Headline energy reduction claim & performance gains (4× faster training, 30× faster inference) and over all the 25× energy reduction claim

Kommu .
Apr 9
5 min read

report summarizing the key points from the article, explaining where the headline performance gains (4× faster training, 30× faster inference) and the 25× energy reduction claim come from. Note that many of these figures result from comparing large‐scale, “benchmarketing” configurations across two generations of technology. The article shows that while the raw compute improvements are more modest (around 2.5× per GPU in simple metrics), system‐level factors push the overall claims higher when configured for massive-scale training and inference.

1. Comparing Generations and System Configurations

Three Generations in Context:
- The baseline is the Hopper H100 (announced in 2022), used in current production.
- The next-generation Grace Hopper GH200 (announced in 2023) offers per‑GPU improvements of roughly 1.4×–1.8×—mainly due to higher memory capacity and faster memory interconnect.
- The new Grace Blackwell GB200 (and its related B200/B100 modules) is compared against the H100 but skips the intermediate GH200. Because the comparison spans roughly two years of technology evolution, some gains are “skipped” in the visible headline numbers.
Configuration Differences:
- The H100 baseline is typically measured in a system with eight GPUs (the HGX H100 or DGX H100), which is an air-cooled, Infiniband-connected rack system.
- The Blackwell benchmarks come from a water-cooled NVL72 system that houses many fewer racks but with a much higher GPU density and improved interconnect (using NVLink with double the bandwidth).

2. Raw Compute Improvements and the “Missing” Multiplicative Factors

A. Training Performance (4× Claim)

Raw Compute Gain:
- On a per GPU basis in sparse FP8 mode, the H100 delivers roughly 4 petaflops whereas a Blackwell GPU delivers about 10 petaflops. This is about a 2.5× improvement in raw compute.
- The Blackwell GPU is essentially built from two chiplets (each similar in performance to an H200) packaged together. Looking at the silicon alone, a single Blackwell GPU is roughly twice as powerful as an H200 and a bit more than twice an H100.
Additional Interconnect and System-Level Gains:
- The interconnect improvements are critical. The H100 system uses NVLink at 900 GB/s among 8 GPUs and Infiniband between nodes, which introduces latency and bandwidth constraints.
- The Blackwell system (NVL72) uses NVLink at 1800 GB/s plus improved Infiniband links. The higher probability (9× more likely) that data transfers occur over the fast NVLink and the fact that transfers are twice as fast results in an extra factor of approximately 1.6× in large-scale configurations.
- When the 2.5× base gain is combined with this additional 1.6× boost, the overall system-level training performance claim reaches about 4× relative to an H100 configuration.
- Note that in smaller configurations (1–8 GPUs), you’d likely see a gain closer to the raw 2.5×.

B. Inference Performance (30× Claim)

Lower Precision Arithmetic (FP4 vs FP8):
- Blackwell introduces a new FP4 (4-bit floating point) format for inference. FP4 arithmetic can theoretically deliver up to 5× the throughput compared to FP8 (the precision mode used on H100) because the data is half as wide.
- Thus, the raw compute per GPU might be expected to improve by roughly 5× based solely on using FP4.
System-Level and Memory Efficiency Factors:
- However, the benchmark that yields a “30×” figure comes from large, highly optimized configurations used for models like those powering ChatGPT.
- Two effects contribute further:
  1. Model Weight Loading and Memory: With FP4, weights take up half as much memory as FP8, meaning they load twice as fast and have better cache hit rates. This effectively speeds up inference beyond the raw compute gain.
  2. System Architecture and Shared Memory: The H100 baseline is from an eight-node system connected via slower interconnect (Infiniband), whereas the Blackwell NVL72 system is a single shared memory domain. This reduces overhead and latency considerably.
- When these factors are combined, the “missing” multiplicative factor (beyond the 5× expected from FP4) is roughly 6× if one were to pick a point on the performance curves that are most favorable to Blackwell.
- However, Adrian notes that this 30× figure seems to be derived from comparing an inefficiently run H100 system (at a low performance point) versus the Blackwell system, so the practical improvement in typical deployments might be closer to 8–10×.

3. Energy Efficiency Improvements (25× Claim)

Per-Operation Efficiency at Lower Precision:
- The use of FP4 arithmetic not only increases throughput but also reduces the amount of data that must be moved, thus lowering the energy consumed per operation. Fewer bits per operation mean that less power is required for memory transfers and compute.
Improved Memory Bandwidth and System Architecture:
- Blackwell’s memory bandwidth increases (from 3.375 TB/s in H100 to about 8 TB/s per GPU) and the overall system uses liquid cooling rather than air cooling. Liquid cooling improves thermal efficiency and allows for higher performance per watt.
TCO and Energy Use Benchmarking Issues:
- The benchmarks compare a 72-GPU Blackwell NVL72 rack against a comparable number of H100 GPUs. According to the technical brief, the NVL72 is rated at 120 kW (about 10.2 kW per GPU), whereas the H100 cluster in the comparison would consume roughly 100 kW (about 11 kW per GPU).
- There is some confusion because the energy and TCO comparisons appear to have the same 25× figure. However, if one adjusts for a more realistic inference speedup (e.g., 10× rather than 30×), then the energy reduction might come out closer to an 8× improvement.
- Regardless, even if the “25×” claim is partly a product of aggressive configuration choices and comparisons made at non-optimal points, it still represents a very strong improvement in energy efficiency for inference workloads.

4. Summary and Context

Training Performance:
- The raw compute gain per GPU (2.5× faster) is amplified by improved interconnect and system configurations (≈1.6× boost) leading to a headline figure of 4× improvement for large-scale training workloads.
Inference Performance:
- The new FP4 arithmetic provides roughly a 5× raw improvement versus FP8 on H100, and when combined with benefits from faster memory loading and superior interconnect (in a highly optimized large-scale configuration), the claim reaches 30×.
- In practice, however, typical inference speedups might be more modest (around 8–10×) depending on the workload and system design.
Energy Efficiency:
- Lower precision operations (FP4) reduce data movement and energy per operation.
- Enhanced memory bandwidth and liquid-cooling based system architectures further reduce energy consumption.
- The 25× energy efficiency claim appears to derive from these combined factors in the best-case scenario, although practical benefits might be somewhat lower.

In conclusion, while the raw improvements in chip design yield modest (2.5×) gains, the large-scale system-level enhancements—especially the improved interconnect, memory bandwidth, and optimized inference using FP4—inflate the headline numbers in marketing materials. This “benchmarketing” approach compares highly optimized, liquid-cooled, shared memory configurations (Blackwell NVL72) against less efficient air-cooled H100 setups, thereby producing the 4×, 30×, and 25× figures. The article cautions that when comparing smaller or more standardized configurations, the gains are lower, but the overall direction of improvement remains significant.

This report is based on the analysis and insights from Adrian Cockcroft’s Medium article “Deep dive into NVIDIA Blackwell Benchmarks — where does the 4x training and 30x inference performance gain, and 25x reduction in energy usage come from?”