AI Strategy , Data science & AI , Software development Jun 25, 2026

Why Most AI Training Runs Operate Below Hardware Potential and What the Fix Actually Costs

VECTOR Labs Team

Last updated on: Jun 25, 2026

The gap between what a GPU cluster can theoretically deliver and what a training job actually extracts is not a hardware problem. It is a configuration problem, and it is more common than most infrastructure teams acknowledge. Data from Lambda's Blackwell benchmark work shows that training runs on H100 and B200 clusters routinely settle at 35 to 45 percent Model FLOP Utilisation (MFU), a range that many teams have normalised as acceptable rather than diagnosed as a solvable inefficiency. The engineering work required to move from that baseline to 60 percent or above requires no architectural changes to the model itself. It requires systematic configuration work that most teams have not done in full.

What MFU Measures and Why It Is the Right Signal

MFU expresses the fraction of theoretical peak FLOP/s that a training job actually uses during the forward and backward pass. A B200 GPU has a peak BF16 throughput of approximately 4.5 petaFLOP/s. A training run at 40 percent MFU is using roughly 1.8 petaFLOP/s of that capacity, leaving the remainder idle due to memory stalls, communication overhead, or suboptimal kernel dispatch.

The reason MFU is a more useful signal than GPU utilisation percentages reported by monitoring tools is that those metrics count the GPU as "busy" even when it is waiting on memory transfers or blocked on inter-node communication. MFU connects hardware cost directly to productive compute, which is why it translates cleanly into financial terms.

If a team is paying for 128 H100s at cloud spot rates and running at 40 percent MFU, they are effectively paying for 51 GPUs' worth of productive compute. Closing that gap to 60 percent MFU is equivalent to recovering the output of roughly 25 additional GPUs without purchasing or renting them.

The Three Root Causes of Sub-50% MFU

Sequence Length and Batch Packing Misalignment

The most common source of wasted compute is padding. When variable-length sequences are batched naively, short sequences are padded to the length of the longest sequence in the batch. On long-context training jobs, this can mean that 30 to 40 percent of tokens processed per step are padding tokens, which consume memory bandwidth and attention compute without contributing to gradient signal.

Packing strategies, which concatenate multiple short sequences with positional resets rather than padding them, eliminate most of this waste. The implementation requires careful handling of attention masks to prevent cross-document attention, but the mechanism is well understood and the engineering effort is bounded. Lambda's benchmarks show MFU improvements of 8 to 12 percentage points from packing alone on mixed-length datasets.

Tensor Parallelism and Pipeline Stage Configuration

Distributed training across multiple nodes introduces communication overhead that scales with the degree of tensor parallelism. A tensor parallel degree of 8 across a single node keeps inter-GPU communication on NVLink, which operates at 900 GB/s on H100 SXM systems. Extending tensor parallelism across nodes drops that bandwidth to InfiniBand rates, typically 400 Gb/s per port, which introduces stalls that directly reduce MFU.

The correct configuration depends on model size and the ratio of compute to communication in each layer. For a 70B parameter model on 8-way tensor parallelism within a node and 4-way pipeline parallelism across nodes, the communication-to-compute ratio stays manageable. Misconfiguring this ratio, for example by using 16-way tensor parallelism across node boundaries on a cluster without sufficient InfiniBand bandwidth, can reduce MFU by 15 percentage points or more.

Flash Attention Kernel Selection and Precision Routing

Not all attention kernels are equal on all hardware generations. Flash Attention 2 and Flash Attention 3 differ in how they schedule warp-level operations on Hopper and Blackwell architectures. On H100s, Flash Attention 3 with BF16 precision and persistent kernel scheduling consistently outperforms Flash Attention 2 by 5 to 10 percent in attention throughput. The difference is architectural: Flash Attention 3 is written specifically for Hopper's asynchronous memory pipeline, which Flash Attention 2 does not exploit.

Teams running Flash Attention 2 on H100 or B200 hardware because it was the default at the time their training stack was assembled are leaving measurable compute on the table. The switch requires testing for numerical stability on the specific model architecture, but does not require changes to the model code itself.

The Configuration Framework That Closes the Gap

Moving from 40 percent to 60 percent MFU on a Blackwell or Hopper cluster follows a reproducible sequence. The first step is profiling with NVIDIA Nsight or PyTorch Profiler to identify whether the primary bottleneck is memory bandwidth, compute, or communication. These three failure modes require different interventions, and treating a communication-bound run with memory optimisations will not move the metric.

The second step is fixing sequence packing and verifying that attention masks are correctly applied post-packing. The third is auditing parallelism configuration against the specific cluster topology, checking NVLink vs. InfiniBand boundaries and adjusting tensor parallel degree accordingly. The fourth is updating the attention kernel and running a precision sweep across BF16 and FP8 where the model's training stability permits.

Each of these steps is independently testable. A team can run a 500-step benchmark at each stage and observe MFU directly, which means the engineering work is auditable rather than speculative.

What This Engineering Work Actually Costs

The labour to move a training stack from 40 to 60 percent MFU on a well-specified cluster is approximately two to four weeks of senior ML infrastructure engineering time, assuming the team has profiling access and familiarity with the distributed training framework in use. The work is not research. It is systematic diagnosis and configuration, which means the timeline is predictable and the outcome is measurable.

The financial case is straightforward. A team running a 90-day pre-training job on 256 H100s at $2.50 per GPU-hour spends approximately $1.38 million. At 40 percent MFU, they are producing the equivalent of roughly 102 GPUs of productive compute. Moving to 60 percent MFU produces the equivalent of 154 GPUs. The delta, 52 GPU-equivalents sustained over 90 days, represents approximately $280,000 in recovered compute value, or a proportional reduction in time-to-completion if the budget is fixed.

Two to four weeks of senior engineering time costs considerably less than $280,000. The return on that investment is not conditional on hardware upgrades or model architecture changes.

Why Teams Accept Sub-Optimal MFU

The normalisation of 35 to 45 percent MFU as a baseline is partly a product of how training infrastructure is typically staffed. ML researchers optimise models. Infrastructure engineers maintain cluster availability. The configuration work that closes the MFU gap sits in between: it requires understanding of both the model's computational structure and the hardware's memory and communication topology. Teams without a dedicated ML systems engineer tend not to have anyone whose primary responsibility includes this intersection.

There is also a measurement problem. Teams that do not track MFU explicitly tend to evaluate training runs by loss curves and throughput in tokens per second. Both metrics can look acceptable even when MFU is poor, because a misconfigured run that completes training in 95 days instead of 60 days still produces a trained model. The inefficiency is visible only in the infrastructure bill and the elapsed calendar time.

The Connection to Time-to-Production

Training efficiency has a direct effect on how quickly a model reaches production. A 60 percent MFU run completes a fixed training compute budget 33 percent faster than a 40 percent MFU run, assuming identical hardware. For teams operating under competitive pressure to release a model or complete a fine-tuning cycle before a product deadline, that compression in calendar time is often more valuable than the cost saving.

We have written previously about how training velocity connects to infrastructure strategy and time-to-revenue in the context of frontier model deployments. The MFU gap is one of the most direct levers available within a fixed hardware budget, and it requires no procurement cycle or architectural commitment to address.

Companion piece to our broader work on AI training infrastructure strategy. See Why Your AI Training Infrastructure Will Become a Competitive Moat for how to evaluate MLPerf benchmarks, MoE scaling implications, and the connection between training velocity and time-to-revenue.

What the Hardware Vendors Are Not Telling You

GPU vendors publish peak FLOP/s figures that assume perfect memory access patterns, full tensor parallelism efficiency, and sustained kernel occupancy. These conditions do not hold in practice. Lambda's Blackwell benchmarks, which measure achieved throughput on real training workloads rather than synthetic microbenchmarks, show that even well-configured runs on B200 hardware settle below 70 percent MFU under standard configurations. The gap between peak specification and achievable throughput is a structural property of how large language model training interacts with memory bandwidth constraints.

This matters for procurement decisions. A team evaluating whether to upgrade from H100 to B200 hardware should model the upgrade against their current achieved MFU, not against theoretical peak throughput. If the current H100 run operates at 38 percent MFU due to configuration issues, migrating to B200 hardware will not resolve those issues. The configuration work is prerequisite to getting full value from either generation of hardware.

FAQs

What is a realistic target MFU for a well-configured large-scale training run on H100 or B200 hardware?

Based on Lambda's Blackwell benchmarks, a well-configured run on H100 SXM or B200 hardware should achieve 60 to 68 percent MFU on dense transformer architectures with standard attention patterns. Mixture-of-experts models tend to achieve lower MFU due to routing overhead and load imbalance across experts. Targets above 70 percent are achievable in narrow conditions but should not be used as planning assumptions for general pre-training workloads.

How do we measure MFU in practice, and what tooling is required?

MFU is calculated by dividing the achieved throughput in FLOP/s by the theoretical peak FLOP/s of the hardware at the relevant precision. Achieved throughput is derived from tokens per second multiplied by the estimated FLOPs per token for the model, typically approximated as 6 times the number of non-embedding parameters per token for the forward and backward pass combined. PyTorch Profiler and NVIDIA Nsight Systems provide the underlying timing data needed to validate this calculation and identify where cycles are being lost.

Does improving MFU require changes to the model architecture or training objective?

No. The configuration changes that close the MFU gap, including sequence packing, parallelism tuning, and attention kernel selection, operate at the infrastructure and training framework level. The model's architecture, loss function, and training data pipeline are unchanged. This is precisely why the engineering work is bounded and auditable: each change can be tested in isolation against a fixed model checkpoint without affecting the training trajectory.

Should we address MFU before or after deciding whether to upgrade hardware?

Before. A hardware upgrade applied to a misconfigured training stack will deliver a fraction of its potential improvement, because the configuration bottlenecks will persist on the new hardware. The correct sequencing is to profile the current run, close the configuration gap, establish a baseline MFU on current hardware, and then model the expected improvement from a hardware upgrade against that baseline. This also produces more accurate procurement projections, because the achieved MFU figure is a real measurement rather than an assumption derived from vendor specifications.

How does tensor parallelism degree interact with cluster topology, and what is the most common misconfiguration?

Tensor parallelism requires all-reduce communication across the participating GPUs at every transformer layer. When the tensor parallel group fits within a single node, this communication travels over NVLink at very high bandwidth and adds minimal overhead. When the group spans nodes, it travels over InfiniBand, which is typically 10 to 20 times slower per GPU pair. The most common misconfiguration is setting tensor parallel degree to match the total number of GPUs per model replica without accounting for node boundaries, which forces high-frequency all-reduces over InfiniBand and can reduce MFU by 15 percentage points or more on standard cluster configurations.

What is the minimum team capability required to execute this configuration work, and can it be done by a generalist ML team?

The work requires familiarity with distributed training frameworks such as Megatron-LM or DeepSpeed, the ability to read GPU profiling traces, and an understanding of how transformer FLOPs are distributed across attention and feed-forward layers. A generalist ML team focused on model development typically lacks the profiling and systems knowledge to do this reliably. The work is best suited to an ML systems engineer or infrastructure specialist with direct experience tuning distributed training jobs. Without that capability in-house, teams tend to iterate by trial and error, which extends the timeline significantly and often does not reach the 60 percent threshold.

A team that understands you

With 20+ years of experience in the world's leading consultancy companies, implementing AI and ML projects in industry-specific contexts, we are ready to hear your challenges.

Talk with an AI expert