AdamW has been the default optimizer for LLM pretraining for several years, and that default has been treated as settled. Most engineering teams select it once, embed it in their training stack, and direct their infrastructure investment elsewhere. A 2026 open problem paper from researchers at Nanjing, Zhejiang, and Fudan Universities (Yu et al., arXiv 2026) makes that posture harder to defend. The paper formally identifies the absence of convergence guarantees for AdamW under heavy-tailed gradient noise, the exact noise regime that characterises LLM pretraining, and presents a candidate failure mechanism. This is not a theoretical curiosity. It is a structural gap in the justification for a component that sits at the centre of every training run your organisation operates.
Why Heavy-Tailed Noise Is the Relevant Regime
Gradient noise in LLM pretraining does not follow the bounded-variance distributions assumed by most Adam convergence proofs. Language data follows Zipfian frequency distributions, which produce gradient signals that are intermittently very large. This is not a property of small or poorly-configured models. Empirical work cited by Yu et al. (arXiv 2026) confirms that heavy-tailed noise persists across coordinates and matrix blocks in nanoGPT pretraining, and that noise magnitude tracks gradient magnitude in a way that finite-variance assumptions cannot capture.
The practical consequence is that the theoretical guarantees underpinning AdamW's convergence behaviour were derived under conditions that do not match the workloads you are running. This does not prove that AdamW fails in practice. It means that when AdamW training runs diverge, produce unexpected loss spikes, or require extensive hyperparameter intervention, you have no theoretical framework to diagnose whether the optimizer itself is implicated.
The Denominator Memory Problem
Yu et al. (arXiv 2026) present a specific mechanism, which they term a corridor lower-bound, showing how AdamW's second-moment accumulator can obscure large gradient events. The exponential moving average of squared gradients smooths the denominator over time. When a heavy-tailed gradient spike arrives, the accumulated denominator may not respond quickly enough to scale the update appropriately, effectively hiding the event from the adaptive step size.
This is a structural property of the algorithm, not a tuning failure. Adjusting beta parameters or learning rate schedules addresses the symptom, not the mechanism. Engineering teams that have responded to loss spikes with hyperparameter sweeps may have been compensating for a property of the optimizer that no hyperparameter directly controls.
What Sign-Based Optimizers Provide That AdamW Does Not
Lion and Muon, both sign-based optimizers, have now received rigorous convergence analysis under heavy-tailed noise. Yu et al. (arXiv 2026) reference work establishing that under finite p-th moment noise with p between 1 and 2, sign-based optimizers attain a convergence rate of O(T^(-(p-1)/(3p-2))), which is provably tight. This is not a marginal theoretical improvement. It is the first class of optimizers with sharp guarantees in the regime that LLM pretraining actually occupies.
The practical record aligns with this. Muon has been adopted in production training runs for Kimi-K2, GLM-5, and DeepSeek-V4, as noted by Yu et al. (arXiv 2026). These are not experimental deployments. They are frontier-scale systems where optimizer choice directly affects compute cost, training stability, and iteration speed.
The Infrastructure Risk That Is Not Being Managed
The gap between AdamW's theoretical foundations and its operational environment is now an infrastructure risk, not just a research gap. A training run that costs $500,000 in compute and diverges at 60% completion due to a heavy-tailed gradient event cannot be retrospectively diagnosed if the optimizer's behaviour under that noise regime is theoretically undefined. The inability to reason about failure modes is itself a risk, because it prevents systematic remediation.
This risk compounds at scale. Longer training runs, larger batch sizes, and more aggressive learning rate schedules all increase the probability of encountering the tail events that AdamW's theoretical framework does not cover. Infrastructure decisions made today, including checkpoint frequency, gradient clipping thresholds, and monitoring instrumentation, are implicitly bets on AdamW's empirical reliability continuing to hold.
What This Means for Training Infrastructure Design
Treating optimizer selection as a first-class infrastructure decision changes several things concretely. Gradient monitoring pipelines need to track distributional properties of gradient noise, not just gradient norms, so that heavy-tailed events are detectable rather than invisible. Checkpoint strategies should account for the possibility that loss spikes are optimizer-driven rather than data-driven, which changes the recovery logic.
Optimizer switching mid-training is operationally complex but not impossible. Teams running fine-tuning pipelines have more flexibility than those running pretraining from scratch, because the cost of an interrupted run is lower. For organisations planning new pretraining infrastructure in 2026, the build-versus-configure decision for the optimizer layer should now include explicit evaluation of Lion and Muon against AdamW, with convergence theory as one evaluation criterion alongside empirical performance.
Treating Optimizer Strategy as an MLOps Concern
The organisational implication is that optimizer selection should not be delegated entirely to ML researchers working within a fixed infrastructure contract. The choice of optimizer affects monitoring requirements, checkpoint overhead, recovery procedures, and the interpretability of training failures. These are MLOps concerns, and they require input from the engineering leads who own those systems.
This does not mean replacing AdamW immediately across all workloads. It means establishing a structured evaluation process, running sign-based optimizer candidates on representative workloads with instrumentation sufficient to detect heavy-tailed gradient events, and building the operational knowledge to switch if the evidence warrants it. The cost of that evaluation is small relative to the cost of a training run that fails without a diagnosable cause.
The Theoretical Gap as a Monitoring Signal
The absence of convergence theory for AdamW under heavy-tailed noise has a direct operational translation: you cannot use theoretical bounds to set principled thresholds for gradient clipping, learning rate schedules, or anomaly detection. Teams that have set these thresholds empirically have done so without a formal basis for knowing whether their thresholds are conservative or insufficient.
Sign-based optimizers with sharp theoretical rates provide a reference point. If Muon attains a provably tight convergence rate under the noise distribution your workload produces, that rate can inform expected loss trajectory bounds, which in turn inform monitoring alerts and intervention thresholds. This is the kind of principled instrumentation that distinguishes a mature MLOps practice from one that responds to failures after they occur.
FAQs
Not necessarily. The absence of convergence guarantees under heavy-tailed noise does not prove that AdamW fails in practice. What it means is that when failures occur, you lack a theoretical framework to determine whether the optimizer is implicated. For teams with stable, well-instrumented training runs, the immediate priority is improving observability of gradient noise distributions rather than switching optimizers. For teams planning new pretraining infrastructure, the evaluation case for sign-based alternatives is now strong enough to warrant structured testing before committing to AdamW as the default.
Both are sign-based optimizers with convergence guarantees under heavy-tailed noise, but they differ in their update mechanics. Lion applies a sign operation to a momentum-weighted gradient combination, making it computationally lighter than AdamW due to the absence of a second-moment accumulator. Muon applies Nesterov momentum in the gradient space followed by orthogonalisation via Newton-Schulz iteration, making it better suited to matrix-valued parameters such as weight matrices in transformer layers. Muon has seen adoption in large-scale pretraining (Kimi-K2, DeepSeek-V4), while Lion has shown gains in fine-tuning and smaller-scale pretraining contexts. The choice should be driven by parameter structure and available compute for the orthogonalisation step.
Standard gradient norm logging is insufficient because it aggregates information that heavy-tailed analysis requires at a distributional level. The relevant additions are per-layer gradient kurtosis tracking, tail index estimation on gradient distributions across training steps, and logging of the ratio between gradient magnitude and the second-moment accumulator denominator in AdamW. These metrics, computed at checkpoint intervals rather than every step, provide the observability needed to determine whether heavy-tailed events are occurring and whether they correlate with loss instability. This instrumentation also establishes a baseline that makes optimizer comparison experiments interpretable.
Switching optimizers mid-training is technically feasible but introduces a discontinuity in the optimizer state that can destabilise training. AdamW carries a second-moment accumulator that sign-based optimizers do not use, so a direct state transfer is not possible. The practical approach is to restart from the most recent stable checkpoint with the new optimizer, using a reduced learning rate for the first several thousand steps to allow the new optimizer's momentum state to warm up. The risk is a temporary regression in loss before the new optimizer's trajectory stabilises. For fine-tuning runs, this cost is manageable. For pretraining at scale, the decision warrants a controlled experiment on a representative smaller run before committing.
Optimizer selection should sit at the intersection of ML research and MLOps ownership, with explicit accountability for both the training performance implications and the infrastructure implications. In practice, this means the ML engineering lead and the research lead jointly own the optimizer evaluation process, with the infrastructure team responsible for the monitoring and recovery systems that make optimizer experiments interpretable. Decisions to change the default optimizer for a given model class should be documented with the same rigour as changes to data pipelines or model architecture, including the empirical basis, the failure modes evaluated, and the monitoring thresholds established for the new configuration.
Heavy-tailed gradient noise is most pronounced in pretraining because the full data distribution is presented to the model, including the rare, high-frequency linguistic events that drive tail behaviour. Fine-tuning on domain-specific corpora with more uniform token distributions may exhibit lighter tails, which would reduce the practical relevance of the theoretical gap. However, fine-tuning on diverse instruction datasets or on corpora with Zipfian characteristics, such as web-scraped text, can reproduce heavy-tailed gradient behaviour. The safe approach is to measure gradient tail properties on your specific fine-tuning data before assuming the theoretical gap is irrelevant to your workload.

