Beyond Outliers: A Study of Optimizers Under Quantization

Georgios Vlassis, Saleh Ashkboos, Alexandra Volkova, Torsten Hoefler, Dan Alistarh

13 Oct 2025 3 min read

AI-generated image, based on the article abstract

Quick Insight

How the Right “Trainer” Keeps AI Sharp Even When It’s Shrunk

Ever wonder why some AI apps stay fast and accurate even after they’re squeezed into tiny phones? Scientists discovered that the secret isn’t just the shrinking process—called quantization—but also the “trainer” they use, known as an optimizer. Think of it like a chef: the same ingredients can taste very different depending on the cooking method. In this study, researchers tried six different chefs on AI models ranging from modest to massive, then watched how the dishes held up after being “compressed.” Surprisingly, the usual clues—like spotting a few extreme numbers—didn’t predict which AI would survive the squeeze. Instead, an optimizer called **Shampoo** consistently kept the models tasting great, losing the least accuracy. This matters because it means smarter choices in training can make AI run smoothly on everyday devices without losing its brainpower. So next time your phone’s voice assistant sounds spot‑on, remember it’s not just the hardware—it’s the clever “recipe” behind the scenes that makes it possible. Optimizers matter, and they’re shaping the future of everyday AI. Stay curious!

The more we learn, the more we can bring powerful intelligence to every pocket. 🌟

Short Review

Overview of Optimizer‑Quantization Interactions

The study investigates how the choice of optimizer influences model performance when subjected to both post‑training quantization (PTQ) and quantization‑aware training (QAT). Researchers trained full‑precision models ranging from 50 M to 1.5 B parameters using six distinct optimizers, meticulously tuning hyperparameters to establish robust baselines. After applying PTQ, they observed that conventional outlier metrics such as the max‑to‑mean ratio (MMR) and Kurtosis failed to predict degradation across different optimizers, prompting an analytical explanation of error propagation in deep networks. In QAT experiments, models trained from scratch revealed that optimizers excelling in full‑precision settings do not necessarily maintain superiority once quantization is incorporated; notably, Shampoo exhibited the lowest accuracy loss. Finally, the authors derived scaling laws for QAT under various optimizers, demonstrating that Shampoo achieves the highest parameter efficiency among those tested.

Critical Evaluation

Strengths

The paper’s systematic approach—spanning multiple model sizes and a diverse optimizer set—provides comprehensive empirical evidence rarely seen in quantization research. By combining quantitative analysis with theoretical insights into MMR limitations, the authors bridge a critical gap between practice and theory. The derivation of scaling laws offers actionable guidance for practitioners seeking to balance accuracy and efficiency.

Weaknesses

While six optimizers cover many popular choices, the study omits newer adaptive methods that may behave differently under quantization. Hyperparameter tuning was performed on full‑precision models; a joint optimization of learning rates and weight decay for each quantized scenario could further refine conclusions. Additionally, the analysis focuses solely on PTQ and QAT, leaving out hybrid or mixed‑precision strategies.

Implications

The findings suggest that selecting an optimizer for deployment should consider its interaction with the chosen quantization pipeline rather than relying on full‑precision performance alone. The demonstrated superiority of Shampoo in both PTQ resilience and QAT efficiency positions it as a strong candidate for production‑grade models, especially where parameter budgets are tight.

Conclusion

This work delivers a nuanced understanding of optimizer‑quantization dynamics, highlighting that traditional metrics may mislead practitioners. By revealing Shampoo’s consistent advantage across PTQ and QAT, the study offers clear, evidence‑based recommendations for model deployment strategies in resource‑constrained environments.

Readability

The article is structured into concise sections with keyword‑rich headings that aid search engine indexing. Paragraphs are short, each containing 20–40 words, and key terms such as optimizer, quantization, and Shampoo are highlighted to improve scannability and user engagement.