Compute-Optimal Quantization-Aware Training
AuthorsAleksandr Dremov**†, David Grangier, Angelos Katharopoulos, Awni Hannun
Compute-Optimal Quantization-Aware Training
AuthorsAleksandr Dremov**†, David Grangier, Angelos Katharopoulos, Awni Hannun
Quantization-aware training (QAT) is a leading technique for improving the accuracy of quantized neural networks. Previ- ous work has shown that decomposing training into a full-precision (FP) phase followed by a QAT phase yields superior accuracy compared to QAT alone. However, the optimal allocation of compute between the FP and QAT phases remains unclear. We conduct extensive experiments with various compute budgets, QAT bit widths, and model sizes from 86.0M to 2.2B to investigate how different QAT durations impact final performance. We demonstrate that, contrary to previous findings, the loss-optimal ratio of QAT to FP training increases with the total amount of compute. Moreover, the opti- mal fraction can be accurately predicted for a wide range of model sizes and quantization widths using the tokens-per- parameter-byte statistic. From experimental data, we derive a loss scaling law that predicts both optimal QAT ratios and fi- nal model performance across different QAT/FP compute allocation strategies and QAT bit widths. We use the scaling law to make further predictions, which we verify experimentally, including which QAT bit width is optimal under a given mem- ory constraint and how QAT accuracy with different bit widths compares to full-precision model accuracy. Additionally, we propose a novel cooldown and QAT fusion approach that performs learning rate decay jointly with quantization-aware training, eliminating redundant full-precision model updates and achieving significant compute savings. These findings provide practical insights into efficient QAT planning and enable the training of higher-quality quantized models with the same compute budget.
With Apple Intelligence, we’re integrating powerful generative AI right into the apps and experiences people use every day, all while protecting their privacy. At the 2025 Worldwide Developers Conference we introduced a new generation of language foundation models specifically developed to enhance the Apple Intelligence features in our latest software releases. We also introduced the new Foundation Models framework, which gives app developers…
Least Squares Binary Quantization of Neural Networks
March 23, 2020research area Methods and AlgorithmsWorkshop at CVPR
This paper was accepted at the Efficient Deep Learning in Computer Vision workshop at the CVPR 2020 conference.
Quantizing weights and activations of deep neural networks results in significant improvement in inference efficiency at the cost of lower accuracy. A source of the accuracy gap between full precision and quantized models is the quantization error. In this work, we focus on the binary quantization, in which values are mapped to -1 and…