Scaling Laws for Optimal Data Mixtures
AuthorsMustafa Shukor†, Louis Bethune, Dan Busbridge, David Grangier, Enrico Fini, Alaaeldin El-Nouby, Pierre Ablin
Scaling Laws for Optimal Data Mixtures
AuthorsMustafa Shukor†, Louis Bethune, Dan Busbridge, David Grangier, Enrico Fini, Alaaeldin El-Nouby, Pierre Ablin
Large foundation models are typically trained on data from multiple domains, with the data mixture—the proportion of each domain used—playing a critical role in model performance. The standard approach to selecting this mixture relies on trial and error, which becomes impractical for large-scale pretraining. We propose a systematic method to determine the optimal data mixture for any target domain using scaling laws. Our approach accurately predicts the loss of a model of size N trained with D tokens and a specific domain weight vector h. We validate the universality of these scaling laws by demonstrating their predictive power in three distinct and large-scale settings: large language model (LLM), native multimodal model (NMM), and large vision models (LVM) pretraining. We further show that these scaling laws can extrapolate to new data mixtures and across scales: their parameters can be accurately estimated using a few small-scale training runs, and used to estimate the performance at larger scales and unseen domain weights. The scaling laws allow to derive the optimal domain weights for any target domain under a given training budget (N,D), providing a principled alternative to costly trial-and-error methods.
Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models
July 7, 2025research area Methods and Algorithms, research area Speech and Natural Language Processingconference ICML, Workshop at ICLR
This paper was accepted at the Sparsity in LLMs (SLLM): Deep Dive into Mixture of Experts, Quantization, Hardware, and Inference workshop at ICLR 2025.
Scaling the capacity of language models has consistently proven to be a reliable approach for improving performance and unlocking new capabilities. Capacity can be primarily defined by two dimensions: the number of model parameters and the compute per example. While scaling typically involves…
Scaling Laws for Forgetting During Finetuning with Pretraining Data Injection
June 20, 2025research area Methods and Algorithmsconference ICML
A widespread strategy for obtaining a language model that performs well in a target domain is to fine-tune it by training it to do unsupervised next-token prediction on data from that domain. Fine-tuning presents two challenges: i) if the amount of target data is limited, as is the case in most practical applications, the model will quickly overfit, and ii) the model will drift away from the original model and forget the pre-training…