Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging
AuthorsPierre Ablin, Angelos Katharopoulos, Skyler Seto, David Grangier
Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging
AuthorsPierre Ablin, Angelos Katharopoulos, Skyler Seto, David Grangier
Large-scale models are routinely trained on a mixture of different data sources. Different data mixtures yield very different downstream performances. We propose a novel architecture that can instantiate one model for each data mixture without having to re-train the model. Our architecture consists of a bank of expert weights, which are linearly combined to instantiate one model. We learn the linear combination coefficients as a function of the input histogram. To train this architecture, we sample random histograms, instantiate the corresponding model, and backprop through one batch of data sampled from the corresponding histogram. We demonstrate the promise of our approach to quickly obtain small specialized models on several datasets.
Scaling Laws for Optimal Data Mixtures
September 26, 2025research area Methods and Algorithmsconference NeurIPS
Large foundation models are typically trained on data from multiple domains, with the data mixture—the proportion of each domain used—playing a critical role in model performance. The standard approach to selecting this mixture relies on trial and error, which becomes impractical for large-scale pretraining. We propose a systematic method to determine the optimal data mixture for any target domain using scaling laws. Our approach…
No Need to Talk: Asynchronous Mixture of Language Models
April 10, 2025research area Methods and Algorithms, research area Speech and Natural Language Processingconference ICLR
We introduce SmallTalk LM, an innovative method for training a mixture of language models in an almost asynchronous manner. Each model of the mixture specializes in distinct parts of the data distribution, without the need of high-bandwidth communication between the nodes training each model. At inference, a lightweight router directs a given sequence to a single expert, according to a short prefix. This inference scheme naturally uses a fraction…