paperJuly 2025

Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging

AuthorsPierre Ablin, Angelos Katharopoulos, Skyler Seto, David Grangier

Large-scale models are routinely trained on a mixture of different data sources. Different data mixtures yield very different downstream performances. We propose a novel architecture that can instantiate one model for each data mixture without having to re-train the model. Our architecture consists of a bank of expert weights, which are linearly combined to instantiate one model. We learn the linear combination coefficients as a function of the input histogram. To train this architecture, we sample random histograms, instantiate the corresponding model, and backprop through one batch of data sampled from the corresponding histogram. We demonstrate the promise of our approach to quickly obtain small specialized models on several datasets.

Diagram illustrating the training pipeline used for the soup-of-experts model. — Figure 1: Training pipeline for the soup-of-experts.

Related readings and updates.

June 9, 2025

With Apple Intelligence, we're integrating powerful generative AI right into the apps and experiences people use every day, all while protecting their privacy. At the 2025 Worldwide Developers Conference we introduced a new generation of language foundation models specifically developed to enhance the Apple Intelligence features in our latest software releases. We also introduced the new Foundation Models framework, which gives app developers...

April 10, 2025research area Methods and Algorithms, research area Speech and Natural Language Processingconference ICLR

We introduce SmallTalk LM, an innovative method for training a mixture of language models in an almost asynchronous manner. Each model of the mixture specializes in distinct parts of the data distribution, without the need of high-bandwidth communication between the nodes training each model. At inference, a lightweight router directs a given sequence to a single expert, according to a short prefix. This inference scheme naturally uses a fraction...

Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging

Related readings and updates.

Updates to Apple's On-Device and Server Foundation Language Models

No Need to Talk: Asynchronous Mixture of Language Models

Discover opportunities in Machine Learning.