Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging
AuthorsPierre Ablin, Angelos Katharopoulos, Skyler Seto, David Grangier
AuthorsPierre Ablin, Angelos Katharopoulos, Skyler Seto, David Grangier
Large-scale models are routinely trained on a mixture of different data sources. Different data mixtures yield very different downstream performances. We propose a novel architecture that can instantiate one model for each data mixture without having to re-train the model. Our architecture consists of a bank of expert weights, which are linearly combined to instantiate one model. We learn the linear combination coefficients as a function of the input histogram. To train this architecture, we sample random histograms, instantiate the corresponding model, and backprop through one batch of data sampled from the corresponding histogram. We demonstrate the promise of our approach to quickly obtain small specialized models on several datasets.
April 10, 2025research area Methods and Algorithms, research area Speech and Natural Language Processingconference ICLR
July 17, 2024research area Methods and Algorithms, research area Speech and Natural Language ProcessingWorkshop at ICML