MixAtlas: Uncertainty-aware Data Mixture Optimization for Multimodal LLM Midtraining
AuthorsBingbing Wen**, Sirajul Salekin, Feiyang Kang†, Lucy Lu Wang‡, Bill Howe‡, Javier Movellan, Manjot Bilkhu
MixAtlas: Uncertainty-aware Data Mixture Optimization for Multimodal LLM Midtraining
AuthorsBingbing Wen**, Sirajul Salekin, Feiyang Kang†, Lucy Lu Wang‡, Bill Howe‡, Javier Movellan, Manjot Bilkhu
This paper was accepted at the Workshop on Navigating and Addressing Data Problems for Foundation Models (NADPFM) at ICLR 2026.
Principled domain reweighting can substantially improve sample efficiency and downstream generalization; however, data-mixture optimization for multimodal pretraining remains underexplored. Current multimodal training recipes tune mixtures from only a single perspective such as data format or task type. We introduce MixAtlas, a principled framework for compute-efficient multimodal mixture optimization via systematic domain decomposition and smaller proxy models. MixAtlas factorizes the training data along two interpretable axes---\emph{image concepts} and \emph{task supervision} ---enabling interpretable mixture control and fine-grained attribution of downstream performance to specific domains within each axis. Using small proxy models and a Gaussian-process surrogate, we explore the mixture space at 1/100th the cost of full-scale training. The resulting mixtures yield substantial improvements: up to 3 faster convergence and consistent gains of 2—5% across diverse benchmarks over existing approaches, with especially strong boosts on text-rich benchmarks like ChartQA (+10%) and TextVQA (+13%). Importantly, we show that mixtures obtained via smaller proxy models transfer to larger scale model training, preserving both efficiency and accuracy gains. Overall, MixAtlas makes multimodal mixture optimization practical and interpretable, providing concrete, compute-efficient recipes for training next-generation MLLMs.
Scaling Laws for Optimal Data Mixtures
September 26, 2025research area Methods and Algorithmsconference NeurIPS
Large foundation models are typically trained on data from multiple domains, with the data mixture—the proportion of each domain used—playing a critical role in model performance. The standard approach to selecting this mixture relies on trial and error, which becomes impractical for large-scale pretraining. We propose a systematic method to determine the optimal data mixture for any target domain using scaling laws. Our approach…
Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging
July 4, 2025research area Methods and Algorithmsconference ICML
Large-scale models are routinely trained on a mixture of different data sources. Different data mixtures yield very different downstream performances. We propose a novel architecture that can instantiate one model for each data mixture without having to re-train the model. Our architecture consists of a bank of expert weights, which are linearly combined to instantiate one model. We learn the linear combination coefficients as a function of…