Projected Language Models: A Large Model Pre-Segmented Into Smaller Ones

AuthorsDavid Grangier, Angelos Katharopoulos, Pierre Ablin, Awni Hannun

This paper has been accepted at the Foundation Models in the Wild workshop at ICML 2024.

Large language models are versatile tools but are not suitable for small inference budgets. Small models have more efficient inference but their lower capacity means that their performance can be good only if one limits their scope to a specialized domain. This paper explores how to get a small language model with good specialized accuracy, even when specialization data is unknown during pretraining. We propose a novel architecture, projected networks (PN). PN is a high capacity network whose parameters can be linearly projected into a small network for fine tuning. We assess the empirical effectiveness of our solution compared to small model training, distillation and hard mixture of experts.

Related readings and updates.

No Need to Talk: Asynchronous Mixture of Language Models

April 10, 2025research area Methods and Algorithms, research area Speech and Natural Language Processingconference ICLR

We introduce SmallTalk LM, an innovative method for training a mixture of language models in an almost asynchronous manner. Each model of the mixture specializes in distinct parts of the data distribution, without the need of high-bandwidth communication between the nodes training each model. At inference, a lightweight router directs a given sequence to a single expert, according to a short prefix. This inference scheme naturally uses a fraction…

Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization

November 12, 2024research area Methods and Algorithms, research area Speech and Natural Language ProcessingWorkshop at NeurIPS

This paper was accepted at the Efficient Natural Language and Speech Processing (ENLSP) Workshop at NeurIPS 2024.

The pre-training phase of language models often begins with randomly initialized parameters. With the current trends in scaling models, training their large number of parameters can be extremely slow and costly. In contrast, small language models are less expensive to train, but they often cannot achieve the accuracy of large…

Projected Language Models: A Large Model Pre-Segmented Into Smaller Ones

Related readings and updates.

No Need to Talk: Asynchronous Mixture of Language Models

Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization

Discover opportunities in Machine Learning.