View publication

This paper was accepted at the Fine-Tuning in Modern Machine Learning: Principles and Scalability (FITML) Workshop at NeurIPS 2024.

Large language models (LLMs) pretrained on large corpora of internet text possess much of the world's knowledge. Following pretraining, one often needs to conduct continued pretraining on certain capabilities, such as math and coding, or "posttraining" (a.k.a., alignment) techniques to make the models follow users' instructions and align them with human preferences. One challenge during these finetuning stages is that the model can lose the pretraining knowledge or forget certain capabilities (e.g., in-context learning ability). Moreover, although there exist strong open-weight LLMs, such as Llama 3, both their pretraining and posttraining data are not open to the public, making it difficult to mix the finetuning data with the models' own pretraining data as a solution for mitigating forgetting. We propose label annealing, a method that mitigates forgetting during finetuning without requiring access to the original pretraining data. Label annealing distills pretraining knowledge during finetuning by adding a KL divergence term in the loss function, regularizing the divergence between the finetuned model's predictions and those of the initial pretrained model. In mathematics and code finetuning, label annealing improves the model's performance in target domains without sacrificing other capabilities of the pretrained model. In alignment finetuning, our method introduces a smooth tradeoff between the instruction-following capability and the pretraining knowledge. We complement our empirical investigation with a mathematical model with overparameterized linear regression that provides geometric intuition as to why label annealing would help.

Related readings and updates.

Knowledge Transfer from Vision Foundation Models for Efficient Training of Small Task-specific Models

Vision Foundation Models (VFMs) pretrained on massive datasets exhibit impressive performance on various downstream tasks, especially with limited labeled target data. However, due to their high inference compute cost, these models cannot be deployed for many real-world applications. Motivated by this, we ask the following important question, "How can we leverage the knowledge from a large VFM to train a small task-specific model for a new target…
See paper details

Vanishing Gradients in Reinforcement Finetuning of Language Models

Pretrained language models are commonly adapted to comply with human intent and downstream tasks via finetuning. The finetuning process involves supervised finetuning (SFT), using labeled samples, and/or reinforcement learning based fine-tuning (RFT) via policy gradient methods, using a (possibly learned) reward function. This work highlights an overlooked optimization hurdle in RFT: we prove that the expected gradient for an input sample (i.e…
See paper details