View publication

Pretrained language models are commonly adapted to comply with human intent and downstream tasks via finetuning. The finetuning process involves supervised finetuning (SFT), using labeled samples, and/or reinforcement learning based fine-tuning (RFT) via policy gradient methods, using a (possibly learned) reward function. This work highlights an overlooked optimization hurdle in RFT: we prove that the expected gradient for an input sample (i.e. prompt) vanishes if its reward standard deviation under the model is low, regardless of whether the reward mean is near-optimal or not. We then demonstrate the prevalence and detrimental effects of vanishing gradients due to low reward standard deviation in an RFT benchmark for language models. In particular, we show that in datasets where samples with low reward standard deviation under the pretrained model are more prevalent, the reward that RFT achieves compared to SFT is worse. Controlled experiments and a theoretical analysis further establish that, even in simplified settings, vanishing gradients in RFT can lead to extremely slow convergence. Lastly, we explore ways to overcome vanishing gradients in RFT of language models. We find the common practice of an initial SFT phase to be the most promising candidate, which sheds light on its importance in an RFT pipeline. Furthermore, our experiments reveal that a relatively few number of optimization steps of SFT on a small number of labeled samples suffice, implying that the initial SFT phase need not be expensive in terms of compute and data labeling efforts

Related readings and updates.

On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization

Reinforcement Learning from Human Feedback (RLHF) is an effective approach for aligning language models to human preferences. Central to RLHF is learning a reward function for scoring human preferences. Two main approaches for learning a reward model are 1) training an explicit reward model as in RLHF, and 2) using an implicit reward learned from preference data through methods such as Direct Preference Optimization (DPO). Prior work has shown…
See paper details

Symbol Guided Hindsight Priors for Reward Learning from Human Preferences

This paper was accepted at the "Human in the Loop Learning Workshop" at NeurIPS 2022. Specification of reward functions for Reinforcement Learning is a challenging task which is bypassed by the framework of Preference Based Learning methods which instead learn from preference labels on trajectory queries. These methods, however, still suffer from high requirements of preference labels and often would still achieve low reward recovery. We present…
See paper details