View publication

Pretrained language models are commonly adapted to comply with human intent and downstream tasks via finetuning. The finetuning process involves supervised finetuning (SFT), using labeled samples, and/or reinforcement learning based fine-tuning (RFT) via policy gradient methods, using a (possibly learned) reward function. This work highlights an overlooked optimization hurdle in RFT: we prove that the expected gradient for an input sample (i.e. prompt) vanishes if its reward standard deviation under the model is low, regardless of whether the reward mean is near-optimal or not. We then demonstrate the prevalence and detrimental effects of vanishing gradients due to low reward standard deviation in an RFT benchmark for language models. In particular, we show that in datasets where samples with low reward standard deviation under the pretrained model are more prevalent, the reward that RFT achieves compared to SFT is worse. Controlled experiments and a theoretical analysis further establish that, even in simplified settings, vanishing gradients in RFT can lead to extremely slow convergence. Lastly, we explore ways to overcome vanishing gradients in RFT of language models. We find the common practice of an initial SFT phase to be the most promising candidate, which sheds light on its importance in an RFT pipeline. Furthermore, our experiments reveal that a relatively few number of optimization steps of SFT on a small number of labeled samples suffice, implying that the initial SFT phase need not be expensive in terms of compute and data labeling efforts

Related readings and updates.

Symbol Guided Hindsight Priors for Reward Learning from Human Preferences

This paper was accepted at the "Human in the Loop Learning Workshop" at NeurIPS 2022. Specification of reward functions for Reinforcement Learning is a challenging task which is bypassed by the framework of Preference Based Learning methods which instead learn from preference labels on trajectory queries. These methods, however, still suffer from high requirements of preference labels and often would still achieve low reward recovery. We present…
See paper details

Rewards Encoding Environment Dynamics Improves Preference-based Reinforcement Learning

This paper was accepted at the workshop at "Human-in-the-Loop Learning Workshop" at NeurIPS 2022. Preference-based reinforcement learning (RL) algorithms help avoid the pitfalls of hand-crafted reward functions by distilling them from human preference feedback, but they remain impractical due to the burdensome number of labels required from the human, even for relatively simple tasks. In this work, we demonstrate that encoding environment…
See paper details