GRACE: A Language Model Framework for Explainable Inverse Reinforcement Learning

AuthorsSilvia Sapora**, Devon Hjelm, Alexander Toshev, Omar Attia, Bogdan Mazoure

Inverse Reinforcement Learning aims to recover reward models from expert demonstrations, but traditional methods yield “black-box” models that are difficult to interpret and debug. In this work, we introduce GRACE (Generating Rewards As CodE), a method for using Large Language Models within an evolutionary search to reverse-engineer an interpretable, code-based reward function directly from expert trajectories. The resulting reward function is executable code that can be inspected and verified. We empirically validate GRACE on the BabyAI and AndroidWorld benchmarks, where it efficiently learns highly accurate rewards, even in complex, multi-task settings. Further, we demonstrate that the resulting reward leads to strong policies, compared to both competitive Imitation Learning and online RL approaches with ground-truth rewards. Finally, we show that GRACE is able to build complex reward APIs in multi-task setups.

** Work done while at Apple

Related readings and updates.

On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization

October 9, 2024research area Methods and Algorithms, research area Speech and Natural Language Processingconference EMNLP

Reinforcement Learning from Human Feedback (RLHF) is an effective approach for aligning language models to human preferences. Central to RLHF is learning a reward function for scoring human preferences. Two main approaches for learning a reward model are 1) training an explicit reward model as in RLHF, and 2) using an implicit reward learned from preference data through methods such as Direct Preference Optimization (DPO). Prior work has shown…

Rewards Encoding Environment Dynamics Improves Preference-based Reinforcement Learning

November 28, 2022research area Human-Computer Interaction, research area Methods and AlgorithmsWorkshop at NeurIPS

This paper was accepted at the workshop at “Human-in-the-Loop Learning Workshop” at NeurIPS 2022.

Preference-based reinforcement learning (RL) algorithms help avoid the pitfalls of hand-crafted reward functions by distilling them from human preference feedback, but they remain impractical due to the burdensome number of labels required from the human, even for relatively simple tasks. In this work, we demonstrate that encoding environment…

GRACE: A Language Model Framework for Explainable Inverse Reinforcement Learning

Related readings and updates.

On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization

Rewards Encoding Environment Dynamics Improves Preference-based Reinforcement Learning

Discover opportunities in Machine Learning.