Symbol Guided Hindsight Priors for Reward Learning from Human Preferences
In collaboration with Arizona State University
AuthorsMudit Verma, Katherine Metcalf
This paper was accepted at the "Human in the Loop Learning Workshop" at NeurIPS 2022.
Specification of reward functions for Reinforcement Learning is a challenging task which is bypassed by the framework of Preference Based Learning methods which instead learn from preference labels on trajectory queries. These methods, however, still suffer from high requirements of preference labels and often would still achieve low reward recovery. We present the PRIOR framework that alleviates the issues of impractical number of queries to humans as well as poor reward recovery through computing priors about the reward function based on the environment dynamics and a surrogate preference classification model. We find that imposing these priors as soft constraints significantly reduces the queries made to the human in the loop and improves the overall reward recovery. Additionally, we investigate the use of an abstract state space for the computation of these priors to further improve the agent's performance.
Providing new features—while preserving user privacy—requires techniques for learning from private and anonymized user feedback. To learn quickly and accurately, we develop and employ statistical learning algorithms that help us overcome multiple challenges that arise from sampling noise, applications of differential privacy, and delays that may be present in the data. These algorithms enable teams at Apple to measure and understand which user experiences are the best. This understanding leads to continual improvements across Apple's products and services to drive better experiences. We provide aspects of this understanding to the Apple developer community through features such as product page optimization.