Can External Validation Tools Improve Annotation Quality for LLM-as-a-Judge?
AuthorsArduin Findeis*†, Floris Weers, Guoli Yin, Ke Ye, Ruoming Pang, Tom Gunter
AuthorsArduin Findeis*†, Floris Weers, Guoli Yin, Ke Ye, Ruoming Pang, Tom Gunter
Pairwise preferences over model responses are widely collected to evaluate and provide feedback to large language models (LLMs). Given two alternative model responses to the same input, a human or AI annotator selects the ``better'' response. Such data can provide a feedback signal in domains where traditional hard-coded metrics are difficult to obtain (e.g. quality of a chat interactions), thereby helping measure model progress or model fine-tuning (e.g., via reinforcement learning from human feedback, RLHF). However, for some domains it can be tricky to obtain such pairwise comparisons in high quality - from humans or AI. For example, long-form responses with many (possibly false) factual statements or complex (possibly incorrect) code represent significant challenges for both AI and human annotators. In this work, we explore augmenting standard AI annotator systems with additional tools to improve performance on three challenging domains: long-form factual, math and code tasks. We propose a tool-using agentic system to augment existing annotators to provide higher quality feedback on these domains. Our system uses web-search and code execution to ground its annotations based on external validation, independent of the LLMs internal biases. We provide extensive experimental results evaluating our method across the three task domains as well as out-of-domain tasks based on RewardBench subsets, where we aim to avoid performance reductions. We share all code to replicate the experiments as an open-source package.
May 22, 2024research area Data Science and Annotationconference ACL
October 8, 2020research area Human-Computer Interaction, research area Tools, Platforms, Frameworksconference UIST