Efficient and Effective Uncertainty Quantification in LLMs

AuthorsMiao Xiong, Andrea Santilli, Michael Kirchhof, Adam Golinski, Sinead Williamson

This paper was accepted at the Safe Generative AI Workshop (SGAIW) 2024 at NeurIPS 2024.

Uncertainty quantification (UQ) is crucial for ensuring the safe deployment of large language model, particularly in high-stakes applications where hallucinations can be harmful. However, existing UQ methods often demand substantial computational resources, e.g., multi-sample methods such as Semantic Entropy (Kuhn et al., 2023) usually require 5-10 inference calls, and probing-based methods require additional datasets for training. This raises a key question: How can we balance UQ performance with computational efficiency? In this work, we first analyze the performance and efficiency of various UQ methods across 6 datasets× 6 models × 2 prompt strategies. Our findings reveal that: 1) Multi-sample methods generally perform only marginally better than single-sample methods, i.e., ≤ 0.02 in AUROC over 65% settings, despite significantly higher inference costs. 2) Probing-based methods perform well primarily on mathematical reasoning and truthfulness benchmarks, while multi-sample methods only show a clear advantage on knowledge-seeking tasks. These findings suggest that the high computational cost does not translate into significant performance gains. Despite their similar overall performance, we observe only moderate correlations between different UQ methods, suggesting they may be capturing different uncertainty signals. This motivates us to explore the potential of combining different methods to harness their complementary strengths at lower computational costs. Our experiments demonstrate that a simple combination of single-sample features can match or even outperform the existing best-performing methods. These findings suggest a promising direction for developing cost-effective uncertainty estimators.

Efficient and Effective Uncertainty Quantification in LLMs

Related readings and updates.

Revisiting Uncertainty Quantification Evaluation in Language Models: Spurious Interactions with Response Length Bias Results

Posterior Uncertainty Quantification in Neural Networks using Data Augmentation

Discover opportunities in Machine Learning.