View publication

This paper was accepted at the Safe Generative AI Workshop (SGAIW) 2024 at NeurIPS 2024.

Uncertainty quantification (UQ) is crucial for ensuring the safe deployment of large language model, particularly in high-stakes applications where hallucinations can be harmful. However, existing UQ methods often demand substantial computational resources, e.g., multi-sample methods such as Semantic Entropy (Kuhn et al., 2023) usually require 5-10 inference calls, and probing-based methods require additional datasets for training. This raises a key question: How can we balance UQ performance with computational efficiency? In this work, we first analyze the performance and efficiency of various UQ methods across 6 datasets× 6 models × 2 prompt strategies. Our findings reveal that: 1) Multi-sample methods generally perform only marginally better than single-sample methods, i.e., ≤ 0.02 in AUROC over 65% settings, despite significantly higher inference costs. 2) Probing-based methods perform well primarily on mathematical reasoning and truthfulness benchmarks, while multi-sample methods only show a clear advantage on knowledge-seeking tasks. These findings suggest that the high computational cost does not translate into significant performance gains. Despite their similar overall performance, we observe only moderate correlations between different UQ methods, suggesting they may be capturing different uncertainty signals. This motivates us to explore the potential of combining different methods to harness their complementary strengths at lower computational costs. Our experiments demonstrate that a simple combination of single-sample features can match or even outperform the existing best-performing methods. These findings suggest a promising direction for developing cost-effective uncertainty estimators.

Related readings and updates.

Posterior Uncertainty Quantification in Neural Networks using Data Augmentation

In this paper, we approach the problem of uncertainty quantification in deep learning through a predictive framework, which captures uncertainty in model parameters by specifying our assumptions about the predictive distribution of unseen future data. Under this view, we show that deep ensembling (Lakshminarayanan et al., 2017) is a fundamentally mis-specified model class, since it assumes that future data are supported on existing observations…
See paper details

SapAugment: Learning A Sample Adaptive Policy for Data Augmentation

Data augmentation methods usually apply the same augmentation (or a mix of them) to all the training samples. For example, to perturb data with noise, the noise is sampled from a Normal distribution with a fixed standard deviation, for all samples. We hypothesize that a hard sample with high training loss already provides strong training signal to update the model parameters and should be perturbed with mild or no augmentation. Perturbing a hard…
See paper details