Trace Length is a Simple Uncertainty Signal in Reasoning Models
AuthorsSiddhartha Devic†, Charlotte Peale‡, Arwen Bradley, Sinead Williamson, Preetum Nakkiran, Aravind Gollakota
Trace Length is a Simple Uncertainty Signal in Reasoning Models
AuthorsSiddhartha Devic†, Charlotte Peale‡, Arwen Bradley, Sinead Williamson, Preetum Nakkiran, Aravind Gollakota
Uncertainty quantification for LLMs is a key research direction towards addressing hallucination and other issues that limit their reliable deployment. In this work, we show that reasoning trace length is a simple and useful confidence estimator in large reasoning models. Through comprehensive experiments across multiple models, datasets, and prompts, we show that trace length performs in comparable but complementary ways to other zero-shot confidence estimators such as verbalized confidence. Our work reveals that reasoning post-training fundamentally alters the relationship between trace length and accuracy, going beyond prior work that had shown that post-training causes traces to grow longer in general (e.g., “overthinking”). We investigate the mechanisms behind trace length’s performance as a confidence signal, observing that the effect remains even after adjusting for confounders such as problem difficulty and GRPO-induced length bias. We identify high-entropy or “forking” tokens as playing a key role in the mechanism. Our findings demonstrate that reasoning post-training enhances uncertainty quantification beyond verbal expressions, and establish trace length as a practical confidence measure for large reasoning models.
TASER: Translation Assessment via Systematic Evaluation and Reasoning
October 2, 2025research area Methods and Algorithms, research area Speech and Natural Language Processing
We introduce TASER (Translation Assessment via Systematic Evaluation and Reasoning), a metric that uses Large Reasoning Models (LRMs) for automated translation quality assessment. TASER harnesses the explicit reasoning capabilities of LRMs to conduct systematic, step-by-step evaluation of translation quality. We evaluate TASER on the WMT24 Metrics Shared Task across both reference-based and reference-free scenarios, demonstrating state-of-the-art…
The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
June 11, 2025research area Speech and Natural Language Processingconference NeurIPS
Recent generations of frontier language models have introduced Large Reasoning Models (LRMs) that generate detailed thinking processes before providing answers. While these models demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scaling properties, and limitations remain insufficiently understood. Current evaluations primarily focus on established mathematical and coding benchmarks, emphasizing final…