Learning Structured Reasoning via Tractable Trajectory Control
AuthorsPo-Nien Kung†, Zhen Yang, Jeffrey Luo†, Cheng-Fu Yang†, Haikang Deng†, Zi-Yi Dou†**, Yinfei Yang**, Nanyun Peng†, Zhe Gan, Kai-Wei Chang†
Learning Structured Reasoning via Tractable Trajectory Control
AuthorsPo-Nien Kung†, Zhen Yang, Jeffrey Luo†, Cheng-Fu Yang†, Haikang Deng†, Zi-Yi Dou†**, Yinfei Yang**, Nanyun Peng†, Zhe Gan, Kai-Wei Chang†
Large language models can exhibit emergent reasoning behaviors, often manifested as recurring lexical patterns (e.g., “wait,” indicating verification). However, complex reasoning trajectories remain sparse in unconstrained sampling, and standard RL often fails to guarantee the acquisition of diverse reasoning behaviors. We propose a systematic discovery and reinforcement of diverse reasoning patterns through structured reasoning, a paradigm that requires targeted exploration of specific reasoning patterns during the RL process. To this end, we propose Ctrl-R, a framework for learning structured reasoning via tractable trajectory control that actively guides the rollout process, incentivizing the exploration of diverse reasoning patterns that are critical for complex problem-solving. The resulting behavior policy enables accurate importance-sampling estimation, supporting unbiased on-policy optimization. We further introduce a power-scaling factor on the importance-sampling weights, allowing the policy to selectively learn from exploratory, out-of-distribution trajectories while maintaining stable optimization. Experiments demonstrate that Ctrl-R enables effective exploration and internalization of previously unattainable reasoning patterns, yielding consistent improvements across language and vision–language models on mathematical reasoning tasks.
Interleaved Reasoning for Large Language Models via Reinforcement Learning
May 28, 2025research area Knowledge Bases and Search, research area Speech and Natural Language Processing
Long chain-of-thought (CoT) significantly enhances large language models’ (LLM) reasoning capabilities. However, the extensive reasoning traces lead to inefficiencies and an increased time-to-first-token (TTFT). We propose a novel training paradigm that uses reinforcement learning (RL) to guide reasoning LLMs to interleave thinking and answering for multi-hop questions. We observe that models inherently possess the ability to perform interleaved…
Bootleg: Self-Supervision for Named Entity Disambiguation
June 25, 2021research area Knowledge Bases and Search, research area Speech and Natural Language Processingconference CIDR
A challenge for named entity disambiguation (NED), the task of mapping textual mentions to entities in a knowledge base, is how to disambiguate entities that appear rarely in the training data, termed tail entities. Humans use subtle reasoning patterns based on knowledge of entity facts, relations, and types to disambiguate unfamiliar entities. Inspired by these patterns, we introduce Bootleg, a self-supervised NED system that is explicitly…