Eliciting In-context Retrieval and Reasoning for Long-Context Language Models

AuthorsYifu Qiu**, Varun Embar, Yizhe Zhang, Navdeep Jaitly, Shay B. Cohen†, Benjamin Han

Recent advancements in long-context language models (LCLMs) have the potential to transform Retrieval-Augmented Generation (RAG) by simplifying pipelines. With their extended context windows, LCLMs can process entire knowledge bases and directly handle retrieval and reasoning. This capability is defined as In-Context Retrieval and Reasoning (ICR2). However, existing benchmarks like LOFT often overestimate LCLM performance because they lack sufficiently challenging contexts. To address this, we introduce ICR2, a benchmark designed for more realistic evaluation and training of LCLMs. This dataset simulates practical scenarios by including confounding documents retrieved using strong retrievers. Additionally, we propose methods to enhance LCLM performance: (1) retrieve-then-generate fine-tuning, (2) explicit modeling of a retrieval head trained jointly with the generation head, and (3) retrieval-attention-probing decoding, which uses attention heads to filter and refine long contexts. Through extensive benchmarking of four well-known LCLMs on LOFT and ICR2, we show that our best approach, applied to Mistral-7B, achieves significant improvements: +17 and +15 on LOFT, and +13 and +2 on ICR2, compared to zero-shot RAG and in-domain supervised fine-tuned models, respectively. It even outperforms GPT-4 on most tasks, despite having a much smaller model size.

** Work done while at Apple
† University of Edinburgh

Eliciting In-context Retrieval and Reasoning for Long-Context Language Models

Related readings and updates.

Superposition Prompting: Improving and Accelerating Retrieval-Augmented Generation

Context Tuning for Retrieval Augmented Generation

Discover opportunities in Machine Learning.