To Infinity and Beyond: Tool-Use Unlocks Length Generalization in State Space Models
AuthorsEran Malach, Omid Saremi, Sinead Williamson, Arwen Bradley, Aryo Lotfi, Emmanuel Abbe, Josh Susskind, Etai Littwin
To Infinity and Beyond: Tool-Use Unlocks Length Generalization in State Space Models
AuthorsEran Malach, Omid Saremi, Sinead Williamson, Arwen Bradley, Aryo Lotfi, Emmanuel Abbe, Josh Susskind, Etai Littwin
State Space Models (SSMs) have become the leading alternative to Transformers for sequence modeling. Their primary advantage is efficiency in long-context and long-form generation, enabled by fixed-size memory and linear scaling of computational complexity. We begin this work by showing a simple theoretical result stating that SSMs cannot accurately solve any “truly long-form” generation problem (in a sense we formally define), undermining their main competitive advantage. However, we show that this limitation can be mitigated by allowing SSMs interactive access to external tools. In fact, we show that given the right choice of tool access and problem-dependent training data, SSMs can learn to solve any tractable problem and generalize to arbitrary problem length/complexity (i.e., achieve length generalization). Following our theoretical finding, we demonstrate that tool-augmented SSMs achieve remarkable length generalization on a variety of arithmetic, reasoning, and coding tasks. These findings highlight SSMs as a potential efficient alternative to Transformers in interactive tool-based and agentic settings.
A Formal Framework for Understanding Length Generalization in Transformers
April 10, 2025research area Methods and Algorithms, research area Speech and Natural Language Processingconference ICLR
A major challenge for transformers is generalizing to sequences longer than those observed during training. While previous works have empirically shown that transformers can either succeed or fail at length generalization depending on the task, theoretical understanding of this phenomenon remains limited. In this work, we introduce a rigorous theoretical framework to analyze length generalization in causal transformers with learnable absolute…
Efficient Diffusion Models without Attention
May 28, 2024research area Computer Vision, research area Methods and Algorithmsconference CVPR
Transformers have demonstrated impressive performance on class-conditional ImageNet benchmarks, achieving state-of-the-art FID scores. However, their computational complexity increases with transformer depth/width or the number of input tokens and requires patchy approximation to operate on even latent input sequences. In this paper, we address these issues by presenting a novel approach to enhance the efficiency and scalability of image…