Stabilizing Transformer Training by Preventing Attention Entropy Collapse
AuthorsShuangfei Zhai*, Tatiana Likhomanenko*, Etai Littwin*, Dan Busbridge*, Jason Ramapuram*, Yizhe Zhang, Jiatao Gu, Josh M. Susskind.
Stabilizing Transformer Training by Preventing Attention Entropy Collapse
AuthorsShuangfei Zhai*, Tatiana Likhomanenko*, Etai Littwin*, Dan Busbridge*, Jason Ramapuram*, Yizhe Zhang, Jiatao Gu, Josh M. Susskind.
m*= Equal Contributors
Training stability is of great importance to Transformers. In this work, we investigate the training dynamics of Transformers by examining the evolution of the attention layers. In particular, we track the attention entropy for each attention head during the course of training, which is a proxy for model sharpness. We identify a common pattern across different architectures and tasks, where low attention entropy is accompanied by high training instability, which can take the form of oscillating loss or divergence. We denote the pathologically low attention entropy, corresponding to highly concentrated attention scores, as entropy collapse. As a remedy, we propose sigmaReparam, a simple and efficient solution where we reparametrize all linear layers with spectral normalization and an additional learned scalar. We demonstrate that the proposed reparameterization successfully prevents entropy collapse in the attention layers, promoting more stable training. Additionally, we prove a tight lower bound of the attention entropy, which decreases exponentially fast with the spectral norm of the attention logits, providing additional motivation for our approach. We conduct experiments with sigmaReparam on image classification, image self-supervised learning, machine translation, automatic speech recognition, and language modeling tasks, across Transformer architectures. We show that sigmaReparam provides stability and robustness with respect to the choice of hyperparameters, going so far as enabling training (a) a Vision Transformer to competitive performance without warmup, weight decay, layer normalization or adaptive optimizers; (b) deep architectures in machine translation and (c) speech recognition to competitive performance without warmup and adaptive optimizers.
Entropy-Preserving Reinforcement Learning
March 30, 2026research area Methods and Algorithmsconference ICLR
Policy gradient algorithms have driven many recent advancements in language model reasoning. An appealing property is their ability to learn from exploration on their own trajectories, a process crucial for fostering diverse and creative solutions. As we show in this paper, many policy gradient algorithms naturally reduce the entropy—and thus the diversity of explored trajectories—as part of training, yielding a policy increasingly limited in its…
Theory, Analysis, and Best Practices for Sigmoid Self-Attention
February 10, 2025research area Methods and Algorithmsconference ICLR
*Primary Contributors
Attention is a key part of the transformer architecture. It is a sequence-to-sequence mapping that transforms each sequence element into a weighted sum of values. The weights are typically obtained as the softmax of dot products between keys and queries. Recent work has explored alternatives to softmax attention in transformers, such as ReLU and sigmoid activations. In this work, we revisit sigmoid attention and conduct an…