LinEAS: End-to-end Learning of Activation Steering with a Distributional Loss
AuthorsPau Rodríguez‡, Michal Klein, Eleonora Gualdoni, Valentino Maiorca†**, Arno Blaas, Luca Zappella, Marco Cuturi, Xavier Suau‡
LinEAS: End-to-end Learning of Activation Steering with a Distributional Loss
AuthorsPau Rodríguez‡, Michal Klein, Eleonora Gualdoni, Valentino Maiorca†**, Arno Blaas, Luca Zappella, Marco Cuturi, Xavier Suau‡
The growing use of generative models in daily life calls for efficient mechanisms to control their generation, to e.g., produce safe content or provide users with tools to explore style changes. Ideally, such mechanisms should require low volume of unpaired data (i.e., without explicit preference), and should be cheap, both at train and inference time, while preserving output quality. Recent research has shown that such mechanisms can be obtained by intervening exclusively on model activations, with the goal of correcting distributional differences between activations seen when using prompts from a source vs. a target set (e.g., toxic and non-toxic sentences). While cheap, these fast methods are inherently crude: their maps are tuned locally, not accounting for their impact on downstream layers, resulting in interventions that cause unintended shifts when used out-of-sample. We propose in this work linear end-to-end activation steering (LinEAS), an approach trained with a global loss that accounts simultaneously for all layer-wise distributional shifts. In addition to being more robust, the loss used to train LinEAS can be regularized with sparsifying norms, which can automatically carry out neuron selection. LinEAS only requires a handful of unpaired samples to be effective, and beats similar baselines on toxicity mitigation in language models, becoming competitive with oracle-dependent methods that have access to strong supervision. LinEAS is modality-agnostic and we empirically find that it outperforms existing activation steering methods at mitigating and including new concepts at the output of single-step text-to-image generation models.
ExpertLens: Activation Steering Features Are Highly Interpretable
November 7, 2025research area Methods and Algorithms, research area Speech and Natural Language ProcessingWorkshop at NeurIPS
This paper was accepted at the Workshop on Unifying Representations in Neural Models (UniReps) at NeurIPS 2025.
Activation steering methods in large language models (LLMs) have emerged as an effective way to perform targeted updates to enhance generated language without requiring large amounts of adaptation data. We ask whether the features discovered by activation steering methods are interpretable. We identify neurons responsible for specific…
Controlling Language and Diffusion Models by Transporting Activations
April 10, 2025research area Computer Vision, research area Methods and Algorithms, research area Speech and Natural Language Processingconference ICLR
Large generative models are becoming increasingly capable and more widely deployed to power production applications, but getting these models to produce exactly what’s desired can still be challenging. Fine-grained control over these models’ outputs is important to meet user expectations and to mitigate potential misuses, ensuring the models’ reliability and safety. To address these issues, Apple machine learning researchers have developed a new…