Sparse Autoencoders Are Capable LLM Jailbreak Mitigators
AuthorsYannick Assogba, Jacopo Cortellazzi, Javier Abad†**, Pau Rodriguez, Xavier Suau, Arno Blaas
Sparse Autoencoders Are Capable LLM Jailbreak Mitigators
AuthorsYannick Assogba, Jacopo Cortellazzi, Javier Abad†**, Pau Rodriguez, Xavier Suau, Arno Blaas
Jailbreak attacks remain a persistent threat to large language model safety. We propose Context-Conditioned Delta Steering (CC-Delta), an SAE-based defense that identifies jailbreak-relevant sparse features by comparing token-level representations of the same harmful request with and without jailbreak context. Using paired harmful/jailbreak prompts, CC-Delta selects features via statistical testing and applies inference-time mean-shift steering in SAE latent space. Across four aligned instruction-tuned models and twelve jailbreak attacks, CC-Delta achieves comparable or better safety–utility tradeoffs than baseline defenses operating in dense latent space. In particular, our method clearly outperforms dense mean-shift steering on all four models, and particularly against out-of-distribution attacks, showing that steering in sparse SAE feature space offers advantages over steering in dense activation space for jailbreak mitigation. Our results suggest off-the-shelf SAEs trained for interpretability can be repurposed as practical jailbreak defenses without task-specific training.
DSO: Direct Steering Optimization for Bias Mitigation
April 29, 2026research area Fairness, research area Methods and Algorithmsconference CVPR
Generative models are often deployed to make decisions on behalf of users, such as vision-language models (VLMs) identifying which person in a room is a doctor to help visually impaired individuals. Yet, VLM decisions are influenced by the perceived demographic attributes of people in the input, which can lead to biased outcomes like failing to identify women as doctors. Moreover, when reducing bias leads to performance loss, users may have…
STEER: Semantic Turn Extension-Expansion Recognition for Voice Assistants
November 8, 2023research area Speech and Natural Language Processingconference EMNLP
*Equal Contributors
In the context of a voice assistant system, steering refers to the phenomenon in which a user issues a follow-up command attempting to direct or clarify a previous turn. We propose STEER, a steering detection model that predicts whether a follow-up turn is a user’s attempt to steer the previous command. Constructing a training dataset for steering use cases poses challenges due to the cold-start problem. To overcome this, we…