Whispering Experts: Toxicity Mitigation in Pre-trained Language Models by Dampening Expert Neurons
AuthorsXavier Suau Cuadros, Pieter Delobelle, Rin Metcalf Susa, Armand Joulin, Nick Apostoloff, Luca Zappella, Pau Rodriguez Lopez
AuthorsXavier Suau Cuadros, Pieter Delobelle, Rin Metcalf Susa, Armand Joulin, Nick Apostoloff, Luca Zappella, Pau Rodriguez Lopez
An important issue with Large Language Models (LLMs) is their undesired ability to generate toxic language. In this work, we show that the neurons responsible for toxicity can be determined by their power to discriminate toxic sentences, and that toxic language can be mitigated by reducing their activation levels proportionally to this power. We propose AUROC adaptation (AURA), an intervention that can be applied to any pre-trained LLM to mitigate toxicity. As the intervention is proportional to the ability of each neuron to discriminate toxic content, it is free of any model-dependent hyperparameters. We show that AURA can achieve up to 2.2× reduction in toxicity with only a 0.72 perplexity increase. We also show that AURA is effective with models of different scale (from 1.5B to 40B parameters), and its effectiveness in mitigating toxic language, while preserving common-sense zero-shot abilities, holds across all scales. AURA can be combined with pre-prompting strategies, boosting its average mitigation potential from 1.28× to 2.35×. Moreover, AURA can counteract adversarial pre-prompts that maliciously elicit toxic content, making it an effective method for deploying safer and less toxic models.
April 10, 2025research area Computer Vision, research area Methods and Algorithms, research area Speech and Natural Language Processingconference ICLR
Large generative models are becoming increasingly capable and more widely deployed to power production applications, but getting these models to produce exactly what's desired can still be challenging. Fine-grained control over these models' outputs is important to meet user expectations and to mitigate potential misuses, ensuring the models' reliability and safety. To address these issues, Apple machine learning researchers have developed a new...
January 14, 2025research area Methods and Algorithmsconference ICLR
The increasing capabilities of large generative models and their ever more widespread deployment have raised concerns about their reliability, safety, and potential misuse. To address these issues, recent works have proposed to control model generation by steering model activations in order to effectively induce or prevent the emergence of concepts or behaviours in the generated output. In this paper we introduce Activation Transport (AcT), a...