PersonaTeaming: Exploring How Introducing Personas Can Improve Automated AI Red-Teaming
AuthorsWesley Hanwen Deng**†, Sunnie S. Y. Kim, Akshita Jha‡, Ken Holstein†, Motahhare Eslami†, Lauren Wilcox, Leon A. Gatys
PersonaTeaming: Exploring How Introducing Personas Can Improve Automated AI Red-Teaming
AuthorsWesley Hanwen Deng**†, Sunnie S. Y. Kim, Akshita Jha‡, Ken Holstein†, Motahhare Eslami†, Lauren Wilcox, Leon A. Gatys
Recent developments in AI governance and safety research have called for red-teaming methods that can effectively surface potential risks posed by AI models. Many of these calls have emphasized how the identities and backgrounds of red-teamers can shape their red-teaming strategies, and thus the kinds of risks they are likely to uncover. While automated red-teaming approaches promise to complement human red-teaming by enabling larger-scale exploration of model behavior, current approaches do not consider the role of identity. As an initial step towards incorporating people’s background and identities in automated red-teaming, we develop and evaluate a novel method, PersonaTeaming, that introduces personas in the adversarial prompt generation process to explore a wider spectrum of adversarial strategies. In particular, we first introduce a methodology for mutating prompts based on either “red-teaming expert” personas or “regular AI user” personas. We then develop a dynamic persona-generating algorithm that automatically generates various persona types adaptive to different seed prompts. In addition, we develop a set of new metrics to explicitly measure the “mutation distance” to complement existing diversity measurements of adversarial prompts. Our experiments show promising improvements (up to 144.1%) in the attack success rates of adversarial prompts through persona mutation, while maintaining prompt diversity, compared to RainbowPlus, a state-of-the-art automated red-teaming method. We discuss the strengths and limitations of different persona types and mutation methods, shedding light on future opportunities to explore complementarities between automated and human red-teaming approaches.
Exploring Empty Spaces: Human-in-the-Loop Data Augmentation
March 26, 2025research area Human-Computer Interaction, research area Tools, Platforms, Frameworksconference CHI
Data augmentation is crucial to make machine learning models more robust and safe. However, augmenting data can be challenging as it requires generating diverse data points to rigorously evaluate model behavior on edge cases and mitigate potential harms. Creating high-quality augmentations that cover these “unknown unknowns” is a time- and creativity-intensive task. In this work, we introduce Amplio, an interactive tool to help practitioners…
Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIP
November 4, 2024research area Computer Vision, research area Methods and Algorithmsconference NeurIPS
Large pretrained vision-language models like CLIP have shown promising generalization capability, but may struggle in specialized domains (e.g., satellite imagery) or fine-grained classification (e.g., car models) where the visual concepts are unseen or under-represented during pretraining. Prompt learning offers a parameter-efficient finetuning framework that can adapt CLIP to downstream tasks even when limited annotation data are available. In…