Exploring Empty Spaces: Human-in-the-Loop Data Augmentation

AuthorsCatherine Yeh, Donghao Ren, Yannick Assogba, Dominik Moritz, Fred Hohman

Data augmentation is crucial to make machine learning models more robust and safe. However, augmenting data can be challenging as it requires generating diverse data points to rigorously evaluate model behavior on edge cases and mitigate potential harms. Creating high-quality augmentations that cover these "unknown unknowns" is a time- and creativity-intensive task. In this work, we introduce Amplio, an interactive tool to help practitioners navigate "unknown unknowns" in unstructured text datasets and improve data diversity by systematically identifying empty data spaces to explore. Amplio includes three human-in-the-loop data augmentation techniques: Augment with Concepts, Augment by Interpolation, and Augment with Large Language Model. In a user study with 18 professional red teamers, we demonstrate the utility of our augmentation methods in helping generate high-quality, diverse, and relevant model safety prompts. We find that Amplio enabled red teamers to augment data quickly and creatively, highlighting the transformative potential of interactive augmentation workflows.

Figure 1: Given a dataset of unstructured text, it can be challenging to determine how and where to augment the data most effectively. We propose a visualization-based approach to help users find relevant empty data spaces to explore to improve dataset diversity. To fill in these empty spaces, metaphorically represented by gaps in an embedding plot, we design an interactive tool with three human-in-the-loop augmentation methods: Augment with Concepts, Augment by Interpolation, and Augment with Large Language Model. Here, each dot represents an embedded sentence from the input dataset of CHI 2024 paper titles.

Exploring Empty Spaces: Human-in-the-Loop Data Augmentation

Related readings and updates.

SapAugment: Learning A Sample Adaptive Policy for Data Augmentation

How Effective is Task-Agnostic Data Augmentation for Pre-trained Transformers?

Discover opportunities in Machine Learning.