On the Impossibility of Separating Intelligence from Judgment: The Computational Intractability of Filtering for AI Alignment
AuthorsSarah Ball†, Greg Gluch‡, Shafi Goldwasser‡, Frauke Kreuter†§, Omer Reingold¶, Guy N. Rothblum
On the Impossibility of Separating Intelligence from Judgment: The Computational Intractability of Filtering for AI Alignment
AuthorsSarah Ball†, Greg Gluch‡, Shafi Goldwasser‡, Frauke Kreuter†§, Omer Reingold¶, Guy N. Rothblum
With the increased deployment of large language models (LLMs), one concern is their potential misuse for generating harmful content. Our work studies the alignment challenge, with a focus on filters to prevent the generation of unsafe information. Two natural points of intervention are the filtering of the input prompt before it reaches the model, and filtering the output after generation. Our main results demonstrate computational challenges in filtering both prompts and outputs. First, we show that there exist LLMs for which there are no efficient prompt filters: adversarial prompts that elicit harmful behavior can be easily constructed, which are computationally indistinguishable from benign prompts for any efficient filter. Our second main result identifies a natural setting in which output filtering is computationally intractable. All of our separation results are under cryptographic hardness assumptions. In addition to these core findings, we also formalize and study relaxed mitigation approaches, demonstrating further computational barriers. We conclude that safety cannot be achieved by designing filters external to the LLM internals (architecture and weights); in particular, black-box access to the LLM will not suffice. Based on our technical results, we argue that an aligned AI system’s intelligence cannot be separated from its judgment.
Datasets, Documents, and Repetitions: The Practicalities of Unequal Data Quality
September 22, 2025research area Data Science and Annotation, research area Speech and Natural Language Processingconference NeurIPS
Data filtering has become a powerful tool for improving model performance while reducing computational cost. However, as large language model compute budgets continue to grow, the limited data volume provided by heavily filtered and deduplicated datasets will become a practical constraint. In efforts to better understand how to proceed, we study model performance at various compute budgets and across multiple pre-training datasets created through…
Data Filtering Networks
April 8, 2024research area Computer Vision, research area Methods and Algorithmsconference ICLR
Large training sets have become a cornerstone of machine learning and are the foundation for recent advances in language modeling and multimodal learning. While data curation for pre-training is often still ad-hoc, one common paradigm is to first collect a massive pool of data from the Web and then filter this candidate pool down to an actual training set via various heuristics. In this work, we study the problem of learning a data filtering…