The Data-Quality Illusion: Rethinking Classifier-Based Quality Filtering for LLM Pretraining
AuthorsThiziri Nait Saada‡§, Louis Bethune, Michal Klein, David Grangier, Marco Cuturi, Pierre Ablin
The Data-Quality Illusion: Rethinking Classifier-Based Quality Filtering for LLM Pretraining
AuthorsThiziri Nait Saada‡§, Louis Bethune, Michal Klein, David Grangier, Marco Cuturi, Pierre Ablin
Large-scale models are pretrained on massive web-crawled datasets containing documents of mixed quality, making data filtering essential. A popular method is Classifier-based Quality Filtering (CQF), which trains a binary classifier to distinguish between pretraining data and a small, high-quality set. It assigns each pretraining document a quality score defined as the classifier’s score and retains only the top-scoring ones. We provide an in-depth analysis of CQF. We show that while CQF improves downstream task performance, it does not necessarily enhance language modeling on the high-quality dataset. We explain this paradox by the fact that CQF implicitly filters the high-quality dataset as well. We further compare the behavior of models trained with CQF to those trained on synthetic data of increasing quality, obtained via random token permutations, and find starkly different trends. Our results challenge the view that CQF captures a meaningful notion of data quality.
Assessing the Role of Data Quality in Training Bilingual Language Models
December 11, 2025research area Data Science and Annotation, research area Speech and Natural Language Processingconference EMNLP
Bilingual and multilingual language models offer a promising path toward scaling NLP systems across diverse languages and users. However, their performance often varies wildly between languages as prior works show that adding more languages can degrade performance for some languages (such as English), while improving others (typically more data constrained languages). In this work, we investigate causes of these inconsistencies by comparing…
With Apple Intelligence, we’re integrating powerful generative AI right into the apps and experiences people use every day, all while protecting their privacy. At the 2025 Worldwide Developers Conference we introduced a new generation of language foundation models specifically developed to enhance the Apple Intelligence features in our latest software releases. We also introduced the new Foundation Models framework, which gives app developers…