Assessing the Role of Data Quality in Training Bilingual Language Models
AuthorsSkyler Seto, Maartje ter Hoeve*, Maureen de Seyssel*, David Grangier*
Assessing the Role of Data Quality in Training Bilingual Language Models
AuthorsSkyler Seto, Maartje ter Hoeve*, Maureen de Seyssel*, David Grangier*
Bilingual and multilingual language models offer a promising path toward scaling NLP systems across diverse languages and users. However, their performance often varies wildly between languages as prior works show that adding more languages can degrade performance for some languages (such as English), while improving others (typically more data constrained languages). In this work, we investigate causes of these inconsistencies by comparing bilingual and monolingual language models. Our analysis reveals that unequal data quality, not just data quantity, is a major driver of performance degradation in bilingual settings. We propose a simple yet effective data filtering strategy to select higher-quality bilingual training data with only high quality English data. Applied to French, German, and Chinese, our approach improves monolingual performance by 2–4% and reduces bilingual model performance gaps to 1%. These results highlight the overlooked importance of data quality in multilingual pretraining and offer a practical recipe for balancing performance.
Leveraging Audio-Visual Data to Reduce the Multilingual Gap in Self-Supervised Speech Models
September 25, 2025research area Speech and Natural Language Processing
Self-supervised learning (SSL) has made significant advances in speech representation learning. Models like wav2vec 2.0 and HuBERT have achieved state-of-the-art results in tasks such as speech recognition, particularly in monolingual settings. However, multilingual SSL models tend to underperform their monolingual counterparts on each individual language, especially in multilingual scenarios with few languages such as the bilingual setting. In…
Generating Multilingual Voices Using Speaker Space Translation Based on Bilingual Speaker Data
April 10, 2020research area Speech and Natural Language Processingconference ICASSP
We present progress towards bilingual Text-to-Speech which is able to transform a monolingual voice to speak a second language while preserving speaker voice quality. We demonstrate that a bilingual speaker embedding space contains a separate distribution for each language and that a simple transform in speaker space generated by the speaker embedding can be used to control the degree of accent of a synthetic voice in a language. The same…