Data-Centric Lessons To Improve Speech-Language Pretraining
AuthorsVishaal Udandarao†‡, Zhiyun Lu, Xuankai Chang, Yongqiang Wang, Violet Z. Yao, Albin Madapally Jose, Fartash Faghri, Josh Gardner, Chung-Cheng Chiu
Data-Centric Lessons To Improve Speech-Language Pretraining
AuthorsVishaal Udandarao†‡, Zhiyun Lu, Xuankai Chang, Yongqiang Wang, Violet Z. Yao, Albin Madapally Jose, Fartash Faghri, Josh Gardner, Chung-Cheng Chiu
Spoken Question-Answering (SQA) is a core capability for useful and interactive artificial intelligence systems. Recently, several speech-language models (SpeechLMs) have been released with a specific focus on improving their SQA performance. However, a lack of controlled ablations of pretraining data processing and curation makes it challenging to understand what factors account for performance, despite substantial gains from similar studies in other data modalities. In this work, we address this gap by conducting a data-centric exploration for pretraining SpeechLMs. We focus on three research questions fundamental to speech-language pretraining data: (1) how to process raw web-crawled audio content for speech-text pretraining, (2) how to construct synthetic pretraining datasets to augment web-crawled data and (3) how to interleave (text, audio) segments into training sequences. We apply the insights from our controlled data-centric ablations to pretrain a 3.8B-parameter SpeechLM, called SpeLangy, that outperforms models that are up to 3x larger by 10.2% absolute performance. We hope our findings highlight the impact of effective data curation for speech-language pretraining and guide future data-centric exploration in SpeechLMs.
Language Models Improve When Pretraining Data Matches Target Tasks
July 18, 2025research area Methods and Algorithms, research area Speech and Natural Language Processing
Every data selection method inherently has a target. In practice, these targets often emerge implicitly through benchmark-driven iteration: researchers develop selection strategies, train models, measure benchmark performance, then refine accordingly. This raises a natural question: what happens when we make this optimization explicit? To explore this, we propose benchmark-targeted ranking (BETR), a simple method that selects pretraining…
Memory-Retaining Finetuning via Distillation
November 21, 2024research area Methods and Algorithms, research area Speech and Natural Language ProcessingWorkshop at NeurIPS
This paper was accepted at the Fine-Tuning in Modern Machine Learning: Principles and Scalability (FITML) Workshop at NeurIPS 2024.
Large language models (LLMs) pretrained on large corpora of internet text possess much of the world’s knowledge. Following pretraining, one often needs to conduct continued pretraining on certain capabilities, such as math and coding, or “posttraining” (a.k.a., alignment) techniques to make the models follow users’…