Synthetic Bootstrapped Pretraining
AuthorsZitong Yang†‡, Aonan Zhang‡, Hong Liu†, Tatsunori Hashimoto†, Emmanuel Candès†, Chong Wang, Ruoming Pang
Synthetic Bootstrapped Pretraining
AuthorsZitong Yang†‡, Aonan Zhang‡, Hong Liu†, Tatsunori Hashimoto†, Emmanuel Candès†, Chong Wang, Ruoming Pang
We introduce Synthetic Bootstrapped Pretraining (SBP), a language model (LM) pretraining procedure that first learns a model of relations between documents from the pretraining dataset and then leverages it to synthesize a vast new corpus for joint training. While the standard pretraining teaches LMs to learn causal correlations among tokens within a single document, it is not designed to efficiently model the rich, learnable inter-document correlations that can potentially lead to better performance. We validate SBP by designing a compute-matched pretraining setup and pretrain a 3B-parameter and a 6B-parameter model on up to 1T tokens from scratch. We find SBP consistently improves upon a strong repetition baseline and delivers up to 60% of performance improvement attainable by an oracle upper bound with access to 20x more unique data. Qualitative analysis reveals that the synthesized documents go beyond mere paraphrases — SBP first abstracts a core concept from the seed material and then crafts a new narration on top of it. Besides strong empirical performance, SBP admits a natural Bayesian interpretation: the synthesizer implicitly learns to abstract the latent concepts shared between related documents.
Data-Centric Lessons To Improve Speech-Language Pretraining
December 16, 2025research area Speech and Natural Language Processing
Spoken Question-Answering (SQA) is a core capability for useful and interactive artificial intelligence systems. Recently, several speech-language models (SpeechLMs) have been released with a specific focus on improving their SQA performance. However, a lack of controlled ablations of pretraining data processing and curation makes it challenging to understand what factors account for performance, despite substantial gains from similar studies in…
Memory-Retaining Finetuning via Distillation
November 21, 2024research area Methods and Algorithms, research area Speech and Natural Language ProcessingWorkshop at NeurIPS
This paper was accepted at the Fine-Tuning in Modern Machine Learning: Principles and Scalability (FITML) Workshop at NeurIPS 2024.
Large language models (LLMs) pretrained on large corpora of internet text possess much of the world’s knowledge. Following pretraining, one often needs to conduct continued pretraining on certain capabilities, such as math and coding, or “posttraining” (a.k.a., alignment) techniques to make the models follow users’…