SpeakStream: Streaming Text-to-Speech with Interleaved Data

AuthorsRichard He Bai, Zijin Gu, Tatiana Likhomanenko, Navdeep Jaitly

With the increasing integration of speech front-ends and large language models (LLM), there is a need to explore architectures that integrate these modalities. While end-to-end models have been explored extensively, cascaded models that stream outputs from LLMs to TTS seem to be oddly under-explored, even though they are potentially much simpler. Using traditional text-to-speech systems to convert LLM outputs to audio, however, poses a technical problem because they need entire utterances to generate sytlistic audio. In this paper we present a ‘streaming’ TTS that can generate audio from streaming text using a novel decoder-only architecture that interleaves text and speech. The model is trained using next-step prediction on interleaved data that is generated from force-alignment of text transcripts to speech. Duing inference our system processes text incrementally while generating consistent speech output, making it suitable for real-time applications like conversational AI agents where an LLM can stream text to a TTS system. Results demonstrate that our approach matches the quality of batch TTS systems while enabling streaming capabilities.

SpeakStream: Streaming Text-to-Speech with Interleaved Data

Related readings and updates.

Closing the Gap Between Text and Speech Understanding in LLMs

Streaming Models for Joint Speech Recognition and Translation

Discover opportunities in Machine Learning.