ChipChat: Low-Latency Cascaded Conversational Agent in MLX
AuthorsTatiana Likhomanenko§, Luke Carlson†**§, Richard He Bai§, Zijin Gu§, Han Tran§, Zakaria Aldeneh§, Yizhe Zhang, Ruixiang Zhang, Huangjie Zheng, Navdeep Jaitly‡**§
ChipChat: Low-Latency Cascaded Conversational Agent in MLX
AuthorsTatiana Likhomanenko§, Luke Carlson†**§, Richard He Bai§, Zijin Gu§, Han Tran§, Zakaria Aldeneh§, Yizhe Zhang, Ruixiang Zhang, Huangjie Zheng, Navdeep Jaitly‡**§
The emergence of large language models (LLMs) has transformed spoken dialog systems, yet the optimal architecture for real-time on-device voice agents remains an open question. While end-to-end approaches promise theoretical advantages, cascaded systems (CSs) continue to outperform them in language understanding tasks, despite being constrained by sequential processing latency. In this work, we introduce ChipChat, a novel low-latency CS that overcomes traditional bottlenecks through architectural innovations and streaming optimizations. Our system integrates streaming (a) conversational speech recognition with mixture-of-experts, (b) state-action augmented LLM, (c) text-to-speech synthesis, (d) neural vocoder, and (e) speaker modeling. Implemented using MLX, ChipChat achieves sub-second response latency on a Mac Studio without dedicated GPUs, while preserving user privacy through complete on-device processing. Our work shows that strategically redesigned CSs can overcome their historical latency limitations, offering a promising path forward for practical voice-based AI agents.
Exploring LLMs with MLX and the Neural Accelerators in the M5 GPU
November 19, 2025
Mac with Apple silicon is increasingly popular among AI developers and researchers interested in using their Mac to experiment with the latest models and techniques. With MLX, users can explore and run LLMs efficiently on Mac. It allows researchers to experiment with new inference or fine-tuning techniques, or investigate AI techniques in a private environment, on their own hardware. MLX works with all Apple silicon systems, and with the latest…
Streaming Models for Joint Speech Recognition and Translation
April 5, 2021research area Speech and Natural Language Processingconference EACL
Using end-to-end models for speech translation (ST) has increasingly been the focus of the ST community. These models condense the previously cascaded systems by directly converting sound waves into translated text. However, cascaded models have the advantage of including automatic speech recognition output, useful for a variety of practical ST systems that often display transcripts to the user alongside the translations. To bridge this gap,…