ChipChat: Low-Latency Cascaded Conversational Agent in MLX
AuthorsTatiana Likhomanenko§, Luke Carlson†**§, Richard He Bai§, Zijin Gu§, Han Tran§, Zakaria Aldeneh§, Yizhe Zhang, Ruixiang Zhang, Huangjie Zheng, Navdeep Jaitly‡**§
ChipChat: Low-Latency Cascaded Conversational Agent in MLX
AuthorsTatiana Likhomanenko§, Luke Carlson†**§, Richard He Bai§, Zijin Gu§, Han Tran§, Zakaria Aldeneh§, Yizhe Zhang, Ruixiang Zhang, Huangjie Zheng, Navdeep Jaitly‡**§
The emergence of large language models (LLMs) has transformed spoken dialog systems, yet the optimal architecture for real-time on-device voice agents remains an open question. While end-to-end approaches promise theoretical advantages, cascaded systems (CSs) continue to outperform them in language understanding tasks, despite being constrained by sequential processing latency. In this work, we introduce ChipChat, a novel low-latency CS that overcomes traditional bottlenecks through architectural innovations and streaming optimizations. Our system integrates streaming (a) conversational speech recognition with mixture-of-experts, (b) state-action augmented LLM, (c) text-to-speech synthesis, (d) neural vocoder, and (e) speaker modeling. Implemented using MLX, ChipChat achieves sub-second response latency on a Mac Studio without dedicated GPUs, while preserving user privacy through complete on-device processing. Our work shows that strategically redesigned CSs can overcome their historical latency limitations, offering a promising path forward for practical voice-based AI agents.
AgentBuilder: Exploring Scaffolds for Prototyping User Experiences of Interface Agents
January 9, 2026research area Human-Computer Interaction
Interface agents powered by generative AI models (referred to as “agents”) can automate actions based on user commands. An important aspect of developing agents is their user experience (i.e., agent experience). There is a growing need to provide scaffolds for a broader set of individuals beyond AI engineers to prototype agent experiences, since they can contribute valuable perspectives to designing agent experiences. In this work, we explore the…
Exploring LLMs with MLX and the Neural Accelerators in the M5 GPU
November 19, 2025
Mac with Apple silicon is increasingly popular among AI developers and researchers interested in using their Mac to experiment with the latest models and techniques. With MLX, users can explore and run LLMs efficiently on Mac. It allows researchers to experiment with new inference or fine-tuning techniques, or investigate AI techniques in a private environment, on their own hardware. MLX works with all Apple silicon systems, and with the latest…