View publication

We present an architecture for device-directed speech detection that treats the task as a text-generation problem. We use a multi-modal fusion approach that combines acoustic information from the recorded audio waveform with text and confidence information obtained from an automatic speech recognition system. The audio waveform is represented as a sequence of continuous embeddings by an audio encoder and presented as a prefix token to a pretrained large language model (LLM). We demonstrate that using multi-modal information within LLMs yields equal error rate improvements over text-only and audio-only models of 38.9% and 20.5% respectively.

Related readings and updates.

Device-Directed Speech Detection for Follow-up Conversations Using Large Language Models

This paper was accepted at the Adaptive Foundation Models (AFM) Workshop at NeurIPS 2024. Follow-up conversations with virtual assistants (VAs) enable a user to seamlessly interact with a VA without the need to repeatedly invoke it using a keyword (after the first query). Therefore, accurate Device-Directed Speech Detection (DDSD) from the follow-up queries is critical for enabling naturalistic user experience. To this end, we explore the notion…
See paper details

Modality Dropout for Multimodal Device Directed Speech Detection using Verbal and Non-Verbal Features

Device-directed speech detection (DDSD) is the binary classification task of distinguishing between queries directed at a voice assistant versus side conversation or background speech. State-of-the-art DDSD systems use verbal cues (for example, acoustic, text and/or automatic speech recognition system (ASR) features) to classify speech as device-directed or otherwise, and often have to contend with one or more of these modalities being…
See paper details