A Multi-signal Large Language Model for Device-directed Speech Detection

AuthorsDominik Wagner, Alex Churchill, Siddharth Sigtia, Panos Georgiou, Matt Mirsamadi, Aarshee Mishra, Erik Marchi

We present an architecture for device-directed speech detection that treats the task as a text-generation problem. We use a multi-modal fusion approach that combines acoustic information from the recorded audio waveform with text and confidence information obtained from an automatic speech recognition system. The audio waveform is represented as a sequence of continuous embeddings by an audio encoder and presented as a prefix token to a pretrained large language model (LLM). We demonstrate that using multi-modal information within LLMs yields equal error rate improvements over text-only and audio-only models of 38.9% and 20.5% respectively.

Related readings and updates.

Device-Directed Speech Detection for Follow-up Conversations Using Large Language Models

November 5, 2024research area Human-Computer Interaction, research area Speech and Natural Language ProcessingWorkshop at NeurIPS

This paper was accepted at the Adaptive Foundation Models (AFM) Workshop at NeurIPS 2024.

Follow-up conversations with virtual assistants (VAs) enable a user to seamlessly interact with a VA without the need to repeatedly invoke it using a keyword (after the first query). Therefore, accurate Device-Directed Speech Detection (DDSD) from the follow-up queries is critical for enabling naturalistic user experience. To this end, we explore the…

Modality Dropout for Multimodal Device Directed Speech Detection using Verbal and Non-Verbal Features

December 18, 2023research area Human-Computer Interaction, research area Speech and Natural Language Processingconference ICASSP

Device-directed speech detection (DDSD) is the binary classification task of distinguishing between queries directed at a voice assistant versus side conversation or background speech. State-of-the-art DDSD systems use verbal cues (for example, acoustic, text and/or automatic speech recognition system (ASR) features) to classify speech as device-directed or otherwise, and often have to contend with one or more of these modalities being…

A Multi-signal Large Language Model for Device-directed Speech Detection

Related readings and updates.

Device-Directed Speech Detection for Follow-up Conversations Using Large Language Models

Modality Dropout for Multimodal Device Directed Speech Detection using Verbal and Non-Verbal Features

Discover opportunities in Machine Learning.