View publication

Device-directed speech detection (DDSD) is the binary classification task of distinguishing between queries directed at a voice assistant versus side conversation or background speech. State-of-the-art DDSD systems use verbal cues (for example, acoustic, text and/or automatic speech recognition system (ASR) features) to classify speech as device-directed or otherwise, and often have to contend with one or more of these modalities being unavailable when deployed in real-world settings. In this paper, we investigate fusion schemes for DDSD systems that can be made more robust to missing modalities. Concurrently, we study the use of non-verbal cues, specifically prosody features, in addition to verbal cues for DDSD. We present different approaches to combine scores and embeddings from prosody with the corresponding verbal cues, finding that prosody improves DDSD performance by up to 8.5% in terms of false acceptance rate (FA) at a given fixed operating point via non-linear intermediate fusion, while our use of modality dropout techniques improves the performance of these models by 7.4% in terms of FA when evaluated with missing modalities during inference time.

Related readings and updates.

A Multi-signal Large Language Model for Device-directed Speech Detection

We present an architecture for device-directed speech detection that treats the task as a text-generation problem. We use a multi-modal fusion approach that combines acoustic information from the recorded audio waveform with text and confidence information obtained from an automatic speech recognition system. The audio waveform is represented as a sequence of continuous embeddings by an audio encoder and presented as a prefix token to a…
See paper details

ICASSP 2022

Apple sponsored Acoustics, Speech, and Signal Processing (ICASSP), which was held in a hybrid format. The virtual event took place on May 7 to 13, and the hybrid main conference on May 22 to 27. ICASSP is the IEEE Signal Processing Society’s flagship conference on signal processing and its applications.

See event details