View publication

Suppressing unintended invocation of the device because of the speech that sounds like wake-word, or accidental button presses, is critical for a good user experience, and is referred to as False-Trigger-Mitigation (FTM). In case of multiple invocation options, the traditional approach to FTM is to use invocation-specific models, or a single model for all invocations. Both approaches are sub-optimal: the memory cost for the former approach grows linearly with the number of invocation options, which is prohibitive for on-device deployment, and does not take advantage of shared training data; while the latter is unable to accurately capture acoustic differences across different invocation types. To this end, we propose a Unified Acoustic Detector (UAD) for FTM when multiple invocation options are available on device. The proposed UAD is trained using a multi-task learning framework, where a jointly trained acoustic encoder model is augmented with invocation-specific classification layers. In the context of the FTM task, we show for the first time that using the shared model architecture across invocations (thus, keeping the model size similar to that of a monolithic model used for a single invocation type), we can not only match but largely improve the accuracy of the invocation-specific models. In particular, in the challenging case of touch-based invocation, we obtain 50% and 35% relative improvement in false positive rate at 99% true positive rate, when compared with a single-output model for both invocations, and separate models per invocation, respectively. Furthermore, we propose streaming and non-streaming variants of the UAD, and show that they both outperform a traditional ASR-based approach to FTM.

Related readings and updates.

Streaming On-Device Detection of Device Directed Speech from Voice and Touch-Based Invocation

When interacting with smart devices such as mobile phones or wearables, the user typically invokes a virtual assistant (VA) by saying a keyword or by pressing a button on the device. However, in many cases, the VA can accidentally be invoked by the keyword-like speech or accidental button press, which may have implications on user experience and privacy. To this end, we propose an acoustic false-trigger-mitigation (FTM) approach for on-device…
See paper details

Complementary Language Model and Parallel Bi-LRNN for False Trigger Mitigation

False triggers in voice assistants are unintended invocations of the assistant, which not only degrade the user experience but may also compromise privacy. False trigger mitigation (FTM) is a process to detect the false trigger events and respond appropriately to the user. In this paper, we propose a novel solution to the FTM problem by introducing a parallel ASR decoding process with a special language model trained from "out-of-domain" data…
See paper details