View publication

Accurate prediction of the user intent to interact with a voice assistant (VA) on a device (e.g. a smartphone) is critical for achieving naturalistic, engaging, and privacy-centric interactions with the VA. To this end, we present a novel approach to predict the user intention (whether the user is speaking to the device or not) directly from acoustic and textual information encoded at subword tokens which are obtained via an end-to-end (E2E) ASR model. Modeling directly the subword tokens, compared to modeling of the phonemes and/or full words, has at least two advantages: (i) it provides a unique vocabulary representation, where each token has a semantic meaning, in contrast to the phoneme-level representations, (ii) each subword token has a reusable “sub”-word acoustic pattern (that can be used to construct multiple full words), resulting in a largely reduced vocabulary space than of the full words. To learn the subword representations for the audio-to-intent classification, we extract: (i) acoustic information from an E2E-ASR model, which provides frame-level CTC posterior probabilities for the subword tokens, and (ii) textual information from a pretrained continuous bag-of-words model capturing the semantic meaning of the subword tokens. The key to our approach is that it combines acoustic subword-level posteriors with text information using the notion of positional-encoding to account for multiple ASR hypotheses simultaneously. We show that the proposed approach learns robust representations for audio- to-intent classification and correctly mitigates 93.3% of unintended user audio from invoking the VA at 99% true positive rate.

Related readings and updates.

Training a Tokenizer for Free with Private Federated Learning

Federated learning with differential privacy, i.e. private federated learning (PFL), makes it possible to train models on private data distributed across users’ devices without harming privacy. PFL is efficient for models, such as neural networks, that have a fixed number of parameters, and thus a fixed-dimensional gradient vector. Such models include neural-net language models, but not tokenizers, the topic of this work. Training a tokenizer…
See paper details

Class LM and Word Mapping for Contextual Biasing in End-to-End ASR

In recent years, all-neural, end-to-end (E2E) ASR systems gained rapid interest in the speech recognition community. They convert speech input to text units in a single trainable Neural Network model. In ASR, many utterances contain rich named entities. Such named entities may be user or location specific and they are not seen during training. A single model makes it inflexible to utilize dynamic contextual information during inference. In this…
See paper details