Hybrid Transformer and CTC Networks for Hardware Efficient Voice Triggering
AuthorsSaurabh Adya, Vineet Garg, Siddharth Sigtia, Pramod Simha, Chandra Dhir
AuthorsSaurabh Adya, Vineet Garg, Siddharth Sigtia, Pramod Simha, Chandra Dhir
We consider the design of two-pass voice trigger detection systems. We focus on the networks in the second pass that are used to re-score candidate segments obtained from the first-pass. Our baseline is an acoustic model(AM), with BiLSTM layers, trained by minimizing the CTC loss. We replace the BiLSTM layers with self-attention layers. Results on internal evaluation sets show that self-attention networks yield better accuracy while requiring fewer parameters. We add an auto-regressive decoder network on top of the self-attention layers and jointly minimize the CTC loss on the encoder and the cross-entropy loss on the decoder. This design yields further improvements over the baseline. We retrain all the models above in a multi-task learning(MTL) setting, where one branch of a shared network is trained as an AM, while the second branch classifies the whole sequence to be true-trigger or not. Results demonstrate that networks with self-attention layers yield ∼60 percent relative reduction in false reject rates for a given false-alarm rate, while requiring 10 percent fewer parameters. When trained in the MTL setup, self-attention networks yield further accuracy improvements. On-device measurements show that we observe 70 percent relative reduction in inference time. Additionally, the proposed network architectures are ~5 times faster to train.
A growing number of consumer devices, including smart speakers, headphones, and watches, use speech as the primary means of user input. As a result, voice trigger detection systems—a mechanism that uses voice recognition technology to control access to a particular device or feature—have become an important component of the user interaction pipeline as they signal the start of an interaction between the user and a device. Since these systems are deployed entirely on-device, several considerations inform their design, like privacy, latency, accuracy, and power consumption.
Apple sponsored the thirty-second Interspeech conference, which was held virtually from October 25 to 29. Interspeech is a global conference focused on cognitive intelligence for speech processing and application.