Flexible Keyword Spotting based on Homogeneous Audio-Text Embedding

AuthorsKumari Nishu, Minsik Cho, Paul Dixon, Devang Naik

Spotting user-defined/flexible keywords represented in text frequently uses an expensive text encoder for joint analysis with an audio encoder in an embedding space, which can suffer from heterogeneous modality representation (for example, large mismatch) and increased complexity. In this work, we propose a novel architecture to efficiently detect arbitrary keywords based on an audio-compliant text encoder, which inherently has homogeneous representation with audio embedding, and it is also much smaller than a compatible text encoder. Our text encoder converts the text to phonemes using a grapheme-to-phoneme (G2P) model, and then to an embedding using representative phoneme vectors extracted from the paired audio encoder on rich speech datasets. We further augment our method with confusable keyword generation to develop an audio-text embedding verifier with strong discriminative power. Experimental results show that our scheme outperforms the state-of-the-art results on Libriphrase hard dataset, increasing Area Under the ROC Curve (AUC) metric from 84.21% to 92.7% and reducing Equal-Error-Rate (EER) metric from 23.36% to 14.4%.

Flexible Keyword Spotting based on Homogeneous Audio-Text Embedding

Related readings and updates.

Matching Latent Encoding for Audio-Text based Keyword Spotting

Acoustic Neighbor Embeddings

Discover opportunities in Machine Learning.