Optimizing Byte-level Representation for End-to-End ASR

AuthorsRoger Hsiao, Liuhui Deng, Erik McDermott, Ruchir Travadi, Xiaodan Zhuang

In this paper, we propose an algorithm to optimize a byte-level representation for end-to-end (E2E) automatic speech recognition (ASR). Byte-level representation is often used by large scale multilingual ASR systems when the character set of the supported languages is large. The compactness and universality of byte-level representation allow the ASR models to use smaller output and therefore, provides more flexibility. UTF-8 is the most commonly used byte-level representation and has been successfully applied to ASR. However, it is not designed for ASR or any machine learning tasks. By using auto-encoder and vector quantization, we show that we can optimize a byte-level representation for ASR and achieve better accuracy. Our proposed framework can incorporate information from different modalities and provide an error correction mechanism. In an English/Mandarin dictation task, we show that the bilingual ASR model built with this approach can outperform UTF-8 representation by 5% relative in error rate.

Optimizing Byte-level Representation for End-to-End ASR

Related readings and updates.

Delayed Fusion: Integrating Large Language Models into First-Pass Decoding in End-to-end Speech Recognition

Bilingual End-to-End ASR with Byte-Level Subwords

Discover opportunities in Machine Learning.