View publication

Automatic speech recognition (ASR) is widely used in consumer electronics. ASR greatly improves the utility and accessibility of technology, but usually the output is only word sequences without punctuation. This can result in ambiguity in inferring user-intent. We first present a transformer-based approach for punctuation prediction that achieves 8% improvement on the IWSLT 2012 TED Task, beating the previous state of the art [1]. We next describe our multimodal model that learns from both text and audio, which achieves 8% improvement over the text-only algorithm on an internal dataset for which we have both the audio and transcriptions. Finally, we present an approach to learning a model using contextual dropout that allows us to handle variable amounts of future context at test time.

Related readings and updates.

Humanizing Word Error Rate for ASR Transcript Readability and Accessibility

Podcasting has grown to be a popular and powerful medium for storytelling, news, and entertainment. Without transcripts, podcasts may be inaccessible to people who are hard-of-hearing, deaf, or deaf-blind. However, ensuring that auto-generated podcast transcripts are readable and accurate is a challenge. The text needs to accurately reflect the meaning of what was spoken and be easy to read. The Apple Podcasts catalog contains millions of podcast episodes, which we transcribe using automatic speech recognition (ASR) models. To evaluate the quality of our ASR output, we compare a small number of human-generated, or reference, transcripts to corresponding ASR transcripts.

See highlight details


Apple sponsored the 46th International Conference on Acoustics, Speech, and Signal Processing (ICASSP). The conference focuses on signal processing and its applications and takes place virtually from June 6 to 11.

See event details