Pitch Accent Detection Improves Pretrained Automatic Speech Recognition
AuthorsDavid Sasu, Natalie Schluter
AuthorsDavid Sasu, Natalie Schluter
We show the performance of Automatic Speech Recognition (ASR) systems that use semi-supervised speech representations can be boosted by a complimentary pitch accent detection module, by introducing a joint ASR and pitch accent detection model. The pitch accent detection component of our model achieves a significant improvement on the state-of-the-art for the task, closing the gap in F1-score by 41%. Additionally, the ASR performance in joint training decreases WER by 28.3% on LibriSpeech, under limited resource fine-tuning. With these results, we show the importance of extending pretrained speech models to retain or re-learn important prosodic cues such as pitch accent.
October 6, 2020research area Human-Computer Interaction, research area Speech and Natural Language Processingconference Interspeech
Modern neural text-to-speech (TTS) synthesis can generate speech that is indistinguishable from natural speech. However, the prosody of generated utterances often represents the average prosodic style of the database instead of having wide prosodic variation. Moreover, the generated prosody is solely defined by the input text, which does not allow for different styles for the same sentence. In this work, we train a sequence-to-sequence neural...
Apple sponsored the 45th International Conference on Acoustics, Speech, and Signal Processing (ICASSP) in May 2020. With a focus on signal processing and its applications, the conference took place virtually from May 4 - 8. Read Apple’s accepted papers below.