View publication

Audiovisual speech synthesis involves synthesizing a talking face while maximizing the coherency of the acoustic and visual speech. To solve this problem, we propose using AVTacotron2, which is an end-to-end text-to-audiovisual speech synthesizer based on the Tacotron2 architecture. AVTacotron2 converts a sequence of phonemes into a sequence of acoustic features and the corresponding controllers of a face model. The output acoustic features are passed through a WaveRNN model to reconstruct the speech waveform. The speech waveform and the output facial controllers are used to generate the corresponding video of the talking face. As a baseline, we use a modular system, where acoustic speech is synthesized from text using the traditional Tacotron2. The reconstructed acoustic speech is then used to drive the controls of the face model using an independently trained audio-to-facial-animation neural network. We further condition both the end-to-end and modular approaches on emotion embeddings that encode the required prosody to generate emotional audiovisual speech. A comprehensive analysis shows that the end-to-end system is able to synthesize close to human-like audiovisual speech with mean opinion scores (MOS) of 4.1, which is the same MOS obtained on the ground truth generated from professionally recorded videos.

Related readings and updates.

On the Role of Visual Cues in Audiovisual Speech Enhancement

We present an introspection of an audiovisual speech enhancement model. In particular, we focus on interpreting how a neural audiovisual speech enhancement model uses visual cues to improve the quality of the target speech signal. We show that visual cues provide not only high-level information about speech activity, i.e., speech/silence, but also fine-grained visual information about the place of articulation. One byproduct of this finding is…
See paper details

Speaker-Independent Speech-Driven Visual Speech Synthesis using Domain-Adapted Acoustic Models

Speech-driven visual speech synthesis involves mapping features extracted from acoustic speech to the corresponding lip animation controls for a face model. This mapping can take many forms, but a powerful approach is to use deep neural networks (DNNs). However, a limitation is the lack of synchronized audio, video, and depth data required to reliably train the DNNs, especially for speaker-independent models. In this paper, we investigate…
See paper details