Emphasis Control for Parallel Neural TTS
AuthorsShreyas Seshadri, Tuomo Raitio, Dan Castellani, Jiangchuan Li
Emphasis Control for Parallel Neural TTS
AuthorsShreyas Seshadri, Tuomo Raitio, Dan Castellani, Jiangchuan Li
Recent parallel neural text-to-speech (TTS) synthesis methods are able to generate speech with high fidelity while maintaining high performance. However, these systems often lack control over the output prosody, thus restricting the semantic information conveyable for a given text. This paper proposes a hierarchical parallel neural TTS system for prosodic emphasis control by learning a latent space that directly corresponds to a change in emphasis. Three candidate features for the latent space are compared: 1) Variance of pitch and duration within words in a sentence, 2) Wavelet-based feature computed from pitch, energy, and duration, and 3) Learned combination of the two aforementioned approaches. At inference time, word-level prosodic emphasis is achieved by increasing the feature values of the latent space for the given words. Experiments show that all the proposed methods are able to achieve the perception of increased emphasis with little loss in overall quality. Moreover, emphasized utterances were preferred in a pairwise comparison test over the non-emphasized utterances, indicating promise for real-world applications.
Adapting Self-Supervised Representations as a Latent Space for Efficient Generation
November 4, 2025research area Computer Vision, research area Methods and Algorithms
We introduce Representation Tokenizer (RepTok), a generative modeling framework that represents an image using a single continuous latent token obtained from self-supervised vision transformers. Building on a pre-trained SSL encoder, we fine-tune only the semantic token embedding and pair it with a generative decoder trained jointly using a standard flow matching objective. This adaptation enriches the token with low-level,…
Hierarchical Prosody Modeling and Control in Non-Autoregressive Parallel Neural TTS
February 18, 2022research area Human-Computer Interaction, research area Speech and Natural Language Processingconference ICASSP
Neural text-to-speech (TTS) synthesis can generate speech that is indistinguishable from natural speech. However, the synthetic speech often represents the average prosodic style of the database instead of having more versatile prosodic variation. Moreover, many models lack the ability to control the output prosody, which does not allow for different styles for the same text input. In this work, we train a non-autoregressive parallel neural TTS…