View publication

Representations from models such as Bidirectional Encoder Representations from Transformers (BERT) and Hidden units BERT (HuBERT) have helped to achieve state-of-the-art performance in dimensional speech emotion recognition. Both HuBERT, and BERT models generate fairly large dimensional representations, and such models were not trained with emotion recognition task in mind. Such large dimensional representations result in speech emotion models with large parameter size, resulting in both memory and computational cost complexities. In this work, we investigate the selection of representations based on their task saliency, which may help to reduce the model complexity without sacrificing dimensional emotion estimation performance. In addition, we investigate modeling label uncertainty in the form of grader opinion variance, and demonstrate that such information can help to improve the model's generalization capacity and robustness. Finally, we analyzed the robustness of the speech emotion model against acoustic degradation and observed that the selection of salient representations from pre-trained models and modeling label uncertainty helped to improve the models generalization capacity to unseen data containing acoustic distortions in the form of environmental noise and reverberation.

Related readings and updates.

Pre-trained Model Representations and their Robustness against Noise for Speech Emotion Analysis

Pre-trained model representations have demonstrated state-of-the-art performance in speech recognition, natural language processing, and other applications. Speech models, such as Bidirectional Encoder Representations from Transformers (BERT) and Hidden units BERT (HuBERT), have enabled generating lexical and acoustic representations to benefit speech recognition applications. We investigated the use of pre-trained model representations for…
See paper details

Detecting Emotion Primitives From Speech And Their Use In Discerning Categorical Emotions

Emotion plays an essential role in human-to-human communication, enabling us to convey feelings such as happiness, frustration, and sincerity. While modern speech technologies rely heavily on speech recognition and natural language understanding for speech content understanding, the investigation of vocal expression is increasingly gaining attention. Key considerations for building robust emotion models include characterizing and improving the…
See paper details