2023 INTERSPEECH INTERSPEECH 2023

Focus-attention-enhanced Crossmodal Transformer with Metric Learning for Multimodal Speech Emotion Recognition

Abstract

Recognizing emotions in speech is essential for improving human-computer interactions, which require understanding and responding to the users' emotional states. Integrating multiple modalities, such as speech and text, enhances the performance of speech emotion recognition systems by providing a varied source of emotional information. In this context, we propose a model that enhances cross-modal transformer fusion by applying focus attention mechanisms to align and combine the salient features of two different modalities, namely, speech and text. The analysis of the disentanglement of the emotional representation various multiple embedding spaces using deep metric learning confirmed that our method shows enhanced emotion recognition performance. Furthermore, the proposed approach was evaluated on the IEMOCAP dataset. Experimental results demonstrated that our model achieves the best performance among other relevant multimodal speech emotion recognition systems.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio