Towards enhancing emotion recognition via multimodal framework

Akalya devi, C.; Karthika Renuka, D.; Pooventhiran, G.; Harish, D.; Yadav, Shweta; Thirunarayan, Krishnaprasad

doi:10.3233/JIFS-220280

Towards enhancing emotion recognition via multimodal framework

Article type: Research Article

Authors: Akalya devi, C.^{a; *} | Karthika Renuka, D.^a | Pooventhiran, G.^b | Harish, D.^c | Yadav, Shweta^d | Thirunarayan, Krishnaprasad^d

Affiliations: [a] Department of Information Technology, PSG College of Technology, Coimbatore, India | [b] Qualcomm India Private Limited Chennai, India | [c] Software AG, Bangalore, India | [d] Department of Computer Science and Engineering, Wright State University, Dayton, OH, USA

Correspondence: [*] Corresponding author. C. Akalya devi, Department of Information Technology, PSG College of Technology, Coimbatore, India. E-mail:[email protected].

Abstract: Emotional AI is the next era of AI to play a major role in various fields such as entertainment, health care, self-paced online education, etc., considering clues from multiple sources. In this work, we propose a multimodal emotion recognition system extracting information from speech, motion capture, and text data. The main aim of this research is to improve the unimodal architectures to outperform the state-of-the-arts and combine them together to build a robust multi-modal fusion architecture. We developed 1D and 2D CNN-LSTM time-distributed models for speech, a hybrid CNN-LSTM model for motion capture data, and a BERT-based model for text data to achieve state-of-the-art results, and attempted both concatenation-based decision-level fusion and Deep CCA-based feature-level fusion schemes. The proposed speech and mocap models achieve emotion recognition accuracies of 65.08% and 67.51%, respectively, and the BERT-based text model achieves an accuracy of 72.60%. The decision-level fusion approach significantly improves the accuracy of detecting emotions on the IEMOCAP and MELD datasets. This approach achieves 80.20% accuracy on IEMOCAP which is 8.61% higher than the state-of-the-art methods, and 63.52% and 61.65% in 5-class and 7-class classification on the MELD dataset which are higher than the state-of-the-arts.

Keywords: Emotion recognition, time-distributed models, CNN-LSTM, BERT, DCCA

DOI: 10.3233/JIFS-220280

Journal: Journal of Intelligent & Fuzzy Systems, vol. 44, no. 2, pp. 2455-2470, 2023

Published: 30 January 2023

Price: EUR 27.50

North America

IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA

Tel: +1 703 830 6300
Fax: +1 703 830 2300
[email protected]

For editorial issues, like the status of your submitted paper or proposals, write to [email protected]

Europe

IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands

Tel: +31 20 688 3355
Fax: +31 20 687 0091
[email protected]

For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office [email protected]

Asia

Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China

Free service line: 400 661 8717
Fax: +86 10 8446 7947
[email protected]

For editorial issues, like the status of your submitted paper or proposals, write to [email protected]

如果您在出版方面需要帮助或有任何建, 件至: [email protected]

Share this:

North America

Europe

Asia