Searching for just a few words should be enough to get started. If you need to make more complex queries, use the tips below to guide you.
Article type: Research Article
Authors: Al-Azani, Sadama; * | Almeshari, Ridhaa; b | El-Alfy, El-Sayeda; b; c
Affiliations: [a] SDAIA-KFUPM Joint Research Center for Artificial Intelligence, King Fahd University of Petroleum & Minerals, Dhahran, Saudi Arabia | [b] Information and Computer Science Department, King Fahd University of Petroleum & Minerals, Dhahran, Saudi Arabia | [c] Computer Engineering Department and Interdisciplinary Research Center of Intelligent Secure Systems (IRC-ISS), King Fahd University of Petroleum and Minerals, Saudi Arabia
Correspondence: [*] Corresponding author Sadam Al-Azani, SDAIA-KFUPM Joint Research Center for Artificial Intelligence, King Fahd University of Petroleum & Minerals, Dhahran 31261, Saudi Arabia. Email: [email protected].
Abstract: Speaker demographic recognition and segmentation analytics play a key role in offering personalized experiences across different automated industries and businesses. This paper aims at developing a multi-label demographic recognition system for Arabic speakers from audio and associated textual modalities. The system can detect age groups, genders, and dialects, but it can be easily extended to incorporate more demographic traits. The proposed method is based on deep learning for feature learning and recognition. Representations of audio modality are learned through 3D spectrogram and AlexNet CNN-based architecture. An AraBERT transformer is employed for learning representations of the textual modality. Additionally, a method is provided for fusing audio and textual representations. The effectiveness of the proposed method is evaluated using the Saudi Audio Dataset for Arabic (SADA), which is a recently published database containing audio recordings of TV shows in different Arabic dialects. The experimental findings show that when using models with standalone modalities for multi-label demographic classification, textual modality using AraBERT performed better than the audio modality represented using 3D spectrogram along with AlexNet CNN-based architecture. Furthermore, when combining both modalities, audio and textual, significant improvement has been attained for all demographic traits.
Keywords: Demographic, 3D spectrogram, AraBERT, multi-label classification, Arabic LLMs, multimodal deep learning
DOI: 10.3233/JIFS-219389
Journal: Journal of Intelligent & Fuzzy Systems, vol. Pre-press, no. Pre-press, pp. 1-12, 2024
IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA
Tel: +1 703 830 6300
Fax: +1 703 830 2300
[email protected]
For editorial issues, like the status of your submitted paper or proposals, write to [email protected]
IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands
Tel: +31 20 688 3355
Fax: +31 20 687 0091
[email protected]
For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office [email protected]
Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China
Free service line: 400 661 8717
Fax: +86 10 8446 7947
[email protected]
For editorial issues, like the status of your submitted paper or proposals, write to [email protected]
如果您在出版方面需要帮助或有任何建, 件至: [email protected]