تحسين التعرف على عواطف الكلام باستخدام محول الرؤية An enhanced speech emotion recognition using vision transformer

المجلة: Scientific Reports، المجلد: 14، العدد: 1
DOI: https://doi.org/10.1038/s41598-024-63776-4
PMID: https://pubmed.ncbi.nlm.nih.gov/38849422
تاريخ النشر: 2024-06-07
المؤلف: Samson Akinpelu وآخرون
الموضوع الرئيسي: التعرف على العواطف والمزاج

نظرة عامة

في هذا البحث، تم تطوير نموذج جديد من Vision Transformer للتعرف على عواطف الكلام (SER) من خلال الاستفادة من الميل-سبكتروغرامات والميزات العميقة. يتضمن النموذج رأس متعدد الطبقات (MLP) بسيط بأبعاد 128 لاستخراج الميزات العميقة، إلى جانب تقنيات مثل التسطيح، والتجزئة، وحجم الباتش 32، وتضمين الموقع، والانتباه الذاتي، وطبقات رأس MLP. تقلل هذه البنية من التعقيد الحسابي وتقلل من عدد المعلمات، مما يساهم في كفاءتها.

تم تقييم أداء النموذج على مجموعتين مرجعيتين، TESS و EMO-DB، مما أظهر نتائج متفوقة مقارنة بالطرق الحديثة الموجودة. حقق النظام المقترح دقة التعرف بنسبة 98% على مجموعة بيانات TESS، و91% على EMO-DB، و93% عند دمج كلا المجموعتين، متجاوزًا معايير الدقة السابقة بنسبة 2% و5%. تسلط النتائج الضوء على قدرة Vision Transformer على التقاط المعلومات السياقية العالمية، ونمذجة الاعتماديات طويلة المدى بشكل فعال، وتعزيز تمثيل أنماط الكلام العاطفي. ستستكشف الأعمال المستقبلية تطبيق هذا النموذج في مهام التعرف على الكلام الإضافية وتقييم فعاليته على مجموعات بيانات متنوعة، بما في ذلك مجموعات بيانات الكلام غير الاصطناعية، ربما بالتزامن مع تقنيات التعلم العميق الأخرى لتحسين معدلات التعرف بشكل أكبر.

الطرق

في هذا القسم، يوضح المؤلفون المنهجية المستخدمة في تجربتهم الواسعة لتقييم فعالية نموذجهم للتعرف على عواطف الكلام (SER) باستخدام سبكتروغرامات الكلام. يتم التحقق من قوة النموذج من خلال اختبارات على مجموعتين مرجعيتين: TESS و EMODB. يقارن المؤلفون أداء نظام SER الخاص بهم ضد عدة أنظمة أساسية لإظهار أهميته.

يتم تقديم مزيد من التفاصيل حول مجموعات البيانات المستخدمة، ومقاييس الدقة المطبقة، والنتائج التي تم الحصول عليها من الدراسة في الأقسام التالية. كما يتم تقديم الصيغ الرياضية الرئيسية ذات الصلة بهندسة النموذج، بما في ذلك آلية الانتباه المعرفة كـ \( \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \) ودالة الانتباه متعدد الرؤوس \( \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W_O \). بالإضافة إلى ذلك، يتم تقديم دالة تفعيل وحدة الخطأ الغاوسي (GELU) كـ \( \text{GELU}(x) = x \cdot \Phi(x) \)، حيث \( \Phi(x) = \frac{1}{2}\left(1 + \text{erf}\left(\frac{x}{\sqrt{2}}\right)\right) \).

النتائج

تشير نتائج التجارب التي أجريت لتقييم النموذج المقترح للتعرف على عواطف الكلام (SER) إلى أداء قوي في التعرف على العبارات الكلامية غير المعروفة. يتم تمثيل خطأ تعميم النموذج تقريبًا من خلال خطأ توقع يبلغ 70. تم استخدام طريقة التحقق المتقاطع 5 مرات، حيث تم تقسيم مجموعة البيانات إلى مجموعات تدريب واختبار، مما أظهر تقليل فعال للخسارة لكل من مراحل التدريب والاختبار. كانت أعلى قيم الخسارة المسجلة عبر ثلاث مجموعات بيانات—TESS و EMODB و TESS-EMODB—0.13 و0.20 و0.25، على التوالي. حقق النموذج دقة ملحوظة، واسترجاع، ودرجات F1، حيث تميز بشكل خاص في التعرف على العواطف المحايدة (100% دقة) والاشمئزاز (99% دقة)، بينما كانت الملل أقل استرجاع عند 76%.

يظهر أداء التصنيف، الملخص في مصفوفات الالتباس والجداول المفصلة، أن النموذج يتفوق على التقنيات الحديثة، خاصة للعواطف مثل الاشمئزاز، المحايد، الحزين، والخوف. كشفت مصفوفة الالتباس عن دقة تعرف عالية بنسبة 99% للعواطف الغاضبة، المحايدة، والاشمئزاز، بينما سجلت مجموعة البيانات الهجينة TESS-EMODB أدنى دقة بنسبة 74% للعاطفة الحزينة. تساهم بساطة هيكل النموذج في فعاليته في تعزيز معدلات التعرف على SER وتقليل التصنيف الخاطئ، مما يجعله مناسبًا للتطبيقات في الوقت الحقيقي في مراقبة أنماط السلوك البشري. بشكل عام، نجح النموذج في تحديد 27 من أصل 30 عاطفة بشكل صحيح، مما يبرز إمكانيته للنشر العملي في مهام التعرف على العواطف.

المناقشة

تسلط قسم المناقشة في ورقة البحث الضوء على تطور أنظمة التعرف على عواطف الكلام (SER)، مع التركيز على الانتقال من نماذج التعلم الآلي التقليدية إلى تقنيات التعلم العميق المتقدمة. في البداية، تم استخدام مصنفات تقليدية مثل آلات الدعم الناقل (SVM)، ونماذج المزيج الغاوسي (GMM)، ونماذج ماركوف المخفية (HMM)، ولكن هذه واجهت تحديات مثل قابلية التعرض للضوضاء وعدم الكفاءة مع مجموعات البيانات الكبيرة. وبالتالي، اكتسبت هياكل الشبكات العصبية، وخاصة الشبكات العصبية التكرارية (RNN) وشبكات الذاكرة طويلة وقصيرة الأجل (LSTM)، زخمًا بسبب قدرتها على معالجة البيانات التسلسلية والتقاط الاعتماديات الزمنية الضرورية للتعرف على العواطف. شهدت التطورات الأخيرة دمج الشبكات العصبية الالتفافية (CNN) وإطارات التعلم العميق المختلفة، مما أدى إلى تحسينات كبيرة في معدلات الدقة، حيث حقق بعض النماذج أكثر من 93% دقة على مجموعات بيانات معينة.

يناقش القسم أيضًا إدخال نماذج المحولات، التي أحدثت ثورة في SER من خلال التقاط الاعتماديات طويلة المدى بشكل فعال وتعزيز استخراج الميزات من إشارات الصوت. تشمل المساهمات البارزة تطوير نماذج هجينة تجمع بين CNNs والمحولات، بالإضافة إلى هياكل جديدة مثل Vision Transformer (ViT)، التي تستخدم آليات الانتباه الذاتي لتحسين التعرف على العواطف من الميل-سبكتروغرامات. يهدف نموذج ViT المقترح في هذه الدراسة إلى الاستفادة من هذه التطورات مع تقليل التعقيد الحسابي، مما يظهر أداءً متفوقًا مقارنة بالطرق الموجودة عبر مجموعات البيانات المرجعية مثل TESS و EMO-DB. تسلط النتائج الضوء على إمكانيات الأساليب المعتمدة على المحولات في تحقيق دقة عالية في مهام SER، مما يمهد الطريق لأنظمة تفاعل الإنسان مع الكمبيوتر الأكثر فعالية.

Journal: Scientific Reports, Volume: 14, Issue: 1
DOI: https://doi.org/10.1038/s41598-024-63776-4
PMID: https://pubmed.ncbi.nlm.nih.gov/38849422
Publication Date: 2024-06-07
Author(s): Samson Akinpelu et al.
Primary Topic: Emotion and Mood Recognition

Overview

In this research, a novel Vision Transformer model was developed for speech emotion recognition (SER) by leveraging mel-spectrograms and deep features. The model incorporates a simple Multi-Layer Perceptron (MLP) head with 128 dimensions to extract deep features, alongside techniques such as flattening, tokenization, a patch size of 32, position embedding, self-attention, and MLP head layers. This architecture minimizes computational complexity and reduces the number of parameters, contributing to its efficiency.

The model’s performance was evaluated on two benchmark datasets, TESS and EMO-DB, demonstrating superior results compared to existing state-of-the-art methods. The proposed system achieved recognition accuracies of 98% on the TESS dataset, 91% on EMO-DB, and 93% when both datasets were combined, surpassing previous accuracy benchmarks by 2% and 5%. The findings highlight the Vision Transformer’s ability to capture global contextual information, effectively modeling long-range dependencies and enhancing the representation of emotional speech patterns. Future work will explore the application of this model in additional speech recognition tasks and evaluate its effectiveness on diverse datasets, including non-synthetic speech corpora, potentially in conjunction with other deep learning techniques to further improve recognition rates.

Methods

In this section, the authors detail the methodology employed in their extensive experiment to evaluate the effectiveness of their speech emotion recognition (SER) model using speech spectrograms. The model’s robustness is validated through tests on two benchmark datasets: TESS and EMODB. The authors compare their SER system’s performance against several baseline systems to demonstrate its significance.

Further elaboration on the datasets utilized, the accuracy metrics applied, and the results obtained from the study is provided in subsequent sections. Key mathematical formulations relevant to the model’s architecture are also presented, including the attention mechanism defined as \( \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \) and the multi-head attention function \( \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W_O \). Additionally, the Gaussian Error Linear Unit (GELU) activation function is introduced as \( \text{GELU}(x) = x \cdot \Phi(x) \), where \( \Phi(x) = \frac{1}{2}\left(1 + \text{erf}\left(\frac{x}{\sqrt{2}}\right)\right) \).

Results

The results of the experiments conducted to evaluate the proposed model for speech emotion recognition (SER) indicate a robust performance in recognizing unknown speech utterances. The model’s generalization error is approximately represented by a prediction error of 70. A 5-fold cross-validation method was employed, dividing the dataset into training and testing subsets, which demonstrated effective loss reduction for both training and testing phases. The highest loss values recorded across three datasets—TESS, EMODB, and TESS-EMODB—were 0.13, 0.20, and 0.25, respectively. The model achieved notable precision, recall, and F1-scores, particularly excelling in recognizing neutral emotions (100% precision) and disgust (99% precision), while boredom had the lowest recall at 76%.

The classification performance, summarized in confusion matrices and detailed tables, shows that the model outperforms state-of-the-art techniques, especially for emotions such as disgust, neutral, sad, and fear. The confusion matrix revealed high recognition accuracies of 99% for angry, neutral, and disgust emotions, while the hybrid dataset TESS-EMODB recorded the lowest accuracy of 74% for sad emotion. The model’s architectural simplicity contributes to its effectiveness in enhancing SER recognition rates and reducing misclassification, making it suitable for real-time applications in monitoring human behavioral patterns. Overall, the model successfully identified 27 out of 30 emotions correctly, underscoring its potential for practical deployment in emotion recognition tasks.

Discussion

The discussion section of the research paper highlights the evolution of Speech Emotion Recognition (SER) systems, emphasizing the transition from traditional machine learning models to advanced deep learning techniques. Initially, conventional classifiers like Support Vector Machines (SVM), Gaussian Mixture Models (GMM), and Hidden Markov Models (HMM) were employed, but these faced challenges such as noise susceptibility and inefficiency with large datasets. Consequently, neural network architectures, particularly Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM) networks, gained traction due to their ability to process sequential data and capture temporal dependencies crucial for emotion recognition. Recent advancements have seen the integration of Convolutional Neural Networks (CNN) and various deep learning frameworks, leading to significant improvements in accuracy rates, with some models achieving over 93% accuracy on specific datasets.

The section also discusses the introduction of transformer models, which have revolutionized SER by effectively capturing long-range dependencies and enhancing feature extraction from audio signals. Notable contributions include the development of hybrid models that combine CNNs and transformers, as well as novel architectures like the Vision Transformer (ViT), which utilizes self-attention mechanisms to improve emotion recognition from mel-spectrograms. The proposed ViT model in this study aims to leverage these advancements while minimizing computational complexity, demonstrating superior performance compared to existing methods across benchmark datasets such as TESS and EMO-DB. The findings underscore the potential of transformer-based approaches in achieving high accuracy in SER tasks, paving the way for more effective human-computer interaction systems.