التعرف على مشاعر الكلام في الوقت الحقيقي باستخدام التعلم العميق وزيادة البيانات Real-time speech emotion recognition using deep learning and data augmentation

المجلة: Artificial Intelligence Review، المجلد: 58، العدد: 2
DOI: https://doi.org/10.1007/s10462-024-11065-x
تاريخ النشر: 2024-12-19
المؤلف: Chawki Barhoumi وآخرون
الموضوع الرئيسي: التعرف على العواطف والمزاج

نظرة عامة

تقدم ورقة البحث نظام التعرف على عواطف الكلام (SER) الذي يستخدم تقنيات التعلم العميق لتحديد المشاعر البشرية من خلال نغمة الصوت. يتضمن النظام المقترح طريقتين لتوسيع البيانات – إضافة الضوضاء وتحويل الطيف – لتعزيز جودة وتنوع مجموعة البيانات. تم إجراء التقييمات باستخدام ثلاث مجموعات بيانات: TESS و EmoDB و RAVDESS، مع استخدام ميزات صوتية متنوعة مثل معاملات سبيكترا لمدى التردد (MFCC) ومعدل عبور الصفر (ZCR) والطيف الملون وقيمة الجذر التربيعي المتوسط (RMS) والكروم. تم اختبار ثلاثة نماذج للتعلم العميق: الشبكة العصبية متعددة الطبقات (MLP) والشبكة العصبية التلافيفية (CNN) ونموذج هجين يجمع بين CNN وذاكرة طويلة وقصيرة المدى ثنائية الاتجاه (Bi-LSTM).

تشير النتائج إلى أن نموذج CNN + BiLSTM حقق أعلى معدلات دقة، حيث كانت 100% و 99.50% و 90.12% لمجموعات بيانات TESS و EmoDB و RAVDESS، على التوالي. كما أظهرت نماذج MLP و CNN أداءً قويًا، حيث حققت دقة متوسطة تبلغ 99.90% و 99.95% لـ TESS، على التوالي. تسلط الدراسة الضوء على فعالية دمج ميزات صوتية متعددة وهياكل الشبكات العصبية، فضلاً عن أهمية توسيع البيانات في تحسين أداء نظام SER. تشمل اتجاهات البحث المستقبلية استكشاف ميزات صوتية إضافية، وتقييم النظام على مجموعات بيانات أكبر وأكثر تنوعًا، وتوسيع نظام SER ليشمل لغات وسياقات ثقافية أخرى، مما قد يعزز قابليته للتطبيق في مجالات متنوعة مثل التفاعل بين الإنسان والكمبيوتر وعلم النفس.

مقدمة

تناقش مقدمة الورقة العلاقة المتطورة بين البشر والآلات، مع التأكيد على أهمية تحسين الاتصال من خلال التفاعل بين الإنسان والكمبيوتر (HCI). تسلط الضوء على أهمية المشاعر في هذا السياق، مشيرة إلى أن التعرف على العواطف أصبح عنصرًا حاسمًا في تطوير تطبيقات HCI. تركز الورقة بشكل خاص على التعرف على عواطف الكلام (SER)، الذي يهدف إلى استنتاج المشاعر من إشارات الكلام باستخدام طرق تكنولوجية متنوعة، بما في ذلك تعبيرات الوجه والإشارات الفسيولوجية. يشير المؤلفون إلى التحديات المتعلقة بالكشف الدقيق ومعالجة المشاعر في الكلام، وهو أمر ضروري لتعزيز تجربة المستخدم في تطبيقات مثل مراكز الاتصال وسلامة الطيران والألعاب وخدمات الصحة النفسية.

تستعرض الورقة التقدم في SER، لا سيما في طرق استخراج الميزات والتصنيف. تميز بين الميزات العالمية والمحلية، حيث تعتبر الميزات النغمية والطيفية الأكثر استخدامًا. يشير المؤلفون إلى التحول من خوارزميات التعلم الآلي التقليدية إلى تقنيات التعلم العميق، التي أظهرت أداءً متفوقًا في مهام SER. يقترحون نموذج تعلم عميق هجين يجمع بين الشبكات العصبية التلافيفية (CNN) وهياكل ذاكرة طويلة وقصيرة المدى ثنائية الاتجاه (Bi-LSTM)، إلى جانب طرق استخراج ميزات متقدمة. يتم التحقق من فعالية نهجهم باستخدام مجموعات بيانات معيارية (TESS و RAVDESS و EmoDB) وتعزيزها من خلال تقنيات توسيع البيانات مثل إضافة الضوضاء وتحويل الطيف، بهدف تحسين الدقة والموثوقية لأنظمة SER في التطبيقات الزمنية الحقيقية.

الطرق

تُوضح منهجية التعرف على عواطف الكلام (SER) المقترحة في القسم 3 وتتكون من عدة خطوات رئيسية: جمع البيانات (3.1) ، إعداد البيانات (3.2) ، استخراج الميزات (3.4) ، تطوير النموذج باستخدام تقنيات التعلم العميق (3.5) ، تدريب النموذج واختباره ، والتصنيف. تستخدم المنهجية ثلاثة نماذج متميزة: الشبكة العصبية متعددة الطبقات (MLP) ، الشبكة العصبية التلافيفية (CNN) ، ونموذج هجين يجمع بين CNN وذاكرة طويلة وقصيرة المدى ثنائية الاتجاه (BiLSTM). لتعزيز أداء النموذج، يتم استخدام خمسة خوارزميات لاستخراج الميزات: معاملات سبيكترا لمدى التردد (MFCC) ، معدل عبور الصفر (ZCR) ، الطيف الملون ، الجذر التربيعي المتوسط (RMS) ، والكروم.

يتم توضيح الهيكل العام لنظام SER المقترح في الشكل 1، حيث يتم توضيح كل مكون في الأقسام التالية، من جمع البيانات إلى التصنيف. تهدف الطريقة المنهجية إلى تحسين التعرف على الحالات العاطفية من الكلام، مع الاستفادة من تقنيات التعلم العميق المتقدمة ومجموعات الميزات المتنوعة لتحقيق نتائج تصنيف قوية.

النتائج

في هذا القسم، تُعرض نتائج الأساليب المختلفة للتعرف على عواطف الكلام (SER)، باستخدام مقاييس تقييم متعددة مثل الدقة، والخسارة، والاسترجاع، والدقة، ودرجة F، ومصفوفة الارتباك. يتم تحليل أداء النماذج عبر ثلاث مجموعات بيانات: EmoDB و TESS و RAVDESS. تكشف دراسة مقارنة في الجدول 5 أنه بينما يظهر نموذج CNN أقصر وقت تدريب، إلا أنه أيضًا لديه أدنى دقة. يحقق نموذج CNN + BiLSTM دقة أعلى قليلاً من نموذج MLP ولكنه يتطلب وقت تدريب أطول. من الجدير بالذكر أن العلاقة بين وقت التدريب والدقة ليست دائمًا خطية، كما يتضح من مجموعة بيانات EmoDB، حيث يحقق نموذج CNN + BiLSTM أعلى دقة مع وقت تدريب أقصر من نموذج CNN.

تشير تقارير التصنيف في الجداول 6 و 7 و 8 إلى أن نموذج CNN + BiLSTM يحقق قيم دقة واسترجاع ودرجة F عالية تبلغ 100% لمعظم فئات المشاعر في مجموعات بيانات TESS و EmoDB، باستثناء فئة السعادة، التي سجلت 98%. تُظهر مجموعة بيانات RAVDESS قيم دعم تصنيف جيدة تتراوح بين 80% و 100%. كما أن نموذج MLP يؤدي بشكل جيد، مع قيم دقة واسترجاع ودرجات F تتراوح بين 95% و 100%. ومع ذلك، فإن أداء نموذج CNN على مجموعة بيانات RAVDESS أقل ملاءمة، خاصة لفئة الحيادية، التي لديها قيم دقة ودرجة F تبلغ 70% و 76%، على التوالي. بشكل عام، تشير النتائج إلى أن الجمع بين CNN و BiLSTM فعال في تصنيف المشاعر من إشارات الكلام، على الرغم من أن الأداء يختلف حسب مجموعة البيانات وفئة المشاعر. توفر التصورات لأداء النموذج عبر 100 دورة رؤى حول سلوك التدريب، مما يساعد في تحديد المشكلات المحتملة مثل الإفراط في التكيف أو نقص التكيف.

المناقشة

في هذا القسم، تناقش البحث عمليات جمع البيانات وإعدادها لدراسة التعرف على عواطف الكلام (SER) باستخدام ثلاث مجموعات بيانات متميزة: TESS و RAVDESS و EmoDB. تتكون مجموعة بيانات TESS من 2,800 عينة صوتية من الكلمات المستهدفة المنطوقة في سياقات عاطفية متنوعة، بينما تتضمن مجموعة بيانات RAVDESS 1,440 ملفًا من 24 ممثلًا محترفًا يعبرون عن مجموعة من المشاعر بشدة متفاوتة. تحتوي مجموعة بيانات EmoDB، التي تم جمعها في غرفة خالية من الصدى، على 535 عبارة تعكس حالات عاطفية مختلفة. شملت عملية إعداد البيانات تحويل الملفات الصوتية إلى تمثيلات زمنية، وتوضيح العينات بعلامات عاطفية، وتطبيق تقنيات توسيع البيانات مثل إضافة الضوضاء وتحويل الطيف لتعزيز موثوقية النموذج.

يتناول القسم أيضًا طرق استخراج الميزات المستخدمة، ولا سيما معاملات سبيكترا لمدى التردد (MFCC) والطيف الملون ومعدل عبور الصفر (ZCR) والجذر التربيعي المتوسط (RMS) وميزات الكروم. تهدف هذه التقنيات إلى تقليل الأبعاد مع الاحتفاظ بالمعلومات الأساسية حول المحتوى العاطفي للكلام. تشمل النماذج المقترحة لـ SER شبكة عصبية متعددة الطبقات (MLP) وشبكة عصبية تلافيفية (CNN) ونموذج هجين CNN-BiLSTM. تم تصميم كل نموذج للاستفادة من نقاط القوة في الأساليب التقليدية والعميقة لتحسين دقة توقع المشاعر في البيانات الصوتية. من المتوقع أن يعزز دمج توسيع البيانات وتقنيات استخراج الميزات المتقدمة قدرات النماذج على التعميم في التطبيقات الواقعية.

Journal: Artificial Intelligence Review, Volume: 58, Issue: 2
DOI: https://doi.org/10.1007/s10462-024-11065-x
Publication Date: 2024-12-19
Author(s): Chawki Barhoumi et al.
Primary Topic: Emotion and Mood Recognition

Overview

The research paper presents a Speech Emotion Recognition (SER) system that utilizes deep learning techniques to identify human emotions through vocal intonation. The proposed system incorporates two data augmentation methods—noise addition and spectrogram shifting—to enhance the dataset’s quality and diversity. Evaluations were conducted using three datasets: TESS, EmoDB, and RAVDESS, employing various acoustic features such as Mel Frequency Cepstral Coefficients (MFCC), Zero Crossing Rate (ZCR), Mel spectrograms, Root Mean Square Value (RMS), and chroma. Three deep learning models were tested: MultiLayer Perceptron (MLP), Convolutional Neural Network (CNN), and a hybrid model combining CNN with Bidirectional Long-Short Term Memory (Bi-LSTM).

The findings indicate that the CNN+BiLSTM model achieved the highest accuracy rates, with 100%, 99.50%, and 90.12% for the TESS, EmoDB, and RAVDESS datasets, respectively. The MLP and CNN models also demonstrated strong performance, achieving average accuracies of 99.90% and 99.95% for TESS, respectively. The study highlights the effectiveness of combining multiple acoustic features and neural network architectures, as well as the significance of data augmentation in improving SER system performance. Future research directions include exploring additional acoustic features, evaluating the system on larger and more diverse datasets, and extending the SER system to other languages and cultural contexts, potentially enhancing its applicability in various fields such as human-computer interaction and psychology.

Introduction

The introduction of the paper discusses the evolving relationship between humans and machines, emphasizing the importance of optimizing communication through Human-Computer Interaction (HCI). It highlights the significance of emotions in this context, noting that emotional recognition has become a critical component in developing HCI applications. The paper specifically focuses on Speech Emotion Recognition (SER), which aims to deduce emotions from speech signals using various technological methods, including facial expressions and physiological signals. The authors point out the challenges of accurately detecting and processing emotions in speech, which is essential for enhancing user experience in applications such as call centers, aviation safety, gaming, and mental health services.

The paper reviews advancements in SER, particularly in feature extraction and classification methods. It distinguishes between global and local features, with prosodic and spectral features being the most commonly utilized. The authors note a shift from classical machine learning algorithms to deep learning techniques, which have shown superior performance in SER tasks. They propose a hybrid deep learning model combining Convolutional Neural Networks (CNN) and Bidirectional Long Short-Term Memory (Bi-LSTM) architectures, alongside advanced feature extraction methods. The effectiveness of their approach is validated using standard datasets (TESS, RAVDESS, EmoDB) and enhanced through data augmentation techniques like noise addition and spectrogram shifting, ultimately aiming to improve the accuracy and robustness of SER systems in real-time applications.

Methods

The proposed Speech Emotion Recognition (SER) methodology is outlined in Section 3 and consists of several key steps: data collection (3.1), data preparation (3.2), feature extraction (3.4), model development using deep learning techniques (3.5), model training and testing, and classification. The methodology employs three distinct models: Multi-Layer Perceptron (MLP), Convolutional Neural Network (CNN), and a combined CNN with Bidirectional Long Short-Term Memory (BiLSTM). To enhance model performance, five feature extraction algorithms are utilized: Mel-Frequency Cepstral Coefficients (MFCC), Zero-Crossing Rate (ZCR), Mel Spectrogram, Root Mean Square (RMS), and Chroma.

The overall architecture of the proposed SER system is illustrated in Figure 1, with each component elaborated upon in subsequent sections, from data collection to classification. The systematic approach aims to optimize the recognition of emotional states from speech, leveraging advanced deep learning techniques and diverse feature sets to achieve robust classification outcomes.

Results

In this section, the results of various approaches for Speech Emotion Recognition (SER) are presented, utilizing multiple evaluation metrics such as accuracy, loss, recall, precision, F-score, and confusion matrix. The performance of the models is analyzed across three datasets: EmoDB, TESS, and RAVDESS. A comparative study in Table 5 reveals that while the CNN model exhibits the shortest training time, it also has the lowest accuracy. The CNN+BiLSTM model achieves slightly higher accuracy than the MLP model but requires longer training. Notably, the relationship between training time and accuracy is not always linear, as demonstrated by the EmoDB dataset, where the CNN+BiLSTM model achieves the highest accuracy with a shorter training time than the CNN model.

Classification reports in Tables 6, 7, and 8 indicate that the CNN+BiLSTM model achieves high precision, recall, and F-score values of 100% for most emotion categories in the TESS and EmoDB datasets, with the exception of Happy, which scores 98%. The RAVDESS dataset shows good classification support values ranging from 80% to 100%. The MLP model also performs well, with precision, recall, and F-scores between 95% and 100%. However, the CNN model’s performance on the RAVDESS dataset is less favorable, particularly for the neutral category, which has precision and F-score values of 70% and 76%, respectively. Overall, the findings suggest that the CNN combined with BiLSTM is effective for emotion classification from speech signals, though performance varies by dataset and emotion category. Visualizations of model performance across 100 epochs provide insights into training behavior, aiding in the identification of potential overfitting or underfitting issues.

Discussion

In this section, the research discusses the data collection and preparation processes for a study on speech emotion recognition (SER) using three distinct datasets: TESS, RAVDESS, and EmoDB. TESS comprises 2,800 audio samples of target words spoken in various emotional contexts, while RAVDESS includes 1,440 files from 24 professional actors expressing a range of emotions at varying intensities. EmoDB, collected in an anechoic chamber, contains 535 utterances reflecting different emotional states. The data preparation involved transforming audio files into time series representations, annotating samples with emotion labels, and applying data augmentation techniques such as noise addition and spectrogram shifting to enhance model robustness.

The section further elaborates on the feature extraction methods employed, notably Mel Frequency Cepstral Coefficients (MFCC), Mel spectrograms, Zero Crossing Rate (ZCR), Root Mean Square (RMS), and Chroma features. These techniques aim to reduce dimensionality while retaining critical information about the emotional content of speech. The proposed models for SER include a Multi-Layer Perceptron (MLP), a Convolutional Neural Network (CNN), and a hybrid CNN-BiLSTM model. Each model is designed to leverage the strengths of classical and deep learning approaches to improve the accuracy of emotion prediction in audio data. The integration of data augmentation and advanced feature extraction techniques is expected to enhance the models’ generalization capabilities in real-world applications.