تحسين اكتشاف اكتئاب الكلام باستخدام التعلم الانتقالي مع wav2vec 2.0 في البيئات ذات الموارد المنخفضة Improving speech depression detection using transfer learning with wav2vec 2.0 in low-resource environments

المجلة: Scientific Reports، المجلد: 14، العدد: 1
DOI: https://doi.org/10.1038/s41598-024-60278-1
تاريخ النشر: 2024-04-25
المؤلف: Xu Zhang وآخرون
الموضوع الرئيسي: التعرف على العواطف والمزاج

نظرة عامة

تتناول البحث تحدي اكتشاف الاكتئاب من خلال تحليل الكلام، وهي مهمة تعيقها نقص البيانات المعلّمة. للتخفيف من هذه المشكلة، يقترح المؤلفون نهج التعلم بالنقل الذي يستخدم نموذج wav2vec 2.0، ويقومون بضبطه لاستخراج الميزات المتعلقة بالاكتئاب. من خلال استخدام مزيج من هياكل 1D-CNN وتجميع الانتباه، يعزز النموذج قدرته على التقاط العلاقات الزمنية في بيانات الصوت. كما أن دمج آليات LSTM والانتباه الذاتي يحسن من حساسية النموذج تجاه المقاطع المتعلقة بالاكتئاب، مما يؤدي إلى تحقيق درجات F1 مثيرة للإعجاب تبلغ 79% على مجموعة بيانات DAIC-WOZ و90.53% على مجموعة بيانات CMDC، متفوقًا على النماذج الأساسية الحالية.

تخلص الدراسة إلى أن ضبط جميع طبقات نموذج wav2vec 2.0 يعزز الأداء بشكل كبير مقارنةً بضبط الطبقة الأخيرة فقط. بالإضافة إلى ذلك، يُظهر تجميع الانتباه أنه أكثر فعالية من طرق التجميع التقليدية، وتم تأكيد أن وحدة الانتباه الذاتي ضرورية لالتقاط المعلومات ذات الصلة في مقاطع الكلام. ستركز الأعمال المستقبلية على استكشاف تقنيات استخراج الميزات المتقدمة، ودمج ميزات صوتية متعددة، ومعالجة تحديات التنفيذ في الوقت الحقيقي من خلال نماذج خفيفة الوزن أو التقليم. كما يخطط المؤلفون لدمج إشارات تخطيط الدماغ (EEG) لتطوير طريقة أكثر شمولاً لاكتشاف الاكتئاب، بهدف تحسين التشخيص المبكر ونتائج العلاج للمرضى.

طرق

في هذا القسم، يوضح المؤلفون الطرق التجريبية المستخدمة في دراستهم، التي أجريت على نظام تشغيل لينوكس باستخدام وحدة معالجة الرسوميات NVIDIA V100 وإطار عمل PyTorch. لضبط نموذج التدريب المسبق للصوت، تم استخدام معدل تعلم قدره $1 \times 10^{-5}$، بينما تم تطبيق معدل أعلى قدره $0.006$ للمهام اللاحقة. استخدمت عملية التحسين مُحسّن Adam مع انخفاض وزن قدره $0.001$، وحجم دفعة قدره 32، وإجمالي 200 دورة تدريبية. تم تنفيذ آلية إنهاء تلقائية لإيقاف التدريب إذا لم يتم ملاحظة تحسين كبير في الأداء على مجموعة التحقق لمدة 10 دورات متتالية. تم استخدام أداة OpenSMILE لاستخراج مجموعة ميزات الصوت العاطفي IS09، والتي تضمنت 16 وصفًا منخفض المستوى (LLDs) واختلافاتها من الدرجة الأولى، مما أدى إلى تمثيل ميزات على مستوى الجملة بعدد أبعاد 384.

بالإضافة إلى ذلك، أجرى المؤلفون تجارب مقارنة باستخدام مجموعات بيانات DAIC وCMDC بلغتين مختلفتين، بالإضافة إلى تقييم تأثير ميزات الإدخال المختلفة ضمن إطار عمل نموذجي متسق. كانت هذه الطريقة تهدف إلى التحقق من قوة وفعالية منهجيتهم.

نقاش

في قسم النقاش من ورقة البحث، يستعرض المؤلفون المنهجيات الحالية لاكتشاف الاكتئاب القائم على الكلام (SDD) وتطبيق التعلم بالنقل لمعالجة التحديات المتعلقة بنقص البيانات. اعتمدت الأساليب المبكرة بشكل أساسي على استخراج الميزات يدويًا، والتي، على الرغم من فعاليتها، غالبًا ما كانت تتطلب معرفة متخصصة وواجهت مشكلات مثل تكرار الميزات. لقد حولت التطورات الحديثة في التعلم العميق التركيز نحو استخراج الميزات تلقائيًا، حيث أظهرت نماذج مثل Transformer Encoder وCNN تحسينات في التكيف والكفاءة في التقاط الميزات المتعلقة بالاكتئاب. من الجدير بالذكر أن الدراسات أظهرت أن تقسيم الكلام إلى فترات أصغر يمكن أن يعزز أداء النموذج، على الرغم من أن العديد من الأعمال السابقة لم تقيم المقاطع الكاملة من الكلام، مما يحد من نتائجها.

لتخفيف التحديات التي تطرحها مجموعات البيانات المحدودة، يبرز المؤلفون فعالية التعلم بالنقل، الذي يسمح بتكييف النماذج المدربة على مجموعة بيانات واحدة لأخرى، مما يحسن الأداء في المجال المستهدف. لقد استخدمت العديد من الدراسات استراتيجيات التعلم بالنقل بنجاح، مثل ضبط النماذج المدربة مسبقًا على مجموعات بيانات كبيرة وغير معلمة مثل Wav2Vec 2.0، لتعزيز قدرات اكتشاف الاكتئاب. يقترح المؤلفون إطار عمل جديد يدمج تقنيات التعلم العميق، بما في ذلك آليات LSTM والانتباه الذاتي، للتنبؤ بحالة الاكتئاب من بيانات الكلام. يركز هذا الإطار على أهمية استخراج الميزات على مستوى المقاطع والمعلومات الزمنية، بهدف تحسين أداء نموذج Wav2Vec 2.0 في SDD، وبالتالي معالجة تحدي الموارد المنخفضة بشكل فعال.

Journal: Scientific Reports, Volume: 14, Issue: 1
DOI: https://doi.org/10.1038/s41598-024-60278-1
Publication Date: 2024-04-25
Author(s): Xu Zhang et al.
Primary Topic: Emotion and Mood Recognition

Overview

The research addresses the challenge of detecting depression through speech analysis, a task hindered by the lack of annotated data. To mitigate this issue, the authors propose a transfer learning approach that utilizes the wav2vec 2.0 model, fine-tuning it to extract depression-related features. By employing a combination of 1D-CNN and attention pooling structures, the model enhances its ability to capture temporal relationships in audio data. The integration of LSTM and self-attention mechanisms further improves the model’s sensitivity to depression-related segments, resulting in impressive F1 scores of 79% on the DAIC-WOZ dataset and 90.53% on the CMDC dataset, outperforming existing baseline models.

The study concludes that fine-tuning all layers of the wav2vec 2.0 model significantly enhances performance compared to fine-tuning only the last layer. Additionally, attention pooling is shown to be more effective than traditional pooling methods, and the self-attention module is confirmed to be crucial for capturing relevant information in speech segments. Future work will focus on exploring advanced feature extraction techniques, integrating multiple acoustic features, and addressing real-time implementation challenges through lightweight models or pruning. The authors also plan to incorporate electroencephalography (EEG) signals to develop a more comprehensive depression detection method, ultimately aiming to improve early diagnosis and treatment outcomes for patients.

Methods

In this section, the authors detail the experimental methods employed in their study, conducted on a Linux operating system using an NVIDIA V100 GPU and the PyTorch framework. For fine-tuning the audio pre-training model, a learning rate of $1 \times 10^{-5}$ was used, while a higher rate of $0.006$ was applied for downstream tasks. The optimization utilized the Adam optimizer with a weight decay of $0.001$, a batch size of 32, and a total of 200 training epochs. An automatic termination mechanism was implemented to halt training if no significant performance improvement was observed on the validation set over 10 consecutive epochs. The OpenSMILE tool was employed to extract the IS09 emotion acoustic feature set, which included 16 low-level descriptors (LLDs) and their first-order differences, resulting in a 384-dimensional sentence-level feature representation.

Additionally, the authors conducted comparative experiments using the DAIC and CMDC datasets in two different languages, as well as assessing the impact of various input features under a consistent model framework. This approach aimed to validate the robustness and effectiveness of their methodology.

Discussion

In the discussion section of the research paper, the authors review existing methodologies for speech-based depression detection (SDD) and the application of transfer learning to address challenges related to data scarcity. Early approaches primarily relied on manual feature extraction, which, while effective, often required specialized knowledge and faced issues such as feature redundancy. Recent advancements in deep learning have shifted the focus towards automatic feature extraction, with models like the Transformer Encoder and CNN demonstrating improved adaptability and efficiency in capturing depression-related features. Notably, studies have shown that segmenting speech into smaller intervals can enhance model performance, although many prior works did not evaluate entire speech segments, limiting their findings.

To mitigate the challenges posed by limited datasets, the authors highlight the effectiveness of transfer learning, which allows models trained on one dataset to be adapted for another, thereby improving performance in the target domain. Several studies have successfully employed transfer learning strategies, such as fine-tuning models pretrained on large, unlabeled datasets like Wav2Vec 2.0, to enhance depression detection capabilities. The authors propose a novel framework that integrates deep learning techniques, including LSTM and self-attention mechanisms, to predict depression status from speech data. This framework emphasizes the importance of segment-level feature extraction and temporal information, ultimately aiming to optimize the performance of the Wav2Vec 2.0 model in SDD, thereby addressing the low-resource challenge effectively.