التعرف على الاكتئاب باستخدام نموذج تدريب مسبق قائم على الصوت Depression recognition using voice-based pre-training model

المجلة: Scientific Reports، المجلد: 14، العدد: 1
DOI: https://doi.org/10.1038/s41598-024-63556-0
PMID: https://pubmed.ncbi.nlm.nih.gov/38830969
تاريخ النشر: 2024-06-03
المؤلف: Xiangsheng Huang وآخرون
الموضوع الرئيسي: التعرف على العواطف والمزاج

نظرة عامة

تقدم البحث نهجًا مبتكرًا لفحص الاكتئاب المبكر من خلال تحليل بيانات الصوت، مع معالجة تحدي حجم مجموعة البيانات المحدودة. باستخدام نموذج wav2vec 2.0 كمستخرج للميزات، يلتقط الدراسة بفعالية ميزات صوتية عالية الجودة من الصوت الخام. يتم استخدام شبكة ضبط صغيرة للتصنيف، مما يؤدي إلى نتائج مثيرة للإعجاب على مجموعة بيانات DAIC-WOZ، مع دقة تصنيف ثنائية تبلغ 0.9649 وخطأ متوسط الجذر التربيعي (RMSE) يبلغ 0.1875، إلى جانب دقة تصنيف متعددة الفئات تبلغ 0.9481 وRMSE يبلغ 0.3810. تُظهر هذه الطريقة قدرات تعميم قوية، مما يجعلها أداة عملية للأطباء في الكشف المبكر عن الاكتئاب.

في الختام، يبرز الدراسة تقدمًا كبيرًا في أداء النموذج والتعميم للاعتراف بالاكتئاب، تم تحقيقه دون تقنيات معالجة مسبقة معقدة مثل استخراج الميزات أو زيادة البيانات. تتجاوز الطريقة المقترحة استراتيجيات التعلم الآلي التقليدية والتعلم العميق، خاصة في التعامل مع مجموعة بيانات DAIC-WOZ المليئة بالضوضاء. تهدف الأبحاث المستقبلية إلى دمج مجموعات بيانات إضافية لتعزيز قوة النموذج وقابلية تعميمه، مع معالجة الحاجة إلى التحقق من البيانات غير المعروفة.

طرق

تستخدم البحث مجموعة بيانات DAIC-WOZ، وهي جزء من مجموعة بيانات تحليل المقابلات المتعلقة بالضيق، والتي تتكون من 189 ملف صوتي تحتوي على حوارات بين المرضى ووكيل افتراضي يُدعى إيلي. تم هيكلة مجموعة البيانات إلى مجموعات تدريب، والتحقق، والاختبار، مع معرفات صوتية تتراوح من 300 إلى 492، مع استبعاد معرفات معينة بسبب مشكلات تقنية. تشمل معرفات المرضى، والتسميات الثنائية، ودرجات PHQ-8، التي يتم تصورها في الأشكال والجداول المرفقة.

تم تصميم مهمتين تجريبيتين: التصنيف الثنائي، الذي يحدد وجود الاكتئاب (المعلمة كـ ‘dep’ أو ‘ndep’)، والتصنيف المتعدد، الذي يصنف شدة الاكتئاب إلى أربعة مستويات بناءً على درجات PHQ-8 المقطعة: لا شيء، خفيف، معتدل، وشديد. تأخذ تقييم أداء النموذج في الاعتبار عدم توازن العينة من خلال استخدام خمسة مقاييس: الدقة، الدقة، الاسترجاع، درجة F1، وخطأ متوسط الجذر التربيعي (RMSE). توفر هذه المقاييس تقييمًا شاملاً لقدرات النموذج التنبؤية، حيث يقوم RMSE بتحديد الانحراف بين القيم المتوقعة والفعلية، حيث تشير القيم الأقل إلى دقة أفضل للنموذج.

مناقشة

في السنوات الأخيرة، ظهرت منهجيات متنوعة للكشف التلقائي عن الاكتئاب، باستخدام مؤشرات متنوعة تتراوح من تقنيات التعلم الآلي إلى الأساليب متعددة الوسائط. تشمل الدراسات البارزة Wollenhaupt-Aguiar وآخرون، الذين ركزوا على العلامات الحيوية، وZhou وآخرون، الذين استخدموا شبكة دمج متعددة الوسائط تعتمد على الوعي الزمني (TAMFN) استنادًا إلى بيانات وسائل التواصل الاجتماعي. استخدمت طرق أخرى إشارات بصرية، وبيانات EEG، ونشاط كهربائي جلدي، مما يبرز التحديات المرتبطة بجمع البيانات والعوامل البيئية التي تؤثر على دقة الإشارة. لقد اكتسب استكشاف إشارات الصوت للكشف عن الاكتئاب زخمًا، مع دراسات تُظهر خصائص صوتية مميزة لدى الأفراد المكتئبين. تم تكملة طرق التعلم الآلي التقليدية بتقنيات التعلم العميق، التي أظهرت أداءً محسنًا من خلال استخراج ميزات ذات مغزى من بيانات الصوت.

تعزز الدراسة الحالية الكشف عن الاكتئاب من خلال ضبط نموذج wav2vec 2.0 على مجموعة بيانات DAIC-WOZ، محققة تحسينات كبيرة في دقة التصنيف دون الاعتماد على تقنيات معالجة مسبقة معقدة. تم تقييم أداء النموذج بدقة، مما يظهر دقة إجمالية تبلغ 96.49% في التصنيف الثنائي و94.81% في مهام التصنيف المتعدد. تشير النتائج إلى أن تجميد معلمات مشفر الميزات wav2vec واستخدام طرق التجميع المتوسطة تساهم في تحسين أداء النموذج. لا يتفوق هذا البحث فقط على الطرق الحالية، بل يعالج أيضًا قيود توفر البيانات وتعقيد استخراج الميزات، مما يمهد الطريق للدراسات المستقبلية لدمج مجموعات بيانات إضافية لتحسين القوة وقابلية التعميم.

القيود

ت stem القيود في الدراسة أساسًا من خصائص مجموعة بيانات DAIC-WOZ، التي تم إنشاؤها في بيئة خاضعة للرقابة وتضم عينات من فئات عمرية متنوعة. تؤثر هذه التباينات في العمر على جودة ميزات الصوت، مما يؤدي إلى تفاوتات بين عينات مجموعة البيانات وتلك التي يتم مواجهتها في التطبيقات الواقعية. علاوة على ذلك، تتكون مجموعة البيانات من 189 عينة فقط، وهو ما يعتبر غير كافٍ للتدريب القوي. للتخفيف من هذه القيود، استخدم الباحثون تقنية تقسيم ودمج بيانات الصوت، مما زاد بنجاح مجموعة البيانات إلى 6,545 عينة مع تقليل إدخال الضوضاء الزائدة.

بالإضافة إلى ذلك، قيدت القيود في الموارد الحاسوبية الحد الأقصى لحجم الدفعة لتدريب النموذج إلى 4. لتعزيز كفاءة التدريب، استخدم الباحثون تراكم التدرجات، مما زاد فعليًا من حجم الدفعة إلى 8. على الرغم من هذه الجهود، تعترف الدراسة بالحاجة إلى مجموعة بيانات أكبر لتحسين قابلية تعميم وأداء النموذج.

Journal: Scientific Reports, Volume: 14, Issue: 1
DOI: https://doi.org/10.1038/s41598-024-63556-0
PMID: https://pubmed.ncbi.nlm.nih.gov/38830969
Publication Date: 2024-06-03
Author(s): Xiangsheng Huang et al.
Primary Topic: Emotion and Mood Recognition

Overview

The research presents an innovative approach to early depression screening through voice data analysis, addressing the challenge of limited dataset sizes. Utilizing the wav2vec 2.0 model as a feature extractor, the study effectively captures high-quality voice features from raw audio. A small fine-tuning network is employed for classification, yielding impressive results on the DAIC-WOZ dataset, with a binary classification accuracy of 0.9649 and a root mean square error (RMSE) of 0.1875, alongside a multi-classification accuracy of 0.9481 and an RMSE of 0.3810. This method demonstrates strong generalization capabilities, making it a practical tool for clinicians in the early detection of depression.

In conclusion, the study highlights a significant advancement in model performance and generalization for depression recognition, achieved without complex preprocessing techniques such as feature extraction or data augmentation. The proposed method surpasses traditional machine learning and deep learning strategies, particularly in handling the noisy DAIC-WOZ dataset. Future research aims to incorporate additional datasets to enhance the model’s robustness and generalizability, while also addressing the need for validation on unknown data.

Methods

The research employs the DAIC-WOZ dataset, part of the Distress Analysis Interview Corpus, which consists of 189 audio files featuring dialogues between patients and a virtual agent named Ellie. The dataset is structured into training, validation, and test sets, with audio IDs ranging from 300 to 492, excluding specific IDs due to technical issues. It includes patient IDs, binary labels, and PHQ-8 scores, which are visualized in the accompanying figures and tables.

Two experimental tasks were designed: binary classification, which identifies the presence of depression (labeled as ‘dep’ or ‘ndep’), and multi-classification, which categorizes depression severity into four levels based on discretized PHQ-8 scores: non, mild, moderate, and severe. The evaluation of model performance accounts for sample imbalance by utilizing five metrics: accuracy, precision, recall, F1 score, and root mean square error (RMSE). These metrics provide a comprehensive assessment of the model’s predictive capabilities, with RMSE specifically quantifying the deviation between predicted and actual values, where lower values indicate better model accuracy.

Discussion

In recent years, various methodologies have emerged for automated depression detection, utilizing diverse indicators ranging from machine learning techniques to multimodal approaches. Notable studies include Wollenhaupt-Aguiar et al., who focused on biomarkers, and Zhou et al., who employed a time-aware attention multimodal fusion network (TAMFN) based on social media data. Other approaches leveraged visual cues, EEG data, and electrodermal activity, highlighting the challenges associated with data acquisition and environmental factors affecting signal accuracy. The exploration of voice signals for depression detection has gained traction, with studies demonstrating distinct acoustic characteristics in depressed individuals. Traditional machine learning methods have been complemented by deep learning techniques, which have shown improved performance by extracting meaningful features from voice data.

The current study enhances depression detection by fine-tuning the wav2vec 2.0 model on the DAIC-WOZ dataset, achieving significant improvements in classification accuracy without relying on complex preprocessing techniques. The model’s performance was rigorously evaluated, demonstrating an overall accuracy of 96.49% in binary classification and 94.81% in multi-class classification tasks. The findings indicate that freezing the parameters of the wav2vec feature encoder and employing average pooling methods contribute to enhanced model performance. This research not only outperforms existing methods but also addresses the limitations of data availability and complexity in feature extraction, paving the way for future studies to incorporate additional datasets for improved robustness and generalizability.

Limitations

The limitations of the study primarily stem from the characteristics of the DAIC-WOZ dataset, which was created in a controlled environment and includes samples from diverse age groups. This variability in age affects the quality of voice features, resulting in discrepancies between the dataset samples and those encountered in real-world applications. Furthermore, the dataset comprises only 189 samples, which is inadequate for robust training. To mitigate this limitation, the researchers employed a technique of segmenting and merging voice data, successfully augmenting the dataset to 6,545 samples while minimizing the introduction of extraneous noise.

Additionally, constraints in computational resources restricted the maximum batch size for model training to 4. To enhance training efficiency, the researchers utilized gradient accumulation, effectively increasing the batch size to 8. Despite these efforts, the study acknowledges the need for a larger dataset to improve the generalizability and performance of the model.