طرق اكتشاف الاكتئاب المعتمدة على دمج متعدد الوسائط للصوت والنص Depression detection methods based on multimodal fusion of voice and text

المجلة: Scientific Reports، المجلد: 15، العدد: 1
DOI: https://doi.org/10.1038/s41598-025-03524-4
PMID: https://pubmed.ncbi.nlm.nih.gov/40593978
تاريخ النشر: 2025-07-01
المؤلف: Zhongwen Xu وآخرون
الموضوع الرئيسي: التعرف على العواطف والمزاج

نظرة عامة

تقدم هذه الورقة البحثية نموذج دمج جديد للكشف الآلي عن الاكتئاب يدمج البيانات ثنائية النمط من الصوت والنص، معالجًا القيود التي تعاني منها طرق التشخيص التقليدية التي تعتمد غالبًا على التقييمات الذاتية. باستخدام نماذج مدربة مسبقًا Wav2Vec 2.0 للصوت وBERT للنص، تستخدم الدراسة طبقة تلافيف متعددة المقاييس وشبكة Bi-LSTM لدمج الميزات والتصنيف. يظهر النموذج تحسينات كبيرة في الأداء، حيث يحقق درجات F1 أعلى وانخفاض في خطأ الجذر التربيعي المتوسط (RMSE) على مجموعات بيانات CMDC وDAIC مقارنة بالنهج أحادي النمط. على وجه الخصوص، على CMDC، تحسن درجة F1 بمقدار 0.0103 و0.2017 لنماذج الصوت فقط والنص فقط، على التوالي، بينما انخفض RMSE بمقدار 0.5186. وبالمثل، على DAIC، زادت درجة F1 بمقدار 0.0645 و0.2589، مع انخفاض RMSE بمقدار 1.9901.

تخلص الدراسة إلى أن نموذج الدمج المقترح يلتقط بشكل فعال الارتباطات الزمنية ويعزز استخدام الميزات من كلا النمطين، مما يبسط عملية استخراج الميزات ويوسع مجال استجابة النموذج. بينما تشير النتائج إلى أداء قوي في الكشف عن الاكتئاب وتوقع شدته، يعترف المؤلفون بالقيود المتعلقة بقيود مجموعة البيانات، والاحتمالية العالية للتكيف، وتعقيد الحوسبة للنماذج المستخدمة. ستستكشف الأعمال المستقبلية دمج أنماط إضافية، مثل بيانات الفيديو والصور، والتحقيق في تقنيات ضغط النموذج لتحسين الكفاءة والنشر العملي. بشكل عام، تؤكد هذه الدراسة على وعد نهج الدمج متعدد الأنماط في تعزيز تشخيصات الصحة العقلية الآلية.

الطرق

في هذا القسم، يوضح المؤلفون الطرق المستخدمة لتقييم نهج الدمج متعدد الأنماط لتصنيف الاكتئاب، مقارنتها بالتقنيات الحالية باستخدام مجموعات بيانات CMDC وDAIC-WOZ. تشير النتائج، الملخصة في الجداول 5 و6، إلى أن طريقتهم تتفوق على طريقة الدمج متعدد الأنماط المعتمدة على الانتباه التي تم الإبلاغ عنها سابقًا، حيث تتفوق بشكل خاص في درجة F1، وهو أمر حاسم لتصنيف الاكتئاب بشكل فعال. على وجه الخصوص، تحقق طريقتهم درجة F1 تبلغ 96.66%، متجاوزة خط الأساس للاهتمام المتبادل بنسبة 12.66% وهندسة LSTM-CNN بنسبة 11.66%.

بالإضافة إلى ذلك، تظهر الطريقة المقترحة أداءً قويًا في مهام الانحدار، حيث تحقق خطأ مطلق متوسط (MAE) يبلغ 2.8793 وخطأ جذر متوسط مربع (RMSE) يبلغ 3.8199 لتوقع درجة PHQ-8، مما يشير إلى انخفاضات كبيرة في معدلات الخطأ مقارنة بأساليب LSTM-CNN. تشير النتائج إلى أن هيكل النموذج المحسن يسمح باستخراج معلومات متعددة المقاييس بكفاءة، مما يعزز كل من قدرات التصنيف والانحدار. يؤكد المؤلفون على قوة الطريقة وقابليتها للتعميم عبر مجموعات مرضى متنوعة، مما يجعلها أداة قيمة لفحص الاكتئاب عن بُعد، خاصة في البيئات ذات الموارد المحدودة. بشكل عام، يضع الأداء المتوازن للإطار عبر المهام كحل واعد لتحسين سير عمل فحص الاكتئاب في مختلف بيئات الرعاية الصحية.

النتائج

تظهر نتائج الدراسة فعالية نموذج الدمج المقترح، الذي يجمع بين وحدة التلافيف متعددة المقاييس (MSC) ووحدة Bi-LSTM، في معالجة البيانات متعددة الأنماط. من خلال تجارب الإزالة، تم تقييم المساهمات الفردية لكل وحدة، مما يكشف أن شبكة MSC تتفوق في معالجة البيانات متعددة الأنماط، حيث تحقق درجة F1 تبلغ 0.9673 بفضل قدراتها على استخراج الميزات الهرمية. في المقابل، بينما يتمتع نموذج Bi-LSTM بكفاءة في نمذجة التسلسل الزمني بدقة تبلغ 0.9613 ودرجة F1 تبلغ 0.9605 للصوت أحادي النمط، فإنه يواجه صعوبة مع الميزات متعددة الأنماط المجمعة، مما يؤدي إلى انخفاض درجة F1 إلى 0.9213. يستفيد إطار MCN+Bi-LSTM المدمج من نقاط القوة لكلتا الوحدتين، مما ينتج عنه درجة F1 مثالية تبلغ 0.9708، والتي تمثل تحسينًا بنسبة 0.35% مقارنةً بـ MSC وحده وزيادة بنسبة 2.67% مقارنةً بـ Bi-LSTM المستقل.

تقدم مزيد من التحقق من قوة النموذج من خلال أدائه على مجموعة بيانات DAIC-WOZ، حيث تفوق أيضًا على الأساليب الحالية. تشير النتائج إلى وجود اختلافات طفيفة بين مجموعات البيانات، حيث حققت مجموعة بيانات CMDC درجات F1 أعلى قليلاً (فرق 0.4%)، يُعزى ذلك إلى اختلافات في توزيع البيانات. من الجدير بالذكر أن مجموعة بيانات CMDC تتميز بتوزيع أكثر توازنًا للعينات الصحية والمكتئبة (56.5% مقابل 43.5%)، بينما مجموعة بيانات DAIC-WOZ أكثر عدم توازن (69% مقابل 31%). على الرغم من هذه الاختلافات، يظهر نموذج الدمج أداءً قويًا باستمرار عبر كلا مجموعتي البيانات، مما يبرز قابليته للتكيف وفعاليته في تحليل الميزات المتنوعة.

المناقشة

في قسم المناقشة، تسلط الورقة الضوء على التقدم الكبير في التعرف على الاكتئاب من خلال أنماط متعددة، مع التأكيد على فعالية البيانات الفسيولوجية (مثل تباين معدل ضربات القلب، مخططات القلب الكهربائي) والبيانات غير الفسيولوجية (مثل خصائص الصوت والنص) في تحديد حالات الاكتئاب. أظهرت الدراسات السابقة أن المؤشرات الفسيولوجية ترتبط بقوة بالاكتئاب، بينما البيانات غير الفسيولوجية، على الرغم من سهولة جمعها، لها قيود في التقاط الفروق العاطفية بدقة للأفراد المكتئبين. يشير المؤلفون إلى أنه بينما حققت الأساليب أحادية النمط نتائج واعدة، فإنها غالبًا ما تفتقر إلى القابلية للتعميم. وبالتالي، هناك اهتمام متزايد في الأساليب متعددة الأنماط التي تدمج بيانات الصوت والنص لتعزيز دقة الكشف.

تنتقد الورقة أيضًا الدراسات متعددة الأنماط الحالية، مشيرة إلى أن العديد منها يعتمد على مجموعات بيانات لا تشمل المرضى الذين تم تشخيصهم سريريًا، مما يحد من قابليتها للتطبيق. يقترح المؤلفون نموذج دمج جديد يستخدم BERT لاستخراج ميزات النص وWav2vec 2.0 لميزات الصوت، مدمجًا هذه من خلال إطار تلافيف متعدد المقاييس وهندسة Bi-LSTM. تهدف هذه الطريقة إلى تحسين قدرة النموذج على التقاط الحالات العاطفية المعقدة المرتبطة بالاكتئاب. تشير النتائج إلى أن النموذج المقترح يتفوق بشكل كبير على النماذج أحادية النمط، مما يظهر دقة وقوة معززة في الكشف عن الاكتئاب عبر مجموعات بيانات متنوعة.

Journal: Scientific Reports, Volume: 15, Issue: 1
DOI: https://doi.org/10.1038/s41598-025-03524-4
PMID: https://pubmed.ncbi.nlm.nih.gov/40593978
Publication Date: 2025-07-01
Author(s): Zhongwen Xu et al.
Primary Topic: Emotion and Mood Recognition

Overview

This research paper presents a novel fusion model for automated depression detection that integrates bimodal data from voice and text, addressing the limitations of traditional diagnostic methods that often rely on subjective assessments. Utilizing pre-trained models Wav2Vec 2.0 for voice and BERT for text, the study employs a multi-scale convolutional layer and a Bi-LSTM network for feature fusion and classification. The model demonstrates significant improvements in performance, achieving higher F1 scores and reduced root mean square error (RMSE) on the CMDC and DAIC datasets compared to single-modal approaches. Specifically, on CMDC, the F1 score improved by 0.0103 and 0.2017 for voice-only and text-only models, respectively, while RMSE decreased by 0.5186. Similarly, on DAIC, the F1 score increased by 0.0645 and 0.2589, with RMSE reduced by 1.9901.

The study concludes that the proposed fusion model effectively captures temporal correlations and enhances feature utilization from both modalities, simplifying the feature extraction process and expanding the model’s receptive field. While the results indicate a robust performance in depression detection and severity prediction, the authors acknowledge limitations related to dataset constraints, potential overfitting, and the computational complexity of the employed models. Future work will explore the integration of additional modalities, such as video and image data, and investigate model compression techniques to improve efficiency and practical deployment. Overall, this research underscores the promise of multimodal fusion approaches in advancing automated mental health diagnostics.

Methods

In this section, the authors detail the methods employed to evaluate their multimodal fusion approach for depression classification, comparing it against existing techniques using the CMDC and DAIC-WOZ datasets. The results, summarized in Tables 5 and 6, indicate that their method outperforms the attention-based multimodal fusion method previously reported, particularly excelling in the F1 score, which is crucial for effective depression classification. Specifically, their approach achieves an F1 score of 96.66%, surpassing the cross-attention baseline by 12.66% and the LSTM-CNN architecture by 11.66%.

Additionally, the proposed method demonstrates strong performance in regression tasks, achieving a mean absolute error (MAE) of 2.8793 and a root mean square error (RMSE) of 3.8199 for PHQ-8 score prediction, indicating significant reductions in error rates compared to LSTM-CNN methods. The findings suggest that the optimized model structure allows for efficient multi-scale information extraction, enhancing both classification and regression capabilities. The authors emphasize the method’s robustness and generalizability across diverse patient populations, making it a valuable tool for remote depression screening, particularly in resource-limited settings. Overall, the framework’s balanced performance across tasks positions it as a promising solution for improving depression screening workflows in various healthcare environments.

Results

The results of the study demonstrate the effectiveness of the proposed fusion model, which combines a Multi-Scale Convolution (MSC) module and a Bi-LSTM module, in processing multimodal data. Through ablation experiments, the individual contributions of each module were assessed, revealing that the MSC network excels in multimodal data processing, achieving an F1-score of 0.9673 due to its hierarchical feature extraction capabilities. In contrast, the Bi-LSTM model, while proficient in temporal sequence modeling with an accuracy of 0.9613 and an F1-score of 0.9605 for unimodal audio, struggles with concatenated multimodal features, resulting in a diminished F1-score of 0.9213. The integrated MCN+Bi-LSTM framework capitalizes on the strengths of both modules, yielding an optimal F1-score of 0.9708, which represents a 0.35% improvement over the MSC alone and a 2.67% enhancement compared to the standalone Bi-LSTM.

Further validation of the model’s robustness is provided by its performance on the DAIC-WOZ dataset, where it also outperformed existing methods. The results indicate minor discrepancies between datasets, with the CMDC dataset yielding slightly higher F1-scores (0.4% difference), attributed to variations in data distribution. Notably, the CMDC dataset features a more balanced distribution of healthy and depressed samples (56.5% vs. 43.5%), while the DAIC-WOZ dataset is more imbalanced (69% vs. 31%). Despite these differences, the fusion model consistently demonstrates strong performance across both datasets, underscoring its adaptability and effectiveness in heterogeneous feature analysis.

Discussion

In the discussion section, the paper highlights the significant advancements in depression recognition through various modalities, emphasizing the effectiveness of physiological data (e.g., heart rate variability, electrocardiograms) and non-physiological data (e.g., voice and text characteristics) in identifying depressive states. Previous studies have demonstrated that physiological indicators correlate strongly with depression, while non-physiological data, although easier to collect, have limitations in accurately capturing the emotional nuances of depressed individuals. The authors note that while single-modality approaches have yielded promising results, they often lack generalizability. Consequently, there is a growing interest in multimodal approaches that integrate audio and text data to enhance detection accuracy.

The paper also critiques existing multimodal studies, pointing out that many rely on datasets that do not include clinically diagnosed patients, which limits their applicability. The authors propose a novel fusion model that utilizes BERT for text feature extraction and Wav2vec 2.0 for audio features, integrating these through a multi-scale convolutional framework and a Bi-LSTM architecture. This approach aims to improve the model’s ability to capture complex emotional states associated with depression. The results indicate that the proposed model significantly outperforms single-modality models, demonstrating enhanced accuracy and robustness in depression detection across diverse datasets.