تشخيص الاكتئاب بناءً على بيانات متعددة الوسائط للوجه Diagnosis of depression based on facial multimodal data

المجلة: Frontiers in Psychiatry، المجلد: 16
DOI: https://doi.org/10.3389/fpsyt.2025.1508772
PMID: https://pubmed.ncbi.nlm.nih.gov/39935533
تاريخ النشر: 2025-01-28
المؤلف: Na Jin وآخرون
الموضوع الرئيسي: التعرف على العواطف والمزاج

نظرة عامة

تتناول ورقة البحث التحدي المتمثل في تشخيص الاكتئاب، وهو قضية صحية عقلية هامة، من خلال اقتراح نهج مبتكر للتعلم العميق يستخدم بيانات متعددة الوسائط، وتحديداً مدخلات الفيديو الصوتي والمرئي. تعاني طرق التشخيص التقليدية غالباً من الذاتية ومعدلات تشخيص خاطئة مرتفعة، مما يبرز الحاجة إلى أدوات موضوعية وآلية. تقدم الدراسة نموذجاً يستخدم وحدة انتباه مكاني زمني لتعزيز استخراج الميزات البصرية ويجمع بين شبكة الالتفاف البيانية (GCN) وشبكات الذاكرة قصيرة وطويلة الأمد (LSTM) لتحليل الميزات الصوتية. يتيح دمج الميزات متعددة الوسائط للنموذج التعرف بفعالية على أنماط مختلفة مرتبطة بالاكتئاب.

تظهر التجارب الواسعة التي أجريت على مجموعة بيانات مقابلات تحليل الضغوط الممتدة (E-DAIC) أداءً قويًا للنموذج، حيث حقق خطأ مطلق متوسط (MAE) قدره 3.51 في تقدير درجات استبيان صحة المريض-8 (PHQ-8) من المقابلات المسجلة. تشير النتائج إلى أن بنية TSNet-DD المقترحة، التي تتضمن آلية انتباه مكاني زمني لبيانات الفيديو ومزيج من GCN وLSTM لبيانات الصوت، تظهر وعدًا كبيرًا لتقييم الاكتئاب المبكر. يقترح المؤلفون أن تركز الأبحاث المستقبلية على استخدام مجموعات بيانات أكبر ووسائط متنوعة لتعزيز دقة التعرف ودعم الأطباء في تشخيص وعلاج الاكتئاب، بما في ذلك تصنيف مستويات الاكتئاب المختلفة.

الطرق

تستفيد الطريقة المقترحة لتشخيص الاكتئاب من إطار عمل متعدد الوسائط يدمج البيانات المرئية والصوتية المستخرجة من مقاطع فيديو المشاركين. في البداية، يتم معالجة المعلومات المرئية والصوتية بشكل منفصل، تليها استخراج الميزات ودمجها لإنشاء مجموعة شاملة من الميزات متعددة الوسائط. يستخدم الإطار وحدة انتباه مكاني زمني لاستخراج ميزات سلوك الوجه ويستخدم شبكات الالتفاف البيانية (GCN) وشبكات الذاكرة قصيرة وطويلة الأمد (LSTM) لمعالجة ميزات الصوت. تعزز هذه الطريقة قدرة النموذج على اكتشاف الأنماط الخفية في البيانات، مما يحسن في النهاية دقة التصنيف.

استخدمت إعدادات التجربة وحدة معالجة الرسوميات NVIDIA RTX3090، حيث تم تدريب النموذج في بيئة Python 3.9 باستخدام PyTorch v1.12.0 وCUDA 11.6. استخدمت الدراسة مجموعة بيانات مقابلات تحليل الضغوط الممتدة (E-DAIC)، التي تشمل مقابلات سريرية من 219 مشاركًا تم وضع علامات عليها لدرجة الاكتئاب باستخدام استبيان صحة المريض-8 (PHQ-8). أشارت النتائج إلى أن النموذج متعدد الوسائط تفوق بشكل كبير على الأساليب أحادية الوسائط، حيث حقق درجة F1 قدرها 0.922، وهي الأعلى بين جميع التجارب. بالإضافة إلى ذلك، أكدت تحليل منحنى التشغيل الاستقبالي (ROC) أداءً تشخيصيًا متفوقًا، حيث حقق نموذج الدمج متعدد الوسائط أعلى قيم لمنطقة تحت المنحنى (AUC). تشير النتائج إلى أن دمج أنواع البيانات المتنوعة يعزز فعالية النموذج في التمييز بين الأفراد المكتئبين وغير المكتئبين مع الحفاظ على الكفاءة الحسابية.

المناقشة

تسلط قسم المناقشة في ورقة البحث الضوء على التقدم في الأساليب متعددة الوسائط لتشخيص الاكتئاب، مع التركيز على دمج البيانات المرئية والصوتية لتعزيز دقة التشخيص. استكشفت الدراسات السابقة طرقًا مختلفة، مثل الشبكات العصبية الالتفافية ثلاثية الأبعاد وآليات الانتباه، لاستخراج الميزات من تعبيرات الوجه وأنماط الكلام. ومع ذلك، فإن العديد من هذه الأساليب لها قيود، مثل تجاهل البيانات الصوتية أو عدم وجود استراتيجيات دمج ميزات شاملة. يلتقط الشبكة الزمنية المكانية المقترحة لتشخيص الاكتئاب (TSNet-DD) بفعالية كل من الميزات الزمنية والمكانية من خلال وحدة الانتباه الزمنية المكانية (TSAM)، بينما يسمح الجمع بين شبكات الالتفاف البيانية (GCN) ونماذج الذاكرة قصيرة وطويلة الأمد (LSTM) لمعالجة الصوت باستخراج أنماط معقدة تتعلق بالاكتئاب.

تشير النتائج إلى أن دمج ميزات الفيديو والصوت يعزز بشكل كبير الأداء التشخيصي، مما يبرز الطبيعة التكميلية لهذه الوسائط. على الرغم من النتائج الواعدة، تعترف الدراسة بالقيود، بما في ذلك الاعتماد على مجموعة بيانات محددة (E-DAIC) والحاجة إلى مزيد من البحث لتصنيف مستويات الاكتئاب المختلفة ودمج أنواع بيانات إضافية، مثل النص. يدعو المؤلفون إلى العمل المستقبلي لتوسيع تنوع مجموعة البيانات واستكشاف وسائط جديدة لتحسين قابلية تفسير النموذج وقدراته التشخيصية، بهدف تقديم دعم أكثر قوة للأطباء في تشخيص وعلاج الاكتئاب.

Journal: Frontiers in Psychiatry, Volume: 16
DOI: https://doi.org/10.3389/fpsyt.2025.1508772
PMID: https://pubmed.ncbi.nlm.nih.gov/39935533
Publication Date: 2025-01-28
Author(s): Na Jin et al.
Primary Topic: Emotion and Mood Recognition

Overview

The research paper addresses the challenge of diagnosing depression, a significant mental health issue, by proposing an innovative deep learning approach that utilizes multimodal data, specifically facial video and audio inputs. Traditional diagnostic methods often suffer from subjectivity and high misdiagnosis rates, highlighting the need for objective, automated tools. The study introduces a model that employs a spatiotemporal attention module to enhance visual feature extraction and integrates a Graph Convolutional Network (GCN) with Long Short-Term Memory (LSTM) networks to analyze audio features. This multimodal feature fusion enables the model to effectively identify various patterns associated with depression.

Extensive experiments conducted on the Extended Distress Analysis Interview Corpus (E-DAIC) dataset demonstrate the model’s robust performance, achieving a Mean Absolute Error (MAE) of 3.51 in estimating Patient Health Questionnaire-8 (PHQ-8) scores from recorded interviews. The findings indicate that the proposed TSNet-DD architecture, which incorporates a spatial-temporal attention mechanism for video data and a combination of GCN and LSTM for audio data, shows significant promise for early depression evaluation. The authors suggest that future research should focus on utilizing larger datasets and diverse modalities to enhance recognition accuracy and support clinicians in diagnosing and treating depression, including the classification of varying depression levels.

Methods

The proposed method for diagnosing depression leverages a multimodal framework that integrates visual and audio data extracted from participant videos. Initially, visual and audio information is preprocessed separately, followed by feature extraction and fusion to create a comprehensive multimodal feature set. The framework employs a spatio-temporal attention module for facial behavior feature extraction and utilizes Graph Convolutional Networks (GCN) and Long Short-Term Memory (LSTM) networks for audio feature processing. This approach enhances the model’s ability to uncover hidden patterns in the data, ultimately improving classification accuracy.

The experimental setup utilized an NVIDIA RTX3090 GPU, with model training conducted in a Python 3.9 environment using PyTorch v1.12.0 and CUDA 11.6. The study employed the Extended Distress Analysis Interview Corpus (E-DAIC) dataset, which includes clinical interviews from 219 participants annotated for depression severity using the Patient Health Questionnaire-8 (PHQ-8). Results indicated that the multimodal model significantly outperformed single-modal approaches, achieving an F1 score of 0.922, the highest across all experiments. Additionally, the Receiver Operating Characteristic (ROC) curve analysis confirmed superior diagnostic performance, with the multimodal fusion model yielding the highest Area Under the Curve (AUC) values. The findings suggest that the integration of diverse data types enhances the model’s effectiveness in distinguishing between depressed and non-depressed individuals while maintaining computational efficiency.

Discussion

The discussion section of the research paper highlights the advancements in multi-modal approaches for depression diagnosis, emphasizing the integration of visual and audio data to enhance diagnostic accuracy. Previous studies have explored various methods, such as 3D convolutional neural networks and attention mechanisms, to extract features from facial expressions and speech patterns. However, many of these approaches have limitations, such as neglecting audio data or lacking comprehensive feature fusion strategies. The proposed Temporal-Spatial Network for Depression Diagnosis (TSNet-DD) effectively captures both temporal and spatial features through its Temporal-Spatial Attention Module (TSAM), while the combination of Graph Convolutional Networks (GCN) and Long Short-Term Memory (LSTM) models for audio processing allows for the extraction of complex patterns related to depression.

The findings indicate that the fusion of video and audio features significantly enhances diagnostic performance, underscoring the complementary nature of these modalities. Despite the promising results, the study acknowledges limitations, including the reliance on a specific dataset (E-DAIC) and the need for further research to categorize different levels of depression and integrate additional data types, such as text. The authors advocate for future work to expand the dataset diversity and explore new modalities to improve the model’s interpretability and diagnostic capabilities, ultimately aiming to provide more robust support for clinicians in diagnosing and treating depression.