DepITCM: طريقة سمعية بصرية لاكتشاف الاكتئاب DepITCM: an audio-visual method for detecting depression

المجلة: Frontiers in Psychiatry، المجلد: 15
DOI: https://doi.org/10.3389/fpsyt.2024.1466507
PMID: https://pubmed.ncbi.nlm.nih.gov/39917382
تاريخ النشر: 2025-01-23
المؤلف: Lishan Zhang وآخرون
الموضوع الرئيسي: الصحة النفسية من خلال الكتابة

نظرة عامة

تقدم ورقة البحث نموذجًا جديدًا لاكتشاف الاكتئاب، يسمى DepITCM، الذي يستفيد من تعلم التمثيل متعدد المهام من خلال البيانات الصوتية والمرئية. يعالج النموذج القيود الحالية في أساليب التعلم العميق من خلال استخراج ودمج المعلومات متعددة الوسائط بشكل فعال عبر ثلاثة أبعاد: الزمن، القناة، والفضاء. يتكون من وحدة معالجة البيانات، ووحدة تحليل المكونات الرئيسية الزمنية-القنوية (ITCM Encoder)، ووحدة تعلم متعددة المهام. تستخدم وحدة ITCM استراتيجية استخراج ميزات مرحلية تلتقط كل من الميزات العالمية والمحلية، مما يعزز قدرة النموذج على دمج المعلومات الزمنية والقنوية والفضائية.

تظهر النتائج التجريبية على مجموعات بيانات AVEC2017 و AVEC2019 فعالية نموذج DepITCM، حيث حقق درجة F1 قدرها 0.823 و 0.816، ودقة تصنيف قدرها 0.823 و 0.810، على التوالي. سجل النموذج أيضًا أخطاء جذر متوسط المربعات (RMSE) قدرها 6.10 و 4.89 في مهام الانحدار. تشير هذه النتائج إلى أن DepITCM يتفوق على معظم الأساليب الحالية في كل من مهام التصنيف والانحدار، مما يظهر مزايا تعلم المهام المتعددة في تحسين دقة اكتشاف الاكتئاب. تختتم الدراسة بالقول إنه بينما يعزز DepITCM بشكل كبير تمثيل الميزات وأداء الاكتشاف، فإن المزيد من التحسينات وإدماج وسائط إضافية ضرورية لتحسينات مستقبلية.

مقدمة

تسلط مقدمة ورقة البحث هذه الضوء على الزيادة المتزايدة في انتشار الاكتئاب، المتوقع أن يكون القضية الصحية العالمية الرائدة بحلول عام 2030 وفقًا لمنظمة الصحة العالمية. تؤكد على أهمية الاكتشاف المبكر والعلاج، مشيرة إلى أن التشخيصات السريرية الحالية تعتمد بشكل كبير على التقييمات الذاتية من خلال المقابلات المنظمة والمقاييس القياسية، والتي قد تتجاهل الإشارات غير اللفظية الدالة على الاكتئاب. تحدد الورقة فجوة في الأبحاث الحالية، حيث تركز الدراسات السابقة بشكل أساسي على استخراج الميزات العالمية مع إغفال الميزات المحلية، التي تعتبر حاسمة لاكتشاف الاكتئاب بدقة.

لمعالجة هذه القيود، يقترح المؤلفون نموذج تعلم تمثيل متعدد المهام جديد، يسمى DepITCM، الذي يدمج الميزات المرئية والصوتية لاكتشاف الاكتئاب. يستخدم النموذج وحدة تحليل المكونات الرئيسية الزمنية-القنوية (ITCM Encoder) لتعزيز استخراج كل من الميزات العالمية والمحلية عبر الأبعاد الزمنية والقنوية والفضائية. علاوة على ذلك، تتضمن الدراسة تعلم المهام المتعددة للاستفادة من المعلومات التكميلية من مهام مختلفة، مما يحسن دقة الاكتشاف. يتم التحقق من فعالية النموذج المقترح باستخدام مجموعات بيانات AVEC2017 و AVEC2019، مع إجراء تجارب إلغاء إضافية لتقييم مساهمات مكوناته. تهدف الورقة إلى تقديم نهج أكثر شمولاً لفهم الحالات العاطفية وتسهيل التدخل المبكر في أعراض الاكتئاب.

الطرق

في هذا القسم، يحدد المؤلفون المنهجية المستخدمة في إطار بحثهم، كما هو موضح في الشكل 2. تبدأ العملية بمعالجة البيانات واختيار الميزات، مما يؤدي إلى تطوير نظام تحديد الاكتئاب من خلال نمذجة السياق الزمني (DepITCM)، المصمم للاكتشاف التلقائي للاكتئاب.

يتكون DepITCM من مكونين رئيسيين: مشفر ITCM ووحدة تعلم متعددة المهام. يكون مشفر ITCM مسؤولاً عن استخراج الميزات المرئية والصوتية المتعلقة بمهمة التعرف. لتعزيز أداء النموذج من خلال الاستفادة من التكامل بين المهام المختلفة، قام المؤلفون بدمج وحدة تعلم متعددة المهام في إطار عمل DepITCM. يهدف هذا النهج المعماري إلى تحسين دقة وفعالية اكتشاف الاكتئاب.

النتائج

في هذا القسم، يقدم المؤلفون نتائج التحقق التجريبي من نموذج DepITCM، باستخدام مجموعات بيانات AVEC2017 و AVEC2019. كان الهدف الرئيسي هو تقييم فعالية النموذج في اكتشاف الاكتئاب وتقييم تأثير استراتيجية التعلم متعدد المهام على دقة الاكتشاف. شملت التجارب تحليلات مقارنة، وتقييمات تعلم متعددة المهام، ودراسات إلغاء، تم تنفيذها جميعًا باستخدام إطار عمل Keras وتم تنفيذها على وحدات معالجة الرسوميات NVIDIA GeForce A800.

بالنسبة لمهام التصنيف، استخدم المؤلفون الدقة، ودرجة F1، والاسترجاع، والدقة كمقاييس تقييم، بينما لمهام الانحدار، استخدموا متوسط الخطأ المطلق (MAE) وخطأ الجذر التربيعي المتوسط (RMSE). توفر هذه المقاييس تقييمًا شاملاً لأداء النموذج في كل من اكتشاف الاكتئاب وتقييم المتغيرات ذات الصلة، مما يبرز الفوائد المحتملة للنهج المقترح في تطبيقات الصحة النفسية.

المناقشة

في هذا القسم، يناقش المؤلفون نهجهم الجديد لاكتشاف الاكتئاب باستخدام مجموعات بيانات AVEC2017 و AVEC2019، مع التركيز على الوسائط المرئية والصوتية بدلاً من البيانات النصية، التي كانت الطريقة السائدة في الدراسات السابقة. تتكون مجموعة بيانات AVEC2017 من 189 موضوعًا مع ميزات بصرية متنوعة، بما في ذلك معالم الوجه ونظرة العين، إلى جانب الميزات الصوتية المستخرجة باستخدام صندوق أدوات COVAREP. توسع مجموعة بيانات AVEC2019 في ذلك مع 275 موضوعًا وتدمج ميزات صوتية إضافية من مجموعة eGeMAPS. يوضح المؤلفون طرق المعالجة المسبقة الخاصة بهم، والتي تتضمن تطبيع وتصنيف الميزات بناءً على التباين لتعزيز أداء النموذج.

يدمج النموذج المقترح، DepITCM، وحدة الالتفاف المتوسعة (IDC) لاستخراج الميزات العالمية ووحدة الانتباه القنوي الزمني (TCPCA) لتعزيز الميزات المحلية. يؤكد المؤلفون على أهمية تعلم المهام المتعددة، موضحين أن نموذجهم يتفوق على تعلم المهمة الواحدة من خلال الاستفادة الفعالة من الميزات المشتركة عبر مهام التصنيف والانحدار. تؤكد تجارب الإلغاء مساهمات وحدات IDC و TCPCA في أداء النموذج. يعترف المؤلفون بحدود نهجهم، وخاصة غياب وسائط النص، ويقترحون أن العمل المستقبلي يمكن أن يستكشف دمج وسائط إضافية لتحسين دقة اكتشاف الاكتئاب بشكل أكبر. بشكل عام، تشير النتائج إلى أن DepITCM يمثل تقدمًا كبيرًا في منهجيات اكتشاف الاكتئاب متعددة الوسائط.

Journal: Frontiers in Psychiatry, Volume: 15
DOI: https://doi.org/10.3389/fpsyt.2024.1466507
PMID: https://pubmed.ncbi.nlm.nih.gov/39917382
Publication Date: 2025-01-23
Author(s): Lishan Zhang et al.
Primary Topic: Mental Health via Writing

Overview

The research paper introduces a novel model for depression detection, termed DepITCM, which leverages multi-task representation learning through audio and visual data. The model addresses existing limitations in deep learning approaches by effectively extracting and integrating multimodal information across three dimensions: time, channel, and space. It consists of a data preprocessing module, an Inception-Temporal-Channel Principal Component Analysis Module (ITCM Encoder), and a multi-task learning module. The ITCM Encoder employs a staged feature extraction strategy that captures both global and local features, enhancing the model’s ability to fuse temporal, channel, and spatial information.

Experimental results on the AVEC2017 and AVEC2019 datasets demonstrate the efficacy of the DepITCM model, achieving an F1 score of 0.823 and 0.816, and classification accuracies of 0.823 and 0.810, respectively. The model also recorded root mean square errors (RMSE) of 6.10 and 4.89 in regression tasks. These findings indicate that DepITCM outperforms most existing methods in both classification and regression tasks, showcasing the advantages of multi-task learning in improving depression detection accuracy. The study concludes that while DepITCM significantly enhances feature representation and detection performance, further optimization and incorporation of additional modalities are necessary for future improvements.

Introduction

The introduction of this research paper highlights the increasing prevalence of depression, projected to be the leading global health issue by 2030 according to the World Health Organization. It emphasizes the importance of early detection and treatment, noting that current clinical diagnoses rely heavily on subjective assessments through structured interviews and standardized scales, which may overlook nonverbal cues indicative of depression. The paper identifies a gap in existing research, where previous studies predominantly focus on global feature extraction while neglecting local features, which are crucial for accurately detecting depression.

To address these limitations, the authors propose a novel multi-task representation learning model, named DepITCM, which integrates visual and audio features for depression detection. The model utilizes an Inception-Temporal-Channel Principal Component Analysis Module (ITCM Encoder) to enhance the extraction of both global and local features across temporal, channel, and spatial dimensions. Furthermore, the study incorporates multi-task learning to leverage complementary information from various tasks, thereby improving detection accuracy. The effectiveness of the proposed model is validated using the AVEC2017 and AVEC2019 datasets, with additional ablation experiments conducted to assess the contributions of its components. The paper aims to provide a more comprehensive approach to understanding emotional states and facilitating early intervention in depressive symptoms.

Methods

In this section, the authors outline the methodology employed in their research framework, depicted in Figure 2. The process begins with data preprocessing and feature screening, leading to the development of the Depression Identification through Temporal Context Modeling (DepITCM) system, designed for the automatic identification of depression.

DepITCM is structured around two main components: an ITCM encoder and a multi-task learning module. The ITCM encoder is responsible for extracting visual and audio features pertinent to the identification task. To enhance model performance by utilizing the complementarity of different tasks, the authors integrated a multi-task learning module into the DepITCM framework. This architectural approach aims to improve the accuracy and effectiveness of depression detection.

Results

In this section, the authors present the results of their experimental validation of the DepITCM model, utilizing the AVEC2017 and AVEC2019 datasets. The primary objective was to assess the model’s effectiveness in depression detection and to evaluate the impact of a multi-task learning strategy on detection accuracy. The experiments included comparative analyses, multi-task learning assessments, and ablation studies, all implemented using the Keras framework and executed on NVIDIA GeForce A800 GPUs.

For the classification tasks, the authors employed accuracy, F1 score, recall, and precision as evaluation metrics, while for regression tasks, they utilized Mean Absolute Error (MAE) and Root Mean Square Error (RMSE). These metrics provide a comprehensive evaluation of the model’s performance in both detecting depression and assessing related variables, highlighting the potential benefits of the proposed approach in mental health applications.

Discussion

In this section, the authors discuss their novel approach to depression detection using the AVEC2017 and AVEC2019 datasets, focusing on visual and audio modalities rather than textual data, which has been the predominant method in prior studies. The AVEC2017 dataset comprises 189 subjects with various visual features, including facial landmarks and eye gaze, alongside audio features extracted using the COVAREP toolbox. The AVEC2019 dataset expands on this with 275 subjects and incorporates additional audio features from the eGeMAPS set. The authors detail their preprocessing methods, which involve normalizing and filtering features based on variance to enhance the model’s performance.

The proposed model, DepITCM, integrates an Inception Dilated Convolution (IDC) module for global feature extraction and a Temporal Channel Prior Convolutional Attention (TCPCA) module for local feature enhancement. The authors emphasize the importance of multi-task learning, demonstrating that their model outperforms single-task learning by effectively utilizing shared features across classification and regression tasks. Ablation experiments confirm the contributions of the IDC and TCPCA modules to the model’s performance. The authors acknowledge the limitations of their approach, particularly the absence of text modality, and suggest that future work could explore the integration of additional modalities to further improve depression detection accuracy. Overall, the findings indicate that DepITCM represents a significant advancement in multimodal depression detection methodologies.