تقييم دقة التشخيص وفعالية العلاج في الصحة النفسية: تحليل مقارن لأدوات نماذج اللغة الكبيرة والمهنيين في الصحة النفسية Evaluating Diagnostic Accuracy and Treatment Efficacy in Mental Health: A Comparative Analysis of Large Language Model Tools and Mental Health Professionals

المجلة: European Journal of Investigation in Health Psychology and Education، المجلد: 15، العدد: 1
DOI: https://doi.org/10.3390/ejihpe15010009
PMID: https://pubmed.ncbi.nlm.nih.gov/39852192
تاريخ النشر: 2025-01-18
المؤلف: Inbar Levkovich
الموضوع الرئيسي: الصحة النفسية من خلال الكتابة

نظرة عامة

تبحث هذه الدراسة في دقة التشخيص وتوصيات العلاج لأربعة نماذج لغوية كبيرة (LLMs)—جمني، كلود، شات جي بي تي-3.5، وشات جي بي تي-4—مقارنة بالمعايير المعتمدة من قبل محترفي الصحة النفسية لمجموعة متنوعة من حالات الصحة النفسية، بما في ذلك الاكتئاب، الأفكار الانتحارية، الفصام المبكر والمزمن، الرهاب الاجتماعي، واضطراب ما بعد الصدمة. باستخدام نصوص قصيرة، وجدت الأبحاث أن LLMs، وخاصة شات جي بي تي-4، حققت دقة تشخيص عالية، مع معدلات مثالية لحالات مثل الاكتئاب واضطراب ما بعد الصدمة. ومع ذلك، اختلف الأداء في الحالات الأكثر تعقيدًا، مثل الفصام المبكر، حيث وصلت دقة شات جي بي تي-4 إلى 55% فقط، مما يشير إلى أن المحترفين البشريين تفوقوا على LLMs في هذه السيناريوهات.

كانت LLMs تميل إلى التوصية بمجموعة واسعة من العلاجات الاستباقية مقارنة بالاستشارات النفسية المستهدفة والأدوية التي اقترحها المحترفون. بالإضافة إلى ذلك، بينما توقعت LLMs معدلات أقل من الشفاء الكامل ومعدلات أعلى من الشفاء الجزئي—خاصة في الحالات غير المعالجة—حافظ الخبراء البشريون على نظرة أكثر تفاؤلاً بشأن الشفاء الكامل عبر مختلف الحالات. تشير هذه النتائج إلى أنه بينما يمكن أن توفر LLMs دعمًا قيمًا في التشخيص وتخطيط العلاج، يجب ألا تحل محل الحكم المهني، خاصة بالنظر إلى توقعاتها المحافظة للشفاء. تؤكد الدراسة على إمكانية دمج LLMs في اتخاذ القرارات السريرية لكنها تدعو إلى مزيد من البحث للتحقق من هذه النتائج ومعالجة قيود الدراسة.

مقدمة

تناقش مقدمة هذه الورقة البحثية التأثير التحويلي للنماذج اللغوية الكبيرة (LLMs) على تشخيصات الصحة النفسية وتوصيات العلاج. تستفيد LLMs، مثل شات جي بي تي-3.5 وشات جي بي تي-4، من تدريب واسع على مجموعات بيانات نصية متنوعة، مما يمكنها من معالجة وتوليد لغة شبيهة بالبشر بشكل فعال. على الرغم من قدرتها على الأداء بشكل مشابه للمحترفين في مجالات مثل تقييم مخاطر الانتحار، فإن الطبيعة الملكية لعمليات تدريبها تثير مخاوف بشأن الشفافية وقابلية تفسير مخرجاتها في البيئات السريرية. تبرز الورقة الدور المزدوج لـ LLMs كأدوات دعم بدلاً من بدائل للأطباء البشريين، مما يبرز أهمية التعاطف البشري والفهم الدقيق في الرعاية الصحية النفسية.

تهدف الدراسة إلى تقييم أداء مختلف LLMs بشكل تجريبي في تشخيص وتوصية العلاجات لمجموعة من الاضطرابات النفسية، مع معالجة الفجوات الموجودة في الأدبيات التي ركزت بشكل أساسي على حالات محددة مثل الاكتئاب والفصام. من خلال مقارنة دقة التشخيص، وتوصيات العلاج، والنتائج المتوقعة لـ LLMs مع تلك الخاصة بمحترفي الصحة النفسية، تسعى الأبحاث إلى تقديم فهم شامل لدمج LLMs في الصحة النفسية. تشمل الأهداف تقييم معدلات التشخيص الصحيح، ومقارنات العلاج، وتوقعات النتائج، مما يساهم في النقاش المستمر حول دور الذكاء الاصطناعي في تحسين تقديم الرعاية الصحية النفسية مع مراعاة الاعتبارات الأخلاقية وضرورة الإشراف البشري.

الطرق

استخدمت الدراسة منهجية نصوص قصيرة، تضم ست نصوص قصيرة تصور مجموعة متنوعة من الاضطرابات النفسية، كما هو موضح من قبل ريفلي وجورم (2012). تم تقديم كل نص قصير في نسختين، واحدة للذكور والأخرى للإناث، حيث تم الإشارة إلى المشاركين الذكور باسم “جون” والمشاركين الإناث باسم “ماري”. التزمت النصوص القصيرة بمعايير التشخيص من DSM-5 وICD-11، وتم تقييمها باستخدام عدة نماذج لغوية كبيرة (LLMs). خضعت كل LLM لعشر تقييمات، حيث تم تحفيزها لتحديد القضايا المعروضة في النصوص القصيرة والرد على سؤالين رئيسيين بشأن النتائج المتوقعة للتدخل المهني وعواقب عدم تلقي المساعدة، باستخدام مقياس ليكرت من 6 نقاط.

سأل السؤال الأول LLMs عن توقع النتيجة الأكثر احتمالاً لجون/ماري إذا تلقوا المساعدة المهنية المناسبة، بينما استفسر الثاني عن النتيجة المتوقعة دون أي تدخل. ثم تم تقييم الردود مقابل المعايير المعتمدة من عينة من 1,536 محترف صحي، بما في ذلك الأطباء العامين، والأطباء النفسيين، وعلماء النفس السريريين، بالإضافة إلى عينة عامة أوسع. تم تقييم أداء LLMs بناءً على معايير من دراسات سابقة لمورغان وآخرين (2013، 2014)، مما يسمح بإجراء تحليل مقارن لفعالية LLMs في تحديد وتوصية التدخلات لمشكلات الصحة النفسية.

النتائج

قيمت نتائج الدراسة دقة التشخيص، وتوصيات العلاج، والنتائج المتوقعة لأربعة نماذج لغوية كبيرة رائدة (LLMs) في سياق مجموعة متنوعة من حالات الصحة النفسية. تم إجراء التحليلات باستخدام برنامج R الإصدار 4.4.1 وRStudio الإصدار 2023.06.1، مما يضمن إطارًا إحصائيًا قويًا للتقييم.

تشير النتائج إلى وجود اختلافات كبيرة في أداء LLMs عبر حالات الصحة النفسية المختلفة، مما يبرز فائدتها المحتملة وقيودها في البيئات السريرية. تم الإبلاغ عن مقاييس محددة لدقة التشخيص وفعالية توصيات العلاج، مما يوفر رؤى حول القدرات التنبؤية لهذه النماذج في تطبيقات الصحة النفسية.

المناقشة

قيمت الدراسة التي أجريت في مايو 2024 دقة التشخيص وتوصيات العلاج لأربعة نماذج لغوية كبيرة متقدمة (LLMs)—جمني، كلود، شات جي بي تي-3.5، وشات جي بي تي-4—مقارنة بتلك الخاصة بمحترفي الصحة النفسية عبر مجموعة متنوعة من حالات الصحة النفسية، بما في ذلك الاكتئاب، الفصام، الرهاب الاجتماعي، واضطراب ما بعد الصدمة. حققت LLMs معدل تشخيص صحيح بنسبة 100% للاكتئاب، الرهاب الاجتماعي، واضطراب ما بعد الصدمة، متفوقة على المحترفين الذين حققوا معدل دقة بنسبة 95%. ومع ذلك، أظهر شات جي بي تي-4 دقة أقل في تشخيص الفصام المبكر والمزمن، بمعدلات بلغت 55% و67% على التوالي. تشير هذه التباينات إلى أنه بينما يمكن لـ LLMs تشخيص الحالات الشائعة بفعالية، قد تنخفض موثوقيتها مع الاضطرابات الأكثر تعقيدًا، مما يبرز الحاجة إلى التدريب والتقييم المستمر.

فيما يتعلق بتوصيات العلاج، اقترحت LLMs باستمرار استشارة محترفي الرعاية الصحية، مثل الأطباء العامين والمستشارين، بمعدلات أعلى من المحترفين أنفسهم. كما دعت إلى مجموعة أوسع من خيارات العلاج، بما في ذلك الأنشطة البدنية والعلاج السلوكي المعرفي. على العكس من ذلك، أظهر المحترفون نظرة أكثر تفاؤلاً بشأن معدلات الشفاء، خاصة في الحالات التي تتضمن أفكارًا انتحارية والفصام، حيث توقعت LLMs معدلات شفاء كامل أقل بكثير. تشير هذه النتائج إلى وجود تحيز حذر في توقعات LLM، مما يبرز أهمية دمج LLMs كأدوات دعم بدلاً من بدائل للخبرة السريرية في الرعاية الصحية النفسية. تؤكد الدراسة على ضرورة التحسين المستمر لـ LLMs والدور الحاسم للحكم البشري في تخطيط العلاج ونتائج المرضى.

القيود

تسلط قيود هذه الدراسة الضوء على عدة عوامل حاسمة قد تؤثر على قابلية تطبيق نتائجها. بينما استخدمت الأبحاث نصوصًا قصيرة صالحة تم استخدامها سابقًا عبر مجموعة متنوعة من المحترفين والاضطرابات النفسية، فإن هذه السيناريوهات النصية لا تعكس تمامًا تعقيدات التفاعلات الحقيقية مع المرضى أو النطاق المتنوع من حالات الصحة النفسية. تثير هذه القيود مخاوف بشأن قابلية تعميم النتائج. علاوة على ذلك، تعترف الدراسة بالتحيزات الموجودة في النماذج اللغوية الكبيرة (LLMs)، التي تم تدريبها على مجموعات بيانات واسعة قد تحتوي على تحيزات تؤثر على توصياتها التشخيصية والعلاجية. تعقد الطبيعة الغامضة لـ LLMs فهم المنطق وراء توصيات معينة، وهو أمر حيوي لبناء الثقة والممارسة الأخلاقية في البيئات السريرية.

بالإضافة إلى ذلك، تستند النتائج إلى مقارنات مع محترفين صحيين لكنها تفتقر إلى التحقق السريري المباشر في رعاية المرضى الفعلية. يبرز هذا ضرورة أن تركز الأبحاث المستقبلية على التجارب السريرية والتطبيقات الواقعية لتقييم فعالية وسلامة LLMs في تشخيصات الصحة النفسية وتخطيط العلاج. نظرًا للتطور السريع لتقنيات الذكاء الاصطناعي، قد تظهر إصدارات جديدة من النماذج التي تم تقييمها خصائص أداء مختلفة، مما يبرز الحاجة إلى التقييم المستمر. يعد معالجة هذه القيود أمرًا ضروريًا لفهم شامل للتحديات والاعتبارات في تطبيق تكنولوجيا LLM على الصحة النفسية، مما يوجه في النهاية نحو أبحاث وتنفيذ أكثر اطلاعًا وأخلاقية في المستقبل.

Journal: European Journal of Investigation in Health Psychology and Education, Volume: 15, Issue: 1
DOI: https://doi.org/10.3390/ejihpe15010009
PMID: https://pubmed.ncbi.nlm.nih.gov/39852192
Publication Date: 2025-01-18
Author(s): Inbar Levkovich
Primary Topic: Mental Health via Writing

Overview

This study investigates the diagnostic accuracy and treatment recommendations of four large language models (LLMs)—Gemini, Claude, ChatGPT-3.5, and ChatGPT-4—against established norms from mental health professionals for various mental health conditions, including depression, suicidal ideation, early and chronic schizophrenia, social phobia, and PTSD. Using text vignettes, the research found that LLMs, particularly ChatGPT-4, achieved high diagnostic accuracy, with perfect rates for conditions like depression and PTSD. However, performance varied in more complex cases, such as early schizophrenia, where ChatGPT-4 only reached 55% accuracy, indicating that human professionals outperformed LLMs in these scenarios.

The LLMs tended to recommend a wider array of proactive treatments compared to the more targeted psychiatric consultations and medications suggested by professionals. Additionally, while LLMs predicted lower rates of full recovery and higher rates of partial recovery—especially in untreated cases—human experts maintained a more optimistic outlook regarding full recovery across various conditions. These findings suggest that while LLMs can provide valuable support in diagnostics and treatment planning, they should not replace professional judgment, particularly given their conservative recovery predictions. The study underscores the potential for integrating LLMs into clinical decision-making but calls for further research to validate these results and address the study’s limitations.

Introduction

The introduction of this research paper discusses the transformative impact of large language models (LLMs) on mental health diagnostics and treatment recommendations. LLMs, such as ChatGPT-3.5 and ChatGPT-4, leverage extensive training on diverse textual datasets, enabling them to process and generate human-like language effectively. Despite their potential to perform comparably to mental health professionals in areas like suicide risk assessment, the proprietary nature of their training processes raises concerns about transparency and the interpretability of their outputs in clinical settings. The paper highlights the dual role of LLMs as supportive tools rather than replacements for human clinicians, emphasizing the importance of human empathy and nuanced understanding in mental healthcare.

The study aims to empirically evaluate the performance of various LLMs in diagnosing and recommending treatments for multiple psychiatric disorders, addressing existing gaps in the literature that have primarily focused on specific conditions like depression and schizophrenia. By comparing the diagnostic accuracy, treatment recommendations, and predicted outcomes of LLMs against those of mental health professionals, the research seeks to provide a comprehensive understanding of LLM integration in mental health. The objectives include assessing correct diagnosis rates, treatment comparisons, and outcome predictions, thereby contributing to the ongoing discourse on the role of AI in enhancing mental healthcare delivery while considering ethical implications and the necessity for human oversight.

Methods

The study utilized a text vignette methodology, featuring six vignettes that depict various mental disorders, as outlined by Reavley and Jorm (2012). Each vignette was presented in both male and female versions, with male participants referred to as ‘John’ and female participants as ‘Mary’. The vignettes adhered to the diagnostic criteria of the DSM-5 and ICD-11, and were assessed using multiple large language models (LLMs). Each LLM underwent ten assessments, where they were prompted to identify issues presented in the vignettes and respond to two key questions regarding the expected outcomes of professional intervention and the consequences of not receiving help, using a 6-point Likert scale.

The first question asked the LLMs to predict the most likely outcome for John/Mary if they received appropriate professional help, while the second inquired about the expected outcome without any intervention. The responses were then evaluated against established norms from a sample of 1,536 health professionals, including general practitioners, psychiatrists, and clinical psychologists, as well as a broader public sample. The performance of the LLMs was assessed based on criteria from previous studies by Morgan et al. (2013, 2014), allowing for a comparative analysis of LLMs’ effectiveness in identifying and recommending interventions for mental health issues.

Results

The results of the study evaluated the diagnostic accuracy, treatment recommendations, and predicted outcomes of four leading large language models (LLMs) in the context of various mental health conditions. The analyses were performed using R software version 4.4.1 and RStudio version 2023.06.1, ensuring a robust statistical framework for the evaluation.

The findings indicate significant variations in the performance of the LLMs across different mental health conditions, highlighting their potential utility and limitations in clinical settings. Specific metrics for diagnostic accuracy and the effectiveness of treatment recommendations were reported, providing insights into the predictive capabilities of these models in mental health applications.

Discussion

The study conducted in May 2024 evaluated the diagnostic accuracy and treatment recommendations of four advanced large language models (LLMs)—Gemini, Claude, ChatGPT-3.5, and ChatGPT-4—against those of mental health professionals across various mental health conditions, including depression, schizophrenia, social phobia, and PTSD. The LLMs achieved a 100% correct diagnosis rate for depression, social phobia, and PTSD, outperforming professionals who had a 95% accuracy rate. However, ChatGPT-4 exhibited lower accuracy in diagnosing early and chronic schizophrenia, with rates of 55% and 67%, respectively. This variability suggests that while LLMs can effectively diagnose common conditions, their reliability may decrease with more complex disorders, highlighting the need for ongoing training and evaluation.

In terms of treatment recommendations, LLMs consistently suggested consulting healthcare professionals, such as general practitioners and counselors, at higher rates than professionals themselves. They also advocated for a broader range of treatment options, including physical activities and cognitive behavioral therapy. Conversely, professionals demonstrated a more optimistic outlook regarding recovery rates, particularly for cases involving suicidal thoughts and schizophrenia, where LLMs predicted significantly lower full recovery rates. These findings indicate a cautious bias in LLM predictions, emphasizing the importance of integrating LLMs as supportive tools rather than replacements for clinical expertise in mental health care. The study underscores the necessity for continuous improvement of LLMs and the critical role of human judgment in treatment planning and patient outcomes.

Limitations

The limitations of this study highlight several critical factors that may affect the applicability of its findings. While the research utilized valid vignettes previously employed across various professionals and mental disorders, these text-based scenarios do not fully encapsulate the complexities of real-life patient interactions or the diverse range of mental health conditions. This limitation raises concerns regarding the generalizability of the results. Furthermore, the study acknowledges the biases inherent in large language models (LLMs), which are trained on extensive datasets that may contain biases influencing their diagnostic and treatment recommendations. The opaque nature of LLMs complicates the understanding of the rationale behind specific recommendations, which is vital for establishing trust and ethical practice in clinical settings.

Additionally, the findings are based on comparisons with health professionals but lack direct clinical validation in actual patient care. This underscores the necessity for future research to focus on clinical trials and real-world applications to evaluate the efficacy and safety of LLMs in mental health diagnostics and treatment planning. Given the rapid evolution of AI technologies, newer versions of the models assessed may exhibit different performance characteristics, emphasizing the need for ongoing evaluation. Addressing these limitations is essential for a comprehensive understanding of the challenges and considerations in applying LLM technology to mental health, ultimately guiding more informed and ethical research and implementation in the future.