كفاءة نماذج اللغة الكبيرة في تقييم الاستجابات المناسبة للأفكار الانتحارية: دراسة مقارنة Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study

المجلة: Journal of Medical Internet Research، المجلد: 27
DOI: https://doi.org/10.2196/67891
PMID: https://pubmed.ncbi.nlm.nih.gov/40053817
تاريخ النشر: 2025-03-05
المؤلف: Ryan K. McBain وآخرون
الموضوع الرئيسي: الصحة النفسية من خلال الكتابة

نظرة عامة

تدرس الدراسة قدرة ثلاثة نماذج لغوية كبيرة (LLMs)—ChatGPT-4o، Claude 3.5 Sonnet، وGemini 1.5 Pro—على تقييم الاستجابات لفكرة الانتحار، في ظل ارتفاع معدلات الانتحار في الولايات المتحدة. باستخدام النسخة المعدلة من استبيان استجابة الأفكار الانتحارية (SIRI-2)، قام الباحثون بتقييم كيفية تقييم هذه النماذج اللغوية لاستجابات الأطباء لحالات افتراضية تتضمن أعراض الاكتئاب وأفكار انتحارية. تم توجيه النماذج لتقييم الاستجابات على مقياس من -3 (غير مناسب للغاية) إلى +3 (مناسب للغاية)، وتمت مقارنة تقييماتهم بتقييمات خبراء الانتحار من خلال تحليلات الانحدار الخطي وحسابات درجات z.

أشارت النتائج إلى أن جميع النماذج الثلاثة قيمت الاستجابات على أنها أكثر ملاءمة من خبراء الانتحار، حيث أظهر ChatGPT فرق متوسط قدره 0.86 (95% CI 0.61-1.12; P<.001)، وClaude 0.61 (95% CI 0.41-0.81; P<.001)، وGemini 0.73 (95% CI 0.35-1.11; P<.001). ومن الجدير بالذكر أن 19% من استجابات ChatGPT، و11% من استجابات Claude، و36% من استجابات Gemini تم تحديدها كقيم شاذة مقارنة بتقييمات الخبراء. من حيث درجات SIRI-2، حقق ChatGPT 45.7، وهو ما يعادل مستوى المستشارين الحاصلين على درجة الماجستير، بينما حصل Claude على 36.7، متجاوزًا المحترفين المدربين سابقًا، وحصل Gemini على 54.5، مشابهًا للموظفين غير المدربين في المدارس. تشير النتائج إلى أنه بينما أظهرت النماذج اللغوية ميلاً نحو تقييمات ملائمة أعلى، فإن اثنين من النماذج أدت بمستوى أو أعلى من مستوى المحترفين المدربين في الصحة النفسية.

مقدمة

تسلط مقدمة ورقة البحث الضوء على الارتفاع المقلق في معدلات الانتحار بين الأفراد الذين تقل أعمارهم عن 50 عامًا في الولايات المتحدة، وخاصة بين المراهقين، حيث زادت الوفيات المبلغ عنها من 39,518 في عام 2011 إلى 48,183 في عام 2021. على الرغم من وجود انخفاض مؤقت خلال جائحة COVID-19، تشير البيانات الحديثة إلى أن الاتجاه الصعودي قد استؤنف. تناقش الورقة الدور المحتمل للنماذج اللغوية الكبيرة (LLMs) في معالجة قضايا الصحة النفسية، خاصة للأفراد الذين يعانون من الاكتئاب وأفكار الانتحار. يمكن أن توفر هذه النماذج، التي تمثلها منصات مثل ChatGPT، دعمًا قيمًا، خاصة لحوالي 50 مليون أمريكي في المناطق الريفية الذين لديهم وصول محدود إلى الرعاية الصحية النفسية.

ومع ذلك، تثار المخاوف بشأن المخاطر المحتملة المرتبطة بالنماذج اللغوية الكبيرة، وخاصة قدرتها على تقديم نصائح ضارة لأولئك المعرضين لخطر الانتحار. الأدبيات الحالية حول فعالية النماذج اللغوية الكبيرة في هذا السياق محدودة، حيث تركز بشكل أساسي على سلوكياتها بدلاً من المقارنات المباشرة مع المعايير المعتمدة. لمعالجة هذه الفجوة، تقيم الدراسة كفاءة ثلاثة نماذج لغوية كبيرة مستخدمة على نطاق واسع في تمييز الاستجابات المناسبة من غير المناسبة لفكرة الانتحار، باستخدام استبيان استجابة التدخل الانتحاري (SIRI-2) كأداة معيارية. يفترض الباحثون أن تقييمات النماذج اللغوية الكبيرة ستختلف بشكل كبير عن تقييمات خبراء الانتحار وأن النماذج لن تظهر تحيزًا ثابتًا في تقييماتها.

الطرق

توضح قسم “الطرق” الإجراءات التجريبية والتحليلية المستخدمة في الدراسة. تفصل في اختيار المشاركين، وتصميم التجارب، والتقنيات الإحصائية المستخدمة لتحليل البيانات. قام الباحثون بتنفيذ إعداد تجريبي محكم لضمان موثوقية النتائج، باستخدام طرق أخذ عينات مناسبة لجمع عينة تمثيلية من السكان قيد الدراسة.

بالإضافة إلى ذلك، يصف القسم الأدوات والتقنيات المحددة المستخدمة لجمع البيانات، بما في ذلك أي برامج تم استخدامها للتحليل الإحصائي. تم تصميم المنهجيات لمعالجة أسئلة البحث بفعالية، مما يضمن أن تكون النتائج صحيحة وقابلة للتعميم. يختتم القسم بمناقشة قيود الطرق، مع الاعتراف بالتحيزات المحتملة والمجالات للبحث المستقبلي.

النتائج

يقدم قسم “النتائج” من ورقة البحث النتائج الرئيسية المستمدة من التجارب أو التحليلات التي تم إجراؤها. يبرز النتائج المهمة، بما في ذلك البيانات الإحصائية، والاتجاهات الملحوظة، وأي ارتباطات تم تحديدها بين المتغيرات. عادةً ما يتم توضيح النتائج من خلال الجداول، والرسوم البيانية، أو الأشكال لتعزيز الوضوح وتسهيل الفهم.

علاوة على ذلك، قد يناقش القسم تداعيات هذه النتائج فيما يتعلق بالفرضيات المطروحة في بداية الدراسة. من الضروري أن يتم تفسير النتائج في سياق أسئلة البحث، مما يوفر رؤى حول أهميتها وتأثيرها المحتمل على مجال الدراسة. بشكل عام، يعمل هذا القسم كأساس للمناقشة اللاحقة والاستنتاجات المستخلصة في الورقة.

المناقشة

في هذه الدراسة، قام المؤلفون بتقييم قدرة ثلاثة نماذج لغوية كبيرة (LLMs)—ChatGPT-4o، Claude 3.5 Sonnet، وGemini 1.5 Pro—على تقييم ملاءمة استجابات الأطباء لحالات افتراضية تتضمن أعراض الاكتئاب وأفكار انتحارية. استخدمت التقييمات أداة SIRI-2، التي تعطي درجات بناءً على الملاءمة المتصورة لاستجابات الأطباء. أشارت النتائج إلى أنه بينما أظهرت النماذج اللغوية الكبيرة ارتباطًا عاليًا مع تقييمات خبراء الانتحار، إلا أنها أظهرت تحيزًا ثابتًا نحو التقييمات الأعلى، حيث قيمت الاستجابات على أنها أكثر ملاءمة مما فعل الخبراء. على وجه التحديد، أنتج ChatGPT-4o وClaude 3.5 وGemini 1.5 درجات متوسطة قدرها 45.71 و36.65 و54.52، على التوالي، مع تفوق Claude على الآخرين.

تشير النتائج إلى أنه على الرغم من أن النماذج اللغوية الكبيرة يمكن أن تقيم ملاءمة الاستجابة بفعالية، فإن ميلها لتقييم الاستجابات بشكل مبالغ فيه يثير المخاوف بشأن موثوقيتها في السياقات العلاجية الواقعية. يؤكد المؤلفون على أهمية مواءمة مخرجات النماذج اللغوية الكبيرة مع المعايير الخبراء لتعزيز فائدتها في تطبيقات الصحة النفسية. كما يشيرون إلى قيود الدراسة، بما في ذلك الطبيعة المتطورة لتكنولوجيا النماذج اللغوية الكبيرة والتركيز على القدرات التقييمية بدلاً من القدرات الاستجابية المباشرة. يجب أن تأخذ الأبحاث المستقبلية في الاعتبار كيفية تكوين النماذج اللغوية الكبيرة للاستجابة لأفكار الانتحار مع ضمان الالتزام بالمعايير المعتمدة للأداء.

Journal: Journal of Medical Internet Research, Volume: 27
DOI: https://doi.org/10.2196/67891
PMID: https://pubmed.ncbi.nlm.nih.gov/40053817
Publication Date: 2025-03-05
Author(s): Ryan K. McBain et al.
Primary Topic: Mental Health via Writing

Overview

The study investigates the ability of three large language models (LLMs)—ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro—to evaluate responses to suicidal ideation, amidst rising suicide rates in the United States. Utilizing the revised Suicidal Ideation Response Inventory (SIRI-2), the researchers assessed how these LLMs rated clinician responses to hypothetical scenarios involving depressive symptoms and suicidal thoughts. The models were instructed to score responses on a scale from -3 (highly inappropriate) to +3 (highly appropriate), and their evaluations were compared to those of expert suicidologists through linear regression analyses and z score calculations.

Results indicated that all three LLMs rated responses as more appropriate than expert suicidologists, with ChatGPT showing a mean difference of 0.86 (95% CI 0.61-1.12; P<.001), Claude 0.61 (95% CI 0.41-0.81; P<.001), and Gemini 0.73 (95% CI 0.35-1.11; P<.001). Notably, 19% of ChatGPT responses, 11% of Claude responses, and 36% of Gemini responses were identified as outliers compared to expert evaluations. In terms of SIRI-2 scores, ChatGPT achieved 45.7, comparable to master's level counselors, while Claude scored 36.7, surpassing prior trained professionals, and Gemini scored 54.5, akin to untrained K-12 staff. The findings suggest that while the LLMs exhibited an upward bias in their appropriateness ratings, two of the models performed at or above the level of trained mental health professionals.

Introduction

The introduction of the research paper highlights the alarming rise in suicide rates among individuals under 50 in the United States, particularly among adolescents, with reported deaths increasing from 39,518 in 2011 to 48,183 in 2021. Although there was a temporary decline during the COVID-19 pandemic, recent data suggest that the upward trend has resumed. The paper discusses the potential role of large language models (LLMs) in addressing mental health issues, particularly for individuals experiencing depression and suicidal ideation. These models, exemplified by platforms like ChatGPT, could provide valuable support, especially for the approximately 50 million Americans in rural areas with limited access to mental health care.

However, concerns are raised regarding the potential risks associated with LLMs, particularly their capacity to offer harmful advice to those at risk of suicide. The existing literature on LLMs’ effectiveness in this context is limited, primarily focusing on their behaviors rather than direct comparisons to established benchmarks. To address this gap, the study evaluates the competency of three widely used LLMs in distinguishing appropriate from inappropriate responses to suicidal ideation, using the Suicide Intervention Response Inventory (SIRI-2) as a standardized measure. The researchers hypothesize that the LLMs’ ratings will significantly differ from those of expert suicidologists and that the models will not exhibit a consistent bias in their evaluations.

Methods

The “Methods” section outlines the experimental and analytical procedures employed in the study. It details the selection of participants, the design of the experiments, and the statistical techniques used for data analysis. The researchers implemented a controlled experimental setup to ensure the reliability of the results, utilizing appropriate sampling methods to gather a representative sample of the population under study.

Additionally, the section describes the specific tools and technologies employed for data collection, including any software used for statistical analysis. The methodologies are designed to address the research questions effectively, ensuring that the findings are both valid and generalizable. The section concludes with a discussion of the limitations of the methods, acknowledging potential biases and areas for future research.

Results

The “Results” section of the research paper presents the key findings derived from the conducted experiments or analyses. It highlights the significant outcomes, including statistical data, observed trends, and any correlations identified between variables. The results are typically illustrated through tables, graphs, or figures to enhance clarity and facilitate understanding.

Moreover, the section may discuss the implications of these findings in relation to the hypotheses posed at the outset of the study. It is crucial that the results are interpreted in the context of the research questions, providing insights into their relevance and potential impact on the field of study. Overall, this section serves as a foundation for the subsequent discussion and conclusions drawn in the paper.

Discussion

In this study, the authors assessed the ability of three large language models (LLMs)—ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro—to evaluate the appropriateness of clinician responses to hypothetical scenarios involving depressive symptoms and suicidal thoughts. The evaluation utilized the SIRI-2 instrument, which assigns scores based on the perceived appropriateness of clinician responses. The results indicated that while the LLMs demonstrated high correlation with expert suicidologists’ ratings, they exhibited a consistent upward bias, rating responses as more appropriate than experts did. Specifically, ChatGPT-4o, Claude 3.5, and Gemini 1.5 produced mean scores of 45.71, 36.65, and 54.52, respectively, with Claude outperforming the others.

The findings suggest that although LLMs can effectively assess response appropriateness, their tendency to overrate responses raises concerns about their reliability in real-world therapeutic contexts. The authors emphasize the importance of aligning LLM outputs with expert benchmarks to enhance their utility in mental health applications. They also note the limitations of the study, including the evolving nature of LLM technology and the focus on evaluative rather than direct response capabilities. Future research should consider how LLMs can be configured to respond to suicidal ideation while ensuring adherence to established performance benchmarks.