تقييم الاتفاق بين ChatGPT-4 والاستبيانات المعتمدة في فحص القلق والاكتئاب لدى طلاب الجامعات: دراسة مقطعية Evaluating the agreement between ChatGPT-4 and validated questionnaires in screening for anxiety and depression in college students: a cross-sectional study

المجلة: BMC Psychiatry، المجلد: 25، العدد: 1
DOI: https://doi.org/10.1186/s12888-025-06798-0
PMID: https://pubmed.ncbi.nlm.nih.gov/40211256
تاريخ النشر: 2025-04-10
المؤلف: Jiali Liu وآخرون
الموضوع الرئيسي: التدخلات الرقمية في الصحة النفسية

نظرة عامة

تقييمت هذه الدراسة المقطعية صلاحية وفائدة نموذج المحادثة المدرب مسبقًا (ChatGPT-4) في تقييم القلق والاكتئاب بين طلاب الجامعات. تضمنت الدراسة إنشاء استبيانات مقابلات منظمة بناءً على استبيان صحة المريض-9 (PHQ-9) ومقياس اضطراب القلق العام-7 (GAD-7) المعتمدين، والذي أطلق عليهما GPT-PHQ-9 و GPT-GAD-7. باستخدام طرق إحصائية متنوعة، بما في ذلك تحليل الارتباط سبيرمان ومنحنيات الخصائص التشغيلية المستقبلية (ROC)، وجدت الدراسة أن كل من GPT-PHQ-9 و GPT-GAD-7 أظهرا موثوقية مقبولة (ألفا كرونباخ = 0.75 و 0.76، على التوالي) وتوافق قوي مع المقاييس المعتمدة، مع قيم ICC تبلغ 0.80 و 0.70. تم تحديد درجات القطع المثلى عند 9.5 لأعراض الاكتئاب و 6.5 لأعراض القلق، مما يدل على حساسية وخصوصية عالية.

تشير النتائج إلى أن الاستبيانات المعدلة بواسطة ChatGPT هي أدوات موثوقة لتقييم حالات الصحة النفسية، مما يدل على إمكانياتها للتطبيق السريري. تسلط الدراسة الضوء على إمكانية استخدام ChatGPT لتطوير استبيانات مصممة خصيصًا لمجموعات سكانية متنوعة، مما قد يسهل تقييمات المرضى في البيئات السريرية ويخفف من عبء العمل على المهنيين الصحيين. ومع ذلك، هناك حاجة إلى مزيد من البحث لاستكشاف الآثار الأوسع لهذه التكنولوجيا في تحسين رعاية المرضى وتدفقات العمل السريرية.

مقدمة

تسلط مقدمة هذه الورقة البحثية الضوء على الدور التحويلي لنماذج اللغة واسعة النطاق (LLMs)، وخاصة ChatGPT، في الذكاء الاصطناعي ومعالجة اللغة الطبيعية. منذ إصدارها للجمهور في 30 نوفمبر 2022، اكتسبت ChatGPT شعبية بسرعة، حيث حققت حوالي 100 مليون مستخدم نشط شهريًا في غضون شهرين، مما جعلها أسرع تطبيق استهلاكي نموًا حتى الآن. يتم استخدام ChatGPT عبر مجالات متنوعة، بما في ذلك خدمة العملاء والبحث الطبي، مما يظهر قدرتها على توليد نصوص عالية الجودة وواعية بالسياق. من بين إصداراتها، تبرز ChatGPT-4 لتميزها في الطلاقة اللغوية والأداء في معالجة النصوص المعقدة، خاصة في التطبيقات الصحية، مثل تقييمات الصحة النفسية.

تؤكد الورقة على إمكانيات ChatGPT في تطوير وإدارة الاستبيانات الطبية، التي تعتبر ضرورية لفحص حالات الصحة النفسية. غالبًا ما تفتقر الاستبيانات التقليدية إلى الخصوصية الثقافية والديموغرافية، لكن قدرة ChatGPT على توليد تقييمات مصممة ديناميكيًا قد تعزز من صلاحيتها وموثوقيتها. تشير الدراسات الأولية إلى أن الاستبيانات المعدلة بواسطة ChatGPT يمكن أن تحقق ارتباطات كبيرة مع أدوات التقييم المعتمدة، مما يحسن من التوحيد وموثوقية البيانات. ومع ذلك، لا يزال تطبيق ChatGPT في فحص الصحة النفسية غير مستكشف بشكل كاف، خاصة بين طلاب الجامعات، الذين يظهرون معدلات عالية من القلق والاكتئاب. تهدف هذه الدراسة إلى الاستفادة من ChatGPT-4 لإنشاء استبيان مقابلة منظمة مصمم خصيصًا لهذه الفئة السكانية، مقارنة صلاحيتها مع استبيان موحد معتمد لاستكشاف طرق جديدة لتقييمات مدعومة بالذكاء الاصطناعي.

الطرق

تحدد قسم “الطرق” في الورقة البحثية التصميم التجريبي والتقنيات التحليلية المستخدمة للتحقيق في فرضية البحث. استخدمت الدراسة نهجًا كميًا، مع دمج التحليلات الإحصائية لتقييم البيانات التي تم جمعها من تجارب متنوعة. تضمنت المنهجيات المحددة تجارب مختبرية محكومة، حيث تم التلاعب بالمتغيرات بشكل منهجي لمراقبة آثارها على النتائج ذات الصلة.

شملت جمع البيانات استخدام أدوات وبروتوكولات موحدة لضمان الموثوقية والصلاحية. تم إجراء التحليل باستخدام برامج إحصائية متقدمة، مما سمح بتطبيق نماذج الانحدار واختبار الفرضيات لاستخلاص استنتاجات ذات مغزى من البيانات. يبرز القسم أهمية القابلية للتكرار والشفافية في الطرق المستخدمة، موفرًا حسابًا مفصلًا للإجراءات المتبعة لتسهيل الأبحاث المستقبلية في هذا المجال.

النتائج

في هذه الدراسة، شارك 200 طالب جامعي، بمتوسط عمر 21.07 عامًا (SD = 1.31)، وحددت الغالبية العظمى من المشاركين (139 مشاركًا، 67.50%) أنفسهم كنساء. أكمل جميع المشاركين الاستبيانات بنجاح بشكل مستقل.

تم الإبلاغ عن الدرجات الوسيطة للاستبيانات المطبقة على النحو التالي: حقق GPT-PHQ-9 درجة وسيطة قدرها 4 (IQR 3-6، النطاق 0-20)، وكان لدى PHQ-9 أيضًا وسيلة قدرها 4 (IQR 2-5، النطاق 0-16)، بينما حقق كل من GPT-GAD-7 و GAD-7 درجات وسيطة قدرها 3 (IQR 2-5 و IQR 2-4، على التوالي، مع نطاقات من 0-15 و 0-13). توفر هذه النتائج نظرة ثاقبة على حالة الصحة النفسية لطلاب الجامعات الذين تم استقصاؤهم.

المناقشة

تتناول الدراسة التي تم مناقشتها تحليلًا مقطعيًا ملاحظًا تم إجراؤه بين طلاب الجامعات في الصين، بهدف تقييم فعالية استبيان معدل بواسطة ChatGPT-4 لتقييم القلق والاكتئاب. تم تطوير الاستبيان باستخدام مقاييس معتمدة، وهي استبيان صحة المريض-9 (PHQ-9) ومقياس اضطراب القلق العام-7 (GAD-7)، وتم تصميمه ليعكس التجارب اليومية لطلاب الجامعات. أشارت النتائج إلى أن الاستبيان المعدل بواسطة ChatGPT أظهر اتساقًا داخليًا جيدًا، مع معاملات ألفا كرونباخ تبلغ 0.75 لـ GPT-PHQ-9 و 0.76 لـ GPT-GAD-7. علاوة على ذلك، اقترحت معاملات الارتباط داخل الفئة (ICCs) توافقًا معتدلًا إلى جيد مع الاستبيانات المعتمدة، مع ICCs تبلغ 0.80 لـ PHQ-9 و 0.70 لـ GAD-7.

كما كشفت النتائج أن استبيان ChatGPT يمكن أن يحدد بفعالية أعراض الاكتئاب والقلق، مع تحديد نقاط القطع المثلى من خلال تحليل الخصائص التشغيلية المستقبلية (ROC). ومع ذلك، بينما أظهر الاستبيان المعدل بواسطة ChatGPT توافقًا كبيرًا مع المقاييس التقليدية، أشار تحليل بلاند-ألتمن إلى تباين، مما يشير إلى أنه لا ينبغي أن يحل محل الأدوات المعتمدة ولكن يجب أن يعمل كملحق للتقييمات السريرية. تسلط الدراسة الضوء على إمكانيات دمج أدوات الذكاء الاصطناعي مثل ChatGPT في تقييمات الصحة النفسية، مع التأكيد أيضًا على الحاجة إلى مزيد من البحث لتحسين هذه التقنيات لسيناريوهات سريرية أكثر تعقيدًا.

القيود

تسلط قيود الدراسة الضوء على عدة مخاوف حاسمة بشأن قابلية تطبيق الاستبيانات المدعومة بالذكاء الاصطناعي لتقييمات PHQ-9 و GAD-7. أولاً، كانت مجموعة المشاركين تتكون حصريًا من طلاب الجامعات، مما يثير تساؤلات حول تمثيل العينة وقابلية تعميم النتائج على مجموعات سكانية أوسع. بالإضافة إلى ذلك، قد تختلف قدرة ChatGPT على تفسير وتكييف المفاهيم النفسية، مثل الاكتئاب والقلق، عبر سياقات ثقافية ولغوية مختلفة. يقدم هذا التباين عدم اليقين بشأن صلاحية النتائج لمجموعات ديموغرافية متنوعة.

علاوة على ذلك، فإن إمكانية الوصول إلى النسخة الحالية من ChatGPT، التي تتطلب اشتراكًا، تطرح قيودًا إضافية. تتطلب النسخة الاحترافية، التي تتضمن ميزات مثل تصفح الإنترنت عبر Bing BETA، رسوم اشتراك قدرها 20 دولارًا شهريًا، مما يثير مخاوف بشأن الحواجز المالية التي قد تحد من الوصول إلى هذه الأدوات المتقدمة المدعومة بالذكاء الاصطناعي. يجب أن تهدف الأبحاث المستقبلية إلى تضمين مجموعة متنوعة من المشاركين واستكشاف فعالية الاستبيانات المعدلة بواسطة ChatGPT عبر خلفيات ثقافية مختلفة لتعزيز صلاحية وملاءمة النتائج.

Journal: BMC Psychiatry, Volume: 25, Issue: 1
DOI: https://doi.org/10.1186/s12888-025-06798-0
PMID: https://pubmed.ncbi.nlm.nih.gov/40211256
Publication Date: 2025-04-10
Author(s): Jiali Liu et al.
Primary Topic: Digital Mental Health Interventions

Overview

This cross-sectional study evaluated the validity and utility of the Chat Generative Pre-trained Transformer (ChatGPT-4) in assessing anxiety and depression among college students. The study involved generating structured interview questionnaires based on the validated Patient Health Questionnaire-9 (PHQ-9) and Generalized Anxiety Disorder Scale-7 (GAD-7), termed GPT-PHQ-9 and GPT-GAD-7. Using various statistical methods, including Spearman correlation analysis and receiver operating characteristic (ROC) curves, the study found that both GPT-PHQ-9 and GPT-GAD-7 exhibited acceptable reliability (Cronbach’s α = 0.75 and 0.76, respectively) and strong consistency with the validated measures, with ICC values of 0.80 and 0.70. Optimal cutoff scores were identified at 9.5 for depressive symptoms and 6.5 for anxiety symptoms, demonstrating high sensitivity and specificity.

The findings suggest that the ChatGPT-adapted questionnaires are reliable tools for assessing mental health conditions, indicating their potential for clinical application. The study highlights the possibility of utilizing ChatGPT to develop tailored questionnaires for diverse populations, which could streamline patient assessments in clinical settings and alleviate the workload of healthcare professionals. However, further research is necessary to explore the broader implications of this technology in enhancing patient care and clinical workflows.

Introduction

The introduction of this research paper highlights the transformative role of large-scale language models (LLMs), particularly ChatGPT, in artificial intelligence and natural language processing. Since its public release on November 30, 2022, ChatGPT has rapidly gained popularity, achieving around 100 million monthly active users within two months, making it the fastest-growing consumer application to date. ChatGPT is utilized across various domains, including customer service and medical research, demonstrating its capability to generate high-quality, context-aware text. Among its versions, ChatGPT-4 stands out for its superior linguistic fluency and performance in complex text processing, particularly in healthcare applications, such as mental health assessments.

The paper emphasizes the potential of ChatGPT in developing and administering medical questionnaires, which are essential for screening mental health conditions. Traditional questionnaires often lack cultural and demographic specificity, but ChatGPT’s ability to dynamically generate tailored assessments could enhance their validity and reliability. Preliminary studies indicate that ChatGPT-adapted questionnaires can yield significant correlations with established assessment tools, thereby improving standardization and data reliability. However, the application of ChatGPT in mental health screening remains underexplored, particularly among university students, who exhibit high rates of anxiety and depression. This study aims to leverage ChatGPT-4 to create a structured interview questionnaire tailored for this demographic, comparing its validity against a validated standardized questionnaire to explore new avenues for AI-assisted assessment methods.

Methods

The “Methods” section of the research paper outlines the experimental design and analytical techniques employed to investigate the research hypothesis. The study utilized a quantitative approach, incorporating statistical analyses to evaluate the data collected from various experiments. Specific methodologies included controlled laboratory experiments, where variables were systematically manipulated to observe their effects on the outcomes of interest.

Data collection involved the use of standardized instruments and protocols to ensure reliability and validity. The analysis was conducted using advanced statistical software, allowing for the application of regression models and hypothesis testing to draw meaningful conclusions from the data. The section emphasizes the importance of replicability and transparency in the methods employed, providing a detailed account of the procedures followed to facilitate future research in the field.

Results

In this study, 200 college students participated, with a mean age of 21.07 years (SD = 1.31), and a majority of 139 participants (67.50%) identifying as women. All participants successfully completed the questionnaires independently.

The median scores for the administered questionnaires were reported as follows: the GPT-PHQ-9 yielded a median score of 4 (IQR 3-6, range 0-20), the PHQ-9 also had a median of 4 (IQR 2-5, range 0-16), while both the GPT-GAD-7 and GAD-7 had median scores of 3 (IQR 2-5 and IQR 2-4, respectively, with ranges of 0-15 and 0-13). These results provide insight into the mental health status of the college student population surveyed.

Discussion

The study discussed is an observational cross-sectional analysis conducted among college students in China, aimed at evaluating the effectiveness of a ChatGPT-4-adapted questionnaire for assessing anxiety and depression. The questionnaire was developed using established scales, namely the Patient Health Questionnaire-9 (PHQ-9) and the Generalized Anxiety Disorder-7 (GAD-7), and tailored to reflect the daily experiences of college students. Results indicated that the ChatGPT-adapted questionnaire demonstrated good internal consistency, with Cronbach’s alpha coefficients of 0.75 for GPT-PHQ-9 and 0.76 for GPT-GAD-7. Furthermore, the intra-class correlation coefficients (ICCs) suggested moderate to good agreement with the validated questionnaires, with ICCs of 0.80 for PHQ-9 and 0.70 for GAD-7.

The findings also revealed that the ChatGPT questionnaire could effectively identify depressive and anxiety symptoms, with optimal cutoff points determined through receiver operating characteristic (ROC) analysis. However, while the ChatGPT-adapted questionnaire showed substantial concordance with traditional measures, the Bland-Altman analysis indicated variability, suggesting that it should not replace validated instruments but rather serve as an adjunct to clinical assessments. The study highlights the potential of integrating AI tools like ChatGPT in mental health evaluations, while also emphasizing the need for further research to refine these technologies for more complex clinical scenarios.

Limitations

The limitations of the study highlight several critical concerns regarding the applicability of AI-powered questionnaires for the PHQ-9 and GAD-7 assessments. Firstly, the participant group was exclusively composed of college students, which raises questions about the representativeness of the sample and the generalizability of the findings to broader populations. Additionally, the ability of ChatGPT to accurately interpret and adapt psychological concepts, such as depression and anxiety, may differ across various cultural and linguistic contexts. This variability introduces uncertainty regarding the validity of the results for diverse demographic groups.

Furthermore, the accessibility of the current version of ChatGPT, which is behind a paywall, poses additional limitations. The Pro version, which includes features like Bing BETA internet browsing, requires a subscription fee of $20 per month, raising concerns about the financial barriers that may restrict access to these advanced AI tools. Future research should aim to include a more diverse participant pool and explore the effectiveness of ChatGPT-adapted questionnaires across different cultural backgrounds to enhance the validity and applicability of the findings.