المساعدة التشخيصية من نموذج اللغة الكبير للأطباء في بلد منخفض إلى متوسط الدخل: تجربة عشوائية محكومة Large language model diagnostic assistance for physicians in a lower-middle-income country: a randomized controlled trial

المجلة: Nature Health، المجلد: 1، العدد: 2
DOI: https://doi.org/10.1038/s44360-025-00007-8
تاريخ النشر: 2026-02-06
المؤلف: Ihsan Ayyub Qazi وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في الرعاية الصحية والتعليم

نظرة عامة

تدرس الدراسة فعالية نماذج اللغة الكبيرة (LLMs) في تعزيز التفكير التشخيصي بين الأطباء في بلد منخفض ومتوسط الدخل، وتحديداً باكستان. أجريت كدراسة تجريبية عشوائية أحادية التعمية مع 60 طبيباً مرخصاً، وهدفت الأبحاث إلى تحديد ما إذا كان الأطباء المدربون على الذكاء الاصطناعي يمكنهم استخدام LLMs لتحسين دقة التشخيص مقارنة بالموارد التقليدية. خضع المشاركون لمنهج دراسي مدته 20 ساعة في محو الأمية في الذكاء الاصطناعي قبل أن يتم توزيعهم عشوائياً إلى مجموعتين: واحدة لديها وصول إلى LLMs والموارد التقليدية، والأخرى مع الموارد التقليدية فقط.

أشارت النتائج إلى أن الأطباء الذين لديهم وصول إلى LLM حققوا متوسط درجة تفكير تشخيصي قدره 71.4%، وهو أعلى بكثير من درجة 42.6% لأولئك الذين يعتمدون فقط على الموارد التقليدية (فرق معدل معدل قدره 27.5 نقطة مئوية، 95% CI: 22.8 إلى 32.2؛ P < 0.001). كان الوقت المستغرق لكل حالة مشابهًا بين المجموعتين، مما يشير إلى أن استخدام LLM لم يعيق الكفاءة. ومن الجدير بالذكر أنه بينما تفوقت LLMs وحدها على المجموعة المدعومة بـ LLM في بعض الحالات، تجاوز 31.4% من مجموعة الأطباء مع LLM الأداء الوسيط لـ LLMs وحدها، مما يبرز إمكانية التكامل. تؤكد هذه النتائج على وعد LLMs في معالجة الأخطاء التشخيصية في البيئات ذات الموارد المحدودة، مشروطة بتدريب فعال في محو الأمية في الذكاء الاصطناعي للمهنيين الصحيين.

طرق البحث

أُجريت الدراسة وفقًا للإرشادات الأخلاقية المعتمدة من مجلس المراجعة المؤسسية في جامعة لاهور للعلوم الإدارية (LUMS)، مع الحصول على موافقة مستنيرة من جميع المشاركين قبل التسجيل والتوزيع العشوائي. كان المشاركون، الذين كانوا ممارسين طبيين مسجلين يحملون درجة MBBS وقد أكملوا 20 ساعة من التدريب المنظم في الذكاء الاصطناعي (AI) ونماذج اللغة الكبيرة (LLMs)، قد حصلوا على اشتراك لمدة شهر في ChatGPT Plus كتعويض. التزمت الأبحاث بإرشادات CONSORT 2025 للتجارب العشوائية المضبوطة (RCTs)، وتم تفصيل بروتوكول الدراسة الكامل في البروتوكول التكميلي 1.

تمت عملية تجنيد المشاركين على مدى ستة أشهر من خلال قوائم البريد الإلكتروني المرتبطة بمعهد التعلم في LUMS، الذي يقدم تدريبًا متخصصًا في الذكاء الاصطناعي في الصحة وعلوم البيانات. تم تنظيم المشاركين في مجموعات صغيرة وشاركوا في جلسات عن بُعد أو اجتماعات شخصية في مختبر الكمبيوتر في LUMS. استمرت كل جلسة لمدة 85 دقيقة، بدأت باستطلاع أساسي مدته 10 دقائق تلاه 75 دقيقة لتقييم الحالات السريرية. تم تنفيذ التجربة كما هو مخطط لها، مع توضيح تدفق المشاركين في الشكل 1 وتفاصيل إضافية متاحة في الشكل التكميلي 2.

النتائج

في هذه الدراسة، تم تجنيد 60 طبيبًا مرخصًا بين 10 يناير 2025 و17 مايو 2025، مع إكمال 58 للدراسة بعد انسحاب مشاركين اثنين من الموافقة. تم تقسيم المجموعة بالتساوي، حيث خضع 29 مشاركًا للتدخل و29 في مجموعة التحكم. ومن الجدير بالذكر أن 35 مشاركًا (60%) شاركوا في لقاءات افتراضية، بينما اختار 23 (40%) التفاعلات الشخصية. كان متوسط سنوات الممارسة بين المشاركين 8.5 سنوات، مع نطاق ربعي (IQR) من 3.3 إلى 12.0 سنة. تم تقديم تفاصيل ديموغرافية إضافية في الجدول 1.

المناقشة

في هذه التجربة العشوائية المضبوطة (RCT) التي أجريت في بلد منخفض ومتوسط الدخل (LMIC)، أدى دمج GPT-4 كأداة تشخيصية إلى تحسين كبير في دقة التشخيص لدى الأطباء مقارنة بالموارد التقليدية. تم تحليل ما مجموعه 342 حالة، حيث حققت مجموعة LLM متوسط درجة تشخيصية قدرها 71.4% مقارنة بـ 42.6% لمجموعة التحكم، مما أسفر عن فرق ذو دلالة إحصائية قدره 27.5 نقطة مئوية (pp؛ P < 0.001). أبرزت الدراسة أن التدريب المنظم في محو الأمية في الذكاء الاصطناعي كان حاسمًا لتعظيم فوائد LLMs، حيث أظهر الأطباء الذين تلقوا هذا التدريب أداءً محسنًا، خاصة أولئك الذين لديهم خبرة سريرية أقل وتعرض سابق أقل لـ LLMs. كشفت تحليلات المجموعات الفرعية أن الأطباء الأقل خبرة أظهروا تحسنًا أكبر (30.0 pp) مقارنة بنظرائهم الأكثر خبرة (24.5 pp). بالإضافة إلى ذلك، شهد الأطباء الذكور فائدة أكبر (41.6 pp) من الأطباء الإناث (17.1 pp)، مما يشير إلى وجود تفاوت محتمل بين الجنسين في فعالية مساعدة LLM. تؤكد النتائج على ضرورة وجود تدخلات تدريب مخصصة لتحسين استخدام أدوات الذكاء الاصطناعي في البيئات السريرية، خاصة في البيئات ذات الموارد المحدودة حيث تكون الأخطاء التشخيصية شائعة. تدعو الدراسة إلى مزيد من الأبحاث لاستكشاف ديناميات التعاون بين الإنسان والذكاء الاصطناعي والحاجة إلى التدريب المستمر للحفاظ على التفاعل الفعال مع تقنيات الذكاء الاصطناعي المتطورة.

Journal: Nature Health, Volume: 1, Issue: 2
DOI: https://doi.org/10.1038/s44360-025-00007-8
Publication Date: 2026-02-06
Author(s): Ihsan Ayyub Qazi et al.
Primary Topic: Artificial Intelligence in Healthcare and Education

Overview

The study investigates the effectiveness of large language models (LLMs) in enhancing diagnostic reasoning among physicians in a lower-middle-income country, specifically Pakistan. Conducted as a single-blind randomized controlled trial with 60 licensed physicians, the research aimed to determine whether AI-trained physicians could utilize LLMs to improve diagnostic accuracy compared to traditional resources. Participants underwent a 20-hour AI-literacy curriculum before being randomized into two groups: one with access to LLMs and conventional resources, and the other with conventional resources alone.

Results indicated that physicians with LLM access achieved a mean diagnostic reasoning score of 71.4%, significantly higher than the 42.6% score of those relying solely on conventional resources (adjusted difference of 27.5 percentage points, 95% CI: 22.8 to 32.2; P < 0.001). Time spent per vignette was comparable between groups, suggesting that LLM utilization did not hinder efficiency. Notably, while LLMs alone outperformed the LLM-assisted group in some cases, 31.4% of the physician-plus-LLM group exceeded the median performance of LLMs alone, highlighting potential complementarity. These findings underscore the promise of LLMs in addressing diagnostic errors in resource-limited settings, contingent upon effective AI-literacy training for healthcare professionals.

Methods

The study was conducted following ethical guidelines approved by the institutional review board at Lahore University of Management Sciences (LUMS), with informed consent obtained from all participants prior to enrollment and randomization. Participants, who were registered medical practitioners with an MBBS degree and had completed 20 hours of structured training in artificial intelligence (AI) and large language models (LLMs), received a one-month subscription to ChatGPT Plus as compensation. The research adhered to the CONSORT 2025 guidelines for randomized controlled trials (RCTs), and the complete study protocol is detailed in Supplementary Protocol 1.

Recruitment of participants occurred over six months through email lists associated with LUMS’s Learning Institute, which provides specialized training in health AI and data science. Participants were organized into small groups and engaged in either remote sessions or in-person meetings at a computer laboratory at LUMS. Each session lasted 85 minutes, beginning with a 10-minute baseline survey followed by 75 minutes for clinical vignette assessments. The trial was executed as planned, with participant flow illustrated in Figure 1 and further details available in Supplementary Figure 2.

Results

In this study, 60 licensed physicians were recruited between January 10, 2025, and May 17, 2025, with 58 completing the study after two participants withdrew consent. The cohort was evenly split, with 29 participants undergoing the intervention and 29 in the control group. Notably, 35 participants (60%) engaged in virtual encounters, while 23 (40%) opted for in-person interactions. The median years of practice among participants was 8.5 years, with an interquartile range (IQR) of 3.3 to 12.0 years. Additional demographic details are provided in Table 1.

Discussion

In this randomized controlled trial (RCT) conducted in a low- and middle-income country (LMIC), the integration of GPT-4 as a diagnostic aid significantly improved physicians’ diagnostic accuracy compared to conventional resources. A total of 342 cases were analyzed, with the LLM group achieving a mean diagnostic score of 71.4% compared to 42.6% for the control group, resulting in a statistically significant difference of 27.5 percentage points (pp; P < 0.001). The study highlighted that structured AI-literacy training was crucial for maximizing the benefits of LLMs, as physicians who received this training demonstrated enhanced performance, particularly those with less clinical experience and lower prior exposure to LLMs. Subgroup analyses revealed that less experienced clinicians showed a greater improvement (30.0 pp) compared to their more experienced counterparts (24.5 pp). Additionally, male physicians experienced a more substantial benefit (41.6 pp) than female physicians (17.1 pp), indicating a potential gender disparity in the effectiveness of LLM assistance. The findings underscore the necessity for tailored training interventions to optimize the use of AI tools in clinical settings, particularly in resource-constrained environments where diagnostic errors are prevalent. The study calls for further research to explore the dynamics of human-AI collaboration and the need for ongoing training to maintain effective interaction with evolving AI technologies.