موثوقية نماذج اللغة الكبيرة كمساعدين طبيين للجمهور العام: دراسة مسجلة مسبقًا عشوائية Reliability of LLMs as medical assistants for the general public: a randomized preregistered study

المجلة: Nature Medicine، المجلد: 32، العدد: 2
DOI: https://doi.org/10.1038/s41591-025-04074-y
PMID: https://pubmed.ncbi.nlm.nih.gov/41663592
تاريخ النشر: 2026-02-01
المؤلف: Andrew M. Bean وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في الرعاية الصحية والتعليم

نظرة عامة

تبحث الدراسة في فعالية نماذج اللغة الكبيرة (LLMs) في مساعدة الجمهور بالنصائح الطبية، وخاصة في تحديد الحالات الأساسية وتحديد الإجراءات المناسبة في عشرة سيناريوهات طبية. في دراسة محكومة شملت 1,298 مشاركًا، أدت المجموعة التي استخدمت LLMs (GPT-4o، Llama 3، Command R+) بشكل جيد في العزل، حيث حددت الحالات الطبية بدقة في 94.9% من الحالات وقرارات مناسبة في 56.3% في المتوسط. ومع ذلك، عندما تفاعل المشاركون مع LLMs، انخفضت معدلات نجاحهم بشكل كبير، حيث كانت نسبة تحديد الحالات أقل من 34.5% وتحديد القرارات أقل من 44.2%، مما أظهر عدم وجود تحسين مقارنة بمجموعة التحكم.

تسلط النتائج الضوء على التحديات التي تواجه تفاعل المستخدم مع LLMs، مما يشير إلى أن المعايير الحالية للمعرفة الطبية لا تتنبأ بشكل كافٍ بالأداء في العالم الحقيقي عندما يكون المستخدمون البشر متورطين. يدعو المؤلفون إلى إجراء اختبارات منهجية للمستخدمين البشر لتقييم القدرات التفاعلية لـ LLMs قبل نشرها في بيئات الرعاية الصحية. على الرغم من إمكانية LLMs في ديمقراطية الرعاية الصحية وتعزيز الوصول إلى المعرفة الطبية، تؤكد الدراسة على ضرورة معالجة قضايا تفاعل المستخدم لضمان التطبيق الفعال في السيناريوهات الواقعية.

الطرق

استخدمت الدراسة تصميمًا بين الموضوعات يشمل ثلاث مجموعات علاجية تستخدم نماذج اللغة الكبيرة (LLMs) ومجموعة تحكم. تم تقديم سيناريوهات طبية للمشاركين لتقييم الحدة السريرية، باستخدام إما واجهة دردشة LLM أو مواد مرجعية عادية. تم تسهيل جمع البيانات من خلال منصة Dynabench، مع إجراء استبيانات قبل وبعد العلاج عبر Qualtrics. تم إجراء دراسات تجريبية لتحسين الواجهة والتعليمات، وتم تسجيل تصميم الدراسة مسبقًا، مع إضافة تجارب LLM فقط بعد التسجيل لتعزيز فهم نتائج الدراسة البشرية.

أشارت النتائج إلى أن المشاركين الذين استخدموا LLMs كانوا أقل احتمالًا بشكل كبير لتحديد الحالات الطبية ذات الصلة مقارنة بمجموعة التحكم، مع نسب احتمالات تبلغ 1.76 لتحديد أي حالة ذات صلة و1.57 للحالات الخطيرة “التي تتطلب الانتباه”. على الرغم من معدل استجابة صحيح إجمالي يبلغ 43.0%، لا يزال معظم المشاركين يختارون قرارات غير صحيحة، وكلا المجموعتين تميلان إلى التقليل من حدة الحالة. من الجدير بالذكر أن أداء LLM كان أفضل عند العمل بشكل مستقل مقارنةً عندما تفاعل المستخدمون معهم، مما يشير إلى أن القدرات القوية لـ LLM لا تضمن نتائج فعالة للمستخدم. تم إجراء تحليلات إحصائية باستخدام حزم STATSMODELS وSCIPY، مع استخدام اختبارات χ² واختبارات مان-ويتني U لتقييم اختلافات الأداء وتقديرات الحدة.

النتائج

في هذه الدراسة، قام الباحثون بتقييم المخاطر المرتبطة باستخدام الجمهور لنماذج اللغة الكبيرة (LLMs) للحصول على نصائح طبية من خلال تجربة عشوائية شملت 1,298 مشاركًا من المملكة المتحدة، تتراوح أعمارهم بين 18 عامًا وما فوق. تم تقديم عشرة سيناريوهات طبية للمشاركين تتطلب اتخاذ قرارات بشأن ما إذا كان يجب عليهم وكيفية طلب العلاج الطبي المهني. قاموا بتقييم قراراتهم على مقياس من خمسة نقاط، من البقاء في المنزل إلى الاتصال بالإسعاف، وحددوا الحالات الطبية التي أثرت على اختياراتهم. تم تقييم دقة قراراتهم مقارنةً بردود ثلاثة أطباء ساعدوا في صياغة السيناريوهات، بينما تم مقارنة صلة الحالات الطبية المدرجة بقائمة معيارية ذهبية أنشأها أربعة أطباء مستقلين.

تم تعيين المشاركين عشوائيًا إلى واحدة من ثلاث مجموعات علاجية، والتي تفاعلت مع LLMs مختلفة—GPT-4o، Llama 3، أو Command R+—أو إلى مجموعة تحكم استخدمت طرق المساعدة المنزلية التقليدية، مثل البحث عبر الإنترنت. سُمح لكل مشارك بتقديم ما يصل إلى ردين، مع استمرار جمع البيانات حتى تم الحصول على 600 رد لكل حالة تجريبية. كان الهدف من هذا التصميم هو عكس التركيبة السكانية للسكان في المملكة المتحدة واستكشاف فعالية LLMs في مساعدة اتخاذ القرارات الطبية في سياق منزلي.

المناقشة

تسلط قسم المناقشة في ورقة البحث الضوء على التحديات الكبيرة في نشر نماذج اللغة الكبيرة (LLMs) لرعاية المرضى المباشرة، خاصة في سيناريوهات التقييم الذاتي الطبي. وجدت الدراسة أنه بينما أظهرت LLMs مثل GPT-4o وLlama 3 وCommand R+ كفاءة عالية في اقتراح الحالات الطبية ذات الصلة، لم تؤدِ تكاملها مع المستخدمين البشر إلى تحسين النتائج مقارنة بمجموعة التحكم التي تستخدم الطرق التقليدية. على وجه التحديد، اقترحت النماذج حالات ذات صلة في أكثر من 90% من الحالات، ومع ذلك كانت دقتها في التوصية بالقرارات المناسبة أقل بشكل ملحوظ، حيث حققت GPT-4o فقط 64.7%. يشير هذا إلى أنه بينما يمكن لـ LLMs توليد معلومات طبية مفيدة، فإن الانهيار في التواصل بين المستخدمين والنماذج يعيق بشكل كبير اتخاذ القرارات الفعالة.

كشفت التحليلات أن المستخدمين غالبًا ما قدموا معلومات غير مكتملة، مما حد من قدرة النماذج على تقديم توصيات دقيقة. علاوة على ذلك، على الرغم من اقتراح LLMs لحالات صحيحة، غالبًا ما فشل المستخدمون في دمج هذه الاقتراحات في قراراتهم النهائية. كما قارنت الدراسة أداء LLM على المعايير الطبية المعتمدة مع تفاعلات المستخدمين في العالم الحقيقي، ووجدت أن الدرجات العالية في المعايير لم تترجم إلى تطبيق فعال في سيناريوهات المستخدم. تؤكد هذه الفجوة على ضرورة إجراء مزيد من الأبحاث لتحسين تفاعلات المستخدم مع LLM، مما يبرز أهمية التواصل الواضح والحاجة إلى أن تكون LLMs أكثر اتساقًا في ردودها. يوصي المؤلفون بأن تكون التطورات المستقبلية في LLMs للتطبيقات الطبية تعطي الأولوية لاختبار المستخدم لضمان السلامة والفعالية في البيئات الواقعية.

Journal: Nature Medicine, Volume: 32, Issue: 2
DOI: https://doi.org/10.1038/s41591-025-04074-y
PMID: https://pubmed.ncbi.nlm.nih.gov/41663592
Publication Date: 2026-02-01
Author(s): Andrew M. Bean et al.
Primary Topic: Artificial Intelligence in Healthcare and Education

Overview

The research investigates the efficacy of large language models (LLMs) in assisting the public with medical advice, particularly in identifying underlying conditions and determining appropriate actions in ten medical scenarios. In a controlled study involving 1,298 participants, those using LLMs (GPT-4o, Llama 3, Command R+) performed well in isolation, accurately identifying medical conditions in 94.9% of cases and appropriate dispositions in 56.3% on average. However, when participants interacted with the LLMs, their success rates dropped significantly, with condition identification at less than 34.5% and disposition determination at under 44.2%, showing no improvement over a control group.

The findings highlight the challenges of user interaction with LLMs, suggesting that existing benchmarks for medical knowledge do not adequately predict real-world performance when human users are involved. The authors advocate for systematic human user testing to assess the interactive capabilities of LLMs before their deployment in healthcare settings. Despite the potential of LLMs to democratize healthcare and enhance access to medical knowledge, the study underscores the necessity of addressing user interaction issues to ensure effective application in real-world scenarios.

Methods

The study employed a between-subjects design involving three treatment groups using large language models (LLMs) and a control group. Participants were presented with medical scenarios to assess clinical acuity, utilizing either an LLM chat interface or their usual reference materials. Data collection was facilitated through the Dynabench platform, with pre- and post-treatment surveys conducted via Qualtrics. Pilot studies were conducted to refine the interface and instructions, and the study design was preregistered, with LLM-only experiments added post-registration to enhance understanding of human study results.

Results indicated that participants using LLMs were significantly less likely to identify relevant medical conditions compared to the control group, with odds ratios of 1.76 for identifying any relevant condition and 1.57 for serious ‘red flag’ conditions. Despite an overall correct response rate of 43.0%, most participants still selected incorrect dispositions, and both groups tended to underestimate condition acuity. Notably, LLM performance was superior when operating independently compared to when users interacted with them, suggesting that strong LLM capabilities do not guarantee effective user outcomes. Statistical analyses were conducted using the STATSMODELS and SCIPY packages, employing χ² tests and Mann-Whitney U tests to assess performance differences and acuity estimations.

Results

In this study, the researchers evaluated the risks associated with the public utilizing large language models (LLMs) for medical advice through a randomized experiment involving 1,298 participants from the UK, aged 18 and older. Participants were presented with ten medical scenarios requiring decisions on whether and how to seek professional medical treatment. They rated their decisions on a five-point scale, from staying home to calling an ambulance, and identified the medical conditions influencing their choices. The accuracy of their decisions was assessed against responses from three physicians who helped draft the scenarios, while the relevance of the listed medical conditions was compared to a gold-standard list created by four independent physicians.

Participants were randomly assigned to one of three treatment groups, which interacted with different LLMs—GPT-4o, Llama 3, or Command R+—or to a control group that used typical home assistance methods, such as internet searches. Each participant was allowed to provide up to two responses, with data collection continuing until 600 responses were obtained for each experimental condition. This design aimed to reflect the demographics of the UK population and to explore the effectiveness of LLMs in aiding medical decision-making in a home context.

Discussion

The discussion section of the research paper highlights significant challenges in deploying large language models (LLMs) for direct patient care, particularly in medical self-assessment scenarios. The study found that while LLMs like GPT-4o, Llama 3, and Command R+ demonstrated high proficiency in suggesting relevant medical conditions, their integration with human users did not yield improved outcomes compared to a control group using traditional methods. Specifically, the models suggested relevant conditions in over 90% of cases, yet their accuracy in recommending appropriate dispositions was notably lower, with GPT-4o achieving only 64.7%. This indicates that while LLMs can generate useful medical information, the communication breakdown between users and models significantly hampers effective decision-making.

The analysis revealed that users often provided incomplete information, which limited the models’ ability to offer accurate recommendations. Furthermore, despite LLMs suggesting correct conditions, users frequently failed to incorporate these suggestions into their final decisions. The study also compared LLM performance on established medical benchmarks with real-world user interactions, finding that high benchmark scores did not translate into effective application in user scenarios. This discrepancy underscores the necessity for further research into improving user-LLM interactions, emphasizing the importance of clear communication and the need for LLMs to be more consistent in their responses. The authors recommend that future developments in LLMs for medical applications should prioritize user testing to ensure safety and efficacy in real-world settings.