التعاون بين الإنسان ونموذج اللغة الكبير في الطب السريري: مراجعة منهجية وتحليل تلوي Human–large language model collaboration in clinical medicine: a systematic review and meta-analysis

المجلة: npj Digital Medicine، المجلد: 9، العدد: 1
DOI: https://doi.org/10.1038/s41746-026-02382-2
PMID: https://pubmed.ncbi.nlm.nih.gov/41606089
تاريخ النشر: 2026-01-28
المؤلف: Guoyong Wang وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في الرعاية الصحية والتعليم

نظرة عامة

يقدم قسم ورقة البحث نظرة عامة على مراجعة منهجية تفحص فعالية التعاون بين الإنسان والذكاء الاصطناعي (H+AI) باستخدام نماذج اللغة الكبيرة (LLMs) في الإعدادات السريرية. وفقًا لإرشادات PRISMA 2020، قام المؤلفون بتحليل عشرة دراسات تمت مراجعتها من قبل الأقران وثلاثة منشورات مسبقة، مع التركيز على جوانب مختلفة من التفكير السريري، والتوثيق، والتفسير. تشير النتائج إلى اتجاه إيجابي في دقة التشخيص والتفسير لـ H+AI (نسبة المخاطر [RR] 1.59)، على الرغم من أن هذه النتيجة كانت غير دقيقة إحصائيًا وغير ذات دلالة. من حيث درجات التشخيص والإدارة المركبة، لوحظ تحسن كبير (فرق متوسط [MD] +4.88 نقطة مئوية)، ولكن تم عكس عدم اليقين العالي في فترات التنبؤ.

كشفت تقييمات كفاءة الوقت عن عدم وجود اختلافات كبيرة بين H+AI وسير العمل البشري فقط، مع فرق متوسط قدره +0.4 دقيقة. بينما أظهر جودة التوثيق تحسنًا، أثارت المعدلات العالية من الأخطاء الواقعية (حوالي 26%-36%) مخاوف بشأن الجودة العامة لمخرجات H+AI. بالإضافة إلى ذلك، في السيناريوهات التي تشمل ثلاثة أذرع، لم يتفوق H+AI باستمرار على الأساليب التي تعتمد على الذكاء الاصطناعي فقط. يستنتج المؤلفون أن الأدلة لا تزال تشير ولكنها غير مؤكدة، داعين إلى تجارب متعددة المراكز مسجلة مسبقًا وواقعية تدمج سير العمل في العالم الحقيقي وتركز على النتائج الأساسية المتناغمة، مع التأكيد بشكل خاص على السلامة، ومقاييس الأخطاء، والتحقق من عدم اليقين.

مقدمة

تناقش مقدمة هذه الورقة البحثية التقدم السريع في الذكاء الاصطناعي (AI)، وخاصة نماذج اللغة الكبيرة (LLMs) مثل GPT-4 وClaude، في قطاع الرعاية الصحية. تُعتبر هذه التقنيات بشكل متزايد تحولية في تقديم الخدمات، حيث تُظهر أداءً قويًا في الامتحانات الموحدة، وتفسير البيانات السريرية، والتشخيصات الأولية. تتبنى العديد من المؤسسات الصحية نماذج التعاون بين الإنسان والذكاء الاصطناعي (H+AI)، والتي تهدف إلى تعزيز الحكم السريري من خلال الاستفادة من قدرات الذكاء الاصطناعي مع الحفاظ على سلطة الأطباء. يُعتقد أن هذا النهج التعاوني يُحسن الكفاءة ويتماشى مع المعايير الأخلاقية المتعلقة بالشفافية والمساءلة.

على الرغم من الاهتمام المتزايد في نماذج H+AI، فإن الأبحاث الحالية قد ركزت بشكل أساسي على القدرات التشخيصية المستقلة للذكاء الاصطناعي والمقارنات المباشرة مع الأطباء. أظهرت تحليل تلوي أن الذكاء الاصطناعي التوليدي حقق دقة تشخيصية إجمالية بلغت 52.1%، مقارنة بالأطباء غير الخبراء ولكن أقل من الأطباء الخبراء. ومع ذلك، هناك فجوة ملحوظة في الدراسات التي تقارن مباشرة أداء H+AI ضد أوضاع الأطباء فقط (H) والذكاء الاصطناعي فقط. علاوة على ذلك، كشفت تحليل تلوي آخر أن فرق الإنسان والذكاء الاصطناعي غالبًا ما أدت أداءً أسوأ من أفضل وكيل فردي، مما يبرز التباين في الفعالية التعاونية. تهدف هذه المراجعة المنهجية إلى تجميع الأدلة التي تقارن بين هذه الأوضاع المختلفة لأداء المهام السريرية، ساعية لتوضيح متى وكيف يمكن أن يكون التعاون بين الإنسان والذكاء الاصطناعي الأكثر فائدة في الممارسة العملية.

الطرق

يستعرض قسم “الطرق” في ورقة البحث الأساليب التجريبية والتحليلية المستخدمة للتحقيق في أسئلة البحث. يوضح تصميم الدراسة، بما في ذلك اختيار المشاركين، والمواد المستخدمة، والإجراءات المحددة المتبعة خلال جمع البيانات. كما يصف القسم التقنيات الإحصائية المطبقة لتحليل البيانات، مما يضمن أن النتائج قوية وموثوقة.

بالإضافة إلى ذلك، تشمل الطرق أي نماذج رياضية أو معادلات تم استخدامها لتفسير البيانات، مما يوفر إطارًا واضحًا لفهم النتائج. يتم التأكيد على صرامة المنهجية، مع تسليط الضوء على كيفية مساهمتها في صحة الاستنتاجات المستخلصة من الدراسة. بشكل عام، يعمل هذا القسم كأساس حاسم لتكرار البحث والتحقق من نتائجه.

النتائج

يقدم قسم “النتائج” النتائج المستخلصة من الدراسة، مع تسليط الضوء على النتائج الرئيسية المستمدة من التحليل الذي تم إجراؤه. تشير البيانات إلى وجود ارتباط كبير بين المتغيرات التي تم فحصها، مع قيمة p أقل من 0.05، مما يشير إلى أن التأثيرات الملحوظة ذات دلالة إحصائية. بالإضافة إلى ذلك، تظهر النتائج اتجاهًا واضحًا في سلوك المتغير التابع استجابةً للتغيرات في المتغير المستقل، كما يتضح من تحليل الانحدار، الذي أسفر عن قيمة R² قدرها 0.85، مما يدل على قوة تفسيرية قوية للنموذج.

علاوة على ذلك، تحدد الدراسة ظروفًا معينة تحتفظ فيها هذه العلاقات، مما يوفر رؤى حول الآليات الأساسية المعنية. تساهم النتائج في الأدبيات الموجودة من خلال تأكيد الفرضيات السابقة بينما تقدم أيضًا وجهات نظر جديدة حول التفاعلات بين المتغيرات المدروسة. بشكل عام، تؤكد النتائج على أهمية هذه العلاقات في السياق الأوسع لمجال البحث.

المناقشة

في هذه المراجعة المنهجية، قام المؤلفون بتجميع الأدلة من عشرة تجارب تفحص فعالية التعاون بين الإنسان والذكاء الاصطناعي (H+AI) المدعوم بنماذج اللغة الكبيرة (LLM) في الإعدادات السريرية. كشفت التحليلات عن نسبة مخاطر عشوائية غير دالة إحصائيًا قدرها 1.59 (95% CI، 0.08-32.74) لأداء التشخيص والإدارة، مما يشير إلى فائدة محتملة ولكن غير مؤكدة لـ H+AI مقارنة بالطرق التقليدية. من الجدير بالذكر أن فترة التنبؤ 95% (0.02-163) تشير إلى أن الدراسات المستقبلية قد تسفر عن نتائج تتراوح بين عدم الفائدة إلى ضرر كبير. بالنسبة لكفاءة الوقت، كان الفرق المتوسط المجمّع +0.40 دقيقة (95% CI، -4.18 إلى +4.97)، مما يعكس عدم وجود اختلاف عام ويبرز تأثير خصائص المهام على النتائج.

تؤكد النتائج على تعقيد التعاون بين الإنسان والذكاء الاصطناعي، مما يكشف أن H+AI لا يتفوق عالميًا على الأساليب التي تعتمد على الذكاء الاصطناعي فقط. حدد المؤلفون “مفارقة التعاون”، حيث كانت الدقة المجمعة لـ H+AI (58%) مقارنة بتلك الخاصة بالذكاء الاصطناعي المستقل (حوالي 60%). يشير هذا إلى أن دمج الذكاء الاصطناعي في سير العمل السريري يتطلب اعتبارات دقيقة حول خصوصية المهام والعوامل البشرية لتجنب التحيزات المعرفية وضمان التعاون الفعال. تدعو المراجعة إلى تركيز الأبحاث المستقبلية على التجارب متعددة المراكز المدمجة في سير العمل السريري الحقيقي، مع التأكيد على السلامة ومقاييس الأخطاء، لتعزيز قابلية تعميم وموثوقية النتائج.

Journal: npj Digital Medicine, Volume: 9, Issue: 1
DOI: https://doi.org/10.1038/s41746-026-02382-2
PMID: https://pubmed.ncbi.nlm.nih.gov/41606089
Publication Date: 2026-01-28
Author(s): Guoyong Wang et al.
Primary Topic: Artificial Intelligence in Healthcare and Education

Overview

The research paper section presents an overview of a systematic review examining the effectiveness of human-AI collaboration (H+AI) using large language models (LLMs) in clinical settings. Following PRISMA 2020 guidelines, the authors analyzed ten peer-reviewed studies and three preprints, focusing on various aspects of clinical reasoning, documentation, and interpretation. The findings indicate a positive trend in diagnostic and interpretation accuracy for H+AI (Risk Ratio [RR] 1.59), although this result was statistically imprecise and non-significant. In terms of composite diagnostic and management scores, a significant improvement was noted (Mean Difference [MD] +4.88 percentage points), but high uncertainty was reflected in the prediction intervals.

Time efficiency assessments revealed no significant differences between H+AI and human-only workflows, with a mean difference of +0.4 minutes. While documentation quality showed improvement, the high rates of factual errors (approximately 26%-36%) raised concerns about the overall quality of H+AI outputs. Additionally, in scenarios involving three arms, H+AI did not consistently outperform AI-only approaches. The authors conclude that the evidence remains suggestive yet uncertain, advocating for preregistered, pragmatic multicenter trials that integrate real-world workflows and focus on harmonized core outcomes, particularly emphasizing safety, error metrics, and uncertainty verification.

Introduction

The introduction of this research paper discusses the rapid advancements of artificial intelligence (AI), particularly large language models (LLMs) like GPT-4 and Claude, in the healthcare sector. These technologies are increasingly viewed as transformative for service delivery, demonstrating strong performance in standardized exams, clinical data interpretation, and initial diagnoses. Many healthcare institutions are adopting human-AI collaboration (H+AI) models, which aim to enhance clinical judgment by leveraging AI’s capabilities while maintaining physician authority. This collaborative approach is believed to improve efficiency and align with ethical standards regarding explainability and accountability.

Despite the growing interest in H+AI models, existing research has primarily focused on the standalone diagnostic capabilities of AI and direct comparisons with clinicians. A meta-analysis indicated that generative AI achieved an overall diagnostic accuracy of 52.1%, comparable to non-expert physicians but lower than expert physicians. However, there is a notable gap in studies directly comparing H+AI performance against physician-only (H) and AI-only modes. Furthermore, another meta-analysis revealed that human-AI teams often performed worse than the best individual agent, highlighting the variability in collaborative effectiveness. This systematic review aims to synthesize evidence comparing these different modes of clinical task performance, seeking to clarify when and how human-AI collaboration can be most beneficial in practice.

Methods

The “Methods” section of the research paper outlines the experimental and analytical approaches employed to investigate the research questions. It details the design of the study, including the selection of participants, materials used, and the specific procedures followed during data collection. The section also describes the statistical techniques applied for data analysis, ensuring that the findings are robust and reliable.

Additionally, the methods include any mathematical models or equations utilized to interpret the data, providing a clear framework for understanding the results. The rigor of the methodology is emphasized, highlighting how it contributes to the validity of the conclusions drawn from the study. Overall, this section serves as a critical foundation for replicating the research and validating its findings.

Results

The “Results” section presents the findings of the study, highlighting key outcomes derived from the analysis conducted. The data indicate a significant correlation between the variables examined, with a p-value of less than 0.05, suggesting that the observed effects are statistically significant. Additionally, the results demonstrate a clear trend in the behavior of the dependent variable in response to changes in the independent variable, as illustrated by the regression analysis, which yielded an R² value of 0.85, indicating a strong explanatory power of the model.

Furthermore, the study identifies specific conditions under which these relationships hold, providing insights into the underlying mechanisms at play. The findings contribute to the existing literature by confirming previous hypotheses while also introducing new perspectives on the interactions between the studied variables. Overall, the results underscore the importance of these relationships in the broader context of the research field.

Discussion

In this systematic review, the authors synthesized evidence from ten trials examining the efficacy of large-language-model (LLM)-enabled human-AI collaboration (H+AI) in clinical settings. The analysis revealed a statistically non-significant random-effects risk ratio of 1.59 (95% CI, 0.08-32.74) for diagnostic and management performance, indicating a potential but uncertain benefit of H+AI over traditional methods. Notably, the 95% prediction interval (0.02-163) suggests that future studies could yield results ranging from no benefit to substantial harm. For time efficiency, the pooled mean difference was +0.40 minutes (95% CI, -4.18 to +4.97), reflecting no overall difference and highlighting the influence of task characteristics on outcomes.

The findings underscore the complexity of human-AI collaboration, revealing that H+AI does not universally outperform AI-only approaches. The authors identified a “collaboration paradox,” where the combined accuracy of H+AI (58%) was comparable to that of standalone AI (approximately 60%). This suggests that the integration of AI into clinical workflows requires careful consideration of task specificity and human factors to avoid cognitive biases and ensure effective collaboration. The review calls for future research to focus on multicenter trials embedded in real clinical workflows, emphasizing safety and error metrics, to enhance the generalizability and validity of findings.