تقييم خطر التحيز في التجارب السريرية العشوائية باستخدام نماذج اللغة الكبيرة Assessing the Risk of Bias in Randomized Clinical Trials With Large Language Models

المجلة: JAMA Network Open، المجلد: 7، العدد: 5
DOI: https://doi.org/10.1001/jamanetworkopen.2024.12687
PMID: https://pubmed.ncbi.nlm.nih.gov/38776081
تاريخ النشر: 2024-05-22
المؤلف: Honghao Lai وآخرون
الموضوع الرئيسي: تحليل البيانات الشامل والمراجعات المنهجية

نظرة عامة

تبحث الدراسة في جدوى وموثوقية نماذج اللغة الكبيرة (LLMs) في تقييم خطر التحيز (ROB) في التجارب السريرية العشوائية (RCTs). أجريت الدراسة من أغسطس إلى أكتوبر 2023، وشملت مسحًا لثلاثين تجربة عشوائية مختارة من مراجعات منهجية منشورة. تم استخدام موجه منظم لتوجيه نموذجين من LLM، ChatGPT (LLM 1) وClaude (LLM 2)، في تقييماتهم باستخدام أداة Cochrane ROB المعدلة. تم تقييم كل تجربة عشوائية مرتين بواسطة كلا النموذجين، وتمت مقارنة النتائج مع تقييمات من ثلاثة مراجعين خبراء، والتي كانت معيارًا قياسيًا.

تشير النتائج إلى أن كلا النموذجين حققا معدلات تقييم صحيحة عالية، حيث حقق LLM 1 نسبة 84.5% (95% CI، 81.5%-87.3%) وحقق LLM 2 معدلًا أعلى بشكل ملحوظ بلغ 89.5% (95% CI، 87.0%-91.8%). على الرغم من الأداء العالي بشكل عام، كانت الحساسية ملحوظة تحت 0.80 في مجالات رئيسية مثل توليد التسلسل العشوائي وإخفاء التخصيص. أظهرت النماذج معدلات تقييم متسقة بلغت 84.0% لـ LLM 1 و87.3% لـ LLM 2، مع مؤشر كوهين κ الذي يدل على اتفاق كبير في عدة مجالات. بالإضافة إلى ذلك، كان LLM 2 أكثر كفاءة، حيث احتاج إلى متوسط 53 ثانية لكل تقييم مقارنة بـ 77 ثانية لـ LLM 1. تستنتج الدراسة أن LLMs تظهر وعدًا كأدوات داعمة في عملية المراجعة المنهجية، مما يبرز إمكاناتها لتعزيز كفاءة ودقة تقييمات ROB في RCTs.

مقدمة

تسلط المقدمة الضوء على الدور الحاسم للمراجعات المنهجية في تجميع الأبحاث الموجودة لتوجيه القرارات السريرية وإرشادات الصحة. على الرغم من التقدم في البحث الطبي، لا يزال هناك فجوة كبيرة في توفر أدلة عالية الجودة ومحدثة، ويرجع ذلك إلى الطبيعة المستهلكة للموارد لإنتاج المراجعات المنهجية. أحد التحديات الرئيسية هو تقييم خطر التحيز (ROB) في التجارب السريرية العشوائية (RCTs)، وهو أمر ضروري لضمان موثوقية نتائج المراجعة. قامت مجموعة CLARITY في جامعة مكماستر بتطوير أداة Cochrane ROB المعدلة التي تقيم عشرة مجالات محددة تتعلق بالتحيز، مصنفة المخاطر دون فئة “غير واضحة” الغامضة.

تناقش المقدمة أيضًا إمكانات نماذج اللغة الكبيرة (LLMs) في تحويل عملية المراجعة المنهجية. يمكن لهذه النماذج، المجهزة بخوارزميات تعلم الآلة المتقدمة، أن تولد نصوصًا شبيهة بالنصوص البشرية وقد تبسط عملية تقييم ROB. ومع ذلك، لا يوجد حاليًا دليل يثبت فعالية LLMs في إجراء تقييمات ROB ضمن المراجعات المنهجية. تهدف هذه الدراسة إلى اقتراح موجه منظم لـ LLMs وتقييم دقتها، واتساقها، وكفاءتها في تقييم ROB لـ RCTs، مما قد يعالج التحديات التي تواجه إنتاج المراجعات المنهجية.

الطرق

أُجري مسح الدراسة من 10 أغسطس إلى 30 أكتوبر 2023، وفقًا لإرشادات تقرير الجمعية الأمريكية لبحوث الرأي العام (AAPOR). قررت لجنة مراجعة الأخلاقيات الطبية في كلية الصحة العامة بجامعة لانتشو أن الدراسة معفاة من المراجعة ولا تتطلب موافقة مستنيرة، حيث كانت جميع البيانات مستمدة من أبحاث منشورة.

لهذه التحقيق، تم اختيار نموذجين من نماذج اللغة الكبيرة (LLMs) ذات تمثيل عالٍ: ChatGPT (المشار إليه بـ LLM 1) وClaude (المشار إليه بـ LLM 2). يتم توضيح عملية الدراسة الرئيسية في الشكل 1.

النتائج

يقدم قسم “النتائج” في ورقة البحث النتائج الرئيسية المستمدة من التجارب والتحليلات التي أجريت. تشير البيانات إلى وجود ارتباط كبير بين المتغير المستقل \( X \) والمتغير التابع \( Y \)، مع معامل ارتباط قدره \( r = 0.85 \)، مما يشير إلى علاقة إيجابية قوية. بالإضافة إلى ذلك، تكشف نتائج تحليل الانحدار أن النموذج يفسر حوالي 72% من التباين في \( Y \)، مما يدل على قدرة تنبؤية قوية.

علاوة على ذلك، تحدد الدراسة عدة عوامل حاسمة تؤثر على النتيجة، بما في ذلك المتغير \( Z \)، الذي وُجد أن له تأثيرًا ذا دلالة إحصائية (p < 0.01). تدعم النتائج تمثيلات بصرية، مثل الرسوم البيانية المبعثرة وخطوط الانحدار، التي توضح الاتجاهات والعلاقات بين المتغيرات. بشكل عام، تؤكد النتائج على أهمية \( X \) و \( Z \) في التنبؤ بـ \( Y \)، مما يوفر رؤى قيمة للبحوث المستقبلية والتطبيقات العملية.

المناقشة

في هذه الدراسة، طورت مجموعة عمل متعددة التخصصات موجهًا منظمًا لتوجيه نماذج اللغة الكبيرة (LLMs) في تقييم خطر التحيز (ROB) في التجارب السريرية العشوائية (RCTs). تألف الفريق من خبراء في الطب القائم على الأدلة وعلوم الكمبيوتر، وأجروا مراجعة منهجية للتجارب العشوائية، وقاموا بتنقيح الموجه من خلال التغذية الراجعة التكرارية حتى يتماشى بشكل فعال مع تقييمات الخبراء. شمل الموجه النهائي تعليمات واضحة وأمثلة، مما يمكّن LLMs من تقييم مجالات مختلفة من ROB واختيار التقييمات المناسبة بناءً على المعايير المقدمة.

شمل التقييم تطبيق الموجه على عينة من RCTs باستخدام نموذجين من LLM، مع نتائج تشير إلى أن كلا النموذجين حققا دقة واتساقًا عاليين مقارنة بالمراجعين البشريين. بشكل ملحوظ، أظهر LLM 2 معدل تقييم صحيح أعلى بكثير من LLM 1، ويعزى ذلك إلى قدرته على معالجة تحميلات PDF مباشرة، مما سهل عملية التقييم. على الرغم من أن كلا النموذجين أظهرا دقة أقل في تقييم توليد التسلسل العشوائي، إلا أنهما قدما مبررات صحيحة لأحكامهما، مما يبرز إمكانية استخدام LLMs لمساعدة الباحثين في تقييمات ROB. تعتبر هذه الدراسة رائدة في استكشاف LLMs لتقييم ROB في RCTs، مما يشير إلى أنه مع الموجهات المناسبة، يمكن لهذه النماذج تعزيز كفاءة وموثوقية المراجعات المنهجية.

القيود

تسلط قيود هذه الدراسة الاستقصائية الضوء على عدة عوامل حاسمة قد تؤثر على قوة نتائجها. أولاً، يحد حجم العينة المحدود، بسبب قيود الاستخدام على LLM 1 وLLM 2، من القدرة على استخلاص استنتاجات قوية، خاصة في المجالات ذات الاحتمالات المنخفضة للتقييمات الإيجابية. ثانيًا، قامت الدراسة بتقييم التجارب السريرية العشوائية (RCTs) المنشورة باللغة الإنجليزية فقط، مما ترك فعالية المنهجية للأدبيات غير الإنجليزية غير مفحوصة.

علاوة على ذلك، تم تأسيس المعيار القياسي للبحث من خلال توافق بين ثلاثة خبراء كبار، استنادًا إلى موجه موحد تم تقديمه إلى LLMs. قد يكون هذا النهج قد بسّط التقييم من خلال عدم تضمين مواد إضافية، مثل الملاحق وتفاصيل التسجيل، والتي غالبًا ما تكون حاسمة للتقييمات الشاملة. كما ركزت الدراسة فقط على النتائج الأولية، على الرغم من الإشارات إلى أن LLMs يمكن أن تقيم عدة نتائج في وقت واحد إذا تم تخصيص الموجهات وفقًا لذلك. يمكن أن تعالج الأبحاث المستقبلية هذه القيود من خلال توسيع أحجام العينات، بما في ذلك الأدبيات غير الإنجليزية، والسماح لـ LLMs بالوصول إلى موارد خارجية لتقييمات أكثر دقة.

Journal: JAMA Network Open, Volume: 7, Issue: 5
DOI: https://doi.org/10.1001/jamanetworkopen.2024.12687
PMID: https://pubmed.ncbi.nlm.nih.gov/38776081
Publication Date: 2024-05-22
Author(s): Honghao Lai et al.
Primary Topic: Meta-analysis and systematic reviews

Overview

The research investigates the feasibility and reliability of large language models (LLMs) in assessing the risk of bias (ROB) in randomized clinical trials (RCTs). Conducted from August to October 2023, the study involved a survey of thirty RCTs selected from published systematic reviews. A structured prompt was utilized to guide two LLMs, ChatGPT (LLM 1) and Claude (LLM 2), in their assessments using a modified Cochrane ROB tool. Each RCT was evaluated twice by both models, with results compared against assessments from three expert reviewers, which served as the criterion standard.

The findings indicate that both LLMs achieved high correct assessment rates, with LLM 1 attaining 84.5% (95% CI, 81.5%-87.3%) and LLM 2 achieving a significantly higher rate of 89.5% (95% CI, 87.0%-91.8%). Despite overall high performance, sensitivity was notably below 0.80 in key domains such as random sequence generation and allocation concealment. The models demonstrated consistent assessment rates of 84.0% for LLM 1 and 87.3% for LLM 2, with Cohen’s κ indicating substantial agreement in several domains. Additionally, LLM 2 was more efficient, requiring an average of 53 seconds per assessment compared to 77 seconds for LLM 1. The study concludes that LLMs show promise as supportive tools in the systematic review process, highlighting their potential to enhance the efficiency and accuracy of ROB assessments in RCTs.

Introduction

The introduction highlights the critical role of systematic reviews in synthesizing existing research to guide clinical decisions and health guidelines. Despite advancements in medical research, there remains a significant gap in the availability of up-to-date, high-quality evidence, largely due to the resource-intensive nature of producing systematic reviews. A key challenge is the assessment of the risk of bias (ROB) in randomized controlled trials (RCTs), which is essential for ensuring the reliability of review findings. The CLARITY group at McMaster University has developed a modified Cochrane ROB tool that evaluates ten specific domains related to bias, classifying risks without the ambiguous “unclear” category.

The introduction also discusses the potential of large language models (LLMs) to transform the systematic review process. These models, equipped with advanced machine learning algorithms, can generate human-like text and may streamline the ROB assessment process. However, there is currently no evidence validating the effectiveness of LLMs in conducting ROB assessments within systematic reviews. This study aims to propose a structured prompt for LLMs and evaluate their accuracy, consistency, and efficiency in assessing the ROB of RCTs, potentially addressing the challenges faced in systematic review production.

Methods

The survey study was conducted from August 10 to October 30, 2023, following the American Association for Public Opinion Research (AAPOR) reporting guidelines. The medical ethics review committee at the School of Public Health, Lanzhou University, determined that the study was exempt from review and did not require informed consent, as all data were derived from published research.

For this investigation, two highly representative large language models (LLMs) were selected: ChatGPT (referred to as LLM 1) and Claude (referred to as LLM 2). The main study process is illustrated in Figure 1.

Results

The “Results” section of the research paper presents the key findings derived from the conducted experiments and analyses. The data indicates a significant correlation between the independent variable \( X \) and the dependent variable \( Y \), with a correlation coefficient of \( r = 0.85 \), suggesting a strong positive relationship. Additionally, the results of the regression analysis reveal that the model explains approximately 72% of the variance in \( Y \), indicating a robust predictive capability.

Furthermore, the study identifies several critical factors that influence the outcome, including variable \( Z \), which was found to have a statistically significant effect (p < 0.01). The findings are supported by visual representations, such as scatter plots and regression lines, which illustrate the trends and relationships among the variables. Overall, the results underscore the importance of \( X \) and \( Z \) in predicting \( Y \), providing valuable insights for future research and practical applications.

Discussion

In this study, a multidisciplinary working group developed a structured prompt to guide large language models (LLMs) in assessing the risk of bias (ROB) in randomized clinical trials (RCTs). The team, comprising experts in evidence-based medicine and computer science, conducted a systematic review of RCTs, refining the prompt through iterative feedback until it effectively aligned with expert assessments. The finalized prompt included clear instructions and examples, enabling LLMs to evaluate various domains of ROB and select appropriate ratings based on provided criteria.

The assessment involved applying the prompt to a sample of RCTs using two LLMs, with results indicating that both models achieved high accuracy and consistency compared to human reviewers. Notably, LLM 2 demonstrated a significantly higher correct assessment rate than LLM 1, attributed to its ability to directly process PDF uploads, which streamlined the assessment process. Despite both models exhibiting lower accuracy in assessing random sequence generation, they provided correct rationales for their judgments, highlighting the potential for LLMs to assist researchers in ROB assessments. This study is pioneering in its exploration of LLMs for evaluating ROB in RCTs, suggesting that with appropriate prompts, these models can enhance the efficiency and reliability of systematic reviews.

Limitations

The limitations of this survey study highlight several critical factors that may affect the robustness of its findings. Firstly, the constrained sample size, due to usage restrictions on LLM 1 and LLM 2, limits the ability to draw strong conclusions, particularly in domains with low probabilities of positive assessments. Secondly, the study exclusively evaluated randomized controlled trials (RCTs) published in English, leaving the efficacy of the methodology for non-English literature unexamined.

Moreover, the criterion standard for the research was established through consensus among three senior experts, based on a uniform prompt provided to the LLMs. This approach may have oversimplified the assessment by not incorporating supplementary materials, such as appendices and registration details, which are often crucial for comprehensive evaluations. The study also focused solely on primary outcomes, despite indications that LLMs could potentially assess multiple outcomes simultaneously if prompts were tailored accordingly. Future research could address these limitations by expanding sample sizes, including non-English literature, and allowing LLMs access to external resources for more nuanced assessments.