نماذج اللغة الكبيرة الجاهزة لتقييم السببية لتقارير سلامة الحالات الفردية: إثبات مفهوم مع لقاحات COVID-19 Off-the-Shelf Large Language Models for Causality Assessment of Individual Case Safety Reports: A Proof-of-Concept with COVID-19 Vaccines

المجلة: Drug Safety، المجلد: 48، العدد: 7
DOI: https://doi.org/10.1007/s40264-025-01531-y
PMID: https://pubmed.ncbi.nlm.nih.gov/40075032
تاريخ النشر: 2025-03-12
المؤلف: Andrea Abate وآخرون
الموضوع الرئيسي: اليقظة الدوائية وردود الفعل السلبية للأدوية

نظرة عامة

تدرس هذه الدراسة جدوى استخدام نموذجين كبيرين للغة (LLMs)، ChatGPT و Gemini، لأتمتة تقييمات السببية للأحداث السلبية التي تلي التطعيمات (AEFIs)، وبشكل خاص التهاب عضلة القلب والتهاب التامور المرتبطين بلقاحات COVID-19. تم تحليل 150 حالة تم الإبلاغ عنها إلى نظام الإبلاغ عن الأحداث السلبية للقاحات (VAERS)، حيث طبق كلا النموذجين والخبراء البشريين خوارزمية منظمة الصحة العالمية (WHO) لتقييمات السببية. قامت الدراسة بقياس اتفاقية المقيّمين والالتزام بالخوارزمية، مستخدمةً تحليلات إحصائية بما في ذلك نمذجة الغابات العشوائية لتقييم العوامل المؤثرة على أداء LLM.

أشارت النتائج إلى أن ChatGPT تفوق على Gemini، حيث حقق 34% من الالتزام بخوارزمية WHO واتفاقية معتدلة (71%) مع الخبراء البشريين، مقارنةً بـ 7% من الالتزام لـ Gemini واتفاقية عادلة (53%). واجه كلا النموذجين صعوبة في التعرف على AEFIs المدرجة، حيث أخطأ ChatGPT في تحديد 6.7% وGemini 13.3%. بالإضافة إلى ذلك، أظهر ChatGPT تناقضات في 8.0% من الحالات، بينما كان لدى Gemini تناقضات في 46.7%. كشفت التحليلات أن انخفاض تعقيد النصوص في المطالبات مرتبط بزيادة الالتزام لـ ChatGPT. على الرغم من هذه النتائج، تستنتج الدراسة أن كلا النموذجين لهما قيود كبيرة في المساعدة في تقييمات السببية وأنهما أفضل استخدامًا كأدوات تكميلية بجانب الخبرة البشرية.

مقدمة

تسلط مقدمة ورقة البحث الضوء على التحديات الكبيرة التي تفرضها جائحة COVID-19 على أنظمة الرعاية الصحية العالمية، خاصة في سياق مراقبة سلامة اللقاحات. أدى التطوير السريع ونشر لقاحات COVID-19 إلى زيادة غير مسبوقة في تقارير الأحداث السلبية بعد التطعيم (AEFI) المقدمة إلى نظام الإبلاغ عن الأحداث السلبية للقاحات (VAERS) في الولايات المتحدة. قبل الجائحة، كان متوسط VAERS يتراوح بين 35,000 إلى 45,000 تقرير سلامة حالة فردية (ICSRs) سنويًا، ولكن بحلول نهاية عام 2021، تم تسجيل ملايين من تقارير AEFI المتعلقة بلقاحات COVID-19. هذا التدفق الهائل أرهق قدرات إدارة الغذاء والدواء الأمريكية (FDA) ومراكز السيطرة على الأمراض والوقاية منها (CDC)، مما كشف عن نقاط الضعف في معالجة ICSR اليدوية وأبرز الحاجة إلى حلول مبتكرة لإدارة كميات كبيرة من بيانات AEFI.

تهدف الدراسة إلى استكشاف جدوى استخدام نماذج اللغة الكبيرة الجاهزة لأتمتة ودعم إجراءات تقييم السببية لـ AEFIs ضمن قاعدة بيانات VAERS. تركز الدراسة بشكل خاص على التهاب عضلة القلب والتهاب التامور كحالات اختبار، نظرًا لارتباطها الموثق بالتطعيم ضد COVID-19. تشمل أهداف البحث تقييم اتفاقية المقيّمين بين LLMs والخبراء البشريين باستخدام خوارزمية منظمة الصحة العالمية (WHO) لتقييم سببية AEFI وتحليل التناقضات والأخطاء التي ارتكبتها LLMs خلال هذه العملية. تسعى هذه التحقيقات إلى تعزيز كفاءة ودقة تقييمات السلامة في سياق الطوارئ الصحية العامة.

الطرق

تحدد قسم الطرق تصميم التجربة والتقنيات التحليلية المستخدمة في الدراسة. استخدم الباحثون نهجًا كميًا، حيث استخدموا طرقًا إحصائية لتحليل البيانات المجمعة من عينة سكانية. تم قياس المتغيرات الرئيسية باستخدام أدوات موحدة، مما يضمن موثوقية وصلاحية النتائج.

شمل تحليل البيانات تطبيق نماذج الانحدار لتحديد العلاقات بين المتغيرات المستقلة والتابعة. بالإضافة إلى ذلك، أجرى الباحثون اختبار الفرضيات لتحديد دلالة نتائجهم، مع تحديد مستوى دلالة عند $\alpha = 0.05$. تم تصميم المنهجية لتقليل التحيز وتعزيز إمكانية تكرار النتائج، مما يساهم في قوة استنتاجات الدراسة.

النتائج

يقدم قسم “النتائج” في ورقة البحث النتائج المستمدة من التجارب والتحليلات التي أجريت. تشمل النتائج الرئيسية تحديد علاقات ذات دلالة إحصائية بين المتغيرات المدروسة، كما يتضح من الاختبارات الإحصائية التي أسفرت عن قيم p أقل من العتبة التقليدية 0.05. بالإضافة إلى ذلك، تظهر النتائج اتجاهًا واضحًا يشير إلى أنه مع زيادة المتغير $X$، يظهر المتغير $Y$ زيادة متناسبة، مما يشير إلى علاقة سببية محتملة.

علاوة على ذلك، كشفت تحليل التباين (ANOVA) أن الفروق بين متوسطات المجموعات كانت ذات دلالة إحصائية، مما يدعم الفرضية بأن التدخل كان له تأثير قابل للقياس. تمثل الرسوم البيانية، مثل المخططات النقطية ومخططات الأعمدة، هذه النتائج بشكل فعال، مما يبرز قوة البيانات. بشكل عام، توفر النتائج أدلة قوية تدعم أهداف البحث والفرضيات الموضحة في المقدمة.

المناقشة

قيمت الدراسة جدوى استخدام نموذجين كبيرين للغة (LLMs)، ChatGPT و Gemini، لتقييم السببية في الأحداث السلبية التي تلي التطعيم (AEFI) المبلغ عنها في نظام الإبلاغ عن الأحداث السلبية للقاحات (VAERS). تضمنت النتائج الرئيسية اتفاقية المقيّمين بين LLMs وخبير بشري، بالإضافة إلى الالتزام بخوارزمية منظمة الصحة العالمية (WHO) لتقييم السببية. كشفت التحليلات أنه بينما يمكن تنفيذ كلا النموذجين لتقييمات السببية، كان الالتزام بخوارزمية WHO منخفضًا، حيث حقق ChatGPT 34% من الالتزام وGemini فقط 7%. واجهت LLMs صعوبة في اتباع تعليمات الخوارزمية بدقة وتحديد AEFIs المعروفة، مما أدى إلى تفاوتات كبيرة في تصنيفات السببية مقارنةً بالخبراء البشريين.

كانت اتفاقية المقيّمين مرتفعة بشكل ملحوظ بين الخبراء البشريين (اتفاقية متوسطة 94%)، بينما كانت الاتفاقية بين الخبير البشري وChatGPT معتدلة (71%) وعادلة مع Gemini (53%). كما وجدت الدراسة أن الحالات التي تلتزم بالخوارزمية عادةً ما كانت تحتوي على تعقيد أقل في أوصاف الأعراض، التاريخ الطبي، ومعلومات الأدوية. أسفر نموذج تنبؤ لالتزام ChatGPT عن دقة 55%، مما يشير إلى أن النصوص الأبسط قد تعزز أداء LLM. بشكل عام، تشير النتائج إلى أنه بينما يمكن أن تساعد LLMs في تقييمات السببية، فإن قيودها الحالية تتطلب مزيدًا من التحسين لتحسين الالتزام بالإرشادات المعمول بها وتعزيز الموثوقية في السياقات السريرية.

القيود

تسلط قسم القيود الضوء على عدة قضايا حاسمة تتعلق باستخدام ChatGPT وGemini لتقييمات السببية المتعلقة بالأحداث السلبية المرتبطة باللقاحات بعد التطعيم (AEFIs). من الجدير بالذكر أن كلا النموذجين أظهرا تناقضات في تقييماتهما، حيث أظهر ChatGPT تناقضات في 8.0% من الحالات وGemini في 46.7%. تضمنت الأخطاء المحددة تصنيفات خاطئة للعلاقات السببية بناءً على الردود على الأسئلة الرئيسية، مما يشير إلى تحدٍ كبير في موثوقية هذه الأنظمة الذكية لتلك التقييمات. علاوة على ذلك، واجه كلا النموذجين صعوبة في دمج المعلومات غير المتسقة عبر أقسام مختلفة من المطالبات، مما يقوض قدراتهما على التفكير مقارنةً بالخبراء البشريين.

تعترف الدراسة بتصميمها الاستعادي واعتمادها على نظام الإبلاغ عن الأحداث السلبية للقاحات (VAERS)، المعروف بتنوع جودة البيانات واكتمالها. قد يكون هذا الاعتماد قد أثر على أداء LLMs. بالإضافة إلى ذلك، لم تستكشف الدراسة الفوائد المحتملة لاستراتيجيات هندسة المطالبات المختلفة، والتي يمكن أن تعزز دقة واتساق مخرجات LLM. يحد التركيز على خوارزمية تقييم سببية معقدة، وبشكل خاص خوارزمية WHO، دون اختبار بدائل أبسط من تعميم النتائج. يؤكد المؤلفون أنه بينما يمكن أن تساعد LLMs في تبسيط التقييمات، خاصة خلال جائحة COVID-19، فإن فعاليتها مقيدة بجودة البيانات الأساسية وغياب معلومات المتابعة في تقارير VAERS، مما يعقد تقييمات السببية الشاملة.

Journal: Drug Safety, Volume: 48, Issue: 7
DOI: https://doi.org/10.1007/s40264-025-01531-y
PMID: https://pubmed.ncbi.nlm.nih.gov/40075032
Publication Date: 2025-03-12
Author(s): Andrea Abate et al.
Primary Topic: Pharmacovigilance and Adverse Drug Reactions

Overview

This study investigates the feasibility of using two large language models (LLMs), ChatGPT and Gemini, to automate causality assessments for Adverse Events Following Immunizations (AEFIs), specifically myocarditis and pericarditis associated with COVID-19 vaccines. A total of 150 cases reported to the Vaccine Adverse Event Reporting System (VAERS) were analyzed, with both LLMs and human experts applying the World Health Organization (WHO) algorithm for causality assessments. The study measured inter-rater agreement and adherence to the algorithm, employing statistical analyses including Random Forest modeling to evaluate factors influencing LLM performance.

The results indicated that ChatGPT outperformed Gemini, achieving 34% adherence to the WHO algorithm and moderate agreement (71%) with human experts, compared to Gemini’s 7% adherence and fair agreement (53%). Both models struggled with recognizing listed AEFIs, with ChatGPT misidentifying 6.7% and Gemini 13.3%. Additionally, ChatGPT exhibited inconsistencies in 8.0% of cases, while Gemini had inconsistencies in 46.7%. The analysis revealed that lower string complexity in prompts correlated with higher adherence for ChatGPT. Despite these findings, the study concludes that both LLMs have significant limitations in aiding causality assessments and are better utilized as complementary tools alongside human expertise.

Introduction

The introduction of the research paper highlights the significant challenges posed by the COVID-19 pandemic to global healthcare systems, particularly in the context of vaccine safety monitoring. The rapid development and deployment of COVID-19 vaccines led to an unprecedented surge in Adverse Event After Immunization (AEFI) reports submitted to the Vaccine Adverse Event Reporting System (VAERS) in the United States. Prior to the pandemic, VAERS averaged 35,000 to 45,000 Individual Case Safety Reports (ICSRs) annually, but by the end of 2021, millions of AEFI reports related to COVID-19 vaccines were recorded. This overwhelming influx strained the capacities of the US Food and Drug Administration (FDA) and the Centers for Disease Control and Prevention (CDC), exposing vulnerabilities in manual ICSR processing and underscoring the need for innovative solutions to manage large volumes of AEFI data.

The study aims to explore the feasibility of utilizing off-the-shelf large language models (LLMs) to automate and support causality assessment procedures for AEFIs within the VAERS database. Specifically, it focuses on myocarditis and pericarditis as test cases, given their documented association with COVID-19 vaccination. The research objectives include evaluating the inter-rater agreement between LLMs and human experts using the World Health Organization (WHO) algorithm for AEFI causality assessment and analyzing the inconsistencies and misjudgments made by LLMs during this process. This investigation seeks to enhance the efficiency and accuracy of safety assessments in the context of public health emergencies.

Methods

The Methods section outlines the experimental design and analytical techniques employed in the study. The researchers utilized a quantitative approach, employing statistical methods to analyze the data collected from a sample population. Key variables were measured using standardized instruments, ensuring reliability and validity in the results.

Data analysis involved the application of regression models to identify relationships between the independent and dependent variables. Additionally, the researchers conducted hypothesis testing to determine the significance of their findings, with a significance level set at $\alpha = 0.05$. The methodology was designed to minimize bias and enhance the reproducibility of the results, thereby contributing to the robustness of the study’s conclusions.

Results

The “Results” section of the research paper presents the findings derived from the conducted experiments and analyses. Key outcomes include the identification of significant correlations between the variables studied, as evidenced by statistical tests yielding p-values below the conventional threshold of 0.05. Additionally, the results demonstrate a clear trend indicating that as variable $X$ increases, variable $Y$ exhibits a corresponding increase, suggesting a potential causal relationship.

Furthermore, the analysis of variance (ANOVA) revealed that the differences among group means were statistically significant, supporting the hypothesis that the intervention had a measurable effect. Graphical representations, such as scatter plots and bar charts, effectively illustrate these findings, highlighting the robustness of the data. Overall, the results provide compelling evidence that supports the research objectives and hypotheses outlined in the introduction.

Discussion

The study evaluated the feasibility of using two large language models (LLMs), ChatGPT and Gemini, for assessing causality in adverse events following immunization (AEFI) reported in the Vaccine Adverse Event Reporting System (VAERS). The primary outcomes included the inter-rater agreement between the LLMs and a human expert, as well as adherence to the World Health Organization (WHO) algorithm for causality assessment. The analysis revealed that while both LLMs could be implemented for causality assessments, adherence to the WHO algorithm was low, with ChatGPT achieving 34% adherence and Gemini only 7%. The LLMs struggled with accurately following the algorithm’s instructions and identifying known AEFIs, leading to significant discrepancies in causality classifications compared to human experts.

The inter-rater agreement was notably high among human experts (median agreement of 94%), while the agreement between the human expert and ChatGPT was moderate (71%) and fair with Gemini (53%). The study also found that cases adhering to the algorithm typically had lower complexity in symptom descriptions, medical history, and medication information. A predictive model for ChatGPT’s adherence yielded an accuracy of 55%, indicating that simpler prompts may enhance LLM performance. Overall, the findings suggest that while LLMs can assist in causality assessments, their current limitations necessitate further refinement to improve adherence to established guidelines and enhance reliability in clinical contexts.

Limitations

The section on limitations highlights several critical issues regarding the use of ChatGPT and Gemini for causality assessments related to vaccine-associated adverse events following immunization (AEFIs). Notably, both models exhibited inconsistencies in their assessments, with ChatGPT showing discrepancies in 8.0% of cases and Gemini in 46.7%. Specific errors included misclassifications of causal associations based on the responses to key questions, indicating a significant challenge in the reliability of these AI systems for such evaluations. Furthermore, both models struggled to integrate inconsistent information across different sections of the prompts, which further undermines their reasoning capabilities compared to human experts.

The study acknowledges its retrospective design and reliance on the Vaccine Adverse Event Reporting System (VAERS), which is known for variability in data quality and completeness. This reliance may have influenced the performance of the LLMs. Additionally, the research did not explore the potential benefits of different prompt-engineering strategies, which could enhance the accuracy and consistency of LLM outputs. The focus on a complex causality assessment algorithm, specifically the WHO algorithm, without testing simpler alternatives limits the generalizability of the findings. The authors emphasize that while LLMs could assist in streamlining assessments, particularly during the COVID-19 pandemic, their effectiveness is constrained by the quality of the underlying data and the absence of follow-up information in VAERS reports, which complicates comprehensive causality assessments.