الذكاء الاصطناعي لأتمتة التحليلات الميتا الشبكية: أربع دراسات حالة لتقييم التطبيق المحتمل لنماذج اللغة الكبيرة Artificial Intelligence to Automate Network Meta-Analyses: Four Case Studies to Evaluate the Potential Application of Large Language Models

المجلة: PharmacoEconomics – Open، المجلد: 8، العدد: 2
DOI: https://doi.org/10.1007/s41669-024-00476-9
PMID: https://pubmed.ncbi.nlm.nih.gov/38340277
تاريخ النشر: 2024-02-10
المؤلف: Tim Reason وآخرون
الموضوع الرئيسي: تحليل البيانات الشامل والمراجعات المنهجية

نظرة عامة

تستكشف هذه الدراسة التجريبية تطبيق نموذج اللغة الكبير (LLM)، وتحديداً المحول المدرب مسبقاً التوليدي 4 (GPT-4)، في أتمتة العمليات المعنية في المراجعات المنهجية والتحليلات الشبكية (NMAs). تركز الأبحاث على أربع دراسات حالة مع نتائج ثنائية ووقت للحدث في منطقتين مرضيتين، حيث تم إجراء NMAs سابقاً يدوياً. تم تطوير برنامج بايثون للتفاعل مع LLM عبر استدعاءات API، مما دفعه لاستخراج البيانات ذات الصلة من المنشورات، وتوليد نصوص R لتنفيذ NMA، وإنتاج تقارير تفسيرية.

تشير النتائج إلى أن LLM حقق دقة تزيد عن 99% في استخراج البيانات عبر 20 جولة لكل دراسة حالة ونجح في توليد نصوص R قابلة للتنفيذ دون تدخل بشري. بالإضافة إلى ذلك، كانت التقارير المنتجة ذات جودة عالية، تلخص بفعالية منطقة المرض، والتحليل، والنتائج، والتفسيرات. تستنتج الدراسة أن LLMs مثل GPT-4 تُظهر وعداً كبيراً في أتمتة استخراج البيانات، وتوليد الشيفرات، وتفسير النتائج في NMAs، مما قد يؤدي إلى توفير كبير في الوقت وتقليل الأخطاء البشرية. ومع ذلك، تؤكد على ضرورة إجراء فحوصات فنية روتينية لضمان الموثوقية، حيث أن LLMs ليست متسقة تماماً بعد ولكن من المتوقع أن تتحسن مع مرور الوقت. يُوصى بمزيد من البحث لتحسين وتأكيد العمليات المعتمدة على LLM في هذا السياق.

مقدمة

توضح المقدمة الدور الحاسم لوكالات تقييم التكنولوجيا الصحية (HTA) في تقييم الفعالية، والسلامة، والتكلفة-الفعالية للتدخلات الطبية الجديدة، وهو أمر ضروري للتفاوض على الأسعار وقرارات السداد. يجب على شركات الأدوية تقديم أدلة قوية من خلال المراجعات الأدبية المنهجية (SLRs) والتحليلات التلوية، المعترف بها كأكثر الطرق صرامة لتجميع مثل هذه الأدلة. تتضمن عملية HTA عدة خطوات، بما في ذلك إجراء SLRs لتحديد الدراسات السريرية ذات الصلة، وتجميع البيانات باستخدام تقنيات إحصائية، وتطوير نماذج اقتصادية لتقييم التكلفة-الفعالية. هذه العملية تتطلب جهداً كبيراً، حيث تحتاج إلى فريق من الخبراء وغالباً ما تستغرق أكثر من عام لإكمالها، مما قد يؤخر وصول المرضى إلى العلاجات الجديدة.

تسلط المقدمة الضوء أيضاً على إمكانيات الذكاء الاصطناعي (AI)، وخاصة الذكاء الاصطناعي التوليدي ونماذج اللغة الكبيرة (LLMs)، لتحسين سير عمل HTA، مما يجعله أكثر كفاءة وأقل عرضة للأخطاء. على الرغم من قدراتها، لا يزال التطبيق العملي لهذه النماذج الذكية في أبحاث نتائج الاقتصاد الصحي (HEOR) غير مختبر إلى حد كبير. تهدف الدراسة إلى تقييم جدوى استخدام LLMs لأتمتة التحليل الشبكي (NMA) من خلال تطوير عمليات لاستخراج البيانات وتحليلها بشكل آلي، باستخدام NMAs التي تم إجراؤها سابقاً كدراسات حالة. تسعى هذه المبادرة إلى تبسيط عملية HTA، مما يحسن في النهاية نتائج المرضى من خلال الوصول الأسرع إلى العلاجات الجديدة.

الطرق

توضح قسم “الطرق” في ورقة البحث الإجراءات التجريبية والتحليلية المستخدمة للتحقيق في سؤال البحث. تفصل اختيار المشاركين، وتصميم الدراسة، والتقنيات المحددة المستخدمة لجمع البيانات وتحليلها. على سبيل المثال، إذا كان ذلك مناسباً، قد يتم وصف طرق إحصائية مثل تحليل الانحدار أو ANOVA لتقييم دلالة النتائج.

بالإضافة إلى ذلك، قد يتضمن القسم معلومات عن الأدوات والتقنيات المستخدمة، مثل البرمجيات لتحليل البيانات أو الأدوات للقياس. يتم أيضاً تناول الاعتبارات الأخلاقية، بما في ذلك موافقة المشاركين وسرية البيانات، لضمان الامتثال لمعايير البحث. بشكل عام، يخدم هذا القسم لتوفير إطار واضح للمنهجية، مما يمكّن من إعادة إنتاج النتائج والتحقق منها.

النتائج

يقدم قسم النتائج في ورقة البحث تقييماً شاملاً لنموذج اللغة الكبير (LLM) في استخراج البيانات وتوليد التقارير عبر أربع دراسات حالة. أظهرت نصوص R التي تم إنشاؤها بواسطة LLM دقة عالية في حساب نتائج التحليل الشبكي (NMA)، حيث كانت النسب المتوسطة للأرجحية للعلاجات مقابل الدواء الوهمي في دراسة الحالة 1 تتماشى بشكل وثيق مع تلك المستمدة من التحليلات التي أجراها البشر، مع اختلافات ضمن التباين المتوقع لأقل من 1% (الجدول 6). بالنسبة لدراسات الحالة 2 و3 و4، كانت النسب المتوسطة للمخاطر للعلاجات مقابل دوكسيتاكسل متطابقة مع تلك من NMA اليدوي، مع اختلافات طفيفة في حدود فترات الثقة تعزى إلى خطأ مونت كارلو، متسقة مع التحليلات التي أجراها البشر (الجداول 7 و8 و9).

بالإضافة إلى ذلك، أنتج LLM تقارير NMA منظمة بشكل جيد تلخص بفعالية مناطق المرض ذات الاهتمام، وتحديداً التهاب الغدد العرقية القيحي لدراسة الحالة 1 وسرطان الرئة غير صغير الخلايا (NSCLC) للبقية. تضمنت التقارير منهجيات واضحة وتفسيرات دقيقة للنتائج، مع تحديد العلاج الأفضل وتقييم الدلالة الإحصائية عبر جميع الجولات. تم ملاحظة مشكلات طفيفة في إطارات البيانات التي أنشأها LLM، مثل ترقيم العلاج غير الصحيح وإبلاغ فترات الثقة، والتي تم تصحيحها بسهولة. بشكل عام، كانت أداء LLM في كل من استخراج البيانات وكتابة التقارير جديرة بالثناء، مع اختلافات في التفاصيل عبر جولات مختلفة (الشكل 4).

المناقشة

تقدم البحث نهجاً جديداً يستخدم نموذج اللغة الكبير (LLM)، وتحديداً GPT-4، لأتمتة العمليات المعنية في التحليل الشبكي (NMA)، بما في ذلك استخراج البيانات، وتوليد نصوص R، وتفسير النتائج. تم إجراء أربع دراسات حالة عبر منطقتين مرضيتين—التهاب الغدد العرقية القيحي (HS) وسرطان الرئة غير صغير الخلايا (NSCLC)—لتقييم أداء LLM في تكرار النتائج من NMAs التي أجريت يدوياً. أظهرت الدراسات أن LLM يمكنه استخراج البيانات ذات الصلة من الأدبيات بفعالية، وتوليد نصوص R التي كانت قابلة للتنفيذ إلى حد كبير مع الحد الأدنى من التدخل البشري، وإنتاج تقارير متماسكة تلخص النتائج. تجاوز معدل نجاح استخراج البيانات الإجمالي 99%، متفوقاً بشكل كبير على دقة استخراج البيانات البشرية النموذجية.

شملت المنهجية تطوير مطالبات تكرارية لتعزيز جودة مخرجات LLM، مع معالجة التحديات مثل حد الرموز لمعالجة النصوص والحاجة إلى معلومات سياقية لضمان التفسير الدقيق للدلالة الإحصائية. على الرغم من بعض الأخطاء الطفيفة في توليد النصوص، كان أداء LLM قوياً، حيث كانت معظم النصوص تتطلب فقط تعديلات طفيفة لتعمل بنجاح. تسلط هذه الدراسة الضوء على إمكانيات LLMs في تبسيط العمليات التحليلية المعقدة في أبحاث الرعاية الصحية، مشيرة إلى أن المزيد من التحسينات في المطالبات والسياق يمكن أن تعزز من فائدتها في التحليلات التلوية الآلية.

القيود

تسلط القيود المفروضة على النهج المقترح الضوء على عدة مجالات رئيسية للبحث والتحسين المستقبلي. أولاً، تم اختبار المنهجية فقط على نموذج لغة كبير واحد (LLM)، وتحديداً GPT-4، وعدد محدود من دراسات الحالة، مما يثير تساؤلات حول قابليتها للتعميم على LLMs الأخرى والشبكات العلاجية الأكثر تعقيداً. لم تتحقق الدراسة من فرضية المخاطر النسبية لتحليلات الوقت للحدث، مما يشير إلى الحاجة إلى تحقق إضافي. تطرح التباينات في استجابات LLM تحديات لإعادة الإنتاج، على الرغم من أن استراتيجيات مثل التحكم في العشوائية واستخدام إصدارات محددة من LLM يمكن أن تخفف من هذه المشكلة.

علاوة على ذلك، بينما يُعتقد أن المطالبات المطورة قابلة للتطبيق عبر مناطق مرضية مختلفة، هناك حاجة إلى مزيد من العرض لتأكيد فعاليتها في سياقات مختلفة ومع نتائج مستمرة. تحدد الدراسة أيضاً الحاجة إلى LLMs للمساعدة في مهام إضافية للتحليل الشبكي (NMA)، مثل اختيار النماذج وتشخيص التقارب. على الرغم من أن LLM أنتج نصوص R عالية الجودة، تم ملاحظة عدم اتساق، مما يشير إلى أن تحسين هندسة المطالبات والتعديل الدقيق يمكن أن يحسن من موثوقية المخرجات. أخيراً، يفتقر النهج الحالي إلى القوة المطلوبة لوكالات تقييم التكنولوجيا الصحية (HTA)، خاصة في مجالات مثل تقييم التباين، مما يبرز ضرورة الإشراف البشري حتى في العمليات الآلية بالكامل.

Journal: PharmacoEconomics – Open, Volume: 8, Issue: 2
DOI: https://doi.org/10.1007/s41669-024-00476-9
PMID: https://pubmed.ncbi.nlm.nih.gov/38340277
Publication Date: 2024-02-10
Author(s): Tim Reason et al.
Primary Topic: Meta-analysis and systematic reviews

Overview

This pilot study investigates the application of a large-language model (LLM), specifically the Generative Pre-trained Transformer 4 (GPT-4), in automating the processes involved in systematic reviews and network meta-analyses (NMAs). The research focuses on four case studies with binary and time-to-event outcomes in two disease areas, where NMAs had been previously conducted manually. A Python script was developed to interface with the LLM via API calls, prompting it to extract relevant data from publications, generate R scripts for NMA execution, and produce interpretative reports.

The findings indicate that the LLM achieved over 99% accuracy in data extraction across 20 runs for each case study and successfully generated executable R scripts without human intervention. Additionally, the reports produced were of high quality, effectively summarizing the disease area, analysis, results, and interpretations. The study concludes that LLMs like GPT-4 show significant promise in automating data extraction, code generation, and result interpretation in NMAs, potentially leading to considerable time savings and reduced human error. However, it emphasizes the necessity for routine technical checks to ensure reliability, as LLMs are not yet fully consistent but are expected to improve over time. Further research is recommended to refine and validate LLM-based processes in this context.

Introduction

The introduction outlines the critical role of health technology assessment (HTA) agencies in evaluating the efficacy, safety, and cost-effectiveness of new medical interventions, which is essential for price negotiations and reimbursement decisions. Pharmaceutical companies must provide robust evidence through systematic literature reviews (SLRs) and meta-analyses, recognized as the most rigorous methods for synthesizing such evidence. The HTA process involves several steps, including conducting SLRs to identify relevant clinical studies, synthesizing data using statistical techniques, and developing economic models to assess cost-effectiveness. This process is labor-intensive, requiring a team of experts and often taking over a year to complete, which can delay patient access to new treatments.

The introduction further highlights the potential of artificial intelligence (AI), particularly generative AI and large language models (LLMs), to optimize the HTA workflow, making it more efficient and less error-prone. Despite their capabilities, the practical application of these AI models in health economic outcomes research (HEOR) remains largely untested. The study aims to assess the feasibility of using LLMs for automated network meta-analysis (NMA) by developing processes for automated data extraction and analysis, using previously conducted NMAs as case studies. This initiative seeks to streamline the HTA process, ultimately improving patient outcomes through faster access to new treatments.

Methods

The “Methods” section of the research paper outlines the experimental and analytical procedures employed to investigate the research question. It details the selection of participants, the design of the study, and the specific techniques used for data collection and analysis. For instance, if applicable, statistical methods such as regression analysis or ANOVA may be described to assess the significance of the findings.

Additionally, the section may include information on the tools and technologies utilized, such as software for data analysis or instruments for measurement. Ethical considerations, including participant consent and data confidentiality, are also typically addressed to ensure compliance with research standards. Overall, this section serves to provide a clear framework for the methodology, enabling reproducibility and validation of the results.

Results

The results section of the research paper presents a comprehensive evaluation of a large language model (LLM) in data extraction and report generation across four case studies. The LLM-generated R scripts demonstrated high accuracy in calculating network meta-analysis (NMA) results, with mean odds ratios for treatments versus placebo in case study 1 closely aligning with those derived from human-conducted analyses, exhibiting differences within the expected variability of less than 1% (Table 6). For case studies 2, 3, and 4, the mean hazard ratios for treatments versus docetaxel were identical to those from manual NMA, with slight variations in credible interval limits attributable to Monte Carlo error, consistent with human-run analyses (Tables 7, 8, and 9).

Additionally, the LLM produced well-structured NMA reports that effectively summarized the disease areas of interest, specifically hidradenitis suppurativa for case study 1 and non-small cell lung cancer (NSCLC) for the others. The reports included clear methodologies and accurate interpretations of results, correctly identifying the best treatment and assessing statistical significance across all runs. Minor issues were noted in the dataframes generated by the LLM, such as incorrect treatment numbering and confidence interval reporting, which were easily rectified. Overall, the LLM’s performance in both data extraction and report writing was commendable, with variations in detail across different runs (Fig. 4).

Discussion

The research presents a novel approach utilizing a large language model (LLM), specifically GPT-4, to automate the processes involved in network meta-analysis (NMA), including data extraction, R script generation, and results interpretation. Four case studies were conducted across two disease areas—hidradenitis suppurativa (HS) and non-small cell lung cancer (NSCLC)—to evaluate the LLM’s performance in replicating results from manually conducted NMAs. The studies demonstrated that the LLM could effectively extract relevant data from literature, generate R scripts that were largely executable with minimal human intervention, and produce coherent reports summarizing the findings. The overall data extraction success rate exceeded 99%, significantly outperforming typical human data extraction accuracy.

The methodology involved iterative prompt development to enhance the LLM’s output quality, addressing challenges such as the token limit for text processing and the need for contextual information to ensure accurate interpretation of statistical significance. Despite some minor errors in script generation, the LLM’s performance was robust, with most scripts requiring only minor adjustments to run successfully. This study highlights the potential of LLMs in streamlining complex analytical processes in healthcare research, suggesting that further refinements in prompting and contextualization could enhance their utility in automated meta-analyses.

Limitations

The limitations of the proposed approach highlight several key areas for further research and improvement. Firstly, the methodology has only been tested on a single large language model (LLM), specifically GPT-4, and a limited number of case studies, which raises questions about its generalizability to other LLMs and more complex treatment networks. The study did not verify the proportional hazards assumption for time-to-event analyses, indicating a need for additional validation. Variability in LLM responses poses challenges for reproducibility, although strategies such as controlling randomness and using specific LLM versions could mitigate this issue.

Moreover, while the prompts developed are believed to be applicable across various disease areas, further demonstration is necessary to confirm their effectiveness in different contexts and with continuous outcomes. The study also identifies the need for LLMs to assist in additional network meta-analysis (NMA) tasks, such as model selection and convergence diagnostics. Although the LLM produced high-quality R scripts, inconsistencies were noted, suggesting that enhanced prompt engineering and fine-tuning could improve output reliability. Finally, the current approach lacks the robustness required for health technology assessment (HTA) bodies, particularly in areas like heterogeneity assessment, underscoring the necessity for human oversight even in fully automated processes.