تسريع تجميع الأدلة السريرية باستخدام نماذج اللغة الكبيرة Accelerating clinical evidence synthesis with large language models

المجلة: npj Digital Medicine، المجلد: 8، العدد: 1
DOI: https://doi.org/10.1038/s41746-025-01840-7
PMID: https://pubmed.ncbi.nlm.nih.gov/40775042
تاريخ النشر: 2025-08-07
المؤلف: Zifeng Wang وآخرون
الموضوع الرئيسي: تنقيب النصوص الطبية والأنطولوجيات

نظرة عامة

تقدم هذه القسم نظرة عامة على خط أنابيب الذكاء الاصطناعي التوليدي (AI) المسمى TrialMind، المصمم لتعزيز كفاءة المراجعات المنهجية (SR) في تجميع الأدلة السريرية. قام المؤلفون بتطوير TrialReviewBench، مجموعة بيانات تتكون من 100 مراجعة منهجية و2,220 دراسة سريرية، لتقييم أداء TrialMind. يظهر الخط الأنبوبي قدرات متفوقة في البحث عن الدراسات، حيث حقق معدلات استرجاع تتراوح بين 0.711 و0.834 مقارنةً بقاعدة بيانات بشرية تتراوح بين 0.138 إلى 0.232. في فحص الدراسات، يتفوق TrialMind على طرق تصنيف الوثائق السابقة بعامل يتراوح بين 1.5 إلى 2.6، وفي استخراج البيانات، يتجاوز دقة GPT-4 بنسبة تتراوح بين 16 إلى 32%.

في دراسة تجريبية تقيم التعاون بين الإنسان والذكاء الاصطناعي، أدى استخدام TrialMind إلى تحسين بنسبة 71.4% في الاسترجاع وتقليل بنسبة 44.2% في وقت الفحص. بالإضافة إلى ذلك، زادت الدقة في استخراج البيانات بنسبة 23.5%، مصحوبة بانخفاض بنسبة 63.4% في الوقت المستغرق. فضل الخبراء الطبيون الأدلة المجمعة التي أنتجها TrialMind على تلك التي تم إنشاؤها بواسطة GPT-4 في 62.5% إلى 100% من الحالات. تؤكد هذه النتائج على إمكانية TrialMind في تسريع عملية تجميع الأدلة السريرية بشكل كبير من خلال التعاون الفعال بين الإنسان والذكاء الاصطناعي.

الطرق

في هذا القسم، يصف المؤلفون المنهجية المستخدمة من قبل TrialMind، أداة مصممة لتعزيز استرجاع وتصنيف الدراسات السريرية من قواعد بيانات الأدب الطبي الواسعة مثل PubMed. يتم معالجة تحدي البحث الفعال عن الدراسات ذات الصلة من خلال نهج منهجي يتضمن توليد الاستعلامات، والتعزيز، والتنقيح، مما يسمح للمستخدمين بتعديل الاستعلامات حسب الحاجة. تم تقييم أداء TrialMind باستخدام مجموعة بيانات من الدراسات السريرية عبر عشرة مجالات علاجية للسرطان، حيث تم قياس الاسترجاع كمقياس أساسي. حقق TrialMind متوسط استرجاع قدره 0.782، متفوقًا بشكل ملحوظ على كل من قاعدة بيانات GPT-4 (استرجاع = 0.073) وقاعدة بيانات الإنسان (استرجاع = 0.187). من الجدير بالذكر أن TrialMind أظهر أداءً متفوقًا عبر مواضيع علاجية متنوعة، مع قيم استرجاع محددة مثل 0.797 للعلاج المناعي و0.834 للحرارة المفرطة.

بالإضافة إلى ذلك، يبسط TrialMind عملية فحص الاقتباسات من خلال نهج من ثلاث خطوات: توليد معايير الأهلية بتنسيق PICO، وتوقع أهلية الدراسة، وتصنيف الدراسات بناءً على التوقعات المجمعة. تم تقييم أداء تصنيف الأداة باستخدام مقاييس Recall@20 وRecall@50، مما يكشف عن تحسينات كبيرة مقارنةً بالطرق الأساسية، مع تغييرات تتراوح من 1.3 إلى 2.6 عبر مواضيع مختلفة. كما شملت المنهجية تحليلًا مفصلاً لتأثير معايير الأهلية على أداء التصنيف، مؤكدة أن معظم المعايير أثرت بشكل إيجابي على النتائج. بشكل عام، يوضح الأداء القوي لـ TrialMind في كل من استرجاع الدراسات وتصنيفها إمكانيته في تعزيز المراجعات الأدبية المنهجية في البحث السريري بشكل كبير.

النتائج

في هذه الدراسة، طور المؤلفون TrialReviewBench، وهو معيار مستمد من المراجعات المنهجية المنشورة في الأدب الطبي، مع التركيز بشكل خاص على علاجات السرطان. تضمنت المنهجية استخراج الدراسات المستهدفة من أوراق المراجعة واستخدام البيانات من جداول خصائص الدراسة كحقائق أساسية للتقييم، مع الالتزام بإرشادات PRISMA. تم تجميع ما مجموعه 2220 دراسة عبر 100 مراجعة، تشمل أربعة مجالات علاجية رئيسية: العلاج المناعي، الإشعاع/العلاج الكيميائي، العلاج الهرموني، والحرارة المفرطة.

تم إنشاء ثلاث مهام تقييم: البحث عن الدراسات، فحص الدراسات، واستخراج البيانات. استخدمت مهمة البحث عن الدراسات إطار عمل PICO (السكان، التدخل، المقارنة، النتيجة) لتوليد كلمات رئيسية لاستعلامات بوليانية تم تقديمها إلى قاعدة بيانات PubMed، مع تقييم أداء النموذج من خلال مقاييس الاسترجاع. في مهمة فحص الاقتباسات، تم إنشاء مجموعة مرشحة من 2000 اقتباس، وقام النموذج بتصنيف هذه الاقتباسات بناءً على احتمالية الإدراج، وتم تقييمها باستخدام Recall@k. لاستخراج البيانات، قام المؤلفون بتعليق يدوي على 1334 من خصائص الدراسة و1049 من نتائج الدراسة، مقارنةً بمخرجات النموذج ضد القيم الأساسية لتحديد الدقة. يضمن هذا النهج الشامل إطارًا قويًا لتقييم المراجعات المنهجية في البحث الطبي.

المناقشة

تناقش هذه القسم تطوير وتقييم TrialMind، وهو نظام مدفوع بواسطة LLM مصمم لتعزيز تجميع الأدلة السريرية ضمن إطار عمل PRISMA للمراجعات الأدبية المنهجية. يدمج TrialMind بفعالية مراحل التعرف، والفحص، والإدراج من خلال توليد مصطلحات البحث من عناصر PICO، وتطبيق معايير الأهلية، واستخراج حقول البيانات ذات الصلة للتحليل التلوي. أظهر النظام أداءً قويًا في استخراج البيانات، محققًا دقة تتراوح بين 0.72 إلى 0.83 عبر مواضيع علاجية متنوعة، مع أعلى دقة في استخراج تصميم الدراسة. ومع ذلك، واجه تحديات في استخراج النتائج، خاصة بسبب تعقيد البيانات الرقمية.

تمت مقارنة أداء TrialMind مع LLMs العامة مثل GPT-4، مما كشف عن دقة متفوقة في استخراج النتائج السريرية، حيث حقق TrialMind معدلات دقة أعلى بكثير من GPT-4 عبر مواضيع متعددة. كما أظهر النظام معدلات منخفضة من الهلوسة والمعلومات المفقودة، خاصة في نتائج الدراسة، حيث يكون الإشراف البشري ضروريًا للتحقق. أشارت دراسات المستخدمين إلى أن TrialMind لم يحسن فقط جودة الأدلة السريرية المجمعة ولكن أيضًا عزز الكفاءة، مما قلل الوقت المستغرق في مهام الفحص واستخراج البيانات بشكل كبير. بشكل عام، يمثل TrialMind نهجًا واعدًا لدمج الذكاء الاصطناعي في البحث السريري، مع التأكيد على أهمية التعاون بين الإنسان والذكاء الاصطناعي بينما يعالج تحديات المراجعات الأدبية المنهجية.

Journal: npj Digital Medicine, Volume: 8, Issue: 1
DOI: https://doi.org/10.1038/s41746-025-01840-7
PMID: https://pubmed.ncbi.nlm.nih.gov/40775042
Publication Date: 2025-08-07
Author(s): Zifeng Wang et al.
Primary Topic: Biomedical Text Mining and Ontologies

Overview

This section presents an overview of a generative artificial intelligence (AI) pipeline named TrialMind, designed to enhance the efficiency of systematic reviews (SR) in clinical evidence synthesis. The authors developed TrialReviewBench, a dataset comprising 100 systematic reviews and 2,220 clinical studies, to evaluate TrialMind’s performance. The pipeline demonstrates superior capabilities in study search, achieving recall rates between 0.711 and 0.834 compared to a human baseline of 0.138 to 0.232. In study screening, TrialMind outperforms previous document ranking methods by a factor of 1.5 to 2.6, and in data extraction, it surpasses the accuracy of GPT-4 by 16 to 32%.

In a pilot study assessing human-AI collaboration, the use of TrialMind resulted in a 71.4% improvement in recall and a 44.2% reduction in screening time. Additionally, accuracy in data extraction increased by 23.5%, accompanied by a 63.4% decrease in time spent. Medical experts favored the synthesized evidence produced by TrialMind over that generated by GPT-4 in 62.5% to 100% of cases. These results underscore the potential of TrialMind to significantly accelerate the process of clinical evidence synthesis through effective human-AI collaboration.

Methods

In this section, the authors describe the methodology employed by TrialMind, a tool designed to enhance the retrieval and ranking of clinical studies from extensive medical literature databases like PubMed. The challenge of effectively searching for relevant studies is addressed through a systematic approach that includes query generation, augmentation, and refinement, allowing users to adjust queries as needed. The performance of TrialMind was evaluated using a dataset of clinical studies across ten cancer treatment areas, measuring Recall as the primary metric. TrialMind achieved an average Recall of 0.782, significantly outperforming both a GPT-4 baseline (Recall = 0.073) and a Human baseline (Recall = 0.187). Notably, TrialMind demonstrated superior performance across various treatment topics, with specific Recall values such as 0.797 for Immunotherapy and 0.834 for Hyperthermia.

Additionally, TrialMind streamlines the citation screening process through a three-step approach: generating eligibility criteria in PICO format, predicting study eligibility, and ranking studies based on aggregated predictions. The tool’s ranking performance was assessed using Recall@20 and Recall@50 metrics, revealing substantial improvements over baseline methods, with fold changes ranging from 1.3 to 2.6 across different topics. The methodology also included a detailed analysis of the impact of eligibility criteria on ranking performance, confirming that most criteria positively influenced results. Overall, TrialMind’s robust performance in both study retrieval and ranking illustrates its potential to significantly enhance systematic literature reviews in clinical research.

Results

In this study, the authors developed TrialReviewBench, a benchmark derived from published systematic reviews in medical literature, specifically focusing on cancer treatments. The methodology involved extracting target studies from review papers and utilizing data from study characteristics tables as ground truth for evaluation, adhering to PRISMA guidelines. A total of 2220 studies across 100 reviews were compiled, encompassing four primary therapy areas: Immunotherapy, Radiation/Chemotherapy, Hormone Therapy, and Hyperthermia.

Three evaluation tasks were established: study search, study screening, and data extraction. The study search task utilized the PICO (Population, Intervention, Comparison, Outcome) framework to generate keywords for Boolean queries submitted to the PubMed database, with model performance assessed through Recall metrics. In the citation screening task, a candidate set of 2000 citations was created, and the model ranked these based on inclusion likelihood, evaluated using Recall@k. For data extraction, the authors manually annotated 1334 study characteristics and 1049 study results, comparing the model’s outputs against ground truth values to determine accuracy. This comprehensive approach ensures a robust framework for evaluating systematic reviews in medical research.

Discussion

The section discusses the development and evaluation of TrialMind, an LLM-driven system designed to enhance clinical evidence synthesis within the PRISMA framework for systematic literature reviews. TrialMind effectively integrates the stages of identification, screening, and inclusion by generating search terms from PICO elements, applying eligibility criteria, and extracting relevant data fields for meta-analysis. The system demonstrated strong data extraction performance, achieving accuracies ranging from 0.72 to 0.83 across various therapeutic topics, with the highest accuracy in study design extraction. However, it faced challenges in extracting results, particularly due to the complexity of numerical data.

TrialMind’s performance was compared against generalist LLMs like GPT-4, revealing superior accuracy in extracting clinical outcomes, with TrialMind achieving accuracy rates significantly higher than GPT-4 across multiple topics. The system also showed low rates of hallucinations and missing information, particularly in study results, where human oversight is crucial for verification. User studies indicated that TrialMind not only improved the quality of synthesized clinical evidence but also enhanced efficiency, reducing time spent on screening and data extraction tasks by substantial margins. Overall, TrialMind exemplifies a promising approach to integrating AI in clinical research, emphasizing the importance of human-AI collaboration while addressing the challenges of systematic literature reviews.