تقييم فعالية نماذج اللغة الكبيرة في فرز الملخصات: تحليل مقارن Evaluating the effectiveness of large language models in abstract screening: a comparative analysis

المجلة: Systematic Reviews، المجلد: 13، العدد: 1
DOI: https://doi.org/10.1186/s13643-024-02609-x
PMID: https://pubmed.ncbi.nlm.nih.gov/39169386
تاريخ النشر: 2024-08-21
المؤلف: Michael Li وآخرون
الموضوع الرئيسي: تحليل البيانات الشامل والمراجعات المنهجية

نظرة عامة

تدرس هذه الدراسة فعالية نماذج اللغة الكبيرة (LLMs) في فرز الملخصات للمراجعات المنهجية والتحليلات التلوية. باستخدام سكريبتات الأتمتة في بايثون، قامت الدراسة بتقييم عدة نماذج LLM، بما في ذلك ChatGPT v4.0 وChatGPT v3.5 وGoogle PaLM وMeta Llama 2، مقابل ثلاثة قواعد بيانات من الملخصات. تشمل مقاييس الأداء التي تم تقييمها الحساسية والخصوصية والدقة العامة، مع مقارنة النتائج بقرارات الإدراج التي تم تنسيقها بواسطة البشر. من الجدير بالذكر أن ChatGPT v4.0 أظهر أداءً متفوقًا، محققًا دقة تزيد عن 90%، مما يشير إلى أن نماذج LLM يمكن أن تعزز بشكل كبير كفاءة عمليات فرز الملخصات.

بينما لا تزال نماذج LLM ليست بديلاً كاملاً للخبراء البشر، تشير النتائج إلى إمكاناتها كمراجعين مستقلين ومتعاونين في سير العمل الهجين. تسلط الدراسة الضوء على أن نماذج LLM يمكن أن تبسط عملية الفرز، مما يقلل من الحاجة إلى مشاركة بشرية واسعة ويقدم بديلاً فعالاً من حيث التكلفة للطرق التقليدية. مع تطور التكنولوجيا، من المتوقع أن تلعب نماذج LLM دورًا متزايد الأهمية في تحويل فرز الملخصات في المراجعات المنهجية والتحليلات التلوية، مما يحسن كفاءة وفعالية البحث بشكل عام.

مقدمة

ت outlines مقدمة هذه الورقة البحثية أهمية المراجعات المنهجية في تجميع الأدلة عبر مجالات أكاديمية متنوعة، لا سيما في البحث الطبي. تلتزم المراجعات المنهجية بأساليب صارمة، مثل إرشادات PRISMA، لتقليل التحيز وتعزيز القابلية للتكرار، مما يجعلها تمثل مستوى عالٍ من الأدلة في البحث القائم على الأدلة. تبدأ العملية بسؤال بحث محدد بوضوح وتتضمن بحثًا شاملاً عبر قواعد بيانات مثل PubMed وEmbase، مسترشدًا بمعايير الإدراج والاستبعاد المحددة مسبقًا. على الرغم من أهميتها، فإن تنفيذ المراجعات المنهجية يمثل تحديًا، لا سيما خلال مرحلة فرز الملخصات، التي تستغرق وقتًا طويلاً وعرضة للأخطاء البشرية بسبب التباين في معايير الإبلاغ والإرهاق المعرفي.

تسلط الورقة الضوء على الانتشار المتزايد للمراجعات المنهجية، مع حوالي 51 مراجعة منشورة يوميًا في عام 2023، مما يشير إلى دورها الحاسم في توحيد نتائج البحث. تقدم التطورات الأخيرة في التعلم الآلي (ML) والذكاء الاصطناعي (AI)، لا سيما الأدوات مثل ChatGPT وGoogle PALM وLlama 2، حلولًا واعدة لتبسيط عملية فرز الملخصات. تهدف هذه الدراسة إلى تقييم أداء هذه النماذج الذكية مقارنة بالطرق التقليدية المعتمدة على الخبراء البشر والطرق الحالية للتعلم الآلي، بهدف تطوير استراتيجيات فعالة ودقيقة تقلل من التدخل البشري في المراجعات المنهجية. ستفصل الأقسام التالية الأساليب الحالية لفرز الملخصات، وتصميم التجربة لتقييم أدوات الذكاء الاصطناعي، والنتائج التجريبية، وفي النهاية مناقشة الآثار المترتبة على مستقبل ممارسات المراجعة المنهجية.

النتائج

في هذا القسم، يقدم المؤلفون نتائج أداء أدوات نماذج اللغة (LLM) المختلفة، مع التركيز بشكل خاص على قدرتها على فرز الملخصات من ثلاثة قواعد بيانات، مع تحليل مفصل لقاعدة بيانات Meijboom 2021. كانت المطالبات المستخدمة في الفرز تتطلب من أدوات LLM تقييم الأهلية بناءً على معايير محددة تتعلق بنقل المرضى من العلاجات الأصلية إلى العلاجات البيوسيميلية. تشير النتائج إلى أن ChatGPT v4.0 حقق أعلى دقة (0.840) بين الأدوات المختبرة، على الرغم من أنه أظهر خصوصية أقل (0.860) وحساسية (0.812) مقارنة بالنماذج الأخرى.

كشفت التحليلات المقارنة باستخدام اختبار McNemar أن ChatGPT v4.0 أدت بشكل أفضل بكثير من Google PALM (قيمة p < 0.001) ولكن أسوأ من ChatGPT 3.5 (قيمة p = 0.001) وLlama من حيث الحساسية. من حيث الخصوصية، تفوق ChatGPT v4.0 على ChatGPT 3.5 (قيمة p < 0.001) وLlama (قيمة p < 0.001) ولكنه كان أقل أداءً من Google PALM (قيمة p = 0.002). لم يحسن نهج اتخاذ القرار المشترك باستخدام التصويت بالأغلبية الدقة العامة مقارنة بـ ChatGPT v4.0، على الرغم من أنه حسّن بشكل كبير الحساسية (قيمة p = 0.008) دون انخفاض كبير في الخصوصية (قيمة p > 0.50). بالإضافة إلى ذلك، أظهر ChatGPT v4.0 حساسية متفوقة مقارنة بكل من طريقة الصفر-shot (قيمة p = 0.0002) والطريقة الهجينة (قيمة p < 0.001)، بينما أظهر تحسنًا طفيفًا في الخصوصية مقارنة بطريقة الصفر-shot (قيمة p = 0.099).

مناقشة

في مناقشة الأساليب الحالية لفرز الملخصات في المراجعات المنهجية، تسلط الورقة الضوء على التطور من التقييمات اليدوية إلى تقنيات الذكاء الاصطناعي المتقدمة. الطريقة اليدوية، على الرغم من كونها شاملة وموثوقة بسبب التدقيق من قبل الخبراء، تستغرق وقتًا طويلاً وعرضة للأخطاء البشرية. في المقابل، تقوم طرق معالجة اللغة الطبيعية (NLP)، مثل آلات الدعم المتجهة (SVM) والغابات العشوائية (RF)، بأتمتة عملية الفرز، مما يعزز الكفاءة ولكن يعتمد بشكل كبير على مجموعات بيانات تدريب جيدة التنسيق. تقدم ملحوظة هي التصنيف بدون تدريب، الذي يسمح بالتCategorization دون الحاجة إلى بيانات تدريب واسعة، على الرغم من أنه قد يضر بالحساسية ويؤدي إلى فقدان الدراسات ذات الصلة.

لمعالجة قيود هذه الطرق، ظهرت طريقة هجينة تجمع بين التصنيف السريع للتصنيف بدون تدريب ودقة نماذج التعلم الآلي التقليدية. تتضمن هذه الطريقة فرزًا أوليًا باستخدام التصنيف بدون تدريب، يتبعه مراجعة يدوية للملخصات المختارة، مما يوجه بعد ذلك تدريب النماذج التقليدية لمجموعات بيانات أوسع. تناقش الورقة أيضًا إمكانيات نماذج اللغة الكبيرة (LLMs) مثل ChatGPT في فرز الملخصات، مما يظهر قدرتها على تقييم الملخصات مقابل معايير محددة بشكل فعال. يقترح المؤلفون سير عمل آلي يستخدم نماذج LLM لتبسيط عملية الفرز، مع التأكيد على أهمية ضبط المعلمات لتحقيق الأداء الأمثل. بشكل عام، تشير النتائج إلى أنه بينما تظهر نماذج LLM والأساليب الهجينة وعدًا في تحسين دقة وكفاءة فرز الملخصات، يبقى النظر بعناية في تنفيذها وبيانات التدريب أمرًا حاسمًا.

Journal: Systematic Reviews, Volume: 13, Issue: 1
DOI: https://doi.org/10.1186/s13643-024-02609-x
PMID: https://pubmed.ncbi.nlm.nih.gov/39169386
Publication Date: 2024-08-21
Author(s): Michael Li et al.
Primary Topic: Meta-analysis and systematic reviews

Overview

This study investigates the efficacy of Large Language Models (LLMs) in abstract screening for systematic reviews and meta-analyses. Utilizing automation scripts in Python, the research evaluated several LLMs, including ChatGPT v4.0, ChatGPT v3.5, Google PaLM, and Meta Llama 2, against three databases of abstracts. The performance metrics assessed included sensitivity, specificity, and overall accuracy, with results benchmarked against human-curated inclusion decisions. Notably, ChatGPT v4.0 exhibited superior performance, achieving over 90% accuracy, suggesting that LLMs can significantly enhance the efficiency of abstract screening processes.

While LLMs are not yet a complete substitute for human experts, the findings indicate their potential as autonomous reviewers and collaborators in hybrid workflows. The study highlights that LLMs can streamline the screening process, reducing the need for extensive human involvement and offering a cost-effective alternative to traditional methods. As technology evolves, LLMs are expected to play an increasingly pivotal role in transforming abstract screening in systematic reviews and meta-analyses, thereby improving overall research efficiency and effectiveness.

Introduction

The introduction of this research paper outlines the significance of systematic reviews in synthesizing evidence across various academic fields, particularly in medical research. Systematic reviews adhere to rigorous methodologies, such as the PRISMA guidelines, to minimize bias and enhance reproducibility, thereby serving as a high level of evidence in evidence-based research. The process begins with a clearly defined research question and involves comprehensive searches through databases like PubMed and Embase, guided by predetermined inclusion and exclusion criteria. Despite their importance, the execution of systematic reviews is challenging, particularly during the abstract screening phase, which is time-consuming and prone to human error due to variability in reporting standards and cognitive fatigue.

The paper highlights the growing prevalence of systematic reviews, with approximately 51 published daily in 2023, indicating their critical role in consolidating research findings. Recent advancements in machine learning (ML) and artificial intelligence (AI), particularly tools like ChatGPT, Google PALM, and Llama 2, present promising solutions to streamline the abstract screening process. This research aims to evaluate the performance of these AI models against traditional human expert-based methods and existing ML approaches, with the goal of developing efficient and accurate strategies that reduce human intervention in systematic reviews. The subsequent sections will detail the current methods for abstract screening, the experimental design for evaluating AI tools, and the empirical findings, ultimately discussing the implications for the future of systematic review practices.

Results

In this section, the authors present the performance results of various language model (LLM) tools, specifically focusing on their ability to screen abstracts from three databases, with a detailed analysis of the Meijboom 2021 database. The prompts used for screening required LLM tools to assess eligibility based on specific criteria related to transitioning patients from originator to biosimilar treatments. The results indicate that ChatGPT v4.0 achieved the highest accuracy (0.840) among the tested tools, although it exhibited lower specificity (0.860) and sensitivity (0.812) compared to other models.

Comparative analysis using the McNemar test revealed that ChatGPT v4.0 performed significantly better than Google PALM (p-value < 0.001) but worse than ChatGPT 3.5 (p-value = 0.001) and Llama regarding sensitivity. In terms of specificity, ChatGPT v4.0 outperformed ChatGPT 3.5 (p-value < 0.001) and Llama (p-value < 0.001) but was outperformed by Google PALM (p-value = 0.002). The combined decision-making approach using majority voting did not enhance overall accuracy compared to ChatGPT v4.0, although it significantly improved sensitivity (p-value = 0.008) without a significant decline in specificity (p-value > 0.50). Additionally, ChatGPT v4.0 demonstrated superior sensitivity compared to both the zero-shot method (p-value = 0.0002) and the hybrid method (p-value < 0.001), while showing marginally improved specificity over the zero-shot method (p-value = 0.099).

Discussion

In the discussion of existing approaches to abstract screening in systematic reviews, the paper highlights the evolution from manual evaluations to advanced AI techniques. The manual approach, while thorough and reliable due to expert scrutiny, is time-consuming and susceptible to human error. In contrast, Natural Language Processing (NLP) methods, such as Support Vector Machines (SVM) and Random Forests (RF), automate the screening process, enhancing efficiency but relying heavily on well-curated training datasets. A notable advancement is zero-shot classification, which allows for categorization without extensive training data, though it may compromise sensitivity and lead to missed relevant studies.

To address the limitations of these methods, a hybrid approach has emerged, combining the rapid categorization of zero-shot classification with the precision of traditional machine learning models. This method involves an initial screening using zero-shot classification, followed by manual review of selected abstracts, which then informs the training of traditional models for broader datasets. The paper also discusses the potential of Large Language Models (LLMs) like ChatGPT in abstract screening, demonstrating their ability to evaluate abstracts against specific criteria effectively. The authors propose an automated workflow utilizing LLMs to streamline the screening process, emphasizing the importance of tuning parameters for optimal performance. Overall, the findings suggest that while LLMs and hybrid methodologies show promise in improving the accuracy and efficiency of abstract screening, careful consideration of their implementation and training data remains crucial.