التحقق من صحة نماذج اللغة الكبيرة (Llama 3 وChatGPT-4o mini) لفرز العناوين والملخصات في المراجعات النظامية الطبية الحيوية Validation of large language models (Llama 3 and ChatGPT-4o mini) for title and abstract screening in biomedical systematic reviews

المجلة: Research Synthesis Methods، المجلد: 16، العدد: 4
DOI: https://doi.org/10.1017/rsm.2025.15
PMID: https://pubmed.ncbi.nlm.nih.gov/41626916
تاريخ النشر: 2025-03-24
المؤلف: Adriana López‐Pineda وآخرون
الموضوع الرئيسي: تحليل البيانات الشامل والمراجعات المنهجية

نظرة عامة

تتناول الدراسة التحديات التي تطرحها الزيادة في حجم الأدبيات العلمية في سياق المراجعات المنهجية، خاصة خلال مرحلة فحص العناوين والملخصات التي تتطلب جهدًا كبيرًا. تؤكد الدراسة على استخدام أدوات الذكاء الاصطناعي (AI)، وبشكل خاص Llama 3 70B وChatGPT-4o mini، لأتمتة هذه العملية في البحث الطبي الحيوي. تم تقييم أداء هذه النماذج الذكية مقارنة بالمراجعين البشريين باستخدام مجموعة بيانات تتكون من 1,081 مقالة، مما كشف أن تكوين LLA_2 لنموذج Llama 3 حقق حساسية بنسبة 77.5%، وخصوصية بنسبة 91.4%، ودقة إجمالية بنسبة 90.2%. بالمقابل، أظهر تكوين CHAT_2 لنموذج ChatGPT-4o mini حساسية بنسبة 56.2% وخصوصية بنسبة 95.1%، مع دقة بلغت 92.0%. أظهرت كلا النموذجين خصوصية قوية، على الرغم من أن Llama 3 تفوق في الحساسية والدقة.

تشير النتائج إلى أنه بينما يمكن أن يعزز الذكاء الاصطناعي بشكل كبير من كفاءة ودقة المراجعات المنهجية، فإن التحقق اليدوي يبقى ضروريًا للتخفيف من مخاطر الإيجابيات الكاذبة والسلبية. تختتم الدراسة بأن دمج الذكاء الاصطناعي مع عمليات المراجعة البشرية يمكن أن يُحسن من اختيار الدراسات ذات الصلة، مما يحافظ على معايير منهجية عالية ويحسن من جودة المراجعات المنهجية. يُوصى بمزيد من البحث لاستكشاف قدرة الذكاء الاصطناعي في تقييم ملاءمة المقالات المختارة وتأثيرها العام على نتائج المراجعة، مما قد يعزز الثقة في اعتماد أدوات الذكاء الاصطناعي في الممارسة العلمية.

مقدمة

تناقش مقدمة هذه الورقة البحثية المنهجية وأهمية المراجعات المنهجية في تجميع الأدلة البحثية حول مواضيع محددة. توضح الخطوات التي تتطلب جهدًا كبيرًا، مثل تحديد أسئلة البحث، وإجراء عمليات البحث في الأدبيات، وتقييم التحيز، مع التأكيد على التحديات التي تطرحها الزيادة في حجم الأبحاث المنشورة. تبرز الورقة إمكانيات نماذج الذكاء الاصطناعي (AI)، وخاصة GPT-4، لتعزيز عملية المراجعة المنهجية من خلال تحسين الكفاءة والدقة خلال مراحل فحص الأدبيات والاختيار. تُلاحظ أدوات الذكاء الاصطناعي مثل ASReview®، Rayyan®، وCovidence® لقدرتها على أتمتة تحديد الدراسات وترتيبها، مما يسهل عملية المراجعة.

ومع ذلك، فإن دمج الذكاء الاصطناعي في المراجعات المنهجية يثير مخاوف بشأن التحيز الخوارزمي والشفافية، مما يتطلب التحقق الدقيق من هذه الأدوات مقارنة بأساليب المراجعة البشرية التقليدية. تؤكد الورقة على أهمية ضمان موثوقية الذكاء الاصطناعي في الفحص، حيث يمكن أن تؤثر الأخطاء بشكل كبير على نتائج المراجعات المنهجية. الهدف الرئيسي من الدراسة هو التحقق من فعالية أدوات نماذج اللغة الكبيرة (LLM)، بما في ذلك Llama 3 وChatGPT-4o mini، في عملية اختيار المقالات للمراجعات المنهجية الطبية الحيوية، خاصة في فحص العناوين والملخصات.

طرق

استخدمت الدراسة منهجية تحقق مقارنة لتقييم فعالية ودقة أداتين من أدوات الذكاء الاصطناعي في فحص العناوين والملخصات غير ذات الصلة من مجموعة من المقالات التي تم تحديدها من خلال بحث مراجعة منهجية. كان المعيار المرجعي لهذا التقييم هو المراجعة اليدوية التي أجراها المراجعون البشريون، والتي كانت بمثابة المعيار الذهبي. تم جمع البيانات بأثر رجعي بعد الانتهاء من الفحص اليدوي، الذي تم في بيئة أكاديمية مزودة بالموارد التكنولوجية القياسية. كانت المقالات التي تم تحليلها مستمدة من مراجعة منهجية سابقة تركزت على عوامل خطر الإصابة بأمراض القلب والأوعية الدموية والوفيات لدى النساء بعد انقطاع الطمث (مرجع PROSPERO: CRD42022323101).

تم استرجاع ما مجموعه 1,397 مقالة محتملة من قواعد بيانات Medline وEmbase وScopus في أكتوبر 2022، مع بقاء 1,081 بعد إزالة التكرارات عبر منصة Rayyan. قام المراجعون البشريون بفحص العناوين والملخصات بشكل مستقل من 22 نوفمبر 2022 إلى 23 يناير 2023، وتم حل التباينات بواسطة مراجع ثالث. تم تصنيف المقالات على أنها “مستبعدة” أو “مشمولة” بناءً على معايير الأهلية المحددة مسبقًا، مع اعتبار القرارات النهائية من الفحص اليدوي معيارًا ذهبيًا موثقًا. تم استخدام أدوات الذكاء الاصطناعي، التي تم استخدامها في يوليو 2024، لتصنيف المقالات على أنها “مشمولة” أو “مستبعدة” بناءً على تدريبها لتحديد أنماط المحتوى غير ذات الصلة في الملخصات، دون الوصول إلى نتائج الفحص اليدوي.

نتائج

في المراجعة المنهجية، تم فحص ما مجموعه 1,081 مقالة، مما أدى إلى تضمين 84 دراسة للمراجعة الكاملة للنص. ومن الجدير بالذكر أن 97 مقالة كانت تفتقر إلى الملخصات في الملفات المرجعية، مما استلزم من المراجعين البشريين الحصول عليها من منصات بديلة. استخدمت الدراسة أداتين من أدوات الذكاء الاصطناعي، نموذج Llama 3 وChatGPT-4o mini، تم اختبار كل منهما تحت ثلاث تكوينات. تم تقييم أداء هذه الأدوات مقارنة بتصنيفات الباحثين، مع تلخيص النتائج في جداول وأشكال متنوعة.

أظهر أفضل تكوين لنموذج Llama 3، LLA_2، حساسية بنسبة 77.5% من خلال تضمين 62 من 80 دراسة بشكل صحيح وخصوصية بنسبة 91.4% من خلال استبعاد 826 من 904 دراسات بدقة. حقق قيمة تنبؤية إيجابية (PPV) بنسبة 44.3% وقيمة تنبؤية سلبية (NPV) بنسبة 97.9%، محققًا دقة إجمالية بنسبة 90.2%. بالمقابل، أظهر تكوين CHAT_2 لنموذج ChatGPT-4o mini حساسية أقل عند 56.2% (تضمين 45 من 80 دراسة بشكل صحيح) ولكن خصوصية أعلى عند 95.1% (استبعاد 860 من 904 دراسات بشكل صحيح). كانت قيمة PPV لـ CHAT_2 50.6% وNPV 96.1%، مع دقة إجمالية بلغت 92.0%. تسلط هذه النتائج الضوء على الفعالية المتفاوتة لأدوات الذكاء الاصطناعي في عمليات المراجعة المنهجية.

مناقشة

في هذه الدراسة، استكشفنا فعالية أدوات الذكاء الاصطناعي، وبشكل خاص نموذج Llama 3 70B وChatGPT-4o mini، لفحص العناوين والملخصات في المراجعات المنهجية ضمن المجال الطبي الحيوي. تشير نتائجنا إلى أن كلا النموذجين حققا معدل دقة إجمالية بنسبة 90%، حيث أظهر Llama 3 أداءً متفوقًا في استبعاد المقالات غير ذات الصلة، كما يتضح من قيمة NPV العالية التي تتراوح بين 97.6% إلى 97.9%. بالمقابل، أظهر ChatGPT-4o mini قيمة NPV أقل قليلاً تبلغ 96.1% ولكن قيمة PPV أعلى تتراوح بين 48.9% إلى 50.6%. تؤكد هذه النتائج على إمكانية الذكاء الاصطناعي في تعزيز كفاءة ودقة المراجعات المنهجية، على الرغم من أنها تسلط الضوء أيضًا على ضرورة التحقق اليدوي للتخفيف من خطر فقدان الدراسات ذات الصلة.

تؤكد الدراسة أيضًا على أهمية تخصيص تكوينات الذكاء الاصطناعي، مثل ضبط إعدادات الحرارة، لتحسين الأداء. تتماشى نتائجنا مع الأبحاث السابقة، مما يشير إلى أنه بينما يمكن أن تقلل أدوات الذكاء الاصطناعي بشكل كبير من عبء العمل وتحسن من دقة الاختيار، إلا أنها ليست معصومة من الخطأ ويجب استخدامها كأدوات مساعدة مكملة بدلاً من استبدال المراجعين البشريين. يجب أن تركز الأبحاث المستقبلية على التحقق من ملاءمة المقالات التي تحددها أدوات الذكاء الاصطناعي وتقييم تأثيرها على نتائج المراجعة النهائية. ستساعد هذه المقاربة في تعزيز الثقة في دمج أدوات الذكاء الاصطناعي في المراجعات المنهجية، مما يسهم في تحقيق ممارسات علمية أكثر كفاءة وصارمة في مواجهة الزيادة المستمرة في حجم الأدبيات.

Journal: Research Synthesis Methods, Volume: 16, Issue: 4
DOI: https://doi.org/10.1017/rsm.2025.15
PMID: https://pubmed.ncbi.nlm.nih.gov/41626916
Publication Date: 2025-03-24
Author(s): Adriana López‐Pineda et al.
Primary Topic: Meta-analysis and systematic reviews

Overview

The study addresses the challenges posed by the increasing volume of scientific literature in the context of systematic reviews, particularly during the labor-intensive title and abstract screening phase. It validates the use of artificial intelligence (AI) tools, specifically Llama 3 70B and ChatGPT-4o mini, for automating this process in biomedical research. The performance of these AI models was evaluated against human reviewers using a dataset of 1,081 articles, revealing that the Llama 3 model’s LLA_2 configuration achieved a sensitivity of 77.5%, specificity of 91.4%, and an overall accuracy of 90.2%. In contrast, the ChatGPT-4o mini’s CHAT_2 configuration demonstrated a sensitivity of 56.2% and specificity of 95.1%, with an accuracy of 92.0%. Both models exhibited strong specificity, although Llama 3 outperformed in sensitivity and accuracy.

The findings suggest that while AI can significantly enhance the efficiency and accuracy of systematic reviews, manual validation remains essential to mitigate the risks of false positives and negatives. The study concludes that integrating AI with human review processes can optimize the selection of relevant studies, thereby maintaining high methodological standards and improving the quality of systematic reviews. Further research is recommended to explore the AI’s capability in evaluating the suitability of selected articles and its overall impact on review outcomes, which could foster greater confidence in the adoption of AI tools in scientific practice.

Introduction

The introduction of this research paper discusses the methodology and significance of systematic reviews in synthesizing research evidence on specific topics. It outlines the labor-intensive steps involved, such as defining research questions, conducting literature searches, and assessing bias, emphasizing the challenges posed by the increasing volume of published research. The paper highlights the potential of artificial intelligence (AI) models, particularly GPT-4, to enhance the systematic review process by improving efficiency and accuracy during literature screening and selection phases. AI tools like ASReview®, Rayyan®, and Covidence® are noted for their ability to automate study identification and ranking, thereby streamlining the review process.

However, the integration of AI in systematic reviews raises concerns regarding algorithmic bias and transparency, necessitating rigorous validation of these tools against traditional human review methods. The paper underscores the importance of ensuring the reliability of AI in screening, as inaccuracies could significantly impact the outcomes of systematic reviews. The primary objective of the study is to validate the effectiveness of large language model (LLM) tools, including Llama 3 and ChatGPT-4o mini, in the article selection process for biomedical systematic reviews, particularly in the screening of titles and abstracts.

Methods

The study employed a comparative validation methodology to assess the efficacy and accuracy of two AI tools in screening irrelevant titles and abstracts from a collection of articles identified through a systematic review search. The reference standard for this evaluation was the manual review conducted by human reviewers, which served as the gold standard. The retrospective data collection followed the completion of the manual screening, which was performed in an academic setting equipped with standard technological resources. The articles analyzed were sourced from a previous systematic review focused on cardiovascular morbidity and mortality risk factors in postmenopausal women (PROSPERO reference: CRD42022323101).

A total of 1,397 potentially eligible articles were retrieved from Medline, Embase, and Scopus databases in October 2022, with 1,081 remaining after duplicate removal via the Rayyan platform. Human reviewers independently screened the titles and abstracts from November 22, 2022, to January 23, 2023, with discrepancies resolved by a third reviewer. Articles were classified as “Excluded” or “Included” based on predefined eligibility criteria, with the final decisions from manual screening serving as the validated gold standard. The AI tools, which were utilized in July 2024, classified articles as “Included” or “Excluded” based on their training to identify irrelevant content patterns in abstracts, without access to the manual screening results.

Results

In the systematic review, a total of 1,081 articles were screened, leading to the inclusion of 84 studies for full-text review. Notably, 97 articles lacked abstracts in the reference files, necessitating human reviewers to source these from alternative platforms. The study employed two AI tools, the Llama 3 model and ChatGPT-4o mini, each tested under three configurations. The performance of these tools was evaluated against the researchers’ classifications, with results summarized in various tables and figures.

The Llama 3 model’s best-performing configuration, LLA_2, demonstrated a sensitivity of 77.5% by correctly including 62 of 80 studies and a specificity of 91.4% by accurately excluding 826 of 904 studies. It yielded a positive predictive value (PPV) of 44.3% and a negative predictive value (NPV) of 97.9%, achieving an overall accuracy of 90.2%. Conversely, the ChatGPT-4o mini model’s optimal configuration, CHAT_2, showed lower sensitivity at 56.2% (correctly including 45 of 80 studies) but higher specificity at 95.1% (correctly excluding 860 of 904 studies). CHAT_2’s PPV was 50.6% and NPV was 96.1%, with an overall accuracy of 92.0%. These findings highlight the varying effectiveness of AI tools in systematic review processes.

Discussion

In this study, we explored the efficacy of AI tools, specifically the Llama 3 70B model and ChatGPT-4o mini, for screening titles and abstracts in systematic reviews within the biomedical field. Our findings indicate that both models achieved an overall accuracy rate of 90%, with Llama 3 demonstrating superior performance in excluding irrelevant articles, as evidenced by a high negative predictive value (NPV) of 97.6% to 97.9%. In contrast, ChatGPT-4o mini exhibited a slightly lower NPV of 96.1% but a higher positive predictive value (PPV) ranging from 48.9% to 50.6%. These results underscore the potential of AI to enhance the efficiency and accuracy of systematic reviews, although they also highlight the necessity for manual validation to mitigate the risk of missing relevant studies.

The study further emphasizes the importance of customizing AI configurations, such as adjusting temperature settings, to optimize performance. Our results align with previous research, indicating that while AI tools can significantly reduce workload and improve selection accuracy, they are not infallible and should be used as complementary aids rather than replacements for human reviewers. Future research should focus on validating the suitability of articles identified by AI and assessing their impact on the final review outcomes. This approach will help establish greater confidence in the integration of AI tools in systematic reviews, ultimately contributing to more efficient and rigorous scientific practices in the face of an ever-increasing volume of literature.