استغلال نماذج اللغة الكبيرة للتصفية شبه التلقائية للمجموعات في المراجعات الأدبية المنهجية Leveraging LLMs for semi-automatic corpus filtration in systematic literature reviews

المجلة: Computers & Graphics، المجلد: 135
DOI: https://doi.org/10.1016/j.cag.2026.104537
تاريخ النشر: 2026-02-16
المؤلف: Lucas Joos وآخرون
الموضوع الرئيسي: تحليل البيانات الشامل والمراجعات المنهجية

نظرة عامة

تقدم هذه الدراسة خط أنابيب شبه آلي مصمم لتعزيز عملية مراجعات الأدبيات المنهجية (SLRs) من خلال استخدام نماذج لغوية كبيرة متعددة (LLMs). تعالج الطريقة المقترحة تحديات استرجاع الأدبيات وتصفيتها، والتي غالبًا ما تكون مستهلكة للوقت ومرهقة بسبب انتشار المنشورات غير ذات الصلة في عمليات البحث المعتمدة على الكلمات الرئيسية. من خلال اعتماد نظام توافق بين نماذج LLMs المختلفة ودمج الإشراف البشري من خلال الأداة مفتوحة المصدر LLMSurver، يسمح الخط بأن يدير الباحثون مجموعات كبيرة من الأدبيات بكفاءة مع ضمان استرجاع عالي وشفافية.

تشير تقييمات هذه الطريقة، التي أجريت على مجموعة بيانات تضم أكثر من 8,000 ورقة مرشحة، إلى أن نماذج LLM الحديثة من مصادر مفتوحة المصدر وتجارية يمكن أن تحقق مستويات أداء قابلة للمقارنة أو تتجاوز تلك الخاصة بالمراجعين البشريين. وهذا يؤدي إلى تقليل كبير في الجهد اليدوي، مما يحول أسابيع من العمل إلى دقائق معدودة. تؤكد النتائج على فعالية التعاون بين الإنسان والذكاء الاصطناعي، حيث يمكن أن تؤثر التعديلات الطفيفة في المطالبات بشكل كبير على النتائج. بشكل عام، توضح الدراسة إمكانيات سير العمل المدعوم من LLM لتبسيط البحث الأكاديمي، مما يجعل عملية مراجعة الأدبيات أكثر سهولة وفعالية من حيث التكلفة والكفاءة.

مقدمة

تناقش مقدمة الورقة أهمية مراجعات الأدبيات المنهجية (SLRs) في الأوساط الأكاديمية، مسلطة الضوء على دورها كطريقة منظمة لتلخيص نتائج البحث وتحديد الفجوات في المعرفة. على الرغم من أهميتها، فإن عملية إجراء SLRs تتطلب جهدًا كبيرًا وتستغرق وقتًا طويلاً، خاصة في المجالات التي تتطور بسرعة مع حجم نشر مرتفع. تتضمن عملية SLR القياسية، كما أوضحها إجر وآخرون، ثماني مراحل، حيث تكون عملية فحص العناوين والملخصات يدوية بشكل ملحوظ وتستهلك الموارد. على سبيل المثال، يمكن أن يتطلب مراجعة مجموعة من 8,000 ورقة حوالي 66 ساعة عمل، وغالبًا ما تمتد إلى عدة أشهر بسبب عوامل مختلفة تؤثر على الكفاءة.

تقترح الورقة الاستفادة من نماذج اللغة الكبيرة (LLMs) لتحسين عملية تصفية الأدبيات الأولية أثناء إنشاء SLR. يمكن أن تعزز LLMs كفاءة وجودة هذه العملية من خلال التقاط الفروق الدلالية والعمل بشكل أسرع وبتكاليف أقل من الفحص اليدوي. يقدم المؤلفون خط أنابيب للتعاون بين الإنسان والذكاء الاصطناعي يستخدم نماذج LLM متعددة لتصنيف الأوراق، مما يؤدي إلى نظام اتخاذ قرارات قائم على التوافق. يقدمون التطبيق مفتوح المصدر LLMSurver، الذي ينفذ هذا الخط شبه الآلي ويسمح بالإشراف والتعديل من قبل المستخدم. تقيم الدراسة الخط باستخدام مجموعة بيانات تضم أكثر من 8,000 ورقة، وتقيّم أداء نماذج LLM المختلفة من منتصف عام 2024 وخريف 2025، وتناقش آثار نتائجهم على جهود تصفية الأدبيات المستقبلية.

الطرق

في هذه الدراسة، نقترح منهجية جديدة لتعزيز عملية استرجاع المجموعات لمراجعات الأدبيات المنهجية (SLRs) باستخدام نماذج لغوية كبيرة (LLMs) مع الحفاظ على نتائج عالية الجودة. يعتمد خطنا على إطار عمل PRISMA المعتمد ويجمع بين وكلاء الذكاء الاصطناعي والباحثين البشريين. تتضمن المرحلة الأولية البحث في مكتبات الإنترنت المتعددة عن المنشورات ذات الصلة بناءً على كلمات رئيسية ومعايير محددة مسبقًا، مما يؤدي إلى مجموعة أولية. تخضع هذه المجموعة لتصفية آلية، بما في ذلك إزالة التكرارات وتوحيد البيانات، قبل أن يتم فحصها. على عكس الطرق التقليدية حيث يقوم الباحثون بمراجعة العناوين والملخصات يدويًا، تستخدم طريقتنا نماذج LLM لتصنيف الأوراق بناءً على مطالبات مخصصة، وعناوين، وملخصات.

لتقليل المخاطر المرتبطة بالاعتماد فقط على نماذج LLM للتصنيف، يلعب الباحثون البشريون دورًا حاسمًا في تحسين المطالبات، وفحص المخرجات، وتقييم النتائج. نستخدم نماذج LLM متعددة ذات هياكل متنوعة للاستفادة من نقاط قوتها التكميلية، مما يضمن عملية تصنيف قوية. تتلقى كل ورقة تصنيفات من جميع نماذج LLM، والتي يتم تنسيقها بعد ذلك من خلال نظام توافق يفضل الإدراج ما لم توصي جميع النماذج بالرفض. تقلل هذه الطريقة المحافظة من خطر استبعاد المنشورات ذات الصلة. على الرغم من أنه قد تحدث بعض الاستبعادات الخاطئة، إلا أنه يمكن التعامل معها في الخطوات اللاحقة، مثل التوسع. بشكل عام، تقلل منهجيتنا بشكل كبير من عبء المراجعة اليدوية على الباحثين، مما يسرع عملية SLR مع ضمان مجموعة مصفاة عالية الجودة لمزيد من التحليل. تدعم الأداة التفاعلية المرئية، LLMSurver، هذه العملية من خلال تسهيل تحليل مخرجات LLM وتشكيل التوافق.

المناقشة

في قسم المناقشة، يبرز المؤلفون الدور المتزايد لنماذج اللغة الكبيرة (LLMs) في تعزيز مراحل مختلفة من عملية النشر العلمي، خاصة في مراجعات الأدبيات. يشيرون إلى أنه بينما تكون المنهجيات التقليدية لإجراء مراجعات الأدبيات غالبًا متكررة ومرهقة، فقد أظهرت التطورات الأخيرة في LLMs وعدًا في أتمتة مهام مثل تصفية المجموعات، وصياغة أسئلة البحث، والتلخيص. على الرغم من هذه التطورات، لا تزال التحديات قائمة، خاصة في تصفية وتصنيف مجموعات البحث الكبيرة بدقة، خاصة في المجالات التي تحتوي على غموض دلالي. يشير المؤلفون إلى دراسات سابقة استكشفت تطبيقات LLM في هذا السياق، بما في ذلك استخدام أنظمة التوافق لتحسين دقة التصنيف.

يقدم المؤلفون نهجهم الخاص، LLMSurver، وهو تطبيق ويب تفاعلي مصمم لتسهيل تصنيف وتصفية الأوراق البحثية باستخدام نماذج LLM متعددة. تتيح هذه الأداة للمستخدمين تحليل النتائج، وتحسين المطالبات، ومقارنة استراتيجيات التوافق مع ضمان خصوصية البيانات من خلال العمل محليًا داخل المتصفح. تُظهر تقييمات خطهم، استنادًا إلى مجموعة بيانات كبيرة من مراجعة أدبية حول تحليل الشبكات المرئية في البيئات الغامرة، فعالية طريقتهم. تشير النتائج إلى أنه بينما حققت LLMs معدلات دقة عالية، فإن نهج التوافق حسّن الأداء بشكل كبير من خلال تقليل أخطاء النماذج الفردية، مما يعزز موثوقية عملية التصنيف. تؤكد النتائج على إمكانيات LLMs في سياقات البحث، خاصة مع تزايد إمكانية الوصول إلى النماذج المفتوحة وقدرتها.

القيود

تقدم الدراسة عدة قيود تبرز مجالات البحث المستقبلية. بشكل أساسي، كانت التقييمات محصورة في موضوع واحد ومجموعة بيانات ضمن علوم الكمبيوتر، مما يثير تساؤلات حول إمكانية تعميم النتائج على تخصصات أخرى، قد تستخدم لغة أكثر غموضًا أو ملخصات أقل تنظيمًا. بالإضافة إلى ذلك، لم تستخدم المرحلة الأولية نماذج اللغة الكبيرة (LLMs) لتحديد الأوراق المرشحة لتجنب المراجع الخاطئة، على الرغم من أن استكشاف ذلك قد يوفر رؤى قيمة. كما أن الاعتماد على المعلومات على مستوى الملخص يحد أيضًا من السياق المتاح لقرارات الإدراج، مما يشير إلى أن النسخ المستقبلية من المنهجية يجب أن تتضمن بيانات النص الكامل أو الاقتباس لتعزيز الدقة.

علاوة على ذلك، فإن التطور المستمر للنماذج يطرح تحديات لإمكانية إعادة الإنتاج، حيث قد تؤدي التحديثات إلى تغيير النتائج. بينما يقلل نهج التوافق من الاستبعادات الخاطئة، إلا أنه لا يمكن القضاء عليها تمامًا، مما يتطلب التحقق اليدوي كحماية حاسمة. تشير الدراسة أيضًا إلى غياب تقييم رسمي لمستخدمي LLMSurver، والذي يمكن أن يُعلم تحسينات الاستخدام. تشمل الاتجاهات المستقبلية دمج الوصول إلى المكتبات عبر الإنترنت لاسترجاع المجموعات تلقائيًا، وتطوير طرق توافق تكيفية تعتمد على ثقة النموذج، وتوسيع LLMSurver إلى منصة تعاونية لتعزيز الشفافية وإمكانية إعادة الإنتاج. بالإضافة إلى ذلك، يمكن تعديل الخط المقترح لمهام أكاديمية أخرى، مثل فحص المحتوى والتوسع.

Journal: Computers & Graphics, Volume: 135
DOI: https://doi.org/10.1016/j.cag.2026.104537
Publication Date: 2026-02-16
Author(s): Lucas Joos et al.
Primary Topic: Meta-analysis and systematic reviews

Overview

This research presents a semi-automated pipeline designed to enhance the process of systematic literature reviews (SLRs) by utilizing multiple large language models (LLMs). The proposed method addresses the challenges of literature retrieval and filtering, which are often time-consuming and labor-intensive due to the prevalence of irrelevant publications in keyword-based searches. By employing a consensus scheme among various LLMs and integrating human supervision through the open-source tool LLMSurver, the pipeline allows researchers to efficiently manage large corpora of literature while ensuring high recall and transparency.

The evaluation of this approach, conducted on a dataset of over 8,000 candidate papers, indicates that modern LLMs from both open-source and commercial sources can achieve performance levels comparable to or exceeding those of human annotators. This results in a significant reduction in manual effort, transforming weeks of work into mere minutes. The findings emphasize the effectiveness of human-AI collaboration, where minor adjustments in prompts can substantially influence outcomes. Overall, the study illustrates the potential of LLM-assisted workflows to streamline academic research, making the literature review process more accessible, cost-effective, and efficient.

Introduction

The introduction of the paper discusses the significance of systematic literature reviews (SLRs) in academia, highlighting their role as a structured method for synthesizing research findings and identifying gaps in knowledge. Despite their importance, the process of conducting SLRs is labor-intensive and time-consuming, particularly in rapidly evolving fields with high publication volumes. The standard SLR process, as outlined by Egger et al., involves eight stages, with manual screening of titles and abstracts being notably resource-intensive. For instance, reviewing a corpus of 8,000 papers can require around 66 person-hours, often extending to several months due to various factors affecting efficiency.

The paper proposes leveraging large language models (LLMs) to optimize the initial filtration of literature during SLR creation. LLMs can enhance the efficiency and quality of this process by capturing semantic nuances and operating faster and at lower costs than manual screening. The authors introduce a human-AI collaboration pipeline that utilizes multiple LLMs to classify papers, culminating in a consensus decision-making scheme. They present the open-source application LLMSurver, which implements this semi-automatic pipeline and allows for user supervision and refinement. The study evaluates the pipeline using a dataset of over 8,000 papers, assessing the performance of various LLMs from mid-2024 and fall 2025, and discusses the implications of their findings for future literature filtration efforts.

Methods

In this study, we propose a novel methodology to enhance the corpus retrieval process for systematic literature reviews (SLRs) using large language models (LLMs) while maintaining high-quality outcomes. Our pipeline builds upon the established PRISMA framework and integrates AI agents with human researchers. The initial phase involves searching multiple online libraries for relevant publications based on predefined keywords and criteria, resulting in an initial corpus. This corpus undergoes automated filtering, including duplicate removal and data unification, before being screened. Unlike traditional methods where researchers manually review titles and abstracts, our approach utilizes LLMs to classify papers based on customized prompts, titles, and abstracts.

To mitigate risks associated with relying solely on LLMs for classification, human researchers play a critical role in refining prompts, inspecting outputs, and evaluating results. We employ multiple LLMs with varying architectures to leverage their complementary strengths, ensuring a robust classification process. Each paper receives classifications from all LLMs, which are then harmonized through a consensus scheme that favors inclusion unless all models recommend rejection. This conservative approach minimizes the risk of excluding relevant publications. Although some false exclusions may occur, they can be addressed in subsequent steps, such as snowballing. Overall, our methodology significantly reduces the manual review burden on researchers, thereby accelerating the SLR process while ensuring a high-quality filtered corpus for further analysis. The visual-interactive tool, LLMSurver, supports this process by facilitating the analysis of LLM outputs and consensus formation.

Discussion

In the discussion section, the authors highlight the growing role of large language models (LLMs) in enhancing various stages of the scientific publishing process, particularly in literature reviews. They note that while traditional methodologies for conducting literature reviews are often repetitive and labor-intensive, recent advancements in LLMs have shown promise in automating tasks such as corpus filtering, research question formulation, and summarization. Despite these advancements, challenges remain, particularly in accurately filtering and classifying large research corpora, especially in fields with semantic ambiguities. The authors reference previous studies that have explored LLM applications in this context, including the use of consensus schemes to improve classification accuracy.

The authors introduce their own approach, LLMSurver, an interactive web application designed to facilitate the classification and filtering of research papers using multiple LLMs. This tool allows users to analyze results, refine prompts, and compare consensus strategies while ensuring data privacy by operating locally within the browser. The evaluation of their pipeline, based on a substantial dataset from a literature review on visual network analysis in immersive environments, demonstrates the effectiveness of their method. The results indicate that while LLMs achieved high accuracy rates, the consensus approach significantly improved performance by mitigating individual model errors, thus enhancing the reliability of the classification process. The findings underscore the potential of LLMs in research contexts, particularly as open models become increasingly accessible and capable.

Limitations

The study presents several limitations that highlight areas for future research. Primarily, the evaluation was confined to a single topic and dataset within computer science, raising questions about the generalizability of the findings to other disciplines, which may utilize more ambiguous language or less structured abstracts. Additionally, the initial phase did not employ large language models (LLMs) for identifying candidate papers to avoid erroneous references, although exploring this could yield valuable insights. The reliance on abstract-level information also limits the context available for inclusion decisions, suggesting that future iterations of the methodology should incorporate full-text or citation data to enhance accuracy.

Moreover, the continuous evolution of models poses challenges for reproducibility, as updates may alter outcomes. While a consensus approach mitigates false exclusions, it cannot eliminate them entirely, necessitating manual verification as a critical safeguard. The study also notes the absence of a formal user evaluation of LLMSurver, which could inform usability improvements. Future directions include integrating online library access for automatic corpus retrieval, developing adaptive consensus methods based on model confidence, and expanding LLMSurver into a collaborative platform for enhanced transparency and reproducibility. Additionally, the proposed pipeline could be adapted for other academic tasks, such as content screening and snowballing.