تقييم التوليد المعزز بالاسترجاع للطب Benchmarking Retrieval-Augmented Generation for Medicine

المجلة: Findings of the Association for Computational Linguistics ACL 2024
DOI: https://doi.org/10.18653/v1/2024.findings-acl.372
تاريخ النشر: 2024-01-01
المؤلف: Guangzhi Xiong وآخرون
الموضوع الرئيسي: أنظمة التعليم الذكي والتعلم التكيفي

نظرة عامة

في هذا القسم، يناقش المؤلفون قيود نماذج اللغة الكبيرة (LLMs) في الإجابة على الأسئلة الطبية (QA)، وخاصة القضايا المتعلقة بالهلوسة والمعلومات القديمة. لمعالجة هذه التحديات، يقترحون تقييم استرجاع المعلومات الطبية المعزز بالتوليد (MIRAGE)، وهو معيار يتكون من 7,663 سؤالاً مأخوذة من خمسة مجموعات بيانات طبية للإجابة على الأسئلة. أجرى المؤلفون تجارب واسعة باستخدام مجموعة أدوات MEDRAG، حيث قاموا بتقييم 41 مجموعة من التركيبات المختلفة من المجموعات، والمسترجعين، ونماذج LLM الأساسية، والتي تضمنت أكثر من 1.8 تريليون رمز من التعليمات.

تشير النتائج إلى أن مجموعة أدوات MEDRAG تعزز دقة ستة نماذج LLM مختلفة بنسبة تصل إلى 18% مقارنةً بأساليب التحفيز التقليدية، مما يرفع فعالية نماذج مثل GPT-3.5 وMixtral إلى مستويات قابلة للمقارنة مع GPT-4. تكشف الدراسة أن الأداء الأمثل يتحقق من خلال الجمع الاستراتيجي بين مجموعات البيانات الطبية المختلفة والمسترجعين. بالإضافة إلى ذلك، يحدد المؤلفون خاصية القياس اللوغاريتمي الخطي وتأثير “الضياع في المنتصف” داخل أنظمة RAG الطبية. تهدف الرؤى المستخلصة من هذا البحث إلى تقديم إرشادات عملية لتنفيذ أنظمة RAG في السياقات الطبية.

مقدمة

تسلط مقدمة هذه الورقة البحثية الضوء على التأثير التحويلي لنماذج اللغة الكبيرة (LLMs) على استرجاع المعلومات، خاصة في سياق الإجابة على الأسئلة (QA) في المجالات العامة والطبية. على الرغم من قدراتها المتقدمة، فإن نماذج LLM عرضة لتوليد استجابات غير صحيحة من الناحية الواقعية، وهو ظاهرة تعرف بالهلوسة، وقد تفتقر إلى المعرفة الأكثر حداثة، مما يشكل مخاطر كبيرة في مجالات حيوية مثل الرعاية الصحية. للتخفيف من هذه القضايا، يقترح المؤلفون توليد معزز بالاسترجاع (RAG) كحل يعزز أداء LLM من خلال دمج الوثائق ذات الصلة والمحدثة في عملية الإجابة على الأسئلة، مما يحسن الشفافية ويؤسس الاستجابات على مصادر موثوقة.

لتقييم فعالية مكونات RAG المختلفة بشكل منهجي، يقدم المؤلفون معيار MI-RAGE، الذي يتكون من 7,663 سؤالاً مستمدة من خمسة مجموعات بيانات طبية مستخدمة على نطاق واسع. يركز هذا المعيار على قدرات RAG في الوضع الصفري، محاكياً سيناريوهات العالم الحقيقي حيث لا تتوفر أمثلة مسبقة. بالإضافة إلى ذلك، يقدم المؤلفون MEDRAG، مجموعة أدوات شاملة تشمل مجموعات وثائق متعددة، وخوارزميات استرجاع، ونماذج LLM، مما يسهل تقييم شامل لأداء RAG. تشير نتائجهم إلى أن MEDRAG يمكن أن يعزز أداء LLM بنسبة 1% إلى 18% مقارنة بأساليب التحفيز التقليدية، مع رؤى محددة حول الاستخدام الأمثل لمجموعات البيانات المختلفة والمسترجعين. تختتم الدراسة بتوصيات عملية لنشر وبحث مستقبل أنظمة RAG في المجال الطبي الحيوي، مما يمثل مساهمات كبيرة في فهم وتطبيق منهجيات RAG.

النتائج

في قسم النتائج، يقيم المؤلفون بشكل منهجي أداء MEDRAG باستخدام معيار MI-RAGE، الذي يسهل تقييم شامل لمكونات مختلفة ضمن توليد معزز بالاسترجاع (RAG) مصممة خصيصاً للتطبيقات الطبية. يوضح القسم 5.1 النتائج المرتبطة بنماذج اللغة الكبيرة (LLMs) المختلفة، بينما يركز القسم 5.2 على مقاييس الأداء المستمدة من مجموعات البيانات والمسترجعين المختلفة. تتوج النتائج من هذه التقييمات بتوصيات عملية لتحسين تنفيذ RAG، والتي يتم مناقشتها في القسم 6.4.

المناقشة

تناقش هذه القسم التقدمات والتقييمات لأنظمة توليد معزز بالاسترجاع (RAG)، خاصة في سياق الإجابة على الأسئلة الطبية الحيوية (QA). يعزز RAG، الذي قدمه لويس وآخرون (2020)، أداء نماذج اللغة الكبيرة (LLMs) في المهام المعتمدة على المعرفة من خلال دمج المعلومات المسترجعة ذات الصلة، مما يقلل من الهلوسة ويوفر المعرفة المحدثة. على الرغم من الدراسات المختلفة التي تستكشف تطبيقات RAG في الطب الحيوي، فإن التقييمات المنهجية قد ركزت بشكل أساسي على نماذج LLM التقليدية بدون RAG. يقدم هذا البحث أول تقييم شامل لأنظمة RAG في السياقات الطبية من خلال معيار MIRAGE، الذي يستخدم إعدادات تقييم واقعية مثل التعلم الصفري والاسترجاع فقط للأسئلة.

يتضمن معيار MIRAGE خمسة مجموعات بيانات طبية للإجابة على الأسئلة، مما يبرز الحاجة إلى أنظمة RAG لاسترجاع السياقات ذات الصلة دون خيارات إجابة مسبقة. تشير النتائج إلى أن RAG يعزز بشكل كبير أداء LLMs في الإجابة على الأسئلة الطبية، حيث تظهر مجموعة أدوات MEDRAG دقة محسنة عبر مهام مختلفة. من الجدير بالذكر أن الدراسة تكشف أن اختيار مجموعة البيانات والمسترجع يؤثر بشكل كبير على الأداء، حيث تحقق مجموعة بيانات PubMed نتائج أفضل باستمرار عبر المهام. علاوة على ذلك، يبرز البحث أهمية استراتيجيات استرجاع المقتطفات وترتيب المقتطفات الصحيحة، مما يوفر توصيات عملية لتحسين أنظمة RAG في التطبيقات الطبية. بشكل عام، تؤكد النتائج على إمكانية RAG في تحسين كفاءة ودقة الإجابة على الأسئلة الطبية، مما يشير إلى تحول نحو أساليب أكثر مرونة وفعالية من حيث التكلفة في الاستفادة من LLMs في اتخاذ القرارات السريرية.

القيود

تقدم الدراسة تقييماً منهجياً لأنظمة توليد معزز بالاسترجاع (RAG) الطبية، لكنها تعترف بعدة قيود. أولاً، تركز الدراسة بشكل أساسي على بنية RAG التقليدية، التي تتضمن مباشرة إضافة الوثائق المسترجعة إلى سياق نموذج اللغة الكبير (LLM). هذه الاختيار، على الرغم من كونه عملياً، يتجاهل البنى الأحدث مثل RAG النشط (جيانغ وآخرون، 2023)، مما يشير إلى الحاجة إلى تقييمات مستقبلية لهذه التصاميم المتقدمة. ثانياً، على الرغم من أن مجموعة بيانات MEDRAG شاملة، إلا أنها يمكن أن تستفيد من تضمين موارد إضافية مثل المقالات النصية الكاملة من PubMed Central والأسئلة الشائعة من مصادر موثوقة (بن أباتشا وديمير-فوشمان، 2019).

علاوة على ذلك، يقتصر التقييم على مكون الاسترجاع لمجموعات بيانات معينة (Pub-MedQA* وBioASQ-Y/N) بسبب غياب تسميات الحقيقة الأرضية في مجموعات بيانات أخرى. يجب على الأبحاث المستقبلية تقييم فائدة المقتطفات المسترجعة لهذه المجموعات من البيانات والنظر في استخدام إعادة ترتيب المشفرات المتقاطعة لتعزيز أداء الاسترجاع. بالإضافة إلى ذلك، بينما تعتبر الإجابة على الأسئلة (QA) مهمة شائعة لتقييم نماذج LLM الطبية الحيوية، يمكن أن تستفيد مهام أخرى تعتمد على المعرفة، مثل التحقق من الادعاءات (وادن وآخرون، 2020؛ ليو وآخرون، 2024)، أيضاً من MEDRAG. تستخدم الدراسة تنسيق الاختيار المتعدد للتقييم على نطاق واسع، لكن الاعتماد على خيارات الإجابة خلال مرحلة التنبؤ النهائية يتطلب مزيداً من التدقيق. بشكل عام، يعترف المؤلفون بهذه القيود ويقترحون أن معالجة هذه القضايا ستكون محوراً للعمل المستقبلي.

Journal: Findings of the Association for Computational Linguistics ACL 2024
DOI: https://doi.org/10.18653/v1/2024.findings-acl.372
Publication Date: 2024-01-01
Author(s): Guangzhi Xiong et al.
Primary Topic: Intelligent Tutoring Systems and Adaptive Learning

Overview

In this section, the authors discuss the limitations of large language models (LLMs) in medical question answering (QA), particularly issues related to hallucinations and outdated information. To address these challenges, they propose the Medical Information Retrieval-Augmented Generation Evaluation (MIRAGE), a benchmark comprising 7,663 questions sourced from five medical QA datasets. The authors conducted extensive experiments using the MEDRAG toolkit, evaluating 41 combinations of different corpora, retrievers, and backbone LLMs, which involved over 1.8 trillion prompt tokens.

The findings indicate that the MEDRAG toolkit enhances the accuracy of six different LLMs by up to 18% compared to traditional chain-of-thought prompting, effectively elevating the performance of models like GPT-3.5 and Mixtral to levels comparable to GPT-4. The study reveals that optimal performance is achieved through the strategic combination of various medical corpora and retrievers. Additionally, the authors identify a log-linear scaling property and the “lost-in-the-middle” effect within medical RAG systems. The insights gained from this research are intended to provide practical guidelines for the implementation of RAG systems in medical contexts.

Introduction

The introduction of this research paper highlights the transformative impact of Large Language Models (LLMs) on information retrieval, particularly in the context of question answering (QA) in both general and medical domains. Despite their advanced capabilities, LLMs are prone to generating factually incorrect responses, a phenomenon known as hallucination, and may lack the most current knowledge, which poses significant risks in critical fields like healthcare. To mitigate these issues, the authors propose Retrieval-Augmented Generation (RAG) as a solution that enhances LLM performance by integrating relevant, up-to-date documents into the QA process, thereby improving transparency and grounding responses in reliable sources.

To systematically assess the effectiveness of various RAG components, the authors introduce the MI-RAGE benchmark, which consists of 7,663 questions derived from five widely used biomedical QA datasets. This benchmark emphasizes the zero-shot capabilities of RAG systems, simulating real-world scenarios where no prior examples are available. Additionally, the authors present MEDRAG, a comprehensive toolkit that encompasses multiple document corpora, retrieval algorithms, and LLMs, facilitating a thorough evaluation of RAG performance. Their findings indicate that MEDRAG can enhance LLM performance by 1% to 18% compared to traditional prompting methods, with specific insights on the optimal use of different corpora and retrievers. The study concludes with practical recommendations for the deployment and future research of RAG systems in the biomedical field, marking significant contributions to the understanding and application of RAG methodologies.

Results

In the Results section, the authors systematically assess the performance of MEDRAG using the MI-RAGE benchmark, which facilitates a comprehensive evaluation of various components within Retrieval-Augmented Generation (RAG) specifically tailored for medical applications. Section 5.1 details the outcomes associated with different large language models (LLMs), while Section 5.2 focuses on the performance metrics derived from various corpora and retrievers. The findings from these evaluations culminate in practical recommendations for optimizing RAG implementations, which are discussed in Section 6.4.

Discussion

The section discusses the advancements and evaluations of Retrieval-Augmented Generation (RAG) systems, particularly in the context of biomedical question answering (QA). RAG, introduced by Lewis et al. (2020), enhances the performance of large language models (LLMs) on knowledge-intensive tasks by integrating relevant retrieved information, thereby reducing hallucinations and providing up-to-date knowledge. Despite various studies exploring RAG’s applications in biomedicine, systematic evaluations have predominantly focused on vanilla LLMs without RAG. This research presents the first comprehensive evaluation of RAG systems in medical contexts through the MIRAGE benchmark, which employs realistic evaluation settings such as Zero-Shot Learning and Question-Only Retrieval.

The MIRAGE benchmark includes five medical QA datasets, emphasizing the need for RAG systems to retrieve relevant contexts without pre-provided answer options. The findings indicate that RAG significantly enhances the performance of LLMs in medical QA, with the MEDRAG toolkit demonstrating improved accuracy across various tasks. Notably, the study reveals that the choice of corpus and retriever significantly impacts performance, with the PubMed corpus consistently yielding better results across tasks. Furthermore, the research highlights the importance of snippet retrieval strategies and the arrangement of ground-truth snippets, providing practical recommendations for optimizing RAG systems in medical applications. Overall, the results underscore the potential of RAG to improve the efficiency and accuracy of medical question answering, suggesting a shift towards more flexible and cost-effective approaches in leveraging LLMs for clinical decision-making.

Limitations

The study presents a systematic evaluation of medical Retrieval-Augmented Generation (RAG) systems, but it acknowledges several limitations. Firstly, the research primarily focuses on the vanilla RAG architecture, which involves directly prepending retrieved documents to the context of the large language model (LLM). This choice, while practical, overlooks newer architectures such as active RAG (Jiang et al., 2023), indicating a need for future evaluations of these advanced designs. Secondly, although the MEDRAG corpus is comprehensive, it could benefit from the inclusion of additional resources like full-text articles from PubMed Central and FAQs from reliable sources (Ben Abacha and Demner-Fushman, 2019).

Furthermore, the evaluation is limited to the retrieval component for specific datasets (Pub-MedQA* and BioASQ-Y/N) due to the absence of ground-truth labels in other datasets. Future research should assess the utility of retrieved snippets for these datasets and consider employing cross-encoder re-rankers to enhance retrieval performance. Additionally, while question-answering (QA) is a prevalent task for assessing biomedical LLMs, other knowledge-intensive tasks, such as claim verification (Wadden et al., 2020; Liu et al., 2024), could also leverage MEDRAG. The study employs a multiple-choice format for large-scale evaluation, but the reliance on answer choices during the final prediction phase warrants further scrutiny. Overall, the authors recognize these limitations and suggest that addressing them will be a focus for future work.