استرجاع معزز للتوليد لـ 10 نماذج لغوية كبيرة وقابليته العامة في تقييم اللياقة الطبية Retrieval augmented generation for 10 large language models and its generalizability in assessing medical fitness

المجلة: npj Digital Medicine، المجلد: 8، العدد: 1
DOI: https://doi.org/10.1038/s41746-025-01519-z
PMID: https://pubmed.ncbi.nlm.nih.gov/40185842
تاريخ النشر: 2025-04-05
المؤلف: Yu He Ke وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في الرعاية الصحية والتعليم

الطرق

يتكون إطار عمل LLM-RAG من عدة مكونات رئيسية تعمل معًا لتسهيل وظيفته. تم تصميم كل مكون لأداء مهام محددة، مما يساهم في الكفاءة والفعالية العامة للإطار. يسمح دمج هذه المكونات بعملية سلسة، مما يعزز قدرات نموذج اللغة في استرجاع وتوليد المعلومات ذات الصلة.

تؤكد المنهجية المستخدمة في هذا الإطار على أهمية القابلية للتجزئة، مما يمكّن النظام من التكيف مع مهام ومجموعات بيانات متنوعة. من خلال الاستفادة من الوظائف المتميزة، يمكن لخط أنابيب LLM-RAG تحسين الأداء في تطبيقات متنوعة، مما يضمن أن تكون عملية الاسترجاع المعززة بالتوليد قوية ومتعددة الاستخدامات. لا يحسن هذا النهج المنظم دقة المخرجات فحسب، بل يدعم أيضًا قابلية التوسع في التعامل مع مجموعات بيانات أكبر واستفسارات أكثر تعقيدًا.

النتائج

في هذه الدراسة، تم تقييم ما مجموعه 3,682 مكونًا، تتكون من 448 مخرجات تم إنشاؤها بواسطة البشر و3,234 مخرجات تم إنشاؤها بواسطة LLM. أظهر نموذج المحول المدرب مسبقًا التوليدي (GPT)4_international أداءً متفوقًا، حيث حقق دقة تبلغ 96.4% في توقع اللياقة الطبية للجراحة، متفوقًا بشكل كبير على المقيمين البشر (86.6%) ونظرائه غير RAG (92.9%) وRAG (92.9%). كانت فعالية النموذج ملحوظة بشكل خاص في تقييم المرضى الأكثر مرضًا (ASA 3)، حيث تفوق على نماذج أخرى مثل Gemini وLLAMA2-13b، التي كانت دقتها أقل من 50%. بالإضافة إلى ذلك، تفوق نموذج GPT4_international في توليد التحسينات الطبية المطلوبة (71.0% مقابل 55.0%، p = 0.026)، على الرغم من أن تعليمات الأدوية التي تم إنشاؤها بواسطة البشر كانت أكثر دقة (91.0% مقابل 98.0%، p = 0.035).

كما أبرزت الدراسة ارتفاع قابلية إعادة إنتاج نموذج GPT4 RAG (4.86/5) وسلامته في تقديم التعليمات (4.93/5). ومن الجدير بالذكر أن معدل السلبية الكاذبة في تحديد المرضى غير المؤهلين طبيًا كان أقل بكثير في نموذج GPT4_international (25%) مقارنة بالمقيمين البشر (62.5%). كانت موثوقية المقيمين (IRR) لنموذج GPT4_international عالية عبر فئات مختلفة، مما يدل على أداء متسق في توقع اللياقة الطبية (IRR = 0.93) وتقديم تعليمات الرعاية الصحية (IRR = 0.96). كشفت التحليلات عن انخفاض معدلات الهلوسة عبر عدة أنظمة LLM، حيث أظهرت LLAMA2 معدلات أعلى بشكل ملحوظ، خاصة في إصداراتها المعززة بـ RAG. بشكل عام، بينما أظهر نموذج GPT4_international دقة وموثوقية ملحوظة، كانت استجابات البشر لا تزال متفوقة في مجالات معينة، مثل تعليمات الأدوية.

المناقشة

تؤكد قسم المناقشة في هذه الدراسة على دمج أنظمة استرجاع التوليد المعزز بالنماذج اللغوية الكبيرة (LLM-RAG) في سير العمل في الرعاية الصحية، لا سيما في الطب قبل الجراحة. تشير النتائج إلى أن نماذج LLM-RAG يمكن أن تتفوق على الأطباء البشر في تقييم لياقة المرضى للجراحة، مما يوفر تقييمات دقيقة ومتسقة. تضع هذه القدرة LLM-RAG كملحق قيم للممارسة السريرية، مما قد يعزز الكفاءة ويخفف من عبء العمل على الأطباء. كما أن قابلية هذه النماذج للتكيف مع ممارسات الرعاية الصحية المحلية مع التوافق مع الإرشادات الدولية تعزز من فائدتها في توحيد التقييمات قبل الجراحة.

تسلط الدراسة الضوء أيضًا على أهمية الاتساق في اتخاذ القرارات السريرية، حيث أظهرت نماذج LLM-RAG استجابات أكثر اتساقًا مقارنة بالمقيمين البشر، لا سيما في مهام مثل تقييم ASA. يمكن أن يقلل هذا الاتساق من سوء التواصل بين فرق الرعاية الصحية ويقلل من التباين الذاتي في الأحكام السريرية. ومع ذلك، تعترف الأبحاث بالقيود، مثل التباين في الإرشادات المحلية والحاجة إلى إشراف مستمر من الخبراء لضمان دقة وملاءمة مخرجات النموذج. كما تم مناقشة الاعتبارات الأخلاقية المتعلقة بالتحيزات المحتملة في بيانات تدريب النموذج، مما يبرز ضرورة التنفيذ الدقيق لأنظمة LLM-RAG كأدوات دعم بدلاً من بدائل للخبرة السريرية. تشمل اتجاهات البحث المستقبلية تحسين آليات الاسترجاع وتأسيس مقاييس تقييم موحدة لتعزيز نشر LLM-RAG في بيئات سريرية متنوعة.

Journal: npj Digital Medicine, Volume: 8, Issue: 1
DOI: https://doi.org/10.1038/s41746-025-01519-z
PMID: https://pubmed.ncbi.nlm.nih.gov/40185842
Publication Date: 2025-04-05
Author(s): Yu He Ke et al.
Primary Topic: Artificial Intelligence in Healthcare and Education

Methods

The LLM-RAG pipeline framework consists of several key components that work in tandem to facilitate its functionality. Each component is designed to perform specific tasks, contributing to the overall efficiency and effectiveness of the framework. The integration of these components allows for a streamlined process, enhancing the capabilities of the language model in retrieving and generating relevant information.

The methodology employed in this framework emphasizes the importance of modularity, enabling the system to adapt to various tasks and datasets. By leveraging distinct functionalities, the LLM-RAG pipeline can optimize performance in diverse applications, ensuring that the retrieval-augmented generation process is both robust and versatile. This structured approach not only improves the accuracy of the outputs but also supports scalability in handling larger datasets and more complex queries.

Results

In this study, a total of 3,682 components were assessed, comprising 448 human-generated and 3,234 LLM-generated outputs. The Generative Pre-training Transformer (GPT)4_international model demonstrated superior performance, achieving an accuracy of 96.4% in predicting medical fitness for surgery, significantly outperforming human evaluators (86.6%) and its non-RAG (92.9%) and RAG counterparts (92.9%). The model’s effectiveness was particularly notable in evaluating sicker patients (ASA 3), where it surpassed other models like Gemini and LLAMA2-13b, which had accuracies below 50%. Additionally, the GPT4_international model excelled in generating required medical optimizations (71.0% vs. 55.0%, p = 0.026), although human-generated medication instructions were more accurate (91.0% vs. 98.0%, p = 0.035).

The study also highlighted the GPT4 RAG model’s high reproducibility (4.86/5) and safety in providing instructions (4.93/5). Notably, the false negative rate for identifying medically unfit patients was significantly lower in the GPT4_international model (25%) compared to human evaluators (62.5%). Inter-rater reliability (IRR) for the GPT4_international model was high across various categories, indicating consistent performance in predicting medical fitness (IRR = 0.93) and providing healthcare instructions (IRR = 0.96). The analysis revealed low hallucination rates across several LLM systems, with LLAMA2 exhibiting notably higher rates, particularly in its RAG-enhanced versions. Overall, while the GPT4_international model showed remarkable accuracy and reliability, human responses were still superior in specific areas, such as medication instructions.

Discussion

The discussion section of this study emphasizes the integration of Large Language Model Retrieval-Augmented Generation (LLM-RAG) systems into healthcare workflows, particularly in preoperative medicine. The findings indicate that LLM-RAG models can surpass human clinicians in evaluating patient fitness for surgery, providing consistent and accurate assessments. This capability positions LLM-RAG as a valuable adjunct to clinical practice, potentially enhancing efficiency and alleviating clinician workload. The adaptability of these models to incorporate local healthcare practices while aligning with international guidelines further underscores their utility in standardizing preoperative evaluations.

The study also highlights the importance of consistency in clinical decision-making, as LLM-RAG models demonstrated more uniform responses compared to human evaluators, particularly in tasks like ASA scoring. This consistency can mitigate miscommunication among healthcare teams and reduce subjective variability in clinical judgments. However, the research acknowledges limitations, such as the variability in local guidelines and the need for ongoing expert oversight to ensure the accuracy and relevance of model outputs. Ethical considerations regarding potential biases in model training data are also discussed, emphasizing the necessity for careful implementation of LLM-RAG systems as supportive tools rather than replacements for clinical expertise. Future research directions include optimizing retrieval mechanisms and establishing standardized evaluation metrics to enhance the deployment of LLM-RAG in diverse clinical settings.