نماذج اللغة الكبيرة الأساسية الطبية لتحليل النصوص الشامل وما بعدها Medical foundation large language models for comprehensive text analysis and beyond

المجلة: npj Digital Medicine، المجلد: 8، العدد: 1
DOI: https://doi.org/10.1038/s41746-025-01533-1
PMID: https://pubmed.ncbi.nlm.nih.gov/40044845
تاريخ النشر: 2025-03-05
المؤلف: Qianqian Xie وآخرون
الموضوع الرئيسي: تعلم الآلة في الرعاية الصحية

نظرة عامة

تقدم هذه الفقرة Me-LLaMA، وهي عائلة جديدة من نماذج اللغة الكبيرة الطبية مفتوحة المصدر (LLMs) المصممة لتعزيز التطبيقات الطبية من خلال دمج معرفة واسعة خاصة بالمجال مع قدرات قوية على اتباع التعليمات. تم تطويرها من خلال التدريب المستمر على نماذج LLaMA2 وضبط التعليمات، تستخدم Me-LLaMA مصادر بيانات بيولوجية طبية وسريرية متنوعة، بما في ذلك الأدبيات البيولوجية الطبية والملاحظات السريرية. تظهر التقييمات عبر ستة مهام تحليل نصوص باستخدام 12 معيارًا، مثل PubMedQA وMIMIC-CXR، أن Me-LLaMA تتفوق على نماذج LLMs الطبية المفتوحة الحالية في كل من الإعدادات بدون تدريب مسبق والإعدادات الخاضعة للإشراف، وتظهر أداءً مماثلاً لنماذج متقدمة مثل ChatGPT وGPT-4 في تشخيص الحالات السريرية المعقدة بعد ضبط التعليمات الخاصة بالمهمة.

تسلط الأبحاث الضوء على قيود نماذج LLMs العامة الحالية في السياقات الطبية، ويرجع ذلك أساسًا إلى نقص المعرفة المتخصصة المستمدة من مجموعات البيانات غير الطبية. بينما حاولت بعض النماذج تعزيز القدرات الطبية من خلال ضبط التعليمات، إلا أنها تظل مقيدة بتدريبها الأساسي. على النقيض من ذلك، تتناول Me-LLaMA هذه التحديات من خلال دمج التدريب المستمر مع ضبط التعليمات، مما يعزز بشكل فعال كل من الأدبيات البيولوجية الطبية والملاحظات السريرية لتحسين القابلية للتطبيق في العالم الحقيقي. يبرز هذا النهج أهمية دمج المعرفة الخاصة بالمجال وقدرات اتباع التعليمات لتحسين أداء نماذج LLMs الطبية.

الطرق

في هذه الدراسة، استخدمنا LLaMA2 6 كنموذج أساسي وعززناه من خلال التدريب المستمر على البيانات وضبط التعليمات، مما أدى إلى تطوير Me-LLaMA. استخدم هذا العملية مجموعة بيانات كبيرة تتكون من 129 مليار رمز و214,000 عينة لضبط التعليمات، مستمدة من مجالات عامة وبيولوجية طبية وسريرية. يتم توضيح نظرة عامة على المنهجية في الشكل 4، بينما توفر الجدول 4 تحليلًا مقارنًا لنماذج Me-LLaMA مقابل نماذج اللغة الكبيرة الطبية مفتوحة المصدر الحالية (LLMs). يهدف هذا النهج إلى تحسين الأداء والقابلية للتطبيق لنماذج LLMs في السياقات الطبية.

النتائج

تظهر نتائج الدراسة أداء نماذج Me-LLaMA في تحليل النصوص الطبية، كما هو موضح في الجدول 1. تفوقت نموذج Me-LLaMA 13B على نموذج PMC-LLaMA 13B في 11 من أصل 12 مجموعة بيانات وتجاوزت نموذج LLaMA2 13B العام في 10 من أصل 12 مجموعة بيانات. ومن الجدير بالذكر أنها أظهرت نتائج تنافسية ضد النماذج الأكبر، LLaMA2 70B وMeditron 70B، في 8 من أصل 12 مجموعة بيانات. في فئة 70B، حقق نموذج Me-LLaMA 70B أفضل أداء في 9 من أصل 12 مجموعة بيانات مقارنة بنظرائه.

يسلط الجدول 2 الضوء على أداء Me-LLaMA في وضع عدم التدريب، حيث أظهر نموذج Me-LLaMA 13B-chat تفوقه على LLaMA2 13B-chat وPMC-LLaMA-chat وMedalpaca 13B عبر جميع مجموعات البيانات تقريبًا. بالإضافة إلى ذلك، تفوق نموذج Me-LLaMA 70B-chat باستمرار على LLaMA2-70B-chat في 11 من أصل 12 مجموعة بيانات. كما أظهر نموذج Me-LLaMA 13B-chat أداءً متفوقًا مقارنة بالنموذج الأكبر LLaMA2-70B-chat-a في 6 من أصل 12 مجموعة بيانات، بينما ظل تنافسيًا في 3 من مجموعات البيانات المتبقية. يوضح الشكل 1 الأداء المقارن لنماذج Me-LLaMA مقابل ChatGPT وGPT-4 في إعدادات عدم التدريب والتعلم الخاضع للإشراف، مما يبرز كفاءة النماذج وإمكاناتها في التطبيقات الطبية، على الرغم من القيود المتعلقة بالخصوصية التي تحد من استخدام مجموعات البيانات السريرية التي تحتوي على معلومات المرضى.

المناقشة

في هذا القسم، يناقش المؤلفون أداء وإمكانات نماذج Me-LLaMA، وبشكل خاص Me-LLaMA-13B وMe-LLaMA-70B، في تشخيص الحالات السريرية المعقدة ومهام معالجة اللغة الطبيعية الطبية المختلفة. أظهر نموذج Me-LLaMA-70B-chat أداءً تنافسيًا، محققًا دقة مماثلة لـ GPT-4 وChatGPT بينما تفوق بشكل كبير على LLaMA2-70B-chat. أكدت التقييمات البشرية أيضًا أن Me-LLaMA-70B-chat تجاوزت GPT-4 في كل من دقة top-1 وtop-5، مما يبرز قابليتها للتطبيق في السيناريوهات السريرية الصعبة. تؤكد الدراسة على فعالية ضبط التعليمات، الذي يعزز بشكل كبير أداء النموذج، خاصة في إعدادات عدم التدريب، بينما يوفر التدريب المستمر فوائد إضافية للنماذج الأكبر.

يتناول المؤلفون أيضًا قيود النماذج الحالية، مشيرين إلى التحديات في مهام مثل التعرف على الكيانات المسماة (NER) واستخراج العلاقات (RE)، حيث تكافح حتى النماذج المتقدمة مثل GPT-4. يؤكدون على أهمية تنوع البيانات في تدريب النماذج، مقترحين أن مزيجًا متوازنًا من البيانات العامة وبيانات المجال الطبي يمكن أن يخفف من نسيان المعرفة. تُعتبر نماذج Me-LLaMA أدوات قيمة لدعم اتخاذ القرار السريري، والتعليم الطبي، والمهام الإدارية، على الرغم من أن المؤلفين يعترفون بالحاجة إلى مزيد من البحث لتحسين أدائها ومعالجة قضايا مثل عدم الدقة الواقعية وسعة معالجة الرموز. بشكل عام، تؤكد النتائج على إمكانات نماذج Me-LLaMA في تعزيز التطبيقات الطبية بينما تحدد مجالات التحسين المستقبلية.

Journal: npj Digital Medicine, Volume: 8, Issue: 1
DOI: https://doi.org/10.1038/s41746-025-01533-1
PMID: https://pubmed.ncbi.nlm.nih.gov/40044845
Publication Date: 2025-03-05
Author(s): Qianqian Xie et al.
Primary Topic: Machine Learning in Healthcare

Overview

The section presents Me-LLaMA, a novel family of open-source medical large language models (LLMs) designed to enhance medical applications by integrating extensive domain-specific knowledge with strong instruction-following capabilities. Developed through continual pretraining and instruction tuning of LLaMA2 models, Me-LLaMA utilizes diverse biomedical and clinical data sources, including biomedical literature and clinical notes. Evaluations across six text analysis tasks using 12 benchmarks, such as PubMedQA and MIMIC-CXR, demonstrate that Me-LLaMA outperforms existing open medical LLMs in both zero-shot and supervised settings, and it shows comparable performance to advanced models like ChatGPT and GPT-4 in diagnosing complex clinical cases after task-specific instruction tuning.

The research highlights the limitations of current general-domain LLMs in medical contexts, primarily due to their lack of specialized knowledge derived from nonmedical datasets. While some models have attempted to enhance medical capabilities through instruction fine-tuning, they remain constrained by their foundational training. In contrast, Me-LLaMA addresses these challenges by combining continual pretraining with instruction tuning, effectively leveraging both biomedical literature and clinical notes to improve real-world applicability. This approach underscores the importance of integrating domain-specific knowledge and instruction-following capabilities to advance the performance of medical LLMs.

Methods

In this study, we employed LLaMA2 6 as the foundational model and enhanced it through continual pre-training and instruction tuning, resulting in the development of Me-LLaMA. This process utilized a substantial dataset comprising 129 billion tokens and 214,000 instruction tuning samples, sourced from general, biomedical, and clinical domains. An overview of the methodology is illustrated in Figure 4, while Table 4 provides a comparative analysis of the Me-LLaMA models against existing open-source medical large language models (LLMs). This approach aims to improve the performance and applicability of LLMs in medical contexts.

Results

The results of the study demonstrate the performance of the Me-LLaMA models in medical text analysis, as detailed in Table 1. The Me-LLaMA 13B model outperformed the PMC-LLaMA 13B model on 11 out of 12 datasets and surpassed the general LLaMA2 13B model on 10 out of 12 datasets. Notably, it showed competitive results against larger models, LLaMA2 70B and Meditron 70B, on 8 out of 12 datasets. In the 70B category, the Me-LLaMA 70B model achieved the best performance on 9 out of 12 datasets when compared to its counterparts.

Table 2 highlights the zero-shot performance of Me-LLaMA chat models, revealing that the Me-LLaMA 13B-chat model outperformed LLaMA2 13B-chat, PMC-LLaMA-chat, and Medalpaca 13B across nearly all datasets. Additionally, Me-LLaMA 70B-chat consistently outperformed LLaMA2-70B-chat on 11 out of 12 datasets. The Me-LLaMA 13B-chat model also demonstrated superior performance compared to the larger LLaMA2-70B-chat-a model on 6 out of 12 datasets, while remaining competitive in 3 out of the remaining datasets. Figure 1 illustrates the comparative performance of Me-LLaMA models against ChatGPT and GPT-4 in zero-shot and supervised learning settings, emphasizing the models’ efficiency and potential in medical applications, despite privacy constraints limiting the use of clinical datasets with patient information.

Discussion

In this section, the authors discuss the performance and potential of the Me-LLaMA models, specifically Me-LLaMA-13B and Me-LLaMA-70B, in complex clinical case diagnosis and various medical NLP tasks. The Me-LLaMA-70B-chat model demonstrated competitive performance, achieving accuracy comparable to GPT-4 and ChatGPT while significantly outperforming LLaMA2-70B-chat. Human evaluations further confirmed that Me-LLaMA-70B-chat surpassed GPT-4 in both top-1 and top-5 accuracy, highlighting its applicability in challenging clinical scenarios. The study emphasizes the effectiveness of instruction tuning, which significantly enhances model performance, particularly in zero-shot settings, while continual pre-training provides additional benefits for larger models.

The authors also address the limitations of current models, noting challenges in tasks such as Named Entity Recognition (NER) and Relation Extraction (RE), where even advanced models like GPT-4 struggle. They emphasize the importance of data diversity in model training, suggesting that a balanced mix of general and medical domain data can mitigate knowledge forgetting. The Me-LLaMA models are positioned as valuable tools for clinical decision support, medical education, and administrative tasks, although the authors acknowledge the need for further research to improve their performance and address issues such as factual inaccuracies and token handling capacity. Overall, the findings underscore the potential of Me-LLaMA models in advancing medical applications while identifying areas for future enhancement.