CancerLLM: نموذج لغوي كبير في مجال السرطان CancerLLM: a large language model in cancer domain

المجلة: npj Digital Medicine، المجلد: 9، العدد: 1
DOI: https://doi.org/10.1038/s41746-026-02441-8
PMID: https://pubmed.ncbi.nlm.nih.gov/41720895
تاريخ النشر: 2026-02-20
المؤلف: Mingchen Li وآخرون
الموضوع الرئيسي: تعلم الآلة في الرعاية الصحية

نظرة عامة

تقدم هذه القسم CancerLLM، وهو نموذج لغة كبير متخصص مصمم لتصنيف وتشخيص السرطان، حيث يعالج القيود الموجودة في نماذج اللغة الطبية الحالية التي لا تتناسب مع هذه المهام المحددة. مع عدد من المعلمات يبلغ 7 مليارات، تم تدريب CancerLLM على مجموعة بيانات كبيرة تتكون من 2.7 مليون ملاحظة سريرية و515,000 تقرير علم الأمراض عبر 17 نوعًا من السرطان. خضع النموذج للتعديل الدقيق للمهام المتعلقة باستخراج نمط السرطان وتوليد التشخيص.

تشير نتائج التقييم إلى أن CancerLLM حقق درجة F1 تبلغ 91.78% لاستخراج الأنماط و86.81% لتوليد التشخيص، متجاوزًا نماذج اللغة الحالية بمتوسط تحسين في درجة F1 قدره 9.23%. علاوة على ذلك، أظهر CancerLLM كفاءة من حيث الوقت واستخدام موارد GPU، إلى جانب القوة مقارنة بالنماذج الأخرى. تشير هذه النتائج إلى أن CancerLLM يمكن أن يعزز بشكل كبير البحث والممارسة السريرية في مجال السرطان.

الطرق

في هذا القسم، يوضح المؤلفون المنهجية المستخدمة في تطوير إطار عمل CANCERLLM، كما هو موضح في الشكل 2. تبدأ العملية بدمج المعرفة الخاصة بالسرطان في النموذج، تليها ضبط التعليمات تحت الإشراف. يتم تقييم القدرات التوليدية لـ CANCERLLM مقابل نماذج اللغة الطبية الكبيرة الحالية (LLMs) باستخدام ثلاثة مقاييس متميزة. لضمان القوة، تم تقديم اختبارين، ثم تم تطبيق النموذج على مهمتين مستهدفتين: استخراج نمط السرطان وتوليد تشخيص السرطان.

النموذج الأساسي المستخدم في هذا البحث هو Mistral 7B، الذي تم اختياره لفعاليته المثبتة على معايير طبية متنوعة. تنقسم عملية التدريب إلى مرحلتين: التدريب المسبق المستمر وضبط التعليمات، حيث يستخدم الأخير التكيف منخفض الرتبة (LoRA) لتعزيز الكفاءة. تم الحصول على مجموعات البيانات للتدريب المسبق وضبط التعليمات من مستودع البيانات السريرية بجامعة مينيسوتا، والذي يشمل سجلات صحية من 31,465 مريضًا بالسرطان، بما في ذلك أكثر من 2.6 مليون ملاحظة سريرية وأكثر من 500,000 تقرير علم الأمراض. حصلت الدراسة على موافقة أخلاقية من مجلس مراجعة المؤسسات بجامعة مينيسوتا (#STUDY00017350). تم تقديم أوصاف مفصلة للمهام وبيانات ضبط التعليمات المقابلة في الأقسام التالية.

النتائج

يقدم قسم النتائج في ورقة البحث تقييمًا شاملاً لنموذج CANCERLLM عبر مهام مختلفة، مما يظهر أدائه المتفوق في توليد تشخيص السرطان واستخراج الأنماط. في توليد التشخيص، تفوق CANCERLLM بشكل كبير على جميع نماذج الأساس، بما في ذلك MISTRAL 1*7B وBIO-MISTRAL 7B، مع تحسينات بنسبة 31.7% و17.92% في متوسط درجات F1 عبر مقاييس مثل المطابقة الدقيقة، BLEU-2، وROUGE-L. ومن الجدير بالذكر أنه على الرغم من كونه نموذجًا أصغر، إلا أن CANCERLLM تفوق على نماذج أكبر تصل إلى 70 مليار معلمة، مما يدل على فعالية نهج خفيف الوزن ومخصص.

في مهمة استخراج نمط السرطان، تفوق CANCERLLM مرة أخرى على جميع النماذج القابلة للمقارنة وأظهر أداءً يعادل النماذج الأكبر مثل LLama3 8B وLLama2 13B. على الرغم من أن ClinicalCamel-70B حقق أعلى أداء في الاستخراج، إلا أن حجم معلماته الأكبر شكل تحديات في كفاءة التدريب والاستدلال. كشفت اختبارات القوة أن CANCERLLM حافظ على أداء متفوق تحت ظروف مضادة للحقائق وأظهر موثوقية متسقة ضد الأخطاء الإملائية، متفوقًا على Bio-Mistral 7B عند معدلات خطأ أعلى. أخيرًا، في تقييم مستقل لمجموعة مرضى، حقق CANCERLLM أعلى متوسط درجة F1 تبلغ 85.08%، متجاوزًا بشكل كبير نماذج الأساس، مما يبرز قدرته على توليد مخرجات تشخيصية متماسكة سريريًا مع دمج فعال للمعرفة الخاصة بالمجال.

المناقشة

في قسم المناقشة من ورقة البحث، يحلل المؤلفون أداء CancerLLM مقارنةً بمختلف نماذج اللغة الكبيرة (LLMs) للمهام المتعلقة بتوليد تشخيص السرطان واستخراج الأنماط. يبرزون أنه بينما تظهر نماذج مثل ClinicalCamel-70B قدرات متفوقة في استخراج الأنماط، فإنها تعاني من أوقات توليد مفرطة واستخدام لذاكرة GPU. بالمقابل، يحقق CancerLLM، بحجم معلمات أصغر، أداءً تنافسيًا، خاصة في مهمة توليد التشخيص، مما يشير إلى أنه يمكن إعطاء الأولوية لكفاءة النموذج دون التضحية بالدقة. ينسب المؤلفون نجاح CancerLLM إلى تدريبه المستهدف على الملاحظات السريرية وتقارير علم الأمراض، مؤكدين على أهمية البيانات عالية الجودة والتعليقات التوضيحية لتدريب النموذج بشكل فعال.

كما يناقش المؤلفون التحديات التي تواجه CancerLLM، مثل حساسيته للأخطاء الإملائية والاختصارات في الملاحظات السريرية، والتي يمكن أن تؤدي إلى انخفاضات كبيرة في الأداء. يشيرون إلى أنه بينما يؤدي النموذج بشكل جيد تحت ظروف محكومة، فإن قوته تتعرض للخطر بسبب الأخطاء اللغوية. علاوة على ذلك، تكشف الدراسة أن الأساليب المعززة بالاسترجاع يمكن أن تعزز أداء النموذج، خاصة في المهام المعقدة، مما يبرز الحاجة إلى أساليب استرجاع محددة للمجال. بشكل عام، تشير النتائج إلى أن CancerLLM لا يتفوق فقط على العديد من نماذج الأساس، بل يوفر أيضًا رؤى قيمة حول دمج الذكاء الاصطناعي في الإعدادات السريرية، خاصة للتطبيقات المتعلقة بعلم الأورام.

Journal: npj Digital Medicine, Volume: 9, Issue: 1
DOI: https://doi.org/10.1038/s41746-026-02441-8
PMID: https://pubmed.ncbi.nlm.nih.gov/41720895
Publication Date: 2026-02-20
Author(s): Mingchen Li et al.
Primary Topic: Machine Learning in Healthcare

Overview

The section presents CancerLLM, a specialized large language model (LLM) designed for cancer phenotyping and diagnosis, addressing the limitations of existing medical LLMs that are not tailored for these specific tasks. With a parameter count of 7 billion, CancerLLM is trained on a substantial dataset comprising 2.7 million clinical notes and 515,000 pathology reports across 17 cancer types. The model underwent fine-tuning for tasks related to cancer phenotype extraction and diagnosis generation.

Evaluation results indicate that CancerLLM achieved an F1 score of 91.78% for phenotype extraction and 86.81% for diagnosis generation, surpassing existing LLMs by an average F1 score improvement of 9.23%. Furthermore, CancerLLM demonstrated efficiency in terms of time and GPU resource usage, alongside robustness compared to other models. These findings suggest that CancerLLM could significantly enhance clinical research and practice within the cancer domain.

Methods

In this section, the authors outline the methodology employed in developing the CANCERLLM framework, as depicted in Figure 2. The process begins with the integration of cancer-specific knowledge into the model, followed by supervised instruction tuning. The generative capabilities of CANCERLLM are evaluated against existing medical large language models (LLMs) using three distinct metrics. To ensure robustness, two testbeds are introduced, and the model is subsequently applied to two targeted tasks: cancer phenotype extraction and cancer diagnosis generation.

The foundational model utilized for this research is Mistral 7B, selected for its demonstrated efficacy on various medical benchmarks. The training process is divided into two phases: continued pre-training and instruction tuning, with the latter employing Low-Rank Adaptation (LoRA) to enhance efficiency. The datasets for pre-training and instruction tuning were sourced from the University of Minnesota Clinical Data Repository, encompassing health records from 31,465 cancer patients, which include over 2.6 million clinical notes and more than 500,000 pathology reports. The study received ethical approval from the University of Minnesota Institutional Review Board (#STUDY00017350). Detailed descriptions of the tasks and the corresponding instruction tuning data are provided in subsequent sections.

Results

The results section of the research paper presents a comprehensive evaluation of the CANCERLLM model across various tasks, demonstrating its superior performance in cancer diagnosis generation and phenotype extraction. In diagnosis generation, CANCERLLM significantly outperformed all baseline models, including MISTRAL 1*7B and BIO-MISTRAL 7B, with improvements of 31.7% and 17.92% in average F1 scores across metrics such as Exact Match, BLEU-2, and ROUGE-L. Notably, despite being a smaller model, CANCERLLM surpassed larger models with up to 70 billion parameters, indicating the effectiveness of a lightweight, domain-specific approach.

In the cancer phenotype extraction task, CANCERLLM again outperformed all comparable models and demonstrated performance on par with larger models like LLama3 8B and LLama2 13B. Although ClinicalCamel-70B achieved the highest extraction performance, its larger parameter size posed challenges in training and inference efficiency. The robustness tests revealed that CANCERLLM maintained superior performance under counterfactual conditions and demonstrated consistent reliability against misspellings, outperforming Bio-Mistral 7B at higher error rates. Finally, in an independent patient cohort evaluation, CANCERLLM achieved the highest average F1 score of 85.08%, significantly surpassing baseline models, thereby underscoring its capability to generate clinically coherent diagnostic outputs while effectively integrating domain knowledge.

Discussion

In the discussion section of the research paper, the authors analyze the performance of CancerLLM in comparison to various large language models (LLMs) for tasks related to cancer diagnosis generation and phenotype extraction. They highlight that while models like ClinicalCamel-70B demonstrate superior phenotype extraction capabilities, they are hindered by excessive generation times and GPU memory usage. In contrast, CancerLLM, with a smaller parameter size, achieves competitive performance, particularly in the diagnosis generation task, suggesting that model efficiency can be prioritized without sacrificing accuracy. The authors attribute CancerLLM’s success to its targeted training on clinical notes and pathology reports, emphasizing the importance of high-quality data and annotations for effective model training.

The authors also discuss the challenges faced by CancerLLM, such as its sensitivity to misspellings and abbreviations in clinical notes, which can lead to significant performance drops. They note that while the model performs well under controlled conditions, its robustness is compromised by linguistic inaccuracies. Furthermore, the study reveals that retrieval-augmented approaches can enhance model performance, particularly in complex reasoning tasks, underscoring the need for domain-specific retrieval methods. Overall, the findings suggest that CancerLLM not only outperforms many baseline models but also provides valuable insights into the integration of AI in clinical settings, particularly for oncology-related applications.