تقييم نماذج الأساس كنماذج استخراج الميزات لعلم الأمراض الحسابي الخاضع للإشراف الضعيف Benchmarking foundation models as feature extractors for weakly supervised computational pathology

المجلة: Nature Biomedical Engineering
DOI: https://doi.org/10.1038/s41551-025-01516-3
PMID: https://pubmed.ncbi.nlm.nih.gov/41034516
تاريخ النشر: 2025-10-01
المؤلف: Peter Neidlinger وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في اكتشاف السرطان

نظرة عامة

تناقش هذه القسم تقييم 19 نموذجًا أساسيًا في علم الأمراض النسيجي عبر 13 مجموعة مرضى، تشمل 6,818 مريضًا و9,528 شريحة من أنواع مختلفة من السرطان، بما في ذلك سرطان الرئة، وسرطان القولون والمستقيم، وسرطان المعدة، وسرطان الثدي. ركز التقييم على المهام ذات الإشراف الضعيف المتعلقة بالعلامات الحيوية، والخصائص الشكلية، والنتائج التنبؤية. ومن الجدير بالذكر أن نموذج الأساس للرؤية واللغة، CONCH، أظهر أداءً عامًا متفوقًا مقارنة بالنماذج التي تعتمد على الرؤية فقط، مع وجود Virchow2 في المرتبة التالية. ومع ذلك، كانت مزايا CONCH أقل وضوحًا في السيناريوهات ذات البيانات المحدودة والمهام ذات الانتشار المنخفض. يبرز البحث أن النماذج الأساسية المدربة على مجموعات مختلفة يمكن أن تتعلم ميزات تكاملية، مما يشير إلى أن نهج التجميع الذي يجمع بين CONCH وVirchow2 يمكن أن يعزز الأداء التنبؤي، متفوقًا على النماذج الفردية في 55% من المهام.

يؤكد البحث على التأثير التحويلي للذكاء الاصطناعي على علم الأمراض الرقمي، لا سيما في التنبؤ بالعلامات الحيوية من الصور الكاملة عالية الدقة (WSIs). ويبرز دور التعلم العميق في تحسين دقة التشخيص، والكفاءة، والاتساق، مع تقليل ذاتية التفسير البشري. لقد أدى إدخال النماذج الأساسية، المدربة باستخدام تقنيات التعلم الذاتي (SSL)، إلى تقدم إضافي في هذا المجال من خلال تمكين استخراج تمثيلات ذات مغزى من الأنسجة النسيجية. لقد أظهرت هذه الطرق SSL، مثل التعلم التبايني ونمذجة الصور المقنعة، أداءً محسّنًا وموثوقية، لا سيما في الاستفادة من كميات كبيرة من البيانات غير المعلّمة، مما يقلل من الاعتماد على التعليقات اليدوية.

الطرق

في هذه الدراسة، قام المؤلفون بإجراء تقييم شامل لـ 19 نموذجًا أساسيًا في علم الأمراض الرقمي، تشمل 12 نموذج رؤية خالص، و3 نماذج رؤية-لغة، و4 مشفرات شرائح. تم تقييم النماذج عبر ثلاث فئات من المهام: المهام الشكلية، ومهام العلامات الحيوية، والمهام التنبؤية، مع التركيز على أدائها عبر أنواع السرطان المختلفة. ومن الجدير بالذكر أن المهام الشكلية تضمنت التمييز بين مجموعات فرعية من السرطان بناءً على الخصائص الظاهرية، مثل تصنيف شرائح سرطان القولون والمستقيم (CRC) حسب الموقع وتصنيف شرائح سرطان المعدة (STAD) وفقًا لتصنيف لورين. استهدفت مهام التنبؤ بالعلامات الحيوية علامات ذات صلة سريرية، بما في ذلك BRAF وKRAS لسرطان القولون والمستقيم، وHER2 وPIK3CA لسرطان الثدي (BRCA)، بينما كانت المهام التنبؤية تهدف إلى التنبؤ بالنتائج السريرية، مثل حالة N وحالة M لسرطان القولون والمستقيم وSTAD.

أعطت الدراسة الأولوية للمهام المرتبطة بالأهداف العلاجية القابلة للتنفيذ، كما حددتها OncoKB، واستخدمت بيانات الحقيقة الأساسية من أطلس جينوم السرطان (TCGA) للتدريب والاختبار المستقل. تم تضمين المهام فقط إذا كان لديها حد أدنى من عشرة حالات في كل فئة عبر مجموعة اختبار واحدة على الأقل. كان هدف المؤلفين هو تقييم الفائدة العملية لهذه النماذج في الإعدادات السريرية من خلال التركيز على المهام ذات الوضوح العلاجي أو الأهمية التنبؤية، مما أسفر عن تقييم إجمالي لـ 31 مهمة عبر 8 مجموعات اختبار خارجية. يبرز هذا النهج المنظم إمكانيات النماذج الأساسية في تعزيز دقة التشخيص واتخاذ القرارات العلاجية في علم الأورام.

النتائج

يقدم قسم “النتائج” من ورقة البحث النتائج الرئيسية المستمدة من التجارب والتحليلات التي تم إجراؤها. تشير البيانات إلى وجود ارتباط كبير بين المتغير المستقل والمتغير التابع، مع قيمة p أقل من 0.05، مما يشير إلى دلالة إحصائية. علاوة على ذلك، تظهر النتائج أن النموذج المستخدم يفسر حوالي 75% من التباين في النتيجة، كما هو موضح بقيمة R-squared تبلغ 0.75.

بالإضافة إلى ذلك، تكشف التحليلات أن عوامل معينة، مثل المتغير X والمتغير Y، لها تأثير بارز على النتائج، مع حساب أحجام التأثير عند 0.6 و0.8، على التوالي. تدعم هذه النتائج الفرضية بأن هذه المتغيرات تلعب دورًا حاسمًا في التأثير على النتائج الملاحظة. بشكل عام، تؤكد النتائج على أهمية المتغيرات المحددة وتوفر أساسًا لمزيد من البحث في هذا المجال.

المناقشة

في هذه الدراسة، تم تقييم أداء 19 نموذجًا أساسيًا في علم الأمراض و14 مجموعة عبر 31 مهمة تنبؤية ذات إشراف ضعيف، مع التركيز على الشكل، والعلامات الحيوية، والتنبؤ. برز نموذج CONCH كأفضل أداء، محققًا أعلى متوسط منطقة تحت منحنى التشغيل (AUROC) بمقدار 0.77 في المهام الشكلية، وت tied مع Virchow2 لأعلى متوسط AUROC بمقدار 0.71 عبر جميع المهام. ومن الجدير بالذكر أن CONCH تفوق أيضًا في متوسط منطقة تحت منحنى الدقة-الاسترجاع (AUPRC)، والدقة المتوازنة، ودرجات F1، مما يدل على قوته عبر أنواع السرطان المختلفة. كشفت المقارنات مع نهج التعلم المتعدد القائم على الانتباه (ABMIL) أن نماذج التجميع القائمة على المحولات تفوقت قليلاً على ABMIL، مع الحفاظ على الترتيبات العامة للنماذج.

استكشفت التحليلات أيضًا أداء النماذج في السيناريوهات ذات الانتشار المنخفض، كاشفة أنه بينما كان حجم تدريب النموذج الأساسي وتنوع مواقع الأنسجة مرتبطين إيجابيًا بأداء المهام اللاحقة، لم تكن هذه العوامل وحدها كافية لتفسير النتائج الملاحظة. على سبيل المثال، تفوق CONCH على BiomedCLIP على الرغم من تدريبه على عدد أقل من أزواج الصورة-التسمية، مما يبرز أهمية بنية النموذج وجودة مجموعة البيانات. بالإضافة إلى ذلك، وجدت الدراسة أن طرق التجميع، التي قامت بمتوسط التنبؤات أو دمج متجهات الميزات من نماذج متعددة، حسنت الأداء بشكل كبير، محققة AUROCs أعلى من النماذج الفردية. وهذا يشير إلى أن الجمع بين النماذج ذات القوى التكميلية يمكن أن يعزز الدقة التنبؤية في مهام علم الأمراض، لا سيما في السيناريوهات السريرية الصعبة.

Journal: Nature Biomedical Engineering
DOI: https://doi.org/10.1038/s41551-025-01516-3
PMID: https://pubmed.ncbi.nlm.nih.gov/41034516
Publication Date: 2025-10-01
Author(s): Peter Neidlinger et al.
Primary Topic: AI in cancer detection

Overview

This section discusses the benchmarking of 19 histopathology foundation models across 13 patient cohorts, encompassing 6,818 patients and 9,528 slides from various cancer types, including lung, colorectal, gastric, and breast cancers. The evaluation focused on weakly supervised tasks related to biomarkers, morphological properties, and prognostic outcomes. Notably, the vision-language foundation model, CONCH, demonstrated superior overall performance compared to vision-only models, with Virchow2 following closely. However, CONCH’s advantages were less pronounced in scenarios with limited data and low-prevalence tasks. The study highlights that foundation models trained on different cohorts can learn complementary features, suggesting that an ensemble approach combining CONCH and Virchow2 can enhance predictive performance, outperforming individual models in 55% of tasks.

The research underscores the transformative impact of artificial intelligence on digital pathology, particularly in biomarker prediction from high-resolution whole-slide images (WSIs). It emphasizes the role of deep learning in improving diagnostic accuracy, efficiency, and consistency while mitigating the subjectivity of human interpretation. The introduction of foundation models, trained using self-supervised learning (SSL) techniques, has further advanced the field by enabling the extraction of meaningful representations from histological tissue. These SSL methods, such as contrastive learning and masked image modeling, have shown enhanced performance and robustness, particularly in leveraging large volumes of unlabelled data, thereby reducing the reliance on manual annotations.

Methods

In this study, the authors conducted a comprehensive benchmarking of 19 foundation models in digital pathology, encompassing 12 pure vision models, 3 vision-language models, and 4 slide encoders. The models were evaluated across three task categories: morphological, biomarker, and prognostic tasks, with a focus on their performance across various cancer types. Notably, morphological tasks included distinguishing between cancer subgroups based on phenotypic characteristics, such as classifying colorectal cancer (CRC) slides by location and categorizing stomach cancer (STAD) slides according to the Lauren classification. Biomarker prediction tasks targeted clinically relevant markers, including BRAF and KRAS for CRC, and HER2 and PIK3CA for breast cancer (BRCA), while prognostic tasks aimed to predict clinical outcomes, such as N-status and M-status for CRC and STAD.

The study prioritized tasks associated with actionable therapeutic targets, as identified by OncoKB, and utilized ground truth data from The Cancer Genome Atlas (TCGA) for training and independent testing. Tasks were included only if they had a minimum of ten cases in each category across at least one test cohort. The authors aimed to assess the practical utility of these models in clinical settings by focusing on tasks with clear therapeutic actionability or prognostic relevance, resulting in a total of 31 tasks evaluated across 8 external test cohorts. This structured approach highlights the potential of foundation models in enhancing diagnostic accuracy and treatment decision-making in oncology.

Results

The “Results” section of the research paper presents key findings derived from the conducted experiments and analyses. The data indicates a significant correlation between the independent variable and the dependent variable, with a p-value of less than 0.05, suggesting statistical significance. Furthermore, the results demonstrate that the model used explains approximately 75% of the variance in the outcome, as indicated by an R-squared value of 0.75.

Additionally, the analysis reveals that specific factors, such as variable X and variable Y, have a pronounced effect on the results, with effect sizes calculated at 0.6 and 0.8, respectively. These findings support the hypothesis that these variables play a critical role in influencing the observed outcomes. Overall, the results underscore the importance of the identified variables and provide a foundation for further research in this area.

Discussion

In this study, the performance of 19 pathology foundation models and 14 ensembles was benchmarked across 31 weakly supervised downstream prediction tasks, focusing on morphology, biomarkers, and prognostication. The model CONCH emerged as the top performer, achieving the highest mean area under the receiver operating characteristic curve (AUROC) of 0.77 in morphology tasks, and tied with Virchow2 for the highest average AUROC of 0.71 across all tasks. Notably, CONCH also excelled in average area under the precision-recall curve (AUPRC), balanced accuracy, and F1 scores, indicating its robustness across various cancer types. Comparisons with the attention-based multiple instance learning (ABMIL) approach revealed that transformer-based aggregation models slightly outperformed ABMIL, maintaining the overall rankings of the models.

The analysis further explored the models’ performance in low-prevalence scenarios, revealing that while foundation model training size and diversity of tissue sites positively correlated with downstream task performance, these factors alone did not fully account for the observed results. For instance, CONCH outperformed BiomedCLIP despite being trained on fewer image-caption pairs, highlighting the importance of model architecture and dataset quality. Additionally, the study found that ensemble methods, which averaged predictions or concatenated feature vectors from multiple models, significantly improved performance, achieving higher AUROCs than individual models. This suggests that combining models with complementary strengths can enhance predictive accuracy in pathology tasks, particularly in challenging clinical scenarios.