تقييم شامل لنماذج الأساس في علم الأمراض لتصنيف أنواع سرطان المبيض A comprehensive evaluation of histopathology foundation models for ovarian cancer subtype classification

المجلة: npj Precision Oncology، المجلد: 9، العدد: 1
DOI: https://doi.org/10.1038/s41698-025-00799-8
PMID: https://pubmed.ncbi.nlm.nih.gov/39885243
تاريخ النشر: 2025-01-30
المؤلف: Jack Breen وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في اكتشاف السرطان

نظرة عامة

تستقصي الدراسة فعالية نماذج الأساس في علم الأمراض النسيجي في تصنيف الأنماط الشكلية لسرطان المبيض، مع معالجة القيود التي تفرضها المعلمات الفائقة التعسفية. تم استخدام نهج تحقق صارم، حيث تم مقارنة مصنفات التعلم المتعدد المعتمد على الانتباه عبر ثلاثة مشفرات مدربة مسبقًا على ImageNet وأربعة عشر نموذجًا أساسيًا، باستخدام مجموعة بيانات تضم 1864 صورة شريحة كاملة. خضعت المصنفات لاختبار الاحتفاظ وتم التحقق منها مقابل مجموعتين بيانات خارجيتين: الدراسة عبر الكندية وتحدي OCEAN.

تشير النتائج إلى أن نموذج الأساس H-optimus-0 حقق أعلى دقة متوازنة بلغت 89% و97% و74% عبر مجموعات التحقق المختلفة، بينما قدم نموذج UNI نتائج قابلة للمقارنة بتكلفة حسابية أقل بكثير. ومن الجدير بالذكر أن ضبط المعلمات الفائقة أسفر عن تحسين وسطي قدره 1.9% في الدقة المتوازنة، مع وصول العديد من التحسينات إلى دلالة إحصائية. تشير النتائج إلى أن نماذج الأساس لا تعزز فقط أداء التصنيف ولكنها تحمل أيضًا إمكانات للتطبيقات السريرية، حيث تقدم رأيًا ثانيًا في الحالات المعقدة وتحسن دقة وكفاءة التشخيص.

طرق

في هذه الدراسة، تم تجميع مجموعة بيانات شاملة لعلم الأمراض النسيجي لسرطان المبيض لتسهيل تصنيف الأنماط. تضمنت مجموعة التدريب 1864 صورة شريحة كاملة (WSIs) من 434 حالة تم علاجها في مستشفيات ليدز التعليمية NHS Trust بين عامي 2008 و2022، مع التركيز على خمسة أنماط رئيسية من سرطان المبيض الظهاري: سرطان حليمي عالي الدرجة (HGSC)، سرطان حليمي منخفض الدرجة (LGSC)، سرطان الخلايا الواضحة (CCC)، سرطان المخاط (MC)، وسرطان بطانة الرحم (EC). تم التحقق من التشخيصات بشكل مستقل من قبل أخصائي علم الأمراض، مما يضمن الدقة وإزالة التناقضات. أظهرت مجموعة البيانات عدم توازن كبير في الفئات، حيث تم تمثيل LGSC بـ 92 صورة شريحة فقط، بينما شكل HGSC 1266 صورة شريحة.

لتقييم أداء النموذج، تم إنشاء مجموعة اختبار مستقلة متوازنة من 100 صورة شريحة من 30 مريضًا، إلى جانب مجموعتي اختبار خارجيتين: مجموعة بيانات الدراسة عبر الكندية، التي تضمنت 80 صورة شريحة من 80 مريضًا، ومجموعة بيانات تحدي OCEAN، التي تضم 513 صورة شريحة من أنماط مختلفة. كان التركيز الأساسي للدراسة هو تصنيف عينات الجراحة الأولية، التي تكون عادةً ذات جودة تشخيصية أعلى مقارنة بعينات جراحة إزالة الورم بين الجلسات (IDS)، التي قد تظهر ميزات شكلية متغيرة بسبب العلاجات السابقة. تضمنت مجموعة التدريب كل من عينات الجراحة الأولية (1412 صورة شريحة) وعينات IDS (452 صورة شريحة)، حيث اقترحت النتائج السابقة أن عينات IDS يمكن أن تعزز من قوة بيانات التدريب.

نتائج

تكشف قسم النتائج في الدراسة أن تقنيات المعالجة المسبقة المختلفة كان لها تأثيرات غير متسقة على مستخرج الميزات ResNet50، مما أدى إلى تحسينات متواضعة في التحقق الداخلي ولكن نتائج متغيرة في التحقق الخارجي. على وجه التحديد، لم تعزز أي طريقة معالجة مسبقة الدقة المتوازنة أو درجة F1 بشكل كبير بما يتجاوز 0.02، كما لم تحسن أي طريقة AUROC. في اختبار الاحتفاظ، أظهرت فقط زيادة الألوان 20× زيادة طفيفة في الأداء، حيث زادت F1 بمقدار 0.023 والدقة المتوازنة بمقدار 0.020، على الرغم من انخفاض AUROC بمقدار 0.012. على العكس، في التحقق الخارجي باستخدام مجموعة بيانات الدراسة عبر الكندية، تجاوزت جميع طرق المعالجة المسبقة الخط الأساسي، مع أكثر التركيبات فعالية التي تتضمن عتبة Otsu وتطبيع Macenko، حيث حقق كل منهما أكثر من 0.1 تحسين في درجة F1 والدقة المتوازنة، وأكثر من 0.016 في AUROC. ومع ذلك، في التحقق الخارجي لتحدي OCEAN، أدت معظم الطرق إلى أداء أسوأ من الخط الأساسي، مع تقديم عتبة Otsu فقط فائدة.

أسفر ضبط المعلمات الفائقة عن تحسينات كبيرة في متوسط فقدان التحقق عبر النماذج، مع تحسين وسطي قدره 0.150 وحد أقصى قدره 0.301. تم تحقيق غالبية هذه المكاسب في التكرارات الأولية للضبط، خاصة بالنسبة للنماذج بخلاف ResNet50 المدربة مسبقًا على ImageNet. أدى التأثير الوسيط للضبط إلى زيادة قدرها 1.9% في الدقة المتوازنة، إلى جانب تحسينات متواضعة في AUROC ودرجة F1. ومع ذلك، كانت التأثيرات متغيرة على نطاق واسع بين النماذج، حيث شهدت بعض النماذج تغييرات في الدقة المتوازنة من -6.6% إلى +15.0%. ومن الجدير بالذكر أن النماذج التي تستخدم ResNet50 وResNet18 وPhikon وH-optimus-0 لم تستفد من الضبط، بينما أظهرت نماذج أخرى تحسينات ذات دلالة إحصائية في واحدة على الأقل من مقاييس التقييم. كانت الفوائد الأكثر اتساقًا من ضبط المعلمات الفائقة ملحوظة في نماذج ViT-L وHibou-L المدربة مسبقًا على ImageNet عبر التحقق.

مناقشة

في هذه الدراسة، قام المؤلفون بتقييم فعالية مستخرجات ميزات الباتش المختلفة لتصنيف أنماط سرطان المبيض باستخدام خط أنابيب التعلم المتعدد المعتمد على الانتباه (ABMIL). أظهرت النتائج أن نماذج الأساس في علم الأمراض النسيجي المعتمدة على المحولات تفوقت بشكل كبير على النماذج التقليدية المدربة مسبقًا على ImageNet، حيث تجاوزت 13 من أصل 14 نموذجًا أساسيًا أداء جميع نماذج ImageNet عبر مجموعات التحقق المتعددة. ومن الجدير بالذكر أن نموذج RN18-Histo، وهو أول نموذج أساسي في علم الأمراض النسيجي، كان الوحيد الذي لم يتجاوز النماذج المدربة مسبقًا على ImageNet، مما يبرز قيود الهياكل السابقة مقارنة بالأساليب الأحدث المعتمدة على المحولات.

كشفت التحليل عن وجود ارتباط إيجابي ضعيف بين أداء النموذج وحجم نموذج الأساس وحجم مجموعة بيانات التدريب المسبق. حققت النماذج الأكبر، مثل Virchow وH-optimus-0، عمومًا أفضل النتائج، على الرغم من أن النموذج الأصغر، GPFM، تفوق في سيناريو تحقق واحد. كما وجدت الدراسة أن ضبط المعلمات الفائقة لمصنفات ABMIL أسفر عن تحسينات متواضعة في أداء التصنيف، خاصة عند ضبط معدلات التعلم وأحجام المصنفات. من المهم أن تشير النتائج إلى أنه بينما كانت لتقنيات المعالجة المسبقة تأثيرات محدودة على الأداء، فإن اختيار مستخرجات الميزات المثلى أكثر أهمية من المعالجة المسبقة الشاملة لتعزيز نتائج التصنيف. بشكل عام، تؤكد هذه التقييم الشامل على التقدم في علم الأمراض النسيجي المدفوع بالذكاء الاصطناعي وإمكانية تحسين دقة التشخيص في تصنيف سرطان المبيض.

Journal: npj Precision Oncology, Volume: 9, Issue: 1
DOI: https://doi.org/10.1038/s41698-025-00799-8
PMID: https://pubmed.ncbi.nlm.nih.gov/39885243
Publication Date: 2025-01-30
Author(s): Jack Breen et al.
Primary Topic: AI in cancer detection

Overview

The study investigates the efficacy of histopathology foundation models in the morphological subtyping of ovarian carcinoma, addressing limitations posed by arbitrary hyperparameters. A rigorous validation approach was employed, comparing attention-based multiple instance learning classifiers across three ImageNet-pretrained encoders and fourteen foundation models, utilizing a dataset of 1864 whole slide images. The classifiers underwent hold-out testing and were validated against two external datasets: the Transcanadian Study and the OCEAN Challenge.

The findings indicate that the H-optimus-0 foundation model achieved the highest balanced accuracies of 89%, 97%, and 74% across different validation sets, while the UNI model delivered comparable results at a significantly lower computational cost. Notably, hyperparameter tuning yielded a median improvement of 1.9% in balanced accuracy, with many enhancements reaching statistical significance. The results suggest that foundation models not only enhance classification performance but also hold potential for clinical applications, offering a second opinion in complex cases and improving diagnostic accuracy and efficiency.

Methods

In this study, a comprehensive dataset of ovarian carcinoma histopathology was assembled to facilitate subtype classification. The training set comprised 1864 whole slide images (WSIs) from 434 cases treated at Leeds Teaching Hospitals NHS Trust between 2008 and 2022, focusing on the five predominant epithelial ovarian cancer subtypes: high-grade serous carcinoma (HGSC), low-grade serous carcinoma (LGSC), clear cell carcinoma (CCC), mucinous carcinoma (MC), and endometrioid carcinoma (EC). Diagnoses were independently verified by a histopathologist, ensuring accuracy and removing discrepancies. The dataset exhibited a significant class imbalance, with LGSC represented by only 92 WSIs, while HGSC accounted for 1266 WSIs.

To evaluate the model’s performance, an independent class-balanced hold-out test set of 100 WSIs from 30 patients was created, alongside two external test sets: the Transcanadian Study dataset, which included 80 WSIs from 80 patients, and the OCEAN Challenge dataset, comprising 513 WSIs of various subtypes. The primary focus of the study was to classify primary surgery specimens, which are typically of higher diagnostic quality compared to interval debulking surgery (IDS) samples, which may exhibit altered morphological features due to prior treatments. The training set included both primary (1412 WSIs) and IDS specimens (452 WSIs), as previous findings suggested that IDS samples could enhance the training data’s robustness.

Results

The results section of the study reveals that various preprocessing techniques had inconsistent effects on the ResNet50 feature extractor, yielding modest improvements in internal validations but variable outcomes in external validations. Specifically, no preprocessing method significantly enhanced balanced accuracy or F1 score beyond 0.02, nor did any method improve AUROC. In hold-out testing, only the 20× color augmentation showed a slight performance increase, raising F1 by 0.023 and balanced accuracy by 0.020, albeit with a reduction in AUROC by 0.012. Conversely, in the external validation using the Transcanadian Study dataset, all preprocessing methods surpassed the baseline, with the most effective combinations involving Otsu thresholding and Macenko normalization, each achieving over 0.1 improvements in F1 score and balanced accuracy, and over 0.016 in AUROC. However, in the OCEAN Challenge external validation, most methods performed worse than the baseline, with only Otsu thresholding providing a benefit.

Hyperparameter tuning yielded significant improvements in average validation loss across models, with a median enhancement of 0.150 and a maximum of 0.301. The majority of these gains were realized in the initial tuning iterations, particularly for models other than the ImageNet-pretrained ResNet50. The median impact of tuning resulted in a 1.9% increase in balanced accuracy, alongside modest improvements in AUROC and F1 score. However, the effects varied widely among models, with some experiencing changes in balanced accuracy from -6.6% to +15.0%. Notably, models utilizing ResNet50, ResNet18, Phikon, and H-optimus-0 did not benefit from tuning, while other models exhibited statistically significant improvements in at least one evaluation metric. The most consistent benefits from hyperparameter tuning were observed in the ImageNet-pretrained ViT-L and Hibou-L models across validations.

Discussion

In this study, the authors evaluated the effectiveness of various patch feature extractors for classifying ovarian carcinoma subtypes using an attention-based multiple instance learning (ABMIL) pipeline. The results demonstrated that transformer-based histopathology foundation models significantly outperformed traditional ImageNet-pretrained models, with 13 out of 14 foundation models exceeding the performance of all ImageNet models across multiple validation sets. Notably, the RN18-Histo model, the earliest histopathology foundation model, was the only one that did not surpass the ImageNet-pretrained models, highlighting the limitations of earlier architectures compared to more recent transformer-based approaches.

The analysis revealed a weak positive correlation between model performance and both the size of the foundation model and the size of the pretraining dataset. The largest models, such as Virchow and H-optimus-0, generally achieved the best results, although the smallest model, GPFM, excelled in one validation scenario. The study also found that hyperparameter tuning of the ABMIL classifiers yielded modest improvements in classification performance, particularly when adjusting learning rates and classifier sizes. Importantly, the findings suggest that while preprocessing techniques had limited impact on performance, selecting optimal feature extractors is more critical than extensive preprocessing for enhancing classification outcomes. Overall, this comprehensive evaluation underscores the advancements in AI-driven histopathology and the potential for improved diagnostic accuracy in ovarian cancer subtyping.