نموذج محول الرؤية متعدد المقاييس الهرمي للكشف الدقيق وتصنيف أورام الدماغ في التصوير الطبي القائم على الرنين المغناطيسي Hierarchical multi-scale vision transformer model for accurate detection and classification of brain tumors in MRI-based medical imaging

المجلة: Scientific Reports، المجلد: 15، العدد: 1
DOI: https://doi.org/10.1038/s41598-025-23100-0
PMID: https://pubmed.ncbi.nlm.nih.gov/41174141
تاريخ النشر: 2025-10-31
المؤلف: C. Sankari وآخرون
الموضوع الرئيسي: كشف وتصنيف أورام الدماغ

نظرة عامة

تقدم البحث إطار عمل جديد لتحويل الرؤية (ViT) معززًا بمنهجية الانتباه متعدد المقاييس الهرمي (HMSA) للكشف التلقائي وتصنيف أورام الدماغ، وخاصةً الدبقية، والورم السحائي، والورم الغدي النخامي، وأنسجة الدماغ السليمة. تشمل الابتكارات الرئيسية استراتيجية تضمين رقع متعددة الدقة تسهل استخراج الميزات على مقاييس مكانية مختلفة (8×8، 16×16، و32×32 رقعة)، وهندسة محول محسنة حسابيًا تقلل من مدة التدريب بنسبة 35%، وآلية معايرة احتمالية تحسن من ثقة التنبؤ. تم التحقق من صحة الإطار على مجموعة بيانات تتكون من 7023 صورة MRI معززة بالتباين بتقنية T1، محققًا دقة تصنيف مثيرة للإعجاب تبلغ 98.7%، مع دقة تبلغ 0.986، واسترجاع يبلغ 0.988، ودرجة F1 تبلغ 0.987، متفوقًا بشكل كبير على طرق التعلم الآلي التقليدية وهياكل CNN الحديثة.

تؤكد النتائج على فعالية ViT المقترح مع HMSA في تحقيق أداء متقدم في الكشف عن أورام الدماغ، مع تحسينات ملحوظة مقارنة بالنماذج الحالية، بما في ذلك زيادة دقة بنسبة 1.9% مقارنةً بـ ViT القياسي ومكاسب كبيرة على الطرق التقليدية مثل الغابات العشوائية وآلات الدعم الشعاعي. جودة المعايرة الممتازة للنموذج، المشار إليها بخطأ المعايرة المتوقع (ECE) البالغ 0.023، تسلط الضوء على موثوقيته في اتخاذ القرارات السريرية. ستركز الأبحاث المستقبلية على توسيع هذا الإطار ليشمل بيانات MRI متعددة التسلسلات وثلاثية الأبعاد، بالإضافة إلى تطوير نماذج نظرية أكثر صرامة لاختيار مقاييس الانتباه، مما يمهد الطريق لأنظمة الذكاء الاصطناعي المتقدمة في التصوير الطبي.

مقدمة

تناقش المقدمة تقنيات تعزيز البيانات الخاصة بالنطاق التي تعزز أداء النموذج في تطبيقات محددة. تشمل الطرق الرئيسية التشوه المرن المميز بمعلمات $\alpha \in [5, 15]$ و$\sigma \in [3, 7]$، والتي تُدخل تباينًا في البيانات مع الحفاظ على السلامة الهيكلية. بالإضافة إلى ذلك، يتم تطبيق تعزيز الشدة من خلال تصحيح غاما مع $\gamma \in [0.8, 1.2]$، مما يسمح بإجراء تعديلات على سطوع الصورة والتباين. يتم استخدام قص الصور الواعي للأورام لضمان رؤية 90% على الأقل من الورم في بيانات التدريب، مما يحسن من ملاءمة العينات المعززة. علاوة على ذلك، يتم تنفيذ تنظيم MixUp، باستخدام معلمة خلط $\lambda \sim \text{Beta}(0.2, 0.2)$، لتعزيز التعميم من خلال دمج أمثلة تدريب مختلفة. تهدف هذه التقنيات مجتمعة إلى تحسين متانة ودقة النماذج في المهام الخاصة بالنطاق.

طرق

يستخدم البحث استراتيجية تدريب تدريجية من ثلاث مراحل لتطوير النموذج. في المرحلة 1، يتم تنفيذ فترة تسخين مدتها 5 عصور، حيث تبقى كتل المحول مجمدة، ويتم تدريب رأس التصنيف فقط بمعدل تعلم قدره $1 \times 10^{-5}$ لتعزيز رسم الخرائط الناتج. تتضمن المرحلة 2 تدريبًا شاملاً على مدى 20 عصرًا، حيث يتم فك تجميد جميع الطبقات، باستخدام معدل تعلم قدره $1 \times 10^{-4}$ مع تذويب جيبي وتطبيق تحسينات متنوعة. أخيرًا، تتكون المرحلة 3 من فترة معايرة تستمر 5 عصور، حيث يتم تثبيت أوزان النموذج، ويتم إجراء تحسين درجة الحرارة بمعدل تعلم قدره $1 \times 10^{-6}$ لتقليل خطأ المعايرة.

لضمان مقارنة عادلة بين النماذج، تم تدريب جميعها تحت بروتوكولات متطابقة، على الرغم من عدم تقديم تفاصيل محددة عن إعداد التجربة في هذا القسم. تهدف هذه الطريقة المنظمة إلى تحسين أداء النموذج مع الحفاظ على الاتساق عبر الظروف التجريبية.

مناقشة

تسلط قسم المناقشة في الورقة الضوء على التقدمات الحاسمة في تصنيف أورام الدماغ من خلال دمج تقنيات التعلم العميق، مع التركيز بشكل خاص على الانتقال من الشبكات العصبية التلافيفية (CNNs) إلى محولات الرؤية (ViTs). تؤكد على أهمية الكشف المبكر عن أورام الدماغ، والذي يؤثر بشكل كبير على معدلات بقاء المرضى، وتبرز دور تقنية MRI في توفير تصوير مفصل دون إشعاع ضار. تشير الورقة إلى أنه على الرغم من أن CNNs قد هيمنت تاريخيًا على تحليل الصور الطبية بسبب كفاءتها وقدرتها على التقاط التسلسلات المكانية، فإن ViTs تقدم مزايا فريدة في التعامل مع الاعتماد على المدى الطويل ودمج الميزات متعددة المقاييس، وهو أمر ضروري لتصنيف أورام الدماغ بدقة.

أظهرت الابتكارات الحديثة، مثل تطوير نماذج هجينة تجمع بين CNNs وViTs، نتائج واعدة في تعزيز دقة التشخيص. يتناول هيكل محول الرؤية الجديد المقترح التحديات المتعلقة بالتصوير الطبي من خلال دمج آليات الانتباه متعدد المقاييس الهرمي والمعايرة الاحتمالية لتقدير الثقة. لا يحسن هذا الهيكل من الكفاءة الحسابية فحسب، بل يعزز أيضًا من قابلية الفهم، مما يسمح بدمج أفضل في سير العمل السريري. تشير النتائج إلى أنه بينما تظل CNNs فعالة في مهام مثل التقسيم، فإن ViTs تقدم فوائد كبيرة لمهام التصنيف، خاصة في السيناريوهات الطبية المعقدة حيث يكون فهم السياق العالمي أمرًا حاسمًا. تدعو الورقة إلى اتباع نهج محدد للمهام عند الاختيار بين CNNs وViTs، مقترحة أن تركز التطورات المستقبلية على التآزر المعماري الذي يستفيد من نقاط القوة في كلا الإطارين.

Journal: Scientific Reports, Volume: 15, Issue: 1
DOI: https://doi.org/10.1038/s41598-025-23100-0
PMID: https://pubmed.ncbi.nlm.nih.gov/41174141
Publication Date: 2025-10-31
Author(s): C. Sankari et al.
Primary Topic: Brain Tumor Detection and Classification

Overview

The research presents a novel Vision Transformer (ViT) framework enhanced by a Hierarchical Multi-Scale Attention (HMSA) methodology for the automated detection and classification of brain tumors, specifically glioma, meningioma, pituitary adenoma, and healthy brain tissue. Key innovations include a multi-resolution patch embedding strategy that facilitates feature extraction at varying spatial scales (8×8, 16×16, and 32×32 patches), a computationally optimized transformer architecture that reduces training duration by 35%, and a probabilistic calibration mechanism that improves prediction confidence. The framework was validated on a dataset of 7023 T1-weighted contrast-enhanced MRI images, achieving an impressive classification accuracy of 98.7%, along with a precision of 0.986, recall of 0.988, and an F1-score of 0.987, significantly outperforming conventional machine learning methods and state-of-the-art CNN architectures.

The findings underscore the efficacy of the proposed ViT with HMSA in achieving state-of-the-art performance in brain tumor detection, with notable improvements over existing models, including a 1.9% accuracy increase compared to standard ViT and substantial gains over traditional methods such as Random Forest and Support Vector Machines. The model’s excellent calibration quality, indicated by an Expected Calibration Error (ECE) of 0.023, highlights its reliability for clinical decision-making. Future research will focus on extending this framework to multi-sequence and 3D volumetric MRI data, as well as developing more rigorous theoretical models for attention scale selection, thereby laying the groundwork for advanced AI systems in medical imaging.

Introduction

The introduction discusses domain-specific data augmentation techniques that enhance model performance in specific applications. Key methods include elastic deformation characterized by parameters $\alpha \in [5, 15]$ and $\sigma \in [3, 7]$, which introduce variability in the data while preserving structural integrity. Additionally, intensity augmentation is applied through gamma correction with $\gamma \in [0.8, 1.2]$, allowing for adjustments in image brightness and contrast. Tumor-aware cropping is employed to ensure that at least 90% of the tumor is visible in the training data, thereby improving the relevance of the augmented samples. Furthermore, MixUp regularization, utilizing a mixing parameter $\lambda \sim \text{Beta}(0.2, 0.2)$, is implemented to enhance generalization by blending different training examples. These techniques collectively aim to improve the robustness and accuracy of models in domain-specific tasks.

Methods

The research employs a three-phase progressive training strategy for model development. In Phase 1, a warm-up period of 5 epochs is implemented, during which the transformer blocks remain frozen, and only the classification head is trained with a learning rate of $1 \times 10^{-5}$ to enhance output mapping. Phase 2 involves comprehensive training over 20 epochs, where all layers are unfrozen, utilizing a learning rate of $1 \times 10^{-4}$ with cosine annealing and applying various augmentations. Finally, Phase 3 consists of a calibration period lasting 5 epochs, during which model weights are fixed, and temperature optimization is performed with a learning rate of $1 \times 10^{-6}$ to minimize calibration error.

To ensure a fair comparison among models, all were trained under identical protocols, although specific details of the experimental setup are not provided in this section. This structured approach aims to optimize model performance while maintaining consistency across experimental conditions.

Discussion

The discussion section of the paper highlights the critical advancements in brain tumor classification through the integration of deep learning techniques, particularly focusing on the transition from Convolutional Neural Networks (CNNs) to Vision Transformers (ViTs). It underscores the importance of early detection of brain tumors, which significantly impacts patient survival rates, and emphasizes the role of MRI technology in providing detailed imaging without harmful radiation. The paper notes that while CNNs have historically dominated medical image analysis due to their efficiency and ability to capture spatial hierarchies, ViTs offer unique advantages in handling long-range dependencies and multi-scale feature integration, which are essential for accurately classifying brain tumors.

Recent innovations, such as the development of hybrid models that combine CNNs and ViTs, have shown promising results in enhancing diagnostic accuracy. The proposed Novel Vision Transformer architecture specifically addresses the challenges of medical imaging by incorporating hierarchical multi-scale attention mechanisms and probabilistic calibration for confidence estimation. This architecture not only improves computational efficiency but also enhances interpretability, allowing for better integration into clinical workflows. The findings suggest that while CNNs remain effective for tasks like segmentation, ViTs present significant benefits for classification tasks, particularly in complex medical scenarios where understanding global context is crucial. The paper advocates for a task-specific approach in selecting between CNNs and ViTs, proposing that future advancements should focus on architectural synergies that leverage the strengths of both frameworks.