التقييم المقارن لمحوّلات الرؤية والشبكات الالتفافية لتصنيف صور الموجات فوق الصوتية للثدي Comparative evaluation of vision transformers and convolutional networks for breast ultrasound image classification

المجلة: Exploration of Medicine، المجلد: 7
DOI: https://doi.org/10.37349/emed.2026.1001382
تاريخ النشر: 2026-02-27
المؤلف: Suleyman Naral وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في اكتشاف السرطان

نظرة عامة

تتناول هذه الدراسة تحدي تباين الملاحظات بين المراقبين في تفسير الموجات فوق الصوتية للثدي من خلال مقارنة أداء نموذجين من Vision Transformer (ViT) – نموذج Swin Transformer Base وDeiT Base – ضد نموذجين من الشبكات العصبية التلافيفية (CNN) ، InceptionV3 وMobileNetV3 Large، لتصنيف صور الموجات فوق الصوتية للثدي تلقائيًا إلى ثلاث فئات: حميدة، خبيثة، وطبيعية. باستخدام مجموعة بيانات صور الموجات فوق الصوتية للثدي (BUSI) ، التي تتكون من 780 صورة مصنفة، تؤكد الأبحاث على التوازن بين الأداء التنبؤي والكفاءة الحاسوبية من خلال خط أنابيب تعزيز مستمر أثناء التدريب.

تظهر النتائج أن نموذج Swin Transformer Base حقق أعلى دقة اختبار تبلغ 0.9167 ودرجة F1 تبلغ 0.8981، مما يدل على أداء تصنيف متفوق. في المقابل، حقق MobileNetV3 Large، على الرغم من تحقيقه دقة أقل تبلغ 0.8583، متطلبات حاسوبية مخفضة بشكل كبير، حيث يحتاج فقط إلى 0.43 GFLOPs مقارنة بـ 30.33 GFLOPs لنموذج Swin. تشير النتائج إلى أنه بينما قد توفر نماذج ViT دقة محسنة، فإن الشبكات العصبية التلافيفية الخفيفة مثل MobileNetV3 Large أكثر ملاءمة للنشر في البيئات السريرية ذات الموارد المحدودة. وبالتالي، يجب أن يأخذ اختيار النماذج لتصنيف الموجات فوق الصوتية للثدي في الاعتبار كل من الدقة التنبؤية والجدوى التشغيلية ضمن سير العمل السريري.

مقدمة

يعتبر سرطان الثدي، الذي يتميز بالنمو غير المنضبط للخلايا الظهارية في نسيج الثدي، تحديًا صحيًا عالميًا كبيرًا بسبب أسبابه المتعددة، والتي تشمل العوامل الوراثية والهرمونية والبيئية. إن الكشف في الوقت المناسب والدقيق أمر حاسم للإدارة الفعالة، حيث ترتبط النتائج السريرية ارتباطًا وثيقًا بالمرحلة عند التشخيص. بينما تعتبر الماموجرام الطريقة القياسية للفحص، فإن حساسيتها محدودة لدى النساء ذوات الأنسجة الثديية الكثيفة والسكان الأصغر سنًا، مما يثير الحاجة إلى طرق تصوير مكملة مثل الموجات فوق الصوتية. على الرغم من أن الموجات فوق الصوتية لها مزايا بسبب سهولة الوصول إليها وعدم وجود إشعاع مؤين، إلا أن تفسيرها يمكن أن يكون ذاتيًا، مما يؤدي إلى تباين في التقارير.

أظهرت التطورات الأخيرة في الذكاء الاصطناعي، وخاصة التعلم العميق، وعدًا في تعزيز دقة واتساق تفسير الموجات فوق الصوتية للثدي. كانت الشبكات العصبية التلافيفية (CNNs) هي النهج السائد، لكن نماذج Vision Transformer (ViT) تظهر كبدائل قد تلتقط علاقات سياقية أوسع. على الرغم من الفوائد المحتملة للنماذج المعتمدة على المحولات، لا تزال التحديات المتعلقة بالموارد الحاسوبية وتنوع مجموعة البيانات قائمة. تهدف هذه الدراسة إلى معالجة الفجوة في التقييمات المقارنة لنماذج المحولات مقابل الهياكل المعروفة للشبكات العصبية التلافيفية من خلال تقييم نموذج Swin Transformer Base وDeiT Base مقابل InceptionV3 وMobileNetV3 Large لتصنيف الموجات فوق الصوتية للثدي إلى ثلاث فئات. تركز الأبحاث ليس فقط على الأداء التنبؤي ولكن أيضًا على تقييم مؤشرات الكفاءة، مما يساعد في اختيار النماذج بناءً على قيود النشر.

الطرق

تحدد قسم “المواد والطرق” تصميم التجربة والإجراءات المستخدمة في الدراسة. يوضح المواد المحددة المستخدمة، بما في ذلك أي مواد كيميائية، معدات، وعينات بيولوجية، بالإضافة إلى مصادرها وطرق تحضيرها. كما يصف القسم المنهجيات المطبقة لجمع البيانات وتحليلها، مما يضمن إمكانية تكرار التجارب.

بالإضافة إلى ذلك، يتم تحديد التقنيات الإحصائية المستخدمة لتفسير البيانات، بما في ذلك أي برامج أو خوارزميات تم استخدامها للتحليل. يتم التأكيد على صرامة الطرق للتحقق من النتائج، مما يضمن أنها موثوقة وقوية. بشكل عام، يوفر هذا القسم نظرة شاملة على الإطار التجريبي الذي يدعم نتائج البحث.

النتائج

تُلخص نتائج الدراسة في الجدول 2، الذي يقدم الأداء التنبؤي، حجم النموذج، وتعقيد الحوسبة لمختلف النماذج التي تم تقييمها. برز نموذج Swin Transformer Base كأفضل أداء، حيث حقق دقة تبلغ 0.9167 ودرجة F1 تبلغ 0.8981، على الرغم من تكلفته الحاسوبية الكبيرة التي تبلغ 86.75 مليون معلمة و30.3375 عملية حسابية عائمة (GFLOPs). في المقابل، حقق MobileNetV3 Large، على الرغم من تحقيقه دقة أقل تبلغ 0.8583، ملفًا أكثر كفاءة بكثير مع 4.21 مليون معلمة و0.4307 GFLOPs، مما يجعله أقل استهلاكًا للموارد بحوالي 20.61 مرة مقارنةً بـ Swin.

احتل نموذج DeiT Base المرتبة الثانية في الدقة عند 0.8750، مع درجة F1 تبلغ 0.8555 وعدد المعلمات مشابه لنموذج Swin، ولكن مع GFLOPs أعلى قليلاً. حققت نماذج CNN، InceptionV3 وMobileNetV3 Large، كلاهما دقة تبلغ 0.8583، ولكن مع اختلافات ملحوظة في متطلبات الموارد. احتاج InceptionV3 إلى 21.79 مليون معلمة و5.6719 GFLOPs، بينما كان MobileNetV3 Large أكثر كفاءة بشكل ملحوظ. تشير النتائج إلى وجود تبادل واضح بين الدقة والمتطلبات الحاسوبية، مما يشير إلى أن اختيار النموذج يجب أن يسترشد بالسياق المحدد للنشر، مع موازنة احتياجات الأداء مقابل القيود الحاسوبية. يُوصى بمزيد من التحقق من خلال قياسات مباشرة لزمن الاستدلال واستخدام الموارد على الأجهزة المستهدفة.

المناقشة

في هذه الدراسة، قام المؤلفون بتقييم أداء أربعة هياكل مختلفة – اثنان من الشبكات العصبية التلافيفية (CNNs: InceptionV3 وMobileNetV3 Large) واثنان من المحولات البصرية (ViTs: Swin Transformer Base وDeiT Base) – لتصنيف صور الموجات فوق الصوتية للثدي من مجموعة بيانات BUSI. تتكون مجموعة البيانات من 780 صورة مصنفة إلى فئات حميدة، خبيثة، وطبيعية، مع تقسيم متدرج بنسبة 70% للتدريب، 15% للتحقق، و15% للاختبار. أشارت النتائج إلى أن نموذج Swin Transformer Base حقق أعلى دقة تبلغ 0.9167 ودرجة F1 تبلغ 0.8981، مما يدل على قدرته على التقاط كل من الإشارات المحلية والسياقية الأوسع من خلال آليات الانتباه الهرمية. ومع ذلك، كان لهذا النموذج أيضًا تكلفة حاسوبية كبيرة، حيث يحتوي على 86.75 مليون معلمة و30.34 GFLOPs.

في المقابل، قدم MobileNetV3 Large دقة تنافسية تبلغ 0.8583 مع تقليل كبير في المتطلبات الحاسوبية، حيث استخدم فقط 4.21 مليون معلمة و0.43 GFLOPs. يبرز هذا التباين أهمية اختيار هيكل يوازن بين الأداء التنبؤي والكفاءة التشغيلية، خاصة في البيئات السريرية ذات الموارد المحدودة. كما اعترفت الدراسة بالقيود، مثل الاعتماد على مجموعة بيانات واحدة وعدم توازن الفئات، مما يشير إلى أن الأبحاث المستقبلية يجب أن تركز على التحقق الخارجي وتقسيم البيانات على مستوى المرضى لتعزيز القابلية للتعميم. بشكل عام، تؤكد النتائج على الحاجة إلى اختيار دقيق للهياكل يتماشى مع المتطلبات السريرية المحددة واعتبارات النشر.

Journal: Exploration of Medicine, Volume: 7
DOI: https://doi.org/10.37349/emed.2026.1001382
Publication Date: 2026-02-27
Author(s): Suleyman Naral et al.
Primary Topic: AI in cancer detection

Overview

This study addresses the challenge of interobserver variability in breast ultrasound interpretation by comparing the performance of two Vision Transformer (ViT) models—Swin Transformer Base and DeiT Base—against two Convolutional Neural Network (CNN) models, InceptionV3 and MobileNetV3 Large, for automated classification of breast ultrasound images into three categories: benign, malignant, and normal. Utilizing the Breast Ultrasound Images (BUSI) dataset, which comprises 780 labeled images, the research emphasizes the balance between predictive performance and computational efficiency through a consistent on-the-fly augmentation pipeline during training.

The findings reveal that the Swin Transformer Base model achieved the highest test accuracy of 0.9167 and an F1 score of 0.8981, indicating superior classification performance. In contrast, MobileNetV3 Large, while achieving a lower accuracy of 0.8583, demonstrated significantly reduced computational demands, requiring only 0.43 GFLOPs compared to Swin’s 30.33 GFLOPs. The results suggest that while ViT models may provide enhanced accuracy, lightweight CNNs like MobileNetV3 Large are more suitable for deployment in resource-constrained clinical environments. Consequently, the selection of models for breast ultrasound classification should consider both predictive accuracy and operational feasibility within clinical workflows.

Introduction

Breast cancer, characterized by the uncontrolled growth of epithelial cells in breast tissue, poses a significant global health challenge due to its multifactorial etiology, which includes genetic, hormonal, and environmental factors. Timely and accurate detection is crucial for effective management, as clinical outcomes are closely linked to the stage at diagnosis. While mammography is the standard screening method, its sensitivity is limited in women with dense breast tissue and younger populations, prompting the need for complementary imaging modalities such as ultrasound. Although ultrasound is advantageous due to its accessibility and lack of ionizing radiation, its interpretation can be subjective, leading to variability in reporting.

Recent advancements in artificial intelligence, particularly deep learning, have shown promise in enhancing the accuracy and consistency of breast ultrasound interpretation. Convolutional Neural Networks (CNNs) have been the predominant approach, but Vision Transformer (ViT) models are emerging as alternatives that may capture broader contextual relationships. Despite the potential benefits of transformer-based models, challenges related to computational resources and dataset heterogeneity persist. This study aims to address the gap in comparative evaluations of transformer models against established CNN architectures by assessing the Swin Transformer Base and DeiT Base against InceptionV3 and MobileNetV3 Large for three-class breast ultrasound classification. The research not only focuses on predictive performance but also evaluates efficiency-related indicators, thereby aiding in informed model selection based on deployment constraints.

Methods

The “Materials and Methods” section outlines the experimental design and procedures employed in the study. It details the specific materials used, including any reagents, equipment, and biological samples, as well as their sources and preparation methods. The section also describes the methodologies applied for data collection and analysis, ensuring reproducibility of the experiments.

Additionally, statistical techniques utilized to interpret the data are specified, including any software or algorithms employed for analysis. The rigor of the methods is emphasized to validate the findings, ensuring that they are robust and reliable. Overall, this section provides a comprehensive overview of the experimental framework that underpins the research outcomes.

Results

The results of the study are summarized in Table 2, which presents the predictive performance, model size, and computational complexity of various evaluated models. The Swin Transformer Base emerged as the top performer, achieving an accuracy of 0.9167 and an F1-score of 0.8981, albeit with a significant computational cost of 86.75 million parameters and 30.3375 Giga Floating-Point Operations (GFLOPs). In contrast, MobileNetV3 Large, while attaining a lower accuracy of 0.8583, demonstrated a much more efficient profile with only 4.21 million parameters and 0.4307 GFLOPs, making it approximately 20.61 times less resource-intensive than Swin.

DeiT Base ranked second in accuracy at 0.8750, with an F1-score of 0.8555 and a parameter count similar to Swin, but slightly higher GFLOPs. The CNN models, InceptionV3 and MobileNetV3 Large, both achieved an accuracy of 0.8583, but with notable differences in resource requirements. InceptionV3 required 21.79 million parameters and 5.6719 GFLOPs, while MobileNetV3 Large was significantly more efficient. The findings indicate a clear trade-off between accuracy and computational demand, suggesting that the choice of model should be guided by the specific deployment context, balancing performance needs against computational constraints. Further validation through direct measurements of inference latency and resource usage on target hardware is recommended.

Discussion

In this study, the authors evaluated the performance of four different architectures—two convolutional neural networks (CNNs: InceptionV3 and MobileNetV3 Large) and two vision transformers (ViTs: Swin Transformer Base and DeiT Base)—for classifying breast ultrasound images from the BUSI dataset. The dataset consists of 780 images categorized into benign, malignant, and normal classes, with a stratified split of 70% for training, 15% for validation, and 15% for testing. The results indicated that Swin Transformer Base achieved the highest accuracy of 0.9167 and an F1 score of 0.8981, demonstrating its ability to capture both local and broader contextual cues through hierarchical attention mechanisms. However, this model also had a substantial computational cost, with 86.75 million parameters and 30.34 GFLOPs.

In contrast, MobileNetV3 Large provided a competitive accuracy of 0.8583 while significantly reducing computational demands, utilizing only 4.21 million parameters and 0.43 GFLOPs. This disparity highlights the importance of selecting an architecture that balances predictive performance with operational efficiency, particularly in resource-constrained clinical settings. The study also acknowledged limitations, such as reliance on a single dataset and class imbalance, suggesting that future research should focus on external validation and patient-level data splitting to enhance generalizability. Overall, the findings underscore the need for careful architecture selection aligned with specific clinical requirements and deployment considerations.