CSWin-UNet: UNet المحول مع نوافذ على شكل صليب لتقسيم الصور الطبية CSWin-UNet: Transformer UNet with cross-shaped windows for medical image segmentation

المجلة: Information Fusion، المجلد: 113
DOI: https://doi.org/10.1016/j.inffus.2024.102634
تاريخ النشر: 2024-08-10
المؤلف: Xiao Liu وآخرون
الموضوع الرئيسي: الرياضيّات والتعلم الآلي في التصوير الطبي

نظرة عامة

تقدم البحث CSWin-UNet، وهو هيكل جديد للتقسيم على شكل U يدمج آلية الانتباه الذاتي CSWin في إطار عمل UNet لتحسين تقسيم الصور الطبية. بينما تظهر الشبكات العصبية التلافيفية (CNNs) قيودًا بسبب تحيزاتها الاستقرائية، وتواجه طرق المحولات متطلبات حسابية عالية، فإن CSWin-UNet يعالج هذه التحديات بفعالية. يعزز الهيكل الكفاءة الحسابية وتفاعلات مجال الاستقبال من خلال الانتباه الذاتي على الأشرطة الأفقية والعمودية. بالإضافة إلى ذلك، يستخدم جهاز فك التشفير مشغل إعادة تجميع واعٍ للمحتوى، والذي يعيد تجميع الميزات استراتيجيًا بناءً على النوى المتوقعة، مما يحسن من استعادة دقة الصورة.

تظهر التقييمات التجريبية على مجموعات بيانات متنوعة، بما في ذلك التصوير المقطعي المتعدد للأعضاء، وتصوير القلب بالرنين المغناطيسي، وآفات الجلد، أن CSWin-UNet يحقق دقة تقسيم عالية مع الحفاظ على تعقيد نموذج منخفض. يستنتج البحث أن CSWin-UNet لا يقلل فقط من التكاليف الحسابية ولكن أيضًا يعزز دقة تقسيم حواف الأعضاء، متفوقًا على الطرق الحالية الرائدة. تشير هذه الطريقة الخفيفة إلى إمكانيات كبيرة لمزيد من التحسينات في تطبيقات التعلم العميق لمهام تقسيم الصور الطبية المعقدة.

مقدمة

تؤكد مقدمة هذه الورقة البحثية على الدور الحاسم لتقسيم الصور الطبية في تعزيز دقة التشخيص والعلاج من خلال استخراج معلومات مفصلة من الصور الطبية. تبرز التقدمات التي أحدثتها تقنيات التعلم العميق، وخاصة الشبكات العصبية التلافيفية (CNNs) ومشتقاتها، مثل UNet وإصداراتها المحسنة، التي حسنت بشكل كبير من دقة التقسيم عبر مختلف طرق التصوير الطبي. ومع ذلك، تواجه هذه الطرق المعتمدة على CNN تحديات في التقاط المعلومات الدلالية العالمية وإدارة الاعتماديات بعيدة المدى، مما دفع الباحثين لاستكشاف هيكل المحولات، الذي حقق نجاحًا أوليًا في معالجة اللغة الطبيعية، لمهام تقسيم الصور.

تقدم الورقة CSWin-UNet، وهو هيكل جديد يعتمد على المحولات مصمم لمعالجة قيود الطرق الحالية مع تقليل التكاليف الحسابية. على عكس النماذج الهجينة مثل TransUNet، يستخدم CSWin-UNet كتل محولات نقية في كل من مشفره وجهاز فك التشفير، مستفيدًا من آلية الانتباه الذاتي على شكل نافذة متقاطعة لتعزيز استخراج الميزات. يتضمن الهيكل طبقة CARAFE لرفع الدقة، مما يحسن من الحفاظ على الميزات التفصيلية والحدود في مهام التقسيم. تظهر النتائج التجريبية أن CSWin-UNet يتفوق على الطرق الحالية في كل من دقة التقسيم والكفاءة الحسابية، مما يمثل تقدمًا كبيرًا في مجال تقسيم الصور الطبية.

الطرق

تصف قسم المنهجية هيكل CSWin-UNet، الذي يتكون من مشفر، وجهاز فك تشفير، واتصالات تخطي، مع كون المكون الأساسي هو كتلة محول CSWin. لمعالجة الصور الطبية بأبعاد \( H \times W \times 3 \)، يستخدم النموذج تضمين الرموز التلافيفية باستخدام نواة \( 7 \times 7 \) وخطوة 4، مما ينتج عنه رموز بقع بحجم \( H/4 \times W/4 \) مع \( C \) قنوات. يتضمن الهيكل أربع مراحل في كل من المشفر وجهاز فك التشفير، مع دمج اتصالات تخطي لدمج الميزات بفعالية والحفاظ على المعلومات السياقية.

في المشفر، تسهل الطبقات التلافيفية مع نواة \( 3 \times 3 \) وخطوة 2 تقليل الدقة، مما يقلل من الدقة إلى النصف بينما يضاعف عدد القنوات. على العكس، يستخدم جهاز فك التشفير طبقة CARAFE لرفع الدقة، مما يزيد من الدقة إلى ضعف حجم الإدخال بينما يقلل عدد القنوات إلى النصف. تعيد عملية رفع الدقة النهائية \( 4 \times \) CARAFE الدقة إلى \( H \times W \)، تليها طبقة خطية تحول خريطة الميزات إلى قناع تقسيم. تعزز هذه الطريقة المنظمة قدرة النموذج على تقسيم الصور الطبية بدقة.

النتائج

في قسم النتائج، تظهر الطريقة المقترحة CSWin-UNet تحسينات كبيرة في أداء التقسيم على مجموعة بيانات Synapse، كما يتضح من زيادة متوسط معامل تشابه Dice (DSC) بنسبة 3.64% مقارنة بـ TransUNet و1.99% مقارنة بـ Swin-UNet، إلى جانب تحسين متوسط مسافة هاوسدورف (HD) بنسبة 12.83% و2.69%، على التوالي. من الجدير بالذكر أن CSWin-UNet يتفوق في تقسيم البنكرياس، محققًا دقة أعلى في تحديد حدوده مقارنة بالطرق الأخرى، التي تواجه صعوبة مع الخصائص المعقدة للعضو. توضح التحليلات المرئية أيضًا أن CSWin-UNet يتفوق على المنافسين في تقسيم الأعضاء الصغيرة والمعقدة بدقة، كما هو موضح في الأشكال المقارنة.

بالإضافة إلى ذلك، تشير النتائج على مجموعة بيانات ACDC إلى أن CSWin-UNet يحقق أعلى قيم DSC لمختلف الهياكل القلبية، متفوقًا على طرق أخرى رائدة. يتم تأكيد الأداء من خلال أشرطة الخطأ التي تمثل فترة الثقة 95%. في مجموعات بيانات تقسيم آفات الجلد، يتفوق CSWin-UNet باستمرار على الطرق الأخرى، بما في ذلك Swin-UNet، في عدة مقاييس تقييم، مما يظهر قدراته العامة القوية. ومع ذلك، لا تزال التحديات قائمة في السيناريوهات ذات التباين المنخفض، حيث تتناقص دقة التقسيم. بشكل عام، يظهر CSWin-UNet دقة تقسيم متفوقة وموثوقية عبر مجموعات بيانات متنوعة ومهام تقسيم معقدة.

المناقشة

تسلط المناقشة حول آليات الانتباه الذاتي في تقسيم الصور الضوء على تأثيرها الكبير في تحسين أداء التقسيم، خاصة في التصوير الطبي حيث يكون التقاط الهياكل المعقدة أمرًا حيويًا. تم تطوير تصاميم مختلفة للانتباه الذاتي، مثل محول Swin ومحول CSWin، للتخفيف من التعقيد الحسابي المرتبط بآليات الانتباه العالمية. يعزز محول CSWin، بشكل خاص، الكفاءة من خلال استخدام نهج الانتباه الذاتي على شكل نافذة متقاطعة يسمح بالحساب المتوازي عبر الأشرطة الأفقية والعمودية، مما يوسع التفاعل داخل مجالات الاستقبال مع الحفاظ على الموارد الحسابية.

في مجال تقسيم الصور الطبية المعتمد على CNN، كانت الهياكل مثل FCN وUNet أساسية، مع نماذج لاحقة مثل SegNet وUNet++ التي تعزز دقة التقسيم من خلال ميزات تصميم مبتكرة. إن دمج المحولات في أطر التقسيم، كما يتضح من نماذج مثل TransUNet وCSWin-UNet، يستفيد بشكل أكبر من قدرات النمذجة السياقية العالمية للمحولات، مما يعالج قيود CNN التقليدية في التقاط الاعتماديات بعيدة المدى. يظهر نموذج CSWin-UNet، الذي يجمع بين الانتباه الذاتي CSWin مع هيكل UNet، دقة تقسيم محسنة مع الحفاظ على تعقيد حسابي أقل، مما يمثل تقدمًا ملحوظًا في تطبيق المحولات على مهام تقسيم الصور الطبية.

Journal: Information Fusion, Volume: 113
DOI: https://doi.org/10.1016/j.inffus.2024.102634
Publication Date: 2024-08-10
Author(s): Xiao Liu et al.
Primary Topic: Radiomics and Machine Learning in Medical Imaging

Overview

The research presents CSWin-UNet, a novel U-shaped segmentation architecture that integrates the CSWin self-attention mechanism into the UNet framework to improve medical image segmentation. While convolutional neural networks (CNNs) exhibit limitations due to their inductive biases, and Transformer methods face high computational demands, CSWin-UNet effectively addresses these challenges. The architecture enhances computational efficiency and receptive field interactions through horizontal and vertical stripes self-attention. Additionally, the decoder employs a content-aware reassembly operator, which strategically reassembles features based on predicted kernels, thereby improving image resolution restoration.

Empirical evaluations on diverse datasets, including synapse multi-organ CT, cardiac MRI, and skin lesions, demonstrate that CSWin-UNet achieves high segmentation accuracy while maintaining low model complexity. The study concludes that CSWin-UNet not only reduces computational costs but also enhances the precision of organ edge segmentation, outperforming existing state-of-the-art methods. This lightweight approach indicates significant potential for further optimizations in deep learning applications for complex medical image segmentation tasks.

Introduction

The introduction of this research paper emphasizes the critical role of medical image segmentation in enhancing diagnostic and treatment accuracy by extracting detailed information from medical images. It highlights the advancements brought about by deep learning techniques, particularly Convolutional Neural Networks (CNNs) and their derivatives, such as UNet and its enhanced versions, which have significantly improved segmentation accuracy across various medical imaging modalities. However, these CNN-based methods face challenges in capturing global semantic information and managing long-range dependencies, prompting researchers to explore the Transformer architecture, initially successful in natural language processing, for image segmentation tasks.

The paper introduces the CSWin-UNet, a novel Transformer-based architecture designed to address the limitations of existing methods while reducing computational costs. Unlike hybrid models like TransUNet, CSWin-UNet employs pure Transformer blocks in both its encoder and decoder, utilizing the cross-shaped window self-attention mechanism to enhance feature extraction. The architecture incorporates a CARAFE layer for upsampling, which improves the preservation of detailed features and boundaries in segmentation tasks. Experimental results demonstrate that CSWin-UNet outperforms existing methods in both segmentation accuracy and computational efficiency, marking a significant advancement in the field of medical image segmentation.

Methods

The methodology section describes the architecture of CSWin-UNet, which comprises an encoder, a decoder, and skip connections, with the fundamental component being the CSWin Transformer block. For processing medical images with dimensions \( H \times W \times 3 \), the model employs convolutional token embedding using a \( 7 \times 7 \) kernel and a stride of 4, resulting in patch tokens of size \( H/4 \times W/4 \) with \( C \) channels. The architecture includes four stages in both the encoder and decoder, incorporating skip connections to effectively merge features and preserve contextual information.

In the encoder, convolutional layers with a \( 3 \times 3 \) kernel and a stride of 2 facilitate downsampling, halving the resolution while doubling the channel count. Conversely, the decoder utilizes the CARAFE Layer for upsampling, which increases the resolution to twice its input size while halving the channel count. A final \( 4 \times \) CARAFE upsampling operation restores the resolution to \( H \times W \), followed by a linear layer that transforms the feature map into a segmentation mask. This structured approach enhances the model’s capability to accurately segment medical images.

Results

In the results section, the proposed CSWin-UNet method demonstrates significant improvements in segmentation performance on the Synapse dataset, as evidenced by a mean Dice Similarity Coefficient (DSC) increase of 3.64% over TransUNet and 1.99% over Swin-UNet, alongside a mean Hausdorff Distance (HD) improvement of 12.83% and 2.69%, respectively. Notably, CSWin-UNet excels in segmenting the pancreas, achieving higher accuracy in delineating its boundaries compared to other methods, which struggle with the organ’s complex characteristics. Visual analyses further illustrate that CSWin-UNet outperforms competitors in accurately segmenting small and complex organs, as shown in comparative figures.

Additionally, results on the ACDC dataset indicate that CSWin-UNet achieves the highest DSC values for various cardiac structures, outperforming other state-of-the-art methods. The performance is corroborated by error bars representing the 95% confidence interval. On skin lesion segmentation datasets, CSWin-UNet consistently surpasses other methods, including Swin-UNet, in multiple evaluation metrics, showcasing its robust generalization capabilities. However, challenges remain in low-contrast scenarios, where segmentation accuracy diminishes. Overall, CSWin-UNet demonstrates superior segmentation accuracy and robustness across diverse datasets and complex segmentation tasks.

Discussion

The discussion on self-attention mechanisms in image segmentation highlights their significant impact on improving segmentation performance, particularly in medical imaging where capturing intricate structures is crucial. Various self-attention designs, such as the Swin Transformer and CSWin Transformer, have been developed to mitigate the computational complexity associated with global attention mechanisms. The CSWin Transformer, in particular, enhances efficiency by employing a cross-shaped window self-attention approach that allows for parallel computation across horizontal and vertical stripes, thereby broadening the interaction within receptive fields while conserving computational resources.

In the realm of CNN-based medical image segmentation, architectures like FCN and UNet have been foundational, with subsequent models like SegNet and UNet++ enhancing segmentation accuracy through innovative design features. The integration of Transformers into segmentation frameworks, exemplified by models such as TransUNet and CSWin-UNet, further capitalizes on the global contextual modeling capabilities of Transformers, addressing the limitations of traditional CNNs in capturing long-range dependencies. The CSWin-UNet model, which combines CSWin self-attention with a UNet architecture, demonstrates improved segmentation accuracy while maintaining lower computational complexity, representing a notable advancement in the application of Transformers to medical image segmentation tasks.