ASF-YOLO: نموذج YOLO جديد مع دمج تسلسل المقياس الانتباهي لتجزئة حالات الخلايا ASF-YOLO: A novel YOLO model with attentional scale sequence fusion for cell instance segmentation

عربي
English

المجلة: Image and Vision Computing، المجلد: 147
DOI: https://doi.org/10.1016/j.imavis.2024.105057
تاريخ النشر: 2024-05-01

ASF-YOLO: نموذج YOLO جديد مع دمج تسلسل المقياس الانتباهي لتجزئة حالات الخلايا

مينغ كانغ، تشي-مينغ تينغ، فونغ فونغ تينغ، رافائيل سي. – واي. فانكلية تكنولوجيا المعلومات، جامعة موناش، حرم ماليزيا، سوبانغ جايا 47500، ماليزيا

معلومات المقال

الكلمات المفتاحية：

تحليل الصور الطبية
تقسيم الأجسام الصغيرة
تنظر مرة واحدة فقط (YOLO)
دمج ميزات التسلسل
آلية الانتباه

الملخص

نقترح إطار عمل جديد يعتمد على دمج تسلسل المقياس الانتباهي (ASF-YOLO) والذي يجمع بين الميزات المكانية والقياسية لتحقيق تقسيم دقيق وسريع لعينات الخلايا. يعتمد هذا الإطار على إطار عمل تقسيم YOLO، حيث نستخدم وحدة دمج ميزات تسلسل المقياس (SSFF) لتعزيز قدرة الشبكة على استخراج المعلومات متعددة المقاييس، ووحدة تشفير الميزات الثلاثية (TFE) لدمج خرائط الميزات من مقاييس مختلفة لزيادة المعلومات التفصيلية. كما نقدم آلية انتباه القناة والموقع (CPAM) لدمج كل من وحدات SSFF وTFE، والتي تركز على القنوات المعلوماتية والأجسام الصغيرة المرتبطة بالموقع المكاني لتحسين أداء الكشف والتقسيم. تظهر التحقق التجريبي على مجموعتين من بيانات الخلايا دقة وسرعة تقسيم ملحوظة لنموذج ASF-YOLO المقترح. حيث يحقق معدل دقة الصندوق (box mAP) 0.91، ومعدل دقة القناع (mask mAP) 0.887، وسرعة استدلال تبلغ 47.3 إطارًا في الثانية على مجموعة بيانات 2018 Data Science Bowl، متفوقًا على أحدث الأساليب. الشيفرة المصدرية متاحة على https://github.com/mkang315/ASF-YOLO.

1. المقدمة

مع التطور السريع لتكنولوجيا إعداد العينات وتكنولوجيا التصوير المجهري، تلعب المعالجة الكمية وتحليل صور الخلايا دورًا مهمًا في مجالات مثل الطب وعلم الأحياء الخلوي. استنادًا إلى الشبكات العصبية التلافيفية (CNN)، يمكن تعلم المعلومات المميزة لصور الخلايا المختلفة من خلال تدريب الشبكة العصبية، والتي تتمتع بأداء تعميم قوي. تعتبر سلسلة R-CNN ذات المرحلتين [1-3] ونسخها ذات المرحلة الواحدة [4، 5] أطرًا تقليدية قائمة على CNN تم استخدامها لمهام التقسيم. ومع ذلك، فإن هذه الطرق التقليدية المعتمدة على CNNs حققت فقط أداءً دون الأمثل لتقسيم حالات الخلايا في الوقت الحقيقي، خاصة عند التعامل مع خلايا كثيفة وصغيرة.

في الأعمال الأخيرة، أصبحت سلسلة You Only Look Once (YOLO) [6-9] من بين أسرع وأكثر النماذج دقةً لتجزئة الكائنات في الوقت الحقيقي. بسبب فكرة التصميم ذات المرحلة الواحدة وقدرات استخراج الميزات، تتمتع نماذج تجزئة الكائنات YOLO بدقة وسرعة أفضل من نماذج التجزئة ذات المرحلتين. ومع ذلك، فإن أداء النماذج المعتمدة على YOLO لتجزئة الكائنات الصغيرة في الصور الطبية أو الهستوباثولوجية، مثل تجزئة كائنات الخلايا، لا يزال غير مستكشف إلى حد كبير. تجزئة كائنات الخلايا تطرح المزيد من التحديات بسبب الكائنات الصغيرة والكثيفة والمتداخلة.
تؤدي الحدود غير الواضحة للخلايا إلى دقة تقسيم ضعيفة. يتطلب الأمر تقسيمًا دقيقًا ومفصلاً لأنواع مختلفة من الكائنات في صور الخلايا. كما هو موضح في الشكل 1، فإن أنواع صور الخلايا المختلفة لديها اختلافات كبيرة في اللون، والشكل، والملمس، ومعلومات أخرى مميزة بسبب اختلافات في شكل الخلايا، وطرق التحضير، وتقنيات التصوير. على الرغم من تحسين دقة وسرعة التقسيم للصور الطبيعية، يمكن تحسين هياكل نماذج YOLO بشكل أكبر للتعامل مع الكائنات الصغيرة في الصور الطبية، مثل الخلايا.

يتكون هيكل إطار عمل YOLO النموذجي من ثلاثة أجزاء رئيسية: العمود الفقري، العنق، والرأس. شبكة العمود الفقري لـ YOLO هي شبكة عصبية تلافيفية تستخرج ميزات الصورة بمستويات مختلفة من الدقة. تم تعديل Darknet مع 53 طبقة تلافيفية (CSPDarknet53) من YOLOv4 وتم تصميمه كشبكة العمود الفقري لـ YOLOv5، والتي تحتوي على وحدات C3 (عنق زجاجة CSP تشمل 3 طبقات تلافيفية) وConvBNSiLU. تم استبدال وحدات C3 بوحدات C2f (عنق زجاجة CSP تشمل 2 طبقة تلافيفية مع اختصار) في العمود الفقري لـ YOLOv8، وهو الاختلاف الوحيد عن YOLOv5. كما هو موضح في الشكل 2، فروع استخراج الميزات من المستوى 1-5.

في العمود الفقري لـ YOLOv5 و YOLOv8 تتوافق مع مخرجات شبكة YOLO المرتبطة بكل من هذه الخرائط المميزة.

الشكل 1. صور خلايا مختلفة (يسار) وخرائط ميزاتها (يمين).

الشكل 2. عرض مختصر للإطار العام لـ YOLOv5 v7.0 و YOLOv8 لمهمة التقسيم. تمثل P1 و P2 و P3 و P4 و P5 مستويات مختلفة من الميزات الناتجة عن الهيكل الأساسي. يقوم الجزء العلوي بقص أقنعة التقسيم لربطها داخل كل من صناديق الحدود المكتشفة، مما يضمن عدم تدفق أقنعة التقسيم خارج صناديق الحدود. تم اختصار الجزء الأوسط عمدًا بسبب الاختلافات في الهياكل بين YOLOv5 و YOLOv8.

تُعتبر YOLOv5 و YOLOv8 أولى الهياكل المعتمدة على YOLO التي يمكنها التعامل مع مهام التقسيم بالإضافة إلى الكشف والتصنيف. في مرحلة استخراج الميزات في YOLOv5، يتم استخدام شبكة العمود الفقري CSPDarkNet53 المكدسة بواسطة عدة وحدات C3، ثم تُستخدم الفروع الثلاثة الفعالة P3 و P4 و P5 من شبكة العمود الفقري كمدخل لشبكة هرم الميزات (FPN) [13] لبناء هيكل دمج متعدد المقاييس في جزء العنق. خلال عملية فك تشفير طبقة الميزات، يتم استخدام ثلاثة رؤوس بأحجام مختلفة تتوافق مع الفرع الفعال لعمود الشبكة لتوقع صناديق الحدود للكائن. بعد زيادة حجم ميزات P3، يتم إجراء فك تشفير بكسل تلو بكسل كتنبؤ بقناع التقسيم للكائن لإكمال تقسيم الكائن. في رأس التقسيم، تُخرج ثلاثة مقاييس من الميزات ثلاثة صناديق ربط مختلفة، وتكون وحدة نموذج القناع مسؤولة عن إخراج القناع النموذجي، الذي يتم معالجته للحصول على صناديق الكشف وأقنعة التقسيم، لمهام تقسيم الكائنات.

في هذه الورقة، نتناول تحديات تقسيم الكائنات الصغيرة استنادًا إلى نموذج محسّن يعتمد على YOLO مع تطبيق على تقسيم كائنات الخلايا. تعتمد طريقتنا على نموذج YOLOv5، مستفيدة من قدرة هيكله العظمي في استخراج ميزات متعددة المقاييس من صور الخلايا. كما تمتد طريقتنا لتشمل بنية YOLOv5 الأصلية لتحسين فعاليتها في تقسيم الكائنات الصغيرة من خلال دمج عدة وحدات جديدة، خاصة في العنق.
جزء من النموذج. تشمل هذه دمج الميزات متعددة المقاييس ودمج آلية الانتباه.

بدقة أكبر، نقترح نموذج تقسيم كائنات من مرحلة واحدة لصور الخلايا، والذي يدمج دمج تسلسل المقياس الانتباهي في إطار عمل YOLO (ASF-YOLO). يتم استخدام شبكة CSPDarknet53 كظهرية لاستخراج معلومات الميزات متعددة الأبعاد من صور الخلايا في مرحلة استخراج الميزات. نقترح عدة تصميمات جديدة للشبكة في جزء العنق لتمكين دمج الميزات متعددة المقاييس وآلية الانتباه. يهدف الدمج متعدد المقاييس إلى تحسين قوة النموذج في مواجهة التغيرات في مقاييس الكائنات الصغيرة من صور الخلايا التي تم الحصول عليها من ظروف مختلفة. يوفر الأخير تركيزًا انتقائيًا على الميزات متعددة المقاييس ذات الصلة بالكائنات الصغيرة. بالنسبة لرأس الكشف، نستفيد من EIoU [14] في مرحلة التدريب لتحسين فقدان موقع الصندوق المحيط من خلال تقليل الفرق بين عرض وارتفاع الصندوق المحيط وصندوق التثبيت. يمكن أن تلتقط EIoU مواقع الكائنات الصغيرة بشكل أفضل، مقارنةً بـ CIoU المستخدم في YOLOv5 وYOLOv8، الذي يعكس فقط الفرق في نسبة العرض إلى الارتفاع، وليس العلاقة الحقيقية بين عرض وارتفاع الصناديق المعلّمة والمتوقعة. كما يتم استخدام تقنية التخفيف الناعم غير الأقصى (Soft-NMS) [15] في مرحلة المعالجة اللاحقة لتحسين مشكلة تداخل الخلايا بكثافة. يتم تلخيص المساهمات الرئيسية لهذا العمل كما يلي:

نحن نصمم وحدة دمج ميزات تسلسل المقياس (SSFF) ووحدة ترميز الميزات الثلاثية (TFE) لدمج خرائط الميزات متعددة المقاييس المستخرجة من الهيكل العظمي في هيكل شبكة تجميع المسار (PANet) [16]. تجمع SSFF المعلومات الدلالية العالمية للصور عبر مقاييس مختلفة من خلال التطبيع، والتكبير، ودمج الميزات متعددة المقاييس في عملية تلافيف ثلاثية الأبعاد. وبالتالي، يمكنها التعامل بفعالية مع الأجسام ذات الأحجام، والاتجاهات، ونسب الأبعاد المتنوعة في تمثيل مساحة المقياس لتحسين تقسيم الأجسام. تتضمن TFE خرائط ميزات صغيرة ومتوسطة وكبيرة الحجم لالتقاط المعلومات المكانية الدقيقة للأجسام الصغيرة عبر مقاييس مميزة. تتغلب هذه على قيود FPN في YOLOv5، التي لا يمكنها استغلال العلاقات بين خرائط الميزات الهرمية بالكامل عبر عمليات الجمع البسيطة والدمج، وتعتمد بشكل أساسي على خرائط الميزات الصغيرة.
ثم نصمم آلية انتباه القناة والموقع (CPAM) لدمج معلومات الميزات من وحدات SSFF وTFE. تتيح هذه الوحدة للنموذج ضبط تركيزه بشكل تكيفي على القنوات والمواقع المكانية ذات الصلة بالأجسام الصغيرة عبر مقاييس مختلفة، وبالتالي تحقيق تقسيم أفضل للحالات مقارنة بهيكل YOLOv5 التقليدي بدون آلية انتباه.
نطبق نموذج ASF-YOLO المقترح لمهام تقسيم الحالات الصعبة للأجسام المتداخلة بكثافة وأنواع خلايا متنوعة. حسب علمنا، هذه هي أول دراسة تستفيد من نموذج قائم على YOLO لتقسيم حالات الخلايا. تُظهر تقييمات مجموعتي بيانات الخلايا المرجعية دقة اكتشاف وسرعة متفوقة مقارنة بأساليب أخرى متطورة، بما في ذلك النماذج المعتمدة على CNN التي تم استخدامها سابقًا لتقسيم الخلايا والعديد من النماذج الحديثة المعتمدة على YOLO.

الشكل 3. نظرة عامة على نموذج ASF-YOLO المقترح. يتكون الإطار بشكل أساسي من وحدة دمج ميزات تسلسل المقياس (SSFF)، ووحدة ترميز الميزات الثلاثية (TFE)، ونموذج انتباه القناة والموقع (CPAM) المعتمد على الهيكل العظمي CSPDarkNet ورأس YOLO. تأتي وحدات CSP وConcat من YOLOv5.

2.1. تقسيم حالات الخلايا

يمكن أن يساعد تقسيم حالات الخلايا في إكمال مهمة عد الخلايا في الصورة بينما لا يمكن لتقسيم الصور الدلالية للخلايا القيام بذلك. زادت أساليب التعلم العميق من دقة تقسيم النواة الآلي [17]. اقترح جونسن وآخرون [18]، يونغ وآخرون [19]، فوجيتا وآخرون [20] وبانشر وآخرون [21] طرقًا محسنة للكشف والتقسيم المتزامن للخلايا بناءً على Mask R-CNN [2]. استخدم يي وآخرون [22] وتشينغ وآخرون [23] طريقة كاشف متعدد الصناديق (SSD) [24] للكشف وتقسيم حالات الخلايا العصبية. استخدم مهبد وآخرون [25] خوارزمية تقسيم دلالي تعتمد على نموذج U-Net [26] لتقسيم أنوية الخلايا. حقق النموذج الهجين SSD وU-Net مع آليات الانتباه [19] أو U-Net وMask R-CNN [27] بعض التحسين في الأداء على مجموعات بيانات تقسيم حالات الخلايا. BlendMask [28] هو إطار تقسيم حالات الأنوية مع وحدة تجميع تلافيف موسعة ووحدة تجميع معلومات السياق. Mask RCNN هو إطار تقسيم الأجسام ذو مرحلتين، حيث تكون سرعته بطيئة. SSD وU-Net وBlendMask هي أطر موحدة من البداية إلى النهاية (أي، مرحلة واحدة) ولكنها تعاني من أداء ضعيف في تقسيم الخلايا الكثيفة والصغيرة. لم تعد الأساليب التقليدية المعتمدة على CNN لتقسيم حالات الخلايا تلبي احتياجات الكشف والتقسيم في الوقت الحقيقي.

2.2. تحسين YOLO لتقسيم الحالات

تركز التحسينات الأخيرة لـ YOLO، لمهمة تقسيم الحالات، على آليات الانتباه، وتحسين الهيكل العظمي أو الشبكات، ودوال الخسارة. تم دمج كتلة الضغط والتحفيز (SENet) [29] في YOLACT المحسن [6] لتحديد البروتوزوا في الصور المجهرية [30]. زادت YOLOMask [31] وPR-YOLO [32] وYOLO-SF [33] من YOLOv5 [8] وYOLOv7-Tiny [34] مع وحدة انتباه الكتلة التلافيفية (CBAM) [35]. تمت إضافة وحدات استخراج الميزات الفعالة إلى الشبكة الهيكلية المحسنة لجعل عملية استخراج ميزات YOLO أكثر كفاءة [36،37]. عزز YOLO-CORE
[38] قناع الحالة بكفاءة من خلال الانحدار المباشر والواضح للحدود باستخدام قيود متعددة الطلبات مصممة تتكون من خسارة المسافة القطبية وخسارة القطاع. بالإضافة إلى ذلك، دمجت النماذج الهجينة YOLOMask [39] وYUSEG [40] YOLOv4 المحسن [12] وYOLOv5s الأصلي مع شبكة تقسيم دلالي U-Net لضمان دقة تقسيم الحالات. حسب علمنا، لم يتم تطبيق هذه الهياكل المحسنة لـ YOLO، التي تم تصميمها في الأصل لتقسيم الحالات للصور الطبيعية، على تقسيم حالات الخلايا، والتي تكون أكثر تحديًا بسبب الخلايا الصغيرة والمتداخلة بكثافة.

3. نموذج ASF-YOLO المقترح

3.1. الهيكل العام

يوضح الشكل 3 نظرة عامة على إطار عمل ASF-YOLO المقترح الذي يجمع بين الميزات المكانية ومتعددة المقاييس لتقسيم حالات صور الخلايا. نطور هيكل شبكة دمج ميزات جديدة يتكون من شبكتين رئيسيتين يمكن أن توفر معلومات تكاملية لتقسيم الأجسام الصغيرة: (1) وحدة SSFF، التي تجمع المعلومات الدلالية العالمية أو عالية المستوى من مقاييس متعددة من الصور، و(2) وحدة TFE، التي يمكن أن تلتقط التفاصيل الدقيقة المحلية للأجسام الصغيرة. يمكن أن يؤدي دمج كل من المعلومات المحلية والعالمية للميزات إلى إنتاج خريطة تقسيم أكثر دقة. نقوم بدمج ميزات الإخراج من P3 وP4 وP5 المستخرجة من الشبكة الهيكلية. أولاً، تم تصميم وحدة SSFF لدمج خرائط الميزات من P3 وP4 وP5 بشكل فعال، والتي تلتقط المقاييس المكانية المختلفة التي تغطي مجموعة متنوعة من الأحجام والأشكال لأنواع الخلايا المختلفة. في SSFF، يتم تطبيع خرائط الميزات P3 وP4 وP5 إلى نفس الحجم، وتكبيرها، ثم تكديسها معًا كمدخلات لتلافيف ثلاثية الأبعاد لدمج الميزات متعددة المقاييس. ثانيًا، تم تطوير وحدة TFE لتعزيز اكتشاف الأجسام الصغيرة للخلايا الكثيفة، من خلال دمج ميزات ثلاثة أحجام مختلفة – كبيرة ومتوسطة وصغيرة – في البعد المكاني لالتقاط المعلومات التفصيلية حول الأجسام الصغيرة. يتم دمج المعلومات التفصيلية للميزات من وحدة TFE بعد ذلك في كل فرع ميزة من خلال هيكل PANet، والذي هو

الشكل 4. هيكل وحدة TFE.

يمثل عدد القنوات، و

يمثل حجم خريطة الميزات. تستخدم كل وحدة ترميز ميزات ثلاثية ثلاث خرائط ميزات بأحجام مختلفة كمدخلات.

ثم يتم دمجها مع المعلومات متعددة المقاييس من وحدة SSFF في فرع P3. نقدم أيضًا آلية انتباه القناة والموقع (CPAM) في فرع P3 للاستفادة من كل من الميزات متعددة المقاييس عالية المستوى والميزات التفصيلية. يمكن لآلية الانتباه للقناة والموقع في CPAM التقاط القنوات المفيدة وتحسين تحديد المواقع المكانية المتعلقة بالأجسام الصغيرة مثل الخلايا، مما يعزز دقة اكتشافها وتقسيمها.

3.2. وحدة دمج ميزات تسلسل المقياس

لمشكلة المقاييس المتعددة لصور الخلايا، تُستخدم هياكل هرمية للميزات لدمج الميزات في الأدبيات الحالية، حيث يتم استخدام الجمع أو الدمج فقط لدمج الميزات الهرمية. ومع ذلك، لا يمكن لهياكل الشبكات الهرمية المختلفة استغلال العلاقة بين جميع خرائط الميزات الهرمية بشكل فعال. نقترح وحدة SSFF جديدة يمكن أن تجمع بشكل أفضل بين خرائط الميزات متعددة المقاييس، أي المعلومات عالية المستوى من خرائط الميزات العميقة مع المعلومات التفصيلية من خرائط الميزات السطحية، التي لها نفس نسبة الأبعاد.

نقوم أيضًا ببناء التمثيلات التسلسلية لخرائط الميزات متعددة المقاييس الناتجة من الهيكل العظمي (أي P3 وP4 وP5) التي تلتقط محتوى الصورة على مستويات مختلفة من التفاصيل أو المقياس. يتم أولاً تلافيف خرائط الميزات P3 وP4 وP5 مع سلسلة من نوى غاوسية ذات انحراف معياري متزايد [41-43]، كما يلي:

حيث

يمثل خريطة ميزات ثنائية الأبعاد و

يتم توليده عن طريق التنعيم باستخدام سلسلة من الالتفافات باستخدام مرشح غاوسي ثنائي الأبعاد مع زيادة في الانحراف المعياري

ثم نقوم بتكديس خرائط الميزات ذات المقاييس المختلفة أفقياً ونستخدم الالتفاف ثلاثي الأبعاد لاستخراج ميزات تسلسل المقاييس، مستلهمين من عمليات الالتفاف ثنائية وثلاثية الأبعاد على إطارات الفيديو المتعددة. نظرًا لأن خرائط الميزات الناتجة عن التنعيم الغاوسي أعلاه لها دقة مختلفة، يتم استخدام طريقة الاستيفاء بأقرب جار لمحاذاة جميع خرائط الميزات إلى نفس دقة مستوى P3. وذلك لأن خريطة الميزات عالية الدقة في المستوى P3 تحتوي على معظم المعلومات الحيوية لاكتشاف وتجزئة الأجسام الصغيرة، تم تصميم وحدة SSFF بناءً على مستوى P3. كما هو موضح في الشكل 3، تتكون وحدة SSFF المقترحة من المكونات التالية:

أيتم استخدام الالتفاف لتغيير عدد القنوات في مستويات الميزات P4 و P5 إلى 256.
تُستخدم طريقة الاستيفاء بأقرب جار [45] لضبط حجمها ليتناسب مع حجم مستوى P3.
تُستخدم طريقة “unsqueeze” لزيادة أبعاد كل طبقة ميزات، مما يغيرها من موتر ثلاثي الأبعاد [الارتفاع، العرض، القناة] إلى موتر رباعي الأبعاد [العمق، الارتفاع، العرض، القناة].
ثم يتم دمج خرائط الميزات ذات الأبعاد الأربعة على طول بعد العمق لتشكيل خريطة ميزات ثلاثية الأبعاد للت convolutions اللاحقة.
أخيرًا، يتم استخدام الالتفاف ثلاثي الأبعاد، وتطبيع الدفعات ثلاثي الأبعاد، ودالة التنشيط SiLU [46] لإكمال استخراج ميزات تسلسل المقياس.

3.3. وحدة ترميز الميزة الثلاثية

لتحديد الأجسام الصغيرة المتداخلة بكثافة، يمكن الرجوع إلى ومقارنة التغيرات في الشكل أو المظهر على مقاييس مختلفة من خلال تكبير الصورة. نظرًا لأن طبقات الميزات المختلفة في شبكة العمود الفقري لها أحجام مختلفة، فإن آلية دمج FPN التقليدية تقوم فقط بتكبير خريطة الميزات الصغيرة الحجم ثم تقسمها أو تضيفها إلى ميزات الطبقة السابقة، متجاهلة المعلومات التفصيلية الغنية لطبقة الميزات الأكبر حجمًا. لذلك، نقترح وحدة TFE، التي تقسم الميزات الكبيرة والمتوسطة والصغيرة، وتضيف خرائط الميزات الكبيرة الحجم، وتقوم بتعزيز الميزات لتحسين المعلومات التفصيلية للميزات.

الشكل 4 يوضح هيكل وحدة TFE. قبل ترميز الميزات، يتم أولاً ضبط عدد قنوات الميزات بحيث يتوافق مع الخصائص الرئيسية للمقياس. بعد معالجة خريطة الميزات الكبيرة الحجم (Large) بواسطة وحدة الالتفاف، يتم ضبط عدد قنواتها إلى

. ثم، يتم استخدام هيكل هجين من التجميع الأقصى + التجميع المتوسط لتقليل الأبعاد المكانية للميزات، مما يساعد على تحقيق عدم التباين في الترجمة، مما يعزز من قوة الشبكة في مواجهة التغيرات المكانية والترجمات لصور الإدخال. بالنسبة لخرائط الميزات صغيرة الحجم (صغيرة)، يتم أيضًا استخدام وحدة الالتفاف لضبط عدد القنوات، ثم يتم استخدام طريقة الاستيفاء لأقرب جار للتكبير. يساعد ذلك في الاحتفاظ بالميزات المحلية ويمنع فقدان معلومات ميزات الأجسام الصغيرة لأن تقنية الاستيفاء لأقرب جار يمكن أن تملأ خريطة الميزات من خلال استخدام بكسلات الجوار وتأخذ في الاعتبار الجوار دون البكسل. بالإضافة إلى ذلك، عند استخدام الاستيفاء لأقرب جار للتكبير، يميل جزء كبير من تفاصيل الميزات المتعلقة بالأجسام الصغيرة إلى الفقدان بسبب تداخل الخلفية. أخيرًا، يتم الالتفاف على ثلاث خرائط ميزات بأحجام كبيرة ومتوسطة وصغيرة بنفس الأبعاد مرة واحدة ثم يتم دمجها في بعد القناة، كما يلي:

أين

تشير إلى خرائط الميزات الناتجة عن وحدة TFE.

، و

تشير إلى خرائط الميزات الكبيرة والمتوسطة والصغيرة، على التوالي.

نتائج من دمج

، و

لديه نفس الدقة وثلاثة أضعاف عدد القنوات لـ

الشكل 5. هيكل وحدة CPAM. تحتوي على شبكات انتباه القناة والموقع.

تمثل العرض والارتفاع، على التوالي.

يدل على عملية حاصل ضرب هادامارد.

3.4. آلية الانتباه للقناة والموقع

لاستخراج معلومات الميزات التمثيلية المحتواة في قنوات مختلفة، نقترح أن يقوم CPAM بدمج معلومات الميزات التفصيلية ومتعددة المقاييس من كل من SSFF و TFE. يظهر هيكل CPAM في الشكل 5. يتكون من شبكة انتباه قنوي تستقبل المدخلات من TFE (المدخل 1)، وشبكة انتباه موضعي تستقبل المدخلات من تراكب مخرجات شبكة الانتباه القنوي و SSFF (المدخل 2).

الإدخال 1 لشبكة الانتباه القنوية هو خريطة الميزات بعد PANet، والتي تحتوي على الميزات التفصيلية لـ TFE. كتلة الانتباه القنوي في SENet [29] اعتمدت أولاً على التجميع المتوسط العالمي لكل قناة بشكل مستقل واستخدمت طبقتين متصلتين بالكامل مع دالة غير خطية من نوع سيغمويد لتوليد أوزان القناة. تهدف الطبقتان المتصلتان بالكامل إلى التقاط التفاعلات غير الخطية بين القنوات، مما يتضمن تقليل الأبعاد للتحكم في تعقيد النموذج، ولكن تقليل الأبعاد يجلب آثارًا جانبية على توقع الانتباه القنوي، والتقاط الاعتمادات بين جميع القنوات غير فعال وغير ضروري. نحن نقدم آلية انتباه بدون تقليل الأبعاد لالتقاط التفاعلات بين القنوات بشكل فعال. بعد التجميع المتوسط العالمي على مستوى القناة دون تقليل الأبعاد، يتم التقاط التفاعلات المحلية بين القنوات من خلال اعتبار كل قناة و

أقرب الجيران. يتم تنفيذ ذلك باستخدام

التفافات بحجم

حيث حجم النواة

يمثل تغطية التفاعلات المحلية عبر القنوات، أي عدد الجيران الذين يشاركون في توقع الانتباه لقناة واحدة. للحصول على تغطية مثالية، يمكن للمرء اللجوء إلى الضبط اليدوي لـ

في هياكل الشبكات المختلفة وأعداد مختلفة من وحدات الالتفاف، وهو أمر ممل. نظرًا لحجم نواة الالتفاف

يتناسب مع بعد القناة

عادةً ما يكون بُعد القناة عددًا صحيحًا من 2 ويمكن تعريفه من حيث

كما

أين

هي معلمات التدرج التي تتحكم في نسبة حجم نواة الالتفاف

إلى بُعد القناة

، على التوالي.

أين

يدل على العدد الفردي للجيران الأقرب. قيمة

تم تعيينه إلى 2 و

محدد على 1. وفقًا لعلاقة التحويل غير الخطية المذكورة أعلاه، فإن تبادل القنوات ذات القيمة العالية يكون أطول، بينما يكون تبادل القنوات ذات القيمة المنخفضة أقصر. لذلك، يمكن لآلية انتباه القناة إجراء تعدين أعمق للعديد من
ميزات القناة.
دمج مخرجات آلية انتباه القناة مع الميزات من SSFF (الإدخال 2) كمدخل لشبكة انتباه الموقع يوفر معلومات تكميلية لاستخراج معلومات الموقع الحيوية من كل خلية. على عكس آلية انتباه القناة، تقوم آلية انتباه الموقع أولاً بتقسيم خريطة الميزات المدخلة إلى جزئين من حيث العرض والارتفاع، والتي تتم معالجتها بشكل منفصل لترميز الميزات في المحاور.

)، وأخيرًا يتم دمجها لتوليد المخرجات.

بشكل أكثر دقة، يتم تجميع خريطة الميزات المدخلة في الاتجاهين الأفقي

) وعمودي (

) المحاور للاحتفاظ بمعلومات الهيكل المكاني لخريطة الميزات، والتي يمكن حسابها على النحو التالي:

أين

هما عرض وارتفاع خريطة الميزات المدخلة، على التوالي.

هي القيم في الموضع

خريطة ميزات الإدخال.

عند توليد إحداثيات انتباه الموضع، يتم تطبيق عمليات الدمج والت convolutions على المحاور الأفقية والعمودية:

أين

يشير إلى ناتج إحداثيات انتباه الموضع، Conv تشير إلى

التفاف وConcat يدل على التراص.

عند تقسيم ميزات الانتباه، يتم إنشاء أزواج من خرائط الميزات المعتمدة على الموقع على النحو التالي:

أين

هما عرض وارتفاع الناتج من الانقسام، على التوالي.

الناتج النهائي لـ CPAM يتم تعريفه بواسطة:

أين

يمثل مصفوفة الوزن لاهتمام القناة والموقع.

3.5. تحسين صناديق التثبيت

من خلال تحسين دالة الخسارة وقمع القيم غير القصوى (NMS)، تم تحسين صناديق التثبيت لرؤوس الكشف الثلاثة من أجل تحسين تقسيم الكائنات في صور الخلايا بأحجام مختلفة.

تقاطع على الاتحاد (IoU) يُستخدم عادةً كدالة خسارة صندوق التثبيت لتحديد التقارب من خلال حساب درجة التداخل بين صندوق الحدود المعلم وصندوق التوقع. ومع ذلك، فإن خسارة IoU التقليدية لا يمكن أن تعكس المسافة والتداخل بين صندوق الكائن وصندوق التثبيت. لمعالجة هذه القضايا، تم اقتراح GIoU [47]، DIoU، وCIoU [48]. يقدم CIoU عاملاً مؤثراً بناءً على خسارة DIoU، والذي يُستخدم بواسطة YOLOv5 وYOLOv8. بينما يأخذ في الاعتبار تأثير منطقة التداخل والمسافة بين نقاط المركز على دالة الخسارة، فإنه يأخذ أيضًا في الاعتبار تأثير نسبة العرض إلى الارتفاع (أي، النسبة الجانبية) لصندوق المعلم وصندوق التوقع على دالة الخسارة. ومع ذلك، فإنه يعكس فقط الفرق في النسبة الجانبية، بدلاً من العلاقة الحقيقية بين العرض والارتفاع لصندوق المعلم وصندوق التوقع. EIoU [14] يقلل من الفرق في العرض والارتفاع بين صندوق الكائن وصندوق التثبيت، مما يمكن أن يحسن تأثير الموقع للأشياء الصغيرة. يمكن تقسيم خسارة EIoU إلى 3 أجزاء: دالة خسارة IoU

وظيفة فقدان المسافة

وظيفة فقدان الجانب

التي تكون صيغتها كما يلي.

أين

تشير إلى المسافة الإقليدية و

تشير إلى النقاط المركزية لـ

على التوالي؛

، و

هي النقطة المركزية ب، العرض، والارتفاع للحقيقة الأرضية؛

تشير إلى عرض وارتفاع أصغر صندوق محيط يغطي الصندوقين. بالمقارنة مع CIoU، فإن EIoU لا يسرع فقط من سرعة تقارب إطار التنبؤ ولكن أيضًا يحسن دقة الانحدار. لذلك، نختار EIoU لاستبدال CIoU في الجزء العلوي.

من أجل القضاء على صناديق التثبيت المكررة، تقوم نماذج الكشف بإخراج حدود كشف متعددة في نفس الوقت، خاصة عندما تكون هناك العديد من حدود الكشف عالية الثقة حول الأجسام الحقيقية. مبدأ خوارزمية NMS الكلاسيكية هو الحصول على الحد الأقصى المحلي. إذا كانت الفجوة بين صندوق التحديد الحالي وإطار الكشف الأعلى تسجيلًا أكبر من العتبة، يتم تعيين درجة صندوق التحديد مباشرة إلى صفر. للتغلب على الخطأ الناتج عن NMS الكلاسيكية، نعتمد Soft-NMS، الذي يستخدم الدالة الغاوسية كدالة وزن لتقليل درجة حدود التنبؤ لاستبدال الدرجة الأصلية بدلاً من تعيينها مباشرة إلى صفر، وبالتالي تعديل قواعد إزالة صندوق التحديد.

4. التجارب

4.1. مجموعات البيانات

قمنا بتقييم أداء نموذج ASF-YOLO المقترح على مجموعتين من بيانات صور الخلايا: مجموعة بيانات DSB2018 ومجموعة بيانات BCC. تحتوي مجموعة بيانات مسابقة علوم البيانات لعام 2018 (DSB2018) [50] على 670 صورة لنوى الخلايا مع أقنعة مقسمة، والتي تم تصميمها لتقييم قابلية تعميم الخوارزمية عبر اختلافات نوع الخلية، والتكبير، ونمط التصوير (الضوء العادي مقابل الفلورية). تحتوي كل قناع على نواة واحدة، دون تداخل بين الأقنعة (لا ينتمي أي بكسل إلى قناعين). تم تقسيم مجموعة البيانات عشوائيًا إلى مجموعة التدريب ومجموعة الاختبار بنسبة 8:2. حجم عينة مجموعة التدريب ومجموعة الاختبار هو 536 و134 صورة، على التوالي.

تم جمع مجموعة بيانات خلايا سرطان الثدي (BCC) [51] من مركز معلومات الصور الحيوية، جامعة كاليفورنيا، سانتا باربرا (UCSB CBI). تتضمن 160 صورة نسيجية ملونة بصبغة الهيماتوكسيلين والإيوزين تُستخدم في اكتشاف خلايا سرطان الثدي، مع بيانات الحقيقة المرتبطة. تم تقسيم مجموعة البيانات عشوائيًا إلى 128

الجدول 1
مقارنة أداء النماذج المختلفة لتجزئة مثيلات الخلايا على مجموعة بيانات DSB2018. أفضل النتائج بالخط العريض.

نموذج	بارام (م)	صندوق		قناع
نموذج	بارام (م)					FPS
ماسک R-CNN [2]	٤٣.٧٥	0.774	0.519	0.782	0.525	20
قناع كاسكيد RCNN [3]	69.17	–	–	0.783	0.533	17.9
سولو [4]	–	–	–	0.642	0.398	٢٥
SOLOv2 [5]	–	–	–	0.741	0.495	٢٨.٧
YOLACT [6]	٣٤.٣	0.703	0.456	0.683	0.440	٢٥
ماسک RCNN سويين T [52]	٤٧.٣	0.784	0.524	0.783	0.527	٢٤
YOLOv5l-seg [8]	٤٥.٢٧	0.876	0.616	0.855	0.502	٤٦.٩
YOLOv8l-seg [9]	٤٥.٩١	0.865	0.631	0.866	0.562	٤٥.٥
ASF-YOLO (خاصتنا)	٤٦.١٨	0.910	0.676	0.887	0.558	٤٧.٣

الجدول 2
مقارنة أداء النماذج المختلفة لتجزئة مثيلات الخلايا على مجموعة بيانات BCC. أفضل النتائج بالخط العريض.

نموذج	صندوق		قناع
نموذج
ماسک R-CNN [2]	0.852	0.614	0.836	0.628
كاسكيد ماسك آر-سي إن إن [3]	0.836	0.630	0.823	0.598
سولو [4]	–	–	0.864	0.647
SOLOv2 [5]	–	–	0.860	0.651
YOLACT [6]	0.715	0.545	0.774	0.565
ماسک RCNN سويين T [52]	0.841	0.604	0.806	0.588
YOLOv5l-seg [8]	0.892	0.703	0.877	0.672
YOLOv8l-seg [9]	0.850	0.619	0.814	0.564
ASF-YOLO (خاصتنا)	0.911	0.737	0.898	0.645

صور (80%) كمجموعة تدريب، و32 صورة (20%) كمجموعة اختبار.

4.2. تفاصيل التنفيذ

تم تنفيذ التجارب على وحدة معالجة الرسوميات NVIDIA GeForce 3090 (24G) و Pytorch 1.10 و Python 3.7 و CUDA 11.3. استخدمنا الوزن الأولي لمجموعة بيانات COCO المدربة مسبقًا. حجم صورة الإدخال هو

حجم دفعة بيانات التدريب هو 16. تستمر عملية التدريب لمدة 100 دورة. استخدمنا خوارزمية الانحدار العشوائي (SGD) كدالة تحسين لتدريب النموذج. تم تعيين المعلمات الفائقة لـ SGD إلى 0.9 من الزخم، و0.001 من معدل التعلم الابتدائي، و0.0005 من تآكل الوزن.

4.3. النتائج الكمية

يوضح الجدول 1 مقارنة الأداء على مجموعة بيانات DSB2018 بين ASF-YOLO المقترح وطرق أخرى تقليدية وحديثة بما في ذلك Mask R-CNN [2]، Cascade Mask R-CNN [3]، SOLO [4]، SOLOv2 [5]، YOLACT [6]، Mask R-CNN مع هيكل Swin Transformer (Mask RCNN Swin T) [52]، YOLOv51-seg v7.0 [8]، و YOLOv8l-seg [9].

حقق نموذجنا الذي يحتوي على 46.18 مليون معلمة أفضل دقة مع Box

0.91 وقناع

0.887 وسرعة الاستدلال وصلت إلى 47.3 إطار في الثانية (FPS)، وهو أفضل أداء. بسبب حجم إدخال الصورة

دقة وسرعة Mask R-CNN باستخدام هيكل Swin Transformer ليست عالية. نموذجنا يتفوق أيضًا على الخوارزميات التقليدية ذات المرحلة الواحدة SOLO وYOLACT.

حقق نموذجنا المقترح أيضًا أفضل أداء في تقسيم الكائنات على مجموعة بيانات BCC، كما هو موضح في الجدول 2. تُظهر التحقق التجريبي قدرة التعميم لـ ASF-YOLO على مجموعات بيانات مختلفة بأنواع خلايا متنوعة.

الشكل 6. مقارنة نوعية لنماذج تقسيم الكائنات المختلفة على مجموعة بيانات DSB2018.

الجدول 3
دراسة الإزالة للمكونات الرئيسية لنموذج ASF-YOLO على مجموعة بيانات DSB2018.

طريقة					صندوق		قناع
نظام عدم الحد من الصلابة	EIoU	تيفي	SSFF	CPAM
					0.876	0.616	0.855	0.502
					0.881	0.622	0.856	0.523
					0.880	0.634	0.852	0.507
					0.891	0.653	0.867	0.542
					0.896	0.653	0.876	0.549
					0.902	0.656	0.874	0.543
					0.910	0.676	0.887	0.558

4.4. النتائج النوعية

الشكل 6 يوفر مقارنة بصرية لتقسيم الخلايا بواسطة طرق مختلفة على صور عينة من مجموعة بيانات DSB2018. من خلال استخدام وحدة TFE لتحسين أداء اكتشاف الأجسام الصغيرة، حقق ASF-YOLO قيمة استرجاع جيدة لصور الخلايا التي تحتوي على أجسام كثيفة وصغيرة في قناة واحدة. من خلال استخدام وحدة SSFF لتعزيز أداء استخراج الميزات متعددة المقاييس، قدم ASF-YOLO أيضًا دقة تقسيم جيدة لصور الخلايا الكبيرة الحجم تحت خلفيات معقدة. وهذا يشير إلى القابلية الجيدة للتعميم لطريقتنا على أنواع الخلايا المتنوعة. من الشكل 6 (أ) و(ب)، كل نموذج لديه نتائج جيدة لأن صور الخلايا بسيطة نسبيًا. من الشكل 6 (ج) و(د)، لدى Mask R-CNN معدل اكتشاف خاطئ مرتفع بسبب مبدأ تصميم خوارزمية المرحلتين. لدى SOLO العديد من الاكتشافات المفقودة ويفشل YOLOv5l-seg في تقسيم الخلايا ذات الحدود الضبابية.

4.5. دراسة الاستئصال

نجري سلسلة من الدراسات الشاملة لإزالة العوامل لنموذج ASF-YOLO المقترح.

الجدول 4
أثر آليات الانتباه المختلفة.

طريقة			بارام (ك)	عمليات النقطة العائمة في الثانية (م)
ASF-YOLO بدون انتباه (الخط الأساسي)	0.902	0.874	0	0
+SENet [29]	0.901	0.879	+8.19	+1.65
+ CBAM [35]	0.905	0.884	+16.48	+3.94
+ CA [53]	0.903	0.8881	+12.32	+1.32
+ CPAM (خاصتنا)	0.910	0.888	+12.23	+2.96

4.5.1. تأثير الطرق المقترحة

يوضح الجدول 3 مساهمة كل وحدة مقترحة في تحسين أداء التقسيم. يمكن لاستخدام Soft-NMS في YOLOv5l-seg التغلب على مشكلة كبح الأخطاء بسبب التداخل المتبادل عند اكتشاف خلايا الأجسام الصغيرة الكثيفة، وتوفير تحسين في الأداء. تعمل دالة خسارة EIoU على تحسين تأثير صناديق الحدود للأجسام الصغيرة، مما يحسن

بواسطة

لقد حسنت وحدات SSFF وTFE وCPAM أداء النموذج بشكل فعال من خلال حل تقسيم كائنات صغيرة في صور الخلايا.

4.5.2. تأثير آليات الانتباه

مقارنةً مع انتباه القناة SENet، الانتباه القنوي والمكاني

الشكل 7. مقارنة نوعية بين آليات الانتباه المختلفة، على سبيل المثال، التقسيم على مجموعة بيانات DSB2018.

الجدول 5
تأثير وحدات الالتفاف المختلفة في العمود الفقري لنموذج ASFYOLO المقترح. أفضل النتائج بالخط العريض.

مجموعة البيانات	الوحدة	الصندوق		القناع
مجموعة البيانات	الوحدة
DSB2018	C3	0.910	0.676	0.887	0.558
DSB2018	C2f	0.867	0.633	0.859	0.558
BCC	C3	0.911	0.737	0.898	0.645
BCC	C2f	0.855	0.619	0.835	0.570

آلية الانتباه CBAM، والانتباه المكاني Coordinate Attention (CA) [53]، توفر آلية الانتباه CPAM المقترحة أداءً أفضل على الرغم من الزيادة الطفيفة في كمية الحسابات والمعلمات، كما هو موضح في الجدول 4.

يوضح الشكل 7 تصور نتائج التقسيم باستخدام وحدات الانتباه المختلفة في نموذج ASF-YOLO. تمتلك CPAM المقترحة معلومات أفضل عن الميزات القنوية والمكانية واستخرجت ميزات أغنى من الصور الأصلية.

4.5.3. تأثير وحدة الالتفاف في العمود الفقري

لإظهار سبب اختيارنا لشبكة العمود الفقري YOLOv5، نقارن نتائج الجزء المحسن من العنق مع شبكات العمود الفقري YOLO المختلفة. يظهر الجدول 5 أنه، عندما يتم استبدال وحدات C3 من YOLOv5 بوحدات C2f من YOLOv8 في العمود الفقري للنموذج المقترح، فإن أداء وحدات C2f في العمود الفقري ينخفض على مجموعتي البيانات.

5. الخاتمة

قمنا بتطوير نموذج دقيق وسريع لتقسيم الكائنات ASF-YOLO لتحليل صور الخلايا، والذي يدمج الميزات المكانية والقياسية لاكتشاف وتقسيم صور الخلايا. قدمنا عدة وحدات جديدة في إطار عمل YOLO. تعزز وحدات SSFF و TFE أداء تقسيم الكائنات متعددة المقاييس والأجسام الصغيرة. تستخرج آلية الانتباه القنوي والمكاني مزيدًا من معلومات الميزات من الوحدتين. تظهر النتائج التجريبية الواسعة أن النموذج المقترح قادر على التعامل مع مهام تقسيم الكائنات لصور الخلايا المختلفة، وتحسين دقة نماذج YOLO الأصلية بشكل كبير في تقسيم الخلايا بسبب الأجسام الصغيرة والكثيفة. تتفوق طريقتنا بشكل كبير على الطرق الحديثة من حيث الدقة وسرعة الاستدلال لتقسيم كائنات الخلايا. نظرًا لصغر حجم مجموعة البيانات في هذه المقالة، يحتاج أداء النموذج العام إلى تحسين إضافي. بالإضافة إلى ذلك، يتم مناقشة فعالية كل وحدة في ASF-YOLO في دراسة الإزالة.

حقق نموذج ASF-YOLO المقترح توازنًا جيدًا بين دقة الكشف وسرعة الحساب، على الرغم من أن آلية الانتباه CPAM قد تؤدي إلى زيادة طفيفة في الجهد الحسابي. ومع ذلك، لا يزال هناك مجال لتحسين دقة الكشف العامة مع الحفاظ على كفاءة تقسيم النموذج، وهو أمر حاسم للتطبيق العملي في البيئات السريرية. لتحقيق هذه الغاية، ستقوم الأعمال المستقبلية بتوسيع مستخرج الميزات لدمج الهيكل الالتفافي الهرمي، والالتفاف القابل للتشويه وآليات غير محلية كما في [54] لتوسيع مجال الاستقبال داخل كتلة الاختناق. بالإضافة إلى ذلك، يمكن استخدام الالتفاف المتوسع [55]
لتحسين أداء الشبكات العصبية التلافيفية في التقاط المعلومات السياقية العالمية دون زيادة الجهد الحسابي وعدد المعلمات. كما يمكن إجراء التعلم بالنقل على شبكة العمود الفقري لاستخراج الميزات النسيجية، كما هو مستوحى من [56]. يمكن أيضًا اعتماد التقدم الأخير في Transformer لتحسين آلية الانتباه للنموذج المقترح.

بيان مساهمة المؤلفين

مينغ كانغ: الكتابة – مراجعة وتحرير، الكتابة – المسودة الأصلية، التصور، التحقق، البرمجيات، التحقيق، التحليل الرسمي، التصور. تشي-مينغ تينغ: الكتابة – مراجعة وتحرير، التحقق، الإشراف، إدارة المشروع، الحصول على التمويل، التصور. فنج فنج تينغ: الكتابة – مراجعة وتحرير، التحقق، الإشراف. رافائيل سي.-دبليو. فهان: الكتابة – مراجعة وتحرير، التحقق، الإشراف.

إعلان عن تضارب المصالح

لا توجد مصالح متضاربة للإعلان من قبل المؤلفين.

توفر البيانات

لقد شاركنا الرابط إلى بياناتنا في قسم المراجع.

الشكر والتقدير

تم دعم هذا العمل من قبل جامعة موناش ماليزيا ووزارة التعليم العالي في ماليزيا بموجب منحة البحث الأساسية FRGS/1/2023/ICT02/MUSM/02/1.

References

[1] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 580-587.
[2] K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask R-CNN, in: Proceedings of the IEEE/ CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2980-2988.
[3] Z. Cai, N. Vasconcelos, Cascade R-CNN: delving into high quality object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 6154-6162.
[4] X. Wang, T. Kong, C. Shen, Y. Jiang, L. Li, SOLO: segmenting objects by locations, in: A. Vedaldi, H. Bischof, T. Brox, J.-M. Frahm (Eds.), Computer Vision – ECCV 2020, Part XVIII, 2020, pp. 649-665.
[5] X. Wang, R. Zhang, T. Kong, L. Li, C. Shen, SOLOv2: dynamic and fast instance segmentation, in: H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, H. Lin (Eds.), Advances in Neural Information Processing Systems (NeurIPS), 2020, pp. 17721-17732.
[6] D. Bolya, C. Zhou, F. Xiao, Y.J. Lee, YOLACT: real-time instance segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 9158-9166.
[7] E. Mohamed, A. Shaker, A. El-Sallab, M. Hadhoud, INSTA-YOLO: real-time instance segmentation, arXiv:2102.0677, 2021.
[8] G. Jocher, YOLO by Ultralytics (version 5.7.0), GitHub, 2022. https://github.com/ ultralytics/yolov5.
[9] G. Jocher, A. Chaurasia, J. Qiu, YOLO by Ultralytics (version 8.0.0), GitHub, 2023. https://github.com/ultralytics/ultralytics.
[10] C.-Y. Wang, H.-Y.M. Liao, Y.-H. Wu, P.-Y. Chen, J.-W. Hsieh, I.-H. Yeh, CSPNet: A new backbone that can enhance learning capability of CNN, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2020, pp. 1571-1580.
[11] Ultralytics, Ultralytics YOLOv5 Architecture, Ultralytics, 2023. https://docs.ultr alytics.com/yolov5/tutorials/architecture_description.
[12] A. Bochkovskiy, C.Y. Wang, H.Y.M. Liao, YOLOv4: Optimal speed and accuracy of object detection, arXiv:2004.10934, 2020.
[13] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, S. Belongie, Feature pyramid networks for object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2117-2125.
[14] Y.-F. Zhang, W. Ren, Z. Zhang, Z. Jia, L. Wang, T. Tan, Focal and efficient IOU loss for accurate bounding box regression, Neurocomputing 506 (2022) 146-157.
[15] N. Bodla, B. Singh, R. Chellappa, L.S. Davis, Soft-NMS-improving object detection with one line of code, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 5562-5570.
[16] S. Liu, L. Qi, H. Qin, J. Shi, J. Jia, Path aggregation network for instance segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 8759-8768.
[17] R. Hollandi, N. Moshkov, L. Paavolainen, E. Tasnadi, F. Piccinini, P. Horvath, Nucleus segmentation: towards automated solutions, Trends Cell Biol. 32 (4) (2021) 295-310.
[18] J. Johnson, Adapting mask-RCNN for automatic nucleus segmentation, arXiv: 1805.00500, 2018.
[19] H. Jung, B. Lodhi, J. Kang, An automatic nuclei segmentation method based on deep convolutional neural networks for histopathology images, BMC Biomed. Eng. 1 (2019) 24.
[20] S. Fujita, X.-H. Han, Cell detection and segmentation in microscopy images with improved mask R-CNN, in: I. Sato, B. Han (Eds.), Computer Vision – ACCV 2020 Workshops, 2020, pp. 58-70.
[21] B. Bancher, A. Mahbod, I. Ellinger, Improving mask R-CNN for nuclei instance segmentation in hematoxylin & eosin-stained histological images, in: M. Atzori, N. Burlutskiy, F. Ciompi, Z. Li, F. Minhas, H. Müller, et al. (Eds.), Proceedings of the MICCAI Workshop on Computational Pathology, PMLR 156, 2024, pp. 20-35.
[22] J. Yi, P. Wu, M. Jiang, Q. Huang, D.J. Hoeppner, D.N. Metaxas, Attentive neural cell instance segmentation, Med. Image Anal. 55 (2019) 228-240.
[23] Z. Cheng, A. Qu, A fast and accurate algorithm for nuclei instance segmentation in microscopy images, IEEE Access 8 (2020) 158679-158689.
[24] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, et al., SSD: Single shot multibox detector, in: B. Leibe, J. Matas, N. Sebe, M. Welling (Eds.), Computer Vision – ECCV 2016, Part I, 2016, pp. 21-37.
[25] A. Mahbod, G. Schaefer, G. Dorffner, S. Hatamikia, R. Ecker, I. Ellinger, A dual decoder U-net-based model for nuclei instance segmentation in hematoxylin and eosin-stained histological images, Front. Med. 9 (2022) 978146.
[26] O. Ronneberger, P. Fischer, T. Brox, U-Net: Convolutional networks for biomedical image segmentation, in: N. Navab, J. Hornegger, W.M. Wells, A.F. Frangi (Eds.), Medical Image Computing and Computer-Assisted Intervention – MICCAI, Part III, 2015, pp. 234-241.
[27] T. Konopczyński, R. Heiman, P. Woźnicki, P. Gniewek, M.-C. Duvernoy, O. Hallatschek, et al., Instance segmentation of densely packed cells using a hybrid model of U-Net and mask R-CNN, in: L. Rutkowski, R. Scherer, M. Korytkowski, W. Pedrycz, R. Tadeusiewicz, J.M. Zurada (Eds.), Artificial Intelligence and Soft Computing (ICAISC), Part I, 2020, pp. 626-635.
[28] J. Wang, Z. Zhang, M. Wu, Y. Ye, S. Wang, Y. Cao, et al., Improved BlendMask: nuclei instance segmentation for medical microscopy images, IET Image Process. 17 (7) (2023) 2284-2296.
[29] J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: Proceedings of IEEE/ CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7132-7141.
[30] Z. Shang, X. Wang, Y. Jiang, Z. Li, J. Ning, Identifying rumen protozoa in microscopic images of ruminant with improved YOLACT instance segmentation, Biosyst. Eng. 215 (2022) 156-169.
[31] Y. Wang, Z. Ouyang, R. Han, Z. Yin, Z. Yang, YOLOMask: Real-time instance segmentation with integrating YOLOv5 and OrienMask, in: Proceedings of the IEEE 22nd International Conference on Communication Technology (ICCT), 2022, pp. 1646-1650.
[32] W. Yang, S. Chen, G. Chen, Q. Shi, PR-YOLO: Improved YOLO for fast protozoa classification and segmentation, Res. Square Preprint (2024), https://doi.org/ 10.21203/rs.3.rs-3199595/v1.
[33] X. Cao, Y. Su, X. Geng, Y. Wang, YOLO-SF: YOLO for fire segmentation detection, IEEE Access 11 (2023) 111079-111092.
[34] C.-Y. Wang, A. Bochkovskiy, H.-Y.M. Liao, YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 7464-7475.
[35] S. Woo, J. Park, J.-Y. Lee, I.S. Kweon, CBAM: Convolutional block attention module, in: V. Ferrari, M. Hebert, C. Sminchisescu, Y. Weiss (Eds.), Computer Vision – ECCV 2018, Part VII, 2018, pp. 3-19.
[36] O.M. Lawal, YOLOv5-LiNet: a lightweight network for fruits instance segmentation, PLoS One 18 (3) (2023) e0282297.
[37] M. Yasir, L. Zhan, S. Liu, J. Wan, M.S. Hossain, A.T. Isiacik Çolak, et al., Instance segmentation ship detection based on improved Yolov7 using complex background SAR images, Front. Mar. Sci. 10 (2023) 1113669.
[38] H. Liu, W. Xiong, Y. Zhang, YOLO-CORE: contour regression for efficient instance segmentation, Mach. Intell. Res. 20 (2023) 716-728.
[39] J. Hua, T. Hao, L. Zeng, G. Yu, YOLOMask, an instance segmentation algorithm based on complementary fusion network, Math 9 (15) (2021) 1766.
[40] B. Bai, J. Tian, T. Wang, S. Luo, S. Lyu, YUSEG: Yolo and Unet is all you need for cell instance segmentation, in: NeurIPS 2022 Weakly Supervised Cell Segmentation in Multi-modality High-Resolution Microscopy Images, 2022.
[41] T. Lindeberg, Scale-Space Theory in Computer Vision, Springer, Cham, 1994, pp. 10-11.
[42] D.J. Lowe, Distinctive image features from scale-invariant keypoints, Int. J. Comput. Vis. 60 (2004) 91-110.
[43] D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri, Learning spatiotemporal features with 3D convolutional networks, in: Proceedings of IEEE International Conference on Computer Vision (ICCV), 2015, pp. 4489-4497.
[44] H.-J. Park, J.-W. Kang, B.-G. Kim, ssFPN: scale sequence (

) feature-based feature pyramid network for object detection, Sens. 23 (9) (2023) 4432.
[45] O. Rukundo, H. Cao, Nearest neighbor value interpolation, Int. J. Adv. Comput. Sci. Appl. 3 (4) (2012) 25-30.
[46] S. Elfwing, E. Uchibe, K. Doya, Sigmoid-weighted linear units for neural network function approximation in reinforcement learning, Neural Netw. 107 (2018) 3-11.
[47] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, S. Savarese, Generalized intersection over union: a metric and a loss for bounding box pegression, in: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 658-666.
[48] Z. Zheng, P. Wang, W. Liu, J. Li, R. Ye, D. Ren, Distance-IoU loss: faster and better learning for bounding box regression, Proc. AAAI Conf. Artific. Intellig. 34 (07) (2020) 12993-13000.
[49] A. Neubeck, L. Van Gool, Efficient non-maximum suppression, in: Proceedings of the 18th International Conference on Pattern Recognition (ICPR), 2006, pp. 850-855.
[50] A. Goodman, A. Carpenter, E. Park, J. Lefman, J. BoozAllen, K. Thomas, et al., 2018 Data Science Bowl, Kaggle, 2018. https://kaggle.com/competitions/data-s cience-bowl-2018.
[51] CBI, Breast Cancer Cell, UCSB, 2008. https://bioimage.ucsb.edu/research/bio-seg mentation.
[52] OpenMMLab, Mask-rcnn_swin-t, GitHub, 2022. https://github.com/open-mm lab/mmdetection/tree/main/configs/swin.
[53] Q. Hou, D. Zhou, J. Feng, Coordinate attention for efficient mobile network design, in: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 13708-13717.
[54] P. Wu, H. Li, N. Zeng, F. Li, FMD-Yolo: an efficient face mask detection method for COVID-19 prevention and control in public, Image Vis. Comput. 117 (2022) 104341.
[55] X. Guo, Z. Wang, P. Wu, Y. Li, F.E. Alsaadi, N. Zeng, ELTS-net: an enhanced liver tumor segmentation network with augmented receptive field and global contextual information, Comput. Biol. Med. 169 (2024) 107879.
[56] L. Wang, Y. Jiao, Y. Diao, N. Zeng, R. Yu, A novel approach combined transfer learning and deep learning to predict TMB from histology image, Pattern Recogn. Lett. 135 (2020) 244-248.

＊Corresponding author．
E－mail address：ting．cheeming＠monash．edu（C．－M．Ting）．

Journal: Image and Vision Computing, Volume: 147
DOI: https://doi.org/10.1016/j.imavis.2024.105057
Publication Date: 2024-05-01

ASF－YOLO：A novel YOLO model with attentional scale sequence fusion for cell instance segmentation

Ming Kang，Chee－Ming Ting＊，Fung Fung Ting，Raphaël C．－W．PhanSchool of Information Technology，Monash University，Malaysia Campus，Subang Jaya 47500，Malaysia

ARTICLE INFO

Keywords：

Medical image analysis
Small object segmentation
You only look once（YOLO）
Sequence feature fusion
Attention mechanism

Abstract

We propose a novel Attentional Scale Sequence Fusion based You Only Look Once（YOLO）framework（ASF－ YOLO）which combines spatial and scale features for accurate and fast cell instance segmentation．Built on the YOLO segmentation framework，we employ the Scale Sequence Feature Fusion（SSFF）module to enhance the multiscale information extraction capability of the network，and the Triple Feature Encoder（TFE）module to fuse feature maps of different scales to increase detailed information．We further introduce a Channel and Position Attention Mechanism（CPAM）to integrate both the SSFF and TFE modules，which focus on informative channels and spatial position－related small objects for improved detection and segmentation performance．Experimental validations on two cell datasets show remarkable segmentation accuracy and speed of the proposed ASF－YOLO model．It achieves a box mAP of 0.91 ，mask mAP of 0.887 ，and an inference speed of 47.3 FPS on the 2018 Data Science Bowl dataset，outperforming the state－of－the－art methods．The source code is available at https：／／github． com／mkang315／ASF－YOLO．

1．Introduction

With the rapid development of sample preparation technology and microscopic imaging technology，quantitative processing and analysis of cell images play an important role in fields such as medicine and cell biology．Based on Convolutional Neural Networks（CNN），the charac－ teristic information of different cell images can be learned through neural network training，which has strong generalization performance． The two－stage R－CNN series［1－3］and its one－stage variants［4，5］are classical CNN－based frameworks that have been used for segmentation tasks．However，these traditional methods based on CNNs only achieved sub－optimal performance for real－time cell instance segmentation， especially when dealing with dense and small cells．

In recent works，the You Only Look Once（YOLO）series［6－9］have become among the fastest and most accurate models for real－time instance segmentation．Because of the one－stage design idea and the capabilities of feature extraction，YOLO instance segmentation models have better accuracy and speed than two－stage segmentation models． Nevertheless，the performance of YOLO－based models for small object segmentation in medical or histopathology images，such as cell instance segmentation is largely unexplored．Cell instance segmentation poses more challenges due to the small，dense，and overlapping objects，as well
as blurred boundaries of cells，which may result in poor segmentation accuracy．It requires accurate detailed segmentation of different types of objects in cell images．As shown in Fig．1，different types of cell images have large differences in color，morphology，texture，and other charac－ teristic information due to differences in cell morphology，preparation methods，and imaging technologies．Despite its improved segmentation accuracy and speed for natural images，the architectures of YOLO－based models can be further optimized for handling small objects in medical images，such as cells．

The typical YOLO framework architecture consists of three main parts：backbone，neck，and head．The backbone network of YOLO is a convolutional neural network that extracts image features at different granularities．Cross Stage Partial［10］Darknet with 53 convolutional layers（CSPDarknet53）［11］was modified from YOLOv4［12］and designed as the backbone network of YOLOv5［8］，which contains C3 （CSP bottleneck including 3 convolutional layers）and ConvBNSiLU modules．The C3 modules are replaced by the C2f（CSP bottleneck including 2 convolutional layers with shortcut）modules in the backbone of YOLOv8［9］，which is the only difference from that of YOLOv5．As shown in Fig．2，the level 1－5 feature extraction branches

，

in the backbone of YOLOv5 and YOLOv8 correspond to the outputs of the YOLO network associated with each of these feature maps．

Fig. 1. Different cell images (left) and their feature maps (right).

Fig. 2. An abridged general view of the framework of YOLOv5 v7.0 and YOLOv8 for segmentation task. P1, P2, P3, P4, P5 represent different levels of features output by the backbone. The head part clips the segmentation masks to bind them inside each of the detected bounding boxes, which ensures that the segmentation masks do not flow out of the bounding boxes. The neck part is intentionally abridged due to the different structures between YOLOv5 and YOLOv8.

YOLOv5 and YOLOv8 are the first mainstream YOLO-based architectures that can handle segmentation tasks besides detection and classification. In the feature extraction stage of YOLOv5, the CSPDarkNet53 backbone network stacked by multiple C3 modules is used, and then the three effective feature branches P3, P4, and P5 of the backbone network are used as the input of the Feature Pyramid Network (FPN) [13] structure to build multiscale fusion structure in the neck part. During the decoding process of the feature layer, three heads of different sizes corresponding to the effective feature branch of the backbone network are used for the bounding box prediction of the object. After upsampling the P3 features, pixel-by-pixel decoding is performed as the segmentation mask prediction of the object to complete the instance segmentation of the object. In the segmentation head, three scales of features output three different anchor boxes, and the mask proto module is responsible for outputting the prototype masks, which are processed to get the detection boxes and segmentation masks, for instance, segmentation tasks.

In this paper, we address the challenges of small object segmentation based on an improved YOLO-based model with an application to cell instance segmentation. Our method builds on the YOLOv5 model, leveraging the capability of its backbone in extracting multiscale features from the cell images. Our method further extends the original YOLOv5 architecture to improve its effectiveness in segmenting small objects by incorporating several new modules, especially in the neck
part of the model. These include the fusion of multiscale features and the incorporation of attention mechanism.

More precisely, we propose a one-stage instance segmentation model for cell images, which integrates Attentional Scale Sequence Fusion in the YOLO framework (ASF-YOLO). The CSPDarknet53 backbone network is first used to extract multi-dimensional feature information from cell images in the feature extraction stage. We propose several novel network designs in the neck part to enable multiscale feature fusion and attention mechanism. The multiscale fusion aims to improve the models robustness to the scale variations of small objects from cell images obtained from different conditions. The latter provides a selective focus on the multiscale features relevant to the small objects. For the detection head, we leverage the EIoU [14] in the training stage to optimize the bounding box location loss by minimizing the difference between the width and height of the bounding box and the anchor box. The use of EIoU can better capture the locations of small objects, compared to CIoU used in YOLOv5 and YOLOv8, which only reflects the difference in aspect ratio, not the true relationship between the width and height of the labeled and predicted boxes. Soft Non-Maximum Suppression (Soft-NMS) [15] is also used in the post-processing stage to improve the densely overlapping cell problem. The main contributions of this work are summarized as follows:

We design a Scale Sequence Feature Fusion (SSFF) module and a Triple Feature Encoder (TFE) module to fuse the multiscale feature maps extracted from the backbone in a Path Aggregation Network (PANet) [16] structure. The SSFF combines global semantic information of images across different scales by normalizing, upsampling, and concatenating multiscale features into a 3D convolution. Thus, it can effectively handle objects of varying sizes, orientations, and aspect ratios in a scale-space representation to improve object segmentation. The TFE incorporates small, medium, and large-sized feature maps to capture the fine spatial information of small objects across distinctive scales. These overcome the limitations of FPN in YOLOv5, which cannot fully exploit the correlations between the pyramid feature maps via simple sum and concatenation operations and mainly leverages small feature maps.
We then design a Channel and Position Attention Mechanism (CPAM) to integrate feature information from the SSFF and TFE modules. This module allows the model to adaptively adjust its focus on relevant channels and spatial locations relevant for the small objects at different scales, and hence better instance segmentation than the conventional YOLOv5 architecture with no attention mechanism.
We apply the proposed ASF-YOLO model for challenging instance segmentation tasks of densely overlapping and various cell types. To our best knowledge, this is the first work to leverage a YOLO-based model for cell instance segmentation. Evaluation of two benchmarking cell datasets shows superior detection accuracy and speed compared to other state-of-the-art methods, including CNN-based models previously used for cell segmentation and several recent YOLO-based models.

Fig. 3. The overview of the proposed ASF-YOLO model. The framework is mainly comprised of the Scale Sequence Feature Fusion (SSFF) module, the Triple Feature Encoder (TFE) module, and the Channel and Position Attention Model (CPAM) based on the CSPDarkNet backbone and the YOLO head. CSP and Concat modules come from YOLOv5.

2.1. Cell instance segmentation

Cell instance segmentation can further help complete the cell counting task in the image while semantic segmentation of cell images cannot. Deep learning approaches have increased the accuracy of automated nucleus segmentation [17]. Johnson et al. [18], Jung et al. [19], Fujita et al. [20] and Bancher et al. [21] proposed improved methods for simultaneous detection and segmentation of cells based on Mask R-CNN [2]. Yi et al. [22] and Cheng et al. [23] utilized a SingleShot multi-box Detector (SSD) [24] method to detect and segment neural cell instances. Mahbod et al. [25] employed a semantic segmentation algorithm U-Net [26] based model for cell nuclei segmentation. The hybrid model SSD and U-Net with attention mechanisms [19] or U-Net and Mask R-CNN [27] achieved some boost in performance on the cell instance segmentation datasets. BlendMask [28] is a nuclei instance segmentation framework with a dilated convolution aggregation module and a context information aggregation module. Mask RCNN is a two-stage object segmentation framework, of which speed is slow. SSD, U-Net, and BlendMask are unified end-to-end (i.e., one-stage) frameworks but have poor performance in segmenting dense and small cells. Traditional CNN-based methods for cell instance segmentation no longer meet the needs of real-time detection and segmentation.

2.2. Improved YOLO for instance segmentation

Recent improvements of YOLO, for the instance segmentation task, focus on attention mechanisms, improved backbone or networks, and loss functions. Squeeze-and-Excitation (SENet) [29] block was integrated into an improved YOLACT [6] for identifying rumen protozoa in microscopic images [30]. YOLOMask [31], PR-YOLO [32] and YOLO-SF [33] augmented YOLOv5 [8] and YOLOv7-Tiny [34] with Convolutional Block Attention Module (CBAM) [35]. Effective feature extraction modules were added to the improved backbone network to make the process of YOLO feature extraction more efficient [36,37]. YOLO-CORE
[38] enhanced the mask of the instance efficiently by explicit and direct contour regression using a designed multi-order constraint consisting of a polar distance loss and a sector loss. In addition, the hybrid models YOLOMask [39] and YUSEG [40] combined optimized YOLOv4 [12] and the original YOLOv5s with semantic segmentation U-Net network to ensure the accuracy of instance segmentation. To our best knowledge, these improved YOLO architectures, originally designed for instance segmentation for natural images, have not been applied for cell instance segmentation, which is more challenging due to the small and densely overlapping cells.

3. The proposed ASF-YOLO model

3.1. Overall architecture

Fig. 3 shows the overview of the proposed ASF-YOLO framework that combines spatial and multiscale features for cell image instance segmentation. We develop a novel feature fusion network architecture consisting of two main component networks that can provide complementary information for small object segmentation: (1) SSFF module, which combines global or high-level semantic information from multiple scales of images, and (2) TFE module, which can capture local fine details of small objects. The integration of both local and global feature information can produce a more accurate segmentation map. We perform a fusion of output features of P3, P4, and P5 extracted from the backbone network. First, the SSFF module is designed to effectively fuse the feature maps of P3, P4, and P5 which captures the different spatial scales covering a variety of sizes and shapes of different cell types. In SSFF, the P3, P4, and P5 feature maps are normalized to the same size, upsampled, and then stacked together as input to a three-dimensional (3D) convolution to combine multiscale features. Secondly, the TFE module is developed to enhance small object detection for dense cells, by splicing features of three different sizes-large, medium, and small–in the spatial dimension to capture detailed information about small objects. The detailed feature information of the TFE module is then integrated into each feature branch through the PANet structure, which is

Fig. 4. The structure of TFE module.

represents the number of channels, and

represents the feature map size. Each triple feature encoder module uses three feature maps of different sizes as input.

then combined with the multiscale information of the SSFF module into the P3 branch. We further introduce the Channel and Position Attention Mechanism (CPAM) in the P3 branch to leverage both the high-level multiscale features and the detailed features. The channel and position attention mechanism in the CPAM can respectively capture informative channels and refine spatial localization related to small objects like cells, thus enhancing its detection and segmentation accuracy.

3.2. Scale sequence feature fusion module

For the multiscale problem of cell images, feature pyramid structures are used for feature fusion in the existing literature, in which only sum or concatenation is employed to fuse the pyramid features. However, the structures of various feature pyramid networks cannot effectively exploit the correlation between all pyramid feature maps. We propose a novel SSFF module that can better combine the multiscale feature maps, i.e., the high-level information of deep feature maps with the detailed information of shallow feature maps, which have the same aspect ratio.

We further construct the sequential representations of the multiscale feature maps generated from the backbone (i.e., P3, P4, and P5) that capture the image content at different levels of detail or scale. The feature maps P3, P4, and P5 are first convolved with a series of Gaussian kernels of increasing standard deviation [41-43], as follows:

where

represents a two-dimensional (2D) feature map and

is generated by smoothing with a series of convolutions using a 2D Gaussian filter with increasing standard deviation

Then, we stack these feature maps of different scales horizontally and use 3D convolution to extract their scale sequence features, inspired by 2D and 3D convolution operations on the multiple video frames [44]. Since the output feature maps from the above Gaussian smoothing have different resolutions, the nearest neighbor interpolation method is used to align all feature maps to the same resolution as the P3. This is because the high-resolution feature map level P3 contains most of the information crucial for the detection and segmentation of small objects, the SSFF module is designed based on the P3 level. As shown in Fig. 3, the proposed SSFF module consists of the following components:

A convolution is used to change the number of channels of the P4 and P5 feature levels to 256.
Nearest neighbor interpolation method [45] is used to adjust their size to the size of the P3 level.
The unsqueeze method is used to increase the dimension of each feature layer, changing it from a 3D tensor [height, width, channel] to a 4D tensor [depth, height, width, channel].
The 4D feature maps are then concatenated along the depth dimension to form a 3D feature map for subsequent convolutions.
Finally, 3D convolution, 3D batch normalization, and SiLU [46] activation function are used to complete scale sequence feature extraction.

3.3. Triple feature encoding module

To identify densely overlapping small objects, one can reference and compare shape or appearance changes at different scales by enlarging the image. Since different feature layers of the backbone network have different sizes, the conventional FPN fusion mechanism only upsamples the small-sized feature map and then splits or adds it to the features of the previous layer, ignoring the rich detailed information of the largersized feature layer. Therefore, we propose the TFE module, which splits large, medium, and small features, adds large-size feature maps, and performs feature amplification to improve detailed feature information.

Fig. 4 illustrates the structure of the TFE module. Before feature encoding, the number of feature channels is first adjusted so that it is consistent with the main scale characteristics. After the large-size feature map (Large) is processed by the convolution module, its channel number is adjusted to

. Then, a hybrid structure of maximum pooling + average pooling is used for downsampling, which helps to reduce the spatial dimensions of features and to achieve translation invariance, enhancing the network’s robustness to spatial variations and translations of the input images. For small-size feature maps (Small), the convolution module is also used to adjust the number of channels, and then the nearest neighbor interpolation method is used for upsampling. This helps to retain the local features and prevents the loss of small object feature information because the nearest-neighbor interpolation technique can populate the feature map by utilizing neighboring pixels and accounts for sub-pixel neighborhoods. Additionally, when employing nearest-neighbor interpolation for upsampling, a significant portion of the feature details pertaining to small objects tend to be lost due to background interference. Finally, the three feature maps of large, medium, and small sizes with the same dimensions are convolved once and then spliced in the channel dimension, as follows:

where

denotes the feature maps of the TFE module output.

, and

denote large, medium, and small-size feature maps, respectively.

results from the concatenation of

, and

has the same resolution as and three times the channel number of

Fig. 5. The structure of CPAM module. It contains channel and position attention networks.

and

represent width and height, respectively.

denotes the operation of the Hadamard product.

3.4. Channel and position attention mechanism

To extract the representative feature information contained in different channels, we propose the CPAM integrate detailed and multiscale feature information from both SSFF and TFE. The architecture of CPAM is shown in Fig. 5. It consists of a channel attentional network receiving input from the TFE (Input 1), and a position attentional network receiving input from the superposition of the outputs of the channel attentional network and the SSFF (Input 2).

Input 1 for the channel attentional network is the feature map after PANet, which contains the detailed features of the TFE. The SENet [29] channel attention block first adopted global average pooling for each channel independently and used two fully connected layers together with a nonlinear Sigmoid function to generate channel weights. The two fully connected layers aim to capture nonlinear cross-channel interactions, which involves reducing dimensionality to control the complexity of the model, but dimensionality reduction brings side effects to channel attention prediction, and capturing the dependencies between all channels is inefficient and unnecessary. We introduce an attention mechanism without dimensionality reduction to capture crosschannel interactions in an effective manner. After channel-wise global average pooling without reducing dimensionality, local cross-channel interactions are captured by considering each channel and its

nearest neighbors. This is implemented using

convolutions of size

, where the kernel size

represents the coverage of local cross-channel interactions, that is, how many neighbors participate in the attention prediction of one channel. To obtain optimal coverage, one may resort to manual tuning of

in different network structures and different numbers of convolution modules, which is tedious. Since the convolution kernel size

is proportional to the channel dimension

, the channel dimension is generally an exponent of 2 and can defined in terms of

where

and

are the scaling parameters controlling the ratio of the convolution kernel size

to the channel dimension

, respectively.

where

denotes the odd number of the nearest neighbors. The value of

is set to 2 and

is set to 1 . According to the above non-linear mapping relationship, the exchange of high-value channels is longer, while the exchange of low-value channels is shorter. Therefore, the channel attention mechanism can perform deeper mining of multiple
channel features.
Combining the outputs of the channel attention mechanism with the features from SSFF (Input 2) as input to the position attention network provides complementary information to extract crucial location information from each cell. In contrast to the channel attention mechanism, the position attention mechanism first splits the input feature map into two parts in terms of its width and height, which are then processed separately for feature encoding in the axes (

and

), and are finally merged to generate output.

More precisely, the input feature map is pooled in both horizontal (

) and vertical (

) axes to retain the spatial structure information of the feature map, which can be calculated as follows:

where

and

are the width and height of the input feature map, respectively.

are the values at the position

of the input feature map.

When generating position attention coordinates, the concatenation and convolution operations are applied for the horizontal and vertical axes:

where

denotes the output of position attention coordinates, Conv denotes a

convolution and Concat denotes concatenation.

When splitting the attention features, pairs of location-dependent feature maps are generated as follows:

where

and

are the width and height of the output of the splitting, respectively.

The final output of CPAM is defined by:

where

represents the weight matrix of the channel and position attentions.

3.5. Anchor box optimization

By optimizing the loss function and Non-Maximum Suppression (NMS), the anchor boxes of the three detection heads are improved for a better instance segmentation of cell images in different sizes.

Intersection over Union (IoU) is typically used as the anchor box loss function to determine convergence by calculating the degree of overlap between the labeled bounding box and the prediction box. However, the classical IoU loss cannot reflect the distance and overlap between the object box and the anchor box. To address these issues, GIoU [47], DIoU, and CIoU [48] have been proposed. CIoU introduces an influence factor based on DIoU Loss, which is used by YOLOv5 and YOLOv8. While taking into account the impact of the overlapping area and center point distance on the loss function, it also takes into account the impact of the width-to-height (i.e., aspect) ratio of the labeled box and the predicted box on the loss function. However, it only reflects the difference in aspect ratio, rather than the true relationship between the width and height of the labeled box and the predicted box. EIoU [14] minimizes the difference in width and height between the object box and anchor box, which can improve the location effect of small objects. EIoU loss can be divided into 3 parts: the IoU loss function

, distance loss function

, and aspect loss function

, of which the formula is as follows.

where

indicates the Euclidean distance and

and

denote the central points of

and

respectively;

, and

are the central point b , width, and height of ground truth;

and

denote the width and height of the smallest enclosing box covering the two boxes. Compared with CIoU, EIoU not only speeds up the convergence speed of the prediction frame but also improves the regression accuracy. Therefore, we select EIoU to replace CIoU in the head part.

In order to eliminate duplicate anchor boxes, the detection models output multiple detection boundaries at the same time, especially when there are many high-confidence detection boundaries around the real objects. The principle of the classical NMS [49] algorithm is to get the local maximum. If the difference between the current bounding box and the highest-scoring detection frame is greater than the threshold, the score of the bounding box is directly set to zero. To overcome the error caused by the classical NMS, We adopt Soft-NMS [15], which uses the Gaussian function as the weight function to reduce the score of the prediction boundaries to replace the original score instead of directly setting it to zero, thus modifying the rules of removing the bounding box.

4. Experiments

4.1. Datasets

We evaluated the performance of the proposed ASF-YOLO model on two cell image datasets: DSB2018 and BCC datasets. The 2018 Data Science Bowl (DSB2018) dataset [50] contains 670 cell nuclei images with segmented masks, which is designed to assess the generalizability of an algorithm across variations of the cell type, magnification, and imaging modality (brightfield vs. fluorescence). Each mask contains one nucleus, with no overlapping between masks (no pixel belongs to two masks). The dataset was randomly divided into the training set and the test set in terms of an 8:2 ratio. The sample size of the training set and test set are 536 and 134 images, respectively.

The Breast Cancer Cell (BCC) dataset [51] was collected from the Center for Bio-Image Informatics, University of California, Santa Barbara (UCSB CBI). It includes 160 hematoxylin and eosin-stained histopathology images used in breast cancer cell detection, with associated ground truth data. The dataset was randomly partitioned into 128

Table 1
Performance comparison of different models for cell instance segmentation on the DSB2018 dataset. The best results are in bold.

Model	Param (M)	Box		Mask
Model	Param (M)					FPS
Mask R-CNN [2]	43.75	0.774	0.519	0.782	0.525	20
Cascade Mask RCNN [3]	69.17	–	–	0.783	0.533	17.9
SOLO [4]	–	–	–	0.642	0.398	25
SOLOv2 [5]	–	–	–	0.741	0.495	28.7
YOLACT [6]	34.3	0.703	0.456	0.683	0.440	25
Mask RCNN Swin T [52]	47.3	0.784	0.524	0.783	0.527	24
YOLOv5l-seg [8]	45.27	0.876	0.616	0.855	0.502	46.9
YOLOv8l-seg [9]	45.91	0.865	0.631	0.866	0.562	45.5
ASF-YOLO (Ours)	46.18	0.910	0.676	0.887	0.558	47.3

Table 2
Performance comparison of different models for cell instance segmentation on the BCC dataset. The best results are in bold.

Model	Box		Mask
Model
Mask R-CNN [2]	0.852	0.614	0.836	0.628
Cascade Mask R-CNN [3]	0.836	0.630	0.823	0.598
SOLO [4]	–	–	0.864	0.647
SOLOv2 [5]	–	–	0.860	0.651
YOLACT [6]	0.715	0.545	0.774	0.565
Mask RCNN Swin T [52]	0.841	0.604	0.806	0.588
YOLOv5l-seg [8]	0.892	0.703	0.877	0.672
YOLOv8l-seg [9]	0.850	0.619	0.814	0.564
ASF-YOLO (Ours)	0.911	0.737	0.898	0.645

images (80%) as the training set, and 32 images (20%) as the test set.

4.2. Implementation details

The experiments were implemented on the NVIDIA GeForce 3090 (24G) GPU and Pytorch 1.10, Python 3.7, and CUDA 11.3 dependencies. We employed the initial weight of the pretrained COCO dataset. The input image size is

. The batch size of the training data quantity is 16 . The training process lasts 100 epochs. We used Stochastic Gradient Descent (SGD) as an optimization function to train the model. The hyperparameters of SDG are set to 0.9 of the momentum, 0.001 of the initial learning rate, and 0.0005 of the weight decay.

4.3. Quantitative results

Table 1 shows performance comparison on the DSB2018 dataset between the proposed ASF-YOLO with other classical and state-of-theart methods including Mask R-CNN [2], Cascade Mask R-CNN [3], SOLO [4], SOLOv2 [5], YOLACT [6], Mask R-CNN with Swin Transformer backbone (Mask RCNN Swin T) [52], YOLOv51-seg v7.0 [8], and YOLOv8l-seg [9].

Our model with 46.18 million parameters achieved the best accuracy with Box

of 0.91 and Mask

of 0.887 and the inference speed reached 47.3 Frame Per Second (FPS), which is the best performance. Due to the image input size of

, the accuracy and speed of Mask R-CNN using the Swin Transformer backbone are not high. Our model also surpasses the classical one-stage algorithms SOLO and YOLACT.

Our proposed model also achieved the best instance segmentation performance on the BCC dataset, as shown in Table 2. The experimental validation shows the generalization ability of ASF-YOLO to different datasets with varying cell types.

Fig. 6. Qualitative comparison of different instance segmentation models on the DSB2018 dataset.

Table 3
Ablation study of main components of the ASF-YOLO on the DSB2018 dataset.

Method					Box		Mask
Soft-NMS	EIoU	TFE	SSFF	CPAM
					0.876	0.616	0.855	0.502
					0.881	0.622	0.856	0.523
					0.880	0.634	0.852	0.507
					0.891	0.653	0.867	0.542
					0.896	0.653	0.876	0.549
					0.902	0.656	0.874	0.543
					0.910	0.676	0.887	0.558

4.4. Qualitative results

Fig. 6 provides a visual comparison of cell segmentation by different methods on sample images from the DSB2018 dataset. By using the TFE module to improve small object detection performance, ASF-YOLO achieved a good recall value for cell images with dense and small objects in a single channel. By using the SSFF module to enhance multiscale feature extraction performance, ASF-YOLO also provided a good segmentation accuracy for large-sized cell images under complex backgrounds. This indicates the good generalizability of our method to varying cell types. From Fig. 6 (a) and (b), each model has good results because the cell images are relatively simple. From Fig. 6 (c) and (d), Mask R-CNN has a high false detection rate due to the design principle of the two-stage algorithm. SOLO has many missed detections and YOLOv5l-seg fails to segment cells with blurred boundaries.

4.5. Ablation study

We conduct a series of extensive ablation studies of the proposed ASF-YOLO model.

Table 4
Effect of different attention mechanisms.

Method			Param (K)	FLOPs (M)
ASF-YOLO w/o attention (baseline)	0.902	0.874	0	0
+SENet [29]	0.901	0.879	+8.19	+1.65
+ CBAM [35]	0.905	0.884	+16.48	+3.94
+ CA [53]	0.903	0.8881	+12.32	+1.32
+ CPAM (ours)	0.910	0.888	+12.23	+2.96

4.5.1. Effect of the proposed methods

Table 3 shows the contribution of each proposed module in improving the segmentation performance. The use of Soft-NMS in YOLOv5l-seg can overcome the error suppression problem due to mutual occlusion when detecting cells of dense small objects, and provide performance improvement. EIoU loss function improves the effect of small object bounding boxes, improving

. The SSFF, TFE, and CPAM modules have effectively improved the model performance by solving small object instance segmentation of cell images.

4.5.2. Effect of attention mechanisms

Compared with the channel attention SENet, the channel and spatial

Fig. 7. Qualitative comparison of different attentional mechanisms, for instance, segmentation on the DSB2018 dataset.

Table 5
Effect of different convolution modules in the backbone of the proposed ASFYOLO. The best results are in bold.

Dataset	Module	Box		Mask
Dataset	Module
DSB2018	C3	0.910	0.676	0.887	0.558
DSB2018	C2f	0.867	0.633	0.859	0.558
BCC	C3	0.911	0.737	0.898	0.645
BCC	C2f	0.855	0.619	0.835	0.570

attention CBAM, and the spatial attention Coordinate Attention (CA) [53], the proposed CPAM attention mechanism provides better performance despite with slight increase in computation amount and parameters, as shown in Table 4.

Fig. 7 shows the visualization of segmentation results using different attentional modules in the ASF-YOLO model. The proposed CPAM has better channel and positional feature information and mined richer features from the original images.

4.5.3. Effect of convolution module in the backbone

To demonstrate why we chose the YOLOv5 backbone network, we compare the results of the improved neck part with different YOLO backbone networks. Table 5 shows that, when the C3 modules of YOLOv5 are replaced by the C2f modules of YOLOv8 in the backbone of the proposed model, the performance of the C2f modules in the backbone decreases on the two datasets.

5. Conclusion

We developed an accurate and fast instance segmentation model ASF-YOLO for cell image analysis, which fuses spatial and scale features for the detection and segmentation of cell images. We introduced several novel modules in the YOLO framework. The SSFF and TFE modules enhance the multiscale and small object instance segmentation performance. The channel and position attention mechanism further mines the feature information of the two modules. Extensive experimental results demonstrate that our proposed model is capable of handling instance segmentation tasks for various cell images, and substantially improving the accuracy of the original YOLO models on cell segmentation due to small and dense objects. Our method substantially outperforms state-of-the-art methods in terms of both accuracy and inference speed for cell instance segmentation. Due to the small size of the data set in this article, the generalization performance of the model needs to be further improved. In addition, the effectiveness of each module in ASF-YOLO is discussed in the ablation study.

The proposed ASF-YOLO has achieved a good balance between detection accuracy and computational speed, despite the fact that the CPAM attention mechanism might induce a slight increase in computational effort. However, there is still room for further enhancing the overall detection accuracy while maintaining the model’s segmentation efficiency, which is crucial for practical implementation in clinical settings. To this end, future work will extend the feature extractor to incorporate hierarchical convolutional structure, deformable convolution and non-local mechanisms as in [54] to enlarge the receptive field within the bottleneck block. Besides, the dilated convolution [55] could
be used to enhance the performance of CNNs in capturing global contextual information without increasing the computational effort and the number of parameters. As inspired by [56], transfer learning could be performed on the backbone network to extract histological features. The recent advances in Transformer could also be adopted to improve the attention mechanism of the proposed model.

CRediT authorship contribution statement

Ming Kang: Writing – review & editing, Writing – original draft, Visualization, Validation, Software, Investigation, Formal analysis, Conceptualization. Chee-Ming Ting: Writing – review & editing, Validation, Supervision, Project administration, Funding acquisition, Conceptualization. Fung Fung Ting: Writing – review & editing, Validation, Supervision. Raphaël C.-W. Phan: Writing – review & editing, Validation, Supervision.

Declaration of competing interest

There are no competing interests to declare by the authors.

Data availability

We have shared the link to our data in the References section.

Acknowledgments

This work was supported by the Monash University Malaysia and the Ministry of Higher Education, Malaysia under Fundamental Research Grant Scheme FRGS/1/2023/ICT02/MUSM/02/1.

References

＊Corresponding author．
E－mail address：ting．cheeming＠monash．edu（C．－M．Ting）．

ASF-YOLO: نموذج YOLO جديد مع دمج تسلسل المقياس الانتباهي لتجزئة حالات الخلايا ASF-YOLO: A novel YOLO model with attentional scale sequence fusion for cell instance segmentation

ASF-YOLO: نموذج YOLO جديد مع دمج تسلسل المقياس الانتباهي لتجزئة حالات الخلايا

معلومات المقال

الكلمات المفتاحية：

الملخص

1. المقدمة

2. الأعمال ذات الصلة

2.1. تقسيم حالات الخلايا

2.2. تحسين YOLO لتقسيم الحالات

3. نموذج ASF-YOLO المقترح

3.1. الهيكل العام

3.2. وحدة دمج ميزات تسلسل المقياس

3.3. وحدة ترميز الميزة الثلاثية

3.4. آلية الانتباه للقناة والموقع

3.5. تحسين صناديق التثبيت

4. التجارب

4.1. مجموعات البيانات

4.2. تفاصيل التنفيذ

4.3. النتائج الكمية

4.4. النتائج النوعية

4.5. دراسة الاستئصال

4.5.1. تأثير الطرق المقترحة

4.5.2. تأثير آليات الانتباه

4.5.3. تأثير وحدة الالتفاف في العمود الفقري

5. الخاتمة

بيان مساهمة المؤلفين

إعلان عن تضارب المصالح

توفر البيانات

الشكر والتقدير

References

ASF－YOLO：A novel YOLO model with attentional scale sequence fusion for cell instance segmentation

ARTICLE INFO

Keywords：

Abstract

1．Introduction

2. Related work

2.1. Cell instance segmentation

2.2. Improved YOLO for instance segmentation

3. The proposed ASF-YOLO model

3.1. Overall architecture

3.2. Scale sequence feature fusion module

3.3. Triple feature encoding module

3.4. Channel and position attention mechanism

3.5. Anchor box optimization

4. Experiments

4.1. Datasets

4.2. Implementation details

4.3. Quantitative results

4.4. Qualitative results

4.5. Ablation study

4.5.1. Effect of the proposed methods

4.5.2. Effect of attention mechanisms

4.5.3. Effect of convolution module in the backbone

5. Conclusion

CRediT authorship contribution statement

Declaration of competing interest

Data availability

Acknowledgments

References