خوارزمية كشف الأجسام الصغيرة متعددة المقاييس SMA-YOLO لصور الاستشعار عن بعد باستخدام الطائرات بدون طيار A multi-scale small object detection algorithm SMA-YOLO for UAV remote sensing images

المجلة: Scientific Reports، المجلد: 15، العدد: 1
DOI: https://doi.org/10.1038/s41598-025-92344-7
PMID: https://pubmed.ncbi.nlm.nih.gov/40102487
تاريخ النشر: 2025-03-18
المؤلف: Shilong Zhou وآخرون
الموضوع الرئيسي: تطبيقات الشبكات العصبية المتقدمة

نظرة عامة

تقدم البحث خوارزمية SMA-YOLO، المصممة لتعزيز اكتشاف الأجسام الصغيرة في بيئات الاستشعار عن بعد المعقدة، مع معالجة التحديات مثل عدم كفاية استخراج المعلومات المكانية المحلية ودمج الميزات الصارم. تتضمن الخوارزمية آلية انتباه غير دلالي متفرق (NSSA) ضمن شبكة العمود الفقري الخاصة بها لاستخراج الميزات غير الدلالية ذات الصلة بالمهمة بكفاءة، مما يزيد من الحساسية تجاه الأجسام الصغيرة. بالإضافة إلى ذلك، تم تقديم شبكة هرمية متعددة الفروع المساعدة ثنائية الاتجاه (BIMA-FPN) لدمج المعلومات الدلالية عالية المستوى مع التفاصيل المكانية منخفضة المستوى، مما يوسع مجالات الاستقبال متعددة المقاييس ويحسن قدرات الاكتشاف. كما أن رأس دمج الميزات في فضاء القناة (CSFA-Head) يعمل على تحسين التعامل مع الميزات متعددة المقاييس، مما يعزز قوة النموذج في السيناريوهات المعقدة.

تشير النتائج التجريبية على مجموعة بيانات VisDrone2019 إلى أن SMA-YOLO تحقق دقة متوسطة (mAP) تبلغ 45.9% عند عتبة 0.5، مما يمثل تحسينًا بنسبة 13% مقارنة بنموذج YOLOv8n الأساسي، مع زيادة ملحوظة بنسبة 7.3% في دقة اكتشاف الأجسام الصغيرة. يبرز الدراسة مزايا الخوارزمية في الحفاظ على عدد منخفض من المعلمات مع تحسين أداء الاكتشاف. ستركز الأبحاث المستقبلية على تعزيز خفة النموذج، والدقة، وسرعة المعالجة، مع معالجة التحديات مثل التكيف مع المجال وتغير المقاييس من خلال التعلم الانتقالي ودمج الميزات متعددة المقاييس المحسنة. كما يهدف المؤلفون إلى استكشاف دمج نماذج جديدة من عائلة YOLO لتعزيز قدرات الاكتشاف والتطبيقات العملية.

طرق

يتكون هيكل YOLOv8 من ثلاثة مكونات رئيسية: العمود الفقري، والعنق، ورأس الاكتشاف، مع خمسة متغيرات (YOLOv8n، YOLOv8s، YOLOv8m، YOLOv8l، و YOLOv8x) تختلف في عرض القناة، والعمق، والسعة القصوى للقناة. يستخدم العمود الفقري هيكل شبكة Darknet جزئي عبر مراحل (CSPdarknet)، ويشمل وحدة C2f لتعزيز تدفق التدرجات، ويستخدم وحدة تجميع هرمية مكانية سريعة (SPPF) لتوحيد أبعاد خريطة الميزات عبر أحجام الإخراج المتنوعة. يتم الاكتشاف على ثلاثة مقاييس (80 × 80، 40 × 40، و 20 × 20) من خلال عمليات تلافيف متعددة.

تم تصميم عنق الشبكة بهيكل PAN-FPN، مما يسهل دمج المعلومات متعددة المقاييس بشكل فعال عبر مسارات من أعلى إلى أسفل ومن أسفل إلى أعلى. يتميز رأس الاكتشاف بهيكل مفصول، مع فروع منفصلة لتصنيف الهدف وانحدار صندوق الحدود، باستخدام خسارة الانتروبيا المتقاطعة الثنائية (BCE) للتصنيف وخسارة التركيز التوزيعي (DFL Loss) جنبًا إلى جنب مع خسارة IOU الكاملة (CIOU Loss) لمهام الانحدار. لتعزيز اكتشاف الأجسام الصغيرة ومتعددة المقاييس في صور الاستشعار عن بعد، يقدم البحث نموذجًا محسنًا، SMA-YOLO، قائمًا على YOLOv8n، مصممًا خصيصًا لهذا التطبيق.

نقاش

في قسم النقاش، يبرز البحث التقدم في آليات الانتباه واستراتيجيات دمج الميزات التي تعزز أداء اكتشاف الأجسام، خاصةً للأجسام الصغيرة في البيئات المعقدة. تعمل آليات الانتباه، مثل وحدات الضغط والتحفيز (SE) والانتباه الإحداثي (CA)، على تحسين تمثيل الميزات من خلال تعديل خصائص القناة ديناميكيًا ودمج المعلومات المكانية. لقد أظهر دمج هذه الوحدات في نماذج مثل CSPNet تحسينات كبيرة في دقة الاكتشاف مع الحفاظ على الكفاءة الحسابية. بالإضافة إلى ذلك، يناقش البحث أهمية استخراج الميزات متعددة المقاييس من خلال الشبكات مثل شبكات الهرم الميزاتي (FPN) ومتغيراتها، والتي تجمع بشكل فعال المعلومات الدلالية عالية المستوى مع التفاصيل المكانية منخفضة المستوى لتعزيز التعرف على الأجسام.

يسمح إدخال آلية الانتباه غير الدلالي المتفرق (NSSA) باستخراج أكثر تركيزًا للميزات غير الدلالية، مما يحسن حساسية النموذج تجاه الأجسام الصغيرة. كما أن شبكة الهرم المساعدة متعددة الفروع ثنائية الاتجاه (BIMA-FPN) المقترحة تعمل على تحسين دمج الميزات، مع معالجة القضايا المتعلقة بفقدان المعلومات أثناء عمليات الزيادة والتقليل. كما يتناول البحث تطوير رأس دمج الميزات في فضاء القناة (CSFA-Head)، الذي يتعلم ديناميكيًا أوزان الدمج للميزات متعددة المقاييس، مما يعزز عدم اعتماد النموذج على المقاييس وقدرته العامة على الاكتشاف. تظهر النتائج التجريبية أن هذه التحسينات تؤدي إلى تحسينات كبيرة في الدقة المتوسطة (mAP) ودقة الاكتشاف للأجسام الصغيرة، مما يثبت فعالية الطرق المقترحة في مهام اكتشاف الأجسام بالطائرات بدون طيار.

Journal: Scientific Reports, Volume: 15, Issue: 1
DOI: https://doi.org/10.1038/s41598-025-92344-7
PMID: https://pubmed.ncbi.nlm.nih.gov/40102487
Publication Date: 2025-03-18
Author(s): Shilong Zhou et al.
Primary Topic: Advanced Neural Network Applications

Overview

The research presents the SMA-YOLO algorithm, designed to enhance small object detection in complex remote sensing environments, addressing challenges such as inadequate local spatial information extraction and rigid feature fusion. The algorithm incorporates a Non-Semantic Sparse Attention (NSSA) mechanism within its backbone network to efficiently extract task-relevant non-semantic features, thereby increasing sensitivity to small objects. Additionally, a Bidirectional Multi-Branch Auxiliary Feature Pyramid Network (BIMA-FPN) is introduced to integrate high-level semantic information with low-level spatial details, expanding multiscale receptive fields and improving detection capabilities. The Channel-Space Feature Fusion Adaptive Head (CSFA-Head) further optimizes the handling of multi-scale features, enhancing model robustness in complex scenarios.

Experimental results on the VisDrone2019 dataset indicate that SMA-YOLO achieves a mean Average Precision (mAP) of 45.9% at a threshold of 0.5, representing a 13% improvement over the baseline YOLOv8n model, with a notable 7.3% increase in small object detection accuracy. The study highlights the algorithm’s advantages in maintaining low parameter counts while improving detection performance. Future research will focus on enhancing model lightweight, accuracy, and processing speed, addressing challenges such as domain adaptation and scale variation through transfer learning and optimized multi-scale feature fusion. The authors also aim to explore the integration of new models from the YOLO family to further advance detection capabilities and practical applications.

Methods

The YOLOv8 architecture comprises three primary components: the backbone, the neck, and the detection head, with five variants (YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x) differing in channel width, depth, and maximum channel capacity. The backbone utilizes a Cross Stage Partial Darknet (CSPdarknet) structure, incorporating the C2f module to enhance gradient flow, and employs a Spatial Pyramid Pooling Fast (SPPF) module to standardize feature map dimensions across varying output sizes. Detection is performed at three scales (80 × 80, 40 × 40, and 20 × 20) through multiple convolutional operations.

The neck of the network is designed with a PAN-FPN structure, facilitating effective multi-scale information fusion via both top-down and bottom-up pathways. The detection head features a decoupled architecture, with separate branches for target classification and bounding box regression, utilizing Binary Cross Entropy (BCE) loss for classification and Distribution Focal Loss (DFL Loss) along with Complete IOU Loss (CIOU Loss) for regression tasks. To enhance the detection of small and multi-scale objects in remote sensing imagery, the study introduces an optimized model, SMA-YOLO, based on YOLOv8n, specifically tailored for this application.

Discussion

In the discussion section, the paper highlights advancements in attention mechanisms and feature fusion strategies that enhance object detection performance, particularly for small objects in complex environments. Attention mechanisms, such as the Squeeze and Excitation (SE) and Coordinate Attention (CA) modules, improve feature representation by dynamically adjusting channel characteristics and incorporating spatial information. The integration of these modules into models like CSPNet has shown significant improvements in detection accuracy while maintaining computational efficiency. Additionally, the paper discusses the importance of multi-scale feature extraction through networks like Feature Pyramid Networks (FPN) and their variants, which effectively combine high-level semantic information with low-level spatial details to enhance object recognition.

The introduction of the Non-Semantic Sparse Attention (NSSA) mechanism allows for a more focused extraction of non-semantic features, improving the model’s sensitivity to small objects. The proposed Bidirectional Multi-Branch Auxiliary Feature Pyramid Network (BIMA-FPN) further optimizes feature fusion, addressing issues related to information loss during upsampling and downsampling processes. The paper also details the development of the Channel-Space Feature Fusion Adaptive Head (CSFA-Head), which dynamically learns fusion weights for multi-scale features, enhancing the model’s scale invariance and overall detection capability. Experimental results demonstrate that these enhancements lead to substantial improvements in mean Average Precision (mAP) and detection accuracy for small objects, validating the effectiveness of the proposed methods in UAV object detection tasks.