الدمج المتأخر عبر مجموعات البيانات لنماذج الكاميرا–LiDAR والرادار للكشف عن الأجسام Cross-dataset late fusion of Camera–LiDAR and radar models for object detection

المجلة: Scientific Reports، المجلد: 16، العدد: 1
DOI: https://doi.org/10.1038/s41598-025-32588-5
PMID: https://pubmed.ncbi.nlm.nih.gov/41501167
تاريخ النشر: 2026-01-07
المؤلف: Zhenyun Du وآخرون
الموضوع الرئيسي: تطبيقات الشبكات العصبية المتقدمة

نظرة عامة

تقدم هذه الورقة البحثية إطار عمل متأخر للتكامل المودولي يدمج بفعالية بين كاميرا، LiDAR، ورادار لتصنيف الأجسام في تطبيقات القيادة الذاتية. بدلاً من استخدام هياكل دمج معقدة من البداية إلى النهاية، يقوم المؤلفون بتدريب شبكتين عصبيتين خفيفتين بشكل مستقل: شبكة عصبية تلافيفية (CNN) لكاميرا + LiDAR باستخدام مجموعة بيانات KITTI ومصنف قائم على وحدة متكررة مغلقة (GRU) للرادار باستخدام RadarScenes. يتم إنشاء مساحة تسميات موحدة من 5 فئات لتنسيق مجموعات البيانات غير المتجانسة، تم التحقق منها من خلال تحليل توزيع الفئات. يتم تعريف آلية الدمج بواسطة قاعدة قرار مرجحة بالثقة، ويتم تقييم الأداء بدقة باستخدام التحقق المتقاطع ثلاثي الطيات، مما ينتج عنه متوسط دقة متوسط (mAP) يبلغ 95.34% لنموذج كاميرا + LiDAR و33.89% لنموذج الرادار. تعزز طريقة الدمج المتأخر الأداء إلى 94.97% mAP مقابل الحقيقة الأرضية لـ KITTI و33.74% مقابل RadarScenes، مما يوضح نقاط القوة التكميلية للأنماط.

في الختام، تسلط الدراسة الضوء على فعالية إطار العمل المتأخر في تحسين موثوقية اكتشاف الأجسام للقيادة الذاتية. يتفوق نموذج كاميرا + LiDAR في اكتشاف الأجسام ذات الميزات البصرية والهندسية المميزة، بينما يوفر نموذج الرادار الاستقرار في الظروف الصعبة، مما يؤكد دوره كحساس مساعد قوي. تتفوق آلية الدمج المتأخر، التي تتميز بنظام وزن احتمالي، باستمرار على الأنماط الفردية عبر فئات مختلفة، مع تباين منخفض وفترات ثقة ضيقة عبر عدة جولات تحقق. تؤكد النتائج على إمكانية استراتيجيات الدمج المتأخر الخفيفة لتعزيز موثوقية الاكتشاف، مما يمهد الطريق للتقدم المستقبلي في وزن الدمج التكيفي، ونمذجة التسلسل الزمني، والنشر في الوقت الحقيقي على الأنظمة المدمجة في السيارات.

مقدمة

تسلط المقدمة الضوء على الدور الحاسم للإدراك الدقيق في أنظمة القيادة الذاتية، وهو أمر أساسي للتنقل في البيئات المعقدة. تناقش نقاط القوة والقيود في تقنيات الاستشعار المختلفة: توفر الكاميرات بيانات دلالية عالية الدقة ولكنها تواجه صعوبة مع الإضاءة المتغيرة والطقس السيئ؛ يوفر LiDAR معلومات هيكلية ثلاثية الأبعاد دقيقة ولكنه قد يفشل على الأسطح العاكسة أو في المسافات الطويلة؛ والرادار، على الرغم من موثوقيته في ظروف الرؤية الضعيفة، يعاني من دقة مكانية أقل وزيادة في ضوضاء القياس. نظرًا لهذه النقاط القوية والضعيفة التكميلية، تؤكد الأبحاث الحديثة على ضرورة دمج المستشعرات المتعددة لتعزيز موثوقية الإدراك في سيناريوهات القيادة الذاتية في العالم الحقيقي.

الطرق

توضح قسم المنهجية إعداد التجارب لتدريب وتقييم ثلاثة نماذج: نموذج كاميرا + LiDAR، نموذج الرادار، وإطار عمل الدمج المتأخر المقترح. تم إجراء التجارب باستخدام Python وPyTorch على بيئة Google Cloud Platform المعززة بوحدة معالجة الرسوميات. لتعزيز موثوقية النتائج وتقليل التحيز في العينة، تم استخدام نهج التحقق المتقاطع ثلاثي الطيات الطبقي لمجموعات بيانات KITTI وRadarScenes، باستخدام ثلاثة بذور عشوائية مختلفة (42، 2024، 7) لكل مجموعة وطية.

يهدف هذا التصميم التجريبي الصارم إلى ضمان مقاييس تقييم قوية للنماذج، مما يسهل مقارنة شاملة لأدائها. يعد استخدام التحقق المتقاطع الطبقي ملحوظًا بشكل خاص لأنه يساعد في الحفاظ على توزيع الفئات عبر مجموعات التدريب والتحقق، مما يوفر تقييمًا أكثر دقة لقدرات النماذج.

النتائج

يقدم قسم النتائج تحليلًا مقارنًا لأداء ثلاثة نماذج: نموذج كاميرا + LiDAR المقترح، نموذج الرادار، ونموذج الدمج المتأخر. تم إجراء التقييم باستخدام التحقق المتقاطع ثلاثي الطيات الطبقي عبر ثلاثة بذور عشوائية (42، 2024، 7)، مما يضمن تقديرات أداء قوية إحصائيًا. تم تقييم نموذج الدمج المتأخر في سياقين: أولاً، من خلال مقارنة مخرجاته مقابل الحقيقة الأرضية لـ KITTI (المشار إليها بـ FUS-K)، وثانيًا، مقابل الحقيقة الأرضية لـ RadarScenes (المشار إليها بـ FUS-R)، باستخدام تسميات كاميرا + LiDAR والرادار كمرجع، على التوالي.

شملت مقاييس الأداء درجات دقة متوسطة لكل فئة (AP)، ومتوسط دقة متوسط (mAP)، والانحرافات المعيارية، وفترات الثقة (CI) بنسبة 95%، المحسوبة عبر جميع الطيات والبذور. توفر هذه المقاييس نظرة شاملة على فعالية وموثوقية النماذج في سيناريوهات مختلفة، مما يبرز نقاط القوة والضعف في كل نهج في سياق دمج المستشعرات.

المناقشة

تؤكد قسم المناقشة في الورقة على مزايا دمج المستشعرات في تعزيز أداء اكتشاف وتتبع الأجسام من خلال دمج التفاصيل الدلالية، والدقة الهندسية، والموثوقية البيئية. تصنف استراتيجيات دمج المستشعرات إلى دمج مبكر، ودمج متوسط، ودمج متأخر، مع تسليط الضوء على الفوائد والتحديات الفريدة لكل منها. الدمج المبكر، على الرغم من قدرته على الاستفادة من الارتباطات منخفضة المستوى، يتطلب طاقة حسابية كبيرة ويتطلب معايرة دقيقة. يتفوق الدمج المتوسط، كما يتضح في نماذج مثل TransFuser وUniFusion، في تعلم الانتباه عبر الأنماط ولكنه يتطلب هياكل مرتبطة بإحكام وبيانات متعددة الأنماط كاملة أثناء التدريب. في المقابل، تجمع طريقة الدمج المتأخر المعتمدة في هذه الدراسة بين التنبؤات من نماذج مدربة بشكل مستقل، مما يوفر مرونة وموثوقية، وهو مفيد بشكل خاص عندما تختلف مجموعات البيانات في أنماط المستشعرات، كما هو الحال مع KITTI (كاميرا + LiDAR) وRadarScenes (رادار).

يهدف إطار عمل الدمج المتأخر المقترح إلى تحسين اكتشاف الأجسام من خلال دمج بيانات كاميرا، LiDAR، ورادار. تستند هذه الطريقة إلى النتائج التي تشير إلى أن دمج الإشارات البصرية والهندسية مع ميزات الحركة والموثوقية للرادار يعزز الأداء بشكل كبير، خاصة في الظروف الصعبة. يستخدم الإطار قاعدة دمج مرجحة، تعطي تأثيرًا أكبر لنموذج كاميرا + LiDAR بسبب أدائه الأساسي المتفوق مع تضمين مساهمات الرادار. تقدم الدراسة نتائج تجريبية باستخدام مقياس متوسط دقة متوسط (mAP)، مما يوضح فعالية استراتيجية الدمج المتأخر في تحقيق نتائج اكتشاف موثوقة عبر مجموعات بيانات غير متجانسة. تختتم الورقة بخطة منظمة للأقسام التالية، تتناول الأعمال ذات الصلة، والمنهجية، وإعداد التجارب، والنتائج، والاتجاهات المستقبلية.

Journal: Scientific Reports, Volume: 16, Issue: 1
DOI: https://doi.org/10.1038/s41598-025-32588-5
PMID: https://pubmed.ncbi.nlm.nih.gov/41501167
Publication Date: 2026-01-07
Author(s): Zhenyun Du et al.
Primary Topic: Advanced Neural Network Applications

Overview

This research paper introduces a modular late-fusion framework that effectively integrates Camera, LiDAR, and Radar modalities for object classification in autonomous driving applications. Instead of employing complex end-to-end fusion architectures, the authors independently train two lightweight neural networks: a Convolutional Neural Network (CNN) for Camera+LiDAR using the KITTI dataset and a Gated Recurrent Unit (GRU)-based classifier for Radar using RadarScenes. A unified 5-class label space is established to harmonize the heterogeneous datasets, validated through class-distribution analysis. The fusion mechanism is defined by a confidence-weighted decision rule, and the performance is rigorously evaluated using 3-fold cross-validation, yielding a mean Average Precision (mAP) of 95.34% for the Camera+LiDAR model and 33.89% for the Radar model. The late-fusion approach enhances performance to 94.97% mAP against KITTI ground truth and 33.74% against RadarScenes, demonstrating the complementary strengths of the modalities.

In conclusion, the study highlights the effectiveness of the late-fusion framework in improving object detection reliability for autonomous driving. The Camera+LiDAR model excels in detecting objects with distinct visual and geometric features, while the Radar model provides stability in challenging conditions, confirming its role as a robust auxiliary sensor. The late-fusion mechanism, characterized by a probabilistic weighting scheme, consistently outperforms individual modalities across various classes, with low variance and tight confidence intervals across multiple validation runs. The findings underscore the potential of lightweight late-fusion strategies to enhance detection robustness, paving the way for future advancements in adaptive fusion weighting, temporal sequence modeling, and real-time deployment on embedded automotive systems.

Introduction

The introduction highlights the critical role of accurate perception in autonomous driving systems, which is essential for navigating complex environments. It discusses the strengths and limitations of various sensing technologies: cameras provide high-resolution semantic data but struggle with varying illumination and adverse weather; LiDAR delivers precise 3D structural information but can falter on reflective surfaces or at extended ranges; and radar, while reliable in poor visibility conditions, suffers from lower spatial resolution and increased measurement noise. Given these complementary strengths and weaknesses, recent research underscores the necessity of multisensor fusion to enhance perception reliability in real-world autonomous driving scenarios.

Methods

The methodology section details the experimental setup for training and evaluating three models: the Camera+LiDAR model, the Radar model, and a proposed late-fusion framework. The experiments were conducted using Python and PyTorch on a GPU-accelerated Google Cloud Platform environment. To enhance the reliability of the results and minimize sampling bias, a 3-fold stratified cross-validation approach was employed for both the KITTI and RadarScenes datasets, utilizing three different random seeds (42, 2024, 7) for each fold and seed combination.

This rigorous experimental design aims to ensure robust evaluation metrics for the models, facilitating a comprehensive comparison of their performance. The use of stratified cross-validation is particularly noteworthy as it helps maintain the distribution of classes across the training and validation sets, thereby providing a more accurate assessment of the models’ capabilities.

Results

The results section presents a comparative analysis of the performance of three models: the proposed Camera+LiDAR model, the Radar model, and a late-fusion model. The evaluation was conducted using 3-fold stratified cross-validation across three random seeds (42, 2024, 7), ensuring statistically robust performance estimates. The late-fusion model was assessed in two contexts: first, by comparing its outputs against the KITTI ground truth (denoted as FUS-K), and second, against the RadarScenes ground truth (denoted as FUS-R), utilizing camera+LiDAR and radar labels as references, respectively.

Performance metrics included per-class average precision (AP) scores, mean average precision (mAP), standard deviations, and 95% confidence intervals (CI), calculated across all folds and seeds. These metrics provide a comprehensive overview of the models’ effectiveness and reliability in various scenarios, highlighting the strengths and weaknesses of each approach in the context of sensor fusion.

Discussion

The discussion section of the paper emphasizes the advantages of sensor fusion in enhancing object detection and tracking performance by integrating semantic detail, geometric accuracy, and environmental robustness. It categorizes sensor fusion strategies into early, middle, and late fusion, highlighting the unique benefits and challenges of each. Early fusion, while capable of leveraging low-level correlations, is computationally intensive and requires precise calibration. Middle fusion, demonstrated in models like TransFuser and UniFusion, excels in learning cross-modal attention but necessitates tightly coupled architectures and complete multi-modal data during training. In contrast, the late fusion approach adopted in this study combines predictions from independently trained models, offering flexibility and robustness, particularly beneficial when datasets vary in sensor modalities, as seen with KITTI (Camera + LiDAR) and RadarScenes (Radar).

The proposed late fusion framework aims to improve object detection by integrating Camera, LiDAR, and Radar data. This approach is motivated by findings that combining visual and geometric cues with radar’s motion and robustness features significantly enhances performance, especially in challenging conditions. The framework employs a weighted fusion rule, assigning higher influence to the Camera + LiDAR model due to its superior baseline performance while still incorporating radar’s contributions. The study reports experimental results using the mean Average Precision (mAP) metric, demonstrating the effectiveness of the late fusion strategy in achieving reliable detection outcomes across heterogeneous datasets. The paper concludes with a structured outline of subsequent sections, detailing related work, methodology, experimental setup, results, and future directions.