إطار هجين مصنف فعال للكشف عن البرمجيات الضارة في إنترنت الأشياء على نظام أندرويد Efficient feature ranked hybrid framework for android Iot malware detection

المجلة: Scientific Reports، المجلد: 16، العدد: 1
DOI: https://doi.org/10.1038/s41598-026-35238-6
PMID: https://pubmed.ncbi.nlm.nih.gov/41593148
تاريخ النشر: 2026-01-27
المؤلف: Nahla Hafez Saeed وآخرون
الموضوع الرئيسي: تقنيات الكشف المتقدمة عن البرمجيات الخبيثة

نظرة عامة

تقدم البحث إطار عمل هجين لاكتشاف البرمجيات الضارة مصمم للأجهزة الذكية التي تعمل بنظام أندرويد، يجمع بين منهجيات التحليل الثابت والديناميكي. يستخدم هذا الإطار مصنف الغابة العشوائية، المحسن من خلال ضبط المعلمات بشكل منهجي، ويستخدم آلية تصنيف مزدوجة تعتمد على مكسب المعلومات ومؤشر جيني لاختيار الميزات الأكثر صلة. تم تقييم الإطار عبر أربعة مجموعات بيانات مرجعية—Drebin و CCCS-CIC-AndMal-2020 و TUANDROMD و CIMD-2024—حقق الإطار معدلات دقة مثيرة للإعجاب تتراوح من 99.03% إلى 100% ودرجات F1 بين 0.98 و 1.00. ومع ذلك، فإن الطبيعة غير المتوازنة للغاية لمجموعة بيانات CIMD-2024 استدعت تنفيذ استراتيجيات التعامل مع عدم التوازن لتحسين فعالية الاكتشاف لكل من الفئات الغالبة والضعيفة.

تؤكد النتائج على قوة ومرونة الإطار، خاصة في اكتشاف البرمجيات الضارة المتخفية وذات اليوم الصفري، مع الحفاظ على عبء حسابي منخفض. على الرغم من نقاط قوته، يحدد البحث قيودًا في التعامل مع عدم التوازن الشديد في الفئات، خاصة ضمن مجموعة بيانات CIMD-2024، حيث واجه النموذج صعوبة في اكتشاف الفئات الضعيفة. ستركز الأعمال المستقبلية على تعزيز مرونة الإطار ضد البرمجيات الضارة المتقدمة والمتخفية، ودمج تحليل سلوك الشبكة في الوقت الحقيقي، ومعالجة قضايا استهلاك الطاقة للأجهزة الذكية ذات الطاقة المنخفضة. بالإضافة إلى ذلك، يبرز البحث أهمية تحديث النموذج بشكل مستمر مع عينات جديدة من البرمجيات الضارة لمكافحة التهديدات المتطورة بفعالية، مع الدعوة لاستخدام مقاييس تقييم متوازنة لضمان تقييمات أداء ذات مغزى.

الطرق

في هذا القسم، يصف المؤلفون التقييم التجريبي الذي تم إجراؤه للتحقق من صحة إطار الكشف الهجين الخاص بهم للبرمجيات الضارة على أندرويد. استخدموا أربع مجموعات بيانات بارزة: Drebin و TUANDROMD و CCCS-CIC-AndMal-2020 و CIMD-2024، كل منها يمثل أنواع ميزات مختلفة—ثابتة وديناميكية. تركز مجموعتا بيانات Drebin و TUANDROMD على الميزات الثابتة فقط، بينما تتضمن مجموعة بيانات CCCS-CIC-AndMal-2020 ميزات ثابتة وديناميكية، وتعتمد CIMD-2024 على الميزات الديناميكية فقط.

تسمح اختيار هذه المجموعات بتقييم شامل لقوة الإطار عبر أنماط تحليل متميزة. يتم تسليط الضوء على التحليل الثابت لكفاءته الحسابية، حيث يقيم كود التطبيق دون تنفيذ، بينما يوفر التحليل الديناميكي رؤى حول سلوك التطبيقات أثناء التشغيل. يهدف هذا النهج المزدوج إلى تعزيز دقة وفعالية اكتشاف البرمجيات الضارة في أنظمة أندرويد.

النتائج

تم تقييم نتائج إطار الكشف عن البرمجيات الضارة المقترح عبر مجموعات بيانات متعددة، مما يدل على دقة عالية واستقرار. على مجموعة بيانات Drebin، حقق النموذج دقة متوسطة قدرها 98.78% مع انحراف معياري قدره ± 0.27%، إلى جانب دقة ماكرو، واسترجاع ماكرو، ودرجات F1 ماكرو قدرها 0.9888 و 0.9849 و 0.9868، على التوالي. تشير هذه المقاييس إلى أداء قوي في التمييز بين التطبيقات الحميدة والضارة. وبالمثل، على مجموعة بيانات CCCS-CIC-AndMal-2020، حقق الإطار دقة متوسطة قدرها 93.03%، مع دقة ماكرو عند 93.82% واسترجاع ماكرو عند 85.72%، مما يعكس فعالية الاكتشاف عبر فئات البرمجيات الضارة المختلفة على الرغم من عدم التوازن في الفئات.

كما قدم الإطار أداءً جيدًا على مجموعة بيانات TUANDROMD، محققًا دقة متوسطة قدرها 98.66% مع تباين ضئيل، مما يدل على موثوقية عبر تقسيمات البيانات المختلفة. ومع ذلك، على مجموعة بيانات CIMD-2024 غير المتوازنة للغاية، انخفضت دقة النموذج إلى 69.82%، مما يبرز التحديات في اكتشاف فئات البرمجيات الضارة الضعيفة بسبب هيمنة حركة المرور الحميدة. تشير هذه النتيجة إلى الحاجة لاستراتيجيات التعامل مع عدم التوازن لتعزيز قدرات الاكتشاف. بشكل عام، يضع الجمع بين تقنيات التحليل الثابت والديناميكي، جنبًا إلى جنب مع أساليب اختيار الميزات الفعالة، الإطار كخيار متفوق مقارنةً بالأطر الحالية للكشف الهجين، محققًا دقة عالية مع استخدام فعال لموارد الحوسبة. أكدت الاختبارات الإحصائية على أهمية التحسينات التي حققها النموذج الهجين عبر جميع مجموعات البيانات، مما يعزز قوته وموثوقيته.

المناقشة

في قسم المناقشة من ورقة البحث، يقدم المؤلفون نظرة شاملة على المنهجيات الحالية لاكتشاف البرمجيات الضارة على أندرويد، مع التركيز على تقنيات التحليل الثابت والديناميكي والهجين. يعاني التحليل الثابت، على الرغم من كونه خفيف الوزن ومناسبًا للبيئات ذات الموارد المحدودة مثل إنترنت الأشياء، من قيود مثل الضعف أمام التعتيم والتحديات في اكتشاف التهديدات ذات اليوم الصفري والمتعددة الأشكال. حققت أطر بارزة مثل Drebin و MLDroid تقدمًا في تحسين الكشف الثابت، لكن المؤلفين يبرزون الحاجة إلى تعزيز القوة من خلال الأساليب الديناميكية أو الهجينة. من ناحية أخرى، يلتقط التحليل الديناميكي سلوكيات التطبيقات في الوقت الحقيقي وهو أكثر مرونة تجاه تكتيكات التهرب، ومع ذلك يتطلب موارد حسابية كبيرة، مما قد يكون مشكلة في نشرات إنترنت الأشياء على نطاق واسع.

تهدف تقنيات التحليل الهجين إلى دمج نقاط القوة في كل من الأساليب الثابتة والديناميكية، مع معالجة نقاط الضعف الموجودة في كل منهما. من خلال دمج الميزات من كلا النموذجين، تعزز النماذج الهجينة دقة الاكتشاف وقوتها، على الرغم من أنها غالبًا ما تواجه تحديات تتعلق بالتعقيد الحسابي وقابلية التطبيق في الوقت الحقيقي في سياقات إنترنت الأشياء. يقترح المؤلفون إطار عمل هجين جديد لاكتشاف البرمجيات الضارة مصمم لبيئات إنترنت الأشياء، باستخدام مصنف الغابة العشوائية المحسن مع استراتيجية تصنيف مزدوجة تعتمد على مكسب المعلومات ومؤشر جيني. يهدف هذا النهج إلى تحقيق دقة عالية، وقابلية للتفسير، ونشر منخفض الطاقة، مما يعالج في النهاية التحديات الرئيسية في اكتشاف البرمجيات الضارة على أندرويد مع ضمان عمليات تصنيف فعالة وشفافة.

Journal: Scientific Reports, Volume: 16, Issue: 1
DOI: https://doi.org/10.1038/s41598-026-35238-6
PMID: https://pubmed.ncbi.nlm.nih.gov/41593148
Publication Date: 2026-01-27
Author(s): Nahla Hafez Saeed et al.
Primary Topic: Advanced Malware Detection Techniques

Overview

The research presents a hybrid malware detection framework tailored for Android-based IoT devices, integrating both static and dynamic analysis methodologies. This framework employs a Random Forest classifier, optimized through systematic hyperparameter tuning, and utilizes a dual feature-ranking mechanism based on Information Gain and Gini Index to select the most relevant features. Evaluated across four benchmark datasets—Drebin, CCCS-CIC-AndMal-2020, TUANDROMD, and CIMD-2024—the framework achieved impressive accuracy rates ranging from 99.03% to 100% and F1-scores between 0.98 and 1.00. However, the highly imbalanced nature of the CIMD-2024 dataset necessitated the implementation of imbalance-handling strategies to improve detection efficacy for both majority and minority classes.

The findings underscore the framework’s robustness and adaptability, particularly in detecting evasive and zero-day malware, while maintaining low computational overhead. Despite its strengths, the study identifies limitations in handling extreme class imbalances, particularly within the CIMD-2024 dataset, where the model struggled to detect minority classes. Future work will focus on enhancing the framework’s resilience against advanced evasive malware, incorporating real-time network behavior analysis, and addressing power consumption issues for low-powered IoT devices. Additionally, the research emphasizes the importance of continuous model updates with new malware samples to combat evolving threats effectively, while advocating for the use of balanced evaluation metrics to ensure meaningful performance assessments.

Methods

In this section, the authors describe the experimental evaluation conducted to validate their hybrid detection framework for Android malware. They utilized four prominent datasets: Drebin, TUANDROMD, CCCS-CIC-AndMal-2020, and CIMD-2024, each representing different feature types—static and dynamic. The Drebin and TUANDROMD datasets focus solely on static features, while the CCCS-CIC-AndMal-2020 dataset incorporates both static and dynamic features, and CIMD-2024 is based on dynamic features alone.

The choice of these datasets allows for a comprehensive assessment of the framework’s robustness across distinct analysis paradigms. Static analysis is highlighted for its computational efficiency, as it evaluates application code without execution, whereas dynamic analysis provides insights into the runtime behavior of applications. This dual approach aims to enhance the accuracy and effectiveness of malware detection in Android systems.

Results

The results of the proposed malware detection framework were evaluated across multiple datasets, demonstrating high accuracy and stability. On the Drebin dataset, the model achieved an average accuracy of 98.78% with a standard deviation of ± 0.27%, alongside macro-precision, macro-recall, and macro-F1 scores of 0.9888, 0.9849, and 0.9868, respectively. These metrics indicate robust performance in distinguishing between benign and malicious applications. Similarly, on the CCCS-CIC-AndMal-2020 dataset, the framework attained a mean accuracy of 93.03%, with macro-precision at 93.82% and macro-recall at 85.72%, reflecting effective detection across various malware categories despite class imbalances.

The framework also performed well on the TUANDROMD dataset, achieving an average accuracy of 98.66% with minimal variance, indicating reliability across different data partitions. However, on the highly imbalanced CIMD-2024 dataset, the model’s accuracy dropped to 69.82%, highlighting challenges in detecting minority malware classes due to the predominance of benign traffic. This finding suggests the need for imbalance-handling strategies to enhance detection capabilities. Overall, the framework’s combination of static and dynamic analysis techniques, alongside effective feature selection methods, positions it as a superior option compared to existing hybrid detection frameworks, achieving high accuracy with efficient computational resource usage. Statistical tests confirmed the significance of the improvements made by the hybrid model across all datasets, reinforcing its robustness and reliability.

Discussion

In the discussion section of the research paper, the authors provide a comprehensive overview of existing methodologies for Android malware detection, focusing on static, dynamic, and hybrid analysis techniques. Static analysis, while lightweight and suitable for resource-constrained environments like IoT, suffers from limitations such as vulnerability to obfuscation and challenges in detecting zero-day and polymorphic threats. Notable frameworks like Drebin and MLDroid have made strides in improving static detection, but the authors highlight the need for enhanced robustness through dynamic or hybrid approaches. Dynamic analysis, on the other hand, captures real-time application behaviors and is more resilient to evasion tactics, yet it demands significant computational resources, which can be problematic in large-scale IoT deployments.

Hybrid analysis techniques aim to combine the strengths of both static and dynamic methods, addressing the weaknesses inherent in each. By integrating features from both paradigms, hybrid models enhance detection accuracy and robustness, although they often face challenges related to computational complexity and real-time applicability in IoT contexts. The authors propose a novel hybrid malware detection framework tailored for IoT environments, utilizing an optimized Random Forest classifier with a dual feature ranking strategy based on Information Gain and Gini Index. This approach aims to achieve high accuracy, interpretability, and low-power deployment, ultimately addressing key challenges in Android malware detection while ensuring efficient and transparent classification processes.