تعزيز اكتشاف البرمجيات الضارة باستخدام تقنيات اختيار الميزات والتقنيات القياسية مع نماذج التعلم الآلي Enhancing malware detection with feature selection and scaling techniques using machine learning models

المجلة: Scientific Reports، المجلد: 15، العدد: 1
DOI: https://doi.org/10.1038/s41598-025-93447-x
PMID: https://pubmed.ncbi.nlm.nih.gov/40097688
تاريخ النشر: 2025-03-17
المؤلف: Rakibul Hasan وآخرون
الموضوع الرئيسي: تقنيات الكشف المتقدمة عن البرمجيات الخبيثة

نظرة عامة

تتناول الدراسة التحدي المتزايد لاكتشاف البرمجيات الضارة في الأمن السيبراني من خلال تقييم آثار اختيار الميزات، وتوحيد الميزات، ونماذج التعلم الآلي (ML) المختلفة على أداء الاكتشاف. باستخدام مجموعة بيانات تصنيف ثنائية تتكون من 11,598 عينة و139 ميزة، تجري الأبحاث تجارب مع ثلاث تقنيات لتوحيد الميزات (بدون توحيد، والتطبيع، وتوحيد الحد الأدنى والحد الأقصى) وثلاث طرق لاختيار الميزات (بدون اختيار، وتحليل التمييز الخطي (LDA)، وتحليل المكونات الرئيسية (PCA))، إلى جانب اثني عشر نموذجًا من نماذج التعلم الآلي، بما في ذلك الخوارزميات التقليدية وطرق التجميع. تشير النتائج إلى أن آلة تعزيز التدرج الخفيف (LGBM) تحقق أعلى دقة بنسبة 97.16% عند استخدام PCA مع إما توحيد الحد الأدنى والحد الأقصى أو التطبيع، مما يبرز الأداء المتفوق لنماذج التجميع مقارنة بالتقليدية.

تؤكد الخاتمة على أهمية المعالجة المسبقة في تعزيز فعالية النموذج، حيث تم تحديد LDA وPCA كطرق فعالة لاختيار الميزات، وخاصة LDA للنماذج الخطية. وقد حسّن توحيد الحد الأدنى والحد الأقصى الأداء بشكل ملحوظ لنماذج التجميع مثل LGBM وغابة عشوائية (RF). ومع ذلك، تعترف الدراسة بالقيود، بما في ذلك الاعتماد على مجموعة بيانات واحدة قد لا تمثل بشكل كامل مشهد البرمجيات الضارة، واستبعاد طرق اختيار الميزات البديلة، واستكشاف محدود لتقنيات التعلم العميق. تشمل اتجاهات البحث المستقبلية الاختبار على مجموعات بيانات متنوعة، والتحقيق في تقنيات اختيار الميزات المتقدمة ونماذج التعلم العميق، ومعالجة عدم توازن الفئات بشكل أكثر شمولاً، وتحسين النماذج للتطبيقات في الوقت الحقيقي. تهدف هذه الجهود إلى تعزيز موثوقية وكفاءة أنظمة اكتشاف البرمجيات الضارة.

طرق

تؤكد المنهجية الموضحة في هذه الدراسة على نهج منظم لاكتشاف البرمجيات الضارة، كما هو موضح في الشكل 1. تم إجراء اختيار الميزات باستخدام تحليل المكونات الرئيسية (PCA) وتحليل التمييز الخطي (LDA)، مستفيدًا من قدرة PCA على زيادة التباين لتقليل الأبعاد وتركيز LDA على تعزيز قابلية الفصل بين الفئات، وهو أمر حاسم للتصنيف الفعال. لاستيعاب الطبيعة المتنوعة لبيانات البرمجيات الضارة، تم استخدام تقنيات توحيد الميزات مثل التطبيع وتوحيد الحد الأدنى والحد الأقصى، مما يضمن عدم تأثير ميزة واحدة بشكل غير متناسب على النموذج ويسهل التحسين.

كان اختيار نماذج التعلم الآلي (ML) يهدف إلى تقديم تمثيل متوازن للنهج التقليدية، وطرق التجميع، والشبكات العصبية، مما يتيح تقييمًا شاملاً للأداء. تم اختيار طرق التجميع، بما في ذلك LightGBM (LGBM) وغابة عشوائية (RF)، لصلابتها في التعامل مع البيانات عالية الأبعاد، بينما تم تضمين الانحدار اللوجستي (LR) وآلات الدعم الناقل (SVM) لسهولة تفسيرها وحساسيتها للمعالجة المسبقة. تم إجراء تحسين المعلمات الفائقة باستخدام البحث الشبكي لتعظيم أداء النموذج عبر تكوينات مختلفة. تعالج هذه المنهجية الشاملة الفجوات الموجودة في الأدبيات وتتوافق مع هدف الدراسة المتمثل في تطوير إطار عمل قوي وقابل للتوسع وقابل للتفسير لاكتشاف البرمجيات الضارة.

نقاش

في مناقشة ورقة البحث، يتم مراجعة منهجيات التعلم الآلي (ML) المختلفة لاكتشاف البرمجيات الضارة، مع تسليط الضوء على التقدمات الكبيرة والتحديات المستمرة في هذا المجال. تشمل المساهمات البارزة تقديم الشبكات العصبية التلافيفية المتوسعة (CNNs) من قبل ميزينا وبورغيت، التي حققت دقة اكتشاف تبلغ 0.99، والطريقة الهجينة للتصنيف المقترحة من قبل روي وآخرين، التي تحدد بفعالية البرمجيات الضارة المموهة باستخدام نهج التعلم التجميعي المتراص. بالإضافة إلى ذلك، يتناول دمج حسين وإسلام لاختيار الميزات الهجينة مع خوارزميات التعلم الآلي اكتشاف الشبكات الروبوتية، بينما يظهر رفيق وآخرون دقة عالية (تصل إلى 98.2%) في تحديد البرمجيات الضارة المعاد استخدامها من خلال تحليل تشابه أسماء الحزم.

على الرغم من هذه التقدمات، تحدد المناقشة الفجوات الحرجة في المنهجيات الحالية، مثل الاعتماد على تقنيات اختيار الميزات التقليدية التي تواجه صعوبة مع مجموعات البيانات عالية الأبعاد وعدم الانتباه الكافي لتوحيد الميزات. علاوة على ذلك، تركز العديد من الدراسات بشكل ضيق على إما تقنيات التعلم الآلي التقليدية أو تقنيات التعلم العميق، متجاهلة إمكانيات النهج الهجينة. تؤكد الورقة على الحاجة إلى تحسين القابلية للتفسير والفهم في نماذج اكتشاف البرمجيات الضارة لتسهيل النشر العملي في الأمن السيبراني. تحفز هذه الرؤى المنهجية المقترحة، التي تهدف إلى تعزيز دقة الاكتشاف والصلابة من خلال دمج تقنيات اختيار الميزات والتوحيد الفعالة مع نماذج التعلم الآلي المتقدمة.

Journal: Scientific Reports, Volume: 15, Issue: 1
DOI: https://doi.org/10.1038/s41598-025-93447-x
PMID: https://pubmed.ncbi.nlm.nih.gov/40097688
Publication Date: 2025-03-17
Author(s): Rakibul Hasan et al.
Primary Topic: Advanced Malware Detection Techniques

Overview

The study addresses the growing challenge of malware detection in cybersecurity by evaluating the effects of feature selection, feature scaling, and various machine learning (ML) models on detection performance. Utilizing a binary tabular classification dataset comprising 11,598 samples and 139 features, the research experiments with three feature scaling techniques (no scaling, normalization, and min-max scaling) and three feature selection methods (no selection, Linear Discriminant Analysis (LDA), and Principal Component Analysis (PCA)), alongside twelve ML models, including both traditional algorithms and ensemble methods. The results indicate that the Light Gradient Boosting Machine (LGBM) achieves the highest accuracy of 97.16% when PCA is employed with either min-max scaling or normalization, highlighting the superior performance of ensemble models over traditional ones.

The conclusion emphasizes the significance of preprocessing in enhancing model efficacy, with LDA and PCA identified as effective feature selection techniques, particularly LDA for linear models. Min-max scaling notably improved performance for ensemble models like LGBM and Random Forest (RF). However, the study acknowledges limitations, including reliance on a single dataset that may not fully represent the malware landscape, the exclusion of alternative feature selection methods, and a limited exploration of deep learning techniques. Future research directions include testing on diverse datasets, investigating advanced feature selection and deep learning models, addressing class imbalance more comprehensively, and optimizing models for real-time applications. These efforts aim to enhance the reliability and efficiency of malware detection systems.

Methods

The methodology outlined in this study emphasizes a structured approach to malware detection, as illustrated in Figure 1. Feature selection was conducted using Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA), leveraging PCA’s ability to maximize variance for dimensionality reduction and LDA’s focus on enhancing class separability, which is crucial for effective classification. To accommodate the diverse nature of malware data, feature scaling techniques such as normalization and min-max scaling were employed, ensuring that no single feature disproportionately influences the model and facilitating improved convergence.

The selection of machine learning (ML) models aimed to provide a balanced representation of traditional, ensemble, and neural network approaches, thereby enabling a thorough evaluation of performance. Ensemble methods, including LightGBM (LGBM) and Random Forest (RF), were chosen for their robustness in handling high-dimensional data, while Logistic Regression (LR) and Support Vector Machines (SVM) were included for their interpretability and sensitivity to preprocessing. Hyperparameter optimization was performed using grid search to maximize model performance across various configurations. This comprehensive methodology addresses existing gaps in the literature and aligns with the study’s goal of developing a robust, scalable, and interpretable malware detection framework.

Discussion

In the discussion of the research paper, various machine learning (ML) methodologies for malware detection are reviewed, highlighting significant advancements and persistent challenges in the field. Notable contributions include the introduction of dilated convolutional neural networks (CNNs) by Mezina and Burget, achieving a detection accuracy of 0.99, and the hybrid classification method proposed by Roy et al., which effectively identifies obfuscated malware using a stacked ensemble learning approach. Additionally, Hossain and Islam’s integration of hybrid feature selection with ML algorithms addresses botnet detection, while Rafiq et al. demonstrate high accuracy (up to 98.2%) in identifying repurposed malware through package name similarity analysis.

Despite these advancements, the discussion identifies critical gaps in existing methodologies, such as reliance on traditional feature selection techniques that struggle with high-dimensional datasets and insufficient attention to feature scaling. Moreover, many studies focus narrowly on either traditional ML or deep learning techniques, neglecting the potential of hybrid approaches. The paper emphasizes the need for improved explainability and interpretability in malware detection models to facilitate practical deployment in cybersecurity. These insights motivate the proposed methodology, which aims to enhance detection accuracy and robustness by integrating effective feature selection and scaling techniques with advanced ML models.