عينات مولدة بواسطة LLM لاكتشاف البرمجيات الضارة على أندرويد LLM-Generated Samples for Android Malware Detection

المجلة: Digital، المجلد: 6، العدد: 1
DOI: https://doi.org/10.3390/digital6010005
تاريخ النشر: 2026-01-18
المؤلف: Nik Rollinson وآخرون
الموضوع الرئيسي: تقنيات الكشف المتقدمة عن البرمجيات الخبيثة

نظرة عامة

تبحث الدراسة في إمكانيات نماذج اللغة الكبيرة (LLMs)، وبشكل خاص نسخة معدلة من GPT-4.1-mini، لتوليد سجلات جدولية منظمة لاكتشاف البرمجيات الخبيثة على نظام أندرويد، مع التركيز على ثلاث عائلات من البرمجيات الخبيثة: BankBot وLocker/SLocker وAirpush/StopSMS. تتناول الدراسة التحديات التي تطرحها البرمجيات الخبيثة المتطورة على نظام أندرويد، والتي تعقد طرق الكشف التقليدية، وتستكشف استخدام البيانات الاصطناعية لتعزيز مجموعات البيانات المحدودة. تكشف النتائج أنه بينما يؤدي تدريب المصنفات فقط على البيانات الحقيقية إلى تحقيق معدلات كشف قريبة من الكمال، فإن دمج البيانات الاصطناعية يحافظ على أداء عالٍ مع تدهور طفيف. ومع ذلك، فإن التدريب حصريًا على البيانات الاصطناعية ينتج عنه نتائج متغيرة، مما يشير إلى أنه بينما يمكن للسجلات التي تم إنشاؤها بواسطة LLM تحسين ندرة مجموعة البيانات، إلا أنها ليست كافية بعد كمصدر تدريب مستقل.

تساهم الدراسة في هذا المجال من خلال تأطير التخليق القائم على LLM كمهام توليد سجلات منظمة وتقييم فعاليتها عبر سيناريوهات متعددة. تسلط الضوء على فائدة التعزيز الاصطناعي في الحفاظ على خصوصية الكشف بينما تكشف عن قيود التدريب القائم فقط على البيانات الاصطناعية، والتي تكون حساسة لخصائص عائلات البرمجيات الخبيثة المختلفة واستراتيجيات الضبط الدقيق. تؤكد البحث على أهمية الاعتبارات المنهجية، مثل تصميم المطالبات والمعالجة اللاحقة، عند تطبيق LLMs على بيانات الأمن السيبراني. ستركز الأعمال المستقبلية على تحسين معلمات الضبط الدقيق واستكشاف نماذج LLM الأكثر تقدمًا لتحسين دقة وعمومية البيانات الاصطناعية، بهدف تعزيز تطبيق السجلات الاصطناعية في تحليلات الأمان.

مقدمة

في مقدمة هذه الورقة البحثية، يسلط المؤلفون الضوء على التحديات التي تطرحها التطورات السريعة في البرمجيات الخبيثة على نظام أندرويد، والتي تتجاوز بشكل متزايد طرق الكشف التقليدية المعتمدة على التوقيع. يشيرون إلى أنه بينما حسنت تقنيات التعلم الآلي (ML) والتعلم العميق (DL) دقة الكشف، فإن فعاليتها غالبًا ما تتعرض للخطر بسبب نقص التسميات عالية الجودة المتاحة في الوقت المناسب ومجموعات البيانات المتوازنة، خاصة لعائلات البرمجيات الخبيثة المحددة. تهدف هذه الدراسة إلى سد فجوة حاسمة من خلال التحقيق في تأثير البرمجيات الخبيثة التي تم إنشاؤها بشكل اصطناعي، والتي تم إنتاجها باستخدام نموذج لغة كبير (LLM) معدّل، على أداء مصنفات ML.

يستخدم المؤلفون مجموعة بيانات KronoDroid لتوليد سجلات برمجيات خبيثة منظمة تحاكي خصائص ثلاث عائلات محددة من البرمجيات الخبيثة: BankBot وLocker/SLocker وAirpush/StopSMS. يصفون خط أنابيب ضبط دقيق شامل يتضمن تطهير المخطط وهندسة المطالبات لضمان جودة البيانات المولدة. تتضمن بروتوكولات التقييم اختبار ثلاثة سيناريوهات – التدريب على البيانات الحقيقية فقط، والتعزيز بالبيانات الاصطناعية، والتعميم على البيانات الاصطناعية فقط – مع استخدام فحوصات تسرب صارمة وطرق إحصائية لتقييم فائدة البيانات الاصطناعية. تشير النتائج الرئيسية إلى أنه بينما يحافظ التعزيز بالبيانات الاصطناعية على أداء الكشف قريبًا من مستويات الأساس، فإن التدريب فقط على البيانات الاصطناعية ينتج عنه نتائج غير موثوقة تختلف حسب عائلة البرمجيات الخبيثة. بالإضافة إلى ذلك، تكشف الدراسة أن الضبط الدقيق الأعمق مع مجموعات بيانات أغنى يمكن أن يقلل الفجوة في الأداء بين البيانات الاصطناعية والحقيقية.

الطرق

في هذا القسم، يحدد المنهج التجريبي إعداد التجربة وعملية التقييم لتقييم تأثير البيانات الاصطناعية على دقة اكتشاف البرمجيات الخبيثة. في البداية، تم تنقيح مجموعة البيانات من 484 ميزة من خلال استبعاد المعرفات والبيانات الوصفية التي قد تؤدي إلى الإفراط في التكيف، مما أدى إلى مجموعة ميزات مخفضة من 474 عمودًا عدديًا. تم معالجة العناصر غير العددية من خلال تعيين قيم ‘None’ إلى صفر، وتم تصفية الأعمدة التي تحتوي على أكثر من 70% من القيم الصفرية، مما أدى إلى مجموعة ميزات نهائية من 387 عمودًا. تم اختيار هذا الحد المحافظ للحفاظ على تغطية كافية عبر مجموعات استدعاء النظام والأذونات مع ضمان بقاء أداء المصنف غير متأثر بإزالة الميزات ذات الكثافة السكانية المنخفضة.

تم إجراء التجارب على معالج Apple M3 Pro مع 18 جيجابايت من ذاكرة الوصول العشوائي، باستخدام مكتبات Python مثل pandas وnumpy وsklearn، ومجموعة بيانات KronoDroid العامة. شمل التقييم اختبار ثلاثة سيناريوهات لكل من ثلاث عائلات من البرمجيات الخبيثة (BankBot وLocker/SLocker وAirpush/StopSMS) عبر خمسة مصنفات، ليصل المجموع إلى 45 مجموعة. ينتهي القسم بوعد بتقديم نتائج مفصلة ومناقشات حول نتائج هذه التقييمات في الأقسام التالية.

المناقشة

تسلط قسم المناقشة في الورقة البحثية الضوء على الدور المهم للذكاء الاصطناعي (AI) في تعزيز تدابير الأمن السيبراني، وخاصة من خلال تطبيق التعلم الآلي (ML) والتعلم العميق (DL) ونماذج اللغة الكبيرة (LLMs). يؤكد على عدم كفاية طرق التحليل اليدوي التقليدية في مواجهة التهديدات السيبرانية المتزايدة التعقيد، داعيًا إلى قدرة الذكاء الاصطناعي على تحسين الكشف عن التهديدات من خلال التعرف على الأنماط والشذوذ التي قد يتم تجاهلها بواسطة التقنيات التقليدية. تحدد المراجعة فجوة بحثية حاسمة تتعلق باستخدام البرمجيات الخبيثة الاصطناعية التي تم إنشاؤها بواسطة LLM لتدريب نماذج الكشف، مشيرة إلى أنه بينما أظهرت تقنيات الذكاء الاصطناعي وعدًا في اكتشاف البرمجيات الخبيثة والوقاية منها، هناك حاجة إلى تقييمات تجريبية للبيانات التي تم إنشاؤها بواسطة LLM في هذا السياق.

تستكشف القسم أيضًا الطبيعة المزدوجة لنماذج LLM في الأمن السيبراني، موضحة إمكانياتها لكل من توليد البرمجيات الخبيثة والكشف عنها. تكشف الدراسات المذكورة أنه بينما يمكن لنماذج LLM مثل GPT-3 من OpenAI توليد متغيرات من البرمجيات الخبيثة، فإنها تسهل أيضًا إنشاء مجموعات بيانات اصطناعية لتدريب نماذج الكشف. ومع ذلك، لا تزال التحديات قائمة، مثل الحاجة إلى مجموعات بيانات عالية الجودة ومتوازنة وقيود البيانات الاصطناعية في تكرار السيناريوهات الواقعية. يدعو المؤلفون إلى نهج هجين يجمع بين البيانات الحقيقية والاصطناعية لتعزيز قوة نماذج الكشف، مؤكدين على أهمية التحديثات والتقييمات المستمرة للتكيف مع التهديدات السيبرانية المتطورة. بشكل عام، تؤكد النتائج على ضرورة المزيد من البحث في التكامل الفعال للذكاء الاصطناعي ونماذج LLM في أطر الأمن السيبراني.

Journal: Digital, Volume: 6, Issue: 1
DOI: https://doi.org/10.3390/digital6010005
Publication Date: 2026-01-18
Author(s): Nik Rollinson et al.
Primary Topic: Advanced Malware Detection Techniques

Overview

The research investigates the potential of Large Language Models (LLMs), specifically a fine-tuned version of GPT-4.1-mini, to generate structured tabular records for Android malware detection, focusing on three malware families: BankBot, Locker/SLocker, and Airpush/StopSMS. The study addresses the challenges posed by evolving Android malware, which complicates traditional detection methods, and explores the use of synthetic data to augment limited datasets. The findings reveal that while training classifiers solely on real data achieves near-perfect detection rates, incorporating synthetic data maintains high performance with minimal degradation. However, training exclusively on synthetic data yields variable results, indicating that while LLM-generated records can enhance dataset scarcity, they are not yet sufficient as a standalone training source.

The study contributes to the field by framing LLM-based synthesis as a structured record generation task and evaluating its effectiveness across multiple scenarios. It highlights the utility of synthetic augmentation in preserving detection specificity while revealing the limitations of synthetic-only training, which is sensitive to the characteristics of different malware families and fine-tuning strategies. The research underscores the importance of methodological considerations, such as prompt design and post-processing, when applying LLMs to cybersecurity data. Future work will focus on optimizing fine-tuning parameters and exploring more advanced LLMs to improve the fidelity and generalization of synthetic data, ultimately aiming to enhance the application of synthetic records in security analytics.

Introduction

In the introduction of this research paper, the authors highlight the challenges posed by the rapid evolution of Android malware, which increasingly outstrips traditional signature-based detection methods. They note that while machine learning (ML) and deep learning (DL) approaches have improved detection accuracy, their effectiveness is often compromised by the lack of timely, high-quality labels and balanced datasets, particularly for specific malware families. This study aims to fill a critical gap by investigating the impact of synthetically generated Android malware, produced using a fine-tuned large language model (LLM), on the performance of ML classifiers.

The authors utilize the KronoDroid dataset to generate structured malware records that emulate the characteristics of three specific malware families: BankBot, Locker/SLocker, and Airpush/StopSMS. They describe a comprehensive fine-tuning pipeline that includes schema sanitization and prompt engineering to ensure the quality of generated data. The evaluation protocol involves testing under three scenarios—real-only training, real-plus-synthetic augmentation, and synthetic-only generalization—while employing rigorous leakage checks and statistical methods to assess the utility of synthetic data. Key findings indicate that while augmentation with synthetic data maintains detection performance close to baseline levels, training solely on synthetic data yields unreliable results that vary by malware family. Additionally, the study reveals that deeper fine-tuning with richer datasets can reduce the performance gap between synthetic and real data.

Methods

In this section, the methodology outlines the experimental setup and evaluation process for assessing the impact of synthetic data on malware detection accuracy. Initially, the dataset was refined from 484 features by excluding identifiers and metadata that could lead to overfitting, resulting in a reduced feature set of 474 numeric columns. Non-numeric artifacts were addressed by imputing ‘None’ values to zero, and columns with over 70% zero values were filtered out, leading to a final feature set of 387 columns. This conservative cutoff was chosen to maintain adequate coverage across syscall and permission groups while ensuring classifier performance remained unaffected by the removal of sparsely populated features.

The experiments were conducted on an Apple M3 Pro processor with 18 GB RAM, utilizing Python libraries such as pandas, numpy, and sklearn, and the public KronoDroid dataset. The evaluation involved testing three scenarios for each of three malware families (BankBot, Locker/SLocker, and Airpush/StopSMS) across five classifiers, totaling 45 combinations. The section concludes with a promise of detailed results and discussions on the outcomes of these evaluations in subsequent sections.

Discussion

The discussion section of the research paper highlights the significant role of Artificial Intelligence (AI) in enhancing cybersecurity measures, particularly through the application of Machine Learning (ML), Deep Learning (DL), and Large Language Models (LLMs). It emphasizes the inadequacy of traditional manual analysis methods in addressing increasingly sophisticated cyber threats, advocating for AI’s ability to improve threat detection by recognizing patterns and anomalies that may be overlooked by conventional techniques. The review identifies a critical research gap concerning the use of LLM-generated synthetic malware for training detection models, suggesting that while AI techniques have shown promise in malware detection and prevention, there is a need for empirical evaluations of LLM-generated data in this context.

The section further explores the dual nature of LLMs in cybersecurity, detailing their potential for both malware generation and detection. Studies cited reveal that while LLMs like OpenAI’s GPT-3 can generate malware variants, they also facilitate the creation of synthetic datasets for training detection models. However, challenges remain, such as the need for high-quality, balanced datasets and the limitations of synthetic data in replicating real-world scenarios. The authors advocate for a hybrid approach that combines both real and synthetic data to enhance the robustness of detection models, underscoring the importance of continuous updates and evaluations to adapt to evolving cyber threats. Overall, the findings underscore the necessity for further research into the effective integration of AI and LLMs in cybersecurity frameworks.