نموذج كشف روابط التصيد الاحتيالي عبر الإنترنت باستخدام تقنيات تحسين التعلم العميق Web-based phishing URL detection model using deep learning optimization techniques

المجلة: International Journal of Data Science and Analytics، المجلد: 20، العدد: 5
DOI: https://doi.org/10.1007/s41060-025-00728-9
تاريخ النشر: 2025-02-08
المؤلف: Kousik Barik وآخرون
الموضوع الرئيسي: كشف البريد المزعج والاحتيال

نظرة عامة

تتناول ورقة البحث التهديد المتزايد لهجمات التصيد الاحتيالي، التي تخدع مستخدمي الإنترنت للكشف عن معلومات حساسة من خلال إعادة توجيههم إلى مواقع ويب احتيالية تشبه المواقع الشرعية بشكل وثيق. على الرغم من وجود طرق الكشف الحالية، فقد زادت شيوع هذه الهجمات، مما يستلزم تطوير تقنيات أكثر تعقيدًا. لمواجهة هذه المشكلة، يقترح المؤلفون نموذج تحسين شبكة الأعصاب التلافيفية المعزز (EGSO-CNN) المصمم لاكتشاف التصيد الاحتيالي على الويب. يستخدم هذا النموذج مجموعة بيانات تم إنشاؤها حديثًا لتعزيز توفر بيانات التصيد المحدثة، مستخدمًا StandardScaler و Variational Autoencoders (VAE) للمعالجة المسبقة واستخراج الميزات. يتم تحقيق تحسين أداء النموذج من خلال تقنية EGSO، مما يؤدي إلى مقاييس مثيرة للإعجاب: دقة 99.44%، واسترجاع 99.21%، ودرجة F1 تبلغ 99.32%، إلى جانب معدلات منخفضة من الإيجابيات الكاذبة والأخطاء.

في الختام، تؤكد الدراسة على الحاجة الملحة لتقنيات الكشف المتقدمة عن التصيد الاحتيالي بسبب الزيادة السريعة في مثل هذه الهجمات الإلكترونية، التي أدت إلى خسائر مالية كبيرة للمستخدمين. لا يظهر نموذج EGSO-CNN المقترح دقة عالية فحسب، بل يظهر أيضًا وعدًا للتطبيقات الواقعية في تعزيز استراتيجيات منع التصيد الاحتيالي. قد تتضمن الأعمال المستقبلية استكشاف منهجيات تعلم عميقة بديلة ودمج تقنيات التعلم الانتقالي لتحسين الأداء بشكل أكبر. في النهاية، يهدف النموذج إلى مساعدة المنظمات في تنفيذ استراتيجيات فعالة للكشف عن التصيد الاحتيالي، مما يعزز رضا العملاء ويعزز سمعتها في بيئة تنافسية.

مقدمة

تستعرض مقدمة ورقة البحث هذه خلفية التصيد الاحتيالي، وهو ظاهرة جريمة إلكترونية ظهرت في عام 1996، حيث يقوم المهاجمون، الذين يُشار إليهم بـ “المتصيدين”، بإنشاء مواقع ويب احتيالية لسرقة المعلومات الحساسة للمستخدمين. يُشتق مصطلح “التصيد الاحتيالي” من “الصيد”، مما يعكس طريقة جذب الضحايا. عادةً ما تحاكي مواقع التصيد الاحتيالي المواقع الشرعية من خلال التلاعب بمكونات عنوان URL، مثل أسماء النطاقات والمسارات، غالبًا من خلال تغييرات دقيقة مثل أخطاء الإملاء أو الأحرف المتشابهة. على سبيل المثال، تم تصميم عنوان URL للتصيد الاحتيالي مثل “https://aimazon.az-z7acyu3z0y10.abc/m” ليشبه “https://www.amazon.com”، مما يخدع المستخدمين لتقديم بيانات اعتمادهم.

تسلط هذه الفقرة الضوء أيضًا على الدور الحاسم لعناوين URL في هذا السياق، والتي تعمل كعناوين ويب تحدد الموارد عبر تنسيقات مختلفة، بما في ذلك صفحات الويب وقواعد البيانات. يتم توضيح دورة حياة التصيد الاحتيالي، مع التركيز على الخطوة الأولية حيث يقوم المهاجمون بإنشاء مواقع ويب خادعة تستغل ثقة المستخدم لالتقاط المعلومات الحساسة. يضع هذا الفهم الأساسي الأساس لمناقشة مجموعات البيانات الحالية وطرق التعلم الآلي (ML) والتعلم العميق (DL) لتعزيز اكتشاف التصيد الاحتيالي على الويب.

طرق

تشمل منهجية البحث الموضحة في هذه الدراسة نموذجًا شاملاً لاكتشاف هجمات التصيد الاحتيالي، والذي يتضمن وحدة بيانات، معالجة مسبقة، استخراج ميزات، مصنفات تعلم عميق، تحسين، وتقييم الأداء. تجمع وحدة جمع البيانات المعلومات من مصادر متعددة لإنشاء مجموعة بيانات أولية، والتي يتم تنظيفها ومعالجتها بدقة لاستخراج الميزات الحيوية. يستخدم النموذج مصنفات تعلم عميق، تم تحسينها خصيصًا لتقييم الأداء في اكتشاف التصيد الاحتيالي.

في التجربة 1، تقيم الدراسة ثلاثة مصنفات تعلم عميق: الذاكرة طويلة وقصيرة الأجل (LSTM)، شبكة الأعصاب التلافيفية (CNN)، ووحدة التكرار المغلقة (GRU). حقق مصنف LSTM دقة قدرها 93.04%، بينما تفوق مصنف CNN عليه بدقة قدرها 94.47%. سجل مصنف GRU دقة قدرها 92.95%. يتم توضيح مقاييس الأداء، بما في ذلك الدقة والخسارة على مر العصور، من خلال أشكال مختلفة، مع تقديم مصفوفة الالتباس رؤى حول فعالية تصنيف المصنفات. تشير منحنيات خاصية التشغيل المستقبلية (ROC) والدقة-الاسترجاع (PR) إلى أن مصنف CNN تفوق باستمرار على الآخرين، كما هو ملخص في الجدول 5، مما يبرز أدائه المتفوق في سياق مجموعات البيانات التي تم تقييمها.

مناقشة

تناقش قسم المناقشة في ورقة البحث ثلاثة أسئلة بحثية رئيسية تتعلق باكتشاف التصيد الاحتيالي على الويب: الدراسات ومجموعات البيانات الحالية، فعالية مصنفات التعلم العميق في اكتشاف هجمات التصيد الاحتيالي، وأهمية تحسين هذه النماذج لتحسين الأداء. تبرز الدراسة أهمية تلخيص التقنيات الحالية وإنشاء مجموعة بيانات شاملة للتصيد الاحتيالي مستمدة من مصادر موثوقة، والتي يمكن أن تساعد الباحثين في استكشاف ميزات جديدة وتعزيز المنهجيات الحالية. يدمج النموذج المقترح هندسة الميزات وتحسين التعلم العميق لتحسين دقة الكشف.

تستعرض الورقة مجموعات بيانات متنوعة متاحة للجمهور، مثل PhishTank وAlexa وOpenPhish، والتي تعتبر حيوية لتدريب نماذج التعلم الآلي (ML) والتعلم العميق (DL). تؤكد على أهمية جودة البيانات والحاجة إلى مجموعات بيانات محدثة لمكافحة تقنيات التصيد المتطورة بشكل فعال. تنتقد الدراسة أيضًا الأساليب الحالية المعتمدة على التعلم الآلي بسبب اعتمادها على هندسة الميزات اليدوية ومرونتها المحدودة تجاه مجموعات البيانات الكبيرة. في المقابل، يهدف النموذج المقترح للتعلم العميق، الذي يستخدم هياكل CNN وLSTM وGRU، إلى معالجة هذه التحديات من خلال استخدام تقنيات المعالجة المسبقة المتقدمة واستراتيجيات التحسين، مما يظهر في النهاية مقاييس أداء محسنة عبر تجارب متعددة.

القيود

تقدم الدراسة عدة قيود قد تؤثر على قابلية تعميم وفعالية نتائجها. أولاً، يتطلب التطور السريع لأنواع الدوافع وهجمات التصيد الاحتيالي مزيدًا من الاختبار والتحقق من صحة النموذج المقترح عبر مجموعات بيانات متنوعة لضمان قوته. حقق النموذج، الذي يتضمن طرقًا مثل DTOF-ANN وRNN-GRU وLSTM-CNN وVAE-DNN وSI-BBA وDNN وEGSO-CNN، مقاييس دقة ملحوظة ولكنه يتطلب تطبيقًا أوسع لتأكيد فعاليته.

بالإضافة إلى ذلك، تبرز عدم قدرة النموذج على اكتشاف مواقع التصيد الاحتيالي التي تستخدم كائنات مدمجة، مثل فلاش ونصوص جافا، فجوة كبيرة في قدراته على الكشف. تشير هذه القيود إلى أن الأدوات التكميلية ضرورية لتحديد هذه الأنواع من محاولات التصيد. أخيرًا، لا تتناول الدراسة تأثير فعالية التدريب، أو نتائج الاختبار، أو ملاحظات المستخدم على أداء الكشف عن التصيد الاحتيالي، مما قد يوفر رؤى قيمة للبحوث المستقبلية.

Journal: International Journal of Data Science and Analytics, Volume: 20, Issue: 5
DOI: https://doi.org/10.1007/s41060-025-00728-9
Publication Date: 2025-02-08
Author(s): Kousik Barik et al.
Primary Topic: Spam and Phishing Detection

Overview

The research paper addresses the growing threat of phishing attacks, which deceive Internet users into revealing sensitive information by redirecting them to fraudulent websites that closely mimic legitimate ones. Despite existing detection methods, the prevalence of these attacks has surged, necessitating the development of more sophisticated techniques. To tackle this issue, the authors propose an Enhanced Grid Search Optimization-Convolutional Neural Network (EGSO-CNN) model designed to detect web phishing. This model utilizes a newly created dataset to enhance the availability of updated phishing data, employing StandardScaler and Variational Autoencoders (VAE) for preprocessing and feature extraction. The optimization of the model’s performance is achieved through the EGSO technique, resulting in impressive metrics: an accuracy of 99.44%, a recall of 99.21%, and an F1-score of 99.32%, alongside low false positive and error rates.

In conclusion, the study emphasizes the urgent need for advanced phishing detection techniques due to the rapid increase in such cyber-attacks, which have resulted in significant financial losses for users. The proposed EGSO-CNN model not only demonstrates high accuracy but also shows promise for real-world applications in enhancing phishing prevention strategies. Future work may involve exploring alternative deep learning methodologies and incorporating transfer learning techniques to further improve performance. Ultimately, the model aims to assist organizations in implementing effective phishing detection strategies, thereby enhancing customer satisfaction and bolstering their reputation in a competitive environment.

Introduction

The introduction of this research paper outlines the background of phishing, a cybercrime phenomenon that emerged in 1996, where attackers, referred to as “phishers,” create fraudulent websites to steal users’ sensitive information. The term “phishing” is derived from “fishing,” reflecting the method of luring victims. Phishing websites typically mimic legitimate ones by manipulating URL components, such as domain names and paths, often through subtle alterations like spelling errors or similar characters. For instance, a phishing URL like “https://aimazon.az-z7acyu3z0y10.abc/m” is designed to resemble “https://www.amazon.com,” thereby deceiving users into providing their credentials.

The section also highlights the critical role of URLs in this context, which serve as web addresses identifying resources across various formats, including web pages and databases. The phishing life cycle is illustrated, emphasizing the initial step where attackers create deceptive websites that exploit user trust to capture sensitive information. This foundational understanding sets the stage for discussing existing datasets and machine learning (ML) and deep learning (DL) approaches to enhance web phishing detection.

Methods

The research methodology outlined in this study involves a comprehensive model for detecting phishing attacks, which includes a data module, preprocessing, feature extraction, deep learning classifiers, optimization, and performance evaluation. The data collection module aggregates information from multiple sources to create a primary dataset, which is then meticulously cleaned and processed to extract critical features. The model employs deep learning classifiers, specifically optimized for performance assessment in phishing detection.

In Experimentation 1, the study evaluates three deep learning classifiers: Long Short-Term Memory (LSTM), Convolutional Neural Network (CNN), and Gated Recurrent Unit (GRU). The LSTM classifier achieved an accuracy of 93.04%, while the CNN classifier outperformed it with an accuracy of 94.47%. The GRU classifier recorded an accuracy of 92.95%. Performance metrics, including accuracy and loss over epochs, are illustrated through various figures, with the confusion matrix providing insights into the classifiers’ categorization effectiveness. The Receiver Operating Characteristic (ROC) and Precision-Recall (PR) curves further indicate that the CNN classifier consistently outperformed the others, as summarized in Table 5, highlighting its superior performance in the context of the evaluated datasets.

Discussion

The discussion section of the research paper addresses three primary research questions related to web phishing detection: the existing studies and datasets, the efficacy of deep learning classifiers in detecting phishing attacks, and the importance of optimizing these models for improved performance. The study highlights the significance of summarizing current techniques and creating a comprehensive phishing dataset derived from trusted sources, which can aid researchers in exploring new features and enhancing existing methodologies. The proposed model integrates feature engineering and deep learning optimization to improve detection accuracy.

The paper reviews various publicly available datasets, such as PhishTank, Alexa, and OpenPhish, which are crucial for training machine learning (ML) and deep learning (DL) models. It emphasizes the importance of data quality and the need for updated datasets to effectively combat evolving phishing techniques. The study also critiques existing ML-based approaches for their reliance on manual feature engineering and limited adaptability to large datasets. In contrast, the proposed DL model, utilizing CNN, LSTM, and GRU architectures, aims to address these challenges by employing advanced preprocessing techniques and optimization strategies, ultimately demonstrating enhanced performance metrics across multiple experiments.

Limitations

The study presents several limitations that may affect the generalizability and effectiveness of its findings. Firstly, the rapid evolution of phishing attack types and motivations necessitates further testing and validation of the proposed model across diverse datasets to ensure its robustness. The model, which includes methods such as DTOF-ANN, RNN-GRU, LSTM-CNN, VAE-DNN, SI-BBA, DNN, and EGSO-CNN, achieved notable accuracy metrics but requires broader application to confirm its efficacy.

Additionally, the model’s inability to detect phishing websites that utilize embedded objects, such as Flash and Java scripts, highlights a significant gap in its detection capabilities. This limitation indicates that supplementary tools are essential for identifying these types of phishing attempts. Lastly, the study does not address the impact of training effectiveness, test outcomes, or user feedback on the performance of phishing detection, which could provide valuable insights for future research.