تقنية محسّنة تعتمد على التصويت الناعم في تعلم الآلة للكشف عن سرطان الثدي باستخدام اختيار ميزات فعّال وتوازن الفئات SMOTE-ENN An improved soft voting-based machine learning technique to detect breast cancer utilizing effective feature selection and SMOTE-ENN class balancing

المجلة: Discover Artificial Intelligence، المجلد: 5، العدد: 1
DOI: https://doi.org/10.1007/s44163-025-00224-w
تاريخ النشر: 2025-01-20
المؤلف: Indu Chhillar وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في الرعاية الصحية

نظرة عامة

تتناول ورقة البحث القضية الحرجة لسرطان الثدي، الذي لا يزال السبب الرئيسي للوفاة بين النساء في جميع أنحاء العالم. تؤكد على أهمية الكشف المبكر والتشخيص الدقيق في تقليل معدلات الوفيات. تسلط الدراسة الضوء على قيود خوارزميات التعلم الآلي عند مواجهة مجموعات بيانات تحتوي على ميزات مكررة أو غير ذات صلة. لتعزيز فعالية هذه الخوارزميات، ينفذ المؤلفون عدة استراتيجيات، بما في ذلك توسيع البيانات باستخدام مقياس روبرت، وتوازن الفئات من خلال تقنية العينة الزائدة الاصطناعية – الجار الأقرب المعدل (SMOTE-ENN)، واختيار الميزات عبر بوروتا واختيار الميزات المعتمد على المعاملات (CBFS).

الحل المقترح هو نموذج تجميعي يعتمد على التصويت الناعم يدمج ثلاثة مصنفات: الشبكة العصبية متعددة الطبقات (MLP)، آلة الدعم الشعاعي (SVM)، وزيادة التدرج المتطرف (XGBoost). تم تقييم النموذج التجميعي على مجموعة بيانات تشخيص سرطان الثدي في ويسكونسن (WDBC)، وأظهر أداءً متفوقًا، محققًا دقة بنسبة 99.42%، ودقة بنسبة 100.0%، واسترجاع بنسبة 98.41%، ودرجة F1 بنسبة 99.20%، ومنطقة تحت المنحنى (AUC) تبلغ 1.0 مع الميزات المثلى المختارة عبر CBFS. بالإضافة إلى ذلك، حافظ على متوسط دقة قدره 99.34% تحت التحقق المتقاطع بعشر طيات (10-FCV). تشير النتائج إلى أن المنهجية المقترحة تتفوق على التقنيات الحديثة الموجودة، وذلك بفضل معالجة البيانات القوية، واختيار الميزات الفعال، والفوائد التآزرية للتعلم التجميعي.

مقدمة

**مقدمة**

يتميز السرطان بالتطور غير الطبيعي للخلايا التي يمكن أن تغزو الأنسجة المحيطة وتنتشر إذا لم تُعالج. ينشأ سرطان الثدي بشكل خاص في القنوات أو الفصوص المنتجة للحليب في الثدي ويمكن أن يظهر كأورام حميدة أو خبيثة. تبقى الأورام الحميدة محلية داخل القنوات، بينما تخترق الأورام الخبيثة الأنسجة الدهنية والموصلة المحيطة. سرطان الثدي هو ثاني أكثر أنواع السرطان انتشارًا وخطورة على مستوى العالم، مما يشكل مخاطر كبيرة للوفاة والمرض، خاصة بالنسبة للنساء، حيث تتأثر حوالي واحدة من كل ثماني نساء. في عام 2020، أفادت منظمة الصحة العالمية بوجود 2.3 مليون حالة جديدة، مما يبرز التأثير الكبير للمرض. على الرغم من ندرته، يمثل الرجال 0.5-1% من جميع حالات سرطان الثدي.

طرق

في هذا القسم، يوضح المؤلفون المنهجيات المستخدمة في دراستهم، مع التركيز على مجموعة بيانات تشخيص سرطان الثدي في ويسكونسن (WDBC). يحددون نهجًا منهجيًا يتضمن خطوات المعالجة المسبقة، وتقنيات اختيار الميزات، والخوارزميات التصنيفية المطبقة. يتم أيضًا مناقشة مقاييس الأداء المستخدمة لتقييم فعالية الاستراتيجية المقترحة. تم إجراء جميع التجارب باستخدام بايثون داخل بيئة تطوير Jupyter Notebook على جهاز كمبيوتر شخصي مزود بمعالج Intel Core i5 وذاكرة وصول عشوائي سعة 8 جيجابايت، مما يضمن إطار عمل حسابي قوي للتحليل.

نتائج

في قسم النتائج، يقدم المؤلفون مقاييس الأداء لمصنف التصويت الناعم المقترح ومصنفاته الأساسية، التي تم تقييمها باستخدام نهج التحقق المتقاطع بعشر طيات. تلخص الجدول 4 دقة المصنفات، والدقة، والاسترجاع، ودرجات F1 عند استخدام جميع الميزات، مما يكشف أن كل من مصنف التصويت الناعم وآلة الدعم الشعاعي (SVM) حققا أعلى دقة بنسبة 98.25%، بينما تبعتهما الشبكة العصبية متعددة الطبقات (MLP) وزيادة التدرج المتطرف (XGBoost). من الجدير بالذكر أن جميع المصنفات حافظت على درجة استرجاع ثابتة قدرها 96.83%. تم تحليل الأداء بشكل أكبر باستخدام استراتيجيتين لاختيار الميزات: بوروتا وCBFS، كما هو موضح في الجدول 5. تشير النتائج إلى أن طريقة CBFS تفوقت على بوروتا، حيث حقق مصنف التصويت الناعم دقة مثيرة للإعجاب بنسبة 99.42% ودرجة دقة مثالية قدرها 100%.

تؤكد مصفوفات الارتباك ومنحنيات خصائص التشغيل المستقبلية (ROC)، الموضحة في الشكلين 5 و6 على التوالي، أداء المصنفات. أظهر مصنف التصويت الناعم قدرة متفوقة في التعرف بشكل صحيح على الحالات الحميدة، حيث أخطأ في تصنيف عينة واحدة خبيثة فقط. كشفت تحليل ROC عن قيم عالية لمنطقة تحت المنحنى (AUC) لجميع المصنفات، حيث حقق مصنف التصويت الناعم أعلى AUC قدره 0.9985، مما يدل على تمييز ممتاز بين الحالات الحميدة والخبيثة. بالمقارنة، تفوقت الطريقة المقترحة على تقنيات تصنيف سرطان الثدي الموجودة، مما يشير إلى إمكانياتها في تعزيز دقة التشخيص وتحسين نتائج المرضى.

مناقشة

تسلط قسم المناقشة في ورقة البحث الضوء على القضية العالمية الملحة لسرطان الثدي، مشيرًا إلى أنه في عام 2023، كان هناك حوالي 297,790 حالة جديدة و91,708 وفاة في الولايات المتحدة وحدها. تهدف مبادرة سرطان الثدي العالمية التابعة لمنظمة الصحة العالمية إلى تقليل معدلات الوفيات بنسبة 2.5% سنويًا من خلال الكشف المبكر، الذي يعيقه في العديد من المناطق نقص المرافق التشخيصية المدربة والموظفين المدربين. تؤكد الورقة على أهمية تقنيات التعلم الآلي (ML) في تعزيز الكشف عن سرطان الثدي، خاصة من خلال طرق الشفط بالإبرة الدقيقة (FNA)، التي يمكن تحسينها من خلال أتمتة العمليات ومعالجة قيود الأنظمة التشخيصية التقليدية.

يستعرض المؤلفون دراسات مختلفة تستخدم خوارزميات التعلم الآلي، مثل الغابة العشوائية (RF)، وآلة الدعم الشعاعي (SVM)، والجار الأقرب (KNN)، لتصنيف سرطان الثدي باستخدام مجموعة بيانات WDBC. يحددون الفجوات الكبيرة في الأبحاث السابقة، بما في ذلك نقص الانتباه إلى توسيع البيانات، وعدم توازن الفئات، والاعتماد على مصنفات فردية بدلاً من الأساليب التجميعية. لمعالجة هذه القضايا، يقترح المؤلفون منهجية قوية تتضمن مقياس روبرت لتطبيع البيانات، وSMOTE-ENN لتوازن الفئات، ونموذج تجميعي يعتمد على التصويت الناعم يجمع بين SVM وXGBoost وMLP. تشير نتائجهم إلى أن هذا النهج يعزز بشكل كبير دقة التصنيف، مما يوفر طريقًا واعدًا لتحسين تشخيص سرطان الثدي وزيادة معدلات البقاء على قيد الحياة.

Journal: Discover Artificial Intelligence, Volume: 5, Issue: 1
DOI: https://doi.org/10.1007/s44163-025-00224-w
Publication Date: 2025-01-20
Author(s): Indu Chhillar et al.
Primary Topic: Artificial Intelligence in Healthcare

Overview

The research paper addresses the critical issue of breast cancer, which remains the leading cause of death among women worldwide. It emphasizes the importance of early detection and accurate diagnosis in reducing mortality rates. The study highlights the limitations of machine learning algorithms when faced with datasets containing duplicate or irrelevant features. To enhance the effectiveness of these algorithms, the authors implement several strategies, including data scaling using the Robust Scaler, class balancing through the Synthetic Minority Over-sampling Technique-Edited Nearest Neighbor (SMOTE-ENN), and feature selection via Boruta and Coefficient-Based Feature Selection (CBFS).

The proposed solution is a soft voting-based ensemble model that integrates three classifiers: Multilayer Perceptron (MLP), Support Vector Machine (SVM), and Extreme Gradient Boosting (XGBoost). Evaluated on the Wisconsin Diagnosis Breast Cancer Dataset (WDBC), the ensemble model demonstrated superior performance, achieving an accuracy of 99.42%, precision of 100.0%, recall of 98.41%, F1 score of 99.20%, and an AUC of 1.0 with optimal features selected through CBFS. Additionally, it maintained a mean accuracy score of 99.34% under tenfold cross-validation (10-FCV). The findings indicate that the proposed methodology outperforms existing state-of-the-art techniques, attributed to its robust data preprocessing, effective feature selection, and the synergistic benefits of ensemble learning.

Introduction

**Introduction**

Cancer is characterized by the abnormal development of cells that can invade surrounding tissues and metastasize if untreated. Breast cancer specifically originates in the milk-producing ducts or lobules of the breast and can manifest as either benign or malignant tumors. Benign tumors remain localized within the ducts, while malignant tumors penetrate the surrounding fatty and connective tissues. Breast cancer is the second most prevalent and severe cancer globally, posing significant mortality and morbidity risks, particularly for women, with approximately one in eight women affected. In 2020, the World Health Organization reported 2.3 million new cases, highlighting the disease’s substantial impact. Although rare, men represent 0.5-1% of all breast cancer cases.

Methods

In this section, the authors detail the methodologies employed in their study, focusing on the Wisconsin Diagnostic Breast Cancer (WDBC) dataset. They outline a systematic approach that includes preprocessing steps, feature selection techniques, and the classification algorithms applied. The performance metrics used to evaluate the effectiveness of the proposed strategy are also discussed. All experiments were conducted using Python within the Jupyter Notebook integrated development environment (IDE) on a personal computer equipped with an Intel Core i5 CPU and 8 GB of RAM, ensuring a robust computational framework for the analysis.

Results

In the Results section, the authors present the performance metrics of their proposed soft voting classifier and its base classifiers, evaluated using a tenfold cross-validation approach. Table 4 summarizes the classifiers’ accuracy, precision, recall, and F1 scores when all features are utilized, revealing that both the soft voting classifier and Support Vector Machine (SVM) achieved the highest accuracy of 98.25%, while Multi-Layer Perceptron (MLP) and XGBoost followed. Notably, all classifiers maintained a consistent recall score of 96.83%. The performance was further analyzed using two feature selection strategies: Boruta and CBFS, as detailed in Table 5. The results indicate that the CBFS method outperformed Boruta, with the soft voting classifier achieving an impressive accuracy of 99.42% and a perfect precision score of 100%.

The confusion matrices and Receiver Operating Characteristic (ROC) curves, illustrated in Figures 5 and 6 respectively, further validate the classifiers’ performance. The soft voting classifier demonstrated superior capability in correctly identifying benign cases, misclassifying only one malignant sample. The ROC analysis revealed high Area Under the Curve (AUC) values for all classifiers, with the soft voting classifier achieving the highest AUC of 0.9985, indicating excellent discrimination between benign and malignant cases. Comparatively, the proposed method outperformed existing breast cancer classification techniques, suggesting its potential to enhance diagnostic accuracy and improve patient outcomes.

Discussion

The discussion section of the research paper highlights the pressing global issue of breast cancer, noting that in 2023, there were approximately 297,790 new cases and 91,708 deaths in the U.S. alone. The World Health Organization’s Global Breast Cancer Initiative aims to reduce mortality rates by 2.5% annually through early detection, which is hindered in many regions due to inadequate diagnostic facilities and trained personnel. The paper emphasizes the importance of machine learning (ML) techniques in enhancing breast cancer detection, particularly through fine needle aspiration (FNA) methods, which can be improved by automating processes and addressing the limitations of traditional diagnostic systems.

The authors review various studies employing ML algorithms, such as Random Forest (RF), Support Vector Machine (SVM), and K-Nearest Neighbor (KNN), to classify breast cancer using the WDBC dataset. They identify significant gaps in previous research, including inadequate attention to data scaling, class imbalance, and reliance on single classifiers rather than ensemble methods. To address these issues, the authors propose a robust methodology that incorporates the Robust Scaler for data normalization, SMOTE-ENN for class balancing, and a soft voting ensemble model combining SVM, XGBoost, and Multi-Layer Perceptron (MLP). Their findings indicate that this approach significantly enhances classification accuracy, thereby offering a promising avenue for improving breast cancer diagnosis and potentially increasing survival rates.