نهج متكامل لاختيار الميزات وتعلم الآلة للكشف المبكر عن سرطان الثدي An integrated approach of feature selection and machine learning for early detection of breast cancer

المجلة: Scientific Reports، المجلد: 15، العدد: 1
DOI: https://doi.org/10.1038/s41598-025-97685-x
PMID: https://pubmed.ncbi.nlm.nih.gov/40234520
تاريخ النشر: 2025-04-15
المؤلف: Jing Zhu وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في اكتشاف السرطان

نظرة عامة

يظل سرطان الثدي واحدًا من أكثر أنواع السرطان شيوعًا بين النساء في جميع أنحاء العالم، حيث تعتبر الاكتشاف المبكر أمرًا حاسمًا لتحسين معدلات البقاء. تقدم هذه الدراسة طريقة جديدة لاختيار الميزات تستخدم قيم الشابلية التراكمية (SHAP) بالتزامن مع الإزالة التكرارية للميزات (RFE) وخوارزمية الغابة العشوائية (RF). لمواجهة مشكلة عدم توازن البيانات، استخدم الباحثون تقنية Borderline-SMOTE. تم تقييم فعالية هذا النهج باستخدام خمسة نماذج تعلم آلي: الجار الأقرب (KNN)، الغابة العشوائية (RF)، الانحدار اللوجستي (LR)، آلة الدعم الناقل (SVM)، وآلة تعزيز التدرج الخفيف (LightGBM)، مع تحسين المعلمات عبر تحسين سرب الجسيمات (PSO).

قامت الدراسة بتصفية 26 ميزة ووجدت أن نموذج LightGBM-PSO حقق أداءً ملحوظًا، حيث حصل على دقة 99.0% في التمييز بين الحالات الحميدة والخبيثة. كما أظهر النموذج خصوصية ودقة بنسبة 100%، ومعدل استدعاء بنسبة 97.40%، ومقياس F بنسبة 98.68%، وAUC قدره 0.9870، ودقة عبر التحقق المتقاطع 10 مرات قدرها 0.9808. علاوة على ذلك، طور الباحثون أداة عبر الإنترنت لتوقع مخاطر سرطان الثدي بناءً على هذا النموذج. تشير النتائج إلى أن تقنيات اختيار الميزات المقترحة والتحسين يمكن أن تعزز بشكل كبير دقة توقع سرطان الثدي، مما قد يحسن من توقعات المرضى.

طرق

يستعرض قسم “المواد والطرق” تصميم التجربة والإجراءات المستخدمة في الدراسة. يوضح المواد المحددة المستخدمة، بما في ذلك أي مواد كيميائية، معدات، وعينات بيولوجية، لضمان إمكانية تكرار التجارب. يتم وصف المنهجيات بشكل منهجي، تغطي التقنيات لجمع البيانات، التحليل، وأي طرق إحصائية تم تطبيقها لتفسير النتائج.

بالإضافة إلى ذلك، قد يتضمن القسم معلومات حول الظروف التجريبية، مثل درجة الحرارة، المدة، وأي ضوابط تم تنفيذها للتحقق من النتائج. يضمن هذا النهج الشامل أن البحث يمكن تكراره بدقة وأن النتائج موثوقة وصحيحة ضمن سياق أهداف الدراسة.

نتائج

في هذه الدراسة، تم تنفيذ عينات التدريب والاختبار على نظام Windows 11 مزود بمعالج Intel i5 وبطاقة رسومات NVIDIA RTX 2050. تم تطوير النموذج باستخدام Python 3.9، مع تسهيل معالجة البيانات بواسطة مكتبات ‘imblearn’ و’pandas’. تم استخدام عدة حزم في تنفيذ النموذج، بما في ذلك ‘numpy’ و’sklearn’ و’shap’ و’scikit-opt’، مما ساهم في وظيفة وأداء النموذج.

بالإضافة إلى ذلك، تم تطوير المنصة عبر الإنترنت لنشر النموذج باستخدام حزمة ‘streamlit’. يبرز هذا المزيج من الأدوات والمكتبات الصرامة المنهجية والإطار التكنولوجي المستخدم في البحث، مما يمكّن من التعامل الفعال مع البيانات وتحسين النموذج.

مناقشة

في هذه الدراسة، طور المؤلفون نموذجًا جديدًا لتشخيص سرطان الثدي باستخدام مجموعة بيانات سرطان الثدي التشخيصي من ويسكونسن (WDBC)، والتي تشمل 569 عينة (357 حميدة و212 خبيثة). خضعت البيانات لعمليات معالجة مسبقة، بما في ذلك تطبيع الحد الأدنى والحد الأقصى وتطبيق تقنية Borderline SMOTE لمعالجة عدم توازن الفئات. يستخدم النموذج تفسيرات الشابلية التراكمية (SHAP) لتفسير مساهمات الميزات، مما يؤدي إلى إنشاء خوارزمية SHAP-RF-RFE التي تدمج قيم SHAP مع الغابة العشوائية لاختيار الميزات بشكل فعال. برز مصنف LightGBM كنموذج الأكثر دقة، حيث حقق دقة ملحوظة بنسبة 99% وAUC قدره 0.987 مع مجموعة فرعية من 26 ميزة.

تؤكد النتائج على أهمية ميزات محددة، مثل ‘radius_worst’ و’area_worst’ و’perimeter_worst’، في توقع مخاطر سرطان الثدي. تم التحقق من أداء النموذج من خلال مقاييس مختلفة، بما في ذلك الدقة، الدقة، الاستدعاء، والتحقق المتقاطع عشر مرات، مما يظهر تفوقه على الطرق التقليدية. بينما يظهر النموذج وعدًا في تعزيز تشخيص سرطان الثدي المبكر، يعترف المؤلفون بالقيود المتعلقة بالعمومية عبر مجموعات بيانات متنوعة والتكاليف الحاسوبية المحتملة المرتبطة بنشره. تشمل الاتجاهات المستقبلية توسيع قابلية تطبيق النموذج على أمراض أخرى ودمج تقنيات تحسين متقدمة لتعزيز الأداء بشكل أكبر. كما أطلق المؤلفون أداة “توقع سرطان الثدي” عبر الإنترنت لتسهيل تقييمات المخاطر المتاحة للمرضى والمهنيين الصحيين.

Journal: Scientific Reports, Volume: 15, Issue: 1
DOI: https://doi.org/10.1038/s41598-025-97685-x
PMID: https://pubmed.ncbi.nlm.nih.gov/40234520
Publication Date: 2025-04-15
Author(s): Jing Zhu et al.
Primary Topic: AI in cancer detection

Overview

Breast cancer remains one of the most common cancers among women worldwide, with early detection being crucial for improving survival rates. This study presents a novel feature selection method that utilizes Shapley additive explanation (SHAP) values in conjunction with Recursive Feature Elimination (RFE) and a Random Forest (RF) algorithm. To tackle the issue of data imbalance, the researchers employed Borderline-SMOTE. The effectiveness of this approach was evaluated using five machine learning models: K-Nearest Neighbor (KNN), Random Forest (RF), Logistic Regression (LR), Support Vector Machine (SVM), and Light Gradient Boosting Machine (LightGBM), with hyperparameter optimization performed via Particle Swarm Optimization (PSO).

The study filtered 26 features and found that the LightGBM-PSO model achieved remarkable performance, attaining an accuracy of 99.0% in distinguishing between benign and malignant cases. The model also exhibited a specificity and precision of 100%, a recall rate of 97.40%, an F-measure of 98.68%, an AUC of 0.9870, and a 10-fold cross-validation accuracy of 0.9808. Furthermore, the researchers developed an online tool for breast cancer risk prediction based on this model. The findings suggest that the proposed feature selection and optimization techniques can significantly enhance breast cancer prediction accuracy, potentially improving patient prognoses.

Methods

The “Materials and Methods” section outlines the experimental design and procedures employed in the study. It details the specific materials used, including any reagents, equipment, and biological samples, ensuring reproducibility of the experiments. The methodologies are described systematically, covering the techniques for data collection, analysis, and any statistical methods applied to interpret the results.

Additionally, the section may include information on the experimental conditions, such as temperature, duration, and any controls implemented to validate the findings. This comprehensive approach ensures that the research can be accurately replicated and that the results are reliable and valid within the context of the study’s objectives.

Results

In this study, the training and testing samples were executed on a Windows 11 system featuring an Intel i5 processor and an NVIDIA RTX 2050 GPU. The model was developed using Python 3.9, with data preprocessing facilitated by the ‘imblearn’ and ‘pandas’ libraries. The implementation of the model utilized several packages, including ‘numpy’, ‘sklearn’, ‘shap’, and ‘scikit-opt’, which contributed to the model’s functionality and performance.

Additionally, the development of the online platform for model deployment was accomplished using the ‘streamlit’ package. This combination of tools and libraries underscores the methodological rigor and technological framework employed in the research, enabling effective data handling and model optimization.

Discussion

In this study, the authors developed a novel breast cancer diagnostic model utilizing the Wisconsin Diagnostic Breast Cancer (WDBC) dataset, which includes 569 samples (357 benign and 212 malignant). The data underwent preprocessing, including min-max normalization and the application of the Borderline SMOTE technique to address class imbalance. The model employs Shapley additive explanations (SHAP) to interpret feature contributions, leading to the creation of the SHAP-RF-RFE algorithm that integrates SHAP values with Random Forest for effective feature selection. The LightGBM classifier emerged as the most accurate model, achieving a remarkable accuracy of 99% and an AUC of 0.987 with a 26-feature subset.

The findings underscore the importance of specific features, such as ‘radius_worst’, ‘area_worst’, and ‘perimeter_worst’, in predicting breast cancer risk. The model’s performance was validated through various metrics, including accuracy, precision, recall, and ten-fold cross-validation, demonstrating its superiority over traditional methods. While the model shows promise for enhancing early breast cancer diagnosis, the authors acknowledge limitations regarding generalizability across diverse datasets and the potential computational costs associated with its deployment. Future directions include expanding the model’s applicability to other diseases and integrating advanced optimization techniques to further enhance performance. The authors have also launched an online “Breast Cancer Prediction Tool” to facilitate accessible risk assessments for patients and healthcare professionals.