نهج جديد مزدوج للتعلم الآلي لاكتشاف سرطان الثدي المبكر باستخدام تقنيات اختيار الميزات المتقدمة وتقليل الأبعاد A novel double machine learning approach for detecting early breast cancer using advanced feature selection and dimensionality reduction techniques

المجلة: Scientific Reports، المجلد: 15، العدد: 1
DOI: https://doi.org/10.1038/s41598-025-06426-7
PMID: https://pubmed.ncbi.nlm.nih.gov/40596255
تاريخ النشر: 2025-07-02
المؤلف: Suganya Athisayamani وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في الرعاية الصحية

نظرة عامة

في هذا البحث، تم تقديم ثلاثة نماذج تعلم آلي مزدوج (DML) لتعزيز دقة الكشف عن سرطان الثدي من خلال الاستفادة من تقنيات التعلم الآلي على مجموعة بيانات الكشف عن سرطان الثدي. تستخدم النماذج مزيجًا من التعلم الآلي والتعلم العميق لاستخراج الميزات الأساسية، والتي يتم دمجها بعد ذلك بواسطة مصنف ميتا لتحقيق أداء تصنيف مثالي. يدمج النموذج الأول غابة عشوائية (RF) لمعالجة الميزات الهيكلية وشبكة عصبية أمامية (FNN) لالتقاط العلاقات غير الخطية. يجمع النموذج الثاني بين تعزيز التدرج المتطرف (XGBoost) مع شبكة عصبية اصطناعية (ANN)، مما يعالج بفعالية كل من الميزات الهيكلية والعددية. يدمج النموذج الثالث LightGBM مع ANN، مستهدفًا مجموعات البيانات التي تحتوي على بيانات ثابتة ومتسلسلة. بشكل جماعي، تحقق هذه النماذج DML، المعززة بتقنيات تقليل الأبعاد مثل تحليل المكونات الرئيسية (PCA)، دقة عالية تبلغ 0.99 في الكشف عن سرطان الثدي.

تؤكد النتائج على فعالية مجموعة متنوعة من مصنفات التعلم الآلي، بما في ذلك آلة الدعم الناقل (SVM)، والانحدار اللوجستي، والجيران الأقرب (KNN)، في تدريب مجموعة البيانات. يقلل تقليل الأبعاد من خلال PCA من التعقيد الحسابي، بينما يحدد اختيار الميزات بناءً على معاملات الارتباط الميزات الرئيسية مثل perimeter_worst و radius_worst، التي تظهر قيم ارتباط عالية (0.99). يستنتج البحث أن إطار عمل DML، وخاصة مزيج LightGBM و ANN، يحسن الأداء ويخفف من الإفراط في التخصيص. قد تستكشف الأعمال المستقبلية دمج أوضاع صور إضافية لتعزيز التشخيص متعدد الأوضاع وتحسين الكفاءة الحسابية من خلال مزيد من التحسينات.

نقاش

في قسم النقاش من ورقة البحث، يبرز المؤلفون التقدم الكبير في تطبيقات التعلم الآلي داخل قطاع الرعاية الصحية، وخاصة في تشخيص سرطان الثدي. أظهرت دراسات مختلفة فعالية خوارزميات التعلم الآلي المختلفة، مثل آلات الدعم الناقل (SVM)، والانحدار اللوجستي، والغابات العشوائية، في تحقيق دقة تصنيف عالية على مجموعات بيانات سرطان الثدي. على سبيل المثال، حقق SVM دقة تبلغ 99% في دراسة واحدة، بينما حقق الانحدار اللوجستي أقصى دقة تبلغ 74.47% في دراسة أخرى. على الرغم من هذه النجاحات، غالبًا ما تكافح الطرق الحالية مع ارتباط الميزات ومجموعات البيانات عالية الأبعاد، مما دفع المؤلفين إلى اقتراح نهج جديد يتضمن اختيار الميزات، وتقليل الأبعاد، ودمج الميزات لتعزيز الأداء التنبؤي.

يستعرض المؤلفون منهجيتهم، التي تشمل استخدام معاملات الارتباط لتحديد الميزات الأساسية وتطبيق تحليل المكونات الرئيسية (PCA) لتقليل الأبعاد. يؤكدون على أهمية اختيار الميزات التي ترتبط بشكل كبير بالمتغير المستهدف، مع تحديد الميزات الخمس الرئيسية التي تم التعرف عليها وهي النقاط المقعرة الأسوأ، perimeter_worst، النقاط المقعرة المتوسطة، radius_worst، و perimeter_mean. علاوة على ذلك، تقدم الورقة إطار عمل تعلم آلي مزدوج (DML) يجمع بين نماذج التعلم الآلي التقليدية وتقنيات التعلم العميق لتحسين الدقة والموثوقية وقابلية التفسير. يستفيد هذا النهج المبتكر من نقاط القوة لكلا نوعي النموذج، مما يؤدي إلى مصنف ميتا يجمع بين الرؤى من المخرجات المجمعة، وبالتالي معالجة قيود أنظمة النموذج الواحد في مجموعات البيانات المعقدة.

Journal: Scientific Reports, Volume: 15, Issue: 1
DOI: https://doi.org/10.1038/s41598-025-06426-7
PMID: https://pubmed.ncbi.nlm.nih.gov/40596255
Publication Date: 2025-07-02
Author(s): Suganya Athisayamani et al.
Primary Topic: Artificial Intelligence in Healthcare

Overview

In this research, three Double Machine Learning (DML) models are introduced to enhance breast cancer detection accuracy by leveraging machine learning techniques on a breast cancer detection dataset. The models utilize a combination of machine learning and deep learning to extract primary features, which are then fused by a meta-classifier for optimal classification performance. The first model integrates Random Forest (RF) for structured feature processing and a Feedforward Neural Network (FNN) for capturing non-linear relationships. The second model combines eXtreme Gradient Boosting (XGBoost) with an Artificial Neural Network (ANN), effectively handling both structured and numerical features. The third model merges LightGBM with an ANN, targeting datasets with static and sequential data. Collectively, these DML models, enhanced by dimensionality reduction techniques such as Principal Component Analysis (PCA), achieve a high accuracy of 0.99 in breast cancer detection.

The findings underscore the efficacy of various machine learning classifiers, including Support Vector Machine (SVM), logistic regression, and K-nearest neighbors (KNN), in training the dataset. Dimensionality reduction through PCA reduces computational complexity, while feature selection based on correlation coefficients identifies key features such as perimeter_worst and radius_worst, which exhibit high correlation values (0.99). The study concludes that the DML framework, particularly the combination of LightGBM and ANN, optimizes performance and mitigates overfitting. Future work may explore integrating additional image modalities to enhance multi-modal diagnosis and improve computational efficiency through further optimizations.

Discussion

In the discussion section of the research paper, the authors highlight the significant advancements in machine learning applications within the healthcare sector, particularly in breast cancer diagnosis. Various studies have demonstrated the effectiveness of different machine learning algorithms, such as Support Vector Machines (SVM), logistic regression, and random forests, in achieving high classification accuracy on breast cancer datasets. For instance, SVM achieved an accuracy of 99% in one study, while logistic regression yielded a maximum accuracy of 74.47% in another. Despite these successes, existing methods often struggle with feature correlation and high-dimensional datasets, prompting the authors to propose a novel approach that incorporates feature selection, dimensionality reduction, and feature fusion to enhance predictive performance.

The authors detail their methodology, which includes using correlation coefficients to identify essential features and applying Principal Component Analysis (PCA) for dimensionality reduction. They emphasize the importance of selecting features that significantly correlate with the target variable, with the top five features identified being concave points_worst, perimeter_worst, concave points_mean, radius_worst, and perimeter_mean. Furthermore, the paper introduces a dual machine learning (DML) framework that combines traditional machine learning models with deep learning techniques to improve accuracy, robustness, and interpretability. This innovative approach leverages the strengths of both model types, culminating in a meta-classifier that synthesizes insights from the combined outputs, thereby addressing the limitations of single-model systems in complex datasets.