اختيار الميزات المدفوع بواسطة SGA وتصنيف الغابات العشوائية لتحسين تشخيص سرطان الثدي: دراسة مقارنة SGA-Driven feature selection and random forest classification for enhanced breast cancer diagnosis: A comparative study

المجلة: Scientific Reports، المجلد: 15، العدد: 1
DOI: https://doi.org/10.1038/s41598-025-95786-1
PMID: https://pubmed.ncbi.nlm.nih.gov/40159513
تاريخ النشر: 2025-03-30
المؤلف: Abrar Yaqoob وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في اكتشاف السرطان

نظرة عامة

في هذه الدراسة، يقدم المؤلفون منهجية جديدة لتصنيف سرطان الثدي تجمع بين خوارزمية تحسين النورس (SGA) لاختيار الميزات ومصنف الغابة العشوائية (RF). يمثل هذا النهج أول تطبيق لـ SGA في اختيار الجينات لتشخيص سرطان الثدي، مما يمكّن من استكشاف منهجي لمساحة الميزات لتحديد أكثر مجموعات الجينات معلوماتية. لا تعزز دمج SGA مع RF دقة التصنيف فحسب، بل تقلل أيضًا من التعقيد الحسابي. حققت الطريقة المقترحة دقة متوسطة قصوى بلغت 99.01% باستخدام 22 جينًا مختارًا، متفوقة على مصنفات أخرى مثل الانحدار الخطي (LR) وآلة الدعم الناقل (SVM) وأقرب الجيران (KNN)، مع دقة متوسطة تتراوح بين 85.35% إلى 94.33%.

تؤكد النتائج فعالية دمج SGA-RF في تمييز خصائص الورم مع الحفاظ على توازن بين تقليل الميزات وأداء التصنيف. تشير النتائج إلى أن مرونة النموذج تسمح بأداء تنافسي عبر مجموعات ميزات مختلفة، مما يعزز من قابليته للتفسير ويقلل من المتطلبات الحسابية. هذه القدرة مفيدة بشكل خاص للتطبيقات السريرية، حيث تساعد أطباء الأورام في اتخاذ قرارات مستنيرة بشأن استراتيجيات التشخيص والعلاج. تقترح الدراسة أن الأبحاث المستقبلية يمكن أن تستكشف دمج خوارزميات مستوحاة من الطبيعة ونماذج التعلم العميق لتحسين أداء الطريقة وقابليتها للتطبيق في الطب الدقيق.

مقدمة

تناقش مقدمة ورقة البحث تطور طرق اختيار الميزات والمصنفات في تصنيف سرطان الثدي. وضعت التقنيات التقليدية، بما في ذلك الاختبارات الإحصائية، وتحليل المكونات الرئيسية (PCA)، والحد الأدنى من التكرار الأقصى من الأهمية (mRMR)، الأساس ولكن غالبًا ما تفشل في إدارة توازن الأهمية والتكرار بشكل فعال، خاصة في سياق بيانات التعبير الجيني غير الخطية. تعزز الطرق المتقدمة، مثل الطرق المغلفة والمدمجة، الأداء ولكنها تقدم تعقيدًا حسابيًا كبيرًا، خاصة مع مجموعات البيانات عالية الأبعاد. تقيم طرق التغليف مجموعات الميزات بناءً على أداء المصنف، بينما تكون الطرق المدمجة مقيدة بفرضيات نماذجها الأساسية، مما يحد من قابليتها للتعميم.

مؤخراً، ظهرت خوارزميات تحسين مستوحاة من الطبيعة كأدوات فعالة لاختيار الميزات، مستفيدة من العمليات البيولوجية والفيزيائية. من بين هذه، تبرز خوارزمية تحسين النورس (SGA) لنهجها المبتكر، حيث تحاكي السلوكيات الاجتماعية والهجرية للنوارس للتنقل بكفاءة في مساحة البحث. تجعل قدرة SGA على تحقيق توازن بين استراتيجيات البحث المحلية والعالمية مناسبة بشكل خاص لمهام اختيار الميزات عالية الأبعاد، ومع ذلك لا يزال تطبيقها في تصنيف سرطان الثدي غير مستكشف بشكل كافٍ. بالإضافة إلى ذلك، تسلط المقدمة الضوء على أهمية اختيار المصنف، مشيرة إلى أن الغابة العشوائية (RF) فعالة بشكل خاص للبيانات عالية الأبعاد بسبب طبيعتها التجميعية ومرونتها ضد الإفراط في التخصيص. تقدم مصنفات أخرى، مثل آلات الدعم الناقل (SVM) والانحدار اللوجستي (LR) وأقرب الجيران (KNN)، مزايا وقيود مميزة، مما يستدعي تحليلًا مقارنًا لتقييم فعالية الطريقة المقترحة.

طرق

يتضمن الإعداد التجريبي لهذه الدراسة نهجًا منهجيًا لتصنيف سرطان الرئة باستخدام مجموعة بيانات تتكون من 24,481 ميزة تعبير جيني من 97 عينة، مصنفة إلى فئات خبيثة وحميدة. تبدأ المنهجية بمعالجة البيانات بشكل صارم، بما في ذلك التنظيف، والتطبيع، والتعامل مع القيم المفقودة من خلال تقدير المتوسط والوضع. يتم تحديد القيم الشاذة وتحديدها باستخدام طريقة النطاق الربعي (IQR)، ويتم ترميز المتغيرات الفئوية لتوافقها مع نماذج التعلم الآلي. ثم يتم تطبيع مجموعة البيانات باستخدام مقياس الحد الأدنى والحد الأقصى لضمان مساهمة متساوية للميزات، تليها عينة طبقية لتقسيم البيانات إلى مجموعات تدريب واختبار بنسبة 80:20.

يتم اختيار الميزات باستخدام خوارزمية تحسين النورس (SGA)، التي تحدد أكثر مجموعات الجينات معلوماتية من خلال تكرارات متعددة لضمان القوة. يتم استخدام الميزات المختارة بعد ذلك في مصنف الغابة العشوائية (RF)، الذي تم تحسينه من خلال ضبط المعلمات. يتم تقييم أداء النموذج باستخدام مقاييس مثل الدقة، والحساسية، والنوعية، وAUC-ROC، محققًا نتائج مثيرة للإعجاب بدقة 99.01%. لتعزيز القابلية للتعميم وتقليل الإفراط في التخصيص، يتم استخدام استراتيجيات مثل تنظيم الميزات وضبط المعلمات، مما يضمن فعالية النموذج في التمييز بين الحالات الخبيثة والحميدة في السيناريوهات السريرية.

نتائج

تظهر نتائج الدراسة فعالية خوارزمية تحسين النورس (SGA) المقترحة المدمجة مع مصنف الغابة العشوائية (RF) لاختيار الجينات في تصنيف سرطان الثدي. كما هو موضح في الجدول 2، تتحسن دقة التصنيف مع زيادة عدد الجينات المختارة، حيث تصل إلى ذروتها عند 99.01% بدقة مع 22 جينًا، إلى جانب دقة متوسطة تبلغ 94.33% ودقة أسوأ حالة تبلغ 88.18%. تشير هذه الأداء الأمثل إلى أن مجموعة مختارة جيدًا من الميزات تعزز قدرة المصنف على التمييز بين الحالات الخبيثة والحميدة. ومع ذلك، مع تجاوز عدد الجينات المختارة 22، تشير انخفاضات في مقاييس الأداء إلى احتمال الإفراط في التخصيص وإدخال ميزات زائدة.

تكشف التحليلات الإضافية أنه بينما تتحسن أفضل الدرجات والمتوسط مع المزيد من الجينات، فإن أسوأ درجة تظهر تباينًا كبيرًا، خاصة بعد 30 جينًا، مما يشير إلى أن اختيار الميزات المفرط يمكن أن يعقد مهمة التصنيف. تؤكد الدراسة على أهمية اختيار عدد مثالي من الجينات للحفاظ على دقة النموذج وقابليته للتعميم. تؤكد اختبارات الأهمية الإحصائية أن التحسينات في دقة التصنيف قوية (p < 0.05)، مما يبرز موثوقية طريقة SGA + RF في اختيار الجينات وإمكاناتها لتطبيقات أوسع في تصنيف السرطان.

مناقشة

تقدم ورقة البحث نهجًا جديدًا يجمع بين خوارزمية تحسين النورس (SGA) لاختيار الميزات ومصنف الغابة العشوائية (RF) لتعزيز تصنيف سرطان الثدي. يعالج هذا الأسلوب بشكل فعال التحديات التي تطرحها بيانات التعبير الجيني عالية الأبعاد، محققًا دقة متفوقة، وكفاءة حسابية، وقابلية للتفسير. تستخدم SGA سلوك الهجرة للنوارس لاستكشاف مساحة الميزات، مما يضمن تحديدًا قويًا للجينات ذات الصلة بيولوجيًا مع تجنب الحلول المحلية. هذه الدراسة هي الأولى التي تطبق SGA في تصنيف سرطان الثدي، مما يوضح فعاليتها في اختيار مجموعات الميزات المثلى.

يعزز دمج SGA مع RF الإطار المقترح، حيث تكمل قدرة RF على التعامل مع البيانات عالية الأبعاد ومقاييس أهمية الميزات عملية اختيار الميزات. تكشف التحليلات المقارنة أن مجموعة SGA-RF تتفوق باستمرار على المصنفات التقليدية مثل الانحدار اللوجستي وآلات الدعم الناقل وأقرب الجيران عبر مقاييس تقييم مختلفة، بما في ذلك الدقة، والدقة، والاسترجاع، وF1-score. تظهر التحقق التجريبي على مجموعات بيانات التعبير الجيني لسرطان الثدي دقة متوسطة ملحوظة تبلغ 99.01% مع 22 جينًا مختارًا فقط، مما يبرز كفاءة الطريقة وإمكاناتها للتطبيقات السريرية. تشير النتائج إلى أن هذا النهج لا يسهل فقط تطوير أدوات تشخيصية فعالة من حيث التكلفة، بل يوفر أيضًا رؤى حول الآليات الجزيئية لسرطان الثدي، مما يمهد الطريق للأبحاث المستقبلية واستراتيجيات العلاج.

القيود

تتركز قيود خوارزمية الغابة العشوائية بشكل أساسي حول التعقيد الحسابي وقابلية التفسير. يمكن أن يصبح النموذج كثيفًا حسابيًا، خاصة عند استخدام عدد كبير من الأشجار أو عند التعامل مع مجموعات بيانات تحتوي على العديد من الميزات. يمكن أن يؤدي هذا التعقيد إلى زيادة أوقات التدريب وعمليات التنبؤ الأبطأ مقارنة بالنماذج الأبسط.

بالإضافة إلى ذلك، بينما تعزز الغابة العشوائية الأداء التنبؤي من خلال تقليل التباين من خلال تجميع عدة أشجار قرار، فإنها تقلل في الوقت نفسه من قابلية التفسير. من السهل فهم أشجار القرار الفردية؛ ومع ذلك، فإن دمج المئات أو الآلاف من الأشجار يعقد مهمة تمييز مساهمة كل ميزة في عملية اتخاذ القرار العامة. تم ضبط المعلمات الرئيسية، كما هو ملخص في الجدول 1، من خلال بحث منهجي عن الشبكة لتحقيق توازن بين تعقيد النموذج وأداء التعميم.

Journal: Scientific Reports, Volume: 15, Issue: 1
DOI: https://doi.org/10.1038/s41598-025-95786-1
PMID: https://pubmed.ncbi.nlm.nih.gov/40159513
Publication Date: 2025-03-30
Author(s): Abrar Yaqoob et al.
Primary Topic: AI in cancer detection

Overview

In this study, the authors introduce a novel methodology for breast cancer classification that combines the Seagull Optimization Algorithm (SGA) for feature selection with the Random Forest (RF) classifier. This approach marks the first application of SGA in gene selection for breast cancer diagnosis, enabling systematic exploration of the feature space to identify the most informative gene subsets. The integration of SGA with RF not only enhances classification accuracy but also reduces computational complexity. The proposed method achieved a peak mean accuracy of 99.01% using 22 selected genes, outperforming other classifiers such as Linear Regression (LR), Support Vector Machine (SVM), and K-Nearest Neighbors (KNN), with mean accuracies ranging from 85.35% to 94.33%.

The findings underscore the effectiveness of the SGA-RF combination in distinguishing tumor characteristics while maintaining a balance between feature reduction and classification performance. The results indicate that the model’s flexibility allows for competitive performance across various feature subsets, enhancing its interpretability and reducing computational demands. This capability is particularly beneficial for clinical applications, as it aids oncologists in making informed decisions regarding diagnosis and treatment strategies. The study suggests that future research could explore the integration of additional nature-inspired algorithms and deep learning models to further improve the method’s performance and applicability in precision medicine.

Introduction

The introduction of the research paper discusses the evolution of feature selection methods and classifiers in breast cancer classification. Traditional techniques, including statistical tests, principal component analysis (PCA), and minimum redundancy maximum relevance (mRMR), have laid the groundwork but often fail to effectively manage the relevance-redundancy trade-off, particularly in the context of non-linear gene expression data. Advanced methods, such as wrapper and embedded approaches, enhance performance but introduce significant computational complexity, especially with high-dimensional datasets. Wrapper methods assess feature subsets based on classifier performance, while embedded methods are constrained by the assumptions of their underlying models, limiting their generalizability.

Recently, nature-inspired optimization algorithms have emerged as effective tools for feature selection, leveraging biological and physical processes. Among these, the Seagull Optimization Algorithm (SGA) stands out for its innovative approach, mimicking the social and migratory behaviors of seagulls to efficiently navigate the search space. SGA’s capability to balance local and global search strategies makes it particularly suitable for high-dimensional feature selection tasks, yet its application in breast cancer classification remains underexplored. Additionally, the introduction highlights the importance of classifier selection, noting that Random Forest (RF) is particularly effective for high-dimensional data due to its ensemble nature and resilience to overfitting. Other classifiers, such as Support Vector Machines (SVM), Logistic Regression (LR), and K-Nearest Neighbors (KNN), each present distinct advantages and limitations, warranting a comparative analysis to assess the proposed method’s effectiveness.

Methods

The experimental setup for this study involves a systematic approach to classify lung cancer using a dataset comprising 24,481 gene expression features from 97 samples, categorized into malignant and benign classes. The methodology begins with rigorous data preprocessing, including cleaning, normalization, and handling of missing values through mean and mode imputation. Outliers are identified and capped using the interquartile range (IQR) method, and categorical variables are encoded for compatibility with machine learning models. The dataset is then normalized using min-max scaling to ensure uniform feature contribution, followed by stratified sampling to split the data into training and test sets in an 80:20 ratio.

Feature selection is performed using the Seagull Optimization Algorithm (SGA), which identifies the most informative gene subsets through multiple iterations to ensure robustness. The selected features are subsequently utilized in a Random Forest (RF) classifier, optimized through hyperparameter tuning. The model’s performance is evaluated using metrics such as accuracy, sensitivity, specificity, and AUC-ROC, achieving impressive results with an accuracy of 99.01%. To enhance generalizability and mitigate overfitting, strategies like feature regularization and parameter tuning are employed, ensuring the model’s efficacy in distinguishing between malignant and benign cases in clinical scenarios.

Results

The results of the study demonstrate the effectiveness of the proposed Seagull Optimization Algorithm (SGA) combined with the Random Forest (RF) classifier for gene selection in breast cancer classification. As detailed in Table 2, the classification accuracy improves with an increasing number of selected genes, peaking at 99.01% accuracy with 22 genes, alongside a mean accuracy of 94.33% and a worst-case accuracy of 88.18%. This optimal performance indicates that a well-chosen subset of features enhances the classifier’s ability to differentiate between malignant and benign cases. However, as the number of selected genes exceeds 22, a decline in performance metrics suggests potential overfitting and the introduction of redundant features.

Further analysis reveals that while the Best and Mean Scores improve with more genes, the Worst Score exhibits significant variability, particularly beyond 30 genes, indicating that excessive feature selection can complicate the classification task. The study emphasizes the importance of selecting an optimal number of genes to maintain model accuracy and generalization. Statistical significance tests confirm that the improvements in classification accuracy are robust (p < 0.05), underscoring the reliability of the SGA + RF method in gene selection and its potential for broader applications in cancer classification.

Discussion

The research paper presents a novel approach that combines the Seagull Optimization Algorithm (SGA) for feature selection with the Random Forest (RF) classifier to enhance breast cancer classification. This method effectively addresses the challenges posed by high-dimensional gene expression data, achieving superior accuracy, computational efficiency, and interpretability. The SGA utilizes the migratory behavior of seagulls to explore the feature space, ensuring robust identification of biologically relevant genes while avoiding local optima. This study is the first to apply SGA in breast cancer classification, demonstrating its effectiveness in selecting optimal feature subsets.

The integration of SGA with RF further strengthens the proposed framework, as RF’s capability to handle high-dimensional data and its feature importance metrics complement the feature selection process. Comparative analyses reveal that the SGA-RF combination consistently outperforms traditional classifiers such as Logistic Regression, Support Vector Machines, and K-Nearest Neighbors across various evaluation metrics, including accuracy, precision, recall, and F1-score. Empirical validation on breast cancer gene expression datasets shows a remarkable mean accuracy of 99.01% with only 22 selected genes, underscoring the method’s efficiency and potential for clinical applications. The findings suggest that this approach not only facilitates the development of cost-effective diagnostic tools but also provides insights into the molecular mechanisms of breast cancer, paving the way for future research and treatment strategies.

Limitations

The limitations of the Random Forest algorithm are primarily centered around computational complexity and interpretability. The model can become computationally intensive, particularly when employing a large number of trees or when dealing with datasets that contain numerous features. This complexity can lead to increased training times and slower prediction processes compared to simpler models.

Additionally, while Random Forest enhances the predictive performance by mitigating variance through the aggregation of multiple decision trees, it concurrently diminishes interpretability. Individual decision trees are straightforward to understand; however, the amalgamation of hundreds or thousands of trees complicates the task of discerning the contribution of each feature to the overall decision-making process. The tuning of key hyperparameters, as summarized in Table 1, was conducted through a systematic grid search optimization to strike a balance between model complexity and generalization performance.