النمذجة التنبؤية المدفوعة بالذكاء الاصطناعي لاكتشاف وإدارة سرطان الرئة باستخدام تعزيز البيانات الاصطناعية ومصنف الغابة العشوائية AI-Driven Predictive Modeling for Lung Cancer Detection and Management Using Synthetic Data Augmentation and Random Forest Classifier

المجلة: International Journal of Computational Intelligence Systems، المجلد: 18، العدد: 1
DOI: https://doi.org/10.1007/s44196-025-00879-4
تاريخ النشر: 2025-06-10
المؤلف: Nisreen Innab وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في الرعاية الصحية

نظرة عامة

تناقش قسم ورقة البحث التأثير التحويلي للذكاء الاصطناعي (AI) على البحث الطبي، لا سيما في سياق الكشف عن سرطان الرئة. لا يزال سرطان الرئة هو الأكثر فتكًا على مستوى العالم، مما يستلزم أدوات تشخيص مبتكرة للتحديد الدقيق وفي الوقت المناسب. تقدم الدراسة طريقة جديدة تُسمى CTGAN-RF، التي تدمج الشبكات التوليدية التنافسية الشرطية (CTGAN) مع مصنف الغابة العشوائية (RF) لتعزيز الكشف عن سرطان الرئة من خلال توليد بيانات اصطناعية.

أظهر نموذج CTGAN-RF أداءً استثنائيًا، حيث حقق درجة دقة قدرها 0.9893 وقيمة دقة ودرجة F1 وقيمة استرجاع قدرها 0.99. شملت التقييمات الشاملة اختبار تسعة خوارزميات تصنيف، باستخدام تقنيات توازن البيانات المختلفة مثل SMOTE وborderline-SMOTE، وتقييم كل من تكوينات البيانات المتوازنة وغير المتوازنة. أشارت النتائج إلى أن CTGAN-RF تفوق بشكل كبير على المصنفات التقليدية، لا سيما في معالجة عدم توازن الفئات وتعزيز دقة التنبؤ. تم تأكيد موثوقية النموذج من خلال التحقق المتقاطع بخمسة أضعاف، مما يثبت تفوقه على الأساليب التشخيصية الحالية لسرطان الرئة ويساهم في تطوير استراتيجيات علاج شخصية لهذه القضية الصحية الحرجة.

مقدمة

تسلط مقدمة ورقة البحث هذه الضوء على التأثير التحويلي للذكاء الاصطناعي (AI) وتعلم الآلة (ML) على مختلف القطاعات، لا سيما في الرعاية الصحية، حيث تسهل تطوير استراتيجيات علاج دقيقة وشخصية. التركيز على سرطان الرئة، وهو أحد الأسباب الرئيسية للوفيات العالمية، مما يبرز الأهمية الحرجة للكشف المبكر والتدخل لتحسين معدلات بقاء المرضى.

يقترح المؤلفون إطار عمل مبتكر يدمج الميزات الاصطناعية التي تم إنشاؤها بواسطة الشبكات التوليدية التنافسية الشرطية (CTGAN) مع مصنف الغابة العشوائية، محققًا درجة دقة ملحوظة قدرها 0.9893، إلى جانب معدلات دقة ودرجة F1 واسترجاع قدرها 0.99. تم تقييم هذا الإطار بدقة مقابل تسعة مصنفات ML معروفة وتم التحقق منه من خلال التحقق المتقاطع بخمسة أضعاف، مما يظهر أداءً متفوقًا مقارنة بالطرق التقليدية. تؤكد النتائج على إمكانية دمج تقنيات توليد البيانات الاصطناعية المتقدمة مع نماذج ML لتعزيز الكشف المبكر عن سرطان الرئة، مما يمهد الطريق لأدوات تشخيصية من الجيل التالي في الرعاية الصحية. تم هيكلة الورقة لمراجعة الأدبيات الحالية، وتفصيل المنهجية، وعرض النتائج التجريبية، واختتامها بتداعيات النتائج.

طرق

تُفصّل المنهجية الخاصة بالتعرف المبكر على سرطان الرئة في هذا القسم، موضحةً سير عمل منهجي يتضمن الحصول على مجموعة البيانات، والمعالجة المسبقة، وتقسيم البيانات، وتطبيق تقنيات زيادة العينة، وتدريب النموذج. تم تقسيم مجموعة البيانات إلى 70% للتدريب و30% للاختبار لضمان أداء قوي للنموذج وتقليل الإفراط في التكيف. تم استخدام مقاييس الأداء الرئيسية مثل الدقة والدقة والاسترجاع ودرجة F1 لتقييم فعالية النماذج.

تم إجراء التجارب في بيئة بايثون باستخدام مكتبات متنوعة على خادم Dell PowerEdge T430 عالي الأداء مزود بمعالجات Intel Xeon مزدوجة (8 نوى بسرعة 2.4 جيجاهرتز) وذاكرة 32 جيجابايت DDR4، بالإضافة إلى وحدة معالجة الرسوميات التي تحتوي على 2 جيجابايت من الذاكرة. سهلت هذه الإعدادات الحسابية عمليات التحليل والتحقق بكفاءة، مما مكن من تطوير نماذج موثوقة لتشخيص سرطان الرئة.

نتائج

في تحليل نماذج تعلم الآلة (ML) المختلفة المطبقة على مجموعة بيانات شاملة لسرطان الرئة، تشير النتائج إلى تباين كبير في الأداء. حققت آلة الدعم الناقل (SVM) أدنى دقة عند 0.5648، مع قيم دقة واسترجاع متطابقة قدرها 0.56. في المقابل، حققت الانحدار اللوجستي (LR) أعلى دقة قدرها 0.6944، بينما أظهر نموذج الغابة العشوائية (RF) أداءً محسنًا بدقة قدرها 0.7407. تفوق نموذج الأشجار العشوائية بشكل كبير (ETC) على جميع النماذج الأخرى، محققًا دقة قدرها 0.7962 إلى جانب قيم دقة واسترجاع ودرجات F1 قدرها 0.80.

استكشفت الدراسة أيضًا تأثير تقنيات إعادة أخذ العينات المختلفة على دقة النموذج. بشكل ملحوظ، عززت تقنية زيادة العينة للأقليات الاصطناعية (SMOTE) ومجموعاتها أداء النموذج بشكل كبير. على سبيل المثال، بعد تطبيق SMOTE-ENN، وصلت دقة النماذج مثل XGBoost (XGB) ومصنف الانحدار العشوائي (SGDC) إلى 0.9838. كانت نتائج التحقق المتقاطع k-fold لنموذج RF، لا سيما عند دمجه مع CTGAN، تحقق دقة مثيرة للإعجاب قدرها 0.9893، مما يبرز قوة النموذج وقدرته على التعميم في تحديد حالات سرطان الرئة بفعالية.

مناقشة

تؤكد قسم المناقشة في ورقة البحث على الأهمية الحرجة للكشف المبكر والتنبؤ الدقيق بسرطان الرئة، مع تسليط الضوء على التقدم في تقنيات تعلم الآلة (ML) والتعلم العميق (DL) التي سهلت طرق جديدة لهذه التحديات. أظهرت الدراسات الحديثة فعالية خوارزميات ML المختلفة، مثل الانحدار اللوجستي (LR)، وXGBoost، وآلات الدعم الناقل (SVM)، وغيرها، في تعزيز نماذج التنبؤ لتقييم مخاطر سرطان الرئة. بشكل ملحوظ، حقق XGBoost دقة قدرها 96.92%، بينما أظهرت نماذج أخرى مثل الغابة الدوارة وSVM درجات AUC قدرها 99.3% و98.8% دقة، على التوالي. توضح هذه النتائج إمكانية تحسين ML بشكل كبير لتحديد سرطان الرئة مبكرًا وتحسين نتائج المرضى.

كما تحدد هذه القسم الفجوات في مشهد البحث الحالي التي تهدف الإطار المقترح إلى معالجتها. ويبرز ضرورة استخدام مصادر بيانات متنوعة، بما في ذلك المعلومات السريرية والديموغرافية، لتحسين نماذج التنبؤ بشكل أكبر. بالإضافة إلى ذلك، تناقش الورقة تقنيات معالجة البيانات المختلفة، مثل تقنية زيادة العينة للأقليات الاصطناعية (SMOTE) ومتغيراتها، والتي تعتبر حاسمة لتخفيف عدم توازن الفئات في مجموعات البيانات. من خلال استخدام هذه التقنيات، تهدف الدراسة إلى تعزيز قوة وموثوقية نماذج ML في التنبؤ بسرطان الرئة، مما يسهم في تحسين دقة التشخيص ورعاية المرضى.

Journal: International Journal of Computational Intelligence Systems, Volume: 18, Issue: 1
DOI: https://doi.org/10.1007/s44196-025-00879-4
Publication Date: 2025-06-10
Author(s): Nisreen Innab et al.
Primary Topic: Artificial Intelligence in Healthcare

Overview

The research paper section discusses the transformative impact of artificial intelligence (AI) on medical research, particularly in the context of lung cancer detection. Lung cancer remains the most lethal cancer globally, necessitating innovative diagnostic tools for timely and accurate identification. The study introduces a novel method termed CTGAN-RF, which integrates conditional tabular generative adversarial networks (CTGAN) with a random forest (RF) classifier to enhance lung cancer detection through synthetic data generation.

The CTGAN-RF model demonstrated exceptional performance, achieving an accuracy score of 0.9893 and a precision, F1 score, and recall value of 0.99. The extensive evaluation involved testing nine classification algorithms, employing various data balancing techniques such as SMOTE and borderline-SMOTE, and assessing both balanced and unbalanced data configurations. The results indicated that CTGAN-RF significantly outperformed traditional classifiers, particularly in addressing class imbalance and enhancing prediction accuracy. The model’s reliability was further confirmed through fivefold cross-validation, establishing its superiority over existing lung cancer diagnostic approaches and contributing to the development of personalized treatment strategies for this critical health issue.

Introduction

The introduction of this research paper highlights the transformative impact of artificial intelligence (AI) and machine learning (ML) on various sectors, particularly in healthcare, where they facilitate the development of precise and personalized treatment strategies. The focus is on lung cancer, a leading cause of global mortality, emphasizing the critical importance of early detection and intervention for improving patient survival rates.

The authors propose an innovative framework that integrates synthetic features generated by Conditional Generative Adversarial Networks (CTGAN) with a Random Forest classifier, achieving a remarkable accuracy score of 0.9893, alongside precision, F1 score, and recall rates of 0.99. This framework was rigorously evaluated against nine established ML classifiers and validated through 5-fold cross-validation, demonstrating superior performance compared to conventional methods. The findings underscore the potential of combining advanced synthetic data generation techniques with ML models to enhance early lung cancer detection, paving the way for next-generation diagnostic tools in healthcare. The paper is structured to review existing literature, detail the methodology, present experimental results, and conclude with the implications of the findings.

Methods

The methodology for early identification of lung cancer is detailed in this section, outlining a systematic workflow that includes dataset acquisition, preprocessing, data splitting, application of oversampling techniques, and model training. The dataset was divided into 70% for training and 30% for testing to ensure robust model performance and mitigate overfitting. Key performance metrics such as accuracy, precision, recall, and F1 score were employed to evaluate the models’ effectiveness.

The experiments were conducted in a Python environment utilizing various libraries on a high-performance Dell PowerEdge T430 server equipped with dual Intel Xeon processors (8 cores at 2.4 GHz) and 32 GB DDR4 memory, along with a GPU featuring 2 GB RAM. This computational setup facilitated efficient analytical and validation processes, enabling the development of reliable models for lung cancer diagnosis.

Results

In the analysis of various machine learning (ML) models applied to a comprehensive lung cancer dataset, the results indicate significant variability in performance. The Support Vector Machine (SVM) yielded the lowest accuracy at 0.5648, with corresponding precision and recall values of 0.56. In contrast, Logistic Regression (LR) achieved the highest accuracy of 0.6944, while the Random Forest (RF) model demonstrated improved performance with an accuracy of 0.7407. The Extremely Randomized Trees (ETC) model outperformed all others, achieving an accuracy of 0.7962 alongside precision, recall, and F1-scores of 0.80.

The study further explored the impact of various data re-sampling techniques on model accuracy. Notably, the Synthetic Minority Over-sampling Technique (SMOTE) and its combinations significantly enhanced model performance. For instance, after applying SMOTE-ENN, the accuracy for models like XGBoost (XGB) and Stochastic Gradient Descent Classifier (SGDC) reached 0.9838. The k-fold cross-validation results for the RF model, particularly when combined with CTGAN, yielded an impressive accuracy of 0.9893, underscoring the model’s robustness and generalization capabilities in identifying lung cancer cases effectively.

Discussion

The discussion section of the research paper emphasizes the critical importance of early detection and accurate prediction of lung cancer, highlighting the advancements in machine learning (ML) and deep learning (DL) techniques that have facilitated novel approaches to these challenges. Recent studies have demonstrated the effectiveness of various ML algorithms, such as Logistic Regression (LR), XGBoost, Support Vector Machines (SVM), and others, in enhancing predictive models for lung cancer risk assessment. Notably, XGBoost achieved an accuracy of 96.92%, while other models like the Rotation Forest and SVM demonstrated AUC scores of 99.3% and 98.8% accuracy, respectively. These findings illustrate the potential of ML to significantly improve early lung cancer identification and patient outcomes.

The section also identifies gaps in the current research landscape that the proposed framework aims to address. It underscores the necessity of utilizing diverse data sources, including clinical and demographic information, to refine predictive models further. Additionally, the paper discusses various data preprocessing techniques, such as the Synthetic Minority Oversampling Technique (SMOTE) and its variants, which are crucial for mitigating class imbalance in datasets. By employing these techniques, the study aims to enhance the robustness and reliability of ML models in predicting lung cancer, ultimately contributing to improved diagnostic accuracy and patient care.