إطار تنبؤي قوي لتصنيف السكري باستخدام التعلم الآلي المحسن على مجموعات بيانات غير متوازنة Robust predictive framework for diabetes classification using optimized machine learning on imbalanced datasets

المجلة: Frontiers in Artificial Intelligence، المجلد: 7
DOI: https://doi.org/10.3389/frai.2024.1499530
PMID: https://pubmed.ncbi.nlm.nih.gov/39839971
تاريخ النشر: 2025-01-07
المؤلف: Inam Abousaber وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في الرعاية الصحية

نظرة عامة

تقدم ورقة البحث إطارًا تنبؤيًا جديدًا لتوقع مرض السكري يعالج التحديات التي تطرحها اختلالات الفئات في مجموعات البيانات السريرية. تؤكد الدراسة على أهمية استخدام خوارزميات التعلم الآلي المتقدمة وتقنيات التعامل مع الاختلالات، بما في ذلك هندسة الميزات واستراتيجيات إعادة أخذ العينات، لتحسين دقة التنبؤ. أظهرت الاختبارات الدقيقة على ثلاث مجموعات بيانات—PIMA، مجموعة بيانات السكري 2019، وBIT_2019—صلابة الإطار وقدرته على التكيف، مما يبرز الدور الحاسم لاختيار النموذج والتخفيف من الاختلالات في تحقيق توقعات موثوقة لمرض السكري.

في تقييم اثني عشر نموذجًا من نماذج التعلم الآلي المدمجة مع خمس تقنيات لإعادة أخذ العينات (SMOTE، ADASYN، Borderline-SMOTE، Random Under Sampling، وSMOTEENN)، كانت غابة عشوائية مرتبطة بـ SMOTE تحقق باستمرار أعلى دقة، ودرجة F1، وROC-AUC. عالجت هذه المجموعة بفعالية تحديات البيانات غير المتوازنة، مما عزز توقع الحالات الأقل تمثيلًا وضمان حساسية وخصوصية عالية. تؤكد النتائج على الأهمية السريرية لهذا النهج، الذي يهدف إلى تقليل مخاطر التشخيص الزائد والتشخيصات المفقودة، وبالتالي توفير رؤى قابلة للتنفيذ لمقدمي الرعاية الصحية. يُقترح أن تستكشف الأعمال المستقبلية قابلية تطبيق الإطار على مجموعات بيانات أكبر وأكثر تعقيدًا، واختبار نماذج التعلم العميق المتقدمة في البيئات السريرية في الوقت الحقيقي، مما يسهل انتقال تطبيقات التعلم الآلي إلى تدخلات الرعاية الصحية العملية.

مقدمة

تسلط مقدمة ورقة البحث هذه الضوء على وباء السكري العالمي المتزايد، الذي يؤثر على حوالي 537 مليون بالغ في عام 2021، مع توقعات تصل إلى 783 مليون بحلول عام 2045. يُعرف السكري بأنه خلل في الأنسولين يؤدي إلى ارتفاع مستويات الجلوكوز في الدم، مما يشكل مخاطر صحية خطيرة، بما في ذلك أمراض القلب والأوعية الدموية وفشل الكلى، مما يزيد من تكاليف الرعاية الصحية ويثقل كاهل أنظمة الرعاية الصحية. يتم التأكيد على أهمية الكشف المبكر كشرط أساسي للتدخل الفعال، خاصة لمنع التقدم من مرحلة ما قبل السكري إلى السكري من النوع 2. تقتصر طرق التشخيص التقليدية، التي تعتمد على مستويات الجلوكوز في الدم أثناء الصيام ومستويات HbA1c، على إمكانية الوصول والحساسية، خاصة في المناطق المحرومة.

تدعو الورقة إلى استخدام التعلم الآلي (ML) كنهج تحويلي للكشف المبكر عن السكري، قادر على تحليل مجموعات بيانات كبيرة لتحديد الأفراد المعرضين للخطر. ومع ذلك، تشير إلى تحدي اختلال الفئات في مجموعات البيانات الطبية، حيث تكون حالات السكري أقل بكثير من حالات غير السكري، مما يؤدي إلى نماذج تنبؤية متحيزة. لمعالجة ذلك، تقترح الدراسة تقنيات معالجة مسبقة متقدمة ومجموعة متنوعة من خوارزميات ML، بما في ذلك الطرق التجميعية وتقنيات أخذ العينات الاصطناعية مثل SMOTE وADASYN، لتعزيز أداء النموذج وحساسيته تجاه فئة السكري الأقل تمثيلًا. تهدف البحث إلى تطوير إطار شامل يحسن دقة التنبؤ وقابلية التعميم عبر مجموعات سكانية متنوعة، مما يسهم في تحسين الكشف المبكر عن السكري وتحسين نتائج المرضى. توضح الورقة هيكلها، موضحة الأقسام التالية التي تستعرض الأدبيات، وتصف مجموعات البيانات والمنهجيات، وتقدم النتائج التجريبية، وتختتم بالتداعيات للبحث المستقبلي في المعلوماتية الطبية.

طرق البحث

تركز منهجية هذه الدراسة على تطوير وتقييم نماذج التعلم الآلي التي تهدف إلى الكشف المبكر عن السكري، مع معالجة التحديات المرتبطة بمجموعات البيانات غير المتوازنة. تبحث الدراسة في النماذج والتقنيات الفعالة لتوقع السكري مع تقييم تأثير طرق التعامل مع اختلال الفئات المختلفة.

تستند النتائج التجريبية إلى نماذج متعددة من التعلم الآلي تم اختبارها على ثلاث مجموعات بيانات: PIMA، مجموعة بيانات السكري 2019، وBIT_2019. تم تقييم أداء النموذج باستخدام مقاييس مثل الدقة، والدقة، والاسترجاع، ودرجة F1، والخصوصية، وROC-AUC، مع تأكيد الاختبارات الإحصائية على أهمية النتائج. تعكس النتائج المقدمة أعلى قيم أداء تم تحقيقها عبر عدة تجارب، مع حفظ تكوينات النموذج المثلى للنشر السريري المستقبلي لتعزيز التشخيص المبكر والتدخل في إدارة السكري. استخدمت جميع التجارب موارد حوسبة سحابية من Kaggle، التي وفرت القوة المعالجة اللازمة للتعامل الفعال مع البيانات وتدريب النموذج. ستتضمن القسم التالي البيانات الخام، والتصورات، والتحليلات الإحصائية لتقييم المزيد من قوة النموذج وتنوعه.

نقاش

في قسم النقاش، تسلط الورقة الضوء على تطور تقنيات التعلم الآلي (ML) في الكشف عن السكري، مع التأكيد على الانتقال من نماذج أبسط مثل الانحدار اللوجستي وأشجار القرار إلى طرق تجميعية أكثر تعقيدًا مثل الغابات العشوائية وتعزيز التدرج. كانت النماذج المبكرة، على الرغم من كونها قابلة للتفسير، تواجه صعوبة في تعقيدات البيانات الطبية، خاصة في مجموعات البيانات غير المتوازنة حيث تتفوق حالات غير السكري بشكل كبير على حالات السكري. لقد أدت التطورات الحديثة إلى دمج الطرق التجميعية وتقنيات إعادة أخذ العينات مثل SMOTE وADASYN لتعزيز حساسية النموذج ودقة التصنيف. وقد أظهرت الدراسات، مثل دراسة غاني وآخرون (2023)، فعالية دمج هذه الأساليب، حيث دمجوا SMOTE مع الطرق التجميعية لتحسين الأداء على مجموعات بيانات السكري.

تؤكد الورقة أيضًا على أهمية قابلية تفسير النموذج في البيئات السريرية، حيث تكون اتخاذ القرارات الشفافة أمرًا حيويًا. تشير إلى نماذج قابلة للتفسير أظهرت وعدًا في تحديد عوامل خطر السكري، إلى جانب نماذج أكثر تعقيدًا مثل الشبكات العصبية الاصطناعية (ANNs) التي حققت تحسينات ملحوظة في الدقة. على الرغم من هذه التطورات، يشير المؤلفون إلى وجود فجوة في المقارنات المنهجية لاستراتيجيات التعامل مع الاختلالات عبر مجموعات بيانات متعددة، والتي تهدف دراستهم إلى معالجتها. من خلال تقييم مجموعة من نماذج ML وطرق إعادة أخذ العينات عبر ثلاث مجموعات بيانات بارزة—PIMA، مجموعة بيانات السكري 2019، وBIT_2019—تسعى الدراسة إلى وضع استراتيجيات فعالة وقابلة للتعميم للكشف عن السكري، مما يسهم في تطوير نماذج ML موثوقة في التشخيص الطبي.

Journal: Frontiers in Artificial Intelligence, Volume: 7
DOI: https://doi.org/10.3389/frai.2024.1499530
PMID: https://pubmed.ncbi.nlm.nih.gov/39839971
Publication Date: 2025-01-07
Author(s): Inam Abousaber et al.
Primary Topic: Artificial Intelligence in Healthcare

Overview

The research paper presents a novel predictive framework for diabetes prediction that addresses the challenges posed by class imbalances in clinical datasets. The study emphasizes the importance of utilizing advanced machine learning algorithms and imbalance handling techniques, including feature engineering and resampling strategies, to improve predictive accuracy. Rigorous testing on three datasets—PIMA, Diabetes Dataset 2019, and BIT_2019—demonstrated the framework’s robustness and adaptability, highlighting the critical role of model selection and imbalance mitigation in achieving reliable diabetes predictions.

In the evaluation of twelve machine learning models combined with five resampling techniques (SMOTE, ADASYN, Borderline-SMOTE, Random Under Sampling, and SMOTEENN), Random Forest paired with SMOTE consistently yielded the highest accuracy, F1-score, and ROC-AUC. This combination effectively addressed the challenges of imbalanced data, enhancing the prediction of minority cases and ensuring high sensitivity and specificity. The findings underscore the clinical relevance of the approach, which aims to reduce the risks of overdiagnosis and missed diagnoses, thereby providing actionable insights for healthcare providers. Future work is suggested to explore the framework’s applicability to larger, more complex datasets and to test advanced deep learning models in real-time clinical settings, facilitating the transition of machine learning applications into practical healthcare interventions.

Introduction

The introduction of this research paper highlights the escalating global diabetes epidemic, affecting approximately 537 million adults in 2021, with projections of 783 million by 2045. Diabetes, characterized by insulin dysfunction leading to elevated blood glucose levels, poses severe health risks, including cardiovascular disease and kidney failure, thereby increasing healthcare costs and burdening healthcare systems. Early detection is emphasized as crucial for effective intervention, particularly to prevent the progression from prediabetes to type 2 diabetes. Traditional diagnostic methods, reliant on fasting blood glucose and HbA1c levels, are limited by accessibility and sensitivity, particularly in underserved regions.

The paper advocates for the use of machine learning (ML) as a transformative approach to early diabetes detection, capable of analyzing large datasets to identify at-risk individuals. However, it notes the challenge of class imbalance in medical datasets, where diabetic cases are significantly fewer than non-diabetic cases, leading to biased predictive models. To address this, the study proposes advanced preprocessing techniques and various ML algorithms, including ensemble methods and synthetic sampling techniques like SMOTE and ADASYN, to enhance model performance and sensitivity towards the minority diabetic class. The research aims to develop a comprehensive framework that improves predictive accuracy and generalizability across diverse populations, ultimately contributing to better early-stage diabetes detection and improved patient outcomes. The paper outlines its structure, detailing subsequent sections that review literature, describe datasets and methodologies, present experimental results, and conclude with implications for future research in medical informatics.

Methods

The methodology of this study focuses on developing and evaluating machine learning models aimed at the early detection of diabetes, particularly addressing the challenges associated with imbalanced datasets. The research investigates effective models and techniques for diabetes prediction while assessing the impact of various class imbalance handling methods.

Experimental results are derived from multiple machine learning models tested on three datasets: PIMA, Diabetic Dataset 2019, and BIT_2019. Model performance was evaluated using metrics such as accuracy, precision, recall, F1-score, specificity, and ROC-AUC, with statistical tests confirming the significance of the findings. The results presented reflect the highest performance values achieved across multiple runs, with the optimal model configurations saved for future clinical deployment to enhance early diagnosis and intervention in diabetes management. All experiments utilized cloud-based computational resources from Kaggle, which provided the necessary processing power for efficient data handling and model training. The subsequent section will include raw data, visualizations, and statistical analyses to further assess model robustness and variability.

Discussion

In the discussion section, the paper highlights the evolution of machine learning (ML) techniques in diabetes detection, emphasizing the transition from simpler models like Logistic Regression and Decision Trees to more complex ensemble methods such as Random Forests and Gradient Boosting. Early models, while interpretable, struggled with the intricacies of medical data, particularly in imbalanced datasets where non-diabetic cases significantly outnumber diabetic ones. Recent advancements have incorporated ensemble methods and resampling techniques like SMOTE and ADASYN to enhance model sensitivity and classification accuracy. Notably, studies have demonstrated the effectiveness of combining these approaches, such as Ganie et al. (2023), who integrated SMOTE with ensemble methods to improve performance on diabetic datasets.

The paper also underscores the importance of model interpretability in clinical settings, where transparent decision-making is crucial. It references various interpretable models that have shown promise in identifying diabetes risk factors, alongside more complex models like artificial neural networks (ANNs) that have achieved notable accuracy improvements. Despite these advancements, the authors point out a gap in systematic comparisons of imbalance-handling strategies across multiple datasets, which their study aims to address. By evaluating a range of ML models and resampling methods across three prominent datasets—PIMA, Diabetes Dataset 2019, and BIT_2019—the study seeks to establish effective and generalizable strategies for diabetes detection, ultimately contributing to the development of reliable ML models in medical diagnostics.