تشخيص فعال لمرض السكري باستخدام طريقة تجميع محسّنة Efficient diagnosis of diabetes mellitus using an improved ensemble method

المجلة: Scientific Reports، المجلد: 15، العدد: 1
DOI: https://doi.org/10.1038/s41598-025-87767-1
PMID: https://pubmed.ncbi.nlm.nih.gov/39863685
تاريخ النشر: 2025-01-25
المؤلف: Blessing Oluwatobi Olorunfemi وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في الرعاية الصحية

نظرة عامة

تتناول الأبحاث القضية الملحة لمرض السكري، وخاصة في الدول النامية، حيث تساهم بشكل كبير في معدلات الوفيات. على الرغم من إمكانيات التعلم الآلي (ML) للكشف المبكر والعلاج، إلا أن الدراسات السابقة واجهت صعوبات في دقة التصنيف المنخفضة بسبب تحديات مثل الإفراط في التخصيص وضوضاء البيانات. تعزز هذه الدراسة دقة التصنيف من خلال استخدام تقنيات التعلم الآلي المجمعة المتوازية والمتسلسلة جنبًا إلى جنب مع طرق اختيار الميزات. باستخدام بيانات مرض السكري من Pima India من مستودع UCI ML، قام الباحثون بإجراء معالجة البيانات التي شملت استبدال القيم المفقودة واختيار الميزات ذات الارتباط العالي. تم تقسيم مجموعة البيانات إلى مجموعات تدريب (70%) واختبار (30%)، وتم تنفيذ التصنيف باستخدام بايثون في Jupyter Notebook.

تضمنت منهجية الدراسة مرحلتين تصميميتين: المرحلة الأولى أنشأت نموذج غابة عشوائية باستخدام J48، شجرة التصنيف والانحدار (CART)، وDecision Stump (DS)، بينما تضمنت المرحلة الثانية طرق التجميع المتسلسلة—XG Boost، AdaBoostM1، وGradient Boosting—باستخدام خوارزمية تصويت متوسطة للتصنيف الثنائي. بشكل ملحوظ، حققت طرق التجميع دقة تصنيف تبلغ 100%، مع جميع مقاييس الأداء (درجة F1، MCC، الدقة، الاسترجاع، AUC-ROC، وAUC-PR) تساوي 1.00، مما يشير إلى توقعات موثوقة للغاية لوجود السكري. تشير النتائج إلى أن النموذج التنبؤي المطور يمكن أن يكون أداة قيمة للباحثين والممارسين في إجراء توقعات سريعة ودقيقة بشأن داء السكري، مما قد ينقذ العديد من الأرواح.

طرق

تركز منهجية هذه الدراسة على استخدام نموذج مجمع لتقييم قوة الارتباط لسمات السكري، باستخدام تقنيات اختيار الميزات الأمامية والخلفية. نفذت الأبحاث طرق التجميع المتوازية والمتسلسلة، حيث تضمنت التجربة الأولى خوارزمية J48، شجرة التصنيف والانحدار (CART)، وDecision Stump (DS) لإنشاء غابة عشوائية (RF). في المرحلة التالية، تم دمج نفس الخوارزميات مع ثلاث طرق تجميع متقدمة: XGBoost، AdaBoostM1، وGradient Boosting. تم اشتقاق التوقعات النهائية باستخدام خوارزميات التصويت المتوسطة، التي تجمع مخرجات مختلف المصنفات الأساسية لتعزيز دقة التصنيف.

بالإضافة إلى ذلك، استخدمت الدراسة طرق التغليف لاختيار الميزات، والتي تتضمن البحث في جميع مجموعات الميزات الممكنة وتقييم جودتها من خلال تدريب واختبار المصنفات على كل مجموعة. تستخدم هذه الطريقة تقنية تعلم آلي مصممة خصيصًا لمجموعة البيانات، باستخدام طريقة البحث الجشع لتقييم جميع تركيبات الميزات مقابل معيار تقييم محدد. تشير النتائج إلى أن طرق التغليف غالبًا ما تحقق توقعات أكثر دقة مقارنة بطرق التصفية، مما يبرز فعاليتها في اختيار الميزات لتحليل سمات السكري.

نتائج

في هذا القسم، تقدم الأبحاث تنفيذ ونتائج نموذج توقع السكري الذي تم تطويره باستخدام مجموعة بيانات تتكون من 9 خصائص و768 حالة، مخزنة في ملف CSV. كشفت التحليلات عن انتشار أعلى للحالات غير السكري (268) مقارنة بالحالات السكري (500). أشار خريطة الحرارة إلى وجود ارتباطات كبيرة بين المتغير الناتج وميزات مثل مستويات الجلوكوز، العمر، مؤشر كتلة الجسم، وعدد الحملات. تم تطبيع الميزات باستخدام Standard Scaler، مما يضمن أن كل ميزة كانت مركزة ومقاسة بناءً على متوسط مجموعة التدريب والانحراف المعياري.

أظهر تقييم أداء النموذج دقة توقع أعلى مقارنة بالدراسات الحالية التي تستخدم تقنيات التعلم الآلي على مجموعة بيانات Pima Indians. حققت طرق التجميع المتسلسلة المقترحة، بما في ذلك XGBoost، AdaBoost، وGradient Boosting، دقة تصنيف مثالية تبلغ 100%، متجاوزة المعايير السابقة التي وضعتها دراسات أخرى. على سبيل المثال، بينما أفاد يadav وPal بدقة 100% مع الغابة العشوائية في التجميعات المتوازية، تفوقت طرق الدراسة الحالية على هذه النتيجة. وبالمثل، أظهرت المقارنات مع تقنيات التجميع الأخرى أن النموذج المقترح تجاوز باستمرار معدلات دقتها، مما يبرز فعالية اختيار الميزات ونهج التجميع في تعزيز قدرات تشخيص السكري. بشكل عام، تشير النتائج إلى تقدم كبير في دقة التشخيص، مما يضع معيارًا جديدًا للبحوث المستقبلية في هذا المجال.

مناقشة

في قسم المناقشة من ورقة البحث، تسلط مراجعة الأدبيات الشاملة الضوء على التقدم الأخير في تطبيقات التعلم الآلي (ML) للكشف المبكر وإدارة داء السكري. استخدمت دراسات متنوعة خوارزميات التعلم الآلي للتنبؤ وتشخيص السكري ومضاعفاته، كاشفة عن نتائج مهمة مثل تحديد miRNAs والجينات المعبر عنها بشكل مختلف في اعتلال الكلى السكري، وإمكانية التدخلات العشبية مثل Liuwei Dihuang Decoction لتحسين حساسية الأنسولين من خلال تعديل مسار إشارة PI3K/Akt. بالإضافة إلى ذلك، تؤكد المراجعة على التأثير التحويلي لتقنيات التعلم الآلي، مثل AdaBoost وآلات الدعم الناقل (SVM)، في تعزيز دقة التشخيص عبر مجالات طبية متنوعة، بما في ذلك السكري وأمراض القلب.

على الرغم من هذه التقدمات، تحدد المراجعة فجوات بحثية حاسمة، بما في ذلك القضايا المتعلقة بالبيانات الضوضائية، والإفراط في التخصيص، وعدم التخصيص التي تؤثر على موثوقية النماذج التنبؤية في الإعدادات السريرية. علاوة على ذلك، تلاحظ التكامل المحدود لأساليب التجميع المتسلسلة والمتوازية، التي يمكن أن تعزز مرونة النموذج، والاستخدام غير الكافي لتقنيات اختيار الميزات التي تحسن مجموعات الميزات المدخلة لتحقيق دقة أفضل. تدعو الورقة إلى استكشاف أكثر شمولاً لطرق التجميع واستراتيجيات اختيار الميزات المتقدمة لتطوير نموذج تنبؤي قوي لداء السكري، وبالتالي معالجة الفجوات المحددة والمساهمة في تطور هذا المجال المستمر.

Journal: Scientific Reports, Volume: 15, Issue: 1
DOI: https://doi.org/10.1038/s41598-025-87767-1
PMID: https://pubmed.ncbi.nlm.nih.gov/39863685
Publication Date: 2025-01-25
Author(s): Blessing Oluwatobi Olorunfemi et al.
Primary Topic: Artificial Intelligence in Healthcare

Overview

The research addresses the pressing issue of diabetes, particularly in developing countries, where it contributes significantly to mortality rates. Despite the potential of machine learning (ML) for early detection and treatment, previous studies have struggled with low classification accuracies due to challenges such as overfitting and data noise. This study enhances classification accuracy by employing both parallel and sequential ensemble ML techniques alongside feature selection methods. Utilizing the Pima India Diabetes Data from the UCI ML Repository, the researchers performed data preprocessing that included replacing missing values and selecting highly correlated features. The dataset was divided into training (70%) and testing (30%) subsets, and classification was executed using Python in Jupyter Notebook.

The study’s methodology comprised two design phases: the first phase created a random forest model using J48, Classification and Regression Tree (CART), and Decision Stump (DS), while the second phase incorporated sequential ensemble methods—XG Boost, AdaBoostM1, and Gradient Boosting—utilizing an average voting algorithm for binary classification. Remarkably, the ensemble methods achieved classification accuracies of 100%, with all performance metrics (F1 score, MCC, Precision, Recall, AUC-ROC, and AUC-PR) equal to 1.00, indicating highly reliable predictions of diabetes presence. The findings suggest that the developed predictive model can be a valuable tool for researchers and practitioners in making swift and accurate predictions regarding diabetes mellitus, potentially saving numerous lives.

Methods

The methodology of this study focuses on employing an ensemble model to assess the correlation strength of diabetes attributes, utilizing both forward and backward feature selection techniques. The research implemented parallel and sequential ensemble approaches, where the first experiment involved the J48 algorithm, Classification and Regression Tree (CART), and Decision Stump (DS) to generate a Random Forest (RF). In the subsequent phase, the same algorithms were combined with three advanced ensemble methods: XGBoost, AdaBoostM1, and Gradient Boosting. The final predictions were derived using average voting algorithms, which aggregate the outputs of various base classifiers to enhance classification accuracy.

Additionally, the study employed wrapper methods for feature selection, which involve searching through all possible feature subsets and evaluating their quality by training and testing classifiers on each subset. This approach utilizes a machine learning technique tailored to the dataset, employing a greedy search method to assess all feature combinations against a defined evaluation criterion. The findings suggest that wrapper methods often yield more accurate predictions compared to filter methods, highlighting their effectiveness in feature selection for diabetes attribute analysis.

Results

In this section, the research presents the implementation and results of a diabetes prediction model developed using a dataset comprising 9 characteristics and 768 instances, stored in a CSV file. The analysis revealed a higher prevalence of non-diabetic cases (268) compared to diabetic cases (500). A heatmap indicated significant correlations between the outcome variable and features such as glucose levels, age, BMI, and number of pregnancies. The features were normalized using Standard Scaler, ensuring each feature was centered and scaled based on the training set’s mean and standard deviation.

The evaluation of the model’s performance demonstrated superior prediction accuracy compared to existing studies utilizing machine learning techniques on the Pima Indians dataset. The proposed sequential ensemble methods, including XGBoost, AdaBoost, and Gradient Boosting, achieved a perfect classification accuracy of 100%, surpassing previous benchmarks set by other studies. For instance, while Yadav and Pal reported 100% accuracy with Random Forest in parallel ensembles, the current study’s methods outperformed this result. Similarly, comparisons with other ensemble techniques revealed that the proposed model consistently exceeded their accuracy rates, highlighting the effectiveness of the feature selection and ensemble approach in enhancing diabetes diagnosis capabilities. Overall, the findings indicate a significant advancement in diagnostic accuracy, establishing a new benchmark for future research in this domain.

Discussion

In the discussion section of the research paper, a comprehensive literature review highlights recent advancements in machine learning (ML) applications for the early detection and management of diabetes mellitus. Various studies have employed ML algorithms to predict and diagnose diabetes and its complications, revealing significant findings such as the identification of differentially expressed miRNAs and genes in diabetic nephropathy, and the potential of herbal interventions like Liuwei Dihuang Decoction to improve insulin sensitivity through modulation of the PI3K/Akt signaling pathway. Additionally, the review emphasizes the transformative impact of ML techniques, such as AdaBoost and Support Vector Machines (SVM), in enhancing diagnostic accuracy across diverse medical domains, including diabetes and heart disease.

Despite these advancements, the review identifies critical research gaps, including issues related to noisy data, overfitting, and underfitting that compromise the reliability of predictive models in clinical settings. Furthermore, it notes the limited integration of sequential and parallel ensemble approaches, which could enhance model resilience, and the underutilization of feature selection techniques that optimize input feature sets for improved accuracy. The paper advocates for a more comprehensive exploration of ensemble methods and advanced feature selection strategies to develop a robust predictive model for diabetes mellitus, thereby addressing the identified gaps and contributing to the field’s ongoing evolution.