توقعات قوية للسكري باستخدام نماذج تعلم الآلة الجماعية مع تقنية زيادة العينة للأقليات الاصطناعية Robust diabetic prediction using ensemble machine learning models with synthetic minority over-sampling technique

المجلة: Scientific Reports، المجلد: 14، العدد: 1
DOI: https://doi.org/10.1038/s41598-024-78519-8
PMID: https://pubmed.ncbi.nlm.nih.gov/39578517
تاريخ النشر: 2024-11-22
المؤلف: Zhenyun Du وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في الرعاية الصحية

نظرة عامة

تتناول هذه الورقة البحثية القضية الصحية العالمية الحرجة المتعلقة بمرض السكري، الذي يتميز بعدم استجابة الجسم الكافية للأنسولين أو نقص إنتاج الأنسولين، مما يؤدي إلى ارتفاع مستويات السكر في الدم ومضاعفات خطيرة مثل أمراض الكلى، وضعف الرؤية، ومشاكل القلب والأوعية الدموية. يقترح المؤلفون إطارًا قويًا لتوقع مرض السكري يستخدم تقنية زيادة العينة للأقليات الاصطناعية (SMOTE) بالتزامن مع طرق التعلم الآلي الجماعية. تشمل المنهجيات الرئيسية التعامل مع البيانات المفقودة، ورفض القيم الشاذة، واختيار الميزات من خلال تحليل الارتباط، وتوازن توزيع الفئات. تظهر النتائج أن الجمع بين AdaBoost وXGBoost يحقق منطقة تحت المنحنى (AUC) مثيرة للإعجاب تبلغ 0.968 ± 0.015، متفوقًا على كل من الأساليب البديلة والنماذج الحديثة الموجودة.

في الختام، تتنبأ الدراسة بشكل فعال بمرض السكري من خلال تحليل عوامل نمط الحياة والعوامل البيولوجية، محققة دقة توقع تبلغ 90.4% على مجموعة بيانات PIMA الهندية. كانت خطوات المعالجة المسبقة، بما في ذلك تعويض البيانات وتوازن الفئات، ضرورية لتعزيز دقة التوقع. بينما يركز النموذج الحالي على البيانات المنظمة، يعترف المؤلفون بإمكانية البحث المستقبلي لدمج البيانات غير المنظمة وميزات التنبؤ الإضافية، مثل التاريخ العائلي وعادات نمط الحياة. يمكن أيضًا تكييف المنهجيات المطورة لتوقع حالات صحية أخرى، بما في ذلك السرطان وأمراض القلب والأوعية الدموية، مما يوسع تأثير هذا البحث في الرعاية الصحية.

النتائج

يقدم قسم النتائج في الدراسة تحليلًا شاملاً لتصور البيانات، والمعالجة المسبقة، وأداء نموذج التعلم الآلي (ML). تم استخدام تقنيات تصور متنوعة، بما في ذلك الرسوم البيانية، ومخططات الصندوق، والخرائط الحرارية، لكشف خصائص البيانات مثل الانحراف، والتفرطح، والقيم الشاذة، مما أبلغ عن طرق إزالة القيم الشاذة اللازمة. كانت تطبيق تقنية زيادة العينة للأقليات الاصطناعية (SMOTE) فعالة في تحقيق توازن مجموعة البيانات، مما يعزز دقة التنبؤ للفئات الأقل تمثيلًا، خاصة في سياق تشخيص الأمراض.

قللت خطوات المعالجة المسبقة، بما في ذلك تعويض القيم المفقودة ورفض القيم الشاذة، مجموعة البيانات إلى 829 سجلًا وحسنت دقة النموذج من خلال التقييس والتطبيع. تم استخدام مقاييس الأداء مثل AUC، والدقة، ودرجة F1، والاسترجاع، ودقة التصنيف لتقييم نماذج ML المختلفة. من الجدير بالذكر أن نموذج XGBoost حقق AUC قدره 0.962 ± 0.012 ودقة قدرها 0.901 ± 0.016. تفوق الأسلوب الجماعي الذي يجمع بين AdaBoost وXGBoost على المصنفات الفردية، حيث حقق AUC قدره 0.968 ± 0.015 ودقة قدرها 0.904 ± 0.023، مما يظهر قوة النهج الجماعي. أظهرت التحليلات المقارنة تفوق المنهجية المقترحة على الخوارزميات الحديثة الموجودة، كما هو موضح في الأشكال والجداول المرفقة.

المناقشة

في قسم المناقشة، تستعرض الورقة منهجيات مختلفة لتوقع مرض السكري، مع تسليط الضوء على فعالية خوارزميات التعلم الآلي المختلفة. استخدمت الدراسات السابقة مجموعة من المصنفات، بما في ذلك الخوارزميات الجينية، وأشجار القرار، والغابات العشوائية، والأساليب الجماعية مثل AdaBoost وXGBoost، محققة مستويات متفاوتة من الدقة ودرجات منطقة تحت المنحنى (AUC). من الجدير بالذكر أن نموذج جماعي هجين يجمع بين AdaBoost وXGBoost وصل إلى AUC قدره 0.968 ودقة قدرها 0.904، مما يظهر تحسينات كبيرة مقارنة بالطرق التقليدية. تؤكد الورقة على أهمية تقنيات المعالجة المسبقة، مثل تقنية زيادة العينة للأقليات الاصطناعية (SMOTE) لتوازن الفئات، ورفض القيم الشاذة باستخدام النطاق الربعي (IQR)، وتعويض القيم المفقودة، والتي تعزز مجتمعة جودة مجموعة البيانات وأداء النموذج التنبؤي.

يستخدم الإطار المقترح مجموعة بيانات مرض السكري Pima الهندية، مع التركيز على الإناث فوق 21 عامًا، ويشمل خطوات معالجة مسبقة صارمة لمعالجة مشكلات جودة البيانات. تشير نتائج الدراسة إلى أن الجمع بين AdaBoost وXGBoost لا يحسن فقط دقة التنبؤ ولكن يوفر أيضًا نموذجًا قويًا للكشف المبكر عن مرض السكري. يعترف المؤلفون بالقيود، مثل الطبيعة المنظمة لمجموعة البيانات، ويقترحون أن البحث المستقبلي يمكن أن يستكشف البيانات غير المنظمة وميزات التنبؤ الإضافية، بما في ذلك عوامل نمط الحياة والتاريخ العائلي، لتعزيز قدرات توقع مرض السكري بشكل أكبر.

القيود

في القسم الذي يناقش القيود، يبرز المؤلفون القيود المرتبطة بمجموعة بياناتهم، وبشكل خاص مجموعة بيانات Pima. بينما يظهر الخوارزم الهجين الذي يجمع بين AdaBoost وXGBoost أداءً متفوقًا مقارنةً بالخوارزميات الحديثة الأخرى، يعترف المؤلفون بأن مجموعة البيانات قد تفرض بعض القيود على تعميم نتائجهم. قد تؤثر هذه القيود على دقة وملاءمة النتائج في سياقات أوسع، مما يشير إلى أن المزيد من التحقق باستخدام مجموعات بيانات متنوعة ضروري لتأكيد قوة الخوارزم الهجين المقترح.

Journal: Scientific Reports, Volume: 14, Issue: 1
DOI: https://doi.org/10.1038/s41598-024-78519-8
PMID: https://pubmed.ncbi.nlm.nih.gov/39578517
Publication Date: 2024-11-22
Author(s): Zhenyun Du et al.
Primary Topic: Artificial Intelligence in Healthcare

Overview

This research paper addresses the critical global health issue of diabetes, characterized by the body’s inadequate response to insulin or insufficient insulin production, leading to elevated blood sugar levels and severe complications such as kidney disease, vision impairment, and cardiovascular issues. The authors propose a robust diabetes prediction framework that utilizes the Synthetic Minority Over-sampling Technique (SMOTE) in conjunction with ensemble machine learning methods. Key methodologies include handling missing data, outlier rejection, feature selection via correlation analysis, and class distribution balancing. The results demonstrate that the combination of AdaBoost and XGBoost achieves an impressive area under the curve (AUC) of 0.968 ± 0.015, outperforming both alternative approaches and existing state-of-the-art models.

In conclusion, the study effectively predicts diabetes by analyzing lifestyle and biological factors, achieving a prediction accuracy of 90.4% on the PIMA Indian Dataset. The preprocessing steps, including data imputation and class balancing, were essential for enhancing prediction accuracy. While the current model focuses on structured data, the authors acknowledge the potential for future research to incorporate unstructured data and additional predictive features, such as family history and lifestyle habits. The methodologies developed could also be adapted for predicting other health conditions, including cancer and cardiovascular diseases, thereby broadening the impact of this research in healthcare.

Results

The results section of the study presents a comprehensive analysis of data visualization, preprocessing, and machine learning (ML) model performance. Various visualization techniques, including histograms, box plots, and heatmaps, were employed to uncover data characteristics such as skewness, kurtosis, and outliers, which informed necessary outlier removal methods. The application of the Synthetic Minority Over-sampling Technique (SMOTE) effectively balanced the dataset, enhancing the predictive accuracy for minority classes, particularly in the context of disease diagnosis.

Preprocessing steps, including missing value imputation and outlier rejection, reduced the dataset to 829 records and improved model accuracy through standardization and normalization. Performance metrics such as AUC, precision, F1 score, recall, and classification accuracy were utilized to evaluate various ML models. Notably, the XGBoost model achieved an AUC of 0.962 ± 0.012 and an accuracy of 0.901 ± 0.016. The ensemble method combining AdaBoost and XGBoost outperformed individual classifiers, yielding an AUC of 0.968 ± 0.015 and an accuracy of 0.904 ± 0.023, demonstrating the robustness of the ensemble approach. Comparative analysis illustrated the superiority of the proposed methodology over existing state-of-the-art algorithms, as detailed in the accompanying figures and tables.

Discussion

In the discussion section, the paper reviews various methodologies for diabetes prediction, highlighting the effectiveness of different machine learning algorithms. Previous studies have employed a range of classifiers, including Genetic Algorithms, Decision Trees, Random Forests, and ensemble methods like AdaBoost and XGBoost, achieving varying levels of accuracy and area under the curve (AUC) scores. Notably, a hybrid ensemble model combining AdaBoost and XGBoost reached an AUC of 0.968 and an accuracy of 0.904, demonstrating significant improvements over traditional methods. The paper emphasizes the importance of preprocessing techniques, such as Synthetic Minority Over-sampling Technique (SMOTE) for class imbalance, outlier rejection using the interquartile range (IQR), and missing value imputation, which collectively enhance the dataset’s quality and the model’s predictive performance.

The proposed framework utilizes the Pima Indian Diabetes dataset, focusing on females over 21 years old, and incorporates rigorous preprocessing steps to address data quality issues. The study’s findings indicate that the combination of AdaBoost and XGBoost not only improves prediction accuracy but also provides a robust model for early diabetes detection. The authors acknowledge limitations, such as the dataset’s structured nature, and suggest future research could explore unstructured data and additional predictive features, including lifestyle factors and family history, to further enhance diabetes prediction capabilities.

Limitations

In the section discussing limitations, the authors highlight the constraints associated with their dataset, specifically the Pima dataset. While the hybrid algorithm combining AdaBoost and XGBoost demonstrates superior performance compared to other state-of-the-art algorithms, the authors acknowledge that the dataset may impose certain limitations on the generalizability of their findings. These limitations could affect the accuracy and applicability of the results in broader contexts, suggesting that further validation with diverse datasets is necessary to confirm the robustness of the proposed hybrid algorithm.