توقع مرض السكري باستخدام تقنيات مختلطة تحت إشراف وغير تحت إشراف بناءً على مجموعة بيانات PIMA Diabetes Prediction Using Hybrid Supervised and Unsupervised Techniques Based on PIMA Dataset

المجلة: Journal of Artificial Intelligence and Technology
DOI: https://doi.org/10.37965/jait.2025.0899
تاريخ النشر: 2025-11-23
المؤلف: Ahmad Adel Abu-Shareha وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في الرعاية الصحية

نظرة عامة

تتناول هذه الورقة البحثية تحديات توقع مرض السكري باستخدام التعلم الآلي، وخاصة بسبب قيود مجموعات البيانات الطبية الصغيرة وغير المتوازنة. يقترح المؤلفون إطار عمل هجين يجمع بين تقنيات التعلم المراقب وغير المراقب، مع الاستفادة بشكل خاص من التجميع، واختيار الميزات عبر المعلومات المتبادلة، والتصنيف لتعزيز الأداء التنبؤي على مجموعة بيانات السكري الهندية PIMA. يهدف الإطار إلى تقليل التعقيد الحسابي مع تحسين قابلية الفصل بين الفئات من خلال التجميع، الذي يجمع سجلات المرضى المتشابهة. تم تقييم ما مجموعه ثلاثة عشر مصنفًا، بما في ذلك طرق التجميع مثل الغابة العشوائية (RF) وزيادة التدرج القصوى (XGB)، مما كشف أن هذه المصنفات حققت أعلى دقة، ودقة، واسترجاع، ودرجات منطقة تحت المنحنى (AUC)، حيث وصلت كل من RF وXGB إلى دقة تبلغ 88.5%.

تخلص الدراسة إلى أن دمج التجميع مع التصنيف يعزز بشكل كبير أداء نموذج التنبؤ، خاصة بالنسبة للطرق المعقدة المعتمدة على التجميع. تشير النتائج إلى تحسينات كبيرة في الدقة، والدقة، والاسترجاع، ودرجة F1، ومقاييس AUC عبر مصنفات مختلفة، بينما أظهرت النماذج الأبسط مثل بايز الساذج (NB) وتحليل التمييز التربيعي (QDA) تحسينات محدودة. تؤكد النتائج فعالية النهج الهجين المقترح في مهام النمذجة التنبؤية في العالم الحقيقي وتقترح طرقًا للبحث المستقبلي، بما في ذلك استكشاف تقنيات التجميع المتقدمة، وطرق اختيار الميزات المتنوعة، وضبط المعلمات الفائقة المثلى لتحسين أداء النموذج بشكل أكبر.

مقدمة

تتناول مقدمة الورقة الأزمة المتزايدة في الصحة العامة لمرض السكري، وهو اضطراب أيضي معقد يتميز بارتفاع مستويات الجلوكوز في الدم، مما يمكن أن يؤدي إلى مضاعفات خطيرة مثل أمراض القلب والأوعية الدموية وفشل الكلى. تسلط منظمة الصحة العالمية الضوء على الزيادة في انتشار مرض السكري، خاصة في البلدان ذات الدخل المنخفض والمتوسط، مما يبرز الحاجة الملحة للكشف المبكر والإدارة لتحسين نتائج المرضى وتقليل تكاليف الرعاية الصحية. يدعو المؤلفون إلى استخدام التعلم الآلي لتطوير نماذج تنبؤية يمكن أن تحدد الأفراد المعرضين لخطر الإصابة بالسكري بناءً على البيانات السريرية والديموغرافية، مما يقدم بديلاً أكثر كفاءة للطرق التقليدية.

على الرغم من إمكانيات التعلم الآلي في توقع مرض السكري، لا تزال هناك عدة تحديات قائمة، بما في ذلك جودة البيانات وتوافرها، وعدم التوازن بين حالات السكري وغير السكري، وصعوبة تحديد الميزات التنبؤية ذات الصلة. تقترح الورقة نهجًا هجينًا يجمع بين التجميع والتصنيف لتعزيز الحساسية التنبؤية لحالات السكري الأقلية. من خلال تجميع المرضى بناءً على السمات الصحية وتطبيق اختيار الميزات لاستبعاد المتغيرات غير ذات الصلة، يهدف الأسلوب المقترح إلى تحسين أداء النموذج وقابليته للتفسير. تم توضيح هيكل الورقة، مع تخصيص الأقسام اللاحقة لمراجعة الأدبيات، وشرح مفصل للإطار الهجين، وتقييم النهج، ومناقشة النتائج.

طرق

في هذا القسم، يوضح المؤلفون الطرق التجريبية المستخدمة لتقييم نظامهم المقترح لتوقع مرض السكري. تم تنفيذ التجارب في بايثون 3.9 على نظام Intel Core i7 (1.8 GHz)، باستخدام مكتبات scikit-learn وXGB. تم تكرار كل تجربة خمس مرات مع بذور عشوائية متغيرة لضمان إمكانية إعادة الإنتاج، وتم تقييم الأهمية الإحصائية للنتائج باستخدام اختبار ويلكوكسون للرتب الموقعة عند مستوى دلالة α = 0.05.

تم توضيح سير العمل للنظام المقترح كما يلي: أولاً، يتم تحميل مجموعة بيانات السكري الهندية PIMA ومعالجتها مسبقًا لمعالجة القيم المفقودة وتوحيد الميزات. بعد ذلك، يتم استخدام المعلومات المتبادلة (MI) لاختيار الميزات. ثم يقوم المؤلفون بتطبيق التجميع K-means لتحديد مجموعات المرضى، مع تكرار وتقييم نتائج التجميع باستخدام طريقة الكوع. أخيرًا، يتم استخدام المصنفات المراقبة التي تتضمن نتائج التجميع لتوقع مرض السكري، مع تقييم أداء النموذج ومقارنته باستخدام مقاييس قياسية.

مناقشة

تناقش الورقة البحثية التقدم في توقع مرض السكري باستخدام التعلم الآلي، مع التركيز على كل من الطرق المراقبة وغير المراقبة. أظهرت تقنيات التعلم المراقب، بما في ذلك أشجار القرار (DT)، والغابات العشوائية (RF)، وآلات الدعم الناقل (SVM)، وطرق التجميع، نتائج واعدة، مع دقة تتراوح من 76.3% إلى 91% عبر دراسات مختلفة. من الجدير بالذكر أنه تم اقتراح نهج هجين يجمع بين التعلم المراقب وغير المراقب، مستفيدًا من نقاط القوة في كلا المنهجين لمعالجة التحديات مثل عدم التوازن بين الفئات وازدواجية الميزات. تعتبر مجموعة بيانات السكري الهندية PIMA بمثابة المعيار، مع خطوات المعالجة المسبقة التي تشمل تعويض الوسيط للقيم المفقودة واختيار الميزات عبر المعلومات المتبادلة (MI).

يدمج الإطار المقترح تجميع K-means لتعزيز قابلية فصل الميزات، تليه عملية التصنيف باستخدام عدة خوارزميات. تشير النتائج إلى أن طرق التجميع مثل RF وXGB تتفوق على النماذج الأبسط، حيث تحقق دقة تبلغ 88.5% ودرجات F1 تبلغ 0.836 و0.835، على التوالي. تؤكد الدراسة فعالية الجمع بين التجميع واختيار الميزات لتحسين الأداء التنبؤي، خاصة بالنسبة للمصنفات المعقدة. تؤكد اختبارات الأهمية الإحصائية قوة النتائج، مما يشير إلى أن النموذج الهجين يمكن أن يكون أداة قيمة في سيناريوهات توقع مرض السكري في العالم الحقيقي. تشمل اتجاهات البحث المستقبلية استكشاف تقنيات التجميع المتقدمة وتحسين المعلمات الفائقة لتعزيز أداء النموذج بشكل أكبر.

القيود

تسلط الدراسة الضوء على عدة قيود في الدراسات الحالية لتوقع مرض السكري. على الرغم من التقدم الملحوظ في تعزيز دقة التوقع، فإن العديد من النماذج الحالية تفشل في معالجة جوانب حاسمة مثل القابلية للتوسع وقابلية التفسير، وهي ضرورية لتطبيقها في إعدادات الرعاية الصحية في العالم الحقيقي. علاوة على ذلك، فإن الاعتماد السائد على مجموعات بيانات فردية، مثل مجموعة بيانات PIMA، يحد من قابلية تعميم النتائج، حيث تعكس هذه المجموعات غالبًا خصائص مجموعة سكانية معينة.

بالإضافة إلى ذلك، فإن استخدام الطرق الهجينة، على الرغم من فائدته في بعض الجوانب، يميل إلى تعقيد التنفيذ. تتطلب هذه الطرق توازنًا دقيقًا بين مكونات التعلم المراقب وغير المراقب، مما يمكن أن يشكل تحديات في التطبيقات العملية. يعد معالجة هذه القيود أمرًا حيويًا لتقدم منهجيات توقع مرض السكري وضمان فعاليتها عبر مجموعات سكانية متنوعة.

Journal: Journal of Artificial Intelligence and Technology
DOI: https://doi.org/10.37965/jait.2025.0899
Publication Date: 2025-11-23
Author(s): Ahmad Adel Abu-Shareha et al.
Primary Topic: Artificial Intelligence in Healthcare

Overview

This research paper addresses the challenges of diabetes prediction using machine learning, particularly due to the limitations of small and imbalanced medical datasets. The authors propose a hybrid framework that combines supervised and unsupervised learning techniques, specifically utilizing clustering, feature selection via Mutual Information, and classification to enhance predictive performance on the PIMA Indian Diabetes Dataset. The framework aims to reduce computational complexity while improving class separability through clustering, which groups similar patient records. A total of thirteen classifiers, including ensemble methods like Random Forest (RF) and eXtreme Gradient Boosting (XGB), were evaluated, revealing that these classifiers achieved the highest accuracy, precision, recall, and area under the curve (AUC) scores, with RF and XGB both reaching an accuracy of 88.5%.

The study concludes that the integration of clustering with classification significantly enhances predictive model performance, particularly for complex ensemble-based methods. The results indicate substantial improvements in accuracy, precision, recall, F1 score, and AUC metrics across various classifiers, while simpler models like Naïve Bayes (NB) and Quadratic Discriminant Analysis (QDA) showed limited enhancements. The findings underscore the effectiveness of the proposed hybrid approach in real-world predictive modeling tasks and suggest avenues for future research, including the exploration of advanced clustering techniques, diverse feature selection methods, and optimal hyperparameter tuning to further refine the model’s performance.

Introduction

The introduction of the paper addresses the growing public health crisis of diabetes mellitus, a complex metabolic disorder characterized by elevated blood glucose levels, which can lead to severe complications such as cardiovascular diseases and kidney failure. The World Health Organization highlights the increasing prevalence of diabetes, particularly in low- and middle-income countries, underscoring the urgent need for early detection and management to improve patient outcomes and reduce healthcare costs. The authors advocate for the use of machine learning to develop predictive models that can identify individuals at high risk for diabetes based on clinical and demographic data, presenting a more efficient alternative to traditional methods.

Despite the potential of machine learning in diabetes prediction, several challenges persist, including data quality and availability, class imbalance between diabetic and non-diabetic cases, and the difficulty of identifying relevant predictive features. The paper proposes a hybrid approach that combines clustering with classification to enhance predictive sensitivity for minority diabetic cases. By grouping patients based on health attributes and applying feature selection to eliminate irrelevant variables, the proposed method aims to improve model performance and interpretability. The structure of the paper is outlined, with subsequent sections dedicated to a literature review, a detailed explanation of the hybrid framework, evaluation of the approach, and discussion of results.

Methods

In this section, the authors detail the experimental methods employed to evaluate their proposed system for diabetes prediction. The experiments were executed in Python 3.9 on an Intel Core i7 (1.8 GHz) system, utilizing the scikit-learn and XGB libraries. Each experiment was repeated five times with varying random seeds to ensure reproducibility, and statistical significance of the results was assessed using the Wilcoxon signed-rank test at a significance level of α = 0.05.

The workflow of the proposed system is outlined as follows: First, the PIMA Indian Diabetes Dataset is loaded and preprocessed to address missing values and scale features. Next, mutual information (MI) is employed for feature selection. The authors then apply K-means clustering to identify patient subgroups, iterating and evaluating the clustering results using the Elbow method. Finally, supervised classifiers that incorporate the clustering outcomes are utilized for diabetes prediction, with model performance evaluated and compared using standard metrics.

Discussion

The research paper discusses advancements in diabetes prediction using machine learning, focusing on both supervised and unsupervised methods. Supervised learning techniques, including Decision Trees (DT), Random Forests (RF), Support Vector Machines (SVM), and ensemble methods, have shown promising results, with accuracies ranging from 76.3% to 91% across various studies. Notably, a hybrid approach combining supervised and unsupervised learning is proposed, leveraging the strengths of both methodologies to address challenges such as class imbalance and feature redundancy. The PIMA Indian Diabetes Dataset serves as the benchmark, with preprocessing steps including median imputation for missing values and feature selection via Mutual Information (MI).

The proposed framework integrates K-means clustering to enhance feature separability, followed by classification using multiple algorithms. Results indicate that ensemble methods like RF and XGB outperform simpler models, achieving accuracies of 88.5% and F1-scores of 0.836 and 0.835, respectively. The study emphasizes the effectiveness of combining clustering with feature selection to improve predictive performance, particularly for complex classifiers. Statistical significance tests confirm the robustness of the findings, suggesting that the hybrid model can be a valuable tool in real-world diabetes prediction scenarios. Future research directions include exploring advanced clustering techniques and optimizing hyperparameters to further enhance model performance.

Limitations

The research highlights several limitations in current diabetes prediction studies. While there has been notable advancement in enhancing prediction accuracy, many existing models fail to address crucial aspects such as scalability and interpretability, which are essential for their application in real-world healthcare settings. Furthermore, the predominant reliance on singular datasets, like the PIMA dataset, restricts the generalizability of findings, as these datasets often reflect the characteristics of a specific population.

Additionally, the use of hybrid methods, although beneficial in some respects, tends to complicate implementation. These methods necessitate a careful equilibrium between supervised and unsupervised learning components, which can pose challenges in practical applications. Addressing these limitations is vital for advancing diabetes prediction methodologies and ensuring their effectiveness across diverse populations.