طرق التعلم الآلي الخاضعة للإشراف المتقدمة للتنبؤ الدقيق بمرض السكري باستخدام اختيار الميزات Advanced supervised machine learning methods for precise diabetes mellitus prediction using feature selection

المجلة: Frontiers in Medicine، المجلد: 12
DOI: https://doi.org/10.3389/fmed.2025.1620268
PMID: https://pubmed.ncbi.nlm.nih.gov/41001366
تاريخ النشر: 2025-09-10
المؤلف: Gufran Ahmad Ansari وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في الرعاية الصحية

نظرة عامة

تستكشف هذه الورقة البحثية تطبيق تقنيات التعلم الآلي (MLT) للتنبؤ المبكر بمرض السكري (DM)، وهو حالة مزمنة لها آثار صحية خطيرة إذا تُركت دون علاج. باستخدام مجموعة بيانات السكري الخاصة بالهنود البيما (PIDD) من مستودع UCI، تقارن الدراسة بين عدة نماذج MLT تحت الإشراف، بما في ذلك آلة الدعم الناقل (SVM)، بايزي البسيط (NB)، الجيران الأقرب (KNN)، وغابة عشوائية (RF). تم استخدام طريقة التحقق المتقاطع ذات العشرة طيات لمعالجة عدم توازن الفئات وتعزيز قابلية تعميم النتائج. تشير النتائج إلى أن SVM حققت أعلى دقة بنسبة 91.5%، تليها RF (90%)، KNN (89%)، وNB (83%)، مما يبرز أهمية اختيار الخوارزمية في الأداء التنبؤي.

تساهم الدراسة في هذا المجال من خلال توفير إطار عمل قوي لتقييم مخاطر السكري، يجمع بين عدة مصنفات ويستخدم مقاييس تحقق شاملة مثل الدقة، الدقة الإيجابية، الاسترجاع، ودرجة F1. تؤكد على ضرورة وجود أدوات تشخيصية فعالة مدفوعة بالبيانات في ضوء الزيادة المتزايدة في حجم بيانات الرعاية الصحية. بينما تقتصر الأبحاث على اعتمادها على مجموعة بيانات منظمة، تهدف الأعمال المستقبلية إلى تطبيق الإطار على بيانات غير منظمة وتوسيع قابليته للتطبيق على مجالات طبية أخرى، بما في ذلك تصنيف الأورام والحالات القلبية الوعائية. بالإضافة إلى ذلك، يُقترح دمج عوامل نمط الحياة في النموذج التنبؤي لتعزيز صلته في البيئات السريرية الواقعية.

مقدمة

تناقش مقدمة الورقة البحثية مرض السكري (DM)، وهو مرض مزمن يتميز بارتفاع مستويات الجلوكوز في الدم بسبب نقص إنتاج الأنسولين من خلايا بيتا البنكرياس. يمكن أن تؤدي هذه الحالة إلى مضاعفات خطيرة تؤثر على أعضاء وأنظمة مختلفة، بما في ذلك الكلى، والعينين، وصحة القلب والأوعية الدموية، مما يؤثر بشكل خاص على كبار السن. تشمل الأعراض الشائعة التبول المتكرر، العطش الشديد، وزيادة الجوع، وبدون علاج، يمكن أن يقلل السكري بشكل كبير من جودة الحياة.

علاوة على ذلك، تبرز المقدمة التحديات التي تواجهها الدول النامية، حيث يعيق الوصول المحدود إلى أدوات التشخيص الكشف المبكر ويساهم في زيادة معدلات الوفيات المرتبطة بالسكري. استجابةً لهذه التحديات، تؤكد الورقة على الدور المتزايد لتقنيات التعلم الآلي (MLT) في التشخيص الطبي. لقد أظهرت MLT وعدًا في تحليل بيانات الرعاية الصحية لتحديد الأنماط التي تسهل التشخيص المبكر والعلاج الشخصي، مما يوضح فعاليتها في التنبؤ بمختلف الأمراض، بما في ذلك التهاب الكبد والسرطان.

الطرق

تؤكد المنهجية الموضحة في الإطار المفاهيمي لتقنيات التعلم الآلي (MLT) على نهج مبتكر للتشخيص الطبي من خلال دمج خطوات التعلم الآلي الأساسية. كما هو موضح في الشكل 2، يفصل الإطار بشكل مبتكر البيانات إلى مجموعات بيانات تدريب، اختبار، وارتباط، مما يسهل تحليل الميزات المستهدف قبل النمذجة. يعزز الدمج المبكر للتحقق المتقاطع k-fold من اتساق تحقق النموذج ويقلل من خطر الإفراط في التكيف.

مختلفًا عن سير العمل التقليدية، يعتمد هذا الإطار استراتيجية تقييم طبقية تميز بين تسجيل الأداء وتحليل النتائج، مما يحسن من قابلية التفسير والموثوقية في السياقات السريرية. علاوة على ذلك، يسمح تصميمه القابل للتعديل بالتكيف بسهولة مع تطبيقات الرعاية الصحية المختلفة، مما يبرز مرونته وإمكاناته للتنفيذ الأوسع.

النتائج

تشير نتائج الدراسة إلى أن نموذج آلة الدعم الناقل (SVM) تفوق على المصنفات الأخرى في التنبؤ بالسكري، محققًا دقة بنسبة 91.5%. تم التحقق من هذه الأداء إحصائيًا باستخدام اختبار مك نيمار، الذي أكد أن التحسن في دقة SVM مقارنة بالنماذج البديلة كان ذا دلالة إحصائية (p < 0.05). تسلط مقاييس التقييم المقدمة في الجدول 3 الضوء على أهمية الدقة والاسترجاع في السياقات السريرية، حيث أظهرت SVM أيضًا دقة عالية (96) واسترجاع (93)، مما أدى إلى درجة F1 إيجابية قدرها 94. في سياق التنبؤ بالسكري، فإن التوازن بين الدقة والاسترجاع أمر حاسم. تقلل الدقة العالية من الإيجابيات الكاذبة، مما يضمن أن معظم الحالات المتنبأ بها من السكري دقيقة بالفعل، بينما يكون الاسترجاع العالي ضروريًا لتحديد المرضى السكريين الفعليين، مما يمنع التأخيرات في العلاج. تؤكد النتائج على ضرورة اختيار توازن مثالي بين هذه المقاييس بناءً على الأولويات السريرية، حيث يمكن أن تؤدي الإيجابيات الكاذبة المفرطة إلى قلق غير ضروري واختبارات، بينما يمكن أن يعرض الاسترجاع المنخفض صحة المريض للخطر.

المناقشة

تتناول قسم المناقشة في الورقة البحثية أنواع السكري المختلفة، مع التركيز على الخصائص والآثار المميزة لمرض السكري من النوع 1 (T1D)، ومرض السكري من النوع 2 (T2D)، وسكري الحمل (GDM). يتم تحديد T1D كحالة مناعية تتطلب علاج الأنسولين مدى الحياة، وتؤثر بشكل أساسي على الفئات العمرية الأصغر. في المقابل، يرتبط T2D، الذي يشكل أكثر من 90% من حالات السكري، بمقاومة الأنسولين وارتفاع معدلات السمنة، مما يبرز قلقًا كبيرًا للصحة العامة. يشكل GDM، الذي يحدث أثناء الحمل، مخاطر على كل من الأمهات والرضع. يتم التأكيد على الحاجة الملحة لتحسين التحليلات التنبؤية في إدارة السكري من خلال توقعات مقلقة من الاتحاد الدولي للسكري، الذي يقدر أن 700 مليون شخص سيكون لديهم سكري بحلول عام 2045.

تدعو الورقة إلى استخدام تقنيات التعلم الآلي (MLT) لتعزيز دقة التنبؤ بالسكري، مع معالجة القيود في الدراسات السابقة مثل التحقق غير الكافي واختيار الميزات. يستخدم الإطار المقترح أربعة تقنيات MLT تحت الإشراف—آلة الدعم الناقل (SVM)، الجيران الأقرب (KNN)، بايزي البسيط (NB)، وغابة عشوائية (RF)—ويشمل طرق معالجة مسبقة قوية، بما في ذلك التطبيع وتحليل البيانات الاستكشافية. من الجدير بالذكر أن الدراسة تحقق دقة عالية بنسبة 91.5% مع SVM، مما يوضح إمكانيات تقنيات MLT المتقدمة في الكشف المبكر عن السكري. تهدف الأبحاث إلى توفير حل قابل للتوسع للمهنيين في الرعاية الصحية، خاصة في البيئات ذات الموارد المحدودة، وتحدد خارطة طريق للدراسات المستقبلية لتوسيع هذه المنهجيات لتشمل أمراض أخرى.

Journal: Frontiers in Medicine, Volume: 12
DOI: https://doi.org/10.3389/fmed.2025.1620268
PMID: https://pubmed.ncbi.nlm.nih.gov/41001366
Publication Date: 2025-09-10
Author(s): Gufran Ahmad Ansari et al.
Primary Topic: Artificial Intelligence in Healthcare

Overview

This research paper investigates the application of machine learning techniques (MLT) for the early prediction of diabetes mellitus (DM), a chronic condition with severe health implications if left untreated. Utilizing the Pima Indian Diabetes Dataset (PIDD) from the UCI repository, the study compares several supervised MLT models, including Support Vector Machine (SVM), Naïve Bayes (NB), K-Nearest Neighbors (KNN), and Random Forest (RF). A 10-fold cross-validation method was employed to address class imbalance and enhance the generalizability of the findings. The results indicate that SVM achieved the highest accuracy at 91.5%, followed by RF (90%), KNN (89%), and NB (83%), underscoring the importance of algorithm selection in predictive performance.

The study contributes to the field by providing a robust framework for diabetes risk assessment, integrating multiple classifiers and employing comprehensive validation metrics such as accuracy, precision, recall, and F1-score. It emphasizes the necessity of effective data-driven diagnostic tools in light of the growing volume of healthcare data. While the research is limited by its reliance on a structured dataset, future work aims to apply the framework to unstructured data and expand its applicability to other medical domains, including tumor classification and cardiovascular conditions. Additionally, incorporating lifestyle factors into the predictive model is proposed to enhance its relevance in real-world clinical settings.

Introduction

The introduction of the research paper discusses diabetes mellitus (DM), a chronic disease marked by elevated blood glucose levels due to insufficient insulin production by pancreatic beta cells. This condition can lead to severe complications affecting various organs and systems, including the kidneys, eyes, and cardiovascular health, particularly impacting older adults. Common symptoms include frequent urination, excessive thirst, and increased hunger, and without treatment, diabetes can significantly diminish quality of life.

Moreover, the introduction highlights the challenges faced in developing countries, where limited access to diagnostic tools hampers early detection and contributes to increased mortality rates associated with diabetes. In response to these challenges, the paper emphasizes the growing role of machine learning techniques (MLT) in medical diagnostics. MLT has shown promise in analyzing healthcare data to identify patterns that facilitate early diagnosis and personalized treatment, demonstrating its effectiveness in predicting various diseases, including hepatitis and cancer.

Methods

The methodology outlined in the conceptual framework for Machine Learning Techniques (MLT) emphasizes a novel approach to medical diagnosis by integrating essential machine learning steps. As illustrated in Figure 2, the framework innovatively separates data into training, testing, and correlation datasets, facilitating targeted feature analysis prior to modeling. The early incorporation of k-fold cross-validation enhances model validation consistency and mitigates the risk of overfitting.

Distinct from traditional workflows, this framework adopts a layered evaluation strategy that differentiates between performance scoring and result analysis, thereby improving interpretability and reliability in clinical contexts. Furthermore, its modular design allows for straightforward adaptation to various healthcare applications, underscoring its versatility and potential for broader implementation.

Results

The results of the study indicate that the Support Vector Machine (SVM) model outperformed other classifiers in diabetes prediction, achieving an accuracy of 91.5%. This performance was statistically validated using McNemar’s test, which confirmed that the improvement in SVM’s accuracy over alternative models was significant (p < 0.05). The evaluation metrics presented in Table 3 highlight the importance of precision and recall in clinical settings, where SVM also demonstrated high precision (96) and recall (93), leading to a favorable F1 score of 94. In the context of diabetes prediction, the balance between precision and recall is critical. High precision minimizes false positives, ensuring that most predicted diabetic cases are indeed accurate, while high recall is essential for identifying actual diabetic patients, thereby preventing delays in treatment. The findings underscore the necessity of selecting an optimal trade-off between these metrics based on clinical priorities, as excessive false positives could result in unnecessary patient anxiety and testing, while low recall could jeopardize patient health.

Discussion

The discussion section of the research paper outlines the various types of diabetes, emphasizing the distinct characteristics and implications of Type 1 diabetes (T1D), Type 2 diabetes (T2D), and gestational diabetes (GDM). T1D is identified as an autoimmune condition requiring lifelong insulin therapy, primarily affecting younger populations. In contrast, T2D, which constitutes over 90% of diabetes cases, is linked to insulin resistance and rising obesity rates, highlighting a significant public health concern. GDM, occurring during pregnancy, poses risks to both mothers and infants. The urgency for improved predictive analytics in diabetes management is underscored by alarming projections from the International Diabetes Federation, estimating that 700 million people will have diabetes by 2045.

The paper advocates for the use of machine learning techniques (MLT) to enhance diabetes prediction accuracy, addressing limitations in prior studies such as inadequate validation and feature selection. The proposed framework employs four supervised MLTs—Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Naïve Bayes (NB), and Random Forest (RF)—and incorporates robust preprocessing methods, including normalization and exploratory data analysis. Notably, the study achieves a high accuracy of 91.5% with SVM, demonstrating the potential of advanced MLTs in early diabetes detection. The research aims to provide a scalable solution for healthcare professionals, particularly in resource-limited settings, and outlines a roadmap for future studies to extend these methodologies to other diseases.