نهج آلي للتنبؤ بالمرضى المصابين بالسكري باستخدام تقنيات KNN والتعدين الفعال للبيانات An automated approach to predict diabetic patients using KNN imputation and effective data mining techniques

المجلة: BMC Medical Research Methodology، المجلد: 24، العدد: 1
DOI: https://doi.org/10.1186/s12874-024-02324-0
PMID: https://pubmed.ncbi.nlm.nih.gov/39333904
تاريخ النشر: 2024-09-27
المؤلف: Abdulaziz Altamimi وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في الرعاية الصحية

نظرة عامة

تتناول ورقة البحث القضية الحرجة لاكتشاف مرض السكري في الدول النامية، مع التأكيد على أهمية التشخيص المبكر والرعاية الطبية الفعالة. تسلط الضوء على التحديات التي تطرحها البيانات المفقودة في الدراسات الحالية، والتي غالبًا ما تؤثر على دقة وموثوقية طرق اكتشاف مرض السكري. لمواجهة ذلك، يقترح المؤلفون طريقة تنبؤ آلي بمرض السكري باستخدام مصنف تصويت مجمع مكدس يدمج ثلاثة نماذج تعلم آلي جنبًا إلى جنب مع محسن KNN لإدارة القيم المفقودة. يحقق النموذج المقترح مقاييس أداء مثيرة للإعجاب، بما في ذلك دقة تبلغ 98.59%، ودقة 99.26%، واسترجاع 99.75%، ودرجة F1 تبلغ 99.45%، ومعامل ارتباط ماثيو (MCC) يبلغ 99.24%. توضح التحليلات المقارنة مع سبع تقنيات تعلم آلي أخرى تفوق النموذج المقترح، مما يبرز فعالية محسن KNN في تعزيز اكتشاف مرض السكري.

في الختام، تكشف الدراسة أن تطبيق تقنيات التعلم الآلي يحسن بشكل كبير من دقة اكتشاف مرض السكري، خاصة عند معالجة البيانات المفقودة من خلال طرق المعالجة المسبقة. تشير النتائج التجريبية إلى أن النموذج المجمع، وبالتحديد تكوين XGB + RF + ETC، يحقق دقة عالية تبلغ 97.49%. تدعو النتائج إلى دمج محسن KNN والنماذج المجمعه في أطر اكتشاف مرض السكري. ستستكشف الأعمال المستقبلية تجميع النماذج في كل من سياقات التعلم الآلي والتعلم العميق لتعزيز الأداء والقابلية للتعميم، خاصة في مجموعات البيانات ذات الأبعاد الأعلى.

مقدمة

تسلط مقدمة ورقة البحث الضوء على الدور الحاسم لمهنيي الرعاية الصحية في تشخيص وعلاج حالات صحية متنوعة، وخاصة مرض السكري (DM). يتميز مرض السكري بعدم كفاية إنتاج الأنسولين أو استخدامه، مما يؤدي إلى ارتفاع مستويات السكر في الدم، والذي يمكن أن يسبب ضررًا شديدًا للأعضاء. لقد زادت نسبة انتشار مرض السكري، حيث تأثر 8.5% من البالغين في عام 2014 ومن المتوقع أن يصل العدد إلى 629 مليون حالة بحلول عام 2045. العبء المالي لرعاية مرض السكري كبير، حيث يُقدّر بنحو 825 مليار دولار سنويًا. تصنف الورقة مرض السكري إلى أربعة أنواع: النوع 1، النوع 2، السكري الحملي، وما قبل السكري، وتحدد عوامل خطر متنوعة، بما في ذلك السمنة، ونمط الحياة الخامل، والاستعداد الوراثي.

يجادل المؤلفون من أجل نهج شامل لإدارة مرض السكري يدمج التكنولوجيا وتعديلات نمط الحياة. يقترحون استخدام أنظمة توصية الوجبات، وتتبع النشاط البدني، وأدوات الالتزام بالعقاقير، والدردشات التفاعلية لتعزيز مشاركة المرضى وتعليمهم. تؤكد الورقة على ضرورة التعاون بين مطوري التكنولوجيا ومقدمي الرعاية الصحية والمرضى لإنشاء حلول سهلة الاستخدام، خاصة في البلدان النامية حيث تتزايد نسبة انتشار مرض السكري. علاوة على ذلك، توضح المقدمة هدف الدراسة في تطوير نموذج مجمع قائم على التعلم الآلي لتوقع مرض السكري، مع معالجة التحديات مثل البيانات المفقودة ومقارنة أداء النموذج المقترح مع المنهجيات الحالية.

طرق

في هذا القسم، يصف المؤلفون المنهجية المستخدمة لمعالجة القيم المفقودة في مجموعة بياناتهم باستخدام محسن KNN، الذي يستخدم المسافة الإقليدية ومتوسط البيانات المتاحة للإحلال. بعد هذه الخطوة المسبقة، تم تدريب وتقييم نماذج تعلم آلي متنوعة على مجموعة البيانات المحسنة. يتم تلخيص مقاييس أداء هذه النماذج في الجدول 6 وتوضيحها في الشكل 4.

حقق نموذج الانحدار اللوجستي (LR) دقة تبلغ 86.52%، مع قيم دقة واسترجاع ودرجة F1 تبلغ 93.64% و96.74% و94.61% على التوالي. تفوق نموذج شجرة القرار (DT) قليلاً على LR بدقة تبلغ 89.71%. أظهر نموذج الغابة العشوائية (RF) تحسينات كبيرة، حيث حقق دقة تبلغ 93.44% إلى جانب دقة عالية (96.45%)، واسترجاع (97.98%)، ودرجة F1 (97.22%). حققت تقنيات التجميع، وبشكل خاص تعزيز التدرج المتطرف (XGB) ومصنف الأشجار الإضافية (ETC)، نتائج أفضل، حيث وصلت ETC إلى أعلى دقة تبلغ 95.3% ومقاييس دقة واسترجاع ودرجة F1 قوية حول 97%. من الجدير بالذكر أن المصنف التصويتي (VC)، الذي جمع بين XGB وRF وETC، أظهر أداءً استثنائيًا بدقة تبلغ 98.59%، ودقة 99.26%، واسترجاع 99.75%، ودرجة F1 تبلغ 99.45%. تؤكد هذه النتائج فعالية طرق التجميع في تعزيز أداء النموذج لمهمة التصنيف، بينما تشير أيضًا إلى أن اختيار النموذج يجب أن يتناسب مع الخصائص المحددة لمجموعة البيانات. يتم تقديم منحنى ROC-AUC للنموذج المقترح في الشكل 5.

نتائج

في هذا القسم، يتم تقديم نتائج نماذج التعلم الآلي (ML) لتوقع مرض السكري، باستخدام Python 3.8 مع مكتبات TensorFlow وSci-Kit Learn في بيئة Jupyter Notebook. تم إجراء التجارب على نظام مزود بمعالج Intel Core i7 من الجيل الحادي عشر. تم تقييم أداء النموذج باستخدام مقاييس مثل درجة F1، والاسترجاع، والدقة، والدقة، مع التركيز على التعامل مع القيم المفقودة في مجموعة البيانات قبل تطبيق النموذج.

حقق نموذج شجرة القرار (DT) دقة تبلغ 78.34%، مع دقة تبلغ 88.61%، واسترجاع 90.55%، ودرجة F1 تبلغ 89.87%. تفوقت طرق التجميع، وبشكل خاص مصنف الأشجار الإضافية (ETC) والغابة العشوائية (RF)، على DT، حيث سجلت دقة تبلغ 84.18% و82.75% على التوالي، مع دقة واسترجاع ودرجات F1 متسقة تبلغ 91.45%. أظهر نموذج XGBoost (XGB) أعلى دقة عند 84.61% وحافظ على أداء متوازن عبر جميع المقاييس. حقق مصنف تصويتي (VC) يجمع بين XGB وRF وETC دقة تبلغ 81.13% ودقة مثيرة للإعجاب تبلغ 94.56%، مما يشير إلى معدل منخفض من الإيجابيات الكاذبة. بالإضافة إلى ذلك، أكدت تقنية التحقق المتقاطع (5-fold) على قوة النماذج المقترحة، حيث أظهرت أداءً متفوقًا عبر جميع مقاييس التقييم مع انحراف معياري ضئيل، مما يعزز موثوقية النتائج.

مناقشة

في قسم المناقشة، تسلط الورقة الضوء على التقدم الكبير في توقع مرض السكري من خلال دمج تقنيات التنقيب عن البيانات والتعلم الآلي. أظهرت دراسات متنوعة فعالية خوارزميات التعلم الآلي في توقع عوامل خطر مرض السكري، مثل ضغط الدم، ومستويات الأنسولين، والجلوكوز، ومؤشر كتلة الجسم (BMI). من الجدير بالذكر أن مصنف التجميع الذي يجمع بين التصويت الناعم ودمج مصفوفة الالتباس حقق دقة تبلغ 81% في توقع مرض السكري من النوع 2 باستخدام السجلات الصحية الإلكترونية. أفادت دراسات أخرى بدقة أعلى، حيث وصلت بعض نماذج التجميع إلى دقة تصل إلى 95% من خلال استخدام تقنيات متقدمة مثل التعلم العميق والتعلم الانتقالي.

تؤكد الورقة أيضًا على أهمية معالجة البيانات، خاصة في التعامل مع القيم المفقودة، التي يمكن أن تؤثر بشكل كبير على أداء النموذج. تم استخدام محسن KNN لمعالجة البيانات المفقودة في مجموعة بيانات السكري المستمدة من Kaggle، والتي تضمنت العديد من القيم الصفرية عبر سمات متنوعة. يجمع النهج المقترح لاكتشاف مرض السكري بين ثلاثة خوارزميات قوية—الغابة العشوائية (RF)، ومصنف الأشجار الإضافية (ETC)، وXGBoost (XGB)—لتشكيل نموذج مجمع يعزز دقة التوقع من خلال الاستفادة من نقاط القوة لكل مصنف فردي. تم استخدام مقاييس التقييم مثل الدقة، والدقة، والاسترجاع، ومعامل ارتباط ماثيو (MCC) لتقييم أداء النموذج، مما يظهر أن الطريقة المجمعه المقترحة تتفوق على النماذج الحالية في الأدبيات.

القيود

يقدم النموذج المقترح عدة قيود تؤثر على فعاليته العامة. بشكل أساسي، يتأثر أداؤه بشكل كبير بجودة البيانات المدخلة، حيث تعتمد كل من إحلال KNN ونتائج التنقيب عن البيانات على اكتمال ودقة هذه البيانات. بالإضافة إلى ذلك، تعقيد النموذج في حساسيته تجاه معلمات KNN، مثل عدد الجيران (K) واختيار مقاييس المسافة، يعقد عملية تحسين جودة الإحلال ودقة التوقع.

علاوة على ذلك، قد تعيق التعقيدات الكامنة في تقنيات التنقيب عن البيانات المستخدمة قابلية التفسير، مما يجعل من الصعب تحديد وفهم الميزات المؤثرة ذات الصلة بتوقع مرض السكري. أخيرًا، تظهر مشكلات القابلية للتوسع بسبب الطبيعة المستهلكة للموارد للحسابات المعنية، مما قد يحد من تطبيق النموذج على مجموعات بيانات أكبر أو سيناريوهات في الوقت الحقيقي.

Journal: BMC Medical Research Methodology, Volume: 24, Issue: 1
DOI: https://doi.org/10.1186/s12874-024-02324-0
PMID: https://pubmed.ncbi.nlm.nih.gov/39333904
Publication Date: 2024-09-27
Author(s): Abdulaziz Altamimi et al.
Primary Topic: Artificial Intelligence in Healthcare

Overview

The research paper addresses the critical issue of diabetes detection in underdeveloped nations, emphasizing the importance of early diagnosis and effective medical care. It highlights the challenges posed by missing data in existing studies, which often compromise the accuracy and robustness of diabetes detection methods. To tackle this, the authors propose an automated diabetes prediction method utilizing a stacked ensemble voting classifier that integrates three machine learning models alongside a KNN imputer to manage missing values. The proposed model achieves impressive performance metrics, including an accuracy of 98.59%, precision of 99.26%, recall of 99.75%, F1 score of 99.45%, and Matthews correlation coefficient (MCC) of 99.24%. Comparative analysis with seven other machine learning techniques demonstrates the proposed model’s superiority, underscoring the efficacy of the KNN imputer in enhancing diabetes detection.

In conclusion, the study reveals that the application of machine learning techniques significantly improves diabetes detection accuracy, particularly when addressing missing data through preprocessing methods. The experimental results indicate that the ensemble model, specifically the XGB + RF + ETC configuration, achieves a high accuracy of 97.49%. The findings advocate for the integration of KNN imputer and ensemble models in diabetes detection frameworks. Future work will explore the stacking of models in both machine learning and deep learning contexts to further enhance performance and generalizability, especially in higher-dimensional datasets.

Introduction

The introduction of the research paper highlights the critical role of healthcare professionals in diagnosing and treating various health conditions, particularly diabetes mellitus (DM). DM is characterized by inadequate insulin production or utilization, leading to hyperglycemia, which can cause severe organ damage. The prevalence of diabetes has escalated, with 8.5% of adults affected in 2014 and a projected 629 million cases by 2045. The financial burden of diabetes care is substantial, estimated at nearly $825 billion annually. The paper categorizes DM into four types: Type 1, Type 2, gestational, and prediabetes, and identifies various risk factors, including obesity, sedentary lifestyle, and genetic predisposition.

The authors argue for a holistic approach to diabetes management that integrates technology and lifestyle modifications. They propose the use of meal recommendation systems, physical activity tracking, drug adherence tools, and interactive chatbots to enhance patient engagement and education. The paper emphasizes the necessity for collaboration among technology developers, healthcare providers, and patients to create user-friendly solutions, particularly in developing countries where diabetes prevalence is rising. Furthermore, the introduction outlines the study’s objective to develop a machine learning-based ensemble model for predicting diabetes, addressing challenges such as missing data and comparing the proposed model’s performance against existing methodologies.

Methods

In this section, the authors describe the methodology employed to address missing values in their dataset using the KNN imputer, which utilizes Euclidean distance and the mean of available data for imputation. Following this preprocessing step, various machine learning models were trained and evaluated on the enhanced dataset. The performance metrics of these models are summarized in Table 6 and illustrated in Figure 4.

The Logistic Regression (LR) model achieved an accuracy of 86.52%, with precision, recall, and F1-score values of 93.64%, 96.74%, and 94.61%, respectively. The Decision Tree (DT) model slightly outperformed LR with an accuracy of 89.71%. The Random Forest (RF) model demonstrated significant improvements, achieving an accuracy of 93.44% alongside high precision (96.45%), recall (97.98%), and F1-score (97.22%). Ensemble techniques, particularly Extreme Gradient Boosting (XGB) and Extra Trees Classifier (ETC), yielded even better results, with ETC reaching the highest accuracy of 95.3% and robust precision, recall, and F1-score metrics around 97%. Notably, the Voting Classifier (VC), which combined XGB, RF, and ETC, exhibited exceptional performance with an accuracy of 98.59%, precision of 99.26%, recall of 99.75%, and F1-score of 99.45%. These findings underscore the effectiveness of ensemble methods in enhancing model performance for the classification task, while also suggesting that model selection should be tailored to the dataset’s specific characteristics. The ROC-AUC curve for the proposed model is presented in Figure 5.

Results

In this section, the results of machine learning (ML) models for predicting diabetes mellitus are presented, utilizing Python 3.8 with TensorFlow and Sci-Kit Learn libraries in a Jupyter Notebook environment. The experiments were conducted on a system equipped with an 11th generation Intel Core i7 processor. Model performance was assessed using metrics such as F1 score, recall, precision, and accuracy, with a focus on handling missing values in the dataset prior to model application.

The Decision Tree (DT) model achieved an accuracy of 78.34%, with precision at 88.61%, recall at 90.55%, and an F1-score of 89.87%. Ensemble methods, specifically Extra Trees Classifier (ETC) and Random Forest (RF), outperformed DT, recording accuracies of 84.18% and 82.75%, respectively, with consistent precision, recall, and F1-scores of 91.45%. The XGBoost (XGB) model demonstrated the highest accuracy at 84.61% and maintained balanced performance across all metrics. A voting classifier (VC) combining XGB, RF, and ETC achieved an accuracy of 81.13% and an impressive precision of 94.56%, indicating a low rate of false positives. Additionally, k-fold cross-validation (5-fold) confirmed the robustness of the proposed models, showing superior performance across all evaluation metrics with minimal standard deviation, thereby enhancing the reliability of the results.

Discussion

In the discussion section, the paper highlights the significant advancements in diabetes prediction through the integration of data mining and machine learning techniques. Various studies have demonstrated the efficacy of machine learning algorithms in predicting diabetes risk factors, such as blood pressure, insulin levels, glucose, and body mass index (BMI). Notably, an ensemble classifier combining soft voting and confusion-matrix-based integration achieved an accuracy of 81% in predicting Type-2 diabetes using electronic health records. Other studies reported even higher accuracies, with some ensemble models reaching up to 95% accuracy by employing advanced techniques like deep learning and transfer learning.

The paper also emphasizes the importance of data preprocessing, particularly in handling missing values, which can significantly impact model performance. The KNN imputer was utilized to address missing data in the diabetes dataset sourced from Kaggle, which included numerous zero values across various attributes. The proposed approach for diabetes detection combines three robust algorithms—Random Forest (RF), Extra Trees Classifier (ETC), and XGBoost (XGB)—to form an ensemble model that enhances prediction accuracy by leveraging the strengths of each individual classifier. Evaluation metrics such as accuracy, precision, recall, and the Matthews Correlation Coefficient (MCC) were employed to assess model performance, demonstrating that the proposed ensemble method outperforms existing models in the literature.

Limitations

The proposed model presents several limitations that impact its overall effectiveness. Primarily, its performance is significantly influenced by the quality of the input data, as both KNN imputation and data mining results rely on the completeness and accuracy of this data. Additionally, the model’s sensitivity to KNN parameters, such as the number of neighbors (K) and the choice of distance metrics, complicates the optimization process for imputation quality and prediction accuracy.

Furthermore, the complexity inherent in the data mining techniques employed may hinder interpretability, making it difficult to identify and understand the influential features relevant to diabetes prediction. Lastly, scalability issues arise due to the resource-intensive nature of the computations involved, which may limit the model’s applicability to larger datasets or real-time scenarios.