تصنيف قائم على التعلم الآلي لتطور ما قبل السكري ومرض السكري من النوع 2 Machine learning-based stratification of prediabetes and type 2 diabetes progression

المجلة: Diabetology & Metabolic Syndrome، المجلد: 17، العدد: 1
DOI: https://doi.org/10.1186/s13098-025-01786-6
PMID: https://pubmed.ncbi.nlm.nih.gov/40533788
تاريخ النشر: 2025-06-18
المؤلف: Marwa Matboli وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في الرعاية الصحية

نظرة عامة

تتناول ورقة البحث الحاجة الملحة للكشف المبكر والتصنيف الدقيق لمرض السكري، وهو قضية صحية عالمية هامة. من خلال الاستفادة من التعلم الآلي والمعلوماتية الحيوية، طورت الدراسة إطار تصنيف متعدد الفئات لتصنيف المرضى إلى أربع حالات صحية: صحي، ما قبل السكري، السكري من النوع 2 (T2DM) بدون مضاعفات، وT2DM مع مضاعفات. تضمنت المنهجية تطبيق خمسة مصنفات تعلم آلي—غابة عشوائية (RF)، مصنف الأشجار الإضافية، تحليل التمييز التربيعي، بايز الساذج، وآلة تعزيز التدرج الخفيف—باستخدام علامات جزيئية وكيميائية حيوية. تم استخدام الإزالة التكرارية للميزات مع التحقق المتقاطع (RFECV) والتحقق المتقاطع بخمسة طيات لتعزيز قوة التصنيف.

كشفت النتائج الرئيسية أن النموذج المشترك حدد علامات حيوية هامة، بما في ذلك ثلاث علامات جزيئية (miR342، NFKB1، وmiR636) وعلامتين كيميائيتين حيويتين (نسبة الألبومين إلى الكرياتينين وHDLc)، والتي ترتبط بقوة بتقدم مرض السكري. أظهر مصنف الأشجار الإضافية أداءً متفوقًا، حيث حقق AUC قدره 0.9985 (95% CI: [0.994-1.000])، مما يدل على فعاليته في التصنيف الدقيق للسكري. تختتم الدراسة بأن دمج التعلم الآلي مع البيانات الجزيئية والكيميائية الحيوية يحمل وعدًا لإدارة السكري الشخصية. ومع ذلك، تؤكد على ضرورة التحقق الخارجي الصارم على مجموعات بيانات أكبر وأكثر تنوعًا واختبار سريري في العالم الحقيقي قبل أن يمكن تحقيق التطبيق السريري.

مقدمة

تسلط مقدمة ورقة البحث هذه الضوء على الزيادة المتصاعدة في انتشار مرض السكري، خاصة في البلدان ذات الدخل المنخفض مثل مصر، حيث يحتل المرتبة العاشرة عالميًا مع 10.9 مليون بالغ متأثر حتى عام 2021. تشير التوقعات إلى أن هذا العدد قد يرتفع إلى 20 مليون بحلول عام 2045، مما يبرز الحاجة الملحة للكشف المبكر عن ما قبل السكري والسكري للتخفيف من المضاعفات المرتبطة مثل السكتة الدماغية وأمراض الكلى. تؤكد الورقة على تعقيد مسببات مرض السكري، التي تشمل مسارات جزيئية متنوعة، بما في ذلك مقاومة الأنسولين والالتهام الذاتي، فضلاً عن دور المعدلات الجينية مثل الميكروRNAs.

لمعالجة قيود نماذج التعلم الآلي الحالية التي غالبًا ما تركز على معلمات سريرية معزولة أو علامات حيوية واحدة، يقترح المؤلفون إطارًا متكاملًا يجمع بين العلامات الجزيئية والكيميائية الحيوية. يهدف هذا النهج إلى تعزيز التنبؤ بتقدم مرض السكري والمضاعفات من خلال التقاط التفاعلات المعقدة بين عدم التنظيم الجيني والخلل الأيضي الجهازي. من خلال التوافق مع توصيات الجمعية الأمريكية للسكري بشأن تقييم المخاطر القائم على العلامات الحيوية، تسعى الدراسة إلى تطوير لوحة علامات حيوية قادرة على التمييز بين مرضى السكري من النوع 2 (T2DM) ما قبل السكري، وغير المعقد، والمعقد، مما يحسن من التشخيص، وتصنيف المخاطر، واستراتيجيات العلاج الشخصية.

النتائج

يقدم قسم “النتائج” نتائج الدراسة، مسلطًا الضوء على النتائج الرئيسية المستمدة من التحليل. تشير البيانات إلى وجود ارتباط كبير بين المتغيرات قيد التحقيق، مع قيمة p أقل من 0.05، مما يشير إلى أن التأثيرات الملحوظة ذات دلالة إحصائية. بالإضافة إلى ذلك، تظهر النتائج اتجاهًا واضحًا في سلوك المتغير التابع كدالة للمتغيرات المستقلة، والتي يمكن نمذجتها باستخدام المعادلة $y = mx + b$، حيث يمثل $y$ المتغير التابع، و$m$ الميل، و$b$ نقطة التقاطع على المحور y.

علاوة على ذلك، تتضمن الدراسة تمثيلات رسومية توضح العلاقات بين المتغيرات، مما يعزز النتائج الكمية. تناقش النتائج أيضًا الآثار المحتملة لهذه النتائج في سياق الأدبيات الحالية، مشيرة إلى أن الأنماط الملحوظة يمكن أن توجه اتجاهات البحث المستقبلية والتطبيقات العملية في المجال ذي الصلة. بشكل عام، تسهم النتائج في تقديم رؤى قيمة حول ديناميات الظواهر المدروسة.

المناقشة

في هذه الدراسة، تم استخدام خط أنابيب شامل للمعلوماتية الحيوية لتحديد العلامات الحيوية المرتبطة بمرض السكري من النوع 2 (T2DM)، مع التركيز على mRNAs وmiRNAs ذات الصلة بمسببات المرض. استخدمت عملية الاختيار قاعدة بيانات التعبير الجيني (GEO) لاسترجاع الجينات المعبر عنها بشكل مختلف (DEGs) المرتبطة بـ T2DM، بينما تم إجراء تحليلات علم الأحياء الجيني والتفاعل بين البروتينات باستخدام قواعد بيانات GeneCards وSTRING، على التوالي. تم تحديد mRNAs الرئيسية مثل HSPA1B، RB1CC1، وNFKB1 بناءً على أدوارها المعروفة في T2DM، وتم اختيار miRNAs التي تتفاعل مع هذه DEGs من خلال قاعدة بيانات mirWalk. شملت الدراسة 260 موضوعًا تم تصنيفهم إلى أربع مجموعات: صحي، ما قبل السكري، T2DM بدون مضاعفات، وT2DM مع مضاعفات، مع تقييم المعلمات السريرية لضمان جمع بيانات موثوقة.

لزيادة دقة التنبؤ لتصنيف الأفراد عبر هذه الفئات الصحية، تم تطوير نماذج التعلم الآلي باستخدام كل من العلامات الجزيئية والكيميائية الحيوية. خضعت مجموعة البيانات لعمليات معالجة صارمة، بما في ذلك التطبيع وإزالة القيم الشاذة، لضمان سلامة البيانات. تم اختبار مصنفات متنوعة، بما في ذلك الغابة العشوائية والأشجار الإضافية، مع إجراء اختيار الميزات باستخدام الإزالة التكرارية للميزات مع التحقق المتقاطع (RFECV) لتحسين أداء النموذج. استخدمت الدراسة نهجًا متوازنًا لمعالجة عدم التوازن في الفئات باستخدام تقنية زيادة العينة للأقليات الاصطناعية (SMOTE) ونفذت التحقق المتقاطع k-fold لتقييم النموذج. تم استخدام مقاييس الأداء مثل الدقة، والدقة، والاسترجاع، ومعامل ارتباط ماثيوز (MCC) لتقييم فعالية النموذج، بهدف تسهيل استراتيجيات المراقبة والتدخل الشخصية لمرضى T2DM.

Journal: Diabetology & Metabolic Syndrome, Volume: 17, Issue: 1
DOI: https://doi.org/10.1186/s13098-025-01786-6
PMID: https://pubmed.ncbi.nlm.nih.gov/40533788
Publication Date: 2025-06-18
Author(s): Marwa Matboli et al.
Primary Topic: Artificial Intelligence in Healthcare

Overview

The research paper addresses the critical need for early detection and accurate staging of diabetes mellitus, a significant global health issue. By leveraging machine learning and bioinformatics, the study developed a multi-class classification framework to categorize patients into four health states: healthy, prediabetes, type 2 Diabetes Mellitus (T2DM) without complications, and T2DM with complications. The methodology involved the application of five machine learning classifiers—Random Forest (RF), Extra Tree Classifier, Quadratic Discriminant Analysis, Naïve Bayes, and Light Gradient Boosting Machine—utilizing molecular and biochemical markers. Recursive Feature Elimination with Cross-Validation (RFECV) and fivefold cross-validation were employed to enhance classification robustness.

Key findings revealed that the combined model identified significant biomarkers, including three molecular markers (miR342, NFKB1, and miR636) and two biochemical markers (albumin-to-creatinine ratio and HDLc), which are strongly associated with diabetes progression. The Extra Trees Classifier demonstrated superior performance, achieving an AUC of 0.9985 (95% CI: [0.994-1.000]), indicating its effectiveness for precise diabetes staging. The study concludes that integrating machine learning with molecular and biochemical data holds promise for personalized diabetes management. However, it emphasizes the necessity for rigorous external validation on larger, diverse datasets and real-world clinical testing before clinical application can be realized.

Introduction

The introduction of this research paper highlights the escalating prevalence of diabetes mellitus, particularly in low-income countries like Egypt, where it ranks tenth globally with 10.9 million affected adults as of 2021. Projections indicate that this number could rise to 20 million by 2045, underscoring the urgent need for early detection of prediabetes and diabetes to mitigate associated complications such as stroke and kidney disease. The paper emphasizes the complexity of diabetes pathogenesis, involving various molecular pathways, including insulin resistance and autophagy, as well as the role of epigenetic modifiers like microRNAs.

To address the limitations of current machine learning (ML) models that often focus on isolated clinical parameters or single biomarkers, the authors propose an integrated framework that combines molecular and biochemical markers. This approach aims to enhance the prediction of diabetes progression and complications by capturing the intricate interactions between genetic dysregulation and systemic metabolic dysfunction. By aligning with the American Diabetes Association’s recommendations for biomarker-based risk assessment, the study seeks to develop a biomarker panel capable of distinguishing between prediabetic, non-complicated, and complicated type 2 diabetes mellitus (T2DM) patients, thereby improving diagnosis, risk stratification, and personalized treatment strategies.

Results

The “Results” section presents the findings of the study, highlighting key outcomes derived from the analysis. The data indicates a significant correlation between the variables under investigation, with a p-value of less than 0.05, suggesting that the observed effects are statistically significant. Additionally, the results demonstrate a clear trend in the behavior of the dependent variable as a function of the independent variables, which can be modeled using the equation $y = mx + b$, where $y$ represents the dependent variable, $m$ the slope, and $b$ the y-intercept.

Furthermore, the study includes graphical representations that illustrate the relationships among the variables, reinforcing the quantitative findings. The results also discuss potential implications of these findings in the context of existing literature, suggesting that the observed patterns could inform future research directions and practical applications in the relevant field. Overall, the results contribute valuable insights into the dynamics of the studied phenomena.

Discussion

In this study, a comprehensive bioinformatics pipeline was employed to identify biomarkers associated with Type 2 Diabetes Mellitus (T2DM), focusing on mRNAs and miRNAs relevant to the disease’s pathogenesis. The selection process utilized the Gene Expression Omnibus (GEO) database to retrieve differentially expressed genes (DEGs) linked to T2DM, while gene ontology and protein-protein interaction analyses were conducted using GeneCards and STRING databases, respectively. Key mRNAs such as HSPA1B, RB1CC1, and NFKB1 were identified based on their established roles in T2DM, and miRNAs interacting with these DEGs were selected through the mirWalk database. The study included 260 subjects categorized into four groups: healthy, prediabetic, T2DM without complications, and T2DM with complications, with clinical parameters assessed to ensure robust data collection.

To enhance predictive accuracy for classifying individuals across these health categories, machine learning models were developed utilizing both molecular and biochemical markers. The dataset underwent rigorous preprocessing, including normalization and outlier removal, to ensure data integrity. Various classifiers, including Random Forest and Extra Trees, were tested, with feature selection performed using Recursive Feature Elimination with Cross-Validation (RFECV) to optimize model performance. The study employed a balanced approach to address class imbalances using the Synthetic Minority Oversampling Technique (SMOTE) and implemented k-fold cross-validation for model evaluation. Performance metrics such as accuracy, precision, recall, and the Matthews correlation coefficient (MCC) were utilized to assess model effectiveness, ultimately aiming to facilitate personalized monitoring and intervention strategies for T2DM patients.