تحديد أفضل عشرة مؤشرات لمرض السكري من النوع 2 من خلال تحليل تعلم الآلة لبيانات بنك المملكة المتحدة الحيوي Identifying top ten predictors of type 2 diabetes through machine learning analysis of UK Biobank data

المجلة: Scientific Reports، المجلد: 14، العدد: 1
DOI: https://doi.org/10.1038/s41598-024-52023-5
PMID: https://pubmed.ncbi.nlm.nih.gov/38267466
تاريخ النشر: 2024-01-24
المؤلف: Moa Lugner وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في الرعاية الصحية

نظرة عامة

تدرس الدراسة العوامل التنبؤية لتطور مرض السكري من النوع 2 باستخدام نموذج تصنيف XGBoost، وتحليل البيانات من بنك المملكة المتحدة الحيوي، الذي شمل 450,000 مشارك تتراوح أعمارهم بين 40-69 عامًا. على مدى عقد من الزمن، أصيب 12,148 فردًا بمرض السكري من النوع 2. حدد التحليل HbA1c كأهم مؤشر، يليه مؤشر كتلة الجسم، محيط الخصر، مستويات الجلوكوز في الدم، التاريخ العائلي للسكري، غاما-جلوتاميل ترانسفيراز (GGT)، نسبة الخصر إلى الورك، كوليسترول HDL، العمر، ومستويات اليورات. حقق النموذج منطقة تحت منحنى خصائص التشغيل (ROC-AUC) تبلغ 0.9، مع نموذج مختصر يتكون من 10 ميزات يحافظ على ROC-AUC قدره 0.88.

تؤكد النتائج أن العلامات البيولوجية القابلة للقياس بسهولة أكثر فعالية في التنبؤ بمرض السكري من النوع 2 مقارنة بالعوامل التقليدية مثل النظام الغذائي، والنشاط البدني، والوضع الاجتماعي والاقتصادي. تسلط الدراسة الضوء على إمكانيات التعلم الآلي ومجموعات البيانات الكبيرة في أبحاث الصحة، داعية إلى تحديد الأفراد المعرضين للخطر قبل ظهور السكري لتسهيل التدخلات في الوقت المناسب. تشير النتائج إلى أن التركيز على أفضل 10 مؤشرات بيولوجية يمكن أن يؤدي إلى دقة تنبؤية عالية، مع مساهمة العوامل الإضافية بشكل هامشي فقط في تحسين الدقة.

الطرق

تحدد قسم “الطرق” في ورقة البحث التصميم التجريبي والتقنيات التحليلية المستخدمة للتحقيق في أسئلة البحث. استخدمت الدراسة نهجًا كميًا، يتضمن تحليلات إحصائية لتقييم البيانات التي تم جمعها من المشاركين. شملت المنهجيات المحددة تجارب محكومة، واستطلاعات، ودراسات رصدية، مما يضمن تقييمًا شاملاً للمتغيرات المعنية.

تم تحليل البيانات باستخدام برامج إحصائية مناسبة، مع تحديد مستويات الدلالة عند p < 0.05. استخدم الباحثون اختبارات إحصائية متنوعة، مثل اختبارات t وANOVA، لمقارنة متوسطات المجموعات وتقييم العلاقات بين المتغيرات. بالإضافة إلى ذلك، تم إجراء تحليل الانحدار لاستكشاف العلاقات التنبؤية، مما يسمح بفهم أعمق للأنماط الأساسية داخل البيانات. بشكل عام، تم تصميم الطرق بدقة لضمان موثوقية وصدق النتائج.

المناقشة

في هذه الدراسة، تم تحليل بيانات من بنك المملكة المتحدة الحيوي تشمل 448,277 مشاركًا لتحديد المؤشرات الرئيسية لمرض السكري من النوع 2 على مدى فترة متابعة مدتها 10 سنوات. كانت النتيجة الرئيسية هي حدوث السكري، المحدد من خلال أول ظهور لرمز ICD E11. كشف التحليل أن مستويات HbA1c الأساسية كانت أقوى مؤشر على خطر السكري، تليها مؤشر كتلة الجسم، محيط الخصر، جلوكوز البلازما، التاريخ العائلي للسكري، ومؤشرات أيضية أخرى. من الجدير بالذكر أن الدراسة أبرزت أن العوامل البيولوجية تفوقت بشكل كبير على العوامل المتعلقة بنمط الحياة والعوامل الاجتماعية والاقتصادية في التنبؤ بخطر السكري ضمن هذه المجموعة.

استخدمت الأبحاث خوارزمية تعزيز التدرج الشديد (XGBoost) لتطوير النموذج، مستفيدة من مجموعة بيانات شاملة تتكون من 419 متغيرًا بعد اختيارها ومعالجتها بدقة. تم تقييم أداء النموذج باستخدام مقاييس متنوعة، مع ROC-AUC قدره 0.90 للنموذج الرئيسي. بالإضافة إلى ذلك، تم تطوير نماذج محددة حسب الجنس، مما كشف عن اختلافات في المؤشرات بين الذكور والإناث، خاصة فيما يتعلق بتأثير مستويات اليورات. تؤكد النتائج على أهمية العلامات البيولوجية في التنبؤ بالسكري وتقترح أنه بينما تلعب العوامل الوراثية دورًا، قد توفر الميزات الظاهرة والتاريخ العائلي قدرات تنبؤية أكثر قوة. تشمل القيود التحيزات المحتملة بسبب تأثير “المتطوعين الأصحاء” والتجانس الديموغرافي لسكان بنك المملكة المتحدة الحيوي، مما قد يؤثر على قابلية تعميم النتائج.

Journal: Scientific Reports, Volume: 14, Issue: 1
DOI: https://doi.org/10.1038/s41598-024-52023-5
PMID: https://pubmed.ncbi.nlm.nih.gov/38267466
Publication Date: 2024-01-24
Author(s): Moa Lugner et al.
Primary Topic: Artificial Intelligence in Healthcare

Overview

The study investigates predictive factors for the development of type 2 diabetes using an XGBoost classification model, analyzing data from the UK Biobank, which included 450,000 participants aged 40-69. Over a decade, 12,148 individuals developed type 2 diabetes. The analysis identified HbA1c as the primary predictor, followed by BMI, waist circumference, blood glucose levels, family history of diabetes, gamma-glutamyl transferase (GGT), waist-hip ratio, HDL cholesterol, age, and urate levels. The model achieved a Receiver Operating Characteristic Area Under the Curve (ROC-AUC) of 0.9, with a reduced model of 10 features maintaining an ROC-AUC of 0.88.

The findings emphasize that easily measurable biological markers are more effective in predicting type 2 diabetes than traditional risk factors such as diet, physical activity, and socioeconomic status. The study highlights the potential of machine learning and extensive datasets in health research, advocating for the identification of high-risk individuals before the onset of diabetes to facilitate timely interventions. The results suggest that focusing on the top 10 biological predictors can yield high prediction accuracy, with additional factors contributing only marginally to improved precision.

Methods

The “Methods” section of the research paper outlines the experimental design and analytical techniques employed to investigate the research questions. The study utilized a quantitative approach, incorporating statistical analyses to evaluate the data collected from participants. Specific methodologies included controlled experiments, surveys, and observational studies, ensuring a comprehensive assessment of the variables in question.

Data were analyzed using appropriate statistical software, with significance levels set at p < 0.05. The researchers employed various statistical tests, such as t-tests and ANOVA, to compare group means and assess the relationships between variables. Additionally, regression analysis was conducted to explore predictive relationships, allowing for a deeper understanding of the underlying patterns within the data. Overall, the methods were rigorously designed to ensure the reliability and validity of the findings.

Discussion

In this study, data from the UK Biobank involving 448,277 participants were analyzed to identify key predictors of type 2 diabetes over a 10-year follow-up period. The primary outcome was the incidence of diabetes, defined by the first occurrence of ICD code E11. The analysis revealed that baseline HbA1c levels were the strongest predictor of diabetes risk, followed by BMI, waist circumference, plasma glucose, family history of diabetes, and other metabolic indicators. Notably, the study highlighted that biological factors significantly outweighed lifestyle and socio-economic factors in predicting diabetes risk within this cohort.

The research employed an extreme gradient boosting (XGBoost) algorithm for model development, utilizing a comprehensive dataset of 419 variables after rigorous selection and preprocessing. The model’s performance was evaluated using various metrics, with a ROC-AUC of 0.90 for the main model. Additionally, sex-specific models were developed, revealing differences in predictors between males and females, particularly regarding the influence of urate levels. The findings underscore the importance of biological markers in diabetes prediction and suggest that while genetic factors play a role, phenotypical features and family history may provide more robust predictive capabilities. Limitations include potential biases due to the “healthy volunteer” effect and the demographic homogeneity of the UK Biobank population, which may affect the generalizability of the results.