التنبؤ بمخاطر الأمراض القلبية الوعائية عبر الإنترنت باستخدام التعلم الآلي Web-based cardiovascular disease risk prediction using machine learning

المجلة: Frontiers in Artificial Intelligence، المجلد: 9
DOI: https://doi.org/10.3389/frai.2026.1690664
PMID: https://pubmed.ncbi.nlm.nih.gov/41766947
تاريخ النشر: 2026-02-13
المؤلف: Suraiya Akhter وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في الرعاية الصحية

نظرة عامة

يتناول هذا القسم من ورقة البحث القضية الملحة لمرض القلب والأوعية الدموية (CVD)، والذي يُعتبر السبب الرئيسي للمراضة والوفيات على مستوى العالم. ويؤكد على ضرورة وجود أدوات تنبؤية فعالة لتسهيل تقييم المخاطر المبكر واتخاذ قرارات سريرية مستنيرة. يستكشف المؤلفون أربع استراتيجيات لاختيار الميزات—الارتباط بيرسون مع اختبار كاي-تربيع، وتقييم قائم على شجرة القرار المتناوبة (ADT)، وتقييم الميزات المعتمد على التحقق المتقاطع (CVFE)، وتقييم الميزات المعتمد على الهيبرغراف (HFE)—لتحديد العوامل الأكثر تنبؤًا بمخاطر مرض القلب والأوعية الدموية باستخدام بيانات من المسح الوطني للصحة والتغذية (NHANES).

تستخدم الدراسة نماذج تعلم الآلة، وتحديدًا غابة عشوائية (RF)، وآلة دعم المتجهات (SVM)، وزيادة التدرج المتطرف (XGBoost)، لتقييم الفعالية التنبؤية للميزات المختارة. أسفرت طريقة HFE المدمجة مع SVM عن أعلى دقة بلغت 82.84% ومنطقة تحت المنحنى (AUC) قدرها 0.9027. تشمل المؤشرات الرئيسية التي تم تحديدها العمر، إجمالي الكوليسترول، تاريخ ارتفاع ضغط الدم، استخدام أدوية خفض الكوليسترول، استخدام الأدوية الموصوفة مؤخرًا، تاريخ التدخين، نسبة دخل الأسرة إلى الفقر، الجنس، المستوى التعليمي، وعرض توزيع خلايا الدم الحمراء. تؤكد النتائج على أهمية اختيار الميزات الاستراتيجي في تعزيز الدقة التنبؤية وقابلية تفسير النموذج، مما يوفر في النهاية موردًا قيمًا للأطباء في تقييم مخاطر القلب والأوعية الدموية وتحسين الرعاية الوقائية. يتوفر تطبيق ويب للمستخدمين للوصول إلى النتائج التنبؤية ومخططات SHAP المستمدة من النموذج.

مقدمة

تسلط مقدمة هذه الورقة البحثية الضوء على التحدي الصحي العالمي الحرج الذي تطرحه أمراض القلب والأوعية الدموية (CVDs)، والتي تمثل حوالي 17.9 مليون وفاة سنويًا، مما يمثل ما يقرب من ثلث جميع الوفيات العالمية. تم تحديد مرض الشريان التاجي كشكل رئيسي من أشكال مرض القلب والأوعية الدموية، حيث يشكل حوالي 64% من الحالات. يتم التأكيد على الطبيعة متعددة العوامل لمخاطر مرض القلب والأوعية الدموية، حيث تلعب العوامل القابلة للتعديل مثل ارتفاع الكوليسترول، والسكري، وخيارات نمط الحياة أدوارًا كبيرة، إلى جانب العوامل غير القابلة للتعديل مثل العمر والعرق. يتم التأكيد على ضرورة وجود استراتيجيات فعالة للكشف المبكر والوقاية، خاصة بالنظر إلى قيود أدوات تقييم المخاطر التقليدية التي غالبًا ما تبسط التفاعلات المعقدة بين عوامل الخطر.

لمعالجة هذه القيود، تدعو الورقة إلى استخدام تقنيات تعلم الآلة (ML)، التي يمكن أن نمذج العلاقات غير الخطية والتفاعلات عالية الرتبة بين المتغيرات المتنوعة، مما يعزز الدقة التنبؤية. يشير المؤلفون إلى أنه على الرغم من أن طرق تعلم الآلة قد أظهرت وعدًا في تحسين توقع مخاطر مرض القلب والأوعية الدموية، إلا أن هناك فجوات لا تزال قائمة في تمثيل المحددات الاجتماعية والسلوكية وفي نشر نماذج سهلة الاستخدام وقابلة للتفسير. تقترح الدراسة خط أنابيب جديد لتعلم الآلة يدمج طرق اختيار الميزات القوية مع قابلية التفسير المعتمدة على SHAP، باستخدام بيانات من المسح الوطني للصحة والتغذية (NHANES). يهدف هذا الإطار إلى تحديد وترتيب العوامل المؤثرة في تقييم مخاطر مرض القلب والأوعية الدموية، مما يدعم في النهاية استراتيجيات العلاج الشخصية ويحسن اتخاذ القرارات السريرية من خلال تطبيق ويب متاح للجمهور.

طرق

في هذا القسم، يوضح المؤلفون المنهجية المستخدمة في بحثهم، كما هو موضح في الشكل 1. تبدأ العملية بجمع البيانات من الأفراد المصنفين إما في خطر الإصابة بمرض القلب والأوعية الدموية (CVD) أو لا. من هذه المجموعة البيانات، يتم إنشاء مجموعة من الميزات المرشحة. لتعزيز صلة هذه الميزات، ينفذ المؤلفون عدة تقنيات تقييم، بما في ذلك الارتباط بيرسون، اختبار كاي-تربيع، شجرة القرار المتناوبة (ADT)، تقييم الميزات المعتمد على التحقق المتقاطع (CVFE)، وتقييم الميزات المعتمد على الهيبرغراف (HFE). تعمل هذه الطرق على تصفية الميزات التي تظهر صلة منخفضة أو تأثير ضئيل.

بعد عملية اختيار الميزات، يتم استخدام المجموعات المنقحة لتدريب ثلاثة نماذج مختلفة من تعلم الآلة: غابة عشوائية (RF)، وآلة دعم المتجهات (SVM)، وزيادة التدرج المتطرف (XGBoost). يتم تقييم الأداء التنبؤي لهذه النماذج بعد ذلك، مما يسمح بإجراء تحليل مقارن لفعاليتها في توقع مخاطر مرض القلب والأوعية الدموية بناءً على الميزات المختارة.

مناقشة

في هذه الدراسة، استخدم المؤلفون بيانات من دورة NHANES 2021-2023 لتطوير إطار عمل لتعلم الآلة لتوقع مخاطر مرض القلب والأوعية الدموية (CVD). تضمنت مجموعة البيانات معلومات ديموغرافية، سريرية، ومخبرية من 335 فردًا مصابين بمرض القلب والأوعية الدموية و3,187 بدون، والتي تم موازنتها من خلال تقليل العينة العشوائية لنمذجة التنبؤ. تم تعريف المتغير الناتج بناءً على حالات مرض القلب والأوعية الدموية المبلغ عنها ذاتيًا والمشخصة من قبل الأطباء، وفقًا لإرشادات CDC. تم استخدام تقنيات مختلفة لاختيار الميزات، بما في ذلك الارتباط بيرسون، اختبارات كاي-تربيع، أشجار القرار المتناوبة (ADT)، تقييم الميزات القائم على الإجماع (CVFE)، وتقييم الميزات القائم على الهيبرغراف (HFE)، لتصفية مجموعة المرشحين الأولية المكونة من 31 مؤشرًا. في النهاية، تم الاحتفاظ بـ 15 ميزة، حيث أسفرت طريقة HFE عن أعلى أداء تنبؤي عند دمجها مع مصنف آلة دعم المتجهات (SVM).

كشفت تحليل SHAP أن المؤشرات الرئيسية لمخاطر مرض القلب والأوعية الدموية تشمل العمر، إجمالي الكوليسترول، ضغط الدم، العوامل الاجتماعية الاقتصادية، وسلوكيات نمط الحياة مثل التدخين. من الجدير بالذكر أن الدراسة سلطت الضوء على العلاقات المعقدة بين الميزات، مثل مفارقة الكوليسترول وتأثير الوضع الاجتماعي والاقتصادي على النتائج القلبية الوعائية. تؤكد النتائج على أهمية دمج أبعاد البيانات المتنوعة لتعزيز الدقة التنبؤية وقابلية تفسير النموذج. يسمح التطبيق الويب المطور للمستخدمين بتقييم مخاطر مرض القلب والأوعية الدموية وفهم مساهمات الميزات المختلفة، مما يدعم اتخاذ القرارات السريرية وتوجيهات البحث المستقبلية. بشكل عام، تُظهر الدراسة إمكانيات تقنيات تعلم الآلة المتقدمة في تحسين توقع مخاطر مرض القلب والأوعية الدموية مع توفير رؤى ذات صلة سريرية.

Journal: Frontiers in Artificial Intelligence, Volume: 9
DOI: https://doi.org/10.3389/frai.2026.1690664
PMID: https://pubmed.ncbi.nlm.nih.gov/41766947
Publication Date: 2026-02-13
Author(s): Suraiya Akhter et al.
Primary Topic: Artificial Intelligence in Healthcare

Overview

This section of the research paper addresses the pressing issue of cardiovascular disease (CVD), which is the leading cause of morbidity and mortality globally. It emphasizes the necessity for effective predictive tools to facilitate early risk assessment and informed clinical decisions. The authors investigate four feature-selection strategies—Pearson correlation combined with the Chi-squared test, Alternating Decision Tree (ADT)-based scoring, Cross-Validated Feature Evaluation (CVFE), and Hypergraph-Based Feature Evaluation (HFE)—to determine the most predictive factors for CVD risk using data from the National Health and Nutrition Examination Survey (NHANES).

The study employs machine learning models, specifically random forest (RF), support vector machine (SVM), and eXtreme Gradient Boosting (XGBoost), to evaluate the predictive effectiveness of the selected features. The HFE approach combined with SVM yielded the highest accuracy of 82.84% and an area under the curve (AUC) of 0.9027. Key predictors identified include age, total cholesterol, history of hypertension, cholesterol-lowering medication use, recent prescription medication use, smoking history, family income-to-poverty ratio, gender, educational attainment, and red cell distribution width. The findings underscore the significance of strategic feature selection in enhancing predictive accuracy and model interpretability, ultimately providing a valuable resource for clinicians in assessing cardiovascular risk and optimizing preventive care. A web application is available for users to access predictive results and SHAP plots derived from the model.

Introduction

The introduction of this research paper highlights the critical global health challenge posed by cardiovascular diseases (CVDs), which account for approximately 17.9 million deaths annually, representing nearly one-third of all global fatalities. Coronary artery disease is identified as the predominant form of CVD, comprising about 64% of cases. The multifactorial nature of CVD risk is emphasized, with modifiable factors such as high cholesterol, diabetes, and lifestyle choices playing significant roles, alongside non-modifiable factors like age and ethnicity. The necessity for effective early detection and preventive strategies is underscored, particularly given the limitations of traditional risk assessment tools that often oversimplify complex interactions among risk factors.

To address these limitations, the paper advocates for the use of machine learning (ML) techniques, which can model nonlinear relationships and high-order interactions among diverse variables, thereby enhancing predictive accuracy. The authors note that while ML methods have shown promise in improving CVD risk prediction, gaps remain in the representation of social and behavioral determinants and in the deployment of user-friendly, interpretable models. The study proposes a novel ML pipeline that integrates robust feature-selection methods with SHAP-driven interpretability, utilizing data from the National Health and Nutrition Examination Survey (NHANES). This framework aims to identify and prioritize influential factors in CVD risk assessment, ultimately supporting personalized treatment strategies and improving clinical decision-making through a publicly accessible web application.

Methods

In this section, the authors detail the methodology employed in their research, as illustrated in Figure 1. The process initiates with the collection of data from individuals categorized as either at risk for cardiovascular disease (CVD) or not. From this dataset, a pool of candidate features is generated. To enhance the relevance of these features, the authors implement several evaluation techniques, including Pearson correlation, Chi-squared test, Alternating Decision Tree (ADT), Cross-Validated Feature Evaluation (CVFE), and Hypergraph-Based Feature Evaluation (HFE). These methods serve to filter out features that exhibit low relevance or minimal impact.

Following the feature selection process, the refined subsets are utilized to train three different machine learning models: Random Forest (RF), Support Vector Machine (SVM), and Extreme Gradient Boosting (XGBoost). The predictive performance of these models is subsequently assessed, allowing for a comparative analysis of their effectiveness in predicting CVD risk based on the selected features.

Discussion

In this study, the authors utilized data from the NHANES 2021-2023 cycle to develop a machine learning framework for predicting cardiovascular disease (CVD) risk. The dataset included demographic, clinical, and laboratory data from 335 individuals with CVD and 3,187 without, which were balanced through random undersampling for predictive modeling. The outcome variable was defined based on self-reported, physician-diagnosed CVD conditions, adhering to CDC guidelines. Various feature selection techniques, including Pearson correlation, Chi-squared tests, Alternating Decision Trees (ADT), Consensus-based Feature Evaluation (CVFE), and Hypergraph-based Feature Evaluation (HFE), were employed to refine the initial set of 31 candidate predictors. Ultimately, 15 features were retained, with the HFE method yielding the highest predictive performance when combined with a Support Vector Machine (SVM) classifier.

The SHAP analysis revealed that key predictors of CVD risk included age, total cholesterol, blood pressure, socioeconomic factors, and lifestyle behaviors such as smoking. Notably, the study highlighted complex relationships among features, such as the cholesterol paradox and the influence of socioeconomic status on cardiovascular outcomes. The findings emphasize the importance of integrating diverse data dimensions to enhance predictive accuracy and model interpretability. The developed web application allows users to assess CVD risk and understand the contributions of various features, thereby supporting clinical decision-making and future research directions. Overall, the study demonstrates the potential of advanced machine learning techniques in improving CVD risk prediction while providing clinically relevant insights.