تحسين توقع مخاطر الأمراض القلبية الوعائية باستخدام التعلم الآلي القابل للتفسير وتوازن البيانات Enhanced cardiovascular disease risk prediction using explainable machine learning and data balancing

المجلة: Discover Computing، المجلد: 29، العدد: 1
DOI: https://doi.org/10.1007/s10791-026-09973-3
تاريخ النشر: 2026-02-16
المؤلف: Raduana Khawla وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في الرعاية الصحية

نظرة عامة

تقدم هذه البحث إطار عمل مبتكر لتعلم الآلة (ML) مصمم لتعزيز توقع مخاطر الأمراض القلبية الوعائية (CVD) من خلال معالجة القضايا الحرجة مثل عدم توازن الفئات وقابلية تفسير النموذج. يستخدم الإطار مجموعة من المصنفات – تعزيز التدرج (GB)، تعزيز التدرج المتطرف (XGB)، CatBoost (CB)، والأشجار الإضافية (ET) – ويشمل تقنية الزيادة الاصطناعية للأقليات (SMOTE) لتحقيق توازن في مجموعة البيانات. تقيم الدراسة أداء النموذج باستخدام مجموعتين مرجعيتين، تصنيف أمراض القلب (HDC) والأمراض القلبية الوعائية (CD)، من خلال التحقق المتقاطع بخمسة طيات ومقاييس متنوعة، بما في ذلك الدقة، الاسترجاع، درجة F1، وROC-AUC. ومن الجدير بالذكر أن CatBoost برز كنموذج الأكثر فعالية، حيث حقق دقة تصل إلى 98.88% و99.44% على مجموعتي بيانات HDC وCD، على التوالي، بينما أظهر أيضًا حساسية وخصوصية ممتازتين.

تؤكد النتائج على إمكانيات الإطار للتطبيق السريري، حيث إنه لا يوفر فقط دقة تنبؤية عالية ولكن أيضًا يستخدم تقنيات الذكاء الاصطناعي القابل للتفسير (XAI) – مثل SHAP وLIME – لتوضيح مساهمات الميزات، وتحديد المتنبئات الرئيسية مثل مستويات التروبونين والجلوكوز. على الرغم من نتائجه الواعدة، تعترف الدراسة بالقيود، بما في ذلك نقص التنوع الديموغرافي في مجموعات البيانات والمتطلبات الحاسوبية لضبط المعلمات. ستسعى الأبحاث المستقبلية إلى التحقق من صحة الإطار باستخدام بيانات طولية ومتعددة الأنماط، مما يعزز قابليته للتعميم وتطبيقه في البيئات السريرية الواقعية. بشكل عام، يمثل إطار عمل SMOTE-CatBoost تقدمًا كبيرًا في توقع مخاطر CVD، حيث يقدم أداة قوية وقابلة للتفسير وذات صلة سريرية لتحسين نتائج المرضى.

مقدمة

تعتبر الأمراض القلبية الوعائية (CVDs) من الأسباب الرئيسية للوفيات العالمية وتطرح تحديات كبيرة أمام أنظمة الرعاية الصحية، لا سيما في المناطق ذات الدخل المنخفض والمتوسط حيث تندر الموارد التشخيصية المتقدمة. إن التعرف في الوقت المناسب على الأفراد المعرضين لمخاطر قلبية وعائية عالية أمر ضروري للتدخلات الوقائية الفعالة، ومع ذلك، غالبًا ما تفشل أدوات تقييم المخاطر التقليدية، مثل درجة مخاطر فرامينغهام وSCORE، في التقاط التفاعلات المعقدة وغير الخطية بين عوامل المخاطر المختلفة، مما يحد من قابليتها للتعميم عبر السكان المتنوعين. أظهرت التقدمات الأخيرة في تعلم الآلة (ML) وعدًا في تحسين توقع مخاطر القلب والأوعية الدموية من خلال الاستفادة من البيانات السريرية عالية الأبعاد لتحديد الأنماط المعقدة التي قد تتجاهلها الأساليب الإحصائية التقليدية.

على الرغم من إمكانيات نماذج ML، فإن هناك تحديين رئيسيين يعيقان اعتمادها السريري: عدم التوازن الفئوي المتأصل في مجموعات البيانات القلبية الوعائية، الذي يميل خوارزميات التعلم نحو الفئة الغالبة ويقوض الحساسية للمرضى ذوي المخاطر العالية، ونقص القابلية للتفسير في العديد من نماذج ML ذات الأداء العالي، التي تعمل كصناديق سوداء. بينما تم استخدام تقنيات مثل تقنية الزيادة الاصطناعية للأقليات (SMOTE) لمعالجة عدم التوازن الفئوي، لا يزال هناك فجوة في دمج هذه الأساليب بشكل منهجي مع التعلم الجماعي والذكاء الاصطناعي القابل للتفسير (XAI) لتعزيز كل من الأداء التنبؤي والشفافية السريرية.

الطرق

تم هيكلة الإطار المنهجي المقترح لتوقع مخاطر الأمراض القلبية الوعائية (CVD) بشكل متعدد المراحل، كما هو موضح في الشكل 1. يشمل هذا الإطار معالجة البيانات بشكل منهجي، واستراتيجيات للتخفيف من عدم التوازن الفئوي، وتقنيات تعلم الآلة (ML) المعتمدة على التجميع، وتقييم شامل للنموذج. تم تصميمه خصيصًا للتعامل مع التحديات الشائعة التي تواجه البيانات السريرية الواقعية، بما في ذلك القضايا المتعلقة بالقيم المفقودة وتوزيعات الفئات المنحرفة.

تُعطي تصميم هذه العملية الأولوية للأداء التنبؤي الموثوق مع ضمان القابلية للتفسير والتعميم القوي عبر مجموعات بيانات متنوعة. من خلال معالجة هذه الجوانب الحرجة، تهدف المنهجية إلى تعزيز دقة وقابلية تطبيق توقعات مخاطر CVD في البيئات السريرية.

المناقشة

تناقش قسم المناقشة في ورقة البحث التحديات الحرجة في توقع الأمراض القلبية الوعائية (CVD) بسبب التفاعل المعقد لعوامل المخاطر المختلفة وقيود طرق تقييم المخاطر التقليدية. تسلط الضوء على عدم كفاية الأساليب التقليدية، التي غالبًا ما تعتمد على النماذج الخطية ومجموعات الميزات المحدودة، مما يؤدي إلى أداء تنبؤي دون المستوى الأمثل، لا سيما في مجموعات المرضى ذوي المخاطر العالية. تؤكد الورقة على الحاجة إلى إطار عمل قوي لتعلم الآلة (ML) لا يحسن فقط الدقة التنبؤية ولكن أيضًا يعالج قضايا عدم التوازن الفئوي وقابلية تفسير النموذج، وهي أمور أساسية لبناء الثقة السريرية واعتمادها.

يقترح المؤلفون إطار عمل للتعلم الجماعي المتوازن يدمج تقنيات توازن البيانات، وتحديدًا تقنية الزيادة الاصطناعية للأقليات (SMOTE)، مع نماذج ML المتقدمة المعتمدة على التجميع مثل تعزيز التدرج، XGBoost، CatBoost، والأشجار الإضافية. يهدف هذا الإطار إلى تعزيز اكتشاف الفئة الأقل بينما يحافظ على أداء تنبؤي عالٍ عبر مجموعات بيانات متنوعة. تتضمن الدراسة تقييمًا شاملاً للنماذج المقترحة باستخدام مجموعتين من البيانات القلبية المتاحة للجمهور، مما يظهر تحسينات كبيرة في مقاييس التنبؤ مثل الدقة، الاسترجاع، وROC-AUC، لا سيما بعد تطبيق SMOTE. علاوة على ذلك، يعزز دمج تقنيات الذكاء الاصطناعي القابل للتفسير، بما في ذلك SHAP وLIME، قابلية تفسير توقعات النموذج، مما يتماشى مع احتياجات اتخاذ القرار السريري. بشكل عام، تؤكد النتائج على فعالية النهج الجماعي المقترح في تقديم توقعات موثوقة وقابلة للتفسير لمخاطر القلب والأوعية الدموية، مع معالجة كل من الأداء التنبؤي وقابلية التطبيق السريري.

القيود

تدور قيود الإطار المقترح بشكل أساسي حول تطبيق SMOTE (تقنية الزيادة الاصطناعية للأقليات) وإمكانيته في إدخال تحيز في مناطق الفئات المتداخلة، مما قد يؤدي إلى عينات ضوضائية. للتخفيف من هذه المشكلة، تم تطبيق SMOTE حصريًا على بيانات التدريب، وتم تقييم أداء النموذج باستخدام التحقق المتقاطع المتدرج لمنع تسرب المعلومات. تم تأكيد قوة النماذج من خلال مقاييس تقييم متنوعة ومجموعات اختبار مستقلة، مما يضمن أن التحسينات الملحوظة في الأداء لم تكن ناتجة عن زيادة العينات. كانت هذه التدابير تهدف إلى تعزيز الحساسية تجاه الفئة الأقل مع الحفاظ على تعميم وموثوقية النموذج.

بالإضافة إلى ذلك، بينما يعرض الإطار أداءً تنبؤيًا قويًا وقابلية تفسير سريرية عالية من خلال دمج توازن البيانات، تعلم الآلة الجماعي، والذكاء الاصطناعي القابل للتفسير، فإنه محدود بتقنيات التجميع المعتمدة على البيانات الهيكلية. تم اتخاذ هذا الخيار التصميمي لإعطاء الأولوية لشفافية النموذج وكفاءة الحوسبة. تشمل اتجاهات البحث المستقبلية استكشاف استراتيجيات دمج الميزات الهجينة ودمج نماذج التعلم العميق المتقدمة ونماذج التعلم المعزز، لا سيما لمجموعات البيانات القلبية الطويلة والمتسلسلة. من المتوقع أن تحسن هذه التقدمات من القدرة على التكيف والتنبؤ الديناميكي بالمخاطر، بناءً على الأساس المتين الذي تم تأسيسه في هذه الدراسة.

Journal: Discover Computing, Volume: 29, Issue: 1
DOI: https://doi.org/10.1007/s10791-026-09973-3
Publication Date: 2026-02-16
Author(s): Raduana Khawla et al.
Primary Topic: Artificial Intelligence in Healthcare

Overview

This research presents an innovative machine learning (ML) framework designed to enhance cardiovascular disease (CVD) risk prediction by addressing critical issues such as class imbalance and model interpretability. The framework employs an ensemble of classifiers—Gradient Boosting (GB), Extreme Gradient Boosting (XGB), CatBoost (CB), and Extra Trees (ET)—and incorporates the Synthetic Minority Oversampling Technique (SMOTE) to balance the dataset. The study evaluates model performance using two benchmark datasets, Heart Disease Classification (HDC) and Cardiovascular Disease (CD), through 5-fold cross-validation and various metrics, including accuracy, recall, F1-score, and ROC-AUC. Notably, CatBoost emerged as the most effective model, achieving accuracies of 98.88% and 99.44% on the HDC and CD datasets, respectively, while also demonstrating excellent sensitivity and specificity.

The findings underscore the framework’s potential for clinical application, as it not only provides high predictive accuracy but also employs explainable artificial intelligence (XAI) techniques—such as SHAP and LIME—to elucidate feature contributions, identifying key predictors like troponin and glucose levels. Despite its promising results, the study acknowledges limitations, including the lack of demographic diversity in the datasets and the computational demands of hyperparameter tuning. Future research will aim to validate the framework with longitudinal and multimodal data, enhancing its generalizability and applicability in real-world clinical settings. Overall, the SMOTE-CatBoost framework represents a significant advancement in CVD risk prediction, offering a robust, interpretable, and clinically relevant tool for improving patient outcomes.

Introduction

Cardiovascular diseases (CVDs) are a leading cause of global mortality and pose significant challenges to healthcare systems, particularly in low- and middle-income regions where advanced diagnostic resources are scarce. Timely identification of individuals at high cardiovascular risk is essential for effective preventive interventions, yet conventional risk assessment tools, such as the Framingham Risk Score and SCORE, often fail to capture the complex, nonlinear interactions among various risk factors, limiting their generalizability across diverse populations. Recent advancements in machine learning (ML) have shown promise in improving cardiovascular risk prediction by leveraging high-dimensional clinical data to identify intricate patterns that traditional statistical methods may overlook.

Despite the potential of ML models, two major challenges hinder their clinical adoption: the inherent class imbalance in cardiovascular datasets, which skews learning algorithms towards the majority class and compromises sensitivity for high-risk patients, and the lack of interpretability in many high-performing ML models, which operate as black boxes. While techniques like the Synthetic Minority Over-sampling Technique (SMOTE) have been employed to address class imbalance, there remains a gap in systematically integrating these approaches with ensemble learning and explainable artificial intelligence (XAI) to enhance both predictive performance and clinical transparency.

Methods

The proposed methodological framework for robust cardiovascular disease (CVD) risk prediction is structured as a multi-stage pipeline, as depicted in Figure 1. This framework encompasses systematic data preprocessing, strategies for mitigating class imbalance, ensemble-based machine learning (ML) techniques, and thorough model evaluation. It is specifically tailored to tackle common challenges encountered in real-world clinical data, including issues related to missing values and skewed class distributions.

The design of this workflow prioritizes reliable predictive performance while ensuring interpretability and strong generalization across diverse datasets. By addressing these critical aspects, the methodology aims to enhance the accuracy and applicability of CVD risk predictions in clinical settings.

Discussion

The discussion section of the research paper addresses the critical challenges in predicting cardiovascular disease (CVD) due to the complex interplay of various risk factors and the limitations of traditional risk assessment methods. It highlights the inadequacies of conventional approaches, which often rely on linear models and limited feature sets, leading to suboptimal predictive performance, particularly in high-risk patient populations. The paper emphasizes the need for a robust machine learning (ML) framework that not only improves predictive accuracy but also addresses issues of class imbalance and model interpretability, which are essential for clinical trust and adoption.

The authors propose a Balanced Ensemble Learning Framework that integrates data balancing techniques, specifically the Synthetic Minority Over-sampling Technique (SMOTE), with advanced ensemble ML models such as Gradient Boosting, XGBoost, CatBoost, and Extra Trees. This framework aims to enhance minority-class detection while maintaining high predictive performance across diverse datasets. The study includes a comprehensive evaluation of the proposed models using two publicly available cardiovascular datasets, demonstrating significant improvements in predictive metrics such as accuracy, recall, and ROC-AUC, particularly after applying SMOTE. Furthermore, the integration of explainable AI techniques, including SHAP and LIME, enhances the interpretability of model predictions, thereby aligning the framework with clinical decision-making needs. Overall, the findings underscore the effectiveness of the proposed ensemble approach in delivering reliable and interpretable cardiovascular risk predictions, addressing both predictive performance and clinical applicability.

Limitations

The limitations of the proposed framework primarily revolve around the application of SMOTE (Synthetic Minority Over-sampling Technique) and its potential to introduce bias in overlapping class regions, which may result in noisy samples. To mitigate this issue, SMOTE was exclusively applied to the training data, and the model’s performance was evaluated using stratified cross-validation to prevent information leakage. The robustness of the models was confirmed through various evaluation metrics and independent test sets, ensuring that observed performance improvements were not artifacts of oversampling. These measures aimed to enhance sensitivity towards the minority class while preserving model generalization and reliability.

Additionally, while the framework showcases strong predictive performance and high clinical interpretability through the integration of data balancing, ensemble machine learning, and explainable AI, it is limited to established ensemble techniques applied to structured tabular datasets. This design choice was made to prioritize model transparency and computational efficiency. Future research directions include the exploration of hybrid feature fusion strategies and the incorporation of advanced deep learning and reinforcement learning models, particularly for sequential and longitudinal cardiovascular datasets. Such advancements are anticipated to improve adaptability and dynamic risk prediction, building on the solid foundation established in this study.