توقع مخاطر القلب والأوعية الدموية باستخدام التعلم الجماعي الهجين والذكاء الاصطناعي القابل للتفسير Predicting cardiovascular risk with hybrid ensemble learning and explainable AI

المجلة: Scientific Reports، المجلد: 15، العدد: 1
DOI: https://doi.org/10.1038/s41598-025-01650-7
PMID: https://pubmed.ncbi.nlm.nih.gov/40410273
تاريخ النشر: 2025-05-23
المؤلف: Pooja Shah وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في الرعاية الصحية

نظرة عامة

تتناول الدراسة الحاجة الملحة لتحسين توقع المخاطر لأمراض القلب والأوعية الدموية (CVDs)، التي تظل سببًا رئيسيًا للوفيات على مستوى العالم. تقدم إطار عمل هجين للتعلم الجماعي يدمج نماذج التعلم الآلي المتقدمة، وتحديدًا تعزيز التدرج، CatBoost، والشبكات العصبية، ضمن بنية جماعية مكدسة. يعزز هذا النهج الأداء التنبؤي، محققًا درجة AUC-ROC تبلغ 0.82، إلى جانب مقاييس الدقة، الاسترجاع، ودرجة F1 بنسبة 81%، 83%، و82%، على التوالي. يتم تعزيز قابلية تفسير النموذج من خلال تقنيات الذكاء الاصطناعي القابلة للتفسير، مثل قيم SHAP وطرق تقليل الأبعاد مثل t-SNE وPCA، مما يسمح للأطباء بفهم تأثير عوامل الخطر المختلفة على التوقعات.

على الرغم من تحقيق دقة تبلغ 82%، والتي قد تبدو متواضعة مقارنةً بتطبيقات أخرى، تؤكد الدراسة على التحديات التي تطرحها بيانات الرعاية الصحية غير المتجانسة والصاخبة. يظهر النموذج الهجين تحسينات كبيرة في الاسترجاع وAUC-ROC مقارنة بالنماذج الأساسية، مما يبرز إمكانيته في الكشف المبكر عن الأفراد ذوي المخاطر العالية. تشمل اتجاهات البحث المستقبلية توسيع مجموعة البيانات لتعزيز القابلية للتعميم، ودمج البيانات في الوقت الحقيقي من الأجهزة القابلة للارتداء للمراقبة المستمرة، واستخدام أدوات الذكاء الاصطناعي القابلة للتفسير المتقدمة لتحسين شفافية النموذج. تعترف الدراسة بالقيود مثل الاعتماد على مجموعة بيانات واحدة والحاجة إلى مزيد من تحسين المعلمات، ومع ذلك، فإنها تضع أساسًا قويًا لتطوير حلول تعلم آلي قابلة للتطبيق في الرعاية الصحية.

الطرق

في هذا القسم، يتم تفصيل المنهجية لجمع البيانات ومعالجتها في دراسة حول أمراض القلب. استخدم الباحثون ثلاث مجموعات بيانات متاحة للجمهور—مجموعة بيانات أمراض القلب في كليفلاند، مجموعة البيانات الهنغارية، ومجموعة بيانات أمراض القلب من IEEE Dataport—مما أسفر عن مجموعة بيانات مجمعة تحتوي على 70,000 حالة مع 12 ميزة سريرية، بما في ذلك العمر، الجنس، ضغط الدم، مؤشر كتلة الجسم، الكوليسترول، والجلوكوز. من الجدير بالذكر أن مجموعة البيانات أظهرت عدم توازن في الفئات، مع وجود حالات صحية أكثر من مرضى القلب والأوعية الدموية. لمعالجة ذلك، تم استخدام تقنية الزيادة الاصطناعية للأقليات (SMOTE) لزيادة عينة الفئة الأقل، مكملةً بالخفض العشوائي لعينة الفئة الأكثر. تضمنت معالجة البيانات ملء القيم المفقودة من خلال تقدير المتوسط أو الوضع، واكتشاف وإزالة القيم الشاذة باستخدام طريقة النطاق الربعي (IQR)، وتطبيع الميزات المستمرة عبر مقياس Min-Max إلى نطاق [0، 1].

ثم تم تقسيم مجموعة البيانات إلى مجموعات تدريب (80%) واختبار (20%)، مع ضمان الحفاظ على توزيع الفئات في كلا المجموعتين الفرعيتين، وهو أمر حاسم لتقييم أداء النموذج على مجموعات البيانات غير المتوازنة. شمل سير العمل لمعالجة البيانات هندسة الميزات لإنشاء متغيرات جديدة مثل مؤشر كتلة الجسم (BMI) ونسبة الكوليسترول إلى الجلوكوز، بالإضافة إلى مصطلحات التفاعل بين قياسات ضغط الدم. بعد هذه التحسينات، أدى تطبيق SMOTE إلى تحسين قدرة النموذج على التمييز بين الفئات، كما يتضح من زيادة درجة AUC-ROC من 0.75 إلى 0.82، مما يشير إلى نجاح التخفيف من عدم توازن الفئات. بالإضافة إلى ذلك، تم إجراء تحليل استكشافي للبيانات، بما في ذلك تصورات توزيع الميزات وتحليل الارتباط، لفهم العلاقات بين الميزات وتأثيرها على مخاطر القلب والأوعية الدموية بشكل أفضل.

النتائج

يقدم قسم النتائج تقييمًا شاملاً للنموذج الهجين المقترح لتوقع مخاطر القلب والأوعية الدموية، مع التركيز على أدائه عبر أبعاد متعددة. يدير النموذج البيانات متعددة الأبعاد بشكل فعال ويظهر قدرات تصنيف قوية، خاصة في سياق مجموعات البيانات غير المتوازنة. يعزز استخدام مقاييس متنوعة وتصويرات البيانات من قابلية تفسير النموذج، مما يسهل فهمًا أعمق لتحولات البيانات وسلوك النموذج في التطبيقات الواقعية.

تشمل النتائج الرئيسية تحديد أنماط التجميع في مخطط تشتت ثنائي الأبعاد لضغط الدم الانقباضي والانبساطي، مما يبرز تعقيدات عوامل خطر القلب والأوعية الدموية. بالإضافة إلى ذلك، يشير مخطط التشتت لمؤشر كتلة الجسم (BMI) مقابل ضغط الدم الانقباضي إلى وجود ارتباط إيجابي مع زيادة مخاطر القلب والأوعية الدموية، كما يتضح من تدرج الألوان. يميز مخطط التوزيع التكميلي لنسبة الكوليسترول إلى الجلوكوز الأفراد ذوي المخاطر الأعلى، مما يبرز أهمية هذه الميزات في تطوير النموذج. تم تقديم مصفوفات الالتباس للنماذج المختلفة، مع تفاصيل عن الإيجابيات الحقيقية (TP)، السلبيات الحقيقية (TN)، الإيجابيات الكاذبة (FP)، والسلبيات الكاذبة (FN)، لت quantifying أداء النموذج، مع تقديم ملخص رقمي في الجدول 4.

المناقشة

تؤكد قسم المناقشة في ورقة البحث على الحاجة الملحة لتحسين طرق توقع مخاطر أمراض القلب والأوعية الدموية (CVD) بسبب معدلات الوفيات العالية المرتبطة بأمراض القلب والأوعية الدموية على مستوى العالم. غالبًا ما تعتمد نماذج توقع المخاطر التقليدية على افتراضات خطية، مما يفشل في التقاط العلاقات المعقدة وغير الخطية الموجودة في البيانات الطبية. أظهرت التطورات الأخيرة في تقنيات التعلم الآلي الجماعي وعدًا في تعزيز دقة التوقع؛ ومع ذلك، فإن طبيعتها “الصندوق الأسود” تحد من قابليتها للتطبيق السريري. يقترح المؤلفون نموذج تعلم جماعي هجين يدمج مصنفات التعلم الآلي ذات الأداء العالي مع أدوات الذكاء الاصطناعي القابلة للتفسير (XAI)، مثل SHAP (SHapley Additive exPlanations)، لتوفير توقعات دقيقة وقابلة للتفسير لمخاطر القلب والأوعية الدموية. يهدف هذا النهج إلى سد الفجوة بين الأداء التنبؤي والصلاحية السريرية، مما يوفر أدوات ذكاء اصطناعي موثوقة للمهنيين في الرعاية الصحية.

تسلط مراجعة الأدبيات الضوء على دراسات متنوعة استكشفت تطبيقات التعلم الآلي في توقع مخاطر القلب والأوعية الدموية، مما يكشف عن اتجاه نحو الأساليب الجماعية التي تجمع بين عدة مصنفات لتحسين الدقة والموثوقية. يستخدم الإطار الهجين المقترح نهج التكديس، حيث يتم تجميع وتكرير التوقعات من نماذج أساسية متعددة بواسطة نموذج ميتا (XGBoost). لا يعزز هذا الهيكل الأداء التنبؤي فحسب، بل يضمن أيضًا قابلية التفسير من خلال تقنيات XAI، مما يسمح للأطباء بفهم مساهمات عوامل الخطر الفردية، مثل ضغط الدم ومؤشر كتلة الجسم. من خلال معالجة تحديات عدم توازن الفئات والحاجة إلى شفافية النموذج، تقدم الدراسة حلاً قويًا يتماشى مع قدرات التعلم الآلي المتقدمة مع التطبيقات العملية في الرعاية الصحية، مما يعزز الثقة في اتخاذ القرارات المدفوعة بالذكاء الاصطناعي في رعاية القلب والأوعية الدموية.

Journal: Scientific Reports, Volume: 15, Issue: 1
DOI: https://doi.org/10.1038/s41598-025-01650-7
PMID: https://pubmed.ncbi.nlm.nih.gov/40410273
Publication Date: 2025-05-23
Author(s): Pooja Shah et al.
Primary Topic: Artificial Intelligence in Healthcare

Overview

The study addresses the pressing need for improved risk prediction of cardiovascular diseases (CVDs), which remain a leading global cause of mortality. It introduces a hybrid ensemble learning framework that integrates advanced machine learning models, specifically Gradient Boosting, CatBoost, and Neural Networks, within a stacked ensemble architecture. This approach enhances predictive performance, achieving an AUC-ROC score of 0.82, alongside Precision, Recall, and F1-Score metrics of 81%, 83%, and 82%, respectively. The model’s interpretability is bolstered through Explainable AI techniques, such as SHAP values and dimensionality reduction methods like t-SNE and PCA, allowing clinicians to understand the influence of various risk factors on predictions.

Despite achieving an accuracy of 82%, which may seem modest compared to other applications, the study emphasizes the challenges posed by heterogeneous and noisy healthcare data. The hybrid model demonstrates significant improvements in recall and AUC-ROC compared to base models, highlighting its potential for early detection of high-risk individuals. Future research directions include expanding the dataset to enhance generalizability, integrating real-time data from wearables for continuous monitoring, and employing advanced Explainable AI tools to improve model transparency. The study acknowledges limitations such as reliance on a single dataset and the need for further hyperparameter optimization, yet it lays a solid foundation for developing actionable machine learning solutions in healthcare.

Methods

In this section, the methodology for data collection and preprocessing in a study on heart disease is detailed. The researchers utilized three publicly available datasets—Cleveland Heart Disease, Hungarian Dataset, and the IEEE Dataport Cardiovascular Disease Dataset—resulting in a combined dataset of 70,000 instances with 12 clinical features, including age, gender, blood pressure, BMI, cholesterol, and glucose. Notably, the dataset exhibited class imbalance, with a predominance of healthy cases over cardiovascular patients. To address this, the Synthetic Minority Over-sampling Technique (SMOTE) was employed to oversample the minority class, complemented by random undersampling of the majority class. Data preprocessing involved filling missing values through mean or mode imputation, detecting and removing outliers using the interquartile range (IQR) method, and normalizing continuous features via Min-Max scaling to a range of [0, 1].

The dataset was then stratified into training (80%) and testing (20%) sets, ensuring that the distribution of classes was maintained in both subsets, which is critical for evaluating model performance on imbalanced datasets. The preprocessing workflow included feature engineering to create new variables such as Body Mass Index (BMI) and cholesterol-to-glucose ratio, as well as interaction terms between blood pressure measurements. Following these enhancements, the application of SMOTE improved the model’s ability to discriminate between classes, as evidenced by an increase in the AUC-ROC score from 0.75 to 0.82, indicating successful mitigation of class imbalance. Additionally, exploratory data analysis was conducted, including feature distribution visualizations and correlation analysis, to better understand the relationships among features and their impact on cardiovascular risk.

Results

The results section presents a comprehensive evaluation of the proposed hybrid ensemble model for cardiovascular risk prediction, focusing on its performance across multiple dimensions. The model effectively manages multidimensional data and demonstrates robust classification capabilities, particularly in the context of imbalanced datasets. The use of various metrics and visualizations enhances the model’s interpretability, facilitating a deeper understanding of data transformations and model behavior in real-world applications.

Key findings include the identification of clustering patterns in a 2D scatter plot of systolic and diastolic blood pressure, which highlights the complexities of cardiovascular risk factors. Additionally, the scatter plot of Body Mass Index (BMI) against systolic blood pressure indicates a positive correlation with increased cardiovascular risk, as evidenced by a color-coded gradient. The complementary distribution plot of the cholesterol-to-glucose ratio further distinguishes individuals at higher risk, emphasizing the importance of these features in model development. Confusion matrices for the different models, detailing true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN), are provided to quantify the model’s performance, with a numerical summary presented in Table 4.

Discussion

The discussion section of the research paper emphasizes the pressing need for improved cardiovascular disease (CVD) risk prediction methods due to the high mortality rates associated with CVDs globally. Traditional risk prediction models often rely on linear assumptions, which fail to capture the complex, nonlinear relationships inherent in medical data. Recent advancements in ensemble machine learning techniques have shown promise in enhancing predictive accuracy; however, their “black-box” nature limits their clinical applicability. The authors propose a hybrid ensemble learning model that integrates top-performing machine learning classifiers with Explainable AI (XAI) tools, such as SHAP (SHapley Additive exPlanations), to provide both accurate and interpretable cardiovascular risk predictions. This approach aims to bridge the gap between predictive performance and clinical validity, thereby delivering trustworthy AI tools for healthcare professionals.

The literature survey highlights various studies that have explored machine learning applications in CVD risk prediction, revealing a trend towards ensemble methods that combine multiple classifiers to improve accuracy and reliability. The proposed hybrid ensemble framework utilizes a stacking approach, where predictions from various base models are aggregated and refined by a meta-model (XGBoost). This architecture not only enhances predictive performance but also ensures interpretability through XAI techniques, allowing clinicians to understand the contributions of individual risk factors, such as blood pressure and BMI. By addressing the challenges of class imbalance and the need for model transparency, the study presents a robust solution that aligns advanced machine learning capabilities with practical healthcare applications, ultimately fostering trust in AI-driven decision-making in cardiovascular care.