مصنف تكديس معزز بمعالجة مسبقة لاكتشاف الأمراض القلبية الوعائية العامة عبر مجموعات بيانات متنوعة A preprocessing-enhanced stacking classifier for generalized cardiovascular disease detection across diverse datasets

المجلة: Scientific Reports، المجلد: 16، العدد: 1
DOI: https://doi.org/10.1038/s41598-026-41042-z
PMID: https://pubmed.ncbi.nlm.nih.gov/41792193
تاريخ النشر: 2026-03-06
المؤلف: Adeel Ashraf وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في الرعاية الصحية

نظرة عامة

تتناول الدراسة القضية الملحة للأمراض القلبية الوعائية (CVDs) من خلال تطوير نموذج تجميع معزز مسبقًا يهدف إلى تحسين دقة وقابلية تعميم التنبؤات الثنائية للأمراض القلبية الوعائية عبر مجموعات بيانات متنوعة. يستخدم النموذج المقترح خط أنابيب شامل للمعالجة المسبقة يتضمن تحويل الميزات، وبناء السمات المشتقة، والترميز، والتجميع باستخدام K-modes، جميعها تُجرى بعد فصل بيانات التدريب والاختبار لضمان نزاهة نتائج التقييم. تجمع بنية التجميع بين ثلاثة متعلمين أساسيين قائمين على الأشجار – الغابة العشوائية، شجرة القرار، والأشجار الإضافية – مع استخدام الانحدار اللوجستي كمتعلمين ميتا لتوليف التنبؤات خارج الطي.

تم التحقق من فعالية هذا النهج التجميعي على ثلاث مجموعات بيانات غير متجانسة، مما أسفر عن دقة قدرها 93.26%، 72%، و99% لمجموعات البيانات I وII وIII، على التوالي. تم تأكيد استقرار أداء النموذج من خلال فترات الثقة بنسبة 95% عبر خمسة بذور عشوائية، مما يبرز قوته في بيئات البيانات المتنوعة. بشكل عام، تسهم هذه البحث في المجال من خلال تقييم منهجي لمختلف تقنيات التعلم الآلي والتجميع لتنبؤ الأمراض القلبية الوعائية، مما يبرز أهمية المعالجة المسبقة في تعزيز أداء النموذج.

مقدمة

تتناول مقدمة ورقة البحث القضية الحرجة للأمراض القلبية الوعائية (CVD)، التي تعد السبب الرئيسي للوفيات على مستوى العالم، حيث تمثل ما يقرب من ثلث جميع الوفيات. تؤكد على أهمية التشخيص المبكر في التخفيف من معدلات المرض والوفاة. تعتمد طرق التشخيص التقليدية، مثل تخطيط القلب الكهربائي وتصوير القلب بالموجات فوق الصوتية، غالبًا على تفسيرات ذاتية، مما قد يتسبب في تجاهل الحالات المبكرة أو غير العرضية. وهذا يبرز ضرورة وجود أدوات تشخيص موضوعية تعتمد على البيانات يمكنها اكتشاف الأنماط السريرية الدقيقة التي تشير إلى مرض وشيك.

تناقش الورقة إمكانيات التعلم الآلي (ML) كملحق قيم لتقنيات التشخيص التقليدية، حيث يسهل استخراج العلاقات المعقدة تلقائيًا من البيانات السريرية. ومع ذلك، تحدد تحديًا كبيرًا يتعلق بقابلية تعميم نماذج التنبؤ بالأمراض القلبية الوعائية الحالية. بينما تشير العديد من الدراسات إلى دقة عالية، فإنها غالبًا ما تعتمد على مجموعات بيانات فردية، أو طرق معالجة مسبقة محدودة، أو مجموعات سكانية ضيقة، مما يؤدي إلى انخفاض الأداء عند تطبيقها على مجموعات بيانات متنوعة أو جديدة. كانت هذه القيود موضوعًا متكررًا في أبحاث التعلم الآلي الحيوية الحديثة، مما يبرز الحاجة إلى نماذج تنبؤية أكثر قوة وقابلية للتكيف.

طرق

تقدم البحث نهجًا جديدًا لتعزيز الكشف المبكر عن الأمراض القلبية الوعائية (CVDs) من خلال مصنف تجميع معزز مسبقًا. يهدف هذا الأسلوب إلى إنشاء نموذج تنبؤي قوي وقابل للتعميم يمكن تطبيقه عبر مجموعات بيانات متنوعة، مما يعالج الحاجة الملحة لأنظمة دعم التشخيص الموثوقة في ضوء كون الأمراض القلبية الوعائية سببًا رئيسيًا للوفيات العالمية. تؤكد الدراسة على أهمية التدخل السريري في الوقت المناسب لتحسين نتائج المرضى.

لتحقيق ذلك، ينفذ الباحثون إطار عمل للتعلم الآلي التجميعي، مدعومًا بخط أنابيب معالجة مسبقة مُنظم بدقة يتجنب تسرب البيانات. تبدأ المنهجية بتنظيف البيانات، تليها تقسيم مجموعة البيانات إلى مجموعات تدريب واختبار قبل حدوث أي هندسة ميزات. يضمن ذلك بقاء بيانات الاختبار معزولة تمامًا. جميع خطوات المعالجة المسبقة – بما في ذلك تحويل البيانات، والتصنيف، ودمج السمات، وترميز التسميات، والتطبيع، والتجميع – مشتقة فقط من بيانات التدريب وتطبق على مجموعة الاختبار باستخدام معلمات ثابتة. يمنع هذا النهج تسرب المعلومات بفعالية ويسهل تقييمًا عادلًا لقدرات تعميم النموذج. يتم تمثيل المنهجية المنهجية بصريًا في الشكل 1 من الورقة.

مناقشة

تسلط قسم المناقشة في هذه الورقة البحثية الضوء على الحاجة الملحة لإطارات عمل قوية وقابلة للتكرار في التعلم الآلي (ML) لكشف الأمراض القلبية الوعائية (CVD)، خاصة عبر مجموعات بيانات غير متجانسة. يقترح المؤلفون مصنف تجميع معزز مسبقًا يدمج خط أنابيب معالجة مسبقة شامل – يشمل التحويل، والتصنيف، وهندسة السمات، وترميز التسميات، والتجميع – لتوحيد تمثيلات الميزات قبل تدريب النموذج. يتم تقييم هذا النهج مقابل طرق التعلم الآلي التقليدية، وتقنيات التجميع، وهياكل التعلم العميق عبر ثلاث مجموعات بيانات متميزة، كل منها يتميز بمتغيرات سريرية وسكان مرضى متنوعة. تشير النتائج إلى أن النموذج المقترح يعزز بشكل كبير قابلية التعميم والاتساق التنبؤي، مما يعالج القيود في الدراسات الحالية للتعلم الآلي الحيوي التي غالبًا ما تعتمد على مجموعات بيانات فردية أو متجانسة.

كما ينتقد المؤلفون المشهد الحالي لأبحاث التنبؤ بالأمراض القلبية الوعائية، مشيرين إلى أن العديد من الدراسات تشير إلى دقة عالية ولكنها تفشل في إثبات قابلية التعميم بسبب الاعتماد على مجموعات بيانات ضيقة وطرق معالجة مسبقة غير كافية. يؤكدون على أهمية منهجيات التقييم الصارمة التي تقلل من تسرب البيانات وتضمن أداءً قويًا عبر بيئات سريرية متنوعة. من خلال إظهار فعالية المعالجة المسبقة المعززة والتعلم التجميعي، تسهم هذه الدراسة في تقدم تشخيص الأمراض القلبية الوعائية القائم على التعلم الآلي، مستهدفة حلولًا ليست فقط موثوقة ولكن أيضًا قابلة للتوسع وقابلة للتطبيق في البيئات السريرية الواقعية.

Journal: Scientific Reports, Volume: 16, Issue: 1
DOI: https://doi.org/10.1038/s41598-026-41042-z
PMID: https://pubmed.ncbi.nlm.nih.gov/41792193
Publication Date: 2026-03-06
Author(s): Adeel Ashraf et al.
Primary Topic: Artificial Intelligence in Healthcare

Overview

The study addresses the pressing issue of cardiovascular diseases (CVDs) by developing a preprocessing-enhanced stacking ensemble model aimed at improving the accuracy and generalizability of binary CVD predictions across diverse datasets. The proposed model employs a comprehensive preprocessing pipeline that includes feature transformation, derived attribute construction, encoding, and K-modes clustering, all conducted post train-test separation to ensure the integrity of evaluation results. The stacking architecture combines three tree-based base learners—Random Forest, Decision Tree, and Extra Trees—with Logistic Regression serving as a meta-learner to synthesize out-of-fold predictions.

The effectiveness of this ensemble approach was validated on three heterogeneous datasets, yielding accuracies of 93.26%, 72%, and 99% for Datasets I, II, and III, respectively. The model’s performance stability was further corroborated through 95% confidence intervals across five random seeds, highlighting its robustness in varied data environments. Overall, this research contributes to the field by systematically evaluating various machine learning and ensemble techniques for CVD prediction, underscoring the importance of preprocessing in enhancing model performance.

Introduction

The introduction of the research paper addresses the critical issue of cardiovascular diseases (CVD), which are the leading cause of mortality globally, responsible for nearly one-third of all deaths. It emphasizes the importance of early diagnosis in mitigating morbidity and mortality rates. Traditional diagnostic methods, such as electrocardiography and echocardiography, often depend on subjective interpretations, which can overlook early-stage or asymptomatic conditions. This highlights the necessity for objective, data-driven diagnostic tools that can detect subtle clinical patterns indicative of impending disease.

The paper discusses the potential of machine learning (ML) as a valuable adjunct to conventional diagnostic techniques, as it facilitates the automatic extraction of complex relationships from clinical data. However, it identifies a significant challenge related to the generalizability of existing CVD prediction models. While many studies report high accuracy, they frequently rely on single datasets, limited preprocessing methods, or narrowly defined patient populations, leading to decreased performance when applied to diverse or new datasets. This limitation has been a recurrent theme in recent biomedical ML research, underscoring the need for more robust and adaptable predictive models.

Methods

The research presents a novel approach to enhance the early detection of cardiovascular diseases (CVDs) through a preprocessing-enhanced stacking classifier. This method aims to create a robust and generalizable predictive model that can be applied across various datasets, addressing the critical need for reliable diagnostic support systems in light of CVDs being a leading cause of global mortality. The study emphasizes the importance of timely clinical intervention to improve patient outcomes.

To achieve this, the researchers implement an ensemble machine learning framework, complemented by a meticulously structured preprocessing pipeline that avoids data leakage. The methodology begins with data cleaning, followed by the partitioning of the dataset into training and testing subsets before any feature engineering occurs. This ensures that the test data remains strictly isolated. All preprocessing steps—including data transformation, categorization, attribute combination, label encoding, normalization, and clustering—are derived solely from the training data and applied to the test set using fixed parameters. This approach effectively prevents information leakage and facilitates a fair assessment of the model’s generalization capabilities. The systematic methodology is visually represented in Figure 1 of the paper.

Discussion

The discussion section of this research paper highlights the critical need for robust and reproducible frameworks in machine learning (ML) for cardiovascular disease (CVD) detection, particularly across heterogeneous datasets. The authors propose a preprocessing-enhanced stacking classifier that integrates a comprehensive preprocessing pipeline—encompassing transformation, categorization, attribute engineering, label encoding, and clustering—to standardize feature representations prior to model training. This approach is benchmarked against traditional ML methods, ensemble techniques, and deep learning architectures across three distinct datasets, each characterized by varying clinical variables and patient demographics. The findings suggest that the proposed model significantly enhances generalizability and predictive consistency, addressing limitations in existing biomedical ML studies that often rely on single or homogenous datasets.

The authors also critique the current landscape of CVD prediction research, noting that many studies report high accuracy but fail to demonstrate generalizability due to reliance on narrow datasets and inadequate preprocessing methods. They emphasize the importance of rigorous evaluation methodologies that mitigate data leakage and ensure robust performance across diverse clinical environments. By demonstrating the efficacy of enhanced preprocessing and ensemble learning, this study contributes to advancing ML-based CVD diagnostics, aiming for solutions that are not only reliable but also scalable and applicable in real-world clinical settings.