التنبؤ بأمراض القلب والأوعية الدموية بناءً على اختيار ميزات متعددة ونموذج PSO-XGBoost المحسن Prediction of cardiovascular disease based on multiple feature selection and improved PSO-XGBoost model

المجلة: Scientific Reports، المجلد: 15، العدد: 1
DOI: https://doi.org/10.1038/s41598-025-96520-7
PMID: https://pubmed.ncbi.nlm.nih.gov/40216915
تاريخ النشر: 2025-04-11
المؤلف: Kerang Cao وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في الرعاية الصحية

نظرة عامة

تقدم ورقة البحث نموذجًا جديدًا لتوقع أمراض القلب والأوعية الدموية، يسمى MFS-DLPSO-XGBoost، والذي يدمج تقنيات اختيار الميزات المتعددة، وخوارزمية تحسين سرب الجسيمات المحسنة (PSO)، وخوارزمية تعزيز التدرج الشديد (XGBoost). يبدأ النموذج بمعالجة البيانات، تليها تدريب نموذج XGBoost ومقارنة أدائه مع أربعة خوارزميات تعلم آلي معروفة: الانحدار اللوجستي، آلة الدعم الناقل، الغابة العشوائية، وجار الأقرب K. تستخدم الدراسة تحليل الارتباط بيرسون ذو العاملين وتصنيف أهمية الميزات لاختيار الميزات بشكل فعال، مما يؤدي في النهاية إلى تحديد مجموعة ميزات مثالية لإدخال النموذج.

تُستخدم خوارزمية PSO المحسنة لضبط معلمات النموذج الفائقة لـ XGBoost، مما يؤدي إلى تحسينات كبيرة في الأداء. حقق نموذج MFS-DLPSO-XGBoost قيم الاسترجاع، والدقة، والموثوقية، ودرجة F1، والمساحة تحت منحنى ROC (AUC) بنسبة 71.4%، 76.3%، 74.7%، 73.6%، و80.8%، على التوالي، متجاوزًا نموذج XGBoost الأساسي بفروق ملحوظة. تؤكد النتائج على قدرات التصنيف القوية للنموذج، مما يشير إلى إمكانية استخدامه في البيئات السريرية لتوقع ومنع أمراض القلب والأوعية الدموية. قد تشمل التطبيقات المستقبلية تطوير برامج لتحسين التفاعل بين الطبيب والمريض ودمجها مع بيانات المرضى التاريخية لتحسين دقة التوقعات بشكل أكبر.

الطرق

في هذه الدراسة، تم إنشاء بيئة تجريبية باستخدام PyCharm وPython 3.8 (64 بت)، مع تفاصيل تكوينات الكمبيوتر المحددة في الجدول 1. استخدمت الدراسة مجموعة بيانات عامة عن أمراض القلب والأوعية الدموية مأخوذة من منصة Kaggle لتحليلاتها.

تتناول قسم النتائج التجريبية والتحليل المنهجيات المستخدمة، على الرغم من عدم تقديم تفاصيل محددة بشأن التقنيات التحليلية والنتائج في النص المستخرج. عادةً ما تتضمن مزيد من التفاصيل حول هذه الطرق أساليب إحصائية، أو خوارزميات، أو نماذج تم تطبيقها على مجموعة البيانات لاستخلاص رؤى تتعلق بأمراض القلب والأوعية الدموية.

النتائج

يقدم قسم النتائج في الورقة سلسلة من التجارب المصممة لتقييم جدوى وموثوقية الخوارزمية المقترحة. تشير النتائج إلى أن الخوارزمية لا تؤدي بشكل فعال فحسب، بل تظهر أيضًا مزايا كبيرة عند مقارنتها بخوارزميات مشابهة أخرى. يتم تقديم تحليل مفصل للنتائج التجريبية، مع تسليط الضوء على نقاط القوة في النهج المقترح في سيناريوهات مختلفة. تؤكد هذه التقييمات المقارنة على إمكانية تطبيق الخوارزمية في مجالها المقصود.

المناقشة

في هذه الدراسة، تم تطوير نموذج لتوقع أمراض القلب والأوعية الدموية باستخدام خوارزمية XGBoost، المحسنة من خلال اختيار الميزات المتعددة ونهج تحسين سرب الجسيمات المحسن (PSO). تتكون مجموعة البيانات، المأخوذة من Kaggle، من 11,389 عينة مع 12 ميزة، بما في ذلك مؤشرات الصحة البدنية. شملت معالجة البيانات التعامل مع القيم المفقودة (أقل من 5% من مجموعة البيانات) من خلال الحذف، واكتشاف القيم الشاذة عبر مخططات الصندوق، وترميز التسميات للمتغيرات الفئوية. تتكون مجموعة البيانات النهائية من 10,587 عينة بعد إزالة القيم الشاذة. تم تقييم أداء النموذج باستخدام مقاييس مثل الدقة، والدقة، والاسترجاع، ودرجة F1، والمساحة تحت منحنى ROC (AUC)، حيث أظهر نموذج XGBoost أداءً متفوقًا مقارنة بالخوارزميات التقليدية مثل الانحدار اللوجستي وآلات الدعم الناقل.

لزيادة دقة التوقع، تم تطبيق عملية اختيار ميزات متعددة، مما أدى إلى إزالة الميزات الزائدة بناءً على تحليل الارتباط وتصنيفات أهمية الميزات. تم استخدام خوارزمية PSO المحسنة لتحسين المعلمات الفائقة لنموذج XGBoost، مما يعالج مشكلات التقارب المبكر ويعزز قوة النموذج. أشارت النتائج إلى أن نموذج MFS-DLPSO-XGBoost حقق AUC قدره 0.808، متفوقًا على نموذج XGBoost القياسي (AUC قدره 0.785) وطرق مقارنة أخرى. تم تأكيد الأهمية الإحصائية من خلال اختبارات t المزدوجة، مما يثبت فعالية التحسينات المقترحة. ستركز الأعمال المستقبلية على التحقق الخارجي باستخدام بيانات سريرية متعددة المراكز ودمج تقنيات التعلم العميق لتحسين أداء النموذج بشكل أكبر.

القيود

تعترف الدراسة بعدة قيود قد تؤثر على إمكانية تعميم نموذجها التنبؤي. أولاً، إن الاعتماد على مصدر بيانات واحد، وهو مجموعة بيانات Kaggle، يثير مخاوف بشأن قابلية تطبيق النموذج في بيئات سريرية متنوعة، حيث قد لا تعكس بشكل كافٍ التباينات في تكوين البيانات، والفئات، والتوزيع. ثانيًا، قد تعيق الفجوات بين معايير التسمية المستخدمة في مجموعة البيانات وتلك المستخدمة في الممارسة السريرية الواقعية فعالية النموذج عند تطبيقه خارج سياق الدراسة. أخيرًا، فإن غياب بيانات مرضى محددة، مثل الحالات التي تشمل حالات نادرة أو مرضى يعانون من عدة حالات مصاحبة، يحد من قوة النموذج وأدائه في السيناريوهات الطبية الفعلية.

Journal: Scientific Reports, Volume: 15, Issue: 1
DOI: https://doi.org/10.1038/s41598-025-96520-7
PMID: https://pubmed.ncbi.nlm.nih.gov/40216915
Publication Date: 2025-04-11
Author(s): Kerang Cao et al.
Primary Topic: Artificial Intelligence in Healthcare

Overview

The research paper presents a novel cardiovascular disease prediction model, termed MFS-DLPSO-XGBoost, which integrates multiple feature selection techniques, an improved particle swarm optimization (PSO) algorithm, and the extreme gradient boosting (XGBoost) algorithm. The model begins with preprocessing the dataset, followed by training the XGBoost model and comparing its performance against four established machine learning algorithms: logistic regression, support vector machine, random forest, and K-nearest neighbor. The study employs two-factor Pearson correlation analysis and feature importance ranking for effective feature selection, ultimately leading to the identification of an optimal feature subset for model input.

The improved PSO algorithm is utilized to fine-tune the hyperparameters of the XGBoost model, resulting in significant performance enhancements. The MFS-DLPSO-XGBoost model achieved recall, precision, accuracy, F1 score, and area under the ROC curve (AUC) values of 71.4%, 76.3%, 74.7%, 73.6%, and 80.8%, respectively, surpassing the baseline XGBoost model by notable margins. The findings underscore the model’s robust classification capabilities, suggesting its potential utility in clinical settings for predicting and preventing cardiovascular disease. Future applications may involve software development for enhanced doctor-patient interaction and integration with historical patient data to further refine predictive accuracy.

Methods

In this study, the experimental environment was established using PyCharm and Python 3.8 (64-bit), with specific computer configurations detailed in Table 1. The research utilized a public dataset on cardiovascular disease sourced from the Kaggle platform for its analyses.

The section on experimental results and analysis outlines the methodologies employed, although specific details regarding the analytical techniques and findings are not provided in the extracted text. Further elaboration on these methods would typically include statistical approaches, algorithms, or models applied to the dataset to derive insights related to cardiovascular disease.

Results

The results section of the paper presents a series of experiments designed to assess the feasibility and reliability of the proposed algorithm. The findings indicate that the algorithm not only performs effectively but also exhibits significant advantages when compared to other similar algorithms. A detailed analysis of the experimental results is provided, highlighting the strengths of the proposed approach in various scenarios. This comparative evaluation underscores the algorithm’s potential for practical applications in its intended domain.

Discussion

In this study, a cardiovascular disease prediction model was developed using the XGBoost algorithm, enhanced through multi-feature selection and an improved particle swarm optimization (PSO) approach. The dataset, sourced from Kaggle, comprised 11,389 samples with 12 features, including physical health indicators. Data preprocessing involved handling missing values (less than 5% of the dataset) through deletion, outlier detection via box plots, and label encoding for categorical variables. The final dataset consisted of 10,587 samples after removing outliers. The model’s performance was evaluated using metrics such as accuracy, precision, recall, F1 score, and area under the ROC curve (AUC), with the XGBoost model demonstrating superior performance compared to traditional algorithms like logistic regression and support vector machines.

To further enhance prediction accuracy, a multi-feature selection process was applied, removing redundant features based on correlation analysis and feature importance rankings. The improved PSO algorithm was utilized to optimize the hyperparameters of the XGBoost model, addressing issues of premature convergence and enhancing the model’s robustness. The results indicated that the MFS-DLPSO-XGBoost model achieved an AUC of 0.808, outperforming the standard XGBoost model (AUC of 0.785) and other comparative methods. Statistical significance was confirmed through paired t-tests, validating the effectiveness of the proposed enhancements. Future work will focus on external validation with multi-center clinical data and the integration of deep learning techniques to further improve model performance.

Limitations

The study acknowledges several limitations that may affect the generalizability of its predictive model. Firstly, the reliance on a single data source, specifically the Kaggle dataset, raises concerns about the model’s applicability to diverse clinical environments, as it may not adequately reflect variations in data composition, categories, and distribution. Secondly, discrepancies between the labeling standards used in the dataset and those in real-world clinical practice could hinder the model’s effectiveness when applied outside the study context. Lastly, the absence of specific patient data, such as cases involving rare conditions or patients with multiple comorbidities, further limits the model’s robustness and performance in actual medical scenarios.