تحليل مقارن لتنبؤ أمراض القلب باستخدام الانحدار اللوجستي، وآلة الدعم الناقل، وجيران الأقرب، وغابة عشوائية مع التحقق المتقاطع لتحسين الدقة Comparative analysis of heart disease prediction using logistic regression, SVM, KNN, and random forest with cross-validation for improved accuracy

المجلة: Scientific Reports، المجلد: 15، العدد: 1
DOI: https://doi.org/10.1038/s41598-025-93675-1
PMID: https://pubmed.ncbi.nlm.nih.gov/40251253
تاريخ النشر: 2025-04-18
المؤلف: Yagyanath Rimal وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في الرعاية الصحية

نظرة عامة

تستكشف هذه الورقة البحثية فعالية التحقق المتقاطع في تعزيز أداء نماذج التعلم الآلي المطبقة على مجموعات بيانات أمراض القلب. تستخدم الدراسة عملية جديدة لتحضير البيانات، تشمل تعويض الميزات العددية بالمتوسط، والميزات الفئوية باستخدام طرق كاي-تربيع، وتطبيق التطبيع. يتم مقارنة أربعة نماذج—الانحدار اللوجستي (LR)، آلة الدعم الناقل (SVM)، أقرب جار (KNN)، وغابة عشوائية (RF)—من خلال التحقق المتقاطع، مما يكشف أن نموذج RF يحقق أعلى دقة ماكرو بنسبة 94%، إلى جانب دقة 97% واسترجاع 97%. في المقابل، يظهر نموذج LR أدنى دقة بنسبة 81%، بينما تظهر نماذج KNN وSVM دقة بنسبة 81% و89% على التوالي.

تؤكد النتائج على أهمية التحقق المتقاطع في تحسين موثوقية ودقة النموذج، مع تباينات في درجات الدقة تبرز الإمكانية لتحسين الأداء من خلال اختيار ميزات متقدم وطرق تجميع. تشير الدراسة إلى أنه بينما يتفوق نموذج RF في المقاييس العامة، يمكن تحسين أداء نموذج الانحدار اللوجستي من خلال تحسين المعلمات. تدعو الأبحاث إلى دمج التعلم الآلي في الرعاية الصحية، مع التأكيد على الحاجة إلى معالجة التحديات مثل التحيز وخصوصية البيانات لضمان تطبيقات سريرية فعالة. تشمل الاتجاهات المستقبلية تحسين اختيار الميزات واستخدام مجموعات بيانات أكبر لتعزيز قوة النموذج وقابليته للتعميم.

الطرق

تؤكد قسم المنهجية في الورقة البحثية على الدور الحاسم للتعلم الآلي في كشف الأنماط الكامنة داخل البيانات العلمية، خاصة للكشف المبكر عن أمراض القلب، التي تساهم في حوالي 12 مليون وفاة عالمية سنويًا. إن انتشار وفيات أمراض الشرايين التاجية أعلى بشكل ملحوظ في الولايات المتحدة مقارنةً بدول أوروبية متقدمة أخرى. لمعالجة هذه القضية، تهدف الدراسة إلى تحقيق دقة تحقق مثالية لمختلف نماذج التعلم الآلي، وخاصة الانحدار اللوجستي، وآلات الدعم الناقل (SVM)، وأقرب الجيران (KNN)، والغابات العشوائية.

تستخدم الأبحاث التحقق المتقاطع لتقييم ومقارنة الفعالية التنبؤية لهذه التقنيات الأربعة في التعرف على حالات الشرايين التاجية. تم تصميم هذا التحليل المقارن لمساعدة الممارسين الطبيين في اختيار النموذج الأنسب للتصنيف والتنبؤ، مما يعزز دقة التشخيص وقد يؤدي إلى تحسين نتائج المرضى.

النتائج

في قسم النتائج، تستخدم الدراسة خريطة حرارية لتوضيح العلاقة بين المتغيرات التابعة والمستقلة في مجموعة بيانات أمراض القلب، باستخدام دالة `sns.heatmap` للوضوح. يكشف التحليل أن بعض الميزات، مثل الجنس، نوع ألم الصدر (cp)، الحد الأقصى لمعدل ضربات القلب (thalach)، والكالسيوم (ca)، تظهر أداءً مرضيًا مع أخطاء مربعة متوسطة أقل من 10. بالمقابل، تظهر ميزات مثل الذبحة الصدرية الناتجة عن التمارين (exang)، والقديمة، والميل قيم خطأ أعلى بكثير. تشير معامل تحديد النموذج إلى أن 58% من التباين في المتغير التابع يفسره المتغيرات المستقلة، مع R-squared معدل قدره 54%، مما يشير إلى ملاءمة معقولة.

يظهر التحليل الإضافي من خلال الانحدار اللوجستي الثنائي دلالة إحصائية قوية، كما يتضح من قيمة p القريبة من الصفر وإحصائية F قدرها 17.71، مما يدعم الأهمية العامة للنموذج. يتم حساب التقاطع عند 0.54، وعلى الرغم من أن زيادة وحدة واحدة في العمر ترتبط بزيادة قدرها 20 وحدة في المتغير التابع، إلا أن هذه العلاقة ليست ذات دلالة إحصائية. من الجدير بالذكر أن المتغير Sex_1 له دلالة مع t-value قدره 2.64 وقيمة p قدرها 0.001. لا تظهر المتغيرات الأخرى، بما في ذلك العمر، ضغط الدم أثناء الراحة (trestbps)، الكوليسترول، thalach، والقديمة، علاقات ذات دلالة مع المتغير التابع. تشير النتائج إلى الحاجة إلى مزيد من التقييم لنموذج التعلم الآلي باستخدام التحقق المتقاطع لتعزيز دقة التنبؤ لمرضى القلب.

المناقشة

في قسم المناقشة من الورقة البحثية، يوضح المؤلفون معالجة البيانات وتقييم مجموعة بيانات أمراض القلب التي تتكون من 13 سمة فئوية. شملت عملية تحضير البيانات الترميز الأحادي، مما أدى إلى 76 عمودًا، وتقسيم بيانات التدريب والاختبار بنسبة 80:20 مع تصنيف للحفاظ على نسب الفئات. استخدمت الدراسة أربعة نماذج للتعلم الآلي—الانحدار اللوجستي، وآلات الدعم الناقل (SVM)، والغابات العشوائية، وأقرب الجيران (KNN)—واستخدمت التحقق المتقاطع لتحسين أداء النموذج. حقق نموذج الانحدار اللوجستي ونموذج KNN أعلى دقة بنسبة 81.9%، بينما سجلت نماذج SVM والغابة العشوائية 78.6%. ومع ذلك، عند تقييم مقاييس المتوسط الكلي، تفوق نموذج الغابة العشوائية على الآخرين، محققًا معدلات دقة واسترجاع تتراوح بين 96-97%.

يؤكد المؤلفون على أهمية التحقق المتقاطع في تقليل الإفراط في التكيف وتعزيز موثوقية النموذج. توضح منحنيات التعلم التي تم إنشاؤها خلال عملية التقييم كيف تحسن أداء النموذج مع زيادة بيانات التدريب، مما يبرز التوازن بين التحيز والتباين. في النهاية، برز نموذج الغابة العشوائية كأكثر الأدوات فعالية في التنبؤ بأمراض القلب، محققًا دقة ماكرو قدرها 94%. يستنتج المؤلفون أنه بينما يحمل التعلم الآلي وعدًا كبيرًا لتطبيقات الرعاية الصحية، يجب معالجة التحديات مثل التحيز، وقابلية التفسير، وخصوصية البيانات لضمان نتائج عادلة. يجب أن تركز الأعمال المستقبلية على تحسين اختيار الميزات واستغلال مجموعات بيانات أكبر لتعزيز قوة النموذج وقابليته للتعميم.

Journal: Scientific Reports, Volume: 15, Issue: 1
DOI: https://doi.org/10.1038/s41598-025-93675-1
PMID: https://pubmed.ncbi.nlm.nih.gov/40251253
Publication Date: 2025-04-18
Author(s): Yagyanath Rimal et al.
Primary Topic: Artificial Intelligence in Healthcare

Overview

This research paper investigates the efficacy of cross-validation in enhancing the performance of machine learning models applied to heart disease datasets. The study employs a novel data preparation process, which includes imputing numerical features with the mean, categorical features using chi-square methods, and applying normalization. Four models—Logistic Regression (LR), Support Vector Machine (SVM), K-Nearest Neighbor (KNN), and Random Forest (RF)—are compared through cross-validation, revealing that the RF model achieves the highest macro accuracy of 94%, alongside a precision of 97% and recall of 97%. In contrast, the LR model shows the lowest accuracy at 81%, while KNN and SVM demonstrate accuracies of 81% and 89%, respectively.

The findings underscore the significance of cross-validation in improving model reliability and accuracy, with variations in accuracy scores highlighting the potential for enhanced performance through advanced feature selection and ensemble methods. The study notes that while the RF model excels in overall metrics, the logistic regression model’s performance could be improved with hyperparameter optimization. The research advocates for the integration of machine learning in healthcare, emphasizing the need to address challenges such as bias and data privacy to ensure effective clinical applications. Future directions include refining feature selection and utilizing larger datasets to bolster model robustness and generalizability.

Methods

The methodology section of the research paper emphasizes the critical role of machine learning in uncovering latent patterns within scientific data, particularly for the early detection of heart diseases, which contribute to approximately 12 million global deaths annually. The prevalence of coronary disease mortality is notably higher in the United States compared to other advanced European nations. To address this issue, the study aims to achieve optimal validation accuracy for various machine learning models, specifically logistic regression, support vector machines (SVM), k-nearest neighbors (KNN), and random forests.

The research employs cross-validation to assess and compare the predictive efficacy of these four machine learning techniques in identifying coronary artery conditions. This comparative analysis is designed to assist medical practitioners in selecting the most appropriate model for classification and prediction, thereby enhancing diagnostic accuracy and potentially improving patient outcomes.

Results

In the results section, the study employs a heatmap to illustrate the correlation between dependent and independent variables in the heart disease dataset, utilizing the `sns.heatmap` function for clarity. The analysis reveals that certain features, such as sex, chest pain type (cp), maximum heart rate (thalach), and calcium (ca), exhibit satisfactory performance with mean squared errors below 10. Conversely, features like exercise angina (exang), oldpeak, and slope demonstrate significantly higher error values. The model’s coefficient of determination indicates that 58% of the variance in the dependent variable is explained by the independent variables, with an adjusted R-squared of 54%, suggesting a reasonable fit.

Further analysis through binary logistic regression shows a strong statistical significance, as indicated by a p-value close to zero and an F-statistic of 17.71, which supports the overall significance of the model. The intercept is calculated at 0.54, and while a one-unit increase in age correlates with a 20-unit increase in the dependent variable, this relationship is not statistically significant. Notably, the variable Sex_1 is significant with a t-value of 2.64 and a p-value of 0.001. Other variables, including age, resting blood pressure (trestbps), cholesterol, thalach, and oldpeak, do not show significant relationships with the dependent variable. The findings suggest a need for further evaluation of the machine learning model using cross-validation to enhance prediction accuracy for heart disease patients.

Discussion

In the discussion section of the research paper, the authors detail the preprocessing and evaluation of a heart disease dataset comprising 13 categorical attributes. The data preparation involved one-hot encoding, resulting in 76 columns, and an 80:20 train-test split with stratification to maintain class proportions. The study employed four machine learning models—logistic regression, support vector machines (SVM), random forests, and k-nearest neighbors (KNN)—and utilized cross-validation to optimize model performance. The logistic regression and KNN models achieved the highest accuracy of 81.9%, while the SVM and random forest models scored 78.6%. However, when evaluating macro-average metrics, the random forest model outperformed others, achieving precision and recall rates of 96-97%.

The authors emphasize the importance of cross-validation in minimizing overfitting and enhancing model reliability. The learning curves generated during the evaluation process illustrated how model performance improved with increased training data, highlighting the trade-offs between bias and variance. Ultimately, the random forest model emerged as the most effective tool for heart disease prediction, achieving a macro accuracy of 94%. The authors conclude that while machine learning holds significant promise for healthcare applications, challenges such as bias, interpretability, and data privacy must be addressed to ensure equitable outcomes. Future work should focus on refining feature selection and leveraging larger datasets to bolster model robustness and generalizability.