تحسين توقع مخاطر السكتة الدماغية من خلال دمج XGBoost وتحليل المكونات الرئيسية المحسن والذكاء الاصطناعي القابل للتفسير Improving stroke risk prediction by integrating XGBoost, optimized principal component analysis, and explainable artificial intelligence

المجلة: BMC Medical Informatics and Decision Making، المجلد: 25، العدد: 1
DOI: https://doi.org/10.1186/s12911-025-02894-z
PMID: https://pubmed.ncbi.nlm.nih.gov/39920691
تاريخ النشر: 2025-02-07
المؤلف: Lesia Mochurad وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في الرعاية الصحية

نظرة عامة

تتناول الدراسة الزيادة المتزايدة في انتشار الأمراض الوعائية الدماغية، وخاصة السكتة الدماغية، التي تسهم بشكل كبير في معدلات الإعاقة والوفيات العالمية. لتعزيز نماذج توقع مخاطر السكتة الدماغية، يقترح المؤلفون دمجًا جديدًا لتقنيات التعلم الآلي الحديثة، وبشكل خاص XGBoost، مع تحليل المكونات الرئيسية (PCA) المحسن. لا يحسن هذا الدمج فقط هيكلة البيانات وسرعة المعالجة – خاصةً لمجموعات البيانات الكبيرة – ولكنه يتضمن أيضًا الذكاء الاصطناعي القابل للتفسير (XAI) لتعزيز شفافية النموذج وقابليته للتفسير للمهنيين الطبيين. حققت الطريقة المقترحة معدلات دقة مثيرة للإعجاب بلغت 95% و98% على مجموعتين من البيانات، مع مؤشرات التحقق المتبادل التي تشير إلى قابلية عالية للتعميم (قيمة متوسطة تبلغ 0.99) وموثوقية (معامل ارتباط ماثيو 0.96 وكابا كوهين 0.96). بالإضافة إلى ذلك، أدى تنفيذ التوازي باستخدام OpenMP إلى زيادة ثلاثية في سرعة المعالجة، مما جعل النهج عمليًا للتطبيقات السريرية.

في الختام، يعزز دمج PCA وXGBoost بشكل كبير كل من الدقة والكفاءة في توقع مخاطر السكتة الدماغية، وهو أمر حاسم في البحث الطبي. تعمل تقليل الأبعاد عبر PCA على تحسين استخدام الموارد وتقليل وقت الحساب، بينما يضمن استخدام SHAP لتفسير النموذج أن النتائج شفافة وقابلة للفهم لمقدمي الرعاية الصحية. تشير النتائج إلى أن هذا النهج المبتكر يمكن أن يسهل التشخيص المبكر والتدخل في الوقت المناسب لمخاطر السكتة الدماغية، مع تطبيقات محتملة تمتد إلى مجالات طبية أخرى مثل الأمراض القلبية والأورام. تشمل اتجاهات البحث المستقبلية تطبيق هذه المنهجية على مجموعات بيانات أكبر من العالم الحقيقي واستكشاف تقنيات بديلة لتحسين أداء النموذج وقابليته للتفسير في البيئات السريرية.

مقدمة

تسلط مقدمة هذه الورقة البحثية الضوء على الدور الحاسم للسكتة الدماغية كأحد الأسباب الرئيسية للوفاة والإعاقة على مستوى العالم، مما يبرز الحاجة إلى استراتيجيات فعالة للوقاية والتشخيص في الوقت المناسب. يتم تقديم دمج التقنيات المتقدمة، وخاصة خوارزميات التعلم الآلي مثل XGBoost والشبكات العصبية، كمسار واعد لتعزيز توقع مخاطر السكتة الدماغية من خلال الاستفادة من كل من البيانات التقليدية والجديدة للمرضى. تناقش الورقة أهمية جودة البيانات والتحديات التي تطرحها الشواذ في مجموعات البيانات، داعية إلى استخدام تحليل المكونات الرئيسية (PCA) لتحسين موثوقية البيانات وقابلية تفسير النموذج. لا يساعد PCA فقط في تحديد المتغيرات الرئيسية ولكنه يبسط أيضًا عملية النمذجة، مما يجعلها أكثر سهولة لمهنيي الرعاية الصحية.

يقترح المؤلفون استخدام تقنيات الحوسبة المتوازية، وبشكل خاص OpenMP، لتحسين معالجة PCA، مما يعالج عدم الكفاءة الحسابية المرتبطة بالطرق التقليدية. يهدف هذا النهج إلى تعزيز سرعة وكفاءة نماذج توقع مخاطر السكتة الدماغية مع الحفاظ على دقة عالية. علاوة على ذلك، يتم تسليط الضوء على دمج تقنيات الذكاء الاصطناعي القابل للتفسير (XAI) كأمر أساسي لتحسين قابلية تفسير النماذج، مما يسمح لمقدمي الرعاية الصحية بفهم أفضل للعوامل التي تؤثر على التوقعات. تسعى الدراسة في النهاية إلى دمج هذه المنهجيات لإنشاء إطار عمل قوي لتقييم مخاطر السكتة الدماغية، مما يبرز أهمية التعلم الآلي في تقدم التشخيصات الطبية ورعاية المرضى.

الطرق

في هذا القسم، يوضح المؤلفون المنهجية المستخدمة لتحليل مجموعة بيانات \( D = (x_i, y_i) \) تتكون من خصائص المرضى \( x_i \in \mathbb{R}^d \) ونتيجة ثنائية \( y_i \in \{0, 1\} \) تشير إلى وجود السكتة الدماغية. شملت معالجة البيانات الأولية التحقق من التكرارات، والتي كانت غائبة، ومعالجة القيم المفقودة في عمود مؤشر كتلة الجسم من خلال تعويض المتوسط للحفاظ على سلامة البيانات. تم تحديد القيم الشاذة باستخدام الرسوم البيانية الصندوقية وطريقة النطاق الربعي (IQR)، مع استبدال القيم المتطرفة بقيم الحدود لضمان الاتساق. تم إجراء تحليل البيانات الاستكشافية (EDA) لتحديد المشكلات المحتملة مثل عدم توازن الفئات.

ركز المؤلفون البيانات وحسبوا مصفوفة التغاير \( S \) لاستخراج القيم الذاتية والمتجهات الذاتية باستخدام طريقة جاكوب، التي تم توازيها باستخدام OpenMP من أجل الكفاءة. تم تحقيق تقليل الأبعاد من خلال اختيار أكبر \( k \) من المتجهات الذاتية، وتحويل البيانات إلى مصفوفة ميزات جديدة \( X_{\text{reduced}} \)، والتي تم تحليلها بعد ذلك باستخدام خوارزمية XGBoost. تم استخدام دالة خسارة لوجستية لتحسين أداء النموذج، مع ضبط المعلمات الفائقة عبر Grid Search لتعزيز الدقة مع منع الإفراط في التخصيص. تم تقييم أداء النموذج باستخدام درجة ROC AUC، وتم تطبيق تقنيات التوازن مثل تقليل العينة وSMOTE لمعالجة عدم توازن الفئات. أخيرًا، تم استخدام طريقة SHAP لتفسير توقعات النموذج، مما يوفر رؤى حول مساهمة كل ميزة في التوقعات بناءً على بيانات الإدخال الأصلية.

النتائج

تظهر نتائج هذه الدراسة فعالية النهج المقترح لتوقع مخاطر السكتة الدماغية، الذي تم اختباره على مجموعتين من البيانات مع عدم توازن كبير في الفئات. كانت مجموعة البيانات 1 تتكون من 5,110 سجلات، مع وجود 249 حالة إيجابية فقط للسكتة الدماغية، بينما احتوت مجموعة البيانات 2 على 5,769,190 سجلاً. لمعالجة عدم توازن الفئات، تم تطبيق تقنية زيادة العينة للأقليات الاصطناعية (SMOTE) على مجموعة البيانات 1، بينما تم استخدام تقليل العينة العشوائي لمجموعة البيانات 2. استخدمت الدراسة تحليل المكونات الرئيسية (PCA) لتقليل الأبعاد، مما أدى إلى التقاط 95% من التباين، مما عزز الكفاءة الحسابية ودقة النموذج. تم استخدام خوارزمية دوران جاكوب لحساب القيم الذاتية والمتجهات الذاتية، وتم تنفيذ الحسابات المتوازية عبر OpenMP، مما أدى إلى تحسينات في السرعة تزيد عن ثلاثة أضعاف مقارنة بالطرق السابقة.

كشفت التحليلات أن المتنبئين الرئيسيين لمخاطر السكتة الدماغية شملت مستوى الجلوكوز في الدم، والعمر، ونوع العمل، مع توفير قيم SHAP (SHapley Additive exPlanations) رؤى حول أهمية الميزات. حقق النموذج دقة بلغت 95% على مجموعة البيانات 1، متجاوزًا المعايير السابقة، وأظهر تحسينات أداء ذات دلالة إحصائية عند تضمين PCA، كما تأكد من خلال اختبارات t المزدوجة واختبارات مان-ويتني U. تؤكد النتائج على أهمية دمج تقنيات الحوسبة المتقدمة في التشخيصات الطبية، مما يسهل التوقعات الدقيقة وفي الوقت المناسب التي يمكن أن تؤثر بشكل كبير على نتائج المرضى.

المناقشة

في قسم المناقشة من الورقة، يقدم المؤلفون تنفيذًا متوازيًا لخوارزمية جاكوب لحساب القيم الذاتية والمتجهات الذاتية، باستخدام توجيهات OpenMP لتعزيز الكفاءة الحسابية. يعد هذا النهج مفيدًا بشكل خاص لمجموعات البيانات عالية الأبعاد، مثل تلك المستخدمة في نماذج توقع مخاطر السكتة الدماغية، حيث يسمح بالتنفيذ المتزامن عبر عدة خيوط مع ضمان تحديثات آمنة للمتغيرات المشتركة من خلال الأقسام الحرجة. يتم مراقبة تقارب الخوارزمية عبر عتبة دقة، مما يمكّن من إنهاء التكرارات بكفاءة. يتم تبرير اختيار طريقة جاكوب من خلال ملاءمتها للتوازي وتوازنها بين الدقة والأداء، مع خطط لمقارنات مستقبلية ضد طرق بديلة مثل تحليل QR.

يقيم المؤلفون أيضًا أداء نموذجهم على مجموعتين من البيانات، باستخدام مقاييس مثل معامل ارتباط ماثيو (MCC) ومعامل كابا كوهين (CK) لتقييم جودة التصنيف، خاصة في السيناريوهات غير المتوازنة. تشير النتائج إلى موثوقية عالية للنموذج، مع متوسط درجة تحقق متبادل يبلغ 0.99028 لمجموعة البيانات 2، مما يظهر أداءً متسقًا عبر الطيات. على الرغم من فعالية النموذج، لا تزال هناك تحديات في التكامل السريري، بما في ذلك التوافق مع الأنظمة الحالية والحاجة إلى اختبار في العالم الحقيقي. تسلط المناقشة الضوء على إمكانية دمج XGBoost مع PCA لتحسين الدقة وسرعة المعالجة، مع الاعتراف أيضًا بالتعقيدات المرتبطة بتطبيق هذا النهج على مجموعات بيانات متنوعة. تشمل اتجاهات البحث المستقبلية استكشاف الخوارزميات الميتاهيرستية لاختيار الميزات وتحسين وقت تشغيل النموذج ودقته في التطبيقات السريرية.

Journal: BMC Medical Informatics and Decision Making, Volume: 25, Issue: 1
DOI: https://doi.org/10.1186/s12911-025-02894-z
PMID: https://pubmed.ncbi.nlm.nih.gov/39920691
Publication Date: 2025-02-07
Author(s): Lesia Mochurad et al.
Primary Topic: Artificial Intelligence in Healthcare

Overview

The study addresses the increasing prevalence of cerebrovascular diseases, particularly stroke, which significantly contributes to global disability and mortality rates. To enhance stroke risk prediction models, the authors propose a novel integration of modern machine learning techniques, specifically XGBoost, with optimized principal component analysis (PCA). This integration not only improves data structuring and processing speed—especially for large datasets—but also incorporates explainable artificial intelligence (XAI) to enhance model transparency and interpretability for medical professionals. The proposed method achieved impressive accuracy rates of 95% and 98% on two datasets, with cross-validation metrics indicating a high generalizability (average value of 0.99) and reliability (Matthew’s correlation coefficient of 0.96 and Cohen’s Kappa of 0.96). Additionally, the implementation of OpenMP parallelization resulted in a threefold increase in processing speed, making the approach practical for clinical applications.

In conclusion, the integration of PCA and XGBoost significantly enhances both the accuracy and efficiency of stroke risk prediction, which is crucial in medical research. The dimensionality reduction via PCA optimizes resource utilization and reduces computation time, while the use of SHAP for model interpretation ensures that results are transparent and comprehensible for healthcare providers. The findings suggest that this innovative approach can facilitate early diagnosis and timely intervention for stroke risk, with potential applications extending to other medical domains such as cardiovascular and oncological diseases. Future research directions include applying this methodology to larger real-world datasets and exploring alternative technologies to further optimize model performance and interpretability in clinical settings.

Introduction

The introduction of this research paper highlights the critical role of stroke as a leading cause of death and disability globally, emphasizing the need for effective prevention and timely diagnosis strategies. The integration of advanced technologies, particularly machine learning algorithms like XGBoost and neural networks, is presented as a promising avenue for enhancing stroke risk prediction by leveraging both traditional and novel patient data. The paper discusses the importance of data quality and the challenges posed by anomalies in datasets, advocating for the use of Principal Component Analysis (PCA) to improve data reliability and model interpretability. PCA not only aids in identifying key variables but also simplifies the modeling process, making it more accessible for healthcare professionals.

The authors propose utilizing parallel computing technologies, specifically OpenMP, to optimize PCA processing, thereby addressing the computational inefficiencies associated with traditional methods. This approach aims to enhance the speed and efficiency of stroke risk prediction models while maintaining high accuracy. Furthermore, the integration of Explainable Artificial Intelligence (XAI) techniques is highlighted as essential for improving the interpretability of the models, allowing healthcare providers to better understand the factors influencing predictions. The study ultimately seeks to combine these methodologies to create a robust framework for stroke risk assessment, underscoring the significance of machine learning in advancing medical diagnostics and patient care.

Methods

In this section, the authors outline the methodology employed to analyze a dataset \( D = (x_i, y_i) \) consisting of patient characteristics \( x_i \in \mathbb{R}^d \) and a binary outcome \( y_i \in \{0, 1\} \) indicating stroke presence. Initial data preprocessing involved checking for duplicates, which were absent, and addressing missing values in the BMI column by imputing the mean to maintain data integrity. Outliers were identified using boxplots and the interquartile range (IQR) method, with extreme values replaced by boundary values to ensure consistency. Exploratory Data Analysis (EDA) was conducted to identify potential issues such as class imbalance.

The authors centered the data and computed the covariance matrix \( S \) to extract the eigenvalues and eigenvectors using the Jacobi method, parallelized with OpenMP for efficiency. Dimensionality reduction was achieved by selecting the \( k \) largest eigenvectors, transforming the data into a new feature matrix \( X_{\text{reduced}} \), which was subsequently analyzed using the XGBoost algorithm. The logistic loss function was employed to optimize model performance, with hyperparameters tuned via Grid Search to enhance accuracy while preventing overfitting. The model’s performance was evaluated using the ROC AUC score, and balancing techniques such as undersampling and SMOTE were applied to address class imbalance. Finally, the SHAP method was utilized to interpret model predictions, providing insights into the contribution of each feature to the predictions based on the original input data.

Results

The results of this study demonstrate the effectiveness of a proposed approach for stroke risk prediction, tested on two datasets with significant class imbalances. Dataset 1 comprised 5,110 records, with only 249 positive stroke cases, while Dataset 2 contained 5,769,190 records. To address class imbalance, Synthetic Minority Over-sampling Technique (SMOTE) was applied to Dataset 1, while random undersampling was utilized for Dataset 2. The study employed Principal Component Analysis (PCA) to reduce dimensionality, capturing 95% of the variance, which enhanced computational efficiency and model accuracy. The Jacobi rotation algorithm was used for eigenvalue and eigenvector calculations, and parallel computations via OpenMP were implemented, resulting in over threefold speed improvements compared to previous methods.

The analysis revealed that key predictors for stroke risk included blood glucose level, age, and work type, with SHAP (SHapley Additive exPlanations) values providing insights into feature importance. The model achieved an accuracy of 95% on Dataset 1, surpassing previous benchmarks, and demonstrated statistically significant performance improvements when PCA was included, as confirmed by paired t-tests and Mann-Whitney U tests. The findings underscore the importance of integrating advanced computational techniques in medical diagnostics, facilitating timely and accurate predictions that can significantly impact patient outcomes.

Discussion

In the discussion section of the paper, the authors present a parallel implementation of the Jacobi algorithm for calculating eigenvalues and eigenvectors, utilizing OpenMP directives to enhance computational efficiency. This approach is particularly beneficial for high-dimensional datasets, such as those used in stroke risk prediction models, as it allows for simultaneous execution across multiple threads while ensuring safe updates to shared variables through critical sections. The algorithm’s convergence is monitored via an accuracy threshold, enabling efficient termination of iterations. The choice of the Jacobi method is justified by its suitability for parallelization and its balance of accuracy and performance, with plans for future comparisons against alternative methods like QR decomposition.

The authors also evaluate their model’s performance on two datasets, employing metrics such as Matthews’ correlation coefficient (MCC) and Cohen’s Kappa coefficient (CK) to assess classification quality, particularly in imbalanced scenarios. Results indicate high model reliability, with an average cross-validation score of 0.99028 for Dataset 2, demonstrating consistent performance across folds. Despite the model’s effectiveness, challenges remain in clinical integration, including compatibility with existing systems and the need for real-world testing. The discussion highlights the potential of combining XGBoost with PCA for improved accuracy and processing speed, while also acknowledging the complexities involved in applying this approach to diverse datasets. Future research directions include exploring metaheuristic algorithms for feature selection and further optimizing the model’s runtime and accuracy in clinical applications.