نماذج التعلم الآلي للتنبؤ بتدهور الصحة المرتبط بالتدخين ومخاطر الأمراض Machine Learning Models for Predicting Smoking-Related Health Decline and Disease Risk

المجلة: Journal of Intelligent Medicine and Healthcare، المجلد: 4، العدد: 1
DOI: https://doi.org/10.32604/jimh.2026.074347
تاريخ النشر: 2026-01-01
المؤلف: Vaskar Chakma وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في الرعاية الصحية

نظرة عامة

تسلط ورقة البحث الضوء على المخاطر الصحية الكبيرة المرتبطة بالتدخين، الذي لا يزال سببًا رئيسيًا يمكن الوقاية منه للوفاة على مستوى العالم. غالبًا ما تفشل طرق الفحص الطبي الحالية في اكتشاف العلامات المبكرة لمشاكل الصحة المرتبطة بالتدخين، مما يؤدي إلى تشخيصات في مراحل متأخرة. تقيم هذه الدراسة بشكل منهجي تقنيات التعلم الآلي لتقييم المخاطر الصحية المرتبطة بالتدخين، مع التركيز على القابلية للتفسير السريرية والتطبيق العملي. من خلال تحليل بيانات الفحص الصحي من 55,691 فردًا، استخدمت الدراسة ثلاثة خوارزميات تنبؤية متقدمة – الغابة العشوائية، XGBoost، وLightGBM – لتحديد المدخنين ذوي المخاطر العالية بناءً على مؤشرات صحية متنوعة. تفوقت نموذج الغابة العشوائية على الآخرين، محققة منطقة تحت المنحنى (AUC) تبلغ 0.926، مما يدل على فعاليتها في التمييز بين الأفراد ذوي المخاطر العالية والأقل خطرًا.

تؤكد النتائج على إمكانيات الذكاء الاصطناعي في الكشف المبكر عن الأمراض بين المدخنين، مما يسمح بالتدخلات في الوقت المناسب من خلال استراتيجيات الوقاية الشخصية. تم تحديد مؤشرات صحية رئيسية مثل ضغط الدم، مستويات الدهون الثلاثية، إنزيمات الكبد، ووظيفة الكلى كمؤشرات حاسمة للتدهور الصحي. تختتم الدراسة بأن التعلم الآلي لا يعزز فقط القدرات التنبؤية ولكن يوفر أيضًا رؤى قيمة حول عوامل الخطر المعقدة المرتبطة بالتدخين. تدعو الفعالية المستمرة لأساليب التجميع، وخاصة الغابة العشوائية، إلى دمجها في أدوات تقييم المخاطر السريرية، مما يمهد الطريق لأساليب أكثر تخصيصًا للإقلاع عن التدخين ومراقبة الصحة المصممة لتناسب نقاط الضعف الفردية.

مقدمة

تتناول مقدمة ورقة البحث هذه التحدي الكبير للصحة العامة الذي يطرحه التدخين، والذي يتسبب في أكثر من 8 ملايين وفاة سنويًا ويساهم في مجموعة من الأمراض بخلاف سرطان الرئة ومرض الانسداد الرئوي المزمن (COPD)، بما في ذلك الاضطرابات القلبية والتمثيل الغذائي. على الرغم من الجهود الكبيرة في مجال الصحة العامة، لا يزال حوالي 1.3 مليار شخص يدخنون، وغالبًا ما يبالغون في تقدير المخاطر المرتبطة بعادتهم. يبرز المؤلفون قيود الأساليب التشخيصية التفاعلية التي تتدخل فقط بعد ظهور الأعراض، داعين إلى استراتيجية استباقية تستخدم أدوات تنبؤية متقدمة لتحديد المخاطر في مراحل مبكرة من تدهور الصحة المرتبطة بالتدخين.

تقترح الدراسة أن التدخين يولد توقيعات بيولوجية مميزة عبر مسارات صحية متنوعة، والتي يمكن لخوارزميات التعلم الآلي اكتشافها قبل الوصول إلى العتبات السريرية التقليدية. من خلال تحليل مؤشرات حيوية متنوعة، تهدف الأبحاث إلى إنشاء تقييم شامل للمخاطر الصحية المرتبطة بالتدخين، مع معالجة الفجوات في الدراسات السابقة التي ركزت بشكل ضيق على أمراض فردية. تعتبر مقارنة نماذج التعلم الآلي بأدوات تقييم المخاطر السريرية المعروفة، مثل درجة خطر القلب والأوعية الدموية في فرامينغهام، ابتكارًا رئيسيًا لتقييم فعاليتها. يؤكد المؤلفون على أهمية قابلية تفسير النموذج باستخدام قيم SHAP (SHapley Additive exPlanations) لتعزيز ثقة الأطباء ودعم اتخاذ القرار المشترك. بالإضافة إلى ذلك، تأخذ الدراسة في الاعتبار الجوانب العملية للتنفيذ السريري وإنصاف الخوارزميات، مما يضمن أن الأدوات التنبؤية المطورة مسؤولة أخلاقيًا وقابلة للتطبيق عبر مجموعات سكانية متنوعة. في النهاية، تهدف الأبحاث إلى تسهيل التدخلات في الوقت المناسب للأفراد المعرضين للخطر، مما يحسن نتائج الصحة العامة المتعلقة باستخدام التبغ.

الطرق

في هذا القسم، يحدد المؤلفون الإطار المنهجي المستخدم لتقييم الأداء التنبؤي لمختلف نماذج التعلم الآلي فيما يتعلق بتدهور الصحة المرتبط بالتدخين. تم تقسيم مجموعة البيانات، المشار إليها باسم Smoking.csv، إلى مجموعات تدريب (80%) واختبار (20%) من خلال أخذ عينات طبقية، مما يضمن أن توزيع النتائج ظل متسقًا عبر كلا المجموعتين. تعتبر هذه الطبقية ضرورية لتقليل التحيزات وتعزيز موثوقية تقييمات النموذج. استخدمت الدراسة سبعة خوارزميات تعلم آلي، بما في ذلك النماذج التقليدية مثل الانحدار اللوجستي (LR)، آلة الدعم الناقل (SVM)، والغابة العشوائية (RF)، إلى جانب أساليب التجميع المتقدمة مثل XGBoost وLightGBM، التي تتمتع بقدرة خاصة على إدارة العلاقات المعقدة وغير الخطية في البيانات.

لتحسين أداء النموذج، تم تنفيذ ضبط المعلمات الفائقة باستخدام التحقق المتقاطع ذو 10 طيات، وهي طريقة توازن بين الكفاءة الحاسوبية وتقدير الأداء الموثوق. بالإضافة إلى ذلك، لمواجهة مشكلة عدم توازن الفئات – وهي قضية شائعة في مجموعات البيانات المتعلقة بالصحة – نفذ المؤلفون خوارزمية NRSBoundary-SMOTE. تعمل هذه التقنية على زيادة عينة الفئة الأقل تمثيلًا التي تقع بالقرب من حدود القرار، مما يعزز قدرة النموذج على التنبؤ بدقة بالنتائج للفئات الممثلة تمثيلًا ناقصًا. تعمل الخوارزمية من خلال حساب حدود مجموعة Rough Set (NRS) للعينات الأقل تمثيلًا وتوليد عينات اصطناعية بناءً على الكثافة المحلية والقرب من عينات الفئة الأكثر تمثيلًا، مما يغني مجموعة بيانات التدريب.

المناقشة

تسلط قسم المناقشة في ورقة البحث الضوء على تطور منهجيات تقييم المخاطر الصحية المرتبطة بالتدخين، مع التأكيد على التحول من الأساليب الإحصائية التقليدية، مثل الانحدار اللوجستي، إلى تقنيات التعلم الآلي المتقدمة. بينما قدمت الدراسات السابقة دقة مقبولة في التنبؤ بحالة التدخين والمخاطر الصحية المرتبطة، غالبًا ما فشلت في التقاط العلاقات المعقدة وغير الخطية الكامنة في التأثيرات البيولوجية للتدخين عبر أنظمة فسيولوجية متنوعة. أظهرت التقدمات الأخيرة في التعلم الآلي، بما في ذلك أشجار القرار وأساليب تعزيز التدرج، تحسينات في القدرات التنبؤية، خاصة للأمراض مثل سرطان الرئة ومرض الانسداد الرئوي المزمن. ومع ذلك، لا تزال هناك فجوات كبيرة في الأدبيات، بما في ذلك التركيز الضيق على نقاط نهاية مرضية واحدة، ومجموعات ميزات محدودة، واستخدام نماذج “الصندوق الأسود” التي تفتقر إلى القابلية للتفسير، مما يعيق اعتمادها السريري.

لمعالجة هذه القيود، يقترح المؤلفون نهجًا شاملًا قائمًا على الأنظمة يتضمن مجموعة شاملة من المؤشرات الحيوية عبر مجالات صحية متعددة، معترفًا بالطبيعة النظامية للأضرار الناتجة عن التدخين. يؤكدون على أهمية قابلية تفسير النموذج من خلال استخدام قيم SHAP (SHapley Additive exPlanations)، التي تعزز الشفافية وتسهّل الثقة السريرية. بالإضافة إلى ذلك، تتضمن الأبحاث تقييمًا صارمًا مقارنة بالأدوات المعروفة لتقييم المخاطر السريرية، مما يسمح بمقارنة ذات مغزى بين نماذج التعلم الآلي والأساليب التقليدية. من خلال إعادة صياغة التركيز البحثي نحو إطار تقييم المخاطر متعدد الأبعاد، يهدف المؤلفون إلى دعم استراتيجيات الوقاية الشخصية والحفاظ على الصحة الاستباقية، مما يسهم في تحسين رعاية المرضى ونتائج الصحة العامة للسكان المدخنين.

القيود

تسلط القيود في هذه الدراسة الضوء على عدة عوامل حاسمة قد تؤثر على قابلية تعميم النتائج. أولاً، مجموعة البيانات مشتقة من برنامج فحص صحي واحد في كوريا الجنوبية، وتتميز بشكل أساسي بمشاركين حضريين ومتجانسين عرقيًا. يثير هذا مخاوف بشأن قابلية تطبيق النموذج على مجموعات عرقية أخرى، حيث يمكن أن تؤثر التباينات الجينية، مثل تلك الموجودة في جين CYP2A6 الذي يؤثر على استقلاب النيكوتين، وانتشار الأمراض الأساسية المختلفة على النتائج. لذلك، من الضروري إجراء تحقق خارجي في مجموعات سكانية أكثر تنوعًا، بما في ذلك المجموعات الأوروبية والأفريقية واللاتينية، قبل التطبيق السريري.

بالإضافة إلى ذلك، فإن غياب المؤشرات الاجتماعية والاقتصادية – مثل الدخل والتعليم والمهنة – داخل مجموعة البيانات يمثل قيدًا، حيث تُعرف هذه العوامل بأنها عوامل مشوشة لكل من سلوك التدخين ونتائج الصحة. قد يؤدي هذا الإغفال إلى نتائج مختلطة بشأن الفجوات الصحية الاجتماعية والاقتصادية وتأثيرات التدخين المحددة. كما أن التصميم العرضي للدراسة يحد من القدرة على إثبات السببية الزمنية أو التنبؤ بنتائج الأمراض المستقبلية، مما يتطلب دراسات طولية تتبع الأفراد على مدى 5-10 سنوات للتحقق من التنبؤات عالية المخاطر مقابل حدوث الأمراض الفعلي. أخيرًا، قد يؤدي الاعتماد على الحالة المدخنة المبلغ عنها ذاتيًا إلى تقليل التقديرات بسبب تحيز الرغبة الاجتماعية؛ إن دمج التحقق البيوكيميائي، مثل مستويات الكوتينين، سيعزز دقة تحديد النتائج. يجب أن تعطي الأبحاث المستقبلية الأولوية للدراسات متعددة المواقع في مجموعات عرقية متنوعة، والمجموعات المستقبلية مع متابعة طولية، ودمج متغيرات تاريخ التدخين بالتفصيل.

Journal: Journal of Intelligent Medicine and Healthcare, Volume: 4, Issue: 1
DOI: https://doi.org/10.32604/jimh.2026.074347
Publication Date: 2026-01-01
Author(s): Vaskar Chakma et al.
Primary Topic: Artificial Intelligence in Healthcare

Overview

The research paper highlights the significant health risks associated with smoking, which remains a leading preventable cause of death globally. Current medical screening methods often fail to detect early signs of smoking-related health issues, resulting in late-stage diagnoses. This study systematically evaluates machine learning techniques for assessing smoking-related health risks, focusing on clinical interpretability and practical application. Analyzing health screening data from 55,691 individuals, the study employed three advanced prediction algorithms—Random Forest, XGBoost, and LightGBM—to identify high-risk smokers based on various health indicators. The Random Forest model outperformed the others, achieving an Area Under the Curve (AUC) of 0.926, indicating its effectiveness in distinguishing between high-risk and lower-risk individuals.

The findings underscore the potential of artificial intelligence in early disease detection among smokers, allowing for timely interventions through personalized prevention strategies. Key health markers such as blood pressure, triglyceride levels, liver enzymes, and kidney function were identified as critical predictors of declining health. The study concludes that machine learning not only enhances predictive capabilities but also provides valuable insights into the complex risk factors associated with smoking. The consistent efficacy of ensemble methods, particularly Random Forest, advocates for their integration into clinical risk assessment tools, paving the way for more personalized approaches to smoking cessation and health monitoring tailored to individual vulnerabilities.

Introduction

The introduction of this research paper addresses the significant public health challenge posed by smoking, which is responsible for over 8 million deaths annually and contributes to a range of diseases beyond lung cancer and chronic obstructive pulmonary disease (COPD), including cardiovascular and metabolic disorders. Despite extensive public health efforts, approximately 1.3 billion people continue to smoke, often underestimating the risks associated with their habit. The authors highlight the limitations of reactive diagnostic approaches that only intervene after symptoms appear, advocating for a proactive strategy that utilizes advanced predictive tools to identify risk at earlier stages of smoking-related health decline.

The study proposes that smoking generates distinct biological signatures across various health pathways, which machine learning algorithms can detect before conventional clinical thresholds are reached. By analyzing diverse biomarkers, the research aims to create a comprehensive assessment of smoking-related health risks, addressing gaps in previous studies that focused narrowly on single diseases. A key innovation is the comparison of machine learning models with established clinical risk assessment tools, such as the Framingham cardiovascular risk score, to evaluate their effectiveness. The authors emphasize the importance of model interpretability using SHAP (SHapley Additive exPlanations) values to enhance clinician trust and support shared decision-making. Additionally, the study considers practical aspects of clinical implementation and algorithmic fairness, ensuring that the predictive tools developed are ethically responsible and applicable across diverse populations. Ultimately, the research aims to facilitate timely interventions for at-risk individuals, improving public health outcomes related to tobacco use.

Methods

In this section, the authors outline the methodological framework employed to assess the predictive performance of various machine learning models concerning smoking-related health decline. The dataset, referred to as Smoking.csv, was divided into training (80%) and testing (20%) subsets through stratified sampling, ensuring that the distribution of outcomes remained consistent across both groups. This stratification is crucial for minimizing biases and enhancing the reliability of model evaluations. The study utilized seven machine learning algorithms, including traditional models such as Logistic Regression (LR), Support Vector Machine (SVM), and Random Forest (RF), alongside advanced ensemble methods like XGBoost and LightGBM, which are particularly adept at managing complex, non-linear data relationships.

To optimize model performance, hyperparameter tuning was executed using 10-fold cross-validation, a method that balances computational efficiency with reliable performance estimation. Additionally, to tackle the issue of class imbalance—a prevalent concern in health-related datasets—the authors implemented the NRSBoundary-SMOTE algorithm. This technique selectively oversamples minority class instances that are situated near decision boundaries, thereby enhancing the model’s ability to accurately predict outcomes for underrepresented classes. The algorithm operates by computing Neighborhood Rough Set (NRS) boundaries for minority samples and generating synthetic samples based on local density and proximity to majority class samples, thereby enriching the training dataset.

Discussion

The discussion section of the research paper highlights the evolution of smoking-related health risk assessment methodologies, emphasizing a shift from traditional statistical approaches, such as logistic regression, to advanced machine learning techniques. While earlier studies provided acceptable accuracy in predicting smoking status and associated health risks, they often failed to capture the complex, non-linear relationships inherent in smoking’s biological effects across various physiological systems. Recent advancements in machine learning, including decision trees and gradient boosting methods, have shown improved predictive capabilities, particularly for diseases like lung cancer and chronic obstructive pulmonary disease. However, significant gaps remain in the literature, including a narrow focus on single disease endpoints, limited feature sets, and the use of “black-box” models that lack interpretability, which hinders clinical adoption.

To address these limitations, the authors propose a holistic, systems-based approach that incorporates a comprehensive panel of biomarkers across multiple health domains, thereby recognizing the systemic nature of smoking-induced damage. They emphasize the importance of model interpretability through the use of SHAP (SHapley Additive exPlanations) values, which enhance transparency and facilitate clinical trust. Additionally, the research includes rigorous benchmarking against established clinical risk assessment tools, allowing for a meaningful comparison of machine learning models with traditional methods. By reframing the research focus towards a multidimensional risk assessment framework, the authors aim to support personalized prevention strategies and proactive health preservation, ultimately contributing to improved patient care and public health outcomes for smoking populations.

Limitations

The limitations of this study highlight several critical factors that may affect the generalizability of the findings. Firstly, the dataset is derived from a single health screening program in South Korea, predominantly featuring urban and ethnically homogeneous participants. This raises concerns about the model’s applicability to other ethnic groups, as genetic variations, such as those in the CYP2A6 gene affecting nicotine metabolism, and differing baseline disease prevalence could influence outcomes. Therefore, external validation in more diverse populations, including European, African, and Latino cohorts, is necessary prior to clinical application.

Additionally, the absence of socioeconomic indicators—such as income, education, and occupation—within the dataset presents a limitation, as these factors are known confounders of both smoking behavior and health outcomes. This oversight may lead to conflated results regarding socioeconomic health disparities and smoking-specific effects. The study’s cross-sectional design further limits the ability to establish temporal causality or predict future disease outcomes, necessitating longitudinal studies that track individuals over 5-10 years to validate high-risk predictions against actual disease incidence. Lastly, reliance on self-reported smoking status may result in underreporting due to social desirability bias; incorporating biochemical validation, such as cotinine levels, would enhance the accuracy of outcome ascertainment. Future research should prioritize multi-site studies in diverse ethnic populations, prospective cohorts with longitudinal follow-up, and the inclusion of detailed smoking history variables.