إطار تعلم الآلة القابل للتفسير لتوقع جودة الهواء في إسطنبول باستخدام تفسيرات شابلي الإضافية (SHAP) Interpretable machine learning framework for air quality prediction in Istanbul using Shapley additive explanations (SHAP)

المجلة: Stochastic Environmental Research and Risk Assessment، المجلد: 40، العدد: 2
DOI: https://doi.org/10.1007/s00477-026-03168-4
تاريخ النشر: 2026-01-22
المؤلف: Enes Birinci وآخرون
الموضوع الرئيسي: مراقبة جودة الهواء وتوقعاته

نظرة عامة

تقدم ورقة البحث تحليلًا شاملاً لتلوث الهواء في المناطق الحضرية، متوقعة أنه بحلول عام 2050، سيعيش 68% من سكان العالم في المدن. أفادت منظمة الصحة العالمية (WHO) بوفاة حوالي 3.7 مليون شخص بشكل مبكر بسبب تلوث الهواء في عام 2012، مما يبرز الحاجة الملحة لإدارة فعالة لجودة الهواء. تركز الدراسة على تطبيق خوارزميات التعلم الآلي (ML) للتنبؤ بتركيزات الجسيمات الدقيقة (PM10، PM2.5) والأوزون (O3) في إسطنبول، باستخدام بيانات من ثلاثة مواقع رصد متميزة. تسلط الأبحاث الضوء على إنشاء معايير جودة الهواء للملوثات، بما في ذلك PM10 و PM2.5، لحماية الصحة العامة والبيئة.

تكشف النتائج أن طرق الأشجار الجماعية، وخاصة ExtraTrees و XGBoost، تتفوق بشكل كبير على الخوارزميات الأخرى في التنبؤ بمستويات الملوثات، محققة قيم خطأ الجذر التربيعي (RMSE) أقل من 10 ميكروغرام/م³ لـ PM2.5 و O3. توضح تحليل SHapley Additive exPlanations (SHAP) تأثير الملوثات المشتركة والعوامل الجوية، مثل سرعة الرياح والرطوبة، على توقعات النموذج، خاصة خلال أشهر الشتاء. تختتم الدراسة بأن نماذج التعلم الآلي يمكن أن تعزز نماذج النقل الكيميائي التقليدية من خلال تقديم تقديرات تلوث فعالة من حيث التكلفة، محلية، وذات صلة زمنية، وهي ضرورية للإرشادات الصحية العامة واستراتيجيات التحكم في الانبعاثات. يتضمن المواد التكميلية تصنيفات تفصيلية لأهمية المتغيرات، مما يعزز مساهمات الدراسة في فهم ديناميات تلوث الهواء.

مقدمة

تسلط المقدمة الضوء على قضية تلوث الهواء المتزايدة، التي أصبحت مصدر قلق حاسم بسبب زيادة التحضر ونمو السكان. مع توسع المدن وزيادة السكان، يؤدي الارتفاع الناتج في الانبعاثات من المركبات والصناعات ومصادر أخرى إلى تدهور جودة الهواء بشكل كبير. تؤكد هذه الفقرة على ضرورة معالجة تلوث الهواء، نظرًا لتداعياته العميقة على الصحة العامة واستدامة البيئة. يهدف المؤلفون إلى استكشاف التأثيرات المتعددة الأوجه للتحضر على جودة الهواء واقتراح استراتيجيات للتخفيف في سياق المراكز الحضرية المتزايدة.

الطرق

في هذه الدراسة، استخدم المؤلفون تقسيمًا موسميًا لمجموعة بياناتهم، مستخدمين صيفين (JJA 2021-2023) وشتاءين (DJF 2021-2023) لتدريب وتقييم نماذج التعلم الآلي. ساعد هذا النهج في تقليل تسرب الزمن وسمح بإجراء تقييمات تحت ظروف جوية متميزة. تم تصميم النماذج لتكون خاصة بالموسم والموقع، مما يعزز القابلية للتفسير ويقلل من التباين من خلال التقاط أنظمة جوية فريدة. للحفاظ على النزاهة الزمنية، تم إجراء التدريب والتقييم باستخدام تقسيمات محجوزة، مع تخصيص كتلة اختبار مخصصة للتقييمات النهائية.

تمت مقارنة مجموعة متنوعة من خوارزميات التعلم الآلي، بما في ذلك XGBoost و Extra Trees و Random Forest (RF) و AdaBoost و Gradient Boosting و K-Nearest Neighbors (KNN) و Multi-Layer Perceptron (MLP) و Support Vector Regression (SVR)، لتحديد النموذج الأكثر فعالية في التنبؤ بنتائج جودة الهواء. تم تنفيذ تحسين المعلمات الفائقة باستخدام بحث بايزي مصمم خصيصًا لكل خوارزمية، مع التركيز على تقليل خطأ التحقق. تم ضبط المعلمات الرئيسية، مثل عمق الشجرة وعدد الأشجار لأساليب التجميع، والحدود وشروط العقوبة لـ SVR، وهندسة الشبكة لـ MLP، لكل محطة وملوث. لمنع الإفراط في التكيف، نفذ المؤلفون تقنيات تنظيمية في النماذج المعتمدة على الأشجار وتقنيات الإسقاط/تآكل الوزن في الشبكات العصبية. تم حساب مقاييس التقييم، بما في ذلك خطأ الجذر التربيعي (RMSE) وخطأ القيمة المطلقة المتوسطة (MAE) و R² وكفاءة ناش-سوتكليف (NSE) وكفاءة كلينغ-غوبتا (KGE) و Willmott’s d1، باستخدام بايثون 3.10.

النتائج

يقدم قسم النتائج تحليلًا شاملاً لأداء النموذج في التنبؤ بتركيزات الملوثات الهوائية، وخاصة PM10 و PM2.5 و O3، عبر ظروف جوية وأنماط حضرية مختلفة. تستخدم الأشكال 2 و 3 و 4 مخططات كمان لتوضيح توزيع المتبقيات، مما يبرز أن التوزيعات الأضيق تشير إلى توقعات نموذج مستقرة، بينما تشير التوزيعات الأوسع إلى زيادة عدم اليقين. تؤكد الدراسة تفوق خوارزميات التعلم الآلي (ML) على خط الأساس الخاص بالانحدار الخطي المتعدد، كما يتضح من قيم R² الأعلى وخطأ الجذر التربيعي (RMSE) الأقل عبر تكوينات مختلفة.

تلخص الجدول 4 أفضل النماذج أداءً لكل مجموعة من الملوثات والمواقع، كاشفة أنه بينما كانت الفجوات في الأداء بين الخوارزميات الرائدة متواضعة، فإن الأداء المتعدد المقاييس المتسق يدعم فعاليتها. من الجدير بالذكر أن نماذج XGBoost و Extra Trees أظهرت قدرات تعميم قوية، خاصة في المناطق الحضرية مثل باججيلار وأرناؤوط كوي، مع قيم R² للاختبار لـ PM2.5 و O3 تتجاوز 0.80. أظهرت النماذج حدًا أدنى من الإفراط في التكيف، وكانت دقتها التنبؤية أعلى بشكل ملحوظ في أشهر الشتاء مقارنة بالصيف، مما يعكس تأثير المصادر المحلية واللحظية على تباين PM10. بشكل عام، تؤكد النتائج على أهمية اختيار النموذج ومنهجية التقييم، مما يبرز الحاجة إلى تقسيمات اختبار موسمية ثابتة لمنع تسرب الزمن في تقييمات الأداء.

المناقشة

تسلط قسم المناقشة في ورقة البحث الضوء على دمج تقنيات التعلم الآلي (ML) لتنبؤ جودة الهواء في إسطنبول، مع التركيز بشكل خاص على تركيزات الجسيمات الدقيقة (PM10، PM2.5) والأوزون (O3). يؤكد المؤلفون على قيود نماذج النقل الكيميائي التقليدية (CTMs) والأساليب الإحصائية في معالجة الظواهر الجوية غير الخطية، داعين إلى فعالية أساليب ML. يقارنون بين سبع خوارزميات ML مختلفة، بما في ذلك XGBoost و K-Nearest Neighbors (KNN) و Support Vector Regression (SVR)، عبر ملوثات مختلفة مع مراعاة التغيرات الموسمية والسياقات الجغرافية. تدمج الدراسة بشكل فريد تفسيرات شابلي المضافة (SHAP) لتعزيز قابلية تفسير النموذج، موضحة مساهمات المتغيرات الجوية في ديناميات الملوثات.

تستخدم الأبحاث بيانات جودة الهواء وبيانات جوية عالية الدقة تم جمعها من ثلاث محطات رصد جودة الهواء (AQMS) متميزة في إسطنبول على مدى عامين. تكشف النتائج أن أداء النموذج يختلف بشكل كبير بين الإعدادات الحضرية والريفية، حيث تظهر المناطق الحضرية دقة تنبؤية أعلى بسبب مصادر الانبعاثات المحلية. تشير التحليلات الموسمية إلى أن موثوقية النموذج تكون عمومًا أكبر في الشتاء مقارنة بالصيف، وذلك بسبب تعقيدات ديناميات الملوثات التي تؤثر عليها الظروف الجوية المتغيرة. بشكل عام، تساهم هذه الدراسة في زيادة الأدبيات المتزايدة حول تطبيقات ML في علوم البيئة، مقدمة إطار عمل شامل لفهم ديناميات جودة الهواء في البيئات الحضرية.

القيود

تسلط قيود هذه الدراسة الضوء على عدة عوامل حاسمة قد تؤثر على تفسير نتائجها. أولاً، تقتصر الأبحاث على بيانات من ثلاث محطات رصد جودة الهواء في إسطنبول، مما يحد من عمومية النتائج من الناحية المكانية. بالإضافة إلى ذلك، يتناول إطار النمذجة فترات الصيف والشتاء فقط، متجاهلاً الفصول الانتقالية (الربيع والخريف) التي قد تقدم أنماط انبعاثات وديناميات جوية متميزة. قد يؤدي الاعتماد على قياسات الملوثات المشتركة (مثل PM$_{10}$ و NO$_{2}$ و NO$_{x}$) كميزات إدخال أيضًا إلى تقليل قوة التنبؤات، خاصة في الصيف عندما تعقد عوامل مثل المصادر المتناثرة والكيمياء الضوئية الأقوى دقة النموذج.

على الرغم من هذه القيود، تظهر الدراسة أن المواقع الحضرية يمكن أن تتنبأ بفعالية بـ PM$_{10}$ و PM$_{2.5}$ و O$_{3}$، مع كون الملوثات الثانوية أكثر قابلية للنمذجة من PM الخشن. تكشف التحليلات أن مؤشرات المرور والعوامل الجوية تؤثر بشكل كبير على تباين PM في المناطق الحضرية، بينما تكون المواقع الريفية أكثر حساسية لاتجاه الرياح والنقل بعيد المدى. تدعو النتائج إلى قياسات جودة الهواء والعوامل الجوية المتجاورة لتعزيز الدقة التنبؤية وتقترح أن دمج ميزات إضافية، مثل تدفق المرور وبيانات الأقمار الصناعية، يمكن أن يحسن أداء النموذج بشكل أكبر. بشكل عام، يقدم الإطار المقترح فائدة عملية لتقدير جودة الهواء على المدى القصير ويدعم استراتيجيات التحكم في الانبعاثات في المدن الكبرى مثل إسطنبول، بينما يوفر أيضًا رؤى حول الآليات الأساسية التي تؤثر على جودة الهواء.

Journal: Stochastic Environmental Research and Risk Assessment, Volume: 40, Issue: 2
DOI: https://doi.org/10.1007/s00477-026-03168-4
Publication Date: 2026-01-22
Author(s): Enes Birinci et al.
Primary Topic: Air Quality Monitoring and Forecasting

Overview

The research paper presents a comprehensive analysis of air pollution in urban areas, projecting that by 2050, 68% of the global population will reside in cities. The World Health Organization (WHO) reported approximately 3.7 million premature deaths attributable to air pollution in 2012, underscoring the critical need for effective air quality management. The study focuses on the application of machine learning (ML) algorithms to predict concentrations of particulate matter (PM10, PM2.5) and ozone (O3) in İstanbul, utilizing data from three distinct monitoring sites. The research highlights the establishment of air quality standards for pollutants, including PM10 and PM2.5, to safeguard public health and the environment.

The findings reveal that ensemble-tree methods, particularly ExtraTrees and XGBoost, significantly outperform other algorithms in predicting pollutant levels, achieving root mean square error (RMSE) values below 10 µg/m³ for PM2.5 and O3. The SHapley Additive exPlanations (SHAP) analysis elucidates the influence of co-pollutants and meteorological factors, such as wind speed and humidity, on model predictions, especially during winter months. The study concludes that machine learning models can enhance traditional chemical transport models by providing cost-effective, localized, and temporally relevant pollution estimates, which are crucial for public health advisories and emission control strategies. The supplementary material includes detailed variable importance rankings, reinforcing the study’s contributions to understanding air pollution dynamics.

Introduction

The introduction highlights the escalating issue of air pollution, which has emerged as a critical concern due to rising urbanization and population growth. As cities expand and populations swell, the resultant increase in emissions from vehicles, industries, and other sources significantly deteriorates air quality. This section underscores the urgency of addressing air pollution, given its profound implications for public health and environmental sustainability. The authors aim to explore the multifaceted impacts of urbanization on air quality and propose strategies for mitigation in the context of growing urban centers.

Methods

In this study, the authors employed a seasonal partitioning of their dataset, utilizing two summers (JJA 2021-2023) and two winters (DJF 2021-2023) to train and evaluate machine learning models. This approach mitigated temporal leakage and allowed for assessments under distinct meteorological conditions. The models were designed to be season- and site-specific, enhancing interpretability and reducing heterogeneity by capturing unique meteorological regimes. To maintain chronological integrity, the training and evaluation were conducted using blocked splits, with a dedicated test block reserved for final assessments.

A variety of machine learning algorithms, including XGBoost, Extra Trees, Random Forest (RF), AdaBoost, Gradient Boosting, K-Nearest Neighbors (KNN), Multi-Layer Perceptron (MLP), and Support Vector Regression (SVR), were compared to identify the most effective model for predicting air quality outcomes. Hyperparameter optimization was performed using Bayesian search tailored to each algorithm, focusing on minimizing validation error. Key hyperparameters, such as tree depth and number of trees for ensemble methods, kernel and penalty terms for SVR, and network architecture for MLP, were fine-tuned for each station and pollutant. To prevent overfitting, the authors implemented regularization techniques in tree-based models and dropout/weight decay in neural networks. The evaluation metrics, including Root Mean Square Error (RMSE), Mean Absolute Error (MAE), R², Nash-Sutcliffe Efficiency (NSE), Kling-Gupta Efficiency (KGE), and Willmott’s d1, were computed using Python 3.10.

Results

The results section presents a comprehensive analysis of model performance in predicting air pollutant concentrations, specifically PM10, PM2.5, and O3, across different meteorological conditions and urban typologies. Figures 2, 3, and 4 utilize violin plots to illustrate the distribution of residuals, highlighting that narrower distributions indicate stable model predictions, while broader distributions suggest increased uncertainty. The study confirms the superiority of machine learning (ML) algorithms over a multiple linear regression baseline, as evidenced by higher R² values and lower root mean square error (RMSE) across various configurations.

Table 4 summarizes the best-performing models for each pollutant-site combination, revealing that while performance gaps among leading algorithms were modest, consistent multi-metric performance supports their efficacy. Notably, the XGBoost and Extra Trees models demonstrated strong generalization capabilities, particularly in urban areas like Bağcılar and Arnavutköy, with test R² values for PM2.5 and O3 exceeding 0.80. The models exhibited minimal overfitting, and their predictive accuracy was notably higher in winter months compared to summer, reflecting the influence of localized and episodic sources on PM10 variability. Overall, the findings underscore the importance of model selection and evaluation methodology, emphasizing the need for fixed seasonal test splits to prevent temporal leakage in performance assessments.

Discussion

The discussion section of the research paper highlights the integration of machine learning (ML) techniques for air quality prediction in Istanbul, specifically focusing on particulate matter (PM10, PM2.5) and ozone (O3) concentrations. The authors emphasize the limitations of traditional chemical transport models (CTMs) and statistical methods in addressing nonlinear atmospheric phenomena, advocating for the effectiveness of ML approaches. They benchmark seven different ML algorithms, including XGBoost, K-Nearest Neighbors (KNN), and Support Vector Regression (SVR), across various pollutants while considering seasonal variations and geographical contexts. The study uniquely incorporates Shapley additive explanations (SHAP) to enhance model interpretability, elucidating the contributions of meteorological variables to pollutant dynamics.

The research utilizes high-resolution air quality and meteorological data collected from three distinct Air Quality Monitoring Stations (AQMS) in Istanbul over a two-year period. The findings reveal that model performance varies significantly between urban and rural settings, with urban areas demonstrating higher predictive accuracy due to localized emission sources. Seasonal analysis indicates that model reliability is generally greater in winter compared to summer, attributed to the complexities of pollutant dynamics influenced by varying meteorological conditions. Overall, this study contributes to the growing body of literature on ML applications in environmental science, providing a comprehensive framework for understanding air quality dynamics in urban environments.

Limitations

The limitations of this study highlight several critical factors that may affect the interpretation of its findings. Firstly, the research is confined to data from three air quality monitoring stations in Istanbul, which restricts the spatial generalizability of the results. Additionally, the modeling framework exclusively addresses summer and winter periods, omitting transitional seasons (spring and autumn) that may present distinct emission patterns and meteorological dynamics. The reliance on co-pollutant measurements (e.g., PM$_{10}$, NO$_{2}$, NO$_{x}$) as input features may also limit the robustness of the predictions, particularly in summer when factors such as dispersed sources and stronger photochemistry complicate model accuracy.

Despite these limitations, the study demonstrates that urban sites can effectively predict PM$_{10}$, PM$_{2.5}$, and O$_{3}$, with secondary pollutants being more amenable to modeling than coarse PM. The analysis reveals that traffic-related indicators and meteorological factors significantly influence PM variability in urban areas, while rural locations are more sensitive to wind direction and long-range transport. The findings advocate for co-located air quality and meteorological measurements to enhance predictive accuracy and suggest that incorporating additional features, such as traffic flow and satellite data, could further improve model performance. Overall, the proposed framework offers practical utility for short-term air quality estimation and supports emission control strategies in megacities like Istanbul, while also providing insights into the underlying mechanisms affecting air quality.