نهج تجميعي هرمي لتوقع PM10 متعدد البلدان باستخدام LightGBM وشبكة عصبية متبقية A hierarchical ensemble approach for multi-country PM10 forecasting using LightGBM and residual neural network

المجلة: Discover Atmosphere، المجلد: 4، العدد: 1
DOI: https://doi.org/10.1007/s44292-025-00072-4
تاريخ النشر: 2026-01-05
المؤلف: Syed Azeem Inam وآخرون
الموضوع الرئيسي: مراقبة جودة الهواء وتوقعاته

نظرة عامة

تقدم هذه البحث إطار عمل جديد من ثلاثة مراحل للتجميع المتراص مصمم للتنبؤ الدقيق بتركيزات الجسيمات (PM10) ليوم واحد مقدماً عبر 380 مدينة في 25 دولة، باستخدام مجموعة بيانات مؤشر جودة الهواء العالمي (WAQI) التي تحتوي على ما يقرب من 1.8 مليون سجل. يتناول الإطار تحديات منهجية كبيرة مثل تسرب الزمن ونمذجة الاعتماد الجغرافي غير الكافية من خلال استخدام آلة تعزيز التدرج الخفيف (LightGBM) كمتعلم أساسي، وشبكة عصبية متبقية لالتقاط عدم الخطية، ومتعلّم انحدار Ridge مدرب على التنبؤات خارج الطي. يضمن تنفيذ تقسيمات التدريب والاختبار الزمنية لكل مدينة، وترميز الهدف الآمن من التسرب، والتحقق المتقاطع باستخدام نافذة متوسعة، منع التسرب بشكل قوي.

يظهر النموذج مقاييس أداء استثنائية، حيث يحقق $R^2 = 0.9983$، RMSE = 1.01 ميكروغرام/م³، وMAE = 1.01 ميكروغرام/م³، مما يمثل تحسينات كبيرة مقارنة بالطرق التقليدية، بما في ذلك تقليل بنسبة 96.7% في RMSE مقارنة بالاستمرارية وتحسين بنسبة 76% مقارنة بـ XGBoost. تشير نتائج التحقق المتقاطع إلى قدرات تعميم قوية، مع قيم $R^2$ تتجاوز 0.99 لمجموعات البيانات الأكبر. تتيح كفاءة الإطار الحسابية إعادة التدريب الروتيني، مما يجعله مناسباً للتطبيقات العملية في الصحة العامة، والتنظيم البيئي، والتخطيط الحضري. تشمل اتجاهات البحث المستقبلية تعزيز قدرات التنبؤ متعددة الخطوات ودمج المتغيرات الخارجية، مع التأكيد على أهمية الصرامة المنهجية في تطوير أنظمة موثوقة لتنبؤ جودة الهواء.

مقدمة

تسلط مقدمة هذه الورقة البحثية الضوء على الآثار الصحية العامة الحرجة للجسيمات ذات القطر الديناميكي الهوائي الأصغر من 10 ميكرومتر (PM10)، والتي ترتبط بزيادة معدلات المرض والوفاة على مستوى العالم. أثبتت المراجعات المنهجية والدراسات الوبائية وجود علاقة سببية بين التعرض لـ PM10 والنتائج الصحية السلبية، مما دفع الهيئات التنظيمية مثل منظمة الصحة العالمية (WHO) والاتحاد الأوروبي (EU) إلى فرض حدود أكثر صرامة على تركيزات PM10. تؤكد الدراسة على ضرورة وجود توقعات موثوقة لمستويات PM10 الحضرية ليوم واحد لتسهيل التدخلات الصحية العامة في الوقت المناسب، خاصة في المدن التي تتجاوز بانتظام العتبات التنظيمية.

يحدد المؤلفون تحديات كبيرة في التنبؤ بجودة الهواء في المدن، بما في ذلك الطبيعة غير الثابتة لسلاسل زمنية PM10، وبيانات المستشعر المتقطعة، والهيكل الهرمي للبيئات الحضرية الذي يعقد جهود النمذجة. ينتقدون المنهجيات الحالية لاحتمال إدخال تسرب البيانات والتحيز، خاصة في ممارسات التحقق المتقاطع وترميز الهدف. لمعالجة هذه القضايا، تقترح الدراسة خط أنابيب هجين جديد للتعلم الآلي مصمم لتوفير توقعات دقيقة لـ PM10 مع منع تسرب المعلومات. يستخدم هذا الخط أنظمة تجميع من ثلاث مراحل، تضم آلة تعزيز التدرج الخفيف (LightGBM) كالمتعلم الأساسي، وشبكة عصبية متبقية لالتقاط عدم الخطية من الدرجة الأعلى، ومتعلّم انحدار Ridge للتنبؤات النهائية. تؤكد المنهجية على تقسيمات زمنية صارمة واختبار خارج العينة لضمان أداء قوي عبر سياقات حضرية متنوعة. تهدف البحث إلى تعزيز موثوقية توقعات PM10، مقدمة إطار عمل قابل للنقل لنمذجة السلاسل الزمنية البيئية التي تلتزم بأفضل الممارسات في منع التسرب.

طرق

يستخدم إطار التنبؤ المقترح لتوقع PM10 عبر عدة دول نهج تجميع خالٍ من التسرب من ثلاث مراحل، يدمج آلة تعزيز التدرج الخفيف (LightGBM) كالمتعلم الأساسي، وشبكة عصبية متبقية لالتقاط العلاقات غير الخطية، ومتعلّم انحدار Ridge لدمج التنبؤات. تم تصميم هذه البنية لمنع تسرب الزمن وتلوث المعلومات عبر المدن من خلال بروتوكول تحقق متداول باستخدام نافذة متوسعة، يطبق بشكل فردي على كل مدينة. يقوم الإطار بتفكيك مشكلة التنبؤ إلى مكونات تكميلية: مكونات خطية وإضافية تم نمذجتها بواسطة تعزيز التدرج، وعدم الخطيات من الدرجة الأعلى التي تم التقاطها بواسطة التعلم المتبقي، والوزن الأمثل لهذه المصادر المتنوعة من المعلومات عبر التعلم الميتا. يعزز هذا التصميم المعياري كل من القوة التنبؤية وقابلية التفسير، وهو أمر حاسم لتطبيقات الصحة العامة.

شمل معالجة البيانات التعامل المنهجي مع البيانات الخام من قاعدة بيانات WAQI، بما في ذلك استبعاد البيانات المفقودة للحفاظ على النزاهة الزمنية وتطبيع المتغيرات العددية إلى نطاق [0,1]. تضمنت هندسة الميزات إنشاء سبع متغيرات تأخير لتعكس الأنماط الذاتية في PM10، بالإضافة إلى ميزات المتوسط المتداول لالتقاط اتجاهات التلوث على المدى المتوسط والطويل. تم استخدام طريقة ترميز الهدف الآمنة من التسرب لمنع تسرب الزمن عند دمج المتغيرات الفئوية. تم تقسيم مجموعة البيانات زمنياً إلى مجموعات تدريب واختبار، مما يضمن أن تقييم النموذج يعكس الأداء التنبؤي في العالم الحقيقي. أظهرت تحليل الإزالة أن النموذج الكامل، الذي يدمج الميزات المتأخرة، والمتوسطات المتداولة، والترميزات الفئوية، حقق أدنى مقاييس خطأ (MAE = 1.13، RMSE = 6.27)، مع تحديد المتوسطات المتداولة كأكثر مجموعة ميزات أهمية، تليها المتغيرات المتأخرة والترميزات الفئوية. تم استخدام نماذج أساسية، بما في ذلك نموذج الاستمرارية ومتوسط متحرك متحيز لمدة 7 أيام، للتحليل المقارن، مما يظهر التحسينات التي يوفرها إطار التجميع المقترح.

نتائج

تظهر نتائج الدراسة فعالية نموذج التجميع المتراص، الذي يجمع بين LightGBM وشبكة عصبية متبقية (NN)، في التنبؤ بتركيزات PM10 عبر خمسة وعشرين دولة في مجموعة بيانات WAQI. كشفت التحقق المتقاطع عن تحسين كبير في أداء النموذج، مع زيادة متوسطة في معامل التحديد ($R^2$) من 0.9832 إلى 0.9910 عند تضمين NN المتبقية. أظهرت الدول ذات قيم $R^2$ الأساسية المنخفضة، مثل الهند وإيران، تحسينات أكبر بشكل نسبي، مما يدل على قوة النموذج في المناطق ذات تباين التلوث العالي. تفوق نموذج التجميع باستمرار على النماذج الأساسية، محققاً $R^2$ مثير للإعجاب قدره 0.9983 على مجموعة الاختبار، مع خطأ الجذر التربيعي المتوسط (RMSE) قدره 1.0071 ميكروغرام/م³، مما يظهر دقته التنبؤية.

أكد التحليل الإحصائي على أهمية التباينات الجغرافية في أداء النموذج، مع إحصائية F قدرها 277.60 وقيمة p أقل من 0.0001، مما يثبت صحة النهج الهرمي في النمذجة. أظهر عملية تدريب النموذج تحسيناً فعالاً، متجنباً الإفراط في التكيف مع الحفاظ على التعميم عبر نوافذ زمنية مختلفة. أشار تحليل أهمية الميزات إلى أن مستويات PM10 في اليوم السابق كانت أكثر المتنبئين تأثيراً، تليها المتوسطات المتداولة، مما يبرز قدرة النموذج على التقاط كل من الديناميات الزمنية والسياق الجغرافي. تختتم الدراسة بأن المنهجية المقترحة، التي تتميز بترميز خالٍ من التسرب والتحقق المتقاطع القوي، تمثل تقدماً كبيراً في التنبؤ البيئي، محققة تحسيناً بنسبة 99.83% مقارنة بنماذج الاستمرارية وتفوقها على تقنيات التعلم الآلي التقليدية.

مناقشة

تقدم الدراسة إطار عمل هجين جديد من ثلاث خطوات لتوقع تركيزات PM10 عبر عدة دول، باستخدام مجموعة بيانات شاملة من مؤشر جودة الهواء العالمي (WAQI) التي تشمل 1.8 مليون سجل من 380 مدينة. يحقق الإطار أداءً تنبؤياً استثنائياً، مع $R^2$ قدره 0.9983 وخطأ الجذر التربيعي المتوسط (RMSE) قدره 1.01 ميكروغرام/م³، متفوقاً على توقعات الاستمرارية بنسبة 96.7% وعلى XGBoost بنسبة 76%. تشمل التقدمات المنهجية الرئيسية ترميز الهدف الآمن من التسرب الهرمي، ودمج الشبكات العصبية المتبقية لالتقاط عدم الخطيات من الدرجة الأعلى، ومتعلّم انحدار Ridge الذي يعزز استقرار التنبؤ. يظهر النموذج تعميماً جغرافياً قوياً، مع قيم $R^2$ تتجاوز 0.99 في المناطق ذات الكثافة السكانية العالية و0.98 في المناطق ذات الكثافة السكانية المنخفضة.

على الرغم من نقاط قوته، فإن الإطار له قيود، مثل تركيزه على التنبؤ ليوم واحد دون قدرات متعددة الخطوات واستبعاد العوامل الجوية الخارجية. تشمل اتجاهات البحث المستقبلية دمج البيانات الجوية، وتعزيز تقدير عدم اليقين، وتوسيع تطبيق النموذج ليشمل المناطق الريفية وسيناريوهات التلوث المتعدد. تؤكد النتائج على أهمية الممارسات المنهجية الصارمة في تطوير أنظمة موثوقة لتنبؤ جودة الهواء التي يمكن أن تُعلم الصحة العامة، والتنظيم البيئي، ومبادرات التخطيط الحضري.

Journal: Discover Atmosphere, Volume: 4, Issue: 1
DOI: https://doi.org/10.1007/s44292-025-00072-4
Publication Date: 2026-01-05
Author(s): Syed Azeem Inam et al.
Primary Topic: Air Quality Monitoring and Forecasting

Overview

This research introduces a novel three-stage stacked ensemble framework designed for accurate day-ahead forecasting of particulate matter (PM10) concentrations across 380 cities in 25 countries, utilizing the World Air Quality Index (WAQI) dataset with nearly 1.8 million records. The framework addresses significant methodological challenges such as temporal leakage and inadequate modeling of geographic dependencies by employing Light Gradient Boosting Machine (LightGBM) as the base learner, a residual neural network for capturing nonlinearities, and a Ridge regression meta-learner trained on out-of-fold predictions. The implementation of chronological per-city train-test splits, leakage-safe target encoding, and expanding-window rolling-origin cross-validation ensures robust leakage prevention.

The model demonstrates exceptional performance metrics, achieving $R^2 = 0.9983$, RMSE = 1.01 µg/m³, and MAE = 1.01 µg/m³, which represent substantial improvements over traditional methods, including a 96.7% reduction in RMSE compared to persistence and a 76% improvement over XGBoost. Cross-validation results indicate strong generalization capabilities, with $R^2$ values exceeding 0.99 for larger datasets. The framework’s computational efficiency allows for routine retraining, making it suitable for practical applications in public health, environmental regulation, and urban planning. Future research directions include enhancing multi-step forecasting capabilities and integrating exogenous variables, emphasizing the importance of methodological rigor in developing reliable air quality forecasting systems.

Introduction

The introduction of this research paper highlights the critical public health implications of particulate matter with an aerodynamic diameter smaller than 10 µm (PM10), which is associated with increased morbidity and mortality rates globally. Systematic reviews and epidemiological studies have established a causal relationship between PM10 exposure and adverse health outcomes, prompting regulatory bodies like the World Health Organization (WHO) and the European Union (EU) to impose stricter limits on PM10 concentrations. The study emphasizes the necessity for reliable, day-ahead predictions of urban PM10 levels to facilitate timely public health interventions, especially in cities that frequently exceed regulatory thresholds.

The authors identify significant challenges in predicting urban air quality, including the nonstationary nature of PM10 time series, sporadic sensor data, and the hierarchical structure of urban environments that complicates modeling efforts. They critique existing methodologies for their potential to introduce data leakage and bias, particularly in cross-validation and target encoding practices. To address these issues, the study proposes a novel hybrid machine-learning pipeline designed to provide accurate PM10 forecasts while preventing information leakage. This pipeline employs a three-stage ensemble architecture, incorporating Light Gradient Boosting Machine (LightGBM) as the primary learner, a residual neural network for capturing higher-order nonlinearities, and a Ridge regression meta-learner for final predictions. The methodology emphasizes strict chronological splits and out-of-sample testing to ensure robust performance across diverse urban contexts. The research aims to enhance the reliability of PM10 forecasting, offering a transferable framework for environmental time-series modeling that adheres to best practices in leakage prevention.

Methods

The proposed forecasting framework for multi-country PM10 prediction employs a three-stage leakage-free ensemble approach, integrating Light Gradient Boosting Machine (LightGBM) as the primary learner, a residual neural network for capturing non-linear relationships, and a Ridge regression meta-learner for combining predictions. This architecture is designed to prevent temporal leakage and cross-city information contamination through an expanding-window rolling-origin validation protocol, applied individually to each city. The framework decomposes the prediction problem into complementary components: linear and additive components modeled by gradient boosting, higher-order non-linearities captured by residual learning, and optimal weighting of these diverse information sources via meta-learning. This modular design enhances both predictive power and interpretability, crucial for public health applications.

Data preprocessing involved systematic handling of raw data from the WAQI database, including the exclusion of missing data to maintain temporal integrity and the normalization of numeric variables to a [0,1] range. Feature engineering included the creation of seven lag variables to reflect autoregressive patterns in PM10, as well as rolling mean features to capture medium- and long-term pollution trends. A leakage-safe target encoding method was employed to prevent temporal leakage when incorporating categorical variables. The dataset was chronologically split into training and test sets, ensuring that model evaluation reflects real-world predictive performance. Ablation analysis indicated that the full model, incorporating lagged features, rolling means, and categorical encodings, achieved the lowest error metrics (MAE = 1.13, RMSE = 6.27), with rolling means identified as the most significant feature group, followed by lagged variables and categorical encodings. Baseline models, including a persistence model and a biased 7-day moving average, were utilized for comparative analysis, demonstrating the enhancements provided by the proposed ensemble framework.

Results

The results of the study demonstrate the effectiveness of a stacked ensemble model, combining LightGBM and a Residual Neural Network (NN), for predicting PM10 concentrations across twenty-five countries in the WAQI dataset. Cross-validation revealed a significant enhancement in model performance, with an average increase in the coefficient of determination ($R^2$) from 0.9832 to 0.9910 when the residual NN was included. Countries with lower baseline $R^2$ values, such as India and Iran, exhibited proportionally greater improvements, indicating the model’s robustness in regions with high pollution variability. The ensemble model consistently outperformed baseline models, achieving an impressive $R^2$ of 0.9983 on the test set, with a root mean square error (RMSE) of 1.0071 µg/m³, demonstrating its predictive accuracy.

Statistical analysis confirmed the significance of geographic variations in model performance, with an F-statistic of 277.60 and a p-value of less than 0.0001, validating the hierarchical modeling approach. The model’s training process showed effective optimization, avoiding overfitting while maintaining generalization across different temporal windows. Feature importance analysis indicated that previous day’s PM10 levels were the most influential predictors, followed by rolling means, underscoring the model’s ability to capture both temporal dynamics and geographic context. The study concludes that the proposed methodology, characterized by leakage-free encoding and robust cross-validation, represents a significant advancement in environmental forecasting, achieving a 99.83% improvement over persistence models and outperforming traditional machine learning techniques.

Discussion

The study presents a novel three-step hybrid ensemble framework for predicting PM10 concentrations across multiple countries, utilizing a comprehensive dataset from the World Air Quality Index (WAQI) that encompasses 1.8 million records from 380 cities. The framework achieves exceptional predictive performance, with an $R^2$ of 0.9983 and a root mean square error (RMSE) of 1.01 µg/m³, outperforming persistence forecasts by 96.7% and XGBoost by 76%. Key methodological advancements include leakagesafe hierarchical target encoding, the incorporation of residual neural networks to capture higher-order nonlinearities, and a Ridge regression meta-learner that enhances prediction stability. The model demonstrates robust geographic generalization, with $R^2$ values exceeding 0.99 in densely populated areas and 0.98 in sparsely populated regions.

Despite its strengths, the framework has limitations, such as its focus on day-ahead forecasting without multi-step capabilities and the exclusion of exogenous meteorological factors. Future research directions include integrating meteorological data, enhancing uncertainty quantification, and expanding the model’s applicability to rural areas and multi-pollutant scenarios. The findings underscore the importance of rigorous methodological practices in developing reliable air quality forecasting systems that can inform public health, environmental regulation, and urban planning initiatives.