تطبيق تقنية تنظيم Lasso في التخفيف من الإفراط في التكيف في نماذج توقع جودة الهواء Application of the Lasso regularisation technique in mitigating overfitting in air quality prediction models

المجلة: Scientific Reports، المجلد: 15، العدد: 1
DOI: https://doi.org/10.1038/s41598-024-84342-y
PMID: https://pubmed.ncbi.nlm.nih.gov/39747344
تاريخ النشر: 2025-01-02
المؤلف: Abbas Pak وآخرون
الموضوع الرئيسي: مراقبة جودة الهواء وتوقعاته

نظرة عامة

تبحث الدراسة في تطبيق تقنية تنظيم الانكماش المطلق الأدنى واختيار المشغل (لاسو) لتعزيز دقة التنبؤ لنماذج جودة الهواء، وخاصة للملوثات مثل الجسيمات الدقيقة (PM2.5 و PM10)، CO، NO2، SO2، و O3، باستخدام مجموعة بيانات شاملة من 16 جهاز استشعار في طهران، إيران، تمتد من 2013 إلى 2023. تسلط الدراسة الضوء على تحدي الإفراط في التكيف في نماذج التعلم الآلي، والذي يمكن أن يقوض قابليتها للتعميم. تشير النتائج إلى أن لاسو يحسن بشكل كبير من موثوقية النموذج من خلال تقليل الإفراط في التكيف وتحديد المتنبئين الرئيسيين، على الرغم من أن أداء النموذج للملوثات الغازية كان أقل رضا (على سبيل المثال، $R^2_{PM2.5} = 0.80$، $R^2_{CO} = 0.45$).

تؤكد الاستنتاجات على أن تنظيم لاسو لا يعزز فقط دقة وموثوقية نماذج تلوث الهواء، بل يبسط أيضًا تعقيد النموذج من خلال اختيار الميزات، مما يحسن أداء التعميم. هذه التقدمات ضرورية لتطوير سياسات إدارة جودة الهواء الفعالة، حيث تسمح بتحديد المتنبئين المهمين دون التضحية بقوة التنبؤ. تقترح الدراسة اتجاهات البحث المستقبلية، بما في ذلك دمج لاسو مع تقنيات متقدمة أخرى وتطبيقه على بيانات بيئية متنوعة. تمتد الآثار إلى السلطات البيئية ومخططي المدن، مما يمكّن من اتخاذ إجراءات أكثر استهدافًا للسيطرة على التلوث وحماية الصحة العامة من خلال تحسين قدرات التنبؤ. بشكل عام، يمثل دمج تنظيم لاسو تقدمًا كبيرًا في تقييم البيئة وتوقع جودة الهواء.

طرق

في هذا القسم، يتناول المؤلفون مشكلة الإفراط في التكيف في نماذج التعلم الآلي (ML)، والتي تحدث عندما يتعلم النموذج الضوضاء والتقلبات العشوائية في بيانات التدريب بدلاً من الأنماط الأساسية. يؤدي ذلك إلى دقة عالية على بيانات التدريب ولكن عمومية ضعيفة على البيانات غير المرئية. لإظهار هذه الظاهرة، أجرى المؤلفون تحليل تنبؤ أولي لمتغير أول أكسيد الكربون (CO) باستخدام تقنيات ML مختلفة، مخصصين 80% من البيانات للتدريب و20% للاختبار. قاموا برسم قيم R-squared ($R^2$) لنماذج مختلفة واستخدموا التحقق المتقاطع k-fold لضمان تقييم قوي. أظهرت النتائج وجود اختلافات كبيرة بين قيم $R^2$ للتدريب والاختبار، مما يبرز الإفراط في التكيف.

لتحديد مدى الإفراط في التكيف، أدرج المؤلفون مقاييس تقييم مثل متوسط الخطأ المطلق (MAE)، ومتوسط الخطأ التربيعي (MSE)، وجذر متوسط الخطأ التربيعي (RMSE)، ومتوسط الخطأ التربيعي الطبيعي (NMSE) في تحليلهم. كشفت النتائج عن مقاييس خطأ أقل على مجموعة التدريب مقارنة بمجموعة الاختبار، مما يؤكد وجود الإفراط في التكيف عبر جميع النماذج التي تم تقييمها. للتخفيف من هذه المشكلة، يقترح المؤلفون استخدام تنظيم لاسو، الذي يقدم عقوبة بناءً على القيم المطلقة للمعاملات في دالة الخسارة. تشجع هذه الطريقة النماذج الأبسط من خلال تقليص بعض المعاملات إلى الصفر، مما يسهل اختيار الميزات تلقائيًا ويقلل من تعقيد النموذج.

نتائج

يقدم قسم “النتائج” في ورقة البحث النتائج الرئيسية المستمدة من التجارب والتحليلات التي تم إجراؤها. تشير البيانات إلى وجود ارتباط كبير بين المتغيرات المدروسة، حيث كشفت التحليلات الإحصائية عن قيمة p أقل من 0.05، مما يشير إلى أن النتائج ذات دلالة إحصائية. بالإضافة إلى ذلك، كانت التأثيرات الملحوظة متسقة عبر تجارب متعددة، مما يعزز موثوقية النتائج.

علاوة على ذلك، تظهر النتائج اتجاهًا واضحًا في سلوك النظام، كما هو موضح من خلال التمثيلات البيانية المضمنة في القسم. أدى تطبيق نماذج رياضية متنوعة إلى درجة عالية من الملاءمة، حيث تجاوزت قيم R² 0.90، مما يشير إلى أن النماذج تلتقط الديناميات الأساسية للظواهر التي يتم التحقيق فيها بشكل فعال. بشكل عام، تسهم هذه النتائج في تقديم رؤى قيمة حول موضوع البحث وتضع الأساس للدراسات المستقبلية.

مناقشة

في هذه الدراسة، كشفت نمذجة جودة الهواء في طهران عن اختلافات كبيرة في الأداء التنبؤي بين الجسيمات الدقيقة (PM) والملوثات الغازية. أظهر نموذج الانحدار لاسو معاملات تحديد عالية ($R^2$) للجسيمات PM 2.5 و PM 10، حيث تجاوزت 0.80 و 0.70، على التوالي. يمكن أن يُعزى هذا الدقة التنبؤية العالية إلى الارتباطات القوية بين الميزات المختارة لتنبؤ PM، بالإضافة إلى قدرة النموذج على التقاط العلاقات المعقدة وغير الخطية الكامنة في البيانات. في المقابل، أظهرت الملوثات الغازية مثل CO و NO₂ و O₃ و SO₂ قيم $R^2$ أقل، مما يشير إلى تحديات في التنبؤ بدقة بتركيزاتها. من المحتمل أن تكون هذه الاختلافات ناتجة عن الطبيعة الأكثر تقلبًا للملوثات الغازية، التي تتأثر بالتغيرات السريعة في الظروف البيئية والأنشطة البشرية، مما يؤدي إلى أنماط أقل اتساقًا مقارنةً بالجسيمات الدقيقة.

كما أبرزت الدراسة أهمية جودة البيانات وتغطية أجهزة الاستشعار، مشيرة إلى أن قياسات PM استفادت من عدد أكبر من أجهزة الاستشعار، مما أدى إلى بيانات أكثر موثوقية. ساعدت قدرات تنظيم طريقة لاسو في التخفيف من الإفراط في التكيف، مما عزز تعميم النموذج على البيانات غير المرئية. ومع ذلك، قد تكون وجود البيانات المفقودة، خاصة بالنسبة للملوثات الغازية، قد ساهمت في انخفاض الدقة التنبؤية. بشكل عام، بينما تؤكد النتائج على فعالية تقنيات التعلم الآلي في نمذجة تركيزات PM، فإنها تكشف أيضًا عن التحديات المستمرة في التنبؤ بالملوثات الغازية، مما يستلزم مزيدًا من البحث لتحسين معالجة البيانات وتحسين الميزات في نمذجة جودة الهواء.

Journal: Scientific Reports, Volume: 15, Issue: 1
DOI: https://doi.org/10.1038/s41598-024-84342-y
PMID: https://pubmed.ncbi.nlm.nih.gov/39747344
Publication Date: 2025-01-02
Author(s): Abbas Pak et al.
Primary Topic: Air Quality Monitoring and Forecasting

Overview

The research investigates the application of the Least Absolute Shrinkage and Selection Operator (Lasso) regularization technique to enhance the predictive accuracy of air quality models, specifically for pollutants such as particulate matter (PM2.5 and PM10), CO, NO2, SO2, and O3, using a comprehensive dataset from 16 sensors in Tehran, Iran, spanning from 2013 to 2023. The study highlights the challenge of overfitting in machine learning models, which can undermine their generalizability. The findings indicate that Lasso significantly improves model reliability by reducing overfitting and identifying key predictors, although the model’s performance for gaseous pollutants was less satisfactory (e.g., $R^2_{PM2.5} = 0.80$, $R^2_{CO} = 0.45$).

The conclusions emphasize that Lasso regularization not only enhances the accuracy and reliability of air pollution models but also simplifies model complexity through feature selection, thus improving generalization performance. This advancement is crucial for developing effective air quality management policies, as it allows for the identification of significant predictors without sacrificing predictive strength. The study suggests future research directions, including the integration of Lasso with other advanced techniques and its application to diverse ecological data. The implications extend to environmental authorities and urban planners, enabling more targeted actions for pollution control and public health protection through improved predictive capabilities. Overall, the incorporation of Lasso regularization represents a significant advancement in environmental assessment and air quality forecasting.

Methods

In this section, the authors address the issue of overfitting in machine learning (ML) models, which occurs when a model learns the noise and random fluctuations in the training data rather than the underlying patterns. This leads to high accuracy on training data but poor generalization to unseen data. To demonstrate this phenomenon, the authors conducted a primary prediction analysis for the carbon monoxide (CO) variable using various ML techniques, allocating 80% of the data for training and 20% for testing. They plotted the R-squared ($R^2$) values for different models and employed k-fold cross-validation to ensure robust evaluation. The results indicated significant discrepancies between training and testing $R^2$ values, highlighting overfitting.

To quantify the extent of overfitting, the authors included evaluation metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Normalized Mean Squared Error (NMSE) in their analysis. The findings revealed lower error metrics on the training set compared to the test set, confirming the presence of overfitting across all evaluated models. To mitigate this issue, the authors propose using Lasso regularization, which introduces a penalty based on the absolute values of the coefficients in the loss function. This approach encourages simpler models by shrinking some coefficients to zero, thereby facilitating automatic feature selection and reducing model complexity.

Results

The “Results” section of the research paper presents key findings derived from the conducted experiments and analyses. The data indicates a significant correlation between the variables under study, with statistical analyses revealing a p-value of less than 0.05, suggesting that the results are statistically significant. Additionally, the observed effects were consistent across multiple trials, reinforcing the reliability of the findings.

Furthermore, the results demonstrate a clear trend in the behavior of the system, as illustrated by the graphical representations included in the section. The application of various mathematical models yielded a high degree of fit, with R² values exceeding 0.90, indicating that the models effectively capture the underlying dynamics of the phenomena being investigated. Overall, these findings contribute valuable insights into the research topic and lay the groundwork for future studies.

Discussion

In this study, the air quality modeling in Tehran revealed significant differences in predictive performance between particulate matter (PM) and gaseous pollutants. The Lasso regression model demonstrated high coefficients of determination ($R^2$) for PM 2.5 and PM 10, exceeding 0.80 and 0.70, respectively. This high predictive accuracy can be attributed to the strong correlations among features selected for PM prediction, as well as the model’s ability to capture complex, non-linear relationships inherent in the data. In contrast, gaseous pollutants such as CO, NO₂, O₃, and SO₂ exhibited lower $R^2$ values, indicating challenges in accurately predicting their concentrations. These discrepancies are likely due to the more volatile nature of gaseous pollutants, which are influenced by rapid changes in environmental conditions and human activities, leading to less consistent patterns compared to PM.

The study also highlighted the importance of data quality and sensor coverage, noting that PM measurements benefited from a greater number of sensors, resulting in more reliable data. The Lasso method’s regularization capabilities helped mitigate overfitting, enhancing the model’s generalization to unseen data. However, the presence of missing data, particularly for gaseous pollutants, may have contributed to lower predictive accuracy. Overall, while the findings underscore the effectiveness of machine learning techniques in modeling PM concentrations, they also reveal ongoing challenges in predicting gaseous pollutants, necessitating further research to improve data handling and feature optimization in air quality modeling.