إدارة المياه العادمة المدفوعة بالذكاء الاصطناعي من خلال التحليل المقارن لتقنيات اختيار الميزات والنماذج التنبؤية AI-driven wastewater management through comparative analysis of feature selection techniques and predictive models

المجلة: Scientific Reports، المجلد: 15، العدد: 1
DOI: https://doi.org/10.1038/s41598-025-07124-0
PMID: https://pubmed.ncbi.nlm.nih.gov/40659650
تاريخ النشر: 2025-07-14
المؤلف: Zhenyun Du وآخرون
الموضوع الرئيسي: تكنولوجيا مراقبة جودة المياه

نظرة عامة

إن دمج الذكاء الاصطناعي (AI) في إدارة معالجة مياه الصرف الصحي يمثل فرصة كبيرة لتعزيز توقعات جودة المياه الملوثة وكفاءة العمليات. تقيّم هذه الدراسة فعالية نماذج التعلم الآلي المختلفة في توقع معايير مياه الصرف الصحي الحرجة، بما في ذلك الطلب الكيميائي على الأكسجين (COD)، الطلب البيولوجي على الأكسجين (BOD)، المواد الصلبة المعلقة الكلية (TSS)، النيتروجين الكلي في المياه الملوثة، والفوسفور الكلي في المياه الملوثة. لتحديد أكثر المتنبئين تأثيرًا، تم استخدام ثلاث تقنيات لاختيار الميزات – SelectKBest، المعلومات المتبادلة، وإزالة الميزات التكرارية (RFE) باستخدام الغابة العشوائية.

استخدمت الأبحاث نماذج التعلم الجماعي مثل XGBoost، والغابة العشوائية، وزيادة التدرج، وLightGBM، مع مقارنة أدائها ضد نماذج شجرة القرار. أظهرت النتائج أن المواد الصلبة المعلقة المتطايرة (VSS) كانت المتنبئ الأكثر أهمية عبر جميع الطرق. تفوقت نماذج التعلم الجماعي على أشجار القرار، حيث حققت زيادة التدرج أعلى دقة تنبؤية لـ TSS والنيتروجين الكلي (خطأ مطلق متوسط (MAE): 3.667، $R^2$: 97.53)، بينما تفوق XGBoost في توقع COD (MAE: 6.251، $R^2$: 83.41%) وBOD (MAE: 1.589، $R^2$: 79.64%). أظهرت LightGBM أفضل دقة لتوقع الفوسفات الكلي (MAE: 0.230، $R^2$: 28.68%). على الرغم من هذه التقدمات، لا تزال التحديات مثل عدم انتظام العمليات والتغيرات الموسمية قائمة، مما يشير إلى الحاجة لمزيد من التحسين في أساليب إدارة مياه الصرف الصحي المدفوعة بالذكاء الاصطناعي.

الطرق

في هذه الدراسة، تم تطوير نماذج التعلم الآلي (ML) للتنبؤ بالاضطرابات المحتملة في تصريف محطات معالجة مياه الصرف الصحي (WWTPs). تم هيكلة المنهجية في ست طبقات: التعامل مع القيم المفقودة، ترميز التسميات، اختيار الميزات، الانقسامات الطبقية، نمذجة ML، وتقييم نموذج ML. يتم توضيح كل طبقة في الأقسام التالية، مع تقديم نظرة عامة في الشكل 1.

تم إجراء التجارب باستخدام Google Colab، وهو بيئة دفتر ملاحظات Jupyter قائمة على السحابة، مستفيدة من مكتبات Python الأساسية مثل Matplotlib لتصور البيانات، NumPy للحسابات العددية، Pandas لمعالجة البيانات وإعدادها، وScikit-learn لنمذجة وتقييم ML. من الجدير بالذكر أن الدراسة لم تتضمن تحسين المعلمات، بل اعتمدت بدلاً من ذلك على الإعدادات الافتراضية لجميع الخوارزميات. استخدمت البيئة الحاسوبية 16 جيجابايت من ذاكرة الوصول العشوائي و employed CPUs لكل من تدريب النموذج وتقييمه، دون تسريع GPU.

النتائج

في هذا القسم، يتم تقديم النتائج التجريبية من نماذج التعلم الآلي (ML) المختلفة التي تتنبأ بمعايير مياه الصرف الرئيسية – الطلب الكيميائي على الأكسجين (COD)، الطلب البيولوجي على الأكسجين (BOD)، المواد الصلبة المعلقة الكلية (TSS)، النيتروجين الكلي في المياه الملوثة، والفوسفور الكلي في المياه الملوثة. تشمل مقاييس التقييم المستخدمة خطأ المربع المتوسط (MSE)، خطأ مطلق متوسط (MAE)، وR-squared ($R^2$). حدد تحليل أهمية الميزات باستخدام طريقة SelectKBest المواد الصلبة المعلقة المتطايرة (VSS) كأكثر المتنبئين أهمية، مع درجة 5298.4، تليها COD المذاب في المياه الملوثة وغيرها، مما يشير إلى انخفاض حاد في الأهمية.

تشير النتائج إلى أن XGBoost حقق أدنى MSE (119.24) لتوقعات COD، بينما تفوقت الغابة العشوائية في توقعات BOD مع أدنى MAE (1.62) وMSE (6.08). ظهرت زيادة التدرج كنموذج الأفضل لتوقعات TSS والنيتروجين الكلي، بينما كانت LightGBM الأفضل لتوقع الفوسفور الكلي. كشفت درجات $R^2$ أن XGBoost (83.41%) وزيادة التدرج (74.29%) كانا الأفضل أداءً لتوقعات COD وBOD، على التوالي، بينما كانت نماذج شجرة القرار تحت الأداء باستمرار عبر جميع المعايير. بالإضافة إلى ذلك، أبرزت طريقة المعلومات المتبادلة VSS في المياه الملوثة كأكثر الميزات ذات الصلة للتوقعات، مما يعزز أهمية اختيار الميزات المناسبة لتدريب النموذج. بشكل عام، أظهرت نماذج التعلم الجماعي، وخاصة XGBoost وزيادة التدرج، دقة تنبؤية متفوقة مقارنة بنماذج شجرة القرار عبر معايير جودة المياه الملوثة المختلفة.

المناقشة

في هذه الدراسة، قام المؤلفون بفحص الدقة التنبؤية لنماذج التعلم الآلي (ML) لمعايير المياه الملوثة في مرافق معالجة مياه الصرف الصحي، مع التركيز على استراتيجيات اختيار الميزات لتعزيز أداء النموذج. تتكون مجموعة البيانات، التي تمتد من 1 يناير 2022 إلى 8 ديسمبر 2024، من 1,075 صفًا مع 65 ميزة و6 متغيرات مستهدفة، بما في ذلك الطلب الكيميائي على الأكسجين (COD)، الطلب البيولوجي على الأكسجين (BOD)، والمواد الصلبة المعلقة الكلية (TSS). حدد التحليل VSS في المياه الملوثة كأكثر المتنبئين أهمية، مما يؤكد أهميته في تقييم النشاط البيولوجي داخل أنظمة المعالجة. وجدت الدراسة أن الطرق الجماعية، وخاصة XGBoost وزيادة التدرج وLightGBM، تفوقت على النماذج الأبسط مثل أشجار القرار، مما يظهر فعاليتها في التقاط العلاقات المعقدة وغير الخطية الموجودة في مجموعات البيانات البيئية.

على الرغم من النتائج الواعدة، تعترف الدراسة بالقيود، مثل احتمال عدم كفاية مجموعة البيانات في عكس عدم انتظام العمليات والتغيرات الموسمية. تشمل اتجاهات البحث المستقبلية دمج سجلات العمليات وبيانات السلاسل الزمنية، واستخدام طرق اختيار الميزات الهجينة التي تجمع بين المعايير الإحصائية والمعتمدة على الخبراء، وتعزيز أنظمة التحكم في الوقت الحقيقي بتنبؤات الذكاء الاصطناعي. الهدف النهائي هو تحسين كفاءة واستدامة إدارة مياه الصرف الصحي من خلال تقنيات النمذجة المتقدمة، مما قد يؤدي إلى تحسين جرعات المواد الكيميائية وتقليل استهلاك الطاقة مع ضمان الامتثال للوائح التصريف.

Journal: Scientific Reports, Volume: 15, Issue: 1
DOI: https://doi.org/10.1038/s41598-025-07124-0
PMID: https://pubmed.ncbi.nlm.nih.gov/40659650
Publication Date: 2025-07-14
Author(s): Zhenyun Du et al.
Primary Topic: Water Quality Monitoring Technologies

Overview

The integration of artificial intelligence (AI) into wastewater treatment management presents a significant opportunity to enhance effluent quality predictions and operational efficiency. This study assesses the efficacy of various machine learning models in predicting critical wastewater effluent parameters, including Chemical Oxygen Demand (COD), Biochemical Oxygen Demand (BOD), Total Suspended Solids (TSS), Total Effluent Nitrogen, and Total Effluent Phosphorus. To identify the most influential predictors, three feature selection techniques—SelectKBest, Mutual Information, and Recursive Feature Elimination (RFE) using Random Forest—were employed.

The research utilized ensemble learning models such as XGBoost, Random Forest, Gradient Boosting, and LightGBM, comparing their performance against Decision Tree models. Results indicated that volatile suspended solids (VSS) were the most significant predictor across all methods. Ensemble models outperformed Decision Trees, with Gradient Boosting achieving the highest predictive accuracy for TSS and total nitrogen (Mean Absolute Error (MAE): 3.667, $R^2$: 97.53), while XGBoost excelled in COD prediction (MAE: 6.251, $R^2$: 83.41%) and BOD (MAE: 1.589, $R^2$: 79.64%). LightGBM demonstrated the best precision for total phosphate prediction (MAE: 0.230, $R^2$: 28.68%). Despite these advancements, challenges such as operational irregularities and seasonal variations remain, indicating the need for further refinement in AI-driven wastewater management approaches.

Methods

In this study, machine learning (ML) models were developed to predict potential disturbances in the discharge of wastewater treatment plants (WWTPs). The methodology is structured into six layers: handling missing values, label encoding, feature selection, stratified splits, ML modeling, and ML model evaluation. Each layer is elaborated upon in the subsequent sections, with an overview provided in Figure 1.

The experiments were conducted using Google Colab, a cloud-based Jupyter Notebook environment, leveraging essential Python libraries such as Matplotlib for data visualization, NumPy for numerical computations, Pandas for data manipulation and preprocessing, and Scikit-learn for ML modeling and evaluation. Notably, the study did not incorporate hyperparameter optimization, relying instead on the default settings for all algorithms. The computational environment utilized 16GB of RAM and employed CPUs for both model training and evaluation, without GPU acceleration.

Results

In this section, the empirical findings from various machine learning (ML) models predicting key wastewater parameters—Chemical Oxygen Demand (COD), Biochemical Oxygen Demand (BOD), Total Suspended Solids (TSS), Effluent Total Nitrogen, and Effluent Total Phosphorus—are presented. The evaluation metrics employed include Mean Square Error (MSE), Mean Absolute Error (MAE), and R-squared ($R^2$). The feature importance analysis using the SelectKBest method identified Effluent Volatile Suspended Solids (VSS) as the most significant predictor, with a score of 5298.4, followed by Effluent Dissolved COD and others, indicating a steep decline in importance.

The results indicate that XGBoost achieved the lowest MSE (119.24) for COD predictions, while Random Forest excelled in BOD predictions with the lowest MAE (1.62) and MSE (6.08). Gradient Boosting emerged as the best model for TSS and total nitrogen predictions, while LightGBM performed best for total phosphorus. The $R^2$ scores revealed that XGBoost (83.41%) and Gradient Boosting (74.29%) were the top performers for COD and BOD, respectively, whereas Decision Tree models consistently underperformed across all parameters. Additionally, the Mutual Information method highlighted Effluent VSS as the most relevant feature for predictions, reinforcing the importance of selecting appropriate features for model training. Overall, ensemble models, particularly XGBoost and Gradient Boosting, demonstrated superior predictive accuracy compared to Decision Tree models across various effluent quality parameters.

Discussion

In this study, the authors examined the predictive accuracy of machine learning (ML) models for effluent parameters in wastewater treatment facilities, emphasizing feature selection strategies to enhance model performance. The dataset, spanning from January 1, 2022, to December 8, 2024, comprises 1,075 rows with 65 features and 6 target variables, including Chemical Oxygen Demand (COD), Biological Oxygen Demand (BOD), and Total Suspended Solids (TSS). The analysis identified Effluent Volatile Suspended Solids (VSS) as the most significant predictor, corroborating its relevance in assessing biological activity within treatment systems. The study found that ensemble methods, particularly XGBoost, Gradient Boosting, and LightGBM, outperformed simpler models like Decision Trees, demonstrating their effectiveness in capturing complex, non-linear relationships inherent in environmental datasets.

Despite the promising results, the study acknowledges limitations, such as the dataset’s potential inadequacy in reflecting operational irregularities and seasonal variations. Future research directions include integrating operational logs and time series data, employing hybrid feature selection methods that combine statistical and expert-driven criteria, and enhancing real-time control systems with AI predictions. The ultimate goal is to improve the efficiency and sustainability of wastewater management through advanced modeling techniques, which could lead to optimized chemical dosing and reduced energy consumption while ensuring compliance with discharge regulations.