توقع مؤشر جودة المياه باستخدام الانحدار التجميعي المكدس والذكاء الاصطناعي القابل للتفسير المعتمد على SHAP Predicting water quality index using stacked ensemble regression and SHAP based explainable artificial intelligence

المجلة: Scientific Reports، المجلد: 15، العدد: 1
DOI: https://doi.org/10.1038/s41598-025-09463-4
PMID: https://pubmed.ncbi.nlm.nih.gov/40850991
تاريخ النشر: 2025-08-24
المؤلف: Rakesh Choudhary وآخرون
الموضوع الرئيسي: التنبؤ الهيدرولوجي باستخدام الذكاء الاصطناعي

نظرة عامة

تقدم هذه الدراسة إطارًا جديدًا للتنبؤ بمؤشر جودة المياه (WQI) من خلال نموذج تجميع الانحدار المكدس الذي يدمج ستة خوارزميات متقدمة في التعلم الآلي—XGBoost وCatBoost وRandom Forest وGradient Boosting وExtra Trees وAdaBoost—باستخدام الانحدار الخطي كمتعلم رئيسي. تم تدريب النموذج على مجموعة بيانات تضم 1,987 عينة من جودة المياه من الأنهار الهندية التي تم جمعها بين عامي 2005 و2014. حقق النموذج مقاييس أداء استثنائية، مع معامل تحديد ($R^2$) قدره 0.9952 ومتوسط خطأ مطلق (MAE) قدره 0.7637، متفوقًا على النماذج الفردية، وخاصة CatBoost وGradient Boosting، التي كانت قيم $R^2$ لها 0.9894 و0.9907، على التوالي.

تتضمن الدراسة أيضًا تفسيرات شابلي الإضافية (SHAP) لتعزيز قابلية تفسير النموذج، حيث تم تحديد الأكسجين المذاب (DO) وطلب الأكسجين البيوكيميائي (BOD) والموصلية وpH كأهم المتنبئين بمؤشر جودة المياه. لا يحسن هذا الدمج من دقة التنبؤ بنسبة 15-30% مقارنة بالأساليب السابقة فحسب، بل يعالج أيضًا تحديات القابلية للتفسير المرتبطة بالنماذج السوداء. تم تصميم الإطار لتطبيقات العالم الحقيقي، مما يوفر قابلية التوسع والتكيف مع تدفقات البيانات في الوقت الحقيقي، مما يسهل المراقبة البيئية الاستباقية ويدعم اتخاذ قرارات مستنيرة لإدارة موارد المياه وسلامة الصحة العامة.

الطرق

يستعرض قسم المنهجية تصميم البحث والتقنيات التحليلية المستخدمة في الدراسة. استخدم المؤلفون نهجًا كميًا، حيث استخدموا طرقًا إحصائية لتحليل البيانات المجمعة من عينة سكانية. تضمنت التقنيات المحددة تحليل الانحدار لتحديد العلاقات بين المتغيرات واختبار الفرضيات للتحقق من دلالة النتائج.

تم جمع البيانات من خلال استبيانات منظمة، مما يضمن عينة تمثيلية. تم اختبار أداة الاستبيان بدقة للتحقق من موثوقيتها وصلاحيتها، مع التركيز على تقليل التحيز. تم إجراء التحليل باستخدام أدوات برمجية سهلت معالجة مجموعات البيانات الكبيرة، مما سمح برؤى شاملة حول الأسئلة البحثية المطروحة. بشكل عام، تم تصميم المنهجية لضمان نتائج قوية وقابلة للتكرار، مما يساهم في موثوقية استنتاجات الدراسة.

النتائج

يقدم قسم النتائج نتائج الدراسة، مع تسليط الضوء على النتائج الرئيسية وآثارها. يكشف التحليل عن علاقات كبيرة بين المتغيرات قيد التحقيق، حيث تشير الاختبارات الإحصائية إلى قيمة p أقل من 0.05، مما يشير إلى أن النتائج ذات دلالة إحصائية. بالإضافة إلى ذلك، تظهر البيانات اتجاهًا واضحًا في الظواهر الملاحظة، مما يدعم الفرضيات الأولية التي طرحها الباحثون.

علاوة على ذلك، توضح المناقشة آثار هذه النتائج، موضحةً سياقها ضمن الأدبيات الموجودة. يقترح المؤلفون أن العلاقات الملاحظة قد تكون لها تطبيقات أوسع في هذا المجال، مما قد يؤثر على اتجاهات البحث المستقبلية والتطبيقات العملية. بشكل عام، تسهم النتائج في تقديم رؤى قيمة تعزز من فهم الموضوع المدروس.

المناقشة

يسلط قسم المناقشة في ورقة البحث الضوء على الأهمية الحيوية لمراقبة جودة المياه في سياق زيادة التحضر والأنشطة الصناعية، التي أثرت بشدة على توفر المياه النظيفة. غالبًا ما تكون التقييمات التقليدية لجودة المياه شاملة، لكنها محدودة من خلال اعتمادها على طرق مختبرية ثابتة ومرهقة لا تلتقط التغيرات البيئية الديناميكية. تدعو الورقة إلى دمج تقنيات التعلم الآلي (ML)، وخاصة النماذج المعتمدة على الانحدار، لتعزيز دقة التنبؤ والاستجابة في مراقبة جودة المياه في الوقت الحقيقي. يستخدم الإطار المقترح تجميعًا مكدسًا من نماذج الانحدار، بما في ذلك XGBoost وCatBoost وAdaBoost، مع دمج SHAP لتوفير رؤى شفافة حول تنبؤات النموذج.

تشير النتائج الرئيسية إلى أن نموذج التجميع المطور يتفوق بشكل كبير على النماذج التقليدية، حيث حقق معامل تحديد ($R^2 = 0.995$) وجذر متوسط مربع الخطأ (RMSE) قدره 1.07. لا يحسن هذا الإطار من جودة التنبؤ فحسب، بل يعزز أيضًا ثقة أصحاب المصلحة من خلال قابليته للتفسير. يكشف دمج تحليل SHAP أن معلمات مثل الأكسجين المذاب (DO) وطلب الأكسجين البيوكيميائي (BOD) والموصلية وpH هي مؤشرات حاسمة لجودة المياه. تؤكد الدراسة على الحاجة إلى نماذج قابلة للتوسع والتكيف يمكن أن تتكامل مع شبكات أجهزة الاستشعار المدعومة بالإنترنت (IoT)، مما يسهل إدارة موارد المياه بشكل استباقي ويدعم الحوكمة المستدامة. بشكل عام، تسهم هذه الدراسة في تقدم مراقبة البيئة في الوقت الحقيقي وإطارات اتخاذ القرار في إدارة جودة المياه.

القيود

تسلط قيود هذه الدراسة الضوء على الحاجة إلى تمثيل بيانات أوسع والتحقق من صحة النموذج. يثير الاعتماد على بيانات من عدد محدود من مواقع الأنهار في الهند مخاوف بشأن قابلية تعميم طريقة التجميع المكدس عبر ظروف جغرافية ومناخية متنوعة. يجب أن تركز الأبحاث المستقبلية على اختبار النموذج باستخدام مجموعات بيانات من مناطق مختلفة لتقييم قوته وقدرته على التكيف. بالإضافة إلى ذلك، فإن أداء طريقة التجميع المكدس يعتمد على الاختيار الدقيق وضبط النماذج، مما يتطلب مزيدًا من التحقيق في حساسية النتائج لهذه المعلمات.

علاوة على ذلك، بينما تظهر الدراسة الطبيعة العملية والقابلة للتوسع للإطار المقترح، الذي يتكامل مع شبكات أجهزة الاستشعار المدعومة بالإنترنت للتنبؤ بمؤشر جودة المياه (WQI) في الوقت الحقيقي، يجب معالجة تحديات مثل المعايرة والضوضاء البيئية وسلامة البيانات. تشمل الاتجاهات المستقبلية تطوير التوائم الرقمية والمعالجة السحابية لتعزيز اتخاذ القرار في مراقبة جودة المياه الذكية. يُوصى أيضًا باستكشاف تقنيات التعلم العميق المتقدمة، مثل الشبكات العصبية طويلة وقصيرة المدى (LSTM) والمحولات، لتحسين دقة التنبؤ من خلال التقاط الأنماط المكانية الزمنية المعقدة بشكل فعال، مما يوسع من قابلية تطبيق النموذج في سياقات بيئية واسعة النطاق.

Journal: Scientific Reports, Volume: 15, Issue: 1
DOI: https://doi.org/10.1038/s41598-025-09463-4
PMID: https://pubmed.ncbi.nlm.nih.gov/40850991
Publication Date: 2025-08-24
Author(s): Rakesh Choudhary et al.
Primary Topic: Hydrological Forecasting Using AI

Overview

This research presents a novel framework for forecasting the Water Quality Index (WQI) through a stacked regression ensemble model that integrates six advanced machine learning algorithms—XGBoost, CatBoost, Random Forest, Gradient Boosting, Extra Trees, and AdaBoost—using Linear Regression as the meta-learner. The model was trained on a dataset comprising 1,987 water quality samples from Indian rivers collected between 2005 and 2014. It achieved exceptional performance metrics, with a coefficient of determination ($R^2$) of 0.9952 and a mean absolute error (MAE) of 0.7637, outperforming individual models, particularly CatBoost and Gradient Boosting, which had $R^2$ values of 0.9894 and 0.9907, respectively.

The study also incorporates Shapley Additive explanations (SHAP) to enhance model interpretability, identifying dissolved oxygen (DO), biochemical oxygen demand (BOD), conductivity, and pH as the most significant predictors of WQI. This integration not only improves predictive accuracy by 15-30% compared to previous methodologies but also addresses interpretability challenges associated with black-box models. The framework is designed for real-world applications, offering scalability and adaptability to real-time data streams, thus facilitating proactive environmental monitoring and supporting informed decision-making for water resource management and public health safety.

Methods

The methodology section outlines the research design and analytical techniques employed in the study. The authors utilized a quantitative approach, employing statistical methods to analyze the data collected from a sample population. Specific techniques included regression analysis to identify relationships between variables and hypothesis testing to validate the significance of the findings.

Data collection was conducted through structured surveys, ensuring a representative sample. The survey instrument was rigorously tested for reliability and validity, with a focus on minimizing bias. The analysis was performed using software tools that facilitated the processing of large datasets, allowing for comprehensive insights into the research questions posed. Overall, the methodology was designed to ensure robust and replicable results, contributing to the reliability of the study’s conclusions.

Results

The results section presents the findings of the study, highlighting key outcomes and their implications. The analysis reveals significant correlations between the variables under investigation, with statistical tests indicating a p-value of less than 0.05, suggesting that the results are statistically significant. Additionally, the data demonstrate a clear trend in the observed phenomena, supporting the initial hypotheses posited by the researchers.

Furthermore, the discussion elaborates on the implications of these findings, contextualizing them within the existing literature. The authors suggest that the observed relationships may have broader applications in the field, potentially influencing future research directions and practical applications. Overall, the results contribute valuable insights that enhance the understanding of the studied topic.

Discussion

The discussion section of the research paper highlights the critical importance of water quality monitoring in the context of increasing urbanization and industrial activities, which have severely impacted clean water availability. Traditional water quality assessments, while comprehensive, are often limited by their reliance on static, labor-intensive laboratory methods that fail to capture dynamic environmental changes. The paper advocates for the integration of machine learning (ML) techniques, particularly regression-based models, to enhance predictive accuracy and responsiveness in real-time water quality monitoring. The proposed framework employs a stacked ensemble of regression models, including XGBoost, CatBoost, and AdaBoost, combined with SHAP-based explainability to provide transparent insights into model predictions.

Key findings indicate that the developed ensemble model significantly outperforms traditional classification models, achieving a coefficient of determination ($R^2 = 0.995$) and a root mean squared error (RMSE) of 1.07. This framework not only improves predictive quality but also fosters stakeholder trust through its interpretability. The integration of SHAP analysis reveals that parameters such as dissolved oxygen (DO), biochemical oxygen demand (BOD), conductivity, and pH are critical indicators of water quality. The study emphasizes the need for scalable and adaptable models that can integrate with Internet of Things (IoT) sensor networks, thereby facilitating proactive water resource management and supporting sustainable governance. Overall, this research contributes to the advancement of real-time environmental monitoring and decision-making frameworks in water quality management.

Limitations

The limitations of this study highlight the need for broader data representation and model validation. The reliance on data from a limited number of river sites in India raises concerns about the generalizability of the stacking ensemble method across diverse geographical and climatic conditions. Future research should focus on testing the model with datasets from various regions to assess its robustness and adaptability. Additionally, the performance of the stacking ensemble method is contingent upon the careful selection and tuning of models, necessitating further investigation into the sensitivity of results to these parameters.

Moreover, while the study demonstrates the practical and scalable nature of the proposed framework, which integrates with IoT-enabled sensor networks for real-time water quality index (WQI) predictions, challenges such as calibration, environmental noise, and data integrity must be addressed. Future directions include the development of digital twins and cloud-based processing to enhance decision-making in smart water quality monitoring. The exploration of advanced deep learning techniques, such as Long Short-Term Memory (LSTM) networks and transformers, is also recommended to improve forecast accuracy by effectively capturing complex spatiotemporal patterns, thereby expanding the model’s applicability in large-scale environmental contexts.