دمج بيانات التعلم الآلي الهجينة لعلوم المحيطات البيوجيوكيميائية Hybrid machine learning data assimilation for marine biogeochemistry

المجلة: Biogeosciences، المجلد: 23، العدد: 1
DOI: https://doi.org/10.5194/bg-23-315-2026
تاريخ النشر: 2026-01-12
المؤلف: Zhenyun Du
الموضوع الرئيسي: العمليات المحيطية والجوية

نظرة عامة

يتناول هذا القسم من ورقة البحث تطبيق التعلم الآلي (ML) لتعزيز دمج البيانات (DA) في نماذج الكيمياء الحيوية البحرية، والتي تعتبر ضرورية لتوقع استجابات النظام البيئي لتغير المناخ والأنشطة البشرية. تحدد الدراسة التحديات في طرق DA الحالية، لا سيما في تحديث المتغيرات غير الملاحظة بسبب تعقيد النموذج والملاحظات المتناثرة. من خلال دمج تقنيات ML في نموذج توقع أحادي البعد لبحار الرفوف الأوروبية الشمالية الغربية، يظهر المؤلفون أن ML يمكن أن يتعلم بفعالية العلاقات الإحصائية بين المتغيرات الملاحظة وغير الملاحظة، مما يؤدي إلى تحسين التحديثات للمتغيرات غير الملاحظة سابقًا، خاصة ضمن أوقات قيادية قصيرة (أقل من 5 أيام).

تشير النتائج إلى أن الطرق المعتمدة على ML، مثل ML-OI و ML-EtE، يمكن أن توسع أنظمة DA الأحادية التقليدية من خلال تقدير الارتباطات وزيادات التحليل لمتغيرات النموذج المختلفة. بينما تظهر أساليب ML وعدًا في تحديث معظم المتغيرات البيلوجية غير الملاحظة، لا تزال هناك تحديات، لا سيما مع المتغيرات مثل العوالق الحيوانية، التي قد تتطلب مخططات DA هجينة لتحديثات فعالة. تسلط الدراسة أيضًا الضوء على قابلية نقل نماذج ML جزئيًا إلى مواقع جديدة، مما يشير إلى أن البحث المستقبلي يجب أن يركز على تحسين استراتيجيات النقل، وتحسين أخذ عينات بيانات التدريب، وتقييم قابلية التوسع للتنبؤات البحرية على نطاق واسع. بشكل عام، تؤكد الأبحاث على إمكانيات ML لمعالجة الاختناقات الحاسوبية في DA الكيمياء الحيوية البحرية، مما يمهد الطريق للتقدم المستقبلي في هذا المجال.

مقدمة

تؤكد مقدمة هذه الورقة البحثية على الدور الحاسم لنمذجة الكيمياء الحيوية البحرية (BGC) في فهم الدورات العنصرية البحرية العالمية وتأثيرات الضغوط البشرية والمناخية على النظم البيئية البحرية. يمكن أن تؤدي هذه الضغوط، مثل تحمض المحيطات وتلوث المغذيات، إلى عواقب بيئية كبيرة، بما في ذلك نقص الأكسجين وفقدان التنوع البيولوجي. تواجه نماذج BGC البحرية، التي غالبًا ما تكون معقدة وغير خطية، تحديات في دمج البيانات (DA) بسبب ندرة وعدم يقين البيانات الملاحظة مقارنة بنماذج المحيطات الفيزيائية. تسلط الورقة الضوء على ضرورة استخدام أساليب DA متعددة المتغيرات لتحديث متغيرات حالة النموذج بشكل فعال باستخدام ملاحظات محدودة، مستمدة أساسًا من بيانات الأقمار الصناعية وملاحظات BGC-Argo.

يقترح المؤلفون دمجًا مبتكرًا لتقنيات التعلم الآلي (ML) في عملية DA لتعزيز نقل المعلومات من المتغيرات الملاحظة إلى المتغيرات غير الملاحظة في نماذج BGC البحرية. تهدف هذه الدراسة إلى استكشاف ما إذا كان بإمكان ML التقاط العلاقات المعقدة المعتمدة على التدفق بين متغيرات BGC وتحسين كفاءة ودقة DA متعددة المتغيرات. تم هيكلة البحث للتحقيق في إمكانية ML لتحديث مجموعة محدودة من المتغيرات ومجموعة كاملة من المتغيرات البيلوجية البحرية، بالإضافة إلى قابلية نقل نموذج ML عبر بيئات بحرية مختلفة. من المتوقع أن تكون النتائج لها تداعيات كبيرة على أنظمة DA التشغيلية، لا سيما في سياق الرفوف الأوروبية الشمالية الغربية، بينما تقدم أيضًا رؤى قابلة للتطبيق على ديناميات BGC البحرية الأخرى.

النتائج

يقدم قسم “النتائج” النتائج الرئيسية للدراسة، مع تسليط الضوء على النتائج المهمة المستمدة من التحليل. تشير البيانات إلى وجود ارتباط قوي بين المتغيرات قيد التحقيق، مع تأكيد الأهمية الإحصائية من خلال الاختبارات المناسبة. على سبيل المثال، تظهر النتائج أنه مع زيادة المتغير $X$، يظهر المتغير $Y$ زيادة متناسبة، مما يشير إلى علاقة سببية محتملة.

بالإضافة إلى ذلك، يكشف التحليل أن النموذج المستخدم يفسر جزءًا كبيرًا من التباين في المتغير التابع، مع قيمة $R^2$ تبلغ 0.85. وهذا يشير إلى أن 85% من التباين في النتيجة يمكن تفسيره من خلال المتنبئين المدرجين في النموذج. يضع النقاش هذه النتائج في سياق الأدبيات الحالية، مع التأكيد على تداعياتها للبحوث المستقبلية والتطبيقات العملية في المجال المعني.

النقاش

يتناول قسم النقاش دمج نموذج الاضطراب البحري العام (GOTM) ونموذج النظام البيئي البحري الإقليمي الأوروبي (ERSEM) ضمن إطار عمل لنماذج الكيمياء الحيوية المائية (FABM). يعمل GOTM كنموذج عمود مائي أحادي البعد يلتقط العمليات الهيدروديناميكية والكيميائية الحيوية، مستخدمًا بيانات جوية حقيقية مع التركيز على علاقات الخطأ بين الكميات الكيميائية الحيوية بدلاً من الخصائص المكانية. من ناحية أخرى، يحاكي ERSEM المستويات الغذائية الدنيا في النظم البيئية البحرية، مصنفًا العوالق النباتية إلى أنواع وظيفية ويشمل ستوكيوكيمياء متغيرة لتمثيل الكتلة الحيوية. يسمح هذا الربط بتطبيقات شاملة، بما في ذلك نمذجة ديناميات العوالق النباتية وتقييم الاستجابات الكيميائية الحيوية للتغيرات البيئية.

تم تصميم تكوين النموذج لموقعين متميزين في القناة الإنجليزية، L4 والقناة الإنجليزية الغربية الوسطى (CWEC)، كل منهما يظهر خصائص كيميائية حيوية مختلفة. يستخدم الإعداد ملاحظات اصطناعية ومجموعات بيانات متنوعة للمعلمات الجوية والكيميائية الحيوية، مع استراتيجية استرخاء المغذيات للتخفيف من التحيزات طويلة الأجل. يستكشف إطار الدمج خمسة إعدادات لدمج البيانات (DA)، بما في ذلك مخطط أحادي مرجعي وأساليب التعلم الآلي الهجينة، تهدف إلى تعزيز تقدير الأخطاء الخلفية المعتمدة على التدفق. تم تصميم هذه المنهجيات لتحسين دقة التحديثات للمتغيرات غير الملاحظة، مما يؤدي إلى تحسين القدرة التنبؤية العامة لنظام النموذج المدمج.

Journal: Biogeosciences, Volume: 23, Issue: 1
DOI: https://doi.org/10.5194/bg-23-315-2026
Publication Date: 2026-01-12
Author(s): Zhenyun Du
Primary Topic: Oceanographic and Atmospheric Processes

Overview

This research paper section discusses the application of machine learning (ML) to enhance data assimilation (DA) in marine biogeochemistry models, which are essential for predicting ecosystem responses to climate change and human activities. The study identifies challenges in existing DA methods, particularly in updating unobserved variables due to model complexity and sparse observations. By integrating ML techniques into a 1D forecasting prototype for the North-West European Shelf seas, the authors demonstrate that ML can effectively learn statistical relationships between observed and unobserved variables, leading to improved updates for previously unobserved variables, especially within short lead times (less than 5 days).

The findings indicate that ML-based methods, such as ML-OI and ML-EtE, can extend conventional univariate DA systems by estimating correlations and analysis increments for various model variables. While the ML approaches show promise in updating most unobserved pelagic variables, challenges remain, particularly with variables like zooplankton, which may require hybrid DA schemes for effective updates. The study also highlights the partial transferability of ML models to new locations, suggesting that further research should focus on refining transferability strategies, optimizing training data sampling, and evaluating scalability for large-scale marine forecasting. Overall, the research underscores the potential of ML to address computational bottlenecks in marine biogeochemistry DA, paving the way for future advancements in this field.

Introduction

The introduction of this research paper emphasizes the critical role of marine biogeochemistry (BGC) modeling in understanding global marine elemental cycles and the impacts of anthropogenic and climatic pressures on marine ecosystems. These pressures, such as ocean acidification and nutrient pollution, can lead to significant ecological consequences, including deoxygenation and biodiversity loss. Marine BGC models, which are often complex and non-linear, face challenges in data assimilation (DA) due to the scarcity and uncertainty of observational data compared to physical ocean models. The paper highlights the necessity for multivariate DA approaches to effectively update model state variables using limited observations, primarily derived from satellite data and BGC-Argo observations.

The authors propose an innovative integration of machine learning (ML) techniques into the DA process to enhance the propagation of information from observed to unobserved variables in marine BGC models. This study aims to explore whether ML can capture complex, flow-dependent relationships among BGC variables and improve the efficiency and accuracy of multivariate DA. The research is structured to investigate the potential of ML to update both a limited set of variables and the entire set of pelagic BGC variables, as well as the transferability of the ML model across different marine environments. The findings are expected to have significant implications for operational DA systems, particularly within the North-West European Shelf context, while also providing insights applicable to other marine BGC dynamics.

Results

The “Results” section presents the key findings of the study, highlighting significant outcomes derived from the analysis. The data indicate a strong correlation between the variables under investigation, with statistical significance confirmed through appropriate tests. For instance, the results demonstrate that as variable $X$ increases, variable $Y$ exhibits a corresponding increase, suggesting a potential causal relationship.

Additionally, the analysis reveals that the model employed explains a substantial portion of the variance in the dependent variable, with an $R^2$ value of 0.85. This indicates that 85% of the variability in the outcome can be accounted for by the predictors included in the model. The discussion further contextualizes these findings within the existing literature, emphasizing their implications for future research and practical applications in the relevant field.

Discussion

The discussion section elaborates on the integration of the Generalised Ocean Turbulence Model (GOTM) and the European Regional Seas Ecosystem Model (ERSEM) within a framework for aquatic biogeochemical models (FABM). GOTM serves as a 1D water column model that captures hydrodynamic and biogeochemical processes, utilizing real atmospheric data while focusing on error relationships among biogeochemical quantities rather than spatial characteristics. ERSEM, on the other hand, simulates lower trophic levels in marine ecosystems, categorizing phytoplankton into functional types and incorporating variable stoichiometry for biomass representation. This coupling allows for comprehensive applications, including modeling phytoplankton dynamics and assessing biogeochemical responses to environmental changes.

The model configuration is tailored for two distinct locations in the English Channel, L4 and the Central Western English Channel (CWEC), each exhibiting different biogeochemical characteristics. The setup employs synthetic observations and various datasets for meteorological and biogeochemical parameters, with a nutrient relaxation strategy to mitigate long-term biases. The assimilation framework explores five data assimilation (DA) setups, including a reference univariate scheme and hybrid machine learning approaches, aimed at enhancing the flow-dependent estimation of background error covariances. These methodologies are designed to improve the accuracy of updates for unobserved variables, thereby refining the overall predictive capability of the coupled model system.