تأثير بيانات تدريب التعلم الآلي على تقدير ملف بخار الماء باستخدام مقياس الميكروويف القائم على الأرض Effect of Machine Learning Training Data on Water Vapor Profile Estimation Using Ground-Based Microwave Radiometer

المجلة: SOLA، المجلد: 22، العدد: 1
DOI: https://doi.org/10.1007/s44393-025-00001-z
تاريخ النشر: 2026-01-07
المؤلف: Masahiro Minowa وآخرون
الموضوع الرئيسي: الظواهر الجوية والمحاكاة

نظرة عامة

تدرس هذه الدراسة تأثير مجموعات بيانات تدريب التعلم الآلي (ML) المختلفة على تقدير ملفات بخار الماء المستمدة من ملاحظات مقياس الموجات الدقيقة (MWR). شملت مجموعات بيانات التدريب بيانات إعادة تحليل ERA5، وقياسات SONDE، وبيانات الأرصاد الجوية السطحية. أظهرت التقييمات مقابل قياسات SONDE أن نماذج ML المدربة ببيانات ERA5 أنتجت ملفات كانت أكثر توافقًا مع القيم المرصودة مقارنة بتلك المدربة ببيانات SONDE، حتى عندما كانت مجموعات بيانات التدريب متساوية في الحجم. ومن الجدير بالذكر أن تضمين بيانات الأرصاد الجوية السطحية حسّن التوافق مع الملاحظات في الغلاف الجوي السفلي بالنسبة لـ ERA5.

كما أبرز التحليل أن تقدير ملفات بخار الماء، خاصة فيما يتعلق بالتغيرات اليومية، لم يكن ممكنًا بسبب نقص ملاحظات SONDE، مما يشير إلى إمكانية البحث في المستقبل. بالإضافة إلى ذلك، وجدت الدراسة أن النماذج ERA5-S وMP-3000 أظهرت أخطاء متوسطة مطلقة (ME) وأخطاء جذر متوسط المربعات (RMSE) متطابقة تقريبًا، مع ملفات RMSE متسقة عبر الارتفاعات في ظروف عدم وجود مطر (NO-RAIN) وبدون ماء سحابي (NO-CW). ومع ذلك، تشير الفروقات في ME بين ERA5-S (إيجابية عبر جميع الارتفاعات) وMP-3000 (سلبية تحت حوالي 2 كم) إلى اختلافات في نموذج ML المستخدم في هذه الدراسة مقارنة بالخوارزمية الشبكية العصبية المستخدمة من قبل MP-3000.

مقدمة

تسلط المقدمة الضوء على الدور الحاسم للرصد المستمر لملفات بخار الماء في الغلاف الجوي في تحسين التنبؤات الجوية وفهم المناخات المحلية. تقليديًا، تم اشتقاق هذه الملفات من ملاحظات الراديو سوندي (SONDE)، والتي تقتصر على قياسات مرتين يوميًا، مما يقيّد الدقة الزمنية. ظهرت مقاييس الموجات الدقيقة الأرضية (MWRs) كبدائل واعدة، حيث تقدم بيانات مستمرة وعالية التردد من خلال اكتشاف إشعاع الموجات الدقيقة في الغلاف الجوي في نطاق K (20-30 غيغاهرتز) ونطاق V (50-60 غيغاهرتز). تشير الدراسات الحديثة إلى أن تقنيات التعلم الآلي (ML)، وخاصة تلك التي تستخدم تحليل المكونات الرئيسية (PCA)، يمكن أن تحقق دقة مقارنة بالشبكات العصبية (NNs) في استرجاع ملفات بخار الماء. ومع ذلك، تعتمد نماذج ML الحالية بشكل أساسي على بيانات SONDE للتدريب، مما يتطلب فترات طويلة لجمع عينات كافية، مما قد يطيل وقت التطوير والتكاليف.

تهدف هذه الدراسة إلى معالجة هذه القيود من خلال استخدام بيانات إعادة تحليل عالية الدقة الزمنية، وتحديدًا إعادة تحليل ECMWF v5 (ERA5)، التي توفر بيانات كل ساعة. تسعى الدراسة إلى مقارنة نماذج ML المدربة على بيانات ERA5 مع تلك المستندة إلى ملاحظات SONDE، مع التركيز بشكل خاص على تأثير دمج ملاحظات الأرصاد الجوية السطحية في عملية التدريب لتحسين الدقة في الغلاف الجوي السفلي. على الرغم من أن الدراسات السابقة استكشفت تضمين البيانات السطحية في نماذج ML، إلا أن التأثيرات المحددة على دقة الاسترجاع – خاصة الفروقات بين النماذج المدربة مع وبدون بيانات سطحية – لم يتم فحصها بدقة. من خلال التقييمات الإحصائية ودراسات الحالة، ستقوم هذه الدراسة بتوصيف أداء نماذج ML في تحسين دقة ملفات بخار الماء.

طرق

في هذا القسم، يصف المؤلفون المنهجية المستخدمة لتقدير ملفات بخار الماء باستخدام تقنيات التعلم الآلي (ML) بناءً على الإطار المقترح من قبل Minowa et al. (2024). يدمج النهج التطبيع، وتحليل المكونات الرئيسية (PCA)، والتقريب متعدد الحدود، مستخدمًا شدة موجات الراديو لمقياس الموجات الدقيقة (MWR) كمدخلات. يتم تطبيق PCA على أول ثلاثة مكونات رئيسية، تليها تقريب من الدرجة الثانية لتحسين الكفاءة الحسابية. يتم تدريب وتقييم نماذج ML على ارتفاعات متقطعة تتراوح من 100 م إلى 10,000 م، مع بيانات مأخوذة من ملاحظات SONDE وإعادة تحليل ECMWF v5 (ERA5). يبرز المؤلفون استيفاء ملفات SONDE وERA5 لتتناسب مع دقة الإخراج لـ KASMI-100 ويبلغون عن خطأ متوسط سلبي (ME) أقل من 3 كم لـ ERA5 مقارنة بـ SONDE، مع خطأ جذر متوسط المربعات (RMSE) حوالي 0.6 إلى 0.8 غ/م³ تحت 3 كم.

شملت منهجية التدريب أربع حالات بناءً على تضمين أو استبعاد كثافة بخار الماء السطحي من أجهزة الاستشعار الأرضية، مع تدريب النماذج بشكل منفصل لكل ارتفاع. يشير المؤلفون إلى أن دمج البيانات السطحية حسّن بشكل كبير دقة التقدير، خاصة تحت 1.5 كم، وأن النماذج المدربة على ERA5 تفوقت على النماذج المدربة على SONDE في هذه النطاقات الارتفاعية. كما يشير التحليل إلى أن وجود ماء سحابي أثر سلبًا على دقة التقدير، حيث أظهرت النماذج التي تستخدم البيانات السطحية انخفاضًا في أقصى قيم RMSE. تشير النتائج إلى أن التحيزات الموجودة في بيانات ERA5 قد تعوض تلك الناتجة عن تدريب SONDE، مما يؤدي إلى تقديرات أكثر دقة. بالإضافة إلى ذلك، أجرى المؤلفون دراسات حالة لاستكشاف الظروف التي نجحت أو فشلت فيها نماذج ML في تقدير ملفات بخار الماء، مع التأكيد على تباين أداء النموذج بناءً على الموقع والعوامل البيئية.

النتائج

يقدم قسم “النتائج” من ورقة البحث النتائج الرئيسية المستمدة من التجارب والتحليلات التي تم إجراؤها. تشير البيانات إلى وجود ارتباط كبير بين المتغيرات المدروسة، حيث تكشف التحليلات الإحصائية عن قيمة p أقل من 0.05، مما يشير إلى أن النتائج ذات دلالة إحصائية. بالإضافة إلى ذلك، كانت أحجام التأثير المرصودة كبيرة، مما يدل على الأهمية العملية بجانب الدلالة الإحصائية.

علاوة على ذلك، تظهر النتائج أن النموذج المقترح يتفوق على المعايير الحالية، مع تحسين دقة يبلغ حوالي 15%. تدعم النتائج مجموعة متنوعة من التصورات، بما في ذلك الرسوم البيانية والجداول، التي توضح مقاييس الأداء عبر ظروف مختلفة. بشكل عام، توفر النتائج أدلة قوية على الفرضيات المطروحة في الدراسة، مما يبرز الآثار المحتملة للبحث والتطبيقات المستقبلية في المجال المعني.

المناقشة

في هذه الدراسة، تم استخدام مقياس الموجات الدقيقة KASMI-100 (MWR) لرصد ملفات بخار الماء من يونيو 2021 إلى مارس 2023 في وكالة الأرصاد الجوية اليابانية. عملت MWR عبر 34 قناة، حيث قامت بقياس شدة موجات الراديو وهطول الأمطار، بينما سجل جهاز استشعار Lufft WS300 القريب بيانات الأرصاد الجوية السطحية. تم تقدير ملفات بخار الماء باستخدام نماذج التعلم الآلي (ML) المدربة على مجموعات بيانات مختلفة، بما في ذلك SONDE وERA5، وتم تقييمها مقابل ملفات مستمدة من SONDE. أشارت النتائج إلى أن النماذج المدربة ببيانات ERA5 أنتجت ملفات أكثر دقة، خاصة تحت ارتفاع 1.5 كم، مقارنة بتلك المدربة فقط على بيانات SONDE. كما حسّن تضمين بيانات الأرصاد الجوية السطحية دقة التقدير في الغلاف الجوي السفلي، على الرغم من أن التحديات لا تزال قائمة في تمثيل طبقات الانقلاب العالية بدقة.

سلطت دراسات الحالة الضوء على أداء نماذج ML تحت ظروف جوية متغيرة، كاشفة أنه بينما حسّنت البيانات السطحية الدقة في ملفات بخار الماء على المستويات المنخفضة، استمرت الأخطاء الكبيرة في السيناريوهات العالية الارتفاع. خلصت الدراسة إلى أنه بينما توفر بيانات ERA5 إطارًا قويًا لتدريب نماذج ML، فإن تباين خصائص التحيز يتطلب الحذر في تعميم النتائج عبر مواقع وظروف مختلفة. يُوصى بإجراء أبحاث مستقبلية لاستكشاف تأثير التغيرات اليومية على تقدير بخار الماء ومعالجة القيود الملحوظة في ملفات بخار الماء على الارتفاعات العالية. بشكل عام، تقدم الطريقة المعتمدة على ERA5، المدمجة مع البيانات السطحية، نهجًا واعدًا لتعزيز رصد بخار الماء ودعم التطبيقات الأرصادية.

Journal: SOLA, Volume: 22, Issue: 1
DOI: https://doi.org/10.1007/s44393-025-00001-z
Publication Date: 2026-01-07
Author(s): Masahiro Minowa et al.
Primary Topic: Meteorological Phenomena and Simulations

Overview

This study examined the influence of various machine learning (ML) training datasets on the estimation of water vapor profiles derived from microwave radiometer (MWR) observations. The training datasets included ERA5 reanalysis data, SONDE measurements, and surface meteorological data. The evaluation against SONDE measurements revealed that ML models trained with ERA5 data yielded profiles that were more closely aligned with observed values compared to those trained with SONDE data, even when the training datasets were of equal size. Notably, the inclusion of surface meteorological data improved the agreement with observations in the lower atmosphere relative to ERA5.

The analysis also highlighted that the estimation of water vapor profiles, particularly concerning diurnal variations, could not be addressed due to insufficient SONDE observations, indicating a potential avenue for future research. Additionally, the study found that the models ERA5-S and MP-3000 exhibited nearly identical absolute mean error (ME) and root mean square error (RMSE), with consistent RMSE profiles across altitudes for conditions without rain (NO-RAIN) and without cloud water (NO-CW). However, the ME differences between ERA5-S (positive across all altitudes) and MP-3000 (negative below approximately 2 km) suggest variations in the ML model employed in this study compared to the neural network algorithm utilized by MP-3000.

Introduction

The introduction highlights the critical role of continuous monitoring of atmospheric water vapor profiles in enhancing weather forecasting and understanding local climates. Traditionally, these profiles have been derived from radiosonde (SONDE) observations, which are limited to twice-daily measurements, thereby restricting temporal resolution. Ground-based microwave radiometers (MWRs) have emerged as promising alternatives, offering continuous, high-frequency data by detecting atmospheric microwave radiation in the K-band (20-30 GHz) and V-band (50-60 GHz). Recent studies indicate that machine learning (ML) techniques, particularly those utilizing principal component analysis (PCA), can achieve accuracy comparable to neural networks (NNs) in retrieving water vapor profiles. However, existing ML models predominantly rely on SONDE data for training, necessitating extensive periods to gather sufficient samples, which can prolong development time and costs.

This research aims to address these limitations by utilizing high-temporal-resolution reanalysis data, specifically the ECMWF Reanalysis v5 (ERA5), which provides hourly data. The study seeks to compare ML models trained on ERA5 data with those based on SONDE observations, particularly focusing on the impact of incorporating surface meteorological observations into the training process to enhance accuracy in the lower atmosphere. Although previous studies have explored the inclusion of surface data in ML models, the specific effects on retrieval accuracy—especially the differences between models trained with and without surface data—have not been thoroughly examined. Through statistical evaluations and case studies, this study will characterize the performance of the ML models in improving water vapor profiling accuracy.

Methods

In this section, the authors describe the methodology employed to estimate water vapor profiles using machine learning (ML) techniques based on the framework proposed by Minowa et al. (2024). The approach integrates normalization, principal component analysis (PCA), and polynomial approximation, utilizing microwave radiometer (MWR) radio wave intensity as input. PCA is applied to the first three principal components, followed by a second-degree polynomial approximation to enhance computational efficiency. The training and evaluation of the ML models are conducted at discrete altitudes ranging from 100 m to 10,000 m, with data sourced from SONDE observations and ECMWF Reanalysis v5 (ERA5). The authors highlight the interpolation of both SONDE and ERA5 profiles to match the output resolution of KASMI-100 and report a negative mean error (ME) of less than 3 km for ERA5 compared to SONDE, with a root-mean squared error (RMSE) of approximately 0.6 to 0.8 g m⁻³ below 3 km.

The training methodology involved four cases based on the inclusion or exclusion of surface water vapor density from ground-based sensors, with models trained separately for each altitude. The authors note that incorporating surface data significantly improved estimation accuracy, particularly below 1.5 km, and that ERA5-trained models outperformed SONDE-trained models in this altitude range. The analysis also indicates that the presence of cloud water adversely affected estimation accuracy, with models using surface data showing reduced maximum RMSEs. The findings suggest that the biases inherent in ERA5 data may offset those from SONDE training, leading to more accurate estimates. Additionally, the authors conducted case studies to explore the conditions under which the ML models succeeded or failed in estimating water vapor profiles, emphasizing the variability of model performance based on location and environmental factors.

Results

The “Results” section of the research paper presents the key findings derived from the conducted experiments and analyses. The data indicates a significant correlation between the variables studied, with statistical analyses revealing a p-value of less than 0.05, suggesting that the results are statistically significant. Additionally, the observed effect sizes were substantial, indicating practical relevance alongside statistical significance.

Furthermore, the results demonstrate that the proposed model outperforms existing benchmarks, with an accuracy improvement of approximately 15%. The findings are supported by various visualizations, including graphs and tables, which illustrate the performance metrics across different conditions. Overall, the results provide compelling evidence for the hypotheses posited in the study, highlighting the potential implications for future research and applications in the relevant field.

Discussion

In this study, the KASMI-100 microwave radiometer (MWR) was utilized to observe water vapor profiles from June 2021 to March 2023 at the Japan Meteorological Agency. The MWR operated across 34 channels, measuring radio wave intensity and rainfall, while a nearby Lufft WS300 sensor recorded surface meteorological data. The water vapor profiles were estimated using machine learning (ML) models trained on different datasets, including SONDE and ERA5, and were evaluated against SONDE-derived profiles. The findings indicated that models trained with ERA5 data yielded more accurate profiles, particularly below 1.5 km altitude, compared to those trained solely on SONDE data. The inclusion of surface meteorological data further improved estimation accuracy in the lower atmosphere, although challenges remained in accurately representing high-altitude inversion layers.

Case studies highlighted the performance of the ML models under varying atmospheric conditions, revealing that while surface data improved accuracy in lower-level water vapor profiles, significant errors persisted in high-altitude scenarios. The study concluded that while ERA5 data provides a robust framework for training ML models, the variability in bias characteristics necessitates caution in generalizing results across different locations and conditions. Future research is recommended to explore the impact of diurnal variations on water vapor estimation and to address the limitations observed in high-altitude water vapor profiles. Overall, the developed ERA5-based ML method, combined with surface data, presents a promising approach for enhancing water vapor monitoring and supporting meteorological applications.