نموذج الكشف غير الخاضع للإشراف القائم على محول الرسوم البيانية المعزز للذاكرة لتحديد شذوذ الأداء في بيئات السحابة الديناميكية للغاية Memory-augment graph transformer based unsupervised detection model for identifying performance anomalies in highly-dynamic cloud environments

المجلة: Journal of Cloud Computing Advances Systems and Applications، المجلد: 14، العدد: 1
DOI: https://doi.org/10.1186/s13677-025-00766-5
تاريخ النشر: 2025-07-22
المؤلف: Zhenyun Du وآخرون
الموضوع الرئيسي: تقنيات الكشف عن الشذوذ وتطبيقاتها

نظرة عامة

تقدم ورقة البحث MemGT، وهي طريقة غير خاضعة للإشراف لاكتشاف الشذوذ في السلاسل الزمنية المتعددة المتغيرات مصممة لأنظمة الحوسبة السحابية. تتناول التحديات التي تطرحها التعقيدات والضوضاء في قياسات المراقبة، والتي يمكن أن تؤدي إلى زيادة فشل الخوادم والأحداث غير الطبيعية. تستفيد MemGT من مشفر Transformer مع التعلم الديناميكي لهياكل الرسوم البيانية لاستخراج الميزات الزمانية والمكانية من البيانات بشكل فعال. تعزز وحدة الذاكرة الديناميكية المحكومة من متانة النموذج ضد أنماط البيانات المتغيرة، بينما تحسن تقنية التعلم الرسومي على أساس النوافذ من قدرته على التمييز بين الضوضاء المتزامنة والشذوذ الحقيقي. تشير النتائج التجريبية إلى أن MemGT تحقق متوسط درجة F1 قدرها 95.04%، متفوقة على 15 طريقة أساسية بنسبة 24.80% عبر 8 مجموعات بيانات عامة.

تؤكد الخاتمة على فعالية MemGT في التقاط الاعتماديات المعقدة في مجموعات البيانات عالية الأبعاد والضوضاء، مما يجعلها أداة واعدة لاكتشاف الشذوذ في الأداء في بيئات السحابة الديناميكية. بينما يظهر الإطار إمكانيات لتطبيقات أوسع تتجاوز الحوسبة السحابية، يعترف المؤلفون بأن التعديلات قد تكون ضرورية لمجالات مختلفة بسبب اختلافات في طبيعة الضوضاء المتزامنة. ستركز الأعمال المستقبلية على تخصيص MemGT لتطبيقات غير سحابية محددة والتحقيق في التعديلات اللازمة لتعزيز قدراتها كإطار قوي لاكتشاف الشذوذ في بيانات السلاسل الزمنية المتعددة المتغيرات المعقدة.

مقدمة

تناقش مقدمة ورقة البحث هذه تعقيدات اكتشاف الشذوذ في أنظمة الحوسبة السحابية، التي تتكون من العديد من الخوادم والشبكات وأجهزة التخزين المترابطة. يبرز المؤلفون الحاجة الملحة لاكتشاف دقيق للشذوذ في الأداء—مثل ارتفاع استخدام وحدة المعالجة المركزية وارتفاعات غير متوقعة في عبء العمل—للحفاظ على موثوقية النظام واستمراريته. ومع ذلك، فإن البنية المعقدة والطبيعة الديناميكية لبيئات السحابة تطرح تحديات كبيرة، بما في ذلك الأبعاد العالية لمقاييس المراقبة ووجود الضوضاء المتزامنة، مما يمكن أن يؤدي إلى زيادة معدلات الإيجابيات الكاذبة في مهام اكتشاف الشذوذ.

لمعالجة هذه التحديات، تقترح الورقة طريقة جديدة غير خاضعة للإشراف لاكتشاف الشذوذ تُسمى MemGT، والتي تدمج مشفر Transformer مع التعلم الديناميكي لهياكل الرسوم البيانية لاستخراج الميزات الزمانية والمكانية من مقاييس المراقبة بشكل فعال. تهدف الطريقة إلى تعزيز المتانة ضد أنماط البيانات المتنوعة والضوضاء المتزامنة من خلال وحدة ذاكرة محكومة ديناميكية وشبكة رسومية تلافيفية (GCN) على أساس النوافذ. يحدد المؤلفون ثلاثة أسئلة بحثية رئيسية تركز على نمذجة الاعتماديات بدقة، وتحسين المتانة، والتعامل بشكل فعال مع الضوضاء المتزامنة. تظهر التقييمات التجريبية أن MemGT تحقق أداءً رائدًا في عدة مجموعات بيانات من العالم الحقيقي، مما يشير إلى إمكانياتها كحل موثوق لاكتشاف الشذوذ في بيئات السحابة المعقدة.

طرق

في قسم الطرق، تناقش الورقة تطور تقنيات الكشف القائمة على الشبكات الذاكرة، خاصة في سياق اكتشاف الشذوذ. استخدمت النماذج المبكرة مصفوفات ذاكرة خارجية لتعزيز قدرات الذاكرة للشبكات العصبية. كانت أول تقدم كبير هو MemAE، وهو نموذج اكتشاف شذوذ غير خاضع للإشراف يدمج الشبكات الذاكرة مع المشفرات الذاتية، مستفيدًا من معلومات العينات الطبيعية لاكتشاف الشذوذ. ومع ذلك، تكمن قيود MemAE في عدم قدرتها على تحديث الشبكة الذاكرة في الوقت الفعلي أثناء التدريب. قدم نموذج MNAD استراتيجية تحديث للذاكرة تقوم بتحديث الشبكة الذاكرة فقط عند مواجهة عينات طبيعية، لكنه يفشل في مراعاة درجة التحديث. للتغلب على هذه القيود، قدم MEMTO وحدة ذاكرة محكومة تتكيف مع أنماط طبيعية متنوعة، مما يحسن بشكل كبير من دقة اكتشاف الشذوذ ويعالج مشاكل التعميم المفرط.

يحدد قسم المنهجية مقاييس المراقبة من أنظمة الحوسبة السحابية كمجموعة من السلاسل الفرعية، يُشار إليها بـ \( X = \{ X_1, \ldots, X_N \} \)، حيث تتكون كل سلسلة فرعية \( X_s = [x_{s1}, \ldots, x_{sL}] \in \mathbb{R}^{L \times d} \) من \( L \) ملاحظات متتالية عبر الطوابع الزمنية. هنا، يمثل \( N \) العدد الإجمالي للسلاسل الفرعية، و\( L \) هو طول كل سلسلة فرعية، و\( d \) هو بعد بيانات الإدخال، و\( x_{st} \in \mathbb{R}^d \) هو الملاحظة في الوقت \( t \). الهدف من مهمة اكتشاف الشذوذ في السلاسل الزمنية المتعددة المتغيرات هو توليد درجة شذوذ لكل سلسلة فرعية \( X_s \) لتحديد حالتها الشاذة.

نتائج

يقدم قسم النتائج تحليلًا مقارنًا لمختلف النماذج المطبقة على ثمانية مهام لاكتشاف الشذوذ في العالم الحقيقي، كما هو موضح في الجدول 3. تشير النتائج إلى أن طرق التعلم الآلي التقليدية تعاني من انخفاضات كبيرة في الأداء في البيئات عالية الأبعاد وعالية الضوضاء. في المقابل، يتفوق نموذج MemGT المقترح باستمرار على الطرق الأساسية عبر خمس مجموعات بيانات: PSM، WADI، MSL، SMAP، وAstrosetHigh، مع تحسينات في درجة F1 بنسبة 9.02%، 35.38%، 9.09%، 8.92%، و40.8%، على التوالي. من الجدير بالذكر أن MemGT يظهر أكبر تحسينات على مجموعات بيانات WADI وAstrosetHigh، التي تتميز بالأبعاد العالية والضوضاء، مما يبرز فعاليته في الظروف الصعبة النموذجية لأنظمة IoT المعقدة.

بالإضافة إلى ذلك، تحقق MemGT ثاني أعلى درجة F1 على مجموعات بيانات SWaT وAstrosetLow وAstrosetMiddle، مع تسليط الضوء على أفضل درجات F1 بالخط العريض والثانية الأفضل بالتسطير. يحافظ النموذج على تباين متوسط ضئيل قدره 3.5% بين الدقة والاسترجاع عبر جميع مجموعات البيانات، مما يقلل من خطر تشويه مقاييس الأداء ويعزز موثوقية اكتشاف الشذوذ مع تقليل الإنذارات الكاذبة. بشكل عام، تظهر MemGT قدرات قوية في إدارة تحديات اكتشاف الشذوذ المتنوعة ضمن بيئات السحابة الديناميكية للغاية.

نقاش

في قسم النقاش، توضح الورقة التقدم في طرق اكتشاف الشذوذ غير الخاضعة للإشراف المعتمدة على التعلم العميق، مع التركيز على دمج التعلم الهيكلي للرسوم البيانية وهياكل Transformer. تشمل الأساليب البارزة DAGMM، الذي يدمج المشفرات العميقة مع نماذج المزيج الغاوسي لنمذجة الشذوذ في الفضاء الكامن، وOmniAnomaly، الذي يستخدم الشبكات العصبية العشوائية لالتقاط تعقيدات السلاسل الزمنية. تستفيد طرق أخرى مثل LSTM-VAE وUSAD من LSTM والتدريب المعارض، على التوالي، لتعزيز دقة اكتشاف الشذوذ. يبرز القسم قيود الطرق الحالية في التعامل مع الطبيعة الديناميكية والمعقدة للبيانات في بيئات الحوسبة السحابية، مما يمكن أن يعيق متانة الاكتشاف.

تقدم الورقة MemGT، وهي بنية جديدة مصممة لمعالجة هذه التحديات من خلال أربعة مكونات رئيسية: Transformer معزز بالذاكرة، التعلم الديناميكي لهياكل الرسوم البيانية، دمج الميزات الزمانية والمكانية، والتعرف على الضوضاء المتزامنة. تستخدم MemGT وحدة ذاكرة محكومة ديناميكية لتعزيز استخراج الميزات والتعرف على الأنماط بشكل تكيفي، بينما تسهل آلية الانتباه الذاتي تعلم الاعتماديات المعقدة بين مقاييس المراقبة. تدمج البنية الميزات الزمانية والمكانية من خلال شبكة عصبية رسومية (GNN) وتستخدم استراتيجية التعرف على الضوضاء المتزامنة لتخفيف تأثير الضوضاء على اكتشاف الشذوذ. تجمع دالة الخسارة المقترحة بين خسارة إعادة البناء مع خسائر مساعدة لضمان تدريب فعال، مما يؤدي في النهاية إلى حساب درجة شذوذ قوية تأخذ في الاعتبار كل من الفضاءات المدخلة والكمونية.

Journal: Journal of Cloud Computing Advances Systems and Applications, Volume: 14, Issue: 1
DOI: https://doi.org/10.1186/s13677-025-00766-5
Publication Date: 2025-07-22
Author(s): Zhenyun Du et al.
Primary Topic: Anomaly Detection Techniques and Applications

Overview

The research paper presents MemGT, an unsupervised multivariate time series anomaly detection method designed for cloud computing systems. It addresses the challenges posed by the complexity and noise in monitoring metrics, which can lead to increased server failures and abnormal events. MemGT leverages a Transformer encoder combined with dynamic graph structure learning to effectively extract spatio-temporal features from data. A novel dynamic gated memory module enhances the model’s robustness against varying data patterns, while a window-wise graph learning technique improves its ability to distinguish between concurrent noise and genuine anomalies. Experimental results indicate that MemGT achieves an average F1 score of 95.04%, outperforming 15 baseline methods by 24.80% across 8 public datasets.

The conclusion emphasizes the effectiveness of MemGT in capturing intricate dependencies in high-dimensional, noisy datasets, making it a promising tool for performance anomaly detection in dynamic cloud environments. While the framework shows potential for broader applications beyond cloud computing, the authors acknowledge that adaptations may be necessary for different domains due to variations in the nature of concurrent noise. Future work will focus on tailoring MemGT for specific non-cloud applications and investigating necessary modifications to enhance its capabilities as a robust anomaly detection framework for complex multivariate time series data.

Introduction

The introduction of this research paper discusses the complexities of anomaly detection in cloud computing systems, which consist of numerous interconnected servers, networks, and storage devices. The authors highlight the critical need for accurate detection of performance anomalies—such as high CPU utilization and unexpected workload spikes—to maintain system reliability and continuity. However, the intricate architecture and dynamic nature of cloud environments pose significant challenges, including high dimensionality of monitoring metrics and the presence of concurrent noise, which can lead to increased false positive rates in anomaly detection tasks.

To address these challenges, the paper proposes a novel unsupervised anomaly detection method called MemGT, which integrates a Transformer encoder with dynamic graph structure learning to effectively extract spatiotemporal features from monitoring metrics. The method aims to enhance robustness against diverse data patterns and concurrent noise through a dynamic gated memory module and a window-wise Graph Convolutional Network (GCN). The authors outline three key research questions focused on accurately modeling dependencies, improving robustness, and effectively handling concurrent noise. Empirical evaluations demonstrate that MemGT achieves state-of-the-art performance on multiple real-world datasets, indicating its potential as a reliable solution for anomaly detection in complex cloud environments.

Methods

In the section on methods, the paper discusses the evolution of memory network-based detection techniques, particularly in the context of anomaly detection. Early models employed external memory matrices to enhance the memory capabilities of neural networks. The first significant advancement was MemAE, an unsupervised anomaly detection model that integrates memory networks with autoencoders, leveraging normal sample information to detect anomalies. However, MemAE’s limitation lies in its inability to update the memory network in real-time during training. The MNAD model introduced a memory update strategy that refreshes the memory network only upon encountering normal samples, yet it fails to consider the update’s degree. To overcome these limitations, MEMTO introduced a gated memory module that adapts to various normal patterns, significantly improving anomaly detection accuracy and addressing over-generalization issues.

The methodology section defines the monitoring metrics from cloud computing systems as a collection of sub-series, denoted as \( X = \{ X_1, \ldots, X_N \} \), where each sub-series \( X_s = [x_{s1}, \ldots, x_{sL}] \in \mathbb{R}^{L \times d} \) consists of \( L \) consecutive observations across timestamps. Here, \( N \) represents the total number of sub-series, \( L \) is the length of each sub-series, \( d \) is the input data dimension, and \( x_{st} \in \mathbb{R}^d \) is the observation at time \( t \). The goal of the multivariate time series anomaly detection task is to generate an anomaly score for each sub-series \( X_s \) to determine its anomalous status.

Results

The results section presents a comparative analysis of various models applied to eight real-world anomaly detection tasks, as detailed in Table 3. The findings indicate that traditional machine learning methods suffer significant performance declines in high-dimensional and high-noise environments. In contrast, the proposed MemGT model consistently outperforms baseline methods across five datasets: PSM, WADI, MSL, SMAP, and AstrosetHigh, with F1 score improvements of 9.02%, 35.38%, 9.09%, 8.92%, and 40.8%, respectively. Notably, MemGT shows the most pronounced enhancements on the WADI and AstrosetHigh datasets, which are characterized by high dimensionality and noise, respectively, underscoring its effectiveness in challenging conditions typical of complex IoT systems.

Additionally, MemGT achieves the second highest F1 score on the SWaT, AstrosetLow, and AstrosetMiddle datasets, with the best F1 scores highlighted in bold and the second-best underlined. The model maintains a minimal average discrepancy of 3.5% between accuracy and recall across all datasets, which mitigates the risk of skewed performance metrics and enhances the reliability of anomaly detection while minimizing false alarms. Overall, MemGT demonstrates robust capabilities in managing diverse anomaly detection challenges within highly dynamic cloud environments.

Discussion

In the discussion section, the paper outlines advancements in deep learning-based unsupervised anomaly detection methods, emphasizing the integration of graph structure learning and transformer architectures. Notable approaches include DAGMM, which merges Deep Autoencoders with Gaussian Mixture Models to model latent space anomalies, and OmniAnomaly, which utilizes stochastic RNNs for capturing time series complexities. Other methods like LSTM-VAE and USAD leverage LSTM and adversarial training, respectively, to enhance anomaly detection accuracy. The section highlights the limitations of existing methods in handling the dynamic and complex nature of data in cloud computing environments, which can hinder detection robustness.

The paper introduces MemGT, a novel architecture designed to address these challenges through four key components: a Memory-augmented Transformer, Dynamic Graph Structure Learning, Spatio-temporal Feature Fusion, and Concurrent Noise Recognition. MemGT employs a dynamic gated memory module to adaptively enhance feature extraction and pattern recognition, while a self-attention mechanism facilitates the learning of complex dependencies among monitoring metrics. The architecture integrates spatio-temporal features through a Graph Neural Network (GNN) and employs a concurrent noise recognition strategy to mitigate the impact of noise on anomaly detection. The proposed loss function combines reconstruction loss with auxiliary losses to ensure effective training, ultimately leading to a robust anomaly score calculation that considers both input and latent spaces.