شبكات GCN-Transformer الزمانية المكانية ذات المسارين للتعرف على الأفعال المعتمدة على الهيكل Two-stream spatio-temporal GCN-transformer networks for skeleton-based action recognition

المجلة: Scientific Reports، المجلد: 15، العدد: 1
DOI: https://doi.org/10.1038/s41598-025-87752-8
PMID: https://pubmed.ncbi.nlm.nih.gov/39929951
تاريخ النشر: 2025-02-10
المؤلف: Dong Chen وآخرون
الموضوع الرئيسي: التعرف على وضع الجسم والحركة

نظرة عامة

تقدم ورقة البحث بنية جديدة للتعرف على الأفعال المعتمدة على الهيكل العظمي، تُسمى SA-TDGFormer، والتي تدمج الشبكات العصبية التلافيفية البيانية (GCNs) ونماذج المحولات في تكوين متوازي. تتناول هذه الطريقة قيود الأساليب السابقة التي اعتبرت الهيكل العظمي البشري كهيكل كامل، وغالبًا ما تغفل أهمية الأجزاء الهيكلية المتصلة بشكل غير مباشر. تتكون البنية المقترحة من هيكل مزدوج التدفق يتضمن تدفق GCN الزمني المكاني، الذي يلتقط التمثيلات الطوبولوجية والحركية للهيكل العظمي، وتدفق المحول الزمني المكاني، الذي يركز على العلاقات العالمية بين المفاصل. يتم استخدام استراتيجية دمج متأخرة لدمج المخرجات من كلا التدفقين، مما يعزز تمثيل الميزات ويزيد من تبادل المعلومات.

تظهر التحقق التجريبي على مجموعات البيانات المرجعية، بما في ذلك NTU RGB + D 60، NTU RGB + D 120، وKinetics-Skeleton، أن نموذج SA-TDGFormer يحسن دقة التعرف على الأفعال بنسبة 1-5% مقارنة بالأطر الحالية، خاصة على مجموعة بيانات NTU RGB + D 60. تتضمن البنية مكونات الشبكة العصبية التلافيفية البيانية التكيفية (AGCN) والشبكة التلافيفية الزمنية (TDCN) لالتقاط الاتصالات المعقدة بين المفاصل بشكل فعال، بينما يستخدم تدفق المحول وحدات متقدمة لكشف التفاعلات غير الخطية على مر الزمن. تتضمن الدراسة أيضًا دراسات إلغاء لتقييم مساهمة المكونات الفردية، مؤكدة على الأداء المتفوق للنموذج في مهام التعرف على الأفعال.

طرق

في هذا القسم، يوضح المؤلفون البنية والوظائف الخاصة بـ SA-TDGFormer المقترحة، والتي تتكون من تدفقين رئيسيين: تدفق التلافيف البياني الزمني المكاني وتدفق المحول الزمني المكاني. يستخدم تدفق التلافيف البياني الزمني المكاني شبكة تلافيفية بيانية تكيفية (AGCN) وشبكة تلافيفية زمنية (TDCN) لاستخراج ميزات الحركة بشكل فعال من الرسم البياني للهيكل العظمي البشري، متكيفًا مع التغيرات الطوبولوجية وملتقطًا الاعتمادات الزمنية عبر الإطارات. من ناحية أخرى، يدمج تدفق المحول الزمني المكاني وحدات المحول لتحليل التفاعلات بين أزواج المفاصل على مر الزمن، كاشفًا عن الارتباطات التي قد تغفلها الأساليب التقليدية. للاستفادة من نقاط القوة في كلا التدفقين، يتم تنفيذ استراتيجية دمج متأخرة، ويتم تقديم تنعيم التسميات أثناء التدريب لتخفيف الإفراط في التكيف.

تتحقق فعالية SA-TDGFormer من خلال تحليلات مقارنة ضد الأساليب المتطورة باستخدام مجموعات بيانات NTU RGB+D 60، NTU RGB+D 120، وKinetics-Skeleton. تشير النتائج إلى تحسينات كبيرة في الأداء، مع تحسين بنسبة 8.5% على X-View وتحسين بنسبة 11.2% على معايير X-Sub لمجموعة بيانات NTU RGB+D 60 مقارنة بأساس ST-GCN. كما يحقق النموذج دقة متفوقة على مجموعة بيانات Kinetics-Skeleton، متفوقًا على الأساليب الحالية بدقة ملحوظة تبلغ 39.0%. ومع ذلك، يعترف المؤلفون بوجود قيود في التعرف على الأفعال ذات الحركات الهيكلية الدقيقة، مثل “القراءة” و”الكتابة”، مما يشير إلى أن دمج بيانات إضافية يمكن أن يعزز الأداء في هذه الفئات الصعبة. تشير هذه الرؤية إلى مسارات محتملة للبحث المستقبلي الذي يهدف إلى تحسين دقة التعرف على الأفعال.

نقاش

في قسم النقاش، تستعرض الورقة تطور وتقدم الشبكات العصبية البيانية (GNNs) وتطبيقاتها في التعرف على الأفعال البشرية (HAR). تبرز الأعمال الأساسية، بدءًا من Scarselli وآخرين (2009)، الذين أسسوا الإطار النظري لـ GNNs، وBruna وآخرين (2013)، الذين قدموا الشبكات العصبية التلافيفية البيانية (GCNs) التي توسع المبادئ التلافيفية إلى المجالات غير الإقليدية. لقد حسنت التطورات اللاحقة، مثل ChebyNet وتحسينات Kipf وWelling، الكفاءة الحسابية والأداء في مهام التصنيف شبه المراقب. كما يبرز القسم التطبيقات المتنوعة لـ GCNs في HAR، موضحًا منهجيات مختلفة، بما في ذلك CNNs وRNNs وتقنيات قائمة على المحولات، التي تم استخدامها لتحليل تسلسلات الهيكل العظمي وتعزيز دقة التعرف على الأفعال.

تناقش الورقة أيضًا دمج GCNs والمحولات، مقترحة بنية مزدوجة التدفق تستفيد من نقاط القوة في كلا النموذجين. تجمع استراتيجية الدمج المتأخرة بين درجات التصنيف من كلا التدفقين لتحسين نتائج التعرف على الأفعال. يوضح المؤلفون تنفيذ نموذجهم باستخدام مجموعات بيانات معروفة، مثل NTU RGB+D وKinetics-Skeleton، ويصفون إعداد التجربة، بما في ذلك تقنيات التحسين وبروتوكولات التدريب. يتم إجراء دراسة إلغاء لتقييم مساهمات المكونات الفردية، كاشفة أن تدفق GCN يعزز أداء النموذج بشكل كبير، محققًا دقة عالية بينما يدير التعقيد الحسابي. بشكل عام، يبرز هذا القسم الأساليب المبتكرة في دمج GCNs والمحولات من أجل HAR الفعال، مما يمهد الطريق للبحث المستقبلي في هذا المجال.

Journal: Scientific Reports, Volume: 15, Issue: 1
DOI: https://doi.org/10.1038/s41598-025-87752-8
PMID: https://pubmed.ncbi.nlm.nih.gov/39929951
Publication Date: 2025-02-10
Author(s): Dong Chen et al.
Primary Topic: Human Pose and Action Recognition

Overview

The research paper presents a novel architecture for skeleton-based action recognition, termed SA-TDGFormer, which integrates Graph Convolutional Networks (GCNs) and Transformer models in a parallel configuration. This approach addresses the limitations of previous methods that treated the human skeleton as a complete structure, often overlooking the importance of indirectly connected skeletal parts. The proposed dual-stream structure consists of a spatiotemporal GCN stream, which captures the topological and motion representations of the skeleton, and a spatiotemporal Transformer stream, which focuses on global inter-joint relationships. A late fusion strategy is employed to merge the outputs from both streams, enhancing feature representation and maximizing information exchange.

Empirical validation on benchmark datasets, including NTU RGB + D 60, NTU RGB + D 120, and Kinetics-Skeleton, demonstrates that the SA-TDGFormer model improves action recognition accuracy by 1-5% compared to existing frameworks, particularly on the NTU RGB + D 60 dataset. The architecture incorporates Adaptive Graph Convolutional (AGCN) and Temporal Convolutional (TDCN) components to effectively capture complex inter-joint connections, while the Transformer stream utilizes advanced modules to uncover nonlinear interactions over time. The study also includes ablation studies to assess the contribution of individual components, confirming the model’s superior performance in action recognition tasks.

Methods

In this section, the authors detail the architecture and functionalities of the proposed SA-TDGFormer, which consists of two primary streams: the spatial-temporal graph convolution stream and the spatial-temporal Transformer stream. The spatial-temporal graph convolution stream employs an Adaptive Graph Convolutional Network (AGCN) and a Temporal Convolutional Network (TDCN) to effectively extract motion features from the human skeleton graph, adapting to topological variations and capturing temporal dependencies across frames. The spatial-temporal Transformer stream, on the other hand, integrates Transformer modules to analyze the interactions between joint pairs over time, revealing correlations that traditional methods may overlook. To leverage the strengths of both streams, a late fusion strategy is implemented, and label smoothing is introduced during training to mitigate overfitting.

The effectiveness of the SA-TDGFormer is validated through comparative analyses against state-of-the-art methods using the NTU RGB+D 60, NTU RGB+D 120, and Kinetics-Skeleton datasets. Results indicate significant performance improvements, with an 8.5% enhancement on the X-View and an 11.2% improvement on the X-Sub benchmarks of the NTU RGB+D 60 dataset compared to the ST-GCN baseline. The model also achieves superior accuracy on the Kinetics-Skeleton dataset, outperforming existing methods with a notable accuracy of 39.0%. However, the authors acknowledge limitations in recognizing actions with subtle skeletal movements, such as “reading” and “writing,” suggesting that incorporating additional data modalities could enhance performance in these challenging categories. This insight points to potential avenues for future research aimed at improving action recognition accuracy.

Discussion

In the discussion section, the paper reviews the evolution and advancements in Graph Neural Networks (GNNs) and their applications in Human Action Recognition (HAR). It highlights foundational works, starting with Scarselli et al. (2009), who established the theoretical framework for GNNs, and Bruna et al. (2013), who introduced Graph Convolutional Networks (GCNs) that extend convolutional principles to non-Euclidean domains. Subsequent developments, such as ChebyNet and Kipf and Welling’s optimizations, have improved computational efficiency and performance in semi-supervised classification tasks. The section also emphasizes the diverse applications of GCNs in HAR, detailing various methodologies, including CNNs, RNNs, and Transformer-based techniques, which have been employed to analyze skeletal sequences and enhance action recognition accuracy.

The paper further discusses the integration of GCNs and Transformers, proposing a dual-stream architecture that leverages the strengths of both models. The late fusion strategy combines classification scores from both streams to improve action recognition outcomes. The authors detail the implementation of their model using established datasets, such as NTU RGB+D and Kinetics-Skeleton, and describe the experimental setup, including optimization techniques and training protocols. An ablation study is conducted to evaluate the contributions of individual components, revealing that the GCN stream significantly enhances model performance, achieving high accuracy while managing computational complexity. Overall, this section underscores the innovative approaches in combining GCNs and Transformers for effective HAR, paving the way for future research in this domain.