تقييم معيار التعرف على الحركة الدقيقة: مجموعة بيانات، طرق، وتطبيقات Benchmarking Micro-Action Recognition: Dataset, Methods, and Applications

المجلة: IEEE Transactions on Circuits and Systems for Video Technology، المجلد: 34، العدد: 7
DOI: https://doi.org/10.1109/tcsvt.2024.3358415
تاريخ النشر: 2024-01-25
المؤلف: Dan Guo وآخرون
الموضوع الرئيسي: التعرف على وضع الجسم والحركة

نظرة عامة

في هذا البحث، يركز المؤلفون على التعرف على الميكرو-أكشن البشري القائم على الفيديو، مقدمين مجموعة بيانات Micro-Action-52 (MA-52)، التي تتميز بتنوع فئاتها الواسع وتمثيلها للجسم بالكامل. يقومون بتقييم تقنيات التعرف على الأفعال العامة الموجودة بشكل نقدي ويقترحون بنية جديدة لشبكة عصبية تلافيفية، تُسمى MANet. تعزز هذه البنية إطار عمل ResNet القائم من خلال دمج مكونات Squeeze-and-Excitation (SE) وTemporal Shift Module (TSM)، مما يمكنها من التقاط الديناميات المكانية الزمنية الدقيقة في بيانات الفيديو بشكل فعال.

تؤكد النتائج التجريبية على قوة مجموعة بيانات MA-52 وفعالية بنية MANet. علاوة على ذلك، توضح الدراسة تطبيق هذه الطريقة في التعرف على الميكرو-أكشن في مهام التعرف على المشاعر من خلال التعلم متعدد المهام، مما يظهر قدرة MANet في تحليل الحالات العاطفية. إن تداعيات التعرف على الميكرو-أكشن كبيرة، مما يشير إلى إمكانيته في تطبيقات العالم الحقيقي المتنوعة ويمهد الطريق لمزيد من البحث الأكاديمي في هذا المجال.

مقدمة

تسلط مقدمة هذه الورقة البحثية الضوء على الاهتمام المتزايد في التعرف على الميكرو-أكشن (MAR) داخل مجتمع الذكاء الاصطناعي، مشددة على أهميته كوسيلة لاكتشاف السلوكيات البشرية الدقيقة التي تعمل كإشارات غير لفظية تعكس الحالات العقلية. الميكرو-أكشن، الذي يتميز بالحركات السريعة ومنخفضة الكثافة عبر أجزاء الجسم المختلفة، يوفر فهمًا أكثر دقة للعواطف والنوايا الفردية مقارنةً بالتواصل اللفظي التقليدي وتعبيرات الوجه. تحدد الورقة التحديات الرئيسية في MAR، بما في ذلك صعوبة تمييز التغيرات البصرية الطفيفة، والتشابهات بين الفئات، وعدم توازن البيانات عبر فئات الأفعال، ومصادر البيانات المحدودة.

لمعالجة هذه التحديات، يقدم المؤلفون مجموعة بيانات Micro-Action-52 (MA-52)، التي تتكون من 52 فئة فعل و22,422 عينة فيديو تم جمعها من 205 مشاركين من خلال عملية مقابلة متخصصة مصممة لاستنباط الميكرو-أكشن الأصلي. تشمل هذه المجموعة مجموعة متنوعة من الحركات السفلية المعقدة والتفاعلات الجسدية، مما يعزز دراسة لغة الجسد. بالإضافة إلى ذلك، يقترح المؤلفون شبكة Micro-Action Network (MANet)، وهي شبكة مرجعية لـ MAR تدمج تقنيات متقدمة لتحسين أداء التعرف. تستكشف الورقة أيضًا تطبيق MAR في تحليل المشاعر، مما يوضح أن بيانات الميكرو-أكشن تعزز بشكل كبير قدرات التعرف على المشاعر. بشكل عام، تسهم هذه العمل في تقديم مجموعة بيانات شاملة وإطار عمل قوي لتقدم البحث في التعرف على الميكرو-أكشن وتداعياته في فهم المشاعر والسلوكيات البشرية.

الطرق

في هذا القسم، يقدم المؤلفون منهجيتهم للتعرف على الميكرو-أكشن، مع التركيز على تطوير مجموعة بيانات جديدة، Microaction-52 (MA-52)، ونموذج مرجعي، شبكة الميكرو-أكشن (MANet). تتكون مجموعة بيانات MA-52 من 52 فئة ميكرو-أكشن وسبعة تسميات لأجزاء الجسم، وتحتوي على 22,422 حالة فيديو من 205 مشاركين. تهدف هذه المجموعة إلى التقاط دقة الميكرو-أكشن، والتي تعتبر حاسمة لفهم المشاعر والنوايا البشرية. كما يوسع المؤلفون مجموعة البيانات لإنشاء MA-52-Pro لتحليل المشاعر، حيث يتم وضع علامة على كل حالة بعلامات المشاعر والميكرو-أكشن، مما يعكس تعقيد السلوك البشري.

تظهر التحليلات التجريبية فعالية MANet، الذي يدمج Squeeze-and-Excitation (SE) وTemporal Shift Module (TSM) في بنية ResNet. تشير النتائج إلى أن MANet يتفوق على النماذج الحالية، مثل UniFormer وTSM، عبر مقاييس تقييم مختلفة، بما في ذلك F1 mean وAcc-Top1 وAcc-Top5. من الجدير بالذكر أن MANet يحقق درجة F1 mean تبلغ 65.59%، متجاوزًا UniFormer بنسبة 1.16%. تكشف النتائج أيضًا أن التعرف على الميكرو-أكشن الدقيق أكثر تحديًا بطبيعته من التعرف الخشن، كما يتضح من الانخفاض الكبير في مقاييس الأداء عند الانتقال من التقييمات الخشنة إلى الدقيقة. يبرز هذا البحث إمكانيات مجموعة بيانات MA-52 وMANet في تعزيز فهم السلوك البشري والتعرف على المشاعر.

النتائج

يقدم هذا القسم مصفوفات الالتباس الناتجة عن نتائج التنبؤ للدراسة. تعتبر هذه المصفوفات أداة حاسمة لتقييم أداء النماذج التنبؤية المستخدمة. توضح كل مصفوفة المعدلات الحقيقية الإيجابية والسلبية، والمعدلات الخاطئة الإيجابية والسلبية، مما يسمح بتقييم شامل لدقة النموذج وموثوقيته في تصنيف البيانات.

تشير النتائج إلى مستويات متباينة من الأداء التنبؤي عبر فئات مختلفة، مما يبرز مجالات معينة يتفوق فيها النموذج وأين تكون التحسينات ضرورية. يوفر تحليل هذه المصفوفات رؤى حول نقاط القوة والضعف في النموذج، مما يوجه التحسينات المستقبلية والتعديلات لتعزيز القدرات التنبؤية العامة.

المناقشة

تسلط قسم المناقشة في الورقة البحثية الضوء على التحديات والتقدم في مجال التعرف على الميكرو-أكشن، خاصة في سياق تحليل السلوك البشري. يؤكد المؤلفون على قيود مجموعات البيانات الحالية، التي غالبًا ما تعاني من تحيز البيانات وتفتقر إلى الدقة اللازمة لتمييز بين الأفعال ذات الصلة الوثيقة. على عكس مجموعات البيانات السابقة التي تركز على فئات الأفعال الأوسع، تقدم الدراسة الحالية مجموعة بيانات جديدة، Micro-Action-52 (MA-52)، التي تلتقط مجموعة واسعة من الميكرو-أكشن العفوية من خلال مقابلات نفسية احترافية. تتكون هذه المجموعة من 205 مشاركين و22,422 حالة فيديو، مصنفة إلى 52 نوعًا مميزًا من الميكرو-أكشن، مما يوفر فهمًا أكثر دقة للسلوك البشري.

كما يناقش المؤلفون التقدم المنهجي في التعرف على الميكرو-أكشن، مشيرين إلى أن تقنيات التعرف على الأفعال التقليدية تتفوق في تحديد الأفعال العامة ولكن تواجه صعوبة مع الفئات الدقيقة. يقترحون نموذجًا جديدًا، MANet، الذي يدمج هيكل ResNet مع وحدات Squeeze-and-Excitation الزمنية ووحدات التحول الزمني لتعزيز التعرف على الحركات الدقيقة. يبرز البحث أهمية التقاط الميكرو-أكشن للجسم بالكامل، خاصة الحركات السفلية، ويهدف إلى تقديم رؤى حول العمليات المعرفية والحالات العاطفية من خلال تحليل مفصل للغة الجسد. بشكل عام، تسهم هذه الدراسة بشكل كبير في فهم الميكرو-أكشن وتداعياتها في تطبيقات متنوعة، بما في ذلك تحليل المشاعر والتفاعل بين الإنسان والكمبيوتر.

Journal: IEEE Transactions on Circuits and Systems for Video Technology, Volume: 34, Issue: 7
DOI: https://doi.org/10.1109/tcsvt.2024.3358415
Publication Date: 2024-01-25
Author(s): Dan Guo et al.
Primary Topic: Human Pose and Action Recognition

Overview

In this research, the authors focus on video-based human micro-action recognition, presenting the Micro-Action-52 (MA-52) dataset, which is characterized by its extensive class variety and whole-body representation. They critically assess existing generic action recognition techniques and propose a novel convolutional neural network architecture, termed MANet. This architecture enhances the established ResNet framework by incorporating Squeeze-and-Excitation (SE) and Temporal Shift Module (TSM) components, enabling it to effectively capture subtle spatio-temporal dynamics in video data.

The experimental findings validate both the robustness of the MA-52 dataset and the efficacy of the MANet architecture. Furthermore, the study demonstrates the application of this micro-action recognition method in emotion recognition tasks through multi-task learning, showcasing MANet’s capability in analyzing emotional states. The implications of micro-action recognition are significant, suggesting its potential for diverse real-world applications and paving the way for further academic inquiry in this domain.

Introduction

The introduction of this research paper highlights the growing interest in micro-action recognition (MAR) within the artificial intelligence community, emphasizing its significance as a means to detect subtle human behaviors that serve as nonverbal cues reflecting mental states. Micro-actions, characterized by rapid and low-intensity movements across various body parts, provide a more nuanced understanding of individual emotions and intentions compared to traditional verbal communication and facial expressions. The paper identifies key challenges in MAR, including the difficulty of distinguishing minor visual changes, inter-class similarities, data imbalance across action categories, and limited data sources.

To address these challenges, the authors present the Micro-Action-52 (MA-52) dataset, which comprises 52 action categories and 22,422 video samples collected from 205 participants through a specialized interview process designed to elicit authentic micro-actions. This dataset notably includes a diverse range of lower-body movements and complex bodily interactions, enhancing the study of body language. Additionally, the authors propose the Micro-Action Network (MANet), a benchmark network for MAR that integrates advanced techniques to improve recognition performance. The paper further explores the application of MAR in emotion analysis, demonstrating that micro-action data significantly enhances emotion recognition capabilities. Overall, this work contributes a comprehensive dataset and a robust framework for advancing research in micro-action recognition and its implications in understanding human emotions and behaviors.

Methods

In this section, the authors present their methodology for micro-action recognition, focusing on the development of a new dataset, Microaction-52 (MA-52), and a benchmark model, micro-action network (MANet). The MA-52 dataset comprises 52 micro-action categories and seven body part labels, featuring 22,422 video instances from 205 participants. This dataset aims to capture the subtlety of micro-actions, which are critical for understanding human emotions and intentions. The authors also extend the dataset to create MA-52-Pro for emotion analysis, where each instance is annotated with both emotion and micro-action labels, reflecting the complexity of human behavior.

The experimental analysis demonstrates the effectiveness of MANet, which integrates squeeze-and-excitation (SE) and temporal shift module (TSM) into the ResNet architecture. The results indicate that MANet outperforms existing models, such as UniFormer and TSM, across various evaluation metrics, including F1 mean, Acc-Top1, and Acc-Top5. Notably, MANet achieves an F1 mean score of 65.59%, surpassing UniFormer by 1.16%. The findings also reveal that fine-grained micro-action recognition is inherently more challenging than coarse-grained recognition, as evidenced by the significant drop in performance metrics when transitioning from coarse to fine-grained evaluations. This research underscores the potential of the MA-52 dataset and MANet for advancing the understanding of human behavior and emotion recognition.

Results

The section presents the confusion matrices generated from the prediction results of the study. These matrices serve as a crucial tool for evaluating the performance of the predictive models employed. Each matrix outlines the true positive, true negative, false positive, and false negative rates, allowing for a comprehensive assessment of the model’s accuracy and reliability in classifying the data.

The results indicate varying levels of predictive performance across different classes, highlighting specific areas where the model excels and where improvements are necessary. The analysis of these confusion matrices provides insights into the strengths and weaknesses of the model, guiding future refinements and adjustments to enhance overall predictive capabilities.

Discussion

The discussion section of the research paper highlights the challenges and advancements in the field of micro-action recognition, particularly in the context of human behavior analysis. The authors emphasize the limitations of existing datasets, which often suffer from data bias and lack the granularity needed to distinguish between closely related actions. Unlike previous datasets that focus on broader action categories, the current study introduces a novel dataset, the Micro-Action-52 (MA-52), which captures a wide range of spontaneous micro-actions through professional psychological interviews. This dataset comprises 205 participants and 22,422 video instances, categorized into 52 distinct micro-action types, thereby offering a more nuanced understanding of human behavior.

The authors also discuss the methodological advancements in micro-action recognition, noting that traditional action recognition techniques excel in identifying generic actions but struggle with fine-grained categories. They propose a new model, MANet, which integrates a ResNet backbone with Temporal Squeeze-and-Excitation and Temporal Shift Modules to enhance the recognition of subtle movements. The study underscores the importance of capturing whole-body micro-actions, particularly lower-limb movements, and aims to provide insights into cognitive processes and emotional states through detailed analysis of body language. Overall, the research contributes significantly to the understanding of micro-actions and their implications in various applications, including emotion analysis and human-computer interaction.