CF-DAN: التعرف على تعبيرات الوجه استنادًا إلى شبكة الانتباه المزدوجة المتقاطعة CF-DAN: Facial-expression recognition based on cross-fusion dual-attention network

المجلة: Computational Visual Media
DOI: https://doi.org/10.1007/s41095-023-0369-x
تاريخ النشر: 2024-02-08
المؤلف: Fan Zhang وآخرون
الموضوع الرئيسي: التعرف على العواطف والمزاج

نظرة عامة

تقدم ورقة البحث شبكة جديدة مزدوجة الانتباه مع دمج متقاطع تهدف إلى تعزيز التعرف على تعبيرات الوجه (FER) في بيئات معقدة، مثل الصور الملتقطة في البرية، والتي غالبًا ما تعاني من مشاكل مثل حجب الوجه والتمويه. تتكون الشبكة المقترحة من ثلاثة مكونات رئيسية: آلية انتباه مزدوج مع دمج متقاطع لمجموعة لتحسين الميزات المحلية أثناء التقاط السياق العالمي؛ دالة تنشيط جديدة C² وهي دالة متعددة الحدود مكعبة مقطعية مصممة لتحسين الكفاءة الحسابية والمرونة؛ وعمليات مغلقة تدمج تقطير الانتباه الذاتي مع اتصالات متبقية لتقليل المعلومات الزائدة وتعزيز قدرات تعميم النموذج. حقق النموذج دقة تعرف بلغت 92.78%، 92.02%، و63.58% على مجموعات بيانات RAF-DB وFERPlus وAffectNet، على التوالي، مما يوضح فعاليته في مهام FER.

في الختام، تؤكد الدراسة على الدور المزدوج للأبعاد المكانية والقنوية في استخراج الميزات، حيث تقوم التفاعلات المحلية بتحسين الميزات وتعمل المجالات الاستقبالية العالمية على تعزيز التقاط المعلومات. الطبيعة التكيفية لدالة التنشيط المقترحة لا تحسن فقط استخراج الميزات ودمجها ولكنها تحمل أيضًا إمكانيات لتطبيقات أوسع في أطر التعلم المختلفة. يقترح المؤلفون أن الطريقة المستخدمة في بناء دالة التنشيط يمكن توسيعها إلى أشكال متعددة الحدود من الدرجة الأعلى، مما يعزز قدرات النقل العكسي وتعميم النموذج. ستركز الأبحاث المستقبلية على تحسين آلية الانتباه الذاتي لميزات البيانات المختلفة في FER واستكشاف التفاعل بين الانتباه الذاتي ودوال التنشيط لتطوير نماذج أكثر كفاءة مع تحسين الانتشار العكسي، والاستمرارية، والكفاءة الحسابية.

مقدمة

تسلط مقدمة ورقة البحث هذه الضوء على أهمية التعرف على تعبيرات الوجه (FER) في فهم الحالات العاطفية البشرية، وهو أمر بالغ الأهمية لتقدم الذكاء العاطفي عبر مجالات مثل علم النفس وعلوم الكمبيوتر. تعتبر تعبيرات الوجه إشارات غير لفظية حيوية تنقل مجموعة من المشاعر، وقد دفعت التقدمات التكنولوجية أبحاث FER إلى تطبيقات متنوعة، بما في ذلك الأمن العام والعلاج الطبي الذكي. بينما اعتمدت طرق FER التقليدية على الميزات اليدوية وتقنيات التعلم السطحي، أحدث ظهور التعلم العميق، وخاصة الشبكات العصبية التلافيفية (CNNs)، ثورة في هذا المجال، محققة نجاحًا ملحوظًا في مهام رؤية الكمبيوتر. ومع ذلك، تواجه CNNs صعوبة في التعامل مع المعلومات الدلالية البصرية عالية المستوى، مما يتطلب دمج طرق معالجة إضافية.

لتحسين FER، تقترح الدراسة محولًا مزدوج الانتباه مع دمج متقاطع يعمل في كلا البعدين المكاني والقنوي، مما يسمح باستخراج ميزات محسنة ودقة أفضل. تتناول هذه الطريقة قيود النماذج الحالية، مثل المتطلبات الحسابية لآليات الانتباه الذاتي، من خلال تنفيذ آلية مجمعة وتقطير الانتباه الذاتي، مما يقلل بشكل كبير من التكاليف الحسابية بنسبة 33%. بالإضافة إلى ذلك، تقدم الورقة طريقة جديدة لبناء دالة التنشيط مصممة للتخفيف من المشكلات المرتبطة بدوال التنشيط التقليدية، مثل تعطيل الخلايا العصبية والعبء الحسابي العالي. بشكل عام، تهدف مساهمات هذه الدراسة إلى تعزيز منهجيات FER من خلال تحسين دمج الميزات والكفاءة الحسابية.

طرق

تقدم الدراسة آلية انتباه مزدوج جديدة تهدف إلى تعزيز التعرف على تعبيرات الوجه (FER) في البيئات الواقعية. لمعالجة التحديات الحسابية التي تطرحها خرائط الميزات الكبيرة في آليات الانتباه الذاتي، يستخدم المؤلفون بنية ResNet لاستخراج الميزات الأولية. على وجه التحديد، يتم معالجة صورة الإدخال \( X \in \mathbb{R}^{3 \times H \times W} \) من خلال ResNet مع خمس طبقات تلافيفية، مما ينتج خريطة ميزات \( X^{\text{Res}} \in \mathbb{R}^{C \times (H/32) \times (W/32)} \)، حيث \( C \) يمثل عدد القنوات.

تستخرج المنهجية أيضًا ميزات عالية المستوى من مخرجات ResNet في كلا البعدين القنوي والمكاني، مما ينتج \( X^c \) و \( X^s \)، على التوالي. يتم دمج هذه الميزات من خلال آلية تعلم تفاعلية لتشكيل \( X^i \in \mathbb{R}^{C \times (H/32) \times (W/32)} \). ثم تقوم طبقة متصلة بالكامل بتحويل ذلك إلى متجه إخراج \( X^{\text{out}} \in \mathbb{R}^{m \times 1} \) للتصنيف، حيث \( m \) تشير إلى عدد فئات التعبير. تعزز آلية الانتباه المزدوج المقترحة التفاعلات العالمية من خلال تطبيق الانتباه عبر كلا البعدين في وقت واحد، مما يحسن الحساسية للمعلومات السياقية ويتفوق على الانتباه الذاتي أحادي البعد التقليدي في مهام FER.

نتائج

يقدم قسم النتائج تقييمًا شاملاً لنموذج التعرف على تعبيرات الوجه (FER) المقترح عبر ثلاث مجموعات بيانات: AffectNet وRAF-DB وFERPlus. يظهر النموذج أداءً متفوقًا مقارنة بعدة طرق معاصرة، محققًا تحسينًا يقارب 4% على EfficientFace وSCN في مجموعة بيانات RAF-DB، وتحسينًا بنسبة 2% على النماذج المعتمدة على الانتباه مثل DAN وAMP-Net. في مجموعة بيانات AffectNet، يحقق النموذج دقة تعريف تبلغ 63.58%، متجاوزًا SCN وEfficientFace بحوالي 4%، بينما يظهر أيضًا تحسنًا معتدلًا في مجموعة بيانات FERPlus. تكشف مصفوفات الالتباس أن النموذج يتفوق في التعرف على التعبيرات السعيدة، والتي يسهل التعرف عليها بسبب عرضها البارز، بينما تعاني تعبيرات مثل الاشمئزاز والاحتقار من معدلات تعرف أقل بسبب أحجام العينات المحدودة.

بالإضافة إلى ذلك، تتناول الدراسة تحدي ارتفاع عدد المعلمات في نماذج المحولات من خلال تقديم آليات انتباه ذاتي مجمعة وتقطير الانتباه الذاتي، مما ينتج نموذجًا يحتوي على ربع عدد المعلمات للنماذج المنافسة مع الحفاظ على دقة عالية. يستغرق وقت التدريب لمجموعة بيانات RAF-DB حوالي ساعة و20 دقيقة، بينما يستغرق الاختبار حوالي دقيقة واحدة. تعزز آلية الانتباه المزدوج، التي تتضمن التعلم التفاعلي وتقنيات الدمج المتقاطع، دقة التصنيف، مما يؤدي إلى تحسينات تبلغ حوالي 1% على FERPlus و3% على AffectNet. تساهم دالة التنشيط المقترحة أيضًا بشكل إيجابي في آليات التعلم التفاعلي، مما يوضح فعالية النموذج في دمج الميزات والأداء العام في مهام FER.

نقاش

في هذا القسم النقاشي، يستكشف المؤلفون التقدم في التعرف على تعبيرات الوجه (FER) ضمن البيئات الواقعية، مؤكدين على دمج آليات الانتباه المزدوج ودوال التنشيط المبتكرة. يبرزون قيود الطرق الحالية، مثل الشبكات العصبية التلافيفية (CNNs) والمحولات، في التقاط كل من الميزات المحلية والعالمية بشكل فعال. تعمل آلية الانتباه المزدوج المقترحة في كلا البعدين المكاني والقنوي، مما يعزز استخراج الميزات من خلال السماح للتفاعلات المحلية بتحسين التفاصيل بينما تستفيد في الوقت نفسه من السياق العالمي. يتم تعزيز هذه الطريقة بشكل أكبر من خلال نموذج انتباه دمج متقاطع، مما يسهل دمج المعلومات من كلا البعدين، مما يؤدي إلى تحسين دقة التعرف.

يقدم المؤلفون أيضًا دالة تنشيط جديدة متعددة الحدود مكعبة مصممة لمعالجة المشكلات المتعلقة بدوال التنشيط التقليدية، مثل تلاشي التدرجات وتعطيل الخلايا العصبية. تعزز هذه الدالة الجديدة الكفاءة الحسابية وقدرات استخراج الميزات. بالإضافة إلى ذلك، تتضمن الدراسة تقطير الانتباه الذاتي للتخفيف من الضوضاء والمعلومات الزائدة أثناء استخراج الميزات، مما يحسن أداء النموذج. تظهر النتائج من التجارب على مجموعات بيانات مثل AffectNet وRAF-DB وFERPlus فعالية الطرق المقترحة، مما يشير إلى تحسينات كبيرة في معدلات التعرف والكفاءة الحسابية. ستركز الأعمال المستقبلية على تحسين آليات الانتباه الذاتي ودوال التنشيط لتعزيز أداء FER بشكل أكبر.

Journal: Computational Visual Media
DOI: https://doi.org/10.1007/s41095-023-0369-x
Publication Date: 2024-02-08
Author(s): Fan Zhang et al.
Primary Topic: Emotion and Mood Recognition

Overview

The research paper presents a novel cross-fusion dual-attention network aimed at enhancing facial-expression recognition (FER) in complex environments, such as images captured in the wild, which often suffer from issues like face occlusion and blurring. The proposed network consists of three key components: a cross-fusion grouped dual-attention mechanism for refining local features while capturing global context; a new C² activation function that is a piecewise cubic polynomial designed to improve computational efficiency and flexibility; and a closed-loop operation integrating self-attention distillation with residual connections to minimize redundant information and bolster the model’s generalization capabilities. The model achieved recognition accuracies of 92.78%, 92.02%, and 63.58% on the RAF-DB, FERPlus, and AffectNet datasets, respectively, demonstrating its effectiveness in FER tasks.

In conclusion, the study emphasizes the dual role of spatial and channel dimensions in feature extraction, where local interactions refine features and global receptive fields enhance information capture. The adaptive nature of the proposed activation function not only improves feature extraction and fusion but also holds potential for broader applications in various learning frameworks. The authors suggest that the method for constructing the activation function could be extended to higher-order polynomial forms, enhancing the reverse transfer capabilities and generalization of the model. Future research will focus on optimizing the self-attention mechanism for different data features in FER and exploring the interplay between self-attention and activation functions to develop more efficient models with improved backpropagation, continuity, and computational efficiency.

Introduction

The introduction of this research paper highlights the significance of facial expression recognition (FER) in understanding human emotional states, which is crucial for advancing emotional intelligence across disciplines such as psychology and computer science. Facial expressions serve as vital nonverbal cues that communicate a range of emotions, and advancements in technology have propelled FER research into various applications, including public security and intelligent medical treatment. While traditional FER methods relied on manual features and shallow learning techniques, the advent of deep learning, particularly convolutional neural networks (CNNs), has revolutionized the field, achieving remarkable success in computer vision tasks. However, CNNs struggle with high-level visual semantic information, necessitating the integration of additional processing methods.

To enhance FER, the study proposes a novel cross-fusion dual-attention transformer that operates in both spatial and channel dimensions, allowing for refined feature extraction and improved accuracy. This approach addresses the limitations of existing models, such as the computational demands of self-attention mechanisms, by implementing a grouped mechanism and self-attention distillation, which significantly reduces computational costs by 33%. Additionally, the paper introduces a new activation function construction method designed to mitigate issues associated with traditional activation functions, such as neuron deactivation and high computational overhead. Overall, the contributions of this study aim to advance FER methodologies by improving feature integration and computational efficiency.

Methods

The study introduces a novel dual-attention mechanism aimed at enhancing facial expression recognition (FER) in real-world settings. To address the computational challenges posed by large feature maps in self-attention mechanisms, the authors employ a ResNet architecture for initial feature extraction. Specifically, an input image \( X \in \mathbb{R}^{3 \times H \times W} \) is processed through a ResNet with five convolutional layers, yielding a feature map \( X^{\text{Res}} \in \mathbb{R}^{C \times (H/32) \times (W/32)} \), where \( C \) represents the number of channels.

The methodology further extracts high-level features from the ResNet output in both channel and spatial dimensions, resulting in \( X^c \) and \( X^s \), respectively. These features are integrated through an interactive learning mechanism to form \( X^i \in \mathbb{R}^{C \times (H/32) \times (W/32)} \). A fully connected layer then transforms this into an output vector \( X^{\text{out}} \in \mathbb{R}^{m \times 1} \) for classification, where \( m \) denotes the number of expression categories. The proposed dual-attention mechanism enhances global interactions by simultaneously applying attention across both dimensions, thereby improving sensitivity to contextual information and outperforming traditional single-dimensional self-attention in FER tasks.

Results

The results section presents a comprehensive evaluation of the proposed facial expression recognition (FER) model across three datasets: AffectNet, RAF-DB, and FERPlus. The model demonstrates superior performance compared to several contemporary methods, achieving approximately 4% improvement over EfficientFace and SCN on the RAF-DB dataset, and a 2% enhancement over attention-based models like DAN and AMP-Net. On the AffectNet dataset, the model achieves an identification accuracy of 63.58%, surpassing SCN and EfficientFace by about 4%, while also showing a modest improvement on the FERPlus dataset. Confusion matrices reveal that the model excels in recognizing happy expressions, which are easier to identify due to their pronounced display, while expressions such as disgust and contempt suffer from lower recognition rates due to limited sample sizes.

Additionally, the study addresses the challenge of high parameter counts in transformer models by introducing grouped self-attention and self-attention distillation mechanisms, resulting in a model with only a quarter of the parameters of competing methods while maintaining high accuracy. The training time for the RAF-DB dataset is approximately 1 hour and 20 minutes, with testing taking about 1 minute. The dual-attention mechanism, which incorporates interactive learning and cross-fusion techniques, further enhances classification accuracy, yielding improvements of around 1% on FERPlus and 3% on AffectNet. The proposed activation function also contributes positively to the interactive learning mechanisms, demonstrating the model’s effectiveness in feature fusion and overall performance in FER tasks.

Discussion

In this discussion section, the authors explore advancements in facial expression recognition (FER) within real-world environments, emphasizing the integration of dual-attention mechanisms and innovative activation functions. They highlight the limitations of existing methods, such as convolutional neural networks (CNNs) and transformers, in capturing both local and global features effectively. The proposed dual-attention mechanism operates in both spatial and channel dimensions, enhancing feature extraction by allowing local interactions to refine details while simultaneously leveraging global context. This approach is further augmented by a cross-fusion attention model, which facilitates the integration of information from both dimensions, leading to improved recognition accuracy.

The authors also introduce a novel piecewise cubic polynomial activation function designed to address issues related to traditional activation functions, such as gradient vanishing and neuron deactivation. This new function enhances computational efficiency and feature extraction capabilities. Additionally, the study incorporates self-attention distillation to mitigate noise and redundant information during feature extraction, thereby improving model performance. The results from experiments on datasets like AffectNet, RAF-DB, and FERPlus demonstrate the effectiveness of the proposed methods, indicating significant improvements in recognition rates and computational efficiency. Future work will focus on refining self-attention mechanisms and activation functions to further enhance FER performance.