التعرف على المشاعر الوجهية المعتمدة على الصور باستخدام الشبكة العصبية التلافيفية على مجموعة بيانات emognition Image-based facial emotion recognition using convolutional neural network on emognition dataset

المجلة: Scientific Reports، المجلد: 14، العدد: 1
DOI: https://doi.org/10.1038/s41598-024-65276-x
PMID: https://pubmed.ncbi.nlm.nih.gov/38910179
تاريخ النشر: 2024-06-23
المؤلف: Erlangga Satrio Agung وآخرون
الموضوع الرئيسي: التعرف على العواطف والمزاج

نظرة عامة

تتناول هذه الدراسة التحديات المتعلقة بالتعرف على مشاعر الوجه (FER) من خلال استخدام مجموعة بيانات Emognition، التي تشمل عشرة مشاعر متميزة: المتعة، الدهشة، الحماس، الإعجاب، المفاجأة، الغضب، الاشمئزاز، الخوف، الحزن، والمحايد. تستخدم البحث تقنيات التعلم العميق، وبالتحديد الشبكات العصبية التلافيفية (CNNs)، لتصنيف هذه المشاعر. تم تنفيذ خط أنابيب شامل لمعالجة البيانات لتحويل بيانات الفيديو إلى صور وزيادة مجموعة البيانات، مما أسفر عن مجموعة نظيفة من 2,535 صورة وجه. تم تقسيم مجموعة البيانات لاحقًا إلى مجموعات تدريب (2,028 صورة)، والتحقق (253 صورة)، والاختبار (254 صورة).

تستكشف الدراسة منهجيتين لتطوير نماذج CNN: التعلم بالنقل مع الضبط الدقيق باستخدام النماذج المدربة مسبقًا Inception-V3 و MobileNet-V2، وبناء نموذج من الصفر باستخدام طريقة تاجوتشي لتحسين المعلمات. تشير النتائج التجريبية إلى أن جميع النماذج حققت أداءً جيدًا، حيث حقق نموذج التعلم بالنقل Inception-V3 أعلى دقة بنسبة 96% ومتوسط درجة F1 قدرها 0.95. من الجدير بالذكر أن النماذج تعرفت بفعالية على عدة مشاعر فريدة ضمن مجموعة بيانات Emognition، بما في ذلك المتعة، الحماس، الدهشة، والإعجاب. تشير النتائج إلى إمكانيات كبيرة للتطبيقات العملية لهذا البحث في مجالات مثل التسويق، الصحة النفسية، والتعليم.

الطرق

في هذه الدراسة، يستخدم المؤلفون نهجًا مزدوجًا لتطوير نموذج شبكة عصبية تلافيفية (CNN) للتعرف على المشاعر. يستخدمون تقنيات التعلم بالنقل مع النماذج المدربة مسبقًا، وبالتحديد MobileNet-V2 و Inception-V3، مع دمج استراتيجيات الضبط الدقيق. بالإضافة إلى ذلك، يقومون بتصميم شبكة جديدة من الصفر تهدف إلى تعزيز الكفاءة. يتم إجراء معالجة مجموعة بيانات Emognition بطريقة تسلسلية لتتوافق مع أهداف البحث.

تعتبر مجموعة بيانات Emognition، التي تتكون من إشارات فسيولوجية وتسجيلات فيديو للجزء العلوي من الجسم من 43 مشاركًا، محورية في هذا البحث. شاهد المشاركون مقاطع أفلام تم التحقق من صحتها عاطفيًا تهدف إلى استثارة تسع مشاعر منفصلة: المتعة، الدهشة، الحماس، الإعجاب، المفاجأة، الغضب، الاشمئزاز، الخوف، والحزن، بالإضافة إلى شعور محايد. تُلاحظ هذه المجموعة لميزاتها في التعرف على المشاعر، خاصة بسبب فئاتها العاطفية المتنوعة التي تسهل التحليل الشامل. تركز الدراسة على بيانات الفيديو للجزء العلوي من الجسم، والتي تتكون من إجمالي 387 فيديو، بمعدلات إطارات تبلغ 60 إطارًا في الثانية لـ 287 فيديو و30 إطارًا في الثانية لـ 100 فيديو المتبقية، مما يبرز التباين في طول الفيديو عبر فئات عاطفية مختلفة.

النتائج

تكشف نتائج اختبار نماذج الكشف أن نموذج التعلم بالنقل باستخدام Inception-V3 يتفوق على النماذج الأخرى من حيث الدقة العامة، حيث حقق درجة 0.96 مقارنة بـ MobileNet-V2 عند 0.89 ونموذج التعلم الكامل عند 0.87. يتسم هذا الأداء المتفوق بالاتساق عبر الفئات الفردية، كما هو موضح في الشكل 14، حيث يظهر Inception-V3 دقة أعلى في التعرف على الفئات الحقيقية، باستثناء فئة الدهشة، حيث يتطابق مع أداء MobileNet-V2. بالإضافة إلى ذلك، تشير مقاييس الاسترجاع، الموضحة في الشكل 15، إلى أن Inception-V3 يتفوق في معظم الفئات، على الرغم من أنه يتخلف في فئة الخوف مقارنة بـ MobileNet-V2.

يدعم تحليل الدقة، الموضح في الشكل 16، هيمنة نموذج Inception-V3، خاصة في فئات الحماس والمحايد. ومع ذلك، يشارك قيم الدقة مع MobileNet-V2 في عدة فئات، بما في ذلك المتعة والدهشة، بينما يتفوق النموذج المبني من الصفر في فئة المتعة. يؤكد تحليل درجة F1، المعروض في الشكل 17، أن Inception-V3 يحافظ على أفضل أداء بشكل عام، مع نتائج قابلة للمقارنة مع MobileNet-V2 في فئة الدهشة. على الرغم من مدة الاختبار الأطول بسبب تعقيده، كما هو موضح في الشكل 18، يظل نموذج Inception-V3 كافيًا للتطبيقات في الوقت الحقيقي، مما يعزز مكانته كنموذج الأكثر فعالية لتصنيف المشاعر ضمن مجموعة بيانات Emognition.

المناقشة

في قسم المناقشة، توضح الدراسة منهجيات معالجة البيانات الشاملة وتدريب النموذج المستخدمة لتصنيف المشاعر من بيانات الفيديو. تشمل معالجة البيانات عدة مراحل، بما في ذلك استخراج إطارات الفيديو، قص الوجه، تنظيف البيانات، وزيادة البيانات، والتي تعزز مجتمعة من جودة وتنوع مجموعة البيانات. يؤكد المؤلفون على أهمية خلط وتقسيم البيانات إلى مجموعات تدريب، والتحقق، والاختبار لتقليل التحيز وتحسين تعميم النموذج. يسهل استخدام إطار البيانات إدارة البيانات بشكل أفضل، بينما تضمن التطبيع وإعادة الحجم التوافق مع متطلبات إدخال الشبكة العصبية التلافيفية (CNN).

تستخدم الدراسة نهج التعلم بالنقل باستخدام نماذج مدربة مسبقًا، MobileNet-V2 و Inception-V3، للاستفادة من قدراتها المعروفة في استخراج الميزات. يتم إجراء عملية الضبط الدقيق في سيناريوهين: السيناريو الأول يجمد جميع الطبقات التلافيفية، حيث يتم تدريب رأس التصنيف فقط، بينما السيناريو الثاني يحرر جزئيًا آخر 50% من الطبقات لتكييف النموذج مع مهمة تصنيف المشاعر المحددة. يتم تبرير اختيار Inception-V3 بدقته الفائقة وهندسته المعقدة، بينما يُفضل MobileNet-V2 لكفاءته وملاءمته للنشر على الأجهزة الأصغر. تشير النتائج إلى أن نموذج التعلم بالنقل مع Inception-V3 يتفوق على النماذج الأخرى من حيث دقة وقياسات الخسارة، مما يظهر قدرات تعلم وتعميم فعالة، بينما يظهر نموذج التعلم الكامل أداءً أقل كفاءة.

Journal: Scientific Reports, Volume: 14, Issue: 1
DOI: https://doi.org/10.1038/s41598-024-65276-x
PMID: https://pubmed.ncbi.nlm.nih.gov/38910179
Publication Date: 2024-06-23
Author(s): Erlangga Satrio Agung et al.
Primary Topic: Emotion and Mood Recognition

Overview

This study addresses the challenges of facial emotion recognition (FER) by utilizing the Emognition dataset, which encompasses ten distinct emotions: amusement, awe, enthusiasm, liking, surprise, anger, disgust, fear, sadness, and neutral. The research employs deep learning techniques, specifically Convolutional Neural Networks (CNNs), to classify these emotions. A comprehensive data preprocessing pipeline was implemented to convert video data into images and augment the dataset, resulting in a clean set of 2,535 facial images. The dataset was subsequently divided into training (2,028 images), validation (253 images), and testing (254 images) subsets.

The study explores two methodologies for developing CNN models: transfer learning with fine-tuning using pre-trained models Inception-V3 and MobileNet-V2, and constructing a model from scratch utilizing the Taguchi method for hyperparameter optimization. Experimental results indicate that all models achieved commendable performance, with the Inception-V3 transfer learning model yielding the highest accuracy of 96% and an average F1-score of 0.95. Notably, the models effectively identified several unique emotions within the Emognition dataset, including amusement, enthusiasm, awe, and liking. The findings suggest significant potential for practical applications of this research in fields such as marketing, mental health, and education.

Methods

In this study, the authors employ a dual approach to develop a Convolutional Neural Network (CNN) model for emotion recognition. They utilize transfer learning techniques with pre-trained models, specifically MobileNet-V2 and Inception-V3, incorporating fine-tuning strategies. Additionally, they design a new network from scratch aimed at enhancing efficiency. The preprocessing of the Emognition dataset is conducted in a serial manner to align with the research objectives.

The Emognition dataset, which consists of physiological signals and upper body video recordings from 43 participants, is pivotal to this research. Participants viewed emotionally validated movie clips intended to elicit nine discrete emotions: amusement, awe, enthusiasm, liking, surprise, anger, disgust, fear, and sadness, along with a neutral emotion. This dataset is noted for its advantages in emotion recognition, particularly due to its diverse emotional categories that facilitate comprehensive analysis. The study focuses on half-body video data, comprising a total of 387 videos, with frame rates of 60 FPS for 287 videos and 30 FPS for the remaining 100, highlighting variability in video length across different emotional classes.

Results

The testing results of the detection models reveal that the transfer learning model utilizing Inception-V3 outperforms the other models in terms of overall accuracy, achieving a score of 0.96 compared to MobileNet-V2 at 0.89 and the full learning model at 0.87. This superior performance is consistent across individual classes, as illustrated in Figure 14, where Inception-V3 demonstrates higher accuracy in recognizing true classes, with the exception of the awe class, where it matches MobileNet-V2’s performance. Additionally, the recall metrics, shown in Figure 15, indicate that Inception-V3 excels in most classes, although it falls short in the fear class compared to MobileNet-V2.

Precision analysis, depicted in Figure 16, further supports the dominance of the Inception-V3 model, particularly in the enthusiasm and neutral classes. However, it shares precision values with MobileNet-V2 in several classes, including amusement and awe, while the build-from-scratch model outperforms in the amusement class. The F1-Score analysis, presented in Figure 17, confirms that Inception-V3 maintains the best performance overall, with comparable results to MobileNet-V2 in the awe class. Despite a longer testing duration due to its complexity, as shown in Figure 18, the Inception-V3 model remains efficient enough for real-time applications, solidifying its status as the most effective model for emotion classification within the Emognition dataset.

Discussion

In the discussion section, the research outlines the comprehensive data pre-processing and model training methodologies employed for emotion classification from video data. The data pre-processing involves several stages, including video frame extraction, face cropping, data cleaning, and augmentation, which collectively enhance the dataset’s quality and diversity. The authors emphasize the importance of shuffling and splitting the data into training, validation, and test sets to mitigate bias and improve model generalization. The use of a dataframe facilitates better data management, while normalization and resizing ensure compatibility with the convolutional neural network (CNN) input requirements.

The study employs a transfer learning approach utilizing pre-trained models, MobileNet-V2 and Inception-V3, to leverage their established feature extraction capabilities. The fine-tuning process is conducted in two scenarios: the first scenario freezes all convolutional layers, training only the classification head, while the second scenario partially unfreezes the last 50% of layers to adapt the model to the specific emotion classification task. The choice of Inception-V3 is justified by its superior accuracy and complex architecture, while MobileNet-V2 is favored for its efficiency and suitability for deployment on smaller devices. The findings indicate that the transfer learning model with Inception-V3 outperforms other models in terms of accuracy and loss metrics, demonstrating effective learning and generalization capabilities, while the full learning model exhibits less optimal performance.