بنية هجينة من CNN-محول لتعزيز التعرف على العواطف المستند إلى EEG: التقاط الاعتمادات المحلية والعالمية باستخدام آليات الانتباه الذاتي Hybrid CNN-transformer architecture for enhanced EEG-based emotion recognition: capturing local and global dependencies with self-attention mechanisms

المجلة: Discover Computing، المجلد: 28، العدد: 1
DOI: https://doi.org/10.1007/s10791-025-09596-0
تاريخ النشر: 2025-05-22
المؤلف: Zhenyun Du وآخرون
الموضوع الرئيسي: التعرف على العواطف والمزاج

نظرة عامة

تقدم البحث بنية هجينة من CNN-Transformer مصممة لتعزيز التعرف على المشاعر من إشارات EEG من خلال دمج قدرات معالجة البيانات المكانية والزمانية بشكل فعال. غالبًا ما تكافح نماذج EEG التقليدية في تتبع الاعتماديات العصبية الممتدة، بينما تتفوق المحولات في فهم الأنماط العالمية ولكن قد تفوت العلاقات المحلية الدقيقة. تستفيد البنية المقترحة من نقاط القوة لكل من الشبكات العصبية التلافيفية (CNNs) لاكتشاف الأنماط المكانية والمحولات لنمذجة السياق العالمي، مما يؤدي إلى تحسين الأداء.

تم تقييم النموذج الهجين على مجموعة بيانات DEAP، التي تشمل 40 جلسة EEG من 32 موضوعًا، حيث حقق النموذج دقة بنسبة 87%، متجاوزًا النماذج التقليدية مثل AlexNet (83.50%)، VGG-16 (85.00%)، ResNet-50 (85.50%)، GoogleNet (85.00%)، وMobileNetV2 (86.00%). أظهرت البنية متانة عبر مستويات شدة المشاعر المختلفة، محققة دقة عالية لكل فئة لحالات عاطفية مختلفة: سعيد + (96.00%)، غاضب + (91.00%)، حزين + (90.00%)، متوتر + (88.00%)، ومبتهج + (86.00%). تؤسس هذه الدراسة أساسًا قويًا للبحوث المستقبلية في نماذج التعلم العميق الهجينة، مع آثار كبيرة للتطبيقات في تفاعل الإنسان مع الكمبيوتر، والحوسبة العاطفية، والتشخيص النفسي.

مقدمة

تسلط المقدمة الضوء على الأهمية المتزايدة لفهم المشاعر الإنسانية من خلال الإشارات الفسيولوجية، لا سيما في مجالات تفاعل الإنسان مع الكمبيوتر والحوسبة العاطفية. بينما توفر تقنيات تصوير الدماغ المتقدمة مثل التصوير المقطعي المحوسب (CT) والتصوير بالرنين المغناطيسي (MRI) دقة مكانية عالية لرسم نشاط الأعصاب، فإن متطلباتها السريرية وتكاليفها تحد من قابليتها للتطبيق في تحديد المشاعر في الوقت الحقيقي. بالمقابل، يظهر تخطيط الدماغ الكهربائي (EEG) كطريقة غير جراحية وفعالة من حيث التكلفة مع دقة زمنية متفوقة، مما يجعلها مناسبة تمامًا لاكتشاف المشاعر، على الرغم من دقتها المكانية الأقل مقارنة بـ CT وMRI.

التحدي الرئيسي في استخدام EEG للتعرف على المشاعر يكمن في الاستخراج الفعال وتفسير الأنماط الزمنية والمكانية المعقدة التي تميز حالات عاطفية مختلفة. اعتمدت الأساليب التقليدية على خوارزميات التعلم الآلي التقليدية، التي قد لا تلتقط تمامًا التعقيدات الكامنة في بيانات EEG. وهذا يبرز الحاجة إلى منهجيات أكثر تقدمًا لتعزيز دقة وموثوقية أنظمة التعرف على المشاعر المستندة إلى إشارات EEG.

طرق

يقدم البحث بنية هجينة من CNN-Transformer للتعرف على المشاعر باستخدام بيانات EEG، منظمة في منهجية متعددة المراحل. في المرحلة 1، يتم جمع إشارات EEG الخام ومعالجتها مسبقًا من خلال تقنيات تقليل الضوضاء، بما في ذلك تصفية النطاق الترددي، لتحسين جودة الإشارة. يتم استخراج ميزات رئيسية مثل الانتروبيا، وعدم التماثل، وميزات المويجات، والترابط من البيانات المعالجة مسبقًا، حيث تمكّن ميزات المويجات من تحليل متعدد المقاييس مفصل لنشاط الدماغ العابر. ثم يتم تحويل بيانات EEG المعالجة إلى طيفيات تلتقط كل من الخصائص الزمنية والترددية، والتي يتم دمجها في تمثيل موحد لإدخالها في نموذج CNN-Transformer.

في المرحلة 2، يحدد مكون CNN الارتباطات المكانية عبر قنوات EEG، بينما يستخدم مكون Transformer آليات الانتباه الذاتي لنمذجة الاعتماديات بعيدة المدى والعلاقات الزمنية العالمية. تعزز هذه الطريقة المزدوجة قدرة النموذج على اكتشاف التغيرات العاطفية الدقيقة. تتضمن المرحلة النهائية التحقق باستخدام قاعدة بيانات DEAP، حيث يتم تقييم أداء النموذج من خلال مقاييس مثل دقة التصنيف، والدقة، والاسترجاع، ودرجة F1. يحقق النموذج المقترح دقة إجمالية تبلغ 87.00%، متفوقًا على الهياكل التقليدية مثل AlexNet وVGG-16، ويظهر تقدمًا كبيرًا في تحديد المشاعر المستندة إلى EEG، مع تحقيق توازن بين معالجة الميزات المحلية والعالمية مع الحفاظ على الكفاءة في الوقت الحقيقي.

مناقشة

في هذا البحث، تم اقتراح بنية هجينة جديدة تجمع بين الشبكات العصبية التلافيفية (CNNs) والمحولات للتعرف على المشاعر المستندة إلى EEG. تعالج البنية التحديات الرئيسية في هذا المجال، بما في ذلك التقاط كل من الاعتماديات المحلية والعالمية في إشارات EEG، وإدارة الديناميات الزمنية للاستجابات العاطفية، والتعامل مع التباين بين الأفراد. تستخرج مكونات CNN الميزات المحلية بفعالية من مناطق الدماغ المحددة، بينما تقوم طبقات Transformer بنمذجة التفاعلات عبر الشبكات العصبية الموزعة. تدعم هذه التكامل آلية دمج الميزات الهرمية واستراتيجية الانتباه التكيفية التي توازن ديناميكيًا بين معالجة الميزات المحلية والعالمية.

يظهر النموذج المقترح أداءً متفوقًا، حيث يحقق دقة تبلغ 87% على مجموعة بيانات DEAP، متفوقًا على كل من الهياكل النقية من CNN وTransformer، بالإضافة إلى النماذج الهجينة الحالية. من الجدير بالذكر أن البنية تتفوق في التمييز بين حالات المشاعر ذات الإثارة العالية والمنخفضة. تكشف التقييمات الشاملة، بما في ذلك التحقق المتبادل والتحليل المقارن مع النماذج المعتمدة، عن كفاءة البنية الهجينة من حيث الموارد الحاسوبية وقابلية التطبيق في الوقت الحقيقي. تؤكد النتائج على إمكانية هذا النهج في تعزيز أنظمة التعرف على المشاعر، لا سيما في التطبيقات العملية التي تتطلب أداءً قويًا عبر مجموعات سكانية متنوعة.

Journal: Discover Computing, Volume: 28, Issue: 1
DOI: https://doi.org/10.1007/s10791-025-09596-0
Publication Date: 2025-05-22
Author(s): Zhenyun Du et al.
Primary Topic: Emotion and Mood Recognition

Overview

The research presents a hybrid CNN-transformer architecture designed to enhance emotion recognition from EEG signals by effectively integrating spatial and temporal data processing capabilities. Conventional EEG models often struggle with tracking extended neural dependencies, while transformers excel at understanding global patterns but may miss fine-grained local relationships. The proposed architecture leverages the strengths of both convolutional neural networks (CNNs) for spatial pattern detection and transformers for global context modeling, resulting in improved performance.

Evaluated on the DEAP dataset, which includes 40 EEG sessions from 32 subjects, the hybrid model achieved an accuracy of 87%, surpassing traditional models such as AlexNet (83.50%), VGG-16 (85.00%), ResNet-50 (85.50%), GoogleNet (85.00%), and MobileNetV2 (86.00%). The architecture demonstrated robustness across varying emotion intensity levels, achieving high class-wise accuracy for different emotional states: Happy + (96.00%), Angry + (91.00%), Sad + (90.00%), Nervous + (88.00%), and Cheerful + (86.00%). This study establishes a solid foundation for future research in hybrid deep learning models, with significant implications for applications in human-computer interaction, affective computing, and psychological diagnostics.

Introduction

The introduction highlights the growing importance of understanding human emotions through physiological signals, particularly in the fields of human-computer interaction and affective computing. While advanced brain imaging techniques like computed tomography (CT) and magnetic resonance imaging (MRI) provide high spatial resolution for neural activity mapping, their clinical requirements and costs limit their applicability for real-time emotion identification. In contrast, Electroencephalogram (EEG) emerges as a non-invasive and cost-effective method with superior temporal resolution, making it well-suited for emotion detection, despite its lower spatial resolution compared to CT and MRI.

The primary challenge in utilizing EEG for emotion recognition lies in the effective extraction and interpretation of the intricate temporal and spatial patterns that differentiate various emotional states. Traditional approaches have relied on conventional machine learning algorithms, which may not fully capture the complexities inherent in EEG data. This underscores the need for more advanced methodologies to enhance the accuracy and reliability of emotion recognition systems based on EEG signals.

Methods

The research presents a hybrid CNN-Transformer architecture for emotion recognition using EEG data, structured in a multi-phase methodology. In Phase 1, raw EEG signals are collected and pre-processed through noise reduction techniques, including band-pass filtering, to enhance signal quality. Key features such as entropy, asymmetry, wavelet features, and connectivity are extracted from the pre-processed data, with wavelet features enabling a detailed multi-scale analysis of transient brain activity. The processed EEG data is then transformed into spectrograms that capture both temporal and frequency characteristics, which are fused into a unified representation for input into the CNN-Transformer model.

In Phase 2, the CNN component identifies spatial correlations across EEG channels, while the Transformer component utilizes self-attention mechanisms to model long-distance dependencies and global temporal relationships. This dual approach enhances the model’s ability to detect subtle emotional changes. The final phase involves validation using the DEAP database, where the model’s performance is evaluated through metrics such as classification accuracy, precision, recall, and F1-score. The proposed model achieves an overall accuracy of 87.00%, outperforming traditional architectures like AlexNet and VGG-16, and demonstrates significant advancements in EEG-based emotion identification, balancing local and global feature processing while maintaining real-time efficiency.

Discussion

In this research, a novel hybrid architecture combining Convolutional Neural Networks (CNNs) and Transformers is proposed for EEG-based emotion recognition. The architecture addresses key challenges in the field, including capturing both local and global dependencies in EEG signals, managing temporal dynamics of emotional responses, and handling inter-subject variability. The CNN components effectively extract localized features from specific brain regions, while the Transformer layers model interactions across distributed neural networks. This integration is supported by a hierarchical feature fusion mechanism and an adaptive attention strategy that dynamically balances local and global feature processing.

The proposed model demonstrates superior performance, achieving an accuracy of 87% on the DEAP dataset, outperforming both pure CNN and Transformer architectures, as well as existing hybrid models. Notably, the architecture excels in distinguishing between high-arousal and low-arousal emotional states. Comprehensive evaluations, including cross-validation and comparative analysis with established models, reveal the hybrid architecture’s efficiency in terms of computational resources and real-time applicability. The findings underscore the potential of this approach to enhance emotion recognition systems, particularly in practical applications requiring robust performance across diverse populations.