CF-DAN: التعرف على تعبيرات الوجه استنادًا إلى شبكة الانتباه المزدوجة المتقاطعة CF-DAN: Facial-expression recognition based on cross-fusion dual-attention network

عربي
English

المجلة: Computational Visual Media
DOI: https://doi.org/10.1007/s41095-023-0369-x
تاريخ النشر: 2024-02-08

CF-DAN: التعرف على تعبيرات الوجه استنادًا إلى شبكة الانتباه المزدوجة المتقاطعة

فان زانغغونغوان تشينهوا وانغ ، وكايمينغ زانغ

（C）المؤلف（المؤلفون） 2024.

الملخص

مؤخراً، ركز التعرف على تعبيرات الوجه (FER) بشكل أساسي على الصور في البيئات الطبيعية، بما في ذلك عوامل مثل حجب الوجه وضبابية الصورة، بدلاً من الصور المخبرية. لقد قدمت البيئات الميدانية المعقدة تحديات جديدة لـ FER. لمعالجة هذه التحديات، تقترح هذه الدراسة شبكة انتباه مزدوجة مع دمج عبر. تتكون الشبكة من ثلاثة أجزاء: (1) آلية انتباه مزدوجة مجمعة مع دمج عبر لتنقيح الميزات المحلية والحصول على معلومات عالمية؛ (2) اقتراحطريقة بناء دالة التفعيل، وهي متعددة الحدود التكعيبية القطعية مع ثلاثة درجات من الحرية، تتطلب حسابات أقل مع تحسين المرونة وقدرات التعرف، مما يمكنها من معالجة مشاكل بطء السرعة وعدم تفعيل الخلايا العصبية بشكل أفضل؛ و(3) عملية مغلقة بين عملية تقطير الانتباه الذاتي والاتصالات المتبقية لقمع المعلومات الزائدة وتحسين قدرة النموذج على التعميم. كانت دقة التعرف على مجموعات بيانات RAF-DB وFERPlus وAffectNet و على التوالي. تظهر التجارب أن هذا النموذج يمكن أن يوفر حلولًا أكثر فعالية لمهام التعرف على المشاعر.

الكلمات الرئيسية: التعرف على تعبيرات الوجه (FER)؛ دالة تنشيط متعددة الحدود من الدرجة الثالثة؛ آلية الانتباه المزدوج؛ التعلم التفاعلي؛ تقطير الانتباه الذاتي

1 المقدمة

دراسة الحالات العاطفية البشرية هي

مجال البحث بين التخصصات الذي يجمع بين علم النفس وعلوم الحاسوب ويعتبر أساسياً لتطوير الذكاء العاطفي. تعبر تعبيرات الوجه عن الحالة العاطفية للشخص بشكل طبيعي وقوي، حيث تنقل مشاعر مثل الهدوء، والسعادة، والغضب، والحزن، والخوف، والاشمئزاز، والدهشة. تعتبر تعبيرات الوجه مفتاحاً للتواصل غير اللفظي بين البشر. مع استمرار التطورات التكنولوجية، تعمق البحث في التعرف على تعبيرات الوجه (FER) وأثر بشكل كبير على العديد من جوانب الحياة، مثل الأمن العام، واكتشاف الكذب، واكتشاف التعب أثناء القيادة، والعلاج الطبي الذكي، ومراقبة الأمن.

．

تستخدم تقنيات التعرف على المشاعر التقليدية بشكل أساسي ميزات يدوية وطرق تعلم ضحلة، وتدرجات اتجاهية، وتمثيلات نادرة، وتحليل المصفوفات غير السالبة. مؤخرًا، مع التطبيق الواسع النطاق للتعلم العميق، حلت الشبكات العصبية التلافيفية تقريبًا محل الطرق التقليدية وحققت نتائج ممتازة في مهام رؤية الكمبيوتر المختلفة. قام العديد من الباحثين بتطبيق هذه الشبكات على مهام التعرف على المشاعر، مما أدى إلى تقدم ممتاز. تتمتع الشبكات العصبية التلافيفية بمزايا كبيرة في استخراج الميزات منخفضة المستوى والهياكل البصرية. ومع ذلك، غالبًا ما تركز المعلومات الدلالية البصرية عالية المستوى على كيفية ارتباط هذه العناصر ببعضها البعض لتشكيل كائن وكيف تشكل العلاقات المكانية بين الكائنات مشهدًا، وهو ما لا يمكن أن تحقق فيه الشبكات العصبية التلافيفية نتائج مثالية. لمعالجة ذلك، استخدمنا فقط شبكة عصبية تلافيفية لاستخراج الميزات الضحلة وتم إجراء معالجة إضافية باستخدام طرق أخرى بناءً على الميزات الضحلة المستخرجة.

استخدمت الدراسات الحديثة آليات الانتباه الذاتي لمهام رؤية الكمبيوتر المختلفة. يمكن لآليات الانتباه الذاتي تقليد كيفية انتباه الناس.
الاهتمام بالمواقع الرئيسية داخل الصورة واستخراج المعلومات الرئيسية من هذه المواقع الرئيسية لإكمال مهام مختلفة. على سبيل المثال، يمكن تطبيق نموذج المحول البصري (ViT) على تصنيف الصور. ومع ذلك، في مجال التعرف على تعبيرات الوجه (FER)، لا يمكن استخدام الشبكات المشابهة لنموذج ViT مباشرة، بسبب حجم العينة المحدود. قام عويدب وآخرون بتطبيق شبكة ViT بنجاح لمهمة التعرف على التعبيرات من خلال إضافة كتلة الضغط والتحفيز (SE). بسبب قدرة الشبكة على نمذجة الاعتمادات العالمية، يمكن لآليات الانتباه الذاتي أن ترتبط بشكل أفضل مع أجزاء الوجه المختلفة عند معالجة المعلومات الدلالية البصرية عالية المستوى. ومع ذلك، تتطلب آليات الانتباه الذاتي موارد حسابية كبيرة، والتي غالبًا ما تكون غير مقبولة في العديد من السيناريوهات الواقعية. في هذه الدراسة، تناولنا هذه المشكلة من خلال اعتماد نهج التجميع وإضافة تقطير الانتباه الذاتي داخل كل مجموعة.

ت complement شبكات الانتباه المزدوجة بعضها البعض من حيث تحسين الميزات واكتساب المعلومات العالمية. يستخدم DANet شبكة تلافيفية ذات انتباه مزدوج مع آليات انتباه مزدوجة تعتمد على الانتباه المكاني والزماني للتعرف على الأفعال. اقترح لي وآخرون نموذج انتباه مزدوج يتكون من انتباه مكاني ومختلط القنوات للتعرف على الأنشطة الرياضية في مقاطع الفيديو. يستخدم SDA-Net وحدات انتباه ذاتي وعابر مكاني/زماني للتعرف على الأفعال البشرية. استخدم الله ومُنير آلية انتباه مكاني موحد لاستخراج الميزات البارزة التي تركز على البشر في إطارات الفيديو. أظهرت هذه الدراسات فعالية الانتباه المزدوج. في مجال التعرف على المشاعر الوجهية، يستخدم POSTER المحولات العابرة لدمج ميزات معالم الوجه والصورة ويبني هيكل هرمى لتعزيز عدم التغير في المقياس. ومع ذلك، يتطلب POSTER شبكة إضافية لاستخراج ميزات معالم الوجه ويحقق فقط الدمج العابر للاهتمام أحادي الأبعاد. علاوة على ذلك، يزيد الهيكل الهرمي لـ POSTER بشكل كبير من تعقيد النموذج. في هذه الدراسة، نستخدم آليات الانتباه الذاتي لنمذجة المعلومات الدلالية البصرية عالية المستوى مباشرة في كل من الأبعاد القنوية والمكانية مع تحقيق الدمج العابر بين البعدين.

يمكن تقسيم دوال التفعيل الموجودة إلى فئتين: تلك المستندة إلى تمديد دالة السيغمويد [34] وتلك المستندة إلى
تمديد دالة ReLU. تتمتع دالة السجمويد والوظائف المرتبطة بها بتعقيد حسابي عالٍ بسبب تضمين عمليات القوة، مما يؤدي إلى بطء في سرعة الحساب. على الرغم من أن دالة ReLU والوظائف المرتبطة بها تتمتع بسرعات حساب سريعة، إلا أن عدم استقرارها ناتج عن عدم استمرارية المشتق من الدرجة الأولى، ويصبح الفاصل السالب صفرًا بعد التفعيل، مما يؤدي بسهولة إلى تعطيل الخلايا العصبية. لمعالجة ذلك، تقترح هذه الدراسة طريقة لبناء دالة تفعيل تستخدم التناسب لتجنب عمليات القوة مع الحفاظ على استمرارية المشتق من الدرجة الأولى. خلال عملية التناسب، يتم إجراء تعديلات على الفواصل لمنع تعطيل الخلايا العصبية. مساهمات هذه الدراسة هي كما يلي:
(1) نقترح محول ثنائي الانتباه مع دمج متقاطع يعتمد على الأبعاد المكانية والقنوية. التفاعل المحلي في البعد المكاني يكمل تحسين الميزات، ويتم توفير مجال استقبال عالمي في البعد القنوي. يحقق محول الانتباه الثنائي مع الدمج المتقاطع التكامل المتبادل لمعلومات الميزات من أبعاد مختلفة، مما يحسن دقة الميزات المعنية.
（2）قمنا بتصميم طريقة لبناء دالة تفعيل لحل المشكلات في دوال التفعيل المستخدمة بشكل شائع، مثل تعطيل الخلايا العصبية، والعبء الحسابي الكبير الذي تسببه دالة القوة، ووجود نقاط غير قابلة للاشتقاق. علاوة على ذلك، قامت هذه الدراسة ببناء

دالة تنشيط مستمرة لآلية التعلم التفاعلي، والتي تحسن من قدرة آلية التعلم التفاعلي على دمج الميزات المختلفة.
（3）لتقليل التكلفة الحسابية العالية لآلية الانتباه الذاتي، نقترح آلية مجمعة وتقطير الانتباه الذاتي. هذه العملية تقسم الانتباه إلى مجموعات مختلفة وتستخدم تقطير الانتباه الذاتي في كل مجموعة لتقليل الأبعاد المكانية لـ

وبذلك يتم تقليل تكلفة الحساب. استخدام تقطير الانتباه الذاتي يحسن من قدرة آلية الانتباه الذاتي ويقلل بشكل كبير من تكلفة الحساب.

）

في البيئات الواقعية. لمعالجة تحديات التعرف على المشاعر في البيئات الواقعية،
لقد بذل الباحثون جهودًا كبيرة لتحسين دقة التعرف تحت ظروف خلفيات معقدة وحجب الشخصيات. قام بورييل وآخرون [36] باستخراج ميزات الدرجة المكانية للأجزاء الرئيسية من الوجه للتصنيف. حققت الطرق المعتمدة على تحليل المكونات الرئيسية [37، 38] إسقاط الصور الاختبارية على الصور التدريبية باستخدام التشابه لتحقيق التعرف على التعبيرات. استخدم حمال وآخرون [39] تقسيم الصور متعدد المستويات والتحلل النادر لحل مشكلة حجب تعبيرات الوجه. تم استخدام عدة كاشفات للوجه مثل نماذج MTCNN [40] وDlib [41] للكشف عن الوجه في سيناريوهات العالم الحقيقي. علاوة على ذلك، لحل المشكلات الموجودة في سيناريوهات العالم الحقيقي، تم إجراء عدد متزايد من الدراسات المعتمدة على الرؤية متعددة الزوايا ومتعددة المقاييس. اقترح هابي وروتراي [42] وماجومدر وآخرون [43] أن تغييرات تعبير الوجه تنعكس بشكل أساسي في الأجزاء الرئيسية، مثل العينين والفم. مع التطور المستمر للتعلم العميق، حققت العديد من الدراسات المعتمدة على الشبكات العصبية التلافيفية (CNNs) تقدمًا كبيرًا في استخراج ميزات الوجه في بيئات العالم الحقيقي. يضيف نموذج SCN [44] ثلاثة وحدات إضافية إلى CNN التقليدي لأغراض وزن أهمية الانتباه الذاتي، وتنظيم الرتبة، وإعادة التسمية. يستخدم RAN [45] عمليات التلافيف لتصميم استخراج الميزات، والانتباه الذاتي، ووحدات الانتباه العلاقية. يستخدم EfficientFace [46] مستخرج ميزات محلي ومعدل مكاني للقناة لتحسين المتانة. يستخدم DMUE [47] فرعًا مساعدًا لاستخراج التوزيعات المحتملة وإجراء تقديرات حتمية زوجية. FDRL [48] هو شبكة تفكيك وإعادة بناء ميزات مرتبطة بالعواطف. يتكون DAN [49] من شبكة تجميع الميزات، والانتباه المتقاطع متعدد الرؤوس، وشبكة دمج الانتباه. تستخدم ARM وحدة تمثيل معدلة لاستبدال طبقة التجميع. يدمج CSGResNet [50] التلافيف غابور (GConv) في ResNet. يستخرج AMP-Net ميزات التعبيرات الوجهية العالمية والمحلية والبارزة بمستويات مختلفة من الدقة لتعلم التنوع الأساسي والمعلومات الرئيسية لعواطف الوجه. ومع ذلك، فإن هذه الطرق محدودة من قبل الحقول الاستقبالية الصغيرة لـ CNNs. في هذه الدراسة، نستفيد من مزايا CNN لاستخراج الميزات السطحية ونستخدم آليات الانتباه الذاتي لمعالجة المعلومات الدلالية البصرية عالية المستوى.

المحولات البصرية. أظهرت العديد من الدراسات الحديثة أن المحولات لديها إمكانات هائلة
في تطبيقات رؤية الكمبيوتر. تم تطبيق عدة دراسات رائدة، مثل تلك المتعلقة بأساليب iGPT [51] وViT، آلية الانتباه الذاتي مباشرة على بكسلات الصورة أو تسلسلات الرقع. مستلهمًا من ذلك، كانت المحولات البصرية التلافيفية (CVTs) [52] الأولى التي تطبق نموذج المحول على مهام FER. تستخدم CVTs خوارزمية النمط الثنائي المحلي (LBP) لإرسال صور تعبير الوجه في حالتين مختلفتين إلى شبكة ResNet للحصول على صور ميزات أصغر، ويتم استخدام نموذج المحول لإكمال مهمة FER. يولد محول رؤية القناع (MViT) [53] قناعًا بناءً على نموذج المحول لتصفية الخلفيات المعقدة وحجب صور الوجه. ترسل طريقة FER-VT [54] خرائط ميزات بمقاييس مختلفة إلى نفس نموذج المحول لإكمال دمج المعلومات. ترشد خرائط الانتباه العشوائي [55] و وحدات الانتباه الذاتي النموذج في تعلم العلاقات الغنية بين الرقع المحلية المختلفة. ومع ذلك، فإن هذه الطرق محدودة من حيث البعد المكاني ولديها قيود في الحصول على المعلومات العالمية مقارنةً بآلية الانتباه المزدوج.

3 المنهجية

تقترح هذه الدراسة آلية انتباه مزدوج أكثر اختصارًا وفعالية لحل مهام FER في بيئات العالم الحقيقي بشكل أكثر فعالية. نظرًا لأن الحجم الكبير لخريطة الميزات يزيد بشكل كبير من حساب آلية الانتباه الذاتي، تستخدم الشبكة أولاً نموذج ResNet للتخطيط عالي الأبعاد لصور تعبير الوجه، وبالتالي الحصول على خرائط ميزات عالية الأبعاد بحجم أصغر. توضح الأشكال 1 و 2 التدفق العام للشبكة. على وجه الخصوص، بافتراض أن الصورة المدخلة هي

، يتم أولاً تغذية الصورة إلى شبكة عصبية ResNet تحتوي على خمس طبقات تلافيفية لاستخراج الميزات منخفضة المستوى. يتم الإشارة إلى خريطة الميزات الناتجة من ResNet على أنها

، حيث يشير

إلى عدد القنوات. ثم يتم استخراج الميزات عالية المستوى من نتائج ResNet المتزامنة في كل من الأبعاد القنوية والمكانية، ويتم الإشارة إلى النتائج المستخرجة على أنها

، على التوالي. تجمع آلية التعلم التفاعلي نتائج هذين البعدين في

. أخيرًا، تستخدم النموذج طبقة متصلة بالكامل للحصول على

الشكل 1 آلية الانتباه المزدوج. الجانب الأيسر يوضح الهيكل العام لآلية الانتباه المزدوج. الجانب الأيمن يوضح الانتباه الذاتي في أبعاد مختلفة. المدخلات

يتم الحصول عليها من خلال التغير الخطي وتقطير الانتباه الذاتي، ثم يتم حساب الانتباه الذاتي. لا يقسم انتباه القناة المدخلات إلى رقع متعددة للعمل، على عكس الانتباه المكاني. يقوم بإجراء عملية الانتباه الذاتي مباشرة على البعد القنوي.

الشكل 2 الهيكل العام للنموذج المقترح. الجزء الرئيسي من النموذج يتكون من طبقتين من الانتباه المزدوج وآلية تعلم تفاعلية. يتم تبادل المعلومات من خلال تبادل

بين طبقتين من الانتباه المزدوج، مما يكسر حجب المعلومات الناتج عن العمليات المتوازية. لتحسين قدرة دمج الميزات لآلية التعلم التفاعلية للانتباه المزدوج، صممنا

دالة تنشيط مستمرة

تتكون من ثلاث حدود مكعبة، والتي لديها قدرة أفضل على دمج الميزات.

متجه الإخراج

للتصنيف، حيث

هو عدد فئات تعبير الوجه.

يتم استخدام الانتباه الذاتي لتحقيق التعرف على تعبير الوجه، مما يحل مشكلة الحقول الاستقبالية الصغيرة لـ CNNs. إن إدخال آلية الانتباه الذاتي بأبعاد مختلفة يعوض عن أوجه القصور في الأبعاد الفردية التقليدية في التفاعلات العالمية. تطبق الخوارزمية المتوازية الانتباه المزدوج مباشرة على المدخلات الأصلية، وتتغلب على العيب الذي يتمثل في أن الانتباه الذاتي في البعد القنوي يعزز فقط نتائج البعد المكاني للانتباه الذاتي، مما يجعل النموذج أكثر حساسية للبيئة المحيطة. مقارنةً بآلية الانتباه الذاتي التقليدية، يمكن تطبيق الانتباه المزدوج المتوازي بشكل أفضل على مهام FER.

3.1 آلية الانتباه المزدوج (DAM)

مع التطبيق الواسع لآليات الانتباه النادرة، أعاد معظم الباحثين بناء
الصورة المدخلة في شكل

عن طريق تقسيم البعد المكاني إلى رقع وإضافة ترميز موضعي، حيث يشير

إلى عدد الرقع و

إلى عدد القنوات. ومع ذلك، يمكن أن تتعرض المعلومات الثنائية الأبعاد الفريدة للصور للتدمير؛ وبالتالي، يصبح تحسين الميزات المحلية تحديًا مهمًا. لمعالجة هذه المشكلات، يتم تقسيم الصور المعاد بناؤها إلى مجموعات من خلال إجراء آلية انتباه ذاتي منفصلة لكل مجموعة. اعتبر أن عدد القنوات لا يتعرض للتلف أثناء إعادة البناء، وأن كل قناة هي تمثيل مجرد للمعلومات العالمية. في هذه الدراسة، يتم تشغيل آليات الانتباه الذاتي للأبعاد القنوية والمكانية بشكل متوازي لتشكيل آلية الانتباه المزدوج. بالمثل، تمت إضافة التعلم القائم على المجموعات إلى الأبعاد القنوية. تكمل آليات الانتباه المزدوج بعضها البعض في اكتساب الميزات المحلية والمعلومات العالمية، مما يظهر قدرات FER قوية.

على وجه الخصوص، في البعد المكاني، يمكن افتراض أن الصورة المعاد بناؤها مقسمة إلى

مجموعات، وكل مجموعة تحتوي على

رقع. يمكن تقسيم الصورة المعاد بناؤها إلى رقع غير متداخلة متعددة باستخدام نهج النافذة المنزلقة، ويمكن بعد ذلك تجميع هذه الرقع. يعتمد حجم كل رقعة (بالبكسل) على الدقة المطلوبة للصورة المعاد بناؤها وحجم الصورة الأصلية. على سبيل المثال، إذا كانت الصورة المعاد بناؤها لها أبعاد

مجموعات، فإن كل رقعة يمكن أن تحتوي على

بكسلات. يتم التعبير عن العملية التشغيلية لآلية الانتباه الذاتي العامة في البعد المكاني كما هو موضح في المعادلة (1):

أين

يمكن الحصول عليه من خلال تحويل خطي للمدخلات.

تمثل نتيجة الانتباه الذاتي، و

في بُعد القناة، يمكن افتراض أن الصورة المعاد بناؤها مقسمة إلى

مجموعات، تحتوي كل مجموعة على

قنوات بحيث

يمكن التعبير عن العملية التشغيلية لآلية الانتباه الذاتي العامة في بُعد القناة كما هو موضح في المعادلة (2):

أين

يمكن الحصول عليه من خلال تحويل خطي للمدخلات.

توضح المعادلتان (1) و(2) أنه يتم اعتبار جميع المواقع المكانية عند حساب الانتباه الذاتي البعدي القنوي، مما يمكّن من التفاعل العالمي. في الدراسات التجريبية اللاحقة، أكدنا أن الانتباه الذاتي البعدي القنوي يولي مزيدًا من الاهتمام للوجه ككل وارتباط تعبيرات الوجه بالبيئة المحيطة. يقتصر البعد المكاني للاهتمام الذاتي على مواقع مكانية مختلفة لإكمال التفاعلات المحلية، مما يجعله أكثر حساسية للمواقع الرئيسية، مثل العينين والفم. يتعاون الاثنان لتعزيز إدراك تعبيرات الوجه. في الوقت نفسه، لتحقيق المشاركة الفورية للمعلومات عندما تتم معالجة البعدين بشكل متوازي، نستخدم طريقة التداخل لـ
استكمال نقل المعلومات بين الأبعاد المختلفة. القسم 3.2 يوضح ذلك بشكل أكبر.

3.2 آلية انتباه الاندماج المتقاطع

تعبيرات الصور من البيئات الواقعية معقدة ومتنوعة. يعبر الناس عن مشاعرهم بطرق مختلفة. علاوة على ذلك، توجد العديد من أوجه التشابه بين تعبيرات المشاعر المختلفة. وبالتالي، فإن التطبيق البسيط للشبكات العصبية على صور تعبيرات الوجه لا يمكنه تمييز الفروق الدقيقة بدقة، مما يؤدي إلى انخفاض معدلات التعرف. في آلية الانتباه الذاتي المزدوج، يركز الانتباه الذاتي في البعد المكاني أكثر على المناطق الرئيسية المتعلقة بتعبيرات الوجه. على العكس، يركز الانتباه الذاتي في بعد القناة أكثر على المعلومات العالمية. كلا النوعين من المعلومات مهمان لتعرف المشاعر من خلال الوجه. للاستفادة بشكل أفضل من المعلومات من أبعاد مختلفة، صممنا نموذج الانتباه الذاتي المزدوج مع الاندماج المتقاطع، كما هو موضح في الشكل 2. في الأبحاث السابقة حول الاندماج المتقاطع، كان غالبًا ما يتطلب دعم بيانات إضافية لإكمال الاندماج المتقاطع بين الميزات التي تم إنشاؤها بواسطة بيانات مختلفة. في هذه الدراسة، اعتمدنا طريقة تنفيذ أكثر اختصارًا تضيف مباشرة الاندماج المتقاطع إلى عملية تشغيل بعدين مختلفين من آلية الانتباه الذاتي. أظهرت التجارب أن التصميم المقترح فعال.

على وجه الخصوص، الناتج

الطبقة العليا من البعد المكاني يتم تحويلها إلى مصفوفتين صور،

من خلال تحويلين خطيين؛ الناتج

يتم تعيين الطبقة العليا من بُعد القناة إلى مصفوفة الصورة

من خلال تحويل خطي، قبل الإرسال

، و

إلى البعد المكاني للطبقة التالية للاهتمام الذاتي. العملية مشابهة في بعد القناة، ويمكن التعبير عن نتيجة الانتباه كما في المعادلة (3):

3.3 تقطير الانتباه الذاتي

آلية الانتباه الذاتي توسع نافذة الانتباه لتشمل الصورة بأكملها، مما يزيد بشكل كبير من العبء الحسابي، وينتج ظاهرة تلطخ شديدة، مما يتسبب في العديد من التركيبات الزائدة.

المعلومات الزائدة غير المفيدة تُدخل تداخلًا خطيرًا في
عملية استخراج الميزات، مما يؤدي إلى تراجع أداء النموذج. تقترح هذه الدراسة أن آلية تقطير الانتباه الذاتي تعمل على المفاتيح والقيم، مما يقلل من حجم الأبعاد المكانية والقناة. يمكن استخراج الميزات السائدة باستخدام هذه العملية لتشكيل خريطة ميزات تتمتع بميزة مركزة في الانتباه الذاتي اللاحق، مما ي suppress التداخل من المعلومات الزائدة وتقليل توليد الضوضاء. علاوة على ذلك، لتقليل فقدان المعلومات ذات الترددات المتوسطة والعالية الناتج عن آلية تقطير الانتباه الذاتي، يتم استخدام اتصال متبقي لدمج الأصل.

مع نتائج عملية الانتباه الذاتي. التفاعل بين تقطير الانتباه الذاتي والاتصال المتبقي يبني عملية مغلقة مستقلة، والتي تعيد بناء المعلومات المفقودة بشكل فعال.

من حيث الأبعاد المكانية، تم بناء طبقتين من الالتفاف بحجم نواة يبلغ ثلاثة. تكمل الطبقة الأولى تحويل عدد القنوات من بعد عالٍ إلى بعد منخفض، وهو عملية ديناميكية لاستخراج الميزات السائدة. تكمل الطبقة الثانية تحويل عدد القنوات من بعد منخفض إلى بعد عالٍ، مما يحافظ على نفس الأبعاد كالإدخال الأصلي. أخيرًا، يتم استخدام التجميع الأقصى لتقليل عدد البقع في المفاتيح والقيم. العملية الحسابية العامة هي كما في المعادلة (4):

أين

يشير إلى عملية الالتفاف مع حجم نواة الالتفاف ثلاثة (تستخدم لتقليل عدد القنوات)، Conv

يشير إلى عملية الالتفاف مع حجم نواة الالتفاف ثلاثة (تستخدم لزيادة عدد القنوات)، وMaxPool تشير إلى التجميع الأقصى.

فيما يتعلق بعدد القنوات، يتم استخدام عملية الالتفاف بحجم نواة الالتفاف يساوي واحد لتقليل عدد القنوات في مجموعة واحدة؛ بعد ذلك، يتم استخدام عملية الالتفاف بحجم نواة الالتفاف يساوي ثلاثة لتعزيز الاتصال بين القنوات لتعلم ميزات عالية الجودة أكثر موثوقية. لتوجيه النموذج للتركيز على اكتساب المعلومات العالمية، يتم التخلي عن التجميع الأقصى في بعد القناة بحيث يظل عدد القطع في كل مجموعة دون تغيير. عملية الحساب الكلية هي كما في المعادلة (5):

أين

تشير إلى عمليات الالتفاف بأحجام نواة الالتفاف واحدة وثلاثة، على التوالي.

3.4 آلية التعلم التفاعلي (ILM)

نظرًا لأن دوال التفعيل مثل سيغمويد و تانجنت هايبر بوليك تحتوي على عمليات أسية، فإن سرعة الحساب تتباطأ، كما أن دالة التفعيل ReLU تحتوي على نقاط غير قابلة للاشتقاق ومشاكل موت الخلايا العصبية. لحل هذه المشاكل، قمنا بتصميم…

دالة تنشيط مستمرة تتكون من منحنيات متعددة الحدود التكعيبية ذات ثلاثة أجزاء، تُسمى دالة متعددة الحدود التكعيبية القطعية

لزيادة قدرة آلية التعلم التفاعلي لدمج الميزات. طريقة البناء لـ

كما يلي.

أولاً، الفاصل الزمني

ينقسم إلى ثلاثة فترات بواسطة النقاط

والقيم الوظيفية المقابلة على النقاط الأربعة

هم

المشتقة الأولى عند النقطة

يتم تعريفه بواسطة

أين

هو معلم معلق.
المشتقات الأولى عند النقاط

كلاهما صفر،

بهذه الطريقة، يمكننا بناء دالة تداخل هيرميت التكعيبية في الفترة

：

أين

تشمل هذه العملية الحسابية ثلاثة معلمات غير محددة:

عندما تكون قيم المعلمات

مُعطاة،

يمكن حسابه بالطريقة التالية. لضمان أن

هو

مستمرة، المشتقة الثانية عند النقطة

يتم تعريفه بواسطة

مثل هذا
أن

يمكن حسابه باستخدام نظام المعادلة (9):

في هذه العملية الإنشائية،

هو

يتم تعريفه على أنه المعادلة (10):

من خلال ضبط قيم المعلمات الثلاثة بشكل مناسب،

في

يمكن تحسين دقة الخوارزمية، مما يعزز قدرة آلية التعلم التفاعلي على دمج الميزات.

ملاحظة: لأن

يتم تطبيقه بشكل متكرر على النموذج، مما يحسب مباشرة

استخدام المعادلة (8) يتطلب تكلفة حسابية عالية. لذلك، نكتب

كما

أين

هي معاملات تم الحصول عليها من تبسيط المعادلة (8).

نحسب

في

قبل كل تكرار، بحيث

يمكن حسابه باستخدام المعادلة (12):

لذلك، يتطلب الأمر فقط ثلاث عمليات ضرب وثلاث عمليات جمع لحساب

، مما يقلل بشكل كبير من عدد العمليات الحسابية. تتطلب الحسابات باستخدام المعادلتين (8) و(9) أربع عمليات قسمة، و17 عملية ضرب، وأربع عمليات طرح، وثلاث عمليات جمع.

استنادًا إلى

تقترح هذه الدراسة آلية تعلم تفاعلية، كما هو موضح في الشكل 3. يتم أولاً استخدام التجميع المتوسط العالمي لمتجهات الميزات الفردية في بُعد القناة لدمج المعلومات المكانية العالمية؛ بعد ذلك، يتم تحويل متجه الميزات من خلال شبكة عصبية أمامية. نشير إلى نتائج التحويل باسم المتجهات الديناميكية الضمنية. أخيرًا، يتم ضرب المتجه الخاص والمتجه الديناميكي الضمني بشكل متقاطع لإكمال التعلم التفاعلي. يستخدم التجميع المتوسط العالمي القيمة المتوسطة لخريطة الميزات ل

الشكل 3 آلية التعلم التفاعلي. يتم تجميع الميزات المستخرجة من بعدين مختلفين بشكل متوسط عالمي ويتم تمريرها عبر طبقة خطية قبل أن يتم ضربها مع المدخلات بشكل منفصل. ثم، يتم ضربها بشكل متقاطع بين البعدين المختلفين.

تحديد أهميتها بالقوة، مما يمنح كل قناة معناها الفعلي مباشرة ثم تغذية نتيجة التجميع العالمي المتوسط إلى المدخلات كأوزان لتعزيز ميزات المدخلات. في عملية التفاعل، لأن دالة التنشيط يمكن أن تضغط المنطقة السلبية إلى فترة سلبية أصغر، يمكن التعرف على المعلومات التفاعلية وكبح المعلومات غير التفاعلية إلى حالة غير نشطة بحيث يمكن دمجها بشكل أفضل مع أبعاد أخرى من المعلومات.

يمكن التعبير عن توليد المتجهات الديناميكية الضمنية والعملية التشغيلية العامة لآلية التعلم التفاعلي كالتالي: المعادلات (13) – (16):

هنا،

يدل على دالة التفعيل؛ خطي، طبقة الاتصال الكامل؛ و

تجميع متوسط عالمي.

4 نتائج

تصف هذه القسم مجموعة البيانات، والبيئة التجريبية، وتنفيذ التجربة. يوضح الشكل 4 بعض العينات. تم مقارنة فعالية النموذج المقترح مع طرق شائعة الاستخدام في السنوات الأخيرة.

الشكل 4 عينات جزئية من ثلاثة مجموعات بيانات.

بعد ذلك، تم التحقيق في التحسينات في كل جزء من النموذج من خلال تجارب الإزالة.

4.1 مجموعات البيانات

قمنا بتقييم طريقتنا على ثلاثة مجموعات بيانات شائعة الاستخدام للتعبيرات الوجهية: AffectNet وRAF-DB وFERPlus. تم جمع هذه البيانات من بيئة واقعية، خاضعة لدرجات مختلفة من الإضاءة والاحتجاب.

أفيكت نت

أفكت نت هو مجموعة بيانات كبيرة للتعبيرات الوجهية في الهواء الطلق تتكون من أكثر من مليون صورة وجه من الإنترنت. تحتوي مجموعة البيانات على ثماني فئات. لاحظ أن مجموعات التدريب والاختبار في أفكت نت غير متوازنة بشكل كبير.

RAF－DB［59］

RAF－DB هو مجموعة بيانات تحتوي على تعبيرات الوجه في سيناريوهات الحياة الواقعية. تتكون مجموعة البيانات من سبع فئات. تحتوي مجموعة التدريب على 12,271 عينة، بينما تحتوي مجموعة الاختبار على 3068 عينة.

فير بلس

FERPlus يحتوي على صور معاد تصنيفها من صور تم تصنيفها بشكل خاطئ ويستبعد الصور غير الوجهية من مجموعة بيانات FER2013 الأصلية. كل صورة في FERPlus تحتوي على عدة مصنفين مشاركين، مما يوفر جودة علامات أفضل من تلك الموجودة في مجموعة بيانات FER2013 الأصلية. مشابهًا لـ AffectNet، تحتوي مجموعة البيانات على ثماني فئات.

4.2 تفاصيل التجربة

تم تنفيذ النموذج باستخدام بايثون 3.7 و Pytorch 1.7.1. بالنسبة لجميع حالات التدريب، تم التقاط صور الوجه باستخدام شبكة MTCNN. خلال التجربة، تم تغيير حجم جميع الصور إلى

بكسل. تم تدريب النموذج باستخدام بطاقة الرسوميات NVIDIA GTX 2080. خلال التدريب، تم تعيين حجم الدفعة إلى 32، وتم استخدام مُحسِّن AdamW مع زخم قدره 0.9 وتخفيف الوزن لـ

تم استخدامه لتحسين النموذج. خلال التدريب، النموذج

استخدم فقط دالة خسارة الانتروبيا المتقاطعة، مما يوفر لها قدرة جيدة على التعميم.

4.3 النتائج والتحليل

هنا، نقارن النموذج المقترح بأكثر الطرق تقدمًا المستخدمة في السنوات الأخيرة على مجموعات بيانات AffectNet وRAF-DB وFERPlus لإثبات تفوق طريقتنا في مهام التعرف على المشاعر. تقدم الجدول 1 نتائج الدراسة.

في مجموعة بيانات RAF-DB، مقارنةً بـ EfficientFace و SCN وطرق CNN الأخرى، يظهر النموذج المقترح تحسينًا يقارب

．مقارنةً بـ DAN و AMP-Net وطرق الانتباه الأخرى، فإن التحسين يبلغ حوالي

تظهر التجارب أن النموذج المقترح لديه قدرات تعرف أكثر تقدمًا على مجموعة بيانات RAF-DB ويقدم حلاً أكثر فعالية لمهام التعرف على المشاعر. على مجموعة بيانات AffectNet، استنادًا إلى البيانات المعروضة في الجدول، فإن دقة التعرف للطريقة المقترحة هي

．مقارنةً بأساليب SCN و EfficientFace و RAN، فإن معدل التحسين حوالي

،وبالمقارنة مع طرق DMUE وFDRL وDAN وARM وCSGResNet وAMP-Net، فإن معدل التحسين حوالي

في مجموعة بيانات FERPlus، معدل التحسين أكبر من

مقارنةً بالطرق الأخرى.

لاستكشاف أداء هذا النموذج بدقة أكبر للتعبيرات الوجهية المختلفة، قمنا بفحص مصفوفات الالتباس في مجموعتي البيانات، كما هو موضح في الشكل 5. تصف مصفوفات الالتباس دقة التعرف على كل تعبير ونسبة التصنيفات الخاطئة كتعابير أخرى، حيث تمثل العناصر القطرية دقة التعرف على كل تعبير. من الواضح أن التعبير السعيد

الجدول 1 مقارنة بين مجموعات بيانات RAF-DB وAffectNet وFERPlus

طريقة	سنة	RAF－DB	أفيكت نت	فير بلس
SCN［44］	مؤتمر رؤية الكمبيوتر والأنظمة المعززة 2020	٨٧.٠٣	60.23	٨٩.٣٩
ران［45］	نصيحة 2020	٨٦.٩٠	－	٨٩.١٦
EfficientFace［46］	AAAI 2021	٨٨.٣٦	٥٩.٨٩	－
DMUE［47］	مؤتمر رؤية الكمبيوتر والأنظمة المعززة 2021	89.42	－	－
FDRL［48］	مؤتمر الرؤية الحاسوبية ونظم التعلم 2021	٨٩.٤٧	－	－
DAN［49］	أرشيف 2021	٨٩.٧٠	62.09	－
ARM［56］	أرشيف 2021	90.42	٦١.٣٣	－
CSGResNet［50］	ICASSP 2022	٨٨.٥٩	61.03	٨٨.٩٤
AMP－Net［57］	تي سي إس في تي 2022	٨٩.١٩	61.32	٨٩.٣٧
ملصق［33］	أرشيف 2022	92.05	63.34	91.62
خاصتنا	－	92.78	63.58	92.02

التعبير هو الأسهل في التعرف عليه من بين الثمانية تعبيرات، وذلك بفضل نطاق عرضه الكبير. كانت نسبة التعرف على السعادة في مجموعة بيانات AffectNet أعلى بكثير من تلك الخاصة بالتعبيرات الأخرى. على عكس الوجوه السعيدة، كانت الفروق في معدلات النجاح صغيرة، ويرجع ذلك أساسًا إلى أن الصور في مجموعة بيانات AffectNet تأتي من الإنترنت وتحتوي على العديد من العينات الخاطئة. في مجموعة بيانات FERPlus، كانت نسبة التعرف على السعادة أقل قليلاً من تلك الخاصة بالتعبيرات المحايدة، لأن التعبيرات المحايدة تحتوي على أكبر حجم عينة. بالإضافة إلى ذلك، احتوت مجموعة بيانات FERPlus على أقل عدد من عينات الاشمئزاز والازدراء – فقط عُشر عدد العينات للتعبيرات الأخرى – وهذه التعبيرات لها مظهر مشابه. وبالتالي، كانت دقة التعرف على الاشمئزاز والازدراء أقل بكثير من تلك الخاصة بالتعبيرات الأخرى.

لقد حظيت المحولات باهتمام متزايد في الدراسات الحديثة حول التعرف على المشاعر. ومع ذلك، فإن العدد الكبير من المعلمات المطلوبة لا يزال يمثل قيدًا رئيسيًا في استخدام المحولات. علاوة على ذلك، فإن معلمات النموذج (Params) والعمليات العائمة.
تعتبر (FLOPs) ميزات رئيسية يجب أخذها بعين الاعتبار لإجراء مقارنات عادلة. كانت إحدى النقاط الأساسية في هذه الدراسة تتعلق بتقليل العدد الإجمالي لبارامترات النموذج. ولهذا الغرض، اقترحنا آليات الانتباه الذاتي المجمعة وعمليات تقطير الانتباه الذاتي. تقارن الجدول 2 بارامترات الطريقة المقترحة مع تلك الخاصة بأساليب أخرى، بما في ذلك CVT وDMUE وTransFER وPOSTER وDAN. من الواضح أن عدد البارامترات في الطريقة المقترحة لا يتجاوز ربع تلك الخاصة بالأساليب الأخرى، مع الحفاظ على أعلى دقة في التعرف على تعبيرات الوجه. من حيث سرعة التشغيل، يتطلب الأمر حوالي ساعة و20 دقيقة لإكمال التدريب على مجموعة بيانات RAF-DB التدريبية وحوالي دقيقة واحدة لاختبار جميع صور التعبير على مجموعة بيانات RAF-DB الاختبارية بعد التدريب. وهذا يوفر نموذج التعرف على تعبيرات الوجه القائم على الانتباه الذاتي بميزة كبيرة.

4.4 دراسة الاستئصال

أداء آليات الانتباه المزدوج والتعلم التفاعلي: للتحقق من الأهمية الكبيرة

الجدول 2 مقارنة المعلمات و FLOPs

طريقة	سنة	معلمات	عمليات النقطة العائمة في الثانية	Acc（RAF－DB）	Acc (FERPlus)
CVT［52］	أرشيف 2021	80.1 مليون	－	٨٧.٦١	٨٨.٨١
DMUE［47］	مؤتمر الرؤية الحاسوبية ونظم التعلم 2021	78.4 مليون	13.4 جي	٨٩.٤٢	－
TransFER［55］	مؤتمر ICCV 2021	65.2 مليون	15．3G	90.91	90.83
ملصق［33］	أرشيف 2022	71.8 مليون	15.7G	92.05	91.62
DAN［49］	أرشيف 2021	28.3 مليون	2.6G	٨٩.٧٠	－
خاصتنا	－	16.4 مليون	2.0 ج	92.78	92.02

مزايا نهج الانتباه المزدوج ، استخدمنا أولاً آلية انتباه ذاتي واحدة كخط الأساس. ثم تمت مقارنة الخط الأساسي مع آلية الانتباه المزدوج. بعد ذلك ، قارنّا تأثيرات التعلم التفاعلي والاندماج المتقاطع للانتباه المزدوج على FER. توضح الأشكال 6 (أ) – 6 (د) هذه الاختلافات. تسرد الجدول 3 النتائج. من الواضح أنه تم تحقيق دقة تصنيف أكثر تقدمًا في مهام FER بعد إضافة آلية الانتباه الذاتي بعدد القنوات وآلية التعلم التفاعلي في المعالجة المتوازية ، مما يثبت جدوى وفعالية إدخال آلية الانتباه المزدوج. تُظهر الصفوف الثالثة والرابعة في الجدول 3 أن إضافة آلية التعلم التفاعلي إلى مجموعات بيانات FERPlus و RAF-DB يمكن أن تُدخل تحسينًا بحوالي

، و

على مجموعات بيانات AffectNet. لتحسين دمج الميزات من مصدرين مختلفين ، نعتبر أولاً تأثير ميزة واحدة على الأخرى ، وندرك التوافق المتبادل بين الميزتين ، ونكمل دمج الميزات على هذا الأساس ، بدلاً من مجرد الإضافة أو التقطيع. أثبتت التجارب جدوى فكرتنا. بالإضافة إلى ذلك ، صممنا طريقة لبناء دالة تنشيط لتحسين آليات التعلم التفاعلي. في مقارنة لاحقة لمختلف دوال التنشيط ، أثبتنا فعالية دالة التنشيط المقترحة.

لمراقبة تأثيرات آلية الانتباه المزدوج على استخراج الميزات وتجميعها على FER ، استخدمنا خوارزمية t-SNE [61] لتصور ميزات التعبير لبعض عينات مجموعة الاختبار في فضاء ثنائي الأبعاد. توضح الشكل 7 النتائج. من الواضح أنه بسبب نقص المعلومات العالمية ،

الجدول 3 أداء آلية الانتباه المزدوج وآلية التعلم التفاعلي

الطريقة	AffectNet	FERPlus	RAF-DB
الخط الأساسي	59.73٪	89.69٪	90.37٪
الخط الأساسي + DAM	60.44٪	90.92٪	90.88٪
الخط الأساسي + DAM + ILM	63.12٪	91.55٪	91.63٪
الخط الأساسي + DAM + ILM المتقاطع	63.58٪	92.02٪	92.78٪

كانت قدرة التعرف على آلية الانتباه أحادية البعد أقل بكثير من تلك الخاصة بآلية الانتباه المزدوج في التدريب الأولي. ظهرت مزايا الميزات الرئيسية المحلية خلال فترة التدريب المتأخرة. بسبب عدد عينات التدريب الكبير وميزات الوجه الواضحة ، تم التعرف على التعبيرات الأربعة من السعادة والحياد والدهشة والخوف بشكل جيد في الشروط الثلاثة ، وهو ما يتماشى مع المعلومات التي تم الحصول عليها في مصفوفة الالتباس. بالنسبة للتعبيرات ذات الاختلافات الطفيفة بينها ، مثل الاشمئزاز والاحتقار ، وجدت آلية الانتباه الذاتي في البعد المكاني صعوبة في فصلها وتمييزها بسبب قلة عينات التدريب. بعد إضافة آلية الانتباه الذاتي في بعد القناة ، تبقى بعض التداخلات في ميزات الوجه مع أحجام العينات الصغيرة ؛ ومع ذلك ، كانت أفضل بكثير من تلك الخاصة بآلية الانتباه الذاتية أحادية البعد. علاوة على ذلك ، يمكن أن يقلل التجميع في آلية الانتباه المزدوج بشكل فعال من الاتصالات الزائدة الناتجة عن الانتباه العالمي ويعزز قدرة FER للشبكة بأكملها. صنفت آلية الانتباه المزدوج المجمع الحزن والاحتقار والاشمئزاز بدقة أكبر ، وكان الحزن هو الأكثر وضوحًا.

لدراسة تأثير آلية الاندماج على مناطق الانتباه لتعبيرات الوجه ، رسمنا خرائط الانتباه لآلية الانتباه في أبعاد القناة والفضاء. كما هو موضح في الشكل 8 ، في بعد القناة ، يتركز الانتباه على التعبير العام للوجه والتفاعل مع البيئة المحيطة ، بينما في البعد المكاني ، يتركز الانتباه على الأجزاء الرئيسية ، مثل العينين والفم. تأثير الاندماج هو جمع المعلومات الرئيسية من البعدين.

تحليل صلاحية دالة التنشيط: كما هو موضح في الشكل 9 ، تقوم دالة السيغمويد بتعديل القيمة بين 0 و 1 ، وتكون التدرجات في هذه الفترة جميعها أقل من 0.25 ؛ وبالتالي ، من الممكن حدوث اختفاء التدرج. تقوم دالة التنشيط المقترحة بتعديل جميع القيم بين رقم سالب صغير وواحد ، والتدرج في هذه الفترة

الشكل 6 دراسات الإزالة على آلية الانتباه المزدوج وآلية التعلم التفاعلي.

الشكل 7 خريطة ميزات ثنائية الأبعاد لبعض العينات.

الشكل 8 تصور آلية الانتباه.

الشكل 9 دوال تنشيط مختلفة ومشتقاتها. تمثل المنحنى الأحمر دالة التنشيط؛ المنحنى البنفسجي ، المشتق الأول؛ والمنحنى الأزرق ، المشتق الثاني.

بين 0 و 1.5. لحل مشاكل تعطيل الخلايا العصبية والمشتقات غير المتصلة
في الجزء السلبي من دالة تنشيط ReLU ، قمنا بتعيين المنطقة السلبية إلى منطقة سلبية أصغر
كما هو موضح في الشكل 9 ، فإن دالة التنشيط المستخدمة في هذه الدراسة قابلة للاشتقاق باستمرار عند جميع النقاط. علاوة على ذلك ، لا تتضمن دالة التنشيط المقترحة الأسس ، مما يحسن سرعة تشغيل النموذج مقارنة بدوال تنشيط السيغمويد و tanh. من حيث دقة التعرف ، قارنّا بين دوال التنشيط المختلفة على مجموعة بيانات RAF-DB؛ يقدم الجدول 4 النتائج. من الواضح أن دالة التنشيط المقترحة تؤدي بشكل أفضل في مهام FER.

أداء تقطير الانتباه الذاتي: للتحقق من فعالية تقطير الانتباه الذاتي ، تم استخدام النموذج الكامل كمعيار ثم تمت مقارنته بالنموذج الذي تم التخلص منه بواسطة تقطير الانتباه الذاتي على مجموعات بيانات AffectNet و FERPlus. يسرد الجدول 5 النتائج. من الواضح أنه بعد التخلص من تقطير الانتباه الذاتي ، يتوسع نافذة التشغيل إلى الصورة الكاملة ، مما يؤدي إلى ظاهرة التتبع الخطيرة وزيادة في أزواج المعلومات الزائدة ، وإدخال ضوضاء مفرطة يؤثر على تعميم النموذج بالكامل. عندما تشكل تقطير الانتباه الذاتي والاتصال المتبقي حلقة مغلقة تمامًا ، يصبح التواصل المعلوماتي عبر النوافذ أسهل ، بينما يقلل من فقدان المعلومات ذات التردد المتوسط والعالي. لذلك ، كان من المفيد إدخال تقطير الانتباه الذاتي في هذا النموذج. هذا يقلل من التكلفة الحسابية للنموذج.

4.5 تحليل حساسية المعلمات

قمنا بإجراء تحليل حساسية للمعلمات المتغيرة لدوال التنشيط على مجموعة بيانات RAF-DB. خلال التجربة ،

تم تعيينه إلى

الجدول 4 أداء PCP وآلية التعلم التفاعلي

PCP	سيغمويد	ReLU	tanh	ILM	ACC
					92.54
					92.78
					91.42
					91.78
					91.98
					92.26
					91.75
					91.97

الجدول 5 أداء تقطير الانتباه الذاتي

الطريقة	AffectNet	FERPlus	المعلمات	FLOPs
بدون تقطير			4.2 M	0.21 G
مع التقطير			4.22 M	0.14 G

1.75 ، 1.85 ، و 1.95 في ثلاث حالات؛

، إلى

، و 0.21 ؛ و

، إلى

، و -0.27 في خمس حالات. توضح الشكل 10 النتائج. كما هو موضح في الشكل 10 ، عندما

، فإن النموذج يعمل بشكل سيء عند

، والأداء متوسط لبقية الحالات. عند

، فإن النموذج يعمل بشكل سيء عند

، وعند

كان النموذج مستقرًا نسبيًا. عندما

، حقق النموذج أعلى دقة.

الشكل 10 تحليل حساسية المعلمات لدوال التنشيط.

5 الاستنتاجات

في هذه الدراسة ، تم اقتراح شبكة الانتباه المزدوج المتقاطع بناءً على الأبعاد المكانية والقناة. يكمل التفاعل المحلي للبعد المكاني تنقيح الميزات ، ويوفر بعد القناة مجال استقبال عالمي. يمكن أن تكمل نوعا ميزات الانتباه الذاتي من خلال الانتباه المتقاطع بحيث تحتوي الميزات المستخرجة على معلومات أكثر فعالية. شكل دالة التنشيط له تعديل تكيفي يمكن أن يزيد من قدرات استخراج الميزات والاندماج. يمكن أيضًا تطبيقه على أطر التعلم الأخرى لتحسين الدقة وتقليل الوقت الحسابي. بالإضافة إلى ذلك ، يمكن توسيع طريقة بناء دالة التنشيط المقترحة لبناء دالة تنشيط متعددة الحدود من

وفقًا للتطبيق لزيادة قدرة النقل العكسي لدالة التنشيط وتحسين القدرة على التعميم. نظرًا لأن آلية الانتباه غالبًا ما تتطلب تكلفة حسابية عالية ، تقترح هذه الدراسة أن تعمل آلية التجميع وتقطير الانتباه الذاتي معًا على آلية الانتباه الذاتي. من خلال تقسيم الانتباه إلى مجموعات مختلفة ، تم استخدام تقطير الانتباه الذاتي في كل مجموعة لتقليل الأبعاد المكانية لـ

, تحسين آلية الانتباه الذاتي وتقليل التكلفة الحسابية.

آلية الانتباه الذاتي هي تقنية رئيسية في التعرف على المشاعر. في الأبحاث المستقبلية، سنواصل استكشاف بناء آلية انتباه ذاتي أكثر فعالية استنادًا إلى ميزات البيانات المختلفة في مهام التعرف على المشاعر. في الوقت نفسه، نخطط لمواصلة دراسة العلاقة بين آلية الانتباه الذاتي ودالة التنشيط. بالنسبة للبيانات المحددة، سنحاول بناء دالة تنشيط أكثر فعالية بشكل تكيفي، بحيث تتمتع بقدرة قوية على الانتشار العكسي، واستمرارية عالية، وتكلفة حسابية منخفضة، للحصول على قدرة أقوى على استخراج الميزات بتكلفة حسابية منخفضة.

شكر وتقدير

تم دعم هذا العمل جزئيًا من قبل المؤسسة الوطنية للعلوم الطبيعية في الصين بموجب المنح رقم 62272281 و 62007017، وصناديق خاصة لمشروع علماء تايشان بموجب المنحة رقم tsqn202306274، ومشروع الابتكار التكنولوجي للشباب في المدرسة العليا في مقاطعة شاندونغ بموجب المنحة رقم 2019KJN042.

إعلان عن تضارب المصالح

ليس لدى المؤلفين أي مصالح متضاربة للإعلان عنها تتعلق بمحتوى هذه المقالة.

References

[1] Edwards, J.; Jackson, H. J.; Pattison, P. E. Emotion recognition via facial expression and affective prosody in schizophrenia. Clinical Psychology Review Vol. 22, No. 6, 789-832, 2002.
[2] Joshi, A.; Kyal, S.; Banerjee, S.; Mishra, T. In-the-wild drowsiness detection from facial expressions. In: Proceedings of the IEEE Intelligent Vehicles Symposium, 207-212, 2020.
[3] Tran, L.; Yin, X.; Liu, X. M. Representation learning by rotating your faces. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 41, No. 12, 30073021, 2019.
[4] Wu, T. F.; Bartlett, M. S.; Movellan, J. R. Facial expression recognition using Gabor motion energy filters. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition – Workshops, 42-47, 2010.
[5] Shan, C. F.; Gong, S. G.; McOwan, P. W. Facial expression recognition based on Local Binary Patterns: A comprehensive study. Image and Vision Computing Vol. 27, No. 6, 803-816, 2009.
[6] Shokoohi, Z.; Bahmanjeh, R.; Faez, K. Expression recognition using directional gradient local pattern and gradient-based ternary texture patterns. In: Proceedings of the 2nd International Conference on Pattern Recognition and Image Analysis, 1-7, 2015.
[7] Wang, Z.; Ying, Z. L. Facial expression recognition based on local phase quantization and sparse representation. In: Proceedings of the 8th International Conference on Natural Computation, 222-225, 2012.
[8] Ali, H. B.; Powers, D. M. W.; Jia, X. B.; Zhang, Y. H. Extended non-negative matrix factorization for face and facial expression recognition. International Journal of Machine Learning and Computing Vol. 5, No. 2, 142147, 2015.
[9] Baddar, W. J.; Lee, S. M.; Ro, Y. M. On-the-fly facial expression prediction using LSTM encoded appearancesuppressed dynamics. IEEE Transactions on Affective Computing Vol. 13, No. 1, 159-174, 2022.
[10] Li, Y. J.; Gao, Y. N.; Chen, B. Z.; Zhang, Z.; Lu, G. M.; Zhang, D. Self-supervised exclusive-inclusive interactive learning for multi-label facial expression recognition in the wild. IEEE Transactions on Circuits and Systems for Video Technology Vol. 32, No. 5, 3190-3202, 2022.
[11] Zhang, X.; Zhang, F. F.; Xu, C. S. Joint expression
synthesis and representation learning for facial expression recognition. IEEE Transactions on Circuits and Systems for Video Technology Vol. 32, No. 3, 16811695, 2022.
[12] Otberdout, N.; Daoudi, M.; Kacem, A.; Ballihi, L.; Berretti, S. Dynamic facial expression generation on Hilbert hypersphere with conditional Wasserstein generative adversarial nets. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 44, No. 2, 848-863, 2022.
[13] Zhang, F. F.; Zhang, T. Z.; Mao, Q. R.; Xu, C. S. A unified deep model for joint facial expression recognition, face synthesis, and face alignment.

Transactions on Image Processing Vol. 29, 6574-6589, 2020.
[14] Feffer, M.; Rudovic, O.; Picard, R. W. A mixture of personalized experts for human affect estimation. In: Machine Learning and Data Mining in Pattern Recognition. Lecture Notes in Computer Science, Vol. 10935. Perner, P. Ed. Springer Cham, 316-330, 2018.
[15] Fan, Y.; Lu, X. J.; Li, D.; Liu, Y. L. Videobased emotion recognition using CNN-RNN and C3D hybrid networks. In: Proceedings of the 18th ACM International Conference on Multimodal Interaction, 445-450, 2016.
[16] Zhang, T.; Zheng, W. M.; Cui, Z.; Zong, Y.; Li, Y. Spatial-temporal recurrent neural network for emotion recognition. IEEE Transactions on Cybernetics Vol. 49, No. 3, 839-847, 2019.
[17] Pang, L.; Li, N. Q.; Zhao, L.; Shi, W. X.; Du, Y. P. Facial expression recognition based on Gabor feature and neural network. In: Proceedings of the International Conference on Security, Pattern Analysis, and Cybernetics, 489-493, 2018.
[18] Liu, Z.; Lin, Y. T.; Cao, Y.; Hu, H.; Wei, Y. X.; Zhang, Z.; Lin, S.; Guo, B. N. Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 9992-10002, 2021.
[19] Kim, J. H.; Kim, N.; Won, C. S. Facial expression recognition with Swin transformer. arXiv preprint arXiv:2203.13472, 2022.
[20] Wang, W. H.; Xie, E. Z.; Li, X.; Fan, D. P.; Song, K. T.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 548-558, 2021.
[21] Zhang, Q.; Yang, Y. B. ResT: An efficient transformer for visual recognition. In: Proceedings of the Advances in Neural Information Processing Systems, 1547515485, 2021.
[22] Zhang, F.; Chen, G. G.; Wang, H.; Li, J. J.; Zhang, C. M. Multi-scale video super-resolution transformer with polynomial approximation. IEEE Transactions on Circuits and Systems for Video Technology Vol. 33, No. 9, 4496-4506, 2023.
[23] Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X. H.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16×16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[24] Aouayeb, M.; Hamidouche, W.; Soladie, C.; Kpalma, K.; Seguier, R. Learning vision transformer with squeeze and excitation for facial expression recognition. arXiv preprint arXiv:2107.03107, 2021.
[25] Putro, M. D.; Nguyen, D. L.; Jo, K. H. A dual attention module for real-time facial expression recognition. In: Proceedings of the 46th Annual Conference of the IEEE Industrial Electronics Society, 411-416, 2020.
[26] Song, W. Y.; Shi, S. Z.; Wu, Y. X.; An, G. Y. Dualattention guided network for facial action unit detection. IET Image Processing Vol. 16, No. 8, 2157-2170, 2022.
[27] Ding, M. Y.; Xiao, B.; Codella, N.; Luo, P.; Wang, J. D.; Yuan, L. DaViT: Dual attention vision transformers. In: Computer Vision – ECCV 2022. Lecture Notes in Computer Science, Vol. 13684. Avidan, S.; Brostow, G.; Cissé, M.; Farinella, G. M.; Hassner, T. Eds. Springer Cham, 74-92, 2022.
[28] Fu, J.; Liu, J.; Tian, H. J.; Li, Y.; Bao, Y. J.; Fang, Z. W.; Lu, H. Q. Dual attention network for scene segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3141-3149, 2019.
[29] Li, X. Q.; Xie, M.; Zhang, Y.; Ding, G. T.; Tong, W. Q. Dual attention convolutional network for action recognition. IET Image Processing Vol. 14, No. 6, 10591065, 2020.
[30] Li, Y. S.; Liu, Y.; Yu, R.; Zong, H. L.; Xie, W. X. Dual attention based spatial-temporal inference network for volleyball group activity recognition. Multimedia Tools and Applications Vol. 82, No. 10, 15515-15533, 2023.
[31] Gedamu, K.; Yilma, G.; Assefa, M.; Ayalew, M. Spatiotemporal dual-attention network for view-invariant human action recognition. In: Proceedings of the SPIE 12342, 14th International Conference on Digital Image Processing, 123420Q, 2022.
[32] Ullah, H.; Munir, A. Human activity recognition using cascaded dual attention CNN and bi-directional GRU framework. arXiv preprint arXiv:2208.05034, 2022.
[33] Zheng, C.; Mendieta, M.; Chen, C. POSTER: A pyramid cross-fusion transformer network for facial expression recognition. arXiv preprint arXiv:2204. 04083, 2022.
［34］Han，J．；Moraga，C．The influence of the sigmoid function parameters on the speed of backpropagation learning．In：Proceedings of the International Workshop on Artificial Neural Networks：From Natural to Artificial Neural Computation，195－201， 1995.
［35］Glorot，X．；Bordes，A．；Bengio，Y．Deep sparse rectifier neural networks．In：Proceedings of the 14th International Conference on Artificial Intelligence and Statistics，315－323， 2011.
［36］Bourel，F．；Chibelushi，C．C．；Low，A．A．Recognition of facial expressions in the presence of occlusion．In： Proceedings of the British Machine Vision Conference， 1－10， 2001.
［37］Mao，X．；Xue，Y．L．；Li，Z．；Huang，K．；Lv，S．W． Robust facial expression recognition based on RPCA and AdaBoost．In：Proceedings of the 10th Workshop on Image Analysis for Multimedia Interactive Services， 113－116， 2009.
［38］Jiang，B．；Jia，K．B．Research of robust facial expression recognition under facial occlusion condition． In：Proceedings of the 7th International Conference on Active Media Technology，92－100， 2011.
［39］Hammal，Z．；Arguin，M．；Gosselin，F．Comparing a novel model based on the transferable belief model with humans during the recognition of partially occluded facial expressions．Journal of Vision Vol．9，No．2，22， 2009.
［40］Zhang，K．P．；Zhang，Z．P．；Li，Z．F．；Qiao，Y．Joint face detection and alignment using multitask cascaded convolutional networks．IEEE Signal Processing Letters Vol．23，No．10，1499－1503， 2016.
［41］Amos，B．；Ludwiczuk，B．；Satyanarayanan，M． OpenFace：A general－purpose face recognition library with mobile applications．School of Computer Science， Carnegie Mellon University，2016．Available at https：／／elijah．cs．cmu．edu／DOCS／CMU－CS－16－118．pdf
［42］Happy，S．L．；Routray，A．Automatic facial expression recognition using features of salient facial patches． IEEE Transactions on Affective Computing Vol．6， No．1，1－12， 2015.
［43］Majumder，A．；Behera，L．；Subramanian，V．K． Automatic facial expression recognition system using deep network－based data fusion．IEEE Transactions on Cybernetics Vol．48，No．1，103－114， 2018.
［44］Wang，K．；Peng，X．J．；Yang，J．F．；Lu，S．J．； Qiao，Y．Suppressing uncertainties for large－scale facial expression recognition．In：Proceedings of the IEEE／CVF Conference on Computer Vision and Pattern Recognition，6896－6905， 2020.
［45］Wang，K．；Peng，X．J．；Yang，J．F．；Meng，D．B．；Qiao， Y．Region attention networks for pose and occlusion
robust facial expression recognition．IEEE Transactions on Image Processing Vol．29，4057－4069， 2020.
［46］Zhao，Z．Q．；Liu，Q．S．；Zhou，F．Robust light－ weight facial expression recognition network with label distribution training．Proceedings of the AAAI Conference on Artificial Intelligence Vol．35，No．4， 3510－3519， 2021.
［47］She，J．H．；Hu，Y．B．；Shi，H．L．；Wang，J．； Shen，Q．；Mei，T．Dive into ambiguity：Latent distribution mining and pairwise uncertainty estimation for facial expression recognition．In：Proceedings of the IEEE／CVF Conference on Computer Vision and Pattern Recognition，6244－6253， 2021.
［48］Ruan，D．L．；Yan，Y．；Lai，S．Q．；Chai，Z．H．； Shen，C．H．；Wang，H．Z．Feature decomposition and reconstruction learning for effective facial expression recognition．In：Proceedings of the IEEE／CVF Conference on Computer Vision and Pattern Recognition，7656－7665， 2021.
［49］Wen，Z．；Lin，W．；Wang，T．；Xu，G．Distract your attention：Multi－head cross attention network for facial expression recognition．arXiv preprint arXiv：2109．07270， 2021.
［50］Jiang，S．P．；Xu，X．M．；Liu，F．；Xing，X．F．；Wang， L．CS－GResNet：A simple and highly efficient network for facial expression recognition．In：Proceedings of the IEEE International Conference on Acoustics，Speech and Signal Processing，2599－2603， 2022.
［51］Chen，M．；Radford，A．；Child，R．；Wu，J．；Jun， H．；Luan，D．；Sutskever，I．Generative pretraining from pixels．In：Proceedings of the 37th International Conference on Machine Learning，1691－1703， 2020.
［52］Ma，F．；Sun，B．；Li，S．Robust facial expression recognition with convolutional visual transformers． arXiv preprint arXiv：2103．16854， 2021.
［53］Li，H．；Sui，M．；Zhao，F．；Zha，Z．；Wu，F．MVT：Mask vision transformer for facial expression recognition in the wild．arXiv preprint arXiv：2106．04520， 2021.
［54］Huang，Q．H．；Huang，C．Q．；Wang，X．Z．；Jiang，F． Facial expression recognition with grid－wise attention and visual transformer．Information Sciences Vol．580， 35－54， 2021.
［55］Xue，F．L．；Wang，Q．C．；Guo，G．D．TransFER： Learning relation－aware facial expression repre－ sentations with transformers．In：Proceedings of the IEEE／CVF International Conference on Computer Vision，3581－3590， 2021.
［56］Shi，J．；Zhu，S．；Liang，Z．Learning to amend facial expression representation via de－albino and affinity． arXiv preprint arXiv：2103．10189， 2021.
［57］Liu，H．W．；Cai，H．L．；Lin，Q．C．；Li，X．F．；Xiao，

H．Adaptive multilayer perceptual attention network for facial expression recognition．IEEE Transactions on Circuits and Systems for Video Technology Vol．32， No．9，6253－6266， 2022.
［58］Dhall，A．；Goecke，R．；Lucey，S．；Gedeon，T．Static facial expression analysis in tough conditions：Data， evaluation protocol and benchmark．In：Proceedings of the IEEE International Conference on Computer Vision Workshops，2106－2112， 2011.
［59］Li，S．；Deng，W．H．；Du，J．P．Reliable crowdsourcing and deep locality－preserving learning for expression recognition in the wild．In：Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2584－2593， 2017.
［60］Barsoum，E．；Zhang，C．；Ferrer，C．C．；Zhang， Z．Y．Training deep networks for facial expression recognition with crowd－sourced label distribution．In： Proceedings of the 18th ACM International Conference on Multimodal Interaction，279－283， 2016.
［61］Van Der Maaten，L．；Hinton，G．Visualizing data using t－SNE．Journal of Machine Learning Research Vol．9， 2579－2625， 2008.

Fan Zhang received his B．S．and Ph．D．degrees in computer science from Shandong University．He is currently an associate professor at the School of Computer Science and Technology， Shandong Technology and Business University．His research interests include computer graphics，computer vision，and artificial intelligence．

Gongguan Chen is a graduate student at the School of Computer Science and Technology，Shandong Technology and Business University．His research interests include computer vision and artificial intelligence．

Hua Wang received her B．S．degree from the University of Kentucky and China University of Mining and Technology in 2012．She joined the Computational Biomechanics Laboratory for her master and Ph．D．degrees．She is currently an associate professor at the School of Information and Electrical Engineering， Ludong University．Her research focuses on computer graphics and computational geometry．

Caiming Zhang is a professor and doctoral supervisor at Software College， Shandong University．He received his B．S．and M．E．degrees in computer science from Shandong University in 1982 and 1984，respectively，and his Ph．D． degree in computer science from the Tokyo Institute of Technology in 1994. From 1997 to 2000，he held a visiting position at the University of Kentucky，USA．His research interests include CAGD，CG，image processing，and computational finance．

Open Access This article is licensed under a Creative Commons Attribution 4．0 International License，which permits use，sharing，adaptation，distribution and reproduc－ tion in any medium or format，as long as you give appropriate credit to the original author（s）and the source，provide a link to the Creative Commons licence，and indicate if changes were made．

The images or other third party material in this article are included in the article＇s Creative Commons licence，unless indicated otherwise in a credit line to the material．If material is not included in the article＇s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use，you will need to obtain permission directly from the copyright holder．

To view a copy of this licence，visit http：／／ creativecommons．org／licenses／by／4．0／．
Other papers from this open access journal are available free of charge from http：／／www．springer．com／journal／41095． To submit a manuscript，please go to https：／／www． editorialmanager．com／cvmj．

1 Shandong Technology and Business University，Shandong 264005，China．E－mail：F．Zhang，zhangfan＠sdtbu．edu．cn； G．Chen，2020410018＠sdtbu．edu．cn．
2 School of Information and Electrical Engineering，Ludong University，Yantai 264025，China．E－mail：hua．wang＠ldu．edu．cn（）．
3 Shangdong University，Shandong 250100，China．E－mail： czhang＠sdu．edu．cn．
Manuscript received：2023－02－21；accepted：2023－07－19
（1）http：／／mohammadmahoor．com／affectnet／
（2）http：／／www．whdeng．cn／raf／model1．html
（3）https：／／www．worldlink．com．cn／osdir／ferplus．html

Journal: Computational Visual Media
DOI: https://doi.org/10.1007/s41095-023-0369-x
Publication Date: 2024-02-08

CF－DAN：Facial－expression recognition based on cross－fusion dual－attention network

Fan Zhang ，Gongguan Chen ，Hua Wang ，and Caiming Zhang

（C）The Author（s） 2024.

Abstract

Recently，facial－expression recognition（FER） has primarily focused on images in the wild，including factors such as face occlusion and image blurring，rather than laboratory images．Complex field environments have introduced new challenges to FER．To address these challenges，this study proposes a cross－fusion dual－ attention network．The network comprises three parts：（1）a cross－fusion grouped dual－attention mechanism to refine local features and obtain global information；（2）a proposed activation function construction method， which is a piecewise cubic polynomial with three degrees of freedom，requiring less computation with improved flexibility and recognition abilities，which can better address slow running speeds and neuron inactivation problems；and（3）a closed－loop operation between the self－attention distillation process and residual connections to suppress redundant information and improve the generalization ability of the model． The recognition accuracies on the RAF－DB，FERPlus， and AffectNet datasets were ，and ，respectively．Experiments show that this model can provide more effective solutions for FER tasks．

Keywords facial－expression recognition（FER）；cubic polynomial activation function；dual－ attention mechanism；interactive learning； self－attention distillation

1 Introduction

The study of human emotional states is an

interdisciplinary research field spanning psychology and computer science and is fundamental to developing emotional intelligence．Facial expressions are the most natural and powerful external expressions of a person＇s emotional state，conveying such emotions as calmness，happiness，anger，sadness， fear，disgust，and surprise．Facial expressions are key to nonverbal human communication．With continuing technological developments，research on facial－expression recognition（FER）has deepened and has had a major impact on many facets of life，such as public security，lie detection，driving－ fatigue detection，intelligent medical treatment［1］， and security monitoring

．

Traditional FER primarily uses manual features and shallow learning methods［4，5］，directional gradients［6］，sparse representations［7］，and non－ negative matrix decomposition［8］．Recently，with the widespread application of deep learning， convolutional neural networks（CNNs）［9－13］have almost completely replaced traditional methods and have achieved excellent results in various computer vision tasks．Many researchers have applied these networks to FER tasks，resulting in excellent progress［14－17］．CNNs have significant advantages in extracting low－level features and visual structures． However，high－level visual semantic information often focuses on how these elements relate to each other to form an object and how the spatial relationships between objects form a scene，for which CNNs cannot achieve ideal results．To address this，we utilized only a CNN to extract the shallow features and additional processing was performed using other methods based on the obtained shallow features．

Recent studies have used self－attention mechanisms for various computer vision tasks［18－22］．Self－ attention mechanisms can imitate how people pay
attention to key positions within an image and extract key information from these key positions to complete various tasks．For example，the vision transformer（ViT）［23］model can be applied to image classification．However，in the field of FER， networks similar to the ViT model cannot be used directly，owing to the limited sample size．Aouayeb et al．［24］successfully applied a ViT network for an expression recognition task by adding a squeeze－ and－excitation（SE）block．Because of the network＇s ability to model global dependencies，self－attention mechanisms can better correlate with various facial parts when processing high－level visual semantic information．However，self－attention mechanisms require considerable computational resources，which are often unacceptable in several real－world scenarios． In this study，we addressed this issue by adopting a grouping approach and adding self－attention distillation within each group．

Dual attention networks［25－28］complement each other in terms of feature refinement and global information acquisition．DANet［29］employs a dual－ attention convolutional network with dual－attention mechanisms based on spatial and temporal attention for action recognition．Li et al．［30］proposed a dual－ attention model composed of spatial and mixed－ channel attention to recognize sports activities in videos．SDA－Net［31］employs spatial／temporal self－and cross－attention modules for human action recognition．Ullah and Munir［32］utilized a unified－ channel spatial attention mechanism to extract salient features centered on humans in video frames． These studies demonstrated the effectiveness of dual－attention．In the field of FER，POSTER［33］ utilizes cross－transformers to fuse facial landmark and image features and builds a pyramid structure to promote scale invariance．However，POSTER requires an additional network to extract facial landmark features and achieves only the cross－ fusion of single－dimensional attention．Moreover，the POSTER pyramid structure significantly increases the model complexity．In this study，we utilize self－attention mechanisms to directly model high－ level visual semantic information in both channel and spatial dimensions while achieving cross－fusion between the two dimensions．

Existing activation functions can be divided into two categories：those based on the extension of the sigmoid function［34］and those based on the
extension of the ReLU function［35］．Sigmoid and its related functions have high computational complexity owing to the inclusion of power operations，resulting in a slow computation speed．Although ReLU and its related functions have fast computation speeds， their poor stability is caused by the discontinuity of the first－order derivative，and the negative interval becomes zero after activation，which easily leads to neuron deactivation．To address this，this study proposes an activation function construction method that uses fitting to avoid power operations while maintaining the continuity of the first－order derivative． During the fitting process，interval adjustments are made to prevent neuron deactivation．The contributions of this study are as follows：
（1）We propose a cross－fusion dual－attention transformer based on the spatial and channel dimensions．Local interaction in the spatial dimension completes the feature refinement，and a global receptive field is provided in the channel dimension． The cross－fusion dual－attention transformer realizes the mutual complementation of feature information of different dimensions，thereby improving the accuracy of the respective features．
（2）We designed an activation function construction method to solve problems in commonly used activation functions，such as neuron deactivation， the large computational overhead introduced by the power function，and the existence of non－differentiable points．Moreover，this study constructed a

continuous activation function for the interactive learning mechanism，which improves the ability of the interactive learning mechanism to integrate different features．
（3）To reduce the high computational cost of the self－attention mechanism，we propose a grouped mechanism and self－attention distillation．This process divides the attention into different groups and uses self－attention distillation in each group to reduce the spatial dimensions of

and

， thereby reducing the computational cost．Using self－ attention distillation improves the ability of the self－ attention mechanism and significantly reduces the computational cost（

）．

in real－world environments．To address the challenges of FER in real－world environments，
researchers have made great efforts to improve the recognition accuracy under complex background and figure occlusion conditions. Bourel et al. [36] extracted the spatial degree features of key facial parts for classification. PCA-based methods [37, 38] realized the projection of test images onto training images using similarity to realize expression recognition. Hammal et al. [39] used multilevel image segmentation and sparse decomposition to solve the facial expression occlusion problem. Multiple face detectors such as the MTCNN [40] and Dlib [41] models have been used for facial detection in realworld scenarios. Moreover, to solve problems present in real-world scenarios, an increasing number of multi-view and multiscale-based studies have been conducted. Happy and Routray [42] and Majumder et al. [43] proposed that facial expression changes are primarily reflected in key parts, such as the eyes and mouth. With the continuous development of deep learning, numerous studies based on convolutional neural networks (CNNs) have made significant progress in extracting facial features in real-world environments. The SCN model [44] adds three additional modules to the traditional CNN for self-attention importance weighting, rank regularization, and relabeling purposes. An RAN [45] employs convolutional operations to design feature extraction, self-attention, and relation attention modules. EfficientFace [46] uses a local feature extractor and channel spatial modulator to improve the robustness. DMUE [47] uses an auxiliary branch to mine potential distributions and perform pairwise deterministic estimations. FDRL [48] is an emotionrelated feature decomposition and reconstruction network. A DAN [49] consists of a feature-clustering, multi-head cross attention, and attention fusion network. ARM uses a modified representation module to replace the pooling layer. CSGResNet [50] integrates Gabor convolution (GConv) into ResNet. AMP-Net extracts global, local, and salient facialemotion features with different granularities to learn the underlying diversity and key information of facial emotions. However, these methods are limited by the small receptive fields of CNNs. In this study, we leverage the advantages of CNN to extract shallow features and use self-attention mechanisms to process high-level visual semantic information.

Visual transformers. Several recent studies have shown that transformers have enormous potential
in computer vision applications. Several pioneering studies, such as those on the iGPT [51] and ViT methods, applied a self-attention mechanism directly to image pixels or patch sequences. Inspired by this, convolutional visual transformers (CVTs) [52] were the first to apply the transformer model to FER tasks. CVTs use the local binary pattern (LBP) algorithm to send facial expression images in two different states to a ResNet network to obtain smaller feature images, and the transformer model is used to complete the FER task. A mask vision transformer (MViT) [53] generates a mask based on the transformer model to filter complex backgrounds and occlusions of facial images. The FER-VT [54] method sends feature maps of different scales to the same transformer model to complete the information fusion. Transfer [55] random dropout attention maps and self-attention modules guide the model in learning the rich relationships between different local patches. However, these methods are limited by the spatial dimension and have limitations in obtaining global information compared with the dual-attention mechanism.

3 Methodology

This study proposes a more concise and effective dualattention mechanism to solve FER tasks in real-world environments more effectively. Considering that the large size of the feature map significantly increases the computation of the self-attention mechanism, the network first uses the ResNet model for the highdimensional mapping of facial expression images, thereby acquiring high-dimensional feature maps of a smaller scale. Figures 1 and 2 depict the overall flow of the network. In particular, assuming that the input image is

, the image is first fed to a ResNet neural network with five convolutional layers for low-level feature extraction. The output feature map of ResNet is denoted as

, where

denotes the number of channels. High-level features are then extracted from the synchronized ResNet results in both the channel and spatial dimensions, and the extracted results are denoted as

and

, respectively. The interactive learning mechanism combines the results of these two dimensions into

. Finally, the model uses a fully connected layer to obtain the

Fig． 1 Dual－attention mechanism．The left side depicts the overall structure of the dual－attention mechanism．The right side depicts the self－attention in different dimensions．Inputs

，and

are obtained via linear variation and self－attention distillation，and self－attention is then calculated．Channel attention does not divide the input into multiple patches for operation，unlike spatial attention．It performs self－attention operation directly on the channel dimension．

Fig． 2 Overall structure of the proposed model．The main part of the model comprises two layers of dual－attention and an interactive learning mechanism．Information is shared by exchanging

between the two layers of dual－attention，thereby breaking the information occlusion caused by parallel operations．To improve the feature fusion ability of the interactive learning mechanism for dual－attention，we designed a

continuous activation function

comprising three－segment cubic polynomials，which has a better feature fusion ability．

output vector

for classification，where

is the number of facial expression categories．

Self－attention is used to realize facial expression recognition，which solves the problem of the small receptive fields of CNNs．Introducing a self－attention mechanism with different dimensions compensates for the deficiencies of traditional single dimensions in global interactions．The parallel algorithm directly applies double attention to the original input， overcomes the defect in which self－attention in the channel dimension enhances only the results of the spatial dimension of the self－attention，and makes the model more sensitive to the surrounding environment．Compared with the traditional self－ attention mechanism，parallel dual－attention can be better applied to FER tasks．

3．1 Dual－attention mechanism（DAM）

With the widespread application of sparse attention mechanisms，most researchers have reconstructed
the input image in the form of

by dividing the spatial dimension into patches and adding positional coding，where

denotes the number of patches and

the number of channels．However， the unique two－dimensional information of the images can be destroyed；thus，the refinement of local features becomes an important challenge．To address these problems，the reconstructed images are divided into groups by conducting a separate self－attention mechanism for each group．Consider that the number of channels is not corrupted during refactoring，and each channel is an abstract representation of global information．In this study， the self－attention mechanisms of the channel and space dimensions are operated in parallel to form a dual－attention mechanism．Similarly，group－based learning was added to channel dimensions．Dual－ attention mechanisms complement each other in the acquisition of local features and global information， exhibiting strong FER abilities．

In particular, in the spatial dimension, the reconstructed image can be assumed to be divided into

groups, and each group contains

patches. The reconstructed image can be divided into multiple nonoverlapping patches using a sliding window approach, and these patches can then be grouped. The size of each patch (in pixels) depends on the desired resolution of the reconstructed image and size of the original image. For example, if the reconstructed image has dimensions

and

groups, then each patch can contain

pixels. The operational process of the overall selfattention mechanism in the spatial dimension is expressed as Eq. (1):

where

can be obtained through a linear transformation of the input.

represent the self-attention result, and

In the channel dimension, the reconstructed image can be assumed to be divided into

groups, with each group containing

channels such that

. The operational process of the overall selfattention mechanism in the channel dimension can be expressed as Eq. (2):

where

can be obtained through a linear transformation of the input.

Equations (1) and (2) illustrate that all spatial locations are considered when calculating channeldimensional self-attention, which enables global interaction. In subsequent ablation studies, we confirmed that channel-dimensional self-attention pays more attention to the face as a whole and connection of facial expressions to the surrounding environment. The spatial dimension of self-attention is limited to different spatial locations to complete local interactions, which makes it more sensitive to key locations, such as the eyes and mouth. The two cooperate to enhance the perception of facial expressions. Simultaneously, to realize the timely sharing of information when the two dimensions are processed in parallel, we use the crossover method to
complete the information transfer between different dimensions. Section 3.2 details this further.

3.2 Cross-fusion attention mechanism

Expression images from real-world environments are complex and diverse. People express their emotions differently. Moreover, many similarities exist between the expressions of different emotions. Consequently, the simple application of neural networks to facialexpression images cannot accurately distinguish minute nuances, resulting in low recognition rates. In the dual self-attention mechanism, self-attention in the spatial dimension focuses more on key regions related to facial expressions. Conversely, the selfattention of the channel dimension focuses more on global information. Both types of information are important for FER. To better utilize information from different dimensions, we designed a cross-fusion dual self-attention model, as shown in Fig. 2. In previous research on cross-fusion, additional data support was often required to complete the cross-fusion between features generated by different data. In this study, we adopted a more concise implementation method that directly adds cross-fusion to the operation process of two different dimensions of the self-attention mechanism. Experiments demonstrated that the proposed design is effective.

In particular, the output

of the upper layer of the spatial dimension is mapped to two image matrixes,

and

, through two linear transformations; the output

of the upper layer of the channel dimension are mapped to image matrix

through a linear transformation, before sending

, and

to the spatial dimension of the next layer for self-attention. The operation is similar in the channel dimension, and the attention result can be expressed as Eq. (3):

3.3 Self-attention distillation

The self-attention mechanism expands the attention window to the entire image, significantly increases the computational overhead, and produces a severe smearing phenomenon, causing many redundant combinations of

and

. Excessive useless information introduces serious interference in the
feature extraction process，resulting in a declining model performance．This study proposes that the self－attention distillation mechanism acts on keys and values，reducing the scale of the spatial and channel dimensions．Dominant features can be extracted using this operation to form a feature map with a focused advantage in subsequent self－attention， suppressing interference from redundant information and reducing noise generation．Moreover，to reduce the loss of middle－and high－frequency information caused by the self－attention distillation mechanism， a residual connection is used to fuse the original

with the results of the self－attention process．The interaction between self－attention distillation and residual connection constructs an independent closed－ loop operation，which effectively reconstructs lost information．

In terms of the spatial dimensions，two convolution layers with a kernel size of three were constructed． The first layer completes the mapping of the channel number from a high to low dimension，which is a dynamic process for extracting dominant features． The second layer completes the mapping of the channel numbers from low to high dimensions，thereby maintaining the same dimensions as the original input． Finally，max pooling is used to reduce the number of patches in the keys and values．The overall calculation process is as Eq．（4）：

where

denotes the convolution operation with a convolution kernel size of three（used to reduce the number of channels）， Conv

denotes the convolution operation with a convolution kernel size of three（used to increase the number of channels），and MaxPool denotes max pooling．

In terms of the channel dimension，a convolution operation with a convolution kernel size of one is used to reduce the number of channels in a single group；subsequently，a convolution operation with a convolution kernel size of three is used to enhance the connection between channels to learn more reliable high－quality features．To guide the model to focus on the acquisition of global information，max pooling is abandoned in the channel dimension such that the number of patches in each group remains unchanged． The overall calculation process is as Eq．（5）：

where

and

denote the convolution operations with convolution kernel sizes of one and three，respectively．

3．4 Interactive learning mechanism（ILM）

Because activation functions such as sigmoid and tanh have exponentiation operations，the calculation speed slows，and the ReLU activation function has non－differentiable points and neuron－death problems．To solve these problems，we designed a

continuous activation function comprising three－ segment cubic polynomial curves，called a piecewise cubic polynomial function

，to increase the ability of the interactive learning mechanism for feature fusion．The construction method for

is as follows．

First，interval

is divided into three intervals by the points

，and

， and the corresponding function values on the four points

，and

are

，and

，respectively．The first derivative at point

is defined by

where

is a pending parameter．
The first derivatives at points

and

are both zero，

．In this manner，we can construct a Hermite cubic interpolation function in the interval

：

where

This calculation process involves three unde－ termined parameters：

，and

．When the values of parameters

，and

are given，

and

can be calculated by the following method． To ensure that

continuous，the second derivative at point

is defined by

，such
that

and

can be calculated using the system of Eq．（9）：

In this construction process，

is defined as Eq．（10）：

By appropriately adjusting the values of the three parameters，

，and

，the accuracy of the algorithm can be improved，thereby enhancing the ability of the interactive learning mechanism for feature fusion．

Note：Because

is applied repeatedly to the model，directly calculating

using Eq．（8） incurs a high computational cost．Therefore，we write

where

，and

are coefficients obtained by simplifying Eq．（8）．

We calculate

，and

before each iteration，such that

（8）can be calculated using Eq．（12）：

Therefore，only three multiplications and three additions are required to calculate

，which significantly reduces the number of computations． Calculations using Eqs．（8）and（9）require four divisions， 17 multiplications，four subtractions，and three additions．

Based on

，this study proposes an interactive learning mechanism，as shown in Fig． 3. The global average pooling of individual feature vectors in the channel dimension is first used to integrate the global spatial information；subsequently， the feature vector is mapped through a feedforward neural network．We refer to the mapping results as implicit dynamic vectors．Finally，the eigenvector and implicit dynamic vector are cross－multiplied to complete interactive learning．Global average pooling uses the mean value of the feature map to

Fig． 3 Interactive learning mechanism．The features extracted from two different dimensions are globally average pooled and passed through a linear layer before being multiplied with the input separately． Then，they are cross－multiplied between the two different dimensions．

forcibly demarcate its importance，directly giving each channel its actual meaning and then feeding the result of the global average pooling to the input as weights to enhance the input features．In the interaction process，because the activation function can compress the negative area to a smaller negative interval，the interactive information can be identified and the non－interactive information can be suppressed to an inactive state such that it can be better integrated with other dimensions of information．

The generation of implicit dynamic vectors and overall operational process of the interactive learning mechanism can be expressed as Eqs．（13）－（16）：

Here，

denotes the activation function；Linear， the full connection layer；and

，global average pooling．

4 Results

This section describes the dataset，experimental environment，and experimental implementation． Figure 4 shows some of the samples．The effective－ ness of the proposed model was compared with that of methods commonly used in recent years．

Fig． 4 Partial samples from three datasets．

Subsequently，improvements in each part of the model were investigated through ablation experiments．

4．1 Datasets

We evaluated our method on three commonly used facial expression datasets：AffectNet，RAF－DB，and FERPlus．These were collected from a real－world environment，subject to different degrees of light and occlusion．

AffectNet

：AffectNet is a large outdoor facial expression dataset comprising over one million facial images from the Internet．The dataset contains eight categories．Note that AffectNet＇s training and test sets are extremely unbalanced．

RAF－DB［59］

：RAF－DB is a dataset that contains facial expressions in real－life scenarios．The dataset comprises seven categories．The training set contains 12,271 samples，and the test set contains 3068 samples．

FERPlus

：FERPlus contains relabeled mis－ labeled images and excludes non－face images from the original FER2013 dataset．Each image in FERPlus has multiple participating annotators，providing better tag quality than that of the original FER2013 dataset．Similar to AffectNet，the dataset has eight categories．

4．2 Experiment details

The model was implemented using Python 3.7 and Pytorch 1．7．1．For all training cases，facial images were captured using the MTCNN network．During the experiment，all images were sized to

pixels．The model was trained using an NVIDIA GTX 2080 GPU．During training，the batch size was set to 32 ，and the AdamW optimizer with a momentum of 0.9 and weight attenuation of

was used to optimize the model．During training，the model

used only the cross－entropy loss function，providing it with a good generalization ability．

4．3 Results and analysis

Here，we compare the proposed model with the most advanced methods used in recent years on the AffectNet，RAF－DB，and FERPlus datasets to prove the superiority of our method for FER tasks．Table 1 presents the results of the study．

On the RAF－DB dataset，compared with EfficientFace，SCN，and other CNN－based methods， the proposed model exhibits an improvement of approximately

．Compared with the DAN，AMP－ Net，and other attention－based methods，the improvement is approximately

．The experiments demonstrate that the proposed model has more advanced recognition capabilities on the RAF－DB dataset and provides a more effective solution for FER tasks．On the AffectNet dataset，based on the data shown in the table，the identification accuracy of the proposed method is

．Compared with the SCN， EfficientFace，and RAN methods，the improvement rate is approximately

，and compared with the DMUE，FDRL，DAN，ARM，CSGResNet，and AMP－ Net methods，the improvement rate is approximately

．On the FERPlus dataset，the improvement rate is greater than

compared with the other methods．

To explore the performance of this model more accurately for different facial expressions，we examined the confusion matrices in the two datasets， as shown in Fig．5．The confusion matrices describe the recognition accuracy of each expression and the proportion misclassified as other expressions，for which the diagonal item represents the recognition accuracy for each expression．Evidently，the happy

Table 1 Comparison on the RAF－DB，AffectNet，and FERPlus datasets

Method	Year	RAF－DB	AffectNet	FERPlus
SCN［44］	CVPR 2020	87.03	60.23	89.39
RAN［45］	TIP 2020	86.90	－	89.16
EfficientFace［46］	AAAI 2021	88.36	59.89	－
DMUE［47］	CVPR 2021	89.42	－	－
FDRL［48］	CVPR 2021	89.47	－	－
DAN［49］	arXiv 2021	89.70	62.09	－
ARM［56］	arXiv 2021	90.42	61.33	－
CSGResNet［50］	ICASSP 2022	88.59	61.03	88.94
AMP－Net［57］	TCSVT 2022	89.19	61.32	89.37
POSTER［33］	arXiv 2022	92.05	63.34	91.62
Ours	－	92.78	63.58	92.02

expression is the easiest to recognize among the eight expressions，owing to its large display range．The recognition rate of happiness in the AffectNet dataset was considerably higher than that of other expressions． Unlike happy faces，the difference in success rates was small，primarily because the images in the AffectNet dataset originate from the Internet and contain many error samples．In the FERPlus dataset，the happy recognition rate was slightly lower than that of neutral expressions，because neutral expressions have the largest sample size．Additionally，the FERPlus dataset contained the least disgust and contempt samples in－only one tenth the number of samples of other expressions－and these expressions have similar appearances．Consequently，the recognition accuracy of disgust and contempt was substantially lower than that of the other expressions．

Transformers have received increasing attention in recent studies on FER．However，the large number of parameters required remains a major limitation in the use of transformers．Moreover，the model parameters（Params）and floating－point operations
（FLOPs）are key features to be considered for fair comparisons．One of the starting points of this study involved reducing the overall number of model parameters．To this end，we proposed grouped self－ attention and self－attention distillation mechanisms． Table 2 compares the parameters of the proposed method with those of other methods，including the CVT，DMUE，TransFER［55］，POSTER，and DAN models．Evidently，the number of parameters in the proposed method is only a quarter of those for the other methods，while maintaining the highest FER accuracy．In terms of running speed，it requires approximately 1 h and 20 min to complete training on the RAF－DB training set and approximately 1 min to test all expression images on the RAF－DB test set after training．This provides the self－attention－based facial expression recognition model with a significant advantage．

4．4 Ablation study

Performance of dual－attention and interactive－ learning mechanisms：To verify the considerable

Table 2 Comparing parameters and FLOPs

Method	Year	Params	FLOPs	Acc（RAF－DB）	Acc（FERPlus）
CVT［52］	arXiv 2021	80.1 M	－	87.61	88.81
DMUE［47］	CVPR 2021	78.4 M	13．4G	89.42	－
TransFER［55］	ICCV 2021	65.2 M	15．3G	90.91	90.83
POSTER［33］	arXiv 2022	71.8 M	15．7G	92.05	91.62
DAN［49］	arXiv 2021	28.3 M	2．6G	89.70	－
Ours	－	16.4 M	2.0 G	92.78	92.02

advantages of the dual－attention approach，we first used a single self－attention mechanism as the baseline． The baseline was then compared with the dual－ attention mechanism．We then compared the effects of interactive learning and cross－fusion dual－attention on FER．Figures 6（a）－6（d）show these differences． Table 3 lists the results．Evidently，a more advanced classification accuracy is achieved in FER tasks after the addition of the channel dimension self－attention and interactive learning mechanism in parallel processing，proving the feasibility and effectiveness of introducing the dual－attention mechanism．The third and fourth rows in Table 3 show that adding an interactive learning mechanism to FERPlus and RAF－DB datasets can introduce an improvement of approximately

，and a

on AffectNet datasets． To better fuse the features from two different sources， we first consider the impact of one feature on the other，realize the mutual mapping of the two features， and complete feature fusion on this basis，instead of simply adding or splicing．The experiments proved the feasibility of our idea．In addition，we designed an activation function construction method to improve interactive learning mechanisms．In a later comparison of various activation functions，we proved the effectiveness of the proposed activation function．

To observe the effects of the dual－attention mechanism on feature extraction and of grouping on FER，we used the t －SNE［61］algorithm to visualize the expression features of parts of the test set samples in a two－dimensional space．Figure 7 shows the results． Evidently，because of the lacking global information，

Table 3 Performance of dual－attention and interactive learning mechanism

Method	AffectNet	FERPlus	RAF－DB
Baseline	59．73％	89．69％	90．37％
Baseline＋DAM	60．44％	90．92％	90．88％
Baseline＋DAM＋ILM	63．12％	91．55％	91．63％
Baseline＋cross－fusion DAM＋ILM	63．58％	92．02％	92．78％

the recognition ability of the single－dimensional attention mechanism was considerably lower than that of the dual－attention mechanism in the initial training．The advantages of the local key features emerged during the late training period．Because of the numerous training samples and obvious facial features，the four expressions of happy，neutral， surprise，and fear were well recognized in the three conditions，which is consistent with the information obtained in the confusion matrix．For expressions with minor differences between them，such as disgust and contempt，the self－attention mechanism of the spatial dimension found separating and distinguishing them difficult owing to the few training samples． After the addition of the self－attention mechanism of the channel dimension，some overlap of facial features remained with the small sample sizes；however，it was substantially better than that of the single－dimension self－attention mechanism．Moreover，grouping in the dual－attention mechanism can effectively reduce redundant connections caused by global attention and enhance the FER ability of the entire network．The grouped dual－attention mechanism classified sadness， contempt，and disgust more accurately，with sadness being the most obvious．

To study the effect of the fusion mechanism on the attention regions for facial expressions，we drew attention maps of the attention mechanism in the channel and space dimensions．As shown in Fig．8， in the channel dimension，attention is focused on the overall facial expression and interaction with the surrounding environment，whereas in the spatial dimension，attention is focused on key parts，such as the eyes and mouth．The effect of fusion is the collection of key information from the two dimensions．

Analysis of the validity of the activation function： As shown in Fig．9，the sigmoid function scales the value between 0 and 1，and the gradients in this interval are all less than 0.25 ；thus，gradient disappearance is possible．The proposed activation function scales all values between a small negative number and one，and the gradient in this interval

Fig． 6 Ablation studies on dual－attention and interactive learning mechanism．

Fig． 7 Two－dimensional feature map of some samples．

Fig． 8 Visualization of the attention mechanism．

Fig． 9 Various activation functions and their derivatives．The red curve represents the activation function；the violet curve，the first derivative； and the blue curve，the second derivative．

is between 0 and 1．5．To solve the problems of neuron deactivation and discontinuous derivatives
in the negative part of the ReLU activation function， we mapped the negative region to a smaller negative
interval．As shown in Fig．9，the activation function used in this study is continuously differentiable at all points．Moreover，the proposed activation function does not involve exponentiation，thereby improving the model＇s operational speed compared with that of the sigmoid and tanh activation functions． In terms of recognition accuracy，we compared various activation functions on the RAF－DB dataset； Table 4 presents the results．Evidently，the proposed activation function performs better in FER tasks．

Performance of self－attention distillation：To verify the effectiveness of self－attention distillation， the complete model was used as the benchmark and then compared with the model discarded by self－attention distillation on the AffectNet and FERPlus datasets．Table 5 lists the results．Evidently， after self－attention distillation is discarded，the operation window expands to the entire image，the serious trailing phenomenon leads to an increase in redundant information pairs，and introducing excessive noise affects the generalization of the entire model．When self－attention distillation and residual connection form a completely closed loop， information communication across windows is easier， while reducing the loss of middle－and high－frequency information．Therefore，self－attention distillation was worth introducing into this model．This reduces the computational cost of the model．

4．5 Parameter sensitivity analysis

We conducted a sensitivity analysis of the variable parameters of the activation functions on the RAF－ DB dataset．During the experiment，

was set to

Table 4 Performance of the PCP and interactive learning mechanism

PCP	sigmoid	ReLU	tanh	ILM	ACC
					92.54
					92.78
					91.42
					91.78
					91.98
					92.26
					91.75
					91.97

Table 5 Performance of self－attention distillation

Method	AffectNet	FERPlus	Params	FLOPs
Without distillation			4.2 M	0.21 G
With distillation			4.22 M	0.14 G

1．75， 1.85 ，and 1.95 in three cases；

，to

， and 0.21 ；and

，to

， and -0.27 in five cases．Figure 10 shows the results． As shown in Fig．10，when

，the model performs poorly at

，and the performance is average for the other cases．At

，the model performs poorly at

，and at

the model was relatively stable．When

，

，the model achieved the highest accuracy．

Fig． 10 Parameter sensitivity analysis of activation functions．

5 Conclusions

In this study, a cross-fusion dual-attention network was proposed based on the spatial and channel dimensions. The local interaction of the spatial dimension completes the feature refinement, and the channel dimension provides the global receptive field. The two types of self-attention features can be complemented by cross-fusion attention such that the extracted features contain more effective information. The shape of the activation function has an adaptive adjustment that can increase the feature extraction and fusion abilities. It can also be applied to other learning frameworks to improve the accuracy and reduce computational time. In addition, the method of constructing the proposed activation function can be extended to construct a polynomial activation function of

according to the application to increase the ability of reverse transfer of the activation function and improve the generalization ability. Because the attention mechanism often requires a high computational cost, this study proposes that the grouping mechanism and self-attention distillation act together on the self-attention mechanism. By dividing attention into different groups, self-attention distillation was used in each group to reduce the spatial dimensions of

and

, improving the self-attention mechanism and reducing the computational cost.

The self-attention mechanism is a key technology in FER. In future research, we will continue to explore the construction of a more effective selfattention mechanism based on different data features in FER tasks. Simultaneously, we plan to continue to study the relationship between the self-attention mechanism and activation function. For specific data, we will attempt to construct a more effective activation function adaptively, such that it has a strong backpropagation ability, high continuity, and low computational cost, to obtain a stronger feature extraction ability with a low computational cost.

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant Nos. 62272281 and 62007017, the Special Funds for Taishan Scholars Project under Grant No. tsqn202306274, and Youth Innovation Technology Project of the Higher School in Shandong Province under Grant No. 2019KJN042.

Declaration of competing interest

The authors have no competing interests to declare that are relevant to the content of this article.

References

1 Shandong Technology and Business University，Shandong 264005，China．E－mail：F．Zhang，zhangfan＠sdtbu．edu．cn； G．Chen，2020410018＠sdtbu．edu．cn．
2 School of Information and Electrical Engineering，Ludong University，Yantai 264025，China．E－mail：hua．wang＠ldu．edu．cn（）．
3 Shangdong University，Shandong 250100，China．E－mail： czhang＠sdu．edu．cn．
Manuscript received：2023－02－21；accepted：2023－07－19
（1）http：／／mohammadmahoor．com／affectnet／
（2）http：／／www．whdeng．cn／raf／model1．html
（3）https：／／www．worldlink．com．cn／osdir／ferplus．html

CF-DAN: التعرف على تعبيرات الوجه استنادًا إلى شبكة الانتباه المزدوجة المتقاطعة CF-DAN: Facial-expression recognition based on cross-fusion dual-attention network

CF-DAN: التعرف على تعبيرات الوجه استنادًا إلى شبكة الانتباه المزدوجة المتقاطعة

الملخص

1 المقدمة

2 الأعمال ذات الصلة

3 المنهجية

3.1 آلية الانتباه المزدوج (DAM)

3.2 آلية انتباه الاندماج المتقاطع

3.3 تقطير الانتباه الذاتي

3.4 آلية التعلم التفاعلي (ILM)

4 نتائج

4.1 مجموعات البيانات

4.2 تفاصيل التجربة

4.3 النتائج والتحليل

4.4 دراسة الاستئصال

4.5 تحليل حساسية المعلمات

5 الاستنتاجات

شكر وتقدير

إعلان عن تضارب المصالح

References

CF－DAN：Facial－expression recognition based on cross－fusion dual－attention network

Abstract

1 Introduction

2 Related work

3 Methodology

3．1 Dual－attention mechanism（DAM）

3.2 Cross-fusion attention mechanism

3.3 Self-attention distillation

3．4 Interactive learning mechanism（ILM）

4 Results

4．1 Datasets

4．2 Experiment details

4．3 Results and analysis

4．4 Ablation study

4．5 Parameter sensitivity analysis

5 Conclusions

Acknowledgements

Declaration of competing interest

References