تحليل المشاعر متعدد الأنماط بناءً على دمج الميزات متعددة الطبقات والتعلم متعدد المهام Multimodal sentiment analysis based on multi-layer feature fusion and multi-task learning

المجلة: Scientific Reports، المجلد: 15، العدد: 1
DOI: https://doi.org/10.1038/s41598-025-85859-6
PMID: https://pubmed.ncbi.nlm.nih.gov/39821109
تاريخ النشر: 2025-01-16
المؤلف: Yujian Cai وآخرون
الموضوع الرئيسي: تحليل المشاعر واستخراج الآراء

نظرة عامة

تناقش هذه الفقرة التطورات في تحليل المشاعر متعدد الوسائط (MSA)، الذي يدمج بين وسائط مختلفة—مثل النص، تعبيرات الوجه، والكلام—لتحسين توقع المشاعر البشرية. تشمل التحديات الرئيسية في MSA استخراج المعلومات العاطفية بفعالية من الوسائط الفردية، وضمان توقعات مستقرة على الرغم من التناقضات عبر الوسائط، والحفاظ على دقة عالية حتى عندما تكون البيانات غير مكتملة. غالبًا ما تفشل الأساليب التقليدية في أخذ التفاعلات بين البيانات الأحادية والمتعددة الوسائط في الاعتبار، مما يؤدي إلى أداء دون المستوى الأمثل، خاصة عندما تكون التمثيلات غير متناسقة.

لمعالجة هذه التحديات، يقدم البحث مكونين جديدين: شبكة استخراج الميزات الأحادية (UFEN) وشبكة دمج المهام المتعددة (MTFN). تعزز UFEN قدرات التمثيل للميزات الأحادية، بينما تحسن MTFN الترابط ودمج المعلومات عبر الوسائط. يستخدم النموذج استخراج ميزات متعدد الطبقات، وآليات الانتباه، وهياكل المحولات لكشف العلاقات المعقدة بين الميزات. تظهر النتائج التجريبية على مجموعات بيانات مثل MOSI وMOSEI وSIMS أن هذا النهج يتفوق على الأساليب الحالية الرائدة، على الرغم من أنه يعترف بالتحديات المستمرة في تحقيق إدراك قوي للمشاعر، خاصة في البيئات المليئة بالضوضاء. قد تتضمن التحسينات المستقبلية دمج تقنيات التعلم بالنقل والتعلم شبه المراقب للتكيف مع مهام تحليل المشاعر الأكثر ديناميكية.

الطرق

في هذا القسم، يوضح المؤلفون طريقتهم المقترحة لتحليل المشاعر متعدد الوسائط، بدءًا بتعريف واضح للمهمة والرموز ذات الصلة. يقدمون هيكل النموذج، الذي يتضمن شبكة استخراج الميزات الأحادية (UFEN) وشبكة دمج المحولات متعددة الوسائط (MTFN)، جنبًا إلى جنب مع وحدة المحولات التي تعزز قدرات النموذج. يختتم القسم بمخطط لنهج التعلم متعدد المهام والأهداف العامة للتحسين التي توجه تطوير النموذج.

للتحقق من فعالية طريقتهم، يقارن المؤلفون طريقتهم ضد عدة معايير أساسية معروفة في تحليل المشاعر متعدد الوسائط. تشمل هذه المعايير LSTM الاندماج المبكر (EF-LSTM)، الذي يستخدم دمج الميزات على مستوى الميزات مع شبكات Bi-LSTM؛ شبكة الأعصاب العميقة للاندماج المتأخر (LF-DNN)، التي تدمج PCA لدمج الميزات؛ شبكة دمج التنسور (TFN)، التي نمذجة التفاعلات بين الوسائط من خلال طبقة دمج التنسور؛ دمج متعدد الوسائط منخفض الرتبة (LMF)، الذي يعزز الكفاءة في التمثيل متعدد الوسائط؛ المحول متعدد الوسائط (MulT)، الذي يستخدم الانتباه عبر الوسائط؛ التمثيلات غير المتغيرة والمحددة (MISA)، التي تتعلم القواسم المشتركة والميزات الخاصة عبر الوسائط؛ والتعلم الذاتي متعدد المهام متعدد الوسائط (Self-MM)، الذي يتضمن استراتيجية تعلم ذاتي للإشراف الأحادي.

النتائج

في قسم النتائج، يقدم البحث أداء نموذج تعظيم المعلومات المتبادلة متعدد الوسائط (MMIM) عبر مجموعات بيانات مختلفة، تحديدًا SIMS وMOSI وMOSEI. تشمل مقاييس التقييم المستخدمة الدقة الثنائية (Acc-2)، دقة الفئات المتعددة (Acc-3، Acc-5، Acc-7)، درجة F1، متوسط الخطأ المطلق (MAE)، ومعامل ارتباط بيرسون (Corr). من الجدير بالذكر أن Acc-2 تم الإبلاغ عنها بصيغتين—سلبية/غير سلبية وسلبية/إيجابية—متميزة بواسطة علامات التقسيم. يتم تقييم قدرة النموذج على تصنيف البيانات إلى فترات عاطفية بشكل كمي باستخدام دالة المؤشر \( I() \)، مع القيم الأعلى من Acc-x تشير إلى تحسين تصنيف المشاعر مع زيادة عدد الفترات.

تشير النتائج إلى أن نموذج MMIM يقلل بشكل فعال من فقدان المعلومات المتعلقة بالمهمة من خلال تعزيز المعلومات المتبادلة بين التمثيلات الأحادية. يتضمن البحث أيضًا تجارب إلغاء وتجارب بصرية لدعم فعالية الطريقة المقترحة، مما يظهر تفوقها على النماذج الأساسية في مجموعات البيانات المحددة. بشكل عام، تشير النتائج إلى أن نهج MMIM يحسن بشكل كبير مقاييس الأداء، خاصة في مهام تحليل المشاعر.

المناقشة

في قسم المناقشة من الورقة، يستعرض المؤلفون الأدبيات الموجودة حول تحليل المشاعر متعدد الوسائط (MSA) والتعلم متعدد المهام، مشيرين إلى التطور من الأساليب الأحادية التي تركز بشكل أساسي على النص إلى طرق متعددة الوسائط الأكثر تعقيدًا التي تدمج البيانات الصوتية والمرئية. يؤكدون على أهمية تعلم التمثيل وتقنيات دمج الوسائط المتعددة، مشيرين إلى استراتيجيات مختلفة مثل الشبكات التنافسية، وآليات الانتباه، والشبكات العصبية الرسومية التي تم استخدامها لتعزيز دقة توقع المشاعر. يشير المؤلفون أيضًا إلى أنه على الرغم من أن التعلم العميق قد حسّن بشكل كبير الأداء في تحليل المشاعر، إلا أن التحديات لا تزال قائمة في قابلية تفسير النموذج والتعميم عبر المجالات.

تقدم الورقة نموذجًا جديدًا يجمع بين شبكة استخراج الميزات الأحادية (UFEN) وشبكة دمج المهام المتعددة (MTFN) لاستخراج ودمج الميزات من وسائط مختلفة (نص، صوت، ومرئي) بشكل فعال بينما يستفيد من التعلم متعدد المهام لتحسين التعميم وتقليل تعقيد النموذج. يهدف الإطار المقترح إلى تحقيق توازن بين استخراج الميزات داخل الوسائط ودمجها عبر الوسائط، باستخدام الانتباه عبر الوسائط وهيكل ترميز-فك الترميز لتعزيز تمثيل ميزات المشاعر. يجادل المؤلفون بأن نهجهم يعالج القيود في الأساليب السابقة من خلال السماح بوزن ديناميكي للوسائط وتحسين النموذج من خلال تمثيلات مشتركة عبر المهام ذات الصلة، مما يؤدي في النهاية إلى تحسين نتائج تحليل المشاعر.

Journal: Scientific Reports, Volume: 15, Issue: 1
DOI: https://doi.org/10.1038/s41598-025-85859-6
PMID: https://pubmed.ncbi.nlm.nih.gov/39821109
Publication Date: 2025-01-16
Author(s): Yujian Cai et al.
Primary Topic: Sentiment Analysis and Opinion Mining

Overview

The section discusses advancements in multimodal sentiment analysis (MSA), which integrates various modalities—such as text, facial expressions, and speech—to enhance the prediction of human emotions. Key challenges in MSA include effectively extracting emotional information from individual modalities, ensuring stable predictions despite inconsistencies across modalities, and maintaining high accuracy even when data is incomplete. Traditional approaches often fail to account for the interactions between unimodal and multimodal data, leading to suboptimal performance, particularly when the representations are asymmetric.

To address these challenges, the paper introduces two novel components: the Unimodal Feature Extraction Network (UFEN) and the Multi-Task Fusion Network (MTFN). The UFEN enhances the representation capabilities of unimodal features, while the MTFN improves the correlation and fusion of information across modalities. The model employs multilayer feature extraction, attention mechanisms, and Transformer architectures to uncover complex relationships among features. Experimental results on datasets such as MOSI, MOSEI, and SIMS demonstrate that this approach outperforms existing state-of-the-art methods, although it acknowledges ongoing challenges in achieving robust sentiment perception, particularly in noisy environments. Future improvements may involve integrating transfer learning and semi-supervised learning techniques to adapt to more dynamic sentiment analysis tasks.

Methods

In this section, the authors detail their proposed method for multimodal sentiment analysis, beginning with a clear definition of the task and the relevant notation. They introduce the model architecture, which includes the Unimodal Feature Extraction Network (UFEN) and the Multimodal Transformer Fusion Network (MTFN), alongside a Transformer module that enhances the model’s capabilities. The section concludes with an outline of the multi-task learning approach and the overall optimization objectives guiding the model’s development.

To validate the effectiveness of their method, the authors compare it against several established baselines in multimodal sentiment analysis. These include Early Fusion LSTM (EF-LSTM), which employs feature-level fusion with Bi-LSTM networks; Late Fusion Deep Neural Network (LF-DNN), which integrates PCA for feature fusion; Tensor Fusion Network (TFN), which models interactions among modalities through a tensor fusion layer; Low-rank Multimodal Fusion (LMF), which enhances efficiency in multimodal representation; Multimodal Transformer (MulT), which utilizes cross-modal attention; Modality-Invariant and Specific Representations (MISA), which learns commonalities and private features across modalities; and Self-Supervised Multi-task Multimodal (Self-MM), which incorporates a self-supervised learning strategy for unimodal supervision.

Results

In the Results section, the study presents the performance of the Multimodal Mutual Information Maximization (MMIM) model across various datasets, specifically SIMS, MOSI, and MOSEI. The evaluation metrics employed include binary accuracy (Acc-2), multi-class accuracies (Acc-3, Acc-5, Acc-7), F1-score, mean absolute error (MAE), and Pearson correlation coefficient (Corr). Notably, Acc-2 is reported in two formats—negative/non-negative and negative/positive—distinguished by segmentation markers. The model’s ability to classify data into sentiment intervals is quantitatively assessed using the indicator function \( I() \), with higher values of Acc-x indicating improved sentiment classification as the number of intervals increases.

The results indicate that the MMIM model effectively reduces task-related information loss by enhancing mutual information among unimodal representations. The study also includes ablation experiments and visual analyses to substantiate the efficacy of the proposed method, demonstrating its superiority over baseline models in the specified datasets. Overall, the findings suggest that the MMIM approach significantly improves performance metrics, particularly in sentiment analysis tasks.

Discussion

In the discussion section of the paper, the authors review existing literature on multimodal sentiment analysis (MSA) and multitask learning, highlighting the evolution from unimodal approaches focused primarily on text to more sophisticated multimodal methods that integrate audio and visual data. They emphasize the importance of representation learning and multimodal fusion techniques, noting various strategies such as adversarial networks, attention mechanisms, and graph neural networks that have been employed to enhance sentiment prediction accuracy. The authors also point out that while deep learning has significantly improved performance in sentiment analysis, challenges remain in model interpretability and generalization across domains.

The paper introduces a novel model that combines a Unimodal Feature Extraction Network (UFEN) and a Multi-task Fusion Network (MTFN) to effectively extract and fuse features from different modalities (text, audio, and visual) while leveraging multitask learning to improve generalization and reduce model complexity. The proposed framework aims to balance intra-modal feature extraction with inter-modal fusion, utilizing cross-modal attention and an encoder-decoder architecture to enhance the representation of sentiment features. The authors argue that their approach addresses limitations in previous methods by allowing for dynamic weighting of modalities and optimizing the model through shared representations across related tasks, ultimately leading to improved sentiment analysis outcomes.