نموذج الانتشار متعدد الوسائط المدرك للبنية لإكمال الرسم البياني للمعرفة متعدد الوسائط DiffusionCom: Structure-Aware Multimodal Diffusion Model for Multimodal Knowledge Graph Completion

المجلة: ICASSP 2026 – 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
DOI: https://doi.org/10.1109/icassp55912.2026.11463648
تاريخ النشر: 2026-04-21
المؤلف: Wei Huang وآخرون
الموضوع الرئيسي: الشبكات العصبية المتقدمة

نظرة عامة

تناقش هذه الفقرة التقدم في الرسوم البيانية المعرفية متعددة الوسائط (MKGs) والتحديات المرتبطة بنقصها، مما أدى إلى ظهور طرق إكمال الرسوم البيانية المعرفية متعددة الوسائط (MKGC). تستخدم تقنيات MKGC الحالية بشكل أساسي نماذج تمييزية تعظم الاحتمالية الشرطية ولكنها غالبًا ما تفشل في التقاط العلاقات المعقدة الموجودة في الرسوم البيانية المعرفية في العالم الحقيقي بشكل فعال. للتغلب على هذه القيود، يقدم المؤلفون DiffusionCom، وهو نموذج انتشار متعدد الوسائط مدرك للبنية يقترب من MKGC من منظور توليدي. يمثل هذا النموذج العلاقة بين زوج (الرأس، العلاقة) والكيانات المرشحة كاحتمالية مشتركة، $p((\text{head}, \text{relation}), (\text{tail}))$، ويشكل مهمة الإكمال كعملية توليد تدريجي لهذه الاحتمالية من الضوضاء.

بالإضافة إلى ذلك، يقترح المؤلفون Structure-MKGformer، وهي طريقة تكيفية ومدركة للبنية لتعلم تمثيل المعرفة متعددة الوسائط، والتي تعمل كترميز لـ DiffusionCom. يستخدم Structure-MKGformer شبكة انتباه الرسوم البيانية متعددة الوسائط (MGAT) لالتقاط ودمج معلومات هيكلية غنية مع تمثيلات الكيانات، مما يعزز الوعي الهيكلي للنموذج. يتضمن تدريب DiffusionCom كل من الخسائر التوليدية والتمييزية للمولد، بينما يتم تحسين مستخرج الميزات فقط باستخدام خسارة تمييزية. تهدف هذه الاستراتيجية ذات الخسارتين إلى تحسين أداء طرق MKGC، وخاصة تلك المعتمدة على نماذج مدربة مسبقًا متعددة الوسائط، من خلال استخدام المعلومات الهيكلية بشكل أفضل.

مقدمة

تناقش مقدمة ورقة البحث أهمية الرسوم البيانية المعرفية (KGs) وامتداداتها متعددة الوسائط (MKGs) في تمثيل البيانات الواقعية من خلال الثلاثيات الواقعية. بينما تعزز MKGs تمثيل المعرفة من خلال دمج أوضاع بيانات متنوعة مثل النصوص والصور، فإنها تواجه تحديات تتعلق بنقص المعرفة الناجم عن مجموعات بيانات متعددة الوسائط المحدودة وتعقيد العلاقات بين الكيانات. لمواجهة هذه القضايا، تقدم الورقة مهمة إكمال الرسوم البيانية المعرفية متعددة الوسائط (MKGC)، التي تهدف إلى تحسين تضمينات الكيانات باستخدام بيانات متعددة الوسائط لملء المعلومات المفقودة.

يسلط المؤلفون الضوء على تحديين رئيسيين في MKGC: أولاً، النماذج الحالية التي تحسن توزيعات الاحتمالات الشرطية تكافح لالتقاط توزيعات البيانات متعددة الوسائط المعقدة الموجودة في السيناريوهات الواقعية؛ ثانيًا، العديد من نماذج المحولات المدربة مسبقًا متعددة الوسائط (MPT) تفشل في الاستفادة الكاملة من المعلومات الهيكلية المدمجة في الرسوم البيانية المعرفية. لمعالجة هذه التحديات، تقترح الورقة نموذج انتشار متعدد الوسائط مدرك للبنية جديد، يسمى DiffusionCom، الذي يؤطر مهمة MKGC كمشكلة توليد توزيع مشترك. يتضمن هذا النموذج ترميزًا جديدًا، Structure-MKGformer، الذي يستفيد من شبكة انتباه الرسوم البيانية متعددة الوسائط (MGAT) لالتقاط العلاقات الهيكلية الدقيقة بشكل فعال. تظهر النتائج التجريبية أن DiffusionCom يتفوق بشكل كبير على الطرق الحالية الرائدة، محققًا تحسينًا نسبيًا ملحوظًا بنسبة 38.2% في مقياس Hits@1 على مجموعة بيانات FB15k-237-IMG.

طرق

في هذا القسم، يوضح المؤلفون المنهجية المستخدمة في بحثهم، بدءًا بتعريف رسمي لمهمة MKGC (إكمال الرسوم البيانية المعرفية المتعددة). يوضحون العملية العامة لنموذجهم المقترح، Diffusion-Com، ويقسمونها إلى وحداتها المكونة ويقدمون تفاصيل التنفيذ لكل منها. تضمن هذه الطريقة المنظمة وضوحًا في فهم كيفية عمل النموذج وتفاعله مع البيانات.

بالإضافة إلى ذلك، سيتناول المؤلفون استراتيجية التدريب المستخدمة للنموذج، بما في ذلك تصميم دالة الخسارة المرتبطة، والتي تعتبر حاسمة لتحسين الأداء. تؤسس هذه المنهجية الشاملة الأساس للإعدادات التجريبية اللاحقة، حيث سيتم تقييم فعالية النموذج المقترح.

نقاش

تناقش هذه الفقرة التقدم في إكمال الرسوم البيانية المعرفية متعددة الوسائط (MKGC)، مع تسليط الضوء على تطور الهياكل من النماذج غير المعتمدة على المحولات، مثل TransE وComplEx، إلى الأساليب المعتمدة على المحولات التي تستفيد من النماذج المدربة مسبقًا مثل VisualBERT وViLBERT. يمثل إدخال إطار عمل DiffusionCom تحولًا كبيرًا، حيث يستخدم نموذج احتمالي انتشار ضوضاء (DDPM) لإعادة صياغة مهمة توقع الكيانات كنموذج لتوزيعات الاحتمالات المشتركة. تهدف هذه الطريقة إلى التقاط العلاقات المعقدة في البيانات متعددة الوسائط، مما يعالج القيود في النماذج التمييزية الحالية التي تكافح مع المهام العلائقية المعقدة.

يجمع DiffusionCom بين مكونين رئيسيين: Structure-MKGformer، الذي يستخدم شبكة انتباه الرسوم البيانية متعددة الوسائط لتعزيز تمثيل الكيانات من خلال الوعي الهيكلي، وCDenoiser (منظف شرطي) مصمم لإزالة الضوضاء تدريجيًا من البيانات وتوليد توقعات دقيقة. يظهر الإطار أداءً متفوقًا على 13 نموذجًا رائدًا عبر مجموعتين مرجعيتين، FB15k-237-IMG وWN18-IMG، محققًا تحسينات كبيرة في مقاييس مثل Hits@1 وHits@3 وHits@10. تؤكد دراسات الإزالة أيضًا فعالية كل مكون، مما يؤكد أن دمج المعلومات الهيكلية، والتوجيه الشرطي، واستراتيجية تحسين الخسارة المزدوجة تعزز بشكل جماعي قدرات النموذج في إكمال الرسوم البيانية المعرفية متعددة الوسائط.

Journal: ICASSP 2026 – 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
DOI: https://doi.org/10.1109/icassp55912.2026.11463648
Publication Date: 2026-04-21
Author(s): Wei Huang et al.
Primary Topic: Advanced Graph Neural Networks

Overview

The section discusses the advancements in multimodal knowledge graphs (MKGs) and the challenges associated with their incompleteness, which has led to the emergence of multimodal knowledge graph completion (MKGC) methods. Current MKGC techniques primarily utilize discriminative models that maximize conditional likelihood but often fail to effectively capture the intricate relationships present in real-world knowledge graphs. To overcome these limitations, the authors introduce DiffusionCom, a structure-aware multimodal diffusion model that approaches MKGC from a generative perspective. This model represents the relationship between a (head, relation) pair and candidate tail entities as a joint probability distribution, $p((\text{head}, \text{relation}), (\text{tail}))$, and formulates the completion task as a process of progressively generating this distribution from noise.

Additionally, the authors propose Structure-MKGformer, an adaptive and structure-aware method for multimodal knowledge representation learning, which serves as the encoder for DiffusionCom. Structure-MKGformer employs a multimodal graph attention network (MGAT) to capture and integrate rich structural information with entity representations, thereby enhancing the model’s structural awareness. The training of DiffusionCom incorporates both generative and discriminative losses for the generator, while the feature extractor is optimized solely with discriminative loss. This dual-loss strategy aims to improve the performance of MKGC methods, particularly those reliant on multimodal pre-trained models, by better utilizing structural information.

Introduction

The introduction of the research paper discusses the significance of Knowledge Graphs (KGs) and their multimodal extensions (MKGs) in representing real-world data through factual triples. While MKGs enhance knowledge representation by integrating diverse data modalities such as text and images, they face challenges related to knowledge incompleteness stemming from limited multimodal corpora and the complexity of relationships among entities. To tackle these issues, the paper introduces the task of multimodal knowledge graph completion (MKGC), which aims to improve entity embeddings using multimodal data to fill in missing information.

The authors highlight two primary challenges in MKGC: first, existing models that optimize conditional probability distributions struggle to capture the complex multimodal data distributions inherent in real-world scenarios; second, many multimodal pre-trained Transformer (MPT) models fail to fully utilize the structural information embedded in knowledge graphs. To address these challenges, the paper proposes a novel structure-aware multimodal diffusion model, termed DiffusionCom, which frames the MKGC task as a joint distribution generation problem. This model incorporates a new encoder, Structure-MKGformer, that leverages the Multimodal Graph Attention Network (MGAT) to effectively capture fine-grained structural relationships. Experimental results demonstrate that DiffusionCom significantly outperforms existing state-of-the-art methods, achieving a notable 38.2% relative improvement in the Hits@1 metric on the FB15k-237-IMG dataset.

Methods

In this section, the authors outline the methodology employed in their research, beginning with a formal definition of the MKGC (Multi-knowledge Graph Completion) task. They detail the overall process of their proposed model, Diffusion-Com, breaking it down into its constituent modules and providing implementation specifics for each. This structured approach ensures clarity in understanding how the model operates and interacts with the data.

Additionally, the authors will address the training strategy utilized for the model, including the design of the associated loss function, which is crucial for optimizing performance. This comprehensive methodology sets the foundation for the subsequent experimental setups, where the effectiveness of the proposed model will be evaluated.

Discussion

The section discusses advancements in Multimodal Knowledge Graph Completion (MKGC), highlighting the evolution of architectures from non-transformer-based models, such as TransE and ComplEx, to transformer-based approaches that leverage pre-trained models like VisualBERT and ViLBERT. The introduction of the DiffusionCom framework marks a significant shift, employing a denoising diffusion probabilistic model (DDPM) to reformulate the entity prediction task as modeling joint probability distributions. This approach aims to capture complex relationships in multimodal data, addressing limitations in existing discriminative models that struggle with intricate relational tasks.

DiffusionCom integrates two main components: the Structure-MKGformer, which utilizes a multimodal graph attention network to enhance entity representation through structural awareness, and a Conditional Denoiser (CDenoiser) designed to progressively denoise data and generate accurate predictions. The framework demonstrates superior performance over 13 state-of-the-art models across two benchmark datasets, FB15k-237-IMG and WN18-IMG, achieving significant improvements in metrics like Hits@1, Hits@3, and Hits@10. Ablation studies further validate the effectiveness of each component, confirming that the integration of structural information, conditional guidance, and a dual loss optimization strategy collectively enhance the model’s capabilities in multimodal knowledge graph completion.