RAMA: إطار عمل متعدد الوكلاء معزز بالاسترجاع لاكتشاف المعلومات المضللة في التحقق من الحقائق متعدد الوسائط RAMA: Retrieval-Augmented Multi-Agent Framework for Misinformation Detection in Multimodal Fact-Checking

المجلة: Companion Proceedings of the ACM Web Conference 2026
DOI: https://doi.org/10.1145/3774905.3796483
تاريخ النشر: 2026-05-28
المؤلف: Shuo Yang وآخرون
الموضوع الرئيسي: المعلومات المضللة وتأثيراتها

نظرة عامة

تقدم ورقة البحث RAMA، وهو إطار عمل متعدد الوكلاء معزز بالاسترجاع يهدف إلى تحسين التحقق من المعلومات المضللة متعددة الوسائط. يتناول RAMA التحديات التي تطرحها الادعاءات الغامضة وسوء السياق من خلال ثلاث ابتكارات رئيسية: (1) صياغة استعلامات استراتيجية تحول الادعاءات متعددة الوسائط إلى استعلامات بحث دقيقة على الويب، (2) تجميع أدلة التحقق المتقاطع من مصادر موثوقة متنوعة، و(3) بنية جماعية متعددة الوكلاء تستفيد من نقاط القوة في نماذج اللغة الكبيرة متعددة الوسائط وأنواع المطالبات المختلفة. تشير النتائج التجريبية إلى أن RAMA يتفوق بشكل كبير على الأنظمة التقليدية التي تعتمد على النصوص فقط والأنظمة ذات الوكيل الواحد، خاصة في التعامل مع الادعاءات الغامضة أو غير المحتملة من خلال تأصيل تحققها في الأدلة الواقعية المسترجعة.

في الختام، يظهر RAMA فعالية دمج نماذج اللغة البصرية المتقدمة مع وحدة WebRetriever مخصصة، مما يسمح باتخاذ قرارات تحقق قوية بناءً على الأدلة الخارجية الحالية. تؤكد التقييمات الشاملة، بما في ذلك دراسات الإزالة والتحليلات النوعية، تفوق RAMA في السيناريوهات التي قد تكون فيها الإشارات البصرية مضللة. ستسعى الأبحاث المستقبلية إلى تعزيز قدرات استرجاع RAMA للأدلة متعددة اللغات وعبر المنصات، بالإضافة إلى استكشاف استراتيجيات جماعية تكيفية لتحسين متانته. بشكل عام، يمثل RAMA تقدمًا كبيرًا في مجال الكشف الآلي عن المعلومات المضللة، مما يوفر طريقًا نحو حلول تحقق متعددة الوسائط أكثر موثوقية وقابلية للتفسير.

مقدمة

تسلط مقدمة الورقة الضوء على التأثير المزدوج للعصر الرقمي على نشر المعلومات، مع التأكيد على الانتشار السريع للمعلومات المضللة جنبًا إلى جنب مع فوائد اكتساب المعرفة. تحدد المجالات الحيوية مثل الرعاية الصحية والسياسة والاقتصاد كأكثر عرضة لعواقب المعلومات المضللة، والتي يمكن أن تؤدي إلى ممارسات ضارة، وعدم استقرار اقتصادي، وانخفاض الثقة العامة. تُعتبر منصات وسائل التواصل الاجتماعي مساهمين رئيسيين في هذه القضية بسبب تسهيلها لإنشاء المحتوى وانتشار المعلومات المضللة بسرعة.

تركز الورقة على نوعين رئيسيين من الوسائط المضللة: الديب فيك والشيء الرخيص، حيث يعتبر الأخير أكثر إثارة للقلق بسبب سهولة إنشائه ونعومته، خاصة من خلال الاستخدام الخاطئ خارج السياق (OOC). تكافح طرق الكشف التقليدية مع المحتوى خارج السياق، مما يدفع إلى التحول نحو أساليب التعلم العميق للكشف، والتي تواجه، مع ذلك، تحديات تتعلق بالتعميم وقابلية التفسير. لمعالجة هذه القيود، يقترح المؤلفون إطار عمل RAMA (إطار العمل متعدد الوكلاء المعزز بالاسترجاع)، الذي يدمج وحدة WebRetriever لجمع الأدلة ونظام متعدد الوكلاء لتقييم اتساق الادعاءات متعددة الوسائط. يهدف هذا النهج المبتكر إلى تعزيز متانة وقابلية توسيع الكشف عن المعلومات المضللة، مما يظهر أداءً متفوقًا في السيناريوهات الواقعية من خلال التفكير التعاوني واتخاذ القرارات المستندة إلى الأدلة.

طرق

يعد RAMA (إطار العمل متعدد الوكلاء المعزز بالاسترجاع) نظامًا متعدد الوكلاء متطورًا تم تطويره لاكتشاف المعلومات المضللة في المحتوى متعدد الوسائط. يستلهم من عمليات التحقق من الحقائق البشرية ويجمع بين التحقق المعزز بالاسترجاع والتفكير التعاوني بين عدة وكلاء لتقييم مصداقية الصور بالتزامن مع ادعاءاتها النصية المقابلة. يتكون الإطار من ثلاثة وحدات مترابطة تعمل بطريقة متسلسلة لتسهيل التحقق الشامل.

تقوم الوحدة الأولى، WebRetriever، بمعالجة المدخلات متعددة الوسائط وتستخدم أدوات قائمة على الويب لإجراء استرجاع الأدلة، مما يضمن الوصول إلى المعلومات الخارجية الحالية. تشارك الوحدة الثانية، VLJudge، في التحقق المتقاطع وتستخدم التفكير متعدد الوكلاء لتقييم الاتساق بين الصور والنصوص، مما ينتج في النهاية قرارات قابلة للتفسير. الوحدة النهائية، DecisionFuser، تجمع التقييمات من جميع الوكلاء، مما يعزز متانة النظام وقابلية توسيعه في معالجة المعلومات المضللة.

نتائج

في قسم النتائج، يتم تقييم أداء إطار عمل RAMA مقابل أحدث الأساليب على مجموعة اختبار عامة، باستخدام الدقة ودرجة F1 كمقاييس رئيسية. من الجدير بالذكر أن الأساليب السابقة، مثل تلك المشار إليها في [26، 40، 52، 53]، استخدمت كل من التسميات المقدمة لاتخاذ القرارات، بينما اعتمدت RAMA، بما يتماشى مع إرشادات المنافسة، فقط على caption1. كانت أعلى دقة تم الإبلاغ عنها من قبل نغوين وآخرين [34]، حيث حققوا 0.930 ودرجة F1 قدرها 0.926. ومع ذلك، أظهرت طريقتهم انخفاضًا كبيرًا في الأداء على مجموعة الاختبار الخاصة لعام 2024، بدقة 0.655 ودرجة F1 قدرها 0.5174، مما يشير إلى مشاكل في الإفراط في التكيف والتعميم.

على النقيض من ذلك، حققت RAMA نتائج تنافسية دون أي تدريب أو ضبط، مسجلة دقة قدرها 0.910 ودرجة F1 قدرها 0.910. تساهم قدرة الإطار على الاستفادة من الأدلة الخارجية للتحقق من المعلومات متعددة الوسائط في تقليل حساسيتها لتوزيع مجموعة البيانات، مما يعزز قدراتها على التعميم عبر عينات غير مرئية. وهذا يضع RAMA كبديل قوي في مشهد أساليب التحقق من المعلومات متعددة الوسائط.

مناقشة

تسلط قسم المناقشة في ورقة البحث الضوء على التحدي المتزايد للمعلومات المضللة على منصات وسائل التواصل الاجتماعي، مع التأكيد على الحاجة إلى تقنيات كشف متقدمة تستفيد من الذكاء الاصطناعي (AI) والتعلم الآلي (ML). تكافح الطرق التقليدية بشكل أساسي مع تحليل المحتوى على مستويات مختلفة—الكلمة، الجملة، وتمثيل النص الأوسع—لكنها تواجه صعوبة مع المعلومات المضللة خارج السياق (OOC)، حيث يتم تمثيل الصور الأصلية بشكل خاطئ بواسطة نص مضلل. تتضمن التطورات الأخيرة في طرق الكشف أساليب متعددة الوسائط، تستخدم نماذج مدربة مسبقًا مثل CLIP وقواعد المعرفة الخارجية لتعزيز دقة الكشف. على الرغم من هذه التحسينات، لا تزال التحديات قائمة، خاصة في التكيف مع السياقات الديناميكية في العالم الحقيقي وتجنب الإفراط في التكيف مع البيانات الاصطناعية.

تقدم الورقة أيضًا إطار عمل جديد للتحقق من الحقائق متعددة الوسائط، RAMA، الذي يدمج وحدة WebRetriever لجمع الأدلة السياقية ووحدة VLJudge لتقييم الاتساق بين المعلومات البصرية والنصية. يستخدم هذا الإطار مجموعة من عدة وكلاء لتقييم الادعاءات، مما يعزز الموثوقية من خلال التصويت بالأغلبية واستراتيجيات اتخاذ القرار الموزونة. تظهر النتائج التجريبية أداءً متفوقًا لـ RAMA مقارنة بالطرق التقليدية، مما يبرز أهمية الأدلة الخارجية والتفكير الجماعي في الكشف الفعال عن المعلومات المضللة. يقترح المؤلفون تحسينات مستقبلية لقدرات RAMA، بهدف دعم استرجاع الأدلة متعددة اللغات واستراتيجيات جماعية تكيفية، مما يعزز مجال الكشف والتحقق الآلي عن المعلومات المضللة.

Journal: Companion Proceedings of the ACM Web Conference 2026
DOI: https://doi.org/10.1145/3774905.3796483
Publication Date: 2026-05-28
Author(s): Shuo Yang et al.
Primary Topic: Misinformation and Its Impacts

Overview

The research paper introduces RAMA, a retrieval-augmented multi-agent framework aimed at enhancing the verification of multimedia misinformation. RAMA addresses the challenges posed by ambiguous claims and insufficient context through three key innovations: (1) strategic query formulation that converts multimodal claims into precise web search queries, (2) aggregation of cross-verification evidence from diverse authoritative sources, and (3) a multi-agent ensemble architecture that utilizes the strengths of various multimodal large language models and prompt variants. Experimental results indicate that RAMA significantly outperforms traditional text-only and single-agent systems, particularly in handling ambiguous or improbable claims by grounding its verification in retrieved factual evidence.

In conclusion, RAMA demonstrates the effectiveness of integrating advanced visual-language models with a dedicated WebRetriever module, allowing for robust verification decisions based on current external evidence. The extensive evaluation, including ablation studies and qualitative analyses, confirms RAMA’s superiority in scenarios where visual cues may be misleading. Future research will aim to enhance RAMA’s retrieval capabilities for multilingual and cross-platform evidence, as well as to explore adaptive ensemble strategies to improve its robustness. Overall, RAMA represents a significant advancement in the field of automated misinformation detection, offering a pathway toward more reliable and explainable multimedia verification solutions.

Introduction

The introduction of the paper highlights the dual impact of the digital age on information dissemination, emphasizing the rapid spread of misinformation alongside the benefits of knowledge acquisition. It identifies critical domains such as healthcare, politics, and economics as particularly vulnerable to the consequences of misinformation, which can lead to harmful practices, economic instability, and diminished public trust. Social media platforms are noted as significant contributors to this issue due to their facilitation of content creation and rapid propagation of misleading information.

The paper focuses on two major types of deceptive media: deepfakes and cheapfakes, with the latter being more concerning due to their ease of creation and subtlety, particularly through out-of-context (OOC) misuse. Traditional detection methods struggle with OOC content, prompting a shift towards deep learning approaches for detection, which, however, face challenges related to generalization and interpretability. To address these limitations, the authors propose the Retrieval-Augmented Multi-Agent (RAMA) framework, which incorporates a WebRetriever module for evidence collection and a multi-agent system for evaluating the consistency of multimedia claims. This innovative approach aims to enhance the robustness and scalability of misinformation detection, demonstrating superior performance in real-world scenarios through collaborative reasoning and evidence-driven decision-making.

Methods

The RAMA (Retrieval-Augmented Multi-Agent Framework) is a sophisticated multi-agent system developed for the detection of misinformation in multimedia content. It draws inspiration from human fact-checking processes and integrates retrieval-augmented verification with collaborative reasoning among multiple agents to evaluate the authenticity of images in conjunction with their corresponding textual claims. The framework consists of three interconnected modules that function in a cascaded manner to facilitate thorough verification.

The first module, WebRetriever, processes multimodal inputs and utilizes web-based tools to conduct evidence retrieval, ensuring access to current external information. The second module, VLJudge, engages in cross-modal verification and employs multi-agent reasoning to assess the consistency between images and text, ultimately producing interpretable decisions. The final module, DecisionFuser, consolidates the evaluations from all agents, thereby enhancing the system’s robustness and scalability in addressing misinformation.

Results

In the results section, the performance of the RAMA framework is evaluated against state-of-the-art methods on a public test set, using accuracy and F1-score as key metrics. Notably, prior approaches, such as those referenced in [26, 40, 52, 53], utilized both provided captions for decision-making, while RAMA, in alignment with competition guidelines, relied solely on caption1. The highest accuracy reported was by Nguyen et al. [34], achieving 0.930 and an F1-score of 0.926. However, their method showed a significant decline in performance on the 2024 private test set, with an accuracy of 0.655 and an F1-score of 0.5174, suggesting issues with overfitting and generalization.

In contrast, RAMA achieved competitive results without any training or fine-tuning, recording an accuracy of 0.910 and an F1-score of 0.910. The framework’s ability to leverage external evidence for multimedia verification contributes to its reduced sensitivity to dataset distribution, thereby enhancing its generalization capabilities across unseen samples. This positions RAMA as a robust alternative in the landscape of multimedia verification methods.

Discussion

The discussion section of the research paper highlights the growing challenge of misinformation on social media platforms, emphasizing the need for advanced detection technologies that leverage artificial intelligence (AI) and machine learning (ML). Traditional methods primarily analyze content at various levels—word, sentence, and broader text representation—but struggle with out-of-context (OOC) misinformation, where authentic images are misrepresented by misleading text. Recent advancements in detection methods incorporate multimodal approaches, utilizing pretrained models like CLIP and external knowledge bases to enhance detection accuracy. Despite these improvements, challenges remain, particularly in adapting to dynamic real-world contexts and avoiding overfitting to synthetic data.

The paper also introduces a novel multimodal fact-checking framework, RAMA, which integrates a WebRetriever module for contextual evidence gathering and a VLJudge module for consistency assessment between visual and textual information. This framework employs an ensemble of multiple agents to evaluate claims, enhancing reliability through majority voting and weighted decision-making strategies. Experimental results demonstrate RAMA’s superior performance compared to traditional methods, underscoring the importance of external evidence and ensemble reasoning in effectively detecting misinformation. The authors propose future enhancements to RAMA’s capabilities, aiming to support multilingual evidence retrieval and adaptive ensemble strategies, thereby advancing the field of automated misinformation detection and verification.