ما بعد تبديل الوجه: معيار إنساني رقمي قائم على الانتشار لاكتشاف التزييف العميق متعدد الوسائط Beyond Face Swapping: A Diffusion-Based Digital Human Benchmark for Multimodal Deepfake Detection

المجلة: ICASSP 2026 – 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
DOI: https://doi.org/10.1109/icassp55912.2026.11462517
تاريخ النشر: 2026-04-21
المؤلف: Jiaxin Liu وآخرون
الموضوع الرئيسي: الشبكات التنافسية التوليدية وتوليد الصور

نظرة عامة

يقدم هذا القسم نظرة عامة على التحديات التي تطرحها التطورات الأخيرة في تقنيات الديب فيك، لا سيما في مجال توليد البشر الرقميين. يقدم المؤلفون DigiFakeAV، وهو مجموعة بيانات شاملة للتزوير متعددة الوسائط تتكون من 60,000 فيديو تم إنتاجها بواسطة خمسة نماذج رائدة للبشر الرقميين. تتميز هذه المجموعة بتنوعها في الجنسية، لون البشرة، الجنس، والسيناريوهات، مما يبرز تعقيدات اكتشاف الديب فيك. تكشف تقييمات طرق الكشف الحالية على DigiFakeAV عن انخفاضات كبيرة في الأداء، مما يشير إلى صعوبة المجموعة.

لمعالجة هذه التحديات، يقترح المؤلفون DigiShield، وهو نموذج كشف أساسي قوي يستخدم دمجًا مكانيًا زمنيًا وعبر الوسائط للميزات البصرية والصوتية. يظهر DigiShield أداءً متقدمًا على مجموعة بيانات DigiFakeAV ويظهر قدرات جيدة على التعميم عبر مجموعات بيانات أخرى. يهدف تقديم DigiFakeAV وDigiShield إلى تعزيز منهجيات كشف الديب فيك وتحفيز المزيد من البحث في هذا المجال الحرج.

مقدمة

تتناول مقدمة هذه الورقة البحثية التهديدات المتزايدة التي تطرحها تقنية الديب فيك، لا سيما في مجالات المعلومات المضللة السياسية والاحتيال المالي. تنتقد المعايير الحالية لكشف الديب فيك، التي تقتصر على طرق التركيب القديمة ومجموعات البيانات غير الكافية. يتم تصنيف هذه المجموعات إلى ثلاثة أجيال، مع تركيز الأحدث على تقنيات تبديل الوجوه التي تؤدي إلى آثار موحدة وتعميم ضعيف على التهديدات الجديدة. يؤكد المؤلفون أن المجموعات الحالية تفشل في التقاط الواقعية متعددة الوسائط والتماسك الزمني للتزويرات الرقمية الحديثة، مما يؤدي إلى تشبع الأداء في أنظمة الكشف وهشاشة في التطبيقات الواقعية.

لمعالجة هذه التحديات، يقدم المؤلفون DigiFakeAV، وهو معيار متعدد الوسائط على نطاق واسع مصمم لاكتشاف التزويرات الرقمية البشرية القائمة على الانتشار. يتضمن DigiFakeAV ثلاث ابتكارات رئيسية: (1) يستخدم خمسة نماذج انتشار متقدمة لتوليد 60,000 فيديو عالي الدقة، محققًا تفاصيل فوتوريالية دون تشابه صارم بين المصدر والهدف؛ (2) يضمن الاتساق متعدد الوسائط من خلال تحسين التزامن السمعي البصري؛ و(3) يتميز بتنوع مدرك للمشهد، مما يعزز التمثيل الديموغرافي ويقلل من التحيز. تكشف النتائج التجريبية عن ثغرات كبيرة في الكاشفات الحالية، حيث تعاني النماذج المتقدمة من انخفاضات كبيرة في الأداء على DigiFakeAV. استجابةً لذلك، يقترح المؤلفون DigiShield، وهو إطار عمل متعدد الوسائط جديد يستفيد من الانتباه عبر الوسائط واستخراج الميزات المكانية الزمنية لتحديد التناقضات الدقيقة في الديب فيك، مما يضع معيارًا جديدًا لطرق الكشف ضد التزويرات من الجيل التالي.

نقاش

في قسم “النقاش” من الورقة، يوضح المؤلفون بناء مجموعة بيانات DigiFakeAV، التي تم تصميمها لتقدم منهجيات كشف الديب فيك. تم إنشاء مجموعة البيانات من خلال خط أنابيب شامل يتضمن جمع البيانات من مجموعات الفيديو عالية الدقة الحالية (HDTF وCelebV-HQ)، تليها تطبيق تقنيات التركيب السمعي البصري المتقدمة. على وجه التحديد، يستخدم المؤلفون خمس طرق لتوليد الفيديو القائم على الانتشار وطريقة واحدة لتركيب الصوت، مما ينتج مجموعة بيانات تدعم التزويرات متعددة الوسائط وتدمج ست تقنيات تركيب جديدة. يتم تنظيم مجموعة البيانات إلى أربع فئات بناءً على مصداقية الفيديو والصوت، مع التركيز على ضمان التنوع الديموغرافي وتمثيل السيناريوهات الواقعية.

يقدم المؤلفون أيضًا DigiShield، وهو معيار جديد لكشف الديب فيك متعدد الوسائط يستفيد من الإشارات السمعية البصرية لتعزيز قدرات الكشف. يستخدم DigiShield خط أنابيب ثنائي التدفق مكاني زمني لاستخراج الميزات، تليه دمج عبر الوسائط وتصنيف. يتم تدريب النموذج باستخدام مجموعة من خسائر التباين وخسائر الانتروبيا المتقاطعة لالتقاط التناقضات التي تشير إلى التزوير بشكل فعال. تظهر النتائج التجريبية أن DigiShield يتفوق بشكل كبير على طرق الكشف الحالية على مجموعة بيانات DigiFakeAV، محققًا منطقة تحت المنحنى (AUC) تبلغ 80.1%، مما يبرز أهمية التحليل المكاني الزمني ومتعدد الوسائط في معالجة التحديات التي تطرحها التزويرات الرقمية القائمة على الانتشار. بشكل عام، تؤكد النتائج على الحاجة إلى أساليب مبتكرة في كشف الديب فيك وإمكانية DigiFakeAV وDigiShield لدفع البحث المستقبلي في هذا المجال.

Journal: ICASSP 2026 – 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
DOI: https://doi.org/10.1109/icassp55912.2026.11462517
Publication Date: 2026-04-21
Author(s): Jiaxin Liu et al.
Primary Topic: Generative Adversarial Networks and Image Synthesis

Overview

The section presents an overview of the challenges posed by recent advancements in deepfake technologies, particularly in the realm of digital human generation. The authors introduce DigiFakeAV, a comprehensive multimodal forgery dataset comprising 60,000 videos produced by five leading digital human models. This dataset is notable for its diversity in nationality, skin tone, gender, and scenarios, highlighting the complexities of detecting deepfakes. The evaluation of existing detection methods on DigiFakeAV reveals significant performance declines, indicating the dataset’s difficulty.

To address these challenges, the authors propose DigiShield, a robust baseline detection model that utilizes spatiotemporal and cross-modal fusion of visual and audio features. DigiShield demonstrates state-of-the-art performance on the DigiFakeAV dataset and shows good generalization capabilities across other datasets. The introduction of DigiFakeAV and DigiShield aims to enhance deepfake detection methodologies and stimulate further research in this critical area.

Introduction

The introduction of this research paper addresses the escalating threats posed by deepfake technology, particularly in the realms of political disinformation and financial fraud. It critiques existing deepfake detection benchmarks, which are limited by outdated synthesis methods and inadequate datasets. These datasets are categorized into three generations, with the latest focusing on face-swapping techniques that result in homogenized artifacts and poor generalization to new threats. The authors emphasize that current datasets fail to capture the multimodal realism and temporal coherence of modern digital forgeries, leading to performance saturation in detection systems and fragility in real-world applications.

To address these challenges, the authors introduce DigiFakeAV, a large-scale multimodal benchmark designed for detecting diffusion-based digital human forgeries. DigiFakeAV incorporates three key innovations: (1) it utilizes five state-of-the-art diffusion models to generate 60,000 high-resolution videos, achieving photorealistic details without strict source-target similarity; (2) it ensures multimodal consistency through improved audiovisual synchronization; and (3) it features scene-aware diversity, enhancing demographic representation and reducing bias. Experimental results reveal significant vulnerabilities in existing detectors, with state-of-the-art models experiencing substantial performance drops on DigiFakeAV. In response, the authors propose DigiShield, a new multimodal framework that leverages cross-modal attention and spatiotemporal feature extraction to identify subtle inconsistencies in deepfakes, establishing a new baseline for detection methods against next-generation forgeries.

Discussion

In the “Discussion” section of the paper, the authors detail the construction of the DigiFakeAV dataset, which is designed to advance deepfake detection methodologies. The dataset is created through a comprehensive pipeline that includes data collection from existing high-definition video datasets (HDTF and CelebV-HQ), followed by the application of state-of-the-art video and audio synthesis techniques. Specifically, the authors utilize five diffusion-based video generation methods and one audio synthesis method, resulting in a dataset that supports multimodal forgeries and incorporates six novel synthesis techniques. The dataset is organized into four categories based on the authenticity of the video and audio, with a focus on ensuring demographic diversity and realistic scenario representation.

The authors also introduce DigiShield, a new multimodal deepfake detection baseline that leverages audio-visual cues to enhance detection capabilities. DigiShield employs a spatiotemporal two-stream pipeline for feature extraction, followed by cross-modal fusion and classification. The model is trained using a combination of contrastive and cross-entropy losses to effectively capture inconsistencies indicative of forgery. Experimental results demonstrate that DigiShield significantly outperforms existing detection methods on the DigiFakeAV dataset, achieving an Area Under the Curve (AUC) of 80.1%, thus highlighting the importance of spatiotemporal and multimodal analysis in addressing the challenges posed by diffusion-based digital forgeries. Overall, the findings underscore the need for innovative approaches in deepfake detection and the potential of DigiFakeAV and DigiShield to drive future research in this domain.