كشف التزييف العميق للفيديو باستخدام نموذج هجين CNN-LSTM-Transformer للتحقق من الهوية Video deepfake detection using a hybrid CNN-LSTM-Transformer model for identity verification

المجلة: Multimedia Tools and Applications، المجلد: 84، العدد: 33
DOI: https://doi.org/10.1007/s11042-024-20548-6
تاريخ النشر: 2025-03-25
المؤلف: Γεώργιος Πετμεζάς وآخرون
الموضوع الرئيسي: الكشف الجنائي عن الوسائط الرقمية

نظرة عامة

إن انتشار تقنية الديب فيك يطرح تحديات كبيرة بسبب إمكانية إساءة استخدامها في إنشاء مقاطع فيديو مزيفة مقنعة. تقدم هذه الورقة نهجًا جديدًا لاكتشاف الديب فيك في الفيديو يجمع بين نماذج الشكل القابلة للتشكيل ثلاثية الأبعاد (3DMMs) مع بنية هجينة من CNN-LSTM-Transformer. يعزز هذا النموذج دقة الاكتشاف من خلال استخدام 3DMMs لاستخراج ميزات الوجه التفصيلية، وCNN للتحليل المكاني، وLSTM للديناميات الزمنية قصيرة المدى، وTransformer للاعتماديات طويلة المدى. يستخدم النظام المقترح طريقة للتحقق من الهوية، تقارن مقاطع الفيديو الاختبارية مع لقطات مرجعية حقيقية، وقد تم تدريبه على مجموعة بيانات VoxCeleb2، مما يظهر أداءً متفوقًا عبر أنواع مختلفة من جودة الفيديو، ومستويات الضغط، وأنواع التلاعب.

تسلط الدراسة الضوء على أول تكامل شامل لهذه التقنيات المتقدمة لاكتشاف الديب فيك، مما يضع معيارًا جديدًا في هذا المجال. يتفوق النموذج في البيئات الصعبة ويعمل بكفاءة، مما يحسن بشكل كبير من سرعة الاستدلال مقارنة بالطرق الحالية. من خلال التدريب فقط على بيانات نقية وغير معدلة، فإنه يقلل من الإفراط في التكيف ويظل فعالًا ضد تقنيات التلاعب المتطورة. لا تعزز هذه الأبحاث الإطار التكنولوجي لاكتشاف الديب فيك فحسب، بل تؤكد أيضًا على قابليتها العملية في مجالات حيوية مثل الطب الشرعي الرقمي ونزاهة الوسائط.

مقدمة

تسلط مقدمة ورقة البحث الضوء على التقدم السريع في الذكاء الاصطناعي (AI) وتعلم الآلة (ML) الذي سهل ظهور تقنية الديب فيك، لا سيما في تلاعب الفيديو. تشكل هذه التزويرات الرقمية الواقعية للغاية مخاطر كبيرة على الخصوصية والأمان والثقة في الوسائط، مع آثار محتملة على حملات التضليل، والعمليات الديمقراطية، والسمعة الشخصية. مع تطور تقنيات إنشاء الديب فيك، تكافح طرق الاكتشاف التقليدية لمواكبة ذلك، مما يستلزم تطوير أنظمة اكتشاف متقدمة تعتمد على الذكاء الاصطناعي.

تقترح الدراسة نهجًا جديدًا لاكتشاف الديب فيك في الفيديو يجمع بين الشبكات العصبية التلافيفية (CNNs)، والشبكات ذات الذاكرة الطويلة والقصيرة (LSTMs)، وTransformers، مستفيدة من نماذج الشكل القابلة للتشكيل ثلاثية الأبعاد (3DMMs) لاستخراج القياسات الحيوية للوجه بالتفصيل. تم تصميم هذا النموذج الهجين لتعزيز دقة الاكتشاف من خلال التركيز حصريًا على البيانات الأصلية، مما يتجنب المشكلات الشائعة مثل الإفراط في التكيف والتحيز نحو المحتوى الحقيقي. تهدف المنهجية المقترحة إلى معالجة كميات كبيرة من بيانات الفيديو بكفاءة وتظهر أداءً قويًا عبر تقنيات التلاعب المختلفة وجودات الفيديو. توضح الورقة هيكلها، موضحة المعرفة الخلفية، والمنهجية، وتقييم الأداء، والاستنتاجات في الأقسام اللاحقة.

طرق

توضح قسم “المواد والطرق” تصميم التجربة والإجراءات المستخدمة في الدراسة. تفصل المواد المحددة المستخدمة، بما في ذلك أي مواد كيميائية، أو أدوات، أو تقنيات كانت جزءًا لا يتجزأ من البحث. يتم وصف المنهجية بطريقة منهجية، مما يضمن إمكانية تكرار التجارب. يشمل ذلك معايير اختيار العينات، والبروتوكولات المتبعة لجمع البيانات، والتحليلات الإحصائية المطبقة لتفسير النتائج.

بالإضافة إلى ذلك، قد يسلط القسم الضوء على أي ضوابط تم تنفيذها للتحقق من النتائج وتقليل التحيزات المحتملة. تعتبر وضوح وصرامة الطرق أمرًا حاسمًا لتقييم موثوقية النتائج وآثارها في السياق الأوسع لمجال البحث. بشكل عام، يعمل هذا القسم كأساس لفهم كيفية إجراء الدراسة وصحة استنتاجاتها.

النتائج

تكشف نتائج الدراسة حول نموذج التعلم العميق (DL) لاكتشاف الديب فيك في الفيديو عن تقدم كبير في كل من الأداء والكفاءة الحاسوبية. تم تنفيذ النموذج في بيئة Python 3.9 باستخدام مكتبة PyTorch، وتم تدريبه على مجموعة بيانات VoxCeleb2، محققًا منطقة تحت المنحنى (AUC) تبلغ 90.82% بعد 1,000 دورة، مما يشير إلى نقطة تقارب لمدة التدريب المثلى. أظهرت التجارب الإضافية أن زيادة عدد طبقات تشفير Transformer من 1 إلى 6 حسنت الأداء، بينما أدت الطبقات الإضافية إلى عوائد متناقصة. وبالتالي، تم اختيار وحدة Transformer المكونة من 6 طبقات لتكوين النموذج النهائي. أظهرت دراسة إلغاء الطبقات على طبقات LSTM أن طبقة LSTM واحدة قدمت أفضل أداء، مما يعزز تصميم النموذج لالتقاط الاعتماديات الزمنية دون الإفراط في التكيف.

تفوق نموذج CNN-LSTM-Transformer المقترح على طرق الاكتشاف الحالية عبر مجموعات بيانات مختلفة، محققًا AUCs تبلغ 97% في ظروف الجودة العالية (HQ) و(97%) في ظروف الجودة المنخفضة (LQ) في مجموعة بيانات DFD، و98% و97% في ظروف HQ وLQ في مجموعة بيانات FF++، على التوالي. يُعزى الأداء المتفوق للنموذج إلى تدريبه على بيانات غير معدلة، مما يعزز القابلية للتعميم والصلابة ضد تقنيات التلاعب المتنوعة. بالإضافة إلى ذلك، أظهر النموذج تخفيضات كبيرة في وقت الاستدلال مقارنة بنموذج ID-Reveal، مما يجعله مناسبًا للتطبيقات في الوقت الحقيقي. بينما تؤكد الدراسة على فعالية النموذج، تعترف بالقيود المتعلقة بالقابلية للتعميم على تقنيات التلاعب غير المرئية والحاجة إلى تعديلات مستقبلية للحفاظ على الفعالية ضد تقنيات الديب فيك المتطورة. ستركز الأعمال المستقبلية على إنشاء مجموعة بيانات شاملة للتدريب واستكشاف مقاييس المسافة البديلة لتعزيز أداء النموذج بشكل أكبر.

المناقشة

تسلط قسم المناقشة في ورقة البحث الضوء على التقدم في اكتشاف الديب فيك في الفيديو من خلال تقنيات تعلم الآلة (ML) والتعلم العميق (DL) المختلفة. استخدمت العديد من الدراسات تقنيات متنوعة، مثل الشبكات العصبية التلافيفية (CNNs)، والشبكات ذات الذاكرة الطويلة والقصيرة (LSTM)، والنماذج الهجينة، لتحديد الشوائب الدقيقة والأنماط التي تشير إلى تلاعب الفيديو. تشمل المساهمات البارزة استخدام CNNs لالتقاط الشوائب الناتجة عن تشويه الوجه، والأنظمة المدركة زمنياً التي تجمع بين CNNs وLSTMs لتحليل تسلسلات الفيديو، وطرق مبتكرة مثل Transformer التناسق الهووي، الذي يركز على التناقضات الهووية في مناطق الوجه. ومع ذلك، فإن قيدًا شائعًا عبر هذه الدراسات هو اعتمادها على إطارات الفيديو الخام، مما قد يقيد قدرة النماذج على استغلال المعلومات المخفية.

لمعالجة هذه القيود، اقترح بعض الباحثين طرق استخراج الميزات قبل إدخال النموذج، مثل استخدام مجالات التدفق البصري والطيف الترددي. تؤكد الورقة على إمكانية الاقتراب من الكشف الشخصي، كما يتضح من طريقة ID-Reveal، التي تركز على تعلم الميزات الزمنية للوجه وتقييم الصفات الحيوية بدلاً من تصنيف مقاطع الفيديو على أنها حقيقية أو مزيفة فقط. يدمج النهج المقترح في هذه الدراسة نموذج CNN-LSTM-Transformer الهجين مع نماذج الشكل القابلة للتشكيل ثلاثية الأبعاد (3DMMs) لتعزيز دقة وكفاءة الاكتشاف. يلتقط هذا النموذج التفاصيل المكانية، والديناميات الزمنية قصيرة المدى، والاعتماديات طويلة المدى، مما يحسن من تحديد المحتوى المعدل مع الحفاظ على التركيز على التحقق من الهوية الفردية.

Journal: Multimedia Tools and Applications, Volume: 84, Issue: 33
DOI: https://doi.org/10.1007/s11042-024-20548-6
Publication Date: 2025-03-25
Author(s): Γεώργιος Πετμεζάς et al.
Primary Topic: Digital Media Forensic Detection

Overview

The proliferation of deepfake technology presents significant challenges due to its potential for misuse in creating convincing manipulated videos. This paper introduces a novel approach for video deepfake detection that combines 3D Morphable Models (3DMMs) with a hybrid CNN-LSTM-Transformer architecture. This model enhances detection accuracy by utilizing 3DMMs for detailed facial feature extraction, a CNN for spatial analysis, an LSTM for short-term temporal dynamics, and a Transformer for long-term dependencies. The proposed system employs an identity verification method, comparing test videos with genuine reference footage, and has been trained on the VoxCeleb2 dataset, demonstrating superior performance across various video qualities, compression levels, and manipulation types.

The study highlights the first comprehensive integration of these advanced technologies for deepfake detection, setting a new benchmark in the field. The model excels in challenging environments and operates efficiently, significantly improving inference speed compared to existing methods. By training solely on pristine, unmanipulated data, it mitigates overfitting and remains effective against evolving manipulation techniques. This research not only advances the technological framework for detecting deepfakes but also emphasizes its practical applicability in critical areas such as digital forensics and media integrity.

Introduction

The introduction of the research paper highlights the rapid advancements in artificial intelligence (AI) and machine learning (ML) that have facilitated the rise of deepfake technology, particularly in video manipulation. These highly realistic digital forgeries pose significant risks to privacy, security, and trust in media, with potential implications for disinformation campaigns, democratic processes, and personal reputations. As deepfake generation techniques become more sophisticated, traditional detection methods struggle to keep pace, necessitating the development of advanced AI-based detection systems.

The study proposes a novel approach to video deepfake detection that integrates convolutional neural networks (CNNs), long short-term memory networks (LSTMs), and Transformers, leveraging 3D Morphable Models (3DMMs) for detailed facial biometrics extraction. This hybrid model is designed to enhance detection accuracy by focusing exclusively on authentic data, thereby avoiding common issues such as overfitting and bias towards genuine content. The proposed methodology aims to efficiently process large volumes of video data and demonstrates strong performance across various manipulation techniques and video qualities. The paper outlines its structure, detailing background knowledge, methodology, performance evaluation, and conclusions in subsequent sections.

Methods

The “Materials and Methods” section outlines the experimental design and procedures employed in the study. It details the specific materials used, including any reagents, instruments, or technologies that were integral to the research. The methodology is described in a systematic manner, ensuring reproducibility of the experiments. This includes the selection criteria for samples, the protocols followed for data collection, and the statistical analyses applied to interpret the results.

Additionally, the section may highlight any controls implemented to validate the findings and mitigate potential biases. The clarity and rigor of the methods are crucial for assessing the reliability of the results and their implications in the broader context of the research field. Overall, this section serves as a foundation for understanding how the study was conducted and the validity of its conclusions.

Results

The results of the study on a deep learning (DL) model for video deepfake detection reveal significant advancements in both performance and computational efficiency. Implemented in a Python 3.9 environment using the PyTorch library, the model was trained on the VoxCeleb2 dataset, achieving a peak area under the curve (AUC) of 90.82% after 1,000 epochs, indicating a convergence point for optimal training duration. Further experiments demonstrated that increasing the number of Transformer encoder layers from 1 to 6 improved performance, while additional layers yielded diminishing returns. Consequently, a 6-layer Transformer module was selected for the final model configuration. An ablation study on LSTM layers indicated that a single LSTM layer provided the best performance, reinforcing the model’s design for capturing temporal dependencies without overfitting.

The proposed CNN-LSTM-Transformer model outperformed existing detection methods across various datasets, achieving AUCs of 97% for high-quality (HQ) and low-quality (LQ) conditions in the DFD dataset, and 98% and 97% for HQ and LQ conditions in the FF++ dataset, respectively. The model’s superior performance is attributed to its training on unmanipulated data, which enhances generalizability and robustness against diverse manipulation techniques. Additionally, the model demonstrated significant reductions in inference time compared to the ID-Reveal model, making it suitable for real-time applications. While the study confirms the model’s effectiveness, it acknowledges limitations regarding generalizability to unseen manipulation techniques and the need for future adaptations to maintain efficacy against evolving deepfake technologies. Future work will focus on creating a comprehensive dataset for training and exploring alternative distance metrics to further enhance model performance.

Discussion

The discussion section of the research paper highlights the advancements in video deepfake detection through various machine learning (ML) and deep learning (DL) methodologies. Numerous studies have employed diverse techniques, such as convolutional neural networks (CNNs), long short-term memory (LSTM) networks, and hybrid models, to identify subtle artifacts and patterns indicative of video manipulation. Notable contributions include the use of CNNs for capturing face-warping artifacts, temporal-aware systems combining CNNs and LSTMs for analyzing video sequences, and innovative approaches like the Identity Consistency Transformer, which focuses on identity discrepancies in facial regions. However, a common limitation across these studies is their reliance on raw video frames, which may restrict the models’ ability to exploit hidden information.

To address these limitations, some researchers have proposed feature extraction methods prior to model input, such as utilizing optical flow fields and frequency spectra. The paper emphasizes the potential of personalized detection approaches, exemplified by the ID-Reveal method, which focuses on learning temporal facial features and assessing biometric traits rather than merely classifying videos as real or fake. The proposed approach in this study integrates a hybrid CNN-LSTM-Transformer model with 3D Morphable Models (3DMMs) to enhance detection accuracy and efficiency. This model captures spatial details, short-term temporal dynamics, and long-range dependencies, thereby improving the identification of manipulated content while maintaining a focus on individual identity verification.