كشف عنف الفيديو باستخدام VGG19 المدرب مسبقًا مع المنطق اليدوي، وطبقات LSTM وطبقات Bi-LSTM Video violence detection using pre-trained VGG19 combined with manual logic, LSTM layers and Bi-LSTM layers

المجلة: Applied Intelligence، المجلد: 56، العدد: 3
DOI: https://doi.org/10.1007/s10489-026-07122-3
تاريخ النشر: 2026-02-01
المؤلف: Pablo Negre وآخرون
الموضوع الرئيسي: التعرف على وضع الجسم والحركة

نظرة عامة

تبحث ورقة البحث في فعالية نماذج الذكاء الاصطناعي المختلفة لاكتشاف العنف في الفيديو، مع التركيز على مساهمات النمذجة الزمنية مقابل التمثيلات على مستوى الإطار. باستخدام إطار تجريبي منهجي، تقيم الدراسة أداء الشبكات العصبية التلافيفية (CNNs) المدمجة مع طبقات الذاكرة قصيرة وطويلة الأمد (LSTM) و Bi-LSTM عبر ثلاثة مجموعات بيانات مرجعية: قتال الهوكي، تدفق العنف، و RWF-2000. تكشف النتائج أن التحليل على مستوى الإطار باستخدام شبكة VGG-19 المدربة مسبقًا، جنبًا إلى جنب مع استراتيجية تجميع بسيطة، تحقق دقة تنافسية بنسبة 95% و 96% على مجموعتي بيانات قتال الهوكي وتدفق العنف، على التوالي. بينما توفر طبقات Bi-LSTM بعض التحسين، فإن المكاسب غير متسقة، مما يشير إلى أن زيادة التعقيد المعماري لا تعزز الأداء دائمًا.

تسلط الدراسة أيضًا الضوء على قيود النماذج الحالية، لا سيما في فهم التأثير الحقيقي لدمج CNNs مع RNNs والفوائد الهامشية لـ LSTM على Bi-LSTM. من الجدير بالذكر أن النماذج المقترحة تُظهر أن الهياكل الأبسط يمكن أن تنافس النماذج الزمنية الأكثر تعقيدًا، خاصة عند استخدام استراتيجيات تجميع ميزات فعالة. تؤكد الأبحاث على أهمية تحقيق التوازن بين تعقيد النموذج والفعالية العملية في أنظمة اكتشاف العنف. ستستكشف الأعمال المستقبلية دمج تقنيات الشرح وتحسين الأداء على مجموعات البيانات الصعبة من خلال استراتيجيات التعلم الانتقالي وزيادة البيانات، بهدف تعزيز قوة ووضوح نماذج اكتشاف العنف.

مقدمة

تتناول مقدمة ورقة البحث هذه القضية الملحة للاعتداء الجسدي وتأثيره الواسع على الأفراد والمجتمع. تبرز الطبيعة متعددة الأوجه للسلوك العدواني، الذي يتأثر بتنظيم العواطف، والصراعات بين الأفراد، والعوامل الاجتماعية والاقتصادية. يقترح المؤلفون أن اكتشاف العنف في الفيديو في الوقت الحقيقي باستخدام الذكاء الاصطناعي (AI) يمثل حلاً قابلاً للتوسع للتخفيف من هذه القضايا، مع التأكيد على أهمية تقريب تصنيف العنف بمستوى الإنسان من خلال الذكاء الاصطناعي بدلاً من وضع تعريف عالمي.

تتركز الدراسة على مقارنة الهياكل المختلفة لاكتشاف العنف في الفيديو، لا سيما فعالية الشبكات العصبية التلافيفية (CNNs) إطارًا بإطار مقابل هياكل CNN-الشبكة العصبية المتكررة (RNN) التي تتضمن طبقات LSTM. تشمل الأهداف الرئيسية تقييم فوائد أداء دمج طبقات LSTM و Bi-LSTM، وتحليل تأثير الخيارات المعمارية والمعلمات الفائقة، وتقييم الفعالية المقارنة لـ VGG-16 و VGG-19 في استخراج الميزات المكانية. يهدف المؤلفون إلى تقديم تحليل منهجي بدلاً من تقديم هياكل جديدة، مما يوضح في النهاية أنه يمكن تحقيق أداء تنافسي باستخدام نماذج أبسط، لا سيما باستخدام VGG-19 دون نمذجة زمنية صريحة. توضح الورقة هيكلها، مشيرة إلى أن الأقسام التالية ستتناول الأعمال ذات الصلة، والمنهجية، والنتائج التجريبية، والاستنتاجات.

طرق

في هذا القسم، يوضح المنهج المعماري للنماذج التي تم تطويرها لمعالجة التحديات المحددة في القسم 2.5. تستخدم الدراسة الشبكات العصبية التلافيفية المدربة مسبقًا (CNNs)، وبالتحديد VGG-16 و VGG-19، التي تم تدريبها في البداية على مجموعة بيانات ImageNet لمهمة اكتشاف العنف. لتكييف هذه النماذج لمشكلة التصنيف الثنائي للعنف مقابل عدم العنف، يتم تعديل الطبقة المتصلة بالكامل الأخيرة، التي تتكون في الأصل من 1000 خلية عصبية، لتحتوي على 2 خلية عصبية فقط. يسمح هذا التعديل للنماذج بالاستفادة من الميزات المكتسبة من الشبكات المدربة مسبقًا مع التركيز على المهمة المحددة لاكتشاف العنف.

نتائج

يقدم قسم النتائج النتائج المستخلصة من مراحل التدريب والاختبار لمختلف الهياكل لاكتشاف العنف القائم على الفيديو، باستخدام خادم مزود بمعالج Intel Core i9-10940X و بطاقة رسومات NVIDIA GeForce RTX 3090. يتم هيكلة التحليل إلى أقسام فرعية تركز على نماذج VGG-16 و VGG-19، بالإضافة إلى تركيباتها مع طبقات LSTM و Bi-LSTM. تؤكد الدراسة على أداء التنبؤات على مستوى الإطار من VGG-19، التي حققت دقة تنافسية دون نمذجة زمنية صريحة، لا سيما على مجموعتي بيانات قتال الهوكي وتدفق العنف. من الجدير بالذكر أن إدخال استراتيجيات التجميع اليدوي قد حسّن من القرارات على مستوى الفيديو، بينما كانت المكاسب في الأداء من طبقات LSTM و Bi-LSTM متواضعة، مما يشير إلى أن النمذجة الزمنية المعقدة قد لا تبرر دائمًا التكاليف الحاسوبية الإضافية.

كشفت مقارنة شاملة لمتغيرات النموذج أنه بينما تفوقت هياكل Bi-LSTM باستمرار على نماذج LSTM، إلا أن تحسينات الدقة العامة ظلت محدودة. تشير النتائج إلى أنه يمكن تحقيق اكتشاف فعال للعنف باستخدام إشارات بصرية على مستوى الإطار فقط، مما يبرز أهمية خصائص مجموعة البيانات في تحديد فوائد النمذجة الزمنية. علاوة على ذلك، تسلط الدراسة الضوء على التوازن بين الدقة والكفاءة الحاسوبية، حيث أظهرت نماذج CNN القائمة على الإطار زمن استدلال أقل مقارنة بالهياكل المتكررة. هذه الرؤية حاسمة للتطبيقات في الوقت الحقيقي، مما يشير إلى أن النماذج الأبسط قد تقدم حلاً أكثر عملية لاكتشاف العنف في البيئات ذات الموارد المحدودة.

نقاش

يوفر قسم النقاش في ورقة البحث نظرة شاملة على الحالة الحالية لاكتشاف العنف في الفيديو باستخدام الذكاء الاصطناعي، مع تسليط الضوء على كل من التقدم والتحديات المستمرة في هذا المجال. يبدأ بتلخيص الحلول الحالية التي تهدف إلى التخفيف من العنف في المجتمع، مع التأكيد على ضرورة أنظمة الكشف في الوقت الحقيقي التي يمكن أن تنبه السلطات وتجمع الأدلة بكفاءة. يُعزى نمو هذا المجال إلى انتشار كاميرات الأمن، والتقدم في تقنيات البيانات الضخمة، وتطور خوارزميات الذكاء الاصطناعي، لا سيما في رؤية الكمبيوتر واكتشاف حركة الإنسان.

تتفصل القسم أيضًا في المنهجيات المستخدمة في اكتشاف العنف، لا سيما دمج الشبكات العصبية التلافيفية (CNN) وشبكات الذاكرة قصيرة وطويلة الأمد (LSTM). لقد ظهرت هذه المجموعة كأكثر الهياكل فعالية، حيث تستخرج CNNs الميزات المكانية من إطارات الفيديو وتلتقط LSTMs الديناميات الزمنية. تستعرض الورقة مجموعات بيانات مختلفة تُستخدم عادةً لتدريب خوارزميات اكتشاف العنف، مثل قتال الهوكي و RWF-2000، وتصنف الخوارزميات إلى طرق تقليدية وطرق تعلم عميق. كما تحدد التحديات الرئيسية، بما في ذلك الحاجة إلى مجموعات بيانات مثالية والمتطلبات الحاسوبية للتحليل في الوقت الحقيقي، بينما تقترح هياكل مبتكرة تستفيد من CNNs المدربة مسبقًا جنبًا إلى جنب مع طبقات LSTM لتعزيز دقة الاكتشاف. بشكل عام، يؤكد القسم على أهمية البحث المستمر لمعالجة تعقيدات اكتشاف العنف في سيناريوهات العالم الحقيقي المتنوعة.

Journal: Applied Intelligence, Volume: 56, Issue: 3
DOI: https://doi.org/10.1007/s10489-026-07122-3
Publication Date: 2026-02-01
Author(s): Pablo Negre et al.
Primary Topic: Human Pose and Action Recognition

Overview

The research paper investigates the effectiveness of various artificial intelligence models for video violence detection, focusing on the contributions of temporal modeling versus frame-level representations. Utilizing a systematic experimental framework, the study evaluates the performance of convolutional neural networks (CNNs) combined with long short-term memory (LSTM) and bidirectional LSTM (Bi-LSTM) layers across three benchmark datasets: Hockey Fights, Violent Flow, and RWF-2000. The findings reveal that a frame-level analysis using a pre-trained VGG-19 network, coupled with a simple aggregation strategy, achieves competitive accuracies of 95% and 96% on the Hockey Fights and Violent Flow datasets, respectively. While Bi-LSTM layers provide some improvement, the gains are inconsistent, suggesting that increased architectural complexity does not always enhance performance.

The study also highlights the limitations of existing models, particularly in understanding the true impact of combining CNNs with RNNs and the marginal benefits of LSTM over Bi-LSTM. Notably, the proposed models demonstrate that simpler architectures can rival more complex temporal models, especially when effective feature aggregation strategies are employed. The research underscores the importance of balancing model complexity with practical effectiveness in violence detection systems. Future work will explore integrating explainability techniques and improving performance on challenging datasets through transfer learning and data augmentation strategies, aiming to enhance the robustness and interpretability of violence detection models.

Introduction

The introduction of this research paper addresses the pressing issue of physical aggression and its widespread impact on individuals and society. It highlights the multifaceted nature of aggressive behavior, which is influenced by emotional regulation, interpersonal conflicts, and socio-economic factors. The authors propose that real-time video violence detection using artificial intelligence (AI) represents a scalable solution to mitigate these issues, emphasizing the importance of approximating human-level labeling of violence through AI rather than establishing a universal definition.

The study focuses on comparing various architectures for video violence detection, particularly the effectiveness of frame-by-frame Convolutional Neural Networks (CNNs) against CNN-Recurrent Neural Network (RNN) architectures that incorporate Long Short-Term Memory (LSTM) layers. Key objectives include assessing the performance benefits of integrating LSTM and Bi-LSTM layers, analyzing the impact of architectural choices and hyperparameters, and evaluating the comparative effectiveness of VGG-16 and VGG-19 in extracting spatial features. The authors aim to provide a systematic analysis rather than introducing new architectures, ultimately demonstrating that competitive performance can be achieved with simpler models, particularly using VGG-19 without explicit temporal modeling. The paper outlines its structure, indicating that subsequent sections will cover related work, methodology, experimental results, and conclusions.

Methods

In this section, the methodology outlines the model architectures developed to address the challenges identified in Section 2.5. The study employs pre-trained Convolutional Neural Networks (CNNs), specifically VGG-16 and VGG-19, which were initially trained on the ImageNet dataset for the task of violence detection. To adapt these models for the binary classification problem of violence versus non-violence, the last fully connected layer, originally consisting of 1000 neurons, is modified to contain only 2 neurons. This alteration allows the models to leverage the learned features of the pre-trained networks while focusing on the specific task of violence detection.

Results

The results section presents findings from the training and testing phases of various architectures for video-based violence detection, utilizing a server equipped with an Intel Core i9-10940X CPU and an NVIDIA GeForce RTX 3090 GPU. The analysis is structured into subsections focusing on VGG-16 and VGG-19 models, as well as their combinations with LSTM and Bi-LSTM layers. The study emphasizes the performance of frame-wise predictions from VGG-19, which achieved competitive accuracy without explicit temporal modeling, particularly on the Hockey Fights and Violent Flow datasets. Notably, the introduction of manual aggregation strategies improved video-level decisions, while the performance gains from LSTM and Bi-LSTM layers were modest, suggesting that complex temporal modeling may not always justify the additional computational costs.

A comprehensive comparison of model variants revealed that while Bi-LSTM architectures consistently outperformed LSTM models, the overall accuracy improvements remained limited. The findings indicate that effective violence detection can be achieved using frame-level visual cues alone, underscoring the importance of dataset characteristics in determining the benefits of temporal modeling. Furthermore, the study highlights the trade-offs between accuracy and computational efficiency, with frame-based CNN models demonstrating lower inference latency compared to recurrent architectures. This insight is crucial for real-time applications, suggesting that simpler models may offer a more practical solution for violence detection in resource-constrained environments.

Discussion

The discussion section of the research paper provides a comprehensive overview of the current state of video violence detection using artificial intelligence, highlighting both advancements and ongoing challenges in the field. It begins by summarizing existing solutions aimed at mitigating violence in society, emphasizing the necessity for real-time detection systems that can alert authorities and gather evidence efficiently. The growth of this domain is attributed to the proliferation of security cameras, advancements in big data technologies, and the evolution of AI algorithms, particularly in computer vision and human action detection.

The section further details the methodologies employed in violence detection, particularly the integration of Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) networks. This combination has emerged as the most effective architecture, with CNNs extracting spatial features from video frames and LSTMs capturing temporal dynamics. The paper reviews various datasets commonly used for training violence detection algorithms, such as Hockey Fights and RWF-2000, and categorizes the algorithms into traditional and deep learning methods. It also identifies key challenges, including the need for optimal datasets and the computational demands of real-time analysis, while proposing innovative architectures that leverage pre-trained CNNs alongside LSTM layers to enhance detection accuracy. Overall, the section underscores the importance of continuous research to address the complexities of violence detection in diverse real-world scenarios.