بنية شبكة عصبية تلافيفية خفيفة الوزن لاكتشاف العنف في تسلسلات الفيديو A lightweight convolutional neural network architecture for violence detection in video sequences

المجلة: Scientific Reports، المجلد: 16، العدد: 1
DOI: https://doi.org/10.1038/s41598-026-37743-0
PMID: https://pubmed.ncbi.nlm.nih.gov/41651963
تاريخ النشر: 2026-02-06
المؤلف: Bhawana Tyagi وآخرون
الموضوع الرئيسي: التعرف على وضع الجسم والحركة

نظرة عامة

تتناول الدراسة الحاجة الملحة للكشف الفعال عن العنف في الوقت الحقيقي في الأماكن العامة ذات الكثافة العالية، حيث يكون التعرف السريع على الحوادث العدوانية أمرًا حاسمًا للتدخل في الوقت المناسب. لمواجهة تعقيدات التغيرات الزمانية والمكانية والطلبات الحسابية العالية لبيانات الفيديو، يقترح المؤلفون بنية شبكة عصبية تلافيفية عميقة خفيفة الوزن (CNN) تعتمد على MobileNetV2. يستخدم هذا النموذج التلافيفات القابلة للفصل حسب العمق و bottlenecks المعكوسة لتقليل المعلمات بشكل كبير مع الحفاظ على دقة تصنيف عالية. يعالج الإطار الإطارات الفيديو بدقة 224 × 224، مستفيدًا من تقنيات التطبيع والتعزيز لتحسين التعميم وتقليل الإفراط في التكيف.

تظهر النتائج التجريبية فعالية النموذج، حيث حقق دقة 97% على مجموعة بيانات حالات العنف في الحياة الواقعية (RLVSD) و94% على مجموعة بيانات قتال الهوكي (HFD)، إلى جانب تحسينات في الدقة والاسترجاع ودرجة F1 مقارنة بشبكات CNN التقليدية. تشير ملفات التعريف الحسابية إلى أن النموذج يمكن أن يعمل بمعدلات إطارات في الوقت الحقيقي على الأجهزة ذات الموارد المحدودة، مما يجعله مناسبًا للنشر في أنظمة المراقبة المختلفة. تشمل اتجاهات البحث المستقبلية دمج الميزات الزمنية من خلال شبكات CNN ثلاثية الأبعاد أو نماذج المحولات، وتعزيز قدرات الكشف في الظروف الصعبة، واستكشاف المدخلات متعددة الأنماط. تؤكد النتائج على إمكانيات الأنظمة المعتمدة على التعلم العميق لتعزيز السلامة العامة من خلال تمكين الاستجابات السريعة للحوادث العنيفة في بيئات متنوعة، شريطة معالجة الاعتبارات الأخلاقية.

مقدمة

تقدم مقدمة ورقة البحث معلومات أساسية ضرورية ذات صلة بالدراسة. توضح السياق الذي تقع فيه البحث، مع تسليط الضوء على المفاهيم الرئيسية والنتائج السابقة التي تُعلم التحقيق الحالي. يؤكد المؤلفون على أهمية الموضوع، موضحين كيف يساهم في المجال الأوسع للدراسة ومعالجة الفجوات الموجودة في الأدبيات.

علاوة على ذلك، تضع المقدمة الأساس لأهداف البحث، موضحة الأسئلة المحددة التي تهدف الدراسة للإجابة عليها. من خلال إنشاء مبرر واضح للبحث، يؤكد المؤلفون على أهمية نتائجهم والآثار المحتملة للعمل المستقبلي في هذا المجال. بشكل عام، تؤطر هذه القسم البحث ضمن الجسم المعرفي القائم، مما يمهد الطريق للمنهجية والنتائج اللاحقة.

طرق

تستعرض هذه القسم طرقًا تقليدية ومبنية على التعلم العميق للكشف عن العنف في الفيديوهات. تشمل الأساليب التقليدية الأنظمة التي طورها Penet وآخرون، والتي حققت معدل كشف مفقود بنسبة 3% و50% إنذارات كاذبة باستخدام بيانات متعددة الأنماط، وDe Souza وآخرون، الذين جمعوا بين كتب الرموز البصرية وآلات الدعم الخطية (SVM) لتحسين الكشف. استخدمت طرق أخرى ملحوظة تقنيات مثل هيستوغرام التدرجات الموجهة (HOG) وطريقة هورن-شونك، مع درجات متفاوتة من النجاح في الدقة والكفاءة. على سبيل المثال، أفاد Das وآخرون بدقة 86% باستخدام مصنفات Random Forest، بينما عزز Xu وآخرون الكشف من خلال دمج SVM مع دوال الأساس الشعاعي (RBF).

في المقابل، أظهرت طرق التعلم العميق تقدمًا كبيرًا في الكشف عن العنف. قدم Simonyan & Zisserman شبكة عصبية تلافيفية (CNN) استخدمت كل من الميزات المكانية والزمنية، محققة نتائج ملحوظة على مجموعات بيانات HMDB-51 وUCF-101. استخدمت نماذج أخرى، مثل تلك التي قدمها Sudhakaran & Lanz وButt وآخرون، بنى LSTM وVGG-19، على التوالي، محققة دقة بلغت 77.9% و81%. مساهمة ملحوظة هي الإطار التعليمي العميق ثلاثي المراحل الذي يدمج CNN خفيفة الوزن للتصفية الأولية، تليها CNN ثلاثية الأبعاد لاستخراج الميزات الزمانية المكانية، محققة دقة عالية في الكشف عن العنف. بشكل عام، يبرز القسم التطور من الأساليب التقليدية إلى تقنيات التعلم العميق المتطورة، مع التركيز على التحسينات في الدقة والكفاءة والقدرة على التعامل مع مجموعات بيانات متنوعة.

نتائج

في هذا القسم، يقدم المؤلفون تحليلًا مقارنًا شاملاً لنموذجهم المقترح مقابل البنى المعروفة، وهي Inceptionv3 وVGG-19 وMobileNet وMobileNetV2. يستخدم التقييم مقاييس كمية لتقييم أداء النموذج المقترح بالنسبة لهذه النماذج الموجودة. تشير النتائج إلى أن النموذج المقترح يظهر أداءً متفوقًا، مما يبرز فعاليته في سياق المهام المدروسة. من المتوقع أيضًا مناقشات إضافية حول آثار هذه النتائج والمجالات المحتملة للبحث المستقبلي.

مناقشة

تسلط قسم المناقشة في ورقة البحث الضوء على التقدم والتحديات في الكشف عن العنف من مقاطع الفيديو المراقبة، مصنفة الأساليب الموجودة إلى تقنيات تقليدية وتقنيات التعلم العميق. تشمل الفجوات البحثية الرئيسية المحددة الحاجة إلى استخراج الميزات تلقائيًا التي تلتقط الأنماط المكانية والزمنية، والهياكل الخفيفة للتطبيقات في الوقت الحقيقي، والنماذج التي يمكن أن تعمم عبر مجموعات بيانات متنوعة. يطرح المؤلفون عدة أسئلة بحثية تهدف إلى معالجة هذه الفجوات، مع التركيز على بناء هياكل تعلم عميق فعالة، وتصميم نماذج مرنة للنشر في الوقت الحقيقي، وتقنيات لتقليل الإيجابيات الكاذبة في المشاهد المعقدة.

يظهر النموذج المقترح، المستند إلى بنية MobileNetV2، مقاييس أداء ملحوظة، محققًا دقة 97% على مجموعة بيانات حالات العنف في الحياة الواقعية (RLVSD) و94% على مجموعة بيانات قتال الهوكي (HFD). يسمح الطابع الخفيف للنموذج بحساب فعال، مما يجعله مناسبًا لتطبيقات المراقبة في الوقت الحقيقي. تشير النتائج إلى أن النموذج يلتقط بفعالية الأنماط الحركية ذات الصلة مع تقليل الاعتماد على خصائص مجموعة بيانات معينة، مما يدل على إمكانيته للنشر في سيناريوهات العالم الحقيقي. تشمل اتجاهات البحث المستقبلية تعزيز دقة الكشف في الظروف الصعبة، ودمج المدخلات متعددة الأنماط، واستكشاف طرق الذكاء الاصطناعي القابلة للتفسير لتحسين شفافية النظام وموثوقيته.

Journal: Scientific Reports, Volume: 16, Issue: 1
DOI: https://doi.org/10.1038/s41598-026-37743-0
PMID: https://pubmed.ncbi.nlm.nih.gov/41651963
Publication Date: 2026-02-06
Author(s): Bhawana Tyagi et al.
Primary Topic: Human Pose and Action Recognition

Overview

The study addresses the urgent need for efficient real-time violence detection in high-density public spaces, where rapid identification of aggressive incidents is crucial for timely intervention. To tackle the complexities of spatiotemporal variations and high computational demands of video data, the authors propose a lightweight deep convolutional neural network (CNN) architecture based on MobileNetV2. This model employs depthwise separable convolutions and inverted residual bottlenecks to significantly reduce parameters while maintaining high classification accuracy. The framework processes video frames at a resolution of 224 × 224, utilizing normalization and augmentation techniques to enhance generalization and reduce overfitting.

Empirical results demonstrate the model’s effectiveness, achieving 97% accuracy on the Real-Life Violence Situations Dataset (RLVSD) and 94% on the Hockey Fight Dataset (HFD), alongside improvements in precision, recall, and F1-score compared to traditional CNNs. The computational profiling indicates that the model can operate at real-time frame rates on resource-constrained hardware, making it suitable for deployment in various surveillance systems. Future research directions include the integration of temporal features through 3D CNNs or transformer models, enhancing detection capabilities in challenging conditions, and exploring multi-modal inputs. The findings underscore the potential of deep learning-based systems to enhance public safety by enabling swift responses to violent incidents in diverse environments, provided that ethical considerations are addressed.

Introduction

The introduction of the research paper provides essential background information relevant to the study. It outlines the context in which the research is situated, highlighting key concepts and previous findings that inform the current investigation. The authors emphasize the significance of the topic, detailing how it contributes to the broader field of study and addressing existing gaps in the literature.

Furthermore, the introduction sets the stage for the research objectives, articulating the specific questions the study aims to answer. By establishing a clear rationale for the research, the authors underscore the importance of their findings and the potential implications for future work in the area. Overall, this section effectively contextualizes the research within the existing body of knowledge, paving the way for the subsequent methodology and results.

Methods

The section outlines various traditional and deep learning-based methods for violence detection in videos. Traditional approaches include systems developed by Penet et al., which achieved a 3% missed detection rate and 50% false alarms using multimodal data, and De Souza et al., who combined visual codebooks with linear Support Vector Machines (SVM) for improved detection. Other notable methods employed techniques such as Histogram of Oriented Gradients (HOG) and the Horn-Schunck method, with varying degrees of success in accuracy and efficiency. For instance, Das et al. reported an 86% accuracy using Random Forest classifiers, while Xu et al. enhanced detection by integrating SVM with radial basis function (RBF) kernels.

In contrast, deep learning methods have shown significant advancements in violence detection. Simonyan & Zisserman introduced a convolutional neural network (CNN) that utilized both spatial and temporal features, achieving notable results on the HMDB-51 and UCF-101 datasets. Other models, such as those by Sudhakaran & Lanz and Butt et al., employed LSTM and VGG-19 architectures, respectively, yielding accuracies of 77.9% and 81%. A noteworthy contribution is the three-stage deep learning framework that integrates a lightweight CNN for initial filtering, followed by a 3D CNN for spatiotemporal feature extraction, achieving high accuracy in violence detection. Overall, the section highlights the evolution from traditional methods to sophisticated deep learning techniques, emphasizing improvements in accuracy, efficiency, and the ability to handle diverse datasets.

Results

In this section, the authors present a comprehensive comparative analysis of their proposed model against established architectures, namely Inceptionv3, VGG-19, MobileNet, and MobileNetV2. The evaluation employs quantitative metrics to assess the performance of the proposed model relative to these existing models. The results indicate that the proposed model demonstrates superior performance, highlighting its effectiveness in the context of the studied tasks. Further discussions on the implications of these findings and potential areas for future research are also anticipated.

Discussion

The discussion section of the research paper highlights the advancements and challenges in violence detection from surveillance videos, categorizing existing methods into traditional and deep learning techniques. Key research gaps identified include the need for automated feature extraction that captures both spatial and temporal patterns, lightweight architectures for real-time applications, and models that can generalize across diverse datasets. The authors pose several research questions aimed at addressing these gaps, focusing on the construction of effective deep learning architectures, the design of resilient models for real-time deployment, and techniques to minimize false positives in complex scenes.

The proposed model, based on the MobileNetV2 architecture, demonstrates significant performance metrics, achieving 97% accuracy on the Real Life Violence Situations Dataset (RLVSD) and 94% on the Hockey Fight Dataset (HFD). The lightweight nature of the model allows for efficient computation, making it suitable for real-time surveillance applications. The findings suggest that the model effectively captures relevant motion patterns while minimizing reliance on specific dataset characteristics, indicating its potential for deployment in real-world scenarios. Future research directions include enhancing detection accuracy in challenging conditions, integrating multi-modal inputs, and exploring explainable AI methods to improve system transparency and reliability.