Manuscript Title:

A FUSION-BASED CNN-BiLSTM FRAMEWORK FOR ROBUST VIOLENCE DETECTION IN REAL-WORLD VIDEO SURVEILLANCE

Author:

MD. SHAFIUL AZAM, TAHMID RAHMAN, ABU SALEH MUSA MIAH, NAKIB AMAN, MD ABDUR RAHIM

DOI Number:

DOI:10.5281/zenodo.17896835

Published : 2025-12-10

About the author(s)

1. MD. SHAFIUL AZAM - Department of Computer Science and Engineering, Pabna University of Science and Technology, Rajapur, Pabna, Bangladesh.
2. TAHMID RAHMAN - Department of Computer Science and Engineering, Hamdard University, Bangladesh.
3. ABU SALEH MUSA MIAH - School of Computer Science and Engineering, The University of Aizu, Aizuwakamatsu, Fukushima, Japan.
4. NAKIB AMAN - Department of Computer Science and Engineering, Pabna University of Science and Technology, Rajapur, Pabna, Bangladesh.
5. MD ABDUR RAHIM - Department of Computer Science and Engineering, Pabna University of Science and Technology, Rajapur, Pabna, Bangladesh.

Full Text : PDF

Abstract

Due to the enormous volume of video material and growing demand for automated surveillance systems, violence detection has become a crucial area of study in computer vision. Law enforcement and security workers might be able to prevent or lessen violent situations by detecting violence in real-time video streams. Deep learning techniques, such as CNNs and LSTMs, have shown promising results in detecting violent activity. However, existing approaches have some limitations, including reduced performance when detecting violence in real-world situations and difficulties differentiating between violent and non-violent activities with similar motion patterns. This paper presents a fully integrated violence detection system that overcomes these limitations by incorporating CNN architectures and BiLSTM with fusion techniques. We analyzed in-depth approaches to violence detection and proposed a novel, effective method. Using a combination of CNNs and a BiLSTM, a reliable framework was built to improve violence detection. This study assesses five CNN designs, including MobileNetV2, ResNet50V2, DenseNet201, Xception, and VGG19, and then integrates them with the BiLSTM network to recognize violent scenes in video data. Furthermore, this paper examines two fusion approaches: intermediate fusion and late fusion. These approaches are tested on two datasets: RLVS and HF. The results reveal that late fusion delivers the highest performance in different metric scores, demonstrating its potential as a superior violence detection approach. We have achieved an accuracy of 98.50% and 97.50% on the RLVS and HF datasets, respectively. This framework might help address the serious issue of violence that affects communities worldwide.


Keywords

Violence Detection, Deep Learning, CNN, BiLSTM, Late Fusion, Video Surveillance, RLVS Dataset.