VioDet
VioDet, short for Violence Detection, is a modern-AI based video violence detection mechanism which makes use of deep learning and computer vision techniques to automate the process of detecting violent content in videos. Proper content filtering of violent media is an important issue nowada
2025-06-28 16:36:36 - Adil Khan
VioDet
Project Area of Specialization Artificial IntelligenceProject SummaryVioDet, short for Violence Detection, is a modern-AI based video violence detection mechanism which makes use of deep learning and computer vision techniques to automate the process of detecting violent content in videos.
Proper content filtering of violent media is an important issue nowadays, for it has many applications: it can be used in conjunction with surveillance cameras to detect inappropriate behavior; aiding parental control by rating videos of streaming services; protecting users from receiving undesired media via messaging applications; blocking content from being uploaded to websites such as social networks, forums or educational platforms; or preventing it from being shown in specific places such as schools and workplaces.
With hundreds of hours of video uploaded every minute through the Internet and becoming a part of our everyday life, comes many violent scenes not suited for people, especially for children. Hence, the demand for automated systems such as VioDet that detect these violent scenes is increasing.
Coming to the technical details, VioDet is built on top of an open-source deep learning model that was custom-trained by us to detect violence in videos. The training phase requires extensive computing resources which is only possible with a good GPU. After we train the model, it is deployed as a web service using Flask and Docker Container. Once deployed, the web service can be hosted using any third-party servers and therefore, can be accessed by the public.
Project ObjectivesNowadays, the amount of public violence has increased dramatically. This can be a terror attack involving one or a number of persons wielding guns to a knife attack by a single person. This has resulted in the ubiquitous usage of surveillance cameras. This has helped authorities in identifying violent attacks and take the necessary steps in order to minimize the disastrous effects. But almost all the systems nowadays require manual human inspection of these videos for identifying such scenarios, which is practically infeasible and ineffcient. It is in this context that this project becomes relevant.
Moreover, video content makes up more and more proportion of the world’s Internet traffic at present. Video service represented by short video clip and live streams have become the new trends of the Internet. However, Internet video content is filled with some violent videos, which are seriously harming the construction of the network ecology. Furthermore, monitoring sudden violence in time creates tremendous challenges for video surveillance.
Thus, violent video detection is of vital importance and the primary objective of this project is to provide video violence detection service with a reasonable accuracy and friendly user interface.
Project Implementation Method Implementation of VioDet can be divided into two modules:-1. Front-End Application Modules:-
1.1 Video Acquistion and Transmission Module:-
This is the module from where the major functioning of application initiates. This module takes a video from the user which could be in the form of an uploaded video or link to a live stream and pass it to the server.
1.2 Notification/Result View Module:-
The notification and resultant are on display by this module for the user to see.
2. Back-End/ Server-Side Modules:-
2.1 Video Acquistion
The task of this module is to load the video or access the live stream, which was received at the endpoint.
2.2 Preprocessing Module
The video transmitted by the application is received and it is divided into frames at specified rate which would be set by the user (fps).
2.3 Violence Detection Module
When preprocessing module has completed its work and a series of frames has been produced, this module would run those frames through the custom-trained deep learning model. Based on the spatio-temporal features of each frame, the frame would be given a final classification.
2.4 Notification Module:-
If a ‘specific’ amount of violence is detected, the video is termed as violent. In case it is violence is detected less than the threshold, the video is termed as non-violent. The generated notification is sent to the server along with the frame details. The server will then notify the web application about the results.
Implementation Methodology:-
All the modules mentioned above are incorporated into the following implementation methodology:-
First of all, we have the deep learning model which we have trained on datasets of videos. Then, we wrap that model using Flask into a web service so that the user can use the model to run predictions.
In order to achieve mobility and scalability, we construct a Docker Image that would serve as a mini-VM to provide the model with environment it needs to execute including all the dependencies such as TensorFlow, Python, and Python Libraries. This Docker Image would communicate with the Web Application via the End Point based on REST API.
Now, all that is left, is for the user to upload a video and/or provide a link to a live stream.
Benefits of the ProjectThis project can be of benefit to each and every one of us who uses the internet (social media, youtube) and to security companies as well.
The ubiquitous usage of surveillance cameras has led to a requirement of human inspectors who must remain continuously alert for identifying violent scenarios, which is practically infeasible and ine?cient. Furthermore, monitoring sudden violence in real time creates tremendous challenges for video surveillance. It is in this context that this project becomes relevant.
Moreover, video content makes up more and more proportion of the world’s Internet traf?c at present. In fact, Cisco stated that by 2022, videos will make up more than 82% of all consumer internet traffic. Most social media content nowadays is video-based. With such large volumes of video content being generated, it is a reality that a significant portion of it contains some degree of violence. A study found that Facebook alone has over 8 billion video views per day. And approx. 13% of those videos contain some degree of violence.
Exposure to this type of violent media can have serious side effects on the human mind and it is seriously harming the construction of the network ecology.
Numerous psychological studies have shown that exposure to violent video content is a causal risk factor for increased aggressive behavior, aggressive cognition, and aggressive affect and for decreased empathy and prosocial behavior.
In a 2009 Policy Statement on Media Violence, the American Academy of Pediatrics said, “Extensive research evidence indicates that media violence can contribute to aggressive behavior, desensitization to violence, nightmares, and fear of being harmed.” Many other studies can be quoted here, however, the problem is clear.
Therefore, VioDet can play a major role in detection of violent video content on the Internet and consequently, stop its propagation.
Technical Details of Final DeliverableThe final deliverable of this project will be the VioDet Web Application that will allow its users to upload videos or provide links to live streams on which to perform violence detection. The web application will be constructed using front-end technologies including but not limited to, HTML5, CSS3, JavaScript, Bootstrap and React.
Basically, the core process of detecting violence in videos will be carried out by the TensorFlow-based deep learning model. This model will work as follows:-
Two consecutive frames are taken as input. They are processed seperately but in parallel by the two pretrained Convolutional Neural Networks (CNNs) known as Darknet19. Output from the bottom layer of the Darknet19 gives us low-level features while output from the top layer of the Darknet19 gives us high-level features.
The low-level feature outputs from Darkent19 are concatenated and fed into one of the additional CNNs (which are not pretrained). The additional CNN is supposed to learn the local motion features as well as the appearance invariant features by comparing the two frames feature map.
The high-level feauture outputs from Darknet19 are concatenated and fed into the other additional CNN. Here, the high level features of the two frames are compared.
Output from both of the additional CNNs are concatenated and passed to a fully-connected layer and the Long Short Term Memory (LSTM) cell to learn the global temporal features.
Finally, the outputs of the LSTM cell are classified by a fully-connected layer which contained two neurons that represent the two categories (violent and non-violent), respectively.
A Docker Image will be constructed that will have all of the pre-requisite libraries and dependencies that are need to run the model including TensorFlow 1.7.0, the machine learning platform and Python 3.6 and its libraries including tensorflow-gpu, sklearn, numpy, pillow, opencv-python and keras. The trained model will be connected to this Docker Image. From here, the model will be linked to a REST API, based on Flask 1.1, that makes it available to the Web Application and facilitates to and from communication between the two modules.
Final Deliverable of the Project Software SystemCore Industry ITOther Industries Media , Health , Security Core Technology Artificial Intelligence(AI)Other TechnologiesSustainable Development Goals Good Health and Well-Being for People, Industry, Innovation and Infrastructure, Peace and Justice Strong InstitutionsRequired Resources| Item Name | Type | No. of Units | Per Unit Cost (in Rs) | Total (in Rs) |
|---|---|---|---|---|
| Total in (Rs) | 68215 | |||
| MSI GeForce GTX 1070 Ti DUKE 8G GDDR5 Graphics Card. | Equipment | 1 | 61215 | 61215 |
| Printing of Project Thesis/Documents. | Miscellaneous | 1 | 3500 | 3500 |
| HD Wireless IP Camera | Miscellaneous | 1 | 3500 | 3500 |