Overview
This article shows the detail of my currently developed project: Violence Detection. The proposed approach outperforms the state-of-the-art methods, while still processing the videos in real-time. The comparison between my method and the previous work by Conv3D is also shown.
Introduction
School fight is always a big issue while I was guarding the high school. However, to monitor the surveillance cameras by staffs is infeasible, not to mention they may have other responsibilities. Fortunately, the currently raised AI (or, the Deep Learning) techniques may be able to detect the anomalies automatically [1]. Such anomaly detection is very fast and can be used as the preprocessing to filter out the normal surveillance videos, and then send the anomaly videos to perform the further examinations by other highly accurate algorithms.
The previous work on violence detection use traditional features such as the BoVW, the STIP, and the MoSIFT, and classify the features by SVM [2]. Ding et. al. extract the spatial and temporal features by the 3D convolution layers and classify the features by the fully-connected layers (as shown in Fig. 1) [3]. However, both of the proposed methods do not well support the variable length of videos. And the computational time of 3D convolution grows rapidly because of the depth of the temporal axis.
Fig. 1. Violence detection by 3D convolutional networks in ref. [3].
Moreover, the lack of the richness of the large-scale video dataset is also an issue. Although the Sports-1M dataset [4] provides a million of videos, most of the categories are about sports (i.e. The diversity of the videos is not as rich as the ImageNet). However, several studies imply that if the pre-trained model had seen a certain object that will be used later in the transfer learning task, the performance of the target task will be better [4, 5]. Therefore, the availability of the use of the pre-trained model on ImageNet is also important, not to mention that there are plenty amounts of available pre-trained models.
In this work, a new network is proposed: A CNN takes the input video frames and outputs the features to the Long Short-Term Memory (LSTM) to learn global temporal features and finally classify the features by fully-connected layers. This network can not only implement by the pre-trained models in ImageNet, but also have the flexibility to accept variable length videos, and even boosts the accuracy to the 98.5% while still processing the image in real-time (80 fps on Nvidia GTX 1080 Ti).
Method
Network Architecture
The proposed network architecture is shown in Fig. 2. It has been shown that in addition to adding the LSTM (which is supposed to extract global temporal features) after the CNN, the local temporal features that can be obtained from the optical flow is also important [6]. Furthermore, it has been reported that the virtue of the optical flow is due to its invariance in appearance as well as its accurate at boundaries and at small displacements [7]. Therefore, in this work, the effect of optical flow is supposed to be mimicked by taking two video frames as input. The two input frames are processed by the pre-trained CNN. The two frames outputs from the bottom layer of the pre-trained model are concatenated in the last channel and then fed into the additional CNN (labeled by orange color in Fig. 2). Since the outputs from the bottom layer are regarded as the low-level features, the additional CNN is supposed to learn the local motion features as well as the appearance invariant features by comparing the two frames feature map. The two frames outputs from the top layer of the pre-trained network are also concatenated and fed into the other additional CNN to compare the high-level features of the two frames. The outputs from the two additional CNN are then concatenated and passed to a fully-connected layer and the LSTM cell to learn the global temporal features. Finally, the outputs of the LSTM cell are classified by a fully-connected layer which contained two neurons that represent the two categories (fight and non-fight), respectively.
Fig. 2. The proposed network architecture. The layers that labeled by blue color are pre-trained on the ImageNet dataset and are frozen during training. The layers that labeled by the orange color are trained on the video dataset.
The pre-trained model is implemented by Darknet19 [8] due to its accuracy on ImageNet and the above real-time performance. Since the Darknet19 already contains 19 convolutional layers, to avoid the degradation problem [9], the additional CNN are implemented by the residual layers [9].
Accuracy Evaluation
The proposed model in this work can output the classified result per frame. However, the previous research evaluates the accuracy at the video-level. To be able to compare with the previous work, the frame-level results are gathered and processed by the following strategy: The video is classified to a certain category if and only if the number of the continuous signals of such category is larger than a certain threshold. Such threshold can be derived by scanning the threshold from 0 to the length of the video and see which threshold yields the best accuracy in the validation set, as shown in Fig. 3. If there are multiple thresholds that can yield the same accuracy, the small one will be chosen.
Fig. 3. The threshold-accuracy curve in the validation set. The horizontal axis represents the threshold of the number of continuous frames that has the positive signal. The vertical axis represents the accuracy at such threshold in the validation set. In the figure, thresholds start from 3 to 9 are all yield the best accuracy. The smallest threshold (i.e. threshold = 3) is chosen so that the continuous false positive in the test set could be reflected by this metric.
Gradient Clipping
It is well known that the gradient of the recurrent network may increase rapidly due to the long-term components [10]. The normal way to deal with the exploding gradient would be: truncate the gradient so that it remains in a reasonable range. While several studies solve this issue by another approach: start training with few unrolls, and then double the size of unrolls when the loss reaches plateaus [5]. In the second approach, they found that it even not necessary to clip the gradients. They also state that without starting from the small unrolls, the network may not even converge [5].
In this work, I found that the network can easily converge even the initial unrolls is set to the length of the videos. However, the absence of the gradient clipping makes the loss curve oscillating during training even if the training is starting with a small unrolls. Therefore, the gradients of the network are truncated in the range from -5.0 to 5.0. Clipping the gradients into the smaller range (e.g. from -1.0 to 1.0) has also been tested. However, my experiment shows that this will cause the network hardly to converge to the lower minima.
Results
Experiment on the Hockey dataset
The Hockey dataset proposed by Bermejo et. al. has 500 fighting clips and 500 non-fighting clips collected from the hockey games [2]. Follow the experiment proposed by Ding et. al. [3], the dataset is further split into the following configuration: the 400 clips (including 200 fighting clips and 200 non-fighting clips) for testing, the 500 clips for training and the 100 clips for validation. The result is shown in Table 1. One can see that the proposed method in this work outperforms other state-of-the-art methods.
Method
|
Accuracy
|
STIP(HOG)+HIK with 1000 vocabulary [3]
|
84.25%
|
STIP(HOF)+HIK with 1000 vocabulary [3]
|
78.00%
|
STIP(HOG+HOF)+HIK with 1000 vocabulary [3]
|
78.50%
|
MOSIFT+HIK with 1000 vocabulary [3]
|
90.90%
|
Conv3D [3]
|
91.00%
|
Darknet19 + Residual Layers+ LSTM
|
98.50%
|
Table 1. The comparison between the previous methods and the proposed method.
The Single Frame Baseline
It has been reported that the single frame models (i.e. the model that does not consider the temporal information) already has a strong performance [4]. This may due to the fact that several categories in the video classification task (such as the Sports-1M and the UCF-101) can be recognized by the scene or the background in the videos (such as football or swimming). The network does not necessary to learn the motion features of the moving objects.
In this work, however, all of the videos are shot in the hockey field and several frames are necessary when examined by the human eyes. Therefore, the performance of the single frame model was not expected to be as well as the models that take temporal information into consideration. However, to compare the proposed method with the single frame model, a simple single frame network has also been proposed. As shown in Fig. 4, The single frame model takes the output from Darknet19 and sends the output feature map into the 3 fully-connected layers to classify the input.
Fig. 4. The single frame model.
In this work, however, all of the videos are shot in the hockey field and several frames are necessary when examined by the human eyes. Therefore, the performance of the single frame model was not expected to be as well as the models that take temporal information into consideration. However, to compare the proposed method with the single frame model, a simple single frame network has also been proposed. As shown in Fig. 4, The single frame model takes the output from Darknet19 and sends the output feature map into the 3 fully-connected layers to classify the input.
Fig. 4. The single frame model.
The result of the comparison is shown in Table 2. Surprisingly, the single frame model also gives a very accurate video-accuracy. However, the per-frame-accuracy of the single frame model is much lower than the network that considers the temporal information. Moreover, the threshold of the number of the continuous positive signals is much larger than the network with the LSTM unit. This is reasonable since the single frame model does not have any temporal information and the only way that decrease the misjudgement is to increase the threshold of the continuous positive signals.
Method
|
threshold
|
frame accuracy
|
video accuracy
|
Darknet19 + 3Fc
|
14
|
93.77%
|
96.00%
|
Darknet19 + Residual Layers + LSTM
|
3
|
97.81%
|
98.50%
|
Table 2. The comparison between the single frames and the proposed method.
Hi,
回覆刪除I tried to run this code by following your instructions in https://github.com/JoshuaPiinRueyPan/ViolenceDetection. However, I am facing some issues because of computational resource limitations. Could you please provide the trained model so that I can just test it?
Of sure. However, we have a holiday here, and I spend the holiday in my hometown. Therefore, my computer is not available currently. I'll send you the link around 10/2 (next TUE). Sorry for the trouble.
刪除Hi,
刪除The following address will link to the trained model:
https://drive.google.com/open?id=1TwGzBTooHvAkBcrKzEfukrZMSakuCdYd
Note: To compare with the previous papers, this model is trained on the hockey violence dataset. If you want to perform the violence detection in surveillance system, you may be better gather your own datasets and trained again. Also, you might want to add more layers if the learning capacity of the model is not enough.
the link is not opening
刪除Thank you for sharing! Very interesting.
回覆刪除Hey ,I would like to extend your research further for that i would like to have your email ID to discuss and communicate further on this topic.
回覆刪除Hi thank you for your interest in this project. My e-mail address is piinrueypan@gmail.com
刪除Hi, have you published any research paper regarding this. Please share it with me.It will really help me for my final year project.
回覆刪除
刪除No, I'm not working for any academic institute and the process of publishing a paper will be tedious. I'm doing this for interest and want to write it down in a formal language. Therefore, I wrote this article
Hi,
回覆刪除Onece you have train the model, you can finetune the model by specified the variable "PRETRAIN_MODEL_PATH_NAME" In file settings/TrainSettings.py line: 22.
This is amazing ! Can you tell me how you calculate a video accuracy and frame accuracy
回覆刪除Hey! I recently ran your program on my Virtual machine. It was slow but worked fine...But recently i messed up my VM and had to re-install it. But then when i tried running, it checks for violence but never gives me the video...instead, it gives "unsmoothed results : false,false,false......"
回覆刪除I couldnt understand what is happening.