In this final year project we propose a novel method for key shots based Video Summarization by introducing 3D Convolutional neural network with multi attention. This is a Computer Vision domain problem which comes under the banner of Artificial Intelligence and we are proposing a deep learning solu
Video Summarization Using Deep CovNets with Multi Attention
In this final year project we propose a novel method for key shots based Video Summarization by introducing 3D Convolutional neural network with multi attention. This is a Computer Vision domain problem which comes under the banner of Artificial Intelligence and we are proposing a deep learning solution for this.
The process starts by encoding the video data into a time variant frame in 3-dimension followed by two steps of attention.
Attention in deep learning can be broadly interpreted as a vector of importance weights: in order to predict or infer one element, such as a pixel in an image or a word in a sentence, we estimate using the attention vector how strongly it is correlated with other elements
The step one learns attention weights for features inside each frame and the second one learns attention weights for all the frames hence deciding importance score between 0 and 1 for each frame. The frame with the highest scores will be selected for the summary. The network is trained using this information.
The current state of the art method used 2D Covnets with self-attention hence losing the dependency of each frame on the next which results in the fact that self-attention focuses on less features. The experimental studies for evaluation of proposed approach will use two standard video summarization datasets (i) SumMe and (ii) TVSum.
We intend to use the F1-Score generated through Key shots as evaluation measure to compare with state-of-the-art in the video summarization.
The current state of the art has F1-Score of 49% on SumMe and 62% on TVSum Datasets and we are aiming to surpass this.
The objective for our FYP is to propose a new method to solve the problem of video summarization by introducing multi-Attention and 3 dimensional Convolutional neural networks and beating the current state of the art methods f1-score and contributing a fruitful research in the deep learning community.
We have started with Exploratory Data Analysis on our Datasets in which we will be doing object detection, pixel intensity distribution and motion analysis.
After that we will preprocess our Data in which we will do normalization and resizing.
Then we will work on data generators to dynamically get only that chunk of data in memory that needs to be processed so that we can effectively use our resources as video data is very large.
Then we will be using Keras Functional API to implement our model starting with the Encoder.
Encoder is the initial part of our model in which the frames or images extracted from the video and there corresponding scores is to be given input and here we are implementing streaks of convolutional neural network 3D and MaxPooling 3D layers for a detailed feature learning of the images.This pair of convolution, max pooling and Batch Normalization is applied 5 times which means 13 layers with filter size increasing by double in every layer.In the fifth and sixth pair of convolution and max pooling we have reduced the filter size from (3 x 3 x 3) to (1 x 1 x 1) for applying network in network convergence which simply helps to get good insight of the relations in the objects of the images. After this our decoder is started where we apply the multi- Attention.
Our main research part resides in the decoder of our neural network. Here we are applying the attention and multi attention mechanism for the summarization of video. The input of the decoder is the output from the encoder. Then we iterate by the no of frames of that batch.
We first expand the dimensions of our GRU hidden state to the size of features which is then passed through a small DNN of three layers the first two have 1024 units which gets summed and hyperbolic tangent is applied as activation function. It is then passed through another dense layer with 1 unit. Softmax with axis 1 is applied so that the overall attention weights get summed to 1
These attention weights are then multiplied with the features we got from our encoder
The output context vector one gets forwarded to the second Step of Attention.
This process will be repeated in step two attention after reducing dimesions of some feature.
After Applying Multi-Attention on our encoder features, we input the final context vector we get from Step two attention and last hidden state to the GRU it is then passed to a Dense layer with 300 neurons and after it a Dense layer with 1 neuron.
This final output and GRU state is returned.
This same process is done for each iteration of our decoder we update the hidden state variable and concatenate each iteration output with the previous ones.
After all iterations gets completed we pass all the outputs with a final dense layer with units equal to no of frames of that batch and apply sigmoid as our Ground truth for each frame is between 0 and 1 for all frames.
The video data is the need of the age. And humans are producing it in a way faster than ever before. There is a need of video summaries in many things like education lectures, news broadcasting, video editing, surveillance, storing these data and many more. Video summaries will play a pivotal role in all these classes. Time saving will be a huge advantage in video summaries. And the storage of rapid growth of data in this age can be tackled with their respective data summaries.
Our research will help in stepping up the ladder to solve this complex problem of video summarization and also will affect our country position positively as it will reflect that we are capable of tackling the problems not even solved by developed countries of 21st century and have a great potential.
In the end, using our trained model we will compare our results with the current state of the Art using the metrics F1-Score on both SumMe and TVSum Datsets.
We will publish a research paper of our approach and results we will get and will have the implementation code as well as the trained model.
| Item Name | Type | No. of Units | Per Unit Cost (in Rs) | Total (in Rs) |
|---|---|---|---|---|
| RAM 8GB | Equipment | 3 | 5500 | 16500 |
| GPU | Equipment | 1 | 50000 | 50000 |
| Using online cloud service like colab-pro or kaggle AI notebooks + Documentation Printing, Files, Covers | Miscellaneous | 1 | 10000 | 10000 |
| Total in (Rs) | 76500 |
Distribution transformers function in distribution networks defined by lower voltages. whe...
We are going to make a Doctor robot that will assist doctors in handling the patients and...
People do faces issues in barber, beauty salon and doctor shops. If people want to take se...
The demand of Unmanned Aerial Vehicle launcher is increasing rapidly as it eliminates the...
Food Industry has always been a profitable industry not only for manufacturers, suppliers,...