Human Face Super Resolution in CCTV

Though numerous endeavours have been made for the accurate hallucination of Low Resolution (LR) facial images, one of the remaining milestones is to better super resolve faces in low quality videos. Our proposed project is a Generative Adversarial Network (GAN) based deep Neural Network which uses A

2025-06-28 16:32:57 - Adil Khan

Project Title

Human Face Super Resolution in CCTV

Project Area of Specialization Artificial IntelligenceProject Summary

Though numerous endeavours have been made for the accurate hallucination of Low Resolution (LR) facial images, one of the remaining milestones is to better super resolve faces in low quality videos. Our proposed project is a Generative Adversarial Network (GAN) based deep Neural Network which uses Attention Nets for High Resolution (HR) face reconstruction and recognition. The technique is based on constructing a multistage deep network which takes a CCTV video as an input and outputs the high resoluted result. The problem with CCTV videos is that it inculcates motion blur, occlusions and has pose and illumination variations. Furthermore the low resolution of the CCTV videos further worsen the situation in which a culprit caught in the video frames is to be recognised automatically.We propose that our GAN based method will solve this problem by creating  a 3D face model that will be an accurate HR face.

Project Objectives

Following are the main objectives of the project:

1. Creating a state-of-the-art face recognition model that identifies faces in CCTV videos.

2. Mitigating problems such as motion blur, pose and illumination variation, tiny faces by using an Attention Module alongside Feature Maps.

3. Super Resolving the faces caught in the CCTV frames in order to provide a high resolution face.

4. Building an accurate High Resolution 3-D face

Project Implementation Method

For the purpose of implementing face volumes, the backbone network of InsightFace is used. It builds a 512-dimensional feature vector. To create a feature volume n feature embedding are stacked together to create a volume embedding. This volume embeddings have temporal information i.e. in a video this means, the volume embedding has information of n previous frames. As an example, if 3 images are stacked together, the model has a 1536-dimensional feature vector which is one face volume. This 1536-dimensional vector is used as an input to a hidden layer which has a dropout rate of 0.5 and 4608 number of nodes and ReLu activation. The second hidden layer contains 1024 nodes. The final layer is a SoftMax layer which has N no of nodes, where N represents the number of classes. The final output is the recognized face with the label of the identity. Furthermore, quality vectors are assigned that depict the quality of each of the components of the face. The highest quality components contribute in the final vector, this final representation is then Super Resolved using a GAN and finally, a 3D face is constructed using 3D morphable modeling.

Benefits of the Project

Face recognition from low-resolution CCTV videos is an important problem because of the inefficiency of current methods to bring criminals to justice even after they are caught on camera. An incident happened in London where a cyclist headbutted a pedestrian. Although the CCTV footage was released no significant progress was made in the case using the footage. Similarly, in the kidnapping case of Dua Mangi CCTV footage was not utilized in solving the case, although the videos contained frames where the face of the culprit was seen. CCTV footage processing plays a vital role in police investigations and it does not only helps reduce costs, time and effort, it also improves prosecution outcomes. Hence, we present a method to enhance recognition of faces caught in CCTV video sequence, incorporating pose, occlusion and illumination variations.

Technical Details of Final Deliverable

The final deliverable would be a software system that leverages a multi-staged pre-trained model. It uses InsightFace as a base model to extract the feature maps. Furthermore, there is an aggregation module that assigns a quality value to each dimension of the feature map. The input to the network is a low-resolution CCTV video and the output is the identity of the person in question or a super-resolved face(in case the person was unknown to the model). The model returns 19 facial landmarks and a feature vector that has the quality wise weighted vector of feature volumes. These 19 facial points are then leveraged to reconstruct the face. Finally, a Generative Adversarial Network is used to super resolve the face.

Final Deliverable of the Project Software SystemCore Industry ITOther Industries Security Core Technology Artificial Intelligence(AI)Other TechnologiesSustainable Development Goals Industry, Innovation and Infrastructure, Peace and Justice Strong InstitutionsRequired Resources
Item Name Type No. of Units Per Unit Cost (in Rs) Total (in Rs)
Total in (Rs) 70000
GPU Equipment17000070000

More Posts