Image Captioning

Project Title

Image Captioning

Project Area of Specialization

Artificial Intelligence

Project Summary

Image captioning is the method through which a good representational text is produced to describe the image to a user or machine. This is quite challenging as to describe an image effectively, there are number of things required such as high-level topic of the image, objects in the image, their relationships, their relative orientation and temporal positions. Deep learning is quite effectively used in solving this problem and attention mechanism is utilized in a similar fashion as human works for describing the image. In this Final Year Project (FYP), we intend to use multiple attention mechanism for simultaneously detecting topic, identifying objects, their orientation and temporal relationships. A tailor-made Long-ShortTerm Memory (LSTM) model is defined, with multiple attentions. It will be trained and optimized. The proposed method will be compared with some of the earlier state-of-the-art models based on the datasets like MSCOCO, Flickr8K and Flickr30K. We intend to evaluate this method by BLEU, CIDER, ROUGE-L and METEOR scores to determine the efficiency of the approach.

Project Objectives

Image captioning, being a challenging problem has attracted many researches to be done in order to achieve a respectable score close to or even above human evaluators. Many of the existing researches and results have been compiled and highlights that a sufficient number of solutions have achieved this task very well, the most prominent solution being an encoderdecoder framework. However, many of these researches aimed at improving one aspect of this framework leaving the other behind such as improving the encoding architecture solely by enhancing feature extractions, or introducing topic modelling, combining order-embeddings etc. while others aimed at improving the decoder part of the network by tailoring the RNNs or introducing memory-based networks that could improve the performance. Our solution aims at improving both the encoder and decoder part of this framework by combining some recent researches and analyze resulting captions using evaluating metrics. In FYP – I, we will aim to work on the encoder part of the framework by examining the effect of feeding topics along with image features embedded in a high-dimensional order-embedding space and using a sample RNN as a decoder in order to view the captions on improving the encoder framework solely. In FYP – II, we aim at improving the decoder framework by incorporating attention and memory transmission using a tailored LSTM (STMA) and then combining both our implementations in order to produce captions and then analyzing these using the evaluating metrics such as BLEU, CIDER, ROUGE etc.

Project Implementation Method

Our approach to the problem follows the traditional encoder-decoder framework which consists of image and captions along with topics for the image as well. These topics, captions, and images are organized in a three-level visual semantic hierarchical structure with the topic at the top and image at the bottom. They are embedded in the same space by the order-embedding method. Given an image embedding and a topic embedding, there would be a subspace bounded by them. The embeddings of target captions should be constrained in the subspace. The language model is trained to sample a point in the subspace and decode it to generate the target caption.

We will primarily be using popular datasets for these task such as COCO and Flickr. For the encoder framework, we aim to use a pre-trained CNN in order to produce image attention features such as InceptionV3, Resnet. Captions of the image are stemmed and collected into documents which are fed into Aatopic model such as LDA, NMF to extract relevant topics. The topics, images and captions are then used to train an embedding space and a weighted sum of image and topic embedding, is then subsampled as an intial state into an STMA-based LSTM.The STMA-LSTM will then be trained over a certain number of epochs to produce relevant captions. These will then be evaluated using BLEU, CIDER, ROUGE etc. and the results will be compared with recent state-of-the-art approaches. We will be using NLTK for caption pre-processing along with deep learning frameworks such as TensorFlow, PyTorch, Keras and Theano for model building and training. Hyperparameter tuning would be done using high-level Keras APIs such as KerasTuner. A high-level diagram for the approach can be seen below:

Benefits of the Project

Digital images and videos contribute to a large proportion of unformatted data available. The main idea is to emulate the process of human vision in order to produce rich, ‘human-like’ descriptions for such data. Hence, extracting information out of them could yield much benefit as its has various applications such as:

Optimizing image searches in search engines
Voice assistance for the visually impaired
Boosting the self-driving system by captioning the environment
Real time video description for CCTV cameras

Such essential and significant functionalities make image captioning an important field to study, explore and improve and hence are used by several software giants like Google, Microsoft etc.