Image Captioning
It is a very challenging task to automatically describe image contents using human-readable sentences, but it could have many benefits. This task is harder than the common image classification or object recognition, and it gained a major focus in computer vision. Also, a textual description must cap
2025-06-28 16:33:01 - Adil Khan
Image Captioning
Project Area of Specialization Artificial IntelligenceProject SummaryIt is a very challenging task to automatically describe image contents using human-readable sentences, but it could have many benefits. This task is harder than the common image classification or object recognition, and it gained a major focus in computer vision. Also, a textual description must capture not only the objects contained in an image, but it must also describe the relationship between these objects in a natural language which means that with image processing we also need a language model to process the descriptions.
There are several ways to provide an automatic description of image contents. One of the popular ways is to use the human-annotated training set which describes the image contents. The motivation of our project came from recent research in Machine Translation where a sentence is translated from one language to another. It was done by translating words individually, aligning the words, and ordering them but recently it has been proven that it can be achieved more easily using Recurrent Neural Networks (RNNs). There are two phases, Encoder, and Decoder. In the first phase, RNN reads the input and transforms it into a fixed-length vector representation, which is then used in the second hidden phase called decoder RNN that generates the actual output. In our model, we will replace the RNN encoder with a deep convolutional neural network (CNN) as we are working with images. Because CNNs can produce a very good representation of the input image by embedding it to a fixed-length vector, which can be used for a variety of computer vision tasks. We will train it for an image classification task and use the last hidden layer as an input to the RNN decoder that will generate descriptions. We will be training a model called Image Caption Generator. This model combines Computer Vision with Natural Language Processing. Initially, we will be using the Flickr8k dataset to train our model, then if we had available hardware resources, we will train it on Flickr30k or MSCOCO datasets. All these datasets are publicly available. For evaluation, we will be using the BLEU score.
Project ObjectivesImage Captioning can be beneficial in many ways, for instance on Social Media sites it can be used to suggest automatic descriptions of the image to the user. It can also be implemented along with some Surveillance Camera software to capture the frames from security footage, generate descriptions, and pass them through some language processing models to detect suspicious behavior. In the case of Visually Impaired people, when using a screen reader in a browser when it reaches any image, it reads the alternative text of that image, aka ‘alt’ attribute in HTML img tag <img alt=`company_logo` src=`logo.jpg`>. Instead of a title, we can place the complete description generated from our model to help the visually impaired better understand the content on the web. Image captioning can also be beneficial in File Indexing, when all the images all tagged with their captions, the user can just search, for instance, beach, and all of his pictures taken on the sea view will be filtered out and displayed. Last but not the least, in Robotics, captions can be used as a representation of the environment and surroundings using frames of the images captured through the camera on its head, then those results can be passed to any NLP model to interpret the meaning and help the Robotic algorithm take appropriate actions based on his surroundings.
Project Implementation MethodThe final trained model will be integrated with an interface (Web-based). It can be used to test the model by generating captions. Moreover, it will be connected with a CCTV to capture the footage and generate captions, which can be utilized in many ways. It will be used to detect suspicious activities or keep a record of people going in or out.
Benefits of the ProjectArtificial Intelligence is the future now. Companies around the world have continued to invest in cognitive software capabilities to expect yield from such investments. The adoption of natural language processing (NLP, the tech at the heart of Amazon Echo), robotic process automation (RPA), and deep learning technologies will be essential in the near future, hence an artificially intelligent model to generate captions is beneficial in several ways.
Captions are a central component in image posts that communicate the background story behind photos. Captions can enhance the engagement with audiences and are therefore critical to campaigns or advertisements. The automated description generated by the model via the visuals could be of much worth in many domains. The model can be integrated with any application involving images as its primary interest. It is widespread from its use on social media sites on the internet to being used for the generation of captions using the live feed from the surveillance cameras for security reasons. Due to this potential, it would be able to make a lot of room in the business industries. The model can be made available at a price for many different fields:
- Any social media platforms, that would want to suggest end-users’ caption for their photos.
- In fields where the text is mostly used. The information in any image can be directly converted into textual form and be made available for use.
- To help visually impaired people. Using the accessibility features in smartphones pictures taken from the camera can be automatically converted into caption and the generated caption can be heard out loud via text to speech.
To generate revenue independently, without any affiliations a website can be set up with basic, standard, and premium packages, where a bulk of images can be uploaded for auto caption generation.
Technical Details of Final DeliverableThe process of generating natural language sentences via visual data has been around for a long time now and there are many things left alone to still be made better in this scope. Many have shown great interest in the domain of generating natural language captions from images. Advancements in object detection and image recognition has affected the domain of image captioning a great deal. Li et al. initiated with detecting objects in the pictures and put all the pieces together for the final description referring to the objects detected and what their relationship could be. The increased interest has addressed the approach to rank captions for a specific image. For a certain image, some descriptions are taken that are close to the image and are ranked with how much one is certain to be accurate. These approaches are either heavily rigid or fail to describe unseen compositions of objects. These methods also present no method to evaluate how accurate the generated caption is. More recently neural net-based recognizer is used to detect a larger set of words and in conjunction with a language, model sentences are generated.
In this approach, we will use the idea of image classification and combine it with recurrent networks to generate well-structured sentences and make a single network that extracts features from images and give them a meaningful description.
A few algorithms have already been introduced that use the above model. A recent work by Mao et al. uses an RNN to predict the words in the description which is not very different from our model but there are important changes that can be done to improve the results a good deal. By using a more powerful RNN and feeding the RNN model with an image directly can make RNN keep track of the images that are already been detected and not repeat the same words. This might not seem like a significant difference but can achieve substantially better results. With that, the rank system of captions can be embedded in the model to make it more reliable and improve accuracy. Besides, the model will also compare the generated captions with Ground reality for evaluation and so the score could be compared with the other models.
Final Deliverable of the Project Software SystemCore Industry ITOther Industries Security Core Technology Artificial Intelligence(AI)Other TechnologiesSustainable Development Goals Good Health and Well-Being for People, Industry, Innovation and Infrastructure, Sustainable Cities and CommunitiesRequired Resources| Item Name | Type | No. of Units | Per Unit Cost (in Rs) | Total (in Rs) |
|---|---|---|---|---|
| Total in (Rs) | 70000 | |||
| Ram 8 GB DDR4 | Equipment | 3 | 5000 | 15000 |
| GeForce GTX 1080 Ti | Equipment | 1 | 30000 | 30000 |
| Core I7 5th Generation | Equipment | 1 | 25000 | 25000 |