Adil Khan 9 months ago
AdiKhanOfficial #FYP Ideas

Image Captioning Deep Learning Model

Automatic caption generation in natural language to describe the visual content of an image has attracted an increasing amount of attention in the last decade due to its potential applications. It is a challenging task to generate captions where a textual description must be generated from a given p

Project Title

Image Captioning Deep Learning Model

Project Area of Specialization

Artificial Intelligence

Project Summary

Automatic caption generation in natural language to describe the visual content of an image has attracted an increasing amount of attention in the last decade due to its potential applications. It is a challenging task to generate captions where a textual description must be generated from a given photograph or a scene and requires both methods from computer vision to understand the content of the image and a language model from the field of natural language. What is most impressive about these methods is a single end-to-end model that can be defined to predict a caption by giving a photo/video instead of requiring sophisticated data preparation or a pipeline of specifically designed models.

Our image captioning model suggests a system that can help visually impaired people understand better their surroundings. 

Although the visually impaired people use other senses such as hearing and touch to recognize the events and objects around them, the life quality of those people can be dramatically lower than the standard level. Therefore to cater to this problem, we are utilizing Image Captioning as an Assistive Technology and combining it with “Smart glasses” to improve the life quality of the visually impaired.

This will be a new captioning approach to describe the visual content of an image which can be integrated into hardware such as smart glass which will guide visually impaired people efficiently and safely to overcome the traveling difficulty.

This device will serve as a consumer device for helping visually impaired people to travel safely without carrying any hardware such as a mobile phone with them to be aware of any obstacles coming in front of them. They will only require to wear “smart glasses” as an accessory and this will not only make their life simply accessible but socially meanings wellful and enjoyable as well. However, the glasses' capacity for now to recognize objects will be limited to indoor objects detection at its current stage of development but as we will improve and with the help of funding and resources provided by NGIRI, the glasses will be able to identify more objects as well which actually needs more Graphics processing unit (GPU) requirements which will be used for one of the processes of building the model having to extract the features of the images along with other resources needed. 

Project Objectives

The Image Captioning Deep Learning model is an Artificial Intelligence model that will be serving the objective of automatically describing the images with one or more natural language sentences. The final project will serve as one of the fast and most accurate models having state-of-art methodologies. The purpose of building such an image caption generator is to work on the three of its project objectives:

Firstly, for the likes of media and publishing companies, generating content each day, and captioning images has been a manual effort so far. It takes a significant amount of effort when there is a high volume of images that are published online. Generating captions that accurately describe the object and its relationship with its surroundings is the job of our Artificial Intelligence model. So the solution to it is, that we have come up with providing an image-captioning Application Programming Interface (API) that will serve as an API service for fast image captioning which will greatly reduce the efforts of content creators and publishers and add value to rapid.

Secondly, it can aid visually impaired users, and make it easy for users to organize and navigate through large amounts of typically unstructured visual data. In order to generate high-quality captions, the model needs to incorporate fine-grained visual clues from the image. The mobile application is thus here to serve the purpose of guiding the users about the objects/environment surrounding them and converting the captions as instructions to voice-overs for navigation. Building a mobile application (narrator) that will help someone with complete blindness or impaired vision to do their routine activities easily. Unlike other systems, this system will not only read the text from the document but also implement dense image captioning to produce captions for the images present in the document and convert both into audio format and narrate it to the user giving him the feel of a human narration.

Moreover, Smart guiding glasses for visually impaired people in an indoor environment are yet another extended feature in our project that can be applied. To overcome the traveling difficulty for the visually impaired group, a smart guiding device in the shape of a pair of eyeglasses for giving these people guidance efficiently and safely traveling. The idea for smart glasses is proposed to act as a guiding aid for visually impaired people. The audio assistance shall also be implemented along with a distance measurable sensor which measures the distance between the obstacle and the user. Ray-ban stories smart glasses which are used for recording. This project will serve a different purpose of observing the environment to converting them into audible captions/descriptions by wearables will be achieved by our model. Thus it serves as a consumer device for helping visually impaired people to travel safely.

Project Implementation Method

We implemented a deep recurrent architecture that automatically produces a short description of images. Our models use a Convolutional Neural Network(CNN), which was pre-trained on ResNet50, to obtain image features. We then feed these features into a Long Short-Term Memory(LSTM) network to generate a description of the image in the English language. For feature extraction, we use a Convolutional Neural Network(CNN). Convolutional Neural Networks have been widely used and studied for image tasks, and are currently state-of-the-art methods for object recognition and detection. Concretely, for all input images, we extract features using ResNet50 models. The idea behind using a pre-trained ResNet50 model is that it is already able to parse out objects that may be useful in image captioning. For image captioning, we are creating a Long Short-Term Memory(LSTM) based model that is used to predict the sequences of words, called the caption, from the feature vectors obtained from the ResNet50 Network. And for this whole process, we are using the Flickr30k dataset containing 31k images each with 5 captions.

Technical Approach / Implementation:

We implemented a deep recurrent architecture that automatically produces a short description of images. Our models use a CNN (convolutional neural network), which was pre-trained on ResNet50, to obtain image features. We then feed these features into an LSTM (Long Short-Term Memory) network to generate a description of the image in the English language. 

CNN-based Image Feature Extractor:

For feature extraction, we use a CNN. CNN's have been widely used and studied for image tasks, and are currently state-of-the-art methods for object recognition and detection. Concretely, for all input images, we extract features using ResNet50 models. ResNet50 Models are trained on the ImageNet dataset. This dataset contains millions of images in over 20,000 categories. Consequently, the top layers of the ResNet50 that perform classification have a dimension of 1,000. The idea behind using a pre-trained ResNet50 model is that it is already able to parse out objects that may be useful in image captioning.

LSTM-based Sentence Generator:

Although RNNs have proven successful on tasks such as text generation and speech recognition, it is difficult to train them to learn long-term dynamics. This problem is likely due to the vanishing and exploding gradients problem that can result from propagating the gradients down through the many layers of the recurrent networks. LSTM networks provide a solution by incorporating memory units that allow the networks to learn when to forget previous hidden states and when to update hidden states when given new information.

Dataset:

Here we used the Flickr30k dataset containing 31000 images collected from Flickr, together with 5 reference sentences provided by human annotators.

Benefits of the Project

The Image Captioning Deep Learning model will have the following benefits according to the three scopes and objectives discussed:

  1. Utilizing Image captioning as assistive technology and building a product that will help the visually impaired community. This allows people with eye problems to see the world around them using smartphones. This would serve as a huge help for visually impaired people and lots of applications can be developed in that space.
  2. The state of art smart glasses as described earlier in the objectives will provide a new option to the visually impaired.
  3.  “Smart glasses” will assist blind people and give them the independence to live their lives without being helped or influenced by others.
  4. Smart glasses will help visually impaired people get rid of carrying any equipment or mobile application while traveling.
  5. It will have the quality attributes of a more accurate and fast retrieving description model. 
  6. A host of benefits for the blind beyond decreasing social isolation and empowering the blind to become more autonomous, the glasses can read print material, recognize objects, and vocalize the wearer's position, creating a sense of security.
  7. The content creator industry is in much need of such services that we offer through API for rapid responses for the descriptions of images.
  8. The alternate option of smart glasses will be a mobile application with a great user interface benefitting the visually impaired community.
  9. To increase the life quality of those people, we report a portable and user-friendly smartphone-based platform capable of generating captions and text descriptions, including the option of a narrator, using an image obtained from a smartphone camera. Such that they can carry out their routine activities easily.

The above-mentioned benefits are only limited to our implementation scope and plan whereas generally this model is one of the research-based projects and is still finding new ways to improve the model for it to be used in wide areas including societal good, technological and business.

Technical Details of Final Deliverable

Image Captioning is a fundamental task that requires a semantic understanding of images and structures of meaningful sentences utilizing generated keywords. It requires Computer vision to understand the content of the image and a language model from the field of natural language processing to turn the understanding of the image into words in the right order.

The technical side of the project will include the work being divided based on the final product to be made and implemented with the Image Captioning Deep Learning Model.

As for the first objective of making content creation easy and fast by providing companies with such an Application Programming Interface (API) as a service. The companies or related user/community edition will be able to register once and later will be accompanied by a pay-as-you-go service. The web application will have a good and latest User Interface (UI) which will require the user to provide images (1 image at a time or a bulk of images) for description. The Artificial Intelligence (AI) model will generate the relevant captions of the image and the description/caption for the image will be provided through the interactive UI fast and accurate.

Secondly, to work on the second objective of the project we will require an android application to help the visually impaired using their cell phones, and the application will survey the environment and return the user some specific instructions through voice assistance. With the same motive of pay-as-you-go service and no-signups requirement we will be registering the user for one time only and will continue providing services. The application will identify the environment/surroundings' objects by accessing the visually impaired user’s camera and guide users with captions to voice conversions for the images to work as a voice assistant to carry out their daily life repeated chores.

With the third objective, we're aiming to make a smart glasses pair that will resemble the working of Ray-ban glasses which are there to serve the purpose of recording stories meanwhile our project will work on observing the surroundings and identifying the objects around them and provide closely related captions and descriptions of the environment to the visually impaired community. Thus, work as wearable assistive technology. 

For all the above-mentioned technical requirements we will need a high-end GPU for training on many images so that any new image that comes as a part of the surrounding gets detected easily to new and unseen images. If we get sufficient funding for the smart glasses as a hardware product, we will be able to cover the cost of Hardware resources, high-quality sensors, and access to Meta sources that are needed for this project along with additional assets if needed.

Final Deliverable of the Project

HW/SW integrated system

Core Industry

IT

Other Industries

Media , Others , Health

Core Technology

Artificial Intelligence(AI)

Other Technologies

Wearables and Implantables

Sustainable Development Goals

Good Health and Well-Being for People, Decent Work and Economic Growth, Industry, Innovation and Infrastructure

Required Resources

Item Name Type No. of Units Per Unit Cost (in Rs) Total (in Rs)
NVIDIA GEFORCE GTX 1050 TI GPU Equipment14300043000
Sensors (camera, motion, Capacitive, etc.) Equipment120002000
Speakers Equipment135003500
Total in (Rs) 48500
If you need this project, please contact me on contact@adikhanofficial.com
0
101
iBlood

An android application for online blood donation services. This can be used for emergency...

1675638330.png
Adil Khan
9 months ago
Mirror therapy gloves for stroke patients

Interruption or reduction of blood supply to the parts of brain prevents the brain tissues...

1675638330.png
Adil Khan
9 months ago
To investigate the effects of acid and bleach washing processes on str...

Denim washing is the most critical finishing stage performed on garment which is carried o...

1675638330.png
Adil Khan
9 months ago
Premium Dress Customizer

The Mobile Applications have propelled in no small extent of changes in the attitude and b...

1675638330.png
Adil Khan
9 months ago
Design and development of Autonomous underwater robot

More and more work is going on to prevent its further growth but still human research at s...

1675638330.png
Adil Khan
9 months ago