Automatic caption generation in natural language to describe the visual content of an image has attracted an increasing amount of attention in the last decade due to its potential applications. It is a challenging task to generate captions where a textual description must be generated from a given p
Image Captioning Deep Learning Model
Automatic caption generation in natural language to describe the visual content of an image has attracted an increasing amount of attention in the last decade due to its potential applications. It is a challenging task to generate captions where a textual description must be generated from a given photograph or a scene and requires both methods from computer vision to understand the content of the image and a language model from the field of natural language. What is most impressive about these methods is a single end-to-end model that can be defined to predict a caption by giving a photo/video instead of requiring sophisticated data preparation or a pipeline of specifically designed models.
Our image captioning model suggests a system that can help visually impaired people understand better their surroundings.
Although the visually impaired people use other senses such as hearing and touch to recognize the events and objects around them, the life quality of those people can be dramatically lower than the standard level. Therefore to cater to this problem, we are utilizing Image Captioning as an Assistive Technology and combining it with “Smart glasses” to improve the life quality of the visually impaired.
This will be a new captioning approach to describe the visual content of an image which can be integrated into hardware such as smart glass which will guide visually impaired people efficiently and safely to overcome the traveling difficulty.
This device will serve as a consumer device for helping visually impaired people to travel safely without carrying any hardware such as a mobile phone with them to be aware of any obstacles coming in front of them. They will only require to wear “smart glasses” as an accessory and this will not only make their life simply accessible but socially meanings wellful and enjoyable as well. However, the glasses' capacity for now to recognize objects will be limited to indoor objects detection at its current stage of development but as we will improve and with the help of funding and resources provided by NGIRI, the glasses will be able to identify more objects as well which actually needs more Graphics processing unit (GPU) requirements which will be used for one of the processes of building the model having to extract the features of the images along with other resources needed.
The Image Captioning Deep Learning model is an Artificial Intelligence model that will be serving the objective of automatically describing the images with one or more natural language sentences. The final project will serve as one of the fast and most accurate models having state-of-art methodologies. The purpose of building such an image caption generator is to work on the three of its project objectives:
Firstly, for the likes of media and publishing companies, generating content each day, and captioning images has been a manual effort so far. It takes a significant amount of effort when there is a high volume of images that are published online. Generating captions that accurately describe the object and its relationship with its surroundings is the job of our Artificial Intelligence model. So the solution to it is, that we have come up with providing an image-captioning Application Programming Interface (API) that will serve as an API service for fast image captioning which will greatly reduce the efforts of content creators and publishers and add value to rapid.
Secondly, it can aid visually impaired users, and make it easy for users to organize and navigate through large amounts of typically unstructured visual data. In order to generate high-quality captions, the model needs to incorporate fine-grained visual clues from the image. The mobile application is thus here to serve the purpose of guiding the users about the objects/environment surrounding them and converting the captions as instructions to voice-overs for navigation. Building a mobile application (narrator) that will help someone with complete blindness or impaired vision to do their routine activities easily. Unlike other systems, this system will not only read the text from the document but also implement dense image captioning to produce captions for the images present in the document and convert both into audio format and narrate it to the user giving him the feel of a human narration.
Moreover, Smart guiding glasses for visually impaired people in an indoor environment are yet another extended feature in our project that can be applied. To overcome the traveling difficulty for the visually impaired group, a smart guiding device in the shape of a pair of eyeglasses for giving these people guidance efficiently and safely traveling. The idea for smart glasses is proposed to act as a guiding aid for visually impaired people. The audio assistance shall also be implemented along with a distance measurable sensor which measures the distance between the obstacle and the user. Ray-ban stories smart glasses which are used for recording. This project will serve a different purpose of observing the environment to converting them into audible captions/descriptions by wearables will be achieved by our model. Thus it serves as a consumer device for helping visually impaired people to travel safely.
We implemented a deep recurrent architecture that automatically produces a short description of images. Our models use a Convolutional Neural Network(CNN), which was pre-trained on ResNet50, to obtain image features. We then feed these features into a Long Short-Term Memory(LSTM) network to generate a description of the image in the English language. For feature extraction, we use a Convolutional Neural Network(CNN). Convolutional Neural Networks have been widely used and studied for image tasks, and are currently state-of-the-art methods for object recognition and detection. Concretely, for all input images, we extract features using ResNet50 models. The idea behind using a pre-trained ResNet50 model is that it is already able to parse out objects that may be useful in image captioning. For image captioning, we are creating a Long Short-Term Memory(LSTM) based model that is used to predict the sequences of words, called the caption, from the feature vectors obtained from the ResNet50 Network. And for this whole process, we are using the Flickr30k dataset containing 31k images each with 5 captions.
Technical Approach / Implementation:
We implemented a deep recurrent architecture that automatically produces a short description of images. Our models use a CNN (convolutional neural network), which was pre-trained on ResNet50, to obtain image features. We then feed these features into an LSTM (Long Short-Term Memory) network to generate a description of the image in the English language.
CNN-based Image Feature Extractor:
For feature extraction, we use a CNN. CNN's have been widely used and studied for image tasks, and are currently state-of-the-art methods for object recognition and detection. Concretely, for all input images, we extract features using ResNet50 models. ResNet50 Models are trained on the ImageNet dataset. This dataset contains millions of images in over 20,000 categories. Consequently, the top layers of the ResNet50 that perform classification have a dimension of 1,000. The idea behind using a pre-trained ResNet50 model is that it is already able to parse out objects that may be useful in image captioning.
LSTM-based Sentence Generator:
Although RNNs have proven successful on tasks such as text generation and speech recognition, it is difficult to train them to learn long-term dynamics. This problem is likely due to the vanishing and exploding gradients problem that can result from propagating the gradients down through the many layers of the recurrent networks. LSTM networks provide a solution by incorporating memory units that allow the networks to learn when to forget previous hidden states and when to update hidden states when given new information.
Dataset:
Here we used the Flickr30k dataset containing 31000 images collected from Flickr, together with 5 reference sentences provided by human annotators.
The Image Captioning Deep Learning model will have the following benefits according to the three scopes and objectives discussed:
The above-mentioned benefits are only limited to our implementation scope and plan whereas generally this model is one of the research-based projects and is still finding new ways to improve the model for it to be used in wide areas including societal good, technological and business.
Image Captioning is a fundamental task that requires a semantic understanding of images and structures of meaningful sentences utilizing generated keywords. It requires Computer vision to understand the content of the image and a language model from the field of natural language processing to turn the understanding of the image into words in the right order.
The technical side of the project will include the work being divided based on the final product to be made and implemented with the Image Captioning Deep Learning Model.
As for the first objective of making content creation easy and fast by providing companies with such an Application Programming Interface (API) as a service. The companies or related user/community edition will be able to register once and later will be accompanied by a pay-as-you-go service. The web application will have a good and latest User Interface (UI) which will require the user to provide images (1 image at a time or a bulk of images) for description. The Artificial Intelligence (AI) model will generate the relevant captions of the image and the description/caption for the image will be provided through the interactive UI fast and accurate.
Secondly, to work on the second objective of the project we will require an android application to help the visually impaired using their cell phones, and the application will survey the environment and return the user some specific instructions through voice assistance. With the same motive of pay-as-you-go service and no-signups requirement we will be registering the user for one time only and will continue providing services. The application will identify the environment/surroundings' objects by accessing the visually impaired user’s camera and guide users with captions to voice conversions for the images to work as a voice assistant to carry out their daily life repeated chores.
With the third objective, we're aiming to make a smart glasses pair that will resemble the working of Ray-ban glasses which are there to serve the purpose of recording stories meanwhile our project will work on observing the surroundings and identifying the objects around them and provide closely related captions and descriptions of the environment to the visually impaired community. Thus, work as wearable assistive technology.
For all the above-mentioned technical requirements we will need a high-end GPU for training on many images so that any new image that comes as a part of the surrounding gets detected easily to new and unseen images. If we get sufficient funding for the smart glasses as a hardware product, we will be able to cover the cost of Hardware resources, high-quality sensors, and access to Meta sources that are needed for this project along with additional assets if needed.
| Item Name | Type | No. of Units | Per Unit Cost (in Rs) | Total (in Rs) |
|---|---|---|---|---|
| NVIDIA GEFORCE GTX 1050 TI GPU | Equipment | 1 | 43000 | 43000 |
| Sensors (camera, motion, Capacitive, etc.) | Equipment | 1 | 2000 | 2000 |
| Speakers | Equipment | 1 | 3500 | 3500 |
| Total in (Rs) | 48500 |
An android application for online blood donation services. This can be used for emergency...
Interruption or reduction of blood supply to the parts of brain prevents the brain tissues...
Denim washing is the most critical finishing stage performed on garment which is carried o...
The Mobile Applications have propelled in no small extent of changes in the attitude and b...
More and more work is going on to prevent its further growth but still human research at s...