Bilingual Automatic Speech Recognition System

2025-06-28 16:30:37 - Adil Khan

Project Title

Project Area of Specialization Artificial IntelligenceProject Summary

Urdu is Persianised standard register of the Hindustani language, which presents several challenges to Natural Language Processing (NLP) tasks and the development of speech technologies. One is that few language resources, such as lexicons, corpora or NLP software, are available for Urdu Language. Another problem is the bilingualism and interference with the English language due to the british rule which lasted for around 200 years.
For this we have created a Bilingual Automatic Speech Recognition system for Urdu speakers who frequently tend to use English words and sentences in between. Our ASR will is able to detect and recognize both pure Urdu and English sentences, as well as Urdu sentences with English words mixed in.
There is no previous work done on the bilingual ASR with the combination of English and Urdu. So our project is mainly research based project where we will explore different techniques and in the end present a working prototype integrated within an assisting application having near production accuracy. We have divided our work into an incremental fashion where we would gradually implement increasing complex systems. This will allow us to grasp the technology and the techniques we will need to achieve our final goal.

Project Objectives

We aim to create a Bilingual Automatic Speech Recognition system for Urdu speakers who frequently tend to use English words and sentences in between. The system is needed because the English language has an extensive ramifications in the Urdu language for the common Urdu speaking person. Our ASR will be able to detect and recognize both pure Urdu and English sentences, as well as Urdu sentences with English words mixed in and produce their transcripts respectively.

Project Implementation Method

In pursuit of the creation of satisfactory bilingual speech recognition experience, we have made use of the DeepSpeech platform provided by Baidu, which overtime has been contributed to numerous open source collaborators. DeepSpeech is an open source project based on the Tensorflow library. The core of the system is a bidirectional recurrent neural network (BRNN) trained to ingest speech spectrograms and generate English text transcriptions. A pre-trained English model is available for use. We employ this system to first train a model for Urdu, and then extend it cover bilingual cases.

The DeepSpeech platform requires relatively large datasets compared to other approaches. For starters, we make use of a 75 hr Read Urdu Multi-Speaker (RUMI) Corpus collected and provided by CSALT18. The dataset is cleaned, a transcript compatible with DeepSpeech is prepared, and fed to KenLM, an opensource tool for creation of language models. This language model, along with the transcript and an alphabet list, is fed to a DeepSpeech provided binary for the creation of a Trie, which is then fed to the DeepSpeech application along with the transcript and language model for training. The dataset is split into training, validation, and testing sets. Since the application requires significant processing power, the training was done on GPUs provided by Google on the Google Colab research platform. The resultant model is used as a base for further fine tuning.

Our model created through DeepSpeech has an accuracy of approx. 80% on the test dataset, meaning a CER of 20%. For testing and deployment, this model, and the deepspeech library were hosted on a NodeJS enabled server.

An application was created where a user can inquire of any business query by voice. This voice was sent to the NodeJS server which then returned an inferred text script.

The application, has several other features, the core of which is that it assists in banking purposes. A chatbot was integrated, connected to a IBM Watson assistant hosted on the IBM cloud. The text returned by deepspeech is sent to Watson assistant for processing and an appropriate response is generated and displayed to the user.

Benefits of the Project

A fully functioning bilingual automatic speech recognition system capable of recognizing both Urdu and English language and producing their transcripts respectively.
We will also write a research paper to aid future researchers in this field.

Technical Details of Final Deliverable

We made use of a 75 hour Read Urdu Multi-Speaker (RUMI) Corpus collected and provided by CSALT18. The dataset was first converted from an arabic urdu script into a roman urdu script. Next, it used to train a speech recognition model with the help of the DeepSpeech engine built by Baidu.

The resultant model is locally hosted on a Node JS enabled server. An application sends a request embedded with an audio clip that is inferred by the system, and a response is generated and sent back to the user application.

The user interacts with a locally developed android application named ProBanker, which assists a user in various everyday banking related activities. The interface is in the form of a chat, where a user can either enter text or voice. The voice option records the audio and sends it to the Node server mentioned before. The returned text is used in the chat.

For understanding and processing the text(and the speech converted to text), we employ an IBM Watson assistant hosted on the IBM cloud. The assistant is trained to understand any banking related queries in both roman Urdu and English. With appropriate understanding, the application takes action as required.User information is stored and retrieved from a nosql database hosted on Google Firebase.

Final Deliverable of the Project Software SystemType of Industry IT , Telecommunication Technologies Artificial Intelligence(AI), Big DataSustainable Development Goals Industry, Innovation and InfrastructureRequired Resources

Item Name	Type	No. of Units	Per Unit Cost (in Rs)	Total (in Rs)
			Total in (Rs)	70000
Graphics Processing Unit	Equipment	1	70000	70000

Bilingual Automatic Speech Recognition System

More Posts