Acoustic modelling using deep learning for Quran recitation assistance

The project is the first step in making a recitation assistance system for Quran that will allow the user to know whether or not he is reciting the Qur?an in the correct manner. The traditional Automatic Speech Recognition System or ASR is comprised of several modules namely the acoustic model, pron

2025-06-28 16:30:07 - Adil Khan

Project Title

Project Area of Specialization Artificial IntelligenceProject Summary

The project is the first step in making a recitation assistance system for Quran that will allow the user to know whether or not he is reciting the Qur’an in the correct manner. The traditional Automatic Speech Recognition System or ASR is comprised of several modules namely the acoustic model, pronunciation dictionary, language model and search decoder. This project is aimed at development of acoustic model for recitation system. The acoustic model is a model that maps the phonemes of a language to their respective spectral information. The key goal is to model and train an acoustic model that would convert recited ayah into their corresponding phonemes by analyzing patterns of speech using methods of deep learning.

Project Objectives

The aim of the project is to help spread the right way or method of recitation of Holy Qur'an. The technical objectives of the project are

To collect and transcribe overall 10 hours of audio data from multiple Qaris to obtain dataset.
To develop a module for extracting useful features from raw speech signal using Mel-Frequency Cepstral Coefficients or MFCCs.
To train an acoustic model to map the extracted feature vectors to their corresponding phonemes.
To develop a web-based software application based on MVC pattern for using this acoustic model.

Project Implementation Method

The project comprises of two modules namely the feature extraction module and the acoustic model.

The feature extraction module would take input a speech waveform and then apply a series of steps on it to extract MFCC features. The input speech must be sampled at 22050 Hz and 16-bit depth rate for better performance. The raw speech would be sliced into one second small clips and then each clip would be pre-emphasized, windowed into overlapping frames of 25ms with 10ms stride and transformed by using short time Fourier transform to preserve temporal relation. The resultant spectrum would be converted into Mel scale and then log magnitude of this spectrum would be transformed to quefrency domain to extract MFCCs. The final output would be transformed using technique such as LDA to reduce dimensionality of feature vectors. The MFCC extraction process would be implemented in Octave script.

These feature vectors along with their transcriptions would be fed into acoustic model to find a mapping from spectral information to phonemes. Traditionally, gaussian mixture models were used to estimate probabilities and then hidden Markova models were used for mapping but we will use the deep learning approach i.e. convolution neural networks to train this model as it has been proved to give the best performance in acoustic modelling. The CNN would be implemented in python using relevant libraries.

Once these two modules would have been developed, then an MVC based web application would be developed to use this model. This application would take in inputs from user and then use the model we had trained to predict the recited phonemes and display them back to user. The web application would use Angular JS on front end and python on backend.

Benefits of the Project

The beneficiaries of the project are not limited to the people of Pakistan but Muslims all around the world. The users of the complete application would be able to validate their recitation skills without needing human assistance, thus helping in spreading the knowledge of Qur’an among masses. The architecture of application would be designed in such a way that it would be adaptable to other Surahs, recitors and dialects if trained properly thus enabling the people to extend it.

Technical Details of Final Deliverable

The final deliverable would comprise of

A feature extraction module to generate feature vectors from audio waves.
The trained model to predict the phoenemes given the feature vector.
A web application using these two modules to provide a phoneme transcription for the recited verse.
An instruction manual for using the model and application.

Final Deliverable of the Project Software SystemCore Industry EducationOther IndustriesCore Technology Artificial Intelligence(AI)Other TechnologiesSustainable Development Goals Quality EducationRequired Resources

Item Name	Type	No. of Units	Per Unit Cost (in Rs)	Total (in Rs)
			Total in (Rs)	69500
GPU GeForce RTX 2060 AMP	Equipment	1	65000	65000
Printing	Miscellaneous	1	2000	2000
Microphone	Equipment	1	2500	2500

Acoustic modelling using deep learning for Quran recitation assistance

More Posts