Ligature Segmentation of Urdu OCR
The proposed project is to develop an efficient and accurate Ligature Segmentation technique for Urdu OCR system on Field Programmable Gate Arrays (FPGAs). Optical Character Recognition (OCR) is the technique used to convert printed or handwritten text images into machine encoded el
2025-06-28 16:33:59 - Adil Khan
Ligature Segmentation of Urdu OCR
Project Area of Specialization Electrical/Electronic EngineeringProject SummaryThe proposed project is to develop an efficient and accurate Ligature Segmentation technique for Urdu OCR system on Field Programmable Gate Arrays (FPGAs). Optical Character Recognition (OCR) is the technique used to convert printed or handwritten text images into machine encoded electronic format. This allows to edit, search, and store the text efficiently.
Urdu language is official language of Pakistan and a considerable printed data in Urdu is available in our libraries, offices and universities but are not available in a digitally accessible form on digital devices. Areas such as journalism and history are in a dire need for efficient OCR systems, as studies show that there is a direct link of a nation’s coherence with its history and how is it preserved and made accessible. Thus, Urdu OCR system is extremely important for digitizing valuable printed/handwritten Urdu data to preserve it and making it searchable to the large audience.
Urdu OCR is more difficult than other languages due to its Nastalique script. Nastalique script inherits complex calligraphic nature, which presents major issues to segmentation and recognition of Urdu text due to high cursiveness, diagonality in writing, context sensitivity and overlapping of characters.
Urdu OCR system for printed and handwritten literature is ongoing research area. Up to date, 96% recognition accuracy has been reported due to the less availability of data sets. It requires a lot of improvement on large data sets. Moreover, analysis show that high accuracy in segmentation of ligatures provides high efficiency in Urdu OCR system. For the expansion of Urdu OCR application, we propose an efficient Ligature Segmentation method for Urdu OCR system that deploy Field Programmable Gate Arrays (FPGAs) to achieve high performance speed in real time on large data sets.
The main aim of this project is to propose an efficient ligatures segmentation for Urdu OCR and its efficient implementation on FPGA hardware. Efficiency is measured in terms of recognition accuracy and hardware resources.
Project Objectives- The main objective of the proposed project is to come up with efficient and accurate segmentation technique of Ligatures for the development of efficient Urdu OCR System that is able to cater the overlapping issue in Ligatures due to diagonality in Nastalique Script.
- To deploy segmentation technique on large data sets in FPGA to employ real time processing.
- To make a dataset of correctly segmented ligatures from Urdu Printed Text images.
- To analyze the effect of picture quailty on our segmentation technique.
- For better efficiency, analyze the time and power consumption by FPGA.
This project is aimed at developing an accurate and high speed Urdu OCR. The OCR system comprised of five stages: Image Acquisition, Pre-processing, Segmentation, Classification and Post-processing. Image acquisition is the process of acquiring an image into the digital form for manipulation by the digital computers through digital camera. Pre-Processing is used to remove any kind of distortion such as quality breakdown, orientation issues introduced during image acquisition process.Dividing a source image into sub-components is known as segmentation. Segmentation can be divided into page segmentation, line segmentation and then text segmentation. Classification is a computational process that sorts images of segmented ligatures/characters into groups according to their similarities by employing Machine learning and deep learning method. At last, post processing is done for the better improvement in classification and recognition of the system.
The workflow of proposed project is discussed below:
Literature Review: For better comprehension of drawbacks in the previously used Ligature Segmentation techniques at national or international level, a detailed study on the pre-processing, segmentation will be carried out.
Dataset collection: Urdu Printed Text in Nastalique Script will be scanned through high resolution camera for the better removal of noise produced during image acquisition of printed text. To achieve high accuracy in Urdu OCR System, large data set is required mixed of high and low quality images. For more data set, Urdu novels and newspaper in Nastalique style will also be downloaded available online.
Pre-Processing: After the collection of Urdu Printed Text Images (UPTI), pre-processing technique is to be applied depending on the quality of stored images. Binarization or Thresholding, Border removal method will be developed to extract the lines efficiently.
Segmentation: Once the input image has been processed accurately, text lines can be separated using projection profile. After the segmentation of text lines, the main task is to extract ligatures from text line. For ligature segmentation, connected component analysis have been used by contributors in this field at both national and international level. To achieve high accuracy in segmentation of ligatures, we aimed to deploy image dilation or erosion method for correct association of secondary ligature respected to its primary ligature.
FPGA implementation: The working on large data increases power consumption in computers or GPUs. Thus, to achieve high speed, real time and efficient process of OCR, we aimed to deploy setup on FPGA as an acceleration platform.
Benefits of the Project- The developed technology of OCR system can be integrated with text to speech converter to help blind people in reading Urdu Literature books, newspaper etc.
- This Project will set a breakthrough in the tourism industry of Pakistan as tourism is one of the most revenue collection department. Proper OCR system will help tourist to convert the sign boards,brouchers, historical places literature written in urdu to their own language.
- Circulation of printed urdu magazines is decreasing day by day. With the growing access to internet through electronic devices, the need of digitizing Urdu magazines containing Inspirational and Character building stories for children, historical and religious events has raised significantly for easier and quicker sharing of documents in this Digital World. Thus, Urdu OCR system is extremely important for digitizing this data to preserve it.
- Studies have shown that 99.3% of population of Pakistan cannot access Urdu content on their devices. This number reaches almost 100% when Urdu literature beyond ten years is considered. It is currently impossible for general public to search or explore Urdu historical events from past through their devices, creating a barrier for our young generations to connect with our history. One of the direct impacts of this project is to use Urdu OCR technology to address this barrier allowing a pathway towards creating a more cognizant Pakistani society.
- The proposed OCR system can be used in banking operations, mail services, office automation, and assistive devices, which will help millions of people.
- The efficiency of work/business increases when the information of organization sectors is quickly searchable. If this data is in printed form, it’s difficult to search and manage. Thus, OCR (Optical Character Recognition) tool was developed to improve the process of data entry.
This project aims to deliver highly accurate Urdu OCR system for Printed text image by implementing Ligature segmentation of Urdu Nastalique on FPGA. FPGA will serve as an acceleration platform for large data set processing in parallel operation.
To remove all the unwanted noise produced during image acquisition, high quality image of printed text in Nastalique style will be captured through digital camera and will be then integrated to FPGA. A high resolution image of Urdu Printed Text will be used as an input for preprocessing and segmentation phase as shown in Figure1.
The main objective of the proposed project will be achieved by catering the issue of inter and intra overlapping in Printed Nastalique Script. After achieving high accuracy and efficiency in Ligature Segmentation on dataset, the technique will be analyzed again for better improvement in OCR System.
Targeted Dataset value:
- Targeted set of input images: 3000
- Expected Accuracy of Ligature Segmentation : 100%
- Power Consumption :Very low during parallel operation by the use of FPGA on large datasets.
Fig.1. Final Deliverable Layout
Final Deliverable of the Project Hardware SystemCore Industry EducationOther Industries IT , Legal Core Technology Big DataOther Technologies Artificial Intelligence(AI)Sustainable Development Goals Quality Education, Decent Work and Economic GrowthRequired Resources| Item Name | Type | No. of Units | Per Unit Cost (in Rs) | Total (in Rs) |
|---|---|---|---|---|
| Total in (Rs) | 70000 | |||
| Joy Book Scan V160 Visualizer 16MP Document Camera for High Quality | Equipment | 1 | 38000 | 38000 |
| PYNQ-Z1: Python Productivity for Zynq-7000 ARM/FPGA SoC + Accessories | Equipment | 1 | 32000 | 32000 |