Cross Lingual Text Reuse Detection for English Urdu Language pairs

Our System detect Cross lingual text reuse at sentence and passage level. Plagiarism and text reuse become more available with the development of Internet. Therefore, it is important to check scientific papers for the fact of cheating and unacknowledged text reuse. Moreover, English to Englis

2025-06-28 16:31:00 - Adil Khan

Project Title

Cross Lingual Text Reuse Detection for English Urdu Language pairs

Project Area of Specialization Artificial IntelligenceProject Summary

Our System detect Cross lingual text reuse at sentence and passage level. Plagiarism and text reuse become more available with the development of Internet. Therefore, it is important to check scientific papers for the fact of cheating and unacknowledged text reuse.

Moreover, English to English text reuse detectors are available but when it comes across different languages like English to Urdu people still face problems. Our system will efficiently detect text reuse for Urdu English language. This not only opens a door for Pakistani researchers to work for Urdu language but also will increase value of Urdu language.

In order to support Urdu language:

Project Objectives

Now-a-days, there is a need of developing a system wchich can detect text-reuse in Urdu-English language.Plagiarism and text reuse become more available with the development of Internet. Therefore, it is important to check scientific papers for the fact of cheating and unacknowledged text reuse. Moreover, English to English text reuse detectors are available but when it comes across languages like English to Urdu people still face problems. Our system will efficiently detect text reuse for Urdu English language and:

Text reuse is a crime that increasing day by day and it is hard to detect in different languages. This leads us to design Text reuse detection system for our native language Urdu. This not only opens a door for Pakistani researchers to work for Urdu language but also will increase value of Urdu language.

Natural Language Processing is a difficult and time taking field although Monolingual Technique is becoming popular in the field of NLP because of its various applications in plagiarism detection for the English language when it comes to cross-lingual little work is done and especially for the Urdu Language there isn't any work being done. Hence we are unable to catch the culprits who steal content.

None of techniques which are for other languages are sufficient for implementation in urdu. So enable quick response for such activities, efficient automated text reuse Detection system is needed.

To develop Cross lingual text reuse detection system for Urdu-English language pairs it's a challenging task to implement text processing using deep learning techniques now a day, it's a research issue to develop fast and efficient algorithms to get high accuracy by using text processing technique to manage large datasets.

A benchmark dataset for research community.

Project Implementation Method

We will use deep learning techniques in this project. we will extract features of the  with CNN and train a LSTM model on these features.

Both iterative and incremental development model are preferred for this project because we will be testing our project several times during the development to achieve the maximum accuracy. From four modules, one module is implemented and tested for accuracy and so on with other modules.

Programming language Python will be used in our project. Our primary host platform is Anaconda for LINUX and secondary for WINDOWS. From training to production, Anaconda is an AI/ML platform that allows organizations to develop, govern, and automate AI/ML and data science with the help of Python language.

For system budget will be:

Benefits of the Project

Our system will detect Text Reuse for Urdu-English language pairs.

Our Major tsks will be:

Develop Corpora(Cross Lingual Text Reuse Detector)  Developed  benchmark  corpora at sentence and passage level for English Urdu language pair.

Purpose Technique                                                             Purpose suitable techniques for developed corpora..

Implement and apply purposed Technique                    Develop the technique for measuring the cross lingual text  reuse detection for English Urdu language pair.

Apply State-of-the art Techniques                                    Apply the “State of the Art” technique for measuring cross lingual text reuse detection for English Urdu language pairs.

Compare Techniques                                                    Compare the state of the art technique and proposed technique using same bench-mark

So that the person who will be monitoring the whole system act against them. Our main target will be:

Technical Details of Final Deliverable

Final deliverable will be a complete automated system . It will be a desktop application. System prototype can be deployed on any website in detection of duplicate content of the web, can be used in machine translations systems, for cross lingual plaigirism detection and to detect plagiarism in journalism.

Final Deliverable of the Project Software SystemCore Industry SecurityOther Industries Education , Media Core Technology Artificial Intelligence(AI)Other TechnologiesSustainable Development Goals Peace and Justice Strong InstitutionsRequired Resources
Item Name Type No. of Units Per Unit Cost (in Rs) Total (in Rs)
Total in (Rs) 70000
GPU GTX 1080Ti/1050Ti (8/11GB) Equipment16000060000
Computer System Equipment11000010000

More Posts