Cross Lingual Text Reuse Detection for English Urdu Language pairs
Our System detect Cross lingual text reuse at sentence and passage level. Plagiarism and text reuse become more available with the development of Internet. Therefore, it is important to check scientific papers for the fact of cheating and unacknowledged text reuse. Moreover, English to Englis
2025-06-28 16:31:00 - Adil Khan
Cross Lingual Text Reuse Detection for English Urdu Language pairs
Project Area of Specialization Artificial IntelligenceProject SummaryOur System detect Cross lingual text reuse at sentence and passage level. Plagiarism and text reuse become more available with the development of Internet. Therefore, it is important to check scientific papers for the fact of cheating and unacknowledged text reuse.
Moreover, English to English text reuse detectors are available but when it comes across different languages like English to Urdu people still face problems. Our system will efficiently detect text reuse for Urdu English language. This not only opens a door for Pakistani researchers to work for Urdu language but also will increase value of Urdu language.
In order to support Urdu language:
- We will develop benchmark corpora at sentence and passage level for English Urdu language pair.
- We will purpose, implement and apply suitable technique for developed corpora to measuring the cross lingual text reuse for English Urdu language pair.
- We Will also apply the “State of the Art” technique for measuring cross lingual text reuse detection for English Urdu language pairs.
- Moreover, we compared the above mentioned techniques and provide a comparable performance.
Now-a-days, there is a need of developing a system wchich can detect text-reuse in Urdu-English language.Plagiarism and text reuse become more available with the development of Internet. Therefore, it is important to check scientific papers for the fact of cheating and unacknowledged text reuse. Moreover, English to English text reuse detectors are available but when it comes across languages like English to Urdu people still face problems. Our system will efficiently detect text reuse for Urdu English language and:
- Detect Cross-lingual Text-reuse
- Increase popularity of Urdu language.
- Automated monitoring.
- Minimize manual effort.
Text reuse is a crime that increasing day by day and it is hard to detect in different languages. This leads us to design Text reuse detection system for our native language Urdu. This not only opens a door for Pakistani researchers to work for Urdu language but also will increase value of Urdu language.
Natural Language Processing is a difficult and time taking field although Monolingual Technique is becoming popular in the field of NLP because of its various applications in plagiarism detection for the English language when it comes to cross-lingual little work is done and especially for the Urdu Language there isn't any work being done. Hence we are unable to catch the culprits who steal content.
None of techniques which are for other languages are sufficient for implementation in urdu. So enable quick response for such activities, efficient automated text reuse Detection system is needed.
To develop Cross lingual text reuse detection system for Urdu-English language pairs it's a challenging task to implement text processing using deep learning techniques now a day, it's a research issue to develop fast and efficient algorithms to get high accuracy by using text processing technique to manage large datasets.
A benchmark dataset for research community.
Project Implementation MethodWe will use deep learning techniques in this project. we will extract features of the with CNN and train a LSTM model on these features.
Both iterative and incremental development model are preferred for this project because we will be testing our project several times during the development to achieve the maximum accuracy. From four modules, one module is implemented and tested for accuracy and so on with other modules.
Programming language Python will be used in our project. Our primary host platform is Anaconda for LINUX and secondary for WINDOWS. From training to production, Anaconda is an AI/ML platform that allows organizations to develop, govern, and automate AI/ML and data science with the help of Python language.
For system budget will be:
- 30-40k for system (approx.)
- 60-70k for GPU GTX 1080Ti/1050Ti (8/11GB) approx.
Our system will detect Text Reuse for Urdu-English language pairs.
- The system can detect text reuse for English Urdu language pair.
- The system can detect text reuse for English Urdu language pair at sentence level.
- The system can detect text reuse for English Urdu language pair at passage level.
- The system will be trained on Big Data.
Our Major tsks will be:
Develop Corpora(Cross Lingual Text Reuse Detector) Developed benchmark corpora at sentence and passage level for English Urdu language pair.
Purpose Technique Purpose suitable techniques for developed corpora..
Implement and apply purposed Technique Develop the technique for measuring the cross lingual text reuse detection for English Urdu language pair.
Apply State-of-the art Techniques Apply the “State of the Art” technique for measuring cross lingual text reuse detection for English Urdu language pairs.
Compare Techniques Compare the state of the art technique and proposed technique using same bench-mark
So that the person who will be monitoring the whole system act against them. Our main target will be:
- Reduce men force.
- Increased accuracy
Final deliverable will be a complete automated system . It will be a desktop application. System prototype can be deployed on any website in detection of duplicate content of the web, can be used in machine translations systems, for cross lingual plaigirism detection and to detect plagiarism in journalism.
Final Deliverable of the Project Software SystemCore Industry SecurityOther Industries Education , Media Core Technology Artificial Intelligence(AI)Other TechnologiesSustainable Development Goals Peace and Justice Strong InstitutionsRequired Resources| Item Name | Type | No. of Units | Per Unit Cost (in Rs) | Total (in Rs) |
|---|---|---|---|---|
| Total in (Rs) | 70000 | |||
| GPU GTX 1080Ti/1050Ti (8/11GB) | Equipment | 1 | 60000 | 60000 |
| Computer System | Equipment | 1 | 10000 | 10000 |