Sentinel Paraphrase Detection
NLP is one of the most emerging field of epoch we want our systems to be efficient enough to understand High Level Language of human beings. In NLP computers or machines are trained to learn natural language and thus they are used to generate and retrieve data in natural language. Paraphrasing is re
| Project Title |
Sentinel Paraphrase Detection
| Project Area of Specialization |
Artificial Intelligence | | Project Summary |
History Yore NLP is one of the most emerging field of epoch we want our systems to be efficient enough to understand High Level Language of human beings. In NLP computers or machines are trained to learn natural language and thus they are used to generate and retrieve data in natural language. Paraphrasing is reordering and rearrangement one text into other text. It can be done on same language. Monolingual Technique is becoming popular in the field of NLP because of its various applications in paraphrased detection. - Natural Language Processing (NLP) focuses on developing computer systems that can analyze, understand and generate natural human-languages.
- One of the major difficulties faced in natural language processing is ambiguity where the same text has several possible interpretations.
- Another equally challenging aspect is that the same content can be conveyed in different ways. This is termed as Paraphrasing
What’s need ? Paraphrased has been in play since the genesis of formalized education, thus the probability exists that there always will remain students or individuals who paraphrase. "An evenly difficult perspective is that the same content and data can be used in different ways known as Paraphrasing." - Jun Choi Lee & Yu-N Cheah [1] presented a semantic relatedness measures that based on Synset shortest path in WordNet for paraphrase detection.
- Chen Liang, Praveen Paritosh, Vinodh Rajendran, Kenneth D. Forbus [2] proposed a new alignment based approach to learn semantic similarity. They used a hybrid representation of attributed relational graphs to encode lexical, syntactic and semantic information
- Hoang-Quoac Nguyen-son, Yusute Miyao and Isao Echizen [3] proposed approach for Para phone detection using identical phrase and similar word matching
- Wenpeng Yin and Hinrich Schiitze [5] presented a new deep learning architecture Bi-CNN-MI for paraphrase identification.
- Lee, Jun Choi and Cheah, Yu-N (2016) Paraphrase Detection using Semantic Relatedness based on Synset Shortest Path in WordNet. In: International Conference on Advance
- Chen Liang, Praveen Paritosh, Vinodh Rajendran, Kenneth D. Forbus, Learning Paraphrase Identification with Structural Alignment Conference: IJCAI 2016, At New York
- Hoang-Quoc Nguyen-Son, Yusuke Miyao, Isao Echizen, Paraphrase Detection Based on Identical Phrase and Similar Word Matching, 29th Pacific Asia Conference on Language, 2019.
- J.C. Lee, and Y. Cheah. “Paraphrase Detection using String Similarity with Synonyms.” The Fourth Asian Conference on Information Systems, ACIS 2017.
Purpose We will develop a prototype system for mono-lingual sentinel paraphrase detection which will take two sentences of English language and output whether they are paraphrased or not paraphrased. Example  | | Project Objectives |
Mono lingual Paraphrased Detection Derived and source texts are in the same language  Objective Monolingual Technique is becoming popular in the field of NLP because of its various applications in plagiarism detection. - Large benchmark of corpus that would be standardized.
- Will use state of the art techniques
- Model would be trained on corpus
- Will develop our own technique as a proposed technique
- Comparison of both techniques would be done on the basis of corpus developed i.e.: will train model on both proposed and state of art techniques
- Will develop a prototype for interface
Abstract - We want to develop paraphrase detection on sentence level.
- We want to develop a benchmark corpus for English-English at sentence level(simulated , real and artificial cases)
Research Gap - No work is done on sentence level yet
- Corpus for real is available on document level but for simulated no work is done yet.
Proposed Solution - We will implement a technique and dataset using deep learning approaches.
- We’ll implement technique and will compare it with state of the Art techniques.
- We’ll develop a corpus for paraphrase detection for English language for Real and Artificial data
- We will develop our own proposed solution
- We will compare both state of the art and our own proposed technique to find the best fit for corpus.
Application Paraphrase detection is important for applications such as - Summarization
- Information retrieval
- Information extraction
- Question answering
| | Project Implementation Method |
Goals and Guidelines WE ARE USING THE FOLLOWING PRINCIPAL - The KISS principle ("Keep it simple stupid!")
Development Methodology Our system will be developed and delivered using the agile methodology. The development phase will be subdivided into number of iteration and with each iteration mainly focused on the betterment of the system. System will be tested and its behavior noted under di?erent circumstances. Each component will be developed and tested concurrently. All the possible scenarios that a user may encounter will be considered while developing the system Artitectural Strategies The architecture comprises of four main elements that serve as the four pillars of the system: • Data Collection • Model Training • Interface Each component works on top of each other in a layered structure. The data collected provide services to model that needs to be trained, and interface gets data from trained model .Each one of them contributes equally to achieve the end result. Each component is dependent on each other as the operations take place in a sequential manner. Design Description The system will be designed starting from the bottom level to build it up to the top level as the quality of further developments depend highly on previous developments. First main concern is to build a corpus that provides the basis for system and then move on to model. The Interface will be delivered last once all the back end work is complete. The design of system consists of phases: - Collecting raw data
- Saving the informative data
- Annotation process
- Displaying the data
 Use of a particular type of product : We’ll be using (programming language, database, library, etc. ...) - Python as programing language
- Anaconda or pycharm as working platform
- Corpus as discussed above
- Nltk libraries,pandas,keras etc
Reuse of existing software components to implement various parts/features of the system: No we’ll be using our own dataset that is collected from various places like: - twitter
- Chat boxes
- Fb etc.
But yes we’ll use state of the technique to train our model on the corpus. Future plans for extending or enhancing the software: We can further extend the system by displacing it on a website but for that vary purpose much time is needed so it can be extended in future for sure. User interface paradigms (or system input and output models) System prototype would be some how similar to following form  | | Benefits of the Project |
Monolingual Technique is becoming popular in the field of NLP because of its various applications in plagiarism detection. - Large benchmark of corpus that would be standardized.
- Will use state of the art techniques
- Model would be trained on corpus
- Will develop our own technique as a proposed technique
- Comparison of both techniques would be done on the basis of corpus developed i.e.: will train model on both proposed and state of art techniques
- Will develop a prototype for interface
Paraphrase detection is important for applications such as - Summarization
- Information retrieval
- Information extraction
- Question answering
It is often difficult for the reader to see how the paraphrased or quoted ideas fit with your broader discussion because they have not read the same source material you have. Thus, in psychological writing, paraphrasing is considered bad writing practice,So we'll detect paraphrasing at sentence level thus it would increase the accuracy level of detection at this stage. | | Technical Details of Final Deliverable |
- Large benchmark of corpus that would be standardized.It would be some what 1lac plus
- Will use state of the art techniques like:word2vec etc.
- Model would be trained on corpus
- Will develop our own technique as a proposed technique
- Comparison of both techniques would be done on the basis of corpus developed i.e.: will train model on both proposed and state of art techniques
- Will develop a prototype for interface
DFD  | | Final Deliverable of the Project |
HW/SW integrated system | | Core Industry |
Security | | Other Industries |
Education , IT | | Core Technology |
Big Data | | Other Technologies |
Artificial Intelligence(AI) | | Sustainable Development Goals |
Quality Education | Required Resources
| Item Name | Type | No. of Units | Per Unit Cost (in Rs) | Total (in Rs) |
| RAM 8GB | Equipment | 1 | 4000 | 4000 |
| GPU | Equipment | 1 | 40000 | 40000 |
| | | Total in (Rs) | 44000 |