Adil Khan 11 months ago

AdiKhanOfficial #FYP Ideas

Sentinel Paraphrase Detection

Project Title

Project Area of Specialization

Artificial Intelligence

Project Summary

History Yore

NLP is one of the most emerging field of epoch we want our systems to be efficient enough to understand High Level Language of human beings. In NLP computers or machines are trained to learn natural language and thus they are used to generate and retrieve data in natural language. Paraphrasing is reordering and rearrangement one text into other text. It can be done on same language. Monolingual Technique is becoming popular in the field of NLP because of its various applications in paraphrased detection.

Natural Language Processing (NLP) focuses on developing computer systems that can analyze, understand and generate natural human-languages.
One of the major difficulties faced in natural language processing is ambiguity where the same text has several possible interpretations.
Another equally challenging aspect is that the same content can be conveyed in different ways. This is termed as Paraphrasing

What’s need ?

Paraphrased has been in play since the genesis of formalized education, thus the probability exists that there always will remain students or individuals who paraphrase.

"An evenly difficult perspective is that the same content and data can be used in different ways known as Paraphrasing."

Literature Review:

Jun Choi Lee & Yu-N Cheah [1] presented a semantic relatedness measures that based on Synset shortest path in WordNet for paraphrase detection.
Chen Liang, Praveen Paritosh, Vinodh Rajendran, Kenneth D. Forbus [2] proposed a new alignment based approach to learn semantic similarity. They used a hybrid representation of attributed relational graphs to encode lexical, syntactic and semantic information
Hoang-Quoac Nguyen-son, Yusute Miyao and Isao Echizen [3] proposed approach for Para phone detection using identical phrase and similar word matching
Wenpeng Yin and Hinrich Schiitze [5] presented a new deep learning architecture Bi-CNN-MI for paraphrase identification.
Lee, Jun Choi and Cheah, Yu-N (2016) Paraphrase Detection using Semantic Relatedness based on Synset Shortest Path in WordNet. In: International Conference on Advance
Chen Liang, Praveen Paritosh, Vinodh Rajendran, Kenneth D. Forbus, Learning Paraphrase Identification with Structural Alignment Conference: IJCAI 2016, At New York
Hoang-Quoc Nguyen-Son, Yusuke Miyao, Isao Echizen, Paraphrase Detection Based on Identical Phrase and Similar Word Matching, 29th Pacific Asia Conference on Language, 2019.
J.C. Lee, and Y. Cheah. “Paraphrase Detection using String Similarity with Synonyms.” The Fourth Asian Conference on Information Systems, ACIS 2017.

Purpose

We will develop a prototype system for mono-lingual sentinel paraphrase detection which will take two sentences of English language and output whether they are paraphrased or not paraphrased.

Example

Project Objectives

Mono lingual Paraphrased Detection

Derived and source texts are in the same language

Objective

Monolingual Technique is becoming popular in the field of NLP because of its various applications in plagiarism detection.

Large benchmark of corpus that would be standardized.
Will use state of the art techniques
Model would be trained on corpus
Will develop our own technique as a proposed technique
Comparison of both techniques would be done on the basis of corpus developed i.e.: will train model on both proposed and state of art techniques
Will develop a prototype for interface

Abstract

We want to develop paraphrase detection on sentence level.
We want to develop a benchmark corpus for English-English at sentence level(simulated , real and artificial cases)

Research Gap

No work is done on sentence level yet
Corpus for real is available on document level but for simulated no work is done yet.

Proposed Solution

We will implement a technique and dataset using deep learning approaches.
We’ll implement technique and will compare it with state of the Art techniques.
We’ll develop a corpus for paraphrase detection for English language for Real and Artificial data
We will develop our own proposed solution
We will compare both state of the art and our own proposed technique to find the best fit for corpus.

Application

Paraphrase detection is important for applications such as

Summarization
Information retrieval
Information extraction
Question answering

Project Implementation Method

Goals and Guidelines

WE ARE USING THE FOLLOWING PRINCIPAL

The KISS principle ("Keep it simple stupid!")

Development Methodology

Our system will be developed and delivered using the agile methodology. The development phase will be subdivided into number of iteration and with each iteration mainly focused on the betterment of the system. System will be tested and its behavior noted under di?erent circumstances. Each component will be developed and tested concurrently. All the possible scenarios that a user may encounter will be considered while developing the system

Artitectural Strategies

The architecture comprises of four main elements that serve as the four pillars of the system:

• Data Collection

• Model Training

• Interface

Each component works on top of each other in a layered structure. The data collected provide services to model that needs to be trained, and interface gets data from trained model .Each one of them contributes equally to achieve the end result. Each component is dependent on each other as the operations take place in a sequential manner.

Design Description

The system will be designed starting from the bottom level to build it up to the top level as the quality of further developments depend highly on previous developments. First main concern is to build a corpus that provides the basis for system and then move on to model. The Interface will be delivered last once all the back end work is complete. The design of system consists of phases:

Collecting raw data
Saving the informative data
Annotation process
Displaying the data

Use of a particular type of product :

We’ll be using (programming language, database, library, etc. ...)

Python as programing language
Anaconda or pycharm as working platform
Corpus as discussed above
Nltk libraries,pandas,keras etc

Reuse of existing software components to implement various parts/features of the system:

No we’ll be using our own dataset that is collected from various places like:

twitter
Chat boxes
Fb etc.

But yes we’ll use state of the technique to train our model on the corpus.

Future plans for extending or enhancing the software:

We can further extend the system by displacing it on a website but for that vary purpose much time is needed so it can be extended in future for sure.

User interface paradigms (or system input and output models)

System prototype would be some how similar to following form

Benefits of the Project

Monolingual Technique is becoming popular in the field of NLP because of its various applications in plagiarism detection.

Large benchmark of corpus that would be standardized.
Will use state of the art techniques
Model would be trained on corpus
Will develop our own technique as a proposed technique
Comparison of both techniques would be done on the basis of corpus developed i.e.: will train model on both proposed and state of art techniques
Will develop a prototype for interface

Paraphrase detection is important for applications such as

Summarization
Information retrieval
Information extraction
Question answering

It is often difficult for the reader to see how the paraphrased or quoted ideas fit with your broader discussion because they have not read the same source material you have. Thus, in psychological writing, paraphrasing is considered bad writing practice,So we'll detect paraphrasing at sentence level thus it would increase the accuracy level of detection at this stage.

Technical Details of Final Deliverable

Large benchmark of corpus that would be standardized.It would be some what 1lac plus
Will use state of the art techniques like:word2vec etc.
Model would be trained on corpus
Will develop our own technique as a proposed technique
Comparison of both techniques would be done on the basis of corpus developed i.e.: will train model on both proposed and state of art techniques
Will develop a prototype for interface

DFD