Adil Khan 9 months ago
AdiKhanOfficial #FYP Ideas

Sentinel Paraphrase Detection

NLP is one of the most emerging field of epoch we want our systems to be efficient enough to understand High Level Language of human beings. In NLP computers or machines are trained to learn natural language and thus they are used to generate and retrieve data in natural language. Paraphrasing is re

Project Title

Sentinel Paraphrase Detection

Project Area of Specialization

Artificial Intelligence

Project Summary

History Yore

NLP is one of the most emerging field of epoch we want our systems to be efficient enough to understand High Level Language of human beings. In NLP computers or machines are trained to learn natural language and thus they are used to generate and retrieve data in natural language. Paraphrasing is reordering and rearrangement one text into other text. It can be done on same language. Monolingual Technique is becoming popular in the field of NLP because of its various applications in paraphrased detection. 

  • Natural Language Processing (NLP) focuses on developing computer systems that can analyze, understand and generate natural human-languages.
  • One of the major difficulties faced in natural language processing is ambiguity where the same text has several possible interpretations.
  • Another equally challenging aspect is that the same content can be conveyed in different ways. This is termed as Paraphrasing
What’s need ?

Paraphrased has been in play since the genesis of formalized education, thus the probability exists that there always will remain students or individuals who paraphrase.

"An evenly difficult perspective is that the same content and data can be used in different ways known as Paraphrasing."

Literature Review:

  1. Jun Choi Lee & Yu-N Cheah [1] presented a semantic relatedness measures that based on Synset shortest path in WordNet for paraphrase detection.
  2. Chen Liang, Praveen Paritosh, Vinodh Rajendran, Kenneth D. Forbus [2] proposed a new alignment based approach to learn semantic similarity. They used a hybrid representation of attributed relational graphs to encode lexical, syntactic and semantic information
  3. Hoang-Quoac Nguyen-son, Yusute Miyao and Isao Echizen [3] proposed approach for Para phone detection using identical phrase and similar word matching
  4. Wenpeng Yin and Hinrich Schiitze [5] presented a new deep learning architecture Bi-CNN-MI for paraphrase identification.
  5. Lee, Jun Choi and Cheah, Yu-N (2016) Paraphrase Detection using Semantic Relatedness based on Synset Shortest Path in WordNet. In: International Conference on Advance
  6. Chen Liang, Praveen Paritosh, Vinodh Rajendran, Kenneth D. Forbus, Learning Paraphrase Identification with Structural Alignment Conference: IJCAI 2016, At New York
  7. Hoang-Quoc Nguyen-Son, Yusuke Miyao, Isao Echizen, Paraphrase Detection Based on Identical Phrase and Similar Word Matching, 29th Pacific Asia Conference on Language, 2019.
  8. J.C. Lee, and Y. Cheah. “Paraphrase Detection using String Similarity with Synonyms.” The Fourth Asian Conference on Information Systems, ACIS 2017.
Purpose

We will develop a prototype system for mono-lingual sentinel paraphrase detection which will take two sentences of English language and output whether they are paraphrased or not paraphrased.

Example

Project Objectives

Mono lingual Paraphrased Detection

Derived and source texts are in the same language

Objective

Monolingual Technique is becoming popular in the field of NLP because of its various applications in plagiarism detection.

  1. Large benchmark of corpus that would be standardized.
  2. Will use state of the art techniques
  3. Model would be trained on corpus
  4. Will develop our own technique as a proposed technique
  5. Comparison of both techniques would be done on the basis of corpus developed i.e.: will train model on both proposed and state of art techniques
  6. Will develop a prototype for interface
Abstract
  1. We want to develop paraphrase detection on sentence level.
  2. We want to develop a benchmark corpus for English-English at sentence level(simulated , real and artificial cases)
Research Gap
  1. No work is done on sentence level  yet
  2. Corpus for real is available on document level but for simulated no work is done yet.
Proposed Solution
  1. We will implement a technique and dataset using deep learning approaches.
  2. We’ll implement technique and will compare it with state of the Art techniques.
  3. We’ll develop a corpus for paraphrase detection for English language for Real and Artificial data
  4. We will develop our own proposed solution
  5. We will compare both state of the art and our own proposed technique to find the best fit for corpus.
Application

Paraphrase detection is important for applications such as

  1. Summarization
  2. Information retrieval
  3. Information extraction
  4. Question answering

Project Implementation Method

Goals and Guidelines

WE ARE USING THE FOLLOWING PRINCIPAL

  • The KISS principle ("Keep it simple stupid!")
Development Methodology

Our system will be developed and delivered using the agile methodology. The development phase will be subdivided into number of iteration and with each iteration mainly focused on the betterment of the system. System will be tested and its behavior noted under di?erent circumstances. Each component will be developed and tested concurrently. All the possible scenarios that a user may encounter will be considered while developing the system

Artitectural Strategies

The architecture comprises of four main elements that serve as the four pillars of the system:

• Data Collection

• Model Training

• Interface

Each component works on top of each other in a layered structure. The data collected provide services to model that needs to be trained, and interface gets data from trained model .Each one of them contributes equally to achieve the end result. Each component is dependent on each other as the operations take place in a sequential manner.

Design Description

The system will be designed starting from the bottom level to build it up to the top level as the quality of further developments depend highly on previous developments. First main concern is to build a corpus that provides the basis for system and then move on to model. The Interface will be delivered last once all the back end work is complete. The design of system consists of phases:

  • Collecting raw data
  • Saving the informative data
  • Annotation process
  • Displaying the data

Use of a particular type of product :

We’ll be using (programming language, database, library, etc. ...)

  • Python as programing language
  • Anaconda or pycharm as working platform
  • Corpus as discussed above
  • Nltk libraries,pandas,keras etc
Reuse of existing software components to implement various parts/features of the system:

No we’ll be using our own dataset that is collected from various places like:

  • twitter
  • Chat boxes
  • Fb etc.

But yes we’ll use state of the technique to train our model on the corpus.

Future plans for extending or enhancing the software:

We can further extend the system by displacing it on a website but for that vary purpose much time is needed so it can be extended in future for sure.

User interface paradigms (or system input and output models)

System prototype would be some how similar to following form

Benefits of the Project

Monolingual Technique is becoming popular in the field of NLP because of its various applications in plagiarism detection.
  1. Large benchmark of corpus that would be standardized.
  2. Will use state of the art techniques
  3. Model would be trained on corpus
  4. Will develop our own technique as a proposed technique
  5. Comparison of both techniques would be done on the basis of corpus developed i.e.: will train model on both proposed and state of art techniques
  6. Will develop a prototype for interface
Paraphrase detection is important for applications such as
  1. Summarization
  2. Information retrieval
  3. Information extraction
  4. Question answering

It is often difficult for the reader to see how the paraphrased or quoted ideas fit with your broader discussion because they have not read the same source material you have. Thus, in psychological writing, paraphrasing is considered bad writing practice,So we'll detect paraphrasing at sentence level thus it would increase the accuracy level of detection at this stage.

Technical Details of Final Deliverable

  • Large benchmark of corpus that would be standardized.It would be some what 1lac plus
  • Will use state of the art techniques like:word2vec etc.
  • Model would be trained on corpus
  • Will develop our own technique as a proposed technique
  • Comparison of both techniques would be done on the basis of corpus developed i.e.: will train model on both proposed and state of art techniques
  • Will develop a prototype for interface

DFD

Final Deliverable of the Project

HW/SW integrated system

Core Industry

Security

Other Industries

Education , IT

Core Technology

Big Data

Other Technologies

Artificial Intelligence(AI)

Sustainable Development Goals

Quality Education

Required Resources

Item Name Type No. of Units Per Unit Cost (in Rs) Total (in Rs)
RAM 8GB Equipment140004000
GPU Equipment14000040000
Total in (Rs) 44000
If you need this project, please contact me on contact@adikhanofficial.com
Business management and Analysis

 Business Management and Analysis Project consist of two parts, first part is managem...

1675638330.png
Adil Khan
9 months ago
Smart Metering System

The project involves design and development of Smart Metering System which monitors, contr...

1675638330.png
Adil Khan
9 months ago
Development of Working Prototype of Wearable Health Monitoring System

The Flexible Health Monitoring System is one specific example of the alliance of a number...

1675638330.png
Adil Khan
9 months ago
Beauty Health and Fitness

BHF is a android-based Beauty, health & fitness management application with appointmen...

1675638330.png
Adil Khan
9 months ago
Demand side management based on load scheduling and power shedding by...

Smart grid is a system that provides the facility of having two way communication in betwe...

1675638330.png
Adil Khan
9 months ago