Roman Urdu Hate Speech Detection
Hate speech is ordinarily characterized as any correspondence that belittles an objective gathering of individuals dependent on certain individuals dependent on some trademark, for example, race, shading, identity, sex, sexual direction, ethnicity, religion or profession. Hate Speech is
2025-06-28 16:34:51 - Adil Khan
Roman Urdu Hate Speech Detection
Project Area of Specialization Artificial IntelligenceProject Summary SummaryHate speech is ordinarily characterized as any correspondence that belittles an objective gathering of individuals dependent on certain individuals dependent on some trademark, for example, race, shading, identity, sex, sexual direction, ethnicity, religion or profession.
Hate Speech is seen as two classes – Hate speech that ought to be managed or potentially disallowed by law and Hate speech that is dangerous. Be that as it may, falls outside boundaries requiring state activity and guideline.
Hate Speech Detection is the computerized assignment of distinguishing if a bit of text contains hate discourse. In any case, no such is yet done in the language generally utilized in our nation; Roman Urdu.
The proposed project will provide a platform that will contain the statistics of hate speech collected and identified from online media stages; facebook, twitter, Instagram, YouTube remarks and a few sites like Siasat.pk , Jang Roznama, Siasat.pk(urdu), iJunoon, Y This News(Indian news Forum), Roman Urdu News(Pakistan News and Business News Forum), Ahlul Hadees(A Roman Urdu Islamic Blog) and Ahnaf Media Services in textual form.
Anyone will be able to get knowledge to get a view that how much and of which type the hated content is being circulated in Roman Urdu. It will give an idea about the mentality and critical thinking of community. The top targets of hate discourse will be evaluated which will then be used to inform intolerance prevention campaigns on both local and national levels.
Project Objectives ObjectivesBy April 2021, we’ll detect Roman Urdu Hate Speech content on social media and websites via our proposed project. We have 7 weeks, 100 hours per week to complete our research and develop a platform to detect Roman Urdu Hate Speech Detection.
Main objective is to give the awareness to the public about Hate Speech which is being circulated via social media like platforms. Everyone will have a way of getting to know about Hated Content. And this can also be applied on any website or blog to filter out their posts and comments from hated content.
Industry ObjectivesAs discussed above the purpose of the proposed project is to detect and remove the hated content from any platform. So from the industrial point of view, we would have the choice to sell our project to any industry that wants to keep check about the negative opinion of people about them. For instance, hatebase.org is working on hate speech detection in 95+ languages but it doesn’t deal with Roman Urdu Hate Speech. So this project can be the part of any existing project that needs to be in Roman Urdu.
Industries working on such a type of project may help us in understanding and collecting the data. For example we can contact those industries who have collected the data regarding Roman Urdu since already trained data is more helpful for Data Scientists. Secondly, our research will be helpful for the beginners. Simple trained data is more helpful to train the model.
Research ObjectivesTo research and identify the percentage of hate in different social websites from the text written in Roman Urdu, we will be able to collect all the data from the proposed sources and to work on those data by applying the different techniques of machine learning.
Research will define what type of data is being hatred frequently and what are the main sources of hated content. It will help to understand the basis of hatred situations.
Academic ObjectivesAfter the study of Artificial Intelligence and Machine Learning, we will implement these techniques in our project. For Machine Learning programs, data is more important part.
- Data in Roman Urdu will be collected by providing authentic methods for scraping Roman Urdu data from different platforms like social medias (facebook, twitter, instagram and tiktok etc.) and websites (Siasat.pk, Jang Roznama, Siasat.pk(urdu) and iJunoon etc).
- Data has to be cleaned to train the model.
- Accurate algorithms for classification of hate speech will be provided and the cleaned data will then be used to train the model.
- Model will be tested using testing data.
- Finally the model that will give the best result, will win and be used for the detection of the hate speech.
After the model is trained and working well, we will deploy this project in the form of a website application.
Project Implementation Method MethodologyWe have divided our project in different small pieces of achievable modules (activities) which are as below for our project according to the current knowledge of our project members:
- Planning
- Feasibility study of the project
- Defining the work flow
- Environment setup
- Dataset collection for training of the ML system
- Data cleaning and Data mapping
- Implementation
- Documentation
- Deployment
- It will provide a real-time snapshot of community behaviors and attitudes against social, ethnic, sexual, gender, and minority groups.
- Moreover, the work is going to be worthwhile due to the language opted since the language under consideration is mostly utilized by our public.
- Since hate speech have specific targets, by conducting this study, we will come to know about the top targets of hate speech.
- The topics, issues and the individuals (politicians, anchors, actors, etc.) which are hated/criticized by the public will also be highlighted.
- It will put light on the age group, gender, profession, and race which is the target of hate speech.
- Proposed system will be used to predict liked or disliked policians and political parties in the public.
- The results deduced will be used to inform intolerance prevention campaigns on both local and national levels.
As for achieving each activity there are some sub activities or tasks which have to perform are known as milestones of that project so just like all other projects our project has following key milestones and their deliverables:
Feasibility SurveyWe did an online survey for the feasibility of our project and found it completely feasible along with some risks.
-
Financial Risk: Financial risk incorporate loss in finance transaction. It directly upshots the scope of the project.
-
Technical Risk: Technical risk is the possible influence changes which could have on a project, system, and infrastructure when an implementation does not work as predicted.
Analyzing and comparing the existing projects and apps in the working state in the market. A few of those are Hatebase, Hatebusters, and Haternet.
Functional RequirementsDefining the functional requirements along with the stakeholders of our project and their importance for the project.
Literature ReviewExercising a deep study of the research work already done in the domain of our project; approx 8 research papers will be studied with work in different languages (English, Spanish, Arabic, and Turkish).
DatasetCollecting a dataset using different scraping techniques. Estimated size dataset 15k.
Data CleaningIt includes removal of stop words, stemming, lemmatization and finally tokenization.
Data MappingTo transform the collected data into a form of model to be trained.
Data TrainingSelected model is then trained by collected and cleaned data.
TestingTesting of the model on the training data, testing data and then on independent data.
Final Deliverable of the Project Software SystemCore Industry ITOther Industries Legal , Telecommunication Core Technology Artificial Intelligence(AI)Other Technologies Blockchain, NeuroTechSustainable Development Goals Peace and Justice Strong InstitutionsRequired Resources| Item Name | Type | No. of Units | Per Unit Cost (in Rs) | Total (in Rs) |
|---|---|---|---|---|
| Total in (Rs) | 57000 | |||
| GPU | Equipment | 1 | 50000 | 50000 |
| Domain | Miscellaneous | 1 | 2500 | 2500 |
| Advertisement | Miscellaneous | 3 | 1500 | 4500 |