Real-time Emotion Detection using Speech, Text and Visual Image Data

Abstract: Sentiment Analysis aims to detect positive, neutral, or negative feelings from text, whereas Emotion Analysis aims to detect and recognize types of feelings through the expression of texts, such as anger, disgust, fear, happiness, sadness, and surprise. Emotions can

2025-06-28 16:28:55 - Adil Khan

Project Title

Project Area of Specialization Computer ScienceProject Summary

Abstract:

Sentiment Analysis aims to detect positive, neutral, or negative feelings from text, whereas Emotion Analysis aims to detect and recognize types of feelings through the expression of texts, such as anger, disgust, fear, happiness, sadness, and surprise. Emotions can play an important role in how we think and behave. The emotions we feel each day can compel us to take action and influence the decisions we make about our lives, both large and small. We have chosen to diversify the data sources we used depending on the type of data considered. For the text input, we are using the Stream-of-consciousness dataset that was gathered in a study by Pennebaker and King [1999]. It consists of a total of 2,468 daily writing submissions from 34 psychology students (29 women and 5 men whose ages ranged from 18 to 67 with a mean of 26.4). For audio data sets, we are using the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). This database contains 7356 files (total size: 24.8 GB). The database contains 24 professional actors (12 females, 12 males). Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions, and song contains calm, happy, sad, angry, and fearful emotions. All conditions are avail-able in three modality formats: Audio-only (16bit, 48kHz .wav), Audio-Video (720p H.264, AAC 48kHz, .mp4), and Video-only (no sound).” For the video data sets, we are using the popular FER2013 Kaggle Challenge data set. The data consists of 48x48 pixel grayscale images of faces. We analyze facial, vocal and textual emotions using Machine Learning based android Application. We are exploring state of the art models in multimodal sentiment analysis. We have chosen to explore text, sound and video inputs data and develop an ensemble model that gathers the information from all these sources and displays it in a clear and interpretable way on Android Device.

Project Objectives

Introduction:

Human beings convey their feelings and certain conditions through different types of emotions. These emotions are conveyed by facial expressions, texts or their speech. By understanding these emotions, we can find out what a particular person is feeling that moment, what are his or her thoughts on a certain topic or a situation.

In today’s modern age of science and technology, the recent advancements in neural networks and machine learning have enabled us to create an emotional interaction between machines and humans. This helps a lot in getting result and other useful information from the machine by understanding the person’s emotions interacting with the machine.

Keeping these things in my mind we have come up with a project which is a Real Time Emotion Recognition system (RTERS), which will predict emotions of a particular person based on his manner of face Expressions, Texts or Speaking.

Aim and Objectives:

Converting text into data for analysis, via application of natural language processing (NLP) and predicting the Emotions from text data.
Converting Voice into data for analysis and predicting the emotions from Voice data.
Finding the Real Time Visual Image based Emotion.

During the 1970s, psychologist Paul Ekman identified six basic emotions that he suggested were universally experienced in all human cultures. The emotions he identified were happiness, sadness, disgust, fear, surprise, and anger. We will classify our data into six type of classes of emotion. The emotions classes will be same for text, images and vocal data. The difference will only be in processing and model.

'Real-time Emotion Detection using Speech, Text and Visual Image Data' _1639955997.jpeg

Figure 1: Global Emotion Detection and Recognition Market

How we Processed these different types of Data:

2021-01-31_11:16:15pm_image-20210131231611-2.png

Figure 2: Dealing with different types of data

'Real-time Emotion Detection using Speech, Text and Visual Image Data' _1639955999.png

Figure 3: Flow Diagram for Naive Bayes Fusion

Applications:

In education for mind presence of student during lectures.
For security purpose.
improve an online way of learning.
Business improvement.
Finding User interaction with product
Improve social interaction of people.
Finding lie or fake messages
Finding fake calls

Project Implementation Method

Methodology:

Methodology for Speech Emotion Analysis:

Feature extractor:

In this part the recorded audio file is loaded by the algorithm and starts extracting the following features from it:

mfcc: the short term power spectrum.
mel: spectrogram frequency.
chroma: pitch of the sound.

Data processing:

This section processes all the extracted features into a dataset for prediction.

Emotion Prediction:

Once the processed data is obtained, it will be compared with the data obtained from model training for comparing and matching results. The closest result will be the predicted emotion by the algorithm.

Methodology for Text Emotion Analysis:

Feature extractor:

In this part the recorded text file is loaded by the algorithm and starts extracting the following features from it:

Word frequency
The Text-LSTM (T-LSTM) model used the LSTM layer
The Text-Bi-LSTM (T-BL) model used the bidirectional LSTM layer.

Data processing

This section processes all the extracted features into a dataset for prediction.

Emotion Prediction:

Methodology for Real Time Face Emotion Analysis:

Feature extractor:

In this part the recorded image file is loaded by the algorithm and starts extracting the following features from it:

FPE: Facial Point extraction
Facial component extraction like mouth, nose and ear etc.

Features	Description	Size
Xm1	Width of mouth Height of mouth	1 × 1
Xm2	Distance between nose and mouth	1 × 1
Xse	Error between mouth and template	6 × 1
Xe1	Distance between two eye brow	1 × 1
Xe2	Distance between eye and eye brow	1 × 1
Xe3	Distance between nose and eye(left side)	1 × 1
Xe4	Distance between nose and eye(right side)	1 × 1
Xse	Error between eye and template	4× 1

Transformed facial point as feature.
68 points at the eyes
Select Features in auxiliary region also.
used numpy array to convert the 68 points to an array of 68 x and y co-ordinates representing their location

Data processing:

This section processes all the extracted features into a dataset for prediction.

Emotion Prediction:

Features

Xm1

Xm2

Xse

Xe1

Xe2

Xe3

Xe4

Xse

Benefits of the Project

There is no such system available which will have analyzed all three types of data. The system available can only work on single type of data and for speech we haven’t any good quality android base application which will predict the emotion from speech type data with good accuracy. We will provide good accuracy and efficient response based android Application which will process all three types of data and recognize the real time emotions. In future we will update it to cross platform to support both IOS and Android Operating System.

Benefits:

Improving Online Education SYstem
Improving Business growth
Security Improvement
Safety from Social media Scam
Recoginising Fake messages and calls
Recoginising student emotion and reaction toward lectures.

Technical Details of Final Deliverable

Requirement Specification:

In this chapter we will give detailed description about the software and as well as hardware. It also includes the interaction of client with the software. It also includes the backend processing requirements.

Requirement specification consists of two parts functional and non-functional requirements.

Non Functional Requirements:

PERFORMANCE:

The performance relies heavily on the internet connection used by the user, better the internet connection, the quicker he will get the results.

RELABILITY:

The accuracy of the results is dependent of the recorded audio file quality and the user’s speech characteristics. It is also independent of user text quality and image backgrounds etc.

AVAILABILITY:

The system will be available to access via app 24/7 unless down due to maintenance issues.

CAPACITY:

The system will only store a log of registered users and the data collected during model training for prediction part.

SECURITY:

The system will ensure user privacy and protection. The recorded audio files or images and text taken for analysis will be immediately removed once the results are generated by the system.

USABILITY:

The users will only need to use the app, which is easy to use and understand.

Functional Requirements:

Android device
Strong internet
Good quality camera
Good quality Mic for speech

Design of Emotion Detection:

Framework of Speech Base Emotion recognition:

'Real-time Emotion Detection using Speech, Text and Visual Image Data' _1639956000.png

Figure 4: Design 1 for Speech

Framework of Text Base Emotion Recognition:

'Real-time Emotion Detection using Speech, Text and Visual Image Data' _1639956001.jpeg

Figure 5: Design 2 for Text

Framework of Visual Image Base Emotion Recognition:

'Real-time Emotion Detection using Speech, Text and Visual Image Data' _1639956002.png

Figure 6: Design 3 for Visual Image

Use Case Of System:

Figure 7: Use Case Diagram for Emotion Detection

Activity Flow Analysis:

'Real-time Emotion Detection using Speech, Text and Visual Image Data' _1639956004.png

Figure 8: Activity Analysis Diagram for Emotion Detection

Class Based Analysis of Emotion Detection:

'Real-time Emotion Detection using Speech, Text and Visual Image Data' _1639956005.png

Figure 9: Class Analysis Diagram for Emotion Detection

Final Deliverable of the Project Software SystemCore Industry ITOther Industries Education , Finance , Media , Security Core Technology Internet of Things (IoT)Other TechnologiesSustainable Development Goals Quality Education, Decent Work and Economic GrowthRequired Resources

Elapsed time in (days or weeks or month or quarter) since start of the project	Milestone	Deliverable
Month 1	Complete Software Requirements Specification and Design	yes

Real-time Emotion Detection using Speech, Text and Visual Image Data

More Posts