Real-time Emotion Detection using Speech, Text and Visual Image Data
Abstract: Sentiment Analysis aims to detect positive, neutral, or negative feelings from text, whereas Emotion Analysis aims to detect and recognize types of feelings through the expression of texts, such as anger, disgust, fear, happiness, sadness, and surprise. Emotions can
2025-06-28 16:28:55 - Adil Khan
Real-time Emotion Detection using Speech, Text and Visual Image Data
Project Area of Specialization Computer ScienceProject SummaryAbstract:
Sentiment Analysis aims to detect positive, neutral, or negative feelings from text, whereas Emotion Analysis aims to detect and recognize types of feelings through the expression of texts, such as anger, disgust, fear, happiness, sadness, and surprise. Emotions can play an important role in how we think and behave. The emotions we feel each day can compel us to take action and influence the decisions we make about our lives, both large and small. We have chosen to diversify the data sources we used depending on the type of data considered. For the text input, we are using the Stream-of-consciousness dataset that was gathered in a study by Pennebaker and King [1999]. It consists of a total of 2,468 daily writing submissions from 34 psychology students (29 women and 5 men whose ages ranged from 18 to 67 with a mean of 26.4). For audio data sets, we are using the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). This database contains 7356 files (total size: 24.8 GB). The database contains 24 professional actors (12 females, 12 males). Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions, and song contains calm, happy, sad, angry, and fearful emotions. All conditions are avail-able in three modality formats: Audio-only (16bit, 48kHz .wav), Audio-Video (720p H.264, AAC 48kHz, .mp4), and Video-only (no sound).” For the video data sets, we are using the popular FER2013 Kaggle Challenge data set. The data consists of 48x48 pixel grayscale images of faces. We analyze facial, vocal and textual emotions using Machine Learning based android Application. We are exploring state of the art models in multimodal sentiment analysis. We have chosen to explore text, sound and video inputs data and develop an ensemble model that gathers the information from all these sources and displays it in a clear and interpretable way on Android Device.
Project ObjectivesIntroduction:
Human beings convey their feelings and certain conditions through different types of emotions. These emotions are conveyed by facial expressions, texts or their speech. By understanding these emotions, we can find out what a particular person is feeling that moment, what are his or her thoughts on a certain topic or a situation.
In today’s modern age of science and technology, the recent advancements in neural networks and machine learning have enabled us to create an emotional interaction between machines and humans. This helps a lot in getting result and other useful information from the machine by understanding the person’s emotions interacting with the machine.
Keeping these things in my mind we have come up with a project which is a Real Time Emotion Recognition system (RTERS), which will predict emotions of a particular person based on his manner of face Expressions, Texts or Speaking.
Aim and Objectives:
- Converting text into data for analysis, via application of natural language processing (NLP) and predicting the Emotions from text data.
- Converting Voice into data for analysis and predicting the emotions from Voice data.
- Finding the Real Time Visual Image based Emotion.
During the 1970s, psychologist Paul Ekman identified six basic emotions that he suggested were universally experienced in all human cultures. The emotions he identified were happiness, sadness, disgust, fear, surprise, and anger. We will classify our data into six type of classes of emotion. The emotions classes will be same for text, images and vocal data. The difference will only be in processing and model.

Figure 1: Global Emotion Detection and Recognition Market
How we Processed these different types of Data:

Figure 2: Dealing with different types of data

Figure 3: Flow Diagram for Naive Bayes Fusion
Applications:
- In education for mind presence of student during lectures.
- For security purpose.
- improve an online way of learning.
- Business improvement.
- Finding User interaction with product
- Improve social interaction of people.
- Finding lie or fake messages
- Finding fake calls
Methodology:
Methodology for Speech Emotion Analysis:
Feature extractor:
In this part the recorded audio file is loaded by the algorithm and starts extracting the following features from it:
- mfcc: the short term power spectrum.
- mel: spectrogram frequency.
- chroma: pitch of the sound.
Data processing:
This section processes all the extracted features into a dataset for prediction.
Emotion Prediction:
Once the processed data is obtained, it will be compared with the data obtained from model training for comparing and matching results. The closest result will be the predicted emotion by the algorithm.
Methodology for Text Emotion Analysis:
Feature extractor:
In this part the recorded text file is loaded by the algorithm and starts extracting the following features from it:
- Word frequency
- The Text-LSTM (T-LSTM) model used the LSTM layer
- The Text-Bi-LSTM (T-BL) model used the bidirectional LSTM layer.
Data processing
This section processes all the extracted features into a dataset for prediction.
Emotion Prediction:
Once the processed data is obtained, it will be compared with the data obtained from model training for comparing and matching results. The closest result will be the predicted emotion by the algorithm.
Methodology for Real Time Face Emotion Analysis:
Feature extractor:
In this part the recorded image file is loaded by the algorithm and starts extracting the following features from it:
- FPE: Facial Point extraction
- Facial component extraction like mouth, nose and ear etc.
| Features | Description | Size |
| Xm1 | Width of mouth Height of mouth | 1 × 1 |
| Xm2 | Distance between nose and mouth | 1 × 1 |
| Xse | Error between mouth and template | 6 × 1 |
| Xe1 | Distance between two eye brow | 1 × 1 |
| Xe2 | Distance between eye and eye brow | 1 × 1 |
| Xe3 | Distance between nose and eye(left side) | 1 × 1 |
| Xe4 | Distance between nose and eye(right side) | 1 × 1 |
| Xse | Error between eye and template | 4× 1 |
- Transformed facial point as feature.
- 68 points at the eyes
- Select Features in auxiliary region also.
- used numpy array to convert the 68 points to an array of 68 x and y co-ordinates representing their location
Data processing:
This section processes all the extracted features into a dataset for prediction.
Emotion Prediction:
Once the processed data is obtained, it will be compared with the data obtained from model training for comparing and matching results. The closest result will be the predicted emotion by the algorithm.
Features
Xm1
Xm2
Xse
Xe1
Xe2
Xe3
Xe4
Xse
Benefits of the ProjectThere is no such system available which will have analyzed all three types of data. The system available can only work on single type of data and for speech we haven’t any good quality android base application which will predict the emotion from speech type data with good accuracy. We will provide good accuracy and efficient response based android Application which will process all three types of data and recognize the real time emotions. In future we will update it to cross platform to support both IOS and Android Operating System.
Benefits:
- Improving Online Education SYstem
- Improving Business growth
- Security Improvement
- Safety from Social media Scam
- Recoginising Fake messages and calls
- Recoginising student emotion and reaction toward lectures.
Requirement Specification:
In this chapter we will give detailed description about the software and as well as hardware. It also includes the interaction of client with the software. It also includes the backend processing requirements.
Requirement specification consists of two parts functional and non-functional requirements.
Non Functional Requirements:
PERFORMANCE:
The performance relies heavily on the internet connection used by the user, better the internet connection, the quicker he will get the results.
RELABILITY:
The accuracy of the results is dependent of the recorded audio file quality and the user’s speech characteristics. It is also independent of user text quality and image backgrounds etc.
AVAILABILITY:
The system will be available to access via app 24/7 unless down due to maintenance issues.
CAPACITY:
The system will only store a log of registered users and the data collected during model training for prediction part.
SECURITY:
The system will ensure user privacy and protection. The recorded audio files or images and text taken for analysis will be immediately removed once the results are generated by the system.
USABILITY:
The users will only need to use the app, which is easy to use and understand.
Functional Requirements:- Android device
- Strong internet
- Good quality camera
- Good quality Mic for speech
Framework of Speech Base Emotion recognition:

Figure 4: Design 1 for Speech
Framework of Text Base Emotion Recognition:

Figure 5: Design 2 for Text
Framework of Visual Image Base Emotion Recognition:

Figure 6: Design 3 for Visual Image
Use Case Of System:Figure 7: Use Case Diagram for Emotion Detection
Activity Flow Analysis:

Figure 8: Activity Analysis Diagram for Emotion Detection
Class Based Analysis of Emotion Detection:

Figure 9: Class Analysis Diagram for Emotion Detection
Final Deliverable of the Project Software SystemCore Industry ITOther Industries Education , Finance , Media , Security Core Technology Internet of Things (IoT)Other TechnologiesSustainable Development Goals Quality Education, Decent Work and Economic GrowthRequired Resources| Elapsed time in (days or weeks or month or quarter) since start of the project | Milestone | Deliverable |
|---|---|---|
| Month 1 | Complete Software Requirements Specification and Design | yes |