Sight and Sound

2025-06-28 16:35:00 - Adil Khan

Project Title

Sight and Sound

Project Area of Specialization Artificial IntelligenceProject Summary

In short: We propose to create a dataset to capture Sights & Sounds of Pakistan: associating ambient sound representing images and videos of scenes, locations, and local cultural events. Initially, Lahore will be our area of interest. On one hand, such a dataset will help us digitally preserve the current feel or ambiance of Lahore, and create a virtual or augmented reality tour of it. Most importantly it will help us train the deep learning-based digital content-generating algorithms, which suffer heavily from the dataset bias and are unable to produce content that is representative of our culture.

In Detail: Sound plays an important role in enhancing human understanding of the visual information. Recently deep learning-based techniques are being employed to both index and create digital content, visual (images, videos) as well as sound (speech, music). However, very few works have explored them both in conjunction with each other. Understanding how sounds are related to the visual content, opens up a wide range of possibilities, from the ability to index them better to generate one given another, e.g. could be used as assist Foley Artist (person responsible for recreating the sound in the films) for videos, or could enhance the experience through augmented/virtual reality. However, machine learning algorithms suffer from the dataset bias, users cannot search or generate from the domain that was not in training data. Especially the content- generation algorithm has double impact, content generated will not represent local culture, and once generated it replaces native one.

We will construct two datasets. One where the videos & images will be collected about different cultural and social events (e.g. Barat arrival, Mehendi, qawali, Muharram procession, local festivals like Sibi-Mela / Kalash festival etc.. ), and locations (train and bus stations, schools, seashore, gardens, bazaars), etc.. of Pakistan. Secondly, we will collect the Walk-Along tour dataset where we will capture both sound and video of different locations of Lahore (e.g. Anarkali Bazar, Badshahi mosque, food-street, Railway station, etc.. ). The Walk-Along dataset will be used to create a virtual tour of the places so that people could visit them without being there.

The first dataset will be used to learn to capture the relationship between the sound and the scene. We will use it to design pipelines for better content retrieval and as well as for generating sounds given images.

Project Objectives

To achieve our goal, we will mainly focus on the following objectives:

Develop web application and data collection app: The project website will show the introduction and details of our project. Side by side, we will update the sample dataset and the progress of our project on our website The data collection app will capture an image and record sound and these two files will be uploaded in google drive on the spot. The access to this application will only be given to members of our group so they can create and access the dataset.
Develop UI map application: This mobile application will help us curate the collected data and make it available to the general public. They can navigate the locations on the map and will enjoy the scenes and sounds associated with them. This will be the end product of our project in which all other codes will be integrated.
Collect representative images and sounds: The aim is to capture the culture of Pakistan (i.e mosque, markets, local food spots, wedding halls, roads) and create a dataset. Our team members will visit different areas of Pakistan and save the data in one, organized folder in google drive.
Preprocessing of data: Since the data collected will contain a lot of noise and irrelevant information. We intend to clean and reshape the images and sounds to eliminate any difficulties that can be faced while training through our models. Furthermore, we will divide our dataset into training, testing and validation sets. Data Augmentation techniques may also be used to increase the size of our dataset.
Carry research about the existing work related to audio-video correspondence: We will learn about the pre-existing research done on the audio and visual representation of multimedia content. All available implementations will be compared to the content collected by us.
Deep learning application for content retrieval and classification By using simple CNNs and pre-trained models like VGG16, Resnet 50, and DenseNet, we will classify our images. We will then use fine-tuning and transfer learning to fit our model according to our dataset. Our model will take input image and output whether the image is of a park, mosque, wedding hall, market, or any other class that we used for training.
Generation of sound: Sound will be generated in accordance with the features/ setting of the image. For example, if the image shows rain in a market place, then our model will produce the sound of rain and the sound of people in a crowded place. We will record and generate binaural audio which will make this even more real.

Project Implementation Method

The final year project implementation has been divided into two phases as thesis 1 and thesis 2.
In thesis 1 our project implementation strategy includes going through the literature review, understand the already work done in this domain build an understanding of the models used in research papers by practically running them. Then it includes building an android application that will help us to capture the Sight and Sound of Lahore in the form of images and audio and store it on the central drive. Side by side our project phase 1 implementation includes building a project website for uploading and updating the current progress of the project. Furthermore, it also includes the generation of scene to sound data set and training a pre-trained image classifier for image classification
Our Phase 2 of the project includes the audio generation for the respective images and integrating the image classifier model with the audio generation model and finally develop an android application which at its backend will be having an integrated audio and image classifier model. That will take an image as its input and generates its respective audio and sound as its output.

Benefits of the Project

Overall research contribution towards generating a non-western dataset. We will create a culture-specific, diverse for Pakistan, that will be a major contribution to research.
Encourage the local researchers Our data-set will be open-source and available for local researchers to work upon and also encourages them to develop further advanced practical applications using our dataset.
Diversity: This project trains deep learning-based digital content-generating algorithms, which suffer heavily from the dataset bias and are unable to produce content that is representative of our culture. So through this model, the diversity of the Pakistani culture will be preserved as we will be using data biased removing and reducing techniques.
Automatic content generation that can be used in games, augmented reality, animation: Our project will help in improving the user experience in video games, virtual and augmented reality. Since our model will automatically associate sound with image, it can be used in the animation industry.
Targets the high need of the hour: As we are focusing on the niche that is targeting audio-generation from respective images which are gaining humungous popularity and it will surely become the need of the hour in near future.

Technical Details of Final Deliverable

We will explore what protocols were used for the large dataset collections such as Image Net, Places, CIFAR10 etc. We will design our protocol on the basis of that. Techniques like annotation agreement will be used to ensure there is no label noise.
In our literature review, we have found that to classify image researchers apply simple RNN to generate natural sound for videos in the wild and modify the RNN for other models. In order to attain higher accuracy the use of AlexNet, GoogLeNet, or CRNN(combination of convolutional and recurrent neural network) for image classification, gives the highest accuracy(93%) from GoogLeNet on UrbanSound8K Dataset. In another attempt to collate audio and images the models were trained using (image, audio) pairs drawn from the Flickr8k, MSCOCO, Flicker-Audio, and SPEECH-COCO ` corpora. This model was a sequence-to-sequence model, composed of an encoder, an attendee, and a decoder. In audio classification we have found that the Sep-Stereo framework has a unique advantage of leveraging mono audio data into stereophonic learning, extensive experiments demonstrate that this approach is capable of producing more realistic binaural audio while preserving satisfying source quality.

Final Deliverable of the Project Software SystemCore Industry ITOther Industries Education Core Technology Artificial Intelligence(AI)Other Technologies Augmented & Virtual RealitySustainable Development Goals Industry, Innovation and InfrastructureRequired Resources

Item Name	Type	No. of Units	Per Unit Cost (in Rs)	Total (in Rs)
			Total in (Rs)	75000
Binaural Microphones	Equipment	1	40000	40000
Portable Stereo camera rig	Equipment	1	10000	10000
Jetson Nano	Equipment	1	15000	15000
Travel to the locations	Miscellaneous	1	10000	10000

Sight and Sound

More Posts