Sight and Sound

In short: We propose to create a dataset to capture Sights & Sounds of Pakistan: associating ambient sound representing images and videos of scenes, locations, and local cultural events. Initially, Lahore will be our area of interest. On one hand, such a dataset will help us digitally preserve t

2025-06-28 16:35:00 - Adil Khan

Project Title

Sight and Sound

Project Area of Specialization Artificial IntelligenceProject Summary

In short: We propose to create a dataset to capture Sights & Sounds of Pakistan: associating ambient sound representing images and videos of scenes, locations, and local cultural events. Initially, Lahore will be our area of interest. On one hand, such a dataset will help us digitally preserve the current feel or ambiance of Lahore, and create a virtual or augmented reality tour of it. Most importantly it will help us train the deep learning-based digital content-generating algorithms, which suffer heavily from the dataset bias and are unable to produce content that is representative of our culture. 

In Detail: Sound plays an important role in enhancing human understanding of the visual information. Recently deep learning-based techniques are being employed to both index and create digital content, visual (images, videos) as well as sound (speech, music).  However, very few works have explored them both in conjunction with each other. Understanding how sounds are related to the visual content, opens up a wide range of possibilities, from the ability to index them better to generate one given another, e.g. could be used as assist Foley Artist (person responsible for recreating the sound in the films) for videos, or could enhance the experience through augmented/virtual reality.  However, machine learning algorithms suffer from the dataset bias, users cannot search or generate from the domain that was not in training data. Especially the content- generation algorithm has double impact, content generated will not represent local culture, and once generated it replaces native one. 

We will construct two datasets. One where the videos & images will be collected about different cultural and social events (e.g. Barat arrival, Mehendi, qawali, Muharram procession, local festivals like Sibi-Mela / Kalash festival etc.. ), and locations (train and bus stations, schools, seashore, gardens, bazaars),  etc.. of Pakistan. Secondly, we will collect the Walk-Along tour dataset where we will capture both sound and video of different locations of Lahore (e.g. Anarkali Bazar, Badshahi mosque, food-street, Railway station, etc.. ).  The Walk-Along dataset will be used to create a virtual tour of the places so that people could visit them without being there. 

The first dataset will be used to learn to capture the relationship between the sound and the scene. We will use it to design pipelines for better content retrieval and as well as for generating sounds given images. 

Project Objectives

To achieve our goal, we will mainly focus on the following objectives:

  1. Develop web application and data collection app: The project website will show the introduction and details of our project. Side by side, we will update the sample dataset and the progress of our project on our website The data collection app will capture an image and record sound and these two files will be uploaded in google drive on the spot. The access to this application will only be given to members of our group so they can create and access the dataset.

  2. Develop UI map application: This mobile application will help us curate the collected data and make it available to the general public. They can navigate the locations on the map and will enjoy the scenes and sounds associated with them. This will be the end product of our project in which all other codes will be integrated.

  3. Collect representative images and sounds: The aim is to capture the culture of Pakistan (i.e mosque, markets, local food spots, wedding halls, roads) and create a dataset. Our team members will visit different areas of Pakistan and save the data in one, organized folder in google drive. 

  4. Preprocessing of data: Since the data collected will contain a lot of noise and irrelevant information. We intend to clean and reshape the images and sounds to eliminate any difficulties that can be faced while training through our models. Furthermore, we will divide our dataset into training, testing and validation sets. Data Augmentation techniques may also be used to increase the size of our dataset. 

  5. Carry research about the existing work related to audio-video correspondence: We will learn about the pre-existing research done on the audio and visual representation of multimedia content. All available implementations will be compared to the content collected by us.

  6. Deep learning application for content retrieval and classification By using simple CNNs and pre-trained models like VGG16, Resnet 50, and DenseNet, we will classify our images. We will then use fine-tuning and transfer learning to fit our model according to our dataset. Our model will take input image and output whether the image is of a park, mosque, wedding hall, market, or any other class that we used for training.

  7. Generation of sound: Sound will be generated in accordance with the features/ setting of the image. For example, if the image shows rain in a market place, then our model will produce the sound of rain and the sound of people in a crowded place. We will record and generate binaural audio which will make this even more real.

Project Implementation Method Benefits of the Project Technical Details of Final Deliverable Final Deliverable of the Project Software SystemCore Industry ITOther Industries Education Core Technology Artificial Intelligence(AI)Other Technologies Augmented & Virtual RealitySustainable Development Goals Industry, Innovation and InfrastructureRequired Resources
Item Name Type No. of Units Per Unit Cost (in Rs) Total (in Rs)
Total in (Rs) 75000
Binaural Microphones Equipment14000040000
Portable Stereo camera rig Equipment11000010000
Jetson Nano Equipment11500015000
Travel to the locations Miscellaneous 11000010000

More Posts