In short: We propose to create a dataset to capture Sights & Sounds of Pakistan: associating ambient sound representing images and videos of scenes, locations, and local cultural events. Initially, Lahore will be our area of interest. On one hand, such a dataset will help us digitally preserve t
Sight and Sound
In short: We propose to create a dataset to capture Sights & Sounds of Pakistan: associating ambient sound representing images and videos of scenes, locations, and local cultural events. Initially, Lahore will be our area of interest. On one hand, such a dataset will help us digitally preserve the current feel or ambiance of Lahore, and create a virtual or augmented reality tour of it. Most importantly it will help us train the deep learning-based digital content-generating algorithms, which suffer heavily from the dataset bias and are unable to produce content that is representative of our culture.
In Detail: Sound plays an important role in enhancing human understanding of the visual information. Recently deep learning-based techniques are being employed to both index and create digital content, visual (images, videos) as well as sound (speech, music). However, very few works have explored them both in conjunction with each other. Understanding how sounds are related to the visual content, opens up a wide range of possibilities, from the ability to index them better to generate one given another, e.g. could be used as assist Foley Artist (person responsible for recreating the sound in the films) for videos, or could enhance the experience through augmented/virtual reality. However, machine learning algorithms suffer from the dataset bias, users cannot search or generate from the domain that was not in training data. Especially the content- generation algorithm has double impact, content generated will not represent local culture, and once generated it replaces native one.
We will construct two datasets. One where the videos & images will be collected about different cultural and social events (e.g. Barat arrival, Mehendi, qawali, Muharram procession, local festivals like Sibi-Mela / Kalash festival etc.. ), and locations (train and bus stations, schools, seashore, gardens, bazaars), etc.. of Pakistan. Secondly, we will collect the Walk-Along tour dataset where we will capture both sound and video of different locations of Lahore (e.g. Anarkali Bazar, Badshahi mosque, food-street, Railway station, etc.. ). The Walk-Along dataset will be used to create a virtual tour of the places so that people could visit them without being there.
The first dataset will be used to learn to capture the relationship between the sound and the scene. We will use it to design pipelines for better content retrieval and as well as for generating sounds given images.
To achieve our goal, we will mainly focus on the following objectives:
Develop web application and data collection app: The project website will show the introduction and details of our project. Side by side, we will update the sample dataset and the progress of our project on our website The data collection app will capture an image and record sound and these two files will be uploaded in google drive on the spot. The access to this application will only be given to members of our group so they can create and access the dataset.
Develop UI map application: This mobile application will help us curate the collected data and make it available to the general public. They can navigate the locations on the map and will enjoy the scenes and sounds associated with them. This will be the end product of our project in which all other codes will be integrated.
Collect representative images and sounds: The aim is to capture the culture of Pakistan (i.e mosque, markets, local food spots, wedding halls, roads) and create a dataset. Our team members will visit different areas of Pakistan and save the data in one, organized folder in google drive.
Preprocessing of data: Since the data collected will contain a lot of noise and irrelevant information. We intend to clean and reshape the images and sounds to eliminate any difficulties that can be faced while training through our models. Furthermore, we will divide our dataset into training, testing and validation sets. Data Augmentation techniques may also be used to increase the size of our dataset.
Carry research about the existing work related to audio-video correspondence: We will learn about the pre-existing research done on the audio and visual representation of multimedia content. All available implementations will be compared to the content collected by us.
Deep learning application for content retrieval and classification By using simple CNNs and pre-trained models like VGG16, Resnet 50, and DenseNet, we will classify our images. We will then use fine-tuning and transfer learning to fit our model according to our dataset. Our model will take input image and output whether the image is of a park, mosque, wedding hall, market, or any other class that we used for training.
Generation of sound: Sound will be generated in accordance with the features/ setting of the image. For example, if the image shows rain in a market place, then our model will produce the sound of rain and the sound of people in a crowded place. We will record and generate binaural audio which will make this even more real.
We will explore what protocols were used for the large dataset collections such as Image Net, Places, CIFAR10 etc. We will design our protocol on the basis of that. Techniques like annotation agreement will be used to ensure there is no label noise.
In our literature review, we have found that to classify image researchers apply simple RNN to generate natural sound for videos in the wild and modify the RNN for other models. In order to attain higher accuracy the use of AlexNet, GoogLeNet, or CRNN(combination of convolutional and recurrent neural network) for image classification, gives the highest accuracy(93%) from GoogLeNet on UrbanSound8K Dataset. In another attempt to collate audio and images the models were trained using (image, audio) pairs drawn from the Flickr8k, MSCOCO, Flicker-Audio, and SPEECH-COCO ` corpora. This model was a sequence-to-sequence model, composed of an encoder, an attendee, and a decoder. In audio classification we have found that the Sep-Stereo framework has a unique advantage of leveraging mono audio data into stereophonic learning, extensive experiments demonstrate that this approach is capable of producing more realistic binaural audio while preserving satisfying source quality.
| Item Name | Type | No. of Units | Per Unit Cost (in Rs) | Total (in Rs) |
|---|---|---|---|---|
| Binaural Microphones | Equipment | 1 | 40000 | 40000 |
| Portable Stereo camera rig | Equipment | 1 | 10000 | 10000 |
| Jetson Nano | Equipment | 1 | 15000 | 15000 |
| Travel to the locations | Miscellaneous | 1 | 10000 | 10000 |
| Total in (Rs) | 75000 |
The development in technology is increases day by day, similarly one of the part is iOT ba...
Pakistan has around 60,000 MW hydro potential available but only 11% of this has been deve...
UAVs are armed with the sensors and cameras for crop observing and aerosols for pesticide...
The basic Purpose of my game is to address all basic problems from each province of Pakist...
An agricultural drone is an unmanned aerial vehicle used to help optimize agriculture oper...