Synthetic Data Production for Machine Learning using Robotic Arm

How do we go about the task of selecting the grasp locations for this object? One method is to create 3D models of the object and mention the grasp locations manually. However, such methodology has two drawbacks: Fitting 3D models is an extremely difficult and time-consuming task by its

2025-06-28 16:36:14 - Adil Khan

Project Title

Project Area of Specialization Artificial IntelligenceProject Summary

Fitting 3D models is an extremely difficult and time-consuming task by itself,
A geometry based-approach may ignore the uniform and non-uniform distribution of densities and mass of the objects, which may play a significant part in selecting the grasp locations.

Therefore, a more practical and viable approach is to use visual recognition to predict grasp locations, since it does not require explicit 3D modelling of objects manually. For example, one can create a grasp location training dataset for hundreds and thousands of objects and use standard machine learning algorithms such as CNNs. However, creating a grasp location dataset for thousands of objects using human labeling can itself be quite challenging for two reasons:

First, as we know most objects can be grasped from multiple positions and angles which makes manually labeling all locations
Second, human notions of grasping are biased by semantics such that humans basically act on a reflex while picking up objects. For example, humans tend to use handles as the grasp location for many objects like cups, bags etc., even though there might be many more positive grasp locations and configurations

In our project, basically our robot will train itself using self-learning algorithms and create a database for all the positive points that it will gather from the grasping hit and trial. At least 100 positive grasp locations for unique objects will be collected and stored in the database. Customized scalable intelligent grasping robot will be developed. We can increase the scale of the robot which means we can handle bigger payload and use the same datasets to interact with bigger objects. Our final outcome will be a self-learning grasping robot which does not need human input to identify, pick and drop novel objects

First using python’s image processing libraries and Kinect V1’s depth sensor we differentiated between the background and the objects. Next, by finding two boundaries of the object we selected multiple points in between them. Then for the same objects, we found the general center of gravity. As we have to train our grasping point selection model using images, we created patches on the objects, focusing on the grasping point in question.

Then afterwards using ROS (Robot operating system), a system is designed to control the robot by simply giving it a set of coordinates and the gripper moves over to the position. Then afterwards using visual confirmation for checking if the object is properly grasped or not, the grasping point will be saved in the database as a positive or negative grasp location.

Project Objectives

Synthetic Data Production for Machine Learning using Robotic Arm _1585515605.png

Considering the object shown in Fig. 1(a). How do we go about the task of selecting the grasp locations for this object? One method is to create 3D models of the object and mention the grasp locations manually. However, such methodology has two drawbacks:

Fitting 3D models is an extremely difficult and time-consuming task by itself,
A geometry based-approach may ignore the uniform and non-uniform distribution of densities and mass of the objects, which may play a significant part in selecting the grasp locations.

First, as we know most objects can be grasped from multiple positions and angles which makes manually labeling all locations
Second, human notions of grasping are biased by semantics such that humans basically act on a reflex while picking up objects. For example, humans tend to use handles as the grasp location for many objects like cups, bags etc., even though there might be many more positive grasp locations and configurations (shown in Fig. 1(b)).

Hence, a randomly sampled patch of an object cannot be assumed to be a negative grasp location, even if it was not marked as a positive location by a human.

In this project, we break the trend of using manually labeled grasp datasets for training grasping robots as such an approach is not scalable. Instead, inspiration taken from human experimental learning, we present a self-supervising algorithm that learns to predict grasp locations and creates a dataset with both positive and negative points via trial and error.

Here the question arises that how much training data do we need to train our grasp models using Convolutional Neural Networks as base to predict possible grasp locations for novel objects? Recent studies have shown better results as compared to the previous experiments by using reinforcement learning with a few hundred grasp locations and train a CNN.

We present a proposal for a large-scale experimental study that not only substantially increases the amount of data for a self-learning grasping robot, but also provides a complete dataset with positive and negative points, in terms of whether an object can be grasped at a particular location and angle. This dataset, collected with robot executed interactions, will be released for research use to the community.

Project Implementation Method

Basically our robot will train itself using self-learning algorithms and create a database for all the positive points that it will gather from the grasping hit and trial. At least 100 positive grasp locations for unique objects will be collected and stored in the database. Customized scalable intelligent grasping robot will be developed. We can increase the scale of the robot which means we can handle bigger payload and use the same datasets to interact with bigger objects. Our final outcome will be a self-learning grasping robot which does not need human input to identify, pick and drop novel objects

Background subtraction is a popular method for isolating the moving parts of a scene by segmenting it into background and foreground The Fundamental logic for detecting moving objects is to find the difference between the background and the foreground. This method is known as “Frame Difference Method”. Now selecting grasping points, as we have to find the best grasping point of an object, relative to all the points, we first selected 5 points on each object. These points were selected on the basis of the particular object’s top boundary and then the points were selected equidistant to each other between the two. The points are shown on the picture in red color so as to distinguish the points from the noise. Then center of mass for each object was found and marked. As we know, the best grasping point for most objects is the center of mass as at that location all the forces are in a natural equilibrium. But as seen in the picture, for some irregular shaped objects, the algorithm failed to find the actual center of mass. For this problem, we will discard the incorrect center of mass and use the generically selected points or divide the object in two and find the center of mass of the two parts separately. Afterwards patch selection for each grasping point. A patch of size 227x227 was made, with the grasping point at center. These patches will be used to train our model to select the best grasping point out of all points, taking in consideration some rules and measures regarding how to identify the best grasping point. When using RGB camera, for detecting objects which were similar in contrast to the background, the algorithm failed to differentiate between the background and foreground. While with a Kinect V1, the depth sensor allowed us to detect objects with greater precision as the objects with some mass were shown with a different shade of grey while the background with a different shade.

Benefits of the Project

Our project, with academic worth in concern, proves that due to the creation and greater use of high capacity CNNs, trial and error experiments for robot tasks such as grasping is now possible. We present a new methodology which uses self-supervised learning algorithms to perform the task at hand. We will also make public the dataset created during the project with both positive and negative grasp locations and allow people to use the algorithm we have created and the retrained Alex-Net CNN for tasks related to robotics. This new methodology will allow researchers and scientist to work on creating larger datasets using state-of-the-art technology and robotics. This system also has the potential to be used for high-school and college educational purposes.

Currently robots are doing human labor in all kinds of places. Best of all, they are doing the jobs that are unhealthy or impractical for people or in areas which are contaminated. This project enables workers to do the more skilled jobs without any risk to their health.

Another point to take in consideration is that previously all tasks related to robotics were done through creating 3D models of objects. Taking a car assembly line as an example, for each new car or new part, a new 3D design was made and fed to the robot to allow it to handle the part properly. But with our proposal, the robot could interact with the part itself and create the datapoints which will enable the workers to only program the robot to place the object in position and nothing else. As in this project we will build a system which is purely based on vision inputs, the end result will be a low-cost, sensor-free robotic arm which can be used to accomplish grasping tasks.

Our project, as a prototype can be used by the community in multiple ways and achieve results rather easily of tasks that were repetitive in nature and were a cause for harm to the society.

But most of all it contributes to the community by paving a path for future students and researchers to work on better technology by taking the self-learning algorithm and methodology as a base. Students in high schools or universities can be introduced to such technology so they can gain an interest in this field.

This project also shows that many tasks using a robot as a medium can be configured and performed without using sensors and using only visual readings. This ideology can in turn lower the costs for many projects and people can work on many innovative ideas which were previously not thought to be possible.

Technical Details of Final Deliverable

The scope of our project includes working on an algorithm for identifying objects present in an image, finding center of mass of each object and selecting random arbitrary points as grasp locations. The algorithm will also give output in the form of image patches of a specific size which will be used to retrain an Alex-Net CNN. The retrained Alex-Net CNN will allow us to choose the best grasping point for each object, relative to all other points selected for the object in question. When the grasp location will be selected, by using a grid system for our area of contact, the CNN will present an output in the form of a set of coordinates and an angle. But as previously mentioned, we will be using an 18-way angle bin which means for each selected grasp locations, the robotic arm will try to grasp the object from 18 different angles and the angle/angles at which the object is grasped successfully and with relatively greater ease, will be marked as a positive location.

The final product of the project will be in the form of a prototype which using the fine-tuned algorithm and an Arduino based serial controller will perform grasping hit and trial experiments using the robotic arm as a medium. The algorithm, as previously mentioned will be a self-learning algorithm meaning that for even objects of greater size and different shapes, the model will work well using the same configurations. And the purpose of our algorithm is to create such an end result in the form of a prototype that it reacts all the same towards novel objects as it does to the other objects which have already been tested upon.

The outcomes of the project will be three-fold:

We introduce a relatively larger robot dataset for the task of grasping. Our dataset will have more than 50 object’s positive and negative grasp locations which will have been collected using trial and error experiments using an aluminum-based robot.
We present a novel formulation of a high capacity CNN for the task of grasping. We predict grasping locations by sampling image patches and predicting the grasping angle. Note that since an object may be graspable at multiple angles, we model the output layer as an 18-way angle bin.
The final product will be a self-learning scalable grasping robot which can handle any novel objects as well as objects with different properties such as size, mass.

Final Deliverable of the Project HW/SW integrated systemCore Industry ManufacturingOther Industries Petroleum , Agriculture Core Technology Artificial Intelligence(AI)Other Technologies RoboticsSustainable Development Goals Industry, Innovation and InfrastructureRequired Resources

Item Name	Type	No. of Units	Per Unit Cost (in Rs)	Total (in Rs)
			Total in (Rs)	12000
Robotic Arm	Equipment	1	10500	10500
Arduino Mega	Equipment	1	1000	1000
Modules and PCBs	Equipment	1	500	500

Synthetic Data Production for Machine Learning using Robotic Arm

More Posts