We need robots in places where human is physically unable to reach, such as planet exploration, search and rescue operations that involve traversing tight passageways and perilous activities like covert operations. To achieve these tasks exploration is required. Conventional methods are computationa
Exploration of an Unknown Environment Using Deep Reinforcement Learning
We need robots in places where human is physically unable to reach, such as planet exploration, search and rescue operations that involve traversing tight passageways and perilous activities like covert operations. To achieve these tasks exploration is required. Conventional methods are computationally expensive and require more time and memory due to which exploration becomes difficult in a large environment. We propose a novel autonomous exploration system with a mobile robot using Deep Reinforcement Learning (DRL) based approach for an unknown environment. In DRL-based system, the agent (robot) learns by interacting with the environment using trial and error methods without the need of a map. The agent improves its policy according to the reward received in effect to the action
Problem Formulation
We formulate the problem of autonomous exploration as a Markov Decision Process (MDP), where at each time step t, the agent receives an observation of its current state, takes an action, receives a reward, and transits to the next state. For the maples exploration task that we consider, the states consist of laser range readings and agent pose relative to the goal position. The task of the agent is to reach the goal position.
Project Components
We use the CoppeliaSim robot simulator to train the agent. We conduct training in the simulation setting to avoid hardware damage and reduce training time as the robot needs to be replaced at the start location after each episode. . After training, we test the learned policies on the real mobile robot. The learning algorithm is deployed in the environment to analyze the performance of the algorithm. A DRL problem consists of the following components that require proper formulation to achieve a specific task:
Agent
We are using a differential drive robot (BubbleRob) as our agent in a simulated environment as shown in the figure. Its wheels are based on the differential drive mechanism. For exploration without obstacles, the main characteristics of the mobile robot are given in table
| Actions | left/right rotation, forward motion |
| Sensors | Proximity Sensor, LIDAR Sensor |

Environment
An obstacle-free rectangular environment is created in the CoppeliaSim simulator for exploration without obstacles.

State Space
We have formulated state space for exploration without obstacles containing the relative angle between the goal and the heading of the Bubble Rob. To reduce the convergence complexity of the algorithm, we discretized the state-space, i.e., from -90 degrees to +90 degrees with 1-degree steps.

Reward Formulation
Reward function formulation is based on the states, collision, and relative angle. The reward function is formulated as

Actions
Sensors
Framework
We use the Double Deep Q-Network (DDQN) algorithm to achieve the autonomous exploration task. In DDQN, two neural networks are used. One is called the policy network and the other is called the target network. The agnet observes a state from the environment and takes an action using the policy network, transits to the next state, and receives a reward from the environment. Episode completion information is stored into Done flag. The set {state, action, next state, reward, done} is called a transition. All the transitions are logged into Experience Replay Memory (ERM). To update the action policy (weights of the policy network), we use a random batch of transitions from the ERM. To reduce the overestimation problem, the weights of the target network are replaced with the policy network's weights after some episodes. The training process is continued until the agent gets trained enough to achieve the goal, this policy is termed optimal policy. The optimal policy is the one by which the agent is able to reach the goal in a minimum number of steps and time while maximizing the reward.

Observations are taken from the environment directly using Lidar which is preprocessed prior to being fed to the learning algorithm. The Learning algorithm is trained over many episodes. We tried to keep the simulation and real robot specifications similar to achieve high sim-to-real performance. The trained neural network is uploaded to the robot's memory. In any given state, the algorithm generates the best possible action to take. The actuation signals are sent to the wheels of the robot, either left/right rotation, or forward. The mobile robot is able to explore and reach the goal location in a real-world environment intelligently using the sim-to-real approach.
Power factor correction is the process of compensating for the lagging current by cre...
According to UNITED NATIONS SUSTAINABLE DEVELOPMENT GOALS report, Pakistan's Power sector...
Conventional Heat Exchangers are classified as parallel, counter or cross flow in which he...
Computer systems have opened up new horizons for emotion detection by recognizing and dete...
Due to the increased number of coronavirus cases. Many countries (Sierra Leone, Rwanda, Ch...