AWS DeepRacer Part 2: Applying Concept of RL to DeepRacer

image from Unsplash

This is Part 2 of the series about AWS DeepRacer, you can read my Part 1 here

In Part 1, I already introduced The Concept of Reinforcement Learning. In this part, I’ll show you, how the DeepRacer model applies RL.

Before we deep drive to this topic, let me refresh your knowledge about RL.

Reinforcement learning (RL) is a type of machine learning which the machine learns by exploring the environment and do action to get rewards or feedback from the environment and maximize it.

An RL model will learn from the experience, again and again, so the model will be able to decide which actions lead to the maximum rewards.

How DeepRacer applies RL ??

In reinforcement learning, the agent learns in an environment and it has one goal, that is to maximize the total reward.

The agent takes an action based on a particular state and the environment returns the reward and the next state for the agent. The agent learns from its mistakes, like trial and error. At first, the agent will take random actions and in the future, it can decide the actions that lead to maximum rewards.

We have this RL flowchart of DeepRacer

DeepRacer RL flowchart by AWS
  1. Agent. In this scenario, the AWS DeepRacer vehicle is the agent. It does the simulation for training. More specifically, it controls the vehicle, taking inputs and deciding actions.
  2. Environment. The environment contains a track that defines where the agent can go and what state it can be in. The agent explores the environment to collect data to train so it can exploit it.
  3. State. A state represents the situation where the agent is currently at a point in time. For AWS DeepRacer, a state is a parameter taken by the sensor such as an image captured by the front-facing camera on the vehicle.
  4. Action. An action is a move made by the agent at the current state. For AWS DeepRacer, an action represent a move at a particular speed and steering angle.
  5. Reward. The reward is the score given as feedback to the agent when it takes an action in a given state. In training the DeepRacer model, the reward is returned by a reward function. The reward function acts to specify what is good or bad action for the agent to take in a given state.

Training DeepRacer model

In a simulator, the agent explores the environment and builds up experience. The experiences collected are used to update to the model periodically and the updated models are used to create more experiences. It is an iterative process.

In this example, we simplified the environment and simulate it in the grid world. Each square represents an individual state. The action we give to the vehicle is to move up or down while facing the direction of the goal.

image by Author

We put score at each grid to decide what behavior to incentivize.

We determine the grid at the edge of the track as “terminal states” which will tell the vehicle that it has gone off the track and failed. In this scenario, we want the car to learn to drive and stick to the center of the track, we provide a high reward for the grid on the centerline and a lower reward farther to the centerline.

In the training, the car will start by exploring the grid until it moves out of track or reaches the destination. As it drives and explores, the car collects rewards from the scores we defined. This process is called an episode.

Let see some scenario below

In the episode above, the car gets a total reward of 2.5 before reach stop state.
In this episode, the agent learns from its previous mistakes. Now, the car gets a total reward of 5.0 before reach a stop state.
The agent learns over time and it gets improving itself. Now, the car gets a total reward more than before until it reaches a stop state.

RL algorithms are trained by repeating and optimizing the cumulative reward. The model will learn which action at a particular state will result in the highest cumulative reward.

Learning doesn’t just get done on the first go; it takes some iteration. First, the agent needs to explore and see which action it can get the highest rewards before it can exploit that information.

image from DeepRacer console by Author

As the agent gains more and more experience, it learns how to get higher rewards. If we plot the total reward from each episode, we can see how the model performs and improves over time. With more experience, the agent gets better and eventually is able to reach the goals.

In this episode, the agent reaches the stop state which is the goal or finish line. It gets a total reward of 14.0 scores.

But, why don’t we programmed the car just goes straight to finish line ??

Because we want the model can adapt to any situation. The environment we provided is just 1 sample environment and there are many conditions. The environment in the real scenario is varied, it is capricious. We expect the model can adapt to any environment so it can get the desired output.

Finale

image from Unsplash

Congratulation to us for learning more about Reinforcement Learning. It’s just the beginning of our long journey to learn RL and DeepRacer. But, learning is not like racing. It’s about changing from what you were to what you are now.

Stay tuned in my series. I’ll publish more articles about RL and DeepRacer.

Hey Great News, my 3rd Part is published
AWS DeepRacer Part 3: The brain behind DeepRacer Model

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store