AWS DeepRacer Part 3: The brain behind DeepRacer Model

How does the training work?

8 min readNov 17, 2020

In the previous part, We’ve learned the concept of RL and how DeepRacer applies it. One thing that is important is the Reward Function.

As I demonstrated the process of a DeepRacer agent learns through the grid world from the previous part (Applying Concept of RL to DeepRacer)

RL algorithms are trained by repeating and optimizing the cumulative reward. The model will learn which action at a particular state will result in the highest cumulative reward.

How should we give reward to the agent?

Reinforcement learning algorithms are built for the optimization of cumulative rewards.

The model will learn which action (and then subsequent actions) will result in the highest cumulative reward on the way to the goal.

The critical part to make your reinforcement learning model work is the reward function. In general, you design your reward function to act as an instructor to your agent.

Reward function parameters for AWS DeepRacer

In AWS DeepRacer, the reward function is a Python function which is given certain parameters that describe the current state and returns a numeric reward value.

Just like you drive a car, you could see vision as your parameter to decide your action in driving. But, the action is different. The agent’s actions are not as smooth as the way you manage the speed and steering while driving. The agent’s action is discrete as you can see below.

The parameters passed to the reward function describe various aspects of the state of the vehicle, such as its position and orientation on the track, its observed speed, steering angle, and more.

There are several important parameters in AWS DeepRacer:

Position on track
Heading
Waypoints
Track width
Distance from centerline
All wheels on track
Speed
Steering angle

With all these parameters at your disposal, you can define a reward function to incentivize whatever driving behavior you like. We can use these input parameters to create a great reward function. There are more parameter in DeepRacer agent, you can take a look at this documentation.

There are several basic reward function provided by AWS.

Following the centerline

From part 2 of this series, I demonstrated how the agent learns in grid world to get high reward if it sticks in centerline. Here is the reward function.

def reward_function(params):
    '''
    Example of rewarding the agent to follow center line
    '''# Read input parameters
    track_width = params['track_width']
    distance_from_center = params['distance_from_center']# Calculate 3 markers that are at varying distances away from the center line
    marker_1 = 0.1 * track_width
    marker_2 = 0.25 * track_width
    marker_3 = 0.5 * track_width# Give higher reward if the car is closer to center line and vice versa
    if distance_from_center <= marker_1:
        reward = 1.0
    elif distance_from_center <= marker_2:
        reward = 0.5
    elif distance_from_center <= marker_3:
        reward = 0.1
    else:
        reward = 1e-3  # likely crashed/ close to off trackreturn float(reward)

This reward function is pretty good, this can be basic the way you build your model. Because the most important is not off-grid before you care about speed

2. Prevent Zig-Zag

This example incentivizes the agent to follow the center line but penalizes with lower reward if it steers too much, which helps prevent zig-zag behavior.

def reward_function(params):
    '''
    Example of penalize steering, which helps mitigate zig-zag behaviors
    '''
    
    # Read input parameters
    distance_from_center = params['distance_from_center']
    track_width = params['track_width']
    steering = abs(params['steering_angle']) # Only need the absolute steering angle# Calculate 3 marks that are farther and father away from the center line
    marker_1 = 0.1 * track_width
    marker_2 = 0.25 * track_width
    marker_3 = 0.5 * track_width# Give higher reward if the car is closer to center line and vice versa
    if distance_from_center <= marker_1:
        reward = 1.0
    elif distance_from_center <= marker_2:
        reward = 0.5
    elif distance_from_center <= marker_3:
        reward = 0.1
    else:
        reward = 1e-3  # likely crashed/ close to off track# Steering penality threshold, change the number based on your action space setting
    ABS_STEERING_THRESHOLD = 15# Penalize reward if the car is steering too much
    if steering > ABS_STEERING_THRESHOLD:
        reward *= 0.8return float(reward)

With this reward function, you could improve your agent's behavior of doing something useless. If you forced the agent to stick in the centerline, it tends to do zig-zag. And that zig-zag moves will increase your time.

3. Go minimal (custom)

I did some research based on the past winner of some DeepRacer tournament and it is my idea. But, I will not explain it too much.

def reward_function(params):
    reward = 0.001     if params["all_wheels_on_track"]:
        reward += 1
    if abs(params["steering_angle"]) < 10:
        reward += 1
   
    reward += params["speed"]**2
   
    return float(reward)

The idea of my reward function is forcing the agent to keep on track and prevent it from doing zig-zag.

This reward function will give lead you to a better time result. But, you can still improve.

4. Self Learning (custom)

This is the base of my current reward function. I did some research on someone's experience of doing deepracer tournament. There is this guy called scottpletcher, he documented his reward function, you can see it on his repo in github.

def reward_function(params):

    if params["all_wheels_on_track"] and params["steps"] > 0:
        reward = ((params["progress"] / params["steps"]) * 100) + (params["speed"]**2)
    else:
        reward = 0.01
        
    return float(reward)

This reward function actually gives you a great base model for your agent to learn the track.

As you can see, it get high training completion with 6 hours of training with right hyperparameter tuning, you can see explanation of it in official AWS website. After you get this base model, you can improve the speed or the waypoints so the time will decrease.

All depends on how good and “generic” our reward function is, you can try to train a model for a car with a track which is, let's say, a “circle”, and later on after training, you use that trained model on a car, and but you place the car in a track which is “square”, it may not perform well, unless and until the reward function is generic.

How does the training works?

Let me give you a brief explanation about the Markov Decision Process that will connect part 2 and part 3.

In Part 1, I already give an introduction to the basic concept of Reinforcement Learning. As you noticed, the basic of RL relies on Markov Decision Process (MDP).

MDP is actually a framing of the problem of learning from the interaction of an agent with the environment to achieve goals. MDP formalizes the problem with the framework and that kind of framework is MDP. The agent learns from the environment each time step, the agent will get some representation of the environment state.

We come back to this flowchart of the RL algorithm. Now, the agent is in the state S(t). The agent will take an action At to the environment then the agent gets reward R(t+1) and moves to the state S(t+1).

S is a set of state called the state space, this state is something the agent observe. State-space contains all possible state in the environment
A is a set of actions called action space.
R is the reward after transitioning from state s to s’, due to action A

There is an important piece of MDP called policy(𝜋). A policy(𝜋) is a way of defining the agent’s action selection with respect to the changes in the environment. A (probabilistic) policy on an MDP is a mapping from the state space to a distribution over the action space. It’s like how the agent move from the state s to s’ based on probability. Policy(𝜋) fully defines the behavior of an agent.

A policy act as a strategy that an agent uses in pursuit of goals. A policy is consist of the suggested actions that the agent should take for every possible state.

For further details about, I will direct you guys to one of my friend’s article that explained it very simple. (Markov Decision Processes Simplified)

In MDP, the policy is provided by us and given to the agent so the agent will take action based on it, but not in the DeepRacer RL algorithm.

In DeepRacer algorithm, they use Neural Network building policy for the agent to take.

image by neuralnetworksanddeeplearning.com

With help of Neural Network, the agent will find and decide which action should it takes that already given in action space. The idea of Reinforcement Learning is to optimize the policy generated by Neural Network so the agent can get the highest reward function.

There is one job of RL, “maximize the cumulative reward”

There is algorithm called Policy Gradient Methods, that are frequently used in Reinforcement Learning Algorithm. As you know the gradient descent that used in supervised learning, in RL we use gradient ascent to maximize the reward function.

The Policy Gradient works like you learn tennis and try to be better each time you learn.

You try to know how to control your force to the ball and the angle of your racket when facing the ball. It is just like in my 2nd part of this series(Applying Concept of RL to DeepRacer), the agent try to maximize its reward each learning iteration. It try to be better each iteration.

AWS DeepRacer uses the Proximal Policy Optimization (PPO) algorithm to train the reinforcement learning model. You can learn more about policy in this my friend’s article (Reinforcement Learning Algorithms Taxonomy).

Finale

Congratulation to us for learning more about Reinforcement Learning. Learning is not like racing. It’s about changing from what you were to what you are now.