AWS DeepRacer Part 1: Concept of Reinforcement Learning

Comprehensive Reinforcement Learning Bootcamp through AWS DeepRacer

5 min readOct 4, 2020

Thanks to Jakarta Machine Learning (JML) and AWS. Recently, I am participating at AWS DeepRacer Bootcamp that provided by AWS and Jakarta Machine Learning. There are 9 participants that will be intensively mentored by the experienced representatives from AWS to learn Reinforcement Learning for the next 3 months. We will learn RL by using one of AWS product, that is AWS DeepRacer.

And of course, I’ll share to you what I learn.

First thing first, as a brief explanation, let me introduce you to machine learning. There are 3 types of machine learning (or at least that I understand), Unsupervised Learning, Supervised Learning, and Reinforcement Learning.

In supervised learning, we provided the machine to learn by a ton of data, and also we provided the outcome of the data. It is more like we give the labeled input and output of the data, and let the machine learn what the relationship and correlation of the input and output of the data are so that next time we give input to the machine, it can predict what the output is according to the input.

While unsupervised learning is given input with no labeled data. The machine identifies patterns in the data that are not so obvious to the human eye. So, unsupervised learning is extremely useful to recognize patterns in data and help us make decisions

Well, what we’ll learn is Reinforcement Learning. Unlike the other two types of ML, Reinforcement Learning is different.

What is Reinforcement Learning?

Based on wikipedia.com, Reinforcement Learning (RL) is an area of machine learning concerned with how software agents ought to take actions in an environment in order to maximize the notion of cumulative reward.

RL algorithms typically do not involve the target outputs (only inputs are given)
OK, that sounds difficult to me at first.

The method of RL is just like how living things, like us, learn. It’s like you train your pet, like a dog, to learn. As an instance, you want to train your dog to do handshakes. You feed him as a reward if he does the handshakes and you don’t feed him if he doesn’t do the handshakes. At first, obviously, he didn’t understand and he is going to make many mistakes. This process will be repeated until he realizes what he should do.

Important Elements

There are several elements of a basic RL algorithm you really need to know:

Agent (which can choose to commit to actions in its current state).
Environment (responds to action and provides new input to the agent).
Reward (incentive or cumulative feedback returned by the environment).
State (The ‘situation’ where the agent is currently at. This consists of the past, present, and future state).
Action (possible action or decision can be chosen each state).
Episode (the sequence of state reaching an end).

Flowchart of an RL algorithm (source: https://i.stack.imgur.com/eoeSq.png)

The dog we trained before is the agent. The dog learns in the environment we are given, in that case, it is how to do handshakes and how we respond to his handshakes. The reward is how much we feed him. The reward helps the agent understand the problem and the environment better, and thus helps to make better decisions on our behalf.

The state is each combination of movement in order to do handshakes. In each state, the dog has action or every possible motion the dog can make, it can be right or completely wrong. An episode, the dog does it repeatedly until it reaches the complete sequence regardless the dog does it right or not and calculate the final reward.

The Trade-off

There are basic concepts or trade-off in RL, it is similar to bias-variance tradeoff in Supervised Learning or maybe for better understanding. it’s like Time-Space in Dynamic Programming. It’s called Exploration-Exploitation.

Exploration in RL is the training of the new data points and exploitation is use of previously captured data.

OK, I know that sounds strange. Exploration is when the agent gathers new information in order to achieve the reward more, the information can make the agent get more reward or less. Exploitation is when the agent takes the best option based on what they already know.

In order to get a better understanding, let’s see these scenarios.

You are planning to take lunch. There are restaurant A and B that you would like to go to.

You already know that in restaurant A, the taste is good and it’s worth your money to spend.
You’ve never eaten in restaurant B, you don’t know how it tastes and whether it’s worth your money to spend or not.

In the first scenario, you will likely be satisfied with your lunch and the money you spent, it is how exploitation works. In the second scenario, you will be experiencing something new, it can be worse or better to spend your money than in restaurant A, it is how exploration works.

Finale

Congratulation to us for learning a bit about Reinforcement Learning. It’s just the beginning of our long journey to learn RL and DeepRacer. Stay tuned in my series. I’ll publish more articles about RL and DeepRacer.

Hey Great News, my 2nd Part is published
AWS DeepRacer Part 2: Applying Concept of RL to DeepRacer

I know that these are a lot to process. But don’t worry, we are doing this together. So, I’ll drop the resources that I’ve used it to learn RL and some references.

Lecture 7: Markov Decision Processes — Value Iteration
Not-So-Deep Reinforcement Learning for dummies — Part 1 & Part 2
Markov Decision Process