Sale!

# Assignment #2 implement deep Q learning

\$30.00

Category:

CS 234  Assignment #2

These questions require thought, but do not require long answers. Please be as concise as possible.

the provided commands and do not edit outside of the marked areas.
You’ll need to download the starter code and fill the appropriate functions following the instructions
from the handout and the code’s documentation. Training DeepMind’s network on Pong takes roughly
12 hours on GPU, so please start early! (Only a completed run will recieve full credit) We will give
you access to an Azure GPU cluster. You’ll find the setup instructions on the course assignment page.
Introduction
In this assignment we will implement deep Q learning, following DeepMind’s paper ([mnih2015human]
and [mnih-atari-2013]) that learns to play Atari from raw pixels. The purpose is to understand the effectiveness of deep neural network as well as some of the techniques used in practice to stabilize training and
achieve better performance. You’ll also have to get comfortable with Tensorflow. We will train our networks
on the Pong-v0 environment from OpenAI gym, but the code can easily be applied to any other environment.
In Pong, one player wins if the ball passes by the other player. Winning a game gives a reward of 1, while
losing gives a negative reward of -1. An episode is over when one of the two players reaches 21 wins. Thus,
the final score is between -21 (lost episode) or +21 (won episode). Our agent plays against a decent hardcoded AI player. Average human performance is −3 (reported in [mnih-atari-2013]). If you go to the end
of the homework successfully, you will train an AI agent with super-human performance, reaching at least
+10 (hopefully more!).
1 Test Environment (5 pts)
Before running our code on Pong, it is crucial to test our code on a test environment. You should be able
to run your models on CPU in no more than a few minutes on the following environment:
• 4 states: 0, 1, 2, 3
• 5 actions: 0, 1, 2, 3, 4. Action 0 ≤ i ≤ 3 goes to state i, while action 4 makes the agent stay in the same
state.
• Rewards: Going to state i from states 0, 1, and 3 gives a reward R(i), where R(0) = 0.1, R(1) =
−0.2, R(2) = 0, R(3) = −0.1. If we start in state 2, then the rewards defind above are multiplied by
−10. See Table 1 for the full transition and reward structure.
1
CS 234 Winter 2019: Assignment #2
• One episode lasts 5 time steps (for a total of 5 actions) and always starts in state 0 (no rewards at the
initial state).
State (s) Action (a) Next State (s’) Reward (R)
0 0 0 0.1
0 1 1 -0.2
0 2 2 0.0
0 3 3 -0.1
0 4 0 0.1
1 0 0 0.1
1 1 1 -0.2
1 2 2 0.0
1 3 3 -0.1
1 4 1 -0.2
2 0 0 -1.0
2 1 1 2.0
2 2 2 0.0
2 3 3 1.0
2 4 2 0.0
3 0 0 0.1
3 1 1 -0.2
3 2 2 0.0
3 3 3 -0.1
3 4 3 -0.1
Table 1: Transition table for the Test Environment
An example of a path (or an episode) in the test environment is shown in Figure 1, and the trajectory can be
represented in terms of st, at, Rt as: s0 = 0, a0 = 1, R0 = −0.2, s1 = 1, a1 = 2, R1 = 0, s2 = 2, a2 = 4, R2 =
0, s3 = 2, a3 = 3, R3 = (−0.1) ∗ (−10) = 1, s4 = 3, a4 = 0, R4 = 0.1, s5 = 0.
Figure 1: Example of a path in the Test Environment
1. (written 5pts) What is the maximum sum of rewards that can be achieved in a single episode in the
test environment, assuming γ = 1?
Page 2 of 6
CS 234 Winter 2019: Assignment #2
2 Q-learning (12 pts)
Tabular setting In the tabular setting, we maintain a table Q(s, a) for each tuple state-action. Given an
experience sample (s, a, r, s0
), our update rule is
Q(s, a) = Q(s, a) + α

r + γ max
a0∈A
Q (s
0
, a0
) − Q (s, a)

, (1)
where α ∈ R is the learning rate, γ the discount factor.
Approximation setting Due to the scale of Atari environments, we cannot reasonably learn and store a Q
value for each state-action tuple. We will instead represent our Q values as a function ˆq(s, a, w) where w are
parameters of the function (typically a neural network’s weights and bias parameters). In this approximation
setting, our update rule becomes
w = w + α

r + γ max
a0∈A
qˆ(s
0
, a0
, w) − qˆ(s, a, w)

∇wqˆ(s, a, w). (2)
In other words, we are try to minimize
L(w) = Es,a,r,s0

r + γ max
a0∈A
qˆ(s
0
, a0
, w) − qˆ(s, a, w)
2
(3)
Target Network DeepMind’s paper [mnih2015human] [mnih-atari-2013] maintains two sets of parameters, w (to compute ˆq(s, a)) and w− (target network, to compute ˆq(s
0
, a0
)) such that our update rule
becomes
w = w + α

r + γ max
a0∈A

s
0
, a0
, w−

− qˆ(s, a, w)

∇wqˆ(s, a, w). (4)
The target network’s parameters are updated with the Q-network’s parameters occasionally and are kept
fixed between individual updates. Note that when computing the update, we don’t compute gradients with
respect to w− (these are considered fixed weights).
Replay Memory As we play, we store our transitions (s, a, r, s0
) in a buffer. Old examples are deleted as
we store new transitions. To update our parameters, we sample a minibatch from the buffer and perform a
-Greedy Exploration Strategy During training, we use an -greedy strategy. DeepMind’s paper
[mnih2015human] [mnih-atari-2013] decreases  from 1 to 0.1 during the first million steps. At test
time, the agent choses a random action with probability sof t = 0.05.
There are several things to be noted:
• In this assignment, we will update w every learning freq steps by using a minibatch of experiences sampled from the replay buffer.
• DeepMind’s deep Q network takes as input the state s and outputs a vector of size = number of actions.
In the Pong environment, we have 6 actions, thus ˆq(s, w) ∈ R
6
.
• The input of the deep Q network is the concatenation 4 consecutive steps, which results in an input
after preprocessing of shape (80 × 80 × 4).
We will now examine these assumptions and implement the epsilon-greedy strategy.
1. (written 3pts) What is one benefit of using experience replay?
Page 3 of

## Reviews

There are no reviews yet.