## Description

CS 234: Assignment #3

1 Policy Gradient Methods (45pts coding + 20pts writeup)

The goal of this problem is to experiment with policy gradient and its variants, including variance reduction

methods. Your goals will be to set up policy gradient for both continuous and discrete environments, and

implement a neural network baseline for variance reduction. The framework for the vanilla policy gradient

algorithm is setup in pg.py, and everything that you need to implement is in this file. The file has detailed

instructions for each implementation task, but an overview of key steps in the algorithm is provided here.

1.1 REINFORCE

Recall the vanilla policy-gradient theorem,

∇θV (θ) = Eπθ

[∇θ log πθ(s, a)Q

πθ

(s, a)]

REINFORCE is a monte-carlo policy gradient algorithm, so we will be using the sampled returns Gt as

unbiased estimates of Qπθ (s, a). Then the gradient update can be expressed as maximize the following

reward function:

V (θ) = 1

D

X

τ∈D

X

T

t

log(πθ(st, at))Gt

where D is the set of all trajectories collected by policy πθ, and τ = (s0, a0, r0, s1…) is a trajectory.

1.2 Baseline

One difficulty of training with the REINFORCE algorithm is that the monte-carlo estimated return Gt

can have high variance. To reduce variance, we subtract a baseline b(s) from the estimated returns when

computing the policy gradient. A good baseline is the state value function, b(s) = V

πθ (s), which requires a

training update to minimize the mean-squared error loss:

V (b) = 1

D

X

τ∈D

X

T

t

(b(st) − Gt)

2

1

CS 234: Assignment #3

1.3 Advantage Normalization

A second variance reduction technique is to normalize the computed advantages so that they have mean 0

and standard deviation 1. From a theoretical perspective, we can consider centering the advantages to be

simply adjusting the advantages by a constant baseline, which does not change the policy gradient. Likewise,

rescaling the advantages effectively changes the learning rate by a factor of 1/σ, where σ is the standard

deviation of the empirical advantages.

1.4 Writeup Questions

(a) (8 pts) (CartPole-v0) Test your implementation on the CartPole-v0 environment with the given configuration

parameters (in config.py) by running

python pg.py

With the given configuration file, this should achieve a perfect score of 200 within the 100 iterations.

NOTE: training may repeatedly converge to 200 and diverge. Your plot does not have to reach 200 and

stay there. We only require that you achieve a perfect score of 200 sometime during training. Include

in your writeup the tensorboard plot for the average reward. Start tensorboard with:

tensorboard –logdir=results/CartPole-v0

and then navigate to the link it gives you. Click on the ”SCALARS” tab to view the average reward

graph.

Now, modify the configuration file to try running with larger and smaller batch sizes. What happens as

you increase/decrease the batch size, and why? Note that as you perform multiple runs, tensorboard

will generate multiple result files, so you should either remove previous run’s results, or change the

output path for each run in the configuration file. Try running without the baseline, and include the

plot. Do you notice any difference? Explain.

(b) (5 pts) (InvertedPendulum-v1) Test your implementation on the InvertedPendulum-v1 environment by modifying config.py. To use this environment (and HalfCheetah-v1), you will have to install MuJoCo

and mujoco-py. Instructions for installation are in the README.txt file in the starter code. We

have ran into errors installing MuJoCo on Windows, so Windows users may instead need to use the

FarmShare servers.

Find configuration parameters so that training reaches the maximum episode reward of 1000 within

100 iterations. Again, training will probably reach 1000, and then repeatedly diverge and converge. We

only require that you reach 1000 sometime during trainig. Include the tensorboard plot in your writeup.

(c) (7 pts) (HalfCheetah-v1) Test your implementation on the much more difficult HalfCheetah-v1 environment,

with γ = 0.9 so that training reaches an average episode reward of more than 200 within 100 iterations.

Include the tensorboard plot, and explain your parameter choices you make in the writeup. Hint: a

large batch size of 50000 works well for us, and it may help to change the size of the policy network.

Page 2 of 3

CS 234: Assignment #3

2 Expected Regret Bounds (35pts)

Assume a reinforcement learning algorithm A for discounted infinite-horizon MDPs has expected regret

E∗

“X

T

t=1

rt

#

− EA

“X

T

t=1

rt

#

= f(T)

for all T > 0, where E∗ is over the probability distribution with respect to the optimal policy π∗ and EA is

the expectation with respect to the algorithm’s behavior. We assume that γ ∈ [0, 1) is the discount factor

and that rewards are normalized, i.e., rt ∈ [0, 1].

(a) (15 pts) Let π be an arbitrary policy or algorithm. Show that for any

0 > 0 and T

0 ≥ log 1

γ

H

0 where H =

1/(1 − γ), we have

Vπ(s) −

T

X0

t=1

γ

t−1Eπ[rt|s1 = s]

≤

0

, for all state s.

Note Vπ is the value function associated with π and Eπ is the expectation with respect to the randomization of π.

(b) (20pts) From the regret guarantee of algorithm A and Part a), it follows that for any

0 > 0 and T

0 ≥ log 1

γ

H

0

,

we have

E∗[V∗(sT +1)] − EA[VA(sT +1)] ≤ f(T

0 + T) − f(T) + 2

0

, for T > 0,

where VA is the value function of the (possibly nonstationary) policy that algorithm A follows. Assume

now f(T) = √

T. Show that for any > 0 and t ≥ 1 + 1

2

log 1

γ

4H

2

, we have

E∗[V∗(st)] − EA[VA(st)] ≤ .

Hint: It may be helpful to set

0

to be some function of and choose an appropriate value of T

0

.

Page 3 of 3