CMPT 419/726: Assignment 3

Assignment 3: Graphical Models /Recurrent Neural Networks/ Reinforcement Learning

Submitting Your Assignment

The assignment must be submitted online at https://coursys.cs.sfu.ca. You must submit two files:

1. An assignment report in PDF format, named report.pdf. This report should contain the

solutions to questions 1-3.

2. Your code for question 3, named cartpole_stabilize.py.

1

CMPT 419/726: Assignment 3 (Spring 2020) Instructor: Mo Chen

1 Graphical Models (22 marks)

Consider the problem of determining whether a local high school student will attend SFU or not.

Define a boolean random variable A (true if the person will attend SFU), discrete random variables

L (maximum of parents’ education level: can take values o for non-university or u for university)

and G (current provincial government: l for Liberal Party, d for NDP), and continuous valued

random variables E (current provincial economy size) and T (SFU tuition level).

1. 4 marks. Draw a simple Bayesian network for this domain.

2. 2 marks. Write the factored representation for the joint distribution p(A, L, G, E, T) that is

described by your Bayesian network.

3. 8 marks. Supply all necessary conditional distributions. Provide the type of distribution that

should be used and give rough guidance / example values for parameters (do this by hand,

educated guesses).

4. 8 marks. Suppose we had a training set and wanted to learn the parameters of the distributions using maximum likelihood. Denote each of the N examples with its values for each

random variable by xn = (an, ln, gn, en, tn). The training set is {x1, x2, . . . , xN }.

Which elements of the training data are needed to learn the parameters for p(A|paA)? Why?

(Note that paA denotes parents of A.)

Start by writing down the likelihood and argue from there.

2 Gated Recurrent Unit (10 marks)

A Gated Recurrent Unit (GRU) is another type of recurrent neural network unit with the ability

to remember and forget components of the state vector (see Cho et al. EMNLP 2014 https:

//arxiv.org/abs/1406.1078).

Read Sec. 2.3 of the linked paper for the description of the GRU. Note that the GRU’s state consists

of a vector of h values. There are two gates, rj and zj

, which control the update of hj

, the j

th

component of the GRU state.

• What values of rj and zj would cause the new state for hj

to be similar to its old state? Give

a short, qualitative answer.

• If rj and zj are both close to 0, how would the state for hj be updated? Give a short,

qualitative answer.

2

CMPT 419/726: Assignment 3 (Spring 2020) Instructor: Mo Chen

3 Reinforcement Learning (17 marks)

This question guides you through implementing the policy gradient algorithm with average reward

baseline.

Preparation:

• Install gym and TensorFlow for Python. Documentation can be found at https://gym.

openai.com/ and https://www.tensorflow.org/install.

• Replace cartpole.py in gym with the version provided. The included file

cartpole_stabilize.py contains the skeleton code for training a cartpole to achieve

its goal of keeping its position centred and pole upright.

The cartpole environment consists of a rotatable pole mounted on top of a cart. The states of the

system are the position and velocity (x, v) of the cart, and the angular position and velocity (θ, ω)

of the pole. The two possible actions are to push the cart left or right with a constant force.

Our goal in this problem is to keep the cart’s position near zero and the pole near upright for as long

as possible. To encourage this, in the custom environment defined in the provided cartpole.py

file, the cartpole system receives a reward of 1 for every time step in which its state satisfies

(|x| ≤ 0.5 and |θ| ≤ 4π

180 ). Training episodes terminate when the system state violates (|x| ≤

1.5 or |θ| ≤ 12π

180 ).

1. In the __init__ method of the agent class, define a policy network that takes as input the

state, has two fully hidden layers of the desired number of neurons with ReLU activation,

and outputs the probability distribution of applying the two possible actions.

2. In the __init__ method of the agent class, compute the probability of applying the actions

in the input data.

3. In the __init__ method of the agent class, define the loss function such that its gradient

is (∇θJ(θ)).

4. Complete the compute advantage function, which should compute a list of advantage values

(At = Σt

0≥tγ

t

0−t

r(st

, at) − b) for every time step across a batch of episodes, where (b =

Eτ∼p(τ;θ)Σt≥0γ

t

r(st

, at)) is the average reward across the batch of episodes. Note that the

batch size is specified by the update_frequency variable.

5. Complete the main part of the script (fill in the unmodified cartpole_stabilize.py

at lines 73-78, 104-107, 122-124).

6. Produce several plots showing the state of cart-pole system at different snapshots in time for

a well-performing episode.

7. Produce a plot showing sum of discounted reward in each episode vs. episode number.

3

CMPT 419/726: Assignment 3 (Spring 2020) Instructor: Mo Chen

4 Attention Models (Optional)

As an alternative to recurrent neural network structures, attention models can be used to analyze

an input sequence directly to compute a sequence of output state representations.

If you are interested in learning more, consider reading Vaswani et al. NIPS 2017 https://

arxiv.org/abs/1706.03762.

4