CSC413/2516

Assignment 4

• (v1.1) Specify the deliverables for the programming questions and provide the raw latex

dump for the written questions

Submission: You must submit two files through MarkUs1

: (1) a PDF file containing your writeup,

titled a4-writeup.pdf, and (2) your code file gnn.ipynb, dqn.ipynb. There will be sections in

the notebook for you to write your responses. Your writeup must be typed. Make sure that the

relevant outputs (e.g. print gradients() outputs, plots, etc.) are included and clearly visible.

See the syllabus on the course website2

for detailed policies. You may ask questions about the

assignment on Piazza3

. Note that 10% of the assignment mark (worth 2 pts) may be removed for

lack of neatness.

You may notice that some questions are worth 0 pt, which means we will not mark them in this

Assignment. Feel free to skip them if you are busy. However, you are expected to see some of them

in the midterm. So, we won’t release the solution for those questions.

The teaching assistants for this assignment are Mustafa Ammous, Masoud Moghani, Haotian Cui,

and Shihao Ma. Send your email with subject “[CSC413] A4 …” to csc413-2023-01-tas@cs.

toronto.edu or post on Piazza with the tag a4.

1

https://markus.teach.cs.toronto.edu/2023-01

2

https://uoft-csc413.github.io/2023/assets/misc/syllabus.pdf

3

https://piazza.com/class/lcp8mp3f9dl7lp

1

CSC413/2516 Winter 2023 with Professor Jimmy Ba & Professor Bo Wang Assignment 4

Important Instructions

Read the following before attempting the assignment.

Overview and Expectations

You will be completing this assignment with the aid of large language models (LLMs) such as

ChatGPT, text-davinci-003, or code-davinci-002. To alleviate the unnecessary steps related to

generating results and screenshotting, we have provided the GPT-generated solution with minimum

prompting effort in ChatGPT: clauses. The goal is to help you (i) develop a solid understanding

of the course materials, and (ii) gain some insight in problem-solving with LLMs. Think of this

as analogous to (i) understanding the rules of addition and multiplication, and (ii) learning how

to use a calculator. Note that LLMs may not be a reliable “calculator” (yet) — as you will see,

GPT-like models can generate incorrect and contradicting answers. It is, therefore important that

you have a good grasp of the lecture materials, so that you can evaluate the correctness of the

model output, and also prompt the model toward the correct solution.

Prompt engineering. In this assignment, we ask that you try to (i) solve the problems

yourself, and (ii) use LLMs to solve a selected subset of them. You will “guide” the LLMs toward

desired outcomes by typing text prompts into the models. There are a number of different ways to

prompt an LLM, including direct copy-pasting LATEX strings of a written question, copying function

docstrings, or interactively editing the previously generated results. Prompting offers a natural and

intuitive interface for humans to interact with and use LLMs. However, LLM-generated solutions

depend significantly on the quality of the prompt used to steer the model, and most effective

prompts come from a deep understanding of the task. You can decide how much time you want to

spend as a university student vs. a prompt engineer, but we’d say it’s probably not a good idea to

use more than 25% of your time on prompting LLMs. See Best Practices below for the basics of

prompt engineering.

What are LLMs good for? We have divided the assignment problems into the following

categories, based on our judgment of how difficult it is to obtain the correct answer using LLMs.

• [Type 1] LLMs can produce almost correct answers from rather straightforward prompts,

e.g., minor modification of the problem statement.

• [Type 2] LLMs can produce partially correct and useful answers, but you may have to use

a more sophisticated prompt (e.g., break down the problem into smaller pieces, then ask a

sequence of questions), and also generate multiple times and pick the most reasonable output.

• [Type 3] LLMs usually do not give the correct answer unless you try hard. This may

include problems with involved mathematical reasoning or numerical computation (many

GPT models do not have a built-in calculator).

• [Type 4] LLMs are not suitable for the problem (e.g., graph/figure-related questions).

2

CSC413/2516 Winter 2023 with Professor Jimmy Ba & Professor Bo Wang Assignment 4

Written Assignment

What you have to submit for this part

See the top of this handout for submission directions. Here are the requirements.

• The zero point questions (in black below) will not be graded, but you are more than welcome

to include your answers for these as well in the submission.

• For (nonzero-point) questions labeled [Type 1] [Type 2] you need to submit your own solution.

Your own solution can be a copy-paste of the LLM output (if you verify that it is correct),

but make sure you cite the model properly.

• For (nonzero-point) questions in [Type 3] [Type 4] you only need to submit your own written

solution, but we encourage you to experiment with LLMs on some of them.

For reference, here is everything you need to hand in for the first half of the PDF report

a4-writeup.pdf.

• Problem 1: 1.1.2[Type 1] , 1.2.1[Type 1] , 1.3.1[Type 1] , 1.3.2[Type 2]

• Problem 2: 2.1.1[Type 3] , 2.1.2[Type 3] , 2.1.2[Type 1] , 2.2.1[Type 3]

Useful prompts

You could start by naively copy-pasting the question and the context as the prompt, and try to

improve the generated answers by trial and error. Raw LATEX dumps are made available for the

written questions to facilitate this process.

• https://uoft-csc413.github.io/2023/assets/assignments/a4_raw_latex_dump.tex

3

CSC413/2516 Winter 2023 with Professor Jimmy Ba & Professor Bo Wang Assignment 4

1 RNNs and Self Attention

For any successful deep learning system, choosing the right network architecture is as important

as choosing a good learning algorithm. In this question, we will explore how various architectural

choices can have a significant impact on learning. We will analyze the learning performance from

the perspective of vanishing /exploding gradients as they are backpropagated from the final layer

to the first.

1.1 Warmup: A Single Neuron RNN

Consider an n layered fully connected network that has scalar inputs and outputs. For now, assume

that all the hidden layers have a single unit, and that the weight matrices are set to 1 (because

each hidden layer has a single unit, the weight matrices have a dimensionality of R

1×1

).

1.1.1 Effect of Activation – ReLU [0pt]

Lets say we’re using the ReLU activation. Let x be the input to the network and let f : R

1 → R

1

be the function the network is computing. Do the gradients necessarily have to vanish or explode

as they are backpropagated? Answer this by showing that 0 ≤ | ∂f(x)

∂x | ≤ 1.

ChatGPT:

Assuming the ReLU activation function is used, the output of the hidden layer can be expressed

as h(x) = max(0, x), where x is the input to the layer. Therefore, the output of the entire network

can be expressed as f(x) = h(h(· · ·(h(x)))), where there are n consecutive applications of h.

Let’s first consider the case where x ≥ 0. In this case, each application of h has a gradient of

1, so the overall gradient is simply the product of n 1’s, which is 1. Therefore, |

∂f(x)

∂x | = 1 when

x ≥ 0.

Now let’s consider the case where x < 0. In this case, the first application of h will output 0,

and all subsequent applications of h will also output 0. Therefore, the output of the entire network

will be f(x) = 0, and the gradient will be 0 as well. Therefore, |

∂f(x)

∂x | = 0 when x < 0.

Combining these two cases, we can conclude that 0 ≤ | ∂f(x)

∂x | ≤ 1 for any input x. Therefore,

the gradients do not necessarily have to vanish or explode as they are backpropagated, and can be

bounded between 0 and 1.

1.1.2 Effect of Activation – Different weights [0.5pt] [Type 1]

Solve the problem in 1.1.1 by assuming now the weights are not 1. You can assume that the i-th

hidden layer has weight wi

. Do the gradients necessarily have to vanish or explode as they are

backpropagated? Answer this by deriving a similar bound as in Sec 1.1.1 for the magnitude of the

gradient.

ChatGPT:

Let x be the input to the network, and let f : R

1 → R

1 be the function the network is computing.

Let hi(x) be the output of the i-th hidden layer with weight wi

, and let f(x) be the output of the

entire network. Then we can write:

hi(x) = wi max(0, hi−1(x))

where h0(x) = x and hn(x) is the output of the entire network. Using the chain rule, we can

write:

4

CSC413/2516 Winter 2023 with Professor Jimmy Ba & Professor Bo Wang Assignment 4

∂f(x)

∂x =

∂hn(x)

∂x =

∂hn(x)

∂hn−1(x)

∂hn−1(x)

∂hn−2(x)

· · ·

∂h2(x)

∂h1(x)

∂h1(x)

∂x

Using the expression for hi(x), we can compute the derivatives as follows:

∂hi(x)

∂hi−1(x)

= wi

∂

∂hi−1(x)

max(0, hi−1(x)) = wi

[max(0, hi−1(x))]′

where [max(0, hi−1(x))]′

is the derivative of the ReLU activation function evaluated at hi−1(x).

If hi−1(x) > 0, then [max(0, hi−1(x))]′ = 1, and if hi−1(x) ≤ 0, then [max(0, hi−1(x))]′ = 0.

Therefore, we have:

∂hi(x)

∂hi−1(x)

= wi

[max(0, hi−1(x))]′ =

(

wi

if hi−1(x) > 0

0 if hi−1(x) ≤ 0

Using the above expression, we can compute:

∂f(x)

∂x

=

∂hn(x)

∂x

=

Yn

i=1

∂hi(x)

∂hi−1(x)

=

Y

hi−1(x)>0

wi

The above expression shows that the magnitude of the gradients can be bounded by the product

of the magnitudes of the weights of the positive activations. This bound can still grow exponentially

with the number of hidden layers and the magnitudes of the weights, which can lead to the vanishing

or exploding gradients problem.

1.2 Matrices and RNN

We will now analyze the recurrent weight matrices under Singular Value Decomposition. SVD is

one of the most important results in all of linear algebra. It says that any real matrix M ∈ R

mxn

can be written as M = UΣV

T where U ∈ R

mxm and V ∈ R

nxn are square orthogonal matrices, and

Σ ∈ R

mxn is a rectangular diagonal matrix with nonnegative entries on the diagonal (i.e. Σii ≥ 0

for i ∈ {1, . . . , min(m, n)} and 0 otherwise). Geometrically, this means any linear transformation

can be decomposed into a rotation/flip, followed by scaling along orthogonal directions, followed

by another rotation/flip.

1.2.1 Gradient through RNN [0.5pt] [Type 1]

Let say we have a very simple RNN-like architecture that computes xt+1 = sigmoid(W xt). You

can view this architecture as a deep fully connected network that uses the same weight matrix at

each layer. Suppose the largest singular value of the weight matrix is σmax(W) = 1

4

. Show that

the largest singular value of the input-output Jacobian has the following bound:

0 ≤ σmax(

∂xn

∂x1

) ≤ (

1

16

)

n−1

(Hint: if C = AB, then σmax(C) ≤ σmax(A)σmax(B). Also, the input-output Jacobian is the

multiplication of layerwise Jacobians).

ChatGPT:

To begin, let’s compute the Jacobian of the RNN-like architecture:

5

CSC413/2516 Winter 2023 with Professor Jimmy Ba & Professor Bo Wang Assignment 4

∂xt+1

∂xt

=

∂

∂xt

sigmoid(W xt) = sigmoid′

(W xt)W

Using the chain rule, we can write the input-output Jacobian as the product of the layerwise

Jacobians:

∂xn

∂x1

=

nY−1

t=1

∂xt+1

∂xt

=

nY−1

t=1

sigmoid′

(W xt)W

To bound the largest singular value of this Jacobian, we can use the hint provided and the fact

that the largest singular value of a matrix is equal to its spectral norm, which is defined as the

maximum absolute value of its eigenvalues.

Using the spectral norm, we can write:

σmax

∂xn

∂x1

=

nY−1

t=1

sigmoid′

(W xt)W

2

We can use the fact that the spectral norm is sub-multiplicative, meaning that ∥AB∥2 ≤

∥A∥2∥B∥2, to bound the spectral norm of the product:

σmax

∂xn

∂x1

≤

nY−1

t=1

sigmoid′

(W xt)W

2

We can further bound the 2-norm of sigmoid′

(W xt)W as follows:

sigmoid′

(W xt)W

2

= σmax(sigmoid′

(W xt)W) ≤ σmax(sigmoid′

(W xt))σmax(W)

where we used the hint provided to obtain the inequality.

Recall that the sigmoid function has a maximum derivative value of 1

4

, which occurs at x = 0.

Therefore, we can write:

σmax

∂xn

∂x1

≤

nY−1

t=1

1

4

σmax(W) =

1

4

n−1

σmax(W)

n−1

Substituting the value of σmax(W) = 1

4

given in the problem, we obtain the desired bound:

0 ≤ σmax

∂xn

∂x1

≤

1

16n−1

1.3 Self-Attention

In a self-attention layer (using scaled dot-product attention), the matrix of outputs is computed

as:

Attention(Q, K, V ) = softmax

QK⊤

√

dk

V

where Q, K, V ∈ R

n×d are the query, key, and value matrices, n is the sequence length, and dm

is the embedding dimension.

6

CSC413/2516 Winter 2023 with Professor Jimmy Ba & Professor Bo Wang Assignment 4

1.3.1 Complexity of Self-Attention [0.5pt] [Type 1]

Recal from Lecture 8, the total cost for scaled dot-product attention scales quadratically with the

sequence length n, i.e., O

n

2

. We can generalize the attention equation for any similarity function

sim() to the following:

αi =

Pn

j=1 sim(Qi

, Kj )Vj

Pn

j=1 sim(Qi

, Kj )

(1.1)

where the subscript of a matrix represents the i-th row as a vector. This is equivalent to the

Softmax attention if we substitute sim(q, k) = exp(

q

T k

√

dk

). Note that for this generalized equation

to be a valid attention equation, the only constraint on sim() is that it need to be non-negative,

which is true for all kernel functions k(x, y) = ϕ(x)

T ϕ(y), for some feature mapping ϕ(). Show that

by applying kernel functions, attention can be calculated with linear complexity (i.e., O(n)).

Hint: Sub in the kernel function for the similarity function into Eq 1.1. Group the terms based

on their subscript (i.e., i and j).

ChatGPT:

To show that attention can be calculated with linear complexity using kernel functions, we will

substitute a kernel function for the similarity function sim() in Equation 1.1 and group the terms

based on their subscript.

Let k(x, y) be a kernel function, then we can define the similarity function sim(Qi

, Kj ) as:

sim(Qi

, Kj ) = k(Qi

, Kj )

Substituting this into Equation 1.1, we get:

αi =

Pn

j=1 k(Qi

, Kj )Vj

Pn

j=1 k(Qi

, Kj )

Now, let’s expand k(Qi

, Kj ) using the kernel trick:

k(Qi

, Kj ) = ϕ(Qi)

T ϕ(Kj )

where ϕ() is a feature mapping that maps the input to a high-dimensional space.

Substituting this in the previous equation, we get:

αi =

Pn

j=1 ϕ(Qi)

T ϕ(Kj )Vj

Pn

j=1 ϕ(Qi)

T ϕ(Kj )

Next, we can group the terms based on their subscripts as follows:

αi =

ϕ(Qi)

T Pn

j=1 ϕ(Kj )Vj

ϕ(Qi)

T Pn

j=1 ϕ(Kj )

Now, we can precompute ϕ(Kj )Vj and ϕ(Kj ) for all j, and store them in matrices M and N

respectively, where Mj = ϕ(Kj )Vj and Nj = ϕ(Kj ). Then, we can rewrite the equation as:

αi =

ϕ(Qi)

TM

ϕ(Qi)

T N

This can be computed for all i with a single matrix multiplication, which has a complexity of

O(n), since M and N have dimensions n × d, where d is the dimensionality of the feature space.

CSC413/2516 Winter 2023 with Professor Jimmy Ba & Professor Bo Wang Assignment 4

Therefore, we have shown that by applying kernel functions, attention can be calculated with

linear complexity.

1.3.2 Linear Attention with SVD [0.5pt] [Type 2]

It has been empirically shown in Transformer models that the context mapping matrix P =

softmax

QK⊤

√

dk

often has a low rank. Show that if the rank of P is k and we already have access

to the SVD of P, then it is possible to compute self-attention in O(nkd) time.

ChatGPT:

If the rank of P is k, then we can express P as a product of two matrices U ∈ R

n×k and V

⊤ ∈ R

k×n

,

where U and V are orthonormal matrices, i.e., U

⊤U = Ik and V V ⊤ = In. Therefore, we have:

P = UV ⊤

Using this factorization of P, we can rewrite the self-attention function as:

Self-Attention(Q, K, V ) =

QK⊤

√

dk

V softmax

QK⊤

√

dk

⊤

= Q(UV ⊤)K⊤V (UV ⊤)

⊤

= QU(V

⊤K)

⊤V U ⊤V U ⊤ = QU(V

⊤K)

⊤(V U)

⊤

Here, we used the property that U

⊤V = 0 because U and V have different orthonormal bases.

Note that V

⊤K has a shape of k ×dk, and we can compute it once since we already have access

to the SVD of P.

Therefore, given the SVD of P, we can compute the self-attention with O(nkd) time complexity,

which is linear in n and d, and proportional to the rank k of the context mapping matrix P. This is

more efficient than the O(n

2d) time complexity required for the standard self-attention calculation.

1.3.3 Linear Attention by Projecting [0pt]

Suppose we ignore the Softmax and scaling and let P = QK⊤ ∈ R

n×n

. Assume P is rank k. Show

that there exist two linear projection matrices C, D ∈ R

k×n

such that P V = Q(CK)

⊤DV and the

right hand side can be computed in O(nkd) time. Hint: Consider using SVD in your proof.

ChatGPT:

Since P = QK⊤, we can perform SVD on K as K = UΣV

⊤, where U ∈ R

n×n and V ∈ R

dk×dk are

orthogonal matrices and Σ ∈ R

dk×dk is a diagonal matrix with singular values σ1, σ2, …, σdk

.

Then, we have QK⊤ = QUΣV

⊤. Since P = QK⊤ has rank k, we know that Σ has at most k

non-zero singular values. Without loss of generality, let the first k singular values be non-zero, and

let Σk be the corresponding diagonal matrix of singular values.

Using the first k columns of U and V (denoted Uk and Vk, respectively), we can approximate

K as K ≈ UkΣkV

⊤

k

. Substituting this approximation into the expression for P, we have P ≈

QUkΣkV

⊤

k

, which we can rewrite as P = Q(CK)

⊤DV where C = V

⊤

k

, D = (UkΣk)

T and V = Idk

.

Now, to compute P V we can first compute CK and DV separately, each in O(nkd) time. Then,

we can compute Q(CK)

⊤DV in O(nkd) time. Therefore, we can compute P V in O(nkd) time as

required.

8

CSC413/2516 Winter 2023 with Professor Jimmy Ba & Professor Bo Wang Assignment 4

2 Reinforcement Learning

2.1 Bellman Equation

The Bellman equation can be seen as a fix point equation to what’s called the Bellman Operator.

Given a policy π, the Bellman operators T

π

for V : B(S) → B(S) and T

π

for Q : B(S × A) →

B(S × A) are defined as follows:

(T

πV )(s) ≜ r

π

(S) + γ

Z

P(s

′

|s, a)π(a|s)V (s

′

) (2.1)

(T

πQ)(s, a) ≜ r(s, a) + γ

Z

P(s

′

|s, a)π(a

′

|s

′

)Q(s

′

, a′

) (2.2)

for all s ∈ S( for V ) or all (s, a) ∈ S × A( for Q).

The Bellman operators have two important properties, 1) monotonicity and 2) γ-contraction.

These properties give us many guarantees such as applying the operator repeatedly will converge to

an unique and optimal solution, which is what allow us to show RL algorithms such as Q-Learning

converges (under certain additional assumptions, but we won’t go over them here). In this section,

we will show that the Bellman operator indeed have these two properties.

2.1.1

[0.5pt] Show that the Bellman operator (on V ) has the monotonicity property. i.e., show that for

a fixed policy π, if V1, V2 ∈ B(S), and V1(s) ≤ V2(s) for all s ∈ S, then we have

T

πV1 ≤ T

πV2 (2.3)

ChatGPT:

To show the monotonicity property of the Bellman operator T

π on V , we need to prove that

if V1, V2 ∈ B(S) with V1(s) ≤ V2(s) for all s ∈ S, then (T

πV1)(s) ≤ (T

πV2)(s) for all s ∈ S.

Expanding the definition of T

π

for V , we have:

(T

πV1)(s) = r

π

(s) + γ

X

s

′∈S

P(s

′

|s, π(s))V1(s

′

)

(T

πV2)(s) = r

π

(s) + γ

X

s

′∈S

P(s

′

|s, π(s))V2(s

′

)

Since V1(s) ≤ V2(s) for all s ∈ S, we have:

γ

X

s

′∈S

P(s

′

|s, π(s))V1(s

′

) ≤ γ

X

s

′∈S

P(s

′

|s, π(s))V2(s

′

)

⇒ r

π

(s) + γ

X

s

′∈S

P(s

′

|s, π(s))V1(s

′

) ≤ r

π

(s) + γ

X

s

′∈S

P(s

′

|s, π(s))V2(s

′

)

⇒ (T

πV1)(s) ≤ (T

πV2)(s)

Therefore, we have shown that the Bellman operator T

π on V has the monotonicity property.

9

CSC413/2516 Winter 2023 with Professor Jimmy Ba & Professor Bo Wang Assignment 4

2.1.2

[0.5pt] Show that the Bellman operator is a γ-contraction mapping with the supremum norm (on

Q). i.e. show that for a discount factor γ and Q1, Q2 ∈ B(S × A), we have

∥T

π

(Q1) − T

π

(Q2)∥∞ ≤ γ∥Q1 − Q2∥∞ (2.4)

Recall from your math classes, the supremum norm (on Q) is as follows:

∥Q∥∞ = sup

(s,a)∈S×A

|Q(s, a)| (2.5)

Hint: for some function f, we have the following. For this question, you can think about what

is P and f in our case.

Z

P(x)f(x)

≤

Z

|P(x)f(x)| =

Z

|P(x)| · |f(x)|

≤

Z

P(x) · sup

x∈X

|f(x)| (2.6)

= sup

x∈X

|f(x)|

Z

P(x) = ∥f∥∞

where in the last line we used the fact R

P(x) = 1

ChatGPT:

To show that the Bellman operator is a γ-contraction mapping with the supremum norm, we

need to prove that for any Q1, Q2 ∈ B(S × A), we have:

|T

πQ1 − T

πQ2|∞ ≤ γ|Q1 − Q2|∞ (2.7)

Let s and a be arbitrary elements in S and A respectively. Then we have:

(T

πQ1)(s, a) − (T

πQ2)(s, a)

=

r(s, a) + γ

X

s

′

P(s

′

| s, a)π(a

′

| s

′

)Q1(s

′

, a′

)

−

r(s, a) + γ

X

s

′

P(s

′

| s, a)π(a

′

| s(2.8)

= γ

X

s

′

P(s

′

| s, a)π(a

′

| s

′

)

Q1(s

′

, a′

) − Q2(s

′

, a′

)

(2.9)

≤ γ

X

s

′

P(s

′

| s, a)π(a

′

| s

′

)

Q1(s

′

, a′

) − Q2(s

′

, a′

)

(2.10)

≤ γ|Q1 − Q2|∞ (2.11)

where we used the fact that |

P

x

ax| ≤ P

x

|ax| in the third line and the supremum norm definition

in the last line. Taking the supremum over all (s, a) ∈ S × A yields:

|T

πQ1 − T

πQ2|∞ ≤ γ|Q1 − Q2|∞ (2.12)

as required. Therefore, the Bellman operator is a γ-contraction mapping with the supremum norm.

CSC413/2516 Winter 2023 with Professor Jimmy Ba & Professor Bo Wang Assignment 4

2.1.3

[0.5pt] For this question, you may assume knowledge of the reward function r(s, a) and transition

probability function p(s

′

|s, a), where s

′

is the next state.

1. Give a definition of v∗(s) in terms of q∗(s, a).

2. Give a definition of q∗(s, a) in terms of v∗(s).

3. Give a definition of a∗ in terms of q∗(s, a).

4. Give a definition of a∗ in terms of v∗(s).

ChatGPT:

a) v ∗ (s) is the optimal state-value function, which is defined as the maximum expected return

achievable by any policy π, starting from state s, i.e., v ∗ (s) = maxπ Eπ[Gt

|st = s]. We can

express v ∗ (s) in terms of q ∗ (s, a) as: v ∗ (s) = maxa q ∗ (s, a), where q ∗ (s, a) is the optimal

action-value function, which is defined as the maximum expected return achievable by any policy

π, starting from state s, taking action a, and then following the optimal policy thereafter, i.e.,

q ∗ (s, a) = maxπ Eπ[Gt

|st = s, at = a].

b) q ∗ (s, a) is the optimal action-value function, which is defined as the maximum expected

return achievable by any policy π, starting from state s, taking action a, and then following the

optimal policy thereafter, i.e., q ∗ (s, a) = maxπ Eπ[Gt

|st = s, at = a]. We can express q ∗ (s, a)

in terms of v ∗ (s) as: q ∗ (s, a) = r(s, a) + γ

P

s

′∈S P(s

′

|s, a)v ∗ (s

′

), where r(s, a) is the reward

function for taking action a in state s, γ is the discount factor, and P(s

′

|s, a) is the probability of

transitioning to state s

′ when taking action a in state s.

c) a∗ is the optimal action to take in state s, which maximizes the action-value function, i.e.,

a∗ = arg maxa q ∗ (s, a).

d) a∗ in terms of v ∗ (s) is: a∗ = arg maxa[r(s, a) + γ

P

s

′∈S P(s

′

|s, a)v ∗ (s

′

)]. This means that

we can find the optimal action to take in state s by computing the action-value function for each

action and selecting the action with the highest value.

2.2 Policy gradients and black box optimization

Very often we have a function f that does not give us useful gradient information: input or output

may be discrete; f may be piecewise constant, nowhere differentiable, or have pathological gradients

(e.g., a discontinuous saw wave on an incline, whose gradient always points away from the global

optimum); or f may be a black box that we cannot backpropagate through. For example, we may

have a phone app that labels photos as cats or dogs. This situation is the default in Reinforcement

Learning (RL), where we can execute the environment dynamics, but we cannot see or control their

internals.

We still, however, want to optimize some score function J[f] : X → R. For example, in RL, we

want to learn a policy that maximizes the non-differentiable environment reward.

When using the REINFORCE strategy, we replaced the θ optimization task with a Monte-Carlo

approximation. One of the key factors for a successful REINFORCE application is the variance.

The higher the variance, the more “noisy” the gradient estimates will be, which can slow down the

optimization process. In this section we will derive the variance of the REINFORCE estimator for

a simple toy task.

11

CSC413/2516 Winter 2023 with Professor Jimmy Ba & Professor Bo Wang Assignment 4

Consider a loss function, f(˜a) which is the zero-one loss of the logistic regression output, p(a|θ).

The input vector has D independent scalar features, xd. We evaluate the performance of the

classifier by sampling from the output of the sigmoid µ. The loss function J(θ) can be written as:

µ = σ

X

D

d=1

θdxd

, (2.13)

p(a|θ) = Bernoulli(µ) = (

µ a = 1

1 − µ a = 0

, (2.14)

a˜ ∼ p(a|θ), (2.15)

f(˜a) = (

1 ˜a = 1

0 ˜a = 0

, (2.16)

J(θ) = Ea˜∼p(a|θ)

[f(˜a)]. (2.17)

2.2.1 Closed form expression for REINFORCE estimator [0.5pt]

Recall from above that the expression for REINFORCE estimator is:

∇θJ[θ] = Ea˜∼p(a|θ)

f(˜a)

∂

∂θ log p(a = ˜a|θ)

(2.18)

We can denote the expression inside the expectation as g[θ, x]:

g[θ, a˜] = f(˜a)

∂

∂θ log p(a = ˜a|θ), a˜ ∼ p(a|θ) (2.19)

For this question, derive a closed form for the g[θ, a˜] as a deterministic function of ˜a, µ, θ, and

xd.

Hint: Substitute in the log likelihood of the Bernoulli distribution.

ChatGPT:

We can start by computing the partial derivative of the log-likelihood of the Bernoulli distribution:

∂

∂θ log p(a = ˜a|θ) = ∂

∂θ [˜a log µ + (1 − a˜) log(1 − µ)] (2.20)

=

a˜

µ

∂

∂θµ −

1 − a˜

1 − µ

∂

∂θµ (2.21)

=

a˜ − µ

µ(1 − µ)

∂

∂θµ. (2.22)

Next, we can plug in the expressions for µ and ∂

∂θµ:

g[θ, a˜] = f(˜a)

∂

∂θ log p(a = ˜a|θ) (2.23)

= f(˜a)

a˜ − µ

µ(1 − µ)

∂

∂θµ (2.24)

= f(˜a)

a˜ − σ(

PD d = 1θdxd)

σ(

PD d = 1θdxd)(1 − σ(

PD d = 1θdxd))

σ(

X

D

d = 1θdxd)(1 − σ(

X

D

d = 1θdxd))xd

(2.25)

= f(˜a)(˜a − σ(

X

D

d = 1θdxd))xd. (2.26)

12

CSC413/2516 Winter 2023 with Professor Jimmy Ba & Professor Bo Wang Assignment 4

Therefore, we obtain the closed form for g[θ, a˜] as f(˜a)(˜a − σ(

PD

d=1 θdxd))xd.

2.2.2 Variance of REINFORCE estimator [0pt]

We will derive the variance of the REINFORCE estimator above. Since the gradient is is Ddimensional, the covariance of the gradients will be D × D matrix. In this question, we will only

consider the variance with respect to the first parameter, i.e. Var[[]ˆg[θ, a˜]1] which scalar value

corresponding to the first element in the diagonal of the covariance matrix. Derive the variance of

the gradient estimator as a function of the first parameter vector: Var[[]ˆg[θ, a˜]1], as a function of

µ, θ, and xd.

Hint: The second moment of a Bernoulli random variable is µ(1 − µ).

2.2.3 Convergence and variance of REINFORCE estimator [0 pt]

Comment on the variance in Part 2.2.2. When do we expect learning to converge slowly in terms

of the output of the logistic regression model, µ?

13

CSC413/2516 Winter 2023 with Professor Jimmy Ba & Professor Bo Wang Assignment 4

Programming Assignment

What you have to submit for this part

For reference, here is everything you need to hand in:

• This is the second half of your PDF report a4-writeup.pdf. Please include the solutions to

the following problems. You may choose to export gnn.ipynb, dqn.ipynb as a PDF and

attach it to the first half of a4-writeup.pdf.

– Question 3: 3.1[Type 1] , 3.2[Type 2] , 3.3[Type 4] , 3.4[Type 2] , 3.5[Type 4] , 3.6[Type

4]

– Question 4: 4.1[Type 1] , 4.2[Type 2] , 4.3[Type 4] .

• Your code file gnn.ipynb, dqn.ipynb

Introduction

In this assignment, you’ll get hands-on experience coding and training GCN (Graph Convolution

Network) and DQN (Deep Q-learning Network), one of Reinforcement Learning methods. This

assignment is divided into two parts: in the first part, you will learn how to implement the vanilla

version of GCN and GAT. In the second part, you will implement and train a DQN agent to learn

how to play the CartPole balancing game. It will be fun to see your model performs much better

than you on the simple game :).

Setting Up

We recommend that you use Colab(https://colab.research.google.com/) for the assignment.

To setup the Colab environment, just open the notebooks for each part of the assignment and

make a copy in your own Google Drive account.

Deliverables

Each section is followed by a checklist of deliverables to add in the assignment writeup. To also give

a better sense of our expectations for the answers to the conceptual questions, we’ve put maximum

sentence limits. You will not be graded for any additional sentences.

14

CSC413/2516 Winter 2023 with Professor Jimmy Ba & Professor Bo Wang Assignment 4

3 Graph Convolution Networks[5pt]

For this part of the assignment, you will implement the vanilla version of Graph Convolution

Networks (GCN) Kipf and Welling [2016] and Graph Attention Networks (GAT) Velickovi´c et al.

[2018].

Basics of GCN:

Recall from the lecture, the goal of a GCN is to learn a function of signals/features on a graph

G = (V, E), which takes as inputs:

1. the input features of each node, xi ∈ RF (in matrix form: X ∈ R|V |×F )

2. some information about the graph structure, typically the adjacency matrix A

Each convolutional layer can be written as H(l+1) = f(H(l)

, A), for some function f(). The f()

we are using for this assignment is in the form of f(H(l)

, A) = σ(Dˆ −1/2AˆDˆ −1/2H(l)W(l)

), where

Aˆ = A + Identity and Dˆ is diagonal node degree matrix (Dˆ −1Aˆ normalizes Aˆ such that all rows

sum to one). Let A˜ = Dˆ −1/2AˆDˆ −1/2

. The GCN we will implement takes two convolution layers,

Z = f(X, A) = sof tmax(A˜ · Dropout(ReLU(AXW ˜ (0))) · W(1))

Basics of GAT:

Graph Attention Network (GAT) is a novel convolution-style neural network. It operates on graphstructured data and leverages masked self-attentional layers. In this assignment, we will implement

the graph attention layer.

Dataset:

The dataset we used for this assignment is Cora Sen et al. [2008]. Cora is one of standard citation

network benchmark dataset (just like MNIST dataset for computer vision tasks). It that consists

of 2708 scientific publications and 5429 links. Each publication is classified into one of 7 classes.

Each publication is described by a word vector (length 1433) that indicates the absence/presence

of the corresponding word. This is used as the features of each node for our experiment. The task

is to perform node classification (predict which class each node belongs to).

Experiments:

Open [GNN notebook link] on Colab and answer the following questions.

1. [1pt] Implementation of Graph Convolution Layer

Complete the code for GraphConvolution() Class

2. [1pt] Implementation of Graph Convolution Network

Complete the code for GCN() Class

3. [0.5pt] Train your Graph Convolution Network

After implementing the required classes, now you can train your GCN. You can play with the

hyperparameters in args.

15

CSC413/2516 Winter 2023 with Professor Jimmy Ba & Professor Bo Wang Assignment 4

4. [1.5pt] Implementation of Graph Attention Layer

Complete the code for GraphAttentionLayer() Class

5. [0.5pt] Train your Graph Attention Network

After implementing the required classes, now you can train your GAT. You can play with the

hyperparameters in args.

6. [0.5pt] Compare your models

Compare the evaluation results for Vanilla GCN and GAT. Comment on the discrepancy in

their performance (if any) and briefly explain why you think it’s the case (in 1-2 sentences).

Deliverables

Create a section in your report called Graph Convolution Networks. Add the following:

• Screenshots of your GraphConvolution, GCN implementations. Highlight the lines you’ve

added.

• Screenshots of your GCN training output, you can just screenshot the last 10 epochs with

test set results.

• Screenshots of your GraphAttentionLayer implementations. Highlight the lines you’ve added.

• Screenshots of your GAT training output, you can just screenshot the last 10 epochs with

test set results.

• Your response to the written component of question 3.6. Your analysis should not exceed 3

sentences.

16

CSC413/2516 Winter 2023 with Professor Jimmy Ba & Professor Bo Wang Assignment 4

4 Deep Q-Learning Network (DQN) [4pt]

In this part of the assignment, we will apply Reinforcement Learning (DQN) to tackle the CartPole

Balancing game, the game that seems easy but actually quite hard. If you haven’t tried it yet, I

recommend you try it first [the link]. However, the difficult game for human may be very simple

to a computer.

Figure 1: Image of the CartPole Balancing game from OpenAI Gym.Brockman et al. [2016]

DQN Overview

Reinforcement learning defines an environment for the agent to perform certain actions (according

to the policy) that maximize the reward at every time stamp. Essentially, our aim is to train a

agent that tries to maximize the discounted, cumulative reward Rt0 =

P∞

t=t0

γ

t−t0 rt

. Because we

assume there can be infinite time stamps, the discount factor, γ, is a constant between 0 and 1 that

ensures the sum converges. It makes rewards from the uncertain far future less important for our

agent than the ones in the near future.

The idea of Q-learning is that if we have a function Q∗

(state, action) that outputs the maximum

expected cumulative reward achievable from a given state-action pair, we could easily construct a

policy (action selection rule) that maximizes the reward:

π

∗

(s) = argmax

a

Q

∗

(s, a) (4.1)

However, we don’t know everything about the world, so we don’t have access to Q∗

. But,

since neural networks are universal function approximators, we can simply create one and train it

to resemble Q∗

. For our training update rule, we will use a fact that every Q function for some

policies obeys the Bellman equation:

Q

π

(s, a) = r + γQπ

(s

′

, π(s

′

)) (4.2)

An intuitive explanation of the structure of the Bellman equation is as follows. Suppose that

the agent has received reward rt at the current state, then the maximum discounted reward from

17

CSC413/2516 Winter 2023 with Professor Jimmy Ba & Professor Bo Wang Assignment 4

this point onward is equal to the current reward plus the maximum expected discounted reward

γQ∗

(st+1, at+1) from the next stage onward. The difference between the two sides of the equality

is known as the temporal difference error, δ:

δ = Q(s, a) − (r + γ max

a

Q(s

′

, a)) (4.3)

Our goal is the minimise this error, so that we can have a good Q function to estimate the

rewards given any state-action pair.

Experiments

Open the Colab notebook link to begin: [DQN notebook link]. Read through the notebook and

play around with it. More detailed instructions are given in the notebook. Have fun!

1. [1pt] Implementation of ϵ − greedy

Complete the function get action for the agent to select an action based on current state. We

want to balance exploitation and exploration through ϵ − greedy, which is explained in the

notebook. Include your code snippet in your write-up.

2. [1pt] Implementation of DQN training step

Complete the function train for the model to perform a single step of optimization. This is

basically to construct the the temporal difference error δ and perform a standard optimizer

update. Notice that there are two networks in the DQN network, policy net and target net,

think about how to use these two networks to construct the loss. Include your code snippet

in your write-up.

3. [2pt] Train your DQN Agent

After implementing the required functions, now you can train your DQN Agent, and you

are suggested to tune the hyperparameters listed in the notebook. Hyperparameters are

important to train a good agent. After all of these, now you can validate your model by

playing the CartPole Balance game! List the hyperparameters’ value you choose,

your epsilon decay rule and summarize your final results from the visualizations

in a few sentences in your write-up.

Deliverables

Create a section in your report called Deep Q-learning Network. Add the following:

• Screenshots of your get_action, train, epsilon decay rule implementations. Highlight

the lines you’ve added.

• Screenshots of your reward curve, and the end time frame of your cartpole video (to show

how many seconds you can balance the cartpole).

• Your response to the written component of question 4.3. Your analysis should not exceed 3

sentences.

18

CSC413/2516 Winter 2023 with Professor Jimmy Ba & Professor Bo Wang Assignment 4

What you need to submit

• Your code files: gnn.ipynb, dqn.ipynb.

• A PDF document titled a4-writeup.pdf containing code screenshots, any experiment

results or visualizations, as well as your answers to the written questions.

Further Resources

For further reading on GANs, DCGAN, GCN and DQN, the following links may be useful:

1. Generative Adversarial Nets (Goodfellow et al., 2014)

2. Deconvolution and Checkerboard Artifacts (Odena et al., 2016)

3. Progressive Growing of GANs (Karras et al. [2017]

4. Analyzing and Improving the Image Quality of StyleGAN (Karras et al. [2020])

5. An Introduction to GANs in Tensorflow

6. Generative Models Blog Post from OpenAI

7. Playing Atari with Deep Reinforcement Learning (Mnih et al., 2013)

8. Deep Reinforcement Learning: A Brief Survey (Arulkumaran et al., 2017)

References

Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. CoRR, abs/1609.02907, 2016. URL http://arxiv.org/abs/1609.02907.

Petar Velickovi´c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li`o, and Yoshua

Bengio. Graph Attention Networks. International Conference on Learning Representations, 2018.

URL https://openreview.net/forum?id=rJXMpikCZ. accepted as poster.

Prithviraj Sen, Galileo Mark Namata, Mustafa Bilgic, Lise Getoor, Brian Gallagher, and Tina

Eliassi-Rad. Collective classification in network data. AI Magazine, 29(3):93–106, 2008. URL

http://www.cs.iit.edu/~ml/pdfs/sen-aimag08.pdf.

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang,

and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for

improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.

Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference

on Computer Vision and Pattern Recognition, pages 8110–8119, 2020.

19

## Reviews

There are no reviews yet.