CSC413/2516

Assignment 2

• (v1.3) Updated 3.2 to have simplified FCNN architecture for calculations.

• (v1.2) Further clarification to question 3.2: updated image of FCNN architecture to specify

output channels in hyphens and added sentence to say the number after the hyphen specifies

the number of output channels or units of a layer in the image. Removed the artefact from

older question stating ”Assume zero padding is used in convolutional layers such that the

output dimension is equal to the input dimension”, as output and input dimensions are

specified by images and are not equal.

• (v1.1) Updated question 3.2 to REMOVE that all convolutional layerrs only have one output channel. Please refer to the numbers in the images to see the output channels in each

convolutional layer.

Submission: You must submit two files through MarkUs1

: (1) a PDF file containing your writeup,

titled a2-writeup.pdf, and (2) your code file a2-code.ipynb. There will be sections in the notebook for you to write your responses. Your writeup must be typed. Make sure that the relevant

outputs (e.g. print gradients() outputs, plots, etc.) are included and clearly visible.

See the syllabus on the course website2

for detailed policies. You may ask questions about the

assignment on Piazza3

. Note that 10% of the assignment mark (worth 2 pts) may be removed for

lack of neatness.

You may notice that some questions are worth 0 pt, which means we will not mark them in this

Assignment. Feel free to skip them if you are busy. However, you are expected to see some of them

in the midterm. So, we won’t release the solution for those questions.

The teaching assistants for this assignment are Mica Consens, Amir Peimani and Yun-Chun Chen.

Send your email with subject “[CSC413] A2 …” to csc413-2023-01-tas@cs.toronto.edu or post

on Piazza with the tag a2.

1

https://markus.teach.cs.toronto.edu/2023-01

2

https://uoft-csc413.github.io/2023/assets/misc/syllabus.pdf

3

https://piazza.com/class/lcp8mp3f9dl7lp

1

CSC413/2516 Winter 2023 with Prof. Jimmy Ba and Bo Wang Assignment 2

Important Instructions

Read the following before attempting the assignment.

Overview and Expectations

You will be completing this assignment with the aid of large language models (LLMs) such as

ChatGPT, text-davinci-003, or code-davinci-002. The goal is to help you (i) develop a solid understanding of the course materials, and (ii) gain some experience in using LLMs for problem solving.

Think of this as analogous to (i) understanding the rules of addition and multiplication, and (ii)

learning how to use a calculator. Note that LLMs may not be a reliable “calculator” (yet) — as you

will see, GPT-like models can generate incorrect and contradicting answers. It is therefore important that you have a good grasp of the lecture materials, so that you can evaluate the correctness

of the model output, and also prompt the model toward the correct solution.

Prompt engineering. In this assignment, we ask that you try to (i) solve the problems

yourself, and (ii) use LLMs to solve a selected subset of them. You will “guide” the LLMs toward

desired outcomes by typing text prompts into the models. There are a number of different ways to

prompt an LLM, including direct copy-pasting LATEX strings of a written question, copying function

docstrings, or interactively editing the previously generated results. Prompting offers a natural and

intuitive interface for humans to interact with and use LLMs. However, LLM-generated solutions

depend significantly on the quality of the prompt used to steer the model, and most effective

prompts come from a deep understanding of the task. You can decide how much time you want to

spend as a university student vs. a prompt engineer, but we’d say it’s probably not a good idea to

use more than 25% of your time on prompting LLMs. See Best Practices below for the basics of

prompt engineering.

What are LLMs good for? We have divided the assignment problems into the following

categories, based on our judgment of how difficult it is to obtain the correct answer using LLMs.

• [Type 1] LLMs can produce almost correct answers from rather straightforward prompts,

e.g., minor modification of the problem statement.

• [Type 2] LLMs can produce partially correct and useful answers, but you may have to use

a more sophisticated prompt (e.g., break down the problem into smaller pieces, then ask a

sequence of questions), and also generate multiple times and pick the most reasonable output.

2

CSC413/2516 Winter 2023 with Prof. Jimmy Ba and Bo Wang Assignment 2

• [Type 3] LLMs usually do not give the correct answer unless you try hard. This may

include problems with involved mathematical reasoning or numerical computation (many

GPT models do not have a built-in calculator).

• [Type 4] LLMs are not suitable for the problem (e.g., graph/figure-related questions).

Program trace and show your work. For questions in [Type 1] [Type 2] , you must

include the program trace (i.e., screenshots of your interaction with the model) in

your submission; the model output does not need to be the correct answer, but you should be

able to verify the solution or find the mistake. Whereas for questions in [Type 3] [Type 4] , you do

not have to include the program traces, but we encourage you to experiment with LLMs and get

creative in the prompting; for problems labeled [EC] in these two categories, extra credit will be

given to prompts that lead to the correct solution.

Grading. We will be grading the assignments as follows.

• Written Part

– For questions in [Type 1] [Type 2] , we require you to submit (i) the program trace,

(ii) critique of the model output (is it correct? which specific part is wrong?), (iii)

your own solution. You will not lose points if you cannot prompt the LLM to generate

the correct answer, as long as you can identify the mistake. However, you will receive

0 point on the question if you do not include the program trace, even if you solve the

problem by yourself; the goal of this is to help you familiarize with the use of LLMs.

If you are confident that the model output is correct, you can directly use it as your own

solution. Make sure you cite the model properly, that is, include the model name,

version (date), and url if applicable.

– For questions in [Type 3] [Type 4] , you will be graded based on the correctness of your

own solution, and you do not need to include screenshots of the model output. Creative

prompts that lead to correct model output will be rewarded.

• Programming Part

– For writing questions labeled [Type 1] [Type 2] please submit (i) the program trace, (ii)

critique of the model output, (iii) your own solution. The grading scheme is the same

as the previous written part.

– For coding questions labeled [Type 1] [Type 2] , submit (i) the program trace, (ii) your

own solution. You will be graded based on the correctness of your own solution, that is,

you will receive full marks if your code can execute and produce the desired outcome.

From our experience, the most efficient way is to start with the LLM-generated code,

which often gives the correct solution under minor modifications. Again, make sure

you cite the model properly.

3

CSC413/2516 Winter 2023 with Prof. Jimmy Ba and Bo Wang Assignment 2

– For questions in [Type 3] [Type 4] , you will be graded based on the correctness of your

own solution, and you do not need to include screenshots of the model output.

Written Assignment

What you have to submit for this part

See the top of this handout for submission directions. Here are the requirements.

• The zero point questions (in black below) will not be graded, but you are more than welcome

to include your answers for these as well in the submission.

• For (nonzero-point) questions labeled [Type 1] [Type 2] you need to submit (i) the LLM

program trace, (ii) critique of the model output, (iii) your own solution. You will

receive 0 point on the question if you do not include the program trace, even if you solve the

problem by yourself. Your own solution can be a copy-paste of the LLM output (if you verify

that it is correct), but make sure you cite the model properly.

• For (nonzero-point) questions in [Type 3] [Type 4] you only need to submit your own written

solution, but we encourage you to experiment with LLMs on some of them.

• If you attempt the extra credit problems [EC] using LLMs, include your program trace in the

beginning of the writeup document.

For reference, here is everything you need to hand in for the first half of the PDF report

a2-writeup.pdf.

• Problem 1: 1.1.1[Type 2] , 1.2.1[Type 1] , 1.2.2[Type 1]

• Problem 2: 2.1.1[Type 4] , 2.1.2[Type 1] , 2.1.3[Type 1] , 2.2.1[Type 2] , 2.2.2[Type 3] ,

2.2.3[Type 2] , 2.3.1[Type 3] , 2.3.2[Type 3] , 2.3.3[Type 1]

• Problem 3: 3.1[Type 4] , 3.2[Type 3] , 3.3 [Type 1]

Useful prompts

You could start by naively copy-pasting the question and the context as the prompt, and try to

improve the generated answers by trial and error. Raw LATEX dumps are made available for the

written questions to facilitate this process.

• https://uoft-csc413.github.io/2023/assets/assignments/a2_raw_latex_dump.tex

• https://uoft-csc413.github.io/2023/assets/assignments/a2_macros.tex

4

CSC413/2516 Winter 2023 with Prof. Jimmy Ba and Bo Wang Assignment 2

1 Optimization

This week, we will continue investigating the properties of optimization algorithms, focusing on

stochastic gradient descent and adaptive gradient descent methods. For a refresher on optimization,

refer to: https://uoft-csc413.github.io/2023/assets/slides/lec03.pdf.

We will continue using the linear regression model established in Homework 1. Given n pairs

of input data with d features and scalar labels (xi

, ti) ∈ R

d × R, we want to find a linear model

f(x) = wˆ

T x with wˆ ∈ R

d

such that the squared error on training data is minimized. Given a data

matrix X ∈ R

n×d and corresponding labels t ∈ R

n

, the objective function is defined as:

L =

1

n

∥Xwˆ − t∥

2

2

(1)

1.1 Mini-Batch Stochastic Gradient Descent (SGD)

Mini-batch SGD performs optimization by taking the average gradient over a mini-batch, denoted

B ∈ R

b×d

, where 1 < b ≪ n. Each training example in the mini-batch, denoted xj ∈ B, is

randomly sampled without replacement from the data matrix X. Assume that X is full rank.

Where L denotes the loss on xj , the update for a single step of mini-batch SGD at time t with

scalar learning rate η is:

wt+1 ← wt −

η

b

X

xj∈B

∇wtL(xj , wt) (2)

Mini-batch SGD iterates by randomly drawing mini-batches and updating model weights using the

above equation until convergence is reached.

1.1.1 Minimum Norm Solution [2pt] [Type 2]

Recall Question 3.3 from Homework 1. For an overparameterized linear model, gradient descent

starting from zero initialization finds the unique minimum norm solution w∗

such that Xw∗ = t.

Let w0 = 0, d > n. Assume mini-batch SGD also converges to a solution wˆ such that Xwˆ = t.

Show that mini-batch SGD solution is identical to the minimum norm solution w∗ obtained by

gradient descent, i.e., ˆw = w∗

.

Hint: Be more specific as to what other solutions? Or is xj or B contained in span of X? Do

the update steps of mini-batch SGD ever leave the span of X?

1.2 Adaptive Methods

We now consider the behavior of adaptive gradient descent methods. In particular, we will investigate the RMSProp method. Let wi denote the i-th parameter. A scalar learning rate η is used.

5

CSC413/2516 Winter 2023 with Prof. Jimmy Ba and Bo Wang Assignment 2

At time t for parameter i, the update step for RMSProp is shown by:

wi,t+1 = wi,t −

η

√vi,t + ϵ

∇wi,tL(wi,t) (3)

vi,t = β(vi,t−1) + (1 − β)(∇wi,tL(wi,t))2

(4)

We begin the iteration at t = 0, and set vi,−1 = 0. The term ϵ is a fixed small scalar used for

numerical stability. The momentum parameter β is typically set such that β ≥ 0.9 Intuitively,

RMSProp adapts a separate learning rate in each dimension to efficiently move through badly

formed curvatures (see lecture slides/notes).

1.2.1 Minimum Norm Solution [1pt] [Type 1]

Consider the overparameterized linear model (d > n) for the loss function defined in Section 1.

Assume the RMSProp optimizer converges to a solution. Provide a proof or counterexample for

whether RMSProp always obtains the minimum norm solution.

Hint: Compute a simple 2D case. Let x1 = [2, 1], w0 = [0, 0], t = [2].

1.2.2 [0pt] [Type 1]

Consider the result from the previous section. Does this result hold true for other adaptive methods

(Adagrad, Adam) in general? Why might making learning rates independent per dimension be

desirable?

2 Gradient-based Hyper-parameter Optimization

In this problem, we will implement a simple toy example of gradient-based hyper-parameter optimization, introduced in Lecture 3.

Often in practice, hyper-parameters are chosen by trial-and-error based on a model evaluation

criterion. Instead, gradient-based hyper-parameter optimization computes gradient of the evaluation criterion w.r.t. the hyper-parameters and uses this gradient to directly optimize for the best

set of hyper-parameters. For this problem, we will optimize for the learning rate of gradient descent

in a regularized linear regression problem.

Specifically, given n pairs of input data with d features and scalar label (xi

, ti) ∈ R

d × R, we

wish to find a linear model f(x) = wˆ

⊤x with wˆ ∈ R

d and a L2 penalty, λ∥wˆ

2

2

∥, that minimizes

the squared error of prediction on the training samples. λ is a hyperparameter that modulates the

impact of the L2 regularization on the loss function. Using the concise notation for the data matrix

X ∈ R

n×d and the corresponding label vector t ∈ R

n

, the squared error loss can be written as:

L˜ =

1

n

∥Xwˆ − t∥

2

2 + λ∥wˆ ∥

2

2

.

6

CSC413/2516 Winter 2023 with Prof. Jimmy Ba and Bo Wang Assignment 2

Starting with an initial weight parameters w0, gradient descent (GD) updates w0 with a learning

rate η for t number of iterations. Let’s denote the weights after t iterations of GD as wt

, the loss as

Lt

, and its gradient as ∇wt

. The goal is the find the optimal learning rate by following the gradient

of Lt w.r.t. the learning rate η.

2.1 Computation Graph

2.1.1 [0.5pt] [Type 4]

Consider a case of 2 GD iterations. Draw the computation graph to obtain the final loss L˜

2 in

terms of w0, ∇w0L˜

0,L˜

0, w1,L˜

1, ∇w1L˜

1, w2, λ˜ and η.

2.1.2 [0.5pt] [Type 1]

Then, consider a case of t iterations of GD. What is the memory complexity for the forwardpropagation in terms of t? What is the memory complexity for using the standard back-propagation

to compute the gradient w.r.t. the learning rate, ∇ηL˜

t

in terms of t?

Hint: Express your answer in the form of O in terms of t.

2.1.3 [0pt] [Type 1]

Explain one potential problem for applying gradient-based hyper-parameter optimization in more

realistic examples where models often take many iterations to converge.

2.2 Optimal Learning Rates

In this section, we will take a closer look at the gradient w.r.t. the learning rate. To simplify the

computation for this section, consider an unregularized loss function of the form L =

1

n

∥Xwˆ − t∥

2

2

.

Let’s start with the case with only one GD iteration, where GD updates the model weights from

w0 to w1.

2.2.1 [1pt] [Type 2]

Write down the expression of w1 in terms of w0, η, t and X. Then use the expression to derive

the loss L1 in terms of η.

Hint: If the expression gets too messy, introduce a constant vector a = Xw0 − t

2.2.2 [0pt] [Type 3]

Determine if L1 is convex w.r.t. the learning rate η.

Hint: A function is convex if its second order derivative is positive

7

CSC413/2516 Winter 2023 with Prof. Jimmy Ba and Bo Wang Assignment 2

2.2.3 [1pt] [Type 2]

Write down the derivative of L1 w.r.t. η and use it to find the optimal learning rate η

∗

that

minimizes the loss after one GD iteration. Show your work.

2.3 Weight decay and L2 regularization

Although well studied in statistics, L2 regularization is usually replaced with explicit weight decay

in modern neural network architectures:

wi+1 = (1 − λ)wi − η∇Li(X) (5)

In this question you will compare regularized regression of the form L˜ =

1

n

∥Xwˆ − t∥

2

2 + λ˜∥wˆ ∥

2

2

with unregularized loss, L =

1

n

∥Xwˆ − t∥

2

2

, accompanied by weight decay (equation 5).

2.3.1 [0.5pt] [Type 3]

Write down two expressions for w1 in terms of w0, η, t, λ, λ˜, and X. The first one using L˜, the

second with L and weight decay.

2.3.2 [0.5pt] [Type 3]

How can you express λ˜ (corresponding to L2 loss) so that it is equivalent to λ (corresponding to

weight decay)?

Hint: Think about how you can express λ˜ in terms of λ and another hyperparameter.

2.3.3 [0pt] [Type 1]

Adaptive gradient update methods like RMSprop (equation 4) modulate the learning rate for each

weight individually. Can you describe how L2 regularization is different from weight decay when

adaptive gradient methods are used? In practice it has been shown that for adaptive gradients

methods weight decay is more successful than l2 regularization.

3 Convolutional Neural Networks

The last set of questions aims to build basic familiarity with convolutional neural networks (CNNs).

8

CSC413/2516 Winter 2023 with Prof. Jimmy Ba and Bo Wang Assignment 2

3.1 Convolutional Filters [0.5pt] [Type 4]

Given the input matrix I and filter J shown below, compute I ∗ J, the output of the convolution

operation (as defined in lecture 4). Assume zero padding is used such that the input and output

are of the same dimension. What feature does this convolutional filter detect?

I =

0 0 0 0 0

0 0 1 1 1

1 1 1 1 0

0 1 1 1 0

0 0 1 0 0

J =

0 −1 0

−1 5 −1

0 −1 0

I ∗ J =

? ? ? ? ?

? ? ? ? ?

? ? ? ? ?

? ? ? ? ?

? ? ? ? ?

3.2 Size of Conv Nets [1pt] [Type 3]

CNNs provides several advantages over fully connected neural networks (FCNNs) when applied to

image data. In particular, FCNNs do not scale well to high dimensional image data, which you will

demonstrate below. Consider the following CNN architecture on the left:

The input image has dimension 32 × 32 and is RGB (three channel). For ease of computation, all

convolutional layers use 3 × 3 kernels and the output channels of each layer are described in the

images. This means that the number after the hyphen specifies the number of output channels

or units of a layer (e.g. Conv3-64 layer has 64 output channels). Assume zero padding is used in

convolutional layers. Each max pooling layer has a filter size of 2×2 and a stride of 2. Furthermore,

ignore all bias terms.

We consider an alternative architecture, shown on the right, which replaces convolutional layers

with fully connected (FC) layers. Assume the fully connected layers do not change the output shape

of inputs. For both the CNN architecture and the FCNN architecture, compute the total number

of neurons in the network, and the total number of trainable parameters. You should report four

numbers in total. Finally, name one disadvantage of having more trainable parameters.

9

CSC413/2516 Winter 2023 with Prof. Jimmy Ba and Bo Wang Assignment 2

3.3 Receptive Fields [0.5pt] [Type 1]

The receptive field of a neuron in a CNN is the area of the image input that can affect the neuron

(i.e. the area a neuron can ‘see’). For example, a neuron in a 3 × 3 convolutional layer is computed

from an input area of 3 × 3 of the input, so it’s receptive field is 3 × 3. However, as we go deeper

into the CNN, the receptive field increases. One helpful resource to visualize receptive fields can

be found at: https://distill.pub/2019/computing-receptive-fields/.

List 3 things that can affect the size of the receptive field of a neuron and briefly explain your

answers.

10

CSC413/2516 Winter 2023 with Prof. Jimmy Ba and Bo Wang Assignment 2

Programming Assignment

What you have to submit for this part

See the top of this handout for submission directions. Here are the requirements.

• The zero point questions (in black below) will not be graded, but you are more than welcome

to include your answers for these as well in the submission.

• For (nonzero-point) writing questions labeled [Type 1] [Type 2] you need to submit (i) the

program trace, (ii) critique of the model output, (iii) your own solution. You will

receive 0 point on the question if you do not include the program trace, even if you solve the

problem by yourself. Make sure you cite the model properly.

• For (nonzero-point) coding questions labeled [Type 1] [Type 2] you need to submit (i) the

LLM program trace (in the writeup PDF), (ii) your own solution.

• For (nonzero-point) questions in [Type 3] [Type 4] you only need to submit your own solution,

but we encourage you to experiment with LLMs on some of them.

• If you attempt the extra credit problems [EC] using LLMs, include your program trace in the

beginning of the writeup document.

For reference, here is everything you need to hand in:

• This is the second half of your PDF report a2-writeup.pdf. Please include the solutions

to the following problems. You may choose to export a2-code.ipynb as a PDF and attach

it to the first half of a2-writeup.pdf. Do not forget to append the following program

traces/screenshots to the end of a2-writeup.pdf:

– Problem 4:

∗ 4.1 [0.5pt] [Type 1]

· program trace

· critique of the model output

· code for model PoolUpsampleNet (screenshot or text)

∗ 4.2 [0.5pt] [Type 4]

· visualizations and your commentary

∗ 4.3 [1.0pt] [Type 3]

· your answer (6 values as function of NIC, NF, NC)

– Problem 5:

∗ 5.1 [0.5pt] [Type 1]

11

CSC413/2516 Winter 2023 with Prof. Jimmy Ba and Bo Wang Assignment 2

· program trace

· critique of the model output

· code for model ConvTransposeNet (screenshot or text)

∗ 5.2 [0.5pt] [Type 4]

· your answer, 1 plot figure (training/validation curves)

∗ 5.3 [0.5pt] [Type 3]

· your answer

∗ 5.4 [0.5pt] [Type 3]

· your answer

– Problem 6:

∗ 6.1 [0.5pt] [Type 1]

· program trace

· critique of the model output

· code for model UNet (screenshot or text)

∗ 6.2 [0.5pt] [Type 4]

· your answer, 1 plot figure (training/validation curves)

∗ 6.3 [1.0pt] [Type 3]

· your answer

• Your code file a2-code.ipynb

Introduction

This assignment will focus on the applications of convolutional neural networks in various image

processing tasks. The starter code is provided as a Python Notebook on Colab (https://colab.

research.google.com/github/uoft-csc413/2023/blob/master/assets/assignments/a2-code.

ipynb). First, we will train a convolutional neural network for a task known as image colourization.

Given a greyscale image, we will predict the colour at each pixel. This is a difficult problem for

many reasons, one of which being that it is ill-posed: for a single greyscale image, there can be

multiple, equally valid colourings. In the second half of the assignment, we will perform fine-tuning

on a pre-trained object detection model.

Image Colourization as Classification

In this section, we will perform image colourization using three convolutional neural networks

(Figure 1). Given a grayscale image, we wish to predict the color of each pixel. We have provided

12

CSC413/2516 Winter 2023 with Prof. Jimmy Ba and Bo Wang Assignment 2

a subset of 24 output colours, selected using k-means clustering4

. The colourization task will be

framed as a pixel-wise classification problem, where we will label each pixel with one of the 24

colours. For simplicity, we measure distance in RGB space. This is not ideal but reduces the

software dependencies for this assignment.

We will use the CIFAR-10 data set, which consists of images of size 32×32 pixels. For most

of the questions we will use a subset of the dataset. The data loading script is included with the

notebooks, and should download automatically the first time it is loaded.

Helper code for Section 4 is provided in a2-code.ipynb, which will define the main training

loop as well as utilities for data manipulation. Run the helper code to setup for this question and

answer the following questions.

4 Pooling and Upsampling

4.1 [0.5pt] [Type 1]

Complete the model PoolUpsampleNet, following the diagram in Figure 1a. Use the PyTorch

layers nn.Conv2d, nn.ReLU, nn.BatchNorm2d5

, nn.Upsample6

, and nn.MaxPool2d. Your CNN

should be configurable by parameters kernel, num in channels, num filters, and num colours.

In the diagram, num in channels, num filters and num colours are denoted NIC, NF and NC

respectively. Use the following parameterizations (if not specified, assume default parameters):

• nn.Conv2d: The number of input filters should match the second dimension of the input

tensor (e.g. the first nn.Conv2d layer has NIC input filters). The number of output filters

should match the second dimension of the output tensor (e.g. the first nn.Conv2d layer has

NF output filters). Set kernel size to parameter kernel. Set padding to the padding variable

included in the starter code.

• nn.BatchNorm2d: The number of features should match the second dimension of the output

tensor (e.g. the first nn.BatchNorm2d layer has NF features).

• nn.Upsample: Use scaling factor = 2.

• nn.MaxPool2d: Use kernel size = 2.

Note: grouping layers according to the diagram (those not separated by white space) using the

nn.Sequential containers will aid implementation of the forward method.

4

https://en.wikipedia.org/wiki/K-means_clustering

5

https://gauthamkumaran.com/batchnormalization/amp/

6

https://machinethink.net/blog/coreml-upsampling/

13

CSC413/2516 Winter 2023 with Prof. Jimmy Ba and Bo Wang Assignment 2

Image

Conv2d

BatchNorm2d

ReLU

MaxPool2d

Conv2d

BatchNorm2d

ReLU

MaxPool2d

Conv2d

BatchNorm2d

ReLU

Upsample

Conv2d

BatchNorm2d

ReLU

Upsample

Conv2d

[BS, NIC, 32, 32]

[BS, NF, 16, 16]

[BS, 2NF, 8, 8]

[BS, NF, 16, 16]

[BS, NC, 32, 32]

[BS, NC, 32, 32]

(a) PoolUpsampleNet

Image

Conv2d

[BS, NIC, 32, 32]

BatchNorm2d

ReLU

Conv2d

BatchNorm2d

ReLU

ConvTranspose2d

BatchNorm2d

ReLU

ConvTranspose2d

BatchNorm2d

ReLU

Conv2d

[BS, NF, 16, 16]

[BS, 2NF, 8, 8]

[BS, NF, 16, 16]

[BS, NC, 32, 32]

[BS, NC, 32, 32]

(b) ConvTransposeNet

Image

Conv2d

[BS, NIC, 32, 32]

BatchNorm2d

ReLU

Conv2d

BatchNorm2d

ReLU

ConvTranspose2d

BatchNorm2d

ReLU

ConvTranspose2d

BatchNorm2d

ReLU

Conv2d

[BS, NF, 16, 16]

[BS, 2NF, 8, 8]

[BS, NF + NF, 16, 16]

[BS, NIC + NC, 32, 32]

[BS, NC, 32, 32]

(c) UNet

Figure 1: Three network architectures that we will be using for image colourization. Numbers

inside square brackets denote the shape of the tensor produced by each layer: BS: batch size,

NIC: num in channels, NF: num filters, NC: num colours.

14

CSC413/2516 Winter 2023 with Prof. Jimmy Ba and Bo Wang Assignment 2

4.2 [0.5pt] [Type 4]

Run main training loop of PoolUpsampleNet. This will train the CNN for a few epochs using the

cross-entropy objective. It will generate some images showing the trained result at the end. Do

these results look good to you? Why or why not?

4.3 [1.0pt] [Type 3]

Compute the number of weights, outputs, and connections in the model, as a function of NIC, NF

and NC. Compute these values when each input dimension (width/height) is doubled. Report all

6 values.

Note:

1. Please ignore biases when answering the questions.

2. Please ignore nn.BatchNorm2d when answering the number of weights, outputs and connections, but we still accept answers that do.

Hint:

1. nn.Upsample does not have parameters (this will help you answer the number of weights).

2. Think about when the input width and height are both doubled, how will the dimension of

feature maps in each layer change? If you know this, you will know how dimension scaling

will affect the number of outputs and connections.

5 Strided and Transposed Dilated Convolutions [2 pts]

For this part, instead of using nn.MaxPool2d layers to reduce the dimensionality of the tensors, we

will increase the step size of the preceding nn.Conv2d layers, and instead of using nn.Upsample

layers to increase the dimensionality of the tensors, we will use transposed convolutions. Transposed

convolutions aim to apply the same operations as convolutions but in the opposite direction. For

example, while increasing the stride from 1 to 2 in a convolution forces the filters to skip over

every other position as they slide across the input tensor, increasing the stride from 1 to 2 in a

transposed convolution adds “empty” space around each element of the input tensor, as if reversing

the skipping over every other position done by the convolution. We will be using a dilation

rate of 1 for the transposed convolution. Excellent visualizations of convolutions and transposed

convolutions have been developed by Dumoulin and Visin [2018] and can be found on their GitHub

page7

.

7

https://github.com/vdumoulin/conv_arithmetic

15

CSC413/2516 Winter 2023 with Prof. Jimmy Ba and Bo Wang Assignment 2

5.1 [0.5pt] [Type 1]

Complete the model ConvTransposeNet, following the diagram in Figure 1b. Use the PyTorch

layers nn.Conv2d, nn.ReLU, nn.BatchNorm2d and nn.ConvTranspose2d. As before, your CNN

should be configurable by parameters kernel, dilation, num in channels, num filters, and

num colours. Use the following parameterizations (if not specified, assume default parameters):

• nn.Conv2d: The number of input and output filters, and the kernel size, should be set in the

same way as Section 4. For the first two nn.Conv2d layers, set stride to 2 and set padding

to 1.

• nn.BatchNorm2d: The number of features should be specified in the same way as for Section

4.

• nn.ConvTranspose2d: The number of input filters should match the second dimension of the

input tensor. The number of output filters should match the second dimension of the output

tensor. Set kernel size to parameter kernel. Set stride to 2, set dilation to 1, and set

both padding and output padding to 1.

5.2 [0.5pt] [Type 4]

Train the model for at least 25 epochs using a batch size of 100 and a kernel size of 3. Plot the

training curve, and include this plot in your write-up.

5.3 [0.5pt] [Type 3]

How do the results compare to Section 4? Does the ConvTransposeNet model result in lower

validation loss than the PoolUpsampleNet? Why may this be the case?

5.4 [0.5pt] [Type 3]

How would the padding parameter passed to the first two nn.Conv2d layers, and the padding and

output padding parameters passed to the nn.ConvTranspose2d layers, need to be modified if we

were to use a kernel size of 4 or 5 (assuming we want to maintain the shapes of all tensors shown

in Figure 1b)? Note: PyTorch documentation for nn.Conv2d8 and nn.ConvTranspose2d9

includes

equations that can be used to calculate the shape of the output tensors given the parameters.

8

https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html

9

https://pytorch.org/docs/stable/generated/torch.nn.ConvTranspose2d.html

16

CSC413/2516 Winter 2023 with Prof. Jimmy Ba and Bo Wang Assignment 2

5.5 [0pt] [Type 4]

Re-train a few more ConvTransposeNet models using different batch sizes (e.g., 32, 64, 128, 256,

512) with a fixed number of epochs. Describe the effect of batch sizes on the training/validation

loss, and the final image output quality. You do not need to attach the final output images.

6 Skip Connections

A skip connection in a neural network is a connection which skips one or more layer and connects

to a later layer. We will introduce skip connections to the model we implemented in Section 5.

6.1 [0.5pt] [Type 1]

Add a skip connection from the first layer to the last, second layer to the second last, etc. That

is, the final convolution should have both the output of the previous layer and the initial greyscale

input as input (see Figure 1c). This type of skip-connection is introduced by Ronneberger et al.

[2015], and is called a “UNet”. Following the ConvTransposeNet class that you have completed,

complete the init and forward methods of the UNet class in Section 6 of the notebook. Hint:

You will need to use the function torch.cat.

6.2 [0.5pt] [Type 4]

Train the model for at least 25 epochs using a batch size of 100 and a kernel size of 3. Plot the

training curve, and include this plot in your write-up.

6.3 [1.0pt] [Type 3]

How does the result compare to the previous model? Did skip connections improve the validation

loss and accuracy? Did the skip connections improve the output qualitatively? How? Give at least

two reasons why skip connections might improve the performance of our CNN models.

17

CSC413/2516 Winter 2023 with Prof. Jimmy Ba and Bo Wang Assignment 2

References

Vincent Dumoulin and Francesco Visin. A guide to convolution arithmetic for deep learning, 2018.

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computerassisted intervention, pages 234–241. Springer, 2015.

18

## Reviews

There are no reviews yet.