Homework 2

600.482/682 Deep Learning

Please submit a report (LaTeX generated PDF) and

the notebook as python file (file → download .py)

to Gradescope with entry code 9G83Y7

(submit the code as programming assignment)

1. The goal of this problem is to minimize a function given a certain input using gradient

descent by breaking down the overall function into smaller components via a computation

graph. The function is defined as:

f(x1, x2, w1, w2) = 1

1 + e−(w1x1+w2x2)

+ 0.5(w

2

1 + w

2

2

).

(a) Please calculate ∂f

∂w1

,

∂f

∂w2

,

∂f

∂x1

,

∂f

∂x2

.

Solution:

∂f

∂w1

=

x1 · e

−(w1x1+w2x2)

(1 + e−(w1x1+w2x2))

2

+ w1

∂f

∂w2

=

x2 · e

−(w1x1+w2x2)

(1 + e−(w1x1+w2x2))

2

+ w2

∂f

∂x1

=

w1 · e

−(w1x1+w2x2)

(1 + e−(w1x1+w2x2))

2

∂f

∂x2

=

w2 · e

−(w1x1+w2x2)

(1 + e−(w1x1+w2x2))

2

(b) Start with the following initialization: w1 = 0.3, w2 = −0.5, x1 = 0.2, x2 = 0.4, draw

the computation graph. Please use backpropagation as we did in class.

You can draw the graph on paper and insert a photo into your report.

The goal is for you to practice working with computation graphs. As a consequence,

you must include the intermediate values during the forward and backward pass.

Solution:

The computation graph is shown as below. All number above the lines are values in

forward pass. All numbers below the lines are values in backward pass.

1

(c) Implement the above computation graph in the complimentary Colab Notebook using

numpy. Use the values of (b) to initialize the weights and fix the input. Use a constant

step size of 0.01. Plot the weight value w1 and w2 for 30 iterations in a single figure in

the report.

Solution:

2. The goal of this problem is to understand the classification ability of a neural network.

Specifically, we consider the XOR problem. Go to the link in footnote1 and answer the

following questions. Hint: hit reset the network right next to the run button after you change

the architecture.

(a) Can a linear classifier, without any hidden layers, solve the XOR problem?

Solution: No. Since there’s only one layer,it is only capable of distinguish all data with

a line. It is apparently not possible to divide the data in XOR problem with a line.

1https://playground.tensorflow.org/#activation=relu&batchSize=10&dataset=xor®Dataset=

reg-plane&learningRate=0.01®ularizationRate=0&noise=0&networkShape=&seed=0.10699&showTestData=

false&discretize=true&percTrainData=80&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&

cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=

false&hideText=false

2

(b) With one hidden layer and ReLU(x) = max(0, x), how many neurons in the hidden

layer do you need to solve the XOR problem? Describe the training loss and estimated

prediction accuracy when using 2, 3 and 4 neurons. Discuss the intuition of why a

certain number of neurons is necessary to solve XOR.

Solution:

When using 2 neurons, the training loss is 0.268, the estimated prediction accuracy is

78

100 = 0.78. The picture is shown as below.

When using 3 neurons, the training loss is 0.260, the estimated prediction accuracy is

73

100 = 0.73. The picture is shown as below.

When using 4 neurons, the training loss is 0.002, the estimated prediction accuracy is

100

100 = 1.00. The picture is shown as below.

3

I think that there are 2 status for x1 and 2 status for x2. Since a layer of neurons can

only perform 1 manipulation, we need 2 = 4 neurons to represent the 4 conditions when

x1 xor x2. Therefore, we can use the four neurons in the hidden layers to make to right

prediction.

3. In this problem, we want to build a neural network from scratch using Numpy for a realworld problem. We consider the MNIST dataset (http://yann.lecun.com/exdb/mnist/),

a hand-written digit classification dataset. Please follow the formula in the complimentary

Colab Notebook. Hint: Make sure you pass the loss and gradient check in the notebook.

(a) Implement the loss and gradient of a linear classifier (python function

linear classifier forward and backward).

(b) Implement the loss and gradient of a multilayer perceptron with one hidden layer and

ReLU(x) = max(0, x) (python function mlp single hidden forward and backward).

(c) Implement the loss and gradient of a multilayer perceptron with two hidden layer, skip

connection and ReLU(x) = max(0, x) (python function mlp two hidden forward and backward).

(d) Plot the development accuracy of each epoch of three models in a single figure using

the following hyperparameters: the batch size is 50, the learning rate is 0.005 and the

number of epochs is 20.

Solution:

4

(e) Try using other hyperparameters and select a set of best hyperparameters using development accuracy. Once you pick the best model and hyperparameters, include

the development accuracy of each epoch into the above figure (make a new figure) and

report the test accuracy of the selected model and hyperparameters.

Solution: The best parameter I currently find is BS = 100, LR = 0.01, NB EPOCH =

20. The development accuracy is 97.30%, higher than the original MLP with 2 hidden layers dev loss,

which is 97.29%.

The picture is shown as below:

The test accuracy is 97.18%

5