Homework 3

600.482/682 Deep Learning

Please submit a latex generated PDF

to Gradescope with entry code 9G83Y7

1. We have talked about backpropagation in class. And here is a supplementary material for

calculating the gradient for backpropagation (https://piazza.com/class_profile/get_

resource/jxcftju833c25t/k0labsf3cny4qw). Please study this material carefully before

you start this exercise. Suppose P = W X and L = f(P) which is a loss function.

(a) Please show that ∂L

∂W =

∂L

∂P XT

. Show each step of your derivation.

(b) Suppose the loss function is L2 loss. L2 loss is dened as L(y, yˆ) = ky − yˆk

2 where y

is the groundtruth; yˆ is the prediction. Given the following initialization of W and X,

please calculate the updated W after one iteration. (step size = 0.1)

W =

0.3 0.5

−0.2 0.4

, X =

x1, x2

=

0 2

3 1

, Y =

y1, y2

=

0.5 1

1 −1.5

2. In this exercise, we will explore how vanishing and exploding gradients aect the learning

process. Consider a simple, 1-dimensional, 3 layer network with data x ∈ R, prediction

yˆ ∈ [0, 1], true label y ∈ {0, 1}, and weights w1, w2, w3 ∈ R, where weights are initialized

randomly via ∼ N (0, 1). We will use the sigmoid activation function σ between all layers,

and the cross entropy loss function L(y, yˆ) = −(y log(ˆy) + (1 − y) log(1 − yˆ)). This network

can be represented as: yˆ = σ(w3 · σ(w2 · σ(w1 · x))). Note that for this problem, we are not

including a bias term.

(a) Compute the derivative for a sigmoid. What are the values of the extrema of this

derivative, and when are they reached?

(b) Consider a random initialization of w1 = 0.25, w2 = −0.11, w3 = 0.78, and a sample

from the data set (x = 0.63, y = 1). Using backpropagation, compute the gradients for

each weight. What have you noticed about the magnitude of the gradient?

Now consider that we want to switch to a regression task and use a similar network

structure as we did above: we remove the nal sigmoid activation, so our new network

is dened as yˆ = w3 · σ(w2 · σ(w1 · x)), where predictions yˆ ∈ R and targets y ∈ R; we

use the L2 loss function instead of cross entropy: L(y, yˆ) = (y−yˆ)

2

. Derive the gradient

of the loss function with respect to each of the weights w1, w2, w3.

(c) Consider again the random initialization of w1 = 0.25, w2 = −0.11, w3 = 0.78, and a

sample from the data set (x = 0.63, y = 128). Using backpropagation, compute the

gradients for each weight. What have you noticed about the magnitude of the gradient?