Exercise II

AMTH/CPSC 663

Compress your solutions into a single zip file titled <lastname and

initials assignment2.zip, e.g. for a student named Tom Marvolo Riddle, riddletm assignment2.zip. Include a single PDF titled

<lastname and initials assignment2.pdf and any Python scripts

specified. Any requested plots should be sufficiently labeled for full points.

Programming assignments should use built-in functions in Python

and TensorFlow; In general, you may use the scipy stack [1]; however,

exercises are designed to emphasize the nuances of machine learning and

deep learning algorithms – if a function exists that trivially solves an entire

problem, please consult with the TA before using it.

Problem 1

1. Provide a geometric interpretation of gradient descent in the one-dimensional case. (Adapted from the

Nielsen book, chapter 1)

2. An extreme version of gradient descent is to use a mini-batch size of just 1. This procedure is known as

online or incremental learning. In online learning, a neural network learns from just one training input

at a time (just as human beings do). Name one advantage and one disadvantage of online learning

compared to stochastic gradient descent with a mini-batch size of, say, 20. (Adapted from the Nielsen

book, chapter 1)

3. Create a network that classifies the MNIST data set using only 2 layers: the input layer (784 neurons)

and the output layer (10 neurons). Train the network using stochastic gradient descent. What accuracy

do you achieve? You can adapt the code from the Nielson book, but make sure you understand each

step to build up the network. Please save your code as prob1.py. (Adapted from the Nielsen book,

chapter 1)

1

Problem 2

1. Alternate presentation of the equations of backpropagation (Nielsen book, chapter 2)

Show that δ

L = ∇aC ? σ

0

(z

L) can be written as δ

L =

P0

(z

L)∇aC, where Σ0

(z

L) is a square matrix

whose diagonal entries are the values σ

0

(z

L

j

) and whose off-diagonal entries are zero.

2. Show that δ

l = ((w

l+1)

T

δ

l+1) ? σ

0

(z

l

) can be rewritten as δ

l = Σ0

(z

l

)(w

l+1)

T

δ

l+1

.

3. By combining the results from problems 2.1 and 2.2, show that δ

l = Σ0

(z

l

)(w

l+1)

T

. . . Σ

0

(z

L−1

)(w

L)

T Σ

0

(z

L)∇aC.

4. Backpropagation with linear neurons (Nielsen book, chapter 2)

Suppose we replace the usual non-linear σ function (sigmoid) with σ(z) = z throughout the network.

Rewrite the backpropagation algorithm for this case.

Figure 1: Simple neural network with initial weights and biases.

Problem 3

1. It can be difficult at first to remember the respective roles of the ys and the as for cross-entropy. It’s easy

to get confused about whether the right form is −[ylna + (1 − y)ln(1 − a)] or −[alny + (1 − a)ln(1 − y)].

What happens to the second of these expressions when y=0 or 1? Does this problem afflict the first

expression? Why or why not? (Nielsen book, chapter 3)

2. Show that the cross-entropy is still minimized when σ(z) = y for all training inputs (i.e. even when

y ∈ (0, 1)). When this is the case the cross-entropy has the value: C = −

1

n

P

x

[ylny + (1 − y)ln(1 − y)]

(Nielsen book, chapter 3)

2

3. Given the network in Figure 1, calculate the derivatives of the cost with respect to the weights and the

biases and the backpropagation error equations (i.e. δ

l

for each layer l) for the first iteration using the

cross-entropy cost function. Initial weights are colored in red, initial biases are colored in orange, the

training inputs and desired outputs are in blue. This problem aims to optimize the weights and biases

through backpropagation to make the network output the desired results. More specifically, given

inputs 0.05 and 0.10, the neural network is supposed to output 0.01 and 0.99 after many iterations.

Problem 4

1. Download the python template prob4 1.py and read through the code which implements a neural

network with TensorFlow based on MNIST data. Implement the TODO part to define the loss and

optimizer. Compare the squared loss, cross entropy loss, and softmax with log-likelihood. Plot the

training cost and the test accuracy vs epoch for each loss function (in two separate plots). Which loss

function converges fastest?

2. Based on prob 4.1 add regularization to the previous network. Implement L2 and L1 regularizations

separately, and dropout separately. Compare the accuracy and report the final regularization parameters you used (for dropout, report the probability parameter). Are the final results sensitive to each

parameter? Please save your code as prob4 2.py. You may want to check out the following link for

regularization.

https://www.tensorflow.org/versions/r0.12/api docs/python/contrib.layers/regularizers

References

[1] “The scipy stack specification.” [Online]. Available: https://www.scipy.org/stackspec.html

3