## Description

Programming Assignment 2

CSE 253: Neural Networks

Instructions

1. Please submit your assignment on Vocareum. There are two components to this assignment: written homework

(Problems 1 & 2a-c), and a programming part. You will be writing a report in a conference paper format

for this assignment, reporting your findings. While we won’t enforce this, We prefer the report to be written

using LATEX or Word in NIPS format (NIPS is the top machine learning conference, and it is now dominated

by deep nets – it will be good practice for you to write in that format!). You are free to choose an alternate

format, but NIPS format is strongly recommended. The templates, both in Word and LATEX are available

from the 2015 NIPS format site.

2. You may use a language of your choice (Python and MATLAB are recommended). You also need to submit

all of the source codes files and a readme.txt file that includes detailed instructions on how to run your code.

You should write clean code with consistent format, as well as explanatory comments, as this code may be

reused in the future.

3. Using the MATLAB neural network toolbox or any off-the-shelf code is strictly prohibited.

Part I

Homework problems to be solved individually, and

turned in individually

Multi-layer Neural Networks

In this problem, we will continue classifying handwritten digits from Yann LeCun’s MNIST Database. In Assignment

1, we classified the digits using a single-layer neural network with different output activation functions. (Logistic

and Softmax regression). In this assignment, we are going to classify the MNIST dataset using multi-layer neural

networks with softmax outputs.

1

Problem

1. In class we discussed two different error functions: sum-of-squared error (SSE) and cross-entropy error. We

learned that SSE is appropriate for linear regression problems where we try to fit data generated from:

t = h(x) + ✏ (1)

Here x is a K-dimensional vector, h(x) is a deterministic function of x, where x includes the bias x0, and ✏ is

random noise that has a Gaussian probability distribution with zero mean and variance 2, i.e. ✏ ⇠ N (0, 2).

Suppose we want to model this data with a linear function approximation with parameter vector w:

y = X

K

i=0

wixi (2)

Prove that finding the optimal parameter w for the above linear regression problem on the dataset D =

{(x1, t1), …,(xN , tN )} is equal to finding the w⇤ that minimizes the SSE:

w⇤ = argminw

X

N

n=1

(t

n yn)

2 (3)

Hint: This problem is solved in Bishop, Chapter 6.

2. For multiclass classification on the MNIST dataset, we previously used softmax regression with cross-entropy

error as the objective function, and learned the weights of a single-layer network to classify the digits. In

this assignment, we will add a hidden layer between the input and output, that consists of J units with the

sigmoid activation function. So this network has three layers: an input layer, a hidden layer and a softmax

output layer.

Notation: We use index k to represent a node in output layer and index j to represent a node in hidden layer

and index i to represent a node in the input layer. Additionally, the weight from node i in the input layer

to node j in the hidden layer is wij . Similarly, the weight from node j in the hidden layer to node k in the

output layer is wjk.

(a) Derivation Derive the expression for for both the units of output layer (k) and the hidden layer (j ).

For this problem, use the definition of as i = @E

@ai

, where ai is the weighted sum of the inputs to unit i.

(b) Update rule. Derive the update rule for wij and wjk using learning rate ⌘, starting with the gradient

descent rule. The rule for the output units should be the same as you found in Programming Assignment

1. For the hidden units:

wij = wij ⌘

@E

@wij

(4)

@E

@wij

= @E

@aj

@aj

@wij

(5)

The derivative should take into account all of the outputs, so:

j = @En

@an

j

=

X

k

@En

@yk

@yk

@aj

(6)

(c) Vectorize computation. The computation is much faster when you update all wij s and wjks at the

same time, using matrix multiplications rather than for loops. Please show the update rule for the

weight matrix from the hidden layer to output layer and the matrix from input layer to hidden layer,

using matrix/vector notation.

2

Part II

Team Programming Assignment

3. Classification. Classification on MNIST datatbase. Refer to your derivations from Problem 2.

(a) Read in the data from the MNIST database. You can use a loader function to read in the data. Loader

functions for matlab can be found in http://ufldl.stanford.edu/wiki/resources/mnistHelper.zip. Python

helper functions can be found in https://gist.github.com/akesling/5358964.

(b) The pixel values in the digit images are in the range [0..255]. Divide by 255 so that they are in the range

[0..1], and then subtract the mean over all of the pixels in each image. I.e., if the mean pixel value in an

image of a “4” is 0.3, you would subtract 0.3 from each pixel value.

(c) While you should use the softmax activation function at the output level, for the hidden layer, you can

use the logistic, i.e., yj = 1/(1 + exp(aj )). This has the nice feature that dy/da = y(1 y).

(d) Check your code for computing the gradient using a small subset of data – so small, you can use one inputoutput pattern! You can compute the slope with respect to one weight using the numerical approximation:

d

dw En(w) ⇡ En(w + ✏) En(w ✏)

2✏

where ✏ is a small constant, e.g., 102. Compare the gradient computed using numerical approximation

with the one computed as in backpropagation. The difference of the gradients should be within big-O of

✏2, so if you used 102, your gradients should agree within 104. (See section 4.8.4 in Bishop for more

details). Note that w here is one weight in the network, so this must be repeated for every weight and

bias. Report your results.

(e) Using the update rule you obtained from 2(c), perform gradient descent to learn a classifier that maps

each input data to one of the labels t 2 {0, …, 9} (but using a one-hot encoding). Use a hold-out set to

decide when to stop training. (Hint: You may choose to split the training set of 60000 images to two

subsets: one training set with 50000 training images and one validation set with 10000 images.) Stop

training when the error on the validation set goes up, or better yet, save the weights as you go, keeping

the ones from when the validation set error was at a minimum. Report your training procedure and

plot your training and testing accuracy vs. number of training iterations of gradient descent. Again, by

accuracy, we mean the percent correct on the training and testing patterns.

4. Adding the “Tricks of the Trade.” Read the first paper on the resources page under readings, “lecun98efficient.pdf”, sections 4.1-4.7. Implement the following ideas from that paper (you can do these all at

once, or incrementally – no need to try all possible combinations, however).

(a) Shuffle the examples, and then use stochastic gradient descent. For the purposes of this class, use

“minibatches”, in which you compute the total gradient over, say, 128 patterns, take the average weight

change (i.e., divide by 128), and then change the weights by that amount.

(b) Stick with the normalization you have already done in the previous part, which is a little different than

what he describes in Section 4.3 – that can be expensive to compute over a whole data set.

(c) For the output layer, use the usual softmax activation function. However, for the hidden layer(s), use

the sigmoid in Section 4.4. Note that you will need to derive the slope of that sigmoid to use when

computing deltas.

(d) Initialize the input weights to each unit using a distribution with 0 mean and standard deviation

1/sqrt(fan-in), where the fan-in is the number of inputs to the unit.

(e) Use momentum.

(f) Comment on the change in performance, which has hopefully improved, at least in terms of learning

speed.

3

5. Experiment with Network Topology. Start with the network of 4. Now, we will consider how the

topology of the neural network changes the performance.

(a) Try halving and doubling the number of hidden units. What do you observe if the number of hidden

units is too small? What if the number is too large?

(b) Change the number of hidden layers. Use two hidden layers instead of one. Create a new architecture

that uses two hidden layers of equal size and has approximately the same number of parameters, as the

previous network with one hidden layer. By that, we mean it should have roughly the same total number

of weights and biases. Report training and testing accuracy vs. number of training iterations of gradient

descent.

4