## Description

Programming Assignment 2

CSE 253: Neural Networks for Pattern Recognition

Instructions

1. Please submit your assignment on Gradescope. There are two components to this assignment: written

homework (Problems 1, 2a-c), and a programming part. You will be writing a report in a conference paper

format for the programming part of this assignment, reporting your findings. The report should be written

using LATEX or Word in NeurIPS format. The templates, both in Word and LATEX are available from

the 2015 NIPS format site.

2. For the programming part, you are allowed to work in groups of up to 3 (like PA1). In extraordinary

circumstances, we will allow you to do it on your own. Please discuss your circumstances with your TA, who

will then present your case to me. Again, don’t forget to include a paragraph for each team member

in your report that describes what each team member contributed to the project.

3. You need to submit all of the source codes files and a readme.txt file that includes detailed instructions on

how to run your code.

You should write clean code with consistent format, as well as explanatory comments, as this code may be

reused in the future. Do not submit any of your output plot files or .pyc files, just the .py files and a readme

that can reproduce your findings.

4. For the programming assignment, you are expected to write your code using NumPy. Using any machine

learning frameworks (PyTorch, Tensorflow, Keras, etc.) which compute the gradients for you is not allowed

for this assignment.

Multi-layer Neural Networks

In this assignment, we will be classifying images of fashion products from Zalando’s Fashion MNIST Dataset. In

Assignment 1, we classified facial expressions using a single-layer neural network with different output activation

functions. (Logistic and Softmax regression). In this assignment, we are going to be using multi-layer neural

networks with softmax outputs for classification.

1

Part I

Homework problems to be solved individually, and

turned in individually

For this part we will not be accepting handwritten reports. Please use latex or word for your report. MathType

is a handy tool for equations in Word. The free version (MathType Lite) has everything you need. This should be

done individually, and each team member should turn in his or her own work separately.

1. (5 pts) Maximum Likelihood Estimation. An exponential distribution can be completely characterized by

the parameter λ . Given n samples [x1, x2, …., xn] drawn from an exponential distribution, find the maximum

likelihood estimate for the parameter λ. Recall that the exponential distribution is given by

p(x; λ) = λ exp(−λx) (1)

2. (15pts) For multiclass classification on the Fashion MNIST dataset, we will use the cross-entropy

loss as our objective function and softmax as the output layer. In our network, we will have a hidden

layer between the input and output, that consists of J units with the tanh activation function. So this network

has three layers: an input layer, a hidden layer and a softmax output layer.

Notation: We use index k to represent a node in output layer and index j to represent a node in hidden layer

and index i to represent a node in the input layer. Additionally, the weight from node i in the input layer

to node j in the hidden layer is wij . Similarly, the weight from node j in the hidden layer to node k in the

output layer is wjk.

(a) (10pts) Derivation. In the following discussion, n denotes the nth input pattern. Derive the expression

for δ for both the units of output layer (δ

n

k

) and the hidden layer (δ

n

j

). Recall that the definition of δ is

δ

n

i = −

∂En

∂an

i

, where a

n

i

is the weighted sum of the inputs to unit i. There are two “hard parts” to this: 1)

taking the derivative of the softmax; and 2) figuring out how to apply the chain rule to get the hidden

deltas. Bishop and Chapter 8 of the PDP books both have good hints on the latter, and Bishop on the

former. However, crucial steps have been left out of the Bishop derivation (Chapter 6). Our main hint

here is: break it up into two parts (see equation 6.161 in Bishop), when k = k

0 and when it doesn’t.

Note that Bishop (Equation 4.31) defines δ

n

j without a minus sign, which is the opposite of the way that

we defined it above, and different from the PDP book chapter 8.

(b) (2pts) Update rule. Write the update rule for w’s in terms of the δ’s you derived above using learning

rate α, starting with the gradient descent rule:

wij = wij − α

∂E

∂wij

(2)

where

∂E

∂wij

=

X

n

∂En

∂wij

(3)

You have to write both the update rules, the hidden to output layer (wjk) update rule and the input to

hidden (wij ) update rule in a generalized form. (Hint: you will have to use chain rule for differentiation.)

∂En

∂wij

=

∂En

∂an

j

∂an

j

∂wij

(4)

(c) (3pts) Vectorize computation. The computation is much faster when you update all wij s and wjks

at the same time, using matrix multiplications rather than for loops. Please show the update rule for

the weight matrix from the hidden layer to output layer and the matrix from input layer to hidden layer,

using matrix/vector notation.

2

Part II

Team Programming Assignment

3. Classification. Classification on the Fashion MNIST datatset. Refer to your derivations from Problem 2.

(a) (0pts) Implement the normalize_data and one_hot_encoding methods provided in the starter code and

read in the Fashion MNIST data using the load_data method. This will load the training and testing

data for you. Create a validation split from the training data.

(b) (5pts) Check your code for computing the gradient using a small subset of data. For example, in

computing E(w + ?) below, you could use 10 examples, one from each category. You can compute the

slope with respect to one weight using the numerical approximation:

d

dw E(w) ≈

E(w + ?) − E(w − ?)

2?

where ? is a small constant, e.g., 10−2

. Compare the gradient computed using numerical approximation

with the one computed as in backpropagation. The difference of the gradients should be within big-O of

?

2

, so if you used 10−2

, your gradients should agree within 10−4

. (See section 4.8.4 in Bishop for more

details). Note that w here is one weight in the network. You can only check one weight at a time this

way – every other weight must stay the same.

Choose one output bias weight, one hidden bias weight, and two hidden to output weights and two input

to hidden weights, and show that the gradient obtained for that weight after backpropagation is within

(O(?

2

)) of the gradient obtained by numerical approximation. For each selected weight w, first increment

the weight by small value ?, do a forward pass for one training example, and compute the loss. This value

is E(w + ?). Then reduce w by the same amount ?, do a forward pass for the same training example and

compute the loss E(w − ?). Then compute the gradient using equation mentioned above and compare

this with gradient obtained by backpropagation. Report the results in a Table.

(c) (10pts) Using the vectorized update rule you obtained from 2(c), perform gradient descent to learn a

classifier that maps each input image to one of the labels t ∈ {0, …, 9}, using a one-hot encoding. Use 50

hidden units. For this programming assignment, use mini-batch stochastic gradient descent

throughout, in all problems.

You should use momentum in your update rule, i.e., include a momentum term weighted by γ, and set

γ to 0.9. You should use cross-validation for early stopping of your training: stop the training when the

error on the validation set goes up. Use the following criteria – If the validation error goes up for some

threshold number of epochs, stop training and save the weights which resulted in minimum validation

error. The validation set error should go up at some point, although this didn’t always happen with the

faces.

Describe your training procedure. Plot your training and validation accuracy (i.e., percent correct) vs.

number of training epochs, as well as training and validation loss vs. number of training epochs. Report

accuracy on test set using the best weights obtained through early stopping.

You may experiment with different learning rates, but you only need to report your results and plots on

the best learning rate you find.

(d) (5pts) Experiment with Regularization. Starting with the network you used for part (c), with new

initial random weights, add weight decay to the update rule. (You will have to decide the amount of

regularization, i.e., λ, a factor multiplied times the weight decay penalty. Experiment with 0.001 and

0.0001) Again, plot training and validation loss, training and validation accuracy, and report final test

accuracy. For this problem, train for about 10% more epochs than you found in part c (i.e., if you found

that 100 epochs were best, train for 110 for this problem). Comment on the change of performance, if any.

3

(e) (5pts) Experiment with Activations. Starting with the network of part (c), try using different

activation functions for the hidden units. You are already using tanh, try the other two below. Note

that the derivative changes when the activation rule changes!!

sigmoid(z) = 1

1 + e−z

(5)

ReLU(z) = max(0, z) (6)

The weight update rule is exactly the same for each activation function. The only thing that changes is

the derivative of the activation function when computing the hidden unit δ’s. For each activation function

you try, plot training and validation loss on one graph, training and validation accuracy on another, and

report final test accuracy. Comment on the change of performance. How do these activation functions

compare to each other? What works best for your problem?

(f) (5pts) Experiment with Network Topology. Starting with the network from part (c), consider how

your neural network architecture changes the performance.

i. Try halving and doubling the number of hidden units. Plot training and validation loss, training and

validation accuracy, and report final test accuracy. How does performance change? Explain your

results.

ii. Change the number of hidden layers. Use two hidden layers instead of one. Create a new architecture

that uses two hidden layers of equal size and has approximately the same number of parameters, as

the previous network with one hidden layer of 50 units. By that, we mean it should have roughly the

same total number of weights and biases. This is a standard method for comparing different neural

network architectures in order to make a fair comparison. Again, plot training and validation loss,

training and validation accuracy, and report final test accuracy. How did the performance change?

Instructions for Programming Assignment

The Fashion MNIST dataset and started code has been provided to you on Piazza.

You need to edit the neuralnet.py file to complete the assignment. This file is a skeleton code that is designed to

guide you to build and implement your neural net in an efficient and modular fashion, and this will give you a feel

for what developing models in PyTorch will be like.

Follow instructions in the code on how to install PyYAML. The config.yaml specifies the configuration for your

Neural Network architecture, training hyperparameters, type of activation, etc. The purpose of each flag is indicated

by the comment before it. Play around with the parameters here to decide what works best for the problem.

The class Activation includes the definitions for all activation functions and their gradients, which you need to fill

in. The definitions of forward and backward have been implemented for you in this class. The code is structured

in such a way that each activation function is treated as an additional layer on top of a linear layer that computes

the net input (a) to the unit. To add an activation layer after a fully-connected or linear layer, a new object of this

class needs to be instantiated and added to the model.

The Layer class denotes a standard fully-connected / linear layer. The forward and backward methods need

to be implemented by you. As the name suggests, ’forward’ takes in an input vector ’x’ and outputs the variable

’a’. Do not apply the activation function on the computed weighted sum of inputs since the activation function is

implemented as a separate layer, as mentioned above. The ’backward’ method takes the weighted sum of the deltas

from the layer above it as input, computes the gradient for its weights (to be saved in d_w) and biases (to be saved

in d_b). If there is another layer below that (multiple hidden layers), it also passes the weighted sum of the deltas

back to the previous layer. Otherwise, if the previous layer is the input layer, it stops there.

The Neuralnetwork class defines the entire network. The __init__() method has been implemented for you

which uses the configuration specified in config.yaml to create the network. Make sure to understand this function

very carefully since good understanding of this will be needed while implementing forward and backward for this

4

class. The function ’forward’ takes in the input dataset ’x’ and targets (in one hot encoded form) as input, performs

a forward pass on the data ’x’ and returns the loss and predictions. The ’backward’ function computes the error

signal from saved predictions and targets and performs a backward pass through all the layers by calling backward

pass for each layer of the network, until it reaches the first hidden layer above the input layer (usually there will

only be one hidden layer for this project, but when there are more, there will be more backward passes). The loss

method computes cross-entropy loss using the predictions and targets and returns this loss.

All of the classes above implement a __call__() method which calls the class’s forward method by default.

Recall that the __call__() method allows instances of the class to be callable. This will make it really easy to

appreciate PyTorch API’s later on as they are also very similar.

The load_data method has been partially implemented for you. You need to implement the softmax,

normalize_data, train and test functions. The requirements for these functions and all other functions are

given in the code.

Once you have implemented everything, you can run checker.py to verify your implementation. The checker.py

file uses sanity.pkl to verify your results using the default configuration provided (in config.yaml). You might

want to save a copy of the default configuration before you begin experimenting with it.

Furthermore, a couple of things to take care of:

• Do not add additional keys in config.yaml.

• Do not rename the classes.

• Do not modify the checker.py file. This is the file that has test cases for your code. You can download it

and use it to verify your results to ensure your implementation is correct.

• You are allowed to write additional functions if you feel the need to do so.

• Make sure to include clear instructions in the Readme file to run your code. If you haven’t done so and if we

are not able to run your code using the instructions you provide, you lose points.

5