## Description

Homework 3

For this homework, you will be working extensively in tensorflow. It is suggested that you spin up a Google Cloud VM with a GPU attached. Remember, instructions for doing so are found in Homework 0.

Part 1: Homework 2, but on tensorflow

Part 2: DNN on MNIST and CIFAR10

Part 3: VGG on MNIST and CIFAR10

(Optional) Part 4, getting state of the art (#SOTA)

Part 1

You don’t have to repeat everything in homework 2, but rather pick one set of two features that worked well for you last homework, and implement logistic regression using tensorflow without using keras (you will practice using keras in parts 2 and 3). In other words, using tensorflow operations, please create a scalar-value loss function and let tensorflow create the training operation for logistic regression, which automatically computes the gradients and updates the weight parameters. Note that the logistic loss is a special case of the softmax cross entropy loss that you’ve seen when classifying MNIST.

Part 2: DNN on MNIST and CIFAR10

In our lab, you guys saw how to work with the MNIST dataset to perform image classification. We can attempt the MNIST classification problem with just fully connected layers. This means we will be optimizing for non-banded matrices (no convolutions).

Calcualte the number of weight parameters you are optimizing for 1, 2 and 3 differen fully connected layers (the total size of each layer is up to you).

What is the max layer depth you can go before training loss does not converge? You can usually tell that something is not converging by examining the training loss vs. iteration curve.

How does the number of parameters relate to the training loss and validation/test loss? Try to get a few data points to speak to this question.

Keeping the maximum number of parameters possible while still maintaining convergence (i.e., a good training and validation/test loss), what happens when you swap the activation function to tanh instead of relu? How about sigmoid?

After exploring the above, train a DNN model with the combination of hyperparameters that you believe will work best on MNIST.

Using the same architecture, try training a DNN model on more difficult dataset such as Fashion MNIST or CIFAR10/100. Example download instructions are shown in the next problem.

Must haves

Make a curve of the final validation/test loss of your DNN after the loss plateaus as a function of the number of weight parameters used (final loss versus # parameters used). Note that you might see something like the curve below for a low number of parameters, but as the number of parameters increases, it will not look like this plot.

On the same figure, make the same curve as above, but use different activation functions in your architecture.

Plot a point corresponding to your crafted DNN archiecture for question 4.

Repeat 1-3 for CIFAR10

The curves when reasonable # params are used look like the belowimage.png

# Download and visualize the data: see all here https://www.tensorflow.org/api_docs/python/tf/keras/datasets

import tensorflow as tf

(X_train, y_train), (X_val, y_val) = tf.keras.datasets.mnist.load_data()

y_train = tf.keras.utils.to_categorical(y_train, 10)

y_val = tf.keras.utils.to_categorical(y_val, 10)

X_train = X_train.reshape(X_train.shape[0], 28, 28, 1)

X_val = X_val.reshape(X_val.shape[0], 28, 28, 1)

from matplotlib import pyplot as plt

%matplotlib inline

print(‘Training data shape’, X_train.shape)

_, (ax1, ax2) = plt.subplots(1, 2)

ax1.imshow(X_train[0].reshape(28, 28), cmap=plt.cm.Greys);

ax2.imshow(X_train[1].reshape(28, 28), cmap=plt.cm.Greys);

# Build your DNN, an example model is given for you.

model = tf.keras.Sequential([

tf.keras.layers.Flatten(input_shape=(28, 28, 1)),

# Try adding more layers and graph the final loss and accuracy

tf.keras.layers.Dense(100, activation=’relu’),

tf.keras.layers.Dense(10, activation=’softmax’)

])

model.compile(optimizer=tf.train.AdamOptimizer(0.001),

loss=’categorical_crossentropy’,

metrics=[‘accuracy’])

model.summary()

Training data shape (60000, 28, 28, 1)

_________________________________________________________________

Layer (type) Output Shape Param #

=================================================================

flatten_5 (Flatten) (None, 784) 0

_________________________________________________________________

dense_125 (Dense) (None, 100) 78500

_________________________________________________________________

dense_126 (Dense) (None, 10) 1010

=================================================================

Total params: 79,510

Trainable params: 79,510

Non-trainable params: 0

_________________________________________________________________

model.fit(X_train, y_train,

batch_size=64,

epochs=1,

verbose=1,

validation_data=(X_val, y_val))

Train on 60000 samples, validate on 10000 samples

Epoch 1/1

60000/60000 [==============================] – 13s 222us/step – loss: 2.3015 – acc: 0.1119 – val_loss: 2.3013 – val_acc: 0.1135

<tensorflow.python.keras.callbacks.History at 0x7fb1d432e780>

Part 3. VGG on CIFAR100 and CIFAR10

VGG is a simple, but powerful CNN created in 2015. Read the VGG paper here: https://arxiv.org/pdf/1409.1556.pdf

Here, we’re going to try to reproduce the model’s findings on the cifar10 and cifar100 dataset. Note that the paper takes 224 x 224 images, but cifar10 and 100 are only 32 x 32 images.

Implement all of the layers for the VGG ConvNet Configuration A. Please use the shell code below as guide. Then, train this network on the Cifar10 and Cifar100 datasets.

For Cifar10 and 100, VGG is probably overkill. Try changing the number of layers and number of filters without sacrificing too much performance accuracy. How many filters can you get rid of before you see the accuracy drop by more than 2%? Where in the architecture is it better to remove filters – towards the input layers, or more towards the output layers?

For what you experiment with–report the parameter, validation loss curves for changing the number of i) layers, ii) filter size, iii) both.

# This is the same model in the other notebook, looks very simplified.

import tensorflow as tf

(X_train, y_train), (X_val, y_val) = tf.keras.datasets.cifar10.load_data()

y_train = tf.keras.utils.to_categorical(y_train, 10)

y_val = tf.keras.utils.to_categorical(y_val, 10)

X_train = X_train.reshape(X_train.shape[0], 32, 32, 3)

X_val = X_val.reshape(X_val.shape[0], 32, 32, 3)

from matplotlib import pyplot as plt

%matplotlib inline

print(‘Training data shape’, X_train.shape)

_, (ax1, ax2) = plt.subplots(1, 2)

ax1.imshow(X_train[0].reshape(32, 32, 3));

ax2.imshow(X_train[1].reshape(32, 32, 3));

# Example CNN used in class

model = tf.keras.Sequential([

tf.keras.layers.Conv2D(32, (5,5), padding=’same’, activation=’relu’, input_shape=(32, 32, 1)),

tf.keras.layers.MaxPool2D(padding=’same’),

tf.keras.layers.Conv2D(64, (5,5), padding=’same’, activation=’relu’),

tf.keras.layers.MaxPool2D(padding=’same’),

tf.keras.layers.Flatten(),

tf.keras.layers.Dense(512, activation=’relu’),

tf.keras.layers.Dense(10, activation=’softmax’)

])

model.compile(optimizer=tf.train.AdamOptimizer(0.0001),

loss=’categorical_crossentropy’,

metrics=[‘accuracy’])

model.summary()

Training data shape (50000, 32, 32, 3)

_________________________________________________________________

Layer (type) Output Shape Param #

=================================================================

conv2d (Conv2D) (None, 32, 32, 32) 832

_________________________________________________________________

max_pooling2d (MaxPooling2D) (None, 16, 16, 32) 0

_________________________________________________________________

conv2d_1 (Conv2D) (None, 16, 16, 64) 51264

_________________________________________________________________

max_pooling2d_1 (MaxPooling2 (None, 8, 8, 64) 0

_________________________________________________________________

flatten_6 (Flatten) (None, 4096) 0

_________________________________________________________________

dense_127 (Dense) (None, 512) 2097664

_________________________________________________________________

dense_128 (Dense) (None, 10) 5130

=================================================================

Total params: 2,154,890

Trainable params: 2,154,890

Non-trainable params: 0

_________________________________________________________________

(Optional) Part 4, state of the art

Currently, state of the art implementations in the image classification problem are DenseNet: (https://arxiv.org/abs/1608.06993), ResNet (https://arxiv.org/abs/1512.03385), and ResNext (https://arxiv.org/pdf/1611.05431.pdf). Try implementing and training one of these on the cifar10 and cifar100 dataset. Feel free to experiment.

Jargon to learn about

What is “residual learning”?

What is a “bottleneck layer”?

What is a “dense block”?