## Description

CSCI 4155/6505 A1

Assignment 1

Introduction

In this assignment you will use python notebooks to learn about kNN, the MNIST dataset, explore the curse

of dimensionality, and try out the decision tree classifer in the sklearn library.

Questions

[8pts total] Q1. k-Nearest Neighbours Classifier on Synthetic Data

One of the important ways of both understanding and also debugging machine learning algorithms is to

create your own, synthetic data sets. In this question you will implement a k-nearest neighbours (kNN)

classifier and evaluate it on synthetic data.

a) [2pts] Create the data set. You will “train” a kNN classifier on the following training data:

• The data is 2-dimensional points in a grid, such that the x1-coordinates and x2-coordinates both range

from −1.5 . . . 1.5, with a spacing of 0.1 between points.

• A training point, x = (x1, x2), will be classified as follows:

– Class 1 if kxk2 ≤ 1

– Class 2 otherwise

Write code that generates this dataset and displays it using a scatterplot, using different colours for each

class.

Note kxkp = (x

p

1 + x

p

2

)

1

p can be implemented by: numpy.linalg.norm([x[0], x[1]], p)

b) [3pts] Implement the kNN algorithm. This may be best done using a Python class, with a train

method that takes the training data and k as input, and a predict method that takes a test example as

input and returns a class prediction. You should find the nearest neighbours by using the Euclidean distance,

kx − yk2

.

c) [1pt] Visualize the predictions. Use the matplotlib.pyplot.contourf function to display how your

kNN classifier performs on new data. To do this, evaluate your model’s predictions across a fairly fine

numpy.meshgrid, with a space of about 10−2 between points. The meshgrid can have boundaries from −1.5

to 1.5 in both coordinates. By examining multiple plots, can you say how the performance of the classifier

depends on k?

d) [1pt] Try another data set. Repeat part a), replacing the condition kxk2 ≤ 1 with kxk1 ≤ 1, and

repeat part c) on the new training data. How does the performance of the classifier change with k? (Try

some extreme values!)

e) [1pt] Continue exploring data sets. Repeat part a), replacing the condition kxk2 ≤ 1 with kxk0.4 ≤ 1,

repeat part c) with the data. Now how does the performance of the classifier change with k? Based on your

results, do you think using Euclidean distance for kNN is always the best choice?

[5pts] Q2. k-Nearest Neighbours Classifier on MNIST Data

In this question you will use the K-nearest neighbours (KNN) classifier, implemented in Q1, to classify the

MNIST images.

a) [2pts] Explore k values. Plot the classification accuracy on train and test sets by considering different

values of k in the range of 1 to 100.

1

CSCI 4155/6505 – Winter 2020 A1

b) [1pt] Explore data set size. Plot the classification accuracy on train and test sets by considering different amounts of training data (randomly select the train samples ranging from 100 to 60,000) by considering

the best value of k as obtained in part (a).

c) [2pts] Explore image resolutions. What happens if you repeat part (a) but reduce the resolution of

the data sets?

MNIST dataset: The MNIST database (Modified National Institute of Standards and Technology database)

of handwritten digits consists of a training set of 60,000 examples, and a test set of 10,000 examples. Additionally, the black and white images from NIST were size-normalized and centered to fit into a 28×28 pixel

bounding box and anti-aliased, which introduced grayscale levels.

You can download the MNIST database from the following links: train set and test set

Every line of these files consists of an image, i.e., 785 numbers between 0 and 255. The first number of each

line is the label, i.e., the digit which is depicted in the image. The following 784 numbers are the pixels of

the 28 x 28 image.

[4pts total] Q3. The Curse of Dimensionality

Taken from an assignment by Roger Grosse

This question will explore what happens to the distances between randomly sampled points as the dimension

is increased.

a) [2pts] Computing expectation in 2D. Take two continuous random variables, X and Y , sampled

from a uniform distribution over the interval [0, 1]. Let Z be the squared Euclidean distance, defined by

Z = (X − Y )

2

. Compute E[Z] and Var[Z] (this will require integration, which you can compute numerically

using scipy.integrate.dblquad). Explain the integrals that you are computing.

b) [2pts] Computing expected distances in n-d. Take two continuous random variables, X and Y ,

sampled from a uniform distribution over the unit cube in d dimensions. That is, if X = (X1, X2, . . . , Xd)

then each Xi

is uniformly sampled from [0, 1], independently from each other. The squared Euclidean distance

can be written as R = Z1+Z2+· · ·+Zd, where Zi = (Xi−Yi)

2

. Compute E[R] and Var[R]. You can write your

answers in terms of d, E[Z], and Var[Z]. Again, you may explore this question empirically/numerically.

c) [Non-marked] Compare the mean and standard deviation of R to the maximum possible distance (i.e.

the distance between opposite corners of the hypercube). How do the answers from a) and b) relate to the

claim that “in higher dimensions, most points are far away, and approximately the same distance?”

[3pts] Q4. Decision Trees

Begin by watching this online video resource about decision trees:

Use the sklearn library to run a decision tree classifier on each of the main datasets you used in the rest of

the assignment:

(a) Let Dp refer to the dataset from Question 1, for different values of p, e.g. D2 refers to the dataset

generation using the Euclidean norm, etc. Run a decision tree classifier on D1, D2, and at least 2 other

values of p.

(b) Write code (not using built-in sklearn functionality) to split MNIST the training set to get a validation

set. Then, using sklearn’s built-in DecisionTreeClassifier, write a function which tries 4-6 different

values for max depth and both information gain and Gini coefficient split criteria, and outputs the accuracies

on the train, on the validation, and on the test sets1

. What would be the meta-parameter settings that you

would normally choose? Note that usually you would not run the classifier on the test set! This is simply a

chance to experiment with (i.e. take a peek at) how the performance on the test set would change. Briefly

1That is, you can use the built-in classifier, but the validation code is your own.

2

CSCI 4155/6505 – Winter 2020 A1

summarize your observations, making sure to include clear description of any other meta-parameter choices

that you made.

General Marking Notes and Tips

• You will be marked for accuracy, correctness and also clarity of presentation where relevant.

• Number each markdown cell in your notebook, so that the marker can reference their feedback to

individual cells.

• Be selective in the experimental results you present. You don’t need to present every single experiment

you carried out. It is best to summarize some of your work and present only the interesting results,

e.g. where the behaviour of the ML model behaves differently.

• In your answers, you want to demonstrate skill in using any required libraries to experiment with ML

models, and good understanding / interpretation of the results. Make it easy for the marker to see

your skill and your understanding.

• Justify your meta-parameter choices where appropriate.

• Except where you are explicitly asked to implement a particular algorithm/step yourself, if there is a

procedure in sklearn that does the task, you are free to find it, read its documentation and use it. If

you are not sure, it is best to on the shared notes or in class!

• You should submit one python notebook with your answers.

• Liberally add markdown cells to explain your code and experimental results. Make good use of the

formatting capabilities of markdown to make your notebook highly readable.

• Add links to all online resources you use (markdown syntax is: [URL](anchor text) ). Avoid explicit

long URLs in your markdown cells.

3