B 551 Assignment 4: Machine Learning

The last assignment of the course will give you a chance to implement and develop some machine learning

classification algorithms from scratch and test out your implementations on various standard datasets.

For this assignment you will be working individually and not in groups as was the case for assignments

1-3. For this reason and because of the reduced time provided before the deadline, this assignment is shorter

than the previous few assignments.

Additionally, because the assignment is due right before finals week, we urge you to please start early and

ask questions on Q&A Community or in office hours if you require any assistance.

Guidelines for this Assignment

Please read the instructions below carefully as there may be a few changes in the guidelines

compared to previous assignments. It is your responsibility to be up-to-date with these guidelines,

and to maintain fairness between all students, we cannot accept any submissions that do not follow the

instructions given here.

Coding Requirements. For fairness and efficiency, we use a semi-automatic program to grade your submissions. This means you must write your code carefully so that our program can run your code and understand

its output properly. As usual, we require the following.

1. Your code for the assignment must be written in Python 3, not Python 2. The skeleton

code is already written in Python 3, but inserting Python 2 code that is not compatible with Python

3 will cause the program and autograding to fail.

2. Use the skeleton code that is provided for you and follow the instructions in the skeleton

code and the specifications laid out in the assignment below. This means that you should not

change file names for your python scripts nor change the parameters for functions that are already in

the base skeleton code. Our autograding scripts may call these functions directly in order to test your

implementations, and changing the file name or changing the parameters of these functions will cause

the scripts to fail and you will likely lose points. Of course, feel free to create new functions as needed,

but keep in mind that the functions in the base skeleton code must work as intended. Finally, please

avoid using global (public) variables in your programs as they may cause your program to behave

unexpectedly when run by our grading program.

3. You may import Python modules from the Python Standard Library for routines not

related to A.I. This includes items such as basic sorting algorithms and data structures like queues,

as long as they are already installed on the SICE Linux servers. For this assignment, no other packages

are allowed to be imported for use with your machine learning implementations from scratch other than

numpy. There are more details regarding this in the assignment description below.

4. Your code must work on the silo.sice.indiana.edu server. We will test your code on this

system, so this is the only way to guarantee that your program will work correctly when we run it.

Minor differences in Python versions, available modules and packages, etc. between the Python on

your system and that of the SICE Linux servers may cause your code to work differently and may

seriously affect your grade.

Coding Style and Documentation. We will not explicitly grade based on coding style, but it’s important

1

that you write your code in a way that we can easily understand it. Please use descriptive variable and

function names, and use comments when needed to help us understand code that is not obvious.

Report. For this assignment, we will require a written report that summarizes your programming solutions

and that answers any specific questions that we pose in the problem descriptions below. Please put the

report in the README.md file in your GitHub repository. Reports that are located somewhere else will not be

graded. For each programming problem, your report should include: (1) the answers to the questions that

are asked in the assignment description, if any; (2) a brief description of how you formulated each problem;

(3) a brief description of how your program works; (4) and a discussion of any problems you faced and

any assumptions, simplifications, and/or design decisions you made. These comments will help us better

understand your code and your implementations. They are especially important if your code does not work

perfectly, since it is a chance to document the energy and thought you put into your solution that may not

otherwise be reflected by the code itself.

Academic Integrity. We take academic integrity very seriously, and especially so for this assignment. To

maintain fairness to all students in the class and integrity of our grading system, we will compare your code

against other sources and prosecute any academic integrity violations that we discover. Before beginning

this assignment, make sure you are familiar with the Academic Integrity policy of the course, as stated in the

Syllabus, and ask us about any doubts or questions you may have. To briefly summarize, you may discuss the

assignment with other people at a high level (e.g., discussing general strategies to solve the problem, talking

about Python syntax and features, etc.). You may also consult printed and/or online references, including

books, tutorials, etc., but you must cite these materials (e.g., in source code comments). We expect that

you’ll write your own code and not copy anything from anyone else, including online resources. However,

if you do copy something (e.g., a very small bit of code that you think is particularly clever), you have to

make it explicitly clear which parts were copied and which parts were your own. You can do this by putting

a very detailed comment in your code, marking the line above which the copying began, and the line below

which the copying ended, and a reference to the source. Code marked in this manner should not on its own

directly implement and solve the problem at hand but should be reserved for little code segments that assist

your own code in solving the problem. Any code that is not marked in this way must be your own, which

you personally designed and wrote. You may not share written answers or code with any other students,

nor may you possess code written by another student, either in whole or in part, regardless of format.

Assignment Description

Please read the guidelines above before starting on the assignment’s problems as described below.

In this assignment, your main task will be to implement two machine learning classification algorithms from

scratch: k-nearest neighbors and multilayer perceptron (which is a class of a feedforward artificial

neural networks). Your implementations from scratch for these classifiers will be tested on various datasets,

and the performance of them will be compared to the respective implementations from the popular machine learning package scikit-learn. Additionally, you will have to implement some utility functions that

these machine learning algorithms rely on, including activation functions for the neurons in the multilayer

perceptron network and distance formulas used during k-nearest neighbors.

You are required to implement these classifiers and utility functions from scratch. As a result, you are not

allowed in any capacity to use any pre-implemented packages such as scikit-learn or scipy (using numpy

is fine and is in fact highly encouraged). We know that these pre-implemented packages will likely perform

much better, but the point of the assignment is to dig deep into the machine learning algorithms and really

learn how they work under the hood. The skeleton code we provide helps take care of details such as class

and function definitions and leaves it to you to write the necessary code to implement the functions related

to machine learning (that is, the functions you must implement will be clearly indicated).

2

Part 0: Getting Started

As usual, you will find that a GitHub repository has been created for you. You can find it by logging

into IU GitHub at https://github.iu.edu/. In the upper left hand corner of the screen, you should see

a pull-down menu. Select cs-b551-fa2021. Then, under Repositories, you should see a repository called

userid -a4, where userid should be your IU username. To get started, clone the GitHub repository using

one of the two commands below in your command line terminal.

git clone https://github.iu.edu/cs-b551-fa2021/userid -a4.git

git clone git@github.iu.edu:cs-b551-fa2021/userid -a4.git

For this assignment, the usage of the package numpy is highly encouraged, and the skeleton code is already written to use numpy by default because it will make this assignment much easier and make your

implementations better and faster. numpy is a fundamental package for scientific computing in Python

that provides a multidimensional array object (ndarray), various derived objects, and an assortment of

routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation

and much more. Instructions for installing numpy if you do not have it on your system can be found at

https://numpy.org/install/, and a quickstart guide to get yourself familiarized on how numpy works can

be found at https://numpy.org/doc/stable/user/quickstart.html.

Part 1: K-Nearest Neighbors Classification

In the machine learning world, k-nearest neighbors is a type of non-parametric supervised machine learning

algorithm that is used for both classification and regression tasks. For classification, the principle behind

k-nearest neighbors is to find k training samples that are closest in distance to a new sample in the test

dataset, and then make a prediction based on those samples.

These k closest neighbors are used to try and predict the correct discrete class for a given test sample.

This prediction is typically done by a simple majority vote of the k nearest neighbors of each test sample;

in other words, the test sample is assigned the data class which has the most representatives within the k

nearest neighbors of the sample. An alternative method for prediction is to weigh the neighbors such that

the nearer neighbors contribute more to the fit than do the neighbors that are further away. For this, a

common choice is to assign weights proportional to the inverse of the distance from the test sample to the

neighbor. The distance can, in general, be any metric measure, but the standard Euclidean distance and

Manhattan distance metrics are the most common choices.

k-nearest neighbors is also known as a non-generalizing machine learning method since it simply “remembers”

all of its training data as opposed to other methods that update specific coefficients that fit a model to the

training data.

What to Do. Your goal in this part is to implement a k-nearest neighbors classifier from scratch. Your

GitHub repository contains the skeleton code for two files that will be used to implement the algorithm:

utils.py and k_nearest_neighbors.py.

The utils.py file contains helpful utility functions that will be used by the machine learning algorithms.

For this part, the only functions you need to concern yourself with are the functions euclidean_distance

and manhattan_distance.

The k_nearest_neighbors.py file defines the KNearestNeighbors class that we will use to implement the

algorithm from scratch. As you can see, the __init__ function has already been properly implemented

for you. This function is run whenever a new KNearestNeighbors object is created, and it checks the

arguments passed in for the parameters in addition to setting up the class attributes based on those arguments. The attributes for the class itself are described in detail in the skeleton code. When creating the

3

KNearestNeighbors object, the following parameters must be specified (or their default values will be used):

• n_neighbors: the number of neighbors a sample is compared with when predicting target class values

(analogous to the value k in k-nearest neighbors).

• weights: represents the weight function used when predicting target class values (can be either

‘uniform’ or ‘distance’). Setting the parameter to ‘distance’ assigns weights proportional to

the inverse of the distance from the test sample to each neighbor.

• metric: represents which distance metric is used to calculate distances between samples. There are

two options: ‘l1’ or ‘l2’, which refer to the Manhattan distance and Euclidean distance respectively.

Between these two files, there are four functions that you are required to implement. The four functions

currently raise a NotImplementedError in order to clearly indicate which functions from the skeleton code

you must implement yourself. Comment out or remove the raise NotImplementedError(…) lines and

implement each function so that they work as described in the documentation. You may assume that the

input data features are all numerical features and that the target class values are categorical features. As

a reminder from the guidelines, feel free to create other functions as you deem necessary, but the functions

defined by the skeleton code itself must work as intended for your final solution, and therefore you cannot

change the parameters for these functions. This is required because we may call any of these functions

directly when testing your code. The four functions you must implement and their descriptions are as

follows:

• euclidean_distance(x1, x2) in utils.py: computes and returns the Euclidean distance between

two vectors.

• manhattan_distance(x1, x2) in utils.py: computes and returns the Manhattan distance between

two vectors.

• fit(X, y) in k_nearest_neighbors.py: fits the model to the provided data matrix X and targets y.

• predict(X) in k_nearest_neighbors.py: predicts class target values for the given test data matrix

X using the fitted classifier model.

Testing your Implementation. We have provided you with a driver program called main.py that allows

you to run and test your implementation of the KNearestNeighbors class and its associated functions.

This driver program will run your implementation multiple times on two different datasets with different

arguments set for the class’ parameters. Alongside your implementation, the corresponding scikit-learn

implementation will be run on the same datasets and with the same arguments, allowing you to directly

compare the accuracy scores achieved by your program with those achieved by the standard scikit-learn

implementation.

In order to run the program, you must have the following packages installed on your system: numpy,

pandas, and scikit-learn. You should already have numpy installed from the previous instructions.

To install pandas, please see the instructions at https://pandas.pydata.org/docs/getting_started/

install.html. To install scikit-learn, please see the instructions at https://scikit-learn.org/stable/

install.html.

To test your k-nearest neighbors implementation, enter the command shown below on your terminal.

python3 main.py knn

If you have not implemented one of the required functions (or have forgotten to remove the line that raises

a NotImplementedError), then the driver program will terminate unsuccessfully. A successful program call,

upon finishing, will result in the output shown below.

4

$ python3 main.py knn

Loading the Iris and Digits Datasets…

Splitting the Datasets into Train and Test Sets…

Standardizing the Train and Test Datasets…

Testing K-Nearest Neighbors Classification…

– Iris Dataset Progress: [============================================================] 100%

– Digits Dataset Progress: [============================================================] 100%

Exporting KNN Results to HTML Files…

Done Testing K-Nearest Neighbors Classification!

Program Finished! Exiting the Program…

You should now see two new HTML files in your project directory. These HTML files contain tables

depicting the accuracy scores of both your implementation and the scikit-learn implementation for all

of the parameters tested. These files, called knn_iris_results.html and knn_digits_results.html, are

accordingly named based on the dataset that was used for testing.

Open both files using an internet browser and you will see that a table was generated from the results of the

driver program. In the table, there is a blank row in between each pair of implementations for readability. If

you have implemented the four functions correctly, the accuracy scores computed for your implementation

and the scikit-learn implementation should be very close to each other for all of the cases. Note that

the accuracy scores may not be exactly the same due to slight differences in the implementations and the

stochastic nature of machine learning algorithms.

Part 2: Multilayer Perceptron Classification

In machine learning, the field of artificial neural networks is often just called neural networks or multilayer

perceptrons. As we have learned in class, a perceptron is a single neuron model that was a precursor to the

larger neural networks that are utilized today.

The building blocks for neural networks are neurons, which are simple computational units that have input

signals and produce an output signal using an activation function. Each input of the neuron is weighted

with specific values, and while the weights are initially randomized, it is usually the goal of training to

find the best set of weights that minimize the output error. The weights can be initialized randomly to

small values, but more complex initialization schemes can be used that can have significant impacts on the

classification accuracy of the models. A neuron also has a bias input that always has a value of 1.0 and it

too must be weighted. These weighted inputs are summed and passed through an activation function, which

is a simple mapping that generates an output value from the weighted inputs. Some common activation

functions include the sigmoid (logistic) function, the hyperbolic tangent function, or the rectified linear unit

function.

These individual neurons are then arranged into multiple layers that connect to each other to create a

network called a neural network (or multilayer perceptron). The first layer is always the input layer that

represents the input of a sample from the dataset. The input layer has the same number of nodes as the

number of features that each sample in the dataset has. The layers after the input layer are called hidden

layers because they are not directly exposed to the dataset inputs. The number of neurons in a hidden layer

can be chosen based on what is necessary for the problem. The neurons in a specific hidden layer all use

the same activation function, but different layers can use different ones. Multilayer perceptrons must have

at least one hidden layer in their network.

The final layer is called the output layer and it is responsible for outputting values in a specific format. It is

5

common for output layers to output a probability indicating the chance that a sample has a specific target

class label, and this probability can then be used to make a final clean prediction for a sample. For example, if

we are classifying images between dogs and cats, then the output layer will output a probability that indicates

whether dog or cat is more likely for a specific image that was inputted to the neural network. The nature

of the output layer means that its activation function is strongly constrained. Binary classification problems

have one neuron in the output layer that uses a sigmoid activation function to represent the probability

of predicting a specific class. Multi-class classification problems have multiple neurons in the output layer,

specifically one for each class. In this case, the softmax activation function is used to output probabilities

for each possible class, and then you can select the class with the highest probability during prediction.

Before training a neural network, the data must be prepared properly. Frequently, the target class values are

categorical in nature: for example, if we are classifying pets in an image, then the possible target class values

might be either dog, cat, or goldfish. However, neural networks usually require that the data is numerical.

Categorical data can be converted to a numerical representation using one-hot encoding. One-hot encoding

creates an array where each column represents a possible categorical value from the original data (for the

image pet classification, one-hot encoding would create three columns). Each row then has either 0s or 1s in

specific positions depending on the class value for that row. Here is an example of one-hot encoding using

the dog, cat, or goldfish image classification example, where we are using five test samples and looking at

their target class values.

y

dog

cat

cat

goldfish

dog

one-hot encoding

−−−−−−−−−−−−−→

ydog ycat ygoldfish

1 0 0

0 1 0

0 1 0

0 0 1

1 0 0

In this assignment, we will specifically be focusing on multilayer perceptron neural networks that are feedforward, fully-connected, and have exactly three layers: an input layer, a hidden layer, and an output layer.

A feedforward fully-connected network is one where each node in one layer connects with a certain weight

to every node in the following layer. A diagram of such a neural network is shown below, where the input

layer has five nodes corresponding to five input features, the hidden layer has four neurons, and the output

layer has three neurons corresponding to three possible target class values. The bias terms are also added

on as nodes named with subscript of b.

xb

x1

x2

x3

x4

x5

ab

a1

a2

a3

a4

y1

y2

y3

Input Layer

5 units

Hidden Layer

4 units

Output Layer

3 units

6

Once the data is prepared properly, training occurs using batch gradient descent. During each iteration,

forward propagation is performed where training data inputs go through the layers of the network until

an output is produced by the output layer. Frequently, the cross-entropy loss is calculated using this output

and stored in a history list that allows us to see how quickly the error reduces every few iterations. The

output from the output layers is then compared to the expected output (the target class values) and an

error is calculated. The output error is then propagated back through the network one layer at a time,

and the weights are updated according to the amount that they contributed to the error. This is called

backward propagation. A parameter called the learning rate is typically used to control how much to

change the model in response to the estimated error each time the model weights are updated. Once the

maximum number of iterations is reached, the neural network is finished training and it can be used to make

new predictions. A prediction is made by using new test data and computing an output using forward

propagation. When there are multiple output neurons, the output with the highest softmax value is chosen

as the predicted target class value.

What to Do. Your goal in this part is to implement a feedforward fully-connected multilayer perceptron

classifier with one hidden layer (as shown in the description above) from scratch. As before, your GitHub

repository contains the skeleton code for two files that will be used to implement the algorithm: utils.py

and multilayer_perceptron.py.

This time, the functions you need to concern yourself with in the utils.py file are the unimplemented

functions defined after the distance functions. Specifically, these functions are: identity, sigmoid, tanh,

relu, cross_entropy, and one_hot_encoding.

The multilayer_perceptron.py file defines the MultilayerPerceptron class that we will use to implement

the algorithm from scratch. Just like the previous part, the __init__ function has already been properly

implemented for you. The attributes for the class itself are described in detail in the skeleton code. When

creating the MultilayerPerceptron object, the following parameters must be specified (or their default

values will be used):

• n_hidden: the number of neurons in the one hidden layer of the neural network.

• hidden_activation: represents the activation function of the hidden layer (can be either ‘identity’,

‘sigmoid’, ‘tanh’, or ‘relu’).

• n_iterations: represents the number of gradient descent iterations performed by the fit(X, y)

method.

• learning_rate: represents the learning rate used when updating neural network weights during gradient descent.

Between these two files, there are nine functions that you are required to implement for this part.

The nine functions currently raise a NotImplementedError in order to clearly indicate which functions

from the skeleton code you must implement yourself. Like before, comment out or remove the raise

NotImplementedError(…) lines and implement each function so that they work as described in the documentation. You may assume that the input data features are all numerical features and that the target

class values are categorical features. As a reminder from the guidelines, feel free to create other functions

as you deem necessary, but the functions defined by the skeleton code itself must work as intended for your

final solution, and therefore you cannot change the parameters for these functions. This is required because

we may call any of these functions directly when testing your code. The nine functions you must implement

and their descriptions are as follows:

• identity(x, derivative = False) in utils.py: computes and returns the identity activation function of the given input data x. If derivative = True, the derivative of the activation function is

returned instead.

7

• sigmoid(x, derivative = False) in utils.py: computes and returns the sigmoid (logistic) activation function of the given input data x. If derivative = True, the derivative of the activation

function is returned instead.

• tanh(x, derivative = False) in utils.py: computes and returns the hyperbolic tangent activation

function of the given input data x. If derivative = True, the derivative of the activation function is

returned instead.

• relu(x, derivative = False) in utils.py: computes and returns the rectified linear unit activation

function of the given input data x. If derivative = True, the derivative of the activation function is

returned instead.

• cross_entropy(y, p) in utils.py: computes and returns the cross-entropy loss, defined as the negative log-likelihood of a logistic model that returns p probabilities for its true class labels y.

• one_hot_encoding(y) in utils.py: converts a vector y of categorical target class values into a one-hot

numeric array using one-hot encoding: one-hot encoding creates new binary-valued columns, each of

which indicate the presence of each possible value from the original data.

• _initialize(X, y) in multilayer_perceptron.py: function called at the beginning of fit(X, y)

that performs one-hot encoding for the target class values and initializes the neural network weights

(_h_weights, _h_bias, _o_weights, and _o_bias).

• fit(X, y) in multilayer_perceptron.py: fits the model to the provided data matrix X and targets

y.

• predict(X) in multilayer_perceptron.py: predicts class target values for the given test data matrix

X using the fitted classifier model.

Testing your Implementation. As with the previous part, running the driver program main.py allows

you to test your implementation of the MultilayerPerceptron class and its associated functions. Assuming

you have already installed the three packages (numpy, pandas, and scikit-learn as discussed before), you

can test your multilayer perceptron implementation by entering the command shown below on your terminal.

python3 main.py mlp

You can also test both your k-nearest neighbors implementation and your multilayer perceptron implementation one after the other by entering the following command.

python3 main.py all

You should see the following terminal output if the driver program has tested your multilayer perceptron

implementation without runtime errors.

8

$ python3 main.py mlp

Loading the Iris and Digits Datasets…

Splitting the Datasets into Train and Test Sets…

Standardizing the Train and Test Datasets…

Testing Multilayer Perceptron Classification…

– Iris Dataset Progress: [============================================================] 100%

– Digits Dataset Progress: [============================================================] 100%

Exporting MLP Results to HTML Files…

Done Testing Multilayer Perceptron Classification!

Program Finished! Exiting the Program…

Just like before, you should now see two new HTML files in your project directory, this time named

mlp_iris_results.html and mlp_digits_results.html. The format of the generated tables are the same

as before, but with different columns representing the parameters of the MultilayerPerceptron class. If you

have correctly implemented the multilayer perceptron, the accuracy scores computed for your implementation and the scikit-learn implementation should be somewhat similar in most cases. However, unlike with

k-nearest neighbors, you may see some more substantial variation in the accuracy scores between

the implementations because the way that the model weights are initialized can make a big difference in

the final prediction accuracy score of the model. Your implementation is likely just fine as long as you are

seeing a decent number of cases where your implementation and the scikit-learn implementation accuracy

scores are close to each other. If there are any concerns regarding this, feel free to make a Q&A Community

post about it and the course staff will help you out.

Part 3: Turn in Your Work

Turn in your source code files (utils.py, k_nearest_neighbors.py, and multilayer_perceptron.py) and

the report your wrote in the README.md file by pushing the files to GitHub (remember to add, commit, push).

## Reviews

There are no reviews yet.