CS 383 – Machine Learning

Assignment 2 – Clustering

Introduction

In this assignment you’ll work on clustering data.

You may not use any functions from machine learning library in your code, however you may use

statistical functions. For example, if available you MAY NOT use functions like

• kmeans

however you MAY use basic statistical functions like:

• std

• mean

• cov

• eig

• In addition, since the spirit of this assignment is not (necessarily) about feature reduction, you

may use functions like pca.

Grading

Although all assignments will be weighed equally in computing your homework grade, below is the

grading rubric we will use for this assignment:

Part 1 (Theory) 0pts

Part 2 (K-Means Clustering) 75pts

Part 3 (Purity) 25pts

TOTAL 100pts

Table 1: Grading Rubric

1

DataSets

Pima Indians Diabetes Data Set In this dataset of 768 instances of testing Pima Indians for

diabetes each row has the following information (1 class label, 8 features).

0. Class Label (-1=negative,+1=positive)

1. Number of times pregnant

2. Plasma glucose concentration

3. Diastolic blood pressure (mm Hg)

4. Triceps skin fold thinkness (mm)

5. Insulin (mu U/ml)

6. Body mass index (kg/m2

)

7. Diabetes pedigree function

8. Age (yrs)

Data obtained from: https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes

2

1 Theory

None

3

2 Clustering

Let’s implement our own version of k-means to cluster data!

Write a script that:

1. Reads in the data. For demonstrative purposes, you will be using the dataset mentioned in

Datasets section (i.e. the diabetes dataset). However, given another dataset’s observable data

matrix, X, your code should still work.

2. Separates the class label from the observable data to form matrices Y and X, respectively.

3. Standardizes the features.

4. Passes the observable data X to a (user-created) function called myKMeans that takes two

parameters:

(a) The data to cluster, X

(b) The number of clusters, k

The myKMeans function :

Your myKMeans function should run k-means on the supplied data and value of k. Since we’ll

visualize the clustering process, a few caveats:

• Although theoretically your code should work for any value of 1 <= k <= N, we’ll constrain

k to have a maximum value of 7, so that you will only need to define up to 7 colors.

• Although your code should be written in such a way to be able to cluster in any dimensional

space, for visualization purposes, if your data’s dimenionality is greater than 3, first reduce its

dimensionality to 3 using PCA. You may use the code you developed in HW1 to do this, or a

PCA function from your linear algebra library.

In order to visualize the cluster process, you’ll generate a video. Choose a frame rate that is

natural for observing the changes in the association. Each frame of the video will be a plot showing

the current clustering. If X has only 2 features, this will be a 2D plot. If it has 3 features (either

natively, or after PCA reduction), then this will be a 3D plot. If possible (that is, if the graphics

library you are using allows it), try to have any 3D plots rotated so that it is easier to see what’s

going on.

For each frame/plot you should:

1. Display the current reference vectors as solid circles and the observations as x’s.

2. The reference vectors and associated observations should have the same colors. For instance, if

you use blue to represent cluster 1, then its reference vector should be a solid blue circle, and

all observations associated with it should be blue x’s.

3. At the top of your graph you should state the iteration number.

Figures 1-2 show the initial and terminal clusterings when k = 2.

4

Implementation Details

1. Seed your random number generator with zero (do this right before running your k-means).

2. Randomly select k observations and use them for the initial reference vectors. I suggest you

use randomize the indices and use the first k randomized indices to select the observations.

3. Use the L2 distance (Euclidean) to measure the distance between observations and reference

vectors.

4. Terminate the training process when the sum of magnitude of change of the cluster centers (from

the previous iteration to the current one) is less than = 2−23. That is, when Pk

i=1 d(ai(t −

1), ai(t)) < where k is the number of clusters, ai(t) is the reference vector for cluster i at

time t and d(x, y) is the L1 distance (Manhattan) between vectors x and y (as defined in the

Similarity and Distance Functions link on BBlearn).

Figure 1: Initial Clustering

5

Figure 2: Final Clustering

6

3 Purity

Augment your code from Part 1 to include the purity of the clusterings at each iteration on the plot.

In order to compute the purity, use the class label information (-1=negative for diabetes, +1=positive

for diabetes). Add the current purity value to the text displayed on your figures.

7

Submission

For your submission, upload to Blackboard a single zip file containing:

1. Source Code – Including any necessary makefiles, etc..

2. readme.txt file – The readme.txt file should contain information on how to run your code to

reproduce your results.

3. Sample videos of at least three different runs where you vary k and/or the features used. At

least one of these must use more than two features. If you were able to get the purity part

working, then these videos will include the purity in text within each frame. The video filenames

should be in the format

K [k] F [f].[ext]

where:

(a) [k] is the value of k passed to your function.

(b) [f] is an underscore-separated list of the features passed in to your kmeans function. If

you used all the features, just state all.

(d) [ext] is the file extension.

For example, in the example in Part 2, the file name would be

K 2 F all.avi

Do not include spaces or special characters (other than the underscore character) in your file and

directory names. Doing so may break our grading scripts.

Since there is no theory component to this assignment, and the visualization is a video, there is no

PDF for this assignment.

8

Sale!

# Assignment 2 – Clustering

$30.00

## Reviews

There are no reviews yet.