## Description

COM S 474/574

Introduction to Machine Learning

Homework 2

1 Directions:

• Due: Thursday February 27, 2020 at 10pm. Late submissions will be accepted

for 24 hours after that time, with a 15% penalty.

• Upload the homework to Canvas as a pdf file. Answers to problems 1-4 can be

handwritten, but writing must be neat and the scan should be high-quality image.

Other responses should be typed or computer generated.

• Any non-administrative questions must be asked in office hours or (if a brief response

is sufficient) Piazza.

2 Problems

In this homework, we will focus on using nearest neighbor methods and logistic regression to

classify (predict a discrete-valued feature y, such as y ∈ {0, 1}).

Problem 1. [5 points] Suppose you are predicting feature y using feature x with logistic

regression, and x is measured in kilometers. After fitting, you get coefficients β0 = 1.24 and

β1 = −3.74. Thus, your model is

Prob(y = 1|x) = e

1.24−3.74x

1 + e

1.24−3.74x

.

Suppose our friend Sammie has an innate fear of the metric system, starts with the same

data set, converts the x values to miles, does not change y values, and then fits. What will

Sammie’s β0 and β1 be?

Problem 2. [15 points] Book problem Chapter 4, #4 “When the number of features . . . ”

Problem 3. [10 points] Book problem Chapter 4, #6 “Suppose we collect data . . . ”

Problem 4. [5 points] Book problem Chapter 4, #8 “Suppose that we take a data set . . . ”

1

Note: For the following problem, instead of reporting training/testing loss, for simplicity you

will be asked to report accuracy. Accuracy is the percentage of samples that were correctly

labeled as yb = 0 or yb = 1.

Problem 5. [65 points]

A. Download the data sets HW2train.csv and HW2test.csv from Canvas. In both files,

the first column is a binary-valued feature y. The second column is a continuous-valued

feature x. Make a scatter-plot of the data-set HW2train with y values on the vertical

axis, x values on the horizontal axis.

B. Fit a logistic model to predict y. Use the whole data set HW2train. If you are

using Python, you can use https://scikit-learn.org/stable/modules/generated/

sklearn.linear_model.LogisticRegression.html.

• The argument ‘penalty’ is whether we want to penalize the coefficients. For this

assignment, we will use ‘penalty’=none.

• Set the argument ‘fit intercept’=True to add a β0 (provided that we do not include

a column of ones when we call the .fit() function).

(1) Report the β0 and β1 values you obtain.

(2) Report the accuracy for HW2train (you can do this with the .score() function).

(3) Also, make a copy of the scatter-plot of the data-set HW2train plot. Add the

function Prob(y = 1|x) on the plot.

• To plot the Prob(y = 1|x) function, you can first generate uniformly spaced

values along the horizontal axis, such as with https://docs.scipy.org/doc/

numpy/reference/generated/numpy.linspace.html 1000 values evenly spaced

between 0 and 100 should be enough for a good picture.

• then determine Prob(y = 1|x) for each of those evenly spaced points. One way

to obtain this is the .predict proba() function https://scikit-learn.org/

stable/modules/generated/sklearn.linear_model.LogisticRegression.

html#sklearn.linear_model.LogisticRegression.predict_proba using those

1000 evenly spaced values as inputs; that function will return an array with

Prob(y = 0|x) for the first column and Prob(y = 1|x) for the second; use

the second column. Alternatively, you can use the β0 and β1 coefficients to

calculate Prob(y = 1|x) by yourself.

The scatter plot markers of the HW2train data should be plotted on top of the

function Prob(y = 1|x) (i.e. in the foreground), so it is not painted over by the

Prob(y = 1|x) function. Title this plot ‘HW2train Scatter Plot and Prob(y = 1|x)’.

(4) Next make another scatter plot titled ‘HW2test Scatter Plot and Prob(y = 1|x)’

just like the previous plot except where you plot the HW2test data instead of

HW2train data. Include the same Prob(y = 1|x) function as in the previous plot.

(5) Report the total accuracy for the HW2test data.

C. Now we will try k-nearest neighbors. Our predictions will be based on the data set

HW2train and odd-values of k, using majority vote. Since we only have a single

2

(one-dimensional) feature, x, we will measure the distance between sample i and a new

sample using the absolute difference, |x(i) − x(new)|.

You can implement knn manually or by using built in functions. For Python, you can use

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClasshtml#sklearn.neighbors.KNeighborsClassifier Some usage notes

• the argument ‘n neighbors’ is k, so set ‘n neighbors’=1 when you just want to use

the nearest neighbor.

• set the argument ‘weights’=’uniform’ (for this assignment we will just use uniform

weights, but we encourage you to explore what happens with ‘weights’=’distance’

which uses a built-in distance weighting or one you make yourself)

• the argument ‘algorithm’ effects how many distances are computed to find the

nearest neighbors for a new sample. Use ‘auto’ for this assignment.

(1) For each value of k ∈ {1, 3, 9}

a. Fit the knn classifier using the HW2train data set. Report the training

accuracy (if using Python, you can use the .score()); briefly mention how you

calculated it, such as if you used .score() or some other way.

b. Make a plot of the classifier’s prediction yb(x) function. This should be a

step-function (piece-wise constant), though your plot of the function can have

steeply slanted lines instead of perfectly vertical jumps. Also plot HW2train

data in the foreground (as a scatter plot). Use the title ‘1nn Classifier with

Training data’ for k = 1 and similar titles for other k.

c. Report the total accuracy for HW2test data set.

d. Make another plot, also with the classifier’s prediction yb(x) function, but show

the HW2test data instead. Use the title ‘1nn Classifier with Testing data’ for

k = 1 and similar titles for other k.

(2) Make a plot with the title ‘Training accuracy as a function of k’ where the horizontal

axis is the parameter k with the odd-numbered values {1, 3, 5, . . . , 13, 15}. The

vertical axis should be the training accuracy for the HW2Train data set using the

knn classifier fit on the HW2Train data.

(3) Make a plot with the title ‘Testing accuracy as a function of k’ where the horizontal

axis is the parameter k with the odd-numbered values {1, 3, 5, . . . , 13, 15}. The

vertical axis should be the training accuracy for the HW2Test data set using the

knn classifier fit on the HW2Train data.

D. In about 4-6 sentences, comment on the performance of the different nearest neighbor

classifiers for the different k values you used, including whether you see any evidence of

over-fitting or under-fitting, and how they compare to the logistic regression classifier,

and any other note-worthy aspects.

3