CSCI 635: Introduction to Machine Learning

Assignment 1

Notes

• The assignment is out of 50 points.

• Submit your assignment through the Dropbox in MyCourses as two files:

• A .pdf file for the write-up, named a1.pdf.

• A .zip file containing your code, trained parameter values, and a README explaining

(briefly!) how to run your trained classifiers, named a1.zip.

• Code must be able to run servers for the course, granger.cs.rit.edu or weasley.cs.rit.edu

Grade Penalties will be applied for:

• Not submitting the write-up as instructed above.

• Submitting code with incorrect file names.

• Not providing trained parameter values – we should be able to easily run your programs using

the trained parameter values – we will not retrain your networks.

• Submitting code that cannot run on the class servers (this penalty will be substantial).

Question 1 – Data Analysis and Visualization (20 points)

For Questions 1 and 2, we will use two data sets (available through MyCourses):

• Frogs-subsample.csv, and

• Frogs.csv,

The features are Mel Frequency Cepstrum Coefficients (MFCCs) representing two frequency

band intensities measured from South American frog calls in audio recordings. There are two

species of frog in the data sets. This data was selected from the Anuran Calls dataset available

from the UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/

Anuran+Calls+%28MFCCs%29. The subsampled data set was producing by obtaining a

(class-)balanced random sample (25 instances per class).

a. Visualization. Separately for each of Frogs-subsample.csv and Frogs.csv, use numpy and

matplotlib to produce the following:

• Plotting Raw Features

• A scatter plot of the ‘raw’ samples (both features, separate color for each class)

• For each frog class in the file:

• 2 histograms (1 per feature/attribute)

• 2 line graphs (1 per feature/attribute – after sorting feature values)

• Plotting Feature Distributions

• A boxplot showing the distribution of features for both classes (For each class, 1

box+whiskers per feature; 4 boxes total)

• Bar graph with error bars (For each class, 1 error bar per feature; 4 errors bars total)

b. Descriptive Statistics. Separately for each data set, use numpy to compute 1) the mean

(expected value), 2) covariance matrix, and 3) standard deviation for each individual feature.

In the write-up, provide one or two tables that clearly and attractively allow the plots of each

type to be easily compared visually (e.g., putting the two scatter plots beside one another).

Provide another (text) table providing the descriptive statistics from part b. for each dataset.

Then discuss the distributions of the features in the two data sets, making direct reference to

your plots and descriptive statistic tables. In what ways are the distributions of the two classes

similar or dissimilar? In what ways are the class distributions in the two data sets similar or

dissimilar?

!! Fair Warning !! the content and clarity of both the visual presentation and text will be

considered when grading. If the presentation is vague, messy, unclear, and/or incorrect, a

grade penalty will be incurred.

Name your program for generating plots and statistics q1.py, and include this in a1.zip.

Question 2 – The Effect of Training Data (15 points)

Using numpy (or PyTorch) create a binary classifier for the data in each file using a single

logistic regressor (i.e., a single ‘perceptron’ using the sigmoid activation function). Create a

scatter plot for each data set, and then visualize the class regions and decision boundaries

(similar to the Hastie et al. figures seen in class).

In the write-up, present the two plots in a table. Discuss the decision boundaries that you

obtained, and both how and why the different data sets produced the results obtained. 2-3

paragraphs should be sufficient.

Name your program q2.py, and include the program along with the saved parameters for

your networks in a1.zip.

Question 3 – Let Us Not Forget Probability! (15 points)

• In creating a product, 85% are produced without defects. Of the products inspected, 10% of the good

ones are seen as defective and not shipped, while only 5% of the defective products are approved and

shipped. If a product is shipped, what is the probability that it has a defect?

• Consider randomly generated bit strings of length four. Demonstrate whether or not the event of

generating bit string with an even number of 1’s is independent of the event producing bit strings that end

in 1.

• Let’s flip a (fair) coin n times to generate a dataset, where we choose to represent the state of the coin,

i.e., heads or tails, as the variable X = {0, 1}, where the first attribute value represents “tails” and the other

“heads”. Suppose that the outcome of our experiment yields the following state sequence:

S = {1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0}

Estimate the probability that p(X = 1) from the data, using a maximum likelihood approach (Hint: the

frequency approach). Also, what is the probability of getting tails under the sequence above, or p(X = 0)?

Bonus: Provide a maximum a posteriori (MAP) estimate of the probability of getting a heads, p(X = 1)

assuming this prior belief about the coin being fair. (Hint: Adapt your MLE estimate to account for your

prior, and since this is a coin toss, or a Bernoulli random variate.)

Put your answers into a1.pdf and make sure your questions for Q3 are cleanly organized (and show

your work/calculation steps). You may scan your math if hand-written (though LaTex is a better

choice) BUT your writing must be clear or points will be deducted for sloppiness & poor readability.