605.649 — Introduction to Machine Learning

Programming Project #4

The purpose of this assignment is to give you a firm foundation in comparing a variety of linear classifiers.

In this project, you will compare two different algorithms, one of which you have already implemented. These

algorithms include Adaline and Logistic Regression. You will also use the same five datasets that you used

from Project 1 from the UCI Machine Learning Repository, namely:

1. Breast Cancer — https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%

29

This breast cancer databases was obtained from the University of Wisconsin Hospitals, Madison from

Dr. William H. Wolberg.

2. Glass — https://archive.ics.uci.edu/ml/datasets/Glass+Identification

The study of classification of types of glass was motivated by criminological investigation.

3. Iris — https://archive.ics.uci.edu/ml/datasets/Iris

The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant.

4. Soybean (small) — https://archive.ics.uci.edu/ml/datasets/Soybean+%28Small%29

A small subset of the original soybean database.

5. Vote — https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records

This data set includes votes for each of the U.S. House of Representatives Congressmen on the 16 key

votes identified by the Congressional Quarterly Almanac.

When using these data sets, be careful of some issues.

1. Not all of these data sets correspond to 2-class classification problems. A method for handling multiclass classification was described for Logistic Regression. For Adaline, it is suggested that you use what

is called a “multi-net.” This is where you train a single network with multiple outputs. Note that if

you wish to apply a one-vs-one or one-vs-all strategy for the neural network, that is acceptable. Just

be sure to explain your strategy in your report.

2. Some of the data sets have missing attribute values. When this occurs in low numbers, you may simply

edit the corresponding values out of the data sets. For more occurrences, you should do some kind of

“data imputation” where, basically, you generate a value of some kind. This can be purely random, or

it can be sampled according to the conditional probability of the values occurring, given the underlying

class for that example. The choice is yours, but be sure to document your choice.

3. Most of attributes in the various data sets are either multi-value discrete (categorical) or real-valued.

You will need to deal with this in some way. For the multi-value situation, you can apply what is called

“one-hot coding” where you create a separate Boolean attribute for each value. For the continuous

attributes, you may use one-hot-coding if you wish but, there is actually a better way. Specifically,

it is recommended that you normalize them first to be in the range −1 to +1 and apply the inputs

directly. (If you want to normalize to be in the range 0 to 1, that’s fine. Just be consistent.)

For this project, the following steps are required:

• Download the five (5) data sets from the UCI Machine Learning repository. You can find this repository

at http://archive.ics.uci.edu/ml/. All of the specific URLs are also provided above.

• Pre-process each data set as necessary to handle missing data and non-Boolean data (both classes and

attributes).

• Implement Adaline and Logistic Regression.

1

• Run your algorithms on each of the data sets. These runs should be done with 5-fold cross-validation

so you can compare your results statistically. You can use classification error, cross entropy loss, or

mean squared error (as appropriate) for your loss function.

• Run your algorithms on each of the data sets. These runs should output the learned models in a

way that can be interpreted by a human, and they should output the classifications on all of the test

examples. If you are doing cross-validation, just output classifications for one fold each.

• Write a very brief paper that incorporates the following elements, summarizing the results of your

experiments. Your paper is required to be at least 5 pages and no more than 10 pages using the JMLR

format You can find templates for this format at http://www.jmlr.org/format/format.html. The

format is also available within Overleaf.

1. Title and author name

2. Problem statement, including hypothesis, projecting how you expect each algorithm to perform

3. Brief description of your experimental approach, including any assumptions made with your algorithms

4. Presentation of the results of your experiments

5. A discussion of the behavior of your algorithms, combined with any conclusions you can draw

6. Summary

7. References (Only required if you use a resource other than the course content.)

• Submit your fully documented code, the video demonstrating the running of your programs, and your

paper.

• For the video, the following constitute minimal requirements that must be satisfied:

– The video is to be no longer than 5 minutes long.

– The video should be provided in mp4 format. Alternatively, it can be uploaded to a streaming

service such as YouTube with a link provided.

– Fast forwarding is permitted through long computational cycles. Fast forwarding is not permitted

whenever there is a voice-over or when results are being presented.

– Provide sample outputs from one test set showing classification performance on Adaline and

Logistic Regression

– Show a sample trained Adaline model and Logistic Regression model

– Demonstrate the weight updates for Adaline and Logistic Regression. For Logistic Regression,

show the multi-class case

– Demonstrate the gradient calculation for Adaline and Logistic Regression. For Logistic Regression,

show the multi-class case

– Show the average performance over the five folds for Adaline and Logistic Regression

Your grade will be broken down as follows:

• Code structure – 10%

• Code documentation/commenting – 10%

• Proper functioning of your code, as illustrated by a 5 minute video – 30%

• Summary paper – 50%

2

605.649 — Introduction to Machine Learning

# 605.649 — Introduction to Machine Learning Programming Project #4

$35.00