Artificial Intelligence Assignment 7

Download assignment7.zip from Canvas. There are four problems worth a total of 195 points plus 25 extra credit points for comp440

and 220 points for comp557. Problems 1 and 4 require written work only, Problems 2 and 3 require Python code in two folders and a writeup. All written work should be placed in a file called

writeup.pdf with problem numbers clearly identified. All code should be included at the labeled

points in submission2.py and submission3.py in the two code folders. For Problems 2 and 3,

please run the autograder using the command line python grader.py in each folder, and report the

results in your writeup. Please upload writeup.py, submission2.py and submission3.py onto

Canvas by the due date/time.

Problem 1: Naive Bayes, Perceptrons, Decision Trees, Neural Networks (45 points + 25 EC points for comp440/required points for

comp557)

The data set below is a subset of a census database (the full data is available at

ftp://ftp.ics.uci.edu/pub/machine-learning-databases/adult). Your goal is to predict whether

an individual earns more or less than 50K a year based on education, gender and citizenship. Education is a discrete-valued attribute which can take one of three values; BS, MS, PhD; gender takes

one of two values: male, female; and citizenship has two values: US and nonUS. You will use four

different algorithms to learn models over this training data set.

Education Gender Citizenship Income

BS male US ≤50K

MS male nonUS 50K

BS female US ≤50K

PhD male nonUS ≤50K

MS female US 50K

PhD female nonUS ≤50K

BS male US ≤50K

PhD male US 50K

BS female nonUS ≤50K

PhD female US 50K

Test your models on the following test set.

Education Gender Citizenship

PhD male US

PhD male nonUS

MS female nonUS

1

2 a. Naive Bayes (10 points)

• (7 points) Estimate the priors for each income class and the class conditional probabilities for

each of the attributes in the training data. That is, calculate P(≤50K), P(50K), and fill in

the following conditional probability tables.

Education P(Education|≤50K) P(Education| 50K)

BS

MS

PhD

Gender P(Gender|≤50K) P(Gender| 50K)

male

female

Citizenship P(Citizenship|≤50K) P(Citizenshipr| 50K)

US

nonUS

• (3 points) How would the Naive Bayes classifier you estimated in the previous step classify the

income of the individuals in the test set? That is, fill in the income column of the following

table with values ≤50K, and 50K.

Education Gender Citizenship Income

PhD male US

PhD male nonUS

MS female nonUS

b. The perceptron algorithm (10 points)

• (1 point) How would you encode the training data as inputs to a perceptron?

• (8 points) Initialize all weights to zero and perform one pass of perceptron training on the

training set, doing updates in the order in which the observations are presented. Present your

result in a table, showing the weight vector after processing each example.

• (1 point) Will the perceptron converge if you run it long enough? Justify your answer.

c. Decision Trees (10 points)

• (3 points) Calculate the information gain of each of the three features. Which feature should

be chosen as the root of the decision tree?

3 • (4 points) Now continue with the decision tree construction process with the root attribute

chosen above. Determine the attribute to split on for the next level of nodes in the decision

tree. Show your information gain calculations. Continue the construction process till all the

leaf nodes have instances belonging to the same Income class. Show the final (unpruned) tree

that you have constructed.

• (3 points) What does your decision tree predict on the three examples in the test set? That

is, fill out the following table.

Education Gender Citizenship Income

PhD male US

PhD male nonUS

MS female nonUS

d. Neural networks (15 points)

• (2 points) How would you encode the data as inputs to a feedforward neural network?

• (9 points) Code this example with sklearn’s MLP implementation, and train the network on

the given training set. Documentation is available at

http://scikit-learn.org/stable/modules/neural networks supervised.html You will

need to set up two numpy arrays for training trainX and trainy corresponding to the 10

training examples. You will also need to set up the test set as the array testX. You will

have to make a choice of the number of hidden units. Try the range 2, . . . , 5 and report the

cross-validated training error.

• (4 points) What are the predictions made on the test set? Compare the classifications of the

test cases made by the four classifiers and comment on their similarities and differences.

e. Using the full data set (25 points)(optional for comp440/required for comp557)

Download the full data from ftp://ftp.ics.uci.edu/pub/machine-learning-databases/adult.

Your goal is to predict whether an individual earns more or less than 50K a year based on education,

gender, citizenship as well as other features in the full data set. Use sklearn to build naive Bayes,

perceptrons, decision trees and multi layer neural networks on this data set. How predictable is

income from these features? Produce a table with columns learning-method, features used, training

error, test error and comment on your results.

Problem 2: Text classification (70 points)

According to Microsoft, circa 2007, 97% of all email is unwanted spam. Fortunately, most of us

avoid the majority of these emails because of well-engineered spam filters. In this assignment, we

will build a text classification system that, despite its simplicity, can identify spurious email with

impressive accuracy. You will then apply the same techniques to identify positive and negative

product reviews and to classify email posts into topical categories.

Problem 2.1: Spam classification (40 points) 4

Let us start by building a simple rule-based system to classify email as spam or ham. To test our

system, we will be using a corpus of email made publicly available after the legal proceedings of the

Enron collapse; within the data/spam-classification/train directory, you will find two folders

spam and ham. Each folder contains a number of full text files that contain the whole email without

headers and have been classified as spam and ham respectively.

In order to implement your own spam classifier, you will subclass Classifier and implement the

classify method appropriately in submission2.py. As usual, grader.py contains a number of

simple test cases for your code. To run, type python grader.py on the command line.

Additionally, we have provided a script main.py to help you interactively run your code with different parameters; be ready to change this script to use alternate features, print different statistics,

etc. It is really just meant to help you get started. To use this script, type python main.py

part<part, using the section number (2.1, 2.2, etc.). python main.py part<part -h lists additional parameters you might find useful. The script will output the classification error rate as

well as a confusion matrix for the training and development set. Each cell of the confusion matrix,

(row, column), of the matrix contains the number of elements with the true label row that have

been classified as column.

Problem 2.1.1: Rule-based system (10 points)

• (5 points) data/spam-classification/blacklist.txt contains a list of words sorted by

descending spam correlation. Implement RuleBasedClassifier by discarding any email

containing at least one word from this list. It is recommended you store the blacklist as a

set to reduce lookup time. For comparison, our system achieved a dev error rate of about

48% on the training set with this heuristic. You can test this by running python main.py

part3.1.1 at the command line. If you run python grader.py you should get something

like this (your times may vary).

========== START GRADING

—– START PART 2.1.1a-0

—– END PART 2.1.1a-0 [took 0:00:00.032256, 2/2 points]

—– START PART 2.1.1a-1

trainErr: 0.474339810663

devErr: 0.480730223124

—– END PART 2.1.1a-1 [took 0:00:08.852822, 3/3 points]

• (5 points) Relax this heuristic by only discarding email that contains at least n or more of

the first k listed words. You can test this by running python main.py part2.1.1 -n 1 -k

10000 at the command line for n = 1 and k = 10000. Report your dev error results on the

training set in a 3×3 table with n = 1, 2, 3 and k = 10000, 20000, 30000.

Problem 2.1.2: Linear classifiers (10 points) 5

As you have observed, this naive rule-based system is quite ineffective. A reasonable diagnosis is

that each word, whether it is viagra or sale, contributes equally to the ’spaminess’ of the email.

Let us relax this assumption by weighing each word separately.

Consider producing a spaminess score, fw(x) for each email document x which sums the ’spaminess’

scores for each word in that document;

fw(x) = X

L

i=1

w(wordi)

where L is the length of the document x, and w is the ’spaminess’ score for word wordi

. We can

use a linear classifier that will classify the email as spam if fw(x) ≥ 0, and as ham, otherwise.

Note that the order of words does not matter when computing the sum fw(x). Thus, it is convenient

to represent a document as a sparse collection of words. An ideal data structure for this is a hash

map or dictionary. For example, the document text The quick dog chased the lazy fox over

the brown fence, would be represented using the dictionary, { brown: 1, lazy: 1, fence: 1, fox:

1, over: 1, chased: 1, dog: 1, quick: 1, the: 2, The: 1}.

Let the above vector representation of a document be φ(x); in this context, the vector representation

is called a feature. The use of individual words or unigrams is a common choice in many language

modelling tasks. With this vector or feature representation, the spaminess score for a document is

simply the inner product of the weights and the document vector fw(x) = w.φ(x).

Let the positive label be represented as a 1. Then, we can write the predicted output ˆy of the

classifier as

yˆ =

(

+1 if w.φ(x) ≥ 0

−1 if w.φ(x) < 0

• (5 points) Implement a function extractUnigramFeatures that reads a document and returns

the sparse vector φ(x) of unigram features. If you run python grader.py, you should see

the following corresponding to a correct implementation of this part.

—– START PART 2.1.2a

—– END PART 2.1.2a [took 0:00:00.000072, 5/5 points]

• (5 points) Implement the classify function in WeightedClassifier. You can test whether

you have implemented this correctly by running python grader.py – you should see:

—– START PART 2.1.2b

—– END PART 2.1.2b [took 0:00:00.000055, 5/5 points]

Problem 2.1.3: Learning to distinguish spam (20 points)

The next question we need to address is where the vector of spaminess scores, w, comes from.

Our prediction function is a simple linear function, so we will use the perceptron algorithm to

6 learn weights. The perceptron algorithm visits each training example and incrementally adjusts

weights to improve the classification of the current labelled example (x, y). If ˆy is the prediction

the algorithm makes with the current set of weights w

(t)

, i.e., ˆy = I(w

(t)

.φ(x) ≥ 0), then if ˆy 6= y,

increment w

(t) by y × φ(x).

• (10 points) Implement learnWeightsFromPerceptron that takes as input a corpus of training

examples, and returns the w learned by the perceptron algorithm. Initialize your weights

uniformly with 0.0. If you run python grader.py you should see the following:

—– START PART 2.1.3a-0

—– END PART 2.1.3a-0 [took 0:00:00.047040, 0/0 points]

—– START PART 2.1.3a-1

trainErr: 0.00896860986547

devErr: 0.0486815415822

—– END PART 2.1.3a-1 [took 0:00:06.147842, 10/10 points]

Our reference implementation with unigram features had an error rate of about 0.048 on the

development set.

• (5 points) So far, we have looked at scores for single words or unigrams. We will now consider

using two adjacent words, or bigrams as features by implementing extractBigramFeatures.

To handle the edge case of a word at the beginning of a sentence (i.e. after a punctuation like

’.’, ’!’ or ’?’), use the token -BEGIN-. On the previous example, The quick dog chased the

lazy fox over the brown fence, extractBigramFeatures would return, {the brown: 1,

brown: 1, lazy: 1, fence: 1, brown fence: 1, fox: 1, over: 1, fox over: 1, chased: 1, dog:

1, lazy fox: 1, quick dog: 1, The quick: 1, the lazy: 1, chased the: 1, quick: 1, the:

2, over the: 1, -BEGIN- The: 1, dog chased: 1}.

Run python grader.py and you should see the following:

—– START PART 2.1.3b-0

—– END PART 2.1.3b-0 [took 0:00:00.000146, 5/5 points]

• (5 points) Vary the number of examples given to the training classifier in steps of 500 from

500 to 5,000. Provide a table of results showing the training and development set classification error when using bigram features. How did the additional features help the training

and development set error? You can run python main.py part2.1.3 to get the results for

producing the table.

Problem 2.2: Sentiment classification (10 points)

You have just constructed a spam classifier that can identify spam with a 96% accuracy. While this

in itself is great, what’s really impressive is that the same system can easily be used to learn how to

tackle other text classification tasks. Let us look at something that is completely different; identifying positive and negative movie reviews. We have provided a dataset in data/sentiment/train,

with labels pos and neg.

7 • (5 points) Use the perceptron learning algorithm you wrote in the previous section to classify

positive and negative reviews. Report the training and development set error rate for unigram

features and bigram features. You can run python main.py part2.2 to get these results.

You can modify the function part2 in main.py if you want to change parameters.

• (5 points) Make a table of the training and development set classification errors as you vary

the number of iterations of the perceptron algorithm from 1 to 20. Use bigram features for

this part. Use main.py for this part as before. Does development set error rate monotonically

decrease with iteration number? Why or why not? You might find it useful to write another

version of learnWeightsFromPerceptron that prints training and development error in each

iteration. Optionally, you might try plotting the errors to visually see how the two errors

behave. We recommend the matplotlib library. Include these plots in your writeup if you

make them.

Problem 2.3: Document categorization (20 points)

Finally, let’s see how this system can be generalized to tasks with multiple labels. We will apply our

knowledge to the task of categorizing emails based on topics. Our dataset in data/topics/train

contains a number of emails from 20 popular USENET groups that have been segregated into 5

broader categories.

• (15 points) A common approach to multi-class classification is to train one binary classifier

for each of the categories in a ”one-vs-all” scheme. Namely, for the binary classifier i, all

examples labelled i are positive examples, and all other examples are negative. When using

this classifier for prediction, one uses the class label with the largest score, fi(x). Implement

the OneVsAllClassifier classifier (and MultiClassClassifier). In order to train this

classifier, you will also need to implement learnOneVsAllClassifiers that will appropriately

train binary classifiers on the provided input data.

If you run python grader.py, you will see the following:

—– START PART 2.3a-0

—– END PART 2.3a-0 [took 0:00:00.000141, 15/15 points]

• (5 points) Report the train and development set error rate for unigram features and bigram

features in your writeup. You can generate these by modifying function part3 in main.py.

If you run python main.py part2.3 with our default part3, you should see a development

error rate of about 12%.

Problem 3: Image classification (60 points)

In this assignment, you will implement an image classifier that distinguishes birds and airplanes.

We will use the perceptron algorithm as a starting point but what should the features (arguably

the most important part of machine learning) be? We will see that rather than specifying them

by hand, as we did for spam filtering, we can actually learn the features automatically using the

8

Figure 1: The sixteen patches of size 8×8 pixels corresponding to an image of size 32×32 pixels.

K-means clustering algorithm. We will be working with the CIFAR-10 dataset, one of the standard

benchmarks for image classification. We assume that your version of Python has numpy and PIL

packages already installed. If you do not have these packages, you can install them fairly easily

from distributions downloadable from the Web. If you use Enthought Python, these packages come

pre-installed.

Warmup

Now we will get you familiar with the classification task. You will be classifying a 32 × 32 image

as either a bird ( y = 1) or a plane (y = 0). One way to do this is to train a classifier on the raw

pixels, that is, φ(x) ∈ <32×32 vector where each component of the vector represents the intensity

of a particular pixel. Try running this classifier with

python run.py –pixels

As you can see, these features do not work very well, the classifier drastically overfits the training

set and as a result barely generalizes better than chance.

The problem here is that pixel values themselves are not really meaningful. We need a higher-level

representation of images. So we resort to the following strategy: Each image is a 32 × 32 grid of

pixels. We will divide the image into sixteen 8 × 8 ”patches”. Now comes the key part: we will

use K-means to cluster all the patches into centroids. These centroids will then allow us to use a

better feature representation of the image which will make the classification task much easier.

9

Figure 2: 20 centroids learned from K-means on patches from the first 1000 training images..

Implementing the K-means algorithm (20 points)

In this part of the assignment you will implement K-means clustering to learn centroids for a given

training set. Specifically you should fill out the method runKMeans in the file submission3.py.

Let D = {x1, . . . , xn} be a set of data, where xi ∈ <d

. We will construct K clusters with means

µ1, . . . , µK where µj ∈ <d

.

• Initialize a set of means µ1, µ2, . . . , µK.

• for i in 1..maxIter

– Assignment step: assign each xi to the closest cluster mean zi ∈ {µ1, . . . , µK}.

zi = argmink||xi − µk||2

– Update step: given the cluster assignments zi

, recalculate the cluster means µ’s.

muk =

1

P

zi=k

1

X

zi=k

xi

We start you off by initializing K-means with random centroids where each floating point number

is chosen from a normal distribution with mean 0 and standard deviation 1. Test the K-means

code with the provided test utility in grader.py by running:

python grader.py

Optional: One way to determine if your K-means algorithm is learning sensible features is to view

the learned centroids using our provided utility function. To view the first 25 learned centroids,

run

10 python run.py –view

Your centroids should look similar to Figure 2. Notice how the centroids look like edges. Important

note on patches: Images are composed of patches which have been converted to gray-scale followed

by standard image preprocessing. This includes normalizing them for luminance and contrast as

well as further whitening.

Feature Extraction (20 points)

The next step in the pipeline is to use the centroids to generate features that will be more useful

for the classification task then the raw pixel intensities. One way to do this is to represent each

patch by the distance to the centroids that are closest to it, the intuition being that we can encode

a patch by the centroids to which it is most similar. We will map each image x to its new feature

vector φ(x) ∈ <16k

, where there is a real value for each patch, centroid combination.

Let pij ∈ <64 be the ij-th patch of x where i, j = 1, . . . , 4. The relative activation, aijk, of centroid

µk by patch pij is defined to be the average Euclidean distance from pij to all centroids minus the

Euclidean distance from pij to µk. The Euclidean distance is the L2 norm, ||v||2 =

√

v

T v.

aijk =

1

K

X

K

k

0=1

||pij − µk

0 ||2

− ||pij − µk||2

The feature value for patch pij and centroid µk is the max of the relative activation and zero.

φijk(x) = max(aijk, 0)

Implement the function extractFeatures in submission3.py. We will use these features in the

linear classifier below.

Supervised Training (20 points)

The final task is to use our better feature representation to classify images as birds or planes. We

have provided you with a working Perceptron classifier which does fairly well. You will implement

the logistic and hinge loss updates to learn the parameters and see if either of these improves

classification accuracy and which does better. First you can run the Perceptron classifier to see

how it performs on the test set:

python run.py –gradient=perceptron

You should see a test accuracy result between 60%-65% – we can do better!

• (10 points) Implement the logisticGradient method in submission3.py. You can test this

method with python grader.py. Given example (φ(x), y) where φ(x) ∈ <d and y ∈ {0, 1},

and parameter vector w, define yy = 2 ∗ y − 1 to map {0, 1} to {−1, 1}. The logistic loss

function is

Losslogistic(w, φ(x), yy) = log(1 + e

−wT φ(x)∗yy)

Compute the gradient of this function with respect to 11 w and complete the logisticGradient

method.

• (10 points) Implement the hingeLossGradient method in submission3.py. The hinge loss

function is

Losshinge(w, φ(x), yy) = max(1 − w

T φ(x) ∗ yy, 0)

Compute the gradient of this function with respect to w and complete the hingeLossGradient

method. You can test this method with python grader.py.

Now you are ready to run the entire pipeline and evaluate performance. Run the full pipeline with:

python run.py –gradient=<loss function

The loss function can be any of {”hinge”,”logistic”,”perceptron”}. The output of this should be

the test accuracy on 1000 images. One benefit of using an unsupervised algorithm to learn features

is that we can train the algorithm with unlabeled data which is much more easily obtained than

labeled data. We have included an option to train the K-means algorithm using 500 extra unlabeled

images. You can run this with

python run.py –gradient=hinge -m

The performance of the supervised classifier goes up even though we are not using labeled data –

a major benefit to using an unsupervised algorithm to learn a feature representation!

Numpy cheat sheet

This is a quick overview of the functions you will need to complete this assignment. A really

thorough introduction is at http://wiki.scipy.org/Tentative NumPy Tutorial.

All standard operations on vectors, matrices and n-dimensional arrays are element-wise in numpy.

For example

A = numpy.random.randn(5,5)

B = numpy.random.randn(5,5)

A+B # element-wise addition

A*B # element-wise multiplication

You can also access individual elements, columns and rows of A using the following notation,

A[0,0] # first element of the first row of A

A[:,1] # second column of A

A[3:5,:] # fourth and fifth rows of A

In order to take the matrix product of A and B use the numpy.dot function.

numpy.dot(A,B) # matrix multiplication A*B

A.dot(B) # same as above

To find the minimum or maximum element in an array or along a specific axis of an array (rows or 12

columns for 2D), use numpy.minimum or numpy.maximum

numpy.minimum(A) # min of A

numpy.minimum(A, axis=0) # min of each column of A

numpy.maximum(A, axis=1) # max of each row of A

To take the indices of the minimum element in each row or column of a 2D array use the numpy.argmin

function and specify the axis

numpy.argmin(A,axis=0) # argmin of each column

Similarly you can take the mean along the columns or rows of a 2D array using numpy.mean and

the sum along a specific axis using numpy.sum

numpy.mean(A,axis=0) # mean of each column of A

numpy.sum(A,axis=1) # sum of each row of A

Problem 4: Relating Naive Bayes classifiers and perceptrons (20

points)

Consider a naive Bayes classifier with binary-valued features, i.e. fi ∈ {0, 1} for a two class problem

{+1, −1}. Prove that it can be represented by a perceptron with weights wi

, i = 0, 1, . . . , n such that

the decisions made by the naive Bayes classifier are the same as the ones made by the perceptron.

The weights of the perceptron should be expressed in terms of the naive Bayes probabilities: P(y),

P(fi

|y = 1), and P(fi

|y = −1). Assume that all these probabilities are non-zero.

Sale!