CS 536 : Decision Trees

Let {(X1

, Y1),(X2

, Y2), . . . ,(Xm, Ym)} denote a data set, where Xi

represents a vector of k (binary) feature values,

and Yi

is a corresponding binary class or label that we will need to learn to be able to predict from the X-values.

We generate data via the following scheme, defining a distribution for our data set: Let X = (X1, X2, X3, . . . , Xk)

be a vector of binary values, satisfying the following

• X1 = 1 with probability 1/2, X1 = 0 with probability 1/2

• For i = 2, . . . , k, Xi = Xi−1 with probability 3/4, and Xi = 1 − Xi−1 with probability 1/4.

In this way, the first feature value is uniformly random, but every successive feature is strongly correlated with the

value of the feature before it. We can then define Y to be a function of X as

Y =

X1 if w2X2 + w3X3 + . . . + wkXk ≥ 1/2

1 − X1 else.

(1)

In other words, if the ‘weighted average’ of X2, . . . Xk tilts high, Y will agree with X1; if the weighted average of

X2, . . . , Xk tilts low, Y will disagree with X1. Take the weights to be defined by wi = 0.9

i/(0.9

2 + 0.9

3 + . . . + 0.9

k

).

1) For a given value of k, m, (number of features, number of data points), write a function to generate a training

data set based on the above scheme.

2) Given a data set, write a function to fit a decision tree to that data based on splitting the variables by maximizing

the information gain. Additionally, return the training error of this tree on the data set, errtrain(

ˆf). It may be

useful to have a function that takes a data set and a variable, and returns the data set partitioned based on the

values of that variable.

3) For k = 4 and m = 30, generate data and fit a decision tree to it. Does the ordering of the variables in the

decision tree make sense, based on the function that defines Y ? Why or why not? Draw the tree.

4) Write a function that takes a decision tree and estimates its typical error on this data err( ˆf); i.e., generate a

lot of data according to the above scheme, and find the average error rate of this tree over that data.

5) For k = 10, estimate the value of |errtrain(

ˆf) − err( ˆf)| for a given m by repeatedly generating data sets, fitting

trees to those data sets, and estimating the true and training error. Do this for multiple m, and graph this

difference as a function of m. What can you say about the marginal value of additional training data?

6) Design an alternative metric for splitting the data, not based on information content / information gain. Repeat

the computation from (5) above for your metric, and compare the performance of your trees vs the ID3 trees.

1

Sale!

# CS 536 : Decision Trees

$30.00

## Reviews

There are no reviews yet.