CS 536 : Decision Trees 



CS 536 : Decision Trees
Let {(X1
, Y1),(X2
, Y2), . . . ,(Xm, Ym)} denote a data set, where Xi
represents a vector of k (binary) feature values,
and Yi
is a corresponding binary class or label that we will need to learn to be able to predict from the X-values.
We generate data via the following scheme, defining a distribution for our data set: Let X = (X1, X2, X3, . . . , Xk)
be a vector of binary values, satisfying the following
• X1 = 1 with probability 1/2, X1 = 0 with probability 1/2
• For i = 2, . . . , k, Xi = Xi−1 with probability 3/4, and Xi = 1 − Xi−1 with probability 1/4.
In this way, the first feature value is uniformly random, but every successive feature is strongly correlated with the
value of the feature before it. We can then define Y to be a function of X as
Y =

X1 if w2X2 + w3X3 + . . . + wkXk ≥ 1/2
1 − X1 else.
In other words, if the ‘weighted average’ of X2, . . . Xk tilts high, Y will agree with X1; if the weighted average of
X2, . . . , Xk tilts low, Y will disagree with X1. Take the weights to be defined by wi = 0.9
2 + 0.9
3 + . . . + 0.9
1) For a given value of k, m, (number of features, number of data points), write a function to generate a training
data set based on the above scheme.
2) Given a data set, write a function to fit a decision tree to that data based on splitting the variables by maximizing
the information gain. Additionally, return the training error of this tree on the data set, errtrain(
ˆf). It may be
useful to have a function that takes a data set and a variable, and returns the data set partitioned based on the
values of that variable.
3) For k = 4 and m = 30, generate data and fit a decision tree to it. Does the ordering of the variables in the
decision tree make sense, based on the function that defines Y ? Why or why not? Draw the tree.
4) Write a function that takes a decision tree and estimates its typical error on this data err( ˆf); i.e., generate a
lot of data according to the above scheme, and find the average error rate of this tree over that data.
5) For k = 10, estimate the value of |errtrain(
ˆf) − err( ˆf)| for a given m by repeatedly generating data sets, fitting
trees to those data sets, and estimating the true and training error. Do this for multiple m, and graph this
difference as a function of m. What can you say about the marginal value of additional training data?
6) Design an alternative metric for splitting the data, not based on information content / information gain. Repeat
the computation from (5) above for your metric, and compare the performance of your trees vs the ID3 trees.


There are no reviews yet.

Be the first to review “CS 536 : Decision Trees ”

Your email address will not be published.