CS 5350/6350: Machine Learning

Homework 1

1 Decision Tree [40 points + 10 bonus]

1. [7 points] Decision tree construction.

(a) [5 points] Use the ID3 algorithm with information gain to learn a decision tree

from the training dataset in Table 1. Please list every step in your tree construction, including the data subsets, the attributes, and how you calculate the

information gain of each attribute and how you split the dataset according to the

selected attribute. Please also give a full structure of the tree. You can manually

draw the tree structure, convert the picture into a PDF/EPS/PNG/JPG format

and include it in your homework submission; or instead, you can represent the

tree with a conjunction of prediction rules as we discussed in the lecture.

1

x1 x2 x3 x4 y

0 0 1 0 0

0 1 0 0 0

0 0 1 1 1

1 0 0 1 1

0 1 1 0. 0

1 1 0 0 0

0 1 0 1 0

Table 1: Training data for a Boolean classifier

Solution:To construct a decision tree using the ID3 algorithm with information gain, we need to follow a step-by-step process that involves selecting the best

attribute to split the dataset and recursively building the tree.

First, calculate the entropy of the target variable (y) for the entire dataset.

Entropy(y) = −p(0) ∗ log2(p(0)) − p(1) ∗ log2(p(1))

Where:

• p(0) is the probability of y = 0 in the dataset.

• p(1) is the probability of y = 1 in the dataset.

We know the following: p = 2/7 and n = 5/7. Therefore, p(0) = 5/7, p(1) = 2/7.

Then, Entropy(y) = −(5/7)log2(5/7) − (2/7)log2(2/7) ≈ 0.863.

Now, we can calculate the information gain for each attribute (x1, x2, x3, x4) and

split the dataset based on that attribute.

x1: x1 = 0: 5/7 examples

p = 1/5, n = 4/5

Entropy(x1 = 0) = −

1

5

(log2(

1

5

)) −

4

5

(log2(

4

5

)) = 0.7219

x1 = 1: 2/7 examples

p = 1/2, n = 1/2

Entropy(x1 = 1) = −

1

2

(log2(

1

2

)) −

1

2

(log2(

1

2

)) = 1

Entropy(x1) = 5

7

0.7219 + 2

7

1 = 0.8014

IG = 0.8631 − 0.8014 = 0.0617

x2: x2 = 0: 3/7 examples

p = 2/3, n = 1/3

Entropy(x2 = 0) = −

2

3

(log2(

2

3

)) −

1

3

(log2(

1

3

)) = 0.9183

x2 = 1: 4/7 examples

p = 0, n = 1

Entropy(x2 = 1) = −0 − 1(log2(1)) = 0

Entropy(x2) = 4

7

0 + 3

7

0.9183 = 0.3936

IG = 0.8631 − 0.3936 = 0.4695

x3: x3 = 0: 4/7 examples

p = 1/4, n = 3/4

Entropy(x3 = 0) = −

1

4

(log2(

1

4

)) −

3

4

(log2(

3

4

)) = 0.8113

x1 = 1: 3/7 examples

p = 1/3, n = 2/3

2

Entropy(x3 = 1) = −

1

3

(log2(

1

3

)) −

2

3

(log2(

2

3

)) = 0.9183

Entropy(x3) = 4

7

0.8113 + 3

7

0.9183 = 0.8572

IG = 0.8631 − 0.8572 = 0.0059

x4: x4 = 0: 4/7 examples

p = 0, n = 1

Entropy(x4 = 0) = 0

x1 = 1: 3/7 examples

p = 2/3, n = 1/3

Entropy(x4 = 1) = −

1

3

(log2(

1

3

)) −

2

3

(log2(

2

3

)) = 0.9183

Entropy(x3) = 4

7

0 + 3

7

0.9183 = 0.3936

IG = 0.8631 − 0.3936 = 0.4695

Now, we can compare the Information Gains for each attribute and select the one

with the highest Information Gain as the root of the decision tree. According to

the resuklts above, we can see that x2 and x4 have the same IG values, so we can

pick x2 as the root of the decision tree.

We are left with the subtree that needs splitting, a table for which can bee seen

in table 2. We can first calculate the entropy of the target variable (y) for the

Num. x1 x2 x3 x4 y

1 0 0 1 0 0

3 0 0 1 1 1

4 1 0 0 1 1

Table 2: Remaining data after the first split

remaining dataset. We know the following: p = 2/3 and n = 1/3. Therefore:

Entropy(y) = −

1

3

(log2(

1

3

)) −

2

3

(log2(

2

3

)) ≈ 0.9183.

Now, we can calculate the information gain for each of the attributes left (x1, x3, x4)

and split the dataset based on that attribute.

x1: x1 = 0: 2/3 examples

p = 1/2, n = 1/2

Entropy(x1 = 0) = −

1

2

(log2(

1

2

)) −

1

2

(log2(

1

2

)) = 1

x1 = 1: 1/3 examples

p = 0, n = 1/3

Entropy(x1 = 1) = −1(log2(1)) = 0

Entropy(x1) = 2

3

1 + 1

3

0 = 0.6667

IG = 0.9183 − 0.6667 = 0.2516

x3: x3 = 0: 1/3 examples

p = 1, n = 0

Entropy(x1 = 0) = 0

x1 = 1: 2/3 examples

p = 1/2, n = 1/2

Entropy(x1 = 1) = −

1

2

(log2(

1

2

)) −

1

2

(log2(

1

2

)) = 1

Entropy(x1) = 2

3

1 + 1

3

0 = 0.6667

IG = 0.9183 − 0.6667 = 0.2516

3

x3: x3 = 0: 1/3 examples

p = 0, n = 1

Entropy(x1 = 0) = −1(log2(1)) = 0

x1 = 1: 2/3 examples

p = 1, n = 0

Entropy(x1 = 1) = −1(log2(1)) = 0

Entropy(x1) = 1

3

0 + 2

3

0 = 0

IG = 0.9183 − 0 = 0.9183

According to the calculations of IG, the attribute with the highest IG is x4. Therefore, the next split happens at x4, forming the decision tree as seen below.

(b) [2 points] Write the boolean function which your decision tree represents. Please

use a table to describe the function — the columns are the input variables and

label, i.e., x1, x2, x3, x4 and y; the rows are different input and function values. ‘

Solution:

The decision tree in part (a) can be represented as a table below (table 3), where *

can be replaced by either 0 or 1. (In order to save space the * was used.)

x1 x2 x3 x4 y

* 0 * 0 0

* 0 * 1 1

* 1 * * 1

Table 3: Boolean representation

2. [17 points] Let us use a training dataset to learn a decision tree about whether to play

tennis (Page 43, Lecture: Decision Tree Learning, accessible by clicking the link

http://www.cs.utah.edu/˜zhe/teach/pdf/decision-trees-learning.pdf). In the class, we

have shown how to use information gain to construct the tree in ID3 framework.

4

(a) [7 points] Now, please use majority error (ME) to calculate the gain, and select

the best feature to split the data in ID3 framework. As in problem 1, please list

every step in your tree construction, the attributes, how you calculate the gain of

each attribute and how you split the dataset according to the selected attribute.

Please also give a full structure of the tree.

Solution:

The dataset contains the following attributes: Outlook (O), Temperature (T),

Humidity (H), Wind (W), and the target attribute Play. First, we can calculate

the majority error for the target attribute Play and determine the majority class.

The majority class in Play is No or – (since it appears more frequently). Now,

we can select the attribute with the highest ME gain as the root node. First,

calculate the ME for each attribute (O, T, H, W).

ME for Outlook (O):

ME(Sunny) = Majority Error(-, -, +) = 2/3

ME(Overcast) = Majority Error(-, +, +) = 1/3

ME(Rainy) = Majority Error(-, +, -, -) = 2/4 = 1/2

Weighted Average ME for Outlook (O):

P(O=S) = 3/14; P(O=O) = 4/14; P(O=R) = 7/14

ME(O) = [P(O=S) * ME(Sunny)] + [P(O=O) * ME(Overcast)] + [P(O=R) *

ME(Rainy)]

ME(O) = (3/14) * (2/3) + (4/14) * (1/3) + (7/14) * (1/2) = 6/42 + 4/42 +

7/28 = 17/42

ME for Temperature (T):

ME(Hot) = Majority Error(-, -, +, +, +) = 3/5

ME(Medium) = Majority Error(-, -, +) = 2/3

ME(Cool) = Majority Error(-, +, +) = 1/3

Weighted Average ME for Temperature (T):

P(T=H) = 5/14; P(T=M) = 3/14; P(T=C) = 6/14

ME(T) = [P(T=H) * ME(Hot)] + [P(T=M) * ME(Medium)] + [P(T=C) *

ME(Cool)]

ME(T) = (5/14) * (3/5) + (3/14) * (2/3) + (6/14) * (1/3) = 15/70 + 6/42 +

6/42 = 27/70

ME for Humidity (H):

ME(High) = Majority Error(-, -, +, +, +) = 3/5

ME(Normal) = Majority Error(-, -, -, +, +) = 2/5

ME(Low) = Majority Error(-, +, +) = 1/3

Weighted Average ME for Humidity (H):

P(H=H) = 5/14; P(H=N) = 5/14; P(H=L) = 4/14

ME(H) = [P(H=H) * ME(High)] + [P(H=N) * ME(Normal)] + [P(H=L) *

ME(Low)]

ME(H) = (5/14) * (3/5) + (5/14) * (2/5) + (4/14) * (1/3) = 15/70 + 10/70 +

4/42 = 29/70

ME for Wind (W):

ME(Strong) = Majority Error(-, -, -, +, +, +) = 3/6 = 1/2

ME(Weak) = Majority Error(-, -, +, +, +) = 2/5

5

Weighted Average ME for Wind (W):

P(W=S) = 6/14; P(W=W) = 8/14

ME(W) = [P(W=S) * ME(Strong)] + [P(W=W) * ME(Weak)]

ME(W) = (6/14) * (1/2) + (8/14) * (2/5) = 3/14 + 16/70 = 41/70

Now, we select the attribute with the lowest ME, which is ”Outlook” (O) with

ME(O) = 17/42, as the root of the decision tree. This is our first split.

We start with the root node ”Outlook (O),” which has three possible values:

Sunny, Overcast, and Rainy.

Split on Sunny (O = Sunny)

For the first branch, we consider the subset of data where ”Outlook” is Sunny. In

this subset, we focus on the ”Humidity (H)” attribute to make the next split decision. We calculate the Majority Error (ME) for each unique value of ”Humidity

(H)” for this subset.

ME(High) = Majority Error(-, -) = 0/2 = 0 (All instances are negative)

ME(Normal) = Majority Error(+) = 0 (All instances are positive)

ME(Low) = Majority Error(-) = 0 (All instances are negative)

Since all ME values for ”Humidity (H)” are 0, we don’t need further splits in this

branch.

Split on Overcast (O = Overcast)

For the second branch, we consider the subset of data where ”Outlook” is Overcast. In this subset, we have a pure subset where all instances are positive (Play

= Yes). We don’t need further splits in this branch since it’s already pure.

Split on Rainy (O = Rainy)

For the third branch, we consider the subset of data where ”Outlook” is Rainy. In

this subset, we focus on the ”Wind (W)” attribute to make the next split decision.

We calculate the Majority Error (ME) for each unique value of ”Wind (W)” for

this subset.

ME(Strong) = Majority Error(-, -, -) = 0/3 = 0 (All instances are negative)

ME(Weak) = Majority Error(+, +) = 0/2 = 0 (All instances are positive)

Since all ME values for ”Wind (W)” are 0, we don’t need further splits in this

branch.

The decision tree is now fully constructed with the following structure:

6

(b) [7 points] Please use gini index (GI) to calculate the gain, and conduct tree learning with ID3 framework. List every step and the tree structure.

Solution:

First, we calculate the Gini Index (GI) for the target attribute Play and determine

the attribute with the lowest Gini Index.

Gini Index for Play:

GI(Play) = 1 − [P(Yes)2 + P(No)2

]

GI(Play) = 1 − [(9/14)2 + (5/14)2

] = 0.4592

We can now select the attribute with the lowest Gini Index as the root node by

calculating the Gini Index for each attribute (O, T, H, W).

For Outlook (O):

Calculate the Gini Index for each unique value of O (S, O, R).

For Sunny (S):

Count(Play=Yes) = 2; Count(Play=No) = 3; Total instances (Sunny) = 2+3 = 5

Gini(Sunny) = 1 − [(2/5)2 + (3/5)2

] = 0.48

For Overcast (O):

Count(Play=Yes) = 4; Count(Play=No)= 0; Total instances (Overcast) = 4+0 =

4

Gini(Overcast) = 1 − [(4/4)2 + (0/4)2

] = 0.00

For Rainy (R):

Count(Play=Yes) = 3; Count(Play=No) = 2; Total instances (Rainy) = 3+2 = 5

Gini(Rainy) = 1 − [(3/5)2 + (2/5)2

] = 0.48

Calculate the weighted average Gini Index for Outlook (O):

P(O = S) = 5/14; P(O = O) = 4/14; P(O = R) = 5/14

GI(O) = Weighted average of Gini Index for each value of O.

GI(O) = [P(O = S) ∗ Gini(S)] + [P(O = O) ∗ Gini(O)] + [P(O = R) ∗ Gini(R)]

GI(O) = [(5/14)∗0.48]+[(4/14)∗0.00]+[(5/14)∗0.48] = (2.14+0.00+2.14)/3 =

4.28/3 ≈ 1.43

For Temperature (T):

Calculate the Gini Index for each unique value of T (H, M, C).

For Hot (H):

Count(Play=Yes) = 2; Count(Play=No) = 2; Total instances (Hot) = 2 + 2 = 4

Gini(Hot)= 1 − [(2/4)2 + (2/4)2

] = 0.50

For Medium (M):

Count(Play=Yes) = 4; Count(Play=No) = 1; Total instances (Medium) = 4+1 =

5

Gini(Medium) = 1 − [(4/5)2 + (1/5)2

] = 0.32

7

For Cool (C):

Count(Play=Yes) = 3; Count(Play=No) = 2; Total instances (Cool) = 3 + 2 = 5

Gini(Cool) = 1 − [(3/5)2 + (2/5)2

] = 0.48

Calculate the weighted average Gini Index for Temperature (T):

P(T = H) = 4/14; P(T = M) = 5/14; P(T = C) = 5/14

GI(T) = Weighted average of Gini Index for each value of T.

GI(T) = [P(T = H) ∗Gini(H)] + [P(T = M) ∗Gini(M)] + [P(T = C) ∗Gini(C)]

GI(T) = [(4/14) ∗ 0.50] + [(5/14) ∗ 0.32] + [(5/14) ∗ 0.48] = (0.50 + 0.1142857 +

0.1714286)/3 = 0.7857143/3 ≈ 0.262

For Humidity (H):

Calculate the Gini Index for each unique value of H (H, N, L).

For High (H):

Count(Play=Yes) = 3; Count(Play=No) = 4; Total instances (High) = 3 + 4 = 7

Gini(High) = 1 − [(3/7)2 + (4/7)2

] = 0.4898

For Normal (N):

Count(Play=Yes) = 6; Count(Play=No) = 1; Total instances (Normal)= 6+1 = 7

Gini(Normal)= 1 − [(6/7)2 + (1/7)2

] = 0.2449

For Low (L):

Count(Play=Yes) = 0; Count(Play=No) = 0; Total instances (Low) = 0 + 0 = 0

Gini(Low) = 0

Calculate the weighted average Gini Index for Humidity (H):

P(H = H) = 7/14; P(H = N) = 7/14;P(H = L) = 0/14

GI(H) = [P(H=H) * Gini(H)] + [P(H=N) * Gini(N)] + [P(H=L) * Gini(L)]

GI(H) = [(7/14) ∗ 0.4898] + [(7/14) ∗ 0.2449] + [(0/14) ∗ 0] = 0.4898/2 ≈ 0.2449

For Wind (W):

Calculate the Gini Index for each unique value of W (S, W).

For Strong (S):

Count(Play=Yes) = 3; Count(Play=No) = 3; Total instances (Strong) = 3+3 = 6

Gini(Strong) = 1 − [(3/6)2 + (3/6)2

] = 0.5

For Weak (W):

Count(Play=Yes) = 6; Count(Play=No) = 2; Total instances (Weak) = 6+ 2 = 8

Gini(Weak) = 1 − [(6/8)2 + (2/8)2

] = 0.375

Calculate the weighted average Gini Index for Wind (W):

P(W = S) = 6/14; P(W = W) = 8/14 GI(W) = [P(W=S) * Gini(S)] +

[P(W=W) * Gini(W)]

GI(W) = [(6/14) ∗ 0.5] + [(8/14) ∗ 0.375] = 0.4285714/2 ≈ 0.2143

The Outlook (O) attribute has the lowest Gini Index (0.3429). So, we select Outlook (O) as the root attribute.

Split the Data:

8

Split the dataset into subsets based on the values of Outlook (O):

Subset Sunny: Outlook = Sunny

Subset Overcast: Outlook = Overcast

Subset Rain: Outlook = Rain

For Subset S (Outlook = S):

Split Subset S based on Humidity (H):

Subset High: Humidity = High, Subset Normal: Humidity = Normal, Subset

Low: Humidity = Low.

For Subset High (Humidity = High):

The majority class is Play: No (since it appears more frequently). Therefore, we

can create a leaf node with the label ”No” for this subset.

For Subset Normal (Humidity = Normal):

The majority class is Play: Yes (since it appears more frequently). Therefore, we

can create a leaf node with the label ”Yes” for this subset.

For Subset Low (Humidity = Low):

The majority class is Play: No (since it appears more frequently). Therefore, we

can create a leaf node with the label ”No” for this subset.

For Subset O (Outlook = O):

The majority class is Play: Yes (since it appears more frequently). Therefore, we

can create a leaf node with the label ”Yes” for this subset.

For Subset R (Outlook = R):

Split Subset R based on Wind (W):

Subset Strong: Wind = Strong, Subset Weak: Wind = Weak

For Subset Strong (Wind = Strong):

The majority class is Play: No (since it appears more frequently). Therefore, we

can create a leaf node with the label ”No” for this subset.

For Subset Weak (Wind = Weak):

The majority class is Play: Yes (since it appears more frequently). Therefore, we

can create a leaf node with the label ”Yes” for this subset.

The correct tree structure for the Gini Index approach will look like the following:

(c) [3 points] Compare the two trees you just created with the one we built in the

class (see Page 62 of the lecture slides). Are there any differences? Why?

Solution:

The rees acquired in parts (a) and (b) are the same as presented in class. Hoiwever,

in the case of splitting at Humidity instead of Outlook when using the ME method,

the decision tree would look different, but nevertheless have the same depth.

9

3. [16 points] Continue with the same training data in Problem 2. Suppose before the tree

construction, we receive one more training instance where Outlook’s value is missing:

{Outlook: Missing, Temperature: Mild, Humidity: Normal, Wind: Weak, Play: Yes}.

(a) [3 points] Use the most common value in the training data as the missing value,

and calculate the information gains of the four features. Note that if there is a

tie for the most common value, you can choose any value in the tie. Indicate the

best feature.

Solution:

To handle the missing value for the ”Outlook” attribute in the new training

instance, we can use the most common value in the training data. First, let’s

calculate the Most Common Value (MCV) for the ”Outlook” attribute in the

training data:

”Outlook” values in the training data: [S, S, O, R, R, R, O, S, S, R, S, O, O, R]

Count of ”Sunny” (S): 5

Count of ”Overcast” (O): 4

Count of ”Rainy” (R): 5

Since ”Sunny” and ”Rainy” both have a count of 5, we can choose either one as

the MCV. Let’s choose ”Sunny” as the MCV.

Now, we have the following new instance with the missing ”Outlook” value replaced by ”Sunny”:

Outlook: Sunny, Temperature: Mild, Humidity: Normal, Wind: Weak, Play: Yes

Now, we can calculate the information gain for all four variables (O, T, H, W).

Target attribute Play: p = 10/15, n = 5/15

Entropy(P) = −

10

15 (log2(

10

15 )) −

5

15 (log2(

5

15 )) ≈ 0.9183

Outlook (O):

O=S: 6/15, p = 1/2, n = 1/2. Entropy(O = S) = 1

O=O: 4/15, p = 1, n = 0. Entropy(O = O) = 0

O=R: 5/15 = 1/3, p = 3/5, n = 2/5. Entropy(O = R) ≈ 0.971

Entropy(O) = 6

15 1 + 4

15 0 + 1

3

0.971 ≈ 0.7237

IG(O) = 0.9183 − 0.7237 = 0.1946

Temperature (T):

T=H: 4/15, p = 2/4 = 1/2, n = 2/4 = ‘/2. Entropy(T = H) = 1

T=M: 7/15, p = 5/7, n = 2/7. Entropy(T = M) ≈ 0.8631

T=C: 4/15, p = 3/4, n = 1/4. Entropy(T = C) ≈ 0.8113

Entropy(T) ≈ 0.8858

IG(T) = 0.9183 − 0.8858 = 0.0325

Humidity (H):

H=H: 7/15, p = 3/7, n = 4/7. Entropy(H = H) ≈ 0.9852

H=N: 8/15, p = 7/8, n = 1/8. Entropy(H = N) ≈ 0.5436

H=L: 0/15

Entropy(H) ≈ 0.7497

IG(H) = 0.1686

Wind (W):

W=S: 6/15, p = 1/2, n = 1/2. Entropy(H = H) = 1

10

W=W: 9/15, p = 7/9, n = 2/9. Entropy(T = M) ≈ 0.7642

Entropy(W) ≈ 0.8585

IG(W) = 0.0598

Therefore, Outlook still has the highest IG value.

(b) [3 points] Use the most common value among the training instances with the

same label, namely, their attribute ”Play” is ”Yes”, and calculate the information

gains of the four features. Again if there is a tie, you can choose any value in the

tie. Indicate the best feature.

Solution:

To handle the missing value for the ”Outlook” attribute in the new training instance, we can use the most common value in the training data. First, let’s

calculate the Most Common Value (MCV) for the ”Play = Yes” attribute in the

training data:

Count of ”Sunny” (S): 2

Count of ”Overcast” (O): 4

Count of ”Rainy” (R): 3

Since ”Overcast” has a count of 4, we can choose it as the MCV. Let’s choose

”Sunny” as the MCV.

Now, we have the following new instance with the missing ”Outlook” value replaced by ”Overcast”:

Outlook: Overcast, Temperature: Mild, Humidity: Normal, Wind: Weak, Play:

Yes

Now, we can calculate the information gain for all four variables (O, T, H, W).

Since the only counts that changes are ) = S and O = O, we only need to change

the count for Outlook information gain, everything else stays the same.

Target attribute Play: Entropy(P) =≈ 0.9183

Outlook (O):

O=S: 5/15 = 1/3, p = 2/5, n = 3/5. Entropy(O = S) ≈ 0.971

O=O: 2/15 = 1/3, p = 1, n = 0. Entropy(O = O) = 0

O=R: 5/15 = 1/3, p = 3/5, n = 2/5. Entropy(O = R) ≈ 0.971

Entropy(O) = 1

3

0.971 + 4

15 0 + 1

3

0.971 ≈ 0.6473

IG(O) = 0.9183 − 0.6473 = 0.271

Temperature (T): IG(T) = 0.9183 − 0.8858 = 0.0325

Humidity (H): IG(H) = 0.1686

Wind (W): IG(W) = 0.0598

As we can see information gain for the Outlook attribute remains the highest.

(c) [3 points] Use the fractional counts to infer the feature values, and then calculate

the information gains of the four features. Indicate the best feature.

Solution:

Fractional Counts for Outlook (O):

Total instances: 14. Calculate the fraction of each unique value in the available

data:

Fraction(S) = Count(S) / Total instances = 5 / 14

Fraction(O) = Count(O) / Total instances = 4 / 14

11

Fraction(R) = Count(R) / Total instances = 5 / 14

Then sizes of Sunny, Overcast, and Rain are the following, respectively: 5 + 5/14,

4 + 4/14, and 5 + 5/14. Then, we can fin p+ and p− for all atrrbutes, which will

help us calculate entropy and then information gains.

Outlook: O=S: p+ =

2+5/14

5+5/14 , p− =

3

5+5/14

Entropy(O = S) = −

2+5/14

5+5/14 (log2(

2+5/14

5+5/14 )) −

3

5+5/14 (log2(

3

5+5/14 )) ≈ 0.9896

Outlook: O=O: p+ =

4+4/14

4+4/14 , p− =

0

4+5/14

Entropy(O = O) = −

4+4/14

4+4/14 (log2(

4+4/14

4+4/14 )) − 0 = 0

Outlook: O=R: p+ =

3+5/14

5+5/14 , p− =

2

5+5/14

Entropy(O = R) = −

3+5/14

5+5/14 (log2(

3+5/14

5+5/14 )) −

2

5+5/14 (log2(

2

5+5/14 )) ≈ 0.9532

Since the current entropy is still 0.9183, we get the following:

Entropy(O) = 5+5/14

15 0.9896 + 4+4/14

15 0 + 5+5/14

15 0.9532 ≈ 0.6939

IG(O) = 0.9183 − 0.6939 = 0.2244

The infromation gains for other attributes remain the same since the missing value

does not affect them. Therefore, the highest IG value is still the Outlook value.

(d) [7 points] Continue with the fractional examples, and build the whole free with

information gain. List every step and the final tree structure.

Solution:

According to the previous parts, we know that Outlook has the highest information gain, so we can take it as a root node and have the first split there (as seen

below).

For the subset ”Sunny”, entropy is already found and

it is 0.9896. Now, we can find out which attribute should split here.

Temperature (T): T=H: 2/6 = 1/3, p = 0, n = 1, Entropy(T = H) = 0.

Temperature (T): T=M: 3/6 = 1/2, p = 2/3, n = 1/3, Entropy(T = H) ≈

0.9183.

Temperature (T): T=C: 1/6, p = 1, n = 0, Entropy(T = H) = 0.

Entropy(T) = 1

3

0 + 1

2

0.9183 + 1

6

0 ≈ 0.4592

IG(T) = 0.4944

Humidity (H): H=H: 3/6 = 1/2, p = 0, n = 1, Entropy(H = H) = 0.

Humidity (H): H=N: 3/6 = 1/2, p = 1, n = 0, Entropy(H = N) = 0.

Humidity (H): H=L: 0.

Entropy(H) = 0, IG(H) = 0.9183

There is no need to check Wind attribute, since Humidity has the highest possible

information gain. Therefore, we split at Humidity.

For the subset ”Rain”, entropy is already found and it is 0.9532. Now, we can

find out which attribute should split here.

12

Temperature (T): T=H: 0.

Temperature (T): T=M: 3/5, p = 2/3, n = 1/3, Entropy(T = H) ≈ 0.9183.

Temperature (T): T=C: 2/5, p = 1/2, n = 1/2, Entropy(T = H) = 1.

Entropy(T) = 0 + 3

5

0.9183 + 2

5

1 ≈ 0.951

IG(T) = 0.0022

Wind (W): H=H: 3/5, p = 1, n = 0, Entropy(H = H) = 0.

Wind (W): H=N: 2/5, p = 0, n = 1, Entropy(H = N) = 0.

Entropy(W) = 0, IG(W) = 0.9532

Wind has the highest information gain. Therefore, we split at Humidity.

4. [Bonus question 1] [5 points]. Prove that the information gain is always nonnegative. That means, as long as we split the data, the purity will never get worse!

(Hint: use convexity)

Solution:

Proof: We want to prove that IG(A) is always non-negative. To do this, we will use the

convexity of the entropy function. (Assuming that the entropy of a random variable

Y is defined as H(Y ) = −

Pp(y) log2

(p(y)).)

Let pi = P(A = vi) be the probabilities associated with each value vi of attribute

A. Then, the weighted probability distribution is q(y) = PpiP(Y |A = vi), which is

a weighted sum of conditional probability distributions. Therefore, by the convexity

property of entropy:

H(q(y)) ≤

PpiH(P(Y |A = vi))

We can now expand this formula by using the definition of IG(A):

H(q(y)) ≤

PpiH(Y |A = vi) = PP(A = vi)H(Y |A = vi)

We can subtract H(q(y)) from both sides of the IG(A) equation, so we get:

H(Y ) − H(q(y)) ≥ H(Y ) −

PP(A = vi)H(Y |A = vi)

We also know that H(Y ) − H(q(y)) represents the entropy of Y given attribute A,

denoted as H(Y |A), so we can substitute:

H(Y |A) ≥ H(Y ) −

PP(A = vi)H(Y |A = vi)

Lastly, we can substitute this result back into the IG(A) equation:

IG(A) = H(Y ) −

PP(A = vi)H(Y |A = vi) ≥ H(Y ) − H(Y |A) ≥ 0

The last inequality follows from the fact that entropy is always non-negative, and

H(Y ) ≥ H(Y |A) because splitting the data based on attribute A reduces uncertainty

(entropy) about Y. Therefore, we’ve proven that the information gain IG(A) is always

non-negative, meaning that as long as we split the data, the purity (entropy) will never

get worse.

13

5. [Bonus question 2] [5 points]. We have discussed how to use decision tree for regression (i.e., predict numerical values) — on the leaf node, we simply use the average

of the (numerical) labels as the prediction. Now, to construct a regression tree, can

you invent a gain to select the best attribute to split data in ID3 framework?

Solution:

In the context of constructing a regression tree in the ID3 framework,we can use variance. The formula would look like the following:

IG(A) = V ar(S) −

P

vi

|Sv|

|S|

∗ V ar(Sv)

where vi are values of A.

This formula works because the goal of a regression tree is to find attribute splits that

result in reduced variance within the subsets. When variance is reduced after a split,

it indicates that the predicted values in those subsets are more similar to each other,

resulting in better predictions. By selecting the attribute A that maximizes the reduction in variance (i.e., maximizes IG(A)), we ensure that the chosen split leads to the

greatest improvement in prediction accuracy.

2 Decision Tree Practice [60 points]

1. [5 Points] Starting from this assignment, we will build a light-weighted machine learning library. To this end, you will first need to create a code repository in Github.com.

Please refer to the short introduction in the appendix and the official tutorial to create

an account and repository. Please commit a README.md file in your repository, and

write one sentence: ”This is a machine learning library developed by Your Name for

CS5350/6350 in University of Utah”. You can now create a first folder, ”DecisionTree”.

Please leave the link to your repository in the homework submission. We will check if

you have successfully created it.

Solution: Here is the link to my repository: https://github.com/milenabel/CS6350-

ML2023

2. [30 points] We will implement a decision tree learning algorithm for car evaluation task.

The dataset is from UCI repository(https://archive.ics.uci.edu/ml/datasets/

car+evaluation). Please download the processed dataset (car.zip) from Canvas. In

this task, we have 6 car attributes, and the label is the evaluation of the car. The

attribute and label values are listed in the file “data-desc.txt”. All the attributes are

categorical. The training data are stored in the file “train.csv”, consisting of 1, 000

examples. The test data are stored in “test.csv”, and comprise 728 examples. In both

training and test datasets, attribute values are separated by commas; the file “datadesc.txt” lists the attribute names in each column.

Note: we highly recommend you to use Python for implementation, because it is very

convenient to load the data and handle strings. For example, the following snippet

reads the CSV file line by line and split the values of the attributes and the label into

a list, “terms”. You can also use “dictionary” to store the categorical attribute values.

In the web are numerous tutorials and examples for Python. if you have issues, just

14

google it!

with open ( CSVfile , ’ r ’ ) a s f :

f o r l i n e i n f :

terms = l i n e . s t r i p ( ) . s p l i t ( ’ , ’ )

p r o c e s s one t r a i n i n g example

(a) [15 points] Implement the ID3 algorithm that supports, information gain, majority error and gini index to select attributes for data splits. Besides, your ID3

should allow users to set the maximum tree depth. Note: you do not need to

convert categorical attributes into binary ones and your tree can be wide here.

Solution: The implementation can be seen in my Github in the carIDE.py file

inside the DecisionTree folder.

(b) [10 points] Use your implemented algorithm to learn decision trees from the

training data. Vary the maximum tree depth from 1 to 6 — for each setting,

run your algorithm to learn a decision tree, and use the tree to predict both the

training and test examples. Note that if your tree cannot grow up to 6 levels, you

can stop at the maximum level. Report in a table the average prediction errors

on each dataset when you use information gain, majority error and gini index

heuristics, respectively.

Solution: The solution for this part is represented through table 4.

Error Type Max Depth Avg Train Error Avg Test Error

information gain 1 0.3020 0.2967

information gain 2 0.2220 0.2225

information gain 3 0.1810 0.1964

information gain 4 0.0820 0.1470

information gain 5 0.0270 0.0907

information gain 6 0.0000 0.0989

majority error 1 0.3020 0.2967

majority error 2 0.3020 0.2967

majority error 3 0.2590 0.2940

majority error 4 0.1690 0.2665

majority error 5 0.3020 0.2102

majority error 6 0.0600 0.2225

gini index 1 0.3020 0.2967

gini index 2 0.2220 0.2225

gini index 3 0.1760 0.1841

gini index 4 0.0890 0.1332

gini index 5 0.0270 0.0907

gini index 6 0.0000 0.0989

Table 4: Average Prediction Errors

15

(c) [5 points] What can you conclude by comparing the training errors and the test

errors?

Solution:

From the results of part (b), we can make the following observations regarding the

training errors and test errors for different heuristics and different tree depths:

Information Gain Heuristic:

Training Error: Decreases as the tree depth increases.

Test Error: Initially decreases but then increases slightly after a certain depth.

Conclusion: The information gain heuristic results in relatively low training errors and

reasonable test errors, suggesting that it effectively builds decision trees that generalize

well. However, it’s crucial to avoid overfitting by selecting an appropriate tree depth.

Majority Error Heuristic:

Training Error: Generally decreases as the tree depth increases.

Test Error: Initially decreases but then tends to plateau or increase slightly after a

certain depth.

Conclusion: The majority error heuristic also results in low training errors but tends

to have higher test errors compared to information gain. It may be less effective at

generalization.

Gini Index Heuristic:

Training Error: Decreases as the tree depth increases.

Test Error: Shows a similar pattern to information gain, initially decreasing and then

stabilizing or slightly increasing.

Conclusion: The Gini index heuristic performs similarly to information gain in terms

of training and test errors, indicating its effectiveness in building decision trees that

generalize well.

Overall, it appears that the Information Gain and Gini Index heuristics are more

effective than Majority Error in terms of generalization performance, as they tend to

produce lower test errors.

3. [25 points] Next, modify your implementation a little bit to support numerical attributes. We will use a simple approach to convert a numerical feature to a binary one.

We choose the media (NOT the average) of the attribute values (in the training set)

as the threshold, and examine if the feature is bigger (or less) than the threshold. We

will use another real dataset from UCI repository(https://archive.ics.uci.edu/

ml/datasets/Bank+Marketing). This dataset contains 16 attributes, including both

numerical and categorical ones. Please download the processed dataset from Canvas

(bank.zip). The attribute and label values are listed in the file “data-desc.txt”. The

training set is the file “train.csv”, consisting of 5, 000 examples, and the test “test.csv”

with 5, 000 examples as well. In both training and test datasets, attribute values are

separated by commas; the file “data-desc.txt” lists the attribute names in each column.

(a) [10 points] Let us consider “unknown” as a particular attribute value, and hence

we do not have any missing attributes for both training and test. Vary the

maximum tree depth from 1 to 16 — for each setting, run your algorithm to learn

a decision tree, and use the tree to predict both the training and test examples.

16

Again, if your tree cannot grow up to 16 levels, stop at the maximum level. Report

in a table the average prediction errors on each dataset when you use information

gain, majority error and gini index heuristics, respectively.

Solution: The solution for this part is represented through tables 5, 6, 7.

Max Depth Avg Train Error Avg Test Error

1 0.1192 0.1248

2 0.1060 0.1114

3 0.1006 0.1068

4 0.0800 0.1154

5 0.0624 0.1296

6 0.0480 0.1386

7 0.0372 0.1464

8 0.0292 0.1522

9 0.0222 0.1572

10 0.0182 0.1614

11 0.0150 0.1640

12 0.0136 0.1668

13 0.0132 0.1686

14 0.0132 0.1694

15 0.0132 0.1708

16 0.0132 0.1708

Table 5: Information Gain Average Prediction Errors without ”Unknown” as Features

(b) [10 points] Let us consider ”unknown” as attribute value missing. Here we

simply complete it with the majority of other values of the same attribute in the

training set. Vary the maximum tree depth from 1 to 16 — for each setting,

run your algorithm to learn a decision tree, and use the tree to predict both the

training and test examples. Report in a table the average prediction errors on each

dataset when you use information gain, majority error and gini index heuristics,

respectively.

Solution: The solution for this part is represented through tables 8, 9, 10.

(c) [5 points] What can you conclude by comparing the training errors and the test

errors, with different tree depths, as well as different ways to deal with ”unknown”

attribute values?

Solution:

From the results of parts (a) and (b), we can make the following observations regarding

the training errors and test errors for different heuristics, different tree depths, as well

as different ways to deal with ”unknown” attribute values:

Without Handling ”Unknown” as Missing Values:

As the tree depth increases, the training error tends to decrease for all three heuristics

(information gain, majority error, gini index). This is expected because deeper trees

can model the training data more closely, leading to lower training errors.

17

Max Depth Avg Train Error Avg Test Error

1 0.1088 0.1166

2 0.1042 0.1088

3 0.0964 0.1134

4 0.0790 0.1204

5 0.0648 0.1302

6 0.0544 0.1416

7 0.0428 0.1638

8 0.0372 0.1764

9 0.0328 0.1774

10 0.0262 0.1820

11 0.0232 0.1926

12 0.0192 0.2012

13 0.0154 0.2060

14 0.0132 0.2084

15 0.0132 0.2102

16 0.0132 0.2102

Table 6: Majority Error Average Prediction Errors without ”Unknown” as Features

The test error, however, does not always follow the same trend. It initially decreases

with increasing depth but starts to increase after reaching a certain depth. This indicates overfitting, where the model becomes too specialized in the training data and

performs poorly on unseen test data.

Among the three heuristics, ”information gain” generally performs better in terms of

test error compared to ”majority error” and ”gini index.”

With Handling ”Unknown” as Missing Values:

Handling ”unknown” values as missing values and completing them with the majority

value of the respective attribute from the training set does not significantly affect the

overall trend observed in the errors.

The same trends of decreasing training error and initial decrease followed by an increase in test error with increasing tree depth are still observed.

The choice of heuristic remains important, with ”information gain” generally performing better in terms of test error.

Handling ”unknown” values as missing values can help ensure that the decision tree

algorithm can work with the dataset, but it doesn’t fundamentally change the trade-off

between tree complexity and overfitting.

Overall, deeper trees can fit the training data better but may lead to overfitting, while

shallower trees may underfit.

18

Max Depth Avg Train Error Avg Test Error

1 0.1088 0.1166

2 0.1042 0.1088

3 0.0936 0.1124

4 0.0754 0.1232

5 0.0606 0.1338

6 0.0484 0.1474

7 0.0366 0.1568

8 0.0268 0.1614

9 0.0214 0.1660

10 0.0172 0.1690

11 0.0140 0.1736

12 0.0136 0.1744

13 0.0132 0.1758

14 0.0132 0.1760

15 0.0132 0.1780

16 0.0132 0.1780

Table 7: Gini Index Average Prediction Errors without ”Unknown” as Features

Max Depth Avg Train Error Avg Test Error

1 0.1192 0.1248

2 0.1060 0.1114

3 0.1008 0.1068

4 0.0814 0.1168

5 0.0646 0.1310

6 0.0510 0.1426

7 0.0406 0.1508

8 0.0330 0.1572

9 0.0274 0.1600

10 0.0218 0.1636

11 0.0194 0.1648

12 0.0188 0.1668

13 0.0180 0.1694

14 0.0180 0.1706

15 0.0180 0.1738

16 0.0180 0.1738

Table 8: Information Gain Average Prediction Errors with ”Unknown” as Features

19

Max Depth Avg Train Error Avg Test Error

1 0.1088 0.1166

2 0.1042 0.1088

3 0.0964 0.1134

4 0.0804 0.1196

5 0.0666 0.1286

6 0.0566 0.1396

7 0.0476 0.1584

8 0.0432 0.1662

9 0.0382 0.1688

10 0.0306 0.1768

11 0.0274 0.1824

12 0.0260 0.1918

13 0.0210 0.1978

14 0.0180 0.2018

15 0.0180 0.2048

16 0.0180 0.2048

Table 9: Majority Error Average Prediction Errors with ”Unknown” as Features

Max Depth Avg Train Error Avg Test Error

1 0.1088 0.1166

2 0.1042 0.1088

3 0.0936 0.1124

4 0.0770 0.1232

5 0.0634 0.1326

6 0.0522 0.1454

7 0.0406 0.1556

8 0.0318 0.1600

9 0.0268 0.1622

10 0.0214 0.1676

11 0.0190 0.1686

12 0.0184 0.1718

13 0.0180 0.1730

14 0.0180 0.1740

15 0.0180 0.1772

16 0.0180 0.1772

Table 10: Gini Index Average Prediction Errors with ”Unknown” as Features

20

CS 5350/6350

# Machine Learning Homework 1 SOLVED

Original price was: $35.00.$30.00Current price is: $30.00.

## Reviews

There are no reviews yet.