CS 5350/6350: Machine Learning

Homework 2

1 Paper Problems [40 points + 8 bonus]

1. [5 points] We have derived the PAC guarantee for consistent learners (namely, the

learners can produce a hypothesis that can 100% accurately classify the training data).

The PAC guarantee is described as follows. Let H be the hypothesis space used by

our algorithm. Let C be the concept class we want to apply our learning algorithm to

search for a target function in C. We have shown that, with probability at least 1 − δ,

a hypothesis h ∈ H that is consistent with a training set of m examples will have the

generalization error errD(h) < ϵ if

m >

1

ϵ

log(|H|) + log 1

δ

.

(a) [2 points] Suppose we have two learning algorithms L1 and L2, which use hypothesis spaces H1 and H2 respectively. We know that H1 is larger than H2, i.e.,

|H1| > |H2|. For each target function in C, we assume both algorithms can find

a hypothesis consistent with the training data.

i. [1 point] According to Occam’s Razor principle, which learning algorithm’s

result hypothesis do you prefer? Why?

ii. [1 point] How is this principle reflected in our PAC guarantee? Please use

the above inequality to explain why we will prefer the corresponding result

hypothesis.

(b) [3 points] Let us investigate algorithm L1. Suppose we have n input features,

and the size of the hypothesis space used by L1 is 3n

. Given n = 10 features,

if we want to guarantee a 95% chance of learning a hypothesis of at least 90%

generalization accuracy, how many training examples at least do we need for L1?

Solution.

a)

i) According to Occam’s Razor principle, simpler hypotheses are preferred over more

complex ones given that they both explain the data sufficiently well. So if both L1 and

L2 can find hypotheses consistent with the training data, we would prefer the result

hypothesis from the learning algorithm with a smaller hypothesis space. Thus, we’d

prefer the result hypothesis from L2 since H2 is smaller than H1.

ii)The PAC guarantee equation reflects the principle of Occam’s Razor. If we look

at the term log(|H|), it becomes evident that the size of the required training set m

increases logarithmically with the size of the hypothesis space |H|. So, if |H1| > |H2|,

then the required m to meet the same ϵ and δ for L1 will be greater than that for L2.

This implies that for algorithms with larger hypothesis spaces, more data is required

to guarantee a certain level of performance, which is a direct consequence of preferring

simpler hypotheses.

b)

Given the information:

n = 10,

|H1| = 310

,

δ = 0.05 (because there’s a 95% chance of learning),

ϵ = 0.10 (because we want at least 90% generalization accuracy).

Plugging these into our PAC guarantee formula:

m >

1

ϵ

log(|H|) + log 1

δ

.

m > (1/0.10) ∗ (log(310) + log(1/0.05))

m > 10 ∗ (10 ∗ log(3) + log(20))

Using logarithm properties:

m > 10 ∗ (10 ∗ 1.0986 + 2.9957)

m > 10 ∗ (10.986 + 2.9957)

m > 10 ∗ 13.9817

m > 139.817

Rounding up, we’ll need at least 140 training examples for L1 to guarantee a 95%

chance of learning a hypothesis with at least 90% generalization accuracy.

2. [5 points] In our lecture about AdaBoost algorithm, we introduced the definition of

weighted error in each round t,

ϵt =

1

2

−

1

2

Xm

i=1

Dt(i)yiht(xi)

where Dt(i) is the weight of i-th training example, and ht(xi) is the prediction of the

weak classifier learned round t. Note that both yi and ht(xi) belong to {1, −1}. Prove

that equivalently,

ϵt =

X

yi̸=ht(xi)

Dt(i).

Solution. Let’s break this equation down by considering the two possible cases for

each training example:

1) yi = ht(xi)

Given yi and ht(xi) belong to {1, −1}, the products for the two possible values are:

If yi = 1 and ht(xi) = 1, then yiht(xi) = 1

If yi = −1 and ht(xi) = −1, then yiht(xi) = 1

2) yi ̸= ht(xi)

For these misclassified examples:

If yi = 1 and ht(xi) = −1, then yiht(xi) = −1

If yi = −1 and ht(xi) = 1, then yiht(xi) = −1

Using the above cases in the given expression:

For correctly classified examples, the term Dt(i)yiht(xi) is Dt(i)

For correctly misclassified examples, the term Dt(i)yiht(xi) is −Dt(i)

Now, we can substitute the values:

ϵt =

1

2

−

1

2

X

yi=ht(xi)

Dt(i) −

X

yi̸=ht(xi)

Dt(i)

The first summation P

yi=ht(xi) Dt(i) represents the total weight of the correctly classified examples. But given that the total weight of all examples is 1 (since Dt

is a

distribution), the weight of correctly classified examples is 1 −

P

yi̸=ht(xi) Dt(i).

We can substitute this into the equation:

ϵt =

1

2

−

1

2

1 − 2

X

yi̸=ht(xi)

Dt(i)

ϵt =

1

2

−

1

2

+

X

yi̸=ht(xi)

Dt(i)

ϵt =

X

yi̸=ht(xi)

Dt(i)

Therefore, we proved the need statement.

3. [20 points] Can you figure out an equivalent linear classifier for the following Boolean

functions? Please point out what the weight vector, the bias parameter and the hyperplane are. Note that the hyperplane is determined by an equation. If you cannot find

out a linear classifier, please explain why, and work out some feature mapping such

that, after mapping all the inputs of these functions into a higher dimensional space,

there is a hyperplane that well separates the inputs; please write down the separating

hyperplane in the new feature space.

(a) [2 point] f(x1, x2, x3) = x1 ∧ ¬x2 ∧ ¬x3

(b) [2 point] f(x1, x2, x3) = ¬x1 ∨ ¬x2 ∨ ¬x3

(c) [8 points] f(x1, x2, x3, x4) = (x1 ∨ x2) ∧ (x3 ∨ x4)

(d) [8 points] f(x1, x2) = (x1 ∧ x2) ∨ (¬x1 ∧ ¬x2)

Solution.

a) The only input that results in a ’true’ output for this function is (1, 0, 0). All other

input combinations will result in ’false’. So, a linear classifier can be constructed for

this function.

Linear threshold function: w1x1 + w2x2 + w3x3 + b ≥ 0

Setting the weights and bias such that: w = [2, −1, −1], b = −1, will classify (1, 0, 0)

correctly as ’true’ (since 2 − 1 ≥ 0) and all other input combinations as ’false’. Therefore, the linear threshold function is: 2×1 − x2 − x3 − 1 ≥ 0 or 2×1 − x2 − x3 ≥ 1.

The hyperplane is determined by the equation: 2×1 − x2 − x3 − 1 = 0.

b) This function results in ’true’ for all input combinations except (1, 1, 1). We can

also construct a linear classifier for this function.

Linear threshold function: w1x1 + w2x2 + w3x3 + b ≤ 0

Setting the weights and bias such that: w = [−1, −1, −1], b = 2, will classify (1, 1, 1)

correctly as ’false’ and all other input combinations as ’true’. Therefore, the linear

threshold function is: −x1 − x2 − x3 + 2 ≤ 0 or −x1 − x2 − x3 ≤ −2.

The hyperplane is determined by the equation: −x1 − x2 − x3 + 2 = 0.

c) This function isn’t linearly separable. However, let’s perform a feature mapping. In

this higher dimensional space, the function can be represented as:

ϕ(x1, x2, x3, x4) = (x1x3, x1x4, x2x3, x2x4)

For the function to be true, at least one product from each pair (either x1 or x2 with

x3 or x4) should be 1. So, for an input like (1, 0, 1, 0), the mapped output is (1, 0, 0, 0),

and the sum of these is 1.

Considering that any single product being 1 is enough for f to be true, the separating

hyperplane could be:

x1x3 + x1x4 + x2x3 + x2x4 − 0.5 = 0

Bias b for this would be −0.5.

d) This function results in ’true’ for inputs (0, 0) and (1, 1) and ’false’ for (1, 0) and

(0, 1). This function isn’t linearly separable in its current form. However, we can perform a feature mapping:

ϕ(x1, x2) = (x

2

1

, x2

2

, x1x2)

In the mapped space, (1, 1) maps to (1, 1, 1), (0, 0) maps to (0, 0, 0), (1, 0) maps to

(1, 0, 0), and (0, 1) maps to (0, 1, 0).

4

Considering this, a hyperplane that separates the true values from the false values

could be:

x

2

1 + x

2

2 − x1x2 − 1 = 0

Bias b for this would be −1.

4. [Bonus] [8 points] Given two vectors x = [x1, x2] and y = [y1, y2], find a feature

mapping ϕ(·) for each of the following functions, such that the function is equal to

the inner product between the mapped feature vectors, ϕ(x) and ϕ(y). For example,

(x

⊤y)

0 = ϕ(x)

⊤ϕ(y) where ϕ(x) = [1] and ϕ(y) = [1]; (x

⊤y)

1 = ϕ(x)

⊤ϕ(y) where

ϕ(x) = x and ϕ(y) = y.

(a) [2 points] (x

⊤y)

2

(b) [2 points] (x

⊤y)

3

(c) [4 points] (x

⊤y)

k where k is any positive integer.

Solution.

a) (x

⊤y)

2

Expanding the term gives:

(x

⊤y)

2 = (x1y1 + x2y2)

2

= x

2

1

y

2

1 + 2x1x2y1y2 + x

2

2

y

2

2

This suggests that the feature mapping ϕ(·) for vector x and y should be:

ϕ(x) = [x

2

1

,

√

2x1x2, x2

2

]

ϕ(y) = [y

2

1

,

√

2y1y2, y2

2

]

b) (x

⊤y)

3

Expanding:

(x

⊤y)

3 = (x1y1 + x2y2)

3

= x

3

1

y

3

1 + 3x

2

1x2y

2

1

y2 + 3x1x

2

2

y1y

2

2 + x

3

2

y

3

2

This leads to the feature mapping:

ϕ(x) = [x

3

1

,

√

3x

2

1×2,

√

3x1x

2

2

, x3

2

]

ϕ(y) = [y

3

1

,

√

3y

2

1

y2,

√

3y1y

2

2

, y3

2

]

c) (x

⊤y)

k where k is any positive integer

For general k, the feature mapping ϕ(·) would produce all combinations of the terms

of x and y up to degree k. There would be terms like x

k−i

1 x

i

2

for each i from 0 to k.

The exact form of ϕ(x) and ϕ(y) for general k would involve a lot of binomial coefficients (from the binomial theorem). For example, the coefficient of x

k−2

1 x

2

2 when

expanding the polynomial would be

k

2

. The form would be cumbersome to write out

in full, but conceptually, it’s a mapping to all combinations of the terms of the vectors

up to degree k.

In conclusion, the feature mapping ϕ(·) essentially captures all the possible interaction

terms of a given degree between the elements of vectors x and y.

5. [10 points] Suppose we have the training data shown in Table 1, from which we want

to learn a linear regression model, parameterized by a weight vector w and a bias

parameter b.

x1 x2 x3 y

1 -1 2 1

1 1 3 4

-1 1 0 -1

1 2 -4 -2

3 -1 -1 0

Table 1: Linear regression training data.

(a) [1 point] Write down the LMS (least mean square) cost function J(w, b).

(b) [3 points] Calculate the gradient ∇J

∇w

and ∇J

∇b when w = [−1, 1, −1]⊤ and b = −1.

(c) [3 points] What are the optimal w and b that minimize the cost function?

(d) [3 points] Now, we want to use stochastic gradient descent to minimize J(w, b).

We initialize w = 0 and b = 0. We set the learning rate r = 0.1 and sequentially

go through the 5 training examples. Please list the stochastic gradient in each

step and the updated w and b.

Solution.

a) The LMS (least mean square) cost function J(w, b) is defined as the mean of the squared

differences between the predicted and actual values:

J(w, b) = 1

m

Xm

i=1

(w

T x

(i) + b − y

(i))2

Where m is the number of training examples.

b) To calculate the gradient with respect to w and b, we need to differentiate the cost function. For simplicity, let’s denote the error for each example as:

e

(i) = w

T x

(i) + b − y

(i)

The gradient with respect to w is:

∇J

∇w

=

2

m

Xm

i=1

e

(i)x

(i)

And the gradient with respect to b is:

∇J

∇b

=

2

m

Xm

i=1

e

(i)

Using the given values w = [−1, 1, −1]T and b = −1, we’ll compute the error e

(i)

for each

example and then compute the gradients.

Using the training data and plugging in the values:

e

(1) = [−1, 1, −1]

1

−1

2

− 1 − 1 = 0

6

e

(2) = [−1, 1, −1]

1

1

3

− 1 − 4 = −4

e

(3) = [−1, 1, −1]

−1

1

0

− 1 + 1 = −1

e

(4) = [−1, 1, −1]

1

2

−4

− 1 + 2 = 4

e

(5) = [−1, 1, −1]

3

−1

−1

− 1 − 0 = 2

Using the above errors:

∇J

∇w

=

2

5

[0 + (−4) + 1 + 4 + 6, 0 + (−4) − 1 + 8 − 2, 0 + (−12) + 0 − 8 − 2] = [1.6, 0.2, −2.4]

∇J

∇b

=

2

5

(0 − 4 − 1 + 4 + 2) = 0.2

c) The optimal parameters w and b can be determined using the normal equations for linear

regression.

Given the training set, we can construct the design matrix X and the target vector y. First,

we augment the input data with a column of ones to account for the bias term b:

1 1 −1 2

1 1 1 3

1 −1 1 0

1 1 2 −4

1 3 −1 −1

And the target vector is:

y =

1

4

−1

−2

0

Now, the normal equations for linear regression are given by:

w = (X

T X)

−1X

T

y

Where w will be a vector of coefficients, the first of which will be the bias term b (because

we augmented X with a column of ones). By calculating:

X

T X =

5 5 2 0

5 13 0 0

2 0 6 8

0 0 8 30

7

And the inverse is:

(X

T X)

−1 =

5.2 −1.96 −0.39 −0.16

−1.96 0.77 0.15 0.06

−0.39 0.15 0.17 −0.04

−0.16 0.06 −0.04 0.03

And:

X

T

y =

2

5

−2

−3

Finally, we can compute w:

w =

5.2 −1.96 −0.39 −0.16

−1.96 0.77 0.15 0.06

−0.39 0.15 0.17 −0.04

−0.16 0.06 −0.04 0.03

2

5

−2

−3

=

4.44

−5.06

−1.53

0.49

Thus, from this vector, the bias term b is 4.44 and the weights are w = [−5.06, −1.53, 0.49].

d) Stochastic Gradient Descent for Linear Regression:

Initialization:

w =

0

0

0

b = 0

r = 0.1

We will now go through each of the 5 training examples.

1. Using the first data point:

x

(1) =

1

−1

2

, y

(1) = 1 Prediction:

yˆ

(1) = wT x

(1) + b = 0 + 0 = 0

Gradient w.r.t. w and b:

∇J

∇w = (ˆy

(1) − y

(1))x

(1) =

−1

1

−2

∇J

∇w = (ˆy

(1) − y

(1)) = −1 Updating w and b using the gradients:

w = w − r

∇J

∇w =

0.1

−0.1

0.2

b = b − r

∇J

∇b = 0.1

2. Using the second data point:

x

(2) =

1

1

3

, y

(2) = 4 Prediction:

8

yˆ

(2) = wT x

(2) + b = 0.7

Gradient w.r.t. w and b:

∇J

∇w = (ˆy

(2) − y

(2))x

(2) =

−3.3

−3.3

−9.9

∇J

∇w = (ˆy

(2) − y

(2)) = −3.3 Updating w and b using the gradients:

w = w − r

∇J

∇w =

0.43

0.23

1.19

b = 0.43

3. Using the second data point:

x

(3) =

−1

1

0

, y

(3) = −1 Prediction:

yˆ

(3 = wT x

(3) + b = 0.66

Gradient w.r.t. w and b:

∇J

∇w = (ˆy

(3) − y

(3))x

(3) =

2.66

−2.66

0

∇J

∇w = (ˆy

(3) − y

(3)) = 1.66 Updating w and b using the gradients:

w = w − r

∇J

∇w =

0.264

0.396

1.19

b = 0.264

4. Using the second data point:

x

(4) =

1

2

−4

, y

(4) = −2 Prediction:

yˆ

(4 = wT x

(4) + b = −3.066

Gradient w.r.t. w and b:

∇J

∇w = (ˆy

(4) − y

(4))x

(4) =

−1.066

−2.132

4.264

∇J

∇w = (ˆy

(4) − y

(4)) = −1.066 Updating w and b using the gradients:

w = w − r

∇J

∇w =

0.3696

0.6092

0.8136

b = 0.3706

5. Using the second data point:

x

(5) =

3

−1

−1

, y

(5) = 0 Prediction:

yˆ

(5 = wT x

(5) + b = 0.5626

Gradient w.r.t. w and b:

9

∇J

∇w = (ˆy

(5) − y

(5))x

(5) =

1.6878

−0.5626

−0.5626

∇J

∇w = (ˆy

(5) − y

(5)) = 0.5626 Updating w and b using the gradients:

w = w − r

∇J

∇w =

0.2005

0.6654

0.8689

b = 0.31434

Therefore, at the end of the first iteration through the dataset, we have the following w

and b:

w =

0.2005

0.6654

0.8689

b = 0.31434

2 Practice [60 points + 10 bonus]

1. [2 Points] Update your machine learning library. Please check in your implementation

of decision trees in HW1 to your GitHub repository. Remember last time you created

a folder “Decision Tree”. You can commit your code into that folder. Please also

supplement README.md with concise descriptions about how to use your code to

learn decision trees (how to call the command, set the parameters, etc). Please create

two folders “Ensemble Learning” and “Linear Regression” in the same level as the

folder “Decision Tree”.

2. [36 points] We will implement the boosting and bagging algorithms based on decision

trees. Let us test them on the bank marketing dataset in HW1 (bank.zip in Canvas).

We use the same approach to convert the numerical features into binary ones. That

is, we choose the media (NOT the average) of the attribute values (in the training set)

as the threshold, and examine if the feature is bigger (or less) than the threshold. For

simplicity, we treat “unknown” as a particular attribute value, and hence we do not

have any missing attributes for both training and test.

(a) [8 points] Modify your decision tree learning algorithm to learn decision stumps

— trees with only two levels. Specifically, compute the information gain to select

the best feature to split the data. Then for each subset, create a leaf node. Note

that your decision stumps must support weighted training examples. Based on

your decision stump learning algorithm, implement AdaBoost algorithm. Vary

the number of iterations T from 1 to 500, and examine the training and test errors. You should report the results in two figures. The first figure shows how the

training and test errors vary along with T. The second figure shows the training

and test errors of all the decision stumps learned in each iteration. What can you

observe and conclude? You have had the results for a fully expanded decision tree

10

in HW1. Comparing them with Adaboost, what can you observe and conclude?

Solution. The code is implemented in GitHub (https://github.com/milenabel/CS6350-

ML2023/tree/main/EnsembleLearning).

Due to the long runtime of the algorithms, I only managed to do 1 to 50 iterations.

(b) [8 points] Based on your code of the decision tree learning algorithm (with information gain), implement a Bagged trees learning algorithm. Note that each tree

should be fully expanded — no early stopping or post pruning. Vary the number

of trees from 1 to 500, report how the training and test errors vary along with the

tree number in a figure. Overall, are bagged trees better than a single tree? Are

bagged trees better than Adaboost?

Solution. The code is implemented in GitHub (https://github.com/milenabel/CS6350-

11

ML2023/tree/main/EnsembleLearning).

Due to the long runtime of the algorithms, I only managed to do 1 to 50 iterations.

Bagged trees, in general, tend to perform better than a single decision tree because they reduce variance and help to avoid overfitting. They are also better

than AdaBoost according to the figure provided.

(c) [6 points] Through the bias and variance decomposition, we have justified why

the bagging approach is more effective than a single classifier/predictor. Let us

verify it in real data. Experiment with the following procedure.

• REPEAT for 100 times

• [STEP 1] Sample 1, 000 examples uniformly without replacement from the

training datset

• [STEP 2] Run your bagged trees learning algorithm based on the 1, 000

training examples and learn 500 trees.

• END REPEAT

• Now you have 100 bagged predictors in hand. For comparison, pick the first

tree in each run to get 100 fully expanded trees (i.e. single trees).

• For each of the test example, compute the predictions of the 100 single trees.

Take the average, subtract the ground-truth label, and take square to compute

the bias term (see the lecture slides). Use all the predictions to compute the

sample variance as the approximation to the variance term (if you forget

what the sample variance is, check it out here). You now obtain the bias

and variance terms of a single tree learner for one test example. You will

need to compute them for all the test examples and then take average as

your final estimate of the bias and variance terms for the single decision tree

learner. You can add the two terms to obtain the estimate of the general

squared error (that is, expected error w.r.t test examples). Now use your 100

bagged predictors to do the same thing and estimate the general bias and

12

variance terms, as well as the general squared error. Comparing the results

of the single tree learner and the bagged trees, what can you conclude? What

causes the difference?

(d) [8 points] Implement the random forest algorithm as we discussed in our lecture.

Vary the number of random trees from 1 to 500. Note that you need to modify

your tree learning algorithm to randomly select a subset of features before each

split. Then use the information gain to select the best feature to split. Vary the

size of the feature subset from {2, 4, 6}. Report in a figure how the training and

test errors vary along with the number of random trees for each feature subset

size setting. How does the performance compare with bagged trees?

(e) [6 points] Following (c), estimate the bias and variance terms, and the squared

error for a single random tree and the whole forest. Comparing with the bagged

trees, what do you observe? What can you conclude?

3. [Bonus][10 points] In practice, to confirm the performance of your algorithm, you

need to find multiple datasets for test (rather than one). You need to extract and

process data by yourself. Now please use the credit default dataset in UCI repository https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients. Randomly choose 24000 examples for training and the remaining 6000 for test. Feel free to

deal with continuous features. Run bagged trees, random forest, and Adaboost with

decision stumps algorithms for 500 iterations. Report in a figure how the training and

test errors vary along with the number of iterations, as compared with a fully expanded

single decision tree. Are the results consistent with the results you obtained from the

bank dataset?

4. [22 points] We will implement the LMS method for a linear regression task. The dataset

is from UCI repository (https://archive.ics.uci.edu/ml/datasets/Concrete+Slump+

Test). The task is to predict the real-valued SLUMP of the concrete, with 7 features.

The features and output are listed in the file “concrete/data-desc.txt”. The training data are stored in the file “concrete/train.csv”, consisting of 53 examples. The

test data are stored in “concrete/test.csv”, and comprise of 50 examples. In both the

training and testing datasets, feature values and outputs are separated by commas.

(a) [8 points] Implement the batch gradient descent algorithm, and tune the learning

rate r to ensure the algorithm converges. To examine convergence, you can watch

the norm of the weight vector difference, ∥wt−wt−1∥, at each step t. if ∥wt−wt−1∥

is less than a tolerance level, say, 10−6

, you can conclude that it converges. You

can initialize your weight vector to be 0. Please find an appropriate r such that

the algorithm converges. To tune r, you can start with a relatively big value,

say, r = 1, and then gradually decrease r, say r = 0.5, 0.25, 0.125, . . ., until you

see the convergence. Report the learned weight vector, and the learning rate r.

Meanwhile, please record the cost function value of the training data at each step,

and then draw a figure shows how the cost function changes along with steps. Use

your final weight vector to calculate the cost function value of the test data.

13

Solution. The code is implemented in GitHub (https://github.com/milenabel/CS6350-

ML2023/tree/main/LinearRegression).

Learned GDA Weight Vector: [−0.01520404 0.90020422 0.78592199 0.85064194 1.29860639

0.12983046 1.57176489 0.99832589].

Learning rate: 0.5.

Test Cost: 0.4672255434765307

(b) [8 points] Implement the stochastic gradient descent (SGD) algorithm. You can

initialize your weight vector to be 0. Each step, you randomly sample a training

example, and then calculate the stochastic gradient to update the weight vector.

Tune the learning rate r to ensure your SGD converges. To check convergence,

you can calculate the cost function of the training data after each stochastic gradient update, and draw a figure showing how the cost function values vary along

with the number of updates. At the beginning, your curve will oscillate a lot.

However, with an appropriate r, as more and more updates are finished, you will

see the cost function tends to converge. Please report the learned weight vector,

and the learning rate you chose, and the cost function value of the test data with

your learned weight vector.

Solution. The code is implemented in GitHub (https://github.com/milenabel/CS6350-

ML2023/tree/main/LinearRegression).

Learned SGD Weight Vector: [−0.03969494 − 0.0385621 − 0.24910744

− 0.10948452 0.39582832 − 0.09827536 0.1331324 − 0.03406326].

Learning rate: 0.25.

Test Cost: 0.4202370852021848

14

(c) [6 points] We have discussed how to calculate the optimal weight vector with an

analytical form. Please calculate the optimal weight vector in this way. Comparing with the weight vectors learned by batch gradient descent and stochastic

gradient descent, what can you conclude? Why?

Solution.

Optimal Weight Vector (Analytical): [−0.01519667 0.90056451 0.78629331 0.85104314

1.29889413 0.12989067 1.57224887 0.99869359]

Observations:

– Similarity between Analytical and GDA Solutions:

– – The weight vector obtained from the Gradient Descent Algorithm (GDA) is

very close to the optimal weight vector obtained analytically.

– – This indicates that GDA has converged to a solution that is almost identical

to the true optimal solution.

– Dissimilarity between Analytical and SGD Solutions:

– – The weight vector obtained from the Stochastic Gradient Descent (SGD) is

different from the optimal weight vector.

– – This is typical for SGD, as it is a stochastic algorithm and may not converge

to the exact optimal solution. However, the cost function value indicates that the

solution is reasonable.

Conclusion:

– Gradient Descent Algorithm (GDA) is Very Effective:

– – The GDA has effectively converged to a solution that is almost identical to the

true optimal solution obtained analytically. This is a strong indication that GDA

is a powerful method for this particular problem.

– Stochastic Gradient Descent (SGD) is Effective but Less Precise:

– – The SGD has found a reasonable solution that, while not identical to the op15

timal solution, still provides a good approximation. The nature of SGD means it

is less precise than GDA, but it can be more efficient for very large datasets.

– Learning Rate Tuning is Crucial:

– – The best learning rate for GDA was 0.5, while for SGD it was 0.25. This

emphasizes the importance of tuning the learning rate to find the most effective

solution.

Overall, both gradient descent algorithms have proven to be effective methods for

solving this problem, with GDA providing a more precise solution that is closer

to the true optimal weight vector.

16

## Reviews

There are no reviews yet.