Lab 6 (pt. 2): Multivariate Linear Regression

This is an INDIVIDUAL assignment. Due date is as indicated on BeachBoard. Follow ALL

instructions otherwise you may lose points. In this lab, you will be making predictions based on

more elaborate data. You will also analyze the accuracy of your model. You will need to use numpy

and pandas for this lab. Please note that since the labs will be graded separately, there will be

separate resubmissions for lab 6 (pt1) and lab 6 (pt2).

Review:

In the previous lab, we discussed how you can find the slope and y-intercept of univariate

data by utilizing the formula below…

>

�

� A = (�!�)”#�!�

However, what would you do if the data that was presented to you contains more columns

of data?

Multivariate Linear Regression:

The presented data was in a form such that there was only one variable (x). However, it is

very realistic for data to rely on multiple independent variables. For this example, we will

use the famous real estate data set. I have attached an edited version on BeachBoard for

your convenience. Please use the data that I provided for you on BeachBoard. Otherwise,

your answers and your formatting will not match. A portion of the csv file (real estate

train.csv) is shown below

X1: Transaction date

X2: House age

X3: Distance to the nearest MRT station

X4: Number of convenience stores

X5: Latitude

X6: Longitude

Y: house price of unit area

The goal is to find a multivariate linear equation that can predict the house price of the unit

area using all of the mentioned features.

� = �1�1 + �2�2 + �3�3 + �4�4 + �5�5 + �6�6 + �

We simply apply the linear algebra method above, but in a larger scale. Based on the

snippet from the table, we can create our matrix equations from this.

b

37.9

42.2

47.3

⋮

f = g

2012.917 32 84.87882 10 24.98298 121.54024 1

2012.9178 19.5 306.5947 9 24.98034 121.53951 1

2013.583

⋮

13.3

⋮

56109845

⋮

5

⋮

24.98746

⋮

121.54391

⋮

1

⋮

j

⎣

⎢

⎢

⎢

⎢

⎢

⎡

�1

�2

�3

�4

�5

�6

� ⎦

⎥

⎥

⎥

⎥

⎥

⎤

Looks familiar? This is still the same thing as our � = � >

�

� A setup from before. However,

we have multiple m values instead of just one. So instead of � = � >

�

� A, our matrix equation

looks more like…

� = �

⎣

⎢

⎢

⎢

⎢

⎢

⎡

�1

�2

�3

�4

�5

�6

� ⎦

⎥

⎥

⎥

⎥

⎥

⎤

Thus, the approach is still the same.

⎣

⎢

⎢

⎢

⎢

⎢

⎡

�1

�2

�3

�4

�5

�6

� ⎦

⎥

⎥

⎥

⎥

⎥

⎤

= (�!�)”#�!�

�ℎ��� � = g

2012.917 32 84.87882 10 24.98298 121.54024 1

2012.9178 19.5 306.5947 9 24.98034 121.53951 1

2013.583

⋮

13.3

⋮

56109845

⋮

5

⋮

24.98746

⋮

121.54391

⋮

1

⋮

j, y = b

37.9

42.2

47.3

⋮

f

The result is

⎣

⎢

⎢

⎢

⎢

⎢

⎡

�1

�2

�3

�4

�5

�6

� ⎦

⎥

⎥

⎥

⎥

⎥

⎤

=

⎣

⎢

⎢

⎢

⎢

⎢

⎡ 4.95347

−0.2696252 −0.0044963

1.11479937

230.797555

−13.593203

−14039.67836⎦

⎥

⎥

⎥

⎥

⎥

⎤

≈

⎣

⎢

⎢

⎢

⎢

⎢

⎡ 4.9535

−0.2696

−0.0045

1.1148

230.7976

−13.5932

−14039.6784⎦

⎥

⎥

⎥

⎥

⎥

⎤

�ℎ�� ������� �� 4 ������� ������

Therefore, if I wanted to predict what the house price for a unit with the following data

(from first data row in real estate test.csv):

X1: Transaction date is 2013.25

X2: House age is 26.8

X3: Distance to the nearest MRT station is 482.7581

X4: Number of convenience store is 5

X5: Latitude is 24.97433

X6: Longitude is 121.53863

We can easily predict the house price for the unit (y) by plugging in what we know into the

formula

� = �

⎣

⎢

⎢

⎢

⎢

⎢

⎡

�1

�2

�3

�4

�5

�6

� ⎦

⎥

⎥

⎥

⎥

⎥

⎤

� = [2013.25 26.8 482.7581 5 24.97433 121.53863 1] ×

⎣

⎢

⎢

⎢

⎢

⎢

⎡ 4.9535

−0.2696

−0.0045

1.1148

230.7976

−13.5932

−14039.6784⎦

⎥

⎥

⎥

⎥

⎥

⎤

= 401.0483 …

≈ 41.0483 �ℎ�� ������� �� ���� ������� ������

If you look at the actual price, from the csv file (real estate test.csv), you’ll find

that the actual value is 35.5. That means that we have an absolute error by using the

following formula:

��� ��� = |��������� − ������|

= |41.0483 − 35.5| = 5.5483

We can also calculate the relative error by using the following formula:

��� ��� = |%&'()*+'(“,*+-,.|

,*+-,.

= |/#.1/23″34.4|

34.4 = 0.1563 �ℎ�� ������� �� 4 ������� ������

This means that the predicted answer is off by 15.63%

We can better test the accuracy of the linear regression model by finding the mean absolute

error (MAE) and the mean relative error (MRE).

��� = #

5ã|���������) − ������)|

5

)

��� = #

5ã|���������) − ������)|

������)

5

)

Where n is the total number of cases and i is the iteration/case number.

You can see a summary of the test results on the next page.

After calculating all of the absolute errors and the relative errors, we can find the MAE and

the MRE. Overall, based on the test data, our model has a 15.18% error on average.

Task:

The purpose of the second part of the lab is to create a framework that will take a csv file

with any number of columns and will create a linear regression model. You will also

analyze the accuracy of this linear regression model.

1. Take a close look at the multi_lin_reg.py file. There are four empty functions:

multivar_linreg(file_name) and predict(inputs, file_name) and

MAE(inputs, file_name) and MRE(inputs, file_name). Read through all of

their descriptions carefully. Remember, you will lose points if you do not follow the

instructions. We are using a grading script

Summary of function tasks

Multivar_linreg(file_name):

Given the csv file_name, find all of the weights and return these values in a numpy

array. This 1xn numpy array should contain [m1, m2, m3, … , b]in this order.

Round all values to four decimal places.

predict(inputs, file_name):

Given inputs, which is a numpy array of all of the weights [m1, m2, m3, … , b],

make predictions from the data given in file_name. The predictions will be stored in a

1xm numpy array [y1, y2, y3, …]. Each row of data from the csv should have a

prediction. Round all values to four decimal places

MAE(inputs, file_name):

Find the mean absolute error of the linear regression model given by inputs, which is a

numpy array of all of the weights [m1, m2, m3, … , b]. The mean absolute error will

be calculated by testing the linear regression model with the data from file_name. Round

all values to four decimal places.

MRE(inputs, file_name):

Find the mean relative error of the linear regression model given by inputs, which is a

numpy array of all of the weights [m1, m2, m3, … , b]. The mean relative error will

be calculated by testing the linear regression model with the data from file_name. Round

all values to four decimal places.

Some important notes:

• Though this example uses six columns (six independent variables), other test cases

may use more or less columns. However, there will be at least one independent

variables.

• For consistency’s sake, do not round until the very end. Meaning you should not

round anything until you return your answers.

• If you want to create extra functions/methods to assist you, feel free to do so.

However, we will only be testing the three functions that are originally in the file.

• If you use any library’s linear regression or least squares method function, you will

get an automatic zero. You must implement this on your own!

2. Your job is to implement all four of these functions so that it passes all test cases. We

provide two csv files. Real estate train.csv is for

multivar_linreg()and Real estate test.csv is for the other functions.

However, we will be using other data sets and csv files to check if your work is

correct.

3. By running the provided test cases, you should get the following results:

4. After completing these functions, comment out the test cases (or delete them) or

else the grading script will pick it up and mark your program as incorrect. Ensure

that you have commented out or deleted ALL print statements. You risk losing

points if your file prints anything.

5. Convert your multi_lin_reg.py file to a .txt file. Submit your

multi_lin_reg.py file and your .txt file on BeachBoard. Do NOT submit it in

compressed folder.

Some helpful functions (refer to part 1 for other helpful functions)

Function name What it does

np.round(array, num) Rounds all elements in array to num decimal places

Example: np.round([0.1234, 0.6545], 3) =>

[0.123, 0.655]

df_name.shape Gets the dimension of the data frame

df.shape => (num_rows, num_columns)

np.append(x, y) Appends y to the end of x. See documentation here

Grading rubric:

To achieve any points, your submission must have the following. Anything missing from

this list will result in an automatic zero. NO EXCEPTIONS!

• Submit everything: py file, txt file, and pdf file

• Program has no errors (infinite loops, syntax errors, logical errors, etc.) that

terminates the program

Please note that if you change the function headers or if you do not return the proper

outputs according to the function requirements, you risk losing all points for those test

cases.

Points Requirement

5 Submission is correct. All two files are part of submission (py file and txt

file)- All or nothing

15 Implemented multivar_linreg() correctly (three other cases not

including Real estate)

12 Implemented predict() correctly (three other cases not including

Real estate)

6 Implemented MAE() correctly (three other cases not including Real

estate)

6 Implemented MRE() correctly ((three other cases not including Real

estate)

5 Passes original test cases (test cases on python file have been

commented out too)- all or nothing

TOTAL: 49