Large-Scale Data Mining: Models and Algorithms ECE 219

Project 4: Regression Analysis and Define

Your Own Task!

1 Introduction

Regression analysis is a statistical procedure for estimating the relationship between a

target variable and a set of features that jointly inform about the target. In this project,

we explore specific-to-regression feature engineering methods and model selection that

jointly improve the performance of regression. You will conduct different experiments

and identify the relative significance of the different options.

2 Datasets

You should take steps in section 3 on either one of the following datasets.

2.1 Dataset 1: Diamond Characteristics

Valentine’s day might be over, but we are still interested in building a bot to predict the

price and characteristics of diamonds. A synthetic diamonds dataset can be downloaded

from this link. This dataset contains information about 53, 940 round-cut diamonds.

There are 10 variables (features) and for each sample, these features specify the various

properties of the sample. Below we describe these features:

• carat: weight of the diamond (0.2–5.01);

• cut: quality of the cut (Fair, Good, Very Good, Premium, Ideal);

• color: diamond colour, from J (worst) to D (best);

• clarity: a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2,

VS1, VVS2, VVS1, IF (best));

• x: length in mm (0–10.74)

• y: width in mm (0–58.9)

• z: depth in mm (0–31.8)

• depth: total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)

• table: width of top of diamond relative to widest point (43–95)

In addition to these features, there is the target variable: i.e what we would like to

predict:

• price: price in US dollars ($326–$18,823);

1

2.2 Dataset 2: Gas Turbine CO and NOx Emission Data Set

Being able to predict the gas emissions in a particular region may be more important

than assessing the price of diamonds. This dataset can be downloaded from this link.

The dataset contains 36733 instances of 11 sensor measurements aggregated over one

hour (by means of average or sum) from a gas turbine located in Turkey’s north western

region for the purpose of studying flue gas emissions, namely CO and NOx (NO + NO2).

There are 5 CSV files for each year. Concatenate all data points and add

a column for the corresponding year and treat it as a categorical feature.

There are two types of gas studied in this project:

• NOx

• CO

Pick one gas to model and drop the other. Important: Do not use the gas that you dropped

as a feature.

3 Required Steps

In this section, we describe the setup you need to follow. Follow these steps to process

either of the datasets in Section 2.

3.1 Before Training

Before training an algorithm, it’s always essential to inspect the data. This provides

intuition about the quality and quantity of the data and suggests ideas to extract features

for downstream ML applications. In this following section we will address these steps.

3.1.1 Handling Categorical Features

A categorical feature is a feature that can take on one of a limited number of possible

values. A preprocessing step is to convert categorical variables into numbers and thus

prepared for training.

One method for numerical encoding of categorical features is to assign a scalar. For

instance, if we have a “Quality” feature with values {Poor, Fair, Typical, Good,

Excellent} we might replace them with numbers 1 through 5. If there is no numerical

meaning behind categorical features (e.g. {Cat, Dog}) one has to perform “one-hot

encoding” instead.

3.1.2 Data Inspection

The first step for data analysis is to take a close look at the dataset1

.

• Plot a heatmap of the Pearson correlation matrix of the dataset columns. Report

which features have the highest absolute correlation with the target variable. In

the context of either dataset, describe what the correlation patterns suggest.Question 1.1

• Plot the histogram of numerical features. What preprocessing can be done if the

distribution of a feature has high skewness? Question 1.2

1For exploratory data analysis, one can try pandas-profiling

2

• Construct and inspect the box plot of categorical features vs target variable. What

do you find? Question 1.3

• For the Diamonds dataset, plot the counts by color, cut and clarity. or

• For the Gas Emission dataset, plot the yearly trends for each feature and compare

them. The data points don’t have timestamps but you may assume the indeces are

times.Question 1.4

3.1.3 Standardization

Standardization of datasets is a common requirement for many machine learning estimators; they might behave badly if the individual features do not more-or-less look like

standard normally distributed data: Gaussian with zero mean and unit variance. If a

feature has a variance that is orders of magnitude larger than others, it might dominate

the objective function and make the estimator unable to learn from other features correctly as expected.

Standardize feature columns and prepare them for training. Question 2.1

3.1.4 Feature Selection

• sklearn.feature selection.mutual info regression function returns estimated

mutual information between each feature and the label. Mutual information (MI)

between two random variables is a non-negative value which measures the dependency between the variables. It is equal to zero if and only if two random variables

are independent, and higher values mean higher dependency.

• sklearn.feature selection.f regression function provides F scores, which is a

way of comparing the significance of the improvement of a model, with respect to

the addition of new variables.

You **may** use these functions to select features that yield better regression results (especially in the classical models). Describe how this step qualitatively affects the

performance of your models in terms of test RMSE. Is it true for all model types? Also

list two features for either dataset that has the lowest MI w.r.t to the target. Question

2.2

From this point on, you are free to use any combination of features, as long as the

performance on the regression model is on par (or slightly worse) than the Neural Network

model.

3.2 Training

Once the data is prepared, we would like to train multiple algorithms and compare their

performance using average RMSE from 10-fold cross-validation (please refer to part 3.3).

3.3 Evaluation

Perform 10-fold cross-validation and measure average RMSE errors for training and validation sets.

For random forest model, measure “Out-of-Bag Error” (OOB) as well. Explain what

OOB error and R2

score means given this link. Question 3

3

3.3.1 Linear Regression

What is the objective function? Train three models: (a) ordinary least squares (linear

regression without regularization), (b) Lasso and (c) Ridge regression, and answer the

following questions.

• Explain how each regularization scheme affects the learned parameter set. Question 4.1

• Report your choice of the best regularization scheme along with the optimal penalty

parameter and explain how you computed it. Question 4.2

• Does feature standardization play a role in improving the model performance (in

the cases with ridge regularization)? Justify your answer. Question 4.3

• Some linear regression packages return p-values for different features2

. What is

the meaning of these p-values and how can you infer the most significant features?

Question 4.4

3.3.2 Polynomial Regression

Perform polynomial regression by crafting products of features you selected in part 3.1.4

up to a certain degree (max degree 6) and applying ridge regression on the compound

features. You can use scikit-learn library to build such features. Avoid overfitting by

proper regularization. Answer the following:

• What are the most salient features? Why? Question 5.1

• What degree of polynomial is best? How did you find the optimal degree? What

does a very high-order polynomial imply about the fit on the training data? What

about its performance on testing data? Question 5.2

3.3.3 Neural Network

You will train a multi-layer perceptron (fully connected neural network). You can simply

use the sklearn implementation:

• Adjust your network size (number of hidden neurons and depth), and weight decay

as regularization. Find a good hyper-parameter set systematically (no more than

20 experiments in total). Question 6.1

• How does the performance generally compare with linear regression? Why? Question 6.2

• What activation function did you use for the output and why? You may use none.

Question 6.3

• What is the risk of increasing the depth of the network too far? Question 6.4

2E.g: scipy.stats.linregress and statsmodels.regression.linear model.OLS

4

3.3.4 Random Forest

We will train a random forest regression model on datasets, and answer the following:

• Random forests have the following hyper-parameters:

– Maximum number of features;

– Number of trees;

– Depth of each tree;

Explain how these hyper-parameters affect the overall performance. Describe if

and how each hyper-parameter results in a regularization effect during training.

Question 7.1

• How do random forests create a highly non-linear decision boundary despite the fact

that all we do at each layer is apply a threshold on a feature? Question 7.2

• Randomly pick a tree in your random forest model (with maximum depth of 4) and

plot its structure. Which feature is selected for branching at the root node? What

can you infer about the importance of this feature as opposed to others? Do the

important features correspond to what you got in part 3.3.1? Question 7.3

• Measure “Out-of-Bag Error” (OOB). Explain what OOB error and R2 score means.

Question 7.4

3.3.5 LightGBM, CatBoost and Bayesian Optimization

Boosted tree methods have shown advantages when dealing with tabular data, and recent

advances make these algorithms scalable to large scale data and enable natural treatment

of (high-cardinality) categorical features. Two of the most successful examples are LightGBM and CatBoost.

Both algorithms have many hyperparameters that influence their performance. This

results in large search space of hyperparameters, making the tuning of the hyperparameters hard with naive random search and grid search. Therefore, one may want to utilize

“smarter” hyperparameter search schemes. We specifically explore one of them: Bayesian

optimization.

In this part, pick either one of the datasets and apply LightGBM OR CatBoost. If

you do both, we will only look at the first one.

• Read the documentation of LightGBM OR CatBoost and determine the important

hyperparameters along with a search space for the tuning of these parameters (keep

the search space small). Question 8.1

• Apply Bayesian optimization using skopt.BayesSearchCV from scikit-optmize

to find the ideal hyperparameter combination in your search space. Report the best

hyperparameter set found and the corresponding RMSE. Question 8.2

• Qualitatively interpret the effect of the hyperparameters using the Bayesian optimization results: Which of them helps with performance? Which helps with regularization (shrinks the generalization gap)? Which affects the fitting efficiency?

Question 8.3

5

Show Us Your Skills: Twitter Data

Introduction

As a culmination of the four projects in this class, we introduce this final dataset that you

will explore and your task is to walk us through an end-to-end ML pipeline to accomplish

any particular goal: regression, classification, clustering or anything else. This is a

design question and it is going to be about 30% of your grade in this project.

Below is a description and some small questions about the provided dataset to get

you started and familiarized with the dataset:

3.4 About the Data

Download the training tweet data3

. The data consists of 6 text files, each one containing

tweet data from one hashtag as indicated in the filenames.

Report the following statistics for each hashtag, i.e. each file has: Question 9.1

• Average number of tweets per hour

• Average number of followers of users posting the tweets per tweet (to make it simple,

we average over the number of tweets; if a users posted twice, we count the user

and the user’s followers twice as well)

• Average number of retweets per tweet

Plot “number of tweets in hour” over time for #SuperBowl and #NFL (a bar plot with

1-hour bins). The tweets are stored in separate files for different hashtags and files are

named as tweet [#hashtag].txt. Question 9.2

Note: The tweet file contains one tweet in each line and tweets are sorted with respect

to their posting time. Each tweet is a JSON string that you can load in Python as a dictionary. For example, if you parse it to object json_object = json.loads(json_string) ,

you can look up the time a tweet is posted by:

json_object[‘citation_date’]

You may also assess the number of retweets of a tweet through the following command:

json_object[‘metrics’][‘citations’][‘total’]

Besides, the number of followers of the person tweeting can be retrieved via:

json_object[‘author’][‘followers’]

The time information in the data file is in the form of UNIX time, which “encodes a

point in time as a scalar real number which represents the number of seconds that have

passed since the beginning of 00:00:00 UTC Thursday, 1 January 1970” (see Wikipedia

for details). In Python, you can convert it to human-readable date by

import datetime

datetime_object = datetime.datetime.fromtimestamp(unix_time)

The conversion above gives out a datetime object storing the date and time in your

local time zone corresponding to that UNIX time.

3

https://ucla.box.com/s/24oxnhsoj6kpxhl6gyvuck25i3s4426d

6

In later parts of the project, you may need to use the PST time zone to interpret the

UNIX timestamps. To specify the time zone you would like to use, refer to the example

below:

import pytz

pst_tz = pytz.timezone(‘America/Los_Angeles’)

datetime_object_in_pst_timezone =

,→ datetime.datetime.fromtimestamp(unix_time, pst_tz)

For more details about datetime operation and time zones, see

https://medium.com/@eleroy/10-things-you-need-to-know-about-date-and-time-in-pythoFollow the steps outlined below: Question 10

• Describe your task.

• Explore the data and any metadata (you can even incorporate additional datasets

if you choose).

• Describe the feature engineering process. Implement it with reason: Why are you

extracting features this way – why not in any other way?

• Generate baselines for your final ML model.

• A thorough evaluation is necessary.

• Be creative in your task design – use things you have learned in other classes too if

you are excited about them!

We value creativity in this part of the project, and your score is partially based on how

unique your task is. Here are a few pitfalls you should avoid (there are more than this

list suggests):

• DO NOTperform simple sentiment analysis on Tweets: running a pre-trained sentiment analysis model on each tweet and correlating that sentiment to the score in

the game in time would give you an obvious result.

• DO NOT include trivial baselines: In sentiment analysis, for example, if you are

going to try and train a Neural Network or use a pre-trained model, your baselines

need to be competitive. Try to include alternate network architectures in addition

to simple baselines such as random or naive Bayesian baselines.

Here we list a few project directions that you can consider and modify. These are not

complete specifications. You are free and are encouraged to create your projects /project

parts (that may get some points for creativity). The projects you come up with should

match or exceed the complexity of the following 3 suggested options:

• Time-Series Correlation between Scores and Tweets: Since this tweet dataset

contains tweets that were posted before, during, and after the Superbowl, you can

find time-series data that have the real-time score of the football game as the tweets

are being generated. This score can be used as a dynamic label for your raw tweet

dataset: there is an alignment between the tweets and the score. You can then train

a model to predict, given a tweet, the team that is winning. Given the score change,

can you generate a tweet using an ensemble of sentences from the original data (or

using a generative model that is more sophisticated)?

7

Figure 1: A sample of the significant events in the game that you can easily find on the internet.

Here is one link that has the time-indexed events.

• Character-centric time-series tracking and prediction: In the #gopatriots

dataset, there are several thousand tweets mentioning “Tom Brady” and his immediate success/failure during the game. He threw 4 touchdowns and 2 interceptions,

so fan emotions about Brady throughout the game are fickle. Can we track the

average perceived emotion across tweets about each player in the game across time

in each fan base? Note that this option would require you to explore ways to find

the sentiment associated with each player in time, not to an entire tweet. Can we

correlate these emotions with the score and significant events (such as interceptions

or fumbles)? Using these features, can you predict the MVP of the game? Who was

the most successful receiver? The MVP was Brady.

• Library of Prediction Tasks given a tweet: Predict the hashtags or how

likely it is that a tweet belongs to a specific team fan. Predict the number of

retweets/likes/quotes. Predict the relative time at which a tweet was posted.

Submission

Your submission should be made to both of the two places: BruinLearn and Gradescope

within BruinLearn.

BruinLearn Please submit a zip file containing your report, and your codes with a

readme file on how to run your code to BruinLearn. The zip file should be named

as

“Project1 UID1 UID2 … UIDn.zip”

where UIDx’s are student ID numbers of the team members. Only one submission

per team is required. If you have any questions, please ask on Piazza or through

email. Piazza is preferred.

Gradescope Please submit your report to Gradescope as well. Please specify your

group members in Gradescope. It is very important that you assign each part of

your report to the question number provided in the Gradescope template.

8

ECE 219

# Project 4: Regression Analysis and Define Your Own Task!

Original price was: $35.00.$30.00Current price is: $30.00.