COMP3430 / COMP8430 – Data Wrangling –

Assignment 1

Worth 10% of the final grade for COMP3430 / COMP8430

Overview and Objectives

This assignment covers the topics of data quality, data exploration, and data profiling as presented in the first few weeks of

the course. It also includes questions about what data wrangling is, why it is important, and how it fits into the broader field

of data analytics. One task refers to the required readings from week 1 of the course while others ask you about practical

aspects of data exploration.

Assignment Structure

The assignment consists of four (4) tasks as described below which can be worth different numbers of marks. Make sure you

answer all aspects of each task.

If you have any questions on the assignment please post them on Wattle – however do not post any partial solutions,

program codes, URLs, etc., or any hints on how to solve any of the assignment tasks.

Marking

This assignment will be marked out of 10, and it will contribute 10% of your final course mark.

Note that not all tasks and questions are equally difficult. For some of the tasks there is no single right or wrong answer.

Marks will be awarded based on your reasoning and the justification of your decisions and explanations, as well as clarity

and correctness of writing.

We will endeavour to release your marks and feedback within two teaching weeks after the submission deadline. If you feel

we have made an error in marking, you have two weeks following the release of marks to raise any issues with the course

convener, after which time your mark will be considered final. If you request that we re-mark your assignment, we

will re-mark the entire assignment and your mark may go up or down as a result.

Data Set Generation for this Assignment and for Assignment 2

For this assignment and the upcoming Assignment 2 each of you will work on an individual data set that will be based on a

master data set we will provide, and a data generation program we will also provide.

Note that we have generated the master data set based on real data (such as lookup tables of names, addresses, etc.), and

we have then corrupted and modified certain aspects of that data set. We have intentionally tried to include the types of

relationships, features, errors, and other data quality issues that you might find in real data sets. Any similarity to real

persons or places is entirely coincidental.

Download the master data set from Wattle (to be made available in week 2) named dw assignment master.csv.gz, and

the data generation program named generate-student-dataset.py. Copy both these files into one folder / directory, and

run the code using Python 3 in the following way:

python3 generate-student-dataset.py your ANU ID dw assignment master.csv.gz

The program will generate an output data set named data wrangling medical 2021 your ANU ID.csv.gz, and print

some output which contains the following important lines (for the example ANU ID u1234567):

>>> python3 generate-student-dataset.py u1234567 dw assignment master.csv.gz

Your student data set for the data wrangling 2021 assignments has been generated and written into file:

data wrangling medical 2021 u1234567.csv

Your ANU ID check code is: d76225bc

Your student data set check code is: 216b3fef9401

*** Check this pair of numbers is in the list provided on Wattle, if not contact the course convenor.

Important

Write down your two check codes because you must provide them with the assignment submission. This

will allow us to validate that you have generated and used the correct data set.

Check that the pair of check codes you get (like in the example above d76225bc and 216b3fef940) is in the list of

check codes we will provide on Wattle (in week 2 under the assignment 1 document). This will allow you to check

that you have generated the correct data set.

You must use your individual generated data set for task 4 of this assignment (and the tasks on data

cleaning in Assignment 2).

Assignment Tasks

Task 1 (2 marks):

According to the paper (from week 1) by Rahm and Do (Data Cleaning: Problems and Current Approaches), data cleaning

generally deals with detecting errors and inconsistencies from data to improve the quality of data. As mentioned in this

paper, there are many issues and problems related to data cleaning.

Answer the following two questions each in 10 or less lines of text (a maximum of 250 words each), where one text entry

will be provided in Wattle per question.

(1) Do you think the problems and issues related to data cleaning raised in this paper (in the year 2000) are still relevant

today? Justify why or why not?

(2) Imagine you are hired by the Australian Federal Department of Health as a data wrangler to deal with incoming

data sets about COVID-19 cases (details of patients who were diagnosed with the virus) from the seven Australian states

and territories. Your task is to clean and integrate these data sets to support the decision making by the Australian

government.

Briefly describe three (3) data wrangling aspects you will have to consider when dealing with such data sets.

Task 2 (1 mark): Following is a list L of age values (in years) of a group of people:

L = [74, 14, 20, 32, 42, 55, 91, 56, 84, 42, 13, 7]

First, split your ANU ID (excluding the first character ‘u’) into four number segments (three pairs and a single number)

and then add these four number segments to L. For example if your ANU ID is u1204067 then split it into: 12, 04, 06, 7

and add these numbers to L, so the final list becomes: L = [74, 14, 20, 32, 42, 55, 91, 56, 84, 42, 13, 7, 12, 4, 6, 7].

Now calculate and enter into the corresponding answer fields on Wattle:

1. the mean and standard deviation of L,

2. the median and median absolute deviation of L, and

3. the mode of L.

Task 3 (2 marks): Apply binning as covered in the lectures to the numbers in the list L as generated in the previous

task (i.e. L including the number segments based on your ANU ID appended).

Calculate and enter into the corresponding answer fields on Wattle the results when binning L using:

1. equal depth with two bins and smoothed by bin median,

2. equal width with three bins and smoothed by bin mean,

3. equal width with four bins and smoothed by bin boundaries, and

4. equal depth with four bins and smoothed by bin boundaries.

Clearly show the bins you generated when you enter your answers into Wattle answer fields by showing one bin per line,

for example (assume we have binned [1,2,3,4,5,6,7,8,9] into three bins with smoothing by bin medians):

Bin 1: [2, 2, 2]

Bin 2: [5, 5, 5]

Bin 3: [8, 8, 8]

Task 4 (5 marks):

For the last task of this assignment you must use the data set you generated as per instructions above. We ask

you to explore this data set using tools of your choice (Rattle, R, Python, Pandas, etc.) and answer the specific questions

about this data set given below.

Make sure to follow the instructions on the individual Wattle answer fields with regard to rounding, the

number of digits to provide after the decimal point, etc.

1. Provide the missingness patterns of values (as we discussed in the labs) for the three attributes: postcode, phone,

and email.

2. Calculate the correlation between the attributes (a) BMI and age at consultation, (b) BMI and height, and (c)

state and valid marital status. In your answers you need to provide the numerical correlation value, the name

of the correlation method you used, and a brief (one sentence) explanation why you used that specific correlation

function for each pair of attributes.

3. For the following attributes, calculate numerical values for the following data quality dimensions:

(a) Completeness for postcode and phone.

(b) Validity for weight.

(c) Uniqueness for last name.

(d) Consistency between age at consultation and birth date (for valid age values).

4. Calculate the distributions of the first digits (Benford’s law) for the attributes (a) cholesterol level, (b) blood pressure

and (c) medicare number. Describe for each in one or two sentences if it does follow Benford’s law or not, and

why you think it does or does not follow this law.

5. Describe in a few sentences three other unusual characteristics you can identify in this data set using data exploration

and profiling.

You will receive up-to one mark for correctly answering each of these questions, where both correct numerical values as

well as correct and clearly written justifications of your answers will be considered.

Other Aspects

For all textual answers in this assignment, English writing mistakes and typographical errors will attract small penalties.

## Reviews

There are no reviews yet.