HW5 (Hadoop MapReduce & Spark)


Rate this product

DSCI 551
HW5 (Hadoop MapReduce & Spark), 100 points
In this homework, we will consider the churn data set again (as in hw1). You are given two versions of
the file: churn4hadoop.csv and churn.csv. The former has not header, to be used for Hadoop quesQon
below; the laSer has header used in Spark.
1. [Hadoop MapReduce, 40 points] Complete the provided by supplying the missing code as
indicated in the source file, so that it answers the following SQL query.
Select InternetService, max(tenure)
From Churn
Where churn = “Yes”
Group by InternetService
Having count(*) > 200;
ExecuQon format: hadoop jar churn.jar Churn input output
Where the input directory contains a single file: churn4hadoop.csv.
2. [40 points] For each of the following SQL queries, write a Spark script that finds the answer to the
query. Note to read a csv file with header into Spark as a dataframe, proceed as follows:
churn =‘churn.csv’, header=True)
You will also need to import this:
import pyspark.sql.funcQons as fc
a) select count(*)
from churn
where gender = ‘Male’ and churn = ‘Yes’;
b) select gender, max(TotalCharges)
from churn
where churn = “Yes”
group by gender;
Note: you will need to change the data type of TotalCharges from string to double. For example,
churn = churn.withColumn(‘TotalCharges’, fc.col(‘TotalCharges’).cast(‘double’))
c) select gender, count(*)
from churn
where churn = ‘Yes’
group by gender;
d) select churn, contract, count(*) cnt
from churn
group by churn, contract
order by churn, cnt desc;
(churn is ascending)
e) select gender, churn, count(*)
from churn
group by gender, churn
having count(*) > 1000;
3. [20 points] Write a Spark RDD script for each of the following SQL queries.
a. Same as q2.a.
b. Same as q2.b.
• Q1: and churn.jar and part-r-00000 under the output directory.
• Q2: submit a text file q2-soluQon.txt with your scripts and outputs from each script.
• Q3: submit a text file q3-soluQon.txt with your scripts and outputs from each script.


There are no reviews yet.

Be the first to review “HW5 (Hadoop MapReduce & Spark)”

Your email address will not be published. Required fields are marked *

Scroll to Top