title: “Homework 2”
subtitle: “Soc 225: Data & Society”
author: “PUT YOUR NAME HERE”
date: “`r Sys.Date()`”
Write all code in the chunks provided!
Remember to unzip to a real directory before running everything!
Problem 1 should be roughly analogous to what we’ve done in class, with a few extensions. There are hints at the bottom of this document if you get stuck. If you still can’t figure it out, go to google/stack exchange/ask a friend. Finally, email me or come to office hours :).
## Problem 1: Piping Hot Variables
*This problem uses `dplyr` verbs to answer questions about an Airbnb data set.*
### 1.1: Get the data
Go to [Inside Airbnb](http://insideairbnb.com/get-the-data.html) and download the “Detailed Listings” data for Seattle, `listings.csv.gz`. This file has many more variables than the “Summary” file we’ve been using in class. Put it in a `data/` subfolder in your `hw-02` project folder.
[This is a compressed (gzipped) file, but R should be able to handle it as-is. If you run into trouble, try unzipping the file before reading it into R.]
### 1.2: Set up your R environment
a. Load the tidyverse
b. Read the detailed Airbnb data into R
### 1.3: Use the data to answer a question
*For how many units does the host live in a different neighborhood from the listing? For how many units does the host live in the same neighborhood as the listing?*
Try to figure out which variables to use from their names, and think about which verbs you’ve learned about might work to answer this question. See the hints at the end if you need help.
### 1.4: Build on your answer
*Building on that work, what is the average number of listings for hosts that live in the same neighborhood as their listing? What’s the average for hosts who live in different neighborhoods from their listing?*
The `mean` function will take the average of a variable, but you might need to look up how to use it. See the hints for more suggestions if you get stuck.
### 1.5: Reflect and interpret
Reflect on your answer to 1.4. What might cause the results you got? How does that connect to the idea that Airbnb might be changing neighborhoods?
*Your answer should be at least a few sentences here*
## Problem 2: Literature Review
*This question asks you to think deeply about the research question you’re investigating. Each answer should be around 100 words.*
2.1: What dataset did you select (include a link again)? Why did you select it? What is your research question? What variables do you plan to use to answer your question?
2.2: Find at least two articles (at least one must be from an academic journal) that have addressed a question similar to your own. What data did they use? What problems did they have?
*If you “can’t find” two articles, provide a screenshot of your search in the university library system from here: http://www.lib.washington.edu/*
2.3: What is one way that you have to modify or examine your data to begin to answer your question?
## Problem 3: Pipe your own data
3.1: Using the functions we’ve worked with in class (select, filter, arrange, mutate), plus any others you’d like to use, examine the key relationship from your research question.
a. Create a new dataset that only includes the variables you’re interested in
b. Output a version of that dataset that only includes certain values of observations, hopefully ones you’re interested in.
c. Order your data by the values of one variable you’re interested in.
d. Create a modified version of one of your variables (many of you will *need* to do this, but even if you don’t, I want to see that you can)
e. Look up and try out one new verb for data transformation. The RStudio data transformation cheat sheet is a fantastic place to start: https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf
For e., we’d recommend using `group_by` + `summarize`. You can group your data by one variable, and then see the mean (or similar) of another variable within each of those groups.
*Use as many code blocks as you need for a-e*
1.3 Try using these steps:
– Step 1: identify the variables you need
– Listing neighborhood: `neighbourhood`
– Host’s neighborhood: `host_neighbourhood`
– Step 2: Filter the data to only include the rows where those variables are not equal. Look back to Module 2 (or look online) if you need a reminder about how to write “equal”, “not equal”, and so on in R.
– Step 3: How many rows are left in the filtered data?
Extra food for thought: how do “NA” (missing) values get handled here? Do you think that makes sense? Should you do something else with them, maybe using `is.na`?
1.4 The variable for number of listings is `host_listings_count`. You might want to make a new variable indicating if a host is a local host (your answer to 1.3 will help here!). There are many ways to use `mean` on a subset of data, but the best approach is one we introduce in Module 5: `group_by` + `summarize`. Try it out now if you can! For this problem, don’t worry about NAs.
a. use select()
b. use filter()
c. use arrange()
d. use mutate()
e. use group_by(var1) %>% summarise(mean = mean(var2))