Tidying Data

# Tidying Data
## <span style="color:#56B4E9">tidyr</span> <br>
### 2019-08-16

---

<img src="tidyr_spread_gather.png" width="70%" height="70%" style="display: block; margin: auto;" />
.pull-right[Art by Allison Horst]

---
background-image: url(http://hexb.in/hexagons/tidyr.png)
background-position: 90% 10%

## **tidyr**

## Functions for tidying data. 
## What is tidy data?
.large[
> “Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham
]
---

# Tidy Data
![](http://garrettgman.github.io/images/tidy-1.png)</div>

---

# Tidy Data 
![](http://garrettgman.github.io/images/tidy-1.png)</div>
### .medium[Each **column** is a single *variable*]

---

# Tidy Data
![](http://garrettgman.github.io/images/tidy-1.png)</div>
### ~~.medium[Each column is a single variable]~~
### .medium[Each **row** is a single *observation*]

---

# Tidy Data 
![](http://garrettgman.github.io/images/tidy-1.png)</div>
### ~~.medium[Each column is a single variable]~~
### ~~.medium[Each row is a single observation]~~
### .medium[Each **cell** is a *value*]

---

## `gather()`

```r
gather(<DATA>, "<KEY>", "<VALUE>", <VARIABLES>)
```

---

## Lord of the Rings

```r
lotr <- tribble(
                         ~film,    ~race, ~female, ~male,
  "The Fellowship Of The Ring",    "Elf",   1229L,  971L,
  "The Fellowship Of The Ring", "Hobbit",     14L, 3644L,
  "The Fellowship Of The Ring",    "Man",      0L, 1995L,
              "The Two Towers",    "Elf",    331L,  513L,
              "The Two Towers", "Hobbit",      0L, 2463L,
              "The Two Towers",    "Man",    401L, 3589L,
      "The Return Of The King",    "Elf",    183L,  510L,
      "The Return Of The King", "Hobbit",      2L, 2673L,
      "The Return Of The King",    "Man",    268L, 2459L
)
```

---

## Lord of the Rings

```r
lotr
```

```
## # A tibble: 9 x 4
##   film                       race   female  male
##   <chr>                      <chr>   <int> <int>
## 1 The Fellowship Of The Ring Elf      1229   971
## 2 The Fellowship Of The Ring Hobbit     14  3644
## 3 The Fellowship Of The Ring Man         0  1995
## 4 The Two Towers             Elf       331   513
## 5 The Two Towers             Hobbit      0  2463
## 6 The Two Towers             Man       401  3589
## 7 The Return Of The King     Elf       183   510
## 8 The Return Of The King     Hobbit      2  2673
## 9 The Return Of The King     Man       268  2459
```

---

---

## `gather()`

```r
lotr %>% 
* gather("sex", "words", female:male)
```

---

## `gather()`

```r
lotr %>% 
  gather("sex", "words", female:male)
```

```
## # A tibble: 18 x 4
*##    film                       race   sex    words
##    <chr>                      <chr>  <chr>  <int>
##  1 The Fellowship Of The Ring Elf    female  1229
##  2 The Fellowship Of The Ring Hobbit female    14
##  3 The Fellowship Of The Ring Man    female     0
##  4 The Two Towers             Elf    female   331
##  5 The Two Towers             Hobbit female     0
##  6 The Two Towers             Man    female   401
##  7 The Return Of The King     Elf    female   183
##  8 The Return Of The King     Hobbit female     2
##  9 The Return Of The King     Man    female   268
## 10 The Fellowship Of The Ring Elf    male     971
## # … with 8 more rows
```

---

---

## Your Turn 1

### Use `gather()` to reorganize `table4a` into three columns: **country**, **year**, and **cases**.

---

## Your Turn 1

### Use `gather()` to reorganize `table4a` into three columns: **country**, **year**, and **cases**.

```r
table4a %>% 
  gather("year", "cases", -country)
```

```
## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Brazil      1999   37737
## 3 China       1999  212258
## 4 Afghanistan 2000    2666
## 5 Brazil      2000   80488
## 6 China       2000  213766
```

---

## `spread()`

```r
spread(<DATA>, <KEY>, <VALUE>)
```

---

## `spread()`

```r
lotr %>% 
* gather("sex", "words", female:male) %>%
  spread(race, words)
```

---

## `spread()`

```r
lotr %>% 
  gather("sex", "words", female:male) %>% 
* spread(race, words)
```

---

## `spread()`

```r
lotr %>% 
  gather("sex", "words", female:male) %>% 
  spread(race, words)
```

```
## # A tibble: 6 x 5
*##   film                       sex      Elf Hobbit   Man
##   <chr>                      <chr>  <int>  <int> <int>
## 1 The Fellowship Of The Ring female  1229     14     0
## 2 The Fellowship Of The Ring male     971   3644  1995
## 3 The Return Of The King     female   183      2   268
## 4 The Return Of The King     male     510   2673  2459
## 5 The Two Towers             female   331      0   401
## 6 The Two Towers             male     513   2463  3589
```

---

## Your Turn 2

### Use `spread()` to reorganize `table2` into four columns: **country**, **year**, **cases**, and **population**. 
### Create a new variable called `prevalence` that divides `cases` by `population` multiplied by 100000.
###  Pass the data frame to a ggplot. Make a scatter plot with `year` on the x axis and `prevalence` on the y axis. Set the color aesthetic (`aes()`) to `country`. Use `size = 2` for the points. Add a line geom.

```r
table2
```

---

## Your Turn 2

```r
table2 %>% 
* spread(type, count) %>%
* mutate(prevalence = (cases/population) * 100000)
```

```
## # A tibble: 6 x 5
##   country      year  cases population prevalence
##   <chr>       <int>  <int>      <int>      <dbl>
## 1 Afghanistan  1999    745   19987071       3.73
## 2 Afghanistan  2000   2666   20595360      12.9 
## 3 Brazil       1999  37737  172006362      21.9 
## 4 Brazil       2000  80488  174504898      46.1 
## 5 China        1999 212258 1272915272      16.7 
## 6 China        2000 213766 1280428583      16.7
```

---

## Your Turn 2

```r
table2 %>% 
  spread(type, count) %>% 
  mutate(prevalence = (cases/population) * 100000) %>% 
* ggplot(aes(x = year, y = prevalence, color = country)) +
*   geom_point(size = 2) +
*   geom_line() +
*   scale_x_continuous(breaks = c(1999L, 2000L))
```

---

## Your Turn 2

---

## `gather()` and `spread()`

![](https://github.com/gadenbuie/tidyexplain/raw/master/images/tidyr-spread-gather.gif)
.pull-right[Animation: Garrick Aden-Buie]

---

## Your Turn 3

### Gather the 5th through 60th columns of `who` into a key column: value column pair named **codes** and **n**. Then select just the `county`, `year`, `codes` and `n` variables.

```r
who
```

---

## Your Turn 3

```r
who %>% 
* gather("codes", "n", 5:60) %>%
  select(country, year, codes, n)
```

---

## Your Turn 3

```r
who %>%
  gather("codes", "n", 5:60) %>% 
  select(country, year, codes, n)
```

```
## # A tibble: 405,440 x 4
##    country      year codes           n
##    <chr>       <int> <chr>       <int>
##  1 Afghanistan  1980 new_sp_m014    NA
##  2 Afghanistan  1981 new_sp_m014    NA
##  3 Afghanistan  1982 new_sp_m014    NA
##  4 Afghanistan  1983 new_sp_m014    NA
##  5 Afghanistan  1984 new_sp_m014    NA
##  6 Afghanistan  1985 new_sp_m014    NA
##  7 Afghanistan  1986 new_sp_m014    NA
##  8 Afghanistan  1987 new_sp_m014    NA
##  9 Afghanistan  1988 new_sp_m014    NA
## 10 Afghanistan  1989 new_sp_m014    NA
## # … with 405,430 more rows
```

---

## `separate()`/`unite()`

```r
separate(<DATA>, <VARIABLE>, into = c("<VARIABLE1>", "<VARIABLE2>"))

unite(<DATA>, <VARIABLES>)
```

---

## Your Turn 4

### Use the `cases` data below. Separate the `sex_age` column into **sex** and **age** columns.

```r
cases <- tribble(
   ~id,     ~sex_age,
   "1",    "male_56",
   "2",  "female_77",    
   "3",  "female_49"
)
separate(_______, _______, into = c("_______", "_______"))
```

---

## Your Turn 4

```r
cases <- tribble(
   ~id,     ~sex_age,
   "1",    "male_56",
   "2",  "female_77",    
   "3",  "female_49"
)
*separate(cases, sex_age, into = c("sex", "age"))
```

---

## Your Turn 4

```r
cases <- tribble(
   ~id,     ~sex_age,
   "1",    "male_56",
   "2",  "female_77",    
   "3",  "female_49"
)
separate(cases, sex_age, into = c("sex", "age"))
```

```
## # A tibble: 3 x 3
##   id    sex    age  
##   <chr> <chr>  <chr>
## 1 1     male   56   
## 2 2     female 77   
## 3 3     female 49
```

---

## Your Turn 4

```r
cases <- tribble(
   ~id,     ~sex_age,
   "1",    "male_56",
   "2",  "female_77",    
   "3",  "female_49"
)
separate(cases, sex_age, into = c("sex", "age"))
```

```
## # A tibble: 3 x 3
##   id    sex    age  
*##   <chr> <chr>  <chr>
## 1 1     male   56   
## 2 2     female 77   
## 3 3     female 49
```

---

## Your Turn 4

```r
cases <- tribble(
   ~id,     ~sex_age,
   "1",    "male_56",
   "2",  "female_77",    
   "3",  "female_49"
)
*separate(cases, sex_age, into = c("sex", "age"), convert = TRUE)
```

```
## # A tibble: 3 x 3
##   id    sex      age
*##   <chr> <chr>  <int>
## 1 1     male      56
## 2 2     female    77
## 3 3     female    49
```

---

## Your Turn 5: Challenge!

#### There are two CSV files in this folder containing SEER data in breast cancer incidence in white and black women. For both sets of data:

#### Import the data
#### Gather the columns into 2 new columns called `year` and `incidence`
#### Add a new variable called `race`. Remember that each data set corresponds to a single race.
#### Bind the data sets together using `bind_rows()` from the dplyr package. Either save it as a new object or pipe the result directly into the ggplot2 code.
#### Plot the data using the code below. Fill in the blanks to have `year` on the x-axis, `incidence` on the y-axis, and `race` as the color aesthetic.

---

# Other neat tidyr tools
## Uncounting frequency tables

```r
lotr %>% 
* gather("sex", "count", female, male) %>%
* uncount(count)
```

---

# Other neat tidyr tools
## Uncounting frequency tables

```
*## # A tibble: 21,245 x 3
*##    film                       race  sex   
##    <chr>                      <chr> <chr> 
##  1 The Fellowship Of The Ring Elf   female
##  2 The Fellowship Of The Ring Elf   female
##  3 The Fellowship Of The Ring Elf   female
##  4 The Fellowship Of The Ring Elf   female
##  5 The Fellowship Of The Ring Elf   female
##  6 The Fellowship Of The Ring Elf   female
##  7 The Fellowship Of The Ring Elf   female
##  8 The Fellowship Of The Ring Elf   female
##  9 The Fellowship Of The Ring Elf   female
## 10 The Fellowship Of The Ring Elf   female
## # … with 21,235 more rows
```

---

# Other neat tidyr tools
## Work with data frames
## **`crossing()`** and **`expand()`**
## **`nest()`** and **`unnest()`**

---

# Other neat tidyr tools
## Work with missing data
## **`complete()`**
## **`drop_na()`** and **`replace_na()`**

---
 
# In development: `pivot_longer()`, `pivot_wider()`

## https://tidyr.tidyverse.org/dev/

---

# Resources
## [R for Data Science](http://r4ds.had.co.nz/): A comprehensive but friendly introduction to the tidyverse. Free online.
## [RStudio Primers](https://rstudio.cloud/learn/primers): Free interactive courses in the Tidyverse