Tidy data and pivots

Schwab

Tidydata

Tidy data problems

Tidy data is not always in a human readable format.

It is easy for computers to read but tricky for people.

or

Data is in a format that is easy for people for tricky for computers.

pivot_*()

When you want to make data readable for machines you’ll often pivot_longer().

pivot_longer()

When you want to make it more human readable you’ll pivot wider.

pivot_wider()

These are new functions we should check the docs.

pivot_longer()

This data is not tidy how can we make it longer for computers?

Code
gs4_deauth()
hiv_key <- "1kWH_xdJDM4SMfT_Kzpkk-1yuxWChfurZuWYjfmv51EA"
hiv <- read_sheet(hiv_key) |>
  rename(Country = 1) |>
  filter(
    Country %in% c("United States", "France", "South Africa")
  ) |>
  select(Country, `1979`, `1989`, `1999`, `2009`) |>
  unnest(cols = c(`2009`)) |>
  mutate(across(matches("[0-9]"), as.double))
hiv
# A tibble: 3 × 5
  Country        `1979` `1989` `1999` `2009`
  <chr>           <dbl>  <dbl>  <dbl>  <dbl>
1 France        NA          NA    0.3    0.4
2 South Africa  NA          NA   14.8   17.2
3 United States  0.0318     NA    0.5    0.6

How you think we could do that with pivot_longer()

pivot_longer()

Solution to previous slide

Code
hiv |>
  pivot_longer(cols = `1979`:`2009`, names_to = "year", values_to = "hiv_rate")
# A tibble: 12 × 3
   Country       year  hiv_rate
   <chr>         <chr>    <dbl>
 1 France        1979   NA     
 2 France        1989   NA     
 3 France        1999    0.3   
 4 France        2009    0.4   
 5 South Africa  1979   NA     
 6 South Africa  1989   NA     
 7 South Africa  1999   14.8   
 8 South Africa  2009   17.2   
 9 United States 1979    0.0318
10 United States 1989   NA     
11 United States 1999    0.5   
12 United States 2009    0.6   

pivot_wider()

This data is tidy how could we make it wider for humans?

library(babynames)
babynames |>
     filter(name %in% c("Sue", "Robin", "Leslie")) |>
     group_by(name, sex) |>
     summarize(total = sum(n))
# A tibble: 6 × 3
# Groups:   name [3]
  name   sex    total
  <chr>  <chr>  <int>
1 Leslie F     266474
2 Leslie M     112689
3 Robin  F     289395
4 Robin  M      44616
5 Sue    F     144465
6 Sue    M        519

How you think we could do that with pivot_wider()

pivot_wider()

Solution to the previous slide.

Code
library(babynames)
babynames |>
     filter(name %in% c("Sue", "Robin", "Leslie")) |>
     group_by(name, sex) |>
     summarize(total = sum(n))|>
     pivot_wider(names_from = sex,values_from = total)
# A tibble: 3 × 3
# Groups:   name [3]
  name        F      M
  <chr>   <int>  <int>
1 Leslie 266474 112689
2 Robin  289395  44616
3 Sue    144465    519

fish_encounters

This data is tidy try to make it wider.

fish_encounters
# A tibble: 114 × 3
   fish  station  seen
   <fct> <fct>   <int>
 1 4842  Release     1
 2 4842  I80_1       1
 3 4842  Lisbon      1
 4 4842  Rstr        1
 5 4842  Base_TD     1
 6 4842  BCE         1
 7 4842  BCW         1
 8 4842  BCE2        1
 9 4842  BCW2        1
10 4842  MAE         1
# ℹ 104 more rows

fish_encounters 2

The solution to the previous slide.

fish_encounters |>
  pivot_wider(names_from = station,values_from = "seen")
# A tibble: 19 × 12
   fish  Release I80_1 Lisbon  Rstr Base_TD   BCE   BCW  BCE2  BCW2   MAE   MAW
   <fct>   <int> <int>  <int> <int>   <int> <int> <int> <int> <int> <int> <int>
 1 4842        1     1      1     1       1     1     1     1     1     1     1
 2 4843        1     1      1     1       1     1     1     1     1     1     1
 3 4844        1     1      1     1       1     1     1     1     1     1     1
 4 4845        1     1      1     1       1    NA    NA    NA    NA    NA    NA
 5 4847        1     1      1    NA      NA    NA    NA    NA    NA    NA    NA
 6 4848        1     1      1     1      NA    NA    NA    NA    NA    NA    NA
 7 4849        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
 8 4850        1     1     NA     1       1     1     1    NA    NA    NA    NA
 9 4851        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
10 4854        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
11 4855        1     1      1     1       1    NA    NA    NA    NA    NA    NA
12 4857        1     1      1     1       1     1     1     1     1    NA    NA
13 4858        1     1      1     1       1     1     1     1     1     1     1
14 4859        1     1      1     1       1    NA    NA    NA    NA    NA    NA
15 4861        1     1      1     1       1     1     1     1     1     1     1
16 4862        1     1      1     1       1     1     1     1     1    NA    NA
17 4863        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
18 4864        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
19 4865        1     1      1    NA      NA    NA    NA    NA    NA    NA    NA

Bonus: the separate() function

If your values have a common symbol and they should be put into different column that can be done with separate().

We’ll see some examples in the next several problems.

Not Tidy -> Tidy

Are these tables tidy, do we need to pivot wider or longer?

  • table1,

  • table2,

  • table3,

  • table4a,

  • table4b,

  • table5

solutions posted in video

Reading

Read Chapter 6.

Futher reading on tidy data