The joins

Ben Baumer

Relational data

  • Data from two or more tables that is related
    • Need principled way to combine

More info here

Soon:

- Tidy data
- 3rd normal form

Later:

- SQL
- Relational Database Management Systems (RDBMS)
- Keys and indices

Example: nycflights13

library(nycflights13)
  • flights
  • airports
  • airlines
  • planes
  • weather

Keep like with like

flights %>%
  select(year, month, day, flight, origin, dest, carrier) %>%
  head(3)
# A tibble: 3 × 7
   year month   day flight origin dest  carrier
  <int> <int> <int>  <int> <chr>  <chr> <chr>  
1  2013     1     1   1545 EWR    IAH   UA     
2  2013     1     1   1714 LGA    IAH   UA     
3  2013     1     1   1141 JFK    MIA   AA     
airlines |>
  head(3)
# A tibble: 3 × 2
  carrier name                  
  <chr>   <chr>                 
1 9E      Endeavor Air Inc.     
2 AA      American Airlines Inc.
3 AS      Alaska Airlines Inc.  

Connections between tables (ER model)

Be specific

flights %>%
  inner_join(airlines) %>%
  dim()
[1] 336776     20
flights %>%
  inner_join(airlines, by = "carrier") %>%
  dim()
[1] 336776     20
flights %>%
  inner_join(airlines, by = c("carrier" = "carrier")) %>%
  dim()
[1] 336776     20

An abstract example

Inner join

inner_join() returns all rows from flights where there are matching values in airlines.It returns all columns from both.

Left join

left_join() returns all rows from flights even if there are no matches in airlines. If no match then NA. It returns all columns from both.

right and full

All is returned, NA is filled in to missing data

Alternatively, but…

  • This doesn’t capture the many-to-oneness

Many-to-one


Coding Example with

  • flights

  • airlines