Joins

CSI-MTH-190

Schwab
library(tidyverse)

Combine two data frames

  • bind_cols()
  • inner_join()
  • left_join()
  • full_join()

Our toy data frames

band_members
# A tibble: 3 × 2
  name  band   
  <chr> <chr>  
1 Mick  Stones 
2 John  Beatles
3 Paul  Beatles
#band_members2
band_instruments
# A tibble: 3 × 2
  name  plays 
  <chr> <chr> 
1 John  guitar
2 Paul  bass  
3 Keith guitar

bind_cols()

This function just smashes two dataframes together.

Only use if you are sure that the rows match up together.

bind_cols( band_instruments, band_members)
# A tibble: 3 × 4
  name...1 plays  name...3 band   
  <chr>    <chr>  <chr>    <chr>  
1 John     guitar Mick     Stones 
2 Paul     bass   John     Beatles
3 Keith    guitar Paul     Beatles

full_join()

Keeps all the data and by default matches by a column name.

full_join(band_instruments, band_members)
# A tibble: 4 × 3
  name  plays  band   
  <chr> <chr>  <chr>  
1 John  guitar Beatles
2 Paul  bass   Beatles
3 Keith guitar <NA>   
4 Mick  <NA>   Stones 

left_join()

Keeps all data in the left dataframe, adding rows from the right dataframe if there is a key match.

left_join(band_instruments, band_members)
# A tibble: 3 × 3
  name  plays  band   
  <chr> <chr>  <chr>  
1 John  guitar Beatles
2 Paul  bass   Beatles
3 Keith guitar <NA>   

inner_join()

Keeps only the data when keys match in both data frames. This is helpful when you want to know what values the dataframes have in common.

inner_join(band_instruments, band_members)
# A tibble: 2 × 3
  name  plays  band   
  <chr> <chr>  <chr>  
1 John  guitar Beatles
2 Paul  bass   Beatles

Different Key names

If the keys in both data frames after different we must specify how to join the data.

left_join(
    band_instruments2, 
    band_members, 
    by = c("artist" = "name")
    )
# A tibble: 3 × 3
  artist plays  band   
  <chr>  <chr>  <chr>  
1 John   guitar Beatles
2 Paul   bass   Beatles
3 Keith  guitar <NA>   

other joins

There are many other joins, some of which are on dplyr’s cheatsheet