The Verbs

Nic Schwab

dplyr lives in the tidyverse

The Verbs

You’ve learned about these verbs:

filter()
- subsets rows
group_by()
- to group the data before summarizing
summarize()
- to summarize data
mutate()
- creates new variables (or columns)

We will also use:

select()
- subsets columns
rename()
- to rename columns
arrange() / arrange(desc())/ slice_*
case_when()
- change variable levels

Preview SQL Queries

CSI 242 Applied Database Management

SQL	dplyr
SELECT	select(), mutate(), summarise()
FROM	the data frame. data=
WHERE	filter()
ORDER BY	arrange()
LIMIT	slice_head()
GROUP BY	group_by

The idea

You can get good at a few functions and do a lot.

The first argument is a data frame.

its a special kind called a tibble.

The output of the wrangling functions is a data frame.

When we wrangle we are not altering the original data.

It still exists.
You can start over.

What is a tibble?

Object of class tbl
Its kind of different from a data frame.

New verbs by example

Live code with the mpg data frame.

Try these with me.

Open R or HCC’s RStudio Server.

Side Note: Vectors

c() makes a vector in R.

We use them to hold information. All information must be the same type (int, bool, etc.)

If you are familiar with lists or arrays, a vector is very similar to those.

books = c("The Martian", "The Hail Mary Project", "Artemis")

filter()

%in% vs ==

mpg |> # 18 audis
    filter ( manufacturer == "audi")|>
    nrow()

[1] 18

mpg |> # 25 fords
    filter(manufacturer == "ford")|>
    nrow()

[1] 25

mpg |> # doesn't work correctly
    filter(manufacturer == c("ford", "audi"))|>
    nrow()

[1] 22

mpg |> # doesn't work correctly
    filter(manufacturer %in% c("ford", "audi"))|>
    nrow()

[1] 43

group_by and summarize

group_by() works with summarize()

mpg|>
    group_by(manufacturer)|>
    # n() is a counting function
    summarise(count = n())

# A tibble: 15 × 2
   manufacturer count
   <chr>        <int>
 1 audi            18
 2 chevrolet       19
 3 dodge           37
 4 ford            25
 5 honda            9
 6 hyundai         14
 7 jeep             8
 8 land rover       4
 9 lincoln          3
10 mercury          4
11 nissan          13
12 pontiac          5
13 subaru          14
14 toyota          34
15 volkswagen      27

mutate()

Changes values of colums. You can save the data frame.

mpg = mpg |>
    mutate(new_year = 2000 - year )
mpg$year

  [1] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 2008 1999 1999 2008 2008
 [16] 1999 2008 2008 2008 2008 2008 1999 2008 1999 1999 2008 2008 2008 2008 2008
 [31] 1999 1999 1999 2008 1999 2008 2008 1999 1999 1999 1999 2008 2008 2008 1999
 [46] 1999 2008 2008 2008 2008 1999 1999 2008 2008 2008 1999 1999 1999 2008 2008
 [61] 2008 1999 2008 1999 2008 2008 2008 2008 2008 2008 1999 1999 2008 1999 1999
 [76] 1999 2008 1999 1999 1999 2008 2008 1999 1999 1999 1999 1999 2008 1999 2008
 [91] 1999 1999 2008 2008 1999 1999 2008 2008 2008 1999 1999 1999 1999 1999 2008
[106] 2008 2008 2008 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 2008 2008
[121] 2008 2008 2008 2008 1999 1999 2008 2008 2008 2008 1999 2008 2008 1999 1999
[136] 1999 2008 1999 2008 2008 1999 1999 1999 2008 2008 2008 2008 1999 1999 2008
[151] 1999 1999 2008 2008 1999 1999 1999 2008 2008 1999 1999 2008 2008 2008 2008
[166] 1999 1999 1999 1999 2008 2008 2008 2008 1999 1999 1999 1999 2008 2008 1999
[181] 1999 2008 2008 1999 1999 2008 1999 1999 2008 2008 1999 1999 2008 1999 1999
[196] 1999 2008 2008 1999 2008 1999 1999 2008 1999 1999 2008 2008 1999 1999 2008
[211] 2008 1999 1999 1999 1999 2008 2008 2008 2008 1999 1999 1999 1999 1999 1999
[226] 2008 2008 1999 1999 2008 2008 1999 1999 2008

select()

select() subsets the columms. Maybe you don’t need the whole df.

names() is useful for finding column names

names(mpg)

 [1] "manufacturer" "model"        "displ"        "year"         "cyl"         
 [6] "trans"        "drv"          "cty"          "hwy"          "fl"          
[11] "class"        "new_year"

new_mpg = mpg |> select(manufacturer, year)
new_mpg

# A tibble: 234 × 2
   manufacturer  year
   <chr>        <int>
 1 audi          1999
 2 audi          1999
 3 audi          2008
 4 audi          2008
 5 audi          1999
 6 audi          1999
 7 audi          2008
 8 audi          1999
 9 audi          1999
10 audi          2008
# ℹ 224 more rows

arrange() / slice_*()

arrange() ordered your df by a variable, while slice_max() will take the top so many rows.

mpg |> arrange (desc(displ))

# A tibble: 234 × 12
   manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
   <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
 1 chevrolet    corvette     7    2008     8 manu… r        15    24 p     2sea…
 2 chevrolet    k1500 tah…   6.5  1999     8 auto… 4        14    17 d     suv  
 3 chevrolet    corvette     6.2  2008     8 manu… r        16    26 p     2sea…
 4 chevrolet    corvette     6.2  2008     8 auto… r        15    25 p     2sea…
 5 jeep         grand che…   6.1  2008     8 auto… 4        11    14 p     suv  
 6 chevrolet    c1500 sub…   6    2008     8 auto… r        12    17 r     suv  
 7 dodge        durango 4…   5.9  1999     8 auto… 4        11    15 r     suv  
 8 dodge        ram 1500 …   5.9  1999     8 auto… 4        11    15 r     pick…
 9 chevrolet    c1500 sub…   5.7  1999     8 auto… r        13    17 r     suv  
10 chevrolet    corvette     5.7  1999     8 manu… r        16    26 p     2sea…
# ℹ 224 more rows
# ℹ 1 more variable: new_year <dbl>

mpg |> slice_max(displ, n = 5)

# A tibble: 5 × 12
  manufacturer model       displ  year   cyl trans drv     cty   hwy fl    class
  <chr>        <chr>       <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 chevrolet    corvette      7    2008     8 manu… r        15    24 p     2sea…
2 chevrolet    k1500 taho…   6.5  1999     8 auto… 4        14    17 d     suv  
3 chevrolet    corvette      6.2  2008     8 manu… r        16    26 p     2sea…
4 chevrolet    corvette      6.2  2008     8 auto… r        15    25 p     2sea…
5 jeep         grand cher…   6.1  2008     8 auto… 4        11    14 p     suv  
# ℹ 1 more variable: new_year <dbl>

Practice wrangling

MDSR Problems 1 and 2

 # Copy this code into a chunk in R to make the Random_subset data frame from problem 1 and 2.
 # Use the verbs we've discussed to make the subsets from the text.
 Random_subset <-  tibble::tribble(
     ~year,~sex,   ~name,         ~n, ~prop,
      2003, "M",     "Bilal",        146, 0.0000695,
      1999, "F",     "Terria",        23, 0.0000118,
      2010, "F",     "Naziyah",       45, 0.0000230,
      1989, "F",     "Shawana",       41, 0.0000206,
      1989, "F",     "Jessi",        210, 0.000105,
      1928, "M",     "Tillman",       43, 0.0000377,
      1981, "F",     "Leslee",        83, 0.0000464,
      1981, "F",     "Sherise",       27, 0.0000151,
      1920, "F",     "Marquerite",    26, 0.0000209,
      1941, "M",     "Lorraine",      24, 0.0000191
   )