You’ve learned about these verbs:
filter()
group_by()
summarize()
mutate()
We will also use:
select()
rename()
arrange() / arrange(desc())/ slice_*
case_when()
CSI 242 Applied Database Management
| SQL | dplyr |
|---|---|
| SELECT | select(), mutate(), summarise() |
| FROM | the data frame. data= |
| WHERE | filter() |
| ORDER BY | arrange() |
| LIMIT | slice_head() |
| GROUP BY | group_by |
You can get good at a few functions and do a lot.
The first argument is a data frame.
The output of the wrangling functions is a data frame.
When we wrangle we are not altering the original data.
It still exists.
You can start over.
Object of class tbl
Live code with the mpg data frame.
Try these with me.
Open R or HCC’s RStudio Server.
c() makes a vector in R.
We use them to hold information. All information must be the same type (int, bool, etc.)
If you are familiar with lists or arrays, a vector is very similar to those.
%in% vs ==
[1] 18
[1] 25
[1] 22
[1] 43
group_by() works with summarize()
# A tibble: 15 × 2
manufacturer count
<chr> <int>
1 audi 18
2 chevrolet 19
3 dodge 37
4 ford 25
5 honda 9
6 hyundai 14
7 jeep 8
8 land rover 4
9 lincoln 3
10 mercury 4
11 nissan 13
12 pontiac 5
13 subaru 14
14 toyota 34
15 volkswagen 27
Changes values of colums. You can save the data frame.
[1] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 2008 1999 1999 2008 2008
[16] 1999 2008 2008 2008 2008 2008 1999 2008 1999 1999 2008 2008 2008 2008 2008
[31] 1999 1999 1999 2008 1999 2008 2008 1999 1999 1999 1999 2008 2008 2008 1999
[46] 1999 2008 2008 2008 2008 1999 1999 2008 2008 2008 1999 1999 1999 2008 2008
[61] 2008 1999 2008 1999 2008 2008 2008 2008 2008 2008 1999 1999 2008 1999 1999
[76] 1999 2008 1999 1999 1999 2008 2008 1999 1999 1999 1999 1999 2008 1999 2008
[91] 1999 1999 2008 2008 1999 1999 2008 2008 2008 1999 1999 1999 1999 1999 2008
[106] 2008 2008 2008 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 2008 2008
[121] 2008 2008 2008 2008 1999 1999 2008 2008 2008 2008 1999 2008 2008 1999 1999
[136] 1999 2008 1999 2008 2008 1999 1999 1999 2008 2008 2008 2008 1999 1999 2008
[151] 1999 1999 2008 2008 1999 1999 1999 2008 2008 1999 1999 2008 2008 2008 2008
[166] 1999 1999 1999 1999 2008 2008 2008 2008 1999 1999 1999 1999 2008 2008 1999
[181] 1999 2008 2008 1999 1999 2008 1999 1999 2008 2008 1999 1999 2008 1999 1999
[196] 1999 2008 2008 1999 2008 1999 1999 2008 1999 1999 2008 2008 1999 1999 2008
[211] 2008 1999 1999 1999 1999 2008 2008 2008 2008 1999 1999 1999 1999 1999 1999
[226] 2008 2008 1999 1999 2008 2008 1999 1999 2008
select() subsets the columms. Maybe you don’t need the whole df.
names() is useful for finding column names
[1] "manufacturer" "model" "displ" "year" "cyl"
[6] "trans" "drv" "cty" "hwy" "fl"
[11] "class" "new_year"
# A tibble: 234 × 2
manufacturer year
<chr> <int>
1 audi 1999
2 audi 1999
3 audi 2008
4 audi 2008
5 audi 1999
6 audi 1999
7 audi 2008
8 audi 1999
9 audi 1999
10 audi 2008
# ℹ 224 more rows
arrange() ordered your df by a variable, while slice_max() will take the top so many rows.
# A tibble: 234 × 12
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 chevrolet corvette 7 2008 8 manu… r 15 24 p 2sea…
2 chevrolet k1500 tah… 6.5 1999 8 auto… 4 14 17 d suv
3 chevrolet corvette 6.2 2008 8 manu… r 16 26 p 2sea…
4 chevrolet corvette 6.2 2008 8 auto… r 15 25 p 2sea…
5 jeep grand che… 6.1 2008 8 auto… 4 11 14 p suv
6 chevrolet c1500 sub… 6 2008 8 auto… r 12 17 r suv
7 dodge durango 4… 5.9 1999 8 auto… 4 11 15 r suv
8 dodge ram 1500 … 5.9 1999 8 auto… 4 11 15 r pick…
9 chevrolet c1500 sub… 5.7 1999 8 auto… r 13 17 r suv
10 chevrolet corvette 5.7 1999 8 manu… r 16 26 p 2sea…
# ℹ 224 more rows
# ℹ 1 more variable: new_year <dbl>
# A tibble: 5 × 12
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 chevrolet corvette 7 2008 8 manu… r 15 24 p 2sea…
2 chevrolet k1500 taho… 6.5 1999 8 auto… 4 14 17 d suv
3 chevrolet corvette 6.2 2008 8 manu… r 16 26 p 2sea…
4 chevrolet corvette 6.2 2008 8 auto… r 15 25 p 2sea…
5 jeep grand cher… 6.1 2008 8 auto… 4 11 14 p suv
# ℹ 1 more variable: new_year <dbl>
# Copy this code into a chunk in R to make the Random_subset data frame from problem 1 and 2.
# Use the verbs we've discussed to make the subsets from the text.
Random_subset <- tibble::tribble(
~year,~sex, ~name, ~n, ~prop,
2003, "M", "Bilal", 146, 0.0000695,
1999, "F", "Terria", 23, 0.0000118,
2010, "F", "Naziyah", 45, 0.0000230,
1989, "F", "Shawana", 41, 0.0000206,
1989, "F", "Jessi", 210, 0.000105,
1928, "M", "Tillman", 43, 0.0000377,
1981, "F", "Leslee", 83, 0.0000464,
1981, "F", "Sherise", 27, 0.0000151,
1920, "F", "Marquerite", 26, 0.0000209,
1941, "M", "Lorraine", 24, 0.0000191
)