Schwab
We will scrape data with the rvest package. We will also use the janitor package to clean the data up.
Find the seasons section
Right click on the table.
Inspect (element)
Notice the <table>
or <tr>
#I'm saving the url for star trek as a string
star_trek_url <- "https://en.wikipedia.org/wiki/Star_Trek:_The_Next_Generation"
#Here's I'm gathering the tables from that site, there are 25.
st_tables <- star_trek_url %>%
read_html() %>% #read_html takes a URL and returns and XML object
html_nodes("table")
Star_Trek_Air_Dates <- st_tables %>%
pluck(2) %>% #Its the second table, but this takes some guessing
html_table()
head(Star_Trek_Air_Dates,3)
# A tibble: 3 × 5
Season Episodes Episodes `Originally aired` `Originally aired`
<chr> <chr> <chr> <chr> <chr>
1 Season Episodes Episodes First aired Last aired
2 1 26 26 September 28, 1987 (1987-09-28) May 16, 1988 (1988-0…
3 2 22 22 November 21, 1988 (1988-11-21) July 17, 1989 (1989-…
What’s wrong with this table?
Now we need to tidy our table. What do we want to do?
Drop the first row.
#This will seperate the first_aired and put them back together without the extra dates.
Star_Trek_Air_Dates <- Star_Trek_Air_Dates %>%
separate(col = first_aired, into = c("month","day","year") ) |>
unite(col = "first_aired", c("month","day","year"))
#This will seperate the last_aired column and put them back together without the extra dates.
Star_Trek_Air_Dates <-Star_Trek_Air_Dates %>%
separate(col = last_aired, into = c("month","day","year") ) |>
unite(col = "last_aired", c("month","day","year"))
See the dplyr cheat sheet.
# A tibble: 3 × 4
season episodes first_aired last_aired
<chr> <chr> <date> <date>
1 1 26 1987-09-28 1988-05-16
2 2 22 1988-11-21 1989-07-17
3 3 26 1989-09-25 1990-06-18