Scraping_Data

Schwab

Scraping Data with rvest.

We will scrape data with the rvest package. We will also use the janitor package to clean the data up.

Let’s scrape some Star Trek TNG data

Here’s the wikipedia page

  1. Find the seasons section

  2. Right click on the table.

  3. Inspect (element)

  4. Notice the <table> or <tr>

Get the table

#I'm saving the url for star trek as a string

star_trek_url <- "https://en.wikipedia.org/wiki/Star_Trek:_The_Next_Generation"

#Here's I'm gathering the tables from that site, there are 25.

st_tables <- star_trek_url %>% 
  read_html() %>% #read_html takes a URL and returns and XML object
  html_nodes("table")

Have R pluck() the table.

Star_Trek_Air_Dates <- st_tables %>%
  pluck(2) %>% #Its the second table, but this takes some guessing
  html_table()  

head(Star_Trek_Air_Dates,3)
# A tibble: 3 × 5
  Season Episodes Episodes `Originally aired`              `Originally aired`   
  <chr>  <chr>    <chr>    <chr>                           <chr>                
1 Season Episodes Episodes First aired                     Last aired           
2 1      26       26       September 28, 1987 (1987-09-28) May 16, 1988 (1988-0…
3 2      22       22       November 21, 1988 (1988-11-21)  July 17, 1989 (1989-…

What’s wrong with this table?

Janitor to clean_names()

Star_Trek_Air_Dates<- Star_Trek_Air_Dates |>
  clean_names()

Now we need to tidy our table. What do we want to do?

Tidying up the data with verbs

Star_Trek_Air_Dates <-  Star_Trek_Air_Dates |>
  select(-episodes_2) %>%  #remove extra variable
  rename(                          #Just renaming columns
    first_aired = originally_aired ,
    last_aired = originally_aired_2
  ) 
# You can't run this twice unless you run the code from the top.

Drop the first row.

Star_Trek_Air_Dates <- Star_Trek_Air_Dates[-1, ]

Fix the dates with separate() and unite()

#This will seperate the first_aired and put them back together without the extra dates. 

Star_Trek_Air_Dates <- Star_Trek_Air_Dates %>%
  separate(col = first_aired, into = c("month","day","year") ) |> 
  unite(col = "first_aired", c("month","day","year"))
  
#This will seperate the last_aired column and put them back together without the extra dates. 

Star_Trek_Air_Dates <-Star_Trek_Air_Dates %>%
  separate(col = last_aired, into = c("month","day","year") ) |> 
  unite(col = "last_aired", c("month","day","year"))

See the dplyr cheat sheet.

Make the Dates into Dates

class(Star_Trek_Air_Dates$first_aired)
[1] "character"
Star_Trek_Air_Dates <- Star_Trek_Air_Dates %>%
  mutate(
    first_aired = as.Date(first_aired, "%B_%d_%Y"),
    last_aired = as.Date(last_aired, "%B_%d_%Y")
  )

All clean

head(Star_Trek_Air_Dates,3)
# A tibble: 3 × 4
  season episodes first_aired last_aired
  <chr>  <chr>    <date>      <date>    
1 1      26       1987-09-28  1988-05-16
2 2      22       1988-11-21  1989-07-17
3 3      26       1989-09-25  1990-06-18