Web Scrapping

and cleaning

Schwab

Ways to get data

We have looked at several places to find data:

packages

csv files

APIs

In this lecture we will consider scraping.

How does it work?

Every web page is an html file.

html stands for HyperText Markup Language.

html has markers to tell the web browser how to display the page.

There are several markers that could be used for scraping.

We will focus primarily on the <table> marker

How to see markers

You can right click on a website an select “inspect” to view a page’s html file.

We’ll consider the Wikipedia page for the 2026 Winter Olympics.

There are a lot of tables on this site.

https://en.wikipedia.org/wiki/2026_Winter_Olympics

rvest

We will scrape with a package called Rvest.

rvest (meant to sound like “Harvest”) is a package that is part of the tidyverse, but is not loaded by default.

This means you download when you download the tidyverse.

But you must still load it after you have loaded the tidyverse.

library(tidyverse)
library(rvest)

The read in the html

There are two helpful examples here.

Generally we will first read in the html data with read_html()

read_html() takes a url as a string and you have to be connected to the internet for it to work properly.

winter_olympics = read_html("https://en.wikipedia.org/wiki/2026_Winter_Olympics")

winter_olympics
{html_document}
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 skin-theme-clientpref-day vector-sticky-header-enabled wp25eastereggs-enable-clientpref-1 wp25eastereggs-companion-enabled wp25eastereggs-companion-phone vector-toc-available skin-theme-clientpref-thumb-standard" lang="en" dir="ltr">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="skin--responsive skin-vector skin-vector-search-vue mediawik ...

Find the right table

We want to find the “selector” for our table.

winter_olympics |>
html_element("#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(85)")
{html_node}
<table class="wikitable" style="margin:0.5em auto; font-size:90%; position:relative; width:75%;">
[1] <tbody><tr>\n<td style="width:2.5em; background:#00cc33; text-align:cente ...

Convert into a table

We use html_table() to create a data frame.

winter_olympics |>
html_element("#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(85)")|>
html_table()
# A tibble: 1 × 10
  X1    X2               X3    X4            X5    X6    X7    X8    X9    X10  
  <chr> <chr>            <chr> <chr>         <chr> <chr> <chr> <chr> <chr> <chr>
1 OC    Opening ceremony ●     Event compet… X     Even… EG    Exhi… CC    Clos…

Cleaning

Note: We’d have to clean this table a fair bit. Its not tidy and there are strange symbols.

Also everything is a character vector.

We will not clean this table but we’ll clean the next one.

Medal count

Attempt to scrape the 2026 Winter Olympics medal table. We’ll clean this one up a bit.

Medal Count Solution

winter_olympics = read_html("https://en.wikipedia.org/wiki/2026_Winter_Olympics")

olympic_medals = winter_olympics |>
  html_element("#mw-content-text > div.mw-content-ltr.mw-parser-output > table.wikitable.sortable.sticky-header-multi.plainrowheaders.jquery-tablesorter")|>
  html_table()

Cleaning with Janitor

We can clean this table with the dplyr and janitor packages. janitor is not part of the tidyverse, but it works with tidy data.

See this page for more information.

Items to clean

Try some of these yourself first, see the hint functions for each line:

  • Rank should be an integer, not a character (mutate())

  • .mw-paser… should say “name_of_country” (rename())

  • all column names should be lowercase to match style (clean_names())

  • lose the last “total row” (head())

  • should probably change “remaining” to “other” (case_when(), or manually)

  • drop the “total” column (select())

  • table is not tidy (pivot_longer())

Rank should be an integer, not a character

Use mutate to make a new column of the correct type.

olympic_medals = olympic_medals |>
  mutate(Rank = as.integer(Rank))
  olympic_medals
# A tibble: 12 × 6
    Rank .mw-parser-output .tooltip-dotted{border-bo…¹  Gold Silver Bronze Total
   <int> <chr>                                         <int>  <int>  <int> <int>
 1     1 Norway                                           18     12     11    41
 2     2 United States                                    12     12      9    33
 3     3 Netherlands                                      10      7      3    20
 4     4 Italy*                                           10      6     14    30
 5     5 Germany                                           8     10      8    26
 6     6 France                                            8      9      6    23
 7     7 Sweden                                            8      6      4    18
 8     8 Switzerland                                       6      9      8    23
 9     9 Austria                                           5      8      5    18
10    10 Japan                                             5      7     12    24
11    NA Remaining                                        26     32     35    93
12    NA Totals (29 entries)                             116    118    115   349
# ℹ abbreviated name:
#   ¹​`.mw-parser-output .tooltip-dotted{border-bottom:1px dotted;cursor:help}NOC`

.mw-paser… should say “name_of_country”

We’ll use rename()

olympic_medals = olympic_medals|>
  rename( name_of_country = `.mw-parser-output .tooltip-dotted{border-bottom:1px dotted;cursor:help}NOC`)

all column names should be lowercase to match style (clean_names())

The clean_names() function will lower case all column names and use underscores as spaces. This conforms to the tidyverse style guide. It also makes data easier to work with.

library(janitor)

olympic_medals = olympic_medals |> 
  clean_names()

lose the last “total row” (head())

olympic_medals = olympic_medals |>
head(16)

change “remaining” to “other” (case_when(), or manually)

olympic_medals|>
mutate(name_of_country = case_when(
  name_of_country=="Remaining" ~ "Other Countries",
  TRUE ~ name_of_country))
# A tibble: 12 × 6
    rank name_of_country      gold silver bronze total
   <int> <chr>               <int>  <int>  <int> <int>
 1     1 Norway                 18     12     11    41
 2     2 United States          12     12      9    33
 3     3 Netherlands            10      7      3    20
 4     4 Italy*                 10      6     14    30
 5     5 Germany                 8     10      8    26
 6     6 France                  8      9      6    23
 7     7 Sweden                  8      6      4    18
 8     8 Switzerland             6      9      8    23
 9     9 Austria                 5      8      5    18
10    10 Japan                   5      7     12    24
11    NA Other Countries        26     32     35    93
12    NA Totals (29 entries)   116    118    115   349
# alternative method
#olympic_medals[16, "name_of_country"]<-"hello" # Notice the <- here. 16th row of the name_of_country column.

drop the “total” column

olympic_medals = 
  olympic_medals |>
  select(-total)

tidy up the table

olympic_medals = olympic_medals |>
pivot_longer(
  cols = c(gold,silver,bronze),
  names_to = "medal", 
  values_to = "number_of_medals")

admire the clean data frame

head(olympic_medals)
# A tibble: 6 × 4
   rank name_of_country medal  number_of_medals
  <int> <chr>           <chr>             <int>
1     1 Norway          gold                 18
2     1 Norway          silver               12
3     1 Norway          bronze               11
4     2 United States   gold                 12
5     2 United States   silver               12
6     2 United States   bronze                9