and cleaning
We have looked at several places to find data:
packages
csv files
APIs
In this lecture we will consider scraping.
Every web page is an html file.
html stands for HyperText Markup Language.
html has markers to tell the web browser how to display the page.
There are several markers that could be used for scraping.
We will focus primarily on the <table> marker
You can right click on a website an select “inspect” to view a page’s html file.
We’ll consider the Wikipedia page for the 2026 Winter Olympics.
There are a lot of tables on this site.
We will scrape with a package called Rvest.
rvest (meant to sound like “Harvest”) is a package that is part of the tidyverse, but is not loaded by default.
This means you download when you download the tidyverse.
But you must still load it after you have loaded the tidyverse.
There are two helpful examples here.
Generally we will first read in the html data with read_html()
read_html() takes a url as a string and you have to be connected to the internet for it to work properly.
{html_document}
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 skin-theme-clientpref-day vector-sticky-header-enabled wp25eastereggs-enable-clientpref-1 wp25eastereggs-companion-enabled wp25eastereggs-companion-phone vector-toc-available skin-theme-clientpref-thumb-standard" lang="en" dir="ltr">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="skin--responsive skin-vector skin-vector-search-vue mediawik ...
We want to find the “selector” for our table.
winter_olympics |>
html_element("#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(85)"){html_node}
<table class="wikitable" style="margin:0.5em auto; font-size:90%; position:relative; width:75%;">
[1] <tbody><tr>\n<td style="width:2.5em; background:#00cc33; text-align:cente ...
We use html_table() to create a data frame.
winter_olympics |>
html_element("#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(85)")|>
html_table()# A tibble: 1 × 10
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 OC Opening ceremony ● Event compet… X Even… EG Exhi… CC Clos…
Note: We’d have to clean this table a fair bit. Its not tidy and there are strange symbols.
Also everything is a character vector.
We will not clean this table but we’ll clean the next one.
Attempt to scrape the 2026 Winter Olympics medal table. We’ll clean this one up a bit.
We can clean this table with the dplyr and janitor packages. janitor is not part of the tidyverse, but it works with tidy data.
Try some of these yourself first, see the hint functions for each line:
Rank should be an integer, not a character (mutate())
.mw-paser… should say “name_of_country” (rename())
all column names should be lowercase to match style (clean_names())
lose the last “total row” (head())
should probably change “remaining” to “other” (case_when(), or manually)
drop the “total” column (select())
table is not tidy (pivot_longer())
Use mutate to make a new column of the correct type.
# A tibble: 12 × 6
Rank .mw-parser-output .tooltip-dotted{border-bo…¹ Gold Silver Bronze Total
<int> <chr> <int> <int> <int> <int>
1 1 Norway 18 12 11 41
2 2 United States 12 12 9 33
3 3 Netherlands 10 7 3 20
4 4 Italy* 10 6 14 30
5 5 Germany 8 10 8 26
6 6 France 8 9 6 23
7 7 Sweden 8 6 4 18
8 8 Switzerland 6 9 8 23
9 9 Austria 5 8 5 18
10 10 Japan 5 7 12 24
11 NA Remaining 26 32 35 93
12 NA Totals (29 entries) 116 118 115 349
# ℹ abbreviated name:
# ¹`.mw-parser-output .tooltip-dotted{border-bottom:1px dotted;cursor:help}NOC`
We’ll use rename()
The clean_names() function will lower case all column names and use underscores as spaces. This conforms to the tidyverse style guide. It also makes data easier to work with.
olympic_medals|>
mutate(name_of_country = case_when(
name_of_country=="Remaining" ~ "Other Countries",
TRUE ~ name_of_country))# A tibble: 12 × 6
rank name_of_country gold silver bronze total
<int> <chr> <int> <int> <int> <int>
1 1 Norway 18 12 11 41
2 2 United States 12 12 9 33
3 3 Netherlands 10 7 3 20
4 4 Italy* 10 6 14 30
5 5 Germany 8 10 8 26
6 6 France 8 9 6 23
7 7 Sweden 8 6 4 18
8 8 Switzerland 6 9 8 23
9 9 Austria 5 8 5 18
10 10 Japan 5 7 12 24
11 NA Other Countries 26 32 35 93
12 NA Totals (29 entries) 116 118 115 349