Numeric Data

Most of this is stolen from Ben Baumer

Schwab?

Read before lecture

Chapter 5 Numeric Data

Summarizing distributions

Numeric Data

  • Shape, Center, and Spread

Load the libraries

library(tidyverse)
library(palmerpenguins)

Histogram

Summarize the shape of the distribution of one variable

ggplot(data = penguins, mapping = aes(x = body_mass_g)) +
  geom_histogram()

Density plot

Summarize the shape of the distribution of one variable

ggplot(data = penguins, mapping = aes(x = body_mass_g)) +
  geom_density()

Box plot

Summarize the shape of the distribution of one variable

ggplot(data = penguins, mapping = aes(x = body_mass_g)) +
  geom_boxplot()

Your turn

  • Use a data graphic to summarize the distribution of the height variable in the starwars data frame.

Histogram: two variables

ggplot(data = penguins, mapping = aes(x = body_mass_g, fill = species)) +
  geom_histogram()

Density plot: two variables

ggplot(data = penguins, mapping = aes(x = body_mass_g, fill = species)) +
  geom_density(alpha = 0.4)

Boxplot: two variables

ggplot(data = penguins, mapping = aes(x = body_mass_g, fill = species)) +
  geom_boxplot()

Measures of center

  • mean: mean()
  • median: median()
# When using summarize functions like n() and mean() the words to the left of the = are the column headers.

  summarize(
    .data = penguins,
    number_of_penguins = n(),
    mean_mass = mean(body_mass_g, na.rm = TRUE),
    median_mass = median(body_mass_g, na.rm = TRUE)
  )
# A tibble: 1 × 3
  number_of_penguins mean_mass median_mass
               <int>     <dbl>       <dbl>
1                344     4202.        4050

Measures of spread

  • standard deviation: sd()
  • variance: var()
  • range: range()
  • IQR: IQR()
  summarize(
    .data = penguins,
    number_of_penguins = n(),
    sd_mass = sd(body_mass_g, na.rm = TRUE),
    var_mass = var(body_mass_g, na.rm = TRUE)
  )
# A tibble: 1 × 3
  number_of_penguins sd_mass var_mass
               <int>   <dbl>    <dbl>
1                344    802.  643131.

Quantiles

Most summary information can be found with the fivnum() or summary() function.

# This is just the five number summary
five_number_pengiuns <-fivenum(penguins$body_mass_g)
five_number_pengiuns
[1] 2700 3550 4050 4750 6300
#This is the five number plus summary
summary(penguins$body_mass_g)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   2700    3550    4050    4202    4750    6300       2 
# This is how you can get the exact quantile. 
quantile(x= penguins$body_mass_g, probs = c(0.25,0.5), na.rm=TRUE)
 25%  50% 
3550 4050 

Graphed Quantiles

# Note the vertical line showing the 25th percentile. 
ggplot(data = penguins, aes(x = body_mass_g)) +
  geom_density()+
  geom_vline(xintercept = 3550, color = "purple") 

Your turn

  • Summarize the distribution of the height variable in the starwars data frame by computing:
    • the number of observations
    • the mean
    • the standard deviation
  • Find the 80th quantile for the height variable in the starwars data.

Answer here.

Outliers

Along a number line to locate outliers we first calculate the two bounds

\[ L = Q1 - 1.5 \times IQR \\ U = Q3 + 1.5 \times IQR \]

Any values outside that range are considered outliers.

Let’s do this with penguins mass.

Calculate outlier bounds in R

# This is the IQR of the penguin's mass
iqr_mass <- IQR(x= penguins$body_mass_g, na.rm = TRUE)

q1 <- five_number_pengiuns[2]
q3 <- five_number_pengiuns[4]

# This is the lower bound
q1 - 1.5*iqr_mass 
[1] 1750
# This is the upper bound
q3 + 1.5*iqr_mass
[1] 6550
min(penguins$body_mass_g, na.rm = TRUE)
[1] 2700
max(penguins$body_mass_g, na.rm = TRUE)
[1] 6300

Your turn

  • Make a boxplot of mass of the starwars characters

  • Calculate the outlier range (by hand or in the computer).

  • Who are the outliers?

Answer here