Center, Shape, and Spread

IMS, Ch. 4–5

Benjamin S. Baumer edited a bunch by Nic Schwab

Warmup

Confounding Variables

For each of the following pairs of variables, a statistically significant positive relationship has been observed. Identify a potential lurking variable (omitted confounder) that might cause the spurious correlation.

  • The amount of ice cream sold in New England and the number of deaths by drowning
  • The salary of U.S. ministers and the price of vodka
  • The number of doctors in a region and the number of crimes committed in that region
  • The amount of coffee consumed and the prevalence of lung cancer

Explore Categorical Data

Chapter 4

Exploring Categorical Data:

Exploring categorical data

Let’s explore categorical data using summary statistics and visualizations.

#Look at the loans Data

head(loans_full_schema)
# A tibble: 6 × 55
  emp_title         emp_length state homeownership annual_income verified_income
  <chr>                  <dbl> <fct> <fct>                 <dbl> <fct>          
1 "global config e…          3 NJ    MORTGAGE              90000 Verified       
2 "warehouse offic…         10 HI    RENT                  40000 Not Verified   
3 "assembly"                 3 WI    RENT                  40000 Source Verified
4 "customer servic…          1 PA    RENT                  30000 Not Verified   
5 "security superv…         10 CA    RENT                  35000 Verified       
6 ""                        NA KY    OWN                   34000 Not Verified   
# ℹ 49 more variables: debt_to_income <dbl>, annual_income_joint <dbl>,
#   verification_income_joint <fct>, debt_to_income_joint <dbl>,
#   delinq_2y <int>, months_since_last_delinq <int>,
#   earliest_credit_line <dbl>, inquiries_last_12m <int>,
#   total_credit_lines <int>, open_credit_lines <int>,
#   total_credit_limit <int>, total_credit_utilized <int>,
#   num_collections_last_12m <int>, num_historical_failed_to_pay <int>, …

Bar graphs

Bar graphs are helpsful for exploring tables

ggplot(loans_full_schema, aes(x=homeownership)) +
  geom_bar()

2 varaibles

We can map aes to multiple variables

ggplot(loans_full_schema, aes(x=homeownership,fill=application_type)) +
  geom_bar()

fill

ggplot(loans_full_schema, aes(x=homeownership,fill=application_type)) +
  geom_bar(position = "fill")

dodge

ggplot(loans_full_schema, aes(x=homeownership,fill=application_type)) +
  geom_bar(position = "dodge")

Try Problem 4.5

Summarizing distributions

Numeric Data

  • Shape, Center, and Spread

Histogram

Summarize the shape of the distribution of one variable

library(tidyverse)
library(palmerpenguins)
ggplot(data = penguins, aes(x = body_mass_g)) +
  geom_histogram()

Density plot

Summarize the shape of the distribution of one variable

ggplot(data = penguins, aes(x = body_mass_g)) +
  geom_density()

Box plot

Summarize the shape of the distribution of one variable

ggplot(data = penguins, aes(x = body_mass_g)) +
  geom_boxplot()

Your turn

  • Use a data graphic to summarize the distribution of the height variable in the starwars data frame.

Histogram: two variables

ggplot(data = penguins, aes(x = body_mass_g, fill = species)) +
  geom_histogram()

Density plot: two variables

ggplot(data = penguins, aes(x = body_mass_g, fill = species)) +
  geom_density()

Boxplot: two variables

ggplot(data = penguins, aes(x = body_mass_g, fill = species)) +
  geom_boxplot()

Measures of center

  • mean: mean()
  • median: median()
penguins |>
  summarize(
    number_of_penguins = n(),
    mean_mass = mean(body_mass_g, na.rm = TRUE),
    median_mass = median(body_mass_g, na.rm = TRUE)
  )
# A tibble: 1 × 3
  number_of_penguins mean_mass median_mass
               <int>     <dbl>       <dbl>
1                344     4202.        4050

Measures of spread

  • standard deviation: sd()
  • variance: var()
  • range: range()
  • IQR: IQR()
penguins |>
  summarize(
    number_of_penguins = n(),
    sd_mass = sd(body_mass_g, na.rm = TRUE),
    var_mass = var(body_mass_g, na.rm = TRUE)
  )
# A tibble: 1 × 3
  number_of_penguins sd_mass var_mass
               <int>   <dbl>    <dbl>
1                344    802.  643131.

Your turn

  • Summarize the distribution of the height variable in the starwars data frame by computing:
    • the number of observations
    • the mean
    • the standard deviation

Thought Experiment

Consider the following two variables:

  1. The height of all adults in the United States
  2. The annual income of all working adults in the United States
  • Sketch a density plot for each distribution. What features does it have?
  • Is it symmetric? Is it normal? It is unimodal?
  • Summarize each distribution numerically. Which measures are most appropriate?

Challenge

Suppose that the government issued a tax rebate in the amount of $2000 to each American taxpayer.

  • How would the distribution of income change?
  • What would happen to your measures of center and spread?