Center, Shape, and Spread

IMS, Ch. 4–5

Benjamin S. Baumer edited a bunch by Nic Schwab

Warmup

Confounding Variables

For each of the following pairs of variables, a statistically significant positive relationship has been observed. Identify a potential lurking variable (omitted confounder) that might cause the spurious correlation.

The amount of ice cream sold in New England and the number of deaths by drowning
The salary of U.S. ministers and the price of vodka
The number of doctors in a region and the number of crimes committed in that region
The amount of coffee consumed and the prevalence of lung cancer

Explore Categorical Data

Chapter 4

Exploring Categorical Data:

Exploring categorical data

Let’s explore categorical data using summary statistics and visualizations.

#Look at the loans Data

head(loans_full_schema)

# A tibble: 6 × 55
  emp_title         emp_length state homeownership annual_income verified_income
  <chr>                  <dbl> <fct> <fct>                 <dbl> <fct>          
1 "global config e…          3 NJ    MORTGAGE              90000 Verified       
2 "warehouse offic…         10 HI    RENT                  40000 Not Verified   
3 "assembly"                 3 WI    RENT                  40000 Source Verified
4 "customer servic…          1 PA    RENT                  30000 Not Verified   
5 "security superv…         10 CA    RENT                  35000 Verified       
6 ""                        NA KY    OWN                   34000 Not Verified   
# ℹ 49 more variables: debt_to_income <dbl>, annual_income_joint <dbl>,
#   verification_income_joint <fct>, debt_to_income_joint <dbl>,
#   delinq_2y <int>, months_since_last_delinq <int>,
#   earliest_credit_line <dbl>, inquiries_last_12m <int>,
#   total_credit_lines <int>, open_credit_lines <int>,
#   total_credit_limit <int>, total_credit_utilized <int>,
#   num_collections_last_12m <int>, num_historical_failed_to_pay <int>, …

Bar graphs

Bar graphs are helpsful for exploring tables

ggplot(loans_full_schema, aes(x=homeownership)) +
  geom_bar()

2 varaibles

We can map aes to multiple variables

ggplot(loans_full_schema, aes(x=homeownership,fill=application_type)) +
  geom_bar()

fill

ggplot(loans_full_schema, aes(x=homeownership,fill=application_type)) +
  geom_bar(position = "fill")

dodge

ggplot(loans_full_schema, aes(x=homeownership,fill=application_type)) +
  geom_bar(position = "dodge")

Try Problem 4.5

Summarizing distributions

Numeric Data

Shape, Center, and Spread

Histogram

Summarize the shape of the distribution of one variable

library(tidyverse)
library(palmerpenguins)
ggplot(data = penguins, aes(x = body_mass_g)) +
  geom_histogram()

Density plot

Summarize the shape of the distribution of one variable

ggplot(data = penguins, aes(x = body_mass_g)) +
  geom_density()

Box plot

Summarize the shape of the distribution of one variable

ggplot(data = penguins, aes(x = body_mass_g)) +
  geom_boxplot()

Your turn

Use a data graphic to summarize the distribution of the height variable in the starwars data frame.

Histogram: two variables

ggplot(data = penguins, aes(x = body_mass_g, fill = species)) +
  geom_histogram()

Density plot: two variables

ggplot(data = penguins, aes(x = body_mass_g, fill = species)) +
  geom_density()

Boxplot: two variables

ggplot(data = penguins, aes(x = body_mass_g, fill = species)) +
  geom_boxplot()

Measures of center

mean: mean()
median: median()

penguins |>
  summarize(
    number_of_penguins = n(),
    mean_mass = mean(body_mass_g, na.rm = TRUE),
    median_mass = median(body_mass_g, na.rm = TRUE)
  )

# A tibble: 1 × 3
  number_of_penguins mean_mass median_mass
               <int>     <dbl>       <dbl>
1                344     4202.        4050

Measures of spread

standard deviation: sd()
variance: var()
range: range()
IQR: IQR()

penguins |>
  summarize(
    number_of_penguins = n(),
    sd_mass = sd(body_mass_g, na.rm = TRUE),
    var_mass = var(body_mass_g, na.rm = TRUE)
  )

# A tibble: 1 × 3
  number_of_penguins sd_mass var_mass
               <int>   <dbl>    <dbl>
1                344    802.  643131.

Your turn

Summarize the distribution of the height variable in the starwars data frame by computing:
- the number of observations
- the mean
- the standard deviation

Thought Experiment

Consider the following two variables:

The height of all adults in the United States
The annual income of all working adults in the United States

Sketch a density plot for each distribution. What features does it have?
Is it symmetric? Is it normal? It is unimodal?
Summarize each distribution numerically. Which measures are most appropriate?

Challenge

Suppose that the government issued a tax rebate in the amount of $2000 to each American taxpayer.

How would the distribution of income change?
What would happen to your measures of center and spread?