Exploring Categorical Data

IMS, Ch. 4

Nic Schwab

Read before Lecture

Chapter 4 Exploring Categorical Data

Confounding Variables

For each of the following pairs of variables, a statistically significant positive relationship has been observed. Identify confounding that might cause the spurious correlation.

  • The amount of ice cream sold in New England and the number of deaths by drowning
  • The number of doctors in a region and the number of crimes committed in that region

Explore Categorical Data

We’ll be exploring categorical data using R

Chapter 4

Exploring Categorical Data:

library(tidyverse)
library(openintro)

# You should read and view the documentation of the data as a first step to exploring it. 
?assortive_mating
assortive_mating
# A tibble: 204 × 2
   self_male partner_female
   <fct>     <fct>         
 1 blue      blue          
 2 blue      blue          
 3 blue      blue          
 4 blue      blue          
 5 blue      blue          
 6 blue      blue          
 7 blue      blue          
 8 blue      blue          
 9 blue      blue          
10 blue      blue          
# ℹ 194 more rows

Exploring categorical data

Let’s explore categorical data using summary statistics and visualizations.

Tables

Tables are also helpful in understanding categorical data

table(data = assortive_mating)
         partner_female
self_male blue brown green
    blue    78    23    13
    brown   19    23    12
    green   11     9    16

Prop tables

# Here I am saving the table as the variable my_table
# I am telling table I specifically want to look at the male and female variables.

my_table <- table(assortive_mating$self_male, assortive_mating$partner_female)

prop.table(my_table)
       
              blue      brown      green
  blue  0.38235294 0.11274510 0.06372549
  brown 0.09313725 0.11274510 0.05882353
  green 0.05392157 0.04411765 0.07843137

Margins

addmargins(A = my_table)
       
        blue brown green Sum
  blue    78    23    13 114
  brown   19    23    12  54
  green   11     9    16  36
  Sum    108    55    41 204

Bar graphs

Bar graphs can be helpful for exploring tables

ggplot(data = assortive_mating, mapping = aes(x=self_male)) +
  geom_bar()

ggplot(data = assortive_mating, mapping = aes(x=partner_female)) +
  geom_bar()

Mosaic plots

In the case of two categorical variables the mosaic plot is nice.

# Here I am showing that I want to see a relationship between the two variables. 
mosaicplot(self_male ~ partner_female, data=assortive_mating)

2 variables with bar

We can map aes to multiple variables

ggplot(data = assortive_mating, mapping = aes(x=self_male, fill=partner_female)) +
  geom_bar()

fill

ggplot(data = assortive_mating, mapping = aes(x=self_male,fill=partner_female)) +
  geom_bar(position = "fill")

dodge

ggplot(data = assortive_mating, mapping = aes(x=self_male,fill=partner_female)) +
  geom_bar(position = "dodge")

Problem 4.5

Additional Practice exploring Categorical Data

Once you’ve loaded the openintro and tidyverse libraries there are many dataset to practice with.

avandia is the name of the dataset explore it using the methods above.

I’ll explore it myself in the next video.