ANOVA part 1

Schwab

Reading

Libraries

library(openintro)
library(tidyverse)

Comparing multiple groups

We use Analysis of Variance for testing if there is a difference between many means.

New distribution: F

The F distribution looks like:

p-value and CI

We find the p-value in the right tail.

There is no real confidence interval.

Conditions:

the observations are independent within and between groups,

the responses within each group are nearly normal, and*

the variability across the groups is about equal.

Rule of Thumb

“The variability across groups is equal” condition has a rule of thumb.

If the largest SD/ smallest SD is between 0.5 and 2.

\[ 0.5\le \frac{\sigma_{max}}{\sigma_{min}}\le 2 \]

A test

There are three classes that took a midterm. We want to know if the exam average for any of the classes is different.

ggplot(data=classdata) +
  geom_boxplot(
    aes(x= m1, color = lecture)    )

Summary stats

Is the variance about the same?

group_by(.data = classdata, lecture) |>
  summarise(mean = mean(m1), sd = sd(m1), count = n())

# A tibble: 3 × 4
  lecture  mean    sd count
  <fct>   <dbl> <dbl> <int>
1 a        75.1  13.9    58
2 b        72.0  13.8    55
3 c        78.9  13.1    51

Notation:

\(H_0: \mu_1 =\mu_2 =\mu_3\)

\(H_a:\) At least one mean is different.

\(\alpha = 0.05\)

The test computation

Is done by R with aov() . We will not be doing this test by hand.

results <- aov( m1 ~ lecture, data = classdata )
summary(results)

             Df Sum Sq Mean Sq F value Pr(>F)  
lecture       2   1290   645.1   3.484  0.033 *
Residuals   161  29810   185.2                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusion:

Reject the null, one of the exam averages is different than the other two.

A flower’s sepal

Try `Sepal.Length` from iris.

Is it appropriate to do ANOVA on Sepal.Length?

# A tibble: 3 × 3
  Species     mean    sd
  <fct>      <dbl> <dbl>
1 setosa      5.01 0.352
2 versicolor  5.94 0.516
3 virginica   6.59 0.636

Look at the graph

Is there one mean that’s different?

iris |>
  ggplot(aes(Sepal.Length, color = Species))+
  geom_boxplot()

Do the test

Notation
Test
Conclusion

results <- aov( Sepal.Length ~ Species, data = iris )
summary(results)

             Df Sum Sq Mean Sq F value Pr(>F)    
Species       2  63.21  31.606   119.3 <2e-16 ***
Residuals   147  38.96   0.265                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Problems 5 or 9

Link to exercises