Libraries
library(openintro)
library(tidyverse)
Comparing multiple groups
We use Analysis of Variance for testing if there is a difference between many means.
New distribution: F
The F distribution looks like:
p-value and CI
We find the p-value in the right tail.
There is no real confidence interval.
Conditions:
- the observations are independent within and between groups,
- the responses within each group are nearly normal, and*
- the variability across the groups is about equal.
Rule of Thumb
“The variability across groups is equal” condition has a rule of thumb.
If the largest SD/ smallest SD is between 0.5 and 2.
\[
0.5\le \frac{\sigma_{max}}{\sigma_{min}}\le 2
\]
A test
There are three classes that took a midterm. We want to know if the exam average for any of the classes is different.
ggplot(data=classdata) +
geom_boxplot(
aes(x= m1, color = lecture) )
Summary stats
Is the variance about the same?
group_by(.data = classdata, lecture) |>
summarise(mean = mean(m1), sd = sd(m1), count = n())
# A tibble: 3 × 4
lecture mean sd count
<fct> <dbl> <dbl> <int>
1 a 75.1 13.9 58
2 b 72.0 13.8 55
3 c 78.9 13.1 51
Notation:
\(H_0: \mu_1 =\mu_2 =\mu_3\)
\(H_a:\) At least one mean is different.
\(\alpha = 0.05\)
The test computation
Is done by R with aov()
. We will not be doing this test by hand.
results <- aov( m1 ~ lecture, data = classdata )
summary(results)
Df Sum Sq Mean Sq F value Pr(>F)
lecture 2 1290 645.1 3.484 0.033 *
Residuals 161 29810 185.2
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Conclusion:
Reject the null, one of the exam averages is different than the other two.
A flower’s sepal
Try Sepal.Length
from iris.
Is it appropriate to do ANOVA on Sepal.Length
?
# A tibble: 3 × 3
Species mean sd
<fct> <dbl> <dbl>
1 setosa 5.01 0.352
2 versicolor 5.94 0.516
3 virginica 6.59 0.636
Look at the graph
Is there one mean that’s different?
iris |>
ggplot(aes(Sepal.Length, color = Species))+
geom_boxplot()
Do the test
results <- aov( Sepal.Length ~ Species, data = iris )
summary(results)
Df Sum Sq Mean Sq F value Pr(>F)
Species 2 63.21 31.606 119.3 <2e-16 ***
Residuals 147 38.96 0.265
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Show how to do HW problem 3 and 4