What is a test statistic?
It is the value from your data that you can compare to a sampling (or comparison) distribution primarily for finding a p-value.
The test statistic for a single proportion is a z-score for the standard normal N(0,1):
\[ T=\frac{\widehat{p} - p_{null}}{SE} \]
Where \(p_{null}\) is the proportion under the null hypothesis.
The test statistic for a difference of proportions is also a z-score for the standard normal N(0,1):
\[ T = \frac{\widehat{p_1} - \widehat{p_2} - 0}{SE} \]
Where \(\widehat{p_1} - \widehat{p_2}\) is the point estimate and 0 is the difference under the null.
We find the test statistics on the standard normal table of z-scores to be able to find a p-value.
We don’t really need to know how to do this because we have pnorm().
When reading stats papers the phrase test statistic is still often given along side the p-value.
Fictional:
We are concerned about over-fishing depleting the food for penguin’s diets.
Last year the average weight of Adelie penguins on Togersen island was 3900 grams.
We went to Togersen Island this year and weighed some penguins.
Our data is in penguins
.
Togersen_penguins <- penguins |>
filter(island == "Torgersen")
mean(Togersen_penguins$body_mass_g, na.rm = TRUE)
[1] 3706.373
Is the difference from 3900 due to chance?
Let’s set up the test for the penguins
\[ H_0: \mu = 3900 \\ H_a: \mu \ne 3900 \]
No extreme outliers
Larger than 30
or
The data distribution is normal.
No clear outliers.
[1] 52
We could use the sampling distribution… But its much better to use the Student t.
The student t distribution is similar to the normal distribution in shape.
It has wider tails.
It is used with confidence intervals and hypothesis testing of a single mean.
Unlike the normal distribution it works well for small sample sizes if the population is normal distributed.
William Sealy Gosset
Worked at Guiness and was interested in the chemical properties of barley. source
The student t takes just one parameter
df = degree of freedom.
\(df= n-1\)
As \(\text{lim}_{df \rightarrow \infty} t_{dist} \rightarrow N(\mu=0,sigma=1)\)
https://en.wikipedia.org/wiki/Student%27s_t-distribution#/media/File:Student_t_pdf.svg
You can use either if:
If the sample size is large or underlying distribution is normal
Population standard deviation is known \(\sigma\)
But if population sd \(\sigma\) is unknown or n is small use student t.
In general the t distribution works, the normal less so.
\(\mu \sim \text{t}_{df}\)
The mean of the t-dist is 0.
The mean of your sample is \(\bar{x}\) and the standard error is \(SE = \frac{s}{\sqrt{n}}\)
We’ve skipped discussing test statistics until now, because of pnorm()
.
Ideally
\[ T = \frac{\bar{x} - \mu}{\frac{\sigma}{\sqrt{n}}}\]
or more realistically:
\[ T = \frac{\bar{x} - \mu}{\frac{s_x}{\sqrt{n}}}\]
We are considering the weight of penguins.
Last year the weight was 3900, and this yeah its 3706.
We checked conditions.
The standard deviation is:
So the test statistic is:
We need to calculate the probability of getting -3.11 for a test statistic.
Multiply by 2 for a two tailed test.
Reject the null hypothesis. We have strong evidence to believe that the average weight this year is different from 3900 grams.
Its possible we made a type 1 error and rejected the null even if it was true.
Its the same as doing it with the normal distribution.
Except you use a t* score instead of z*score.
\[ \bar{x} \pm t^* SE\]
\(\bar{x} = 3706.373\)
\(SE = \frac{s}{n}=\frac{445.10}{51} = 8.73\)
\(t^*=\)
We are 90% confident that the true weight of penguins is between 3602 and 3811.
Awesome Auto data set. 𝑛 = 5, \(\bar{x}\)= 14600, and \(s_x\)= 7765.31. Suppose you hear that the Awesome Auto dealership typically sells cars for 18000. You decide to test this claim.
Write the hypotheses in symbols.
Check conditions, then calculate the test statistic, 𝑇 , and the associated degrees of freedom.
Find and interpret the p-value in this context.
What is the conclusion of the hypothesis test when using 𝛼 = 0.05?
\[ H_0: \mu = 18000 \\ H_a: \mu \ne 18000 \\ \alpha = 0.05 \]
Unfortunately we cannot at present test the normal condition. The cars are independent.
\(T = \frac{14600 - 18000}{\frac{7765.31}{\sqrt{5}}}\)
df = 4
p-value:
With a large p value we find no evidence that the prices at Awesome Auto average 18000.
We could never verify that the underlying data was normal.
t.test()
R, of course, can make this test easy if you are working from a vector of values (x) (instead of sample statistics, like above). You will do this on occasion in your homework.
You can also take the time to have R calculate the mean and standard deviation, but t.test() is fine.
t.test(x= , alternative= , mu=)
Suppose the awesome auto data was:
t.test()
The 2017 Toyota Prius Prime has a MPG_e = 54. Consider the prius_mpg
data set in the openintro package.
Test the hypothesis that the actual MPG is greater than 54.
Consider Conditions, but do the test either way and we’ll discuss conditions afterwards.
Hint: use prius_mpg$_____
to take just one column from the data.
library(tidyverse)
library(openintro)
t.test(x=prius_mpg$average_mpg, alternative = "greater", mu = 54)
One Sample t-test
data: prius_mpg$average_mpg
t = 5.757, df = 18, p-value = 9.304e-06
alternative hypothesis: true mean is greater than 54
95 percent confidence interval:
117.4944 Inf
sample estimates:
mean of x
144.8632
# A tibble: 1 × 1
p_value
<dbl>
1 0