Linear regression one predictor

Schwab

Read before:

Read chapter 24 before hand.

Libraries

library(openintro)
library(tidyverse)
library(broom)
library(statsr)

Linear Regression (again)

Recall Linear Regression

ggplot(data = starbucks)+
  geom_point(aes(x=fat, y= calories))

Add the model to the points.

ggplot(data = starbucks, aes(x=fat, y= calories))+
  geom_point()+
  geom_smooth(method = "lm", se = FALSE)

Model outputs: tidy

star_model <- lm(calories ~ fat, data = starbucks)
#tidy() give the regression output
tidy(star_model) 
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    184.      17.3       10.6 1.25e-16
2 fat             11.3      1.12      10.1 1.32e-15

summary()

summary(star_model)

Call:
lm(formula = calories ~ fat, data = starbucks)

Residuals:
     Min       1Q   Median       3Q      Max 
-132.599  -44.130    3.469   54.868  126.134 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  183.734     17.277   10.63  < 2e-16 ***
fat           11.267      1.117   10.09 1.32e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 69.1 on 75 degrees of freedom
Multiple R-squared:  0.5756,    Adjusted R-squared:  0.5699 
F-statistic: 101.7 on 1 and 75 DF,  p-value: 1.32e-15

Are residuals normalish?

star_model_aug <- augment(star_model)


ggplot(data = star_model_aug, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  xlab("Fitted values") +
  ylab("Residuals")

What were we doing?

  • Make a model

  • Trying to figure out if the model is reasonable

    • looking at the correlation coefficient r.

What will we do today?

  • A hypothesis test to see if the slope is a number other than zero.

  • recall: \(y= \beta_0 + \beta_1 x\)

  • \(\beta_1\) is the slope.

  • If the slope is zero there is no relationship between y and x.

Conditions: LINE

  • Linearity* data has to be linear

  • Data has to be independent

    • watch out for time series.
  • nearly normal residuals

    • look for random disbursment around the zero line of residual plot.
  • constant or equal variability

    • the points in the residual plot should not have a distinct/changing pattern.

Some examples:

Independence problems

ggplot(data= arbuthnot, aes(year,boys))+
  geom_point() +
  geom_smooth(method = "lm" , se = FALSE)

Linear Problems

Normal issues

Normal issues again

Starbucks calories vs fat.

starbucks |>
  ggplot(aes(x=fat, y= calories))+
  geom_point()+
  geom_smooth(method = "lm", se = FALSE)

Residuals

What about this?

ggplot(data = arbuthnot, aes(x = girls, y = boys))+
  geom_point()+
  labs(title = "Boys vs Girls",
       subtitle = "Born in London in the 1600s")+
  geom_smooth(method = "lm")

Residuals girl vs boy births

Let’s do a test

Recall

\[ y=\beta_0+\beta_1 x + e \]

  • \(\beta_0\) = intercept

  • \(\beta_1\) = slope

  • x is the predictor variable

  • y is the response variable

  • e is the error

Relationship between variable x and y?

If there is no relationship the slope is 0

If there is a relationship the slope is not zero.

We do

\[ H_0: \beta_1 = 0 \\ H_1: \beta_1 \ne 0 \]

(We could do other tests on r - the correlation coefficient or \(\beta_0\) )

The t distribution

The distribution of the slopes of an infinite number of samples would be a student t.

So we are doing a t test with this test statistic:

\(T = \frac{\hat{\beta}_1-0} {\text{SE}}\)

df= n-2

We’ll let R calculate \(SE\) and \(\hat{\beta_1}\).

The test

\[ H_0: \beta_1 = 0 \\ H_1: \beta_1 \ne 0 \]

results <- lm(boys ~ girls, data= arbuthnot)
summary(results)

Call:
lm(formula = boys ~ girls, data = arbuthnot)

Residuals:
    Min      1Q  Median      3Q     Max 
-363.30 -110.25   -6.42   97.44  356.63 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 184.77244   59.76823   3.091  0.00274 ** 
girls         1.03391    0.01038  99.578  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 148.8 on 80 degrees of freedom
Multiple R-squared:  0.992, Adjusted R-squared:  0.9919 
F-statistic:  9916 on 1 and 80 DF,  p-value: < 2.2e-16

Some Vocab

\(r\) - correlation coefficient

\(r^2\) - coefficient of determination

\(r^2\) is the proportion of the variability that can be explained by the explanatory variable

Homework Time

Problem 3 in class.