Intro to Linear Regression

IMS, Ch. 7.1

Ben Baumer edited by Nic Schwab

2023-09-22

Read Before Lecture

Chapter 7 Linear Regression with a single predictor

Bivariate Relationships

Two variables
Response variable
Explanatory variable

Response

Response variable (aka dependent variable): the variable that you are trying to understand/model

Explanatory

Explanatory variable (aka independent variable, aka predictor): the variable that you can measure that you think might be related to the response variable

Scatterplot for two numerical variables

response variable on \(y\)-axis and explanatory variable on \(x\)-axis
geom_point()
What are we looking for?
- Overall patterns and deviations from those patterns
- Form (e.g. linear, quadratic, etc.), direction (positive or negative), and strength (how much scatter?)
- Outliers

Examine the babies data

From the babies dataset, let’s make a scatter plot of gestation (explanatory) vs bwt (response).

Characterize the distribution
- Form
- Direction
- Strength
- Outliers

Birthweight of babies

ggplot(data = babies, aes(x = gestation, y = bwt)) +
  geom_point()

Your turn

Use a scatter plot to analyze the relationship between the height and mass of characters in the starwars data frame
Characterize the distribution
- Form
- Direction
- Strength
- Outliers

Solution: starwars characters height vs. mass.

There is one outlier, it is Jaba the Hut.

ggplot(data = starwars, aes(x = height, y = mass)) +
  geom_point()

Correlation

(Pearson Product-Moment) correlation coefficient

measure of the strength and direction of the linear relationship between two numerical variables
denoted \(r\)
measured on the scale of \([-1, 1]\)
cor(data1,data2,use = "complete.obs")

Find the correlation for babies

summarise(.data = babies, r = cor(gestation,bwt,use="complete.obs"))

# A tibble: 1 × 1
      r
  <dbl>
1 0.408

You try

Find the correlation coefficient for StarWars height vs mass. Hint: copy and edit the code from above.

Solution

summarise(.data = starwars, r = cor(height, mass ,use="complete.obs"))

# A tibble: 1 × 1
      r
  <dbl>
1 0.131

Filtering out outliers

We might want to remove Jaba The Hut from the data. This is not something you need to know how to do. But if I filter data you are expected to graph it.

filter() out the one massive character and create a new df
find the new correlation coefficient

Graph the new starwars scatter plot

new_starwars <- filter(.data = starwars, mass < 500)
summarise(.data = new_starwars, r = cor(height, mass ,use="complete.obs"))

# A tibble: 1 × 1
      r
  <dbl>
1 0.751

Add the regression line to babies.

#Filter out the babies born before 259 days. They are premature
babies <- filter(.data = babies, gestation > 259)

ggplot(data=babies, aes(x = gestation, y = bwt)) +
  geom_point()+
  geom_smooth(method = "lm", se=FALSE)

The equation of that regression.

# Save the linear regression object. 
baby_model <- lm(bwt~gestation,data=babies)

# Look at the summary output of that object. 
tidy(baby_model)

# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    3.14    11.4        0.277 7.82e- 1
2 gestation      0.419    0.0402    10.4   2.46e-24

Add the regression line to new_starwars

Do you think it is fair to remove the outlier from starwars?
Can we remove outliers from babies?

Anscombe

# A tibble: 4 × 5
  set       N `mean(x)` `mean(y)` `cor(x, y)`
  <chr> <int>     <dbl>     <dbl>       <dbl>
1 1        11         9      7.50       0.816
2 2        11         9      7.50       0.816
3 3        11         9      7.5        0.816
4 4        11         9      7.50       0.817

same mean
same standard deviation
same correlation coefficient is the same (up to three digits)!

Anscombe plots

ggplot(data = ds, aes(x = x, y = y)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = 0) + 
  facet_wrap(~set)