Intro to Linear Regression

IMS, Ch. 7.1

Ben Baumer edited by Nic Schwab

2023-09-22

Read Before Lecture

Chapter 7 Linear Regression with a single predictor

Bivariate Relationships

  • Two variables

  • Response variable

  • Explanatory variable

Response

Response variable (aka dependent variable): the variable that you are trying to understand/model

Explanatory

Explanatory variable (aka independent variable, aka predictor): the variable that you can measure that you think might be related to the response variable

Scatterplot for two numerical variables

  • response variable on \(y\)-axis and explanatory variable on \(x\)-axis
  • geom_point()
  • What are we looking for?
    • Overall patterns and deviations from those patterns
    • Form (e.g. linear, quadratic, etc.), direction (positive or negative), and strength (how much scatter?)
    • Outliers

Examine the babies data

From the babies dataset, let’s make a scatter plot of gestation (explanatory) vs bwt (response).

  • Characterize the distribution
    • Form
    • Direction
    • Strength
    • Outliers

Birthweight of babies

ggplot(data = babies, aes(x = gestation, y = bwt)) +
  geom_point()

Your turn

  • Use a scatter plot to analyze the relationship between the height and mass of characters in the starwars data frame

  • Characterize the distribution

    • Form
    • Direction
    • Strength
    • Outliers

Solution: starwars characters height vs. mass.

There is one outlier, it is Jaba the Hut.

ggplot(data = starwars, aes(x = height, y = mass)) +
  geom_point()

Correlation

(Pearson Product-Moment) correlation coefficient

  • measure of the strength and direction of the linear relationship between two numerical variables
  • denoted \(r\)
  • measured on the scale of \([-1, 1]\)
  • cor(data1,data2,use = "complete.obs")

Find the correlation for babies

summarise(.data = babies, r = cor(gestation,bwt,use="complete.obs"))
# A tibble: 1 × 1
      r
  <dbl>
1 0.408

You try

Find the correlation coefficient for StarWars height vs mass. Hint: copy and edit the code from above.

Solution

summarise(.data = starwars, r = cor(height, mass ,use="complete.obs"))
# A tibble: 1 × 1
      r
  <dbl>
1 0.131

Filtering out outliers

We might want to remove Jaba The Hut from the data. This is not something you need to know how to do. But if I filter data you are expected to graph it.

  • filter() out the one massive character and create a new df
  • find the new correlation coefficient

Graph the new starwars scatter plot

new_starwars <- filter(.data = starwars, mass < 500)
summarise(.data = new_starwars, r = cor(height, mass ,use="complete.obs"))
# A tibble: 1 × 1
      r
  <dbl>
1 0.751

Add the regression line to babies.

#Filter out the babies born before 259 days. They are premature
babies <- filter(.data = babies, gestation > 259)

ggplot(data=babies, aes(x = gestation, y = bwt)) +
  geom_point()+
  geom_smooth(method = "lm", se=FALSE)

The equation of that regression.

# Save the linear regression object. 
baby_model <- lm(bwt~gestation,data=babies)

# Look at the summary output of that object. 
tidy(baby_model)
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    3.14    11.4        0.277 7.82e- 1
2 gestation      0.419    0.0402    10.4   2.46e-24

Add the regression line to new_starwars

  • Do you think it is fair to remove the outlier from starwars?

  • Can we remove outliers from babies?

Anscombe

# A tibble: 4 × 5
  set       N `mean(x)` `mean(y)` `cor(x, y)`
  <chr> <int>     <dbl>     <dbl>       <dbl>
1 1        11         9      7.50       0.816
2 2        11         9      7.50       0.816
3 3        11         9      7.5        0.816
4 4        11         9      7.50       0.817
  • same mean
  • same standard deviation
  • same correlation coefficient is the same (up to three digits)!

Anscombe plots

ggplot(data = ds, aes(x = x, y = y)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = 0) + 
  facet_wrap(~set)

Datasaurus

More examples

Beware

  • Note that correlation only measures the strength of a linear relationship

  • Always graph your data!

Spurious Correlations

Reading for next lecture

No additional reading