IMS, Ch. 7.1
2023-09-22
Two variables
Response variable
Explanatory variable
Response variable (aka dependent variable): the variable that you are trying to understand/model
Explanatory variable (aka independent variable, aka predictor): the variable that you can measure that you think might be related to the response variable
geom_point()
From the babies dataset, let’s make a scatter plot of gestation (explanatory) vs bwt (response).
help(babies)
Use a scatter plot to analyze the relationship between the height
and mass
of characters in the starwars
data frame
Characterize the distribution
There is one outlier, it is Jaba the Hut.
cor(data1,data2,use = "complete.obs")
Find the correlation coefficient for StarWars height
vs mass
. Hint: copy and edit the code from above.
We might want to remove Jaba The Hut from the data. This is not something you need to know how to do. But if I filter data you are expected to graph it.
filter()
out the one massive character and create a new dfnew_starwars <- filter(.data = starwars, mass < 500)
summarise(.data = new_starwars, r = cor(height, mass ,use="complete.obs"))
# A tibble: 1 × 1
r
<dbl>
1 0.751
# Save the linear regression object.
baby_model <- lm(bwt~gestation,data=babies)
# Look at the summary output of that object.
tidy(baby_model)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 3.14 11.4 0.277 7.82e- 1
2 gestation 0.419 0.0402 10.4 2.46e-24
Do you think it is fair to remove the outlier from starwars?
Can we remove outliers from babies?
# A tibble: 4 × 5
set N `mean(x)` `mean(y)` `cor(x, y)`
<chr> <int> <dbl> <dbl> <dbl>
1 1 11 9 7.50 0.816
2 2 11 9 7.50 0.816
3 3 11 9 7.5 0.816
4 4 11 9 7.50 0.817
Note that correlation only measures the strength of a linear relationship
Always graph your data!
No additional reading