Can we predict the stopping distance of a car based on its speed? This is a question we can try to answer using simple linear regression.
First, let’s plot the stopping distance by speed and visually check if there seems to be a relationship between the two variables.
plot(cars$speed, cars$dist)
Now we can make a linear model using the variables.
carmodel <- lm(cars$dist ~ cars$speed)
summary(carmodel)
Call:
lm(formula = cars$dist ~ cars$speed)
Residuals:
Min 1Q Median 3Q Max
-29.069 -9.525 -2.272 9.215 43.201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791 6.7584 -2.601 0.0123 *
cars$speed 3.9324 0.4155 9.464 1.49e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
Finally, we can plot the linear model on top of the original plot and examine the fit visually.
plot(cars$speed, cars$dist)
abline(carmodel, lty = 1, col = "cyan", lwd = 3)
Drennan includes data for the number of hoes in sites of different areas. The data have been drawn from the R Companion for Drennan.
First let’s read in and plot the data.
RSHoes <- read.csv("https://maddiebrown.github.io/ANTH630/data/drennan_datasets/RSHoes.csv")
plot(Hoes ~ Area, RSHoes, xlab = "Site Area (ha)", ylab = "Number of Hoes", las = 1,
pch = 4)
Based on visual inspection, it looks like as the site area increases, the number of hoes decreases. Next we can tun lm()
to create a linear model of the data.
RSHoes.lm <- lm(Hoes ~ Area, RSHoes)
summary(RSHoes.lm)
Call:
lm(formula = Hoes ~ Area, data = RSHoes)
Residuals:
Min 1Q Median 3Q Max
-12.2994 -2.6945 -0.4595 3.8827 10.0001
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 47.802 7.231 6.611 2.5e-05 ***
Area -1.958 0.527 -3.716 0.00295 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.877 on 12 degrees of freedom
Multiple R-squared: 0.5351, Adjusted R-squared: 0.4963
F-statistic: 13.81 on 1 and 12 DF, p-value: 0.002947
Using predict()
we can see what the model predicts the number of hoes will be when the area is 15.2ha.
predict(RSHoes.lm, data.frame(Area = 15.2))
1
18.03207
Then we can check what the actual number of hoes is when the area is 15.2ha and compare this to the linear model. It appears the model has slightly overpredicted how many hoes there should be. If you are interested in recreating the equations to find the best fit lines as described in Drennan, you can find example code in the R Companion by Carlson.
RSHoes$Hoes[RSHoes$Area == 15.2]
[1] 15
Using code from the R Companion by Carlson, we can plot the residuals between the model and the observed data.
plot(Hoes ~ Area, RSHoes, xlab = "Site Area (ha)", ylab = "Number of Hoes", las = 1,
pch = 4)
abline(RSHoes.lm)
Predict <- fitted(RSHoes.lm)
segments(RSHoes$Area, Predict, RSHoes$Area, RSHoes$Hoes, lty = 3)
Next, we can calculate the \(r^2\) value, which is the coefficient of determination and described what proportion of the variation in the observed the number of hoes can be explained by the site area. In this case, \(53.51\%\).
r2 <- 1 - sum(resid(RSHoes.lm)^2)/sum((RSHoes$Hoes - mean(RSHoes$Hoes))^2)
round(r2, 4)
[1] 0.5351
We can also examine the strength of the correlation between the variables. Pearson’s R can be calculated with the cor()
function. This confirms the strong negative relationship between three variables.
cor(RSHoes$Area, RSHoes$Hoes)
[1] -0.7314935
Finally, we can plot the regression line, original data and 95% confidence intervals with the following code (see: R Companion):
plot(Hoes ~ Area, RSHoes, xlab = "Site Area (ha)", ylab = "Number of Hoes", las = 1,
pch = 4)
xp <- seq(7, 19, 0.1)
yp <- predict(RSHoes.lm, data.frame(Area = xp), int = "c")
matlines(xp, yp, lty = c(1, 2, 2), col = "black")
Based on Drennan’s analysis of the residuals, he found that soil fertility also impacted the number of hoes, thereby explaining much of the residuals. We can directly run a regression model on all three variables and examine how well the linear model now predicts the relationship between variables.
RSHoes.lm2 <- lm(Hoes ~ Area + Soil, RSHoes)
summary(RSHoes.lm2)
Call:
lm(formula = Hoes ~ Area + Soil, data = RSHoes)
Residuals:
Min 1Q Median 3Q Max
-3.3016 -0.7414 0.3711 1.0160 1.9560
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 28.8448796 2.5930686 11.124 2.53e-07 ***
Area -1.4370562 0.1549888 -9.272 1.56e-06 ***
Soil 0.0105754 0.0008939 11.830 1.35e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.657 on 11 degrees of freedom
Multiple R-squared: 0.9661, Adjusted R-squared: 0.96
F-statistic: 156.8 on 2 and 11 DF, p-value: 8.214e-09
In this case, we see that the \(R^2\) is 0.9661, meaning that the model of area and soil as predictors of the number of hoes explains \(96.61\%\) of the variation in number of hoes. So while there may be some additional variables, these two explain the nearly all of the variation.
The F-statistic and its \(p\)-value tell us that there is a very small chance of obtaining a F-ratio of that scale by chance, so the model is significantly predicts the number of hoes.