regression models project
DESCRIPTION
Data Science ProjectTRANSCRIPT
Impact of transmission type of fuel efficiency in mtcars20 June 2015
Executive Summarry
We are studying the data set mtcars present within R to determine the relationship between miles per gallon mpg andtransmission type am(manual/ automatic). We evaluate several model choices to explore the relationship, finally settlingon mpg~wt*factor(am) based on our choice strategy. Using, the model we discover that manual transmission offersbetter mpg for cars lighter than ~2,808 lbs and manual transmission has 95% confidence of offering 3.2-11.3 miles pergallon better mpg than automatic transmission averaged across all car weights (under sample constraints). We also lookat the residual variation in the chosen linear model.
Exploratory Data Analysis and choosing the regression model
We load the data and explore the correlation between various terms in the data set. ?mtcars provides the requiredvariable descriptions for the terms in the data set.
library('ggplot2');library('xtable');data(mtcars);options(scipen = 999);cr <- as.data.frame(cor(mtcars)); tab <- xtable(cr[1:4,],
caption = "Correlation table for mtcars (top 4 rows)")print.xtable(tab, floating = TRUE ,comment = FALSE)
mpg cyl disp hp drat wt qsec vs am gear carbmpg 1.00 -0.85 -0.85 -0.78 0.68 -0.87 0.42 0.66 0.60 0.48 -0.55cyl -0.85 1.00 0.90 0.83 -0.70 0.78 -0.59 -0.81 -0.52 -0.49 0.53
disp -0.85 0.90 1.00 0.79 -0.71 0.89 -0.43 -0.71 -0.59 -0.56 0.39hp -0.78 0.83 0.79 1.00 -0.45 0.66 -0.71 -0.72 -0.24 -0.13 0.75
Table 1: Correlation table for mtcars (top 4 rows)
The above correlation table excerpt can guide us to which variables are most correlated with miles per gallon mpg. As wewish to study the effects of transmission am, it should also be included in the exploration. We choose to look at wt, hpand cyl as they are highly correlated with mpg.In Figure 1, 2 and 3, we can clearly see the relationships between mpg and weight in /1000 lbs wt(fig 1), horsepowerhp(fig 2) and number of cylinders in the engine cyl(fig 3). The linear fit model with 95% confidence intervals also helpsguide potential models we may wish to choose for our linear regressionOn cursory glance, we may also wish to consider diplacement in cu. in.disp as a highly related variable to the outcomempg, but we can see that it is well correlated with cyl
Choosing the regression modelsIn order to choose the appropriate model, we follow:- Fit a model with the variable having highest correlation with mpg (and also include am)- Create subsequent models including another variable (one at a time) that has relatively high correlation with outcomempg and then perform anova as a nested likelihood ratio test and compare consecutive p-values.Thus we choose 4 models: f1, f2, f3, f4 as shown below, with the Anova results in Table 2
f1 <- lm(mpg ~ wt + factor(am), data = mtcars); f2 <- lm(mpg ~ wt * factor(am), data = mtcars)f3 <- lm(mpg ~ wt * factor(am) + factor(cyl), data = mtcars)f4 <- lm(mpg ~ wt * factor(am) + factor(cyl) + hp , data = mtcars)anv <- as.data.frame(anova(f1,f2,f3,f4))tab1 <- xtable(anv, caption = "Anova for choosing regression model from f1, f2, f3, f4")digits(tab1)<- 5; print.xtable(tab1, floating = TRUE, comment = FALSE)
1
Res.Df RSS Df Sum of Sq F Pr(>F)1 29.00000 278.319702 28.00000 188.00767 1.00000 90.31203 17.30489 0.000333 26.00000 137.99173 2.00000 50.01593 4.79183 0.017314 25.00000 130.47184 1.00000 7.51990 1.44090 0.24124
Table 2: Anova for choosing regression model from f1, f2, f3, f4
Examining the p-values from Table 2, we can see that there is benfit in considering an interaction between wt and amwhile estimating mpg in model f2. This is true as the comparison between f1 and f2 yields a p-value of 0.0003283 whichis less than a typical Type I error rate α = 0.05. So we can choose model f2 as a good choice for further study into theeffect of transmission am on mpg. Moreover, there seems to be no benefit in adding more variable to the model as per theobserved p-values for f3 and f4
Infering from the chosen model (model f2)
Model EsimationTo answer our questions of interest we plot the relationship between mpg and wt with color representing am in Figure 4.The grey slope line shows the direct relationship between mpg and wt without considering am while the two horizontallines represent the mean mpg for the two transmission types. Thus from an average perspective the manual transmissionhas a higher mpg = 24.39 than that of automatic transmission, which is mpg = 17.15. However, there is significant overlapbetween the points and as a result a clear relationship cannot be inferred visually.
tt <- t.test(x = mtcars[mtcars$am == 0,1], y = mtcars[mtcars$am == 1,1])hval <- hatvalues(f2); topcar <- names(hval[order(hval, decreasing = T)])mpgChangeAuto <- f2$coeff[2]; mpgChangeMan <- f2$coeff[2] + f2$coeff[4]
This can be further analysed by taking a two.sided t-test which gives a pvalue of 0.0014 and confidence interval of -11.28,-3.21. This definitely means that there is a measurable impact of transmission on mpg. Continuing from Figure 4, thetwo regression lines for automatic and manual transmission show that mpg decreases more rapidly with increasein weight for a car with manual transmission than one with automatic. Based on the model coefficients thereis a -3.79 change in mpg for 1000lbs increase in weight for auto transm. and a -9.08 change in mpg for1000lbs increase in weight for manual transm. Also, as weight increases, cars tend to have automatic transmissionrather than manual which means group status partially matters (manual or auto) (Note: This assumes that the carssample was not chosen in such a manner that heavier cars had automatic transmission. This mpg benefit is nullifiedbeyond wt = 2.81 where the two regression lines intersect.
Model CharacteristicsIn Figure 5 we examine the model fit for model f2. The residual variation plot (plot1) shows that there is noheteroskedasticity but significant residual variation in the middle of the dataset. We see outliers in “Fiat128, Mercedes240D, Toyota Corolla” with large residual variation but as per plot4 they have low leverage. In the Normal QQ plot, theresidual error closely maps to the normal distribution, but in higher positive quantiles we see a skewness (negative) in theerror distribution. Exploring the cars having highest leverage, we get Maserati Bora with a hatvalue of 0.37.
Conclusion
Answering the Questions: Based on model f2, we can say that (1) A manual transmission is better for mpg whenweight of car is less than 2808.12lbs. Beyond that, automatic transmission offers better mpg.(2) On an overall basis across all car weights, manual transmission offers between approx 3.2 - 11.2 better miles pergallon than automatic (as per our t.test inference)(3) Our conclusion is based upon the model f2 we chose, and the residual variance may impact the final result. Our modelchoice was also influenced by our need to observe impact of am on mpg whose correlation is actually lesser than wt,hp,dispand cyl (4) To infer difference in mpg we have used a t.test and the assumption is that there are no confounders thatimpact the obtained result
2
Appendix
tr <- c("Automatic","Manual")mtcars$trans <- tr[mtcars$am + 1]qplot(x = wt, y = mpg, data = mtcars, color = trans, geom = c("point", "smooth"),
method = "lm", main = "Figure 1: Miles per Gallon mpg vs. Car wt (in 1000lbs)")
10
20
30
2 3 4 5wt
mpg
trans
Automatic
Manual
Figure 1: Miles per Gallon mpg vs. Car wt (in 1000lbs)
qplot(x = hp, y = mpg, data = mtcars, color = trans, geom = c("point", "smooth"),method = "lm", main = "Figure 2: Miles per Gallon mpg vs. Horse Power hp")
10
20
30
100 200 300hp
mpg
trans
Automatic
Manual
Figure 2: Miles per Gallon mpg vs. Horse Power hp
qplot(x = cyl, y = mpg, data = mtcars, color = trans, geom = c("point", "smooth"),method = "lm", main = "Figure 3: Miles per Gallon mpg vs. No. of cylinders cyl")
3
10
15
20
25
30
35
4 5 6 7 8cyl
mpg
trans
Automatic
Manual
Figure 3: Miles per Gallon mpg vs. No. of cylinders cyl
f0 <- lm(mpg ~ wt, data = mtcars)g <- ggplot(data = mtcars, aes(wt,mpg))g <- g + geom_point(aes(color = trans))g <- g + geom_hline(aes(yintercept = mean(mtcars[mtcars$am==0,1])),color ="dark grey")g <- g + geom_hline(aes(yintercept = mean(mtcars[mtcars$am==1,1])),color ="dark grey")g <- g + geom_abline(intercept = f0$coeff[1], slope = f0$coeff[2], color = "grey47")g <- g + geom_abline(intercept = f2$coeff[1], slope = f2$coeff[2], color = "salmon")g <- g + geom_abline(intercept = f2$coeff[1] + f2$coeff[3], slope = f2$coeff[2] + f2$coeff[4], color = "darkturquoise")
g <- g + labs(title = "Figure 4: Regression Model Effects for linear model f2")g
10
15
20
25
30
35
2 3 4 5wt
mpg
trans
Automatic
Manual
Figure 4: Regression Model Effects for linear model f2
par(mfrow = c(2,2), oma = c(2,2,4,2))plot(f2, sub.caption = "Figure 5: Linear model f2 characteristics")
4
15 20 25 30
−4
04
Fitted values
Res
idua
ls
Residuals vs FittedFiat 128
Merc 240DToyota Corolla
−2 −1 0 1 2
−1
12
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als Normal Q−Q
Fiat 128Merc 240DToyota Corolla
15 20 25 30
0.0
1.0
Fitted values
Sta
ndar
dize
d re
sidu
als
Scale−LocationFiat 128
Merc 240DToyota Corolla
0.0 0.1 0.2 0.3
−1
12
Leverage
Sta
ndar
dize
d re
sidu
als
Cook's distance0.5
0.5
1
Residuals vs Leverage
Chrysler Imperial
Fiat 128Toyota Corolla
Figure 5: Linear model f2 characteristics
5