regression models project

5
Impact of transmission type of fuel efficiency in mtcars 20 June 2015 Executive Summarry We are studying the data set mtcars present within R to determine the relationship between miles per gallon mpg and transmission type am(manual/ automatic). We evaluate several model choices to explore the relationship, finally settling on mpg~wt*factor(am) based on our choice strategy. Using, the model we discover that manual transmission offers better mpg for cars lighter than ~2,808 lbs and manual transmission has 95% confidence of offering 3.2-11.3 miles per gallon better mpg than automatic transmission averaged across all car weights (under sample constraints). We also look at the residual variation in the chosen linear model. Exploratory Data Analysis and choosing the regression model We load the data and explore the correlation between various terms in the data set. ?mtcars provides the required variable descriptions for the terms in the data set. library(ggplot2);library(xtable);data(mtcars);options(scipen = 999); cr <- as.data.frame(cor(mtcars)); tab <- xtable(cr[1:4,], caption = "Correlation table for mtcars (top 4 rows)") print.xtable(tab, floating = TRUE ,comment = FALSE) mpg cyl disp hp drat wt qsec vs am gear carb mpg 1.00 -0.85 -0.85 -0.78 0.68 -0.87 0.42 0.66 0.60 0.48 -0.55 cyl -0.85 1.00 0.90 0.83 -0.70 0.78 -0.59 -0.81 -0.52 -0.49 0.53 disp -0.85 0.90 1.00 0.79 -0.71 0.89 -0.43 -0.71 -0.59 -0.56 0.39 hp -0.78 0.83 0.79 1.00 -0.45 0.66 -0.71 -0.72 -0.24 -0.13 0.75 Table 1: Correlation table for mtcars (top 4 rows) The above correlation table excerpt can guide us to which variables are most correlated with miles per gallon mpg. As we wish to study the effects of transmission am, it should also be included in the exploration. We choose to look at wt, hp and cyl as they are highly correlated with mpg. In Figure 1, 2 and 3, we can clearly see the relationships between mpg and weight in /1000 lbs wt(fig 1), horsepower hp(fig 2) and number of cylinders in the engine cyl(fig 3). The linear fit model with 95% confidence intervals also helps guide potential models we may wish to choose for our linear regression On cursory glance, we may also wish to consider diplacement in cu. in.disp as a highly related variable to the outcome mpg, but we can see that it is well correlated with cyl Choosing the regression models In order to choose the appropriate model, we follow: - Fit a model with the variable having highest correlation with mpg (and also include am) - Create subsequent models including another variable (one at a time) that has relatively high correlation with outcome mpg and then perform anova as a nested likelihood ratio test and compare consecutive p-values. Thus we choose 4 models: f1, f2, f3, f4 as shown below, with the Anova results in Table 2 f1 <- lm(mpg ~ wt + factor(am), data = mtcars); f2 <- lm(mpg ~ wt * factor(am), data = mtcars) f3 <- lm(mpg ~ wt * factor(am) + factor(cyl), data = mtcars) f4 <- lm(mpg ~ wt * factor(am) + factor(cyl) + hp , data = mtcars) anv <- as.data.frame(anova(f1,f2,f3,f4)) tab1 <- xtable(anv, caption = "Anova for choosing regression model from f1, f2, f3, f4") digits(tab1)<- 5; print.xtable(tab1, floating = TRUE, comment = FALSE) 1

Upload: akshay-rao

Post on 09-Dec-2015

3 views

Category:

Documents


1 download

DESCRIPTION

Data Science Project

TRANSCRIPT

Page 1: Regression Models Project

Impact of transmission type of fuel efficiency in mtcars20 June 2015

Executive Summarry

We are studying the data set mtcars present within R to determine the relationship between miles per gallon mpg andtransmission type am(manual/ automatic). We evaluate several model choices to explore the relationship, finally settlingon mpg~wt*factor(am) based on our choice strategy. Using, the model we discover that manual transmission offersbetter mpg for cars lighter than ~2,808 lbs and manual transmission has 95% confidence of offering 3.2-11.3 miles pergallon better mpg than automatic transmission averaged across all car weights (under sample constraints). We also lookat the residual variation in the chosen linear model.

Exploratory Data Analysis and choosing the regression model

We load the data and explore the correlation between various terms in the data set. ?mtcars provides the requiredvariable descriptions for the terms in the data set.

library('ggplot2');library('xtable');data(mtcars);options(scipen = 999);cr <- as.data.frame(cor(mtcars)); tab <- xtable(cr[1:4,],

caption = "Correlation table for mtcars (top 4 rows)")print.xtable(tab, floating = TRUE ,comment = FALSE)

mpg cyl disp hp drat wt qsec vs am gear carbmpg 1.00 -0.85 -0.85 -0.78 0.68 -0.87 0.42 0.66 0.60 0.48 -0.55cyl -0.85 1.00 0.90 0.83 -0.70 0.78 -0.59 -0.81 -0.52 -0.49 0.53

disp -0.85 0.90 1.00 0.79 -0.71 0.89 -0.43 -0.71 -0.59 -0.56 0.39hp -0.78 0.83 0.79 1.00 -0.45 0.66 -0.71 -0.72 -0.24 -0.13 0.75

Table 1: Correlation table for mtcars (top 4 rows)

The above correlation table excerpt can guide us to which variables are most correlated with miles per gallon mpg. As wewish to study the effects of transmission am, it should also be included in the exploration. We choose to look at wt, hpand cyl as they are highly correlated with mpg.In Figure 1, 2 and 3, we can clearly see the relationships between mpg and weight in /1000 lbs wt(fig 1), horsepowerhp(fig 2) and number of cylinders in the engine cyl(fig 3). The linear fit model with 95% confidence intervals also helpsguide potential models we may wish to choose for our linear regressionOn cursory glance, we may also wish to consider diplacement in cu. in.disp as a highly related variable to the outcomempg, but we can see that it is well correlated with cyl

Choosing the regression modelsIn order to choose the appropriate model, we follow:- Fit a model with the variable having highest correlation with mpg (and also include am)- Create subsequent models including another variable (one at a time) that has relatively high correlation with outcomempg and then perform anova as a nested likelihood ratio test and compare consecutive p-values.Thus we choose 4 models: f1, f2, f3, f4 as shown below, with the Anova results in Table 2

f1 <- lm(mpg ~ wt + factor(am), data = mtcars); f2 <- lm(mpg ~ wt * factor(am), data = mtcars)f3 <- lm(mpg ~ wt * factor(am) + factor(cyl), data = mtcars)f4 <- lm(mpg ~ wt * factor(am) + factor(cyl) + hp , data = mtcars)anv <- as.data.frame(anova(f1,f2,f3,f4))tab1 <- xtable(anv, caption = "Anova for choosing regression model from f1, f2, f3, f4")digits(tab1)<- 5; print.xtable(tab1, floating = TRUE, comment = FALSE)

1

Page 2: Regression Models Project

Res.Df RSS Df Sum of Sq F Pr(>F)1 29.00000 278.319702 28.00000 188.00767 1.00000 90.31203 17.30489 0.000333 26.00000 137.99173 2.00000 50.01593 4.79183 0.017314 25.00000 130.47184 1.00000 7.51990 1.44090 0.24124

Table 2: Anova for choosing regression model from f1, f2, f3, f4

Examining the p-values from Table 2, we can see that there is benfit in considering an interaction between wt and amwhile estimating mpg in model f2. This is true as the comparison between f1 and f2 yields a p-value of 0.0003283 whichis less than a typical Type I error rate α = 0.05. So we can choose model f2 as a good choice for further study into theeffect of transmission am on mpg. Moreover, there seems to be no benefit in adding more variable to the model as per theobserved p-values for f3 and f4

Infering from the chosen model (model f2)

Model EsimationTo answer our questions of interest we plot the relationship between mpg and wt with color representing am in Figure 4.The grey slope line shows the direct relationship between mpg and wt without considering am while the two horizontallines represent the mean mpg for the two transmission types. Thus from an average perspective the manual transmissionhas a higher mpg = 24.39 than that of automatic transmission, which is mpg = 17.15. However, there is significant overlapbetween the points and as a result a clear relationship cannot be inferred visually.

tt <- t.test(x = mtcars[mtcars$am == 0,1], y = mtcars[mtcars$am == 1,1])hval <- hatvalues(f2); topcar <- names(hval[order(hval, decreasing = T)])mpgChangeAuto <- f2$coeff[2]; mpgChangeMan <- f2$coeff[2] + f2$coeff[4]

This can be further analysed by taking a two.sided t-test which gives a pvalue of 0.0014 and confidence interval of -11.28,-3.21. This definitely means that there is a measurable impact of transmission on mpg. Continuing from Figure 4, thetwo regression lines for automatic and manual transmission show that mpg decreases more rapidly with increasein weight for a car with manual transmission than one with automatic. Based on the model coefficients thereis a -3.79 change in mpg for 1000lbs increase in weight for auto transm. and a -9.08 change in mpg for1000lbs increase in weight for manual transm. Also, as weight increases, cars tend to have automatic transmissionrather than manual which means group status partially matters (manual or auto) (Note: This assumes that the carssample was not chosen in such a manner that heavier cars had automatic transmission. This mpg benefit is nullifiedbeyond wt = 2.81 where the two regression lines intersect.

Model CharacteristicsIn Figure 5 we examine the model fit for model f2. The residual variation plot (plot1) shows that there is noheteroskedasticity but significant residual variation in the middle of the dataset. We see outliers in “Fiat128, Mercedes240D, Toyota Corolla” with large residual variation but as per plot4 they have low leverage. In the Normal QQ plot, theresidual error closely maps to the normal distribution, but in higher positive quantiles we see a skewness (negative) in theerror distribution. Exploring the cars having highest leverage, we get Maserati Bora with a hatvalue of 0.37.

Conclusion

Answering the Questions: Based on model f2, we can say that (1) A manual transmission is better for mpg whenweight of car is less than 2808.12lbs. Beyond that, automatic transmission offers better mpg.(2) On an overall basis across all car weights, manual transmission offers between approx 3.2 - 11.2 better miles pergallon than automatic (as per our t.test inference)(3) Our conclusion is based upon the model f2 we chose, and the residual variance may impact the final result. Our modelchoice was also influenced by our need to observe impact of am on mpg whose correlation is actually lesser than wt,hp,dispand cyl (4) To infer difference in mpg we have used a t.test and the assumption is that there are no confounders thatimpact the obtained result

2

Page 3: Regression Models Project

Appendix

tr <- c("Automatic","Manual")mtcars$trans <- tr[mtcars$am + 1]qplot(x = wt, y = mpg, data = mtcars, color = trans, geom = c("point", "smooth"),

method = "lm", main = "Figure 1: Miles per Gallon mpg vs. Car wt (in 1000lbs)")

10

20

30

2 3 4 5wt

mpg

trans

Automatic

Manual

Figure 1: Miles per Gallon mpg vs. Car wt (in 1000lbs)

qplot(x = hp, y = mpg, data = mtcars, color = trans, geom = c("point", "smooth"),method = "lm", main = "Figure 2: Miles per Gallon mpg vs. Horse Power hp")

10

20

30

100 200 300hp

mpg

trans

Automatic

Manual

Figure 2: Miles per Gallon mpg vs. Horse Power hp

qplot(x = cyl, y = mpg, data = mtcars, color = trans, geom = c("point", "smooth"),method = "lm", main = "Figure 3: Miles per Gallon mpg vs. No. of cylinders cyl")

3

Page 4: Regression Models Project

10

15

20

25

30

35

4 5 6 7 8cyl

mpg

trans

Automatic

Manual

Figure 3: Miles per Gallon mpg vs. No. of cylinders cyl

f0 <- lm(mpg ~ wt, data = mtcars)g <- ggplot(data = mtcars, aes(wt,mpg))g <- g + geom_point(aes(color = trans))g <- g + geom_hline(aes(yintercept = mean(mtcars[mtcars$am==0,1])),color ="dark grey")g <- g + geom_hline(aes(yintercept = mean(mtcars[mtcars$am==1,1])),color ="dark grey")g <- g + geom_abline(intercept = f0$coeff[1], slope = f0$coeff[2], color = "grey47")g <- g + geom_abline(intercept = f2$coeff[1], slope = f2$coeff[2], color = "salmon")g <- g + geom_abline(intercept = f2$coeff[1] + f2$coeff[3], slope = f2$coeff[2] + f2$coeff[4], color = "darkturquoise")

g <- g + labs(title = "Figure 4: Regression Model Effects for linear model f2")g

10

15

20

25

30

35

2 3 4 5wt

mpg

trans

Automatic

Manual

Figure 4: Regression Model Effects for linear model f2

par(mfrow = c(2,2), oma = c(2,2,4,2))plot(f2, sub.caption = "Figure 5: Linear model f2 characteristics")

4

Page 5: Regression Models Project

15 20 25 30

−4

04

Fitted values

Res

idua

ls

Residuals vs FittedFiat 128

Merc 240DToyota Corolla

−2 −1 0 1 2

−1

12

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als Normal Q−Q

Fiat 128Merc 240DToyota Corolla

15 20 25 30

0.0

1.0

Fitted values

Sta

ndar

dize

d re

sidu

als

Scale−LocationFiat 128

Merc 240DToyota Corolla

0.0 0.1 0.2 0.3

−1

12

Leverage

Sta

ndar

dize

d re

sidu

als

Cook's distance0.5

0.5

1

Residuals vs Leverage

Chrysler Imperial

Fiat 128Toyota Corolla

Figure 5: Linear model f2 characteristics

5