422 hw 4 - final draft with edits jt
DESCRIPTION
422 JE22TRANSCRIPT
-
TEAM 5 Page 1
EXP 422 MANAGERIAL DATA ANALYSIS
TEAM ASSIGNMENT #4
Team 5
Chuck Albanese
Himadri Kaul
Sreehari Purna
Justine Rumbel
Joshua Thompson
Due: December 30, 2013
-
TEAM 5 Page 2
Date Provided : DupreeFuels.xlsx
Objective: Find a statistically reliable oil consumption model based on the data provided for Dupree Fuels.
The following variables were noted in the data provided:
The number of degree days since the last oil fill and the consumption amounts for 40 customers.
The number of people residing in the homes of each of the 40 customers.
The home type classification, which is a number between 1 and 5, is a composite index of the home
size, age, exposure to wind, level of insulation and furnace type. A low index implies lower oil
consumption per degree day and a high index implies higher consumption of oil per degree day.
Answer:
Step 1: We created a series of scatterplots and a correlation matrix of every independent variable against
the dependent variable of oil usage to understand the relationship between the same and to detect any
potential outliers.
-
TEAM 5 Page 3
Oil Usage Degree Days Home Index Number People
Linear Correlation Table Data Set #1 Data Set #1 Data Set #1 Data Set #1
Oil Usage 1.000 0.543 0.662 -0.086
Degree Days 1.000 -0.065 -0.138
Home Index 1.000 -0.062
Number People 1.000
The plots and the correlation matrix indicate that there is a positive relationship between oil usage and the
variables of Degree Days and Home Index but a negative relationship was noted between oil usage and the
variable Number of People. However, the scatterplot for Number of People indicates that the variable may not
be pulling its weight in the model, however, well have a better idea of this and whether multicollinearity is a
concern when we compare regression models..
Step 2: Regression
The simple linear regression analysis with oil usage as a dependent variable (DV) and Home Index, Number
People and Degree days as an independent variable (IV) separately revealed the following:
The regression analysis between Oil Usage and Home Index indicates significant p-value, high t-value
with R2 value approx.. 44% and adjusted R
2 approx. 42%. The Se was noted to be around 134.
The regression analysis between Oil Usage and Degree days indicate significant p-value (.0003), high t-
value with R2 approaching 30% and adjusted R
2 around 28%. The Se was noted to be more than 150.
The regression analysis of Oil Usage vs Number People revealed high p-value (0.5966), t-value less than
1 (0.5337), with R2 approx. 7.4% and adjusted R
2 approaching -2. The Se was noted to be more than
178.
In addition, we also performed Liliefors test for normality for the residuals (since the sample is small) and
violation of assumption of normality was not detected in any cases. Please see the graphs below:
-
TEAM 5 Page 4
Oil usage vs Home index Oil usage vs Degree Days
Oil usage vs Number people
We then performed a single multiple linear regression analysis with Oil Usage as the DV and Degree
Days, Home Index and Number of People as IVs. The following model was obtained:
Multiple Regression for Oil Usage Multiple R-Square
Adjusted StErr of
Summary R R-Square Estimate
0.8855 0.7840 0.7660 85.46750087
Degrees of Sum of Mean of F-Ratio p-Value
ANOVA Table Freedom Squares Squares
Explained 3 954738.9266 318246.3089 43.5674 < 0.0001
Unexplained 36 262968.9734 7304.693706
-
TEAM 5 Page 5
Coefficient
Standard t-Value p-Value
Confidence Interval 95%
Regression Table Error Lower Upper
Constant -218.309867 63.95850802 -3.4133 0.0016 -348.023734 -88.5960014
Degree Days 0.27507924 0.036332705 7.5711 < 0.0001 0.201393098 0.348765382
Home Index 86.98874993 9.630435116 9.0327 < 0.0001 67.45732225 106.5201776
Number People 5.267239565 10.56179448 0.4987 0.6210 -16.1530724 26.68755159
The following equation was estimated:
Predicted oil usage = -218.30 + 0.28 Degree Days + 87 Home Index + 5.30 Number of People
We noticed that more than 78% of variations were explained in this model which is more than average and Se
was also noted to be statistically significant at 85. However, one of the variables Number of People showed t-
value less than 1 and high p-value. The scatterplot of residual vs fit showed a significant pattern and normality
was noted in the residuals. Please see below.
Step 3: Improving our Model.
We performed various regression analysis including:
1. Creating log variable of number people as an IV. We noticed that although the t-value (.7966 as
compared to 0.49 before) and p-value (0.43 as compared to 0.62 before) of variable number people
slightly improved but remained insignificant. In addition, the R2 and Se did not improve at all.
2. We created log variable of the dependent variable oil usage. We noticed that the coefficient of number
people had improved t-value (1.82) and p-value (0.07) and very low Se but the overall R2 value
dropped to 58%. I dont think we can compare the R^2 values when we Log the DVhe mentioned
this to us on our last Homework.
-
TEAM 5 Page 6
3. We eliminated the variable number people and the model improved with low p-value and high t-value of
each IV coefficient. The R2 value was also noted to be more than 78% with Se to be more than 84%. Se
is not a percentage. Please see below:
Multiple Regression for Oil Usage Multiple R-Square
Adjusted StErr of
Summary R R-Square Estimate
0.8846 0.7826 0.7708 84.59533454
Degrees of Sum of Mean of F-Ratio p-Value
ANOVA Table Freedom Squares Squares
Explained 2 952922.1868 476461.0934 66.5786 < 0.0001
Unexplained 37 264785.7132 7156.370627
Coefficient
Standard t-Value p-Value
Confidence Interval 95%
Regression Table Error Lower Upper
Constant -192.815538 38.04404902 -5.0682 < 0.0001 -269.900104 -115.730973
Degree Days 0.272492284 0.035593529 7.6557 < 0.0001 0.200372943 0.344611625
Home Index 86.64572305 9.50781596 9.1131 < 0.0001 67.38105801 105.9103881
We noticed that although the model has improved with low p-value of IVs but the R2 and Se values were similar
to the previous model that included the variable Number of People. In addition, the scatterplot of residual vs fit
was noted to be similar as before.
Current model Previous model
It helped us decide that the variable Number of People did not have much explanatory power as the other
variables and our model will improve by eliminating this variable. However, to test the assumption of Dupree
Fuels that the oil usage is affected by the number of people resided in a home, we created a pivot table with oil
usage in rows and averaged of rest of the IVs to study if this is true. Our results indicate that the number of
-
TEAM 5 Page 7
people did not really have any effect on the oil usage as some of the highest consumption occurred in the homes
with less than people and some of the lower consumption occurred in the households with more people in
residence. Therefore, we confirmed our statistic results that the variable Number of People does not have any
significant explanatory power and it can be safely eliminated from our model.
4. As we continued and tried to improve our R2 and Se values in our model and decided to create
interaction variable of Home Index and Degree Days. The following model was obtained:
Multiple Regression for Oil Usage Multiple R-Square
Adjusted StErr of
Summary R R-Square Estimate
0.9685 0.9380 0.9328 45.79215001
Degrees of Sum of Mean of F-Ratio p-Value
ANOVA Table Freedom Squares Squares
Explained 3 1142218.744 380739.5813 181.5708 < 0.0001
Unexplained 36 75489.15608 2096.921002
Coefficient
Standard t-Value p-Value
Confidence Interval 95%
Regression Table Error Lower Upper
Constant 32.36917195 31.39762783 1.0309 0.3094 -31.3081687 96.0465126
Degree Days -0.05605467 0.039584758 -1.4161 0.1654 -0.13633628 0.024226939
Home Index 2.096470139 10.27988387 0.2039 0.8395 -18.7521006 22.94504094
InterDegHome 0.126175343 0.013279881 9.5012 < 0.0001 0.099242497 0.153108189
The following equation was estimated:
Predicted oil usage = 32.36 0.05 Degree Days + 2.1 Home Index + 0.13 Interaction (Degree daysHome Index)
This can be interpreted as follows:
As the degree-days increases by 1 and the home index remains constant, the oil usage is expected to
decrease by 0.05 plus the product of 0.13 and the current number of home index.
As the home index increases by 1 and the degree-days remains constant, the oil usage is expected to
increase by 2.1 plus the product of 0.13 and the current number of degree days.
In other words the oil usage depends upon the number of home index. (Im not sure what this is saying.
Is it needed?)
In addition, the R2 improved significantly to almost 94% and the Se dropped significantly to 45.79% (Se not a
percentage) improving the model significantly. Moreover the scatterplot of residual vs fit improved as well.
Please see below:
-
TEAM 5 Page 8
We also noticed insignificant p-value and t-value for the two IVs of Degree Days and Home Index. However,
eliminating these variables from the model did not improve our model in terms of R2 and Se. Instead, the R
2
decreased to .9274, adjusted R2 became .9255, and the Se increased to 48.24.
Step 4: Test for Violation of Assumptions:
1. Nonconstant Error Variance not detected as noticed in the scatterplot of residual vs fit. Please see
above.
2. Non-Normality Please see the graph below for Chi-square test of normality. The test appears to show
normality of the residuals.
3. Autocorrelation: The Durbin-Watson test revealed the value of 2.37 indicating a slight negative
autocorrelation.
-
TEAM 5 Page 9
Conclusion:
We concluded that the model obtained in #4 with oil usage as a DV and IVs of Degree Days, Home Index and
Interaction (Home Index, Degree Days) is the best in explaining the oil consumption for Dupree Fuels. There is
a possibility of having a Type II error in this model, so in our communication with Dupree Fuels we would
suggest increasing the sample size to achieve statistically significant results.