422 hw 4 - final draft with edits jt

TEAM 5 Page 1

EXP 422 MANAGERIAL DATA ANALYSIS

TEAM ASSIGNMENT #4

Team 5

Chuck Albanese

Himadri Kaul

Sreehari Purna

Justine Rumbel

Joshua Thompson

Due: December 30, 2013

TEAM 5 Page 2

Date Provided : DupreeFuels.xlsx

Objective: Find a statistically reliable oil consumption model based on the data provided for Dupree Fuels.

The following variables were noted in the data provided:

The number of degree days since the last oil fill and the consumption amounts for 40 customers.

The number of people residing in the homes of each of the 40 customers.

The home type classification, which is a number between 1 and 5, is a composite index of the home

size, age, exposure to wind, level of insulation and furnace type. A low index implies lower oil

consumption per degree day and a high index implies higher consumption of oil per degree day.

Answer:

Step 1: We created a series of scatterplots and a correlation matrix of every independent variable against

the dependent variable of oil usage to understand the relationship between the same and to detect any

potential outliers.

TEAM 5 Page 3

Oil Usage Degree Days Home Index Number People

Linear Correlation Table Data Set #1 Data Set #1 Data Set #1 Data Set #1

Oil Usage 1.000 0.543 0.662 -0.086

Degree Days 1.000 -0.065 -0.138

Home Index 1.000 -0.062

Number People 1.000

The plots and the correlation matrix indicate that there is a positive relationship between oil usage and the

variables of Degree Days and Home Index but a negative relationship was noted between oil usage and the

variable Number of People. However, the scatterplot for Number of People indicates that the variable may not

be pulling its weight in the model, however, well have a better idea of this and whether multicollinearity is a

concern when we compare regression models..

Step 2: Regression

The simple linear regression analysis with oil usage as a dependent variable (DV) and Home Index, Number

People and Degree days as an independent variable (IV) separately revealed the following:

The regression analysis between Oil Usage and Home Index indicates significant p-value, high t-value

with R2 value approx.. 44% and adjusted R

2 approx. 42%. The Se was noted to be around 134.

The regression analysis between Oil Usage and Degree days indicate significant p-value (.0003), high t-

value with R2 approaching 30% and adjusted R

2 around 28%. The Se was noted to be more than 150.

The regression analysis of Oil Usage vs Number People revealed high p-value (0.5966), t-value less than

1 (0.5337), with R2 approx. 7.4% and adjusted R

2 approaching -2. The Se was noted to be more than

178.

In addition, we also performed Liliefors test for normality for the residuals (since the sample is small) and

violation of assumption of normality was not detected in any cases. Please see the graphs below:

TEAM 5 Page 4

Oil usage vs Home index Oil usage vs Degree Days

Oil usage vs Number people

We then performed a single multiple linear regression analysis with Oil Usage as the DV and Degree

Days, Home Index and Number of People as IVs. The following model was obtained:

Multiple Regression for Oil Usage Multiple R-Square

Adjusted StErr of

Summary R R-Square Estimate

0.8855 0.7840 0.7660 85.46750087

Degrees of Sum of Mean of F-Ratio p-Value

ANOVA Table Freedom Squares Squares

Explained 3 954738.9266 318246.3089 43.5674 < 0.0001

Unexplained 36 262968.9734 7304.693706

TEAM 5 Page 5

Coefficient

Standard t-Value p-Value

Confidence Interval 95%

Regression Table Error Lower Upper

Constant -218.309867 63.95850802 -3.4133 0.0016 -348.023734 -88.5960014

Degree Days 0.27507924 0.036332705 7.5711 < 0.0001 0.201393098 0.348765382

Home Index 86.98874993 9.630435116 9.0327 < 0.0001 67.45732225 106.5201776

Number People 5.267239565 10.56179448 0.4987 0.6210 -16.1530724 26.68755159

The following equation was estimated:

Predicted oil usage = -218.30 + 0.28 Degree Days + 87 Home Index + 5.30 Number of People

We noticed that more than 78% of variations were explained in this model which is more than average and Se

was also noted to be statistically significant at 85. However, one of the variables Number of People showed t-

value less than 1 and high p-value. The scatterplot of residual vs fit showed a significant pattern and normality

was noted in the residuals. Please see below.

Step 3: Improving our Model.

We performed various regression analysis including:

1. Creating log variable of number people as an IV. We noticed that although the t-value (.7966 as

compared to 0.49 before) and p-value (0.43 as compared to 0.62 before) of variable number people

slightly improved but remained insignificant. In addition, the R2 and Se did not improve at all.

2. We created log variable of the dependent variable oil usage. We noticed that the coefficient of number

people had improved t-value (1.82) and p-value (0.07) and very low Se but the overall R2 value

dropped to 58%. I dont think we can compare the R^2 values when we Log the DVhe mentioned

this to us on our last Homework.

TEAM 5 Page 6

3. We eliminated the variable number people and the model improved with low p-value and high t-value of

each IV coefficient. The R2 value was also noted to be more than 78% with Se to be more than 84%. Se

is not a percentage. Please see below:


Adjusted StErr of


0.8846 0.7826 0.7708 84.59533454



Explained 2 952922.1868 476461.0934 66.5786 < 0.0001

Unexplained 37 264785.7132 7156.370627

Coefficient




Constant -192.815538 38.04404902 -5.0682 < 0.0001 -269.900104 -115.730973

Degree Days 0.272492284 0.035593529 7.6557 < 0.0001 0.200372943 0.344611625

Home Index 86.64572305 9.50781596 9.1131 < 0.0001 67.38105801 105.9103881

We noticed that although the model has improved with low p-value of IVs but the R2 and Se values were similar

to the previous model that included the variable Number of People. In addition, the scatterplot of residual vs fit

was noted to be similar as before.

Current model Previous model

It helped us decide that the variable Number of People did not have much explanatory power as the other

variables and our model will improve by eliminating this variable. However, to test the assumption of Dupree

Fuels that the oil usage is affected by the number of people resided in a home, we created a pivot table with oil

usage in rows and averaged of rest of the IVs to study if this is true. Our results indicate that the number of

TEAM 5 Page 7

people did not really have any effect on the oil usage as some of the highest consumption occurred in the homes

with less than people and some of the lower consumption occurred in the households with more people in

residence. Therefore, we confirmed our statistic results that the variable Number of People does not have any

significant explanatory power and it can be safely eliminated from our model.

4. As we continued and tried to improve our R2 and Se values in our model and decided to create

interaction variable of Home Index and Degree Days. The following model was obtained:


Adjusted StErr of


0.9685 0.9380 0.9328 45.79215001



Explained 3 1142218.744 380739.5813 181.5708 < 0.0001

Unexplained 36 75489.15608 2096.921002

Coefficient




Constant 32.36917195 31.39762783 1.0309 0.3094 -31.3081687 96.0465126

Degree Days -0.05605467 0.039584758 -1.4161 0.1654 -0.13633628 0.024226939

Home Index 2.096470139 10.27988387 0.2039 0.8395 -18.7521006 22.94504094

InterDegHome 0.126175343 0.013279881 9.5012 < 0.0001 0.099242497 0.153108189

The following equation was estimated:

Predicted oil usage = 32.36 0.05 Degree Days + 2.1 Home Index + 0.13 Interaction (Degree daysHome Index)

This can be interpreted as follows:

As the degree-days increases by 1 and the home index remains constant, the oil usage is expected to

decrease by 0.05 plus the product of 0.13 and the current number of home index.

As the home index increases by 1 and the degree-days remains constant, the oil usage is expected to

increase by 2.1 plus the product of 0.13 and the current number of degree days.

In other words the oil usage depends upon the number of home index. (Im not sure what this is saying.

Is it needed?)

In addition, the R2 improved significantly to almost 94% and the Se dropped significantly to 45.79% (Se not a

percentage) improving the model significantly. Moreover the scatterplot of residual vs fit improved as well.

Please see below:

TEAM 5 Page 8

We also noticed insignificant p-value and t-value for the two IVs of Degree Days and Home Index. However,

eliminating these variables from the model did not improve our model in terms of R2 and Se. Instead, the R

2

decreased to .9274, adjusted R2 became .9255, and the Se increased to 48.24.

Step 4: Test for Violation of Assumptions:

1. Nonconstant Error Variance not detected as noticed in the scatterplot of residual vs fit. Please see

above.

2. Non-Normality Please see the graph below for Chi-square test of normality. The test appears to show

normality of the residuals.

3. Autocorrelation: The Durbin-Watson test revealed the value of 2.37 indicating a slight negative

autocorrelation.

TEAM 5 Page 9

Conclusion:

We concluded that the model obtained in #4 with oil usage as a DV and IVs of Degree Days, Home Index and

Interaction (Home Index, Degree Days) is the best in explaining the oil consumption for Dupree Fuels. There is

a possibility of having a Type II error in this model, so in our communication with Dupree Fuels we would

suggest increasing the sample size to achieve statistically significant results.

422 hw 4 - final draft with edits jt

Documents