422 hw 4 - final draft with edits jt

9
TEAM 5 Pag EXP 422  –  MANAGERIAL DATA ANALYSIS TEAM ASSIGNMENT #4 Team 5 Chuck Albanese Himadri Kaul Sreehari Purna Justine Rumbel Joshua Thompson Due: December 30, 2013 

Upload: sree7

Post on 17-Oct-2015

15 views

Category:

Documents


0 download

DESCRIPTION

422 JE22

TRANSCRIPT

  • TEAM 5 Page 1

    EXP 422 MANAGERIAL DATA ANALYSIS

    TEAM ASSIGNMENT #4

    Team 5

    Chuck Albanese

    Himadri Kaul

    Sreehari Purna

    Justine Rumbel

    Joshua Thompson

    Due: December 30, 2013

  • TEAM 5 Page 2

    Date Provided : DupreeFuels.xlsx

    Objective: Find a statistically reliable oil consumption model based on the data provided for Dupree Fuels.

    The following variables were noted in the data provided:

    The number of degree days since the last oil fill and the consumption amounts for 40 customers.

    The number of people residing in the homes of each of the 40 customers.

    The home type classification, which is a number between 1 and 5, is a composite index of the home

    size, age, exposure to wind, level of insulation and furnace type. A low index implies lower oil

    consumption per degree day and a high index implies higher consumption of oil per degree day.

    Answer:

    Step 1: We created a series of scatterplots and a correlation matrix of every independent variable against

    the dependent variable of oil usage to understand the relationship between the same and to detect any

    potential outliers.

  • TEAM 5 Page 3

    Oil Usage Degree Days Home Index Number People

    Linear Correlation Table Data Set #1 Data Set #1 Data Set #1 Data Set #1

    Oil Usage 1.000 0.543 0.662 -0.086

    Degree Days 1.000 -0.065 -0.138

    Home Index 1.000 -0.062

    Number People 1.000

    The plots and the correlation matrix indicate that there is a positive relationship between oil usage and the

    variables of Degree Days and Home Index but a negative relationship was noted between oil usage and the

    variable Number of People. However, the scatterplot for Number of People indicates that the variable may not

    be pulling its weight in the model, however, well have a better idea of this and whether multicollinearity is a

    concern when we compare regression models..

    Step 2: Regression

    The simple linear regression analysis with oil usage as a dependent variable (DV) and Home Index, Number

    People and Degree days as an independent variable (IV) separately revealed the following:

    The regression analysis between Oil Usage and Home Index indicates significant p-value, high t-value

    with R2 value approx.. 44% and adjusted R

    2 approx. 42%. The Se was noted to be around 134.

    The regression analysis between Oil Usage and Degree days indicate significant p-value (.0003), high t-

    value with R2 approaching 30% and adjusted R

    2 around 28%. The Se was noted to be more than 150.

    The regression analysis of Oil Usage vs Number People revealed high p-value (0.5966), t-value less than

    1 (0.5337), with R2 approx. 7.4% and adjusted R

    2 approaching -2. The Se was noted to be more than

    178.

    In addition, we also performed Liliefors test for normality for the residuals (since the sample is small) and

    violation of assumption of normality was not detected in any cases. Please see the graphs below:

  • TEAM 5 Page 4

    Oil usage vs Home index Oil usage vs Degree Days

    Oil usage vs Number people

    We then performed a single multiple linear regression analysis with Oil Usage as the DV and Degree

    Days, Home Index and Number of People as IVs. The following model was obtained:

    Multiple Regression for Oil Usage Multiple R-Square

    Adjusted StErr of

    Summary R R-Square Estimate

    0.8855 0.7840 0.7660 85.46750087

    Degrees of Sum of Mean of F-Ratio p-Value

    ANOVA Table Freedom Squares Squares

    Explained 3 954738.9266 318246.3089 43.5674 < 0.0001

    Unexplained 36 262968.9734 7304.693706

  • TEAM 5 Page 5

    Coefficient

    Standard t-Value p-Value

    Confidence Interval 95%

    Regression Table Error Lower Upper

    Constant -218.309867 63.95850802 -3.4133 0.0016 -348.023734 -88.5960014

    Degree Days 0.27507924 0.036332705 7.5711 < 0.0001 0.201393098 0.348765382

    Home Index 86.98874993 9.630435116 9.0327 < 0.0001 67.45732225 106.5201776

    Number People 5.267239565 10.56179448 0.4987 0.6210 -16.1530724 26.68755159

    The following equation was estimated:

    Predicted oil usage = -218.30 + 0.28 Degree Days + 87 Home Index + 5.30 Number of People

    We noticed that more than 78% of variations were explained in this model which is more than average and Se

    was also noted to be statistically significant at 85. However, one of the variables Number of People showed t-

    value less than 1 and high p-value. The scatterplot of residual vs fit showed a significant pattern and normality

    was noted in the residuals. Please see below.

    Step 3: Improving our Model.

    We performed various regression analysis including:

    1. Creating log variable of number people as an IV. We noticed that although the t-value (.7966 as

    compared to 0.49 before) and p-value (0.43 as compared to 0.62 before) of variable number people

    slightly improved but remained insignificant. In addition, the R2 and Se did not improve at all.

    2. We created log variable of the dependent variable oil usage. We noticed that the coefficient of number

    people had improved t-value (1.82) and p-value (0.07) and very low Se but the overall R2 value

    dropped to 58%. I dont think we can compare the R^2 values when we Log the DVhe mentioned

    this to us on our last Homework.

  • TEAM 5 Page 6

    3. We eliminated the variable number people and the model improved with low p-value and high t-value of

    each IV coefficient. The R2 value was also noted to be more than 78% with Se to be more than 84%. Se

    is not a percentage. Please see below:

    Multiple Regression for Oil Usage Multiple R-Square

    Adjusted StErr of

    Summary R R-Square Estimate

    0.8846 0.7826 0.7708 84.59533454

    Degrees of Sum of Mean of F-Ratio p-Value

    ANOVA Table Freedom Squares Squares

    Explained 2 952922.1868 476461.0934 66.5786 < 0.0001

    Unexplained 37 264785.7132 7156.370627

    Coefficient

    Standard t-Value p-Value

    Confidence Interval 95%

    Regression Table Error Lower Upper

    Constant -192.815538 38.04404902 -5.0682 < 0.0001 -269.900104 -115.730973

    Degree Days 0.272492284 0.035593529 7.6557 < 0.0001 0.200372943 0.344611625

    Home Index 86.64572305 9.50781596 9.1131 < 0.0001 67.38105801 105.9103881

    We noticed that although the model has improved with low p-value of IVs but the R2 and Se values were similar

    to the previous model that included the variable Number of People. In addition, the scatterplot of residual vs fit

    was noted to be similar as before.

    Current model Previous model

    It helped us decide that the variable Number of People did not have much explanatory power as the other

    variables and our model will improve by eliminating this variable. However, to test the assumption of Dupree

    Fuels that the oil usage is affected by the number of people resided in a home, we created a pivot table with oil

    usage in rows and averaged of rest of the IVs to study if this is true. Our results indicate that the number of

  • TEAM 5 Page 7

    people did not really have any effect on the oil usage as some of the highest consumption occurred in the homes

    with less than people and some of the lower consumption occurred in the households with more people in

    residence. Therefore, we confirmed our statistic results that the variable Number of People does not have any

    significant explanatory power and it can be safely eliminated from our model.

    4. As we continued and tried to improve our R2 and Se values in our model and decided to create

    interaction variable of Home Index and Degree Days. The following model was obtained:

    Multiple Regression for Oil Usage Multiple R-Square

    Adjusted StErr of

    Summary R R-Square Estimate

    0.9685 0.9380 0.9328 45.79215001

    Degrees of Sum of Mean of F-Ratio p-Value

    ANOVA Table Freedom Squares Squares

    Explained 3 1142218.744 380739.5813 181.5708 < 0.0001

    Unexplained 36 75489.15608 2096.921002

    Coefficient

    Standard t-Value p-Value

    Confidence Interval 95%

    Regression Table Error Lower Upper

    Constant 32.36917195 31.39762783 1.0309 0.3094 -31.3081687 96.0465126

    Degree Days -0.05605467 0.039584758 -1.4161 0.1654 -0.13633628 0.024226939

    Home Index 2.096470139 10.27988387 0.2039 0.8395 -18.7521006 22.94504094

    InterDegHome 0.126175343 0.013279881 9.5012 < 0.0001 0.099242497 0.153108189

    The following equation was estimated:

    Predicted oil usage = 32.36 0.05 Degree Days + 2.1 Home Index + 0.13 Interaction (Degree daysHome Index)

    This can be interpreted as follows:

    As the degree-days increases by 1 and the home index remains constant, the oil usage is expected to

    decrease by 0.05 plus the product of 0.13 and the current number of home index.

    As the home index increases by 1 and the degree-days remains constant, the oil usage is expected to

    increase by 2.1 plus the product of 0.13 and the current number of degree days.

    In other words the oil usage depends upon the number of home index. (Im not sure what this is saying.

    Is it needed?)

    In addition, the R2 improved significantly to almost 94% and the Se dropped significantly to 45.79% (Se not a

    percentage) improving the model significantly. Moreover the scatterplot of residual vs fit improved as well.

    Please see below:

  • TEAM 5 Page 8

    We also noticed insignificant p-value and t-value for the two IVs of Degree Days and Home Index. However,

    eliminating these variables from the model did not improve our model in terms of R2 and Se. Instead, the R

    2

    decreased to .9274, adjusted R2 became .9255, and the Se increased to 48.24.

    Step 4: Test for Violation of Assumptions:

    1. Nonconstant Error Variance not detected as noticed in the scatterplot of residual vs fit. Please see

    above.

    2. Non-Normality Please see the graph below for Chi-square test of normality. The test appears to show

    normality of the residuals.

    3. Autocorrelation: The Durbin-Watson test revealed the value of 2.37 indicating a slight negative

    autocorrelation.

  • TEAM 5 Page 9

    Conclusion:

    We concluded that the model obtained in #4 with oil usage as a DV and IVs of Degree Days, Home Index and

    Interaction (Home Index, Degree Days) is the best in explaining the oil consumption for Dupree Fuels. There is

    a possibility of having a Type II error in this model, so in our communication with Dupree Fuels we would

    suggest increasing the sample size to achieve statistically significant results.