predicting housing sales price in the year 2008 and...accurately predict sales price in 2008 via...

Post on 24-Jul-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

B Y : S H I V A N I C H O U D H A R Y &

E M I L Y P H I L L I P S

Predicting Housing Sales Price in the Year 2008

Objective

Accurately predict Sales Price in 2008 via House characteristics

Which of these characteristics are important in this prediction?

Dataset obtained from the United States Census Bureau from the http: //www.census.gov /construction/nrc/index.html website

Data is collected through survey of construction

Funded by Department of Housing and Urban Development

Methodology

Split data into training (75%) and test (25%)

Complete Univariate Analysis of variables

Check for Heteroscedasticity, multicollinearity, etc.

Step-wise Model Selection

Test significance, residual analysis, etc.

Check model on test dataset

Re-run on full dataset

Data Distribution

7,042 in whole dataset- 5,281 in train, 1,761 in test

1 continuous response, 1 continuous regressor and 6 categorical regressors,

Variable Type

Sales Price Continuous (Response)

Square Foot Area of the House Continuous (Regressor)

Bedrooms Categorical (Regressor)

Full Bathrooms Categorical (Regressor)

Half Bathrooms Categorical (Regressor)

Stories Categorical (Regressor)

Parking Facility Categorical (Regressor)

Metropolitan Area Categorical (Regressor)

Scatterplot

Checking Heteroscedasticity

Spread vs Level

Box-Cox Transformation

Reducing Heteroscedasticity

Creating Linear Relationships

Scatterplot of Re-expressed Values

Problem- Interpretation

Our final Box-Cox Transformation gave a lambda of -0.333 (the reciprocal cube root)

This is hard to interpret, and thus not optimal.

-0.333 ~ 0

The log is easier to explain

Proof of Similarity of Transformation

Proof of Similarity of Transformation

Outliers: Hat Matrix

Cutoff: 2p/n ~ 0.003

Final Model

All 3 methodologies (forward, backward, and stepwise) using Log transforms agreed on the final model

No metropolitan area

Test data confirmed this model as a good fit

R^2 = 0.5371031 for test

R^2 = 0.5228 for training

Refit this model on the entire dataset for more accuracy

R^2 = 0.5277

X1 = Log Square Foot Area of House

X2 = 2 full bathrooms if=1

X3 = 3 full bathrooms if =1

X4 = 4 or more full bathrooms if=1

X5 = 1 half bathroom if=1

X6 = 2 or more half bathrooms if=1

X7 = 3 bedrooms if =1

X8 = 4 bedrooms if =1

X9 = 5 or more bedrooms if =1

X10 = 2 car garage if=1

X11= 3 or more car garage if=1

X12 = other parking if=1

X13 = 2 or more stories if =1

X14 = split-level if =1

Variable Coefficient Stan. error t-statistic p-value Meaning

Intercept 7.195195 0.134570 53.468 < 2e-16

X1 0.673273 0.018480 36.432 < 2e-16 Log Square Foot Area ofHouse

X2 0.014853 0.031145 0.477 0.633 2 full bathrooms if =1

X3 0.203439 0.033720 6.033 1.69e-09 3 full bathrooms if =1

X4 0.421026 0.039484 10.663 < 2e-16 4 or more full bathrooms if =1

X5 0.113380 0.011296 10.037 < 2e-16 1 half bathroom if =1

X6 0.182157 0.031005 5.875 4.42e-09 2 or more half bathrooms if =1

X7 -0.164756 0.015939 -10.337 < 2e-16 3 bedrooms if =1

X8 -0.185899 0.018351 -10.130 < 2e-16 4 bedrooms if =1

X9 -0.266615 0.025485 -10.462 < 2e-16 5 or more bedrooms if=1

X10 -0.003572 0.018385 -0.194 0.846 2 car garage if =1

X11 0.145093 0.021573 6.726 1.88e-11 3 or more car garage if=1

X12 -0.075753 0.026314 -2.879 0.004 Other parking if=1

X13 0.067372 0.011786 5.716 1.13e-08 2 or more stories if =1

X14 -0.002316 0.061797 -0.037 0.970 Split-Level house if=1

Testing a Subset of Regression Coefficients

Full Model: F-statistic= 560.9, p-value < 2.2e-16

Can conclude there is predictive value in the equation as a whole

Variable Taken out F-Statistic P-value

Square Foot Area of House 1327.3 < 2.2e-16

Full Bathrooms 118.3 < 2.2e-16

Half Bathrooms 55.972 < 2.2e-16

Bedrooms 45.704 < 2.2e-16

Parking Facility 55.833 < 2.2e-16

Stories 16.48 7.24e-08

Example of Whole vs Individual Sig

Variable Level of Var t-statistic P-value Signif. code

Parking Facility

Level 2 -0.194 0.846

Parking Facility

Level 3 6.726 1.88e-11 ***

Parking Facility

Level 4 -2.879 0.004 **

F-statistic P-value

55.833 < 2.2e-16

Residuals vs Fitted

Normal Q-Q Plot of Residuals

Problems we faced

Necessary transformations for variables

Missing data (chose to exclude)

Low Level of Multicollinearity

Categorical Data

Outliers

Possible overfitting (huge dataset)?

Conclusion

We were able to develop a model that moderately well predicted the Sales Price for houses in 2008

We found variables that appear to be important in this prediction

top related