rpa307 p2
TRANSCRIPT
RPA 307 Project #2
Valuation Modelling Case Study
Name: Eun Ji (Lindsay) Lee
Student Id : 055-693-147
Date Submitted: December 1 2014
Professor: Anthony Percaccio
P a g e | 1
Table of Contents Introduction 2 Data Analysis 2 Transformations 9 Variable selection 10 Model Calibration 11 Final analysis and conclusions 17 Appendix A: Graphs and analysis 22 Appendix B: Syntax for transformation 39
P a g e | 2
Introduction
This report is intended to provide the most accurate predicted market value based on given data set, ‘Midsize_T2’ through multiple linear regression equation. SPSS software was used for the calculation and the calculation is derived from the nine steps by the course manual.
The equation will provide predicted house price information for single-family residential properties in Midsize City. The intended users are lenders, private vendors, homebuyers, and property tax agents in the market. The valuation date is December 31, 2008.
Data Analysis Step 0. Preliminary data screening
The purpose of preliminary data screening is to filter out ‘bad’ data that was incorrectly put(such as one digit square foot in total finished area, 100 washrooms, etc.), filter out any duplicated data(s), and quickly review the outliers.
The raw data has 726 sales with 43 variables. Quick screening by frequency statistics show there were three ‘bad’ data; 72 Mount View, 46 South Cove, and 46 South Cove (the latter two are duplicated) have one-digit total finished area (1,1,3 accordingly), they were removed. Also there were 9 duplicate sales, which were removed. So far 714 sales were left.
Checking for outliers by scatter plot (sales price vs. continuous variables) showed there were no specific outliers.
Step 1. Specifying the model
The basic model is;
MV= LV + BV
where MV = Estimated Market value of sales price;
LV = land value and BV = building value
This is the base model which will be developed in multivariate (multi variable) linear model.
Step 2. Examine the variables
The 46 variables could be distinguished into structure (building-related), land-related, and information only (user-friendly purpose). The detail is as follow:
Category Variables Structure (Candidate for building value) YearBuilt, EffectiveYear,
Foundation, FinishedArea, First_Floor,
P a g e | 3
Second_Floor, BasementTotalArea, BasementFinishedArea, Stories, DeckCoveredArea, DeckUncoveredArea, FullBath, ThreeQtrBath,HalfBath, Bedrooms, Fireplcs, MultiCarGarage, SingleCarGarage, CarPort, Pool, Outbuildings
Land (Candidate for land value) LotSizeSqft, CornerLot, OutstandingView, ExcellentView, GoodView,
Information only PID, Nbhd, Neigh, ActualUse, ActualUseTypeDese, OutstandingView, ExcellengView, Good View SalesYear, SalesMonth, SalesDay, Sale Price, Adj_Price AssessedLand, AssessedImprov, AssessedTotal ManualClass, Lin_mancls, ManualClassDesc, Random
Elimination of Non-meaningful Variables
1. ActualUse (all 0), ActualUseTypeDesc(all 'SINGLE FAMILY DWELLING),
2. SaleYear(all 2008) will not be considered in any further steps since there is no distinction finding.
3. Random variable does not imply anything hence will not be considered in further steps.
4. AssessedLand, AssessedImprov, AssessedTotal do not reflect current market price hence will not be considered as well.
5. Sales date will be not used because it is too specific.
Step 3. Examining the variables
Check for time adjustment
Note that all sales were made in 2008 and sales day was not considered since it is too specific. There is a span of 7 months (May 2008 to December 2012) of sales, which shows fairly even distributions. Since the target date is Dec 31 2008, in this rapid moving market, time adjustment is required to make all the data in the same baseline. However, the detail of time adjustment will not be discussed here because it is already included in the data.
P a g e | 4
Note that the market movement seems stable throughout the seven months but slightly more active during non-winter seasons. The typical time adjustment steps are 1) study graphically the general trend with time; 2) confirm the trend using K-W test; 3)
Sales price and time adjusted sale price has high R-squared = 95.8%.
Summary of the relationship of variables with adjusted sales price
• All 714 sales are open market sales of single family detached dwellings • the land area for all properties is measured in square foot
Neighbourhood characteristics
• there are five neighbourhoods; 55 properties of Alexandra Heights (nbhd code 005), 101 properties on Cairn Wood (nbhd code 014),
P a g e | 5
• 87 properties on Lillian Lake (nbhd code 015), 164 properties on Mount View (nbhd code 072),
• 175 properties on South Cove (nbhd code 046), and 132 properties on West Hill (nbhd code 036)
House characteristics - general
• the lot size of the properties is from 2311 sq ft to 24803 sq ft, mean of 7572.8 sq ft and median of 7127.5 sq ft.
• *year built ranges from 1930 to 2007, median 1993 anad mean 1991.67 • effective year ranges from 1941 to 2007, median 1993 • There are four foundation types; 375 are basement, 247 are crawl space, 56 are 33 are
slab on-grade • Total finished area varies from 616 sq ft to 7477 sq ft. the range(6861 sq ft) is quite huge • *house characteristics-inside • the size of the first floor varis from 546 sq ft to 3609 sq ft, median of 1292 sq ft and mean
of 1322.39 sq ft • 476 properties have second floor; ranges from 309 sq ft to 2115 sq ft. mean of 1057.31
and median of 1065.5 • 432 properties have basement, total basement area ranges from 140 to 2625 sq ft.and
total finished basement area ranges from 0 to 2625 • bathrooms/ • 4 properties have 0 full bathroom; 4 properties have 6 full bathrooms; most properties
have 1 or 2 full bathrooms • most properties have 0 three-quarter bathrooms • 1 property have 3 half bathrooms; most properties have 2 full bathrooms • *most properties have 1 or 2 fireplaces • only 4 properties have 2 multi-car garages; most properties have 1 multi-car garage • most properties have 0 single car garage • only 1 properties have 2 carports; most properties have 0 carport. • *average number of bedrooms are 3.87. varies from 1 to 8. • *only one property has one bedroom • only two properties has eight bedrooms • *the properties has mostly 1 or 2 fireplaces, 16 properties has three fireplaces. • *most houses have one multi-car garage. only four properties has two multi-car garages • most houses have zero single car garages. • *most houses have zero carport. only one property has two carports.
House characteristics - binary variables
• 94.6% of the properties do not have pool • 96.6% of the properties do not have outbuilding • 6.9% of the properties are located around the corner • 2.8% of the properties have outstanding view • 5.5% of the properties have excellent view • 3.7% of the properties have good view.
P a g e | 6
• all properties were sold at 2008. • properties were quite evenly sold between May to Dec, • but more properties were sold on Aug to Nov; about 20-30 more properties than other
months. • the majority of the properties, 33.5%, are second storey single family dwelling, modern
style, builted after 1930. • the quality of the house ranges from • the condition of the homes is either • 235 properties are first floor ; 20 properties have 1.5 storey; 456 2-floor properties • number of full bathrooms are varied from 0 to 6; mostly 1 or 2 full bathrooms
According to rule of 5, • 0 full bathrooms; four properties • 6 full bathrooms; four properties • 3 half bathrooms; one property
• 1 bedroom; one property • 8 bedrooms; two properties
• 2 multi-car garages: four properties • 2 carports; one property
Were eliminated because there is less than 5 data, which does not represent the average market condition.
So far, there were 712 remaining properties.
Graph analysis of the variables
a) scatter plot analysis : adjusted time sales price vs. areas
Adjusted time sales and continuous variables (LotSizeSqft, FinishedAreaFirst_Floor, BasementTotalArea, BasementFinishedArea , second_floor , basementfinishedarea ) were examined. Nothing significant relationship was observed. The highest R-squared value was 0.638, which was Finished area and Adjusted sales price.
The graphical analysis between basementtotalarea and basementfinishedarea showed R-squared of 0.447.
P a g e | 7
b) box plot analysis: adjusted sales price vs. discrete variables
Adjusted sales price and discrete variables such as Neigh, Foundation, Stories, Bedrooms, Fireplcs, Pool, Outbuildings, CornerLot, Views, and ManualClass, were examined.
P a g e | 8
The Noticeable distinctions were the Neighbourhoods - Alexander and Southcode had slightly higher adjusted sales price than other neighbourhoods, second story houses are higher than 1 or 1.5 stories, number of bedrooms are clearly distinct with adj sales price, number of fireplaces are clearly distinct with adj sales price, and manual class are distinct with adj sales price.
Corrleation analysis showed total finishe area (79.9%), assessment improved (68.6$) are closely corrleated with adjusted sales price.
Hence, the variables difficult or not worth to use were;
sales price will not be used; instead, time adjusted sales price will be used because it represents the data better year built; effective year is more efficient information storeys- finished area will include this information sales date is too specific hence not needed manual class will be transformed to linearized manual class random variable does not attribute any value to the price; hence will be not used sales year will not be used because it does not differentiate the data (all are in 2008) will only use deck covered area because simply deck uncovered area is area - deck covered area
Transformations Step 4. Transformations
Variables need to be transformed were
1) Number of bathrooms to be combined into one variable, Bathrooms = FullBath + 0.75*ThreeQtrBath + 0.5*HalfBath.
2) Corner lot, outbuildings, Excellent_view, Outstanding_view, Good_view and pool to be transformed to binary variable, i.e. Yes/ No to 1 and 0
3) All the square footage variables were linearized; they were multiplied by Lin_mancls,
Lin_LotSizeSqft = Lin_mancls * LotSizeSqft.
Lin_FinishedArea =Lin_mancls * FinishedArea.
Lin_First_Floor = Lin_mancls* First_Floor.
Lin_Second_Floor = Lin_mancls * Second_Floor.
Lin_BasementTotal = Lin_mancls* BasementTotalArea.
Lin_BasementFinished = Lin_mancls* BasementFinishedArea.
Lin_DeckCovered = Lin_mancls * DeckCoveredArea.
4) Foundations and Neighbourhoods were transformed as well; into six different binary variables.
The transformed variables were double checked by cross-tabulation by comparing to pre-transformed data, in case of something went wrong.
(The detailed syntax is available in Appendix B)
P a g e | 9
Variable Selection Step 5. Re-examing new variables
The transformed variables were re-examined through scatter plot, box plot, and correlation matrix.
Adjusted sales price and discrete variables (bathrooms, CornorLot2, Outbuildings2, Pool2, Outstanding_view, Excellent_view, Good_view, AlexandraHeights, CairnWood, LillianLake, MountView, SouthCove, WestHill, CrawlSpace, PartialBsmt, and SlabFn) were examined through box plot.
Outstanding view, Excellent view, South Cove, West Hill showed clear distinction. (all moderate to low distinctive results could be found in Appendix A).
P a g e | 10
Scatter plot analysis examined the relationship between adjusted sales price and continuous variables (Lin_LotSizeSqft, Lin_FinisheArea, Lin_First_Floor, Lin_Second_Floor, Lin_BasementTotal, Lin_BasementFinished, Lin_DeckCovered). All of the area variables had positive corrleationwith the adjusted sales price. However mulicolliniearity should be considered hence correlation matrix was examined as well.
According to the correlation matrix, lot size, finished area, first floor area, and second floor area had at least moderate relationship (correlation of 0.6 or high) with the adjusted sales price. However, lot size is also closely related to finished area and first floor, finished area is closely related to first floor, second floor, and finished basement area, etc. This “multicollinearity” is simply double-counting and potentially decreases the effectiveness of the equation. The multicollinearity could be also determined by VIF and significance level on regression analysis.
Model Calibration Step 6: List the Variables for Calibration
*The following 16 variables will be common to each of combinations
bathrooms CornerLot2
outbuildings2 PartialBsmt
SlabFn
pool2 outstanding_view excelleng_view
good_view alexandra heights
southcove westhill
CairnWood LillianLake crawlspace
Fireplcs Stepwise
Regression
In this step, multivariate linear regression (enter) was performed to build the model, which will be further adjusted.
P a g e | 11
Three possible combinations of area variables are designed based on their low correlation.
Any area variables with their correlations greater than 60% were not put in the same model to avoid multicollinearity.
First Test: the sixteen common variables plus lot size, first floor, second floor
Second Test: the sixteen common variables plus total finished area
Third Test: the sixteen common variables plus lot size and finished area
First Run
First test includes the 16 common variables in addition to (Linearized) Lot Size, First Floor, and Second Floor.
Note that no variable has VIF greater than 3.3, Adjusted R^squared = 0.819, and the Standard Error of Estimate is 40395.853.
Second Run
P a g e | 12
The second test includes 16 common variables and Total Finished Area.
VIF of 3.591 was observed in bathroom variable, which is not acceptable because VIF should be less than 3.591. Hence, will exclude bathroom and re-run
Second run - 2
All VIF was under 2, adjusted R square was 0.74 and SEE was 48408.249.
P a g e | 13
Third Run
The third test considered the 16 common variables plus Lin-Lotsize sqft and Finished Area
Adjusted R-Square was 0.812, SEE was 41194.271, and All VIF were less than 3.3
Hence, first test was taken as the best representative based on the highes R-square and lowest SEE, as well as all VIF was less than 3.3.
Hence, the "Best” set of variables was:
Lin_LotSizeSqft Lin_First_Floor
Lin_Second_Floor AlexandraHeights
CairnWood LillianLake MountView SouthCove
WestHill Crawlspace
PartialBsmt SlabFn Bsmt
bathrooms CornerLot2
Outbuildings2 Pool2
Outstanding_view Excellent_view
Good_view.
P a g e | 14
Step 7: model calibration
The model calibration step is calculating the coefficients for each of the variables in the model.
First, comparing R squared, SEE, constant, and coefficients in enter- and stepwise- regressions.
Showed that the R squared, SEE, constant, and coefficients were all the same.
P a g e | 15
Residuals against the predicted values indicated sales 168, 196, and 565 were the outliers based on standard residual 3.3. Sales 168, 196, and 565 were filtered out.
When the regression was re-run, adjusted R-square and SEE were improved but crawl space and basement had VIF greater than 3.3. Hence Crawl space was removed because basement had the higher VIF, and sales number 157 was an outlier.
P a g e | 16
After removing crawl space and sales number 157, the new regression model looked okay. The
following is the final result.
P a g e | 17
The final model has adjusted R-square of 0.826, which is strong because it is greater than 0.8, SEE of 38941.812, and COV is (mean of adj.sales divided by SEE) which is 530985.62/38941.812 is 13.6%, which is acceptable range of 10~20% but ideally it should be lower than 10%, All t-statistics are not within the range of positive and negative 1.6, all VIF were less than 3.3 which means there is no major multicollinearity, and all significance level was less than 0.1, F-statistics was 262.557 and signicance level of the model was 0.000.
The final model is:
Predval = 327643.789 + 24181.094 * AlexandraHeights - 12570.719 * MountView
+ 20877.339 * SouthCove - 23705.343 * WestHill + 24663.88 * Bsmt - 17051.539 * SlabFn
+ 17787.747 * bathrooms - 15875.994 * CornerLot2 + 48134.742 * Outstanding_view
+ 28127.906 *Excellent_view + 7.613 * Lin_LotSizeSqft + 28.421 * Lin_First_Floor +
10.930 * Lin_Second_Floor.
Final Analysis and Conclusions STEP8: Test and Evaluate the Model
In this step, we will test and evaluate the model based on PAR = Predval/Adj_Price.
a) Testing Neighbourhood
Testing Property characteristics
Both scatter and box plot analysis showed there is no relationship between PAR and the variables in the model.
The mean ranks should be approximately 349 (half of 698). All of the neighbourhoods appear to be close to the target value.
P a g e | 18
The Chi-square test indicates the acceptable range of the fair distribution of neighbourhoods.
Examination of Property Characteristics
All of them showed unrelated relationships with PAR.
Correlation matrix
The Correlation is quite low. The variable most strongly related to PAR is Lin_LotSizeSqft, -0.03.
P a g e | 19
b) Stepwise Regression
14 variables are chosen in this model.
This model is acceptable because all of their t-statistics is not within the range of 1.6. It
has adjusted R square of 0.819, the mean value of adjusted sale is 530255.62. The COV is 530255.62 / 40330.79 = 13.15%, which is good but COV less than 10% would
be excellent representation of the data. F statistics is 228.297 with significance level
0.00. Lower significance level indicates good representation. All of the VIF were less than 3.3 and Tolerance greater than 0.3.
Hence, the “best” model has fourteen variables including;
Lin_LotSizeSqft
Lin_First_Floor
Lin_Second_Floor AlexandraHeights
CairnWood LillianLake
MountView SouthCove
WestHill
Crawlspace PartialBsmt
CornerLot2
Outbuildings2
Pool2 Outstanding_view
Excellent_view
Good_view SlabFn
Bsmt
bathrooms
Neighbourhood test
P a g e | 20
The result is quite good.
The overall mean is 1.006 (should be within 1.1%) and the overall median is 1.000 (should be within 2.7%). The COD is 6.1% which is very good indication because it is below than 10%.
The lowest median PAR is 0.993 while the highest is 1.009. All of the 95% interval contains the median if 1.000. Which means there is less than 5% chance the medial PAR is not equal to 1.000.
Hence, there are no further adjustments are needed.
Step 9: Conclusion
The regression model quite well represents the market value for single family dwellings in Midsize City. Based on model test in Step 7 and 8, there is no further adjustment needed. However, this model represents 82% of the market data. Which is fairly good indication, however, one should research further to improve the goodness of the equation by gathering more data. Also all of the houses in the data set were sold in 2008 hence more comprehensive sales date is needed to understand the general price trend of sales price. Also, one should consider the subjective of the views (excellent, outstanding, and good) and the manual class, which was subjectively decided by one appraiser. Overall, the valuation model is credible and could be used by lenders, private vendors, homebuyers, and property tax agents.
Appendix A: Graphs and analysis Step 0:
Checking outliers
P a g e | 21
Step 3: graphical analysis of variables
P a g e | 22
Step 5: Re-examining the variables
P a g e | 23
P a g e | 24
Scatter plot analysis
P a g e | 25
.
P a g e | 26
Step 6: Regression Analysis First run
P a g e | 27
Second run
P a g e | 28
Second run-2
P a g e | 29
Third run
P a g e | 30
Step 7: model calibration
Re-run
P a g e | 31
Final model
P a g e | 32
P a g e | 33 PAR
N Valid 698
Missing 0
Mean 1.0056
Median 1.0001
Range .40
Minimum .81
Maximum 1.21
Percentiles 10 .9129
20 .9384
25 .9488
30 .9573
40 .9790
50 1.0001
60 1.0241
70 1.0493
75 1.0601
80 1.0740
90 1.1059
Model Summary
Model R R Square
Adjusted R
Square
Std. Error of the
Estimate
1 .907a .823 .819 40330.790
a. Predictors: (Constant), Lin_Second_Floor, MountView, SlabFn,
CornerLot2, Outstanding_view, Excellent_view, Crawlspace,
AlexandraHeights, WestHill, Lin_LotSizeSqft, SouthCove, bathrooms,
Lin_First_Floor, Bsmt
P a g e | 34
Step 8:
Neighbourhood analysis
Step Variable Added Adjusted R Squared Standard Error 1 Lin_First_Floor 0.612 59106.45 2 Lin_LotSizeSqft 0.670 54500.135 3 Bathrooms 0.746 47800.484 4 Bsmt 0.769 45627.034 5 SouthCove 0.784 44087.197 6 AlexandraHeights 0.796 42910.725
Linear Regression based on PAR
Model Summaryb
Model R R Square
Adjusted R
Square
Std. Error of the
Estimate
1 .086a .007 -.013 .07434
a. Predictors: (Constant), Lin_Second_Floor, MountView, SlabFn,
CornerLot2, Outstanding_view, Excellent_view, Crawlspace,
AlexandraHeights, WestHill, Lin_LotSizeSqft, SouthCove, bathrooms,
Lin_First_Floor, Bsmt
b. Dependent Variable: PAR
P a g e | 35
7 WestHill 0.802 42276.472 8 Outstanding_View 0.807 41696.602 9 Lin_Second_Floor 0.812 41131.194 10 SlabFn 0.814 40883.494 11 Excellent_View 0.816 40717.662 12 MountView 0.818 40532.234 13 CrawlSpace 0.819 40424.299 14 CornerLot2 0.819 40330.790
Stepwise regression
P a g e | 36
P a g e | 37
P a g e | 38
APPENDIX B: Syntax for Transformation
***step 4 transformation
*variables need to be transfomed
- number of bathrooms are needed to be combine into one variable.
USE ALL.
COMPUTE Bathrooms = FullBath + 0.75*ThreeQtrBath + 0.5*HalfBath.
*- corner lot, outbuildings, and pool will be transformed to binary variable.
RECODE CornerLot ('Y'=1) (ELSE=0) INTO CornerLot2.
RECODE Outbuildings ('Y'=1) (ELSE=0) INTO Outbuildings2.
RECODE Pool ('Y'=1) (ELSE=0) INTO Pool2.
*- views (outstanding, excelleng, good) will be combine into one variable and will be linearized.
RECODE OutstandingView ('Y'=1) (ELSE=0) INTO Outstanding_view.
RECODE ExcellentView ('Y'=1) (ELSE=0) INTO Excellent_view.
RECODE GoodView ('Y'=1) (ELSE=0) INTO Good_view.
*Manual classic description was already linearized.
P a g e | 39
*cross tab
CROSSTABS
/TABLES=FullBath ThreeQtrBath HalfBath BY bathrooms
/FORMAT=AVALUE TABLES
/CELLS=COUNT
/COUNT ROUND CELL.
*linearize square footage variables
COMPUTE Lin_LotSizeSqft = Lin_mancls * LotSizeSqft.
COMPUTE Lin_FinishedArea =Lin_mancls * FinishedArea.
COMPUTE Lin_First_Floor = Lin_mancls* First_Floor.
COMPUTE Lin_Second_Floor = Lin_mancls * Second_Floor.
COMPUTE Lin_BasementTotal = Lin_mancls* BasementTotalArea.
COMPUTE Lin_BasementFinished = Lin_mancls* BasementFinishedArea.
COMPUTE Lin_DeckCovered = Lin_mancls * DeckCoveredArea.
EXECUTE.
*transform neighbourhoods
*Note: changed Nbhd from String to Numeric.
COMPUTE AlexandraHeights = 0.
COMPUTE CairnWood = 0.
COMPUTE LillianLake = 0.
COMPUTE MountView = 0.
COMPUTE SouthCove = 0.
COMPUTE WestHill = 0.
IF (Nbhd = 005) AlexandraHeights = 1.
IF (Nbhd = 014) CairnWood = 1.
IF (Nbhd = 015) LillianLake = 1.
P a g e | 40
IF (Nbhd = 072) MountView = 1.
IF (Nbhd = 046) SouthCove = 1.
IF (Nbhd = 036) WestHill = 1.
EXECUTE.
*transform foundation.
COMPUTE Crawlspace = 0.
COMPUTE PartialBsmt = 0.
COMPUTE SlabFn = 0.
COMPUTE Bsmt = 0.
IF (Foundation = 'CRAWL') Crawlspace = 1.
IF (Foundation = 'PBSMT') PartialBsmt = 1.
IF (Foundation = 'SLAB') SlabFn = 1.
IF (Foundation = 'BSMT') Bsmt = 1.
EXECUTE.