introduction to regression analysis - avttools.com intro reg 2...and the regression plus, and the...
TRANSCRIPT
1
Introduction to Regression Analysis
2
Automated Valuation Technologies, Inc. This introductory workbook was created by Automated Valuation Technologies, Inc. (AVT). The purpose of AVT is to fill the voids in appraisal practice that result from the rapidly changing appraisal environment. Appraisers often find themselves engaged in new activities which require the use of technology that has not yet reached their computer desktops. This is both unfortunate and unacceptable. It is unfortunate because appraisers are not fully effective in carrying out their duties. It is unacceptable because it compromises the vital role appraisers perform in the safekeeping of their country’s greatest wealth: real property. It is AVT’s mission to provide technologies real estate appraisers require to fulfill their duties. AVT operates under the belief that there is no substitute for the “Neighborhood Appraiser.” Their knowledge of the local market is unique and cannot be duplicated by remote computer analysis. These local appraisers are hardworking and dependable. Without question, these gritty individuals will carry out their duties as long as they have the knowledge, training, and tools to do so.
This Workbook is copyrighted. © 2010 Automated Valuation Technologies, Inc. 215 Ellis Avenue, Maryville, TN 37804
3
David A. Braun, MAI, SRA
David A. Braun, MAI, SRA, has been an appraiser in the Greater Knoxville area, since 1976. He was born in St. Louis Missouri, and later moved to Knoxville, Tennessee. He is a 1976 graduate of the University of Tennessee with a Bachelor’s of Science degree in business administration. David received the Appraisal Institute’s SRA in 1980 and MAI in 1999. David has been appraising most types of properties since 1976. He is also a licensed real estate broker with sales and management experience. David founded Braun & Associates, Inc. in 1983 in Maryville, Tennessee. Braun & Associates, Inc. is a medium-sized appraisal firm with 12 employees and handles most types of assignments and property types. He is also president and founder of Automated Valuation Technologies, Inc. (AVT). AVT produces unique appraisal seminars and software for the practicing appraiser. David has had “how to” articles published in the Real Estate Valuation Magazine, The Working RE Magazine, and presented white papers on appraisal topics. He has twice served as President of his local Chapter of the Society of Real Estate Appraisers (now a part of the Appraisal Institute). He has served on the Appraisal Institute’s Instructor Subcommittee at a national level. David is currently an approved Appraisal Institute Instructor and is certified by the Appraisal Foundation as a USPAP Instructor. David has been a pioneer in developing and reporting the scope of work. He developed and taught the first scope of work seminar, and was the first to use a computer program to aid in the development and reporting of the scope of work. He authored, APPRAISING IN THE NEW MILLENNIUM; Due Diligence & Scope of Work. David has developed numerous applications that run on macros in MS Excel® some of which are free and some are for purchase. These can be found at AVT’s website. He has designed, but did not program other applications such as the “USL Documenter” and the Regression Plus, and the Fusion. David has made presentations to the Federal Financial Institutions Examination Council (FFIEC), and to the Association of Appraiser Regulatory Officials (AARO) on the topic of scope of work.
4
TableofContentsI. FORWARD............................................................................................................................. 5
II. INTRODUCTION .................................................................................................................. 6
A. Review .............................................................................................................................. 7
B. Winding up Step 8; Testing the Model ............................................................................ 9
III. FINALLY…REGRESSION ANALYSIS ......................................................................... 13
A. HowRegressionWorks ............................................................................................. 13
B. Exercise: Number Games .............................................................................................. 16
IV. STEPS IN THE REGRESSION ANALYSIS.................................................................... 17
A. Steps in the Regression Analysis ................................................................................... 20
B. Gauging the Reliability of the Value Prediction ............................................................ 22
V. CASE STUDIES ................................................................................................................... 26
A. Case Studies 1 & 2 ......................................................................................................... 26
VI. CONCLUSION .................................................................................................................. 27
VII. QUESTIONS ..................................................................................................................... 28
5
I. FORWARD After trying several training methods we have concluded that workbooks supplemented with videos is the best avenue for success. This is an inexpensive and flexible learning environment. This is the second of a five part series:
1. An Introduction to Market Modeling 2. Introduction to Regression Analysis 3. Trending Values over Time 4. Reporting Your Regression Analysis 5. Extracting Sales Adjustments with Regression
The first three workbooks will empower the valuator to form credible value opinions with regression analysis. The fourth workbook shows the valuator how to report this value in a USPAP compliant manner Finally, the fifth workbook explores the phenomena of extracting sales adjustments with the Regression Plus software.
6
II. INTRODUCTION In this workbook we will:
…wind up our novice level understanding of testing the market model by
introducing:
o Coefficient of Determination (R2), and
o Adjusted R2
…explore how regression analysis calculates the coefficients.
…Introduce the P-factor
…test for curvilinear relations with residual trending of the variables.
…learn the techniques to improve the model provided by the regression
analysis.
…identify where the appraiser influences the regression analysis.
…perform some case studies.
We will not consider the market conditions (time) adjustment in this workbook. This
topic will be addressed in Workbook 3 to allow a steady progressive learning process.
The good news is that we will not really learn any new theory in Workbook 3, we will just
apply or knowledge to some specific techniques.
7
A. Review Basically, valuation modeling has the following steps:
1. Data Gathering (Research)
2. Identifying the Variables (Property Characteristics)
3. Choosing the unit of Measure for each Variable
4. Data-basing
5. Scrubbing the Data
6. Estimating the Value of the Coefficients
7. Building the Model
8. Testing the Model
9. Applying the Model to the Subject Property
Regression Analysis is one method or technique to perform Step 6: Estimating the
Value of the Coefficients. The point is that most of what is done to form an opinion of
value with regression is really market modeling. The essence of regression analysis is
limited to Step 6 above.
8
Here are some things that are necessary to perform regression analysis that valuators
have little or no experience in:
The difference in building models as opposed to the “comparative” nature of a
traditional sales approach;
how to organize and scrub data,
how to test a model for accuracy, and
how to actually perform the regression analysis.
Four premises were presented:
Premise 1 The scope of work to extract sales adjustments with regression analysis is much more
extensive and time consuming than simply forming an opinion of the final value.
Premise 2
Regression analysis naturally leads to the correct final value opinion even when
multicollinearity is present. It tends to allocate the value of property components which
are piggy-backing on each other in a manner that will result in the correct value even
though the coefficients of those related property characteristics may not be accurate on
an individual basis.
9
Premise 3
If a market model of 30 or more comparable sales has a high degree of accuracy, and
the subject property is very similar to the sales used in the analysis; then the predicted
value for the subject property will have a high degree of reliability. This topic will be
covered again in “Workbook 2; An Introduction to Regression”.
Premise 4
To lead to a credible value opinion comparables must have similar value relationships
(coefficients) to the subject property under the current market conditions. This results
because specific sets of purchasers tend to shop this specific collection of properties.
B. WindingupStep8;TestingtheModel
In workbook 1 we discussed testing the model by means of residual analysis. We
presented three methods:
Average Absolute Error (AAE)
Market Model’s Predictability for best 90% of the observations (MMP)
Plotting the residuals (as a percentage) on a chart
10
At this time we will conclude our discussion of testing the model by introducing the
coefficient of determination (R2). We already know a lot about R2 based on the
discussion of the correlation coefficient (r) presented in Workbook 1. Mathematically R2
is simply the correlation coefficient squared. So, to understand R2 we need to review
what the correlation coefficient is.
The Pearson’s correlation coefficient is a measure of the linear relationship between two
sets of numbers. We will use MS Excel to make this calculation. We will examine the
formula in the intermediate level material.
Consider some examples:
SET 1 r R2
Independent (A) 2 4 6 8 18 12 14 16 1.000 1.000
dependent (B) 20 40 60 80 180 120 140 160
SET 2 r R2
Independent (A) 2 4 6 8 18 12 14 16 ‐0.252 0.063
dependent (B) 8 12 6 2 1 9 13 7
Which of these two sets of numbers could you best predict what number in row (B)
should correspond to the number 10 in Row (A)?
It is easy to predict that in Set (1) a 100 in Row (B) would relate to a 10 in row (A). In
fact, all of the numbers in row (B) are a multiple of 10 of Row (A). This gives it a perfect
correlation of positive 1. Set 2 is not as easy to predict because the two sets of
numbers are not very correlated.
11
R² = 1
0
50
100
150
200
0 5 10 15 20
SET 1
R² = 0.0633
0
2
4
6
8
10
12
14
0 5 10 15 20
SET 2
Consider their plot charts:
SET 1 has a really good (in fact,
perfect) linear relationship
between the two data sets. This
fact is borne out by the
correlation coefficient of a
positive 1. It is a positive number
because the line moves in an
upward direction.
Set 2 has very low linear
relationship between the two
data sets. These two data sets
have a correlation coefficient of
minus 0.252. A coefficient of
zero indicates no relationship. It
is a negative number because
the line moves in an downward
direction.
12
Now imagine that both Row (A)’s are the actual sales prices and both Row (B)’s are
what the model predicts for each actual sales price. Based on the linear relationships
between the actual sales prices and what the two models predicted, I would hold that
the model used in SET 1 does a better job of prediction than the model used in SET 2.
The correlation coefficient is simply a residual analysis technique.
We could stop here, but most everyone in the world prefers to express this linear
relationship by squaring the correlation of coefficient (r) which results in the coefficient
of determination (R2). The coefficient of determination is very misunderstood by most
appraisers, this lack of understanding often results in incorrect conclusions to their
analysis. R2 will always be less than the absolute of r (unless it is a (-1, 0, or +1). What
the difference in the two measures represents will be covered in the intermediate level
material. For now I am pulling a Sargent Friday, and just presenting “the facts mam,
just the facts”. However, novice level students need to know when an R2 becomes low
enough to be of concern. While an R2 can be as low as 0.60 and the model still be
useful, the analyst should be suspicious of a model with an R2 of less than 0.70.
Go to Excel File “WB2-Problems.xls”
Solve Problem 1, then watch WB2-Video 1,
zzzzzand then solve Problem 2.
The R2 works best when there is a high ratio of sales to variables (property
characteristics). The Adjusted R2 adjusts the calculation of R2 to make it more
meaningful when there are few sales with many variables. Valuation analysts will tend
to rely heavily on the adjusted R2 because of the limitations on the number of sales in
the real estate markets.
13
III. FINALLY…REGRESSIONANALYSIS We are now at Step 6; Estimating the Value of the Coefficients. This step requires
identifying which variables have a relationship with the over-all value of the property.
One method of identifying which property components contribute to the value, and how
much each unit of those components contributes is regression analysis.
A. HowRegressionWorks Regression is a profound analysis tool that can help the appraiser form an opinion of
market value by creating a market model which the appraiser can apply to the subject
property. Its framework is like all market models, but how it builds the market model is
unique and easy to understand when considering only one property characteristic
(variable). Utilization of one variable is called “simple” regression.
14
The model can be visualized by plotting the data points on a grid and then positioning a
straight line within those plotted points. The objective is to position the line so that the
sum of the differences between each plotted point and the line squared is minimized
when totaled. This is called the “sum of the least squares” method. Notice that the line
placed within the plotted points is a straight line.
There are other types of regression analysis that will be discussed in more advanced
classes. This seminar will only involve “linear regression based on the sum of the least
squares”.
HOW DOES REGRESSION ACTUALLY DO THE CALCULATIONS?
The sum of the least squares is one method of building the universal model equation
like the one presented earlier for three property characteristics:
Y = (ax1) + (bx2) + (cx3)
For simple regression we will use only one property characteristic and consider the
intercept (where the regression line intersects with the Y-axis.
Y = (ax) + intercept
15
Go to the File “WB-2 Problems.xls” and work through Problem 3, them watch Video 2
Some of the advantages of the “sum of the least squares straight line method” of
regression are:
It is functional for real estate uses such as valuation predictions and extracting
adjustments.
It is straight forward enough to be understood and implemented by appraisers.
This method is proven by mathematicians, and is accepted as valid by the users
of appraisals.
16
B. Exercise:NumberGames
Go to the File “WB-2 Problems.xls” and work through
Problems 4-6, then watch Video 3
17
IV. STEPSINTHEREGRESSIONANALYSIS
There are two tools that aid in the building of the model that have not yet been
discussed. These are the P-factor and residual trend analysis of the individual variables
(property characteristics). These are presented in the regression sections because they
are especially related to regression.
The P-factor is a statistical check that
attempts to verify if the relationship (shown
in the analysis) really exists or did a
coefficient just occur by chance. A low P-
factor is evidence that a relationship
between the specific variable and the over-
all value does exist. A high P-factor is
evidence that what the regression analysis is
showing is of no consequence because
there is no relationship between the variable
and the value. In the Regression Plus a “green” color code indicates a low P-factor, a
“yellow” color code means be careful, and a “red” color code indicates that no
relationship exists. In the illustration above the P-factor indicates that there is a
relationship between the GLA, Age, and Pool and the over-all value. However, there is
evidence that no such relationship exists for Lot size.
The thresholds are set as; Green <= 0.1, Red >0.3, and Yellow between 0.1 and 0.3.
These thresholds are set liberally based on the targeted over-all accuracy, and on the
fact that the appraiser has some feel (experience) for what property characteristics
effect value and which do not. If the P-factor is high then simply unselect that variable
so it is not considered in the analysis.
18
The residual trend analysis of the individual variables is a test that identifies variables
which have a curvilinear relationship to value. Least squares linear regression always
plots values along a straight line. The model’s accuracy will be degraded if the value
relationship is curvilinear but is being measured by a straight line. The following charts
are from the Regression Plus software. It charts the residual trend by each individual
variable. The first chart shows that there are no residuals for “GLA” ranging from 1,150
to 1,750 square feet.
19
However, in the following chart there are significant residuals when trending them by
size. This type of a trend indicates that there is a curvilinear relationship between the
size and the over-all value.
If a variable has a curvilinear relation the analyst should attempt to remove some of the
data points where the relationship tends to show up. If too many observations must be
removed it may be necessary to remove that variable entirely. Another alternative
which will be covered in the intermediate material is data transformation. This involves
changing the data so it results in a linear relationship.
20
A. StepsintheRegressionAnalysis
Once the data has been gathered, scrubbed, and loaded into the regression program
we are ready to perform Steps 6 and 7:
Step 6; Estimating the value of the coefficients, and
Step 7; Building the model.
In order to do this we will have to utilize what we have learned about Step 8; Testing the
model. Developing the model is basically a trial and error process. The idea is to run it
and gauge its quality, then modify it and see if the model got better or worse. We can
organize this trial and error process as much as possible with the goal of having some
efficiency. The following tasks should be performed in the following order after the data
has been loaded and the date of value has been set:
Premise 6
You can improve most any model that the regression analysis first returns.
This premise goes a long way in proving that any value opinion formed by the valuation
analyst based on regression analysis is not an automated Valuation model (AVM).
Although, with enough pre-programming, you could use regression analysis as the
basis for an AVM.
21
1. Tasks for Step 6
a. Data Check
i. Check for blanks or text fields (cell will color coded “red”).
ii. Be sure each dummy field has several “hits”.
b. Be sure there is only one variable for each property characteristic.
c. Unselect one of each set of dummy fields (unless one field was never
loaded).
d. Except for “Dummy” fields unselect any coefficients that have an
unrealistic value (especially the incorrect sign).
e. Unselect any variables that have high P-factors
f. Correct or remove any outliers.
g. Run the residual analysis for each variable.
h. Gauge the quality of the model.
i. Examine the Avg. Abs. Errors and the MMP at 90%.
ii. Examine the R2 and the adjusted R2.
iii. Enter the subject property information
iv. Consider the weights applied by the model on each variable.
i. If the quality of the model meets or exceeds the desire specifications, then
stop, if not continue with task “2” below.
2. Next Run
a. Modify the model
i. This could be removing or adding back a sale, removing or adding
back a variable.
b. Go to step “d” above
It typically takes between 10 and 20 runs to maximize the quality of the market. The
“stepwise” method is often employed to Task (2. a.). This method is not covered here
as it is discussed in most books which cover regression analysis. In some regression
programs these runs are sometimes completed by computer programming. The
Regression Plus program purposely does not automate these tasks because it is
22
important for the valuation analyst to personally perform and control this and every other
part of the valuation process.
B. GaugingtheReliabilityoftheValuePrediction
The reliability of the value prediction is based on:
1. The quality of the model, and
2. how well the subject property fits the sales data used in the model.
The model and the predicted value for the subject property are two distinct issues. The
accuracy of the model can easily be distinguished by residual analysis and other issues.
However, the reliability of the predicted value of the subject property varies based on
how “similar” it is to the sales data used to determine its predicated value.
Let’s say that you have examined and considered the following evidence that led to the
market model’s development:
1. The number of observations
2. Quality of the sales data
3. The AAE and the MMP at 90%
4. The reasonableness of the model’s coefficients
5. The weights applied of each variable applied to form the prediction
6. The R2 and adjusted R2
7. The P-factors
8. The residual analysis check for curvilinear relationship of each variable.
Let’s say in this hypothetical example the AAE is 5% and the MMP for the best 90% of
the residuals is 12%. This means that the actual sales used had residuals from 0% to
more than 12%. On average the reliability factor of the subject will be the AAE.
However, the subject just like the comps has an expected range in value from 0% to
23
over 12%. Where in this range is reasonable for the predicted value depends on the
similarity of the subject property to the sales used in the analysis.
In this example let’s say that three variables were considered, GLA, Age, and Land size.
The sales used in the analysis have an average size of each of these variables as 1,850
square feet, 9 years old, and 1.7 acres. If the subject property had exactly those
averages then its expected range of value would be less than the model’s AAE. If it’s
components were very similar then its expected range would be about what the AAE is
or just a little larger. The farther it is from the average of the data the higher its
expected range of value will be.
The Regression Plus software gives you
two places to compare the subject to the
averages of the sales for each variable.
The first is under “Valuation Settings”
where you enter the subject property’s
variables. When the Regression Plus first
opens it runs the regression analysis
based on the average of the sales data.
In the below caption the age of 14 is the
average age of all of the sales, the
average for bedrooms is 3, and the
average square feet is 1,853. The other
variables shown are yes/no (1/0) answers
so there averages are not really
meaningful. This gives the analyst a pretty good feel of how similar the subject is when
the subject’s data is entered over the averages.
24
The second method is the output for
the subject’s deviation from the
average of the sales denoted as
“SubDev” in the output. The standard
deviation is color coded as “Green” for
good, “yellow” for being within the
range of the sales data, and “Red” for
being outside of the range of the sales
data. This color coding is based on
the following thresholds in standard
deviation; Green < 1.0, Yellow >= 1.0
but within the observation range, and
Red = outside of the observation
range.
Notice that in this case all of the variables are color coded green except the Days since
sale. This will typically be the case unless the value opinion is retrospective. Be
concerned any time the subject property has a component that is very different than the
sales data. This is true for the days since sale as well. This important topic will be
covered in Workbook 3.
Premise 6
The reliability (reasonable range of value) for the subject property is affected by
both the quality of the market model and by how similar the subject property is to
the sales used.
25
The Regression Plus software forces the analyst to consider all of the factors by
requiring the following items to be rated before the analysis report can be copied or
printed:
The rating given for the “Over-all Confidence Rating” considers both the quality of the
model and the fit of the subject property to the sales used.
Premise 7
Two assumptions of regression analysis are:
1. That there is a non-curvilinear relationship between the variable and the
dependent variable.
2. That the residuals are both positive and negative.
These will be explained in the intermediate level material.
26
V. CASESTUDIES
“Knowledge is of no value unless you put it into practice.”
~ Anton Chekhov
A. CaseStudies1&2 Open the Excel workbook “Workbook2-Problems.xls” The correct coefficients and values for these case studies are: Site $0.00, GLA $72.00,
Foundation (Crawl $8,000, Full Unfin. $16,000, Full ½ Fin. $24,000, Full All Fin.
$32,000), Age -$2,800 per year, Bedrooms $0.0, Bath Full $5,000, Bath Half $3,000,
Neighborhoods (Hamptons $0.0, Glades -$5,000, Clark Addn. $7,000).
Site 0.6 $0 $0 GLA 1,577 $72.00 $113,544.00 Full 1/2 unfin $24,000 Age 14 ($2,800) ($39,200)BR 3 $0 $0.00 Bath Full 3 $5,000 $15,000.00 Bath Half 1 $3,000 $3,000.00
Clark Addn. $7,000
Total $123,344
27
VI. CONCLUSION
At this point you know a lot about regression analysis. Workbook 3 will contain a lot of
information about trending values over time. Once that workbook is completed you will
be ready to put regression into use for your everyday appraising.
I suggest that after completing Workbook 3 that you begin running a regression analysis
in conjunction with each appraisal you perform. Competency comes from
understanding theory, training, and practice.
Be sure and address the “Discussion Questions” in the Forum under Workbook 2.
28
VII. QUESTIONS
1. The “adjusted R2” will typically be ___________ as R2:
a. About the same
b. Typically lower
c. Typically higher
d. None of the above
2. The “sum of the least squares method” is used to…
a. to calculate where the regression line should be positioned
b. to calculate the what the intercept is
c. to calculate the number of points to plot
d. both (a) and (b)
3. The P-factor is a statistical tool that is used to…
a. establish the amount of the coefficient
b. establish if the amount of the coefficient is correct
c. establish if there is a relationship between the variable and the over-all
value
d. all of the above
4. One assumption of regression analysis is…
a. that a curvilinear relation exists between the variables and over-all value
b. that a non-curvilinear relation exists between the variables and over-all
value
c. that a curvilinear relation exists between the variables and coefficient
d. that a non-curvilinear relation exists between the variables and coefficient
29
5. Another assumption of regression is that…
a. all of the residuals have positive signs
b. all of the residuals have negative signs
c. some residuals have positive and some have negative signs
d. none of the above
6. When utilizing regression which part of the valuation process is truly regression?
a. Gathering the sales
b. Data-basing
c. Applying the model to the subject
d. Estimating the value of the coefficients
7. Of the nine steps listed as valuation modeling which steps does the Regression
Plus software control for the valuation analyst?
a. Steps 5 & 7
b. Steps 8 & 9
c. All of the steps
d. None of the steps
8. The most meaningful analysis to test the quality of a market model is…
a. R2
b. AAE
c. MMP
d. All of the above because they are all residual based analysis
30
9. Some of the advantages of the “sum of the least squares straight line method” of
regression are:
a. It is functional for real estate uses such as valuation predictions and
extracting adjustments.
b. It is straight forward enough to be understood and implemented by
appraisers.
c. This method is proven by mathematicians, and is accepted as valid by the
users of appraisals.
d. All of the above
10. It typically takes between __________ runs to maximize the quality of the
market.
a. 1-5
b. 5-10
c. 10-20
d. None of the above
11. The rating given for the “Over-all Confidence Rating” considers:
a. Both the quality of the model and the fit of the subject property to the sales
used.
b. The quality of the model.
c. The fit of the subject property to the sales used.
d. None of the above
31
12. If the AAE is 6.0% and the MMP is 12.0% and all of the subject property’s
characteristics are about average for the sales data used, you should expect the
reliability of the value opinion to be about…
a. over 12.0%+-
b. 7.0% to 12.0%
c. 6.0% or less
d. None of the above