robust hedonic price indexes - massey...
TRANSCRIPT
Robust Hedonic Price Indexes
Steven C. Bourassa
Department of Urban and Public Affairs, University of Louisville, 426 W. Bloom Street,
Louisville, Kentucky 40208, USA, email: [email protected]
Eva Cantoni
Research Center for Statistics and Department of Economics, University of Geneva,
40 boulevard du Pont-d’Arve, CH-1211 Geneva 4, Switzerland, email:
and
Martin Hoesli
University of Geneva (HEC), Swiss Finance Institute, 40 boulevard du Pont-d’Arve, CH-
1211 Geneva 4, Switzerland, University of Aberdeen Business School, Scotland, and
Bordeaux Management School, France, email: [email protected]
April 19, 2013
1
Robust Hedonic Price Indexes
Abstract
Using single-family sales data for Louisville, Kentucky, we assess the benefits of applying
robust methods to down weight problematic transactions in the context of longitudinal
hedonic price models. Robust estimators can reduce the influence of outliers that are
due to data entry errors, unmeasured hedonic characteristics, or non-market
transactions. We use simulation analysis to compare conventional indexes with several
robust indexes and conclude that a robust S-estimator is least biased relative to a true
(simulated) underlying index. We then compare a robust index with a conventional
index and a conventional index that controls for distressed sales and find that it is much
closer to the latter, except in a submarket with a large proportion of distressed sales.
We also demonstrate that robust methods can substantially eliminate the revision
problem in longitudinal hedonic indexes.
Key Words: house price indexes, hedonic models, robust methods, distressed sales
JEL Codes: R31
2
Introduction
Hedonic models are widely used to construct constant-quality house price
indexes; for example, they are the primary measures of price movements in France,
Norway, Switzerland, and the United Kingdom.1
The types of data problems that might occur in a hedonic context include sample
selection bias, which refers to the differences in implicit prices of hedonic characteristics
or price movements between the entire housing stock and the sample of houses that
actually transacts (Gatzlaff and Haurin, 1998). Sample selection bias correction
techniques can, at least in theory, be used to address this problem. Although robust
techniques are not helpful with respect to sample selection problems, they are useful
with respect to several other common data problems. These include non-market
(discounted) prices on the left-hand side of the hedonic equation, relevant attributes
Hedonic modelling raises a variety of
practical issues relating to, among other things, choice of estimation technique,
specification of relevant hedonic characteristics, and handling of data problems. These
issues are inter-related. For example, spatial estimation techniques have been
developed to overcome the inability of hedonic characteristics to fully capture spatial
relationships. Robust estimation techniques, the focus of this paper, are intended
mainly to correct data problems.
1 The United States, where repeat sales methods predominate, is an exception to this
rule. Other exceptions include the countries Denmark, New Zealand, and Sweden,
which rely upon sale price-appraisal ratio (SPAR) indexes (Bourassa, Hoesli, and Sun,
2006) and Australia, where a stratified median index is used (Prasad and Richards,
2008).
3
missing from the right-hand side of the equation, and measurement errors on either
side.
These data problems can lead to outliers that bias indexes in undesirable ways.
Conventional hedonic regression models are highly sensitive to these outliers because
they are estimated by minimizing the sums of the squared residuals. This gives outliers,
and particularly large outliers, disproportionately large influence. Even a small number
of outliers can have a large effect. If these outliers represent incorrect data, in the
sense that the data are not consistent with the assumptions of the model, then least
squares estimation can be biased and inconsistent. Robust methods generally down
weight observations automatically based on the size of their residuals and thereby
provide a means for producing consistent and possibly efficient estimators and test
statistics when a model is somewhat misspecified because its assumptions are not fully
consistent with the data.
Problematic data include distressed sales, non-arm’s length sales, and
typographical errors.2
2 We also considered flips as a possible type of problematic data. Flips involve two
transactions of the same property in relatively quick succession (e.g., within a year),
with some upgrading of the property between transactions. The improvements will
often involve characteristics that are not measured by the available hedonic variables.
Consequently, the first transaction in a flip may appear to be at a discount and the
second transaction at a premium relative to the market. In preliminary estimations of
our model, however, we did not find evidence of bias due to flips.
For example, discounts associated with distressed sales, such as
foreclosure and short sales, may bias indexes downwards. These types of forced sales
may be discounted because the seller is unusually eager to reach a deal or appear to be
4
discounted because there is unmeasured deterioration in quality (Campbell, Giglio, and
Pathak, 2011; Pennington-Cross, 2006). Some types of problematic transactions may be
flagged in some hedonic data sets. In other cases, it may be possible to identify these
transactions, but only with considerable investment of time and effort. Typographical
errors may be impossible to identify with any degree of certainty. Robust methods
provide a means for responding to data problems when it is difficult or impossible to
identify all of the transactions with contaminated data.
What constitutes problematic data depends on the purpose of the index (Wang
and Zorn, 1997). Some indexes are designed to treat the market as a portfolio and to
track changes in the value of the portfolio over time. Such indexes are value-weighted
(i.e., weighted by the value of each transaction) rather than equal-weighted. An
example of a value-weighted index is the S&P/Case-Shiller index (Standard & Poors,
2009). For such an index, down-weighting distressed sales might not be desirable
because such sales might be valid contributors to the value of the market portfolio.
However, in other cases, the aim is to track constant-quality price movements for
properties that are not distressed. In such a case, an equal-weighted index is
appropriate, as is controlling for the downward bias caused by distressed sales. Our
purpose here is consistent with the latter aim of measuring movements in typical
market-rate transactions.
Because robust techniques aim to model the behavior of the majority of the
data, they respond differently to different market circumstances. In a “normal”
market—defined as a market that does not have a large proportion of forced or
distressed sales—such transactions will tend to be classified as outliers and robust
5
techniques will prevent them from biasing an index. However, as the proportion of
distressed sales increases (becoming the majority), they are less likely to be classified as
outliers and hence less likely to be down weighted by robust techniques. In such cases,
robust indexes will tend to be similar to conventional indexes estimated using ordinary
least squares (OLS). We give an example of this by calculating a robust index for a
submarket in Louisville that has a large percentage of distressed sales.
Robust methods have been used in a variety of fields such as biostatistics
(Heritier, Cantoni, Copt, and Victoria-Feser, 2009) and sociological and political research
(Andersen, 2008), but their applications to real estate have been limited.3 Belsley, Kuh,
and Welsch (1980) use a robust M-estimator to analyze a hedonic model of house prices
in the Boston area.4
3 The term “robust” is used fairly widely in real estate and other types of statistical
research to refer to regression models with various types of desirable features, such as
standard errors that account for heteroscedasticity. Here we are concerned with
techniques designed to reduce the influence of unusual observations.
Thorson (1994) and Janssen, Söderberg, and Zhou (2001) use a
robust least median of squares estimator in hedonic models of agricultural land values
in Illinois and apartment building incomes and prices in Stockholm, respectively. Yoo
(2001) applies a least absolute deviations estimator (also known as a median quantile
estimator) in a hedonic model of house prices in Seoul, Korea. Finally, Song and
Wilhelmsson (2010) use a robust M-estimator method to identify and down weight
4 In contrast to least squares, M-estimators minimize some function that gives
decreasing weights to observations as the size of the standardized residual increases
(see, e.g., Huber, 1964).
6
outliers in a hedonic price model of condominium prices in Stockholm.5
We use a large dataset of house sales for Louisville, Kentucky, for the period
1998Q1 to 2010Q2 to investigate the usefulness of robust regression methods in
producing more reliable house price indexes. The time period is of interest as it covers
both bull and bear housing market conditions. Bias related to distressed properties is
likely to be more severe in bear markets. Data entry errors and other data problems
could occur at any time. Our data allow us to identify most distressed transactions and
assess their impacts on hedonic price indexes.
Two papers
apply robust techniques to construction of repeat sales indexes. McMillen and Thorsnes
(2006) conclude that a median quantile estimator suffers less bias from positive outliers,
such as unobserved renovations, than the conventional estimator. Bourassa, Cantoni,
and Hoesli (2013) show how robust techniques improve repeat sales house price index
estimation and reduce the magnitude of revisions as new data are added over time.
After conducting a simulation analysis to determine the best robust estimator for
problems like ours, we compare a robust hedonic index to a conventional OLS index and
an OLS index that controls for distressed property sales. We find that the robust index
for the market as a whole tracks the latter index more closely than the conventional
index because distressed sales are outliers in the market-wide context. However, for a
submarket in which distressed sales constitute a substantial proportion of the
observations, they are not outliers, and the robust index tracks the conventional index.
5 Peña and Ruiz-Castillo (1984) apply Cook’s distance statistic (Cook, 1977) to identify
outliers in a hedonic study of apartment rents in Madrid. They discuss robust M-
estimators, but do not apply them.
7
We also demonstrate that robust methods can virtually eliminate the index revision
problem.
The next section reviews the hedonic models that are typically used to construct
price indexes. The subsequent section discusses robust estimators and their application
to hedonic modeling. We then discuss our data and provide some summary statistics.
The subsequent section applies simulation analysis to compare alternative robust
estimators. Next, we compute and compare robust and conventional indexes for
Louisville. The penultimate section analyzes index revisions and the final section
concludes.
Hedonic Price Models
The hedonic model has a measure of house sale prices on the left-hand side and
set of property characteristics on the right-hand side. These “hedonic” characteristics
typically describe aspects of the structure and lot, as well as the location of the
property. If the data cover multiple periods of time, then a series of time dummy
variables is included to control for changes in price levels over time. In addition, a set of
spatial dummy variables can be used to capture submarket effects. This approach
allows the model’s intercept term to vary over time and space, but holds the implicit
prices of the hedonic characteristics constant. These implicit prices can be interpreted
as averages over the time period covered by the model. With both time and submarket
dummies, the model is:
2 2
P MT
it it t it m im itt m
Y X D S
, (1)
8
where Yit is the natural logarithm of the sale price of house i at time t, Xit is the vector of
property characteristics for house i at time t, the superscript T indicates the transpose
function, Dit is a set of P – 1 time dummy variables equal to 1 if house i sold at time t and
0 otherwise, Sim is a set of M – 1 submarket dummy variables equal to 1 if house i is
located in submarket m and 0 otherwise, is the intercept term, , t , and m are
vectors of parameters to be estimated, and it is a random error term distributed
2(0, )N . The antilogarithms of the time dummy coefficients are used to construct a
price index, with the index number 1 100I and exp( ) 100t tI for 1t .
As an alternative to using a “longitudinal” model with time dummies, price
indexes can be constructed by chaining together price increases based on the values of a
typical house as predicted by pairs of successive single-period estimations. The “typical”
house can be defined in terms of the average characteristics in the first or second
periods or a geometric mean of the two (Laspeyres, Paasche, and Fisher indexes,
respectively). A disadvantage of this approach is the need for a substantial number of
transactions each period; small samples may result in volatility in the coefficient
estimates due to changes in the characteristics of properties sold over time. Here we
focus on the application of robust estimators to longitudinal hedonic modeling.6
Robust Estimators
Conventional hedonic models are estimated using ordinary least squares.
6 A variation on the longitudinal approach involves moving windows containing a fixed
number of time periods (see, e.g., Bourassa, Hoesli, Scognamiglio, and Sormani, 2008).
9
Hence, the model’s parameters are estimated by minimizing the sum of the squared
errors, 2i . Part of the reason for the use of least squares is habit. The least squares
estimator has been applied to a wide variety of problems for over 200 years. The
estimator has a closed form solution and is easy to compute, although this advantage is
less relevant today given modern computers. Also, if it is assumed to be Gaussian, the
estimator is maximum likelihood and has the smallest variance among all unbiased
estimators. However, it is well known that very small numbers of outliers can have
substantial impacts on the regression results (Andersen, 2008).7
Robust estimators are designed to address deviations from the assumptions of a
It can be difficult to
diagnose this problem because the residuals themselves will be biased if the parameter
estimates are biased (Heritier, Cantoni, Copt, and Victoria-Feser, 2009). In other words,
the true outliers may be “masked”. Diagnostic tools, such as Cook’s distance (Cook,
1977), which compare models fitted with and without one observation at a time, do not
cope with the problem of multiple masked outliers. Data “cleaning” may be problematic
in part because decisions about which data to remove can be arbitrary. Iterative
processes for identifying and removing outliers may never reach an obviously
satisfactory conclusion and may remove a large proportion of the data. Moreover,
removal of data implies that conventional inference obtained from the remaining sample
is no longer valid and conventional tests are questionable (see MacDonald and
Robinson, 1985, pp. 125-26).
7 Yohai (1987, p. 642) notes that “even one outlier may have a large effect on the
estimate.” Janssen, Söderberg, and Zhou (2001) give an example of this by simulating a
single outlier in a hedonic model of apartment building prices.
10
model and, therefore, robust statistics “as a collection of related theories, is the
statistics of approximate parametric models” (Hampel, Ronchetti, Rousseeuw, and
Stahel, 1986, p. 7). Robust estimators achieve robustness by assuming that the data
generating process lies in the neighborhood of the “ideal” model. For instance, for the
regression model in equation (1) this translates into assuming that the distribution of
the error term it is
(1 )F F G (2)
where is typically small, F is the normal distribution, and G is an arbitrary
distribution. What is then sought is inference about F , the distribution of the majority
of the data. To ensure robustness, the fitting criterion is changed.
A variety of robust estimators has been developed and is available in various
statistical packages such as R, Stata, and SAS. The different estimators are typically
compared with respect to breakdown point and asymptotic efficiency.8
8 In addition, estimators can be compared with respect to their influence function. An
estimator is said to be robust if it has bounded influence.
The breakdown
point is a measure of resistance to outliers and refers to the smallest percentage of
contaminated data that the estimator can tolerate without producing an arbitrary
result. For example, an estimator with a breakdown point of zero could produce an
arbitrary result with as little as one outlying data point. The maximum breakdown point
is 50% because otherwise the estimates would be based on a minority of the data.
Asymptotic efficiency refers to the sampling variance for large samples of the estimator
relative to that of OLS, which under standard assumptions is the most efficient of
11
unbiased linear estimators. Estimators with high efficiency have low standard errors.
Robust estimators have received limited attention in real estate research. In
general, previous studies involving hedonic models have not used what are now
considered to be the optimal robust techniques. Belsley, Kuh, and Welsch (1980) use an
M-estimator to identify outliers which they then delete, whereas the preferred method
is to retain all observations and use the estimator to automatically down weight outliers
because in the latter case inference is correct (i.e., the standard errors are correct).
Moreover, the M-estimator is not optimal due to its low breakdown point, meaning that
it cannot cope with large proportions of outliers and produces arbitrary results (as is the
case for OLS). Thorson (1994) and Janssen, Söderberg, and Zhou (2001) both use least
median of squares (LMS) regression, which minimizes the median of the squared
residuals (rather than the sum as in OLS). A problem with these authors’ approach is
that the LMS estimator is used only to identify outliers. The model is re-estimated using
weighted least squares, in which the outliers are assigned arbitrary weights of zero. The
least absolute deviations (also referred to as the least absolute values) method of Yoo
(2001), which minimizes the absolute values of the residuals, has low efficiency. The M-
estimators used by Song and Wilhelmsson (2010) have high efficiency, but as mentioned
above suffer from a low breakdown point. These authors also use Cook’s distance
statistic to delete some outliers, instead of simply allowing the robust procedure to
down weight them.
In theory, the preferred option would be to apply an MM-estimator, which
according to Andersen (2008) is probably now the most popular robust technique for
linear regression modeling. The MM-estimators were first proposed by Yohai (1987)
12
and they are appealing because they can have both a high breakdown point (e.g., 50%)
and predefined high efficiency (e.g., 95% relative to OLS when the errors have a normal
distribution). MM-estimators involve two M-estimators (hence the label MM). The first
step involves a highly resistant S-estimator that minimizes a robust M-estimate of the
scale (i.e., the standard deviation) of the residuals (Rousseeuw and Leroy, 1987):
1ˆ ˆ ˆˆargmin , ,S S
n
(3)
where the i are the residuals and the value of that minimizes ˆ S is the S-estimator.
The robust scale, ˆ S , satisfies
1
1ˆ
niS
i
E Zn
(4)
where Z has a standard normal distribution and several choices are available for . A
popular option, which we will use here, is the Tukey biweight or bisquare function
(Beaton and Tukey, 1974) given by
2 4 6
3 3 if
1 if
r r r r cr c c c
r c
(5)
where c is a tuning constant that determines the tradeoff between breakdown point
and efficiency.9
9 The tuning constant is negatively related to the breakdown point and positively related
to efficiency and is typically set to 1.547 for the initial S-estimator and then 4.685 for the
subsequent M-estimator. This aims for a high breakdown point for the S-estimator and
then a high level of efficiency for the M-estimator.
To define the robustness weights, we use
13
22
1 if
0 if
r r r cr c
r
w r r
(6)
so that
22
1 if
0 if
r r cw r c
r c
(7)
Note that w → 0 as r → c and w = 0 when r ≥ c. Using the residuals, it , from the first
step, the second step of the MM-estimation process simply computes the scale of the
residuals, ˆ .
Fixing the scale at the value obtained from the second step, the third step
involves computation of an M-estimate of the regression parameters:
1
ˆ ˆargmin ( )n
MM Si
i
(8)
Again, the function is the bisquare function given in equation (5) and the
corresponding weights are as in equation (7). Further technical details of this approach
can be found in Maronna, Martin, and Yohai (2006).
While the MM-estimator seems optimal in theory, in practice it does not always
perform as expected. Verardi and Croux (2009) use simulation to study how well
various robust estimators respond to contaminated data. They first simulate a clean
data set (no outliers) with five independent variables and a sample size of 1,000. They
then randomly replace 10% of the observations with contaminated data for one of the
14
independent variables (without modifying the dependent variable; these are called
“leverage points”) and estimate the model using OLS and several robust estimators,
including two MM-estimators, one with 95% efficiency, MM(0.95), and the other with
70% efficiency, MM(0.70).10
Gürünlü Alma (2011) performs a similar simulation analysis to compare
alternative robust indicators for regression equations containing two, three, or five
independent variables for small samples (n = 30). Deviations in both the independent
variables (“leverage points”) and the dependent variables (“vertical outliers”) are
considered as contamination schemes, separately or jointly. Several contamination
levels are considered and the results are based on 1,000 replications. Using R2 as a
measure of goodness of fit, the author concludes that the S-estimator (the first step of
the MM-estimator) consistently outperforms MM(0.95) and in most cases outperforms
the other estimators considered (OLS, M-estimator, and least trimmed squares).
This was repeated for a total of 1,000 simulations, the
results of which were then summarized and compared with the true results. By far, the
least biased estimator was MM(0.70), indicating that the less efficient MM-estimator
has properties that make it preferable to the more efficient one for this type of
contamination (leverage points).
Given the discrepancy between theory and empirical results, we undertake a
simulation analysis based on our data as a means for choosing the optimal robust
technique. We discuss our data in the next section before describing the simulation
10 The efficiency level is modified by adjusting the tuning constant, c, for the M-
estimator that is applied in the third step of the MM process. For 95% and 70%
efficiency, c is set at 4.685 and 2.697, respectively.
15
analysis in the following section.
Data
The data consist of 82,297 transactions of single-family residential properties in
Louisville Metro-Jefferson County, Kentucky, for the period from 1998Q1 through
2010Q2. The transactions data are from the local Multiple Listing Service (MLS) and the
county Property Valuation Administrator (PVA). Transactions were excluded from the
sample when required hedonic characteristics were missing or when the data were
obviously incorrect. Hedonic characteristics include the age of the structure, number of
bathrooms, finished square feet, lot size (in acres), and submarket. The nine
submarkets are approximately as defined by the MLS, although we combined two
submarkets and redefined the boundaries to coincide with zip codes.
We supplemented the MLS data with information from the PVA identifying the
seller of each property. This allowed us to flag real estate owned (REO) sales by
financial institutions, which we categorized as distressed sales. Other distressed sales,
such as short sales, were not identifiable during the time period covered by the data.
Exhibit 1 summarizes the data for the full sample and by submarket and year of
sale. Area (submarket) 1 stands out as having by far the lowest average sale price and
the largest percentage of distressed sales, both in terms of the percentages of sales
within that submarket (27.8%) and in terms of the percentage of all distressed sales
(38.7%). This submarket has the oldest and smallest structures and the smallest lot size
on average. The data by year of sale show that average prices increased steadily from
1998 through 2007 and then declined between 2007 and 2009, followed by a small
16
increase in the first two quarters of 2010. Average age increased by about one year per
year, reflecting the aging housing stock, while house size remained fairly constant. Lot
size decreased at a slow rate, with an overall average of a quarter acre. Distressed sales
constituted 7.9% of all transactions. By the first two quarters of 2010, distressed sales
(as we define them) had increased to 18% of all transactions. Over 50% of distressed
transactions occurred in 2007 or more recently. Overall transaction volume tends to
move with house prices.
[Exhibit 1 here]
Simulations
To create a “true” index to compare with the robust indexes, we add to the
conventional index controls for sales of distressed properties (REO sales). The REO
controls are specified as a series of time dummies equal to 1 if the sale was REO and 0
otherwise. This allows the impacts of distressed sales to vary over time with the
severity of the foreclosure crisis. The model with controls for REO sales is specified as:
2 2 1
P M PT
it it t it m im t it itt m t
Y X D S R
, (9)
where itR are the REO time dummies and t are the corresponding parameters to be
estimated. Note that the REO dummies start in t = 1.
In order to compare conventional (OLS) and robust indexes with the known
index, we generate data according to the model
50 9 50
2 2 1
ˆ ˆ ˆˆ ˆ ,Tit it t t m m t t it
t m t
Y X D S R
(10)
17
where the dummy covariates tD , mS , and tR are the ones from the market-wide
dataset, the coefficient values , , , , and are those stemming from fitting
equation (9), and the error terms it are generated according to an (0, 0.2)N
distribution. We then estimate the model
50 9
2 2
,Tit it t t m m it
t m
Y X D S
(11)
to reproduce the situation in which REO dummies are not available for fitting.11
We repeated this procedure 200 times with both OLS and robust techniques.
Exhibit 2 shows the simulation results. The top panel represents the estimates obtained
with OLS. Each box plot is the summary of the 200 estimates for each time dummy. The
bottom panels give the same information for the S, MM(0.70), and MM(0.95) robust
estimates. Because the known underlying values have been subtracted, we expect the
box plots to be centered on zero if the parameters are estimated correctly. As the
exhibit shows, all of the estimators track the known index closely when the proportion
of REO transactions is low. However, the OLS and MM(0.95) indexes display increasing
bias as the proportion of REO transactions increases. The MM(0.95) index is not as
11 Some versions of MM- and S-estimators have difficulty with the large numbers of
dummy variables in hedonic models due to the fact that they start with a subsample of
the data. For example, the lmrob() function in R’s robustbase package has recently been
updated to allow MM-estimators to accommodate a large number of categorical
variables, although it may be necessary in some circumstances to try multiple seeds for
the random number generator used in the first stage of the MM procedure. Also, the
cov=“.vcov.w” option is recommended. For another example, it may be necessary to
modify the subsetsize option when using SAS’s robustreg procedure for the S-estimator.
18
biased as the OLS index, but it nevertheless displays substantial bias during the later
years of the period. The MM(0.70) index does a better job than the MM(0.95) index,
but it still displays some bias. By far the least biased is the S-estimator; consequently,
we will use that estimator for the rest of our analysis.
[Exhibit 2 here]
Robust and Conventional Indexes
Exhibit 3 shows the conventional longitudinal (OLS single-equation) hedonic
index compared to a robust index calculated using the S-estimator. In addition, the
exhibit also shows an OLS index that controls for REO sales. The conventional index
increases fairly steadily through 2005Q3 and then it becomes much more volatile, with a
pronounced drop in late 2008. The index with REO controls is above the conventional
index from 2001Q3, and the gap tends to grow with the volume of REO transactions.
This is consistent with the expectation that failure to control for distressed sales should
bias the index downwards. The robust index is generally much closer to the index with
REO controls than to the conventional index. This is particularly the case during the
latter, more volatile, part of the series. Moreover, the robust index is somewhat less
volatile than the index with REO controls, which is likely because the robust index is
controlling for the impacts of outliers in addition to REO sales.
[Exhibit 3 here]
Exhibit 4 displays 95% confidence intervals for the conventional and robust
indexes. The confidence intervals show that the robust index is significantly different
from the conventional index in 11 quarters, all at the end of the series. Given the
19
influence of outliers in the conventional index, the confidence intervals for that index
are undoubtedly too large (note how narrow the interval is for the robust index),
implying that the indexes have significant differences more frequently than depicted.
[Exhibit 4 here]
Exhibit 3 suggests that REO transactions are an important source of outliers.
Exhibit 1 indicates that 38.7% of all REO sales occurred in Area 1 and 27.8% of the
transactions in Area 1 were REO sales. Area 1 (West Louisville) is a low-income, largely
African-American, part of the city, and it was subject to considerable investment by
landlords during the early part of the period and large numbers of foreclosures during
the latter part of the period. Within Area 1, distressed transactions rose from 6.2% in
1998 to 20.0% in 2003. In 2004, however, 33.6% of transactions were distressed and by
the first half of 2010 some 49.2% were distressed. The large percentages of problematic
transactions suggest that a robust estimation for Area 1 should be less likely to consider
these to be outliers. Consistent with this, Exhibit 5 shows that for Area 1 the robust
index tracks the conventional index rather than the index with REO controls.
[Exhibit 5 here]
Index Revisions
Index revision refers to the fact that, as data for subsequent periods are added,
the historical values of an index are subject to change. Such revisions are problematic,
in particular when an index is used for the settlement of derivative contracts (Deng and
Quigley, 2008). Revisions are an inherent problem for repeat sales indexes, but are not
necessarily an issue for hedonic indexes. For example, an index that is derived from
20
chained single-period hedonic equations will not be subject to revision. However, an
index based on a longitudinal equation or a moving window series of equations may
change as new time periods and new data are added.
Our strategy for studying revisions follows the approach of Clapham, Englund,
Quigley, and Redfearn (2006). We focus on revisions to the ten quarters in the middle
of our series (2003Q1 to 2005Q2). This gives us 20 quarters to build up the index prior
to the analysis period. This also yields 20 sets of revisions for each of the 10 analysis
quarters as we progressively add quarterly data for 2005Q3 through 2010Q2.
Exhibit 6 reports statistics for cumulative revisions over the 10-quarter period as
well as quarterly revisions for the conventional, conventional with REO controls, and
robust indexes. The cumulative revisions refer to the changes in price growth over the
10-quarter period, while the quarterly revisions refer to the changes in price growth for
the individual quarters within that period. The means of the absolute values of the
cumulative revisions for the three indexes are 1.13%, 0.49%, and 0.09%, respectively.
The corresponding figures for the quarterly revisions are 0.06%, 0.03%, and 0.01%.
These results indicate that the robust method virtually eliminates the revision problem.
This can be seen clearly in Exhibit 7. Each panel of the exhibit shows 21 superimposed
indexes for 2003Q1 through 2005Q2, including the original index plus the 20 revisions.
[Exhibits 6 and 7 here]
Conclusions
Even in the best real estate database, it is almost inevitable that there will be
measurement and data entry errors. Moreover, a database may omit key variables,
21
such as identifiers for one or more types of distressed sales. By reducing the impacts of
outliers, robust methods provide a means for addressing such data problems. Using a
longitudinal hedonic model with simulated data, we compare three types of robust
estimators, including two MM-estimators with different levels of efficiency and an S-
estimator. We select the latter for further empirical analysis because it provides the
least biased estimates of a known index.
Using over 82,000 house sales from Louisville, Kentucky, we then construct a
robust hedonic index and compare it with a conventional OLS index and a conventional
index that controls for a major source of outliers, distressed transactions. The robust
index differs significantly from the conventional index and is much closer to the index
with controls for distressed sales. In contrast, in a submarket with a large percentage of
distressed sales, the robust index is much closer to the conventional index because the
distressed sales are not treated as outliers. We also show that robust techniques
minimize the index revision problem that occurs when new data are incorporated into a
longitudinal hedonic model.
Acknowledgments
We thank the Greater Louisville Association of Realtors for providing the
Multiple Listing Service data.
References
Andersen, R. Modern Methods for Robust Regression. Sage Publications, Thousand
Oaks, CA: Sage Publications, 2008.
22
Beaton, A.E. and J.W. Tukey. The Fitting of Power Series, Meaning Polynomials,
Illustrated on Band-Spectroscopic Data. Technometrics, 1974, 16:2, 147-85.
Belsley, D.A., E. Kuh, and R.E. Welsch. Regression Diagnostics: Identifying Influential
Data and Sources of Collinearity. New York: John Wiley, 1980.
Bourassa, S.C., E. Cantoni, and M. Hoesli. Robust Repeat Sales Indexes. Real Estate
Economics, 2013 (in press).
Bourassa, S.C., M. Hoesli, and J. Sun. A Simple Alternative House Price Index Method.
Journal of Housing Economics, 2006, 15:1, 80-97.
Bourassa, S.C., M. Hoesli, D. Scognamiglio, and P. Sormani. Constant-Quality House
Price Indexes for Switzerland. Swiss Journal of Economics and Statistics, 2008,
144:4, 561-75.
Campbell, J.Y., S. Giglio, and P.A. Pathak. Forced Sales and House Prices. American
Economic Review, 2011, 101:5, 2108-31.
Clapham, E., P. Englund, J.M. Quigley, and C.L. Redfearn. Revisiting the Past and Settling
the Score: Index Revision for House Price Derivatives. Real Estate Economics,
2006, 34:2, 275-302.
Cook, R.D. Detection of Influential Observations in Linear Regression. Technometrics,
1977, 19:1, 15-18.
Deng, Y. and J.M. Quigley. Index Revision, House Price Risk, and the Market for House
Price Derivatives. Journal of Real Estate Finance and Economics, 2008, 37:3, 191-
209.
Gatzlaff, D.H. and D.R. Haurin. Sample Selection and Biases in Local House Value
Indices. Journal of Urban Economics, 1998, 43:2, 199-222.
23
Gürünlü Alma, Ö. Comparison of Robust Regression Methods in Linear Regression.
International Journal of Contemporary Mathematical Sciences, 2011, 6:9, 409-21.
Hampel, F.R., E.M. Ronchetti, P.J. Rousseeuw, and W.A. Stahel. Robust Statistics: The
Approach Based on Influence Functions. New York: John Wiley & Sons, 1986.
Heritier, S., E. Cantoni, S. Copt, and M.-P. Victoria-Feser, 2009. Robust Methods in
Biostatistics. Chichester, UK: John Wiley & Sons, 2009.
Huber, P.J. Robust Estimation of a Location Parameter. Annals of Mathematical
Statistics, 1964, 35:1, 73-101.
Janssen, C., B. Söderberg, and J. Zhou. Robust Estimation of Hedonic Models of Price
and Income for Investment Property. Journal of Property Investment and Finance,
2001, 19:4, 342-60.
MacDonald, G.M. and C. Robinson 1985. Cautionary Tails about Arbitrary Deletion of
Observations; or, Throwing the Variance Out with the Bathwater. Journal of Labor
Economics, 1985, 3:2, 124-52.
Maronna, R., R. Martin, and V. Yohai. Robust Statistics: Theory and Methods.
Chichester, UK: John Wiley & Sons, 2006.
McMillen, D. and P. Thorsnes. Housing Renovations and the Quantile Repeat-Sales Price
Index. Real Estate Economics, 2006, 34:4, 567-84.
Peña, D. and J. Ruiz-Castillo. Robust Methods of Building Regression Models: An
Application to the Housing Sector. Journal of Business and Economic Statistics,
1984, 2:1, 10-20.
Pennington-Cross, A. The Value of Foreclosed Property. Journal of Real Estate
Research, 2006, 28:2, 193-214.
24
Prasad, N. and A. Richards. Improving Median Housing Price Indexes through
Stratification. Journal of Real Estate Research, 2008, 30:1, 45-71.
Rousseeuw, P.J. and A.M. Leroy. Robust Regression and Outlier Detection. New York:
John Wiley, 1987.
Song, H.-S. and M. Wilhelmsson. Handling Outliers in the Construction of a Price Index
for Condominiums. Paper presented at the American Real Estate Society Annual
Meeting, Naples, FL, 2010.
Standard & Poor’s. S&P/Case-Shiller Home Price Indices: Index Methodology. New York:
McGraw-Hill, 2009.
Thorson, J.A. The Use of Least Median of Squares in the Estimation of Land Value
Equations. Journal of Real Estate Finance and Economics, 1994, 8:2, 183-90.
Verardi, V. and C. Croux. Robust Regression in Stata. Stata Journal, 2009, 9:3, 439-53.
Wang, F.T. and P.M. Zorn. Estimating House Price Growth with Repeat Sales Data:
What’s the Aim of the Game? Journal of Housing Economics, 1997, 6:2, 93-118.
Yohai, V.J. High Breakdown Point and High Efficiency Robust Estimates for Regression.
Annals of Statistics, 1987, 15:2, 642-56.
Yoo, S.-H. A Robust Estimation of Hedonic Price Functions: Least Absolute Deviations
Estimator. Applied Economics Letters, 2001, 8:1, 55-58.
25
Exhibit 1 Descriptive Statistics
Sample size
Mean sale price ($)
Mean age of structure
Mean number of bathrooms
Mean finished square feet
Mean lot size (acres)
Distressed (REO) sales as a % of sales in sub-market or year
Distressed (REO) sales as a % of all distressed sales
By submarket
1 9,097 53,239 72.5 1.4 1,310 0.13 27.8 38.7
2 8,687 158,962 69.8 1.7 1,723 0.17 4.2 5.6
3 5,795 230,385 56.9 2.1 2,224 0.27 1.5 1.4
4 7,663 98,547 38.1 1.5 1,438 0.25 11.9 14.0
5 8,427 109,272 40.2 1.6 1,526 0.28 9.7 12.5
6 8,538 120,263 25.9 1.7 1,562 0.25 7.9 10.3
7 16,044 154,847 25.8 2.0 1,857 0.26 4.5 11.0
8 11,169 253,249 19.7 2.6 2,685 0.34 2.6 4.4
9 6,877 255,641 17.2 2.7 2,729 0.33 2.1 2.2
By year of sale
1998 6,192 129,875 33.4 1.9 1,863 0.26 1.0 1.0
1999 6,577 132,187 35.7 1.9 1,827 0.26 1.5 1.5
2000 6,047 141,663 35.8 1.9 1,867 0.25 2.1 2.0
2001 6,238 146,872 36.6 1.9 1,843 0.25 2.9 2.8
2002 6,031 152,285 36.3 1.9 1,862 0.25 3.5 3.2
2003 6,941 163,342 36.5 2.0 1,916 0.26 4.9 5.2
2004 7,182 165,134 38.5 1.9 1,902 0.26 8.9 9.8
2005 7,457 171,811 40.0 1.9 1,921 0.25 9.4 10.8
2006 7,362 173,996 40.7 1.9 1,919 0.25 10.5 11.9
2007 7,170 177,803 42.3 1.9 1,938 0.25 12.6 13.9
2008 5,882 165,613 43.0 1.9 1,895 0.24 15.5 13.9
2009 6,030 159,798 44.6 1.9 1,895 0.24 16.5 15.3
2010 3,188 161,967 44.2 1.9 1,913 0.24 18.0 8.8
Full sample
82,297
157,669
38.9
1.9
1,890
0.25
7.9
100.0
Source: Authors’ calculations based on Greater Louisville Association of Realtors’ Multiple Listing Service and Jefferson County Property Valuation Administrator data. Notes: The 2010 data are for the first two quarters only.
26
Exhibit 2 Simulation Analysis
27
Exhibit 3 Conventional, Conventional with REO Controls, and Robust (S) Indexes for Louisville
100
105
110
115
120
125
130
135
140
145
Conventional Conventional with REO controls Robust (S)
28
Exhibit 4 Confidence Intervals (95%) for Conventional and Robust (S) Indexes
90
100
110
120
130
140
150
Conventional Robust (S)
29
Exhibit 5 Conventional, Conventional with REO Controls, and Robust (S) Indexes for Area 1
Note: Indexes are four quarter moving averages.
60
70
80
90
100
110
120
130
140
150
Conventional Conventional with REO controls Robust (S)
30
Exhibit 6 Analysis of Revisions for 2003Q1 through 2005Q2
Conventional indexes
Conventional indexes with REO controls
Robust indexes
Cumulative revisions (%)
Mean (of absolute values) 1.13 0.49 0.09
Minimum 0.24 0.09 -0.12
Maximum 1.94 0.93 0.22
Range 1.70 0.84 0.34
Standard deviation 0.51 0.26 0.11
Quarterly revisions (%)
Mean (of absolute values) 0.06 0.03 0.01
Minimum -0.01 -0.03 -0.02
Maximum 0.16 0.09 0.02
Range 0.17 0.11 0.04
Standard deviation 0.04 0.02 0.01
31
Exhibit 7 Revisions for 2003Q1 through 2005Q2: (a) Conventional Indexes; (b) Conventional Indexes with REO Controls; (c) Robust (S) Indexes
(a)
(b)
(c)
115
120
125
130
135
140
115
120
125
130
135
140
115
120
125
130
135
140