robust hedonic price indexes - massey...

Robust Hedonic Price Indexes

Steven C. Bourassa

Department of Urban and Public Affairs, University of Louisville, 426 W. Bloom Street,

Louisville, Kentucky 40208, USA, email: [email protected]

Eva Cantoni

Research Center for Statistics and Department of Economics, University of Geneva,

40 boulevard du Pont-d’Arve, CH-1211 Geneva 4, Switzerland, email:

[email protected]

and

Martin Hoesli

University of Geneva (HEC), Swiss Finance Institute, 40 boulevard du Pont-d’Arve, CH-

1211 Geneva 4, Switzerland, University of Aberdeen Business School, Scotland, and

Bordeaux Management School, France, email: [email protected]

April 19, 2013

1

Robust Hedonic Price Indexes

Abstract

Using single-family sales data for Louisville, Kentucky, we assess the benefits of applying

robust methods to down weight problematic transactions in the context of longitudinal

hedonic price models. Robust estimators can reduce the influence of outliers that are

due to data entry errors, unmeasured hedonic characteristics, or non-market

transactions. We use simulation analysis to compare conventional indexes with several

robust indexes and conclude that a robust S-estimator is least biased relative to a true

(simulated) underlying index. We then compare a robust index with a conventional

index and a conventional index that controls for distressed sales and find that it is much

closer to the latter, except in a submarket with a large proportion of distressed sales.

We also demonstrate that robust methods can substantially eliminate the revision

problem in longitudinal hedonic indexes.

Key Words: house price indexes, hedonic models, robust methods, distressed sales

JEL Codes: R31

2

Introduction

Hedonic models are widely used to construct constant-quality house price

indexes; for example, they are the primary measures of price movements in France,

Norway, Switzerland, and the United Kingdom.1

The types of data problems that might occur in a hedonic context include sample

selection bias, which refers to the differences in implicit prices of hedonic characteristics

or price movements between the entire housing stock and the sample of houses that

actually transacts (Gatzlaff and Haurin, 1998). Sample selection bias correction

techniques can, at least in theory, be used to address this problem. Although robust

techniques are not helpful with respect to sample selection problems, they are useful

with respect to several other common data problems. These include non-market

(discounted) prices on the left-hand side of the hedonic equation, relevant attributes

Hedonic modelling raises a variety of

practical issues relating to, among other things, choice of estimation technique,

specification of relevant hedonic characteristics, and handling of data problems. These

issues are inter-related. For example, spatial estimation techniques have been

developed to overcome the inability of hedonic characteristics to fully capture spatial

relationships. Robust estimation techniques, the focus of this paper, are intended

mainly to correct data problems.

1 The United States, where repeat sales methods predominate, is an exception to this

rule. Other exceptions include the countries Denmark, New Zealand, and Sweden,

which rely upon sale price-appraisal ratio (SPAR) indexes (Bourassa, Hoesli, and Sun,

2006) and Australia, where a stratified median index is used (Prasad and Richards,

2008).

3

missing from the right-hand side of the equation, and measurement errors on either

side.

These data problems can lead to outliers that bias indexes in undesirable ways.

Conventional hedonic regression models are highly sensitive to these outliers because

they are estimated by minimizing the sums of the squared residuals. This gives outliers,

and particularly large outliers, disproportionately large influence. Even a small number

of outliers can have a large effect. If these outliers represent incorrect data, in the

sense that the data are not consistent with the assumptions of the model, then least

squares estimation can be biased and inconsistent. Robust methods generally down

weight observations automatically based on the size of their residuals and thereby

provide a means for producing consistent and possibly efficient estimators and test

statistics when a model is somewhat misspecified because its assumptions are not fully

consistent with the data.

Problematic data include distressed sales, non-arm’s length sales, and

typographical errors.2

2 We also considered flips as a possible type of problematic data. Flips involve two

transactions of the same property in relatively quick succession (e.g., within a year),

with some upgrading of the property between transactions. The improvements will

often involve characteristics that are not measured by the available hedonic variables.

Consequently, the first transaction in a flip may appear to be at a discount and the

second transaction at a premium relative to the market. In preliminary estimations of

our model, however, we did not find evidence of bias due to flips.

For example, discounts associated with distressed sales, such as

foreclosure and short sales, may bias indexes downwards. These types of forced sales

may be discounted because the seller is unusually eager to reach a deal or appear to be

4

discounted because there is unmeasured deterioration in quality (Campbell, Giglio, and

Pathak, 2011; Pennington-Cross, 2006). Some types of problematic transactions may be

flagged in some hedonic data sets. In other cases, it may be possible to identify these

transactions, but only with considerable investment of time and effort. Typographical

errors may be impossible to identify with any degree of certainty. Robust methods

provide a means for responding to data problems when it is difficult or impossible to

identify all of the transactions with contaminated data.

What constitutes problematic data depends on the purpose of the index (Wang

and Zorn, 1997). Some indexes are designed to treat the market as a portfolio and to

track changes in the value of the portfolio over time. Such indexes are value-weighted

(i.e., weighted by the value of each transaction) rather than equal-weighted. An

example of a value-weighted index is the S&P/Case-Shiller index (Standard & Poors,

2009). For such an index, down-weighting distressed sales might not be desirable

because such sales might be valid contributors to the value of the market portfolio.

However, in other cases, the aim is to track constant-quality price movements for

properties that are not distressed. In such a case, an equal-weighted index is

appropriate, as is controlling for the downward bias caused by distressed sales. Our

purpose here is consistent with the latter aim of measuring movements in typical

market-rate transactions.

Because robust techniques aim to model the behavior of the majority of the

data, they respond differently to different market circumstances. In a “normal”

market—defined as a market that does not have a large proportion of forced or

distressed sales—such transactions will tend to be classified as outliers and robust

5

techniques will prevent them from biasing an index. However, as the proportion of

distressed sales increases (becoming the majority), they are less likely to be classified as

outliers and hence less likely to be down weighted by robust techniques. In such cases,

robust indexes will tend to be similar to conventional indexes estimated using ordinary

least squares (OLS). We give an example of this by calculating a robust index for a

submarket in Louisville that has a large percentage of distressed sales.

Robust methods have been used in a variety of fields such as biostatistics

(Heritier, Cantoni, Copt, and Victoria-Feser, 2009) and sociological and political research

(Andersen, 2008), but their applications to real estate have been limited.3 Belsley, Kuh,

and Welsch (1980) use a robust M-estimator to analyze a hedonic model of house prices

in the Boston area.4

3 The term “robust” is used fairly widely in real estate and other types of statistical

research to refer to regression models with various types of desirable features, such as

standard errors that account for heteroscedasticity. Here we are concerned with

techniques designed to reduce the influence of unusual observations.

Thorson (1994) and Janssen, Söderberg, and Zhou (2001) use a

robust least median of squares estimator in hedonic models of agricultural land values

in Illinois and apartment building incomes and prices in Stockholm, respectively. Yoo

(2001) applies a least absolute deviations estimator (also known as a median quantile

estimator) in a hedonic model of house prices in Seoul, Korea. Finally, Song and

Wilhelmsson (2010) use a robust M-estimator method to identify and down weight

4 In contrast to least squares, M-estimators minimize some function that gives

decreasing weights to observations as the size of the standardized residual increases

(see, e.g., Huber, 1964).

6

outliers in a hedonic price model of condominium prices in Stockholm.5

We use a large dataset of house sales for Louisville, Kentucky, for the period

1998Q1 to 2010Q2 to investigate the usefulness of robust regression methods in

producing more reliable house price indexes. The time period is of interest as it covers

both bull and bear housing market conditions. Bias related to distressed properties is

likely to be more severe in bear markets. Data entry errors and other data problems

could occur at any time. Our data allow us to identify most distressed transactions and

assess their impacts on hedonic price indexes.

Two papers

apply robust techniques to construction of repeat sales indexes. McMillen and Thorsnes

(2006) conclude that a median quantile estimator suffers less bias from positive outliers,

such as unobserved renovations, than the conventional estimator. Bourassa, Cantoni,

and Hoesli (2013) show how robust techniques improve repeat sales house price index

estimation and reduce the magnitude of revisions as new data are added over time.

After conducting a simulation analysis to determine the best robust estimator for

problems like ours, we compare a robust hedonic index to a conventional OLS index and

an OLS index that controls for distressed property sales. We find that the robust index

for the market as a whole tracks the latter index more closely than the conventional

index because distressed sales are outliers in the market-wide context. However, for a

submarket in which distressed sales constitute a substantial proportion of the

observations, they are not outliers, and the robust index tracks the conventional index.

5 Peña and Ruiz-Castillo (1984) apply Cook’s distance statistic (Cook, 1977) to identify

outliers in a hedonic study of apartment rents in Madrid. They discuss robust M-

estimators, but do not apply them.

7

We also demonstrate that robust methods can virtually eliminate the index revision

problem.

The next section reviews the hedonic models that are typically used to construct

price indexes. The subsequent section discusses robust estimators and their application

to hedonic modeling. We then discuss our data and provide some summary statistics.

The subsequent section applies simulation analysis to compare alternative robust

estimators. Next, we compute and compare robust and conventional indexes for

Louisville. The penultimate section analyzes index revisions and the final section

concludes.

Hedonic Price Models

The hedonic model has a measure of house sale prices on the left-hand side and

set of property characteristics on the right-hand side. These “hedonic” characteristics

typically describe aspects of the structure and lot, as well as the location of the

property. If the data cover multiple periods of time, then a series of time dummy

variables is included to control for changes in price levels over time. In addition, a set of

spatial dummy variables can be used to capture submarket effects. This approach

allows the model’s intercept term to vary over time and space, but holds the implicit

prices of the hedonic characteristics constant. These implicit prices can be interpreted

as averages over the time period covered by the model. With both time and submarket

dummies, the model is:

2 2

P MT

it it t it m im itt m

Y X D S

, (1)

8

where Yit is the natural logarithm of the sale price of house i at time t, Xit is the vector of

property characteristics for house i at time t, the superscript T indicates the transpose

function, Dit is a set of P – 1 time dummy variables equal to 1 if house i sold at time t and

0 otherwise, Sim is a set of M – 1 submarket dummy variables equal to 1 if house i is

located in submarket m and 0 otherwise, is the intercept term, , t , and m are

vectors of parameters to be estimated, and it is a random error term distributed

2(0, )N . The antilogarithms of the time dummy coefficients are used to construct a

price index, with the index number 1 100I and exp( ) 100t tI for 1t .

As an alternative to using a “longitudinal” model with time dummies, price

indexes can be constructed by chaining together price increases based on the values of a

typical house as predicted by pairs of successive single-period estimations. The “typical”

house can be defined in terms of the average characteristics in the first or second

periods or a geometric mean of the two (Laspeyres, Paasche, and Fisher indexes,

respectively). A disadvantage of this approach is the need for a substantial number of

transactions each period; small samples may result in volatility in the coefficient

estimates due to changes in the characteristics of properties sold over time. Here we

focus on the application of robust estimators to longitudinal hedonic modeling.6

Robust Estimators

Conventional hedonic models are estimated using ordinary least squares.

6 A variation on the longitudinal approach involves moving windows containing a fixed

number of time periods (see, e.g., Bourassa, Hoesli, Scognamiglio, and Sormani, 2008).

9

Hence, the model’s parameters are estimated by minimizing the sum of the squared

errors, 2i . Part of the reason for the use of least squares is habit. The least squares

estimator has been applied to a wide variety of problems for over 200 years. The

estimator has a closed form solution and is easy to compute, although this advantage is

less relevant today given modern computers. Also, if it is assumed to be Gaussian, the

estimator is maximum likelihood and has the smallest variance among all unbiased

estimators. However, it is well known that very small numbers of outliers can have

substantial impacts on the regression results (Andersen, 2008).7

Robust estimators are designed to address deviations from the assumptions of a

It can be difficult to

diagnose this problem because the residuals themselves will be biased if the parameter

estimates are biased (Heritier, Cantoni, Copt, and Victoria-Feser, 2009). In other words,

the true outliers may be “masked”. Diagnostic tools, such as Cook’s distance (Cook,

1977), which compare models fitted with and without one observation at a time, do not

cope with the problem of multiple masked outliers. Data “cleaning” may be problematic

in part because decisions about which data to remove can be arbitrary. Iterative

processes for identifying and removing outliers may never reach an obviously

satisfactory conclusion and may remove a large proportion of the data. Moreover,

removal of data implies that conventional inference obtained from the remaining sample

is no longer valid and conventional tests are questionable (see MacDonald and

Robinson, 1985, pp. 125-26).

7 Yohai (1987, p. 642) notes that “even one outlier may have a large effect on the

estimate.” Janssen, Söderberg, and Zhou (2001) give an example of this by simulating a

single outlier in a hedonic model of apartment building prices.

10

model and, therefore, robust statistics “as a collection of related theories, is the

statistics of approximate parametric models” (Hampel, Ronchetti, Rousseeuw, and

Stahel, 1986, p. 7). Robust estimators achieve robustness by assuming that the data

generating process lies in the neighborhood of the “ideal” model. For instance, for the

regression model in equation (1) this translates into assuming that the distribution of

the error term it is

(1 )F F G (2)

where is typically small, F is the normal distribution, and G is an arbitrary

distribution. What is then sought is inference about F , the distribution of the majority

of the data. To ensure robustness, the fitting criterion is changed.

A variety of robust estimators has been developed and is available in various

statistical packages such as R, Stata, and SAS. The different estimators are typically

compared with respect to breakdown point and asymptotic efficiency.8

8 In addition, estimators can be compared with respect to their influence function. An

estimator is said to be robust if it has bounded influence.

The breakdown

point is a measure of resistance to outliers and refers to the smallest percentage of

contaminated data that the estimator can tolerate without producing an arbitrary

result. For example, an estimator with a breakdown point of zero could produce an

arbitrary result with as little as one outlying data point. The maximum breakdown point

is 50% because otherwise the estimates would be based on a minority of the data.

Asymptotic efficiency refers to the sampling variance for large samples of the estimator

relative to that of OLS, which under standard assumptions is the most efficient of

11

unbiased linear estimators. Estimators with high efficiency have low standard errors.

Robust estimators have received limited attention in real estate research. In

general, previous studies involving hedonic models have not used what are now

considered to be the optimal robust techniques. Belsley, Kuh, and Welsch (1980) use an

M-estimator to identify outliers which they then delete, whereas the preferred method

is to retain all observations and use the estimator to automatically down weight outliers

because in the latter case inference is correct (i.e., the standard errors are correct).

Moreover, the M-estimator is not optimal due to its low breakdown point, meaning that

it cannot cope with large proportions of outliers and produces arbitrary results (as is the

case for OLS). Thorson (1994) and Janssen, Söderberg, and Zhou (2001) both use least

median of squares (LMS) regression, which minimizes the median of the squared

residuals (rather than the sum as in OLS). A problem with these authors’ approach is

that the LMS estimator is used only to identify outliers. The model is re-estimated using

weighted least squares, in which the outliers are assigned arbitrary weights of zero. The

least absolute deviations (also referred to as the least absolute values) method of Yoo

(2001), which minimizes the absolute values of the residuals, has low efficiency. The M-

estimators used by Song and Wilhelmsson (2010) have high efficiency, but as mentioned

above suffer from a low breakdown point. These authors also use Cook’s distance

statistic to delete some outliers, instead of simply allowing the robust procedure to

down weight them.

In theory, the preferred option would be to apply an MM-estimator, which

according to Andersen (2008) is probably now the most popular robust technique for

linear regression modeling. The MM-estimators were first proposed by Yohai (1987)

12

and they are appealing because they can have both a high breakdown point (e.g., 50%)

and predefined high efficiency (e.g., 95% relative to OLS when the errors have a normal

distribution). MM-estimators involve two M-estimators (hence the label MM). The first

step involves a highly resistant S-estimator that minimizes a robust M-estimate of the

scale (i.e., the standard deviation) of the residuals (Rousseeuw and Leroy, 1987):

1ˆ ˆ ˆˆargmin , ,S S

n

(3)

where the i are the residuals and the value of that minimizes ˆ S is the S-estimator.

The robust scale, ˆ S , satisfies

1

1ˆ

niS

i

E Zn

(4)

where Z has a standard normal distribution and several choices are available for . A

popular option, which we will use here, is the Tukey biweight or bisquare function

(Beaton and Tukey, 1974) given by

2 4 6

3 3 if

1 if

r r r r cr c c c

r c

(5)

where c is a tuning constant that determines the tradeoff between breakdown point

and efficiency.9

9 The tuning constant is negatively related to the breakdown point and positively related

to efficiency and is typically set to 1.547 for the initial S-estimator and then 4.685 for the

subsequent M-estimator. This aims for a high breakdown point for the S-estimator and

then a high level of efficiency for the M-estimator.

To define the robustness weights, we use

13

22

1 if

0 if

r r r cr c

r

w r r

(6)

so that

22

1 if

0 if

r r cw r c

r c

(7)

Note that w → 0 as r → c and w = 0 when r ≥ c. Using the residuals, it , from the first

step, the second step of the MM-estimation process simply computes the scale of the

residuals, ˆ .

Fixing the scale at the value obtained from the second step, the third step

involves computation of an M-estimate of the regression parameters:

1

ˆ ˆargmin ( )n

MM Si

i

(8)

Again, the function is the bisquare function given in equation (5) and the

corresponding weights are as in equation (7). Further technical details of this approach

can be found in Maronna, Martin, and Yohai (2006).

While the MM-estimator seems optimal in theory, in practice it does not always

perform as expected. Verardi and Croux (2009) use simulation to study how well

various robust estimators respond to contaminated data. They first simulate a clean

data set (no outliers) with five independent variables and a sample size of 1,000. They

then randomly replace 10% of the observations with contaminated data for one of the

14

independent variables (without modifying the dependent variable; these are called

“leverage points”) and estimate the model using OLS and several robust estimators,

including two MM-estimators, one with 95% efficiency, MM(0.95), and the other with

70% efficiency, MM(0.70).10

Gürünlü Alma (2011) performs a similar simulation analysis to compare

alternative robust indicators for regression equations containing two, three, or five

independent variables for small samples (n = 30). Deviations in both the independent

variables (“leverage points”) and the dependent variables (“vertical outliers”) are

considered as contamination schemes, separately or jointly. Several contamination

levels are considered and the results are based on 1,000 replications. Using R2 as a

measure of goodness of fit, the author concludes that the S-estimator (the first step of

the MM-estimator) consistently outperforms MM(0.95) and in most cases outperforms

the other estimators considered (OLS, M-estimator, and least trimmed squares).

This was repeated for a total of 1,000 simulations, the

results of which were then summarized and compared with the true results. By far, the

least biased estimator was MM(0.70), indicating that the less efficient MM-estimator

has properties that make it preferable to the more efficient one for this type of

contamination (leverage points).

Given the discrepancy between theory and empirical results, we undertake a

simulation analysis based on our data as a means for choosing the optimal robust

technique. We discuss our data in the next section before describing the simulation

10 The efficiency level is modified by adjusting the tuning constant, c, for the M-

estimator that is applied in the third step of the MM process. For 95% and 70%

efficiency, c is set at 4.685 and 2.697, respectively.

15

analysis in the following section.

Data

The data consist of 82,297 transactions of single-family residential properties in

Louisville Metro-Jefferson County, Kentucky, for the period from 1998Q1 through

2010Q2. The transactions data are from the local Multiple Listing Service (MLS) and the

county Property Valuation Administrator (PVA). Transactions were excluded from the

sample when required hedonic characteristics were missing or when the data were

obviously incorrect. Hedonic characteristics include the age of the structure, number of

bathrooms, finished square feet, lot size (in acres), and submarket. The nine

submarkets are approximately as defined by the MLS, although we combined two

submarkets and redefined the boundaries to coincide with zip codes.

We supplemented the MLS data with information from the PVA identifying the

seller of each property. This allowed us to flag real estate owned (REO) sales by

financial institutions, which we categorized as distressed sales. Other distressed sales,

such as short sales, were not identifiable during the time period covered by the data.

Exhibit 1 summarizes the data for the full sample and by submarket and year of

sale. Area (submarket) 1 stands out as having by far the lowest average sale price and

the largest percentage of distressed sales, both in terms of the percentages of sales

within that submarket (27.8%) and in terms of the percentage of all distressed sales

(38.7%). This submarket has the oldest and smallest structures and the smallest lot size

on average. The data by year of sale show that average prices increased steadily from

1998 through 2007 and then declined between 2007 and 2009, followed by a small

16

increase in the first two quarters of 2010. Average age increased by about one year per

year, reflecting the aging housing stock, while house size remained fairly constant. Lot

size decreased at a slow rate, with an overall average of a quarter acre. Distressed sales

constituted 7.9% of all transactions. By the first two quarters of 2010, distressed sales

(as we define them) had increased to 18% of all transactions. Over 50% of distressed

transactions occurred in 2007 or more recently. Overall transaction volume tends to

move with house prices.

[Exhibit 1 here]

Simulations

To create a “true” index to compare with the robust indexes, we add to the

conventional index controls for sales of distressed properties (REO sales). The REO

controls are specified as a series of time dummies equal to 1 if the sale was REO and 0

otherwise. This allows the impacts of distressed sales to vary over time with the

severity of the foreclosure crisis. The model with controls for REO sales is specified as:

2 2 1

P M PT

it it t it m im t it itt m t

Y X D S R

, (9)

where itR are the REO time dummies and t are the corresponding parameters to be

estimated. Note that the REO dummies start in t = 1.

In order to compare conventional (OLS) and robust indexes with the known

index, we generate data according to the model

50 9 50

2 2 1

ˆ ˆ ˆˆ ˆ ,Tit it t t m m t t it

t m t

Y X D S R

(10)

17

where the dummy covariates tD , mS , and tR are the ones from the market-wide

dataset, the coefficient values , , , , and are those stemming from fitting

equation (9), and the error terms it are generated according to an (0, 0.2)N

distribution. We then estimate the model

50 9

2 2

,Tit it t t m m it

t m

Y X D S

(11)

to reproduce the situation in which REO dummies are not available for fitting.11

We repeated this procedure 200 times with both OLS and robust techniques.

Exhibit 2 shows the simulation results. The top panel represents the estimates obtained

with OLS. Each box plot is the summary of the 200 estimates for each time dummy. The

bottom panels give the same information for the S, MM(0.70), and MM(0.95) robust

estimates. Because the known underlying values have been subtracted, we expect the

box plots to be centered on zero if the parameters are estimated correctly. As the

exhibit shows, all of the estimators track the known index closely when the proportion

of REO transactions is low. However, the OLS and MM(0.95) indexes display increasing

bias as the proportion of REO transactions increases. The MM(0.95) index is not as

11 Some versions of MM- and S-estimators have difficulty with the large numbers of

dummy variables in hedonic models due to the fact that they start with a subsample of

the data. For example, the lmrob() function in R’s robustbase package has recently been

updated to allow MM-estimators to accommodate a large number of categorical

variables, although it may be necessary in some circumstances to try multiple seeds for

the random number generator used in the first stage of the MM procedure. Also, the

cov=“.vcov.w” option is recommended. For another example, it may be necessary to

modify the subsetsize option when using SAS’s robustreg procedure for the S-estimator.

18

biased as the OLS index, but it nevertheless displays substantial bias during the later

years of the period. The MM(0.70) index does a better job than the MM(0.95) index,

but it still displays some bias. By far the least biased is the S-estimator; consequently,

we will use that estimator for the rest of our analysis.

[Exhibit 2 here]

Robust and Conventional Indexes

Exhibit 3 shows the conventional longitudinal (OLS single-equation) hedonic

index compared to a robust index calculated using the S-estimator. In addition, the

exhibit also shows an OLS index that controls for REO sales. The conventional index

increases fairly steadily through 2005Q3 and then it becomes much more volatile, with a

pronounced drop in late 2008. The index with REO controls is above the conventional

index from 2001Q3, and the gap tends to grow with the volume of REO transactions.

This is consistent with the expectation that failure to control for distressed sales should

bias the index downwards. The robust index is generally much closer to the index with

REO controls than to the conventional index. This is particularly the case during the

latter, more volatile, part of the series. Moreover, the robust index is somewhat less

volatile than the index with REO controls, which is likely because the robust index is

controlling for the impacts of outliers in addition to REO sales.

[Exhibit 3 here]

Exhibit 4 displays 95% confidence intervals for the conventional and robust

indexes. The confidence intervals show that the robust index is significantly different

from the conventional index in 11 quarters, all at the end of the series. Given the

19

influence of outliers in the conventional index, the confidence intervals for that index

are undoubtedly too large (note how narrow the interval is for the robust index),

implying that the indexes have significant differences more frequently than depicted.

[Exhibit 4 here]

Exhibit 3 suggests that REO transactions are an important source of outliers.

Exhibit 1 indicates that 38.7% of all REO sales occurred in Area 1 and 27.8% of the

transactions in Area 1 were REO sales. Area 1 (West Louisville) is a low-income, largely

African-American, part of the city, and it was subject to considerable investment by

landlords during the early part of the period and large numbers of foreclosures during

the latter part of the period. Within Area 1, distressed transactions rose from 6.2% in

1998 to 20.0% in 2003. In 2004, however, 33.6% of transactions were distressed and by

the first half of 2010 some 49.2% were distressed. The large percentages of problematic

transactions suggest that a robust estimation for Area 1 should be less likely to consider

these to be outliers. Consistent with this, Exhibit 5 shows that for Area 1 the robust

index tracks the conventional index rather than the index with REO controls.

[Exhibit 5 here]

Index Revisions

Index revision refers to the fact that, as data for subsequent periods are added,

the historical values of an index are subject to change. Such revisions are problematic,

in particular when an index is used for the settlement of derivative contracts (Deng and

Quigley, 2008). Revisions are an inherent problem for repeat sales indexes, but are not

necessarily an issue for hedonic indexes. For example, an index that is derived from

20

chained single-period hedonic equations will not be subject to revision. However, an

index based on a longitudinal equation or a moving window series of equations may

change as new time periods and new data are added.

Our strategy for studying revisions follows the approach of Clapham, Englund,

Quigley, and Redfearn (2006). We focus on revisions to the ten quarters in the middle

of our series (2003Q1 to 2005Q2). This gives us 20 quarters to build up the index prior

to the analysis period. This also yields 20 sets of revisions for each of the 10 analysis

quarters as we progressively add quarterly data for 2005Q3 through 2010Q2.

Exhibit 6 reports statistics for cumulative revisions over the 10-quarter period as

well as quarterly revisions for the conventional, conventional with REO controls, and

robust indexes. The cumulative revisions refer to the changes in price growth over the

10-quarter period, while the quarterly revisions refer to the changes in price growth for

the individual quarters within that period. The means of the absolute values of the

cumulative revisions for the three indexes are 1.13%, 0.49%, and 0.09%, respectively.

The corresponding figures for the quarterly revisions are 0.06%, 0.03%, and 0.01%.

These results indicate that the robust method virtually eliminates the revision problem.

This can be seen clearly in Exhibit 7. Each panel of the exhibit shows 21 superimposed

indexes for 2003Q1 through 2005Q2, including the original index plus the 20 revisions.

[Exhibits 6 and 7 here]

Conclusions

Even in the best real estate database, it is almost inevitable that there will be

measurement and data entry errors. Moreover, a database may omit key variables,

21

such as identifiers for one or more types of distressed sales. By reducing the impacts of

outliers, robust methods provide a means for addressing such data problems. Using a

longitudinal hedonic model with simulated data, we compare three types of robust

estimators, including two MM-estimators with different levels of efficiency and an S-

estimator. We select the latter for further empirical analysis because it provides the

least biased estimates of a known index.

Using over 82,000 house sales from Louisville, Kentucky, we then construct a

robust hedonic index and compare it with a conventional OLS index and a conventional

index that controls for a major source of outliers, distressed transactions. The robust

index differs significantly from the conventional index and is much closer to the index

with controls for distressed sales. In contrast, in a submarket with a large percentage of

distressed sales, the robust index is much closer to the conventional index because the

distressed sales are not treated as outliers. We also show that robust techniques

minimize the index revision problem that occurs when new data are incorporated into a

longitudinal hedonic model.

Acknowledgments

We thank the Greater Louisville Association of Realtors for providing the

Multiple Listing Service data.

References

Andersen, R. Modern Methods for Robust Regression. Sage Publications, Thousand

Oaks, CA: Sage Publications, 2008.

22

Beaton, A.E. and J.W. Tukey. The Fitting of Power Series, Meaning Polynomials,

Illustrated on Band-Spectroscopic Data. Technometrics, 1974, 16:2, 147-85.

Belsley, D.A., E. Kuh, and R.E. Welsch. Regression Diagnostics: Identifying Influential

Data and Sources of Collinearity. New York: John Wiley, 1980.

Bourassa, S.C., E. Cantoni, and M. Hoesli. Robust Repeat Sales Indexes. Real Estate

Economics, 2013 (in press).

Bourassa, S.C., M. Hoesli, and J. Sun. A Simple Alternative House Price Index Method.

Journal of Housing Economics, 2006, 15:1, 80-97.

Bourassa, S.C., M. Hoesli, D. Scognamiglio, and P. Sormani. Constant-Quality House

Price Indexes for Switzerland. Swiss Journal of Economics and Statistics, 2008,

144:4, 561-75.

Campbell, J.Y., S. Giglio, and P.A. Pathak. Forced Sales and House Prices. American

Economic Review, 2011, 101:5, 2108-31.

Clapham, E., P. Englund, J.M. Quigley, and C.L. Redfearn. Revisiting the Past and Settling

the Score: Index Revision for House Price Derivatives. Real Estate Economics,

2006, 34:2, 275-302.

Cook, R.D. Detection of Influential Observations in Linear Regression. Technometrics,

1977, 19:1, 15-18.

Deng, Y. and J.M. Quigley. Index Revision, House Price Risk, and the Market for House

Price Derivatives. Journal of Real Estate Finance and Economics, 2008, 37:3, 191-

209.

Gatzlaff, D.H. and D.R. Haurin. Sample Selection and Biases in Local House Value

Indices. Journal of Urban Economics, 1998, 43:2, 199-222.

23

Gürünlü Alma, Ö. Comparison of Robust Regression Methods in Linear Regression.

International Journal of Contemporary Mathematical Sciences, 2011, 6:9, 409-21.

Hampel, F.R., E.M. Ronchetti, P.J. Rousseeuw, and W.A. Stahel. Robust Statistics: The

Approach Based on Influence Functions. New York: John Wiley & Sons, 1986.

Heritier, S., E. Cantoni, S. Copt, and M.-P. Victoria-Feser, 2009. Robust Methods in

Biostatistics. Chichester, UK: John Wiley & Sons, 2009.

Huber, P.J. Robust Estimation of a Location Parameter. Annals of Mathematical

Statistics, 1964, 35:1, 73-101.

Janssen, C., B. Söderberg, and J. Zhou. Robust Estimation of Hedonic Models of Price

and Income for Investment Property. Journal of Property Investment and Finance,

2001, 19:4, 342-60.

MacDonald, G.M. and C. Robinson 1985. Cautionary Tails about Arbitrary Deletion of

Observations; or, Throwing the Variance Out with the Bathwater. Journal of Labor

Economics, 1985, 3:2, 124-52.

Maronna, R., R. Martin, and V. Yohai. Robust Statistics: Theory and Methods.

Chichester, UK: John Wiley & Sons, 2006.

McMillen, D. and P. Thorsnes. Housing Renovations and the Quantile Repeat-Sales Price

Index. Real Estate Economics, 2006, 34:4, 567-84.

Peña, D. and J. Ruiz-Castillo. Robust Methods of Building Regression Models: An

Application to the Housing Sector. Journal of Business and Economic Statistics,

1984, 2:1, 10-20.

Pennington-Cross, A. The Value of Foreclosed Property. Journal of Real Estate

Research, 2006, 28:2, 193-214.

24

Prasad, N. and A. Richards. Improving Median Housing Price Indexes through

Stratification. Journal of Real Estate Research, 2008, 30:1, 45-71.

Rousseeuw, P.J. and A.M. Leroy. Robust Regression and Outlier Detection. New York:

John Wiley, 1987.

Song, H.-S. and M. Wilhelmsson. Handling Outliers in the Construction of a Price Index

for Condominiums. Paper presented at the American Real Estate Society Annual

Meeting, Naples, FL, 2010.

Standard & Poor’s. S&P/Case-Shiller Home Price Indices: Index Methodology. New York:

McGraw-Hill, 2009.

Thorson, J.A. The Use of Least Median of Squares in the Estimation of Land Value

Equations. Journal of Real Estate Finance and Economics, 1994, 8:2, 183-90.

Verardi, V. and C. Croux. Robust Regression in Stata. Stata Journal, 2009, 9:3, 439-53.

Wang, F.T. and P.M. Zorn. Estimating House Price Growth with Repeat Sales Data:

What’s the Aim of the Game? Journal of Housing Economics, 1997, 6:2, 93-118.

Yohai, V.J. High Breakdown Point and High Efficiency Robust Estimates for Regression.

Annals of Statistics, 1987, 15:2, 642-56.

Yoo, S.-H. A Robust Estimation of Hedonic Price Functions: Least Absolute Deviations

Estimator. Applied Economics Letters, 2001, 8:1, 55-58.

25

Exhibit 1 Descriptive Statistics

Sample size

Mean sale price ($)

Mean age of structure

Mean number of bathrooms

Mean finished square feet

Mean lot size (acres)

Distressed (REO) sales as a % of sales in sub-market or year

Distressed (REO) sales as a % of all distressed sales

By submarket

1 9,097 53,239 72.5 1.4 1,310 0.13 27.8 38.7

2 8,687 158,962 69.8 1.7 1,723 0.17 4.2 5.6

3 5,795 230,385 56.9 2.1 2,224 0.27 1.5 1.4

4 7,663 98,547 38.1 1.5 1,438 0.25 11.9 14.0

5 8,427 109,272 40.2 1.6 1,526 0.28 9.7 12.5

6 8,538 120,263 25.9 1.7 1,562 0.25 7.9 10.3

7 16,044 154,847 25.8 2.0 1,857 0.26 4.5 11.0

8 11,169 253,249 19.7 2.6 2,685 0.34 2.6 4.4

9 6,877 255,641 17.2 2.7 2,729 0.33 2.1 2.2

By year of sale

1998 6,192 129,875 33.4 1.9 1,863 0.26 1.0 1.0

1999 6,577 132,187 35.7 1.9 1,827 0.26 1.5 1.5

2000 6,047 141,663 35.8 1.9 1,867 0.25 2.1 2.0

2001 6,238 146,872 36.6 1.9 1,843 0.25 2.9 2.8

2002 6,031 152,285 36.3 1.9 1,862 0.25 3.5 3.2

2003 6,941 163,342 36.5 2.0 1,916 0.26 4.9 5.2

2004 7,182 165,134 38.5 1.9 1,902 0.26 8.9 9.8

2005 7,457 171,811 40.0 1.9 1,921 0.25 9.4 10.8

2006 7,362 173,996 40.7 1.9 1,919 0.25 10.5 11.9

2007 7,170 177,803 42.3 1.9 1,938 0.25 12.6 13.9

2008 5,882 165,613 43.0 1.9 1,895 0.24 15.5 13.9

2009 6,030 159,798 44.6 1.9 1,895 0.24 16.5 15.3

2010 3,188 161,967 44.2 1.9 1,913 0.24 18.0 8.8

Full sample

82,297

157,669

38.9

1.9

1,890

0.25

7.9

100.0

Source: Authors’ calculations based on Greater Louisville Association of Realtors’ Multiple Listing Service and Jefferson County Property Valuation Administrator data. Notes: The 2010 data are for the first two quarters only.

26

Exhibit 2 Simulation Analysis

27

Exhibit 3 Conventional, Conventional with REO Controls, and Robust (S) Indexes for Louisville

100

105

110

115

120

125

130

135

140

145

Conventional Conventional with REO controls Robust (S)

28

Exhibit 4 Confidence Intervals (95%) for Conventional and Robust (S) Indexes

90

100

110

120

130

140

150

Conventional Robust (S)

29

Exhibit 5 Conventional, Conventional with REO Controls, and Robust (S) Indexes for Area 1

Note: Indexes are four quarter moving averages.

60

70

80

90

100

110

120

130

140

150

Conventional Conventional with REO controls Robust (S)

30

Exhibit 6 Analysis of Revisions for 2003Q1 through 2005Q2

Conventional indexes

Conventional indexes with REO controls

Robust indexes

Cumulative revisions (%)

Mean (of absolute values) 1.13 0.49 0.09

Minimum 0.24 0.09 -0.12

Maximum 1.94 0.93 0.22

Range 1.70 0.84 0.34

Standard deviation 0.51 0.26 0.11

Quarterly revisions (%)

Mean (of absolute values) 0.06 0.03 0.01

Minimum -0.01 -0.03 -0.02

Maximum 0.16 0.09 0.02

Range 0.17 0.11 0.04

Standard deviation 0.04 0.02 0.01

31

Exhibit 7 Revisions for 2003Q1 through 2005Q2: (a) Conventional Indexes; (b) Conventional Indexes with REO Controls; (c) Robust (S) Indexes

(a)

(b)

(c)

115

120

125

130

135

140

115

120

125

130

135

140

115

120

125

130

135

140

robust hedonic price indexes - massey...

Documents