nonparametric likelihood ratio model selection tests ...doubleh/papers/sieve.pdf · ference in rio...

Nonparametric Likelihood Ratio Model Selection Testsbetween Parametric Likelihood and Moment Condition

Models1

Xiaohong Chena, Han Hongb and Matthew Shumc

a Department of Economics, New York University.b Department of Economics, Duke University.

c Department of Economics, Johns Hopkins University.

First version: November 2001, Current version: October 2005

Abstract

We propose a nonparametric likelihood ratio testing procedure for choosing betweena parametric (likelihood) model and a moment condition model when both models couldbe misspecified. Our procedure is based on comparing the Kullback-Leibler Informa-tion Criterion (KLIC) between the parametric model and moment condition model. Weconstruct the KLIC for the parametric model using the difference between the para-metric log likelihood and a sieve nonparametric estimate of population entropy, andobtain the KLIC for the moment model using the empirical likelihood statistic. We alsoconsider multiple (> 2) model comparison tests, when all the competing models couldbe misspecified, and some models are parametric while others are moment-based. Weevaluate the performance of our tests in a Monte Carlo study, and apply the tests to anexample from industrial organization.

Keywords: Model Selection Tests, Empirical Likelihood, KLICJEL classification: C52

1We thank the associate editor, three referees, J. Geweke, J. Horowitz, Y. Kitamura, O. Linton, N.

Meddahi, S. Ng, H. Paarsch, R. Porter and seminar participants in Carnegie-Mellon, U. Iowa, the 2003

Econometric Society Winter Meetings in Washington DC, SITE 2003, and the 2004 Semiparametrics con-

ference in Rio for helpful comments. We also thank D. Chen and in particular D. Nekipelov for excellent

research assistance in the simulation study. We also thank the National Science Foundation and the Sloan

Foundation for generous research support.

1 Introduction

In many empirical applications, similar economic theoretical models imply different econo-metric specifications: some specify parametric (likelihood-based) models while others spec-ify moment based models. For a given dataset, then, applied researchers face the problemof how to select one model among the multiple competing models. For example, in de-mand analysis, both likelihood based estimation methods (Gowrisankaran, Geweke, andTown (2003)) and moment based estimation methods (Berry, Levinsohn, and Pakes (1995))have been used. In empirical structural auction models, Paarsch (1992) showed that whilemaximum likelihood is feasible for some models, other models can be feasibly estimated bymethod of moments.

There has been a well developed econometric literature on model selection when all thecompeting models are parametric likelihood-based, and can be estimated by parametricmaximum likelihood; see, for example, Vuong (1989), Sin and White (1996), Gourierouxand Monfort (1994) and the references therein. Several recent papers have also developedmodel selection tests when all the competing models are moment based, and can be esti-mated by generalized method of moments or empirical likelihood; see, for example, Hansen(1982), Kitamura (2000), Smith (1992), Rivers and Vuong (2001), Ramalho and Smith(2002) and others. In particular, using information theoretic methods,2 Kitamura (2000)developed the nonparametric analog of Vuong’s 1989 model selection tests for potentiallymisspecified (unconditional) moment based models, which is also extended to conditionalmoment restriction models in Kitamura (2002). However, to the best of our knowledge,there is no published work in the likeliood setting on model selection tests among compet-ing parametric (likelihood-based) models and (unconditional) moment based models. Forthe non-likelihood based models, Horowitz and Hardle (1994) offered a specification testbetween a parametric model against a semiparametric alternative.

In this paper, we formulate nonparametric likelihood ratio tests for choosing among para-metric models and moment models when all these models could be misspecified. Our pro-cedures are based on comparing the Kullback-Leibler Information Criterion (KLIC) for theparametric models and the moment condition models, and can be regarded as extensions ofthe likelihood ratio test of Vuong (1989) for comparing two parametric models and of thenonparametric likelihood ratio test of Kitamura (2000) for comparing two moment-based

2cf. Kitamura and Stutzer (1997), Qin and Lawless (1994), Kitamura (1997), Imbens, Spady, and Johnson

(1998) and Kitamura and Tripathi (2001).

2

models. We construct a sample estimate of the KLIC for a parametric model using thedifference between the maximized sample log-likelihood parametric function and a sievenonparametric estimate of the population entropy;3 we obtain a sample estimate of theKLIC for a moment-based model by the empirical likelihood method (cf. Owen (2001) andQin and Lawless (1994)). Under mild regularity conditions, we show that both estimatedKLIC are

√n−consistent and asymptotically normally distributed under potential model

misspecification. The difference between the estimated KLICs is then used to construct ourtest statistics. For comparison between two models, the proposed test is directional andapplies to situations in which the two competing models are non-nested, overlapping, ornested and whether both, one, or neither is misspecified. We derive large sample statisticalproperties of these tests under mild regularity conditions.

In some applications, researchers need to compare multiple econometric structural modelsimplied by several competing economic theories, where some of the candidate models areparametric, and some are moment-based. For multiple model comparison, we follow theapproach of White (2000) by specifying a benchmark model, and require that at least oneof the candidate models is not nesting nor nested by the benchmark model. The benchmarkmodel can be either a parametric model or a moment-based model.

In this paper we use the KLIC as the model comparison criterion, which is a popularcriterion in both statistics and econometrics. On the one hand, it is well known that,for parametric models, the maximum likelihood estimate minimizes the KLIC between theparametric density and the true population density even when the model is misspecified;see e.g. White (1982). On the other hand, for moment models, in which a set of momentconditions form the statistical model for approximating the population data-generatingprocess, the maximized empirical likelihood criterion function is the minimal KLIC betweenthe set of implied densities which satisfy the moment conditions and the true populationdensity. Therefore it seems convenient to use a KLIC criterion in comparing parametric andmoment-based models, because in both cases the KLIC can be constructed from componentswhich are outputs of the estimation procedure.

We note that KLIC is not the only choice for the discrepancy measure between differentfamilies of probability distributions defined by competing models. In some applications,alternative discrepancy measures that measure goodness-of-fit in some sense, such as the

3Any choice of the sieve basis, such as spline, Fourier series, power series, neural network, Hermite

polynomial, wavelet, normal mixture and many others, can be used to estimate the population entropy.

3

Kolmogorov-Smirnov distance, might be preferred.4 Nevertheless, we need to stress thepoint that no matter which discrepancy measure to use for model selection tests, it ismore desirable to use the same criterion for both the parameter estimation and the modelcomparison. This is because in model selection problems, all the competing models couldbe misspecified even under the null hypothesis that they are equally close to the truedata generating process (DGP) according to some discrepancy measure; for this reason, itis important to apply the same criterion to define pseudo true parameters as well as tocompare the models. This is in contrast to consistent specification test problems, wherethe model is correctly specified under the null hypothesis. For this case, one can apply anyconsistent estimator for the true unknown parameters under the null and then apply othercriteria to define test statistics. The crucial difference is that the true parameters onlydepend on the true DGP and do not depend on the criterion used to estimate them, butthe pseudo true parameters depend on both the true DGP and the discrepancy measureused to define them.5

The rest of the paper is organized as follows: Section 2 discusses the definition and esti-mation of the KLIC for parametric models and moment condition models. In Section 3 weformulate nonparametric likelihood ratio tests between a parametric model and a momentcondition model when both can be misspecified. Monte Carlo evidence on the small-sampleproperties of this test are presented in Section 4. In Section 5, we study selection testsamong multiple models where some models are parametric and others are moment-based.Section 6 contains an empirical illustration of our testing procedure to price search models.Section 7 briefly concludes. All technical proofs are gathered in the Appendix.

2 The KLIC for parametric and moment condition models

Throughout the paper, we assume that there is a random sample (zi : i = 1, ..., n) fromZ, whose true but unknown probability density is h0(Z). We refer to the parametric like-lihood specification as model F, which is given by a family of probability density functionsf(Z;β) : β ∈ Rdβ, where the functional form of f(·;β) is specified up to unknownfinite-dimensional parameter β. We refer to the moment model as M, which is described

4macroeconomic models, other popular model comparison criteria include the Hansen-Jaganathen dis-

tance in evaluating asset pricing models; the mean squared prediction error, the median squared prediction

error, density forecasts evaluation and the conditional Kolmogorov test. Related work includes Granger

(2002), F.X.Diebold (1989).5See Chen and Fan (2005) for more detailed discussion of this matter.

4

by the unconditional moment restrictions E0[m (Z;α)] = 0, where the functional form ofm (·;α) ∈ Rdm is specified up to an unknown finite-dimensional parameter α ∈ Rdα , andthe expectation E0[·] is taken with respect to the true data generating process (DGP). Forfeasibility, we focus on the over-identified case where dm ≥ dα.

The KLIC defines a pseudo distance between a family of distribution functions and the trueDGP. A sample estimate of the KLIC of a parametric model can be obtained by combiningthe maximized sample log-likelihood parametric function with a sieve nonparametric esti-mate of the population entropy. A sample estimate of the KLIC for the moment conditionmodel can be obtained by the empirical likelihood methods.

2.1 KLIC for parametric models

The KLIC from a family of densities h (Z) : h ∈ T to a density h0 (Z) is defined as

I

(h (Z) : h ∈ T

∣∣∣∣h0 (·))≡ min

h∈T

∫ (log

h0 (z)h (z)

)h0 (z) dz ≥ 0,

where the equality is achieved at h(·) = h0 (·) only if h0 ∈ T . The family of densitiesh (Z) : h ∈ T is called misspecified if h0 /∈ T , in which case the KLIC is strictly positive.The population KLIC for a parametric likelihood family f(Z;β) : β ∈ B, B a compactsubset of Rdβ , is

I

(f(·;β) : β ∈ B

∣∣∣∣h0 (·))

= minβ∈B

∫ (log

h0(z)f(z;β)

)h0 (z) dz.

The parametric maximum likelihood estimator,

β = argmaxβ∈B1n

n∑t=1

log f (zt;β)

is well known to minimize sample versions of the population KLICs. The behavior of theparametric maximum likelihood estimator under mis-specification is by now well understood(see, for example, White (1982)). Define the pseudo true value β∗ as

β∗ = argminβ∈BI

(f(·;β) : β ∈ B

∣∣∣∣h0(·)). (2.1)

Under mild regularity conditions, β converges to β∗ at the√n-rate and is asymptotically

normally distributed.

5

ASSUMPTION 1 Assume that the data ztnt=1 are i.i.d., B is a compact subset of Rdβ

with non-empty interior, and the following conditions hold:

1. The solution β∗ to (2.1) is unique and belongs to the interior of B.

2. log f(zt;β), is continuous at β ∈ B with probability one, and E0[supβ∈B | log f(zt;β)|] <∞.

3. log f(zt;β) is twice continuously differentiable at β ∈ N (β∗) with probability one,where N (β∗) is a small neighborhood around β∗, and

E0[ supβ∈N (β∗)

∣∣∣∣∂2 log f (zt;β)∂β∂β′

∣∣∣∣] <∞.

4. Af and Ωf are finite and non-singular, where

Af ≡∂2E0log f (zt;β∗)

∂β∂β′,Ωf ≡ E0

[∂ log f (zt, β∗)

∂β

∂ log f (zt, β∗)′

∂β

].

In the above, E0 denote the expectation with respect to h0(z), the true density function ofthe data z. The next theorem follows directly from Newey and McFadden (1994) (theorem2.1, lemma 2.4, theorem 3.1), and is therefore stated without a proof.

Theorem 1 Under assumption 1,√n(β − β∗) d−→ N

(0, A−1

f ΩfA−1f

), and

n∑t=1

(log f(zt; β)− log f(zt;β∗)

)= Op (1) .

It is well known that the conclusion of the above theorem still holds if we replace assumption1 by other weaker sufficient conditions.

2.1.1 Sieve nonparametric entropy estimation

The population KLICs for parametric models can be consistently estimated by the differencebetween the average log likelihood function and a nonparametric estimate of the entropy ofthe underlying population, given by

1n

n∑t=1

log h(zt)−1n

n∑t=1

log f(zt; β).

6

The first part is a nonparametric estimate of the population entropy∫[log h0 (z)]h0(z)dz ≡ E0log h0 (Z).

In the following, we shall present two classes of sieve maximum likelihood estimation ofthe unknown true density h0(z). The first class is to use sieves to approximate the log-density function and the second class is to use sieves to approximate the square-root densityfunction. Both classes of sieve density estimators allow for the use of a variety of sievebased functions. For example, Stone (1985) applied spline sieves, Barron and Sheu (1991)used orthogonal polynomials, splines and trigonometric sieves, and Chen and White (1999)considered neural network sieves for approximating log-densities; Chen and Shen (1998)used trigonometric sieves, Gallant and Nychka (1987) and Coppejans and Gallant (2002)used Hermite polynomials to approximate square-root densities.

Denote the parametric space of the density h (·) byH. A sieve is a sequence of approximatingspaces Hn that is dense in H as n → ∞, in the sense that for any h (·) ∈ H, there existsa sequence πnh ∈ Hn such that d (h, πnh) → 0 as n → ∞, where d(·) denotes a pseudometric.

Let h0 be the true unknown probability density of Z on its support Z, where Z is a compactset in Rdz with Lipschitz continuous boundaries. For notational convenience, we use L(h0)to denote either log h0 or

√h0, and L(h) to denote either log h or

√h. In the following,

Ck(Z) denotes the space of k-times continuously differentiable functions on Z, and ∂kL(h(z))∂zk

denotes the k-th derivative of L(h) with respect to z.

ASSUMPTION 2 Let L(h0) ∈ H:

H =L(h) ∈ Ck (Z) : h (z) ≥ c0 > 0 for all z ∈ Z, and

|∂kL(h(z1))∂zk

− ∂kL(h(z2))∂zk

|/|z1 − z2|γ ≤ dh for all z1, z2 ∈ Z

where k = 0, 1, 2, ... and γ ∈ (0, 1], and s ≡ k + γ > dz/2.

Let pj(Z), j = 1, 2, ... denote a sequence of known basis functions, such as Fourier series,Hermite polynomials, splines of order k + 1, etc., that can approximate any real-valuedsquare-integrable function of Z arbitrarily well. Then a finite-dimensional linear sieve for

7

H is

Hn =L(h) ∈ H : L(h(z)) = Σkn

j=1ajpj(z), aj ∈ R

,

with dim(Hn) = kn →∞ and kn/n→ 0 as n→∞.

An artificial neural networks (ANN) sieve is attractive when the dimension dz is greaterthan or equal to 3 (see e.g. Chen and White (1999) and the reference therein). A simpleANN sieve for H is:

Hn =L(h) ∈ H : L(h(z)) = b0 +

kn∑j=1

bjψ(a′

jz + a0,j),kn∑

j=1

dz∑i=0

|ai,j | ≤ kn log n,kn∑

j=0

|bj | ≤ log(n)

where ψ() is a logistic function, and aj = (a1,j , . . . , adz ,j)′.

Log-density estimation: Suppose that we want to estimate log h0, the log-density. Sincelog h0 is subject to the non-linear constraint

∫Z explog h0(z)dz = 1, it is more convenient

to write log h0 = θ0 − log∫Z expθ0(z)dz, and treat θ0 as an unknown function in some

linear space Θ:

θ0 ∈ Θ = θ ∈ H : E[θ(Z)2] <∞,

∫Zθ(z)dz = 0.

Let Θn denote some sieve space for Θ:

Θn = θ ∈ Hn :∫Zθ(z)dz = 0,

where Hn can be any of the popular sieve spaces for H. Then

θ = arg maxθ∈Θn

1n

n∑t=1

[θ(zt)− log

∫Z

exp θ(z)dz]

is a sieve maximum likelihood estimator of θ0, and we have log h = θ − log∫Z expθ(z)dz.

Square-root density estimation: Suppose that we want to estimate the square root densityθ0 =

√h0 ∈ H, then

θ0 ∈ Θ = θ ∈ H :∫Z[θ(z)]2dz = 1.

Let Θn denote some sieve space for Θ:

Θn = θ ∈ Hn :∫Z[θ(z)]2dz = 1,

where Hn be any of the popular sieve spaces for H. Then

θ = arg maxθ∈Θn

1n

n∑t=1

log [θ(zt)]2

is a sieve maximum likelihood estimator of θ0, and we have h = [θ]2.

8

Theorem 2 Assume that the data ztnt=1 are i.i.d. and V ar0

[log h0 (Z)

]is positive fi-

nite. Suppose h0(z) satisfies assumption 2. Let h(z) be the nonparametric estimator ofthe density using the sieve MLE of either the log-density or the squared-root density. Ifkn = O(n1/(2s+dz)) for the linear sieve and (kn)2+1/dz log(kn) = O(n) for ANN sieves, then

√n

(1n

n∑t=1

log h(zt)− E0 log h0 (Z)

)=√n

(1n

n∑t=1

log h0(zt)− E0 log h0 (Z)

)+ op(1)

d−→N

(0, V ar0

[log h0 (Z)

])

Proof of theorem 2: Under the stated conditions, by Example 3.7, Remark 3.8 andExample 3.9 in Chen and White (1999) for ANN sieve estimation of log-densities and square-root densities, and by the results in Chen and Shen (1998) for linear sieve estimation ofsquared root-density, we have the following two results:

1n

n∑t=1

(log h(zt)− log h0(zt)

)− E0

(log h(Z)− log h0(Z)

)= op(n−1/2)

and

E0(log h(Z)− log h0(Z)

)= op(n−1/2).

The theorem now follows.

2.2 KLIC for moment condition models

Moment condition-based models allow researchers to focus on the main features of theeconometric models without facing a high risk of misspecifying a tight parametric likeli-hood function. A large recent econometric literature has studied the information theoreticfeatures of moment condition-based estimation, and developed a variety of generalized em-pirical likelihood estimators (see Kitamura and Stutzer (1997), Kitamura (1997), Imbens,Spady, and Johnson (1998), Kitamura and Tripathi (2001) and Newey and Smith (2001)).A set of overidentified estimating equations imply a nonparametric family of probabilitymeasures that are compatible with these moment restrictions.

For each α ∈ A, define Qα = q (z) |∫m (z;α) q(z)dz = 0, a set of probability densities for

Z which are consistent with the moment conditions. Then we can define M = ∪α∈AQα as

9

the unconditional moment condition-based model. The Model M is misspecified if the truepopulation distribution h0 (·) /∈ M.

An empirical likelihood procedure will identify an element in M that is closest to the truedistribution h0, in the sense of infq(·)∈M I (q (·) |h0 (·)). The solution to this populationKLIC problem can be characterized in two steps (see Appendix A for more details). First,for a fixed parameter vector α ∈ A:

infq(·)∈Qα

I (q (·) |h0 (·)) = I (q∗α (·) |h0 (·)) =maxλ∈Λ

∫log

(1 + λ′m (z;α)

)h0 (z) dz

≡∫

log(1 + λ(α)′m(z;α)

)h0 (z) dz,

and the implied distribution q∗α (·) that minimizes the KLIC is given by

q∗α (Z) =h0 (Z)

(1 + λ(α)′m(Z;α)).

The population empirical likelihood problem can then be stated as

infq(·)∈M

I (q (·) |h0 (·)) =minα∈A

∫log

(1 + λ(α)′m(z;α)

)h0 (z) dz

=minα∈A

maxλ∈Λ

∫log

(1 + λ′m(z;α)

)h0 (z) dz

≡∫

log(1 + λ∗

′m(z;α∗)

)h0 (z) dz,

with λ∗ ≡ λ(α∗). The empirical likelihood estimator optimizes the following sample analogof the population saddle point objective function (cf. Qin and Lawless (1994), Imbens,Spady, and Johnson (1998)):(

α, λ)

= argminα∈Aargmaxλ∈ΛM (λ, α)

where

M (λ, α) ≡ 1n

n∑t=1

log(1 + λ′m (zt;α)

).

The implied empirical distribution subject to the moment constraints is given by

Q (z) =1n

n∑t=1

1 (zt ≤ z) /(1 + λ′m (zt; α)

).

Under mild regularity conditions, αp−→ α∗ and λ

p−→ λ∗. Both√n (α− α∗) and

√n(λ−λ∗)

are asymptotically normal (see Kitamura (2000)). These are formally stated in the followingtheorem. First, we list the assumptions.

10

ASSUMPTION 3 Assuming that the data ztnt=1 are iid, and:

1. A ⊂ Rdα, λ ∈ Λ ⊂ Rdm. Both A and Λ are compact, and

supα∈A,z∈Z

|m (z;α) | <∞ and infα∈A,λ∈Λ,z∈Z

1 + λ′m (z;α) > 0.

2. E0m (zt;α)m (zt;α)′ is positive definite uniformly over α ∈ A.

3. M (λ, α) ≡ E0log (1 + λ′m (zt;α)) is continuously differentiable over Λ and A.

4. The problem of minα∈A maxλ∈ΛM (λ, α) has an unique saddle point solution (α∗, λ∗)with α∗ ∈ int(A) and λ∗ ∈ int(Λ).

5. D = ∂2M(λ,α)∂λ∂α′ |(α∗,λ∗) is of full column rank.

6. V = V ar0(

m(zt;α∗)

1+λ∗′m(zt;α∗)

)= E0 m(zt;α∗)m(zt;α∗)′

(1+λ∗′m(zt;α∗))2 is finite and positive definite.

7. supα∈A supλ∈Λ

∣∣∣∣M (λ, α)−M (λ, α)∣∣∣∣ = op (1).

8. Stochastic equi-continuity for sample moment conditions. For any δn → 0:

sup|α−α∗|≤δn

∣∣∣∣ 1√n

n∑t=1

(m (zt;α)

1 + λ∗′m (zt;α)− m (zt;α∗)

1 + λ∗′m (zt;α∗)− E0 m (zt;α)

1 + λ∗′m (zt;α)

) ∣∣∣∣ = op (1) .

9. Stochastic equi-continuity for KLIC saddle-point objective function. For any δn → 0:

sup|α−α∗|≤δn,|λ−λ∗|≤δn

√n

∣∣∣∣M (λ, α)− M (λ∗, α∗)− [M (λ, α)−M (λ∗, α∗)]∣∣∣∣ = op (1) .

Remark 1: The uniform boundedness condition (1) on m (zt;α) is doubtless very strong.We assume it now to avoid the technical difficulty that the nonnegativity condition that1 + λ′m (zt;α) > 0 may reduce Λ to a singleton (zero) when m (zt;α) is unbounded. Wesuspect that attempts to relax this assumption may be technically rather difficult. Inpractice, whether of not this condition is satisfied depends on both z and the functionalform of m (z, α). The actual specification of m does offer the user some flexibility in termsof the assumptions they would make in practice.

Remark 2: The interior point condition (4) of (λ∗, α∗) requires the moment condition modelnot to be too misspecified. The interior point condition implies that

E0m (zt;α∗) /(1 + λ∗

′m (zt;α∗)

) = 0,

which rules out cases where m (z;α) > 0 with probability one for all α ∈ A. Essentially, weare requiring that the set M be non-empty.

11

Theorem 3 Under assumption 3,√n (α− α∗) = Op (1),

√n(λ− λ∗) = Op (1), and

n∑t=1

(log(1 + λ′m (zt; α)

)− log

(1 + λ∗

′m (zt;α∗)

))= Op (1) .

3 Nonparametric Selection Test between Two Models

Given the KLIC measures for the parametric and moment condition-based models, theirdifference indicates which model is closer to the true data generating process. Define

LR∞ (β∗, h0;λ∗, α∗) = E0 log h0 (Z)− E0 log f(Z;β∗)− E0 log(1 + λ∗

′m (Z;α∗)

),

and consider the following sets of null and alternative hypotheses:

H0 : LR∞ (β∗, h0;λ∗, α∗) = 0,

meaning that model F and model M are equivalent, against

Hg : LR∞ (β∗, h0;λ∗, α∗) < 0,

meaning that model F is better than model M, or

Hm : LR∞ (β∗, h0;λ∗, α∗) > 0,

meaning that model M is better than model F, in the sense of providing a better (smaller)KLIC to the data.

Hence a nonparametric quasi-likelihood ratio test statistic between a parametric model anda moment condition model can be formulated as the sample analog of these populationquantities

LRn(β, h; λ, α) =1n

n∑t=1

log h(zt)−1n

n∑t=1

log f(zt; β)− 1n

n∑t=1

log(1 + λ′m(zt, α)

).

For example, a statistically significantly negative value of LRn(β, h; λ, α) indicates that themodel F is likely to be better than model M.

3.1 Large sample properties of the test statistics

The limit distribution of the test statistic LRn(β, h; λ, α) is an immediate consequence ofTheorems 1, 2 and Theorem 3. In the following we denote

vg(zt) =(log h0(zt)− E0 log h0(zt)

)−(log f(zt;β∗)− E0 log f(zt;β∗)

)−(log(1 + λ∗

′m(zt;α∗))− E0 log(1 + λ∗

′m(zt;α∗))

).

12

Theorem 4 Under conditions for Theorems 1, 2 and 3,

√n(LRn(β, h; λ, α)− LR∞ (β∗, h0;λ∗, α∗)

)d−→ N

(0, σ2

g

)for σ2

g = E0[vg(zt)2].

This result follows directly from the fact that due to Theorems 1, 2 and 3,

√n(LRn(β, h; λ, α)− LR∞ (β∗, h0;λ∗, α∗)

)=

1√n

n∑t=1

vg(zt) + op (1) .

The asymptotic distribution described in Theorem 4 can be used to test the null hypothesisof H0 when the limiting variance is not degenerate. Degeneracy occurs if and only ifvg(Z) = 0 with Z-probability one, which occurs when

HE : f(z;β∗) ≡ h0(z)(1 + λ∗′m(z;α∗))

for almost all z.

namely that the probability measure induced by the parametric family coincides with theprobability measure induced by the moment functions. When this happens, the two modelsare observationally equivalent, even if they can be both misspecified.

A consistent estimate of σ2g can be formed by σ2

g = 1n

∑nt=1[vg (zt)]2, where

vg(zt) =

log h(zt)−1n

n∑j=1

log h(zj)

−log f(zt; β)− 1

n

n∑j=1

log f(zj ; β)

−

log(1 + λ

′m(zt, α)

)− 1n

n∑j=1

log(1 + λ

′m(zj , α)

) .Lemma 1 Under conditions for Theorems 1, 2 and 3, σ2

gp−→ σ2

g .

Similarly to Vuong (1989), the asymptotic distribution of this likelihood-ratio-like statisticis normal. This differs from the familiar case where the competing models are nested, andthe unrestricted model is correctly specified. For this case, the leading term in the Taylorexpansion of the likelihood ratio statistic is the second-order term, leading to a limiting χ2

distribution. On the other hand, in the non-nested case, the difference between the samplelog-likelihood functions of the competing models determines the asymptotic distribution,leading to a limiting normal distribution via a central limit theorem argument. For furtherdiscussion, see Vuong (1989), pp. 313–314.

13

3.2 Testing the degeneracy condition

The degeneracy hypothesis HE can be tested. For comparing between two parametricmodels, Vuong (1989) used the variance estimate σ2

g as the test statistic and obtained itsdistribution in a manner related to a second order expansion of the statistic LRn

(β, h; λ, α

).

While a similar strategy can be used here, the statistical properties are more complicatedthat the Vuong (1989) test. Therefore we consider more direct tests of the equality betweenthe two implied distributions. We consider both density- and CDF-based test statistics forthe degeneracy hypothesis.

Density-based degeneracy test statistic The degeneracy hypothesis HE implies thatthe pseudo density functions should be equal to each other: h0 (z) = f (z;β∗)

(1 + λ∗

′m(z;α∗)

).

Therefore a test statistic can be based on an empirical analog of a norm of the differencebetween these two densities. For example, a von Mises type statistic can be formulated as

Vn =∫ [

f(z; β)(1 + λ′m (z, α)

)− h (z)

]2dz (3.2)

where h (z) is a nonparametric density estimate using either kernel or sieve based methods.The results of Fan (1994) can be used to describe the asymptotic distribution of Vn whenh (z) is a kernel density estimate.6 A drawback of the density based degeneracy test is thatit converges at the slower nonparametric rate, therefore eroding power of the sequentialtesting procedure. Appendix C gives some more details.

CDF-based degeneracy test statistics Next we focus on CDF-based tests for thedegeneracy hypotheses HE using the implied empirical distribution functions. These testsconverge faster than the density-based statistics, and therefore are more powerful thanthose based on density estimates. However, the faster convergence rate also implies thatthe sampling variation introduced by using the estimated β, λ and α in computing the teststatistic needs to be taken into account for valid large sample inferences. Note that thedegeneracy hypothesis can alternatively be stated as

HE :∫ Z

h(z)dz ≡∫ Z

f(z;β∗)(1 + λ∗

′m(z;α∗)

)dz.

6If the variance vg is estimated using kernel density methods, then Vuong’s variance test will be closely

related to the kernel based von-Mises statistic. We thank Yuichi Kitamura for pointing this out to us.

Kitamura (2002) uses sample splitting as an alternative to the degeneracy test.

14

Thus a consistent test can be based on some notation of functional distance between theempirical distribution and its implied parametric form:

ρ

(1n

n∑t=1

1 (zt ≤ Z)−∫ Z

f(z; β)(1 + λ′m(z; α)

)dz

),

where ρ (·) is a pseudo-metric that is bounded and continuous with respect to the L1 norm.These include both Kolmogorov-Smirnov type sup-norm statistic

ρ1 = supZ

∣∣∣∣ 1nn∑

i=1

1 (zi ≤ Z)−∫ Z

f(z, β

) (1 + λ′m (z, α)

)dz

∣∣∣∣, (3.3)

and Cramer-Von Mises type integrated square statistic

ρ2 =∫ ( 1

n

n∑i=1

1 (zi ≤ Z)−∫ Z

f(z, β

) (1 + λ′m (u, α)

)du

)2

w (Z) dZ. (3.4)

Appendix D gives more details about the asymptotic distributions of the empirical CDFbased tests of nondegeneracy.

3.3 Nested, non-nested and overlapping models

Since HE ⊂ H0, if we fail to reject HE , then we are done. On the other hand, if we rejectHE , H0 may still hold, in which case theorem 4 provides the basis for a directional test forH0 vs Hf and Hm. Using the classification method of Vuong (1989), the relation betweenthe parametric model and the moment model can be nested, non-nested or overlapping.

The parametric model and the moment based model are strictly non-nested if and only ifF ∩M = ∅. Formally this holds if and only if,

∀β ∈ B, ∀α ∈ A,∫m (z;α) f(z;β)dz 6= 0.

Since model F and model M have no common implied probability measure, the degeneracyhypothesis HE cannot hold for this case. Hence theorem 4 can be applied.

Models M and F are overlapping if and only if (i) M ∩ F 6= ∅; (ii) M 6⊂ F and F 6⊂ M.This holds if and only if B = B1 ∪ B2 such that

∀β ∈ B1, ∃α ∈ A,∫m (z;α) f(z;β)dz = 0,

∀β ∈ B2, ∀α ∈ A,∫m (z;α) f(z;β)dz 6= 0,

∃P ∈ M, such that P /∈ F.

15

Since F ∩ M 6= ∅, the observational equivalence relation may or may not hold. Thus ageneral sequential procedure consists in testing first whether HE holds. In the second step,if HE is not rejected, we can conclude that F and M can not be discriminated given thedata. On the other hand, if HE is rejected, H0 may still hold, and we can proceed by usingTheorem 4 to test for H0. The significance level for the sequential testing procedure isbounded from above by those of each of the two steps.

The two models can also be nested. It is probably more common for the moment basedmodel to nest the fully parametric model than for the fully parametric model to nest themoment based model. Model M nests model F if and only if F ⊂ M, namely

∀β ∈ B, ∃α ∈ A, such that∫m(z;α)f(z;β)dz = 0.

For nested models, the pseudo-true measure h0(z)/(1 + λ∗

′m(z;α∗)

)may not necessarily

equal f(z;β∗) since it is may not belong to f(·;β) : β ∈ B. However, HE and H0

are equivalent statements for nested models. A proof of this result would follow the samereasoning as Lemma 7.1 in Vuong (1989). Given this equivalence, we can use the degeneracytest to test for H0.

4 Monte Carlo Simulations

In this section we report the results from a small monte carlo simulation exercise to illustratethe finite sample properties of the above testing procedure. We specify the true datagenerating process to be a standard normal distribution. The data in the experimentsare drawn from this distribution. We consider two unconditional competing models, oneparametric and one based on moment conditions, such that neither of them nests the truenormal distribution (so that both are misspecified).

The unconditional moment model is specified to be the first six moments calculated from thedouble exponential distribution density of f (z;α) = 1

2α1exp

(− |(z−α2)|

α1

). In other words,

m (z;α1, α2) has six components, for k = 1, 2, 3:

m2k (z;α) = (z − α2)2k − (2k)!α2k

1 ,

m2k+1 (z;α) = (z − α2)2k+1 .

The location parameter α2 is assumed to be known, and the scale parameter α1 is to beestimated.

16

The unconditional parametric model f (z;β) is specified to be a double Weibull distribution:

f (z;β) =β2

2β1|z − β|β2−1 exp

(−|z − β|β2

β1

), −∞ < z <∞.

The shape parameters β1 and β2 are considered to be known and not to be estimated.Only the location parameter β is considered to be the relevant parameter of the parametricWeibull model.

The first set of experiments studies the size of the proposed test. In order to do this, wecalibrate the parameters α2, β1, β2 such that the two competing models are equi-distant (interms of the KLIC) from the true normal distribution. Hence, in this design, although bothmodels are misspecified, the null that both models fit equally well (in terms of KLIC) istrue, and we consider the empirical rejection probability for small samples.

We calibrate the parameters α2, β1, β2 in the following way. First of all, we fix α2 = 0and find, via monte carlo simulations (with 10 million simulation draws) and numericalminimizations to obtain the values of α1 that minimizes the KLIC between the momentmodel mj (z;α) , j = 1, . . . , 6 and the standard normal distribution. The resulting valueof α1 turns out to be 1/6. Through this numerical procedure we also found the vector ofLagrange multiplier associated with these moment conditions:

λ = (−0.1377, 75.3769, 3.4499, 11.4009, −1.9334, 58.6049)′.

For these values of α1 and α2, we also found the KLIC between this moment model m (z;α)and the true DGP of the standard normal distribution.

Next, we fix β = 0 and one of the shape parameters of the double Weibull distributionβ2 = 3. Then, using monte carlo simulations and a numerical root finder, we found the valueof β1 that equates the KLIC between the double Weibull distribution with (β1, β2 = 3, β = 0)and the true DGP of standard normal distribution with the previously found KLIC betweenthe moment model m (z;α) and the true DGP of standard normal distribution. The valueof β1 that equates the two KLICs turns out to be β1 = 0.2554.

Because the standard normal distribution is symmetrically distributed around zero, it is im-mediate that, at the calibrated values of α2, β1, β2 described above, the location parametersα2 = 0 and β = 0 will minimize the KLICs of the two competing models in the population.Hence, for these values of α2, β1, β2, the KLICs for the two competing models are identical,and equal to 3.9945.

17

In running the simulations, we generate data sets of sizes 100 and 200 respectively andrepeat the experiments 10,000 times. In each experiment, the parameter estimates α1 andβ are obtained through empirical likelihood methods and maximum likelihood numerically,and the test statistics are computed. The nonparametric entropy estimate was obtainedusing cardinal B-spline wavelet functions.

[Table 1 about here.]

Table 1 tabulates the empirical rejection rates of these 10,000 experiments. It is clear fromthe above table that the empirical rejection rates are close to the ones predicted by theasymptotic theory. The accuracy of the asymptotic theory increases with the larger samplesize.

In the second set of monte carlo experiments, we examine the finite sample power propertiesof the testing procedure. The setups of the experiments are identical to the first one, exceptthat we calibrate β1 at different values so that the minimized population KLICs for thetwo competing models are no longer identical. In particular, we choose β1 = 0.26 with thedifference in the KLICs being −0.41 with the moment model being the one with the smallerKLIC, and β1 = 0.30 with the difference in the KLICs being −0.95.


The results are reported in Table 2. They show that the rejection rates increase as theparameters are chosen such that the differences in the KLICs are no longer equal to zero.

5 Multiple Model Comparison

In this section, we extend our procedure to handle the comparison of multiple (> 2) models,following the idea of White (2000). In this case, all the candidate models are compared witha benchmark model. If no candidate model is closer to the true model than the benchmarkmodel in terms of KLIC, the benchmark model is chosen; otherwise, the candidate modelthat is closest to the true model in terms of KLIC will be selected.

In the following we denote I0 as the KLIC associated with the benchmark model. When

18

the benchmark model is a parametric model f0(z;β0) : β0 ∈ B0, then

I0 =I(β∗0 |h0) = E0log h0(Z)− log f0(Z;β∗0)

= minβ0∈B0

E0log h0(Z)− log f0(Z;β0).

When the benchmark model is a moment-based model m0(·;α0) : α0 ∈ A0, then

I0 =I(λ∗0, α∗0|h0) = E0log

(1 + λ∗′0 m0(Z;α∗0)

)

= minα0∈A0

maxλ0∈Λ0

E0log(1 + λ′0m0(Z;α0)

).

Let fi(z;βi)Mf

i=1 be the candidate parametric models and mi(z;αi)Mmi=1 be the candidate

moment-based models. For i = 1, ...,Mf define

I(β∗i |h0) = E0log h0(Z)− log fi(Z;β∗i ) = minβi∈Bi

E0log h0(Z)− log fi(Z;βi),

and for j = 1, ...,Mm define

I(λ∗j , α∗j |h0) =E0log

(1 + λ∗′j mj(Z;α∗j )

)

= minαj∈Aj

maxλj∈Λj

E0log(1 + λ′jmj(Z;αj)

).

We are interested in testing whether the best candidate model outperforms the benchmarkparametric model. Hence, the null hypothesis is

HM0 : I0 ≤ min

min

i=1,...,Mf

I(β∗i |h0), minj=1,...,Mm

I(λ∗j , α∗j |h0)

,

or equivalently

HM0 : max

max

i=1,...,Mf

I0 − I(β∗i |h0), maxj=1,...,Mm

I0 − I(λ∗j , α∗j |h0)

≤ 0,

meaning that no candidate model is closer to the true model than the benchmark model interms of KLIC, and the alternative hypothesis is

HM1 : I0 > min

min

i=1,...,Mf

I(β∗i |h0), minj=1,...,Mm

I(λ∗j , α∗j |h0)

,

or equivalently

HM1 : max

max

i=1,...,Mf

I0 − I(β∗i |h0), maxj=1,...,Mm

I0 − I(λ∗j , α∗j |h0)

> 0,

19

meaning that there exists a candidate model that is closer to the true model than thebenchmark model in terms of KLIC.

For i = 1, ...,Mf , let

βi = arg maxβi∈Bi

1n

n∑t=1

log fi(zt;βi) and I(βi|h) =1n

n∑t=1

log h(zt)− log fi(zt; βi),

and for j = 1, ...,Mm let

(αj , λj) = arg minαj∈Aj

arg maxλj∈Λj

1n

n∑t=1

log(1 + λ′jmj(zt;αj)

),

I(λj , αj |h) =1n

n∑t=1

log(1 + λ′jmj(zt; αj)

).

Our test will be based on the following nonparametric likelihood ratio statistic vector oflength M = Mf +Mm:(

I0 − I(β1|h), ..., I0 − I(βMf|h); I0 − I(λ1, α1|h), ..., I0 − I(λMm , αMm |h)

)′.

We consider several cases.

Case 1: if the benchmark is a parametric model I0 = I(β∗0 |h0), we denote

Ω = Ωf0 = (σik)Mi,k=1 ,

where

(1.i) for i, k = 1, ...,Mf ,

σik = Cov0[log

fi(Z;β∗i )f0(Z;β∗0)

, logfk(Z;β∗k)f0(Z;β∗0)

],

(1.ii) for i = 1, ...,Mf and k = Mf + k′, k′ = 1, ...,Mm,

σik = Cov0[log

fi(Z;β∗i )f0(Z;β∗0)

, logh0(Z)

f0(Z;β∗0)− log(1 + λ∗′k′mk′(Z;α∗k′))

],

(1.iii) for i = Mf + i′, i′ = 1, ...,Mm and k = 1, ...,Mf ,

σik = Cov0[log

h0(Z)f0(Z;β∗0)

− log(1 + λ∗′i′mi′(Z;α∗i′)), logfk(Z;β∗k)f0(Z;β∗0)

],

20

(1.iv) for i = Mf + i′, i′ = 1, ...,Mm and k = Mf + k′, k′ = 1, ...,Mm,

σik = Cov0[log

h0(Z)f0(Z;β∗0)

− log(1 + λ∗′i′mi′(Z;α∗i′)), logh0(Z)

f0(Z;β∗0)− log(1 + λ∗′k′mk′(Z;α∗k′))

].

Case 2: if the benchmark is a moment-based model I0 = I(λ∗0, α∗0|h0), we have

Ω = Ωm0 = (σik)Mi,k=1

where

(2.i) for i, k = 1, ...,Mf ,

σik = Cov0

[log(1 + λ∗′0 m0(Z;α∗0))− log

h0(Z)fi(Z;β∗i )

, log(1 + λ∗′0 m0(Z;α∗0))− logh0(Z)

fk(Z;β∗k)

],

(2.ii) for i = 1, ...,Mf and k = Mf + k′, k′ = 1, ...,Mm,

σik = Cov0

[log(1 + λ∗′0 m0(Z;α∗0))− log

h0(Z)fi(Z;β∗i )

, log1 + λ∗′0 m0(Z;α∗0)1 + λ∗′k′mk′(Z;α∗k′)

],

(2.iii) for i = Mf + i′, i′ = 1, ...,Mm and k = 1, ...,Mf ,

σik = Cov0

[log

1 + λ∗′0 m0(Z;α∗0)1 + λ∗′i′mi′(Z;α∗i′)

, log(1 + λ∗′0 m0(Z;α∗0))− logh0(Z)fk(z;β∗k)

],

(2.iv) for i = Mf + i′, i′ = 1, ...,Mm and k = Mf + k′, k′ = 1, ...,Mm,

σik = Cov0

[log

1 + λ∗′0 m0(Z;α∗0)1 + λ∗′i′mi′(Z;α∗i′)

, log1 + λ∗′0 m0(Z;α∗0)1 + λ∗′k′mk′(Z;α∗k′)

].

Corollary 5 Suppose that the parametric model i = 1, 2, . . . ,Mf satisfies conditions ofTheorems 1 and 2, and moment model j = 1, ...,Mm satisfies conditions of Theorem 3.Suppose that the benchmark satisfies conditions of Theorems 1 and 2 (if it is parametric) orconditions of Theorem 3 (if it is moment-based). Suppose that Ω is positive semi-definiteand its largest eigenvalue is positive. Then

√n

I0 − I(β1|h)− [I0 − I(β∗1 |h0)]...I0 − I(βMf

|h)− [I0 − I(β∗Mf|h0)]

I0 − I(λ1, α1|h)− [I0 − I(λ∗1, α∗1|h0)]

...I0 − I(λMm , αMm |h)− [I0 − I(λ∗Mm

, α∗Mm|h0)]

−→d

Z1

...ZMf

ZMf+1

...ZM

21

where (Z1, . . . , ZM )′ ∼ N (0,Ω). Further Ω = Ωf0 if the benchmark is parametric I0 =I(β∗0 |h0), and Ω = Ωm0 if the benchmark is moment-based I0 = I(λ∗0, α

∗0|h0).

The proof of this Proposition directly follows from those of Theorems 1, 2 and 3; hence, weomit it.

White (2000)’s reality check test is under the Least Favorable Configuration (LFC) (i.e.,I0 = I(β∗i |h0) = I(λ∗j , α

∗j |h0) for all i = 1, ...,Mf and j = 1, ...,Mm),

TWn ≡ max

[max

i=1,... ,Mf

n1/2I0 − I(βi|h), maxj=1,...,Mm

n1/2I0 − I(λj , αj |h)]

−→ d maxj=1,...,M

Zj under HM0 and LFC.

Since Hansen (2003) shows via simulation that the power of White’s test could be poor, wefollow Hansen (2003)’s suggestion and consider a modified test

THn ≡ max

[max

i=1,... ,Mf

n1/2I0 − I(βi|h), maxj=1,...,Mm

n1/2I0 − I(λj , αj |h), 0]

→ d max[

maxi=1,...,M0

Zi, 0]

under HM0 .

where M0 is the set of competing models that are binding under HM0 . Since the limiting

distribution of both TWn and TH

n are complicated, we can use bootstrap critical values toimplement these tests.

Step 1: Draw a b−th independent random bootstrap sample zbtn

t=1 of size n from theraw data ztn

t=1 with replacement, and compute the b−th bootstrap estimates Ib0, I(β

bi |hb),

I(λbj , α

bj |hb) for i = 1, ...,Mf , j = 1, ...,Mm,

Step 2: Compute the b−th bootstrap centered terms.

For White’s test: Ib,Wci ≡ Ib

0 − I(βbi |hb) − [I0 − I(βi|h)] for i = 1, ...,Mf , and Ib,Wc

j ≡Ib0 − I(λb

j , αbj |hb)− [I0 − I(λj , αj |h)] for j = 1, ...,Mm.

For Hansen’s test: Ib,Hci ≡ Ib

0−I(βbi |hb)−[I0−I(βi|h)]1I0−I(βi|h) ≥ −an for i = 1, ...,Mf ,

and Ib,Hcj ≡ Ib

0− I(λbj , α

bj |hb)− [I0− I(λj , αj |h)]1I0− I(λj , αj |h) ≥ −an for j = 1, ...,Mm.

an is a sequence of constants where an → 0 and ann1/2 → ∞. For example, we can take

an = log n/√n.

Step 3: Compute the b−th bootstrap estimates of the tests:

TW,bn ≡ max

[max

i=1,... ,Mf

n1/2Ib,Wci , max

j=1,...,Mm

n1/2Ib,Wcj

],

22

TH,bn ≡ max

[max

i=1,... ,Mf

n1/2Ib,Hci , max

j=1,...,Mm

n1/2Ib,Hcj , 0

].

Step 4: Repeat Steps 1 - 3 for b = 1, ..., B with B = 500 (say). Compute the bootstrapestimates of the p− value

pW ≡ 1B

B∑b=1

1TW,bn > TW

n ,

pH ≡ 1B

B∑b=1

1TH,bn > TH

n .

The White (2000) test rejects the null when pW is small and does not reject when pW isbig. The Hansen (2003) test rejects the null if pH is small and does not reject if pH is big.

6 Empirical Illustration: Sequential vs. Non-sequential Search

Models

To illustrate the use of our testing procedure, we present an example from industrial organi-zation of testing between sequential and non-sequential models of consumer search models,using price data obtained from internet websites. Since the search models we compare arerelatively stylized, and omit important details of online shopping behavior, we emphasizethat the results in this section are presented more as an illustration of the testing proceduredescribed above, rather than a detailed analysis of search behavior in online markets.

As is discussed in Hong and Shum (2001), the parameters characterizing consumers’ op-timal search and firms’ optimal pricing strategies in an equilibrium pricing model withnon-sequential consumer search behavior can be identified nonparametrically, in the sensethat the population moment restrictions implied by the equilibrium model are enough toidentify these parameters, without additional functional form restrictions on the shape ofconsumers’ search costs. On the other hand, the parameters of an equilibrium pricing modelwith sequential consumer search behavior cannot be identified without functional form as-sumptions on the distribution of consumer search costs. Therefore, as an illustration of ourproposed test, we test between several parameterizations of the sequential search model,and the moment-based nonsequential search model.

23

For both models, we assume that there are a continuum of firms and consumers, andinterpret the equilibrium price distribution Fp as the symmetric equilibrium mixed strategyemployed by all firms. We assume that consumers have inelastic demand for a single unitof the good. Consumers incur a search cost c to receive a single price quote, and we assumethat search costs are i.i.d. draws across consumers from a distribution Fc. Let p and p

denote, respectively, the lower and upper bound of the support of Fp, and r denote thecommon unit production cost of each firm. Since all firms produce homogeneous products,only search frictions (arising from consumers’ imperfect information about stores’ prices)generate price dispersion in this market.

Before proceeding to the test, we briefly describe the estimating procedures for each modelin turn. Details on the derivations of these equations is given in Hong and Shum (2001).

6.1 Non-sequential Search Model

Consumers who search non-sequentially commit to buying from the lowest-priced store afterobtaining a random sample of n prices. A consumer with search cost c chooses the numberof stores n to canvass to minimize her total expected cost, which is the sum of her searchcosts as well as the price she expects to pay for the product:

n∗(c) ≡ argminn>1C(n; c) ≡ c ∗ (n− 1) +∫ p

pnp (1− F (p))n−1 f(p)dp.

Let Fp denote the empirical distribution of the observed prices. Define:

q1 :the equilibrium proportion of consumers who search only one store

q2 :the equilibrium proportion of consumers who search two stores

q3 :the equilibrium proportion of consumers who search three stores, etc.

In the mixed strategy equilibrium, the following indifference condition must hold for everyprice p ∈ [p, p):

(p− r) q1 = (p− r)

[ ∞∑k=1

qkk (1− F (p))k−1

].

Rearranging, we get that

F−1p (τ) = r +

(p− r) · q1[∑K−1k=1 qkk(1− τ)

] ,24

for any quantile τ ∈ [0, 1] and where K(< N) denotes the maximal number of stores atwhich consumers will search. From this equation, we can derive set of M(≥ K) momentconditions

E

1

pt ≤ r +(p− r) q1[∑K

k=1 qkk (1− τm)k−1]− τm

= 0, m = 1, . . . ,M, M ≥ K.

which we can use to estimate the unknowns r, q1, . . . , qK−1.

The sample analogs (with a dataset consisting of T + 1 prices, with pT+1 = p) are

1T

T∑t=1

1pt ≤ r +

(p− r) q1[∑Kk=1 qkk (1− τm)k−1

]− τm

= 0. (6.5)

6.2 Sequential Search Model

Consumers who search sequentially follow a “reservation price” policy whereby they obtainadditional price quotes until they find one which is lower than their (optimally chosen)reservation price.

As discussed in Hong and Shum (2001), the sequential search model is not nonparametricallyidentified given only data on prices. Hence, we consider parametric maximum likelihood(MLE) estimation of this model, assuming that the search cost distribution Fc(·; θ) is pa-rameterized by a (finite-dimensional) vector θ.7 The likelihood function for each observedprice is the equilibrium price density, which is given by (see Hong and Shum (2001) fordetails)8

fp (s; θ) =−2(p− r)

(s− r)3 ∗ fc

(c(1− p−r

s−r ; θ)

; θ) − (p− r)2f ′c

(c(1− p−r

s−r ; θ))

(s− r)4 ∗[fc

(c(1− p−r

s−r ; θ)

; θ)]3 .

7p and p can be (super-consistently) estimated from the data. See Donald and Paarsch (1993) for a

discussion of maximum likelihood estimation when a subset of the parameters can be super-consistently

estimated.8In deriving this equation, we are implicitly assuming that the flow of consumers visiting any store is

slow enough so that, from the firms’ point of view, the consumer population is identical over time. This

assumption is made to avoid difficult theoretical issues; please see Hong and Shum (2001) for a more extended

discussion.

25

In the above equations, c(τ ; θ) ≡ F−1c (τ ; θ), the inverse CDF for the search cost distribution.

Given θ, the auxiliary parameter r can be determined by the restriction that Fp(p) = 1 sothat r must satisfy:

1 = Fp(p) =

(p− r

)(p− r)2 ∗ fc

(c(1− p−r

p−r

); θ) .

The likelihood function for the whole sample of prices, then, is just L(θ, r) =∏T

t=1 fp(pt; θ).

6.3 Results

To illustrate the use of our tests, we present the test results for a sample of online pricescollected for two products: the Palm Pilot Vx, and the statistics textbook Probability andMeasure, by P. Billingsley. These prices were gathered from the pricescan.com websiteon February 5, 2002, with shipping and handling costs confirmed by visiting the individualwebsites. The summary statistics for these price data are given in Table 3.


The test results are reported in Table 4. We considered three models: (1) the non-sequentialsearch models, estimated via the moment conditions (6.5); (ii) a parametric sequentialsearch model, with a Gamma search cost distribution; (iii) a parametric sequential searchmodel with a Lognormal search cost distribution. We only report results for the White teststatistics; the Hansen test results were qualitatively similar, and we do not report themfor convenience. We performed the tests taking each of the three models, in turn, as thebenchmark model. The nonparametric sieve estimation of the entropy was done using acardinal B-spline wavelet basis.

The test results are qualitatively similar for both products. For the Palm Pilot Vx, thebootstrapped p-values for the test statistics using Models 1 and 3 as the benchmark modelare large (at 0.953 and 0.847, respectively) which appear to imply that both the nonse-quential and the parametric model assuming a Log-normal search cost distribution performequally well in fitting the data. The p-value for the test statistics when Model 2 is usedas the benchmark model is substantially lower (at 0.162), suggesting the inferiority of theparametric model assuming a Gamma search cost distribution.

26

For the Billingsley text, the lowest p-value is again obtained when Model 2 is used as thebenchmark model. These results indicate perhaps that it is difficult to distinguish (on thebasis of fit) between the nonsequential search models, and certain parameterizations of thesequential search model. This may be due in part to the small sample size used in thisexample (we were able to obtain only 18 prices for each product).

For completeness, Table 4 also includes descriptive statistics for the KLIC and entropyestimates for the three models, across bootstrap resamples. Not surprising, given the testresults, we see that the KLIC’s for Models 1 and 3 are noticeably smaller than the KLICfor Model 2. Furthermore, the KLICs, as well as the nonparametric entropy estimates, arequite stable across bootstrap resamples.


While this application is rather specialized, there are other potentially interesting indus-trial organization applications of our testing procedure. For example, Bresnahan (1981)and Berry, Levinsohn, and Pakes (1995) have presented competing equilibrium demandand supply models for the automobile industry. Bresnahan (1981) considers a vertical dif-ferentiation context, and estimates model parameters using MLE assuming a parametricdistribution for measurement errors in price. Berry, Levinsohn, and Pakes (1995) considera more general model of with both vertical and horizontal differentiation, and present amoment-based GMM estimation approach. Testing between these two specification couldbe done using our approach.

7 Conclusion

In this paper we first develop a nonparametric likelihood ratio model selection test betweentwo competing models where one of the models is parametric likelihood, and the other ismoment-based model. Our tests extend the likelihood-ratio model selection tests presentedpreviously in Vuong (1989) for two parametric likelihood models and Kitamura (2000) fortwo moment-based models to the scenario where one model is parametric and the otheris moment-based. We then extend our procedure to handle the comparison of multiple(> 2) models where some candidate models are parametric and other candidate modelsare moment-based. We present a short Monte Carlo simulation study for our procedure,and also present an industrial organization example of our tests. Using data on observed

27

online prices for an electronic product, we examine whether the prices are generated froman equilibrium search model where consumers follow sequential or non-sequential searchstrategies.

For notational clarity we have focused on the case with continuous observations in the paper.The extension to a mixture of discrete and continuous regressors is straightforward butinvolves more tedious notations. In particular, we only need to modify the nonparametricestimation of the entropy of the true data generating process by partitioning the samplebased on the value of the discrete regressors.

We believe that these model selection testing situations arise often in practice, especiallygiven that both parametric likelihood specification and moment condition specification havebecome popular in structural econometric modeling. There are potential applications to,inter alia, discrete-choice differentiated product demand models in industrial organization,sample selection models in labor economics, as well as dynamic models in macroeconomicsand financial economics.

28

References

Barron, A., and C. Sheu (1991): “Approximation of Density Functions by Sequences ofExponential Families,” Annals of Statistics, 19, 1347–1369.

Berry, S., J. Levinsohn, and A. Pakes (1995): “Automobile Prices in Market Equilib-rium,” Econometrica, 63, 841–890.

Bresnahan, T. (1981): “Departures from Marginal Cost Pricing in the American Auto-mobile Industry,” Journal of Econometrics, 17, 201–227.

Chen, X., and Y. Fan (2005): “A Model Selection Test for Bivariate Failure-Time Data,”working paper, New York University and Vanderbilt University.

Chen, X., and X. Shen (1998): “Sieve Extremum Estimates for Weakly Dependent Data,”Econometrica, 66, 289–314.

Chen, X., and H. White (1999): “Improved rates and asymptotic normality for nonpara-metric neural network estimators,” IEEE Tranactions Information Theory, 45, 682–691.

Chernozhukov, V., and C. Hansen (2001): “An IV Model of Quantile Treatment Ef-fects,” MIT Department of Economics Working Paper.

Christoffersen, P., J. Hahn, and A. Inoue (2001): “Testing, Comparing and Com-bining Value at Risk Measures,” Journal of Empirical Finance, 8, 325–342.

Coppejans, M., and A. R. Gallant (2002): “Cross-Validated SNP Density Estimates,”Journal of Econometrics, 110, 27–65.

Donald, S., and H. Paarsch (1993): “Piecewise Pseudo-Maximum Likelihood Estima-tion in Empirical Models of Auctions,” International Economic Review, 34, 121–148.

Fan, Y. (1994): “Testing the Goodness of Fit of a Parametric Density Function by KernelMe thod,” Econometric Theory, 10, 316–356.

F.X.Diebold (1989): “Forecast Combination and Encompassing: Reconciling Two Diver-gent Literature,” International Journal of Forecasting, 5, 589–592.

Gallant, A. R., and D. W. Nychka (1987): “Semi-Nonparametric Maximum LikelihoodEstimation,” Econometrica, 55, 363–390.

Gourieroux, G., and A. Monfort (1994): “Testing Non-Nested Hypotheses,” in Hand-book of Econometrics, Vol. 4, ed. by R. Engle, and D. McFadden. North Holland.

Gowrisankaran, G., J. Geweke, and R. Town (2003): “Bayesian inference for hospitalquality in a selection model,” Econometrica, 71(4), 1215–1239.

29

Granger, C. (2002): “Time Series Concept for Conditional Distributions,” manuscript,UCSD.

Hansen, L. (1982): “Large Sample Properties of Generalized Method of Moments Estima-tors,” Econometrica, 50, 1029–1054.

Hansen, P. R. (2003): “A Test for Superior Predictive Ability,” manuscript, StanfordUniversity.

Hong, H., and M. Shum (2001): “Estimating Search Costs Using Equilibrium Models,”mimeo., Princeton University.

Horowitz, J., and W. Hardle (1994): “Testing a Parametric Model against a Semi-parametric Alternative,” Econometric Theory, 10, 821–848.

Imbens, G., R. Spady, and P. Johnson (1998): “Information Theoretic Approaches toInference in Moment Condition Models,” Econometrica, 66, 333–357.

Kitamura, Y. (1997): “Empirical likelihood methods with weakly dependent processes,”Ann. Statist., 25(5), 2084–2102.

Kitamura, Y. (2000): “Comparing Misspecified Dynamic Econometric Models using Non-parametric Likelihood,” Department of Economics, University of Wisconsin.

Kitamura, Y. (2002): “Econometric Comparison of Conditional Models,” Manuscript,University of Pennsylvania.

Kitamura, Y., and M. Stutzer (1997): “An Information-Theoretic Alternative to Gen-eralized Method of Moments Estimation,” Econometrica, 65, 861–874.

Kitamura, Y., and G. Tripathi (2001): “Empirical Likelihood-Based Inference in Condi-tional Moment Restriction Models,” Department of Economics, University of Wisconsin.

Newey, W., and D. McFadden (1994): “Large Sample Estimation and Hypothesis Test-ing,” in Handbook of Econometrics, Vol. 4, ed. by R. Engle, and D. McFadden, pp.2113–2241. North Holland.

Newey, W., and R. Smith (2001): “Higher Order Properties of GMM and GeneralizedEmpirical Likeliood Estimators,” MIT and University of Bristol.

Owen, A. (2001): Empirical Likelihood. Chapman and Hall/CRC.

Paarsch, H. (1992): “Deciding between the common and private value paradigms inempirical models of auctions,” Journal of Econometrics, 51, 191–215.

Qin, J., and J. Lawless (1994): “Empirical Likelihood and General Estimating Equa-tions,” Annals of Statistics, 22, 300–325.

30

Ramalho, J. J., and R. J. Smith (2002): “Generalized Empirical Likelihood Non-NestedTests,” Journal of Econometrics, pp. 1–28.

Rivers, D., and Q. Vuong (2001): “Model Selection Tests for Nonlinear Dynamic Mod-els,” The Econometrics Journal(forthcoming).

Sin, C., and H. White (1996): “Information Criteria for Selecting Possibly MisspecifiedParametric Models,” Journal of Econometrics, 71, 207–225.

Smith, R. J. (1992): “Non-Nested Tests for Competing Models Estimated by GeneralizedMethod of Moments,” Econometrica, 60(4), 973–980.

Stone, C. (1985): “Additive Regression and Other Nonparametric Models,” Annals ofStatistics, 13, 689–705.

Vuong, Q. (1989): “Likelihood-ratio tests for model selection and non-nested hypotheses,”Econometrica, pp. 307–333.

White, H. (1982): “Maximum Likelihood Estimation for Misspecified Models,” Economet-rica, 50, 1–25.

(2000): “A Reality Check for Data Snooping,” Econometrica, 68, 1097–1126.

31

A Population empirical likelihood function

We restrict Q to the set of probability measures with continuously differentiable densityfunctions. We find the solution for the constrained optimization problem:

minq(·)

I (q (·) |h0 (·)) =∫

logh0 (Z)q (Z)

h0 (Z) dZ

s.t.∫m (Z;α) q (Z) dZ = 0 and

∫q (Z) dZ = 1

The Lagrangian for this constrained optimization problem is

Lq(·),λ,β =∫

logh0 (Z)q (Z)

h0 (Z) dZ + λ′∫m (Z;α) q (Z) dZ + β

(∫q (Z) dZ − 1

)Using calculus of variation to concentrate out q (·) given λ, β, we find that

q∗(·|λ, β

)= argminq(·)Lq(·),λ,β =

h0 (Z)β + λ′m (Z; α)

Define λ = λ/β. Using the constraint that q (·) integrates to 1, we can solve for β:

β =∫

h0 (Z)1 + λ′m (Z;α)

dZ and q (Z) =h0 (Z)

1 + λ′m (Z;α)/

∫h0 (Z)

1 + λ′m (Z;α)dZ

Concentrating out q (·) and β given λ leads to

I (q∗ (·) |h (·)) = L =∫

log (1 + λ′m (Z;α))h0 (Z) dZ + log∫

h0 (Z)1 + λ′m (Z;α)

dZ

The moment constraint∫ m(Z;α)

1+λ′m(Z;α)h0 (Z) dZ = 0 implies that∫

11+λ′m(Z;α)h0 (Z) dZ = 1.

Hence

I (q∗ (·) |h0 (·)) = maxλ

∫log (1 + λ′m (Z;α))h0 (Z) dZ.

B Proof of Theorem 3

Consistency and asymptotic normality for generalized empirical likelihood estimators canbe found in Kitamura and Stutzer (1997), Christoffersen, Hahn, and Inoue (2001), Cher-nozhukov and Hansen (2001) and Newey and Smith (2001). In particular, we follow closelythe arguments of Christoffersen, Hahn, and Inoue (2001), which applies to misspecifiedmodels. We review their main steps.

32

Consistency:

Step 1. Condition 2 implies that λ (α) is continuous in α, where

λ (α) = arg maxλ

E0 log(1 + λ′m (zi;α)

).

Step 2. This combined with conditions 5 and 7 and the saddle point property shows thatfor L defined as E0 log

(1 + λ∗

′m (zi;α∗)

), and for all δ > 0, there exists η > 0 such that

P

(1n

n∑i=1

log(1 + λ (α)′m (zi;α)

)< L+ η

)≤ P

(1n

n∑i=1

log(1 + λ (α)′m (zi;α)

)< L+ η

)→ 0.

Step 3. Using convexity lemmas, λ (α∗)p→ λ∗, and

P

(1n

n∑i=1

log(1 + λ (α∗)′m (zi;α∗)

)> L+ η

)→ 0.

Combining steps 2 and 3 gives αp→ α∗. Using convexity lemma again, since

1n

n∑i=1

log(1 + λ′m (zi; α)

) p→ log(1 + λ′m (zi;α∗)

)pointwise and therefore uniform in λ,

λ = λ (α)p→ λ∗ = arg max

λE0 log

(1 + λ′m (zi, α∗)

).

√n convergence rate:

Step 1. Condition 4 and 6 imply 1√n

∑ni=1

m(zi;α∗)

1+λ∗′m(zi;α∗)= Op (1) and

n∑i=1

log(1 + λ (α∗)′m (zi;α∗)

)= Op (1) .

Step 2. Using the Saddle point property, denote M (λ, α) = 1n

∑ni=1 log (1 + λ′m (zi;α)).

Then for all t:

nM

(λ∗ +

t√n, α

)≤ nM

(λ, α

)≤ nM

(λ (α∗) , α∗

)= Op (1) .

33

A Taylor expansion of the left-hand side up to second order then shows that

1√n

n∑i=1

m (zi; α)1 + λ∗′m (zi; α)

= Op (1) .

Step 3. Using conditions 8,

1√n

n∑i=1

m (zi; α)1 + λ∗′m (zi; α)

=1√n

n∑i=1

m (zi;α∗)1 + λ∗′m (zi;α∗)

+√nE

m (zi; α)1 + λ∗′m (zi; α)

+ op (1)

=1√n

n∑i=1

m (zi;α∗)1 + λ∗′m (zi;α∗)

+ (D + op (1))′√n (α− α∗) + op (1)

Then by condition 5,√n (α− α∗) = Op (1).

Step 4. A Taylor expansion of 1√n

∑ni=1

m(zi;α)

1+λ′m(zi;α)= 0 in λ around λ∗ shows that:

√n(λ− λ∗

)= (V + op (1))−1 1√

n

n∑i=1

m (zi; α)1 + λ∗′m (zi; α)

+ op (1) = Op (1)

Empirical likelihood objective functions: Using condition 3 and 9,

√n(M(λ, α

)− M (λ∗, α∗)

)=√n(M(λ, α

)−M (λ∗, α∗)

)+ op (1)

=(∂

∂λM (λ∗, α∗) + op (1)

)√n(λ− λ∗

)+(∂

∂αM (λ∗, α∗) + op (1)

)√n (α− α∗) + op (1) .

C Kernel based test of nondegeneracy

The limit distribution of Vn, the kernel based nondegeneracy test (defined in Eq. (3.2)in the main text) can be derived following the results of Fan (1994), who shows that theconvergence rate and the limit distribution of the test statistics depends on whether thedata is under-smoothed, over-smoothed or optimally-smoothed. In particular, Fan (1994)showed that the effect of the first step estimation of β, λ and α needs to be taken intoaccount in the cases of over-smoothing and optimal-smoothing. On the other hand, in theundersmoothing case, preliminary parametric estimation has no effect on the limit varianceof the test statistics. In the following we restate the result for the undersmoothing casefrom case (c2) of Corollary 2.4 of Fan (1994):

Proposition 1 Let K (·) be a mth order kernel and let b be the bandwidth used in the kernel

34

estimation. If nbdz/2+2m −→ 0, then

nbdz/2(Vn −

1nbdz

∫K (u)2 du

)d−→ N

(0,

14

[∫h2 (Z) dZ

]∫ [∫K (u)K (u+ v) du

]2dv

).

D Distribution function based test of nondegeneracy

We first introduce some notation to describe the limiting distribution of the CDF-basedtest statistics for degeneracy. Let

Jβ (Z) =∫ Z ∂

∂βf (z, β∗)

(1 + λ∗

′m (z, α∗)

)dz,

Jλ,α (Z) = ∫ Z f (z, β∗)m (z, α∗) dz∫ Z f (z, β∗)

(1 + λ∗

′ ∂∂αm (z, α∗)

)dz.

Then

√n

(∫ Z

f(z, β

) (1 + λ′m (z, α)

)dz −

∫ Z

f (z, β∗)(1 + λ∗

′m (z, α∗)

)dz

)=Jβ (Z)

√n(β − β∗

)+ Jλ,α (x)

√n(λ− λ∗, α− α∗

)+ op (1) .

Using the linear representation of the parametric estimates, we can write

√n

(1n

n∑i=1

1 (zi ≤ Z)−∫ Z

f(z, β

) (1 + λ′m (z, α)

)dz

− h (Z)−∫ Z

f (z, β∗)(1 + λ∗

′m (z, α∗)

)dz

)=

1√n

n∑i=1

(1 (zi ≤ Z)− h (Z)− Jβ (Z)A−1

f

∂

∂βlog f (zi;β∗)

− Jλ,α (x)A−1m

m(zi;α∗)

1+λ∗′m(zi;α∗)∂

∂αm(zi;α)λ∗

1+λ∗′m(zi;α∗)

)+ op (1) .

In the above we have used the notation

Am =

[S11 S12

S21 S22

],

35

where

S11 =− Em (Z;α∗)m (Z;α∗)′

(1 + λ∗′m (Z;α∗))2,

S12 =E∂

∂αm (Z;α∗)1 + λ∗′m (;α∗)

− Em (Z;α∗)

(1 + λ∗′m (Z;α∗))2λ∗

′ ∂

∂λm (Z;α∗) ,

S21 =E∂

∂αm (Z;α∗)1 + λ′m (Z;α∗)

− E∂

∂αm (Z;α∗)λ∗m (Z;α∗)

(1 + λ∗′m (Z;α∗))2,

S22 =Eλ∗

′ ∂2

∂α∂α′m (Z;α∗)

1 + λ′m (Z;α∗)− E

∂∂αm (Z;α∗)λ∗λ∗

′ ∂∂αm (Z;α∗)

(1 + λ∗′m (Z;α))2.

This linear representation then converges weakly as n → ∞ to the Gaussian process G (·)with covariance function G (u)G (v) = Ev (zi, u) v (zi, v), where v (zi, u) is1 (zi ≤ u)− h (u)− Jβ (u)A−1

f

∂

∂βlog f (zi;β∗)− Jλ,α (u)A−1

m

m(zi;α∗)

1+λ∗′m(zi;α∗)∂

∂αm(zi;α)λ

1+λ′m(zi;α)

.

Given the derivations above, it is immediate to characterize the limit distribution for ρ1

and ρ2, the two CDF-based nondegeneracy test statistics introduced in Equations (3.3) and(3.4) in the main text:

Proposition 2 Under H0:

√nρ1

d−→ supZG (Z) ,

nρ2d−→∫G (u)2w (u) du.

36

Table 1: Monte carlo results: Empirical rejection probabilities under nullTest sizes for the corresponding critical valuesSample size z1% z5% z10%

n = 100 3.40% 7.87% 12.79%n = 200 1.52% 5.41% 11.89%

37

Table 2: Monte carlo results: Empirical rejection probabilities under alternativesTable: Test sizes for the corresponding critical valuesSample size z1% z5% z10%

β1 = 0.2600n = 100 3.84% 8.64% 13.43%n = 200 1.97% 5.66% 10.89%

β1 = 0.3000n = 100 13.46% 21.01% 26.76%n = 200 11.72% 19.99% 25.99%

38

Product # obs List Mean Stdev Median p p

Including shipping and handling costs9

Palm Pilot Vx 18 238.97 38.96 219.45 190.00 310.62Billingsley: Probability and Measure 18 99.95 95.23 6.14 98.90 83.58 100.87

Table 3: Summary statistics on online prices for two different productsPrice data for all products downloaded from pricescan.com: February 5, 2002.

39

Table 4: Results from Multiple Model Comparison Tests

Model 1: Non-sequential SearchModel 2: Parametric Sequential Search Model, with Gamma Search Cost Distribution

Model 3: Parametric Sequential Search Model, with Lognormal Search Cost Distribution

Benchmark Model Twn (White Bootstrap

for Test test statistic) p-valuea

Palm Pilot VxModel 1 0.397 0.953Model 2 0.640 0.162Model 3 -0.397 0.847

Billingsley textbookModel 1 0.034 0.386Model 2 0.640 0.205Model 3 -0.034 0.988

Statistic Full-sample Mean from Stdev from Number ofStatistic Bootstrap samples Bootstrap samples Bootstrap samples

Palm Pilot VxKLIC1 0.193 0.182 0.086 148KLIC2 0.956 0.055 1.215KLIC3 0.316 0.085 2.423NP entropy est. 2.182 1.163 0.383

Billingsley textbookKLIC1 0.713 0.733 0.281 83KLIC2 1.319 1.839 1.606KLIC3 0.679 1.916 2.658NP entropy est. 2.545 2.801 0.224

All prices includes shipping and handling chargesaProb

(T w,b

n ≥ T wn

), where T w,b

n denotes bootstrapped value for test statistic T wn , and Prob (· · · ) denotes

the empirical frequency from bootstrap re-samples.

40

nonparametric likelihood ratio model selection tests ...doubleh/papers/sieve.pdf · ference in rio...

Documents