linear inverse problems in structural econometrics estimation

Linear Inverse Problems in Structural EconometricsEstimation based on spectral decomposition

and regularization∗

Marine CarrascoUniversity of Rochester

Jean-Pierre FlorensUniversite de Toulouse (GREMAQ and IDEI)

Eric RenaultUniversity of North Carolina, Chapel Hill

∗We thank Richard Blundell, Xiaohong Chen, Serge Darolles, James Heckman, Jan Johannes, OliverLinton, Jean-Michel Loubes, Enno Mammen, Costas Meghir, Whitney Newey, Jean-Francois Richard,Anne Vanhems, and Ed Vytlacil for helpful discussions. Carrasco gratefully aknowledges financial supportfrom the National Science Foundation, grant # SES-0211418.

Abstract

Inverse problems can be described as functional equations where the value of thefunction is known or easily estimable but the argument is unknown. Many problemsin econometrics can be stated in the form of inverse problems where the argumentitself is a function. For example, consider a nonlinear regression where the functionalform is the object of interest. One can readily estimate the conditional expectationof the dependent variable given a vector of instruments. From this estimate, onewould like to recover the unknown functional form.

This chapter provides an introduction to the estimation of the solution to in-verse problems. This chapter focuses mainly on integral equations of the first kind.Solving these equations is particularly challenging as the solution does not neces-sarily exist, may not be unique, and is not continuous. As a result, a regularized(or smoothed) solution needs to be implemented. We review different regularizationmethods and study the properties of the estimator. Integral equations of the firstkind appear, for example, in the generalized method of moments when the numberof moment conditions is infinite, and in the nonparametric estimation of instrumen-tal variable regressions. In the last section of this chapter, we investigate integralequations of the second kind, whose solutions may not be unique but are continu-ous. Such equations arise when additive models and measurement error models areestimated nonparametrically.

Keywords: Additive models, Integral equation, Generalized Method of Moments,Instrumental variables, Many regressors, Nonparametric estimation, Tikhonov andLandweber-Fridman regularizations.

JEL: C13, C14, C20.

1. Introduction

1.1. Structural models and functional estimation

The objective of this chapter is to analyze functional estimation in structural econometricmodels. Different approaches exist to structural inference in econometrics and our pre-sentation may be viewed as a nonparametric extension of the basic example of structuralmodels, namely the static linear simultaneous equations model (SEM). Let us consider Ya vector of random endogenous variables and Z a vector of exogenous random variables.A SEM is characterized by a system

BθY + CθZ = U (1.1)

where Bθ and Cθ are matrices that are functions of an unknown “structural” parameterθ and E [U |Z] = 0. The reduced form is a multivariate regression model

Y = ΠZ + V (1.2)

where Π is the matrix of ordinary regression coefficients. The relation between reducedand structural form is, in the absence of higher moments restrictions, characterized by:

BθΠ + Cθ = 0. (1.3)

The two essential issues of structural modeling, the identification and the overidenti-fication problems, follow from the consideration of Equation (1.3). The uniqueness of thesolution in θ for given Π defines the identification problem. The existence of a solution (orrestrictions imposed on Π to guarantee the existence) defines the overidentification ques-tion. The reduced form parameter Π can be estimated by OLS and if a unique solution inθ exists for any Π, it provides the Indirect Least Square estimate of θ. If the solution doesnot exist for any Π, θ can be estimated by a suitable minimization of BθΠ + Cθ where Πis an estimator of Π.

In this chapter, we address the issue of functional extension of this construction. Thedata generating process (DGP) is described by a stationary ergodic stochastic processwhich generates a sequence of observed realizations of a random vector X.

The structural econometric models considered in this chapter are about the station-ary distribution of X. This distribution is characterized by its cumulative distributionfunction (c.d.f.) F, while the functional parameter of interest is an element ϕ of some in-finite dimensional Hilbert space. Following the notation of Florens (2003), the structuraleconometric model defines the connection between ϕ and F under the form of a functionalequation:

A(ϕ, F ) = 0. (1.4)

This equation extends Equation (1.3) and the definitions of identification (uniqueness ofthis solution) and of overidentification (constraints on F such that a solution exists) areanalogous to the SEM case. The estimation is also performed along the same line: F

1

can be estimated by the empirical distribution of the sample or by a more sophisticatedestimator (like kernel smoothing) belonging to the domain of A. ϕ is estimated by solving(1.4) or, in the presence of overidentification, by a minimization of a suitable norm ofA(ϕ, F ) after plugging in the estimator of F .

This framework may be clarified by some remarks.

1. All the variables are treated as random in our model and this construction seems todiffer from the basic econometric models which are based on a distinction betweenexogenous or conditioning variables and endogenous variables. Actually this dis-tinction may be used in our framework. Let X be decomposed into Y and Z and Finto FY (.|Z = z) the conditional c.d.f. of Y given Z = z, and FZ the marginal c.d.f.of Z. Then, the exogeneity of Z is tantamount to the conjunction of two conditions.

Firstly, the solution ϕ of (1.4) only depends on FY (.|Z = z) and ϕ is identified bythe conditional model only. Secondly if FY (.|Z = z) and FZ are “variations free” ina given statistical model defined by a family of sampling distributions (intuitivelyno restrictions link FY (.|Z = z) and FZ), no information on FY (.|Z = z) (and thenon ϕ) is lost by neglecting the estimation of FZ . This definition fully encompassesthe usual definition of exogeneity in terms of cuts (see Engle, Hendry and Richard(1983), Florens and Mouchart (1985)). Extension of that approach to sequentialmodels and then to sequential or weak exogeneity is straightforward.

2. Our construction does not explicitly involve residuals or other unobservable vari-ables. As will be illustrated in the examples below, most of the structural econo-metric models are formalized by a relationship between observable and unobservablerandom elements. A first step in the analysis of these models is to express the re-lationship between the functional parameters of interest and the DGP, or, in ourterminology, to specify the relation A(ϕ, F ) = 0. We start our presentation at thesecond step of this approach and our analysis is devoted to the study of this equationand to its use for estimation.

3. The overidentification is handled by extending the definition of the parameter inorder to estimate overidentified models. Even if A(ϕ, F ) = 0 does not have a so-lution for a given F , the parameter ϕ is still defined as the minimum of a norm ofA(ϕ, F ). Then ϕ can be estimated from an estimation of F, which does not satisfythe overidentification constraints. This approach extends the original GeneralizedMethod of Moments (GMM) treatment of overidentification. Another way to takeinto account overidentification constraints consists in estimating F under these con-straints (the estimator of F is the nearest distribution to the empirical distributionfor which there exists a solution, ϕ, of A(ϕ, F ) = 0). This method extends the newapproach to GMM called the empirical likelihood analysis (see Owen (2001) andreferences therein). In this chapter, we remain true to the first approach: if theequation A(ϕ, F ) = 0 has no solution it will be replaced by the first order conditionof the minimization of a norm of A(ϕ, F ). In that case, this first order conditiondefines a functional equation usually still denoted A(ϕ, F ) = 0.

2

1.2. Notation

In this chapter, X is a random element of a finite or infinite dimensional space X . In mostof the examples, X is a finite dimensional euclidean space (X ⊂ Rm) and the distributionon X, denoted F is assumed to belong to a set F . If F is absolutely continuous, its densityis denoted by f . Usually, X is decomposed into several components, X = (Y, Z, W ) ∈Rp×Rq×Rr(p+q+r = m) and the marginal c.d.f. or probability density function (p.d.f.)are denoted by FY , FZ , FW and fY , fX , fW respectively. Conditional c.d.f. are denoted byFY (.|Z = z) or FY (.|z) and conditional density by fY (.|Z = z) or fY (.|z) . The samplemay be an i.i.d. sample of X (denoted in that case (xi)i=1,...,n) or weakly dependent timeseries sample denoted (xt)t=1,...,T in the dynamic case.

The paper focuses on the estimation of an infinite dimensional parameter denotedby ϕ, which is an element of a Hilbert space H (mathematical concepts are recalled inSection 2). In some particular cases, finite dimensional parameters are considered andthis feature is underlined by the notation θ ∈ Θ ⊂ Rd.

The structural model is expressed by an operator A from H×F into an Hilbert spaceE and defines the equation A(ϕ, F ) = 0. The (possibly local) solution of this equation isdenoted by:

ϕ = Ψ(F ). (1.5)

For statistical discussions, a specific notation for the true value is helpful and F0 willdenote the true c.d.f. (associated with the density f0 and with the true parameter ϕ0 (orθ0)). The estimators of the c.d.f. will be denoted by Fn in an i.i.d. setting or FT in adynamic environment.

The operator A may take various forms. Particular cases are linear operators withrespect to F or to ϕ. The first case will be illustrated in the GMM example but mostof the paper will be devoted to the study of linear operator relatively to ϕ. In that case,equation A(ϕ, F ) = 0 can be rewritten :

A(ϕ, F ) = Kϕ− r = 0 (1.6)

where K is a linear operator from H to E depending on F and r is an element of E andis also a function of F . The properties of K are essential and we will present differentexamples of integral or differential operators. More generally, A may be nonlinear eitherwith respect to F or to ϕ, but as usual in functional analysis, most of the analysis ofnonlinear operators may be done locally (around the true value typically) and reduces tothe linear case. Game theoretic models or surplus estimation give examples of nonlinearmodels.

The problem of solving Equation (1.4) enters in the class of inverse problems. An in-verse problem consists of the resolution of an equation where the elements of the equationsare imperfectly known. In the linear case, the equation is Kϕ = r and F is not exactlyknown but only estimated. Thus, r is also imperfectly known. The econometric situationis more complex than most of the inverse problems studied in the statistical literature

3

because K is also only imperfectly known. According to the classification proposed byVapnik (1998), the stochastic inverse problems of interest in this chapter are more oftenthan not characterized by equations where both the operator and the right-hand sideterm need to be estimated. Inverse problems are said to be well-posed if a unique solutionexists and depends continuously on the imperfectly known elements of the equation. Inour notation, this means that Ψ in (1.5) exists as a function of F and is continuous. Thenif F is replaced by Fn, the solution ϕn of A(ϕn, Fn) = 0 exists and the convergence of Fn

to F0 implies the convergence of ϕn to ϕ0 by continuity. Unfortunately a large class ofinverse problems relevant to econometric applications are not well-posed (they are thensaid to be ill-posed in the Hadamard sense, see e.g. Kress (1999), Vapnik (1998)). Inthis case, a regularization method needs to be implemented to stabilize the solution. Ourtreatment of ill-posed problems is closed to that of Van Rooij and Ryumgaart (1999).

1.3. Examples

This section presents various examples of inverse problems motivated by structural econo-metric models. We will start with the GMM example, which is the most familiar toeconometricians. Subsequently, we present several examples of linear (w.r.t. ϕ) inverseproblems. The last three examples are devoted to nonlinear inverse problems.

1.3.1. Generalized Method of Moments (GMM)

Let us assume that X is m dimensional and the parameter of interest θ is also finitedimensional (θ ∈ Θ ⊂ Rd). We consider a function

h : Rm ×Θ → E (1.7)

and the equation connecting θ and F is defined by:

A(θ, F ) = EF (h(X, θ)) = 0 (1.8)

A particular case is given by h(X, θ) = µ(X) − θ where θ is exactly the expectationof a transformation µ of the data. More generally, θ may be replaced by an infinitedimensional parameter ϕ but we do not consider this extension here.

The GMM method was introduced by Hansen (1982) and has received numerous ex-tensions (see Ai and Chen (2003) for the case of an infinite dimensional parameter). GMMconsists in estimating θ by solving an inverse problem linear in F but nonlinear in θ. Itis usually assumed that θ is identified i.e. that θ is uniquely characterized by Equation(1.8). Econometric specifications are generally overidentified and a solution to (1.8) onlyexists for some particular F , including the true DGP F0, under the hypothesis of correctspecification of the model. The c.d.f F is estimated by the empirical distribution and theequation (1.8) becomes:

1

n

n∑i=1

h(xi, θ) = 0, (1.9)

4

which has no solution in general. Overidentification is treated by an extension of thedefinition of θ as follows:

θ = arg minθ‖BEF (h)‖2 (1.10)

where B is a linear operator in E and ‖‖ denotes the norm in E . This definition coincideswith (1.8) if F satisfies the overidentification constraints. Following Equation (1.10), theestimator is:

θn = arg minθ

∥∥∥∥∥Bn

(1

n

n∑i=1

h(xi, θ)

)∥∥∥∥∥

2

(1.11)

where Bn is a sequence of operators converging to B. If the number of moment conditionsis finite, Bn and B are square matrices.

As θ is finite dimensional, the inverse problem generated by the first order conditionsof (1.10) or (1.11) is well-posed and consistency of the estimators follows from standardregularity conditions. As it will be illustrated in Section 6, an ill-posed inverse problemarises if the number of moment conditions is infinite and if optimal GMM is used. In finitedimensions, optimal GMM is obtained using a specific weighting matrix, B = Σ− 1

2 , whereΣ is the asymptotic variance of

√n

(1n

∑ni=1 h(xi, θ)

)(Σ = V ar(h) in i.i.d. sampling). In

the general case, optimal GMM requires the minimization of ‖g‖2 where

Σ12 g = EF (h) (1.12)

The function g is then the solution of a linear inverse problem. If the dimension of h is notfinite, Equation (1.12) defines an ill-posed inverse problem, which requires a regularizationscheme (see Section 3).

1.3.2. Instrumental variables

Instrumental regression is a possible strategy to perform nonparametric estimation whenexplanatory variables are endogenous. Let us decompose X into (Y, X, W ) where Y ∈ R,Z ∈ Rq, W ∈ Rr. The subvectors Z and W may have elements in common. Theeconometrician starts with a relation

Y = ϕ(Z) + U (1.13)

where U is a random term which does not satisfy E(U |Z) = 0. This assumption isreplaced by the more general hypothesis

E(U |W ) = 0 (1.14)

and W is called the set of instrumental variables. Condition (1.14) defines ϕ as thesolution of an integral equation. In terms of density, (1.14) means that

A(ϕ, F ) =

∫ϕ(z)fZ(z|W = w)dz −

∫yfY (y|W = w)dy = 0 (1.15)

5

Using previous notation, the first part of (1.15) is denoted Kϕ and the second part isequal to r.

This expression is linear in ϕ and can be made linear in F by eliminating the denom-inator through a multiplication by fW (w). However, as will be seen later, this problem isessentially nonlinear in F because the treatment of overidentification and of regularizationwill necessarily reintroduce the denominator in (1.15).

Instrumental regression introduced in (1.15) can be generalized to local instrumentalregression and to generalized local instrumental regression. These extensions are relevantin more complex models than (1.13), where in particular the error term may enter theequation in non additive ways (see for such a treatment, Florens, Heckman, Meghir, andVytlacil (2002)). For example, consider the equation

Y = ϕ(Z) + Zε + U (1.16)

where Z is scalar and ε is a random unobservable heterogeneity component. It can beproved that, under a set of identification assumptions, ϕ satisfies the equations :

Aj(ϕ, F ) = EF

(∂ϕ(Z)

∂Z|W = w

)−

∂

∂Wj

E(Y |W = w)

∂

∂Wj

E(Z|W = w)

= 0 (1.17)

for any j = 1, ..., r. This equation, linear with respect to ϕ, combines integral and differ-ential operators.

Instrumental variable estimation and its local extension define ill-posed inverse prob-lems as will be seen in Section 5.

1.3.3. Deconvolution

Another classical example of ill-posed inverse problem is given by the deconvolution prob-lem. Let us assume that X,Y, Z be three scalar random elements such that

Y = X + Z (1.18)

Only Y is observable. The two components X and Z are independent. The density of Z(the error term) is known and denoted g. The parameter of interest is the density ϕ ofX. Then ϕ is solution of:

A(ϕ, F ) =

∫ϕ(y)g(x− y)dy − f(x) = 0

(1.19)

= Kϕ− r

This example is comparable to the instrumental variables case but only the r.h.s. r = fis unknown whereas the operator K is given.

6

1.3.4. Regression with many regressors

This example also constitutes a case of linear ill-posed inverse problems. Let us considera regression model where the regressors are indexed by τ belonging to an infinite indexset provided with a measure Π. The model says:

Y =

∫Z(τ)ϕ(τ)Π(dτ) + U (1.20)

where E(U |(Z(τ))τ ) = 0 and ϕ is the parameter of interest and is infinite dimensional.Examples of regression with many regressors are now common in macroeconomics (seeStock and Watson (2002) or Forni and Reichlin (1998) for two presentations of this topic).

Let us assume that Y and (Z(τ))τ are observable. Various treatments of (1.20) canbe done and we just consider the following analysis. The conditional moment equationE(U |(Z(τ))τ ) = 0 implies an infinite number of conditions indexed by τ :

E(Z(τ)U) = 0, ∀τ

or equivalently

∫EF (Z(τ)Z(ρ))ϕ(ρ)Π(dρ)− EF (Y Z(τ)) = 0, ∀τ (1.21)

This equation generalizes the usual normal equations of the linear regression to aninfinite number of regressors. The inverse problem defined in (1.21) is linear in both Fand ϕ but it is ill posed. An intuitive argument to illustrate this issue is to considerthe estimation using a finite number of observations of the second moment operatorEF (Z(τ)Z(ρ)) which is infinite dimensional. The resulting multicollinearity problem issolved by a ridge regression. The “infinite matrix” EF (Z(.)Z(.)) is replaced by αI +EF (Z(.)Z(.)) where I is the identity and α a positive number, or by a reduction of the setof regressors to the first principal components. These two solutions are particular examplesof regularization methods (namely the Tikhonov and the spectral cut-off regularizations),which will be introduced in Section 3.

1.3.5. Additive models

The properties of the integral equations generated by this example and by the next oneare very different from that of the three previous examples. We consider an additiveregression model:

Y = ϕ(Z) + ψ(W ) + U (1.22)

where E(U |Z, W ) = 0 and X = (Y, Z,W ) is the observable element. The parameters ofinterest are the two functions ϕ and ψ. The approach we propose here is related to thebackfitting approach (see Hastie and Tibshirani (1990)). Other treatments of additive

7

models have been considered in the literature (see Pagan and Ullah (1999)). Equation(1.22) implies

EF (Y |Z = z) = ϕ(z) + EF (ψ(W )|Z = z)

EF (Y |W = w) = EF (ϕ(Z)|W = w) + ψ(w)(1.23)

and by substitution

ϕ(z)− EF (EF (ϕ(Z)|W )|Z = z)

= EF (Y |Z = z)− EF (EF (Y |W )|Z = z)(1.24)

or, in our notations:

(I −K) ϕ = r

where K = EF (EF ( . |W )|Z). Backfitting refers to the iterative method to solve Equation(1.23).

An analogous equation characterizes ψ. Actually even if (1.22) is not well specified,these equations provide the best approximation of the regression of Y given Z and W byan additive form. Equation (1.24) is a linear integral equation and even if this inverseproblem is ill-posed because K is not one-to-one (ϕ is only determined up to a constantterm), the solution is still continuous and therefore the difficulty is not as important asthat of the previous examples.

1.3.6. Measurement-error models or nonparametric analysis of panel data

We denote η to be an unobservable random variable for which two measurements Y1 andY2 are available. These measurements are affected by a bias dependent on observablevariables Z1 and Z2. More formally:

Y1 = η + ϕ(Z1) + U1 E(U1|η, Z1, Z2) = 0

Y2 = η + ϕ(Z2) + U2 E(U2|η, Z1, Z2) = 0(1.25)

An i.i.d. sample (y1i, y2i, ηi, z1i, z2i) is drawn but the ηi are unobservable. Equivalentlythis model may be seen as a two period panel data with individual effects ηi.

The parameter of interest is the “bias function” ϕ, identical for the two observations.In the measurement context, it is natural to assume that the joint distribution of theobservables is independent of the order of the observations, or equivalently (Y1, Z1, Y2, Z2)are distributed as (Y2, Z2, Y1, Z1). This assumption is not relevant in a dynamic context.

The model is transformed in order to eliminate the unobservable variable by difference:

Y = ϕ(Z2)− ϕ(Z1) + U (1.26)

8

where Y = Y2 − Y1, U = U2 − U1, and E(U |Z1, Z2) = 0.This model is similar to an additive model except for the symmetry between the

variables, and the fact that with the notation of (1.22), ϕ and ψ are identical. Anapplication of this model may be found in Gaspar and Florens (1998) where y1i and y2i

are two measurements of the level of the ocean in location i by a satellite radar altimeter,ηi is the true level and ϕ is the “sea state bias” depending on the waves’ height and thewind speed (Z1i and Z2i are both two dimensional).

The model is treated through the relation:

E(Y |Z2 = z2) = ϕ(z2)− E(ϕ(Z1)|Z2 = z2) (1.27)

which defines an integral equation Kϕ = r. The exchangeable property between thevariables implies that conditioning on Z1 gives the same equation (where Z1 and Z2 areexchanged).

1.3.7. Game theoretic model

This example and the next ones present economic models formalized by nonlinear inverseproblems. As the focus of this chapter is on linear equations, these examples are givenfor illustration and will not be treated outside of this section. The analysis of nonlinearfunctional equations raises numerous questions: uniqueness and existence of the solution,asymptotic properties of the estimator, implementation of the estimation procedure andnumerical computation of the solution.

Most of these questions are usually solved locally by a linear approximation of thenonlinear problem deduced from a suitable concept of derivative. A strong concept ofderivation (typically Frechet derivative) is needed to deal with the implicit form of themodel, which requires the use of the Implicit Function theorem.

The first example of nonlinear inverse problems follows from the strategic behavior ofthe players in a game. Let us assume that for each game, each player receives a randomsignal or type denoted by ξ and plays an action X. The signal is generated by a probabilitydescribed by its c.d.f. ϕ, and the players all adopt a strategy σ dependent on ϕ whichassociates X with ξ, i.e.

X = σϕ(ξ).

The strategy σϕ is determined as an equilibrium of the game (e.g. Nash equilibrium) orby an approximation of the equilibrium (bounded rationality behavior). The signal ξ isprivate knowledge for the player but is unobserved by the econometrician, and the c.d.f.ϕ is common knowledge for the players but is unknown for the statistician. The strategyσϕ is determined from the rules of the game and by the assumptions on the behavior ofthe players. The essential feature of the game theoretic model from a statistical viewpointis that the relation between the unobservable and the observable variables depends on thedistribution of the unobservable component. The parameter of interest is the c.d.f. ϕ ofthe signals.

9

Let us restrict our attention to cases where ξ and X are scalar and where σϕ is strictlyincreasing. Then the c.d.f. F of the observable X is connected with ϕ by:

A(ϕ, F ) = F σϕ − ϕ = 0 (1.28)

If the signals are i.i.d. across the different players and different games, F can beestimated by a smooth transformation of the empirical distribution and Equation (1.28)is solved in ϕ. The complexity of this relation can be illustrated by the auction model.In the private value first price auction model, ξ is the value of the object and X the bid.If the number of bidders is N + 1 the strategy function is equal to:

X = ξ −

∫ ξ

ξ

ϕN(u)du

ϕN(ξ)(1.29)

where [ξ, ξ] is the support of ξ and ϕN(u) = [ϕ(u)]N is the c.d.f. of the maximum privatevalue among N players.

Model (1.28) may be extended to a non iid setting (depending on exogenous variables)or to the case where σϕ is partially unknown. The analysis of this model has been done byGuerre, Perrigne and Vuong (2000) in a nonparametric context. The framework of inverseproblem is used by Florens, Protopopescu and Richard (1997).

1.3.8. Solution of a differential equation

In several models like the analysis of the consumer surplus, the function of interest is thesolution of a differential equation depending on the data generating process.

Consider for example a class of problems where X = (Y, Z,W ) ∈ R3 is i.i.d., F is thec.d.f. of X and the parameter ϕ verifies:

d

dzϕ(z) = mF (z, ϕ(z)) (1.30)

when mF is a regular function depending on F . A first example is

mF (z, w) = EF (Y |Z = z, W = w) (1.31)

but more complex examples may be constructed in order to take into account the en-dogeneity of one or two variables. For example, Z may be endogenous and mF may bedefined by:

E(Y |W1 = w1,W2 = w2) = E(mF (Z, W1)|W1 = w1,W2 = w2) (1.32)

Economic applications can be found in Hausman (1981, 1985) and Hausman and Newey(1995) and a theoretical treatment of these two problems is given by Vanhems (2000) andLoubes and Vanhems (2001).

10

1.3.9. Instrumental variables in a nonseparable model

Another example of a nonlinear inverse problem is provided by the following model:

Y = ϕ (Z,U) (1.33)

where Z is an endogenous variable. The function ϕ is the parameter of interest. Denoteϕz (u) = ϕ (z, u) . Assume that ϕz (u) is an increasing function of u for each z. Moreover,the distribution, FU of U is assumed to be known for identification purposes. Model (1.33)may arise in a duration model where Y is the duration (see Equation (2.2) of Horowitz1999). One difference with Horowitz (1999) is the presence of an endogenous variablehere. There is a vector of instruments W, which are independent of U . Because U and Ware independent, we have

P (U ≤ u|W = w) = P (U ≤ u) = FU (u) . (1.34)

Denote f the density of (Y, Z) and

F (y, z|w) =

∫ y

−∞f (t, z|w) dt.

F can be estimated using the observations (yi, zi, wi), i = 1, 2, ..., n. By a slight abuse ofnotation, we use the notation P (Y ≤ y, Z = z|W = w) for F (y, z|w) . We have

P (U ≤ u, Z = z|W = w) = P(ϕz (Y )−1 ≤ u, Z = z|W = w

)

= P (Y ≤ ϕz (u) , Z = z|W = w)

= F (ϕz (u) , z|w) . (1.35)

Combining Equations (1.34) and (1.35), we obtain∫

F (ϕz (u) , z|w) dz = FU (u) . (1.36)

Equation (1.36) belongs to the class of Urysohn equations of Type I (Polyanin and Manzhi-rov, 1998). The estimation of the solution of Equation (1.36) is discussed in Florens(2005).

1.4. Organization of the chapter

Section 2 reviews the basic definitions and properties of operators in Hilbert spaces. Thefocus is on compact operators because they have the advantage of having a discretespectrum. We recall some laws of large numbers and central limit theorems for Hilbertvalued random elements. Finally, we discuss how to estimate the spectrum of a compactoperator and how to estimate the operators themselves.

Section 3 is devoted to solving integral equations of the first kind. As these equationsare ill-posed, the solution needs to be regularized (or smoothed). We investigate theproperties of the regularized solutions for different types of regularizations.

11

In Section 4, we show under suitable assumptions the consistency and asymptoticnormality of regularized solutions.

Section 5 detail five examples: the ridge regression, the factor model, the infinitenumber of regressors, the deconvolution, and the instrumental variables estimation.

Section 6 has two parts. First, it recalls the main results relative to reproducingkernels. Reproducing kernel theory is closely related to that of the integral equationsof the first kind. Second, we explain the extension of GMM to a continuum of momentconditions and show how the GMM objective function reduces to the norm of the momentfunctions in a specific reproducing kernel Hilbert space. Several examples are provided.

Section 7 tackles the problem of solving integral equations of the second kind. Atypical example of such a problem is the additive model introduced earlier.

2. Spaces and Operators

The purpose of this section is to introduce terminology and to state the main properties ofoperators in Hilbert spaces that are used in our econometric applications. Most of theseresults can be found in Debnath and Mikusinsky (1999) and Kress (1999). Ait-Sahalia,Hansen, and Scheinkman (2004) provide an excellent survey of operator methods for thepurpose of financial econometrics.

2.1. Hilbert spaces

We start by recalling some of the basic concepts of analysis. In the sequel, C denotes theset of complex numbers. A vector space equipped by a norm is called a normed space.A sequence (ϕn) of elements in a normed space is called a Cauchy sequence if for everyε > 0 there exists an integer N (ε) such that

‖ϕn − ϕm‖ < ε

for all n, m ≥ N (ε) , i.e, if limn,m→∞ ‖ϕn − ϕm‖ = 0. A space S is complete if everyCauchy sequence converges to an element in S. A complete normed vector space is calleda Banach space.

Let (E, E , Π) be a probability space and

LpC (E, E , Π) =

f : E → C measurable s.t. ‖f‖ ≡

(∫|f |p dΠ

)1/p

< ∞

, p ≥ 1.

Then, LpC (E, E , Π) is a Banach space. If we only consider functions valued in R this space

is still a Banach space and is denoted in that case by Lp (we drop the subscript C). Inthe sequel, we also use the following notation. If E is a subset of Rp, then the σ−field Ewill always be the Borel σ−field and will be omitted in the notation Lp (Rp, Π). If Π hasa density π with respect to Lebesgue measure, Π will be replaced by π. If π is uniform, itwill be omitted in the notation.

12

Definition 2.1 (Inner product). Let H be a complex vector space. A mapping 〈, 〉 :H × H → C is called an inner product in H if for any ϕ, ψ, ξ ∈ H and α, β ∈ C thefollowing conditions are satisfied:

(a) 〈ϕ, ψ〉 = 〈ψ, ϕ〉 (the bar denotes the complex conjugate),(b) 〈αϕ + βψ, ξ〉 = α 〈ϕ, ξ〉+ β 〈ψ, ξ〉 ,(c) 〈ϕ, ϕ〉 ≥ 0 and 〈ϕ, ϕ〉 = 0 ⇐⇒ ϕ = 0.A vector space equipped by an inner product is called an inner product space.

Example. The space CN of ordered N -tuples x = (x1, ..., xN) of complex numbers,with the inner product defined by

〈x, y〉 =N∑

l=1

xlyl

is an inner product space.Example. The space l2 of all sequences (x1, x2, ...) of complex numbers such that∑∞

j=1 |xj|2 < ∞ with the inner product defined by 〈x, y〉 =∑∞

j=1 xjyj for x = (x1, x2, ...)and y = (y1, y2, ...) is an infinite dimensional inner product space.

Example. The space L2C (E, E , Π) associated with the inner product defined by

〈ϕ, ψ〉 =

∫ϕψdΠ

is an inner product space. On the other hand, LpC (E, E , Π) is not a inner product space

if p 6= 2.An inner product satisfies the Cauchy-Schwartz inequality, that is,

|〈ϕ, ψ〉|2 ≤ 〈ϕ, ϕ〉〈ψ, ψ〉for all ϕ, ψ ∈ H. Remark that 〈ϕ, ϕ〉 is real because 〈ϕ, ϕ〉 = 〈ϕ, ϕ〉. It actually defines a

norm ‖ϕ‖ = 〈ϕ, ϕ〉1/2 (this is the norm induced by the inner product 〈, 〉).Definition 2.2 (Hilbert space). If an inner product space is complete in the inducednorm, it is called a Hilbert space.

A standard theorem in functional analysis guarantees that every inner product spaceH can be completed to form a Hilbert space H. Such a Hilbert space is said to be thecompletion of H.

Example. CN , l2 and L2 (R,Π) are Hilbert spaces.Example. (Sobolev space) Let Ω = [a, b] be an interval of R. Denote by Hm (Ω),

m = 1, 2, ..., the space of all complex-valued functions ϕ ∈ Cm such that for all |l| ≤ m,ϕ(l) = ∂lϕ (τ) /∂τ l ∈ L2 (Ω) . The inner product on Hm (Ω) is

〈ϕ, ψ〉 =

∫ b

a

m∑

l=0

ϕ(l) (τ) ψ(l) (τ)dτ .

Hm (Ω) is an inner product space but it is not a Hilbert space because it is not complete.The completion of Hm (Ω) , denoted Hm (Ω), is a Hilbert space.

13

Definition 2.3 (Convergence). A sequence (ϕn) of vectors in an inner product spaceH, is called strongly convergent to a vector ϕ ∈ H if ‖ϕn − ϕ‖ → 0 as n →∞.

Remark that if (ϕn) converges strongly to ϕ in H then 〈ϕn, ψ〉 → 〈ϕ, ψ〉 as n → ∞,for every ψ ∈ H. The converse is false.

Definition 2.4. Let H be an inner product space. A sequence (ϕn) of nonzero vectorsin H is called an orthogonal sequence if 〈ϕm, ϕn〉 = 0 for n 6= m. If in addition ‖ϕn‖ = 1for all n, it is called an orthonormal sequence.

Example. Let π (x) be the pdf of a normal with mean µ and variance σ2. Denote byφj the Hermite polynomials of degree j:

φj (x) = (−1)jdjπdxj

π. (2.1)

The functions φj (x) form an orthogonal system in L2 (R, π) .

Any sequence of vectors(ψj

)in an inner product space that is linearly independent,

i.e.,

∞∑j=1

αjψj = 0 ⇒ αj = 0 ∀j = 1, 2, ...

can be transformed into an orthonormal sequence by the method called Gram-Schmidtorthonormalization process. This process consists of the following steps. Given

(ψj

),

define a sequence(ϕj

)inductively as

ϕ1 =ψ1

‖ψ1‖,

ϕ2 =ψ2 − 〈ψ2, ϕ1〉ϕ1

‖ψ2 − 〈ψ2, ϕ1〉ϕ1‖...

ϕn =ψn −

∑n−1l=1 〈ψn, ϕl〉ϕl∥∥ψn −

∑n−1l=1 〈ψn, ϕl〉ϕl

∥∥ .

As a result,(ϕj

)is orthonormal and any linear combinations of vectors ϕ1, ..., ϕn is also

a linear combinations of ψ1, ..., ψn and vice versa.

Theorem 2.5 (Pythagorean formula). If ϕ1, ..., ϕn are orthogonal vectors in an innerproduct space, then

∥∥∥∥∥n∑

j=1

ϕj

∥∥∥∥∥

2

=n∑

j=1

∥∥ϕj

∥∥2.

14

From the Pythagorean formula, it can be seen that the αl that minimize

∥∥∥∥∥ϕ−n∑

j=1

αjϕj

∥∥∥∥∥

are such that αj =⟨ϕ, ϕj

⟩. Moreover

n∑j=1

∣∣⟨ϕ, ϕj

⟩∣∣2 ≤ ‖ϕ‖2 . (2.2)

Hence the series∑∞

j=1

∣∣⟨ϕ, ϕj

⟩∣∣2 converges for every ϕ ∈ H. The expansion

ϕ =∞∑

j=1

⟨ϕ, ϕj

⟩ϕj (2.3)

is called a generalized Fourier series of ϕ. In general, we do not know whether the seriesin (2.3) is convergent. Below we give a sufficient condition for convergence.

Definition 2.6 (Complete orthonormal sequence ). An orthonormal sequence(ϕj

)in an inner product space H is said to be complete if for every ϕ ∈ H we have

ϕ =∞∑

j=1

⟨ϕ, ϕj

⟩ϕj

where the equality means

limn→∞

∥∥∥∥∥ϕ−n∑

j=1

⟨ϕ, ϕj

⟩ϕj

∥∥∥∥∥ = 0

where ‖.‖ is the norm in H.

A complete orthonormal sequence(ϕj

)in an inner product space H is an orthonormal

basis in H, that is every ϕ ∈ H has a unique representation ϕ =∑∞

j=1 αjϕj where αl ∈ C.

If(ϕj

)is a complete orthonormal sequence in an inner product space H then the set

span ϕ1, ϕ2, ... =

n∑

j=1

αjϕj : ∀n ∈ N, ∀α1, ..., αn ∈ C

is dense in H.

Theorem 2.7. An orthonormal sequence(ϕj

)in a Hilbert space H is complete if and

only if⟨ϕ, ϕj

⟩= 0 for all j = 1, 2, ... implies ϕ = 0.

15

Theorem 2.8 (Parseval’s formula). An orthonormal sequence(ϕj

)in a Hilbert space

H is complete if and only if

‖ϕ‖2 =∞∑

j=1

∣∣⟨ϕ, ϕj

⟩∣∣2 (2.4)

for every ϕ ∈ H.

Definition 2.9 (Separable space). A Hilbert space is called separable if it contains acomplete orthonormal sequence.

Example. A complete orthonormal sequence in L2 ([−π, π]) is given by

φj (x) =eijx

√2π

, j = ...,−1, 0, 1, ...

Hence, the space L2 ([−π, π]) is separable.

Theorem 2.10. Every separable Hilbert space contains a countably dense subset.

2.2. Definitions and basic properties of operators

In the sequel, we denote K : H → E the operator that maps a Hilbert space H (withnorm ‖.‖H) into a Hilbert space E (with norm ‖.‖E).

Definition 2.11. An operator K : H → E is called linear if

K (αϕ + βψ) = αKϕ + βKψ

for all ϕ, ψ ∈ H and all α, β ∈ C.

Definition 2.12. (i) The null space of K : H → E is the setN (K) = ϕ ∈ H : Kϕ = 0 .(ii) The range of K : H → E is the set R(K) = ψ ∈ E : ψ = Kϕ for some ϕ ∈ H .(iii) The domain of K : H → E is the subset ofH denoted D(K) on which K is defined.(iv) An operator is called finite dimensional if its range is of finite dimension.

Theorem 2.13. A linear operator is continuous if it is continuous at one element.

Definition 2.14. A linear operator K : H → E is called bounded if there exists a positivenumber C such that

‖Kϕ‖E ≤ C ‖ϕ‖Hfor all ϕ ∈ H.

16

Definition 2.15. The norm of a bounded operator K is defined as

‖K‖ ≡ sup‖ϕ‖≤1

‖Kϕ‖E

Theorem 2.16. A linear operator is continuous if and only if it is bounded.

Example. The identity operator defined by Iϕ = ϕ for all ϕ ∈ H is bounded with‖I‖ = 1.

Example. Consider the differential operator:

(Dϕ) (x) =dϕ (τ)

dτ= ϕ′ (τ)

defined on the space E1 = ϕ ∈ L2 ([−π, π]) : ϕ′ ∈ L2 ([−π, π]) with norm ‖ϕ‖ =√∫ π

−π|f (τ)|2 dτ .

For ϕj (τ) = sin jτ , j = 1, 2, ..., we have∥∥ϕj

∥∥ =√∫ π

−π|sin (jτ)|2 dτ =

√π and

∥∥Dϕj

∥∥ =√∫ π

−π|j cos (jτ)|2 dτ = j

√π. Therefore

∥∥Dϕj

∥∥ = j∥∥ϕj

∥∥ proving that the differential

operator is not bounded.

Theorem 2.17. Each linear operator K from a finite dimensional normed space H intoa normed space E is bounded.

An important class of linear operators are valued in C and they are characterized byRiesz theorem. From (2.2), we know that for any fixed vector g in an inner product spaceH, the formula G (ϕ) = 〈ϕ, g〉 defines a bounded linear functional on H. It turns out thatif H is a Hilbert space, then every bounded linear functional is of this form.

Theorem 2.18 (Riesz). Let H be a Hilbert space. Then for each bounded linear func-tion G : H → C there exists a unique element g ∈ H such that

G (ϕ) = 〈ϕ, g〉for all ϕ ∈ H. The norms of the element g and the linear function F coincide

‖g‖H = ‖G‖where ‖.‖H is the norm in H and ‖.‖ is the operator norm.

Definition 2.19 (Hilbert space isomorphism). A Hilbert space H1 is said to be iso-metrically isomorphic (congruent) to a Hilbert space H2 if there exists a one-to-one linearmapping J from H1 to H2 such that

〈J (ϕ) , J (ψ)〉H2= 〈ϕ, ψ〉H1

for all ϕ, ψ ∈ H1. Such a mapping J is called a Hilbert space isomorphism (or congruence)from H1 to H2.

17

The terminology “congruence” is used by Parzen (1959, 1970).

Theorem 2.20. Let H be a separable Hilbert space.(a) If H is infinite dimensional, then it is isometrically isomorphic to l2.(b) If H has a dimension N , then it is isometrically isomorphic to CN .

A consequence of Theorem 2.20 is that two separable Hilbert spaces of the samedimension (finite or infinite) are isometrically isomorphic.

Theorem 2.21. Let H and E be Hilbert spaces and let K : H → E be a bounded op-erator. Then there exists a uniquely determined linear operator K∗ : E → H with theproperty

〈Kϕ, ψ〉E = 〈ϕ,K∗ψ〉Hfor all ϕ ∈ H and ψ ∈ E . Moreover, the operator K∗ is bounded and ‖K‖ = ‖K∗‖ . K∗

is called the adjoint operator of K.

Riesz Theorem 2.18 implies that, in Hilbert spaces, the adjoint of a bounded operatoralways exists.

Example 2.1. (discrete case) Let π and ρ be two discrete probability density func-tions on N. Let H = L2 (N, π) =

ϕ : N→ R, ϕ = (ϕl)l∈N such that

∑l∈N ϕ2

l π (l) < ∞and E = L2 (N, ρ). The operator K that associates to elements (ϕl)l∈N of H elements(ψp

)p∈N of E such that

(Kϕ)p = ψp =∑

l∈Nk (p, l) ϕlπ (l)

is an infinite dimensional matrix. If H and E are finite dimensional, then K is simply amatrix.

Example 2.2. (integral operator) An important kind of operator is the integraloperator. Let H = L2

C (Rq, π) and E =L2C (Rr, ρ) where π and ρ are pdf. The integral

operator K : H → E is defined as

Kϕ (τ) =

∫k (τ , s) ϕ (s) π (s) ds. (2.5)

The function k is called the kernel of the operator. If k satisfies∫ ∫

|k (τ , s)|2 π (s) ρ (τ) dsdτ < ∞ (2.6)

(k is said to be a L2−kernel) then K is a bounded operator and

‖K‖ ≤√∫ ∫

|k (τ , s)|2 π (s) ρ (τ) dsdτ .

18

Indeed for any ϕ ∈ H, we have

‖Kϕ‖2E =

∫ ∣∣∣∣∫

k (τ , s) ϕ (s) π (s) ds

∣∣∣∣2

ρ (τ) dτ

=

∫|〈k (τ , .) , ϕ (.)〉H|2 ρ (τ) dτ

≤∫‖k (τ , .)‖2

H ‖ϕ‖2

H ρ (τ) dτ

by Cauchy-Schwarz inequality. Hence we have

‖Kϕ‖2E ≤ ‖ϕ‖2

H

∫‖k (τ , .)‖2

H ρ (τ) dτ

= ‖ϕ‖2

H

∫ ∫|k (τ , s)|2 π (s) ρ (τ) dsdτ .

The upperbound for ‖K‖ follows.The adjoint K∗ of the operator K is also an integral operator

K∗ψ (s) =

∫k∗ (s, τ) ψ (τ) ρ (τ) dτ (2.7)

with k∗ (s, τ) = k (τ , s). Indeed, we have

〈Kϕ, ψ〉E =

∫(Kϕ) (τ) ψ (τ)ρ (τ) dτ

=

∫ (∫k (τ , s) ϕ (s) π (s) ds

)ψ (τ)ρ (τ) dτ

=

∫ϕ (s)

(∫k (τ , s) ψ (τ)ρ (τ)

)π (s) ds

=

∫ϕ (s)

(∫k∗ (s, τ) ψ (τ) ρ (τ)

)π (s) ds

= 〈ϕ,K∗ψ〉H .

There are two types of integral operators we are interested in, the covariance operatorand the conditional expectation operator.

Example 2.3. (conditional expectation operator) When K is a conditional ex-pectation operator, it is natural to define the spaces of reference as a function of unknownpdfs. Let (Z, W ) ∈ Rq × Rr be a r.v. with distribution FZ,W , let FZ , and FW be themarginal distributions of Z and W respectively. The corresponding pdfs are denoted fZ,W ,fZ , and fW . Define

H = L2 (Rq, fZ) ≡ L2Z ,

E = L2 (Rr, fW ) ≡ L2W .

19

Let K be the conditional expectation operator:

K : L2Z → L2

W

ϕ → E [ϕ (Z) |W ] . (2.8)

K is an integral operator with kernel

k (w, z) =fZ,W (z, w)

fZ (z) fW (w)

By Equation (2.7), its adjoint K∗ has kernel k∗ (z, w) = k (w, z) and is also a conditionalexpectation operator:

K∗ : L2W → L2

Z

ψ → E [ψ (W ) |Z] .

Example 2.4. (Restriction of an operator on a subset of H) Let K : H → Eand consider the restriction denoted K0 of K on a subspace H0 of H. K0 : H0→ E is suchthat K0 and K coincide on H0. It can be shown that the adjoint K∗

0 of K0 is the operatormapping E into H0 such that

K∗0 = PK∗ (2.9)

where P is the projection on H0. The expression of K∗0 will reflect the extra information

contained in H0.To prove (2.9), we use the definition of K∗ :

〈Kϕ,ψ〉E = 〈ϕ,K∗ψ〉H for all ϕ ∈ H0

= 〈ϕ,K∗0ψ〉H0

for all ϕ ∈ H0

⇔ 〈ϕ,K∗ψ −K∗0ψ〉H = 0 for all ϕ ∈ H0

⇔ K∗ψ −K∗0ψ ∈ H⊥

0

⇔ K∗0ψ = PK∗ψ.

A potential application of this result to the conditional expectation in Example 2.3 isthe case where ϕ is known to be additive. Let Z = (Z1, Z2) . Then

H0 =ϕ (Z) = ϕ1 (Z1) + ϕ2 (Z2) : ϕ1 ∈ L2

Z1, ϕ2 ∈ L2

Z2

.

Assume that E [ϕ1 (Z1)] = E [ϕ2 (Z2)] = 0. We have Pϕ = (ϕ1, ϕ2) with

ϕ1 = (I − P1P2)−1 (P1 − P1P2) ϕ,

ϕ2 = (I − P1P2)−1 (P2 − P1P2) ϕ,

where P1 and P2 are the projection operators on L2Z1

and L2Z2

respectively. If the twospaces L2

Z1and L2

Z2are orthogonal, then ϕ1 = P1ϕ and ϕ2 = P2ϕ.

20

Definition 2.22 (Self-adjoint). If K = K∗ then K is called self-adjoint (or Hermitian).

Remark that if K is a self-adjoint integral operator, then k (s, τ) = k (τ , s).

Theorem 2.23. Let K : H → H be a self-adjoint operator then

‖K‖ = sup‖ϕ‖=1

∣∣〈Kϕ, ϕ〉H∣∣ .

Definition 2.24 (Positive operator). An operator K : H → H is called positive if itis self-adjoint and 〈Kϕ, ϕ〉H ≥ 0.

Definition 2.25. A sequence (Kn) of operators Kn : H → E is called pointwise conver-gent if for every ϕ ∈ H, the sequence Knϕ converges in E . A sequence (Kn) of boundedoperators converges in norm to a bounded operator K if ‖Kn −K‖ → 0 as n →∞.

Definition 2.26 (Compact operator). A linear operator K : H → E is called a com-pact operator if for every bounded sequence (ϕn) in H, the sequence (Kϕn) contains aconvergent subsequence in E .

Theorem 2.27. Compact linear operators are bounded.

Not every bounded operator is compact. An example is given by the identity operatoron an infinite dimensional space H. Consider an orthonormal sequence (en) in H. Thenthe sequence Ien = en does not contain a convergent subsequence.

Theorem 2.28. Finite dimensional operators are compact.

Theorem 2.29. If the sequence Kn : H → E of compact linear operators are norm con-vergent to a linear operator K : H → E , i.e., ‖Kn −K‖ → 0 as n → ∞, then K iscompact. Moreover, every compact operator is the limit of a sequence of operators withfinite dimensional range.

Hilbert Schmidt operators are discussed in Dunford and Schwartz (1988, p. 1009),Dautray and Lyons (1988, Vol 5, p.41, chapter VIII).

Definition 2.30 (Hilbert-Schmidt operator). Letϕj, j = 1, 2, ...

be a complete or-

thonormal set in a Hilbert space H. An operator K : H → E is said to be a Hilbert-Schmidt operator if the quantity ‖.‖HS defined by

‖K‖HS =

∞∑j=1

∥∥Kϕj

∥∥2

E

1/2

is finite. The number ‖K‖HS is called the Hilbert-Schmidt norm of K. Moreover

‖K‖ ≤ ‖K‖HS (2.10)

and hence K is bounded.

21

From (2.10), it follows that HS norm convergence implies (operator) norm convergence.

Theorem 2.31. The Hilbert-Schmidt norm is independent of the orthonormal basis usedin its definition.

Theorem 2.32. Every Hilbert-Schmidt operator is compact.

Theorem 2.33. The adjoint of a Hilbert-Schmidt operator is itself a Hilbert-Schmidtoperator and ‖K‖HS = ‖K∗‖HS .

Theorem 2.32 implies that Hilbert-Schmidt (HS) operators can be approached by asequence of finite dimensional operators, which is an attractive feature when it comesto estimating K. Remark that the integral operator K defined by (2.5) and (2.6) is aHilbert-Schmidt (HS) operator and its adjoint is also a HS operator. Actually, all Hilbert-Schmidt operators of L2 (Rq, π) in L2 (Rr, ρ) are integral operators. The following theoremis proved in Dautray and Lions (Vol. 5, p. 45).

Theorem 2.34. An operator of L2 (Rq, π) in L2 (Rr, ρ) is Hilbert-Schmidt if and only ifit admits a kernel representation (2.5) conformable to (2.6). In this case, the kernel k isunique.

Example 2.1 (continued). Let K from L2 (N, π) in L2 (N, ρ) with kernel k (l, p). Kis a Hilbert-Schmidt operator if

∑∑k (l, p)2 π (l) ρ (p) < ∞. In particular, the operator

defined by (Kϕ)1 = ϕ1 and (Kϕ)p = ϕp − ϕp−1, p = 2, 3, ... is not a Hilbert-Schmidtoperator; it is not even compact.

Example 2.3 (continued). By Theorem 2.34, a sufficient condition for K and K∗

to be compact is

∫ ∫ [fZ,W (z, w)

fZ (z) fW (w)

]2

fZ (z) fW (w) dzdw < ∞.

Example 2.5 (Conditional expectation with common elements). Consider aconditional expectation operator from L2 (X,Z) into L2 (X,W ) defined by

(Kϕ) (x,w) = E [ϕ (X, Z) |X = x,W = w] .

Because there are common elements between the conditioning variable and the argumentof the function ϕ, the operator K is not compact. Indeed, let ϕ (X) be such that E (ϕ2) =1, we have Kϕ = ϕ. It follows that the image of the unit circle in L2 (X, Z) contains theunit circle of L2 (X) and hence is not compact. Therefore, K is not compact.

Example 2.6 (Restriction). For illustration, we consider the effect of restricting K

on a subset of L2C (Rq, π) . Consider K the operator defined by

K : L2C (Rq, π) → L2

C (Rr, ρ)

Kϕ = Kϕ

22

for every ϕ ∈ L2C (Rq, π) , where L2

C (Rq, π) ⊂ L2C (Rq, π) and L2

C (Rr, ρ) ⊃ L2C (Rr, ρ) .

Assume that K is a HS operator defined by (2.5). Under which conditions is K an HSoperator? Let

Kϕ (s) =

∫k (τ , s) ϕ (s) π (s) ds

=

∫k (τ , s)

π (s)

π (s)ϕ (s) π (s) ds

≡∫

k (τ , s) ϕ (s) π (s) ds.

Assume that π (s) = 0 implies π (s) = 0 and ρ (τ) = 0 implies ρ (τ) = 0. Note that

∫ ∣∣∣k (τ , s)∣∣∣2

π (s) ρ (τ) dsdτ

=

∫|k (τ , s)|2 π (s)

π (s)

ρ (τ)

ρ (τ)π (s) ρ (τ) dsdτ

< sups

∣∣∣∣π (s)

π (s)

∣∣∣∣ supτ

∣∣∣∣ρ (τ)

ρ (τ)

∣∣∣∣∫|k (τ , s)|2 π (s) ρ (τ) dsdτ .

Hence the HS property is preserved if (a) there is a constant c > 0 such that π (s) ≤ cπ (s)for all s ∈ Rq and (b) there is a constant d such that ρ (τ) ≤ dρ (τ) for all τ ∈ Rr.

2.3. Spectral decomposition of compact operators

For compact operators, spectral analysis reduces to the analysis of eigenvalues and eigen-functions. Let K : H → H be a compact linear operator.

Definition 2.35. λ is an eigenvalue of K if there is a nonzero vector φ ∈ H such thatKφ = λφ. φ is called the eigenfunction of K corresponding to λ.

Theorem 2.36. All eigenvalues of a self-adjoint operator are real and eigenfunctionscorresponding to different eigenvalues are orthogonal.

Theorem 2.37. All eigenvalues of a positive operator are nonnegative.

Theorem 2.38. For every eigenvalue λ of a bounded operator K, we have |λ| ≤ ‖K‖ .

Theorem 2.39. Let K be a self-adjoint compact operator, the set of its eigenvalues (λj)is countable and its eigenvectors

(φj

)can be orthonormalized. Its largest eigenvalue (in

absolute value) satisfies |λ1| = ‖K‖ . If K has infinitely many eigenvalues |λ1| ≥ |λ2| ≥ ...,then limj→∞λj = 0.

Let K : H → E , K∗K and KK∗ are self-adjoint positive operators on H and E respec-tively. Hence their eigenvalues are nonnegative by Theorem 2.37.

23

Definition 2.40. Let H and E be Hilbert spaces, K : H → E be a compact linearoperator and K∗ : E → H be its adjoint. The square roots of the eigenvalues of thenonnegative self-adjoint compact operator K∗K : H → H are called the singular valuesof K.

The following results (Kress, 1999, Theorem 15.16) apply to operators that are notnecessarily self-adjoint.

Theorem 2.41. Let (λj) denote the sequence of the nonzero singular values of the com-pact linear operator K repeated according to their multiplicity. Then there exist or-thonormal sequences φj of H and ψj of E such that

Kφj = λjψj, K∗ψj = λjφj (2.11)

for all j ∈ N. For each ϕ ∈ H we have the singular value decomposition

ϕ =∞∑

j=1

⟨ϕ, φj

⟩φj + Qϕ (2.12)

with the orthogonal projection operator Q : H → N (K) and

Kϕ =∞∑

j=1

λj

⟨ϕ, φj

⟩ψj. (2.13)

λj, φj, ψj

is called the singular system of K. Note that λ2

j are the nonzero eigenvaluesof KK∗ and K∗K associated with the eigenfunctions ψj and φj respectively.

Theorem 2.42. Let K be the integral operator defined by (2.5) and assume Condition(2.6) holds. Let

λj, φj, ψj

be as in (2.11). Then:

(i) The Hilbert Schmidt norm of K can be written as

‖K‖HS =

∑j∈N

|λj|21/2

=

∫ ∫|k (τ , s)|2 π (s) ρ (τ) dsdτ

1/2

where each λj is repeated according to its multiplicity.

(ii) (Mercer’s formula) k (τ , s) =∑∞

j=1 λjψj (τ) φj (s).

Example (degenerate operator). Consider an integral operator defined on L2 ([a, b])with a Pincherle-Goursat kernel

Kf (τ) =

∫ b

a

k (τ , s) f (s) ds,

k (τ , s) =n∑

l=1

al (τ) bl (s) .

24

Assume that al and bl belong to L2 ([a, b]) for all l. By (2.6), it follows that K is bounded.Moreover, as K is finite dimensional, we have K compact by Theorem 2.28. Assume thatthe set of functions (al) is linearly independent. The equality Kφ = λφ yields

n∑

l=1

al (τ)

∫bl (s) φ (s) ds = λφ (τ) ,

hence φ (τ) is necessarily of the form∑n

l=1 clal (τ). The dimension of the range of K istherefore n, there are at most n nonzero eigenvalues.

Example. Let H = L2 ([0, 1]) and the integral operator Kf (τ) =∫ 1

0(τ ∧ s) f (s) ds

where τ ∧ s = min(τ , s). It is possible to explicitly compute the eigenvalues and eigen-

functions of K by solving Kφ = λφ ⇐⇒ ∫ τ

0sφ (s) ds + τ

∫ 1

τφ (s) ds = λφ (τ) . Us-

ing two successive differentiations with respect to τ , we obtain a differential equationφ (τ) = −λφ′′ (τ) with boundary conditions φ (0) = 0 and φ′ (1) = 0. Hence the set oforthonormal eigenvectors is φj (τ) =

√2 sin ((πjτ) /2) associated with the eigenvalues

λj = 4/ (π2j2), j = 1, 3, 5, ....We can see that the eigenvalues converge to zero at anarithmetic rate.

Example. Let π be the pdf of the standard normal distribution and H = L2 (R, π) .Define K as the integral operator with kernel

k (τ , s) =l (τ , s)

π (τ) π (s)

where l (τ , s) is the joint pdf of the bivariate normal N((

00

),

(1 ρρ 1

)). Then K

is a self-adjoint operator with eigenvalues λj = ρj and has eigenfunctions that take theHermite polynomial form φj, j = 1, 2, ... defined in (2.1). This is an example where theeigenvalues decay exponentially fast.

2.4. Random element in Hilbert spaces

2.4.1. Definitions

Let H be a real separable Hilbert space with norm ‖‖ induced by the inner product 〈, 〉 .Let (Ω,F , P ) be a complete probability space. Let X : Ω → H be a Hilbert space-valuedrandom element (an H-r.e.). X is integrable or has finite expectation E (X) if E (‖X‖) =∫Ω‖X‖ dP < ∞, in that case E (X) satisfies E (X) ∈ H and E [〈X, ϕ〉] = 〈E (X) , ϕ〉 for

all ϕ ∈ H. An H-r.e. X is weakly second order if E[〈X, ϕ〉2] < ∞ for all ϕ ∈ H. For a

weakly second order H-r.e. X with expectation E (X) , we define the covariance operatorK as

K : H → HKϕ = E [〈X − E (X) , ϕ〉 (X − E (X))]

for all ϕ ∈ H. Note that var 〈X, ϕ〉 = 〈Kϕ,ϕ〉 .

25

Example. Let H = L2 ([0, 1]) with ‖g‖ =[∫ 1

0g (τ)2 dτ

]1/2

and X = h (τ , Y ) where Y

is a random variable and h (., Y ) ∈ L2 ([0, 1]) with probability one. Assume E (h (τ , Y )) =0, then the covariance operator takes the form:

Kϕ (τ) = E [〈h (., Y ) , ϕ〉h (τ , Y )]

= E

[(∫h (s, Y ) ϕ (s) ds

)h (τ , Y )

]

=

∫E [h (τ , Y ) h (s, Y )] ϕ (s) ds

≡∫

k (τ , s) ϕ (s) ds.

Moreover, if h (τ , Y ) = I Y ≤ τ − F (τ) then k (τ , s) = F (τ ∧ s)− F (τ) F (s) .

Definition 2.43. An H-r.e. Y has a Gaussian distribution on H if for all ϕ ∈ H thereal-valued r.v. 〈ϕ, Y 〉 has a Gaussian distribution on R.

Definition 2.44 (strong mixing). Let Xi,n, i = ...,−1, 0, 1, ...; n ≥ 1 be an array ofH-r.e., defined on the probability space (Ω,F , P ) and define An,b

n,a = σ (Xi,n, a ≤ i ≤ b)for all −∞ ≤ a ≤ b ≤ +∞, and n ≥ 1. The array Xi,n is called a strong or α−mixingarray of H-r.e. if limj→∞ α (j) = 0 where

α (j) = supn≥1

supl

supA,B

[|P (A ∩B)− P (A) P (B)| : A ∈ An,l

n,−∞, B ∈ An,+∞n,l+j

].

2.4.2. Central limit theorem for mixing processes

We want to study the asymptotic properties of Zn = n−1/2∑n

i=1 Xi,n where Xi,n : 1 ≤ 1 ≤ nis an array of H-r.e.. Weak and strong laws of large numbers for near epoch dependent(NED) processes can be found in Chen and White (1996). Here we provide sufficientconditions for the weak convergence of processes to be denoted ⇒ (see Davidson, 1994,for a definition). Weak convergence is stronger than the standard central limit theorem(CLT) as illustrated by a simple example. Let (Xi) be an iid sequence of zero mean weaklysecond order elements of H. Then for any Z in H, 〈Xi, Z〉 is an iid zero mean sequenceof C with finite variance 〈KZ, Z〉. Then standard CLT implies the asymptotic normalityof 1√

n

∑ni=1 〈Xi, Z〉 . The weak convergence of 1√

n

∑ni=1 Xi to a Gaussian process N (0, K)

in H requires an extra assumption, namely E ‖X1‖2 < ∞. Weak convergence theoremsfor NED processes that might have trending mean (hence are not covariance stationary)are provided by Chen and White (1998). Here, we report results for mixing processesproved by Politis and Romano (1994). See also van der Vaart and Wellner (1996) for iidsequences.

Theorem 2.45. Let Xi,n : 1 ≤ 1 ≤ n be a double array of stationary mixing H-r.e.with zero mean, such that for all n, ‖Xi,n‖ < B with probability one, and

∑mj=1 j2α (j) ≤

26

Kmr for all 1 ≤ m ≤ n and n, and some r < 3/2. Assume, for any integer l ≥ 1,that (X1,n, ..., Xl,n), regarded as a r.e. of Hl, converges in distribution to say, (X1, ..., Xl).Moreover, assume E [〈X1,n, Xl,n〉] → E [〈X1, Xl〉] as n →∞ and

limn→∞

n∑

l=1

E [〈X1,n, Xl,n〉] =∞∑

l=1

E [〈X1, Xl〉] < ∞.

Let Zn = n−1/2∑n

i=1 Xi,n. For any ϕ ∈ H, let σ2ϕ,n denote the variance of 〈Zn, ϕ〉 . Assume

σ2ϕ,n →

n→∞σ2

ϕ ≡ V ar (〈X1, ϕ〉) + 2∞∑i=1

cov (〈X1, ϕ〉 , 〈X1+i, ϕ〉) . (2.14)

Then Zn converges weakly to a Gaussian process N (0, K) in H, with zero mean andcovariance operator K satisfying 〈Kϕ, ϕ〉 = σ2

ϕ for each ϕ ∈ H.

In the special case when the Xi,n = Xi form a stationary sequence, the conditionssimplify considerably:

Theorem 2.46. Assume X1, X2, ...is a stationary sequence of H-r.e. with mean µ andmixing coefficient α. Let Zn = n−1/2

∑ni=1 (Xi − µ).

(i)If E(‖X1‖2+δ

)< ∞ for some δ > 0, and

∑j [α (j)]δ/(2+δ) < ∞

(ii) or if X1, X2, ...is iid and E ‖X1‖2 < ∞Then Zn converges weakly to a Gaussian process G ∼ N (0, K) in H. The distribu-

tion of G is determined by the distribution of its marginals 〈G,ϕ〉 which are N (0, σ2

ϕ

)distributed for every ϕ ∈ H where σ2

ϕ is defined in (2.14).

Let el be a complete orthonormal basis of H. Then ‖X1‖2 =∑∞

l=1 〈X1, el〉2 andhence in the iid case, it suffices to check that E ‖X1‖2 =

∑∞l=1 E

[〈X1, el〉2]

< ∞.The following theorem is stated in more general terms in Chen and White (1992).

Theorem 2.47. Let An be a random bounded linear operator from H to H and A 6= 0be a nonrandom bounded linear operator from H to H. If ‖An − A‖ → 0 in probabilityas n →∞ and Yn ⇒ Y ∼ N (0, K) in H. Then AnYn ⇒ AY ∼ N (0, AKA∗).

In Theorem 2.47, the boundedness of A is crucial. In most of our applications, A willnot be bounded and we will not be able to apply Theorem 2.47. Instead we will have tocheck the Liapunov condition (Davidson 1994) “by hand”.

Theorem 2.48. Let the array Xi,n be independent with zero mean and variance se-

quenceσ2

i,n

satisfying

∑ni=1 σ2

i,n = 1. Then∑n

i=1 Xi,nd→ N (0, 1) if

limn→∞

n∑i=1

E[|Xi,n|2+δ

]= 0 (Liapunov condition)

for some δ > 0.

27

2.5. Estimation of an operator and its adjoint

2.5.1. Estimation of an operator

In many cases of interest, an estimator of the compact operator, K, is given by a degen-erate operator of the form

Knϕ =Ln∑

l=1

al (ϕ) εl (2.15)

where εl ∈ E , al (ϕ) is linear in ϕ.Examples:1 - Covariance operator

Kϕ (τ 1) =

∫E [h (τ 1, X) h (τ 2, X)] ϕ (τ 2) dτ 2.

Replacing the expectation by the sample mean, one obtains an estimator of K :

Knϕ (τ 1) =

∫ (1

n

n∑i=1

h (τ 1, xi) h (τ 2, xi)

)ϕ (τ 2) dτ 2

=n∑

i=1

ai (ϕ) εi

with

ai (ϕ) =1

n

∫h (τ 2, xi) ϕ (τ 2) dτ 2 and εi = h (τ 1, xi) .

Note that here K is self-adjoint and the rate of convergence of Kn is parametric.2 - Conditional expectation operator

Kϕ (w) = E [ϕ (Z) |W = w] .

The kernel estimator of K with kernel ω and bandwidth cn is given by

Knϕ (w) =

∑ni=1 ϕ (zi) ω

(w−wi

cn

)

∑ni=1 ω

(w−wi

cn

)

=n∑

i=1

ai (ϕ) εi

where

ai (ϕ) = ϕ (zi) and εi =

ω

(w−wi

cn

)

∑ni=1 ω

(w−wi

cn

) .

In this case, the rate of convergence of Kn is nonparametric, see Subsection 4.1.

28

2.5.2. Estimation of the adjoint of a conditional expectation operator

Consider a conditional expectation operator as described in Example 2.3. Let K : L2Z →

L2W be such that (Kϕ) (w) = E [ϕ (Z) |W = w] and its adjoint is K∗ : L2

W → L2Z with

(Kψ) (z) = E [ψ (W ) |Z = z] . Let fZ,W , fZ (z), and fW (w) be nonparametric estimatorsof fZ,W , fZ (z), and fW (w) obtained either by kernel or sieves estimators. Assume thatK and K∗ are estimated by replacing the unknown pdfs by their estimators, that is:

Knϕ (w) =

∫fZ,W (z, w)

fZ (z)ϕ (z) dz,

(K∗)nψ (z) =

∫fZ,W (z, w)

fW (w)ψ (w) dw.

Then we have (K∗)n 6=(Kn

)∗for H = L2

Z and E = L2W . Indeed, we do not have

⟨Knϕ, ψ

⟩E

=⟨ϕ, (K∗)nψ

⟩H

. (2.16)

There are two solutions to this problem. First, we choose as space of references Hn =

L2(Rq, fZ

)and En = L2

(Rr, fW

). In which case, (K∗)n =

(Kn

)∗for Hn and En because

⟨Knϕ, ψ

⟩En

=⟨ϕ, (K∗)nψ

⟩Hn

. (2.17)

The new spaces Hn and En depend on the sample size and on the estimation procedure.Another approach consists in definingH = L2 (Rq, π) and E = L2 (Rr, ρ) where π and ρ areknown and satisfy: There exist c, c′ > 0 such that fZ (z) ≤ cπ (z) and fW (w) ≥ c′ρ (w) .Then

K∗ψ (z) =

∫fZ,W (z, w)

fW (w)

ρ (w)

π (z)ψ (w) dw

6= E [ψ (W ) |Z = z] .

In that case, (K∗)n =(Kn

)∗forH and E but the choice of π and ρ require some knowledge

on the support and the tails of the distributions of W and Z.An alternative solution to estimating K and K∗ by kernel is to estimate the spectrum

of K and to apply Mercer’s formula. Let H = L2Z and E = L2

W . The singular systemλj, φj, ψj

of K satisfies

λj = supφj ,ψj

E[φj (Z) ψj (W )

], j = 1, 2... (2.18)

subject to∥∥φj

∥∥H = 1,

⟨φj, φl

⟩H = 0, l = 1, 2, ..., j − 1,

∥∥ψj

∥∥E = 1,

⟨ψj, ψl

⟩E = 0, l =

1, 2, ..., j − 1. Assume the econometrician observes a sample wi, zi : i = 1, ..., n. To

29

estimateλj, φj, ψj

, one can either estimate (2.18) by replacing the expectation by the

sample mean or by replacing the joint pdf by a nonparametric estimator.The first approach was adopted by Darolles, Florens, and Renault (1998). Let

Hn =

ϕ : Rq → R,

∫ϕ (z)2 dFZ (z) < ∞

,

En =

ψ : Rr → R,

∫ψ (w)2 dFW (w) < ∞

where FZ and FW are the empirical distributions of Z and W. That is ‖ϕ‖2Hn

= 1n

∑ni=1 ϕ (zi)

2

and ‖ψ‖2En

= 1n

∑ni=1 ψ (wi)

2 . Darolles, Florens, and Renault (1998) propose to estimateλj, φj, ψj

by solving

λj = supφj ,ψj

1

n

n∑i=1

[φj (zi) ψj (wi)

], j = 1, 2... (2.19)

subject to∥∥∥φj

∥∥∥Hn

= 1,⟨φj, φl

⟩Hn

= 0, l = 1, 2, ..., j − 1,∥∥∥ψj

∥∥∥En

= 1,⟨ψj, ψl

⟩En

= 0,

l = 1, 2, ..., j − 1 where φj and ψj are elements of increasing dimensional spaces

φj (z) =J∑

j=1

αjaj (z) ,

ψj (w) =J∑

j=1

βjbj (w)

for some basis aj and bj. By Mercer’s formula (2.13), K can be estimated by

Knϕ (w) =∑

λj

(∫φj (z) ϕ (z) dFZ

)ψj (w)

(K∗)nψ (z) =∑

λj

(∫ψj (w) ψ (w) dFW

)φj (z) .

Hence (K∗)n =(Kn

)∗for Hn and En.

The second approach consists in replacing fZ,W by a nonparametric estimator fZ,W .Darolles, Florens, and Gourieroux (2004) use a kernel estimator, whereas Chen, Hansen

and Scheinkman (1998) use B-spline wavelets. LetHn = L2(Rq, fZ

)and En = L2

(Rr, fW

)

where fZ and fW are the marginals of fZ,W . (2.18) can be replaced

λj = supφj ,ψj

∫φj (z) ψj (w) fZ,W (z, w) dzdw, j = 1, 2... (2.20)

30

subject to∥∥φj

∥∥Hn

= 1,⟨φj, φl

⟩Hn

= 0, l = 1, 2, ..., j − 1,∥∥ψj

∥∥En

= 1,⟨ψj, ψl

⟩En

= 0,

l = 1, 2, ..., j−1. Denote

λj, φj, ψj

the resulting estimators of

λj, φj, ψj

. By Mercer’s

formula, K can be approached by

Knϕ (w) =∑

λj

(∫φj (z) ϕ (z) fZ (z) dz

)ψj (w)

(K∗)nψ (z) =∑

λj

(∫ψj (w) ψ (w) fW (w) dw

)φj (z) .

Hence (K∗)n =(Kn

)∗for Hn and En. Note that in the three articles mentioned above,

Z = Xt+1 and W = Xt where Xt is a Markov process. These papers are mainlyconcerned with estimation. When the data are the discrete observations of a diffusionprocess, the nonparametric estimations of a single eigenvalue-eigenfunction pair and ofthe marginal distribution are enough to recover a nonparametric estimate of the diffusioncoefficient. The techniques described here can also be used for testing the reversibility ofthe process Xt , see Darolles, Florens, and Gourieroux (2004).

2.5.3. Computation of eigenvalues and eigenfunctions of finite dimensional op-erators

Here, we assume that we have some estimators of K and K∗, denoted Kn and K∗n such

that Kn and K∗n have finite range and satisfy

Knϕ =Ln∑

l=1

al (ϕ) εl (2.21)

K∗nψ =

Ln∑

l=1

bl (ψ) ηl (2.22)

where εl ∈ E , ηl ∈ H, al (ϕ) is linear in ϕ and bl (ψ) is linear in ψ. Moreover the εl andηl are assumed to be linearly independent. It follows that

K∗nKnϕ =

Ln∑

l=1

bl

(Ln∑

l′=1

al′ (ϕ) εl′

)ηl

=Ln∑

l,l′=1

al′ (ϕ) bl (εl′) ηl. (2.23)

We calculate the eigenvalues and eigenfunctions of K∗nKn by solving

K∗nKnφ = λ2φ.

31

Hence φ is necessarily of the form: φ =∑

l βlηl. Replacing in (2.23), we have

λ2βl =Ln∑

l′,j=1

βjal′(ηj

)bl (εl′) . (2.24)

Denote β=[β1, ..., βLn

]the solution of (2.24). Solving (2.24) is equivalent to finding the

Ln nonzero eigenvalues λ2

1, ..., λ2

Lnand eigenvectors β 1, ...,βLn of an Ln × Ln−matrix C

with principle element

cl,j =Ln∑

l′=1

al′(ηj

)bl (εl′) .

The eigenfunctions of K∗nKn are

φj =Ln∑

l=1

βj

l ηl, j = 1, ...Ln

associated with λ2

1, ..., λ2

Ln.

φj : j = 1, .., Ln

need to be orthonormalized. The estima-

tors of the singular values are λj =

√λ

2

j .

2.5.4. Estimation of noncompact operators

This chapter mainly focuses on compact operators, because compact operators can beapproached by a sequence of finite dimensional operators and therefore can be easily esti-mated. However, it is possible to estimate a noncompact operator by an estimator, whichis infinitely dimensional. A simple example is provided by the conditional expectationoperator with common elements.

Example 2.5 (continued). This example is discussed in Hall and Horowitz (2004).Assume that the dimension of Z is p. The conditional expectation operator K can beestimated by a kernel estimator with kernel ω and bandwidth cn

(Kϕ

)(x,w) =

∑ni=1

[∫1cpnϕ (x, z) ω

(z−zi

cn

)dz

]ω

(x−xi

cn

)ω

(w−wi

cn

)

∑ni=1 ω

(x−xi

cn

)ω

(w−wi

cn

) .

We can see that K is an infinite dimensional operator because all functions ϕ (x) thatdepend only on x are in the range of K.

3. Regularized solutions of integral equations of the first kind

Let H and E be two Hilbert spaces considered only over the real scalars for the sake ofnotational simplicity. Let K be a linear operator on D(K) ⊂ H into E . This section

32

discusses the properties of operator equations (also called Fredholm equations) of the firstkind

Kϕ = r (3.1)

where K is typically an integral compact operator. Such an equation in ϕ is in generalan ill-posed problem by opposition to a well-posed problem. Equation (3.1) is said to bewell-posed if (i) (existence) a solution exists, (ii) (uniqueness) the solution is unique, and(iii) (stability) the solution is continuous in r, that is ϕ is stable with respect to smallchanges in r. Whenever one of these conditions is not satisfied, the problem is said tobe ill-posed. The lack of stability is particularly problematic and needs to be addressedby a regularization scheme. Following Wahba (1973) and Nashed and Wahba (1974), weintroduce generalized inverses of operators in Reproducing Kernel Hilbert Spaces (RKHS).Properties of RKHS will be studied more extensively in Section 6.

3.1. Ill-posed and well-posed problems

This introductory subsection gives an overview of the problems encountered when solvingan equation Kϕ = r where K is a linear operator, not necessarily compact. A moredetailed encounter can be found in Groetsch (1993). We start with a formal definition ofa well-posed problem.

Definition 3.1. Let K : H → E . The equation

Kϕ = r (3.2)

is called well-posed if K is bijective and the inverse operator K−1 : E → H is continuous.Otherwise, the equation is called ill-posed.

Note that K is injective means N (K) = 0 , and K is surjective means R (K) = E .In this section, we will restrict ourselves to the case where K is a bounded (and thereforecontinuous) linear operator. By Banach theorem (Kress, 1999, page 266), if K : H → Eis a bounded linear operator, K bijective implies that K−1 : E → H is bounded andtherefore continuous. In this case, Kϕ = r is well-posed.

An example of a well-posed problem is given by

(I − C) ϕ = r

where C : H → H is a compact operator and 1 is not an eigenvalue of K. This is anexample of integral equation of the second kind that will be studied in Section 7.

We now turn our attention to ill-posed problems.Problem of uniqueness:If N (K) 6= 0 , then to any solution of ϕ of (3.2), one can add an element ϕ1 of

N (K), so that ϕ + ϕ1 is also a solution. A way to achieve uniqueness is to look for thesolution with minimal norm.

33

Problem of existence:A solution to (3.2) exists if and only if

r ∈ R (K) .

Since K is linear, R (K) is a subspace of E , however it generally does not exhaust E .Therefore, a traditional solution of (3.2) exists only for a restricted class of functions r.If we are willing to broaden our notion of solution, we may enlarge the class of functionsr for which a type of generalized solution exists to a dense subspace of functions of E .

Definition 3.2. An element ϕ ∈ H is said to be a least squares solution of (3.2) if:

‖Kϕ− r‖ ≤ ‖Kf − r‖ , for any f ∈ H (3.3)

If the set Sr of all least squares solutions of (3.2) for a given r ∈ E is not empty andadmits an element ϕ of minimal norm, then ϕ is called a pseudosolution of (3.2).

The pseudosolution, when it exists, is denoted ϕ = K†r where K† is by definition theMoore Penrose generalized inverse of K. However, the pseudosolution does not necessarilyexist. The pseudosolution exists if and only if Pr ∈ R (K) where P is the projectionoperator on R (K), the closure of the range of K. Note that Pr ∈ R (K) if and only if

r = Pr + (1− P ) r ∈ R (K) +R (K)⊥ . (3.4)

Therefore, a pseudosolution exists if and only if r lies in the dense subspace R (K) +R (K)⊥ of E .

We distinguish two cases:1. R (K) is closed.For any r ∈ E , ϕ = K†r exists and is continuous in r.Example. (I − C) ϕ = r where K is compact and 1 is an eigenvalue of K. The

problem is ill-posed because the solution is not unique but it is not severally ill-posedbecause the pseudosolution exists and is continuous.

2. R (K) is not closed.The pseudosolution exists if and only if r ∈ R (K) +R (K)⊥ . But here, ϕ = K†r is

not continuous in r.Example. K is a compact infinitely dimensional operator.

For the purpose of econometric applications, condition (3.4) will be easy to maintainsince:

Either (K, r) denotes the true unknown population value, and then the assumption r∈ R (K) means that the structural econometric model is well-specified. Inverse problemswith specification errors are beyond the scope of this chapter.

Or (K, r) denotes some estimators computed from a finite sample of size n. Then,insofar as the chosen estimation procedure is such that R (K) is closed (for instancebecause it is finite dimensional as in Subsection 2.5.1), we have R (K) +R (K)⊥ = E .

34

The continuity assumption of K will come in general with the compacity assumptionfor population values and, for sample counterparts, with the finite dimensional prop-erty. Moreover, the true unknown value K0 of K will be endowed with the identificationassumption:

N (K0) = 0 (3.5)

and the well-specification assumption:

r0 ∈ R (K0) . (3.6)

(3.5) and (3.6) ensure the existence of a unique true unknown value ϕ0 of ϕ defined asthe (pseudo) solution of the operator equation K0ϕ0 = r0. Moreover, this solution is notgoing to depend on the choice of topologies on the two spaces H and E .

It turns out that a compact operator K with infinite-dimensional range is a prototypeof an operator for which R (K) is not closed. Therefore, as soon as one tries to generalizestructural econometric estimation from a parametric setting (K finite dimensional) toa non-parametric one, which can be seen as a limit of finite dimensional problems (Kcompact), one is faced with an ill-posed inverse problem. This is a serious issue for thepurpose of consistent estimation, since in general one does not know the true value r0

of r but only a consistent estimator rn. Therefore, there is no hope to get a consistentestimator ϕn of ϕ by solving Kϕn = rn that is ϕn = K†rn, when K† is not continuous.In general, the issue to address will be even more involved since K† and K must also beestimated.

Let us finally recall a useful characterization of the Moore-Penrose generalized inverseof K.

Proposition 3.3. Under (3.4), K†r is the unique solution of minimal norm of the equa-tion K?Kϕ = K?r.

In other words, the pseudosolution ϕ of (3.2) can be written in two ways:

ϕ = K†r = (K?K)†K?r

For r ∈ R (K) (well-specification assumption in the case of true unknown values), K?r∈ R (K?K) and then (K?K)−1 K?r is well defined. The pseudosolution can then berepresented from the singular value decomposition of K as

ϕ = K†r = (K?K)−1 K?r =∞∑

j=1

⟨r, ψj

⟩

λj

φj (3.7)

It is worth noticing that the spectral decomposition (3.7) is also valid for any r ∈ R (K)+R (K)⊥ to represent the pseudosolution ϕ = K†r = (K?K)†K?r since r ∈ R (K)⊥ isequivalent to K†r = 0.

35

Formula (3.7) clearly demonstrates the ill-posed nature of the equation Kϕ = r. Ifwe perturb the right-hand side r by rδ = r + δψj, we obtain the solution ϕδ = ϕ+ δφj/λj.

Hence, the ratio∥∥ϕδ − ϕ

∥∥ /∥∥rδ − r

∥∥ = 1/λj can be made arbitrarily large due to thefact that the singular values tend to zero. Since the influence of estimation errors in r iscontrolled by the rate of this convergence, Kress (1999, p. 280) says that the equationis “mildly ill-posed” if the singular values decay slowly to zero and that it is “severelyill-posed” if they decay rapidly. Actually, the critical property is the relative decay rateof the sequence

⟨r, ψj

⟩with respect to the decay of the sequence λj. To see this, note that

the solution ϕ has to be determined from its Fourier coefficients by solving the equations

λj

⟨ϕ, φj

⟩=

⟨r, ψj

⟩, for all j.

Then, we may expect high instability of the solution ϕ if λj goes to zero faster than⟨ϕ, φj

⟩. The properties of regularity spaces introduced below precisely document this

intuition.

3.2. Regularity spaces

As stressed by Nashed and Wahba (1974), an ill-posed problem relative to H and E maybe recast as a well-posed problem relative to new spaces H′ ⊂ H and E ′ ⊂ E , withtopologies on H′ and E ′ , which are different from the topologies on H and E respectively.While Nashed and Wahba (1974) generally build these Hilbert spaces H′ and E ′ as RKHSassociated with an arbitrary self-adjoint Hilbert-Schmidt operator, we focus here on theRKHS associated with (K?K)β, for some positive β. More precisely, assuming that K isHilbert-Schmidt and denoting (λj, φj, ψj) its singular system, we define the self-adjoint

operator (K?K)β by:

(K?K)β ϕ =∞∑

j=1

λ2βj

⟨ϕ, φj

⟩φj.

Definition 3.4. The β−regularity space of the compact operator K is defined for allβ > 0, as the RKHS associated with (K?K)β . That is, the space:

Φβ =

ϕ ∈ N (K)⊥ such that

∞∑j=1

⟨ϕ, φj

⟩2

λ2βj

< ∞

(3.8)

where a Hilbert structure is being defined through the inner product

〈f, g〉β =∞∑

j=1

⟨f, φj

⟩ ⟨g, φj

⟩

λ2βj

for f and g ∈ Φβ.

36

Note however that the construction of RKHS considered here is slightly more generalthan the one put forward in Nashed and Wahba (1974) since we start from elements of ageneral Hilbert space, not limited to be a L2 space of functions defined on some interval ofthe real line. This latter example will be made explicit in Section 6. Moreover, the focusof our interest here will only be the regularity spaces associated with the true unknownvalue K0 of the operator K. Then, the identification assumption will ensure that all theregularity spaces are dense in H :

Proposition 3.5. Under the identification assumption N (K) = 0, the sequence ofeigenfunctions

φj

associated with the non-zero singular values λj defines a Hilbert

basis of H. In particular, all the regularity spaces Φβ , β > 0, contain the vectorial spacespanned by the

φj

and, as such, are dense in H .

Proposition 3.5. is a direct consequence of the singular value decomposition (2.9).Generally speaking, when β increases, Φβ , β > 0, is a decreasing family of subspaces ofH. Hence, β may actually be interpreted as the regularity level of the functions ϕ, asillustrated by the following result.

Proposition 3.6. Under the identification assumption (N (K) = 0), for any β > 0,

Φβ = R[(K∗K)

β2

].

In particular, Φ1 = R (K∗) .

Proof. By definition, the elements of the range of (K∗K)β2 can be written f =∑∞

j=1 λβj

⟨ϕ, φj

⟩φj for some ϕ ∈ H. Note that this decomposition also describes the range

of K∗ for β = 1. Then:

‖f‖2β =

∞∑j=1

⟨ϕ, φj

⟩2

λ2βj

λ2βj =

∞∑j=1

⟨ϕ, φj

⟩2= ‖ϕ‖2 < ∞.

Hence R[(K∗K)

β2

]⊂ Φβ.

Conversely, for any ϕ ∈ Φβ, one can define:

f =∞∑

j=1

⟨ϕ, φj

⟩

λβj

φj

and then (K∗K)β2 f =

∑∞j=1

⟨ϕ, φj

⟩φj = ϕ since N (K) = 0. Hence, Φβ ⊂ R

[(K∗K)

β2

].

Since we mainly consider operators, K, which are integral operators with continuous

kernels, applying the operator (K∗K)β2 has a smoothing effect, which is stronger for

larger values of β. This is the reason why the condition ϕ ∈ Φβ qualifies the level, β

37

of regularity or smoothness of ϕ. The associated smoothness properties are studied infurther details in Loubes and Vanhems (2003). The space Φ1 of functions is also putforward in Schaumburg (2004) when K denotes the conditional expectation operator fora continuous time Markov process Xt with Levy type generator sampled in discrete time.He shows that whenever a transformation ϕ(Xt) of the diffusion process is consideredwith ϕ ∈ Φ1, the conditional expectation operator E[ϕ(Xt+h) |Xt] admits a convergentpower series expansion as the exponential of the infinitesimal generator.

The regularity spaces Φβ are of interest here as Hilbert spaces (included in H butendowed with another scalar product) where our operator equation (3.2) is going to be-come well-posed. More precisely, let us also consider the family of regularity spaces Ψβ

associated with the compact operator K∗:

Ψβ =

ψ ∈ N (K?)⊥ such that

∞∑j=1

⟨ψ, ψj

⟩2

λ2βj

< ∞

Ψβ is a Hilbert space endowed with the inner product:

Definition 3.7. 〈F, G〉β =∑∞

j=1

⟨F, ψj

⟩ ⟨G, ψj

⟩

λ2βj

for F and G ∈ Ψβ.

Note that the spaces Ψβ are not in general dense in E since N (K?) 6= 0. But theydescribe well the range of K when K is restricted to some regularity space:

Proposition 3.8. Under the identification assumptionN (K) = 0, K(Φβ ) = Ψβ+1 forall positive β. In particular, Ψ1 = R (K) .

Proof. We know from Proposition 3.6 that when ϕ ∈ Φβ , it can be written:

ϕ =∑∞

j=1 λβj

⟨f, φj

⟩φj for some f ∈ H. Then, Kϕ =

∑∞j=1 λβ+1

j

⟨f, φj

⟩ψj ∈ Ψβ+1. Hence

K(Φβ ) ⊂ Ψβ+1.Conversely, since according to a singular value decomposition like (2.9), the sequence

ψj

defines a basis of N (K?)⊥, any element of Ψβ+1 can be written as

ψ =∞∑

j=1

⟨ψ, ψj

⟩ψj with

∞∑j=1

⟨ψ, ψj

⟩2

λ2β+2j

< ∞.

Let us then define ϕ =∑∞

j=1(1/λj)⟨ψ, ψj

⟩φj .We have

∞∑j=1

⟨ϕ, φj

⟩2

λ2βj

=∞∑

j=1

⟨ψ, ψj

⟩2

λ2β+2j

< ∞

and thus ϕ ∈ Φβ . Moreover, Kϕ =∑∞

j=1

⟨ψ, ψj

⟩ψj = ψ. This proves that Ψβ+1 ⊂

K(Φβ ).

38

Therefore, when viewed as an operator from Φβ into Ψβ+1, K has a closed rangedefined by the space Ψβ+1 itself. It follows that the ill-posed problem

K : H → EKϕ = r

may be viewed as well-posed relative to the subspaces Φβ into Ψβ+1 and their associatednorms. This means that

(i) First, we think about the pseudosolution ϕ = K†r as a function of r evolving inΨβ+1, for some positive β.

(ii) Second, continuity of ϕ = K†r with respect to r must be understood with respect

to the norms ‖r‖β+1 = 〈r, r〉1/2β+1 and ‖ϕ‖β = 〈ϕ, ϕ〉1/2

β

To get the intuition of this result, it is worth noticing that these new topologies defineanother adjoint operator K?

β of K characterized by:

〈Kϕ, ψ〉β+1 =⟨ϕ,K?

βψ⟩

β,

and thus:

K?βψ =

∞∑j=1

(1/λj)⟨ψ, ψj

⟩φj.

In particular, K?βψj = (1/λj)φj. In other words, all the eigenvalues of K?

βK and KK?β

are now equal to one and the pseudosolution is defined as:

ϕ = K†βr = K?

βr =∞∑

j=1

⟨r, ψj

⟩

λj

φj.

The pseudosolution depends continuously on r because K†β = K?

β is a bounded operatorfor the chosen topologies; it actually has a unit norm.

For the purpose of econometric estimation, we may be ready to assume that the trueunknown value ϕ0 belongs to some regularity space Φβ. This just amounts to an additionalsmoothness condition about our structural functional parameter of interest. Then, we aregoing to take advantage of this regularity assumption through the rate of convergence ofsome regularization bias as characterized in the next subsection.

Note finally that assuming ϕ0 ∈ Φβ, that is r0 ∈ Ψβ+1 for some positive β, is nothingbut a small reinforcement of the common criterion of existence of a solution, known asthe Picard’s theorem (see e.g. Kress, 1999, page 279), which states that r0 ∈ Ψ1 = R (K).The spaces Φβ and Ψβ are strongly related to the concept of Hilbert scales, see Natterer(1984), Engl, Hanke, and Neubauer (1996), and Tautenhahn (1996).

3.3. Regularization schemes

As pointed out in Subsection 3.1, the ill-posedness of an equation of the first kind witha compact operator stems from the behavior of the sequence of singular values, which

39

converge to zero. This suggests trying to regularize the equation by damping the explosiveasymptotic effect of the inversion of singular values. This may be done in at least twoways:

A first estimation strategy consists in taking advantage of the well-posedness of theproblem when reconsidered within regularity spaces. Typically, a sieve kind of approachmay be designed, under the maintained assumption that the true unknown value r0 ∈ Ψβ+1

for some positive β, in such a way that the estimator rn evolves when n increases, in anincreasing sequence of finite dimensional subspaces of Ψβ+1. Note however that when the

operator K is unknown, the constraint rn ∈ N (K?)⊥ may be difficult to implement.Hence, we will not pursue this route any further.

The approach adopted in this chapter follows the general regularization frameworkof Kress (1999, Theorem 15.21). It consists in replacing a sequence

1/µj

of explosive

inverse singular values by a sequenceq(α, µj)/µj

where the damping function q(α, µ)

is chosen such that:(i) q(α, µ)/µ remains bounded when µ goes to zero (damping effect),(ii) for any given µ : Limα→0q(α, µ) = 1 (asymptotic unbiasedness).Since our inverse problem of interest can be addressed in two different ways:

ϕ = K†r = (K?K)†K?r,

the regularization scheme can be applied either to K† (µj = λj) or to (K?K)† (µj =

λ2j). The latter approach is better suited for our purpose since estimation errors will

be considered below at the level of (K?K) and K?r respectively. We maintain in thissubsection the identification assumption N (K) = 0. We then define:

Definition 3.9. A regularized version ϕα = Aα K?r of the pseudosolution ϕ = (K?K)†K?ris defined as:

ϕα =∞∑

j=1

1

λ2j

q(α, λ2

j

) ⟨K?r, φj

⟩φj =

∞∑j=1

1

λj

q(α, λ2

j

) ⟨r, ψj

⟩φj (3.9)

=∞∑

j=1

q(α, λ2

j

) ⟨ϕ, φj

⟩φj

where the real-valued function, q, is such that

|q (α, µ)| ≤ d (α) µ (3.10)

limα→0

q (α, µ) = 1.

Note that (3.9) leaves unconstrained the values of the operator Aα on the spaceR (K∗)⊥ = N (K) . However, since N (K) = 0, Aα is uniquely defined as

Aαϕ =∞∑

j=1

1

λ2j

q(α, λ2

j

) ⟨ϕ, φj

⟩φj (3.11)

40

for all ϕ ∈ H. Note that as q is real, Aα is self-adjoint. Then by (3.10), Aα is a boundedoperator from H into H with

‖Aα‖ ≤ d (α) . (3.12)

In the following, we will always normalize the exponent of the regularization parameterα such that αd (α) has a positive finite limit c when α goes to zero. By construction,AαK∗Kϕ → ϕ as α goes to zero. When a genuine solution exists (r = Kϕ), the regular-ization induces a bias:

ϕ− ϕα =∞∑

j=1

[1− q

(α, λ2

j

)] ⟨r, ψj

⟩(φj/λj) =

∞∑j=1

[1− q

(α, λ2

j

)] ⟨ϕ, φj

⟩φj (3.13)

The squared regularization bias is

‖ϕ− ϕα‖2 =∞∑

j=1

b2(α, λ2

j

) ⟨ϕ, φj

⟩2, (3.14)

where b(α, λ2

j

)= 1 − q

(α, λ2

j

)is the bias function characterizing the weight of the

Fourier coefficient⟨ϕ, φj

⟩. Below, we show that the most common regularization schemes

fulfill the above conditions. We characterize these schemes through the definitions of thedamping weights q (α, µ) or equivalently, of the bias function b (α, µ) .

Example (Spectral cut-off).The spectral cut-off regularized solution is

ϕα =∑

λ2j≥α/c

1

λj

⟨r, ψj

⟩φj.

The explosive influence of the factor (1/µ) is filtered out by imposing q (α, µ) = 0 forsmall µ , that is |µ| < α/c. α is a positive regularization parameter such that no bias isintroduced when |µ| exceeds the threshold α/c :

q (α, µ) = I |µ| ≥ α/c =

1 if |µ| ≥ α/c,0 otherwise.

For any given scaling factor c, the two conditions of Definition 3.9. are then satisfied(with d(α) = c/α) and we get a bias function b

(α, λ2

)which is maximized (equal to 1)

when λ2 < α/c and minimized (equal to 0) when λ2 ≥ α/c.

Example (Landweber-Fridman).Landweber-Fridman regularization is characterized by

Aα = c

1/α−1∑

l=0

(I − cK∗K)l K∗,

ϕα = c

1/α−1∑

l=0

(I − cK∗K)l K∗r.

41

The basic idea is similar to spectral cut-off but with a smooth bias function. Of course,one way to make the bias function continuous while meeting the conditions b (α, 0) = 1and b

(α, λ2

)= 0 for λ2 > α/c would be to consider a piecewise linear bias function

with b(α, λ2

)= 1− (c/α)λ2 for λ2 ≤ α/c . Landweber-Fridman regularization makes it

smooth, while keeping the same level and the same slope at λ2 = 0 and zero bias for large

λ2, b(α, λ2

)=

(1− cλ2

)1/αfor λ2 ≤ 1/c and zero otherwise, that is

q (α, µ) =

1 if |µ| > 1/c,

1− (1− cµ)1/α for |µ| ≤ 1/c.

For any given scaling factor c, the two conditions of Definition 3.9 are then satisfied withagain d(α) = c/α.

Example (Tikhonov regularization).Here, we have

Aα =(α

cI + K∗K

)−1

,

ϕα =(α

cI + K∗K

)−1

K∗r

=∞∑

j=1

λj

λ2j + α/c

⟨r, ψj

⟩φj

where c is some scaling factor. In contrast to the two previous examples, the bias functionis never zero but decreases toward zero at a hyperbolic rate (when λ2 becomes infinitelylarge), while still starting from 1 for λ2 = 0 :

b(α, λ2

)=

(α/c)

(α/c) + λ2 .

that is:

q(α, λ2

)=

λ2

(α/c) + λ2

For any given scaling factor c, the two conditions of Definition 3.9 are again satisfied withd(α) = c/α.

We are going to show now that the regularity spaces Φβ introduced in the previoussubsection are well-suited for controlling the regularization bias. The basic idea is astraightforward consequence of (3.15):

‖ϕ− ϕα‖2 ≤ [supj

b2(α, λ2

j

)λ2β

j ] ‖ϕ‖2β (3.15)

42

Therefore, the rate of convergence (when the regularization parameter α goes to zero) ofthe regularization bias will be controlled, for ϕ ∈ Φβ, by the rate of convergence of

Mβ(α) = supj

b2(α, λ2

j

)λ2β

j

The following definition is useful to characterize the regularization schemes.

Definition 3.10 (Geometrically unbiased regularization). A regularization schemecharacterized by a bias function b

(α, λ2

)is said to be geometrically unbiased at order

β > 0 if:

supλ2

b2(α, λ2

)λ2β = O(αβ).

Proposition 3.11. The spectral cut-off and the Landweber-Fridman regularization schemesare geometrically unbiased at any positive order β, whereas Tikhonov regularizationscheme is geometrically unbiased at all order β ∈ (0, 2]. For Tikhonov regularizationand β ≥ 2, we have

Mβ(α) = O(α2).

Proof. In the spectral cut-off case, there is no bias for λ2j > α/c while the bias is

maximum, equal to one, for smaller λ2j . Therefore:

Mβ(α) ≤ (α/c)β.

In the Landweber-Fridman case, there is no bias for λ2j > 1/c but a decreasing

bias(1− cλ2

j

)1/αfor λ2

j increasing from zero to (1/c). Therefore, Mβ(α) ≤ [Supλ2≤(1/c)(1− cλ2

)2/αλ2β]. The supremum is reached for λ2 =(β/c)[β + (2/α)]−1 and gives:

Mβ(α) ≤ (β/c)β[β + (2/α)]−β ≤ (β/2)β(α/c)β.

In the Tikhonov case, the bias decreases hyperbolically and then Mβ(α) ≤ supλ2

[ (α/c)

(α/c)+λ2 ]2 λ2β. For β < 2, the supremum is reached for λ2 = (βα/c)[2− β]−1 and thus

Mβ(α) ≤ λ2β ≤ [β/(2− β)]β(α/c)β.

As K is bounded, its largest eigenvalue is bounded. Therefore, for β ≥ 2, we have

Mβ(α) ≤ (α/c)2 supj

λ2(β−2)j .

43

Proposition 3.12. Let K : H → E be an injective compact operator. Let us assume thatthe solution ϕ of Kϕ = r lies in the β−regularity space Φβ of operator K, for some positiveβ. Then, if ϕα is defined by a regularization scheme geometrically unbiased at order β,we have

‖ϕα − ϕ‖2 = O(αβ

).

Therefore, the smoother the function ϕ of interest (ϕ ∈ Φβ for larger β) is, the fasterthe rate of convergence to zero of the regularization bias will be. However, a degree ofsmoothness larger than or equal to 2 (corresponding to the case ϕ ∈ R [(K∗K)]) may beuseless in the Tikhonov case. Indeed, for Tikhonov, we have ‖ϕα − ϕ‖2 = O

(αmin(β,2)

).

This is basically the price to pay for a regularization procedure, which is simple to imple-ment and rather intuitive (see Subsection 3.4 below) but introduces a regularization biaswhich never vanishes completely.

Both the operator interpretation and the practical implementation of smooth regular-ization schemes (Tikhonov and Landweber-Fridman) are discussed below.

3.4. Operator interpretation and implementation

In contrast to spectral cut-off, the advantage of Tikhonov and Landweber-Fridman regu-larization schemes is that they can be interpreted in terms of operators. Their algebraicexpressions only depend on the global value of (K∗K) and (K∗r), and not of the singularvalue decomposition. An attractive feature is that it implies that they can be implementedfrom the computation of sample counterparts (KnK

∗n) and (K∗

nrn) without resorting toan estimation of eigenvalues and eigenfunctions.

The Tikhonov regularization is based on

(αnI + K∗K) ϕαn= K∗r ⇔

ϕαn=

∞∑j=1

λj

λ2j + αn

⟨r, ψj

⟩φj

for a penalization term αn and λj =√

λ2j , while, for notational simplicity, the scaling

factor c has been chosen equal to 1.

The interpretation of αn as a penalization term comes from the fact that ϕα can beseen as the solution of

ϕα = arg minϕ‖Kϕ− r‖2 + α ‖ϕ‖2 = 〈ϕ,K∗Kϕ + αϕ− 2K∗r〉+ ‖r‖2 .

To see this, just compute the Frechet derivative of the above expression and note that itis zero only for K∗Kϕ + αϕ = K∗r.

This interpretation of Tikhonov regularization in terms of penalization may suggest tolook for quasi-solutions (see Kress, 1999, section 16-3), that is solutions of the minimiza-tion of ‖Kϕ− r ‖ subject to the constraint that the norm is bounded by ‖ϕ‖ ≤ ρ for

44

given ρ . For the purpose of econometric estimation, the quasi-solution may actually bethe genuine solution if the specification of the structural econometric model entails thatthe function of interest ϕ lies in some compact set (Newey and Powell, 2003).

If one wants to solve directly the first order conditions of the above minimization,it is worth mentioning that the inversion of the operator (αI + K∗K) is not directlywell-suited for iterative approaches since, typically for small α, the series expansion of[I + (1/α)K∗K]−1 does not converge. However, a convenient choice of the estimators Kn

and K∗n may allow us to replace the inversion of infinite dimensional operators by the

inversion of finite dimensional matrices.

More precisely, when Kn and K∗n can be written as in (2.21) and (2.22), one can

directly write the finite sample problem as:(αnI + K∗

nKn

)ϕ = K∗

nr ⇔

αnϕ +Ln∑

l,l′=1

al′ (ϕ) bl (εl′) ηl =Ln∑

l=1

bl (r) ηl (3.16)

1) First we compute al (ϕ) :Apply aj to (3.16):

αnaj (ϕ) +Ln∑

l,l′=1

al′ (ϕ) bl (εl′) aj (ηl) =Ln∑

l=1

bl (r) aj (ηl) (3.17)

(3.17) can be rewritten as

(αnI + A) a = b

where a =[

a1 (ϕ) a2 (ϕ) · · · aLn (ϕ)]′

, A is the Ln × Ln−matrix with principalelement

Aj,l′ =Ln∑

l=1

bl (εl′) aj (ηl)

and

b =

∑l bl (r) a1 (ηl)

...∑l bl (r) aLn (ηl)

.

2) From (3.16), we have

ϕn =1

αn

[Ln∑

l=1

bl (r) ηl −Ln∑

l,l′=1

al′ (ϕ) bl (εl′) ηl

].

45

Landweber-Fridman regularizationThe great advantage of this regularization scheme is not only that it can be written

directly in terms of quantities (K∗K) and (K∗r), but also the resulting operator problemcan be solved by a simple iterative procedure, with a finite number of steps. To get this,one has to first choose a sequence of regularization parameters, αn, such that (1/αn)is an integer and second the scaling factor c so that 0 < c < 1/ ‖K‖2. This lattercondition may be problematic to implement since the norm of the operator K may beunknown. The refinements of an asymptotic theory, that enables us to accommodate afirst step estimation of ‖K‖ before the selection of an appropriate c, is beyond the scopeof this chapter. Note however, that in several cases of interest, ‖K‖ is known a priori eventhough the operator K itself is unknown. For example, if K is the conditional expectationoperator, ‖K‖ = 1.

The advantage of the condition c < 1/ ‖K‖2 is to guarantee a unique expression for

the bias function b(α, λ2

)=

(1− cλ2

)1/αsince, for all eigenvalues, λ2 ≤ 1/c. Thus, when

(1/α) is an integer :

ϕα =∞∑

j=1

1

λj

q(α, λ2j)

⟨r, ψj

⟩φj

with

q(α, λ2

j

)= 1− (

1− cλ2j

)1/α

= cλ2j

1/α−1∑

l=0

(1− cλ2

j

)l.

Thus,

ϕα = c

1/α−1∑

l=0

∞∑j=1

λj

(1− cλ2

j

)l ⟨r, ψj

⟩φj

= c

1/α−1∑

l=0

∞∑j=1

λ2j

(1− cλ2

j

)l ⟨ϕ, φj

⟩φj

= c

1/α−1∑

l=0

(I − cK∗K)l K∗Kϕ.

Therefore, the estimation procedure will only resort to estimators of K∗K and of K∗r,without need for either the singular value decomposition nor any inversion of operators.For a given c and regularization parameter αn, the estimator of ϕ is

ϕn = c

1/αn−1∑

l=0

(I − cK∗

nKn

)l

K∗nrn.

46

ϕn can be computed recursively by

ϕl,n =(I − cK∗

nKn

)ϕl−1,n + cK∗

nrn, l = 1, 2, ..., 1/αn − 1.

starting with ϕ0,n = cK∗nrn. This scheme is known as the Landweber-Fridman iteration

(see Kress, 1999, p. 287).

3.5. Estimation bias

Regularization schemes have precisely been introduced because the right hand side rof the inverse problem Kϕ = r is generally unknown and replaced by an estimator.Let us denote by rn an estimator computed from an observed sample of size n. Asannounced in the introduction, a number of relevant inverse problems in econometrics areeven more complicated since the operator K itself is unknown. Actually, in order to applya regularization scheme, we may not need only an estimator of K but also of its adjointK∗ and of its singular system

λj, φj, ψj : j = 1, 2, ...

. In this subsection, we consider

such estimators Kn, K∗n, and

λj, φj,ψj : j = 1, ..., Ln

as given. We also maintain the

identification assumption, so that the equation Kϕ = r defines without ambiguity a trueunknown value ϕ0.

If ϕα = AαK∗r is the chosen regularized solution, the proposed estimator ϕn of ϕ0 isdefined by

ϕn = AαnK∗nrn. (3.18)

Note that the definition of this estimator involves two decisions. First, we need to select asequence (αn) of regularization parameters so that limn→∞ αn = 0 (possibly in a stochasticsense in the case of a data-driven regularization) in order to get a consistent estimatorof ϕ0. Second, for a given αn, we estimate the second order regularization scheme AαnK∗

by AαnK∗n. Generally speaking, Aαn is defined from (3.9) where the singular values are

replaced by their estimators and the inner products⟨ϕ, φj

⟩are replaced by their empirical

counterparts (see Subsection 2.5.3). Yet, we have seen above that in some cases, theestimation of the regularized solution does not involve the estimators λj but only the

estimators Kn and K∗n.

In any case, the resulting estimator bias ϕn − ϕ0 has two components:

ϕn − ϕ0 = ϕn − ϕαn+ ϕαn

− ϕ0. (3.19)

While the second component ϕαn− ϕ0 defines the regularization bias characterized in

Subsection 3.3, the first component ϕn−ϕαnis the bias corresponding to the estimation of

the regularized solution of ϕαn. The goal of this subsection is to point out a set of statistical

assumptions about the estimators Kn, K∗n, and rn that gives an (asymptotic) upper

bound to the specific estimation bias magnitude,∥∥ϕn − ϕαn

∥∥ when the regularizationbias

∥∥ϕαn− ϕ0

∥∥ is controlled.

47

Definition 3.13 (Smooth regularization). A regularization scheme is said to be smoothwhen

∥∥∥(AαnK∗

nKn − AαnK∗K)

ϕ0

∥∥∥ ≤ d (αn)∥∥∥K∗

nKn −K∗K∥∥∥

∥∥ϕαn− ϕ0

∥∥ (1 + εn) (3.20)

with εn = O(∥∥∥K∗

nKn −K∗K∥∥∥)

.

Proposition 3.14 (Estimation bias). If ϕα = AαK∗r is the regularized solution con-formable to Definition 3.9 and ϕn = AαnK∗

nrn, then

∥∥ϕn − ϕαn

∥∥ (3.21)

≤ d (αn)∥∥∥K∗

nrn − K∗nKnϕ0

∥∥∥ +∥∥∥(AαnK∗

nKn − AαnK∗K)

ϕ0

∥∥∥

In addition, both the Tikhonov and Landweber-Fridman regularization schemes are smooth.In the Tikhonov case, εn = 0 identically.

Proof.

ϕn − ϕαn= AαnK∗

nrn − AαnK∗r

= AαnK∗n

(rn − Knϕ0

)+ AαnK∗

nKnϕ0 − AαnK∗Kϕ0

Thus,

∥∥ϕn − ϕαn

∥∥ ≤ d (αn)∥∥∥K∗

nrn − K∗nKnϕ0

∥∥∥ +∥∥∥AαnK∗

nKnϕ0 − AαnK∗Kϕ0

∥∥∥ .

1) Case of Tikhonov regularization:

AαnK∗nKnϕ0 − AαnK∗Kϕ0 (3.22)

= Aαn

(K∗

nKn −K∗K)

ϕ0 +(Aαn − Aαn

)K∗Kϕ0.

Since in this case,

Aα = (αI + K∗K)−1 ,

the identity

B−1 − C−1 = B−1(C −B)C−1

gives

Aαn − Aαn = Aαn

(K∗K − K∗

nKn

)Aαn

48

and thus,(Aαn − Aαn

)K∗Kϕ0 = Aαn

(K∗K − K∗

nKn

)AαnK∗Kϕ0 (3.23)

= Aαn

(K∗K − K∗

nKn

)ϕαn

.

(3.22) and (3.23) together give

AαnK∗nKnϕ0 − AαnK∗Kϕ0

= Aαn

(K∗

nKn −K∗K) (

ϕ0 − ϕαn

),

which shows that Tikhonov regularization is smooth with εn = 0.2) Case of Landweber-Fridman regularization:In this case,

ϕα =∞∑

j=1

[1− (

1− cλ2j

)1/α]

< ϕ0, φj > φj

=[I − (I − cK∗K)1/α

]ϕ0.

Thus,

AαnK∗nKnϕ0 − AαnK∗Kϕ0

=

[(I − cK∗K)1/αn −

(I − cK∗

nKn

)1/αn]

ϕ0

+

[I −

(I − cK∗

nKn

)1/αn

(I − cK∗K)−1/αn

](I − cK∗K)1/αn ϕ0

+

[I −

(I − cK∗

nKn

)1/αn


] (ϕ0 − ϕαn

).

Then, a Taylor expansion gives:∥∥∥∥I −

(I − cK∗

nKn

)1/αn


∥∥∥∥

=

∥∥∥∥c

αn

(K∗

nKn −K∗K)∥∥∥∥ (1 + εn)

with εn = O(∥∥∥K∗

nKn −K∗K∥∥∥).

The result follows with d(α) = c/α.Note that we are not able to establish (3.20) for the spectral cut-off regularization. In

that case, the threshold introduces a lack of smoothness, which precludes a similar Taylorexpansion based argument as above.

The result of Proposition 3.14 jointly with (3.19) shows that two ingredients matter incontrolling the estimation bias ‖ϕn − ϕ0‖ . First, the choice of a sequence of regularization

49

parameters, αn, will govern the speed of convergence to zero of the regularization bias∥∥ϕαn− ϕ0

∥∥ (for ϕ0 in a given Φβ) and the speed of convergence to infinity of d (αn).Second, nonparametric estimation of K and r will determine the rates of convergence of∥∥∥K∗

nrn − K∗nKnϕ0

∥∥∥ and∥∥∥K∗

nKn −K∗K∥∥∥ .

4. Asymptotic properties of solutions of integral equations of thefirst kind

4.1. Consistency

Let ϕ0 be the solution of Kϕ = r. By abuse of notation, we denote Xn = O (cn) forpositive sequences Xn and cn, if the sequence Xn/cn is upper bounded.

We maintain the following assumptions:

A1. Kn, rn are consistent estimators of K and r.

A2.∥∥∥K∗

nKn −K∗K∥∥∥ = O

(1

an

)

A3.∥∥∥K∗

nrn − K∗nKnϕ0

∥∥∥ = O(

1bn

)

As before ϕα = AαK∗r is the regularized solution where Aα is a second order regular-ization scheme and ϕn = AαnK∗

nrn. Proposition 4.1 below follows directly from Proposi-tion 3.9 and Definition 3.6 (with the associated normalization rule αd(α) = O(1)):

Proposition 4.1. When applying a smooth regularization scheme, we get:

‖ϕn − ϕ0‖= O

(1

αnbn

+

(1

αnan

+ 1

) ∥∥ϕαn− ϕ0

∥∥)

.

Discussion on the rate of convergence:The general idea is that the fastest possible rate of convergence in probability of

‖ϕn − ϕ0‖ to zero should be the rate of convergence of the regularization bias∥∥ϕαn

− ϕ0

∥∥.Proposition 4.1 shows that these two rates of convergence will precisely coincide when therate of convergence to zero of the regularization parameter, αn, is chosen sufficiently slowwith respect to both the rate of convergence an of the sequence of approximations of thetrue operator, and the rate of convergence bn of the estimator of the right-hand side ofthe operator equation. This is actually a common strategy when both the operator andthe right-hand side of the inverse problem have to be estimated (see e.g. Vapnik (1998),corollary p. 299).

To get this, it is first obvious that αnbn must go to infinity at least as fast as∥∥ϕαn− ϕ0

∥∥−1. For ϕ0 ∈ Φβ, β > 0, and a geometrically unbiased regularization scheme,

this means that:

α2nb2

n ≥ α−βn

50

that is αn ≥ b− 2

β+2n . To get the fastest possible rate of convergence under this constraint,

we will choose:

αn = b− 2

β+2n .

Then, the rate of convergence of ‖ϕn − ϕ0‖ and∥∥ϕαn

− ϕ0

∥∥ will coincide if and only if

anb− 2

β+2n is bounded away from zero. Thus, we have proved:

Proposition 4.2. Consider a smooth regularization scheme, which is geometrically un-biased of order β > 0 with estimators of K and r conformable to Assumptions A1, A2,

A3, and anb− 2

β+2n bounded away from zero. For ϕ0 ∈ Φβ, the optimal choice of the regu-

larization parameter is αn = b− 2

β+2n , and then

‖ϕn − ϕ0‖ = O

(b− β

β+2n

).

For Tikhonov regularization, when ϕ0 ∈ Φβ, β > 0, provided anb−min( 2

β+2, 12)

n is bounded

away from zero and αn = b−min( 2

β+2, 12)

n , we have

‖ϕn − ϕ0‖ = O

(b−min( β

β+2, 12)

n

).

Note that the only condition about the estimator of the operator K∗K is that its rate

of convergence, an, is sufficiently fast to be greater than b2

β+2n . Under this condition, the

rate of convergence of ϕn does not depend upon the accuracy of the estimator of K∗K.Of course, the more regular the unknown function ϕ0 is, the larger β is and the easierit will be to meet the required condition. Generally speaking, the condition will involvethe relative bandwidth sizes in the nonparametric estimation of K∗K and K∗r. Note thatif, as it is generally the case for a convenient bandwidth choice (see e.g. subsection 5.4),bn is the parametric rate (bn =

√n), an must be at least n1/(β+2). For β not too small,

this condition will be fulfilled by optimal nonparametric rates. For instance, the optimalunidimensional nonparametric rate, n2/5, will work as soon as β ≥ 1/2.

The larger β is, the faster the rate of convergence of ϕn is. In the case where ϕ0 is afinite linear combination of

φj

(case where β is infinite), and bn =

√n, an estimator

based on a geometrically unbiased regularization scheme (such as Landweber-Fridman)achieves the parametric rate of convergence. We are not able to obtain such a fast ratefor Tikhonov, therefore it seems that if the function ϕ0 is suspected to be very regular,Landweber-Fridman is preferable to Tikhonov. However, it should be noted that the ratesof convergence in Proposition 4.2 are upperbounds and could possibly be improved upon.

51

4.2. Asymptotic normality

Asymptotic normality of

ϕn − ϕ0 = ϕn − ϕαn+ ϕαn

− ϕ0

= AαnK∗nrn − AαnK∗Kϕ0 + ϕαn

− ϕ0

can be deduced from a functional central limit theorem applied to K∗nrn − K∗

nKnϕ0.Therefore, we must reinforce Assumption A3 by assuming a weak convergence in H:

Assumption WC:

bn

(K∗

nrn − K∗nKnϕ0

)⇒ N (0, Σ) in H.

According to (3.21), (3.22), and (3.23), we have in the case of Tikhonov regularization:

bn (ϕn − ϕ0) = bnAαn

[K∗

nrn − K∗nKnϕ0

](4.1)

+bnAαn

[K∗

nKn −K∗K] (

ϕ0 − ϕαn

)(4.2)

while an additional term corresponding εn in (3.20) should be added for general regular-ization schemes. The term (4.1) can be rewritten as

Aαnξ + Aαn (ξn − ξ)

where ξ denotes the random variable N (0, Σ) in H and

ξn = bn

(K∗

nrn − K∗nKnϕ0

).

By definition:⟨Aαnξ, g

⟩∥∥∥Σ1/2Aαng

∥∥∥d→ N (0, 1)

for all g ∈ H. Then, we may hope to get a standardized normal asymptotic probabilitydistribution for

〈bn (ϕn − ϕ0) , g〉∥∥∥Σ1/2Aαng∥∥∥

for vectors g conformable to the following assumption:

Assumption G∥∥∥Aαng

∥∥∥∥∥∥Σ1/2Aαng

∥∥∥= O (1) .

52

Indeed, we have in this case:

∣∣∣⟨Aαn (ξn − ξ) , g

⟩∣∣∣∥∥∥Σ1/2Aαng

∥∥∥≤‖ξn − ξ‖

∥∥∥Aαng∥∥∥

∥∥∥Σ1/2Aαng∥∥∥

,

which converges to zero in probability because ‖ξn − ξ‖ P→ 0 by WC. We are then able toshow:

Proposition 4.3. Consider a Tikhonov regularization. Suppose Assumptions A1, A2,A3, and WC hold and ϕ0 ∈ Φβ, β > 0, with bnα

min(β/2,1)n →

n→∞0, we have for any g

conformable to G:

〈bn (ϕn − ϕ0) , g〉∥∥∥Σ1/2Aαng∥∥∥

d→ N (0, 1) .

Proof. From the proof of Proposition 3.9, we have:

⟨bn

(ϕn − ϕαn

), g

⟩

=⟨Aαnξ, g

⟩

+⟨Aαn (ξn − ξ) , g

⟩

+⟨bnAαn

[K∗

nKn −K∗K] (

ϕ0 − ϕαn

), g

⟩(4.3)

in the case of Tikhonov regularization. We already took care of the terms in ξ and ξn, itremains to deal with the bias term corresponding to (4.3):

bn

⟨Aαn

(K∗

nKn −K∗K) (

ϕ0 − ϕαn

), g


∥∥∥

≤bn

⟨(K∗

nKn −K∗K) (

ϕ0 − ϕαn

), Aαng


∥∥∥

≤ bn

∥∥∥K∗nKn −K∗K

∥∥∥∥∥ϕ0 − ϕαn

∥∥∥∥∥Aαng

∥∥∥∥∥∥Σ1/2Aαng

∥∥∥

= O

(bnα

min(β/2,1)n

an

).

53

Discussion of Proposition 4.3.(i) It is worth noticing that Proposition 4.3 does not in general deliver a weak con-

vergence result for bn (ϕn − ϕ0) because it does not hold for all g ∈ H. However, thecondition G is not so restrictive. It just amounts to assume that the multiplication byΣ1/2 does not modify the rate of convergence of Aαng.

(ii) We remark that for g = K∗Kh, Aαng and Σ1/2Aαng converge respectively to h andΣ1/2h. Moreover, if g 6= 0, Σ1/2h = Σ1/2 (K∗K)−1 g 6= 0. Therefore, in this case, not onlyis the condition G fulfilled but we get asymptotic normality with rate of convergence bn,that is typically root n. This result is conformable to the theory of asymptotic efficiencyof inverse estimators as recently developed by Van Rooij, Ruymgaart and Van Zwet(2000). They show that there is a dense linear submanifold of functionals for which theestimators are asymptotically normal at the root n rate with optimal variance (in thesense of minimum variance in the class of the moment estimators). We do get optimalvariance in Proposition 4.3 since in this case (using heuristic notations as if we were infinite dimension) the asymptotic variance is:

limn→∞

g′AαnΣAαn

= g′ (K∗K)−1 Σ (K∗K)−1 g.

Moreover, we get this result in particular for any nonzero g in R (K∗K) while weknow that R (K∗) is dense in H (identification condition). Generally speaking, VanRooij, Ruymgaart and Van Zwet (2000) stress that the inner products do not convergeweakly for every vector g in H at the same rate, if they converge at all.

(iii) The condition bnαmin(β/2,1)n → 0 imposes a convergence to zero of the regularization

coefficient αn faster than the rate αn = b−min( 2

β+2, 12)

n required for the consistency. Thisstronger condition is needed to show that the regularization bias multiplied by bn convergesto zero. A fortiori, the estimation bias term vanishes asymptotically.

The results of Proposition 4.3 are established under strong assumptions: convergencein H and restriction on g. An alternative method consists in establishing the normalityof ϕn by the Liapunov condition (Davidson, 1994), see the example on deconvolution inSection 5 below.

5. Applications

A well-known example is that of the kernel estimator of the density. Indeed, the estimationof the pdf f of a random variable X can be seen as solving an integral equation of thefirst kind

Kf (x) =

∫ +∞

−∞I (u ≤ x) f (u) du = F (x) (5.1)

where F is the cdf of X. Applying the Tikhonov regularization to (5.1), one obtains akernel estimator of f . This example is detailed in Hardle and Linton (1994) and in Vapnik(1998, pages 308-311) and will not be discussed further.

54

This section reviews the standard examples of the ridge regression and factor modelsand less standard examples such as the regression with an infinite number of regressors,the deconvolution and the instrumental variable estimation.

5.1. Ridge regression

The Tikhonov regularization discussed in Section 3 can be seen as an extension of thewell-known ridge regression. The ridge regression was introduced by Hoerl and Kennard(1970). It was initially motivated by the fact that in the presence of near multicollinearityof the regressors, the least squares estimator may vary dramatically as the result of a smallperturbation in the data. The ridge estimator is more stable and may have a lower riskthan the conventional least squares estimator. For a review of this method, see Judge,Griffiths, Hill, Lutkepohl, and Lee (1980) and for a discussion in the context of inverseproblems, see Ruymgaart (2001).

Consider the linear model (the notation of this paragraph is specific and correspondsto general notations of linear models):

y = Xθ + ε (5.2)

where y and ε are n×1−random vectors, X is a n×q matrix of regressors of full rank, andθ is an unknown q × 1−vector of parameters. The number of explanatory variables, q, isassumed to be constant and q < n. Assume that X is exogenous and all the expectationsare taken conditionally on X. The classical least-squares estimator of θ is

θ = (X ′X)−1

X ′y.

There exists an orthogonal transformation such that X ′X/n = P ′DP with

D =

µ1 0. . .

0 µq

,

55

µj > 0, and P ′P = Iq. Using the mean square error as measure of the risk, we get

E∥∥∥θ − θ

∥∥∥2

= E∥∥∥(X ′X)

−1(X ′ (Xθ + ε)− θ)

∥∥∥2

= E∥∥∥(X ′X)

−1X ′ε

∥∥∥2

= E(ε′X (X ′X)

−2X ′ε

)

= σ2trace(X (X ′X)

−2X ′

)

=σ2

ntrace

((X ′X

n

)−1)

=σ2

ntrace

(P ′D−1P

)

=σ2

n

q∑j=1

1

µj

.

If some of the columns of X are closely collinear, the eigenvalues may be very small andthe risk very large. Moreover, when the number of regressors is infinite, the risk is nolonger bounded.

A solution is to use the ridge regression estimator:

θa = arg minθ‖y −Xθ‖2 + a ‖θ‖2

⇒ θa = (aI + X ′X)−1

X ′y

for a > 0. We prefer to introduce α = a/n and define

θα =

(αI +

X ′Xn

)−1X ′yn

. (5.3)

This way, the positive constant α corresponds to the regularization parameter introducedin earlier sections.

The estimator θα is no longer unbiased. We have

θα = E(θα

)=

(αI +

X ′Xn

)−1X ′X

nθ.

Using the fact that A−1 −B−1 = A−1 [B − A] B−1. The bias can be rewritten as

θα − θ =

(αI +

X ′Xn

)−1X ′X

nθ −

(X ′X

n

)−1X ′X

nθ

= α

(αI +

X ′Xn

)−1

θ.

56

The risk becomes

E∥∥∥θα − θ

∥∥∥2

= E∥∥∥θα − θα

∥∥∥2

+ ‖θα − θ‖2

= E

∥∥∥∥∥(

αI +X ′X

n

)−1X ′εn

∥∥∥∥∥

2

+ α2

∥∥∥∥∥(

αI +X ′X

n

)−1

θ

∥∥∥∥∥

2

= E

(ε′Xn

(αI +

X ′Xn

)−2X ′εn

)+ α2

∥∥∥∥∥(

αI +X ′X

n

)−1

θ

∥∥∥∥∥

2

=σ2

ntrace

((αI +

X ′Xn

)−2X ′X

n

)+ α2

∥∥∥∥∥(

αI +X ′X

n

)−1

θ

∥∥∥∥∥

2

=σ2

n

q∑j=1

µj(α + µj

)2 + α2

q∑j=1

((Pθ)j

)2

(α + µj

)2 .

There is the usual trade-off between the variance (decreasing in α) and the bias (increasingin α). For each θ and σ2, there is a value of α for which the risk of θα is smaller than

that of θ. As q is finite, we have E∥∥∥θα − θα

∥∥∥2

∼ 1/n and ‖θα − θ‖2 ∼ α2. Hence, the

MSE is minimized for αn ∼ 1/√

n. Let us compare this rate with that necessary to theasymptotic normality of θα. We have

θα − θ = −α

(αI +

X ′Xn

)−1

θ +

(αI +

X ′Xn

)−1X ′εn

.

Therefore, if X and ε satisfy standard assumptions of stationarity and mixing, θα is

consistent as soon as αn goes to zero and√

n(θα − θ

)is asymptotically centered normal

provided αn = o (1/√

n) , which is a faster rate than that obtained in the minimization ofthe MSE. Data-dependent methods for selecting the value of α are discussed in Judge etal. (1980).

Note that the ridge estimator (5.3) is the regularized inverse of the equation

y = Xθ, (5.4)

where obviously θ is overidentified as there are n equations for q unknowns. Let H be Rq

endowed with the euclidean norm and E be Rn endowed with the norm, ‖v‖2 = v′v/n.Define K : H → E such that Ku = Xu for any u ∈ Rq. Solving 〈Ku, v〉 = 〈u,K∗v〉, wefind the adjoint of K, K∗ : E → H where K∗v = X ′v/n for any v ∈ Rn. The Tikhonovregularized solution of (5.4) is given by

θα = (αI + K∗K)−1 K∗y,

which corresponds to (5.3). It is also interesting to look at the spectral cut-off reg-ularization. Let P1, P2, ..., Pq be the orthonormal eigenvectors of the q × q matrix

57

K∗K = X ′X/n and Q1, Q2, ..., Qn be the orthonormal eigenvectors of the n×n matrixKK∗ = XX ′/n. Let λj =

√µj. Then the spectral cut-off regularized estimator is

θα =∑

λj≥α

1

λj

〈y,Qj〉Pj =∑

λj≥α

1

λj

y′Qj

nPj.

A variation on the spectral cut-off consists in keeping the l largest eigenvalues to obtain

θl =l∑

j=1

1

λj

y′Qj

nPj.

We will refer to this method as truncation. A forecast of y is given by

y = Kθl =l∑

j=1

y′Qj

nQj. (5.5)

Equation (5.5) is particularly interesting for its connection with forecasting using factorsdescribed in the next subsection.

5.2. Principal components and factor models

Let Xit be the observed data for the ith cross-section unit at time t, with i = 1, 2, ..., qand t = 1, 2, ..., T. Consider the following dynamic factor model

Xit = δ′iFt + eit (5.6)

where Ft is an l × 1 vector of unobserved common factors and δi is the vector of factorloadings associated with Ft. The factor model is used in finance, where Xit represents thereturn of asset i at time t, see Ross (1976). Here we focus on the application of (5.6) toforecasting a single time series using a large number of predictors as in Stock and Watson(1998, 2002), Forni and Reichlin (1998), and Forni, Hallin, Lippi, and Reichlin (2000).Stock and Watson (1998, 2002) consider the forecasting equation

yt+1 = β′Ft + εt+1

where yt is either the inflation or the industrial production and Xt in (5.6) comprises 224macroeconomic time-series. If the number of factors l is known, then Λ = (δ1, δ2, ..., δq)and F = (F1, F2, ..., FT )′ can be estimated from

minΛ,F

1

qT

q∑i=1

T∑t=1

(Xit − δ′iFt)2

(5.7)

58

under the restriction F ′F/T = I. The F solution of (5.7) are the eigenvectors of XX ′/Tassociated with the l largest eigenvalues. Hence F = [Q1 | ... | Ql] where Qj is jth eigen-vector of XX ′/T. Using the compact notation y = (y2, ..., yT+1)

′, a forecast of y is givenby

y = Fβ

= F (F ′F )−1

F ′y

= FF ′yT

=l∑

j=1

Q′jy

TQj.

We recognize (5.5). It means that forecasting using a factor model (5.6) is equivalent toforecasting Y from (5.4) using a regularized solution based on the truncation. The onlydifference is that in the factor literature, it is assumed that there exists a fixed number ofcommon factors, whereas in the truncation approach (5.5), the number of factors growswith the sample size. This last assumption may seem more natural when the number ofexplanatory variables, q goes to infinity.

An important issue in factor analysis is the estimation of the number of factors. Stockand Watson (1998) propose to minimize the MSE of the forecast. Bai and Ng (2002)propose various BIC and AIC criterions that permit us to consistently estimate the numberof factors, even when T and q go to infinity.

5.3. Regression with an infinity of regressors

Consider the following model

Y =

∫Z (τ) ϕ (τ) Π (dτ) + U (5.8)

where Z is uncorrelated with U and may include lags of Y , Π is a known measure (possiblywith finite or discrete support). One observes (yi, zi (τ))i=1,...,n .

First approach: Ridge regression(5.8) can be rewritten as

y1...

yn

=

∫z1 (τ) ϕ (τ) Π (dτ)

...∫zn (τ) ϕ (τ) Π (dτ)

+

u1...

un

or equivalently

y = Kϕ + u

59

where the operator K is defined in the following manner

K : L2 (Π) → Rn

Kϕ =

∫z1 (τ) ϕ (τ) Π (dτ)

...∫zn (τ) ϕ (τ) Π (dτ)

.

As is usual in the regression, the error term u is omitted and we solve

Kϕ = y

using a regularized inverse

ϕα = (K∗K + αI)−1 K∗y. (5.9)

As an exercise, we compute K∗ and K∗K. To compute K∗, we solve

〈Kϕ,ψ〉 = 〈ϕ,K∗ψ〉for ψ = (ψ1, ..., ψn) and we obtain

(K∗y) (τ) =1

n

n∑i=1

yizi (τ) ,

K∗Kϕ (τ) =

∫1

n

n∑i=1

zi (τ) zi (s) ϕ (s) Π (ds) .

The properties of the estimator (5.9) are further discussed in Van Rooij, Ruymgaart andVan Zwet (2000).

Second approach: moment conditionsAlternatively, (5.8) can be rewritten as

E [Y − 〈Z,ϕ〉 |Z (τ)] = 0 for all τ in the support of Π

Replacing the conditional moments by unconditional moments, we have

E [Y Z (τ)− 〈Z, ϕ〉Z (τ)] = 0 ⇐⇒∫E [Z (τ) Z (s)] ϕ (s) Π (ds) = E [Y Z (τ)] ⇐⇒

Tϕ = r. (5.10)

The operator T can be estimated by Tn, the operator with kernel1

n

∑ni=1 zi (τ) zi (s) and

rF can be estimated by rn (τ) =1

n

∑ni=1 yizi (τ) . Hence (5.10) becomes

Tnϕ = rn, (5.11)

60

which is equal to

K∗Kϕ = K∗y.

If one preconditions (5.11) by applying the operator T ∗n , one gets the solution

ϕn =(αI + T ∗

n Tn

)T ∗

n rn (5.12)

which differs from the solution (5.9). When α goes to zero at an appropriate rate of con-vergence (different in both cases), the solutions of (5.9) and (5.12) will be asymptoticallyequivalent. Actually, the preconditioning by an operator in the Tikhonov regulariza-tion has the purpose of constructing an operator which is positive self-adjoint. BecauseTn = K∗K is already positive self-adjoint, there is no reason to precondition here. Some-times preconditioning more than necessary is aimed at facilitating the calculations (seeRuymgaart, 2001).

Using the results of Section 4, we can establish the asymptotic normality of ϕn definedin (5.12).

Assuming thatA1 - ui has mean zero and variance σ2 and is uncorrelated with zi (τ) for all τA2 - uizi (.) is an iid process of L2 (Π) .A3 - E ‖uizi (.)‖2 < ∞.we have(i)

∥∥∥T 2n − T 2

∥∥∥ = O(

1√n

)

(ii)√

n(Tnrn − T 2

nϕ0

)⇒ N (0, Σ) in L2 (Π) .

(i) is straightforward. (ii) follows from

rn − Tnϕ0 =1

n

n∑i=1

yizi (τ)−∫

1

n

n∑i=1

zi (τ) zi (s) ϕ0 (s) Π (ds)

=1

n

n∑i=1

uizi (τ) .

Here, an =√

n and bn =√

n. Under Assumptions A1 to A3, we have

1√n

n∑i=1

uizi (τ) ⇒ N (0, σ2T

)

in L2 (Π) by Theorem 2.46. Hence

√n

(Tnrn − T 2

nϕ0

)⇒ N (

0, σ2T 3).

Let us rewrite Condition G in terms of the eigenvalues λj and eigenfunctions φj of T :∥∥∥(T 2 + αnI)

−1g∥∥∥

2

∥∥T 3/2 (T 2 + αnI)−1 g∥∥2 = O (1)

61

⇔∑∞

j=1

〈g,φj〉2(λ2

j+α)2

∑∞j=1

λ3j〈g,φj〉2(λ2

j+α)2

= O (1) .

Obviously condition G will not be satisfied for all g in L2 (Π) .

By Proposition 4.3, assuming that ϕ0 ∈ Φβ, 0 < β < 2 and√

nαβ/2n → 0, we have for

g conformable with Condition G,

〈√n (ϕn − ϕ0) , g〉∥∥T 3/2 (T 2 + αnI)−1 g∥∥

d→ N (0, 1) .

The asymptotic variance is given by

∥∥T−1/2g∥∥2

=∞∑

j=1

⟨g, φj

⟩2

λj

.

Whenever it is finite, that is whenever g ∈ R (T−1/2

), 〈(ϕn − ϕ0) , g〉 converges at the

parametric rate.

A related but different model from (5.8) is the Hilbertian autoregressive model:

Xt = ρ (Xt−1) + εt (5.13)

where Xt and εt are random elements in a Hilbert space and ρ is a compact linear operator.The difference between (5.13) and (5.8) is that in (5.8), Y is a random variable and notan element of a Hilbert space. Bosq (2000) proposes an estimator of ρ and studies itsproperties.

Kargin and Onatski (2004) are interested in the best prediction of the interest ratecurve. They model the forward interest rate Xt (τ) at maturity τ by (5.13) where ρ is aHilbert-Schmidt integral operator:

(ρf) (τ) =

∫ ∞

0

ρ (τ , s) f (s) ds. (5.14)

The operator ρ is identified from the covariance and cross-covariance of the process Xt. LetΓ11 be the covariance operator of random curve Xt and Γ12 the cross-covariance operatorof Xt and Xt+1. For convenience, the kernels of Γ11 and Γ12 are denoted using the samenotation. Equations (5.13) and (5.14) yield

Γ12 (τ 1, τ 2) = E [Xt+1 (τ 1) Xt (τ 2)]

= E

[∫ρ (τ 1, s) Xt (s) Xt (τ 2) ds

]

=

∫ρ (τ 1, s) Γ11 (s, τ 2) ds.

62

Hence,

Γ12 = ρΓ11.

Solving this equation requires a regularization because Γ11 is compact. Interestingly,Kargin and Onatski (2004) show that the best prediction of the interest rate curve infinite sample is not necessarily provided by the eigenfunctions of Γ11 associated withthe largest eigenvalues. It means that the spectral cut-off does not provide satisfactorypredictions in small samples. They propose a better predictor of the interest rate curve.

Continuous-time models have been extensively studied by Le Breton (1994) and Bergstrom(1988).

5.4. Deconvolution

Assume we observe iid realizations y1, ..., yn of a random variable Y with unknown pdf h,where Y satisfies

Y = X + Z

where X and Z are independent random variables with pdf ϕ and g respectively. Theaim is to get an estimator of ϕ assuming g is known. This problem consists in solving inϕ the equation:

h (y) =

∫g (y − x) ϕ (x) dx. (5.15)

(5.15) is an integral equation of the first kind where the operator K defined by (Kϕ) (y) =∫g (y − x) ϕ (x) dx has a known kernel and need not be estimated. Recall that the com-

pactness property depends on the space of reference. If we define as space of reference,L2 with respect to Lebesgue measure, then K is not a compact operator and hence has acontinuous spectrum. However, for a suitable choice of the reference spaces, K becomescompact. The most common approach to solving (5.15) is to use a deconvolution ker-nel estimator, this method was pioneered by Carroll and Hall (1988) and Stefanski andCarroll (1990). It is essentially equivalent to inverting Equation (5.15) by means of thecontinuous spectrum of K, see Carroll, Van Rooij, and Ruymgaart (1991) and Subsection5.4.2 below. In a related paper, Van Rooij and Ruymgaart (1991) propose a regularizedinverse to a convolution problem of the type (5.15) where g has the circle for support.They invert the operator K using its continuous spectrum.

5.4.1. A new estimator based on Tikhonov regularization

The approach of Carrasco and Florens (2002) consists in defining two spaces of reference,L2

πX(R) and L2

πY(R) as

L2πX

(R) =

φ (x) such that

∫φ (x)2 πX (x) dx < ∞

,

L2πY

(R) =

ψ (y) such that

∫ψ (y)2 πY (y) dy < ∞

,

63

so that K is a Hilbert-Schmidt operator from L2πX

(R) to L2πY

(R), that is the followingcondition is satisfied

∫ ∫ (πY (y) g (y − x)

πY (y) πX (x)

)2

πY (y) πX (x) dxdy < ∞.

As a result K has a discrete spectrum for these spaces of reference. Letλj, φj, ψj

denote

its singular value decomposition. Equation (5.15) can be approximated by a well-posedproblem using Tikhonov regularization

(αnI + K∗K) ϕαn= K∗h.

Hence we have

ϕαn(x) =

∞∑j=1

1

αn + λ2j

⟨K∗h, φj

⟩φj (x)

=∞∑

j=1

1

αn + λ2j

⟨h,Kφj

⟩φj (x)

=∞∑

j=1

λj

αn + λ2j

⟨h, ψj

⟩φj (x)

=∞∑

j=1

λj

αn + λ2j

E[ψj (Yi) πY (Yi)

]φj (x) .

The estimator of ϕ is obtained by replacing the expectation by a sample mean:

ϕn =1

n

n∑i=1

∞∑j=1

λj

αn + λ2j

ψj (yi) πY (yi) φj (x) .

Note that we avoided estimating h by a kernel estimator. In some cases, ψj and φj are

known. For instance, if Z ∼ N (0, σ2), πY (y) = φ (y/τ) and πX (x) = φ(x/√

τ 2 + σ2)

then ψj and φj are Hermite polynomials associated with λj = ρj. When ψj and φj areunknown, they can be estimated via simulations. Since one can do as many simulationsas one wishes, the error due to the estimation of ψj and φj can be considered negligible.

Using the results of Section 3, one can establish the rate of convergence of ‖ϕn − ϕ0‖ .Assume that ϕ0 ∈ Φβ, 0 < β < 2, that is

∞∑j=1

⟨ϕ, φj

⟩2

λ2βj

< ∞.

We have∥∥ϕαn

− ϕ0

∥∥ = O(α

β/2n

)and

∥∥ϕn − ϕαn

∥∥ = O (1/ (αn

√n)) as here bn =

√n. For

an optimal choice of αn = Cn−1/(β+2), ‖ϕn − ϕ0‖2 is O(n−β/(β+2)

). The mean integrated

64

square error (MISE) defined as E ‖ϕn − ϕ0‖2 has the same rate of convergence. Fan (1993)provides the optimal rate of convergence for a minimax criterion on a Lipschitz class offunctions. The optimal rate of the MISE when the error term is normally distributed isonly (ln n)−2 if ϕ is twice differentiable. On the contrary, here we get an arithmetic rateof convergence. The condition ϕ0 ∈ Φβ has the effect of reducing the class of admissiblefunctions and hence improves the rate of convergence. Which type of restriction doesϕ0 ∈ Φβ impose? In Carrasco and Florens (2002), it is shown that ϕ0 ∈ Φ1 is satisfied if

∫ ∣∣∣∣ψϕ0

(t)

ψg (t)

∣∣∣∣ dt < ∞ (5.16)

where ψϕ0and ψg denote the characteristic functions of ϕ0 and g respectively. This con-

dition can be interpreted as the noise is “smaller” than the signal. Consider for examplethe case where ϕ0 and g are normal. Condition (5.16) is equivalent to the fact that thevariance of g is smaller than that of ϕ0. Note that the condition ϕ0 ∈ Φ1 relates ϕ0 andg while one usually imposes restrictions on ϕ0 independently of those on g.

5.4.2. Comparison with the deconvolution kernel estimator

Let L2λ(R) be the space of square-integrable functions with respect to Lebesgue measure

on R. Let F denote the Fourier transform operator from L2λ(R) into L2

λ(R) defined by

(Fq) (s) =1√2π

∫eisxq (x) dx.

F satisfies that F ∗ = F−1. We see that

F (g ∗ f) = φgFf

so that K admits the following spectral decomposition (see Carroll, van Rooij and Ruym-gaart, 1991, Theorem 3.1.):

K = F−1MφgF

where Mρ is the multiplication operator Mρϕ = ρϕ.

K∗K = F−1M|φg|2F.

We want to solve in f the equation:

K∗Kf = K∗h.

Let us denote

q (x) = (K∗h) (x) =

∫g (y − x) h (y) dy.

65

Then,

q (x) =1

n

n∑i=1

g (yi − x)

is a√

n−consistent estimator of q.Using the spectral cut-off regularized inverse of K∗K, we get

f = F−1M 1

|φg |2|φg|>αF q

Using the change of variables u = yi − x, we have

(F q) (t) =1

n

n∑i=1

∫eitxg (yi − x) dx

=1

n

n∑i=1

∫eit(yi−u)g (u) du

=1

n

n∑i=1

φg (t)eityi .

f (x) =1

2π

∫e−itxI

∣∣φg (t)∣∣ > α

1∣∣φg (t)∣∣2 (F q) (t) dt

=1

2π

1

n

n∑i=1

∫e−it(yi−x)I

∣∣φg (t)∣∣ > α

1

φg (t)dt.

Assuming that φg > 0 and strictly decreasing as |t| goes to infinity, we have I∣∣φg (t)

∣∣ > α

=I −A ≤ t ≤ A for some A > 0 so that

f (x) =1

2π

1

n

n∑i=1

∫ A

−A

e−it(yi−x)

φg (t)dt.

Now compare this expression with the kernel estimator (see e.g. Stefanski and Carroll,1990). For a smoothing parameter c and a kernel ω, the kernel estimator is given by

fk(x) =1

nc

n∑i=1

1

2π

∫φω (u)

φg (u/c)eiu(yi−x)/cdu. (5.17)

Hence f coincides with the kernel estimator when φω (u) = I[−1,1] (u). This is the sinckernel corresponding to ω (x) = sin c (x) = sin (x) /x. This suggests that the kernel es-timator is obtained by inverting an operator that has a continuous spectrum. Becausethis spectrum is given by the characteristic function of g, the speed of convergence of the

66

estimator depends on the behavior of φg in the tails. For a formal exposition, see Carrollet al (1991, Example 3.1.). They assume in particular that the function to estimate isp differentiable and they obtain a rate of convergence (as a function of p) that is of thesame order as the rate of the kernel estimator.

By using the Tikhonov regularization instead of the spectral cut-off, we obtain

f(y) =1

n

n∑i=1

∫φg (t)∣∣φg (t)

∣∣2 + αe−itxieitydt.

We apply a change of variable u = −t,

f(y) =1

n

n∑i=1

1

2π

∫φg (u)∣∣φg (u)

∣∣2 + αeiu(xi−y)du. (5.18)

The formulas (5.18) and (5.17) differ only by the way the smoothing is applied.

5.5. Instrumental variables

This example is mainly based on Darolles, Florens and Renault (2002).An economic relationship between a response variable Y and a vector Z of explanatory

variables is often represented by an equation:

Y = ϕ (Z) + U , (5.19)

where the function ϕ(.) defines the parameter of interest while U is an error term. Therelationship (5.19) does not characterize the function ϕ if the residual term is not con-strained. This difficulty is solved if it is assumed that E[U | Z] = 0, or if equivalentlyϕ (Z) = E[Y | Z]. However in numerous structural econometric models, the conditionalexpectation function is not the parameter of interest. The structural parameter is a rela-tion between Y and Z where some of the Z components are endogenous. This is the casein various situations: simultaneous equations, error-in-variables models, and treatmentmodels with endogenous selection etc.

The first issue is to add assumptions to Equation (5.19) in order to characterize ϕ. Twogeneral strategies exist in the literature, at least for linear models. The first one consistsin introducing some hypotheses on the joint distribution of U and Z (for example on thevariance matrix). The second one consists in increasing the vector of observables from(Y, Z) to (Y, Z, W ), where W is a vector of instrumental variables. The first approachwas essentially followed in the error-in-variables models and some similarities exist withthe instrumental variables model (see e.g. Malinvaud (1970, ch. 9), Florens, Mouchart,Richard (1974) or Florens, Mouchart, Richard (1987) for the linear case). Instrumentalvariable analysis as a solution to an endogeneity problem was proposed by Reiersol (1941,1945), and extended by Theil (1953), Basmann (1957), and Sargan (1958).

However, even in the instrumental variables framework, the definition of the functionalparameter of interest remains ambiguous in the general nonlinear case. Three possible

67

definitions of ϕ have been proposed (see Florens, Heckman, Meghir and Vytlacil (2002) fora general comparison between these three concepts and their extensions to more generaltreatment models).i) The first one replaces E[U | Z] = 0 by E[U | W ] = 0, or equivalently it defines ϕ asthe solution of

E[Y − ϕ (Z) | W ] = 0. (5.20)

This definition was the foundation of the analysis of simultaneity in linear models orparametric nonlinear models (see Amemiya (1974)), but its extension to the nonparamet-ric case raises new difficulties. The focus of this subsection is to show how to address thisissue in the framework of ill-posed inverse problems. A first attempt was undertaken byNewey and Powell (2003), who prove consistency of a series estimator of ϕ in Equation(5.20). Florens (2003) and Blundell and Powell (2003) consider various nonparametricmethods for estimating a nonlinear regression with endogenous regressors. Darolles, Flo-rens, and Renault (2002) prove both the consistency and the asymptotic distribution ofa kernel estimator of ϕ. Hall and Horowitz (2004) give the optimal rate of convergenceof the kernel estimator under conditions which differ from those of Darolles, Florens, andRenault (2002). Finally, Blundell, Chen, and Kristensen (2003) propose a sieves estimatorof the Engel curve.

ii) A second approach is now called control function approach and was systematized byNewey, Powell, and Vella (1999). This technique was previously developed in specificmodels (e.g. Mills ratio correction in some selection models for example). The startingpoint is to compute E[Y | Z, W ] which satisfies:

E[Y | Z, W ] = ϕ (Z) + h(Z,W ), (5.21)

where h(Z, W ) = E[U | Z, W ]. Equation (5.21) does not characterize ϕ. However wecan assume that there exists a function V (the control function) of (Z,W ) (typicallyZ − E[Z | W ]), which captures all the endogeneity of Z in the sense that E[U | W,V ] =E[U | V ] = h (V ). This implies that (5.21) may be rewritten as

E[Y | Z,W ] = ϕ (Z) + h(V ), (5.22)

and under some conditions, ϕ may be identified from (5.22) up to an additive constantterm. This model is an additive model where the V are not observed but are estimated.iii) A third definition follows from the literature on treatment models (see e.g. Imbens,Angrist (1994), Heckman, Ichimura, Smith, Todd (1998) and Heckman, Vytlacil (2000)).We extremely simplify this analysis by considering Z and W as scalars. Local instrumentis defined by ∂E[Y |W ]

∂W/∂E[Z|W ]

∂W, and the function of interest ϕ is assumed to be characterized

by the relation:

∂E[Y |W ]∂W

∂E[Z|W ]∂W

= E

[∂ϕ

∂Z| W

]. (5.23)

68

Let us summarize the arguments, which justify Equation (5.23).Equation (5.19) is extended to a non separable model

Y = ϕ (Z) + Zε + U (5.24)

where ε and U are two random noises.First, we assume that

E(U |W ) = E (ε|W ) = 0

This assumption extends the instrumental variable assumption but is not sufficient toidentify the parameter of interest ϕ. From (5.24) we get:

E (Y |W = w) =

∫[ϕ (z) + zr (z, w)] fZ (z|w) dz

where fZ (.|.) denote the conditional density of Z given W and r (z, w) = E (ε|Z = z,W = w) .Then

∂

∂wE (Y |W = w) =

∫ϕ (z)

∂

∂wfZ (z|w) dz +

∫z

∂

∂wr (z, w) fZ (z|w) dz

+

∫zr (z, w)

∂

∂wfZ (z|w) dz.

We assume that the order of integration and derivative may commute (in particular theboundary of the distribution of Z given W = w does not depends on w).

Second, we introduce the assumption that V = Z −E (Z|W ) is independent of W. Interms of density, this assumption implies that fZ (z|w) = f (z −m (w)) where m (w) =E (Z|W = w) and f is the density of v. Then:

∂

∂wE (Y |W = w) = −∂m (w)

∂w

∫ϕ (z)

∂

∂zfZ (z|w) dz

+

∫z

∂

∂wr (z, w) fZ (z|w) dz

− ∂m (w)

∂w

∫zr (z, w)

∂

∂zfZ (z|w) dz

An integration by parts of the first and the third integrals gives

∂

∂wE (Y |W = w) =

∂m (w)

∂w

∫∂

∂zϕ(z)fZ (z|w) dz

+

∫z

(∂r

∂w+

∂m

∂w

∂r

∂z

)fZ (z|w) dz

+∂m (w)

∂w

∫r (z, w) fZ (z|w) dz

The last integral is zero under E (ε|w) = 0. Finally, we need to assume that the second in-tegral is zero. This is true in particular if there exists r such that r (z, w) = r (z −m (w)) .

69

Hence, Equation (5.23) is verified.

These three concepts are identical in the linear normal case but differ in general.We concentrate our presentation in this chapter on the pure instrumental variable casesdefined by equation (5.20).

For a general approach of Equation (5.20) in terms of inverse problems, we introducethe following notation:K : L2

F (Z) → L2F (W ) ϕ → Kϕ = E[ϕ (Z) | W ],

K∗ : L2F (W ) → L2

F (Z) ψ → K∗ψ = E[ψ (W ) | Z].

All these spaces are defined relatively to the true (unknown) DGP. The two linear oper-ators K and K∗ satisfy:

〈ϕ (Z) , ψ (W )〉 = E[ϕ (Z) ψ (W )] = 〈Kϕ (W ) , ψ (W )〉L2F (W ) = 〈ϕ (Z) , K∗ψ (Z)〉L2

F (Z).

Therefore, K∗ is the adjoint operator of K, and reciprocally. Using these notations, theunknown instrumental regression ϕ corresponds to any solution of the functional equation:

A(ϕ, F ) = Kϕ− r = 0, (5.25)

where r (W ) = E[Y | W ].

In order to illustrate this construction and the central role played by the adjointoperator K∗, we first consider the example where Z is discrete, namely Z is binary. Thismodel is considered by Das (2005) and Florens and Malavolti (2002). In that case, afunction ϕ(Z) is characterized by two numbers ϕ(0) and ϕ(1) and L2

Z is isomorphic toR2. Equation (5.20) becomes

ϕ (0) Prob (Z = 0|W = w) + ϕ (1) Prob (Z = 1|W = w) = E (Y |W = w) .

The instruments W need to take at least two values in order to identify ϕ (0) and ϕ (1) fromthis equation. In general, ϕ is overidentified and overidentification is solved by replacing(5.25) by

K∗Kϕ = K∗r

or, in the binary case, by

ϕ (0) E (Prob (Z = 0|W ) |Z) + ϕ (1) E (Prob (Z = 1|W ) |Z) = E (E (Y |W ) |Z) .

In the latter case, we get two equations which in general have a unique solution.This model can be extended by considering Z = (Z1, Z2) where Z1 is discrete (Z1 ∈ 0, 1)

and Z2 is exogenous (i.e. W = (W1, Z2)). In this extended binary model, ϕ is characterizedby two functions ϕ(0, z2) and ϕ(1, z2), the solutions of

ϕ(0, z2)E (Prob (Z1 = 0|W ) |Z1 = z1, Z2 = z2) + ϕ (1, z2) E (Prob (Z1 = 1|W ) |Z1 = z1, Z2 = z2)= E (E (Y |W ) |Z1 = z1, Z2 = z2) , for z1 = 0, 1.

70

The properties of the estimator based on the previous equation are considered in Flo-rens and Malavolti (2002). In this case, no regularization is needed because K∗K has acontinuous inverse (since the dimension is finite in the pure binary case and K∗K is notcompact in the extended binary model).

We can also illustrate our approach in the case when the Hilbert spaces are not neces-sarily L2 spaces. Consider the following semiparametric case. The function ϕ is constrainedto be an element of

X =

ϕ such that ϕ =

L∑

l=1

βlεl

where (εl)l=1,...,L is a vector of fixed functions in L2F (Z) . Then X is a finite dimensional

Hilbert space. However, we keep the space E equal to L2F (W ). The model is then partially

parametric but the relation between Z and W is treated nonparametrically. In this case,it can easily be shown that K∗ transforms any function ψ of L2

F (W ) into a function of X ,which is its best approximation in the L2 sense (see Example 2.4. in Section 2). Indeed:

If ψ ∈ L2F (W ) ,∀j ∈ 1, ...L

E (εjψ) = 〈Kεj, ψ〉 = 〈εj, K∗ψ〉 .

Moreover, K∗ψ ∈ X =⇒ K∗ψ =L∑

l=1

αlεl, therefore

⟨εj,

L∑

l=1

αlεl

⟩= E (ψεj)

⇔L∑

l=1

αlE (εjεl) = E (ψεj) .

The function ϕ defined as the solution of Kϕ = r is in general overidentified but the equa-tion K∗Kϕ = K∗r always has a unique solution. The finite dimension of X implies that(K∗K)−1 is a finite dimensional linear operator and is then continuous. No regularizationis required.

Now we introduce an assumption which is only a regularity condition when Z and Whave no element in common. However, this assumption cannot be satisfied if there aresome elements in common between Z and W . Extensions to this latter case are discussedin Darolles, Florens and Renault (2002), see also Example 2.5. in Section 2.

Assumption A.1: The joint distribution of (Z,W ) is dominated by the product of itsmarginal distributions, and its density is square integrable w.r.t. the product of margins.

Assumption A.1 ensures that K and K∗ are Hilbert Schmidt operators, and is asufficient condition for the compactness of K, K∗, KK∗ and K∗K (see Lancaster (1968),Darolles, Florens, Renault (2002)) and Theorem 2.34.

71

Under Assumption A1, the instrumental regression ϕ is identifiable if and only if 0 isnot an eigenvalue of K∗K. Then, for the sake of expositional simplicity, we focus on thei.i.d. context:

Assumption A.2: The data (yi, zi, wi) i = 1, · · ·n, are i.i.d samples of (Y, Z,W ).

We estimate the joint distribution F of (Y, Z, W ) using a kernel smoothing of theempirical distribution. In the applications, the bandwidths differ, but they all have thesame speed represented by the notation cn.

For economic applications, one may be interested either by the unknown functionϕ(Z) itself, or only by its moments, including covariances with some known functions.These moments may for instance be useful for testing economic statements about scaleeconomies, elasticities of substitutions, and so on.

For such tests, one will only need the empirical counterparts of these moments andtheir asymptotic probability distribution. An important advantage of the instrumentalvariable approach is that it permits us to estimate the covariance between ϕ(Z) and g(Z)for a large class of functions. Actually, the identification assumption amounts to ensurethat the range R(K∗) is dense in L2

F (Z) and for any g in this range:

∃ψ ∈ L2F (W ), g(Z) = E[ψ (W ) | Z],

and then Cov[ϕ(Z), g(Z)] = Cov[ϕ(Z), E[ψ (W ) | Z]] = Cov[ϕ(Z), ψ (W )] = Cov[E[ϕ(Z) |W ], ψ (W )] = Cov[Y, ψ (W )], can be estimated with standard parametric techniques. Forinstance, if E[g(Z)] = 0, the empirical counterpart of Cov[Y, ψ (W )], i.e.:

1

n

n∑i=1

Yiψ (Wi) ,

is a root-n consistent estimator of Cov[ϕ(Z), g(Z)], and:

√n

[1

n

n∑i=1

Yiψ (Wi)− Cov[ϕ(Z), g(Z)]

]d→ N (0, V ar[Y ψ (W )]),

where V ar[Y ψ (W )] will also be estimated by its sample counterpart. However in practicethis analysis has very limited interest because even if g is given, ψ is not known and mustbe estimated by solving the integral equation g(Z) = E[ψ(W ) | Z], where the conditionaldistribution of W given Z is also estimated.

Therefore, the real problem of interest is to estimate Cov[ϕ(Z), g(Z)], or 〈ϕ, g〉 byreplacing ϕ by an estimator. This estimator will be constructed by solving a regularizedversion of the empirical counterpart of (5.25) where K and r are replaced by their estima-tors. In the case of kernel smoothing, the necessity of regularization appears obviously.Using the notation of 2.5.3, the equation

Knϕ = rn

72

becomes

n∑i=1

ϕ (zi) ω

(w − wi

cn

)

n∑i=1

ω

(w − wi

cn

) =

n∑i=1

yiω

(w − wi

cn

)

n∑i=1

ω

(w − wi

cn

) .

The function ϕ can not be obtained from this equation except for the values ϕ (zi) equalto yi. This solution does not constitute a consistent estimate. The regularized Tikhonovsolution is the solution of:

αnϕ (z) +

n∑j=1

ω

z − zj

cn

! n∑i=1

ϕ(zi)ω

wj−wi

cn

!n∑

i=1

ω

wj−wi

cn

!n∑

j=1

ω

z − zj

cn

! =

n∑j=1

ω

z − zj

cn

! n∑i=1

yiω

wj−wi

cn

!n∑

i=1

ω

wj−wi

cn

!n∑

j=1

ω

z − zj

cn

! .

This functional equation may be solved in two steps. First, the z variable is fixed to thevalues zi and the system becomes an n×n linear system, which can be solved in order toobtain the ϕ (zi) . Second, the previous expression gives a value of ϕ (z) for any value ofz.

If n is very large, this inversion method may be difficult to apply and may be replacedby a Landweber Fridman resolution (see Section 3). A first expression of ϕ (z) may be forinstance the estimated conditional expectation E (E (Y |W ) |Z) and this estimator will bemodified a finite number of times by the formula

ϕl,n =(I − cK∗

nKn

)ϕl−1,n + cK∗

nrn.

To simplify our analysis, we impose a relatively strong assumption:

Assumption A.3: The error term is homoskedastic, that is:

V ar (U |W ) = σ2.

In order to check the asymptotic properties of the estimator of ϕ, it is necessary tostudy to properties of the estimators of K and of r. Under regularity conditions such asthe compactness of the joint distribution support and the smoothness of the density (seeDarolles et al. (2002)),the estimation by boundary kernels gives the following results:

i)∥∥∥K∗

nKn −K∗K∥∥∥

2

∼ O(

1n(cn)p + (cn)2ρ

)where ρ is the order of the kernel and p the

dimension of Z.

ii)∥∥∥K∗

nrn − K∗nKnϕ

∥∥∥2

∼ O(

1n

+ (cn)2ρ)

73

iii) A suitable choice of cn implies

√n

(K∗

nrn − K∗nKnϕ

)=⇒ N

(0, σ2K∗K

)

This convergence is a weak convergence in L2F (Z) (see Section 2.4).

Using results developed in Section 4 and in Darolles et al. (2002) it can be deducedthat:

a) If αn → 0, c2ρn

α2n→ 0, 1

α2nncρ

n∼ O (1) the regularized estimator ϕn converge in proba-

bility to ϕ in L2 norm.b) If ϕ ∈ Φβ (0 < β ≤ 2) , the optimal choices of αn and cn are:

αn = k1n− 1

2β

cn = k2n− 1

2ρ

and, if ρ is chosen such that p2ρ≤ β

2+β, we obtain the following bound for the rate of

convergence

‖ϕn − ϕ‖ ∼ O(n−

β2+β

)

c) Let us assume that α is kept constant. In that case, the linear operators(αI + K∗

n Kn)−1 and (αI + K∗K)−1 are bounded, and using a functional version of theSlutsky theorem (see Chen and White (1992), and Section 2.4), it is immediately checkedthat:

√n(ϕn − ϕ− bα

n) =⇒ N (0, Ω), (5.26)

where

bαn = α

[(αI + K∗

nKn)−1 − (αI + K∗K)−1]ϕ,

and

Ω = σ2(αI + K∗K)−1K∗K(αI + K∗K)−1.

Some comments may illustrate this first result:i) The convergence obtained in (5.26) is still a functional distributional convergence inthe Hilbert space L2

F (Z), which in particular implies the convergence of inner product√n〈ϕn − ϕ− bα

n, g〉 to univariate normal distribution N (0, 〈g, Ωg〉).ii) The convergence of ϕn involves two bias terms. The first bias is ϕα − ϕ. This term isdue to the regularization and does not decrease if α is constant. The second one, ϕn−ϕα

follows from the estimation error of K. This bias decreases to zero when n increases, butat a lower speed than

√n.

iii) The asymptotic variance in (5.26) can be seen as the generalization of the two stageleast squares asymptotic variance. An intuitive (but not correct) interpretation of this

74

result could be the following. If α is small, the asymptotic variance is approximatelyσ2(K∗K)−1, which is the functional extension of σ2(E(ZW ′)E(WW ′)−1E(WZ ′))−1.

d) Let us now consider the case where α → 0. For any δ ∈ Φβ (β ≥ 1), if αn is optimal

(= k1n− 1

2β ) and if cn = k2n−( 1

2ρ+ε) (ε > 0) , we have

√νn (δ) 〈ϕn − ϕ, δ〉 −Bn =⇒ N

(0, σ2

),

where the speed of convergence is equal to

νn (δ) =n∥∥K (αnI + K∗K)−1 δ

∥∥2 ≥ O(n

2β2+β

),

and the bias Bn is equal to√

νn (δ) 〈ϕα − ϕ, δ〉 , which in general does not vanish. If δ= 1 for example, this bias is O (nα2

n) and diverges.The notion of Φβ permits us to rigorously define the concept of weak or strong in-

struments. Indeed, if λj are not zero for any j, the function ϕ is identified by Equation(5.25) and ϕn is a consistent estimator. A bound for the speed of convergence of ϕn isprovided under the restriction that ϕ belongs to a space Φβ with β > 0. The conditionϕ ∈ Φβ means that the rate of decline of the Fourier coefficients of ϕ in the basis of φj

is faster than the rate of decline of the λβj (which measures the dependence). In order to

have asymptotic normality we need to assume that β ≥ 1. In that case, if ϕ ∈ Φβ, wehave asymptotic normality of inner products 〈ϕn − ϕ, δ〉 in the vector space Φβ. Then,it is natural to say that W is a strong instrument for ϕ if ϕ is an element of a Φβ withβ ≥ 1. This may have two equivalent interpretations. Given Z and W , the set of instru-mental regressions for which W is a strong instrument is Φ1 or given Z and ϕ, any set ofinstruments is strong if ϕ is an element of the set Φ1 defined using these instruments.

We may complete this short presentation with two final remarks. First, the optimalchoice of cn and αn implies that the speed of convergence and the asymptotic distributionare not affected by the fact that K is not known and is estimated. The accuracy of theestimation is governed by the estimation of the right hand side term K∗r. Secondly, theusual “curse of dimensionality” of nonparametric estimation appears in a complex way.The dimension of Z appears in many places but the dimension of W is less explicit. Thevalue and the rate of decline of the λj depend on the dimension of W : Given Z, thereduction of the number of instruments implies a faster rate of decay of λj to zero and aslower rate of convergence of the estimator.

6. Reproducing kernel and GMM in Hilbert spaces

6.1. Reproducing kernel

Models based on reproducing kernels are the foundation for penalized likelihood estimationand splines (see e.g. Berlinet and Thomas-Agnan, 2004). However, it has been little used

75

in econometrics so far. The theory of reproducing kernels becomes very useful when theeconometrician has an infinite number of moment conditions and wants to exploit all ofthem in an efficient way. For illustration, let θ ∈ R be the parameter of interest andconsider an L× 1−vector h that gives L moment conditions satisfying Eθ0 (h (θ)) = 0 ⇔θ = θ0. Let hn (θ) be the sample estimate of Eθ0 (h (θ)). The (optimal) generalized methodof moments (GMM) estimator of θ is the minimizer of hn (θ)′ Σ−1hn (θ) where Σ is the

covariance matrix of h. hn (θ)′ Σ−1hn (θ) can be rewritten as∥∥Σ−1/2hn (θ)

∥∥2and coincides

with the norm of hn (θ) in a particular space called the reproducing kernel Hilbert space(RKHS). When h is finite dimensional, the computation of the GMM objective functiondoes not raise any particular difficulty, however when h is infinite dimensional (for instanceis a function) then the theory of RKHS becomes very handy. A second motivation for theintroduction of the RKHS of a self-adjoint operator K is the following. Let T be suchthat K = TT ∗ then the RKHS of K corresponds to the 1−regularity space of T (denotedΦ1 in Section 3.1).

6.1.1. Definitions and basic properties of RKHS

This section presents the theory of reproducing kernels, as described in Aronszajn (1950)and Parzen (1959, 1970). Let L2

C (π) =ϕ : I ⊂ RL → C :

∫I|ϕ (s)|2 π (s) ds < ∞

whereπ is a pdf (π may have a discrete or continuous support) and denote ‖.‖ and 〈, 〉 the normand inner product on L2

C (π).

Definition 6.1. A space H (K) of complex-valued functions defined on a set I ⊂ RL issaid to be a reproducing kernel Hilbert space H (K) associated with the integral operatorK : L2

C (π) → L2C (π) with kernel k (t, s) if the three following conditions hold

(i) it is a Hilbert space (with inner product denoted 〈, 〉K),(ii) for every s ∈ I, k (t, s) as a function of t belongs to H (K) ,(iii) (reproducing property) for every s ∈ I and ϕ ∈ H (K), ϕ (s) = 〈ϕ (.) , k (., s)〉K .The kernel k is then called the reproducing kernel.

The following properties are listed in Aronszajn (1950):1 - If the RK k exists, it is unique.2 - A Hilbert space H of functions defined on I ⊂ RL is a RKHS if and only if all

functionals ϕ → ϕ (s) for all ϕ ∈ H, s ∈ I, are bounded.3 - K is a self-adjoint positive operator on L2

C (π).4 - To a self-adjoint positive operator K on I, there corresponds a unique RKHSH (K)

of complex-valued functions.5 - Every sequence of functions ϕn which converges weakly to ϕ in H (K) (that is

〈ϕn, g〉K → 〈ϕ, g〉K for all g ∈ H (K)) converges also pointwise, that is lim ϕn (s) = ϕ (s) .

Note that (2) is a consequence of Riesz theorem 2.18: There exists a representor ksuch that for all ϕ ∈ H

ϕ (t) = 〈ϕ, kt〉K .

76

Let kt = k (t, .) so that 〈kt, ks〉K = k (t, s). (5) follows from the reproducing property.Indeed, 〈ϕn (t)− ϕ (t) , k (t, s)〉K = ϕn (s)− ϕ (s) .

Example (finite dimensional case). Let I = 1, 2, ..., L , let Σ be a positivedefinite L × L matrix with principal element σt,s. Σ defines an inner product on RL :〈ϕ, ψ〉Σ = ϕ′Σ−1ψ. Let (σ1, ..., σL) be the columns of Σ. Let ϕ = (ϕ (1) , ..., ϕ (L))′, thenwe have the reproducing property

〈ϕ, σt〉Σ = ϕ (t) , τ = 1, ..., L

because ϕΣ−1Σ = ϕ. Now we diagonalize Σ, Σ = PDP ′ where P is the m × m matrixwith (t, j) element φj (t) (φj are the orthonormal eigenvectors of Σ) and D is the diagonalmatrix with diagonal element λj (the eigenvalues of Σ). The (t, s)th element of Σ can berewritten as

σ (t, s) =m∑

j=1

λjφj (t) φj (s) .

We have

〈ϕ, ψ〉Σ = ϕ′Σ−1ψ =m∑

j=1

1

λj

⟨ϕ, φj

⟩ ⟨ψ, φj

⟩

where 〈, 〉 is the euclidean inner product.

From this small example, we see that the norm in a RKHS can be characterized by thespectral decomposition of an operator. Let K be a positive self-adjoint compact operatorwith spectrum

φj, λj : j = 1, 2, ...

. Assume that N (K) = 0. It turns out that H (K)

coincides with the 1/2-regularization space of the operator K :

H (K) =

ϕ : ϕ ∈ L2 (π) and

∞∑j=1

∣∣⟨ϕ, φj

⟩∣∣2λj

< ∞

= Φ1/2 (K) .

We can check that(i) H (K) is a Hilbert space with inner product

〈ϕ, ψ〉K =∞∑

j=1

⟨ϕ, φj

⟩ ⟨ψ, φj

⟩

λj

and norm

‖ϕ‖2K =

∞∑j=1

∣∣⟨ϕ, φj

⟩∣∣2λj

.

(ii) k (., t) belongs to H (K)

77

(iii) 〈ϕ, k(., t)〉K = ϕ (t) .

Proof. (ii) follows from Mercer’s formula (Theorem 2.42 (iii)) that is k (t, s) =∑∞j=1 λjφj (t) φj (s). Hence ‖k (., t)‖2

K =∑∞

j=1

∣∣⟨φj, k (., t)⟩∣∣2 /λj =

∑∞j=1

∣∣λjφj (t)∣∣2 /λj =∑∞

j=1 λjφj (t) φj (t) = k (t, t) < ∞. For (iii), we use again Mercer’s formula. 〈ϕ (.) , k (., t)〉K =∑∞j=1

⟨φj, k (., t)

⟩ ⟨ϕ, φj

⟩/λj =

∑∞j=1

⟨ϕ, φj

⟩Kφj (t) /λj =

∑∞j=1

⟨ϕ, φj

⟩φj (t) = ϕ (t) .

There is a link between calculating a norm in a RKHS and solving an integral equationKϕ = ψ. We follow Nashed and Wahba (1974) to enlighten this link. We have

Kϕ =∞∑

j=1

λj

⟨ϕ, φj

⟩φj.

Define K1/2 the square root of K:

K1/2ϕ =∞∑

j=1

√λj

⟨ϕ, φj

⟩φj.

Note that N (K) = N (K1/2

), H (K) = K1/2 (L2

C (π)) . Define K−1/2 =(K1/2

)†where ()†

is the Moore-Penrose generalized inverse introduced in Subsection 3.1.:

K†ψ =∞∑

j=1

1

λj

⟨ψ, φj

⟩φj.

Similarly, the generalized inverse of K1/2 takes the form:

K−1/2ψ =∞∑

j=1

1√λj

⟨ψ, φj

⟩φj.

From Nashed and Wahba (1974), we have the relations

‖ϕ‖2K = inf

‖p‖ : p ∈ L2C (π) and ϕ = K1/2p

,

〈ϕ, ψ〉K =⟨K−1/2ϕ,K−1/2ψ

⟩, for all ϕ, ψ ∈ H (K) . (6.1)

The following result follows from Proposition 3.6.

Proposition 6.2. Let T : E →L2C (π) be an operator such that K = TT ∗ then

H (K) = R (K1/2

)= R (T ∗) = Φ1 (T ) .

Note that T ∗ : L2C (π) → E and K1/2 : L2

C (π) → L2C (π) are not equal because they

take their values in different spaces.

78

6.1.2. RKHS for covariance operators of stochastic processes

In the previous section, we have seen how to characterize H (K) using the spectral de-composition of K. When K is known to be the covariance kernel of a stochastic pro-cess, then H (K) admits a simple representation. The main results of this section comefrom Parzen (1959). Consider a random element (r.e.). h (t) , t ∈ I ⊂ Rp defined ona probability space (Ω,F , P ) and observed for all values of t. Assume h (t) is a sec-ond order random function that is E

(|h (t)|2) =∫Ω|h (t)|2 dP < ∞ for every t ∈ I.

Let L2 (Ω,F , P ) be the set of all r.v. U such that E |U |2 =∫Ω|U |2 dP < ∞. De-

fine the inner product 〈U, V 〉L2(Ω,F ,P ) between any two r.v. U and V of L2 (Ω,F , P )

by 〈U, V 〉L2(Ω,F ,P ) = E(UV

)=

∫Ω

UV dP. Let L2 (h (t) , t ∈ I) be the Hilbert spacespanned by the r.e. h (t) , t ∈ I. Define K the covariance operator with kernel k (t, s) =

E(h (t) h (s)

). The following theorem implies that any symmetric nonnegative kernel can

be written as a covariance kernel of a particular process.

Theorem 6.3. K is a covariance operator of a r.e. if and only if K is a positive self-adjoint operator.

The following theorem can be found in Parzen (1959) for real-valued functions. Thecomplex case is treated in Saitoh (1997).

Theorem 6.4. Let h (t) , t ∈ I be a r.e. with mean zero and covariance kernel k. Then(i) L2 (h (t) , t ∈ I) is isometrically isomorphic or congruent to the RKHS H (K) . De-

note J : H (K) → L2 (h (t) , t ∈ I) this congruence.(ii) For every function ϕ in H (K) , J (ϕ) satisfies

〈J (ϕ) , h (t)〉L2(Ω,F ,P ) = E(J (ϕ) h (t)

)= 〈ϕ, k (., t)〉K = ϕ (t) , for all t ∈ I (6.2)

where J (ϕ) is unique in L2 (h (t) , t ∈ I) and has mean zero and variance such that

‖ϕ‖2K = ‖J (ϕ)‖2

L2(Ω,F ,P ) = E(|J (ϕ)|2) .

Note that, by (6.2), the congruence is such that J (k (., t)) = h (t). The r.v. U ∈L2 (h (t) , t ∈ I) corresponding to ϕ ∈ H (K) is denoted below as 〈ϕ, h〉K (or J (ϕ)). AsL2 (h (t) , t ∈ I) and H (K) are isometric, we have by Definition 2.19

cov [〈ϕ, h〉K , 〈ψ, h〉K ] = E[J (ϕ) J (ψ)

]= 〈ϕ, ψ〉K

for every ϕ, ψ ∈ H (K) . Note that 〈ϕ, h〉K is not correct notation because h =∑

j

⟨h, φj

⟩φj

a.s. does not belong to H (K). If it were the case, we should have∑

j

⟨h, φj

⟩2/λj < ∞

a.s.. Unfortunately⟨h, φj

⟩are independent with mean 0 and variance

⟨Kφj, φj

⟩= λj.

Hence, E[∑

j

⟨h, φj

⟩2/λj

]= ∞ and by Kolmogorov’s theorem

∑j

⟨h, φj

⟩2/λj = ∞ with

79

nonzero probability. The r.v. J (ϕ) itself is well-defined, only the notation 〈ϕ, h〉K is notadequate; as Kailath (1971) explains, it should be regarded only as a mnemonic for find-ing J (ϕ) in a closed form. The rest of this section is devoted to the calculation of ‖ϕ‖K .Note that the result (6.2) is valid when t is multidimensional, t ∈ RL. In the next section,h (t) will be a moment function indexed by an arbitrary index parameter t.

Assume that the kernel k on I × I can be represented as

k (s, t) =

∫h (s, x) h (t, x)P (dx) (6.3)

where P is a probability measure and h (s, .) , s ∈ I is a family of functions on L2 (Ω,F , P ) .By Theorem 6.4, H (K) consists of functions ϕ on I of the form

ϕ (t) =

∫ψ (x) h (t, x)P (dx) (6.4)

for some unique ψ in L2 (h (t, .) , t ∈ I) , the subspace of L2 (Ω,F , P ) spanned by h (t, .) , t ∈ I.The RKHS norm of ϕ is given by

‖ϕ‖2K = ‖ψ‖2

L2(Ω,F ,P ) .

When calculating ‖ϕ‖2K in practice, one looks for the solutions of (6.4). If there are several

solutions, it is not always obvious to see which one is spanned by h (t, .) , t ∈ I. In thiscase, the right solution is the solution with minimal norm (Parzen, 1970):

‖ϕ‖2K = min

ψ s.t.ϕ=〈ψ,h〉L2

‖ψ‖2L2(Ω,F ,P ) .

Theorem 6.4 can be reinterpreted in terms of range. Let T and T ∗ be

T : L2 (π) → L2 (h (t, .) , t ∈ I)

ϕ → Tϕ (x) =

∫ϕ (t) h (t, x) π (t) dt.

and

T ∗ : L2 (h (t, .) , t ∈ I) → L2 (π)

ψ → T ∗ψ (s) =

∫ψ (x) h (s, x)P (dx) .

To check that T ∗ is indeed the dual of T, it suffices to check 〈Tϕ, ψ〉L2(Ω,F ,P ) = 〈ϕ, T ∗ψ〉L2(π)

for ϕ ∈ L2 (π) and ψ (x) = h (t, x) as h (t, .) spans L2 (h (t, .) , t ∈ I) . Using the fact thatK = T ∗T and Proposition 6.2, we have H (K) = R (T ∗), which gives Equation (6.4).

80

Example. Let k (t, s) = t ∧ s. k can be rewritten as

k (t, s) =

∫ 1

0

(t− x)0+ (s− x)0

+ du

with

(s− x)0+ =

1 if x < s0 if x ≥ s

.

It follows that H (K) consists of functions ϕ of the form:

ϕ (t) =

∫ 1

0

ψ (x) (t− x)0+ dx =

∫ t

0

ψ (x) dx, 0 ≤ t ≤ 1

⇒ ψ (t) = ϕ′ (t) .

Hence, we have

‖ϕ‖2K =

∫ 1

0

|ψ (x)|2 dx =

∫ 1

0

|ϕ′ (x)|2 dx.

Example. Let k be defined as in (6.3) with h (t, x) = eitx. Assume P admits a pdffθ0 (x) , which is positive everywhere. Equation (6.4) is equivalent to

ϕ (t) =

∫ψ (x) e−itxP (dx)

=

∫ψ (x) e−itxfθ0 (x) dx.

By the Fourier Inversion formula, we have

ψ (x) =1

2π

1

fθ0 (x)

∫eitxϕ (t) dt.

6.2. GMM in Hilbert spaces

First introduced by Hansen (1982), the Generalized Method of Moments (GMM) becamethe cornerstone of modern structural econometrics. In Hansen, the number of momentconditions is supposed to be finite. The method proposed in this section permits to dealwith moment functions that take their values in finite or infinite dimensional Hilbertspaces. It was initially proposed by Carrasco and Florens (2000) and further developedin Carrasco and Florens (2001) and Carrasco, Chernov, Florens, and Ghysels (2004).

81

6.2.1. Definition and examples

Let xi : i = 1, 2, ..., n be an iid sample of a random vector X ∈ Rp. The case where Xis a time-series will be discussed later. The distribution of X is indexed by a parameterθ ∈ Θ ⊂ Rd. Denote Eθ the expectation with respect to this distribution. The unknownparameter θ is identified from the function h (X; θ) (called moment function) defined onRp ×Θ, so that the following is true.

Identification Assumption

Eθ0 (h (X; θ)) = 0 ⇔ θ = θ0. (6.5)

It is assumed that h (X; θ) takes its values in a Hilbert space H with inner product 〈., .〉and norm ‖.‖ . When f = (f1, ..., fL) and g = (g1, ..., gL) are vectors of functions of H,we use the convention that 〈f, g′〉 denotes the L×L−matrix with (l, m) element 〈fl, gm〉 .Let Bn : H → H be a sequence of random bounded linear operators and

hn (θ) =1

n

n∑i=1

h (xi; θ) .

We define the GMM estimator associated with Bn as

θn (Bn) = arg minθ∈Θ

∥∥∥Bnhn (θ)∥∥∥ . (6.6)

Such an estimator will in general be suboptimal; we will discuss the optimal choice of Bn

later. Below, we give four examples that can be handled by the method discussed in thissection. They illustrate the versatility of the method as it can deal with a finite number ofmoments (Example 1), a continuum (Examples 2 and 3) and a countably infinite sequence(Example 4).

Example 1 (Traditional GMM). Let h (x; θ) be a vector of RL, Bn be a L ×L−matrix and ‖.‖ denote the Euclidean norm. The objective function to minimize is

∥∥∥Bnhn (θ)∥∥∥

2

= hn (θ)′ B′nBnhn (θ)

and corresponds to the usual GMM quadratic form hn (θ)′ Wnhn (θ) with weighting matrixWn = B′

nBn.

Example 2 (Continuous time process). Suppose we observe independent repli-cations of a continuous time process

X i (t) = G (θ, t) + ui (t) , 0 ≤ t ≤ T , i = 1, 2, ..., n (6.7)

where G is a known function and ui = ui (t) : 0 ≤ t ≤ T is a zero mean Gaussianprocess with continuous covariance function k (t, s) = E [u (t) u (s)], t, s ∈ [0, T ] . Denote

82

X i = X i (t) : 0 ≤ t ≤ T, G (θ) = G (θ, t) : 0 ≤ t ≤ T , and H = L2 ([0, T ]). Theunknown parameter θ is identified from the moment of the function

h(X i; θ

)= X i −G (θ) .

Assume h (X i; θ) ∈ L2 ([0, T ]) with probability one. Candidates for Bn are arbitrarybounded operators on L2 ([0, T ]) including the identity. For Bnf = f , we have

∥∥∥Bnhn (θ)∥∥∥

2

=

∫ T

0

hn (θ)2 dt.

Estimation of Model (6.7) is discussed in Kutoyants (1984).

Example 3 (Characteristic function). Denote ψθ (t) = Eθ[eit′X

]the characteris-

tic function of X. Inference can be based on

h (t,X; θ) = eit′X − ψθ (t) , t ∈ RL.

Note that contrary to the former examples, h (t,X; θ) is complex valued and |h (t, X; θ)| ≤∣∣eit′X∣∣ + |ψθ (t)| ≤ 2. Let Π be a probability measure on RL and H = L2

C

(RL, Π

). As

h (., X; θ) is bounded, it belongs to L2C

(RL, Π

)for any Π. Feuerverger and McDunnough

(1981) and more recently Singleton (2001) show that an efficient estimator of θ is ob-tained from h (., X; θ) by solving an empirical counterpart of

∫Eh (t,X; θ) ω (t) dt = 0

for an adequate weighting function ω, which turns out to be a function of the pdf of X.This efficient estimator is not implementable as the pdf of X is unknown. They suggestestimating θ by GMM using moments obtained from a discrete grid t = t1, t2, ..., tM . Analternative strategy put forward in this section is to use the full continuum of momentconditions by considering the moment function h as an element of H = L2

C

(RL, Π

).

Example 4 (Conditional moment restrictions). Let X = (Y, Z) . For a knownfunction ρ ∈ R, we have the conditional moment restrictions

Eθ0 [ρ (Y, Z, θ) |Z] = 0.

Hence for any function g (Z), we can construct unconditional moment restrictions

Eθ0 [ρ (Y, Z, θ) g (Z)] = 0.

Assume Z has bounded support. Chamberlain (1987) shows that the semiparametricefficiency bound can be approached by a GMM estimator based on a sequence of mo-ment conditions using as instruments the power function of Z : 1, Z, Z2, ..., ZL for alarge L. Let π be the Poisson probability measure π (l) = e−1/l! and H = L2 (N,π) =f : N → R :

∑∞l=1 g (l) π (l) < ∞ . Let

h (l, X; θ) = ρ (Y, Z, θ) Z l, l = 1, 2, ...

If h (l, X; θ) is bounded with probability one, then h (., X; θ) ∈ L2 (N,π) with probabilityone. Instead of using an increasing sequence of moments as suggested by Chamberlain, itis possible to handle h (., X; θ) as a function. The efficiency of the GMM estimator basedon the countably infinite number of moments h (l, X; θ) : l ∈ N will be discussed later.

83

6.2.2. Asymptotic properties of GMM

Let H = L2C (I, Π) =

f : I → C :

∫I|f (t)|2 Π (dt) < ∞

where I is a subset of RL forsome L ≥ 1 and Π is a (possibly discrete) probability measure. This choice of H isconsistent with Examples 1 to 4. Under some weak assumptions,

√nhn (θ0) converges to

a Gaussian process N (0, K) in H where K denotes the covariance operator of h (X; θ0) .K is defined by

K : H → Hf → Kf (s) = 〈f, k (., t)〉 =

∫

I

k (t, s) f (s) Π (ds)

where the kernel k of K satisfies k (t, s) = Eθ0

[h (t,X; θ0) h (s,X; θ0)

]and k (t, s) =

k (s, t). Assume moreover that K is a Hilbert Schmidt operator and hence admits a discretespectrum. Suppose that Bn converges to a bounded linear operator B defined on Hand that θ0 is the unique minimizer of

∥∥BEθ0h (X; θ)∥∥ . Then θn (Bn) is consistent and

asymptotically normal. The following result is proved in Carrasco and Florens (2000).

Proposition 6.5. Under Assumptions 1 to 11 of Carrasco and Florens (2000), θn (Bn)is consistent and

√n

(θn (Bn)− θ0

)L→ N (0, V )

with

V =⟨BEθ0 (∇θh) , BEθ0 (∇θh)′

⟩−1

× ⟨BEθ0 (∇θh) , (BKB∗) BEθ0 (∇θh)′

⟩

× ⟨BEθ0 (∇θh) , BEθ0 (∇θh)′

⟩−1

where B∗ is the adjoint of B.

6.2.3. Optimal choice of the weighting operator

Carrasco and Florens (2000) show that the asymptotic variance V given in Proposi-tion 6.5 is minimal for B = K−1/2. In that case, the asymptotic covariance becomes⟨K−1/2Eθ0 (∇θh) , K−1/2Eθ0 (∇θh)

⟩−1.

Example 1 (continued). K is the L×L−covariance matrix of h (X; θ) . Let Kn be

the matrix 1n

∑ni=1 h

(xi; θ

1)

h(xi; θ

1)′

where θ1

is a consistent first step estimator of θ.

Kn is a consistent estimator of K. Then the objective function becomes

⟨K−1/2

n hn (θ) , K−1/2n hn (θ)

⟩= hn (θ)′ K−1

n hn (θ)

which delivers the optimal GMM estimator.

84

When H is infinite dimensional, we have seen in Section 3.1 that the inverse of K,

K−1, is not bounded. Similarly K−1/2 =(K1/2

)−1is not bounded on H and its domain

has been shown in Subsection 6.1.1 to be the subset of H which coincides with the RKHSassociated with K and denoted H (K) .

To estimate the covariance operator K, we need a first step estimator θ1

that is√n−consistent. It may be obtained by letting Bn equal the identity in (6.6) or by using

a finite number of moments. Let Kn be the operator with kernel

kn (t, s) =1

n

n∑i=1

h(t, xi; θ

1)

h(s, xi; θ

1)

.

Then Kn is a consistent estimator of K and ‖Kn −K‖ = O (1/√

n) . As K−1f is notcontinuous in f , we estimate K−1 by the Tykhonov regularized inverse of Kn :

(Kαnn )−1 =

(αnI + K2

n

)−1Kn

for some penalization term αn ≥ 0. If αn > 0, (Kαnn )−1 f is continuous in f but is

a biased estimator of K−1f. There is a trade-off between the stability of the solutionand its bias. Hence, we will let αn decrease to zero at an appropriate rate. We define

(Kαnn )−1/2 =

((Kαn

n )−1)1/2.

The optimal GMM estimator is given by

θn = arg minθ∈Θ

∥∥∥(Kαnn )−1/2 hn (θ)

∥∥∥ .

Interestingly, the optimal GMM estimator minimizes the norm of hn (θ) in the RKHSassociated with Kαn

n . Under certain regularity conditions, we have∥∥∥(Kαn

n )−1/2 hn (θ)∥∥∥ P→

∥∥Eθ0 (h (θ))∥∥

K.

A condition for applying this method is that Eθ0 (h (θ)) ∈ H (K) . This condition can beverified using results from 6.1.1.

Proposition 6.6. Under the regularity conditions of Carrasco and Florens (2000, Theo-rem 8), θn is consistent and

√n

(θn − θ0

)L→ N

(0,

⟨Eθ0 (∇θh (θ0)) , Eθ0 (∇θh (θ0))

′⟩−1

K

)

as n and nα3n →∞ and αn → 0.

The stronger condition nα3n → ∞ of Carrasco and Florens (2000) has been relaxed

into nα2 → ∞ in Carrasco, Chernov, Florens, and Ghysels (2004). Proposition 6.6 doesnot indicate how to select αn in practice. A data-driven method is desirable. Carrascoand Florens (2001) propose to select the αn that minimizes the mean square error (MSE)of the GMM estimator θn. As θn is consistent for any value of αn, it is necessary tocompute the higher order expansion of the MSE, which is particularly tedious. Instead ofrelying on an analytic expression, it may be easier to compute the MSE via bootstrap orsimulations.

85

6.2.4. Implementation of GMM

There are two equivalent ways to compute the objective function

∥∥∥(Kαnn )−1/2 hn (θ)

∥∥∥2

, (6.8)

1) using the spectral decomposition of Kn,or2) using a simplified formula that involves only vectors and matrices.The first method discussed in Carrasco and Florens (2000) requires calculating the

eigenvalues and eigenfunctions of Kn using the method described in 2.5.3. Let φj denote

the orthonormalized eigenfunctions of Kn and λj the corresponding eigenvalues. Theobjective function in Equation (6.8) becomes

n∑j=1

λj

λ2

j + αn

∣∣∣⟨hn (θ) , φj

⟩∣∣∣2

. (6.9)

The expression (6.9) suggests a nice interpretation of the GMM estimator. Indeed, note

that⟨√

nhn (θ0) , φj

⟩, j = 1, 2, ... are asymptotically normal with mean 0 and vari-

ance λj and are independent across j. Therefore (6.9) is the regularized version of theobjective function of the optimal GMM estimator based on the n moment conditionsE

[⟨h (θ) , φj

⟩]= 0, j = 1, 2, ..., n.

The second method is more attractive by its simplicity. Carrasco et al. (2004) showthat (6.8) can be rewritten as

v (θ)′ [

αnIn + C2]−1

v (θ)

where C is a n × n−matrix with (i, j) element cij, In is the n × n identity matrix andv (θ) = (v1 (θ) , ..., vn (θ))′ with

vi (θ) =

∫h

(t, xi; θ

1)′

hn (t; θ) Π (dt)

cij =1

n

∫h

(t, xi; θ

1)′

h(t, xj; θ

1)

Π (dt) .

Note that the dimension of C is the same whether h ∈ R or h ∈ RL.

6.2.5. Asymptotic Efficiency of GMM

Assume that the pdf of X, fθ, is differentiable with respect to θ. Let L2 (h) be the closureof the subspace of L2 (Ω,F , P ) spanned by h (t, Xi; θ0) : t ∈ I.Proposition 6.7. Under standard regularity conditions, the GMM estimator based onh (t, xi; θ) : t ∈ I is asymptotically as efficient as the MLE if and only if

∇θ ln fθ (xi; θ0) ∈ L2 (h) .

86

This result is proved in Carrasco and Florens (2004) in a more general setting where Xi

is Markov of order L. A similar efficiency result can be found in Hansen (1985), Tauchen(1997) and Gallant and Long (1997).

Example 2 (continued). Let K be the covariance operator of u (t) and H (K) theRKHS associated with K. Kutoyants (1984) shows that if G (θ) ∈ H (K) , the likelihoodratio of the measure induced by X (t) with respect to the measure induced by u (t) equals

LR (θ) =n∏

i=1

exp

⟨G (θ) , xi

⟩K− 1

2‖G (θ)‖2

K

where 〈G,X〉K has been defined in Subsection 6.1.2 and denotes the element of L2 (X (t) : 0 ≤ t ≤ T )under the mapping J−1 of the function G (θ) (J is defined in Theorem 6.4). The scorefunction with respect to θ is

∇θ ln (LR (θ)) =

⟨∇θG (θ) ,

1

n

n∑i=1

(xi −G (θ)

)⟩

K

.

For θ = θ0 and a single observation, the score is equal to

〈∇θG (θ0) , u〉K ,

which is an element of L2 (u (t) : 0 ≤ t ≤ T ) = L2 (h (X (t) ; θ0) : 0 ≤ t ≤ T ) . Hence, byProposition 6.7, the GMM estimator based on h (X; θ0) is asymptotically efficient. Thisefficiency result is corroborated by the following. The GMM objective function is

‖h (x; θ)‖2K =

⟨1

n

n∑i=1

(xi −G (θ)

),1

n

n∑i=1

(xi −G (θ)

)⟩

K

.

The first order derivative equals to

∇θ ‖h (x; θ)‖2K = 2

⟨∇θG (θ) ,

1

n

n∑i=1

(xi −G (θ)

)⟩

K

= 2∇θ ln (LR (θ)) .

Therefore, the GMM estimator coincides with the MLE in this particular case as they aresolutions of the same equation.

Example 3 (continued). Under minor conditions on the distribution of Xi, theclosure of the linear span of

h (t,Xi; θ0) : t ∈ RL

contains all functions of L2 (X) =

g : Eθ0[g (X)2] < ∞

and hence the score ∇θ ln fθ (Xi; θ0) itself. Therefore the GMMestimator is efficient. Another way to prove efficiency is to explicitly calculate the asymp-totic covariance of θn. To simplify, assume that θ is scalar. By Theorem 6.4, we have

∥∥Eθ0 (∇θh (θ0))∥∥2

K=

∥∥∥Eθ0 (∇θh (θ0))∥∥∥

2

K= E |U |2

87

where U satisfies

Eθ0

[Uh (t; θ0)

]= Eθ0 (∇θh (t; θ0)) for all t ∈ RL

which is equivalent to

Eθ0

[U (X)

(eit′X − ψθ0

(t))]

= −∇θψθ0(t) for all t ∈ RL. (6.10)

As U has mean zero, U has also mean zero and we can replace (6.10) by

Eθ0

[U (X)eit′X

]= −∇θψθ0

(t) for all t ∈ RL ⇔∫

U (x)eit′xfθ0 (x) dx = −∇θψθ0(t) for all t ∈ RL ⇔

U (x)fθ0 (x) = − 1

2π

∫e−it′x∇θψθ0

(t) dt. (6.11)

The last equivalence follows from the Fourier inversion formula. Assuming that we canexchange the integration and derivation in the right hand side of (6.11), we obtain

U (x)fθ0 (x) = −∇θfθ0 (x) ⇔U (x) = −∇θ ln fθ0 (x) .

Hence Eθ0 |U |2 = Eθ0[(∇θ ln fθ0 (X))2] . The asymptotic variance of θn coincides with the

Cramer Rao efficiency bound even if, contrary to Example 3, θn differs from the MLE.

Example 4 (continued). As in the previous example, we intend to calculate theasymptotic covariance of θn using Theorem 6.4. We need to find U the p−vector of r.v.such that

Eθ0[Uρ (Y, Z; θ0) Z l

]= Eθ0

[∇θρ (Y, Z; θ0) Z l]

for all l ∈ N, ⇔Eθ0

[Eθ0 [Uρ (Y, Z; θ0) |Z] Z l

]= Eθ0

[Eθ0 [∇θρ (Y, Z; θ0) |Z] Z l

]for all l ∈ N(6.12)

(6.12) is equivalent to

Eθ0 [Uρ (Y, Z; θ0) |Z] = Eθ0 [∇θρ (Y, Z; θ0) |Z] (6.13)

by the completeness of polynomials (Sansone, 1959) under some mild conditions on thedistribution of Z. A solution is

U0 = Eθ0 [∇θρ (Y, Z; θ0) |Z] Eθ0[ρ (Y, Z; θ0)

2 |Z]−1ρ (Y, Z; θ0) .

We have to check that this solution has minimal norm among all the solutions. Consideran arbitrary solution U = U0 + U1. U solution of (6.13) implies

Eθ0 [U1ρ (Y, Z; θ0) |Z] = 0.

88

Hence Eθ0 (UU ′) = Eθ0 (U0U′0) + Eθ0 (U1U

′1) and is minimal for U1 = 0. Then

∥∥Eθ0 (∇θh (θ0))∥∥2

K

= Eθ0 (U0U′0)

= Eθ0

Eθ0 [∇θρ (Y, Z; θ0) |Z] Eθ0

[ρ (Y, Z; θ0)

2 |Z]−1Eθ0 [∇θρ (Y, Z; θ0) |Z]′

.

Its inverse coincides with the semi-parametric efficiency bound derived by Chamberlain(1987).

Note that in Examples 2 and 3, the GMM estimator reaches the Cramer Rao boundasymptotically, while in Example 4 it reaches the semi-parametric efficiency bound.

6.2.6. Testing overidentifying restrictions

Hansen (1982) proposes a test of specification, which basically tests whether the overiden-tifying restrictions are close to zero. Carrasco and Florens (2000) propose the analogueto Hansen’s J test in the case where there is a continuum of moment conditions. Let

pn =n∑

j=1

λ2

j

λ2

j + αn

, qn = 2n∑

j=1

λ4

j(λ

2

j + αn

)2

where λj are the eigenvalues of Kn as described earlier.

Proposition 6.8. Under the assumptions of Theorem 10 of Carrasco and Florens (2000),we have

τn =

∥∥∥(Kαnn )−1/2 hn

(θn

)∥∥∥2

− pn

qn

d→ N (0, 1)

as αn goes to zero and nα3n goes to infinity.

This test can also be used for testing underidentification. Let θ0 ∈ R be such thatE [h (X, θ0)] = 0. Arellano, Hansen and Sentana (2005) show that the parameter, θ0,is locally unidentified if E [h (X, θ)] = 0 for all θ ∈ R. It results in a continuum ofmoment conditions indexed by θ. Arellano et al. (2005) apply τn to test for the null ofunderidentification.

6.2.7. Extension to time series

So far, the data were assumed to be iid. Now we relax this assumption. Let x1, ..., xT bethe observations of a time series Xt that satisfies some mixing conditions. Inference willbe based on moment functions h (τ , Xt; θ0) indexed by a real, possibly multidimensionalindex τ . h (τ , Xt; θ0) are in general autocorrelated, except in some special cases, anexample of which will be discussed below.

89

Example 5 (Conditional characteristic function). Let Yt be a (scalar) Markovprocess and assume that the conditional characteristic function (CF) of Yt+1 given Yt,ψθ (τ |Yt) ≡ Eθ [exp (iτYt+1) |Yt] , is known. The following conditional moment conditionholds

Eθ[eiτYt+1 − ψθ (τ |Yt) |Yt

]= 0.

Denote Xt = (Yt, Yt+1)′. Let g (Yt) be an instrument so that

h (τ , Xt; θ) =(eiτYt+1 − ψθ (τ |Yt)

)g (Yt)

satisfies the identification condition (6.5). h (τ , Xt; θ) is a martingale difference se-quence and is therefore uncorrelated. The use of the conditional CF is very popular infinance. Assume that Yt, t = 1, 2, ..., T is a discretely sampled diffusion process, then Yt

is Markov. While the conditional likelihood of Yt+1 given Yt does not have a closed formexpression, the conditional CF of affine diffusions is known. Hence GMM can replace MLEto estimate these models where MLE is difficult to implement. For an adequate choiceof the instrument g (Yt), the GMM estimator is asymptotically as efficient as the MLE.The conditional CF has been recently applied to the estimation of diffusions by Singleton(2001), Chacko and Viceira (2003), and Carrasco et al. (2004). The first two papers useGMM based on a finite grid of values for τ , whereas the last paper advocates using thefull continuum of moments which permits us to achieve efficiency asymptotically.

Example 6 (Joint characteristic function). Assume Yt is not Markov. In thatcase, the conditional CF is usually unknown. On the other hand, the joint characteristicfunction may be calculated explicitly (for instance when Yt is an ARMA process with sta-ble error, see Knight and Yu, 2002; or Yt is the growth rate of a stochastic volatility model,see Jiang and Knight, 2002) or may be estimated via simulations (this technique is devel-oped in Carrasco et al., 2004). Denote ψθ (τ) ≡ Eθ [exp (τ 1Yt + τ 2Yt+1 + ... + τL+1Yt+L)]with τ = (τ 1, ..., τL)′ , the joint CF of Xt ≡ (Yt, Yt+1, ..., Yt+L)′ for some integer L ≥ 1.Assume that L is large enough for

h (τ , Xt; θ) = eiτ ′Xt − ψθ (τ)

to identify the parameter θ. Here h (τ ,Xt; θ) are autocorrelated. Knight and Yu (2002)estimate various models by minimizing the following norm of h (τ , Xt; θ) :

∫ (1

T

T∑t=1

eiτ ′xt − ψθ (τ)

)2

e−τ ′τdτ .

This is equivalent to minimizing∥∥∥B 1

T

∑Tt=1 h (τ , Xt; θ)

∥∥∥2

with B = e−τ ′τ/2. This choice

of B is suboptimal but has the advantage of being easy to implement. The optimalweighting operator is, as before, the square root of the inverse of the covariance operator.Its estimation will be discussed shortly.

90

Under some mixing conditions on h (τ , Xt; θ0) , the process hT (θ0) = 1T

∑Tt=1 h (τ , Xt; θ0)

follows a functional CLT (see Subsection 2.4.2):

√T hT (θ0)

L→ N (0, K)

where the covariance operator K is an integral operator with kernel

k (τ 1, τ 2) =+∞∑

j=−∞Eθ0

[h (τ 1, Xt; θ0) h (τ 2, Xt−j; θ0)

].

The kernel k can be estimated using a kernel-based estimator as those described in An-drews (1991) and references therein. Let ω : R → [−1, 1] be a kernel satisfying theconditions stated by Andrews. Let q be the largest value in [0, +∞) for which

ωq = limu→∞

1− ω (u)

|u|q

is finite. In the sequel, we will say that ω is a q−kernel. Typically, q = 1 for the Bartlettkernel and q = 2 for Parzen, Tuckey-Hanning and quadratic spectral kernels. We define

kT (τ 1, τ 2) =T

T − d

T−1∑j=−T+1

ω

(j

ST

)ΓT (j) (6.14)

with

ΓT (j) =

1T

∑Tt=j+1 h

(τ 1, Xt; θ

1

T

)h

(τ 2, Xt−j; θ

1

T

), j ≥ 0

1T

∑Tt=−j+1 h

(τ 1, Xt+j; θ

1

T

)h

(τ 2, Xt; θ

1

T

), j < 0

(6.15)

where ST is some bandwidth that diverges with T and θ1

T is a T 1/2−consistent estimatorof θ. Let KT be the integral estimator with kernel kT . Under some conditions on ωand h (τ , Xt; θ0) , Carrasco et al. (2004) establish the rate of convergence of KT to K.Assuming S2q+1

T /T → γ ∈ (0, +∞) , we have

‖KT −K‖ = Op

(T−q/(2q+1)

).

The inverse of K is estimated using the regularized inverse of KT , KαTT = (K2

T + αT I)−1

KT

for a penalization term αT ≥ 0. As before, the optimal GMM estimator is given by

θT = arg minθ∈Θ

∥∥∥(KαTT )−1/2 hT (θ)

∥∥∥ .

Carrasco et al. (2004) show the following result.

91

Proposition 6.9. Assume that ω is a q−kernel and that S2q+1T /T → γ ∈ (0, +∞) . We

have

√T (θT − θ0)

L→ N(0,

(⟨Eθ0 (∇θh) , Eθ0 (∇θh)′

⟩K

)−1)

(6.16)

as T and T q/(2q+1)αT go to infinity and αT goes to zero.

Note that the implementation of this method requires two smoothing parameters αT

and ST . No cross-validation method for selecting these two parameters simultaneouslyhas been derived yet. If ht is uncorrelated, then K can be estimated using the sampleaverage and the resulting estimator satisfies ‖KT −K‖ = Op

(T−1/2

). When ht are

correlated, the convergence rate of KT is slower and accordingly the rate of convergenceof αT to zero is slower.

7. Estimating solutions of integral equations of the second kind

7.1. Introduction

The objective of this section is to study the properties of the solution of an integralequation of the second kind (also called Fredholm equation of the second type) definedby:

(I −K)ϕ = r (7.1)

where ϕ is an element of a Hilbert space H, K is a compact operator from H to H and ris an element of H. As in the previous sections, K and r are known functions of a datagenerating process characterized by its c.d.f. F, and the functional parameter of interestis the function ϕ.

In most cases, H is a functional space and K is an integral operator defined by itskernel k. Equation (7.1) becomes:

ϕ(t)−∫

k(t, s)ϕ(s)Π(ds) = r(t) (7.2)

The estimated operators are often degenerate, see Subsection 2.5.1. and in that case,Equation (7.2) simplifies into:

ϕ(t)−L∑

`=1

a`(ϕ)ε`(t) = r(t) (7.3)

where the a`(ϕ) are linear forms on H and ε` belongs to H for any `.The essential difference between equations of the first kind and of the second kind

is the compactness of the operator. In (7.1), K is compact but I − K is not compact.Moreover, if I−K is one-to-one, its inverse is bounded. In that case, the inverse problem

92

is well-posed. Even if I −K is not one-to-one, the ill-posedness of equation (7.1) is lesssevere than in the first kind case because the solutions are stable in r.

In most cases, K is a self-adjoint operator (and hence I−K is also self-adjoint) but wewill not restrict our presentation to this case. On the other hand, Equation (7.1) can beextended by considering an equation (S −K)ϕ = r where K is a compact operator fromH to E (instead of H to H) and S is a one-to-one bounded operator from H to E with abounded inverse. Indeed, (S −K)ϕ = r ⇔ (I − S−1K) ϕ = S−1r where S−1K : H → His compact. So that we are back to Equation (7.1), see Corollary 3.6. of Kress (1999).

This section is organized in the following way. The next paragraph recalls the mainmathematical properties of the equations of the second kind. The two following para-graphs present the statistical properties of the solution in the cases of well-posed andill-posed problems, and the last paragraph applies these results to the two examples givenin Section 1.

The implementation of the estimation procedures is not discussed here because it issimilar to the implementation of the estimation of a regularized equation of the firstkind (see Section 3). Actually, regularizations transform first kind equations into secondkind equations and the numerical methods are then formally equivalent, even though thestatistical properties are fundamentally different.

7.2. Riesz theory and Fredholm alternative

We first briefly recall the main results about equations of the second kind as they weredeveloped at the beginning of the 20th century by Fredholm and Riesz. The statementsare given without proofs (see e.g. Kress, 1999, Chapters 3 and 4).

Let K be a compact operator from H to H and I be the identity on H (which iscompact only ifH is finite dimensional). Then, the operator I−K has a finite dimensionalnull space and its range is closed. Moreover, I−K is injective if and only if it is surjective.In that case I −K is invertible and its inverse (I −K)−1 is a bounded operator.

An element of the null space of I − K verifies Kϕ = ϕ, and if ϕ 6= 0, it is aneigenfunction of K associated with the eigenvalue equal to 1. Equivalently, the inverseproblem (7.1) is well-posed if and only if 1 is not an eigenvalue of K. The Fredholmalternative follows from the previous results.

Theorem 7.1 (Fredholm alternative). Let us consider the two equations of the sec-ond kind:

(I −K)ϕ = r (7.4)

and

(I −K∗)ψ = s (7.5)

where K∗ is the adjoint of K. Then:

93

i) Either the two homogeneous equations (I −K)ϕ = 0 and (I −K∗)ψ = 0 only havethe trivial solutions ϕ = 0 and ψ = 0. In that case, (7.4) and (7.5) have a uniquesolution for any r and s in H

ii) or the two homogeneous equations (I − K)ϕ = 0 and (I − K∗)ψ = 0 have thesame finite number m of linearly independent solutions ϕj and ψj (j = 1, ..., m)respectively, and the solutions of (7.4) and (7.5) exist if and only if 〈ψj, r〉 = 0 and〈ϕj, s〉 = 0 for any j = 1, ..., m.

(ii) means that the null spaces of I −K and I −K∗ are finite dimensional and havesame dimensions. Moreover, the ranges of I −K and I −K∗ satisfy

R (I −K) = N (I −K∗)⊥ ,

R (I −K∗) = N (I −K)⊥ .

7.3. Well-posed equations of the second kind

In this subsection, we assume that I −K is injective. In this case, the problem is well-posed and the asymptotic properties of the solution are easily deduced from the propertiesof the estimation of the operator K and of the right-hand side r.

The starting point of this analysis is the relation:

ϕn − ϕ0 =(I − Kn

)−1

rn − (I −K)−1 r

=(I − Kn

)−1

(rn − r) +

[(I − Kn

)−1

− (I −K)−1

]r

=(I − Kn

)−1 [rn − r +

(Kn −K

)(I −K)−1 r

]

=(I − Kn

)−1 [rn − r +

(Kn −K

)ϕ0

](7.6)

where the third equality follows from A−1 −B−1 = A−1 (B − A) B−1.

Theorem 7.2. If

i)∥∥∥Kn −K

∥∥∥ = o (1)

ii)∥∥∥(rn + Knϕ0

)− (r + Kϕ0)

∥∥∥ = O

(1

an

)

Then ‖ϕn − ϕ0‖ = O

(1

an

)

94

Proof. As I − K is invertible and admits a continuous inverse, i) implies that

‖(I − Kn

)−1

‖ converges to∥∥(I −K)−1

∥∥ and the result follows from (7.6).

In some cases ‖r− rn‖ = O( 1bn

) and ‖Kn−K‖ = O( 1dn

). Then 1an

= 1bn

+ 1dn

. In someparticular examples, as will be illustrated in the last subsection, the asymptotic behaviorof rn − Knϕ is directly considered.

Asymptotic normality can be obtained from different sets of assumptions. The follow-ing theorems illustrate two kinds of asymptotic normality.

Theorem 7.3. If

i)∥∥∥Kn −K

∥∥∥ = o (1)

ii) an

((rn + Knϕ0

)− (r + Kϕ0)

)=⇒ N (0, Σ) (weak convergence in H)

Then

an (ϕn − ϕ0) =⇒ N (0, (I −K)−1 Σ (I −K∗)−1) .

Proof. The proof follows immediately from (7.6) and Theorem 2.47.

Theorem 7.4. We consider the case where H = L2(Rp, π). If

i) ‖Kn −K‖ = o(1)

ii) ∃an s.t an

[(rn + Knϕ0

)− (r + Kϕ0)

](x)

d→ N (0, σ2 (x)) , ∀x ∈ Rp

iii) ∃bn s.t an

bn= o(1) and

bnK[(

rn + Knϕ)− (r + Kϕ0)

]=⇒ N (0, Ω) (weak convergence in H)

Then

an (ϕn − ϕ0) (x)d→ N (

0, σ2 (x)), ∀x.

Proof. Using

(I −K)−1 = I + (I −K)−1K

we deduce from (7.6):

an(ϕn − ϕ0)(x) = an

(I − Kn)−1

[rn + Knϕ0 − r −Kϕ0

]

= an(rn + Knϕ0 − r −Kϕ0)(x)

(7.7)

+an

bn

bn(I − Kn)−1Kn(rn + Knϕ0 − r −Kϕ0)

(x)

95

The last term in brackets converges (weakly in L2) to a N (0, (I −K)−1Ω(I −K)−1) andthe value of this function at any point x also converges to a normal distribution (weakconvergence implies finite dimensional convergence). Then the last term in brackets isbounded and the result is verified.

Note that condition (iii) is satisfied as soon as premultiplying by K increases the rateof convergence of rn + Knϕ. This is true in particular if K is an integral operator.

We illustrate these results by the following three examples. The first example is anillustrative example, while the other two are motivated by relevant econometric issues.

Example. Consider L2(R, Π) and (Y, Z) is a random element of R × L2(R, Π). Westudy the integral equation of the second kind defined by

ϕ(x) +

∫EF (Z(x)Z(y)) ϕ(y)Π (dy) = EF (Y Z(x)) (7.8)

denoted by ϕ+V ϕ = r. Here K = −V . As the covariance operator, V is a positive opera-tor, K is a negative operator and therefore 1 can not be an eigenvalue of K. Consequently,Equation (7.8) defines a well-posed inverse problem.

We assume that an i.i.d. sample of (Y, Z) is available and the estimated equation(7.8) defines the parameter of interest as the solution of an integral equation having thefollowing form:

ϕ(x) +1

n

n∑i=1

zi(x)

∫zi(y)ϕ(y)Π (dy) =

1

n

n∑i=1

yizi(x) (7.9)

Under some standard regularity conditions, one can check that ‖Vn − V ‖ = O(

1√n

)and

that

√n

1

n

∑i

zi(·)

[yi −

∫zi(y)ϕ(y)Π(dy)

]− EF (Y Z (·))−

∫EF (Z(.)Z(y))ϕ(y)Π(dy)

⇒ N (0, Σ) in L2(R, Π).

Suppose for instance that EF (Y |Z) =∫

Z(y)ϕ(y)Π(dy). Under a homoscedasticity hy-pothesis, the operator Σ is a covariance operator with kernel σ2EF (Z(x)Z(y)) where

σ2 = V ar

(Y −

∫Z(y)ϕ(y)Π(dy)|Z

).

Then, from Theorem 7.3,

√n (ϕn − ϕ0) ⇒ N (

0, σ2(I + V )−1V (I + V )−1).

96

Example. Rational expectations asset pricing models:Following Lucas (1978), rational expectations models characterize the pricing func-

tional as a function ϕ of the Markov state solution of an integral equation:

ϕ (x)−∫

a(x, y)ϕ (y) f (y|x) dy =

∫a(x, y)b(y)f (y|x) dy (7.10)

While f is the transition density of the Markov state, the function a denotes the marginalrate of substitution and b the dividend function. For the sake of expositional simplicity,we assume here that the functions a and b are both known while f is estimated nonpara-metrically by a kernel method. Note that if the marginal rate of substitution a involvessome unknown preference parameters (subjective discount factor, risk aversion param-eter), they will be estimated, for instance by GMM, with a parametric root n rate ofconvergence. Therefore, the nonparametric inference about ϕ (deduced from the solutionof (7.10) using a kernel estimation of f) is not contaminated by this parametric estima-tion; all the statistical asymptotic theory can be derived as if the preference parameterswere known.

As far as kernel density estimation is concerned, it is well known that under mildconditions (see e.g. Bosq (1998)) it is possible to get the same convergence rates and thesame asymptotic distribution with stationary strongly mixing stochastic processes as inthe i.i.d. case.

Let us then consider a n-dimensional stationary stochastic process Xt and H thespace of square integrable functions of one realization of this process. In this example, His defined with respect to the true distribution. The operator K is defined by

Kϕ (x) = EF (a (Xt−1, Xt) ϕ (Xt) |Xt−1 = x)

and

r (x) = EF (a (Xt−1, Xt) b(Xt)|Xt−1 = x)

We will assume that K is compact through possibly a Hilbert-Schmidt condition (seeAssumption A.1 of Section 5.5 for such a condition). A common assumption in rationalexpectation models is that K is a contraction mapping, due to discounting. Then, 1 isnot an eigenvalue of K and (7.10) is a well-posed Fredholm integral equation.

Under these hypotheses, both numerical and statistical issues associated with thesolution of (7.10) are well documented. See Rust, Traub and Wozniakowski (2002) andreferences therein for numerical issues. The statistical consistency of the estimator ϕn

obtained from the kernel estimator Kn is deduced from Theorem 7.2 above. Assumptioni) is satisfied because Kn − K has the same behavior as the conditional expectationoperator and

rn + Knϕ− r −Kϕ= EFn (a (Xt−1, Xt) (b(Xt) + ϕ (Xt)) |Xt−1)−EF (a (Xt−1, Xt) (b(Xt) + ϕ (Xt)) |Xt−1)

97

converges at the speed 1an

=(

1ncm

n+ c4

n

)1/2

if cn is the bandwidth of the (second order)

kernel estimator and m is the dimension of X.The weak convergence follows from Theorem 7.4. Assumption ii) of Theorem 7.4 is

the usual result on the normality of kernel estimation of conditional expectation. As K isan integral operator, the transformation by K increases the speed of convergence, whichimplies iii) of Theorem 7.4.

Example. Partially Nonparametric forecasting model:This example is drawn from Linton and Mammen (2003). Nonparametric prediction

of a stationary ergodic scalar random process Xt is often performed by looking for apredictor m (Xt−1, ..., Xt−d) able to minimize the mean square error of prediction:

E [Xt −m (Xt−1, ..., Xt−d)]2

In other words, if m can be any squared integrable function, the optimal predictor isthe conditional expectation

m0 (Xt−1,...,Xt−d) = E [Xt|Xt−1,...,Xt−d]

and can be estimated by kernel smoothing or any other nonparametric way of estimating aregression function. The problems with this kind of approach are twofold. First, it is oftennecessary to include many lagged variables and the resulting nonparametric estimationsurface suffers from the well-known “curse of dimensionality”. Second, it is hard todescribe and interpret the estimated regression surface when the dimension is more thantwo.

A solution to deal with these problems is to think about a kind of nonparametricgeneralization of ARMA processes. For this purpose, let us consider semiparametricpredictors of the following form

E [Xt|It−1] = mϕ (θ, It−1) =∞∑

j=1

aj (θ) ϕ (Xt−j) (7.11)

where θ is an unknown finite dimensional vector of parameters, aj (.) , j ≥ 1 are knownscalar functions, and ϕ (.) is the unknown functional parameter of interest. The notation

E [Xt|It−1] = mϕ (θ, It−1)

stresses the fact that the predictor depends on the true unknown value of the parametersθ and ϕ, and of the information It−1 available at time (t− 1) as well. This information isactually the σ-field generated by Xt−j, j ≥ 1. A typical example is

aj (θ) = θj−1 for j ≥ 1 with 0 < θ < 1. (7.12)

Then the predictor defined in (7.11) is actually characterized by

mϕ (θ, It−1) = θmϕ (θ, It−2) + ϕ (Xt−1) (7.13)

98

In the context of volatility modelling, Xt would denote a squared asset return overperiod [t− 1, t] and mϕ (θ, It−1) the so-called squared volatility of this return as expectedat the beginning of the period. Engle and Ng (1993) have studied such a partially non-parametric (PNP for short) model of volatility and called the function ϕ the “news impactfunction”. They proposed an estimation strategy based on piecewise linear splines. Notethat the PNP model includes several popular parametric volatility models as special cases.For instance, the GARCH (1,1) model of Bollerslev (1986) corresponds to ϕ (x) = w +αxwhile the Engle (1990) asymmetric model is obtained for ϕ (x) = w + α (x + δ)2 . Moreexamples can be found in Linton and Mammen (2003).

The nonparametric identification and estimation of the news impact function can bederived for a given value of θ. After that, a profile criterion can be calculated to estimateθ. In any case, since θ will be estimated with a parametric rate of convergence, theasymptotic distribution theory of a nonparametric estimator of ϕ is the same as if θwere known. For the sake of notational simplicity, the dependence on unknown finitedimensional parameters θ is no longer made explicit.

At least in the particular case (7.12)-(7.13), ϕ is easily characterized as the solutionof a linear integral equation of the first kind

E [Xt − θXt−1|It−2] = E [ϕ (Xt−1) |It−2]

Except for its dynamic features, this problem is completely similar to the nonparametricinstrumental regression example described in Section 5.5. However, as already mentioned,problems of the second kind are often preferable since they may be well-posed. As shownby Linton and Mammen (2003) in the particular case of a PNP volatility model, it isactually possible to identify and consistently estimate the function ϕ defined as

ϕ = arg minϕ

E

[Xt −

∞∑j=1

ajϕ (Xt−j)

]2

(7.14)

from a well-posed linear inverse problem of the second kind. When ϕ is an element of theHilbert space L2

F (X), its true unknown value is characterized by the first order conditionsobtained by differentiating in the direction of any vector h

E

[(Xt −

∞∑j=1

ajϕ (Xt−j)

)( ∞∑

l=1

alh (Xt−l)

)]= 0

99

In other words, for any h in L2F (X)

∞∑j=1

ajEX [E [Xt|Xt−j = x] h (x)]

−∞∑

j=1

a2jE

X [ϕ (x) h (x)]

−∞∑

j=1

∞∑

l=1l 6=j

ajalEX [E [ϕ (Xt−l) |Xt−j = x] h(x)] = 0

(7.15)

where EX denotes the expectation with respect to the stationary distribution of Xt. Asthe equality in (7.15) holds true for all h, it is true in particular for a complete sequenceof functions of L2

F (X). It follows that

∞∑j=1

ajE [Xt|Xt−j = x]−( ∞∑

l=1

a2l

)ϕ (x)

−∞∑

j=1

∞∑

l 6=j

ajalE [ϕ (Xt−l) |Xt−j = x] = 0

PX− almost surely on the values of x. Let us denote

rj (Xt) = E [Xt+j|Xt] and Hk (ϕ) (Xt) = E [ϕ (Xt+k) |Xt] .

Then, we have proved that the unknown function ϕ of interest must be the solution ofthe linear inverse problem of the second kind

A (ϕ, F ) = (I −K) ϕ− r = 0 (7.16)

where

r =

( ∞∑j=1

a2j

)−1 ∞∑j=1

ajrj,

K = −( ∞∑

j=1

a2j

)−1 ∞∑j=1

∑

l 6=j

ajalHj−l,

and, with a slight change of notation, F now characterizes the probability distribution ofthe stationary process (Xt) .

To study the inverse problem (7.16), it is first worth noticing that K is a self-adjointintegral operator. Indeed, while

K =

( ∞∑j=1

a2j

)−1 +∞∑

h=±1

Hk

+∞∑

l=max[1,1−k]

alal+k

100

we immediately deduce from Subsection 2.5.1 that the conditional expectation operatorHk is such that

H∗k = H−k

and thus K = K∗, since

+∞∑

l=max[1,1−k]

alal+k =+∞∑

l=max[1,1+k]

alal−k

As noticed by Linton and Mammen (2003), this property greatly simplifies the practicalimplementation of the solution of the sample counterpart of equation (7.19). Even moreimportantly, the inverse problem (7.19) will be well-posed as soon as one maintains thefollowing identification assumption about the news impact function ϕ.

Assumption A. There exists no θ and ϕ ∈ L2F (X) with ϕ 6= 0 such that∑∞

j=1 aj (θ) ϕ (Xt−j) = 0 almost surely.

To see this, observe that Assumption A means that for any non-zero function ϕ

0 < E

[ ∞∑j=1

ajϕ (Xt−j)

]2

,

that is

0 <

∞∑j=1

a2j 〈ϕ, ϕ〉+

∞∑j=1

∞∑

l=1l 6=j

alaj 〈ϕ,Hj−lϕ〉 .

Therefore

0 < 〈ϕ, ϕ〉 − 〈ϕ,Kϕ〉 (7.17)

for non zero ϕ. In other words, there is no non-zero ϕ such that

Kϕ = ϕ

and the operator (I −K) is one-to-one. Moreover, (7.17) implies that (I −K) has eigen-values bounded from below by a positive number. Therefore, if K depends continuouslyon the unknown finite dimensional vector of parameters θ and if θ evolves in some compactset, the norm of

(I −K)−1 will be bounded from above uniformly on θ.It is also worth noticing that the operator K is Hilbert-Schmidt and a fortiori compact

under reasonable assumptions. As already mentioned in Subsection 2.5.1, the Hilbert-Schmidt property for the conditional expectation operator Hk is tantamount to the inte-grability condition

∫ ∫ [fXt,Xt−k

(x, y)

fXt(x)fXt (y)

]2

fXt (x) fXt (y) dxdy < ∞

101

It amounts to saying that there is not too much dependence between Xt and Xt−k. Thisshould be tightly related to the ergodicity or mixing assumptions about the stationaryprocess Xt. Then, if all the conditional expectation operator Hk, k ≥ 1 are Hilbert-Schmidt, the operator K will also be Hilbert-Schmidt insofar as

∞∑j=1

∑

l 6=j

a2ja

2l < +∞.

Up to a straightforward generalization to stationary mixing processes of results onlystated in the i.i.d. case, the general asymptotic theory of Theorems 7.3 and 7.4 can thenbe easily applied to nonparametric estimators of the news impact function ϕ based onthe Fredholm equation of the second kind (7.15). An explicit formula for the asymptoticvariance of ϕn as well as a practical solution by implementation of matricial equationssimilar to those of Subsection 3.4 (without need of a Tikhonov regularization) is providedby Linton and Mammen (2003) in the particular case of volatility modelling.

However, an important difference with the i.i.d. case (see for instance assumption A.3in Section 5.5 about instrumental variables) is that the conditional homoskedasticity as-sumption cannot be maintained about the conditional probability distribution of Xt givenits own past. This should be particularly detrimental in the case of volatility modelling,since when Xt denotes a squared return, it will in general be even more conditionallyheteroskedastic than returns themselves. Such severe conditional heteroskedasticity willlikely imply a poor finite sample performance, and a large asymptotic variance of the esti-mator ϕn defined from the inverse problem (7.15), that is from the least squares problem(7.14). Indeed, ϕn is a kind of OLS estimator in infinite dimension. In order to bettertake into account conditional heteroskedasticity of Xt in the context of volatility mod-elling, Linton and Mammen (2003) propose to replace the least squares problem (7.14)by a quasi-likelihood kind of approach where the criterion to optimize is defined from thedensity function of a normal conditional probability distribution of returns, with variancemϕ (θ, It−1) . Then the difficulty is that the associated first order conditions now charac-terize the news impact function ϕ as solution of a nonlinear inverse problem. Linton andMammen (2003) suggest to work with a version of this problem which is locally linearizedaround the previously described least squares estimator ϕn (and associated consistentestimator of θ).

7.4. Ill-posed equations of the second kind

The objective of this section is to study equations (I −K)ϕ = r where 1 is an eigenvalueof K, i.e. where I − K is not injective (or one-to-one). For simplicity, we restrict ouranalysis to the case where the order of multiplicity of the eigenvalue 1 is one and theoperator K is self-adjoint. This implies that the dimension of the null spaces of I −K isone and using the results of Section 7.2, the space H may be decomposed into

H = N (I −K)⊕R(I −K)

102

i.e. H is the direct sum between the null space and the range of I −K, both closed. Wedenote by PN r the projection of r on N (I −K) and by PRr the projection of r on therange R(I −K).

Using ii) of Theorem 7.1, a solution of (I −K)ϕ = r exists in the non injective caseonly if r is orthogonal to N (I −K) or equivalently, if r belongs to R(I −K). In otherwords, a solution exists if and only if r = PRr. However in this case, the solution is notunique and there exists a one dimensional linear manifold of solutions. Obviously, if ϕis a solution, ϕ plus any element of N (I − K) is also a solution. This non uniquenessproblem will be solved by a normalization rule which selects a unique element in the setof solutions. The normalization we adopt is

〈ϕ, φ1〉 = 0 (7.18)

where φ1 is the eigenfunction of K corresponding to the eigenvalue equal to 1.In most statistical applications of equations of the second kind, the r element corre-

sponding to the true data generating process is assumed to be in the range of I−K whereK is also associated with the true DGP. However, this property is no longer true if F isestimated and we need to extend the resolution of (I −K)ϕ = r to cases where I −K isnot injective and r is not in the range of this operator. This extension must be done insuch a way that the continuity properties of inversion are preserved.

For this purpose we consider the following generalized inverse of (I − K). As K isa compact operator, it has a discrete spectrum λ1 = 1, λ2,... where only 0 may be anaccumulation point (in particular 1 cannot be an accumulation point). The associatedeigenfunctions are φ1, φ2, .... Then we define:

Lu =∞∑

j=2

1

1− λj

〈u, φj〉φj, u ∈ H (7.19)

Note that L = (I −K)† is the generalized inverse of I − K, introduced in Section 3.Moreover, L is continuous and therefore bounded because 1 is an isolated eigenvalue. Thisoperator computes the unique solution of (I −K)ϕ = PRu satisfying the normalizationrule (7.18). It can be easily verified that L satisfies:

LPR = L = PRL,

L(I −K) = (I −K)L = PR. (7.20)

We now consider estimation. For an observed sample, we obtain the estimator Fn ofF (that may be built from a kernel estimator of the density) and then the estimators rn

and Kn of r and K respectively. Let φ1, φ2, ... denote the eigenfunctions of Kn associatedwith λ1, λ2, ... We restrict our attention to the cases where 1 is also an eigenvalue ofmultiplicity one of Kn (i.e. λ1 = 1). However, φ1 may be different from φ1.

We have to make a distinction between two cases. First, assume that the Hilbertspace H of reference is known and in particular the inner product is given (for exampleH = L2(Rp, Π) with Π given). The normalization rule imposed to ϕn is

〈ϕn, φ1〉 = 0

103

and Ln is the generalized inverse of I − Kn in H (which depends on the Hilbert spacestructure) where

Lnu =∞∑

j=2

1

1− λj

〈u, φj〉φj, u ∈ H

Formula (7.20) applies immediately for Fn.However, if the Hilbert spaceH depends on F (e.g. H = L2(Rp, F )), we need to assume

that L2(R, Fn) ⊂ L2(Rp, F ). The orthogonality condition, which defines the normalizationrule (7.18) is related to L2(Rp, F ) but the estimator ϕn of ϕ will be normalized by

〈ϕn, φ1〉n = 0

where 〈 , 〉n denotes the inner product relative to Fn. This orthogonality is different froman orthogonality relative to 〈 , 〉. In the same way Ln is now defined as the generalizedinverse of I − Kn with respect to the estimated Hilbert structure, i.e.

Lnu =∞∑

j=2

1

1− λj

〈u, φj〉nφj

and Ln is not the generalized inverse of I − Kn in the original space H. The advan-tages of this definition are that Ln may be effectively computed and satisfies the for-mula (7.20) where Fn replaces F . In the sequel PRn denotes the projection operator on

Rn = R(I − Kn

)for the inner product < .,.>n.

To establish consistency, we will use the following equality.

Ln − L = Ln(Kn −K)L

+ Ln(PRn − PR) + (PRn − PR)L. (7.21)

It follows from (7.20) and Ln − L = LnPRn − PRL = Ln(PRn − PR) + (PRn − PR)L −PRnL + LnPr and Ln (Kn −K) L = Ln (Kn − I) L + Ln (I −K) L = PRnL + LnPr.

The convergence property is given by the following theorem.

Theorem 7.5. Let us define ϕ0 = Lr and ϕn = Lnrn. If

i)∥∥∥Kn −K

∥∥∥ = o (1)

ii) ‖PRn − PR‖ = O(

1bn

)

iii)∥∥∥(rn + Knϕ0)− (r + Kϕ0)

∥∥∥ = O(

1an

)

104

Then

‖ϕn − ϕ0‖ = O

(1

an

+1

bn

).

Proof. The proof is based on:

ϕn − ϕ0 = Lnrn − Lr

= Ln(rn − r) + (Ln − L)r

= Ln(rn − r) + Ln(Kn −K)ϕ0 (7.22)

+ Ln (PRn − PR) r + (PRn − PR)ϕ0

deduced from (7.21). Then

‖ϕn − ϕ0‖ ≤ ‖Ln‖‖(rn + Knϕ0)− (r + Kϕ0)‖+ (‖Ln‖‖r‖+ ‖ϕ‖)‖PRn − PR‖ (7.23)

Under i) and ii) ‖Ln − L‖ = o(1) from (7.21). This implies ‖Ln‖ → ‖L‖ and the resultfollows.

Ifan

bn

= O (1) , the actual speed of convergence is bounded by1

an

. This will be the

case in the two examples of 7.5 wherean

bn

→ 0.

We consider asymptotic normality in this case. By (7.20), we have Ln = PRn + LnKn,hence:

ϕn − ϕ0

= PRn

[(rn + Knϕ0)− (r + Kϕ0)

](7.24)

+ LnKn

[(rn + Knϕ0)− (r + Kϕ0)

](7.25)

+ Ln(PRn − PR)r + (PRn − PR)ϕ0 (7.26)

Let us assume that there exists a sequence an such that i) and ii) below are satisfied

i) anPRn

[(rn + Knϕ0)− (r + Kϕ0)

](x) has an asymptotic normal distribution,

ii) an

[LnKn(rn + Knϕ0 − r −Kϕ0)

](x) → 0, an

[Ln (PRn − PR) r

](x) → 0,

and an [(PRn − PR)ϕ0] (x) → 0 in probability.Then the asymptotic normality of an(ϕn − ϕ0) is driven by the behavior of (7.24).

This situation occurs in the nonparametric estimation, as illustrated in the next section.

105

7.5. Two examples: backfitting estimation in additive models and panel model

7.5.1. Backfitting estimation in additive models

Using the notation of Subsection 1.3.5, an additive model is defined by

(Y, Z, W ) ∈ R× Rp × Rq

Y = ϕ(Z) + ψ(W ) + UE(U |Z, W ) = 0,

(7.27)

in which case (see (1.24)), the function ϕ is solution of the equation:

ϕ− E [E(ϕ (Z) |W )|Z] = E(Y |Z)− E [E(Y |W )|Z]

and ψ is the solution of an equation of the same nature obtained by a permutation ofW and Z. The backfitting algorithm of Breiman and Friedman (1985) , and Hastie andTibshirani (1990) is widely used to estimate ϕ and ψ in Equation (7.27).Mammen, Linton,and Nielsen (1999) derive the asymptotic distribution of the backfitting procedure. Al-ternatively, Newey (1994), Tjostheim and Auestad (1994), and Linton and Nielsen (1995)propose to estimate ϕ (respectively ψ) by integrating an estimator of E [Y |Z = z, W = w]with respect to w (respectively z).

We focus our presentation on the estimation of ϕ. It appears as the result of a linearequation of the second kind. More precisely, we have in that case:

• H is the space of the square integrable functions of Z with respect to the true datagenerating process. This definition simplifies our presentation but an extension todifferent spaces is possible.

• The unknown function ϕ is an element of H. Actually, asymptotic considerationswill restrict the class of functions ϕ by smoothness restrictions.

• The operator K is defined by Kϕ = E [E(ϕ (Z) |W )|Z]. This operator is self adjointand we assume its compactness. This compactness may be obtained through theHilbert Schmidt Assumption A.1 of Subsection 5.5.

• The function r is equal to E(Y |Z)− E [E(Y |W )|Z].

The operator I − K is not one-to-one because the constant functions belong to thenull space of this operator. Indeed, the additive model (7.27) does not identify ϕ and ψ.We introduce the following assumption (see Florens, Mouchart, and Rolin (1990)), whichwarrants that ϕ and ψ are exactly identified up to an additive constant, or equivalentlythat the null space of I −K only contains the constants (meaning 1 is an eigenvalue ofK of order 1).

Identification assumption. Z and W are measurably separated w.r.t. the distri-bution F, i.e. a function of Z almost surely equal to a function of W is almost surelyconstant.

106

This assumption implies that if ϕ1, ϕ2, ψ1, ψ2 are such that E(Y |Z, W ) = ϕ1(Z) +ψ1(W ) = ϕ2(Z) + ψ2(W ) then ϕ1(Z) − ϕ2(Z) = ψ2(W ) − ψ1(W ) which implies thatϕ1 − ϕ2 and ψ2 − ψ1 are a.s. constant. In terms of the null set of I −K we have:

Kϕ = ϕ

⇐⇒ E [E(ϕ (Z) |W )|Z] = ϕ (Z)

=⇒ E[(E [ϕ (Z) |W ])2] = E [ϕ (Z) E (ϕ(Z)|W )]

= E(ϕ2 (Z)

).

But, by Pythagore theorem

ϕ(Z) = E(ϕ (Z) |W ) + υ

E(ϕ2 (Z)

)= E

((E (ϕ (Z) |W ))2) + Eυ2.

Then:

Kϕ = ϕ =⇒ υ = 0

⇔ ϕ(Z) = E [ϕ(Z) |W ] .

Then, if ϕ is an element of the null set of I −K, ϕ is almost surely equal to a function ofW and is therefore constant.

The eigenvalues of K are real, positive and smaller than 1 except for the first one,that is 1 = λ1 > λ2 > λ3 > ...1 The eigenfunctions are such that φ1 = 1 and the condition〈ϕ, φ1〉 = 0 means that ϕ has an expectation equal to zero. The range of I −K is the setof functions with mean equal to 0 and the projection of u, PRu, equals u− E(u).

It should be noticed that under the hypothesis of the additive model, r has zero meanand is then an element of R(I −K). Then, a unique (up to the normalization condition)solution of the structural equation (I −K)ϕ = r exists.

The estimation may be done by kernel smoothing. The joint density is estimated by

fn(y, z, w) =1

nc1+p+qn

n∑i=1

ω

(y − yi

cn

)ω

(z − zi

cn

)ω

(w − wi

cn

)(7.28)

and Fn is the c.d.f. associated to fn. The estimated Kn operator verifies:

(Knϕ)(z) =

∫ϕ (u) an (u, z) du (7.29)

where

an (u, z) =

∫fn (., u, w) fn (., z, w)

fn (., ., w) fn (., z, .)dw.

1Actually K = T ∗T when Tϕ = E(ϕ|W ) and T ∗ψ = E(ψ|Z) when ψ is a function of W . Theeigenvalues of K correspond to the squared singular values of the T and T ∗ defined in Section 2.

107

The operator Kn must be an operator from H to H (it is by construction an operator

from L2Z(Fn) into L2

Z(Fn)). Therefore,ω( z−z`

cn)P

` ω( z−z`cn

)must be square integrable w.r.t. F .

The estimation of r by rn verifies

rn(z) =1

n∑`=1

ω(

z−z`

cn

)n∑

`=1

(y` −

n∑i=1

yiω`i

)ω

(z − z`

cn

)

where ω`i =ω

w` − wi

cn

!n∑

j=1

ω

w` − wj

cn

! .

The operator Kn also has 1 as the greatest eigenvalue corresponding to an eigenfunc-tion equal to 1. Since Fn is a mixture of probabilities for which Z and W are independent,the measurable separability between Z and W is fulfilled. Then, the null set of I − Kn

reduces a.s. (w.r.t. Fn) to constant functions. The generalized inverse of an operatordepends on the inner product of the Hilbert space because it is defined as the function ϕof minimal norm which minimizes the norm of Knϕ− rn. The generalized inverse in thespace L2

Z(F ) cannot be used for the estimation because it depends on the actual unknownF . Then we construct Ln as the generalized inverse in L2

Z(Fn) of I − Kn. The practical

computation of Ln can be done by computing the n eigenvalues of Kn, λ1 = 1, ..., λn andthe n eigenfunctions φ1 = 1, φ2, ..., φn. Then

Lnu =n∑

j=2

1

1− λj

∫u(z)φj(z)fn(z)dz

φj

It can be easily checked that property (7.20) is verified where PRn is the projection(w.r.t. Fn) on the orthogonal of the constant function. This operator subtracts from anyfunction its empirical mean, which is computed through the smoothed density:

PRnu = u− 1

ncpn

∑i

∫u(z)ω

(z − zi

cn

)dz

The right hand side of the equation (I− Kn)ϕ = rn has a mean equal to 0 (w.r.t. Fn).Hence, this equation has a unique solution ϕn = Lnϕ0 which satisfies the normalization

condition 1ncp

n

∑i

∫ϕn(z)ω

(z−zi

cn

)dz = 0.

The general results of Section 7.4 apply. First, we check that the conditions i) to iii)of Theorem 7.5 are fulfilled.

i) Under very general assumptions, ‖Kn −K‖ → 0 in probability.

108

ii) We have to check the properties of PRn − PR

(PRn − PR)ϕ =1

ncpn

∑i

∫ϕ(z)ω

(z − zi

cn

)dz −

∫ϕ(z)f(z)dz.

The asymptotic behavior of ‖(PRn−PR)ϕ‖2 =∣∣∣ 1ncp

n

∑ni=1

∫ϕ(z)ω

(z−zi

cn

)dz − E(ϕ)

∣∣∣2

is the same as the asymptotic behavior of the expectation of this positive randomvariable:

E

(1

ncpn

n∑i=1

∫ϕ(z)ω

(z − zi

cn

)dz − E(ϕ)

)2

.

Standard computation on this expression shows that this mean square error is

O

(1

n+ c

2min(d,d′)n

)‖ϕ‖2, where d is the smoothness degree of ϕ and d′ the order of

the kernel.

iii) The last term we have to consider is actually not computable but its asymptoticbehavior is easily characterized. We simplify the notation by denoting EFn(.|.) theestimation of a conditional expectation. The term we have to consider is

(rn + Knϕ)− (r + Kϕ) = EFn(Y |Z)− EFn(EFn(Y |W )|Z) + EFn(EFn(ϕ(Z)|W )|Z)− EF (Y |Z) + EF (EF (Y |W )|Z)− EF (EF (ϕ(Z)|W )|Z)= EFn

(Y − EF (Y |W ) + EF (ϕ (Z) |W ) |Z)

− EF(Y − EF (Y |W ) + EF (ϕ (Z) |W ) |Z)

− R

where R = EFEFn (Y − ϕ (Z) |W )− EF (Y − ϕ (Z) |W )

. Moreover

EF (Y |W ) = EF (ϕ (Z) |W ) + ψ (W ) .

Then(rn + Knϕ

)− (r + Kϕ) = EFn (Y − ψ (W ) |Z)− EF (Y − ψ (W ) |Z)

− R.

The R term converges to zero at a faster rate than the first part of the r.h.s. of thisequation and can be neglected. We have seen in the other parts of this chapter that

‖EFn(Y − ψ(W )|Z)− EF (Y − ψ(W )|Z)‖2 = 0

(1

ncpn

+ c2ρn

)

where ρ depends on the regularity assumptions. Therefore, Condition iii) of Theo-rem 7.5 is fulfilled.

109

From Theorem 7.5, it follows that ‖ϕn−ϕ0‖ → 0 in probability and that ‖ϕn−ϕ0‖ =

0

(1√ncp

n

+ cρn

).

The pointwise asymptotic normality of√

ncρn(ϕn(z)− ϕ0(z)) can now be established.

We apply the formulas (7.24) to (7.26) and Theorem 7.4.

1) First, consider (7.26). Under a suitable condition on cn (typically ncρ+2min(d,r)n → 0),

we have:√ncp

n

Ln(PRn − PR)r + (PRn − PR)ϕ0

→ 0 in probability.

2) Second, consider (7.25). Using the same argument as in Theorem 7.4, a suitablechoice of cn implies that

√ncρ

nLnKn

[(rn + Knϕ0)− (r + Kϕ0)

]→ 0. (7.30)

Actually, while EFn(Y − ψ(W )|Z) − EF (Y − ψ(W )|Z) only converges pointwiseat a nonparametric speed, the transformation by the operator Kn converts thisconvergence into a functional convergence at a parametric speed. Then

√ncp

n

∥∥∥Kn

[EFn(Y − ψ(W )|Z)− EF (Y − ψ(W )|Z]∥∥∥ → 0.

Moreover, Ln converges in norm to L, which is a bounded operator. Hence, theresult of (7.30) follows.

3) The term (7.24) remains. The convergence of√

ncpn(ϕFn

(z)−ϕF (z)) is then identicalto the convergence of√

ncpnPRn

[EFn(Y − ψ(W )|Z = z)− EF (Y − ψ(W )|Z = z

]

=√

ncpn

[EFn(Y − ψ(W )|Z = z)− EF (Y − ψ(W )|Z = z)

− 1n

∑i

(yi − ψ(wi))− 1ncp

n

∑i

∫ ∫(y − ψ(w))f(y, w|Z = z)ω

(z − zi

cn

)dzdw

].

It can easily be checked that the difference between the two sample means convergeto zero at a higher speed than

√ncp

n and these two last terms can be neglected.Then using standard results on nonparametric estimation, we obtain:

√ncp

n(ϕFn(z)− ϕF (z))

d→ N(

0, V ar(Y − ψ(W )|Z = z)

∫ω (u)2 du

fZ(z)

)

where the 0 mean of the asymptotic distribution is obtained thanks to a suitablechoice of the bandwidth, which needs to converge to 0 faster than the optimal speed.

110

Note that the estimator of ϕ has the same properties as the oracle estimator based onthe knowledge of ψ. This attractive feature was proved by Mammen, Linton, and Nielsen(1999) using different tools.

7.5.2. Estimation of the bias function in a measurement error equation

We have introduced in Example 1.3.6, Section 1, the measurement error model:

Y1 = η + ϕ (Z1) + U1 Y1, Y2 ∈ RY2 = η + ϕ (Z2) + U2 Z1, Z2 ∈ Rp

where η, Ui are random unknown elements and Y1 and Y2 are two measurements of η con-taminated by a bias term depending on observable elements Z1 and Z2. The unobservablecomponent η is eliminated by differencing and we get the model under consideration :

Y = ϕ (Z2)− ϕ (Z1) + U (7.31)

when Y = Y2 − Y1 and E (Y |Z1, Z2) = ϕ (Z2) − ϕ (Z1) . We assume that i.i.d. obser-vations of (Y, Z1, Z2) are available. Moreover, the order of measurements is arbitrary orequivalently (Y1, Y2, Z1, Z2) is distributed identically to (Y2, Y1, Z2, Z1) . This implies that(Y, Z1, Z2) and (−Y, Z2, Z1) have the same distribution. In particular, Z1 and Z2 areidentically distributed.

• The reference spaceH is the space of random variables defined on Rp that are squareintegrable with respect to the true marginal distribution on Z1 (or Z2). We are in acase where the Hilbert space structure depends on the unknown distribution.

• The function ϕ is an element of H but this set has to be reduced by the smoothnesscondition in order to obtain asymptotic properties of the estimation procedure.

• The operator K is the conditional expectation operator

(Kϕ) (z) = EF (ϕ (Z2) |Z1 = z)= EF (ϕ (Z1) |Z2 = z)

from H to H. The two conditional expectations are equal because (Z1, Z2) and(Z2, Z1) are identically distributed (by the exchangeability property). This operatoris self-adjoint and we suppose that K is compact. This property may be deduced asin previous cases from a Hilbert Schmidt argument.

Equation (7.31) introduces an overidentification property because it constrains theconditional expectation of Y given Z1 and Z2. In order to define ϕ for any F (and inparticular for the estimated one), the parameter ϕ is now defined as the solution of theminimization problem:

ϕ = arg minϕ

E (Y − ϕ (Z2) + ϕ (Z1))2

111

or, equivalently as the solution of the first-order conditions:

EF [ϕ (Z2) |Z1 = z]− ϕ (z) = E (Y |Z1 = z)

because (Y, Z1, Z2) and (−Y, Z2, Z1) are identically distributed.The integral equation, which defines the function of interest, ϕ, may be denoted by

(I −K) ϕ = r

where r = E (Y |Z2 = z) = −E (Y |Z1 = z) . As in the additive models, this inverse prob-lem is ill-posed because I − K is not one-to-one. Indeed, 1 is the greatest eigenvalueof K and the eigenfunctions associated with 1 are the constant functions. We need anextra assumption to warrant that the order of multiplicity is one, or in more statisticalterms, that ϕ is identified up to a constant. This property is obtained if Z1 and Z2 aremeasurably separated, i.e. if the functions of Z1 almost surely equal to some functions ofZ2, are almost surely constant.

Then the normalization rule is

〈ϕ, φ1〉 = 0

where φ1 is constant. This normalization is then equivalent to

EF (ϕ) = 0.

If F is estimated using a standard kernel procedure, the estimated Fn does not sat-isfy in general, the exchangeability assumption ((Y, Z1, Z2) and (−Y, Z2, Z1) are identi-cally distributed). A simple way to incorporate this constraint is to estimate F usinga sample of size 2n by adding to the original sample (yi, z1i, z2i)i=1,...,n a new sample(−yi, z2i, z1i)i=1,...,n . For simplicity, we do not follow this method here and consider an es-timation of F, which does not verify the exchangeability. In that case, rn is not in general

an element of R(I − Kn

), and the estimator ϕn is defined as the unique solution of

(I − Kn

)ϕ = PRn rn,

which satisfies the normalization rule

EFn (ϕ) = 0.

Equivalently, we have seen that the functional equation(I − Kn

)ϕ = rn reduces to a

n dimensional linear system, which is solved by a generalized inversion. The asymptoticproperties of this procedure follow immediately from the theorems of Section 7.4 and areobtained identically to the case of additive models.

112

References

[1] Ai, C. and X. Chen (2003) “Efficient Estimation of Models with Conditional MomentRestrictions Containing Unknown Functions”, Econometrica, 71, 1795-1843.

[2] Ait-Sahalia, Y., L.P. Hansen, and J.A. Scheinkman (2004) “Operator Methods forContinuous-Time Markov Processes”, forthcoming in the Handbook of FinancialEconometrics, edited by L.P. Hansen and Y. Ait-Sahalia, North Holland.

[3] Arellano, M., L. Hansen, and E. Sentana (2005) “Underidentification?”, mimeo,CEMFI.

[4] Aronszajn, N. (1950) “Theory of Reproducing Kernels”, Transactions of the Amer-ican Mathematical Society, Vol. 68, 3, 337-404.

[5] Bai, J. and S. Ng (2002) “Determining the Number of Factors in ApproximateFactor Models”, Econometrica, 70, 191-221.

[6] Basmann, R.L. (1957), “A Generalized Classical Method of Linear Estimation ofCoefficients in a Structural Equations”, Econometrica, 25, 77-83.

[7] Berlinet, A. and C. Thomas-Agnan (2004) Reproducing Kernel Hilbert Spaces inProbability and Statistics, Kluwer Academic Publishers, Boston.

[8] Blundell, R., X. Chen, and D. Kristensen (2003) “Semi-Nonparametric IV Esti-mation of Shape-Invariant Engel Curves”, Cemmap working paper CWP 15/03,University College London.

[9] Blundell, R. and J. Powell (2003) “Endogeneity in Nonparametric and Semipara-metric Regression Models”, in Advances in Economics and Econometrics, Vol. 2,eds by M. Dewatripont, L.P. Hansen and S.J. Turnovsky, Cambridge UniversityPress, 312-357.

[10] Bollerslev, T. (1986), “Generalized Autoregressive Conditional Heteroskedasticity”,Journal of Econometrics 31, 307-327.

[11] Bosq, D. (1998) Nonparametric Statistics for Stochastic Processes. Estimation andPrediction, Lecture Notes in Statistics, 110. Springer-Verlag, NewYork.

[12] Bosq, D. (2000) Linear processes in function spaces. Theory and applications, Lec-ture Notes in Statistics, 149. Springer-Verlag, NewYork.

[13] Breiman, L. and J.H. Friedman (1985) “Estimating Optimal Transformations forMultiple Regression and Correlation”, Journal of American Statistical Association,80, 580-619.

113

[14] Carrasco, M., M. Chernov, J.-P. Florens, and E. Ghysels (2004) “Efficient estima-tion of jump diffusions and general dynamic models with a continuum of momentconditions”, mimeo, University of Rochester.

[15] Carrasco, M. and J.-P. Florens (2000) “Generalization of GMM to a continuum ofmoment conditions”, Econometric Theory, 16, 797-834.

[16] Carrasco, M. and J.-P. Florens (2001) “Efficient GMM Estimation Using the Em-pirical Characteristic Function”, mimeo, University of Rochester.

[17] Carrasco, M. and J.-P. Florens (2002) “Spectral method for deconvolving a density”,mimeo, University of Rochester.

[18] Carrasco, M. and J.-P. Florens (2004) “On the Asymptotic Efficiency of GMM”,mimeo, University of Rochester.

[19] Carroll, R. and P. Hall (1988) “Optimal Rates of Convergence for Deconvolving aDensity”, Journal of American Statistical Association, 83, No.404, 1184-1186.

[20] Carroll, R., A. Van Rooij, and F. Ruymgaart (1991) “Theoretical Aspects of Ill-posed Problems in Statistics”, Acta Applicandae Mathematicae, 24, 113-140.

[21] Chacko, G. and L. Viceira (2003) “Spectral GMM estimation of continuous-timeprocesses”, Journal of Econometrics, 116, 259-292.

[22] Chen, X., L.P. Hansen and J. Scheinkman (1998) “Shape-preserving Estimation ofDiffusions”, mimeo, University of Chicago.

[23] Chen, X., and H. White (1992), “Central Limit and Functional Central Limit The-orems for Hilbert Space-Valued Dependent Processes”, Working Paper, Universityof San Diego.

[24] Chen, X. and H. White (1996) “Law of Large Numbers for Hilbert Space-Valuedmixingales with Applications”, Econometric Theory, 12, 284-304.

[25] Chen, X. and H. White (1998) “Central Limit and Functional Central Limit The-orems for Hilbert Space-Valued Dependent Processes”, Econometric Theory, 14,260-284.

[26] Darolles, S., J.-P. Florens, and C. Gourieroux (2004) “Kernel Based NonlinearCanonical Analysis and Time Reversibility”, Journal of Econometrics, 119, 323-353.

[27] Darolles, S., J.-P. Florens, and E. Renault (1998), “Nonlinear Principal Componentsand Inference on a Conditional Expectation Operator with Applications to MarkovProcesses”, presented in Paris-Berlin conference 1998, Garchy, France.

114

[28] Darolles, S., J.-P. Florens, and E. Renault (2002), “Nonparametric InstrumentalRegression”, Working paper 05-2002, CRDE.

[29] Das, M. (2005) “Instrumental variables estimators of nonparametric models withdiscrete endogenous regressors”, Journal of Econometrics, 124, 335-361.

[30] Dautray, R. and J.-L. Lions (1988) Analyse mathematique et calcul numerique pourles sciences et les techniques. Vol. 5. Spectre des operateurs, Masson, Paris.

[31] Davidson, J. (1994) Stochastic Limit Theory, Oxford University Press, Oxford.

[32] Debnath, L. and P. Mikusinski (1999) Introduction to Hilbert Spaces with Applica-tions, Academic Press. San Diego.

[33] Dunford, N. and J. Schwartz (1988) Linear Operators, Part II: Spectral Theory,Wiley, New York.

[34] Engl. H. W., M. Hanke, and A. Neubauer (1996) Regularization of Inverse Problems,Kluwer Academic Publishers.

[35] Engle R.F., (1990), “Discussion: Stock Market Volatility and the Crash of ’87”,Review of Financial Studies 3, 103-106.

[36] Engle, R.F., D.F. Hendry and J.F. Richard, (1983), “Exogeneity”, Econometrica,51 (2) 277-304.

[37] Engle, R.F., and V.K. Ng (1993), “Measuring and Testing the Impact of News onVolatility”, The Journal of Finance XLVIII, 1749-1778.

[38] Fan, J. (1993) “Adaptively local one-dimentional subproblems with application toa deconvolution problem”, The Annals of Statistics, 21, 600-610.

[39] Feuerverger, A. and P. McDunnough (1981), “On the Efficiency of Empirical Char-acteristic Function Procedures”, Journal of the Royal Statistical Society, Series B,43, 20-27.

[40] Florens, J.-P. (2003) “Inverse Problems in Structural Econometrics: The Exampleof Instrumental Variables”, in Advances in Economics and Econometrics, Vol. 2, edsby M. Dewatripont, L.P. Hansen and S.J. Turnovsky, Cambridge University Press,284-311.

[41] Florens, J.-P. (2005) “Engogeneity in nonseparable models. Application to treat-ment models where the outcomes are durations”, mimeo, University of Toulouse.

[42] Florens, J.-P., J. Heckman, C. Meghir and E. Vytlacil (2002), “ Instrumental Vari-ables, Local Instrumental Variables and Control Functions”, Manuscript, Universityof Toulouse.

115

[43] Florens, J.-P. and Malavolti (2002) “Instrumental Regression with Discrete Vari-ables”, mimeo University of Toulouse, presented at ESEM 2002, Venice.

[44] Florens, J.-P. and M. Mouchart (1985), “Conditioning in Dynamic Models”, Journalof Time Series Analysis, 53 (1), 15-35.

[45] Florens, J.-P., M. Mouchart, and J.F. Richard (1974), “Bayesian Inference in Error-in-variables Models”, Journal of Multivariate Analysis, 4, 419-432.

[46] Florens, J.-P., M. Mouchart, and J.F. Richard (1987), “Dynamic Error-in-variablesModels and Limited Information Analysis”, Annales d’Economie et Statistiques,6/7, 289-310.

[47] Florens, J.-P., M. Mouchart, and J.-M. Rolin (1990) Elements of Bayesian Statistics,Dekker, New York.

[48] Florens, J.-P., C. Protopopescu, and J.F. Richard, (1997), “Identification and Esti-mation of a Class of Game Theoretic Models”, GREMAQ, University of Toulouse.

[49] Forni, M., M. Hallin, M. Lippi, and L. Reichlin (2000) “The generalized dynamicfactor model: identification and estimation”, Review of Economic and Statistics,82, 4, 540-552.

[50] Forni, M. and L. Reichlin (1998) “Let’s Get Real: A Factor Analytical Approach toDisaggregated Business Cycle Dynamics”, Review of Economic Studies, 65, 453-473.

[51] Gallant, A. R. and J. R. Long (1997) “Estimating Stochastic Differential EquationsEfficiently by Minimum Chi-squared”, Biometrika, 84, 125-141.

[52] Gaspar, P. and J.-P. Florens, (1998), “Estimation of the Sea State Biais in RadarAltimeter Measurements of Sea Level: Results from a Nonparametric Method”,Journal of Geophysical Research, 103 (15), 803-814.

[53] Guerre, E., I. Perrigne, and Q. Vuong, (2000), “Optimal Nonparametric Estimationof First-Price Auctions”, Econometrica, 68 (3), 525-574.

[54] Groetsch, C. (1993) Inverse Problems in Mathematical Sciences, Vieweg Mathemat-ics for Scientists and Engineers, Wiesbaden.

[55] Hall, P. and J. Horowitz (2004) “Nonparametric Methods for Inference in the Pres-ence of Instrumental Variables”, mimeo, Northwestern University.

[56] Hansen, L.P., (1982), “Large Sample Properties of Generalized Method of MomentsEstimators”, Econometrica, 50, 1029-1054.

[57] Hansen, L.P. (1985) “A method for calculating bounds on the asymptotic covariancematrices of generalized method of moments estimators”, Journal of Econometrics,30, 203-238.

116

[58] Hardle, W. and O. Linton (1994) “Applied Nonparametric Methods”, Handbook ofEconometrics, Vol. IV, edited by R.F. Engle and D.L. McFadden, North Holland,Amsterdam.

[59] Hastie, T.J. and R.J. Tibshirani (1990), Generalized Additive Models, Chapman andHall, London.

[60] Hausman, J., (1981), “Exact Consumer’s Surplus and Deadweight Loss”, AmericanEconomic Review, 71, 662-676.

[61] Hausman, J. (1985), “The Econometrics of Nonlinear Budget sets”, Econometrica,53, 1255-1282.

[62] Hausman, J. and W.K. Newey, (1995) “Nonparametric Estimation of Exact Con-sumers Surplus and Deadweight Loss”, Econometrica, 63, 1445-1476.

[63] Heckman, J., H. Ichimura, J. Smith, and P. Todd (1998), “Characterizing SelectionBias Using Experimental Data”, Econometrica, 66, 1017-1098.

[64] Heckman, J., and E. Vytlacil (2000), “Local Instrumental Variables”, in Nonlin-ear Statistical Modeling: Proceedings of the Thirteenth International Symposium inEconomic Theory and Econometrics: Essays in Honor of Takeshi Amemiya, ed. byC. Hsiao, K. Morimune, and J. Powells. Cambridge: Cambridge University Press,1-46.

[65] Hoerl, A. E. and R. W. Kennard (1970) “Ridge Regression: Biased Estimation ofNonorthogonal Problems”, Technometrics, 12, 55-67.

[66] Horowitz, J. (1999) “Semiparametric estimation of a proportional hazard modelwith unobserved heterogeneity”, Econometrica, 67, 1001-1028.

[67] Imbens, G., and J. Angrist (1994), “Identification and Estimation of Local AverageTreatment Effects”, Econometrica, 62, 467-476.

[68] Jiang, G. and J. Knight (2002) “Estimation of Continuous Time Processes Via theEmpirical Characteristic Function”, Journal of Business & Economic Statistics, 20,198-212.

[69] Judge, G., W. Griffiths, R. C. Hill, H. Lutkepohl, and T-C. Lee (1980) The Theoryand Practice of Econometrics, John Wiley and Sons, New York.

[70] Kargin, V. and A. Onatski (2004) “Dynamics of Interest Rate Curve by Func-tional Auto-Regression”, mimeo Columbia University, presented at the CIRANOand CIREQ Conference on Operator Methods (Montreal, November 2004).

[71] Kitamura, Y. and M. Stutzer (1997), “An Information Theoretic Alternative toGeneralized Method of Moments Estimation”, Econometrica, 65, 4, 861-874.

117

[72] Knight, J. L. and J. Yu (2002) “Empirical Characteristic Function in Time SeriesEstimation”, Econometric Theory, 18, 691-721.

[73] Kress, R. (1999), Linear Integral Equations, Springer.

[74] Kutoyants, Yu. (1984), Parameter estimation for stochastic processes, HeldermannVerlag, Berlin.

[75] Lancaster, H. (1968), “The Structure of Bivariate Distributions”, Annals of Math-ematical Statistics, 29, 719-736.

[76] Linton, O. and E. Mammen (2003), “Estimating Semiparametric ARCH(∞) modelsby kernel smoothing methods”, forthcoming in Econometrica.

[77] Linton, O. and J.P. Nielsen (1995) “A Kernel Method of Estimating StructuredNonparametric regression Based on Marginal Integration”, Biometrika, 82, 93-100.

[78] Loubes, J.M. and A. Vanhems (2001), “Differential Equation and Endogeneity”,Discussion Paper, GREMAQ, University of Toulouse, presented at ESEM 2002,Venice.

[79] Loubes, J.M. and A. Vanhems (2003), “Saturation Spaces for Regularization Meth-ods in Inverse Problems”, Discussion Paper, GREMAQ, University of Toulouse,presented at ESEM 2003, Stockholm.

[80] Lucas, R. (1978) “Asset Prices in an Exchange Economy”, Econometrica, 46, 1429-1446.

[81] Luenberger, D. G. (1969) Optimization by Vector Space Methods, Wiley, New York.

[82] Malinvaud, E. (1970), Methodes Statistiques de l’Econometrie, Dunod, Paris.

[83] Mammen, E., O. Linton, and J. Nielsen (1999) “The existence and asymptoticproperties of a backfitting projection algorithm under weak conditions”, The Annalsof Statistics, 27, 1443-1490.

[84] Nashed, N. Z. and G. Wahba (1974) “Generalized inverses in reproducing kernelspaces: An approach to regularization of linear operator equations”, SIAM Journalof Mathematical Analysis, 5, 974-987.

[85] Natterer (1984) “Error bounds for Tikhonov regularization in Hilbert scales”, Ap-plicable Analysis, 18, 29-37.

[86] Newey, W. (1994) “Kernel Estimation of Partial Means”, Econometric Theory, 10,233-253.

[87] Newey, W., and J. Powell (2003), “Instrumental Variables for Nonparametric Mod-els”, Econometrica, 71, 1565-1578.

118

[88] Newey, W., Powell, J., and F. Vella (1999), “Nonparametric Estimation of Trian-gular Simultaneous Equations Models”, Econometrica, 67, 565-604.

[89] Owen, A. (2001) Empirical likelihood, Monographs on Statistics and Applied Prob-ability, vol. 92. Chapman and Hall, London.

[90] Pagan, A. and A. Ullah (1999), Nonparametric Econometrics, Cambridge UniversityPress.

[91] Parzen, E. (1959) “Statistical Inference on time series by Hilbert Space Methods,I.”,Technical Report No.23, Applied Mathematics and Statistics Laboratory, Stanford.Reprinted in (1967) Time series analysis papers, Holden-Day, San Francisco.

[92] Parzen, E. (1970) “Statistical Inference on time series by RKHS methods”, Proc.12th Biennal Canadian Mathematical Seminar, R. Pyke, ed. American Mathemati-cal Society, Providence.

[93] Politis, D. and J. Romano (1994) “Limit theorems for weakly dependent Hilbertspace valued random variables with application to the stationary bootstrap”, Sta-tistica Sinica, 4, 451-476.

[94] Polyanin, A. and A. Manzhirov (1998) Handbook of Integral Equations, CRC Press,Bocca Raton, Florida.

[95] Qin, J. and J. Lawless, (1994), “Empirical Likelihood and General Estimating Equa-tions”, The Annals of Statistics, 22, 1, 300-325.

[96] Reiersol, O. (1941), “Confluence Analysis of Lag Moments and other Methods ofConfluence Analysis”, Econometrica, 9, 1-24.

[97] Reiersol, O. (1945), “Confluence Analysis by Means of Instrumental Sets of Vari-ables”, Arkiv for Mathematik, Astronomie och Fysik, 32A, 119.

[98] Ross, S. (1976) “The Arbitrage Theory of Capital Asset Pricing”, Journal of Fi-nance, 13, 341-360.

[99] Rust, J., J. F. Traub, and H. Wozniakowski (2002) “Is There a Curse of Dimension-ality for Contraction Fixed Points in the Worst Case?”, Econometrica, 70, 285-330.

[100] Ruymgaart, F. (2001) “A short introduction to inverse statistical inference”, lecturegiven at the conference “L’Odyssee de la Statistique”, Institut Henri Poincare, Paris.

[101] Saitoh, S. (1997) Integral transforms, reproducing kernels and their applications,Longman.

[102] Sansone, G. (1959) Orthogonal Functions, Dover Publications, New York.

119

[103] Sargan, J.D. (1958), “The Estimation of Economic Relationship using InstrumentalVariables”, Econometrica, 26, 393-415.

[104] Schaumburg, E. (2004) “Estimation of Markov Processes of Levy Type Generators”,mimeo, Kellogg School of Management.

[105] Singleton, K. (2001) “Estimation of Affine Pricing Models Using the EmpiricalCharacteristic Function”, Journal of Econometrics, 102, 111-141.

[106] Stefanski, L. and R. Carroll (1990) “Deconvoluting Kernel Density Estimators”,Statistics, 2, 169-184.

[107] Stock, J. and M. Watson (1998) “Diffusion Indexes”, NBER working paper 6702.

[108] Stock, J. and M. Watson (2002) “Macroeconomic Forecasting Using Diffusion In-dexes”, Journal of Business and Economic Statistics, 20, 147-162.

[109] Tauchen, G. (1997) “New Minimum Chi-Square Methods in Empirical Finance”, inAdvances in Econometrics, Seventh World Congress, eds. D. Kreps and K. Wallis,Cambridge University Press, Cambridge.

[110] Tauchen, G. and R. Hussey (1991) “Quadrature-Based Methods for Obtaining Ap-proximate Solutions to Nonlinear Asset Pricing Models”, Econometrica, 59, 371-396.

[111] Tautenhahn, U. (1996) “Error estimates for regularization methods in Hilbertscales”, SIAM Journal of Numerical Analysis, 33, 2120-2130.

[112] Theil, H.(1953), “Repeated Least-Squares Applied to Complete Equations System”,The Hague: Central Planning Bureau (mimeo).

[113] Tjostheim, D. and B. Auestad (1994) “Nonparametric Identification of NonlinearTime Series Projections”, Journal of American Statistical Association, 89, 1398-1409.

[114] Vanhems, A. (2000), “Nonparametric Solutions to Random Ordinary DifferentialEquations of First Orders”, GREMAQ-University of Toulouse.

[115] Van Rooij and F. Ruymgaart (1991) “Regularized Deconvolution on the Circle andthe Sphere”, in Nonparametric Functional Estimation and Related Topics, editedby G. Roussas, 679-690, Kluwer Academic Publishers, the Netherlands.

[116] Van Rooij, A., F. Ruymgaart (1999) “On Inverse Estimation”, in Asymptotics,Nonparametrics, and Time Series, 579-613, Dekker, NY.

[117] Van Rooij, A., F. Ruymgaart, and W. Van Zwet (2000) “Asymptotic Efficiency ofInverse Estimators”, Theory of Probability and its Applications, 44, 4, 722-738.

[118] Vapnik A.C.M. (1998), Statistical Learning Theory, Wiley, New York.

120

[119] Wahba, G. (1973) “Convergence Rates of Certain Approximate Solutions to Fred-holm Integral Equations of the First Kind”, Journal of Approximation Theory, 7,167-185.

121

linear inverse problems in structural econometrics estimation

Documents