adaptive estimation in functional linear model: a model

Estimation — Functional linear model

Adaptive estimation in functional linear model:a model selection approach

Angelina Rochejoint work with Élodie Brunel and André Mas

I3M-Université Montpellier II

7èmes Journées de Statistique Fonctionnelle et Opératorielle28-29 Juin 2012, Montpellier

1 / 34


Summary

1 Model definition and estimation context

2 Estimation procedure of the slope functionEstimation– Minimization of the least square contrastPenalized contrast model selection

3 Upper and lower bound on the riskOracle-inequalityConvergence rate over Sobolev spaces

4 Numerical resultsSimulation methodResults

2 / 34


Model definition and estimation context

Summary





3 / 34



Functional linear regression model

We model the dependence between a functional random predictor Xand a scalar response Y by the linear relation:

Y =

∫ 1

0β(t)X (t)dt + ε, (1)

wereβ is an unknown function in L2([0,1]), the slope function;X is a random variable with values in L2([0,1]), centred,ε is a real random variable, centred, independent of X and withvariance σ2.

The aim is to estimate the function β from the data of a sample{(Xi ,Yi ), i = 1, ...,n} verifying Equation (1).

4 / 34



Some existing works on functional linear model

Many estimation procedures with asymptotic convergenceresults (see e.g. Cardot et al. (1999, 2003), Cai and Hall (2006),Hall and Horowitz (2007), Crambes et al. (2009),...).Optimal theoretical choice of smoothing parameters depends onboth unknown regularities of the slope β and the predictors X .Smoothing parameters obtained in practice by cross-validation.Non-asymptotic results providing adaptative data-drivenestimators were missing up to the recent papers of Comte andJohannes (2010).

5 / 34



Prevision error – Definition

Let (Xn+1,Yn+1) independent of the sample (X1,Y1), ..., (Xn,Yn) and

Yn+1 =

∫ 1

0β(s)Xn+1(s)ds,

the value of Yn+1 predicted from Xn+1 and the estimator β.

Prevision error

The prevision error of β is the quantity:

E[(

Yn+1 − E[Yn+1|Xn+1])2|X1, ...,Xn

]=

∑j≥1

λj < β − β, ϕj >2

=: ‖β − β‖2Γ,

where, for all j , λj is the eigenvalue of the covariance operator Γassociated to the eigenfunction ϕj and < ·, · > is the usual scalarproduct of L2([0,1]).

6 / 34



Reformulation of the problem equation

Multiplying both sides of the model equation by X (s) and taking theexpectation we obtain the following formulation of the problem:

g(s) := E [YX (s)] = E

[∫ 1

0β(t)X (t)dt X (s)

]=: Γβ(s),

where Γ is the covariance operator associated to the random functionX .

The problem of estimating the function β is then clearly related to theinversion of the covariance operator Γ or of its empirical equivalent

Γn : f ∈ L2([0,1]) 7→ 1n

n∑i=1

< f ,Xi > Xi .

7 / 34



Dimension reduction

Aim: Find an approximation space Sm of dimension Dm < +∞containing as much information as possible.

Final objective is to define an estimator based on functional PCA i.e.take Sm as the space spanned by the eigenfunctions associated tothe m largest eigenvalues of Γn.

Problem: Difficulty to control an estimator defined on a random space.

8 / 34



Hypothesis – Consequence on the covarianceoperator eigenfunctions

Context of “circular data”: assumption on the curve X

The curve X is supposed to be 1-periodic (X (0) = X (1)) andsecond-order stationary.

In this context the Fourier basis (ϕj )j≥1 of L2([0,1]) defined by

ϕ1 ≡ 1, ϕ2j (·) =√

2 cos(2πj ·) et ϕ2j+1(·) =√

2 sin(2πj ·),

is a basis of eigenfunctions of the covariance operator Γ.

9 / 34


Estimation procedure of the slope function

Summary





10 / 34



Estimation– Minimization of the least square contrast

Summary





11 / 34




Dimension reduction

Approximation spaces

Let Nn ∈ N∗ and m ∈ {1, ...,Nn}, we denote by

Sm := span{ϕ1, ..., ϕ2m+1},

the linear space, called model, spanned by the trigonometric basis.

For all m ∈ {1, ...,Nn}, we define an estimator of β on the space Sm.

12 / 34




Estimation on Sm

minimization of the least square contrast

For all m ∈ {1, ...,Nn}, we compute the least square estimator on Sm :

βm := arg minf∈Smγn(f )

where γn is the least square contrast defined by:

γn : f 7→ 1n

n∑i=1

(Yi− < f ,Xi >)2.

After this first step, we obtain a family {βm,m = 1, ...,Nn} ofestimators of the slope function β.

13 / 34



Penalized contrast model selection

Summary





14 / 34




Dimension selection – heuristic

Define by m∗ the unknown ideal dimension, called oracle,

m∗ := arg minm=1,...,NnE[‖β − βm‖2

Γ],

the idea is to define a data-driven criterion which allows to select adimension with performance similar to the oracle.Let βm be the orthogonal projection of β into Sm, we have thefollowing bias-variance decomposition

E[‖β − βm‖2Γ] = E[‖β − βm‖2

Γ] +E[‖βm − βm‖2Γ].

The idea is to define a criterion with a similar behaviour

crit(m) := γn(βm) + pen(m).

15 / 34




Dimension selection via penalisation

We choose then an element of the family {βm,m = 1, ...,Nn}minimizing the penalized criterion:

crit(m) := γn(βm) + pen(m),

where pen(m) := κ 2m+1n σ2, κ is a numerical constant and σ2 can be

replaced by an estimator σ2m (see section simulations).

The estimator selected by our penalized criterion is then βm with

m ∈ arg minm=1,...,Nncrit(m).

16 / 34


Upper and lower bound on the risk

Summary





17 / 34



Oracle-inequality

Summary





18 / 34



Oracle-inequality

Oracle-inequality

Theorem (Brunel et Roche (2011))

Under moment hypothesis on the random variables < ϕj ,X > /√λj

and ε. By choosing Nn such that

min1≤j≤2Nn+1

λj ≥ 2/n2 et 2Nn + 1 ≤ K

√n

log3 n,

with K a constant.For all slope function β∈L2([0,1]) such that E[< β,X >4]<+∞:

E[‖βm − β‖2Γ] ≤ C1

(min

m=1,...,Nn

(inf

f∈Sm

‖β − f‖2Γ + pen(m)

))+

C(β, Γ)

n,

where C(β, Γ) = C2(1 + ‖β‖2

Γ + E[< β,X >4]).

19 / 34



Convergence rate over Sobolev spaces

Summary





20 / 34




Periodized Sobolev spaces

We recall the definition of a Sobolev space on [0,1]:

Wα2 =

{f ∈ L2([0,1]), f (α−1) absolutely continuous, ‖f (α)‖ ≤ L

},

for α ∈ N∗ and L > 0. We consider the following subset of Wα2 :

W per (α,L) ={

f ∈Wα2 , ∀j = 1, ..., α− 1, f (j)(0) = f (j)(1)

}.

21 / 34




Upper-bound on the rate of convergence

Theorem

Polynomial case If, for all j , j−2a/c ≤ λj ≤ cj−2a, with a > 1/2 andc > 0, then:

supβ∈W per (α,L)

E[‖βm − β‖2Γ] ≤ CPn−(2α+2a)/(2α+2a+1).

Exponential case If, for all j , exp(−j2a)/c ≤ λj ≤ c exp(−j2a), witha, c > 0, then:

supβ∈W per (α,L)

E[‖βm − β‖2Γ] ≤ CEn−1(log n)1/2a.

Remark: In the case where ε ∼ N (0, σ2), those bounds coincide withthe minimal bounds given by Cardot and Johannes (2010).

22 / 34


Numerical results

Summary





23 / 34


Numerical results

Simulation method

Summary





24 / 34


Numerical results

Simulation method

Simulation of X

X =1001∑j=1

√λjξjϕj ,

with ξ1, ..., ξ1001 independent realizations of N (0,1). We consider twosequences (λj )j≥1:

λ(P)j = 1

j2 ;

λ(E)j = exp

(−√

j)

.

0 0.5 1−5

0

5λ = λ(E)

t

X(t)

0 0.5 1−5

0

5λ = λ(P)

t

X(t)

25 / 34


Numerical results

Simulation method

Slope functions

We define:

β1(t) = ln(15t2 + 10) + cos(4πt)(Cardot et al. (2003));

β2(t) = t(t − 1).

0.0 0.2 0.4 0.6 0.8 1.0

1.5

2.5

3.5

β1

x

Bet

a(x)

0.0 0.2 0.4 0.6 0.8 1.0

−0.

25−

0.15

−0.

05

β2

xB

eta(

x)

Variance of the noise ε: σ2 = 0.01, supposed to be known.

26 / 34


Numerical results

Results

Summary





27 / 34


Numerical results

Results

Estimation of β1(x) = ln(15x2 + 10) + cos(4πx)

n=2000

0.0 0.2 0.4 0.6 0.8 1.0

1.5

2.0

2.5

3.0

3.5

4.0

x

Bet

a(x)

0.0 0.2 0.4 0.6 0.8 1.0

1.5

2.0

2.5

3.0

3.5

4.0

x

Bet

a(x)

Mean and median for 1000 Monte-Carlo replications of ‖βm − β‖2Γ:

n = 500 n = 1000 n = 2000λ

(P)j = j−2 mean (×10−3) 0.54 0.17 0.091

median (×10−3) 0.55 0.15 0.089

λ(E)j = e−

√j mean (×10−3) 0.58 0.23 0.13

median (×10−3) 0.58 0.21 0.1328 / 34


Numerical results

Results

Estimation of β2(x) = x(x − 1)

n = 2000

0.0 0.2 0.4 0.6 0.8 1.0

−0.

25−

0.20

−0.

15−

0.10

−0.

050.

00

x

Bet

a(x)

0.0 0.2 0.4 0.6 0.8 1.0

−0.

25−

0.20

−0.

15−

0.10

−0.

050.

00

x

Bet

a(x)

Mean and median for 1000 Monte-Carlo replications of ‖βm − β‖2Γ:

n = 500 n = 1000 n = 2000λ

(P)j = j−2 mean (×10−3) 0.51 0.084 0.037

median (×10−3) 0.53 0.057 0.033

λ(E)j = e−

√j mean (×10−3) 0.52 0.10 0.044

median (×10−3) 0.56 0.073 0.04229 / 34


Numerical results

Results

Estimating the noise variance

In case were the variance σ2 is not supposed to be known, wereplace the penalty pen(m) = κσ2 Dm

n by pen(m) := κσ2m

Dmn where

σ2m :=

1n

n∑i=1

(Yi− < βm,Xi >

)2= γn(βm).

Mean for 1000 Monte-Carlo replications of ‖βm − β‖2Γ × 10−3, λj = j−2:

n = 500 n = 1000 n = 2000β1 known σ2 0.54 0.17 0.091

unknown σ2 0.57 0.18 0.091β2 known σ2 0.51 0.084 0.037

unknown σ2 0.54 0.089 0.037

30 / 34


Conclusion and perspectives


Estimation of the slope function β by minimization of a penalizedleast square contrast. Estimation procedure simple enough to beimplemented.The prediction error of this estimator is controlled by an oracleinequality whatever the regularity of the function to be estimated.The maximum risk on Sobolev spaces reaches the optimal rateof convergence.However, the assumption of periodicity of the function X is toorestrictive, a generalization of the results presented fornon-periodic curves is in progress.

31 / 34



Estimation procedure

Let (λj , ϕj )j≥1 the eigenelements of Γn sorted such that λ1 ≥ λ2 ≥ ... .

1. Estimation on Sm

For all m = 1, ...,Nn, if λm > 0, set

βm =m∑

j=1

< g, ϕj >

λjϕj ,

the unique minimizer of the least square contrast onSm = span{ϕ1, ..., ϕm} (recall that g := 1

n

∑ni=1 YiXi .).

2. Dimension selection

m ∈ arg minm=1,...,Nn(γn(βm) + pen(m)),

where pen(m) = κ′mn σ2.

32 / 34



Estimation results

Estimation of β1(x) = ln(15x2 + 10) + cos(4πx), n = 2000λj = j−2 λj = j−3 λj = e−j

0.0 0.2 0.4 0.6 0.8 1.0

1.5

2.0

2.5

3.0

3.5

4.0

x

β~(x

)

0.0 0.2 0.4 0.6 0.8 1.0

1.5

2.0

2.5

3.0

3.5

4.0

x

β~(x

)0.0 0.2 0.4 0.6 0.8 1.0

1.5

2.0

2.5

3.0

3.5

4.0

x

β~(x

)

Estimation of β2(x) = x(x − 1), n = 2000λj = j−2 λj = j−3 λj = e−j

0.0 0.2 0.4 0.6 0.8 1.0

−0.

25−

0.20

−0.

15−

0.10

−0.

050.

00

x

β~(x

)

0.0 0.2 0.4 0.6 0.8 1.0

−0.

25−

0.20

−0.

15−

0.10

−0.

050.

00

x

β~(x

)

0.0 0.2 0.4 0.6 0.8 1.0

−0.

25−

0.20

−0.

15−

0.10

−0.

050.

00

xβ~(x

)

33 / 34



Thank you for your attention!

Brunel E. and Roche A. (2011). Penalized contrast estimation infunctional linear model with circular data,hal.archives-ouvertes.fr:hal-00651399.

Cai, T. and Hall, P. (2006). Prediction in functional linear regression, Ann.Statist., 34(5), 2159–2179.

Cai, T. and Yuan, M. (2012). Minimax and adaptive prediction forfunctional linear regression, J. American Statistical Association, toappear.

Cardot, H., Ferraty, F. and Sarda, P. (2003). Spline estimator for thefunctional linear model, Statistica Sinica, 13, 571–591.

Comte, F. and Johannes, J. (2010). Adaptive estimation in circularfunctional linear models, Math. Method. Statist. 19(1) 42–63.

Crambes, C., Kneip, A. and Sarda, P. (2009). Smoothing spline estimatorfor functional linear regression, Ann. Statist. 37(1), 35–72.

34 / 34


Assumptions on the moments

There exists two constants v > 0 and c > 0 such that, for allj = 1, ...,2m + 1 and for all q ≥ 2:

E

∣∣∣∣∣< ϕj ,Xi >√λj

∣∣∣∣∣2q ≤ q!

2v2qq−2.

The random variable ε admits a moment τp of order p > 6.

1 / 9


Sketch of proof

For all m = 1, ...,Nn:

γn(βm) + pen(m) ≤ γn(βm) + pen(m) ≤ γn(βm) + pen(m),

with βm the orthogonal projection of β on Sm, and

γn(βm)− γn(βm) = ‖βm − β‖2n − ‖βm − β‖2

n + 2νn(βm − βm),

with, for all f ∈ L2([0,1]):

νn(f ) :=1n

n∑i=1

εi < f ,Xi >, and ‖f‖2n :=

1n

n∑i=1

< f ,Xi >2 .

Then,

‖βm − β‖2n ≤ ‖βm − β‖2

n + 2νn(βm − β) + pen(m)− pen(m).

2 / 9


Simulation of X

X =501∑j=1

√λjξjψj ,

with ξ1, ..., ξ501 independent realizations of N (0,1) and (ψj )j≥1 are theeigenfunctions of the covariance operator associated to the Brownianmotion: ψj (x) =

√2 sin(π(j − 0.5)x). We consider three sequences

(λj )j≥1:λ

(P1)j = j−2 λ

(P2)j = j−3 λ

(E)j = exp (−j)

3 / 9


Comparison with cross validation(β1, λj = j−3, n = 1000)

vc: crit(m) = γn(βm) + pen(m) (red line)

GCV: critGCV (m) =∑n

i=1(Yi−Yi )2

(1−tr(Hm)/n)2 (light blue dotted line)

CV: critCV (m) = 1n

∑ni=1(Yi − Y (−i)

i )2 (blue dot dash line)

0.0 0.2 0.4 0.6 0.8 1.0

1.5

2.5

3.5

x

β~(x

)

0.0 0.2 0.4 0.6 0.8 1.0

1.5

2.5

3.5

x

β~(x

)

0.0 0.2 0.4 0.6 0.8 1.0

1.5

2.5

3.5

x

β~(x

)

0.0 0.2 0.4 0.6 0.8 1.0

1.5

2.5

3.5

x

β~(x

)

0.0 0.2 0.4 0.6 0.8 1.0

1.5

2.5

3.5

x

β~(x

)

0.0 0.2 0.4 0.6 0.8 1.0

1.5

2.5

3.5

x

β~(x

)

4 / 9


Dataset

We apply our estimation procedure to the concentration of ozonedata studied previously by Aneiros-Perez et al., Cardot et al. (2006)and Crambes et al. (2009).The dataset consists of 474 daily measurements of ozoneconcentrations and the ozone peak of the next day.

0 5 10 15 20

050

100

150

Ozone concentration

Time (h)

Ozo

ne c

once

ntra

tion

(µg/

m^3

)

0 100 200 300 400

5010

015

0

Ozone peak

DaysO

zone

con

cent

ratio

n

5 / 9


Division of data sample

We suppose that the dependence between the concentration curveXj of the day j and the ozone peak Yj of the day after can bemodelled by functional linear regression.

We separate randomly our sample into two subsets:

A sub-sample {(X Ei ,Y

Ei ), i =

1, ...,n}, with n = 373, usedto calculate the slope estimatorβm.

The rest of the sample{(X T

i ,YTi ), i = 1, ...,101}

is kept to evaluate the perfor-mance of the estimator.

6 / 9


Results

0 50 100 150 200

050

100

150

200

Predicted values ......vs. observed values

Estimator defined on the Fourier basisConcentration peak predicted (µg/m^3)

Con

cent

ratio

n pe

ak o

bser

ved

(µg/

m^3

)

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

−10

0−

60−

40−

200

20

Residuals

Figure: Left: plot of the points (Y (T )i ,Y (T )

i ) (blue) and of the line y = x . Right:boxplot of the vector {Y (T )

i − Y (T )i , i = 1, ..., 101}.

7 / 9


Application to ozone data

0 50 100 150 200

050

100

150

200


Estimator defined on the Fourier basisConcentration peak predicted (µg/m^3)

Con

cent

ratio

n pe

ak o

bser

ved

(µg/

m^3

)

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0 50 100 150 200

050

100

150

200


FPCAConcentration peak predicted (µg/m^3)

Con

cent

ratio

n pe

ak o

bser

ved

(µg/

m^3

)

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

estimator defined on the Fourier basis FPCA

−10

0−

60−

40−

200

20

Residuals

8 / 9


First theoretical results: oracle type inequality

Suppose that the eigenvalues (λj )j≥1 decrease at polynomial orexponential rate. Under some moments assumptions on the curves Xand the noise ε. Then for all β ∈ L2([0,1]) such that there exists aconstant b > 0 verifying∑

j≥1

jb < β,ϕj >2< +∞,

we have:

E[‖βm − β‖2Γ] ≤ C1

(min

m∈Mn

(E[‖β − Πmβ‖2

n] + E[‖β − Πmβ‖2Γ] + pen(m)

))+

C2

n(1 + ‖β‖2

Γ),

with C1,C2 > 0 independent of β and n.

9 / 9

adaptive estimation in functional linear model: a model

Documents