several nonlinear models and methods for fda

Post on 11-May-2015

74 Views

Category:

Science

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

Short courses on functional data analysis and statistical learning, part 2 CENATAV, Havana, Cuba September 16th, 2008

TRANSCRIPT

Several nonlinear models and methods for FDA

Nathalie Villa-Vialaneix - nathalie.villa@math.univ-toulouse.frhttp://www.nathalievilla.org

Institut de Mathématiques de Toulouse - IUT de Carcassonne, Université de PerpignanFrance

La Havane, September 16th, 2008

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 1 / 42

Table of contents

1 Nonparametric kernel

2 Neural networks

3 References

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 2 / 42

Nonparametric model in FDA

In this section, X is a random variable taking its values in a metric space(X, ‖.‖X) where ‖.‖X denotes a semi-norm (i.e., ‖x‖X = 0; x = 0).

In the following presentation, we are interesting in the followingnonparametric functional model:

Y = Φ(X) + ε

where Y is a real random variable (regression case), X is a functionalrandom variable taking its values in (X, ‖.‖X) and ε is a centered randomvariable, independant from X and Φ is a unknown operator from X to R.We also suppose that we are given a set of n i.i.d. realizations of therandom pair (X ,Y):

(xi , yi), (x2, y2), . . . , (xn, yn).

From this training set, we aim at building an estimate, Φn, of Φ suchthat Φn converge to the true Φ when n tends to infinity, in a sense thatwill be developed later.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 3 / 42

Nonparametric model in FDA

In this section, X is a random variable taking its values in a metric space(X, ‖.‖X) where ‖.‖X denotes a semi-norm (i.e., ‖x‖X = 0; x = 0).In the following presentation, we are interesting in the followingnonparametric functional model:

Y = Φ(X) + ε

where Y is a real random variable (regression case), X is a functionalrandom variable taking its values in (X, ‖.‖X) and ε is a centered randomvariable, independant from X and Φ is a unknown operator from X to R.

We also suppose that we are given a set of n i.i.d. realizations of therandom pair (X ,Y):

(xi , yi), (x2, y2), . . . , (xn, yn).

From this training set, we aim at building an estimate, Φn, of Φ suchthat Φn converge to the true Φ when n tends to infinity, in a sense thatwill be developed later.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 3 / 42

Nonparametric model in FDA

In this section, X is a random variable taking its values in a metric space(X, ‖.‖X) where ‖.‖X denotes a semi-norm (i.e., ‖x‖X = 0; x = 0).In the following presentation, we are interesting in the followingnonparametric functional model:

Y = Φ(X) + ε

where Y is a real random variable (regression case), X is a functionalrandom variable taking its values in (X, ‖.‖X) and ε is a centered randomvariable, independant from X and Φ is a unknown operator from X to R.We also suppose that we are given a set of n i.i.d. realizations of therandom pair (X ,Y):

(xi , yi), (x2, y2), . . . , (xn, yn).

From this training set, we aim at building an estimate, Φn, of Φ suchthat Φn converge to the true Φ when n tends to infinity, in a sense thatwill be developed later.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 3 / 42

Nonparametric model in FDA

In this section, X is a random variable taking its values in a metric space(X, ‖.‖X) where ‖.‖X denotes a semi-norm (i.e., ‖x‖X = 0; x = 0).In the following presentation, we are interesting in the followingnonparametric functional model:

Y = Φ(X) + ε

where Y is a real random variable (regression case), X is a functionalrandom variable taking its values in (X, ‖.‖X) and ε is a centered randomvariable, independant from X and Φ is a unknown operator from X to R.We also suppose that we are given a set of n i.i.d. realizations of therandom pair (X ,Y):

(xi , yi), (x2, y2), . . . , (xn, yn).

From this training set, we aim at building an estimate, Φn, of Φ suchthat Φn converge to the true Φ when n tends to infinity, in a sense thatwill be developed later.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 3 / 42

Nadaraya-Watson kernel estimate[Nadaraya, 1964, Watson, 1964]

Returning to the real case (i.e. X ∈ R), the Nadaraya-Watson kernelestimate is the regression function:

Φn : x ∈ R →

∑ni=1 yiK

(x−xi

h

)∑n

i=1 K(

x−xih

)where

K is the so-called kernel, i.e., K : R → R is a bounded, integrablefunction. Additionally, K is often positive and null everywhere, excepton a compact subset of R.

h is the smoothing parameter: this parameter controls thesmoothness of the estimate Φn.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 4 / 42

Nadaraya-Watson kernel estimate[Nadaraya, 1964, Watson, 1964]

Returning to the real case (i.e. X ∈ R), the Nadaraya-Watson kernelestimate is the regression function:

Φn : x ∈ R →

∑ni=1 yiK

(x−xi

h

)∑n

i=1 K(

x−xih

)where

K is the so-called kernel, i.e., K : R → R is a bounded, integrablefunction. Additionally, K is often positive and null everywhere, excepton a compact subset of R.

h is the smoothing parameter: this parameter controls thesmoothness of the estimate Φn.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 4 / 42

Nadaraya-Watson kernel estimate[Nadaraya, 1964, Watson, 1964]

Returning to the real case (i.e. X ∈ R), the Nadaraya-Watson kernelestimate is the regression function:

Φn : x ∈ R →

∑ni=1 yiK

(x−xi

h

)∑n

i=1 K(

x−xih

)where

K is the so-called kernel, i.e., K : R → R is a bounded, integrablefunction. Additionally, K is often positive and null everywhere, excepton a compact subset of R.

h is the smoothing parameter: this parameter controls thesmoothness of the estimate Φn.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 4 / 42

Remark on the parameters

Two parameters of the estimator Φn have to be set:1 The kernel K : its choice does not have much influence on the

accuracy of Φn. Several common choices for K are:the uniform kernel: K : x ∈ R → I[−1,1](x) (“moving averageestimate”),

the Epanechnikov kernel: K : x ∈ R → (1 − u2)I[−1,1](x),the Gaussian kernel: K : x ∈ R → e−u2

.. . .

2 The smoothing parameter h: it is of a main importance to obtain agood approximation.Several methods have been proposed to choose it, such as crossvalidation strategies, to name a few.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 5 / 42

Remark on the parameters

Two parameters of the estimator Φn have to be set:1 The kernel K : its choice does not have much influence on the

accuracy of Φn. Several common choices for K are:the uniform kernel: K : x ∈ R → I[−1,1](x) (“moving averageestimate”),the Epanechnikov kernel: K : x ∈ R → (1 − u2)I[−1,1](x),

the Gaussian kernel: K : x ∈ R → e−u2.

. . .2 The smoothing parameter h: it is of a main importance to obtain a

good approximation.Several methods have been proposed to choose it, such as crossvalidation strategies, to name a few.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 5 / 42

Remark on the parameters

Two parameters of the estimator Φn have to be set:1 The kernel K : its choice does not have much influence on the

accuracy of Φn. Several common choices for K are:the uniform kernel: K : x ∈ R → I[−1,1](x) (“moving averageestimate”),the Epanechnikov kernel: K : x ∈ R → (1 − u2)I[−1,1](x),the Gaussian kernel: K : x ∈ R → e−u2

.

. . .2 The smoothing parameter h: it is of a main importance to obtain a

good approximation.Several methods have been proposed to choose it, such as crossvalidation strategies, to name a few.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 5 / 42

Remark on the parameters

Two parameters of the estimator Φn have to be set:1 The kernel K : its choice does not have much influence on the

accuracy of Φn. Several common choices for K are:the uniform kernel: K : x ∈ R → I[−1,1](x) (“moving averageestimate”),the Epanechnikov kernel: K : x ∈ R → (1 − u2)I[−1,1](x),the Gaussian kernel: K : x ∈ R → e−u2

.. . .

2 The smoothing parameter h: it is of a main importance to obtain agood approximation.

The smoothing parameter h: it is of a mainimportance to obtain a good approximation.Several methods have been proposed to choose it, such as crossvalidation strategies, to name a few.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 5 / 42

Remark on the parameters

Two parameters of the estimator Φn have to be set:1 The kernel K : its choice does not have much influence on the

accuracy of Φn. Several common choices for K are:the uniform kernel: K : x ∈ R → I[−1,1](x) (“moving averageestimate”),the Epanechnikov kernel: K : x ∈ R → (1 − u2)I[−1,1](x),the Gaussian kernel: K : x ∈ R → e−u2

.

. . .

2 In particular, h depends on n.

h = 1 h = 0.5

The smoothing parameter h: it is of a main importance to obtain agood approximation.Several methods have been proposed to choose it, such as crossvalidation strategies, to name a few.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 5 / 42

Remark on the parameters

Two parameters of the estimator Φn have to be set:1 The kernel K : its choice does not have much influence on the

accuracy of Φn. Several common choices for K are:the uniform kernel: K : x ∈ R → I[−1,1](x) (“moving averageestimate”),the Epanechnikov kernel: K : x ∈ R → (1 − u2)I[−1,1](x),the Gaussian kernel: K : x ∈ R → e−u2

.

. . .

2 The smoothing parameter h: it is of a main importance to obtain agood approximation.Several methods have been proposed to choose it, such as crossvalidation strategies, to name a few.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 5 / 42

Generalization of the N.W. kernel to FDA[Ferraty and Vieu, 2002, Ferraty and Vieu, 2000]

When X takes its values in the Hilbert space (X, 〈., .〉X), this estimates canbe generalized by

Φn : x ∈ X →n∑

i=1

wi(x)yi

where

wi(x) =K

(‖xi−x‖X

h

)∑n

k=1 K(‖xk−x‖X

h

)and K and h are defined as previously (i.e., as in the real case).

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 6 / 42

Basic assumption about the fractal dimension

The main assumption for theoretical results on the convergence of theN.W. kernel estimate in Hilbertian case is:

(Afd1) Fractal dimension assumption(also called “small balls assumption”)

limα→0+

P (X ∈ B(x, α))

αa(x)= c(x)

where a(x) and c(x) are positive real numbers andB(x, α) :=

{u ∈ X : ‖u − x‖X ≤ α

}.

a(x) is named fractal dimension of the probability distribution of X .

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 7 / 42

Basic assumption about the fractal dimension

The main assumption for theoretical results on the convergence of theN.W. kernel estimate in Hilbertian case is:

(Afd1) Fractal dimension assumption(also called “small balls assumption”)

limα→0+

P (X ∈ B(x, α))

αa(x)= c(x)

where a(x) and c(x) are positive real numbers andB(x, α) :=

{u ∈ X : ‖u − x‖X ≤ α

}.

a(x) is named fractal dimension of the probability distribution of X .

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 7 / 42

Assumptions for pointwise convergence

(A1) limn→+∞ hn = 0 and limn→+∞nha(x)

nlog n = +∞

(A2) K is bounded and is null except on a compact subset of R+;moreover, K satisfies

∀ t , t ′ ∈ R+, |K(t) − K(t ′)| ≤ |t − t ′|

(A3) Y is bounded

(A4) Φ is continuous at x ∈ X

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 8 / 42

Pointwise convergence

Theorem [Ferraty and Vieu, 2000, Ferraty and Vieu, 2002]

Under assumptions (A1)-(A4) and assumption (Afd1),

limn→+∞

Φn(x) = Φ(x).

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 9 / 42

Assumptions for optimal rate of pointwiseconvergence

(Afd2) P (X ∈ B(x, α)) = αa(x)c(x) + OP(αa(x)+b(x)

)(A4’) It exists B > 0, C > 0 and β > 0 such that: for all u and v inB(x,B)

|Φ(u) − Φ(v)| ≤ C |u − v |β

(A1’) hn = h(

log nn

) 11γ(x)+a(x) where γ(x) = min{b(x), β} and

limn→+∞nha(x)

nlog n = 0

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 10 / 42

Rate of convergence for pointwise convergence

Theorem [Ferraty and Vieu, 2000, Ferraty and Vieu, 2002]

Under assumptions (A1’), (A2), (A3), (A4’) and assumption (Afd2),

Φn(x) − Φ(x) = O

(log n

n

) γ(x)2γ(x)+a(x)

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 11 / 42

Assumptions for uniform convergence on a compactsubset of X, C

(Afd3) limα→0+ supx∈SP(X∈B(x,α))

αA = c(x) where infx∈S c(x) > 0

(A5) The covering number of S, N(S, l) (i.e., the minimum numberof balls of radius l that are needed to cover S is such that it existsα > 0 and C > 0 such that N(C, l) = Cl−α

(A4”) Φ is continuous on S

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 12 / 42

Uniform convergence

Theorem [Ferraty and Vieu, 2004, Ferraty and Vieu, 2008]

Under assumptions (A1)-(A3), (A4”), (A5) and assumption (Afd3),

Φn(x) − Φ(x) = O

(log n

n

) γ(x)2γ(x)+a(x)

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 13 / 42

Note on possible choices for the semi-norm

1 PCA semi-norm: suppose that X is a squared integrable randomvariable of L2 (i.e., E

(‖X‖2L2

)< +∞) then,

ΓX can be written as∑

k≥1 λk vk ⊗ vk where ((λk ), (vk ))k is theeigensystem of ΓX , (λk ) are in decreasing order and (vk ) areorthonormal vectors of L2,This defines a semi-norm on X = L2: for a given K ∈ N∗,

∀ x ∈ X, ‖x‖2X =K∑

k=1

〈vk , x〉2L2 =∥∥∥PSpan{v1,...,vK }(x)

∥∥∥2L2 .

This semi-norm emphasizes the main directions for the representationof the random variable X .

2 q-th derivative semi-norm: suppose now thatX =

{h ∈ L2 : h(q) exists and is in L2

}. Then,

∀ x ∈ X, ‖x‖2X

=∥∥∥∥x(q)

∥∥∥∥2

L2.

This norm is strongly related to RKHS (Sobolev spaces) and splinesand will be further investigated in Presentation 4.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 14 / 42

Note on possible choices for the semi-norm

1 PCA semi-norm: suppose that X is a squared integrable randomvariable of L2 (i.e., E

(‖X‖2L2

)< +∞) then,

ΓX can be written as∑

k≥1 λk vk ⊗ vk where ((λk ), (vk ))k is theeigensystem of ΓX , (λk ) are in decreasing order and (vk ) areorthonormal vectors of L2,

This defines a semi-norm on X = L2: for a given K ∈ N∗,

∀ x ∈ X, ‖x‖2X =K∑

k=1

〈vk , x〉2L2 =∥∥∥PSpan{v1,...,vK }(x)

∥∥∥2L2 .

This semi-norm emphasizes the main directions for the representationof the random variable X .

2 q-th derivative semi-norm: suppose now thatX =

{h ∈ L2 : h(q) exists and is in L2

}. Then,

∀ x ∈ X, ‖x‖2X

=∥∥∥∥x(q)

∥∥∥∥2

L2.

This norm is strongly related to RKHS (Sobolev spaces) and splinesand will be further investigated in Presentation 4.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 14 / 42

Note on possible choices for the semi-norm

1 PCA semi-norm: suppose that X is a squared integrable randomvariable of L2 (i.e., E

(‖X‖2L2

)< +∞) then,

ΓX can be written as∑

k≥1 λk vk ⊗ vk where ((λk ), (vk ))k is theeigensystem of ΓX , (λk ) are in decreasing order and (vk ) areorthonormal vectors of L2,This defines a semi-norm on X = L2: for a given K ∈ N∗,

∀ x ∈ X, ‖x‖2X =K∑

k=1

〈vk , x〉2L2 =∥∥∥PSpan{v1,...,vK }(x)

∥∥∥2L2 .

This semi-norm emphasizes the main directions for the representationof the random variable X .

2 q-th derivative semi-norm: suppose now thatX =

{h ∈ L2 : h(q) exists and is in L2

}. Then,

∀ x ∈ X, ‖x‖2X

=∥∥∥∥x(q)

∥∥∥∥2

L2.

This norm is strongly related to RKHS (Sobolev spaces) and splinesand will be further investigated in Presentation 4.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 14 / 42

Note on possible choices for the semi-norm

1 PCA semi-norm: suppose that X is a squared integrable randomvariable of L2 (i.e., E

(‖X‖2L2

)< +∞) then,

ΓX can be written as∑

k≥1 λk vk ⊗ vk where ((λk ), (vk ))k is theeigensystem of ΓX , (λk ) are in decreasing order and (vk ) areorthonormal vectors of L2,This defines a semi-norm on X = L2: for a given K ∈ N∗,

∀ x ∈ X, ‖x‖2X =K∑

k=1

〈vk , x〉2L2 =∥∥∥PSpan{v1,...,vK }(x)

∥∥∥2L2 .

This semi-norm emphasizes the main directions for the representationof the random variable X .

2 q-th derivative semi-norm: suppose now thatX =

{h ∈ L2 : h(q) exists and is in L2

}. Then,

∀ x ∈ X, ‖x‖2X

=∥∥∥∥x(q)

∥∥∥∥2

L2.

This norm is strongly related to RKHS (Sobolev spaces) and splinesand will be further investigated in Presentation 4.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 14 / 42

Note on possible choices for the semi-norm

1 PCA semi-norm: suppose that X is a squared integrable randomvariable of L2 (i.e., E

(‖X‖2L2

)< +∞) then,

ΓX can be written as∑

k≥1 λk vk ⊗ vk where ((λk ), (vk ))k is theeigensystem of ΓX , (λk ) are in decreasing order and (vk ) areorthonormal vectors of L2,This defines a semi-norm on X = L2: for a given K ∈ N∗,

∀ x ∈ X, ‖x‖2X =K∑

k=1

〈vk , x〉2L2 =∥∥∥PSpan{v1,...,vK }(x)

∥∥∥2L2 .

This semi-norm emphasizes the main directions for the representationof the random variable X .

2 q-th derivative semi-norm: suppose now thatX =

{h ∈ L2 : h(q) exists and is in L2

}. Then,

∀ x ∈ X, ‖x‖2X

=∥∥∥∥x(q)

∥∥∥∥2

L2.

This norm is strongly related to RKHS (Sobolev spaces) and splinesand will be further investigated in Presentation 4.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 14 / 42

Generalization to curve classification[Ferraty and Vieu, 2003, Ferraty and Vieu, 2004]

Suppose that X ∈ X but also that Y ∈ {1, 2, . . . ,G}.

Then, a classification rule is given by:

∀ x ∈ X, g(x) = arg maxg=1,...,G

P (Y = g|X = x) .

Then, this rule needs the estimation of the probability P (Y = g|X = x):

Pn(Y = g|X = x) :=n∑

i=1

wi(x)I[Yi=g]

where wi(x) =K(‖x−xi‖X

h

)∑n

l=1 K(‖x−xl‖X

h

) .

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 15 / 42

Generalization to curve classification[Ferraty and Vieu, 2003, Ferraty and Vieu, 2004]

Suppose that X ∈ X but also that Y ∈ {1, 2, . . . ,G}.Then, a classification rule is given by:

∀ x ∈ X, g(x) = arg maxg=1,...,G

P (Y = g|X = x) .

Then, this rule needs the estimation of the probability P (Y = g|X = x):

Pn(Y = g|X = x) :=n∑

i=1

wi(x)I[Yi=g]

where wi(x) =K(‖x−xi‖X

h

)∑n

l=1 K(‖x−xl‖X

h

) .

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 15 / 42

Generalization to curve classification[Ferraty and Vieu, 2003, Ferraty and Vieu, 2004]

Suppose that X ∈ X but also that Y ∈ {1, 2, . . . ,G}.Then, a classification rule is given by:

∀ x ∈ X, g(x) = arg maxg=1,...,G

P (Y = g|X = x) .

Then, this rule needs the estimation of the probability P (Y = g|X = x):

Pn(Y = g|X = x) :=n∑

i=1

wi(x)I[Yi=g]

where wi(x) =K(‖x−xi‖X

h

)∑n

l=1 K(‖x−xl‖X

h

) .

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 15 / 42

Example of nonparametric curve classification withPCA semi-norm

Problem: Discriminating 5 phonemes from their log-periodograms.

Competitor methods:Ridge PDA (i.e., Principal Discriminant Analysis penalized by the norm)Partial Least Squares with the L2 norm (denoted by MPLSR)Partial Least Squares with the PCA semi-norm (denoted byNPCD/MPLSR)Nonparametric kernel estimator with the PCA semi-norm (denotedby NPCD/PCA)

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 16 / 42

Example of nonparametric curve classification withPCA semi-norm

Problem: Discriminating 5 phonemes from their log-periodograms.

Competitor methods:Ridge PDA (i.e., Principal Discriminant Analysis penalized by the norm)Partial Least Squares with the L2 norm (denoted by MPLSR)Partial Least Squares with the PCA semi-norm (denoted byNPCD/MPLSR)Nonparametric kernel estimator with the PCA semi-norm (denotedby NPCD/PCA)

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 16 / 42

Obtained result

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 17 / 42

Generalization to time series [Ferraty and Vieu, 2004]

Problem and notations: If we are given a time series (Z(t))t∈R , one isoften interesting, knowing {Z(t), t ∈ [Tmax − T ,Tmax]}, to predict Z(t + τ).

Denoting

X ={Z(t), t ∈ [Tmax − T ,Tmax]

}Y = Z(Tmax + τ)

we can see that this problem is strongly related to a functional regressionmodel.The observations are given by

xi ={z(t), t ∈ [(i − 1)T , iT ]

}yi = z(iT + τ)

for i = 1, . . . , n but they are not independant.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 18 / 42

Generalization to time series [Ferraty and Vieu, 2004]

Problem and notations: If we are given a time series (Z(t))t∈R , one isoften interesting, knowing {Z(t), t ∈ [Tmax − T ,Tmax]}, to predict Z(t + τ).Denoting

X ={Z(t), t ∈ [Tmax − T ,Tmax]

}Y = Z(Tmax + τ)

we can see that this problem is strongly related to a functional regressionmodel.

The observations are given by

xi ={z(t), t ∈ [(i − 1)T , iT ]

}yi = z(iT + τ)

for i = 1, . . . , n but they are not independant.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 18 / 42

Generalization to time series [Ferraty and Vieu, 2004]

Problem and notations: If we are given a time series (Z(t))t∈R , one isoften interesting, knowing {Z(t), t ∈ [Tmax − T ,Tmax]}, to predict Z(t + τ).Denoting

X ={Z(t), t ∈ [Tmax − T ,Tmax]

}Y = Z(Tmax + τ)

we can see that this problem is strongly related to a functional regressionmodel.The observations are given by

xi ={z(t), t ∈ [(i − 1)T , iT ]

}yi = z(iT + τ)

for i = 1, . . . , n but they are not independant.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 18 / 42

Mixing assumptions

Under mixing assumptions on ((xi , yi))i and other assumptions, thesame convergence results holds.

More precisely, suppose that, for

α(n) = supk∈Z, A∈σk

−∞, B∈σ+∞k+n

∣∣∣P (A ∩ B) − P (A)P (B)∣∣∣

where σlk = σ ({(xi , yi) : k ≤ i ≤ l}), we have

limn→+∞

α(n) = 0.

α is named mixing coefficient.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 19 / 42

Mixing assumptions

Under mixing assumptions on ((xi , yi))i and other assumptions, thesame convergence results holds.More precisely, suppose that, for

α(n) = supk∈Z, A∈σk

−∞, B∈σ+∞k+n

∣∣∣P (A ∩ B) − P (A)P (B)∣∣∣

where σlk = σ ({(xi , yi) : k ≤ i ≤ l}), we have

limn→+∞

α(n) = 0.

α is named mixing coefficient.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 19 / 42

Mixing assumptions

Under mixing assumptions on ((xi , yi))i and other assumptions, thesame convergence results holds.More precisely, suppose that, for

α(n) = supk∈Z, A∈σk

−∞, B∈σ+∞k+n

∣∣∣P (A ∩ B) − P (A)P (B)∣∣∣

where σlk = σ ({(xi , yi) : k ≤ i ≤ l}), we have

limn→+∞

α(n) = 0.

α is named mixing coefficient.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 19 / 42

Table of contents

1 Nonparametric kernel

2 Neural networks

3 References

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 20 / 42

Biological neural networks

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 21 / 42

Biological neural networks

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 21 / 42

Biological neural networks

∑Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 21 / 42

Biological neural networks

If∑> activation threshold then

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 21 / 42

Biological neural networks

If∑< activation threshold then

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 21 / 42

Mathematical multilayer perceptrons:multidimensional case

Inpu

ts

2

1.5

11

Variable d

Variable 2

Variable 1

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 22 / 42

Mathematical multilayer perceptrons:multidimensional case

Inpu

ts

2

1.5

11

Variable d

Variable 2

Variable 1× weights

0.5

-1

0.2

∑= 4.4

Activation function G

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 22 / 42

Mathematical multilayer perceptrons:multidimensional case

Inpu

ts

2

1.5

11

Variable d

Variable 2

Variable 1× weights

0.5

-1

0.2

∑= 4.4

Activation function G

∑+G

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 22 / 42

Mathematical multilayer perceptrons:multidimensional case

Inpu

ts

2

1.5

11

Variable d

Variable 2

Variable 1× weights

0.5

-1

0.2

∑= 4.4

Activation function G

∑+G

× weights

∑+f

Outputs

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 22 / 42

In summary, multilayer perceptrons with 1 hiddenlayer are:

a class of functions of the form:

φpw : x ∈ Rd →

p∑k=1

w(2)k G

(w(0)

k + (w(1)k )T x

)

whereG is given and called the activation function.

p is also given and called the number of neurons on the hiddenlayer.For all k = 1, . . . , p, w(2)

k ∈ R, w(0)k ∈ R and w(1)

k ∈ Rd are theweights that have to be set from the learning data set.More precisely, given ((xi , yi))i=1,...,n n i.i.d. random realization of therandom pair (X ,Y) that takes its values in Rd × R, an estimate of Ψ isgiven by: Φn = φw∗n where w∗n is a solution of:

w∗n = arg minw∈(R×R×Rd)p

n∑i=1

(yi − φw(xi))2 .

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 23 / 42

In summary, multilayer perceptrons with 1 hiddenlayer are:

a class of functions of the form:

φpw : x ∈ Rd →

p∑k=1

w(2)k G

(w(0)

k + (w(1)k )T x

)where

G is given and called the activation function.

p is also given and called the number of neurons on the hiddenlayer.For all k = 1, . . . , p, w(2)

k ∈ R, w(0)k ∈ R and w(1)

k ∈ Rd are theweights that have to be set from the learning data set.More precisely, given ((xi , yi))i=1,...,n n i.i.d. random realization of therandom pair (X ,Y) that takes its values in Rd × R, an estimate of Ψ isgiven by: Φn = φw∗n where w∗n is a solution of:

w∗n = arg minw∈(R×R×Rd)p

n∑i=1

(yi − φw(xi))2 .

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 23 / 42

In summary, multilayer perceptrons with 1 hiddenlayer are:

a class of functions of the form:

φpw : x ∈ Rd →

p∑k=1

w(2)k G

(w(0)

k + (w(1)k )T x

)where

G is given and called the activation function. Popular examples ofactivation functions are:

the linear activation function

p is also given and called the number of neurons on the hiddenlayer.For all k = 1, . . . , p, w(2)

k ∈ R, w(0)k ∈ R and w(1)

k ∈ Rd are theweights that have to be set from the learning data set.More precisely, given ((xi , yi))i=1,...,n n i.i.d. random realization of therandom pair (X ,Y) that takes its values in Rd × R, an estimate of Ψ isgiven by: Φn = φw∗n where w∗n is a solution of:

w∗n = arg minw∈(R×R×Rd)p

n∑i=1

(yi − φw(xi))2 .

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 23 / 42

In summary, multilayer perceptrons with 1 hiddenlayer are:

a class of functions of the form:

φpw : x ∈ Rd →

p∑k=1

w(2)k G

(w(0)

k + (w(1)k )T x

)where

G is given and called the activation function. Popular examples ofactivation functions are:

the sigmoid activation function

p is also given and called the number of neurons on the hiddenlayer.For all k = 1, . . . , p, w(2)

k ∈ R, w(0)k ∈ R and w(1)

k ∈ Rd are theweights that have to be set from the learning data set.More precisely, given ((xi , yi))i=1,...,n n i.i.d. random realization of therandom pair (X ,Y) that takes its values in Rd × R, an estimate of Ψ isgiven by: Φn = φw∗n where w∗n is a solution of:

w∗n = arg minw∈(R×R×Rd)p

n∑i=1

(yi − φw(xi))2 .

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 23 / 42

In summary, multilayer perceptrons with 1 hiddenlayer are:

a class of functions of the form:

φpw : x ∈ Rd →

p∑k=1

w(2)k G

(w(0)

k + (w(1)k )T x

)where

G is given and called the activation function.p is also given and called the number of neurons on the hiddenlayer.

For all k = 1, . . . , p, w(2)k ∈ R, w(0)

k ∈ R and w(1)k ∈ Rd are the

weights that have to be set from the learning data set.More precisely, given ((xi , yi))i=1,...,n n i.i.d. random realization of therandom pair (X ,Y) that takes its values in Rd × R, an estimate of Ψ isgiven by: Φn = φw∗n where w∗n is a solution of:

w∗n = arg minw∈(R×R×Rd)p

n∑i=1

(yi − φw(xi))2 .

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 23 / 42

In summary, multilayer perceptrons with 1 hiddenlayer are:

a class of functions of the form:

φpw : x ∈ Rd →

p∑k=1

w(2)k G

(w(0)

k + (w(1)k )T x

)where

G is given and called the activation function.p is also given and called the number of neurons on the hiddenlayer. p is a main parameter to obtain a good generalization ability.

For all k = 1, . . . , p, w(2)k ∈ R, w(0)

k ∈ R and w(1)k ∈ Rd are the

weights that have to be set from the learning data set.More precisely, given ((xi , yi))i=1,...,n n i.i.d. random realization of therandom pair (X ,Y) that takes its values in Rd × R, an estimate of Ψ isgiven by: Φn = φw∗n where w∗n is a solution of:

w∗n = arg minw∈(R×R×Rd)p

n∑i=1

(yi − φw(xi))2 .

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 23 / 42

In summary, multilayer perceptrons with 1 hiddenlayer are:

a class of functions of the form:

φpw : x ∈ Rd →

p∑k=1

w(2)k G

(w(0)

k + (w(1)k )T x

)where

G is given and called the activation function.p is also given and called the number of neurons on the hiddenlayer. p is a main parameter to obtain a good generalization ability.

For all k = 1, . . . , p, w(2)k ∈ R, w(0)

k ∈ R and w(1)k ∈ Rd are the

weights that have to be set from the learning data set.More precisely, given ((xi , yi))i=1,...,n n i.i.d. random realization of therandom pair (X ,Y) that takes its values in Rd × R, an estimate of Ψ isgiven by: Φn = φw∗n where w∗n is a solution of:

w∗n = arg minw∈(R×R×Rd)p

n∑i=1

(yi − φw(xi))2 .

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 23 / 42

In summary, multilayer perceptrons with 1 hiddenlayer are:

a class of functions of the form:

φpw : x ∈ Rd →

p∑k=1

w(2)k G

(w(0)

k + (w(1)k )T x

)where

G is given and called the activation function.p is also given and called the number of neurons on the hiddenlayer. p is a main parameter to obtain a good generalization ability.

For all k = 1, . . . , p, w(2)k ∈ R, w(0)

k ∈ R and w(1)k ∈ Rd are the

weights that have to be set from the learning data set.More precisely, given ((xi , yi))i=1,...,n n i.i.d. random realization of therandom pair (X ,Y) that takes its values in Rd × R, an estimate of Ψ isgiven by: Φn = φw∗n where w∗n is a solution of:

w∗n = arg minw∈(R×R×Rd)p

n∑i=1

(yi − φw(xi))2 .

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 23 / 42

In summary, multilayer perceptrons with 1 hiddenlayer are:

a class of functions of the form:

φpw : x ∈ Rd →

p∑k=1

w(2)k G

(w(0)

k + (w(1)k )T x

)where

G is given and called the activation function.p is also given and called the number of neurons on the hiddenlayer.For all k = 1, . . . , p, w(2)

k ∈ R, w(0)k ∈ R and w(1)

k ∈ Rd are theweights that have to be set from the learning data set.

More precisely, given ((xi , yi))i=1,...,n n i.i.d. random realization of therandom pair (X ,Y) that takes its values in Rd × R, an estimate of Ψ isgiven by: Φn = φw∗n where w∗n is a solution of:

w∗n = arg minw∈(R×R×Rd)p

n∑i=1

(yi − φw(xi))2 .

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 23 / 42

In summary, multilayer perceptrons with 1 hiddenlayer are:

a class of functions of the form:

φpw : x ∈ Rd →

p∑k=1

w(2)k G

(w(0)

k + (w(1)k )T x

)where

G is given and called the activation function.p is also given and called the number of neurons on the hiddenlayer.For all k = 1, . . . , p, w(2)

k ∈ R, w(0)k ∈ R and w(1)

k ∈ Rd are theweights that have to be set from the learning data set.More precisely, given ((xi , yi))i=1,...,n n i.i.d. random realization of therandom pair (X ,Y) that takes its values in Rd × R, an estimate of Ψ isgiven by: Φn = φw∗n where w∗n is a solution of:

w∗n = arg minw∈(R×R×Rd)p

n∑i=1

(yi − φw(xi))2 .

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 23 / 42

Universal approximation

The popularity of MLP in the multidimensional context comes from twomain properties. The first one is called universal approximationcapability:

Theorem [Hornik, 1991, Hornik, 1993, Stinchcombe, 1999]If the activation function is continuous and non polynomial, then,

{φpw : w ∈

(R × R × Rd

)p, p ∈ N∗}

is dense in the set of continuous functions from Rd to R for thetopology induced by the uniform norm on compact sets.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 24 / 42

Consistency to optimal weights

The second property deals with the fact that the optimal empirical weightstends to the optimal weights when the size of the training data tends toinfinity.

Theorem [White, 1989, White, 1990]

Denote w∗ = arg minw∈(R×R×Rd)p E((

X − φpw(Y)

)2)

and

Θ(p,∆) ={φ

pw :

∑pk=1 |w

(2)k | ≤ ∆ and

∑ki=1(|w(0)

k |+∑d

l=1 |w(1)kl |) ≤ ∆p

}.

If1 the activation function G is bounded;2 limn→+∞ pn = +∞, limn→+∞∆n = +∞, ∆n = o(n) and

limn→+∞ pn∆4n log pn∆n = o(n);

3 w∗ exists and is unique;4 w∗n = arg minw: φw∈Θ(pn ,∆n)

∑ni=1(yi − φw(xi))2;

then limn→+∞

∥∥∥w∗n − w∗∥∥∥Rd = 0.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 25 / 42

Multilayer perceptrons with functional input

Suppose now that (X ,Y) is a random pair taking its values in X × R where(X, 〈., .〉X) is a Hilbert space.

Multilayer perceptrons with 1 hidden layer generalize to functional input by:

φpw : x ∈ X →

p∑k=1

w(2)k G

(w(0)

k + 〈x,w(1)k 〉X

)where the weights w(1)

k ∈ X (functional values).

With relevant representations of the weights (w(1)k )k (rich enough

representations), universal property of this model remains valid.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 26 / 42

Multilayer perceptrons with functional input

Suppose now that (X ,Y) is a random pair taking its values in X × R where(X, 〈., .〉X) is a Hilbert space.Multilayer perceptrons with 1 hidden layer generalize to functional input by:

φpw : x ∈ X →

p∑k=1

w(2)k G

(w(0)

k + 〈x,w(1)k 〉X

)where the weights w(1)

k ∈ X (functional values).

With relevant representations of the weights (w(1)k )k (rich enough

representations), universal property of this model remains valid.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 26 / 42

Multilayer perceptrons with functional input

Suppose now that (X ,Y) is a random pair taking its values in X × R where(X, 〈., .〉X) is a Hilbert space.Multilayer perceptrons with 1 hidden layer generalize to functional input by:

φpw : x ∈ X →

p∑k=1

w(2)k G

(w(0)

k + 〈x,w(1)k 〉X

)where the weights w(1)

k ∈ X (functional values).

With relevant representations of the weights (w(1)k )k (rich enough

representations), universal property of this model remains valid.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 26 / 42

Practical implementation: discrete sampling case

Suppose now that (xi)i are known on a discrete sampling gridt1, t2, . . . , td .

Then, 〈w(1)k , xi〉X can be approximated by

〈w(1)k , xi〉X '

1d

d∑l=1

xi(tk )w(1)k (tl).

w(1)k should be searched in a class of functions F such that, for any sets of

real numbers (cl)l=1,...,d , it exists a function w ∈ F for which

∀ l = 1, . . . , d, w(tl) = cl .

Splines have such a property (see Presentation 4).

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 27 / 42

Practical implementation: discrete sampling case

Suppose now that (xi)i are known on a discrete sampling gridt1, t2, . . . , td .Then, 〈w(1)

k , xi〉X can be approximated by

〈w(1)k , xi〉X '

1d

d∑l=1

xi(tk )w(1)k (tl).

w(1)k should be searched in a class of functions F such that, for any sets of

real numbers (cl)l=1,...,d , it exists a function w ∈ F for which

∀ l = 1, . . . , d, w(tl) = cl .

Splines have such a property (see Presentation 4).

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 27 / 42

Practical implementation: discrete sampling case

Suppose now that (xi)i are known on a discrete sampling gridt1, t2, . . . , td .Then, 〈w(1)

k , xi〉X can be approximated by

〈w(1)k , xi〉X '

1d

d∑l=1

xi(tk )w(1)k (tl).

w(1)k should be searched in a class of functions F such that, for any sets of

real numbers (cl)l=1,...,d , it exists a function w ∈ F for which

∀ l = 1, . . . , d, w(tl) = cl .

Splines have such a property (see Presentation 4).

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 27 / 42

Practical implementation: discrete sampling case

Suppose now that (xi)i are known on a discrete sampling gridt1, t2, . . . , td .Then, 〈w(1)

k , xi〉X can be approximated by

〈w(1)k , xi〉X '

1d

d∑l=1

xi(tk )w(1)k (tl).

w(1)k should be searched in a class of functions F such that, for any sets of

real numbers (cl)l=1,...,d , it exists a function w ∈ F for which

∀ l = 1, . . . , d, w(tl) = cl .

Splines have such a property (see Presentation 4).

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 27 / 42

Practical implementation: projection approach

Suppose now that X admits a family of functions (ψqk )q∈N∗, 1≤k≤q such that:

for all q ∈ N∗, (ψqk )k is an orthonormal system,

if Pq denotes the projection in X on Span{ψ

q1, . . . , ψ

qq

}then, the

pointwise consistent property is satisfied:

∀ x ∈ X, ∀ ε > 0,∃Q ∈ N∗ : ∀ q ≥ Q ,∥∥∥Pq(x) − x

∥∥∥∞≤ ε.

Hence, 〈w(1)k , xi〉X can be approximated by

〈w(1)k , xi〉X ' 〈w

(1)k ,Pq(xi)〉X = 〈Pq(w(1)

k ), xi〉X = 〈Pq(w(1)k ),Pq(xi)〉X.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 28 / 42

Practical implementation: projection approach

Suppose now that X admits a family of functions (ψqk )q∈N∗, 1≤k≤q such that:

for all q ∈ N∗, (ψqk )k is an orthonormal system,

if Pq denotes the projection in X on Span{ψ

q1, . . . , ψ

qq

}then, the

pointwise consistent property is satisfied:

∀ x ∈ X, ∀ ε > 0,∃Q ∈ N∗ : ∀ q ≥ Q ,∥∥∥Pq(x) − x

∥∥∥∞≤ ε.

Hence, 〈w(1)k , xi〉X can be approximated by

〈w(1)k , xi〉X ' 〈w

(1)k ,Pq(xi)〉X = 〈Pq(w(1)

k ), xi〉X = 〈Pq(w(1)k ),Pq(xi)〉X.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 28 / 42

Practical implementation: projection approach

Suppose now that X admits a family of functions (ψqk )q∈N∗, 1≤k≤q such that:

for all q ∈ N∗, (ψqk )k is an orthonormal system,

if Pq denotes the projection in X on Span{ψ

q1, . . . , ψ

qq

}then, the

pointwise consistent property is satisfied:

∀ x ∈ X, ∀ ε > 0,∃Q ∈ N∗ : ∀ q ≥ Q ,∥∥∥Pq(x) − x

∥∥∥∞≤ ε.

Hence, 〈w(1)k , xi〉X can be approximated by

〈w(1)k , xi〉X ' 〈w

(1)k ,Pq(xi)〉X = 〈Pq(w(1)

k ), xi〉X = 〈Pq(w(1)k ),Pq(xi)〉X.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 28 / 42

Universal approximation for projection basedapproach

Theorem[Rossi and Conan-Guez, 2005, Rossi and Conan-Guez, 2006]If G is continuous and non polynomial then the setx ∈ X →

p∑k=1

w(2)k G

w(0)k +

q∑l=1

βkl(Pq(x))l

is dense in the set of all continuous functions defined on X for the uniformnorm on any compact subset of X.

Remark 1: A similar result exists for the pointwise consistent approach.Remark 2: A convergence results for the optimal weights of thefunctional MLP also exists but requires many technical assumptions sothat we do not detail it here. See [Rossi and Conan-Guez, 2005] for moredetails about it.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 29 / 42

Universal approximation for projection basedapproach

Theorem[Rossi and Conan-Guez, 2005, Rossi and Conan-Guez, 2006]If G is continuous and non polynomial then the setx ∈ X →

p∑k=1

w(2)k G

w(0)k +

q∑l=1

βkl(Pq(x))l

is dense in the set of all continuous functions defined on X for the uniformnorm on any compact subset of X.

Remark 1: A similar result exists for the pointwise consistent approach.

Remark 2: A convergence results for the optimal weights of thefunctional MLP also exists but requires many technical assumptions sothat we do not detail it here. See [Rossi and Conan-Guez, 2005] for moredetails about it.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 29 / 42

Universal approximation for projection basedapproach

Theorem[Rossi and Conan-Guez, 2005, Rossi and Conan-Guez, 2006]If G is continuous and non polynomial then the setx ∈ X →

p∑k=1

w(2)k G

w(0)k +

q∑l=1

βkl(Pq(x))l

is dense in the set of all continuous functions defined on X for the uniformnorm on any compact subset of X.

Remark 1: A similar result exists for the pointwise consistent approach.Remark 2: A convergence results for the optimal weights of thefunctional MLP also exists but requires many technical assumptions sothat we do not detail it here. See [Rossi and Conan-Guez, 2005] for moredetails about it.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 29 / 42

Functional Inverse Regression

Suppose that we are given a random pair (X ,Y) taking its values in R × Xfor which a are given n i.i.d. realizations (x1, y1), . . . , (xn, yn)

Model: Moreover, we suppose that (X ,Y) satisfies the following model:

Y = Ψ(〈X , a1〉X, . . . , 〈X , aq〉X, ε)

where ε is a centered real random variable independant of X , Ψ is anunknown function that has to be estimated and {a1, . . . , aq} are unknownelements of X that are linearly independents and that have to beestimated.We call the space Span

{a1, . . . , aq

}, the Effective Dimension Reduction

subspace of X, denoted by EDR.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 30 / 42

Functional Inverse Regression

Suppose that we are given a random pair (X ,Y) taking its values in R × Xfor which a are given n i.i.d. realizations (x1, y1), . . . , (xn, yn)Model: Moreover, we suppose that (X ,Y) satisfies the following model:

Y = Ψ(〈X , a1〉X, . . . , 〈X , aq〉X, ε)

where ε is a centered real random variable independant of X , Ψ is anunknown function that has to be estimated and {a1, . . . , aq} are unknownelements of X that are linearly independents and that have to beestimated.

We call the space Span{a1, . . . , aq

}, the Effective Dimension Reduction

subspace of X, denoted by EDR.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 30 / 42

Functional Inverse Regression

Suppose that we are given a random pair (X ,Y) taking its values in R × Xfor which a are given n i.i.d. realizations (x1, y1), . . . , (xn, yn)Model: Moreover, we suppose that (X ,Y) satisfies the following model:

Y = Ψ(〈X , a1〉X, . . . , 〈X , aq〉X, ε)

where ε is a centered real random variable independant of X , Ψ is anunknown function that has to be estimated and {a1, . . . , aq} are unknownelements of X that are linearly independents and that have to beestimated.We call the space Span

{a1, . . . , aq

}, the Effective Dimension Reduction

subspace of X, denoted by EDR.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 30 / 42

Fundamental property of EDR space

Denote A = (〈X , a1〉X, . . . , 〈X , aq〉X)T .

Li’s condition [Li, 1991]If

(A-Li) ∀ x ∈ X, ∃v ∈ Rq : E (〈u,X〉X | A) = vT A ,

then E (X | Y) ∈ ΓX (EDR).

⇒ EDR is estimated through the estimation of a1, . . . , aq, eigenvectorsof Γ−1

X ΓE(X |Y).But, as for the functional linear model, Γ−1

X has to be estimated by apenalized or a regularized approach.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 31 / 42

Fundamental property of EDR space

Denote A = (〈X , a1〉X, . . . , 〈X , aq〉X)T .

Li’s condition [Li, 1991]If

(A-Li) ∀ x ∈ X, ∃v ∈ Rq : E (〈u,X〉X | A) = vT A ,

then E (X | Y) ∈ ΓX (EDR).

⇒ EDR is estimated through the estimation of a1, . . . , aq, eigenvectorsof Γ−1

X ΓE(X |Y).

But, as for the functional linear model, Γ−1X has to be estimated by a

penalized or a regularized approach.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 31 / 42

Fundamental property of EDR space

Denote A = (〈X , a1〉X, . . . , 〈X , aq〉X)T .

Li’s condition [Li, 1991]If

(A-Li) ∀ x ∈ X, ∃v ∈ Rq : E (〈u,X〉X | A) = vT A ,

then E (X | Y) ∈ ΓX (EDR).

⇒ EDR is estimated through the estimation of a1, . . . , aq, eigenvectorsof Γ−1

X ΓE(X |Y).But, as for the functional linear model, Γ−1

X has to be estimated by apenalized or a regularized approach.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 31 / 42

PCA approach[Ferré and Yao, 2003, Ferré and Yao, 2005]

Note

((λni , v

ni ))i≥1 the eigenvalue decomposition of Γn

X ((λi)i are ordered indecreasing order and almost n eigenvalues are not null; (vn

i )i areorthonormal)

kn an integer such that: kn ≤ n and limn→+∞ kn = +∞

Pkn the projector Pkn (u) =∑kn

i=1〈vni , .〉Xvn

i

Γn,knX = Pkn ◦ Γn,kn

X ◦ Pkn = 1n∑kn

i=1 λni 〈v

ni , .〉Xvn

i

if (Is)s=1,...,S is a partition of the subspace of R where Y takes itsvalues, then we estimate

P (Y ∈ Is) by pns = 1

n

∑ni=1 I{yi∈Is },

E (X | Y ∈ Is) by µns = 1

n

∑ni=1 xiI{yi∈Is },

ΓE(X |Y) by ΓnE(X |Y)

=∑S

s=1 pnsµ

ns ⊗ µ

ns .

((αknk ), an

k )k=1,...,q the eigensystem of (Γn,knX )+ΓE(X |Y).

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 32 / 42

PCA approach[Ferré and Yao, 2003, Ferré and Yao, 2005]

Note

((λni , v

ni ))i≥1 the eigenvalue decomposition of Γn

X ((λi)i are ordered indecreasing order and almost n eigenvalues are not null; (vn

i )i areorthonormal)

kn an integer such that: kn ≤ n and limn→+∞ kn = +∞

Pkn the projector Pkn (u) =∑kn

i=1〈vni , .〉Xvn

i

Γn,knX = Pkn ◦ Γn,kn

X ◦ Pkn = 1n∑kn

i=1 λni 〈v

ni , .〉Xvn

i

if (Is)s=1,...,S is a partition of the subspace of R where Y takes itsvalues, then we estimate

P (Y ∈ Is) by pns = 1

n

∑ni=1 I{yi∈Is },

E (X | Y ∈ Is) by µns = 1

n

∑ni=1 xiI{yi∈Is },

ΓE(X |Y) by ΓnE(X |Y)

=∑S

s=1 pnsµ

ns ⊗ µ

ns .

((αknk ), an

k )k=1,...,q the eigensystem of (Γn,knX )+ΓE(X |Y).

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 32 / 42

PCA approach[Ferré and Yao, 2003, Ferré and Yao, 2005]

Note

((λni , v

ni ))i≥1 the eigenvalue decomposition of Γn

X ((λi)i are ordered indecreasing order and almost n eigenvalues are not null; (vn

i )i areorthonormal)

kn an integer such that: kn ≤ n and limn→+∞ kn = +∞

Pkn the projector Pkn (u) =∑kn

i=1〈vni , .〉Xvn

i

Γn,knX = Pkn ◦ Γn,kn

X ◦ Pkn = 1n∑kn

i=1 λni 〈v

ni , .〉Xvn

i

if (Is)s=1,...,S is a partition of the subspace of R where Y takes itsvalues, then we estimate

P (Y ∈ Is) by pns = 1

n

∑ni=1 I{yi∈Is },

E (X | Y ∈ Is) by µns = 1

n

∑ni=1 xiI{yi∈Is },

ΓE(X |Y) by ΓnE(X |Y)

=∑S

s=1 pnsµ

ns ⊗ µ

ns .

((αknk ), an

k )k=1,...,q the eigensystem of (Γn,knX )+ΓE(X |Y).

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 32 / 42

PCA approach[Ferré and Yao, 2003, Ferré and Yao, 2005]

Note

((λni , v

ni ))i≥1 the eigenvalue decomposition of Γn

X ((λi)i are ordered indecreasing order and almost n eigenvalues are not null; (vn

i )i areorthonormal)

kn an integer such that: kn ≤ n and limn→+∞ kn = +∞

Pkn the projector Pkn (u) =∑kn

i=1〈vni , .〉Xvn

i

Γn,knX = Pkn ◦ Γn,kn

X ◦ Pkn = 1n∑kn

i=1 λni 〈v

ni , .〉Xvn

i

if (Is)s=1,...,S is a partition of the subspace of R where Y takes itsvalues, then we estimate

P (Y ∈ Is) by pns = 1

n

∑ni=1 I{yi∈Is },

E (X | Y ∈ Is) by µns = 1

n

∑ni=1 xiI{yi∈Is },

ΓE(X |Y) by ΓnE(X |Y)

=∑S

s=1 pnsµ

ns ⊗ µ

ns .

((αknk ), an

k )k=1,...,q the eigensystem of (Γn,knX )+ΓE(X |Y).

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 32 / 42

PCA approach[Ferré and Yao, 2003, Ferré and Yao, 2005]

Note

((λni , v

ni ))i≥1 the eigenvalue decomposition of Γn

X ((λi)i are ordered indecreasing order and almost n eigenvalues are not null; (vn

i )i areorthonormal)

kn an integer such that: kn ≤ n and limn→+∞ kn = +∞

Pkn the projector Pkn (u) =∑kn

i=1〈vni , .〉Xvn

i

Γn,knX = Pkn ◦ Γn,kn

X ◦ Pkn = 1n∑kn

i=1 λni 〈v

ni , .〉Xvn

i

if (Is)s=1,...,S is a partition of the subspace of R where Y takes itsvalues, then we estimate

P (Y ∈ Is) by pns = 1

n

∑ni=1 I{yi∈Is },

E (X | Y ∈ Is) by µns = 1

n

∑ni=1 xiI{yi∈Is },

ΓE(X |Y) by ΓnE(X |Y)

=∑S

s=1 pnsµ

ns ⊗ µ

ns .

((αknk ), an

k )k=1,...,q the eigensystem of (Γn,knX )+ΓE(X |Y).

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 32 / 42

PCA approach[Ferré and Yao, 2003, Ferré and Yao, 2005]

Note

((λni , v

ni ))i≥1 the eigenvalue decomposition of Γn

X ((λi)i are ordered indecreasing order and almost n eigenvalues are not null; (vn

i )i areorthonormal)

kn an integer such that: kn ≤ n and limn→+∞ kn = +∞

Pkn the projector Pkn (u) =∑kn

i=1〈vni , .〉Xvn

i

Γn,knX = Pkn ◦ Γn,kn

X ◦ Pkn = 1n∑kn

i=1 λni 〈v

ni , .〉Xvn

i

if (Is)s=1,...,S is a partition of the subspace of R where Y takes itsvalues, then we estimate

P (Y ∈ Is) by pns = 1

n

∑ni=1 I{yi∈Is },

E (X | Y ∈ Is) by µns = 1

n

∑ni=1 xiI{yi∈Is },

ΓE(X |Y) by ΓnE(X |Y)

=∑S

s=1 pnsµ

ns ⊗ µ

ns .

((αknk ), an

k )k=1,...,q the eigensystem of (Γn,knX )+ΓE(X |Y).

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 32 / 42

PCA approach[Ferré and Yao, 2003, Ferré and Yao, 2005]

Note

((λni , v

ni ))i≥1 the eigenvalue decomposition of Γn

X ((λi)i are ordered indecreasing order and almost n eigenvalues are not null; (vn

i )i areorthonormal)

kn an integer such that: kn ≤ n and limn→+∞ kn = +∞

Pkn the projector Pkn (u) =∑kn

i=1〈vni , .〉Xvn

i

Γn,knX = Pkn ◦ Γn,kn

X ◦ Pkn = 1n∑kn

i=1 λni 〈v

ni , .〉Xvn

i

if (Is)s=1,...,S is a partition of the subspace of R where Y takes itsvalues, then we estimate

P (Y ∈ Is) by pns = 1

n

∑ni=1 I{yi∈Is },

E (X | Y ∈ Is) by µns = 1

n

∑ni=1 xiI{yi∈Is },

ΓE(X |Y) by ΓnE(X |Y)

=∑S

s=1 pnsµ

ns ⊗ µ

ns .

((αknk ), an

k )k=1,...,q the eigensystem of (Γn,knX )+ΓE(X |Y).

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 32 / 42

PCA approach[Ferré and Yao, 2003, Ferré and Yao, 2005]

Note

((λni , v

ni ))i≥1 the eigenvalue decomposition of Γn

X ((λi)i are ordered indecreasing order and almost n eigenvalues are not null; (vn

i )i areorthonormal)

kn an integer such that: kn ≤ n and limn→+∞ kn = +∞

Pkn the projector Pkn (u) =∑kn

i=1〈vni , .〉Xvn

i

Γn,knX = Pkn ◦ Γn,kn

X ◦ Pkn = 1n∑kn

i=1 λni 〈v

ni , .〉Xvn

i

if (Is)s=1,...,S is a partition of the subspace of R where Y takes itsvalues, then we estimate

P (Y ∈ Is) by pns = 1

n

∑ni=1 I{yi∈Is },

E (X | Y ∈ Is) by µns = 1

n

∑ni=1 xiI{yi∈Is },

ΓE(X |Y) by ΓnE(X |Y)

=∑S

s=1 pnsµ

ns ⊗ µ

ns .

((αknk ), an

k )k=1,...,q the eigensystem of (Γn,knX )+ΓE(X |Y).

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 32 / 42

Assumptions for convergence of the estimated EDRspace

(A1) X has 4 th moment;

(A2) the (λk )k≥1 are all distinct and positive;

(A3) if we note b1 = 2√

2λ1−λ2

and bj = 2√

2min(λj−1−λj ,λj−λj+1)

for j ≥ 2,

limn→+∞1√nλkn

∑knj=1 bj = 0 and limn→+∞

1√nλ2

kn

= 0;

(A4) the eigenvalues of Γ−1X ΓE(X |Y) are all distinct.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 33 / 42

Convergence of the estimated EDR space

Theorem [Ferré and Yao, 2003, Ferré and Yao, 2005]Under assumption (A-Li) and assumptions (A1)-(A4),∥∥∥an

k − ak∥∥∥X

n→+∞, P−−−−−−−−→ 0.

Remark: A similar result for a smoothing approach is described in[Ferré and Villa, 2005, Ferré and Villa, 2006].

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 34 / 42

Convergence of the estimated EDR space

Theorem [Ferré and Yao, 2003, Ferré and Yao, 2005]Under assumption (A-Li) and assumptions (A1)-(A4),∥∥∥an

k − ak∥∥∥X

n→+∞, P−−−−−−−−→ 0.

Remark: A similar result for a smoothing approach is described in[Ferré and Villa, 2005, Ferré and Villa, 2006].

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 34 / 42

FIR multilayer perceptrons

The idea of this model, developed in [Ferré and Villa, 2006], is to estimatethe function Ψ by a multilayer perceptron.

〈X , a1〉X

〈X , a2〉X

〈X , aq〉X w(1)

w(2)∑

+

Bias

∑+

Bias: w(0)

Y

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 35 / 42

FIR multilayer perceptrons

The idea of this model, developed in [Ferré and Villa, 2006], is to estimatethe function Ψ by a multilayer perceptron.

〈X , a1〉X

〈X , a2〉X

〈X , aq〉X w(1)

w(2)∑

+

Bias

∑+

Bias: w(0)

Y

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 35 / 42

Notations

Outputs : ∀ u ∈ Rq,

φw(u) =

p∑k=1

w(2)k G

((w(1)

k )T u + w(0)k

);

For a given error function L (e.g., mean square error),ζ((u, y),w) = L(φw(u), y);

Variables: Z = ((〈X , aj〉X)j=1,...,q,Y), zni = ((〈xi , aj〉X)j=1,...,q, yi): Z

and (zni )i take their values in a open subspace, O, of Rq+1;

The weights are chosen in a compact subset,W, of (R × Rq × R)p

theoretical: w∗ = arg minw∈W E (ζ(Z ,w)): W∗ is the set of all w∗;empirical: w∗n = arg minw∈W

∑ni=1 ζ(zn

i ,w).

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 36 / 42

Notations

Outputs : ∀ u ∈ Rq,

φw(u) =

p∑k=1

w(2)k G

((w(1)

k )T u + w(0)k

);

For a given error function L (e.g., mean square error),ζ((u, y),w) = L(φw(u), y);

Variables: Z = ((〈X , aj〉X)j=1,...,q,Y), zni = ((〈xi , aj〉X)j=1,...,q, yi): Z

and (zni )i take their values in a open subspace, O, of Rq+1;

The weights are chosen in a compact subset,W, of (R × Rq × R)p

theoretical: w∗ = arg minw∈W E (ζ(Z ,w)): W∗ is the set of all w∗;empirical: w∗n = arg minw∈W

∑ni=1 ζ(zn

i ,w).

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 36 / 42

Notations

Outputs : ∀ u ∈ Rq,

φw(u) =

p∑k=1

w(2)k G

((w(1)

k )T u + w(0)k

);

For a given error function L (e.g., mean square error),ζ((u, y),w) = L(φw(u), y);

Variables: Z = ((〈X , aj〉X)j=1,...,q,Y), zni = ((〈xi , aj〉X)j=1,...,q, yi): Z

and (zni )i take their values in a open subspace, O, of Rq+1;

The weights are chosen in a compact subset,W, of (R × Rq × R)p

theoretical: w∗ = arg minw∈W E (ζ(Z ,w)): W∗ is the set of all w∗;empirical: w∗n = arg minw∈W

∑ni=1 ζ(zn

i ,w).

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 36 / 42

Notations

Outputs : ∀ u ∈ Rq,

φw(u) =

p∑k=1

w(2)k G

((w(1)

k )T u + w(0)k

);

For a given error function L (e.g., mean square error),ζ((u, y),w) = L(φw(u), y);

Variables: Z = ((〈X , aj〉X)j=1,...,q,Y), zni = ((〈xi , aj〉X)j=1,...,q, yi): Z

and (zni )i take their values in a open subspace, O, of Rq+1;

The weights are chosen in a compact subset,W, of (R × Rq × R)p

theoretical: w∗ = arg minw∈W E (ζ(Z ,w)): W∗ is the set of all w∗;empirical: w∗n = arg minw∈W

∑ni=1 ζ(zn

i ,w).

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 36 / 42

Notations

Outputs : ∀ u ∈ Rq,

φw(u) =

p∑k=1

w(2)k G

((w(1)

k )T u + w(0)k

);

For a given error function L (e.g., mean square error),ζ((u, y),w) = L(φw(u), y);

Variables: Z = ((〈X , aj〉X)j=1,...,q,Y), zni = ((〈xi , aj〉X)j=1,...,q, yi): Z

and (zni )i take their values in a open subspace, O, of Rq+1;

The weights are chosen in a compact subset,W, of (R × Rq × R)p

theoretical: w∗ = arg minw∈W E (ζ(Z ,w)): W∗ is the set of all w∗;

empirical: w∗n = arg minw∈W∑n

i=1 ζ(zni ,w).

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 36 / 42

Notations

Outputs : ∀ u ∈ Rq,

φw(u) =

p∑k=1

w(2)k G

((w(1)

k )T u + w(0)k

);

For a given error function L (e.g., mean square error),ζ((u, y),w) = L(φw(u), y);

Variables: Z = ((〈X , aj〉X)j=1,...,q,Y), zni = ((〈xi , aj〉X)j=1,...,q, yi): Z

and (zni )i take their values in a open subspace, O, of Rq+1;

The weights are chosen in a compact subset,W, of (R × Rq × R)p

theoretical: w∗ = arg minw∈W E (ζ(Z ,w)): W∗ is the set of all w∗;empirical: w∗n = arg minw∈W

∑ni=1 ζ(zn

i ,w).

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 36 / 42

Assumptions for convergence of the optimalempirical weights

(A1) ∀ z ∈ O, ζ(z, .) is continuous;

(A2) ∀w ∈ W, ζ(.,w) is measurable;

(A3) it exists a measurable function ζ̃ from O in R such that, ∀ z ∈ O,∀w ∈ W,

∣∣∣ζ(z,w)∣∣∣ < ζ̃(z) and E

(ζ̃(Z)

)< +∞;

(A4) ∀w ∈ W, ∃C(w) > 0 such that for all (x, y) and (x′, y′) in O,∣∣∣ζ((x, y),w) − ζ((x′, y),w)∣∣∣ ≤ C(w) ‖x − x′‖Rq

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 37 / 42

Assumptions for convergence of the optimalempirical weights

(A1) ∀ z ∈ O, ζ(z, .) is continuous;

(A2) ∀w ∈ W, ζ(.,w) is measurable;

(A3) it exists a measurable function ζ̃ from O in R such that, ∀ z ∈ O,∀w ∈ W,

∣∣∣ζ(z,w)∣∣∣ < ζ̃(z) and E

(ζ̃(Z)

)< +∞;

(A4) ∀w ∈ W, ∃C(w) > 0 such that for all (x, y) and (x′, y′) in O,∣∣∣ζ((x, y),w) − ζ((x′, y),w)∣∣∣ ≤ C(w) ‖x − x′‖Rq

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 37 / 42

Assumptions for convergence of the optimalempirical weights

(A1) ∀ z ∈ O, ζ(z, .) is continuous;

(A2) ∀w ∈ W, ζ(.,w) is measurable;

(A3) it exists a measurable function ζ̃ from O in R such that, ∀ z ∈ O,∀w ∈ W,

∣∣∣ζ(z,w)∣∣∣ < ζ̃(z) and E

(ζ̃(Z)

)< +∞;

(A4) ∀w ∈ W, ∃C(w) > 0 such that for all (x, y) and (x′, y′) in O,∣∣∣ζ((x, y),w) − ζ((x′, y),w)∣∣∣ ≤ C(w) ‖x − x′‖Rq

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 37 / 42

Assumptions for convergence of the optimalempirical weights

(A1) ∀ z ∈ O, ζ(z, .) is continuous;

(A2) ∀w ∈ W, ζ(.,w) is measurable;

(A3) it exists a measurable function ζ̃ from O in R such that, ∀ z ∈ O,∀w ∈ W,

∣∣∣ζ(z,w)∣∣∣ < ζ̃(z) and E

(ζ̃(Z)

)< +∞;

(A4) ∀w ∈ W, ∃C(w) > 0 such that for all (x, y) and (x′, y′) in O,∣∣∣ζ((x, y),w) − ζ((x′, y),w)∣∣∣ ≤ C(w) ‖x − x′‖Rq

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 37 / 42

Convergence of the optimal empirical weights

Theorem [Ferré and Villa, 2006]Under assumption (A-Li), assumptions ensuring the convergence of theestimated EDR space and assumptions (A1)-(A4),

d(w∗n,W∗)

n→+∞, P−−−−−−−−→ 0,

where d is defined by: d(w,W) = infw̃∈W∥∥∥w − w̃

∥∥∥R(q+2)p .

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 38 / 42

Application: Tecator Data Set

Aim : Predicting the fat content of peaces of meat from their infraredspectra.

Comparison of:

PCA and then MLP [Thodberg, 1996];

Functional MLP (with a discrete sampling based approach)[Rossi and Conan-Guez, 2005];

smoothing FIR and then MLP [Ferré and Villa, 2006];

projection based FIR and then MLP[Ferré and Yao, 2003, Ferré and Villa, 2006];

smoothing FIR and then linear model.

Methodology: Repetition of 50 experiments with random training/testsets.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 39 / 42

Application: Tecator Data Set

Aim : Predicting the fat content of peaces of meat from their infraredspectra.Comparison of:

PCA and then MLP [Thodberg, 1996];

Functional MLP (with a discrete sampling based approach)[Rossi and Conan-Guez, 2005];

smoothing FIR and then MLP [Ferré and Villa, 2006];

projection based FIR and then MLP[Ferré and Yao, 2003, Ferré and Villa, 2006];

smoothing FIR and then linear model.

Methodology: Repetition of 50 experiments with random training/testsets.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 39 / 42

Application: Tecator Data Set

Aim : Predicting the fat content of peaces of meat from their infraredspectra.Comparison of:

PCA and then MLP [Thodberg, 1996];

Functional MLP (with a discrete sampling based approach)[Rossi and Conan-Guez, 2005];

smoothing FIR and then MLP [Ferré and Villa, 2006];

projection based FIR and then MLP[Ferré and Yao, 2003, Ferré and Villa, 2006];

smoothing FIR and then linear model.

Methodology: Repetition of 50 experiments with random training/testsets.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 39 / 42

Results of the experiments

1

2

3

4

5

PCA-MLPNN f

SIRr-MLPSIRp-MLP

SIRr-ML

Remark: A similar experiment has been made on the phoneme dataset tocompare “FIR-MLP” to the functional nonparametric kernel.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 40 / 42

Results of the experiments

1

2

3

4

5

PCA-MLPNN f

SIRr-MLPSIRp-MLP

SIRr-ML

Remark: A similar experiment has been made on the phoneme dataset tocompare “FIR-MLP” to the functional nonparametric kernel.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 40 / 42

Table of contents

1 Nonparametric kernel

2 Neural networks

3 References

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 41 / 42

References

Further details for the references are given in the joint document.

Ferraty, F. and Vieu, P. (2000).Dimension fractale et estimation de la régression dans des espacesvectoriels semi-normés.Comptes Rendus Mathématique. Académie des Sciences. Paris,330:139–142.

Ferraty, F. and Vieu, P. (2002).The functional nonparametric model and application to spectrometricdata.Computational Statistics, 17:515–561.

Ferraty, F. and Vieu, P. (2003).Curves discrimination: a non parametric approach.Computational and Statistical Data Analysis, 44:161–173.

Ferraty, F. and Vieu, P. (2004).Nonparametric models for functional data, with application inregression, time series prediction and curves discrimination.Journal of Nonparametric Statistics, 16:111–125.

Ferraty, F. and Vieu, P. (2008).Erratum of: ’non-parametric models for functional data, withapplication in regression, time-series prediction and curvediscrimination’.Journal of Nonparametric Statistics, 20(2):187–189.

Ferré, L. and Villa, N. (2005).Discrimination de courbes par régression inverse fonctionnelle.Revue de Statistique Appliquée, LIII(1):39–57.

Ferré, L. and Villa, N. (2006).Multi-layer perceptron with functional inputs: an inverse regressionapproach.Scandinavian Journal of Statistics, 33(4):807–823.

Ferré, L. and Yao, A. (2003).Functional sliced inverse regression analysis.Statistics, 37(6):475–488.

Ferré, L. and Yao, A. (2005).Smoothed functional inverse regression.Statistica Sinica, 15(3):665–683.

Hornik, K. (1991).Approximation capabilities of multilayer feedfoward networks.Neural Networks, 4(2):251–257.

Hornik, K. (1993).Some new results on neural network approximation.Neural Networks, 6(8):1069–1072.

Li, K. (1991).Sliced inverse regression for dimension reduction.Journal of the American Statistical Association, 86:316–342.

Nadaraya, E. (1964).On estimating regression.Theory of Probability and its Applications, 10:186–196.

Rossi, F. and Conan-Guez, B. (2005).Functional multi-layer perceptron: a nonlinear tool for functional dataanlysis.Neural Networks, 18(1):45–60.

Rossi, F. and Conan-Guez, B. (2006).Theoretical properties of projection based multilayer perceptrons withfunctional inputs.Neural Processing Letters, 23(1):55–70.

Stinchcombe, M. (1999).Neural network approximation of continuous functionals andcontinuous functions on compactifications.Neural Network, 12(3):467–477.

Thodberg, H. (1996).A review of bayesian neural network with an application to nearinfrared spectroscopy.IEEE Transaction on Neural Networks, 7(1):56–72.

Watson, G. (1964).Smooth regression analysis.Sankhya Series, A(26):359–372.

White, A. (1990).Connectionist nonparametric regression: multilayer feedforwardnetworks can learn arbitraty mappings.Neural Networks, 3:535–549.

White, H. (1989).Learning in artificial neural network: a statistical perspective.Neural Computation, 1:425–464.

Nathalie Villa (IMT & UPVD) Presentation 2 La Havane, Sept. 16th, 2008 42 / 42

top related