universidade de santiago de compostela …eio.usc.es/pub/reports/report04_02.pdf · estatÍstica e...

18
UNIVERSIDADE DE SANTIAGO DE COMPOSTELA DEPARTAMENTO DE ESTATÍSTICA E INVESTIGACIÓN OPERATIVA BOOSTING FOR REAL AND FUNCTIONAL SAMPLES. AN APPLICATION TO AN ENVIRONMENTAL PROBLEM B. M. Fernández de Castro and W. González Manteiga. Report 04-02 Reports in Statistics and Operations Research

Upload: dangthien

Post on 15-Feb-2019

214 views

Category:

Documents


0 download

TRANSCRIPT

UNIVERSIDADE DE SANTIAGO DE COMPOSTELA

DEPARTAMENTO DE ESTATÍSTICA E INVESTIGACIÓN OPERATIVA

BOOSTING FOR REAL AND FUNCTIONAL SAMPLES. AN APPLICATION TO AN ENVIRONMENTAL PROBLEM

B. M. Fernández de Castro and W. González Manteiga.

Report 04-02

Reports in Statistics and Operations Research

BOOSTING FOR REAL AND FUNCTIONAL SAMPLES. AN

APPLICATION TO AN ENVIRONMENTAL PROBLEM. ∗

B. M. Fernandez de Castro†, W. Gonzalez Manteiga†

†Dpto. de Estadıstica e Investigacion Operativa.Univ. de Santiago de Compostela.

Abstract

In this paper, boosting techniques are given in order to forecast SO2

levels near a power plant. We use boosting with neural networks to fore-cast real values of SO2 concentration. Then, the data are considered asa time series of curves. Assuming a lag one dependence, the predictionsare computed using the functional kernel and the linear autoregressiveHilbertian model. Boosting techniques are developed for those functionalmodels. We compare results of functional boosting with different startingpoints and iterate models. We carry out the estimation, in real and func-tional case, with the information given by a historical matrix, which is asubsample that emphasizes relevant SO2 values.

Keywords: neural networks, functional data, boosting, air pollutant.

1 Introduction

Environmental care has nowadays priority for many institutions and govern-ments. Many directives have been developed, for example, to control industry’semissions to ambient air. To this effect, those directives give appropriate limitvalues for different pollutants like sulphur dioxide, oxides of nitrogen, particulatematter or lead. The industries work on different tools to keep emissions underthe limits. These tools often include prediction systems.

In this paper we study the case of prediction tools developed to forecast SO2

values in the surroundings of a power plant. Different statistical models havebeen used to that end: semiparametric models (Garcıa Jurado, et al., 1995), par-tially linear models (Prada Sanchez, et al. 2000) and neural networks (Fernandezde Castro, et al. 2003). These models use real SO2 values of monitoring stationsto forecast future values. Some approaches in the space-tiem field have been alsodone (Angulo, et al., 1996).

Real time monitoring makes large data sets available. Furthermore, comput-ers can manage such databases. If we aggregate consecutive discrete recordingsand view them as sampled values of a random curve, we can study statistics mod-els for curves instead of vectors or numbers. Functional data analysis (Ramsayand Silverman, 2002) appears to be the efficient framework for dealing with such

∗ Supported by MCyT Grant BFM2002-03213 (European FEDER support included), Di-reccion Xeral de I+D (Xunta de Galicia) Grant PGIDT03PXIC20702PN and ENDESA GE-

NERACION S.A. under Direccion Xeral de I+D (Xunta de Galicia) Grant PGIDIT03TAM08E.

statistical elements. Functional autoregressive models (Bosq, 2000) have beenused by Besse and Cardot (1996) and Besse, et al. (2000) to forecast traffic andthe climatic variation phenomenon El nino. Damon and Guillas (2002) and Guil-las (2002) studied the inclusion of explanatory variables to the ARH(1) modelin order to make predictions of ozone levels.

We study predictions given by functional kernel model and Autoregressiveof order one Hilbertian model ARH(1) on our data set. We also present sometools for better prediction. The idea of historical matrix introduced by GarcıaJurado, et al. (1995) is used to better take into account information aboutepisodes. Results show that this kind of estimation improves forecasts.

Boosting is a useful method born in machine learning. The motivation ofthis technique was to obtain a powerful classifier from a combination of someweak classifiers (Hastie, et al. 2001). Despite being originally designed for clas-sification problems, it has been extended to regression and learning algorithmsin general. Boosting techniques have been used by Borra and Di Ciaccio (2002)to improve predictions of projection pursuit regression or multivariate adaptiveregression splines.

In this paper we study the application of boosting techniques with functionaldata. Boosting is used with functional kernel models and autoregressive Hilber-tian models. It is shown that this technique can improve the forecasts given bythe initial models.

The paper is organized as follows. Section 2 describes the data set of SO2

values we deal with. In section 3 we define the models we use to forecast SO2

values: neural networks for real data and autoregressive Hilbertian model andfunctional kernel in functional case. The description of the historical matrixes forreal and functional data is also included. Section 4 describes boosting techniqueand its application to real and functional data. In section 5 we evaluate theresults of the different techniques.

2 Data

We will use boosting techniques to improve forecasts of SO2 levels given by statis-tical models for real and functional data. Our interest is focused on SO2 levelsaround a power plant located in As Pontes, in the Northwest of Spain. Thispower plant has developed an Atmospheric Pollution Supplementary ControlSystem. This system works changing operation conditions to reduce atmosphericemissions during unfavourable meteorological conditions for plume diffusion. Itsmain purpose is avoiding quality air level episodes. The system includes a Vigi-lance Network with 17 measure stations located within a radius of 30 kilometres(figure 1). The stations measure SO2 levels, among other pollutants, and sendvalues continuously to the power plant. There, a database manages values ofpollutants from each measure station. Since the effects of emission reduction arenot immediate in the surroundings of the power plant, prediction tools are quiteimportant to make the Atmospheric Pollution Supplementary Control Systemeffective. The power plant’s system includes predictions of SO2 values, with

short horizons, given by statistical models.Our main purpose is to provide useful information to the staff at the power

plant to avoid high levels of SO2 on the ground. The SO2 levels are near zero forlong time periods, but they can rise quickly and cause quality air level episodesduring unfavourable meteorological conditions. The statistical models give tothe staff at the power plant forecasts of SO2 values, so they can advance theirdecisions to change operation conditions and reduce SO2 levels.

The database at the power plant sends a SO2 datum every 5 minutes. Cur-rent legislation forces to control hourly averaged SO2 values. In order to giveinformation as good as possible, SO2 average over the last hour is calculatedevery 5 minutes. We forecast values of this hourly averaged SO2 for the nexthalf an hour, since the power plant needs, at least, predictions with half an hourhorizon. Those predictions are given, individually, for the 17 measure stationslocated around the power plant. We will restrict our study to one of those sta-tions, denoted by F4. This study can be reproduced for the other 16 measurestations.

Figure 1: Vigilance Network of As Pontes Power Plant. (Northwest of Spain)

3 Models

3.1 Neural Networks

Artificial neural networks are composed of simple processors called nodes, set inlayers and connected with one another. The information is given to the neuralnetwork through the nodes of the input layer. Every node processes the informa-tion collected and sends it to the next layer nodes. This procedure is repeateduntil the output layer is reached. We must make the neural network acquireenough knowledge to answer correctly to the information we give. This knowl-edge is obtained using a training set. This general structure can be translatedinto a multivariate mathematical model that will try to minimize the error overa set with known responses (Ripley, 1996).

To use neural networks we need to look at hourly averaged SO2 values asobservations of time series {xt, t = ... − 1, 0, 1, 2, ...}. Between xt and xt+1 thereis a period of 5 minutes, since the database sends a datum every 5 minutes.

Given time series xt, we have developed a neural network to forecast values6 steps ahead: xt+6. The inputs of the neural network are known values of SO2.Experience indicates that most of the information models need to forecast SO2

values is in the near past of SO2 values. The neural network model we are goingto use has one input layer, one output layer and one hidden layer. The outputlayer has one node as the response is one-dimensional. A two-dimensional input(xt−3, xt) is selected, which means two nodes in the input layer. The numberof nodes in the hidden layer: L, can be fixed according to empirical results ordetermined during the training process. The predictor given by the designedneural network can be written as:

xt+6 = fo

L∑

j=1

ωo1jf

h(

θhj + ωh

j1xt + ωhj2xt−3

)

where fh and fo are the activation functions used in the nodes of the hiddenand output layers and

{

θhj , ωo

1j , ωhj1, ω

hj2; j = 1, ..., L

}

are trends and weights of

the nodes. We use logistic function as activation function: f(z) = 11+e−z for

hidden layer and identity function for output layer. Trends and weights will bedetermined in the training process.

The training process consists of giving the network a set of inputs with theirknown responses, the so-called training set, comparing the network’s outputswith the real responses, then, modifying the network’s weights and trends ac-cording to backpropagation algorithm, and finally, iterating until a minimumerror level is reached.

The training set is built with vectors in the form (xt−3, xt,xt+6)t chosen from

real data. Our time series is formed largely by values near zero because thereare only a few high level episodes in history, often quite sparse in time. If wetake sequential data for the training set, it would be possibly formed by a largenumber of values near zero and a few values with information about quality air

level episodes. As our greatest interest is to predict those events, we need amechanism to store these data intelligently for future use. Following the notionof historical matrix introduced by Garcıa Jurado, et al. (1995), we have dividedthe training set into 10 classes and assigned a range of values of xt+6 to each class.Thus each vector (xt−3, xt,xt+6)

tfrom the known history will be introduced in

the class xt+6 belongs to, replacing the oldest vector. This kind of sets have beenused with favourable outcomes for predicting SO2 levels using semi-parametricmodels (Garcıa Jurado, et al., 1995) and neural networks (Fernandez de Castro,et al., 2003)

3.2 Functional Models

From a functional point of view, our data of SO2 levels can be seen as observa-tions of a continuous-time stochastic process. For each time instant T , we try toforecast future values x(u), u≥T, using the information contained in the infinitenumber of variables of the past x(u), u≤T. The idea is to consider portions ofthe stochastic process as curves. We will study curves representing half an hour.That is, our half an hour samples involve 6 consecutive SO2 data. We will thusconsider random variables with values in H = L2 ([0, 6]) in the following way:Xn(u) = x(6n + u) for u ∈ [0, 6], n = 1, 2, ....

Several authors have worked in curve prediction. We can find a theoreticalstudy of linear processes with values in function spaces in Bosq (2000). Someauthors use smoothing splines in this area (Besse, et al., 2000). We choose notto include this technique with our curves because our main interest is forecastingsudden increases.

As we are interested in predictions with half an hour horizon and our sam-ple curves represents half an hour, our analysis will be restricted to lag onedependence.

Let (εn) be a strong Hilbertian white noise (SWN), that is a sequence of i.i.d.H-valued random variables satisfying:

Eεn = 0, 0 < E ‖εn‖2H = σ2 < ∞, n ∈ Z

We will consider the following statistical model:

Xn = ρ (Xn−1) + εn

where ρ : H → H is the operator to be estimated.Two different techniques are applied to estimate operator ρ. First, assuming

a lineal relation between Xn and Xn+1: autoregressive Hilbertian model. Thenwe use a general model, with no assumption about the relation: functional kernel.

In the context of real valued time series we have also used the notion ofhistorical matrix introduced by Garcıa Jurado et al. (1995) for semi-parametricmodels. Since this approach has provided good results for real data, it has beenadjusted to our curve data. Matrices are filled with 1500 vectors of the form(Xt, Xt+1), where now each data Xt is made of 6 consecutive SO2 measures.

Like for the real valued historical matrix, we divide the matrix into classes andwe include each functional vector into its corresponding class. Two types ofclassification are considered:

1. An “ordinary” classification: the functional historical matrix is dividedinto 10 classes. Each class has a range of real SO2 levels associated. Everyfunctional vector (Xt, Xt+1) is introduced into the class the last real valueof Xt+1 belongs to, replacing the oldest functional vector.

2. A “functional” classification: the functional historical matrix is divided into5 classes. Each class has a curve shape associated. We distinguish 5 curveshapes: increases, decreases, plateaus, changes and everything else. To de-termine the class Xt = (X1

t , ..., X6t ) belongs to, we compute the differences:

(X2t − X1

t , ..., X6t − X5

t )

When the absolute value of a difference is strictly less than 5, then it isregarded symbolically as a 0. When a difference is greater than 5, then itis regarded as a “+”. When a difference is less than -5, then it is regardedas a “-“. The five classes are defined as follows: five + for an increase, five- for a decrease, five 0 for a plateau, at least one + and one - and no 0 fora change. The changes’ category is difficult to fill out since there were onlya few in the sample, but it contains a great amount of information. Thus,five classes of 300 couples of the form (Xt, Xt+1) compose our historicalmatrix, using the entire curve Xt+1 to classify them.

3.2.1 Autoregressive Hilbertian Model: ARH(1)

For this technique ρ is a bounded linear operator on H . The estimation ofρ is performed in several steps using covariance operator given by C (x) =E [〈X0, x〉X0], cross-covariance operator given by D (x) = E [〈X0, x〉X1], andthe relation: D = ρC. Since C is not invertible in general, a projection on a sub-space of finite dimension kn is necessary to obtain an estimate of ρ. kn must beselected taking into account the decreasing rate of C’s eigenvalues (Bosq, 2000;Guillas, 2001). Bosq (2000) proved that under mild assumptions, this autore-gressive model has a unique stationary solution. After withdrawing the meanfrom the process, steps to estimate ρ are as follows:

1. Compute, by a Principal Component Analysis (PCA), empirical estimatorsof the eigenelements of the covariance operator C associated to (Xn).

2. Project the relation between the cross-covariance operator D and C

D = ρC

in the subspace spanned by the first kn eigenvectors associated to kn great-est empirical eigenvalues.

3. Get a consistent estimator ρn of ρ using this projected relation wheneverit is possible, because of invertibility conditions. Several assumptions canensure this, especially by taking kn of a small size in comparison with thesample size n. Depending on the decay rate of the eigenvalues of C, theusual (and optimal) kn’s are of log(n) or n1/α, α > 1 types (see Bosq, 2000,for the a.s. mode of convergence and Guillas, 2001, for the L2 mode). Ournumber of sample points was small (equal to 6). kn cannot be greater thanthis sample size because in the computing process we only have informationfor vectors of size 6. We choose its value by cross validation.

Notice that this statistical model is non-parametric since ρ is an infinitedimensional parameter.

3.2.2 Functional Kernel

Linear operators could be too restrictive for such dependent sample of curves.We can consider a general model using the functional extension of the Nadaraya-Watson kernel regression estimator (Besse, et al., 2000). In this case, operatorρ can be estimated by the following functional kernel estimator:

ρhn(x) =

n∑

i=1

Xi+1 · K(

‖Xi−x‖hn

)

n∑

i=1

K(

‖Xi−x‖hn

)

where x belongs to H , K denotes a kernel (our choice was the Gaussian kernel),n is the sample size and hn is the bandwidth. We decided to look at globalbandwidths. To select the bandwidth, we use cross-validation over a collectionof functional vectors from the functional historical matrix.

4 Boosting

4.1 Boosting for Real Valued Variables

The outline of boosting for regression is: given d-dimensional explanatory vari-able x, and response variable y, we are interested in the relation:

y = f(x) + e,

where f : Rd → R. Given a family of basis functions b(x,γ), with parameters γ,

the solution set out by boosting for the estimation of f can be written as:

f(x) =

Q∑

q=0

βqb (x, γq),

with βq combination parameters. As we can see, boosting gives a procedure tofit an additive model over a set of basis functions.

To estimate values for parameters, we must minimize empirical error over thetraining set for some loss function L(y, f(x)). Given a training set of explanatoryvariables with known responses {(xi, yi)}

ni=1, we must solve:

min{βq,γq}

Q

q=1

n∑

i=1

L

(

yi,

Q∑

q=1

βqb (xi, γq)

)

This minimization problem can be solved using functional gradient descentalgorithm (Buhlmann and Yu, 2003).

Steps of the generic functional gradient descent algorithm are as follows:

Step 1 : Initialization: Given the training set {(xi, yi)}ni=1, fit an initial

basis function:

f0(x) = b (x, γ0) .

Set q = 0.

Step 2 : Compute the negative gradient for the loss function L evalu-ated at the current estimation:

ri = −∂L(yi,f)∂f

f=fq(xi), i=1, . . . ,n

Given the set {(xi, ri)}ni=1, fit a basis function:

Bq+1(x) = b (x, γq+1)

Step 3 : Numerical search for the step size:

βq+1 = arg minβ

n∑

i=1

L(

yi, fq(xi) + βBq+1(xi))

Step 4 : Set fq+1(·) = fq(·) + βq+1Bq+1(·)

Set q = q+1 and return to step 2.

Notice that step 2 implies fitting a basis function with different data eachtime. Boosting algorithm is iterated until errors stabilize at a minimum level.

Different loss functions can be used with this algorithm. Exponential functionor Logit function are good choices in classification problems. When the responseis real the quadratic loss can be an appropriate choice:

L (y, f(x)) =1

2(y − f(x))

2

If we use quadratic loss, terms ri in the second step are errors of the previousboosting iteration: r = (y − f). That is, in every iteration of the boosting

algorithm we are fitting a basis function to the set of explanatory variables x

with the last errors as responses.Different functions can be used for the basis family, for example, regression

trees (Hastie, et al., 2001), or smoothing splines (Buhlmann and Yu, 2003). Sinceneural networks have given good results predicting SO2 levels, they will be ourchoice for the basis function family. We will select a neural network topologyand we will fit proper trends and weights each time step 2 is carried out.

4.2 Boosting for Functional Variables

Boosting algorithm improves predictions of statistical models applied to realdata. Our goal is to try to extend the algorithm to functional data, so we shouldimprove predictions given by ARH(1) model and/or Functional Kernel model.

The outline of the problem for regression in functional data field is as follows.Given an explanatory variable X and response Y , both with values in Hilbertspace H , we deal with model: Y = ρ(X) + ε.

Working with real data, boosting algorithm described in section 4.1 improvesestimation on each iteration progressing in the direction of maximum descent ofthe loss function. When we use quadratic loss as loss function, the direction ofmaximum descent corresponds to errors of current estimation. So, each time weapply the algorithm we add an estimation of errors of previous iteration.

In the functional case, we consider the empirical quadratic L2-errors approx-imation as loss function of the algorithm:

L(Yt, Yt) =∥

∥Yt − Yt

2

L2

=

1

6

6∑

j=1

(

Yjt − Y

jt

)2

.

Each time we apply boosting, we add an estimation of the functional errorsof current estimation.

The steps we will follow are described below. We will call this algorithm L2

functional boosting algorithm:

Step 1 : Initialization: Given the training set {(Xi, Yi)}ni=1, fit an

initial operator ρ0(X) = ρY (X) for model Y = ρ(X) + ε.

Set q = 0.

Step 2 : Compute functional errors of current estimation:

Ri = (Yi − ρq(Xi)), i=1, . . . ,n

Given the set {(Xi, Ri)}ni=1, fit a basis operator ρR(X) for model R =

ρ(X) + ε.

Step 3 : Numerical search for the step size:

βq+1 = argminβ

n∑

i=1

L(

Yi, ρq(X) + βρR(X))

Step 4 : Set ρq+1(·) = ρq(·) + βq+1ρR(·)

Set q = q+1 and return to step 2.

Boosting algorithm is iterated until empirical L2-errors stabilize at a mini-mum level.

In order to apply this algorithm we need to select the initial operator andan operator family to iterate the boosting algorithm. In section 3.2 we havedescribed the models we apply to our data set: ARH(1) and functional kernelmodel. We will use those two models as two different starting points for theboosting algorithm. As operator family we will study results given by functionalkernel for regression and functional linear model.

Given the functional model Y = ρ(X) + ε and the set {(Xi, Yi)}ni=1 with X

and Y valued in a Hilbert space H , the functional kernel for regression proposesan estimation of the operator ρ as follows:

ρhn(x) =

n∑

i=1

Yi · K(

‖Xi−x‖hn

)

n∑

i=1

K(

‖Xi−x‖hn

)

where x belongs to H , K denotes a kernel (our choice was the Gaussian kernel),n is the sample size and hn is the bandwidth. We select global bandwidths usingcross-validation.

For the same functional model we can estimate ρ using a linear approach forfunctional data. First, we need to extract mean of explanatory and responsedata. Then, the steps are the following:

1. Compute, by a Principal Component Analysis (PCA), empirical estimatorsof the eigenelements of operator C (x) = E [〈X, x〉X].

2. Project the relation between operator D (x) = E [〈X, x〉Y ] and C:

D = ρC

in the subspace spanned by the first kn eigenvectors associated to kn great-est empirical eigenvalues.

3. Get a consistent estimator ρn of ρ using this projected relation wheneverit is possible, because of invertibility conditions.

We use those two approaches as operator families in the boosting algorithm.

5 Results

To illustrate results of estimations of different models, we have evaluated themover quality air level episodes. We have built the historical matrixes for realdata, as it is described in section 3.1, and for functional data, as section 3.2describes, using historical data of 2001. We have selected two quality air levelepisodes depicted at the measure station F4 on April 22, and June 21, 2002, notincluded in the historical matrixes.

5.1 Results for Real Data

Given the real time series for SO2 values xt, our interest is to forecast xt+6 valuesfor every instant t, using known values of the time series. The statistical modelis:

xt+6 = f(xt, xt−3) + ε.

To estimate f we apply boosting algorithm over a neural network with 10nodes in the hidden layer. That means: the selected basis function b(x,γ) is aneural network with L = 10. Its parameters are the trends and weights γ ={

θhj , ωo

1j , ωhj1, ω

hj2; j = 1, ..., L

}

and its input vector is x = (xt−3, xt)t.

The starting training set is a historical real matrix filled up with 1500 realvectors of the form (xt−3, xt, xt+6)

tfrom year 2001.

In order to evaluate results of the algorithm we calculate the mean squarederror of its forecasts on the quality air level episodes:

MSE =1

n

n∑

i=1

(xi − xi)2

To evaluate the evolution of the algorithm we calculate this error on everyiteration q of the algorithm and we compare its value MSE q with MSE0: MSEat the first step of the algorithm:

MSEq0 =

MSEq

MSE0100.

Figure 2 shows the forecasts given by the neural network with boosting iter-ations for the quality air level episode depicted at F4 station on April 22, 2002.The best prediction is displayed. Tables 1 and 2 display forecast errors for thetwo different days we consider. The minimum error level is achieved on seconditeration of boosting algorithm.

q MSE MSEq0

0 936,17627 100,001 921,58368 98,44

Table 1: Real valued Boosting. Prediction errors at F4 station on April 22, 2002

q MSE MSEq0

0 813,54755 100,001 777,23328 95,54

Table 2: Real valued Boosting. Prediction errors at F4 station on June 21, 2002

Figure 2: Real valued Boosting, April 22, 2002. Forecasts given by boostingover neural networks.

5.2 Results for Functional Data

Given the functional time series for SO2 Xt, our interest is to forecast the curveXt+1 for every instant t. The statistical model is:

Xn = ρ(Xn−1) + εn.

We use boosting algorithm described in section 4.2 to estimate the operatorρ. We present the results of boosting algorithm with two different startingestimators of ρ, using ARH(1) and using functional kernel (FK). We also use twodifferent approaches for boosting iteration. As basis operator we use functionallinear model (LIN) and functional kernel (FK).

The starting training sets are functional historical matrixes built with thetwo different criteria presented in section 3.2: ‘ordinary’ classification, called‘levels matrix’ afterwards, and ‘functional’ classification, called ‘shapes matrix’in the sequel. Those two matrixes are filled up with 1500 functional vectors ofthe form (Xt, Xt+1)

tfrom year 2001. We have also used a training set built as a

usual time series, using real sequential data. Doing so, we can compare resultswith those using historical matrixes.

Figures 3, 4, 5 and 6 display forecasts given by boosting algorithm, withdifferent starting points and models for iteration, for the quality air level episodeat F4 station on April 22, 2002. One should look at those figures carefully sinceevery 30 minutes we are joining the last predicted real value X6

n−1 at time n-1

with the first predicted real value X1n at time n. Hence, there is a piece of line

which is not really predicted. Note that when the levels are around zero, theforecast is typically an increase, which is usually a wrong prediction but is not

a serious error. Our interest is focused on quality air level episodes, when theSO2 level exceeds the value of 150µg

/

m3. That situation corresponds to a levelof intervention for the staff in the power plant. This part is quite well forecastedwhen using a shape historical matrix.

To evaluate results of the algorithm we use the empirical L2-error of itsforecasts on the quality air level episodes:

L2ME =1

n

n∑

i=1

1

6

6∑

j=1

(

Xji − X

ji

)2

1/2

To evaluate the evolution of the algorithm we calculate this error on everyiteration q of the algorithm and we compare its value L2ME q with L2ME at thefirst step of the algorithm, L2ME0:

L2MEq0 =

L2MEq

L2ME0100.

Tables 3 and 4 include measures of errors for the different boosting modelson the two quality air level episodes considered. The boosting iteration columnindicates the iteration in which the algorithm achieves the lowest error for eachexample. Boosting iteration equal to 0 indicates that the minimum error isacquired with the initial estimation.

Note that models perform better using the ‘shapes’ historical matrix. Themost accurate forecasts are given by boosting algorithm with an ARH(1) asstarting point and functional kernel to iterate the model, using the shapes his-torical matrix.

Starting Iterate Historical Boostingpoint model matrix L2ME L2ME

q0 iteration

ARH(1)

LIN

No HM 16,15677 99,66 2levels 18,45314 100,00 0shapes 17,40021 94,69 2

FK

No HM 16,03335 98,90 9levels 12,22794 66,26 4shapes 10,40675 56,64 5

FK

LIN

No HM 16,67396 97,83 1levels 18,00115 86,63 2shapes 15,01328 87,74 2

FK

No HM 17,04299 100,00 0levels 20,69654 99,60 9shapes 16,13719 94,31 1

Table 3: Functional Boosting. Prediction errors at F4 station on April 22, 2002

Starting Boosting Historical Boostingpoint model matrix L2ME L2ME

q0 iteration

ARH(1)

LIN

No HM 17,18766 99,75 2levels 17,53457 100,00 0shapes 16,84903 92,44 2

FK

No HM 16,84921 97,78 1levels 13,70578 78,16 4shapes 9,87119 54,16 5

FK

LIN

No HM 16,96896 100,00 0levels 14,02233 79,44 2shapes 11,81233 81,23 2

FK

No HM 16,96896 100,00 0levels 17,56491 99,51 2shapes 13,12376 90,25 1

Table 4: Functional Boosting. Prediction errors at F4 station on June 21, 2002

Figure 3: Funtional Boosting, April 22, 2002. Starting point: ARH(1). Boostingmodel: Linear (LIN). Historical matrix: levels matrix, shapes matrix and thecase with no historical matrix.

6 CONCLUSIONS

In this paper we have used boosting techniques to improve forecasts given byneural networks with real data. We can improve forecasts given by those modelswith this technique. Neural networks use an historical matrix as training set.

We have also studied the application of functional models to our data set.Those models are quite interesting since we can predict the entire curve for

Figure 4: Funtional Boosting, April 22, 2002. Starting point: ARH(1). Boostingmodel: Functional kernel (FK). Historical matrix: levels matrix, shapes matrixand the case with no historical matrix.

Figure 5: Funtional Boosting, April 22, 2002. Starting point: Functional Kernel.Boosting model: Linear (LIN). Historical matrix: levels matrix, shapes matrixand the case with no historical matrix.

the next half hour, instead of the real value. We introduced a specific wayof building historical matrixes for functional data. We consider two differentways of building those matrixes: taking into account the last real value or the

Figure 6: Funtional Boosting, April 22, 2002. Starting point: Functional Kernel.Boosting model: Functional kernel (FK). Historical matrix: levels matrix, shapesmatrix and the case with no historical matrix.

shape of each functional datum. These ideas help us to select data with interestinformation to estimate the models. Forecasts obtained with such functionalmodels, using functional historical matrixes, appear to be a good option dealingwith SO2 values. Results are better using the ‘shapes’ historical matrix forfunctional data, than using the ‘levels’ one.

We introduce boosting ideas for functional models. Boosting algorithm forfunctional data make use of autoregressive Hilbertian and functional kernel mod-els. By means of this technique we can combine those predictor techniquesobtaining even better results. An Autoregressive Hilbertian model as startingpoint with boosting iteration using Functional Kernel model, seems to be anappropriate way to cover the evolution of this kind of functional processes.

References

Angulo, J. M., Gonzalez Manteiga, W., Febrero Bande, M. and Alonso, F. J. (1998).Semi-parametric statistical approaches for space-time process prediction. Envi-

ronmental and Ecological Statistics 5, 297-316.

Besse, P., and Cardot, H. (1996). Spline approximation of the prediction of a first-order autoregressive functional process. Canad. J. Statist. 24, 467-487.

Besse, P., Cardot, H., and Stephenson D. (2000). Autoregressive forecasting of somefunctional climatic variations. Scandinavian Journal of Statistics 27(4), 673-687.

Borra, S. and Di Ciaccio, A. (2002). Improving nonparametric regression methods bybagging and boosting. Computational Statistics & Data Analysis 38, 407-420.

Bosq, D. (2000). Linear processes in function spaces. Springer. New York.

Buhlmann, P., and Yu, B. (2003). Boosting with L2-Loss: Regression and Classifica-tion. JASA 98, 324-339.

Damon, J., and Guillas, S. (2002), The inclusion of exogenous variables in functionalautoregressive ozone forecasting. Environmetrics, 13, 759-774.

Fernandez de Castro, B.M., Prada Sanchez, J.M., Gonzalez Manteiga, W., FebreroBande, M., Bermudez Cela, J.L., and Hernandez Fernandez, J.J. (2003). Predic-tion of SO2 level using neural networks. Journal of the air and waste management

association 53, 532-538.

Garcıa Jurado, I., Gonzalez Manteiga, W., Febrero Bande, M., Prada Sanchez, J.,and Cao, R. (1995). Predicting using Box-Jenkins, nonparametric and bootstraptechniques. Technometrics 37, 303-310

Guillas, S. (2001). Rates of convergence of autocorrelation estimates for autoregressiveHilbertian processes. Statist. Probab. Lett. 55, 281-291.

Guillas, S. (2002), Doubly stochastic Hilbertian processes, Journal of Applied Proba-

bility, 39, 566-580.

Hastie, T., Tibshirani, R., and Friedman, J. (2001). The elements of statistical learn-

ing: data mining, inference and prediction. Springer. New York.

Prada Sanchez, J. M., Febrero Bande, M., Cotos Yanez, T., Gonzalez Manteiga, W.Bermudez Cela, J. L., and Lucas Domınguez, T. (2000). Prediction of SO2 pollu-tion incidents near a power station using partially linear models and a historicalmatrix of predictor-response vectors. Environmetrics 11, 209-225.

Ramsay, J. O., and Silverman, B. W. (2002). Applied functional data analysis: meth-

ods and case studies. Springer. New York.

Ripley, B. D. (1996). Pattern recognition and neural networks. Cambridge UniversityPress. Cambridge.