a recursive least square algorithm for online kernel ...diniz/papers/ri88.pdfa department of...

10
Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom A recursive least square algorithm for online kernel principal component extraction João B.O. Souza Filho a,b, , Paulo S.R. Diniz a,c a Department of Electronics and Computer Engineering, Polytechnic School, Federal University of Rio de Janeiro, Avenida Athos da Silveira Ramos, 149, Technological Center, Building H, 2nd oor, rooms H-219(20) and 221, Rio de Janeiro, Brazil b Electrical Engineering Postgraduate Program (PPEEL), Federal Center of Technological Education Celso Suckow da Fonseca, Avenida Maracanã, 229, Building E, 5th oor, Rio de Janeiro, Brazil c Electrical Engineering Program (PEE), Alberto Luiz Coimbra Institute (COPPE), Federal University of Rio de Janeiro, Brazil ARTICLE INFO Communicated by Haiqin Yang Keywords: Kernel principal components analysis Kernel methods Online kernel algorithms Machine learning Generalized Hebbian algorithm ABSTRACT The online extraction of kernel principal components has gained increased attention, and several algorithms proposed recently explore kernelized versions of the generalized Hebbian algorithm (GHA) [1], a well-known principal component analysis (PCA) extraction rule. Consequently, the convergence speed of such algorithms and the accuracy of the extracted components are highly dependent on a proper choice of the learning rate, a problem dependent factor. This paper proposes a new online xed-point kernel principal component extraction algorithm, exploring the minimization of a recursive least-square error function, conjugated with an approximated deation transform using component estimates obtained by the algorithm, implicitly applied upon data. The proposed technique automatically builds a concise dictionary to expand kernel components, involves simple recursive equations to dynamically dene a specic learning rate to each component under extraction, and has a linear computational complexity regarding dictionary size. As compared to state-of-art kernel principal component extraction algorithms, results show improved convergence speed and accuracy of the components produced by the proposed method in ve open-access databases. 1. Introduction Kernel principal components analysis (KPCA) [2] is a simple but powerful nonlinear generalization of the widely used PCA technique [3]. Originally stated as a Gram-matrix eigendecomposition problem [2], thus solvable by classical linear algebra methods [4], this technique faces problems with large scale datasets, for which the computational burden involved in Gram-matrix construction and factorization may turn the extraction process infeasible. To address these problems, several authors have proposed incre- mental [57] and more recently online kernel component extraction algorithms [810]. Some examples are the online kernel Hebbian algorithm (OKHA) [8] and the subset kernel Hebbian algorithm (SubKHA) [9], which are extensions of the kernel Hebbian algorithm (KHA) [6]. In both cases, component estimates are expanded using concise and dynamically built dictionaries, but following dierent rules for mananging them. The kernel Hebbian algorithm (KHA) [6] is a nonlinear extension of the generalized Hebbian algorithm (GHA) [1]. A practical issue faced when using these algorithms is the choice of the learning rate factor, which critically aects the convergence speed and accuracy of extracted components. Usually, such relevant factor is set the same for all components under extraction by trial-and-error or following some metaheuristic procedure. In case of KHA, the work [11] showed that adopting individual learning rates to each component under extraction results in better convergence and accuracy [11]. Nonetheless, the meta-heuristic procedure explored in this work is computationally expensive and requires the adjustment of several experimental parameters. In a similar way, Tanaka [10] has proposed a recursive least-square online kernel PCA extraction algorithm based on the iterative weighted rule (IWR) [12]. The strategy adopted by this rule is somewhat similar to the one proposed in [13] for the extraction of principal components, resulting in learning rates dynamically adjusted to each component under extraction. The accuracy of this algorithm, however, seems to be dependent on the choice of the weighting constants [13]. Motivated by these drawbacks this paper proposes a low-complexity online kernel principal component extraction algorithm, named recur- sive least square kernel Hebbian algorithm (RLS-KHA). The algorithm http://dx.doi.org/10.1016/j.neucom.2016.12.031 Received 20 May 2016; Received in revised form 23 November 2016; Accepted 8 December 2016 Corresponding author at: Department of Electronics and Computer Engineering, Polytechnic School, Federal University of Rio de Janeiro, Avenida Athos da Silveira Ramos, 149, Technological Center, Building H, 2nd oor, rooms H-219(20) and 221, Rio de Janeiro, Brazil. E-mail addresses: jb[email protected] (J.B.O.S. Filho), [email protected] (P.S.R. Diniz). Neurocomputing 237 (2017) 255–264 Available online 13 December 2016 0925-2312/ © 2016 Elsevier B.V. All rights reserved. MARK

Upload: others

Post on 06-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A recursive least square algorithm for online kernel ...diniz/papers/ri88.pdfa Department of Electronics and Computer Engineering, Polytechnic School, Federal University of Rio de

Contents lists available at ScienceDirect

Neurocomputing

journal homepage: www.elsevier.com/locate/neucom

A recursive least square algorithm for online kernel principal componentextraction

João B.O. Souza Filhoa,b,⁎, Paulo S.R. Diniza,c

a Department of Electronics and Computer Engineering, Polytechnic School, Federal University of Rio de Janeiro, Avenida Athos da Silveira Ramos, 149,Technological Center, Building H, 2nd floor, rooms H-219(20) and 221, Rio de Janeiro, Brazilb Electrical Engineering Postgraduate Program (PPEEL), Federal Center of Technological Education Celso Suckow da Fonseca, Avenida Maracanã, 229,Building E, 5th floor, Rio de Janeiro, Brazilc Electrical Engineering Program (PEE), Alberto Luiz Coimbra Institute (COPPE), Federal University of Rio de Janeiro, Brazil

A R T I C L E I N F O

Communicated by Haiqin Yang

Keywords:Kernel principal components analysisKernel methodsOnline kernel algorithmsMachine learningGeneralized Hebbian algorithm

A B S T R A C T

The online extraction of kernel principal components has gained increased attention, and several algorithmsproposed recently explore kernelized versions of the generalized Hebbian algorithm (GHA) [1], a well-knownprincipal component analysis (PCA) extraction rule. Consequently, the convergence speed of such algorithmsand the accuracy of the extracted components are highly dependent on a proper choice of the learning rate, aproblem dependent factor. This paper proposes a new online fixed-point kernel principal component extractionalgorithm, exploring the minimization of a recursive least-square error function, conjugated with anapproximated deflation transform using component estimates obtained by the algorithm, implicitly appliedupon data. The proposed technique automatically builds a concise dictionary to expand kernel components,involves simple recursive equations to dynamically define a specific learning rate to each component underextraction, and has a linear computational complexity regarding dictionary size. As compared to state-of-artkernel principal component extraction algorithms, results show improved convergence speed and accuracy ofthe components produced by the proposed method in five open-access databases.

1. Introduction

Kernel principal components analysis (KPCA) [2] is a simple butpowerful nonlinear generalization of the widely used PCA technique[3]. Originally stated as a Gram-matrix eigendecomposition problem[2], thus solvable by classical linear algebra methods [4], this techniquefaces problems with large scale datasets, for which the computationalburden involved in Gram-matrix construction and factorization mayturn the extraction process infeasible.

To address these problems, several authors have proposed incre-mental [5–7] and more recently online kernel component extractionalgorithms [8–10]. Some examples are the online kernel Hebbianalgorithm (OKHA) [8] and the subset kernel Hebbian algorithm(SubKHA) [9], which are extensions of the kernel Hebbian algorithm(KHA) [6]. In both cases, component estimates are expanded usingconcise and dynamically built dictionaries, but following different rulesfor mananging them. The kernel Hebbian algorithm (KHA) [6] is anonlinear extension of the generalized Hebbian algorithm (GHA) [1].

A practical issue faced when using these algorithms is the choice of

the learning rate factor, which critically affects the convergence speedand accuracy of extracted components. Usually, such relevant factor isset the same for all components under extraction by trial-and-error orfollowing some metaheuristic procedure. In case of KHA, the work [11]showed that adopting individual learning rates to each componentunder extraction results in better convergence and accuracy [11].Nonetheless, the meta-heuristic procedure explored in this work iscomputationally expensive and requires the adjustment of severalexperimental parameters. In a similar way, Tanaka [10] has proposeda recursive least-square online kernel PCA extraction algorithm basedon the iterative weighted rule (IWR) [12]. The strategy adopted by thisrule is somewhat similar to the one proposed in [13] for the extractionof principal components, resulting in learning rates dynamicallyadjusted to each component under extraction. The accuracy of thisalgorithm, however, seems to be dependent on the choice of theweighting constants [13].

Motivated by these drawbacks this paper proposes a low-complexityonline kernel principal component extraction algorithm, named recur-sive least square kernel Hebbian algorithm (RLS-KHA). The algorithm

http://dx.doi.org/10.1016/j.neucom.2016.12.031Received 20 May 2016; Received in revised form 23 November 2016; Accepted 8 December 2016

⁎ Corresponding author at: Department of Electronics and Computer Engineering, Polytechnic School, Federal University of Rio de Janeiro, Avenida Athos da Silveira Ramos, 149,Technological Center, Building H, 2nd floor, rooms H-219(20) and 221, Rio de Janeiro, Brazil.

E-mail addresses: [email protected] (J.B.O.S. Filho), [email protected] (P.S.R. Diniz).

Neurocomputing 237 (2017) 255–264

Available online 13 December 20160925-2312/ © 2016 Elsevier B.V. All rights reserved.

MARK

Page 2: A recursive least square algorithm for online kernel ...diniz/papers/ri88.pdfa Department of Electronics and Computer Engineering, Polytechnic School, Federal University of Rio de

employs individual learning rates to each component under extraction,automatically tuned using a very simple iterative equation. Theproposed RLS-KHA possesses iterative equations similar to KHA,and furthermore demands the setting of only one experimentalparameter: the forgetting factor, exhibiting improved accuracy andconvergence speed.

2. Kernel PCA

The KPCA [2] is a natural nonlinear extension of the PCA [3]. Thistechnique implicitly produces mappings of realizations of a randomvector x in a feature space F using a nonlinear function Φ x( ), which issupposed to satisfy the Mercer theorem [14]. The principal kernelcomponents [2] consists of optimal directions in the sense of repre-senting these mappings in such space. These directions are defined bythe eigenvectors, sorted by decreasing order of their correspondingeigenvalues, of the following covariance matrix [2]

EC Φ x Φ x= [ ( ) ( )]͠ ͠κT (1)

where the function Φ(·)͠ represents the centered mapping of x in thefeature space given by

EΦ x Φ x Φ x( ) = ( ) − [ ( )]͠ (2)

and the operator E [·] designates the expected value.An elegant result from KPCA theory is that the mapping Φ x( ) does

not have to be explicitly determined. Assume the vectors x x,…, N1defining a set of realizations of x, and Φ x Φ x( ),…, ( )N1 representing themappings of such realizations in a feature space. The principal kernelcomponents can be simply stated in terms of linear combinations ofthese mappings [2] as follows

α j p Nw Φ x Φ x= [ ( ) ⋯ ( )] , 1 ≤ ≤ ⪡j N j1 (3)

where the vector αj represents the weighting constants related to the j-th kernel component. Now assume that the Gram matrix [14] of theserealizations is given by

⎣⎢⎢

⎦⎥⎥

κ κ

κ κK

x x x x

x x x x=

( , ) ⋯ ( , )⋱

( , ) ⋯ ( , )N

N

N N N

1 1 1

1 (4)

where an arbitrary kernel function κ x x( , )a b a b N(1 ≤ , ≤ ) satisfies thefollowing equality κ x x Φ x Φ x( , ) = ( ) ( )a b

Ta b , i.e., it computes the inner-

product of the mappings of the vectors xa and xb in the feature space.The vector αj corresponds to a scaled-version of the j-th dominanteigenvector of KN as follows

α αλK =N j j j (5)

α αλ

= 1j

jj

(6)

3. The RLS-KHA

The proposed algorithm explores the least mean-square reconstruc-tion principle (LMSER) [15], combined with an approximate deflationtransform, both considering implicit mappings of input data in afeature space. For algorithm deduction, consider the following set ofobjective functions (Oj)

O E j pΦ x w w Φ x= [ ( ) − ( ) ], = 1…͠ ͠j j j jT 2 (7)

where the parameter vector wj should be chosen to minimize thecorresponding Oj, and the vector Φ x( )͠ j corresponds to a deflated [4]version of Φ x( )͠ , i.e., it has null projections in the directions of higher-order components i jw( , 1 ≤ < − 1)i . This vector corresponds to

Φ x F Φ x( ) = ( )͠ ͠j j (8)

with

∑F I e e= −ji

j

i iT

=1

−1

(9)

where the vectors ei represent the dominant eigenvectors of Cκ . Asshown in Appendix A, the set of objective functions proposed in Eq. (7)have the p principal kernel components of x as minimum points.

Now, in order to iteratively estimate the expected value of Φ x( ) atan arbitrary iteration k, we will use an exponentially weighted movingaverage as follows

⎡⎣⎢⎢

⎤⎦⎥⎥

E γ γ i

γ k γ γ i

γ k γ

Φ x Φ Φ x

Φ x Φ x

Φ x Φ

[ ( )] ≈ = (1 − ) ( ( ))

= (1 − ) ( ( )) + ( ( ))

= (1 − ) ( ( )) +

ki

kk i

i

kk i

k

=1

=1

−1( −1)−

−1 (10)

The variable γ is a forgetting factor [16] γ(0 < < 1). In this case, Eqs.(2) and (8) for an input vector kx( ) can be rewritten as

k k kΦ x Φ x Φ x Φ( ( )) ≈ ( ( )) = ( ( )) −͠ ck (11)

k k kΦ x Φ x F Φ x( ( )) ≈ ( ( )) = ( ( ))͠ j jc

jc

(12)

where the vectors kΦ x( ( ))cand kΦ x( ( ))j

cdenote estimates of the

vectors kΦ x( ( ))͠ and kΦ x( ( ))͠ j , respectively, both using Φk . Similarly, itis possible to estimate the value of Oj in Eq. (7) as follows

O O k γ γ i k k i

γ γ i k y k i

Φ x w w Φ x

Φ x w

≈ ( ) = (1 − ) ( ( )) − ( ) ( ) ( ( ))

= (1 − ) ( ( )) − ( ) ( , )∼

j ji

kk i

jc

j jT c

i

kk i

jc

j j

=1

− 2

=1

− 2

(13)

where

y k i k iw Φ x( , ) = ( ) ( ( ))∼j j

T c(14)

The use of this weighted estimator is convenient for non-stationarydata, as it defines a time-limited window of reconstruction errors usedin component extraction, whose extension is dependent on the value ofγ chosen. Besides, it is possible to approximate Eq. (13) by thefollowing equality

∑O k O k γ γ i k y i iΦ x w( ) ≈ ( ) = (1 − ) ∥ ( ( )) − ( ) ( − 1, )∥∼j j

pa

i

kk i

jc

j j=1

− 2

(15)

where we are assuming that the projections of input data mapped inthe feature space on the directions of kernel component estimates donot vary significantly for this range of iterations, thusy k i y i i( , ) ≈ ( − 1, )j j . A similar criterion was previously explored forprincipal subspace component extraction by the PAST algorithm [17]and by Tanaka algorithm for KPCA extraction [10].

The gradient O k∇ ( )k jpa

w ( )j is given by

∑O k γ γ y i i i k y i iΦ x w∇ ( ) = 2(1 − ) ( − 1, )[ ( ( )) − ( ) ( − 1, )]∼ ∼w k j

pa

i

kk i

j jc

j j( )=1

−j

(16)

Equating this gradient to zero, the following fixed-point equations forthe extraction of principal kernel components result

∑kξ k

γ y i i iw Φ x( ) = 1( )

( − 1, ) ( ( ))∼j

j i

kk i

j jc

=1

(17)

where

∑ξ k γ y i i( ) = ( − 1, )∼j

i

kk i

j=1

− 2

(18)

J.B.O.S. Filho, P.S.R. Diniz Neurocomputing 237 (2017) 255–264

256

Page 3: A recursive least square algorithm for online kernel ...diniz/papers/ri88.pdfa Department of Electronics and Computer Engineering, Polytechnic School, Federal University of Rio de

A more convenient form to write these equations is given by (seeAppendix B)

k k η k y k k k y k k kw w Φ x w( ) = ( − 1) + ( ) ( − 1, )[ ( ( )) − ( − 1, ) ( − 1)]∼ ∼j j j j j

cj j

(19)

where

η kξ k

( ) = 1( )j

j (20)

ξ k y k k γξ k( ) = ( − 1, ) + ( − 1)∼j j j

2(21)

These equations, however, still involve the explicit mapping of theinput vector kx( ) in the feature space, which can be avoided by theusage of the kernel trick [14]. For this, consider at the kth iteration, adynamically defined dictionary having a set ofm selected input samplesas follows

a b a b kD x x= [ ( ) ⋯ ( )], ≠ ≤m k, (22)

The mappings of these dictionary members in the feature space can bewritten as

a bΨ Φ x Φ x= [ ( ( )) ⋯ ( ( ))]m k, (23)

The next step corresponds to approximate the mapping of kx( ) in thefeature space by a linear combination of matrix Ψm k, columns as follows

βk kΦ x Φ x Ψ( ( )) ≈ ( ( )) = m k k m, , (24)

The vector βk m, that minimizes the square-norm of the error producedin this approximative mapping corresponds to [8]

β κK=k m m kmx,

−1( ) (25)

Here, the matrix Km is the Gram-matrix [14] produced only usingdictionary members, i.e.,

⎣⎢⎢

⎦⎥⎥

κ κ

κ κK

d d d d

d d d d=

( , ) ⋯ ( , )⋱

( , ) ⋯ ( , )m

m

m m m

1 1 1

1 (26)

where the vectors d d d, ,…, m1 2 correspond to the columns of the matrixDm k, , and the vector κ k

mx( ) represents the empirical kernel mapping [14]

of kx( ) in this set, given by

κ k k k k k k kΨ Φ x x d x d x d= ( ( )) = [ ( ( ), ); ( ( ), ); ⋯ ( ( ), )]km

m kT

mx( ) , 1 2 (27)

Note that the arbitrary kernel function denoted as k (·,·) is supposed tosatisfy the Mercer theorem [14]. Similarly to Eq. (24) assume thefollowing mappings

βΦ Ψ=k m k k m, , (28)

βkΦ x Ψ( ( )) =jc

m k k mj

, , (29)

Combining Eqs. (10), (24) and (28), as well as assuming no dictionaryaugmentation at iteration k, i.e., Ψ Ψ=m k m k, , −1, the following relationresults

β β

β β β

γ γ

γ γ

Ψ Ψ Ψ= (1 − ) +

= (1 − ) +

m k k m m k k m m k

k m k m k m

, , , , , −1

−1, , −1, (30)

In addition we consider expressing the kernel component estimatesusing the matrix Ψm k, as follows

αk k j pw Ψ( ) = ( ), 1 ≤ ≤j m k j m, , (31)

Using Eqs. (10), (11), (27) and (31), Eq. (14) can be rewritten in thefollowing form

α

α

α κ κ

α κ

y k k k k

k k

k

k

Ψ Φ x

Ψ Φ x Φ

( − 1, ) = ( − 1) ( ( ))

= ( − 1) [ ( ( )) − ]

= ( − 1)[ − ]

= ( − 1)

j j mT

m kT c

j mT

m kT

k

j mT

km

km

j mT

km

x

x

, ,

, ,

, ( )

, ( ) (32)

where

κ κ κ= −∼k

mk

mkm

x x( ) ( ) (33)

κ κ κγ γ= (1 − ) +km

km

km

x( ) −1 (34)

Then, we can use at each iteration an estimate of the matrix Fj, denoted

as kF ( )j , where the true eigenvectors are approximated by the compo-nent estimates produced by the algorithm as follows

∑k k kF F I w w≈ ( ) = − ( − 1) ( − 1)j ji

j

j jT

=1

−1

(35)

Using Eqs. (11), (12), (14), (24), (28), (29), (31) and (35), the followingrelations can be deduced

∑k k k k k kΦ x F Φ x Φ x w w Φ x( ( )) = ( ) ( ( )) = ( ) − ( − 1) ( − 1) ( ( ))jc

jc c

i

j

j jT c

=1

−1

⎡⎣⎢⎢

⎤⎦⎥⎥∑

β β β α

β β α

k y k k

y k k k

Ψ Ψ= − − ( − 1) ( − 1, )

= − ( − 1, ) ( − 1)

∼∼

m k k mj

m k k m k mi

j

j m j

k mj

k mi

j

j j m

, , , , ,=1

−1

,

, ,=1

−1

,(36)

where

β β β= −∼k m k m k m, , , (37)

Similarly, using Eqs. (29) and (31) in Eq. (19), it is possible to deducethat

α α β α

α α β α

k k η k y k k y k k k

k k η k y k k y k k k

Ψ Ψ Ψ( ) = ( − 1) + ( ) ( − 1, ) [ − ( − 1, ) ( − 1)]

( ) = ( − 1) + ( ) ( − 1, )[ − ( − 1, ) ( − 1)]

∼ ∼

∼ ∼m k j m m k j m j j m k k m

jj j m

j m j m j j k mj

j j m

, , , , , , ,

, , , ,

(38)

Defining the matrix A as follows

α α αk k k kA ( ) = [ ( ); ( ); … ( )]m mT

mT

p mT

1, 2, , (39)

the proposed algorithm can be expressed in the following matrix form

βk k k k k k kA A Λ y y y A( ) = ( − 1) + ( )( ( ) − LT[ ( ) ( )] ( − 1))∼ ∼ ∼∼m m k m

T Tm, (40)

with

κk ky A( ) = ( − 1)∼ ∼m k

mx( ) (41)

⎡⎣⎢

⎤⎦⎥k

ξ k ξ k ξ kΛ( ) = diag 1

( ), 1

( ),…, 1

( )p1 2 (42)

where LT[·] represents the lower-triangular matrix operator and diag[·]represents a diagonal matrix, whose entries are determined by thevalues ξ k( )j defined by Eq. (21). This equation can be also rewritten inthe following vector form

ξ ξk k γ ky( ) = ( ) + ( − 1)∼2 (43)

where the vector ξ k( ) is given by

ξ k ξ k ξ k ξ k( ) = [ ( ); ( ); ⋯ ( )]p1 2 (44)

3.1. Dictionary construction

The proposed RLS-KHA explores the approximative linear depen-dence (ALD) [18] for dictionary augmentation, i.e., it evaluates thereconstruction error produced by the approximative mapping proposed

J.B.O.S. Filho, P.S.R. Diniz Neurocomputing 237 (2017) 255–264

257

Page 4: A recursive least square algorithm for online kernel ...diniz/papers/ri88.pdfa Department of Electronics and Computer Engineering, Polytechnic School, Federal University of Rio de

in Eq. (24) as follows

βΦ x Ψϵ = ∥ ( ) − ∥k m k k m2

, ,2 (45)

The value of ϵk2 can be written as [8]

β κk k kx xϵ = ( ( ), ( )) −k k mT

kmx

2, ( ) (46)

If this quantity is greater than a user defined threshold, here denoted asν, the approximation error is considered unacceptable, resulting in theinclusion of the input vector kx( ) in the dictionary as follows

kD D x= [ ( )]m k m k+1, , (47)

As a consequence, a new matrix Ψm k+1, is defined, corresponding to

kΨ Ψ Φ x= [ ( ( ))]m k m k+1, , (48)

Dictionary augmentation also results in the update of the followingalgorithm variables

• Vector βk m,Based on Eqs. (24) and (48), we have

⎡⎣⎢

⎤⎦⎥

β β

β

k kΦ x Ψ Ψ Φ x

o

( ( )) = = [ ( ( ))]

= 1

m k k m m k k m

k mm

+1, , +1 , , +1

, +1 (49)

where the column-vector om has all m components equal zero.

• Vector βk m−1,Similarly to Eq. (49), considering Eq. (28), the following

equation results

⎡⎣⎢

⎤⎦⎥β β=

0k mk m

−1, +1−1,

(50)

• Vector κkm−1

Similarly to Eq. (50), we have

⎡⎣⎢

⎤⎦⎥κ κ=

0km k

m

−1+1 −1

(51)

• Vectors αj m,Consider kernel component estimates before and after dictionary

augmentation denoted as wj m, and wj m, +1 j p(1 ≤ ≤ ), respectively.Since dictionary member inclusion should not alter componentestimates from the previous iteration, using Eq. (31), the followingexpression results

⎡⎣⎢

⎤⎦⎥

α α

α α

k kk k

k k

w wΨ Ψ

( − 1) = ( − 1)( − 1) = ( − 1)

( − 1) = ( − 1)0

j m j m

m k j m m k j m

j mj m

, +1 ,

+1, , +1 , ,

, +1,

(52)

Thus, if we consider the matrices kA ( − 1)m and kA ( − 1)m+1 havinglines defined by the vectors α k( − 1)j m, and α k( − 1)j m, +1 j p(1 ≤ ≤ ),respectively, the following structure results

k kA A o( − 1) = [ ( − 1) ]m m p+1 (53)

• Vector κ kmx( )

Considering Eqs. (27) and (48), we have

⎡⎣⎢

⎤⎦⎥

κ

κ

k k k

k k k

Ψ Φ x Ψ Φ x Φ x

x x

= ( ( )) = [ ( ( ))] ( ( ))

=( ( ), ( ))

km

m kT

m kT

km

x

x

( )+1

+1, ,

( )

(54)

• Matrix Km−1

The matrix Km+1−1 can be incrementally determined using the

following formula [8]

⎡⎣⎢⎢

⎤⎦⎥⎥

⎡⎣⎢

⎤⎦⎥

β βKK oo

=0

+ 1ϵ

−1

[− 1]mm m

mT

k

k mk mT

+1−1

−1

2,

,(55)

where the vector βk m, and the variable ϵk2 are given by Eqs. (25) and

(46), respectively.

In synthesis, the RLS-KHA involves three main conceptual phases:dictionary evaluation, dictionary augmentation and the update ofkernel component estimates. Table 1 summarizes the algorithm forconvenience, where separate boxes emphasize the steps correspondingto each phase. In the first phase, the error produced by the approximatemapping of the current input vector is calculated, which involves O m( )2

operations, mainly due to step 5. The second phase corresponds to theupdate of both dictionary and algorithm variables, requiringO m(( + 1) )2 operations in step 9. Lastly, the update of kernel compo-

nent estimates requires O(mp) and⎛⎝⎜

⎞⎠⎟O m p

22 operations in steps 21 and

24, respectively. If m⪢ p2

2, the number of algorithm operations in this

phase becomes linear with respect to dictionary size. In this case, theremaining phases dominate the overall computational cost of thealgorithm.

Table 1The RLS-KHA algorithm.

J.B.O.S. Filho, P.S.R. Diniz Neurocomputing 237 (2017) 255–264

258

Page 5: A recursive least square algorithm for online kernel ...diniz/papers/ri88.pdfa Department of Electronics and Computer Engineering, Polytechnic School, Federal University of Rio de

3.2. Relations to other KPCA extraction methods

Most of the recently proposed online kernel PCA extractionalgorithms exploit a common framework, involving the followingphases: input evaluation, dictionary management, and the update ofkernel component estimates. In the first, the algorithms evaluate if thecurrent input should become a dictionary member. Regarding thisaspect, the proposed RLS-KHA and the OKHA are similar, since bothmethods employ the ALD, whereas the Tanaka algorithm considers thecoherence criterion [19]. Unlike ALD and OKHA, the SubKHA includesall input data vectors into the dictionary until a maximum number ofmembers defined by the user is reached.

Relative to member inclusion, the dictionary update phase follows avery similar approach in all algorithms. A reduced computational costformula (Eq. (55)) produces the inverse of the Gram matrix ofdictionary members. The exception is the SubKHA, which exchangesthe least representative dictionary member by the current input whenthe dictionary is full and a replacement criterion based on ALD is met.Note that the OKHA, the Tanaka algorithm, and the RLS-KHA do notreplace dictionary members.

Regarding the iterative equations for kernel component extraction,the OKHA and the SubKHA exploit kernelizations of the GHAalgorithm, employing the same user defined learning rate for allcomponents under extraction, usually maintained fixed or adjustedaccording to some heuristic procedure. The proposed RLS-KHA andTanaka algorithm use an individual learning rate to each componentunder extraction, dynamically tuned, resulting in impressive gains inconvergence speed.

Concerning the last phase, Tanaka algorithm minimizes a weightedcost function (IWR), initially proposed for generalized eigendecompo-sition in [12], whereas the proposed RLS-KHA considers the mini-mization of multiple simple objective functions following the LMSERprinciple, coupled by approximated deflations of input data based oncomponent estimates. The latter method involves simpler extractionequations than the former and produces more accurate results, asobserved in our simulations. Besides, analyzing Equations (40) and(41), if we ignore the process of centering the input data mapped in thefeature space, the RLS-KHA assumes similar extraction equations toKHA, OKHA, and SubKHA. Thus, for this particular case, the proposedRLS-KHA solves an open problem stated in [8], which is the determi-nation of an optimal learning rate for the OKHA algorithm. Table 2summarizes the main characteristics of the algorithms here mentioned.

4. Results

We compared the accuracy and convergence speed of the proposedRLS-KHA with the following state-of-art online KPCA extractionalgorithms: KHA, OKHA, SubKHA and Tanaka algorithm. For thesake of simplicity, the methods did not consider the centering ofimplicitly mapped input samples in the feature space. All experimentsadopted the Gaussian kernel [14] and utilized five UCI open-access

databases: USPS [20], ISOLET [21], MNIST [22], COREL [23] andCOVERTYPE [24]. These sets represent an increasing complexityevaluation framework regarding the quantity and dimensionality ofthe vectors involved. Except for USPS, assuming personal or embeddedcomputing platforms, the extraction of kernel components usingclassical KPCA in such sets is cumbersome or infeasible due to thesize of the Gram matrix involved. Moreover, online KPCA extractiontechniques produce kernel components expanded in terms of smalldictionaries, resulting in more data summarization.

We conducted several algorithm trials, according to the datasetconsidered, randomly presenting dataset vectors to the algorithms. Foreach trial, we initialized the elements of the matrix A using randomnumbers sampled from a zero-mean Gaussian distribution withvariance 0.01. Table 3 summarizes the experiments carried on regard-ing the number N( ) and dimension of dataset vectors d( ), the amount ofmemory (in bytes) demanded by the storage of the Gram matrixinvolved in classical KPCA (assuming double precision matrix ele-ments), the kernel width adopted σ( ), as well as the number of kernelcomponents p( ) and epochs ep( ) considered in each experiment.1

The hyperparameters of such evaluations were the same previouslyemployed in some literature experiments. For all datasets except forUSPS, we used a fixed (F) learning rate of 0.1 in OKHA and SubKHA,and small dictionaries for component expansion, automatically ex-tracted from data. In the case of USPS, experiments adopted a fixed (F)and a exponentially decreasing (ED) learning rates. For the first, thevalue adopted was 0.05, whereas the second considered an initial valueof 0.05 and a decay factor of 0.999995. These values were chosen bysome experimental trials and achieved good tradeoff between extrac-tion speed and component accuracy. Besides, USPS experimentsconsidered dictionaries composed by all dataset members, i.e., a fulldictionary (FD), as well as defined by a small subset of them, i.e., small(SD) dictionaries.

Regarding Tanaka algorithm, for all datasets, we employed thefactor β equal to 0.9999, the entries of the matrix Q were sampled froma zero mean Gaussian distribution with a variance of 0.1, and weadopted decreasing weighting factors starting from 1 with a regularstep of 0.02. SubKHA experiments considered two scenarios relative tothe replacement of dictionary members: enabled (EN) and disabled(DS), considering values of α given by 1 and 10,000, respectively. Foreach set of dataset experiments, we tuned the parameters related todictionary construction, i.e., the value of ν, the coherence threshold (δ),and the value of M to result in dictionaries with similar size. The valuesadopted to such parameters are summarized in Table 4.

Note that in our experiments involving the proposed RLS-KHA,dataset vectors were randomly sampled, configuring a stationary dataprocess. For such cases, we recommend values of gamma in the rangeof 0.99–0.9995. Nonetheless, lower values of γ may be employed innon-stationary environments, speeding up algorithm convergence.Clearly, the requirements of convergence speed and accuracy of thetarget application dictate the choice of the factor γ. Motivated by USPSresults, we adopted the value of γ equal to 0.999 for all remainingdatasets.

Table 2Main characteristics of some online kernel algorithms.

Aspect Algorithm

OKHA RLS-KHA Tanaka SubKHA

Dictionaryevaluation

ALD Coherence None untildictionary is full

Dictionarymanagement

Just inclusion Inclusion andexclusion

Learning rate Userdefined

Automatically tunned User defined

Algorithmderivation

GHA Particular IWR GHA

Table 3Datasets and extraction parameters used in the evaluation of the methods (see text).

N d st σ p ne ep

USPS 300 8 703.2 kB 8 16 25 2000ISOLET 7797 917 463.8 MB 10 40 10 1MNIST 22,008 784 3.61 GB 8.5 50 10 1COREL 68,040 33 34.55 GB 0.5 20 10 1COVER 581,012 54 2.46 TB 1.0 30 4 1

1 A submission of all dataset elements to a learning algorithm defines one epoch. Thus,this value corresponds in each experiment to dataset size.

J.B.O.S. Filho, P.S.R. Diniz Neurocomputing 237 (2017) 255–264

259

Page 6: A recursive least square algorithm for online kernel ...diniz/papers/ri88.pdfa Department of Electronics and Computer Engineering, Polytechnic School, Federal University of Rio de

The quality extraction function proposed in [9], here named asempirical square reconstruction error (ESRE), was the figure of meritemployed in our analysis. Ignoring the centering of input datamappings in the feature space, we can express this function for anarbitrary iteration k as

∑ESRE kN

κ k k k kx x y y t K t( ) = 1 [ ( , ) − 2 ( ) ( ) + ( ) ( )]i

N

i i iT

i iT

m i=1 (56)

with

κk ky A( ) = ( )iT m

xi (57)

k k kt A y( ) = ( ) ( )i i (58)

In this formula, the estimates of kernel components define the columnsof the matrix A, the variable N corresponds to the cardinality of theevaluation set, and dataset vectors are denoted as xi i N(1 ≤ ≤ ).

The performance plots for method comparison considered theaverage value of ESRE at each iteration. We also evaluated the averageμ( ) and standard deviation σ( ) of the steady-state values attained byeach method, reported as μ σ( ± ). To identify statistically significantdifferences on method performances, we employed the non-parametricKruskal-Wallis (KW) test [25], assuming a significance level ofα = 10%, and the Tukey's HSD [25] post-hoc test. We conducted thestatistical analysis as follows: if the KW test returned a p-value lowerthan 0.1, the differences on the results are supposed to be significant,and we analyzed the p value associated with each pairwise comparison,refusing the hypothesis of similar performance (null hypothesis) for allpairs satisfying p < 0.01.

More details regarding the datasets and conducted experiments arepresented in the following.

4.1. USPS dataset

The USPS dataset is composed of grayscale handwritten digitimages with size 16×16, coded using integer numbers in the range of0–255. We reproduced an experiment conducted in [11], taking thefirst 100 images related to the digits 1–3, and considering the samevalues for the kernel width and the number of the components adoptedin [11]. The selected images were converted to column-vectors withdimension 256, normalized to be in 0 to 1 range, and presented to thealgorithms.

Initially, we examined the effect of the forgetting factor choice onthe convergence behavior of the proposed RLS-KHA. Figs. 1 and 2summarize the experiments, which considered values of γ ranging from0.985 to 0.9995, as well as full and small dictionaries. The proposedalgorithm became unstable in some trials when values of γ lower than0.985 were adopted.

The use of higher values of γ resulted in slower convergence, butimproved accuracy, as can be observed by the reduction in the steady-state ESRE. The mean value attained by the proposed RLS-KHA for thefull dictionary case was 0.1670 ± 0.0025 γ( = 0.985) and 0.1415 ± 0.001γ( = 0.995), whereas 0.1909 ± 0.0054 γ( = 0.985) and 0.1768 ± 0.0054γ( = 0.995), when considering the small dictionary. In both cases, KWtest confirmed statistically significant differences in algorithm perfor-mances χ p( > 57.45, <0.001)2 . Notice that a good trade-off betweenaccuracy and convergence speed was achieved using γ = 0.999. Thisfactor produced statistically similar steady-state ESRE values as forγ = {0.995, 0.9995} p( > 0.9050) and γ = 0.9995 (p=0.5555) for smalland full dictionaries, respectively.

In the sequence, we compared the proposed RLS-KHA usingγ = 0.999 with other KPCA methods for both dictionary sizes. Figs. 3and 4 depict the results. The differences observed in steady-state ESREwere relevant χ p( > 99.44, < 0.001)2 . Besides, the use of fixed learningrates resulted in slightly better results than exponentially decaying onesfor both dictionary sizes.

In the case of full dictionary, OKHA and SubKHA exhibited super-imposed curves, as expected. Regarding online methods, Tanaka con-verged faster than the proposed RLS-KHA until iteration 6,000, butattained a highly biased steady-state ESRE (0.1815 ± 0.0001). As the KHAemploys all dataset members since the beginning of the extraction, thismethod initially exhibited higher values of ESRE. However, after 200iterations, the KHA was the fastest method, being outperformed by theproposed RLS-KHA at the iteration 20,000. The RLS-KHA also attained

100 101 102 103 104 105 106

Iterations

0.2

0.3

0.4

0.5

0.6

0.70.80.9

1

ES

RE

0.9850.990.9950.9990.9995

Fig. 1. ESRE errors (average curve) attained by the proposed RLS-KHA for USPSdataset when employing different forgetting factors for the full dictionary case (see text).

100 101 102 103 104 105

Iterations

0.2

0.3

0.4

0.5

0.6

0.7

0.80.9

1

ES

RE

0.9850.990.9950.9990.9995

Fig. 2. ESRE errors (average curve) attained by the proposed RLS-KHA for USPSdataset when employing different forgetting factors for the small dictionary case (seetext).

Table 4Parameters regarding dictionary construction used by each algorithm in the experiments(see text).

RLS-KHA/OKHA (ν) Tanaka (δ) SubKHA (M)

USPS-FD 0.01 1.0000 300USPS-SD 0.25 0.7420 49ISOLET 0.62 0.4300 71MNIST 0.43 0.5515 90COREL 0.47 0.5900 50COVERTYPE 0.50 0.5750 156

J.B.O.S. Filho, P.S.R. Diniz Neurocomputing 237 (2017) 255–264

260

Page 7: A recursive least square algorithm for online kernel ...diniz/papers/ri88.pdfa Department of Electronics and Computer Engineering, Polytechnic School, Federal University of Rio de

the lowest mean steady-state ESRE (0.1417 ± 0.0017), performing equiva-lently (p=0.1049) to OKHA/SubKHA with exponentially decaying learn-ing rate (D) and better than the remaining techniques p( < 0.006).Moreover, the proposed RLS-KHA converged six times faster (45,000iterations) than OKHA and SubKHA in this experiment.

Concerning small dictionary results, the SubKHA initially behavedsimilar to OKHA, however, it attained the highest mean steady-state ESRE (0.2516 ± 0.0062), similarly to Tanaka algorithm

p(0.2101 ± 0.0047, = 0.1049). The values attained by the proposedalgorithm, OKHA (D), and OKHA (F) were 0.1768 ± 0.0054,0.1770 ± 0.0054, and 0.1776 ± 0.0045, respectively, practically the samep( > 0.9934). Moreover, the proposed RLS-KHA converged in about35,000 iterations, performing approximately 4.6 times faster thanOKHA (F).

4.2. ISOLET dataset

The ISOLET dataset [21] is composed of 617 features extractedfrom voice signals produced by 150 subjects pronouncing all alphabet

letters twice [26]. Simulations considered both training and testingarchives (isolet1 to isolet5), involving 7,797 examples. Based on trialsapplying classical KPCA extraction on subsamples of this dataset, wedefined the kernel width and number of components. The choice ofsuch parameters envisaged to produce a consistent energy curve [3]and to result in a set of kernel components retaining at least 80% fromthe energy [11] from data (implicitly) mapped in the feature space.Results are summarized in Fig. 5.

According to KW test, the methods performed differentlyχ p( = 28.74, < 0.001)2 , but the results were more similar due to thelower χ2 attained. Up to the iteration 6000, the best performingtechnique was the Tanaka algorithm. The proposed RLS-KHA out-performed the OKHA, SubKHA, and Tanaka algorithm at iterations3600, 5000, and 6000, respectively. The replacement of dictionarymembers in SubKHA (EN) resulted in a slight improvement inconvergence speed, but the steady-state values were equivalent(p=0.8768) to SubKHA (DS). Our method achieved the lowest meansteady-state ESRE (0.4986 ± 0.0090), performing better than OKHA

p(0.5500 ± 0.0284, < 0.001) and SubKHA (DS) (0.5169 ± 0.0065,p = 0.0540) , and equivalently to Tanaka p(0.5126 ± 0.0284, = 0.6204)and SubKHA (EN) p(0.5106 ± 0.0058, = 0.4040).

4.3. MNIST dataset

Digital pictures of handwritten digits with size 28×28 [22] composethe MNIST dataset. Reproducing an experiment from [8], we selectedthe training and testing samples associated with the numbers “1”, “2”and “3”, resulting in 22,008 images. We converted the selected imagesto vectors with dimension 784, which were submitted to the algo-rithms. Unlike [8], we did not add noise to these pictures. The kernelwidth, number of components extracted, and dictionary size were thesame adopted in an experiment from [8]. Fig. 6 exhibits the corre-sponding results.

KW test reported different steady-state ESRE valuesχ p( = 45.44, < 0.001)2 . Results show that the proposed RLS-KHAperformed better than Tanaka algorithm in the entire simulationinterval. Our method surpassed the OKHA and SubKHA (EN, DS) atiterations 3000 and 6000, respectively, achieving the lowest meansteady-state ESRE (0.2720 ± 0.0094). It performed better than SubKHA(DS) p(0.3004 ± 0.0018, = 0.0496), OKHA p(0.3347 ± 0.0106, < 0.001)and Tanaka algorithm p(0.3859 ± 4.6476, < 0.001), but performedsimilarly to SubKHA (EN) p(0.2962 ± 0.0043, = 0.3228).

100 101 102 103 104 105

Iterations

0.2

0.3

0.4

0.5

0.6

0.7

0.80.9

1

ES

RE

RLS-KHAOKHA(D)TanakaSubKHA(D)(EN)OKHA(F)

Fig. 4. ESRE errors (average curve) attained by each method considering the smallsample dictionary and USPS dataset (see text).

0 1000 2000 3000 4000 5000 6000 7000Iterations

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

ES

RE

RLS-KHAOKHA (F)TanakaSubKHA (F)(EN)SubKHA (F)(DS)

Fig. 5. ESRE errors (average curve) attained by each method considering the ISOLETdataset (see text).

100 101 102 103 104 105

Iterations

0.2

0.3

0.4

0.5

0.6

0.70.80.9

1

ES

RE

RLS-KHAOKHA/SubKHA(F)(DS)TanakaKHA(F)OKHA/SubKHA(F)(DS)

Fig. 3. ESRE errors (average curve) attained by each method considering the full sampledictionary and USPS dataset (see text).

J.B.O.S. Filho, P.S.R. Diniz Neurocomputing 237 (2017) 255–264

261

Page 8: A recursive least square algorithm for online kernel ...diniz/papers/ri88.pdfa Department of Electronics and Computer Engineering, Polytechnic School, Federal University of Rio de

4.4. COREL dataset

The COREL dataset consists of four sets of features (color histo-gram, color histogram layout, color moments and co-occurrencetexture), extracted from 68,040 photos images belonging to variouscategories, and obtained from Corel Corporation. We considered onlythe color histogram features to our experiments. The kernel width,number of components, and dictionary size were set similarly toISOLET experiments. We have also included a modified (M) versionof the RLS-KHA algorithm, employing the same dictionary manage-ment criterion of SubKHA (DS) to enlighten some findings. Fig. 7exhibits the results.

Excluding RLS-KHA (M), the steady-state values were differentaccording to KW test χ and p( = 37.39, < 0.001)2 . The convergencebehavior of the proposed RLS-KHA was better than OKHA and Tanaka,but worse than SubKHA. In terms of steady-state errors, our algorithmattained a mean steady-state ESRE of 0.1416 ± 0.0075, performingbetter than OKHA p(0.1581±0.0060, = 0.0807), similarly to Tanaka

p(0.1492 ± 0.0068, = 0.9480) and SubKHA (DS) (0.1323 ± 0.0082,p = 0.5203), and worse than SubKHA (EN) (0.1214 ± 0.0068,

p = 0.0167). However, the best convergence curve belongs to theRLS-KHA (M), which attained a steady-state ESRE value of0.1118 ± 0.0049, similarly to SubKHA (EN) (p=0.0855), and betterthan the remaining methods p( < 0.0855).

We conclude that the SubKHA performed better in this experimentsince it fills its dictionary in much less iterations than RLS-KHA, as theformer employs a more conservative criterion for member inclusion.Note that we have observed a similar phenomenon in USPS experi-ments when comparing the KHA with online KPCA algorithms.

4.5. Cover type dataset

The cover type UCI dataset involves 54 cartographic variables(elevation, slope, distance to hydrology, etc.) regarding wilderness areaslocated in the Roosevelt National Forest in northern Colorado. Each oneof the 581,012 feature vectors relates to a 30×30 m cell, determined byUS Forest Service (USFS). The kernel width, number of components, anddictionary size were set similarly to ISOLET experiments. We normalizedeach variable to be within 0–1 range, and also included the modifiedRLS-KHA as in COREL experiments. Fig. 8 shows the results.

Excluding the RLS-KHA (M), initially, Tanaka algorithm was thefastest technique, being surpassed by the proposed RLS-KHA, SubKHA(EN) and SubKHA (DS) at iterations 120,000, 180,000, and 230,000,respectively. From iteration 120,000 to 300,000, the proposed RLS-KHA was the best performing method, after what it was outperformedby SubKHA (EN). In terms of steady-state ESRE, the methodsperformed differently χ p( = 12.06, = 0.0169)2 , but the proposed RLS-KHA, SubKHA (DS), and SubKHA (EN) attained the followingequivalent p( > 0.8983) values: 0.2513 ± 0.0022, 0.2369 ± 0.0134, and0.255 ± 0.0213, respectively. In turn, the values achieved by SubKHA(EN) were lower than OKHA p(0.2755 ± 0.0150, = 0.0884) and Tanakaalgorithm p(0.2806 ± 0.0060, = 0.0235). Similar to COREL dataset,the RLS-KHA (M) converged faster than all methodsχ p( =17.62, = 0.0035)2 , attaining a final mean steady-state ESRE

(0.2263 ± 0.0031) similar to SubKHA (EN, DS) p( > 0.4657), and betterthan OKHA and Tanaka algorithm p( < 0.0203). Thus, the use of a moreaggressive criterion for dictionary construction may improve theperformance of the proposed RLS-KHA, especially when dealing withlarge datasets, an issue to be explored in a future algorithm version.

5. Conclusions

In this work, we have proposed a new online kernel principalcomponent extraction algorithm: the RLS-KHA. The method involves

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2Iterations 104

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

ES

RE

RLS-KHAOKHA (F)TanakaSubKHA (F)(EN)SubKHA (F)(DS)

Fig. 6. ESRE errors (average curve) attained by each method considering the MNISTdataset (see text).

0 1 2 3 4 5 6Iterations 104

0.1

0.2

0.3

0.4

0.5

0.6

0.7

ES

RE

RLS-KHAOKHA (F)TanakaSubKHA (F)(EN)SubKHA (F)(DS)RLS-KHA (M )

Fig. 7. ESRE errors (average curve) attained by each method considering the CORELdataset (see text).

Iterations 1050 1 2 3 4 5 6

ES

RE

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

RLS-KHAOKHA (F)TanakaSUB-KHA (F)(EN)SUB-KHA (F)(DS)RLS-KHA (M)

Fig. 8. ESRE errors (average curve) attained by each method considering the COVERdataset (see text).

J.B.O.S. Filho, P.S.R. Diniz Neurocomputing 237 (2017) 255–264

262

Page 9: A recursive least square algorithm for online kernel ...diniz/papers/ri88.pdfa Department of Electronics and Computer Engineering, Polytechnic School, Federal University of Rio de

extraction equations as simple as the KHA algorithm and employs anindividual learning rate to each component under extraction, which isdynamically tuned by a computationally inexpensive iterative equation.Results show that convergence speed and accuracy are conflicting

issues in this approach, but the use of a suitably chosen forgettingfactor leads to impressive gains in both requisites, especially ascompared to state-of-art KPCA extraction methods.

Appendix A. The minimum value of Eq. (7)

Using straightforward algebra, Eq. (7) may be rewritten as

O tr E E EΦ x Φ x w Φ x Φ x w w w Φ x Φ x w= { [ ( ) ( )]} − 2 [ ( ) ( )] + | | [ ( ) ( )] }͠ ͠ ͠ ͠ ͠ ͠j j jT

jT

jT

j j jT T

j2 (A.1)

where the trace operator is represented by tr {·}. The gradient of Oj with respect to wj, denoted as O∇ jwj , is given by

O E EΦ x Φ x w w Φ x Φ x w w∇ = −4 [ ( ) ( )] + 4( [ ( ) ( )] )͠ ͠ ͠ ͠j jT

j jT T

j jwj (A.2)

Assuming that the vector Φ x( )͠ j has an arbitrary dimensionality of n, and considering Eqs. (8) and (9), the following expression results

⎛⎝⎜⎜

⎞⎠⎟⎟∑ ∑ ∑E E λ λ E EΦ x Φ x F Φ x Φ x I e e e e e e F Φ x Φ x F F C F Φ x Φ x C[ ( ) ( )] = [ ( ) ( )] = − = = [ ( ) ( )] = = [ ( ) ( )] =͠ ͠ ͠ ͠ ͠ ͠ ͠ ͠j

Tj

T

i

j

i iT

i

n

i i iT

i j

n

i i iT

jT

jT

j κ jT

j jT

κj

=1

−1

=1 = (A.3)

By making O∇ = 0jwj and considering Eq. (A.3), results in

δC w w=κj

j j (A.4)

for

δ w C w= jT

κ j (A.5)

i.e., the minimum value of Oj is achieved for αw e=j j, which is a scaled-version of the j-th unit norm dominant eigenvector of Cκ . Substituting thisresult into Eq. (A.4), we have

α α α

α λ

C e e C e e

C e e

= ( )

=κj

j jT

κ j j

κj

j j j

2

2 (A.6)

thus α = ± 1.

Appendix B. Proof of Eq. (19)

Expanding Eq. (17), we have

⎡⎣⎢⎢

⎤⎦⎥⎥∑k

ξ ky k k k γ γ y i i i

ξ ky k k k γ

ξ kk ξ kw Φ x Φ x Φ x w( ) = 1

( )( − 1, ) ( ( )) + ( − 1, ) ( ( )) = 1

( )( − 1, ) ( ( )) +

( )( − 1) ( − 1)∼ ∼∼

jj

j jc

i

kk i

j jc

jj j

c

jj j

=1

−1( −1)−

(B.1)

Similarly, Eq. (18) can be written as

ξ k y k k γ γ y i i

y k k γξ k

ξ k ξ k y k k

( ) = ( − 1, ) + ∑ ( − 1, )

= ( − 1, ) + ( − 1)

( − 1) = ( ) − ( − 1, )

∼ ∼

j j ik k i

j

j j

j γ j γ j

2=1−1 ( −1)− 2

2

1 1 2(B.2)

Substituting Eq. (B.2) in (B.1) results in Eq. (19).

References

[1] T.D. Sanger, Optimal unsupervised learning in a single linear feedfoward network,Neural Netw. 2 (1989) 459–473.

[2] B. Schölkoppf, S. Mika, K.R. Müller, Nonlinear component analysis as a kerneleigenvalue problem, Neural Comput. 10 (1998) 1299–1319.

[3] I.T. Jollife, Principal Component Analysis, 2nd ed., Springer, 2002.[4] G. Golub, C. Loan, Matrix Computations, 3rd ed., The Johns Hopkins University

Press, 1996.[5] L. Hoegaerts, L.D. Lathauwer, I. Goethals, J. Suykens, J. Vandewalle, B.D. Moor,

Efficiently updating and tracking the dominant kernel principal components,Neural Netw. 20 (2007) 220–229.

[6] K.I. Kim, M. Franz, B. Scholkopf, Iterative kernel principal component analysis forimage modeling, IEEE Trans. Pattern Anal. Mach. Intell. 27 (2005) 1351–1366.

[7] W. Shi, W. Zhang, The accelerated power method for kernel principal componentanalysis, in: D. Zeng (Ed.), Applied Informatics and Communication 225, Springer,Berlin Heidelberg, 2011, pp. 563–570.

[8] P. Honeine, Online kernel principal component analysis: a reduced-order model,IEEE Trans. Pattern Anal. Mach. Intell. 34 (2012) 1814–1826.

[9] Y. Washizawa, Adaptive subset kernel principal component analysis for time-varying patterns, IEEE Trans. Neural Netw. Learn. Syst. 23 (2012) 1961–1973.

[10] T. Tanaka, Y. Washizawa, A. Kuh, Adaptive kernel principal components tracking,in: IEEE International Conference on Acoustics, Speech and Signal Processing -ICASSP 2012, pp. 1905–1908.

[11] S. Günter, N.N. Schraudolph, S.V.N. Vishwanathan, Fast iterative kernel principalcomponent analysis, J. Mach. Learn. Res. 8 (2007) 1893–1918.

[12] J. Yang, Y. Zhao, H. Xi, Weigthed rule based adaptive algorithm for simultaneouslyextracting generalized eigenvectors, IEEE Trans. Neural Netw. 22 (2011) 800–806.

[13] S. Ouyang, Z. Bao, Fast principal component extraction by a weighted informationcriterion, IEEE Trans. Signal Process. 50 (2002) 1994–2002.

[14] B. Scholkopf, A.J. Smola, Learning with Kernels, 1st ed., MIT Press, 2002.[15] L. Xu, Least mean square error reconstruction principle for self-organizing neural-

nets, Neural Netw. 6 (1993) 627–648.[16] P.S.R. Diniz, Adaptive filtering - algorithms and practical implementation, 4th ed.,

Springer, 2013.[17] B. Yang, Projection approximation subspace tracking, IEEE Trans. Signal Process.

43 (1995) 95–107.[18] Y. Engel, S. Mannor, R. Meir, The kernel recursive least-squares algorithm, IEEE

Trans. Signal Process. 52 (2004) 2275–2285.[19] C. Richard, J.C.M. Bermudez, P. Honeine, Online prediction of time series data

with kernels, IEEE Trans. Signal Process. 57 (2009) 1058–1067.[20] Usps handwritten digits, http://www.cs.nyu.edu/roweis/data.html, 2015.[21] Uci machine learning repository - isolet database, https://archive.ics.uci.edu/ml/

J.B.O.S. Filho, P.S.R. Diniz Neurocomputing 237 (2017) 255–264

263

Page 10: A recursive least square algorithm for online kernel ...diniz/papers/ri88.pdfa Department of Electronics and Computer Engineering, Polytechnic School, Federal University of Rio de

datasets/ISOLET, 2015.[22] The mnist database of handwritten digits, http://yann.lecun.com/exdb/mnist/,

2015.[23] Uci machine learning repository - corel image features data set, https://archive.ics.

uci.edu/ml/datasets/Corel+Image+Features, 2015.[24] Uci machine learning repository – covertype data set, https://archive.ics.uci.edu/

ml/datasets/Covertype, 2015.[25] D.J. Sheskin, Handbook of parametric and nonparametric statistical procedures,

3rd edition, Chapman & Hall/CRC, 2004.[26] R. Cole, M. Fanty, Spoken letter recognition, in: Proceedings of the Workshop on

Speech and Natural Language, pp. 385–390.

João B.O. Souza Filho received the ElectronicEngineering degree in 2001, and the M.Sc and the D.Scdegrees in Electrical Engineering in 2002 and 2007,respectively, all from Federal University of Rio de Janeiro(UFRJ), Rio de Janeiro, Brazil. He was with the FederalCenter of Technological Education Celso Suckow daFonseca (CEFET-RJ) from 2005 to 2015. Now he is a fullprofessor of the Department of Electronic Engineeringfrom UFRJ. His research interests are in pattern recogni-tion, machine learning, artificial neural networks, adaptivesystems, digital signal processing, embedded systems andelectronic instrumentation.

Paulo S.R. Diniz received the Electronics Eng. degree(Cum Laude) from the Federal University of Rio de Janeiro(UFRJ) in 1978, the M.Sc. degree from COPPE/UFRJ in1981, and the Ph.D. from Concordia University, Montreal,Canada, in 1984 (all in Electrical Engineering). He is withthe Department of Electronic Engineering and thePostgraduate Program in Electrical Engineering fromUFRJ. Diniz is an IEEE Fellow. He served as associateeditor of some IEEE journals and received many careerawards, having published numerous refereed papers andthree textbooks. His research interests are in adaptivesignal processing, digital and wireless communicationsand multirate systems.

J.B.O.S. Filho, P.S.R. Diniz Neurocomputing 237 (2017) 255–264

264