neural-based orthogonal data fitting (the exin neural networks) || variants of the mca exin neuron

28
3 VARIANTS OF THE MCA EXIN NEURON * 3.1 HIGH-ORDER MCA NEURONS High-order units have been advocated in [65]. The MCA neurons can be gener- alized into high-order neurons for fitting nonlinear hypersurfaces, which can be expressed as w 1 f 1 (x ) + w 2 f 2 (x ) +···+ w n f n (x ) + w 0 = 0 (3.1) where f i (x ) is a function of x [e.g., for polynomial hypersurfaces f i (x ) = x r 1 1 x r 2 2 ··· x r n n with integers r j 0]. A generalized high-order neuron is just like a linear neuron, but each weight is now regarded as a high-order connection, and each input is not the component of the data vector x but its high-order combination f i (x ). From this point of view, the neuron can be considered as a functional link neuron [148]. Hence, the same reasoning of the hyperplane fitting can be made: Defining f = f 1 (x ) , f 2 (x ) , ... , f n (x ) T , calculate e f = E (f ); if it is different from zero, preprocess the data, and so on. Remark 80 (MCA Statistical Limit) If the observation datum x has Gaussian noise, the noise in the transformed input f is generally non-Gaussian [145]. It Sections 3.1 and 3.2 are reprinted from G. Cirrincione and M. Cirrincione, “Robust Total Least Squares by the Nonlinear MCA EXIN Neuron,” IEEE International Joint Symposia on Intelligence and Systems (SIS98): IAR98, Intelligence in Automation & Robotics , pp. 295–300, Rockville, MD, May 21–23, 1998. Copyright © 1998 IEEE. Reprinted with permission. Neural-Based Orthogonal Data Fitting: The EXIN Neural Networks, By Giansalvo Cirrincione and Maurizio Cirrincione Copyright © 2010 John Wiley & Sons, Inc. 89

Upload: maurizio

Post on 06-Jun-2016

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Neural-Based Orthogonal Data Fitting (The EXIN Neural Networks) || Variants of the MCA EXIN Neuron

3

VARIANTS OF THE MCAEXIN NEURON*

3.1 HIGH-ORDER MCA NEURONS

High-order units have been advocated in [65]. The MCA neurons can be gener-alized into high-order neurons for fitting nonlinear hypersurfaces, which can beexpressed as

w1f1(x) + w2f2(x) + · · · + wn fn(x) + w0 = 0 (3.1)

where fi (x) is a function of x [e.g., for polynomial hypersurfaces fi (x) =x r1

1 x r22 · · · x rn

n with integers rj ≥ 0]. A generalized high-order neuron is just likea linear neuron, but each weight is now regarded as a high-order connection,and each input is not the component of the data vector x but its high-ordercombination fi (x). From this point of view, the neuron can be considered as afunctional link neuron [148]. Hence, the same reasoning of the hyperplane fittingcan be made: Defining f = [

f1 (x) , f2 (x) , . . . , fn (x)]T

, calculate ef = E (f ); ifit is different from zero, preprocess the data, and so on.

Remark 80 (MCA Statistical Limit) If the observation datum x has Gaussiannoise, the noise in the transformed input f is generally non-Gaussian [145]. It

∗Sections 3.1 and 3.2 are reprinted from G. Cirrincione and M. Cirrincione, “Robust Total LeastSquares by the Nonlinear MCA EXIN Neuron,” IEEE International Joint Symposia on Intelligenceand Systems (SIS98): IAR98, Intelligence in Automation & Robotics, pp. 295–300, Rockville, MD,May 21–23, 1998. Copyright © 1998 IEEE. Reprinted with permission.

Neural-Based Orthogonal Data Fitting: The EXIN Neural Networks,By Giansalvo Cirrincione and Maurizio CirrincioneCopyright © 2010 John Wiley & Sons, Inc.

89

Page 2: Neural-Based Orthogonal Data Fitting (The EXIN Neural Networks) || Variants of the MCA EXIN Neuron

90 VARIANTS OF THE MCA EXIN NEURON

implies that the MCA solution (TLS ) is not optimal (see [98]); a robust versionof MCA EXIN (NMCA EXIN ) that overcomes the problem is presented in thenext section.

3.2 ROBUST MCA EXIN NONLINEAR NEURON (NMCA EXIN)

As shown in Section 1.10, the TLS criterion (which gives a solution parallel tothe minor eigenvector) is not optimal if the data errors are impulsive noise orcolored noise with unknown covariance or if some strong outliers exist. Insteadof using the TLS criterion, Oja and Wang [143–145], suggest an alternativecriterion that minimizes the sum of certain loss functions of the orthogonaldistance:

Jf (w) =N∑

i=1

f

(wT x (i)

√wT w

)=

N∑i=1

f (di ) w0 = 0 (3.2)

This is the TLS version of the robust LS regression. In robust LS the verticaldistances are taken instead of the perpendicular signed distances di . Solutionsfor the weights are among the M-estimates, which are the most popular of therobust regression varieties [91].

Definition 81 (Robust MCA) The weight vector that minimizes the criterionJf (w) is called a robust minor component.

The loss function f must be nonnegative, convex with its minimum at zero,increasing less rapidly than the square function, and even [91]. Convexity pre-vents multiple minimum points, and the slow increase dampens the effect ofstrong outliers on the solution. The derivative of the loss function, g(y) =df (y)/dy is called the influence function [80]. Depending on the choice of theloss function, several different learning laws are obtained. The most importantloss functions are [21]:

• Logistic function:

f (y) = 1

βln cosh βy (3.3)

with corresponding influence function

g(y) = tanh βy (3.4)

• Absolute value function:

f (y) = |y | (3.5)

with corresponding influence function

g(y) = sign y (3.6)

Page 3: Neural-Based Orthogonal Data Fitting (The EXIN Neural Networks) || Variants of the MCA EXIN Neuron

ROBUST MCA EXIN NONLINEAR NEURON (NMCA EXIN) 91

• Huber’s function:

f (y) =⎧⎨⎩

y2

2for |y | ≤ β

β |y | − β2

2 for |y |>β

(3.7)

• Talvar’s function:

f (y) =

⎧⎪⎨⎪⎩

y2

2for |y | ≤ β

β2

2for |y |>β

(3.8)

• Hampel’s Function:

f (y) =

⎧⎪⎪⎨⎪⎪⎩

β2

π

(1 − cos

πy

β

)for |y | ≤ β

2β2

πfor |y |>β

(3.9)

where β is a problem-dependent control parameter.

Replacing the finite sum in eq. (3.2) by the theoretical expectation of f givesthe following cost function:

Jrobust EXIN (w) = E

{f

(wT x√wT w

)}(3.10)

Then

d Jrobust EXIN (w)

d w= E

⎧⎨⎩g

(wT x√wT w

) (wT w

)x − (

wT x)w(

wT w) 3

2

⎫⎬⎭

Considered that g(y) is odd and assuming that∣∣wT x

∣∣ < ε ∀ε > 0, the followingapproximation is valid:

g

(wT x√wT w

)∼

g(wT x

)√

wT w(3.11)

which, in a stochastic approximation-type gradient algorithm in which the expec-tations are replaced by the instantaneous values of the variables, gives the robustMCA EXIN learning law (NMCA EXIN):

w (t + 1) = w (t) − α (t)

wT (t)w (t)

[g (y (t)) x (t) − g (y (t)) y (t) w (t)

wT (t) w (t)

](3.12)

Page 4: Neural-Based Orthogonal Data Fitting (The EXIN Neural Networks) || Variants of the MCA EXIN Neuron

92 VARIANTS OF THE MCA EXIN NEURON

If the absolute value function is chosen, then

w (t + 1) = w (t) − α (t)

wT (t) w (t)

[(sign y (t)) x (t) − |y (t)|w (t)

wT (t) w (t)

](3.13)

and the MCA EXIN neuron is nonlinear because its activation function is thehard-limiter [eq. (3.6)]. If the logistic function is chosen, then

w (t + 1) = w (t) − α (t)

wT (t) w (t)

[x (t) tanh βy (t) − y (t) w (t) tanh βy (t)

wT (t) w (t)

](3.14)

and the MCA EXIN neuron is nonlinear because its activation function is thesigmoid [eq. (3.4)]. In the simulations, only the sigmoid is taken in account andits steepness β is always chosen equal to 1.

Remark 82 (NMCA EXIN vs. MCA EXIN) The MCA EXIN learning law is aparticular case of the NMCA EXIN learning law for f (di ) = d2

i . As a consequenceof the shape of the loss function, for small orthogonal distances (inputs not too farfrom the fitting hyperplane), the behaviors of the nonlinear and linear neuron arepractically the same, but for larger orthogonal distances, the nonlinear neuronbetter rejects the noise. The nonlinear unit inherits exactly all the MCA EXINproperties and, in particular, the absence of the finite time (sudden) divergencephenomenon.

Remark 83 (Robusting the MCA Linear Neurons) The same considerationsthat give the NMCA EXIN learning law can be applied to OJA, OJAn, and LUO. Itcan be shown that these robust versions inherit the properties of the correspondinglinear units. In particular, they still diverge because this phenomenon is not linkedto the activation function chosen, but to the structure of the learning law, whichimplies orthogonal updates with respect to the current weight state. Furthermore,the presence of outliers anticipates the sudden divergence because the learninglaw is directly proportional to the squared weight modulus, which is amplified bythe outlier; it increases the difficulty in choosing a feasible stop criterion.

For details on this subject, see [28].

3.2.1 Simulations for the NMCA EXIN Neuron

Here a comparison is made between the NMCA EXIN neuron and the non-linear neuron of [143–145] (NOJA+), which is the nonlinear counterpart ofOJA+. Under the same very strict assumptions, NOJA+ still converges. Thefirst simulation uses, as a benchmark, the example in Sections 2.6.2.4 and 2.10,

Page 5: Neural-Based Orthogonal Data Fitting (The EXIN Neural Networks) || Variants of the MCA EXIN Neuron

ROBUST MCA EXIN NONLINEAR NEURON (NMCA EXIN) 93

where λmin = 0.1235 < 1. For simplicity, α(t) = const = 0.01. Every 100 steps,an outlier is presented. It is generated with the autocorrelation matrix

R′ =⎡⎣ σ 2 0 0

0 σ 2 00 0 σ 2

⎤⎦ (3.15)

The norm of the difference between the computed weight and the right MC uniteigenvector is called εw and represents the accuracy of the neuron. After 10 trialsunder the same conditions (initial conditions at [0.1, 0.1, 0.1]T , α(t) = const =0.01) and for σ 2 = 100, NMCA EXIN yields, in the average, after 10,000 itera-tions, λ = 0.1265 and εw = 0.1070 (the best result yields λ = 0.1246 and εw =0.0760). A plot of the weights is given in Figure 3.1. Observe the peaks causedby the outliers and the following damping. Figure 3.2 shows the correspondingbehavior of the squared weight norm, which is similar to the MCA EXIN dynamicbehavior. In the same simulation for σ 2 = 100, NOJA+ yields, on the average,after 10, 000 iterations, λ = 0.1420 and εw = 0.2168. This poorer accuracy iscaused by its learning law, which is an approximation of the true Rayleigh quo-tient gradient. In the presence of strong outliers, this simplification is less valid.In fact, for lower σ 2, the two neurons have nearly the same accuracy. Figure 3.3shows the sudden divergence phenomenon for the robust version of LUO. Herethe experiment conditions are the same, but the initial conditions are the exactsolution: It shows that the solution is not stable even in the robust version.

Figure 3.1 NMCA EXIN weights in the presence of outliers.

Page 6: Neural-Based Orthogonal Data Fitting (The EXIN Neural Networks) || Variants of the MCA EXIN Neuron

94 VARIANTS OF THE MCA EXIN NEURON

Figure 3.2 Divergence of the NMCA EXIN squared weight norm.

Figure 3.3 Deviation from the MC direction for NLUO, measured as the squared sine ofthe angle between the weight vector and the MC direction.

Page 7: Neural-Based Orthogonal Data Fitting (The EXIN Neural Networks) || Variants of the MCA EXIN Neuron

ROBUST MCA EXIN NONLINEAR NEURON (NMCA EXIN) 95

The second simulation deals with the benchmark problem in [195], alreadysolved in Section 2.10. As seen before, at first the points are preprocessed. Thenthe NMCA EXIN and NOJA+ learnings are compared (at each step the inputvectors are picked up from the training set with equal probability). Both neuronshave the same initial conditions [0.05, 0.05]T and the same learning rate usedin [195]: that is, start α(t) at 0.01, linearly reduce it to 0.0025 at the first 500iterations, and then keep it constant. In the end, the estimated solution mustbe renormalized. Table 3.1 shows the results (averaged in the interval betweeniteration 4500 and iteration 5000) obtained for σ 2 = 0.5 and without outliers.They are compared with the singular value decomposition (SVD)–based solutionand the LS solution. It can be concluded that the robust neural solutions havethe same accuracy as that of the corresponding linear solutions. Second, anothertraining set is created by using the same 500 points, but without additional noise.Then two strong outliers at the points x = 22, y = 18 and x = 11.8, y = 9.8 areadded and fed to the neuron every 100 iterations. Table 3.2 shows the results:NOJA+ is slightly more accurate than NMCA EXIN. This example is of nopractical interest because the assumption of noiseless points is too strong. Thelast experiment uses the same training set, but with different values of additionalGaussian noise and the same two strong outliers. The learning rate and the initialconditions are always the same. The results in Table 3.3 are averaged over 10experiments and in the interval between iteration 4500 and iteration 5000. Asthe noise in the data increases, the NMCA EXIN neuron gives more and moreaccurate results with respect to the NOJA+ neuron. As in the first simulation,

Table 3.1 Line fitting with Gaussian noise

a1 a2 εw

True values 0.25 0.5LS 0.233 0.604 0.105SVD 0.246 0.511 0.011OJA+ 0.246 0.510 0.012NOJA+ 0.248 0.506 0.006NMCA EXIN 0.249 0.498 � 0

Table 3.2 Line fitting with strong outliers only

a1 a2 εw

True values 0.25 0.5LS 0.216 0.583 0.089SVD 0.232 0.532 0.036OJA+ 0.232 0.532 0.036NOJA+ 0.244 0.492 0.010NMCA EXIN 0.241 0.490 0.013

Page 8: Neural-Based Orthogonal Data Fitting (The EXIN Neural Networks) || Variants of the MCA EXIN Neuron

96 VARIANTS OF THE MCA EXIN NEURON

Table 3.3 Line fitting with Gaussian noise and strong outliers

σ 2 NMCA EXIN εw NOJA+ εw

0.1 0.0229 0.02530.2 0.0328 0.03930.3 0.0347 0.04650.4 0.0379 0.05000.5 0.0590 0.0880

Figure 3.4 Plot of NMCA EXIN weights in a very noisy environment with strong outliers.

this fact is explained by the type of approximation of NOJA+. Figure 3.4 showsthe plot of the NMCA EXIN weights for σ 2 = 0.5 for one of the 10 experimentswith this level of noise.

3.3 EXTENSIONS OF THE NEURAL MCA

In this section the neural analysis for MCA is extended to minor subspace analysis(MSA), principal components analysis (PCA), and principal subspace analysis(PSA).

Let R be the N × N autocorrelation matrix input data, with eigenvalues0 ≤ λN ≤ λN −1 ≤ · · · ≤ λ1 and corresponding orthonormal eigenvectorszN , zN −1, . . . , z1. The following definitions are given:

Page 9: Neural-Based Orthogonal Data Fitting (The EXIN Neural Networks) || Variants of the MCA EXIN Neuron

EXTENSIONS OF THE NEURAL MCA 97

• z1 is the first principal component.• zN is the first minor component.• z1, z2, . . . , zM (M < N ) are M principal components.• span {z1, z2, . . . , zM } is an M -dimensional principal subspace.• zN −L+1, zN −L+2, . . . , zN (L < N ) are L minor components.• span {zN −L+1, zN −L+2, . . . , zN } is an L -dimensional minor subspace.

The general neural architecture is given by a single layer of Q linearneurons with inputs x(t) = [x1 (t) , x2 (t) , . . . , xN (t)]T and outputs y (t) =[y1 (t) , y2 (t) , . . . , yQ (t)

]T:

yj (t) =N∑

i=1

wij (t) xi (t) = wTj (t) x (t) (3.16)

for j = 1, 2, . . . , Q or j = N , N − 1, . . . , N − Q + 1, where wj (t) =[w1j (t) , . . . , wN j (t)

]Tis the connection weights vector. Hence, Q = 1 for

PCA and MCA (in this case, j = N only).

3.3.1 Minor Subspace Analysis

The minor subspace (dimension M ) usually represents the statistics of the additivenoise and hence is referred to as the noise subspace in many fields of signalprocessing, such as those cited in Section 2.2.1. Note that in this section, MSArepresents both the subspace and the minor unit eigenvectors.

The existing MSA networks are:

• Oja’s MCA algorithm. Oja [141] extended the OJA rule (2.16) as follows:

wj (t + 1) = wj

A

(t)︷ ︸︸ ︷−α (t) x (t) yj (t)

+B︷ ︸︸ ︷

α (t)[y2

j (t) + 1 − wTj (t) wj (t)

]wj (t)

−C︷ ︸︸ ︷

ϑα (t)∑i > j

yi (t) yj (t) wi (t) (3.17)

for j = N , N − 1, . . . , N − M + 1. The corresponding averaging ODE isgiven by

dwj (t)

dt= −Rwj (t) +

[wT

j (t) Rwj (t)]wj (t) + wj (t)

− wTj (t) wj (t) wj (t) − ϑ

∑i > j

wTi (t) Rwj (t) wi (t) (3.18)

Page 10: Neural-Based Orthogonal Data Fitting (The EXIN Neural Networks) || Variants of the MCA EXIN Neuron

98 VARIANTS OF THE MCA EXIN NEURON

The term A in eq. (3.17) is the anti-Hebbian term. The term B is the con-straining term (forgetting term). The term C performs the Gram–Schmidtorthonormalization. Oja [141] demonstrated that under the assumptionswT

j (0) zj �= 0,∥∥wj (0)

∥∥2 = 1, λN −M +1 < 1, and ϑ >λN −M +1/λN − 1,

the weight vector in eq. (3.18) converges to the minor componentszN −M +1, zN −M +2, . . . , zN . If ϑ is larger than but close to λN −M +1/λN − 1,the convergence is slow.

• LUO MSA algorithms . In [121], two learning laws are proposed. Thefirst is

wj (t + 1) = wj (t) − α (t) yj (t)wTj (t) wj (t) x (t) + α (t) y2

j (t) wj (t)

+ βα (t) yj (t) wTj (t) wj (t)

∑i > j

yi (t) wi (t) (3.19)

for j = N , N − 1, . . . , N − M + 1. The corresponding averaging ODE isgiven by

dwj (t)

dt= −wT

j (t)wj (t)

⎡⎣I − β

∑i > j

wi (t) wTi (t)

⎤⎦

Rwj (t) + wTj (t) Rwj (t)wj (t) (3.20)

where β is a positive constant affecting the convergence speed. Equation(3.20) does not possess the invariant norm property (2.27) for j < N . Thesecond (MSA LUO) is the generalization of the LUO learning law (2.25):

wj (t + 1) = wj (t) − α (t) yj (t)[wT

j (t) wj (t) xj (t) − wTj (t) xj (t) wj (t)

](3.21)

for j = N , N − 1, . . . , N − M + 1 and

xj (t) = x (t) −∑i > j

yi (t)wi (t) (3.22)

for j = N − 1, . . . , N − M + 1. Equation (3.22) represents, at con-vergence, the projection of x (t) on the subspace orthogonal to theeigenvectors zj+1, zj+2, . . . , zN . In this way, an adaptive Gram–Schmidtorthonormalization is done. Indeed, note that if

wT (0) zN = wT (0) zN = · · · = wT (0) zi = 0 (3.23)

in the assumptions for Theorem 49, the LUO weight vector converges tozi−1 (this is clear in the proof of Theorem 60). This idea has been proposed

Page 11: Neural-Based Orthogonal Data Fitting (The EXIN Neural Networks) || Variants of the MCA EXIN Neuron

EXTENSIONS OF THE NEURAL MCA 99

in [141,186,194]. The corresponding averaging ODE is given by

dwj (t)

dt= −wT

j (t) wj (t)

⎡⎣I −

∑i > j

wi (t) wTi (t)

⎤⎦Rwj (t)

+ wTj (t)

⎡⎣I −

∑i > j

wi (t)wTi (t)

⎤⎦Rwj (t) wj (t) (3.24)

Defining

Rj = R −∑i > j

wi (t) wTi (t) R (3.25a)

j (t) = wTj (t) wj (t) (3.25b)

φj (t) = wTj (t) Rj wj (t) (3.25c)

eq. (3.23) becomes

dwj (t)

dt= −j (t) Rj wj (t) + φj (t) wj (t) (3.26)

Equation (3.26) is characterized by the invariant norm property (2.27) forj < N (evidently, in a first approximation).

It is instructive to summarize the convergence theorems for MSA LUO. In[122], the following theorem appears.

Theorem 84 (MSA LUO Convergence 1) If the initial values of the weightvector satisfy wT

j (0)zj �= 0 for j = N , N − 1, . . . , N − M + 1, then

span {wN −M +1(f ), wzN −M +2(f ), . . . , wN (f )} = span {zN −M +1, zN −M +2, . . . , zN }(3.27)

where wj (f ) = limt→∞ wj (t). In other words, MSA LUO finds a basis for theminor subspace, but not the eigenvector basis.

The proof of this theorem exploits the analogy of the MSA LUO learning lawwith the PSA LUO learning law cited in Section 3.3.3 and uses thec orrespondingproof for convergence. Considering that only the part of demonstration concern-ing the orthogonality of the converged weight vectors to the remaining nonminoreigenvectors can be applied for MSA, the authors cannot also demonstrate theconvergence to the minor components. Suddenly, in [121], the following theoremappears.

Page 12: Neural-Based Orthogonal Data Fitting (The EXIN Neural Networks) || Variants of the MCA EXIN Neuron

100 VARIANTS OF THE MCA EXIN NEURON

Theorem 85 (MSA LUO Convergence 2) If the initial values of the weightvector satisfy wT

j (0)zj �= 0 for j = N , N − 1, . . . , N − M + 1, then the weightvectors provided by eq. (3.21) converge to the corresponding minor components;that is,

limt→∞ wj (t) = ± ∥∥wj (0)

∥∥2 zj (3.28)

for j = N , N − 1, . . . , N − M + 1.

This theorem, which looks like an extension of Theorem 84, is given with-out demonstration! In fact, in [121] it is justified by analogy with the PSALUO learning law (i.e., the same demonstration as that of Theorem 84 is sug-gested). Considering that the analogy cannot be used entirely because the upperbounds used have no sense for MSA, it can be deduced that Theorem 85 is false.A qualitative analysis of the proof is given in the next section.

3.3.2 MSA EXIN Neural Network

Using the same adaptive Gram–Schmidt orthonormalization as in Section 3.3.1and the MCA EXIN learning law (2.35), the MSA EXIN learning law is deduced[compare with eq. (3.21)]:

wj (t + 1) = wj (t) − α (t) yj (t)

wT (t) w (t)

⎡⎣x (t) −

∑i > j

yi (t)wi (t)

−wT

j (t)(

x (t) −∑i > j yi (t) wi (t)

)wj (t)

wT (t) w (t)

⎤⎦ (3.29)

for j = N , N − 1, . . . , N − M + 1. The corresponding averaging ODE is givenby

dwj (t)

dt= −−1

j (t) Rj wj (t) + −2j (t) φj (t)wj (t)

= −−1j

[Rj − r

(wj (t) , Rj

)I]wj (t) (3.30)

where r represents the Rayleigh quotient. Multiplying eq. (3.30) by wTj (t) on

the left yields

d∥∥wj (t)

∥∥22

dt= −−1

j (t) wTj (t) Rj wj (t) + −2

j (t) φj (t) wTj (t) wj (t)

= −−1j (t) φj (t) + −2

j (t) φj (t) j (t) = 0 (3.31)

Page 13: Neural-Based Orthogonal Data Fitting (The EXIN Neural Networks) || Variants of the MCA EXIN Neuron

EXTENSIONS OF THE NEURAL MCA 101

for j = N , N − 1, . . . , N − M + 1. Then

∥∥wj (t)∥∥2

2 = ∥∥wj (0)∥∥2

2 , t > 0 (3.32)

This analysis and the following considerations use the averaging ODE (3.30) andso are limited to this approximation (i.e., as seen earlier, to the first part of theweight vector time evolution).

The following theorem for the convergence of the learning law holds.

Theorem 86 (Convergence to the MS) If the initial values of the weight vectorsatisfy wT

j (0)zj �= 0 for j = N , N − 1, . . . , N − M + 1, then the weight vectorsprovided by eq. (3.29) converge, in the first part of their evolution, to the minorsubspace spanned by the corresponding minor components [i.e., eq. (3.27) holds].

Proof. The proof is equal to the proof in [122] because MSA EXIN can bededuced only by dividing the learning rate of MSA by

∥∥wj (t)∥∥4

2. In particular,the same analysis as already done for LUO and MCA EXIN can be repeated,and it is easy to notice that if the initial weight vector has modulus of less than1, then MSA EXIN approaches the minor subspace more quickly than MSALUO. �

Now, a qualitative analysis of the temporal evolution of MSA EXIN andMSA LUO is given. The quantitative analysis will be the subject of future work.First, consider eq. (3.22), which represents an adaptive Gram–Schmidt orthonor-malization and is input to the j th neuron. It means that the M neurons of theMSA LUO and EXIN neural networks are not independent, except the neuronfor j = N (the driving neuron), which converges to the first minor component.It can be argued that the neurons labeled by j < N are driven by the weightvectors of the neurons with labels i > j . As will be clear from the analysis lead-ing to Proposition 57, the cost landscape of the MCA neurons has N criticaldirections in the domain �N − {0}. Among these, all the directions associatedwith the nonminor components are saddle directions, which are minima in anydirection of the space spanned by bigger eigenvectors (in the sense of eigenvec-tors associated with bigger eigenvalues) but are maxima in any direction of thespace spanned by smaller eigenvectors. The critical direction associated with theminimum eigenvalue is the only global minimum. With these considerations inmind, let’s review the temporal evolution.

• Transient . The driving neuron weight vector tends to the minor componentdirection, staying approximately on a hypersphere of radius equal to theinitial weight modulus. The other weight vectors stay in a first approxima-tion on the respective hyperspheres [see eq. (3.32)] and tend to subspacesorthogonal to the eigenvectors associated with the smaller eigenvalues (thelower subspaces) because the adaptive orthogonalization eliminates degreesof freedom and the corresponding saddle in the cost landscape becomes a

Page 14: Neural-Based Orthogonal Data Fitting (The EXIN Neural Networks) || Variants of the MCA EXIN Neuron

102 VARIANTS OF THE MCA EXIN NEURON

minimum. MSA EXIN is faster than MSA LUO in approaching the minorsubspace for initial weights of modulus less than 1.

• MC direction. The weight vectors of the driving neurons for both neuralnetworks diverge, fluctuating around the first MC direction. Weight modulisoon become larger than unity. Hence, the fluctuations, which are orthogonalto the first MC direction, as shown in Section 2.6.2, are larger for LUO thanfor EXIN, because the latter has the positive smoothing effect of the squaredweight norm at the denominator of the learning law. Hence, its weight vectorremains closer to the first MC direction. The driven neurons have inputs withnonnull projections in the lower subspaces.

• Loss of step. If the driving neurons have too large (with respect to MC)orthogonal weight increments, the driven neurons lose their orthogonal direc-tion, and their associated minima become saddles again. So the weights tendto the minor component directions associated with the lower eigenvaluesuntil they reach the first minor component direction. This loss of the desireddirection is called a loss of step, in analogy with electrical machines. As aresult, this loss propagates from neuron N − 1 to neuron N − M + 1, andthe corresponding weights reach the first MC direction in the same order.Because of its larger weight increments, MSA LUO always loses a stepbefore MSA EXIN. Furthermore, MSA EXIN has weight vectors stayinglonger near their correct directions.1

• Sudden divergence. The driving neuron of MSA EXIN diverges at a finitetime, as demonstrated in Section 2.6.2. The other neurons, once they havelost a step, have the same behavior as that of the driving neuron (i.e., theydiverge at a finite time).

As a consequence of its temporal behavior, MSA LUO is useless for applica-tions both because their weights do not stay long in their correct directions andfor the sudden divergence problem.

3.3.2.1 Simulations The first example uses, as benchmark, Problem 6.7 in[121, p. 228]. Here,

R =

⎡⎢⎢⎣

2.7262 1.1921 1.0135 −3.74431.1921 5.7048 4.3103 −3.09691.0135 4.3103 5.1518 −2.6711

−3.7443 −3.0969 −2.6711 9.4544

⎤⎥⎥⎦

has distinct eigenvalues λ1 < λ2 < λ3 < λ4, where λ1 = 1.0484 with corre-sponding eigenvector z1 = [0.9052, 0.0560, −0.0082, 0.4212]T and λ2 = 1.1048with corresponding eigenvector z2 = [−0.0506, 0.6908, −0.7212, 0.0028]T .MSA EXIN has common learning rate α(t) = const = 0.001. The initial

1This phenomenon is different from the loss of step, but is not independent.

Page 15: Neural-Based Orthogonal Data Fitting (The EXIN Neural Networks) || Variants of the MCA EXIN Neuron

EXTENSIONS OF THE NEURAL MCA 103

Figure 3.5 First example for MSA EXIN: eigenvalue estimation.

conditions are w1 (0) = e1 and w2 (0) = e2, where ei is the coordinate vectorwith all zeros except the i th element, which is 1. MSA EXIN yields, after 1000iterations, the following estimations:

λ̃1 = φ1 (1000)

1 (1000)= 1.0510 w1 (1000) = [0.9152, 0.0278, −0.0054, 0.4215]T

λ̃2 = φ2 (1000)

2 (1000)= 1.1077 w2 (1000) = [0.0185, 0.7165, −0.7129, 0.0407]T

which are in very good agreement with the true values. Define Pij asz T

i wj / ‖zi ‖2

∥∥wj

∥∥2. Figure 3.5 is a temporal plot of the eigenvalue estimation.

Note the loss of step, which begins after about 3000 iterations. After 6000iterations the two computed eigenvalues are equal because the two weightvectors, w1 and w2, are parallel. This can also be seen clearly in Figure 3.6,which is a plot of P12. Figures 3.7 and 3.8 show the accurate recovery ofthe minor subspace. Figure 3.9 is an estimation of the eigenvalues when theexperimental setup of Section 2.6.2.4 is used (i.e., the true solutions are takenas initial conditions; second example). The plot shows both that the loss of stephappens after the minor component direction is reached and that there is nosudden divergence.

MSA LUO has the same learning rate as MSA EXIN and initial conditionsw1 (0) = e1 and w2 (0) = e2. MSA LUO yields, after 1000 iterations, the follow-ing estimations:

λ̃1 = φ1 (1000)

1 (1000)= 1.0508 w1 (1000) = [0.9154, 0.0219, 0.0160, 0.4362]T

Page 16: Neural-Based Orthogonal Data Fitting (The EXIN Neural Networks) || Variants of the MCA EXIN Neuron

104 VARIANTS OF THE MCA EXIN NEURON

P12 (red)

0.2

−0.2

0.4

0.6

0.8

1

0

0 2000 4000 6000 8000 10000

iterations

Figure 3.6 First example for MSA EXIN: P12.

Figure 3.7 First example for MSA EXIN: P31 and P32. (See insert for color representation ofthe figure.)

Page 17: Neural-Based Orthogonal Data Fitting (The EXIN Neural Networks) || Variants of the MCA EXIN Neuron

EXTENSIONS OF THE NEURAL MCA 105

Figure 3.8 First example for MSA EXIN: P41 and P42. (See insert for color representation ofthe figure.)

Figure 3.9 Second example for MSA EXIN: eigenvalue estimation.

Page 18: Neural-Based Orthogonal Data Fitting (The EXIN Neural Networks) || Variants of the MCA EXIN Neuron

106 VARIANTS OF THE MCA EXIN NEURON

Figure 3.10 First example for MSA LUO: eigenvalue estimation.

λ̃2 = φ2 (1000)

2 (1000)= 1.1094 w2 (1000) = [0.0414, 0.7109, −0.7263, 0.0744]T

which are similar to the MSA EXIN results. Figure 3.10 shows that the lossof step begins after only 1000 iterations. After 4000 iterations, w1 and w2, areparallel. This is also shown in Figure 3.11, which is a plot of P12. Hence, theloss of step happens much earlier in MSA LUO than in MSA EXIN. Note inFigure 3.11 that there are no flat portions of the curve before the loss of step (inMSA EXIN there is a big portion with nearly null gradient); this fact does notallow us to devise a good stop criterion. Figures 3.12 and 3.13 show the accuraterecovery of the minor subspace. Figures 3.14, 3.15, and 3.16 show the resultsfor λ1 and λ2 together, for λ2 alone, and for P12, respectively, when the truesolutions are taken as initial conditions. λ2 goes to infinity at iteration 20, 168,λ1 at iteration 21, 229.

The third example uses, as a benchmark, Problem 6.8 in [121, p. 229]. Here,

R =

⎡⎢⎢⎣

2.5678 1.1626 0.9302 −3.52691.1626 6.1180 4.3627 −3.43170.9302 4.3627 4.7218 −2.7970

−3.5269 −3.4317 −2.7970 9.0905

⎤⎥⎥⎦

has nearly nondistinct eigenvalues, λ1 = 1.0000 and λ2 = 1.0001, with corre-sponding eigenvectors

z1 = [0.9070, 0.0616, 0.0305, 0.4433]T

z2 = [−0.2866, 0.6078, −0.7308, −0.1198]T

Page 19: Neural-Based Orthogonal Data Fitting (The EXIN Neural Networks) || Variants of the MCA EXIN Neuron

EXTENSIONS OF THE NEURAL MCA 107

P12 (red)

0.2

−0.2

0.4

0.6

0.8

1

0

0 2000 4000 6000 8000 10000

iterations

Figure 3.11 First example for MSA LUO: P12.

Figure 3.12 First example for MSA LUO: P31 and P32. (See insert for color representationof the figure.)

Page 20: Neural-Based Orthogonal Data Fitting (The EXIN Neural Networks) || Variants of the MCA EXIN Neuron

108 VARIANTS OF THE MCA EXIN NEURON

Figure 3.13 First example for MSA LUO: P41 and P42. (See insert for color representationof the figure.)

Figure 3.14 Second example for MSA LUO: eigenvalue estimation.

Page 21: Neural-Based Orthogonal Data Fitting (The EXIN Neural Networks) || Variants of the MCA EXIN Neuron

EXTENSIONS OF THE NEURAL MCA 109

Figure 3.15 Second example for MSA LUO: estimation of λ2.

P12 (red)0.2

−0.2

−0.4

−0.6

−0.8

−1

0

0 0.5 1 1.5 2

× 104

2.5

iterations

Figure 3.16 Second example for MSA LUO: P12.

Page 22: Neural-Based Orthogonal Data Fitting (The EXIN Neural Networks) || Variants of the MCA EXIN Neuron

110 VARIANTS OF THE MCA EXIN NEURON

P12 (red)

0.01

0.02

0.03

0.04

0.05

0.06

−0.01

−0.02

−0.03

−0.04

0

0 200 400 600 800 1000

iterations

Figure 3.17 Second example for MSA EXIN: P12.

MSA EXIN and MSA LUO have a common learning rate α(t) = const = 0.001.

The initial conditions are w1 (0) = e1 and w2 (0) = e2. MSA EXIN yields, after1000 iterations, the following estimations:

λ̃1 = φ1 (1000)

1 (1000)= 1.0028 w1 (1000) = [0.9070, 0.0616, 0.0305, 0.4433]T

λ̃2 = φ2 (1000)

2 (1000)= 1.0032 w2 (1000) = [0.0592, 0.6889, −0.7393, 0.0797]T

MSA LUO yields, after 1000 iterations, the following eigenvalues:

λ̃1 = φ1 (1000)

1 (1000)= 1.0011 and λ̃2 = φ2 (1000)

2 (1000)= 1.0088

Figures 3.17 to 3.21 show some results for both neural networks. Their perfor-mances with respect to the nondesired eigenvectors are similar. More interestingis the behavior in the minor subspace. As is evident in Figure 3.16 for MSAEXIN and Figure 3.19 for MSA LUO, MSA EXIN better estimates the minorcomponents and for a longer time. Hence, it can easily be stopped.

3.3.3 Principal Components and Subspace Analysis

This subject is very important, but outside the scope of this book, except from thepoint of view of a straight extension of the MCA and MSA EXIN learning laws.

Page 23: Neural-Based Orthogonal Data Fitting (The EXIN Neural Networks) || Variants of the MCA EXIN Neuron

EXTENSIONS OF THE NEURAL MCA 111

Figure 3.18 Second example for MSA EXIN: P31 and P32. (See insert for color representationof the figure.)

Figure 3.19 Second example for MSA EXIN: P41 and P42. (See insert for color representationof the figure.)

Page 24: Neural-Based Orthogonal Data Fitting (The EXIN Neural Networks) || Variants of the MCA EXIN Neuron

112 VARIANTS OF THE MCA EXIN NEURON

P31 (red), P32 (blue)

0.02

0.06

0.04

0.08

0.1

0.12

−0.02

−0.04

−0.06

0

0 200 400 600 800 1000

iterations

Figure 3.20 Second example for MSA LUO: P12

Figure 3.21 Second example for MSA LUO: P31 and P32. (See insert for color representationof the figure.)

Page 25: Neural-Based Orthogonal Data Fitting (The EXIN Neural Networks) || Variants of the MCA EXIN Neuron

EXTENSIONS OF THE NEURAL MCA 113

Table 3.4 PCA learning laws

PCA OJA PCA OJAn PCA LUO PCA EXIN

a1 1 1 ‖w (t)‖22 ‖w (t)‖−2

2a2 1 ‖w (t)‖−2

2 1 ‖w (t)‖−42

Indeed, the formulation of the PCA and PSA (here PSA is intended as findingthe principal components) EXIN learning laws derives from the correspondingequations for MCA and MSA simply by reversing a sign. As a consequence,all corresponding theorems are also valid here. The same can be said for OJA,OJAn, and LUO, even if, in reality, historically they were originally conceivedfor PCA. According to the general analysis for PCA in [121, p. 229], all thesePCA algorithms can be represented by the general form

w (t + 1) = w (t) + a1α (t) y (t) x (t) − a2α (t) y2 (t) w (t) (3.33)

and the particular cases are given in Table 3.4. The corresponding statisticallyaveraging differential equation is the following:

dw (t)

dt= a1Rw (t) − a2w

T (t) Rw (t)w (t) (3.34)

From the point of view of the convergence, at a first approximation, it holdsthat

limt → ∞ w (t) = ±

√a1

a2z1 (3.35)

which is valid on the condition wT (0) z1 �= 0.About the divergence analysis, eq. (2.105) still holds exactly for PCA. Hence,

all consequences are still valid; in particular, PCA LUO diverges at the finitetime given by eq. (2.111). On the contrary, eq. (2.116) is no longer valid, but forPCA OJA it must be replaced by

dp

dt= +2λmin (1 − p) p p (0) = p0 (3.36)

whose solution is given by

p(t) = 1

1 + (1 − p0)e−2λmin t/p0(3.37)

Hence,

limt → ∞ p (t) = 1 (3.38)

and neither divergence nor sudden divergence happens.

Page 26: Neural-Based Orthogonal Data Fitting (The EXIN Neural Networks) || Variants of the MCA EXIN Neuron

114 VARIANTS OF THE MCA EXIN NEURON

PCA : EXIN (red), LUO (blue), OJAn (black), OJA (green)15

10

5

00 0.5

Squ

ared

nor

m

1 1.5

× 104

2

iterations

Figure 3.22 Evolution of the squared weight moduli of the PCA linear neurons with initialconditions equal to the solution. (See insert for color representation of the figure.)

The following simulation deals with the same example of Section 2.6.2.4, butnow the initial condition is given by the first principal component:

w(0) = [0.5366, 0.6266, 0.5652]T

and λmax = 0.7521. The learning rate is the same. Figures 3.22 and 3.23 confirmthe previous analysis regarding the PCA divergence. Figure 3.24 shows an esti-mation of the largest eigenvalue using the PCA neurons with initial conditionschosen randomly, for every component, in [−0.1, 0.1]. PCA EXIN is the fastest,but has sizable oscillations in the transient.

In [140], a neural network is proposed to find a basis for the principal sub-space. For the adaptive extraction of the principal components, the stochastinggradient ascent (SGA) algorithm [139,142], the generalized Hebbian algorithm(GHA) [166], the LEAP algorithm [16,17], the PSA LUO [the learning law is eq.(3.21) but with the reversed sign of the weight increment] [125], and the APEXalgorithm [53,111,112] are some of the most important methods. In particular,for PSA LUO, the convergence is guaranteeed if the moduli of the weight vectorsare greater than 1. A PSA EXIN learning law can be conceived by reversing the

Page 27: Neural-Based Orthogonal Data Fitting (The EXIN Neural Networks) || Variants of the MCA EXIN Neuron

EXTENSIONS OF THE NEURAL MCA 115

Figure 3.23 Deviation from the PC direction for the PCA neurons, measured as thesquared sine of the angle between the weight vector and the PC direction. (See insert forcolor representation of the figure.)

Figure 3.24 Computation of the maximum eigenvalue using PCA neurons. (See insert forcolor representation of the figure.)

Page 28: Neural-Based Orthogonal Data Fitting (The EXIN Neural Networks) || Variants of the MCA EXIN Neuron

116 VARIANTS OF THE MCA EXIN NEURON

sign of the weight increment in eq. (3.29):

wj (t + 1) = wj (t) + α (t) yj (t)

wT (t) w (t)

⎡⎣x (t) −

∑i > j

yi (t)wi (t)

−wT

j (t)(

x (t) −∑i > j yi (t) wi (t)

)wj (t)

wT (t) w (t)

⎤⎦ (3.39)

This network has not been used in this book and will be the subject of future work.Because of its similarity to the PSA LUO learning law, the same convergencetheorem applies, but we think that this convergence theorem can be extended to amore general constraint for the weight moduli because it is based only on upperbounds for the averaging ODE and subsequent formulas.