neural-based orthogonal data fitting (the exin neural networks) || total least squares problems

1

TOTAL LEAST SQUARESPROBLEMS

1.1 INTRODUCTION

The problem of linear parameter estimation gives rise to an overdetermined setof linear equations Ax ≈ b, where A is the data matrix and b is the observa-tion vector . In the (classical) least squares (LS) approach there is the underlyingassumption that all errors are confined to the observation vector. This assumptionis often unrealistic: The data matrix is not error-free because of sampling errors,human errors, modeling errors, and instrument errors. Methods for estimating theeffect of such errors on the LS solution are given in [90] and [177]. The methodof total least squares (TLS) is a technique devised to compensate for data errors.It was introduced in [74], where it has been solved by using singular valuedecomposition (SVD), as pointed out in [76] and more fully in [71]. Geometricalanalysis of SVD brought Staar [176] to the same idea. This method of fitting hasa long history in the statistical literature, where the method is known as orthog-onal regression or errors-in-variables (EIV)1 regression. Indeed, the univariateline-fitting problem had been considered in the nineteenth century [3]. Someimportant contributors are Pearson [151], Koopmans [108], Madansky [126],and York [197]. About 40 years ago, the technique was extended to multivariateproblems and later to multidimensional problems (they deal with more than oneobservation vector b; e.g., [70,175]).

1EIV models are characterized by the fact that the true values of the observed variables satisfyunknown but exact linear relations.

Neural-Based Orthogonal Data Fitting: The EXIN Neural Networks,By Giansalvo Cirrincione and Maurizio CirrincioneCopyright © 2010 John Wiley & Sons, Inc.

1

2 TOTAL LEAST SQUARES PROBLEMS

A complete analysis of the TLS problems can be found in [98], where thealgorithm of [74] is generalized to all cases in which it fails to produce a solution(nongeneric TLS ). Most of the following theoretical presentation of the TLSproblems is based on [98].

1.2 SOME TLS APPLICATIONS

There are a lot of TLS applications in many fields:

• Time-domain system identification and parameter estimation. This includesdeconvolution techniques: for example, renography [100], transfer functionmodels [57], estimates of the autoregressive parameters of an ARMA modelfrom noisy measurements [179], and structural identification [8].

• Identification of state-space models from noisy input-output measurements .Examples, including the identification of an industrial plant, can be foundin [134] and [135].

• Signal processing . A lot of algorithms have been proposed for the harmonicretrieval problem: the Pisarenko harmonic decomposition [152], the linearprediction–based work of Rahman and Yu [158], the ESPRIT (estimationof signal parameters via rotational invariance techniques) algorithm ofRoy and Kailath [163], and the Procrustes rotations–based ESPRITalgorithm proposed by Zoltowski and Stavrinides [202]. Zoltowski [201]also applied TLS to the minimum variance distortionless response (MVDR)beamforming problem.

• Biomedical signal processing . This includes signal parameter estimatesin the accurate quantification of in vivo magnetic resonance spectroscopy(MRS) and the quantification of chromophore concentration changes inneonatal brain [94].

• Image processing . This includes the image reconstruction algorithm for com-puting images of the interior structure of highly scattering media (opticaltomography) by using the conjugate gradient method [200], a method forremoving noise from digital images corrupted with additive, multiplicative,and mixed noise [87], and the regularized constrained TLS method for restor-ing an image distorted by a linear space-invariant point-spread functionwhich is not exactly known [130].

• Computer vision. This includes disparity-assisted stereo optical flow estima-tion [102] and robust and reliable motion analysis [24,137].

• Experimental modal analysis. Estimates of the frequency response functionsfrom measured input forces and response signals are applied to mechanicalstructures [162].

• Nonlinear models. The true variables are related to each other nonlinearly(see [73]).

PRELIMINARIES 3

• Acoustic radiation problems, geology, inverse scattering, geophysical tomo-graphic imaging, and so on See, respectively, [79,58,171,104].

• Environmental sciences. This includes the fitting of Horton’s infiltrationcurve using real data from field experiments and linear and nonlinearrainfall-runoff modeling using real data from watersheds in Louisiana andBelgium [159].

• Astronomy. This includes the determination of preliminary orbit in celestialmechanics [19], its differential correction, and the estimation of variousparameters of galactic kinematics [12].

1.3 PRELIMINARIES

The TLS method solves a set of m linear equations in n × d unknowns X rep-resented in matrix form by

AX ≈ B (1.1)

where A is the m × n data matrix and B is the m × d observation matrix . Ifm > n , the system is overdetermined . If d > 1, the problem is multidimensional . Ifn > 1, the problem is multidivariate. X ′ is the n × d minimum norm least squaressolution and X is the minimum norm total least squares solution of eq. (1.1).

Singular value decomposition (SVD) of the matrix A (m > n) in eq. (1.1) isdenoted by

A = U ′�′V ′T (1.2)

where

U ′ = [U ′

1; U ′2

], U ′

1 = [u ′

1, . . . , u ′n

], U ′

2 = [u ′

n+1, . . . , u ′m

],

u ′i ∈ �m , U ′T U ′ = U ′U ′T = Im

V ′ = [v′

1, . . . , v′n

], v′

i ∈ �n , V ′T V ′ = V ′V ′T = In

�′ = diag(σ ′

1, . . . , σ ′n

) ∈ �m×n , σ ′1 ≥ · · · ≥ σ ′

n ≥ 0

and the SVD of the m × (n + d) matrix [A; B] (m > n) in eq. (1.1) is denoted by

[A; B] = U �V T (1.3)

where

U = [U1; U2] , U1 = [u1, . . . , un ] , U2 = [un+1, . . . , um

],

ui ∈ �m , U T U = UU T = Im


V =[

V11 V12

V21 V22

]nd

= [v1, . . . , vn+d

], vi ∈ �n+d , V T V = VV T = In+d

n d

� =[

�1 00 �2

]= diag (σ1, . . . , σn+t ) ∈ �m×(n+d),

t = min{m − n , d}, �1 = diag (σ1, . . . , σn) ∈ �n×n ,�2 = diag (σn+1, . . . , σn+t ) ∈ �(m−n)×d , σ1 ≥ · · · ≥ σn+t ≥ 0

For convenience of notation, σi = 0 if m < i ≤ n + d .(u ′

i , σ′i , v′

i

)and (ui , σi , vi )

are, respectively, the singular triplet of A and [A; B ].Most of the book is devoted to the unidimensional (d = 1) case; that is,

Ax = b (1.4)

where A ∈ �m×n and b ∈ �m . However in Section 1.6 we deal with the multidi-mensional case. The problem is defined as basic if (1) it is unidimensional, (2)it is solvable, and (3) it has a unique solution.

1.4 ORDINARY LEAST SQUARES PROBLEMS

Definition 1 (OLS Problem) Given the overdetermined system (1.4 ), the leastsquares (LS ) problem searches for

minb ′∈�m

∥∥b − b ′∥∥2 subject to b ′ ∈ R(A) (1.5)

where R(A) is the column space of A. Once a minimizing b ′ is found, then any x ′satisfying

Ax ′ = b ′ (1.6)

is called an LS solution (the corresponding LS correction is �b ′ = b − b ′).

Remark 2 Equations (1.5) and (1.6) are satisfied if b ′ is the orthogonal projec-tion of b into R(A).

Theorem 3 (Closed-Form OLS Solution) If rank(A)= n, eqs. (1.5) and (1.6)are satisfied for the unique LS solution given by

x ′ = (AT A)−1AT b = A+b (1.7)

(for an underdetermined system, this is also the minimal L2 norm solution).

BASIC TLS PROBLEM 5

Matrix A+ is called the Moore–Penrose pseudoinverse. The underlyingassumption is that errors occur only in the observation vector and that the datamatrix is known exactly .

The OLS solution can be computed as the minimizer of the following errorfunction:

EOLS = 12 (Ax − b)T (Ax − b) (1.8)

1.5 BASIC TLS PROBLEM

Definition 4 (Basic TLS Problem) Given the overdetermined set (1.4 ), thetotal least squares (TLS ) problem searches for

min[A;b]∈�m×(n+1)

∥∥[A; b] − [A; b]∥∥

F subject to b ∈ R(A) (1.9)

where ‖· · ·‖F is the Frobenius norm. Once a minimizing [A; b] is found, then anyx satisfying

Ax = b (1.10)

is called a TLS solution (the corresponding TLS correction is [�A;�b] =[A; b] − [A; b]).

Theorem 5 (Solution of the Basic TLS Problem) Given eq. (1.2) [respectively,(1.3 )] as the SVD of A (respectively, [A; b]), if σ ′

n >σn+1, then

[A; b] = U �V T where � = diag(σ1, . . . , σn , 0) (1.11)

with corresponding TLS correction matrix

[�A;�b] = σn+1un+1vTn+1 (1.12)

solves the TLS problem (1.9 ) and

x = − 1

vn+1,n+1

[v1,n+1, . . . , vn ,n+1

]T(1.13)

exists and is the unique solution to Ax = b.

Proof. (An outline: for the complete proof, see [98].) Recast eq. (1.4) as

[A; b][x T ;−1]T ≈ 0 (1.14)


If σn+1 = 0, rank[A; b] = n + 1; then no nonzero vector exists in the orthogonalcomplement of Rr ([A; b]), where Rr (T ) is the row space of matrix T . To reduce ton the rank, using the Eckart–Young–Mirsky matrix approximation theorem (see[54,131]), the best rank n TLS approximation [A; b] of [A; b], which minimizesthe deviations in variance, is given by (1.11). The minimal TLS correction is then

σn+1 = minrank([A;b])=n

∥∥[A; b] − [A; b]∥∥

F (1.15)

and is attained for the rank 1 TLS correction (1.12). Then the approximateset [A; b][x T ;−1]T ≈ 0 is now compatible and its solution is given by theonly vector vn+1 that belongs to the kernel of [A; b]. The TLS solution is thenobtained by scaling vn+1 until its last component is −1. �

Proposition 6 The interlacing theorem for singular values (see [182])implies that

σ1 ≥ σ ′1 ≥ · · · ≥ σn ≥ σ ′

n ≥ σn+1 (1.16)

Proposition 7 As proved in [98, Cor. 3.4 ],

σ ′n >σn+1 ⇐⇒ σn >σn+1 and vn+1,n+1 = 0

Remark 8 If σn+1 = 0, rank[A; b] = n �⇒ (1.14 ) is compatible and no approx-imation is needed to obtain the exact solution (1.13 ).

The TLS solution is obtained by finding the closest subspace R([A; b]

)to the

n + 1 columns of [A; b] such that the sum of the squared perpendicular distancesfrom each column of [A; b] to R

([A; b]

)is minimized, and each column is

approximated by its orthogonal projection onto that subspace.

Theorem 9 (Closed-Form Basic TLS Solution) Given (1.2 ) [respectively,(1.3 )] as the SVD of A (respectively, [A; b]), if σ ′

n >σn+1, then

x = (AT A − σ 2

n+1I)−1

AT b (1.17)

Proof. The condition σ ′n >σn+1 assures the existence and the uniqueness of the

solution (see Proposition 7). Since the singular vectors vi are eigenvectors of[A; b]T [A; b], x satisfies the following set:

[A; b]T [A; b]

[x

−1

]=

[AT A AT bbT A bT b

] [x

−1

]= σ 2

n+1

[x

−1

](1.18)

Formula (1.17) derives from the top part of (1.18). �

BASIC TLS PROBLEM 7

Remark 10 (See [74].) The ridge regression is a way of regularizing thesolution of an ill-conditioned LS problem [114, pp. 190ff.]; for example, theminimization of ‖b − Ax‖2

2 + μ ‖x‖22, where μ is a positive scalar, is solved by

xLS(μ) = (AT A + μI

)−1AT b, and ‖xLS(μ)‖2 becomes small as μ becomes large.

But xTLS = xLS(−σ 2n+1), which implies that the TLS solution is a deregularizing

procedure, a reverse ridge regression. It implies that the condition of the TLSproblem is always worse than that of the corresponding LS problem.

Remark 11 Transforming (1.18 ) as

[�′T �′ g

gT ‖b‖22

] [z

−1

]= σ 2

n+1

[z

−1

]with g = �′T U ′T b, z = V ′T x

(1.19)

Then(�′T �′ − σ 2

n+1I)

z = g and σ 2n+1 + gT z = ‖b‖2

2. Substituting z in the latterexpression by the former yields

σ 2n+1 + gT (

�′T �′ − σ 2n+1I

)−1g = ‖b‖2

2 (1.20)

This is a version of the TLS secular equation [74,98].

Remark 12 If vn+1,n+1 = 0, the TLS problem is solvable and is then calledgeneric.

Remark 13 If σp >σp+1 = · · · = σn+1, any vector in the space created by theright singular vectors associated with the smallest singular vector is a solutionof the TLS problem (1.9 ); the same happens in the case m < n (underdeterminedsystem), since then the conditions σm+1 = · · · = σn+1 = 0 hold.

Remark 14 The TLS correction∥∥[A; b] − [A; b]

∥∥F is always smaller in norm

than the LS correction∥∥b − b ′∥∥

2.

1.5.1 OLS and TLS Geometric Considerations (Row Space)

The differences between TLS and OLS are considered from a geometric point ofview in the row space Rr ([A; b]) (see Figure 1.1 for the case n = 2).

If no errors are present in the data (EIV model), the set Ax ≈ b is compatible,rank[A; b] = n , and the space Rr ([A; b]) is n-dimensional (hyperplane). If errorsoccur, the set is no longer compatible and the rows r1, r2, . . . , rm are scatteredaround the hyperplane. The normal of the hyperplane, corresponding to the minorcomponent of [A; b], gives the corresponding solution as its intersection with thehyperplane xn+1 = −1.


r1

r1

−1

−1

−1 r2

r2

r3

x3

r3

x2

R([A;b′])

R([A;b])∧

∧

∧ ∧∧

∧

x′

x1

x3

x2

x1x1

∧x2

x′1

x′2

⊥R1([A;b′])

−1

x ⊥R([A;b′])

(a)

(b)

Figure 1.1 Geometry of the LS solution x′ (a) and of the TLS solution x (b) for n = 2. Part(b) shows the TLS hyperplane.

Definition 15 The TLS hyperplane is the hyperplane xn+1 = −1.

The LS approach [for n = 2, see Figure 1.1(a)] looks for the best approximationb ′ to b satisfying (1.5) such that the space Rr ([A; b ′]) generated by the LSapproximation is a hyperplane. Only the last components of r1, r2, . . . , rm canvary. This approach assumes random errors along one coordinate axis only. TheTLS approach [for n = 2, see Figure 1.1(b)] looks for a hyperplane Rr ([A; b])such that (1.9) will be satisfied. The data changes are not restricted to beingalong one coordinate axis xn+1. All correction vectors �ri given by the rows of

MULTIDIMENSIONAL TLS PROBLEM 9

[�A;�b

]are parallel with the solution vector

[x T ;−1

]T[it follows from (1.12)

and (1.13)]. The TLS solution is parallel to the right singular vector correspondingto the minimum singular value of [A; b], which can be expressed as a Rayleighquotient (see Section 2.1):

σ 2n+1 =

∥∥∥[A; b][x T ; −1

]T∥∥∥2

2∥∥∥[x T ;−1

]T∥∥∥2

2

=∑m

i=1

∣∣aTi x − bi

∣∣2

1 + x T x=

m∑i=1

‖�ri‖22 (1.21)

with aTi the i th row of A and ‖�ri ‖2

2 the square of the distance from[aT

i ; bi]T ∈

�n+1 to the nearest point in the subspace:

Rr ([A; b]) ={[

ab

] ∣∣∣a ∈ �n , b ∈ �, b = x T a

}

Thus, the TLS solution x minimizes the sum of the squares of the orthogonaldistances (weighted squared residuals):

ETLS(x) =∑m

i=1 |aTi x − bi |2

1 + x T x(1.22)

which is the Rayleigh quotient (see Section 2.1) of [A; b]T [A; b] constrained toxn+1 = −1. It can also be rewritten as

ETLS (x) = (Ax − b)T (Ax − b)

1 + x T x(1.23)

This formulation is very important from the neural point of view because itcan be considered as the energy function to be minimized for the training of aneural network whose final weights represent the TLS solution. This is the basicidea of the TLS EXIN neuron.

1.6 MULTIDIMENSIONAL TLS PROBLEM

1.6.1 Unique Solution

Definition 16 (Multidimensional TLS Problem) Given the overdeterminedset (1.1 ), the total least squares (TLS ) problem searches for

min[A;B ]∈�m×(n+d)

∥∥[A; B ] − [A; B ]∥∥

F subject to R(B) ∈ R(A) (1.24)

Once a minimizing [A; B ] is found, then any X satisfying

AX = B (1.25)


is called a TLS solution (the corresponding TLS correction is [�A;�B ] =[A; B ] − [A; B ]).

Theorem 17 (Solution of the Multidimensional TLS Problem) Given (1.3 )as the SVD of [A; B ], if σn >σn+1, then

[A; B ] = U diag (σ1, . . . , σn , 0, . . . , 0) V T = U1�1[V T

11; V T21

](1.26)

with corresponding TLS correction matrix

[�A;�B ] = U2�2[V T

12; V T22

](1.27)

solves the TLS problem (1.24) and

X = −V12V −122 (1.28)

exists and is the unique solution to AX = B .

Proof. See [98, p. 52]. �

Theorem 18 (Closed-Form Multidimensional TLS Solution) Given (1.2)[respectively, (1.3)] as the SVD of A([A; B ]), if σ ′

n >σn+1 = · · · = σn+d , then

X = (ATA − σ 2

n+1I)−1AT B (1.29)

Proof. See [98, Th. 3.10]. �

Proposition 19 (Existence and Uniqueness Condition) See [98, p. 53 ]:

σ ′n >σn+1 �⇒ σn >σn+1 and V22 is nonsingular.

Remark 20 The multidimensional problem AX ≈ Bm×d can also be solved bycomputing the TLS solution of each subproblem Axi ≈ bi , i = 1, . . . , d, sepa-rately. The multidimensional TLS solution is better at least when all data areequally perturbed and all subproblems Axi ≈ bi have the same degree of incom-patibility.

1.6.2 Nonunique Solution

Theorem 21 (Closed-Form Minimum Norm TLS Solution) Given the TLSproblem (1.24) and (1.25), assuming that σp >σp+1 = · · · = σn+d with p ≤ n,

NONGENERIC UNIDIMENSIONAL TLS PROBLEM 11

let (1.3) be the SVD of [A; B ], partitioning V as

V =[

V11 V12

V21 V22

]nd

p n − p + d

If V22 is of full rank, the multidimensional minimum norm (minimum 2-norm andminimum Frobenius norm) TLS solution X is given by

X = −V12V +22 = (

ATA − σ 2n+1I

)+AT B (1.30)

Proof. See [98, p. 62]. �

1.7 NONGENERIC UNIDIMENSIONAL TLS PROBLEM

The TLS problem (1.9) fails to have a solution if σp >σp+1 = · · · = σn+1,p ≤ n and all vn+1,i = 0, i = p + 1, . . . , n + 1 (nongeneric TLS problem). Inthis case the existence conditions of the preceding theorems are not satisfied.These problems occur whenever A is rank-deficient (σ ′

p ≈ 0) or when the set ofequations is highly conflicting (σ ′

p ≈ σp+1 large). The latter situation is detectedby inspecting the size of the smallest σi for which vn+1,i = 0. If this singularvalue is large, the user can simply reject the problem as irrelevant from a linearmodeling point of view. Alternatively, the problem can be made generic byadding more equations. This is the case if the model is EIV and the observationerrors are statistically independent and equally sized (same variance). Forσ ′

p ≈ 0, it is possible to remove the dependency among the columns of A byremoving appropriate columns in A such that the remaining submatrix has fullrank and then apply TLS to the reduced problem (subset selection, [88,89,95]).Although exact nongeneric TLS problems seldom occur, close-to-nongenericTLS problems are not uncommon. The generic TLS solution can still becomputed, but it is unstable and becomes even very sensitive to data errorswhen σ

′p − σp+1 is very close to zero [74].

Theorem 22 (Properties of Nongeneric Unidimensional TLS) (See [98].) Let(1.2 ) [respectively, (1.3)] be the SVD of A (respectively, [A; b]); let b ′ be theorthogonal projection of b onto R(A) and [A; b] the rank n approximation of[A; b], as given by (1.11 ). If V ′(σj ) [respectively, U ′(σj )] is the right (respec-tively, left) singular subspace of A associated with σj , then the following relationscan be proven:

vn+1,j = 0 ⇐⇒ vj =[

v′0

]with v′ ∈ V ′(σj ) (1.31)

vn+1,j = 0 �⇒ σj = σ ′k with k = j − 1 or k = j and 1 ≤ k ≤ n

(1.32)


vn+1,j = 0 �⇒ b⊥u ′ with u ′ ∈ U ′(σj ) (1.33)

vn+1,j = 0 ⇐⇒ uj = u ′ with u ′ ∈ U ′(σj ) (1.34)

vn+1,j = 0 �⇒ b ′⊥u ′ with u ′ ∈ U ′(σj ) (1.35)

vn+1,j = 0 �⇒ b⊥u ′ with u ′ ∈ U ′(σj ) (1.36)

If σj is an isolated singular value, the converse of relations (1.33 ) and (1.35 ) alsoholds.

Proof. See [92, Th. 1–3]. �

Corollary 23 (See [98].) If σn >σn+1 and vn+1,n+1 = 0, then

σn+1 = σ ′n , un+1 = ±u ′

n , vn+1 =[ ±v′

n0

], b, b ′,b⊥u ′

n (1.37)

The generic TLS approximation and corresponding TLS correction matrixminimizes

∥∥�A;�b∥∥

F but does not satisfy the constraint b ∈ R(A) and thereforedoes not solve the TLS problem (1.9). Moreover, (1.37) yields

[A; b

]vn+1 = 0 �⇒ Av′

n = 0 (1.38)

Then the solution v′n describes an approximate linear relation among the columns

of A instead of estimating the desired linear relation between A and b. The sin-gular vector vn+1 is called a nonpredictive multicollinearity in linear regressionsince it reveals multicollinearities in A that are of no (or negligible) value in pre-dicting the response b. Also, since b⊥u ′ = un+1, there is no correlation betweenA and b in the direction of u ′ = un+1. The strategy of nongeneric TLS is toeliminate those directions in A that are not at all correlated with the observationvector b; the additional constraint

[x

−1

]⊥vn+1 is then introduced. Latent rootregression [190] uses the same constraint in order to stabilize the LS solution inthe presence of multicollinearities. Generalizing:

Definition 24 (Nongeneric Unidimensional TLS Problem) Given the set(1.4 ), let (1.3 ) be the SVD of [A; b]; the nongeneric TLS problem searches for

min[A;b]∈�m×(n+1)

∥∥[A; b] − [A; b]∥∥

F subject to b ∈ R(A) and[x−1

]⊥vj , j = p + 1, . . . , n + 1 (provided that vn+1,p = 0)

(1.39)

Once a minimizing [A; b] is found, then any x satisfying

Ax = b (1.40)

is called a nongeneric TLS solution (the corresponding nongeneric TLS correctionis [�A;�b] = [A; b] − [A; b]).

NONGENERIC UNIDIMENSIONAL TLS PROBLEM 13

Theorem 25 (Nongeneric Unidimensional TLS Solution) Let (1.3 ) be theSVD of [A; b], and assume that vn+1,j = 0 for j = p + 1, . . . , n + 1, p ≤ n. Ifσp−1 >σp and vn+1,p = 0, then

[A; b] = U �V T where � = diag(σ1, . . . , σp−1, 0, σp+1, . . . , σn+1)

(1.41)

with corresponding nongeneric TLS correction matrix

[�A;�b] = σpupvTp (1.42)

solves the nongeneric TLS problem (1.39 ) and

x = − 1

vn+1,p

[v1,p , . . . , vn ,p

]T(1.43)

exists and is the unique solution to Ax = b.

Proof. See [98, p. 72]). �

Corollary 26 If vn+1,n+1 = 0 ∧ vn+1,n = 0 ∧ σn−1 >σn , then (1.43 ) becomes

[x−1

]= − vn

vn+1,n(1.44)

Theorem 27 (Closed-Form Nongeneric TLS Solution) Let (1.2 ) [respec-tively, (1.3 )] be the SVD of A (respectively, [A; b]), and assume that vn+1,j = 0for j = p + 1, . . . , n + 1, p ≤ n. If σp−1 >σp and vn+1,p = 0, the nongenericTLS solution is

x =(

AT A − σ 2p In

)−1AT b (1.45)

Proof. See [98, p. 74]. �

Remark 28 The nongeneric TLS algorithm must identify the close-to-nongenericor nongeneric situation before applying the corresponding formula. In the follow-ing it will be shown that the TLS EXIN neuron solves both generic and nongenericTLS problems without changing its learning law (i.e., automatically).


1.8 MIXED OLS–TLS PROBLEM

If n1 columns of the m × n data matrix A are known exactly , the problem is calledmixed OLS–TLS [98]. It is natural to require that the TLS solution not perturbthe exact columns. After some column permutations in A such that A = [A1; A2],where A1 ∈ �m×n1 is made of the exact n1 columns and A2 ∈ �m×n2 , perform n1

Householder transformations Q on the matrix [A; B] (QR factorization) so that

[A1; A2; B] = Q

[R11 R12 R1b

0 R22 R2b

]n1

m − n1

n1 n − n1 d(1.46)

where R11 is a n1 × n1 upper triangular matrix. Then compute the TLS solutionX2 of R22X ≈ R2b . X2 yields the last n − n1 components of each solutionvector xi . To find the first n1 rows X1 of the solution matrix X = [

X T1 ; X T

2

]T,

solve (OLS) R11X1 = R1b − R12X2. Thus, the entire method amounts to apreprocessing step, a TLS problem, an OLS problem, and a postprocessing step(inverse row permutations) [72].

Theorem 29 (Closed-Form Mixed OLS–TLS Solution) Let rank A1 = n1;denote by σ ′ (respectively, σ ) the smallest [respectively, (n2 + 1)th] singularvalue of R22 (respectively, [R22; R2b]); assume that the smallest singularvalues σ = σn2+1 = · · · = σn2+d coincide. If σ ′ >σ , then the mixed OLS–TLSsolution is

X =(

AT A − σ 2[

0 00 In2

])−1

AT B (1.47)

Proof. It is a special case of [97, Th. 4]. �

1.9 ALGEBRAIC COMPARISONS BETWEEN TLS AND OLS

Comparing (1.29) with the LS solution

X ′ = (AT A

)−1AT B (1.48)

shows that σn+1 completely determines the difference between the solutions.Assuming that A is of full rank, σn+1 = 0 means that both solutions coincide(AX ≈ B compatible or underdetermined). As σn+1 deviates from zero, the setAX ≈ B becomes more and more incompatible and the differences between theTLS and OLS solutions become deeper and deeper.

STATISTICAL PROPERTIES AND VALIDITY 15

If σ ′n >σn+1 = · · · = σn+d , then

∥∥X∥∥

F ≥ ∥∥X ′∥∥F (1.49)

σn+1 also influences the difference in condition between the TLS and OLS prob-lems and thus the difference in numerical accuracy of their respective solutionsin the presence of worst-case perturbations. σ ′2

n − σ 2n+1 is a measure of how close

AX ≈ B is to the class of nongeneric TLS problems. Assuming that A0X ≈ B0

is the corresponding unperturbed set, rank A0 = n and the perturbations in A andB have approximately the same size, the TLS solution is more accurate thanthe OLS solution, provided that the ratio (σn − σ 0

n+1)/σ′n > 1, where σ 0

n+1 is the(n + 1)th singular value of [A0; B0]. The advantage of TLS is more remarkablewith increasing ratio: for example, when σ ′

n ≈ 0, when ‖B0‖F is large, or whenX0 becomes close to the singular vector v′

n of A0 associated with its smallestsingular value.

1.9.1 About the Residuals

Proposition 30 Define the LS residua R′ as B − AX ′ and the TLS residual R asB − AX ; then

R − R′ = σ 2n+1A

(AT A − σ 2

n+1I)−1

X ′ (1.50)∥∥R∥∥

F ≥ ∥∥R′∥∥F (1.51)

The TLS and LS residuals approach each other if:

1. σn+1 is small (slightly incompatible set).2. ‖B‖F is small (i.e., the TLS solution becomes close to the LS solution).3. σ ′

n � σn+1 [i.e., A may not be (nearly) rank deficient].4. B is close to the largest singular vectors of A.

Proposition 31 If σn+1 = 0, then R = R′ = 0.

1.10 STATISTICAL PROPERTIES AND VALIDITY

Under the assumption that all errors in the augmented matrix [A; B] are row-wise independently and identically distributed (i.i.d.) with zero mean and commoncovariance matrix of the form σ 2

ν �0, �0 known and positive definite (e.g., theidentity matrix), the TLS method offers the best estimate and is more accuratethan the LS solution in estimating the parameters of a model. The most suitablemodel for the TLS concept is the errors-in-variables (EIV) model, which assumesan unknown but exact linear relation (zero residual problems) among the truevariables that can only be observed with errors.


Definition 32 (Multivariate Linear EIV Model)

B0 = 1mαT + A0X0, B0 ∈ �m×d , A0 ∈ �m×n , α ∈ �d

A = A0 + �AB = B0 + �B

(1.52)

where 1m = [1, . . . , 1]T . X0 is the n × d matrix of the true but unknown parame-ters to be estimated. The intercept vector α is either zero (no-intercept model) orunknown (intercept model) and must be estimated.

Proposition 33 (Strong Consistency) If, in the EIV model, it is assumed thatthe rows of [�A;�B] are i.i.d. with common zero-mean vector and commoncovariance matrix of the form � = σ 2

ν In+d , where σ 2ν > 0 is unknown, then the TLS

method is able to compute strongly consistent estimates of the unknown parame-ters X0, A0, α, andσν .

EIV models are useful when:

1. The primary goal is to estimate the true parameters of the model generatingthe data rather than prediction and if there is not a priori certainty that theobservations are error-free.

2. The goal is the application of TLS to the eigenvalue–eigenvector analysisor SVD (TLS gives the hyperplane that passes through the intercept and isparallel to the plane spanned by the first right singular vectors of the datamatrix [174]).

3. It is important to treat the variables symmetrically (i.e., there are no inde-pendent and dependent variables).

The ordinary LS solution X ′ of (1.52) is generally an inconsistent estimateof the true parameters X0 (i.e., LS is asymptotically biased ). Large errors (large�, σν), ill-conditioned A0, as well as, in the unidimensional case, the solutionoriented close to the lowest right singular vector v′

n of A0 increase the bias andmake the LS estimate more and more inaccurate. If � is known, the asymptoticbias can be removed and a consistent estimator, called corrected least squares(CLS) can be derived [60,106,168]. The CLS and TLS asymptotically yield thesame consistent estimator of the true parameters [70,98]. Under the given assump-tion about the errors of the model, the TLS estimators X , α, A, and [d/(n + d)]σ 2

[σ 2 = (1/mt)∑t

i=1 σ 2n+i with t = min {m − n , d}] are the unique with probabil-

ity 1 maximum likelihood estimators of X0, α, A0, and σ 2ν [70].

Remark 34 (Scaling) The assumption about the errors seems somewhat restric-tive: It requires that all measurements in A and B be affected by errors and,moreover, that these errors must be uncorrelated and equally sized. If these condi-tions are not satisfied, the classical TLS solution is no longer a consistent estimateof the model parameters. Provided that the error covariance matrix � is known up

STATISTICAL PROPERTIES AND VALIDITY 17

to a factor of proportionality, the data [A; B] can be transformed to the new data[A∗; B∗] = [A; B] C −1, where C is a square root (Cholesky factor) of � (=

C TC ), such that the error covariance matrix of the transformed data is now diag-onal with equal error variances. Then the TLS algorithm can be used on the newset, and finally, its solution must be converted to a solution of the original set ofequations [69,70].

Remark 35 (Covariance Knowledge) The knowledge of the form of the errorcovariance matrix up to a constant scalar multiple may still appear too restrictive,as this type of information is not always available to an experimenter. Assumingthat independent repeated observations are available for each variable observedwith error, this type of replication provides enough information about the errorcovariance matrix to derive consistent unbiased estimates of � [60,62]. Usingthese consistent estimates instead of � does not change the consistency propertiesof the parameter estimators [60].

The TLS estimators in both the intercept and no-intercept models are asymp-totically normally distributed . For the unidimensional case, the covariance matrixof the TLS estimator x is larger than the covariance matrix of the LS estimatorx ′, even if A is noisy [98].

Summarizing, a comparison about the accuracy of the TLS and LS solutionwith respect to their bias, total variance, and mean squared error (MSE; totalvariance + squared bias) gives:

• The bias of TLS is much smaller than the bias of LS and decreases withincreasing m (i.e., increasing the degree of overdetermination).

• The total variance of TLS is larger than that of LS.• At the smallest noise variances, MSE is comparable for TLS and LS; by

increasing the noise in the data, the differences in MSE are greater, showingthe better performance of TLS; the TLS solution will be more accurate,especially when the set of equations are more overdetermined, but the betterperformance is already good for moderate sample size m .

All these conclusions are true even if the errors are not Gaussian but areexponentially distributed or t-distributed with 3 degrees of freedom.

1.10.1 Outliers

The TLS, which is characterized by larger variances, is less stable than LS. Thisinstability has quite serious implications for the accuracy of the estimates in thepresence of outliers in the data (i.e., large errors in the measurements) [20,105].In this case a dramatic deterioration of the TLS estimates happens. Also, theLS estimates encounter serious stability problems, although less dramatically;efficient robust procedures should be considered, which are quite efficient andrather insensitive to outliers (e.g., by down-weighting measurement samples that


have high residuals). In the following, robust nonlinear neurons are introducedto overcome this problem.

1.11 BASIC DATA LEAST SQUARES PROBLEM

The TLS problem can be viewed as an unconstrained perturbation problembecause all columns of [A; b] can have error perturbations. The OLS problemconstrains the columns of A to be errorless; the opposite case is the data leastsquares (DLS) problem, because the error is assumed to lie only in the datamatrix A.

Definition 36 (Basic DLS Problem) Given the overdetermined set (1.4 ), thedata least squares (DLS ) problem searches for

minA′′∈�m×n

∥∥A − A′′∥∥F subject to b ∈ R(A′′) (1.53)

Once a minimizing A′′ is found, then any x ′′ satisfying

A′′x ′′ = b (1.54)

is called a DLS solution (the corresponding DLS correction is �A′′ = A − A′′).

The DLS case is particularly appropriate for certain deconvolution problems,such as those that may arise in system identification or channel equalization [51].

Theorem 37 The DLS problem (1.53 ) is solved by

x ′′ = bT b

bT Avminvmin

(bT Avmin = 0

)(1.55)

where vmin is the right singular vector corresponding to the smallest singularvalue of the matrix P⊥

b A, where P⊥b =

(I − b

(bT b

)−1bT

)is a projection matrix

that projects the column space of A into the orthogonal complement of b. If thesmallest singular vector is repeated, then the solution is not unique. The minimumnorm solution is given by

x ′′ = bT b(bT AVmin

) (V T

minAT b)Vmin

(V T

minAT b) (

bT AVmin) (

V TminAT b

) = 0

(1.56)

where Vmin is the right singular space corresponding to the repeated smallestsingular value of P⊥

b A.

Proof. See [51], which derives its results from the constrained total least squares(CTLS) [1,2]. �

ITERATIVE COMPUTATION METHODS 19

1.12 PARTIAL TLS ALGORITHM

Considering the TLS solution of AX ≈ B to be deduced from a basis of theright singular subspace associated with the smallest singular values of [A; B],considerable savings in computation time is possible by calculating only the basisvectors desired. This can be done in a direct way by modifying the SVD algorithmor in an iterative way by making use of start vectors. An improved algorithmPSVD (partial SVD) is presented in [98], which computes this singular subspace.There are three reasons for its higher efficiency vs. the classical SVD [75]:

1. The Householder transformations of the bidiagonalization are applied onlyto the basis vectors of the desired singular subspace.

2. The bidiagonal is only partially diagonalized.3. An appropriate choice is made between QR and QL iteration steps [99].

Definition 38 (Motor Gap) The motor gap is the gap between the singularvalues associated with the desired and undesired vectors.

Depending on the motor gap (it must be large), the desired numerical accu-racy, and the dimension of the desired subspace (it must be small), PSVD can bethree times as fast as the classical SVD, while the same accuracy can be main-tained. Incorporating the PSVD algorithm into the TLS computations results inan improved partial TLS (PTLS) [96]. Typically, PTLS reduces the computationtime by a factor of 2.

1.13 ITERATIVE COMPUTATION METHODS

In the estimation of parameters of nonstationary systems that vary slowly withtime, space, or frequency, a priori information is available for the TLS algorithm.Indeed, in this case, slowly varying sets of equations must be solved at each timeinstant and the TLS solution at the last step is usually a good initial guess forthe solution at the next step. If the changes in these systems are of small normbut of full rank (e.g., when all elements of the data matrix change slowly fromstep to step), the computation time can be better reduced by using an iterativemethod. There are also other advantages in using this type of algorithm:

• Each step supplies a new and better estimate of the solution, permitting usto control the level of convergence depending on the perturbations of thedata.

• It is easy to code.• Some iterative routines use the given matrix over and over again without

modification, exploiting its particular structure or sparsity.


1.13.1 Direct Versus Iterative Computation Methods

The direct computation methods are the classical TLS and the PTLS; their effi-ciency is determined essentially by the dimensions of the data matrix, the desiredaccuracy, the dimension p of the desired singular subspace, and the motor gap.The iterative methods are efficient in solving TLS problems if:

1. The start matrix is good and the problem is generic.2. The desired accuracy is low.3. The dimension p of the desired singular subspace is known.4. The dimension d of the problem is low.5. The data matrix dimensions are moderate.6. The gap is sufficiently large.

In contrast to the iterative methods, the direct methods always converge tothe desired solution.

In the sequel a list of the principal nonneural iterative methods is presented,together with their own field of application (see [98]).

1.13.2 Inverse Iteration

Many authors have studied the inverse iteration (II) method in (non)symmetriceigenvalue problems (see [75,150]). According to Wilkinson [193], it is the mostpowerful and accurate method for computing eigenvectors.

Being S = C TC , where C = [A; B], the iteration matrix Qk is given by

Qk = (S − λ0I )−kQ0 (1.57)

with Q0 a start matrix and λ0 a chosen shift. Given the dimension p of the TLSeigensubspace of S , taking λ0 zero or such that the ratio |σ 2

n−p − λ0|/|σ 2n−p+1 −

λ0| is high enough, matrix Qk converges to the desired minor eigensubspace. Theiteration destroys the structure of the matrix S (for details, see [98]).

Remark 39 (Convergence Property) If the motor gap (the gap between σ 2n−p

and σ 2n−p+1) is large, fast convergence occurs (this requirement is satisfied for

many TLS problems).

1.13.3 Chebyshev Iteration

When the motor gap is small, the convergence can be accelerated by apply-ing the Chebyshev polynomials (see [136,150,164,193]), instead of the inversepower functions as before, to the matrix C TC for the ordinary Chebyshev itera-tion (OCI) and to the matrix

(C TC

)−1for the inverse Chebyshev iteration (ICI).

ITERATIVE COMPUTATION METHODS 21

The Chebyshev polynomials T yzk (x) are orthogonal over the interval

[y , z

]with

respect to the density function 1/√

1 − x 2. By choosing the interval[y , z

]as

small as possible so that it contains all the undesired eigenvalues of a matrix S ,the ordinary Chebyshev iteration method will converge to a basis of the eigensub-space associated with the remaining eigenvalues outside

[y , z

]: in particular, in

multidimensional TLS problems, where the singular subspace associated with thep smallest singular values of C ,

[y , z

]must contain the n − p largest eigenvalues

of S = C TC = [A; B]T [A; B ]. The inverse Chebyshev iteration is applied to thematrix S = (

C TC)−1

and then the interval[z , y

]must be chosen as small as pos-

sible such that it contains all undesired eigenvalues of S (i.e., the inverses of thesquares of the singular values of [A; B]). Only OCI does not alter the matrix C .

Usually, for small motor gaps [typically, for ratios σr/σr+1 < 10 for the gapbetween the singular values associated with the desired (≤ σr+1) and undesired(≥ σr ) singular subspace of matrix C ], ICI is recommended. Indeed, this methodis proven [98, p. 160] always to converge faster than OCI. Generally, the gainin speed is very significant. In contrast to OCI and provided that the undesiredsingular value spectrum is not too small (e.g., σ1/σr ≥ 2), the convergence rate ofICI is hardly influenced by the spread of this spectrum as well as by the quality ofits lower bound z ≤ 1/σ 2

1 [98, p. 152]. Moreover, ICI is proved [98, pp. 152–157]always to converge faster than II, provided that an optimal estimation of thebound y is given. The smaller the motor gap, the larger the gain in speed. OCI isshown [98, p. 139] to be efficient only in problems characterized by a very densesingular value spectrum, which is not the case for most TLS problems. However,it is recommended for solving very large, sparse, or structured problems, becauseit does not alter the matrix C . Table 1.1 summarizes these comparisons.

1.13.4 Lanczos Methods

Lanczos methods are iterative routines that bidiagonalize an arbitrary matrix Cor tridiagonalize a symmetric matrix S directly without any orthogonal updatesas in the Householder approach to the classical SVD. The original matrix struc-ture is not destroyed and needs little storage; thus, these methods work well forlarge, sparse, or structured matrices [98]. They are used for TLS problems witha small motor gap; if the Lanczos process is applied to S (LZ), it has a minimallyfaster convergence than OCI with optimal bounds. Like OCI, the convergence

Table 1.1 Link between the singular value spectrum and the best convergencea

II OCI ICI RQI LZ ILZ

Motor gap large small small small small smallUndesired SV spread indep. small indep. indep. small indep.Bounds no optimal optimal no no no

a indep., independent.


rate depends not only on the motor gap but also on the relative spread of theundesired singular value spectrum. If it is applied to S −1 [the inverse Lanc-zos method(ILZ)], it has the same convergence properties as ICI with optimalbounds. However, round-off errors make all Lanczos methods difficult to use inpractice [75]. For a summary, see Table 1.1.

1.13.5 Rayleigh Quotient Iteration

The Rayleigh quotient iteration (RQI) can be used to accelerate the convergencerate of the inverse iteration process when the motor gap is small (see Table 1.1),no adequate shift λ0 can be computed, and convergence to only one singularvector of the desired singular subspace is required. RQI is a variant of II byapplying a variable shift λ0(k), which is the RQ r(qk ) (see Section 2.1) of theiteration vector qk , defined by

r(qk ) = qTk Sqk

qTk qk

minimizing f (λ) = ‖(S − λI ) qk‖2 (1.58)

Then the RQI is given by

(S − λ0(k)I ) qk = qk−1 (1.59)

With regard to II, RQI need not estimate the fixed shift λ0, because it gen-erates the best shift in each step automatically and therefore converges faster.Parlett [149] has shown that the RQI convergence is ultimately cubic. However,RQI requires more computations per iteration step than II. Furthermore, RQIapplies only to square-symmetric matrices S and it is impossible to avoid theexplicit formation of S = C T C , which may affect the numerical accuracy of thesolution. A good start vector q0 should be available to allow the RQI to convergeto the desired solution. For example, most time-varying problems are character-ized by abrupt changes at certain time instants that seriously affect the qualityof the start vector and cause RQI to converge to an undesired singular triplet(see [98]). If convergence to several basis vectors of the desired singular sub-space is required, RQI extensions must be used (e.g., inverse subspace iterationwith Ritz acceleration) [48].

1.14 RAYLEIGH QUOTIENT MINIMIZATION NONNEURALAND NEURAL METHODS

From eqs. (1.22) and (1.21) it is evident that ETLS (x) corresponds to the Rayleighquotient (see Section 2.1) of [A; b], and therefore the TLS solution correspondsto its minimization. This minimization is equal to the minor components anal-ysis(MCA), because it corresponds to the search of the eigenvector associatedwith the minimum eigenvalue of [A; b], followed by a scaling of the solutioninto the TLS hyperplane. This idea will be explained at greater length in the

RAYLEIGH QUOTIENT MINIMIZATION NONNEURAL AND NEURAL METHODS 23

following two chapters. Among the nonneural methods, the first to use such anidea in a recursive TLS algorithm was Davila [50]; there the desired eigenvectorwas updated with a correction vector chosen as a Kalman filter gain vector, andthe scalar step size was determined by minimizing RQ. Bose et al. [11] appliedrecursive TLS to reconstruct high-resolution images from undersampled low-resolution noisy multiframes. In [200] the minimization is performed by usingthe conjugate gradient method, which requires more operations but is very fastand is particularly suitable for large matrices.

The neural networks applied to TLS problems can be divided into two cat-egories: the neural networks for the MCA, which are described in Chapter 2,and the neural networks that iterate only in the TLS hyperplane and thereforegive the TLS solution directly, which are described in Chapter 3. They are thefollowing:

• The Hopfield-like neural network of Luo, Li, and He [120,121]. This networkis made up of 3 (m + n) + 2 neurons (m and n are the dimensions of thedata matrix) grouped in a main network and in four subnetworks; the outputof the main network gives the TLS solution, and the available data matrixand observation vector are taken directly as the interconnections and the biascurrent of the network; thus, the principal limit of this network is the fact thatit is linked to the dimensions of the data matrix and cannot be used withoutstructural changes for other TLS problems. The authors demonstrate2 thestability of the network for batch operation: that is, working on all theequations together (as opposed to sequential or online operation, whichupdates at every equation presentation). The initial conditions cannot be null.The network is based on an analog circuit architecture which has continuous-time dynamics. The authors apply the network to the TLS linear predictionfrequency estimation problem.

• The linear neuron of Gao, Ahmad, and Swamy [63,64]. This is a single lin-ear neuron associated with a constrained anti-Hebbian learning law, whichfollows from the linearization of ETLS (x), so it is correct enough for smallgains and, above all, for weight norms much smaller than 1 . The weightvector gives the TLS solution after the learning phase. It has been appliedto adaptive FIR and IIR parameter estimation problems. In the first problem,after learning, the output of the neuron gives the error signal of the adaptivefilter; this is a useful property because in a great many adaptive filter appli-cations the error signal has the same importance as the filter parameters andother signal quantities. From now on, this neuron will be termed TLS GAO .

• The linear neurons of Cichocki and Unbehauen [21,22]. The authors proposelinear neurons with different learning laws to deal with OLS, TLS, and DLSproblems. For each algorithm an appropriate cost energy to be minimized

2This demonstration is not original, because the authors rediscover the well-known theorem ofstability for gradient flows (see, e.g., [84, p. 19]).


is devised, which contains the classical minimization [e.g., the minimiza-tion of ETLS (x) for the TLS] plus regularization terms for ill-conditionedproblems and nonlinear functions for the robustness of the method. Thenthe system of differential equations describing the gradient flow of thisenergy is implemented in an analog network for the continuous-time learn-ing law and in a digital network for the discrete-time learning law. Theanalog network consists of analog integrators, summers, and multipliers;the network is driven by independent source signals (zero-mean random,high frequency, uncorrelated i.i.d.) multiplied by the incoming data aij , bi

(i = 1, 2, . . . , m; j = 1, 2, . . . , n) from [A; b]. The artificial neuron, with anon-chip adaptive learning algorithm, allows both complete processing of theinput information simultaneously and the sequential strategy. In the digitalneuron the difference equations of the gradient flow are implemented byCMOSswitched-capacitor (SC) technology. The neurons for the TLS do notwork on the exact gradient flow, but on its linearization:3 It gives the samelearning law of TLS GAO for a particular choice of the independent signals.The DLS learning lawis introduced empirically, without justification. Theexamples of [22] are used as a benchmark for comparison purposes.

Remark 40 TLS GAO and the linear neurons of Cichochi and Unbehauen for theTLS problems have learning laws which are not gradient flows of the error functionbecause of the linearization, which forbids the use of acceleration techniquesbased on the Hessian of the error function and the conjugate gradient method.

3It is evident that as for TLS GAO, the norm of the weight must be much less than unity forlinearization; curiously, the authors forget to cite this limit.

neural-based orthogonal data fitting (the exin neural networks) || total least squares problems

Documents