1 b. khoromskij, leipzig 2007(l4) toward mla in tensor ... · lect. 4. toward mla in tensor-product...

Lect. 4. Toward MLA in tensor-product formats B. Khoromskij, Leipzig 2007(L4) 1

Contents of Lecture 4

1. Structured representation of high-order tensors revisited.

- Tucker model.

- Canonical (PARAFAC) model.

- Two-level and mixed models.

2. Multi-linear algebra (MLA) with Kronecker-product data.

- Invariace of some matrix properties.

- Commutator, matrix exponential, eigen-value problem.

- Lyapunov equation.

- Complexity issues.

3. Algebaric methods of tensor-product decomposition.

Rank-(r1, ..., rd) Tucker model B. Khoromskij, Leipzig 2007(L4) 2

Tucker Model (T r). (orthonormalised set V(ℓ)kℓ

∈ RIℓ)

A(r) =

r1∑

k1=1

...

rd∑

kd=1

bk1...kd· V

(1)k1

× ... × V(d)kd

∈ RI1×...×Id .

Core tens. B = bk ∈ Rr1×...×rd is not unique (up to rotations)

Complexity (p = 1): rd + rdn ≪ nd with r = max rℓ ≪ n.

Visualization of the Tucker model with d = 3:

=

I 2

I 1

I 3

A B

I 1

r 2

r 1

I 2

I 3

r 3

V

V

V

(1)

(2)

(3)

CANDECOMP/PARAFAC (CP) tensor format B. Khoromskij, Leipzig 2007(L4) 3

CP Model (Cr). Approx. A by a sum of rank-1 tensors

A(r) =r∑

k=1

bk · V(1)k × · · · × V

(d)k ≈ A, bk ∈ R

with normalised V(ℓ)k ∈ Rnp

. Uniqueness is due to J. Kruskal ’77.

Complexity: r + rdn.

The minimal number r is called a tensor rank of A(r).

+

b

A

1b

V V V

V V V

V V V

+= ...+

1

1 2

2

2

r

r

r

(1) (1) (1)

(2) (2) (2)

21

(3) (3) (3)

rb

Figure 1: Visualization of the CP-model for d = 3.

Two-level and mixed models B. Khoromskij, Leipzig 2007(L4) 4

Two-level Tucker model T (U ,r,q),

A(r,q) = B ×V(1) × V

(2)... × V(d) ∈ T (U ,r,q) ⊂ C(n,q),

1. B ∈ Rr1×...×rd is retrieved by the rank-q CP model C(r,q)

2. V(ℓ) = [V

(ℓ)1 V

(ℓ)2 ...V

(ℓ)rℓ ] ∈ U, ℓ = 1, ..., d,

U spans fixed (uniform/adaptive) basis;

⇒ O(rd) with r = maxℓ≤d rℓ ⇒ O(dqr) (independent of n !).

Mixed model MC,T :

A = A1 + A2, A1 ∈ Cr1 , A2 ∈ T r2 .

Applies to ”ill-conditioned” tensors.

Examples of two-level models B. Khoromskij, Leipzig 2007(L4) 5

(I) Tensor-product sinc-interpolation:

analytic functions with point singularities,

r = (r, ..., r), r = q = O(log n| log ε|) ⇒ O(dqr).

(II) Adaptive two-level approximation:

Tucker + CP decomp. of B with q ≤ |r| ⇒ O(dqn).

(III) Sparse grids: regularity of mixed derivatives,

r = (n1, ..., nd), hyperbolic cross selection ⇒ q = n logd n

⇒ O(n logd n).

Structured tensor-product models (d-th order tensors of size nd)

Model Notation Memory/A · x A · B Comput. tools

Canonical - CP Cr drn drn2 ALS/Newton

HKT - CP CH,r dr√

n logq n drn logq n Analytic (quadr.)

Nested - CP CT (I),L drlog dn+ rd drlog dn SVD/QR/orthog. iter.

Tucker T r rd + drn - Orthogonal ALS

Two-level Tucker T (U,r,q) drq/drr0qn2 dr2q2 (mem.) Analyt.(interp.) + CP

Challenge of multi-factor analysis B. Khoromskij, Leipzig 2007(L4) 6

Paradigm: linear algebra vs. multi-linear algebra.

CP/Tucker tensor-product models have plenty of merits:

1. A(r) is repr. with low cost drn (resp. drn + rd) ≪ nd.

2. V(ℓ)k can be repr. in the data-sparse form:

H-matrix (HKT), wavelet-based (WKT), uniform basis.

3. The core tensor B = bk can be ”sparsified” via CP model.

4. Efficient numerical MLA ⇔ Highly nonlinear problems.

Remark. CP decomposition (unique !) can’t be retrieved by

rotation and truncation of the Tucker model,

Cr = T r if r = 1 ∨ d = 2, but Cr 6⊂ T r if r = |r| ≥ 2 ∧ d ≥ 3.

Little analogy between the cases d ≥ 3 and d = 2 B. Khoromskij, Leipzig 2007(L4) 7

I. rank(A) depends on the number field (say, R or C).

II. We do not know any finite algorithm to compute

r = rank(A), except simple bounds:

rank(A) ≤ nd−1; rank(A) ≤ rank(A1) + ... + rank(And−2)

III. For fixed d and n we do not know the exact value of

maxrank(A). J. Kruskal ’75 proved that:

– for any 2 × 2 × 2 tensor we have maxrank(A) = 3 < 4;

– for 3 × 3 × 3 tensors there holds maxrank(A) = 5 < 9.

IV. “Probabilistic” properties of rank(A): in the set of 2× 2× 2

tensors there is about 79% of rank-2 tensors and 21% of

rank-3 tensors, while rank-1 tensors appear with probability 0.

Clearly, for n × n matrices we have Prank(A) = n = 1.

Little analogy between the cases d ≥ 3 and d = 2 B. Khoromskij, Leipzig 2007(L4) 8

V. However, it is possible to prove very important uniqueness

property within the equivalence classes.

Two CP-type representations are considered as equivalent if either

(a) they differ in the order of terms or

(b) for some set of paramers aℓ

k∈ R such that

dQ

ℓ=1aℓ

k= 1 (k = 1, ..., r),

there is a transform V(ℓ)k

→ aℓ

kV

(ℓ)k

.

A simplified version of the general uniqueness result is the

following (all factors have the same full rank r).

Prop. 1. (J. Kruskal, 1977) Let for each ℓ = 1, ..., d, the vectors

V(ℓ)k , (k = 1, ..., r) with r = rank(A), are linear independent. If

(d − 2)r ≥ d − 1,

then the CP decomposition is uniquely determined up to the

equivalence (a) - (b) above.

Properties of the Kronecker product B. Khoromskij, Leipzig 2007(L4) 9

A tensor A ∈ RI1×...×Id can be viewed as:

A. An element of linear space of vectors with the ℓ2-inner

product and related Frobenius norm, which is a multi-variate

function of the discrete argument A : I1 × ... × Id → R.

B. A mapping A : RI1×...×Iq → R

Iq+1×...×Id (hence, requiring

matrix operations in the tensor format).

Def. 4.1. The Kronecker product (KP) operation A ⊗ B of

two matrices

A = [aij ] ∈ Rm×n, B ∈ R

h×g

is an mh × ng matrix that has the block-representation [aijB].

Ex. 4.1. In general A ⊗ B 6= B ⊗ A. What is the condition on

A and B that provides A ⊗ B = B ⊗ A ?


1. Let C ∈ Rs×t, then the KP satisfies the associative law,

(A ⊗ B) ⊗ C = A ⊗ (B ⊗ C) = A ⊗ B ⊗ C ∈ Rmhs×ngt,

and therefore we do not use brackets.

2. Let C ∈ Rn×r, D ∈ R

g×s, then the matrix-matrix product in

the Kronecker format takes the form

(A ⊗ B)(C ⊗ D) = (AC) ⊗ (BD).

The extension to d-th order tensors is

(A1 ⊗ ... ⊗ Ad)(B1 ⊗ ... ⊗ Bd) = (A1B1) ⊗ ... ⊗ (AdBd).


3. We have the distributive law

(A + B) ⊗ (C + D) = A ⊗ C + A ⊗ D + B ⊗ C + B ⊗ D.

4. Rank relation: rank(A ⊗ B) = rank(A)rank(B).

Invariance of some matrix properties:

(1) If A and B are diagonal then A ⊗ B is also diagonal, and

conversely (if A ⊗ B 6= 0).

(2) (A ⊗ B)T = AT ⊗ BT , (A ⊗ B)∗ = A∗ ⊗ B∗.

(3) Let A and B be Hermitian/normal matrices (A∗ = A resp.

A−1 = A). Then A ⊗ B is of the corresponding type.

(4) A ∈ Rn×n, B ∈ Rm×m ⇒ det(A ⊗ B) = (detA)m(detB)n.

Hint: A ⊗ B = diagnB · A ⊗ Im.

Matrix operations with the Kronecker product B. Khoromskij, Leipzig 2007(L4) 12

Thm. 4.1. Let A ∈ Rn×n and B ∈ R

m×m be invertible

matrices. Then

(A ⊗ B)−1 = A−1 ⊗ B−1.

Proof. Since det(A) 6= 0, det(B) 6= 0 and the above property

(4) we have det(A ⊗ B) 6= 0. Thus (A ⊗ B)−1 exists and

(A−1 ⊗ B−1)(A ⊗ B) = (A−1A) ⊗ (B−1B) = Inm.

Lem. 4.1. Let A ∈ Rn×n and B ∈ R

m×m be unitary matrices.

Then A ⊗ B is a unitary matrix.

Proof. Since A∗ = A−1, B∗ = B−1 we have

(A ⊗ B)∗ = A∗ ⊗ B∗ = A−1 ⊗ B−1 = (A ⊗ B)−1.


Define the commutator [A, B] := AB − BA.


m×m. Then

[A ⊗ In, Im ⊗ B] = 0 ∈ Rm2×n2

.

Proof.

[A ⊗ In, Im ⊗ B] = (A ⊗ In)(Im ⊗ B) − (Im ⊗ B)(A ⊗ In)

= A ⊗ B − A ⊗ B = 0.

Lem. 4.3. Let A, B ∈ Rn×n, C, D ∈ R

m×m and [A, B] = 0,

[C, D] = 0. Then

[A ⊗ C, B ⊗ D] = 0.

Proof. Apply the identity (A ⊗ B)(C ⊗ D) = (AC) ⊗ (BD).



m×m. Then

tr(A ⊗ B) = tr(A)tr(B).

Proof. Since diag(aiiB) = aiidiag(B), we have

tr(A ⊗ B) =n∑

i=1

m∑

j=1

aiibjj =n∑

i=1

aii

m∑

j=1

bjj .

Thm. 4.2. Let A, B, I ∈ Rn×n. Then

exp(A ⊗ I + I ⊗ B) = (expA) ⊗ (expB).

Proof. Since [A ⊗ I, I ⊗ B] = 0, we have

exp(A ⊗ I + I ⊗ B) = exp(A ⊗ I) exp(I ⊗ B).


Furthermore, since

exp(A ⊗ I) =∞∑

k=0

(A ⊗ I)k

k!, exp(I ⊗ B) =

∞∑

m=0

(I ⊗ B)m

m!

the arbitrary term in exp(A ⊗ I) exp(I ⊗ B) is given by

1

k!

1

m!(A ⊗ I)k(I ⊗ B)m.

Imposing

(A⊗I)k(I⊗B)m = (Ak⊗Ik)(Im⊗Bm) = (Ak⊗I)(I⊗Bm) ≡ Ak⊗Bm,

we finally arrive at

1

k!

1

m!(A ⊗ I)k(I ⊗ B)m = (

1

k!Ak) ⊗ (

1

m!Bm).


Thm. 4.2 can be extended to the case of many-term sum

exp(A1⊗I⊗...⊗I+I⊗A2⊗...⊗I+...+I⊗...⊗I⊗Ad) = (eA1)⊗...⊗(eAd).

Other simple properties:

sin(In ⊗ A) = In ⊗ sin(A),

sin(A ⊗ Im + In ⊗ B) = sin(A) ⊗ cos(B) + cos(A) ⊗ sin(B),

Eigenvalue problem B. Khoromskij, Leipzig 2007(L4) 17

Lem. 4.5. Let A ∈ Rm×m and B ∈ R

n×n have the eigen-data

λj , uj, j = 1, ..., m, and µk, vk, k = 1, ..., n, respectively. Then

A ⊗ B has the eigenvalues λjµk with the corresponding

eigenvectors uj ⊗ vk, 1 ≤ j ≤ m, 1 ≤ k ≤ n.

Thm. 4.3. Under the conditions of Lem. 4.5 the

eigenvalues/eigenfunctions of A ⊗ In + Im ⊗ B are given by

λj + µk and uj ⊗ vk, respectively.

Proof. Due to Lem. 4.5 we have

(A ⊗ In + Im ⊗ B)(uj ⊗ vk) = (A ⊗ In)(uj ⊗ vk) + (Im ⊗ B)(uj ⊗ vk)

= (Auj) ⊗ (Invk) + (Imuj) ⊗ (Bvk)

= (λjuj) ⊗ vk + uj ⊗ (µkvk)

= (λj + µk)(uj ⊗ vk).

Lyapunov/Silvester equations B. Khoromskij, Leipzig 2007(L4) 18

For a matrix A ∈ Rm×n we use the vector representation

A → vec(A) ∈ Rmn, where vec(A) is an nm × 1 vector obtained

by “stacking” A’s columns (the FORTRAN-style ordering)

vec(A) := [a11, ..., an1, a12, ..., anm]T .

In this way, vec(A) is a rearranged version of A.

The matrix Sylvester equation for X ∈ Rm×n

AX + XBT = G ∈ Rm×m

with A ∈ Rm×m, B ∈ R

n×n, can be written in vector form

(In ⊗ A + B ⊗ Im)vec(X) = vec(G).

In the special case A = B we have the Lyapunov equation.

Lyapunov/Silvester equations B. Khoromskij, Leipzig 2007(L4) 19

Now the solvability conditions and certain solution methods

can be derived (cf. the results for eigenvalue problems).

Silvester equation is uniquely solvable if

λj(A) + µk(B) 6= 0.

Moreover, since In ⊗ A and B ⊗ Im commute, we can apply all

methods proposed below to represent the inverse

(In ⊗ A + B ⊗ Im)−1

(

=

∫ ∞

0

e−(In⊗A+B⊗Im)tdt

)

.

In particular, if A and B correspond to the discrete elliptic

operators in Rd with separable coefficients, we obtain the

low-rank tensor-product decomposition to the Sylvester

solution operator (cf. Lect. 7/2005).

Kronecker Hadamard product B. Khoromskij, Leipzig 2007(L4) 20

Lemma 4.6 indicates the simple (but important) property of

the Hadamard product of two tensors A, B ∈ RId

, defined by

C = A ⊙ B = ci1...id(i1...id)∈Id

defined by the entry-wise multiplication

ci1...id = ai1...iq · bi1...id .

Lem. 4.6. Let both A and B be represented by the CP

model with the Kronecker rank rA, rB and with V ℓk

substituted by Aℓk ∈ R

I and Bℓk ∈ R

I, respectively. Then A ⊙ B

is a tensor with the Kronecker rank r = rArB given by

A ⊙ B =

rA∑

k=1

rB∑

m=1

ckcm(A1k ⊙ B1

m) ⊗ ... ⊗ (Adk ⊙ Bd

m).

Kronecker Hadamard product B. Khoromskij, Leipzig 2007(L4) 21

Proof. It is easy to check that

(A1 ⊗ B1) ⊙ (A2 ⊗ B2) = (A1 ⊙ A2) ⊗ (B1 ⊙ B2),

and similar for d-term products. Applying the above relations,

we obtain

A ⊙ B =

(

rA∑

k=1

ck

d⊗

ℓ=1

Aℓk

)

⊙

(

rB∑

m=1

cm

d⊗

ℓ=1

Bℓm

)

=

rA∑

k=1

rB∑

m=1

ckcm

(

d⊗

ℓ=1

Aℓk

)

⊙

(

d⊗

ℓ=1

Bℓm

)

and the assertion follows.

Complexity of the HKT -matrix arithmetics B. Khoromskij, Leipzig 2007(L4) 22

Complexity issues

Let V ℓk ∈ MH,s(TI×I ,P) in the CP represent. and let N = nd.

• Data compression.

The storage for A is O(rdsn log n), r = O(logα N), α > 0.

Hence, we enjoy the sub-linear complexity.

• Matrix-by-vector complexity of Ax, x ∈ CN .

For general x one has the linear cost O(rdsN log n).

If x = x1 × ... × xd, xi ∈ Cn, we again arrive at sub-linear

complexity O(rdsn log n).

• Matrix-by-matrix complexity of AB and A ⊙ B.

The H-matr. struct. of the Kronecker factors leads to

O(r2ds2n logq n) operations instead of O(N3).

How to construct a Kronecker product ? B. Khoromskij, Leipzig 2007(L4) 23

1. d = 2: SVD and ACA methods in the case of two-fold

decompositions.

2. d ≥ 2: Analytic approximation for the function-related d-th

order tensors (consider in Lect. 5).

Def. 4.2. Given the multi-variate function

g : Ω ∈ Rd → R with d = dp, p, d ∈ N, d ≥ 2,

Ω = (ζ1, ..., ζd) ∈ Rd : ‖ζℓ‖∞ ≤ L, ℓ = 1, ..., d ∈ R

d, L > 0,

where ‖ · ‖∞ means the ℓ∞-norm of ζℓ ∈ Rp (p = 1).

Introduce the function-generated d-th order tensor

A ≡ A(g) := [ai1...id ] ∈ RId

with ai1...id := g(ζ1i1

, ..., ζdid

). (1)

Approximation tools: sinc-methods, exponential fitting.


3. d ≥ 3: Algebraic recompression methods:

3A. Greedy algorithms with dictionary

D :=

V (1) ×2 V (2)... ×d V (d) : V (ℓ) ∈ Rn, ‖V (ℓ)‖ = 1

.

(a) Fit the original tensor A by a rank-one tensor A1;

(b) Subtract A1 from the original tensor A;

(c) Approx. the residue A − A1 with another rank-one tensor.

For best rank-1 appr. one solves the minimisation problem

min ||A − V (1) ⊗ · · · ⊗ V (d)||F , V (ℓ) ∈ Rn

pℓ ,

by using ALS or the Newton iteration (proven convergence).

In general, convergence theory for Greedy algorithm is still

open question (see Lect.1).


Def. 4.3. A tensor A ∈ Cr is orthogonally decomposable if

(V(ℓ)k , V

(ℓ)k′ ) = δk,k′ (k, k′ = 1, ..., r; ℓ = 1, ..., d).

Thm. 4.5. (Zhang, Golub) If a tensor of order d ≥ 3 is

orthogonally decomposable, then this decomposition is

unique, and the OGA correctly computes it.

Proof: See Lect. 1.

(3B) The Newton algorithm to solve the Lagrange eq. in the

constrained minimisation: Find A ∈ Cr and λ(k,ℓ) ∈ R s.t.

f(A) := ‖A − A0‖2F +

r∑

k=1

d∑

ℓ=1

λ(k,ℓ)(

‖V(ℓ)k ‖2 − 1

)

→ min. (2)

Efficient implementation of the Newton algorithm (M. Espig,

MPI MIS).


(3C) Alternating least-squares (ALS).

Mode per mode components update, fix all V(ℓ), ℓ 6= m

(m = 1, ..., d).

Convergence theory only for r = 1 (Golub, Zhang; Kolda ’01)

Under certain simplifications, the constraint ALS

minimisation algorithm can be implemented in

O(m2n + Kitdr2m) op. (see Lect. 5).

The convergence theory behind these algorithms is not

complete, moreover the solution might not be unique or even

might not exist.

Summary I B. Khoromskij, Leipzig 2007(L4) 27

Motivation:

Basic linear algebra can be performed using one-dimensional

operations, thus avoiding the exponential scaling in d.

Bottleneck:

Lack of finite algebraic methods for the robust multi-fold

Kronecker decomposition of high order tensors (for d ≥ 3).

Difficulties with recompression in matrix operations. There

are efficient and robust ALS/Newton algorithms.

Observation:

Analytic approximation methods are of principal importance.

Classical example: an approximation by Gaussians.

Recent proposals: Sinc meth., exponential fitting, sparse

grids.

1 b. khoromskij, leipzig 2007(l4) toward mla in tensor ... · lect. 4. toward mla in tensor-product...

Documents