efficient automatic differentiation of matrix functions...watsonlogo introduction the kronecker and...
Post on 09-Jul-2020
8 Views
Preview:
TRANSCRIPT
watsonlogo
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Efficient Automatic Differentiation of MatrixFunctions
Peder A. Olsen, Steven J. Rennie, and Vaibhava GoelIBM T.J. Watson Research Center
Automatic Differentiation 2012Fort Collins, CO
July 25, 2012
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions
How can we compute the derivative?
Numerical Methods that do numeric differentiation by formulaslike the finite difference
f ′(x) ≈ f (x + h)− f (x)
h
These methods are simple to program, but loose halfof all significant digits.
Symbolic Symbolic differentiation gives the symbolicrepresentation of the derivative without saying howbest to implement the derivative. In general symbolicderivatives are as accurate as they come.
Automatic Something between numeric and symbolicdifferentiation. The derivative implementation iscomputed from the code for f (x).
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions
Finite Difference:
f ′(x) ≈ f (x + h)− f (x)
h
Central Difference:
f ′(x) ≈ f (x + h)− f (x − h)
2h
For h < εx , where ε is the machine precision, these approximationsbecome zero. Thus h has to be chosen with care.
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions
The error in the finite difference is bounded by
‖f ′′‖∞h
2+
2ε‖f ‖∞h
whose minimum is2√ε‖f ‖∞‖f ′′‖∞
achieved for
h = 2
√ε‖f ‖∞‖f ′′‖∞
.
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions
Numerical Differentiation Error
The graph shows the error in approximating the derivative forf (x) = ex as a function of h.
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions
An imaginary derivative
The error stemming from the subtractive cancellation can becompletely removed by using complex numbers.Let f : R→ R then
f ′(x + ih) ≈ f (x) + ihf ′(x) +(ih)2
2!f ′′(x) +
(ih)3
3!f (3)(x)
= f (x)− h2
2f ′′(x) + hi(f ′(x)− h2
6f (3)(x))
Thus the derivative can be approximated by
f ′(x) ≈ Im
(f ′(x + ih)
h
)Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions
Imaginary Differentiation Error
The graph shows the error in approximating the derivative forf (x) = ex as a function of h.
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions
By using dual numbers dual numbers a + bE , where E 2 = 0 wecan further reduce the error. For second order derivatives we canuse hyper dual numbers. ((2012) J. Fike).
Complex number implementations usually come with mostprogramming languages. For dual or hyper-dual numbers we needto download or implement code.
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions
What is Automatic Differentiation?
Algorithmic Differentiation (often referred to asAutomatic Differentiation or just AD) uses the softwarerepresentation of a function to obtain an efficient methodfor calculating its derivatives. These derivatives can be ofarbitrary order and are analytic in nature (do not haveany truncation error). B. Bell – author of CppAD
Use of dual or complex numbers is a form of automaticdifferentiation. More common though is operator overloadingimplementations as in CppAD.
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions
The Speelpenning function is:
f (y) =n∏
i=1
xi (y)
We consider xi = (y − ti ), so that f (y) =∏n
i=1(y − ti ). If wecompute the symbolic derivative we get
f ′(y) =n∑
i=1
∏j 6=i
(y − tj),
which requires n(n − 1) multiplications to compute with astraightforward implementation. Rewriting the expression as
f ′(y) =n∑
i=1
f (y)
y − ti,
reduces the computational cost, but is not numerically stable ify ≈ ti .
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions
Reverse Mode Differentiation
Define fk =∏k
i=1(y − ti ) and rk =∏n
i=k(y − ti ), then
f ′(y) =n∑
i=1
∏j 6=i
(y − tj) =n∑
i=1
fi−1ri+1,
and fi = (y − ti )fi−1, f1 = y − t1, ri = (y − ti )ri+1, rn = y − tn.Note that f (y) = fn = r1.
Cost of computing f (y) is n subtractions and n − 1 multiplies. Atthe overhead of storing f1, . . . , fn the derivative can be computedwith an additional n − 1 additions and 2n − 1 multiplications.
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions
Reverse Mode Differentiation
Under quite realistic assumptions the evaluation of agradient requires never more than five times the effort ofevaluating the underlying function by itself.
Andreas Griewank (1988)
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions
Reverse mode differentiation (1971) is essentially applying thechain rule on a function in the reverse order that the function iscomputed.
This is the same technique as applied in the backpropagationalgorithm (1969) for neural networks and the forward-backwardalgorithm for training HMMs (1970).
Reverse mode differentiation applies to very general classes offunctions.
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions
Reverse mode differentiation was independently invented by
G. M. Ostrovskii (1971)
S. Linnainmaa (1976)
John Larson and Ahmed Sameh (1978)
Bert Speelpenning (1980)
Not surprisingly the fame went to B. Speelpenning, whoadmittedly told the full story at AD 2012.
A. Sameh was on B. Speelpenning’s thesis committee.
The Speelpenning function was suggested as a motivation toSpeelpenning’s thesis by his Canadian interim advisor Prof. ArtSedgwick who passed away on July 5th.
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions
Why worry about matrix differentiation?
The differentiation of functions of a matrix argument is still notwell understood
No simple “calculus” for computing matrix derivatives.
Matrix derivatives can be complicated to compute even forquite simple expressions.
Goal: Build on known results to define a matrix-derivative calculus.
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions
Quiz
The Tikhonov regularized maximum (log-)likelihood for learningthe covariance Σ given an empirical covariance S is
f (Σ) = −log det(Σ)− trace(SΣ−1)− ‖Σ−1‖2F .
What is f ′(Σ) and f ′′(Σ)?
Statisticians can compute it. Can you?
Can a calculus be reverse engineered?
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions
Terminology
Classical calculus:
scalar-scalar functions (f : R→ R)
scalar-vector functions (f : Rd → R).
Matrix calculus:
scalar-matrix functions (f : Rm×n → R)
matrix-matrix functions (f : Rm×n → Rk×l).
Note:
1 Derivatives of scalar-matrix functions requires derivatives ofmatrix-matrix functions (why?)
2 Matrix-matrix derivatives should be computed implicitlywherever possible (why?)
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions
Optimizing Scalar-Matrix Functions
Useful quantities for optimizing a scalar-matrix function
f (X) : Rd×d → RThe gradient is a matrix–matrix function:
f ′(X) : Rd×d → Rd×d .
The Hessian is also a matrix-matrix function:
f ′′(X) : Rd×d → Rd2×d2.
Desiderata: The derivative of a matrix–matrix function should be amatrix, so that we can directly use it to compute the Hessian.
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions
Optimizing Scalar-Matrix Functions (continued)
Taking the scalar–matrix derivative of
f (G(X))
will require the information in the matrix–matrix derivative
∂G
∂X.
Desiderata: The derivative of a matrix-matrix function should bea matrix, so that a convenient chain-rule can be established.
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions
The Matrix-Matrix Derivative
We define the matrix–matrix derivative to be:
∂F
∂Xdef=
∂vec(F>)
∂vec>(X>)=
∂f11∂x11
∂f11∂x12
· · · ∂f11∂xmn
∂f12∂x11
∂f12∂x12
· · · ∂f12∂xmn
......
. . ....
∂fkl∂x11
∂fkl∂x12
· · · ∂fkl∂xmn
.
Note that the matrix–matrix derivative of a scalar–matrixfunction is not the same as the scalar–matrix derivative:
∂ mat(f )
∂X= vec>
((∂f
∂X
)>).
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions
The Matrix-Matrix Derivative
We define the matrix–matrix derivative to be:
∂F
∂Xdef=
∂vec(F>)
∂vec>(X>)=
∂f11∂x11
∂f11∂x12
· · · ∂f11∂xmn
∂f12∂x11
∂f12∂x12
· · · ∂f12∂xmn
......
. . ....
∂fkl∂x11
∂fkl∂x12
· · · ∂fkl∂xmn
.
Note that the matrix–matrix derivative of a scalar–matrixfunction is not the same as the scalar–matrix derivative:
∂ mat(f )
∂X= vec>
((∂f
∂X
)>).
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions
The Matrix-Matrix Derivative
We define the matrix–matrix derivative to be:
∂F
∂Xdef=
∂vec(F>)
∂vec>(X>)=
∂f11∂x11
∂f11∂x12
· · · ∂f11∂xmn
∂f12∂x11
∂f12∂x12
· · · ∂f12∂xmn
......
. . ....
∂fkl∂x11
∂fkl∂x12
· · · ∂fkl∂xmn
.
Note that the matrix–matrix derivative of a scalar–matrixfunction is not the same as the scalar–matrix derivative:
∂ mat(f )
∂X= vec>
((∂f
∂X
)>).
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions
The Matrix-Matrix Derivative
We define the matrix–matrix derivative to be:
∂F
∂Xdef=
∂vec(F>)
∂vec>(X>)=
∂f11∂x11
∂f11∂x12
· · · ∂f11∂xmn
∂f12∂x11
∂f12∂x12
· · · ∂f12∂xmn
......
. . ....
∂fkl∂x11
∂fkl∂x12
· · · ∂fkl∂xmn
.
Note that the matrix–matrix derivative of a scalar–matrixfunction is not the same as the scalar–matrix derivative:
∂ mat(f )
∂X= vec>
((∂f
∂X
)>).
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Computing DerivativesAutomatic DifferentiationMatrix-Matrix DerivativesLinear Matrix Functions
Derivatives of Linear Matrix Functions
1 A matrix–matrix derivative is a matrix outer-product:
F (X) = trace(AX)B,∂F
∂X= vec(B>)vec>(A).
2 A matrix–matrix derivative is a Kronecker product:
F (X) = AXB,∂F
∂X= A⊗ B>.
3 A matrix–matrix derivative cannot be expressed with standardoperations. We call the new operation the box product
F (X) = AX>B,∂F
∂X= A� B>.
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform
Direct Matrix Products
A Direct Matrix Product, X = A~ B, is a matrix whose elementsare
x(i1i2)(i3i4) = aiσ(1)iσ(2)biσ(3)iσ(4)
.
(i1i2) is shorthand for i1n2 + i2 − 1 where i2 ∈ {1, 2, . . . , n2}.σ is a permutation over Z4
The direct matrix products are central to matrix–matrixdifferentiation. 8 can be expressed using Kronecker products:
A⊗ B A> ⊗ B A⊗ B> A> ⊗ B>
B⊗ A B> ⊗ A B⊗ A> B> ⊗ A>
Then 8 using matrix outer products and the final 8 using the boxproduct.
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform
Definition
Let A ∈ Rm1×n1 and B ∈ Rm2×n2
Definition (Kronecker Product)
A⊗ B ∈ R(m1m2)×(n1n2) is defined by (A⊗ B)(i−1)m2+j ,(k−1)n2+l =aikbj l = (A⊗ B)(ij)(kl).
Definition (Box Product)
A� B ∈ R(m1m2)×(n1n2) is defined by (A� B)(i−1)m2+j ,(k−1)n1+l =ai lbjk = (A� B)(ij)(kl).
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform
An example Kronecker and box product
To see the structure consider the box product of two 2× 2matrices, A and B:
A⊗ B A� Ba11b11 a11b12 a12b11 a12b12
a11b21 a11b22 a12b21 a12b22
a21b11 a21b12 a22b11 a22b12
a21b21 a21b22 a22b21 a22b22
a11b11 a12b11 a11b12 a12b12
a11b21 a12b21 a11b22 a12b22
a21b11 a22b11 a21b12 a22b12
a21b21 a22b21 a21b22 a22b22
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform
An example Kronecker and box product
To see the structure consider the box product of two 2× 2matrices, A and B:
A⊗ B A� Ba11b11 a11b12 a12b11 a12b12
a11b21 a11b22 a12b21 a12b22
a21b11 a21b12 a22b11 a22b12
a21b21 a21b22 a22b21 a22b22
a11b11 a12b11 a11b12 a12b12
a11b21 a12b21 a11b22 a12b22
a21b11 a22b11 a21b12 a22b12
a21b21 a22b21 a21b22 a22b22
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform
An example Kronecker and box product
To see the structure consider the box product of two 2× 2matrices, A and B:
A⊗ B A� Ba11b11 a11b12 a12b11 a12b12
a11b21 a11b22 a12b21 a12b22
a21b11 a21b12 a22b11 a22b12
a21b21 a21b22 a22b21 a22b22
a11b11 a12b11 a11b12 a12b12
a11b21 a12b21 a11b22 a12b22
a21b11 a22b11 a21b12 a22b12
a21b21 a22b21 a21b22 a22b22
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform
An example Kronecker and box product
To see the structure consider the box product of two 2× 2matrices, A and B:
A⊗ B A� Ba11b11 a11b12 a12b11 a12b12
a11b21 a11b22 a12b21 a12b22
a21b11 a21b12 a22b11 a22b12
a21b21 a21b22 a22b21 a22b22
a11b11 a12b11 a11b12 a12b12
a11b21 a12b21 a11b22 a12b22
a21b11 a22b11 a21b12 a22b12
a21b21 a22b21 a21b22 a22b22
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform
An example Kronecker and box product
To see the structure consider the box product of two 2× 2matrices, A and B:
A⊗ B A� Ba11b11 a11b12 a12b11 a12b12
a11b21 a11b22 a12b21 a12b22
a21b11 a21b12 a22b11 a22b12
a21b21 a21b22 a22b21 a22b22
a11b11 a12b11 a11b12 a12b12
a11b21 a12b21 a11b22 a12b22
a21b11 a22b11 a21b12 a22b12
a21b21 a22b21 a21b22 a22b22
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform
An example Kronecker and box product
To see the structure consider the box product of two 2× 2matrices, A and B:
A⊗ B A� Ba11b11 a11b12 a12b11 a12b12
a11b21 a11b22 a12b21 a12b22
a21b11 a21b12 a22b11 a22b12
a21b21 a21b22 a22b21 a22b22
a11b11 a12b11 a11b12 a12b12
a11b21 a12b21 a11b22 a12b22
a21b11 a22b11 a21b12 a22b12
a21b21 a22b21 a21b22 a22b22
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform
An example Kronecker and box product
To see the structure consider the box product of two 2× 2matrices, A and B:
A⊗ B A� Ba11b11 a11b12 a12b11 a12b12
a11b21 a11b22 a12b21 a12b22
a21b11 a21b12 a22b11 a22b12
a21b21 a21b22 a22b21 a22b22
a11b11 a12b11 a11b12 a12b12
a11b21 a12b21 a11b22 a12b22
a21b11 a22b11 a21b12 a22b12
a21b21 a22b21 a21b22 a22b22
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform
An example Kronecker and box product
To see the structure consider the box product of two 2× 2matrices, A and B:
A⊗ B A� Ba11b11 a11b12 a12b11 a12b12
a11b21 a11b22 a12b21 a12b22
a21b11 a21b12 a22b11 a22b12
a21b21 a21b22 a22b21 a22b22
a11b11 a12b11 a11b12 a12b12
a11b21 a12b21 a11b22 a12b22
a21b11 a22b11 a21b12 a22b12
a21b21 a22b21 a21b22 a22b22
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform
A =
1 1 1 · · · 12 2 2 · · · 23 3 3 · · · 3...
......
...10 10 10 · · · 10
B =
1 2 3 · · · 101 2 3 · · · 101 2 3 · · · 10...
......
...1 2 3 · · · 10
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform
A⊗ B
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform
A� B
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform
vec(A)vec>(B)
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform
The box product behaves similarly to the Kronecker product:1 Vector Multiplication:
(B> ⊗ A)vec(X) = vec(AXB) (B> � A)vec(X) = vec(AX>B)
2 Matrix Multiplication:
(A⊗ B)(C⊗D) = (AC)⊗ (BD) (A� B)(C�D) = (AD)⊗ (BC)
3 Inverse and Transpose:
(A⊗ B)−1 = A−1 ⊗ B−1 (A� B)−1 = B−1 � A−1
(A⊗ B)> = A> ⊗ B> (A� B)> = B> � A>
4 Mixed Products:
(A⊗ B)(C�D) = (AC)� (BD) (A� B)(C⊗D) = (AD)� (BC)
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform
The box product behaves similarly to the Kronecker product:1 Vector Multiplication:
(B> ⊗ A)vec(X) = vec(AXB) (B> � A)vec(X) = vec(AX>B)
2 Matrix Multiplication:
(A⊗ B)(C⊗D) = (AC)⊗ (BD) (A� B)(C�D) = (AD)⊗ (BC)
3 Inverse and Transpose:
(A⊗ B)−1 = A−1 ⊗ B−1 (A� B)−1 = B−1 � A−1
(A⊗ B)> = A> ⊗ B> (A� B)> = B> � A>
4 Mixed Products:
(A⊗ B)(C�D) = (AC)� (BD) (A� B)(C⊗D) = (AD)� (BC)
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform
The box product behaves similarly to the Kronecker product:1 Vector Multiplication:
(B> ⊗ A)vec(X) = vec(AXB) (B> � A)vec(X) = vec(AX>B)
2 Matrix Multiplication:
(A⊗ B)(C⊗D) = (AC)⊗ (BD) (A� B)(C�D) = (AD)⊗ (BC)
3 Inverse and Transpose:
(A⊗ B)−1 = A−1 ⊗ B−1 (A� B)−1 = B−1 � A−1
(A⊗ B)> = A> ⊗ B> (A� B)> = B> � A>
4 Mixed Products:
(A⊗ B)(C�D) = (AC)� (BD) (A� B)(C⊗D) = (AD)� (BC)
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform
The box product behaves similarly to the Kronecker product:1 Vector Multiplication:
(B> ⊗ A)vec(X) = vec(AXB) (B> � A)vec(X) = vec(AX>B)
2 Matrix Multiplication:
(A⊗ B)(C⊗D) = (AC)⊗ (BD) (A� B)(C�D) = (AD)⊗ (BC)
3 Inverse and Transpose:
(A⊗ B)−1 = A−1 ⊗ B−1 (A� B)−1 = B−1 � A−1
(A⊗ B)> = A> ⊗ B> (A� B)> = B> � A>
4 Mixed Products:
(A⊗ B)(C�D) = (AC)� (BD) (A� B)(C⊗D) = (AD)� (BC)
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform
Let A ∈ Rm1×n1 , B ∈ Rm2×n2 then
Trace:
trace(A⊗B) = trace(A) trace(B) trace(A�B) = trace(AB)
Determinant: Here m1 = n1 and m2 = n2 is required
det(A⊗ B) = (det(A))m2(det(A))m1
det(A� B) = (−1)(m12 )(m2
2 )(det(A))m2(det(B))m1
Associativity:
(A⊗B)⊗C = A⊗ (B⊗C) (A�B)�C = A� (B�C),
but not for mixed products. In general we have
(A⊗B)�C 6= A⊗ (B�C) (A�B)⊗C 6= A� (B⊗C).
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform
Identity Box Products
The identity box product Im � In is permutation matrix withinteresting properties. Let A ∈ Rm1×n1 , B ∈ Rm2×n2 .
Orthonormal: (Im � In)>(Im � In) = Imn
Transposition: (Im1 � In1)vec(A) = vec(A>)
Connector: Converting a Kronecker product to a box product:(A⊗ B)(In1 � In2) = A� B.Converting a box product to a Kronecker product:(A� B)(In2 � In1) = A⊗ B.
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform
Box products are old things in a new wrapping
Although the notation for box products is new, Im � In has longbeen known as Tm,n by physicists, or as a stride permutation Lmn
m
by others. These objects are identical, but the box-product allowsus to express more complex identities more compactly, e.g. letA ∈ Rm1×n1 , B ∈ Rm2×n2 then
(Im1 � In1m2 � In2)vec((A⊗ B)>) = vec(vec(A)vec>(B)).
The initial permutation matrix would have to be written
Im1 � In1m2 � In2 = (Im1 ⊗ Tn1m2,m1)Tm1,n1m1m2 ,
if the notation of box-products wasn’t being used.
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform
Direct matrix products are important because such matrices can bemultiplied fast with each other and with vectors.
Let us show how the FFT can be done in terms of Kronecker andbox products. Recall that the DFT matrix of order n is given by
Fn = [e−2πikl/n]0≤k,l<n = [ωkln ]0≤k,l<n and therefore
F2 =
(1 11 −1
), F4 =
1 1 1 11 −i −1 i1 −1 1 −11 i −1 −i
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform
We can factor the matrix F4 as follows:
F4 =
1 1
1 11 −1
1 −1
11
1−i
1 11 −1
1 11 −1
11
11
or more compactly
F4 = (F2 ⊗ I2)diag
(vec
(1 11 −i
))(I2 � F2)
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform
In general if we define the matrix
VN,M(α) =
1 1 1 . . . 11 α α2 . . . αM−1
1 α2 α4 . . . α2(M−1)
......
.... . .
...
1 αN−1 α2(N−1) . . . α(N−1)(M−1)
,
For N = km we have the following factorizations of the DFTmatrix:
FN = (Fk ⊗ Im)(diag(vec(Vm,k(ωN))))(Ik � Fm)
FN = (Fm � Ik)(diag(vec(Vm,k(ωN))))(Fk ⊗ Im)
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform
This allows FFTN(x) = y = FNx to be computed as
FNx = vec((Vm,k(ωN) ◦ (FmX>))F>k ).
where x = vec(X) and ◦ denotes elementwise multiplication (.∗ inMatlab). The direct computation has a cost of O(k2m2), whereasthe formula above does the job in O(km(k + m)) operations.Repeated use of the identity for N = 2n leads to the Cooley-TukeyFFT algorithm. (James Cooley was at T. J. Watson from 1962 to1991).
The fastest FFT library in the world as of 2012 (Spiral) usesknowledge of several such factorizations to automatically optimizethe FFT implementation for an arbitrary value of n to a givenplatform!
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation
Here we show the role of Kronecker and box product in thematrix–matrix differentiation framework.
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation
Basic Differentiation Identitities
Let X ∈ Rm×n.
Identity: F(X) = X, F′(X) = Imn
Transpose:F(X) = X>, F′(X) = Im � In
Chain Rule:
F(X) = G(H(X)), F′(X) = G′(H(X))H′(X)
Product Rule:
F(X) = G(X)H(X), F′(X) = (I⊗H>(X))G′(X)+(G(X)⊗I)H′(X)
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation
More Derivative Identitities
Assume X ∈ Rm×m is a square matrix.
Square: F(X) = X2, F′(X) = Im ⊗ X> + X⊗ Im.
Inverse: F(X) = X−1, F′(X) = −X−1 ⊗ X−>.
+Transpose: F(X) = X−>, F′(X) = −X−> � X−1.
Square Root: F(X) = X1/2,
F′(X) =(I⊗ (X1/2)> + X1/2 ⊗ I
)−1.
Integer Power: F(X) = Xk , F′(X) =∑k−1
i=0 Xi ⊗ (Xk−1−i )>.
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation
First some simple identities:
f (X) = trace(AX) f ′(X) = A>
f (X) = trace(AX>) f ′(X) = Af (X) = log det(X) f ′(X) = X−>
Then the chain rule for more general expressions:
vec
((∂
∂Xlog det(G(X))
)>)=
(∂G
∂X
)>vec(
(G(X))−1)
vec
((∂
∂Xtrace(G(X))
)>)=
(∂G
∂X
)>vec(I)
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation
The chain rule to get scalar-matrix derivatives is awkward to use.Instead we have some short-cuts.
∂
∂Xtrace(G(X)H(X)) =
∂
∂Xtrace (H(Y)G(X) + H(X)G(Y))
∣∣∣∣Y=X
,
∂
∂Xtrace(AF−1(X)) = − ∂
∂Xtrace
(F−1(Y)AF−1(Y)F(X)
)∣∣∣∣Y=X
,
and
∂log det(F(X))
∂X=
∂
∂Xtrace
(F−1(Y)F(X)
)∣∣∣∣Y=X
.
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation
Use the formulas on the previous page to compute
∂
∂Xtrace((X− A)(X− B)−1(X− C))
The answer is:
(X− C)>(X− B)−> + (X− B)−>(X− A)>
+(X− B)−>(X− A)>(X− C)>(X− B)−>
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation
Use the formulas on the previous page to compute
∂
∂Xtrace((X− A)(X− B)−1(X− C))
The answer is:
(X− C)>(X− B)−> + (X− B)−>(X− A)>
+(X− B)−>(X− A)>(X− C)>(X− B)−>
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation
Finally if
r(x) =q(x)
p(x)=
∑ni=1 aix
i∑nj=1 bix i
is a scalar-scalar function we can form a matrix-matrix function bysimply substituting a matrix X for the scalar x . Then
∂
∂Xtrace(r(X)) =
(r ′(X)
)>and if h(x) = log(r(x)) then
∂
∂Xlog det(r(X)) =
(h′(X)
)>.
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation
Compute f (X) = trace(F(X)) = trace(((I + X)−1X>)X)
trace(F(X))
×
× X
(·)−1 (·)>
+ X
XI
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation
Compute f (X) = trace(F(X)) = trace(((I + X)−1X>)X)
trace(F(X))
×
× X
(·)−1 (·)>
+ X
XI
T1 = I + X
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation
Compute f (X) = trace(F(X)) = trace(((I + X)−1X>)X)
trace(F(X))
×
× X
(·)−1T2 = T−11
(·)>
+ X
XI
T1 = I + X
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation
Compute f (X) = trace(F(X)) = trace(((I + X)−1X>)X)
trace(F(X))
×
× X
(·)−1T2 = T−11
(·)>T3 = X>
+ X
XI
T1 = I + X
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation
Compute f (X) = trace(F(X)) = trace(((I + X)−1X>)X)
trace(F(X))
×
×T4 = T2 ∗ T3 X
(·)−1T2 = T−11
(·)>T3 = X>
+ X
XI
T1 = I + X
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation
Compute f (X) = trace(F(X)) = trace(((I + X)−1X>)X)
trace(F(X))
×T5 = T4 ∗ X
×T4 = T2 ∗ T3 X
(·)−1T2 = T−11
(·)>T3 = X>
+ X
XI
T1 = I + X
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation
Compute f (X) = trace(F(X)) = trace(((I + X)−1X>)X)
trace(F(X))f = trace(T5)
×T5 = T4 ∗ X
×T4 = T2 ∗ T3 X
(·)−1T2 = T−11
(·)>T3 = X>
+ X
XI
T1 = I + X
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation
Forward mode computation of f ′(X):
trace(F(X))
× T5
× T4 XI⊗ I
(·)−1 T2 (·)>T3
+ XI⊗ IT1
XI⊗ II0
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation
(G + H)′ = G′ + H′
trace(F(X))
× T5
× T4 XI⊗ I
(·)−1 T2 (·)>T3
+ XI⊗ IT1T′1 = 0 + I⊗ I
XI⊗ II0
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation
Chain rule: (T−11 )′ = −(T−1
1 ⊗ T−>1 )T′1
trace(F(X))
× T5
× T4 XI⊗ I
(·)−1 T2T′2 = −(T2 ⊗ T>2 )T′1 (·)>T3
+ XI⊗ IT1T′1 = 0 + I⊗ I
XI⊗ II0
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation
(X>)′ = I� I
trace(F(X))
× T5
× T4 XI⊗ I
(·)−1 T2T′2 = −(T2 ⊗ T>2 )T′1 (·)>T3 T′3 = I� I
+ XI⊗ IT1T′1 = 0 + I⊗ I
XI⊗ II0
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation
Product rule: (GH)′ = (I⊗H>)G′ + (G⊗ I)H′
trace(F(X))
× T5
× T4T′4 = (I⊗ T>3 )T′2 + (T2 ⊗ I)T′3 XI⊗ I
(·)−1 T2T′2 = −(T2 ⊗ T>2 )T′1 (·)>T3 T′3 = I� I
+ XI⊗ IT1T′1 = 0 + I⊗ I
XI⊗ II0
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation
Product rule: (GH)′ = (I⊗H>)G′ + (G⊗ I)H′
trace(F(X))
× T5T′5 = (I⊗ X>)T′4 + (T4 ⊗ I)(I⊗ I)
× T4T′4 = (I⊗ T>3 )T′2 + (T2 ⊗ I)T′3 XI⊗ I
(·)−1 T2T′2 = −(T2 ⊗ T>2 )T′1 (·)>T3 T′3 = I� I
+ XI⊗ IT1T′1 = 0 + I⊗ I
XI⊗ II0
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation
((trace(F))′)> = vec−1(vec>(I)F′
)trace(F(X))
×
f ′ = vec−1(vec>(I)T′5
)
T5T′5 = (I⊗ X>)T′4 + (T4 ⊗ I)(I⊗ I)
× T4T′4 = (I⊗ T>3 )T′2 + (T2 ⊗ I)T′3 XI⊗ I
(·)−1 T2T′2 = −(T2 ⊗ T>2 )T′1 (·)>T3 T′3 = I� I
+ XI⊗ IT1T′1 = 0 + I⊗ I
XI⊗ II0
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation
Forward mode differentiation
Requires huge intermediate matrices.
Only the last step reduces the size.
Critical points:
F′(X) is composed of Kronecker, box (and outer) products fora large class of functions. (wow!)
These can be “unwound” by multiplication with vectorizedscalar-matrix derivatives (gasp!):
vec>(C)A⊗ B = vec>(B>CA)
Reverse mode differentiation:
Evaluate the derivative from top to bottom.
Small scalar-matrix derivatives are propagated down the tree.
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation
Reverse Mode Differentiation: (f ′)> = vec−1(vec>(I)F′
)trace(F(X))
×
f ′ = R0 = I
T5
× T4 X
(·)−1 T2 (·)>T3
+ XT1
XI
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation
vec>(R0)(I⊗ X>) = vec>(XR0I)
trace(F(X))
×
f ′ = R0 = I
T5T′5 = (I⊗ X>) T′4+ T4 ⊗ I
× T4
R1 = XR0
X
(·)−1 T2 (·)>T3
+ XT1
XI
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation
vec>(R0)(T4 ⊗ I) = vec>(IR0T4)
trace(F(X))
×
f ′ = R>2 R0 = I
T5T′5 = (I⊗ X>) T′4+ T4 ⊗ I
× T4
R1 = XR0
X R2 = R0T4
(·)−1 T2 (·)>T3
+ XT1
XI
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation
vec>(R1)(I⊗ T>3 ) = vec>(T3R1I)
trace(F(X))
×
f ′ = R>2 R0 = I
T5
× T4
R1 = XR0
T′4 = (I⊗ T>3 ) T′2+ (T2 ⊗ I) T′3 X R2 = R0T4
(·)−1 T2
R3 = T3R1
(·)>T3
+ XT1
XI
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation
vec>(R1)(T2 ⊗ I) = vec>(IR1T2)
trace(F(X))
×
f ′ = R>2 R0 = I
T5
× T4
R1 = XR0
T′4 = (I⊗ T>3 ) T′2+ (T2 ⊗ I) T′3 X R2 = R0T4
(·)−1 T2
R3 = T3R1
(·)>T3
R4 = R1T2
+ XT1
XI
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation
vec>(R3)(−T2 ⊗ T>2 ) = vec>(−T2R3T2)
trace(F(X))
×
f ′ = R>2 R0 = I
T5
× T4 X R2 = R0T4
(·)−1 T2
R3 = T3R1
T′2 = −(T2 ⊗ T>2 )T′1 (·)>T3
R4 = R1T2
+ XT1
R5 = −T2R3T2
XI
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation
vec>(R4)(I� I) = vec>(IR>4 I)
trace(F(X))
×
f ′ = R>2 +R>6 R0 = I
T5
× T4 X R2 = R0T4
(·)−1 T2 (·)>T3
R4 = R1T2
T′3 = I� I
+ X R6 = R>4T1
R5 = −T2R3T2
XI
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation
vec>(R5)0 = vec(0)
trace(F(X))
×
f ′ = R>2 +R>6 R0 = I
T5
× T4 X R2 = R0T4
(·)−1 T2 (·)>T3
+ X R6 = R>4T1
R5 = −T2R3T2
T′1 =0+I⊗ I
XI R7 = 0
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation
vec>(R5)I⊗ I = vec(R5)
trace(F(X))
×
f ′ = R>2 +R>6 +R>8 R0 = I
T5
× T4 X R2 = R0T4
(·)−1 T2 (·)>T3
+ X R6 = R>4T1
R5 = −T2R3T2
T′1 =0+I⊗ I
X R8 = R5I R7 = 0
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation
f ′(X) = R>2 + R>6 + R>8
trace(F(X))
×
f ′ = R>2 +R>6 +R>8 R0 = I
T5
× T4 X R2 = R0T4
(·)−1 T2 (·)>T3
+ X R6 = R>4T1
X R8 = R5I
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Derivative PatternsHessian FormsNewton’s MethodFuture Work
There is growing interest in Machine Learning in scalar–matrixobjective functions.
Probabilistic Graphical Models
Covariance Selection
Optimization in Graphs and Networks.
Data-mining in social networks
Can we use this theory to help optimize such functions? We are onthe look-out for interesting problems.
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Derivative PatternsHessian FormsNewton’s MethodFuture Work
The Anatomy of a Matrix-Matrix Function
Theorem
Let R(X) be a rational matrix–matrix function formed fromconstant matrices and K occurrences of X using arithmetic matrixoperations (+, −, ∗ and (·)−1) and transposition ((·)>). Then thederivative of the matrix–matrix function is of the form
k1∑i=1
Ai ⊗ Bi +K∑
i=k1+1
Ai � Bi .
The matrices Ai and Bi are computed as parts of R(X).
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Derivative PatternsHessian FormsNewton’s MethodFuture Work
Hessian Forms
The derivative of a function f of the form trace(R(X)) orlog det(R(X)) is a rational function. Therefore, the Hessian is ofthe form
k1∑i=1
Ai ⊗ Bi +K∑
i=k1+1
Ai � Bi .
where K is the number of times X occurs in the expression for thederivative.If Ai , Bi are d × d matrices then multiplication by H can be donewith O(Kd3) operations.Generalizations of this result can be found in the paper.
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Derivative PatternsHessian FormsNewton’s MethodFuture Work
Let f : Rd×d → R with f (X) = trace(R(X)) orf (X) = log det(R(X)).
Derivative: G = f ′(X) ∈ Rd×d and Hessian:H = f ′′(X) ∈ Rd2×d2
.
Newton direction vec(V) = H−1vec(G) can be computedefficiently if K � d
K Algorithm Complexity Storage
1 A⊗ B = A−1 ⊗ B−1 O(d3) O(Kd2)2 Barthels-Stewart algorithm O(d3) O(Kd2)≥ 3 Conjugate Gradient Algorithm O(Kd5) O(Kd2)≥ d General matrix inversion O(d6 + Kd4) O(d4)
The (generalized) Bartels-Stewart algorithms solves theSylvester-like equation AX + X>B = C.
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Derivative PatternsHessian FormsNewton’s MethodFuture Work
What about optimizing functions of the form
f (X) + ‖vec(X)‖1?
Common approaches:
Strategy method Use Hessian structure?
Newton-Lasso Coordinate descent XFISTA X
Orthantwise `1 CG XL-BFGS 7
It is not obvious how to take advantage of the Hessian structurefor all these methods. For orthantwise CG for example – thesub-problem requires the Newton direction for a sub-matrix of theHessian.
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Derivative PatternsHessian FormsNewton’s MethodFuture Work
1 We have applied this methodology to the covariance selectionproblem – a paper is forthcoming.
2 We are digging deeper in the theory of matrix differentiationand properties of the box products.
3 We are on the look-out for more interesting matrixoptimization problems. All suggestions are appreciated!
Peder, Steven and Vaibhava Matrix Differentiation
top related