cheng soon ong & christian walder - machlearn.gitlab.io · introduction to statistical machine...
TRANSCRIPT
![Page 1: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/1.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
1of 177
Introduction to Statistical Machine Learning
Cheng Soon Ong & Christian Walder
Machine Learning Research GroupData61 | CSIRO
andCollege of Engineering and Computer Science
The Australian National University
CanberraFebruary – June 2018
(Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")
![Page 2: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/2.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
90of 177
Part III
Linear Algebra
![Page 3: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/3.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
91of 177
Linear Curve Fitting - Input Specification
N = 10
x ≡ (x1, . . . , xN)T
t ≡ (t1, . . . , tN)T
xi ∈ R i = 1,. . . , N
ti ∈ R i = 1,. . . , N0 2 4 6 8 10
x
0
1
2
3
4
5
6
7
8
t
![Page 4: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/4.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
92of 177
Strategy in this course
Estimate best predictor = training = learningGiven data (x1, y1), . . . , (xn, yn), find a predictor fw(·).
1 Identify the type of input x and output y data2 Propose a (linear) mathematical model for fw3 Design an objective function or likelihood4 Calculate the optimal parameter (w)5 Model uncertainty using the Bayesian approach6 Implement and compute (the algorithm in python)7 Interpret and diagnose results
![Page 5: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/5.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
93of 177
Linear Curve Fitting
N = 10
x ≡ (x1, . . . , xN)T
t ≡ (t1, . . . , tN)T
xi ∈ R i = 1,. . . , N
ti ∈ R i = 1,. . . , N0 2 4 6 8 10
x
0
1
2
3
4
5
6
7
8
t
![Page 6: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/6.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
94of 177
Linear Curve Fitting - Least Squares
N = 10
x ≡ (x1, . . . , xN)T
t ≡ (t1, . . . , tN)T
xi ∈ R i = 1,. . . , N
ti ∈ R i = 1,. . . , N
y(x,w) = w1x + w0
X ≡ [x 1]
w∗ = (XTX)−1XT t
0 2 4 6 8 10x
0
1
2
3
4
5
6
7
8
t
![Page 7: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/7.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
95of 177
Intuition
Arithmeticzero, positive/negative numbersaddition and multiplication
GeometryPoints and LinesVector addition and scalingHumans have experience with 3 dimensions (less with 1, 2though)
Generalisation to N dimensions (possibly N →∞)Line→ vector space VPoint→ vector x ∈ VExample : X ∈ Rn×m
Space of matrices Rn×m and the space of vectors Rn·m areisomorphic
![Page 8: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/8.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
96of 177
Vector Space V , underlying Field F
Given vectors u, v,w ∈ V and scalars α, β ∈ F , the followingholds:
Associativity of addition : u + (v + w) = (u + v) + w .Commutativity of addition : v + w = w + v .Identity element of addition : ∃ 0 such thatv + 0 = v, ∀ v ∈ V.Inverse of addition : For all v ∈ V, ∃w ∈ V, such thatv + w = 0. The additive inverse is denoted −v.Distributivity of scalar multiplication with respect to vectoraddition : α(v + w) = αv + αw .Distributivity of scalar multiplication with respect to fieldaddition : (α+ β)v = αv + βv .Compatibility of scalar multiplication with fieldmultiplication : α(βv) = (αβ)v .Identity element of scalar multiplication : 1v = v, where 1 isthe identity in F .
![Page 9: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/9.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
97of 177
Matrix-Vector Multiplication
� �1 for i in range (m) :
R[ i ] = 0 . 0 ;for j in range ( n ) :
R[ i ] = R[ i ] + A [ i , j ] ∗ V[ j ]� �Listing 1: Code for elementwise matrix multiplication.
A V = Ra11 a12 . . . a1n
a21 a22 . . . a2n
. . . . . . . . . . . .am1 am2 . . . amn
v1v2. . .vn
=
a11v1 + a12v2 + · · ·+ a1nvn
a21v1 + a22v2 + · · ·+ a2nvn
. . .am1v1 + am2v2 + · · ·+ amnvn
![Page 10: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/10.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
98of 177
Matrix-Vector Multiplication
� �1 for i in range (m) :
R[ i ] = 0 . 0 ;for j in range ( n ) :
R[ i ] = R[ i ] + A [ i , j ] ∗ V[ j ]� �Listing 2: Code for elementwise matrix multiplication.
A V = Ra11 a12 . . . a1n
a21 a22 . . . a2n
. . . . . . . . . . . .am1 am2 . . . amn
v1v2. . .vn
=
a11v1 + a12v2 + · · ·+ a1nvn
a21v1 + a22v2 + · · ·+ a2nvn
. . .am1v1 + am2v2 + · · ·+ amnvn
![Page 11: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/11.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
99of 177
Matrix-Vector Multiplication
A V = Ra11 a12 . . . a1n
a21 a22 . . . a2n
. . . . . . . . . . . .am1 am2 . . . amn
v1v2. . .vn
=
a11v1 + a12v2 + · · ·+ a1nvn
a21v1 + a22v2 + · · ·+ a2nvn
. . .am1v1 + am2v2 + · · ·+ amnvn
� �1 R = A [ : , 0 ] ∗ V [ 0 ] ;
for j in range (1 , n ) :R += A [ : , j ] ∗ V[ j ] ;� �
Listing 3: Code for columnwise matrix multiplication.
![Page 12: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/12.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
100of 177
Matrix-Vector Multiplication
A V = Ra11 a12 . . . a1n
a21 a22 . . . a2n
. . . . . . . . . . . .am1 am2 . . . amn
v1v2. . .vn
=
a11v1 + a12v2 + · · ·+ a1nvn
a21v1 + a22v2 + · · ·+ a2nvn
. . .am1v1 + am2v2 + · · ·+ amnvn
� �
R = A [ : , 0 ] ∗ V [ 0 ] ;2 for j in range (1 , n ) :
R += A [ : , j ] ∗ V[ j ] ;� �Listing 4: Code for columnwise matrix multiplication.
![Page 13: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/13.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
101of 177
Matrix-Vector Multiplication
Denote the n columns of A by a1, a2, . . . , an
Each ai is now a (column) vector
A V = Ra1 a2 . . . an
v1v2. . .vn
=
a1v1 + a2v2 + · · ·+ anvn
![Page 14: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/14.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
102of 177
Transpose of R
Given R = AV,
R =
a1v1 + a2v2 + · · ·+ anvn
What is RT ?
RT =[v1aT
1 + v2aT2 + · · ·+ vnaT
n]
NOT equal to ATVT ! (In fact, ATVT is not even defined.)
![Page 15: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/15.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
103of 177
Transpose of R
Given R = AV,
R =
a1v1 + a2v2 + · · ·+ anvn
What is RT ?
RT =[v1aT
1 + v2aT2 + · · ·+ vnaT
n]
NOT equal to ATVT ! (In fact, ATVT is not even defined.)
![Page 16: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/16.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
104of 177
Transpose
Reverse order rule : (AV)T = VTAT
VT AT = RT
[v1 v2 . . . vn
]
aT1
aT2
. . .
aTn
=[v1aT
1 + v2aT2 + · · ·+ vnaT
n]
![Page 17: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/17.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
105of 177
Transpose
Reverse order rule : (AV)T = VTAT
VT AT = RT
[v1 v2 . . . vn
]
aT1
aT2
. . .
aTn
=[v1aT
1 + v2aT2 + · · ·+ vnaT
n]
![Page 18: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/18.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
106of 177
Trace
The trace of a square matrix A is the sum of all diagonalelements of A.
tr {A} =
n∑k=1
Akk
Example
A =
1 2 34 5 67 8 9
tr {A} = 1 + 5 + 9 = 15
The trace does not exist for a non-square matrix.
![Page 19: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/19.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
107of 177
vec(x) operator
Define vec(X) as the vector which results from stacking allcolumns of a matrix A on top of each other.Example
A =
1 2 34 5 67 8 9
vec(A) =
147258369
Given two matrices X,Y ∈ Rn×p, the trace of the productXTY can be written as
tr{
XTY}
= vec(X)T vec(Y)
![Page 20: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/20.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
108of 177
Inner Product
Mapping from two vectors x, y ∈ V to a field of scalars F( F = R or F = C ):
〈·, ·〉 : V× V→ F
Conjugate Symmetry :〈x, y〉 = 〈y, x〉Linearity :〈ax, y〉 = a〈x, y〉 , and 〈x + y, z〉 = 〈x, z〉+ 〈y, z〉Positive-definitness :〈x, x〉 ≥ 0, and 〈x, x〉 = 0 for x = 0 only.
![Page 21: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/21.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
109of 177
Canonical Inner Products
Inner product for real numbers x, y ∈ R:
〈x, y〉 = x y
Dot product between two vectors:Given x, y ∈ Rn
〈x, y〉 = xTy ≡n∑
k=1
xk yk = x1 y1 + x2 y2 + · · ·+ xn yn
Canonical inner product for matrices :Given X,Y ∈ Rn×p
〈X,Y〉 = tr{
XTY}
=
p∑k=1
(XTY)kk =
p∑k=1
n∑l=1
(XT)kl(Y)lk
=
p∑k=1
n∑l=1
XlkYlk
![Page 22: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/22.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
110of 177
Canonical Inner Products
Inner product for real numbers x, y ∈ R:
〈x, y〉 = x y
Dot product between two vectors:Given x, y ∈ Rn
〈x, y〉 = xTy ≡n∑
k=1
xk yk = x1 y1 + x2 y2 + · · ·+ xn yn
Canonical inner product for matrices :Given X,Y ∈ Rn×p
〈X,Y〉 = tr{
XTY}
=
p∑k=1
(XTY)kk =
p∑k=1
n∑l=1
(XT)kl(Y)lk
=
p∑k=1
n∑l=1
XlkYlk
![Page 23: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/23.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
111of 177
Calculations with Matrices
Denote (i, j)th element of matrix X by Xij
Transpose : (XT)ij = Xji
Product : (XY)ij =∑...
k=1 XikYkj
Proof of the linearity of the canonical matrix inner product :
〈X + Y,Z〉 = tr{
(X + Y)TZ}
=
p∑k=1
((X + Y)TZ)kk
=
p∑k=1
n∑l=1
((X + Y)T)klZlk =
p∑k=1
n∑l=1
(Xlk + Ylk)Zlk
=
p∑k=1
n∑l=1
XlkZlk + YlkZlk
=
p∑k=1
n∑l=1
(XT)klZlk + (YT)klZlk
=
p∑k=1
(XTZ)kk + (YTZ)kk
= tr{
XTZ}
+ tr{
YTZ}
= 〈X,Z〉+ 〈Y,Z〉
![Page 24: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/24.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
112of 177
Projection
In linear algebra and functional analysis, a projection is alinear transformation P from a vector space V to itself suchthat
P2 = P
x
y(a,b)
(a-b,0)
A[
ab
]=
[a− b
0
]A =
[1 −10 0
]
![Page 25: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/25.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
113of 177
Projection
In linear algebra and functional analysis, a projection is alinear transformation P from a vector space V to itself suchthat
P2 = P
x
y(a,b)
(a-b,0)
A[
ab
]=
[a− b
0
]A =
[1 −10 0
]
![Page 26: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/26.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
114of 177
Projection
In linear algebra and functional analysis, a projection is alinear transformation P from a vector space V to itself suchthat
P2 = P
x
y(a,b)
(a-b,0)
A[
ab
]=
[a− b
0
]A =
[1 −10 0
]
![Page 27: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/27.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
115of 177
Orthogonal Projection
x
y
Px Py
y - Py
Orthogonality : need an inner product 〈x, y〉Choose two arbitrary vectors x and y. Then Px and y− Pyare orthogonal.
0 = 〈Px, y− Py〉 = (Px)T(y− Py) = xT(PT − PTP)y
![Page 28: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/28.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
116of 177
Orthogonal Projection
Orthogonal Projection
P2 = P P = PT
Example : Given some unit vector u ∈ Rn characterising aline through the origin in Rn
project an arbitrary vector x ∈ Rn onto this line by
P = uuT
Proof : P = PT , and
P2x = (uuT)(uuT)x = uuTx = Px
![Page 29: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/29.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
117of 177
Orthogonal Projection
Given a matrix A ∈ Rn×p and a vector x. What is theclosest point x̃ to x in the column space of A ?Orthogonal Projection into the column space of A
x̃ = A(ATA)−1ATx
Projection matrix
P = A(ATA)−1AT
Proof
P2 = A(ATA)−1ATA(ATA)−1AT = A(ATA)−1AT = P
Orthogonal projection ?
PT = (A(ATA)−1AT)T = A(ATA)−1AT = P
![Page 30: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/30.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
118of 177
Linear Curve Fitting - Least Squares
N = 10
x ≡ (x1, . . . , xN)T
t ≡ (t1, . . . , tN)T
xi ∈ R i = 1,. . . , N
ti ∈ R i = 1,. . . , N
y(x,w) = w1x + w0
X ≡ [x 1]
w∗ = (XTX)−1XT t
0 2 4 6 8 10x
0
1
2
3
4
5
6
7
8
t
![Page 31: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/31.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
119of 177
Linear Independence of Vectors
A set of vectors is linearly independent if none of them canbe written as a linear combination of finitely many othervectors in the set.Given a vector space V and k vectors v1, . . . , vk ∈ V andscalars a1, . . . , ak. The vectors are linearly independent, ifand only if
a1v1 + a2v2 + · · ·+ akvk = 0 ≡
00...0
has only the trivial solution
a1 = a2 = ak = 0
![Page 32: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/32.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
120of 177
Rank
The column rank of a matrix A is the maximal number oflinearly independent columns of A.The row rank of a matrix A is the maximal number oflinearly independent rows of A.For every matrix A, the row rank and column rank areequal, called the rank of A.
![Page 33: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/33.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
121of 177
Determinant
Let Sn be the set of all permutation of the numbers {1, . . . , n}.The determinant of the square matrix A is then given by
det {A} =∑σ∈Sn
sgn(σ)
n∏i=1
Ai,σ(i)
where sgn is the signature of the permutation ( +1 if thepermutation is even, −1 if the permutation is odd).
det {AB} = det {A} det {B} for A,B square matrices.
det{
A−1}
= det {A}−1.
det{
AT}
= det {A}.det {c A} = cn det {A} for A ∈ Rn×n and c ∈ R.
![Page 34: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/34.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
122of 177
Matrix Inverse
Identity Matrix I
I = AA−1 = A−1A
The matrix inverse A−1 does only exist for square matriceswhich are NOT singular.Singular matrix
at least one eigenvalue is zero,determinant |A| = 0.
Inversion ’reverses the order’ (like transposition)
(AB)−1 = B−1A−1
(Only then is (AB)(AB)−1 = (AB)B−1A−1 = AA−1 = I.)
![Page 35: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/35.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
123of 177
Matrix Inverse and Transpose
(A−1)T ?
need to assume that (A−1) existsrule : (A−1)T = (AT)−1
![Page 36: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/36.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
124of 177
Matrix Inverse and Transpose
(A−1)T ?need to assume that (A−1) existsrule : (A−1)T = (AT)−1
![Page 37: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/37.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
125of 177
Useful Identity
(A−1 + BTC−1B)−1BTC−1 = ABT(BABT + C)−1
How to analyse and prove such an equation?A−1 must exist, therefore A must be square, say A ∈ Rn×n.Same for C−1, so let’s assume C ∈ Rm×m.Therefore, B ∈ Rm×n.
![Page 38: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/38.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
126of 177
Useful Identity
(A−1 + BTC−1B)−1BTC−1 = ABT(BABT + C)−1
How to analyse and prove such an equation?A−1 must exist, therefore A must be square, say A ∈ Rn×n.Same for C−1, so let’s assume C ∈ Rm×m.Therefore, B ∈ Rm×n.
![Page 39: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/39.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
127of 177
Prove the Identity
(A−1 + BTC−1B)−1BTC−1 = ABT(BABT + C)−1
Multiply by (BABT + C)
(A−1 + BTC−1B)−1BTC−1(BABT + C) = ABT
Simplify the left-hand side
(A−1 + BTC−1B)−1[BTC−1B(ABT) + BT ]
= (A−1 + BTC−1B)−1[BTC−1B + A−1]ABT
= ABT
![Page 40: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/40.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
128of 177
Woodbury Identity
(A + BD−1C)−1 = A−1 − A−1B(D + CA−1B)−1CA−1
Useful if A is diagonal and easy to invert, and B is very tall(many rows, but only a few columns) and C is very wide(few rows, many columns).
![Page 41: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/41.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
129of 177
Manipulating Matrix Equations
Don’t multiply by a matrix which does not have full rank.Why? In the general case, you will lose solutions.Don’t multiply by nonsquare matrices.Don’t assume a matrix inverse exists because a matrix issquare. The matrix might be singular and you are makingthe same mistake as if deducing a = b from a 0 = b 0.
![Page 42: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/42.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
130of 177
Eigenvectors
Every square matrix A ∈ Rn×n has an Eigenvectordecomposition
Ax = λx
where x ∈ Rn and λ ∈ C.Example: [
0 1−1 0
]x = λx
λ = {−ı, ı}
x =
{[ı1
],
[−ı
1
]}
![Page 43: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/43.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
131of 177
Eigenvectors
How many eigenvalue/eigenvector pairs?
Ax = λx
is equivalent to(A− λI)x = 0
Has only non-trivial solution for det {A− λI} = 0polynom of nth order; at most n distinct solutions
![Page 44: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/44.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
132of 177
Real Eigenvalues
How can we enforce real eigenvalues?Let’s look at matrices with complex entries A ∈ Cn×n.Transposition is replaced by Hermitian adjoint, e.g.[
1 + ı2 3 + ı45 + ı6 7 + ı8
]H
=
[1− ı2 5− ı63− ı4 7− ı8
]Denote the complex conjugate of a complex number λ byλ.
![Page 45: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/45.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
133of 177
Real Eigenvalues
How can we enforce real eigenvalues?Let’s assume A ∈ Cn×n, Hermitian (AH = A).Calculate
xHAx = λxHx
for an eigenvector x ∈ Cn of A.Another possibility to calculate xHAx
xHAx = xHAHx (A is Hermitian)
= (xHAx)H (reverse order)
= (λxHx)H (eigenvalue)
= λxHx
and thereforeλ = λ (λ is real).
If A is Hermitian, then all eigenvalues are real.Special case: If A has only real entries and is symmetric,then all eigenvalues are real.
![Page 46: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/46.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
134of 177
Singular Value Decomposition
Every matrix A ∈ Rn×p can be decomposed into a product ofthree matrices
A = UΣVT
where U ∈ Rn×n and V ∈ Rp×p are orthogonal matrices (UTU = I and VTV = I ), and Σ ∈ Rn×p has nonnegativenumbers on the diagonal.
![Page 47: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/47.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
135of 177
Linear Curve Fitting - Least Squares
N = 10
x ≡ (x1, . . . , xN)T
t ≡ (t1, . . . , tN)T
xi ∈ R i = 1,. . . , N
ti ∈ R i = 1,. . . , N
y(x,w) = w1x + w0
X ≡ [x 1]
w∗ = (XTX)−1XT t
0 2 4 6 8 10x
0
1
2
3
4
5
6
7
8
t
![Page 48: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/48.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
136of 177
Inverse of a matrix
Assume a full rank symmetric real matrix A.Then A = UTΛU whereΛ is a diagonal matrix with real eigenvaluesU contains the eigenvectors
A−1 = (UTΛU)−1
= U−1Λ−1U−T inverse changes order
= UTΛ−1U UTU = I
The inverse of a diagonal matrix is the inverse of itselements.
![Page 49: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/49.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
137of 177
How to calculate a gradient?
Recall the gradient of a univariate function f : R→ R.Example:
f (x) = x4 + 7x3 + 5x2 − 17x + 3
whose gradient is given by
dfdx
= 4x3 + 21x2 + 10x− 17
6 5 4 3 2 1 0 1 2value of parameter
40
20
0
20
40
60
obje
ctiv
e
x4 + 7x3 + 5x2 17x + 3
![Page 50: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/50.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
138of 177
Gradients and optima
We find the stationary points by calculating the derivative,setting it to zero and solving the resulting equation.Example:
dfdx
= 4x3 + 21x2 + 10x− 17 = 0
has three stationary points (because it is a cubicpolynomial)
We calculate the second derivative ( d2fdx2 ) to check whether
a stationary point is a minimum ( d2fdx2 > 0) or a maximum
( d2fdx2 < 0).
The example has two minimum and one maximumstationary point.
![Page 51: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/51.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
139of 177
Gradient descent
We may not be able to solve the problem analytically, haveto resort to gradient descent.Idea: Take a step in the negative gradient direction.Many different approaches for gradient descent.
1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0x
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
f(x)
![Page 52: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/52.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
140of 177
Rules of differentiation
We denote the derivative of f as f ′.Sum rule
(f (x) + g(x))′ = f ′(x) + g′(x)
Product rule
(f (x)g(x))′ = f ′(x)g(x) + f (x)g′(x)
Quotient rule (f (x)
g(x)
)′=
f ′(x)g(x)− f (x)g′(x)
(g(x))2
Chain rule (g(f (x))
)′= (g ◦ f )′(x) = g′(f )f ′(x)
![Page 53: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/53.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
141of 177
How to calculate a gradient?
Given a multivariate function f : RD → R.Compute the partial derivative with respect to eachelement, and collect the results in the appropriate shape.Use knowledge of scalar differentiation.Recommend the convention that gradients are rowvectors.Example:
f (x1, x2) = x21x2 + x1x3
2
has partial derivatives
∂f (x1, x2)
∂x1= 2x1x2 + x3
2
∂f (x1, x2)
∂x2= x2
1 + 3x1x22
and therefore the gradient is (∈ R1×2)
∇xf (x) =dfdx
=[∂f (x1,x2)∂x1
∂f (x1,x2)∂x2
]=[2x1x2 + x3
2 x21 + 3x1x2
2
].
![Page 54: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/54.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
142of 177
Gradient with trace
We can combine matrix multiplication rules with partialdifferentation.Example: ∇A tr {AB} = B>
Recall:
tr {AB} =
←− a1 −→←− a2 −→
· · ·←− aN −→
↑ ↑ ↑
b1 b2 · · · bn
↓ ↓ ↓
=
m∑i=1
a1ibi1 +
m∑i=1
a2ibi2 + . . .+
m∑i=1
anibin
Therefore:∂ tr {AB}∂aij
= bji
![Page 55: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University](https://reader035.vdocuments.us/reader035/viewer/2022062601/5c36eada09d3f2b60b8b46e6/html5/thumbnails/55.jpg)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
143of 177
Books
Gilbert Strang, "Introduction to Linear Algebra", WellesleyCambridge, 2009.David C. Lay, Steven R. Lay, Judi J. McDonald, "LinearAlgebra and Its Applications", Pearson, 2015.Jonathan S. Golan, "The Linear Algebra a BeginningGraduate Student Ought to Know", Springer, 2012Kaare B. Petersen, Michael S. Pedersen, "The MatrixCookbook", 2012