Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Outlines
OverviewIntroductionLinear Algebra
Probability
Linear Regression 1
Linear Regression 2
Linear Classification 1
Linear Classification 2
Kernel MethodsSparse Kernel Methods
Mixture Models and EM 1Mixture Models and EM 2Neural Networks 1Neural Networks 2Principal Component Analysis
AutoencodersGraphical Models 1
Graphical Models 2
Graphical Models 3
Sampling
Sequential Data 1
Sequential Data 2
1of 825
Statistical Machine Learning
Christian Walder
Machine Learning Research GroupCSIRO Data61
and
College of Engineering and Computer ScienceThe Australian National University
CanberraSemester One, 2020.
(Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
68of 825
Part II
Introduction
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
69of 825
Flavour of this course
Formalise intuitions about problemsUse language of mathematics to express modelsGeometry, vectors, linear algebra for reasoningProbabilistic models to capture uncertaintyDesign and analysis of algorithmsNumerical algorithms in pythonUnderstand the choices when designing machine learningmethods
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
70of 825
What is Machine Learning?
Definition (Mitchell, 1998)
A computer program is said to learn from experience E withrespect to some class of tasks T and performance measure P,if its performance at tasks in T, as measured by P, improveswith experience E.
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
71of 825
Polynomial Curve Fitting
some artificial data created from the function
sin(2πx) + random noise x = 0, . . . , 1
x
t
0 1
−1
0
1
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
72of 825
Polynomial Curve Fitting - Input Specification
N = 10
x ≡ (x1, . . . , xN)T
t ≡ (t1, . . . , tN)T
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
73of 825
Polynomial Curve Fitting - Input Specification
N = 10
x ≡ (x1, . . . , xN)T
t ≡ (t1, . . . , tN)T
xi ∈ R i = 1,. . . , N
ti ∈ R i = 1,. . . , N
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
74of 825
Polynomial Curve Fitting - Model Specification
M : order of polynomial
y(x,w) = w0 + w1 x + w2 x2 + · · ·+ wM xM
=
M∑m=0
wm xm
nonlinear function of x
linear function of the unknown model parameter wHow can we find good parameters w = (w1, . . . ,wM)
T?
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
75of 825
Learning is Improving Performance
t
x
y(xn,w)
tn
xn
Performance measure : Error between target andprediction of the model for the training data
E(w) =12
N∑n=1
(y(xn,w)− tn)2
unique minimum of E(w) for argument w? under certainconditions (what are they?)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
76of 825
Learning is Improving Performance
t
x
y(xn,w)
tn
xn
Performance measure : Error between target andprediction of the model for the training data
E(w) =12
N∑n=1
(y(xn,w)− tn)2
unique minimum of E(w) for argument w? under certainconditions (what are they?)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
77of 825
Model Comparison or Model Selection
y(x,w) =
M∑m=0
wm xm
∣∣∣∣∣M=0
= w0
x
t
M = 0
0 1
−1
0
1
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
78of 825
Model Comparison or Model Selection
y(x,w) =
M∑m=0
wm xm
∣∣∣∣∣M=1
= w0 + w1 x
x
t
M = 1
0 1
−1
0
1
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
79of 825
Model Comparison or Model Selection
y(x,w) =
M∑m=0
wm xm
∣∣∣∣∣M=3
= w0 + w1 x + w2 x2 + w3 x3
x
t
M = 3
0 1
−1
0
1
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
80of 825
Model Comparison or Model Selection
y(x,w) =
M∑m=0
wm xm
∣∣∣∣∣M=9
= w0 + w1 x + · · ·+ w8 x8 + w9 x9
overfitting
x
t
M = 9
0 1
−1
0
1
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
81of 825
Testing the Model
Train the model and get w?
Get 100 new data pointsRoot-mean-square (RMS) error
ERMS =√
2E(w?)/N
M
ERMS
0 3 6 90
0.5
1TrainingTest
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
82of 825
Testing the Model
M = 0 M = 1 M = 3 M = 9w?
0 0.19 0.82 0.31 0.35w?
1 -1.27 7.99 232.37w?
2 -25.43 -5321.83w?
3 17.37 48568.31w?
4 -231639.30w?
5 640042.26w?
6 -1061800.52w?
7 1042400.18w?
8 -557682.99w?
9 125201.43
Table: Coefficients w? for polynomials of various order.
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
83of 825
More Data
N = 15
x
t
N = 15
0 1
−1
0
1
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
84of 825
More Data
N = 100heuristics : have no less than 5 to 10 times as many datapoints than parametersbut number of parameters is not necessarily the mostappropriate measure of model complexity !later: Bayesian approach
x
t
N = 100
0 1
−1
0
1
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
85of 825
Regularisation
How to constrain the growing of the coefficients w ?Add a regularisation term to the error function
E(w) =12
N∑n=1
( y(xn,w)− tn)2+λ
2‖w‖2
Squared norm of the parameter vector w
‖w‖2 ≡ wTw = w20 + w2
1 + · · ·+ w2M
unique minimum of E(w) for argument w? under certainconditions (what are they for λ = 0? for λ > 0?)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
86of 825
Regularisation
M = 9
x
t
ln λ = −18
0 1
−1
0
1
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
87of 825
Regularisation
M = 9
x
t
ln λ = 0
0 1
−1
0
1
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
88of 825
Regularisation
M = 9E
RMS
ln λ−35 −30 −25 −200
0.5
1TrainingTest
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
89of 825
What is Machine Learning?
Definition (Mitchell, 1998)
A computer program is said to learn from experience E withrespect to some class of tasks T and performance measure P,if its performance at tasks in T, as measured by P, improveswith experience E.
Task: regressionExperience: x input examples, t output labelsPerformance: squared errorModel choiceRegularisationdo not train on the test set!
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
90of 825
Probability Theory
p(X,Y )
X
Y = 2
Y = 1
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
91of 825
Probability Theory
Y vs. X a b c d e f g h i sum2 0 0 0 1 4 5 8 6 2 261 3 6 8 8 5 3 1 0 0 34
sum 3 6 8 9 9 8 9 6 2 60
p(X,Y )
X
Y = 2
Y = 1
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
92of 825
Sum Rule
Y vs. X a b c d e f g h i sum2 0 0 0 1 4 5 8 6 2 261 3 6 8 8 5 3 1 0 0 34
sum 3 6 8 9 9 8 9 6 2 60
p(X = d,Y = 1) = 8/60p(X = d) = p(X = d,Y = 2) + p(X = d,Y = 1)
= 1/60 + 8/60
p(X = d) =∑
Y
p(X = d,Y)
p(X) =∑
Y
p(X,Y)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
93of 825
Sum Rule
Y vs. X a b c d e f g h i sum2 0 0 0 1 4 5 8 6 2 261 3 6 8 8 5 3 1 0 0 34
sum 3 6 8 9 9 8 9 6 2 60
p(X) =∑
Y
p(X,Y)
p(X)
X
p(Y) =∑
X
p(X,Y)
p(Y )
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
94of 825
Product Rule
Y vs. X a b c d e f g h i sum2 0 0 0 1 4 5 8 6 2 261 3 6 8 8 5 3 1 0 0 34
sum 3 6 8 9 9 8 9 6 2 60
Conditional Probability
p(X = d | Y = 1) = 8/34
Calculate p(Y = 1):
p(Y = 1) =∑
X
p(X,Y = 1) = 34/60
p(X = d,Y = 1) = p(X = d | Y = 1)p(Y = 1)
p(X,Y) = p(X | Y) p(Y)
Another intuitive view is renormalisation of relative frequencies:
p(X | Y) = p(X,Y)p(Y)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
95of 825
Sum and Product Rules
Y vs. X a b c d e f g h i sum2 0 0 0 1 4 5 8 6 2 261 3 6 8 8 5 3 1 0 0 34
sum 3 6 8 9 9 8 9 6 2 60
p(X) =∑
Y
p(X,Y)
p(X)
X
p(X | Y) = p(X,Y)p(Y)
X
p(X |Y = 1)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
96of 825
Sum Rule and Product Rule
Sum Rulep(X) =
∑Y
p(X,Y)
Product Rulep(X,Y) = p(X | Y) p(Y)
These rules form the basis of Bayesian machine learning, andthis course!
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
97of 825
Bayes Theorem
Use product rule
p(X,Y) = p(X | Y) p(Y) = p(Y | X) p(X)
Bayes Theorem
p(Y | X) = p(X | Y) p(Y)p(X)
only defined for p(X) > 0
and
p(X) =∑
Y
p(X,Y) (sum rule)
=∑
Y
p(X | Y) p(Y) (product rule)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
98of 825
Probability Densities
Real valued variable x ∈ RProbability of x to fall in the interval (x, x + δx) is given byp(x)δx for infinitesimal small δx.
p(x ∈ (a, b)) =∫ b
ap(x) dx.
xδx
p(x) P (x)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
99of 825
Constraints on p(x)
Nonnegativep(x) ≥ 0
Normalisation ∫ ∞−∞
p(x) dx = 1.
xδx
p(x) P (x)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
100of 825
Cumulative distribution function P(x)
P(x) =∫ x
−∞p(z) dz
orddx
P(x) = p(x)
xδx
p(x) P (x)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
101of 825
Multivariate Probability Density
Vector x ≡ (x1, . . . , xD)T =
x1...
xD
Nonnegative
p(x) ≥ 0
Normalisation ∫ ∞−∞
p(x) dx = 1.
This means ∫ ∞−∞· · ·∫ ∞−∞
p(x) dx1 . . . dxD = 1.
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
102of 825
Sum and Product Rule for Probability Densities
Sum Rulep(x) =
∫ ∞−∞
p(x, y) dy
Product Rulep(x, y) = p(y | x) p(x)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
103of 825
Expectations
Weighted average of a function f(x) under the probabilitydistribution p(x)
E [f ] =∑
x
p(x) f (x) discrete distribution p(x)
E [f ] =∫
p(x) f (x) dx probability density p(x)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
104of 825
How to approximate E [f ]
Given a finite number N of points xn drawn from theprobability distribution p(x).Approximate the expectation by a finite sum:
E [f ] ' 1N
N∑n=1
f (xn)
How to draw points from a probability distribution p(x) ?Lecture coming about “Sampling”
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
105of 825
Expection of a function of several variables
arbitrary function f (x, y)
Ex [f (x, y)] =∑
x
p(x) f (x, y) discrete distribution p(x)
Ex [f (x, y)] =∫
p(x) f (x, y) dx probability density p(x)
Note that Ex [f (x, y)] is a function of y.
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
106of 825
Conditional Expectation
arbitrary function f (x)
Ex [f | y] =∑
x
p(x | y) f (x) discrete distribution p(x)
Ex [f | y] =∫
p(x | y) f (x) dx probability density p(x)
Note that Ex [f | y] is a function of y.Other notation used in the literature : Ex|y [f ].What is E [E [f (x) | y]] ? Can we simplify it?This must mean Ey [Ex [f (x) | y]]. (Why?)
Ey [Ex [f (x) | y]] =∑
y
p(y)Ex [f | y] =∑
y
p(y)∑
x
p(x|y) f (x)
=∑x,y
f (x) p(x, y) =∑
x
f (x) p(x)
= Ex [f (x)]
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
107of 825
Variance
arbitrary function f (x)
var[f ] = E[(f (x)− E [f (x)])2] = E
[f (x)2]− E [f (x)]2
Special case: f (x) = x
var[x] = E[(x− E [x])2] = E
[x2]− E [x]2
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
108of 825
Covariance
Two random variables x ∈ R and y ∈ R
cov[x, y] = Ex,y [(x− E [x])(y− E [y])]
= Ex,y [x y]− E [x]E [y]
With E [x] = a and E [y] = b
cov[x, y] = Ex,y [(x− a)(y− b)]
= Ex,y [x y]− Ex,y [x b]− Ex,y [a y] + Ex,y [a b]
= Ex,y [x y]− b Ex,y [x]︸ ︷︷ ︸=Ex[x]
−a Ex,y [y]︸ ︷︷ ︸=Ey[y]
+a b Ex,y [1]︸ ︷︷ ︸=1
= Ex,y [x y]− a b− a b + a b = Ex,y [x y]− a b
= Ex,y [x y]− E [x]E [y]
Expresses how strongly x and y vary together. If x and yare independent, their covariance vanishes.
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
109of 825
Covariance for Vector Valued Variables
Two random variables x ∈ RD and y ∈ RD
cov[x, y] = Ex,y[(x− E [x])(yT − E
[yT])]
= Ex,y[x yT]− E [x]E
[yT]
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
110of 825
The Gaussian Distribution
x ∈ RGaussian Distribution with mean µ and variance σ2
N (x |µ, σ2) =1
(2πσ2)12exp{− 1
2σ2 (x− µ)2}
N (x|µ, σ2)
x
2σ
µ
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
111of 825
The Gaussian Distribution
N (x |µ, σ2) > 0∫∞−∞N (x |µ, σ2) dx = 1
Expectation over x
E [x] =∫ ∞−∞N (x |µ, σ2) x dx = µ
Expectation over x2
E[x2] = ∫ ∞
−∞N (x |µ, σ2) x2 dx = µ2 + σ2
Variance of x
var[x] = E[x2]− E [x]2 = σ2
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
112of 825
Strategy in this course
Estimate best predictor = training = learningGiven data (x1, y1), . . . , (xn, yn), find a predictor fw(·).
1 Identify the type of input x and output y data2 Propose a (linear) mathematical model for fw3 Design an objective function or likelihood4 Calculate the optimal parameter (w)5 Model uncertainty using the Bayesian approach6 Implement and compute (the algorithm in python)7 Interpret and diagnose results