![Page 1: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/1.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Outlines
OverviewIntroductionLinear Algebra
Probability
Linear Regression 1
Linear Regression 2
Linear Classification 1
Linear Classification 2
Kernel MethodsSparse Kernel Methods
Mixture Models and EM 1Mixture Models and EM 2Neural Networks 1Neural Networks 2Principal Component Analysis
AutoencodersGraphical Models 1
Graphical Models 2
Graphical Models 3
Sampling
Sequential Data 1
Sequential Data 2
1of 825
Statistical Machine Learning
Christian Walder
Machine Learning Research GroupCSIRO Data61
and
College of Engineering and Computer ScienceThe Australian National University
CanberraSemester One, 2020.
(Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")
![Page 2: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/2.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
68of 825
Part II
Introduction
![Page 3: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/3.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
69of 825
Flavour of this course
Formalise intuitions about problemsUse language of mathematics to express modelsGeometry, vectors, linear algebra for reasoningProbabilistic models to capture uncertaintyDesign and analysis of algorithmsNumerical algorithms in pythonUnderstand the choices when designing machine learningmethods
![Page 4: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/4.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
70of 825
What is Machine Learning?
Definition (Mitchell, 1998)
A computer program is said to learn from experience E withrespect to some class of tasks T and performance measure P,if its performance at tasks in T, as measured by P, improveswith experience E.
![Page 5: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/5.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
71of 825
Polynomial Curve Fitting
some artificial data created from the function
sin(2πx) + random noise x = 0, . . . , 1
x
t
0 1
−1
0
1
![Page 6: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/6.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
72of 825
Polynomial Curve Fitting - Input Specification
N = 10
x ≡ (x1, . . . , xN)T
t ≡ (t1, . . . , tN)T
![Page 7: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/7.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
73of 825
Polynomial Curve Fitting - Input Specification
N = 10
x ≡ (x1, . . . , xN)T
t ≡ (t1, . . . , tN)T
xi ∈ R i = 1,. . . , N
ti ∈ R i = 1,. . . , N
![Page 8: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/8.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
74of 825
Polynomial Curve Fitting - Model Specification
M : order of polynomial
y(x,w) = w0 + w1 x + w2 x2 + · · ·+ wM xM
=
M∑m=0
wm xm
nonlinear function of x
linear function of the unknown model parameter wHow can we find good parameters w = (w1, . . . ,wM)
T?
![Page 9: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/9.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
75of 825
Learning is Improving Performance
t
x
y(xn,w)
tn
xn
Performance measure : Error between target andprediction of the model for the training data
E(w) =12
N∑n=1
(y(xn,w)− tn)2
unique minimum of E(w) for argument w? under certainconditions (what are they?)
![Page 10: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/10.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
76of 825
Learning is Improving Performance
t
x
y(xn,w)
tn
xn
Performance measure : Error between target andprediction of the model for the training data
E(w) =12
N∑n=1
(y(xn,w)− tn)2
unique minimum of E(w) for argument w? under certainconditions (what are they?)
![Page 11: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/11.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
77of 825
Model Comparison or Model Selection
y(x,w) =
M∑m=0
wm xm
∣∣∣∣∣M=0
= w0
x
t
M = 0
0 1
−1
0
1
![Page 12: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/12.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
78of 825
Model Comparison or Model Selection
y(x,w) =
M∑m=0
wm xm
∣∣∣∣∣M=1
= w0 + w1 x
x
t
M = 1
0 1
−1
0
1
![Page 13: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/13.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
79of 825
Model Comparison or Model Selection
y(x,w) =
M∑m=0
wm xm
∣∣∣∣∣M=3
= w0 + w1 x + w2 x2 + w3 x3
x
t
M = 3
0 1
−1
0
1
![Page 14: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/14.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
80of 825
Model Comparison or Model Selection
y(x,w) =
M∑m=0
wm xm
∣∣∣∣∣M=9
= w0 + w1 x + · · ·+ w8 x8 + w9 x9
overfitting
x
t
M = 9
0 1
−1
0
1
![Page 15: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/15.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
81of 825
Testing the Model
Train the model and get w?
Get 100 new data pointsRoot-mean-square (RMS) error
ERMS =√
2E(w?)/N
M
ERMS
0 3 6 90
0.5
1TrainingTest
![Page 16: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/16.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
82of 825
Testing the Model
M = 0 M = 1 M = 3 M = 9w?
0 0.19 0.82 0.31 0.35w?
1 -1.27 7.99 232.37w?
2 -25.43 -5321.83w?
3 17.37 48568.31w?
4 -231639.30w?
5 640042.26w?
6 -1061800.52w?
7 1042400.18w?
8 -557682.99w?
9 125201.43
Table: Coefficients w? for polynomials of various order.
![Page 17: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/17.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
83of 825
More Data
N = 15
x
t
N = 15
0 1
−1
0
1
![Page 18: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/18.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
84of 825
More Data
N = 100heuristics : have no less than 5 to 10 times as many datapoints than parametersbut number of parameters is not necessarily the mostappropriate measure of model complexity !later: Bayesian approach
x
t
N = 100
0 1
−1
0
1
![Page 19: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/19.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
85of 825
Regularisation
How to constrain the growing of the coefficients w ?Add a regularisation term to the error function
E(w) =12
N∑n=1
( y(xn,w)− tn)2+λ
2‖w‖2
Squared norm of the parameter vector w
‖w‖2 ≡ wTw = w20 + w2
1 + · · ·+ w2M
unique minimum of E(w) for argument w? under certainconditions (what are they for λ = 0? for λ > 0?)
![Page 20: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/20.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
86of 825
Regularisation
M = 9
x
t
ln λ = −18
0 1
−1
0
1
![Page 21: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/21.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
87of 825
Regularisation
M = 9
x
t
ln λ = 0
0 1
−1
0
1
![Page 22: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/22.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
88of 825
Regularisation
M = 9E
RMS
ln λ−35 −30 −25 −200
0.5
1TrainingTest
![Page 23: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/23.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
89of 825
What is Machine Learning?
Definition (Mitchell, 1998)
A computer program is said to learn from experience E withrespect to some class of tasks T and performance measure P,if its performance at tasks in T, as measured by P, improveswith experience E.
Task: regressionExperience: x input examples, t output labelsPerformance: squared errorModel choiceRegularisationdo not train on the test set!
![Page 24: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/24.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
90of 825
Probability Theory
p(X,Y )
X
Y = 2
Y = 1
![Page 25: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/25.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
91of 825
Probability Theory
Y vs. X a b c d e f g h i sum2 0 0 0 1 4 5 8 6 2 261 3 6 8 8 5 3 1 0 0 34
sum 3 6 8 9 9 8 9 6 2 60
p(X,Y )
X
Y = 2
Y = 1
![Page 26: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/26.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
92of 825
Sum Rule
Y vs. X a b c d e f g h i sum2 0 0 0 1 4 5 8 6 2 261 3 6 8 8 5 3 1 0 0 34
sum 3 6 8 9 9 8 9 6 2 60
p(X = d,Y = 1) = 8/60p(X = d) = p(X = d,Y = 2) + p(X = d,Y = 1)
= 1/60 + 8/60
p(X = d) =∑
Y
p(X = d,Y)
p(X) =∑
Y
p(X,Y)
![Page 27: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/27.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
93of 825
Sum Rule
Y vs. X a b c d e f g h i sum2 0 0 0 1 4 5 8 6 2 261 3 6 8 8 5 3 1 0 0 34
sum 3 6 8 9 9 8 9 6 2 60
p(X) =∑
Y
p(X,Y)
p(X)
X
p(Y) =∑
X
p(X,Y)
p(Y )
![Page 28: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/28.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
94of 825
Product Rule
Y vs. X a b c d e f g h i sum2 0 0 0 1 4 5 8 6 2 261 3 6 8 8 5 3 1 0 0 34
sum 3 6 8 9 9 8 9 6 2 60
Conditional Probability
p(X = d | Y = 1) = 8/34
Calculate p(Y = 1):
p(Y = 1) =∑
X
p(X,Y = 1) = 34/60
p(X = d,Y = 1) = p(X = d | Y = 1)p(Y = 1)
p(X,Y) = p(X | Y) p(Y)
Another intuitive view is renormalisation of relative frequencies:
p(X | Y) = p(X,Y)p(Y)
![Page 29: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/29.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
95of 825
Sum and Product Rules
Y vs. X a b c d e f g h i sum2 0 0 0 1 4 5 8 6 2 261 3 6 8 8 5 3 1 0 0 34
sum 3 6 8 9 9 8 9 6 2 60
p(X) =∑
Y
p(X,Y)
p(X)
X
p(X | Y) = p(X,Y)p(Y)
X
p(X |Y = 1)
![Page 30: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/30.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
96of 825
Sum Rule and Product Rule
Sum Rulep(X) =
∑Y
p(X,Y)
Product Rulep(X,Y) = p(X | Y) p(Y)
These rules form the basis of Bayesian machine learning, andthis course!
![Page 31: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/31.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
97of 825
Bayes Theorem
Use product rule
p(X,Y) = p(X | Y) p(Y) = p(Y | X) p(X)
Bayes Theorem
p(Y | X) = p(X | Y) p(Y)p(X)
only defined for p(X) > 0
and
p(X) =∑
Y
p(X,Y) (sum rule)
=∑
Y
p(X | Y) p(Y) (product rule)
![Page 32: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/32.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
98of 825
Probability Densities
Real valued variable x ∈ RProbability of x to fall in the interval (x, x + δx) is given byp(x)δx for infinitesimal small δx.
p(x ∈ (a, b)) =∫ b
ap(x) dx.
xδx
p(x) P (x)
![Page 33: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/33.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
99of 825
Constraints on p(x)
Nonnegativep(x) ≥ 0
Normalisation ∫ ∞−∞
p(x) dx = 1.
xδx
p(x) P (x)
![Page 34: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/34.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
100of 825
Cumulative distribution function P(x)
P(x) =∫ x
−∞p(z) dz
orddx
P(x) = p(x)
xδx
p(x) P (x)
![Page 35: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/35.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
101of 825
Multivariate Probability Density
Vector x ≡ (x1, . . . , xD)T =
x1...
xD
Nonnegative
p(x) ≥ 0
Normalisation ∫ ∞−∞
p(x) dx = 1.
This means ∫ ∞−∞· · ·∫ ∞−∞
p(x) dx1 . . . dxD = 1.
![Page 36: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/36.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
102of 825
Sum and Product Rule for Probability Densities
Sum Rulep(x) =
∫ ∞−∞
p(x, y) dy
Product Rulep(x, y) = p(y | x) p(x)
![Page 37: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/37.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
103of 825
Expectations
Weighted average of a function f(x) under the probabilitydistribution p(x)
E [f ] =∑
x
p(x) f (x) discrete distribution p(x)
E [f ] =∫
p(x) f (x) dx probability density p(x)
![Page 38: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/38.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
104of 825
How to approximate E [f ]
Given a finite number N of points xn drawn from theprobability distribution p(x).Approximate the expectation by a finite sum:
E [f ] ' 1N
N∑n=1
f (xn)
How to draw points from a probability distribution p(x) ?Lecture coming about “Sampling”
![Page 39: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/39.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
105of 825
Expection of a function of several variables
arbitrary function f (x, y)
Ex [f (x, y)] =∑
x
p(x) f (x, y) discrete distribution p(x)
Ex [f (x, y)] =∫
p(x) f (x, y) dx probability density p(x)
Note that Ex [f (x, y)] is a function of y.
![Page 40: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/40.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
106of 825
Conditional Expectation
arbitrary function f (x)
Ex [f | y] =∑
x
p(x | y) f (x) discrete distribution p(x)
Ex [f | y] =∫
p(x | y) f (x) dx probability density p(x)
Note that Ex [f | y] is a function of y.Other notation used in the literature : Ex|y [f ].What is E [E [f (x) | y]] ? Can we simplify it?This must mean Ey [Ex [f (x) | y]]. (Why?)
Ey [Ex [f (x) | y]] =∑
y
p(y)Ex [f | y] =∑
y
p(y)∑
x
p(x|y) f (x)
=∑x,y
f (x) p(x, y) =∑
x
f (x) p(x)
= Ex [f (x)]
![Page 41: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/41.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
107of 825
Variance
arbitrary function f (x)
var[f ] = E[(f (x)− E [f (x)])2] = E
[f (x)2]− E [f (x)]2
Special case: f (x) = x
var[x] = E[(x− E [x])2] = E
[x2]− E [x]2
![Page 42: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/42.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
108of 825
Covariance
Two random variables x ∈ R and y ∈ R
cov[x, y] = Ex,y [(x− E [x])(y− E [y])]
= Ex,y [x y]− E [x]E [y]
With E [x] = a and E [y] = b
cov[x, y] = Ex,y [(x− a)(y− b)]
= Ex,y [x y]− Ex,y [x b]− Ex,y [a y] + Ex,y [a b]
= Ex,y [x y]− b Ex,y [x]︸ ︷︷ ︸=Ex[x]
−a Ex,y [y]︸ ︷︷ ︸=Ey[y]
+a b Ex,y [1]︸ ︷︷ ︸=1
= Ex,y [x y]− a b− a b + a b = Ex,y [x y]− a b
= Ex,y [x y]− E [x]E [y]
Expresses how strongly x and y vary together. If x and yare independent, their covariance vanishes.
![Page 43: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/43.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
109of 825
Covariance for Vector Valued Variables
Two random variables x ∈ RD and y ∈ RD
cov[x, y] = Ex,y[(x− E [x])(yT − E
[yT])]
= Ex,y[x yT]− E [x]E
[yT]
![Page 44: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/44.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
110of 825
The Gaussian Distribution
x ∈ RGaussian Distribution with mean µ and variance σ2
N (x |µ, σ2) =1
(2πσ2)12exp{− 1
2σ2 (x− µ)2}
N (x|µ, σ2)
x
2σ
µ
![Page 45: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/45.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
111of 825
The Gaussian Distribution
N (x |µ, σ2) > 0∫∞−∞N (x |µ, σ2) dx = 1
Expectation over x
E [x] =∫ ∞−∞N (x |µ, σ2) x dx = µ
Expectation over x2
E[x2] = ∫ ∞
−∞N (x |µ, σ2) x2 dx = µ2 + σ2
Variance of x
var[x] = E[x2]− E [x]2 = σ2
![Page 46: Statistical Machine Learning · Probability Theory Probability Densities Expectations and Covariances 69of 825 Flavour of this course Formalise intuitions about problems Use language](https://reader035.vdocuments.us/reader035/viewer/2022071013/5fcb4f22d95793302d7c114e/html5/thumbnails/46.jpg)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations andCovariances
112of 825
Strategy in this course
Estimate best predictor = training = learningGiven data (x1, y1), . . . , (xn, yn), find a predictor fw(·).
1 Identify the type of input x and output y data2 Propose a (linear) mathematical model for fw3 Design an objective function or likelihood4 Calculate the optimal parameter (w)5 Model uncertainty using the Bayesian approach6 Implement and compute (the algorithm in python)7 Interpret and diagnose results