statistical machine learning · probability theory probability densities expectations and...

Post on 21-Aug-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Outlines

OverviewIntroductionLinear Algebra

Probability

Linear Regression 1

Linear Regression 2

Linear Classification 1

Linear Classification 2

Kernel MethodsSparse Kernel Methods

Mixture Models and EM 1Mixture Models and EM 2Neural Networks 1Neural Networks 2Principal Component Analysis

AutoencodersGraphical Models 1

Graphical Models 2

Graphical Models 3

Sampling

Sequential Data 1

Sequential Data 2

1of 825

Statistical Machine Learning

Christian Walder

Machine Learning Research GroupCSIRO Data61

and

College of Engineering and Computer ScienceThe Australian National University

CanberraSemester One, 2020.

(Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

68of 825

Part II

Introduction

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

69of 825

Flavour of this course

Formalise intuitions about problemsUse language of mathematics to express modelsGeometry, vectors, linear algebra for reasoningProbabilistic models to capture uncertaintyDesign and analysis of algorithmsNumerical algorithms in pythonUnderstand the choices when designing machine learningmethods

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

70of 825

What is Machine Learning?

Definition (Mitchell, 1998)

A computer program is said to learn from experience E withrespect to some class of tasks T and performance measure P,if its performance at tasks in T, as measured by P, improveswith experience E.

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

71of 825

Polynomial Curve Fitting

some artificial data created from the function

sin(2πx) + random noise x = 0, . . . , 1

x

t

0 1

−1

0

1

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

72of 825

Polynomial Curve Fitting - Input Specification

N = 10

x ≡ (x1, . . . , xN)T

t ≡ (t1, . . . , tN)T

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

73of 825

Polynomial Curve Fitting - Input Specification

N = 10

x ≡ (x1, . . . , xN)T

t ≡ (t1, . . . , tN)T

xi ∈ R i = 1,. . . , N

ti ∈ R i = 1,. . . , N

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

74of 825

Polynomial Curve Fitting - Model Specification

M : order of polynomial

y(x,w) = w0 + w1 x + w2 x2 + · · ·+ wM xM

=

M∑m=0

wm xm

nonlinear function of x

linear function of the unknown model parameter wHow can we find good parameters w = (w1, . . . ,wM)

T?

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

75of 825

Learning is Improving Performance

t

x

y(xn,w)

tn

xn

Performance measure : Error between target andprediction of the model for the training data

E(w) =12

N∑n=1

(y(xn,w)− tn)2

unique minimum of E(w) for argument w? under certainconditions (what are they?)

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

76of 825

Learning is Improving Performance

t

x

y(xn,w)

tn

xn

Performance measure : Error between target andprediction of the model for the training data

E(w) =12

N∑n=1

(y(xn,w)− tn)2

unique minimum of E(w) for argument w? under certainconditions (what are they?)

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

77of 825

Model Comparison or Model Selection

y(x,w) =

M∑m=0

wm xm

∣∣∣∣∣M=0

= w0

x

t

M = 0

0 1

−1

0

1

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

78of 825

Model Comparison or Model Selection

y(x,w) =

M∑m=0

wm xm

∣∣∣∣∣M=1

= w0 + w1 x

x

t

M = 1

0 1

−1

0

1

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

79of 825

Model Comparison or Model Selection

y(x,w) =

M∑m=0

wm xm

∣∣∣∣∣M=3

= w0 + w1 x + w2 x2 + w3 x3

x

t

M = 3

0 1

−1

0

1

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

80of 825

Model Comparison or Model Selection

y(x,w) =

M∑m=0

wm xm

∣∣∣∣∣M=9

= w0 + w1 x + · · ·+ w8 x8 + w9 x9

overfitting

x

t

M = 9

0 1

−1

0

1

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

81of 825

Testing the Model

Train the model and get w?

Get 100 new data pointsRoot-mean-square (RMS) error

ERMS =√

2E(w?)/N

M

ERMS

0 3 6 90

0.5

1TrainingTest

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

82of 825

Testing the Model

M = 0 M = 1 M = 3 M = 9w?

0 0.19 0.82 0.31 0.35w?

1 -1.27 7.99 232.37w?

2 -25.43 -5321.83w?

3 17.37 48568.31w?

4 -231639.30w?

5 640042.26w?

6 -1061800.52w?

7 1042400.18w?

8 -557682.99w?

9 125201.43

Table: Coefficients w? for polynomials of various order.

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

83of 825

More Data

N = 15

x

t

N = 15

0 1

−1

0

1

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

84of 825

More Data

N = 100heuristics : have no less than 5 to 10 times as many datapoints than parametersbut number of parameters is not necessarily the mostappropriate measure of model complexity !later: Bayesian approach

x

t

N = 100

0 1

−1

0

1

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

85of 825

Regularisation

How to constrain the growing of the coefficients w ?Add a regularisation term to the error function

E(w) =12

N∑n=1

( y(xn,w)− tn)2+λ

2‖w‖2

Squared norm of the parameter vector w

‖w‖2 ≡ wTw = w20 + w2

1 + · · ·+ w2M

unique minimum of E(w) for argument w? under certainconditions (what are they for λ = 0? for λ > 0?)

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

86of 825

Regularisation

M = 9

x

t

ln λ = −18

0 1

−1

0

1

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

87of 825

Regularisation

M = 9

x

t

ln λ = 0

0 1

−1

0

1

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

88of 825

Regularisation

M = 9E

RMS

ln λ−35 −30 −25 −200

0.5

1TrainingTest

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

89of 825

What is Machine Learning?

Definition (Mitchell, 1998)

A computer program is said to learn from experience E withrespect to some class of tasks T and performance measure P,if its performance at tasks in T, as measured by P, improveswith experience E.

Task: regressionExperience: x input examples, t output labelsPerformance: squared errorModel choiceRegularisationdo not train on the test set!

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

90of 825

Probability Theory

p(X,Y )

X

Y = 2

Y = 1

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

91of 825

Probability Theory

Y vs. X a b c d e f g h i sum2 0 0 0 1 4 5 8 6 2 261 3 6 8 8 5 3 1 0 0 34

sum 3 6 8 9 9 8 9 6 2 60

p(X,Y )

X

Y = 2

Y = 1

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

92of 825

Sum Rule

Y vs. X a b c d e f g h i sum2 0 0 0 1 4 5 8 6 2 261 3 6 8 8 5 3 1 0 0 34

sum 3 6 8 9 9 8 9 6 2 60

p(X = d,Y = 1) = 8/60p(X = d) = p(X = d,Y = 2) + p(X = d,Y = 1)

= 1/60 + 8/60

p(X = d) =∑

Y

p(X = d,Y)

p(X) =∑

Y

p(X,Y)

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

93of 825

Sum Rule

Y vs. X a b c d e f g h i sum2 0 0 0 1 4 5 8 6 2 261 3 6 8 8 5 3 1 0 0 34

sum 3 6 8 9 9 8 9 6 2 60

p(X) =∑

Y

p(X,Y)

p(X)

X

p(Y) =∑

X

p(X,Y)

p(Y )

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

94of 825

Product Rule

Y vs. X a b c d e f g h i sum2 0 0 0 1 4 5 8 6 2 261 3 6 8 8 5 3 1 0 0 34

sum 3 6 8 9 9 8 9 6 2 60

Conditional Probability

p(X = d | Y = 1) = 8/34

Calculate p(Y = 1):

p(Y = 1) =∑

X

p(X,Y = 1) = 34/60

p(X = d,Y = 1) = p(X = d | Y = 1)p(Y = 1)

p(X,Y) = p(X | Y) p(Y)

Another intuitive view is renormalisation of relative frequencies:

p(X | Y) = p(X,Y)p(Y)

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

95of 825

Sum and Product Rules

Y vs. X a b c d e f g h i sum2 0 0 0 1 4 5 8 6 2 261 3 6 8 8 5 3 1 0 0 34

sum 3 6 8 9 9 8 9 6 2 60

p(X) =∑

Y

p(X,Y)

p(X)

X

p(X | Y) = p(X,Y)p(Y)

X

p(X |Y = 1)

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

96of 825

Sum Rule and Product Rule

Sum Rulep(X) =

∑Y

p(X,Y)

Product Rulep(X,Y) = p(X | Y) p(Y)

These rules form the basis of Bayesian machine learning, andthis course!

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

97of 825

Bayes Theorem

Use product rule

p(X,Y) = p(X | Y) p(Y) = p(Y | X) p(X)

Bayes Theorem

p(Y | X) = p(X | Y) p(Y)p(X)

only defined for p(X) > 0

and

p(X) =∑

Y

p(X,Y) (sum rule)

=∑

Y

p(X | Y) p(Y) (product rule)

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

98of 825

Probability Densities

Real valued variable x ∈ RProbability of x to fall in the interval (x, x + δx) is given byp(x)δx for infinitesimal small δx.

p(x ∈ (a, b)) =∫ b

ap(x) dx.

xδx

p(x) P (x)

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

99of 825

Constraints on p(x)

Nonnegativep(x) ≥ 0

Normalisation ∫ ∞−∞

p(x) dx = 1.

xδx

p(x) P (x)

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

100of 825

Cumulative distribution function P(x)

P(x) =∫ x

−∞p(z) dz

orddx

P(x) = p(x)

xδx

p(x) P (x)

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

101of 825

Multivariate Probability Density

Vector x ≡ (x1, . . . , xD)T =

x1...

xD

Nonnegative

p(x) ≥ 0

Normalisation ∫ ∞−∞

p(x) dx = 1.

This means ∫ ∞−∞· · ·∫ ∞−∞

p(x) dx1 . . . dxD = 1.

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

102of 825

Sum and Product Rule for Probability Densities

Sum Rulep(x) =

∫ ∞−∞

p(x, y) dy

Product Rulep(x, y) = p(y | x) p(x)

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

103of 825

Expectations

Weighted average of a function f(x) under the probabilitydistribution p(x)

E [f ] =∑

x

p(x) f (x) discrete distribution p(x)

E [f ] =∫

p(x) f (x) dx probability density p(x)

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

104of 825

How to approximate E [f ]

Given a finite number N of points xn drawn from theprobability distribution p(x).Approximate the expectation by a finite sum:

E [f ] ' 1N

N∑n=1

f (xn)

How to draw points from a probability distribution p(x) ?Lecture coming about “Sampling”

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

105of 825

Expection of a function of several variables

arbitrary function f (x, y)

Ex [f (x, y)] =∑

x

p(x) f (x, y) discrete distribution p(x)

Ex [f (x, y)] =∫

p(x) f (x, y) dx probability density p(x)

Note that Ex [f (x, y)] is a function of y.

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

106of 825

Conditional Expectation

arbitrary function f (x)

Ex [f | y] =∑

x

p(x | y) f (x) discrete distribution p(x)

Ex [f | y] =∫

p(x | y) f (x) dx probability density p(x)

Note that Ex [f | y] is a function of y.Other notation used in the literature : Ex|y [f ].What is E [E [f (x) | y]] ? Can we simplify it?This must mean Ey [Ex [f (x) | y]]. (Why?)

Ey [Ex [f (x) | y]] =∑

y

p(y)Ex [f | y] =∑

y

p(y)∑

x

p(x|y) f (x)

=∑x,y

f (x) p(x, y) =∑

x

f (x) p(x)

= Ex [f (x)]

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

107of 825

Variance

arbitrary function f (x)

var[f ] = E[(f (x)− E [f (x)])2] = E

[f (x)2]− E [f (x)]2

Special case: f (x) = x

var[x] = E[(x− E [x])2] = E

[x2]− E [x]2

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

108of 825

Covariance

Two random variables x ∈ R and y ∈ R

cov[x, y] = Ex,y [(x− E [x])(y− E [y])]

= Ex,y [x y]− E [x]E [y]

With E [x] = a and E [y] = b

cov[x, y] = Ex,y [(x− a)(y− b)]

= Ex,y [x y]− Ex,y [x b]− Ex,y [a y] + Ex,y [a b]

= Ex,y [x y]− b Ex,y [x]︸ ︷︷ ︸=Ex[x]

−a Ex,y [y]︸ ︷︷ ︸=Ey[y]

+a b Ex,y [1]︸ ︷︷ ︸=1

= Ex,y [x y]− a b− a b + a b = Ex,y [x y]− a b

= Ex,y [x y]− E [x]E [y]

Expresses how strongly x and y vary together. If x and yare independent, their covariance vanishes.

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

109of 825

Covariance for Vector Valued Variables

Two random variables x ∈ RD and y ∈ RD

cov[x, y] = Ex,y[(x− E [x])(yT − E

[yT])]

= Ex,y[x yT]− E [x]E

[yT]

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

110of 825

The Gaussian Distribution

x ∈ RGaussian Distribution with mean µ and variance σ2

N (x |µ, σ2) =1

(2πσ2)12exp{− 1

2σ2 (x− µ)2}

N (x|µ, σ2)

x

µ

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

111of 825

The Gaussian Distribution

N (x |µ, σ2) > 0∫∞−∞N (x |µ, σ2) dx = 1

Expectation over x

E [x] =∫ ∞−∞N (x |µ, σ2) x dx = µ

Expectation over x2

E[x2] = ∫ ∞

−∞N (x |µ, σ2) x2 dx = µ2 + σ2

Variance of x

var[x] = E[x2]− E [x]2 = σ2

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

112of 825

Strategy in this course

Estimate best predictor = training = learningGiven data (x1, y1), . . . , (xn, yn), find a predictor fw(·).

1 Identify the type of input x and output y data2 Propose a (linear) mathematical model for fw3 Design an objective function or likelihood4 Calculate the optimal parameter (w)5 Model uncertainty using the Bayesian approach6 Implement and compute (the algorithm in python)7 Interpret and diagnose results

top related