statistical machine learning · probability theory probability densities expectations and...

Statistical MachineLearning

Data61 | CSIROThe Australian National

University

Outlines

OverviewIntroductionLinear Algebra

Probability

Linear Regression 1

Linear Regression 2

Linear Classification 1

Linear Classification 2

Kernel MethodsSparse Kernel Methods

Mixture Models and EM 1Mixture Models and EM 2Neural Networks 1Neural Networks 2Principal Component Analysis

AutoencodersGraphical Models 1

Graphical Models 2

Graphical Models 3

Sampling

Sequential Data 1

Sequential Data 2

1of 825

Statistical Machine Learning

Christian Walder

Machine Learning Research GroupCSIRO Data61

College of Engineering and Computer ScienceThe Australian National University

CanberraSemester One, 2020.

(Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations andCovariances

68of 825

Part II

Introduction

University

Probability Theory

69of 825

Flavour of this course

Formalise intuitions about problemsUse language of mathematics to express modelsGeometry, vectors, linear algebra for reasoningProbabilistic models to capture uncertaintyDesign and analysis of algorithmsNumerical algorithms in pythonUnderstand the choices when designing machine learningmethods

University

Probability Theory

70of 825

What is Machine Learning?

Definition (Mitchell, 1998)

A computer program is said to learn from experience E withrespect to some class of tasks T and performance measure P,if its performance at tasks in T, as measured by P, improveswith experience E.

University

Probability Theory

71of 825

some artificial data created from the function

sin(2πx) + random noise x = 0, . . . , 1

University

Probability Theory

72of 825

Polynomial Curve Fitting - Input Specification

N = 10

x ≡ (x1, . . . , xN)T

t ≡ (t1, . . . , tN)T

University

Probability Theory

73of 825

Polynomial Curve Fitting - Input Specification

N = 10

x ≡ (x1, . . . , xN)T

t ≡ (t1, . . . , tN)T

xi ∈ R i = 1,. . . , N

ti ∈ R i = 1,. . . , N

University

Probability Theory

74of 825

Polynomial Curve Fitting - Model Specification

M : order of polynomial

y(x,w) = w0 + w1 x + w2 x2 + · · ·+ wM xM

M∑m=0

nonlinear function of x

linear function of the unknown model parameter wHow can we find good parameters w = (w1, . . . ,wM)

University

Probability Theory

75of 825

Learning is Improving Performance

y(xn,w)

Performance measure : Error between target andprediction of the model for the training data

E(w) =12

N∑n=1

(y(xn,w)− tn)2

unique minimum of E(w) for argument w? under certainconditions (what are they?)

University

Probability Theory

76of 825

Learning is Improving Performance

y(xn,w)

Performance measure : Error between target andprediction of the model for the training data

E(w) =12

N∑n=1

(y(xn,w)− tn)2

unique minimum of E(w) for argument w? under certainconditions (what are they?)

University

Probability Theory

77of 825

Model Comparison or Model Selection

y(x,w) =

M∑m=0

∣∣∣∣∣M=0

University

Probability Theory

78of 825

y(x,w) =

M∑m=0

∣∣∣∣∣M=1

= w0 + w1 x

University

Probability Theory

79of 825

y(x,w) =

M∑m=0

∣∣∣∣∣M=3

= w0 + w1 x + w2 x2 + w3 x3

University

Probability Theory

80of 825

y(x,w) =

M∑m=0

∣∣∣∣∣M=9

= w0 + w1 x + · · ·+ w8 x8 + w9 x9

overfitting

University

Probability Theory

81of 825

Testing the Model

Train the model and get w?

Get 100 new data pointsRoot-mean-square (RMS) error

ERMS =√

2E(w?)/N

0 3 6 90

1TrainingTest

University

Probability Theory

82of 825

Testing the Model

M = 0 M = 1 M = 3 M = 9w?

0 0.19 0.82 0.31 0.35w?

1 -1.27 7.99 232.37w?

2 -25.43 -5321.83w?

3 17.37 48568.31w?

4 -231639.30w?

5 640042.26w?

6 -1061800.52w?

7 1042400.18w?

8 -557682.99w?

9 125201.43

Table: Coefficients w? for polynomials of various order.

University

Probability Theory

83of 825

More Data

N = 15

University

Probability Theory

84of 825

More Data

N = 100heuristics : have no less than 5 to 10 times as many datapoints than parametersbut number of parameters is not necessarily the mostappropriate measure of model complexity !later: Bayesian approach

N = 100

University

Probability Theory

85of 825

Regularisation

How to constrain the growing of the coefficients w ?Add a regularisation term to the error function

E(w) =12

N∑n=1

( y(xn,w)− tn)2+λ

2‖w‖2

Squared norm of the parameter vector w

‖w‖2 ≡ wTw = w20 + w2

1 + · · ·+ w2M

unique minimum of E(w) for argument w? under certainconditions (what are they for λ = 0? for λ > 0?)

University

Probability Theory

86of 825

Regularisation

ln λ = −18

University

Probability Theory

87of 825

Regularisation

ln λ = 0

University

Probability Theory

88of 825

Regularisation

M = 9E

ln λ−35 −30 −25 −200

1TrainingTest

University

Probability Theory

89of 825

What is Machine Learning?

Definition (Mitchell, 1998)

A computer program is said to learn from experience E withrespect to some class of tasks T and performance measure P,if its performance at tasks in T, as measured by P, improveswith experience E.

Task: regressionExperience: x input examples, t output labelsPerformance: squared errorModel choiceRegularisationdo not train on the test set!

University

Probability Theory

90of 825

Probability Theory

p(X,Y )

University

Probability Theory

91of 825

Probability Theory

Y vs. X a b c d e f g h i sum2 0 0 0 1 4 5 8 6 2 261 3 6 8 8 5 3 1 0 0 34

sum 3 6 8 9 9 8 9 6 2 60

p(X,Y )

University

Probability Theory

92of 825

Sum Rule

sum 3 6 8 9 9 8 9 6 2 60

p(X = d,Y = 1) = 8/60p(X = d) = p(X = d,Y = 2) + p(X = d,Y = 1)

= 1/60 + 8/60

p(X = d) =∑

p(X = d,Y)

p(X) =∑

p(X,Y)

University

Probability Theory

93of 825

Sum Rule

sum 3 6 8 9 9 8 9 6 2 60

p(X) =∑

p(X,Y)

p(Y) =∑

p(X,Y)

University

Probability Theory

94of 825

Product Rule

sum 3 6 8 9 9 8 9 6 2 60

Conditional Probability

p(X = d | Y = 1) = 8/34

Calculate p(Y = 1):

p(Y = 1) =∑

p(X,Y = 1) = 34/60

p(X = d,Y = 1) = p(X = d | Y = 1)p(Y = 1)

p(X,Y) = p(X | Y) p(Y)

Another intuitive view is renormalisation of relative frequencies:

p(X | Y) = p(X,Y)p(Y)

University

Probability Theory

95of 825

Sum and Product Rules

sum 3 6 8 9 9 8 9 6 2 60

p(X) =∑

p(X,Y)

p(X | Y) = p(X,Y)p(Y)

p(X |Y = 1)

University

Probability Theory

96of 825

Sum Rule and Product Rule

Sum Rulep(X) =

p(X,Y)

Product Rulep(X,Y) = p(X | Y) p(Y)

These rules form the basis of Bayesian machine learning, andthis course!

University

Probability Theory

97of 825

Bayes Theorem

Use product rule

p(X,Y) = p(X | Y) p(Y) = p(Y | X) p(X)

Bayes Theorem

p(Y | X) = p(X | Y) p(Y)p(X)

only defined for p(X) > 0

p(X) =∑

p(X,Y) (sum rule)

p(X | Y) p(Y) (product rule)

University

Probability Theory

98of 825

Real valued variable x ∈ RProbability of x to fall in the interval (x, x + δx) is given byp(x)δx for infinitesimal small δx.

p(x ∈ (a, b)) =∫ b

ap(x) dx.

p(x) P (x)

University

Probability Theory

99of 825

Constraints on p(x)

Nonnegativep(x) ≥ 0

Normalisation ∫ ∞−∞

p(x) dx = 1.

p(x) P (x)

University

Probability Theory

100of 825

Cumulative distribution function P(x)

P(x) =∫ x

−∞p(z) dz

P(x) = p(x)

p(x) P (x)

University

Probability Theory

101of 825

Multivariate Probability Density

Vector x ≡ (x1, . . . , xD)T =

Nonnegative

p(x) ≥ 0

Normalisation ∫ ∞−∞

p(x) dx = 1.

This means ∫ ∞−∞· · ·∫ ∞−∞

p(x) dx1 . . . dxD = 1.

University

Probability Theory

102of 825

Sum and Product Rule for Probability Densities

Sum Rulep(x) =

∫ ∞−∞

p(x, y) dy

Product Rulep(x, y) = p(y | x) p(x)

University

Probability Theory

103of 825

Expectations

Weighted average of a function f(x) under the probabilitydistribution p(x)

E [f ] =∑

p(x) f (x) discrete distribution p(x)

E [f ] =∫

p(x) f (x) dx probability density p(x)

University

Probability Theory

104of 825

How to approximate E [f ]

Given a finite number N of points xn drawn from theprobability distribution p(x).Approximate the expectation by a finite sum:

E [f ] ' 1N

N∑n=1

f (xn)

How to draw points from a probability distribution p(x) ?Lecture coming about “Sampling”

University

Probability Theory

105of 825

Expection of a function of several variables

arbitrary function f (x, y)

Ex [f (x, y)] =∑

p(x) f (x, y) discrete distribution p(x)

Ex [f (x, y)] =∫

p(x) f (x, y) dx probability density p(x)

Note that Ex [f (x, y)] is a function of y.

University

Probability Theory

106of 825

Conditional Expectation

arbitrary function f (x)

Ex [f | y] =∑

p(x | y) f (x) discrete distribution p(x)

Ex [f | y] =∫

p(x | y) f (x) dx probability density p(x)

Note that Ex [f | y] is a function of y.Other notation used in the literature : Ex|y [f ].What is E [E [f (x) | y]] ? Can we simplify it?This must mean Ey [Ex [f (x) | y]]. (Why?)

Ey [Ex [f (x) | y]] =∑

p(y)Ex [f | y] =∑

p(y)∑

p(x|y) f (x)

=∑x,y

f (x) p(x, y) =∑

f (x) p(x)

= Ex [f (x)]

University

Probability Theory

107of 825

Variance

arbitrary function f (x)

var[f ] = E[(f (x)− E [f (x)])2] = E

[f (x)2]− E [f (x)]2

Special case: f (x) = x

var[x] = E[(x− E [x])2] = E

[x2]− E [x]2

University

Probability Theory

108of 825

Covariance

Two random variables x ∈ R and y ∈ R

cov[x, y] = Ex,y [(x− E [x])(y− E [y])]

= Ex,y [x y]− E [x]E [y]

With E [x] = a and E [y] = b

cov[x, y] = Ex,y [(x− a)(y− b)]

= Ex,y [x y]− Ex,y [x b]− Ex,y [a y] + Ex,y [a b]

= Ex,y [x y]− b Ex,y [x]︸︷︷︸=Ex[x]

−a Ex,y [y]︸︷︷︸=Ey[y]

+a b Ex,y [1]︸︷︷︸=1

= Ex,y [x y]− a b− a b + a b = Ex,y [x y]− a b

= Ex,y [x y]− E [x]E [y]

Expresses how strongly x and y vary together. If x and yare independent, their covariance vanishes.

University

Probability Theory

109of 825

Covariance for Vector Valued Variables

Two random variables x ∈ RD and y ∈ RD

cov[x, y] = Ex,y[(x− E [x])(yT − E

[yT])]

= Ex,y[x yT]− E [x]E

University

Probability Theory

110of 825

The Gaussian Distribution

x ∈ RGaussian Distribution with mean µ and variance σ2

N (x |µ, σ2) =1

(2πσ2)12exp{− 1

2σ2 (x− µ)2}

N (x|µ, σ2)

University

Probability Theory

111of 825

The Gaussian Distribution

N (x |µ, σ2) > 0∫∞−∞N (x |µ, σ2) dx = 1

Expectation over x

E [x] =∫ ∞−∞N (x |µ, σ2) x dx = µ

Expectation over x2

E[x2] = ∫ ∞

−∞N (x |µ, σ2) x2 dx = µ2 + σ2

Variance of x

var[x] = E[x2]− E [x]2 = σ2

University

Probability Theory

112of 825

Strategy in this course

Estimate best predictor = training = learningGiven data (x1, y1), . . . , (xn, yn), find a predictor fw(·).

1 Identify the type of input x and output y data2 Propose a (linear) mathematical model for fw3 Design an objective function or likelihood4 Calculate the optimal parameter (w)5 Model uncertainty using the Bayesian approach6 Implement and compute (the algorithm in python)7 Interpret and diagnose results

statistical machine learning · probability theory probability densities expectations and...

Documents

preliminary covariances obtained with conrad/gromacs · pdf...

structural equation covariances ke-hai yuan* peter …

characteristics are covariances: a uni ed model of risk and...

stat 1510: introducing probability. agenda 2 the idea of...

state covariances and the matrix completion problem

reminder - means, variances and covariances. covariance...

covariances and linear predictability of the atlantic...

pseudo-measurement simulations and bootstrap for the...

the kalman filter: a study of covariances

upper lachlan shire council crookwell landfill site … ·...

variograms/covariances and their estimation

example of eqs with ‘raw data’ sintaxis generator method...

event-related covariances during a bimanuai visuomotor...

probability & probability distribution

ariadne – a program estimating covariances in detail for

estimation of surface o vertical correlations and...

characteristics are covariances: a uni ed model of risk

probability and probability distributions

evaluation of n+cr cross section data up to 200 mev neutron...

variances and covariances of geoneutrino signals -...