machine learning for signal processing linear gaussian...

Machine Learning for Signal Processing

Linear Gaussian Models

Class 21. 12 Nov 2013

Instructor: Bhiksha Raj

12 Nov 2013 11755/18797 1

Administrivia

• HW3 is up

• Projects – please send us an update

12 Nov 2013 11755/18797 2

Recap: MAP Estimators

• MAP (Maximum A Posteriori): Find a “best

guess” for y (statistically), given known x

y = argmax Y P(Y|x)

12 Nov 2013 11755/18797 3

Recap: MAP estimation

• x and y are jointly Gaussian

12 Nov 2013 11755/18797 4

• z is Gaussian

CCCzVar )( ]))([( T

yxxy yxEC

zzz zzC

CNzP ))((5.0exp||2

1),()(

MAP estimation: Gaussian PDF

12 Nov 2013 11755/18797 5

F1 X Y

MAP estimation: The Gaussian at a particular value of X

12 Nov 2013 11755/18797 6

Conditional Probability of y|x

• The conditional probability of y given x is also Gaussian

– The slice in the figure is Gaussian

• The mean of this Gaussian is a function of x

• The variance of y reduces if x is known

– Uncertainty is reduced

12 Nov 2013 11755/18797 7

)),(()|( 11

yxyyxxxyxy CCCCxCCNxyP

)(][ 1

|| xxxyxyxyxy xCCyE

xyyy CCCCxyVar 1)|(

MAP estimation: The Gaussian at a particular value of X

12 Nov 2013 11755/18797 8

Most likely

MAP Estimation of a Gaussian RV

12 Nov 2013 11755/18797 9

][)|(maxargˆ| yExyPy xyy

Its also a minimum-mean-squared error estimate

• Minimize error:

• Differentiating and equating to 0:

12 Nov 2013 11755/18797 10

]|ˆˆ[]|ˆ[2

xyyyyxyy T

]|[ˆ2ˆˆ]|[]|ˆ2ˆˆ[ xyyyyxyyxyyyyyy EEEErr TTTTTT

0ˆ]|[2ˆˆ2. yxyyy dEdErrd TT

]|[ˆ xyy E The MMSE estimate is the mean of the distribution

For the Gaussian: MAP = MMSE

12 Nov 2013 11755/18797 11

Most likely

is also

The MEAN

Would be true of any symmetric distribution

MMSE estimates for mixture distributions

Let P(y|x) be a mixture density

The MMSE estimate of y is given by

Just a weighted combination of the MMSE estimates from the component distributions

),|()|()|( xyxxy kPkPPk

yxyxyxy dkPkPEk

),|()|(]|[ yxyyx dkPkPk

),|()|(

],|[)|( xyx kEkPk

MMSE estimates from a Gaussian mixture

12 Nov 2013 11755/18797 13

P(y|x) is also a Gaussian mixture

Let P(x,y) be a Gaussian Mixture

),;()()()( kk

NkPPP zzyx,

),|()|()(

kPkPP ),|()|()|( xyxxy

12 Nov 2013 11755/18797 14

Let P(y|x) is a Gaussian Mixture

kPkPP ),|()|()|( xyxxy

),(),,(,,

)),((),|( ,

xxxyxy xxy kkkk CCNkP

kkkk CCNkPP )),(()|()|( ,

,,, xxxyxy xxxy

12 Nov 2013 11755/18797 15

E[y|x] is also a mixture

P(y|x) is a mixture Gaussian density

kkkk CCNkPP )),(()|()|( ,

,,, xxxyxy xxxy

kEkPE ],|[)|(]|[ xyxxy

kkkk CCkPE )()|(]|[ ,

,,, xxxyxy xxxy

12 Nov 2013 11755/18797 16

Weighted combination of MMSE estimates obtained from individual Gaussians!

Weight P(k|x) is easily computed too..

kkkk CCkPE )()|(]|[ ,

,,, xxxyxy xxxy

),()|(

kPkP ),()()( , xxxk

CNkPP x

12 Nov 2013 11755/18797 17

A mixture of estimates from individual Gaussians

Voice Morphing

• Align training recordings from both speakers

– Cepstral vector sequence

• Learn a GMM on joint vectors

• Given speech from one speaker, find MMSE estimate of the other

• Synthesize from cepstra

12 Nov 2013 11755/18797 18

MMSE with GMM: Voice Transformation

- Festvox GMM transformation suite (Toda)

awb bdl jmk slt

12 Nov 2013 11755/18797 19

MAP / ML / MMSE

• General statistical estimators

• All used to predict a variable, based on other parameters related to it..

• Most common assumption: Data are Gaussian, all RVs are Gaussian

– Other probability densities may also be used..

• For Gaussians relationships are linear as we saw..

12 Nov 2013 11755/18797 20

Gaussians and more Gaussians..

• Linear Gaussian Models..

• But first a recap

12 Nov 2013 11755/18797 21

A Brief Recap

• Principal component analysis: Find the K bases that

best explain the given data

• Find B and C such that the difference between D and

BC is minimum

– While constraining that the columns of B are orthonormal

12 Nov 2013 11755/18797 22

Remember Eigenfaces

• Approximate every face f as f = wf,1 V1+ wf,2 V2 + wf,3 V3 +.. + wf,k Vk

• Estimate V to minimize the squared error

• Error is unexplained by V1.. Vk

• Error is orthogonal to Eigenfaces

12 Nov 2013 11755/18797 23

Karhunen Loeve vs. PCA

• Eigenvectors of the Correlation matrix: – Principal directions of tightest

ellipse centered on origin

– Directions that retain maximum energy

12 Nov 2013 11755/18797 24

12 Nov 2013 11755/18797 25

• Eigenvectors of the Covariance matrix: – Principal directions of tightest

ellipse centered on data

– Directions that retain maximum variance

12 Nov 2013 11755/18797 26

12 Nov 2013 11755/18797 27

• If the data are naturally centered at origin, KLT == PCA

• Following slides refer to PCA!

– Assume data centered at origin for simplicity

• Not essential, as we will see..

12 Nov 2013 11755/18797 28

Remember Eigenfaces

• Approximate every face f as f = wf,1 V1+ wf,2 V2 + wf,3 V3 +.. + wf,k Vk

• Estimate V to minimize the squared error

• Error is unexplained by V1.. Vk

• Error is orthogonal to Eigenfaces

12 Nov 2013 11755/18797 29

Eigen Representation

• K-dimensional representation

– Error is orthogonal to representation

– Weight and error are specific to data instance

12 Nov 2013 11755/18797 30

= w11 + e1

e1 w11

Illustration assuming 3D space

Representation

12 Nov 2013 11755/18797 31

= w12 + e2

Error is at 90o

to the eigenface

Representation

12 Nov 2013 11755/18797 32

All data with the same representation wV1 lie a plane orthogonal to wV1

With 2 bases

12 Nov 2013 11755/18797 33

= w11 + w21 + e1

e1 w11

Error is at 90o

to the eigenfaces

With 2 bases

12 Nov 2013 11755/18797 34

= w12 + w22 + e2

Error is at 90o

to the eigenfaces

In Vector Form

12 Nov 2013 11755/18797 35

Xi = w1iV1 + w2iV2 + ei

Error is at 90o

to the eigenfaces

wVVX e

In Vector Form

12 Nov 2013 11755/18797 36

Xi = w1iV1 + w2iV2 + ei

Error is at 90o

to the eigenface

• K-dimensional representation • x is a D dimensional vector • V is a D x K matrix • w is a K dimensional vector • e is a D dimensional vector

Learning PCA

• For the given data: find the K-dimensional subspace such that it captures most of the variance in the data

– Variance in remaining subspace is minimal 12 Nov 2013 11755/18797 37

Constraints

• VTV = I : Eigen vectors are orthogonal to each other

• For every vector, error is orthogonal to Eigen vectors

– eTV = 0

• Over the collection of data

– Average wTw = Diagonal : Eigen representations are uncorrelated

– Determinant eTe = minimum: Error variance is minimum

• Mean of error is 0 12 Nov 2013 11755/18797

Error is at 90o

to the eigenface

A Statistical Formulation of PCA

• x is a random variable generated according to a linear relation

• w is drawn from an K-dimensional Gaussian with diagonal

covariance

• e is drawn from a 0-mean (D-K)-rank D-dimensional Gaussian

• Estimate V (and B) given examples of x 39

),0(~ BNw

),0(~ ENe

12 Nov 2013 11755/18797

Error is at 90o

to the eigenface

Linear Gaussian Models!!

• w is drawn from a Gaussian

• e is drawn from a 0-mean Gaussian

• Estimate V given examples of x

– In the process also estimate B and E 40

),0(~ BNw

),0(~ ENe

12 Nov 2013 11755/18797

Linear Gaussian Models!!

• w is drawn from a Gaussian

• e is drawn from a 0-mean Gaussian

• Estimate V given examples of x

– In the process also estimate B and E 41

),0(~ BNw

),0(~ ENe

12 Nov 2013 11755/18797

Linear Gaussian Models

• Observations are linear functions of two uncorrelated Gaussian random variables

– A “weight” variable w

– An “error” variable e

– Error not correlated to weight: E[eTw] = 0

• Learning LGMs: Estimate parameters of the model given instances of x

– The problem of learning the distribution of a Gaussian RV 42

eVwμx ),0(~ BNw

),0(~ ENe

12 Nov 2013 11755/18797

LGMs: Probability Density

• The mean of x:

12 Nov 2013 11755/18797 43

eVwμx ),0(~ BNw

),0(~ ENe

μewVμx ][][][ EEE

• The Covariance of x:

EBEEE TT VVxxxx ]][][[

The probability of x

• x is a linear function of Gaussians: x is also Gaussian

• Its mean and variance are as given

12 Nov 2013 11755/18797 44

eVwμx ),0(~ BNw),0(~ ENe

),(~ EBN T VVμx

μxVVμx

15.0exp

1)( EB

EBP TT

Estimating the variables of the model

• Estimating the variables of the LGM is equivalent to estimating P(x)

– The variables are , V, B and E

12 Nov 2013 11755/18797 45

),(~ EBN T VVμx

Estimating the model

• The model is indeterminate:

– Vw = VCC-1w = (VC)(C-1w)

– We need extra constraints to make the solution unique

• Usual constraint : B = I

– Variance of w is an identity matrix

12 Nov 2013 11755/18797 46

),(~ EBN T VVμx

Estimating the variables of the model

– The variables are , V, and E

12 Nov 2013 11755/18797 47

eVwμx ),0(~ INw),0(~ ENe

),(~ EN T VVμx

The Maximum Likelihood Estimate

• The ML estimate of does not depend on the covariance of the Gaussian

12 Nov 2013 11755/18797 48

),(~ EN T VVμx

• Given training set x1, x2, .. xN, find , V, E

Centered Data

• We can safely assume “centered” data

– = 0

• If the data are not centered, “center” it

– Estimate mean of data

• Which is the maximum likelihood estimate

– Subtract it from the data 12 Nov 2013 11755/18797 49

Simplified Model

– The variables are V, and E

12 Nov 2013 11755/18797 50

eVwx ),0(~ INw),0(~ ENe

),0(~ EN T VVx

Estimating the model

• Given a collection of xi terms

– x1, x2,..xN

• Estimate V and E

• w is unknown for each x

• But if assume we know w for each x, then what do we get:

12 Nov 2013 11755/18797 51

),0(~ EN T VVxeVwx

Estimating the Parameters

• We will use a maximum-likelihood estimate

• The log-likelihood of x1..xN knowing their wis

12 Nov 2013 11755/18797 52

eVwx ii ),0()( ENP e ),()|( ENP Vwwx

)()(5.0exp

1)|( 1

VwxVwxwx EE

)..|..(log 11 NNP wwxx

ii EEN )()(5.0||log5.0 11VwxVwx

Maximizing the log-likelihood

• Differentiating w.r.t. V and setting to 0

12 Nov 2013 11755/18797 53

ii EENLL )()(5.0||log5.0 11VwxVwx

0)(2 1 T

iiE wVwx1

ii wwwxV

• Differentiating w.r.t. E-1 and setting to 0

E xwVxx1

Estimating LGMs: If we know w

• But in reality we don’t know the w for each x

– So how to deal with this?

• EM..

12 Nov 2013 11755/18797 54

ii wwwxV

E xwVxx1

eVwx ii ),0()( ENP e

Recall EM

• We figured out how to compute parameters if we knew the missing information

• Then we “fragmented” the observations according to the posterior probability P(z|x) and counted as usual

• In effect we took the expectation with respect to the a posteriori probability of the missing data: P(z|x)

12 Nov 2013 11755/18797 55

Collection of “blue”

numbers

Collection of “red”

numbers

.. .. Collection of “blue”

numbers

.. .. Collection of “blue”

numbers

6 6 6 .. 6 ..

Instance from blue dice Instance from red dice Dice unknown

EM for LGMs

• Replace unseen data terms with expectations taken w.r.t. P(w|xi)

12 Nov 2013 11755/18797 56

ii wwwxV

E xwVxx1

NNE xwVxx xw| ][

i iiEE wwwxV xw|xw|

EM for LGMs

• Replace unseen data terms with expectations taken w.r.t. P(w|xi)

12 Nov 2013 11755/18797 57

ii wwwxV

E xwVxx1

NNE xwVxx xw| ][

i iiEE wwwxV xw|xw|

Expected Value of w given x

• x and w are jointly Gaussian!

– x is Gaussian

– w is Gaussian

– They are linearly related

12 Nov 2013 11755/18797 58

eVwx ),0()( ENP e ),0()( INP w

),0()( ENP T VVx

xz ),()( zzzz CNP

Expected Value of w given x

• x and w are jointly Gaussian!

12 Nov 2013 11755/18797 59

Vwx wxxw ]))([( TEC

),()( zzzz CNP eVwx

),0()( ENP T VVx

),0()( INP w

The conditional expectation of w given z

• P(w|z) is a Gaussian

12 Nov 2013 11755/18797 60

)),(()|( 11

xwxxwxwwxxxwxwxw CCCCxCCNP T

))(,)(()|( 11VVVVxVVVxw

EIENP TTTT

TT EEi

xVVVwxw

| )(][ TT

iiiEEVarE ][][)(][ ||| wwwww xwxwxw

iiiEEEIE ][][)(][ ||

| wwVVVVww xwxwxw

LGM: The complete EM algorithm

• Initialize V and E

• E step:

• M step:

• 12 Nov 2013 11755/18797 61

TT EEi

xVVVwxw

| )(][

iiiEEEIE ][][)(][ ||

| wwVVVVww xwxwxw

i iiEE wwwxV xw|xw|

NNE xwVxx xw| ][

So what have we achieved

• Employed a complicated EM algorithm to learn a Gaussian PDF for a variable x

• What have we gained???

• Next class:

– PCA

• Sensible PCA

• EM algorithms for PCA

– Factor Analysis

• FA for feature extraction

12 Nov 2013 11755/18797 62

• Find directions that capture most of the variation in the data

• Error is orthogonal to these variations

3 Oct 2011 11755/18797 63

),0(~ INw

),0(~ ENe

LGMs : Application 1 Learning principal components

• The full covariance matrix of a Gaussian has D2 terms

• Fully captures the relationships between variables

• Problem: Needs a lot of data to estimate robustly

3 Oct 2011 11755/18797 64

LGMs : Application 2 Learning with insufficient data

FULL COV FIGURE

To be continued..

• Other applications..

• Next class

12 Nov 2013 11755/18797 65

machine learning for signal processing linear gaussian...

Documents

hw_spark magazine-fall2014]

edmontonians fall2014

oakley icon lookbook fall2014

lec1 fall2014

rascals fall2014/winter2015

reclife | fall2014

fall2014 bellevue schedule

fall2014 portfoliopdf

beng fall2014

fall2014 at thecathedral

fall2014 columns

the marauder's map - university of california,...

phys1815 experimenti fall2014 en2

mle and map•maximum a posteriori (map) estimate •naïve...

hebron fall2014

chapter 12 fall2014

biol1003 -9_photosynthesis - fall2014

a soft-input soft-output maximum a posteriori (map) module

garello fall2014

king crimpro fall2014