machine learning for signal processing linear gaussian...

Post on 02-Oct-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Machine Learning for Signal Processing

Linear Gaussian Models

Class 21. 12 Nov 2013

Instructor: Bhiksha Raj

12 Nov 2013 11755/18797 1

Administrivia

• HW3 is up

– .

• Projects – please send us an update

12 Nov 2013 11755/18797 2

Recap: MAP Estimators

• MAP (Maximum A Posteriori): Find a “best

guess” for y (statistically), given known x

y = argmax Y P(Y|x)

12 Nov 2013 11755/18797 3

Recap: MAP estimation

• x and y are jointly Gaussian

12 Nov 2013 11755/18797 4

• z is Gaussian

y

xz

y

x

zzE

][

yyyx

xyxx

zz CC

CCCzVar )( ]))([( T

yxxy yxEC

T

zz

zz

zzz zzC

CNzP ))((5.0exp||2

1),()(

MAP estimation: Gaussian PDF

12 Nov 2013 11755/18797 5

F1 X Y

MAP estimation: The Gaussian at a particular value of X

12 Nov 2013 11755/18797 6

x0

Conditional Probability of y|x

• The conditional probability of y given x is also Gaussian

– The slice in the figure is Gaussian

• The mean of this Gaussian is a function of x

• The variance of y reduces if x is known

– Uncertainty is reduced

12 Nov 2013 11755/18797 7

)),(()|( 11

xyxx

T

yxyyxxxyxy CCCCxCCNxyP

)(][ 1

|| xxxyxyxyxy xCCyE

xyxx

T

xyyy CCCCxyVar 1)|(

F1

MAP estimation: The Gaussian at a particular value of X

12 Nov 2013 11755/18797 8

Most likely

value

x0

MAP Estimation of a Gaussian RV

12 Nov 2013 11755/18797 9

x0

][)|(maxargˆ| yExyPy xyy

Its also a minimum-mean-squared error estimate

• Minimize error:

• Differentiating and equating to 0:

12 Nov 2013 11755/18797 10

]|ˆˆ[]|ˆ[2

xyyyyxyy T

EEErr

]|[ˆ2ˆˆ]|[]|ˆ2ˆˆ[ xyyyyxyyxyyyyyy EEEErr TTTTTT

0ˆ]|[2ˆˆ2. yxyyy dEdErrd TT

]|[ˆ xyy E The MMSE estimate is the mean of the distribution

For the Gaussian: MAP = MMSE

12 Nov 2013 11755/18797 11

Most likely

value

is also

The MEAN

value

Would be true of any symmetric distribution

MMSE estimates for mixture distributions

12

Let P(y|x) be a mixture density

The MMSE estimate of y is given by

Just a weighted combination of the MMSE estimates from the component distributions

),|()|()|( xyxxy kPkPPk

yxyxyxy dkPkPEk

),|()|(]|[ yxyyx dkPkPk

),|()|(

],|[)|( xyx kEkPk

MMSE estimates from a Gaussian mixture

12 Nov 2013 11755/18797 13

P(y|x) is also a Gaussian mixture

Let P(x,y) be a Gaussian Mixture

),;()()()( kk

k

NkPPP zzyx,

y

xz

)(

),|()|()(

)(

),,(

)(

)()|(

x

xyxx

x

yx

x

yx,xy

P

kPkPP

P

kP

P

PP kk

k

kPkPP ),|()|()|( xyxxy

MMSE estimates from a Gaussian mixture

12 Nov 2013 11755/18797 14

Let P(y|x) is a Gaussian Mixture

k

kPkPP ),|()|()|( xyxxy

),(),,(,,

,,

,

,

yyyx

xyxx

y

xxy

kk

kk

k

k

CC

CCNkP

)),((),|( ,

1

,,,

xxxyxy xxy kkkk CCNkP

k

kkkk CCNkPP )),(()|()|( ,

1

,,, xxxyxy xxxy

MMSE estimates from a Gaussian mixture

12 Nov 2013 11755/18797 15

E[y|x] is also a mixture

P(y|x) is a mixture Gaussian density

k

kkkk CCNkPP )),(()|()|( ,

1

,,, xxxyxy xxxy

k

kEkPE ],|[)|(]|[ xyxxy

k

kkkk CCkPE )()|(]|[ ,

1

,,, xxxyxy xxxy

MMSE estimates from a Gaussian mixture

12 Nov 2013 11755/18797 16

Weighted combination of MMSE estimates obtained from individual Gaussians!

Weight P(k|x) is easily computed too..

k

kkkk CCkPE )()|(]|[ ,

1

,,, xxxyxy xxxy

)(

),()|(

x

xx

P

kPkP ),()()( , xxxk

k

CNkPP x

MMSE estimates from a Gaussian mixture

12 Nov 2013 11755/18797 17

A mixture of estimates from individual Gaussians

Voice Morphing

• Align training recordings from both speakers

– Cepstral vector sequence

• Learn a GMM on joint vectors

• Given speech from one speaker, find MMSE estimate of the other

• Synthesize from cepstra

12 Nov 2013 11755/18797 18

MMSE with GMM: Voice Transformation

- Festvox GMM transformation suite (Toda)

awb bdl jmk slt

awb

bdl

jmk

slt

12 Nov 2013 11755/18797 19

MAP / ML / MMSE

• General statistical estimators

• All used to predict a variable, based on other parameters related to it..

• Most common assumption: Data are Gaussian, all RVs are Gaussian

– Other probability densities may also be used..

• For Gaussians relationships are linear as we saw..

12 Nov 2013 11755/18797 20

Gaussians and more Gaussians..

• Linear Gaussian Models..

• But first a recap

12 Nov 2013 11755/18797 21

A Brief Recap

• Principal component analysis: Find the K bases that

best explain the given data

• Find B and C such that the difference between D and

BC is minimum

– While constraining that the columns of B are orthonormal

12 Nov 2013 11755/18797 22

BCD B

C

D

Remember Eigenfaces

• Approximate every face f as f = wf,1 V1+ wf,2 V2 + wf,3 V3 +.. + wf,k Vk

• Estimate V to minimize the squared error

• Error is unexplained by V1.. Vk

• Error is orthogonal to Eigenfaces

12 Nov 2013 11755/18797 23

Karhunen Loeve vs. PCA

• Eigenvectors of the Correlation matrix: – Principal directions of tightest

ellipse centered on origin

– Directions that retain maximum energy

12 Nov 2013 11755/18797 24

Karhunen Loeve vs. PCA

• Eigenvectors of the Correlation matrix: – Principal directions of tightest

ellipse centered on origin

– Directions that retain maximum energy

12 Nov 2013 11755/18797 25

• Eigenvectors of the Covariance matrix: – Principal directions of tightest

ellipse centered on data

– Directions that retain maximum variance

Karhunen Loeve vs. PCA

• Eigenvectors of the Correlation matrix: – Principal directions of tightest

ellipse centered on origin

– Directions that retain maximum energy

12 Nov 2013 11755/18797 26

• Eigenvectors of the Covariance matrix: – Principal directions of tightest

ellipse centered on data

– Directions that retain maximum variance

Karhunen Loeve vs. PCA

• Eigenvectors of the Correlation matrix: – Principal directions of tightest

ellipse centered on origin

– Directions that retain maximum energy

12 Nov 2013 11755/18797 27

• Eigenvectors of the Covariance matrix: – Principal directions of tightest

ellipse centered on data

– Directions that retain maximum variance

Karhunen Loeve vs. PCA

• If the data are naturally centered at origin, KLT == PCA

• Following slides refer to PCA!

– Assume data centered at origin for simplicity

• Not essential, as we will see..

12 Nov 2013 11755/18797 28

Remember Eigenfaces

• Approximate every face f as f = wf,1 V1+ wf,2 V2 + wf,3 V3 +.. + wf,k Vk

• Estimate V to minimize the squared error

• Error is unexplained by V1.. Vk

• Error is orthogonal to Eigenfaces

12 Nov 2013 11755/18797 29

Eigen Representation

• K-dimensional representation

– Error is orthogonal to representation

– Weight and error are specific to data instance

12 Nov 2013 11755/18797 30

= w11 + e1

e1 w11

Illustration assuming 3D space

0

Representation

• K-dimensional representation

– Error is orthogonal to representation

– Weight and error are specific to data instance

12 Nov 2013 11755/18797 31

= w12 + e2

e2

w12

Illustration assuming 3D space

Error is at 90o

to the eigenface

90o

Representation

• K-dimensional representation

– Error is orthogonal to representation

12 Nov 2013 11755/18797 32

w

All data with the same representation wV1 lie a plane orthogonal to wV1

0

With 2 bases

• K-dimensional representation

– Error is orthogonal to representation

– Weight and error are specific to data instance

12 Nov 2013 11755/18797 33

= w11 + w21 + e1

e1 w11

Illustration assuming 3D space

0,0

Error is at 90o

to the eigenfaces

w21

With 2 bases

• K-dimensional representation

– Error is orthogonal to representation

– Weight and error are specific to data instance

12 Nov 2013 11755/18797 34

= w12 + w22 + e2

e2

w12

Illustration assuming 3D space

Error is at 90o

to the eigenfaces

w22

In Vector Form

• K-dimensional representation

– Error is orthogonal to representation

– Weight and error are specific to data instance

12 Nov 2013 11755/18797 35

Xi = w1iV1 + w2iV2 + ei

e2

w12

Error is at 90o

to the eigenfaces

w22

V1

V2 D2

i

i

i

iw

wVVX e

2

1

21

In Vector Form

12 Nov 2013 11755/18797 36

Xi = w1iV1 + w2iV2 + ei

e2

w12

Error is at 90o

to the eigenface

w22

V1

V2 D2

eVwx

• K-dimensional representation • x is a D dimensional vector • V is a D x K matrix • w is a K dimensional vector • e is a D dimensional vector

Learning PCA

• For the given data: find the K-dimensional subspace such that it captures most of the variance in the data

– Variance in remaining subspace is minimal 12 Nov 2013 11755/18797 37

Constraints

38

eVwx

• VTV = I : Eigen vectors are orthogonal to each other

• For every vector, error is orthogonal to Eigen vectors

– eTV = 0

• Over the collection of data

– Average wTw = Diagonal : Eigen representations are uncorrelated

– Determinant eTe = minimum: Error variance is minimum

• Mean of error is 0 12 Nov 2013 11755/18797

e2

w12

Error is at 90o

to the eigenface

w22

V1

V2 D2

A Statistical Formulation of PCA

• x is a random variable generated according to a linear relation

• w is drawn from an K-dimensional Gaussian with diagonal

covariance

• e is drawn from a 0-mean (D-K)-rank D-dimensional Gaussian

• Estimate V (and B) given examples of x 39

eVwx

),0(~ BNw

),0(~ ENe

12 Nov 2013 11755/18797

e2

w12

Error is at 90o

to the eigenface

w22

V1

V2 D2

Linear Gaussian Models!!

• x is a random variable generated according to a linear relation

• w is drawn from a Gaussian

• e is drawn from a 0-mean Gaussian

• Estimate V given examples of x

– In the process also estimate B and E 40

eVwx

),0(~ BNw

),0(~ ENe

12 Nov 2013 11755/18797

Linear Gaussian Models!!

• x is a random variable generated according to a linear relation

• w is drawn from a Gaussian

• e is drawn from a 0-mean Gaussian

• Estimate V given examples of x

– In the process also estimate B and E 41

eVwx

),0(~ BNw

),0(~ ENe

12 Nov 2013 11755/18797

Linear Gaussian Models

• Observations are linear functions of two uncorrelated Gaussian random variables

– A “weight” variable w

– An “error” variable e

– Error not correlated to weight: E[eTw] = 0

• Learning LGMs: Estimate parameters of the model given instances of x

– The problem of learning the distribution of a Gaussian RV 42

eVwμx ),0(~ BNw

),0(~ ENe

12 Nov 2013 11755/18797

LGMs: Probability Density

• The mean of x:

12 Nov 2013 11755/18797 43

eVwμx ),0(~ BNw

),0(~ ENe

μewVμx ][][][ EEE

• The Covariance of x:

EBEEE TT VVxxxx ]][][[

The probability of x

• x is a linear function of Gaussians: x is also Gaussian

• Its mean and variance are as given

12 Nov 2013 11755/18797 44

eVwμx ),0(~ BNw),0(~ ENe

),(~ EBN T VVμx

μxVVμx

VVx

15.0exp

||2

1)( EB

EBP TT

TD

Estimating the variables of the model

• Estimating the variables of the LGM is equivalent to estimating P(x)

– The variables are , V, B and E

12 Nov 2013 11755/18797 45

eVwμx ),0(~ BNw),0(~ ENe

),(~ EBN T VVμx

Estimating the model

• The model is indeterminate:

– Vw = VCC-1w = (VC)(C-1w)

– We need extra constraints to make the solution unique

• Usual constraint : B = I

– Variance of w is an identity matrix

12 Nov 2013 11755/18797 46

eVwμx ),0(~ BNw),0(~ ENe

),(~ EBN T VVμx

Estimating the variables of the model

• Estimating the variables of the LGM is equivalent to estimating P(x)

– The variables are , V, and E

12 Nov 2013 11755/18797 47

eVwμx ),0(~ INw),0(~ ENe

),(~ EN T VVμx

The Maximum Likelihood Estimate

• The ML estimate of does not depend on the covariance of the Gaussian

12 Nov 2013 11755/18797 48

),(~ EN T VVμx

• Given training set x1, x2, .. xN, find , V, E

i

iN

xμ1

Centered Data

• We can safely assume “centered” data

– = 0

• If the data are not centered, “center” it

– Estimate mean of data

• Which is the maximum likelihood estimate

– Subtract it from the data 12 Nov 2013 11755/18797 49

Simplified Model

• Estimating the variables of the LGM is equivalent to estimating P(x)

– The variables are V, and E

12 Nov 2013 11755/18797 50

eVwx ),0(~ INw),0(~ ENe

),0(~ EN T VVx

Estimating the model

• Given a collection of xi terms

– x1, x2,..xN

• Estimate V and E

• w is unknown for each x

• But if assume we know w for each x, then what do we get:

12 Nov 2013 11755/18797 51

),0(~ EN T VVxeVwx

Estimating the Parameters

• We will use a maximum-likelihood estimate

• The log-likelihood of x1..xN knowing their wis

12 Nov 2013 11755/18797 52

eVwx ii ),0()( ENP e ),()|( ENP Vwwx

)()(5.0exp

||2

1)|( 1

VwxVwxwx EE

P T

D

)..|..(log 11 NNP wwxx

i

ii

T

ii EEN )()(5.0||log5.0 11VwxVwx

Maximizing the log-likelihood

• Differentiating w.r.t. V and setting to 0

12 Nov 2013 11755/18797 53

i

ii

T

ii EENLL )()(5.0||log5.0 11VwxVwx

0)(2 1 T

i

i

iiE wVwx1

i

T

ii

i

T

ii wwwxV

• Differentiating w.r.t. E-1 and setting to 0

i i

T

ii

T

iiN

E xwVxx1

Estimating LGMs: If we know w

• But in reality we don’t know the w for each x

– So how to deal with this?

• EM..

12 Nov 2013 11755/18797 54

1

i

T

ii

i

T

ii wwwxV

i i

T

ii

T

iiN

E xwVxx1

eVwx ii ),0()( ENP e

Recall EM

• We figured out how to compute parameters if we knew the missing information

• Then we “fragmented” the observations according to the posterior probability P(z|x) and counted as usual

• In effect we took the expectation with respect to the a posteriori probability of the missing data: P(z|x)

12 Nov 2013 11755/18797 55

Collection of “blue”

numbers

Collection of “red”

numbers

6

.. .. Collection of “blue”

numbers

Collection of “red”

numbers

6

.. .. Collection of “blue”

numbers

Collection of “red”

numbers

6

6 6

6 6 6 .. 6 ..

Instance from blue dice Instance from red dice Dice unknown

EM for LGMs

• Replace unseen data terms with expectations taken w.r.t. P(w|xi)

12 Nov 2013 11755/18797 56

1

i

T

ii

i

T

ii wwwxV

i i

T

ii

T

iiN

E xwVxx1

eVwx ii ),0()( ENP e

i

T

i

i

T

ii iE

NNE xwVxx xw| ][

111

][][

i

T

i

T

i iiEE wwwxV xw|xw|

EM for LGMs

• Replace unseen data terms with expectations taken w.r.t. P(w|xi)

12 Nov 2013 11755/18797 57

1

i

T

ii

i

T

ii wwwxV

i i

T

ii

T

iiN

E xwVxx1

eVwx ii ),0()( ENP e

i

T

i

i

T

ii iE

NNE xwVxx xw| ][

111

][][

i

T

i

T

i iiEE wwwxV xw|xw|

Expected Value of w given x

• x and w are jointly Gaussian!

– x is Gaussian

– w is Gaussian

– They are linearly related

12 Nov 2013 11755/18797 58

eVwx ),0()( ENP e ),0()( INP w

),0()( ENP T VVx

w

xz ),()( zzzz CNP

Expected Value of w given x

• x and w are jointly Gaussian!

12 Nov 2013 11755/18797 59

wwwx

xwxx

zzCC

CCC

w

xz

0

w

x

z

Vwx wxxw ]))([( TEC

),()( zzzz CNP eVwx

),0()( ENP T VVx

),0()( INP w

I

EC

T

T

V

VVVzz

The conditional expectation of w given z

• P(w|z) is a Gaussian

12 Nov 2013 11755/18797 60

)),(()|( 11

xwxxwxwwxxxwxwxw CCCCxCCNP T

I

EC

T

T

V

VVVzz

wwwx

xwxx

zzCC

CCC

0

w

x

z

))(,)(()|( 11VVVVxVVVxw

EIENP TTTT

i

TT EEi

xVVVwxw

1

| )(][ TT

iiiEEVarE ][][)(][ ||| wwwww xwxwxw

TTTT

iiiEEEIE ][][)(][ ||

1

| wwVVVVww xwxwxw

LGM: The complete EM algorithm

• Initialize V and E

• E step:

• M step:

• 12 Nov 2013 11755/18797 61

i

TT EEi

xVVVwxw

1

| )(][

TTTT

iiiEEEIE ][][)(][ ||

1

| wwVVVVww xwxwxw

1

][][

i

T

i

T

i iiEE wwwxV xw|xw|

i

T

i

i

T

ii iE

NNE xwVxx xw| ][

11

So what have we achieved

• Employed a complicated EM algorithm to learn a Gaussian PDF for a variable x

• What have we gained???

• Next class:

– PCA

• Sensible PCA

• EM algorithms for PCA

– Factor Analysis

• FA for feature extraction

12 Nov 2013 11755/18797 62

• Find directions that capture most of the variation in the data

• Error is orthogonal to these variations

3 Oct 2011 11755/18797 63

eVwx

),0(~ INw

),0(~ ENe

LGMs : Application 1 Learning principal components

• The full covariance matrix of a Gaussian has D2 terms

• Fully captures the relationships between variables

• Problem: Needs a lot of data to estimate robustly

3 Oct 2011 11755/18797 64

LGMs : Application 2 Learning with insufficient data

FULL COV FIGURE

To be continued..

• Other applications..

• Next class

12 Nov 2013 11755/18797 65

top related