machine learning for signal processing linear gaussian...
TRANSCRIPT
Machine Learning for Signal Processing
Linear Gaussian Models
Class 21. 12 Nov 2013
Instructor: Bhiksha Raj
12 Nov 2013 11755/18797 1
Administrivia
• HW3 is up
– .
• Projects – please send us an update
12 Nov 2013 11755/18797 2
Recap: MAP Estimators
• MAP (Maximum A Posteriori): Find a “best
guess” for y (statistically), given known x
y = argmax Y P(Y|x)
12 Nov 2013 11755/18797 3
Recap: MAP estimation
• x and y are jointly Gaussian
12 Nov 2013 11755/18797 4
• z is Gaussian
y
xz
y
x
zzE
][
yyyx
xyxx
zz CC
CCCzVar )( ]))([( T
yxxy yxEC
T
zz
zz
zzz zzC
CNzP ))((5.0exp||2
1),()(
MAP estimation: Gaussian PDF
12 Nov 2013 11755/18797 5
F1 X Y
MAP estimation: The Gaussian at a particular value of X
12 Nov 2013 11755/18797 6
x0
Conditional Probability of y|x
• The conditional probability of y given x is also Gaussian
– The slice in the figure is Gaussian
• The mean of this Gaussian is a function of x
• The variance of y reduces if x is known
– Uncertainty is reduced
12 Nov 2013 11755/18797 7
)),(()|( 11
xyxx
T
yxyyxxxyxy CCCCxCCNxyP
)(][ 1
|| xxxyxyxyxy xCCyE
xyxx
T
xyyy CCCCxyVar 1)|(
F1
MAP estimation: The Gaussian at a particular value of X
12 Nov 2013 11755/18797 8
Most likely
value
x0
MAP Estimation of a Gaussian RV
12 Nov 2013 11755/18797 9
x0
][)|(maxargˆ| yExyPy xyy
Its also a minimum-mean-squared error estimate
• Minimize error:
• Differentiating and equating to 0:
12 Nov 2013 11755/18797 10
]|ˆˆ[]|ˆ[2
xyyyyxyy T
EEErr
]|[ˆ2ˆˆ]|[]|ˆ2ˆˆ[ xyyyyxyyxyyyyyy EEEErr TTTTTT
0ˆ]|[2ˆˆ2. yxyyy dEdErrd TT
]|[ˆ xyy E The MMSE estimate is the mean of the distribution
For the Gaussian: MAP = MMSE
12 Nov 2013 11755/18797 11
Most likely
value
is also
The MEAN
value
Would be true of any symmetric distribution
MMSE estimates for mixture distributions
12
Let P(y|x) be a mixture density
The MMSE estimate of y is given by
Just a weighted combination of the MMSE estimates from the component distributions
),|()|()|( xyxxy kPkPPk
yxyxyxy dkPkPEk
),|()|(]|[ yxyyx dkPkPk
),|()|(
],|[)|( xyx kEkPk
MMSE estimates from a Gaussian mixture
12 Nov 2013 11755/18797 13
P(y|x) is also a Gaussian mixture
Let P(x,y) be a Gaussian Mixture
),;()()()( kk
k
NkPPP zzyx,
y
xz
)(
),|()|()(
)(
),,(
)(
)()|(
x
xyxx
x
yx
x
yx,xy
P
kPkPP
P
kP
P
PP kk
k
kPkPP ),|()|()|( xyxxy
MMSE estimates from a Gaussian mixture
12 Nov 2013 11755/18797 14
Let P(y|x) is a Gaussian Mixture
k
kPkPP ),|()|()|( xyxxy
),(),,(,,
,,
,
,
yyyx
xyxx
y
xxy
kk
kk
k
k
CC
CCNkP
)),((),|( ,
1
,,,
xxxyxy xxy kkkk CCNkP
k
kkkk CCNkPP )),(()|()|( ,
1
,,, xxxyxy xxxy
MMSE estimates from a Gaussian mixture
12 Nov 2013 11755/18797 15
E[y|x] is also a mixture
P(y|x) is a mixture Gaussian density
k
kkkk CCNkPP )),(()|()|( ,
1
,,, xxxyxy xxxy
k
kEkPE ],|[)|(]|[ xyxxy
k
kkkk CCkPE )()|(]|[ ,
1
,,, xxxyxy xxxy
MMSE estimates from a Gaussian mixture
12 Nov 2013 11755/18797 16
Weighted combination of MMSE estimates obtained from individual Gaussians!
Weight P(k|x) is easily computed too..
k
kkkk CCkPE )()|(]|[ ,
1
,,, xxxyxy xxxy
)(
),()|(
x
xx
P
kPkP ),()()( , xxxk
k
CNkPP x
MMSE estimates from a Gaussian mixture
12 Nov 2013 11755/18797 17
A mixture of estimates from individual Gaussians
Voice Morphing
• Align training recordings from both speakers
– Cepstral vector sequence
• Learn a GMM on joint vectors
• Given speech from one speaker, find MMSE estimate of the other
• Synthesize from cepstra
12 Nov 2013 11755/18797 18
MMSE with GMM: Voice Transformation
- Festvox GMM transformation suite (Toda)
awb bdl jmk slt
awb
bdl
jmk
slt
12 Nov 2013 11755/18797 19
MAP / ML / MMSE
• General statistical estimators
• All used to predict a variable, based on other parameters related to it..
• Most common assumption: Data are Gaussian, all RVs are Gaussian
– Other probability densities may also be used..
• For Gaussians relationships are linear as we saw..
12 Nov 2013 11755/18797 20
Gaussians and more Gaussians..
• Linear Gaussian Models..
• But first a recap
12 Nov 2013 11755/18797 21
A Brief Recap
• Principal component analysis: Find the K bases that
best explain the given data
• Find B and C such that the difference between D and
BC is minimum
– While constraining that the columns of B are orthonormal
12 Nov 2013 11755/18797 22
BCD B
C
D
Remember Eigenfaces
• Approximate every face f as f = wf,1 V1+ wf,2 V2 + wf,3 V3 +.. + wf,k Vk
• Estimate V to minimize the squared error
• Error is unexplained by V1.. Vk
• Error is orthogonal to Eigenfaces
12 Nov 2013 11755/18797 23
Karhunen Loeve vs. PCA
• Eigenvectors of the Correlation matrix: – Principal directions of tightest
ellipse centered on origin
– Directions that retain maximum energy
12 Nov 2013 11755/18797 24
Karhunen Loeve vs. PCA
• Eigenvectors of the Correlation matrix: – Principal directions of tightest
ellipse centered on origin
– Directions that retain maximum energy
12 Nov 2013 11755/18797 25
• Eigenvectors of the Covariance matrix: – Principal directions of tightest
ellipse centered on data
– Directions that retain maximum variance
Karhunen Loeve vs. PCA
• Eigenvectors of the Correlation matrix: – Principal directions of tightest
ellipse centered on origin
– Directions that retain maximum energy
12 Nov 2013 11755/18797 26
• Eigenvectors of the Covariance matrix: – Principal directions of tightest
ellipse centered on data
– Directions that retain maximum variance
Karhunen Loeve vs. PCA
• Eigenvectors of the Correlation matrix: – Principal directions of tightest
ellipse centered on origin
– Directions that retain maximum energy
12 Nov 2013 11755/18797 27
• Eigenvectors of the Covariance matrix: – Principal directions of tightest
ellipse centered on data
– Directions that retain maximum variance
Karhunen Loeve vs. PCA
• If the data are naturally centered at origin, KLT == PCA
• Following slides refer to PCA!
– Assume data centered at origin for simplicity
• Not essential, as we will see..
12 Nov 2013 11755/18797 28
Remember Eigenfaces
• Approximate every face f as f = wf,1 V1+ wf,2 V2 + wf,3 V3 +.. + wf,k Vk
• Estimate V to minimize the squared error
• Error is unexplained by V1.. Vk
• Error is orthogonal to Eigenfaces
12 Nov 2013 11755/18797 29
Eigen Representation
• K-dimensional representation
– Error is orthogonal to representation
– Weight and error are specific to data instance
12 Nov 2013 11755/18797 30
= w11 + e1
e1 w11
Illustration assuming 3D space
0
Representation
• K-dimensional representation
– Error is orthogonal to representation
– Weight and error are specific to data instance
12 Nov 2013 11755/18797 31
= w12 + e2
e2
w12
Illustration assuming 3D space
Error is at 90o
to the eigenface
90o
Representation
• K-dimensional representation
– Error is orthogonal to representation
12 Nov 2013 11755/18797 32
w
All data with the same representation wV1 lie a plane orthogonal to wV1
0
With 2 bases
• K-dimensional representation
– Error is orthogonal to representation
– Weight and error are specific to data instance
12 Nov 2013 11755/18797 33
= w11 + w21 + e1
e1 w11
Illustration assuming 3D space
0,0
Error is at 90o
to the eigenfaces
w21
With 2 bases
• K-dimensional representation
– Error is orthogonal to representation
– Weight and error are specific to data instance
12 Nov 2013 11755/18797 34
= w12 + w22 + e2
e2
w12
Illustration assuming 3D space
Error is at 90o
to the eigenfaces
w22
In Vector Form
• K-dimensional representation
– Error is orthogonal to representation
– Weight and error are specific to data instance
12 Nov 2013 11755/18797 35
Xi = w1iV1 + w2iV2 + ei
e2
w12
Error is at 90o
to the eigenfaces
w22
V1
V2 D2
i
i
i
iw
wVVX e
2
1
21
In Vector Form
12 Nov 2013 11755/18797 36
Xi = w1iV1 + w2iV2 + ei
e2
w12
Error is at 90o
to the eigenface
w22
V1
V2 D2
eVwx
• K-dimensional representation • x is a D dimensional vector • V is a D x K matrix • w is a K dimensional vector • e is a D dimensional vector
Learning PCA
• For the given data: find the K-dimensional subspace such that it captures most of the variance in the data
– Variance in remaining subspace is minimal 12 Nov 2013 11755/18797 37
Constraints
38
eVwx
• VTV = I : Eigen vectors are orthogonal to each other
• For every vector, error is orthogonal to Eigen vectors
– eTV = 0
• Over the collection of data
– Average wTw = Diagonal : Eigen representations are uncorrelated
– Determinant eTe = minimum: Error variance is minimum
• Mean of error is 0 12 Nov 2013 11755/18797
e2
w12
Error is at 90o
to the eigenface
w22
V1
V2 D2
A Statistical Formulation of PCA
• x is a random variable generated according to a linear relation
• w is drawn from an K-dimensional Gaussian with diagonal
covariance
• e is drawn from a 0-mean (D-K)-rank D-dimensional Gaussian
• Estimate V (and B) given examples of x 39
eVwx
),0(~ BNw
),0(~ ENe
12 Nov 2013 11755/18797
e2
w12
Error is at 90o
to the eigenface
w22
V1
V2 D2
Linear Gaussian Models!!
• x is a random variable generated according to a linear relation
• w is drawn from a Gaussian
• e is drawn from a 0-mean Gaussian
• Estimate V given examples of x
– In the process also estimate B and E 40
eVwx
),0(~ BNw
),0(~ ENe
12 Nov 2013 11755/18797
Linear Gaussian Models!!
• x is a random variable generated according to a linear relation
• w is drawn from a Gaussian
• e is drawn from a 0-mean Gaussian
• Estimate V given examples of x
– In the process also estimate B and E 41
eVwx
),0(~ BNw
),0(~ ENe
12 Nov 2013 11755/18797
Linear Gaussian Models
• Observations are linear functions of two uncorrelated Gaussian random variables
– A “weight” variable w
– An “error” variable e
– Error not correlated to weight: E[eTw] = 0
• Learning LGMs: Estimate parameters of the model given instances of x
– The problem of learning the distribution of a Gaussian RV 42
eVwμx ),0(~ BNw
),0(~ ENe
12 Nov 2013 11755/18797
LGMs: Probability Density
• The mean of x:
12 Nov 2013 11755/18797 43
eVwμx ),0(~ BNw
),0(~ ENe
μewVμx ][][][ EEE
• The Covariance of x:
EBEEE TT VVxxxx ]][][[
The probability of x
• x is a linear function of Gaussians: x is also Gaussian
• Its mean and variance are as given
12 Nov 2013 11755/18797 44
eVwμx ),0(~ BNw),0(~ ENe
),(~ EBN T VVμx
μxVVμx
VVx
15.0exp
||2
1)( EB
EBP TT
TD
Estimating the variables of the model
• Estimating the variables of the LGM is equivalent to estimating P(x)
– The variables are , V, B and E
12 Nov 2013 11755/18797 45
eVwμx ),0(~ BNw),0(~ ENe
),(~ EBN T VVμx
Estimating the model
• The model is indeterminate:
– Vw = VCC-1w = (VC)(C-1w)
– We need extra constraints to make the solution unique
• Usual constraint : B = I
– Variance of w is an identity matrix
12 Nov 2013 11755/18797 46
eVwμx ),0(~ BNw),0(~ ENe
),(~ EBN T VVμx
Estimating the variables of the model
• Estimating the variables of the LGM is equivalent to estimating P(x)
– The variables are , V, and E
12 Nov 2013 11755/18797 47
eVwμx ),0(~ INw),0(~ ENe
),(~ EN T VVμx
The Maximum Likelihood Estimate
• The ML estimate of does not depend on the covariance of the Gaussian
12 Nov 2013 11755/18797 48
),(~ EN T VVμx
• Given training set x1, x2, .. xN, find , V, E
i
iN
xμ1
Centered Data
• We can safely assume “centered” data
– = 0
• If the data are not centered, “center” it
– Estimate mean of data
• Which is the maximum likelihood estimate
– Subtract it from the data 12 Nov 2013 11755/18797 49
Simplified Model
• Estimating the variables of the LGM is equivalent to estimating P(x)
– The variables are V, and E
12 Nov 2013 11755/18797 50
eVwx ),0(~ INw),0(~ ENe
),0(~ EN T VVx
Estimating the model
• Given a collection of xi terms
– x1, x2,..xN
• Estimate V and E
• w is unknown for each x
• But if assume we know w for each x, then what do we get:
12 Nov 2013 11755/18797 51
),0(~ EN T VVxeVwx
Estimating the Parameters
• We will use a maximum-likelihood estimate
• The log-likelihood of x1..xN knowing their wis
12 Nov 2013 11755/18797 52
eVwx ii ),0()( ENP e ),()|( ENP Vwwx
)()(5.0exp
||2
1)|( 1
VwxVwxwx EE
P T
D
)..|..(log 11 NNP wwxx
i
ii
T
ii EEN )()(5.0||log5.0 11VwxVwx
Maximizing the log-likelihood
• Differentiating w.r.t. V and setting to 0
12 Nov 2013 11755/18797 53
i
ii
T
ii EENLL )()(5.0||log5.0 11VwxVwx
0)(2 1 T
i
i
iiE wVwx1
i
T
ii
i
T
ii wwwxV
• Differentiating w.r.t. E-1 and setting to 0
i i
T
ii
T
iiN
E xwVxx1
Estimating LGMs: If we know w
• But in reality we don’t know the w for each x
– So how to deal with this?
• EM..
12 Nov 2013 11755/18797 54
1
i
T
ii
i
T
ii wwwxV
i i
T
ii
T
iiN
E xwVxx1
eVwx ii ),0()( ENP e
Recall EM
• We figured out how to compute parameters if we knew the missing information
• Then we “fragmented” the observations according to the posterior probability P(z|x) and counted as usual
• In effect we took the expectation with respect to the a posteriori probability of the missing data: P(z|x)
12 Nov 2013 11755/18797 55
Collection of “blue”
numbers
Collection of “red”
numbers
6
.. .. Collection of “blue”
numbers
Collection of “red”
numbers
6
.. .. Collection of “blue”
numbers
Collection of “red”
numbers
6
6 6
6 6 6 .. 6 ..
Instance from blue dice Instance from red dice Dice unknown
EM for LGMs
• Replace unseen data terms with expectations taken w.r.t. P(w|xi)
12 Nov 2013 11755/18797 56
1
i
T
ii
i
T
ii wwwxV
i i
T
ii
T
iiN
E xwVxx1
eVwx ii ),0()( ENP e
i
T
i
i
T
ii iE
NNE xwVxx xw| ][
111
][][
i
T
i
T
i iiEE wwwxV xw|xw|
EM for LGMs
• Replace unseen data terms with expectations taken w.r.t. P(w|xi)
12 Nov 2013 11755/18797 57
1
i
T
ii
i
T
ii wwwxV
i i
T
ii
T
iiN
E xwVxx1
eVwx ii ),0()( ENP e
i
T
i
i
T
ii iE
NNE xwVxx xw| ][
111
][][
i
T
i
T
i iiEE wwwxV xw|xw|
Expected Value of w given x
• x and w are jointly Gaussian!
– x is Gaussian
– w is Gaussian
– They are linearly related
12 Nov 2013 11755/18797 58
eVwx ),0()( ENP e ),0()( INP w
),0()( ENP T VVx
w
xz ),()( zzzz CNP
Expected Value of w given x
• x and w are jointly Gaussian!
12 Nov 2013 11755/18797 59
wwwx
xwxx
zzCC
CCC
w
xz
0
w
x
z
Vwx wxxw ]))([( TEC
),()( zzzz CNP eVwx
),0()( ENP T VVx
),0()( INP w
I
EC
T
T
V
VVVzz
The conditional expectation of w given z
• P(w|z) is a Gaussian
12 Nov 2013 11755/18797 60
)),(()|( 11
xwxxwxwwxxxwxwxw CCCCxCCNP T
I
EC
T
T
V
VVVzz
wwwx
xwxx
zzCC
CCC
0
w
x
z
))(,)(()|( 11VVVVxVVVxw
EIENP TTTT
i
TT EEi
xVVVwxw
1
| )(][ TT
iiiEEVarE ][][)(][ ||| wwwww xwxwxw
TTTT
iiiEEEIE ][][)(][ ||
1
| wwVVVVww xwxwxw
LGM: The complete EM algorithm
• Initialize V and E
• E step:
• M step:
• 12 Nov 2013 11755/18797 61
i
TT EEi
xVVVwxw
1
| )(][
TTTT
iiiEEEIE ][][)(][ ||
1
| wwVVVVww xwxwxw
1
][][
i
T
i
T
i iiEE wwwxV xw|xw|
i
T
i
i
T
ii iE
NNE xwVxx xw| ][
11
So what have we achieved
• Employed a complicated EM algorithm to learn a Gaussian PDF for a variable x
• What have we gained???
• Next class:
– PCA
• Sensible PCA
• EM algorithms for PCA
– Factor Analysis
• FA for feature extraction
12 Nov 2013 11755/18797 62
• Find directions that capture most of the variation in the data
• Error is orthogonal to these variations
3 Oct 2011 11755/18797 63
eVwx
),0(~ INw
),0(~ ENe
LGMs : Application 1 Learning principal components
• The full covariance matrix of a Gaussian has D2 terms
• Fully captures the relationships between variables
• Problem: Needs a lot of data to estimate robustly
3 Oct 2011 11755/18797 64
LGMs : Application 2 Learning with insufficient data
FULL COV FIGURE
To be continued..
• Other applications..
• Next class
12 Nov 2013 11755/18797 65