ling 696b: mixture model and linear dimension reduction

58
1 LING 696B: Mixture model and linear dimension reduction

Upload: caesar-stanton

Post on 03-Jan-2016

34 views

Category:

Documents


0 download

DESCRIPTION

LING 696B: Mixture model and linear dimension reduction. Statistical estimation. Basic setup: The world: distributions p(x; ),  -- parameters “all models may be wrong, but some are useful” - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: LING 696B:  Mixture model and linear dimension reduction

1

LING 696B: Mixture model and linear dimension reduction

Page 2: LING 696B:  Mixture model and linear dimension reduction

2

Statistical estimation Basic setup:

The world: distributions p(x; ), -- parameters “all models may be wrong, but some are useful”

Given parameter , p(x; ) tells us how to calculate the probability of x (also referred to as the “likelihood” p(x|) )

Observations: X = {x1, x2, …, xN} generated from some p(x; ). N is the number of observations

Model-fitting: based on some examples X, make guesses (learning, inference) about

Page 3: LING 696B:  Mixture model and linear dimension reduction

3

Statistical estimation Example:

Assuming people’s height follows normal distributions (mean, var)

p(x; ) = the probability density function of normal distribution

Observation: measurements of people’s height

Goal: estimate parameters of the normal distribution

Page 4: LING 696B:  Mixture model and linear dimension reduction

4

Maximum likelihood estimate (MLE) Likelihood function: examples xi

are independent of one another, so

Among all the possible values of , choose the so that L() is the biggest

L()

Consistency:

H !

Page 5: LING 696B:  Mixture model and linear dimension reduction

5

H matters a lot! Example: curve fitting with

polynomials

Page 6: LING 696B:  Mixture model and linear dimension reduction

6

Clustering Need to divide x1, x2, …, xN into

clusters, without a priori knowledge of where clusters are

An unsupervised learning problem: fitting a mixture model to x1, x2, …, xN

Example: height of male and female follow two distributions, but don’t know gender from x1, x2, …, xN

Page 7: LING 696B:  Mixture model and linear dimension reduction

7

The K-means algorithm Start with a random assignment,

calculate the means

Page 8: LING 696B:  Mixture model and linear dimension reduction

8

The K-means algorithm Re-assign members to the closest

cluster according to the means

Page 9: LING 696B:  Mixture model and linear dimension reduction

9

The K-means algorithm Update the means based on the

new assignments, and iterate

Page 10: LING 696B:  Mixture model and linear dimension reduction

10

Why does K-means work? In the beginning, the centers are poorly

chosen, so the clusters overlap a lot But if centers are moving away from each

other, then clusters tend to separate better Vice versa, if clusters are well-separated,

then the centers will stay away from each other

Intuitively, these two steps “help each other”

Page 11: LING 696B:  Mixture model and linear dimension reduction

11

Interpreting K-means as statistical estimation Equivalent to fitting a mixture of

Gaussians with: Spherical covariance Uniform prior (weights on each Gaussian)

Problems: Ambiguous data should have gradient

membership Shape of the clusters may not be

spherical Size of the cluster should play a role

Page 12: LING 696B:  Mixture model and linear dimension reduction

12

Multivariate Gaussian 1-D: N(, 2) N-D: N(, ), ~NX1 vector, ~NXN

matrix with (i,j) = ij ~ correlation Probability calculation:

P(x; ,) = C ||-N/2 exp{-(x-)T -1 (x-)} Intuitive meaning of -1: how to

calculate the distance from x to

transpose inverse

Page 13: LING 696B:  Mixture model and linear dimension reduction

13

Multivariate Gaussian: log likelihood and distance Spherical covariance matrix -1

Diagonal covariance matrix -1

Full covariance matrix -1

Page 14: LING 696B:  Mixture model and linear dimension reduction

14

Learning mixture of Gaussian:EM algorithm Expectation: putting “soft” labels

on data -- a pair (, 1-)

(0.5, 0.5)

(0.05, 0.95)(0.8, 0.2)

Page 15: LING 696B:  Mixture model and linear dimension reduction

15

Learning mixture of Gaussian:EM algorithm Maximization: doing Maximum-

Likelihood with weighted data

Notice everyoneis wearing a hat!

Page 16: LING 696B:  Mixture model and linear dimension reduction

16

EM v.s. K-means Same:

Iterative optimization, provably converge (see demo)

EM better captures the intuition: Ambiguous data are assigned gradient

membership Clusters can be arbitrary shaped

pancakes Size of the cluster is a parameter Allows for flexible control based on

prior knowledge (see demo)

Page 17: LING 696B:  Mixture model and linear dimension reduction

17

EM is everywhere Our problem: the labels are important,

yet not observable – “hidden variables” This situation is common for complex

models, and Maximum likelihood --> EM Bayesian Networks Hidden Markov models Probabilistic Context Free Grammars Linear Dynamic Systems

Page 18: LING 696B:  Mixture model and linear dimension reduction

18

Beyond Maximum likelihood?Statistical parsing Interesting remark from Mark Johnson:

Intialize a PCFG with treebank counts Train the PCFG on treebank with EM

A large a mount of NLP research try to dump the first, and improve the second

Log likelihood

Measure of success

Page 19: LING 696B:  Mixture model and linear dimension reduction

19

What’s wrong with this? Mark Johnson’s idea:

Wrong data: human don’t just learn from strings

Wrong model: human syntax isn’t context-free

Wrong way of calculating likelihood: p(sentence | PCFG) isn’t informative

(Maybe) wrong measure of success?

Page 20: LING 696B:  Mixture model and linear dimension reduction

20

End of excursion:Mixture of many things Any generative model can be combined

with a mixture model to deal with categorical data

Examples: Mixture of Gaussians Mixture of HMMs Mixture of Factor Analyzers Mixture of Expert networks

It all depends on what you are modeling

Page 21: LING 696B:  Mixture model and linear dimension reduction

21

Applying to the speech domain

Speech signals have high dimensions Using front-end acoustic modeling from

speech recognition: Mel-Frequency Cepstral Coefficients (MFCC)

Speech sounds are dynamic Dynamic acoustic modeling: MFCC-delta Mixture components are Hidden Markov

Models (HMM)

Page 22: LING 696B:  Mixture model and linear dimension reduction

22

Clustering speech with K-means Phones from TIMIT

Page 23: LING 696B:  Mixture model and linear dimension reduction

23

Clustering speech with K-means Diphones

Words

Page 24: LING 696B:  Mixture model and linear dimension reduction

24

What’s wrong here Longer sound sequences are more

distinguishable for people Yet doing K-means on static feature

vectors misses the change over time

Mixture components must be able to capture dynamic data

Solution: mixture of HMMs

Page 25: LING 696B:  Mixture model and linear dimension reduction

25

Mixture of HMMs HMM HMM Mixture

Learning: EM for HMM + EM for mixture

silence burst transition

Page 26: LING 696B:  Mixture model and linear dimension reduction

26

Mixture of HMMs Model-based clustering Front-end (MFCC+delta) Algorithm: initial guess by K-means, then EM

Gaussian mixturefor single frames

HMM mixturefor whole sequences

Page 27: LING 696B:  Mixture model and linear dimension reduction

27

Mixture of HMM v.s. K-means

Phone clustering: 7 phones from 22 speakers

*1 – 5: cluster index

Page 28: LING 696B:  Mixture model and linear dimension reduction

28

Mixture of HMM v.s. K-means

Diphone clustering: 6 diphones from 300+ speakers

Page 29: LING 696B:  Mixture model and linear dimension reduction

29

Mixture of HMM v.s. K-means

Word clustering: 3 words from 300+ speakers

Page 30: LING 696B:  Mixture model and linear dimension reduction

30

Growing the model Guess 6 at once is hard, but 2 is easy; Hill climbing strategy: starting with 2,

then 3, 4, ... Implementation: split the cluster with

the maximum gain in likelihood; Intuition: discriminate within the

biggest pile.

Page 31: LING 696B:  Mixture model and linear dimension reduction

31

Learning categories and features with mixture model Procedure: apply mixture model

and EM algorithm, inductively find clusters

Each split is followed by a retraining step using all dataData

21

11 12 21 22

Page 32: LING 696B:  Mixture model and linear dimension reduction

32

% classified as Cluster 1

% classified as Cluster 2

All data

1obstruent

2sonorant

IPA TIMIT

Page 33: LING 696B:  Mixture model and linear dimension reduction

33

% classifed as Cluster 11

% classified as Cluster 12

All data

1 2

1

11fricative

12

Page 34: LING 696B:  Mixture model and linear dimension reduction

34

% classified as Cluster 21

% classified as Cluster 22

All data

1 21

11 12

r

21back

sonorant

22

Page 35: LING 696B:  Mixture model and linear dimension reduction

35

% classified Cluster 121

% classified as Cluster 122

All data

1 21

11 12 21 22

121oralstop

122nasalstop

Page 36: LING 696B:  Mixture model and linear dimension reduction

36

% classified as Cluster 221

% classified as Cluster 222

All data

1 21

11 12 22

121 122

221 222

21

front low

sonorant

front high

sonorant

nasaloralstop

fricative backsonorant

Page 37: LING 696B:  Mixture model and linear dimension reduction

37

Summary: learning features

Discovered features: distinctions between natural classes based on spectral properties

All data

1 [+sonorant][- sonorant]

[+fricative] [-fricative] [+back] [-back]

[-nasal] [+nasal] [+high] [-high]

For individual sounds, the feature values are gradient rather than binary (Ladefoged, 01)

Page 38: LING 696B:  Mixture model and linear dimension reduction

38

Evaluation: phone classification How do the “soft” classes fit into “hard” ones?

Training set

Test set

Are “errors” really errors?

Page 39: LING 696B:  Mixture model and linear dimension reduction

39

Level 2: Learning segments + phonotactics

Segmentation is a kind of hidden structure Iterative strategy works here too

Optimization -- the augmented model:p(words | units, phonotactics, segmentation) Units argmax p({wi} | U, P, {si})

Clustering = argmax p(segments | units) -- Level 1 Phonotactics argmax p({wi} | U, P, {si})

Estimating transitions of Markov chain Segmentation argmax p({wi} | U, P, {si})

Viterbi decoding

Page 40: LING 696B:  Mixture model and linear dimension reduction

40

Iterative learning as coordinate-wise ascent

Each step increases likelihood score and eventually reaches a local maximum

segmentation

Unitsphonotactics

Level curves of likelihood score

Initial valuecomes fromLevel-1 learning

Page 41: LING 696B:  Mixture model and linear dimension reduction

41

Level 3:Lexicon can be mixtures too

Re-clustering of words using the mixture-based lexical model

Initial values (mixture components, weights) bottom-up learning (Stage 2)

Iterating steps: Classify each word as the best exemplar of

the given lexical item (also infer segmentation)

Update lexical weights + units + phonotactics

Page 42: LING 696B:  Mixture model and linear dimension reduction

42

Big question:How to choose K? Basic problem:

Nested hypothesis spaces: Hk-1 Hk Hk+1 …

As K goes up, likelihood always goes up.

Recall the polynomial curve fitting Mixture model too

(see demo)

Page 43: LING 696B:  Mixture model and linear dimension reduction

43

Big question:How to choose K? Idea #1: don’t just look at the likelihood,

look at the combination of likelihood and something else Bayesian Information Criterion:

-2 log L() + (log N)*d Minimal Description Length:

log L() + description() Akaike Information Criterion:

-2 log L() + 2 d/N In practice, often need magical “weights”

in front of the something else

Page 44: LING 696B:  Mixture model and linear dimension reduction

44

Big question:How to choose K? Idea #2: use one set of data for

learning, one for testing generalization

Cross-validation: run EM until the likelihood starts to hurt in the test set (see demo)

What if you have a bad test set: Jack-knife procedure Cutting data into 10 parts, and do 10

training and tests

Page 45: LING 696B:  Mixture model and linear dimension reduction

45

Big question:How to choose K? Idea #3: treat K as “hyper” parameter,

and do Bayesian learning on K More flexible: K can grow up and down

depending on number of data Allow K to grow to infinity: Dirichlet /

Chinese restaurant process mixture Need “hyper-hyper” parameters to

control how likely K grows Computationally also intensive

Page 46: LING 696B:  Mixture model and linear dimension reduction

46

Big question:How to choose K? There is really no elegant universal

solution One view: statistical learning looks

within Hk, but does not come up with Hk itself

How do people choose K? (also see later reading)

Page 47: LING 696B:  Mixture model and linear dimension reduction

47

Dimension reduction Why dimension reduction? Example: estimate a continuous

probability distribution by counting histograms on samples

10 bins 20 bins 30 bins

Page 48: LING 696B:  Mixture model and linear dimension reduction

48

Dimension reduction Now think about 2D, 3D …

How many bins do you need? Estimate density of distribution

with Parzen window:

How big (r) does the window needs to grow?

Data in the window

Window size

Page 49: LING 696B:  Mixture model and linear dimension reduction

49

Curse of dimensionality Discrete distributions:

Phonetics experiment: M speakers X N sentences X P stresses X Q segments … …

Decision rules: (K) Nearest-neighbor How big a K is safe? How long do you have to wait until you

are really sure they are your nearest neighbors?

Page 50: LING 696B:  Mixture model and linear dimension reduction

50

One obvious solution Assume we know something about

the distribution Translates to a parametric approach

Example: counting histograms for 10-D data needs lots of bins, but knowing it’s a pancake allows us to fit a Gaussian d10 parameters v.s. how many?

Page 51: LING 696B:  Mixture model and linear dimension reduction

51

Linear dimension reduction Principle Components Analysis Multidimensional Scaling Factor Analysis Independent Component Analysis As we will see, we still need to

assume we know something…

Page 52: LING 696B:  Mixture model and linear dimension reduction

52

Principle Component Analysis Many names (eigen modes, KL

transform, etc.) and relatives The key is to understand how to

make a pancake Centering, rotating and smashing

Step 1: moving the dough to the center X <-- X -

Page 53: LING 696B:  Mixture model and linear dimension reduction

53

Principle Component Analysis Step 2: finding a direction of

projection that has the maximal “stretch”

Linear projection of X onto vector w: Projw(X) = XNXd * wdX1 (X centered)

Now measure the stretch This is sample variance = Var(X*w)

wx

Page 54: LING 696B:  Mixture model and linear dimension reduction

54

Principle Component Analysis Step 3: formulate this as a

constrained optimization problem Objective of optimization: Var(X*w) Need constraint on w: (otherwise can

explode), only consider the direction So formally:

argmax||w||=1 Var(X*w)

Page 55: LING 696B:  Mixture model and linear dimension reduction

55

Principle Component Analysis Some algebra (homework):

Var(x) = E[(x - E[x])2

= E[x2] - (E[x])2

Apply to matrices (homework)Var(X*w) = wTXT * X w = wTCov(X) w (why)

Cov(X) is a dXd matrix (homework) Symmetric (easy) For any y, yTCov(X) y >= 0 (tricky)

Page 56: LING 696B:  Mixture model and linear dimension reduction

56

Principle Component Analysis Going back to the optimization

problem:= argmax||w||=1 Var(X*w)= argmax||w||=1 wTCov(X) w

The solution is an eigenvector of Cov(X)

w1

The first Principle Component!

Page 57: LING 696B:  Mixture model and linear dimension reduction

57

More principle components We keep looking for w2 in all the

directions perpendicular to w1

Formally:argmax||w2||=1,w2w1 wTCov(X) w

This turns out to be another eigenvector corresponding to the 2nd largest eigenvalue w2

New coordinates!

Page 58: LING 696B:  Mixture model and linear dimension reduction

58

Rotation Can keep going until we pick up all d

eigenvectors, perpendicular to each other

Putting these eigenvectors together, we have a big matrix W=(w1,w2,…,wd)

W is called an orthogonal matrix This corresponds to a rotation of the

pancake This pancake has no correlation

between dimensions