a sober look at spectral learninghzhao1/papers/icml2014/spectral-slide.pdf · a sober look at...
TRANSCRIPT
A Sober Look at Spectral Learning
Han Zhao and Pascal Poupart
June 17, 2014
1 / 69
Spectral Learning
What is spectral learning?
I New methods in machine learning to tackle mixture modelsand graphical models with latent variables.
I Dates back to Karl Pearson’s method of moments approach tosolve mixture of Gaussians.
I An alternative to the principle of maximum likelihoodestimation and Bayesian inference.
I Been widely applied to various models, including HiddenMarkov Models [1, 2], mixture of Gaussians [3], TopicModels [4, 5, 6] and latent junction trees [7, 8], etc.
Today I will focus on spectral algorithm for Hidden Markov Models.
2 / 69
Spectral Learning
What is spectral learning?
I New methods in machine learning to tackle mixture modelsand graphical models with latent variables.
I Dates back to Karl Pearson’s method of moments approach tosolve mixture of Gaussians.
I An alternative to the principle of maximum likelihoodestimation and Bayesian inference.
I Been widely applied to various models, including HiddenMarkov Models [1, 2], mixture of Gaussians [3], TopicModels [4, 5, 6] and latent junction trees [7, 8], etc.
Today I will focus on spectral algorithm for Hidden Markov Models.
3 / 69
Spectral Learning
What is spectral learning?
I New methods in machine learning to tackle mixture modelsand graphical models with latent variables.
I Dates back to Karl Pearson’s method of moments approach tosolve mixture of Gaussians.
I An alternative to the principle of maximum likelihoodestimation and Bayesian inference.
I Been widely applied to various models, including HiddenMarkov Models [1, 2], mixture of Gaussians [3], TopicModels [4, 5, 6] and latent junction trees [7, 8], etc.
Today I will focus on spectral algorithm for Hidden Markov Models.
4 / 69
Spectral Learning
What is spectral learning?
I New methods in machine learning to tackle mixture modelsand graphical models with latent variables.
I Dates back to Karl Pearson’s method of moments approach tosolve mixture of Gaussians.
I An alternative to the principle of maximum likelihoodestimation and Bayesian inference.
I Been widely applied to various models, including HiddenMarkov Models [1, 2], mixture of Gaussians [3], TopicModels [4, 5, 6] and latent junction trees [7, 8], etc.
Today I will focus on spectral algorithm for Hidden Markov Models.
5 / 69
Spectral Learning
What is spectral learning?
I New methods in machine learning to tackle mixture modelsand graphical models with latent variables.
I Dates back to Karl Pearson’s method of moments approach tosolve mixture of Gaussians.
I An alternative to the principle of maximum likelihoodestimation and Bayesian inference.
I Been widely applied to various models, including HiddenMarkov Models [1, 2], mixture of Gaussians [3], TopicModels [4, 5, 6] and latent junction trees [7, 8], etc.
Today I will focus on spectral algorithm for Hidden Markov Models.
6 / 69
HMM
Hidden Markov Model
I A discrete time stochastic process.
I Satisfies Markovian property.
I The state of the system at each time step is hidden, only theobservation of the system is visible.
7 / 69
HMM
HMM can be defined as a triple 〈T ,O, π〉:I Transition matrix T ∈ Rm×m, Tij = Pr(st+1 = i | st = j).
I Observation matrix O ∈ Rn×m, Oij = Pr(ot = i | st = j).
I Initial distribution π ∈ Rm, πi = Pr(s1 = i).
8 / 69
HMM
Given an HMM H = 〈T ,O, π〉, we are interested in two inferenceproblems:
1. Marginal Inference (Estimation problem). Computing themarginal probability
Pr(o1:t) =∑s1:t
Pr(o1:t , s1:t) =∑s1:t
Pr(s1:t) Pr(o1:t |s1:t)
Dynamic Programming !
2. MAP Inference (Decoding problem). Computing the sequences∗1:t maximizing the posterior probability
s∗1:t = arg maxs1:t
Pr(s1:t |o1:t)
Viterbi Algorithm !
What about the learning problem?
9 / 69
HMM
Given an HMM H = 〈T ,O, π〉, we are interested in two inferenceproblems:
1. Marginal Inference (Estimation problem). Computing themarginal probability
Pr(o1:t) =∑s1:t
Pr(o1:t , s1:t) =∑s1:t
Pr(s1:t) Pr(o1:t |s1:t)
Dynamic Programming !
2. MAP Inference (Decoding problem). Computing the sequences∗1:t maximizing the posterior probability
s∗1:t = arg maxs1:t
Pr(s1:t |o1:t)
Viterbi Algorithm !
What about the learning problem?
10 / 69
HMM
Given an HMM H = 〈T ,O, π〉, we are interested in two inferenceproblems:
1. Marginal Inference (Estimation problem). Computing themarginal probability
Pr(o1:t) =∑s1:t
Pr(o1:t , s1:t) =∑s1:t
Pr(s1:t) Pr(o1:t |s1:t)
Dynamic Programming !
2. MAP Inference (Decoding problem). Computing the sequences∗1:t maximizing the posterior probability
s∗1:t = arg maxs1:t
Pr(s1:t |o1:t)
Viterbi Algorithm !
What about the learning problem?
11 / 69
HMM
Given an HMM H = 〈T ,O, π〉, we are interested in two inferenceproblems:
1. Marginal Inference (Estimation problem). Computing themarginal probability
Pr(o1:t) =∑s1:t
Pr(o1:t , s1:t) =∑s1:t
Pr(s1:t) Pr(o1:t |s1:t)
Dynamic Programming !
2. MAP Inference (Decoding problem). Computing the sequences∗1:t maximizing the posterior probability
s∗1:t = arg maxs1:t
Pr(s1:t |o1:t)
Viterbi Algorithm !
What about the learning problem?
12 / 69
HMM
Given an HMM H = 〈T ,O, π〉, we are interested in two inferenceproblems:
1. Marginal Inference (Estimation problem). Computing themarginal probability
Pr(o1:t) =∑s1:t
Pr(o1:t , s1:t) =∑s1:t
Pr(s1:t) Pr(o1:t |s1:t)
Dynamic Programming !
2. MAP Inference (Decoding problem). Computing the sequences∗1:t maximizing the posterior probability
s∗1:t = arg maxs1:t
Pr(s1:t |o1:t)
Viterbi Algorithm !
What about the learning problem?
13 / 69
HMM
Given an HMM H = 〈T ,O, π〉, we are interested in two inferenceproblems:
1. Marginal Inference (Estimation problem). Computing themarginal probability
Pr(o1:t) =∑s1:t
Pr(o1:t , s1:t) =∑s1:t
Pr(s1:t) Pr(o1:t |s1:t)
Dynamic Programming !
2. MAP Inference (Decoding problem). Computing the sequences∗1:t maximizing the posterior probability
s∗1:t = arg maxs1:t
Pr(s1:t |o1:t)
Viterbi Algorithm !
What about the learning problem?
14 / 69
HMM Reparametrization
Let H = 〈T ,O, π〉 be an HMM, define the following observableoperators:
Ax , T diag(Ox ,1, . . . ,Ox ,m), ∀x ∈ [n]
H = 〈π,Ax〉,∀x ∈ [n] is an equivalent parameterization of HMM.
Ax [i , j ] = Pr(st+1 = i |st = j)× Pr(ot = x |st = j) = Pr(st+1 =i , ot = x |st = j).
15 / 69
HMM Reparametrization
Let H = 〈T ,O, π〉 be an HMM, define the following observableoperators:
Ax , T diag(Ox ,1, . . . ,Ox ,m), ∀x ∈ [n]
H = 〈π,Ax〉,∀x ∈ [n] is an equivalent parameterization of HMM.
Ax [i , j ] = Pr(st+1 = i |st = j)× Pr(ot = x |st = j) = Pr(st+1 =i , ot = x |st = j).
16 / 69
HMM Reparametrization
Let H = 〈T ,O, π〉 be an HMM, define the following observableoperators:
Ax , T diag(Ox ,1, . . . ,Ox ,m), ∀x ∈ [n]
H = 〈π,Ax〉,∀x ∈ [n] is an equivalent parameterization of HMM.
Ax [i , j ] = Pr(st+1 = i |st = j)× Pr(ot = x |st = j) = Pr(st+1 =i , ot = x |st = j).
17 / 69
HMM Reparametrization
We can express the marginal probability in terms of observableoperators:
Pr(o1:t) =∑s1:t+1
Pr(o1:t , s1:t+1)
=∑s1:t+1
[Pr(st+1|st) Pr(ot |st)] · · · [Pr(s2|s1) Pr(o1|s1)] Pr(s1)
=∑s1:t+1
Aot [st+1, st ] · · ·Ao1 [s2, s1]πs1
= 1TAot · · ·Ao1π
Goal of Learning: Estimate the observable operators from sequenceof observations.
18 / 69
HMM Reparametrization
We can express the marginal probability in terms of observableoperators:
Pr(o1:t) =∑s1:t+1
Pr(o1:t , s1:t+1)
=∑s1:t+1
[Pr(st+1|st) Pr(ot |st)] · · · [Pr(s2|s1) Pr(o1|s1)] Pr(s1)
=∑s1:t+1
Aot [st+1, st ] · · ·Ao1 [s2, s1]πs1
= 1TAot · · ·Ao1π
Goal of Learning: Estimate the observable operators from sequenceof observations.
19 / 69
HMM Reparametrization
We can express the marginal probability in terms of observableoperators:
Pr(o1:t) =∑s1:t+1
Pr(o1:t , s1:t+1)
=∑s1:t+1
[Pr(st+1|st) Pr(ot |st)] · · · [Pr(s2|s1) Pr(o1|s1)] Pr(s1)
=∑s1:t+1
Aot [st+1, st ] · · ·Ao1 [s2, s1]πs1
= 1TAot · · ·Ao1π
Goal of Learning: Estimate the observable operators from sequenceof observations.
20 / 69
HMM Reparametrization
We can express the marginal probability in terms of observableoperators:
Pr(o1:t) =∑s1:t+1
Pr(o1:t , s1:t+1)
=∑s1:t+1
[Pr(st+1|st) Pr(ot |st)] · · · [Pr(s2|s1) Pr(o1|s1)] Pr(s1)
=∑s1:t+1
Aot [st+1, st ] · · ·Ao1 [s2, s1]πs1
= 1TAot · · ·Ao1π
Goal of Learning: Estimate the observable operators from sequenceof observations.
21 / 69
HMM Reparametrization
We can express the marginal probability in terms of observableoperators:
Pr(o1:t) =∑s1:t+1
Pr(o1:t , s1:t+1)
=∑s1:t+1
[Pr(st+1|st) Pr(ot |st)] · · · [Pr(s2|s1) Pr(o1|s1)] Pr(s1)
=∑s1:t+1
Aot [st+1, st ] · · ·Ao1 [s2, s1]πs1
= 1TAot · · ·Ao1π
Goal of Learning: Estimate the observable operators from sequenceof observations.
22 / 69
Spectral Learning for HMM [1]
Assumption 1: π > 0 element-wise, and T and O are full rank(rank(T ) = rank(O) = m). Define the first three order momentsof the observations:
P1[i ] = Pr(x1) = i
P2,1[i , j ] = Pr(x2 = i , x1 = j)
P3,x ,1[i , j ] = Pr(x3 = i , x2 = x , x1 = j), ∀x ∈ [n]
Let U ∈ Rn×m be the left singular matrix of P2,1, define thefollowing observable operators:
b1 = UTP1
b∞ = (PT2,1U)+P1
Bx = (UTP3,x ,1)(UTP2,1)+, ∀x ∈ [n]
where M+ denotes the Moore-Penrose pseudoinverse of matrix M.
23 / 69
Spectral Learning for HMM [1]
Assumption 1: π > 0 element-wise, and T and O are full rank(rank(T ) = rank(O) = m). Define the first three order momentsof the observations:
P1[i ] = Pr(x1) = i
P2,1[i , j ] = Pr(x2 = i , x1 = j)
P3,x ,1[i , j ] = Pr(x3 = i , x2 = x , x1 = j), ∀x ∈ [n]
Let U ∈ Rn×m be the left singular matrix of P2,1, define thefollowing observable operators:
b1 = UTP1
b∞ = (PT2,1U)+P1
Bx = (UTP3,x ,1)(UTP2,1)+, ∀x ∈ [n]
where M+ denotes the Moore-Penrose pseudoinverse of matrix M.
24 / 69
Spectral Learning for HMM [1]
Theorem (Observable HMM Representation [1])
Assume the HMM obeys assumption 1, then
1. b1 = (UTO)π
2. bT∞ = 1T (UTO)−1
3. Bx = (UTO)Ax(UTO)−1 ∀x ∈ [n]
4. Pr(o1:t) = bT∞Bxt · · ·Bx1b1
b1, b∞ and Bx only depend on first three order moments ofobservations, free of hidden states !
25 / 69
Spectral Learning for HMM [1]
Theorem (Observable HMM Representation [1])
Assume the HMM obeys assumption 1, then
1. b1 = (UTO)π
2. bT∞ = 1T (UTO)−1
3. Bx = (UTO)Ax(UTO)−1 ∀x ∈ [n]
4. Pr(o1:t) = bT∞Bxt · · ·Bx1b1
b1, b∞ and Bx only depend on first three order moments ofobservations, free of hidden states !
26 / 69
Spectral Learning for HMM [1]
Main result of Spectral Learning algorithm for HMM:
Theorem (Sample Complexity)
There exists a constant C > 0 such that the following holds. Pickany 0 < ε, η < 1 and t ≥ 1. Assume the HMM obeys assumption1, and
N ≥ C · t2
ε2·(
m · log(1/ε)
σm(O)2σm(P2,1)4+
m · n0(ε) · log(1/ε)
σm(O)2σm(P2,1)2
)With probability at least 1− η, the model returned by the spectrallearning algorithm for HMM satisfies∑
x1,...,xt
|Pr(x1:t)− P̂r(x1:t)| ≤ ε
where n0(ε) = O(ε1/(1−s)), s > 1 a constant.
27 / 69
Compared with EM
Expectation-Maximization [9]:
I Local search heuristic algorithm based on the principle ofMaximum Likelihood Estimation
I Local optima problem.
I No consistency guarantees.
For a given t ≥ 1, and 0 < ε, η < 1, spectral learning algorithm:
I A finite sample complexity to be consistent in terms of L1
error on marginal probability.
I No local optima since it only solves an SVD without any localsearch.
28 / 69
Compared with EM
Expectation-Maximization [9]:
I Local search heuristic algorithm based on the principle ofMaximum Likelihood Estimation
I Local optima problem.
I No consistency guarantees.
For a given t ≥ 1, and 0 < ε, η < 1, spectral learning algorithm:
I A finite sample complexity to be consistent in terms of L1
error on marginal probability.
I No local optima since it only solves an SVD without any localsearch.
29 / 69
Compared with EM
Expectation-Maximization [9]:
I Local search heuristic algorithm based on the principle ofMaximum Likelihood Estimation
I Local optima problem.
I No consistency guarantees.
For a given t ≥ 1, and 0 < ε, η < 1, spectral learning algorithm:
I A finite sample complexity to be consistent in terms of L1
error on marginal probability.
I No local optima since it only solves an SVD without any localsearch.
30 / 69
Compared with EM
Expectation-Maximization [9]:
I Local search heuristic algorithm based on the principle ofMaximum Likelihood Estimation
I Local optima problem.
I No consistency guarantees.
For a given t ≥ 1, and 0 < ε, η < 1, spectral learning algorithm:
I A finite sample complexity to be consistent in terms of L1
error on marginal probability.
I No local optima since it only solves an SVD without any localsearch.
31 / 69
Compared with EM
Expectation-Maximization [9]:
I Local search heuristic algorithm based on the principle ofMaximum Likelihood Estimation
I Local optima problem.
I No consistency guarantees.
For a given t ≥ 1, and 0 < ε, η < 1, spectral learning algorithm:
I A finite sample complexity to be consistent in terms of L1
error on marginal probability.
I No local optima since it only solves an SVD without any localsearch.
32 / 69
EM v.s. Spectral algorithm
Two synthetic experiments:
SmallSyn LargeSyn
# states 4 50# observations 8 100
test set size 4096 10,000length of test sequence 4 50
Measure: normalized L1 prediction error on test data set
L1 =∑
x1:t∈T|Pr(x1:t)− P̂r(x1:t)|
1t
where T is the test set.
33 / 69
EM v.s. Spectral algorithm
1 1.5 2 2.5 3 3.5 4
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
Rank Hyperparameter
No
rma
lize
d L
1 e
rro
rSmallSyn
Training Size = 10000
Training Size = 50000
Training Size = 100000
Training Size = 500000
Training Size = 1000000
0 10 20 30 40 500.9
0.92
0.94
0.96
0.98
1
1.02
1.04
1.06
1.08
1.1
Rank Hyperparameter
No
rma
lize
d L
1 e
rro
r
LargeSyn
Training Size = 10000
Training Size = 50000
Training Size = 100000
Training Size = 500000
Training Size = 1000000
0 1 2 3 4 5
x 104
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
Training Size
No
rma
lize
d L
1 e
rro
r
SmallSyn, m = 4
LearnHMM
EM
0 1 2 3 4 5
x 106
0.01
0.02
0.03
0.04
0.05
0.06
0.07
Training Size
No
rma
lize
d L
1 e
rro
r
LargeSyn, m = 50
LearnHMM
EM
34 / 69
EM v.s. Spectral algorithmNegative probability problem with spectral learning algorithm:
I Size of training data.
I Estimation of rank hyperparameter.I Length of test sequence.
Proportion of negative probabilities:
NEG PROP =|{P̂r(x1:t) < 0 | x1:t ∈ T }|
|T |
1 1.5 2 2.5 3 3.5 40
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Rank Hyperparameter
Pro
po
rtio
n o
f N
eg
ative
Pro
ba
bili
tie
s
SmallSyn
Training Size = 10000
Training Size = 50000
Training Size = 100000
Training Size = 500000
Training Size = 1000000
0 10 20 30 40 500
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Rank Hyperparameter
Pro
po
rtio
n o
f N
eg
ative
Pro
ba
bili
tie
s
LargeSyn
Training Size = 10000
Training Size = 50000
Training Size = 100000
Training Size = 500000
Training Size = 1000000
35 / 69
EM v.s. Spectral algorithmNegative probability problem with spectral learning algorithm:
I Size of training data.I Estimation of rank hyperparameter.
I Length of test sequence.
Proportion of negative probabilities:
NEG PROP =|{P̂r(x1:t) < 0 | x1:t ∈ T }|
|T |
1 1.5 2 2.5 3 3.5 40
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Rank Hyperparameter
Pro
po
rtio
n o
f N
eg
ative
Pro
ba
bili
tie
s
SmallSyn
Training Size = 10000
Training Size = 50000
Training Size = 100000
Training Size = 500000
Training Size = 1000000
0 10 20 30 40 500
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Rank Hyperparameter
Pro
po
rtio
n o
f N
eg
ative
Pro
ba
bili
tie
s
LargeSyn
Training Size = 10000
Training Size = 50000
Training Size = 100000
Training Size = 500000
Training Size = 1000000
36 / 69
EM v.s. Spectral algorithmNegative probability problem with spectral learning algorithm:
I Size of training data.I Estimation of rank hyperparameter.I Length of test sequence.
Proportion of negative probabilities:
NEG PROP =|{P̂r(x1:t) < 0 | x1:t ∈ T }|
|T |
1 1.5 2 2.5 3 3.5 40
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Rank Hyperparameter
Pro
po
rtio
n o
f N
eg
ative
Pro
ba
bili
tie
s
SmallSyn
Training Size = 10000
Training Size = 50000
Training Size = 100000
Training Size = 500000
Training Size = 1000000
0 10 20 30 40 500
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Rank Hyperparameter
Pro
po
rtio
n o
f N
eg
ative
Pro
ba
bili
tie
s
LargeSyn
Training Size = 10000
Training Size = 50000
Training Size = 100000
Training Size = 500000
Training Size = 1000000
37 / 69
EM v.s. Spectral algorithmNegative probability problem with spectral learning algorithm:
I Size of training data.I Estimation of rank hyperparameter.I Length of test sequence.
Proportion of negative probabilities:
NEG PROP =|{P̂r(x1:t) < 0 | x1:t ∈ T }|
|T |
1 1.5 2 2.5 3 3.5 40
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Rank Hyperparameter
Pro
po
rtio
n o
f N
eg
ative
Pro
ba
bili
tie
s
SmallSyn
Training Size = 10000
Training Size = 50000
Training Size = 100000
Training Size = 500000
Training Size = 1000000
0 10 20 30 40 500
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Rank Hyperparameter
Pro
po
rtio
n o
f N
eg
ative
Pro
ba
bili
tie
s
LargeSyn
Training Size = 10000
Training Size = 50000
Training Size = 100000
Training Size = 500000
Training Size = 1000000
38 / 69
Compared with EM
Why EM succeeds in practice?If the log-likelihood function of model parameter tends toconcave/quasi-concave when the sample size goes to infinity ?
1. Local search algorithms, for example, EM algorithm in ourcase, will converge to global optima, hence obtain themaximum likelihood estimator [10].
2. Consistency. Sequence of MLE converges in probability to thetrue model parameter (suppose the model is identifiable byparameter) [11].
3. Asymptotic normality. The distribution of MLE tends to be aGaussian distribution with mean the true parameter andcovariance matrix equal to the inverse the Fisher informationmatrix, i.e., more and more concentrated [11].
4. Most statistical efficient consistent estimator of modelparameter [11].
39 / 69
Compared with EM
Why EM succeeds in practice?If the log-likelihood function of model parameter tends toconcave/quasi-concave when the sample size goes to infinity ?
1. Local search algorithms, for example, EM algorithm in ourcase, will converge to global optima, hence obtain themaximum likelihood estimator [10].
2. Consistency. Sequence of MLE converges in probability to thetrue model parameter (suppose the model is identifiable byparameter) [11].
3. Asymptotic normality. The distribution of MLE tends to be aGaussian distribution with mean the true parameter andcovariance matrix equal to the inverse the Fisher informationmatrix, i.e., more and more concentrated [11].
4. Most statistical efficient consistent estimator of modelparameter [11].
40 / 69
Compared with EM
Why EM succeeds in practice?If the log-likelihood function of model parameter tends toconcave/quasi-concave when the sample size goes to infinity ?
1. Local search algorithms, for example, EM algorithm in ourcase, will converge to global optima, hence obtain themaximum likelihood estimator [10].
2. Consistency. Sequence of MLE converges in probability to thetrue model parameter (suppose the model is identifiable byparameter) [11].
3. Asymptotic normality. The distribution of MLE tends to be aGaussian distribution with mean the true parameter andcovariance matrix equal to the inverse the Fisher informationmatrix, i.e., more and more concentrated [11].
4. Most statistical efficient consistent estimator of modelparameter [11].
41 / 69
Compared with EM
Why EM succeeds in practice?If the log-likelihood function of model parameter tends toconcave/quasi-concave when the sample size goes to infinity ?
1. Local search algorithms, for example, EM algorithm in ourcase, will converge to global optima, hence obtain themaximum likelihood estimator [10].
2. Consistency. Sequence of MLE converges in probability to thetrue model parameter (suppose the model is identifiable byparameter) [11].
3. Asymptotic normality. The distribution of MLE tends to be aGaussian distribution with mean the true parameter andcovariance matrix equal to the inverse the Fisher informationmatrix, i.e., more and more concentrated [11].
4. Most statistical efficient consistent estimator of modelparameter [11].
42 / 69
Compared with EM
Why EM succeeds in practice?If the log-likelihood function of model parameter tends toconcave/quasi-concave when the sample size goes to infinity ?
1. Local search algorithms, for example, EM algorithm in ourcase, will converge to global optima, hence obtain themaximum likelihood estimator [10].
2. Consistency. Sequence of MLE converges in probability to thetrue model parameter (suppose the model is identifiable byparameter) [11].
3. Asymptotic normality. The distribution of MLE tends to be aGaussian distribution with mean the true parameter andcovariance matrix equal to the inverse the Fisher informationmatrix, i.e., more and more concentrated [11].
4. Most statistical efficient consistent estimator of modelparameter [11].
43 / 69
Synthetic Experiment
Is our conjecture true in HMM? An HMM with one singleparameter for visualization:
H = 〈T =
(θ 1− θ
1− θ θ
),O =
(0.7 0.30.3 0.7
), π = (0.5, 0.5)〉
Beta distribution with uniform distribution as prior.Exact Bayesian updating with more and more observations.
44 / 69
Synthetic Experiment
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
θ
(no
rma
lize
d)
like
liho
od
10 observations
45 / 69
Synthetic Experiment
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
θ
(no
rma
lize
d)
like
liho
od
10 observations
20 observations
46 / 69
Synthetic Experiment
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
θ
(no
rma
lize
d)
like
liho
od
10 observations
20 observations
30 observations
47 / 69
Synthetic Experiment
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
θ
(no
rma
lize
d)
like
liho
od
10 observations
20 observations
30 observations
40 observations
48 / 69
Synthetic Experiment
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
1.5
2
2.5
3
θ
(no
rma
lize
d)
like
liho
od
10 observations
20 observations
30 observations
40 observations
50 observations
49 / 69
Synthetic Experiment
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
1.5
2
2.5
3
θ
(no
rma
lize
d)
like
liho
od
10 observations
20 observations
30 observations
40 observations
50 observations
60 observations
50 / 69
Synthetic Experiment
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
1.5
2
2.5
3
θ
(no
rma
lize
d)
like
liho
od
10 observations
20 observations
30 observations
40 observations
50 observations
60 observations
70 observations
51 / 69
Synthetic Experiment
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
1.5
2
2.5
3
3.5
4
4.5
θ
(no
rma
lize
d)
like
liho
od
10 observations
20 observations
30 observations
40 observations
50 observations
60 observations
70 observations
80 observations
52 / 69
Synthetic Experiment
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
θ
(no
rma
lize
d)
like
liho
od
10 observations
20 observations
30 observations
40 observations
50 observations
60 observations
70 observations
80 observations
90 observations
53 / 69
Synthetic Experiment
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
θ
(no
rma
lize
d)
like
liho
od
10 observations
20 observations
30 observations
40 observations
50 observations
60 observations
70 observations
80 observations
90 observations
100 observations
54 / 69
Synthetic ExperimentAnother small synthetic experiment: HMM with 2 states, 2observations and 4 free parameters.
0 50 100 150 200−140
−120
−100
−80
−60
−40
−20
0
Training Size
Log−
likelih
oo
d
Log−likelihood Comparison
EM log−likelihood
True log−likelihood
55 / 69
Synthetic ExperimentAnother small synthetic experiment: HMM with 2 states, 2observations and 4 free parameters.
0 50 100 150 200−140
−120
−100
−80
−60
−40
−20
0
Training Size
Log−
likelih
ood
Log−likelihood Comparison
EM log−likelihood
True log−likelihood
56 / 69
Conclusions
Spectral learning for HMMPros:
1. Additive L1 error bound with finite sample complexity.
2. No local optima.
Cons:
1. Negative probability.
2. Not most statistically efficient.
3. Slow to converge.
57 / 69
Conclusions
Spectral learning for HMMPros:
1. Additive L1 error bound with finite sample complexity.
2. No local optima.
Cons:
1. Negative probability.
2. Not most statistically efficient.
3. Slow to converge.
58 / 69
Conclusions
Spectral learning for HMMPros:
1. Additive L1 error bound with finite sample complexity.
2. No local optima.
Cons:
1. Negative probability.
2. Not most statistically efficient.
3. Slow to converge.
59 / 69
Conclusions
Spectral learning for HMMPros:
1. Additive L1 error bound with finite sample complexity.
2. No local optima.
Cons:
1. Negative probability.
2. Not most statistically efficient.
3. Slow to converge.
60 / 69
Conclusions
Spectral learning for HMMPros:
1. Additive L1 error bound with finite sample complexity.
2. No local optima.
Cons:
1. Negative probability.
2. Not most statistically efficient.
3. Slow to converge.
61 / 69
Conclusions
EM for HMMPros:
1. Fast to converge.
2. Statistically efficient.
3. Optimization based approach.
Cons:
1. Local search heuristics, no provable guarantee for globaloptima.
2. Stuck in local optima for non-convex optimization.
62 / 69
Conclusions
EM for HMMPros:
1. Fast to converge.
2. Statistically efficient.
3. Optimization based approach.
Cons:
1. Local search heuristics, no provable guarantee for globaloptima.
2. Stuck in local optima for non-convex optimization.
63 / 69
Conclusions
EM for HMMPros:
1. Fast to converge.
2. Statistically efficient.
3. Optimization based approach.
Cons:
1. Local search heuristics, no provable guarantee for globaloptima.
2. Stuck in local optima for non-convex optimization.
64 / 69
Conclusions
EM for HMMPros:
1. Fast to converge.
2. Statistically efficient.
3. Optimization based approach.
Cons:
1. Local search heuristics, no provable guarantee for globaloptima.
2. Stuck in local optima for non-convex optimization.
65 / 69
Conclusions
EM for HMMPros:
1. Fast to converge.
2. Statistically efficient.
3. Optimization based approach.
Cons:
1. Local search heuristics, no provable guarantee for globaloptima.
2. Stuck in local optima for non-convex optimization.
66 / 69
Reference I
D. Hsu, S. M. Kakade, and T. Zhang, “A spectral algorithmfor learning Hidden Markov Models,” Journal of Computer andSystem Sciences, vol. 78, pp. 1460–1480, Sept. 2012.
A. Anandkumar, D. Hsu, and S. M. Kakade, “A Method ofMoments for Mixture Models and Hidden Markov Models,”arXiv preprint arXiv:1203.0683, 2012.
D. Hsu and S. M. Kakade, “Learning mixtures of sphericalgaussians: moment methods and spectral decompositions,” inProceedings of the 4th conference on Innovations inTheoretical Computer Science, pp. 11–20, ACM, 2013.
A. Anandkumar, D. P. Foster, and D. Hsu, “A SpectralAlgorithm for Latent Dirichlet Allocation,” in NIPS,pp. 926—-934, 2012.
67 / 69
Reference II
A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, andM. Telgarsky, “Tensor Decompositions for Learning LatentVariable Models,” in arXiv preprint arXiv:1210.7559, pp. 1–55,2012.
S. Arora, R. Ge, Y. Halpern, D. Mimno, A. Moitra, D. Sontag,Y. Wu, and M. Zhu, “A practical algorithm for topic modelingwith provable guarantees,” arXiv preprint arXiv:1212.4777,2012.
A. Parikh, G. Teodoru, G. Tech, M. Ishteva, and E. P. Xing,“A Spectral Algorithm for Latent Junction Trees,” arXivpreprint arXiv:1210.4884, 2012.
A. Parikh and E. P. Xing, “A Spectral Algorithm for LatentTree Graphical Models,” in Proceedings of the 28thInternational Conference on Machine Learning, pp. 1065–1072,2011.
68 / 69
Reference III
L. Rabiner, “A tutorial on hidden markov models and selectedapplications in speech recognition,” Proceedings of the IEEE,vol. 77, no. 2, pp. 257–286, 1989.
S. P. Boyd and L. Vandenberghe, Convex optimization.Cambridge university press, 2004.
G. Casella and R. L. Berger, Statistical inference, vol. 70.Duxbury Press Belmont, CA, 1990.
69 / 69