[paper introduction] efficient lattice rescoring using recurrent neural network language models
TRANSCRIPT
Efficient Lattice Rescoring using Recurrent Neural Network Language ModelsX. Liu, Y. Wang, X. Chen, M. J. F. Gales & P. C. Woodland Proc. of ICASSP 2014
Introduced by Makoto Morishita 2016/02/25 MT Study Group
What is a Language Model
• Language models assign a probability to each sentence.
2
W1 = speech recognition system
W2 = speech cognition system
W3 = speck podcast histamine
P(W1) = 4.021 * 10-3
P(W2) = 8.932 * 10-4
P(W3) = 2.432 * 10-7
What is a Language Model
• Language models assign a probability to each sentence.
3
W1 = speech recognition system
W2 = speech cognition system
W3 = speck podcast histamine
P(W1) = 4.021 * 10-3
P(W2) = 8.932 * 10-4
P(W3) = 2.432 * 10-7
Best!
In this paper…
• Authors propose 2 new methods to efficiently re-score speech recognition lattices.
4
0 1
7
9
2 3 4 5 6
8
high this is my mobile phone
phones
this
this
hi
hy
Language Models
n-gram back off model
6
This is my mobile phone
hone
home2345
1
• Use n-gram words to estimate the next word probability.
n-gram back off model
• Use n-gram words to estimate the next word probability.
7
This is my mobile phone
hone
home2345
1If bi-gram, use these words.
Feedforward neural network language model
• Use n-gram words and feedforward neural network.
8
[Y. Bengio et. al. 2002]
Feedforward neural network language model
9
[Y. Bengio et. al. 2002]
http://kiyukuta.github.io/2013/12/09/mlac2013_day9_recurrent_neural_network_language_model.html
Recurrent neural network language model
• Use full history contexts and recurrent neural network.
10
[T. Mikolov et. al. 2010]
001
0
current word
history
sigmoid softmax
wi�1
si�2
si�1
si�1
P (wi|wi�1, si�2)
Language Model States
LM states
12
• To use LM for re-scoring task, we need to store the states of LM to efficiently score the sentence.
bi-gram
13
0 1 2 3
a
b
c
e
d
SR Lattice
bi-gram LM states
1aa
bc
e
1b
2c
2d
0<s> 3e
e
cd
d
tri-gram
14
0 1 2 3
a
b
c
e
d
SR Lattice
tri-gramLM states
1<s>,aa
b
0<s>
2<s>,b
2a,c
2a,d
2a,c
2a,d
c
d
c
d
3e,d
3e,c
e
ee
e
tri-gram
15
0 1 2 3
a
b
c
e
d
SR Lattice
tri-gramLM states
1<s>,aa
b
0<s>
2<s>,b
2a,c
2a,d
2a,c
2a,d
c
d
c
d
3e,d
3e,c
e
ee
e
States become larger!
Difference
• n-gram back off model & feedforward NNLM - Use only fixed n-gram words.
• Recurrent NNLM- Use whole past words (history). - LM states will grow rapidly. - It takes a lot of computational cost.
16
We want to reduce recurrent NNLM states
Hypothesis
Context information gradually diminishing
• We don’t have to distinguish all of the histories.
• e.g.I am presenting the paper about RNNLM. ≒ We are presenting the paper about RNNLM.
18
Similar history make similar vector
• We don’t have to distinguish all of the histories.
• e.g.I am presenting the paper about RNNLM. ≒ I am introducing the paper about RNNLM.
19
Proposed Method
n-gram based history clustering
• I am presenting the paper about RNNLM. ≒ We are presenting the paper about RNNLM.
• If the n-gram is the same,we use the same history vector.
21
History vector based clustering
• I am presenting the paper about RNNLM. ≒ I am introducing the paper about RNNLM.
• If the history vector is similar to other vector, we use the same history vector.
22
Experiments
Experimental results
24
4-gram back-off LMFeedforward NNLM
RNNLM Reranking
RNNLM n-gram based history clustering
RNNLM history vector based clustering
Baseline
Experimental results
25
4-gram back-off LMFeedforward NNLM
RNNLM Reranking
RNNLM n-gram based history clustering
RNNLM history vector based clustering
Baseline
Experimental results
26
4-gram back-off LMFeedforward NNLM
RNNLM Reranking
RNNLM n-gram based history clustering
RNNLM history vector based clustering
Baseline
comparable WER and70% reduction in lattice size
27
RNNLM n-gram based history clustering
RNNLM history vector based clustering
Same WER and45% reduction in lattice size
Experimental results
28
RNNLM n-gram based history clustering
RNNLM history vector based clustering
Same WER and7% reduction in lattice size
Experimental results
Experimental results
29
4-gram back-off LMFeedforward NNLM
RNNLM Reranking
RNNLM n-gram based history clustering
RNNLM history vector based clustering
Baseline
Comparable WER and72.4% reduction in lattice size
Conclusion
Conclusion
• Proposed methods can achieve comparable WER with 10k-best re-ranking, as well as over 70% compression in lattice size.
• Small lattice size make computational cost smaller!
31
References
• これもある意味Deep Learning,Recurrent Neural Network Language Modelの話 [MLAC2013_9日目]http://kiyukuta.github.io/2013/12/09/mlac2013_day9_recurrent_neural_network_language_model.html
32
Prefix tree structuring
33