Download - Part 5 Language Model
Part 5Language Model
CSE717, SPRING 2008
CUBS, Univ at Buffalo
Examples of Good & Bad Language Models Excerption from Herman, comic strips by Jim Unger
1 2
3 4
What’s a Language Model
A Language model is a probability distribution over word sequences
P(“And nothing but the truth”) 0.001
P(“And nuts sing on the roof”) 0
What’s a language model for?
Speech recognition Handwriting recognition Spelling correction Optical character recognition Machine translation
(and anyone doing statistical modeling)
The Equation
)()|(maxarg
)(
)()|(maxarg
)|(maxarg
cewordsequenPcewordsequennsobservatioP
nsobservatioP
cewordsequenPcewordsequennsobservatioP
nsobservatiocewordsequenP
cewordsequen
cewordsequen
cewordsequen
The observation can be image features (handwriting recognition), acoustics (speech recognition), word sequence in another language (MT), etc.
How Language Models work
Hard to compute P(“And nothing but the truth”)
Decompose probabilityP(“and nothing but the truth) =P(“and”) P(“nothing|and”) P(“but|and nothing”) P(“the|and nothing but”) P(“truth|and nothing but the”)
The Trigram Approximation
Assume each word depends only on the previous two words
P(“the|and nothing but”)
P(“the|nothing but”)
P(“truth|and nothing but the”)
P(“truth|but the”)
How to find probabilities?
Count from real text
Pr(“the | nothing but”) c(“nothing but the”) / c(“nothing but”)
Evaluation
How can you tell a good language model from a bad one?
Run a speech recognizer (or your application of choice), calculate word error rate Slow Specific to your recognizer
Perplexity
An exampleData: “the whole truth and nothing but the truth”Lexicon: L={the, whole, truth, and, nothing, but}Model 1: uni-gram, Pr(L1)=…=Pr(L6)=1/6
Model 2: unigram, Pr(“the”)=Pr(“truth”)=1/4, Pr(“whole”)=Pr(“and”)=Pr(“nothing”)=Pr(“but”)=1/8
TTww
1
1 )],...,[Pr()(
wP
6])6/1[()( 8
1 8
wP
5.657])8/1()4/1[()( 8
1 44
wP
modelgiven by generated
is y that probabilit :),...,Pr(
test text:,...,
1
1
w
w
T
T
ww
ww
Perplexity: Is lower better?
Remarkable fact: the “true” model for data has the lowest possible perplexity
Lower the perplexity, the closer we are to true model.
Perplexity correlates well with the error rate of recognition task Correlates better when both models are trained on
same data Doesn’t correlate well when training data changes
Smoothing
Terrible on test data: If no occurrences of C(xyz), probability is 0
P(sing|nuts) =0 leads to infinite perplexity!
)(
)(
)(
)()|Pr(
y
y
y
yy
c
zc
wc
zcz
w
Smoothing: Add One
Add one smoothing:
Add delta smoothing:
Simple add-one smoothing does not perform well – the probability of rarely seen events is over-estimated
||)(
1)()|Pr(
Lc
zcz
y
yy
||)(
)()|Pr(
Lc
zcz
y
yy
Smoothing: Simple Interpolation
Interpolate Trigram, Bigram, Unigram for best combination
Almost good enough
)(
)()1(
)(
)(
)(
)()|Pr(
c
zc
yc
yzc
xyc
xyzcxyz
Smoothing: Redistribution of Probability Mass (Backing Off) [Katz87] Discounting
Discounted probability mass
Redistribution
)()(, ,)(
)()()|Pr( zzcz
c
zzcz yyy
y
yyy
)...|Pr()...|Pr()|Pr(
,0)...( If
21
1
nn
n
yyzkyyzz
zyyc
y
)(
)()(
y
yy
c
z
1)|Pr( that so selected is z
zk y
(n-1)-gram
Factor can be determined by the relative frequency of singletons, i.e., events observed exactly once in the data [Ney95]
Linear Discount
0)(, ,)(
)()1()|Pr(
zcz
c
zcz yy
y
yy
)(
)(1
zc
zd
y
y
1 ),()( zcz yy
)( zc y
Generalization
: function of y, determined by cross-validation
Requires more data
Computation is expensive
More General Formulation
Drawback of linear discount
The counts of frequently observed events are modified the most ; against the “law of large numbers”
)( y
1)( ),()()( yyyy zcz
)( zc y
The discount is an absolute value
Works pretty well, easier than linear discounting
Absolute Discounting
)(, ,
)(
)()|Pr( zcz
c
zcz yy
y
yy
)( zy
References
[1] Katz S, Estimation of probabilities from sparse data for the language model component of a speech recognizer, IEEE Trans on Acoustics, Speech, and Signal Processing 35(3):400-401, 1987
[2] Ney H, Essen U, Kneser R, On the estimation of “small” probabilities by leaving-one-out, ITTT Trans. on PAMI 17(12): 1202-1212, 1995
[3] Joshua Goodman, A tutorial of language model: the State of The Art in Language Modeling, research.microsoft.com/~joshuago/lm-tutorial-public.ppt