parameter estimate in ibm models: ling 572 fei xia week ??

Parameter estimate in IBM Models:

Ling 572

Fei Xia

Week ??

Outline

• IBM Model 1 Review: (from LING571)– Word alignment– Modeling– Training: formula

• Formulae

IBM Model Basics

• Classic paper: Brown et. al. (1993)

• Translation: F E (or Fr Eng)

• Resource required: – Parallel data (a set of “sentence” pairs)

• Main concepts:– Source channel model– Hidden word alignment– EM training

Intuition

• Sentence pairs: word mapping is one-to-one.– (1) S: a b c d e T: l m n o p

– (2) S: c a e T: p n m

– (3) S: d a c T: n p l

(b, o), (d, l), (e, m), and (a, p), (c, n), or

(a, n), (c, p)

Source channel model for MT

)|(*)(maxarg* EFPEPEE

Eng sent Noisy channel Fr sent

P(E) P(F | E)

Two types of parameters:• Language model: P(E) • Translation model: P(F | E)

• a(j)=i aj = i• a = (a1, …, am)• Ex:

– F: f1 f2 f3 f4 f5

– E: e1 e2 e3 e4

– a4=3– a = (0, 1, 1, 3, 2)

Word alignment

Word alignment

• An alignment, a, is a function from Fr word position to Eng word position: a(j)=i means that the fj is generated by ei.

• The constraint: each fr word is generated by exactly one Eng word (including e0):

Modeling p(F | E) with alignment

a

a

EaFPEaP

EFaPEFP

),|(*)|(

)|,()|(

Notation

• E: the Eng sentence: E = e1 …el

• ei: the i-th Eng word.

• F: the Fr sentence: f1 … fm

• fj: the j-th Fr word.

• e0: the Eng NULL word

• F0 : the Fr NULL word.

• aj: the position of Eng word that generates fj.

Notation (cont)

• l: Eng sent leng• m: Fr sent leng• i: Eng word position• j: Fr word position• e: an Eng word• f: a Fr word

ii eeise ...11

jj aaisa ...11

jj ffisf ...11

Generative process

• To generate F from E:– Pick a length m for F, with prob P(m | l)– Choose an alignment a, with prob P(a | E, m)– Generate Fr sent given the Eng sent and the

alignment, with prob P(F | E, a, m).

• Another way to look at it:– Pick a length m for F, with prob P(m | l).– For j=1 to m

• Pick an Eng word index aj, with prob P(aj | j, m, l).• Pick a Fr word fj according to the Eng word ei, where aj=I,

with prob P(fj | ei ).

Decomposition

a

a

mEaFPmEaPEmP

EamFP

EmFP

EFP

),,|(*),|(*)|(

)|,,(

)|,(

)|(

Approximation

• Fr sent length depends only on Eng sent length:

• Fr word depends only on the Eng word that generates it:

• Estimating P(a | E, m): All alignments are equally likely:

)|())(|()|( lmPElengthmPEmP

)|(

),,,|(),,|(

1

11

1

ja

m

jj

jm

jj

efP

fmEafPmEaFP

mlmEaP

)1(

1),|(

Decomposition

)|()1(

)|(

)|()1(

)|(

)|(*)1(

1*)|(

),,|(*),|(*)|()|(

1 1

),...( 1

)...( 1

)...(

1

1

1

i

m

j

l

ijm

aaaa

m

jjm

aaaa

m

jjm

aaa

efPl

lmP

efPl

lmP

efPl

lmP

mEaFPmEaPEmPEFP

j

m

j

m

m

Final formula and parameters for Model 1

)|()1(

)|()|(

1 1i

m

j

l

ijmefP

l

lmPEFP

Two types of parameters:• Length prob: P(m | l)• Translation prob: P(fj | ei), or t(fj | ei),

Training

• Mathematically motivated: – Having an objective function to optimize– Using several clever tricks

• The resulting formulae – are intuitively expected– can be calculated efficiently

• EM algorithm– Hill climbing, and each iteration guarantees to

improve objective function– It does not guaranteed to reach global optimal.

Training: Fractional counts

• Let Ct(f, e) be the fractional count of (f, e) pair in the training data, given alignment prob P.

)),(),(*),|((),(||

1,ja

F

jj

FE a

eeffFEaPefCt

Alignment prob Actual count of times e and f are linked in (E,F) by alignment a

FVx

exCt

efCteft

),(

),()|(

Estimating P(a|E,F)

• We could list all the alignments, and estimate P(a | E, F).

a

EFaP

EFaP

EFP

EFaPFEaP

)|,(

)|,(

)|(

)|,(),|(

)|(*)1(

1),|(*)|()|,(

...,1ja

mjjmeft

lEaFPEaPEFaP

Formulae so far

)),(),(*),|((),(||

1,ja

F

jj

FE a

eeffFEaPefCt

FVx

exCt

efCteft

),(

),()|(

m

jaj

a

m

jaj

a j

j

eft

eft

EFaP

EFaPFEaP

1

1

)|(

)|(

)|,(

)|,(),|(

New estimate for t(f|e)

The algorithm

1. Start with an initial estimate of t(f | e): e.g., uniform distribution

2. Calculate P(a | F, E)

3. Calculate Ct (f, e), Normalize to get t(f|e)

4. Repeat Steps 2-3 until the “improvement” is too small.

No need to enumerate all word alignments

• Luckily, for Model 1, there is a way to calculate Ct(f, e) efficiently.

FE

E

ii

F

jj

E

ii

eft

ffeeeft

efCt,

||

0''

||

0

||

0

)|('

)),((*)),((*)|('

),(

FVx

exCt

efCteft

),(

),()|(

The algorithm

1. Start with an initial estimate of t(f | e): e.g., uniform distribution

2. Calculate P(a | F, E)

3. Calculate Ct (f, e), Normalize to get t(f|e)

4. Repeat Steps 2-3 until the “improvement” is too small.

Summary of Model 1

• Modeling:– Pick the length of F with prob P(m | l).– For each position j

• Pick an English word position aj, with prob P(aj | j, m, l).• Pick a Fr word fj according to the Eng word ei, with t(fj | ei),

where i=aj

– The resulting formula can be calculated efficiently.

• Training: EM algorithm. The update can be done efficiently.

• Finding the best alignment: can be easily done.

New stuff

EM algorithm

• EM: expectation maximization• In a model with hidden states (e.g., word

alignment), how can we estimate model parameters?

• EM does the following:– E-step: Take an initial model parameterization and

calculate the expected values of the hidden data.– M-step: Use the expected values to maximize the

likelihood of the training data.

Training Summary

• Mathematically motivated: – Having an objective function to optimize– Using several clever tricks

• The resulting formulae – are intuitively expected– can be calculated efficiently

• EM algorithm– Hill climbing, and each iteration guarantees to

improve objective function– It does not guaranteed to reach global optimal.

parameter estimate in ibm models: ling 572 fei xia week ??

Documents