joint model

A Phrase-Based, Joint Probability Model for Statistical Machine Translation

[Marcu and Wong, 2005]

Milo Stanojevi

Problems with aligning with conditional models (IBM models)

They are 1-to-many alignments which cannot capture some bigger units than words that should be translated together (in the paper they are called intuitive alignments)

If we want phrase alignment we need to use a non-conditional joint probabilistic model

Possible ways how to achieve this: Symmetrizing word alignments (seen on the last lecture) Learning it directly from corpora (this lecture)

Difference between symmetrized word alignments and directly

learned phrase alignments

Difference between symmetrized word alignments and directly

learned phrase alignments

Concepts

Concepts generate words on both sidesWords on each side generate words on their opposite side

How to find concepts in data?

Using EM in the similar way as in IBM models Two models proposed:

Model 1 ignores reordering (each order is equally probable)

Model 2 introduces distortion model in Model 1

Model 1

Generative model which describes the following stochastic process:

Model 1


1. generation of concepts

Model 1


1. generation of concepts 2. generation of phrase

pairs from concepts

Model 1


1. generation of concepts 2. generation of phrase

pairs from concepts 3. order phrases on

each side linearly

Model 1

p(E ,F)= CC | L(E , F ,C)

ciC

t (e i , f i)

Configuration of concepts

Predicate which checks if C covers sentence pair (E,F) completely without overlapping words

Model 1

Authors say that this model is good for alignment, but

It is unsuitable for decoding because it does not help in deciding how phrases on the target side should be ordered

Model 2


Model 2


1. generate bag of concepts CC

Model 2


1. generate bag of concepts C 2. Initialize E and F to

empty sequences

C

F

E

Model 2



empty sequences 3. Randomly take a concept

, remove it from C and generate

C

F

E

c iC

(e i , f i)

Model 2





4. Append to F on position

C

F

E

c iC

(e i , f i)

f i

k

k

Model 2 Generative model which describes the following stochastic

process: 1. generate bag of concepts C 2. Initialize E and F to



4. Append to F on position 5.insert at position l in E

if no other phrase occupies

C

F

E

c iC(e i , f i)

f i

e il+ei

l+eil

k

k

Model 2 Generative model which describes the following stochastic

process: 1. generate bag of concepts C 2. Initialize E and F to



4. Append to F on position 5.insert at position l in E

if no other phrase occupies 6. repeat steps 3 to 5 until C is not empty

C

F

E

c iC(e i , f i)

f i

e il+ei

l+eil

k

k

Model 2

Absolute distortion distribution for each concept

F

El+eil

k

dist (f i , e i)=p= k

k +f i

d ( p ,( l+ei/2))

p(E ,F)= CC |L( E, F ,C )

c iC

[t (e i , f i)dist ( f i , ei)] Joint probability of parallel sentences

... ...

... ...

k+f i

Training

Complete EM is impossible because of the huge number of possible alignments

Even if we consider only the most frequent n-grams space of possible alignments is big

When distribution of alignments is uniform (first iteration of EM) there are tricks which could be applied for computing expectation step efficiently but for later iterations we need to use some approximations

Training

1. Determine high-frequency n-grams 2. Initial iteration of EM with tricks 3. Additional iterations of EM with

approximations 4. Generate conditional model probabilities

Initial iteration of EM

Let C(n,k) be the number of all possible partitions of sentence E of n words into k non-empty concepts

Then there are k!*C(n,k)*C(m,k) alignments that could be built for sentence pair (E,F) with k concepts

Expectation for some phrase pair (concept) is ratio between alignments that are consistent with that phrase pair and all possible alignments

Initial iteration of EM

k=1

min(la , mb )

k !C (la , k1)C (mb , k1)

k=1

min(l ,m )

k ! C( l , k )C (m ,k )

a

b

l

b

m

How to compute number of ways in which n words can be partitioned in

k concepts - C(n,k)?

1 2 3 ... n-1 n



1 2 3 ... n-1 n

In the paper they use Stirling number of second kind

S (n , k )= 1k !i=0

k1

(1)i(ki )(ki)n



1 2 3 ... n-1 n

In the paper they use Stirling number of second kind

S (n , k )= 1k !i=0

k1

(1)i(ki )(ki)n

p(e i , f i)=k=1

min(la ,mb)

k !C (la , k1)C (mb , k1)

k=1

min(l ,m )

k !C (l , k )C(m ,k)=

k=1

min(la , mb )

k ! S (la , k )S (mb , k)

k=1

min(l ,m )

k ! S (l , k )S (m ,k )



1 2 3 ... n-1 n1 2 3 n-2 n-1



1 2 3 ... n-1 n1 2 3 n-2 n-1

3 n-2 k-12 ...



1 2 3 ... n-1 n1 2 3 n-2 n-1

3 n-2 k-12 ...

C (n , k )=(n1k1)=(n1) !

(n1k1)! (k1) !



1 2 3 ... n-1 n1 2 3 n-2 n-1

3 n-2 k-12 ...

C (n , k )=(n1k1)=(n1) !

(n1k1)! (k1) !

p(e i , f i)=k=1

min(la ,mb)

k !C (la , k1)C (mb , k1)

k=1

min(l ,m )

k !C (l , k )C(m ,k)=

k=1

min(la , mb )

k ! (la1k2 )(mb1k2 )k=1

min(l ,m )

k ! ( l1k1)(m1k1 )

Later EM iterations

Authors say that because full EM is impossible that they train on Viterbi derivations

Not completely correct since they approximate Viterbi with greedy search for Viterbi alignment

[DeNero and Klein 2008] have proved that the problem of finding Viterbi phrase alignment is NP complete

Even if we could train on Viterbi alignments that procedure is not guaranteed to work

Later EM iterations

With greedy search they find first best complete alignment judging by t-probabilities (without distortion) and then hill-climb to better derivations by: Breaking and merging concepts Swapping words between concepts Moving words across concepts

All sets of alignments found during hill-climbing are used for EM training

Making conditional probability model

and are used for training but and for decoding

By making probabilities conditional they become better for the task we are solving

t (f i | e i)

t (f i , e i)

d ( pos (f i) | pos ( ei))

d ( pos (f i) , pos ( ei))

Results

Even though the authors made many approximations and applied lots of black magic (hacking) they still get better results in translation than models trained with IBM 4

Clearly using bigger chunks in translation helps

Possible improvements

Using number of combinations instead of Stirling number of second kind for initial EM iteration

Constraining the search using symmetrized IBM alignments [Birch et al., 2006] or using ITG constraints [Cherry and Lin, 2007]

and many others With all the improvements that could be made this

method is still very slow with big corpora [Koehn 2010] and does not give good results because it leads to over-fitting

Thank you

Questions?

Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20Slide 21Slide 22Slide 23Slide 24Slide 25Slide 26Slide 27Slide 28Slide 29Slide 30Slide 31Slide 32Slide 33Slide 34Slide 35Slide 36

joint model

Documents