joint model

36
A Phrase-Based, Joint Probability Model for Statistical Machine Translation [Marcu and Wong, 2005] Miloš Stanojević

Upload: miloshstanojevic

Post on 01-Oct-2015

3 views

Category:

Documents


0 download

DESCRIPTION

Slides for joint model of phrase alignment by Marcu and Wong

TRANSCRIPT

  • A Phrase-Based, Joint Probability Model for Statistical Machine Translation

    [Marcu and Wong, 2005]

    Milo Stanojevi

  • Problems with aligning with conditional models (IBM models)

    They are 1-to-many alignments which cannot capture some bigger units than words that should be translated together (in the paper they are called intuitive alignments)

    If we want phrase alignment we need to use a non-conditional joint probabilistic model

    Possible ways how to achieve this: Symmetrizing word alignments (seen on the last lecture) Learning it directly from corpora (this lecture)

  • Difference between symmetrized word alignments and directly

    learned phrase alignments

  • Difference between symmetrized word alignments and directly

    learned phrase alignments

    Concepts

    Concepts generate words on both sidesWords on each side generate words on their opposite side

  • How to find concepts in data?

    Using EM in the similar way as in IBM models Two models proposed:

    Model 1 ignores reordering (each order is equally probable)

    Model 2 introduces distortion model in Model 1

  • Model 1

    Generative model which describes the following stochastic process:

  • Model 1

    Generative model which describes the following stochastic process:

    1. generation of concepts

  • Model 1

    Generative model which describes the following stochastic process:

    1. generation of concepts 2. generation of phrase

    pairs from concepts

  • Model 1

    Generative model which describes the following stochastic process:

    1. generation of concepts 2. generation of phrase

    pairs from concepts 3. order phrases on

    each side linearly

  • Model 1

    p(E ,F)= CC | L(E , F ,C)

    ciC

    t (e i , f i)

    Configuration of concepts

    Predicate which checks if C covers sentence pair (E,F) completely without overlapping words

  • Model 1

    Authors say that this model is good for alignment, but

    It is unsuitable for decoding because it does not help in deciding how phrases on the target side should be ordered

  • Model 2

    Generative model which describes the following stochastic process:

  • Model 2

    Generative model which describes the following stochastic process:

    1. generate bag of concepts CC

  • Model 2

    Generative model which describes the following stochastic process:

    1. generate bag of concepts C 2. Initialize E and F to

    empty sequences

    C

    F

    E

  • Model 2

    Generative model which describes the following stochastic process:

    1. generate bag of concepts C 2. Initialize E and F to

    empty sequences 3. Randomly take a concept

    , remove it from C and generate

    C

    F

    E

    c iC

    (e i , f i)

  • Model 2

    Generative model which describes the following stochastic process:

    1. generate bag of concepts C 2. Initialize E and F to

    empty sequences 3. Randomly take a concept

    , remove it from C and generate

    4. Append to F on position

    C

    F

    E

    c iC

    (e i , f i)

    f i

    k

    k

  • Model 2 Generative model which describes the following stochastic

    process: 1. generate bag of concepts C 2. Initialize E and F to

    empty sequences 3. Randomly take a concept

    , remove it from C and generate

    4. Append to F on position 5.insert at position l in E

    if no other phrase occupies

    C

    F

    E

    c iC(e i , f i)

    f i

    e il+ei

    l+eil

    k

    k

  • Model 2 Generative model which describes the following stochastic

    process: 1. generate bag of concepts C 2. Initialize E and F to

    empty sequences 3. Randomly take a concept

    , remove it from C and generate

    4. Append to F on position 5.insert at position l in E

    if no other phrase occupies 6. repeat steps 3 to 5 until C is not empty

    C

    F

    E

    c iC(e i , f i)

    f i

    e il+ei

    l+eil

    k

    k

  • Model 2

    Absolute distortion distribution for each concept

    F

    El+eil

    k

    dist (f i , e i)=p= k

    k +f i

    d ( p ,( l+ei/2))

    p(E ,F)= CC |L( E, F ,C )

    c iC

    [t (e i , f i)dist ( f i , ei)] Joint probability of parallel sentences

    ... ...

    ... ...

    k+f i

  • Training

    Complete EM is impossible because of the huge number of possible alignments

    Even if we consider only the most frequent n-grams space of possible alignments is big

    When distribution of alignments is uniform (first iteration of EM) there are tricks which could be applied for computing expectation step efficiently but for later iterations we need to use some approximations

  • Training

    1. Determine high-frequency n-grams 2. Initial iteration of EM with tricks 3. Additional iterations of EM with

    approximations 4. Generate conditional model probabilities

  • Initial iteration of EM

    Let C(n,k) be the number of all possible partitions of sentence E of n words into k non-empty concepts

    Then there are k!*C(n,k)*C(m,k) alignments that could be built for sentence pair (E,F) with k concepts

    Expectation for some phrase pair (concept) is ratio between alignments that are consistent with that phrase pair and all possible alignments

  • Initial iteration of EM

    k=1

    min(la , mb )

    k !C (la , k1)C (mb , k1)

    k=1

    min(l ,m )

    k ! C( l , k )C (m ,k )

    a

    b

    l

    b

    m

  • How to compute number of ways in which n words can be partitioned in

    k concepts - C(n,k)?

    1 2 3 ... n-1 n

  • How to compute number of ways in which n words can be partitioned in

    k concepts - C(n,k)?

    1 2 3 ... n-1 n

    In the paper they use Stirling number of second kind

    S (n , k )= 1k !i=0

    k1

    (1)i(ki )(ki)n

  • How to compute number of ways in which n words can be partitioned in

    k concepts - C(n,k)?

    1 2 3 ... n-1 n

    In the paper they use Stirling number of second kind

    S (n , k )= 1k !i=0

    k1

    (1)i(ki )(ki)n

    p(e i , f i)=k=1

    min(la ,mb)

    k !C (la , k1)C (mb , k1)

    k=1

    min(l ,m )

    k !C (l , k )C(m ,k)=

    k=1

    min(la , mb )

    k ! S (la , k )S (mb , k)

    k=1

    min(l ,m )

    k ! S (l , k )S (m ,k )

  • How to compute number of ways in which n words can be partitioned in

    k concepts - C(n,k)?

    1 2 3 ... n-1 n1 2 3 n-2 n-1

  • How to compute number of ways in which n words can be partitioned in

    k concepts - C(n,k)?

    1 2 3 ... n-1 n1 2 3 n-2 n-1

    3 n-2 k-12 ...

  • How to compute number of ways in which n words can be partitioned in

    k concepts - C(n,k)?

    1 2 3 ... n-1 n1 2 3 n-2 n-1

    3 n-2 k-12 ...

    C (n , k )=(n1k1)=(n1) !

    (n1k1)! (k1) !

  • How to compute number of ways in which n words can be partitioned in

    k concepts - C(n,k)?

    1 2 3 ... n-1 n1 2 3 n-2 n-1

    3 n-2 k-12 ...

    C (n , k )=(n1k1)=(n1) !

    (n1k1)! (k1) !

    p(e i , f i)=k=1

    min(la ,mb)

    k !C (la , k1)C (mb , k1)

    k=1

    min(l ,m )

    k !C (l , k )C(m ,k)=

    k=1

    min(la , mb )

    k ! (la1k2 )(mb1k2 )k=1

    min(l ,m )

    k ! ( l1k1)(m1k1 )

  • Later EM iterations

    Authors say that because full EM is impossible that they train on Viterbi derivations

    Not completely correct since they approximate Viterbi with greedy search for Viterbi alignment

    [DeNero and Klein 2008] have proved that the problem of finding Viterbi phrase alignment is NP complete

    Even if we could train on Viterbi alignments that procedure is not guaranteed to work

  • Later EM iterations

    With greedy search they find first best complete alignment judging by t-probabilities (without distortion) and then hill-climb to better derivations by: Breaking and merging concepts Swapping words between concepts Moving words across concepts

    All sets of alignments found during hill-climbing are used for EM training

  • Making conditional probability model

    and are used for training but and for decoding

    By making probabilities conditional they become better for the task we are solving

    t (f i | e i)

    t (f i , e i)

    d ( pos (f i) | pos ( ei))

    d ( pos (f i) , pos ( ei))

  • Results

    Even though the authors made many approximations and applied lots of black magic (hacking) they still get better results in translation than models trained with IBM 4

    Clearly using bigger chunks in translation helps

  • Possible improvements

    Using number of combinations instead of Stirling number of second kind for initial EM iteration

    Constraining the search using symmetrized IBM alignments [Birch et al., 2006] or using ITG constraints [Cherry and Lin, 2007]

    and many others With all the improvements that could be made this

    method is still very slow with big corpora [Koehn 2010] and does not give good results because it leads to over-fitting

  • Thank you

    Questions?

    Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20Slide 21Slide 22Slide 23Slide 24Slide 25Slide 26Slide 27Slide 28Slide 29Slide 30Slide 31Slide 32Slide 33Slide 34Slide 35Slide 36