dp-based search algorithms for statistical machine translation my name: mauricio zuluaga based on...

DP-based Search Algorithms for Statistical Machine Translation

My name: Mauricio ZuluagaBased on “Christoph Tillmann Presentation” and “Word Reordering and a Dynamic Programming Beam Search Algorithm for Statistical Machine Translation”, C. Tillmann, H. Ney

Computational Challenges in M.T.

Source sentence f (French)Target sentence e (English)Bayes' rule:

Pr(e|f) = Pr(e)*Pr(f|e)/Pr(f)

Computational Challenges in M.T.

Estimating the language model probability Pr(e) (L.M. Problem Trigram)Estimating the Translation model probability Pr(f|e) (T. Problem)Finding an efficient way to search for the English sentence that maximizes the product (Search Problem). We want to focus only in the most likely hypothesis during the search.

Approach based on Bayes’ rule:

Transformation

Inverse Transformation

Target Language Text

Source Language Text

Global Search:

)|(Pr)(Prmax 111IJI efe •

overIe1

Language Model)( 1

Iep

)|( 11IJ efp

Translation Model

Jf1

Ie1

Trigram language modelTranslation model (simplified) :

1. Lexicon probabilities:2. Fertilities3. Class-based distortion probs :

“Here, j is the currently covered input sentence position and j0 is the previously covered input sentence position. The input sentence length J is included, since we would like to think of the distortion probability as normalized according to J.” [Tillmann]

Model Details

)|( efp)|( enΦ),|( ' Jjjp

Same except in the handling of distortion probabilities.In model 4 there are 2 separate distortion probabilities for the head of a tablet and the rest of the words of the tablet.Probability depends on the previous tablet and on the identity (class) of the French word being placed. (Ej, appearance of adjectives before nouns in English but after them in French).“We expect dl(-lI.A(e),/3(f)) to be larger than dl(+ llA(e),/3(f)) when e is an adjective and d is a noun. Indeed, this is borne out in the trained distortion probabilities for Model 4, where we find that dl(-l|A(government's),B(developpement)) is 0.7986, while dl(+ l|A(government's),B(developpement)) is 0.0168.”A and B are class functions of the English and French words (in this implementation |A|=|B|=50 classes)

Model Details (Model 4 vs. Model 3):

Decoder

Others have followed different approaches for DecodersThis is the part where we have to be efficient !!!

Word Reordering and a Dynamic Programming Beam Search Algorithm for Statistical Machine Translation, C. Tillmann, H. Ney

DP-based beam search decoder for IBM-model 4 (this is the one described in the previous paper)

Example Alignment

besuchen

KollegeSie

Mai

colleague

diesemFallkannm

ein

am viertennicht

InInthiscase

cannotvisityouontheforth

.

ofMay.

my

)|,()|( 11111

1

IJJ

a

IJ eafpefpJ

∑=

Word-to-Word Alignment (source to target): iaj j =→

Hidden Alignment:

Target

Source

Inverted Alignments

i

i - 1

Source Positions

Target P

ositionsInverted alignment (target to source) :

Coverage constraint: introduce coverage vector

ibji =→

C

Traveling Salesman Problem

Problem: Visit J citiesCosts for transitions between citiesVisit each city exactly once, minimizing overall costsDynamic Programming (Held-Karp 1962)Cities correspond to source sentence positions (words,coverage constraint)Costs (negative logarithm of the product of the translation, alignment and language model probabilities).

Traveling Salesman Problem

DP with auxiliary quantity Shortest path from city 1 to city j visiting all cities in

Complexity using DP:

),( jCQ

=),( jCQ

JJJ 2! 2 •→

)}},{\({min),( '

}\{'

'jjCQdjCQ

jjjCj

+=∈

C

•The order in which cities are visited is not important•Only costs for the best path reaching j has to be stored•Remember Minimum edit distance formulation was also a DP search problem

({1},1)

({1,2},2)

({1,3},3)

({1,4},4)

({1,2,3},3)

({1,2,4},4)

({1,2,5},5)

({1,2,3},2)

({1,3,4},4)

({1,3,5},5)

({1,2,4},2)

({1,3,4},3)

({1,4,5},5)

({1,5},5)

({1,2,5},2)

({1,3,5},3)

({1,4,5},4)

({1,2,3,4,5},2)

({1,2,3,4,5},3)

({1,2,3,4,5},4)

({1,2,3,4,5},5)

Final

({1,2,3,5},5)

({1,2,4,5},5)

({1,3,4,5},5)

({1,2,3,4},4)

({1,2,4,5},4)

({1,3,4,5},4)

({1,2,3,4},3)

({1,2,3,5},3)

({1,3,4,5},3)

({1,2,3,4},2)

({1,2,3,5},2)

({1,2,4,5},2)

M.T. Recursion Equation

})},{\,()|(),|({}{\

)|(),,(''''

','max jjCeQeepJjjp

jCje

efpjCeQ j

••∈

•=

Complexity: where E is the size of the Target language vocabulary (still too large…)

22 2 JE J ••

Maximum approximation:

)}|,()({max})|()({max 1111,

111111

IIJI

be

IJI

eebfpepefpep

III•=•

*Q(e,C,j) is the probability of the best partial hypothesis(e1..ei, b1..bi) where C = {bk | k = 1..i}, bi = j, ei = e, and ei-1 = e’

DP-based Search Algorithm

Input: source string

initialization

for each cardinality do

for each pair ,where , do

for each target word do

Trace back:

Find shortest tourRecover optimal sequence

)}},{\,()|(

),|({)|(),,(

'''

'

}\{max

'

jjCeQeep

JjjpefpjCeQjCj

j

•

•=∈

Jc ,,2,1 L=

Jj fff ,,,,1 LL

),( jC

Ee∈

Cj ∈ cC =||

IBM-Style Re-ordering (S3)Procedural Restriction: select one of the first 4 empty positions (to extend the hypothesis)

Upper bound for word reordering complexity: 43 JE •

1 j J

KollegeSie

Mai

Verb Group Re-ordering (GE)

besuchen

colleague

diesemFallkannm

ein

am viertennicht

In

Inthiscase

cannotvisityouontheforth

.

ofMay.

my

Complexity:Mostly monotonic traversal from left to right

)( 22 RLRJE +•••

Beam Search Pruning

Search proceeds cardinality-synchronously over coverage vectors : Three pruning types:

1. Coverage pruning2. Cardinality pruning3. Observation pruning(number of

words produced by a source word f is limited)

C cC =||

Beam Search Pruning

4 kinds of Thresholds:• the coverage pruning threshold tC• the coverage histogram threshold nC• the cardinality pruning threshold tc (looks only at the

cardinality)• the cardinality histogram threshold nc (looks only at the

cardinality)Define new probabilities based on uncovered positions

(using only trigrams and lexicon probabilities).Maintain only the ones above the thresholds.

Beam Search Pruning

Compute best score and apply threshold:1. For each coverage vector 2. For each cardinality : Use histogram pruningObservation pruning: for each select best target word :

Cc

fe

)()|( epefp uni•

German-English Verbmobil

German to English, IBM-4Evaluation Measure: m-WER and SSERTraining: 58 K sentence pairs Vocabulary: 8K (German), 5K (English)Test-331 (held-out data) (scaling factors for language and distortion models)Test-147 (evaluation)

Effect of Coverage PruningRe-ordering restriction

CPU time[sec]

m-WER [%]

GE 0.01 0.21 73.5

0.1 0.43 53.1

1.0 1.43 30.3

2.5 4.75 25.8

5.0 29.6 24.6

10.0 630 24.9

S3 0.01 5.48 70.0

0.1 9.21 50.9

1.0 46.2 31.6

2.5 190 28.4

5.0 830 28.3

)log( Ct−

TEST-147: Translation Results

Re-ordering CPU[sec]

m -WER[%]

SSER[%]

MON (no re-ordering)

0.2 40.6 28.6

GE (verb group) 5.2 33.4 21.4

S3 (like IBM patent) 13.7 34.2 20.3

References

Word Reordering and a Dynamic Programming Beam Search Algorithm for Statistical Machine Translation, C. Tillmann, H. Ney

“A DP based Search Using Monotone Alignments in Statistical Translation” C. Tillmann, S. Vogel, H. Ney, A. Zubiaga

The Mathematics of Statistical Machine Translation: Parameter Estimation Peter E Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, Robert L. Mercer

Accelerated DP Based Search for Statistical Translation, C. Tillmann, S. Vogel, H. Ney, A. Zubiaga, H. Sawaf

Word Re-orderign and DP-based Search in Statistical Machine Translation, H. Ney, C. Tillmann

dp-based search algorithms for statistical machine translation my name: mauricio zuluaga based on...

Documents