dp-based search algorithms for statistical machine translation my name: mauricio zuluaga based on...
Post on 18-Dec-2015
219 views
TRANSCRIPT
DP-based Search Algorithms for Statistical Machine Translation
My name: Mauricio ZuluagaBased on “Christoph Tillmann Presentation” and “Word Reordering and a Dynamic Programming Beam Search Algorithm for Statistical Machine Translation”, C. Tillmann, H. Ney
Computational Challenges in M.T.
Source sentence f (French)Target sentence e (English)Bayes' rule:
Pr(e|f) = Pr(e)*Pr(f|e)/Pr(f)
Computational Challenges in M.T.
Estimating the language model probability Pr(e) (L.M. Problem Trigram)Estimating the Translation model probability Pr(f|e) (T. Problem)Finding an efficient way to search for the English sentence that maximizes the product (Search Problem). We want to focus only in the most likely hypothesis during the search.
Approach based on Bayes’ rule:
Transformation
Inverse Transformation
Target Language Text
Source Language Text
Global Search:
)|(Pr)(Prmax 111IJI efe •
overIe1
Language Model)( 1
Iep
)|( 11IJ efp
Translation Model
Jf1
Ie1
Trigram language modelTranslation model (simplified) :
1. Lexicon probabilities:2. Fertilities3. Class-based distortion probs :
“Here, j is the currently covered input sentence position and j0 is the previously covered input sentence position. The input sentence length J is included, since we would like to think of the distortion probability as normalized according to J.” [Tillmann]
Model Details
)|( efp)|( enΦ),|( ' Jjjp
Same except in the handling of distortion probabilities.In model 4 there are 2 separate distortion probabilities for the head of a tablet and the rest of the words of the tablet.Probability depends on the previous tablet and on the identity (class) of the French word being placed. (Ej, appearance of adjectives before nouns in English but after them in French).“We expect dl(-lI.A(e),/3(f)) to be larger than dl(+ llA(e),/3(f)) when e is an adjective and d is a noun. Indeed, this is borne out in the trained distortion probabilities for Model 4, where we find that dl(-l|A(government's),B(developpement)) is 0.7986, while dl(+ l|A(government's),B(developpement)) is 0.0168.”A and B are class functions of the English and French words (in this implementation |A|=|B|=50 classes)
Model Details (Model 4 vs. Model 3):
Decoder
Others have followed different approaches for DecodersThis is the part where we have to be efficient !!!
Word Reordering and a Dynamic Programming Beam Search Algorithm for Statistical Machine Translation, C. Tillmann, H. Ney
DP-based beam search decoder for IBM-model 4 (this is the one described in the previous paper)
Example Alignment
besuchen
KollegeSie
Mai
colleague
diesemFallkannm
ein
am viertennicht
InInthiscase
cannotvisityouontheforth
.
ofMay.
my
)|,()|( 11111
1
IJJ
a
IJ eafpefpJ
∑=
Word-to-Word Alignment (source to target): iaj j =→
Hidden Alignment:
Target
Source
Inverted Alignments
i
i - 1
Source Positions
Target P
ositionsInverted alignment (target to source) :
Coverage constraint: introduce coverage vector
ibji =→
C
Traveling Salesman Problem
Problem: Visit J citiesCosts for transitions between citiesVisit each city exactly once, minimizing overall costsDynamic Programming (Held-Karp 1962)Cities correspond to source sentence positions (words,coverage constraint)Costs (negative logarithm of the product of the translation, alignment and language model probabilities).
Traveling Salesman Problem
DP with auxiliary quantity Shortest path from city 1 to city j visiting all cities in
Complexity using DP:
),( jCQ
=),( jCQ
JJJ 2! 2 •→
)}},{\({min),( '
}\{'
'jjCQdjCQ
jjjCj
+=∈
C
•The order in which cities are visited is not important•Only costs for the best path reaching j has to be stored•Remember Minimum edit distance formulation was also a DP search problem
({1},1)
({1,2},2)
({1,3},3)
({1,4},4)
({1,2,3},3)
({1,2,4},4)
({1,2,5},5)
({1,2,3},2)
({1,3,4},4)
({1,3,5},5)
({1,2,4},2)
({1,3,4},3)
({1,4,5},5)
({1,5},5)
({1,2,5},2)
({1,3,5},3)
({1,4,5},4)
({1,2,3,4,5},2)
({1,2,3,4,5},3)
({1,2,3,4,5},4)
({1,2,3,4,5},5)
Final
({1,2,3,5},5)
({1,2,4,5},5)
({1,3,4,5},5)
({1,2,3,4},4)
({1,2,4,5},4)
({1,3,4,5},4)
({1,2,3,4},3)
({1,2,3,5},3)
({1,3,4,5},3)
({1,2,3,4},2)
({1,2,3,5},2)
({1,2,4,5},2)
({1},1)
({1,2},2)
({1,3},3)
({1,4},4)
({1,2,3},3)
({1,2,4},4)
({1,2,5},5)
({1,2,3},2)
({1,3,4},4)
({1,3,5},5)
({1,2,4},2)
({1,3,4},3)
({1,4,5},5)
({1,5},5)
({1,2,5},2)
({1,3,5},3)
({1,4,5},4)
({1,2,3,4,5},2)
({1,2,3,4,5},3)
({1,2,3,4,5},4)
({1,2,3,4,5},5)
Final
({1,2,3,5},5)
({1,2,4,5},5)
({1,3,4,5},5)
({1,2,3,4},4)
({1,2,4,5},4)
({1,3,4,5},4)
({1,2,3,4},3)
({1,2,3,5},3)
({1,3,4,5},3)
({1,2,3,4},2)
({1,2,3,5},2)
({1,2,4,5},2)
M.T. Recursion Equation
})},{\,()|(),|({}{\
)|(),,(''''
','max jjCeQeepJjjp
jCje
efpjCeQ j
••∈
•=
Complexity: where E is the size of the Target language vocabulary (still too large…)
22 2 JE J ••
Maximum approximation:
)}|,()({max})|()({max 1111,
111111
IIJI
be
IJI
eebfpepefpep
III•=•
*Q(e,C,j) is the probability of the best partial hypothesis(e1..ei, b1..bi) where C = {bk | k = 1..i}, bi = j, ei = e, and ei-1 = e’
DP-based Search Algorithm
Input: source string
initialization
for each cardinality do
for each pair ,where , do
for each target word do
Trace back:
Find shortest tourRecover optimal sequence
)}},{\,()|(
),|({)|(),,(
'''
'
}\{max
'
jjCeQeep
JjjpefpjCeQjCj
j
•
•=∈
Jc ,,2,1 L=
Jj fff ,,,,1 LL
),( jC
Ee∈
Cj ∈ cC =||
IBM-Style Re-ordering (S3)Procedural Restriction: select one of the first 4 empty positions (to extend the hypothesis)
Upper bound for word reordering complexity: 43 JE •
1 j J
KollegeSie
Mai
Verb Group Re-ordering (GE)
besuchen
colleague
diesemFallkannm
ein
am viertennicht
In
Inthiscase
cannotvisityouontheforth
.
ofMay.
my
Complexity:Mostly monotonic traversal from left to right
)( 22 RLRJE +•••
Beam Search Pruning
Search proceeds cardinality-synchronously over coverage vectors : Three pruning types:
1. Coverage pruning2. Cardinality pruning3. Observation pruning(number of
words produced by a source word f is limited)
C cC =||
Beam Search Pruning
4 kinds of Thresholds:• the coverage pruning threshold tC• the coverage histogram threshold nC• the cardinality pruning threshold tc (looks only at the
cardinality)• the cardinality histogram threshold nc (looks only at the
cardinality)Define new probabilities based on uncovered positions
(using only trigrams and lexicon probabilities).Maintain only the ones above the thresholds.
Beam Search Pruning
Compute best score and apply threshold:1. For each coverage vector 2. For each cardinality : Use histogram pruningObservation pruning: for each select best target word :
Cc
fe
)()|( epefp uni•
German-English Verbmobil
German to English, IBM-4Evaluation Measure: m-WER and SSERTraining: 58 K sentence pairs Vocabulary: 8K (German), 5K (English)Test-331 (held-out data) (scaling factors for language and distortion models)Test-147 (evaluation)
Effect of Coverage PruningRe-ordering restriction
CPU time[sec]
m-WER [%]
GE 0.01 0.21 73.5
0.1 0.43 53.1
1.0 1.43 30.3
2.5 4.75 25.8
5.0 29.6 24.6
10.0 630 24.9
S3 0.01 5.48 70.0
0.1 9.21 50.9
1.0 46.2 31.6
2.5 190 28.4
5.0 830 28.3
)log( Ct−
TEST-147: Translation Results
Re-ordering CPU[sec]
m -WER[%]
SSER[%]
MON (no re-ordering)
0.2 40.6 28.6
GE (verb group) 5.2 33.4 21.4
S3 (like IBM patent) 13.7 34.2 20.3
References
Word Reordering and a Dynamic Programming Beam Search Algorithm for Statistical Machine Translation, C. Tillmann, H. Ney
“A DP based Search Using Monotone Alignments in Statistical Translation” C. Tillmann, S. Vogel, H. Ney, A. Zubiaga
The Mathematics of Statistical Machine Translation: Parameter Estimation Peter E Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, Robert L. Mercer
Accelerated DP Based Search for Statistical Translation, C. Tillmann, S. Vogel, H. Ney, A. Zubiaga, H. Sawaf
Word Re-orderign and DP-based Search in Statistical Machine Translation, H. Ney, C. Tillmann