Download - Direct MT, Example-based MT, Statistical MT
Direct MT, Example-based MT, Statistical MT
Issues in Machine Translation• Orthography
– Writing from left-to-right vs right-to-left– Character sets (alphabetic, logograms, pictograms)– Segmentation into word/word-like units
• Morphology• Lexical: Word senses
– bank “river bank”, “financial institution”• Syntactic: Word order
– Subject-verb-object subject-object-verb• Semantic: meaning
– “ate pasta with a spoon”, “ate pasta with marinara”, “ate pasta with John” • Pragmatic: world knowledge
– “Can you pass me the salt?”• Social: conversational norms
– pronoun usage depends on the conversational partner• Cultural: idioms and phrases
– “out of the ballpark”, “came from leftfield”• Contextual•In addition for Speech Translation
– Prosody: JOHN eats bananas: John EATS bananas; John eats BANANAS– Pronunciation differences– Speech recognition errors
• In a multilingual environment– Code Switching: Use of linguistic apparatus of one language to express ideas in another language.
MT Approaches: Different levels of meaning transfer
Direct MT
Interlingua
Transfer-basedMT
Source Target
Depth of Analysis
Parsing
Semantic Interpretation
Semantic Generation
Syntactic Generation
Syntactic Structure
Syntactic Structure
Spanish : ajá quiero usar mi tarjeta de crédito
English : yeah I wanna use my credit cardAlignment : 1 3 4 5 7 0 6
Direct Machine Translation • Words are replaced using a dictionary
– Some amount of morphological processing• Word reordering is limited • Quality depends on the size of the dictionary, closeness of languages
English : I need to make a collect call
Japanese : 私は コレクト コールを かける 必要があります
Alignment : 1 5 0 3 0 2 4
Translation Memory•Idea is to reuse translations that were done in the past
– Useful for technical terminology– Ideally used in a sub-language translation
• System helps in matching new instances against previously translated instances• Choices are presented to a human translator through a GUI• Human translator selects and “stitches” the available options to cover the source language sentence• If no match is found, the translator introduces a new translation pair into the translation memory.• Pros:
– Maintains consistency in translation across multiple translators– Improves efficiency of translation process
• Issues: How is the matching done?– Word level matching, morphological root matching– Determines robustness of the translation memory
Example-based MTTranslation-by-analogy: a. A collection of source/target text pairsb. A matching metricc. An word or phrase-level alignmentd. Method for recombinationATR EBMT System (E. Sumita, H. Iida, 1991); CMU Pangloss EBMT (R. Brown, 1996)
Exact match (direct translation)Target
ALIGNMENT (transfer)
MATCHING(analysis)
RECOMBINATION(generation)
Source
Example run of EBMT
English-Japanese Examples in the Corpus:1. He buys a notebook Kare wa noto o kau2. I read a book on international politics Watashi wa kokusai seiji
nitsuite kakareta hon o yomu
Translation Input: He buys a book on international politicsTranslation Output: Kare wa kokusai seiji nitsuite kakareta hon o kau
• Challenge: Finding a good matching metric• He bought a notebook• A book was bought• I read a book on world politics
Variations in EBMT• Database of Sentence Aligned corpus• Analysis of the SL
– Depends on how the database is stored– Full sentences, sentence fragments, tree fragments
• Matching metric: idea is to arrive at a semantic closeness– Exact match– N-gram match– Fuzzy match– Similarity-based match– Matching with variables
• Regeneration of the TL– Depends on how the database produces the output
Issues in EBMT• Parallel corpora• Granularity of examples• Size of example-base
– Does accuracy improve by growing example-base?• Suitability of examples
– Diversity and consistency of examples– Contradictory examples– Exceptional examples
(a) Watashi wa komputa o kyoyosuru I share the use of a computer
(b) Watashi wa kuruma o tsukau I use a car
(c) Watashi wa dentaku o shiyosuru I share the use of a calculator
I use a calculator
Issues in EBMT• How are examples stored?
– Context-based examples• “OK” depends on dialog context;
– “wakarimashita (I understand)”; – “iidesu yo (I agree)”– or “ijo desu (lets change the subject)”
– Annotated tree structures• Eg. Kanojo wa kami ga nagai (She has long hair)• Trees with linking nodes
– Multi-level lattices with typographic, orthographic, lexical, syntactic and other information.• Pos information, predicate-argument, chunks, dependency trees
– Generalized Examples• Tokenize Dates, Names, cities, gender, number, tense are replaced by
generalized tokens• Precision-Recall tradeoff• A continuum from plain strings to context sensitive rules
Issues in EBMT
String based• Sochira ni okeru We will send it to you• Sochira wa jimukyoku desu This is the office
Generalized String• X o onegai shimasu may I speak to the X• X o onegai shimasu please give me the X
Template Format• N1 N2 N3 N2’ N3’ for N1’(N1 = sanka “participation”, N2 = moshikomi “application” N3=yoshi “form”)
Distance in a thesaurus is used to select the method.
Issues in EBMT
• Matching:– Metric used to measure the similarity of the SL input to the SLs in
the example database.– Exact Character-based matching– Edit-distance based matching– Word-based matching
• Thesaurus similarity/Wordnet based similarity• A man eats vegetables Hito wa yasai o taberu• Acid eats metal san wa kinzoku o okasu• He eats potatoes kare wa jagaimo o taberu• Sulphuric acid eats iron Ryusan wa tetsu o okasu
– Thesaurus free similarity matching based on distributional clustering– Annotated word-based matching
• POS based matching• Relaxation techniques
– Exact match with dels and insertions word-order differences morphological variants POS differences
Matching in EBMT (contd)
• Structure-based Matching– Tree-based edit distance– Case-frame based matching
• Partial matching– Not entire input need match with the example database– Chunks, substrings, fragments can match– Assembling the TL output is more challenging.
Adaptability and Recombination in EBMTProblem:a. Identify which portion of the associated translation corresponds to the matched portion of the source text (Adaptability)b. Recombining the portions in an appropriate manner.
Alignment: can be done using statistical techniques or using bilingual dictionaries.
Boundary friction problem: For English-Japanese, translations of noun phrases can be reused independent of them being subjects or objects.
The handsome boy entered the roomThe handsome boy ate his breakfastI saw the handsome boy
Not in German:Der schone Junge aB seine FruhstuckIch sah den schonen Jungen
Adaptability
Example-retrieval can be scored on two counts: (a) the closeness of the match between the input text and the example,
and (b) the adaptability of the example, on the basis of the relationship between
the representations of the example and its translation.
Use the Offset Command to increase the spacing between the shapes.
a. Use the Offset Command to specify the spacing between the shapes.b. Mit der Option Abstand legen Sie den Abstand zwischen den Formen fest.
a. Use the Save Option to save your changes to disk.b. Mit der Option Speichern können Sie ihre Anderungen auf Diskette
speichern.
Recombination options are ranked using n-gram modela. Ich sah den schönen Jungen.b. * Ich sah der schöne Junge.
Flavors of EBMT•EBMT used as a component in an MT system which also has more traditional elements:•EBMT may be used
– in parallel with these other “engines”, – or just for certain classes of problems– when some other component cannot deliver a result.
•EBMT may be better suited to some kinds of applications than others. • Dividing line between EBMT and so-called “traditional” rule-based approaches may not be obvious.
When to apply EBMTWhen one of the following conditions holds true for a linguistic phenomenon, [rule-based] MT is less suitable than EBMT.(a) Translation rule formation is difficult.(b) The general rule cannot accurately describe [the] phenomen[on] because it represents a special case.(c) Translation cannot be made in a compositional way from target words.
Learning translation patternsKare wa kuruma o kuji de ateru.
HE topic CAR obj LOTTERY inst STRIKESLit. ‘He strikes a car with the lottery.’He wins a car as a prize in the lottery.
Learn pattern (c) from to correct (a) to be like (b)
Generation of Translation Templates• “Two phase” EBMT methodology: “learning” of templates (i.e. transfer rules) from a corpus.• Parse the translation pairs; align the syntactic units with the help of a bilingual dictionary.• Generalized by replacing the coupled units with variables marked for syntactic category.a. X[NP] no nagasa wa saidai 512 baito de aru. The maximum length of X[NP] is 512 bytes.b. X[NP] no nagasa wa saidai Y[N] baito de aru. The maximum length of X[NP] is Y[N] bytes.
• Any coupled unit pair can be replaced by variables. Refine templates which give rise to a conflicta. play baseball yakyu o suru b. play tennis tenisu o suruc. play X[NP]!X[NP] o suru
a. play the piano piano o hikub. play the violin baiorin o hikuc. play X[NP]!X[NP] o hiku
• “refined” by the addition of “semantic categories”a. play X[NP/sport] X[NP] o surub. play X[NP/instrument] X[NP] o hiku
Also, automatic generalization techniques from paired strings
Statistical Machine Translation
Can all the steps of EBMT technique be induced from a parallel corpus?What are the parameters of such a model?What are the components of SMT?
Slides adapted from Dorr and Monz, Knight, Schafer and Smith
Word-Level Alignments
Given a parallel sentence pair we can link (align) words or phrases that are translations of each other:
Where do we get the sentence pairs from?
Parallel Resources
Newswire: DE-News (German-English), Hong-Kong News, Xinhua News (Chinese-English),Government: Canadian-Hansards (French-English), Europarl (Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portugese, Spanish, Swedish), UN Treaties (Russian, English, Arabic, . . . )Manuals: PHP, KDE, OpenOffice (all from OPUS, many languages)Web pages: STRAND project (Philip Resnik)
Sentence Alignment
If document De is translation of document Df how do we find the translation for each sentence?The n-th sentence in De is not necessarily the translation of the n-th sentence in document Df
In addition to 1:1 alignments, there are also 1:0, 0:1, 1:n, and n:1 alignmentsApproximately 90% of the sentence alignments are 1:1
Sentence Alignment (c’ntd)
There are several sentence alignment algorithms:• Align (Gale & Church): Aligns sentences based on their character
length (shorter sentences tend to have shorter translations then longer sentences). Works astonishingly well
• Char-align: (Church): Aligns based on shared character sequences. Works fine for similar languages or technical domains
• K-Vec (Fung & Church): Induces a translation lexicon from the parallel texts based on the distribution of foreign-English word pairs.
Computing Translation Probabilities
Given a parallel corpus we can estimate P(e | f) The maximum likelihood estimation of P(e | f) is: freq(e,f)/freq(f)Way too specific to get any reasonable frequencies! Vast majority of unseen data will have zero counts!P(e | f ) could be re-defined as:
Problem: The English words maximizing P(e | f ) might not result in a readable sentence
P(e | f ) maxe if j
P(ei | f j )
Decoding
The decoder combines the evidence from P(e) and P(f | e) to find the sequence e that is the best translation:
The choice of word e’ as translation of f’ depends on the translation probability P(f’ | e’) and on the context, i.e. other English words preceding e’
argmaxe
P(e | f ) argmaxe
P( f |e)P(e)
Noisy Channel Model for Translation
Translation Modeling
Determines the probability that the foreign word f is a translation of the English word eHow to compute P(f | e) from a parallel corpus?Statistical approaches rely on the co-occurrence of e and f in the parallel data: If e and f tend to co-occur in parallel sentence pairs, they are likely to be translations of one another
Finding Translations in a Parallel Corpus
Into which foreign words f, . . . , f’ does e translate?Commonly, four factors are used:• How often do e and f co-occur? (translation)• How likely is a word occurring at position i to translate into a word
occurring at position j? (distortion) For example: English is a verb-second language, whereas German is a verb-final language
• How likely is e to translate into more than one word? (fertility) For example: defeated can translate into eine Niederlage erleiden
• How likely is a foreign word to be spuriously generated? (null translation)
Translation Model?
Mary did not slap the green witch
Maria no dió una botefada a la bruja verde
Source-language morphological analysis
Source parse tree
Semantic representation
Generate target structure
Generative approach:
Translation Model?
Mary did not slap the green witch
Maria no dió una botefada a la bruja verde
Source-language morphological analysis
Source parse tree
Semantic representation
Generate target structure
Generative story:
What are allthe possiblemoves andtheir associatedprobabilitytables?
The Classic Translation ModelWord Substitution/Permutation [IBM Model 3, Brown et al., 1993]
Mary did not slap the green witch
Mary not slap slap slap the green witch n(3|slap)
Maria no dió una botefada a la bruja verded(j|i)
Mary not slap slap slap NULL the green witchP-Null
Maria no dió una botefada a la verde brujat(la|the)
Generative approach:
Probabilities can be learned from raw bilingual text.
Statistical Machine Translation
… la maison … la maison bleue … la fleur …
… the house … the blue house … the flower …
All word alignments equally likely
All P(french-word | english-word) equally likely
Statistical Machine Translation
… la maison … la maison bleue … la fleur …
… the house … the blue house … the flower …
“la” and “the” observed to co-occur frequently,so P(la | the) is increased.
Statistical Machine Translation
… la maison … la maison bleue … la fleur …
… the house … the blue house … the flower …
“house” co-occurs with both “la” and “maison”, butP(maison | house) can be raised without limit, to 1.0,
while P(la | house) is limited because of “the”
(pigeonhole principle)
Statistical Machine Translation
… la maison … la maison bleue … la fleur …
… the house … the blue house … the flower …
settling down after another iteration
Statistical Machine Translation
… la maison … la maison bleue … la fleur …
… the house … the blue house … the flower …
Inherent hidden structure revealed by EM training!For details, see:
• “A Statistical MT Tutorial Workbook” (Knight, 1999).• “The Mathematics of Statistical Machine Translation” (Brown et al, 1993)• Software: GIZA++
Statistical Machine Translation
… la maison … la maison bleue … la fleur …
… the house … the blue house … the flower …
P(juste | fair) = 0.411P(juste | correct) = 0.027P(juste | right) = 0.020 …
new Frenchsentence
Possible English translations,to be rescored by language model
IBM Models 1–5
Model 1: Bag of words• Unique local maxima• Efficient EM algorithm (Model 1–2)Model 2: General alignment: Model 3: fertility: n(k | e)• No full EM, count only neighbors (Model 3–5)• Deficient (Model 3–4)Model 4: Relative distortion, word classesModel 5: Extra variables to avoid deficiency
a(epos | f pos,elength, f length )
IBM Model 1Given an English sentence e1 . . . el and a foreign sentence f1 . . . fm
We want to find the ’best’ alignment a, where a is a set pairs of the form {(i , j), . . . , (i’, j’)}, 0<= i , i’ <= l and 1<= j , j’<= mNote that if (i , j), (i’, j) are in a, then i equals i’, i.e. no many-to-one alignments are allowedNote we add a spurious NULL word to the English sentence at position 0In total there are (l + 1)m different alignments AAllowing for many-to-many alignments results in (2l)m possible alignments A
IBM Model 1
Simplest of the IBM modelsDoes not consider word order (bag-of-words approach)Does not model one-to-many alignmentsComputationally inexpensiveUseful for parameter estimations that are passed on to more elaborate models
IBM Model 1Translation probability in terms of alignments:
where:
and:
P( f |e) P( f ,a | e)aA
P( f ,a | e) P(a | e)P( f | a,e)
1
(l 1)mP( f j
j1
m
|ea j )
P( f |e) 1
(l 1)mP( f j
j1
m
| ea j )aA
IBM Model 1We want to find the most likely alignment:
Since P(a | e) is the same for all a:
Problem: We still have to enumerate all alignments
argmaxaA
1(l 1)m
P( f jj1
m
|ea j )
argmaxaA
P( f jj1
m
| ea j )
IBM Model 1Since P(fj | ei) is independent from P(fj’ | ei’) we can find the maximum alignment by looking at the individual translation probabilities onlyLet , then for each aj:
The best alignment can computed in a quadratic number of steps: (l+1 x m)
argmaxaA
(a1, ... ,am )
a j argmax0il
P( f j |ei)
Computing Model 1 Parameters
How to compute translation probabilities for model 1 from a parallel corpus?Step 1: Determine candidates. For each English word e collect all foreign words f that co-occur at least once with eStep 2: Initialize P(f | e) uniformly, i.e. P(f | e) = 1/(no of co-occurring foreign words)
Computing Model 1 ParametersStep 3: Iteratively refine translation probablities:1 for n iterations2 set tc to zero3 for each sentence pair (e,f) of lengths (l,m)4 for j=1 to m 5 total=0; 6 for i=1 to l
7 total += P(fj | ei); 8 for i=1 to l
9 tc(fj | ei) += P(fj | ei)/total;10 for each word e11 total=0; 12 for each word f s.t. tc(f | e) is defined13 total += tc(f | e);14 for each word f s.t. tc(f | e) is defined15 P(f | e) = tc(f | e)/total;
IBM Model 1 Example
Parallel ‘corpus’:the dog :: le chienthe cat :: le chatStep 1+2 (collect candidates and initialize uniformly):P(le | the) = P(chien | the) = P(chat | the) = 1/3P(le | dog) = P(chien | dog) = P(chat | dog) = 1/3P(le | cat) = P(chien | cat) = P(chat | cat) = 1/3P(le | NULL) = P(chien | NULL) = P(chat | NULL) = 1/3
IBM Model 1 Example
Step 3: IterateNULL the dog :: le chien• j=1
total = P(le | NULL)+P(le | the)+P(le | dog)= 1tc(le | NULL) += P(le | NULL)/1 = 0 += .333/1 = 0.333tc(le | the) += P(le | the)/1 = 0 += .333/1 = 0.333tc(le | dog) += P(le | dog)/1= 0 += .333/1 = 0.333
• j=2total = P(chien | NULL)+P(chien | the)+P(chien | dog)=1tc(chien | NULL) += P(chien | NULL)/1 = 0 += .333/1 = 0.333tc(chien | the) += P(chien | the)/1 = 0 += .333/1 = 0.333tc(chien | dog) += P(chien | dog)/1 = 0 += .333/1 = 0.333
IBM Model 1 Example
NULL the cat :: le chat• j=1
total = P(le | NULL)+P(le | the)+P(le | cat)=1tc(le | NULL) += P(le | NULL)/1 = 0.333 += .333/1 = 0.666tc(le | the) += P(le | the)/1 = 0.333 += .333/1 = 0.666tc(le | cat) += P(le | cat)/1 = 0 +=.333/1 = 0.333
• j=2total = P(chien | NULL)+P(chien | the)+P(chien | dog)=1tc(chat | NULL) += P(chat | NULL)/1 = 0 += .333/1 = 0.333tc(chat | the) += P(chat | the)/1 = 0 += .333/1 = 0.333tc(chat | cat) += P(chat | dog)/1 = 0 += .333/1 = 0.333
IBM Model 1 Example
Re-compute translation probabilities • total(the) = tc(le | the) + tc(chien | the) + tc(chat | the)
= 0.666 + 0.333 + 0.333 = 1.333 P(le | the) = tc(le | the)/total(the)
= 0.666 / 1.333 = 0.5 P(chien | the) = tc(chien | the)/total(the)
= 0.333/1.333 0.25 P(chat | the) = tc(chat | the)/total(the)
= 0.333/1.333 0.25• total(dog) = tc(le | dog) + tc(chien | dog) = 0.666 P(le | dog) = tc(le | dog)/total(dog)
= 0.333 / 0.666 = 0.5 P(chien | dog) = tc(chien | dog)/total(dog)
= 0.333 / 0.666 = 0.5
IBM Model 1 Example
Iteration 2:NULL the dog :: le chien• j=1
total = P(le | NULL)+P(le | the)+P(le | dog)= 1.5 = 0.5 + 0.5 + 0.5 = 1.5
tc(le | NULL) += P(le | NULL)/1 = 0 += .5/1.5 = 0.333tc(le | the) += P(le | the)/1 = 0 += .5/1.5 = 0.333tc(le | dog) += P(le | dog)/1 = 0 += .5/1.5 = 0.333
• j=2total = P(chien | NULL)+P(chien | the)+P(chien | dog)=1
= 0.25 + 0.25 + 0.5 = 1tc(chien | NULL) += P(chien | NULL)/1 = 0 += .25/1 = 0.25tc(chien | the) += P(chien | the)/1 = 0 += .25/1 = 0.25tc(chien | dog) += P(chien | dog)/1 = 0 += .5/1 = 0.5
IBM Model 1 Example
NULL the cat :: le chat• j=1
total = P(le | NULL)+P(le | the)+P(le | cat)= 1.5 = 0.5 + 0.5 + 0.5 = 1.5
tc(le | NULL) += P(le | NULL)/1 = 0.333 += .5/1 = 0.833tc(le | the) += P(le | the)/1 = 0.333 += .5/1 = 0.833tc(le | cat) += P(le | cat)/1 = 0 += .5/1 = 0.5
• j=2total = P(chat | NULL)+P(chat | the)+P(chat | cat)=1
= 0.25 + 0.25 + 0.5 = 1tc(chat | NULL) += P(chat | NULL)/1 = 0 += .25/1 = 0.25tc(chat | the) += P(chat | the)/1 = 0 += .25/1 = 0.25tc(chat | cat) += P(chat | cat)/1 = 0 += .5/1 = 0.5
IBM Model 1 Example
Re-compute translations (iteration 2):• total(the) = tc(le | the) + tc(chien | the) + tc(chat | the)
= .833 + 0.25 + 0.25 = 1.333 P(le | the) = tc(le | the)/total(the)
= .833 / 1.333 = 0.625 P(chien | the) = tc(chien | the)/total(the)
= 0.25/1.333 = 0.188 P(chat | the) = tc(chat | the)/total(the)
= 0.25/1.333 = 0.188• total(dog) = tc(le | dog) + tc(chien | dog) = 0.333 + 0.5 = 0.833 P(le | dog) = tc(le | dog)/total(dog)
= 0.333 / 0.833 = 0.4 P(chien | dog) = tc(chien | dog)/total(dog)
= 0.5 / 0.833 = 0.6
IBM Model 1Example
After 5 iterations: P(le | NULL) = 0.755608028335301 P(chien | NULL) = 0.122195985832349 P(chat | NULL) = 0.122195985832349 P(le | the) = 0.755608028335301 P(chien | the) = 0.122195985832349 P(chat | the) = 0.122195985832349 P(le | dog) = 0.161943319838057 P(chien | dog) = 0.838056680161943 P(le | cat) = 0.161943319838057 P(chat | cat) = 0.838056680161943
IBM Model 1 Recap
IBM Model 1 allows for an efficient computation of translation probabilitiesNo notion of fertility, i.e., it’s possible that the same English word is the best translation for all foreign wordsNo positional information, i.e., depending on the language pair, there might be a tendency that words occurring at the beginning of the English sentence are more likely to align to words at the beginning of the foreign sentence
IBM Model 3
IBM Model 3 offers two additional features compared to IBM Model 1:• How likely is an English word e to align to k foreign words
(fertility)? • Positional information (distortion), how likely is a word in
position i to align to a word in position j?
IBM Model 3: Fertility
The best Model 1 alignment could be that a single English word aligns to all foreign wordsThis is clearly not desirable and we want to constrain the number of words an English word can align to Fertility models a probability distribution that word e aligns to k words: n(k,e)Consequence: translation probabilities cannot be computed independently of each other anymoreIBM Model 3 has to work with full alignments, note there are up to (l+1)m different alignments
IBM Model 1 + Model 3
Iterating over all possible alignments is computationally infeasibleSolution: Compute the best alignment with Model 1 and change some of the alignments to generate a set of likely alignments (pegging)Model 3 takes this restricted set of alignments as input
Pegging
Given an alignment a we can derive additional alignments from it by making small changes:• Changing a link (j,i) to (j,i’)• Swapping a pair of links (j,i) and (j’,i’) to (j,i’) and (j’,i) The resulting set of alignments is called the neighborhood of a
IBM Model 3: Distortion
The distortion factor determines how likely it is that an English word in position i aligns to a foreign word in position j, given the lengths of both sentences: d(j | i, l, m)Note, positions are absolute positions
Deficiency
Problem with IBM Model 3: It assigns probability mass to impossible strings• Well formed string: “This is possible”• Ill-formed but possible string: “This possible is”• Impossible string:Impossible strings are due to distortion values that generate different words at the same positionImpossible strings can still be filtered out in later stages of the translation process
Limitations of IBM Models
Only 1-to-N word mappingHandling fertility-zero words (difficult for decoding)Almost no syntactic information• Word classes• Relative distortionLong-distance word movementFluency of the output depends entirely on the English language model
Decoding
How to translate new sentences?A decoder uses the parameters learned on a parallel corpus• Translation probabilities• Fertilities• DistortionsIn combination with a language model the decoder generates the most likely translationStandard algorithms can be used to explore the search space (A*, greedy searching, …)Similar to the traveling salesman problem
Decoding for “Classic” Models Of all conceivable English word strings, find the one maximizing P(e) x P(f | e)
Decoding is an NP-complete challenge • (Knight, 1999)
Several search strategies are available
Each potential English output is called a hypothesis.
Dynamic Programming Beam Search
1st targetword
2nd targetword
3rd targetword
4th targetword
start end
Each partial translation hypothesis contains: - Last English word chosen + source words covered by it - Next-to-last English word chosen - Entire coverage vector (so far) of source sentence - Language model and translation model scores (so far)
all sourcewords
covered
[Jelinek, 1969; Brown et al, 1996 US Patent;(Och, Ueffing, and Ney, 2001]
Dynamic Programming Beam Search
1st targetword
2nd targetword
3rd targetword
4th targetword
start end
Each partial translation hypothesis contains: - Last English word chosen + source words covered by it - Next-to-last English word chosen - Entire coverage vector (so far) of source sentence - Language model and translation model scores (so far)
all sourcewords
covered
[Jelinek, 1969; Brown et al, 1996 US Patent;(Och, Ueffing, and Ney, 2001]
best predecessorlink
The Classic Resultsla politique de la haine . (Foreign Original)
politics of hate . (Reference Translation)the policy of the hatred . (IBM4+N-grams+Stack)
nous avons signé le protocole . (Foreign Original)
we did sign the memorandum of agreement . (Reference Translation)we have signed the protocol . (IBM4+N-grams+Stack)
où était le plan solide ? (Foreign Original)
but where was the solid plan ? (Reference Translation)where was the economic base ? (IBM4+N-grams+Stack)
the Ministry of Foreign Trade and Economic Cooperation, including foreigndirect investment 40.007 billion US dollars today provide data includethat year to November china actually using foreign 46.959 billion US dollars and
Limitations of Word-Based MT
Multiple English words for one French word• IBM models can do one-to-many (fertility) but not many-to-onePhrasal Translation• “real estate”, “note that”, “interest in”Syntactic Transformations• Verb at the beginning in Arabic• Translation model penalizes any proposed re-ordering• Language model not strong enough to force the verb to move to the
right place
Phrase-Based Statistical MT
Phrase-Based Statistical MT
Foreign input segmented in to phrases• “phrase” is any sequence of wordsEach phrase is probabilistically translated into English• P(to the conference | zur Konferenz)• P(into the meeting | zur Konferenz)Phrases are probabilistically re-orderedSee [Koehn et al, 2003] for an intro.This is state-of-the-art!
Morgen fliege ich nach Kanada zur Konferenz
Tomorrow I will fly to the conference In Canada
Advantages of Phrase-Based
Many-to-many mappings can handle non-compositional phrasesLocal context is very useful for disambiguating• “Interest rate” …• “Interest in” …The more data, the longer the learned phrases• Sometimes whole sentences
How to Learn the Phrase Translation Table?One method: “alignment templates” (Och et al, 1999)
Start with word alignment, build phrases from that.
Mary
did
not
slap
the
green
witch
Maria no dió una bofetada a la bruja verde
This word-to-wordalignment is a by-product of training a translation modellike IBM-Model-3.
This is the best(or “Viterbi”) alignment.
How to Learn the Phrase Translation Table?
One method: “alignment templates” (Och et al, 1999)
Start with word alignment, build phrases from that.
Mary
did
not
slap
the
green
witch
Maria no dió una bofetada a la bruja verde
This word-to-wordalignment is a by-product of training a translation modellike IBM-Model-3.
This is the best(or “Viterbi”) alignment.
IBM Models are 1-to-Many
Run IBM-style aligner both directions, then merge:
EF bestalignment
Union or Intersection
MERGE
FE bestalignment
How to Learn the Phrase Translation Table?
Collect all phrase pairs that are consistent with the word alignment
Mary
did
not
slap
the
green
witch
Maria no dió una bofetada a la bruja verde
oneexamplephrase
pair
Consistent with Word Alignment
Phrase alignment must contain all alignment points for allthe words in both phrases!
x x
Mary
did
not
slap
Maria no dió
Mary
did
not
slap
Maria no dió
Mary
did
not
slap
Maria no dió
consistent inconsistent inconsistent
Mary
did
not
slap
the
green
witch
Maria no dió una bofetada a la bruja verde
Word Alignment Induced Phrases
(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)
Mary
did
not
slap
the
green
witch
Maria no dió una bofetada a la bruja verde
Word Alignment Induced Phrases
(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)(a la, the) (dió una bofetada a, slap the)
Mary
did
not
slap
the
green
witch
Maria no dió una bofetada a la bruja verde
Word Alignment Induced Phrases
(Maria, Mary) (no, did not) (dió una bofetada, slap) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the)(Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the) (bruja verde, green witch)
Mary
did
not
slap
the
green
witch
Maria no dió una bofetada a la bruja verde
Word Alignment Induced Phrases
(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the)(Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the) (bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap) (a la bruja verde, the green witch) …
Mary
did
not
slap
the
green
witch
Maria no dió una bofetada a la bruja verde
Word Alignment Induced Phrases
(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the)(Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the) (bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap) (a la bruja verde, the green witch) … (Maria no dió una bofetada a la bruja verde, Mary did not slap the green witch)
Phrase Pair ProbabilitiesA certain phrase pair (f-f-f, e-e-e) may appear many times across the bilingual corpus.
• We hope so!
So, now we have a vast list of phrase pairs and their frequencies – how to assign probabilities?
Phrase Pair Probabilities
Basic idea: • No EM training• Just relative frequency: P(f-f-f | e-e-e) = count(f-f-f, e-e-e) / count(e-e-e)
Important refinements: • Smooth using word probs P(f | e) for individual words connected in
the word alignment– Some low count phrase pairs now have high probability, others have low
probability• Discount for ambiguity
– If phrase e-e-e can map to 5 different French phrases, due to the ambiguity of unaligned words, each pair gets a 1/5 count
• Count BAD events too– If phrase e-e-e doesn’t map onto any contiguous French phrase, increment
event count(BAD, e-e-e)
Advanced Training Methods
Basic Model, Revisited
argmax P(e | f) = e
argmax P(e) x P(f | e) / P(f) = e
argmax P(e) x P(f | e) e
Basic Model, Revisited
argmax P(e | f) = e
argmax P(e) x P(f | e) / P(f) = e
argmax P(e)2.4 x P(f | e) … works better! e
Basic Model, Revisited
argmax P(e | f) = e
argmax P(e) x P(f | e) / P(f) e
argmax P(e)2.4 x P(f | e) x length(e)1.1
e
Rewards longer hypotheses, since these are unfairly punished by P(e)
Basic Model, Revisited
argmax P(e)2.4 x P(f | e) x length(e)1.1 x KS 3.7 … e
Lots of knowledge sources vote on any given hypothesis.
“Knowledge source” = “feature function” = “score component”.
Feature function simply scores a hypothesis with a real value.
(May be binary, as in “e has a verb”).
Problem: How to set the exponent weights?
MT Evaluation
* Intrinsic
* Extrinsic
Human evaluation
Automatic (machine) evaluation
How useful is MT system output for…Deciding whether a foreign language blog is about politics?Cross-language information retrieval?Flagging news stories about terrorist attacks?…
Human Evaluation
Je suis fatigué.
Tired is I.
Cookies taste good!
I am exhausted.
Adequacy Fluency
5
1
5
2
5
5
Human Evaluation
CON
PRO
High quality
Expensive!
Person (preferably bilingual) must make atime-consuming judgment per system hypothesis.
Expense prohibits frequent evaluation of incremental system modifications.
Automatic Evaluation
PRO
Cheap. Given available reference translations,free thereafter.
CON
We can only measure some proxy fortranslation quality. (Such as N-Gram overlap or edit distance).
Automatic Evaluation: Bleu Score
Bleu=
B = { (1- |ref| / |hyp|)e if |ref| > |hyp|
1 otherwisebrevitypenalty
Bleu score: brevity penalty, geometricmean of N-Gram precisions
N-Gramprecision
Bounded aboveby highest countof n-gram in anyreference sentence
N
nnpN
B1
1exp
hypn
hypn clipn ncount
ncountp
gram-
gram-
)gram-(
)gram-(
Automatic Evaluation: Bleu Score
I am exhaustedhypothesis 1
Tired is Ihypothesis 2
I am tiredreference 1
I am ready to sleep nowreference 2
Automatic Evaluation: Bleu Score
I am exhaustedhypothesis 1
Tired is Ihypothesis 2
I am tiredreference 1
I am ready to sleep now and so exhaustedreference 2
1-gram 3-gram2-gram3/3
1/3
1/2
0/2
0/1
0/1
I I Ihypothesis 3 1/3 0/2 0/1
Maximum BLEU Training(Och, 2003)
Translation System
(Automatic,Trainable)
Translation Quality
Evaluator(Automatic)
Farsi EnglishMT Output
EnglishReference Translations(sample “right answers”)
BLEUscore
LanguageModel #1
TranslationModel
LanguageModel #2
Length Model
OtherFeatures
Learning Algorithm for Directly Reducing Translation Error
Yields big improvements in quality.
Minimizing Error/Maximizing Bleu
• Adjust parameters to minimize error (L) when translating a training set
• Error as a function of parameters is– nonconvex: not guaranteed
to find optimum– piecewise constant: slight
changes in parameters might not change the output.
• Usual method: optimize one parameter at a time with linear programming
Generative/Discriminative Reunion
Generative models can be cheap to train: “count and normalize” when nothing’s hidden.Discriminative models focus on problem: “get better translations”.Popular combination• Estimate several generative translation and language models using
relative frequencies.• Find their optimal (log-linear) combination using discriminative
techniques.
Generative/Discriminative Reunion
words#)()|()|()|( 87321 tptspstptsp LMlexicalphrasephrase
Score each hypothesis with several generative models:
If necessary, renormalize into a probability distribution:
)exp( kkZ fθ
where k ranges over all hypotheses. We then have
)exp(1)|( fθ Z
stp i
for any given hypothesis i.
Exponentiation makes it positive.
Unnecessary if thetas sum to 1 and p’s are all probabilities.
Minimizing Risk
kk
iii stp
][exp][exp)|(, fθ
fθ
1.0 1
10
Instead of the error of the 1-best translation, compute expected error (risk) using k-best translations; this makes the function differentiable.
Smooth probability estimates using gamma to even out local bumpiness. Gradually increase gamma to approach the 1-best error.
)],([E,
tsLp θ
Synchronous grammars