Improving SMT withPhrase to Phrase Translations
Joy Ying Zang, Ashish Venugopal,
Stephan Vogel, Alex Waibel
Carnegie Mellon University
Project: Mega-RADD
2
CMU Mega RADD
The Mega-RADD Team:
SMT: Stephan Vogel, Alex Waibel, John Lafferty,EMBT: Ralf Brown, Bob Frederking,
Chinese: Joy Ying Zang, Ashish Venugopal,
Bing Zhao, Fei HuangArabic: Alicia Tribble, Ahmed Badran
3
Overview
• Goals:– Develop Data-Driven General Purpose MT Systems
– Train on Large and Small Corpora, Evaluate to test Portability
• Approaches– Two Data-driven Approaches: Statistical, Example-Based
– Also Grammar based Translation System
– Multi-Engine Translation
• Languages: Chinese and Arabic
• Statistical Translation:– Exploit Structure in Language: Phrases
– Determine Phrases from Mono- and Bi-Lingual Co-occurrences
– Determine Phrases from Lexical and Alignment Information
4
Arabic: Initial System
• 1 million words of UN data, 300 sentences for testing
• Preprocessing: separation of punctuation marks, lower case for English, correction of corrupted numbers
• Adding Human knowledge: cleaning statistical lexicon for 100 most frequent wordsbuilding lists names, simple date expressions, numbers (total: 1000 entries, total effort: two part-timers * 4 weeks)
• Alignment: IBM1 plus HMM training, lexicon plus phrase translations
• Language Model: trained on 1m sub-corpus
• Results (20 May 2002):UN test data (300 sentences): Bleu = 0.1176NIST devtest (203 sentences): Bleu = 0.0242 NIST = 2.0608
5
Arabic: Portability to a New Language
• Training on subset of UN corpus chosen to cover vocabulary of test data
• Training English to Arabic for extraction of phrase translations
• Minimalist Morphology: strip/add suffixes for ~200 unknown wordsNIST: 5.5368 5.6700
• Adapting LM: Select stories from 2 years of English Xinhua storiesaccording to 'Arabic' keyword list (280 entries); size 6.9m words.NIST: 5.5368 5.9183
• Results:- 20 Mai (devtest): 2.0608- 13 June (devtest): 6.5805- 14 June (evaltest): 5.4662 (final training not completed)- 17 June (evaltest): 6.4499 (after completed training)- 19 Juli (devtest): 7.0482
6
Two Approaches
• Determine Phrases from Mono- and Bi-Lingual Co-occurrences– Joy
• Determine Phrases from Lexical and Alignment Information– Ashish
7
• Mismatch between languages: word to word translation doesn’t work
• Phrases encapsulate the context of words, e.g. verb tense
Why phrases?
8
Why phrases? (Cont.)
• Local reordering, e.g. Chinese relative clause
• Using phrases to soothe word segmentation failure
9
Utilizing bilingual information
• Given a sentence pair (S,T),
S=<s1,s2,…,si,…sm>
T=<t1,t2,…,tj,…,tn>, where si/tj are source/target words.
• Given an m*n matrix B, where
B(i,j)= co-occurrence(si,tj)=
where, N=a+b+c+d;
tj ~tj
si a b
~si c d
)()()()(
)(),(
22
dcbadbca
cdadNts ji
10
Utilizing bilingual information (Cont.)
• Goal: find a partition over matrix B, under the constraint that one src/tgt word can only align to one tgt/src word or phrase (adjacent word sequence)
Legal segmentation, imperfect alignment Illegal segmentation, perfect alignment
11
Utilizing bilingual information (Cont.)For each sentence pair in the training data:
While(still has row or column not aligned){Find cell[i,j], where B(i,j) is the max among all available(not aligned) cells;Expand cell[i,j] with similarity sim_thresh to region[RowStart,RowEnd;
ColStart,ColEnd]Mark all the cells in the region as aligned
}Output the aligned regions as phrases
-----------------------------------------------------
Sub expand cell[i,j] with sim_thresh {current aligned region: region[RowStart=i, RowEnd=i; ColStart=j, ColEnd=j]While(still ok to expand){
if all cells[m,n], where m=RowStart-1, ColStart<=n<=ColEnd, B(m,n) is similar to B(i,j) then RowStart = RowStart --; //expand to north
if all cells[m,n], where m=RowEnd+1, ColStart<=n<=ColEnd, B(m,n) is similar to B(i,j) then RowStart = RowStart ++; //expand to south
… //expand to east… //expand to west
}
Define similar(x,y)= true, if abs((x-y)/y) < 1-similarity_thresh
12
Utilizing bilingual information (Cont.)
Expand to North
Expand to South
Expand to EastExpand to West
13
Integrating monolingual information
• Motivation:– Use more information in the alignment
– Easier for aligning phrases
– There is much more monolingual data than bilingual data
Santa Monica
Pittsburgh Los AngelesSomerset
Union town
Santa Clarita
Corona
14
Integrating monolingual information (Cont.)
• Given a sentence pair (S,T),
S=<s1,s2,…,si,…sm> and T=<t1,t2,…,tj,…,tn>, where si/tj are source/target words.
• Construct m*m matrix A, where A(i,j) = collocation(si, sj); Only A(i,i-1) and A(i,i+1) have values.
• Construct n*n matrix C, where C(i,j) = collocation(ti, tj); Only C(j-1,j) and A(j+1,j) have values.
• Construct m*n matrix B, where B(i,j)= co-occurrence(si, tj).
15
Integrating monolingual information (Cont.)
• Normalize A so that:
• Normalize C so that:
• Normalize B so that:
• Calculating new src-tgt matrix B’
j
jiA 1),(
i
jiC 1),(
m
i
n
j
jiB1 1
1),(
CBAB '
B B’
16
Discussion and Results
• Simple• Efficient
– Partitioning the matrix is linear O(min(m,n)).
– The construction of A*B*C is O(m*n);
• Effective– Improved the translation quality from baseline (NIST=
6.3775, Bleu=0.1417 ) to (NIST= 6.7405, Bleu=0.1681) on small data track dev-test
17
Utilizing alignment information: Motivation
• Alignment model associates words and their translations on the sentence level.
• Context and co-occurrence are represented when considering a set of sentence level alignments.
• Extract phrase relations from the alignment information.
18
Processing Alignments
• Identification – Selection of the source phrases target phrase candidates.
• Scoring – Assigning a score to each candidate phrase pair to create a ranking.
• Pruning – Reducing the set of candidate translations to a computationally tractable number.
19
Identification
• Extraction from sentence level alignments.• For each source phrase identify the sentences in
which they occur and load the sentence alignment• Form a sliding/expanding window in the
alignment to identify candidate translations.
20
Identification Example - I
21
Identification Example - II
- is
-is in step with the
-is in step with the establishment
-is in step with the establishment of
-is in step with the establishment of its
-is in step with the establishment of its legal
-is in step with the establishment of its legal system
-the
-the establishment
-the establishment of
-……
-the establishment of its legal system
-……
-establishment
-establishment of
-establishment of its
-….
22
Scoring - I
• This candidate set H needs to be scored and ranked before pruning.
• Alignment based scores.• Similarity clustering
– Assume that the hypothesis set contains several similar phrases ( across several sentences ) and several noisy phrases.
– SimScore(h) = Mean(EditDistance(h, h’)/AvgLen(h,h’)) for h,h’ in H
23
Scoring Example
24
Scoring - II
• Lexicon augmentation– Weight each point in alignment scoring by their lexical
probability.• P( si | tj ) where I, J represent the area of the translation
hypothesis being considered. Only the pairs of words where there is an alignment is considered.
– Calculate translation probability of hypothesis• Σi Πj P( si | tj ) All words in the hypothesis are considered.
25
Combining Scores
• Final Score(h) = Πj Scorej(h) for each scoring method.
• Due to additional morphology present in English as compared to Chinese, a length model is used to adjust the final score to prefer longer phrases.
• Diff Ratio = (I-J) / J if I>J• FinalScore(h)=FinalScore(h)*(1.0+c*e(-1.0*DiffRatio) )
– c is an experimentally determined constant
26
Pruning
• This large candidate list is now sorted by score and is ready for pruning.
• Difficult to pick a threshold that will work across different phrases. We need a split point that separates the useful and the noisy candidates.
• Split point = argmax p {MeanScore(h<p) – MeanScore(h>=p)}where h represents each hypothesis in the ordered set H.
27
Experiments
• Alignment model – experimented with one-way (EF) and two-way (EF-FE union/intersection) for IBM Models 1-4.– Best results found using union (high recall model) from
model 4.
• Both lexical augmentation (using model 1 lexicon) scores and length bonus were applied.
28
Results and Thoughts
6.3775 6.52Baseline (IBM1+LDC-Dic)
6.7405 7.316+ Phrases
Small Track Large Track
-More effective pruning techniques will significantly reduced the experimentation cycle
- Improved alignment models that better combine bi-directional alignment information
29
Combining Methods
Small Data Track
(Dec-01 data)
6.6427
6.5624
6.5295
6.2381
6.8790
6.7987
6.7405
6.3775
+ Phrases Joy & Ashish
+ Phrases Joy
+ Phrases Ashish
Baseline(IBM1+LDC-Dic)
standard improved
Segmentation