grammatical agreement in smt
Post on 18-Nov-2014
216 Views
Preview:
DESCRIPTION
TRANSCRIPT
Institut für Anthropomatik1 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Grammatical Agreement in SMT
Seminar Sprach-zu-Sprach-ÜbersetzungSS 2013
Institut für Anthropomatik2 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Inflection– Modification of a word– signals grammatical variants (tense, gender, case, …)– e.g. walk vs. Walked
Agreement– Inflection for related words in a sentence has to agree– e.g. das Haus vs. die Haus
Some languages are weakly inflected (e.g. English)
Some are highly inflected (e.g. German, Arabic, …)
Inflection and Agreement
Institut für Anthropomatik3 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Local Agreement Errors
Ref:
the-carF go
F with-speed
Hypo:
the-carF go
M with-speed
Long-distance Agreement Errors
Ref: celle qui parle , c’est ma femme
oneF who speak , is my wife
F
Hypo: celui qui parle est ma femme
oneM who speak is my spouse
F
Agreement Errors
Institut für Anthropomatik4 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Approaches for SMT
Morphological Generation– Create raw stems and modify with predicted inflection
Agreement Constraints– Use SCFG of target and add constraints to it
Class-based Agreement Model– Use morphological word classes “Noun+Def+Sg+Fem”
Institut für Anthropomatik5 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Morphological Generation: Idea
“Generating Complex Morphology for Machine Translation” (Minkov and Toutanova, 2007)
Convert MT output to stem sequence
Predict an inflection for every stem
Reflect meaning and comply with agreement rules
Institut für Anthropomatik6 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Morphological Generation: Lexicons
Morphology analysis and generation
Operations:– Stemming– Inflection– Morphological analysis
Create manually
Create automatically from data
Here: assumed as given
Institut für Anthropomatik7 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Morphological Generation: Inflection Prediction
Maximum Entropy Markov model (2nd order)
Features:– Monolingual– Bilingual
– Lexical– Morphological– Syntactic
p ( y∣ x)=∏t=1
n
p ( y t∣ y t−1 , yt−2 , xt) , y t∈ I t
Institut für Anthropomatik8 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Morphological Generation: Evaluation
English-Russian and English-Arabic
Technical (software manual) domain
Input: Aligned sentence pairs of reference translations (no output of MT System) → reduce noise
Accuracy (%) results
Institut für Anthropomatik9 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Morphological Generation: Conclusion
Needed resources:– Large corpus of aligned sentence pairs– Lexicons (source and target) with the three operations
+ Better accuracy than simple LM (even with small training data)
+ Easy to add to existing MT system
- Expensive creation of lexicons
Institut für Anthropomatik10 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Constraints: Idea
“Agreement Constraints for Statistical Machine Translation into German” (Williams and Koehn, 2011)
String-to-tree model
Synchronous grammar for target language
Adding learned constraints and probabilities
Evaluation of constraints during decoding
Institut für Anthropomatik11 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Constraints: Feature Structure
Feature structure
Unification
Institut für Anthropomatik12 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Constraints: Grammar
Synchronous grammar learned from parallel corpus
Extended by constraints at target-side
Sample rule/constraint:
NP-SB → the X1 cat | die AP
1 Katze
Institut für Anthropomatik13 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Constraints: Training
Propagation rules to capture NP/PP agreements:
Applied bottom-up
Institut für Anthropomatik14 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Constraints: Decoding
Model:
Every element of rule/constraint has a feature structure
Constraint evaluation: Each hypothesis stores set of feature structures corresponding to its root rule element
Recombination of hypotheses is possible
t=arg maxtp(t∣s)
p (t∣ s)=1Z∑i=1
n
λ ihi(s ,t )
Institut für Anthropomatik15 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Constraints: Evaluation
English-German
Europarl and News Commentary
Parsing: BitPar; Alignment: GIZA++; SCFG rules: Moses toolkit
Treebank for target
Grammar: ~140 m rules
BLEU scores and p-values for three test sets
Institut für Anthropomatik16 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Constraints: Conclusion
Needed resources:– Parallel corpus– Heuristics for constraint extraction
+ Improvement in translation accuracy
- Improvement is quite small
Institut für Anthropomatik17 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Class-Based: Idea
1. Segmentation
2. Tagging
3. Scoring
“A Class-Based Agreement Model for Generating Accurately Inflected Translations” (Green and DeNero, 2012)
During Decoding
Target-Side
Three Steps:
Institut für Anthropomatik18 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Class-Based: Segmentation
Train conditional random field
Features:
Centered 5-character window
During decoding
Not as preprocessing step
Labels:
I: Continuation (Inside)
O: Outside (whitespace)
B: Beginning
F: Non-native chars
Institut für Anthropomatik19 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Class-Based: Tagging
Train CRF on full sentences with gold classes
Features:– Current and previous words, affixes, etc.
Labels:– Morphological classes
→ Gender, number, person, definiteness– e.g. 89 classes for Arabic
Example:
'the car'
Tagged: “Noun+Def+Sg+Fem”
Institut für Anthropomatik20 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Class-Based: Scoring
Scoring of word sequences not comparable across hypotheses
→ Scoring class sequences with generative model
Simple bigram LM over gold class sequences (add-1 smoothed)
τ '=arg maxτ
p( τ∣ s)
q(e)= p (τ ' )=∏i=1
I
p( τ ' i∣τ ' i−1)
Institut für Anthropomatik21 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Class-Based: Evaluation
English-Arabic
Training data: variety of sources (e.g. web)
Development and Test: NIST sets (Newswire and mixed genre [broadcast news, newsgroups, weblog])
Phrase-based decoder
BLEU score for newswire sets
BLEU score for mixed genre sets
Institut für Anthropomatik22 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Class-Based: Conclusion
Needed resources:– Treebank for target (existing for many languages)– Large target corpus
+ Improves translation quality
+ Easy to integrate in existing MT system
- Increases decoding time
- Not very good for mixed genres
Institut für Anthropomatik23 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Green, S. and DeNero, J. (2012). “A Class-Based Agreement Model for
Generating Accurately Inflected Translations”. In: ACL.
Williams, P. and Koehn, P. (2011). “Agreement Constraints for Statistical Machine Translation into German”. In: Sixth Workshop on Statistical Machine Translation
Minkov, E. and Toutanova, K. (2007) “Generating Complex Morphology for Machine Translation”. In: ACL.
References
top related