anastasiou 1 idioms in ebmt idiom processing within the ebmt system metis-ii dimitra anastasiou...
TRANSCRIPT
Anastasiou1Idioms in EBMT
Idiom Processing within the EBMT System METIS-II
Dimitra Anastasiou
Institut für Angewandte Informationsforschung (IAI) Saarland University, Germany
School of Computing, Dublin City University, Dublin
15th October 2008
2 / 40 AnastasiouIdioms in EBMT
Aim-Methods
Aim• Enhancement of translation quality of idiomatic
expressions (idiomatic VPs in particular) within the German-to-English EBMT system METIS-II
Resources• Bilingual idiom dictionary• Monolingual corpus• Syntactic rules according to the German
topological field model
3 / 40 AnastasiouIdioms in EBMT
Outlook
EBMT: statistical or rule-based MT?• Interpretation of idioms• Topological field model• Treatment of idioms by MT• METIS-II idiom resources• Translation process of METIS-II• Evaluation of METIS-II
4 / 40 AnastasiouIdioms in EBMT
EBMT: Statistical or Rule-Based MT?
Two tendencies of EBMT:
1) Combinations of EBMT with rule-based MT (RBMT) as hybrid systems[Sumita et al., 1990];
2) Pure EBMT systems [Sato & Nagao, 1990].
EBMT lies between RBMT and statistical MT (SMT) [Carl & Way, 2005]Reason: The transfer between SL and TL is always guided by translation examples, even if the replacement and/or modification of the sub-sequences are completely rule- or data-based.
5 / 40 AnastasiouIdioms in EBMT
Outlook
• EBMT: statistical or rule-based MT?• Interpretation of idioms Identification by MTSemanticsSyntaxGrammatical and lexical variants• Topological field model• Treatment of idioms by MT• METIS-II idiom resources• Translation process of METIS-II• Evaluation of METIS-II
6 / 40 AnastasiouIdioms in EBMT
Interpretation of Idioms
• Diverse terms (and accordingly definitions):idiom, semi-idiom, (cranberry) collocation, idiomatic/figurative/fixed/periphrastic phrase/expression, phraseologism, (dead) metaphor, etc.
• Irregularity of idioms depends on: Fixedness of constituents
[Moon, 1998; Trawinski, 2008]; Degree of compositionality;• Syntactic opaqueness: kick the bucket – die
[Jackendoff, 1997; Gazdar et al., 1985]; Poetic marking of the form, e.g.
klipp und klar (clear as daylight) mit Rat und Tat (help and advice)
7 / 40 AnastasiouIdioms in EBMT
Idiom Identification by MT
jmdn. mit Argusaugen beobachtenso.-with-Argus eyes-observe
watch so. like a hawk
Er beobachtete den Mann, der die Bank betrat, mit Argusaugen. He was watching the man, who entered the bank, like a hawk.
• The contiguous parts of the idiom (mit Argusaugen);• The discontinuous parts of the idiom (beobachten) in any of its declination forms;• The syntactic requirements of the idiom; • The clause boundaries (usually in one clause). More information can be found in Volk (1998).
8 / 40 AnastasiouIdioms in EBMT
Semantics (Degree of Compositionality)
1) Non-compositional: cranberry/unical constituents, e.g.:
A recent study on cranberry expressions in English and German is that of Trawinski et al. (2008);
2) Partially compositional: light-verb constructions (SVCs)
A recent study on German PP-verb SVCs is that of Krenn (2008);
3) Strictly compositional: collocations, e.g.
as happy as a sandboyon tenterhooks
außer Betrieb gehen – go out of serviceaußer Betrieb sein – be out of order
Maßnahmen ergreifentake measures
9 / 40 AnastasiouIdioms in EBMT
Syntax
• Syntactic categories of idioms
• Realization of idiomsas for the syntactic gaps
Continuous (without gaps)
Discontinuous (with gaps)
10 / 40 AnastasiouIdioms in EBMT
Syntactic Categories
• Noun phrase (NP): pink slip
• Prepositional phrase (PP): by hook or crook
• Combination NP-PP: danger for life and health
• Adjective: prim and proper
• Verbal phrase (iVP) NP-Verb: kick the bucket PP-Verb: fall on deaf ears NP-PP-Verb throw out the baby with the
bath water
• Proverb less is sometimes more Saying gimme a break
11 / 40 AnastasiouIdioms in EBMT
Grammatical Variants (1)
• Number: pull up stakes *pull up stake
Exception! keep tabs on sb/sth keep a tab on sb/sth
• Case: auf die Strasse gehen *auf der Strasse gehen take to the streets
• Determ.: play a role *play the role
• Posses.: in Verbindung treten *in Pos.Pron. Verbindung treten
contact
Pos.Pron. Ohr leihen *das Ohr leihenlisten
12 / 40 AnastasiouIdioms in EBMT
Grammatical Variants (2)
• Negation: eine Rolle spielen (play a role)
keine Rolle spielen (any-role-play)
*nicht eine Rolle spielen (not-a-role-play)
auf keinen grünen Zweig kommen (never get anywhere)nicht/nie auf einen grünen Zweig kommen
• Passivization The more syntactically opaque an idiom has, the less possible it is to undergo passivization.
opaque: [kick] [the bucket] – die *The bucket was kicked by him (only literal
meaning)
transparent: [spill] [the beans] – [tell] [a secret] The beans were spilled by him
13 / 40 AnastasiouIdioms in EBMT
Lexical Variants
• Substitution: kick the bucket *kick the pail
hit the sack*hit the hay
• Modifiers
Adjective: keep tabs onkeep close tabs on
Adverb: noch grün hintern den Ohren sein noch absolut grün hintern
den Ohren seinbe half-baked
14 / 40 AnastasiouIdioms in EBMT
Outlook
• EBMT: statistical or rule-based MT? • Interpretation of idioms• Topological field modelRealization of idiomsDiscontinuous patterns• Treatment of idioms by MT• METIS-II idiom resources• Translation process of METIS-II• Evaluation of METIS-II
15 / 40 AnastasiouIdioms in EBMT
Topological Field Model for German
The German clauses are divided into five fields; each field can be occupied by a certain number and kind of constituents [Drach, 1963; DUDEN, 1998; Dürscheid, 2000]:
• pre-field (PF): only 1 constituent!;
• left bracket (LB): finite (modal/auxiliary verb);
• middle field (MF):many constituents and in free order;
• right bracket (RB): non-finite verb (infinitive/participle form);
• post-field (PF): subclause(s).
16 / 40 AnastasiouIdioms in EBMT
Realization of Idioms
• Continuous form: ( iNPMF | iPPMF | [iNPMF iPPMF] ) iVRB
Er will nicht bei den Argumenten ständig den Bock (iNPMF) zum Gärtner (iPPMF)
machen (iVRB)!
He-wants-not-during-the-arguments-always-the-bock-to-the-gardner-make!
He does not always want to set the fox to keep the geese during the argumentation!
• Discontinuous form: iVLB (Adverb)*MF ( iNPMF | iPPMF | [iNPMF iPPMF] )
Er macht (iVLB) oft (Adverb) den Bock (iNPMF) zum Gärtner (iPPMF).
He-makes-often-the-bock-to-the-gardner.
He often sets the fox to keep the geese.
17 / 40 AnastasiouIdioms in EBMT
Discontinuous patterns
Den Bock zum Gärtner machen (set the fox to keep the geese)
Er macht (iVLB) oft (Adverb) den Bock (iNPMF) zum Gärtner (iPPMF).
Er hat den Bock (iNPMF) zum Gärtner (iPPMF) oft gemacht (iVRB).
?Den Bock (iNPPF) zum Gärtner (iPPPF) hat er oft gemacht (iVRB).
?Den Bock (iNPPF) hat er oft zum Gärtner (iPPMF) gemacht (iVRB).
18 / 40 AnastasiouIdioms in EBMT
Outlook
• EBMT: statistical or rule-based MT? • Interpretation of idioms• Topological field model• Treatment of idioms by MT Idioms suitable for EBMT• METIS-II idiom resources• Translation process of METIS-II• Evaluation of METIS-II
19 / 40 AnastasiouIdioms in EBMT
Treatment of Idioms by MT
• Bar-Hillel (1952): “The only way for a machine to treat idioms is - not to have idioms!”
• Power Translator Pro user manual (2000) warns the user to avoid inputting sentences containing idioms!
• Power Translator Pro, SYSTRAN, T1 Langenscheidt cannot identify discontinuous idioms.
20 / 40 AnastasiouIdioms in EBMT
Idioms suitable for EBMT
• Idiomatic expressions are are not suitable for rule-based MT (RBMT), but are suitable for EBMT.
“Translation of an idiomatic expression can only be used to translate the same idiomatic expression; it cannot be used to translate a similar expression.”
(Sumita et al., 1990: 210).
• By contrast, Nomiyama (1992) emphasizes the disadvantage of EBMT’s using only thesauri to define a general semantic distance, resulting in over-generalization, which is a major problem in translating idiomatic expressions.
• Related work: Santos (1990), Wehrli (1998), Ryu et al. (1999), and Gangadharaiah; Balakrishnan (2006):
21 / 40 AnastasiouIdioms in EBMT
Outlook
• EBMT: statistical or rule-based MT? • Interpretation of idioms• Topological field model• Treatment of idioms by MT• METIS-II idiom resources Idiom lexiconGerman corpus (annotation), (statistical
analysis)Syntactic rules• Translation process of METIS-II• Evaluation of METIS-II
22 / 40 AnastasiouIdioms in EBMT
Idiom Resources
• Bilingual idiom dictionary of 871 entries
• Monolingual German corpus of 486 sentences
• Syntactic rules according to the German topological field model
23 / 40 AnastasiouIdioms in EBMT
METIS-II Project
• Hybrid MT system (EBMT, RBMT, SMT);• Time span: 2004-2007;• SLs: Dutch, German, Greek, Spanish;• TL: Bristish English;• Based on pattern matching;• Sources: Huge monolingual TL corpus (BNC); Bilingual dictionaries; Tokenizer; PoS tagger, chunker, lemmatizer; Manually constructed matching rules.
24 / 40 AnastasiouIdioms in EBMT
Idiom Dictionary
871 entries
Entry example{de=den_Bock_zum_Gärtner_machen, mde={c=verb}, en=set_the_fox_to_keep_the_geese, men={c=verb}}.
826 equal PoS 45 different PoS(verb/VP-interjection)
598 verbs/VPs
163 interject-ions
37 NPs
28 PPs
25 / 40 AnastasiouIdioms in EBMT
Manually constructed(IAI)
Idiom Corpus
three corpus resources
Europarl (EP) Mixture of data sets
(MDS)
DWDS (Digital lexicon of theGerman language in the20th century)
Real examples (Internet)
80 MWEs63 cont. (79%)17 disc. (21%)
275 MWEs205 cont. (75%)70 disc. (25%)
131 MWEs91 cont. (69%)40 disc. (31%)
26 / 40 AnastasiouIdioms in EBMT
Annotation of Idioms in the German Corpus
Continuous form:Er will nicht bei den Argumenten ständig <MWE id=1> den Bock zum Gärtner machen </MWE id=1>.
He does not always want to set the fox to keep the geese during the argumentation.
Discontinuous form:Er <MWE id=1> macht </MWE id=1> oft <MWE id=1> den Bock zum Gärtner </MWE id=1>.
He often sets the fox to keep the geese .
27 / 40 AnastasiouIdioms in EBMT
Statistical Analysis of iVPs’ Syntactic Patterns
Continuous form patterns EP corpus MDS corpus DWDS corpus
NP-V 8 65 15
PP-V 29 106 60
NP-PP-V 4 21 6
Discontinuous form patterns
EP corpus MDS corpus DWDS corpus
V-NP 1 8 13
V-PP 16 25 18
V-NP-PP - 22 9
28 / 40 AnastasiouIdioms in EBMT
Syntactic Rule for Continuous Idioms
Er will nicht bei den Argumenten ständig den Bock zum Gärtner machen!
En Bloc Pattern = A:match=yes, last idiom’s word=no, [den Bock,zum
Gärtner]B: match=yes, last idiom’s word=yes [machen]
C: mark_as_continuous_iVP.
where A: first idiom constituent - before last B: last idiom constituent C: command to identify/match as
continuous No alien element between A and B!
29 / 40 AnastasiouIdioms in EBMT
Syntactic Rule for Discontinuous Idioms
Er macht (iVLB) oft (Adverb) den Bock (iNPMF) zum Gärtner (iPPMF).
Discontinuous Pattern_LBMF = A: match=yes, field=LB, c=verb, [macht] B: [match=no, field=MF]*, [oft] C: match=yes, field=MF, [den Bock, zum
Gärtner] D: mark_as_discontinuous_iVP.
where A: idiom’s verb in the left bracket B: arbitrarily many elements
C: matched idiom’s constituents D: command to identify/match as discontinuous
Alien element(s) between A and C!
30 / 40 AnastasiouIdioms in EBMT
Outlook
• History of EBMT• Interpretation of idioms• Topological field model• Treatment of idioms by MT• METIS-II idiom resources• Translation process of METIS-IIMETIS-II Idiom Matching Process• Evaluation of METIS-II
31 / 40 AnastasiouIdioms in EBMT
METIS-II Translation Process
1) SL analysis (tokenization, PoS-tagging, lemmatization, and chunking or shallow parsing);
2) SL-to-TL matching
i) The bilingual idiom dictionary;ii) The syntactic matching rules.
3) TL generation (the main TL resource, BNC, is used as a data-set of examples). The token generator is described in Carl & Schütz (2005).
32 / 40 AnastasiouIdioms in EBMT
METIS-II Idiom Matching Process
Users• Store an idiom in the bilingual dictionary;• Load the syntactic matching rules;• Enter an input sentence/corpus.
System• The system reads the sentence word by word;• If the idiom is continuous and in the same form as
stored in the dictionary, it is directly correctly translated;
• If the idiom is discontinuous, the system reads the syntactic matching rules (rule by rule), until it finds the appropriate one which is then applied.
33 / 40 AnastasiouIdioms in EBMT
Outlook
• History of EBMT• Interpretation of idioms• Topological field model• Treatment of idioms by MT• METIS-II idiom resources• Translation process of METIS-II• Evaluation of METIS-IIFor continuous idiomsFor discontinuous idioms
34 / 40 AnastasiouIdioms in EBMT
Evaluation of METIS-II
Hit: correct matching/correct translationMiss: no matching/reuse of German inputNoise: false matching/literal translation
• Presicion:
• Recall:
• fscore:
noisehits
hits
Pr
misseshits
hits
Re
recallprecision
recallprecisionfscore
2
35 / 40 AnastasiouIdioms in EBMT
Evaluation Results for Continuous iVPs
Recall Precision f-score
Europarl Corpus 98,3% 96,8% 96,8%
Manually constructed examples and examples from the Web
99% 96,2% 97,4%
DWDS 98,9% 96,7% 97,4%
36 / 40 AnastasiouIdioms in EBMT
Evaluation Results for Discontinuous iVPs
Recall Precision f-score
Europarl Corpus 88,2% 78,9% 83,2%
Manually constructed examples and examples from the Web
95,7% 84,8% 88,8%
DWDS 92,5% 90,2% 90,6%
37 / 40 AnastasiouIdioms in EBMT
Conclusion
• Continuous idioms: more than 95% recall and precision
• Discontinuous idioms: Almost more than 90% recall and more than 80% precision.
• The evaluation figures for continuous idioms of all techniques are higher than these for the discontinuous idioms.
This is attributed to the fact that discontinuous idioms are more difficult to identify due to their spread constituents through the sentence.
38 / 40 AnastasiouIdioms in EBMT
Thank you for your attention!
Dimitra Anastasiou
www.d-anastasiou.com
39 / 40 AnastasiouIdioms in EBMT
References (1)
• Bar-Hillel, Y., (1952), “The Treatment of ‘idioms’ by a Translating Machine”, presented at the Conference on Mechanical Translation at Massachusetts Institute of Technology, June 1952.
• Brown, R. D., (1999), “Adding Linguistic Knowledge to a Lexical Example-based Translation System”, in: 8th TMI 1999, Chester, England 22-32.
• Carl, M.; Schütz, J., (2005), “A Reversible Lemmatizer/Token-generator for English”, in: EBMT Workshop 2005, MT Summit X, Phuket, Thailand.
• Drach, Erich, (1963), Grundgedanken der deutschen Satzlehre, Darmstadt: Wissenschaftliche Buchgesellschaft.
• DUDEN Redaktion, (1998), Grammatik der deutschen Gegenwartssprache, Mannheim.
• Dürscheid, C., (2000), Syntax: Grundlagen und Theorien, Wiesbaden. • Gangadharaiah, R.; Balakrishnan, N., (2006), “Application of
Linguistic Rules to Generalized Example Based Machine Translation for Indian Languages“, in: Proceedings of the First National Symposium on Modeling and Shallow Parsing of Indian Languages (MSPIL), Mumbai, India.
• Gazdar, G.; Klein, E.; Pullum, G.; Sag, I., (1985), Generalized Phrase Structure Grammar, Basil Blackwell, Oxford
• Jackendoff, Ray. 1997. The Architecture of the Language Faculty. Cambridge, Mass.: MIT Press.
• Krenn, B., (2008), “Description of evaluation resource – German PP-verb data, in: MWE Workshop 2009, at LREC Conference, 7-11.
40 / 40 AnastasiouIdioms in EBMT
References (2)
• Moon, R., (1998), Fixed Expressions and Idioms in English: A Corpus-based Approach, Oxford, England: Clarendon Press.
• Ryu, B. R.; Kim Y. K.; Yuh, S. H.; Park S. K., (1999), “FromTo K/E: A Korean English Machine Translation system based on idiom recognition and fail softening”, in: MT Summit VII, Singapore, 469-475.
• Santos, D., (1990), “Lexical gaps and idioms in Machine Translation”, in: Karlgren, H. (Ed.), 13th COLING 1990, Helsinki, Finland, 330-335.
• Sumita, E.; Iida, H.; Kohyama, H., (1990), “Translating with Examples: A New Approach to Machine Translation”, in: 3rd TMI 1990, Texas, USA, 203-212.
• Trawinski, B., Sailer, M., Soehn, J.P., Lemnitzer, L., Richter, F., (2008),“Cranberry Expressions in English and German”, in: MWE Workshop 2009, at LREC Conference, 35-39.
• Volk, M., (1998), “The Automatic Translation of Idioms. Machine Translation vs. Translation Memory Systems”, in: Nico Weber (Ed.): Machine Translation: Theory, Applications, and Evaluation. An assessment of the state of the art. St. Augustin: Gardez-Verlag.
• Wehrli, E. (1998), “Translating Idioms”, in: 17th COLING 1998, Vol. 2, 1388-1392.