language divergences and solutions advanced machine translation seminar alison alvarez
TRANSCRIPT
Language Divergences and Solutions
Advanced Machine Translation Seminar
Alison Alvarez
Overview
Introduction Morphology Primer Translation Mismatches
Types Solutions
Translation Divergences Types Solutions
Different MT Systems Generation Heavy Machine Translation DUSTer
Source ≠ Target
Languages don’t encode the same information in the same wayMakes MT complicatedKeeps all of us employed
Morphology in a Nutshell
Morphemes are word partsWork +er Iki +ta +ku +na +ku +na +ri +ma +shi +ta
Types of MorphemesDerivational: makes new word Inflectional: adds information to an existing
word
Morphology in a Nutshell Analytic/Isolating
little or no inflectional morphology, separate words Vietnamese, Chinese I was made to go
Synthetic Lots of inflectional morphology Fusional vs. Agglutinating Romance Languages, Finnish, Japanese, Mapudungun Ika (to go) +se (to make/let) +rare (passive) +ta (past
tense) He need +s (3rd person singular) it.
Translation Differences
TypesTranslation Mismatches
Different information from source to target
Translation Divergences Same information from source to target, but the
meaning is distributed differently in each language
Translation Mismatches
“…the information that is conveyed is different in the source and target languages”
Types: Lexical levelTypological level
Lexical Mismatches
A lexical item in one language may have more distinctions than in another
Brother
弟
otouto
Younger Brother
兄さん
Ani-san
Older Brother
Typological Mismatches
Mismatch between languages with different levels of grammaticalization
One language may be more structurally complex
Source marking, Obligatory Subject
Typological Mismatches
Source: Quechua vs. English (they say) s/he was singing --> takisharansi taki (sing) +sha (progressive) +ra (past) + n (3rd sg)
+si (reportative)
Obligatory Arguments: English vs. Japanese Kusuri wo Nonda --> (I, you, etc.) took medicine. Makasemasu! -->(I’ll) leave (it) to (you)
Translation Mismatch Solutions
More information --> Less information (easy) Less information --> More information (hard)
Context clues Language Models Generalization Formal representations
Translation Divergences
“…the same information is conveyed in source and target texts”
Divergences are quite common Occurs in about 1 out of every three
sentences in the TREC El Norte Newspaper corpus (Spanish-English)
Sentences can have multiple kinds of divergences
Translation Divergence Types
Categorial Divergence Conflational Divergence Structural Divergence Head Swapping Divergence Thematic Divergence
Categorial Divergence
Translation that uses different parts of speech
Tener hambre (have hunger) --> be hungry
Noun --> adjective
Conflational Divergence
The translation of two words using a single word that combines their meaning
Can also be called a lexical gap X stab Z --> X dar puñaladas a Z (X give stabs
to Z) glastuinbouw --> cultivation under glass
Structural Divergence
A difference in the realization of incorporated arguments
PP to Object X entrar en Y (X enter in Y) --> X enter Y X ask for a referendum --> X pedir un
referendum (ask-for a referendum)
Head Swapping Divergence
Involves the demotion of a head verb and the promotion of a modifier verb to head position
S
NP VP
N V PP VP
Yo entro en el cuarto corriendo
S
NP VP
N V PP
I ran into the room.
Thematic Divergence
This divergence occurs when sentence arguments switch argument roles from one language to another
X gustar a Y (X please to Y) --> Y like X
Divergence Solutions and Statistical/EBMT Systems Not really addressed explicitly in SMT Covered in EBMT only if it is covered
extensively in the data
Divergence Solutions and Transfer Systems Hand-written transfer rules Automatic extraction of transfer rules from
bi-texts Problematic with multiple divergences
Divergence Solutions and Interlingua Systems Mel’čuk’s Deep Syntactic Structure Jackendoff’s Lexical Semantic Structure Both require “explicit symmetric knowledge” from
both source and target language Expensive
Divergence Solutions and Interlingua Systems
John swam across a river
Juan cruza el río nadando
[event CAUSE JOHN
[event GO JOHN [path ACROSS JOHN [position AT JOHN RIVER]]]
[manner SWIM+INGLY]]
Generation-Heavy MT
Built to address language divergences Designed for source-poor/target-rich
translation Non-Interlingual Non-Transfer Uses symbolic overgeneration to account
for different translation divergences
Generation-Heavy MT
Source languagesyntactic parser translation lexicon
Target language lexical semantics, categorial variations &
subcategorization frames for overgenerationStatistical language model
GHMT System
Analysis Stage
Independent of Target Language Creates a deep syntactic dependency Only argument structure, top-level
conceptual nodes & thematic-role information
Should normalize over syntactic & morphological phenomena
Translation Stage
Converts SL lexemes to TL lexemes Maintains dependency structure
Analysis/Translation Stage
GIVE (v)
[cause go]
I
agent
STAB (n)
theme
JOHN
goal
Generation Stage
Lexical & Structural Selection Conversion to a thematic dependency
Uses syntactic-thematic linking map “loose” linking
Structural expansion Addresses conflation & head-swapped divergences
Turn thematic dependency to TL syntactic dependency
Addresses categorial divergence
Generation Stage: Structural Expansion
Generation Stage
Linearization Step Creates a word lattice to encode different
possible realizations Implemented using oxyGen engine
Sentences ranked & extracted Nitrogen’s statistical extractor
Generation Stage
GHMT Results
4 of 5 Spanish-English divergences “can be generated using structural expansion & categorial variations”
The remaining 1 out of 5 needed more world knowledge or idiom handling
SL syntactic parser can still be hard to come by
Divergences and DUSTer
Helps to overcome divergences for word alignment & improve coder agreement
Changes an English sentence structure to resemble another language
More accurate alignment and projection of dependency trees without training on dependency tree data
DUSTer
Motivation for the development of automatic correction of divergences
1. “Every Language Pair has translation divergences that are easy to recognize”
2. “Knowing what they are and how to accommodate them provides the basis for refined word level alignment”
3. “Refined word-level” alignment results in improved projection of structural information from English to another language
DUSTer
DUSTer
Bi-text parsed on English side only “Linguistically Motivated” & common search
terms Conducted on Spanish & Arabic (and later
Chinese & Hindi) Uses all of the divergences mentioned before,
plus a “light verb” divergence Try put to trying poner a prueba
DUSTer Rule Development Methods Identify canonical transformations for each
divergence type Categorize English sentences into
divergence type or “none” Apply appropriate transformations Humans align E E’ foreign language
DUSTer Rules
# "kill" => "LightVB kill(N)" (LightVB = light verb)# Presumably, this will work for "kill" => "give death to”# "borrow" => "take lent (thing) to”# "hurt" => "make harm to”# "fear" => "have fear of”# "desire" => "have interest in”# "rest" => "have repose on”# "envy" => "have envy of”type1.B.X [English{2 1 3} Spanish{2 1 3 4 5} ][ Verb<1,i,CatVar:V_N> [ Noun<2,j,Subj> ] [ Noun<3,k,Obj> ] ] <--> [ LightVB<1,Verb>[ Noun<2,j,Subj> ] [ Noun<3,i,Obj> ]
[ Oblique<4,Pred,Prep> [ Noun<5,k,PObj> ] ] ]
DU
ST
er R
esul
ts
Conclusion
Divergences are common They are not handled well by most MT
systems GHMT can account for divergences, but
still needs development DUSTer can handle divergences through
structure transformations, but requires a great deal of linguistic knowledge
The End
Questions?
ReferencesDorr, Bonnie J., "Machine Translation Divergences: A Formal Description and Proposed Solution,"
Computational Linguistics, 20:4, pp. 597--633, 1994.Dorr, Bonnie J. and Nizar Habash, "Interlingua Approximation: A Generation-Heavy Approach", In
Proceedings of Workshop on Interlingua Reliability, Fifth Conference of the Association for Machine Translation in the Americas, AMTA-2002,Tiburon, CA, pp. 1--6, 2002
Dorr, Bonnie J., Clare R. Voss, Eric Peterson, and Michael Kiker, "Concept Based Lexical Selection," Proceedings of the AAAI-94 fall symposium on Knowledge Representation for Natural Language Processing in Implemented Systems, New Orleans, LA, pp. 21--30, 1994.
Dorr, Bonnie J., Lisa Pearl, Rebecca Hwa, and Nizar Habash, "DUSTer: A Method for Unraveling Cross-Language Divergences for Statistical Word-Level Alignment," Proceedings of the Fifth Conference of the Association for Machine Translation in the Americas, AMTA-2002,Tiburon, CA, pp. 31--43, 2002.
Habash, Nizar and Bonnie J. Dorr, "Handling Translation Divergences: Combining Statistical and Symbolic Techniques in Generation-Heavy Machine Translation", In Proceedings of the Fifth Conference of the Association for Machine Translation in the Americas, AMTA-2002,Tiburon, CA, pp. 84--93, 2002.
Haspelmath, Martin. Understanding Morphology. Oxford Univeristy Press, 2002. Kameyama, Megumi and Ryo Ochitani, Stanley Peters “Resolving Translation Mismatches With
Information Flow” Annual Meeting of the Assocation of Computational Linguistics, 1991