lemmatizer czechtoenglish ml
TRANSCRIPT
Effects of Lemmatization on Czech-English Statistical MT
Ashley GillUniversity of Washington
Seattle, [email protected]
ParinitaUniversity of Washington
Seattle, [email protected]
Motivation• Morphologically rich language -> English• Source (Czech)-functions expressed as endings (inflections) -fewer instances of the surface form of a word (prefix+stem+suffix) occur in the corpus, data
sparsity- Free word-order• Target (English)- word order - function words• Goal- to improve word-alignments• Approach- analyze surface word forms into lemma and morphology, e.g.: car +plural- translate lemma and morphology separately- generate target surface form- experiment with the different POS
Experiments
• Most problematic parts of speech in Czech-English translations are nouns and verbs (Bojar and Prokopov´a ,2006).
• The baseline - no changes. • ALemma - all words were lemmatized • NLemma - nouns were lemmatized only• Vlemma - verbs were lemmatized only
Source Corpus Lemmatizer
leleAlemma Nlemma Vlemma
Moses Toolkit
Baseline
Lemmatized Source Corpus
Target Translation
System Overview
Lemmatizer• ‘The Free Morphology (FM)’ tool (Hajic 2001). • universal (i.e., language-independent) morphology tool (FMAnalyze.pl)• analysis of word forms for inflective languages.• includes a frequency-based, high coverage Czech dictionary. • Czech positional morphology (Hajic, 2000) uses morphological tags
consisting of 12 actively used positions, each stating the value of one morphological category – we used tags for Nouns and Verbs
Examples: Input: Prezident rezignoval na svou funkci. Output: <csts><f cap>Prezident<MMl>prezident<MMt>NNMS1-----A----<f>rezignoval<MMl>rezignovat_:T<MMt>VpYS---XR-AA---<f>na<MMl>na<MMt>RR--4----------<MMt>RR--6----------<f>svou<MMl>svůj-1_^(přivlast.)<MMt>P8FS4---------1<MMt>P8FS7---------1<f>funkci<MMl>funkce<MMt>NNFS3-----A----<MMt>NNFS4-----A----<MMt>NNFS6-----A----<D><d>.<MMl>.<MMt>Z:-------------</csts>
Preprocessing
• The output from the FM – one token per line• No markup for sentence delimiter• Inserted a simple sentence delimiter, “*”, in the corpus ( it
does not occur naturally in the corpus)• For each word from the FM file: • Alemma experiment - use the lemma instead of the original
word• Nlemma experiment – use the lemma only if the first
position of the FM output markup is “N” (denoting a noun)• Vlemma experiment - use the lemma only if the first
position of the FM output markup is “V” (denoting a verb).
Corpus:
Input sentences Output sentences
baseline 35000 14453
ALemma 70048 13136
NLemma 70048 15737
VLemma 70048 20686
News Commentary corpus
Input sentences Output sentences
original baseline 70048 62610
We used a corpus of about half the size as the baseline to compare with .35,000 lines, which ends up using 14453 lines after removing sentences > 40 tokens.
Results:
BASELINE: BLEU = 4.24, 27.9/9.3/2.2/0.7 (BP=0.931, ratio=0.933, hyp_len=46470, ref_len=49805)
ALEMMA: BLEU = 8.60, 36.4/13.7/5.2/2.1 (BP=1.000, ratio=1.177, hyp_len=58645, ref_len=49805)
NLEMMA: BLEU = 10.09, 40.0/15.7/6.2/2.7 (BP=1.000, ratio=1.108, hyp_len=55174, ref_len=49805)
VLEMMA: BLEU = 13.06, 44.1/19.1/8.5/4.1 (BP=1.000, ratio=1.017, hyp_len=50652, ref_len=49805)
original baseline (full corpus)
BLEU = 18.89, 53.0/27.0/14.1/7.9 (BP=0.946, ratio=0.947, hyp_len=47182, ref_len=49805)
Improvement in BLEU scores , double for ALEMMA ,and triple for VERBS lemmatized only
BASELINE OUTPUT: rasov~[ rozd~[lená europetypickým evropské extrémní of the right , there is a sign of její racism , and that že využívá imigra~Mní otázku in svůj politický prosp~[ch .italská lega nord , nizozemský vlaams blocks , francouzská penova defensive on national , this vše are p~Yíklady parties ~Mi hnutí vzešlých from spole~Mné aaverze vů~Mi imigrantům and prosazujících zjednodušující to look at how ~Yešit otázku p~Yist~[hovalců .
ALEMMA OUTPUT: rasov~R , divided europein fact , european the extreme right is its racism and that using imigra~R is the question in their political of would .italy ' s nord lego , the dutch , vlaams blockade , the french has come . as to how souë jmen . it ' s rule of money ' s administration national fronts - all of this iis an example sides poorer or vze movement , the rise of the common averze against immigrants and pushing the ) , simplifies a view , how many out to question the immigrants .
NLEMMA OUTPUT: race-specific divided europein fact the extreme right is its racism and that applied to the immigration question in their political of europe .indeed , the lego , nord , the dutch vlaams bloc , the french still penova combatants national - all of this are examples parties themselves or movements be held and of from the common averze towards immigrants and pushing the the simplest a view , the solution is to question the immigrants .
VLEMMA OUTPUT: race-specific divided europein fact the extreme right is its rasismus and that use immigration is the question in their political favor .indeed , the lega nord nizozemsk vlaams blockade , the french still penova national fronts - all of this is happening parties or movement would be held and of from the common averze towards imigrant and pushing of makes it easier to this view , to question the immigrants .
Czech English POS
rasismusrasismus penova averzeaverzeaverze imigrant
racialismracismfoam abhorrence dislikingloathing immigrant
Noun Noun Noun Noun Noun nounNoun
VLemma-Dict-Output: race-specific divided europein fact the extreme right is its rasismus [racism] and that use immigration is the question in their political favor .indeed , the lega nord nizozemsk vlaams blockade , the french still penova[foam] national fronts - all of this is happening parties or movement would be held and of from the common averse[loathing] towards immigrant[immigrants] and pushing of makes it easier to this view , to question the immigrants .
VLemma Output: race-specific divided europein fact the extreme right is its rasismus and that use immigration is the question in their political favor .indeed , the lega nord nizozemsk vlaams blockade , the french still penova national fronts - all of this is happening parties or movement would be held and of from the common averse towards immigrant and pushing of makes it easier to this view , to question the immigrants .
Dictionary
Limitations
• No use of syntax/POS in sentence reordering• Phenomenon like ‘pronoun dropping’ that
occurs in Czech is not tested for accuracy in translations
• No Human Cross evaluation for better understanding of the improvement in results
• Does not cover the effect of morphology of target language on translations. (Zhang et. al, 2007).
FUTURE DIRECTION•Add syntactic information to improve the word reordering and language modeling. •Carry experiments with other languages too.•Test Pipeline of lemmatization to improve word alignment ? But what about syntax?
VLEMMA: race-specific divided europein fact the extreme right is its rasismus and that use immigration is the question in their political favor .indeed , the lega nord nizozemsk vlaams blockade , the french still penova national fronts - all of this is happening parties or movement would be held and of from the common averze towards imigrant and pushing of makes it easier to this view , to question the immigrants .
Lemmatize nouns
Source Language
English