lemmatizer czechtoenglish ml

Post on 11-Jun-2015

743 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Effects of Lemmatization on Czech-English Statistical MT

Ashley GillUniversity of Washington

Seattle, WAgillak@u.washington.edu

ParinitaUniversity of Washington

Seattle, WAparinita@u.washington.edu

Motivation• Morphologically rich language -> English• Source (Czech)-functions expressed as endings (inflections) -fewer instances of the surface form of a word (prefix+stem+suffix) occur in the corpus, data

sparsity- Free word-order• Target (English)- word order - function words• Goal- to improve word-alignments• Approach- analyze surface word forms into lemma and morphology, e.g.: car +plural- translate lemma and morphology separately- generate target surface form- experiment with the different POS

Experiments

• Most problematic parts of speech in Czech-English translations are nouns and verbs (Bojar and Prokopov´a ,2006).

• The baseline - no changes. • ALemma - all words were lemmatized • NLemma - nouns were lemmatized only• Vlemma - verbs were lemmatized only

Source Corpus Lemmatizer

leleAlemma Nlemma Vlemma

Moses Toolkit

Baseline

Lemmatized Source Corpus

Target Translation

System Overview

Lemmatizer• ‘The Free Morphology (FM)’ tool (Hajic 2001). • universal (i.e., language-independent) morphology tool (FMAnalyze.pl)• analysis of word forms for inflective languages.• includes a frequency-based, high coverage Czech dictionary. • Czech positional morphology (Hajic, 2000) uses morphological tags

consisting of 12 actively used positions, each stating the value of one morphological category – we used tags for Nouns and Verbs

Examples: Input: Prezident rezignoval na svou funkci. Output: <csts><f cap>Prezident<MMl>prezident<MMt>NNMS1-----A----<f>rezignoval<MMl>rezignovat_:T<MMt>VpYS---XR-AA---<f>na<MMl>na<MMt>RR--4----------<MMt>RR--6----------<f>svou<MMl>svůj-1_^(přivlast.)<MMt>P8FS4---------1<MMt>P8FS7---------1<f>funkci<MMl>funkce<MMt>NNFS3-----A----<MMt>NNFS4-----A----<MMt>NNFS6-----A----<D><d>.<MMl>.<MMt>Z:-------------</csts>

Preprocessing

• The output from the FM – one token per line• No markup for sentence delimiter• Inserted a simple sentence delimiter, “*”, in the corpus ( it

does not occur naturally in the corpus)• For each word from the FM file: • Alemma experiment - use the lemma instead of the original

word• Nlemma experiment – use the lemma only if the first

position of the FM output markup is “N” (denoting a noun)• Vlemma experiment - use the lemma only if the first

position of the FM output markup is “V” (denoting a verb).

Corpus:

Input sentences Output sentences

baseline 35000 14453

ALemma 70048 13136

NLemma 70048 15737

VLemma 70048 20686

News Commentary corpus

Input sentences Output sentences

original baseline 70048 62610

We used a corpus of about half the size as the baseline to compare with .35,000 lines, which ends up using 14453 lines after removing sentences > 40 tokens.

Results:

BASELINE: BLEU = 4.24, 27.9/9.3/2.2/0.7 (BP=0.931, ratio=0.933, hyp_len=46470, ref_len=49805)

ALEMMA: BLEU = 8.60, 36.4/13.7/5.2/2.1 (BP=1.000, ratio=1.177, hyp_len=58645, ref_len=49805)

NLEMMA: BLEU = 10.09, 40.0/15.7/6.2/2.7 (BP=1.000, ratio=1.108, hyp_len=55174, ref_len=49805)

VLEMMA: BLEU = 13.06, 44.1/19.1/8.5/4.1 (BP=1.000, ratio=1.017, hyp_len=50652, ref_len=49805)

original baseline (full corpus)

BLEU = 18.89, 53.0/27.0/14.1/7.9 (BP=0.946, ratio=0.947, hyp_len=47182, ref_len=49805)

Improvement in BLEU scores , double for ALEMMA ,and triple for VERBS lemmatized only

BASELINE OUTPUT: rasov~[ rozd~[lená europetypickým evropské extrémní of the right , there is a sign of její racism , and that že využívá imigra~Mní otázku in svůj politický prosp~[ch .italská lega nord , nizozemský vlaams blocks , francouzská penova defensive on national , this vše are p~Yíklady parties ~Mi hnutí vzešlých from spole~Mné aaverze vů~Mi imigrantům and prosazujících zjednodušující to look at how ~Yešit otázku p~Yist~[hovalců .

ALEMMA OUTPUT: rasov~R , divided europein fact , european the extreme right is its racism and that using imigra~R is the question in their political of would .italy ' s nord lego , the dutch , vlaams blockade , the french has come . as to how souë jmen . it ' s rule of money ' s administration national fronts - all of this iis an example sides poorer or vze movement , the rise of the common averze against immigrants and pushing the ) , simplifies a view , how many out to question the immigrants .

NLEMMA OUTPUT: race-specific divided europein fact the extreme right is its racism and that applied to the immigration question in their political of europe .indeed , the lego , nord , the dutch vlaams bloc , the french still penova combatants national - all of this are examples parties themselves or movements be held and of from the common averze towards immigrants and pushing the the simplest a view , the solution is to question the immigrants .

VLEMMA OUTPUT: race-specific divided europein fact the extreme right is its rasismus and that use immigration is the question in their political favor .indeed , the lega nord nizozemsk vlaams blockade , the french still penova national fronts - all of this is happening parties or movement would be held and of from the common averze towards imigrant and pushing of makes it easier to this view , to question the immigrants .

Czech English POS

rasismusrasismus penova averzeaverzeaverze imigrant

 racialismracismfoam abhorrence dislikingloathing immigrant

Noun Noun Noun  Noun Noun  nounNoun

VLemma-Dict-Output: race-specific divided europein fact the extreme right is its rasismus [racism] and that use immigration is the question in their political favor .indeed , the lega nord nizozemsk vlaams blockade , the french still penova[foam] national fronts - all of this is happening parties or movement would be held and of from the common averse[loathing] towards immigrant[immigrants] and pushing of makes it easier to this view , to question the immigrants .

VLemma Output: race-specific divided europein fact the extreme right is its rasismus and that use immigration is the question in their political favor .indeed , the lega nord nizozemsk vlaams blockade , the french still penova national fronts - all of this is happening parties or movement would be held and of from the common averse towards immigrant and pushing of makes it easier to this view , to question the immigrants .

Dictionary

Limitations

• No use of syntax/POS in sentence reordering• Phenomenon like ‘pronoun dropping’ that

occurs in Czech is not tested for accuracy in translations

• No Human Cross evaluation for better understanding of the improvement in results

• Does not cover the effect of morphology of target language on translations. (Zhang et. al, 2007).

FUTURE DIRECTION•Add syntactic information to improve the word reordering and language modeling. •Carry experiments with other languages too.•Test Pipeline of lemmatization to improve word alignment ? But what about syntax?

VLEMMA: race-specific divided europein fact the extreme right is its rasismus and that use immigration is the question in their political favor .indeed , the lega nord nizozemsk vlaams blockade , the french still penova national fronts - all of this is happening parties or movement would be held and of from the common averze towards imigrant and pushing of makes it easier to this view , to question the immigrants .

Lemmatize nouns

Source Language

English

top related