final paper ashley_parinita

Effects of Lemmatization on Czech-English Statistical MT

Ashley Gill

University of Washington

Seattle, WA

[email protected]

Parinita

University of Washington

Seattle, WA

[email protected]

Abstract

The focus of this paper is to see if lemmatiza-

tion affects experiments with Czech-to-

English phrase-based machine translation. We

vary the translation scenario by trying lemma-

tization of different parts of speech in Czech.

Experimental results demonstrate significant

improvement of translation quality in terms of

BLEU. We then propose a simple step to im-

prove upon the translation output. This ap-

proach is applicable to language pairs which

vary in their morphological richness.

1 Introduction

Czech, as a Slavic language is highly inflection-

al. It is an almost free word-order language. Most

of the functions expressed in Czech as endings

(inflections) are rendered by English word order

and function words. This causes fewer instances

of the surface form of a word (pre-

fix+stem+suffix) to occur in the corpus. This

phenomenon, called data sparsity, is one of the

factors that degrade statistical machine transla-

tion (SMT). Research in SMT increasingly

makes use of linguistic analysis in order to im-

prove performance. By including abstract catego-

ries, such as lemmas and parts-of-speech (POS)

in the models, it is argued that systems can be-

come better at handling sentences for which

training data at the word level is sparse. Existing

work has shown that using morpho-syntactic in-

formation is an effective solution to data sparse-

ness. We present a phrase-based statistical ma-

chine translation approach which uses linguistic

analysis in the preprocessing phase. The linguis-

tic analysis includes morphological transforma-

tion by applying lemmatization and trying vari-

ous combinations to see improvement in word

alignment and vocabulary reduction.

2 System Overview

Previous works have shown that the most prob-

lematic parts of speech in Czech-English transla-

tions are nouns and verbs (Bojar and Prokopov´a

,2006). In our experiments we aim to improve

word alignments between Czech and English by

splitting the inflection of verbs and nouns to im-

prove frequency of the stem words and hence

improve alignments. Czech is a pro-drop lan-

guage, pronouns representing the subject are

usually left out but the morphology of the verb

indicates explicitly which pronoun was meant.

By lemmatizing verbs only we hope to improve

this misalignment.

For this task we ran four different experiments.

The baseline experiment was carried out with no

changes. We ran three different experiments with

lemmatization to see their effect. ALemma -

where all words were lemmatized in the corpus,

NLemma - where words that were tagged as

nouns were lemmatized only , and Vlemma -

where words tagged as verbs was lemmatized

3 Components

3.1 Corpus

Training data is taken from the new News Com-

mentary corpus. The released data is not toke-

nized and includes sentences of any length (in-

cluding empty sentences). Also, all data is in Un-

icode (UTF-8) format. To tune our system during

development, we used development sets of 1057

sentences (News Commentary, nc-dev2007). The

data is provided in raw text format. To test our

system during development, we used 2007 sen-

tences (News Commentary, nc-test2007).

We tokenized the corpus and lowercased it. We

trimmed sentences longer than 40 words, as

needed by GIZA++.

Input

sentences

Output

sentences

original baseline 70048 62610 Table 1: Corpus Size before and after

cleaning

The Moses toolkit [Koehn et al., 2007]

represents the phrase-based machine translation

and operates only on the word level. This toolkit

was used for preparing the data, building lan-

guage model, training the model, Tuning, run-

ning system on development test set, and for

evaluation.

3.2 Lemmatizer

For the lemmatization we have used „The Free

Morphology (FM) ‟ tool ( Hajic 2001) . FM is a

pair of universal (i.e., language-independent)

morphology tools (FMAnalyze.pl, FMGene-

rate.pl) for analysis and generation of word

forms for inflective languages. It comes with a

frequency-based, high coverage Czech dictio-

nary. It takes as its input a text file and returns

the output in csts-like SGML markup, with one

token per line.

Examples:

Input: Prezident rezignoval na svou funkci.

Output: <csts>

<f

cap>Prezident<MMl>prezident<MMt>NNMS1--

---A----

<f>rezignoval<MMl>rezignovat_:T<MMt>VpY

S---XR-AA---

<f>na<MMl>na<MMt>RR--4----------

<MMt>RR--6----------

<f>svou<MMl>svůj-

1_^(přivlast.)<MMt>P8FS4---------

1<MMt>P8FS7---------1

<f>funkci<MMl>funkce<MMt>NNFS3-----A---

-<MMt>NNFS4-----A----<MMt>NNFS6-----A--

--

<D>

<d>.<MMl>.<MMt>Z:-------------

</csts>

Table 2: Sample FM analyzer output

The FM can work for other morphologically rich

inflective languages which can be described us-

ing segmentation of a word form into two parts:

a root and an ending. Even if linguistically not

quite justified, many phenomena which would

normally break this simple rule can be made to

work in this framework.

Special provision is made in the code for up to

two "inflectional" prefixes which might both be

present in some word forms. Such prefixes are

found in many Slavic languages, such as Czech,

Slovak, Polish, etc.

Czech positional morphology (Hajic, 2000) uses

morphological tags consisting of 12 actively used

positions, each stating the value of one morpho-

logical category. Categories which are not rele-

vant for a given lemma (e.g. tense for nouns) are

assigned a special value. We made use of this

positional information from the lemmatizer out-

put to re-create our lemmatized corpus for the

different experiments.

3.3 PreProcessor

For each experiment, the training and testing da-

tasets need to be run through the FM lemmatizer

tool. Because of the style of the output from the

FM, we decided to first implement a simple sen-

tence delimiter, “*”, since it does not occur natu-

rally in the corpus. We are then able to determine

where to place line breaks between sentences,

without disrupting the naturally-occuring sen-

tence punctuation. Next, for each word from the

FM file, depending on the particular experiment,

we decide whether to use the original word or the

lemma. For the Alemma experiment, we used the

lemma instead of the original word for every

word in the FM output file. For the Nlemma ex-

periment, we used the lemma only if the first

position of the FM output markup is “N” (denot-

ing a noun). And for the Vlemma experiment, we

used the lemma only if the first position of the

FM output markup is “V” (denoting a verb). Af-

ter the preprocessing step was complete, we

saved the outputs in the UTF-8 format, to ensure

compatibility with the GIZA++ and Moses Sys-

tems.

3.4 Post Pre-Processing corpus

Lemmatization increased the number of words

per sentence, after the pre-processing was com-

plete. We removed sentences which were

greater than 40 words. This reduced the corpus

size substantially. The comparison with the

original baseline was not a fair one. To get a

better idea, we used a smaller subset of the un-

lemmatized corpus and treated it as the baseline

system. This does not compare the same sen-

tences for word alignment but gives a fair com-

parison of the alignment given corpus of simi-

lar size with different settings.

Input

sentences

Output

sentences

baseline 35000 14453

ALEMMA 70048 13136

NLEMMA 70048 15737

VLEMMA 70048 20686

Table 3: Experiment corpus before and af-

ter cleaning

The output sentences were then used for build-

ing the language models and training.

We saw improvements in data sparsity by lem-

matization as has been proved in various papers

before. The improvements we saw for our data

set is listed in Table4

Total

number

of words

Vocabulary

(Number of

unique

words)

Singletons

(Words occurring

only once)

Baseline 364762 23368 12258

ALEMMA 336507 13502 6560

NLEMMA 398979 17333 8772

VLEMMA 537715 21475 10136

Table 4: Reduction in data spareseness

4 Results

The BLEU scores show remarkable improve-

ment in the lemmatized corpus. It almost doubles

the score for the baseline. Lemmatizing only

nouns increases the scores even further, but the

best BLEU scores are seen when we lemmatize

only the verbs. The scores obtained are thrice the

baseline score. Lemmatizing verbs are useful not

only in improving the BLEU scores, but also can

be used to improve the translation‟s readability.

Once the verbs are aligned correctly, mostly the

nouns are the only words that remain to be trans-

lated. Thus after the VLemma translation is com-

plete, a simple post-processing step that replaces

the nouns using a dictionary can improve the

translations further. Table 5 shows the im-

provement in BLEU scores for the various expe-

riments we carried out.

Experiment BLEU

BASELINE: 4.24, 27.9/9.3/2.2/0.7

(BP=0.931, ratio=0.933,

hyp_len=46470, ref_len=49805)

ALEMMA: 8.60, 36.4/13.7/5.2/2.1

(BP=1.000, ratio=1.177,


NLEMMA: 10.09, 40.0/15.7/6.2/2.7

(BP=1.000, ratio=1.108,


VLEMMA: 13.06, 44.1/19.1/8.5/4.1

(BP=1.000, ratio=1.017,


original

baseline (full

corpus)

18.89, 53.0/27.0/14.1/7.9

(BP=0.946, ratio=0.947,


Table 5: BLEU scores

The BLEU scores look in agreement with the

decoded translations, we can see a remarkable

improvements in the translated text as can be

seen in Table 6

BASELINE: rasov~[ rozd~[lená europe

typickým evropské extrémní of the right , there is a sign of

její racism , and that že využívá imigra~Mní otázku in svůj

politický prosp~[ch .

italská lega nord , nizozemský vlaams blocks , francouzská

penova defensive on national , this vše are p~Yíklady par-

ties ~Mi hnutí vzešlých from spole~Mné aa

verze vů~Mi imigrantům and prosazujících zjednodušující

to look at how ~Yešit otázku p~Yist~[hovalců .

ALemma: rasov~R , divided europe

in fact , european the extreme right is its racism and that

using imigra~R is the question in their political of would .

italy ' s nord lego , the dutch , vlaams blockade , the french

has come . as to how souë jmen . it ' s rule of money ' s ad-

ministration national fronts - all of this ii

s an example sides poorer or vze movement , the rise of the

common averze against immigrants and pushing the ) , sim-

plifies a view , how many out to question the immigrants .

NLemma: race-specific divided europe

in fact the extreme right is its racism and that applied to the

immigration question in their political of europe .

indeed , the lego , nord , the dutch vlaams bloc , the french

still penova combatants national - all of this are examples

parties themselves or movements be held and of from the

common averze towards immigrants and pushing the the

simplest a view , the solution is to question the immigrants .

VLemma: race-specific divided europe

in fact the extreme right is its rasismus and that use immi-

gration is the question in their political favor .

indeed , the lega nord nizozemsk vlaams blockade , the

french still penova national fronts - all of this is happening

parties or movement would be held and of from the com-

mon averze towards imigrant and pushing of makes it easi-

er to this view , to question the immigrants .

TABLE 6: Sample Outputs

We propose that lemmatizing a specific part-

of-speech improves the word alignments. And

that there are two possible approaches that can be

derived from this fact for a complete translation.

One is to create a pipeline of translations. For

e.g. the output of NLemma becomes the source

language of VLemma, and translation is done

from a „verb lemmatized NLemma output‟ to

English. The second approach is to apply a dic-

tionary or lexicon on the translation output of the

lemmatized corpus. Looking at the output in Ta-

ble 6 of the VLemma, we can apply a simple dic-

tionary as in Table 7

Czech English POS

rasismus

rasismus

penova

averze

averze

averze

imigrant

racialism

racism

foam

abhorrence

disliking

loathing

immigrant

noun

noun

noun

noun

noun

noun

noun

Table 7: Sample Dictionary

The output we get will have the non-resolved

nouns translated correctly.

VLEMMA-DICT: race-specific divided europe

in fact the extreme right is its rasismus [racism] and that

use immigration is the question in their political favor .

indeed , the lega nord nizozemsk vlaams blockade , the

french still penova[foam] national fronts - all of this is hap-

pening parties or movement would be held and of from the

common averse[loathing] towards immi-

grant[immigrants] and pushing of makes it easier to this

view , to question the immigrants .

TABLE 8: Post-Processed Output

5 Limitations

The results are compared only using BLEU

scores. Specific phenomenon like „pronoun

dropping‟ that occurs in Czech is not tested for

accuracy in translations. The pre-processing

steps implemented were focused more towards

BLEU score improvement vs. improvement in

actual translations. While that is a standard me-

tric for evaluation, other phenomenon could have

undergone Human Cross evaluation for better

understanding of the improvement in results.

The experiments does not cover the effect of

morphology of target language on translations.

Experiments like effect of lemmatization of both

the source and the target language on the word

alignments can be carried out. (Zhang et. al,

2007).

6 Future Paths

Due to time constraint we could not carry out

experiments to measure if a pipeline of lemmati-

zation improves translation quality or not. In fu-

ture we would like to compare both the dictio-

nary method and the pipeline method using

BLEU scores and human evaluation metric.

We would also like to add more syntactic in-

formation to improve the word reordering and

language modelling. We would like to carry ex-

periments on other languages too.

7 Conclusion

We have studied the effect of lemmatization of

different parts of speech in improving the word

alignment for Czech-to-English SMT. We found

that lemmatization of verbs yields the maximum

improvement in BLEU scores. We have con-

cluded with an approach to improve the transla-

tions by lemmatizing verbs to improve align-

ments and then replacing the unresolved nouns,

adjectives by using a Czech-English dictionary.

This approach is applicable to translations from

any morphologically rich language to a simpler

one.

References

Adria de Gispert, Deepa Gupta, Maja Popovic, Patrik

lambert, Jose B. Marino, Marcello Federico, Her-

mann Ney and Rafael Banchs. 2006. Improving

Statistical Word Alignments with Morpho-syntactic

Transformations. In Advances in natural Language

Processing, Vol. 4139:368-379.

Bettina Schrader. 2004. Improving word alignment

quality using linguistic knowledge. In Workshop

proceedings of the Fourth International Conference

on Language Resources and Evaluation (LREC),

46-49.

Maria Holmqvist, Sara Stymne and Lars Ahrenberg.

2007. Getting to know Moses: Initial experiments

on German-English factored translation. In Pro-

ceedings of ACL Second Workshop in Statistical

Machine Translation, 181-184.

Ondr̆ej Bojar . and Magdalena Prokopov´a., Czech-

English Word Alignment.In Proceedings of

LREC'06, pp. 1236-1239, ELRA, 2006.

Ondr̆ej Bojar , Evgeny Matusov, and Hermann Ney.

2006. Czech-English Phrase-Based Machine

Translation. Institute of Formal and Applied Lin-

guistics, Czech Republic.

Ondrej Bojar, David Marecek, Vaclav Novak, Martin

Popel, Jan Ptacek, Jan Rous and Zdenek Za-

bokrtsky. 2008. English-Czech MT in 2009. In

Proceedings of the Fourth Workshop on Statistical

Machine Translation, 125-129.

Ondrej Bojar and Jan Hajic. 2008. Phrase-Based and

Deep Syntactic English-to-Czech Statistical Ma-

chine Translation. In Proceedings of the Third

Workshop on Statistical Machine Translation, 143-

146.

Sharon Goldwater and David McClosky. 2005. Im-

proving Statistical MT through Morphological

Analysis. In Proceedings of the conference on Hu-

man Language Technology and Empirical Methods

in Natural Language Processing, 676-683.

Ruiqiang Zhang and Eiichiro Sumita. 2007. Boosting

Statistical Machine Translation by Lemmatization

and Linear Interpolation. National Institute of In-

formation and Communications Technology, Spo-

ken Language Communication Research Laborato-

ries, Japan.

Hua Wu, Haifeng Wang and Zhanyi Liu. 2006. Alter-

nation. Boosting Statistical Word Alignment Using

Labeled and Unlabeled Data. Toshiba Research

and Development Center, China .

Dan Gusfield. 1997. Algorithms on Strings, Trees

and Sequences. Cambridge University Press,

Cambridge, UK.

Chiang, D. 2005. A hierarchical phrase-based model

for statistical machine translation. In: Proceedings

of the 43rd Annual Meeting of the Association for

Computational Linguistics (ACL‟05), Ann Arbor,

Michigan, Association for Computational Linguis-

tics.

Kristina Toutanova and Colin Cherry. 2009. A global

model for joint lemmatization and part-of-speech

prediction. In Proceedings of the Joint Conference

of the 47th Annual Meeting of the ACL and the 4th

International Joint Conference on Natural Lan-

guage Processing of the AFNLP,pages 486–494,

Suntec, Singapore, August. Association for Com-

putational Linguistics.

Koehn, P., Hoang, H., Birch, A., Callison-Burch, C.,

Federico, M., Bertoldi, N., Cowan, B., Shen, W.,

Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin,

A., and Herbst, E.. Moses: Open Source Toolkit

for Statistical Machine Translation. In ACL 2007,

Proceedings of the 45th Annual Meeting of the

Association for Computational Linguistics Compa-

nion Volume Proceedings of the Demo and Poster

Sessions, pp. 177-180, Association for Computational

Linguistics, Prague, Czech Republic, 2007.