final paper ashley_parinita
TRANSCRIPT
![Page 1: Final paper ashley_parinita](https://reader037.vdocuments.us/reader037/viewer/2022100506/554f8a71b4c9052a518b50fd/html5/thumbnails/1.jpg)
Effects of Lemmatization on Czech-English Statistical MT
Ashley Gill
University of Washington
Seattle, WA
Parinita
University of Washington
Seattle, WA
Abstract
The focus of this paper is to see if lemmatiza-
tion affects experiments with Czech-to-
English phrase-based machine translation. We
vary the translation scenario by trying lemma-
tization of different parts of speech in Czech.
Experimental results demonstrate significant
improvement of translation quality in terms of
BLEU. We then propose a simple step to im-
prove upon the translation output. This ap-
proach is applicable to language pairs which
vary in their morphological richness.
1 Introduction
Czech, as a Slavic language is highly inflection-
al. It is an almost free word-order language. Most
of the functions expressed in Czech as endings
(inflections) are rendered by English word order
and function words. This causes fewer instances
of the surface form of a word (pre-
fix+stem+suffix) to occur in the corpus. This
phenomenon, called data sparsity, is one of the
factors that degrade statistical machine transla-
tion (SMT). Research in SMT increasingly
makes use of linguistic analysis in order to im-
prove performance. By including abstract catego-
ries, such as lemmas and parts-of-speech (POS)
in the models, it is argued that systems can be-
come better at handling sentences for which
training data at the word level is sparse. Existing
work has shown that using morpho-syntactic in-
formation is an effective solution to data sparse-
ness. We present a phrase-based statistical ma-
chine translation approach which uses linguistic
analysis in the preprocessing phase. The linguis-
tic analysis includes morphological transforma-
tion by applying lemmatization and trying vari-
ous combinations to see improvement in word
alignment and vocabulary reduction.
2 System Overview
Previous works have shown that the most prob-
lematic parts of speech in Czech-English transla-
tions are nouns and verbs (Bojar and Prokopov´a
,2006). In our experiments we aim to improve
word alignments between Czech and English by
splitting the inflection of verbs and nouns to im-
prove frequency of the stem words and hence
improve alignments. Czech is a pro-drop lan-
guage, pronouns representing the subject are
usually left out but the morphology of the verb
indicates explicitly which pronoun was meant.
By lemmatizing verbs only we hope to improve
this misalignment.
For this task we ran four different experiments.
The baseline experiment was carried out with no
changes. We ran three different experiments with
lemmatization to see their effect. ALemma -
where all words were lemmatized in the corpus,
NLemma - where words that were tagged as
nouns were lemmatized only , and Vlemma -
where words tagged as verbs was lemmatized
3 Components
3.1 Corpus
Training data is taken from the new News Com-
mentary corpus. The released data is not toke-
nized and includes sentences of any length (in-
cluding empty sentences). Also, all data is in Un-
icode (UTF-8) format. To tune our system during
development, we used development sets of 1057
sentences (News Commentary, nc-dev2007). The
data is provided in raw text format. To test our
system during development, we used 2007 sen-
tences (News Commentary, nc-test2007).
We tokenized the corpus and lowercased it. We
trimmed sentences longer than 40 words, as
needed by GIZA++.
![Page 2: Final paper ashley_parinita](https://reader037.vdocuments.us/reader037/viewer/2022100506/554f8a71b4c9052a518b50fd/html5/thumbnails/2.jpg)
Input
sentences
Output
sentences
original baseline 70048 62610 Table 1: Corpus Size before and after
cleaning
The Moses toolkit [Koehn et al., 2007]
represents the phrase-based machine translation
and operates only on the word level. This toolkit
was used for preparing the data, building lan-
guage model, training the model, Tuning, run-
ning system on development test set, and for
evaluation.
3.2 Lemmatizer
For the lemmatization we have used „The Free
Morphology (FM) ‟ tool ( Hajic 2001) . FM is a
pair of universal (i.e., language-independent)
morphology tools (FMAnalyze.pl, FMGene-
rate.pl) for analysis and generation of word
forms for inflective languages. It comes with a
frequency-based, high coverage Czech dictio-
nary. It takes as its input a text file and returns
the output in csts-like SGML markup, with one
token per line.
Examples:
Input: Prezident rezignoval na svou funkci.
Output: <csts>
<f
cap>Prezident<MMl>prezident<MMt>NNMS1--
---A----
<f>rezignoval<MMl>rezignovat_:T<MMt>VpY
S---XR-AA---
<f>na<MMl>na<MMt>RR--4----------
<MMt>RR--6----------
<f>svou<MMl>svůj-
1_^(přivlast.)<MMt>P8FS4---------
1<MMt>P8FS7---------1
<f>funkci<MMl>funkce<MMt>NNFS3-----A---
-<MMt>NNFS4-----A----<MMt>NNFS6-----A--
--
<D>
<d>.<MMl>.<MMt>Z:-------------
</csts>
Table 2: Sample FM analyzer output
The FM can work for other morphologically rich
inflective languages which can be described us-
ing segmentation of a word form into two parts:
a root and an ending. Even if linguistically not
quite justified, many phenomena which would
normally break this simple rule can be made to
work in this framework.
Special provision is made in the code for up to
two "inflectional" prefixes which might both be
present in some word forms. Such prefixes are
found in many Slavic languages, such as Czech,
Slovak, Polish, etc.
Czech positional morphology (Hajic, 2000) uses
morphological tags consisting of 12 actively used
positions, each stating the value of one morpho-
logical category. Categories which are not rele-
vant for a given lemma (e.g. tense for nouns) are
assigned a special value. We made use of this
positional information from the lemmatizer out-
put to re-create our lemmatized corpus for the
different experiments.
3.3 PreProcessor
For each experiment, the training and testing da-
tasets need to be run through the FM lemmatizer
tool. Because of the style of the output from the
FM, we decided to first implement a simple sen-
tence delimiter, “*”, since it does not occur natu-
rally in the corpus. We are then able to determine
where to place line breaks between sentences,
without disrupting the naturally-occuring sen-
tence punctuation. Next, for each word from the
FM file, depending on the particular experiment,
we decide whether to use the original word or the
lemma. For the Alemma experiment, we used the
lemma instead of the original word for every
word in the FM output file. For the Nlemma ex-
periment, we used the lemma only if the first
position of the FM output markup is “N” (denot-
ing a noun). And for the Vlemma experiment, we
used the lemma only if the first position of the
FM output markup is “V” (denoting a verb). Af-
ter the preprocessing step was complete, we
saved the outputs in the UTF-8 format, to ensure
compatibility with the GIZA++ and Moses Sys-
tems.
3.4 Post Pre-Processing corpus
Lemmatization increased the number of words
per sentence, after the pre-processing was com-
plete. We removed sentences which were
greater than 40 words. This reduced the corpus
size substantially. The comparison with the
original baseline was not a fair one. To get a
better idea, we used a smaller subset of the un-
lemmatized corpus and treated it as the baseline
system. This does not compare the same sen-
tences for word alignment but gives a fair com-
parison of the alignment given corpus of simi-
lar size with different settings.
![Page 3: Final paper ashley_parinita](https://reader037.vdocuments.us/reader037/viewer/2022100506/554f8a71b4c9052a518b50fd/html5/thumbnails/3.jpg)
Input
sentences
Output
sentences
baseline 35000 14453
ALEMMA 70048 13136
NLEMMA 70048 15737
VLEMMA 70048 20686
Table 3: Experiment corpus before and af-
ter cleaning
The output sentences were then used for build-
ing the language models and training.
We saw improvements in data sparsity by lem-
matization as has been proved in various papers
before. The improvements we saw for our data
set is listed in Table4
Total
number
of words
Vocabulary
(Number of
unique
words)
Singletons
(Words occurring
only once)
Baseline 364762 23368 12258
ALEMMA 336507 13502 6560
NLEMMA 398979 17333 8772
VLEMMA 537715 21475 10136
Table 4: Reduction in data spareseness
4 Results
The BLEU scores show remarkable improve-
ment in the lemmatized corpus. It almost doubles
the score for the baseline. Lemmatizing only
nouns increases the scores even further, but the
best BLEU scores are seen when we lemmatize
only the verbs. The scores obtained are thrice the
baseline score. Lemmatizing verbs are useful not
only in improving the BLEU scores, but also can
be used to improve the translation‟s readability.
Once the verbs are aligned correctly, mostly the
nouns are the only words that remain to be trans-
lated. Thus after the VLemma translation is com-
plete, a simple post-processing step that replaces
the nouns using a dictionary can improve the
translations further. Table 5 shows the im-
provement in BLEU scores for the various expe-
riments we carried out.
Experiment BLEU
BASELINE: 4.24, 27.9/9.3/2.2/0.7
(BP=0.931, ratio=0.933,
hyp_len=46470, ref_len=49805)
ALEMMA: 8.60, 36.4/13.7/5.2/2.1
(BP=1.000, ratio=1.177,
hyp_len=58645, ref_len=49805)
NLEMMA: 10.09, 40.0/15.7/6.2/2.7
(BP=1.000, ratio=1.108,
hyp_len=55174, ref_len=49805)
VLEMMA: 13.06, 44.1/19.1/8.5/4.1
(BP=1.000, ratio=1.017,
hyp_len=50652, ref_len=49805)
original
baseline (full
corpus)
18.89, 53.0/27.0/14.1/7.9
(BP=0.946, ratio=0.947,
hyp_len=47182, ref_len=49805)
Table 5: BLEU scores
The BLEU scores look in agreement with the
decoded translations, we can see a remarkable
improvements in the translated text as can be
seen in Table 6
BASELINE: rasov~[ rozd~[lená europe
typickým evropské extrémní of the right , there is a sign of
její racism , and that že využívá imigra~Mní otázku in svůj
politický prosp~[ch .
italská lega nord , nizozemský vlaams blocks , francouzská
penova defensive on national , this vše are p~Yíklady par-
ties ~Mi hnutí vzešlých from spole~Mné aa
verze vů~Mi imigrantům and prosazujících zjednodušující
to look at how ~Yešit otázku p~Yist~[hovalců .
ALemma: rasov~R , divided europe
in fact , european the extreme right is its racism and that
using imigra~R is the question in their political of would .
italy ' s nord lego , the dutch , vlaams blockade , the french
has come . as to how souë jmen . it ' s rule of money ' s ad-
ministration national fronts - all of this ii
s an example sides poorer or vze movement , the rise of the
common averze against immigrants and pushing the ) , sim-
plifies a view , how many out to question the immigrants .
NLemma: race-specific divided europe
in fact the extreme right is its racism and that applied to the
immigration question in their political of europe .
indeed , the lego , nord , the dutch vlaams bloc , the french
still penova combatants national - all of this are examples
parties themselves or movements be held and of from the
common averze towards immigrants and pushing the the
simplest a view , the solution is to question the immigrants .
VLemma: race-specific divided europe
in fact the extreme right is its rasismus and that use immi-
gration is the question in their political favor .
indeed , the lega nord nizozemsk vlaams blockade , the
french still penova national fronts - all of this is happening
parties or movement would be held and of from the com-
![Page 4: Final paper ashley_parinita](https://reader037.vdocuments.us/reader037/viewer/2022100506/554f8a71b4c9052a518b50fd/html5/thumbnails/4.jpg)
mon averze towards imigrant and pushing of makes it easi-
er to this view , to question the immigrants .
TABLE 6: Sample Outputs
We propose that lemmatizing a specific part-
of-speech improves the word alignments. And
that there are two possible approaches that can be
derived from this fact for a complete translation.
One is to create a pipeline of translations. For
e.g. the output of NLemma becomes the source
language of VLemma, and translation is done
from a „verb lemmatized NLemma output‟ to
English. The second approach is to apply a dic-
tionary or lexicon on the translation output of the
lemmatized corpus. Looking at the output in Ta-
ble 6 of the VLemma, we can apply a simple dic-
tionary as in Table 7
Czech English POS
rasismus
rasismus
penova
averze
averze
averze
imigrant
racialism
racism
foam
abhorrence
disliking
loathing
immigrant
noun
noun
noun
noun
noun
noun
noun
Table 7: Sample Dictionary
The output we get will have the non-resolved
nouns translated correctly.
VLEMMA-DICT: race-specific divided europe
in fact the extreme right is its rasismus [racism] and that
use immigration is the question in their political favor .
indeed , the lega nord nizozemsk vlaams blockade , the
french still penova[foam] national fronts - all of this is hap-
pening parties or movement would be held and of from the
common averse[loathing] towards immi-
grant[immigrants] and pushing of makes it easier to this
view , to question the immigrants .
TABLE 8: Post-Processed Output
5 Limitations
The results are compared only using BLEU
scores. Specific phenomenon like „pronoun
dropping‟ that occurs in Czech is not tested for
accuracy in translations. The pre-processing
steps implemented were focused more towards
BLEU score improvement vs. improvement in
actual translations. While that is a standard me-
tric for evaluation, other phenomenon could have
undergone Human Cross evaluation for better
understanding of the improvement in results.
The experiments does not cover the effect of
morphology of target language on translations.
Experiments like effect of lemmatization of both
the source and the target language on the word
alignments can be carried out. (Zhang et. al,
2007).
6 Future Paths
Due to time constraint we could not carry out
experiments to measure if a pipeline of lemmati-
zation improves translation quality or not. In fu-
ture we would like to compare both the dictio-
nary method and the pipeline method using
BLEU scores and human evaluation metric.
We would also like to add more syntactic in-
formation to improve the word reordering and
language modelling. We would like to carry ex-
periments on other languages too.
7 Conclusion
We have studied the effect of lemmatization of
different parts of speech in improving the word
alignment for Czech-to-English SMT. We found
that lemmatization of verbs yields the maximum
improvement in BLEU scores. We have con-
cluded with an approach to improve the transla-
tions by lemmatizing verbs to improve align-
ments and then replacing the unresolved nouns,
adjectives by using a Czech-English dictionary.
This approach is applicable to translations from
any morphologically rich language to a simpler
one.
References
Adria de Gispert, Deepa Gupta, Maja Popovic, Patrik
lambert, Jose B. Marino, Marcello Federico, Her-
mann Ney and Rafael Banchs. 2006. Improving
Statistical Word Alignments with Morpho-syntactic
Transformations. In Advances in natural Language
Processing, Vol. 4139:368-379.
Bettina Schrader. 2004. Improving word alignment
quality using linguistic knowledge. In Workshop
proceedings of the Fourth International Conference
on Language Resources and Evaluation (LREC),
46-49.
Maria Holmqvist, Sara Stymne and Lars Ahrenberg.
2007. Getting to know Moses: Initial experiments
on German-English factored translation. In Pro-
ceedings of ACL Second Workshop in Statistical
Machine Translation, 181-184.
Ondr̆ej Bojar . and Magdalena Prokopov´a., Czech-
English Word Alignment.In Proceedings of
LREC'06, pp. 1236-1239, ELRA, 2006.
![Page 5: Final paper ashley_parinita](https://reader037.vdocuments.us/reader037/viewer/2022100506/554f8a71b4c9052a518b50fd/html5/thumbnails/5.jpg)
Ondr̆ej Bojar , Evgeny Matusov, and Hermann Ney.
2006. Czech-English Phrase-Based Machine
Translation. Institute of Formal and Applied Lin-
guistics, Czech Republic.
Ondrej Bojar, David Marecek, Vaclav Novak, Martin
Popel, Jan Ptacek, Jan Rous and Zdenek Za-
bokrtsky. 2008. English-Czech MT in 2009. In
Proceedings of the Fourth Workshop on Statistical
Machine Translation, 125-129.
Ondrej Bojar and Jan Hajic. 2008. Phrase-Based and
Deep Syntactic English-to-Czech Statistical Ma-
chine Translation. In Proceedings of the Third
Workshop on Statistical Machine Translation, 143-
146.
Sharon Goldwater and David McClosky. 2005. Im-
proving Statistical MT through Morphological
Analysis. In Proceedings of the conference on Hu-
man Language Technology and Empirical Methods
in Natural Language Processing, 676-683.
Ruiqiang Zhang and Eiichiro Sumita. 2007. Boosting
Statistical Machine Translation by Lemmatization
and Linear Interpolation. National Institute of In-
formation and Communications Technology, Spo-
ken Language Communication Research Laborato-
ries, Japan.
Hua Wu, Haifeng Wang and Zhanyi Liu. 2006. Alter-
nation. Boosting Statistical Word Alignment Using
Labeled and Unlabeled Data. Toshiba Research
and Development Center, China .
Dan Gusfield. 1997. Algorithms on Strings, Trees
and Sequences. Cambridge University Press,
Cambridge, UK.
Chiang, D. 2005. A hierarchical phrase-based model
for statistical machine translation. In: Proceedings
of the 43rd Annual Meeting of the Association for
Computational Linguistics (ACL‟05), Ann Arbor,
Michigan, Association for Computational Linguis-
tics.
Kristina Toutanova and Colin Cherry. 2009. A global
model for joint lemmatization and part-of-speech
prediction. In Proceedings of the Joint Conference
of the 47th Annual Meeting of the ACL and the 4th
International Joint Conference on Natural Lan-
guage Processing of the AFNLP,pages 486–494,
Suntec, Singapore, August. Association for Com-
putational Linguistics.
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C.,
Federico, M., Bertoldi, N., Cowan, B., Shen, W.,
Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin,
A., and Herbst, E.. Moses: Open Source Toolkit
for Statistical Machine Translation. In ACL 2007,
Proceedings of the 45th Annual Meeting of the
Association for Computational Linguistics Compa-
nion Volume Proceedings of the Demo and Poster
Sessions, pp. 177-180, Association for Computational
Linguistics, Prague, Czech Republic, 2007.