agenda - universitetet i bergen

Agenda 1.  Object of analysis

1.  Problem description 2.  Preliminary premises and hypotheses

2.  Nominal compounds 1.  German nominal compounds 2.  Spanish syntagmatic compounds 3.  Spanish-German correspondences

3.  Word alignment techniques 1.  IBM 1-5 Models 2.  HMM Models

4.  Methodology 1.  Corpus 2.  Work plan

5.  Expected results

Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 1

1. Object of analysis

•  Spanish phraseological units which are translated as nominal compounds in German/Norwegian

à 1:n alignments (DE/NO : ES) •  NOT covered in this project:

–  n:n alignments –  1:1 alignments of nominal compounds in DE/NO that

correspond to a single word in ES i.e. Straßenlampe > semáforo


1.1. Problem description •  DE/NOà great tendency to use nominal compounds

Human translators à ES translators need to find the corresponding

phraseological unit à DE/NO translators have to produce compounds from

phraseological units MT Systems

à DE/NO > ES unable to translate compounds correctly à ES > DE/NO unable to generate compounds


1.1. Problem description

•  Consequences: –  Human translators devote a lot of time to find the

proper translation correspondences –  MT Systems do not produce accurate and quality

translations & translations do not sound natural


1.1. Problem description Mistakes produced by ES > DE MT Systems


1.1. Problem description

Mistakes produced by DE > ES MT Systems


1.2. Preliminary premises and hypothesis 1.  There may be some latent linguistic clues to detect

phraseological units in ES prone to become compounds in DE/NO

2.  It seems that there are currently more compounds in the texts originally written in DE/NO than in the texts translated into DE/NO

3.  Nominal compounds and phraseological units referring to a common topic/domain tend to be repeated in the same original text and other texts across the domain they belong to à Frequency of apparition may be another useful hint


1.2. Preliminary premises and hypothesis 1.  Lemmatization will be needed to unify frequencies of

apparition 2.  Newly created compounds will have more char/word

than already existing and lexicalized compounds i.e. Zusatz·stoff-Zulassung·s·verordnung: Reglamento

relativo a la autorización de aditivos 3.  Frequent compounds will have probably been

lexicalized and thus included in dictionaries i.e. Rohmilch (leche cruda); Reifezeit (periodo de

maduración) 4.  Terms referring to the main topic of a text are potential

candidates to produce compounds


2. Nominal compounds

•  DE/NO nominal compounds are translated as…

phraseological units in Spanish –  That are a Spanish syntagmatic compound –  That are NOT a Spanish syntagmatic compound

•  AIM = determine which phraseological units in ES correspond to a DE/NO nominal compound à Identify common/differentiation features


2.1. German nominal compounds •  Lexicalized à appear in dictionaries •  Not lexicalized à translational equivalent has to be determined


–  Bushaltestelle

–  Handbremsvorrichtung

2.1. German nominal compounds (a) Compound = 2 or more nouns

-  Head = noun placed on the right extreme of the word a)  Non-head = complement à translation with PP

b)  Non-head = modifier à unpredictable translation


Bus·fahrer (conductor de autobús); Programm·entwicklung (desarrollo de programas), Daten·schutz (protección de datos); Problem·lösung (resolución de problemas)

Land·haus (casa de campo), Fabrik·arbeiter (trabajador de fábrica), Nord·see·öl (petróleo del Mar del Norte),

Metall·industrie (industria metalúrgica), Haupt·aufgabe (tarea principal), Grund·fähigkeit (capacidad básica), End·produkt (producto final), Schlüssel·wort (palabra

clave), Mitglied·staat (estado miembro)

2.1. German nominal compounds (b) Compound = verbal root + noun

–  Non-head is always a modifier a)  Head has the thematic role of the argument frame of

the modifier à Translation: participle / deverbal adjective


Schimm·kran (grúa flotante) à Kran = THEME of swimmen Klapp·stühl (silla plegable) à Stühl = THEME of klappen

Hänge·brücke (puente colgante) à Brücke = THEME of hängen Wasch·kleid (vestido lavable) à Kleid = THEME of waschen

2.1. German nominal compounds (b) Compound = verbal root + noun

–  Non-head is always a modifier b)  Head has none thematic role of the verb and is only

a modifier à Translation: deverbal noun


Prüf·verfaren: proceso de inspección Bade·anzug: traje de baño

Schwimm·lehrer: profesor de natación Mal·wettbewerb: concurso de pintura

2.1. German nominal compounds (c) Compound = adjective + noun

–  The adjective = modifier of the noun


Gesamt·ausgabe (edición completa); Höchst·geschwindigkeit (velocidad máxima); Zentral·einheit (unidad central); Privat·bereich (sector privado).

2.2. Spanish syntagmatic compounds •  According to Val Álvaro (1999):

–  Lexical compounds (not relevant for us) –  Syntagmatic compounds:

•  They have fixed syntactical structures; •  They refer to a single conceptual unit; •  They usually accept the possibility of

deautomatising the non compositional meaning they have;

•  They usually are more reluctant to cohesion when their semantic transparency is higher


2.2. Spanish syntagmatic compounds •  Syntactic fixedness can be acknowledged when:

–  They only appear in a determined order –  It is not possible to change the constituents with other

lexical units –  Modifier determiners or quantifiers may not be

changed –  It is only possible to change the whole phrase as such –  None of the constituents may be separated from the

others (i.e. question mark), and it is also not possible to make a pronominal reference to only one of its constituents

–  Ellipsis is not allowed, for instance in the case of phrase coordination


2.3. Spanish-German correspondences


DE-ES phraseological units ES-DE phraseological units






5.  Expected results Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 20

3.1. IBM 1-5 Models

•  Models 1 & 2: based on the length of a string –  M1: word order in e and f does not affect Pr(f|e)

à All possible connections for a position in French have the same probability

–  M2: Pr(f|e) depends on the word order in e and f à Probability depends on the positions it aligns and the length of both strings


3.1. IBM 1-5 Models Models 3, 4, and 5 develop the French string by choosing, for each word in the English string: -  The number of words in the French string that will be aligned to it; -  The identity of these French words; and -  The actual positions in the French string that these words will

occupy -  M3: the probability of an alignment depends on the positions it

aligns and the lengths of the English and French strings -  M4: the probability of an alignment depends in addition on the

identities of the French and English words aligned and on the alignments of any other French words that are aligned to the same English word

-  M5: improves the results obtained with M4


3.1. IBM 1-5 Models

•  As we do not usually have corpora aligned at subsentential level, we need to obtain the alignments from somewhere

•  Model 1 only takes into account words and is easy and quick to be trained

•  Model 1 results are used as the basis for Model 2, and so on à My preprocessing module would allow us to directly train the aligner with Model 1


3.2. HMM Models •  They are based on the assumption that alignments tend

to preserve locality à Neighboring words in the original language are often aligned with neighboring words in the target language

•  Each alignment decision is conditioned by previous decisions

•  Disadvantage: they do not usually allow for multiword alignment à Deng & Byrne (2005) propose a system with 1:n alignments that we may try


4. Methodology •  Improve the accuracy of 1:n statistical word alignments

by packaging phraseological units in a preprocessing module

•  The preprocessing module shall pack as single words those Spanish phraseological units that will correspond to a German/Norwegian nominal compound using:

•  Linguistic rules •  Statistical analysis results

•  Word aligners will then be trained with the preprocessed corpus

•  Packed units will be unpacked to produce the corresponding right 1:n alignments


4.1. Corpus •  Needed as it offers the linguistic evidence needed to

identify translation correspondences (Melamed: 1996) •  German-Spanish corpus

–  DG Enterprise project of the EC –  Specialized and divided in domains and subdomains:

•  Specialized texts tend to have a greater number of compounds

•  Will help us to determine whether our approach requires domain tuning

•  Norwegian-Spanish corpus –  Norwegian literature compilation translated into Spanish –  European Law translations?


4.1. Corpus FIELD

# FILES AUSTRIA

#FILES GERMANY

#FILES SPAIN

B00: CONSTRUCTION 205 174 39 C00A: AGRICULTURE, FISHING AND FOODSTUFFS 52 60 78 C00C: CHEMICALS 16 19 12 C00P: PHARMACEUTICALS AND COSMETICS 3 17 3 H00: DOMESTIC AND LEISURE EQUIPMENT 12 7 36 I00: MECHANICS 28 8 45 N00E: ENERGY, MINERALS, WOOD 22 14 14 S00E: ENVIRONMENT 24 27 12 S00S: HEALTH, MEDICAL EQUIPMENT 4 1 2 SERV: 98/48/EC SERVICES 15 38 9 T00T: TRANSPORT 0 0 0 V00T: TELECOMS 0 0 0 X00M: GOODS AND MISCELLANEOUS PRODUCTS 0 0 0


German-Spanish corpus: current status

4.2. Work plan 1.  Corpus

1.  compilation, clean-up and alignment at sentence level

2.  Alignment split in separate language files 3.  Lemmatization and POS-tagging

2.  Manual analysis of a subpart of the corpus 1.  Gold Standard establishment 2.  Creation of a training and test set 3.  Establishment of rules to filter candidates to be

packed


4.2. Work plan 3.  Experiments:

1.  Experiment 0: 1.  Corpus processing with Giza++ (IBM model 1) 2.  analysis of results

2.  Implementation of the first set of rules (A) 3.  Experiment 1:

1.  Corpus processing with Giza++ (IBM model 1) 2.  Analysis of results using the set of rules A

4.  Comparison of results between experiment 0 and 1 5.  Error analysis and establishment of set of rules B 6.  Experiment 2:

1.  Corpus processing with Giza++ (IBM model 1) 2.  Analysis of results using the set of rules B

7.  Comparison of results across experiments 0-2 8.  Error analysis and final results


4.2. Work plan

•  The experiments and development of the preprocessing module will be first focused on the German-Spanish language pair and Norwegian will be included at a later stage

•  The Alignment Error Ratio (AER – Och & Ney: 2003) will be taken into account to evaluate alignment results

•  We may also test whether MT (Moses) quality increases using the preprocessing module and compare it with other MT Systems, either statistical (Google), rule-based (Lucy Software) or hybrid (Apertium/Systran)







5.  Expected results Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 32

5. Expected results

•  If results are positive, it could be considered the possibility of implementing a semi-supervised model of phraseological units detection

•  That model could be used to improve the preprocessing module and run similar experiments

•  If both German and Norwegian are proven to be successful, other Germanic-Romance language pairs may be taken into consideration to run similar experiments


Expected results 1.  Specialized parallel corpora that can be reused for other

purposes once the project has finished 2.  90% accuracy of the preprocessing module 3.  Improvement of 1:n alignments in statistical word aligners

(i.e. Giza++) by chunking phraseological units in Spanish 4.  Automatic compound dictionary extraction 5.  Other NLP areas that can also benefit from this project

are: –  Computer Assisted Translation (CAT) –  Cross-lingual information retrieval –  Computer Assisted Language Learning (CALL) –  NLP applications implying 2 or more languages


Questions?

Thank you very much for your attention!

agenda - universitetet i bergen

Documents