agenda - universitetet i bergen
TRANSCRIPT
Agenda 1. Object of analysis
1. Problem description 2. Preliminary premises and hypotheses
2. Nominal compounds 1. German nominal compounds 2. Spanish syntagmatic compounds 3. Spanish-German correspondences
3. Word alignment techniques 1. IBM 1-5 Models 2. HMM Models
4. Methodology 1. Corpus 2. Work plan
5. Expected results
Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 1
Agenda 1. Object of analysis
1. Problem description 2. Preliminary premises and hypotheses
2. Nominal compounds 1. German nominal compounds 2. Spanish syntagmatic compounds 3. Spanish-German correspondences
3. Word alignment techniques 1. IBM 1-5 Models 2. HMM Models
4. Methodology 1. Corpus 2. Work plan
5. Expected results
Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 2
1. Object of analysis
• Spanish phraseological units which are translated as nominal compounds in German/Norwegian
à 1:n alignments (DE/NO : ES) • NOT covered in this project:
– n:n alignments – 1:1 alignments of nominal compounds in DE/NO that
correspond to a single word in ES i.e. Straßenlampe > semáforo
Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 3
1.1. Problem description • DE/NOà great tendency to use nominal compounds
Human translators à ES translators need to find the corresponding
phraseological unit à DE/NO translators have to produce compounds from
phraseological units MT Systems
à DE/NO > ES unable to translate compounds correctly à ES > DE/NO unable to generate compounds
Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 4
1.1. Problem description
• Consequences: – Human translators devote a lot of time to find the
proper translation correspondences – MT Systems do not produce accurate and quality
translations & translations do not sound natural
Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 5
1.1. Problem description Mistakes produced by ES > DE MT Systems
Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 6
1.1. Problem description
Mistakes produced by DE > ES MT Systems
Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 7
1.2. Preliminary premises and hypothesis 1. There may be some latent linguistic clues to detect
phraseological units in ES prone to become compounds in DE/NO
2. It seems that there are currently more compounds in the texts originally written in DE/NO than in the texts translated into DE/NO
3. Nominal compounds and phraseological units referring to a common topic/domain tend to be repeated in the same original text and other texts across the domain they belong to à Frequency of apparition may be another useful hint
Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 8
1.2. Preliminary premises and hypothesis 1. Lemmatization will be needed to unify frequencies of
apparition 2. Newly created compounds will have more char/word
than already existing and lexicalized compounds i.e. Zusatz·stoff-Zulassung·s·verordnung: Reglamento
relativo a la autorización de aditivos 3. Frequent compounds will have probably been
lexicalized and thus included in dictionaries i.e. Rohmilch (leche cruda); Reifezeit (periodo de
maduración) 4. Terms referring to the main topic of a text are potential
candidates to produce compounds
Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 9
Agenda 1. Object of analysis
1. Problem description 2. Preliminary premises and hypotheses
2. Nominal compounds 1. German nominal compounds 2. Spanish syntagmatic compounds 3. Spanish-German correspondences
3. Word alignment techniques 1. IBM 1-5 Models 2. HMM Models
4. Methodology 1. Corpus 2. Work plan
5. Expected results
Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 10
2. Nominal compounds
• DE/NO nominal compounds are translated as…
phraseological units in Spanish – That are a Spanish syntagmatic compound – That are NOT a Spanish syntagmatic compound
• AIM = determine which phraseological units in ES correspond to a DE/NO nominal compound à Identify common/differentiation features
Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 11
2.1. German nominal compounds • Lexicalized à appear in dictionaries • Not lexicalized à translational equivalent has to be determined
Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 12
– Bushaltestelle
– Handbremsvorrichtung
2.1. German nominal compounds (a) Compound = 2 or more nouns
- Head = noun placed on the right extreme of the word a) Non-head = complement à translation with PP
b) Non-head = modifier à unpredictable translation
Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 13
Bus·fahrer (conductor de autobús); Programm·entwicklung (desarrollo de programas), Daten·schutz (protección de datos); Problem·lösung (resolución de problemas)
Land·haus (casa de campo), Fabrik·arbeiter (trabajador de fábrica), Nord·see·öl (petróleo del Mar del Norte),
Metall·industrie (industria metalúrgica), Haupt·aufgabe (tarea principal), Grund·fähigkeit (capacidad básica), End·produkt (producto final), Schlüssel·wort (palabra
clave), Mitglied·staat (estado miembro)
2.1. German nominal compounds (b) Compound = verbal root + noun
– Non-head is always a modifier a) Head has the thematic role of the argument frame of
the modifier à Translation: participle / deverbal adjective
Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 14
Schimm·kran (grúa flotante) à Kran = THEME of swimmen Klapp·stühl (silla plegable) à Stühl = THEME of klappen
Hänge·brücke (puente colgante) à Brücke = THEME of hängen Wasch·kleid (vestido lavable) à Kleid = THEME of waschen
2.1. German nominal compounds (b) Compound = verbal root + noun
– Non-head is always a modifier b) Head has none thematic role of the verb and is only
a modifier à Translation: deverbal noun
Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 15
Prüf·verfaren: proceso de inspección Bade·anzug: traje de baño
Schwimm·lehrer: profesor de natación Mal·wettbewerb: concurso de pintura
2.1. German nominal compounds (c) Compound = adjective + noun
– The adjective = modifier of the noun
Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 16
Gesamt·ausgabe (edición completa); Höchst·geschwindigkeit (velocidad máxima); Zentral·einheit (unidad central); Privat·bereich (sector privado).
2.2. Spanish syntagmatic compounds • According to Val Álvaro (1999):
– Lexical compounds (not relevant for us) – Syntagmatic compounds:
• They have fixed syntactical structures; • They refer to a single conceptual unit; • They usually accept the possibility of
deautomatising the non compositional meaning they have;
• They usually are more reluctant to cohesion when their semantic transparency is higher
Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 17
2.2. Spanish syntagmatic compounds • Syntactic fixedness can be acknowledged when:
– They only appear in a determined order – It is not possible to change the constituents with other
lexical units – Modifier determiners or quantifiers may not be
changed – It is only possible to change the whole phrase as such – None of the constituents may be separated from the
others (i.e. question mark), and it is also not possible to make a pronominal reference to only one of its constituents
– Ellipsis is not allowed, for instance in the case of phrase coordination
Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 18
2.3. Spanish-German correspondences
Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 19
DE-ES phraseological units ES-DE phraseological units
Agenda 1. Object of analysis
1. Problem description 2. Preliminary premises and hypotheses
2. Nominal compounds 1. German nominal compounds 2. Spanish syntagmatic compounds 3. Spanish-German correspondences
3. Word alignment techniques 1. IBM 1-5 Models 2. HMM Models
4. Methodology 1. Corpus 2. Work plan
5. Expected results Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 20
3.1. IBM 1-5 Models
• Models 1 & 2: based on the length of a string – M1: word order in e and f does not affect Pr(f|e)
à All possible connections for a position in French have the same probability
– M2: Pr(f|e) depends on the word order in e and f à Probability depends on the positions it aligns and the length of both strings
Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 21
3.1. IBM 1-5 Models Models 3, 4, and 5 develop the French string by choosing, for each word in the English string: - The number of words in the French string that will be aligned to it; - The identity of these French words; and - The actual positions in the French string that these words will
occupy - M3: the probability of an alignment depends on the positions it
aligns and the lengths of the English and French strings - M4: the probability of an alignment depends in addition on the
identities of the French and English words aligned and on the alignments of any other French words that are aligned to the same English word
- M5: improves the results obtained with M4
Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 22
3.1. IBM 1-5 Models
• As we do not usually have corpora aligned at subsentential level, we need to obtain the alignments from somewhere
• Model 1 only takes into account words and is easy and quick to be trained
• Model 1 results are used as the basis for Model 2, and so on à My preprocessing module would allow us to directly train the aligner with Model 1
Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 23
3.2. HMM Models • They are based on the assumption that alignments tend
to preserve locality à Neighboring words in the original language are often aligned with neighboring words in the target language
• Each alignment decision is conditioned by previous decisions
• Disadvantage: they do not usually allow for multiword alignment à Deng & Byrne (2005) propose a system with 1:n alignments that we may try
Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 24
Agenda 1. Object of analysis
1. Problem description 2. Preliminary premises and hypotheses
2. Nominal compounds 1. German nominal compounds 2. Spanish syntagmatic compounds 3. Spanish-German correspondences
3. Word alignment techniques 1. IBM 1-5 Models 2. HMM Models
4. Methodology 1. Corpus 2. Work plan
5. Expected results
Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 25
4. Methodology • Improve the accuracy of 1:n statistical word alignments
by packaging phraseological units in a preprocessing module
• The preprocessing module shall pack as single words those Spanish phraseological units that will correspond to a German/Norwegian nominal compound using:
• Linguistic rules • Statistical analysis results
• Word aligners will then be trained with the preprocessed corpus
• Packed units will be unpacked to produce the corresponding right 1:n alignments
Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 26
4.1. Corpus • Needed as it offers the linguistic evidence needed to
identify translation correspondences (Melamed: 1996) • German-Spanish corpus
– DG Enterprise project of the EC – Specialized and divided in domains and subdomains:
• Specialized texts tend to have a greater number of compounds
• Will help us to determine whether our approach requires domain tuning
• Norwegian-Spanish corpus – Norwegian literature compilation translated into Spanish – European Law translations?
Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 27
4.1. Corpus FIELD
# FILES AUSTRIA
#FILES GERMANY
#FILES SPAIN
B00: CONSTRUCTION 205 174 39 C00A: AGRICULTURE, FISHING AND FOODSTUFFS 52 60 78 C00C: CHEMICALS 16 19 12 C00P: PHARMACEUTICALS AND COSMETICS 3 17 3 H00: DOMESTIC AND LEISURE EQUIPMENT 12 7 36 I00: MECHANICS 28 8 45 N00E: ENERGY, MINERALS, WOOD 22 14 14 S00E: ENVIRONMENT 24 27 12 S00S: HEALTH, MEDICAL EQUIPMENT 4 1 2 SERV: 98/48/EC SERVICES 15 38 9 T00T: TRANSPORT 0 0 0 V00T: TELECOMS 0 0 0 X00M: GOODS AND MISCELLANEOUS PRODUCTS 0 0 0
Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 28
German-Spanish corpus: current status
4.2. Work plan 1. Corpus
1. compilation, clean-up and alignment at sentence level
2. Alignment split in separate language files 3. Lemmatization and POS-tagging
2. Manual analysis of a subpart of the corpus 1. Gold Standard establishment 2. Creation of a training and test set 3. Establishment of rules to filter candidates to be
packed
Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 29
4.2. Work plan 3. Experiments:
1. Experiment 0: 1. Corpus processing with Giza++ (IBM model 1) 2. analysis of results
2. Implementation of the first set of rules (A) 3. Experiment 1:
1. Corpus processing with Giza++ (IBM model 1) 2. Analysis of results using the set of rules A
4. Comparison of results between experiment 0 and 1 5. Error analysis and establishment of set of rules B 6. Experiment 2:
1. Corpus processing with Giza++ (IBM model 1) 2. Analysis of results using the set of rules B
7. Comparison of results across experiments 0-2 8. Error analysis and final results
Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 30
4.2. Work plan
• The experiments and development of the preprocessing module will be first focused on the German-Spanish language pair and Norwegian will be included at a later stage
• The Alignment Error Ratio (AER – Och & Ney: 2003) will be taken into account to evaluate alignment results
• We may also test whether MT (Moses) quality increases using the preprocessing module and compare it with other MT Systems, either statistical (Google), rule-based (Lucy Software) or hybrid (Apertium/Systran)
Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 31
Agenda 1. Object of analysis
1. Problem description 2. Preliminary premises and hypotheses
2. Nominal compounds 1. German nominal compounds 2. Spanish syntagmatic compounds 3. Spanish-German correspondences
3. Word alignment techniques 1. IBM 1-5 Models 2. HMM Models
4. Methodology 1. Corpus 2. Work plan
5. Expected results Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 32
5. Expected results
• If results are positive, it could be considered the possibility of implementing a semi-supervised model of phraseological units detection
• That model could be used to improve the preprocessing module and run similar experiments
• If both German and Norwegian are proven to be successful, other Germanic-Romance language pairs may be taken into consideration to run similar experiments
Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 33
Expected results 1. Specialized parallel corpora that can be reused for other
purposes once the project has finished 2. 90% accuracy of the preprocessing module 3. Improvement of 1:n alignments in statistical word aligners
(i.e. Giza++) by chunking phraseological units in Spanish 4. Automatic compound dictionary extraction 5. Other NLP areas that can also benefit from this project
are: – Computer Assisted Translation (CAT) – Cross-lingual information retrieval – Computer Assisted Language Learning (CALL) – NLP applications implying 2 or more languages
Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 34
Questions?
Thank you very much for your attention!