agenda - universitetet i bergen

35
Agenda 1. Object of analysis 1. Problem description 2. Preliminary premises and hypotheses 2. Nominal compounds 1. German nominal compounds 2. Spanish syntagmatic compounds 3. Spanish-German correspondences 3. Word alignment techniques 1. IBM 1-5 Models 2. HMM Models 4. Methodology 1. Corpus 2. Work plan 5. Expected results Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 1

Upload: others

Post on 12-Jul-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Agenda - Universitetet i Bergen

Agenda 1.  Object of analysis

1.  Problem description 2.  Preliminary premises and hypotheses

2.  Nominal compounds 1.  German nominal compounds 2.  Spanish syntagmatic compounds 3.  Spanish-German correspondences

3.  Word alignment techniques 1.  IBM 1-5 Models 2.  HMM Models

4.  Methodology 1.  Corpus 2.  Work plan

5.  Expected results

Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 1

Page 2: Agenda - Universitetet i Bergen

Agenda 1.  Object of analysis

1.  Problem description 2.  Preliminary premises and hypotheses

2.  Nominal compounds 1.  German nominal compounds 2.  Spanish syntagmatic compounds 3.  Spanish-German correspondences

3.  Word alignment techniques 1.  IBM 1-5 Models 2.  HMM Models

4.  Methodology 1.  Corpus 2.  Work plan

5.  Expected results

Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 2

Page 3: Agenda - Universitetet i Bergen

1. Object of analysis

•  Spanish phraseological units which are translated as nominal compounds in German/Norwegian

à 1:n alignments (DE/NO : ES) •  NOT covered in this project:

–  n:n alignments –  1:1 alignments of nominal compounds in DE/NO that

correspond to a single word in ES i.e. Straßenlampe > semáforo

Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 3

Page 4: Agenda - Universitetet i Bergen

1.1. Problem description •  DE/NOà great tendency to use nominal compounds

Human translators à ES translators need to find the corresponding

phraseological unit à DE/NO translators have to produce compounds from

phraseological units MT Systems

à DE/NO > ES unable to translate compounds correctly à ES > DE/NO unable to generate compounds

Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 4

Page 5: Agenda - Universitetet i Bergen

1.1. Problem description

•  Consequences: –  Human translators devote a lot of time to find the

proper translation correspondences –  MT Systems do not produce accurate and quality

translations & translations do not sound natural

Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 5

Page 6: Agenda - Universitetet i Bergen

1.1. Problem description Mistakes produced by ES > DE MT Systems

Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 6

Page 7: Agenda - Universitetet i Bergen

1.1. Problem description

Mistakes produced by DE > ES MT Systems

Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 7

Page 8: Agenda - Universitetet i Bergen

1.2. Preliminary premises and hypothesis 1.  There may be some latent linguistic clues to detect

phraseological units in ES prone to become compounds in DE/NO

2.  It seems that there are currently more compounds in the texts originally written in DE/NO than in the texts translated into DE/NO

3.  Nominal compounds and phraseological units referring to a common topic/domain tend to be repeated in the same original text and other texts across the domain they belong to à Frequency of apparition may be another useful hint

Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 8

Page 9: Agenda - Universitetet i Bergen

1.2. Preliminary premises and hypothesis 1.  Lemmatization will be needed to unify frequencies of

apparition 2.  Newly created compounds will have more char/word

than already existing and lexicalized compounds i.e. Zusatz·stoff-Zulassung·s·verordnung: Reglamento

relativo a la autorización de aditivos 3.  Frequent compounds will have probably been

lexicalized and thus included in dictionaries i.e. Rohmilch (leche cruda); Reifezeit (periodo de

maduración) 4.  Terms referring to the main topic of a text are potential

candidates to produce compounds

Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 9

Page 10: Agenda - Universitetet i Bergen

Agenda 1.  Object of analysis

1.  Problem description 2.  Preliminary premises and hypotheses

2.  Nominal compounds 1.  German nominal compounds 2.  Spanish syntagmatic compounds 3.  Spanish-German correspondences

3.  Word alignment techniques 1.  IBM 1-5 Models 2.  HMM Models

4.  Methodology 1.  Corpus 2.  Work plan

5.  Expected results

Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 10

Page 11: Agenda - Universitetet i Bergen

2. Nominal compounds

•  DE/NO nominal compounds are translated as…

phraseological units in Spanish –  That are a Spanish syntagmatic compound –  That are NOT a Spanish syntagmatic compound

•  AIM = determine which phraseological units in ES correspond to a DE/NO nominal compound à Identify common/differentiation features

Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 11

Page 12: Agenda - Universitetet i Bergen

2.1. German nominal compounds •  Lexicalized à appear in dictionaries •  Not lexicalized à translational equivalent has to be determined

Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 12

–  Bushaltestelle

–  Handbremsvorrichtung

Page 13: Agenda - Universitetet i Bergen

2.1. German nominal compounds (a) Compound = 2 or more nouns

-  Head = noun placed on the right extreme of the word a)  Non-head = complement à translation with PP

b)  Non-head = modifier à unpredictable translation

Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 13

Bus·fahrer (conductor de autobús); Programm·entwicklung (desarrollo de programas), Daten·schutz (protección de datos); Problem·lösung (resolución de problemas)

Land·haus (casa de campo), Fabrik·arbeiter (trabajador de fábrica), Nord·see·öl (petróleo del Mar del Norte),

Metall·industrie (industria metalúrgica), Haupt·aufgabe (tarea principal), Grund·fähigkeit (capacidad básica), End·produkt (producto final), Schlüssel·wort (palabra

clave), Mitglied·staat (estado miembro)

Page 14: Agenda - Universitetet i Bergen

2.1. German nominal compounds (b) Compound = verbal root + noun

–  Non-head is always a modifier a)  Head has the thematic role of the argument frame of

the modifier à Translation: participle / deverbal adjective

Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 14

Schimm·kran (grúa flotante) à Kran = THEME of swimmen Klapp·stühl (silla plegable) à Stühl = THEME of klappen

Hänge·brücke (puente colgante) à Brücke = THEME of hängen Wasch·kleid (vestido lavable) à Kleid = THEME of waschen

Page 15: Agenda - Universitetet i Bergen

2.1. German nominal compounds (b) Compound = verbal root + noun

–  Non-head is always a modifier b)  Head has none thematic role of the verb and is only

a modifier à Translation: deverbal noun

Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 15

Prüf·verfaren: proceso de inspección Bade·anzug: traje de baño

Schwimm·lehrer: profesor de natación Mal·wettbewerb: concurso de pintura

Page 16: Agenda - Universitetet i Bergen

2.1. German nominal compounds (c) Compound = adjective + noun

–  The adjective = modifier of the noun

Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 16

Gesamt·ausgabe (edición completa); Höchst·geschwindigkeit (velocidad máxima); Zentral·einheit (unidad central); Privat·bereich (sector privado).

Page 17: Agenda - Universitetet i Bergen

2.2. Spanish syntagmatic compounds •  According to Val Álvaro (1999):

–  Lexical compounds (not relevant for us) –  Syntagmatic compounds:

•  They have fixed syntactical structures; •  They refer to a single conceptual unit; •  They usually accept the possibility of

deautomatising the non compositional meaning they have;

•  They usually are more reluctant to cohesion when their semantic transparency is higher

Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 17

Page 18: Agenda - Universitetet i Bergen

2.2. Spanish syntagmatic compounds •  Syntactic fixedness can be acknowledged when:

–  They only appear in a determined order –  It is not possible to change the constituents with other

lexical units –  Modifier determiners or quantifiers may not be

changed –  It is only possible to change the whole phrase as such –  None of the constituents may be separated from the

others (i.e. question mark), and it is also not possible to make a pronominal reference to only one of its constituents

–  Ellipsis is not allowed, for instance in the case of phrase coordination

Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 18

Page 19: Agenda - Universitetet i Bergen

2.3. Spanish-German correspondences

Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 19

DE-ES phraseological units ES-DE phraseological units

Page 20: Agenda - Universitetet i Bergen

Agenda 1.  Object of analysis

1.  Problem description 2.  Preliminary premises and hypotheses

2.  Nominal compounds 1.  German nominal compounds 2.  Spanish syntagmatic compounds 3.  Spanish-German correspondences

3.  Word alignment techniques 1.  IBM 1-5 Models 2.  HMM Models

4.  Methodology 1.  Corpus 2.  Work plan

5.  Expected results Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 20

Page 21: Agenda - Universitetet i Bergen

3.1. IBM 1-5 Models

•  Models 1 & 2: based on the length of a string –  M1: word order in e and f does not affect Pr(f|e)

à All possible connections for a position in French have the same probability

–  M2: Pr(f|e) depends on the word order in e and f à Probability depends on the positions it aligns and the length of both strings

Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 21

Page 22: Agenda - Universitetet i Bergen

3.1. IBM 1-5 Models Models 3, 4, and 5 develop the French string by choosing, for each word in the English string: -  The number of words in the French string that will be aligned to it; -  The identity of these French words; and -  The actual positions in the French string that these words will

occupy -  M3: the probability of an alignment depends on the positions it

aligns and the lengths of the English and French strings -  M4: the probability of an alignment depends in addition on the

identities of the French and English words aligned and on the alignments of any other French words that are aligned to the same English word

-  M5: improves the results obtained with M4

Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 22

Page 23: Agenda - Universitetet i Bergen

3.1. IBM 1-5 Models

•  As we do not usually have corpora aligned at subsentential level, we need to obtain the alignments from somewhere

•  Model 1 only takes into account words and is easy and quick to be trained

•  Model 1 results are used as the basis for Model 2, and so on à My preprocessing module would allow us to directly train the aligner with Model 1

Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 23

Page 24: Agenda - Universitetet i Bergen

3.2. HMM Models •  They are based on the assumption that alignments tend

to preserve locality à Neighboring words in the original language are often aligned with neighboring words in the target language

•  Each alignment decision is conditioned by previous decisions

•  Disadvantage: they do not usually allow for multiword alignment à Deng & Byrne (2005) propose a system with 1:n alignments that we may try

Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 24

Page 25: Agenda - Universitetet i Bergen

Agenda 1.  Object of analysis

1.  Problem description 2.  Preliminary premises and hypotheses

2.  Nominal compounds 1.  German nominal compounds 2.  Spanish syntagmatic compounds 3.  Spanish-German correspondences

3.  Word alignment techniques 1.  IBM 1-5 Models 2.  HMM Models

4.  Methodology 1.  Corpus 2.  Work plan

5.  Expected results

Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 25

Page 26: Agenda - Universitetet i Bergen

4. Methodology •  Improve the accuracy of 1:n statistical word alignments

by packaging phraseological units in a preprocessing module

•  The preprocessing module shall pack as single words those Spanish phraseological units that will correspond to a German/Norwegian nominal compound using:

•  Linguistic rules •  Statistical analysis results

•  Word aligners will then be trained with the preprocessed corpus

•  Packed units will be unpacked to produce the corresponding right 1:n alignments

Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 26

Page 27: Agenda - Universitetet i Bergen

4.1. Corpus •  Needed as it offers the linguistic evidence needed to

identify translation correspondences (Melamed: 1996) •  German-Spanish corpus

–  DG Enterprise project of the EC –  Specialized and divided in domains and subdomains:

•  Specialized texts tend to have a greater number of compounds

•  Will help us to determine whether our approach requires domain tuning

•  Norwegian-Spanish corpus –  Norwegian literature compilation translated into Spanish –  European Law translations?

Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 27

Page 28: Agenda - Universitetet i Bergen

4.1. Corpus FIELD  

#  FILES  AUSTRIA  

#FILES  GERMANY  

#FILES  SPAIN  

B00:  CONSTRUCTION   205   174   39  C00A:  AGRICULTURE,  FISHING  AND  FOODSTUFFS   52   60   78  C00C:  CHEMICALS   16   19   12  C00P:  PHARMACEUTICALS  AND  COSMETICS   3   17   3  H00:  DOMESTIC  AND  LEISURE  EQUIPMENT   12   7   36  I00:  MECHANICS   28   8   45  N00E:  ENERGY,  MINERALS,  WOOD   22   14   14  S00E:  ENVIRONMENT   24   27   12  S00S:  HEALTH,  MEDICAL  EQUIPMENT   4   1   2  SERV:  98/48/EC  SERVICES   15   38   9  T00T:  TRANSPORT   0   0   0  V00T:  TELECOMS   0   0   0  X00M:  GOODS  AND  MISCELLANEOUS  PRODUCTS   0   0   0  

Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 28

German-Spanish corpus: current status

Page 29: Agenda - Universitetet i Bergen

4.2. Work plan 1.  Corpus

1.  compilation, clean-up and alignment at sentence level

2.  Alignment split in separate language files 3.  Lemmatization and POS-tagging

2.  Manual analysis of a subpart of the corpus 1.  Gold Standard establishment 2.  Creation of a training and test set 3.  Establishment of rules to filter candidates to be

packed

Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 29

Page 30: Agenda - Universitetet i Bergen

4.2. Work plan 3.  Experiments:

1.  Experiment 0: 1.  Corpus processing with Giza++ (IBM model 1) 2.  analysis of results

2.  Implementation of the first set of rules (A) 3.  Experiment 1:

1.  Corpus processing with Giza++ (IBM model 1) 2.  Analysis of results using the set of rules A

4.  Comparison of results between experiment 0 and 1 5.  Error analysis and establishment of set of rules B 6.  Experiment 2:

1.  Corpus processing with Giza++ (IBM model 1) 2.  Analysis of results using the set of rules B

7.  Comparison of results across experiments 0-2 8.  Error analysis and final results

Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 30

Page 31: Agenda - Universitetet i Bergen

4.2. Work plan

•  The experiments and development of the preprocessing module will be first focused on the German-Spanish language pair and Norwegian will be included at a later stage

•  The Alignment Error Ratio (AER – Och & Ney: 2003) will be taken into account to evaluate alignment results

•  We may also test whether MT (Moses) quality increases using the preprocessing module and compare it with other MT Systems, either statistical (Google), rule-based (Lucy Software) or hybrid (Apertium/Systran)

Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 31

Page 32: Agenda - Universitetet i Bergen

Agenda 1.  Object of analysis

1.  Problem description 2.  Preliminary premises and hypotheses

2.  Nominal compounds 1.  German nominal compounds 2.  Spanish syntagmatic compounds 3.  Spanish-German correspondences

3.  Word alignment techniques 1.  IBM 1-5 Models 2.  HMM Models

4.  Methodology 1.  Corpus 2.  Work plan

5.  Expected results Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 32

Page 33: Agenda - Universitetet i Bergen

5. Expected results

•  If results are positive, it could be considered the possibility of implementing a semi-supervised model of phraseological units detection

•  That model could be used to improve the preprocessing module and run similar experiments

•  If both German and Norwegian are proven to be successful, other Germanic-Romance language pairs may be taken into consideration to run similar experiments

Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 33

Page 34: Agenda - Universitetet i Bergen

Expected results 1.  Specialized parallel corpora that can be reused for other

purposes once the project has finished 2.  90% accuracy of the preprocessing module 3.  Improvement of 1:n alignments in statistical word aligners

(i.e. Giza++) by chunking phraseological units in Spanish 4.  Automatic compound dictionary extraction 5.  Other NLP areas that can also benefit from this project

are: –  Computer Assisted Translation (CAT) –  Cross-lingual information retrieval –  Computer Assisted Language Learning (CALL) –  NLP applications implying 2 or more languages

Bergen, 22 June 2011 Multilingual Resources and Tools CLARA 34

Page 35: Agenda - Universitetet i Bergen

Questions?

Thank you very much for your attention!