machine translation using tectogrammatics zdeněk Žabokrtský ifal, charles university in prague

23
Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague

Upload: collin-mckinney

Post on 16-Jan-2016

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague

Machine Translationusing Tectogrammatics

Zdeněk ŽabokrtskýIFAL, Charles University in Prague

Page 2: Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague

Overview

Part I - theoretical background

Part II - TectoMT system

Page 3: Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague

MT pyramid (in terms of PDT)

Key question in MT: optimal level of abstraction?

Our answer: somewhere around tectogrammaticshigh generalization over different language characteristics, but still computationally (and mentally!) tractable

MT triangle:

sourcelanguage

targetlanguage

interlingua

tectogram.

surf.synt.

morpho.

raw text.

level ofabstraction

"transfer distance"

?

Page 4: Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague

Basic facts about "Tecto"

introduced by Petr Sgall in 1960'simplemented in Prague Dep. Treebank 2.0

each sentence represented as a deep-syntactic dependency tree

functional words accompanying an autosemantic word "collapse" with it into a single t-node, labeled with the autosemantic t-lemma

added t-nodes (e.g. because of pro-drop)

semantically indispensable syntactic and morphological categories rendered by a complex system of t-node attributes (functors+subfunctors, grammatemes for tense, number, degree of comparison, etc.)

Page 5: Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague

SMT and limits of growth

current state-of-the-art approaches to MT n-grams + large parallel (and also monolingual) corpora +

huuuuge computational power

n-grams are very greedy! availability (or even existence!) of more data?

example: Czech-English parallel data~1 MW - easy (just download and align some tens of e-books)

~10 MW - doable (parallel corpus Czeng)

~100 MW - not now, but maybe in a couple of years...

~1 GW - ?

~10 GW (~ 100 000 books) - Was it ever translated???

Page 6: Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague

How could tecto help SMT?

n-gram view: manifestations of lexemes are mixed with manifestations of language means expressing the relations between the lexemes and of other grammar rules

inflectional endings, agglutinative affixes, functional words, word order, punctuation orthographic rules ...It will be delivered to Mr. Green's assistants at the nearest meeting.

training data sparsity

how could tecto ideas help?within each sentence, clear separation of meaningful "signs" from "signs" which are only imposed by grammar (e.g. imposed by agreement) clear separation of lexical, syntactical and morphological meaning components modularization of the translation task potential for a better structuring of statistical models more effective exploatation of the limited training data

Page 7: Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague

"Semitecto"abstract sentence representation, tailored for MT purposes

motivation: not to make decisions which are not really necessary for the MT process (such as distinguishing between many types of temporal and directional semantic complementations)given the target-language "semitecto" tree, we want the sentence generation to be deterministic

slightly "below" tecto (w.r.t. the abstraction axis):adopting the idea of separating lexical, syntactical and morphological meaning components; adopting the t-tree topology principlesadopting many t-node attributes (especially grammatemes, coreference, etc.)but (almost) no functors, no subfunctors, no WSD, no pointers to valency dictionary, no tfa...closer to the surface-syntax

main innovation: concept of formemes

Page 8: Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague

Formemesformeme = morphosyntactic language means expressing the dependency relation

n:v+6 (in Czech) = semantic noun which is on the surface expressed in the form of prepositional group in locative with preposition "v"v:that+fin/a (in English) = semantic verb expressed in active voice as a head of subordinating clause introduced with the sub.conjunction "that"

obviously, sets of formeme values are specific for each of the four semantic parts of speechin fact, formemes are edge labels partially substituting functors

what is NOT captured by formemes: morphological categories imposed by grammar rules (esp. by agreement), such as gender, number and case for adjectives in attributive positions morphological categories already represented by grammatemes, such as degree of comparison for adjectives, tense for verbs, number for nouns

Page 9: Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague

Formemes in the tree

Example: It is extremely important that Iraq held elections to a constitutional assembly.

Page 10: Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague

Some more examples of proposed formemes

Czech968 adj:attr604 n:1552 n:2497 v:fin/a308 n:4260 adv:169 n:v+6133 adj:compl117 v:inf104 n:poss86 n:782 v:že+fin/a77 v:rc/a63 n:s+753 n:k+353 n:attr50 n:na+647 n:na+442 v:aby+fin/a

English661 adj:attr568 n:attr456 n:subj413 n:obj370 v:fin/a273 n:of+X238 adv:160 n:poss160 n:in+X146 v:to+inf/a 92 adj:compl 91 n:to+X... 62 v:rc/a... 51 v:that+fin/a ...39 v:ger/a

Page 11: Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague

Three-way transfer

translation process:(I have been asked by him to come -> Požádal mě, abych přišel)

1. source language sentence analysis up to the "semitecto" layer

2. tranfer of lexemes (ask požádat , come přijít) formemes (v:fin/p v:fin/a , v:to+inf v:aby+fin/a)

grammatemes (tense=past1past , 0 verbmod=cdn)

3. target language sentence synthesis from the "semitecto" layer

Page 12: Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague

Adding statistics...

P(lT |lS)

P(fT |fS)

P(lgov ,ldep,f)

source language target language

translation model (e.g. from parallel

corpus Czeng, 30MW)

"binode" language model (e.g. from partially parsed

Czech National Corpus, 100MW)

Page 13: Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague

Part II TectoMT System

Page 14: Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague

Goals

primary goal to build a high-quality linguistically motivated MT system using the PDT layered framework, starting with English -> Czech direction

secondary goalsto create a system for testing the true usefulness of various NLP tools within a real-life applicationto exploit the abstraction power of tectogrammaticsto supply data and technology for other projects

Page 15: Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague

Main design decisionsLinux + Perl

set of well-defined, linguistically relevant levels of language representation

neutral w.r.t. chosen methodology (e.g. rules vs. statistics)

in-house OO architecture as the backbone,but easy incorporation of external tools (parsers, taggers, lemmatizers etc.)

accent on modularity: translation scenario as a sequence of translation blocks (modules corresponding to individual NLP subtasks)

sourcelanguage

targetlanguage

MT triangle:interlingua

tectogram.

surf.synt.

morpho.

raw text.

Page 16: Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague

TectoMT - Example of analysis (1)

Sample sentence: It is extremely important that Iraq held elections to a constitutional assembly.

Page 17: Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague

TectoMT - example of analysis (2)

phrase-structure tree:

Page 18: Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague

TectoMT - example of analysis (3) analytical tree

Page 19: Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague

TectoMT - example of analysis (4) tectogrammatical tree (with formemes)

Page 20: Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague

Heuristic alignment

Sentence pair:

It is extremely important that Iraq held elections to a constitutional assembly.

Je nesmírně důležité, že v Iráku proběhly volby do ústavního shromáždění.

Page 21: Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague

Formeme pairs extracted from parallel aligned trees

593 adj:attr adj:attr 290 v:fin/a v:fin/a 282 n:1 n:subj 214 adj:attr n:attr 165 n:2 n:of+X 152 adv: adv: 149 n:4 n:obj 102 n:2 n:attr 86 n:v+6 n:in+X 79 n:poss n:poss 73 n:1 n:obj 61 n:2 n:obj 51 v:inf v:to+inf/a 50 adj:compl adj:compl 39 n:2 n: 34 n:4 n:subj 34 n:attr n:attr 32 v:že+fin/a v:that+fin/a 32 n:2 n:poss 27 n:4 n:attr 27 n:2 n:subj 26 adj:attr n:poss 25 v:rc/a v:rc/a 20 v:aby+fin/a v:to+inf/a

Page 22: Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague

Processing blocksin the current prototype

input text output text

src-m-layer

src-p-layer src-a-layer

src-t-layer trg-t-layer

input text output text

src-m-layer

src-p-layer src-a-layer

src-t-layer trg-t-layer

1) segment the input text into sentences

2) tokenize the sentences3) morphological tagging 4) lemmatize each token5) phrase-structure parsing 6) mark phrase heads7) run constituencydependency

transformation8) mark subject nodes9) derive the t-tree topology10) label t-nodes with t-lemmas11) assign

coordination/apposition functors

12) mark finite clauses13) detect grammatical co-

reference in relative clauses14) determine the semantic part

of speech15) fill grammateme attributes

(number, tense, degree...)

16) detect the sentence modality

17) detect formeme

18) clone the source-language t-tree

19) translate t-lemmas using a simple 1:1 probabilistic lexicon

20) set the gender attribute according to the noun lemma

21) set the aspect attribute according to the verb lemma

22) predict the target-language formeme

23) resolve morphological agreement

24) expand complex verbs forms

25) add prepositions and conjunctions

26) perform conjugation and declination

27) resolve word order

28) add punctuation

29) perform vocalization of prepositions

30) concatenate the tokens into final sentence string

Page 23: Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague

Thank you !