using alignment for multilingual-text compression ehud s. conley and shmuel t. klein

23
Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein

Upload: kristopher-pitts

Post on 11-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein

Using Alignment for Multilingual-Text Compression

Ehud S. Conley and Shmuel T. Klein

Page 2: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein

Outline

• Multilingual text

• Problem definition

• Multilingual-text alignment

• Compression of multilingual texts using alignment– Algorithm– Results

• Future work

Page 3: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein

Multilingual text

• Same contents in two or more (natural) languages– Legislative texts of the European Union in all

EU languages

Subject: Supplies of military equipment to Iraq

Objet: Livraisons de matériel militaire à l’Irak

Page 4: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein

Problem definition

• How can multilingual texts be compressed more efficiently relative to compression of each language separately?–Can semantic equivalence be

exploited to reduce aggregate corpus size?

Page 5: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein

Multilingual-text alignment (1)

• Mapping of equivalent text fragments to each other– Paragraph/sentence and word/phrase

levels

– Algorithms for both levels• Tokenization, lemmatization, shallow

parsing

– Alignment possibly partial

Page 6: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein

Multilingual-text alignment (2)

Subject : Supplies of military equipment to Iraq Objet : Livraisons de matériel militaire à l’ Irak

Page 7: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein

Linear alignment

• Given two parallel fragments S and T, the linear alignment of a token tj in T is the token si in S such that:

5.0

||

||j

T

Si

Page 8: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein

Correct vs. linear alignment

5.09

8ji

1 2 3 4 5 6 7 8 9

Subject : Supplies of military equipment to Iraq

Objet : Livraisons de matériel militaire à l’ Irak

correct linear

9||,8|| TS

Page 9: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein

Offset from linear alignment

• Signed distance between correct and linear alignments

– Usually very small values (mostly [-10, 10])

offset = 2 1 2 3 4 5 6 7 8 9

Subject : Supplies of military equipment to Iraq

Objet : Livraisons de matériel militaire à l’ Irak

correct linear

Page 10: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein

Compression of multilingual texts using alignment:

Basic idea (1)• Compress by replacing words/phrases

with pointers to their translations within the other text– Original text restored using bilingual dictionary

• Store offsets relative to linear alignment– Small values small number of values

efficient encoding

Page 11: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein

Compression of multilingual texts using alignment:

Basic idea (2)• Store number of words in pointed fragment

– Might be a multi-word phrase– bilan balance sheet

• Single pointer may replace multi-word phrase– matériel militaire pointer to military

equipment– chemin de fer railway

Page 12: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein

Basic scheme: Example (option 1)

• Prefixes: 0 - word, 1 - pointer

• 1(offset, length)

offset = 2 1 2 3 4 5 6 7 8 9

Subject : Supplies of military equipment to Iraq

Objet : Livraisons de matériel militaire à l’ Irak

correct linear

1(0, 1) 0(:) 1(0, 1) 0(de) 1(2, 1) 1(0, 1) 0(à) 0(l’) 1(0, 1) Objet Livraisons matériel militaire Irak

Page 13: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein

Basic scheme: Example (option 2)

• matériel militaire pointer to military equipment

• Offset relative to first words

offset = 1 1 2 3 4 5 6 7 8 9

Subject : Supplies of military equipment to Iraq

Objet : Livraisons de matériel militaire à l’ Irak

correct linear

1(0, 1) 0(:) 1(0, 1) 0(de) 1(1, 2) 0(à) 0(l’) 1(0, 1) Objet Livraisons matériel militaire Irak

Page 14: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein

Complication: Words withmultiple possible translations

• Sometimes more than one possible translation per word– equipment

1. équipement

2. matériel

• Must encode correct translation within pointer– Store index of translation

Page 15: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein

Complication:Morphological variants (1)

• Bilingual dictionary must use one morphological form (lemma)–go aller stands for:

{go, went, gone, going} {aller, vais, vas, va etc.}

Page 16: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein

Complication:Morphological variants (2)

• Texts include inflected forms– More than one possible lemma

(bound {bind, bound}) must indicate correct lemmas for S to enable dictionary lookup

– Several variants per lemma must indicate correct inflections of translation words to enable restoration of T

Page 17: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein

Complication: Morphological variants (3)LEMMA DICTIONARY lower

0. low (adj.) 1. lower (verb)

bound 0. bound 1. bind

BILINGUAL DICTIONARY low

0. bas 1. déprimé 2. grave 3. ignoble 4. inférieur 5. …

bound 0. bondir 1. limite 2. borne 3. bond 4. …

VARIANT DICTIONARY borne

0. borne (sing.) 1. bornes (pl.)

inférieur 0. inférieur (masc.) 1. inférieure (fem.) 2. inférieurs 3. inférieures

lower bound

borne inférieure

1(1,1,0,2,0) 1(-1,1,0,4,1) borne inférieure

•1(offset, length, lemma(s), translation, variant(s))•Multiple values for multiple words

Page 18: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein

Optimizations

• No encoding for single option– Relevant for all 3 dictionaries

• Sort options by descending order of frequencies– Large number of small values better

encoding

• Encode length as (length – 1)– length never 0

Page 19: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein

Binary encoding (1)

• Use 3 Huffman codes–H1: words + pointer prefix

–H2: absolute values of offsets

• sign bit follows, except for 0

–H3: lengths + indices

Page 20: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein

Binary encoding (2)

• Words:

H1(lemma) [H3(variant)]

• Pointers:l = length, m = (# of words in translation)

H1(ptr_prefix) H2(offset) [sign_bit] H3(l – 1)[H3(lemma0)] … [H3(lemmal - 1)][H3(translation)][H3(variant0)] … [H3(variantm – 1)]

Page 21: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein

Empirical results

• English-French responsa collection of European parliament (ARCADE project)

• Sizes do not include codes for HWORD and TRANS; also not dictionaries for TRANS– Dictionaries exist anyway in large IR systems– Heaps law: Dictionary size is αNβ, where 0.4 β 0.6

• For large corpora, size negligible

Page 22: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein

Empirical results (2)

Page 23: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein

Future work

• Other test corpora– Other languages

• Compress target using lemmatized source

• Improve encoding

• Bidirectional scheme

• Pattern matching within compressed text

• Improved model for k languages