using alignment for multilingual-text compression ehud s. conley and shmuel t. klein
TRANSCRIPT
Using Alignment for Multilingual-Text Compression
Ehud S. Conley and Shmuel T. Klein
Outline
• Multilingual text
• Problem definition
• Multilingual-text alignment
• Compression of multilingual texts using alignment– Algorithm– Results
• Future work
Multilingual text
• Same contents in two or more (natural) languages– Legislative texts of the European Union in all
EU languages
Subject: Supplies of military equipment to Iraq
Objet: Livraisons de matériel militaire à l’Irak
Problem definition
• How can multilingual texts be compressed more efficiently relative to compression of each language separately?–Can semantic equivalence be
exploited to reduce aggregate corpus size?
Multilingual-text alignment (1)
• Mapping of equivalent text fragments to each other– Paragraph/sentence and word/phrase
levels
– Algorithms for both levels• Tokenization, lemmatization, shallow
parsing
– Alignment possibly partial
Multilingual-text alignment (2)
Subject : Supplies of military equipment to Iraq Objet : Livraisons de matériel militaire à l’ Irak
Linear alignment
• Given two parallel fragments S and T, the linear alignment of a token tj in T is the token si in S such that:
5.0
||
||j
T
Si
Correct vs. linear alignment
5.09
8ji
1 2 3 4 5 6 7 8 9
Subject : Supplies of military equipment to Iraq
Objet : Livraisons de matériel militaire à l’ Irak
correct linear
9||,8|| TS
Offset from linear alignment
• Signed distance between correct and linear alignments
– Usually very small values (mostly [-10, 10])
offset = 2 1 2 3 4 5 6 7 8 9
Subject : Supplies of military equipment to Iraq
Objet : Livraisons de matériel militaire à l’ Irak
correct linear
Compression of multilingual texts using alignment:
Basic idea (1)• Compress by replacing words/phrases
with pointers to their translations within the other text– Original text restored using bilingual dictionary
• Store offsets relative to linear alignment– Small values small number of values
efficient encoding
Compression of multilingual texts using alignment:
Basic idea (2)• Store number of words in pointed fragment
– Might be a multi-word phrase– bilan balance sheet
• Single pointer may replace multi-word phrase– matériel militaire pointer to military
equipment– chemin de fer railway
Basic scheme: Example (option 1)
• Prefixes: 0 - word, 1 - pointer
• 1(offset, length)
offset = 2 1 2 3 4 5 6 7 8 9
Subject : Supplies of military equipment to Iraq
Objet : Livraisons de matériel militaire à l’ Irak
correct linear
1(0, 1) 0(:) 1(0, 1) 0(de) 1(2, 1) 1(0, 1) 0(à) 0(l’) 1(0, 1) Objet Livraisons matériel militaire Irak
Basic scheme: Example (option 2)
• matériel militaire pointer to military equipment
• Offset relative to first words
offset = 1 1 2 3 4 5 6 7 8 9
Subject : Supplies of military equipment to Iraq
Objet : Livraisons de matériel militaire à l’ Irak
correct linear
1(0, 1) 0(:) 1(0, 1) 0(de) 1(1, 2) 0(à) 0(l’) 1(0, 1) Objet Livraisons matériel militaire Irak
Complication: Words withmultiple possible translations
• Sometimes more than one possible translation per word– equipment
1. équipement
2. matériel
• Must encode correct translation within pointer– Store index of translation
Complication:Morphological variants (1)
• Bilingual dictionary must use one morphological form (lemma)–go aller stands for:
{go, went, gone, going} {aller, vais, vas, va etc.}
Complication:Morphological variants (2)
• Texts include inflected forms– More than one possible lemma
(bound {bind, bound}) must indicate correct lemmas for S to enable dictionary lookup
– Several variants per lemma must indicate correct inflections of translation words to enable restoration of T
Complication: Morphological variants (3)LEMMA DICTIONARY lower
0. low (adj.) 1. lower (verb)
bound 0. bound 1. bind
BILINGUAL DICTIONARY low
0. bas 1. déprimé 2. grave 3. ignoble 4. inférieur 5. …
bound 0. bondir 1. limite 2. borne 3. bond 4. …
VARIANT DICTIONARY borne
0. borne (sing.) 1. bornes (pl.)
inférieur 0. inférieur (masc.) 1. inférieure (fem.) 2. inférieurs 3. inférieures
lower bound
borne inférieure
1(1,1,0,2,0) 1(-1,1,0,4,1) borne inférieure
•1(offset, length, lemma(s), translation, variant(s))•Multiple values for multiple words
Optimizations
• No encoding for single option– Relevant for all 3 dictionaries
• Sort options by descending order of frequencies– Large number of small values better
encoding
• Encode length as (length – 1)– length never 0
Binary encoding (1)
• Use 3 Huffman codes–H1: words + pointer prefix
–H2: absolute values of offsets
• sign bit follows, except for 0
–H3: lengths + indices
Binary encoding (2)
• Words:
H1(lemma) [H3(variant)]
• Pointers:l = length, m = (# of words in translation)
H1(ptr_prefix) H2(offset) [sign_bit] H3(l – 1)[H3(lemma0)] … [H3(lemmal - 1)][H3(translation)][H3(variant0)] … [H3(variantm – 1)]
Empirical results
• English-French responsa collection of European parliament (ARCADE project)
• Sizes do not include codes for HWORD and TRANS; also not dictionaries for TRANS– Dictionaries exist anyway in large IR systems– Heaps law: Dictionary size is αNβ, where 0.4 β 0.6
• For large corpora, size negligible
Empirical results (2)
Future work
• Other test corpora– Other languages
• Compress target using lemmatized source
• Improve encoding
• Bidirectional scheme
• Pattern matching within compressed text
• Improved model for k languages