align algorithm - jorge

7/28/2019 Align Algorithm - Jorge

1/28

A P ro g ra m f o r A l i g n i ng Sent ences i nBi l ingua l CorporaW i l l ia m A . G a l e *A T & T B e ll L a b o r a t o r i e s

K e n n e t h W . C h u r c h *A T & T B e l l L a b o r a t o r i e s

Researchers in both ma chine translation (e.g., Brown et al. 1990) and b ilingual lexicography (e.g.,Klavans and Tzou kerma nn 1990) have recently become interested in stu dy ing bilingual corpora,bodies of text such as the Can adian Hansards (parliamentary proceedings), which are availablein multiple languages (such as French and English). One useful step is to align the sentences,that is, to identify correspondences between sentences in one language and sentences in the otherlanguage.

This pa per will describe a method a nd a program (align) for aligning sentences based on asimple statistical mo del of character lengths. The program uses the fact that longer sentences inone language tend to be translated into longer sentences in the other language, and that shortersentences tend to be translated into shorter sentences. A probabilistic score is assigned to eachproposed correspondence of sen tences, based on the scaled difference of lengths of the two sentences(in ch aracters) and the variance o f this d ifference. This probabilistic score is used in a dyn amicprogramming fram ework to f ind the m axim um l ikelihood al ignment o f sentences.

It is remarkable that such a simple approach works as well as i t does. A n evaluation wasperformed based on a trilingual corpus of economic reports issued by the U nion Bank o f Switzer-land (UBS) in English, French, and German. The method correctly aligned all but 4% of thesentences. Moreov er, it is possible to extract a large subcorpus that has a muc h smaller errorrate. B y selecting the best-scoring 80% of the alignm ents, the error rate is reduced from 4% to0.7%. The re were more errors on the English-French subcorpus than on the English-Germansubcorpus, sho wing that error rates will dep end on the corpus considered; howeve r, bo th weresmall enough to hope that the method w ill be useful for m any language pairs.

To furth er research on bilingual corpora, a m uch larger sample o f Canadian Hansards (ap-proximately 90 million words, ha lf in Eng lish and and ha lf in French) has been aligned wit h theal ign program and will be available through the Data Collection Initiative of the Associationfor Com putational Linguistics (ACL /DC I). In addition, in ord er to facilitate replication of theal ign program , an appen dix is provided w ith d etailed c-code of the more difficult core of the al ignprogram.1 . In t r o d u c t i o nR e s e a r c h e r s i n b o t h m a c h i n e t r a n s l a t i o n ( e . g . , B r o w n e t a l . 1 9 9 0 ) a n d b i l i n g u a l l e x i -c o g r a p h y ( e. g., K l a v a n s a n d T z o u k e r m a n n 19 90 ) h a v e r e c e n t l y b e c o m e i n t e r e s t e d ins t u d y i n g b i l i n g u a l c o r p o ra , b o d i e s o f te x t s u c h a s t h e C a n a d i a n H a n s a r d s (p a r l i a me n -tary d e b a t e s) , w h i c h are avai lab le in mul t ip le l a n g u a g e s ( s u c h a s F r e n c h a n d E n g l is h ) .T h e s e n t e n c e a l i g n m e n t t a s k is t o i d e n t i fy c o r re s p o n d e n c e s b e t w e e n s e n t e n ce s i n o n e

* AT&T BellLabo ratories 600 M ountain A venue Mu rray Hill, NJ, 07974

(~) 1993 Asso ciation for Com putational Linguistics


2/28

Co m put a t i ona l L i ngu i st ic s Vo l um e 19, N um ber 1

T a b l e 1I n p u t t o a l i g n m e n t p r o g r a m .Eng l i s h F r enchAcco r d i ng t o o u r s u r vey , 198 8 s a le s o fm i n e r a l w a t e r a n d s o f t d r i n k s w e r e m u c hh i ghe r t han i n 1987 , r e f le c t ing t he g r ow -i ng popu l a r i t y o f t he s e p r oduc t s . Co l a d r i nkm a n u f a c t u r e r s i n p a r t ic u l a r a c h i e v e d a b o v e -a v e r a g e g r o w t h r a t e s . T h e h i g h e r t u r n o v e rwas l a r ge l y due t o an i nc r ea s e i n t he s a l e sv o l u m e . E m p l o y m e n t a n d i n v e s t m e n t l e v e l sa l s o c l i m bed . Fo l l owi ng a t wo- yea r t r ans i -t i ona l pe r i od , t he new Foods t u f f s Or d i nancef o r Mi ne r a l Wa t e r c am e i n t o e f f ec t on Ap r i l1, 1988 . Specif ically, i t co ntain s m ore s t r in-gen t r equ i r em en t s r ega r d i ng qua l i t y cons i s -t e n c y a n d p u r i t y g u a r a n t e e s.

Q u a n t a u x e a u x m i n 6 ra l es e t a u x l im o n a d e s ,e l le s r encon t r en t t ou j ou r s p l u s d ' adep t e s . Ene f fe t , no t r e s onda ge f a i t r e s s o r ti r de s v en t e sne t t em en t s up 6r i eu r e s a c e l le s de 1987 , po u rl e s bo i s s ons ~ ba s e de co l a no t am m en t . Lap r og r e s s i on de s ch i ff r es d ' a f fa i r e s r 6 s u l te eng r a n d e p a r t i e d e l ' a c c r o i s s e m e n t d u v o l u m edes ven t e s . Uem pl o i e t l e s i nves t i s s em en t so n t 6 g a l e m e n t a u g m e n t 6 . L a n o u v e l l e o rd o n -nance f 6d6r a l e s u r l e s den r 6es a l i m en t a i r e sconce r nan t en t r e au t r e s l e s e aux m i n6r a l e s ,en t r 6e en v i gue u r l e l e r av r i l 1988 ap r 6s unep6r i ode t r ans i t o ir e de d eux ans , ex i ge s u r t ou tune p l us g r ande cons t ance dans l a qua l i t 6 e tune ga r an t i e de l a pu r e t 6 .

l a n g u a g e a n d s e n t e n c e s in t h e o t h e r l a n g u a g e . T h i s t a s k i s a f ir s t s t e p t o w a r d t h e m o r ea m b i t io u s t as k f i n d in g c o r r e s p o n d e n c e s a m o n g w o r d s . 1

T h e i n p u t i s a p a i r o f t e x t s s u c h a s T a b le 1. T h e o u t p u t i d e n t i f ie s t h e a l i g n m e n tb e t w e e n s e n te n c e s . M o s t E n g l i s h s e n t e n c e s m a t c h e x a c t ly o n e F r e n c h s e n t e n c e , b u t i tis p o s s i b l e f o r a n E n g l i s h s e n t e n c e to m a t c h t w o o r m o r e F r e n c h s e n t e n c e s . T h e f i rs tt w o E n g l i s h s e n t e n c e s i n T a b l e 2 i l lu s t r a t e a p a r t i c u l a r l y h a r d c a s e w h e r e t w o E n g l i s hs e n t e n c e s a li g n t o t w o F r e n c h s e n t e n c e s. N o s m a l l e r a l i g n m e n t s a r e p o s si b l e b e c a u s et h e cl a u se " . . . s a le s . . . w e r e h i g h e r . . . " i n t h e fi rs t E n g l i s h s e n t e n c e c o r r e s p o n d s t o( p a r t of ) t h e s e c o n d F r e n c h s e n t e n c e . T h e n e x t t w o a l i g n m e n t s b e l o w i ll u s tr a te t h em o r e t y p i c a l ca s e w h e r e o n e E n g l i s h s e n t e n c e a l ig n s w i t h e x a c t l y o n e F r e n c h s e n t e n c e .T h e f i n a l a l i g n m e n t m a t c h e s t w o E n g l i s h s e n t e n c e s to a s in g l e F r e n c h s e n t e n c e . T h e s ea l ig n m e n t s a g r e e d w i t h t h e r e su l ts p r o d u c e d b y a h u m a n j ud g e .

A l i g n i n g s e n t e n c e s i s j u s t a f ir s t s t e p t o w a r d c o n s t r u c t i n g a p r o b a b i li s t ic d i c t i o n a r y( T a b le 3 ) f o r u s e i n a l ig n i n g w o r d s i n m a c h i n e t r a n s l a t i o n ( B r o w n e t al . 19 9 0) , o r f o rc o n s t r u c t i n g a b i li n g u a l c o n c o r d a n c e ( T a b le 4 ) f o r u s e i n l e x i c o g r a p h y ( K l a v a n s a n dT z o u k e r m a n n 1 9 9 0 ) .

A l t h o u g h t h e r e h a s b e e n s o m e p r e v i o u s w o r k o n t h e se n t e n c e a l i g n m e n t ( e.g .,B r o w n , L a i , a n d M e r c e r 1 9 9 1 [ a t I B M ] , K a y a n d R 6 s c h e i s e n [ t h i s i s s u e ; a t X e r o x ] ,a n d C a t i z o n e , R u s s el l, a n d W a r w i c k , i n p r e s s [a t I S S C O ] , t h e a l i g n m e n t t a s k r e m a i n s as i g n if ic a n t o b s t ac l e p r e v e n t i n g m a n y p o t e n t i a l u se r s f r o m r e a p i n g m a n y o f th e b e n e f i t so f b i l in g u a l c o r p o r a , b e c a u s e t h e p r o p o s e d s o l u t io n s a r e o f t e n u n a v a i l a b le , u n r e l ia b l e ,a n d / o r c o m p u t a t io n a l l y p r o h ib i ti v e .

M o s t o f t h e p r e v i o u s w o r k o n s e n t e n c e a l i g n m e n t h a s y e t t o b e p u b l i s h e d . K a y ' sd r a f t ( K a y a n d R 6 s c h e is e n ; th is is s ue ) , f o r e x a m p l e , w a s w r i t t e n m o r e t h a n t w o y e a r sa g o a n d i s s ti ll u n p u b l i s h e d . S i m i l a r l y t h e IB M w o r k is a l s o s e v e r a l y e a r s o l d , b u t n o t

1 In statistics, string-matching problems are divided into two classes:alignmentproblems andcorrespondenceproblems. Crossing dependencies are possible in the latter, but not in the former.

76


3/28

W i l li am A . G a l e a n d K e n n e t h W . C h u r c h P r o g r a m f o r A l i g n i n g S e n te n c es

T a b l e 2O u t p u t f r o m a li g n m e n t p r o g r a m .E n g l i s h F r e n c hA c c o r d i n g t o o u r s u r v e y , 1 9 8 8 s a l e s o f m i n -e r al w a t e r a n d s o ft d r i n k s w e r e m u c h h i g h e rt h a n i n 1 9 87 , r e f le c t in g t h e g r o w i n g p o p u l a r -i ty o f t h e s e p r o d u c t s . C o l a d r i n k m a n u f a c -t u r e r s i n p a r t i c u l a r a c h i e v e d a b o v e - a v e r a g eg r o w t h r a t e s .

Q u a n t a u x e a u x m i n 6 r a le s e t a u x l i m o n a d e s ,e l l e s r e n c o n t r e n t t o u j o u r s p l u s d ' a d e p t e s . E ne f f e t , n o t r e so n d ag e f a i t r e s so r t i r d e s v en t e sn e t t em en t su p 6 r i eu re s a c e l l e s d e 1 9 8 7 , p o u rl e s b o i s s o n s a b a s e d e c o l a n o t a m m e n t .T h e h i g h e r t u r n o v e r w a s l a r g e l y d u e t o a n L a p r o g r e s s i o n d e s c h i ff r e s d ' a f f a ir e s r 6 s u lt ei n c re a s e i n t h e s a le s v o l u m e , e n g r a n d e p a r t i e d e l ' a c c r o i s s e m e n t d u v o l -u m e d e s v e n t e s .E m p l o y m e n t a n d i n v e s t m e n t l e v el s a l s o L ' e m p l o i e t l es i n v e s t i s s e m e n t s o n t 6 g a le -c l im b e d , m e n t a u g m e n t 6 .F o l l o w i n g a t w o - y e a r tr a n s i ti o n a l p e r i o d , t h en e w F o o d s t u ff s O r d i n a n c e f o r M i n e ra l W a -t e r came in to e f fec t on Apr i l 1 , 1988 . Spec i f -i ca l ly , i t conta ins more s t r i ngent requi re -m e n t s r e g a r d i n g q u a l i t y c o n s is t e n cy a n d p u -r i ty g u a r a n t e e s .

L a n o u v e l l e o r d o n n a n c e f 6 d 6 r a l e s u r l e sd e n r 6 e s a l i m e n t a i r e s c o n c e r n a n t e n t r e a u t r e sl e s e a u x m i n 6 r a l e s , e n t r 6 e e n v i g u e u r l e l e rav r i l 1 9 8 8 ap r6 s u n e p 6 r i o d e t r an s i t o i r e d ed e u x a n s , e xi ge s u r t o u t u n e p l u s g r a n d e c o n -s t an ce d an s l a q u a l i t 6 e t u n e g a ran t i e d e l ap u re t 6 .

T a b l e 3A n en t ry i n a p ro b ab i l i s t i c d i c t i o n a ry .

( f rom Brown e t a l . 1990)E n g l i s h F r e n c h P r o b ( F r e n ch l E n g li s h)the l e 0 .610the la 0.178th e 1' 0.083the les 0.023the ce 0.013the i l 0 .012t h e d e 0 . 0 0 9the ~ 0 .007t h e q u e 0 .0 0 7

v e r y w e l l d o c u m e n t e d i n t h e p u b l i s h e d l i t e r at u r e; c o n s e q u e n t l y , th e r e h a s b e e n a l oto f u n n e c e s s a r y s u b s e q u e n t w o r k a t IS S C O a n d e l s e w h e r e . 2

T h e m e t h o d w e d e s c r ib e h a s t h e s a m e s e n t e n c e q e n g t h b a s is a s d o e s t h a t o f B r o w n ,L a i, a n d M e r c e r , w h i l e t h e t w o d i f f e r c o n s i d e r a b l y f r o m t h e l ex i ca l a p p r o a c h e s t r ie db y K a y a n d R 6 s c h e i s e n a n d b y C a t i z o n e , R u s se ll , a n d W a r w i c k .

T h e f e a s ib i l i ty o f o t h e r m e t h o d s h a s v a r i e d g r e a t l y . K a y ' s a p p r o a c h i s a p p a r e n t l yq u i t e s lo w . A t le a st , w i t h t h e c u r r e n t l y i n e f fi c ie n t i m p l e m e n t a t i o n , i t m i g h t t a k e h o u r s

2 After we finished mo st of this w ork, i t cam e to our at tention that the IBM MT gro up has at least fourpap ers that m entio n sentence alignment. (Brown et al. 1988a,b) start fro m a set of aligned sentences,sugg esting that the y had a solution to the sentence al ignment problem back in 1988. Brown et al . (1990)me ntion that sentence lengths form ed the basis of their metho d. The draft by Brown, Lai, and Mercer(1991) describes their process without giving equations.

7 7


4/28

Computational Linguistics Volume 19, Number 1

T a b l e 4A b i l i n g u a l c o n c o r d a n c e .

b a n k / b a n q u e ( " m o n e y " s e ns e)i t c o u l d a l s o b e a p l a c e w h e r e w e w o u l d h a v e a b a n k o f e x p e r t s . S E N T i k n o w s e v e r a l p e o p l e w h o a

f tr e l e l ie u o t i s e re t r o u v e r a i t u n e e s p 6 c e d e b a n q u e d ' e x p e r t s . S E N T j e c o n n a i s p l u s i e u r s p e r sf f i n a n c e (m r . w i ls o n ) a n d t h e g o v e r n o r o f th ee s fi n a n c e s ( m . w i l s o n ) e t le g o u v e r n e u r d e l a

r e d u c e d b y o v e r 8 0 0 p e r c e n t in o n e w e e k t h r o u g hu s d e 8 0 0 p . 1 0 0 e n u n e s e m a i n e a c a u s e d ' u n e

b a n k o f c a n a d a h a v e f r e q u e n t ly o n b e h a l f o f th e c ab a n q u e d u c a n a d a o n t f r 6 q u e m m e n t u ti li s6 a u c ob a n k a c ti o n. S E N T th e r e w a s a h a b e r d a s h e r w h o w o ub a n q u e . S E N T v o i l a u n c h e m i s i e r q u i a u r a i t a p p r

b a n k / b a n c ( " p la c e " s e n se )h a f o r u m . S E N T s u c h w a s t h e c a s e in t h e g e o r g e se n t r e l e s 6 t a t s- u n i s e t le c a n a d a a p r o p o s d u

h a n i d i d . S E N T h e s a i d t h e n o s e a n d t a il o f t h eg o u v e r n e m e n t a v a i t c 6 d 6 l e s e x t r 6 m i t 6 s d u

h e f i s h i n g p r i v i l e g e s o n t h e n o s e a n d t ai l o f t h el e s p r iv i l 6 g e s d e p ~ c h e a u x e x t r 6 m i t6 s d u

b a n k i s s u e w h i c h w a s s e tt l ed b e t w e e n c a n a d a a n d t hb a n c d e g e o r g e . S E N T c ' e s t d a n s l e b u t d e r 6b a n k w e r e s u r r e n d e r e d b y t h i s g o v e r n m e n t . S E N T t hb a n c . S E N T e n f a i t, lo r s d e s n 6 g o c i a t i o n s d e 1b a n k w e n t d o w n t h e tu b e b e f o r e w e e v e n n e g o t i a t e db a n c o n t 6 t 6 l i q u i d 6 s a v a n t r h y m e q u ' o n a i

to align a single Scientific American article (Kay, personal communication). It ough t tobe possible to achieve fairly reasonable results with much less computation. The IBMalgorithm is much more efficient since they were able to extract nearly 3 million pairsof sentences from Hansard materials in 10 days of running time on an IBM Model3090 mainframe computer with access to 16 megabytes of virtual memory (Brown,Lai, and Mercer 1991).The evaluation of results has been absent or rudimentary. Kay gives positive ex-amples of the alignment process, but no counts of error rates. Brown, Lai, and Mercer(1991) report that they achieve a 0.6% error rate when the algorithm suggests aligningone sentence with one sentence. However, they do not characterize its performanceoverall or on the more difficult cases.Since the research community has not had access to a practical sentence alignmentprogram, we t hough t that it wou ld be helpful to describe such a program ( a l i g n ) and toevaluate its results. In addition, a large sample of Canad ian Hansards (approximately90 million words, half in French and half in English) has been aligned with the a l i g nprogram and has been made available to the general research commun ity through theData Collection Initiative of the Association for Computa tiona l Linguistics (ACL/DCI).In order to facilitate replication of the a l i g n program, an appendix is provided withdetailed c-code of the more difficult core of the a l i g n program.The a l i g n program is based on a very simple statistical model of character lengths.The model makes use of the fact that longer sentences in one language tend to betranslated into longer sentences in the other language, and that shorter sentencestend to be translated into shorter sentences. A probabilistic score is assigned to eachpair of proposed sentence pairs, based on the ratio of lengths of the two sentences(in characters) and the variance of this ratio. This probabilistic score is used in adynamic programming framework in order to find the maxi mum likelihood alignmentof sentences.It is remarkable that such a simple approach can work as well as it does. Anevaluation was per formed based on a trilingual corpus of 15 economic reports issuedby the Union Bank of Switzerland (UBS) in English, French, and German (14,680words, 725 sentences, and 188 paragraphs in English and corresponding numbers in

78


5/28

William A. Gale and Kenneth W. Church Program for Aligning Sentences

the other two languages). The method correctly aligned all but 4% of the sentences.Moreover, it is possible to extract a large subcorpus that has a much smaller errorrate. By selecting the best-scoring 80% of the alignments, the error rate is reducedfrom 4% to 0.7%. There were more errors on the English-French subcorpus than onthe English-German subcorpus, showing that error rates will depend on the corpusconsidered; however, both were small enough for us to hope that the me thod will beuseful for many language pairs. We believe that the error rate is considerably lowerin the Canadian Hansards because the translations are more literal.2 . P a r a gr a ph A l i g n m e n tThe sentence alignment program is a two-step process. First paragraphs are aligned,and then sentences within a paragraph are aligned. It is fairly easy to align paragraphsin our trilingual corpus of Swiss banking reports since the boundaries are usuallyclearly marked. However, there are some short headings and signatures that can beconfused with paragraphs. Moreover, these short "pseudo-paragraphs" are not alwaystranslated into all languages. On a corpus this small the paragraphs could have beenaligned by hand. It turns out that "pseudo-paragraphs" usually have fewer than 50characters and that real paragraphs usually have more than 100 characters. We usedthis fact to align the paragraphs automatically, checking the result by hand.

The procedure correctly aligned all of the English and German paragraphs. How-ever, one of the French documents was badly translated and could not be alignedbecause of the omission of one long paragraph and the duplication of a short one.This document was excluded for the purposes of the remainder of this experiment.We will show below that paragraph alignment is an important step, so it is fortu-nate that it is not particularly difficult. In aligning the Hansards, we found tha t para-graphs were often already aligned. For robustness, we decided to align paragraphswithin certain fairly reliable regions (denoted by certain Hansard-specific formattingconventions) using the same method as that described below for aligning sentenceswithin each paragraph.3 . A Dy na m ic P ro g ra m m ing F ra m ewo rkNow, let us consider how sentences can be aligned within a paragraph. The programmakes use of the fact that longer sentences in one language tend to be translated intolonger sentences in the other language, and that shorter sentences tend to be trans-lated into shorter sentences. 3 A probabilistic score is assigned to each proposed pairof sentences, based on the ratio of lengths of the two sentences (in characters) andthe variance of this ratio. This probabilistic score is used in a dynamic programmingframework in order to find the max imu m likelihood alignment of sentences. The fol-

3 W e w i ll h a v e l it tl e t o s a y a b o u t h o w s e n t e n c e b o u n d a r i e s a r e i d e n t if i e d . Id e n t i f y i n g s e n t e n c eb o u n d a r i e s i s n o t a l w a y s a s e a s y a s i t m i g h t a p p e a r f o r re a s o n s d e s c r i b e d i n L i b e r m a n a n d C h u r c h ( inp r e s s ). It w o u l d b e m u c h e a s i e r i f p e r i o d s w e r e a l w a y s u s e d t o m a r k s e n t e n c e b o u n d a r i e s ; b u tu n f o r t u n a t e ly , m a n y p e r i o d s h a v e o t h e r p u r p o s e s . I n t h e B r o w n C o r p u s , f o r e x a m p l e , o n l y 90 % o f th ep e r i o d s a re u s e d t o m a r k s e n t e n c e b o u n d a r i e s ; t h e r e m a i n i n g 1 0 % a p p e a r i n n u m e r i c a l e x p r e s s i o n s ,a b b r e v i a t i o n s , a n d s o f o r t h . I n t h e Wall Street Journal, t h e r e is e v e n m o r e d i s c u s s i o n o f d o l l a r a m o u n t sa n d p e r c e n t a g e s , a s w e l l a s m o r e u s e o f a b b r e v i a t e d t i tl e s s u c h a s Mr.; c o n s e q u e n t l y , o n l y 5 3% o f t h ep e r i o d s i n t h e Wall Street Journal a r e u s e d t o id e n t i f y s e n t e n c e b o u n d a r i e s . F o r t h e U B S d a ta , a s i m p l es e t o f h e u r i s t ic s w e r e u s e d t o i d e n t i fy s e n t e n c e s b o u n d a r i e s . T h e d a t a s e t w a s s u f f ic i e n tl y s m a l l t h a t i tw a s p o s s i b l e t o c o r r e c t t h e r e m a i n i n g m i s t a k e s b y h a n d . F o r a la r g e r d a t as e t , s u c h a s t h e C a n a d i a nH a n s a r d s , i t w a s n o t p o s s i b l e to c h e c k th e r e s u lt s b y h a n d . W e u s e d t h e s a m e p r o c e d u r e t h a t i s u s e d i nC h u r c h ( 19 88 ). T h i s p r o c e d u r e w a s d e v e l o p e d b y K a t h r y n B a k e r ( u n p u b l i s h e d ) .

79


6/28


f . -

e -f-Q .t~

e ~ -

E

00

00

P ,

o OlI0

***., *

f * * *

. . , * * *

* * t - * *

I I I5 0 0 1 0 0 0 1 5 0 0Engl ish paragraph length

Figure 1Paragraph lengths are highly correlated. The horizontal axis shows the length of Englishparagraphs, while the vertical scale shows the lengths of the corresponding Germanparagraphs. Note that the correlation is quite large (.991).

lowing striking figure could easily lead one to this approach. Figure 1 shows that thelengths (in characters) of English and German paragraphs are highly correlated (.991).Dynamic programming is often used to align two sequences of symbols in a vari-ety of settings, such as genetic code sequences from different species, speech sequencesfrom different speakers, gas chroma tograph sequences from different compounds, andgeologic sequences f rom different locations (Sankoff and Kruskal 1983). We could ex-pect these matching techniques to be useful, as long as the order of the sentences doesnot differ too radically between the two languages. Details of the alignment techniquesdiffer considerably from one application to another, but all use a distance measure tocompare two individual elements within the sequences and a dynamic progr ammingalgorithm to minimize the total distances between aligned elements within two se-quences. We have found that the sentence alignment problem fits fairly well into thisframework, th ough it is necessary to introduce a fairly interesting innovation into thestructure of the distance measure.Kruskal and Liberman (1983) describe distance measures as belonging to one oftwo classes: trace and t ime-warp. The difference becomes important when a singleelement of one sequence is being matched with multiple elements from the other. Intrace applications, such as genetic code matching, the single element is matched withjust one of the multiple elements, and all of the others will be ignored. In contrast,in time-warp applications such as speech template matching, the single element ismatched with each of the multiple elements, and the single element will be usedin multiple matches. Interestingly enough, our application does not fit into either of

80


7/28

W i l l ia m A . G a l e a n d K e n n e t h W . C h u r c h P r o g r a m f o r A l i g n i n g S e n t e n c es

" 0

0

tO0

0

d

0

0d] I I I I [ ]

-3 -2 -1 0 1 2 3d e l t a

Figure 2Del t a i s app rox ima te ly no rma l . The ho r i zon t a l ax i s shows ~ , wh i l e t he ve r t i c a l s ca l e shows t hee m p i r i c a l d e n s i t y o f d e l t a f o r th e h a n d - a l i g n e d r e g io n s a s p o i n ts , a n d a n o r m a l ( 0 ,1 ) d e n s i t yp lo t ( l i ne s ) f o r compar i son The emp i r i c a l dens i t y i s s l i gh t l y more peaked t han no rma l ( and i t sm e a n i s n o t q u i t e z e r o ), b u t t h e d i f f e re n c e s a r e s m a l l e n o u g h f o r t h e p u r p o s e s o f t h e a l g o r i th m .

K r u s k a l a n d L i b e r m a n ' s cl as se s b e c a u s e o u r d i s ta n c e m e a s u r e n e e d s t o c o m p a r e t h es i n g le e l e m e n t w i t h a n a g g r e g a t e o f t h e m u l t i p l e e l e m e n t s .4 . T h e D i s t a n c e M e a s u r eI t i s c o n v e n i e n t f o r t h e d i s t a n c e m e a s u r e t o b e b a s e d o n a p r o b a b i l i s t ic m o d e l s o t h a ti n f o r m a t i o n c a n b e c o m b i n e d i n a c o n s is t en t w a y . O u r d i s t a n c e m e a s u r e i s a n e s t i m a t eo f - l o g P r o b ( m a t c h I 6 ), w h e r e ~ d e p e n d s o n 11 a n d / 2 , t h e l e n g t h s o f t h e t w o p o r t i o n so f t e x t u n d e r c o n s i d e r a t io n . T h e l o g is i n t r o d u c e d h e r e s o th a t a d d i n g d i s t a n c e s w i l lp r o d u c e d e s i r a b l e r e s u l t s .

T h i s d i s t a n c e m e a s u r e i s b a s e d o n t h e a s s u m p t i o n t h a t e a c h c h a r a c t e r i n o n e l a n -g u a g e , L ~, g i v e s r is e to a r a n d o m n u m b e r o f c h a r a c t e r s i n t h e o t h e r l a n g u a g e , L2. W ea s s u m e t h e s e r a n d o m v a r i a b le s a r e i n d e p e n d e n t a n d i d e n ti c a l ly d i s t r i b u t e d w i t h an o r m a l d i s t ri b u t i o n . T h e m o d e l is t h e n s p e c i f i e d b y t h e m e a n , c , a n d v a r i a n c e , s 2, o ft h is d i s t r i b u ti o n , c i s t h e e x p e c t e d n u m b e r o f c h a r a c t e r s i n L2 p e r c h a r a c t e r i n L b a n ds 2 i s th e v a r i a n c e o f t h e n u m b e r o f c h a r a c t e r s i n L2 p e r c h a r a c t e r i n L1. W e d e f i n e ~ t obe (12 - l l C ) / V ~ l s2 s o t h a t it h a s a n o r m a l d i s t r ib u t i o n w i t h m e a n z e r o a n d v a r i a n c eo n e ( a t l e as t w h e n t h e t w o p o r t i o n s o f te x t u n d e r c o n s i d e r a t io n a c t u a l l y d o h a p p e n t ob e t r a n s l a t i o n s o f o n e a n o t h e r ) .

F i g u r e 2 i s a c h e c k o n t h e a s s u m p t i o n t h a t 6 is n o r m a l l y d i s t r i b u t e d . T h e f i g u r e isc o n s t r u c t e d u s i n g t h e p a r a m e t e r s c a n d s 2 e s t i m a t e d f o r th e p r o g r a m .

81


8/28

Co m p ut a t i ona l L i ngu i s ti c s Vo l um e 19, N um ber 1

0

~ eJ"0

t ' -

Q.

0I I b I0 5 0 0 1 0 0 0 1 5 0 0

Eng l i s h pa r ag r aph l eng t hFigure 3Var i ance is m o de l e d p r o por t i on a l t o leng t h . Th e ho r i zon t a l ax i s p l o ts t he l eng t h o f Eng l i s hpa r ag r aph s , w h i l e t he ve r t ic a l ax i s s how s t he s q ua r e o f t he d i ff e r ence o f Eng l i s h and G e r m anl eng t hs , an e s t i m a t e o f va r iance . The p l o t i nd i ca te s t ha t v a r i ance i nc r eas e s w i t h l eng t h , a sp r ed i c t ed b y t he m ode l . The l ine s how s t he r e s u l t o f a r obus t r eg r e s s i on ana lys i s. F i ve ex t r em epo i n t s l y i ng above t he t op o f t h i s f i gu r e have been s uppr e s s ed s i nce t hey d i d no t con t r i bu t e t othe robus t regress ion .

T h e p a r a m e t e r s c a n d S 2 ar e d e t e r m i n e d e m p i r i c a l l y f r o m t h e U B S d a t a. W e c o u l de s ti m a te c b y c o u n t i n g t h e n u m b e r o f c h a r ac t e rs i n G e r m a n p a r a g r a p h s t h e n d i v id -i n g b y t h e n u m b e r o f c h a r a c t e r s i n c o r r e s p o n d i n g E n g l i s h p a r a g r a p h s . W e o b t a i n8 1 1 0 5 /7 3 4 8 1 ~ 1 .1 . T h e s a m e c a l c u l a t i o n o n F r e n c h a n d E n g l i s h p a r a g r a p h s y i e l d sc ~ 7 2 3 0 2 / 6 8 4 5 0 ~ 1 .0 6 a s t h e e x p e c t e d n u m b e r o f F r e n c h c h a r a c t e r s p e r E n g l i s hc h a ra c te r . A s w i l l b e e x p l a i n e d l ate r, p e r f o r m a n c e d o e s n o t s e e m t o b e v e r y s e n s i t iv et o t h e s e p r e c is e l a n g u a g e - d e p e n d e n t q u a n t i ti e s , a n d t h e r e f o r e w e s i m p l y a s s u m e t h el a n g u a g e - i n d e p e n d e n t v a l u e c ~ 1 , w h i c h s i m p l i fi e s t h e p r o g r a m c o n s i d er a b ly . T h isv a l u e w o u l d c l e a rl y b e i n a p p r o p r i a t e f o r E n g l i s h - C h i n e s e a l i g n m e n t , b u t i t s e e m sl ik e l y t o b e u s e f u l f o r m o s t p a i r s o f E u r o p e a n l a n g u a g e s .

s 2 i s e s t i m a t e d f r o m F i g u r e 3. T h e m o d e l a s s u m e s t h a t s 2 is p r o p o r t i o n a l t o l e n g t h .T h e c o n s t a n t o f p r o p o r t i o n a l i t y is d e t e r m i n e d b y t h e s l o p e o f t h e r o b u s t r e g r e s s i o nl in e s h o w n i n t h e f i g u r e . T h e r e s u l t f o r E n g l i s h - - G e r m a n is s 2 = 7 .3 , a n d f o r E n g l i s h -F r e n c h i s s 2 = 5 .6 . A g a i n , w e w i l l s e e t h a t t h e d i f f e r e n c e i n t h e t w o s l o p e s i s n o tt o o im p o r t a n t . T h e r e f o r e , w e c a n c o m b i n e t h e d a t a a c ro s s l a n g u a g e s , a n d a d o p t t h es i m p l e r l a n g u a g e - i n d e p e n d e n t e s t im a t e s 2 ~ 6 .8 , w h i c h is w h a t is a c tu a l l y u s e d i n t h ep r o g r a m .

W e n o w a p p e a l t o B a y es T h e o r e m t o es t i m a t e P r o b ( m a t c h ] 6 ) a s a c o n s t a n t t i m e sP r o b ( 6 I m a t c h ) P r o b ( m a t c h ) . T h e c o n s t a n t c a n b e i g n o r e d s i n c e it w i l l b e t h e s a m e f o r

8 2


9/28

W i l li am A . G a l e a n d K e n n e t h W . C h u r c h P r o g r a m f o r A l i g n in g S e n t e nc e s

T a b l e 5P r ob( m a t ch )C a t e g o r y F r e q u e n c y P r o b ( m a tc h )1-1 1167 0.891-0 or 0-1 13 0.00992-1 or 1-2 117 0.0892-2 15 0.011

1312 1.00

a ll p r o p o s e d m a t c h e s . T h e c o n d i t i o n a l p r o b a b i l i t y P rob(~ I match) c a n b e e s t i m a t e d b yP rob(~ ] match) = 2(1 - Prob(]~l))

w h e r e Prob(]61) is th e p r o b a b i l i t y t h a t a r a n d o m v a r ia b l e , z , w i t h a s t a n d a r d i z e d ( m e a nz e r o , v a r i a n c e o n e ) n o r m a l d i s t r ib u t i o n , h a s m a g n i t u d e a t le a s t a s la r g e a s 16]. T h a t i s,

Prob(~)- 1 f~~ oo e - z2 /2 dz.T h e p r o g r a m c o m p u t e s 6 d i r e c tl y f r o m t h e l e n g t h s o f t h e t w o p o r t i o n s o f te x t, 11 a n d12, a n d t h e t w o p a r a m e t e r s , c a n d s 2 . T h a t i s , 6 = (/2 - llC)/IX /~lS . T h e n , Prob(]6]) sc o m p u t e d b y i n t e g r a ti n g a s t a n d a r d n o r m a l d i s t r i b u t io n ( w i th m e a n z e r o a n d v a r i a n c eo n e ) . M a n y s t at is ti c s te x t b o o k s i n c l u d e a t a b le f o r c o m p u t i n g t h is . T h e c o d e i n th ea p p e n d i x u s e s t h e pnorm f u n c t io n , w h i c h i s b a s e d o n a n a p p r o x i m a t i o n d e s c r i b e d b yA b r a m o w i t z a n d S t e g u n ( 19 64 ; p . 9 32 , e q u a t i o n 2 6 .2 .1 7 ).

T h e p r i o r p r o b a b i l i t y o f a m a t c h , Prob(match), i s f it w i t h t h e v a l u e s i n T a b l e 5 , w h i c hw e r e d e t e r m i n e d f r o m t he h a n d - m a r k e d U B S d ata . W e h a v e f o u n d t h a t a s e n te n c e ino n e l a n g u a g e n o r m a l l y m a t c h e s e x a c t ly o n e s e n t e n c e in t h e o t h e r l a n g u a g e (1 -1 ). T h r e ea d d i t i o n a l p o s s i b i l i ti e s a r e a l s o c o n s i d e r e d : 1 -0 ( in c l u d i n g 0 -1 ), 2 -1 ( i n c l u d i n g 1 -2 ), a n d2 -2 . T a b l e 5 s h o w s a l l f o u r p o s s i b il i ti e s .

T h i s c o m p l e t e s t h e d i s c u s s i o n o f t h e d i s t a n c e m e a s u r e . P rob(match I 6 ) i s c o m p u t e da s a n ( i r r e l e v a n t ) c o n s t a n t t i m e s P rob(~ ] match)Prob(match). P rob(match) i s c o m p u t e du s i n g t h e v a l u e s i n T a b l e 5 . Prob(6 ] match) is c o m p u t e d b y a s s u m i n g t h a t Prob(6 ]match) = 2 ( 1 - P r o b ( ] ~ ] ) ), w h e r e Prob(16]) h a s a s t a n d a r d n o r m a l d i s t r ib u t i o n . W e f ir s tc a l cu l a t e 6 a s (12 - llc)/Ix/~lS a n d t h e n Prob(]6[) i s c o m p u t e d b y i n te g r a t i n g a s t a n d a r dn o r m a l d i s t r i b u t io n . S e e t h e c - fu n c t i o n two_side_distance i n t h e a p p e n d i x f o r a n e x a m p l eo f a c - c o d e i m p l e m e n t a t i o n o f t h e s e c a l c u la t io n s .

T h e d i s t an c e f u n c t i o n d , r e p r e s e n t e d i n th e p r o g r a m a s two,side_distance, i s d e f i n e di n a g e n e r a l w a y t o a ll o w f o r i n s e r ti o n s , d e l e t i o n , s u b s t i t u t io n , e tc . T h e f u n c t i o n t a k e sf o u r a r gu m en t s : Xl~ Yl , x2 , y2 .

1 . Le t d(xl,yl , 0~ 0 ) b e t h e c o s t o f s u b s t i t u t i n g X l w i t h y l ,2. d(xl, 0 ; 0 , 0 ) b e t h e c o s t o f d e l e t i n g X l,3. d(O,yl; 0 , 0 ) b e t h e c o s t o f i n s e r t i o n o f Y l,4. d(Xl,yl; x 2, 0 ) b e t h e c o s t o f c o n t r a c t i n g X l a n d x 2 t o y l ,

8 3


10/28


5. d(X l,yl;O, y2) be the cost of expanding X1 to yl and y2, and6 . d (X l ,y l ;x2 ,y2) be the cost of merging xl and x2 and matching with Yl

and yR.5 . T h e D y n a m i c P r o g r a m m i n g A l g o r i t h mThe algorithm is summarized in the following recursion equation. Let si, i = 1 ... I, bethe sentences of one language, and tj , j -- 1 -. . J, be the translat ions of those sentences inthe other language. Let d be the distance function described in the previous section, andlet D(i, j ) be the minimum distance between sentences s l , . . . s i and their translationst l , . . . t j , under the maximu m likelihood alignment. D(i, j ) is computed by minimizingover six cases (substitution, deletion, insertion, contraction, expansion, and merger)which, in effect, impose a set of slope constraints. That is, D(i, j ) is defined by thefollowing recurrence with the initial condition D(i, j ) = O.

D( i , j - 1) + d(O, tj;O,O)D ( i - 1,j) + d(si, O;O,O)D ( i - 1 , j - 1) + d(si, tj;O,O)D( i , j ) = min D ( i - l , j - 2 ) q - d (s i, tj ;O, t j _ l )D ( i - 2 , j - 1 ) + d (s i, t y; Si -l ,0 )D ( i - 2 , j - 2 ) + d (s i, t f i s i- l , tj - 1 )

6 . E v a l u a t i o n

To evaluate align, its results were compared with a human alignment. All of the UBSsentences were aligned by a primary judge, a native speaker of English with a readingknowledge of French and German. Two additional judges, a native speaker of Frenchand a native speaker of German, respectively, were used to check the primary judge on43 of the more difficult paragraphs having 230 sentences (out of 118 total paragraphswith 725 sentences). Both of the additional judges were also fluent in English, havingspent the last few years living and working in the United States, though they wereboth more comfortable with their native language than with English.The materials were prepared in order to make the task somewhat less tedious forthe judges. Each paragraph was printed in three columns, one for each of the threelanguages: English, French, and German. Blank lines were inserted between sentences.The judges were asked to draw lines between matching sentences. The judges werealso permitted to draw a line between a sentence and "null" if they thought that thesentence was not translated. For the purposes of this evaluation, two sentences weredefined to "match" if they shared a common clause. (In a few cases, a pair of sentencesshared only a phrase or a word, rather than a clause; these sentences did not count asa "match" for the purposes of this experiment.)After checking the primary judge with the other two judges, it was decided thatthe primary judge's results were sufficiently reliable that they could be used as astandard for evaluating the program. The primary judge made only two mistakes onthe 43 hard paragraphs (one French mistake and one German mistake), whereas theprogram made 44 errors on the same materials. Since the pr imary judge's error rate isso much lower than that of the program, it was decided that we needn't be concernedwith the primary judge's error rate. If the program and the judge disagree, we canassume that the program is probably wrong.The 43 "hard" paragraphs were selected by looking for sentences that mappedto something other than themselves after going through both German and French.

84


11/28

Wi l li am A. Ga l e and K enne t h W. Ch ur ch P r og r am f o r Al i gn i ng Sen t ences

T a b l e 6Com pl ex m a t ches a r e m or e d i f f i cu l t .c a t e g o r y E n g l i s h -F r e n c h E n g l i s h - G e r m a n t o ta l

N e r r % N e r r % N e r r %1-01-12-12-23-13-2

8 8 100542 14 2.659 8 149 3 331 1 1001 1 100

5 5 100625 9 1.458 2 3.46 2 331 1 1000 0 - -

13 13 1001167 23 2.0117 10 915 5 332 2 1001 1 100

S p ec if ic al ly , f o r e a ch E n g l is h s e n t e n c e , w e a t t e m p t e d t o f in d t h e c o r r e s p o n d i n g G e r m a ns e n te n c e s , a n d t h e n f o r e a c h o f t h e m , w e a t t e m p t e d t o f in d t h e c o r r e s p o n d i n g F r e n c hs e n te n c e s , a n d t h e n w e a t t e m p t e d t o f in d t h e c o r r e s p o n d i n g E n g l i s h s e n te n c e s , w h i c hs h o u l d h o p e f u l l y g e t u s b a c k t o w h e r e w e s t a rt e d . T h e 43 p a r a g r a p h s i n c l u d e d a lls e n t e n c e s in w h i c h t h is p r o c e s s c o u l d n o t b e c o m p l e t e d a r o u n d t h e l o op . T h i s r e la t iv e l ys m a l l g r o u p o f p a r a g r a p h s ( 23 % o f a ll p a r a g r a p h s ) c o n t a i n e d a r e l a t iv e l y l a r g e fr a c t i o no f t h e p r o g r a m ' s e r r o r s (8 2 % ). T h u s , t h e r e s e e m s t o b e s o m e v e r i f i c a t io n th a t t h i st r il i n g u a l c r i t e r io n d o e s i n f a c t s u c c e e d i n d i s t i n g u i s h i n g m o r e d i f fi c u lt p a r a g r a p h sf r o m l e s s d i f f i c u l t o n e s .

T h e r e a re th r e e p a i rs o f l a n g u a g e s: E n g l i s h - G e r m a n , E n g l i s h - F r e n c h , a n d F r e n c h -G e r m a n . W e w i ll r e p o r t o n j u s t th e f ir st tw o . ( T h e th i r d p a i r i s p r o b a b l y d e p e n d e n to n t h e f i r s t t w o . ) E r r o r s a r e r e p o r t e d w i t h r e s p e c t t o t h e j u d g e ' s r e s p o n s e s . T h a t i s ,f o r e a c h o f t h e " m a t c h e s " t h a t t h e p r i m a r y j u d g e f o u n d , w e r e p o r t th e p r o g r a m a sc o r r e c t i f i t f o u n d t h e " m a t c h " a n d i n c o r r e c t i f i t d i d n ' t . T h i s p r o c e d u r e is b e t t e r t h a nc o m p a r i n g o n t h e b as is o f a l i g n m e n t s p r o p o s e d b y t h e a l g o r i t h m f o r tw o r e as o n s .F i rs t, it m a k e s t h e t r ia l " b l i n d , " t h a t i s, t h e j u d g e d o e s n o t k n o w t h e a l g o r i t h m ' s r e s u l tw h e n j u d g i n g . S e c o n d , i t a l lo w s c o m p a r i s o n o f r e s u lt s f o r d i f fe r e n t a l g o r i t h m s o n ac o m m o n b as is .

T h e p r o g r a m m a d e 3 6 e r r o r s o u t o f 6 21 t o ta l a l ig n m e n t s (5 .8 % ) f o r E n g l i s h - F r e n c ha n d 1 9 e r r o r s o u t o f 6 95 (2 .7 % ) a l i g n m e n t s f o r E n g l i s h - G e r m a n . O v e r a l l , th e r e w e r e5 5 e r r o r s o u t o f a t o t a l o f 13 16 a l i g n m e n t s (4 .2 % ) . T h e h i g h e r e r r o r r a t e f o r E n g l i s h -F r e n c h a l i g n m e n t s m a y r e s u lt f r o m t h e G e r m a n b e i n g t h e o r i g in a l, s o t h a t th e E n g l i s ha n d G e r m a n d i ff e r b y o n e t r an s l a ti o n , w h i l e t h e E n g l i sh a n d F r e n c h d i f fe r b y t w ot r a n s l a t i o n s .

T a b le 6 b r e a k s d o w n t h e e r r o r s b y c a t e g o ry , il lu s t ra t in g t h a t c o m p l e x m a t c h e sa r e m o r e d i f f i c u l t . 1 - 1 a l i g n m e n t s a r e b y f a r t h e e a s i e s t . T h e 2 - 1 a l i g n m e n t s , w h i c hc o m e n e x t , h a v e f o u r t i m e s t h e e r r o r r a t e f o r 1 -1. T h e 2 - 2 a l i g n m e n t s a r e h a r d e r s ti ll ,b u t a m a j o r i t y o f t h e a l i g n m e n t s a r e f o u n d . T h e 3-1 a n d 3 - 2 a l i g n m e n t s a r e n o t e v e nc o n s i d e r e d b y t h e a l g o r i th m , s o n a t u r a l l y a l l t h r e e i n s t a n c e s o f t h e s e a r e c o u n t e d a se r r o r s . T h e m o s t e m b a r r a s s i n g c a t e g o r y i s 1 -0 , w h i c h w a s n e v e r h a n d l e d c o r re c tl y . I na d d i t i o n , w h e n t h e a l g o r i t h m a s s i g n s a s e n t e n c e t o th e 1 -0 c a t e g o r y , i t is al s o a l w a y sw r o n g . C l e a rl y , m o r e w o r k is n e e d e d t o d e a l w i t h t h e 1 -0 c a t e g o r y . I t m a y b e n e c e s s a r yt o c o n s id e r l a n g u a g e -s p e c i fi c m e t h o d s i n o r d e r t o d e a l a d e q u a t e l y w i t h t hi s ca se .

S i n c e t h e a l g o r i t h m a c h i e v e s s u b s t a n t i a l l y b e t t e r p e r f o r m a n c e o n t h e 1 - 1 r e g i o n s ,o n e i n t e r p r e t a t i o n o f t h e s e r e s u l t s is th a t t h e o v e r a l l l o w e r r o r r a t e is d u e t o t h eh i g h f r e q u e n c y o f 1-1 a l i g n m e n t s i n E n g l i s h - F r e n c h a n d E n g l i s h - G e r m a n t ra n s la t io n s .

85


12/28

Co m p ut a t i ona l L i ngu i st ic s Vo l um e 19, N um ber 1

T a b l e 7The d i s tance m eas u r e is t he be s t p r ed i c t o r o f er r o r s .Var iable Coef . S td . Dev.Dis tance M easure .071 .011Ca t eg or y Typ e .52 .47P a r a g ra p h L e n g t h . 0 0 0 3 . 0 0 0 5S e n t e n ce L e n g t h . 0 0 1 3 . 0 0 2 9

Coe f . / S t d . Dev .6.51.10.60.5

T r a n s l a t i o n s t o l i n g u i st i c a ll y m o r e d i f f e r e n t l a n g u a g e s , s u c h a s H e b r e w o r J a p a n e s e ,m i g h t e n c o u n t e r a h i g h e r p r o p o r t i o n o f h a r d m a t c h es .

W e i n v e s t i g a t e d t h e p o s s i b l e d e p e n d e n c e o f t h e e r r o r r a te o n f o u r v a r ia b l es :1 . S e n t e n c e L e n g t h2 . P a r a g r a p h L e n g t h3. C a t e g o r y T y p e4 . D i s t a n c e M e a s u r e .

W e u s e d l og is ti c r e g r e s s io n ( H o s m e r a n d L e m e s h o w 1 98 9) t o s e e h o w w e l l e a c h o ft h e f o u r v a r i a b l e s p r e d i c t e d t h e e r r o rs . T h e c o e ff i ci e n t s a n d t h e i r s t a n d a r d d e v i a t i o n sa r e s h o w n i n T a b le 7. A p p a r e n t l y , t h e d is t a n c e m e a s u r e i s t h e m o s t u s e f u l p r e d i c t o r ,a s i n d i c a t e d b y t h e l a st c o l u m n . I n f ac t, n o n e o f t h e o t h e r t h r e e f a c t o r s w a s f o u n d t oc o n t r i b u t e s i g n i f ic a n t l y b e y o n d t h e e f fe c t o f t h e d i s t a n c e m e a s u r e , i n d i c a t i n g t h a t t h ed i s t an c e m e a s u r e i s a l r e a d y d o i n g a n e x c e l le n t j ob , a n d w e s h o u l d n o t e x p e c t m u c hi m p r o v e m e n t if w e w e r e t o t r y t o a u g m e n t t h e m e a s u r e t o t a k e th e s e a d d i t i o n a l f a c to r si n t o a c c o u n t .T h e f a c t t h a t t h e s c o re is s u c h a g o o d p r e d i c t o r o f p e r f o r m a n c e c a n b e u s e d t o e x-t r a c t a la r g e s u b c o r p u s t h a t h a s a m u c h s m a l l e r e r r o r ra t e . B y s e le c t i n g t h e b e s t s c o r i n g8 0 % o f t h e a l i g n m e n t s , t h e e r r o r ra t e c a n b e r e d u c e d f r o m 4 % t o 0 .7 % . I n g e n e r a l , w ec a n t r a d e o f f t h e s iz e o f t h e s u b c o r p u s a n d t h e a c c u r a c y b y s e t t in g a t h r e s h o l d , a n dr e j ec t i n g a l i g n m e n t s w i t h a s c o r e a b o v e t h is t h r e s h o l d . F i g u r e 4 e x a m i n e s t h i s tr a d e - o f fi n m o r e d e t a i l .L e s s f o r m a l t e s ts o f t h e e r r o r r a t e in t h e H a n s a r d s s u g g e s t t h a t t h e o v e r a l l e r r o rr a t e is a b o u t 2 % , w h i l e t h e e r r o r r a t e f o r t h e e a s y 8 0 % o f th e s e n t e n c e s is a b o u t 0 . 4% .A p p a r e n t l y t h e H a n s a r d t r a n s l a t i o n s a r e m o r e l i t e r a l t h a n t h e U B S r e p o r t s . I t t o o k2 0 h o u r s o f r e a l t i m e o n a s u n 4 to a l ig n 3 6 7 d a y s o f H a n s a r d s , o r 3 .3 m i n u t e s p e rH a n s a r d - d a y . T h e 3 6 7 d a y s o f H a n s a r d s c o n t a i n e d a b o u t 8 90 ,0 00 s e n t e n c e s o r a b o u t 3 7m i l l io n " w o r d s " ( to k e n s) . A b o u t h a l f o f t h e c o m p u t e r t i m e i s s p e n t i d e n t i f y i n g t o k e n s ,s e n t en c e s , a n d p a r a g r a p h s , a n d a b o u t h a l f o f t h e t im e i s s p e n t i n t h e a l i g n p r o g r a mi t se l f .T h e o v e r a l l e r ro r , 4 .2 % , t h a t w e g e t o n t h e U B S c o r p u s is c o n s i d e r a b l y h i g h e rt h a n t h e 0 . 6 % e r r o r r e p o r t e d b y B r o w n , L a i , a n d M e r c e r ( 1 9 9 1 ) . H o w e v e r , a d i r e c tc o m p a r i s o n is m i s l e a d i n g b e c a u s e o f t h e d i f fe r e n c e s i n c o r p o r a a n d t h e d i f f e re n c e s ins a m p l i n g . W e h a v e o b s e r v e d t h a t th e H a n s a r d s a r e m u c h e a s i e r t h a n t h e U B S. O u re r r o r r a te d r o p s b y a b o u t 5 0 % i n t h a t c as e . A l i g n i n g t h e U B S F r e n c h a n d E n g l i s h t e x t s ism o r e d i ff ic u l t t h a n a l ig n i n g t h e E n g l is h a n d G e r m a n , b e c a u s e t h e F r e n c h a n d E n g l i s h

86


13/28

W i l l ia m A . G a l e a n d K e n n e t h W . C h u r c h P r o g r a m f o r A l i g n i n g S e n t e n ce s

03

o.~, 04c -O

0. .

0

. . . . . . . . . . . . . ~_ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

I I I I I2 0 40 60 80 1O 0

pe r cen t o f r e t a ined a l i gnm en t sFigure 4Ext r ac t i ng a subco rpus w i th l ower e r ro r r a t e . The f ac t t ha t t he s co re i s such a good p r ed i c to ro f p e r f o r m a n c e c a n b e u s e d t o e x t ra c t a la r g e s u b c o r p u s t h a t h a s a m u c h s m a l l e r e r r o r ra t e. I ng e n e r a l, w e c a n t r a d e o f f t h e s i ze o f t h e s u b c o r p u s a n d t h e a c c u r a c y b y s e t t in g a t h r e s h o l d a n dre j ec t i ng a l i gnm en t s w i th a s co re abo ve th i s t h r e sho ld . The ho r i zon t a l ax i s shows t he s i z e o ft h e s u b c o r p u s , a n d t h e v e r ti c a l a x is s h o w s t h e c o r r e s p o n d i n g e r r o r r a t e. A n e r r o r r a te o f a b o u t2 / 3 % c a n b e o b t a i n e d b y s e l ec t in g a t h re s h o l d t h a t w o u l d r e t ai n a p p r o x i m a t e l y 8 0% o f t h eco rpus .

v e r s i o n s a r e s e p a r a t e d b y t w o t r a n s la t io n s , b o t h b e i n g t r a n s l a t i o n s o f t h e G e r m a no r ig i n al . I n a d d i t i o n , I B M s a m p l e s o n l y t h e 1-1 a l i g n m e n t s , w h i c h a r e m u c h e a s i e rt h a n a n y o t h e r c a t eg o r y , a s o n e c a n s e e f r o m T a b l e 6.G i v e n t h e s e d i f f e r e n c e s i n t e s t i n g m e t h o d o l o g y a s w e l l a s t h e d i f f e r e n c e s i n t h ea l g o r i t h m s , w e f i n d th e m e t h o d s g i v i n g b r o a d l y s i m i l a r r e su l ts . B o t h m e t h o d s g i v er e s u l t s w i t h s u f fi c ie n t a c c u r a c y t o u s e t h e r e s u l t i n g a l i g n m e n t s , o r s e l ec t e d p o r t i o n st h e re o f , f o r a c q u i s i ti o n o f l ex i ca l i n f o r m a t i o n . A n d n e i t h e r m e t h o d a c h i e v e s h u m a na c c u r a c y o n t h e ta s k . ( N o t e t h a t o n e d if f e re n c e b e t w e e n t h e i r m e t h o d a n d o u r s i s t h a tt h e y n e v e r f in d 2 -2 a l ig n m e n t s . T h i s w o u l d g i v e th e i r m e t h o d a m i n i m u m o v e r a l l e r r o rr a t e o f 1 .4 % o n t h e U B S c o r p u s , t h r e e t i m e s t h e h u m a n e r r o r ra t e o n h a r d p a r a g r a p h s . )W e c o n c l u d e t h a t a s e n t e n c e a l i g n m e n t m e t h o d t h a t a c h i ev e s h u m a n a c c u r a c y w i l ln e e d t o h a v e l e xi ca l in f o r m a t i o n a v a i l a b l e t o it .7 . V a r i a t i o n s a n d E x t e n s i o n s7 .1 M e a s u r i n g L e n g t h i n T e r m s O f W o r d s R a t h e r th a n C h a r a c te r sI t is i n t e r e s ti n g t o c o n s i d e r w h a t h a p p e n s i f w e c h a n g e o u r d e f i n i t i o n o f l e n g t h t o c o u n tw o r d s r a t h e r t h a n c h a r a c t e r s . I t m i g h t s e e m t h a t a w o r d is a m o r e n a t u r a l l i n g u i st icu n i t t h a n a c h ar a ct e r . H o w e v e r , w e h a v e f o u n d t h a t w o r d s d o n o t p e r f o r m a s w e l l a s

8 7


14/28


characters. In fact, the "words" variation increases the number of errors dramatically(from 36 to 50 for English-French and from 19 to 35 for English-German). The totalerrors were thereby increased from 55 to 85, or from 4.2% to 6.5%.

We believe that characters are better because there are more of them, and there-fore there is less uncertainty. On the average, there are 117 characters per sentence(including white space) and only 17 words per sentence. Recall that we have modeledvariance as proportional to sentence length, V(I) = s21. Using the character data, wefound previously that s2 ~ 6.5. The same a rgument applied to words yields s2 ~ 1.9.For comparison's sake, it is useful to consider the ratio of x / - V ~ / m (or equivalently,s/x/-m), where m is the mean sentence length. We obtain x /V (m) /m ratios of 0.22 forcharacters and 0.33 for words, indicating that characters are less noisy than words,and are therefore more suitable for use in align.

Although Brown, Lai, and Mercer (1991) used lengths measured in words, com-parisons of error rates between our work and theirs will not test whether charactersor words are more useful. As set out in the previous section, there are numerousdifferences in testing methodology and materials. Furthermore, there are apparentlymany differences between the IBM algorithm and ours other than the units of mea-surement, which could also account for any difference on performance. Appropriatemethodology is to compare methods with only one factor varying, as we do here.

7 .2 I g n o r i n g P a r a gr a p h B o u n d a r i e sRecall that align is a two-step process. First, paragraph boundaries are identified andthen sentences are aligned within paragraphs. We considered eliminating the first stepand found a threefold degradation in performance. The English-French errors wereincreased from 36 to 84, and the English-German errors from 19 to 86. The overallerrors were increased from 55 to 170. Thus the two-step approach reduces errors bya factor of three. It is possible that performance might be improved further still byintroducing additional alignment steps at the clause and/or phrase levels, but testingthis hypothesis would require access to robust parsing technology.

7 . 3 A d d i n g a 2 - 2 C a t e g o r yThe original version of the program did not consider the category of 2-2 alignments.Table 6 shows that the program was right on 10 of 15 actual 2-2 alignments. This wasachieved at the cost of introducing 2 spurious 2-2 alignments. Thus in 12 tries, theprogram was right 10 times, wrong 2 times. This is significantly better t han chance,since there is less than 1% chance of getting 10 or more heads out of 12 flips of a faircoin. Thus it is worthwhile to include the 2-2 alignment possibility.

7 . 4 U s i n g M o r e A c c u r a t e P a r a m e t e r E s t i m a t e sWhen we discussed the estimation of the model parameters, c and s2, we mentionedthat it is possible to fit the parameters more accurately if we estimate different valuesfor each language pair, but that doing so did not seem to increase performance by verymuch. In fact, we foun d exactly the same total number of errors, although the errors areslightly different. Changing the parameters resulted in four changes to the ou tpu t forEnglish-French (two right and two wrong), and two changes to the output for English-German (one right and one wrong). Since it is more convenient to use language-independent parameter values, and doing so doesn't seem to hurt performance verymuch (if at all), we have decided to adopt the language-independent values.

88


15/28


7 . 5 E x t e n s i o n s7.5.1 H a r d a n d S o f t B o u n d a r i e s . Recall that we rejected one of the French documentsbecause one paragraph was omitted and two paragraphs were duplicated. We couldhave handled this case if we had employed a more powerful paragraph alignmentalgorithm. In fact, in aligning the Canad ian Hansards, we found that it was necessaryto do something more elaborate than we did for the UBS data. We decided to usemore or less the same procedure for aligning paragraphs within a document as theprocedure that we used for aligning sentences within a paragraph. Let us introducethe distinction between hard and soft delimiters. The alignment program is definedto move soft delimiters as necessary within the constraints of the hard delimiters.Hard delimiters cannot be modified, and there must be equal numbers of them. Whenaligning sentences within a paragraph, the program considers paragraph boundariesto be "hard" and sentence boundaries to be "soft." When aligning paragraphs withina document, the program considers document boundaries to be "hard" and paragraphboundaries to be "soft." This entension has been incorporated into the implementationpresented in the appendix.

7.5.2 A u g m e n t i n g t h e D i c t i o n a r y F u n c t i o n t o C o n s i d e r W o r d s . Many alternativealignment procedures such as Kay and R6scheisen (unpublished) make use of words. Itought to help to know that the English string "house" and the French string "maison"are likely to correspond. Dates and numbers are perhaps an even more extreme exam-ple. It really ought to help to know that the English string "1988" and the French string"1988" are likely to correspond. We are currently exploring ways to integrate thesekinds of clues into the framewor k described above. However, at present, the algorithmdoes not have access to lexical constraints, which are clearly very important. We expectthat once these clues are properly integrated, the program will achieve performancecomparable to that of the primary judge. However, we are still not convinced that itis necessary to process these lexical clues, since the current performance is sufficientfor many applications, such as building a probabilistic dictionary. It is remarkable justhow well we can do without lexical constraints. Adding lexical constraints might slowdown the program and make it less useful as a first pass.

8. C o n c l u s i o n sThis paper has proposed a method for aligning sentences in a bilingual corpus, basedon a simple probabilistic model, described in Section 3. The model was motivatedby the observation that longer regions of text tend to have longer translations, andthat shorter regions of text tend to have shorter translations. In particular, we foundthat the correlation between the length of a paragraph in characters and the length ofits translation was extremely high (0.991). This high correlation suggests that lengthmight be a strong clue for sentence alignment.Although this method is extremely simple, it is also quite accurate. Overall, therewas a 4.2% error rate on 1316 alignments, averaged over both English-French andEnglish-German data. In addition, we find that the probability score is a good predictorof accuracy, and consequently, it is possible to select a subset of 80% of the a lignmentswith a much smaller error rate of only 0.7%.The method is also fairly language- independen t. Both English-French and English-German data were processed using the same parameters. If necessary, it is possible tofit the six parameters in the model with language-specific values, though, thus far, wehave not found it necessary to do so.

89


16/28

C o m p u t a t i o n a l L i n g ui st ic s V o l u m e 1 9, N u m b e r 1

W e h a v e e x a m i n e d a n u m b e r o f v a r ia t io n s . I n p a r ti c u la r , w e f o u n d t h a t i t is b e t t e rt o u s e c h a r a c t e r s r a th e r t h a n w o r d s i n c o u n t i n g s e n t e n c e le n g th . A p p a r e n t l y , t h e p e r -f o r m a n c e i s b e t t e r w i t h c h a r a c t e r s b e c a u s e t h e r e i s l es s v a r i a b i li t y in t h e d i f f e r e n c e s o fs e n t e n c e l en g t h s s o m e a s u r e d . U s i n g w o r d s a s u n it s i n c r ea s e s th e e r r o r r a te b y h a l lf r o m 4 . 2 % t o 6 . 5 % .

I n t h e f u tu r e , w e w o u l d h o p e t o e x t e n d t h e m e t h o d t o m a k e u s e o f l ex i ca l c o n -s tr a in t s. H o w e v e r , i t is r e m a r k a b l e j u s t h o w w e l l w e c a n d o w i t h o u t s u c h c o n s t r a in t s .W e m i g h t a d v o c a t e o u r s i m p l e c h a r a c t e r a l ig n m e n t p r o c e d u r e a s a fi rs t p a s s, e v e n t ot h o se w h o a d v o c a t e t h e u s e o f l e x ic a l c o n st ra in t s . O u r p r o c e d u r e w o u l d c o m p l e m e n ta le x ic a l a p p r o a c h q u i t e w e ll . O u r m e t h o d is q u i c k b u t m a k e s a f e w p e r c e n t e r ro r s ;a le x ic a l a p p r o a c h is p r o b a b l y s l o w e r , t h o u g h p o s s i b l y m o r e a c c u r at e . O n e m i g h t g ow i t h o u r a p p r o a c h w h e n t h e s co r e s a r e sm a l l , a n d b a c k o f f t o a le x i c al -b a s e d a p p r o a c ha s n e c e s s a r y .AcknowledgmentsW e t h a n k S u s a n n e W o l ff a n d E v e l y n eT z o u k e r m a n n f o r t h e ir p a i n s i n a l ig n i n gs e n te n c e s. S u s a n W a r w i ck p r o v i d e d u s w i t ht he UBS t r il i ngua l co r pus an d conv i nc ed ust o w o r k o n t h e s e n t e n c e a l i g n m e n t p r o b l e m .ReferencesAbramowi tz , M. , and Stegun, I . (1964) .Handbook of Mathematical Functions. U SGover nm en t P r i n t i ng Of f i ce .Brown, P. ; Cocke, J . ; Del la Piet ra , S. ; Del laPietra, V.; Jelinek, F.; Mercer, R.; andRoossin, P. (1988a) . "A stat is t icala p p r o a c h t o F r e n c h / E n g l i s h t ra n s l at io n . "In Proceedings, RIA088 Conference.C a m b r i d g e , M A .Brown, P. ; Cocke, J . ; Del la Piet ra , S. ; Del laPietra, V.; Jelinek, F.; Mercer, R.; andRoossin, P. (1988b) . "A stat is t icalapp r oach t o l anguage t r ans l a t i on . " I nP roceedings, 13 th International Conference onComputational Linguistics (COLIN G-88).

B u d a p e s t , H u n g a r y .Brown, P. ; Cocke, J . ; Del la Piet ra , S. ; Del laPie tra, V.; Jelinek , F.; La fferty , J. ; M erce r,R. ; and Roossin, P. (1990) . "A stat is t icalapp r oach t o m ach i ne t r ans l a t i on . "Computational Linguistics, 16, 79-85.Bro w n, P. ; Lai , J. ; an d M ercer , R. (1991)."Al i gn i ng s en t ences i n pa r a l l e l co r po r a . "In P roceedings, 47th An nua l Meeting of theAssociation for Computational Linguistics.Cat izone , R. ; Russe l l , G. ; and Warwick , S .

( in p r e ss ) . "D e r i v i ng t r ans l a ti on d a t a f r ombi l ingual t ex t s . " In Lexical Acquisition:Using on-line Resources to Build a Lexicon,e d i t e d b y Z e r n ik . L a w r e n c e E r lb a u m .Church , K. (1988) . "A s tochas t i c par t sp r o g r a m a n d n o u n p h r a s e p a rs e r f orunres t r i c ted t ex t . " In P roceedings, SecondConference on Applied Natural LanguageProcessing. Au s t i n , TX.Hosmer , D. , and Lemeshow, S . (1989) .Applied Logistic Regression. Wiley.Kl avans , J. , and Tz ouk e r m ann , E . (1990)."Th e BI CORD s ys t em . " I n Proceedings,15 th International Conference onComputational Linguistics (COLING-90),174-179.Kay, M. , and R6scheisen, M. (1988) ."Tex t - t rans l a t ion a l i gnm en t . " Xe r ox P a l oAl t o R es ea r ch Cen t e r.Kruska l , J . , and Liberman, M. (1983) . "Thes y m m e t r ic t im e - w a r p i n g p r o b l e m : F r o mcon t i nuous t o d i s c r e te . " I n Time Warps,String Edits, and Macro Molecules: TheTheory and P ractice of Sequence Comparison,ed i t ed b y D. S ankof f and J . Kr us ka l .Add i s on - Wes l ey .Liberman, M. , and Church , K. ( in press ) ." T e x t a n a ly s is a n d w o r d p r o n u n c i a t i o n i nt ex t - to - s peech s yn t hes i s . " I n Advances inSpeech Signal Processing, ed i t ed by S . Fu r u iand M. Sondh i .Sankoff , D. , and Kruskal , J . (1983) . TimeWarps, String Edits, and M acromolecules: TheTheory and P ractice of Seq uence Comparison.Addi s on - Wes l ey .

90


17/28


Ap pend ix: Programwith Michael D. RileyThe following code is the core of align. It is a C language program that inputs twotext files, with one token (word) per line. The text files contain a number of delimitertokens. There are two types of delimiter tokens: "ha rd" and "soft." The hard regions(e.g., paragraphs) may not be changed, and there must be equal numbers of them inthe two input files. The soft regions (e.g., sentences) may be deleted (1-0), inserted (0-1), substituted (1-1), contracted (2-1), expanded (1-2), or merged (2-2) as necessary sothat the output ends up with the same number of soft regions. The program generatestwo output files. The two output files contain an equal number of soft regions, eachon a line. If the -v command line option is included, each soft region is preceded byits probability score.# i n c l u d e < f c n t l . h># i n c l u d e < m a l l o c . h ># i n c l u d e < m a t h . h ># i n c l u d e < s t d i o . h ># i n c l u d e < s t r i n g . h ># i n c l u d e < s y s / m m a n . h ># i n c l u d e < s y s / t y p e s . h ># i n c l u d e < v a l u e s . h ># i n c l u d e < s y s / s t a t . h >/*

u s a g e :a l i g n r e g i o n s - D ' . P A RA ' - d ' . En d o f S e n t e n c e ' f i l e 1 f i l e 2o u t p u t s t w o f i l es : f i l e l . a l ~ f i l e 2 . a lh a r d r e g i o n s a r e d e l i m i t e d b y t h e - D a r gs o f t r e g i o n s a r e d e l i m i t e d b y t h e - d a r g*/

# d e f i n e d i s t ( x , y ) d i s t a n c e s [ ( x ) * (( ny ) + 1 ) + ( y ) ]# d e f i n e p a t h x ( x , y ) p a t h x [ ( x ) * ( (n y) + i) + ( y ) ]# d e f i n e p a t h y ( x , y ) p a t h _ y [ ( x ) * (( n y) + 1) + ( y ) ]# d e f i n e M A X _ F I L E N A M E 2 8 6# d e f i n e B I G D I S T A N C E 2 8 0 0/ * D y n a m i c P r o g r a m m i n g O p t i m i z a t i o n * /s t r u c t a l i g n m e n t {

int xl;int yl;int x2;int y2;int d;

};c h a r * h a r d _ d e l i m i t e r = N U L L ;c h a r * s o f t _ d e l i m i t e r = N U L L ;i n t v e r b o s e = O ;/ * u t i l i t y f u n c t i o n s * /

I ~ - D a r g * II * - d a r g * II * - v a r g * I

91


18/28

Com putat ional Linguist ics Volume 19, N um ber 1

char *readchars(), ** readlines(), **substri ngs();void err();/ *

s e q _ a l i g n b y M i k e R i l e yx and y are se q uences of ob j ects, repr esen ted as non-z ero ints,to be aligned.dist_fun ct(xl, yl, x2, y2) is a distan ce funct ion of 4 args:

dist_fu nct(xl, yl, O, O) gives cost of subs titu tion of xl by yl.dist_fu nct(xl, O, O, 01 gives cost of dele tion of xl.dist_fu nct(O, yl, O, 01 gives cost of inse rtion of yl.dist_fu nct(xl, yl, x2, 01 gives cost of contract ion of (xl,x2) to yl.dist_fu nct(xl, yl, O, y2) gives cost of expa nsio n of xi to (yl,y2).dist_fun ct(xl, yl, x2, y2) gives cost to mat ch (xl,x2) to (yl,y2).

align is the alignment, with (align[i].xl, align[i]. x2) align edw i t h ( a li g n [ i ]. y l , a l i g n [ i ] . y2 ) . Z e r o i n a l i g n [ ] . x l a n d a l i g n [ ] . y lc o r r e s p o n d t o i n s e r t i o n a n d d e l e t i on , r e s p e c t i v e l y . N o n - z e r o i na l i g n [ ] . x 2 a n d a l i g n [ ] . y 2 c o r r e s p o n d t o c o n t r a c t i o n a n d e x p a n s i on ,r e s p e c t i v e l y , a l i g n [ ] . d g i v e s t h e d i s t a n c e f o r t h a t p a i r i ng .

T h e f u n c t i o n r e t u r n s t h e l e n g t h o f t he a l i g n m e n t .* /intse q _al ign(x, y, nx, ny, dist_funct, align)

int *x, *y, nx, ny;int (*dist_funct)();struct alignment **align;

int *distances, *path_x, *path_y, n;int i, j, oi, oj, di, dj, dl, d2, d3, d4, d5, d6, dmin;struct alignment *ralign;dis tan ces = (int *) ma llo c( (nx + I) * (ny + i) * size of(i nt)) ;pat h_x = (int *) mal loc ((n x + i) * (ny + I) * size of(i nt)) ;pat h_y = (int *) mal loc ((n x + I) * (ny + i) * size of(i nt));ralig n = (struct alignme nt *) mall oc(( nx + ny) s i z e o f ( s t r u c t a l i g n m e n t) ) ;for( j = O; j O ? /* subst itut ion */

dist(i-l, j -l) + (*dist_ funct)( x[i-l], y[ j -1], O, O): M A X I N T ;

d2 = i>O ? /* dele tion */dist(i-l, j ) + (*dist _funct) (x[i-l] , O, O, O)

: M A X I N T ;d 3 = j >O ? /* inser tion */

dist(i, j -l) + (*dist_funct) (O, y[ j -1], O, O)

9 2


19/28

W i l l ia m A . G a l e a n d K e n n e t h W . C h u r c h P r o g r a m f o r A l i g n i n g S e n t e n c es

: M A X I N T ;d 4 = i > l & & j > O ? / * c o n t r a c t i o n * /

d i s t ( i - 2 , j - l ) + ( ~ d i s t _ f u n c t ) ( x [ i - 2 ] , y [ j - l ] , x [ i - l ] , O ): M A X I N T ;

d 5 = i > O & & j > l ? / ~ e x p a n s i o n * /d i s t ( i - l , j - 2 ) + ( ~ d i s t _ f u n c t ) ( x [ i - l ] , y [ j - 2 ] , O , y [ j - l ] )

: M A X I N T ;d 6 = i > l & & j > l ? / ~ m e l d i n g ~ /

d i s t ( i - 2 , j - 2 ) + ( * d i s t _ f u n c t ) ( x [ i - 2 ] , y [ j - 2 ] , x [ i - l ] , y [ j - l ] ): M A X I N T ;

d m i n = d i ;i f ( d 2 < d m i n ) d m i n = d 2 ;i f ( d 3 < d m i n ) d m i n = d 3 ;i f ( d 4 < d m i n ) d m i n = d 4 ;i f ( d 5 < d m i n ) d m i n = d S ;i f ( d 6 < d m i n ) d m i n = d 6 ;i f ( d m i n = = M A X I N T ) {

d i s t ( i , j ) = O ;}e l s e i f ( d m i n = = d l ) {

d i s t ( i , j ) = d l ;p a t h x ( i , j ) = i - l ;p a t h y ( i , j ) = j - l ;}

e l s e i f ( d m i n = = d 2 ) {d i s t ( i , j ) = d 2 ;p a t h x ( i , j ) = i - l ;p a t h y ( i , j ) = j ;}

e l s e i f ( d m i n = = d 3 ) {d i s t ( i , j ) = d 3 ;p a t h x ( i , j ) = i ;p a t h y ( i , j ) = j - l ;

}e l s e i f ( d m i n = = d 4 ) {d i s t ( i , j ) = d 4 ;p a t h x ( i , j ) = i - 2 ;p a t h y ( i , j ) = j - l ;}

e l s e i f ( d r a i n = = d 5 ) {d i s t ( i , j ) = d 5 ;p a t h x ( i , j ) = i - l ;p a t h y ( i , j ) = j - 2 ;

}e l s e / * d m i n = = d 6 * / {d i s t ( i , j ) = d 6 ;p a t h x ( i , j ) = i - 2 ;p a t h y ( i , j ) = j - 2 ;}

9 3


20/28

Com putat ional Linguis tics Volume 19, N um ber 1

}n = O ;f o r ( i = n x , 3 = n y ; i > O I I j > O ; i = o i , j = o j ) {

o i = p a t h x ( i , j ) ;o j = p a t h y ( i , j ) ;d i = i - o i ;d j = j - o j ;i f ( d i = = i a ~ d j = = I ) { / * s u b s t i t u t i o n * /

r a l i g n [ n ] . x l = x [ i - i ] ;r a l i g n [ n ] . y l = y [ j - i ] ;r a l i g n [ n ] . x 2 = O ;r a l i g n I n] . y 2 = 0 ;r a l i g n [ n + + ] . d = d i s t ( i , j ) - d i s t ( i - i , j - i ) ;

e l s e i f ( d i = = 1 ~ a d j = = O ) { / * d e l e t i o n * /r a l i g n [ n ] . x l = x [ i - l ] ;r a l i g n [ n ] . y l = O ;r a l i g n [ n ] . x 2 = O ;r a l i g n [ n] . 2 = 0 ;r a l i g n [ n + + ] . d = d i s t ( i , j ) - d i s t ( i - l , j ) ;

e l s e i f ( d i = = 0 ~ d j = = I ) { / * i n s e r t i o n * /r a l i g n [ n ] . x l = O ;r a l i g n [ n ] . y l = y [ j - l ] ;r a l i g ~ I n S . x 2 = O ;r a l i g n [ n ] . 2 = 0 ;r a l i g n [ n + + ] . d = d i s t ( i , j ) - d i s t ( i , j - l ) ;

e l s e i f ( d j = = i ) { / * c o n t r a c t i o n * /r a l i g n [ n ] . x l = x [ i - 2 ] ;r a l i g n [ n ] . y l = y [ j - l ] ;r a l i g ~ [ n ] . x 2 = x [ i - l ] ;r a l i g n [ n ] . 2 = 0 ;r a l i g n [ n + + ] . d = d i s t ( i , j ) - d i s t ( i - 2 , j - l ) ;

e l s e i f ( d i = = i ) { / * e x p a n s i o n * /r a l i g n [ n ] . x l = x [ i - l ] ;r a l i g n I n ] . l = y [ j - 2 ] ;r a l i g n [ n ] . x 2 = O ;r a l i g n [ n ] . y 2 = y [ j - l ] ;r a l i g n [ n + + ] . d = d i s t ( i , j ) - d i s t ( i - l , j - 2 ) ;}

e l s e / * d i = = 2 a a d j = = 2 * / { / * m e l d i n g * /r a l i g n [ n ] . x l = x [ i - 2 ] ;r a l i g n I n ] . l = y [ j - 2 ] ;r a l i g n [ n ] . x 2 = x [ i - l ] ;r a l i g n [ n ] . y 2 = y [ j - l ] ;

9 4


21/28

William A. Gale and Kenneth W. Chur ch Progr am for Alig ning Sentences

r a l i g n [ n + + ] . d = d i s t ( i , j ) - d i s t ( i - 2 , j - 2 ) ;}}* a l i g n = ( s t r u c t a l i g n m e n t * ) m a l l o c ( n * s i z e o f ( s t r u c t a l i g n m e n t ) ) ;f o r ( i = O ; i < n ; i + + )

b c o p y ( r a l i g n + i , ( * a l i g n ) + ( n - i - l ) , s i z e o f ( s t r u c t a l i g n m e n t ) ) ;f r e e ( d i s t a n c e s ) ;f r e e ( p a t h _ x ) ;f r e e ( p a t h _ y ) ;f r e e ( r a l i g n ) ;r e t u r n ( n ) ;}

/ * L o c a l D i s t a n c e F u n c t i o n * // * R e t u r n s t h e a r e a u n d e r a n o r m a l d i s t r i b u t i o n

f r o m - i n f t o z s t a n d a r d d e v i a t i o n s * /d o u b l ep n o r m ( z )

d o u b l e z ;{d o u b l e t , p d ;t = 1 / ( 1 + 0 . 2 3 1 6 4 1 9 * z ) ;

p d = 1 - 0 , 3 9 8 9 4 2 3 *ex p ( - z * z / 2 ) *

( ( ( (1 . 330 2744 29 * t - 1 .821255978) * t+ 1.78 14779 37) * t - 0.3565 6378 2) * t + 0.3 1938 1530 ) * t ;

/* see Abramowi tz , M. , and I . S tegu n (1964) , 26 . 2 .1 7 p . 932 * /r e t u r n ( p d ) ;}

/ * Re t u r n -1 0 0 * l o g p ro b ab i l i t y t h a t an E n g l i s h s en t en ce o f l en g t hl en l i s a t r an s l a t i o n o f a fo r e i g n s en t en ce o f l en g t h l en 2 . T hep ro b a b i l i t y i s b a s ed o n t wo p a ram e t e r s , t h e mean and v a r i a n ce o fn umber o f fo r e i g n ch a ra c t e r s p e r E n g l i s h ch a rac t e r .* /

i n tm a t c h ( l e n l , l e n 2 )

i n t l e n l , l e n 2 ;{doub le z , pd , mean ;double c = 1;doub le s2 = 6 .8 ;i f ( l e n l = = O & & l e n 2 = = O ) r e t u r n ( O ) ;m e a n = ( l e n l + l e n 2 / c ) / 2 ;z = ( c * l e n l - l e n 2 ) / s q r t ( s 2 * m e a n ) ;/ * N e e d t o d e a l w i t h b o t h s i d e s o f t h e n o r m a l d i s t r i b u t i o n * /i f ( z < O ) z = - z ;

95


22/28

Com puta t iona l L ingu i s ti c s Vo lume 19, N um ber 1

pd = 2 * (1 - pnor m(z)) ;if(pd > O) retu rn(( int) (-lO 0 * log(pd)));e l s e r e t u r n ( B I G _ D I S T A N C E ) ;

}inttwo_s ide_di stance (xl, yl, x2, y2)

int xl, yl, x2, y2;int penal ty21 = 2 30;log([ prob of 2-1 match]int penal ty22 = 440 ;Iog([ prob of 2-2 match]int penal tyOl = 450 ;log([p rob of 0-i match] / [prob of I-i match])

/* -i00 */ [prob of 1-1 matc h]) */

/ * - 1 0 0 */ [ p r o b o f 1 -1 m a t c h ] ) * /

/ * - 1 0 0 * * /if(x2 == 0 ~& y2 == O)

if(x1 == O) /* insert ion */return (match (x1, yl) + penalty0 1);

else if(y1 == O) /* delet ion */return( match(x 1, yl) + penalty01);

else retur n (match(xl, yl)); /* subst ituti on */else if(x2 == O) /* expa nsio n */

retur n (match(x1, yl + y2) + penalty2 1);else if(y2 == O) /* contr actio n */

retur n(mat ch(xl + x2, yl) + penalty21);else /* merge r */

retur n(mat ch(x1 + x2, yl + y2) + penalty22 );}/ * F u n c t i o n s f o r M a n i p u l a t i n g R e g i o n s * /s t r u c t r e g i o n {

char **lines;int length;

};v o i dprint_ region( fd, region, score)

int score;FILE *fd;struct region *region;

{char **lines, **end;lines = region->lines;

9 6


23/28

W i l l ia m A . G a l e a n d K e n n e t h W . C h u r c h P r o g r a m f o r A l i g n i n g S e n t e n c e s

e n d = l i n e s + r e g i o n - > l e n g t h ;f o r ( ; l i n e s < e n d ; l i n e s + + )

f p r i n t f ( f d , " Z s \ n " , * l i n e s ) ;}in tl e n g t h _ o f _ a _ r e g i o n ( r e g i o n )

s t r u c t r e g i o n * r e g i o n ;{i n t r e s u l t ;c h a r * * l i n e s , * * e n d ;l i n e s = r e g i o n - > l i n e s ;e n d = l i n e s + r e g i o n - > l e n g t h ;r e s u l t = e n d - l i n e s ;f o r ( ; l i n e s < e n d ; l i n e s + + )

r e s u l t + = s t r l e n ( * l i n e s ) ;r e t u r n ( r e s u l t ) ;}

int *r e g i o n _ l e n g t h s ( r e g i o n s , n )

s t r u c t r e g i o n * r e g i o n s ;int n;{

int i;i n t * r e s u l t ;r e s u l t = ( in t * ) m a l l o c ( n * s i z e o f ( i n t ) ) ;i f ( r e s u l t = = N U L L ) e r r ( " m a l l o c f a i l e d " ) ;fo r( i = O; i < n; i++ )

r e s u lt [ i ] = l e n g t h o f _ a _ r e g i o n ( r e g i o n s [ i ] ) ;r e t u r n ( r e s u l t ) ;}

s t r u c t r e g i o n *f i n d _ s u b _ r e g i o n s ( r e g i o n , d e l i m i te r , l e n _ pt r )

s t r u c t r e g i o n * r e g i o n ;c h a r * d e l i m i t e r ;i n t * l e n _ p t r ;

s t r u c t r e g i o n * r e s u l t ;c h a r ** i , * * l i n e s , * * e n d ;int n = O;l i n e s = r e g i o n - > l i n e s ;e n d = l i n e s + r e g i o n - > l e n g t h ;f o r ( l = l i n e s ; i < e n d ; I + + )

i f ( d e l i m i t e r & & s t r c m p ( * l , d e l i m i t e r ) = = O ) n + + ;

9 7


24/28

Com puta t iona l L ingu i s ti c s Vo lume 19, Nu m ber 1

r e s u l t = ( s t r u c t r e g i o n ~ ) c a l l o c ( n + l , s i z e o f ( s t r u c t r e g i o n ) ) ;i f ( r e s u l t = = N U L L ) e r r ( " m a l l o c f a i l e d " ) ;* l e n _ p t r = n ;n = O;r e s u l t [ O ] . l i n e s = l i n e s;for (1 = lines; 1 < end; l++)

i f ( d e l i m i t e r & ~ s t r c m p ( * l , d e l i m i t e r ) = = O ) {r e s u l t [ n ] . l e n g t h = 1 - r e s u l t [ n ] . l i n e s ;r e s u l t [ n + l ] . l i n e s = I +i ;n++;}

r e s u l t [ n ] . l e n g t h = 1 - r e s u l t [ n ] . l i n e s ;if(n != *len_p tr) {

f p r i n t f ( s t d e r r , " f i n d _ s u b _ r e g i o n s : n = ~ d , * l e n _ p t r = ~ d \ n " , n ,* l e n _ p t r ) ;e x i t ( 2 ) ;

}r e t u r n ( r e s u l t ) ;}

/ * T o p L e v e l M a i n F u n c t i o n * /intm a i n ( a r g c , a r g v )

int argc;c h a r * * a r g v ;

c h a r * * l i n e s l , ~ * l i n e s 2 ;i nt n u m b e r _ o f l i n e s l , n u m b e r _o f _ l i n es 2 ;s t r u c t r e g i o n ~ h a r d _ r e g i o n sl , ~ h a r d _ r e g i o n s 2 , ~ s o f t r e g i o n s l ,~ s o f t _ r e g i o n s 2 ;s t r u c t r e g i o n ~ h a r d _ e n d l , ~ h a r d _ e n d 2 , t m p ;i nt n u m b e r _ o f h a r d _ r e g i o n s l ;i n t n u m b e r _ o f h a r d _ r e g i o n s 2 ;i nt n u m b e r _ o f s o f t _ r e g i o n s l ;i n t n u m b e r _ o f s o f t _ r e g i o n s 2 ;int ~lenl, ~len2;int c, n, i, ix, iy, prevx, prevy;s t r u c t a l i g n m e n t ~ a l i g n , ~ a;FILE *outl, ~out2;c ha r f i l e n a m e [ M A X F I L E N A M E ] ;e x t e r n c h a r ~ o p t a r g ;e x t e r n i n t o p t i n d ;/ * p a r s e a r g u m e n t s * /w h i l e ( ( c = g e t o p t ( a r g c , a r g v, " v d : D : " ) )

s w i t c h ( c ) {case 'v''

verbose = 1;b r e a k ;

case 'd':s o f t d e l i m i t e r = s t r d u p ( o p t a r g ) ;

!= EOF)

9 8


25/28


brea k ;case 'D':

h a r d _ d e l i m i t e r = s t r d u p ( o p t ar g ) ;brea k ;

default:fprintf(s tderr, " usage: align_re gions [d (soft delimiter)]( h a r d d e l i m i t e r ) ] \ n " ) ;exit(2);}

if(argc ! = optind + 2) err( " wro ng nu mber of arguments" );/* open output files */sprintf (filena me, " ~s.al" , argv[optind]) ;out1 = fopen(fil ename, " w" );if(outl == NULL) {

fprintf( stderr, " can't open ~s\ n" , filename);exit(2);}sprintf (filena me, " ~s.al" , argv[opti nd+l]);out2 = fopen(fi lename, " w" );if(out2 == NULL) {

fprintf(s tderr, " can't open ~s\ n" , filename);exit(2);}

[D

l i n e s l = r e a d l i n e s ( a r g v [ o p t i n d ] , & n u m b e r _ o f _ l i n e s l ) ;l i n e s 2 = r e a d l i n e s ( a r g v [ o p t i n d + l ] , & n u m b e r _ o f _ l i n e s 2 ) ;tmp.l ines = linesl;t m p . l e n g t h = n u m b e r _ o f _ l i n e s l ;h a r d _ r eg i o n s l = f i n d _ s u b r e g i o n s ( & t m p , h a r d _ de l i m i te r ,

~ n u m b e r _ o f _ h a r d _ r e g i o n s l ) ;tmp.li nes = lines2;t m p . l e n g t h = n u m b e r _ o f _ l i n e s 2 ;h a r d _ r e g i o n s 2 = f i n d _ s u b _ r e g i o n s ( & t m p , h a r d _ d e l i m i t e r ,& n u m b e r _ o f _ h a r d _ r e g i o n s 2 ) ;i f ( n u m b e r _ o f h a r d _ r e g i o n s l ! = n u m b e r _ o f_ h a r d _ r e g io n s 2 )

err(" a lign_r egions: input files do not contain thesame numbe r of hard regions " );

h a r d _ e n d l = h a r d _ r e g i o n s l + n u m b e r _ o f _ h a r d _ r e g i o n s l ;h a r d _ e n d 2 = h a r d r e g i o n s 2 + n u m b e r o f _ h a r d r e g i o n s2 ;f o r ( ; h a r d _ r e g i o n s l < h a r d _ e n d l ; h a r d r eg i o n s l + + , h a r d _ r e g i o n s 2 + + ) {

s o f t _ r e g i o n s l = f i n d _ s u b _ r e g i o n s ( h a r d _ r e g i o n s l [ O ] , s o f t _ d e l i m i t e r ,& n u m b e r _ o f _ s o f t _ r e g i o n s l ) ;

s o f t _ r e g i o n s 2 = f i n d _ s u b _ r e g i o n s ( h a r d _ r e g i o n s 2 [ O ] , s o f t _ d e l i m i t e r ,& n u m b e r _ o f _ s o f t _ r e g i o n s 2 ) ;

l e n l = r e g i o n _ l e n g t h s ( s o f t _ r e g i o n s l , n u m b e r _ o f _ s o f t _ r e g i o n s l ) ;

9 9


26/28

Com puta t iona l L ingu i st i cs Vo lume 19, Nu m ber 1

l e n 2 = r e g i o n _ l e n g t h s ( s o f t _ r e g i o n s 2 , n u m b e r _ o f _ s o f t _ r e g i o n s 2 ) ;n = s e q _ a l i g n ( l e n l , l e n2 , n u m b e r _ o f _ s o f t _ r e g i o n s l ,

n u m b e r _ o f _ s o f t _ r e g i o n s 2 ,t w o _ s i d e _ d i s t a n c e , ~ a l i g n ) ;prevx = prevy = ix = iy = O;fo r(i = O; i < n; i++) {

a = aalign[i];if(a ->x2 > O) ix++; els e if(a ->xl == O) ix--;if(a ->y2 > O) iy++; el se if( a->yl == O) iy--;if(a->xl == 0 a~ a->yl == 0 ~ & a->x2 == 0 ~ a->y2 == O){ix++; iy++;}ix++;iy++;if(verbose) {

fprintf( outl, ".Score ~d kn ", a->d);fprintf( out2, ".Score ~d kn ", a->d);}

for( ; pre vx < ix; prevx ++)p r i n t _ r e g i o n ( o u t l , s o f t _ r e g i o n s l [ p r e v x ] , a - > d ) ;

f p r i n t f ( o u t l , " ~ s in " , so f t _ d e l i m i t e r ) ;for( ; pre vy < iy; prevy++)

p r i n t _ r e g i o n ( o u t 2 , s o f t _ r e g i o n s 2 [ p r e v y ] , a - > d ) ;f p r i n t f ( o u t 2 , " Z s k n " , s o f t _ d e l i m i t e r );}f p r i n t f ( o u t l , " ~ s k n " , h a r d _ d e l i m i t e r ) ;f p r i n t f ( o u t 2 , " ~ s k n " , h a r d _ d e l i m i t e r ) ;f r e e ( a l i g n ) ;f r e e ( s o f t _ r e g i o n s l ) ;f r e e ( s o f t _ r e g i o n s 2 ) ;free(lenl);free (len2) ;

}}/ ~ U t i l i t y F u n c t i o n s ~ /v o i derr(msg)

char ~msg;{f p r i nt f ( s td e r r , " ~ E R R O R ~ : % s \ n " , m sg ) ;exit(2);

}/~ return the conte nts of the file as a string

and stuff the length of this str ing into len_ptr ~/charr e a d c h a r s ( f i l e n ~ n e , l e n _ p t r )

char ~filename;

100


27/28

W i l l ia m A . G a l e a n d K e n n e t h W . C h u r c h P r o g r a m f o r A l i g n i n g S e n t e n ce s

i n t ~ l e n p t r ;F I L E * f d ;c h a r * r e s u l t ;s t r u c t s t a t s t a t _ b u f ;f d = f o p e n ( f i l e n a m e , " r " );i f ( f d = = N U L L ) e r r ( " o p e n f a i l e d " ) ;i f ( f s t a t ( f i l e n o ( f d ) , & s t a t _ b u f ) = = - I)

e r r ( " s t a t f a i l e d " ) ;* l e n _ p t r = s t a t b u f . s t _ s i z e ;r e s u l t = m a l l o c ( * l e n _ p t r ) ;i f ( r e s u l t = = N U L L ) e r r ( " m a l l o c f a i l e d \ n " ) ;i f ( f r e a d ( r e s u l t , s i z e o f ( c h a r ) , * l e n _ p t r , fd ) != ~ l e n _ p t r )

e r r ( " f r e a d f a i l e d " ) ;i f ( f c l o s e ( f d ) = = - i)

e r r ( " f c l o s e f a i l e d " ) ;r e t u r n ( r e s u l t ) ;}

/ * s p l i t s t r i n g i n t o a n u m b e r o f s u b s t r i n g s d e l i m i t e d b y a d e l i m i t e rc h a r a c t e rr e t u r n a n a r r a y o f s u b s t r i n g ss t u f f t h e l e n g t h o f t h i s a r r a y i n t o l e n _ p t r * /

c h a r * *s u b s t r i n g s ( s t r i n g , e n d, d e l i m i t e r , l e n _ p t r )

c h a r * s t r i n g , * e n d , d e l i m i t e r ;i n t * l e n _ p t r ;

c h a r * s , * * r e s u l t ;int i = O;w h i l e ( s t r i n g < e n d & & * s t r i n g = = d e l i m i t e r ) s t r i n g + + ;for( s = string

align algorithm - jorge

Documents