catch the link! combining clues for word alignment

44
Catch the Link! Combining Clues for Word Alignment Jörg Tiedemann Uppsala University [email protected]

Upload: melosa

Post on 23-Feb-2016

33 views

Category:

Documents


0 download

DESCRIPTION

Catch the Link! Combining Clues for Word Alignment. Jörg Tiedemann Uppsala University [email protected]. Outline. Background What do we want? What do we have? What do we need? Clue Alignment What is a clue? How do we find clues? How do we use clues? What do we get?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Catch the Link!  Combining Clues for Word Alignment

Catch the Link! Combining Clues for Word Alignment

Jörg TiedemannUppsala [email protected]

Page 2: Catch the Link!  Combining Clues for Word Alignment

Outline Background

What do we want?What do we have?What do we need?

Clue AlignmentWhat is a clue?How do we find clues?How do we use clues?What do we get?

Page 3: Catch the Link!  Combining Clues for Word Alignment

automatically

language independent

What do we want?

Source

Trans-lation 1

Sentencealigner

Parallelcorpus

Trans-lation 2

Wordaligner

Tokenlinks

Typelinks

Alignedcorpus

Page 4: Catch the Link!  Combining Clues for Word Alignment

What do we have? tokeniser (ca 99%) POS tagger (ca 96%) lemmatiser (ca 99%) shallow parser (ca 92%), parser (> 80%) sentence aligner (ca 96%) word aligner

75% precision45% recall

Page 5: Catch the Link!  Combining Clues for Word Alignment

Word alignment challenges: non-linear mapping grammatical/lexical differences translation gaps translation extensions idiomatic expressions multi-word equivalences

What’s the problem with Word Alignment?

(1) Our Hasid is in his late twenties.(2) Vår chassid är bortåt de trettio.

(Saul Bellow “To Jerusalem and back: a personal account”)

(1) I take the middle seat, which I dislike, but I am not really put out.

(2) Jag tar mittplatsen, vilket jag inte tycker om, men det gör mig inte så mycket.

(Saul Bellow “To Jerusalem and back: a personal account”)

(1) Armén kommer att reformeras och effektiviseras.

(2) The army will be reorganized with the aim of making it more effective.

(The Declarations of the Swedish Government, 1988)

(1) Neutralitetspolitiken stöds av ett starkt försvar till värn för vårt oberoende.

(2) Our policy of neutrality is underpinned by a strong defence.

(The Declarations of the Swedish Government, 1988)

(1) Alsop says, "I have a horror of the bad American practice of choosing up sides in other people's politics, ..."(2) Alsop förklarar: "Jag fasar för den amerikanska ovanan att välja sida i andra människors politik, ...”

(Saul Bellow “To Jerusalem and back: a personal account”)

Page 6: Catch the Link!  Combining Clues for Word Alignment

So what? What are the real problems?Word alignment

uses simple, fixed tokenisation fails to identify appropriate translation

units ignores contextual dependencies ignores relevant linguistic information uses poor morphological analyses

Page 7: Catch the Link!  Combining Clues for Word Alignment

What do we need?flexible tokenisationpossible multi-word unitslinguistic tools for several languagesintegration of linguistic knowledgecombination of knowledge resourcesalignment in context

Page 8: Catch the Link!  Combining Clues for Word Alignment

Let’s go!

Clue Alignment!• finding clues

• combining clues• aligning words

Page 9: Catch the Link!  Combining Clues for Word Alignment

Word Alignment Clues

The United Nations conference has started today .

Idag började FN-konferensen .

DT NNP NNP NN VBZ VBN RB

RGOS V@IIAS NCUSN@DS

NP VP ADVP

[ ][ ][ ]

ADVP VC NP

conferencekonferensen

Page 10: Catch the Link!  Combining Clues for Word Alignment

Word Alignment CluesDef.: A word alignment clue Ci(s,t) is a

probability which indicates an association between two lexical items, s and t, from parallel texts.

Def.: A lexical item is a set of words with associated features attached to it.

Page 11: Catch the Link!  Combining Clues for Word Alignment

How do we find clues? (1)Clues can be estimated from association

scores:

Ci(s,t) = wi * Ai (s,t)

co-occurrence:• Dice coefficient: A1 (s,t) = Dice (s,t)• Mutual information: A2 (s,t) = I (s;t)

string similarity• longest common sub-seq.ratio: A3 (s,t) = LCSR (s,t)

Page 12: Catch the Link!  Combining Clues for Word Alignment

How do we find clues? (2)Clues can be estimated from training

data:

Ci(s,t) = wi * P (ft |fs) wi * freq(ft ,fs )/freq(fs)

fs , ft are features of s and t, e.g.• part-of-speech sequences of s, t• phrase category (NP, VP etc), syntactic function• word position• context features

Page 13: Catch the Link!  Combining Clues for Word Alignment

How do we use clues? (1) Clues are simply sets of association measures The crucial point: we have to combine them!

If Ci(s,t) = P(ai ), define the total clue as Call(s,t) = P(A) = P(a1 a2 ... an)Clues are not mutually exclusive! P(a1 a2 ) = P(a1) + P(a2 ) - P(a1 a2 )Assume independence! P(a1 a2 ) = P(a1) * P(a2 )

Page 14: Catch the Link!  Combining Clues for Word Alignment

How do we use clues? (2)Clues can refer to any set of tokens from

source and target language segments. overlaps inclusions

Def.: A clue shares its indication with all member tokens! allow clue combinations at the level of

single tokens

Page 15: Catch the Link!  Combining Clues for Word Alignment

Clue overlaps - an exampleThe United Nations conference has started today.Idag började FN-konferensen.

Clue 1 (co-occurrence)United Nations FN-konferensen 0.4Nations conference FN-konferensen 0.5United FN-konferense 0.3

Clue 2 (string similarity)conference FN-konferensen 0.57Nations FN-konferensen 0.29

Clueall

United FN-konferensen 0.58Nations FN-konferensen 0.787conference FN-konferensen 0.785

Page 16: Catch the Link!  Combining Clues for Word Alignment

The Clue Matrix Idag började FN-konferensen The United NationsConference has started today

0.50.50.5

Clue 2 (string similarity)conference FN-konferensen 0.57Nations FN-konferensen 0.29today idag 0.4

Clue 1 (co-occurrence)The United Nations FN-konferensen 0.5United Nations FN-konferensen 0.4has började 0.2started började 0.6started today idag 0.3Nations conference började 0.4

0.57

0.70.70.7870.4

0.40.20.720.3

0.30.58

Page 17: Catch the Link!  Combining Clues for Word Alignment

Clue Alignment (1) general principles:

combine all clues and fill the matrixhighest score = best linkallow overlapping links only

• if there is no better link for both tokens• if tokens are next to each other

links which overlap at one point form a link cluster

Page 18: Catch the Link!  Combining Clues for Word Alignment

Clue Alignment (2)the alignment procedure:

1. find the best link2. remove the best link (set its value to 0)3. check for overlaps

• accept: add to set of link clusters• dismiss otherwise

4. continue with 1 until no more links are found

(or all values are below a certain threshold)

Page 19: Catch the Link!  Combining Clues for Word Alignment

Clue Alignment (3) Idag började FN-konferensen The United Nationsconference has started today

0.50.50.5

Best link:

Nations FN-konferensen 0.787

Link clusters:Nations FN-konferensen

0.57

0.70.70.7870.4

0.40.20.720.3

0.30.58

Best link:

started började 0.72

000

000

0

Link clusters:Nations FN-konferensenstarted började

Best link:

United FN-konferensen 0.7

Link clusters:United Nations FN-konferensenstarted började

Best link:

today idag 0.58

Link clusters:United Nations FN-konferensenstarted börjadetoday idag

0

Best link:

conference FN-konferensen 0.57

Link clusters:United Nations conference FN-konferensenstarted börjadetoday idag

0

Best link:

The FN-konferensen 0.5

Link clusters:The United Nations conference FN-konferensenstarted börjadetoday idag

Link clusters:The United Nations conference FN-konferensenhas started börjadetoday idag

Best link:

has började 0.2

0

Page 20: Catch the Link!  Combining Clues for Word Alignment

Bootstrappingagain: clues can be estimated from

training dataself-training: use available links as

training datagoal: learn new clues for the next

steprisk: increased noise (lower precision)

Page 21: Catch the Link!  Combining Clues for Word Alignment

Learning Clues POS-clue:

assumption: word pairs with certain POS-tags are more likely to be translations of each other than other word pairs

features: POS-tag sequences position clue:

assumption: translations are relatively close to each other (esp. in related languages)

features: relative word positions

Page 22: Catch the Link!  Combining Clues for Word Alignment

So much for the theory! Results?!

The setup: Corpus and basic tools:• Saul Bellow’s “To Jerusalem and back: a personal

account ”, English/Swedish, about 170,000 words• English POS-tagger (Grok), trained on Brown, PTB• English shallow parser (Grok), trained on PTB• English stemmer, suffix truncation• Swedish POS-tagger (TnT), trained on SUC• Swedish CFG parser (Megyesi), rule-based• Swedish lemmatiser, database taken from SUC

Page 23: Catch the Link!  Combining Clues for Word Alignment

Results!?! … not yetbasic clues:

• Dice coefficient ( 0.3)• LCSR (0.4), 3 characters/string

learned clues:• POS clue• position clue

clue alignment threshold = 0.4uniform normalisation (0.5)

Page 24: Catch the Link!  Combining Clues for Word Alignment

Results!!! Come on!Preliminary results (… work in progress …) Evaluation: 500 random samples have been linked

manually (Gold standard) Metrics: precisionPWA & recallPWA (Ahrenberg et al,

2000)alignment & clues precision recall F

Dice+LCSR (best-first) 79.377% 32.454% 46.071%Dice+LCSR 71.225% 41.065% 52.095%Dice+LCSR+POS 70.667% 48.566% 57.568%Dice+LCSR+POS+position 72.820% 51.561% 60.374%

Page 25: Catch the Link!  Combining Clues for Word Alignment

Give me more numbers!The impact of parsing.How much do we gain?

Alignment results with n-grams, (shallow) parsing, and both:

chunks+ngrams precision recall Fngrams 74.712% 51.501% 60.972%chunks 78.410% 52.909% 63.183%ngrams+chunks 72.820% 51.561% 60.374%

Page 26: Catch the Link!  Combining Clues for Word Alignment

One more thing.Stemming, lemmatisation and all

that … Do we need morphological analyses for

Swedish and English?word/lemma/stem precision recall F

words 79.490% 48.827% 60.495%swedish & english stems 77.401% 45.338% 57.181%swedish lemmas+english stems 78.410% 52.909% 63.183%

Page 27: Catch the Link!  Combining Clues for Word Alignment

Conclusions Combining clues helps to find links Linguistic knowledge helps

POS tags are valuable cluesword position gives hints for related languagesparsing helps with the segmentation problemlemmatisation gives higher recall

We need more experiments, tests with other language pairs, more/other clues

recall & precision is still low

Page 28: Catch the Link!  Combining Clues for Word Alignment
Page 29: Catch the Link!  Combining Clues for Word Alignment

POS clues - examples

score source target----------------------------------------------------------0.915479582146249 VBZ [email protected] WRB RH0S0.761904761904762 VBP [email protected] RB RG0S0.674033149171271 VBD [email protected] DT NNP NN [email protected] PRP VBZ PF@USS@S [email protected] NNS NNP [email protected] VB [email protected] RBR RGCS0.5 DT JJ JJ NN DF@US@S AQP0SNDS NCUSN@DS

Page 30: Catch the Link!  Combining Clues for Word Alignment

Position clues - examples

score mapping------------------------------------0.245022348638765 x -> 00.12541095637398 x -> -10.0896900742491966 x -> 10.0767611096745595 x -> -20.0560378264563555 x -> -30.0514572790070555 x -> 20.0395256916996047 x -> 6 7 8

Page 31: Catch the Link!  Combining Clues for Word Alignment

Open Questions Normalisation!

How do we estimate the wi’s? Non-contiguous phrases

Why not allow long distance clusters? Independence assumption

What is the impact of dependencies? Alignment clues

What is a bad clue, what is a good one?Contextual clues

Page 32: Catch the Link!  Combining Clues for Word Alignment

Clue alignment - example be ko var ställ scher min fru undrar road för jag de en lunch . amused 0 0 0 0 0 0 0 0 0 0 , 0 0 0 0 0 0 0 0 0 48 my 81 63 0 0 0 0 0 0 0 0 wife 58 80 0 0 0 0 0 0 0 0 asks 0 0 42 0 0 0 0 0 0 0 why 0 0 0 0 74 0 0 0 0 0 i 0 0 0 0 0 0 0 0 0 0 ordered 0 0 0 0 0 0 36 0 0 0 the 0 0 0 0 0 0 0 70 70 0 kosher 0 34 0 0 0 0 0 53 86 0 lunch 0 34 0 0 0 0 0 41 81 0 . 0 0 0 0 0 0 0 0 0 76

Page 33: Catch the Link!  Combining Clues for Word Alignment

Alignment - examplesthe Middle East Mellersta Östernafford kosta påat least åtminstonean American satellite en satellitcommon sense sunda förnuftetJerusalem area Jerusalemområdetkosher lunch koscherlunchleftist anti-Semitism vänsterantisemitismleft-wing intellectuals vänsterintellektuellaliterary history litteraturhistoriskamanuscript collection handskriftsamlingMarine orchestra marinkårsorkestermarionette theater marionetteaternmathematical colleagues matematikkollegermental character mentalitetfar too alldeles

Page 34: Catch the Link!  Combining Clues for Word Alignment

Alignment - examplesa banquet en banketta battlefield ett slagfälta day dagenthe Arab states arabstaternathe Arab world arabvärldenthe baggage carousel bagagekarusellenthe Communist dictatorships kommunistdiktaturernaThe Fatah terrorists Al Fatah-terroristernathe defense minister försvarsministernthe defense minister försvarsministerthe daughter dotterthe first President förste president

Page 35: Catch the Link!  Combining Clues for Word Alignment

Alignment - examplesAmerican imperial interests amerikanska imperialistintressenasChicago schools Chicagos skolordecidedly anti-Semitic avgjort antisemitiskahis identity sin identitethis interest sitt intressehis interviewer hans intervjuaremilitant Islam militanta muhammedanismenno longer inte längresophisticated arms avancerade vapenstill clearly uppenbarligen ännudozen Russian dussin ryskaexceedingly intelligent utomordentligt intelligentfew drinks några drinkargoyish democracy gojernas demokratiindustrialized countries industrialiserade ländernahas become har blivit

Page 36: Catch the Link!  Combining Clues for Word Alignment

Gold standard - MWUs

link: Secretary of State -> Utrikesministerlink type: regularunit type: multi -> single

source text: Secretary of State Henry Kissinger has won the Middle Eastern struggle by drawing Egypt into the American camp.target text: Utrikesminister Henry Kissinger har vunnit slaget om Mellanöstern genom att dra in Egypten i det amerikanska lägret.

Page 37: Catch the Link!  Combining Clues for Word Alignment

Gold standard - fuzzy linkslink: unrelated -> inte tillhör hans släktlink type: fuzzyunit type: single -> multi

source text: And though he is not permitted to sit beside women unrelated to him or to look at them or to communicate with them in any manner (all of which probably saves him a great deal of trouble), he seems a good-hearted young man and he is visibly enjoying himself.

target text: Och fastän han inte får sitta bredvid kvinnor som inte tillhör hans släkt eller se på dem eller meddela sig med dem på något sätt (alltsammans saker som utan tvivel besparar honom en mängd bekymmer) verkar han vara en godhjärtad ung man, och han ser ut att trivas gott.

Page 38: Catch the Link!  Combining Clues for Word Alignment

Gold standard - null links

link: do ->link type: nullunit type: single -> null

source text:"How is it that you do not know English?"target text:"Hur kommer det sig att ni inte talar engelska?"

Page 39: Catch the Link!  Combining Clues for Word Alignment

Gold standard - morphology

link: the masses -> massornalink type: regularunit type: multi -> single

source text: Arafat was unable to complete the classic guerrilla pattern and bring the masses into the struggle.target text: Arafat har inte kunnat fullborda det klassiska gerillamönstret och föra in massorna i kampen.

Page 40: Catch the Link!  Combining Clues for Word Alignment

Evaluation metrics

),max(),max( trgtrgsrcsrc

trgsrc

GSGS

CCQ

)()()()( MnCnPnInQ

recall PWA

)()()( CnPnInQ

precisionPWA

Csrc – number of overlapping source tokens in (partially) correct link proposals, Csrc=0 for incorrect link proposals

Ctrg – number of overlapping target tokens in (partially) correct link proposals, Ctrg=0 for incorrect link proposals

Ssrc – number of source tokens proposed by the system Strg – number of target tokens proposed by the system Gsrc – number of source tokens in the gold standard Gtrg – number of target tokens in the gold standard

Page 41: Catch the Link!  Combining Clues for Word Alignment

Evaluation metrics - example source target precisionPWA recallPWA

reference Reläventil TC TC relay valve proposed Reläventil Relay valve (3/5 = 0.6) + (3/5 = 0.6) + TC TC (2/5 = 0.4) = 1 (2/5 = 0.4) = 1 reference ordinarie ordinary proposed ordinarie skruv ordinary 2/3 0.66 2/3 0.66 reference kommer att indikeras will be indicated proposed det kommer will (2/7 0.286) + (2/7 0.286) + att the (0/7 = 0) + (0/7 = 0) + indikeras indicated (2/7 0.286)

(2/7 0.286)

reference vill wants proposed - - 0 0 reference vatten - proposed - - 1 1 reference to till proposed to att 0 0 reference Scanias chassier Scania chassis proposed Scanias Scania chassis 3/4 = 0.75 3/4 = 0.75

/6 0.663 /7 0.569

Page 42: Catch the Link!  Combining Clues for Word Alignment

Corpus markup (Swedish)<s lang="sv" id="9"> <c id="c-1" type="NP"> <w span="0:3" pos="PF@NS0@S" id="w9-1" stem="det">Det</w> </c> <c id="c-2" type="VC"> <w span="4:2" pos="V@IPAS" id="w9-2" stem="vara">är</w> </c> <c id="c-3"> <w span="7:3" pos="CCS" id="w9-3" stem=”som">som</w> </c> <c id="c-4" type="NPMAX"> <c id="c-5" type="NP"> <w span="11:3" pos="DI@NS@S" id="w9-4" stem="en">ett</w> <w span="15:5" pos="NCNSN@IS" id="w9-5">besök</w> </c> <c id="c-6" type="PP"> <c id="c-7"> <w span="21:1" pos="SPS" id="w9-6" stem="1">i</w> </c> <c id="c-8" type="NP"> <w span="23:9" pos="NCUSN@DS" id="w9-7" stem="barndom">barndomen</w> </c> </c> </c></s>

Page 43: Catch the Link!  Combining Clues for Word Alignment

Corpus markup (English) <s lang="en" id="9"> <chunk type="NP" id="c-1"> <w span="0:2" pos="PRP" id="w9-1">It</w> </chunk> <chunk type="VP" id="c-2"> <w span="3:2" pos="VBZ" id="w9-2” stem="be">is</w> </chunk> <chunk type="NP" id="c-3"> <w span="6:2" pos="PRP$" id="w9-3">my</w> <w span="9:9" pos="NN" id="w9-4">childhood</w> </chunk> <chunk type="VP" id="c-4"> <w span="19:9" pos="VBD" id="w9-5">revisited</w> </chunk> <chunk id="c-5"> <w span="28:1" pos="." id="w9-6">.</w> </chunk> </s>

Page 44: Catch the Link!  Combining Clues for Word Alignment

… is that all?How good are the new clues?

Alignment results with learned clues only: (neither LCSR nor Dice)

clues only precision recall FPOS 55.178% 20.383% 29.769%position 37.169% 21.550% 27.282%