aside –the rosetta stonefrank/csc401/lectures2017/5-2_smt.pdfrosetta stone. he noticed: 1. the...
TRANSCRIPT
CSC401/2511– Spring2017 4
AncientEgyptian(c.3000BCE)• Few writers• Stone tablets• Many(>1500)symbols
representingideas(e.g.,apple)
• Afew(~140)symbolsrepresentingsounds(e.g.gah)
• Demotic(c.650BCE)• Many writers• Papyrus sheets• Morepurposes(e.g.,
recipes,contracts)• Fewersymbols• Higherproportion of
symbolsrepresentingsounds
CSC401/2511– Spring2017 5
TheRosettastone• TheRosettastone datesfrom196BCE.
• Itwasre-discoveredbyFrenchsoldiersduringNapoleon’sinvasionofEgyptin1799CE.
AncientEgyptian
hieroglyphs
EgyptianDemotic
AncientGreek
• Itcontainsthreeparalleltextsindifferentlanguages,onlythelast ofwhichwasunderstood.
• By1799,ancientEgyptianhadbeenforgotten.
CSC401/2511– Spring2017 6
Writingsystems• Logographic: adj.Describeswritingsystemswhose
symbols denotesemantic ideas.
• Phonographic: adj. Describeswritingsystemswhosesymbols denotesounds.E.g.,inEnglishthesymbols‘sh’mean
• Somewritingsystemsareamixofthesequalities:• � mā ‘mother’,formedfrom:• � nǚ (meanslike)‘woman’• � mă (soundslike)‘horse’
CSC401/2511– Spring2017 7
Writingsystems• Logographic: Symbolsrefertoideas.• Phonographic: Symbolsrefertosounds.
• Englishcarrieslogographicheritage.
IsancientEgyptianlogographicorphonographic?
Proto-Sinaitic
“alph”(ox)
“bet”(house)
“kaf”(palm)
“mem”(water)
“en”(eye)
Phoenician
Cyrillic A b K M O P
“ro”(head)
CSC401/2511– Spring2017 8
DecipheringRosetta• During1822–1824,Jean-FrançoisChampollion workedontheRosettastone.Henoticed:1. ThecircledEgyptiansymbolsappearedinroughly
thesamepositionsastheword‘Ptolemy’intheGreek.2. ThenumberofEgyptianhieroglyphtokensweremuchlarger
thanthenumberofGreekwords→Egyptianseemedtohavebeenpartiallyphonographic.
3. Cleopatra’scartouchewaswritten
CSC401/2511– Spring2017 9
Aside– decipheringRosetta• Soifwas‘Ptolemy’ andwas‘Cleopatra’andthesymbolscorrespondedtosounds– canwematchupthesymbols?
P
P L
L O
O
E
E
C A T R A
T M Y
• Thisapproachdemonstratedthevalueofworkingfromparalleltextstodecipheranunknownlanguage:• Itwouldnothavebeenpossiblewithoutaligning unknownwords(hieroglyhs)toknownwords(Greek)…
CSC401/2511– Spring2017 10
Today• Introductiontostatisticalmachinetranslation(SMT).
• Whatwewantisasystemtotakeutterances/sentencesinonelanguageandtransformthemtoanother:
Nemangepasce chat!
Don’teatthatcat!
CSC401/2511– Spring2017 11
Directtranslation• Abilingualdictionarythatalignswordsacrosslanguagescanbehelpful,butonlyforsimplecases.
¿ Dónde está la biblioteca ?Where is the library ?Où est la bibliothèque ?
Mi nombre es T-boneMy name is T-boneMon nom est T-bone
CSC401/2511– Spring2017 12
Challenge1:lexicalambiguity• Awordtokeninonelanguagemayhavemanypossibletranslationsinanother:
• E.g., book theflight→ reservarreadthebook →libro
thechair inthechair→ président,chaise
kill thequeen→ tuer lareinekill theQueen→ éteindre lamusique deQueen
CSC401/2511– Spring2017 13
Challenge2:differingwordorders• English: subject– (trans.)verb– objectJapanese: subject– object– (trans.)verb
e.g., English: IBMboughtLotusJapanese: ~IBMLotusbought
• English: determiner– adjective– nounFrench: determiner– noun– adjective
e.g., English: thefastzombieFrench: lezombierapide
CSC401/2511– Spring2017 14
Challenge3:unpreservedsyntax• Differences insyntaxbetweenlanguagesarefeltoverlongerdistancesthansimplewordalternations.• E.g.,
• Thisimpliesthatwe’dneedhigh-levelgrammarsofthesourceandtargetlanguages.
Thebottlefloated intothecave
Labotella entró alacuerva flotando(thebottleenteredtothecavefloating)
CSC401/2511– Spring2017 15
Challenge4:syntacticambiguity• Syntacticambiguityinthesourcemakesitdifficulttoproduceasinglesentenceinthetargetlanguage.• E.g.,
Rickhitthezombiewiththestick
Rickgolpeó elzombieconelpalo
(thestickwasused)
Rickgolpeó elzombiequetenia elpalo
(thezombiehadthestick)
CSC401/2511– Spring2017 16
Challenge5:idiosyncracies• Languageshavetheirownidioms,and“feel”.
• E.g.,
Wehavetoburnthemidnightoil Ilfaut travailler tard
Estie desacramouille Hostofthesacrament
Bygolly!
Ilfaut brûler l’huiledeminuit
CSC401/2511– Spring2017 17
ClassicalMT:Dictionaries• EarlyMTinvolvedmerelylookingupeachwordinabilingualdictionaryofrules.• E.g.,translate‘much’or‘many’ intoRussian:
If precedingwordishow return skol’koelseifprecedingwordisas return stol’ko zheelseif wordismuch
if precedingwordisvery returnnilelseif followingwordisanounreturnmnogo
else (wordismany)if precedingwordisaprepositionandnextwordisanoun
returnmnogiielse returnmnogo
FromJurafsky &Martin
CSC401/2511– Spring2017 18
ClassicalMT:Dictionaries• Thisapproachcausessomeproblems,e.g.,
• It’sdifficult/impossibletocapturelong-range re-orderings:• English: Sourcessaid thatIBMboughtLotusyesterdayJapanese: ~Sources yesterdayIBMLotusboughtthatsaid
• It’sdifficulttodisambiguateparts-of-speech:• English: Theysaidthat Ipunchedthat zombie• French: Ilsontditque j'aifrappéce zombie
• Havingexpertswritelotsofrulescanbecomeunruly.• …andexpensive...andfullofmistakes…
CSC401/2511– Spring2017 19
ClassicalMT:Transfer-basedapproach• Transfer-basedMTinvolvesthreephases:
• Analysis: e.g.,build syntacticparsetreesofthesourcesentence.
• Transfer: e.g.,convert thesource-languageparsetreetoatarget-languageparsetree.
• Generation: e.g.,produce anoutputsentence fromthetarget-languageparsetree.
• Thesesystemscaninvolvefairlydeepanalysis,oftenincludingsemantic analysis.
CSC401/2511– Spring2017 20
Exampleofsyntactictransfer
FromReginaBarzilay atMIT
Seecsc485/2501formoreon
computationalapproachestoparsetrees
CSC401/2511– Spring2017 21
Exampleofsyntactictransfer
FromReginaBarzilay atMIT
Transformationsaredefinedatthesyntacticlevel
CSC401/2511– Spring2017 22
ClassicalMT:Transfer-basedapproach• Transferringbetweenparsetreesallowsustoencodemoregeneral ruleswithlong-term dependencies.
• However,ifwewanttotranslatebetween! languages,we’dneed"(!$) setsoftransformationrules.• Thiswouldinvolvelotsofexpertsineachlanguage($$).• Thiscanbesomewhatmitigated byabstractingbeyondsyntaxintoaninterlingua:aconceptualspacecommontoall languages.• Wemightneedaworkabletheoryofneurolinguistics todothisproperly,but‘hacks’aregettingsomegoodresults.
CSC401/2511– Spring2017 23
Statisticalmachinetranslation• Machinetranslationseemedtobeanintractableproblemuntilachangeinperspective…
WhenIlookatanarticleinRussian,Isay:‘ThisisreallywritteninEnglish,butithasbeencoded insomestrangesymbols.Iwillnowproceedtodecode.’
WarrenWeaver March,1947
ClaudeShannon July,1948
Transmitter&(')
Receiver&((|')
Noisychannel' (
CSC401/2511– Spring2017 24
Hownottousethenoisychannel• Themodel&(*, ,) tellsushowlikelyanEnglishsentence*andaFrenchsentence, aretocorrespond toeachother.
• Imaginethatyou’regivenaFrenchsentence,,,andyouwanttoconvertittothebestcorrespondingEnglishsentence,*∗• i.e., *∗ = argmax
4&(*, ,)
• Othersmaybetemptedtomodelthisas*∗ = argmax
4& * , &(,)
Thisisuselessifyou’realwaysgiven,
CSC401/2511– Spring2017 25
Hownottousethenoisychannel• Othersmaybetemptedtomodelthisas
*∗ = argmax4
& * , &(,)
Thisisuselessifyou’realwaysgiven,
• If&(*|,) isamodelthattranslatesword-to-word,thenwecannotaccountfordifferingwordordersacrosslanguages.• E.g., Source French: lezombierapide
TargetEnglish: thezombiefast
• If&(*|,) includessyntax,itbecomesvery difficulttolearnwithoutexpertsorspecially-annotateddata.
26
Thenoisychannel
Source5(6)
LanguagemodelChannel5(7|6)
Translationmodel*′
Decoder
,′
6∗ Observed7
CSC401/2511 – Spring 2017
*∗ = argmax4
&(,|*)&(*)
CSC401/2511– Spring2017 27
Howtousethenoisychannel• Howdoesthiswork?
*∗ = argmax4
&(,|*)&(*)
• &(*) isalanguagemodel (e.g.,N-gram)andencodesknowledgeofwordorder.
• &(,|*) isaword-leveltranslationmodelthatencodesonlyknowledgeonanunordered word-by-wordbasis.
• Combining thesemodelscangiveusnaturalness andfidelity,respectively.
CSC401/2511– Spring2017 28
Howtousethenoisychannel• ExamplefromKoehnandKnightusingonlyconditionallikelihoodsofSpanish wordsgivenEnglish words.
• Que hambre tengo yo→WhathungerhaveI & 9 * = 1.4*=>
HungryIamso & 9 * = 1.0*=@
Iamsohungry & 9 * = 1.0*=@
HaveIthathunger & 9 * = 2.0*=>
…
CSC401/2511– Spring2017 29
Howtousethenoisychannel• …andwiththeEnglishlanguagemodel
• Que hambre tengo yo→WhathungerhaveI & 9 * & * = 1.4*=>×1.0*=@
HungryIamso & 9 * &(*) = 1.0*=@×1.4*=@
Iamsohungry & 9 * &(*) = 1.0*=@×1.0*=C
HaveIthathunger & 9 * &(*) = 2.0*=>×9.8*=F
…
CSC401/2511– Spring2017 30
Howtolearn5(7|6)?• Solution:collectstatisticsonvastparalleltexts
…citizen ofCanadahastheright tovoteinanelectionofmembersofthe
HouseofCommonsorofa
legislativeassemblyandtobequalifiedformembership…
e.g.,theCanadianHansards:bilingualParliamentaryproceedings
…citoyencanadienaledroit devoteetestéligibleaux
électionslégislativesfédéralesouprovinciales …
CSC401/2511– Spring2017 31
Bilingualdata
FromChrisManning’scourseatStanford
• DatafromLinguisticDataConsortiumatUniversityofPennsylvania.
CSC401/2511– Spring2017 32
Alignment• Inpractice,wordsandphrasescanbeoutoforder.
Quantauxeaux minérales etauxlimonades,elles rencontrenttoujours plusd’adeptes.Eneffet,notre sondagefaitressortirdesventesnettementsupérieuresàcelles de1987,pourlesboissons àbasedecolanotamment
Accordingtooursurvey
1988salesof
mineralwaterandsoftdrinks
weremuchhigherthanin1987,
reflectingthegrowingpopularity
oftheseproducts.Coladrink
manufacturersinparticular
achievedaboveaveragegrowthrates
FromManning&Schütze
alignment
CSC401/2511– Spring2017 33
Alignment• Alsoinpractice,we’reusuallynotgiventhealignment.
Quantauxeaux minérales etauxlimonades,elles rencontrenttoujours plusd’adeptes.Eneffet,notre sondagefaitressortirdesventesnettementsupérieuresàcelles de1987,pourlesboissons àbasedecolanotamment
Accordingtooursurvey
1988salesof
mineralwaterandsoftdrinks
weremuchhigherthanin1987,
reflectingthegrowingpopularity
oftheseproducts.Coladrink
manufacturersinparticular
achievedaboveaveragegrowthrates
FromManning&Schütze
CSC401/2511– Spring2017 34
Sentencealignment• Sentencescanalsobeunaligned acrosstranslations.
• E.g., Hewashappy.E1 Hehadbacon.E2 →Ilétait heureux parcequ'il avaitdubacon.F1
*G ,G*$ ,$*H ,H*C ,C*> ,>*@ ,@*F ,F…
*G ,G*$*H ,$*C ,H*> ,C
,>*@ ,@*F ,F…
CSC401/2511– Spring2017 35
Sentencealignment• Weoftenneedtoalignsentences beforewecanalignwords.
• We’lllookattwobroadclassesofmethods:1. Methodsthatonlylookatsentencelength,2. Methodsbasedonlexicalmatches,or“cognates”.
CSC401/2511– Spring2017 36
1.Sentencealignmentbylength(GaleandChurch,1993)
• Assumingtheparagraphalignmentisknown,• ℒ4 isthe#ofwordsinanEnglish sentence,• ℒJ isthe#ofwordsinaFrench sentence.
• Assumeℒ4 andℒJ haveGaussian/normaldistributionswithK = LMN andOP = QPMN.• Empiricalconstants R andS set‘byhand’.• Thepenalty,TUSV(ℒ4, ℒJ),ofaligningsentenceswithdifferentlengthsisbasedonthedivergence oftheseGaussians.
CSC401/2511– Spring2017 37
1.SentencealignmentbylengthWecanassociatecostswithdifferenttypesofalignments.
WX,Y isthepriorcostofaligningZ sentencesto[ sentences.
TUSV = TUSV ℒ4\ + ℒ4^, ℒJ\ + T$,G +TUSV ℒ4_, ℒJ + TG,G +
TUSV ℒ4a, ℒJ_ + TG,G +TUSV ℒ4b, ℒJa + ℒJb + TG,$ +TUSV ℒ4c, ℒJc + TG,GFinddistributionofsentencebreakswithminimumcostusingdynamicprogramming
*G ,G*$*H ,$*C ,H*> ,C
,>*@ ,@
It’sabitmorecomplicated– seepaperoncourse
webpage
CSC401/2511– Spring2017 38
2.Sentencealignmentbycognates• Cognates: n.pl.Wordsthathaveacommon
etymological origin.• Etymological: adj.Pertainingtothehistorical
derivationofaword.E.g.,porc→pork
• Theintuitionisthatwordsthatarerelated acrosslanguageshavesimilarspellings.• e.g.,zombie/zombie,government/gouvernement• Notalways:son (maleoffspring)vs.son (sound)
• Cognatescan“anchor”sentencealignmentsbetweenrelatedlanguages.
CSC401/2511– Spring2017 39
2.Sentencealignmentbycognates• Cognatesshouldbespelledsimilarly…
• N-graph: n.SimilartoN-grams,butcomputedatthecharacter-level,ratherthanattheword-level.
E.g.,TUdeV(S, ℎ, Z) isatrigraphmodel
• Church(1993)tracksall4-graphs whichareidenticalacrosstwotexts.• Hecallsthisa‘signal-based’approximationtocognateidentification.
CSC401/2511– Spring2017 40
2a.Church’smethod
FromManning&SchützeEnglish French
English
French
e.g.,theZgh French4-graph
isequaltothe[gh English4-graph.
1. Concatenatepairedtexts.
2. Placea‘dot’wheretheZgh Frenchandthe[ghEnglish4-graphareequal.
3. Searchforashortpath‘near’thebilingualdiagonals.
CSC401/2511– Spring2017 41
2a.Church’smethod
FromManning&Schütze
• Eachpointalongthispathisconsideredtorepresentamatchbetweenlanguages.
• TherelevantEnglishandFrenchsentencesare∴aligned.
English French
English
Frenche.g.,thejgh Frenchsentenceisalignedto thekgh English
sentence.
CSC401/2511– Spring2017 42
2b.Melamed’s method• !T9(l, m) isthelongestcommonsubsequenceofcharacters (withgapsallowed)inwordsl andm.
• Melamed(1993)measuressimilarityofwordsl andm
!T9n l, m =opeqVℎ(!T9 l, m )
max(opeqVℎ l , opeqVℎ m )• e.g.,
!T9n rstuvwxuwy, rsdtuvwpxuwy =10
12‘LCSRatio’
CSC401/2511– Spring2017 43
2b.Melamed’s method• Excludesstopwordsfrombothlanguages.
(e.g.,the,a,le,un)
• MelamedempiricallydeclaredthatcognatesoccurwhenzW{| ≥ ~. �Ä (i.e.,there’salotofoverlapinthosewords).• ∴ 25%ofwordsinCanadianHansard arecognates.
• AswithChurch,constructa“bitext”graph.• Putapointatposition(Z, [) ≡ !T9n Z, [ ≥ 0.58.• Findanear-diagonalalignment,asbefore.
CSC401/2511– Spring2017 44
Fromsentencestowords• We’vecomputedthesentence alignments.
• Whataboutword alignments?
CSC401/2511– Spring2017 45
Wordalignment
• Wordalignmentscanbe1:1,N:1,1:N,0:1,1:0,…E.g.,“zerofertility”word:nottranslated(1:0)
“spurious”words:generatedfrom‘nothing’(0:1)
Onewordtranslatedasseveralwords(1:N)
alignment
Note thatthisisonlyonepossible
alignment
CSC401/2511– Spring2017 46
IntuitionofstatisticalMT
• Thewords‘the’and‘maison’ co-occurfrequently,butnot asfrequentlyas‘the’and‘la’.
5(ÉÑ|yÖu) shouldbehigher than5(ÜÉuáv|yÖu),5(àÉuáu|yÖu),andeven5(xÑXQsw|yÖu)
Note:weconsiderallpossible wordalignments….
CSC401/2511– Spring2017 47
Assignment2– content
• Build N-gramlanguagemodels,withsmoothing.
• Learn word-levelalignmentswiththeIBM-1modelusingdatafromtheCanadianHansard.
• Combine thelanguageandalignmentmodelsintoasimpleFrench-to-Englishtranslator.
• Therearesomebonusmarksavailableforsubstantiallygoingbeyondtheminimalrequirements.
CSC401/2511– Spring2017 48
Assignment2– languages
• Sentenceshavealreadybeensplit andaligned foryou.• Wordshavenot beenaligned.
• Youdon’t needtoknowFrenchforthisassignment.• Frenchismore‘rigid’thanEnglish,soitsuseofcontractions,e.g.,aremoreregular.
• Youhavetodosomepre-processingofFrenchsentences,butthoserulesaregiventoyouexplicitly.
CSC401/2511– Spring2017 49
Assignment2– practical
• WillbepostedbyMonday13February.
• WillbeprogrammedinMatlab .• VarioussupportfunctionsforthisassignmentwillbeavailableonCDF.
• Hamed Heydari willgiveatutorialonMatlab on10February.
• Markswillbegivenmoreforunderstanding thealgorithmsandconceptsthanforspecificresults.
CSC401/2511– Spring2017 50
Puzzle:Machinetranslation• Puzzle(forfun):TranslatethisCentauriphrasetoArcturan:
“farok crrrok hihok yorok clok kantok ok-yurp”1a.ok-voon ororok sprok .1b.at-voon bichat dat .
7a.lalok farok ororok lalok sprok izok enemok .7b.watjjatbichatwatdatvateneat.
2a.ok-drubel ok-voon anok plok sprok .2b.at-drubel at-voon pippat rrat dat .
8a.lalokbrokanokploknok.8b.iat lat pippat rrat nnat .
3a.erok sprok izok hihok ghirok .3b.totat dat arrat vathilat .
9a.wiwok nok izok kantok ok-yurp .9b.totat nnat quat oloat at-yurp .
4a.ok-voonanokdrokbrokjok.4b.at-voonkratpippatsatlat.
10a.lalok mok nok yorok ghirok clok .10b.wat nnat gatmatbathilat .
5a.wiwok farok izok stok .5b.totat jjat quatcat.
11a.lalok nok crrrok hihok yorok zanzanok .11b.watnnatarratmatzanzanat.
6a.lalok sprok izok jok stok .6b.watdatkratquatcat.
12a.lalok rarok nok izok hihok mok .12b.watnnatforatarratvatgat.
CSC401/2511– Spring2017 51
Puzzle:Machinetranslation• Hinttogetstarted:
“farok crrrok hihok yorok clok kantok ok-yurp”
1a.ok-voon ororok sprok .1b.at-voon bichat dat .
7a.lalok farok ororok lalok sprok izok enemok .7b.watjjat bichatwatdatvateneat.
2a.ok-drubel ok-voon anok plok sprok .2b.at-drubel at-voon pippat rrat dat .
8a.lalokbrokanokploknok.8b.iat lat pippat rrat nnat .
3a.erok sprok izok hihok ghirok .3b.totat dat arrat vathilat .
9a.wiwok nok izok kantok ok-yurp .9b.totat nnat quat oloat at-yurp .
4a.ok-voonanokdrokbrokjok.4b.at-voonkratpippatsatlat.
10a.lalok mok nok yorok ghirok clok .10b.wat nnat gatmatbathilat .
5a.wiwok farok izok stok .5b.totat jjat quatcat.
11a.lalok nok crrrok hihok yorok zanzanok .11b.watnnatarratmatzanzanat.
6a.lalok sprok izok jok stok .6b.watdatkratquatcat.
12a.lalok rarok nok izok hihok mok .12b.watnnatforatarratvatgat.