machine translation overviewdemo.clab.cs.cmu.edu/nlp/s20/files/slides/l25-machine... · 2020. 4....

MachineTranslationOverview

April23,2020Junjie Hu

MaterialslargelyborrowedfromAustinMatthews

One naturally wonders if the problem of translation could conceivably be

treated as a problem in cryptography. When I look at an article in Russian, I

say: ‘This is really written in English, but it has been coded in some strange symbols. I will now

proceed to decode.’

WarrenWeavertoNorbertWiener,March,1947

Parallel corpus

• We are given a corpus of sentence pairs in two languages to trainour machine translation models.

• Source language is also called foreign language, denoted as f.• Conventionally target language is usually referred to English.

Greek

Egyptian

NoisyChannelMT

Wewantamodelofp(e|f)

NoisyChannelMT


Confusingforeignsentence

NoisyChannelMT


PossibleEnglishtranslation

Confusingforeignsentence

NoisyChannelMT

p(e) e f

channel

“English” “Foreign”

decode

p(f|e)

NoisyChannelMT

“LanguageModel” “TranslationModel”

NoisyChannelDivisionofLabor

• Languagemodel– p(e)• isthetranslationfluent,grammatical,andidiomatic?• useanymodelofp(e) – typicallyann-grammodel

• Translationmodel– p(f|e)• “reverse”translationprobability• ensuresadequacyoftranslation

LanguageModelFailure

MylegalnameisAlexanderPerchov.


MylegalnameisAlexanderPerchov.ButallofmymanyfriendsdubmeAlex,becausethatisamoreflaccid-to-utterversionofmylegalname.MotherdubsmeAlexi-stop-spleening-me!,becauseIamalwaysspleening her.


MylegalnameisAlexanderPerchov.ButallofmymanyfriendsdubmeAlex,becausethatisamoreflaccid-to-utterversionofmylegalname.MotherdubsmeAlexi-stop-spleening-me!,becauseIamalwaysspleening her.IfyouwanttoknowwhyIamalwaysspleening her,itisbecauseIamalwayselsewherewithfriends,anddisseminatingsomuchcurrency,andperformingsomanythingsthatcanspleenamother.

TranslationModel

• p(f|e) givesthechannelprobability– theprobabilityoftranslatinganEnglishsentenceintoaforeignsentence

• f =je voudrais unpeu defrommage• e1 =Iwouldlikesomecheesee2 =Iwouldlikealittleofcheesee3 =ThereisnotraintoBarcelona

0.40.5>0.00001

p(f|e)

TranslationModel

• Howdoweparameterizep(f|e)?

• Therearea lot of possible sentences (closed to infinite number):• We can only count the sentences in our training data• thiswon’tgeneralizetonewinputs

?

LexicalTranslation

• Howdowetranslateaword?Lookitupinadictionary!Haus:house,home,shell,household

• Multipletranslations• Differentwordsenses,differentregisters,differentinflections• house,home arecommon• shell isspecialized(theHausofasnailisitsshell)

Howcommoniseach translation?

Translation Counthouse 5000home 2000shell 100

household 80

Maximum Likelihood Estimation (MLE)

LexicalTranslation

• Goal:amodelp(e|f,m)• wheree andf arecompleteEnglishandForeignsentences

LexicalTranslation

• Goal:amodelp(e|f,m)• wheree andf arecompleteEnglishandForeignsentences• Lexicaltranslationmakesthefollowingassumptions:

• Eachwordei ine isgeneratedfromexactlyonewordinf• Thus,wehavealatentalignmentai thatindicateswhichwordei “camefrom.”Specificallyitcamefromfai.

• Giventhealignmentsa,translationdecisionsareconditionallyindependentofeachotheranddependonly onthealignedsourcewordfai.

LexicalTranslation

• Puttingourassumptionstogether,wehave:

where a is an m-dimensional latent vector with each element ai in the rangeof [0,n].

p(Alignment) p(Translation |Alignment)

Word Alignment

• Mostoftheresearch forthefirst10yearsofSMTwas here.Wordtranslations weren’ttheproblem.Wordorder washard.

Word Alignment

• Alignmentscanbevisualizedbydrawinglinksbetweentwosentences,andtheyarerepresentedasvectorsofpositions:

Reordering

• Wordsmaybereorderedduringtranslation

WordDropping

• Asourcewordmaynotbetranslatedatall

WordInsertion

• Wordsmaybeinsertedduringtranslation• E.g.Englishjust doesnothaveanequivalent• Butthesewordsmustbeexplained– wetypicallyassumeeverysourcesentencecontainsaNULLtoken

One-to-manyTranslation

• Asourcewordmaytranslateintomorethanone targetword

Many-to-oneTranslation

• Morethanonesourcewordmaynot translateasaunitinlexicaltranslation

IBMModel1

• Simplestpossiblelexicaltranslationmodel• Additionalassumptions:

• Them alignmentdecisionsareindependent• Thealignmentdistributionforeachai isuniformoverallsourcewordsandNULL

TranslatingwithModel1


Languagemodelsays:J


Languagemodelsays:L

LearningLexicalTranslationModels

• Howdowelearntheparametersp(e|f) on the training corpusof (f, e) sentence pairs?

• “Chickenandegg”problem• Ifwehadthealignments,wecouldestimatethetranslationprobabilities(MLEestimation)

• Ifwehadthetranslationprobabilitieswecouldfindthemostlikelyalignments(greedy)

Expectation-Maximization (EM) Algorithm

• Picksomerandom(oruniform)startingparameters• Repeatuntilbored(~5iterationsforlexicaltranslationmodels):

• Usingthecurrentparameters,compute“expected”alignmentsp(ai|e,f)foreverytargetwordtokeninthetrainingdata

• Keeptrackoftheexpectednumberoftimesf translatesintoe throughoutthewholecorpus

• Keeptrackofthenumberoftimesf isusedinthesourceofanytranslation• Usethesefrequency estimatesinthestandardMLEequationtogetabettersetofparameters

EMforIBM Model1

EMforModel1

Convergence

Extensions: Lexical to Phrase Translation

• Phrase-basedMT:• Allowmultiplewordstotranslateaschunks(includingmany-to-one)• Introduceanotherlatentvariable,thesourcesegmentation

Extensions: Alignment Heuristics

• AlignmentPriors:• Insteadofassumingthealignmentdecisionsareuniform,impose(orlearn)aprioroveralignmentgrids:

Chahuneau etal.(2013)

Extensions: Hierarchical Phrase-based MT

• Syntacticstructure• Rulesoftheform:• X之一à oneoftheX

Chang(2005),Galleyetal.(2006)

MT Evaluation

• Howdoweevaluatetranslationsystems’output?• Centralidea:“Thecloseramachinetranslationistoaprofessionalhumantranslation,thebetteritis.”

• MostcommonlyusedmetriciscalledBLEU, which is the geometricmean of the n-gram precision against the human translations plus alength penalty term.

BLEU:AnExample

Candidate1:Itisaguidetoactionwhichensuresthatthemilitaryalways obeythecommandsoftheparty.

Reference1:Itisaguidetoaction that ensuresthatthemilitary willforeverheedPartycommands.

Reference2:Itistheguidingprinciplewhich guaranteesthe militaryforcesalways beingunderthe commandof theParty.

Reference3:Itisthepracticalguideforthearmyalwaystoheeddirectionsoftheparty.

Unigram Precision : 17/18

AdaptedfromslidesbyArthurChan

IssueofN-gramPrecision

• Whatifsomewordsareover-generated?• e.g.“the”

• Anextremeexample

Candidate:thethe the the the the the.Reference1:The catisonthemat.Reference2:Thereisacatonthemat.

• N-gramPrecision:7/7• Solution:referencewordshouldbeexhaustedafteritismatched.


IssueofN-gramPrecision

• Whatifsomewordsarejustdropped?• Anotherextremeexample

Candidate:the.Reference1:Mymomlikestheblueflowers.Reference2:Mymotherpreferstheblueflowers.

• N-gramPrecision:1/1• Solution:addapenaltyifthecandidateistooshort.


BLEU

ClippedN-gramprecisionsforN=1,2,3,4

GeometricAverage

BrevityPenalty

• Rangesfrom0.0to1.0,butusuallyshownmultipliedby100• Anincreaseof+1.0BLEUisusuallyaconferencepaper• MTsystemsusuallyscoreinthe10sto30s• Humantranslatorsusuallyscoreinthe70sand80s

AShortSegue

• Word- andphrase-based(“symbolic”)modelswerecuttingedgefordecades(upuntil~2014)

• Suchmodelsarestillthemostwidelyusedincommercialapplications• Since2014mostresearchonMThasfocusedonneuralmodels

“Neurons”

“Neural”Networks

“Softmax”

“Neural”Networks

“Deep”

“Deep”

Note:

“Recurrent”

DesignDecisions

• Howtorepresentinputsandoutputs?• Neuralarchitecture?

• Howmanylayers?(Requiresnon-linearities toimprovecapacity!)• Howmanyneurons?• Recurrentornot?• Whatkindofnon-linearities?

RepresentingLanguage

• “One-hot”vectors• Eachpositioninavectorcorrespondstoawordtype

• Distributedrepresentations• Vectorsencode“features”ofinputwords(charactern-grams,morphologicalfeatures,etc.)

dog=

Aardvark

Aabalone

Abando

nAb

ash

… Dog

…

dog=

TrainingNeuralNetworks

• Neuralnetworksaresupervisedmodels– youneedasetofinputspairedwithoutputs

• Algorithm• Rununtilbored:

• Giveinputtothenetwork,seewhatitpredicts• Computeloss(y,y*)• Usechainrule(aka“backpropagation”)tocomputegradientwithrespecttoparameters• Updateparameters(SGD,Adam,LBFGS,etc.)

NeuralLanguageModels

tanh

softmax

x=x Bengio etal.(2013)

Bengioetal.(2003)

NeuralFeaturesforTranslation

• TurnBengio etal.(2003)intoatranslationmodel• Condtional model,generatethenextEnglishwordconditionedon

• Thepreviousn Englishwordsyougenerated• Thealignedsourcewordanditsm neighbors

Devlinetal.(2014)

tanh

softmax

x=x Devlinetal.(2014)

NeuralFeaturesforTranslation

Devlinetal.(2014)

NotationSimplification

RNNsRevisited

FullyNeuralTranslation

• Fullyend-to-endRNN-basedtranslationmodel• EncodethesourcesentenceusingoneRNN• GeneratethetargetsentenceonewordatatimeusinganotherRNN

Encoder

I am a student je suis étudiant

je suis étudiant

Decoder

Sutskever etal.(2014)

AttentionalModel

• Theencoder-decodermodelstruggleswithlongsentences• AnRNNistryingtocompressanarbitrarilylongsentenceintoafinite-lengthworthvector

• Whatifweonlylookatone(orafew)sourcewordswhenwegenerateeachoutputword?

Bahdanau etal.(2014)

TheIntuition

83

large blackOur dog bit the poor mailman .

うちの⼤きな⽝が可哀想な郵便屋に噛みついた。⿊い


TheAttentionModel

Encoder

I am a student

Decoder


TheAttentionModel

Encoder

I am a student

Decoder

AttentionModel


TheAttentionModel

Encoder

I am a student

Decoder

AttentionModel

softmax


TheAttentionModel

Encoder

I am a student

Decoder

AttentionModel

ContextVector


TheAttentionModel

Encoder

I am a student

je

Decoder

AttentionModel

ContextVector


TheAttentionModel

Encoder

I am a student je

je

Decoder

AttentionModel

ContextVector


TheAttentionModel

Encoder

I am a student je

je

Decoder

AttentionModel


TheAttentionModel

Encoder

I am a student je

je suis

Decoder

AttentionModel

ContextVector


TheAttentionModel

Encoder

I am a student je suis

je suis

Decoder

AttentionModel

ContextVector


TheAttentionModel

Encoder

I am a student je suis étudiant

je suis étudiant

Decoder

AttentionModel

ContextVector


ConvolutionalEncoder-Decoder

• CNN:• encodeswordswithinafixedsizewindow• Parallelcomputation• Shortestpathtocoverawiderrangeofwords

• RNN:• sequentiallyencodeasentencefromlefttoright

• Hardtoparallelize

Gehring et.al2017

TheTransformer

• Idea:InsteadofusinganRNNtoencodethesourcesentenceandthepartialtargetsentence,useself-attention!

Vaswanietal.(2017)

I am a student I am a student

StandardRNNEncoder SelfAttentionEncoder

rawwordvector

word-in-contextvector

TheTransformer

Encoder

je suis étudiant

je suis étudiant

Decoder

AttentionModel

ContextVector

I am a student

Vaswanietal.(2017)

Transformer

• Traditionalattention:• Query:decoderhiddenstate• KeyandValue:encoderhiddenstate• Attendtosourcewordsbasedonthecurrentdecoderstate

• Self-attention:• Query,Key,Valuearethesame• Attendtosurroundingsourcewordsbasedonthecurrentsourceword

• Attendtopreceeding targetwordsbasedonthecurrenttargetword

Vaswanietal.(2017)

VisualizationofAttentionWeight

• Self-attentionweightcandetectlong-termdependencywithinasentence,e.g.,make… moredifficult

TheTransformer

• Computationiseasilyparallelizable• Shorterpathfromeachtargetwordtoeachsourcewordà strongergradientsignals• Empiricallystrongertranslationperformance• Empiricallytrainssubstantiallyfasterthanmoreserialmodels

CurrentResearchDirectionsonNeuralMT

• IncorporationsyntaxintoNeuralMT• Handlingofmorphologicallyrichlanguages• Optimizingtranslationquality(insteadofcorpusprobability)• Multilingualmodels• Document-leveltranslation

machine translation overviewdemo.clab.cs.cmu.edu/nlp/s20/files/slides/l25-machine... · 2020. 4....

Documents