machine translation overviewdemo.clab.cs.cmu.edu/nlp/s20/files/slides/l25-machine... · 2020. 4....

101
Machine Translation Overview April 23, 2020 Junjie Hu Materials largely borrowed from Austin Matthews

Upload: others

Post on 14-Feb-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • MachineTranslationOverview

    April23,2020Junjie Hu

    MaterialslargelyborrowedfromAustinMatthews

  • One naturally wonders if the problem of translation could conceivably be

    treated as a problem in cryptography. When I look at an article in Russian, I

    say: ‘This is really written in English, but it has been coded in some strange symbols. I will now

    proceed to decode.’

    WarrenWeavertoNorbertWiener,March,1947

  • Parallel corpus

    • We are given a corpus of sentence pairs in two languages to trainour machine translation models.

    • Source language is also called foreign language, denoted as f.• Conventionally target language is usually referred to English.

  • Greek

    Egyptian

  • NoisyChannelMT

    Wewantamodelofp(e|f)

  • NoisyChannelMT

    Wewantamodelofp(e|f)

    Confusingforeignsentence

  • NoisyChannelMT

    Wewantamodelofp(e|f)

    PossibleEnglishtranslation

    Confusingforeignsentence

  • NoisyChannelMT

    p(e) e f

    channel

    “English” “Foreign”

    decode

    p(f|e)

  • NoisyChannelMT

    “LanguageModel” “TranslationModel”

  • NoisyChannelDivisionofLabor

    • Languagemodel– p(e)• isthetranslationfluent,grammatical,andidiomatic?• useanymodelofp(e) – typicallyann-grammodel

    • Translationmodel– p(f|e)• “reverse”translationprobability• ensuresadequacyoftranslation

  • LanguageModelFailure

    MylegalnameisAlexanderPerchov.

  • LanguageModelFailure

    MylegalnameisAlexanderPerchov.ButallofmymanyfriendsdubmeAlex,becausethatisamoreflaccid-to-utterversionofmylegalname.MotherdubsmeAlexi-stop-spleening-me!,becauseIamalwaysspleening her.

  • LanguageModelFailure

    MylegalnameisAlexanderPerchov.ButallofmymanyfriendsdubmeAlex,becausethatisamoreflaccid-to-utterversionofmylegalname.MotherdubsmeAlexi-stop-spleening-me!,becauseIamalwaysspleening her.IfyouwanttoknowwhyIamalwaysspleening her,itisbecauseIamalwayselsewherewithfriends,anddisseminatingsomuchcurrency,andperformingsomanythingsthatcanspleenamother.

  • TranslationModel

    • p(f|e) givesthechannelprobability– theprobabilityoftranslatinganEnglishsentenceintoaforeignsentence

    • f =je voudrais unpeu defrommage• e1 =Iwouldlikesomecheesee2 =Iwouldlikealittleofcheesee3 =ThereisnotraintoBarcelona

    0.40.5>0.00001

    p(f|e)

  • TranslationModel

    • Howdoweparameterizep(f|e)?

    • Therearea lot of possible sentences (closed to infinite number):• We can only count the sentences in our training data• thiswon’tgeneralizetonewinputs

    ?

  • LexicalTranslation

    • Howdowetranslateaword?Lookitupinadictionary!Haus:house,home,shell,household

    • Multipletranslations• Differentwordsenses,differentregisters,differentinflections• house,home arecommon• shell isspecialized(theHausofasnailisitsshell)

  • Howcommoniseach translation?

    Translation Counthouse 5000home 2000shell 100

    household 80

  • Maximum Likelihood Estimation (MLE)

  • LexicalTranslation

    • Goal:amodelp(e|f,m)• wheree andf arecompleteEnglishandForeignsentences

  • LexicalTranslation

    • Goal:amodelp(e|f,m)• wheree andf arecompleteEnglishandForeignsentences• Lexicaltranslationmakesthefollowingassumptions:

    • Eachwordei ine isgeneratedfromexactlyonewordinf• Thus,wehavealatentalignmentai thatindicateswhichwordei “camefrom.”Specificallyitcamefromfai.

    • Giventhealignmentsa,translationdecisionsareconditionallyindependentofeachotheranddependonly onthealignedsourcewordfai.

  • LexicalTranslation

    • Puttingourassumptionstogether,wehave:

    where a is an m-dimensional latent vector with each element ai in the rangeof [0,n].

    p(Alignment) p(Translation |Alignment)

  • Word Alignment

    • Mostoftheresearch forthefirst10yearsofSMTwas here.Wordtranslations weren’ttheproblem.Wordorder washard.

  • Word Alignment

    • Alignmentscanbevisualizedbydrawinglinksbetweentwosentences,andtheyarerepresentedasvectorsofpositions:

  • Reordering

    • Wordsmaybereorderedduringtranslation

  • WordDropping

    • Asourcewordmaynotbetranslatedatall

  • WordInsertion

    • Wordsmaybeinsertedduringtranslation• E.g.Englishjust doesnothaveanequivalent• Butthesewordsmustbeexplained– wetypicallyassumeeverysourcesentencecontainsaNULLtoken

  • One-to-manyTranslation

    • Asourcewordmaytranslateintomorethanone targetword

  • Many-to-oneTranslation

    • Morethanonesourcewordmaynot translateasaunitinlexicaltranslation

  • IBMModel1

    • Simplestpossiblelexicaltranslationmodel• Additionalassumptions:

    • Them alignmentdecisionsareindependent• Thealignmentdistributionforeachai isuniformoverallsourcewordsandNULL

  • TranslatingwithModel1

  • TranslatingwithModel1

    Languagemodelsays:J

  • TranslatingwithModel1

    Languagemodelsays:L

  • LearningLexicalTranslationModels

    • Howdowelearntheparametersp(e|f) on the training corpusof (f, e) sentence pairs?

    • “Chickenandegg”problem• Ifwehadthealignments,wecouldestimatethetranslationprobabilities(MLEestimation)

    • Ifwehadthetranslationprobabilitieswecouldfindthemostlikelyalignments(greedy)

  • Expectation-Maximization (EM) Algorithm

    • Picksomerandom(oruniform)startingparameters• Repeatuntilbored(~5iterationsforlexicaltranslationmodels):

    • Usingthecurrentparameters,compute“expected”alignmentsp(ai|e,f)foreverytargetwordtokeninthetrainingdata

    • Keeptrackoftheexpectednumberoftimesf translatesintoe throughoutthewholecorpus

    • Keeptrackofthenumberoftimesf isusedinthesourceofanytranslation• Usethesefrequency estimatesinthestandardMLEequationtogetabettersetofparameters

  • EMforIBM Model1

  • EMforModel1

  • EMforModel1

  • EMforModel1

  • Convergence

  • Extensions: Lexical to Phrase Translation

    • Phrase-basedMT:• Allowmultiplewordstotranslateaschunks(includingmany-to-one)• Introduceanotherlatentvariable,thesourcesegmentation

  • Extensions: Alignment Heuristics

    • AlignmentPriors:• Insteadofassumingthealignmentdecisionsareuniform,impose(orlearn)aprioroveralignmentgrids:

    Chahuneau etal.(2013)

  • Extensions: Hierarchical Phrase-based MT

    • Syntacticstructure• Rulesoftheform:• X之一à oneoftheX

    Chang(2005),Galleyetal.(2006)

  • MT Evaluation

    • Howdoweevaluatetranslationsystems’output?• Centralidea:“Thecloseramachinetranslationistoaprofessionalhumantranslation,thebetteritis.”

    • MostcommonlyusedmetriciscalledBLEU, which is the geometricmean of the n-gram precision against the human translations plus alength penalty term.

  • BLEU:AnExample

    Candidate1:Itisaguidetoactionwhichensuresthatthemilitaryalways obeythecommandsoftheparty.

    Reference1:Itisaguidetoaction that ensuresthatthemilitary willforeverheedPartycommands.

    Reference2:Itistheguidingprinciplewhich guaranteesthe militaryforcesalways beingunderthe commandof theParty.

    Reference3:Itisthepracticalguideforthearmyalwaystoheeddirectionsoftheparty.

    Unigram Precision : 17/18

    AdaptedfromslidesbyArthurChan

  • IssueofN-gramPrecision

    • Whatifsomewordsareover-generated?• e.g.“the”

    • Anextremeexample

    Candidate:thethe the the the the the.Reference1:The catisonthemat.Reference2:Thereisacatonthemat.

    • N-gramPrecision:7/7• Solution:referencewordshouldbeexhaustedafteritismatched.

    AdaptedfromslidesbyArthurChan

  • IssueofN-gramPrecision

    • Whatifsomewordsarejustdropped?• Anotherextremeexample

    Candidate:the.Reference1:Mymomlikestheblueflowers.Reference2:Mymotherpreferstheblueflowers.

    • N-gramPrecision:1/1• Solution:addapenaltyifthecandidateistooshort.

    AdaptedfromslidesbyArthurChan

  • BLEU

    ClippedN-gramprecisionsforN=1,2,3,4

    GeometricAverage

    BrevityPenalty

    • Rangesfrom0.0to1.0,butusuallyshownmultipliedby100• Anincreaseof+1.0BLEUisusuallyaconferencepaper• MTsystemsusuallyscoreinthe10sto30s• Humantranslatorsusuallyscoreinthe70sand80s

  • AShortSegue

    • Word- andphrase-based(“symbolic”)modelswerecuttingedgefordecades(upuntil~2014)

    • Suchmodelsarestillthemostwidelyusedincommercialapplications• Since2014mostresearchonMThasfocusedonneuralmodels

  • “Neurons”

  • “Neurons”

  • “Neurons”

  • “Neurons”

  • “Neurons”

  • “Neural”Networks

  • “Neural”Networks

  • “Neural”Networks

  • “Neural”Networks

  • “Softmax”

    “Neural”Networks

  • “Deep”

  • “Deep”

  • “Deep”

  • “Deep”

  • “Deep”

  • “Deep”

  • “Deep”

    Note:

  • “Recurrent”

  • DesignDecisions

    • Howtorepresentinputsandoutputs?• Neuralarchitecture?

    • Howmanylayers?(Requiresnon-linearities toimprovecapacity!)• Howmanyneurons?• Recurrentornot?• Whatkindofnon-linearities?

  • RepresentingLanguage

    • “One-hot”vectors• Eachpositioninavectorcorrespondstoawordtype

    • Distributedrepresentations• Vectorsencode“features”ofinputwords(charactern-grams,morphologicalfeatures,etc.)

    dog=

    Aardvark

    Aabalone

    Abando

    nAb

    ash

    … Dog

    dog=

  • TrainingNeuralNetworks

    • Neuralnetworksaresupervisedmodels– youneedasetofinputspairedwithoutputs

    • Algorithm• Rununtilbored:

    • Giveinputtothenetwork,seewhatitpredicts• Computeloss(y,y*)• Usechainrule(aka“backpropagation”)tocomputegradientwithrespecttoparameters• Updateparameters(SGD,Adam,LBFGS,etc.)

  • NeuralLanguageModels

    tanh

    softmax

    x=x Bengio etal.(2013)

  • Bengioetal.(2003)

  • NeuralFeaturesforTranslation

    • TurnBengio etal.(2003)intoatranslationmodel• Condtional model,generatethenextEnglishwordconditionedon

    • Thepreviousn Englishwordsyougenerated• Thealignedsourcewordanditsm neighbors

    Devlinetal.(2014)

  • tanh

    softmax

    x=x Devlinetal.(2014)

  • NeuralFeaturesforTranslation

    Devlinetal.(2014)

  • NotationSimplification

  • RNNsRevisited

  • FullyNeuralTranslation

    • Fullyend-to-endRNN-basedtranslationmodel• EncodethesourcesentenceusingoneRNN• GeneratethetargetsentenceonewordatatimeusinganotherRNN

    Encoder

    I am a student je suis étudiant

    je suis étudiant

    Decoder

    Sutskever etal.(2014)

  • AttentionalModel

    • Theencoder-decodermodelstruggleswithlongsentences• AnRNNistryingtocompressanarbitrarilylongsentenceintoafinite-lengthworthvector

    • Whatifweonlylookatone(orafew)sourcewordswhenwegenerateeachoutputword?

    Bahdanau etal.(2014)

  • TheIntuition

    83

    large blackOur dog bit the poor mailman .

    うち の ⼤きな ⽝ が 可哀想な 郵便屋 に 噛み ついた 。⿊い

    Bahdanau etal.(2014)

  • TheAttentionModel

    Encoder

    I am a student

    Decoder

    Bahdanau etal.(2014)

  • TheAttentionModel

    Encoder

    I am a student

    Decoder

    AttentionModel

    Bahdanau etal.(2014)

  • TheAttentionModel

    Encoder

    I am a student

    Decoder

    AttentionModel

    softmax

    Bahdanau etal.(2014)

  • TheAttentionModel

    Encoder

    I am a student

    Decoder

    AttentionModel

    ContextVector

    Bahdanau etal.(2014)

  • TheAttentionModel

    Encoder

    I am a student

    je

    Decoder

    AttentionModel

    ContextVector

    Bahdanau etal.(2014)

  • TheAttentionModel

    Encoder

    I am a student je

    je

    Decoder

    AttentionModel

    ContextVector

    Bahdanau etal.(2014)

  • TheAttentionModel

    Encoder

    I am a student je

    je

    Decoder

    AttentionModel

    Bahdanau etal.(2014)

  • TheAttentionModel

    Encoder

    I am a student je

    je suis

    Decoder

    AttentionModel

    ContextVector

    Bahdanau etal.(2014)

  • TheAttentionModel

    Encoder

    I am a student je suis

    je suis

    Decoder

    AttentionModel

    ContextVector

    Bahdanau etal.(2014)

  • TheAttentionModel

    Encoder

    I am a student je suis étudiant

    je suis étudiant

    Decoder

    AttentionModel

    ContextVector

    Bahdanau etal.(2014)

  • TheAttentionModel

    Encoder

    I am a student je suis étudiant

    je suis étudiant

    Decoder

    AttentionModel

    ContextVector

    Bahdanau etal.(2014)

  • ConvolutionalEncoder-Decoder

    • CNN:• encodeswordswithinafixedsizewindow• Parallelcomputation• Shortestpathtocoverawiderrangeofwords

    • RNN:• sequentiallyencodeasentencefromlefttoright

    • Hardtoparallelize

    Gehring et.al2017

  • TheTransformer

    • Idea:InsteadofusinganRNNtoencodethesourcesentenceandthepartialtargetsentence,useself-attention!

    Vaswanietal.(2017)

    I am a student I am a student

    StandardRNNEncoder SelfAttentionEncoder

    rawwordvector

    word-in-contextvector

  • TheTransformer

    Encoder

    je suis étudiant

    je suis étudiant

    Decoder

    AttentionModel

    ContextVector

    I am a student

    Vaswanietal.(2017)

  • Transformer

    • Traditionalattention:• Query:decoderhiddenstate• KeyandValue:encoderhiddenstate• Attendtosourcewordsbasedonthecurrentdecoderstate

    • Self-attention:• Query,Key,Valuearethesame• Attendtosurroundingsourcewordsbasedonthecurrentsourceword

    • Attendtopreceeding targetwordsbasedonthecurrenttargetword

    Vaswanietal.(2017)

  • VisualizationofAttentionWeight

    • Self-attentionweightcandetectlong-termdependencywithinasentence,e.g.,make… moredifficult

  • TheTransformer

    • Computationiseasilyparallelizable• Shorterpathfromeachtargetwordtoeachsourcewordà strongergradientsignals• Empiricallystrongertranslationperformance• Empiricallytrainssubstantiallyfasterthanmoreserialmodels

  • CurrentResearchDirectionsonNeuralMT

    • IncorporationsyntaxintoNeuralMT• Handlingofmorphologicallyrichlanguages• Optimizingtranslationquality(insteadofcorpusprobability)• Multilingualmodels• Document-leveltranslation