machine translation overviewdemo.clab.cs.cmu.edu/nlp/s20/files/slides/l25-machine... · 2020. 4....
TRANSCRIPT
-
MachineTranslationOverview
April23,2020Junjie Hu
MaterialslargelyborrowedfromAustinMatthews
-
One naturally wonders if the problem of translation could conceivably be
treated as a problem in cryptography. When I look at an article in Russian, I
say: ‘This is really written in English, but it has been coded in some strange symbols. I will now
proceed to decode.’
WarrenWeavertoNorbertWiener,March,1947
-
Parallel corpus
• We are given a corpus of sentence pairs in two languages to trainour machine translation models.
• Source language is also called foreign language, denoted as f.• Conventionally target language is usually referred to English.
-
Greek
Egyptian
-
NoisyChannelMT
Wewantamodelofp(e|f)
-
NoisyChannelMT
Wewantamodelofp(e|f)
Confusingforeignsentence
-
NoisyChannelMT
Wewantamodelofp(e|f)
PossibleEnglishtranslation
Confusingforeignsentence
-
NoisyChannelMT
p(e) e f
channel
“English” “Foreign”
decode
p(f|e)
-
NoisyChannelMT
“LanguageModel” “TranslationModel”
-
NoisyChannelDivisionofLabor
• Languagemodel– p(e)• isthetranslationfluent,grammatical,andidiomatic?• useanymodelofp(e) – typicallyann-grammodel
• Translationmodel– p(f|e)• “reverse”translationprobability• ensuresadequacyoftranslation
-
LanguageModelFailure
MylegalnameisAlexanderPerchov.
-
LanguageModelFailure
MylegalnameisAlexanderPerchov.ButallofmymanyfriendsdubmeAlex,becausethatisamoreflaccid-to-utterversionofmylegalname.MotherdubsmeAlexi-stop-spleening-me!,becauseIamalwaysspleening her.
-
LanguageModelFailure
MylegalnameisAlexanderPerchov.ButallofmymanyfriendsdubmeAlex,becausethatisamoreflaccid-to-utterversionofmylegalname.MotherdubsmeAlexi-stop-spleening-me!,becauseIamalwaysspleening her.IfyouwanttoknowwhyIamalwaysspleening her,itisbecauseIamalwayselsewherewithfriends,anddisseminatingsomuchcurrency,andperformingsomanythingsthatcanspleenamother.
-
TranslationModel
• p(f|e) givesthechannelprobability– theprobabilityoftranslatinganEnglishsentenceintoaforeignsentence
• f =je voudrais unpeu defrommage• e1 =Iwouldlikesomecheesee2 =Iwouldlikealittleofcheesee3 =ThereisnotraintoBarcelona
0.40.5>0.00001
p(f|e)
-
TranslationModel
• Howdoweparameterizep(f|e)?
• Therearea lot of possible sentences (closed to infinite number):• We can only count the sentences in our training data• thiswon’tgeneralizetonewinputs
?
-
LexicalTranslation
• Howdowetranslateaword?Lookitupinadictionary!Haus:house,home,shell,household
• Multipletranslations• Differentwordsenses,differentregisters,differentinflections• house,home arecommon• shell isspecialized(theHausofasnailisitsshell)
-
Howcommoniseach translation?
Translation Counthouse 5000home 2000shell 100
household 80
-
Maximum Likelihood Estimation (MLE)
-
LexicalTranslation
• Goal:amodelp(e|f,m)• wheree andf arecompleteEnglishandForeignsentences
-
LexicalTranslation
• Goal:amodelp(e|f,m)• wheree andf arecompleteEnglishandForeignsentences• Lexicaltranslationmakesthefollowingassumptions:
• Eachwordei ine isgeneratedfromexactlyonewordinf• Thus,wehavealatentalignmentai thatindicateswhichwordei “camefrom.”Specificallyitcamefromfai.
• Giventhealignmentsa,translationdecisionsareconditionallyindependentofeachotheranddependonly onthealignedsourcewordfai.
-
LexicalTranslation
• Puttingourassumptionstogether,wehave:
where a is an m-dimensional latent vector with each element ai in the rangeof [0,n].
p(Alignment) p(Translation |Alignment)
-
Word Alignment
• Mostoftheresearch forthefirst10yearsofSMTwas here.Wordtranslations weren’ttheproblem.Wordorder washard.
-
Word Alignment
• Alignmentscanbevisualizedbydrawinglinksbetweentwosentences,andtheyarerepresentedasvectorsofpositions:
-
Reordering
• Wordsmaybereorderedduringtranslation
-
WordDropping
• Asourcewordmaynotbetranslatedatall
-
WordInsertion
• Wordsmaybeinsertedduringtranslation• E.g.Englishjust doesnothaveanequivalent• Butthesewordsmustbeexplained– wetypicallyassumeeverysourcesentencecontainsaNULLtoken
-
One-to-manyTranslation
• Asourcewordmaytranslateintomorethanone targetword
-
Many-to-oneTranslation
• Morethanonesourcewordmaynot translateasaunitinlexicaltranslation
-
IBMModel1
• Simplestpossiblelexicaltranslationmodel• Additionalassumptions:
• Them alignmentdecisionsareindependent• Thealignmentdistributionforeachai isuniformoverallsourcewordsandNULL
-
TranslatingwithModel1
-
TranslatingwithModel1
Languagemodelsays:J
-
TranslatingwithModel1
Languagemodelsays:L
-
LearningLexicalTranslationModels
• Howdowelearntheparametersp(e|f) on the training corpusof (f, e) sentence pairs?
• “Chickenandegg”problem• Ifwehadthealignments,wecouldestimatethetranslationprobabilities(MLEestimation)
• Ifwehadthetranslationprobabilitieswecouldfindthemostlikelyalignments(greedy)
-
Expectation-Maximization (EM) Algorithm
• Picksomerandom(oruniform)startingparameters• Repeatuntilbored(~5iterationsforlexicaltranslationmodels):
• Usingthecurrentparameters,compute“expected”alignmentsp(ai|e,f)foreverytargetwordtokeninthetrainingdata
• Keeptrackoftheexpectednumberoftimesf translatesintoe throughoutthewholecorpus
• Keeptrackofthenumberoftimesf isusedinthesourceofanytranslation• Usethesefrequency estimatesinthestandardMLEequationtogetabettersetofparameters
-
EMforIBM Model1
-
EMforModel1
-
EMforModel1
-
EMforModel1
-
Convergence
-
Extensions: Lexical to Phrase Translation
• Phrase-basedMT:• Allowmultiplewordstotranslateaschunks(includingmany-to-one)• Introduceanotherlatentvariable,thesourcesegmentation
-
Extensions: Alignment Heuristics
• AlignmentPriors:• Insteadofassumingthealignmentdecisionsareuniform,impose(orlearn)aprioroveralignmentgrids:
Chahuneau etal.(2013)
-
Extensions: Hierarchical Phrase-based MT
• Syntacticstructure• Rulesoftheform:• X之一à oneoftheX
Chang(2005),Galleyetal.(2006)
-
MT Evaluation
• Howdoweevaluatetranslationsystems’output?• Centralidea:“Thecloseramachinetranslationistoaprofessionalhumantranslation,thebetteritis.”
• MostcommonlyusedmetriciscalledBLEU, which is the geometricmean of the n-gram precision against the human translations plus alength penalty term.
-
BLEU:AnExample
Candidate1:Itisaguidetoactionwhichensuresthatthemilitaryalways obeythecommandsoftheparty.
Reference1:Itisaguidetoaction that ensuresthatthemilitary willforeverheedPartycommands.
Reference2:Itistheguidingprinciplewhich guaranteesthe militaryforcesalways beingunderthe commandof theParty.
Reference3:Itisthepracticalguideforthearmyalwaystoheeddirectionsoftheparty.
Unigram Precision : 17/18
AdaptedfromslidesbyArthurChan
-
IssueofN-gramPrecision
• Whatifsomewordsareover-generated?• e.g.“the”
• Anextremeexample
Candidate:thethe the the the the the.Reference1:The catisonthemat.Reference2:Thereisacatonthemat.
• N-gramPrecision:7/7• Solution:referencewordshouldbeexhaustedafteritismatched.
AdaptedfromslidesbyArthurChan
-
IssueofN-gramPrecision
• Whatifsomewordsarejustdropped?• Anotherextremeexample
Candidate:the.Reference1:Mymomlikestheblueflowers.Reference2:Mymotherpreferstheblueflowers.
• N-gramPrecision:1/1• Solution:addapenaltyifthecandidateistooshort.
AdaptedfromslidesbyArthurChan
-
BLEU
ClippedN-gramprecisionsforN=1,2,3,4
GeometricAverage
BrevityPenalty
• Rangesfrom0.0to1.0,butusuallyshownmultipliedby100• Anincreaseof+1.0BLEUisusuallyaconferencepaper• MTsystemsusuallyscoreinthe10sto30s• Humantranslatorsusuallyscoreinthe70sand80s
-
AShortSegue
• Word- andphrase-based(“symbolic”)modelswerecuttingedgefordecades(upuntil~2014)
• Suchmodelsarestillthemostwidelyusedincommercialapplications• Since2014mostresearchonMThasfocusedonneuralmodels
-
“Neurons”
-
“Neurons”
-
“Neurons”
-
“Neurons”
-
“Neurons”
-
“Neural”Networks
-
“Neural”Networks
-
“Neural”Networks
-
“Neural”Networks
-
“Softmax”
“Neural”Networks
-
“Deep”
-
“Deep”
-
“Deep”
-
“Deep”
-
“Deep”
-
“Deep”
-
“Deep”
Note:
-
“Recurrent”
-
DesignDecisions
• Howtorepresentinputsandoutputs?• Neuralarchitecture?
• Howmanylayers?(Requiresnon-linearities toimprovecapacity!)• Howmanyneurons?• Recurrentornot?• Whatkindofnon-linearities?
-
RepresentingLanguage
• “One-hot”vectors• Eachpositioninavectorcorrespondstoawordtype
• Distributedrepresentations• Vectorsencode“features”ofinputwords(charactern-grams,morphologicalfeatures,etc.)
dog=
Aardvark
Aabalone
Abando
nAb
ash
… Dog
…
dog=
-
TrainingNeuralNetworks
• Neuralnetworksaresupervisedmodels– youneedasetofinputspairedwithoutputs
• Algorithm• Rununtilbored:
• Giveinputtothenetwork,seewhatitpredicts• Computeloss(y,y*)• Usechainrule(aka“backpropagation”)tocomputegradientwithrespecttoparameters• Updateparameters(SGD,Adam,LBFGS,etc.)
-
NeuralLanguageModels
tanh
softmax
x=x Bengio etal.(2013)
-
Bengioetal.(2003)
-
NeuralFeaturesforTranslation
• TurnBengio etal.(2003)intoatranslationmodel• Condtional model,generatethenextEnglishwordconditionedon
• Thepreviousn Englishwordsyougenerated• Thealignedsourcewordanditsm neighbors
Devlinetal.(2014)
-
tanh
softmax
x=x Devlinetal.(2014)
-
NeuralFeaturesforTranslation
Devlinetal.(2014)
-
NotationSimplification
-
RNNsRevisited
-
FullyNeuralTranslation
• Fullyend-to-endRNN-basedtranslationmodel• EncodethesourcesentenceusingoneRNN• GeneratethetargetsentenceonewordatatimeusinganotherRNN
Encoder
I am a student je suis étudiant
je suis étudiant
Decoder
Sutskever etal.(2014)
-
AttentionalModel
• Theencoder-decodermodelstruggleswithlongsentences• AnRNNistryingtocompressanarbitrarilylongsentenceintoafinite-lengthworthvector
• Whatifweonlylookatone(orafew)sourcewordswhenwegenerateeachoutputword?
Bahdanau etal.(2014)
-
TheIntuition
83
large blackOur dog bit the poor mailman .
うち の ⼤きな ⽝ が 可哀想な 郵便屋 に 噛み ついた 。⿊い
Bahdanau etal.(2014)
-
TheAttentionModel
Encoder
I am a student
Decoder
Bahdanau etal.(2014)
-
TheAttentionModel
Encoder
I am a student
Decoder
AttentionModel
Bahdanau etal.(2014)
-
TheAttentionModel
Encoder
I am a student
Decoder
AttentionModel
softmax
Bahdanau etal.(2014)
-
TheAttentionModel
Encoder
I am a student
Decoder
AttentionModel
ContextVector
Bahdanau etal.(2014)
-
TheAttentionModel
Encoder
I am a student
je
Decoder
AttentionModel
ContextVector
Bahdanau etal.(2014)
-
TheAttentionModel
Encoder
I am a student je
je
Decoder
AttentionModel
ContextVector
Bahdanau etal.(2014)
-
TheAttentionModel
Encoder
I am a student je
je
Decoder
AttentionModel
Bahdanau etal.(2014)
-
TheAttentionModel
Encoder
I am a student je
je suis
Decoder
AttentionModel
ContextVector
Bahdanau etal.(2014)
-
TheAttentionModel
Encoder
I am a student je suis
je suis
Decoder
AttentionModel
ContextVector
Bahdanau etal.(2014)
-
TheAttentionModel
Encoder
I am a student je suis étudiant
je suis étudiant
Decoder
AttentionModel
ContextVector
Bahdanau etal.(2014)
-
TheAttentionModel
Encoder
I am a student je suis étudiant
je suis étudiant
Decoder
AttentionModel
ContextVector
Bahdanau etal.(2014)
-
ConvolutionalEncoder-Decoder
• CNN:• encodeswordswithinafixedsizewindow• Parallelcomputation• Shortestpathtocoverawiderrangeofwords
• RNN:• sequentiallyencodeasentencefromlefttoright
• Hardtoparallelize
Gehring et.al2017
-
TheTransformer
• Idea:InsteadofusinganRNNtoencodethesourcesentenceandthepartialtargetsentence,useself-attention!
Vaswanietal.(2017)
I am a student I am a student
StandardRNNEncoder SelfAttentionEncoder
rawwordvector
word-in-contextvector
-
TheTransformer
Encoder
je suis étudiant
je suis étudiant
Decoder
AttentionModel
ContextVector
I am a student
Vaswanietal.(2017)
-
Transformer
• Traditionalattention:• Query:decoderhiddenstate• KeyandValue:encoderhiddenstate• Attendtosourcewordsbasedonthecurrentdecoderstate
• Self-attention:• Query,Key,Valuearethesame• Attendtosurroundingsourcewordsbasedonthecurrentsourceword
• Attendtopreceeding targetwordsbasedonthecurrenttargetword
Vaswanietal.(2017)
-
VisualizationofAttentionWeight
• Self-attentionweightcandetectlong-termdependencywithinasentence,e.g.,make… moredifficult
-
TheTransformer
• Computationiseasilyparallelizable• Shorterpathfromeachtargetwordtoeachsourcewordà strongergradientsignals• Empiricallystrongertranslationperformance• Empiricallytrainssubstantiallyfasterthanmoreserialmodels
-
CurrentResearchDirectionsonNeuralMT
• IncorporationsyntaxintoNeuralMT• Handlingofmorphologicallyrichlanguages• Optimizingtranslationquality(insteadofcorpusprobability)• Multilingualmodels• Document-leveltranslation