domain adaptation for neural machine translationkevinduh/notes/1905-kduh-mtma-adapt.pdfsubnetworksto...

Post on 15-Mar-2020

17 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

DomainAdaptationforNeuralMachineTranslation

KevinDuhJohnsHopkinsUniversity

May2019

Outline

1. Problemdefinition2. Surveyofadaptationmethods3. ErrorAnalysis4. PromisingResearchDirections

DomainAdaptationProblem:MachineLearningPerspective

• Trainingdata:– (x1,y1),(x2,y2),(x3,y3),…, i.i.d.samplesfromdistributionD

– Buildmodelp(y|x)

• IftestdataisnotfromD, p(y|x)maybeoperatingataspaceitwasn’tbuiltfor.– Twocasesforwhatwemeanby“notfromD”

Visualization:Fittingp(y|x)

x

y

Case1:Testisnotininputdomain(CovariateShift)

x

y

testsamplext

Wheredoyoupredictyt?

Case2:Input-outputrelationchanges

x

y

testsamplext

Expectyt here

Butunexpectedlyyt ishere

ExamplesinMachineTranslation(MT)

• Domainmismatchexample:– TrainingdataconsistsofPatent sentences– TestsampleisSocialMedia

• Case1:Testisnotininputdomain– cantranslatetechnicalwordslike“NMT”– noideahowtotranslate“OMG”

• Case2:Input-Outputrelationchanges– “CAT”translatestoawordthatmeans“ComputerAidedTranslation” ratherthan“Cutefurryanimal”

“Domain”isafuzzyconceptinpractice

• Corporadifferby:– Topics:patents,politics,news,medicine– Style:formal,informal–Modality:written,spoken

• Oftenuse“domain”torefertodatasource:– e.g.Europarl,OpenSubtitles,TED,Paracrawl

• Bothcase1andcase2mismatchesoccuratmultiplelevels:lexical,syntactic,etc.

Examplesentences(case1):whichisPatent,TED,Subtitles,Europarl?

1. Weliveinadigitalworld,butwe’refairlyanalogcreatures.

2. Thetabletsexhibitimprovedbioavailabilityoftheactiveingredient.

3. So,um… she’skidding.4. Resumptionofthesession

12 3 4

Examplebitext (case2)

WhyisDomainAdaptationanimportantprobleminMT?

• Itmaybeexpensivetoobtaintrainingbitextsthatarebothlarge &relevant totestdomain

• Oftenhavetoworkwithwhateverwecanget

Small Large

Irrelevant ✔

Relevant ✔ ✔✔

DataSize

Relevancetotestdomain

Terminology(1of2)

• Example:TestdomainisSocialMedia• In-domaindata– Datathatisrelevanttotestdomain:SNScorpus

• Out-of-domaindata– Datathatislessrelevanttotestdomain:Europarl

• General-domaindata–MayuseinterchangeablywithOut-of-Domain–Maymeanmixedcorpus:

• Europaral +Patent+TED• Europarl +Patent+TED+SNS

Terminology(2of2)

• Supervisedadaptationmethods:– AssumesOODbitext &IDbitext

• Unsupervisedadaptationmethods:– AssumesOODbitext &IDmonotext

Small Large

Irrelevant Out-of-Domain(OOD)

Relevant In-Domain(ID)

DataSize

Relevancetotestdomain

Outline

1. Problemdefinition2. Surveyofadaptationmethods3. ErrorAnalysis4. PromisingResearchDirections

AtaxonomyofdomainadaptationmethodsforNMT

From:Chenhui ChuandRui Wang,ASurveyofDomainAdaptationforNeuralMachineTranslation,COLING2018

Datacentricadaptationmethods

Note:I’llpresentanassortmentofmethods,pickedmainlytodemonstratethevarietybutnotnecessarilyrepresentativeoftheliterature.

SyntheticDataAugmentation(forwardorbacktranslation)

LargeOut-of-DomainBitext

In-domainMonotext

Out-of-DomainNMTModel

src trgsrc ortrg

FilteringOut-of-DomainBitext forrelevantdatasubsets(esp.forcase2)

LargeOut-of-DomainBitext

SmallIn-domainBitext

Relevancemodele.g.ngram languagemodel

RobertCMooreandWilliamLewis.Intelligentselectionoflanguagemodeltrainingdata.ACL2010

Amittai Axelrod,Xiaodong He,andJianfeng Gao.Domainadaptationviapseudoin-domaindataselection.EMNLP2011

KevinDuh,GrahamNeubig,KatsuhitoSudoh,andHajimeTsukada.Adaptationdataselectionusingneurallanguagemodels:Experimentsinmachinetranslation.ACL2013

Marcin Junczys-Dowmunt.Dualconditionalcross- entropyfilteringofnoisyparallelcorpora.WMT2018

Trainingobjectivecentricmethods

ContinuedTraining(a.k.a fine-tuning)

Out-of-DomainNMTModel

ContinuedTrainingNMTModel

RandomInitializedNMTModel

SmallIn-domainBitext

LargeOut-of-DomainBitext

Thisseemstobe1st citationonNMTcontinuedtraining:Minh-Thang Luong andChrisManning.StanfordNeuralMachineTranslationSystemsforSpokenLanguageDomain.IWSLT2016

ContinuedTraining:GeneralDomainà PatentDomain

0

10

20

30

40

50

60

70

German Korean Russian Chinese

GeneralDomain In-Domain ContinuedTraining

+0.4

+1.8 +10.1+3.5

BLEU

ResultsfromJHUSCALEWorkshop2018:ResilientMachineTranslationinNewDomains

Generalalgorithm:1. TrainmodelonconvergenceondatasetA(A=OODbitext)2. ContinuetrainingondatasetB(B=in-domainbitext)

ContinuedTrainingVariants:• Detailsonlearningrate,etc.instep2matters• Addingaregularizationtermorfixsubnetworks instep2

– AntonioValerioMiceli Barone,BarryHaddow,UlrichGermann,andRicoSennrich.Regularizationtechniquesforfine-tuninginneuralmachinetranslation.EMNLP2017

– HudaKhayrallah,BrianThompson,KevinDuh,andPhilippKoehn.Regularizedtrainingobjectiveforcontinuedtrainingfordomainadaptationinneuralmachinetranslation.WNMT2018

– BrianThompson,HudaKhayrallah,Antonios Anastasopoulos,Arya D.McCarthy,KevinDuh,RebeccaMarvin,PaulMcNamee,JeremyGwinnup,TimAnderson,PhilippKoehn.FreezingSubnetworks toAnalyzeDomainAdaptationinNeuralMachineTranslation,WMT2018

• Differentwaystomixdata(e.g.A+Binstep2)ororderdata– Chenhui Chu,RajDabre,andSadao Kurohashi.Anempiricalcomparisonofdomainadaptation

methodsforneuralmachinetranslation.ACL2017– Marlies vanderWees,AriannaBisazza,andChristof Monz.Dynamicdataselectionforneural

machinetranslation.EMNLP2017– WeiWang,TaroWatanabe,Macduff Hughes,Tetsuji Nakagawa,andCiprian Chelba.Denoising

neuralmachinetranslationtrainingwithtrusteddataandonlinedataselection.WMT2018– Xuan Zhang,PamelaShapiro,Gaurav Kumar,PaulMcNamee,MarineCarpuat andKevinDuh.

CurriculumLearningforDomainAdaptationinNeuralMachineTranslation.NAACL2019

• Ensembling out-of-domainmodelandcontinuedtrainedmodel:– MarkusFreitag andYaser Al-Onaizan.2016.FastDomainAdaptationforNeuralMachine

Translation.ArXiV abs/1612.06897.

InstanceWeighting

BoxingChen,ColinCherry,GeorgeFoster,SamuelLarkin.CostWeightingforNeuralMachineTranslationDomainAdaptation.WNMT2017

Rui Wang,MasaoUtiyama,Lemao Liu,Kehai Chen,Eiichiro Sumita.InstanceWeightingforNeuralMachineTranslationDomainAdaptation.EMNLP2017

Architecture/DecoderCentricmethods

• (Iprefertoconsiderthetwotypesofmethodstogethersinceitissometimesarbitrarytodifferentiatewhat’spartofthewholearchitectureandwhat’sonlypartofthedecodingprocess)

Fusionoftwomodels

Caglar Gulcehre,Orhan Firat,KelvinXu,Kyunghyun Cho,Loic Barrault,Huei-ChiLin,Fethi Bougares,Holger Schwenk,andYoshua Bengio.2015.Onusingmonolingualcorporainneuralmachinetranslation.CoRR,abs/1503.03535.

Translationmodelvs.Languagemodel

• Problem:Languagemodelcanbetoostrong

• Onesolution:

Toan NguyenandDavidChiang.ImprovingLexicalChoiceinNeuralMachineTranslation.NAACL2018(notethispaperaddressesthegeneralproblemofimproperlexicalchoice,butthisisafrequentproblemindomainadaptation)

Averagedsourcewordrepresentationatdecodetimet

Otheradaptationmethods

Adaptationatthetokenlevel:Subword Regularization

Taku Kudo,Subword Regularization:ImprovingNeuralNetworkTranslationModelswithMultipleSubword Candidates,ACL2018

Trainoverdifferentsubword segmentations,randomlysampled

Outline

1. Problemdefinition2. Surveyofadaptationmethods3. ErrorAnalysis4. PromisingResearchDirections

Thetruthis,it’shardtoanalyzeMTerrors

• ItwashardtopinpointwhytranslationwasincorrectintheSMTdays

• It’sperhapsevenharderforNMT• Butwetryanyway.Atleastitgivesawaytothinkabouttheproblem.

S4Analysis(originallydevelopedforSMT)

• SEENerror:Neverseenthissourcewordbeforeinthetrainingdata(case1in1st partofthistalk)

• SENSEerror:Thesourcewordappearsinthetrainingdata,butisnotusedinthissense.(e.g.case2)

• SCOREerror:Thesourcewordanditstranslationappearsinthetrainingdata,butthecorrecttranslationisscoredlower

• SEARCHerror:Thecorrecttranslationisscoredhigher,butsomehowgotlostinthesearchprocess

AnnIrvine,JohnMorgan,MarineCarpuat,HalDaumeIII,andDragosMunteanu.2013.Measuringmachinetranslationerrorsinnewdomains.TransactionsoftheAssociationforComputationalLinguistics(TACL)

S4Analysis(requiresreferenceandalignment)

AnnIrvine,JohnMorgan,MarineCarpuat,HalDaumeIII,andDragosMunteanu.2013.Measuringmachinetranslationerrorsinnewdomains.TransactionsoftheAssociationforComputationalLinguistics(TACL)

S4Analysis(forNMT?)

• First,runexternalwordalignertodeterminecorrect/incorrectwords(buthowmuchcanwetrustthis?)

• SEENerror:– Checkout-of-vocabularywordsonsourceside

• SENSEerror:– Foragivensourcewordf,checkifthedesiredtranslationnever

appearsinthetargetsideoftrainingbitext wheref appears?• SCOREerror:

– Whennoneoftheaboveistrue?• SEARCHerror:

– Notsurehowtocheckbesidesincreasingbeam,but..

FluentlyInadequateTranslations

• Source:(TEDtalk)

• Un-adaptedsystemoutput:I’mafraidI’mnotgoingtohavetogotobed.

• Gloss:Crowparentsseemtobeteachingtheiryoungtheseskills.

• Adaptedsystemoutput:Andtheirparentsalsotaughttheirchildrenhowtodoit.

• Translationsthatarefluentbuthavenothingtodowiththesourceareverydangerous!

MariannaJ.Martindale,MarineCarpuat,KevinDuhandPaulMcNamee.IdentifyingFluentlyInadequateOutputinNeuralandStatisticalMachineTranslation.MTSummit2019

Outline

1. Problemdefinition2. Surveyofadaptationmethods3. ErrorAnalysis4. PromisingResearchDirections(myopinion)

PersonalizedAdaptation,e.g.forComputerAssistedTranslation(CAT)

• CATpresentsmanyinterestingopportunitiesforresearch(withrealuserimpact!)

• ExampleinterfaceatLilt.com:

PaulMichel,GrahamNeubig.ExtremeAdaptationforPersonalizedNeuralMachineTranslation.ACL2018Sachith SriRamKothur,RebeccaKnowlesandPhilippKoehn.Document-LevelAdaptationforNeuralMachineTranslation.WNMT2018

AdaptationtoNewLanguages

• Givenbitext inlanguagepairsA->B,C->D– BuildatranslatorforA->D– BuildatranslatorforE->BwhereEisrelatedlanguagetoA

– Assumessomesharedrepresentation,canusecontinuedtraining,etc.

• Crazyideabutpotentiallylargeimpact

Barret Zoph,Deniz Yuret,JonathanMay,andKevinKnight.Transferlearningforlow-resourceneuralmachinetranslation.EMNLP2016;GrahamNeubig andJunjie Hu.RapidAdaptationofNeuralMachineTranslationtoNewLanguages.EMNLP2018

UnderstandingadaptationerrorsasawaytounderstandNMTbehavior

• DomainAdaptationprovidesagoodtestbedforunderstandingoverfitting,etc.

• Whattriggersafluentlyinadequatetranslation?

• Whydoescatastrophicforgettinghappen?Thompson,et.al.NAACL2019;Saunders,et.al.ACL2019;Kirpatrick,et.al.PNAS2017

Betteradaptationalgorithms

• ManyopportunitiesfornewideasinNMT,e.g.• Batchvs.Onlinesetup– Onlinesetupmakescurriculumlearningeasier

• Easytodesignnewarchitectures–Multitasklearningforsharingparameters&data

• WhatideastoborrowfromSMT?–Modularizationoflexicalchoice,reordering,syntax,etc.

Re-cap

1. Problemdefinition2. Surveyofadaptationmethods3. ErrorAnalysis4. PromisingResearchDirections

Case1:Testisnotininputdomain(CovariateShift)

x

y

testsamplext

Wheredoyoupredictyt?

Case2:Input-outputrelationchanges

x

y

testsamplext

Expectyt here

Butunexpectedlyyt ishere

WhyisDomainAdaptationanimportantprobleminMT?

• Expensivetoobtaintrainingbitexts thatarebothlarge &relevant totestdomain

Small Large

Irrelevant ✔

Relevant ✔ ✔✔

DataSize

Relevancetotestdomain

AtaxonomyofdomainadaptationmethodsforNMT

From:Chenhui ChuandRui Wang,ASurveyofDomainAdaptationforNeuralMachineTranslation,COLING2018

ContinuedTraining

Out-of-DomainNMTModel

ContinuedTrainingNMTModel

RandomInitializedNMTModel

SmallIn-domainBitext

LargeOut-of-DomainBitext

Outline

1. Problemdefinition2. Surveyofadaptationmethods3. ErrorAnalysis– S4&FluentlyInadequateTranslations

4. PromisingResearchDirections– CAT,newlanguageadaptation,adaptationas

windowtounderstandNMT,etc.

top related