cs388: natural language processing lecture 19: pretrained ...gdurrett/courses/fa2019/... ·...
TRANSCRIPT
![Page 1: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence](https://reader033.vdocuments.us/reader033/viewer/2022053117/609ae639608dc240ea2c92af/html5/thumbnails/1.jpg)
CS388:NaturalLanguageProcessing
GregDurre8
Lecture19:PretrainedTransformers
Credit:???
![Page 2: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence](https://reader033.vdocuments.us/reader033/viewer/2022053117/609ae639608dc240ea2c92af/html5/thumbnails/2.jpg)
Administrivia
‣ Project2dueTuesday
‣ PresentaEondayannouncementsnextweek
![Page 3: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence](https://reader033.vdocuments.us/reader033/viewer/2022053117/609ae639608dc240ea2c92af/html5/thumbnails/3.jpg)
Recall:Self-A8enEon
Vaswanietal.(2017)
themoviewasgreat
‣ Eachwordformsa“query”whichthencomputesa8enEonovereachword
‣MulEple“heads”analogoustodifferentconvoluEonalfilters.UseparametersWkandVktogetdifferenta8enEonvalues+transformvectors
x4
x04
scalar
vector=sumofscalar*vector
↵i,j = softmax(x>i xj)
x0i =
nX
j=1
↵i,jxj
↵k,i,j = softmax(x>i Wkxj) x0
k,i =nX
j=1
↵k,i,jVkxj
![Page 4: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence](https://reader033.vdocuments.us/reader033/viewer/2022053117/609ae639608dc240ea2c92af/html5/thumbnails/4.jpg)
Recall:Transformers
Vaswanietal.(2017)
themoviewasgreat
‣ AugmentwordembeddingwithposiEonembeddings,eachdimisasine/cosinewaveofadifferentfrequency.Closerpoints=higherdotproducts
‣WorksessenEallyaswellasjustencodingposiEonasaone-hotvector
themoviewasgreat
emb(1)
emb(2)
emb(3)
emb(4)
![Page 5: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence](https://reader033.vdocuments.us/reader033/viewer/2022053117/609ae639608dc240ea2c92af/html5/thumbnails/5.jpg)
ThisLecture
‣ GPT/GPT2
‣ Analysis/VisualizaEon
‣ BERT
![Page 6: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence](https://reader033.vdocuments.us/reader033/viewer/2022053117/609ae639608dc240ea2c92af/html5/thumbnails/6.jpg)
BERT
![Page 7: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence](https://reader033.vdocuments.us/reader033/viewer/2022053117/609ae639608dc240ea2c92af/html5/thumbnails/7.jpg)
BERT
‣ ThreemajorchangescomparedtoELMo:
‣ TransformersinsteadofLSTMs(transformersinGPTaswell)‣ BidirecEonal<=>MaskedLMobjecEveinsteadofstandardLM‣ Fine-tuneinsteadoffreezeattestEme
‣ AI2madeELMoinspring2018,GPTwasreleasedinsummer2018,BERTcameoutOctober2018
![Page 8: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence](https://reader033.vdocuments.us/reader033/viewer/2022053117/609ae639608dc240ea2c92af/html5/thumbnails/8.jpg)
BERT
Devlinetal.(2019)
‣ ELMoisaunidirecEonalmodel(asisGPT):wecanconcatenatetwounidirecEonalmodels,butisthistherightthingtodo?
Astunningballetdancer,Copelandisoneofthebestperformerstoseelive.
ELMo
ELMo“performer”
“balletdancer”
BERT
“balletdancer/performer”
‣ ELMoreprslookateachdirecEoninisolaEon;BERTlooksatthemjointly
![Page 9: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence](https://reader033.vdocuments.us/reader033/viewer/2022053117/609ae639608dc240ea2c92af/html5/thumbnails/9.jpg)
BERT‣ Howtolearna“deeplybidirecEonal”model?WhathappensifwejustreplaceanLSTMwithatransformer?
JohnvisitedMadagascaryesterday
visited Madag. yesterday …
‣ TransformerLMshavetobe“one-sided”(onlya8endtoprevioustokens),notwhatwewant
JohnvisitedMadagascaryesterday
ELMo(LanguageModeling)visited Madag. yesterday …
BERT
![Page 10: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence](https://reader033.vdocuments.us/reader033/viewer/2022053117/609ae639608dc240ea2c92af/html5/thumbnails/10.jpg)
MaskedLanguageModeling‣ HowtopreventcheaEng?NextwordpredicEonfundamentallydoesn'tworkforbidirecEonalmodels,insteaddomaskedlanguagemodeling
Johnvisited[MASK]yesterday
Madagascar‣ BERTformula:takeachunkoftext,predict15%ofthetokens
‣ For80%(ofthe15%),replacetheinputtokenwith[MASK]
Devlinetal.(2019)
‣ For10%,replacew/random‣ For10%,keepsame
Johnvisitedofyesterday
JohnvisitedMadagascaryesterday
![Page 11: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence](https://reader033.vdocuments.us/reader033/viewer/2022053117/609ae639608dc240ea2c92af/html5/thumbnails/11.jpg)
Next“Sentence”PredicEon‣ Input:[CLS]Textchunk1[SEP]Textchunk2
[CLS]Johnvisited[MASK]yesterdayandreallyallit[SEP]IlikeMadonna.
Madagascar
Devlinetal.(2019)
Transformer
Transformer
…
enjoyed likeNotNext
‣ BERTobjecEve:maskedLM+nextsentencepredicEon
‣ 50%oftheEme,takethetruenextchunkoftext,50%oftheEmetakearandomotherchunk.Predictwhetherthenextchunkisthe“true”next
![Page 12: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence](https://reader033.vdocuments.us/reader033/viewer/2022053117/609ae639608dc240ea2c92af/html5/thumbnails/12.jpg)
BERTArchitecture‣ BERTBase:12layers,768-dimperwordpiecetoken,12heads.Totalparams=110M
Devlinetal.(2019)
‣ BERTLarge:24layers,1024-dimperwordpiecetoken,16heads.Totalparams=340M
‣ PosiEonalembeddingsandsegmentembeddings,30kwordpieces
‣ Thisisthemodelthatgetspre-trainedonalargecorpus
![Page 13: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence](https://reader033.vdocuments.us/reader033/viewer/2022053117/609ae639608dc240ea2c92af/html5/thumbnails/13.jpg)
WhatcanBERTdo?
Devlinetal.(2019)
‣ CLStokenisusedtoprovideclassificaEondecisions
‣ BERTcanalsodotaggingbypredicEngtagsateachwordpiece‣ Sentencepairtasks(entailment):feedbothsentencesintoBERT
![Page 14: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence](https://reader033.vdocuments.us/reader033/viewer/2022053117/609ae639608dc240ea2c92af/html5/thumbnails/14.jpg)
WhatcanBERTdo?
‣ HowdoesBERTmodelthissentencepairstuff?
‣ TransformerscancaptureinteracEonsbetweenthetwosentences,eventhoughtheNSPobjecEvedoesn’treallycausethistohappen
Transformer
Transformer
…
[CLS]Aboyplaysinthesnow[SEP]Aboyisoutside
Entails
![Page 15: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence](https://reader033.vdocuments.us/reader033/viewer/2022053117/609ae639608dc240ea2c92af/html5/thumbnails/15.jpg)
WhatcanBERTNOTdo?
‣ BERTcannotgeneratetext(atleastnotinanobviousway)
‣ Notanautoregressivemodel,candoweirdthingslikesEcka[MASK]attheendofastring,fillinthemask,andrepeat
‣Maskedlanguagemodelsareintendedtobeusedprimarilyfor“analysis”tasks
![Page 16: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence](https://reader033.vdocuments.us/reader033/viewer/2022053117/609ae639608dc240ea2c92af/html5/thumbnails/16.jpg)
Fine-tuningBERT‣ Fine-tunefor1-3epochs,batchsize2-32,learningrate2e-5-5e-5
‣ Largechangestoweightsuphere(parEcularlyinlastlayertoroutetherightinformaEonto[CLS])
‣ Smallerchangestoweightslowerdowninthetransformer
‣ SmallLRandshortfine-tuningschedulemeanweightsdon’tchangemuch
‣Morecomplex“triangularlearningrate”schemesexist
![Page 17: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence](https://reader033.vdocuments.us/reader033/viewer/2022053117/609ae639608dc240ea2c92af/html5/thumbnails/17.jpg)
Fine-tuningBERT
Peters,Ruder,Smith(2019)
‣ BERTistypicallybe8erifthewholenetworkisfine-tuned,unlikeELMo
![Page 18: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence](https://reader033.vdocuments.us/reader033/viewer/2022053117/609ae639608dc240ea2c92af/html5/thumbnails/18.jpg)
EvaluaEon:GLUE
Wangetal.(2019)
![Page 19: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence](https://reader033.vdocuments.us/reader033/viewer/2022053117/609ae639608dc240ea2c92af/html5/thumbnails/19.jpg)
Results
Devlinetal.(2018)
‣ Hugeimprovementsoverpriorwork(evencomparedtoELMo)
‣ EffecEveat“sentencepair”tasks:textualentailment(doessentenceAimplysentenceB),paraphrasedetecEon
![Page 20: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence](https://reader033.vdocuments.us/reader033/viewer/2022053117/609ae639608dc240ea2c92af/html5/thumbnails/20.jpg)
RoBERTa
Liuetal.(2019)
‣ “RobustlyopEmizedBERT”
‣ 160GBofdatainsteadof16GB
‣ Dynamicmasking:standardBERTusesthesameMASKschemeforeveryepoch,RoBERTarecomputesthem
‣ Newtraining+moredata=be8erperformance
![Page 21: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence](https://reader033.vdocuments.us/reader033/viewer/2022053117/609ae639608dc240ea2c92af/html5/thumbnails/21.jpg)
GPT/GPT2
![Page 22: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence](https://reader033.vdocuments.us/reader033/viewer/2022053117/609ae639608dc240ea2c92af/html5/thumbnails/22.jpg)
OpenAIGPT/GPT2
‣ GPT2:trainedon40GBoftextcollectedfromupvotedlinksfromreddit
‣ 1.5Bparameters—byfarthelargestofthesemodelstrainedasofMarch2019
Radfordetal.(2019)
‣ “ELMowithtransformers”(worksbe8erthanELMo)
‣ TrainasingleunidirecEonaltransformerLMonlongcontexts
‣ Becauseit'salanguagemodel,wecangeneratefromit
![Page 23: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence](https://reader033.vdocuments.us/reader033/viewer/2022053117/609ae639608dc240ea2c92af/html5/thumbnails/23.jpg)
OpenAIGPT2
slidecredit:OpenAI
![Page 24: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence](https://reader033.vdocuments.us/reader033/viewer/2022053117/609ae639608dc240ea2c92af/html5/thumbnails/24.jpg)
OpenQuesEons
3)HowdoweharnessthesepriorsforcondiEonalgeneraEontasks(summarizaEon,generateareportofabasketballgame,etc.)
4)Isthistechnologydangerous?(OpenAIhasonlyreleased774Mparammodel,not1.5Byet)
1)Hownovelisthestuffbeinggenerated?(Isitjustdoingnearestneighborsonalargecorpus?)
2)HowdoweunderstandanddisEllwhatislearnedinthismodel?
![Page 25: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence](https://reader033.vdocuments.us/reader033/viewer/2022053117/609ae639608dc240ea2c92af/html5/thumbnails/25.jpg)
Grover‣ SamplefromalargelanguagemodelcondiEonedonadomain,date,authors,andheadline
Zellersetal.(2019)
‣ HumansrankGrover-generatedpropagandaasmorerealisEcthanreal“fakenews”
‣ NOTE:NotaGAN,discriminatortrainedseparatelyfromthegenerator
‣ Fine-tunedGrovercandetectGroverpropagandaeasily—authorsargueforreleasingitforthisreason
![Page 26: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence](https://reader033.vdocuments.us/reader033/viewer/2022053117/609ae639608dc240ea2c92af/html5/thumbnails/26.jpg)
Pre-TrainingCost(withGoogle/AWS)
h8ps://syncedreview.com/2019/06/27/the-staggering-cost-of-training-sota-ai-models/
‣ XLNet(BERTvariant):$30,000—$60,000(unclear)
‣ Grover-MEGA:$25,000
‣ BERT:Base$500,Large$7000
‣ Thisisforasinglepre-trainingrun…developingnewpre-trainingtechniquesmayrequiremanyruns
‣ Fine-tuningthesemodelscantypicallybedonewithasingleGPU(butmaytake1-3daysformedium-sizeddatasets)
![Page 27: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence](https://reader033.vdocuments.us/reader033/viewer/2022053117/609ae639608dc240ea2c92af/html5/thumbnails/27.jpg)
PushingtheLimits
‣ NVIDIA:trained8.3BparameterGPTmodel(5.6xthesizeofGPT-2)
NVIDIAblog(Narasimhan,August2019)
‣ ArguablethesemodelsaresEllunderfit:largermodelssEllgetbe8erheld-outperplexiEes
![Page 28: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence](https://reader033.vdocuments.us/reader033/viewer/2022053117/609ae639608dc240ea2c92af/html5/thumbnails/28.jpg)
GoogleT5
Raffeletal.(October23,2019)
‣WesEllhaven'thitthelimitofbiggerdatabeinguseful
‣ ColossalCleanedCommonCrawl:750GBoftext
![Page 29: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence](https://reader033.vdocuments.us/reader033/viewer/2022053117/609ae639608dc240ea2c92af/html5/thumbnails/29.jpg)
BART
Lewisetal.(October30,2019)
‣ Sequence-to-sequenceBERTvariant:permute/make/deletetokens,thenpredictfullsequenceautoregressively
‣ Fordownstreamtasks:feeddocumentintobothencoder+decoder,usedecoderhiddenstateasoutput
‣ Goodresultsondialogue,summarizaEontasks
![Page 30: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence](https://reader033.vdocuments.us/reader033/viewer/2022053117/609ae639608dc240ea2c92af/html5/thumbnails/30.jpg)
Analysis
![Page 31: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence](https://reader033.vdocuments.us/reader033/viewer/2022053117/609ae639608dc240ea2c92af/html5/thumbnails/31.jpg)
WhatdoesBERTlearn?
Clarketal.(2019)
‣ HeadsontransformerslearninteresEnganddiversethings:contentheads(a8endbasedoncontent),posiEonalheads(basedonposiEon),etc.
![Page 32: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence](https://reader033.vdocuments.us/reader033/viewer/2022053117/609ae639608dc240ea2c92af/html5/thumbnails/32.jpg)
WhatdoesBERTlearn?
Clarketal.(2019)
‣ SEllwayworsethanwhatsupervisedsystemscando,butinteresEngthatthisislearnedorganically
![Page 33: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence](https://reader033.vdocuments.us/reader033/viewer/2022053117/609ae639608dc240ea2c92af/html5/thumbnails/33.jpg)
ProbingBERT
Tenneyetal.(2019)
‣ TrytopredictPOS,etc.fromeachlayer.Learnmixingweights
representaEonofwordpieceifortaskτ
‣ Plotshowssweights(blue)andperformancedeltaswhenanaddiEonallayerisincorporated(purple)
‣ BERT“rediscoverstheclassicalNLPpipeline”:firstsyntacEctasksthensemanEcones
![Page 34: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence](https://reader033.vdocuments.us/reader033/viewer/2022053117/609ae639608dc240ea2c92af/html5/thumbnails/34.jpg)
CompressingBERT
Micheletal.(2019)
‣ Remove60+%ofBERT’sheadswithminimaldropinperformance
‣ DisElBERT(Sanhetal.,2019):nearlyasgoodwithhalftheparametersofBERT(viaknowledgedisEllaEon)
![Page 35: CS388: Natural Language Processing Lecture 19: Pretrained ...gdurrett/courses/fa2019/... · Transformer Transformer … NotNext enjoyed like ‣ BERT objecEve: masked LM + next sentence](https://reader033.vdocuments.us/reader033/viewer/2022053117/609ae639608dc240ea2c92af/html5/thumbnails/35.jpg)
OpenQuesEons
‣ Thesetechniquesareheretostay,unclearwhatformwillwinout
‣ Roleofacademiavs.industry:nomajorpretrainedmodelhascomepurelyfromacademia
‣ BERT-basedsystemsarestate-of-the-artfornearlyeverymajortextanalysistask
‣ Cost/carbonfootprint:asinglemodelcosts$10,000+totrain(thoughthiscostshouldcomedown)