3/28/18
1
DeepLearningIII
CS760:MachineLearningSpring2018
MarkCravenandDavidPage
www.biostat.wisc.edu/~craven/cs760
1
GoalsfortheLecture
• Youshouldunderstandthefollowingconcepts:
• generativeadversarialnetworks(GANs)• competitivetraining• Nashequilibrium
• longshort-termmemory(LSTM)networks• BidirectionalLSTMs(BiLSTMs)• Attention
• embeddings• wordembeddings• Word2Vec
2
3/28/18
2
GenerativeAdversarialNetworks
• CombinedeepnetsuchasCNNwithadversarialtraining
• Onenetworktriestogeneratedatatofoolasecondnetwork
• Secondnetworktriestodiscriminatefakefromrealdata
• Objectiveforoneis(roughly)negativeoftheother’sobjective
• Oftencangeneraterealisticnewdata
• Somebackgroundonadversarialtraining…
Prisoners’DilemmaTwosuspectsinamajorcrimeareheldinseparatecells.Thereisenoughevidencetoconvicteachofthemofaminoroffense,butnotenoughevidencetoconvicteitherofthemofthemajorcrimeunlessoneofthemactsasaninformeragainsttheother(defects).Iftheybothstayquiet,eachwillbeconvictedoftheminoroffenseandspendoneyearinprison.Ifoneandonlyoneofthemdefects,hewillbefreedandusedasawitnessagainsttheother,whowillspendfouryearsinprison.Iftheybothdefect,eachwillspendthreeyearsinprison.
Players:Thetwosuspects.
Actions:Eachplayer’ssetofactionsis{Quiet,Defect}.
Preferences:Suspect1’sorderingoftheactionprofiles,frombesttoworst,is(Defect,Quiet)(hedefectsandsuspect2remainsquiet,soheisfreed),(Quiet,Quiet)(hegetsoneyearinprison),(Defect,Defect)(hegetsthreeyearsinprison),(Quiet,Defect)(hegetsfouryearsinprison).Suspect2’sorderingis(Quiet,Defect),(Quiet,Quiet),(Defect,Defect),(Defect,Quiet).
3/28/18
3
3representsbestoutcome,0worst,etc.
NashEquilibrium
Thanks,Wikipedia.
3/28/18
4
AnotherExample
Thanks,Prof.OsborneofU.Toronto,Economics
…AndAnotherExample
Thanksagain,Prof.OsborneofU.Toronto,Economics
3/28/18
5
Minimax withSimultaneousMoves• maximin value:largestvalueplayercanbeassuredofwithoutknowingotherplayer’sactions
• minimax value:smallestvalueotherplayerscanforcethisplayertoreceivewithout knowingthisplayer’saction
• minimaxisanupperboundonmaximin
KeyResult
• Utility:numericrewardforactions
• Game:2ormoreplayerstaketurnsortakesimultaneousactions.Movesleadtostates,stateshaveutilities.
• Gameislikeanoptimizationproblem,buteachplayertriestomaximizeownobjectivefunction(utilityfunction)
• Zero-sumgame:eachplayer’sgainorlossinutilityisexactlybalancedbyothers’
• Inzero-sumgame,Minimax solutionissameasNashEquilibrium
3/28/18
6
GenerativeAdversarialNetworks
• Approach: Setupzero-sumgamebetweendeepnetsto• Generator: Generatedatathatlooksliketrainingset• Discriminator: Distinguishbetweenrealandsyntheticdata
• Motivation:• Buildingaccurategenerativemodelsishard(e.g.,learningand sampling fromMarkovnetorBayesnet)
• Wanttouseallourgreatprogressonsupervisedlearnerstodothisunsupervised learningtaskbetter
• Deepnetsmaybeourfavoritesupervisedlearner,especiallyforimagedata,ifnetsareconvolutional(usetricksofslidingwindowswithparametertying,cross-entropytransferfunction,batchnormalization)
DoesItWork?
Thanks,IanGoodfellow,NIPS2016TutorialonGANS,forthisandmostofwhatfollows…
3/28/18
7
ABitMoreonGANAlgorithm
TheRestoftheDetails
• UsedeepconvolutionalneuralnetworksforDiscriminatorDandGeneratorG
• Let xdenotetrainset andzdenoterandom,uniforminput
• Setupzero-sumgamebygivingDthefollowingobjective,andGthenegationofit:
• LetDandGcomputetheirgradientssimultaneously,eachmakeonestepindirectionofthegradient,andrepeatuntilneithercanmakeprogress…Minimax
3/28/18
8
NotSoFast
• Whileprecedingversionistheoreticallyelegant,inpracticethegradientforGvanishesbeforewereachbestpracticalsolution
• WhilenolongertrueMinimax,usesameobjectiveforDbutchangeobjectiveforGto:
• Sometimesbetterifinsteadofusingoneminibatch atatimetocomputegradientanddobatchnormalization,wealsohaveafixedsubsetoftrainingset,andusecombinationoffixedsubsetandcurrentminibatch
CommentsonGANs
• Potentiallycanuseourhigh-poweredsupervisedlearnerstobuildbetter,fasterdatagenerators(cantheyreplaceMCMC,etc.?)
• WhilesomenicetheorybasedonNashEquilibria,betterresultsinpracticeifwemoveabitawayfromthetheory
• Ingeneral,manyinMLcommunityhavestrongconcernthatwedon’treallyunderstandwhydeeplearningworks,includingGANs
• Stillmuchresearchintofiguringoutwhythisworksbetterthanothergenerativeapproachesforsometypesofdata,howwecanimproveperformancefurther,howtotakethesefromimagedatatootherdatatypeswhereCNNsmightnotbethemostnaturaldeepnetworkstructure
3/28/18
9
LongShort-TermMemoryNetworks(LSTMs)
• Recurrentneuralnetworksarewell-suitedtosequencedata• LearningfromsentencesorotherNLP• Learningfromtemporalsequences
• Recurrentneuralnetworksmodelhumanshort-termmemory• Takeintoaccountrecentthoughtprocesses,inadditiontonewinput• Lasthiddenstateandcurrentinputstatebothinfluencenewhiddenstateandnewoutput
• Sometimesstartofsentenceisrelevanttoendingoutput,somaywanttokeepearlyeventsinshort-termmemoryalongtime
• Gardenpathsentenceswhenoutputateachstepispart-of-speech• “Thecomplexhousesmarriedandsinglesoldiersandtheirfamilies.”
RecallRNNs(thanksEdChoi,Jimeng Sun)
• RecurrentNeuralNetwork(RNN)• Binaryclassification
HiddenLayerh1
h1=𝝈(WiTx1)
Inputx1 x1 (Firstelementinthesequence“The”)
3/28/18
10
RecurrentNeuralNetworks
• RecurrentNeuralNetwork(RNN)• Binaryclassification
h2
x2
h1
x1
h2=𝝈(WhTh1 +Wi
Tx2)
“The” “patient”
RecurrentNeuralNetworks
• RecurrentNeuralNetwork(RNN)• Binaryclassification
h9
h10
h2
x2 x9 x10
h1
x1
h10=𝝈(WhTh9 +Wi
Tx10)
“The” “patient” “knee” “.”
3/28/18
11
RecurrentNeuralNetworks
• RecurrentNeuralNetwork(RNN)• Binaryclassification
Output
𝑦# =𝝈(woTh10)
Outcome0.0~1.0h9
h10
h2
x2 x9 x10
h1
x1
woWh
Wi
LongChainEqualsDeepLearning
• Backpropagationmustworkitswaybackthroughthesequence:backpropagationthroughtime(BTT)
• Hardertolearnlong-distancerelationships
• LSTMsandrelatedarchitecturestrytomakethiseasierbymakinglongshort-termmemorythedefault,soeffortisrequiredtoforget
3/28/18
12
LSTMCellinaPicture(Graves,Jaitly,Mohamed,2013)
LSTMEquations(GJM,2013)
Rememberinput,hidden,output,sigmoids (𝜎),tanh’s areallvectors.Weightmatricesfromcelltogate,suchasWci ,arealldiagonal, soelementm ofgatevectorreceivesinputonlyfromelementm ofcellvector.
3/28/18
13
BidirectionalLSTM(BiLSTM)inaPicture(GJM,2013)
BLSTMwithAttention,inaPicture(ThanksZhou,Shi,Tian,Qi,Li,Hao,Xu,ACL2016)
3/28/18
14
CommentsonBLSTMswithAttention
• Canhaveoney oradifferentoutputyi ateachtime/locationinsequence
• Ateachtime/locationinsequence,choosea(possiblydifferent)subsetoftimestoattendtoforcomputingtheoutput
• BLSTMswithAttentionnowthebestapproachformanyNLPtasks(Manning,2017)
• BLSTMsandotherRNNmodelsrequiremoresequentialprocessingthanCNNs;thereforelessspeedupwithGPUswithRNNsthanwithCNNs,thoughspeedupisstillsubstantial
Word embeddings• standard representation for words is a one-hot (1-of-k) encoding
010⋮0⋮0⋮0
aardvarkanteater
apple
cat
dog
zymergy
000⋮1⋮0⋮0
aardvarkanteater
apple
cat
dog
zymergy
000⋮0⋮1⋮0
aardvarkanteater
apple
cat
dog
zymergy
• this encoding doesn’t capture any information about the semantic similarity among words which may be useful for many NLP tasks
𝒘*+, - 𝒘./0 = 0
3/28/18
15
Word embeddings• How can we find dense, lower-dimensional vectors that capture
semantic similarity among words?
.26
.01
.37⋮.07
cat
.24
.04
.38⋮.13
dog
• The key is to learn to model the context around words. Firth 1957: “You shall know a word by the company it keeps.”
“…what if your dog has cold paws…”
The skip-gram model• learn to predict context words in a window around given word
• collect a large corpus of text• define a window size s• construct training instances that map from word → other word in
window
“…what if your dog has cold paws…”
dog → what
dog → if
dog → your
dog → has
dog → cold
dog → paws
3/28/18
16
The skip-gram model
𝒙,;<
𝒙, 𝒙,;=
𝒙,><
𝑾@×B
𝑾′B×@
𝑾′B×@
𝑾′B×@
one-hot encoding of context word using V-dimensional vector
one-hot encoding of word using V-dimensional vector
dense encoding of word using N-dimensional vector
The skip-gram model
𝒙, 𝒙,;=𝑾@×B 𝑾′B×@
• same weights used for all context positions, so effectively• network has one set of ouptut units• training instances are pairs (dog → what)
3/28/18
17
𝒙,;<
𝒙, 𝒙,;=
𝒙,><
𝑾@×B
𝑾′B×@
𝑾′B×@
𝑾′B×@
The skip-gram model
𝐸 𝒘 = −1𝑇G G log𝑃 𝑥,>M|𝑥,
;OPMPOMQR
S
,T=
• objective function: learn weights that minimize negative log probability (maximize probability) of words in context for each given word
𝒙,;<
𝒙, 𝒙,;=
𝒙,><
𝑾@×B
𝑾′B×@
𝑾′B×@
𝑾′B×@
The skip-gram model
𝑃 𝑥U|𝑥V =exp 𝑤[\
] 𝑤[^∑ exp 𝑤[]𝑤[^�[
• output units use softmax function to compute 𝑃 𝑥U|𝑥V• hidden units are linear • since input vector uses a one-hot encoding, it effectively selects a
column of the weight matrix (denoted by 𝑤[^)
3/28/18
18
Using word embeddings• after training, it’s the hidden layer (equivalently, the first layer of
weights) we care about• the hidden units provide a dense, low-dimensional representation for
each word• we can use these low-dimensional vectors in any NLP application
(e.g. as input representation for deep network)
𝒙,;<
𝒙, 𝒙,;=
𝒙,><
𝑾@×B
𝑾′B×@
𝑾′B×@
𝑾′B×@
An alternative: the Continuous Bag of Words (CBOW) model
• learn to predict a word given its context
𝒙,;<
𝒙,𝒙,;=
𝒙,><
𝑾@×B
𝑾@×B
𝑾@×B
𝑾′B×@
3/28/18
19
word2vec
• skip-gram model developed by Google [Mikolov et al. 2013]• V = 692k• N = 300• trained on billion word corpus• includes additional tricks to improve training time