deep learning iiicraven/cs760/lectures/dnns-3.pdf · 2018-03-29 · •building accurate generative...

19
3/28/18 1 Deep Learning III CS 760: Machine Learning Spring 2018 Mark Craven and David Page www.biostat.wisc.edu/~craven/cs760 1 Goals for the Lecture You should understand the following concepts: generative adversarial networks (GANs) competitive training Nash equilibrium long short-term memory (LSTM) networks Bidirectional LSTMs (BiLSTMs) Attention embeddings word embeddings Word2Vec 2

Upload: others

Post on 18-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deep Learning IIIcraven/cs760/lectures/DNNs-3.pdf · 2018-03-29 · •Building accurate generative models is hard (e.g., learning andsamplingfrom Markov net or Bayes net) •Want

3/28/18

1

DeepLearningIII

CS760:MachineLearningSpring2018

MarkCravenandDavidPage

www.biostat.wisc.edu/~craven/cs760

1

GoalsfortheLecture

• Youshouldunderstandthefollowingconcepts:

• generativeadversarialnetworks(GANs)• competitivetraining• Nashequilibrium

• longshort-termmemory(LSTM)networks• BidirectionalLSTMs(BiLSTMs)• Attention

• embeddings• wordembeddings• Word2Vec

2

Page 2: Deep Learning IIIcraven/cs760/lectures/DNNs-3.pdf · 2018-03-29 · •Building accurate generative models is hard (e.g., learning andsamplingfrom Markov net or Bayes net) •Want

3/28/18

2

GenerativeAdversarialNetworks

• CombinedeepnetsuchasCNNwithadversarialtraining

• Onenetworktriestogeneratedatatofoolasecondnetwork

• Secondnetworktriestodiscriminatefakefromrealdata

• Objectiveforoneis(roughly)negativeoftheother’sobjective

• Oftencangeneraterealisticnewdata

• Somebackgroundonadversarialtraining…

Prisoners’DilemmaTwosuspectsinamajorcrimeareheldinseparatecells.Thereisenoughevidencetoconvicteachofthemofaminoroffense,butnotenoughevidencetoconvicteitherofthemofthemajorcrimeunlessoneofthemactsasaninformeragainsttheother(defects).Iftheybothstayquiet,eachwillbeconvictedoftheminoroffenseandspendoneyearinprison.Ifoneandonlyoneofthemdefects,hewillbefreedandusedasawitnessagainsttheother,whowillspendfouryearsinprison.Iftheybothdefect,eachwillspendthreeyearsinprison.

Players:Thetwosuspects.

Actions:Eachplayer’ssetofactionsis{Quiet,Defect}.

Preferences:Suspect1’sorderingoftheactionprofiles,frombesttoworst,is(Defect,Quiet)(hedefectsandsuspect2remainsquiet,soheisfreed),(Quiet,Quiet)(hegetsoneyearinprison),(Defect,Defect)(hegetsthreeyearsinprison),(Quiet,Defect)(hegetsfouryearsinprison).Suspect2’sorderingis(Quiet,Defect),(Quiet,Quiet),(Defect,Defect),(Defect,Quiet).

Page 3: Deep Learning IIIcraven/cs760/lectures/DNNs-3.pdf · 2018-03-29 · •Building accurate generative models is hard (e.g., learning andsamplingfrom Markov net or Bayes net) •Want

3/28/18

3

3representsbestoutcome,0worst,etc.

NashEquilibrium

Thanks,Wikipedia.

Page 4: Deep Learning IIIcraven/cs760/lectures/DNNs-3.pdf · 2018-03-29 · •Building accurate generative models is hard (e.g., learning andsamplingfrom Markov net or Bayes net) •Want

3/28/18

4

AnotherExample

Thanks,Prof.OsborneofU.Toronto,Economics

…AndAnotherExample

Thanksagain,Prof.OsborneofU.Toronto,Economics

Page 5: Deep Learning IIIcraven/cs760/lectures/DNNs-3.pdf · 2018-03-29 · •Building accurate generative models is hard (e.g., learning andsamplingfrom Markov net or Bayes net) •Want

3/28/18

5

Minimax withSimultaneousMoves• maximin value:largestvalueplayercanbeassuredofwithoutknowingotherplayer’sactions

• minimax value:smallestvalueotherplayerscanforcethisplayertoreceivewithout knowingthisplayer’saction

• minimaxisanupperboundonmaximin

KeyResult

• Utility:numericrewardforactions

• Game:2ormoreplayerstaketurnsortakesimultaneousactions.Movesleadtostates,stateshaveutilities.

• Gameislikeanoptimizationproblem,buteachplayertriestomaximizeownobjectivefunction(utilityfunction)

• Zero-sumgame:eachplayer’sgainorlossinutilityisexactlybalancedbyothers’

• Inzero-sumgame,Minimax solutionissameasNashEquilibrium

Page 6: Deep Learning IIIcraven/cs760/lectures/DNNs-3.pdf · 2018-03-29 · •Building accurate generative models is hard (e.g., learning andsamplingfrom Markov net or Bayes net) •Want

3/28/18

6

GenerativeAdversarialNetworks

• Approach: Setupzero-sumgamebetweendeepnetsto• Generator: Generatedatathatlooksliketrainingset• Discriminator: Distinguishbetweenrealandsyntheticdata

• Motivation:• Buildingaccurategenerativemodelsishard(e.g.,learningand sampling fromMarkovnetorBayesnet)

• Wanttouseallourgreatprogressonsupervisedlearnerstodothisunsupervised learningtaskbetter

• Deepnetsmaybeourfavoritesupervisedlearner,especiallyforimagedata,ifnetsareconvolutional(usetricksofslidingwindowswithparametertying,cross-entropytransferfunction,batchnormalization)

DoesItWork?

Thanks,IanGoodfellow,NIPS2016TutorialonGANS,forthisandmostofwhatfollows…

Page 7: Deep Learning IIIcraven/cs760/lectures/DNNs-3.pdf · 2018-03-29 · •Building accurate generative models is hard (e.g., learning andsamplingfrom Markov net or Bayes net) •Want

3/28/18

7

ABitMoreonGANAlgorithm

TheRestoftheDetails

• UsedeepconvolutionalneuralnetworksforDiscriminatorDandGeneratorG

• Let xdenotetrainset andzdenoterandom,uniforminput

• Setupzero-sumgamebygivingDthefollowingobjective,andGthenegationofit:

• LetDandGcomputetheirgradientssimultaneously,eachmakeonestepindirectionofthegradient,andrepeatuntilneithercanmakeprogress…Minimax

Page 8: Deep Learning IIIcraven/cs760/lectures/DNNs-3.pdf · 2018-03-29 · •Building accurate generative models is hard (e.g., learning andsamplingfrom Markov net or Bayes net) •Want

3/28/18

8

NotSoFast

• Whileprecedingversionistheoreticallyelegant,inpracticethegradientforGvanishesbeforewereachbestpracticalsolution

• WhilenolongertrueMinimax,usesameobjectiveforDbutchangeobjectiveforGto:

• Sometimesbetterifinsteadofusingoneminibatch atatimetocomputegradientanddobatchnormalization,wealsohaveafixedsubsetoftrainingset,andusecombinationoffixedsubsetandcurrentminibatch

CommentsonGANs

• Potentiallycanuseourhigh-poweredsupervisedlearnerstobuildbetter,fasterdatagenerators(cantheyreplaceMCMC,etc.?)

• WhilesomenicetheorybasedonNashEquilibria,betterresultsinpracticeifwemoveabitawayfromthetheory

• Ingeneral,manyinMLcommunityhavestrongconcernthatwedon’treallyunderstandwhydeeplearningworks,includingGANs

• Stillmuchresearchintofiguringoutwhythisworksbetterthanothergenerativeapproachesforsometypesofdata,howwecanimproveperformancefurther,howtotakethesefromimagedatatootherdatatypeswhereCNNsmightnotbethemostnaturaldeepnetworkstructure

Page 9: Deep Learning IIIcraven/cs760/lectures/DNNs-3.pdf · 2018-03-29 · •Building accurate generative models is hard (e.g., learning andsamplingfrom Markov net or Bayes net) •Want

3/28/18

9

LongShort-TermMemoryNetworks(LSTMs)

• Recurrentneuralnetworksarewell-suitedtosequencedata• LearningfromsentencesorotherNLP• Learningfromtemporalsequences

• Recurrentneuralnetworksmodelhumanshort-termmemory• Takeintoaccountrecentthoughtprocesses,inadditiontonewinput• Lasthiddenstateandcurrentinputstatebothinfluencenewhiddenstateandnewoutput

• Sometimesstartofsentenceisrelevanttoendingoutput,somaywanttokeepearlyeventsinshort-termmemoryalongtime

• Gardenpathsentenceswhenoutputateachstepispart-of-speech• “Thecomplexhousesmarriedandsinglesoldiersandtheirfamilies.”

RecallRNNs(thanksEdChoi,Jimeng Sun)

• RecurrentNeuralNetwork(RNN)• Binaryclassification

HiddenLayerh1

h1=𝝈(WiTx1)

Inputx1 x1 (Firstelementinthesequence“The”)

Page 10: Deep Learning IIIcraven/cs760/lectures/DNNs-3.pdf · 2018-03-29 · •Building accurate generative models is hard (e.g., learning andsamplingfrom Markov net or Bayes net) •Want

3/28/18

10

RecurrentNeuralNetworks

• RecurrentNeuralNetwork(RNN)• Binaryclassification

h2

x2

h1

x1

h2=𝝈(WhTh1 +Wi

Tx2)

“The” “patient”

RecurrentNeuralNetworks

• RecurrentNeuralNetwork(RNN)• Binaryclassification

h9

h10

h2

x2 x9 x10

h1

x1

h10=𝝈(WhTh9 +Wi

Tx10)

“The” “patient” “knee” “.”

Page 11: Deep Learning IIIcraven/cs760/lectures/DNNs-3.pdf · 2018-03-29 · •Building accurate generative models is hard (e.g., learning andsamplingfrom Markov net or Bayes net) •Want

3/28/18

11

RecurrentNeuralNetworks

• RecurrentNeuralNetwork(RNN)• Binaryclassification

Output

𝑦# =𝝈(woTh10)

Outcome0.0~1.0h9

h10

h2

x2 x9 x10

h1

x1

woWh

Wi

LongChainEqualsDeepLearning

• Backpropagationmustworkitswaybackthroughthesequence:backpropagationthroughtime(BTT)

• Hardertolearnlong-distancerelationships

• LSTMsandrelatedarchitecturestrytomakethiseasierbymakinglongshort-termmemorythedefault,soeffortisrequiredtoforget

Page 12: Deep Learning IIIcraven/cs760/lectures/DNNs-3.pdf · 2018-03-29 · •Building accurate generative models is hard (e.g., learning andsamplingfrom Markov net or Bayes net) •Want

3/28/18

12

LSTMCellinaPicture(Graves,Jaitly,Mohamed,2013)

LSTMEquations(GJM,2013)

Rememberinput,hidden,output,sigmoids (𝜎),tanh’s areallvectors.Weightmatricesfromcelltogate,suchasWci ,arealldiagonal, soelementm ofgatevectorreceivesinputonlyfromelementm ofcellvector.

Page 13: Deep Learning IIIcraven/cs760/lectures/DNNs-3.pdf · 2018-03-29 · •Building accurate generative models is hard (e.g., learning andsamplingfrom Markov net or Bayes net) •Want

3/28/18

13

BidirectionalLSTM(BiLSTM)inaPicture(GJM,2013)

BLSTMwithAttention,inaPicture(ThanksZhou,Shi,Tian,Qi,Li,Hao,Xu,ACL2016)

Page 14: Deep Learning IIIcraven/cs760/lectures/DNNs-3.pdf · 2018-03-29 · •Building accurate generative models is hard (e.g., learning andsamplingfrom Markov net or Bayes net) •Want

3/28/18

14

CommentsonBLSTMswithAttention

• Canhaveoney oradifferentoutputyi ateachtime/locationinsequence

• Ateachtime/locationinsequence,choosea(possiblydifferent)subsetoftimestoattendtoforcomputingtheoutput

• BLSTMswithAttentionnowthebestapproachformanyNLPtasks(Manning,2017)

• BLSTMsandotherRNNmodelsrequiremoresequentialprocessingthanCNNs;thereforelessspeedupwithGPUswithRNNsthanwithCNNs,thoughspeedupisstillsubstantial

Word embeddings• standard representation for words is a one-hot (1-of-k) encoding

010⋮0⋮0⋮0

aardvarkanteater

apple

cat

dog

zymergy

000⋮1⋮0⋮0

aardvarkanteater

apple

cat

dog

zymergy

000⋮0⋮1⋮0

aardvarkanteater

apple

cat

dog

zymergy

• this encoding doesn’t capture any information about the semantic similarity among words which may be useful for many NLP tasks

𝒘*+, - 𝒘./0 = 0

Page 15: Deep Learning IIIcraven/cs760/lectures/DNNs-3.pdf · 2018-03-29 · •Building accurate generative models is hard (e.g., learning andsamplingfrom Markov net or Bayes net) •Want

3/28/18

15

Word embeddings• How can we find dense, lower-dimensional vectors that capture

semantic similarity among words?

.26

.01

.37⋮.07

cat

.24

.04

.38⋮.13

dog

• The key is to learn to model the context around words. Firth 1957: “You shall know a word by the company it keeps.”

“…what if your dog has cold paws…”

The skip-gram model• learn to predict context words in a window around given word

• collect a large corpus of text• define a window size s• construct training instances that map from word → other word in

window

“…what if your dog has cold paws…”

dog → what

dog → if

dog → your

dog → has

dog → cold

dog → paws

Page 16: Deep Learning IIIcraven/cs760/lectures/DNNs-3.pdf · 2018-03-29 · •Building accurate generative models is hard (e.g., learning andsamplingfrom Markov net or Bayes net) •Want

3/28/18

16

The skip-gram model

𝒙,;<

𝒙, 𝒙,;=

𝒙,><

𝑾@×B

𝑾′B×@

𝑾′B×@

𝑾′B×@

one-hot encoding of context word using V-dimensional vector

one-hot encoding of word using V-dimensional vector

dense encoding of word using N-dimensional vector

The skip-gram model

𝒙, 𝒙,;=𝑾@×B 𝑾′B×@

• same weights used for all context positions, so effectively• network has one set of ouptut units• training instances are pairs (dog → what)

Page 17: Deep Learning IIIcraven/cs760/lectures/DNNs-3.pdf · 2018-03-29 · •Building accurate generative models is hard (e.g., learning andsamplingfrom Markov net or Bayes net) •Want

3/28/18

17

𝒙,;<

𝒙, 𝒙,;=

𝒙,><

𝑾@×B

𝑾′B×@

𝑾′B×@

𝑾′B×@

The skip-gram model

𝐸 𝒘 = −1𝑇G G log𝑃 𝑥,>M|𝑥,

;OPMPOMQR

S

,T=

• objective function: learn weights that minimize negative log probability (maximize probability) of words in context for each given word

𝒙,;<

𝒙, 𝒙,;=

𝒙,><

𝑾@×B

𝑾′B×@

𝑾′B×@

𝑾′B×@

The skip-gram model

𝑃 𝑥U|𝑥V =exp 𝑤[\

] 𝑤[^∑ exp 𝑤[]𝑤[^�[

• output units use softmax function to compute 𝑃 𝑥U|𝑥V• hidden units are linear • since input vector uses a one-hot encoding, it effectively selects a

column of the weight matrix (denoted by 𝑤[^)

Page 18: Deep Learning IIIcraven/cs760/lectures/DNNs-3.pdf · 2018-03-29 · •Building accurate generative models is hard (e.g., learning andsamplingfrom Markov net or Bayes net) •Want

3/28/18

18

Using word embeddings• after training, it’s the hidden layer (equivalently, the first layer of

weights) we care about• the hidden units provide a dense, low-dimensional representation for

each word• we can use these low-dimensional vectors in any NLP application

(e.g. as input representation for deep network)

𝒙,;<

𝒙, 𝒙,;=

𝒙,><

𝑾@×B

𝑾′B×@

𝑾′B×@

𝑾′B×@

An alternative: the Continuous Bag of Words (CBOW) model

• learn to predict a word given its context

𝒙,;<

𝒙,𝒙,;=

𝒙,><

𝑾@×B

𝑾@×B

𝑾@×B

𝑾′B×@

Page 19: Deep Learning IIIcraven/cs760/lectures/DNNs-3.pdf · 2018-03-29 · •Building accurate generative models is hard (e.g., learning andsamplingfrom Markov net or Bayes net) •Want

3/28/18

19

word2vec

• skip-gram model developed by Google [Mikolov et al. 2013]• V = 692k• N = 300• trained on billion word corpus• includes additional tricks to improve training time