cse 490 u: deep learning spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · cse 490...

CSE490U:DeepLearningSpring2016

Yejin Choi

SomeslidesfromCarlosGuestrin,AndrewRosenberg, LukeZettlemoyer

HumanNeurons

• Switchingtime• ~0.001second

• Numberofneurons– 1010

• Connectionsperneuron– 104-5

• Scenerecognitiontime– 0.1seconds

• Numberofcyclesperscenerecognition?– 100àmuchparallelcomputation!

g

PerceptronasaNeuralNetwork

Thisisoneneuron:– Inputedgesx1 ...xn,alongwithbasis– Thesumisrepresentedgraphically– Sumpassedthroughanactivationfunctiong

SigmoidNeuron

g

Just change g!• Why would we want to do this?• Notice new output range [0,1]. What was it before?• Look familiar?

OptimizinganeuronWetraintominimizesum-squarederror

⌅(x) =

⇧

⌥

x(1)

. . .x(n)

x(1)x(2)

x(1)x(3)

. . .

ex(1)

. . .

⌃

⌦⌦⌦⌦⌦⌦⌦⌦⌦⌦⌦�

⇧L

⇧w= w �

↵

j

�jyjxj

⌅(u).⌅(v) =

⇤u1u2

⌅.

⇤v1v2

⌅= u1v1 + u2v2 = u.v

⌅(u).⌅(v) =

⇧

⌥

u21u1u2u2u1u22

⌃

⌦⌦� .

⇧

⌥

v21v1v2v2v1v22

⌃

⌦⌦� = u21v21 + 2u1v1u2v2 + u22v

22

= (u1v1 + u2v2)2

= (u.v)2

⌅(u).⌅(v) = (u.v)d

P (errortrue(h) ⇥ ⇤) ⇥ |H|e�m⇥ ⇥ ⇥

ln�|H|e�m⇥

⇥⇥ ln ⇥

ln |H|�m⇤ ⇥ ln ⇥

m ⇤ln |H|+ ln 1�

⇤

⇤ ⇤ln |H|+ ln 1�

m

⇧l

⇧wi= �

↵

j

[yj � g(w0 +↵

i

wixji )]

⇧

⇧wig(w0 +

↵

i

wixji )

7

Solution just depends on g’: derivative of activation function!

�l

�wi= �

�

j

[yj � g(w0 +�

i

wixji )]

�

�wig(w0 +

�

i

wixji )

�

�wig(w0 +

�

i

wixji ) = xj

i

�

�wig(w0 +

�

i

wixji ) = xj

ig�(w0 +

�

i

wixji )

g�(x) = g(x)(1� g(x))

�

�xf(g(x)) = f �(g(x))g�(x)

8

@

@wig(w0 +

X

i

wixji ) = x

jig

0(w0 +X

i

wixji )

Sigmoidunits:havetodifferentiateg�l

�wi= �

�

j

[yj � g(w0 +�

i

wixji )]

�

�wig(w0 +

�

i

wixji )

�

�wig(w0 +

�

i

wixji ) = xj

i

�

�wig(w0 +

�

i

wixji ) = xj

ig�(w0 +

�

i

wixji )

g�(x) = g(x)(1� g(x))

8

g

Perceptron,linearclassification,Booleanfunctions:xi∈{0,1}

• Canlearnx1∨ x2?• -0.5+x1+ x2

• Canlearnx1∧ x2?• -1.5+x1+x2

• Canlearnanyconjunctionordisjunction?• 0.5+x1+…+xn• (-n+0.5)+x1+…+xn

• Canlearnmajority?• (-0.5*n)+x1+…+xn

• Whatarewemissing?ThedreadedXOR!,etc.

GoingbeyondlinearclassificationSolvingtheXORproblem

y =x1 XORx2

v1= (x1∧¬x2)=-1.5+2x1-x2

v2= (x2∧¬x1)=-1.5+2x2-x1

y=v1∨v2=-0.5+v1+v2

x1

x2

1

v1

v2

y

1-0.5

1

1

-1.5

2-1

2

-1-1.5

= (x1 ∧ ¬x2) ∨ (x2 ∧¬x1)

Hiddenlayer

• Singleunit:

• 1-hiddenlayer:

• Nolongerconvexfunction!

Example data for NN with hidden layer

Learned weights for hidden layer

Why“representationlearning”?

• MaxEnt (multinomiallogisticregression):

• NNs:

y = softmax(w · f(x, y))

y = softmax(w · �(Ux))

y = softmax(w · �(U (n)(...�(U

(2)�(U

(1)x))))

You design the feature vector

Feature representations are “learned” through hidden layers

Verydeepmodelsincomputervision

RECURRENTNEURALNETWORKS

𝑥" 𝑥# 𝑥$ 𝑥%

ℎ" ℎ# ℎ$ ℎ%

RecurrentNeuralNetworks(RNNs)• EachRNNunitcomputesanewhiddenstateusingthepreviousstateanda

newinput• EachRNNunit(optionally)makesanoutputusingthecurrenthiddenstate

• Hiddenstatesarecontinuousvectors– Canrepresentveryrichinformation– Possiblytheentirehistoryfromthebeginning

• Parametersareshared(tied)acrossallRNNunits(unlikefeedforwardNNs)

ht = f(xt, ht�1)

ht 2 RD

yt = softmax(V ht)

RecurrentNeuralNetworks(RNNs)• GenericRNNs:

• VanillaRNN:

ht = f(xt, ht�1)

yt = softmax(V ht)

yt = softmax(V ht)

ht = tanh(Uxt +Wht�1 + b)

𝑥" 𝑥# 𝑥$ 𝑥%

ℎ" ℎ# ℎ$ ℎ%

ManyusesofRNNs

• Input:asequence• Output:onelabel(classification)• Example:sentimentclassification

ht = f(xt, ht�1)

𝑥" 𝑥# 𝑥$ 𝑥%

ℎ" ℎ# ℎ$

ℎ%y = softmax(V hn)

1.Classification(seq toone)

2.onetoseq

• Input:oneitem• Output:asequence• Example:Imagecaptioning

ht = f(xt, ht�1)

yt = softmax(V ht)

𝑥"

ℎ" ℎ# ℎ$

ℎ%ℎ$ℎ#ℎ"Cat sitting on top of ….

ManyusesofRNNs

3.sequencetagging

• Input:asequence• Output:asequence(ofthesamelength)• Example:POStagging,NamedEntityRecognition• HowaboutLanguageModels?

– Yes!RNNscanbeusedasLMs!– RNNsmakemarkov assumption:T/F? ht = f(xt, ht�1)

yt = softmax(V ht)

𝑥" 𝑥# 𝑥$ 𝑥%

ℎ" ℎ# ℎ$

ℎ%ℎ$ℎ#ℎ"

ManyusesofRNNs

4.Languagemodels• Input:asequenceofwords• Output:onenextword• Output:orasequenceofnextwords• Duringtraining,x_t istheactualwordinthetrainingsentence.• Duringtesting,x_t isthewordpredictedfromtheprevioustimestep.• DoesRNNLMsmakeMarkovassumption?

– i.e.,thenextworddependsonlyonthepreviousNwords

𝑥" 𝑥# 𝑥$ 𝑥%

ℎ" ℎ# ℎ$

ℎ%ℎ$ℎ#ℎ"

ManyusesofRNNs

ht = f(xt, ht�1)

yt = softmax(V ht)

5.seq2seq (aka“encoder-decoder”)

• Input:asequence• Output:asequence(ofdifferent length)• Examples?

ht = f(xt, ht�1)

yt = softmax(V ht)

𝑥" 𝑥# 𝑥$

ℎ" ℎ# ℎ$

ℎ%

ℎ% ℎ' ℎ(

ℎ)ℎ(ℎ'

ManyusesofRNNs

ManyusesofRNNs4.seq2seq (aka“encoder-decoder”)

Figure from http://www.wildml.com/category/conversational-agents/

• Conversation and Dialogue• Machine Translation

ManyusesofRNNs4.seq2seq (aka“encoder-decoder”)

John has a dog

𝑥" 𝑥# 𝑥$

ℎ" ℎ# ℎ$

ℎ%

ℎ% ℎ' ℎ(

ℎ)ℎ(ℎ'

Parsing!- “Grammar as Foreign Language” (Vinyals et al., 2015)

RecurrentNeuralNetworks(RNNs)• GenericRNNs:

• VanillaRNN:

ht = f(xt, ht�1)

yt = softmax(V ht)

yt = softmax(V ht)


𝑥" 𝑥# 𝑥$ 𝑥%

ℎ" ℎ# ℎ$ ℎ%

RecurrentNeuralNetworks(RNNs)• GenericRNNs:• VanillaRNNs:• LSTMs(LongShort-termMemoryNetworks):

ht = f(xt, ht�1)

𝑥" 𝑥# 𝑥$ 𝑥%

ℎ" ℎ# ℎ$ ℎ%

o

t

= �(U (o)x

t

+W

(o)h

t�1 + b

(o))

ft = �(U (f)xt +W

(f)ht�1 + b

(f))

it = �(U (i)xt +W

(i)ht�1 + b

(i))

c̃t = tanh(U (c)xt +W

(c)ht�1 + b

(c))

ct = ft � ct�1 + it � c̃tht = ot � tanh(ct)

There are many known variations to this set of equations!


𝑐" 𝑐# 𝑐$ 𝑐% 𝑐+ : cell state

ℎ+ : hidden state

LSTMS(LONGSHORT-TERMMEMORYNETWORKS

𝑐+,"

ℎ+,"

𝑐+

ℎ+

Figure by Christopher Olah (colah.github.io)

LSTMS(LONGSHORT-TERMMEMORYNETWORKSsigmoid:[0,1]

ft = �(U (f)xt +W

(f)ht�1 + b

(f))

Forget gate: forget the past or not



tanh:[-1,1]

it = �(U (i)xt +W

(i)ht�1 + b

(i))


(c)ht�1 + b

(c))

Input gate: use the input or not

New cell content (temp):

ft = �(U (f)xt +W

(f)ht�1 + b

(f))




tanh:[-1,1]

ct = ft � ct�1 + it � c̃t

New cell content: - mix old cell with the new temp cell

it = �(U (i)xt +W

(i)ht�1 + b

(i))


(c)ht�1 + b

(c))



ft = �(U (f)xt +W

(f)ht�1 + b

(f))




ct = ft � ct�1 + it � c̃t

New cell content: - mix old cell with the new temp cell

it = �(U (i)xt +W

(i)ht�1 + b

(i))


(c)ht�1 + b

(c))



ft = �(U (f)xt +W

(f)ht�1 + b

(f))

Forget gate: forget the past or notOutput gate: output from the new cell or noto

t

= �(U (o)x

t

+W

(o)h

t�1 + b

(o))

ht = ot � tanh(ct)Hidden state:



it = �(U (i)xt +W

(i)ht�1 + b

(i))Input gate: use the input or notft = �(U (f)

xt +W

(f)ht�1 + b

(f))Forget gate: forget the past or not

Output gate: output from the new cell or not

o

t

= �(U (o)x

t

+W

(o)h

t�1 + b

(o))

ct = ft � ct�1 + it � c̃tNew cell content: - mix old cell with the new temp cell


(c)ht�1 + b

(c))New cell content (temp):

ht = ot � tanh(ct)Hidden state:

𝑐+,"

ℎ+,"

𝑐+

ℎ+

vanishinggradientproblemforRNNs.

• Theshadingofthenodesintheunfoldednetworkindicatestheirsensitivity totheinputsattimeone(thedarkertheshade,thegreaterthesensitivity).

• Thesensitivity decaysovertimeasnewinputsoverwritetheactivationsofthehiddenlayer,andthenetwork‘forgets’thefirstinputs.

Example from Graves 2012

PreservationofgradientinformationbyLSTM

• Forsimplicity,allgatesareeitherentirelyopen(‘O’)orclosed(‘—’).• Thememorycell ‘remembers’thefirstinputaslongastheforgetgateisopenand

theinputgateisclosed.• Thesensitivity oftheoutputlayercanbeswitchedonandoffbytheoutputgate

withoutaffectingthecell.

Forget gate

Input gate

Output gate

Example from Graves 2012

RecurrentNeuralNetworks(RNNs)• GenericRNNs:• VanillaRNNs:• GRUs(GatedRecurrentUnits):

ht = f(xt, ht�1)

𝑥" 𝑥# 𝑥$ 𝑥%

ℎ" ℎ# ℎ$ ℎ%

zt = �(U (z)xt +W

(z)ht�1 + b

(z))

rt = �(U (r)xt +W

(r)ht�1 + b

(r))

h̃t = tanh(U (h)xt +W

(h)(rt � ht�1) + b

(h))

ht = (1� zt) � ht�1 + zt � h̃tLess parameters than LSTMs. Easier to train for comparable performance!


RecursiveNeuralNetworks• Sometimes,inferenceoveratreestructuremakesmoresensethan

sequentialstructure• Anexampleofcompositionalityinideologicalbiasdetection

(red→ conservative,blue→ liberal,gray→ neutral)inwhichmodifierphrasesandpunctuationcausepolarityswitchesathigherlevelsoftheparsetree

Example from Iyyer et al., 2014

RecursiveNeuralNetworks• NNsconnectedasatree• Treestructureisfixedapriori• Parametersareshared,similarlyasRNNs

Example from Iyyer et al., 2014

LEARNING:BACKPROPAGATION

ErrorBackpropagation

• Modelparameters:

forbrevity:

x0

x1

x2

xP

f(x,

~

✓)

~✓ = {wij , wjk, wkl}

Next 10 slides on back propagation are adapted from Andrew Rosenberg

~✓ = {w(1)ij , w(2)

jk , w(3)kl }

w(1)ij w(2)

jk

w(3)kl


• Modelparameters:• Leta andz betheinputandoutputofeachnode

41

x0

x1

x2

xP

f(x,

~

✓)

wij wjk

wkl

zj zkziaj zlalak

~✓ = {wij , wjk, wkl}


wij wjk

zj

aj

∑

aj =X

i

wijzi

zj = g(aj)

∑

zi

x0

x1

x2

xP

f(x,

~

✓)

wij wjk

wkl

zj zkziaj ak zl

aj =X

i

wijzi

al

ak =X

j

wjkzj al =X

k

wklzk

zj = g(aj) zk = g(ak) zl = g(al)

• Leta andz betheinputandoutputofeachnode

Training:minimizeloss

x0

x1

x2

xP

f(x,

~

✓)

wij wjk

wkl

zj zkziaj ak zlal

R(✓) =1N

NX

n=0

L(yn � f(xn))

=1N

NX

n=0

12

(yn � f(xn))2

=1N

NX

n=0

12

0

@yn � g

0

@X

k

wklg

0

@X

j

wjkg

X

i

wijxn,i

!1

A

1

A

1

A2

Empirical Risk Function


x0

x1

x2

xP

f(x,

~

✓)

wij wjk

wkl

zj zkziaj ak zlal

Optimize last layer weights wklLn =

12

(yn � f(xn))2

@R

@wkl=

1N

X

n

@Ln

@al,n

� @al,n

@wkl

�Calculus chain rule


x0

x1

x2

xP

f(x,

~

✓)

wij wjk

wkl

zj zkziaj ak zlal


12

(yn � f(xn))2

@R

@wkl=

1N

X

n

@Ln

@al,n

� @al,n

@wkl


@R

@wkl=

1N

X

n

@ 1

2 (yn � g(al,n))2

@al,n

� @al,n

@wkl

�


x0

x1

x2

xP

f(x,

~

✓)

wij wjk

wkl

zj zkziaj ak zlal


12

(yn � f(xn))2

@R

@wkl=

1N

X

n

@Ln

@al,n

� @al,n

@wkl


@R

@wkl=

1N

X

n

@ 1

2 (yn � g(al,n))2

@al,n

� @zk,nwkl

@wkl

�


x0

x1

x2

xP

f(x,

~

✓)

wij wjk

wkl

zj zkziaj ak zlal


12

(yn � f(xn))2

@R

@wkl=

1N

X

n

@Ln

@al,n

� @al,n

@wkl


@R

@wkl=

1N

X

n

@ 1

2 (yn � g(al,n))2

@al,n

� @zk,nwkl

@wkl

�=

1N

X

n

[�(yn � zl,n)g0(al,n)] zk,n


x0

x1

x2

xP

f(x,

~

✓)

wij wjk

wkl

zj zkziaj ak zlal


12

(yn � f(xn))2

@R

@wkl=

1N

X

n

@Ln

@al,n

� @al,n

@wkl


@R

@wkl=

1N

X

n

@ 1

2 (yn � g(al,n))2

@al,n

� @zk,nwkl

@wkl

�=

1N

X

n

[�(yn � zl,n)g0(al,n)] zk,n

=1N

X

n

�l,nnzk,n


x0

x1

x2

xP

f(x,

~

✓)

wij wjk

wkl

zj zkziaj ak zlal

Repeat for all previous layers

@R

@wkl=

1N

X

n

@Ln

@al,n

� @al,n

@wkl

�=

1N

X

n

[�(yn � zl,n)g0(al,n)] zk,n =1N

X

n

�l,nzk,n

@R

@wjk=

1N

X

n

@Ln

@ak,n

� @ak,n

@wjk

�=

1N

X

n

"X

l

�l,nwklg0(ak,n)

#zj,n =

1N

X

n

�k,nzj,n

@R

@wij=

1N

X

n

@Ln

@aj,n

� @aj,n

@wij

�=

1N

X

n

"X

k

�k,nwjkg0(aj,n)

#zi,n =

1N

X

n

�j,nzi,n

aj =X

i

wijzi

zj = g(aj)

@R

@wjk=

1N

X

n

@Ln

@ak,n

� @ak,n

@wjk

�=

1N

X

n

"X

l

�l,nwklg0(ak,n)

#zj,n =

1N

X

n

�k,nzj,n

@R

@wij=

1N

X

n

@Ln

@aj,n

� @aj,n

@wij

�=

1N

X

n

"X

k

�k,nwjkg0(aj,n)

#zi,n =

1N

X

n

�j,nzi,n

∑ wij wjk

zj

aj

∑

zi �j∑

�i �k zk

Backprop Recursion

x0

x1

x2

xP

f(x,

~

✓)

wij wjk

wkl

zj zkziaj ak zlal

wt+1ij = wt

ij � ⌘@R

wij

wt+1jk = wt

jk � ⌘@R

wkl

wt+1kl = wt

kl � ⌘@R

wkl

Learning:GradientDescent

Backpropagation• Startswithaforwardsweeptocomputealltheintermediatefunctionvalues• Throughbackprop,computesthepartialderivativesrecursively• Aformofdynamicprogramming

– Insteadofconsideringexponentiallymanypathsbetween aweightw_ij andthefinalloss(risk),storeandreuseintermediateresults.

• Atypeofautomaticdifferentiation.(thereareothervariantse.g.,recursivedifferentiationonlythroughforwardpropagation.

zi

�j @R

@wij

Forward

Gradient

Backpropagation• TensorFlow (https://www.tensorflow.org/)• Torch(http://torch.ch/)• Theano (http://deeplearning.net/software/theano/)• CNTK(https://github.com/Microsoft/CNTK)• cnn (https://github.com/clab/cnn)• Caffe (http://caffe.berkeleyvision.org/)

PrimaryInterfaceLanguage:• Python• Lua• Python• C++• C++• C++

Forward

Gradient

CrossEntropyLoss(akalogloss,logisticloss)• CrossEntropy

• Relatedquantities– Entropy– KLdivergence(thedistancebetweentwodistributionspandq)

• UseCrossEntropyformodelsthatshouldhavemoreprobabilisticflavor(e.g.,languagemodels)

• UseMeanSquaredError lossformodelsthatfocusoncorrect/incorrectpredictions

H(p, q) = Ep[�log q] = H(p) +DKL(p||q)

H(p, q) = �X

y

p(y) log q(y)

H(p) =X

y

p(y)log p(y)

DKL(p||q) =X

y

p(y) logp(y)

q(y)

MSE =1

2(y � f(x))2

Predicted prob

True prob

RNNLearning:Backprop ThroughTime(BPTT)

• Similartobackprop withnon-recurrentNNs• Butunlikefeedforward (non-recurrent)NNs,eachunitinthe

computationgraphrepeatstheexactsameparameters…• Backprop gradientsoftheparametersofeachunitasifthey

aredifferentparameters• Whenupdatingtheparametersusingthegradients,usethe

averagegradientsthroughouttheentirechainofunits.

𝑥" 𝑥# 𝑥$ 𝑥%

ℎ" ℎ# ℎ$ ℎ%

Convergenceofbackprop• Withoutnon-linearityorhiddenlayers,learningisconvexoptimization– Gradientdescentreachesglobalminima

• Multilayerneuralnets(withnonlinearity)arenotconvex– Gradientdescentgetsstuckinlocalminima– Selectingnumberofhiddenunitsandlayers=fuzzyprocess– NNshavemadeaHUGEcomebackinthelastfewyears

• Neuralnetsarebackwithanewname– Deepbeliefnetworks– HugeerrorreductionwhentrainedwithlotsofdataonGPUs

Overfitting inNNs• AreNNslikelytooverfit?

– Yes,theycanrepresentarbitraryfunctions!!!

• Avoidingoverfitting?– Moretrainingdata– Fewerhiddennodes/bettertopology

– Randomperturbationtothegraphtopology(“Dropout”)

– Regularization– Earlystopping

cse 490 u: deep learning spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · cse 490...

Documents