cse 490 u: deep learning spring 2016courses.cs.washington.edu/courses/cse490u/16sp/... · cse 490...
TRANSCRIPT
CSE490U:DeepLearningSpring2016
Yejin Choi
SomeslidesfromCarlosGuestrin,AndrewRosenberg, LukeZettlemoyer
HumanNeurons
• Switchingtime• ~0.001second
• Numberofneurons– 1010
• Connectionsperneuron– 104-5
• Scenerecognitiontime– 0.1seconds
• Numberofcyclesperscenerecognition?– 100àmuchparallelcomputation!
g
PerceptronasaNeuralNetwork
Thisisoneneuron:– Inputedgesx1 ...xn,alongwithbasis– Thesumisrepresentedgraphically– Sumpassedthroughanactivationfunctiong
SigmoidNeuron
g
Just change g!• Why would we want to do this?• Notice new output range [0,1]. What was it before?• Look familiar?
OptimizinganeuronWetraintominimizesum-squarederror
⌅(x) =
⇧
⌥
x(1)
. . .x(n)
x(1)x(2)
x(1)x(3)
. . .
ex(1)
. . .
⌃
⌦⌦⌦⌦⌦⌦⌦⌦⌦⌦⌦�
⇧L
⇧w= w �
↵
j
�jyjxj
⌅(u).⌅(v) =
⇤u1u2
⌅.
⇤v1v2
⌅= u1v1 + u2v2 = u.v
⌅(u).⌅(v) =
⇧
⌥
u21u1u2u2u1u22
⌃
⌦⌦� .
⇧
⌥
v21v1v2v2v1v22
⌃
⌦⌦� = u21v21 + 2u1v1u2v2 + u22v
22
= (u1v1 + u2v2)2
= (u.v)2
⌅(u).⌅(v) = (u.v)d
P (errortrue(h) ⇥ ⇤) ⇥ |H|e�m⇥ ⇥ ⇥
ln�|H|e�m⇥
⇥⇥ ln ⇥
ln |H|�m⇤ ⇥ ln ⇥
m ⇤ln |H|+ ln 1�
⇤
⇤ ⇤ln |H|+ ln 1�
m
⇧l
⇧wi= �
↵
j
[yj � g(w0 +↵
i
wixji )]
⇧
⇧wig(w0 +
↵
i
wixji )
7
Solution just depends on g’: derivative of activation function!
�l
�wi= �
�
j
[yj � g(w0 +�
i
wixji )]
�
�wig(w0 +
�
i
wixji )
�
�wig(w0 +
�
i
wixji ) = xj
i
�
�wig(w0 +
�
i
wixji ) = xj
ig�(w0 +
�
i
wixji )
g�(x) = g(x)(1� g(x))
�
�xf(g(x)) = f �(g(x))g�(x)
8
@
@wig(w0 +
X
i
wixji ) = x
jig
0(w0 +X
i
wixji )
Sigmoidunits:havetodifferentiateg�l
�wi= �
�
j
[yj � g(w0 +�
i
wixji )]
�
�wig(w0 +
�
i
wixji )
�
�wig(w0 +
�
i
wixji ) = xj
i
�
�wig(w0 +
�
i
wixji ) = xj
ig�(w0 +
�
i
wixji )
g�(x) = g(x)(1� g(x))
8
g
Perceptron,linearclassification,Booleanfunctions:xi∈{0,1}
• Canlearnx1∨ x2?• -0.5+x1+ x2
• Canlearnx1∧ x2?• -1.5+x1+x2
• Canlearnanyconjunctionordisjunction?• 0.5+x1+…+xn• (-n+0.5)+x1+…+xn
• Canlearnmajority?• (-0.5*n)+x1+…+xn
• Whatarewemissing?ThedreadedXOR!,etc.
GoingbeyondlinearclassificationSolvingtheXORproblem
y =x1 XORx2
v1= (x1∧¬x2)=-1.5+2x1-x2
v2= (x2∧¬x1)=-1.5+2x2-x1
y=v1∨v2=-0.5+v1+v2
x1
x2
1
v1
v2
y
1-0.5
1
1
-1.5
2-1
2
-1-1.5
= (x1 ∧ ¬x2) ∨ (x2 ∧¬x1)
Hiddenlayer
• Singleunit:
• 1-hiddenlayer:
• Nolongerconvexfunction!
Example data for NN with hidden layer
Learned weights for hidden layer
Why“representationlearning”?
• MaxEnt (multinomiallogisticregression):
• NNs:
y = softmax(w · f(x, y))
y = softmax(w · �(Ux))
y = softmax(w · �(U (n)(...�(U
(2)�(U
(1)x))))
You design the feature vector
Feature representations are “learned” through hidden layers
Verydeepmodelsincomputervision
RECURRENTNEURALNETWORKS
𝑥" 𝑥# 𝑥$ 𝑥%
ℎ" ℎ# ℎ$ ℎ%
RecurrentNeuralNetworks(RNNs)• EachRNNunitcomputesanewhiddenstateusingthepreviousstateanda
newinput• EachRNNunit(optionally)makesanoutputusingthecurrenthiddenstate
• Hiddenstatesarecontinuousvectors– Canrepresentveryrichinformation– Possiblytheentirehistoryfromthebeginning
• Parametersareshared(tied)acrossallRNNunits(unlikefeedforwardNNs)
ht = f(xt, ht�1)
ht 2 RD
yt = softmax(V ht)
RecurrentNeuralNetworks(RNNs)• GenericRNNs:
• VanillaRNN:
ht = f(xt, ht�1)
yt = softmax(V ht)
yt = softmax(V ht)
ht = tanh(Uxt +Wht�1 + b)
𝑥" 𝑥# 𝑥$ 𝑥%
ℎ" ℎ# ℎ$ ℎ%
ManyusesofRNNs
• Input:asequence• Output:onelabel(classification)• Example:sentimentclassification
ht = f(xt, ht�1)
𝑥" 𝑥# 𝑥$ 𝑥%
ℎ" ℎ# ℎ$
ℎ%y = softmax(V hn)
1.Classification(seq toone)
2.onetoseq
• Input:oneitem• Output:asequence• Example:Imagecaptioning
ht = f(xt, ht�1)
yt = softmax(V ht)
𝑥"
ℎ" ℎ# ℎ$
ℎ%ℎ$ℎ#ℎ"Cat sitting on top of ….
ManyusesofRNNs
3.sequencetagging
• Input:asequence• Output:asequence(ofthesamelength)• Example:POStagging,NamedEntityRecognition• HowaboutLanguageModels?
– Yes!RNNscanbeusedasLMs!– RNNsmakemarkov assumption:T/F? ht = f(xt, ht�1)
yt = softmax(V ht)
𝑥" 𝑥# 𝑥$ 𝑥%
ℎ" ℎ# ℎ$
ℎ%ℎ$ℎ#ℎ"
ManyusesofRNNs
4.Languagemodels• Input:asequenceofwords• Output:onenextword• Output:orasequenceofnextwords• Duringtraining,x_t istheactualwordinthetrainingsentence.• Duringtesting,x_t isthewordpredictedfromtheprevioustimestep.• DoesRNNLMsmakeMarkovassumption?
– i.e.,thenextworddependsonlyonthepreviousNwords
𝑥" 𝑥# 𝑥$ 𝑥%
ℎ" ℎ# ℎ$
ℎ%ℎ$ℎ#ℎ"
ManyusesofRNNs
ht = f(xt, ht�1)
yt = softmax(V ht)
5.seq2seq (aka“encoder-decoder”)
• Input:asequence• Output:asequence(ofdifferent length)• Examples?
ht = f(xt, ht�1)
yt = softmax(V ht)
𝑥" 𝑥# 𝑥$
ℎ" ℎ# ℎ$
ℎ%
ℎ% ℎ' ℎ(
ℎ)ℎ(ℎ'
ManyusesofRNNs
ManyusesofRNNs4.seq2seq (aka“encoder-decoder”)
Figure from http://www.wildml.com/category/conversational-agents/
• Conversation and Dialogue• Machine Translation
ManyusesofRNNs4.seq2seq (aka“encoder-decoder”)
John has a dog
𝑥" 𝑥# 𝑥$
ℎ" ℎ# ℎ$
ℎ%
ℎ% ℎ' ℎ(
ℎ)ℎ(ℎ'
Parsing!- “Grammar as Foreign Language” (Vinyals et al., 2015)
RecurrentNeuralNetworks(RNNs)• GenericRNNs:
• VanillaRNN:
ht = f(xt, ht�1)
yt = softmax(V ht)
yt = softmax(V ht)
ht = tanh(Uxt +Wht�1 + b)
𝑥" 𝑥# 𝑥$ 𝑥%
ℎ" ℎ# ℎ$ ℎ%
RecurrentNeuralNetworks(RNNs)• GenericRNNs:• VanillaRNNs:• LSTMs(LongShort-termMemoryNetworks):
ht = f(xt, ht�1)
𝑥" 𝑥# 𝑥$ 𝑥%
ℎ" ℎ# ℎ$ ℎ%
o
t
= �(U (o)x
t
+W
(o)h
t�1 + b
(o))
ft = �(U (f)xt +W
(f)ht�1 + b
(f))
it = �(U (i)xt +W
(i)ht�1 + b
(i))
c̃t = tanh(U (c)xt +W
(c)ht�1 + b
(c))
ct = ft � ct�1 + it � c̃tht = ot � tanh(ct)
There are many known variations to this set of equations!
ht = tanh(Uxt +Wht�1 + b)
𝑐" 𝑐# 𝑐$ 𝑐% 𝑐+ : cell state
ℎ+ : hidden state
LSTMS(LONGSHORT-TERMMEMORYNETWORKS
𝑐+,"
ℎ+,"
𝑐+
ℎ+
Figure by Christopher Olah (colah.github.io)
LSTMS(LONGSHORT-TERMMEMORYNETWORKSsigmoid:[0,1]
ft = �(U (f)xt +W
(f)ht�1 + b
(f))
Forget gate: forget the past or not
Figure by Christopher Olah (colah.github.io)
LSTMS(LONGSHORT-TERMMEMORYNETWORKSsigmoid:[0,1]
tanh:[-1,1]
it = �(U (i)xt +W
(i)ht�1 + b
(i))
c̃t = tanh(U (c)xt +W
(c)ht�1 + b
(c))
Input gate: use the input or not
New cell content (temp):
ft = �(U (f)xt +W
(f)ht�1 + b
(f))
Forget gate: forget the past or not
Figure by Christopher Olah (colah.github.io)
LSTMS(LONGSHORT-TERMMEMORYNETWORKSsigmoid:[0,1]
tanh:[-1,1]
ct = ft � ct�1 + it � c̃t
New cell content: - mix old cell with the new temp cell
it = �(U (i)xt +W
(i)ht�1 + b
(i))
c̃t = tanh(U (c)xt +W
(c)ht�1 + b
(c))
Input gate: use the input or not
New cell content (temp):
ft = �(U (f)xt +W
(f)ht�1 + b
(f))
Forget gate: forget the past or not
Figure by Christopher Olah (colah.github.io)
LSTMS(LONGSHORT-TERMMEMORYNETWORKS
ct = ft � ct�1 + it � c̃t
New cell content: - mix old cell with the new temp cell
it = �(U (i)xt +W
(i)ht�1 + b
(i))
c̃t = tanh(U (c)xt +W
(c)ht�1 + b
(c))
Input gate: use the input or not
New cell content (temp):
ft = �(U (f)xt +W
(f)ht�1 + b
(f))
Forget gate: forget the past or notOutput gate: output from the new cell or noto
t
= �(U (o)x
t
+W
(o)h
t�1 + b
(o))
ht = ot � tanh(ct)Hidden state:
Figure by Christopher Olah (colah.github.io)
LSTMS(LONGSHORT-TERMMEMORYNETWORKS
it = �(U (i)xt +W
(i)ht�1 + b
(i))Input gate: use the input or notft = �(U (f)
xt +W
(f)ht�1 + b
(f))Forget gate: forget the past or not
Output gate: output from the new cell or not
o
t
= �(U (o)x
t
+W
(o)h
t�1 + b
(o))
ct = ft � ct�1 + it � c̃tNew cell content: - mix old cell with the new temp cell
c̃t = tanh(U (c)xt +W
(c)ht�1 + b
(c))New cell content (temp):
ht = ot � tanh(ct)Hidden state:
𝑐+,"
ℎ+,"
𝑐+
ℎ+
vanishinggradientproblemforRNNs.
• Theshadingofthenodesintheunfoldednetworkindicatestheirsensitivity totheinputsattimeone(thedarkertheshade,thegreaterthesensitivity).
• Thesensitivity decaysovertimeasnewinputsoverwritetheactivationsofthehiddenlayer,andthenetwork‘forgets’thefirstinputs.
Example from Graves 2012
PreservationofgradientinformationbyLSTM
• Forsimplicity,allgatesareeitherentirelyopen(‘O’)orclosed(‘—’).• Thememorycell ‘remembers’thefirstinputaslongastheforgetgateisopenand
theinputgateisclosed.• Thesensitivity oftheoutputlayercanbeswitchedonandoffbytheoutputgate
withoutaffectingthecell.
Forget gate
Input gate
Output gate
Example from Graves 2012
RecurrentNeuralNetworks(RNNs)• GenericRNNs:• VanillaRNNs:• GRUs(GatedRecurrentUnits):
ht = f(xt, ht�1)
𝑥" 𝑥# 𝑥$ 𝑥%
ℎ" ℎ# ℎ$ ℎ%
zt = �(U (z)xt +W
(z)ht�1 + b
(z))
rt = �(U (r)xt +W
(r)ht�1 + b
(r))
h̃t = tanh(U (h)xt +W
(h)(rt � ht�1) + b
(h))
ht = (1� zt) � ht�1 + zt � h̃tLess parameters than LSTMs. Easier to train for comparable performance!
ht = tanh(Uxt +Wht�1 + b)
RecursiveNeuralNetworks• Sometimes,inferenceoveratreestructuremakesmoresensethan
sequentialstructure• Anexampleofcompositionalityinideologicalbiasdetection
(red→ conservative,blue→ liberal,gray→ neutral)inwhichmodifierphrasesandpunctuationcausepolarityswitchesathigherlevelsoftheparsetree
Example from Iyyer et al., 2014
RecursiveNeuralNetworks• NNsconnectedasatree• Treestructureisfixedapriori• Parametersareshared,similarlyasRNNs
Example from Iyyer et al., 2014
LEARNING:BACKPROPAGATION
ErrorBackpropagation
• Modelparameters:
forbrevity:
x0
x1
x2
xP
f(x,
~
✓)
~✓ = {wij , wjk, wkl}
Next 10 slides on back propagation are adapted from Andrew Rosenberg
~✓ = {w(1)ij , w(2)
jk , w(3)kl }
w(1)ij w(2)
jk
w(3)kl
ErrorBackpropagation
• Modelparameters:• Leta andz betheinputandoutputofeachnode
41
x0
x1
x2
xP
f(x,
~
✓)
wij wjk
wkl
zj zkziaj zlalak
~✓ = {wij , wjk, wkl}
ErrorBackpropagation
wij wjk
zj
aj
∑
aj =X
i
wijzi
zj = g(aj)
∑
zi
x0
x1
x2
xP
f(x,
~
✓)
wij wjk
wkl
zj zkziaj ak zl
aj =X
i
wijzi
al
ak =X
j
wjkzj al =X
k
wklzk
zj = g(aj) zk = g(ak) zl = g(al)
• Leta andz betheinputandoutputofeachnode
x0
x1
x2
xP
f(x,
~
✓)
wij wjk
wkl
zj zkziaj ak zl
aj =X
i
wijzi
al
ak =X
j
wjkzj al =X
k
wklzk
zj = g(aj) zk = g(ak) zl = g(al)
• Leta andz betheinputandoutputofeachnode
Training:minimizeloss
x0
x1
x2
xP
f(x,
~
✓)
wij wjk
wkl
zj zkziaj ak zlal
R(✓) =1N
NX
n=0
L(yn � f(xn))
=1N
NX
n=0
12
(yn � f(xn))2
=1N
NX
n=0
12
0
@yn � g
0
@X
k
wklg
0
@X
j
wjkg
X
i
wijxn,i
!1
A
1
A
1
A2
Empirical Risk Function
Training:minimizeloss
x0
x1
x2
xP
f(x,
~
✓)
wij wjk
wkl
zj zkziaj ak zlal
R(✓) =1N
NX
n=0
L(yn � f(xn))
=1N
NX
n=0
12
(yn � f(xn))2
=1N
NX
n=0
12
0
@yn � g
0
@X
k
wklg
0
@X
j
wjkg
X
i
wijxn,i
!1
A
1
A
1
A2
Empirical Risk Function
ErrorBackpropagation
x0
x1
x2
xP
f(x,
~
✓)
wij wjk
wkl
zj zkziaj ak zlal
Optimize last layer weights wklLn =
12
(yn � f(xn))2
@R
@wkl=
1N
X
n
@Ln
@al,n
� @al,n
@wkl
�Calculus chain rule
ErrorBackpropagation
x0
x1
x2
xP
f(x,
~
✓)
wij wjk
wkl
zj zkziaj ak zlal
Optimize last layer weights wklLn =
12
(yn � f(xn))2
@R
@wkl=
1N
X
n
@Ln
@al,n
� @al,n
@wkl
�Calculus chain rule
@R
@wkl=
1N
X
n
@ 1
2 (yn � g(al,n))2
@al,n
� @al,n
@wkl
�
ErrorBackpropagation
x0
x1
x2
xP
f(x,
~
✓)
wij wjk
wkl
zj zkziaj ak zlal
Optimize last layer weights wklLn =
12
(yn � f(xn))2
@R
@wkl=
1N
X
n
@Ln
@al,n
� @al,n
@wkl
�Calculus chain rule
@R
@wkl=
1N
X
n
@ 1
2 (yn � g(al,n))2
@al,n
� @zk,nwkl
@wkl
�
ErrorBackpropagation
x0
x1
x2
xP
f(x,
~
✓)
wij wjk
wkl
zj zkziaj ak zlal
Optimize last layer weights wklLn =
12
(yn � f(xn))2
@R
@wkl=
1N
X
n
@Ln
@al,n
� @al,n
@wkl
�Calculus chain rule
@R
@wkl=
1N
X
n
@ 1
2 (yn � g(al,n))2
@al,n
� @zk,nwkl
@wkl
�=
1N
X
n
[�(yn � zl,n)g0(al,n)] zk,n
ErrorBackpropagation
x0
x1
x2
xP
f(x,
~
✓)
wij wjk
wkl
zj zkziaj ak zlal
Optimize last layer weights wklLn =
12
(yn � f(xn))2
@R
@wkl=
1N
X
n
@Ln
@al,n
� @al,n
@wkl
�Calculus chain rule
@R
@wkl=
1N
X
n
@ 1
2 (yn � g(al,n))2
@al,n
� @zk,nwkl
@wkl
�=
1N
X
n
[�(yn � zl,n)g0(al,n)] zk,n
=1N
X
n
�l,nnzk,n
ErrorBackpropagation
x0
x1
x2
xP
f(x,
~
✓)
wij wjk
wkl
zj zkziaj ak zlal
Repeat for all previous layers
@R
@wkl=
1N
X
n
@Ln
@al,n
� @al,n
@wkl
�=
1N
X
n
[�(yn � zl,n)g0(al,n)] zk,n =1N
X
n
�l,nzk,n
@R
@wjk=
1N
X
n
@Ln
@ak,n
� @ak,n
@wjk
�=
1N
X
n
"X
l
�l,nwklg0(ak,n)
#zj,n =
1N
X
n
�k,nzj,n
@R
@wij=
1N
X
n
@Ln
@aj,n
� @aj,n
@wij
�=
1N
X
n
"X
k
�k,nwjkg0(aj,n)
#zi,n =
1N
X
n
�j,nzi,n
aj =X
i
wijzi
zj = g(aj)
@R
@wjk=
1N
X
n
@Ln
@ak,n
� @ak,n
@wjk
�=
1N
X
n
"X
l
�l,nwklg0(ak,n)
#zj,n =
1N
X
n
�k,nzj,n
@R
@wij=
1N
X
n
@Ln
@aj,n
� @aj,n
@wij
�=
1N
X
n
"X
k
�k,nwjkg0(aj,n)
#zi,n =
1N
X
n
�j,nzi,n
∑ wij wjk
zj
aj
∑
zi �j∑
�i �k zk
Backprop Recursion
x0
x1
x2
xP
f(x,
~
✓)
wij wjk
wkl
zj zkziaj ak zlal
wt+1ij = wt
ij � ⌘@R
wij
wt+1jk = wt
jk � ⌘@R
wkl
wt+1kl = wt
kl � ⌘@R
wkl
Learning:GradientDescent
Backpropagation• Startswithaforwardsweeptocomputealltheintermediatefunctionvalues• Throughbackprop,computesthepartialderivativesrecursively• Aformofdynamicprogramming
– Insteadofconsideringexponentiallymanypathsbetween aweightw_ij andthefinalloss(risk),storeandreuseintermediateresults.
• Atypeofautomaticdifferentiation.(thereareothervariantse.g.,recursivedifferentiationonlythroughforwardpropagation.
zi
�j @R
@wij
Forward
Gradient
Backpropagation• TensorFlow (https://www.tensorflow.org/)• Torch(http://torch.ch/)• Theano (http://deeplearning.net/software/theano/)• CNTK(https://github.com/Microsoft/CNTK)• cnn (https://github.com/clab/cnn)• Caffe (http://caffe.berkeleyvision.org/)
PrimaryInterfaceLanguage:• Python• Lua• Python• C++• C++• C++
Forward
Gradient
CrossEntropyLoss(akalogloss,logisticloss)• CrossEntropy
• Relatedquantities– Entropy– KLdivergence(thedistancebetweentwodistributionspandq)
• UseCrossEntropyformodelsthatshouldhavemoreprobabilisticflavor(e.g.,languagemodels)
• UseMeanSquaredError lossformodelsthatfocusoncorrect/incorrectpredictions
H(p, q) = Ep[�log q] = H(p) +DKL(p||q)
H(p, q) = �X
y
p(y) log q(y)
H(p) =X
y
p(y)log p(y)
DKL(p||q) =X
y
p(y) logp(y)
q(y)
MSE =1
2(y � f(x))2
Predicted prob
True prob
RNNLearning:Backprop ThroughTime(BPTT)
• Similartobackprop withnon-recurrentNNs• Butunlikefeedforward (non-recurrent)NNs,eachunitinthe
computationgraphrepeatstheexactsameparameters…• Backprop gradientsoftheparametersofeachunitasifthey
aredifferentparameters• Whenupdatingtheparametersusingthegradients,usethe
averagegradientsthroughouttheentirechainofunits.
𝑥" 𝑥# 𝑥$ 𝑥%
ℎ" ℎ# ℎ$ ℎ%
Convergenceofbackprop• Withoutnon-linearityorhiddenlayers,learningisconvexoptimization– Gradientdescentreachesglobalminima
• Multilayerneuralnets(withnonlinearity)arenotconvex– Gradientdescentgetsstuckinlocalminima– Selectingnumberofhiddenunitsandlayers=fuzzyprocess– NNshavemadeaHUGEcomebackinthelastfewyears
• Neuralnetsarebackwithanewname– Deepbeliefnetworks– HugeerrorreductionwhentrainedwithlotsofdataonGPUs
Overfitting inNNs• AreNNslikelytooverfit?
– Yes,theycanrepresentarbitraryfunctions!!!
• Avoidingoverfitting?– Moretrainingdata– Fewerhiddennodes/bettertopology
– Randomperturbationtothegraphtopology(“Dropout”)
– Regularization– Earlystopping