“deep”&learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4....
TRANSCRIPT
“Deep” Learning
2
natural language analyzer
Big picture: natural language analyzers
Natural language input signal: -‐ Web page -‐ Ques<on -‐ Search query -‐ Tweet -‐ Voice command
Output analysis: -‐ Ques<on -‐ Answer -‐ Command to a robot -‐ Trending topics
3
sen<ment analyzer
speech recognizer
tokenizer
POS tagger
syntac<c parser
seman<c parser
machine translator
named en<ty
recognizer
spell corrector
coreference resolu<on
classifica<on
Big picture: natural language analyzers
Natural language input signal: -‐ Web page -‐ Ques<on -‐ Search query -‐ Tweet -‐ Voice command
Output analysis: -‐ Ques<on -‐ Answer -‐ Command to a robot -‐ Trending topics
4
sen<ment analyzer
speech recognizer
tokenizer
POS tagger
syntac<c parser
seman<c parser
machine translator
named en<ty
recognizer
spell corrector
coreference resolu<on
classifica<on
Today: deep learning for NLP components
Natural language input signal: -‐ Web page -‐ Ques<on -‐ Search query -‐ Tweet -‐ Voice command
Output analysis: -‐ Ques<on -‐ Answer -‐ Command to a robot -‐ Trending topics
Agenda
• Big picture
• Why deep learning?
• Building blocks of a deep neural network
• How to train deep neural networks
• Important results
6
do
classifica<on
Running example: document classifica<on
sen<ment analyzer
speech recognizer
tokenizer
POS tagger
syntac<c parser
seman<c parser
machine translator
named en<ty
recognizer
spell corrector
coreference resolu<on
document category
7
classifica<on
Running example: document classifica<on
• Politics• Business• Science• Sports• Health
output = argmaxl f( l, d)
d l
Barcelona lost to Real Madrid Sports
8
How to define f(l, d): linear models
Linear models: f(l, d) = w . g(l,d)
0 0 0 1 0 0 1 0 0 0 0 …
0.4 -‐1.2 0.2 0.2 -‐0.4 -‐1.0 5.1 1.1 2.3 0.8 -‐0.1 … Number of <mes Barcelona appears
in a document labeled Sports
Number of <mes lost appears in a document labeled Sports
Barcelona lost to Real Madrid
Sportsl = d =
9
How to define f(l, d): linear models
Linear models: f(l, d) = w . g(l,d) -‐ Easy to implement -‐ Easy to op<mize w Two possible improvements: -‐ Define more complex func<ons -‐ Find becer representa<ons of (l,d)
Figure credits: Barbara Rosario
Agenda
• Big picture
• Why deep learning?
• Building blocks of a deep neural network
• How to train deep neural networks
• Important results
11
Linear models: f(l, d) = w . g(l,d) = w(l) . x(d) e.g., y1 = x1 w1,1+ x2 w2,1+ x3 w3,1+ x4 w4,1+ x5 w5,1 = w(1) . x(d)
w1,1
w5,3
Number of <mes lost appears in a document
Number of <mes Barcelona appears in a document
neural network v1.0: linear model
- Politics
- Science
- Sports
12
neural network v1.0: linear model
Linear models: f(l, d) = w . g(l,d) = w(l) . x(d) e.g., y1 = x1 w1,1+ x2 w2,1+ x3 w3,1+ x4 w4,1+ x5 w5,1 = w(1) . x(d) x
W same as
y
W
=
x y
similar words s<ll share no parameters!
- Politics
- Science
- Sports
13
neural network v2.0: representa<on learning
Big idea: induce low-‐dimensional dense feature representa<ons of high-‐dimensional objects
14
neural network v2.1: representa<on learning
x
W
y dense
Did this really solve the problem?
Big idea: embed words in a dense vector space and use the word embeddings as dense features
15
neural network v3.0: complex func<ons
x
W
y
Big idea: define more complex func<ons by adding a hidden layer
y = W x
16
neural network v3.0: complex func<ons
x
W2
y
Big idea: define more complex func<ons by adding a hidden layer
W1
h1
y = W2 h1 = W2 (W1 x) = W x Wait! Is that true?!
17
neural network v3.0: complex func<ons
x
W2
y
Big idea: define more complex func<ons by adding a hidden layer
W1
h1
Universal approxima<on theorem Cybenko., G. (1989)
y = W2 h1 = W2 a1(W1 x)
non-‐linear func<ons, e.g., logis<c func<on a1(z) = 1 / (1 + e-‐z)
z
a1(z)
induced features
18
neural network v3.0: complex func<ons
hcps://en.wikipedia.org/wiki/Ac<va<on_func<on
Popular ac<va<on/transfer/non-‐linear func<ons:
19
neural network v3.5: “deeper” networks
x
W2
y
W1
h1
y = W3 h2 = W3 a2( W2 a1(W1 x) )
W3
h2
Wait but why do we need more layers?
20
neural network v3.5: “deeper” networks
x
W2
y
W1
h1
y = W3 h2 = W3 a2( W2 a1(W1 x) )
W3
h2
21
neural network v4.0: recurrent neural networks
Big idea: use hidden layers to represent sequen<al state
x
y
Feed-‐forward neural networks
x1
y
x2 x3
Recurrent neural networks
Figure credits: Andrej Karpathy
How did we represent x for document classifica<on?
Real …. Madrid =
Real Madrid
Real …. Madrid
≠ Real Madrid
Do we share parameters across states?
22
neural network v4.0: recurrent neural networks
Figure credits: Christopher Olah
How to compute the hidden layers?
23
neural network v4.1: output sequences
Figure credits: Andrej Karpathy
24
neural network v4.1: output sequences
Figure credits: Andrej Karpathy
Example: Character-‐level language models
25
neural network v4.1: output sequences
Credits: Andrej Karpathy
Sample output: Copyright was the succession of independence in the slop of Syrian influence that was a famous German movement based on a more popular servicious, non-‐doctrinal and sexual power post. Many governments recognize the military housing of the [[Civil Liberaliza<on and Infantry Resolu<on 265 Na<onal Party in Hungary]], that is sympathe<c to be to the [[Punjab Resolu<on]] (PJS)[hcp://www.humah.yahoo.com/guardian.cfm/7754800786d17551963s89.htm Official economics Adjoint for the Nazism, Montgomery was swear to advance to the resources for those Socialism's rule, was star<ng to signing a major tripad of aid exile.]]
26
neural network v4.2: Long-‐Short Term Memory
Figure credits: Christopher Olah hcp://colah.github.io/posts/2015-‐08-‐Understanding-‐LSTMs/
LSTMs
Regular RNNs
27
neural network v4.2: Long-‐Short Term Memory
Figure credits: Christopher Olah hcp://colah.github.io/posts/2015-‐08-‐Understanding-‐LSTMs/
28
neural network v4.3: bidirec<onal RNNs
Figure credits: Christopher Olah
Unidirec<onal RNNs Bidirec<onal RNNs
29
neural network v4.4: acen<on models
Bahdanau et al. (2015)
30
neural network v5: convolu<onal NN
Figure credits: Christopher Olah
31
neural network v5: convolu<onal NN
Figure credits: Christopher Olah
Feed-‐forward NN
Convolu<onal NN
32
neural network v5: convolu<onal NN
Figure credits: Christopher Olah
Convolu<onal layer
convolu<onal layer 2 convolu<onal layer 1
Do we share parameters of different convolu<ons? In the same layer? In different layers?
convolu<onal layer 2 convolu<onal layer 1
33
neural network v5: convolu<onal NN
Figure credits: Christopher Olah
convolu<onal layer 2 Max pooling layer convolu<onal layer 1
2D convolu<ons
34
neural network v5.1: recursive NNs
35
neural network v6: dropout
Agenda
• Big picture
• Why deep learning?
• Building blocks of a deep neural network
• How to train deep neural networks
• Important results
How to train NN models? • argmaxl f(d, l) only tells us which label to predict. • Supervised learning (need input/output pairs) • Loss func<on: e.g., cross-‐entropy between empirical distribu<on and model distribu<on
-‐ log p(l*|d) = -‐ log ( ef(d, l*) / Σl ef(d, l) )
x y h1
soymax layer
Regression problems? Mean square error
E[(y -‐ y*)2]
How to op<mize the loss? • Stochas<c gradient descent
How to op<mize the loss?
• Parameter ini<aliza<on – Break the symmetry – Use small values – Random restarts
– Popular choice: uniform with mean=zero and variance = 1 / size of previous layer
hcp://andyljones.tumblr.com/post/110998971763/an-‐explana<on-‐of-‐xavier-‐ini<aliza<on
How to op<mize the loss?
• Other op<miza<on methods – Variants of stochas<c gradient descent (e.g., averaged SGD, SGD with momentum) See hcp://research.microsoy.com/pubs/192769/tricks-‐2012.pdf
– Adagrad – Adam
– Adadelta
How to op<mize the loss?
• Compu<ng gradients: the hard way – Analy<cally derive the expression that represents the gradient with respect to each input.
– Compute that expression.
• Compu<ng gradients: automa<c differen<a<on – Translate the loss func<on into a sequence of atomic opera+ons
– Hard-‐code the differen<a<on of each atomic opera<on with respect to its parameters is hard-‐coded.
– Recursively compute the gradient of the loss func<on with respect to model parameters using chain rule.
How to op<mize the loss: automa<c differen<a<on
How to op<mize the loss: automa<c differen<a<on
How to op<mize the loss: deep learning libraries
hcps://github.com/soumith/convnet-‐benchmarks/ Also see: CMU’s locally grown library at hcps://github.com/clab/cnn
Agenda
• Big picture
• Why deep learning?
• Building blocks of a deep neural network
• How to train deep neural networks
• Important results
46
Major results: language modeling
Krizehvsky et al. (2012)
47
Major results: image classifica<on
48
Major results: ImageNet
Krizehvsky et al. (2012): posi<ve and nega<ve examples
49
Major results: ImageNet
Krizehvsky et al. (2012): sample convolu<on filters
50
Major results: speech recogni<on
Graves et al. (2013)
51
Major results: transla<on
Sutskever et al. (2014)
52
Major results: transla<on
Bahdanau et al. (2015)
53
Major results: dependency parsing
Chen and Manning (2014)
54
Major results: dependency parsing
Dyer et al. (2015)
Important things we didn’t cover
• Dark knowledge • Connec<on to graphical models • Alterna<ves to the soymax output layer
Agenda
• Big picture
• Why deep learning?
• Building blocks of a deep neural network
• How to train deep neural networks
• Important results
57
sen<ment analyzer
speech recognizer
tokenizer
POS tagger
syntac<c parser
seman<c parser
machine translator
named en<ty
recognizer
spell corrector
coreference resolu<on
classifica<on
Open ques<on: can we do without the intermediate linguis<c abstrac<ons?
Natural language input signal: -‐ Web page -‐ Ques<on -‐ Search query -‐ Tweet -‐ Voice command
Output analysis: -‐ Ques<on -‐ Answer -‐ Command to a robot -‐ Trending topics