“deep”&learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4....

“Deep” Learning

2

natural language analyzer

Big picture: natural language analyzers

Natural language input signal: -‐  Web page -‐  Ques<on -‐  Search query -‐  Tweet -‐  Voice command

Output analysis: -‐  Ques<on -‐  Answer -‐  Command to a robot -‐  Trending topics

3

sen<ment analyzer

speech recognizer

tokenizer

POS tagger

syntac<c parser

seman<c parser

machine translator

named en<ty

recognizer

spell corrector

coreference resolu<on

classifica<on

Big picture: natural language analyzers



4

sen<ment analyzer

speech recognizer

tokenizer

POS tagger

syntac<c parser

seman<c parser

machine translator

named en<ty

recognizer

spell corrector


classifica<on

Today: deep learning for NLP components



Agenda

•  Big picture

•  Why deep learning?

•  Building blocks of a deep neural network

•  How to train deep neural networks

•  Important results

6

do

classifica<on

Running example: document classifica<on

sen<ment analyzer

speech recognizer

tokenizer

POS tagger

syntac<c parser

seman<c parser

machine translator

named en<ty

recognizer

spell corrector


document category

7

classifica<on

Running example: document classifica<on

•  Politics•  Business•  Science•  Sports•  Health

output = argmaxl f( l, d)

d l

Barcelona lost to Real Madrid Sports

8

How to define f(l, d): linear models

Linear models: f(l, d) = w . g(l,d)

0 0 0 1 0 0 1 0 0 0 0 …

0.4 -‐1.2 0.2 0.2 -‐0.4 -‐1.0 5.1 1.1 2.3 0.8 -‐0.1 … Number of <mes Barcelona appears

in a document labeled Sports

Number of <mes lost appears in a document labeled Sports

Barcelona lost to Real Madrid

Sportsl = d =

9

How to define f(l, d): linear models

Linear models: f(l, d) = w . g(l,d) -‐  Easy to implement -‐  Easy to op<mize w Two possible improvements: -‐  Define more complex func<ons -‐  Find becer representa<ons of (l,d)

Figure credits: Barbara Rosario

Agenda

•  Big picture





11

Linear models: f(l, d) = w . g(l,d) = w(l) . x(d) e.g., y1 = x1 w1,1+ x2 w2,1+ x3 w3,1+ x4 w4,1+ x5 w5,1 = w(1) . x(d)

w1,1

w5,3

Number of <mes lost appears in a document

Number of <mes Barcelona appears in a document

neural network v1.0: linear model

- Politics

- Science

- Sports

12

neural network v1.0: linear model

Linear models: f(l, d) = w . g(l,d) = w(l) . x(d) e.g., y1 = x1 w1,1+ x2 w2,1+ x3 w3,1+ x4 w4,1+ x5 w5,1 = w(1) . x(d) x

W same as

y

W

=

x y

similar words s<ll share no parameters!

-  Politics

-  Science

-  Sports

13

neural network v2.0: representa<on learning

Big idea: induce low-‐dimensional dense feature representa<ons of high-‐dimensional objects

14

neural network v2.1: representa<on learning

x

W

y dense

Did this really solve the problem?

Big idea: embed words in a dense vector space and use the word embeddings as dense features

15

neural network v3.0: complex func<ons

x

W

y

Big idea: define more complex func<ons by adding a hidden layer

y = W x

16


x

W2

y


W1

h1

y = W2 h1 = W2 (W1 x) = W x Wait! Is that true?!

17


x

W2

y


W1

h1

Universal approxima<on theorem Cybenko., G. (1989)

y = W2 h1 = W2 a1(W1 x)

non-‐linear func<ons, e.g., logis<c func<on a1(z) = 1 / (1 + e-‐z)

z

a1(z)

induced features

18


hcps://en.wikipedia.org/wiki/Ac<va<on_func<on

Popular ac<va<on/transfer/non-‐linear func<ons:

19

neural network v3.5: “deeper” networks

x

W2

y

W1

h1

y = W3 h2 = W3 a2( W2 a1(W1 x) )

W3

h2

Wait but why do we need more layers?

20

neural network v3.5: “deeper” networks

x

W2

y

W1

h1

y = W3 h2 = W3 a2( W2 a1(W1 x) )

W3

h2

21

neural network v4.0: recurrent neural networks

Big idea: use hidden layers to represent sequen<al state

x

y

Feed-‐forward neural networks

x1

y

x2 x3

Recurrent neural networks

Figure credits: Andrej Karpathy

How did we represent x for document classifica<on?

Real …. Madrid =

Real Madrid

Real …. Madrid

≠ Real Madrid

Do we share parameters across states?

22

neural network v4.0: recurrent neural networks

Figure credits: Christopher Olah

How to compute the hidden layers?

23

neural network v4.1: output sequences


24



Example: Character-‐level language models

25


Credits: Andrej Karpathy

Sample output: Copyright was the succession of independence in the slop of Syrian influence that was a famous German movement based on a more popular servicious, non-‐doctrinal and sexual power post. Many governments recognize the military housing of the [[Civil Liberaliza<on and Infantry Resolu<on 265 Na<onal Party in Hungary]], that is sympathe<c to be to the [[Punjab Resolu<on]] (PJS)[hcp://www.humah.yahoo.com/guardian.cfm/7754800786d17551963s89.htm Official economics Adjoint for the Nazism, Montgomery was swear to advance to the resources for those Socialism's rule, was star<ng to signing a major tripad of aid exile.]]

26

neural network v4.2: Long-‐Short Term Memory

Figure credits: Christopher Olah hcp://colah.github.io/posts/2015-‐08-‐Understanding-‐LSTMs/

LSTMs

Regular RNNs

27

neural network v4.2: Long-‐Short Term Memory

Figure credits: Christopher Olah hcp://colah.github.io/posts/2015-‐08-‐Understanding-‐LSTMs/

28

neural network v4.3: bidirec<onal RNNs


Unidirec<onal RNNs Bidirec<onal RNNs

29

neural network v4.4: acen<on models

Bahdanau et al. (2015)

30

neural network v5: convolu<onal NN


31



Feed-‐forward NN

Convolu<onal NN

32



Convolu<onal layer

convolu<onal layer 2 convolu<onal layer 1

Do we share parameters of different convolu<ons? In the same layer? In different layers?

convolu<onal layer 2 convolu<onal layer 1

33



convolu<onal layer 2 Max pooling layer convolu<onal layer 1

2D convolu<ons

34

neural network v5.1: recursive NNs

35

neural network v6: dropout

Agenda

•  Big picture





How to train NN models? •  argmaxl f(d, l) only tells us which label to predict. •  Supervised learning (need input/output pairs) •  Loss func<on: e.g., cross-‐entropy between empirical distribu<on and model distribu<on

-‐ log p(l*|d) = -‐ log ( ef(d, l*) / Σl ef(d, l) )

x y h1

soymax layer

Regression problems? Mean square error

E[(y -‐ y*)2]

How to op<mize the loss? •  Stochas<c gradient descent

How to op<mize the loss?

•  Parameter ini<aliza<on –  Break the symmetry – Use small values –  Random restarts

–  Popular choice: uniform with mean=zero and variance = 1 / size of previous layer

hcp://andyljones.tumblr.com/post/110998971763/an-‐explana<on-‐of-‐xavier-‐ini<aliza<on


•  Other op<miza<on methods –  Variants of stochas<c gradient descent (e.g., averaged SGD, SGD with momentum) See hcp://research.microsoy.com/pubs/192769/tricks-‐2012.pdf

– Adagrad – Adam

– Adadelta


•  Compu<ng gradients: the hard way – Analy<cally derive the expression that represents the gradient with respect to each input.

–  Compute that expression.

•  Compu<ng gradients: automa<c differen<a<on –  Translate the loss func<on into a sequence of atomic opera+ons

– Hard-‐code the differen<a<on of each atomic opera<on with respect to its parameters is hard-‐coded.

–  Recursively compute the gradient of the loss func<on with respect to model parameters using chain rule.

How to op<mize the loss: automa<c differen<a<on

How to op<mize the loss: deep learning libraries

hcps://github.com/soumith/convnet-‐benchmarks/ Also see: CMU’s locally grown library at hcps://github.com/clab/cnn

Agenda

•  Big picture





46

Major results: language modeling

Krizehvsky et al. (2012)

47

Major results: image classifica<on

48

Major results: ImageNet

Krizehvsky et al. (2012): posi<ve and nega<ve examples

49

Major results: ImageNet

Krizehvsky et al. (2012): sample convolu<on filters

50

Major results: speech recogni<on

Graves et al. (2013)

51

Major results: transla<on

Sutskever et al. (2014)

52

Major results: transla<on

Bahdanau et al. (2015)

53

Major results: dependency parsing

Chen and Manning (2014)

54

Major results: dependency parsing

Dyer et al. (2015)

Important things we didn’t cover

•  Dark knowledge •  Connec<on to graphical models •  Alterna<ves to the soymax output layer

Agenda

•  Big picture





57

sen<ment analyzer

speech recognizer

tokenizer

POS tagger

syntac<c parser

seman<c parser

machine translator

named en<ty

recognizer

spell corrector


classifica<on

Open ques<on: can we do without the intermediate linguis<c abstrac<ons?



“deep”&learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4....

Documents