part2:deep learning - hitir.hit.edu.cn/~car/talks/ijcnlp17-tutorial/lec02-deep_learning.pdf ·...

Post on 21-May-2020

6 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Part 2: DeepLearning

2017-11-27 IJCNLP 2017 Tutorial 1

Part 2.1: DeepLearning Background

2017-11-27 IJCNLP 2017 Tutorial 2

WhatisMachineLearning?

• FromDatatoKnowledge

InputAlgorithm

Output

Traditional Program

InputOutput

“Algorithm”

ML Program2017-11-27 IJCNLP 2017 Tutorial 3

AStandardExampleofML

• TheMNIST(ModifiedNIST)databaseofhand-writtendigitsrecognition– Publiclyavailable– AhugeamountabouthowwellvariousMLmethodsdoonit

– 60,000+ 10,000 hand-writtendigits(28x28pixelseach)

2017-11-27 IJCNLP 2017 Tutorial 4

Veryhardtosaywhatmakesa2

2017-11-27 IJCNLP 2017 Tutorial 5

Traditional Model(before2012)

• Fixed/engineeredfeatures+trainableclassifier– Designingafeatureextractorrequiresconsiderableeffortsbyexperts

SIFTGISTShapecontext

2017-11-27 IJCNLP 2017 Tutorial 6

DeepLearning (after2012)

• LearningHierarchicalRepresentations• DEEPmeansmorethanonestageofnon-linearfeaturetransformation

2017-11-27 IJCNLP 2017 Tutorial 7

DeepLearningArchitecture

2017-11-27 IJCNLP 2017 Tutorial 8

DeepLearningisNotNew

• 1980stechnology(NeuralNetworks)

2017-11-27 IJCNLP 2017 Tutorial 9

AboutNeuralNetworks

• Pros– Simpletolearnp(y|x)– Performance is OKforshallownets

• Cons– Troublewith>3layers– Overfitts– Slowtotrain

2017-11-27 IJCNLP 2017 Tutorial 10

DeepLearningbeatsNN

• Pros– Simpletolearnp(y|x)

– Performance is OK forshallownets

• Cons– Troublewith>3layers

– Overfitts

– Slowtotrain

• Dropout• Maxout• StochasticPooling

GPU

• Newactivationfunctions:ReLU,…

• Gatedmechanism

2017-11-27 IJCNLP 2017 Tutorial 11

ResultsonMNIST

• NaïveNeuralNetwork– 96.59%

• SVM(defaultsettingsforlibsvm)– 94.35%

• OptimalSVM[AndreasMueller]– 98.56%

• Thestateoftheart:ConvolutionalNN(2013)– 99.79%

2017-11-27 IJCNLP 2017 Tutorial 12

DeepLearningforSpeechRecognition

2017-11-27 IJCNLP 2017 Tutorial 13

DL forNLP: Representation Learning

A

B

A

B

A

B

- -

---

-

2017-11-27 IJCNLP 2017 Tutorial 14

DL forNLP: End-to-End Learning

2017-11-27 IJCNLP 2017 Tutorial 15

ROOThas_VBZ good_JJ Control_NN ._.

Stack Buffer

He_PRPnsubj

Configuration

0 0 1 0 1 … 0 1 0 0

Traditional Parser Stack-LSTM Parser

Part 2.2: Feedforward Neural Networks

2017-11-27 IJCNLP 2017 Tutorial 16

feature units

decision unit

input units

hand-coded programs

learned weights

TheStandardPerceptronArchitecture

IPhone is very good .

good very/good very …

>5?

0.8 0.9 0.1 …

2017-11-27 IJCNLP 2017 Tutorial 17

TheLimitationsofPerceptrons

• Thehand-codedfeatures– Greatinfluenceontheperformance– Needlotsofcosttofindsuitablefeatures

• Alinearclassifierwithahyperplane– Cannotseparatenon-lineardata,suchasXOR functioncannotbelearnedbyasingle-layerperceptron

0,1

0,0 1,0

1,1

weight plane output =1output =0

The positive and negative casescannot be separated by a plane

2017-11-27 IJCNLP 2017 Tutorial 18

LearningwithNon-linearHiddenLayers

2017-11-27 IJCNLP 2017 Tutorial 19

FeedforwardNeuralNetworks

• Multi-layerPerceptron(MLP)• Theinformationispropagatedfrom

theinputstotheoutputs• NOcyclebetweenoutputsandinputs• Learningtheweightsofhiddenunits

isequivalenttolearningfeatures• Networkswithouthiddenlayersare

verylimitedintheinput-outputmappings– Morelayersoflinearunitsdonot

help.Itsstilllinear– Fixedoutputnon-linearities arenot

enoughx1 x2 xn…..

1st hidden layer

2nd hiddenlayer

Output layer

2017-11-27 IJCNLP 2017 Tutorial 20

MultipleLayer NeuralNetworks

• Whatarethosehiddenneuronsdoing?– Mayberepresentoutlines

2017-11-27 IJCNLP 2017 Tutorial 21

GeneralOptimizing(Learning)Algorithms

• GradientDescent

• StochasticGradientDescent(SGD)– Minibatch SGD(m>1),OnlineGD(m=1)

2017-11-27 IJCNLP 2017 Tutorial 22

Computational/Flow Graphs

• DescribingMathematicalExpressions• Forexample

– e=(a+b)*(b+1)• c=a+b,d=b+1,e=c*d

– Ifa=2,b=1

2017-11-27 IJCNLP 2017 Tutorial 23

DerivativesonComputationalGraphs

2017-11-27 IJCNLP 2017 Tutorial 24

ComputationalGraphBackwardPass(Backpropagation)

2017-11-27 IJCNLP 2017 Tutorial 25

Part 2.3: Recurrent and OtherNeural Networks

2017-11-27 IJCNLP 2017 Tutorial 26

LanguageModels

• Alanguagemodelcomputesaprobabilityforasequenceofword:!(#$,⋯#') orpredictsaprobabilityforthenextword:!(#')$|#$,⋯#')

• Usefulformachinetranslation,speechrecognition,andsoon– Wordordering

• P(thecatissmall)>P(smalltheiscat)

– Wordchoice• P(therearefour cats)>P(therearefor cats)

2017-11-27 IJCNLP 2017 Tutorial 27

TraditionalLanguageModels

• AnincorrectbutnecessaryMarkovassumption!– Probabilityisusuallyconditionedonn previouswords– ! "#,⋯"& =∏ !("*|"#,⋯ ,"*,#) ≈/*0# ∏ !("*|"*,(&,#),⋯ ,"*,#)/

*0#• Disadvantages

– ThereareALOTofn-grams!– Cannotseetoolonghistory

2017-11-27 IJCNLP 2017 Tutorial 28

RecurrentNeuralNetworks(RNNs)

• Conditiontheneuralnetworkonallpreviousinputs• RAMrequirementonlyscaleswithnumberofinputs

W1 W1

ht ht+1ht-1

yt-1 yt yt+1

xt-1 xt xt+1

W1

W2 W2W2

W3 W3W3

2017-11-27 IJCNLP 2017 Tutorial 29

RecurrentNeuralNetworks (RNNs)

• Atasingletimestept– ℎ" = tanh()*ℎ"+* +)-.")– 01" = 234567.()8ℎ")

ht …ht-1

01"

xt

W1

W2

W3

ht

01"

xt

W1

W2

W3

W1

2017-11-27 IJCNLP 2017 Tutorial 30

TrainingRNNsishard

• Ideallyinputsfrommanytimestepsagocanmodifyoutputy• Forexample,with2timesteps

ht ht+1ht-1

yt-1 yt yt+1

xt-1 xt xt+1

W1 W1

W2 W2

W3 W3

2017-11-27 IJCNLP 2017 Tutorial 31

BackPropagation ThroughTime(BPTT)• Totalerroristhesumofeacherrorattimestept

– !"!# = ∑ !"&

!#'()*

• !"&!#+ =

!"&!,&

!,&!#+ iseasytobecalculated

• Buttocalculate !"&!#. =!"&!,&

!,&!/&

!/&!#. ishard(alsofor01)

• Becauseℎ( = tanh(0*ℎ(8* +01:() dependsonℎ(8*,whichdependson0* andℎ(81,andsoon.

• So !"&!#. = ∑ !"&!,&

!,&!/&

!/&!/<

!/<!#.

(=)*

ht

>?(

xt

W1

W2

W3

2017-11-27 IJCNLP 2017 Tutorial 32

The vanishinggradientproblem• !"#

!$= ∑ !"#

!'#

!'#!(#

!(#!()

!()!$

*+,- ,ℎ* = tanh(4-ℎ*5- +4

78*)

• !(#!()

= ∏!(;!(;<=

*>,+?- = ∏ 4-diag[tanh′(⋯ )]*

>,+?-

• !(;!(;<=

≤ H 4- ≤ HI-– where H is bound diag[tanh′(⋯ )] , I-is the largest singular value of4-

• !(#!()

≤ HI-*5+à0

– ifHI- < 1, thiscanbecomeverysmall (vanishinggradient)– ifHI- > 1, thiscanbecomeverylarge(explodinggradient)

• Trickforexplodinggradient:clippingtrick(setathreshold)

2017-11-27 IJCNLP 2017 Tutorial 33

A“solution”

• Intuition– Ensure!"# ≥ 1à topreventvanishinggradients

• So…– ProperinitializationoftheW– TouseReLU insteadoftanh orsigmoidactivationfunctions

2017-11-27 IJCNLP 2017 Tutorial 34

Abetter“solution”

• Recalltheoriginaltransitionequation– ℎ" = tanh()*ℎ"+* +)-.")

• Wecaninsteadupdatethestateadditively– 0" = tanh()*ℎ"+* +)-.")– ℎ" = ℎ"+* + 0"– then, 123

12345= 1 + 173

12345≥ 1

– Ontheotherhand• ℎ" = ℎ"+* + 0" = ℎ"+- + 0"+* + 0" = ⋯

ht …ht-1

:;"

xt

2017-11-27 IJCNLP 2017 Tutorial 35

Abetter“solution” (cont.)

• Interpolatebetweenoldstateandnewstate(“choosingtoforget”)– !" = $ %&'" + )&ℎ"+,– ℎ" = !" ⊙ ℎ"+, + (1 − !") ⊙ 2"

• Introduceaseparateinputgate3"– 3" = $ %4'" + )4ℎ"+,– ℎ" = !" ⊙ ℎ"+, + 3" ⊙ 2"

• Selectivelyexposememorycell5" withanoutputgate6"– 6" = $ %7'" + )7ℎ"+,– 5" = !" ⊙ 5"+, + 3" ⊙ 2"– ℎ" = 6" ⊙ tanh(5")2017-11-27 IJCNLP 2017 Tutorial 36

LongShort-TermMemory(LSTM)

• Hochreiter & Schmidhuber,1997• LSTM=additiveupdates+gating

Xt

+ + + +0 1 2 3

ht-1

Ct-1 Ct

ht

+

�σ σ σtanh

tanh

ht

2017-11-27 IJCNLP 2017 Tutorial 37

GatedRecurrentUnites,GRU(Choetal.2014)

• Mainideas– Keeparoundmemoriestocapturelongdistancedependencies– Allowerrormessagestoflowatdifferentstrengthsdependingontheinputs

• Updategate– Basedoncurrentinputandhiddenstate– !" = $ %&'" + )&ℎ"+,

• Resetgate– Similarlybutwithdifferentweights– -" = $(%/'" + )/ℎ"+,)

2017-11-27 IJCNLP 2017 Tutorial 38

GRU• Memoryattimestepcombinescurrentandprevioustimesteps

– ℎ" = $" ⊙ ℎ"&' + (1 − $") ⊙ ℎ-– Updategatez controlshowmuchofpaststateshouldmatternow

• Ifz closedto1,thenwecancopyinformationinthatunitthroughmanytimestepsà lessvanishinggradient!

• Newmemorycontent– ℎ-" = tanh(23" + 4" ⊙ 5ℎ"&')– Ifresetgater unitiscloseto0,thenthisignorespreviousmemoryandonlystoresthenewinputinformationà allowsmodeltodropinformationthatisirrelevantinthefuture

2017-11-27 IJCNLP 2017 Tutorial 39

LSTMvs.GRU

• Noclearwinner!• Tuninghyperparameters likelayersizeisprobablymoreimportantthanpickingtheidealarchitecture

• GRUshave fewer parametersandthusmaytrainabitfasterorneedless datatogeneralize

• Ifyouhaveenoughdata,thegreaterexpressivepowerofLSTMsmay leadtobetterresults.

2017-11-27 IJCNLP 2017 Tutorial 40

MoreRNNs

• BidirectionalRNN • Stack BidirectionalRNN

2017-11-27 IJCNLP 2017 Tutorial 41

Tree-LSTMs

• TraditionalSequentialComposition

• Tree-StructuredComposition

2017-11-27 IJCNLP 2017 Tutorial 42

MoreApplicationsofRNN

• NeuralMachineTranslation• HandwritingGeneration• ImageCaptionGeneration• …...

2017-11-27 IJCNLP 2017 Tutorial 43

ConvolutionNeuralNetwork

CS231nConvolutionalNeuralNetworkforVisualRecognition.

Pooling

2017-11-27 IJCNLP 2017 Tutorial 44

CNN for NLP

Zhang,Y.,&Wallace,B.(2015).ASensitivityAnalysisof(andPractitioners’Guideto)ConvolutionalNeuralNetworksforSentenceClassification.

2017-11-27 IJCNLP 2017 Tutorial 45

RecursiveNeuralNetwork

Socher,R.,Manning,C.,&Ng,A.(2011).LearningContinuousPhraseRepresentationsandSyntacticParsingwithRecursiveNeuralNetwork.NIPS.

2017-11-27 IJCNLP 2017 Tutorial 46

Summary

• Deep Learning– Representation Learning– End-to-end Learning

• Popular Networks– Feedforward Neural Networks– Recurrent Neural Networks– Convolutional Neural Networks

2017-11-27 IJCNLP 2017 Tutorial 47

top related