part2:deep learning - hitir.hit.edu.cn/~car/talks/ijcnlp17-tutorial/lec02-deep_learning.pdf ·...

Part 2: DeepLearning

2017-11-27 IJCNLP 2017 Tutorial 1

Part 2.1: DeepLearning Background

WhatisMachineLearning?

• FromDatatoKnowledge

InputAlgorithm

Output

Traditional Program

InputOutput

“Algorithm”

ML Program2017-11-27 IJCNLP 2017 Tutorial 3

AStandardExampleofML

• TheMNIST(ModifiedNIST)databaseofhand-writtendigitsrecognition– Publiclyavailable– AhugeamountabouthowwellvariousMLmethodsdoonit

– 60,000+ 10,000 hand-writtendigits(28x28pixelseach)

Veryhardtosaywhatmakesa2

Traditional Model(before2012)

• Fixed/engineeredfeatures+trainableclassifier– Designingafeatureextractorrequiresconsiderableeffortsbyexperts

SIFTGISTShapecontext

DeepLearning (after2012)

• LearningHierarchicalRepresentations• DEEPmeansmorethanonestageofnon-linearfeaturetransformation

DeepLearningArchitecture

DeepLearningisNotNew

• 1980stechnology(NeuralNetworks)

AboutNeuralNetworks

• Pros– Simpletolearnp(y|x)– Performance is OKforshallownets

• Cons– Troublewith>3layers– Overfitts– Slowtotrain

DeepLearningbeatsNN

• Pros– Simpletolearnp(y|x)

– Performance is OK forshallownets

• Cons– Troublewith>3layers

– Overfitts

– Slowtotrain

• Dropout• Maxout• StochasticPooling

• Newactivationfunctions:ReLU,…

• Gatedmechanism

ResultsonMNIST

• NaïveNeuralNetwork– 96.59%

• SVM(defaultsettingsforlibsvm)– 94.35%

• OptimalSVM[AndreasMueller]– 98.56%

• Thestateoftheart:ConvolutionalNN(2013)– 99.79%

DeepLearningforSpeechRecognition

DL forNLP: Representation Learning

DL forNLP: End-to-End Learning

ROOThas_VBZ good_JJ Control_NN ._.

Stack Buffer

He_PRPnsubj

Configuration

0 0 1 0 1 … 0 1 0 0

Traditional Parser Stack-LSTM Parser

Part 2.2: Feedforward Neural Networks

feature units

decision unit

input units

hand-coded programs

learned weights

TheStandardPerceptronArchitecture

IPhone is very good .

good very/good very …

0.8 0.9 0.1 …

TheLimitationsofPerceptrons

• Thehand-codedfeatures– Greatinfluenceontheperformance– Needlotsofcosttofindsuitablefeatures

• Alinearclassifierwithahyperplane– Cannotseparatenon-lineardata,suchasXOR functioncannotbelearnedbyasingle-layerperceptron

0,0 1,0

weight plane output =1output =0

The positive and negative casescannot be separated by a plane

LearningwithNon-linearHiddenLayers

FeedforwardNeuralNetworks

• Multi-layerPerceptron(MLP)• Theinformationispropagatedfrom

theinputstotheoutputs• NOcyclebetweenoutputsandinputs• Learningtheweightsofhiddenunits

isequivalenttolearningfeatures• Networkswithouthiddenlayersare

verylimitedintheinput-outputmappings– Morelayersoflinearunitsdonot

help.Itsstilllinear– Fixedoutputnon-linearities arenot

enoughx1 x2 xn…..

1st hidden layer

2nd hiddenlayer

Output layer

MultipleLayer NeuralNetworks

• Whatarethosehiddenneuronsdoing?– Mayberepresentoutlines

GeneralOptimizing(Learning)Algorithms

• GradientDescent

• StochasticGradientDescent(SGD)– Minibatch SGD(m>1),OnlineGD(m=1)

Computational/Flow Graphs

• DescribingMathematicalExpressions• Forexample

– e=(a+b)*(b+1)• c=a+b,d=b+1,e=c*d

– Ifa=2,b=1

DerivativesonComputationalGraphs

ComputationalGraphBackwardPass(Backpropagation)

Part 2.3: Recurrent and OtherNeural Networks

LanguageModels

• Alanguagemodelcomputesaprobabilityforasequenceofword:!(#$,⋯#') orpredictsaprobabilityforthenextword:!(#')$|#$,⋯#')

• Usefulformachinetranslation,speechrecognition,andsoon– Wordordering

• P(thecatissmall)>P(smalltheiscat)

– Wordchoice• P(therearefour cats)>P(therearefor cats)

TraditionalLanguageModels

• AnincorrectbutnecessaryMarkovassumption!– Probabilityisusuallyconditionedonn previouswords– ! "#,⋯"& =∏ !("*|"#,⋯ ,"*,#) ≈/*0# ∏ !("*|"*,(&,#),⋯ ,"*,#)/

*0#• Disadvantages

– ThereareALOTofn-grams!– Cannotseetoolonghistory

RecurrentNeuralNetworks(RNNs)

• Conditiontheneuralnetworkonallpreviousinputs• RAMrequirementonlyscaleswithnumberofinputs

ht ht+1ht-1

yt-1 yt yt+1

xt-1 xt xt+1

W2 W2W2

W3 W3W3

RecurrentNeuralNetworks (RNNs)

• Atasingletimestept– ℎ" = tanh()*ℎ"+* +)-.")– 01" = 234567.()8ℎ")

ht …ht-1

TrainingRNNsishard

• Ideallyinputsfrommanytimestepsagocanmodifyoutputy• Forexample,with2timesteps

ht ht+1ht-1

yt-1 yt yt+1

xt-1 xt xt+1

BackPropagation ThroughTime(BPTT)• Totalerroristhesumofeacherrorattimestept

– !"!# = ∑ !"&

!#'()*

• !"&!#+ =

!"&!,&

!,&!#+ iseasytobecalculated

• Buttocalculate !"&!#. =!"&!,&

!,&!/&

!/&!#. ishard(alsofor01)

• Becauseℎ( = tanh(0*ℎ(8* +01:() dependsonℎ(8*,whichdependson0* andℎ(81,andsoon.

• So !"&!#. = ∑ !"&!,&

!,&!/&

!/&!/<

!/<!#.

The vanishinggradientproblem• !"#

!$= ∑ !"#

!'#!(#

!(#!()

*+,- ,ℎ* = tanh(4-ℎ*5- +4

• !(#!()

= ∏!(;!(;<=

*>,+?- = ∏ 4-diag[tanh′(⋯ )]*

• !(;!(;<=

≤ H 4- ≤ HI-– where H is bound diag[tanh′(⋯ )] , I-is the largest singular value of4-

• !(#!()

≤ HI-*5+à0

– ifHI- < 1, thiscanbecomeverysmall (vanishinggradient)– ifHI- > 1, thiscanbecomeverylarge(explodinggradient)

• Trickforexplodinggradient:clippingtrick(setathreshold)

A“solution”

• Intuition– Ensure!"# ≥ 1à topreventvanishinggradients

• So…– ProperinitializationoftheW– TouseReLU insteadoftanh orsigmoidactivationfunctions

Abetter“solution”

• Recalltheoriginaltransitionequation– ℎ" = tanh()*ℎ"+* +)-.")

• Wecaninsteadupdatethestateadditively– 0" = tanh()*ℎ"+* +)-.")– ℎ" = ℎ"+* + 0"– then, 123

12345= 1 + 173

12345≥ 1

– Ontheotherhand• ℎ" = ℎ"+* + 0" = ℎ"+- + 0"+* + 0" = ⋯

ht …ht-1

Abetter“solution” (cont.)

• Interpolatebetweenoldstateandnewstate(“choosingtoforget”)– !" = $ %&'" + )&ℎ"+,– ℎ" = !" ⊙ ℎ"+, + (1 − !") ⊙ 2"

• Introduceaseparateinputgate3"– 3" = $ %4'" + )4ℎ"+,– ℎ" = !" ⊙ ℎ"+, + 3" ⊙ 2"

• Selectivelyexposememorycell5" withanoutputgate6"– 6" = $ %7'" + )7ℎ"+,– 5" = !" ⊙ 5"+, + 3" ⊙ 2"– ℎ" = 6" ⊙ tanh(5")2017-11-27 IJCNLP 2017 Tutorial 36

LongShort-TermMemory(LSTM)

• Hochreiter & Schmidhuber,1997• LSTM=additiveupdates+gating

+ + + +0 1 2 3

Ct-1 Ct

�σ σ σtanh

GatedRecurrentUnites,GRU(Choetal.2014)

• Mainideas– Keeparoundmemoriestocapturelongdistancedependencies– Allowerrormessagestoflowatdifferentstrengthsdependingontheinputs

• Updategate– Basedoncurrentinputandhiddenstate– !" = $ %&'" + )&ℎ"+,

• Resetgate– Similarlybutwithdifferentweights– -" = $(%/'" + )/ℎ"+,)

GRU• Memoryattimestepcombinescurrentandprevioustimesteps

– ℎ" = $" ⊙ ℎ"&' + (1 − $") ⊙ ℎ-– Updategatez controlshowmuchofpaststateshouldmatternow

• Ifz closedto1,thenwecancopyinformationinthatunitthroughmanytimestepsà lessvanishinggradient!

• Newmemorycontent– ℎ-" = tanh(23" + 4" ⊙ 5ℎ"&')– Ifresetgater unitiscloseto0,thenthisignorespreviousmemoryandonlystoresthenewinputinformationà allowsmodeltodropinformationthatisirrelevantinthefuture

LSTMvs.GRU

• Noclearwinner!• Tuninghyperparameters likelayersizeisprobablymoreimportantthanpickingtheidealarchitecture

• GRUshave fewer parametersandthusmaytrainabitfasterorneedless datatogeneralize

• Ifyouhaveenoughdata,thegreaterexpressivepowerofLSTMsmay leadtobetterresults.

MoreRNNs

• BidirectionalRNN • Stack BidirectionalRNN

Tree-LSTMs

• TraditionalSequentialComposition

• Tree-StructuredComposition

MoreApplicationsofRNN

• NeuralMachineTranslation• HandwritingGeneration• ImageCaptionGeneration• …...

ConvolutionNeuralNetwork

CS231nConvolutionalNeuralNetworkforVisualRecognition.

Pooling

CNN for NLP

Zhang,Y.,&Wallace,B.(2015).ASensitivityAnalysisof(andPractitioners’Guideto)ConvolutionalNeuralNetworksforSentenceClassification.

RecursiveNeuralNetwork

Socher,R.,Manning,C.,&Ng,A.(2011).LearningContinuousPhraseRepresentationsandSyntacticParsingwithRecursiveNeuralNetwork.NIPS.

Summary

• Deep Learning– Representation Learning– End-to-end Learning

• Popular Networks– Feedforward Neural Networks– Recurrent Neural Networks– Convolutional Neural Networks

part2:deep learning - hitir.hit.edu.cn/~car/talks/ijcnlp17-tutorial/lec02-deep_learning.pdf ·...

Documents

lecture 8: recurrent neural networksrecurrent neural...

lec02 algorithm analysis

rnns: teacher forcing

reservoir computing methods - e-learning · recurrent...

deep learning and neural networks deep learning history of...

modelling time series with neural networks€¦ ·...

lec02 review

sequence transduction with recurrent neural...

lec02: x86_64 / shellcode

recurrent neural network...

mechanical vibration lec02

recurrent neural networks (rnns) h t 1) t+1) · recurrent...

ec lec02 diodes

recurrent neural networks - upc universitat politècnica...

biochem lec02

lec02 machine learning

sequence modelling and recurrent neural networks (rnns) ·...

reversible recurrent neural networks€¦ · recurrent...

rnns and tensorflow

sequence modeling: recurrent and recursive...