deep learning - duke university · why deep? • deep learning is a family of techniques for...
TRANSCRIPT
![Page 1: Deep Learning - Duke University · Why Deep? • Deep learning is a family of techniques for building and training large neural networks • Why deep and not wide? – Deep sounds](https://reader033.vdocuments.us/reader033/viewer/2022042302/5ecd24465f998d2c8838dbd2/html5/thumbnails/1.jpg)
11/6/18
1
DeepLearning
RonaldParrCompSci 570
WiththankstoKrisHauserforsomecontent
Late1990’s:NeuralNetworksHittheWall
• Recall thata3layernetworkcanapproximateanyfunctionarbitrarilyclosely (caveat:mightrequiremany,manyhiddennodes)
• Q:Whynotusebignetworksforhardproblems?• A:Itdidn’tworkinpractice!– Vanishinggradients– Notenoughtrainingdata– Notenoughtrainingtime
![Page 2: Deep Learning - Duke University · Why Deep? • Deep learning is a family of techniques for building and training large neural networks • Why deep and not wide? – Deep sounds](https://reader033.vdocuments.us/reader033/viewer/2022042302/5ecd24465f998d2c8838dbd2/html5/thumbnails/2.jpg)
11/6/18
2
WhyDeep?
• Deeplearningisafamilyoftechniquesforbuildingandtraininglarge neuralnetworks
• Whydeepandnotwide?– Deepsounds betterthanwideJ–Maybeeasiertostructure,think aboutlayersofcomputationvs.oneflat,widecomputation(consider CNFtoDNFconversion frommidterm)
VanishingGradients
• Recallbackprop derivation:
• Activationfunctions oftenbetween-1and+1• Thefurtheryougetfromtheoutput layer,thesmallerthegradientgets
• Hardtolearnwhengradientsarenoisyandsmall
!!
€
δ j =∂E∂akk
∑ ∂ak
∂a j
= h'(a j) wkjk∑ δk
![Page 3: Deep Learning - Duke University · Why Deep? • Deep learning is a family of techniques for building and training large neural networks • Why deep and not wide? – Deep sounds](https://reader033.vdocuments.us/reader033/viewer/2022042302/5ecd24465f998d2c8838dbd2/html5/thumbnails/3.jpg)
11/6/18
3
RelatedProblem:Saturation
• Sigmoidgradientgoesto0attails• Extremevalues(saturation)anywherealongbackprop pathcausesgradienttovanish
-1.5
-1
-0.5
0
0.5
1
1.5
-10 -5 0 5 10
EstimatingtheGradient• Recall:Backpropagation isgradientdescent• Computingexactgradientofthelossfunctionrequiressummingoverall trainingsamples
• Whynotupdateaftereachtrainingsample?– Calledonlineorstochasticgradient– Possibilityofmoreefficient learning
• Supposeyouneedonlyasmallnumberofsamplestoestimatethegradientcorrectly?
• Whydolotsofunnecessary computation?– But,theoretically, banbeunstableunlessyouuseasmallstepsize
![Page 4: Deep Learning - Duke University · Why Deep? • Deep learning is a family of techniques for building and training large neural networks • Why deep and not wide? – Deep sounds](https://reader033.vdocuments.us/reader033/viewer/2022042302/5ecd24465f998d2c8838dbd2/html5/thumbnails/4.jpg)
11/6/18
4
Batch/MinibatchMethods• Findasweetspotbyestimatingthegradientusingasubsetofthesamples
• Randomlysamplesubsetsofthetrainingdataandsumgradientcomputationsoverall samplesinthesubset
• Takeadvantageofparallelarchitectures(multicore/GPU)
• Still requirescarefulselectionofstepsizeandstepsizeadjustmentschedule– artvs.science
TricksforSpeedingThingsUp• Secondordermethods,e.g.,Newton’smethod– maybe
computationallyintensiveinhighdimensions
• Conjugategradientismorecomputationallyefficient,thoughnotyetwidelyused
• Momentum:Useacombinationofpreviousgradientstosmoothoutoscillations
• Linesearch:(Binary)searchingradientdirectiontofindbiggestworthwhilestepsize
• Somemethodstrytogetbenefitsofsecondordermethodswithoutcost(withoutcomputingfullHessian),e.g.,Adam
![Page 5: Deep Learning - Duke University · Why Deep? • Deep learning is a family of techniques for building and training large neural networks • Why deep and not wide? – Deep sounds](https://reader033.vdocuments.us/reader033/viewer/2022042302/5ecd24465f998d2c8838dbd2/html5/thumbnails/5.jpg)
11/6/18
5
TricksForBreakingDownProblems
• Buildupdeepnetworksbytrainingshallownetworks,thenfeedingtheiroutput intonewlayers(mayhelpwithvanishinggradientandotherproblems)– aformof“pretraining”
• Trainthenetworktosolve“easier”problemsfirst,thentrainonharderproblems–curriculumlearning,aformof“shaping”
ConvolutionalNeuralNetworks(CNNs)
• ChampionedbyLeCun (1998)
• Originallyusedforhandwritingrecognition
• Nowusedinstateoftheartsystemsinmanycomputervisionapplications
• Well-suitedtodatawithagrid-likestructure
![Page 6: Deep Learning - Duke University · Why Deep? • Deep learning is a family of techniques for building and training large neural networks • Why deep and not wide? – Deep sounds](https://reader033.vdocuments.us/reader033/viewer/2022042302/5ecd24465f998d2c8838dbd2/html5/thumbnails/6.jpg)
11/6/18
6
Convolutions
• Whatisaconvolution?• Waytocombinetwofunctions,e.g.,xandw:
• Discreteversion
𝑠 𝑡 = % 𝑥 𝑎 𝑤 𝑡 −𝑎 𝑑𝑎�
�
𝑠 𝑡 = ,𝑥 𝑎 𝑤(𝑡 −𝑎)�
�
EntireDomain
Example:Supposes(t)isadecayingaverageofvaluesofxaroundt,withwdecreasingasagetsfurther fromt
ConvolutionsonGrids
• ForimageI• Convolution“kernel”K:
𝑆 𝑖, 𝑗 = ,,𝐼 𝑚, 𝑛 𝐾(𝑖 − 𝑚,𝑗 − 𝑛)�
7
=,,𝐼 𝑖 − 𝑚, 𝑗 − 𝑛 𝐾(𝑚, 𝑛)�
7
�
8
�
8
Examples:Aconvolutioncanblur/smooth/noise-filteranimagebyaveragingneighboringpixels.Aconvolutioncanalsoserveasanedgedetector
Figure9.6fromDeepLearning,IanGoodfellowandYoshua Bengio andAaronCourville
![Page 7: Deep Learning - Duke University · Why Deep? • Deep learning is a family of techniques for building and training large neural networks • Why deep and not wide? – Deep sounds](https://reader033.vdocuments.us/reader033/viewer/2022042302/5ecd24465f998d2c8838dbd2/html5/thumbnails/7.jpg)
11/6/18
7
ConvolutiononGridExample
CHAPTER 9. CONVOLUTIONAL NETWORKS
a b c d
e f g h
i j k l
w x
y z
aw + bx +
ey + fzaw + bx +
ey + fzbw + cx +
fy + gzbw + cx +
fy + gzcw + dx +
gy + hzcw + dx +
gy + hz
ew + fx +iy + jz
ew + fx +iy + jz
fw + gx +
jy + kz
fw + gx +
jy + kz
gw + hx +
ky + lz
gw + hx +
ky + lz
Input
Kernel
Output
Figure 9.1: An example of 2-D convolution without kernel-flipping. In this case we restrict
the output to only positions where the kernel lies entirely within the image, called “valid”
convolution in some contexts. We draw boxes with arrows to indicate how the upper-left
element of the output tensor is formed by applying the kernel to the corresponding
upper-left region of the input tensor.
334
Figure9.1fromDeepLearning,IanGoodfellow andYoshua Bengio andAaronCourville
ApplicationtoImages&Nets
• Imageshavehugeinputspace:1000x1000=1M• Fully connectedlayers=hugenumberofweights,slowtraining
• Convolutional layersreduceconnectivity byconnectingonlyanmxn windowaroundeachpixel
• Canuseweightsharing tolearnacommon setofweightssothatsameconvolution isappliedeverywhere(orinmultipleplaces)
![Page 8: Deep Learning - Duke University · Why Deep? • Deep learning is a family of techniques for building and training large neural networks • Why deep and not wide? – Deep sounds](https://reader033.vdocuments.us/reader033/viewer/2022042302/5ecd24465f998d2c8838dbd2/html5/thumbnails/8.jpg)
11/6/18
8
AdditionalStages
• Convolutionalstages(may)feedtodetector stages
• Detectorsarenonlinear,e.g.,ReLU
• Detectorsfeedtopoolstages
• Poolingstagessummarizingupstreamnodes,e.g.,average,2-norm,max
Source:wikipedia
ReLU vs.Sigmoid
• ReLU isfastertocompute• Derivativeistrivial• Onlysaturatesononeside
• Worryaboutnon-differentiabilityat0?• Canusesub-gradient
![Page 9: Deep Learning - Duke University · Why Deep? • Deep learning is a family of techniques for building and training large neural networks • Why deep and not wide? – Deep sounds](https://reader033.vdocuments.us/reader033/viewer/2022042302/5ecd24465f998d2c8838dbd2/html5/thumbnails/9.jpg)
11/6/18
9
ExampleConvolutionalNetwork
INPUT 28x28
feature maps 4@24x24
feature maps4@12x12
feature maps12@8x8
feature maps12@4x4
OUTPUT26@1x1
Subsampling
Convolution
Convolution
Subsampling
Convolution
From,ConvolutionalNetworksforImages,Speech,andTime-Series, LeCun &Bengio
N.B.:Subsampling=averaging
Weightsharingresultsin2600weightssharedover100,000connections.
WhyThisWorks• ConvNets canuseweightsharingtoreducethenumberof
parameterslearned–mitigatesproblemswithbignetworks
• Canbestructuredtolearnscaleandpositioninvariantfeaturedetectors
• Finallayersthencombinefeaturetolearnthetargetfunction
• Canbeviewedasdoingsimultaneousfeatureselectionandclassification
![Page 10: Deep Learning - Duke University · Why Deep? • Deep learning is a family of techniques for building and training large neural networks • Why deep and not wide? – Deep sounds](https://reader033.vdocuments.us/reader033/viewer/2022042302/5ecd24465f998d2c8838dbd2/html5/thumbnails/10.jpg)
11/6/18
10
ConvNets inPractice
• Worksurprisinglywellinmanyexamples,eventhosethataren’timages
• Numberofconvolutional layers,formofpoolinganddetectingunitsmaybeapplicationspecific– art&sciencetopickingthis
OtherTricks
• Residualnets introduceconnectionsacrosslayers,whichtendstomitigatethevanishinggradientproblem
• Techniquessuchasimageperturbationanddropoutreduceoverfittingandproducemorerobustsolutions
![Page 11: Deep Learning - Duke University · Why Deep? • Deep learning is a family of techniques for building and training large neural networks • Why deep and not wide? – Deep sounds](https://reader033.vdocuments.us/reader033/viewer/2022042302/5ecd24465f998d2c8838dbd2/html5/thumbnails/11.jpg)
11/6/18
11
PuttingItallTogether• Whyisdeeplearningsucceedingnowwhenneuralnetslostmomentuminthe90’s?
• Newarchitectures(e.g.ConvNets)arebettersuitedto(some)learningtasks,reduce#ofweights
• Smarteralgorithmsmakebetteruseofdata,handlenoisygradientsbetter
• Massiveamountsofdatamakeoverfittinglessofaconcern(butstillalwaysaconcern)Massingamountsofcomputationmakehandlingmassiveamountsofdatapossible
• Largeandgrowingbagoftrickstomitigatingoverfitting,vanishinggradientissues
Superficial(?)Limitations
• Deeplearningresultsarenoteasilyhuman-interpretable
• Computationallyintensive• Combinationofart,science,rulesofthumb
• Canbetricked:– “Intriguingpropertiesofneuralnetworks”
![Page 12: Deep Learning - Duke University · Why Deep? • Deep learning is a family of techniques for building and training large neural networks • Why deep and not wide? – Deep sounds](https://reader033.vdocuments.us/reader033/viewer/2022042302/5ecd24465f998d2c8838dbd2/html5/thumbnails/12.jpg)
11/6/18
12
BeyondClassification
• Deepnetworks(andothertechniques)canbeusedforunsupervised learning
• Example:Autoencodertriestocompressinputstoalowerdimensionalrepresentation
RecurrentNetworks• Recurrentnetworksfeed(partof)theoutputofthenetworkbacktotheinput
• Why?– Canlearn(hidden)state,e.g.,stateinahiddenMarkovmodel
– Usefulforparsinglanguage– Canlearnaprogram
• LSTM:VariationonRNNthathandleslongtermmemoriesbetter
![Page 13: Deep Learning - Duke University · Why Deep? • Deep learning is a family of techniques for building and training large neural networks • Why deep and not wide? – Deep sounds](https://reader033.vdocuments.us/reader033/viewer/2022042302/5ecd24465f998d2c8838dbd2/html5/thumbnails/13.jpg)
11/6/18
13
DeeperLimitations• Wegetimpressiveresultsbutwedon’talwaysunderstandwhyor
whetherwereallyneedallofthedataandcomputationused
• Hardtoexplainresultsandhardtoguardagainstadversarialspecialcases(“Intriguingpropertiesofneuralnetworks”,and“Universaladversarialperturbations”)
• Notclearhowlogic,highlevelreasoningcouldbeincorporated
• Notclearhowtoincorporatepriorknowledgeinaprincipledway