deep learning: what? how? why? how to win a kaggle competition

75
Deep Learning What, how, why? On the use of deep learning for Kaggle competitions

Upload: 317070

Post on 16-Apr-2017

268 views

Category:

Technology


5 download

TRANSCRIPT

PowerPoint Presentation

Deep LearningWhat, how, why? On the use of deep learning for Kaggle competitions

In about 45 minutesZMUV

In about 30 minutesBigger & DeeperIs better

In about 15 minutes

Who would if he had the time?

Who knows convolutional neural networks?Who works inMachine learning?Who has worked on Kaggle problems?Who has workedwith deep learning before?

Who know what Kaggle is?Who know what dropout is?

Who am IJonas DegravePhd studentUGent

Who are we?former Reservoir Lab

Data Science LabIDLab

Research in neural networks since 2005

HIPSTERALERT

What do we do?Machine learningRoboticsBrain-inspired computing

What did we do?

Totalling $160k in prizes

Testimonials

Neural networks in 5 minutes

Neural networks in 5 minutes

Input layerHidden layersOutput layer

Gradient descent

Backpropagation

Deep learning

Input layerHidden layerOutput layerHidden layerHidden layer

HistoryArtificial Neural Net: 1949Backpropagation: 1975

Deep Learning: 2012

What used to be the problem

Input layerHidden layerOutput layerHidden layerHidden layer

Vanishing gradientsAnd all information is gone

For long,we didnt know

GPUsRectifiersMaxpoolDropout

They fight vanishing gradients!What has changed

Deep learning

Deep learningState of the art for all problems with spatially correlated data

No more feature engineering!

Old school bingoBoltzmann MachinesEnergyTanh or sigmoid activationFeature engineeringDeep belief networks

How to domodernneural networks

Train setMake the sets, make them wellValidation set & Test set

Choose your error functionAlways optimize the error function where possible!

Use error function for your problem

ErrorValidationTrainingTimeValidationTrainingMake Train & validation curves

Underfitting & overfitting

ValidationTrainingTimeUnderfitting & overfitting

RegularizeBigger & Deeper

Larger networks tend to work better. Make your network bigger and bigger until the accuracy stops increasing. Then regularize the hell out of it. Then make it bigger still. Yoshua Bengio

My first architecture

Start with standard components:Conv-layers, dense layers,max-pooling, dropout

SparsityMake sure, that for each sample, only a few parameters are used

Dropout

Maxpool

yx

Rectifier(aka Relu)

Convolution layers

No bigger than 3x33x3 layer9 parameters3x3 receptive field5x5 layer25 parameters5x5 receptive field2 stacked 3x3 layer19 parameters5x5 receptive field

Output functionSoftmaxLoglossCross entropySigmoidRegressionIdentity

My first architecture

~ 1 million parameters

Let usoptimize

Gradient Descent

Train set

Gradient

Stochastic Gradient Descent

Train set

Gradient

Batch

Adams update rule

Train set

Gradient

Batch

Gradient

Gradient

Gradient

Weight Update Step

Local minimum

You want generalization, not the global minimum on the train set!

My first architecture

~ 1 million parameters

Initialization

Weight matricesRandom orthogonal initializationWith correct amplitude

Most libraries provide this

Does not lose information

Output layerYou have prior information!Initialize with zeros!

Bias

Bias sets the initial sparsity!

Think about your initialization!

My first architecture

~ 1 million parameters

DoesTrain on 1 sampleitwork?Train on 2 samples

Learning rate

OvershootingLearn too slow

Learning rate

Data preprocessingZMUV your data

ZeromeanUnit Variance

Input layerHidden layerOutput layerHidden layerHidden layerBatch normalizationbatchnormbatchnormbatchnormbatchnorm

Regularization

Data augmentation

Unsupervised learningLearn on the test setPseudo-labelingLadder networks

Insert a priori informationinto architecture

Insert a priori information

Its an art

RegularizeBigger & Deeper

Rinse and repeat Jonas Degrave

Ensemble

The average prediction will always be better than the worst prediction.

Ensemble OptimizedOnValidation set

Submit

Computing timeDeadlines are fixed

End performance is proportional to number of iterations,NOT training time per model

Major take-awaysEverything has a reasonDont buy into hypesIf it cant be explained in 1 minute why it works, it probably isnt working.

Skip connections

zeros

Wide convolutions

Thanks!Questions: @317070