dl1 deep learning_algorithms

deep learning

Algorithms and Applications

Bernardete Ribeiro, bribeiro@dei.uc.pt

University of Coimbra, Portugal

INIT/AERFAI Summer School on Machine Learning, Benicassim 22-26 June 2015

III - Deep Learning Algorithms

elements 3: deep neural networks

outline

∙ Learning in Deep Neural Networks∙ Deep Learning: Evolution Timeline∙ Deep Architectures∙ Restricted Boltzmann Machines (RBMs)∙ Deep Belief Networks (DBNs)∙ Deep Models Overall Characteristics

learning in deep neural networks

1. No general learning algorithm (no-free lunch theorem byWolpert 1996)

2. Learning algorithm for specific tasks - perception, control,prediction, planning reasoning, language understanding

3. Limitations of BP - local minima, optimization challengesfor non-convex objective functions

4. Hinton’s deep belief networks (DBNs) as stack of RBMs5. LeCun’s energy based learning for DBNs

deep learning: evolution timeline

1. Perceptron [Frank Rosenblatt, 1959]2. Neocognitron [K Fukushima, 1980]3. Convolutional Neural Network (CNN) [LeCun, 1989]4. Multi-level Hierarchy Networks [Jurgen Schmidthuber, 1992]5. Deep Belief Networks (DBNs) as stack of RBMs [GeoffreyHinton, 2006]

deep architectures

from brain-like computing to deep learning

∙ New empirical and theoretical results have brought deeparchitectures into the focus of the Machine Learning (ML)researchers [Larochelle et al., 2007].

∙ Theoretical results suggest that deep architectures arefundamental to learn the kind of brain-like complicatedfunctions that can represent high-level abstractions (e.g.vision, speech, language) [Bengio, 2009]

deep concepts main idea

deep neural networks

∙ Convolutional Neural Networks (CNNs) [LeCun et al., 1989]∙ Deep Belief Networks (DBNs) [Hinton et al, 2006]∙ AutoEncoders (AEs) [Bengio et al, NIPS 2006]∙ Sparse Autoencoders [Ranzato et al, NIPS’2006]

convolutional neural networks (cnns)

∙ Convolutional Neural Network consists of two basicoperations∙ convolutional∙ pooling

∙ Convolutional and pooling layersare arranged alternately untilhigh-level features are obtained

∙ Several feature maps in eachconvolutional layer

∙ Weights in the same map areshared

input C1 S2 C3 S4

1I Arel, D Rose & T Karnowski, Deep Machine Learning—A New Frontier in Artificial Intelligence Research, IEEE,CIM,2010

convolutional neural networks (cnns)

∙ Convolutional: suppose the size of the layer is d× dand the size of the receptive fields are r × r, γ and xdenote respectively the values of the convolutionallayer and the previous layer:

γij = g(r∑

r∑n=1

xi+m−1,j+n−1.wm,n + b)

i, j = 1, · · · , (d− r + 1) where g is a nonlinear function.∙ Pooling is following after convolution to reduce thedimensionality of features and to introducetranslational invariance into the CNN network.

deep belief networks (dbns)

∙ Probabilistic generative modelscontrasting with the discriminativenature of other NNS

∙ Generative models provide a jointprobability distribution of dataand labels

∙ Unsupervised greedy-layer-wisepre-training followed by finaltuning

image 28 x 28 pixels

visible

hidden

visible

hidden

visible

hidden

Top Level units

Labels Hidden Units

RBM Layer

Detection Layer

2based on I Arel, D Rose & T Karnowski, Deep Machine Learning—A New Frontier in Artificial IntelligenceResearch, IEEE, CIM,2010

autoencoders (aes)

∙ The auto-encoder has twocomponents:∙ the encoder f (mapping x to h) and∙ the decoder g (mapping h to r)

∙ An auto-encoder is a neuralnetwork that tries to reconstructits input to its output

encoder f…

decoder g

input x

code h

reconstruction r

3based on Y Bengio, I Goodfellow and A Courville, Deep Learning, An MIT Press book (in preparation),www.iro.umontreal.ca_~bengioy_dbook

deep architectures versus shallow architectures

∙ Deep architectures can be exponentially more efficientthan shallow architectures [Roux and Bengio, 2010].∙ Functions that can be compactly represented with a NeuralNetwork (NN) of depth d, may require an exponential numberof computational elements for a network with depth d− 1[Bengio, 2009].

∙ Since the number of computational elements depends onthe number of training samples available, using shallowarchitectures may result in poor generalizationmodels [Bengio, 2009].

∙ As a result, deep architecture models tend to outperformshallow models such as Support VectorMachines (SVMs) [Larochelle et al., 2007].

deep architectures versus shallow architectures

∙ Deep architectures can be exponentially more efficientthan shallow architectures [Roux and Bengio, 2010].∙ Functions that can be compactly represented with a NeuralNetwork (NN) of depth d, may require an exponential numberof computational elements for a network with depth d− 1[Bengio, 2009].

∙ Since the number of computational elements depends onthe number of training samples available, using shallowarchitectures may result in poor generalizationmodels [Bengio, 2009].

∙ As a result, deep architecture models tend to outperformshallow models such as SVMs [Larochelle et al., 2007].

Resctricted Boltzmann Machines

Deep Belief Networks

restricted boltzmann machines

restricted boltzmann machines (rbms)

h1 h2 h3 · · · hj · · · hJ 1

v1 v2 · · · vi · · · vI 1

biasvisible units

hidden units

decoder

encoder

∙ Unsupervised∙ Find complex regularities intraining data

∙ Bipartite Graph∙ visible, hidden layer

∙ Binary stochastic units∙ On/Off with probability

∙ 1 Iteration∙ Update Hidden Units∙ Reconstruct Visible Units

∙ Maximum Likelihood oftraining data

h1 h2 h3 · · · hj · · · hJ 1

v1 v2 · · · vi · · · vI 1

biasvisible units

hidden units

decoder

encoder

∙ Training Goal: Best probablereproduction∙ unsupervised data

∙ find latent factors of dataset∙ Adjust weights to getmaximum probability ofinput data

h1 h2 h3 · · · hj · · · hJ 1

v1 v2 · · · vi · · · vI 1

biasvisible units

hidden units

decoder

encoder

Given an observed state, the energy of the joint configurationof the visible units and hidden units (v,h) is given by:

E(v,h) = −I∑i=1

civi −J∑j=1

bjhj −J∑j=1

I∑i=1

Wjivihj , (1)

where W is the matrix of weights, and b and c are the biasunits w.r.t. hidden and visible layers, respectively.

h1 h2 h3 · · · hj · · · hJ 1

v1 v2 · · · vi · · · vI 1

biasvisible units

hidden units

decoder

encoder

The Restricted Boltzmann Machine (RBM) assigns aprobability for each configuration (v,h), using:

p(v,h) = e−E(v,h)

Z , (2)

where Z is a normalization constant called partition function,obtained by summing up the energy of all possible (v,h)configurations [Bengio, 2009, Hinton, 2010,Carreira-Perpiñán and Hinton, 2005]:

Z =∑v,h

e−E(v,h) . (3)

Since there are no connections between any two units withinthe same layer, given a particular random inputconfiguration, v, all the hidden units are independent of eachother and the probability of h given v becomes:

p(h | v) =∏j

p(hj = 1 | v) , (4)

p(hj = 1 | v) = σ(bj +I∑i=1

viWji) . (5)

Similarly given a specific hidden state, h, the probability of vgiven h is obtained by (6):

p(v | h) =∏i

p(vi = 1 | h) , (6)

where:

p(vi = 1 | h) = σ(ci +J∑j=1

hjWji) . (7)

Given a random training vector v, the state of a given hiddenunit j is set to 1 with probability:

p(hj = 1|v) = σ(bj +∑i

viWij)

Similarly:p(vi = 1|h) = σ(ci +

hjWij)

where σ (x) is the sigmoid squashing function 1(1+e−x) .

The marginal probability assigned to a visible vector, v, isgiven by (8):

p(v) =∑h

p(v,h) = 1Z∑h

e−E(v,h) . (8)

Hence, given a specific training vector v its probability can beraised by adjusting the weights and the biases in order tolower the energy of that particular vector while raising theenergy of all the others.

To this end, we can perform stochastic gradient ascentprocedure on the log-likelihood obtained from training thedata vectors using ( 9):

∂ logp(v)∂θ

= −∑h

p(h | v)∂ E(v,h)∂θ︸︷︷︸

positive phase

+∑v,h

p(v,h)∂E(v,h)∂θ︸︷︷︸

negative phase

training an rbm

The learning rule for performing stochastic steepest ascent inthe log probability of the training data:

∂ logp(v)∂θ

=⟨vihj

⟩0 −

⟨vihj

⟩∞ (10)

where 〈·〉0 denotes expectations for the data distribution(p0 = p(h | v)) and 〈·〉∞ denotes expectations under themodel distributionp∞(v,h) = p(v,h) [Roux and Bengio, 2008].

h1 h2 h3 · · · hj · · · hJ 1

v1 v2 · · · vi · · · vI 1

biasvisible units

hidden units

decoder

encoder

mcmc using alternating gibbs sampling

v(0) = x

i · · ·

· · · j

⟨vihj

p(hj = 1|v) = σ(bj +∑I

i=1 viWji)

v(0) = x

i · · ·

· · · j

⟨vihj

i · · ·

p(vi = 1|h) = σ(ci +∑J

j=1 hjWji)

v(0) = x

i · · ·

· · · j

⟨vihj

i · · ·

· · · j

p(hj = 1|v) = σ(bj +∑I

i=1 viWji)

v(0) = x

i · · ·

· · · j

⟨vihj

i · · ·

· · · j

i · · ·

p(vi = 1|h) = σ(ci +∑J

j=1 hjWji)

v(0) = x

i · · ·

· · · j

⟨vihj

i · · ·

· · · j

i · · ·

· · · j

v(∞)

i · · ·

h(∞)

· · · j

⟨vihj

⟩∞

contrastive divergence algorithm

contrastive divergence (cd–k)

∙ To solve this problem, Hinton proposed the ContrastiveDivergence algorithm.

∙ CD–k replaces 〈.〉∞ by 〈·〉k for small values of k.

∆Wji = η(⟨vihj

⟩0 −

⟨vihj

⟩k) (11)

contrastive divergence (cd–k)

∙ v(0) ← x∙ Compute the binary (features) states of the hidden units,h(0), using v(0)

∙ for n← 1 to k∙ Compute the “reconstruction” states for the visible units, v(n),using h(n−1)

∙ Compute the “reconstruction” states for the hidden units, h(n),using v(n)

∙ end for∙ Update the weights and biases, according to:

∆Wji = η(⟨vihj

⟩0 −

⟨vihj

⟩k) (12)

∆bj = η(⟨hj⟩0 −

⟨hj⟩k) (13)

∆ci = η(〈vi〉0 − 〈vi〉k) (14)37

x· · ·

h1· · ·

p(x|h1)p(h1|x)

x· · ·

h1· · ·

h2· · ·

p(x|h1)p(h1|x)

p(h1|h2)p(h2|h1)

x· · ·

h1· · ·

h2· · ·

h3· · ·

p(x|h1)p(h1|x)

p(h1|h2)p(h2|h1)

p(h2|h3)p(h3|h2)

∙ Start with a training vectoron the visible units

∙ Update all the hidden unitsin parallel

∙ Update the all the visibleunits in parallel to get a“reconstruction”

∙ Update the hidden unitsagain

x· · ·

h1· · ·

p(x|h1)p(h1|x)

x· · ·

h1· · ·

h2· · ·

p(x|h1)p(h1|x)

p(h1|h2)p(h2|h1)

x· · ·

h1· · ·

h2· · ·

h3· · ·

p(x|h1)p(h1|x)

p(h1|h2)p(h2|h1)

p(h2|h3)p(h3|h2)

pre-training and fine tuning

500 hidden units

300 hidden units

500 hidden units

100 hidden units

300 hidden units

100 hidden units

10 hidden

update weights

500 hidden units

300 hidden units

100 hidden units

10 hidden

error < 0.001

DBN Model

RBMs pre-training fine-tuning with BP41

practical considerations

weights initialization

deep belief networks (dbns) - adaptive learning rate size

ηji =

uη(old)ji if (⟨vihj

⟩0 −

⟨vihj

⟩k)(

⟨vihj

⟩(old)0 −

⟨vihj

⟩(old)k ) > 0

dη(old)ji if (⟨vihj

⟩0 −

⟨vihj

⟩k)(

⟨vihj

⟩(old)0 −

⟨vihj

⟩(old)k ) < 0

44Lopes et al., Towards Adaptive learning with improvedconvergence of DBNs on GPUs, Pattern Recognition, [2014]

adaptive step size

0 100 200 300 400 500 600 700 800 900 1000

(reconstruction)

α = 0.1

adaptiveγ = 0.1γ = 0.4γ = 0.7

0 100 200 300 400 500 600 700 800 900 1000

(reconstruction)

α = 0.4

adaptiveγ = 0.1γ = 0.4γ = 0.7

0 100 200 300 400 500 600 700 800 900 1000

(reconstruction)

α = 0.7

adaptiveγ = 0.1γ = 0.4γ = 0.7

Average reconstruction error (RMSE).46

convergence results (α = 0.1)

Training images

Reconstructionafter 50 epochsReconstructionafter 100 epochsReconstructionafter 250 epochsReconstructionafter 500 epochsReconstructionafter 750 epochsReconstruc-tion after

1000 epochs

Adaptive Step Size Fixed (optimized) learning rate η = 0.4

deep models characteristics

∙ Biological Plausibility

∙ DBNs are effective in a wide range of ML problems.∙ Creating a Deep Belief Network (DBN) model is a timeconsuming and computationally expensive task thatinvolves training several Restricted Boltzmann Machines(RBMs) upholding considerable efforts.

∙ The adaptive step-size procedure for tuning the learningrate has been incorporated in the learning model withexcelling results.

∙ Graphics Processing Units (GPU) can reduce significantlythe convergence time for the data intensive tasks in DBNs

∙ Biological Plausibility∙ DBNs are effective in a wide range of ML problems.

∙ Creating a Deep Belief Network (DBN) model is a timeconsuming and computationally expensive task thatinvolves training several Restricted Boltzmann Machines(RBMs) upholding considerable efforts.

∙ Biological Plausibility∙ DBNs are effective in a wide range of ML problems.∙ Creating a Deep Belief Network (DBN) model is a timeconsuming and computationally expensive task thatinvolves training several Restricted Boltzmann Machines(RBMs) upholding considerable efforts.

Bengio, Y. (2009).Learning deep architectures for AI.Foundations and Trends in Machine Learning, 2(1):1–127.

Carreira-Perpiñán, M. A. and Hinton, G. E. (2005).On contrastive divergence learning.In Proceedings of the 10th International Workshop onArtificial Intelligence and Statistics (AISTATS 2005), pages33–40.Hinton, G. E. (2010).A practical guide to training restricted Boltzmannmachines.Technical report, Department of Computer Science,University of Toronto.

Larochelle, H., Erhan, D., Courville, A., Bergstra, J., andBengio, Y. (2007).

An empirical evaluation of deep architectures onproblems with many factors of variation.In Proceedings of the 24th international conference onMachine learning (ICML 2007), pages 473–480. ACM.

Roux, N. L. and Bengio, Y. (2008).Representational power of restricted Boltzmannmachines and deep belief networks.Neural Computation, 20(6):1631–1649.

Roux, N. L. and Bengio, Y. (2010).Deep belief networks are compact universalapproximators.Neural Computation, 22(8):2192–2207.

Questions?

deep learning

Algorithms and Applications

Bernardete Ribeiro, bribeiro@dei.uc.ptJune 24, 2015

University of Coimbra, Portugal

INIT/AERFAI Summer School on Machine Learning, Benicassim 22-26 June 2015

dl1 deep learning_algorithms

Internet

deep calls unto deep

ddd afroroman 2010 life story of abubakari lunna dl1

dl1 user manual_ver 3

cis238/dl1 chapter 19 ftp: transferring files across a...

b6 dl1.doc

autoinducer 2 production by streptococcus gordonii dl1 and...

git 101 - dl1.cuni.cz

microscopy - dl1.cuni.cz

deep deep decarbonisationdecarbonisationdecarbonisationfor

dl1 digital engagement david mortimer

2021年 27日 - dl1.dl.multidevice-disc.com

dl1 digital engagement tiffany st james

deep learning for computer vision -...

python loops - dl1.cuni.cz

speed log single axis compact display · 2019. 3. 29. ·...

pubp 503.dl1 culture, organization, and technology spring...

unit guide for leyendo con mis amigos reading with...

economic sanctions - dl1.cuni.cz

cis238/dl1 chapter 15 rebuilding the linux kernel preparing...

zytholoog dl1