introduction to neural networks - unige.it · stefano rovetta introduction to neural networks...
TRANSCRIPT
43hrs
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 1 / 109
Back to optimization
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109
Stochastic optimization
• Optimize a cost that is a random variable
• Types of randomness:
- Measurement plus noise: R+ ν
- Multiple effects mixed together (we might use a mixture model)
- Unknown statistical properties
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 3 / 109
Monte Carlo integration
• Expectation of a random variable X :
E {X} =
∫Eξ px(ξ) dξ
(over the whole data space E)
• . . . But only a sample {x1, . . . , xn} is given (training set)
• Empirical distribution Px(ξ) =1
n
∑nl=1 δ (ξ − xl)→
• Approximate (empirical) expectation of X :
E {X} =
∫Eξ Px(ξ) dξ =
1
n
n∑l=1
xl
• This is a Monte Carlo integral
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 4 / 109
• Suppose that R is classification performance (risk).
• We want to optimize the true risk, the one computed on all possible,infinite data:
R(w) =
∫R (y(x),w) p(x)dx.
• This is a function of w(the weights identify one specific neural net)
• It is also a function of the data distribution p(x)
(the performance is estimated on the data)
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 5 / 109
• When training a neural network we don’t have p(x), but only thetraining set {x1, . . . ,xn}
• From the training set we have the empirical distribution
Px(ξ) =1
n
n∑l=1
δ (ξ − xl)
• so we can compute a Monte Carlo estimate of the risk
R(w, X) =1
np
np∑l=1
R (y(xl),w)
this is the empirical risk.
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 6 / 109
Training by epoch
• Optimize using the whole training set to estimate the cost
• It means computing R (and the ∆W )
• on the basis of a Monte Carlo estimate of risk
• Finds the optimal value of an approximate (empirical) cost function
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 7 / 109
Stochastic approximation
• A special kind of stochastic optimization
• R is estimated at each input pattern using that pattern alone
• Extremely unreliable estimation – but it converges in probability!
• Robbins and Monro, 1951; Kiefer and Wolfowitz, 1952
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 8 / 109
• Convergence in probability:
limn→∞
Pr(|Rn −R| ≥ ε
)= 0
• Rn is the estimate of R on a training set of size n
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 9 / 109
Stochastic approximation
• Given:
- A function R whose gradient ∇R we want to set to zero, or minimize(but we can’t compute analytically)
- A sequence G1, G2, . . . , Gl, . . . of random samples of ∇R, affectedby random noise
- A decreasing sequence η1, η2, . . . , ηl, . . . of step size coefficients
• Basic iteration:w(l + 1) = w(l)− ηl Gl
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 10 / 109
Stochastic approximation: The intuition
• Each sample gives a noisy (stochastic) estimate of the gradient
• ⇒ ∇R + noise
• By averaging over time, noise cancels out
• Random variations also make it possible to escape local minima
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 11 / 109
Results on convergence of stochastic approximation
• If R is twice differentiable and convex,
then stochastic approximation converges with a rate of O(
1
l
)• A condition of convergence (not optimal rate of convergence):
0 <∑l
η2l = A <∞
• Usually the hypotheses are not met (complex cost landscape) and wedon’t have guarantees.
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 12 / 109
Training by pattern
• is computing R (and the ∆W )
• on the basis of an estimate of risk on a single point
• An extreme Monte Carlo estimate on a training set of one observationonly
• Finds the approximate optimal value of an approximate costfunction
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 13 / 109
Implementation of training
• By epoch: estimation loop, then update
• By pattern: estimation + update loop
• By pattern on a training set: l = random
• Learning rate η → By pattern: keep it low
• → By epoch: make it adaptive
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 14 / 109
Multi-layer neural networks
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 15 / 109
Connectionism and Parallel Distributed Processes
David Rumelhart James McClelland Geoffrey Hinton
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 16 / 109
What is connectionism?
• Connectionism is an approach to cognitive science that characterizeslearning and memory through the discrete interactions between nodesof neural networks
• Representation of concepts and rules not concentrated in symbols witha lot of meaning, but in sub-symbolic “neural encodings” (neuronactivations) which have a meaning only if taken collectively as patterns
• Neural networks are distributed and massively parallel
• They rely on spontaneously-generated internal representations
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 17 / 109
Network topologies
Most general: feedback.
*
Units may be visible or hidden (*)
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 18 / 109
Network topologies
A special type of feedback is lateral connections
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 19 / 109
Network topologies
Less general: a topology where cycles are forbidden: feedforward.Visible units may be input or output.
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 20 / 109
Network topologies
Least general: multi-layer
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 21 / 109
Why multi-layer?
Linear separability
Feature discovery
Hierarchies of abstractions
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 22 / 109
Example: Parity
Problem: Given any input string of d bits, tell whether the number of bitsset (= 1) is even.
Generalizes XOR: it is not linearly separable
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 23 / 109
Example: Parity
The solution requires d hidden units
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 24 / 109
Universal approximation theorem
G. Cybenko 1989
A feed-forward network with a single hidden layer containing a finitenumber of neurons can approximate any continuous function on compactsubsets of Rd
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 25 / 109
How do we train a multi-layer neural network?
1 With a suitable algorithm
2 With a sequence of independent trainings
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 26 / 109
• As we have seen, learning (e.g., learning to recognize) can be cast asthe problem of optimizing a suitable cost function (risk)
• But most optimization methods rely on the necessary minimumcondition ∇E = 0 or on the direction of the gradient ∇E
→ requirement: E must be at least differentiable (even better if alsoconvex, but that’s not always possible)
• Even if E is differentiable, for hidden units we cannot compute anerror term like (t− a)2 (mse)
→ requirement: we need a way to do this
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 27 / 109
A differentiable activation function
• Let’s write the discriminant function for a problem with two Gaussian,spherical, equal-variance classes.
• Translation of the origin, rotation of axes. . .
• 1-dimensional symmetrical problem in x with only two parameters
p(x|ω1) =1√2πσ
exp
[(x− µ)2
2σ2
]p(x|ω2) =
1√2πσ
exp
[(x+ µ)2
2σ2
]
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 28 / 109
For the Bayes theorem:
P (ω1) =p(x|ω1)P (ω1)
p(x|ω1)P (ω1) + p(x|ω2)P (ω2)
P (ω2) =p(x|ω2)P (ω2)
p(x|ω1)P (ω1) + p(x|ω2)P (ω2)
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 29 / 109
2-class discriminant function:
g(x) = P (ω1)− P (ω2)
=exp
[(x−µ)22σ2
]exp
[(x−µ)22σ2
]+ exp
[(x+µ)2
2σ2
] − exp[(x+µ)2
2σ2
]exp
[(x−µ)22σ2
]+ exp
[(x+µ)2
2σ2
]removing the factors 1/
√2πσ
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 30 / 109
g(x) =exp
[−x2+µ2
2σ2
]exp
[xµσ2
]− exp
[−x2+µ2
2σ2
]exp
[−xµσ2
]exp
[−x2+µ2
2σ2
]exp
[xµσ2
]+ exp
[−x2+µ2
2σ2
]exp
[−xµσ2
]The common positive factor exp
[−x2+µ2
2σ2
]cancels out:
g(x) =exµ
σ2 − e−xµ
σ2
exµ
σ2 + e−xµ
σ2
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 31 / 109
• We replace x with the score r = x ·w′
• We can absorb the factor µ/σ2 into the norm of w′:
w =µ
σ2w′
• We obtain
g(r) =er − e−r
er + e−r, r = x ·w
g(r) = hyperbolic tangent activation, tanh(r)
• logistic or sigmoid activation:
σ(r) =1
1 + e−r=
tanh(r) + 1
2
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 32 / 109
-1
-0.5
0
0.5
1
-10 -5 0 5 10
a
r
SIGMOID
TANH
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 33 / 109
-1
-0.5
0
0.5
1
-10 -5 0 5 10
a
r
HEAVISIDE
SIGN
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 34 / 109
• The sigmoid is the solution of the logistic equation
y′ = y(1− y)
• Therefore, by definition,
∂σ(r)
∂r= σ(r) ( 1− σ(r) )
• Also,∂ tanh(r)
∂r= 1− tanh2(r)
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 35 / 109
The error back-propagation algorithm
• Discovered by Amari/Werbos/Parker/Rumelhart/Hinton/Williams from1974 to 1986
• The name appears in Rosenblatt’s book “Principles of Neurodynamics”in 1962
• A clever application of the chain rule of differential calculus
• We can perform gradient descent in a distributed way and withoutactually computing derivatives
• The responsibility for errors is back-propagated from the outputsback inside the network, and distributed among the hidden layers.
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 36 / 109
The chain rule
df(g(x))
dx=
df(y)
dy
∣∣∣∣y=g(x)
dg(x)
dx
Where is the “chain”?
df(g(h(x)))
dx=
df(g)
dg↗ dg(h)
dh↗ dh(x)
dx
which, for instance, can be used to prove that
∂σ(r)
∂wi=
dσ(r)
dr∂r
∂wi= σ′(r)xi = σ(r) ( 1− σ(r) )xi (1)
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 37 / 109
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 38 / 109
np number of patterns in the training setni number of input units
nh number of hidden unitsno number of output unitsnw total number of weights, nw = (ni + 1)nh + (nh + 1)noi index for input componentsj index for hidden unitsk index for output unitsxi i-th component of input patternrj net stimulus of the j-th hidden unitrk net stimulus of the k-th output unit
shj j-th hidden unit activation valuesok k-th component of outputtgk k-th component of target
whiji weight to j-th hidden unit from i-th input unit [(ni + 1)× nh]wohkj weight to k-th output unit from j-th hidden unit [(nh + 1)× no]
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 39 / 109
Loss function λ(sok, tgk) = (tgk − sok)2
1 in general there may be several output units;
2 the overall cost function is not quadratic (a paraboloid) because thenetwork is non-linear
Non convex cost function
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 40 / 109
Expected cost
E =
∫1
2
1
no
no∑k=1
(sok(x)− tgk(x))2 p(x)dx (2)
E is known only through its estimate on the training set (here by epoch):
E =1
np
np∑l=1
1
2
1
no
no∑k=1
(sok(xl)− tgk(xl))2 (3)
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 41 / 109
Summation and differentiations are both linear and therefore can beexchanged freely.
E =1
2
1
no
no∑k=1
(sok − tgk)2 (4)
We only consider one pattern
• For training online = by pattern, we will apply immediately the ∆w
as we did with perceptron and Adaline
• For training by epoch, we will sum several ∆w and apply them onlyat the end of each pass (a training epoch).
• For training by batch, we will sum several ∆w and apply them onlyafter some % of a complete pass.
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 42 / 109
The operation of the multilayer perceptron is divided in two steps:
• activation forward-propagation
• error back-propagation.
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 43 / 109
Forward propagation
→→→→→→→→
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 44 / 109
Forward propagation
∀j rj =
ni∑i=0
whijixi ⇒ shj = σ(rj) (5)
∀k rk =
nh∑j=0
wohkjshj ⇒ sok = σ(rk) (6)
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 45 / 109
Error back-propagation
←←←←←←←←
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 46 / 109
Error back-propagation and update
we start from computation of partial derivatives, i.e., the gradient of theerror.
w is generically any of the weights of the network.
We need all the components of the gradient ∇E
These are∂E
∂wfor all possible w
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 47 / 109
∂E
∂w=
1
2
1
no
no∑k=1
∂ (sok − tgk)2
∂w=
1
no
no∑k=1
(sok − tgk)∂sok∂w
(7)
Depending on whether w is a woh or a whi we will have differentexpansions of the above expression.
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 48 / 109
Hidden-to-output weights wohkj
∂E
∂wohkj=
1
no
no∑k′=1
(sok′ − tgk′)∂sok′
∂rk′
∂rk′
∂shj(8)
We can drop all terms not depending on k, those with k′ 6= k:
∂E
∂wohkj=
1
no(sok − tgk)
∂sok∂rk
∂rk∂shj
(9)
We plug in quantities known from the forward pass:
∂E
∂wohkj=
1
no(sok − tgk)σ
′(rk)shj (10)
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 49 / 109
If we define
δk = (sok − tgk)σ′(rk) (11)
we have a generalization of the “delta” term which we have seen in the deltarule by Widrow and Hoff.
Generalized delta rule for the hidden-to-output weights:
∆wohkj = −ηδkshj , (12)
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 50 / 109
Problem with the input-to-hidden weights:not all terms are readily available.We use again the chain rule to find another formulation for ∂E/∂whiji
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 51 / 109
∂E
∂whiji=
1
2
1
no
no∑k=1
∂ (sok − tgk)2
∂whiji= (13)
=1
no
no∑k=1
(sok − tgk)∂sok∂rk
∂rk∂shj
∂shj∂whiji
(14)
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 52 / 109
Now the quantities appearing in the last equation are available, again fromeither the forward pass or theory:
• (sok − tgk)∂sok∂rk
= δk
•∂rk∂shj
= wohkj
•∂shj∂whiji
=∂shj∂∂rj
∂rj∂whiji
= σ′(rj)xi
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 53 / 109
∂E
∂whiji=
1
no
no∑k=1
(sok − tgk)∂sok∂rk
∂rk∂shj
∂shj∂whiji
(15)
=1
no
no∑k=1
[δkσ′(rk)wohkj
] [σ′(rj)xi
](16)
Note that the summation here does not disappear
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 54 / 109
We can further manipulate the expression, by first isolating the terms whichdo not contribute to the summation:
=
[1
no
no∑k=1
δkσ′(rk)wohkj
] [σ′(rj)xi
](17)
and then identifying the generalized delta for the input-to-hidden weights:
δj =
[1
no
no∑k=1
δkσ′(rk)wohkj
](18)
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 55 / 109
Generalized delta rule
for the input-to-hidden weights:
∆whiji = −ηδjxi , (19)
amazingly similar in form to that for the hidden-to-output weights
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 56 / 109
Important property of multi-layer networks
The layered network is the simplest possible connectivity that has theuniversal approximation property.
Should be large enough – or deep enough
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 57 / 109
Generalization and overfitting
The number of weights needs to be high.
We must take care of controlling overfitting.
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 58 / 109
Overfitting
Is the situation where
• R is low
• but |R−R| is high
Symptom: While training we are happy, but then tests fail!
No generalization due to too much specialization (learning the trainingset, not the classificatin rule)
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 59 / 109
Multi-layer perceptrons not a good model for the brain?
Some evidence that the brain uses sparse (localized) rather than dense(distributed) representations.
Probably both
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 60 / 109
Deep neural networks
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 61 / 109
David Hubel and Torsten Wiesel
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 62 / 109
Hubel and Wiesel placed electrodes in animals brains (visual cortex)They discovered the columnar organization of neurons
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 63 / 109
Each layer in a cortical colum extracts features from the input it receivesfrom the previous layer
These features are more and more abstract
Edges – Simple shapes – Composite shapes – Eyes, mouths, noses. . . Grandmother(The Grandmother Cell hypothesis)
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 64 / 109
Learning features
in neural networksInternal representation in hidden layers
Hierarchy requires many layers (deep networks).
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 65 / 109
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 66 / 109
Learning: Limits of multi-layer networks
Error back-propagation does not work well with very deep structures
Vanishing gradient phenomenon:At each layer, the backpropagated components of the gradient becomeexponentially smaller.
To avoid the problem: use shallow networks (theoretically sufficient).
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 67 / 109
Example of a shallow architecture
Support vector machines
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 68 / 109
Representational advantage of depth
In the 80’s and early 90’s some works proved that:some logical functions, that can be implemented with a depth of klayers, require exponentially more units if reduced to k − 1 layers
In the 2010’s: dependent inputs (variables) need very deep networks
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 69 / 109
How to avoid training the whole network altogether?
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 70 / 109
Multi-level hierarchies of networks
Cascaded networks of unsupervised layers trained one after the other+Final classification layer
The whole structure is finally trained with error back-propagation
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 71 / 109
The idea is not new: Neocognitron
K. Fukushima, 1987
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 72 / 109
Unsupervised learning principles
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 73 / 109
Information Bottleneck
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 74 / 109
Information Bottleneck
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 75 / 109
Techniques using the "information bottleneck" principle
Using statistics and entropy
• Coding theory
• Stochastic complexity and minimum description length
Using errors
• Autoencoders
• PCA
• Rate-distortion theory
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 76 / 109
Autoencoders
An autoencoder is a special case of a multi-layer perceptron charcterized bytwo aspects:
1 Structure: number of units in the input layer = number of units in theoutput layer > number of hidden units
2 Learned task: an autoencoder is trained to approximate theidentity function (= replicate its input at the output)
An autoencoder is not a classifier
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 77 / 109
Autoencoders
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 78 / 109
Autoencoders
What is interesting is not the output value (is an approximation to the input)but the pattern present on the hidden layer
Since we don’t use any target (the target coincides with the input),the autoencoder task is unsupervisedSometimes termed "self-supervised"
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 79 / 109
Learned features from a set of images
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 80 / 109
Recognizing handwritten digits
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 81 / 109
Features for recognizing ’0’ from ’8’
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 82 / 109
Features for recognizing ’1’ from ’8’
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 83 / 109
An example of an autoencoder for learning features fromsymbolic data
Task: diagnose Lyme disease from patient records
Problem: many features (observed signs and symptoms) are binary and verysparse
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 84 / 109
An example of an autoencoder for learning features fromsymbolic data
Learning the features
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 85 / 109
An example of an autoencoder for learning features fromsymbolic data
Using the learned features
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 86 / 109
Principal component analysis
Is an instance of factor analysis:Discover the few unobservable factors that give rise to observable(measurable) variables
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 87 / 109
Example of factor analysis problem:Discover the abilities underlying performance in school tests
Observed variables Marks in algebra testMarks in geometry testMarks in literature testMarks in foreign language testMarks in music testMarks in essay
Hidden factors Linguistic abilitySpatial abilitySymbolic processing ability
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 88 / 109
Principal Component Analysis or PCA
is a linear solution to the factor analysis problem
Linear: factors are linear combinations of patterns
v = λ1x1 + λ2x2 + . . .+ λdxd
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 89 / 109
PCA works on the Covariance matrix of data
Covariance between input xi and input xj :
σi,j = σj,i = E {(xi − xi)(xj − xj)}
E{} expectation (or mean over te training set), xi mean of i-th input
Σ =
σ1,1 σ1,2 . . . σ1,dσ2,1 σ2,2 . . . σ2,d...
. . ....
σd,1 σd,2 . . . σd,d
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 90 / 109
Note: If X is the training set as a matrix and all inputs have zero mean ,i.e., X −X , then
Σ = XTX
X = X-repmat(mean(X),size(X,1),1)
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 91 / 109
Principal componentsThe "factors" in PCA are called principal components and are given bythe eigenvectors of Σ:
v1, . . . ,vd
If we project pattern x = [x1, x2, . . . , xd] onto the componentvi = [v1, v2, . . . , vd]
we obtain the value of the i-th factor, or component, or feature, for patternx:
ai = x · vi =∑i
xivi
OK, components; but why "principal"?
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 92 / 109
Property
1 Eigenvectors of Σ can be ordered by the corresponding eigenvalues,from largest to smallest
2 Eigenvectors are thus ordered by variance or energy or level ofactivity from largest to smallest
3 Projection of the training set X onto the first r (principal)components gives the best rank r approximation to X itself, whenmeasured by mean square error
PCA is a form of lossy compressionThe principal components are features useful to represent the data in asynthetic way (information bottleneck)
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 93 / 109
It has been proved that an autoencoder with linear activations learnsthe principal components
This is because the objective is the mean squared reconstruction error of alower-rank representation, the same as PCA
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 94 / 109
Oja’s neuron
A single-unit model with linear (identity) activation
a = x ·w
Learning rule:
w← w + ηx(a− aw)
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 95 / 109
It can be proven that, for small η, Oja’s learning rule is a first-order Taylorapproximation of the Rayleigh quotient iteration method of finding theprincipal eigenvector.
At convergence, w is the principal component of Σ.
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 96 / 109
Oja’s neuron is a neural principal component analyzer
Advantages over using explicit eigensolvers (e.g., LAPACK eigensolver, orMatlab’s eig function):
1 Distributed
2 Online (big data!)
Disadvantages:
1 Stochastic (convergence in probability)
2 Slower because of the requirement of small η
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 97 / 109
Restricted Boltzmann Machines
A generative modelInvented by G. HintonStarted in the Eighties (Boltzmann machines) then developed in thefollowing decades
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 98 / 109
Boltzmann Machines:
• binary-valued units
• bi-directional connections
• symmetric weight (equal in the two directions)
• general topology (feedback possible)
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 99 / 109
The restricted version has the limitation that its topology must be abipartite graph
This makes it more tractable
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 100 / 109
Energy
• v = [vi] and h = [hi] visible and hidden unit activation values,respectively
• wi,j weight between vi and hj• ai and bi biases of visible and hidden units, respectively
then we can define an "energy"
E(v,h) = −∑i
aivi −∑j
bjhj −∑i
∑j
viwi,jhj
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 101 / 109
Probability of states
The probability of any possible network state is
P (v,h) =1
Ze−E(v,h)
with Z partition function (normalizer)
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 102 / 109
Probability of states
Since intra-layer connections are not present, probability of activation ofone unit does not depend on that of other units in the same layer – only inthe other layer
P (vi = 1|h) =1
1 + e−(ai+∑j wi,jhj)
P (hj = 1|v) =1
1 + e−(bj+∑i wi,jvi)
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 103 / 109
Training a RBM
Algorithm called contrastive divergenceUses random sampling from the probabilities (computed as above):
• Apply one input
• Compute probability P (h|v) - Sample from it to generate hiddenconfiguration
• Compute a positive update step ∆w+ = vhT (outer product)
• Generate one possible input v′ from the hidden configuration
• Compute probability P (h′|v′)• Compute a negative update step ∆w− = v′h′T
• Apply update: w← w + η(∆w+ −∆w−)
This does not optimize any explicit objective function!!
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 104 / 109
Training RBMs of large size is not simpleThere are tricks to make the task easier
Example: weight sharing and convolutional neural networksThese help with data having correlated inputs, as in image, video, speech,general time series.
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 105 / 109
Deep Belief Networks
• A DBN is a sequence of RBMs
• Each RBM can be trained independently of the following ones
• greedy strategy
• The last layer can be a classifier
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 106 / 109
Deep networks can be built out of RBMs, but also out of autoencoders
Autoencoders are less insensitive to random noise
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 107 / 109
Neural networks: Why bothering?
Deep learning achieved success in very complex tasks and won manycompetitionsExample: extracting words from audio and transforming them in automaticsubtitles (cfr. Youtube)
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 108 / 109
T H E E N D
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 109 / 109