1 l. orseau neural networks neural networks efrei 2010 laurent orseau...

1L. OrseauNeural Networks

Neural Networks

EFREI 2010

Laurent Orseau(laurent.orseau@agroparistech.fr)

AgroParisTech

based on slides by Antoine Cornuejols

1. Introduction

2. The perceptron

3. The multi-layer perceptron (MLP)

4. Learning in MLP

5. Computational aspects

6. Methodological aspects of learning

7. Applications

8. Developments and perspectives

9. Conclusions

1. Introduction

2. The perceptron

4. Learning in MLP

7. Applications

9. Conclusions

Introduction: Why neural networks?

• Biological inspiration

Natural brain: a very seductive model

– Robust and fault tolerant

– Flexible. Easily adaptable

– Can work with incomplete, uncertain, noisy data ...

– Massively parallel

– Can learn

Neurons

– ≈ 1011 neurons in the human brain

– ≈ 104 connections (synapses + axons) / neuron

– Action potential / refractory period / neurotransmitters

– Excitatory / inhibitory signals

Introduction: Why neural networks?

• Some propertiesproperties

Parallel computation

Directly implementable on dedicated circuits

Robust and fault tolerant (distributed representation)

Simple algorithms

Very general

• Some defectsdefects

Opacity of acquired knowledge

Historical notes (quickly)

Premises

– Mc Culloch & Pitts (1943): 1st formal neuron model.

neuron and logical calculus: base of artificial intelligence.

– Hebb rule (1949): learning by reinforcing synaptic coupling

First realizations

– ADALINE (Widrow-Hoff, 1960)

– PERCEPTRON (Rosenblatt, 1958-1962)

– Analysis of Minsky & Papert (1969)

New models

– Kohonen (competitive learning), ...

– Hopfield (1982) (recurrent net)

– multi-layer perceptron(1985)

Analysis and developments

– Control theory, generalization (Vapnik), ...

The perceptron

Rosenblatt (1958-1962)

1. Introduction

2. The perceptron

4. Learning in MLP

7. Applications

9. Conclusions

9L. OrseauNeural NetworksLinear discrimination: the perceptron

[Rosenblatt, 1957,1962]

Decision function:

Bias nodeOutput node

Input nodes

11L. OrseauNeural NetworksLinear discrimination: the perceptron

• Geometry - 2 classes

12L. OrseauNeural NetworksLinear discrimination: the perceptronDiscrimination contre tous les autresDiscrimination against all others

• Geometry - multiclass

Ambiguous region

13L. OrseauNeural NetworksLinear discrimination: the perceptronDiscrimination entre deux classesDiscrimination against all others

• Geometry – multiclass

•N(N-1)/2 discriminant functions

14L. OrseauNeural NetworksThe perceptron: Performance criterion

• Optimization criterion (error function): Total # classification errors: NO

Perceptron criterion:

For all forms of learning, we want:

Proportional to the distance to the decision surface (for all wrongly classified

examples)

Piecewise linear and continuous function

wT x 0

Direct learning: pseudo-inverse method

• Direct solution (pseudo-inverse method) requires:

Knowledge of all pairs (xi,yi)

Matrix inversion (often ill defined)

(only for linear network and quadratic error function)

• Requires an iterative method without matrix inversion

Gradient descent

16L. OrseauNeural NetworksThe perceptron: algorithm

• Exploration method of H Gradient search

– Minimization of error function

– Principle: in the spirit of the Hebb rule:

modify connection proportionally to input and output

– Learn only if classification error

Algorithm:

if example is correctly classified: do nothing

otherwise:

Loop over all training examples until a stopping criterion

Convergence?

w(t 1) w(t) xi ui

17L. OrseauNeural NetworksThe perceptron: convergence memory capacity

• Questions:

What can be learned?

– Result from [Minsky & Papert,68]: linear separators

Convergence guaranties?

– Perceptron convergence theorem [Rosenblatt,62]

Reliability of learning and number of examples

– How many examples do we need to have some guaranty about what should

be learned?

Expressive power: Linear separations

1. Introduction

2. The perceptron

4. Learning in MLP

7. Applications

9. Conclusions

The multi-layer perceptron

• Usual topology

Signal flow

Input : xk

Input layer Output layerHidden layer

Output: yk

Desired output: uk

The multi-layer perceptron: propagation

• For each neuron:

wjk : weightweight of the connection from node j to node k

ak : activationactivation of node k

g : activationactivation functionfunction

g(a) 1

yl g w jk jj 0, d

g(ak )

g’(a) = g(a)(1-g(a))

Radial Basis Function

sigmoïdal function

Threshold function

rail function

Activation ai

Sortie zi

The multi-layer perceptron: the XOR example

Weight

Weigth

Example of network (JavaNNS)

1. Introduction

2. The perceptron

4. Learning in MLP

7. Applications

9. Conclusions

The MLP: learning

• Find weights such that the network makes a input-output mapping consistent with the given examples

(same old generalization problem)

• Learning:

Minimize loss function E(w,{xl,ul}) in function of w

Use a gradient descent method

(gradient back-propagation algorithm )

Inductive principle: We suppose that what works on training examples (empirical risk minimization) should work on test (unseen) examples (real risk minimization)

wij E wij

Learning: gradient descent

• learning = search in the multidimensional parameter space (synaptic weights) to minimize loss function

• Almost all learning rules

= gradient descent method

Optimal solution w* so that

wij(1) wij

wij w( )

E(1) E( ) w E

E(w* ) 0

, ...,

The multi-layer perceptron: learning

Algorithm (gradient back-propagation): gradient descent

Iterative algorithm:

Off-line case (total gradient):

where:

On-line case: (stochastic gradient):

w( t ) w( t 1) Ew(t )

wij (t) wij (t 1) (t)1

RE (xk ,w)

wij (t) wij (t 1) (t)RE(xk,w)

RE(xk ,w) [tk f (xk ,w)]2

w * argminw

my(xl ; w) u(xl ) 2

The multi-layer perceptron: learning

1. Take one example from training set

2. Compute output state of network

3. Compute error = fct(output – desired output) (e.g. = (yl - ul)2)

4. Compute gradients

With gradient back-propagation algorithm

5. Modify synaptic weights

6. Stopping criterion

Based on global error, number of examples, etc.

7. Go back to 1

MLP: gradient back-propagation

• The problem: Determine responsibilities (“credit assignment problem”)What connection is responsible, and of how much, on error E ?

• Principle: Compute error on a connection in function of the error on the next layer

• Two steps:

1. Evaluation of error derived relative to weights

2. Use of these derivates to compute the modification on each weight

1. Evaluation of error Ej (or E) due to each connection:

Idea: compute error on connection wji in function of error after node j

For nodes in the output layer:

For nodes in the hidden layer:

g' (ak ) E l

g' (ak ) uk(xl) yk

a jk g' (a j ) w jk k

ai : activation ofnode i

zi : sortie of node i

i : error attached to node i

wijji k

Output nodeHidden node

wjkzjzi

• 2. Modification of weights

We suppose a step gradient (constant or not): (t)

If stochastic learning (after presentation of each example)

If batch learning (after presentation of the whole set of examples)

wji (t) j ai

wji (t) jn ai

MLP: forward and backward passes (resume)

ai(x) w jxj

yi(x) g(ai(x))

ys (x) w js y jj1

k neurons on thehidden layer

. . .x1 x2 x3 xd

w1 w2 w3wd

w0Bia s

. . .y (x)1

MLP: forward and backward passes (resume)

. . .x1 x2 x3 xd

w1 w2 w3wd

w0Biais

. . .y (x)1

s g' (as ) (us ys )

wis (t 1) wis (t) ( t) sai

wei (t 1) wei(t) (t ) iae

g' ( aj) w

snodes

of next layer

• Learning efficiency

O(|w|) for each learning pass, |w| = # weights

Usually several hundreds of passes (see below)

And learning must typically be done several dozens of times with different

initial random weights

• Recognition efficiency

Possibility of real time

Applications: multi-objective optimization

• cf [Tom Mitchell]

Predict both class and color

Instead of class only

Role of the hidden layer

MLP: Applications• Control: identification and control of processes

(e.g. Robot control)

• Signal Processing (filtering, data compression, speech processing (recognition, prediction, production),…)

• Pattern recognition, image processing (hand-writing recognition, automated postal code recognition (Zip codes, USA), face recognition...)

• Prediction (water, electricity consumption, meteorology, stock market, ...)

• Diagnostic (industry, medical, science, ...)

Application to postal Zip codes

• [Le Cun et al., 1989, ...] (ATT Bell Labs: very smart team)

• ≈ 10000 examples of handwritten numbers

• Segment et rescales on a 16 x 16 matrix

• Weigh sharing

• Optimal brain damage

• 99% correct recognition (on training set)

• 9% reject (delegated to human recognition)

The database

Application to postal Zip codes

16 x 16 Matrix 12 segment detectors (8x8)

12 segment detectors (4x4)

30 nodes

10 output nodes

Some mistakes made by the network

Regression

A failure: QSAR

• Quantitative Structure Activity Relations

Predire certaines proprietes de molecules (par example activite biologique) à partir de descriptions:- chimiques- geometriques- electriques

1. Introduction

2. The perceptron

4. Learning in MLP

7. Applications

9. Conclusions

MLP: Practical view (1)• Technical problems:

how to improve the algorithm performance?

MLP as an optimization method: variants

• Momentum

• Second order methods

• Hessian

• Conjugate gradient

Heuristics

• Sequential learning vs batch learning

• Choice of activation function

• Normalization of inputs

• Weights initializations

• Learning gains

MLP: gradient back-propagation (variants)

• Momentum

wji (t 1) E

w ji(t)

Convergence

• Learning step tweaking:

MLP: Convergence problems

• Local minimums

Add momentum (inertia)

Conditioning of parameters

Noising learning data

Online algorithm (stochastic vs. total)

Variable gradient step (in time and for each node)

Use of second derivate (Hessien). Conjugate gradient

MLP: Convergence problems (variables gradient)

• Adaptive gain

if gradient does not change sign, otherwise

Much lower gain for stochastic than for total gradient

Specific gain for each layer (e.g. 1 / (# input node)1/2 )

• More complex algorithms

Conjugate gradients

– Idea: Try to minimize independently on each dimension, using a momentum of

search

Second order methods (Hessian)

– Faster convergence but slower computations

1. Introduction

2. The perceptron

4. Learning in MLP

7. Applications

9. Conclusions

Overfitting

Real Risk

Emprirical Risk

Overfitting

Data quantity

Prenventing overfitting: regularisation

• Principle: limit expressiveness of H

• New empirical risk:

• Some useful regularizers:

– Control of NN architecture

– Parameter control

• Soft-weight sharing

• Weight decay

• Convolution network

– Noisy examples

Remp () 1

mL(h (xl , ),u l

) [h(. , )]Penalization term

Control by limiting the exploration of H

• Early stopping

• Weight decay

Generalization: optimize the network structure

• Progressive growth

Cascade correlation [Fahlman,1990]

• Pruning

Optimal brain damage [Le Cun,1990]

Optimal brain surgeon [Hassibi,1993]

Introduction of prior knowledge

Invariances

• Symmetries in the example space

Translation / rotation / dilatation

• Cost function having derivates

1. Introduction

2. The perceptron

4. Learning in MLP

7. Applications

9. Conclusions

ANN Application Areas

• Classification

• Clustering

• Associative memory

• Control

• Function approximation

Applications for ANN Classifiers

• Pattern recognition

Industrial inspection

Fault diagnosis

Image recognition

Target recognition

Speech recognition

Natural language processing

• Character recognition

Handwriting recognition

Automatic text-to-speech conversion

Neural Network ApproachesALVINN - Autonomous Land Vehicle In a Neural Network

ALVINN

- Developed in 1993.

- Performs driving with Neural Networks.

- An intelligent VLSI image sensor for road following.

- Learns to filter out image details not relevant to driving.

Hidden layer

Output units

Input units

ALVINN

1. Introduction

2. The perceptron

4. Learning in MLP

7. Applications

9. Conclusions

MLP with Radial Basis Functions (RBF)

• Definition

Hidden layer uses radial basis activation function (e.g. Gaussian)

– Idea: “pave” the input space with “receptive fields”

Output layer: linear combination upon the hidden layer

• Properties

Still universal approximator ([Hartman et al.,90], ...)

But not parsimonious (combinatorial explosion of input dimension)

Only for small input dimension problems

Strong links with fuzzy inference systems and neuro-fuzzy systems

• Parameters to tune:

# hidden nodes

Initial positions if receptive fields

Diameter of receptive fields

Output weights

• Methodes

Adaptation of back-propagation

Determination of each type of parameters with a specific method (usually more effective)

– Centers determined by “clustering” methofs (k-means, ...)

– Diameters determined by covering rate optimization (PPV, ...)

– Output weights by linear optimization (calcul de pseudo-inverse, ...)

MLP with Radial Basis Functions (RBF)

Neural Networks for sequence processing

• Tasks : Take the Time dimension into account

Sequence recognition

E.g. recognize a word corresponding to a vocal signal

Reproduction of sequence

E.g. predict next values of the sequence (ex: electricity consumption prediction)

Temporal association

Production of another in response to the recognition of another sequence

Time Delay Neural Networks (TDNNs)

Duplicate inputs for several past time steps

Recurrent Neural Networks

Recurrent ANN Architectures

• Feedback connections

• Dynamic memory: y(t+1)=f(x(τ),y(τ),s(τ)) τ(t,t-1,...)

• Models: Jordan/Elman ANNs

Hopfield

Adaptive Resonance Theory (ART)

• Can learn regular grammars

Finite State Machines

Back Propagation Through Time

• Can even model full computers with 11 neurons (!)

Very special use of RNNs…

Uses the property that a weight can be any number,i.e. it is an unlimited memory

+ Chaotic dynamics

No learning algorithm for this

• Problems

Complex trajectories

– Chaotic dynamics

Limited memory of past

Learning is very difficult!

– Exponential decay of error signal in time

Long Short Term Memory (Hochreiter 1997)

• Idea:

Only some nodes are recurrent

Only self-recurrence

Linear activation function

– Error decays linearly, not exponentially

• Can learn

Regular languages (FSM)

Some Context-free (stack machine) and Context-sensitive grammars

– anbn, anbncn

Reservoir computing

• Idea:

Random recurrent neural network,

Learn only output layer weights

• Many internal dynamics

• Output layer selects interesting ones

• And combinations thereofOutputInput

1. Introduction

2. The perceptron

4. Learning in MLP

7. Applications

9. Conclusions

Conclusions

• Limits

Learning is slow and difficult

Result is opaque

– Difficult to extract knowledge

– Difficult to use prior knowledge (but KBANN)

Incremental learning of new concepts is difficult: catastrophic forgetting

• Avantages

Can learn a wide variety of problems

Bibliography• Ouvrages / articles

Bishop C. (06): Neural networks for pattern recognition. Clarendon Press - Oxford, 1995.

Haykin (98): Neural Networks. Prentice Hall, 1998.

Hertz, Krogh & Palmer (91): Introduction to the theory of neural computation. Addison Wesley, 1991.

Thiria, Gascuel, Lechevallier & Canu (97): Statistiques et methodes neuronales. Dunod, 1997.

Vapnik (95): The nature of statistical learning. Springer Verlag, 1995.

• Sites web

http://www.lps.ens.fr/~nadal/

1 l. orseau neural networks neural networks efrei 2010 laurent orseau...

Documents

synthese technique - agroparistech.fr · mots-clés:...

introduction to machine learning laurent orseau...

michele conforti gérard cornuéjols giacomo zambelli...

1 a. cornuéjols les réseaux connexionnistes les réseaux...

chapter 10 disjunctive programming - lara:...

conflicts and cooperation the two sides of governance andré...

the principle of presence: a heuristic for growing knowledge...

i ndependent c omponents a nalysis i ndependent c omponents...

functional barriers in pet recycled bottles. part i....

cours excel vba - agroparistech.fr · excel vba –...

i ndependent c omponents a nalysis with the jade algorithm...

1 l. orseau induction of decision trees induction of...

chimiometrie 2009 proposed model for challenge2009 patrícia...

cornuejols tutuncu 2007- optimization methods in finance

optimization methods in finance - kurolf/ct_finopt.pdf ·...

arxiv:1705.08417v2 [cs.ai] 19 aug 2017 · yampolskiy,...