1 l. orseau neural networks neural networks efrei 2010 laurent orseau...

Post on 13-Jan-2016

213 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1L. OrseauNeural Networks

Neural Networks

EFREI 2010

Laurent Orseau(laurent.orseau@agroparistech.fr)

AgroParisTech

based on slides by Antoine Cornuejols

2L. OrseauNeural Networks

Plan

1. Introduction

2. The perceptron

3. The multi-layer perceptron (MLP)

4. Learning in MLP

5. Computational aspects

6. Methodological aspects of learning

7. Applications

8. Developments and perspectives

9. Conclusions

3L. OrseauNeural Networks

Plan

1. Introduction

2. The perceptron

3. The multi-layer perceptron (MLP)

4. Learning in MLP

5. Computational aspects

6. Methodological aspects of learning

7. Applications

8. Developments and perspectives

9. Conclusions

4L. OrseauNeural Networks

Introduction: Why neural networks?

• Biological inspiration

Natural brain: a very seductive model

– Robust and fault tolerant

– Flexible. Easily adaptable

– Can work with incomplete, uncertain, noisy data ...

– Massively parallel

– Can learn

Neurons

– ≈ 1011 neurons in the human brain

– ≈ 104 connections (synapses + axons) / neuron

– Action potential / refractory period / neurotransmitters

– Excitatory / inhibitory signals

5L. OrseauNeural Networks

Introduction: Why neural networks?

• Some propertiesproperties

Parallel computation

Directly implementable on dedicated circuits

Robust and fault tolerant (distributed representation)

Simple algorithms

Very general

• Some defectsdefects

Opacity of acquired knowledge

6L. OrseauNeural Networks

Historical notes (quickly)

Premises

– Mc Culloch & Pitts (1943): 1st formal neuron model.

neuron and logical calculus: base of artificial intelligence.

– Hebb rule (1949): learning by reinforcing synaptic coupling

First realizations

– ADALINE (Widrow-Hoff, 1960)

– PERCEPTRON (Rosenblatt, 1958-1962)

– Analysis of Minsky & Papert (1969)

New models

– Kohonen (competitive learning), ...

– Hopfield (1982) (recurrent net)

– multi-layer perceptron(1985)

Analysis and developments

– Control theory, generalization (Vapnik), ...

7L. OrseauNeural Networks

The perceptron

Rosenblatt (1958-1962)

8L. OrseauNeural Networks

Plan

1. Introduction

2. The perceptron

3. The multi-layer perceptron (MLP)

4. Learning in MLP

5. Computational aspects

6. Methodological aspects of learning

7. Applications

8. Developments and perspectives

9. Conclusions

9L. OrseauNeural NetworksLinear discrimination: the perceptron

[Rosenblatt, 1957,1962]

Decision function:

Bias nodeOutput node

Input nodes

11L. OrseauNeural NetworksLinear discrimination: the perceptron

• Geometry - 2 classes

12L. OrseauNeural NetworksLinear discrimination: the perceptronDiscrimination contre tous les autresDiscrimination against all others

• Geometry - multiclass

Ambiguous region

13L. OrseauNeural NetworksLinear discrimination: the perceptronDiscrimination entre deux classesDiscrimination against all others

• Geometry – multiclass

•N(N-1)/2 discriminant functions

14L. OrseauNeural NetworksThe perceptron: Performance criterion

• Optimization criterion (error function): Total # classification errors: NO

Perceptron criterion:

For all forms of learning, we want:

Proportional to the distance to the decision surface (for all wrongly classified

examples)

Piecewise linear and continuous function

wT x 0

< 0

x 1

2

15L. OrseauNeural Networks

Direct learning: pseudo-inverse method

• Direct solution (pseudo-inverse method) requires:

Knowledge of all pairs (xi,yi)

Matrix inversion (often ill defined)

(only for linear network and quadratic error function)

• Requires an iterative method without matrix inversion

Gradient descent

16L. OrseauNeural NetworksThe perceptron: algorithm

• Exploration method of H Gradient search

– Minimization of error function

– Principle: in the spirit of the Hebb rule:

modify connection proportionally to input and output

– Learn only if classification error

Algorithm:

if example is correctly classified: do nothing

otherwise:

Loop over all training examples until a stopping criterion

Convergence?

w(t 1) w(t) xi ui

17L. OrseauNeural NetworksThe perceptron: convergence memory capacity

• Questions:

What can be learned?

– Result from [Minsky & Papert,68]: linear separators

Convergence guaranties?

– Perceptron convergence theorem [Rosenblatt,62]

Reliability of learning and number of examples

– How many examples do we need to have some guaranty about what should

be learned?

18L. OrseauNeural Networks

Expressive power: Linear separations

19L. OrseauNeural Networks

Expressive power: Linear separations

20L. OrseauNeural Networks

Plan

1. Introduction

2. The perceptron

3. The multi-layer perceptron (MLP)

4. Learning in MLP

5. Computational aspects

6. Methodological aspects of learning

7. Applications

8. Developments and perspectives

9. Conclusions

22L. OrseauNeural Networks

The multi-layer perceptron

• Usual topology

Signal flow

Input : xk

Input layer Output layerHidden layer

Output: yk

Desired output: uk

23L. OrseauNeural Networks

The multi-layer perceptron: propagation

• For each neuron:

wjk : weightweight of the connection from node j to node k

ak : activationactivation of node k

g : activationactivation functionfunction

g(a) 1

1 e a

yl g w jk jj 0, d

g(ak )

g’(a) = g(a)(1-g(a))

Radial Basis Function

sigmoïdal function

Threshold function

rail function

Activation ai

Sortie zi

+1

+1

+1

24L. OrseauNeural Networks

The multi-layer perceptron: the XOR example

A

B

C

x1

x2

y

Bias

Weight

Weigth

-0.5

1-1.5

1

11

1

-0.5

-1

A B C

25L. OrseauNeural Networks

Example of network (JavaNNS)

26L. OrseauNeural Networks

Plan

1. Introduction

2. The perceptron

3. The multi-layer perceptron (MLP)

4. Learning in MLP

5. Computational aspects

6. Methodological aspects of learning

7. Applications

8. Developments and perspectives

9. Conclusions

27L. OrseauNeural Networks

The MLP: learning

• Find weights such that the network makes a input-output mapping consistent with the given examples

(same old generalization problem)

• Learning:

Minimize loss function E(w,{xl,ul}) in function of w

Use a gradient descent method

(gradient back-propagation algorithm )

Inductive principle: We suppose that what works on training examples (empirical risk minimization) should work on test (unseen) examples (real risk minimization)

wij E wij

28L. OrseauNeural Networks

Learning: gradient descent

• learning = search in the multidimensional parameter space (synaptic weights) to minimize loss function

• Almost all learning rules

= gradient descent method

Optimal solution w* so that

wij(1) wij

( ) E

wij w( )

E(1) E( ) w E

E(w* ) 0

=

w1

,

w2

, ...,

w N

T

29L. OrseauNeural Networks

The multi-layer perceptron: learning

Goal:

Algorithm (gradient back-propagation): gradient descent

Iterative algorithm:

Off-line case (total gradient):

where:

On-line case: (stochastic gradient):

w( t ) w( t 1) Ew(t )

wij (t) wij (t 1) (t)1

m

RE (xk ,w)

wijk1

m

wij (t) wij (t 1) (t)RE(xk,w)

wij

RE(xk ,w) [tk f (xk ,w)]2

w * argminw

1

my(xl ; w) u(xl ) 2

l1

m

30L. OrseauNeural Networks

The multi-layer perceptron: learning

1. Take one example from training set

2. Compute output state of network

3. Compute error = fct(output – desired output) (e.g. = (yl - ul)2)

4. Compute gradients

With gradient back-propagation algorithm

5. Modify synaptic weights

6. Stopping criterion

Based on global error, number of examples, etc.

7. Go back to 1

31L. OrseauNeural Networks

MLP: gradient back-propagation

• The problem: Determine responsibilities (“credit assignment problem”)What connection is responsible, and of how much, on error E ?

• Principle: Compute error on a connection in function of the error on the next layer

• Two steps:

1. Evaluation of error derived relative to weights

2. Use of these derivates to compute the modification on each weight

32L. OrseauNeural Networks

1. Evaluation of error Ej (or E) due to each connection:

Idea: compute error on connection wji in function of error after node j

For nodes in the output layer:

For nodes in the hidden layer:

MLP: gradient back-propagation

E l

wij

k E l

ak

g' (ak ) E l

yk

g' (ak ) uk(xl) yk

j E l

aj

E l

ak

ak

ajk k

ak

zj

zj

a jk g' (a j ) w jk k

k

E l

wij

E l

a j

a j

wij

j zi

33L. OrseauNeural Networks

MLP: gradient back-propagation

ai : activation ofnode i

zi : sortie of node i

i : error attached to node i

wijji k

yk

Output nodeHidden node

k

akaj

j

wjkzjzi

34L. OrseauNeural Networks

MLP: gradient back-propagation

• 2. Modification of weights

We suppose a step gradient (constant or not): (t)

If stochastic learning (after presentation of each example)

If batch learning (after presentation of the whole set of examples)

wji (t) j ai

wji (t) jn ai

n

n

35L. OrseauNeural Networks

MLP: forward and backward passes (resume)

x

ai(x) w jxj

j 1

d

w0

yi(x) g(ai(x))

ys (x) w js y jj1

k

ys(x)

wis

k neurons on thehidden layer

. . .x1 x2 x3 xd

w1 w2 w3wd

yi(x)

x0

w0Bia s

. . .y (x)1

36L. OrseauNeural Networks

MLP: forward and backward passes (resume)

x

ys(x)

wis

. . .x1 x2 x3 xd

w1 w2 w3wd

yi(x)

x0

w0Biais

. . .y (x)1

s g' (as ) (us ys )

wis (t 1) wis (t) ( t) sai

wei (t 1) wei(t) (t ) iae

j

g' ( aj) w

js

snodes

of next layer

37L. OrseauNeural Networks

MLP: gradient back-propagation

• Learning efficiency

O(|w|) for each learning pass, |w| = # weights

Usually several hundreds of passes (see below)

And learning must typically be done several dozens of times with different

initial random weights

• Recognition efficiency

Possibility of real time

40L. OrseauNeural Networks

Applications: multi-objective optimization

• cf [Tom Mitchell]

Predict both class and color

Instead of class only

41L. OrseauNeural Networks

Role of the hidden layer

42L. OrseauNeural Networks

Role of the hidden layer

43L. OrseauNeural Networks

Role of the hidden layer

44L. OrseauNeural Networks

MLP: Applications• Control: identification and control of processes

(e.g. Robot control)

• Signal Processing (filtering, data compression, speech processing (recognition, prediction, production),…)

• Pattern recognition, image processing (hand-writing recognition, automated postal code recognition (Zip codes, USA), face recognition...)

• Prediction (water, electricity consumption, meteorology, stock market, ...)

• Diagnostic (industry, medical, science, ...)

45L. OrseauNeural Networks

Application to postal Zip codes

• [Le Cun et al., 1989, ...] (ATT Bell Labs: very smart team)

• ≈ 10000 examples of handwritten numbers

• Segment et rescales on a 16 x 16 matrix

• Weigh sharing

• Optimal brain damage

• 99% correct recognition (on training set)

• 9% reject (delegated to human recognition)

46L. OrseauNeural Networks

The database

47L. OrseauNeural Networks

Application to postal Zip codes

1

2

3

4

5

6

7

8

9

0

16 x 16 Matrix 12 segment detectors (8x8)

12 segment detectors (4x4)

30 nodes

10 output nodes

48L. OrseauNeural Networks

Some mistakes made by the network

49L. OrseauNeural Networks

Regression

50L. OrseauNeural Networks

A failure: QSAR

• Quantitative Structure Activity Relations

Predire certaines proprietes de molecules (par example activite biologique) à partir de descriptions:- chimiques- geometriques- electriques

51L. OrseauNeural Networks

Plan

1. Introduction

2. The perceptron

3. The multi-layer perceptron (MLP)

4. Learning in MLP

5. Computational aspects

6. Methodological aspects of learning

7. Applications

8. Developments and perspectives

9. Conclusions

52L. OrseauNeural Networks

MLP: Practical view (1)• Technical problems:

how to improve the algorithm performance?

MLP as an optimization method: variants

• Momentum

• Second order methods

• Hessian

• Conjugate gradient

Heuristics

• Sequential learning vs batch learning

• Choice of activation function

• Normalization of inputs

• Weights initializations

• Learning gains

53L. OrseauNeural Networks

MLP: gradient back-propagation (variants)

• Momentum

wji (t 1) E

w ji

w ji(t)

54L. OrseauNeural Networks

Convergence

• Learning step tweaking:

55L. OrseauNeural Networks

MLP: Convergence problems

• Local minimums

Add momentum (inertia)

Conditioning of parameters

Noising learning data

Online algorithm (stochastic vs. total)

Variable gradient step (in time and for each node)

Use of second derivate (Hessien). Conjugate gradient

56L. OrseauNeural Networks

MLP: Convergence problems (variables gradient)

• Adaptive gain

if gradient does not change sign, otherwise

Much lower gain for stochastic than for total gradient

Specific gain for each layer (e.g. 1 / (# input node)1/2 )

• More complex algorithms

Conjugate gradients

– Idea: Try to minimize independently on each dimension, using a momentum of

search

Second order methods (Hessian)

– Faster convergence but slower computations

57L. OrseauNeural Networks

Plan

1. Introduction

2. The perceptron

3. The multi-layer perceptron (MLP)

4. Learning in MLP

5. Computational aspects

6. Methodological aspects of learning

7. Applications

8. Developments and perspectives

9. Conclusions

58L. OrseauNeural Networks

Overfitting

Real Risk

Emprirical Risk

Overfitting

Data quantity

59L. OrseauNeural Networks

Prenventing overfitting: regularisation

• Principle: limit expressiveness of H

• New empirical risk:

• Some useful regularizers:

– Control of NN architecture

– Parameter control

• Soft-weight sharing

• Weight decay

• Convolution network

– Noisy examples

Remp () 1

mL(h (xl , ),u l

l 1

m

) [h(. , )]Penalization term

60L. OrseauNeural Networks

Control by limiting the exploration of H

• Early stopping

• Weight decay

61L. OrseauNeural Networks

Generalization: optimize the network structure

• Progressive growth

Cascade correlation [Fahlman,1990]

• Pruning

Optimal brain damage [Le Cun,1990]

Optimal brain surgeon [Hassibi,1993]

62L. OrseauNeural Networks

Introduction of prior knowledge

Invariances

• Symmetries in the example space

Translation / rotation / dilatation

• Cost function having derivates

63L. OrseauNeural Networks

Plan

1. Introduction

2. The perceptron

3. The multi-layer perceptron (MLP)

4. Learning in MLP

5. Computational aspects

6. Methodological aspects of learning

7. Applications

8. Developments and perspectives

9. Conclusions

64L. OrseauNeural Networks

ANN Application Areas

• Classification

• Clustering

• Associative memory

• Control

• Function approximation

65L. OrseauNeural Networks

Applications for ANN Classifiers

• Pattern recognition

Industrial inspection

Fault diagnosis

Image recognition

Target recognition

Speech recognition

Natural language processing

• Character recognition

Handwriting recognition

Automatic text-to-speech conversion

66L. OrseauNeural Networks

Presented by Martin Ho, Eddy Li, Eric Wong and Kitty Wong - Copyright© 2000

Neural Network ApproachesALVINN - Autonomous Land Vehicle In a Neural Network

ALVINN

67L. OrseauNeural Networks

Presented by Martin Ho, Eddy Li, Eric Wong and Kitty Wong - Copyright© 2000

- Developed in 1993.

- Performs driving with Neural Networks.

- An intelligent VLSI image sensor for road following.

- Learns to filter out image details not relevant to driving.

Hidden layer

Output units

Input units

ALVINN

68L. OrseauNeural Networks

Plan

1. Introduction

2. The perceptron

3. The multi-layer perceptron (MLP)

4. Learning in MLP

5. Computational aspects

6. Methodological aspects of learning

7. Applications

8. Developments and perspectives

9. Conclusions

69L. OrseauNeural Networks

MLP with Radial Basis Functions (RBF)

• Definition

Hidden layer uses radial basis activation function (e.g. Gaussian)

– Idea: “pave” the input space with “receptive fields”

Output layer: linear combination upon the hidden layer

• Properties

Still universal approximator ([Hartman et al.,90], ...)

But not parsimonious (combinatorial explosion of input dimension)

Only for small input dimension problems

Strong links with fuzzy inference systems and neuro-fuzzy systems

70L. OrseauNeural Networks

• Parameters to tune:

# hidden nodes

Initial positions if receptive fields

Diameter of receptive fields

Output weights

• Methodes

Adaptation of back-propagation

Determination of each type of parameters with a specific method (usually more effective)

– Centers determined by “clustering” methofs (k-means, ...)

– Diameters determined by covering rate optimization (PPV, ...)

– Output weights by linear optimization (calcul de pseudo-inverse, ...)

MLP with Radial Basis Functions (RBF)

71L. OrseauNeural Networks

Neural Networks for sequence processing

• Tasks : Take the Time dimension into account

Sequence recognition

E.g. recognize a word corresponding to a vocal signal

Reproduction of sequence

E.g. predict next values of the sequence (ex: electricity consumption prediction)

Temporal association

Production of another in response to the recognition of another sequence

Time Delay Neural Networks (TDNNs)

Duplicate inputs for several past time steps

Recurrent Neural Networks

72L. OrseauNeural Networks

Recurrent ANN Architectures

• Feedback connections

• Dynamic memory: y(t+1)=f(x(τ),y(τ),s(τ)) τ(t,t-1,...)

• Models: Jordan/Elman ANNs

Hopfield

Adaptive Resonance Theory (ART)

73L. OrseauNeural Networks

Recurrent Neural Networks

• Can learn regular grammars

Finite State Machines

Back Propagation Through Time

• Can even model full computers with 11 neurons (!)

Very special use of RNNs…

Uses the property that a weight can be any number,i.e. it is an unlimited memory

+ Chaotic dynamics

No learning algorithm for this

75L. OrseauNeural Networks

Recurrent Neural Networks

• Problems

Complex trajectories

– Chaotic dynamics

Limited memory of past

Learning is very difficult!

– Exponential decay of error signal in time

76L. OrseauNeural Networks

Long Short Term Memory (Hochreiter 1997)

• Idea:

Only some nodes are recurrent

Only self-recurrence

Linear activation function

– Error decays linearly, not exponentially

• Can learn

Regular languages (FSM)

Some Context-free (stack machine) and Context-sensitive grammars

– anbn, anbncn

77L. OrseauNeural Networks

Reservoir computing

• Idea:

Random recurrent neural network,

Learn only output layer weights

• Many internal dynamics

• Output layer selects interesting ones

• And combinations thereofOutputInput

79L. OrseauNeural Networks

Plan

1. Introduction

2. The perceptron

3. The multi-layer perceptron (MLP)

4. Learning in MLP

5. Computational aspects

6. Methodological aspects of learning

7. Applications

8. Developments and perspectives

9. Conclusions

80L. OrseauNeural Networks

Conclusions

• Limits

Learning is slow and difficult

Result is opaque

– Difficult to extract knowledge

– Difficult to use prior knowledge (but KBANN)

Incremental learning of new concepts is difficult: catastrophic forgetting

• Avantages

Can learn a wide variety of problems

81L. OrseauNeural Networks

Bibliography• Ouvrages / articles

Bishop C. (06): Neural networks for pattern recognition. Clarendon Press - Oxford, 1995.

Haykin (98): Neural Networks. Prentice Hall, 1998.

Hertz, Krogh & Palmer (91): Introduction to the theory of neural computation. Addison Wesley, 1991.

Thiria, Gascuel, Lechevallier & Canu (97): Statistiques et methodes neuronales. Dunod, 1997.

Vapnik (95): The nature of statistical learning. Springer Verlag, 1995.

• Sites web

http://www.lps.ens.fr/~nadal/

top related