feed-forward artificial neural networks

73
1 Feed-Forward Artificial Neural Networks AMIA 2003, Machine Learning Tutorial Constantin F. Aliferis & Ioannis Tsamardinos Discovery Systems Laboratory Department of Biomedical Informatics Vanderbilt University

Upload: lucian

Post on 24-Feb-2016

62 views

Category:

Documents


5 download

DESCRIPTION

Feed-Forward Artificial Neural Networks. AMIA 2003, Machine Learning Tutorial Constantin F. Aliferis & Ioannis Tsamardinos Discovery Systems Laboratory Department of Biomedical Informatics Vanderbilt University. Binary Classification Example. Example: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Feed-Forward Artificial Neural Networks

1

Feed-Forward Artificial Neural Networks

AMIA 2003, Machine Learning TutorialConstantin F. Aliferis & Ioannis Tsamardinos

Discovery Systems LaboratoryDepartment of Biomedical Informatics

Vanderbilt University

Page 2: Feed-Forward Artificial Neural Networks

2

Binary Classification Example

Value of predictor 1

Val

ue o

f pre

dict

or 2

Example:

•Classification to malignant or benign breast cancer from mammograms

•Predictor 1: lump thickness

•Predictor 2: single epithelial cell size

Page 3: Feed-Forward Artificial Neural Networks

3

Possible Decision Areas

Value of predictor 1

Val

ue o

f pre

dict

or 2

Class area: red circles

Class area: Green triangles

Page 4: Feed-Forward Artificial Neural Networks

4

Possible Decision Areas

Value of predictor 1

Val

ue o

f pre

dict

or 2

Page 5: Feed-Forward Artificial Neural Networks

5

Possible Decision Areas

Value of predictor 1

Val

ue o

f pre

dict

or 2

Page 6: Feed-Forward Artificial Neural Networks

6

Binary Classification Example

Value of predictor 1

Val

ue o

f pre

dict

or 2

The simplest non-trivial decision function is the straight line (in general a hyperplane)

One decision surface

Decision surface partitions space into two subspaces

Page 7: Feed-Forward Artificial Neural Networks

7

Specifying a Line

Line equation: w2x2+w1x1+w0 = 0

Classifier: If w2x2+w1x1+w0 0

Output Else

Output

x1

x 2

w2x2+w1x1+w0 = 0

w2x2+w1x1+w0 > 0

w2x2+w1x1+w0 < 0

Page 8: Feed-Forward Artificial Neural Networks

8

Classifying with Linear Surfaces

Use 1 for Use -1 for Classifier becomes

sgn(w2x2+w1x1+w0)= sgn(w2x2+w1x1+w0x0),

with x0=1 alwaysx1

x 2

w2x2+w1x1+w0 = 0

w2x2+w1x1+w0 > 0

w2x2+w1x1+w0 < 0

)sgn( predictors ofnumber ,)sgn(0

xwnxwn

iii

Page 9: Feed-Forward Artificial Neural Networks

9

The Perceptron

1Input:(attributes of patient to classify)

Output: classification of patient (malignant or benign)

W4=3

W2=0W2=-2 W1=4 W0=2

x0x4 x3 x2 x1

Page 10: Feed-Forward Artificial Neural Networks

10

The Perceptron

2 3.1 4 10Input:(attributes of patient to classify)

Output: classification of patient (malignant or benign)

W4=3

W2=0W2=-2 W1=4 W0=2

x0x4 x3 x2 x1

Page 11: Feed-Forward Artificial Neural Networks

11

The Perceptron

3 3.1 4 10Input:(attributes of patient to classify)

Output: classification of patient (malignant or benign)

W4=3

W2=0W2=-2 W1=4 W0=2

x0x4 x3 x2 x1

30

n

iiixw

Page 12: Feed-Forward Artificial Neural Networks

12

The Perceptron

2 3.1 4 10Input:(attributes of patient to classify)

Output: classification of patient (malignant or benign)

W4=3

W2=0W2=-2 W1=4 W0=2

x0x4 x3 x2 x1

30

n

iiixw

sgn(3)=1

Page 13: Feed-Forward Artificial Neural Networks

13

The Perceptron

Input:(attributes of patient to classify)

Output: classification of patient (malignant or benign)

2 3.1 4 10

W4=3

W2=0W2=-2 W1=4 W0=2

x0x4 x3 x2 x1

30

n

iiixw

sgn(3)=1

1

Page 14: Feed-Forward Artificial Neural Networks

14

Learning with Perceptrons

Hypotheses Space: Inductive Bias: Prefer hypotheses that do not

misclassify any of the training instances (or minimize an error function).

Search method: perceptron training rule, gradient descent, etc.

Remember: the problem is to find “good” weights

}|{ 1 nwwH

Page 15: Feed-Forward Artificial Neural Networks

15

Training Perceptrons Start with random

weights Update in an intelligent

way to improve them Intuitively:

Decrease the weights that increase the sum

Increase the weights that decrease the sum

Repeat for all training instances until convergence

2 3.1 4 10

30 -2 4 2

x0x4 x3 x2 x1

30

n

iiixw

sgn(3)=1

1True Output: -1

Page 16: Feed-Forward Artificial Neural Networks

16

Perceptron Training Rule

t : output the current example should have o: output of the perceptron xi: value of predictor variable i t=o : No change (for correctly classified examples)

iii

ii

www

xotw

'

)(

Page 17: Feed-Forward Artificial Neural Networks

17

Perceptron Training Rule

t=-1, o=1 : sum will decrease t=1, o=-1 : sum will increase

iii

ii

www

xotw

'

)(

2' )( iiiiiiiii xotxwxwxwxw

)sgn( iixw

Page 18: Feed-Forward Artificial Neural Networks

18

Perceptron Training Rule

In vector form:

iii

ii

www

xotw

'

)(

xotww )('

Page 19: Feed-Forward Artificial Neural Networks

19

Example of Perceptron Training: The OR function

x1

x 2

0 1

1

-1

1 1

1

Page 20: Feed-Forward Artificial Neural Networks

20

Example of Perceptron Training: The OR function

Initial random weights:

Define line:x1=0.5

x1

x 2

0 1

1

-1

1 1

1

05.010 12 xx

Page 21: Feed-Forward Artificial Neural Networks

21

Example of Perceptron Training: The OR function

Initial random weights:

Defines line:x1=0.5

x1

x 2

0 1

1

-1

1 1

1

05.010 12 xx

x1=0.5

Area where classifier outputs 1

Area where classifier outputs -1

Page 22: Feed-Forward Artificial Neural Networks

22

Example of Perceptron Training: The OR function

x1

x 2

0 1

1

-1

1 1

1

x1=0.5

Only misclassified example x1=0, x2=1

Page 23: Feed-Forward Artificial Neural Networks

23

Example of Perceptron Training: The OR function

x1

x 2

0 1

1

-1

1 1

1

x1=0.5

Only misclassified example x1=0, x2=1

05.010 : LineOld 12 xx

5.0,1,1

1,0,115.0,1,0

1,0,1))1(1(5.05.0,1,0

)(

'

'

'

'

w

w

w

xotww

Page 24: Feed-Forward Artificial Neural Networks

24

Example of Perceptron Training: The OR function

x1

x 2

0 1

1

-1

1 1

1

x1=0.5

Only misclassified example x1=0, x2=1

05.011 :New 12 xx

For x2=0, x1=-0.5

For x1=0, x2=-0.5

So, new line is:

Page 25: Feed-Forward Artificial Neural Networks

25

Example of Perceptron Training: The OR function

x1

x 2

0 1

1

-1

1 1

1

x1=0.5

05.011 :New 12 xx

For x2=0, x1=-0.5

For x1=0, x2=-0.5

-0.5

-0.5

05.011 12 xx

Page 26: Feed-Forward Artificial Neural Networks

26

Example of Perceptron Training: The OR function

x1

x 2

0 1

1

-1

1 1

1

-0.5

-0.5

05.011 12 xx

Misclassified example

Next iteration:

5.0,1,1

1,0,015.0,1,1

1,0,0)11(5.05.0,1,1

)(

'

'

'

'

w

w

w

xotww

Page 27: Feed-Forward Artificial Neural Networks

27

Example of Perceptron Training: The OR function

x1

x 2

0 1

1

-1

1 1

1

-0.5

-0.5

05.011 12 xx

Perfect classification

No change occurs next

Page 28: Feed-Forward Artificial Neural Networks

28

Analysis of the Perceptron Training Rule Algorithm will always converge within finite

number of iterations if the data are linearly separable.

Otherwise, it may oscillate (no convergence)

Page 29: Feed-Forward Artificial Neural Networks

29

Training by Gradient Descent

Similar but: Always converges Generalizes to training networks of

perceptrons (neural networks) Idea:

Define an error function Search for weights that minimize the error, i.e.,

find weights that zero the error gradient

Page 30: Feed-Forward Artificial Neural Networks

30

Setting Up the Gradient Descent Mean Squared Error:

Minima exist where gradient is zero:

Dd

dd otwE 2)(21)(

Ddd

idd

Dddd

idd

Dddd

i

Dddd

ii

ow

ot

otw

ot

otw

otww

E

)()(

)()(221

)(21

)(21

2

2

Page 31: Feed-Forward Artificial Neural Networks

31

The Sign Function is not Differentiable

0except everywhere ,0)(

di

dd

i

owoo

w

x 20 1

1

-1

1 1

1

x1=0.5

Page 32: Feed-Forward Artificial Neural Networks

32

Use Differentiable Transfer Functions

Replace

with the sigmoid

)sgn( xw

))(1)(()(1

1)(

)(

ysigysigdyydsig

eysig

xwsig

y

Page 33: Feed-Forward Artificial Neural Networks

33

Calculating the Gradient

Ddddddd

Dddidddd

Ddd

idddd

Dd i

d

d

ddd

Ddd

idd

Ddd

idd

Ddd

idd

i

xsigsigotwE

xsigsigot

xww

sigsigot

wsigot

sigw

ot

xwsigw

ot

ow

otwE

))(1)(()()(

))(1)(()(

)())(1)(()(

)()(

))(()(

))(()(

)()(

,

Page 34: Feed-Forward Artificial Neural Networks

34

Updating the Weights with Gradient Descent

Each weight update goes through all training instances

Each weight update more expensive but more accurate

Always converges to a local minimum regardless of the data

Ddddddd xsigsigotww

wEww

))(1)(()(

)(

Page 35: Feed-Forward Artificial Neural Networks

35

Multiclass Problems

E.g., 4 classes Two possible solutions

Use one perceptron (output unit) and interpret the results as follows:

Output 0-0.25, class 1 Output 0.25 – 0.5, class 2 Output 0.5 – 0.75, class 3 Output 0.75 – 1, class 4

Use four output units and a binary encoding of the output (1-of-m encoding).

Page 36: Feed-Forward Artificial Neural Networks

36

1-of-M Encoding

Assign to class with largest output

x0x4 x3 x2 x1

Output for Class 1

Output for Class 2

Output for Class 3

Output for Class 4

Page 37: Feed-Forward Artificial Neural Networks

37

Feed-Forward Neural Networks

x0x4 x3 x2 x1

Output Layer

Hidden Layer 2

Hidden Layer 1

Input Layer

Page 38: Feed-Forward Artificial Neural Networks

38

Increased Expressiveness Example: Exclusive OR

x 2

0 1

1

-1

1

1

-1

No line (no set of three weights) can separate the training examples (learn the true function).

x2 x1 x0

w2 w1 w0

Page 39: Feed-Forward Artificial Neural Networks

39

Increased Expressiveness Examplex 2

0 1

1

-1

1

1

-1

x2 x1 x0

w2,1 w1,1

w0,1

w2,2

w1,2 w0,2

w’1,1 w’2,1

Page 40: Feed-Forward Artificial Neural Networks

40

Increased Expressiveness Examplex 2

0 1

1

-1

1

1

-1

H1

x2 x1 x0

1 -1

0.5

H2

-1

1 0.5

O

11

X1 X2 C

0 0 -1

0 1 1

1 0 1

1 1 -1

Page 41: Feed-Forward Artificial Neural Networks

41

Increased Expressiveness Examplex 2

-1

1

1

-1

H1

x2 x1 x0

1 -1

0.5

H2

-1

1 0.5

O

11

X1 X2 C H1 H2 O

T1 0 0 -1

T2 0 1 1

T3 1 0 1

T4 1 1 -1

x1

Page 42: Feed-Forward Artificial Neural Networks

42

Increased Expressiveness Examplex 2

-1

1

1

-1

0 0

H1

x2 x1

1x0

1 -1

-0.5

H2

-1

1 -0.5

O

11

X1 X2 C H1 H2 O

T1 0 0 -1

T2 0 1 1

T3 1 0 1

T4 1 1 -1

x1

Page 43: Feed-Forward Artificial Neural Networks

43

Increased Expressiveness Examplex 2

-1

1

1

-1

0 0

-1

x2 x1

1x0

1 -1

-0.5

-1

-1

1 -0.5

-1

11

X1 X2 C H1 H2 O

T1 0 0 -1 -1 -1 -1

T2 0 1 1

T3 1 0 1

T4 1 1 -1

x1

Page 44: Feed-Forward Artificial Neural Networks

44

Increased Expressiveness Examplex 2

-1

1

1

-1

0 1

-1

x2 x1

1x0

1 -1

-0.5

1

-1

1 -0.5

1

11

X1 X2 C H1 H2 O

T1 0 0 -1 -1 -1 -1

T2 0 1 1 -1 1 1

T3 1 0 1

T4 1 1 -1

x1

Page 45: Feed-Forward Artificial Neural Networks

45

Increased Expressiveness Examplex 2

-1

1

1

-1

1 0

1

x2 x1

1x0

1 -1

-0.5

-1

-1

1 -0.5

1

11

X1 X2 C H1 H2 O

T1 0 0 -1 -1 -1 -1

T2 0 1 1 -1 1 1

T3 1 0 1 1 -1 1

T4 1 1 -1

x1

Page 46: Feed-Forward Artificial Neural Networks

46

Increased Expressiveness Examplex 2

-1

1

1

-1

1 1

-1

x2 x1

1x0

1 -1

-0.5

-1

-1

1 -0.5

1

11

X1 X2 C H1 H2 O

T1 0 0 -1 -1 -1 -1

T2 0 1 1 -1 1 1

T3 1 0 1 1 -1 1

T4 1 1 -1 -1 -1 -1

x1

Page 47: Feed-Forward Artificial Neural Networks

47

Increased Expressiveness Examplex 2

-1

1

1

-1

1 1

-1

x2 x1

1x0

1 -1

-0.5

-1

-1

1 -0.5

1

11

X1 X2 C H1 H2 O

T1 0 0 -1 -1 -1 -1

T2 0 1 1 -1 1 1

T3 1 0 1 1 -1 1

T4 1 1 -1 -1 -1 -1

x1

Page 48: Feed-Forward Artificial Neural Networks

48

Increased Expressiveness Examplex 2

-1

1

1

-1

1 1

-1

x2 x1

1x0

1 -1

-0.5

-1

-1

1 -0.5

1

11

X1 X2 C H1 H2 O

T1 0 0 -1 -1 -1 -1

T2 0 1 1 -1 1 1

T3 1 0 1 1 -1 1

T4 1 1 -1 -1 -1 -1

x1

Page 49: Feed-Forward Artificial Neural Networks

49

From the Viewpoint of the Output Layer

H1 H2

O

11

C H1 H2 O

T1 -1 -1 -1 -1

T2 1 -1 1 1

T3 1 1 -1 1

T4 -1 -1 -1 -1

T1

T3

T2x1

x 2

T4

Mapped By Hidden Layer to:

T1T3

T2

H1

H2

T4

Page 50: Feed-Forward Artificial Neural Networks

50

From the Viewpoint of the Output Layer

T1

T3

T2x1

x 2

T4

Mapped By Hidden Layer to:

T1T3

T2

H1

H2

T4

•Each hidden layer maps to a new instance space

•Each hidden node is a new constructed feature

•Original Problem may become separable (or easier)

Page 51: Feed-Forward Artificial Neural Networks

51

How to Train Multi-Layered Networks

Select a network structure (number of hidden layers, hidden nodes, and connectivity).

Select transfer functions that are differentiable.

Define a (differentiable) error function. Search for weights that minimize the error

function, using gradient descent or other optimization method.

BACKPROPAGATION

Page 52: Feed-Forward Artificial Neural Networks

52

How to Train Multi-Layered Networks Select a network structure

(number of hidden layers, hidden nodes, and connectivity).

Select transfer functions that are differentiable.

Define a (differentiable) error function.

Search for weights that minimize the error function, using gradient descent or other optimization method. x2 x1 x0

w2,1 w1,1

w0,1

w2,2

w1,2 w0,2

w’1,1 w’2,1

Dd

dd otwE 2)(21)(

Page 53: Feed-Forward Artificial Neural Networks

53

Back-Propagating the Error

x2 x1 x0

Output unit(s):

Hidden unit(s):

Hidden unit(s):

ijw

jkw

wji: weight vector from unit j to unit i

xj : jth output

i, index for output layer

j, index for hidden layer

k, index for input layer

ix

jx

kx

Page 54: Feed-Forward Artificial Neural Networks

54

Back-Propagating the Error

ii

d

Dddidddd

i

xsigsigotwE

xw

xsigsigotwE

))(1)(()(

example trainingsinglea of onContributi

))(1)(()( ,

x2 x1 x0

ijw

jkw

ix

jx

kx

Page 55: Feed-Forward Artificial Neural Networks

55

Back-Propagating the Error

x2 x1 x0

ijw

jkw

ix

jx

kx

)(

))(1)((

ot

sigsigxwE

jij

Page 56: Feed-Forward Artificial Neural Networks

56

Back-Propagating the Error

x2 x1 x0

ijw

jkw

ix

jx

kx?

))(1)((

j

kjjk

sigsigxwE

Page 57: Feed-Forward Artificial Neural Networks

57

Back-Propagating the Error

x2 x1 x0

ijw

jkw

ix

jx

kx

Outputsiiijj

kjjk

w

sigsigxwE

))(1)((

Page 58: Feed-Forward Artificial Neural Networks

58

Back-Propagating the Error

x2 x1 x0

ijw

jkw

ix

jx

kx

miw

mx )( mmm xt

Outputsm

mmii w

NextLayeri

iijj w

rttttrtr

tr

trtrtr

xsigsigwwwwEww

))(1)((

)(

Page 59: Feed-Forward Artificial Neural Networks

59

Training with BackPropagation

Going once through all training examples and updating the weights: one epoch

Iterate until a stopping criterion is satisfied The hidden layers learn new features and

map to new spaces Training reaches a local minimum of the error

surface

Page 60: Feed-Forward Artificial Neural Networks

60

Overfitting with Neural Networks

If number of hidden units (and weights) is large, it is easy to memorize the training set (or parts of it) and not generalize

Typically, the optimal number of hidden units is much smaller than the input units

Each hidden layer maps to a space of smaller dimension

Training is stopped before local minimum is reached

Page 61: Feed-Forward Artificial Neural Networks

61

Typical Training Curve

Epoch

Error

Real Error or on an independent validation set

Error on Training Set

Ideal training stoppage

Page 62: Feed-Forward Artificial Neural Networks

62

Example of Training Stopping Criteria

Split data to train-validation-test sets Train on train, until error in validation set is

increasing (more than epsilon the last m iterations, or

until a maximum number of epochs is reached Evaluate final performance on test set

Page 63: Feed-Forward Artificial Neural Networks

63

Classifying with Neural Networks

Determine representation of input: E.g., Religion {Christian, Muslim, Jewish} Represent as one input taking three different

values, e.g. 0.2, 0.5, 0.8 Represent as three inputs, taking 0/1 values

Determine representation of output (for multiclass problems) Single output unit vs Multiple binary units

Page 64: Feed-Forward Artificial Neural Networks

64

Classifying with Neural Networks Select

Number of hidden layers Number of hidden units Connectivity Typically: one hidden layer, hidden units is a

small fraction of the input units, full connectivity

Select error function Typically: minimize mean squared error or

maximize log likelihood of the data

Page 65: Feed-Forward Artificial Neural Networks

65

Classifying with Neural Networks

Select a training method: Typically gradient descent (corresponds to

vanilla Backpropagation) Other optimization methods can be used:

Backpropagation with momentum Trust-Region Methods Line-Search Methods

Congugate Gradient methods Newton and Quasi-Newton Methods

Select stopping criterion

Page 66: Feed-Forward Artificial Neural Networks

66

Classifying with Neural Networks

Select a training method: May include also searching for optimal

structure May include extensions to avoid getting stuck

in local minima Simulated annealing Random restarts with different weights

Page 67: Feed-Forward Artificial Neural Networks

67

Classifying with Neural Networks

Split data to: Training set: used to update the weights Validation set: used in the stopping criterion Test set: used in evaluating generalization

error (performance)

Page 68: Feed-Forward Artificial Neural Networks

68

Other Error Functions in Neural Networks Adding a penalty term for weight magnitude Minimizing cross entropy with respect to

target values network outputs interpretable as probability

estimates

Page 69: Feed-Forward Artificial Neural Networks

69

Representational Power

Perceptron: Can learn only linearly separable functions

Boolean Functions: learnable by a NN with one hidden layer

Continuous Functions: learnable with a NN with one hidden layer and sigmoid units

Arbitrary Functions: learnable with a NN with two hidden layers and sigmoid units

Number of hidden units in all cases unknown

Page 70: Feed-Forward Artificial Neural Networks

70

Issues with Neural Networks

No principled method for selecting number of layers and units Tiling: start with a small network and keep

adding units Optimal brain damage: start with a large

network and keep removing weights and units Evolutionary methods: search in the space of

structures for one that generalizes well No principled method for most other design

choices

Page 71: Feed-Forward Artificial Neural Networks

71

Important but not Covered in This Tutorial Very hard to understand the classification

logic from direct examination of the weights Large recent body of work in extracting

symbolic rules and information from Neural Networks

Recurrent Networks, Associative Networks, Self-Organizing Maps, Committees or Networks, Adaptive Resonance Theory etc.

Page 72: Feed-Forward Artificial Neural Networks

72

Why the Name Neural Networks?

Initial models that simulate real neurons to use for classification

Efforts to improve and understand classification independent of similarity to biological neural networks

Efforts to simulate and understand biological neural networks to a larger degree

Page 73: Feed-Forward Artificial Neural Networks

73

Conclusions Can deal with both real and discrete domains Can also perform density or probability estimation Very fast classification time Relatively slow training time (does not scale to

thousands of inputs) One of the most successful classifiers yet Successful design choices still a black art Easy to overfit or underfit if care is not applied