042407 neural networks

8/13/2019 042407 Neural Networks

1/19


2/19

Michael S. Lewicki !Carnegie MellonArtificial Intelligence: Neural Networks

The Iris dataset with decision tree boundaries

3

1 2 3 4 5 6 70

0.5

1

1.5

2

2.5

petal length (cm)

petalwidth(cm)


The optimal decision boundary for C2vs C3

4

1 2 3 4 5 6 70

0.5

1

1.5

2

2.5

petal length (cm)

petalwidth(cm)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9p(petal length|C2)

p(petal length|C3) optimal decision boundary isdetermined from the statisticaldistributionof the classes

optimal only if model is correct assigns precise degree of uncertainty

to classification


3/19


Optimal decision boundary

5

1 2 3 4 5 6 70

0.2

0.4

0.6

0.8

1

Optimal decision boundary

p(petal length|C2) p(petal length|C3)

p(C2|petal length) p(C3|petal length)


Can we do better?

6

1 2 3 4 5 6 70

0.5

1

1.5

2

2.5

petal length (cm)

petalwidth(cm)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9p(petal length|C2)

p(petal length|C3) only way is to use more information DTs use both petal width and petal

length


4/19


Arbitrary decision boundaries would be more powerful

7

1 2 3 4 5 6 70

0.5

1

1.5

2

2.5

petal length (cm)

petalwidth(cm)

Decision boundariescould be non-linear


3 4 5 6 7

1

1.5

2

2.5

x1

x2

Defining a decision boundary

8

consider just two classes want points on one side of line in

class 1, otherwise class 2.

2D linear discriminant function:

This defines a 2D plane whichleads to the decision:

The decision boundary:

y =mTx+ b = 0x

class 1 if y 0,

class 2 if y


5/19


Linear separability

9

1 2 3 4 5 6 70

0.5

1

1.5

2

2.5

petal length (cm)

petalwidth(cm)

Two classes are linearly separable if they can beseparated by a linear combination of attributes

- 1D: threshold- 2D: line- 3D: plane- M-D: hyperplane

linearly separable

not linearly separable


Diagraming the classifier as a neural network

The feedforward neural network isspecified by weights wiand bias b:

It can written equivalently as

where w0= bis the bias and adummy input x0that is always 1.

10

x1 x2 xM

y

w1 w2 wM

b

x1 x2 xM

y

w1 w2 wM

x0=1

w0

y =wTx =

M

i=0

wixi

y = wTx+ b

=

M

i=1

wixi + b

output unit

input units

bias

weights


6/19


Determining, ie learning, the optimal linear discriminant

11

First we must define an objective function, ie the goal of learning Simple idea: adjust weights so that output y(xn)matches class cn

Objective: minimize sum-squared error over all patternsxn:

Note the notation xn defines a pattern vector:

We can define the desired class as:

E=1

2

N

n=1

(wTxn cn)2

xn = x1, . . . , xM n

cn = 0 xn class 11 xn class 2


Weve seen this before: curve fitting

12

example from Bi shop (2006),Pattern Recognition and Machine Learning

t= sin(2x) + noise

0 1

!1

0

1 t

x

y(xn,w)

tn

xn


7/19


Neural networks compared to polynomial curve fitting

13

0 1

!1

0

1

0 1

!1

0

1

0 1

!1

0

1

0 1

!1

0

1

y(x,w) = w0+ w1x+ w2x2 + + wMx

M =M

j=0

wjxj

E(w) =1

2

N

n=1

[y(xn,w) tn]2

example from Bi shop (2006),Pattern Recognition and Machine Learning

For the linearnetwork, M=1 andthere are multipleinput dimensions


General form of a linear network

A linear neural network is simply alinear transformation of the input.

Or, in matrix-vector form:

Multiple outputs corresponds tomultivariate regression

14

y =Wx

j =

M

i=0

wi,jxi

x1 xi xM

yi

w

x0=1

y1 yK

x

y

W

outputs

weights

inputs

bias


8/19


Training the network: Optimization by gradient descent

15

We can adjust the weights incrementallyto minimize the objective function.

This is calledgradient descent

Or gradient ascentif were maximizing.

The gradient descent rule for weight wiis:

Or in vector form:

For gradient ascent, the signof the gradient step changes.

w1

w2

w3

w4w2

w1

wt+1

i = w

t

iE

wi

wt+1 =wt E

w


Computing the gradient

Idea: minimize error by gradient descent Take the derivative of the objective function wrt the weights:

And in vector form:

16

E = 1

2

N

n=1

(wTxn cn)2

E

wi=

2

2

N

n=1

(w0x0,n+ + wixi,n+ + wMxM,n cn)xi,n

=N

n=1

(wTxn cn)xi,n

E

w

=N

n=1

(wTxn cn)xn


9/19


Simulation: learning the decision boundary

Each iteration updates the gradient:

Epsilon is a small value:"= 0.1/N

Epsilon too large:- learning diverges

Epsilon too small:

- convergence slow

17

3 4 5 6 7

1

1.5

2

2.5

x1

x2

0 5 10 151000

2000

3000

4000

5000

6000

7000

8000

9000

10000

11000

iteration

Error

E

wi=

N

n=1

(wTxn cn)xi,n

t+1

i = w

t

i E

wi

Learning Curve






Epsilon too small:- convergence slow

18

3 4 5 6 7

1

1.5

2

2.5

x1

x2

0 5 10 151000

2000

3000

4000

5000

6000

7000

8000

9000

10000

11000

iteration

Error

E

wi=

N

n=1

(wTxn cn)xi,n

t+1i

= wt

i

E

wi

Learning Curve


10/19






Epsilon too small:

- convergence slow

19

3 4 5 6 7

1

1.5

2

2.5

x1

x2

0 5 10 151000

2000

3000

4000

5000

6000

7000

8000

9000

10000

11000

iteration

Error

E

wi=

N

n=1

(wTxn cn)xi,n

t+1

i = w

t

i E

wi

Learning Curve







20

3 4 5 6 7

1

1.5

2

2.5

x1

x2

0 5 10 151000

2000

3000

4000

5000

6000

7000

8000

9000

10000

11000

iteration

Error

E

wi=

N

n=1

(wTxn cn)xi,n

t+1i

= wt

i

E

wi

Learning Curve


11/19






Epsilon too small:

- convergence slow

21

3 4 5 6 7

1

1.5

2

2.5

x1

x2

0 5 10 151000

2000

3000

4000

5000

6000

7000

8000

9000

10000

11000

iteration

Error

E

wi=

N

n=1

(wTxn cn)xi,n

t+1

i = w

t

i E

wi

Learning Curve







22

3 4 5 6 7

1

1.5

2

2.5

x1

x2

0 5 10 151000

2000

3000

4000

5000

6000

7000

8000

9000

10000

11000

iteration

Error

E

wi=

N

n=1

(wTxn cn)xi,n

t+1i

= wt

i

E

wi

Learning Curve


12/19



Learning converges onto the solutionthat minimizes the error.

For linear networks, this isguaranteed to converge to theminimum

It is also possible to derive a closed-form solution (covered later)

23

3 4 5 6 7

1

1.5

2

2.5

x1

x2

0 5 10 151000

2000

3000

4000

5000

6000

7000

8000

9000

10000

11000

iteration

Error

Learning Curve


Learning is slow when epsilon is too small

Here, larger step sizes wouldconverge more quickly to theminimum

24

w

Error


13/19


Divergence when epsilon is too large

If the step size is too large, learningcan oscillate between different sides

of the minimum

25

w

Error


Multi-layer networks

26

Can we extend our network tomultiple layers? We have:

Or in matrix form

Thus a two-layer linear network isequivalent to a one-layer linearnetwork with weights U=VW.

It is notmore powerful.

x

y

W

V

z

j =

i

wi,jxi

zj =

k

vj,kyj

=

k

vj,k

i

wi,jxi

z = Vy= VWx

How do we address this?


14/19


Non-linear neural networks

Idea introduce a non-linearity:

Now, multiple layers are notequivalent

Common nonlinearities:- threshold

-sigmoid

27

j =f(

i

wi,jxi)

zj = f(

k

vj,kyj)

= f(

k

vj,kf(

i

wi,jxi))

!! !" !# !$ !% & % $ # " !&

%

'

()'*

threshold

!! !" !# !$ !% & % $ # " !&

%

'

()'*

sigmoid

=

0 x < 0

1 x 0

y= 1

1 + exp(x)


Modeling logical operators

A one-layer binary-threshold

network can implement the logicaloperators AND and OR, but notXOR.

Why not?

28

x1

x2

x1

x2

x1

x2

x1AND x2 x1OR x2 x1XOR x2

j =f(

i

wi,jxi)

!! !" !# !$ !% & % $ # " !&

%

'

()'*

threshold y =

0 x < 0

1 x 0


15/19


Posterior odds interpretation of a sigmoid

29


The general classification/regression problem

30

Data

D = x1, . . . , xT

xi = {x1, . . . , xN}i

desired output

= y1, . . . , yK

model

= 1, . . . , M

Given data, we want to learn a model thatcan correctly classify novel observations or

map the inputs to the outputs

i = 1 if xi Ci class i,

0 otherwise

for classification:

input is a set of T observations, each anN-dimensional vector (binary, discrete, orcontinuous)

model (e.g. a decision tree) is defined byM parameters, e.g. a multi-layerneural network.

regression for arbitrary y.


16/19


Error function is defined as before,where we use the target vector tntodefine the desired output for networkoutput yn.

The forward pass computes theoutputs at each layer:

A general multi-layer neural network

31

E = 1

2

n=1

(yn(xn,W1:L) tn)2

x1 xi xM

yi

w

x0=1

y1 yK

x0=1

yiy1 yK

yiy1 yK

y0=1

w

y0=1

ylj = f(

i

wli,jyl1j )

l= {1, . . . , L}

x y0

output = yL

input

layer 1

layer 2

output


Deriving the gradient for a sigmoid neural network

Mathematical procedure for trainis gradient descient: same asbefore, except the gradients aremore complex to derive.

Convenient fact for the sigmoidnon-linearity:

backward pass computes thegradients: back-propagation

32

d(x)

dx =

d

dx

1

1 + exp (x)

= (x)(1 (x))

E = 1

2

N

n=1

(yn(xn,W1:L) tn)2

x1 xi xM

yi

w

x0=1

y1 yK

x0=1

yiy1 yK

yiy1 yK

y0=1

w

y0=1

input

layer 1

layer 2

output

Wt+1 = Wt + E

W

New problem: local minima


17/19


Applications: Driving (output is analog: steering direction)

33

network with 1 layer

(4 hidden units)

D. Pomerleau. Neural network

perception for mobile robotguidance. Kluwer AcademicPublishing, 1993.

Learns to drive on roads

Demonstrated at highwayspeeds over 100s of miles


Real image input is augmented to avoid overfitting

34

Training data:Images +

correspondingsteering angle

Important:Conditioning oftraining data togenerate newexamples! avoidsoverfitting


18/19


Takes as input image of handwritten digit Each pixel is an input unit Complex network with many layers Output is digit class

Tested on large (50,000+) database of handwritten samples

Real-time Used commercially

Hand-written digits: LeNet

35


LeNet

36

Y. LeCun, L. Bottou, Y. Bengio, andP. Haffner. Gradient-based learningapplied to document recognition.Proceedings of the IEEE, november1998.

http://yann.lecun.com/exdb/lenet/

Very low error rate (


19/19


Object recognition

LeCun, Huang, Bottou (2004). Learning Methods for Generic Object Recognitionwith Invariance to Pose and Lighting. Proceedings of CVPR 2004.

http://www.cs.nyu.edu/~yann/research/norb/

37


Summary

Decision boundaries- Bayes optimal- linear discriminant- linear separability

Classification vs regression Optimization by gradient descent Degeneracy of a multi-layer linear network

Non-linearities:: threshold, sigmoid, others?

Issues:- very general architecture, can solve many problems- large number of parameters: need to avoid overfitting- usually requires a large amount of data, or special architecture- local minima, training can be slow, need to set stepsize

38

042407 neural networks

Documents