multilayer perceptron perceptron.pdf · multilayer perceptron ... input x belongs to c 1....

Outline

Multilayer Perceptron

Hong Chang

Institute of Computing Technology,Chinese Academy of Sciences

Machine Learning Methods (Fall 2012)

Hong Chang (ICT, CAS) Multilayer Perceptron

Outline

Outline I

1 Introduction

2 Single Perceptron

3 Boolean Function Learning

4 Multilayer Perceptron


Introduction Single Perceptron Boolean Function Learning MLPs

Artificial Neural Network

Cognitive scientists and neuroscientists: the aim is to understand thefunctioning of the brain by building models of the natural neuralnetworks.Machine learning researchers: the aim (more pragmatic) is to buildbetter computer systems based on inspirations from studying thebrain.A human brain has:

Large number (1011) of neurons as processing unitsLarge number (104) of synapses per neuron as memory unitsParallel processing capabilitiesDistributed computation/memoryHigh robustness to noise and failure

Artificial neural networks (ANN) mimic some characteristics of thehuman brain, especially with regard to the computational aspects.



Perceptron

The output f is weighted sum of the inputs x = (x0, x1, . . . , xd )T :

f =d∑

j=1

wjxj + w0 = wT x

where x0 is a special bias unit with x0 = 1 and w = (w0,w1, . . . ,wd )T

are called the connection weights.



What a Perceptron Does

Regression vs. classification:

To implement a linear discriminant function, we need the thresholdfunction

s(a) ={

1 if a > 00 otherwise

to define the following decision rule:

Choose{

C1 if s(wT x) = 1C2 otherwise



Sigmoid Function

Instead of using the threshold function to give a discrete output in{0,1}, we may use the sigmoid function

sigmoid(a) =1

1 + exp(−a)

to give a continuous output in (0,1):

f = sigmoid(wT x)

The output may be interpreted as the posterior probability that theinput x belongs to C1.Perceptron is cosmetically similar to logistic regression and leastsquares linear regression.



K > 2 Outputs

K perceptrons, each with a weight vector wk :

fk =d∑

j=1

wkjxj + w0 = wTk x or f = Wx

where wkj is the weight from input xj to output yk , and each row of theK × (d + 1) matrix W is the weight vector of one perceptron.W performs a linear transformation from a d-dimensional space to aK -dimensional space.



Classification

Classification:Choose Ck if fk = maxl fl

If we need the posterior probabilities as well, we can use softmax todefine fk as:

fk =exp(wT

k x)∑Kl=1 exp(wT

l x)



Perceptron Learning

Learning mode:Online learning: instances seen one by one.Batch learning: whole sample seen all at once.

Advantages of online learning:No need to store the whole sample.Can adapt to changes in sample distribution over time.Can adapt to physical changes in system components.

The error function is not defined over the whole sample but onindividual instances.Starting from randomly initialized weights, the parameters areadjusted a little bit at each iteration to reduce the error withoutforgetting what was learned previously.



Stochastic Gradient Descent

If the error function is differentiable, gradient descent may be appliedat each iteration to reduce the error.Gradient descent for online learning is also known as stochasticgradient descent.For regression, the error on a single instance (x(i), y (i)):

E(w|x(i), y (i)) =12(y (i) − f (i))2 =

12(y (i) − (wT x(i)))2

which gives the following online update rule:

4w (i)j = η(y (i) − f (i))x (i)

j

where η is a step size parameter which is decreased gradually in timefor convergence.



Binary Classification

Logistic discrimination for a single instance (x(i), y (i)), where y (i) = 1 ifx(i) ∈ C1 and y (i) = 0 if x(i) ∈ C2, gives the output:

f (i) = sigmoid(wT x(i))

Likelihood:L = (f (i))y (i)

(1− f (i))1−y (i)

Cross-entropy error function:

E(w|x(i), y (i)) = − log L = −y (i) log f (i) − (1− y (i)) log(1− f (i))

Online update rule:

4w (i)j = η(y (i) − f (i))x (i)

j



K > 2 Classes

Softmax for a single instance (x(i), y (i)), where y (i)k = 1 if x(i) ∈ Ck and

0 otherwise, gives the outputs:

f (i)k =exp(wT

k x(i))∑l exp(wT

l x(i))

Likelihood:L =

∏k

(f (i)k )y (i)k

Cross-entropy error function:

E({wk}|x(i), y (i)) = − log L = −∑

k

y (i)k log f (i)k

Online update rule:

4w (i)kj = η(y (i)

k − f (i)k )x (i)j



Perceptron Learning Algorithm

For k = 1, . . . ,KFor j = 0, . . . ,d

wkj ← rand(−0.01,0.01)Repeat

For all (xt , y t) ∈ X in random orderFor k = 1, . . . ,K

ok ← 0For j = 0, . . . ,d

ok ← ok + wkjx tj

For k = 1, . . . ,Kfk ← exp(ok )/

∑l exp(ol)

For k = 1, . . . ,KFor j = 0, . . . ,d

wjk ← wjk + η(y tk − fk )x t

jUntil convergence

Update=Learning Factor × (Desired Output - Actual Output) × Input



Online Learning Error of Perceptron

Theorem (Block, 1962, and Novikoff, 1962)

Let a sequence of examples (x(1), y (1)), . . . , (x(N), y (N)) be given. Supposethat ‖x(i)‖ ≤ M for all i , and further that there exists a unit-length vector u(‖u‖2 = 1) such that y (i) · (uT x(i)) ≥ γ for all examples in the sequence(i.e., uT x(i) ≥ γ if y (i) = 1 and uT x(i) ≤ −γ if y (i) = −1, so that u separatesthe data with a margin of at least γ). Then the total number of mistakesthat the perceptron algorithm makes on this sequence is at most (M/γ)2.

Given a training example (x, y) (y ∈ {−1,1}), the perceptron learningrule updates the parameters as follows: If fθ(x) = y , then it makes nochange to the parameters. Otherwise, θ ← θ + yx.Note that bound on the number of errors does not have an explicitdependence on the number of examples N in the sequence, or on thedimension d of the inputs.



Convergence of Perceptron

Perceptron convergence theorem:If there exists an exact solution (i.e., if the training data set is linearlyseparable), then the perceptron learning algorithm is guaranteed tofind an exact solution in finite number of steps.The number of steps required to achieve convergence could still besubstantial.For linearly separable data set, there may be many solutions.For linearly nonseparable data set, the perceptron algorithm will neverconverge.



Illustration of Convergence of Perceptron

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1



Learning Boolean AND

Learning a Boolean function is a two-class classification problem.AND function with 2 inputs and 1 output:

x1 x2 y0 0 00 1 01 0 01 1 1

Perceptron for AND and its geometric interpretation:



Learning Boolean XOR

A simple perceptron can only learn linearly separable Booleanfunctions such as AND and OR but not linearly nonseparablefunctions such as XOR.XOR function with 2 inputs and 1 output:

x1 x2 y0 0 00 1 11 0 11 1 0



Learning Boolean XOR (2)

There do not exist w0,w1,w2 that satisfy the following inequalities:

w0 ≤ 0w2 + w0 > 0w1 + w0 > 0

w1 + w2 + w0 ≤ 0

The VC dimension of a line in 2-D is 3. With 2 binary inputs there are4 cases, so there exist problems with 2 inputs that are not solvableusing a line.



Multilayer Perceptrons

A multilayer perceptron (MLP) has a hidden layer between the inputand output layers.MLP can implement nonlinear discriminants (for classification) andnonlinear regression functions (for regression).We call this a two-layer network because the input layer performs nocomputation.



Illustration of the capability of a MLP

Four functions: (a) f (x) = x2; (b) f (x) = sin(x); (c) f (x) = ‖x‖; (d) stepfunction. N = 50 uniformly sampled.3 hidden units with tahn activation functions, linear output units.



Forward Propagation

Input-to-hidden:

zh = sigmoid(wTh x) =

1

1 + exp[−(∑d

j=1 whjxj + wh0)]

The hidden units must implement a nonlinear function (e.g., sigmoid orhyperbolic tangent) or else it is equivalent to a simple perceptron.Sigmoid can be seen as a continuous, differentiable version ofthresholding.

Hidden-to-output:

fk = vTk z =

H∑h=1

vkhzh + vk0

Regression: linear output units.Classification: a sigmoid unit (for K = 2) or K output units with softmax(for K > 2).



Forward Propagation (2)

The hidden units make a nonlinear transformation from thed-dimensional input space to the H-dimensional space spanned bythe hidden units.In the new H-dimensional space, the output layer implements a linearfunction.Multiple hidden layers may be used for implementing more complexfunctions of the inputs, but learning the network weights in such deepnetworks will be more complicated.



MLP for XOR

Any Boolean function can be represented as a disjunction ofconjunctions, e.g.,

x1 XOR x2 = (x1 AND ∼ x2) OR (∼ x1 AND x2)

which can be implemented by an MLP with one hidden layer.



MLP as a Universal Approximator

The result for arbitrary Boolean functions can be extended to thecontinuous case.Universal approximation:An MLP with one hidden layer can approximate any nonlinear functionof the input given sufficiently many hidden units.



Backpropagation Learning Algorithm

Extension of the perceptron learning algorithm to multiple layers byerror backpropagation from the outputs back to the inputs:

Learning of hidden-to-output weights: like perceptron learning by treatingthe hidden units as inputs.Learning of the input-to-hidden weights: applying the chain rule tocalculate the gradient:

∂E∂whj

=∂E∂fk

∂fk∂zh

∂zh

∂whj



MLP Learning for Nonlinear Regression

Assuming a single output:

f (i) = f (x(i)) =H∑

h=1

vhz(i)h + v0

where z(i)h = sigmoid(wT

h x(i)).Error function over entire sample:

E(W,v|X ) = 12

∑i

(y (i) − f (i))2

Update rule for second-layer weights:

4vh = η∑

i

(y (i) − f (i))z(i)h



MLP Learning for Nonlinear Regression (2)

Update rule for first-layer weights:

4whj = −η ∂E∂whj

= −η∑

i

∂E∂f (i)

∂f (i)

∂z(i)h

∂z(i)h

∂whj

= η∑

i

(y (i) − f (i))vhz(i)h (1− z(i)

h )x (i)j

(y (i) − f (i))vh acts like the error term for hidden unit h, which isbackpropagated from the output to the hidden unit with the weight vhreflecting the responsibility of the hidden unit.Either batch learning or online learning may be carried out.



Example

Evolution of regression function and error over time



MLP Learning for Nonlinear Multiple-OutputRegression

Outputs:

f (i)k =H∑

h=1

vkhz(i)h + vk0

Error function:

E(W,V|X ) = 12

∑i

∑k

(y (i)k − f (i)k )2

Update rule for second-layer weights:

4vkh = η∑

i

(y (i)k − f (i)k )z(i)

h

Update rule for first-layer weights:

4whj = η∑

i

[∑k

(y (i)k − f (i)k )vkh

]z(i)

h (1− z(i)h )x (i)

j



Algorithm

Initialize all vkh and whj to rand(−0.01,0.01)Repeat

For all (xt , y t) ∈ X in random orderFor h = 1, . . . ,H

zh ← sigmoid(wTh xt)

For k = 1, . . . ,Kfk = vT

k zFor k = 1, . . . ,K

4vk = η(y tk − f t

k )zFor h = 1, . . . ,H

4wh = η(∑

k (ytk − f t

k )vkh)zh(1− zh)xt

For k = 1, . . . ,Kvk ← vk +4vk

For h = 1, . . . ,Hwh ← wh +4wh

Until convergence



MLP Learning for Nonlinear Multi-Class Discrimination

Outputs:

f (i)k = fk (x(i)) =exp(o(i)

k )∑l exp(o(i)

l )

which approximate the posterior probabilities P(Ck |x(i)), whereo(i)

k =∑H

h=1 vkhz(i)h + vk0.

Error function:

E(W,V|X ) = −∑

i

∑k

y (i)k log f (i)k

Update rules:

4vkh = η∑

i

(y (i)k − f (i)k )z(i)

h

4whj = η∑

i

[∑k

(y (i)k − f (i)k )vkh

]z(i)

h (1− z(i)h )x (i)

j


multilayer perceptron perceptron.pdf · multilayer perceptron ... input x belongs to c 1....

Documents