1 hassoun chap3 perceptron

8/11/2019 1 Hassoun Chap3 Perceptron

1/10

s tc o n

cli

Fundamentals of RTIFICI LNEUR L NETWORKS

Mohamad H. Hassoun

II ;


2/10

3 earning ulesOne of the most significant attributes of a neural network is its ability to learinteracting with its environment or:Wiih an information source. Learning in a nenetwork is normally accomplished through an adaptive procedure, known as a leing rule or algorithm whereby the weights of the network are incrementally adjusso as to improve a predefined performance measure over time.

In the context of artificial neural networks, the process of learning is best viewas an optimization process. More precisely, the learning process can be viewesearch in a multidimensional parameter (weight) space for a solution, which gr

~ yoptimizes a prespecified objective {criterion) function. This view is adopted inchapter, and it allows us to unify a wide range of existing learning rules which othewise would have looked more like a diverse variety of learning procedures.

This chapter presents a number of basic learning rules for supervised, reinforcedand unsupervised learning tasks. In supervised learning (also known as learning witteacher or assOciative learning , each input pattern/signal received from the environmentis associated with a specific desired target pattern. Usually, the weights are syn


3/10

S8 3 Learning Ru

established that all these learning rules can be systematically derived as minimizersan appropriate criterion function.

3 1 1 Error Correction Rules

Error-correction rules were proposed initially as ad hoc rules for single-unit trainingThese rules essentially drive the output error of a given unit to zero. This section starwith the classic perceptron learning rule and gives a proof for its convergence. Tother error correction rules such as Mays' rule and the IX-LMS rule are coverThroughout this section an attempt is made to point out criterion functions thatminimized by using each rule. These learning rules also will be cast as relaxation rules,


4/10

3.1 Supervised Learning in a Single-Unit Setting

k = 1,2, ... , m, is the desired target for the kth input vector (usually the order of thtraining pairs is random). The entire collection of these pairs is called the training

The goal, then, is to design a perceptron such that for each input vector k oftraining set, the perceptron output l matches the desired target dk; that is, we reql = sgn(wTx k = dk, for each k = 1,2,00.,m. In this case we say that the perceptcorrectly classifies the training set. Of course, designing an appropriate perceptrto correctly classify the training set amounts to determining a weight vector wthat the following relations are satisfied:

if d k = + 1if d k = 1

(3

Recall that the set of all x which satisfy x T w = U defines a hyperplane in R . T


5/10

60 3 Learning Rules

Notice that for p = 0.5, the perceptrop. learning rule can be written as

where

{

WI arbitraryW k l = wk iwk+1 =

{ X k

Z k - kx

otherwise

if d k = 1if d k = 1

That is, a correction is made if and only if a misclassification, indicated by

(i )TW k :s; 0

(3.1.3)

(3.1.4)

(3.1.5)

occurs. The addition of vector Zk to wk in Equation (3.1.3) moves the weight vectordirectly toward and perhaps across the hyperplane (i )TW k = O. The new inner product (i )Twl+ 1 is larger than (Zk)Twl by the amount of 11i 1I2 , and the correction aw k =wl+1 - wk is clearly moving wk in a good direction, the direction of increasing (i )Twl,


6/10


7/10

62 3 Learning Ru

This sensitivity is responsible for the varying quality of the perceptron-generatedseparating surface observed in simulations.

The bound on the number of corrections ko given by Equation (3.1.14) dependsthe choice of the initial weight vector WI. If WI = 0, we get

or k = maxii 2 11w*1I 2

o [mini (Xi)TW*] 2(3.1.1

Here, ko is a function of the initially unknown solution weight vector w . ThereforEquation (3.1.15) is of no help for predicting the maximum number of correctionHowever, the denominator of Equation (3.1.15) implies that the difficulty ofthe prolem is essentially determined by the samples most nearly orthogonal to the solution

vector.eneralizations of tbe Perceptron Learning Rule The perceptron lea:rning rule may

be generalized to include a variable increment p and a fixed positive margin b Thgeneralized learning rule updates the weight vector whenever (z )TW fails to excethe margin b. Here, the algor ithm for weight vector update is given by


8/10

3.t Supervised Learning in a Single-Unit Setting

e.g., pk = p/k or even l = pk , then w converges to a solution w* that satisfZI)TW* > b for i = 1,2,: .. , m. Furthermore, when p is fixed at a positive constantthis learning rule converges in finite time.

Another variant of the perceptron learning rule is given by the batch updaprocedure

{

W1 arbitrary

wk+1 = wk + p L zzeZ( , , )

3.1

where Z Wk) is the set of patterns z misclassified by wk. Here, the weight vector changeAw = W i - wk is along the direction of the resultant vector of all misclassified patterns. In general, this update procedure converges faster than the perceptron rule.it requires more storage

In the nonlInearly separable case, the preceding algorithms do not converge. Fewtheoretical results are available on the behavior of these algorithms for nonlinearl)separable problems [see Minsky and Papert 1969) and Block and Levin 1970)


9/10

64 3 Learning RUles

Given this objective function J w), the search point W can be incrementally improved at each iteration by sliding downhill on the surface defined by J(w) in w space.Specifically, we may use J to perform a discrete gradient-descent search that updates

i so that a step is taken downhill in the steepest direction along the search surfaceJ(w) at Wk This can be achieved by making Aw k proportional to the gradient of J atthe present location wk ; formally, we may write 2

W 1 =W -pVJ(W)I ..=wk=W -P - - . . .[ aJ aJ aJ JTaWl aW2 aW,,+l ..="k

(3.1.21)

Here, the initial search point WI and the learning rate (step size) p are to be specifiedby the user. Equation (3.1.21) can be called the steepest gradient descent search rule or,

simply, gradient descent. Next, substituting the gradientVJ(W ) = - L 'z (3.1.22)

& e Z ( k )

into Equation (3.1.21) leads to the weight update rule


10/10

3.l Supervised Learning in a Single-Unit Setting

procedure as in Equations (3.1.21) through 3.1.23), it can e shown that

J w) = - L (ZTW - b:s:b

(3

is the appropriate criterion function for the modified perceptron rule in Equation3.1.16).

Before moving on, it should be noted that the gradient of J in Equation 3.1.not mathematically precise. Owing to the piecewise linear nature of J sudden cha

\in the gradient of J occur every time the perceptron output goes through a transition at (Zk)TW = O Therefore, the gradient of J is not defined at transition poinsatisfying zk)TW = 0, k = 1,2, , m. However, because of the discrete nature of Etion 3.1.21), the likelihood of w

k

overlapping with one of these transition pointnegligible, and thus we may still express VJ as in Equation 3.1.22). The readreferred to Problem 3.1.3 for further exploration into gradient descent on the pertron criterion function.

1 hassoun chap3 perceptron

Documents