perceptrons for regression & classification

8/11/2019 Perceptrons for Regression & Classification

1/20

Neural Networks Assignment 1Candidate Number: 19214

Part A (1):

Illustration of Training Process

Setup of training process

Parameter Setting

0.5

bias 1.0

Error function erce!tron criterion

"eig#t u!date rule $e%uential &radient 'escent

Acti(ation function )ea(iside: * + ,1 if - / 0 1 ot#erwise

3n!ut attern arget

class 1 + -1 1 -0 1 target + ,1

class 2 + -0 0 -1 0 target + 1

3nitial "eig#ts w + -w0 w1 w2 + -0 0 0

3 am using t#e erce!tron criterion as error function because of its information about t#ecurrent error gradient and 3 am using $e%uential &radient 'escent because in a %uicke!erimental e(aluation 3 found t#at it con(erges faster t#an 6atc# &radient 'escent -seet#e discussion in art A -2 below.

After initialisation t#e weig#t (ector is at t#e origin #ence wit# t#e gi(en in!ut 7

!atterns -1 1 -0 1 and -1 0 are misclassi8ed after t#e 8rst training e!oc#. #e in!ut -00 is classi8ed correctl* as its acti(ation (alue is 0 and as t#e acti(ation function #as a #ardlimit at / 0 t#e !attern is t#erefore classi8ed wit# a !redicted target (alue of 1.

1 of 20

able 1: $etu! of raining rocess

lot 1: $tate of t#e Network after 3nitialisation


2/20


After t#e 8rst e!oc# t#e weig#t (ector is u!dated as w + -0.5 0 0.5 resulting in adecision boundar* of 2+ w0w2 w#ic# results in a #oriontal line at 2+ 1.

3n t#is training e!oc# resulting in 2 in!ut !atterns -1 1 and -1 0 being misclassi8ed andan u!dated weig#t (ector of w+ -0.5 0 1. #is s#ifts t#e decision boundar* downtowards t#e origin resulting in a #oriontal line at 2+ 0.5. "it# t#is weig#t u!date all!oints are correctl* classi8ed and t#erefore training ends after t#e net e!oc# w#ere t#eweig#ts are left unc#anged as t#e correct decision boundar* #as alread* been learnt. #e

8nal weig#t (ector t#erefore isw

+ -0.5 0 1 and is dis!la*ed in lot 7 below.

2 of 20

lot 2: $tate of t#e Network after raining E!oc# 1

lot 7: $tate of t#e Network after raining E!oc# 2


3/20


Learnability

#e !erce!tron is able to learn an* linearl* se!arable in!ut out of t#e ; different in!utcombinations 4 are linearl* se!arable and can be learnt b* t#e erce!tron t#e ot#er 2

re!resenting and res!. are not linearl* se!arable and #ence cannot be learntb* t#e erce!tron. able 2 below gi(es an o(er(iew o(er w#ic# in!ut combinations t#eerce!tron can learn.

N6: 3n essence t#e ; different in!ut combinations conform to 7 uni%ue in!ut !attern t#atis w#en class 1 + -0 0 -1 1 and class 2 + -0 1 -1 0 t#en t#ere is an in!ut combinationw#ere class 2 + class 1 and class 1 + class 2 w#ic# re!resents t#e same !attern ?ust wit#different class ad#erence and #ence t#e erce!tron is able to learn 2 out of 7 uni%ue in!ut!atterns.

# Input Patterns Target Learnable Comment

1class 1 + -0 0 -0 1 ,1

@esclass 2 + -1 0 -1 1 1

2class 1 + -0 0 -1 0 ,1

@esclass 2 + -0 1 -1 1 1

7class 1 + -0 0 -1 1 ,1

No class 2 + -0 1 -1 0 1

4class 1 + -0 1 -1 0 ,1

No class 2 + -0 0 -1 1 1

5class 1 + -0 1 -1 1 ,1

@esclass 2 + -0 0 -1 0 1

;class 1 + -1 0 -1 1 ,1

@esclass 2 + -0 0 -0 1 1

7 of 20

able 2: earnabilit* =(er(iew


4/20


Epocs until Con!ergence

Bor t#e illustrated training !rocedure wit# + 0.5 bias + 1.0 and initial weig#t (ector of w

+ -0 0 0 t#e learning algorit#m con(erged after 27 e!oc#s -for t#e linearl* se!arable

!roblems able 7 gi(es a s#ort o(er(iew of t#e number of e!oc#s until con(ergence !erin!ut !attern for t#e abo(e mentioned starting !arameters.

Input Patterns Epocs until con!ergence

class 1 + -1 0 -1 1 target + ,12

class 2 + -0 0 -0 1 target + 1

class 1 + -0 0 -0 1 target + ,17

class 2 + -1 0 -1 1 target + 1

class 1 + -0 0 -0 1 target + ,1

7class 2 + -1 0 -1 1 target + 1

class 1 + -1 1 -0 1 target + ,17

class 2 + -0 0 -1 0 target + 1

3n general 3 found t#at con(ergence itself t#at is t#e fact w#et#er or not t#e !roblem islearnable or not is inde!endent of t#e (alues of and t#e bias and t#at furt#er t#e (alueof doesnt affect t#e number of e!oc#s taken until con(ergence #owe(er t#e bias (aluedoes #a(e an im!act on t#e number of e!oc#s for t#e gi(en setu!. lot 4 dis!la*s t#enumber of e!oc#s until con(ergence for (ar*ing (alues of + D1.5 1.0 0.5 0.1 0.05 0.010.005 0.001 and t#e bias + D2 1.5 1.0 0.5 0 0.5 1 1.5 2 and an initial weig#t (ector ofw + -0 0 0.

4 of 20

able 7: E!oc#s until Con(ergence =(er(iew

lot 4: Error $urface for different (alues ofand t#e bias


5/20


Part A ("):

Training Process

Basic Setup of training process

Parameter Setting

0.05

bias 1.0

Error function erce!tron criterion


Acti(ation function )ea(iside: * + ,1 if - / 0 1 ot#erwise

3 decided to use erce!tron criterion as error function because its gi(ing me informationabout t#e error gradient w#ic# is needed to !erform &radient 'escent. 3 furt#er c#ose touse $e%uential &radient 'escent in fa(our of 6atc# &radient 'escent because 3 found t#at$e%uential &radient con(erged %uicker in m* e!eriments. =(er 100 test runs $e%uential&radient 'escent on a(erage con(erged after 1F e!oc#s w#ereas 6atc# &radient 'escenttook 21 e!oc#s for con(ergence on a(erage. 3 furt#er decided to stick wit# a bias (alue of1.0 and to c#oose+ 0.05 w#ic# after a few test runs a!!eared to be a reasonable c#oice

between granularit* and s!eed of con(ergence.

To Sufe or not to Sufe

6efore e(er* training e!oc# 3 s#ufGed t#e w#ole dataset so as to not o(er8t t#e data. 3ngeneral 3 found t#at w#en s#ufGing is !erformed con(ergence usuall* takes longer -o(er100 test runs a network t#at s#ufGes its data before e(er* e!oc# con(erged in 1F e!oc#son a(erage w#ereas a network wit#out s#ufGing con(erged in 17 e!oc#s on a(erage butoften results in more solid looking decision boundaries. )ence 3 added s#ufGing of in!utdata to m* training regime.

$eigt Initialisation

3 found t#at initialising t#e weig#ts in s!eci8c wa*s #as a signi8cant im!act on t#enumber of training e!oc# t#e network takes to con(erge. Bor t#e gi(en task we were tosam!le 10 data !oints from 2 &aussian distributions wit# means 1 + 0 and 2 + 4res!ecti(el*. lot 5 s#ows t#e 2 distributions wit# t#e o!timal decision boundar* -in a6a*esian sense at + 2. )ence 3 e!ected t#at an initial decision boundar* at + 2 wouldoften alread* be t#e correct decision boundar* for t#e gi(en data !oints.

5 of 20

able 4: 6asic 3nitialisation of t#e free !arameters


6/20


#is #*!ot#esis turned out to be true in an e!eriment wit# 1000 test runs initialising t#eweig#ts so t#at t#e decision boundar* is a (ertical line at + 2 turned out to be t#e correctsolution in ;14 out of 1000 runs. #e mean number of e!oc#s until con(ergence was 15#owe(er t#is number is somew#at distorted b* t#e fact t#at not all of t#e 1000 !roblemswere linearl* se!arable w#ere t#e algorit#m terminated after 200 iterations. 3ncom!arison random weig#t initialisation took at least 2 e!oc#s to 8nd a decision

boundar* and t#is #a!!ened onl* 1H4 out of 1000 times t#us underlining t#e su!eriorit*of m* weig#t initialisation met#od.

)owe(er 3 found an e(en better weig#t initialisation met#od t#an setting a (ertical line at + 2 namel* 3 initialised t#e weig#ts b* using t#e Iinimum $%uared Error criterionw#ic# essentiall* is &radient 'escent wit# east Iean $%uares as error function. As t#egradient is a(ailable in closed form for t#e t#is setu! no &radient 'escent was neededand t#e east Iean $%uares solution could be obtained directl*. #e ca(eat of t#is met#odis t#at it ma* fail to 8nd a solution e(en if t#ere is one #ence 3 onl* used it as a wa* ofinitialising t#e weig#ts for t#e network. #is initialisation resulted in being a solution in;92 out of 1000 test runs out!erforming t#e (ertical line at + 2 initialisation. #e a(eragenumber of e!oc#s until con(ergence for t#is met#od was J14.; #owe(er as alread*mentioned abo(e t#is number is slig#tl* distorted due to t#e fact t#at not all !roblems#a(e been linearl* se!arable. aken t#e w#ole setu! furt#er t#e Iinimum $%uared Errorcriterion would #a(e gi(en rise to t#e )oKas#*a! !rocedure w#ic# 3 started toim!lement but lack of time !re(ented me from 8nis#ing an im!lementation 1.

Note t#at for !ur!oses of better illustrating t#e network learning !rogress 3 generall*initialised t#e weig#ts wit# random (alues.

1 &utierre=suna >icardo L17: Linear Discriminant Functions. eas: eas A L I

Mni(ersit*. A(ailable from: #tt!:researc#.cs.tamu.edu!rismlectures!r!rl1F.!df-accessed 17t#Bebruar* 2014

; of 20

lot 5: 2 &aussian 'istributions
http://research.cs.tamu.edu/prism/lectures/pr/pr_l17.pdfhttp://research.cs.tamu.edu/prism/lectures/pr/pr_l17.pdf


7/20


Con!ergence % &ecay

3 also e!erimented wit# a deca* factor for t#e learning rate w#ere 3 di(ided t#elearning rate b* 2 e(er* 20 e!oc#s. #is a!!roac# introduces an additional ad(antage as

well as an additional disad(antage. #e merit being t#at in case t#e learning rate #asinitiall* been too large w#ic# could lead to t#e situation t#at a global minimum is beingo(ers#ot t#e deca* of t#e learning rate acts as a regulator to scale down until aminimum can be reac#ed. #e drawback of t#is met#od is t#at it could slow downcon(ergence and cause t#e algorit#m to terminate wit#out #a(ing found a solution e(en ift#ere would #a(e been one. As 3 didnt want t#at to #a!!en 3 disabled t#e deca* factor formost e!eriments.

Sigmoi'al Acti!ation !s ea!isi'e Acti!ation

#e ma?or difference between t#e #ea(iside function and sigmoids -a!art from t#e fact

t#at t#e latter can be differentiated w#ereas t#e former cannot is t#at a sigmoidalacti(ation function suc# as tan# is continuous w#ereas t#e #ea(iside acti(ation function isunde8ned at acti(ationmagnitude + 0. #e (alue returned b* a sigmoidal acti(ationfunction can be inter!reted as t#e le(el of con8dence of t#e current classi8cation decisionand in indeed t#e logistic acti(ation function re!resents t#e !robabilit* of a gi(en data!oint for t#e gi(en class -CO. 3n ot#er words t#e (alue returned b* a sigmoidalacti(ation function can also be seen as a distance measure to t#e current decision

boundar*. Bor t#e tan# acti(ation function t#e closer t#e (alue is to 0 t#e closer t#e currentdata !oint is to t#e decision boundar*. A sigmoidal and t#e #ea(iside acti(ation functions#are t#e fact t#at at some !oint a #ard limit needs to be a!!lied in order to get aclassi8cation decision. Bor t#e #ea(iside function as well as for tan# t#is is at

acti(ationmagnitude + 0 w#ere t#e data !oint needs to be ma!!ed to t#e target s!ace insome wa*.

3 em!iricall* e(aluated t#e a(erage e!oc#s for con(ergence for a tan# acti(ation functionand t#e #ea(iside acti(ation function and found t#at on a(erage a network trained wit#t#e #ea(iside acti(ation function con(erges slig#tl* faster t#an wit# tan#. =(er 100 testruns a network wit# tan# con(erged after 14 e!oc#s on a(erage w#ereas a networktrained wit# #ea(iside con(erged after 12.F e!oc#s on a(erage.

6ut t#is is not t#e onl* difference t#e resulting decision boundaries usuall* differ as wellas lots ; L F s#ow w#ere classi8cation was carried out wit# t#e same data !oints and t#esame initialisation of t#e networks free !arameters. )owe(er inde!endent of t#eacti(ation function onl* linearl* se!arable !roblems can be sol(ed.

F of 20


8/20


*on+linearly separable &ata

3f no eit criterion after a gi(en number of e!oc#s would #a(e been su!!lied t#en t#ealgorit#m would not terminate for a nonlinearl* se!arable !roblem. As lots H11 s#owt#e network is somew#at des!eratel* tr*ing to 8nd a solution t#at se!arates t#e 2 classes

but doesnt 8nd one. #ese 4 lots re!resent t#e network after 9 10 11 and 12 traininge!oc#s res!ecti(el* and are based on sam!ling data from 2 &aussian distributions wit#

means

1 + 0 and

2 + 2. lot 12 s#ows t#e corres!onding error rate w#ic# is #ea(il*oscillating as t#e algorit#m tries to 8t a decision boundar*. 3n contrast lot 17 s#ows t#eerror rate for a linearl* se!arable !roblem.

H of 20

lot ;: >esulting 'ecision 6oundar* wit# )ea(iside Acti(ationBunction

lot F: >esulting 'ecision 6oundar* wit# tan# Acti(ation Bunction

lot H12: C#ange of 'ecision 6oundar* in a linearl*unse!arable !roblem.


9/20


Illustrating te Training Process

lots 1521 s#ow t#e learning !rocess for a linearl* se!arable !roblem wit# t#e 7!re(iousl* described weig#t initialisation met#ods. lot 15 re!resents t#e decision

boundar* w#en t#e weig#ts are initalised wit# t#e Iinimum $%uared Error criterion.lots 1; L 1F s#ow t#e networks learning !rogress w#en t#e weig#ts are initialised b*setting t#e decision boundar* at + 2 and lot 1H s#ows t#e decision boundar* wit#randoml* initialised weig#ts and lots 19 21 s#ow t#e last 7 training e!oc#s -out of ; in

total of t#e networks learning !rogress.

9 of 20

lot 17: Error rate for a nonlinearl* se!arable !roblem lot 14: Error rate for a linearl* se!arable !roblem

lot 15: Network 'ecision 6oundar* wit# Iinimum $%uared Errorcriterion weig#t initialisation.


10/20


10 of 20

lot 1;: Network 'ecision 6oundar* wit# + 2 weig#tinitialisation. 'ue to an outlier of class 2 at J-1.H0.H t#e initialised'ecision 6oundar* is not *et a solution.

lot 1F: Network 'ecision 6oundar* wit# + 2 weig#tinitialisation. #e Network was able to learn a correct 'ecision6oundar* after onl* 1 training e!oc#.

lot 1H: Network 'ecision 6oundar* wit# randoml* initialisedweig#ts before t#e 8rst training e!oc#.

lot 19: Network 'ecision 6oundar* wit# randoml* initialisedweig#ts after training e!oc# 4 of ;.

lot 20: Network 'ecision 6oundar* wit# randoml* initialisedweig#ts after training e!oc# 5 of ;.

lot 21: Network 'ecision 6oundar* wit# randoml* initialisedweig#ts after training e!oc# ; of ;. 'ecision 6oundar* successfull*learnt.


11/20


Part , (1):


Parameter Setting 0.0001

bias 7 , mean-

Error function east Iean $%uared Error


3nitial "eig#ts w + -w0 w1 + -1 0.4

#e task for t#e network is to 8nd t#e best 8t line for t#e gi(en data !oints. Bor a

regression scenario like t#e gi(en one t#e goal is to !redict a target (ariable * gi(en in!utsfrom 1 P n (ariables. 3n essence for regression t#ere is no need for an acti(ation functionas t#e acti(ation !roduced b* t#e network alread* re!resents t#e %uantit* of interestalt#oug# strictl* s!eaking one could argue t#at t#e network uses t#e identit* function asits acti(ation function.

*et-or. Initialisation

#e %uantit*is drawn from a uniform distribution in t#e inter(al Q10 ,10R and is addedto t#e interce!t term of t#e function * + 0.4 , 7 , #ence gi(en an in8nite amount ofdata !oints for t#e function 3 would e!ect t#e to con(erge to 0. As t#ere is onl* 1 in!ut

!arameter t#e regression line will be a straig#t line of t#e form * + k , d wit# t#egradient being close to t#e gradient of t#e original function so 0.4.

3 c#ose to initialise t#e weig#t (ector as w + -w0 w1 + -1 0.4 and to use a bias (alue of 7 ,mean-. "it# an initialisation close to t#e underl*ing real function 3 was #o!ing to reducet#e number of training e!oc#s re%uired.

As for classi8cation 3 also s#ufGed t#e data before eac# training e!oc# for t#e regressiontask and found t#at network con(ergence took a lot longer. =(er 100 test runs a networkt#at s#ufGed t#e data before e(er* e!oc# re%uired 211 e!oc#s on a(erage for con(ergencew#ereas t#e a(erage con(ergence wit#out s#ufGing t#e data was 2 e!oc#s. )owe(er ont#e ot#er #and a network t#at used s#ufGing !roduced slig#tl* better lines in terms of t#ea(erage s%uared error t#e* !roduced. Again o(er 100 test runs t#e mean a(erage s%uarederror for a network wit# s#ufGing was 1;.02 w#ereas for a network wit#out s#ufGing t#eerror was 1F.2.

3 used $e%uential &radient 'escent in con?unction wit# east Iean $%uares as m* errorfunction. 3 c#ose $e%uential &radient 'escent in fa(our of 6atc# &radient 'escent

because in m* e!eriments on linear regression $e%uential &radient 'escent generall*con(erged faster and resulted in a smaller error and t#erefore a better regression line. Bor100 test runs $e%uential &radient 'escent con(erged after 210 e!oc#s on a(erage for t#egi(en setu! w#ereas 6atc# &radient 'escent con(erged onl* after 750 e!oc#s on a(erage.#e a(erage of t#e mean s%uared error o(er 100 test runs for $e%uential &radient 'escent

was 1F.F91H w#ereas t#e mean s%uared error for 6atc# &radient 'escent was 19.22;F. 3

11 of 20

able 5: 6asic 3nitialisation of t#e free !arameters


12/20


c#ose east Iean $%uares as m* error function because it is sim!le to im!lement and fort#e gi(en setu! and in con?unction wit# &radient 'escent is guaranteed to con(erge to aglobal minimum -as long as t#e ot#er !arameters i.e. t#e learning rate are accordingl*set.

Con!ergence

Bor testing wet#er t#e algorit#m #as con(erged 3 com!ared t#e current error to t#e errorof t#e !re(ious training e!oc# -a.k.a. !re(ious error. 3f t#e difference between t#e currenterror and t#e !re(ious error is below a !rede8ned t#res#old t#en t#e algorit#m sto!s. 3most commonl* used 0.0001 or 0.00001 for . #e second termination criterion was w#ena !rede8ned number of e!oc#s #as been reac#ed. 3 most commonl* used (alues between100 and 500 w#ic# is %uite low but was suf8cient for t#e gi(en tasks.

/f te !irtues of Preprocessing

"#en running t#e network to 8nd a best8t line 3 found it makes a #uge differencew#et#er or not t#e data #a(e been !ro!erl* !re!rocessed for t#e following !aragra!#s 3will t#erefore fre%uentl* com!are between t#e differences of a!!l*ing !re!rocessing to nota!!l*ing an* !re!rocessing.

All m* !re!rocessing consisted of normalising t#e in!ut and target (alues and after#a(ing found t#e best8t line con(erting t#e data back to its original s!ace -see Bormulas1 L 2.

/n ,ias0 $eigts an' Training Error

Bor t#e training runs w#ere 3 didnt !re!rocess t#e data 8nding a solution was #ugel*de!endent on t#e (alue of t#e learning rate w#ic# needed to be (er* small -+ 0.00001 inorder for t#e network to con(erge and !roduce a good8t line #owe(er t#e training errordidnt constantl* decrease as 3 would #a(e e!ected but was #ea(il* oscillating -see lot22. #e weig#ts for t#e network con(erged towards w1S 0.4 and w0S 1 t#e actual resultsfor one test run were w1+ 0.4007 and w0+ 1.1599 res!ecti(el*.

12 of 20

Bormula 1: 'ata Normalisation Bormula 2: ost!rocessing P con(erting t#e databack to its original s!ace


13/20


Bor t#e training runs w#ere 3 normalised t#e data t#e network was less de!endent ons!eci8c (alues for -3 usuall* #ad in t#e inter(al Q0.0001 0.001R. )a(ing t#e datanormalised training error was now decreasing towards 0 w#ic# is illustrated in lot 27.#e weig#ts for a network wit# normalised data con(erged towards w1S 1 and w0S 0 -t#eeact 8gures were w1 + 0.9179 and w0 + 0.0009 w#ic# makes sense as t#e normalisedfunction #as a gradient of k + 1 and an interce!t of 0. )ence 3 would conclude t#at t#eweig#ts for a regression task con(erge to t#e gradient and t#e interce!t of t#e gi(en

function. Bormula 2 from abo(e #ad to be a!!lied to use t#e weig#ts learnt from anormalised model toget#er wit# t#e original data.

17 of 20

lot 22: Error rate for raw in!ut data -no normalisation or ot#er!re!rocessing carried out.

lot 27: Error rate for normalised in!ut data.


14/20


Turning up te noise

3ncreasing t#e noise results in data !oints w#ere it is #arder to recognise a straig#t line ast#e underl*ing function. #erefore b* increasing t#e random Guctuations t#e resulting

regression line becomes more #oriontal w#ic# means t#at t#e general trend of t#e datare!resented b* t#e gradient of t#e -underl*ing function can no longer be reliabl*estimated. Bor eam!le t#e learnt weig#t for t#e gradient wit#+ Q50 ,50R is no longerclose to 1 but onl* 0.;1 resulting in a gradient of J0.25 for t#e regression line -see lots 24L 25.

o'ifying to un'erlying function

3 c#anged t#e function to * + 1.2 2 , and initialised t#e weig#ts as w0+ 1 wit# a bias(alue of 2 , mean- and w1+ 1.2. #e resulting weig#ts of t#e network again con(ergedclose to t#e gradient of t#e underl*ing function -w1 S 1.2 t#e eact 8gure being w1 +1.1F0F and t#e interce!t - w0S 1 t#e eact 8gure being w0+ 0.H44H. Bor a network trainedon normalised data t#e weig#ts con(erged towards w1 S 1 and w0S 0 res!ecti(el* -t#eeact 8gures being w1+ 0.9254 and w0+ 0.00002.

14 of 20

lot 24:+ Q10 10R t#e Network is still able to ca!ture t#e general

trend of t#e data well wit# t#e learnt weig#ts con(erging towardsw

1S 0.4 and w

0S 1.

lot 25:+ Q50 50R t#e random Guctuation signi8cantl* distort t#e

underl*ing function resulting in t#e regression line being more#oriontal and ending wit# a gradient %uite different -J0.25 fromt#e gradient of t#e original function -0.4.


15/20


A *ote on te Close' 2orm 3egression Line

Bor t#e gi(en !roblem it would be !ossible to calculate t#e best8t regression line in closedform instead of using an iterati(e !rocess. As would be e!ected t#e resulting closed form

regression line was alwa*s a better 8t in terms of minimising least mean s%uared errort#an t#e iterati(e a!!roac#es. )owe(er a little sur!risingl* o(er 1000 test runs wit# +0.00001 for t#e iterati(e !rocess t#e difference in mean of a(erage s%uared errors was%uite small. #e mean of t#e a(erage s%uared errors for t#e closed form a!!roac# was1;.7471 and t#e mean of t#e a(erage s%uared errors for t#e iterati(e a!!roac# was 1;.74;;.=ut of interest 3 increased t#e (alue of to + 0.001 and obser(ed t#e mean of t#ea(erage s%uared error o(er 1000 test runs again resulting in closed form error + 1;.7;0;and iterati(e error + 1;.4;F;.

Illustration of -at te *et-or. is 'oing

lot 2; s#ows t#e data !oints t#e underl*ing original function t#e closed form regressionline and t#e regression line retrie(ed from t#e network. lot 2F s#ows t#e data !oints ?ustwit# t#e regression line retrie(ed from t#e network. o furt#er illustrate t#e learning!rocess lots 2H72 s#ow t#e training !rogress w#en t#e network weig#ts #a(e beenrandoml* selected -to better illustrate learning !rogress and lot 77 s#ows t#ecorres!onding error rate. Note t#at t#e algorit#m con(erged after 11 e!oc#s and t#e lotss#ow t#e line 8tting !rogress after e!oc#s 02 ; and 11 res!ecti(el*. #e !lots inbetween#a(e been omitted for s!ace reasons.

15 of 20

lot 2;: All in one !lot containing t#e original function t#eregression line learnt b* t#e Network and t#e regression lineobtained in closed form.

lot 2F: lot ?ust containing t#e regression line learnt b* t#eNetwork.


16/20


1; of 20

lot 2H: $tate of t#e Network before t#e start of learning. lot 29: >egression line after t#e 8rst training e!oc#.

lot 70:lot 71: >egression line after t#e ;t#training e!oc#.

lot 72: >egression line wit# con(erged Network !arameters after11 training e!oc#s.

lot 70: >egression line after t#e 2ndtraining e!oc#.

lot 77: Error rate of t#e Network.


17/20


Part , (")

Intro'uctory *otes

Bor t#is task t#e dimensionalit* of t#e in!ut s!ace is increased to 2. #e in!uts areinde!endent of eac# ot#er. erforming regression for t#is function results in a regression!lane.

After #a(ing a!!reciated t#e bene8ts of !re!rocessing in t#e !re(ious !art 3 onl*e!erimented wit# normalised data for t#is section. Normalisation was !erformed t#esame wa* as in art 6 -1 -see Bormulas 1 L 2.

Bor weig#t initialisation 3 followed m* !re(ious a!!roac# of initialising t#e weig#t (ectorto be 1 for t#e bias weig#t and t#e gradients of t#e res!ecti(e terms of t#e functionsot#erwise so for t#e gi(en function * + 0.41, 1.42, t#e initial weig#t (ector was w +

-w0 w1 w2 + -1 0.4 1.4. 3 also initialised t#e bias to mean- as !re(iousl*. 3 furt#er usedt#e same error function and weig#t u!date rule and also !erformed s#ufGing before eac#training e!oc#. able ; summarises t#e basic setu! for t#is task.

3 also used $e%uential &radient 'escent in con?unction wit# east Iean $%uares as errorfunction for learning t#e network !arameters for t#e same reasons as stated in t#e!re(ious section.


Parameter Setting

0.0001bias 2 , mean-

Error function east Iean $%uared Error


3nitial "eig#ts w + -w0 w1 w2 + -1 0.4 1.4

/n te -eigt an' bias !alue

As 3 e!ected w1and w2con(erged towards 0.4 and 1.4 res!ecti(el* #owe(er t#is time t#ebias weig#t was a lot more (olatile con(erging towards 1 in some e!eriments andcon(erging towards 0 in ot#ers. #is was a bit sur!rising at 8rst but w#en 3 started!rinting t#e mean (alue of t#e random Guctuations mean- alongside t#e weig#ts 3found t#at t#e closer mean- was to 0 t#e closer w0was to 1 -i.e. mean- + 0.159 w0+0.999 and t#e furt#er mean- was awa* from 0 t#e closer w0con(erged towards 0 -i.e.mean- + 10.F259 w0 + 0.0019. 3n bot# cases t#e resulting (alue for t#e interce!t would

be in t#e inter(al Q1 ,1R.

#e (ariance in t#e bias led me to run some more e!eriments and 3 found t#at w1and w2actuall* dont con(erge towards 0.4 and 1.4 at allT #e ke* was (ar*ing t#e (alue of w#ere t#e learnt weig#ts c#anged %uite signi8cantl*. At + 0.01 t#e (alues were as

re!orted abo(e w#en decreasing t#e (alue to + 0.00001 w1con(erged towards J0.2 so

1F of 20

able ;: 6asic 3nitialisation of t#e free !arameters


18/20


#alf t#e gradient (alue and w2 con(erged towards J0.F also roug#l* #alf t#e gradient(alue. #is be#a(iour seems to be con8rmed b* t#e (alues obtained from a closed formsolution. Also interestingl* t#e (alue for w0alwa*s con(erged towards 0 in t#e closedform a!!roac#. able F summarises t#e 8ndings of t#e !re(ious 2 !aragra!#s.

E4periment $eigt Close' 2orm *et-or. (5 661) *et-or. (5 666661) mean()

1

w0 0 0.995H 0.1091 0.4F0;

w1 0.2H11 0.79FF 0.2H11 0.4F0;

w2 0.F75F 1.7HF 0.F7F; 0.4F0;

2

w0 0 0.0052 0.001; F.401;

w1 0.2H7H 0.790; 0.29F; F.401;

w2 0.F;17 1.740; 0.H7;; F.401;

o'ifying te 2unction

6* c#anging t#e function to * + 1.21, 0.;2, 0 3 could (alidate m* assum!tions t#atcon(ergence of t#e weig#ts was rat#er roug#l* towards gradient 2 and de8nitel* nottowards t#e gradient (alue as able H below s#ows.

E4periment $eigt Close' 2orm *et-or. (5 661) *et-or. (5 666661) mean()

1

w0 0 0.999 0.;00H 0.225F

w1 0.;997 1.1902 0.;994 0.225F

w2 0.4715 0.59;F 0.4715 0.225F

2

w0 0 0.002; 0.001 11.1F9;

w1 0.F145 1.1FH1 0.FF52 11.1F9;

w2 0.2505 0.5F94 0.294H 11.1F9;

1H of 20

able F: 3nter!retation of learnt weig#ts.


19/20


$at te *et-or. is 'oing

lots 747F s#ow t#at t#e network is tr*ing to 8nd t#e best 8t !lane for t#e gi(en data.

19 of 20

lot 74: 'is!la*ing t#e best8t regression line learntb* t#e Network for t#e gi(en data !oints.

lot 75: 'is!la*ing t#e regression line obtained inclosed form.

lot 7;: 'is!la*ing t#e regression line obtained fromt#e Network as well as t#e regression line obtainedin closed form.

lot 7F: Error rate of t#e Network.


20/20


Appen'i4

#e Iatlab code for t#is assignment is contained on t#e following !ages.

20 of 20

perceptrons for regression & classification

Documents