machine learning guide final

20
Machine Learning Husnain Inayat Hussain

Upload: husnaininayat8610

Post on 16-Aug-2015

218 views

Category:

Documents


0 download

DESCRIPTION

My Notes for my Machine Learning Class

TRANSCRIPT

MachineLearningHusnainInayatHussainListofFigures1.1 GeometricInterpretationofPerceptronLearning . . . . . . . . . . . . . . . . 51.2 (a)WhentheInputsarenotScaled. (b)WhentheInputsareScaled. . . . . 61.3 GradientDescentinAction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.4 (a)DataisNoisy. (b)Targetistoocomplex. . . . . . . . . . . . . . . . . . . 91.5 (a)Bias. (b)Variance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.6 TheEllipseastheErrorandtheCircleasRegularizer. . . . . . . . . . . . . . 112.1 TwoDataPointsareSeparatedbyMarginr. . . . . . . . . . . . . . . . . . . 142.2 ThreeDierentClassierswithDierentMargins. . . . . . . . . . . . . . . . 143ListofTables1.1 AToyExampleClassicationData. . . . . . . . . . . . . . . . . . . . . . . . 21.2 ModiedClassicationDatawithBiasIncluded. . . . . . . . . . . . . . . . . 31.3 (a)BadlyScaledInputData. (b)StandardizedInputData . . . . . . . . . . 64Contents1 Introduction 11.1 PerceptronLearning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1.1 LinearAlgebraicNotation. . . . . . . . . . . . . . . . . . . . . . . . . 31.1.2 PerceptronHypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.1.3 PerceptronLearningRule . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 InputScaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 StochasticGradientDescent. . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4 Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.4.1 BiasVarianceAnalysis. . . . . . . . . . . . . . . . . . . . . . . . . . . 91.4.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 SupervisedLearningTechniques 132.1 SupportVectorMachines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.1.1 MaximumMarginClassier . . . . . . . . . . . . . . . . . . . . . . . . 142.1.2 QuantifyingtheMargin . . . . . . . . . . . . . . . . . . . . . . . . . . 155Nations arebornintheheartsof poets, theyprosper anddieinthehandsofpoliticians.AllamaMohammadIqbal(1877-1938)1IntroductionThis is a short guide whose aim is to provide you with the material asked in the third hourly.All the questions that you will see in the paper will be covered here. The answers will not beexplicit,butyoushouldbeabletorespondtomanyofthemifyoutookthelecturesandyoureadthisguidetoo.1.1 PerceptronLearningPerceptronsarebyfartheeasiest tounderstandandthemostcomfortabletoimplement. Aswithmostofthelearningalgorithmsinthisclass,therststepistoanalyzethedatawhichneedstobeclassied(rememberperceptronsareusedforclassication). Tounderstandhowperceptronsworkwewill consideratoyexampleasshowninTable(1.1). Thedatacanbedividedintotwoseparatecategories. Onebelongstothersttwocolumnsofthetable. Thisisgenerallytermedasrawinputandisdenotedbythealphabetx(theboldfaceletterxifxismultidimensional). Thesecondcategoryisthethirdcolumny. Ifwejustthinkaboutthethird column without any explicit information, we could arrive at the following conclusion: yappearstobetheresultofthersttwoxvalues. Ifwecanonlygureouttherelationshipbetweentheinputandtheoutputthenourjobisdone.Inmoretechnicaljargon(language), ourtaskistondthemappingbetweentheinputandthe output. This mapping which could also be called as a function takes in any general inputandproducesanoutputwhichisclosetotheoriginaloutput. Sowearerequiredtolearnafunction(hypothesis)giventheoutputlabel y. Thiskindof learningiscalledsupervisedlearning. Butthequestionishowarewegoingtolearnthehypothesisfromthedata. Toanswerthisquestion,let us considertheidea of asignal . Ok! sowewerealreadywonderingaboutgettingthefunctionandnowanewtermsignal hasbeenintroduced. Whatarewegoing to do about it?One thing is certain that signal has got something to do with the data.Moreprecisely,signalisconnectedwiththeinputdata. LetuslookatEq.(1.1)toseewhatthetermsignallooksfortherstdatapoint.12 1.1. PERCEPTRONLEARNINGx1x2y10 60 no16 80 no10 77 no22 100 no25 95 no50 170 yes45 180 yes80 130 yes55 150 yes20 110 no32 130 yes100 190 ?Table1.1: AToyExampleClassicationDatas(x) = w0x(1)0+w1x(1)1+w2x(1)2(1.1)InEq.(1.1), s(x)denotesthesignalasafunctionofx, x(i)jistheinputwherethesubscriptjdenotesthecoordinate(dimension)andthebracketedsuperscript(i)denotestheexamplenumber. The ws are the parameters of the signal. Note, however, that the original data doesnot contain the coordinate x0. Where has this coordinate come from and what is its purpose?Theansweristhatitisjustamathematical artifact(atechnique)whichhelpsnicelysolveourproblem. Itisthethresholdofthesignalwhichisitsvalueatx = 0. Thisissamereasonwhythisisalsotermedasbias. Inordertoincludebiasintheproblem, thedataisslightlymodied. ThemodicationisshownwiththehelpofTable(1.2). Acolumnof1sisaddedtoactasthebiasoftheproblem.Letsrecap,whathasbeendonesofar.adatasetisgiventherowsofthedataareidentiedasdierentinstancesorexamplesthedataiscategorizedintorawinputsandoutputasignal isformedoutoftherstexampleofthedatasetthe signal is formed using a certainweight vector wand the rst inputexamplex(1)Letsshedmorelightonwhatthesignalis. Thewordhasitsorigininelectricalengineering.Butwearenotstudyinganyelectircalengineeringhere. Weareonlyborrowingthistermtomake things simple for us. The signal is nothing but a linear combination of the weight vectorwandtheinputvectorx. Amorecompactmethodisavailabletorepresentthesignal. ThisisshownwiththehelpofEq.(1.2).s(x) = wTx (1.2)Eq.(1.2)canbefurtherexpandedintermsoftheexactvectorstounderstandalittleclearerhowthesignalononeexampleisformed. ThisisshownwiththehelpofEq.(1.3).1.1. PERCEPTRONLEARNING 3x0x1x2y1 10 60 no1 16 80 no1 10 77 no1 22 100 no1 25 95 no1 50 170 yes1 45 180 yes1 80 130 yes1 55 150 yes1 20 110 no1 32 130 yes1 100 190 ?Table1.2: ModiedClassicationDatawithBiasIncluded1.1.1 LinearAlgebraicNotationLinear algebra has the necessary tools to make our job easy. Eq.(1.2) can be further expandedin terms of the exact vectors to understand a little clearer how the signal on one the ithexampleisformed. ThisisshownwiththehelpofEq.(1.3).s(x) =

w0w1w2

x(i)0x(i)1x(i)2(1.3)So far we have been forming signals out of one example only. The problem is that we have anentire dataset and there must be some way to get signals from all the examples in the dataset.Onewaytodothisistousematrixmultiplication. Firstletusconsideradatamatrixwhichisformedinthefollowing:X=x(1)0x(1)1x(1)2. . . x(1)nx(2)0x(2)1x(2)2. . . x(2)nx(3)0x(3)1x(3)2. . . x(3)nx(4)0x(4)1x(4)2. . . x(4)n......... . . ....x(m)0x(m)1x(m)2. . . x(m)n(1.4)InEq.(1.4)isshownthedatamatrixXwhosedimensionsaremx(n+1). Here, mdenotesthenumberofexamplesandndenotesthedimensionsofthedata. Notethatalthoughnisthedimensionoftheoriginalinputdata,afteraddingthebiasthedimensionbecomesn +1.But since the subscript for the bias dimension is simply 0,therefore,the notation for the lastdimension is stilln. Now when we count the dimensions including 0, it turns out to ben+1.Now that we know how to form signal from one data point, we could form signals from all data4 1.1. PERCEPTRONLEARNINGpoints. UsingthedatamatrixasshowninEq.(1.4),wecanwritethefollowingrelationship.s(X) =x(1)0x(1)1x(1)2. . . x(1)nx(2)0x(2)1x(2)2. . . x(2)nx(3)0x(3)1x(3)2. . . x(3)nx(4)0x(4)1x(4)2. . . x(4)n......... . . ....x(m)0x(m)1x(m)2. . . x(m)nw0w1w2...wn(1.5)Theresult of thematrixmultiplicationpresentedinEq.(1.5) isavector of signalswhosedimensionis givenbymx1. Inother words sis acolumnvector whichhas mcompo-nents,where,eachcomponentcorrespondstoeachexampleinthedataset. Inmorecompactnotations,Eq.(1.5)iswrittenas:s = Xw (1.6)1.1.2 PerceptronHypothesisAtthisstagewehavesuccessfullyextractedallthesignalsfromtheinitialdataset. Thebigquestionishowdoestheperceptronclassifyeachofthesesignalsintothetwoclassesyesandno. Perceptronusesthefollowingruletoclassifythesignals:hw(x) =

+1ifs(x) 01ifs(x) < 0(1.7)InEq.(1.7), hw(x)denotesthehypothesisandisreadas: thehypothesisasafunctionofxandparameterizedbyw. Thehypothesisisprettysimple. Itoutputsabinaryvalue. Theresult is either +1 or 1. We could think of +1 as the output of the hypothesis wheny= yesand1wheny= no.1.1.3 PerceptronLearningRuleNowthat wehavetheperceptronhypothesis, howdoweknowthat theinitiallyselectedweightvectorproducesthecorrectoutput. Todothis,weuseperceptronlearningrule. Therulesaysthefollowing:pickanexampleatrandomobtainthevalueofhypothesisfromaninitiallyselectedsetofweightscomparethehypothesisresponsewiththetruelabel yincaseof mismatch, multiplytheinput vectorbythecorrect label (+1or1)updatebyaddingthismultipliedinputtotheweightvectorGiventhedataset Xhavingmexamples, output y andweight vector w, anOCTAVE/MATLABimplementationoftheabovecodeispresentedwiththehelpofList1.1.1.2. INPUTSCALING 5Listing1.1: PerceptronAlgorithminOctave/Matlabwhile ( e r r =0 ) &&( count 0. However, thetruelabel is1. Tocorrectthis,theinputxismultipliedbythecorrectlableyandaddedtothetotheweightvector. Whenthisisdoneitcanbeobservedinthelowerpartofthegurethatnowthesignalis< 0. =

ifis not equal to =

correct the weight by: = + Figure1.1: GeometricInterpretationofPerceptronLearning1.2 InputScalingSometimes the input coordinates are not on the same scale. To illustrate this point let us lookatanexampletable. InTable(1.3)(a)x1isonaverydierentscalethanx2. Ifwetrytominimize the mean squared error loss then we might end up with something like Fig.1.2 (a).Thelongellipsesshowthatoneofthedimensionsismuchlargerthantheotherandlookingat the table it can be veried that this is indeed the case. There is a remedy for this situationandthis is knownas scaling. There are afewtypes of scalingavailable andyoushouldconsult lecture 5 for details. Here, one of those rescaling techniques known as standardizationoftheinputdataiselaborated. Whatwedoistakethemeanandstandarddeviationofall6 1.2. INPUTSCALINGx1x2y10 6000 no16 8000 no10 7700 yes22 10000 yesx1x2y-0.7833 -1.1739 no0.2611 0.0457 no-0.7833 -0.1372 yes1.3055 1.2653 yesTable1.3: (a)BadlyScaledInputData. (b)StandardizedInputData elliptic error function contour plot circular error function contour plot Figure1.2: (a)WhentheInputsarenotScaled. (b)WhentheInputsareScaled.thefeatures. Wesubtractthemeanfromall thefeaturesanddividethembythestandarddeviationtonormalizethedata. Afterrescalingof thedataweshouldexpectthecontoursoftheerrorplottobecomemorecircular. ThisisshownwiththehelpofstandardizeddatainTable(1.3)(b)andalsopresentedinFig.1.2(b). Itmustbenotedthattheseguresareonlyillustrativeandreal casesmaydierfromthecasesdepictedinthegure. Alsonotethattheperfectcircularplotasshownhereisgenerallynottheresult. Afterrescalingthefeaturesbecomesuchthatthecontoursarealmostcircular. Whengradientdescentistakingtoolong, rescalingof thedatahelpittoconvergetotheminimumfaster. AnOCTAVE/MATLABimplementationofstandardizationisshownwiththehelpofList1.2.Listing1.2: InputScalinginOctave/MatlabmuX=mean(X) ;stdX=std(X) ;for j =2 : ( n +1 )mX( :, j ) =X( :, j ) muX( j ) ;sX( :, j ) =mX( :, j ) / stdX( j ) ;endNOTE: Theindexj variesfrom2ton + 1. Thisissobecausescalingisneverappliedtothebias(thecolumnsof1s).1.3. STOCHASTICGRADIENTDESCENT 71.3 StochasticGradientDescentGradientdescentisoneofthemostcommonlyusediterativetechniquesinmachinelearning.Thepurposeof gradient descent is tondtheminimumof anerror function. Theerrorfunctionisgeneratedusinginputvaluesfromsomegivendataset. Usingournotationsfordata,themeansquarederrorfunctionisgivenas:E(w) =12mm

i=1

hw(x(i)) y(i)

2(1.8)Thegradientoftheerrorfunctionwithrespecttosomeweightcoordinatejisgivenby:E(w)wj=1mm

i=1

hw(x(i)) y(i)

x(i)j(1.9)Nowgradientdescentalgorithmssays,upgradetheweightvectorusingthefollowingrule:wj= wjE(w)wj(1.10)Hereinadditiontothepreviousnotationsdenotesthelearningrate. Themajordierencebetweenbatchgradientdescentandstochasticgradientdescentisthatinsteadof updatingtheweightvectorbysummingoveralltheexamples,theupdateisperformedbasedonlyonone randomly selected example. Therefore, ifithexample is chosen, then the error gradient iscalculatedonlyontheithexampleforthejthcoordinatewithoutsummingoveralltheotherexamples. Thiserrorgradientthenbecomes:E(w)wj=

hw(x(i)) y(i)

x(i)j(1.11)Nowinboththe batchgradient descent andstochastic gradient descent, update is madesimultaneouslytoallthecoordinatesandnotjustthejthcoordinate. Ifyougothroughtherst three lectures you will get a very good idea about that. So how are we going to implementstochasticgradientdescentinOCTAVE/MATLAB.Letslookatthecodesnippetrstandthenwewilldiscussalittlemoreaboutthealgorithm.8 1.3. STOCHASTICGRADIENTDESCENTListing1.3: StochasticGradientDescentinOctave/Matlabwhile ( dError >eps ) &&( nI t e r s 0 < 0 ()

< 0 ()

> 0 Figure1.3: GradientDescentinAction1.4. NOISE 9 the data points do not lie on the target curve the data points fit perfectly the target curve simple target curve complex target curve Figure1.4: (a)DataisNoisy. (b)Targetistoocomplex.1.4 NoiseIneverymeasurementsystemthereareaccuracyandprecisionissues. Theresultisthatthedatahasuncertainty. Wecanneverbesureoftheaccuracyofthedata. Thisuncertaintyinthedataisknownasnoise. Letuslookatanexampletoappreciatetheconcept. InFig.1.4(a)isshownasimpletarget. Theaccompanyingdatapointsarealsoshown. Itcaneasilybeinferredfromthegurethatthedatapointsdonotagreewiththeactualcurve. Suchatypeofnoiseiscalledasstochasticnoise.Stochastic or random noise is a very common concept. A relatively uncommon notion is thatofdeterministicnoise. Tounderstanddeterministicnoise,letuslookatFig.1.4(b).InFig.1.4(b)itcanbeseenthatthedatapointslieperfectlyonthetargetfunction. So,now the questuon is: where has noise come from here?The data has noise not because of un-certaintybutbecauseoftargetcomplexity. Inotherwords,thesourceanyerrorsontrainingandtestingisthefactthatthetargetisistoodiculttomodel.Tofurtherunderstandthisidea, considerthefollowingexample. Supposeyouareamath-ematicsteacherintheprimarysectionof aschool. Of all theclasses, ClassIV-Aperformsexceptionallywell andthestudentsaskyoutoteachthemexclusivethingsinmathematicsasthatisthebestclassandtheyarethebeststudents. Youbecomealittleoptimisticandteachthemintegration. Afterteachingthemintegration, somethingstrangehappens. Theystartperformingpoorlyevenontheeasyquestions. Whathappened? Theansweris: Theirmindsweretoolimitedtounderstandintegration. Tothemintegrationwascomplex. Theydidtrytolearnit. Butnowevenonslightlydierentquestionsthattheythinkaredicult,theytrytoapplyintegration(thattheydonotknowhow)andmakebigmistakes.1.4.1 BiasVarianceAnalysisThenotionof noiseanderror inmachinelearningis closelyrelatedtobias andvariance.Sofar, theerrorthatwehavedealtwithistheerrorincurredwhiletraining. Thetaskofmachinelearningistoseetheresponsetoyourhypothesisonanunknowndatapoint. The10 1.4. NOISEerroritmakesonadataset(ordatapoint)thatyourlearningalgorithmhasneverseeniscalledthetrueerrorofyourhypothesis. Thistrueerrorisalsocalledoutofsampleerror.Fornotational conveniencewedenotethetrueerrorbythesubscriptout andthetrainingerrorbythesubscriptin. Thentheoutofsampleerrorcanbedecomposedinthefollowingmanner:Eout= Bias + Variance + Noise (1.12)Thesethreetermscanfurtherbeexplainedasfollows:1. Bias: Bias is the component in the error which is due to the simplicity of your hypothe-sis. If you try to learn a very simple hypothesis you might not learn true characteristicsofthedataandhenceyouwillmakeerrorswhiletrainingaswellaspredicting. Thisisalsocalledundertting.2. Variance: Varianceisthecomponentwhichisduetothecomplexityofyourhypoth-esisandnotbecauseofthecomplexityofthedata. Inotherwords,ifyoutrytolearnthedatasettoowell,thenalthoughyourtrainingerrorwillbeverylittle,butyouwillmakebigerrorswhiletesting. Thisisalsocalledovertting. Hencevariancecorre-sponds to the deterministic noise. You can overcome variance by using lots of examples.Note: Complexityof learnedhypothesis shouldnot beconfusedwiththecomplexityoftargetfunction.3. Noise: Thisisduetothestochastic(random)nosieinthedata.Fig.1.5(a)and(b)provideapictorialpresentationofbiasandvariance. # examples# examples error error

simple hypothesis (low order polynomial) complex hypothesis (high order polynomial) bias variance Figure1.5: (a)Bias. (b)Variance.1.4.2 RegularizationRegularization is a cure for noise and in particular, overtting. As has already been explained,theerrorduetonoisecouldberandomordeterministic. Thedeterministicnoiseisthevari-ancepartof theerror, whichisalsoknownasoverftting. So, letusrstunderstandwhat1.4. NOISE 11doesoverttingmean. Fittingacurveistondparameterswhichproduceoutputasclosetotheinputaspossible. Overttingistotittoowell, somuchso, thatall thepointsareperfectlycoveredwhiletraining. However, attestingtimethiscouldbeabigproblemandthishasbeenexplainedabove.Regularization deals with putting a restraint on the learning algorithm not to learn too good(too strong weights). To put this in eect we consider the following constrained optimizationproblem.min E(w) =12mm

i=1

hw(x(i)) y(i)

2subjectto: w2< (1.13)Thisequationcouldpictoriallyberepresentedas: Insteadof doingdetailedderivation, one Figure1.6: TheEllipseastheErrorandtheCircleasRegularizer.couldlookat the diagramandinfer the followingmathematical relationshipbetweenthegradient of the error and the weight vector at the optimum point presented by the green starinthegure:E(w) = mwE(w) +mw = 0 (1.14)Eq.(1.14)tellsmethatif Icouldonlytransformthesecondpartwithintoadierentialof somethingI couldhave another optimizationfunctionwithout the constraint. This ispresentedbythefollowingequation:min E(w) =12mm

i=1

hw(x(i)) y(i)

2+2mw2(1.15)12 1.5. CONCLUSIONNowthisisalmostthesameasoriginalcostfunctionexceptanadditiveterm. Wealsoknowthatthederivativeofadditivetermisavailable. Usingthisinformationwecouldrewritetheweightupdateandtheexactsolutionforregressionproblems.1. WeightUpdate:wj= wj 1mm

i=1

hw(x(i)) y(i)

x(i)j mwj(1.16)2. NormalEquations:w = (XTX +J)1XTy (1.17)Note: Bias term(the columnof 1s is never regularized. Therefore, J is theidentitymatrix,exceptthatitsrstelementis0insteadofbeing1).1.5 ConclusionTheconclusionwillbeinredandnotethatyouneedtopaycarefulattentiontothat. AlsogothroughthelectureonNeural Nets. Ihaveaskedverybasicquestions. ReadthelectureonVCanalysis. Onlyconceptualquestionswillbeasked.1. Humanbrainis goodat patternrecognitionwhilecomputers aregoodatnumbercrunching.2. ThegrowthfunctionisjustiedbecauseoftheVCbound.I hope that reading this document and going through lectures 5 - 10 will get you good marks.Allthebest!Withfaith, disciplineandself-less devotion to duty, thereisnothingworthwhilethat youcannotachieve.MohammadAli Jinnah(1876-1948)2SupervisedLearningTechniquesThischaptershouldserveasaguideforthenal examination. Althoughthenameimpliessupervisedtechniqueszingeneral,itwilldealwithneuralnets,supportvectormachinesandnearestneighbours.2.1 SupportVectorMachinesSupportVectorMachines(SVM)isinarguablythemostacceptedtechniqueusedforclassi-cation. Itreliesontheconceptofmargins. Themarginisdenedasthedistancebetweentheseparatinghyperplaneandthenearestexample. Anexampleheremeansadatapoint.That examplewhichis closest tothedatapoint maybenegativeor positivefor binaaryclassication. Letsassumethattheclosestexampleasmentionedhereispositive. So, thatmeans that there will only be one margin and that too for the nearest positive example. Well,thatswrong. Asyouwill seelater, whenmarginsareformed, theytakeintoaccountboththenegativeandpositiveexamples. LetslookatFig.2.1fortheconceptofmargins.In Fig.2.1 are shown two examples from two classes. We see that margins are formed in sucha manner that the distance of the two examples (data points) is the same. Ifris the distanceof each of the examples from the separating hyperplane then we can express the total marginassumoftheindividualmarginsasshowninEq.(2.1).marginT= 2r (2.1)InEq.(2.1), thesubscript(T)signiesthewordtotal. Wecouldsumthetwodistanceseasilywithout worryingabout anyprojections or components. Wesummedthenbytheirscalar magnitudes because the two margins separate the examples in one axis only. To furtherclarify,thinkofthetwoexamplesassittingonthesurfaceofasphere. Thelinejoiningthemis a straight line passing thriugh the center of the sphere. If we know the radius of the sphere,wecouldeasilyndthedistancebetweenthetwoexamplesbysummingtheradiustwice.Inother words theexamplescouldbethought of sittingonthesurfaceof thesphereandseparatedbythediameterofthesphere.1314 2.1. SUPPORTVECTORMACHINES Figure2.1: TwoDataPointsareSeparatedbyMarginr.2.1.1 MaximumMarginClassierKnowingthatmargin issimplytheperpendiculardistanceof thenearestexampletotheseparatinghyperplane, wemayaskourselves: What does themarginreallydofor us intermsofmachinelearningandisitreallyworthspendingtime?. Theanswertothisratherfundamental question can be given intuitively. Let us look at a gure and then things will beclearer. a) b) c) Figure2.2: ThreeDierentClassierswithDierentMargins.2.1. SUPPORTVECTORMACHINES 15In Fig.2.2, three classifers are presented. We do not know which learning algorithm generatedthat,butwedoknowthattherearedierencesintheway,theyareseparatingtheotherwiseidenticaldataset. Letustrytoenumeratethesedierences.thethreeclassiers(separatingplanes)havedierentslopes.case(a)separatestheexamplessuchthat thelineisveryclosetoeachofthetwoclassesofnearestexamples.incase(b)theseparatingplaneisratherfarfromthetwoexamples.case(c)iswheretheseparatingplaneisfarthestfromthetwoexamples.themarginforcase(a) is narrowest, whereas, themarginforcase(c) isthewidest.NoticethatIamusingthewordseparatingplane,separatinglineandseparatinghyperplaneinterchangeably. Theallmeanthesame. Althoughtherearedierencesbutfornowitdoesnotmatteranditcouldbetreatedasthesame. Gettingbacktoouroriginalquestion, howdoesthemarginhelpus,werealizethatperhapstheclassierwiththewidestmarginisthebest. Butwhy? Onewaytothinkaboutthisistoconsideranewexample. Anewexampleisadatapointwhichyourtrainingalgorithmhasnotseenyet. Whatarethechancesthattherst separatorwill classifyitcorrectly? Whatarethechancesthat thesecondorthethirdclassierwill doitsjobwithouterror? Toanewexample, themosterrortolerant(oraccurate)will bethethirdclassier. Tofurtherstrengthentheidea, lerusassumethatthenew example is a little wayward (not normal or a little outside or beyond the normal region).Whichclassierwilltolerateitthenost? Thethirdone! Becauseithasawidermarginandifanexampleisalittlehereorthere, itcouldstill lieinawidermarginandhencecangetcorrectlyclassiedthenifthemarginsarenarrow,sincetherethechancesarehighthatthenewexamplewillbemisclassied.2.1.2 QuantifyingtheMargin