cpsc 340: machine learning and data miningfwood/cs340/lectures/l17.pdfmy advice if you want the...

CPSC340:MachineLearningandDataMining

RegularizationSpring2020

Admin• Midterm isFriday.

– Feb14th at6:00pm.(here,WESB100)– 100minutes.– Closed-book.– Onedoubled-sidedhand-written‘cheatsheet’.

• Useoftypedcheatsheetswillbetreatedasanacademiccodeviolation.– BringyourstudentID– theywillbecheckedatexamhand-in.– Auditorsdonottakethemidterm.

• Therewillbetwotypesofquestionsonthemidterm:– ‘Technical’questionsrequiringthingslikepseudo-codeorderivations.– ‘Conceptual’questionstestingunderstandingofkeyconcepts.

• All lectureslidematerialexcept“bonusslides”isfairgame.

LastTime:FeatureSelection• Lasttimewediscussedfeatureselection:

– Choosingsetof“relevant”features.

• Mostcommonapproachissearchandscore:– Define“score”and“search”forfeatureswithbestscore.

• Butit’shardtodefinethe“score”andit’shardto“search”.– Soweoftenusegreedymethodslikeforwardselection.

• Methodsworkokon“toy”data,butarefrustratingonrealdata.– Differentmethodsmayreturnverydifferentresults.– Definingwhetherafeatureis“relevant”iscomplicatedandambiguous.

Myadviceifyouwantthe“relevant”variables.• Trytheassociationapproach.• Tryforwardselectionwithdifferentvaluesofλ.• Tryoutafewotherfeatureselectionmethodstoo.

• Discusstheresultswiththedomainexpert.– Theyprobablyhaveanideaofwhysomevariablesmightberelevant.

• Don’tbeoverconfident:– Thesemethodsareprobablynotdiscoveringhowtheworldtrulyworks.– “Thealgorithmhasfoundthatthesevariablesarehelpfulinpredictingyi.”

• Thenawarningthatthesemodelsarenotperfectatfindingrelevantvariables.

Related:SurvivorshipBias• PlottinglocationofbulletholesonplanesreturningfromWW2:

• Wherearethe“relevant”partsoftheplanetoprotect?– “Relevant”partsareactuallywheretherearenobullets.– Planesshotinotherplacesdidnotcomeback (armorwasneeded).

https://en.wikipedia.org/wiki/Survivorship_bias


• Thisisanexampleof“survivorshipbias”:– DataisnotIIDbecauseyouonlysamplethe“survivors”.– Causeshavocforfeatureselection,andMLmethodsingeneral.



• Peoplecometowrongconclusionsduetosurvivorbiasallthetime.– Articleon“secretsofsuccess”,focusingontraitsofsuccessfulpeople.– Butignoringthenumberofnon-super-successfulwiththesametraits.– Article hypothesizingaboutvarioustopics(allergies,mentalillness,etc.).


“Feature”Selectionvs.“Model”Selection?• Modelselection:“whichmodelshouldIuse?”– KNNvs.decisiontree,depthofdecisiontree,degreeofpolynomialbasis.

• Featureselection:“whichfeaturesshouldIuse?”– Usingfeature10ornot,usingxi2 aspartofbasis.

• Thesetwotasksarehighly-related:– It’sadifferent“model”ifweaddxi2 tolinearregression.– Butthexi2 termisjusta“feature”thatcouldbe“selected”ornot.– Usually,“featureselection”meanschoosingfromsome“original”features.

• Youcouldsaythat“feature”selectionisaspecialcaseof“model”selection.

ModelSelection

FeatureSelection

Canithelppredictiontothrowfeaturesaway?• Yes,becauselinearregressioncanoverfit withlarge‘d’.– Eventhoughit’s“just”ahyper-plane.

• Considerusingd=n,withcompletelyrandomfeatures.– Withhighprobability,youwillbeabletogetatrainingerrorof0.– Butthefeatureswererandom,thisiscompletelyoverfitting.

• Youcouldview“numberoffeatures”asahyper-parameter.– Modelgetsmorecomplexasyouaddmorefeatures.

(pause)

Recall:PolynomialDegreeandTrainingvs.Testing

• We’vesaidthatcomplicatedmodelstendtooverfit more.

• Butwhatifweneedacomplicatedmodel?http://www.cs.ubc.ca/~arnaud/stat535/slides5_revised.pdf

ControllingComplexity• Usually“true”mappingfromxi toyi iscomplex.– Mightneedhigh-degreepolynomial.– Mightneedtocombinemanyfeatures,anddon’tknow“relevant”ones.

• Butcomplexmodelscanoverfit.• Sowhatdowedo???

• Ourmaintools:– Modelaveraging:averageovermultiplemodelstodecreasevariance.– Regularization:addapenaltyonthecomplexityofthemodel.

Wouldyourather?• Considerthefollowingdatasetand3linearregressionmodels:

• Whichlineshouldwechoose?

Wouldyourather?• Considerthefollowingdatasetand3linearregressionmodels:

• Whatifyouareforcedtochoosebetweenred andgreen?– Andassumetheyhavethesametrainingerror.

• Youshouldpickgreen.– Sinceslopeissmaller,smallchangeinxi hasasmallerchangeinpredictionyi.

• Greenline’spredictionsarelesssensitivetohaving‘w’exactlyright.– Sincegreen‘w’islesssensitivetodata,testerrormightbelower.

SizeofRegressionWeightsareOverfitting

• Theregressionweightswj withdegree-7arehugeinthisexample.• Thedegree-7polynomialwouldbelesssensitivetothedata,

ifwe“regularized”thewj sothattheyaresmall.

L2-Regularization• Standardregularization strategyisL2-regularization:

• Intuition: largeslopeswj tendtoleadtooverfitting.

• Objectivebalancesgettinglowerrorvs.havingsmallslopes ‘wj’.– “Youcanincreasethetrainingerrorifitmakes‘w’muchsmaller.”– Nearly-always reducesoverfitting.

– Regularizationparameterλ>0controls“strength”ofregularization.• Largeλputslargepenaltyonslopes.


• Intermsoffundamentaltrade-off:– Regularizationincreasestrainingerror.– Regularizationdecreasesapproximationerror.

• Howshouldyouchooseλ?– Theory:as‘n’growsλ shouldbeintherangeO(1)to(√n).– Practice:optimize validationsetorcross-validation error.

• Thisalmostalwaysdecreasesthetesterror.

L2-Regularization“Shrinking”Example• Solutiontoa“leastsquareswithL2-regularization”fordifferentλ:

• Wegetleastsquareswithλ =0.– Butwecanachievesimilartrainingerrorwithsmaller||w||.

• ||Xw –y||increaseswithλ,and||w||decreaseswithλ.– Thoughindividualwj canincreaseordecreasewithlambda.– BecauseweusetheL2-norm,thelargeonesdecreasethemost.

λ w1 w2 w3 w4 w5

0 -1.88 1.29 -2.63 1.78 -0.63

1 -1.88 1.28 -2.62 1.78 -0.64

4 -1.87 1.28 -2.59 1.77 -0.66

16 -1.84 1.27 -2.50 1.73 -0.73

64 -1.74 1.23 -2.22 1.59 -0.90

256 -1.43 1.08 -1.70 1.18 -1.05

1024 -0.87 0.73 -1.03 0.57 -0.81

4096 -0.35 0.31 -0.42 0.18 -0.36

||Xw – y||2 ||w||2

285.64 15.68

285.64 15.62

285.64 15.43

285.71 14.76

286.47 12.77

292.60 8.60

321.29 3.33

374.27 0.56

RegularizationPath• Regularizationpathisaplotoftheoptimalweights‘wj’as‘λ’ varies:

• Startswithleastsquareswithλ=0,andwj convergeto0asλ grows.

L2-regularizationandthenormalequations• WhenusingL2-regularizedsquarederror,wecansolvefor∇ f(w)=0.• Lossbefore:• Lossafter:

• Gradientbefore:• Gradientafter:

• Linearsystembefore:• Linearsystemafter:• ButunlikeXTX,thematrix(XTX+λI)isalwaysinvertible:

– Multiplybyitsinverseforuniquesolution:20

rf(w) = XTXw �XT y

XTXw = XT y

rf(w) = XTXw �XT y + �w

(XTX + �I)w = XT y

GradientDescentforL2-RegularizedLeastSquares

• TheL2-regularizedleastsquaresobjectiveandgradient:

• GradientdescentiterationsforL2-regularizedleastsquares:

• CostofgradientdescentiterationisstillO(nd).– Canshownumberofiterationsdecreaseasλ increases (notobvious).

WhyuseL2-Regularization?• It’saweirdthingtodo,butwe(cs340professors)say“alwaysuseregularization”.– “Almostalwaysdecreasestesterror”shouldalreadyconvinceyou.

• Buthereare6morereasons:1. Solution‘w’isunique.2. XTXdoesnotneedtobeinvertible (nocollinearityissues).3. LesssensitivetochangesinXory.4. Gradientdescentconvergesfaster (biggerλ meansfeweriterations).5. Stein’sparadox:ifd≥3,‘shrinking’movesuscloserto‘true’w.6. Worstcase:justsetλ smallandgetthesameperformance.

(pause)

FeatureswithDifferentScales• Considercontinuousfeatureswithdifferentscales:

• Shouldweconverttosomestandard‘unit’?– Itdoesn’tmatterfordecisiontreesornaïveBayes.

• Theyonlylookatonefeatureatatime.– Itdoesn’tmatterforleastsquares:

• wj*(100mL)givesthesamemodelaswj*(0.1L)withadifferentwj.

Egg(#) Milk(mL) Fish(g) Pasta(cups)

0 250 0 1

1 250 200 1

0 0 0 0.5

2 250 150 0

FeatureswithDifferentScales• Considercontinuousfeatureswithdifferentscales:

• Shouldweconverttosomestandard‘unit’?– Itmattersfork-nearestneighbours:

• “Distance”willbeaffectedmorebylargefeaturesthansmallfeatures.– Itmattersforregularizedleastsquares:

• Penalizing(wj)2 meansdifferentthingsiffeatures‘j’areondifferentscales.

Egg(#) Milk(mL) Fish(g) Pasta(cups)

0 250 0 1

1 250 200 1

0 0 0 0.5

2 250 150 0

StandardizingFeatures• Itiscommontostandardizecontinuousfeatures:

– Foreachfeature:1. Computemeanandstandarddeviation:

2. Subtractmeananddividebystandarddeviation (“z-score”)

– Nowchangesin‘wj’havesimilareffectforanyfeature‘j’.• Howshouldwestandardizetestdata?

– Wrongapproach:usemeanandstandarddeviationoftestdata.– Trainingandtestmeanandstandarddeviationmightbeverydifferent.– Rightapproach:usemeanandstandarddeviationoftrainingdata.

StandardizingFeatures• Itiscommontostandardizecontinuousfeatures:

– Foreachfeature:1. Computemeanandstandarddeviation:

2. Subtractmeananddividebystandarddeviation (“z-score”)

– Nowchangesin‘wj’havesimilareffectforanyfeature‘j’.• Ifwe’redoing10-foldcross-validation:

– Computeµj andσj basedonthe9trainingfolds(e.g.,averageover9/10sofdata).– Standardizetheremaining(“validation”)foldwiththis“training”µj andσj.– Re-standardizefordifferentfolds.

StandardizingTarget• Inregression,wesometimesstandardizethetargetsyi.– Putstargetsonthesamestandardscaleasstandardizedfeatures:

• Withstandardizedtarget,settingw=0predictsaverageyi:– Highregularizationmakesuspredictclosertotheaveragevalue.

• Again,makesureyoustandardizetestdatawiththetrainingstats.• Othercommontransformationsofyi arelogarithm/exponent:

– Makessenseforgeometric/exponentialprocesses.

Regularizingthey-Intercept?• Shouldweregularizethey-intercept?

• No!Whyencourageittobeclosertozero?(Itcouldbeanywhere.)– Youshouldbeallowedtoshiftfunctionup/downglobally.

• Yes!Itmakesthesolutionuniqueanditeasiertocompute‘w’.

• Compromise:regularizebyasmalleramountthanothervariables.

(pause)

PredictingtheFuture• Inprinciple,wecanuseanyfeaturesxi thatwethinkarerelevant.• Thismakesittemptingtousetime asafeature,andpredictfuture.

https://gravityandlevity.wordpress.com/2009/04/22/the-fastest-possible-mile/

PredictingtheFuture• Inprinciple,wecanuseanyfeaturesxi thatwethinkarerelevant.• Thismakesittemptingtousetimeasafeature,andpredictfuture.

https://gravityandlevity.wordpress.com/2009/04/22/the-fastest-possible-mihttps://overthehillsports.wordpress.com/tag/hicham-el-guerrouj/le/

Predicting100mtimes400yearsinthefuture?

https://plus.maths.org/content/sites/plus.maths.org/files/articles/2011/usain/graph2.gif

Predicting100mtimes400yearsinthefuture?

https://plus.maths.org/content/sites/plus.maths.org/files/articles/2011/usain/graph2.gifhttp://www.washingtonpost.com/blogs/london-2012-olympics/wp/2012/08/08/report-usain-bolt-invited-to-tryout-for-manchester-united/

InterpolationvsExtrapolation• Interpolation istaskofpredicting“betweenthedatapoints”.

– Regressionmodelsaregoodatthisifyouhaveenoughdataandfunctioniscontinuous.• Extrapolation istaskofpredictionoutsidetherangeofthedatapoints.

– Withoutassumptions,regressionmodelscanbeembarrassingly-badatthis.

• Ifyourunthe100mregressionmodelsbackwardsintime:– Theypredictthathumansusedtobereallyreally slow!

• Ifyourunthe100mregressionmodelsforwardsintime:– Theymighteventuallypredictarbitrarily-small100mtimes.– Thelinearmodelactuallypredictsnegativetimesinthefuture.

• Thesetimetravelingracesin2060shouldbeprettyexciting!

• Somediscussionhere:– http://callingbullshit.org/case_studies/case_study_gender_gap_running.html

https://www.smbc-comics.com/comic/rise-of-the-machines

NoFreeLunch,Consistency,andtheFuture

Ockham’sRazorvs.NoFreeLunch• Ockham’srazor isaproblem-solvingprinciple:

– “Amongcompetinghypotheses,theonewiththefewestassumptionsshouldbeselected.”

– Suggestsweshouldselectlinearmodel.

• Fundamentaltrade-off:– Ifsametrainingerror,pickmodellesslikelytooverfit.– FormalversionofOccam’sproblem-solvingprinciple.– Alsosuggestsweshouldselectlinearmodel.

• Nofreelunchtheorem:– Thereexistspossibledatasetswhereyoushouldselectthegreenmodel.

NoFreeLunch,Consistency,andtheFuture• Wecanresolve“bluevs.green”bycollectingmoredata:

NoFreeLunch,Consistency,andtheFuture

Discussion:ClimateModels• HasEarthwarmedupoverlast100years?(Consistencyzone)

– Dataclearlysays“yes”.

• WillEarthcontinuetowarmovernext100years?(generalizationerror)– Weshouldbemoreskepticalaboutmodelsthatpredictfutureevents.

https://en.wikipedia.org/wiki/Global_warming

Discussion:ClimateModels• Soshouldweallbecomeglobalwarmingskeptics?• Ifweaverageovermodelsthatoverfitin*independent*ways,weexpectthe

testerrortobelower,sothisgivesmoreconfidence:

– Weshouldbeskepticalofindividualmodels,butagreeingpredictionsmadebymodelswithdifferentdata/assumptionsaremorelikelybetrue.

• Allthenear-futurepredictionsagree,sotheyarelikelytobeaccurate.– Andit’sprobablyreasonabletoassumefairlycontinuouschange(nobig“jumps”).

• Varianceishigherfurtherintofuture,sopredictionsarelessreliable.– Relyingmoreonassumptionsandlessondata.

https://en.wikipedia.org/wiki/Global_warming

IndexFunds:EnsembleExtrapolationforInvesting• Wanttodoextrapolationwheninvestingmoney.

– Whatwillthisbeworthinthefuture?• Indexfundscanbeviewedasanensemblemethodforinvesting.

– Forexample,buystockintop500companiesproportionaltovalue.– Triestofollowaveragepriceincrease/decrease.

– Thissimpleinvestingstrategyoutperformsmostfundmanagers.

http://fibydesign.com/005-introduction-to-index-investing-stocks-index-funds-vtsax/

Summary• Regularization:

– Addingapenaltyonmodelcomplexity.

• L2-regularization:penaltyonL2-normofregressionweights‘w’.– Almostalwaysimprovestesterror.

• Standardizingfeatures:– Forsomemodelsitmakessensetohavefeaturesonthesamescale.

• Interpolationvs.Extrapolation:– Machinelearningwithlarge‘n’isgoodatpredicting“betweenthedata”.– Withoutassumptions,canbearbitrarilybad“awayfromthedata”.

• Nexttime:learningwithanexponentialnumberofirrelevantfeatures.


• EquivalenttominimizingsquarederrorbutkeepingL2-normsmall.

Regularization/ShrinkingParadox• Wethrowdartsatatarget:– Assumewedon’talwayshittheexactcenter.– Assumethedartsfollowasymmetricpatternaroundcenter.


• Shrinkageofthedarts:1. Choosesomearbitrary location‘0’.2. Measuredistancesfromdartsto‘0’.


• Shrinkageofthedarts:1. Choosesomearbitrary location‘0’.2. Measuredistancesfromdartsto‘0’.3. Movemissestowards‘0’,bysmall

amountproportionaltodistancefrom0.

• Ifsmallenough,dartswillbeclosertocenteronaverage.


• Shrinkageofthedarts:1. Choosesomearbitrary location‘0’.2. Measuredistancesfromdartsto‘0’.3. Movemissestowards‘0’,bysmall

amountproportionaltodistancefrom0.

• Ifsmallenough,dartswillbeclosertocenteronaverage.Visualizationoftherelatedhigher-dimensionalparadoxthatthemeanofdatacomingfromaGaussianisnotthebestestimateofthemeanoftheGaussianin3-dimensionsorhigher:https://www.naftaliharris.com/blog/steinviz

cpsc 340: machine learning and data miningfwood/cs340/lectures/l17.pdfmy advice if you want the...

Documents