cpsc 340: machine learning and data miningfwood/cs340/lectures/l17.pdfmy advice if you want the...
TRANSCRIPT
CPSC340:MachineLearningandDataMining
RegularizationSpring2020
Admin• Midterm isFriday.
– Feb14th at6:00pm.(here,WESB100)– 100minutes.– Closed-book.– Onedoubled-sidedhand-written‘cheatsheet’.
• Useoftypedcheatsheetswillbetreatedasanacademiccodeviolation.– BringyourstudentID– theywillbecheckedatexamhand-in.– Auditorsdonottakethemidterm.
• Therewillbetwotypesofquestionsonthemidterm:– ‘Technical’questionsrequiringthingslikepseudo-codeorderivations.– ‘Conceptual’questionstestingunderstandingofkeyconcepts.
• All lectureslidematerialexcept“bonusslides”isfairgame.
LastTime:FeatureSelection• Lasttimewediscussedfeatureselection:
– Choosingsetof“relevant”features.
• Mostcommonapproachissearchandscore:– Define“score”and“search”forfeatureswithbestscore.
• Butit’shardtodefinethe“score”andit’shardto“search”.– Soweoftenusegreedymethodslikeforwardselection.
• Methodsworkokon“toy”data,butarefrustratingonrealdata.– Differentmethodsmayreturnverydifferentresults.– Definingwhetherafeatureis“relevant”iscomplicatedandambiguous.
Myadviceifyouwantthe“relevant”variables.• Trytheassociationapproach.• Tryforwardselectionwithdifferentvaluesofλ.• Tryoutafewotherfeatureselectionmethodstoo.
• Discusstheresultswiththedomainexpert.– Theyprobablyhaveanideaofwhysomevariablesmightberelevant.
• Don’tbeoverconfident:– Thesemethodsareprobablynotdiscoveringhowtheworldtrulyworks.– “Thealgorithmhasfoundthatthesevariablesarehelpfulinpredictingyi.”
• Thenawarningthatthesemodelsarenotperfectatfindingrelevantvariables.
Related:SurvivorshipBias• PlottinglocationofbulletholesonplanesreturningfromWW2:
• Wherearethe“relevant”partsoftheplanetoprotect?– “Relevant”partsareactuallywheretherearenobullets.– Planesshotinotherplacesdidnotcomeback (armorwasneeded).
https://en.wikipedia.org/wiki/Survivorship_bias
Related:SurvivorshipBias• PlottinglocationofbulletholesonplanesreturningfromWW2:
• Thisisanexampleof“survivorshipbias”:– DataisnotIIDbecauseyouonlysamplethe“survivors”.– Causeshavocforfeatureselection,andMLmethodsingeneral.
https://en.wikipedia.org/wiki/Survivorship_bias
Related:SurvivorshipBias• PlottinglocationofbulletholesonplanesreturningfromWW2:
• Peoplecometowrongconclusionsduetosurvivorbiasallthetime.– Articleon“secretsofsuccess”,focusingontraitsofsuccessfulpeople.– Butignoringthenumberofnon-super-successfulwiththesametraits.– Article hypothesizingaboutvarioustopics(allergies,mentalillness,etc.).
https://en.wikipedia.org/wiki/Survivorship_bias
“Feature”Selectionvs.“Model”Selection?• Modelselection:“whichmodelshouldIuse?”– KNNvs.decisiontree,depthofdecisiontree,degreeofpolynomialbasis.
• Featureselection:“whichfeaturesshouldIuse?”– Usingfeature10ornot,usingxi2 aspartofbasis.
• Thesetwotasksarehighly-related:– It’sadifferent“model”ifweaddxi2 tolinearregression.– Butthexi2 termisjusta“feature”thatcouldbe“selected”ornot.– Usually,“featureselection”meanschoosingfromsome“original”features.
• Youcouldsaythat“feature”selectionisaspecialcaseof“model”selection.
ModelSelection
FeatureSelection
Canithelppredictiontothrowfeaturesaway?• Yes,becauselinearregressioncanoverfit withlarge‘d’.– Eventhoughit’s“just”ahyper-plane.
• Considerusingd=n,withcompletelyrandomfeatures.– Withhighprobability,youwillbeabletogetatrainingerrorof0.– Butthefeatureswererandom,thisiscompletelyoverfitting.
• Youcouldview“numberoffeatures”asahyper-parameter.– Modelgetsmorecomplexasyouaddmorefeatures.
(pause)
Recall:PolynomialDegreeandTrainingvs.Testing
• We’vesaidthatcomplicatedmodelstendtooverfit more.
• Butwhatifweneedacomplicatedmodel?http://www.cs.ubc.ca/~arnaud/stat535/slides5_revised.pdf
ControllingComplexity• Usually“true”mappingfromxi toyi iscomplex.– Mightneedhigh-degreepolynomial.– Mightneedtocombinemanyfeatures,anddon’tknow“relevant”ones.
• Butcomplexmodelscanoverfit.• Sowhatdowedo???
• Ourmaintools:– Modelaveraging:averageovermultiplemodelstodecreasevariance.– Regularization:addapenaltyonthecomplexityofthemodel.
Wouldyourather?• Considerthefollowingdatasetand3linearregressionmodels:
• Whichlineshouldwechoose?
Wouldyourather?• Considerthefollowingdatasetand3linearregressionmodels:
• Whatifyouareforcedtochoosebetweenred andgreen?– Andassumetheyhavethesametrainingerror.
• Youshouldpickgreen.– Sinceslopeissmaller,smallchangeinxi hasasmallerchangeinpredictionyi.
• Greenline’spredictionsarelesssensitivetohaving‘w’exactlyright.– Sincegreen‘w’islesssensitivetodata,testerrormightbelower.
SizeofRegressionWeightsareOverfitting
• Theregressionweightswj withdegree-7arehugeinthisexample.• Thedegree-7polynomialwouldbelesssensitivetothedata,
ifwe“regularized”thewj sothattheyaresmall.
L2-Regularization• Standardregularization strategyisL2-regularization:
• Intuition: largeslopeswj tendtoleadtooverfitting.
• Objectivebalancesgettinglowerrorvs.havingsmallslopes ‘wj’.– “Youcanincreasethetrainingerrorifitmakes‘w’muchsmaller.”– Nearly-always reducesoverfitting.
– Regularizationparameterλ>0controls“strength”ofregularization.• Largeλputslargepenaltyonslopes.
L2-Regularization• Standardregularization strategyisL2-regularization:
• Intermsoffundamentaltrade-off:– Regularizationincreasestrainingerror.– Regularizationdecreasesapproximationerror.
• Howshouldyouchooseλ?– Theory:as‘n’growsλ shouldbeintherangeO(1)to(√n).– Practice:optimize validationsetorcross-validation error.
• Thisalmostalwaysdecreasesthetesterror.
L2-Regularization“Shrinking”Example• Solutiontoa“leastsquareswithL2-regularization”fordifferentλ:
• Wegetleastsquareswithλ =0.– Butwecanachievesimilartrainingerrorwithsmaller||w||.
• ||Xw –y||increaseswithλ,and||w||decreaseswithλ.– Thoughindividualwj canincreaseordecreasewithlambda.– BecauseweusetheL2-norm,thelargeonesdecreasethemost.
λ w1 w2 w3 w4 w5
0 -1.88 1.29 -2.63 1.78 -0.63
1 -1.88 1.28 -2.62 1.78 -0.64
4 -1.87 1.28 -2.59 1.77 -0.66
16 -1.84 1.27 -2.50 1.73 -0.73
64 -1.74 1.23 -2.22 1.59 -0.90
256 -1.43 1.08 -1.70 1.18 -1.05
1024 -0.87 0.73 -1.03 0.57 -0.81
4096 -0.35 0.31 -0.42 0.18 -0.36
||Xw – y||2 ||w||2
285.64 15.68
285.64 15.62
285.64 15.43
285.71 14.76
286.47 12.77
292.60 8.60
321.29 3.33
374.27 0.56
RegularizationPath• Regularizationpathisaplotoftheoptimalweights‘wj’as‘λ’ varies:
• Startswithleastsquareswithλ=0,andwj convergeto0asλ grows.
L2-regularizationandthenormalequations• WhenusingL2-regularizedsquarederror,wecansolvefor∇ f(w)=0.• Lossbefore:• Lossafter:
• Gradientbefore:• Gradientafter:
• Linearsystembefore:• Linearsystemafter:• ButunlikeXTX,thematrix(XTX+λI)isalwaysinvertible:
– Multiplybyitsinverseforuniquesolution:20
rf(w) = XTXw �XT y
XTXw = XT y
rf(w) = XTXw �XT y + �w
(XTX + �I)w = XT y
GradientDescentforL2-RegularizedLeastSquares
• TheL2-regularizedleastsquaresobjectiveandgradient:
• GradientdescentiterationsforL2-regularizedleastsquares:
• CostofgradientdescentiterationisstillO(nd).– Canshownumberofiterationsdecreaseasλ increases (notobvious).
WhyuseL2-Regularization?• It’saweirdthingtodo,butwe(cs340professors)say“alwaysuseregularization”.– “Almostalwaysdecreasestesterror”shouldalreadyconvinceyou.
• Buthereare6morereasons:1. Solution‘w’isunique.2. XTXdoesnotneedtobeinvertible (nocollinearityissues).3. LesssensitivetochangesinXory.4. Gradientdescentconvergesfaster (biggerλ meansfeweriterations).5. Stein’sparadox:ifd≥3,‘shrinking’movesuscloserto‘true’w.6. Worstcase:justsetλ smallandgetthesameperformance.
(pause)
FeatureswithDifferentScales• Considercontinuousfeatureswithdifferentscales:
• Shouldweconverttosomestandard‘unit’?– Itdoesn’tmatterfordecisiontreesornaïveBayes.
• Theyonlylookatonefeatureatatime.– Itdoesn’tmatterforleastsquares:
• wj*(100mL)givesthesamemodelaswj*(0.1L)withadifferentwj.
Egg(#) Milk(mL) Fish(g) Pasta(cups)
0 250 0 1
1 250 200 1
0 0 0 0.5
2 250 150 0
FeatureswithDifferentScales• Considercontinuousfeatureswithdifferentscales:
• Shouldweconverttosomestandard‘unit’?– Itmattersfork-nearestneighbours:
• “Distance”willbeaffectedmorebylargefeaturesthansmallfeatures.– Itmattersforregularizedleastsquares:
• Penalizing(wj)2 meansdifferentthingsiffeatures‘j’areondifferentscales.
Egg(#) Milk(mL) Fish(g) Pasta(cups)
0 250 0 1
1 250 200 1
0 0 0 0.5
2 250 150 0
StandardizingFeatures• Itiscommontostandardizecontinuousfeatures:
– Foreachfeature:1. Computemeanandstandarddeviation:
2. Subtractmeananddividebystandarddeviation (“z-score”)
– Nowchangesin‘wj’havesimilareffectforanyfeature‘j’.• Howshouldwestandardizetestdata?
– Wrongapproach:usemeanandstandarddeviationoftestdata.– Trainingandtestmeanandstandarddeviationmightbeverydifferent.– Rightapproach:usemeanandstandarddeviationoftrainingdata.
StandardizingFeatures• Itiscommontostandardizecontinuousfeatures:
– Foreachfeature:1. Computemeanandstandarddeviation:
2. Subtractmeananddividebystandarddeviation (“z-score”)
– Nowchangesin‘wj’havesimilareffectforanyfeature‘j’.• Ifwe’redoing10-foldcross-validation:
– Computeµj andσj basedonthe9trainingfolds(e.g.,averageover9/10sofdata).– Standardizetheremaining(“validation”)foldwiththis“training”µj andσj.– Re-standardizefordifferentfolds.
StandardizingTarget• Inregression,wesometimesstandardizethetargetsyi.– Putstargetsonthesamestandardscaleasstandardizedfeatures:
• Withstandardizedtarget,settingw=0predictsaverageyi:– Highregularizationmakesuspredictclosertotheaveragevalue.
• Again,makesureyoustandardizetestdatawiththetrainingstats.• Othercommontransformationsofyi arelogarithm/exponent:
– Makessenseforgeometric/exponentialprocesses.
Regularizingthey-Intercept?• Shouldweregularizethey-intercept?
• No!Whyencourageittobeclosertozero?(Itcouldbeanywhere.)– Youshouldbeallowedtoshiftfunctionup/downglobally.
• Yes!Itmakesthesolutionuniqueanditeasiertocompute‘w’.
• Compromise:regularizebyasmalleramountthanothervariables.
(pause)
PredictingtheFuture• Inprinciple,wecanuseanyfeaturesxi thatwethinkarerelevant.• Thismakesittemptingtousetime asafeature,andpredictfuture.
https://gravityandlevity.wordpress.com/2009/04/22/the-fastest-possible-mile/
PredictingtheFuture• Inprinciple,wecanuseanyfeaturesxi thatwethinkarerelevant.• Thismakesittemptingtousetimeasafeature,andpredictfuture.
https://gravityandlevity.wordpress.com/2009/04/22/the-fastest-possible-mihttps://overthehillsports.wordpress.com/tag/hicham-el-guerrouj/le/
Predicting100mtimes400yearsinthefuture?
https://plus.maths.org/content/sites/plus.maths.org/files/articles/2011/usain/graph2.gif
Predicting100mtimes400yearsinthefuture?
https://plus.maths.org/content/sites/plus.maths.org/files/articles/2011/usain/graph2.gifhttp://www.washingtonpost.com/blogs/london-2012-olympics/wp/2012/08/08/report-usain-bolt-invited-to-tryout-for-manchester-united/
InterpolationvsExtrapolation• Interpolation istaskofpredicting“betweenthedatapoints”.
– Regressionmodelsaregoodatthisifyouhaveenoughdataandfunctioniscontinuous.• Extrapolation istaskofpredictionoutsidetherangeofthedatapoints.
– Withoutassumptions,regressionmodelscanbeembarrassingly-badatthis.
• Ifyourunthe100mregressionmodelsbackwardsintime:– Theypredictthathumansusedtobereallyreally slow!
• Ifyourunthe100mregressionmodelsforwardsintime:– Theymighteventuallypredictarbitrarily-small100mtimes.– Thelinearmodelactuallypredictsnegativetimesinthefuture.
• Thesetimetravelingracesin2060shouldbeprettyexciting!
• Somediscussionhere:– http://callingbullshit.org/case_studies/case_study_gender_gap_running.html
https://www.smbc-comics.com/comic/rise-of-the-machines
NoFreeLunch,Consistency,andtheFuture
NoFreeLunch,Consistency,andtheFuture
NoFreeLunch,Consistency,andtheFuture
Ockham’sRazorvs.NoFreeLunch• Ockham’srazor isaproblem-solvingprinciple:
– “Amongcompetinghypotheses,theonewiththefewestassumptionsshouldbeselected.”
– Suggestsweshouldselectlinearmodel.
• Fundamentaltrade-off:– Ifsametrainingerror,pickmodellesslikelytooverfit.– FormalversionofOccam’sproblem-solvingprinciple.– Alsosuggestsweshouldselectlinearmodel.
• Nofreelunchtheorem:– Thereexistspossibledatasetswhereyoushouldselectthegreenmodel.
NoFreeLunch,Consistency,andtheFuture• Wecanresolve“bluevs.green”bycollectingmoredata:
NoFreeLunch,Consistency,andtheFuture
NoFreeLunch,Consistency,andtheFuture
NoFreeLunch,Consistency,andtheFuture
NoFreeLunch,Consistency,andtheFuture
NoFreeLunch,Consistency,andtheFuture
NoFreeLunch,Consistency,andtheFuture
Discussion:ClimateModels• HasEarthwarmedupoverlast100years?(Consistencyzone)
– Dataclearlysays“yes”.
• WillEarthcontinuetowarmovernext100years?(generalizationerror)– Weshouldbemoreskepticalaboutmodelsthatpredictfutureevents.
https://en.wikipedia.org/wiki/Global_warming
Discussion:ClimateModels• Soshouldweallbecomeglobalwarmingskeptics?• Ifweaverageovermodelsthatoverfitin*independent*ways,weexpectthe
testerrortobelower,sothisgivesmoreconfidence:
– Weshouldbeskepticalofindividualmodels,butagreeingpredictionsmadebymodelswithdifferentdata/assumptionsaremorelikelybetrue.
• Allthenear-futurepredictionsagree,sotheyarelikelytobeaccurate.– Andit’sprobablyreasonabletoassumefairlycontinuouschange(nobig“jumps”).
• Varianceishigherfurtherintofuture,sopredictionsarelessreliable.– Relyingmoreonassumptionsandlessondata.
https://en.wikipedia.org/wiki/Global_warming
IndexFunds:EnsembleExtrapolationforInvesting• Wanttodoextrapolationwheninvestingmoney.
– Whatwillthisbeworthinthefuture?• Indexfundscanbeviewedasanensemblemethodforinvesting.
– Forexample,buystockintop500companiesproportionaltovalue.– Triestofollowaveragepriceincrease/decrease.
– Thissimpleinvestingstrategyoutperformsmostfundmanagers.
http://fibydesign.com/005-introduction-to-index-investing-stocks-index-funds-vtsax/
Summary• Regularization:
– Addingapenaltyonmodelcomplexity.
• L2-regularization:penaltyonL2-normofregressionweights‘w’.– Almostalwaysimprovestesterror.
• Standardizingfeatures:– Forsomemodelsitmakessensetohavefeaturesonthesamescale.
• Interpolationvs.Extrapolation:– Machinelearningwithlarge‘n’isgoodatpredicting“betweenthedata”.– Withoutassumptions,canbearbitrarilybad“awayfromthedata”.
• Nexttime:learningwithanexponentialnumberofirrelevantfeatures.
L2-Regularization• Standardregularization strategyisL2-regularization:
• EquivalenttominimizingsquarederrorbutkeepingL2-normsmall.
Regularization/ShrinkingParadox• Wethrowdartsatatarget:– Assumewedon’talwayshittheexactcenter.– Assumethedartsfollowasymmetricpatternaroundcenter.
Regularization/ShrinkingParadox• Wethrowdartsatatarget:– Assumewedon’talwayshittheexactcenter.– Assumethedartsfollowasymmetricpatternaroundcenter.
• Shrinkageofthedarts:1. Choosesomearbitrary location‘0’.2. Measuredistancesfromdartsto‘0’.
Regularization/ShrinkingParadox• Wethrowdartsatatarget:– Assumewedon’talwayshittheexactcenter.– Assumethedartsfollowasymmetricpatternaroundcenter.
• Shrinkageofthedarts:1. Choosesomearbitrary location‘0’.2. Measuredistancesfromdartsto‘0’.3. Movemissestowards‘0’,bysmall
amountproportionaltodistancefrom0.
• Ifsmallenough,dartswillbeclosertocenteronaverage.
Regularization/ShrinkingParadox• Wethrowdartsatatarget:– Assumewedon’talwayshittheexactcenter.– Assumethedartsfollowasymmetricpatternaroundcenter.
• Shrinkageofthedarts:1. Choosesomearbitrary location‘0’.2. Measuredistancesfromdartsto‘0’.3. Movemissestowards‘0’,bysmall
amountproportionaltodistancefrom0.
• Ifsmallenough,dartswillbeclosertocenteronaverage.Visualizationoftherelatedhigher-dimensionalparadoxthatthemeanofdatacomingfromaGaussianisnotthebestestimateofthemeanoftheGaussianin3-dimensionsorhigher:https://www.naftaliharris.com/blog/steinviz