data manipula+onemeyers.scripts.mit.edu/.../cs149_slides/class11.pdf · randomly split the data:...
TRANSCRIPT
![Page 1: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit](https://reader031.vdocuments.us/reader031/viewer/2022040609/5ecdd0dc95d57f20f50e6aa1/html5/thumbnails/1.jpg)
Data manipula+on
![Page 2: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit](https://reader031.vdocuments.us/reader031/viewer/2022040609/5ecdd0dc95d57f20f50e6aa1/html5/thumbnails/2.jpg)
Outline for today
Be#erknowasport:RobertoClementeReviewofmul8plelinearregressionManipula8ngdatawithdplyrWorksheet5
![Page 3: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit](https://reader031.vdocuments.us/reader031/viewer/2022040609/5ecdd0dc95d57f20f50e6aa1/html5/thumbnails/3.jpg)
Be5er know a player Roberto Clemente
![Page 4: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit](https://reader031.vdocuments.us/reader031/viewer/2022040609/5ecdd0dc95d57f20f50e6aa1/html5/thumbnails/4.jpg)
Announcements
ExtraofficehoursthisFriday.ThereisasignuplinkonMoodle,oremailme
Startthinkingaboutyourfinalprojectfortheclass.TheguidelinesforthefinalprojectareonMoodle.
AprojectproposalisdueonWednesdayMarch29th
![Page 5: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit](https://reader031.vdocuments.us/reader031/viewer/2022040609/5ecdd0dc95d57f20f50e6aa1/html5/thumbnails/5.jpg)
Review
![Page 6: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit](https://reader031.vdocuments.us/reader031/viewer/2022040609/5ecdd0dc95d57f20f50e6aa1/html5/thumbnails/6.jpg)
Regression
Regressionismethodofusingonevariabletopredictthevalueofasecondvariable.Inlinearregressionwefitalinetothedata,calledtheregressionline.
ŷ=a+b·x
![Page 7: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit](https://reader031.vdocuments.us/reader031/viewer/2022040609/5ecdd0dc95d57f20f50e6aa1/html5/thumbnails/7.jpg)
Measuring goodness of fit
Residual
Wecanmeasurehowwellthelinefitsthedatausingthemeansquarederror(MSE):
y
ŷ
Residual=Observed–Predicted=y–ŷ
LeastsquareregressionlineminimizestheMSE
![Page 8: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit](https://reader031.vdocuments.us/reader031/viewer/2022040609/5ecdd0dc95d57f20f50e6aa1/html5/thumbnails/8.jpg)
Mul+ple regression
Wehaveobservedthatsomesta8s8cscombinemul8pletypesofmeasurementsBaUngaverage:BA=[(1)·1B+(1)·2B+(1)·3B+(1)·HR]/ABSluggingpercentage:Slug=[(1)·1B+(2)·2B+(3)·3B+(4)·HR]/ABGenericlinearsta8s8c:stat=w1·BB+w2·HBP+w3·1B+w4·2B+w5·3B+w6·HR+w0
![Page 9: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit](https://reader031.vdocuments.us/reader031/viewer/2022040609/5ecdd0dc95d57f20f50e6aa1/html5/thumbnails/9.jpg)
Mul+ple regression
Mul8pleregressionmakespredic8ons(ŷ)usingmul8plevariables
Wecanfindtheop8malweights(wi’s)foracombina8onofbasicsta8s8csbyminimizingtheMSEonŷ=w1·BB+w2·HBP+w3·1B+w4·2B+w5·3B+w6·HR+w0
fit<-lm(R~BB+HBP+H+X2B+X3B+HR,data=team.baUng.162)TheRmodelfitcontainstheweightswi‘sformakingpredic8onsŷcoef(fit)
![Page 10: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit](https://reader031.vdocuments.us/reader031/viewer/2022040609/5ecdd0dc95d57f20f50e6aa1/html5/thumbnails/10.jpg)
What is the best sta+s+c we can create?
Supposewefit:fit<-lm(R~BB+H+X2B+X3B+HR+BRA+OBP+AB,data=team.baUng.162)
RMSE:
sqrt(mean(linear_model$residuals^2))23.49 HowaccuratewouldthisRMSEbeifweappliedittonewdata?
![Page 11: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit](https://reader031.vdocuments.us/reader031/viewer/2022040609/5ecdd0dc95d57f20f50e6aa1/html5/thumbnails/11.jpg)
OverfiEng
FiUngamodeltopreciselytothedataathandinsuchawaythatitdoesnotgeneralizetonewdataiscalledoverfi=ngIfweusedthesamedatatofitourmodel(findthewi’s)aswedidtoevaluatewhetheritwasagoodfitoures8mateofRMSEmightbetooop8mis8c• i.e.,oures8mateoftheRMSEmightbetoosmall
![Page 12: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit](https://reader031.vdocuments.us/reader031/viewer/2022040609/5ecdd0dc95d57f20f50e6aa1/html5/thumbnails/12.jpg)
OverfiEng
Oneshouldalwaysyoudifferentdatawhenfi=ngandevalua8ngamodel!
![Page 13: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit](https://reader031.vdocuments.us/reader031/viewer/2022040609/5ecdd0dc95d57f20f50e6aa1/html5/thumbnails/13.jpg)
Cross-valida+on
Cross-valida8onisamethodforassessingthegoodnessofamodelinawaythatcanavoidoverfiUng
Whatwedoisbuildthemodelononesetofdata,calledthetrainingset
• i.e.,findthecoefficientsononesetofdata
Thenweevaluatewhetherthemodelfitswellonasecondsetofdata,calledthetestset
• i.e.,predicttheŷ’sbasedonx’sfromanewdataset
Ifthemodelistrulygood,weshouldgetgoodpredic8onsonthetestset
• i.e.,asmallRMSEonthetestset
![Page 14: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit](https://reader031.vdocuments.us/reader031/viewer/2022040609/5ecdd0dc95d57f20f50e6aa1/html5/thumbnails/14.jpg)
Ma5hew’s BaEng Sta+s+c (MBS)
Randomlysplitthedata:• ½ifthedataisinthetrainingset• ½ofthedataisinthetestset
FitthemodelusingthetrainingdatafortheMBS:
fit<-lm(R~BB+H+X2B+…,data=training.data)
Makepredic8onsonthetestdata
predicted.yhats<-predict(fit,newdata=test.data)cross.validated.RMSE<-sqrt(mean((predicted.yhats-test.data$R)^2))
MSEforpredic8onsmadeusingthesametrainingdatax’sandy’s:23.18MSEforpredic8onsmadeonthetestdatax’sandy’s:24.63
![Page 15: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit](https://reader031.vdocuments.us/reader031/viewer/2022040609/5ecdd0dc95d57f20f50e6aa1/html5/thumbnails/15.jpg)
Manipula+ng data with dplyr
Rpackagesaddaddi8onalfunc8onstoRlibrary(‘package.name’)
dplyrisaveryusefulpackageformanipula8ngdataframeslibrary(‘dplyr’)
Thereareseveralveryusefulfunc8onsinthedplyrpackageincluding:
• filter()• select()• mutate()• group_by()• summarize()
Allthesefunc8onstakeadataframeasinputandreturnadataframeasoutput
![Page 16: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit](https://reader031.vdocuments.us/reader031/viewer/2022040609/5ecdd0dc95d57f20f50e6aa1/html5/thumbnails/16.jpg)
filter()
Thefilter()func8onreturnsasubsetofrowsinadataframeExample:
all.data<-get.Lahman.baUng.data()red.sox.data<-filter(all.data,teamID=="BOS")
Ques8on:Howcouldwegetallplayerswhohavelessthan300PA?max.300PA<-filter(all.data,PA<300)
![Page 17: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit](https://reader031.vdocuments.us/reader031/viewer/2022040609/5ecdd0dc95d57f20f50e6aa1/html5/thumbnails/17.jpg)
The pipe operator %>%
Thepipeoperator%>%allowsustochaincommandstogether>all.data<-get.Lahman.baUng.data()>red.sox.2015<-all.data%>%
filter(teamID=="BOS")%>%filter(yearID==2015)
![Page 18: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit](https://reader031.vdocuments.us/reader031/viewer/2022040609/5ecdd0dc95d57f20f50e6aa1/html5/thumbnails/18.jpg)
select()
Theselect()func8onreturnsasubsetofthevariables• i.e.,subsetofthecolumnsofadataframe
Example:all.data<-get.Lahman.baUng.data()data.hits.and.walks<-select(all.data,H,BB)
Ques8on:Howcouldweonlyhomerunsanddoubles?data.homeruns.and.doubles<-select(all.data,HR,X2B)
![Page 19: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit](https://reader031.vdocuments.us/reader031/viewer/2022040609/5ecdd0dc95d57f20f50e6aa1/html5/thumbnails/19.jpg)
mutate()
Themutate()func8onaddsnewvariablestoadataframefromvariablesthatarealreadyinthedataframe
• i.e.,createsnewcolumnsfromoldcolumns
Example:data.with.1B<-mutate(all.data,X1B=H–X2B–X3B-HR)
Ques8on:• HowcanweaddBRA(whichisOBP*SlugPct)toourdataframe?data.with.BRA<-mutate(all.data,BRA=OBP*SlugPct)
![Page 20: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit](https://reader031.vdocuments.us/reader031/viewer/2022040609/5ecdd0dc95d57f20f50e6aa1/html5/thumbnails/20.jpg)
group_by()
Thegroup_by()func8onassignscategoricalvariablestogroups
• byitselfitdoesnothing,butitisusefulinconjunc8onwiththesummarize()func8onasdescribedonthenextslide
Example:data.team.grouped<-group_by(all.data,teamID)
Ques8on:• Howcangroupdatabyyear?data.year.grouped<-group_by(all.data,yearID)
![Page 21: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit](https://reader031.vdocuments.us/reader031/viewer/2022040609/5ecdd0dc95d57f20f50e6aa1/html5/thumbnails/21.jpg)
summarize()
Thesummarize()func8onreducesthedatabasedonthegroupingassignedbythegroup_by()func8on
• i.e.,ittakesmanycasesandcreatesummarysta8s8csfromthesecasesseparatelyforeachgrouping.
Example:data.team.grouped<-all.data%>%
group_by(teamID)%>%summarize(sum(H,na.rm=TRUE))
Ques8on:Howcanwegetthetotalnumberofhitsasafunc8onoftheyear?
data.year.grouped<-all.data%>%group_by(yearID)%>%summarize(sum(H,na.rm=TRUE))
![Page 22: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit](https://reader031.vdocuments.us/reader031/viewer/2022040609/5ecdd0dc95d57f20f50e6aa1/html5/thumbnails/22.jpg)
Summary of some of what we have learned about descrip+ve sta+s+cs
Descrip8vesta8s8cs:median,mean,standarddevia8on,percen8les,fivenumbersummary,range,interquar8lerange,z-scores,correla8on
Plots:barplots,piecharts,histograms,boxplots,sca#erplotsRegression:linearregression,mul8pleregression,residuals,RMSE,overfiUng
AlotaboutbaseballandanalyzingdatainR
![Page 23: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit](https://reader031.vdocuments.us/reader031/viewer/2022040609/5ecdd0dc95d57f20f50e6aa1/html5/thumbnails/23.jpg)
Worksheet 5
>get.worksheet(5)Pleasegetstartedonthisworksheetearly,someoftheques8onsontheworksheetmightbechallenging!A{erthebreak:probabilityandinferen8alSta8s8cs