berg-finalprojectisaacberg.com/documents/algorithms/berg-finalproject.pdf · title: microsoft word...
TRANSCRIPT
IsaacBergSTAT139FinalProjectNovember14,2016Introduction: Ever since the NHL was founded in 1917, teams have been searching fordifferentways inwhich they can get a leg up on their competition. As constantlyevolving technology has made it easier to take more in depth statistics withingames,teamshaveturnedtoanalyststousethisdataandfinddifferentwaystogivetheirteamacompetitiveedge. Startingduringthe2015-2016seasonhundredsofstatswerekeptoneveryplayerwhoplayedatleastonegameandpostedtoanexcelfilethatisopentothepublic.Usingthisdataset,Iwilllookatsixdifferentpotentialrelationships that I think might be found within the data. These relationshipsincludepointsscoredvs.yearlysalary,pointsvs.age,goalsvs.age,assistsvs.age,goalieheightsvs.starts,andplayers’monthofbirth.DataUsed: Ipulledthisdatafromawebsitecalledhockeyabstract.com,pulledthedataIneededoutontoanotherExcelfileandthenloadedIasaCSVfileintoR.InordertoavoidoutliersofplayerswhowereonlycalledupforshortstintsIdecidedtoonlyusedatafromplayerswhohadplayedaminimumof20gamesintheseason.
Ø install.packages("gsheet")Ø >>library(gsheet)Ø Hockey<-
(gsheet2tbl("https://docs.google.com/spreadsheets/d/17h48ZPLqbV8qv2mEJDZ3uIyf9USypixQq1L1aHPPLY0/edit?usp=sharing"))
Ø Mainhockey<-(gsheet2tbl(“https://docs.google.com/spreadsheets/d/1qQoY7ofuQz8iK4h1zJtzk4zSSngrULK-nnc7-4JJv74/edit?usp=sharing))
Ø >mainhockey<-(mainhockey[which(GP>20),])Ø >fullgoalie<-
(gsheet2tbl("https://docs.google.com/spreadsheets/d/1E04cp2XPno5e4acyXR6bpMylszg_NoHpHrO7ggTEFYo/edit?usp=sharing"))
Ø >goalie<-(fullgoalie[which(fullgoalie$GS>5),])
Data/Results:FigureA:SalaryVs.Points
Ø plot(mainhockey$Salary,mainhockey$PTS,main="SalaryVs.Points")
>cor.test(mainhockey$Salary,mainhockey$PTS,main=”Salaryvs.Points”) Pearson'sproduct-momentcorrelationdata:mainhockey$Salaryandmainhockey$PTSt=17.172,df=663,p-value<2.2e-16alternativehypothesis:truecorrelationisnotequalto095percentconfidenceinterval:0.49989360.6053295sampleestimates:cor0.5548354This plot looks to have a slight positive correlation and the cor.test() does provethis,althoughit isonlyaslightpositivecorrelationat .55.Inanattempttoshowabetter positive correlation, I decided to get rid of all defensemen as they aregenerallypaidtopreventgoalsratherthanscorethem.Aftertakingthesevaluesout,Ire-didboththescatterplotandcorrelationtest.
Ø >salvpoints<-(mainhockey[which(Pos!="D"),])Ø >plot(salvpoints$Salary,salvpoints$PTS,main=”Salaryvs.Pointsfor
Forwards”)Ø >cor.test(salvpoints$Salary,salvpoints$PTS)Ø Ø Pearson'sproduct-momentcorrelationØ Ø data:salvpoints$Salaryandsalvpoints$PTSØ t=21.18,df=590,p-value<2.2e-16Ø alternativehypothesis:truecorrelationisnotequalto0Ø 95percentconfidenceinterval:Ø 0.60887690.7006888Ø sampleestimates:Ø corØ 0.6572141
FigureB:AgeVs.Points>plot(mainhockey$Age,mainhockey$PTS,main=”Agevs.Points”)>cor.test(mainhockey$Age,mainhockey$PTS) Pearson'sproduct-momentcorrelationdata:mainhockey$Ageandmainhockey$PTSt=0.50098,df=664,p-value=0.6166alternativehypothesis:truecorrelationisnotequalto095percentconfidenceinterval:
-0.056617640.09526932sampleestimates:cor0.01943799
Thisdatahasaveryslightpositivecorrelationat.019,butthisfailstorejetthenull.Itwouldmakesensethatthisdatawouldbenormallydistributedasyoungerplayerswouldscorelesspointsastheyareintroducedtothegame,scoretheirmostpointsastheirexperienceandathleticismpeak,andthendropinpointsastheirbodiesbegintoage.>plot(model)>hockey2<-hockey[,c("G","Age","Month","GP","Salary")]Error:object'hockey'notfound>hockey2<-mainhockey[,c("G","Age","Month","GP","Salary")]>pairs(hockey2)>model1<-lm(PTS~Age+Month+GP+Pos+Salary,data=mainhockey)>summary(model1)Call:lm(formula=PTS~Age+Month+GP+Pos+Salary,data=mainhockey)Residuals:Min1QMedian3QMax-27.786-8.132-0.6116.52942.707Coefficients:EstimateStd.ErrortvaluePr(>|t|)
(Intercept)15.335733.409214.4988.12e-06***Age-1.167400.11368-10.269<2e-16***Month0.020760.134760.1540.878GP0.532710.0265120.098<2e-16***PosC/LW-2.562432.13323-1.2010.230PosC/LW/RW-3.456315.80289-0.5960.552PosC/N5.0841311.507690.4420.659PosC/RW0.592912.554500.2320.817PosC/RW/LW-1.965874.78371-0.4110.681PosD-9.615611.28959-7.4562.88e-13***PosD/RW6.071838.157490.7440.457PosLW1.986151.919291.0350.301PosLW/C-1.990462.33584-0.8520.394PosLW/C/RW-10.1413511.47687-0.8840.377PosLW/RW-3.185402.03791-1.5630.119PosLW/RW/C3.730208.138040.4580.647PosRW1.188131.904980.6240.533PosRW/C-2.307663.34250-0.6900.490PosRW/LW-1.677971.94115-0.8640.388Salary4.240580.2285818.551<2e-16***---Signif.codes:0‘***’0.001‘**’0.01‘*’0.05‘.’0.1‘’1Residualstandarderror:11.41on645degreesoffreedom(1observationdeletedduetomissingness)MultipleR-squared:0.6629, AdjustedR-squared:0.653F-statistic:66.77on19and645DF,p-value:<2.2e-16FigureC:AgeVs.Goals>plot(mainhockey$Age,mainhockey$G,main=”Agevs.Goals”)>cor.test(mainhockey$Age,mainhockey$G) Pearson'sproduct-momentcorrelationdata:mainhockey$Ageandmainhockey$Gt=-0.29738,df=664,p-value=0.7663alternativehypothesis:truecorrelationisnotequalto095percentconfidenceinterval:-0.087435000.06448892sampleestimates:cor-0.01153963
MuchlikeFigureBthereisonlyaslightnegativecorrelationbetweenageandgoalsscoredbutitissosmallthatthereshouldbenocorrelationconsidered.DosamethingasfigureBhere>model<-lm(G~Age+Month+GP+Pos+Salary,data=mainhockey)>summary(model)Call:lm(formula=G~Age+Month+GP+Pos+Salary,data=mainhockey)Residuals:Min1QMedian3QMax-12.426-3.808-0.6502.85424.306Coefficients:EstimateStd.ErrortvaluePr(>|t|)(Intercept)7.8047361.6649884.6883.38e-06***Age-0.5041300.055521-9.080<2e-16***Month0.0052920.0658160.0800.93594GP0.2017540.01294515.586<2e-16***PosC/LW-0.1911301.041825-0.1830.85450PosC/LW/RW-1.0316052.834011-0.3640.71597PosC/N1.6991495.6201130.3020.76250PosC/RW2.2930831.2475641.8380.06651.
PosC/RW/LW0.5052002.3362640.2160.82887PosD-6.7522470.629807-10.721<2e-16***PosD/RW0.8387733.9839460.2110.83331PosLW2.5423460.9373422.7120.00686**PosLW/C-0.7379911.140776-0.6470.51791PosLW/C/RW-5.1279385.605064-0.9150.36060PosLW/RW1.0775990.9952711.0830.27934PosLW/RW/C1.7902143.9744470.4500.65255PosRW1.8753120.9303532.0160.04425*PosRW/C-0.5307681.632406-0.3250.74518PosRW/LW1.1898150.9480151.2550.20991Salary1.5948990.11163614.287<2e-16***---Signif.codes:0‘***’0.001‘**’0.01‘*’0.05‘.’0.1‘’1Residualstandarderror:5.57on645degreesoffreedom(1observationdeletedduetomissingness)MultipleR-squared:0.6087, AdjustedR-squared:0.5971F-statistic:52.8on19and645DF,p-value:<2.2e-16
FigureD:AgeVs.Assists>plot(mainhockey$Age,mainhockey$A,main=”Agevs.Assists)>cor.test(mainhockey$Age,mainhockey$A) Pearson'sproduct-momentcorrelationdata:mainhockey$Ageandmainhockey$A
t=1.0066,df=664,p-value=0.3145alternativehypothesis:truecorrelationisnotequalto095percentconfidenceinterval:-0.037046940.11466691sampleestimates:cor0.03903494
MuchlikeFigureBthereisonlyaslightpositivecorrelationbetweenageandassistsbutitissosmallthatthereshouldbenocorrelationconsidered.>model2<-lm(A~Age+Month+GP+Pos+Salary,data=mainhockey)>summary(model2)Call:lm(formula=A~Age+Month+GP+Pos+Salary,data=mainhockey)Residuals:Min1QMedian3QMax-22.671-5.036-0.8643.91234.242Coefficients:EstimateStd.ErrortvaluePr(>|t|)(Intercept)7.531002.356613.1960.00146**
Age-0.663270.07858-8.440<2e-16***Month0.015470.093150.1660.86815GP0.330950.0183218.063<2e-16***PosC/LW-2.371301.47459-1.6080.10830PosC/LW/RW-2.424714.01123-0.6040.54574PosC/N3.384987.954650.4260.67059PosC/RW-1.700181.76579-0.9630.33599PosC/RW/LW-2.471073.30673-0.7470.45516PosD-2.863360.89142-3.2120.00138**PosD/RW5.233065.638840.9280.35373PosLW-0.556201.32670-0.4190.67519PosLW/C-1.252471.61464-0.7760.43821PosLW/C/RW-5.013417.93335-0.6320.52765PosLW/RW-4.263001.40870-3.0260.00258**PosLW/RW/C1.939995.625390.3450.73031PosRW-0.687181.31681-0.5220.60195PosRW/C-1.776892.31049-0.7690.44214PosRW/LW-2.867791.34181-2.1370.03295*Salary2.645680.1580116.744<2e-16***---Signif.codes:0‘***’0.001‘**’0.01‘*’0.05‘.’0.1‘’1Residualstandarderror:7.884on645degreesoffreedom(1observationdeletedduetomissingness)MultipleR-squared:0.5963, AdjustedR-squared:0.5844F-statistic:50.15on19and645DF,p-value:<2.2e-16ofmonths
Ø hist(mainhockey$Month,main="DistributionofMonths")
FigureF:Goalieheightsvs.Starts:AtrendIhavenoticedoverthelastfewyearsistheincreasingamountofemphasisplacedonhavingtallergoaliesonNHLteamsastheycantakeupmorespaceinthenetwithoutsacrificingquickness,aseveryoneintheNHLhasbecomesoquickthatthishasalmostbecomenegligible.Idecidedtotestwhethertherewasanycorrelationbetweentheheightofagoalieandthenumberofgamestheystartinaseason.
Ø plot(goalie$HT,goalie$GS,main="Heightvs.GamesStarted")Ø >cor.test(goalie$HT,goalie$GS)Ø Ø Pearson'sproduct-momentcorrelationØ Ø data:goalie$HTandgoalie$GSØ t=1.2094,df=69,p-value=0.2306Ø alternativehypothesis:truecorrelationisnotequalto0Ø 95percentconfidenceinterval:Ø -0.092331230.36510728Ø sampleestimates:Ø corØ 0.1440761
FigureG:PairsDataPlot:>hockey2<-mainhockey[,c("G","Age","Month","GP","Salary")]>pairs(hockey2)
Conclusion: ForFigureA,thereisapositivecorrelationfoundbetweensalaryandpointsscoredwhichmeansthatthehigherthesalarythatispaidgenerallyequatestomorepoints scoredby thatplayer.After takingoutplayerswhoareonlycategorizedasdefensemenandrunningthetestsagainthecorrelationdidinfactjumpfrom.55tojustover.65.ThisshowshowteamsintheNHLplacesuchahighvalueonpointsfortheirforwards,asitistheeasiestwaytogeneralizesuccessontheice. For Figure B, there is a very slight positive correlation between age andpoints scored but it is close enough to zero that we fail to reject the null of nocorrelation. I then created a linear model to test points against all of the othervariables.Basedoffthismodelitcanbeassumedthattheaveragenumberofpointsyouwouldexpectanyplayertoscoreis15.Eachyearoldertheplayer is fromtheminimumageof19youcangenerallytake1pointoffoftheirtotal,eachextragameplayedovertheminimum20gamesplayedwillgenerallyaddabout .5points,andeachextramilliondollarsaddedontotheplayerssalarywillgenerallyincreasetheirpointsby4. ForFigureC, therewasavery slightnegative correlationbetweenageandgoalsscoredbutitwassoclosetozerothatitfailedtorejectthenullhypothesisofnocorrelation.NextIcreatedalinearmodelofthisdataandfoundthat;theaverageassumed number of foals scored was about 8 goals, for each year added to theminimum age of 19 you can generally take off .5 of a goal, for each game playedaddedfromtheminimumof20addsabout.2ofagoal,andforeachmillionaddedtoaplayerssalarygenerallyaddson1.6goals.Ialsoincludedtheqqplotofthislinearmodeltestedagainsttheactualdata,whichwasaboutthesameforFiguresB-D,andconcludedthatthedatawasnormallydistributedexcept inthehighestandlowestsections. For FigureD, therewas a very slight positive correlationbetween age andassists earned but it was close enough to zero that it failed to reject the nullhypothesis of no correlation.Next I created a linearmodel of thisdata and foundthattheaveragenumberofassistswas7.5,foreachyearaddedontotheminimumof 19 you can take away .66of an assist, for each gameplayedover the20 gameminimum you can add on .33, and for each extramillion earned you can add 2.6assists. ForFigureE, Idecided to lookat thisdatabecauseof somethingcalled the“baseball effect”. This is the idea that players who are born earlier in the year(January, February, March) are usually a year older than other people their agegroup foryouthhockeybasedoffof the cutoff and thusgenerallyperformbetterand receivemore attention from pro scouts when the time comes. Although thisdata did not show anything incredibly significant, the highest concentration ofplayerswere born in January,whichwould prove this point to some extent. Thisphenomena has already been looked into by many other people, for example
(http://fans.canadiens.nhl.com/community/topic/21826-study-correlation-of-birth-month-and-of-canadians-in-the-nhl/). For Figure F, There is a very slight positive correlation between goalieheightsandgamesstarted,butat.144itisnotstrongenoughtorejectthenull.Thisdoesnotshowanysignificant informationthat tallergoaltendersgetstartedmoreoftenthanshorterones,whichwouldmakesenseastherearemanymorefactorsinplaywhendecidingagoalieforeachNHLteamthanjustheight. For Figure G, I used the pairs() function to explore different relationshipswithinthedata.ThetwomainthingsthatInoticedwerethattherewasapositiveexponential relationship between games played and goals and that there was alinearrelationshipbetweensalaryandgamesplayed.