A Statistical Determination of the Characteristics of Playoff Teams in
The National Hockey League
Dan Foehrenbach & Chris [email protected] & [email protected]
UP-STAT 2013, April 6
Outline
Predicting which teams make the playoffs is a difficult task, yet can be achieved using multivariate techniques
Two-dimensional plot, playoff teams and non-playoff teams
Unsupervised learning and supervised learning Cluster analysis Four different unique clusters Different modeling techniques to predict whether or
not a given team will make the playoffs Linear discriminant analysis Current season predictions
Statement of the Problem
The main purpose of this report is the analysis of data, its application to uncovering what makes a team successful
Discovery of common trends between good and bad teams
Data was collected over the past 11 years The variables chosen ensure every aspect of hockey
is used within the analysis Response variable and dummy variable (Playoffs, SC)
Variables
GG Goals Score Per Game
GAG Goals Against Per Game
GFAR Goals For/Goals Against Ratio
PP Power Play Percentage
PK Penalty Kill Percentage
S.G Shots Per Game
SA.G Shots Against Per Game
Sc1 Win Percentage When Scoring First
Tr1st Win Percentage When Trailing After 1st Period
FO Faceoff Win Percentage
SV Save Percentage
Statistical Analysis
Exploratory Data Analysis
Gain an overall sense of the dataset, how playoff teams and Stanley Cup winners are dissimilar, or similar, to the rest of the teams
Multidimensional scaling (11D broken down into a 2D plot of data)
Modeling Stanley Cup winners would be too difficult with only 10 data points
Also, gain a sense of which variables distinguished the best between playoff teams and non-playoff teams in a univariate sense
2D Plot of Data After Multi-Dimensional Scaling
Boxplots of Explanatory Variables
Unsupervised Learning
Clustering Analysis
Determine what similarities, or dissimilarities, exist among teams without any prior knowledge of group membership
Clear statistical distinctions that can separate teams into 2 or more groupings – playoff and non-playoff
Hierarchical clustering using Ward’s method (to determine the cluster means)
Refinement of Ward clusters using kMeans clustering Four clusters yielded the most useful interpretations
Number of Clusters
Cluster Interpretations and Means
Cluster 1: Teams that have a low win percentage and are offensive oriented
Cluster 2: Teams that have a high win percentage and are offensive oriented
Cluster 3: Teams that have a low win percentage and are defensive oriented
Cluster4: Teams that have a high win percentage and are defensive oriented
Cluster Means GG GAG GFAR PP PK S.G SA.G Sc1 Cluster 1 2.71 3.07 0.92 16.96 79.50 29.13 31.43 0.61 Cluster 2 3.02 2.70 1.09 20.29 82.52 30.77 29.03 0.70 Cluster 3 2.49 2.75 0.93 14.82 83.58 27.42 29.36 0.59 Cluster 4 2.73 2.41 1.12 17.01 85.25 29.37 26.52 0.69 Tr1st FO SV Cluster 1 0.27 49.22 0.90 Cluster 2 0.36 50.46 0.90 Cluster 3 0.23 49.56 0.90 Cluster 4 0.34 50.81 0.90
kMeans Clusters Using Centroids from Ward Clusters
There is a clear distinction between the 4 clusters in only 2 dimensions
The majority of the playoff teams are coming from the clusters with higher win percentage
The playoff teams from clusters 2 and 3 tend to be borderline with cluster 1 or 4. This implies that these teams are likely “bubble” teams – those teams that are ranked 7th/8th in their respective conferences
kMeans Clusters Using Centroids from Ward Clusters
Stanley Cup Winners
The Stanley Cup champions over the past 11 years and the following subgroups the team belonged to
Stanley Cup Winners Over the Past 11 Years and Cluster Membership
Offensive Oriented Defensive Oriented
Boston Bruins 2010 Pittsburgh Penguins 2008
Anaheim Ducks 2006 Chicago Blackhawks 2009
Detroit Red Wings 2007 Tampa Bay Lightning 2003
Carolina Hurricanes 2005
New Jersey Devils 2002
Colorado Avalanche 2000
Detroit Red Wings 2001
Los Angeles Kings 2011
Cluster Summary
There is a clear statistical distinction between teams with high and low win percentages as well as teams that are more offensive and defensive oriented
Cluster membership also distinguishes very well between playoff teams and non playoff teams
The largest discriminating criterion between clusters seems to be win percentage metrics, scoring/goals allowed metrics and offensive/defensive metrics
Supervised Learning
Model Comparisons
Seek to find a model that best predicts whether or not a given team will make the playoffs
Several different model types were considered:◦ Linear Discriminant Analysis
◦ Logistic Analysis
◦ PCA Logistic Analysis
◦ Kernel Support Vector Machine
◦ Recursive Partitioning
◦ Random Forests
Training and Test Sets
The data were split into training and test sets – each model was built on the training set and predictions were then made on the test set
The accuracy of each model was calculated as the number of correct predictions made
Random data points were selected to maintain independence of observations
100 iterations of the training/test splits
Accuracy Comparison Between Different Models
Mean Accuracy
There are similar mean accuracy for certain models over the 9 iterations of training and test
Given that model and variable interpretation is difficult with kSVM, LDA was chosen as the most useful model going forward
Mean Accuracy
LDA 0.89
Log 0.88
Log-PCA 0.89
kSVM 0.89
R.Part 0.80
RF 0.88
Linear Discriminant Analysis
The final LDA model was built on the entire data set and the priors were specified as Pr[0] = 14/30 and Pr[1] = 16/30 (since we know the exact number of teams that will make the playoffs each year)
Group Means for LDA Model GG GAG GFAR PP PK S.G SA.G Sc1 Tr1st FO SV 0 2.57 2.97 0.89 16.15 81.62 28.38 30.29 0.58 0.24 49.46 0.90 1 2.90 2.55 1.13 18.36 83.55 29.88 28.21 0.71 0.36 50.46 0.91
Linear Discriminant Coefficients
The results agree with the univariate exploratory analysis – GFAR, GAG, Sc1, and Tr1st show some of the largest coefficients and discriminating power between two groups
LD1 GG 1.67 GAG -2.60 GFAR 0.33 PP 0.03 PK 0.03 S.G -0.02 SA.G 0.00 Sc1 3.68 Tr1st 3.03 FO 0.02 SV -16.73
Linear Discriminant Analysis in 2 Dimensions
The result was reduced to 2 dimensions to get visual aid
Discriminating line was added
Linear Discriminant Analysis in 2 Dimensions
Difficulty of Predicting Stanley Cup
To predict the Stanley Cup winner an LDA model was built on the entire data set with the priors specified as Pr[0] = 29/30 and Pr[1] = 1/30 (again, we know that there is one going to be 1 Stanley Cup Winner per year)
Without assessing via training and test, the difficulty of the task by making predictions on only 1 year of data (2010) shows the following:[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Levels: 0 1 PITTSBURGH 2010 PHILADELPHIA 2010 WASHINGTON 2010 VANCOUVER 2010 0.1307364781 0.1263354558 0.0849897554 0.0544844286 SAN JOSE 2010 DETROIT 2010 BOSTON 2010 NY RANGERS 2010 0.0486076493 0.0401645795 0.0358007681 0.0274712183 CHICAGO 2010 TAMPA BAY 2010 LOS ANGELES 2010 FLORIDA 2010 0.0230399705 0.0207186326 0.0206135593 0.0166602529 COLUMBUS 2010 ST LOUIS 2010 BUFFALO 2010 NASHVILLE 2010 0.0144914773 0.0135499447 0.0101636053 0.0098853267 CALGARY 2010 MONTREAL 2010 NY ISLANDERS 2010 NEW JERSEY 2010 0.0090187380 0.0071093793 0.0066572890 0.0064761267 PHOENIX 2010 WINNIPEG 2010 CAROLINA 2010 TORONTO 2010 0.0050708140 0.0043784593 0.0038735653 0.0029203444 DALLAS 2010 OTTAWA 2010 COLORADO 2010 ANAHEIM 2010 0.0026230173 0.0012930889 0.0007440940 0.0005716120 MINNESOTA 2010 EDMONTON 2010 0.0004059411 0.0003390065
Predictions for This Years Data
The current season was shortened in length due to a lockout in the beginning of the year (48 games)
Data was collected as of April 2, 2013 and the model was used to predict which teams will make the playoffs this year
The following teams were predicted to make the playoffs:
Anaheim Ducks
Boston Bruins
Chicago Blackhawks
Detroit Red Wings
Los Angeles Kings
Minnesota Wild
Montreal Canadians
New York Rangers
Ottawa Senators
Pittsburgh Penguins
San Jose Sharks
St. Louis Blues
Tampa Bay Lightning
Toronto Maple Leafs
Vancouver Canucks
Predictions for This Years Data
Only 15 teams have been predicted, due to the above results of around 8% misclassification (1 team is being wrongly classified as not making the playoffs)
The teams with the highest probabilities for playoff entry are the following:
CHICAGO 2012 PITTSBURGH 2012 BOSTON 2012 MONTREAL 2012 ANAHEIM 2012 0.9999884 0.9999739 0.9993562 0.9985149 0.9981987 LOS ANGELES 2012 OTTAWA 2012 MINNESOTA 2012 TORONTO 2012 ST LOUIS 2012 0.9948172 0.9844236 0.9721751 0.9680388 0.9318136 SAN JOSE 2012 TAMPA BAY 2012 NY RANGERS 2012 VANCOUVER 2012 DETROIT 2012 0.8345595 0.7729422 0.7627002 0.6917667 0.6249559 PHOENIX 2012 0.4242253
Critical Evaluation and Future Plans
There are indications that there are techniques and strategies that help a hockey team achieve success
With the data that was collected, predicting who won the Stanley Cup was not achieved (additional data could have achieved something remarkable)
Future plans would be to collect more data, as well as keep watch on this season and compare the results to the predicted results
If all goes well and 89% of the teams predicted actually make the playoffs (especially in a shorter season), this model can be used in a way that can end up being lucrative in future endeavors
QUESTIONS?????