Download - A Statistical Determination of the Characteristics of Playoff Teams in The National Hockey League Dan Foehrenbach & Chris Claeys [email protected]@gmail.com

A Statistical Determination of the Characteristics of Playoff Teams in

The National Hockey League

Dan Foehrenbach & Chris [email protected] & [email protected]

UP-STAT 2013, April 6

mailto:[email protected]

mailto:[email protected]

Outline

Predicting which teams make the playoffs is a difficult task, yet can be achieved using multivariate techniques

Two-dimensional plot, playoff teams and non-playoff teams

Unsupervised learning and supervised learning Cluster analysis Four different unique clusters Different modeling techniques to predict whether or

not a given team will make the playoffs Linear discriminant analysis Current season predictions

Statement of the Problem

The main purpose of this report is the analysis of data, its application to uncovering what makes a team successful

Discovery of common trends between good and bad teams

Data was collected over the past 11 years The variables chosen ensure every aspect of hockey

is used within the analysis Response variable and dummy variable (Playoffs, SC)

Variables

GG Goals Score Per Game

GAG Goals Against Per Game

GFAR Goals For/Goals Against Ratio

PP Power Play Percentage

PK Penalty Kill Percentage

S.G Shots Per Game

SA.G Shots Against Per Game

Sc1 Win Percentage When Scoring First

Tr1st Win Percentage When Trailing After 1st Period

FO Faceoff Win Percentage

SV Save Percentage

Statistical Analysis

Exploratory Data Analysis

Gain an overall sense of the dataset, how playoff teams and Stanley Cup winners are dissimilar, or similar, to the rest of the teams

Multidimensional scaling (11D broken down into a 2D plot of data)

Modeling Stanley Cup winners would be too difficult with only 10 data points

Also, gain a sense of which variables distinguished the best between playoff teams and non-playoff teams in a univariate sense

2D Plot of Data After Multi-Dimensional Scaling

Boxplots of Explanatory Variables

Unsupervised Learning

Clustering Analysis

Determine what similarities, or dissimilarities, exist among teams without any prior knowledge of group membership

Clear statistical distinctions that can separate teams into 2 or more groupings – playoff and non-playoff

Hierarchical clustering using Ward’s method (to determine the cluster means)

Refinement of Ward clusters using kMeans clustering Four clusters yielded the most useful interpretations

Number of Clusters

Cluster Interpretations and Means

Cluster 1: Teams that have a low win percentage and are offensive oriented

Cluster 2: Teams that have a high win percentage and are offensive oriented

Cluster 3: Teams that have a low win percentage and are defensive oriented

Cluster4: Teams that have a high win percentage and are defensive oriented

Cluster Means GG GAG GFAR PP PK S.G SA.G Sc1 Cluster 1 2.71 3.07 0.92 16.96 79.50 29.13 31.43 0.61 Cluster 2 3.02 2.70 1.09 20.29 82.52 30.77 29.03 0.70 Cluster 3 2.49 2.75 0.93 14.82 83.58 27.42 29.36 0.59 Cluster 4 2.73 2.41 1.12 17.01 85.25 29.37 26.52 0.69 Tr1st FO SV Cluster 1 0.27 49.22 0.90 Cluster 2 0.36 50.46 0.90 Cluster 3 0.23 49.56 0.90 Cluster 4 0.34 50.81 0.90

kMeans Clusters Using Centroids from Ward Clusters

There is a clear distinction between the 4 clusters in only 2 dimensions

The majority of the playoff teams are coming from the clusters with higher win percentage

The playoff teams from clusters 2 and 3 tend to be borderline with cluster 1 or 4. This implies that these teams are likely “bubble” teams – those teams that are ranked 7th/8th in their respective conferences

kMeans Clusters Using Centroids from Ward Clusters

Stanley Cup Winners

The Stanley Cup champions over the past 11 years and the following subgroups the team belonged to

Stanley Cup Winners Over the Past 11 Years and Cluster Membership

Offensive Oriented Defensive Oriented

Boston Bruins 2010 Pittsburgh Penguins 2008

Anaheim Ducks 2006 Chicago Blackhawks 2009

Detroit Red Wings 2007 Tampa Bay Lightning 2003

Carolina Hurricanes 2005

New Jersey Devils 2002

Colorado Avalanche 2000

Detroit Red Wings 2001

Los Angeles Kings 2011

Cluster Summary

There is a clear statistical distinction between teams with high and low win percentages as well as teams that are more offensive and defensive oriented

Cluster membership also distinguishes very well between playoff teams and non playoff teams

The largest discriminating criterion between clusters seems to be win percentage metrics, scoring/goals allowed metrics and offensive/defensive metrics

Supervised Learning

Model Comparisons

Seek to find a model that best predicts whether or not a given team will make the playoffs

Several different model types were considered:◦ Linear Discriminant Analysis

◦ Logistic Analysis

◦ PCA Logistic Analysis

◦ Kernel Support Vector Machine

◦ Recursive Partitioning

◦ Random Forests

Training and Test Sets

The data were split into training and test sets – each model was built on the training set and predictions were then made on the test set

The accuracy of each model was calculated as the number of correct predictions made

Random data points were selected to maintain independence of observations

100 iterations of the training/test splits

Accuracy Comparison Between Different Models

Mean Accuracy

There are similar mean accuracy for certain models over the 9 iterations of training and test

Given that model and variable interpretation is difficult with kSVM, LDA was chosen as the most useful model going forward

Mean Accuracy

LDA 0.89

Log 0.88

Log-PCA 0.89

kSVM 0.89

R.Part 0.80

RF 0.88

Linear Discriminant Analysis

The final LDA model was built on the entire data set and the priors were specified as Pr[0] = 14/30 and Pr[1] = 16/30 (since we know the exact number of teams that will make the playoffs each year)

Group Means for LDA Model GG GAG GFAR PP PK S.G SA.G Sc1 Tr1st FO SV 0 2.57 2.97 0.89 16.15 81.62 28.38 30.29 0.58 0.24 49.46 0.90 1 2.90 2.55 1.13 18.36 83.55 29.88 28.21 0.71 0.36 50.46 0.91

Linear Discriminant Coefficients

The results agree with the univariate exploratory analysis – GFAR, GAG, Sc1, and Tr1st show some of the largest coefficients and discriminating power between two groups

LD1 GG 1.67 GAG -2.60 GFAR 0.33 PP 0.03 PK 0.03 S.G -0.02 SA.G 0.00 Sc1 3.68 Tr1st 3.03 FO 0.02 SV -16.73

Linear Discriminant Analysis in 2 Dimensions

The result was reduced to 2 dimensions to get visual aid

Discriminating line was added

Linear Discriminant Analysis in 2 Dimensions

Difficulty of Predicting Stanley Cup

To predict the Stanley Cup winner an LDA model was built on the entire data set with the priors specified as Pr[0] = 29/30 and Pr[1] = 1/30 (again, we know that there is one going to be 1 Stanley Cup Winner per year)

Without assessing via training and test, the difficulty of the task by making predictions on only 1 year of data (2010) shows the following:[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Levels: 0 1 PITTSBURGH 2010 PHILADELPHIA 2010 WASHINGTON 2010 VANCOUVER 2010 0.1307364781 0.1263354558 0.0849897554 0.0544844286 SAN JOSE 2010 DETROIT 2010 BOSTON 2010 NY RANGERS 2010 0.0486076493 0.0401645795 0.0358007681 0.0274712183 CHICAGO 2010 TAMPA BAY 2010 LOS ANGELES 2010 FLORIDA 2010 0.0230399705 0.0207186326 0.0206135593 0.0166602529 COLUMBUS 2010 ST LOUIS 2010 BUFFALO 2010 NASHVILLE 2010 0.0144914773 0.0135499447 0.0101636053 0.0098853267 CALGARY 2010 MONTREAL 2010 NY ISLANDERS 2010 NEW JERSEY 2010 0.0090187380 0.0071093793 0.0066572890 0.0064761267 PHOENIX 2010 WINNIPEG 2010 CAROLINA 2010 TORONTO 2010 0.0050708140 0.0043784593 0.0038735653 0.0029203444 DALLAS 2010 OTTAWA 2010 COLORADO 2010 ANAHEIM 2010 0.0026230173 0.0012930889 0.0007440940 0.0005716120 MINNESOTA 2010 EDMONTON 2010 0.0004059411 0.0003390065

Predictions for This Years Data

The current season was shortened in length due to a lockout in the beginning of the year (48 games)

Data was collected as of April 2, 2013 and the model was used to predict which teams will make the playoffs this year

The following teams were predicted to make the playoffs:

Anaheim Ducks

Boston Bruins

Chicago Blackhawks

Detroit Red Wings

Los Angeles Kings

Minnesota Wild

Montreal Canadians

New York Rangers

Ottawa Senators

Pittsburgh Penguins

San Jose Sharks

St. Louis Blues

Tampa Bay Lightning

Toronto Maple Leafs

Vancouver Canucks

Predictions for This Years Data

Only 15 teams have been predicted, due to the above results of around 8% misclassification (1 team is being wrongly classified as not making the playoffs)

The teams with the highest probabilities for playoff entry are the following:

CHICAGO 2012 PITTSBURGH 2012 BOSTON 2012 MONTREAL 2012 ANAHEIM 2012 0.9999884 0.9999739 0.9993562 0.9985149 0.9981987 LOS ANGELES 2012 OTTAWA 2012 MINNESOTA 2012 TORONTO 2012 ST LOUIS 2012 0.9948172 0.9844236 0.9721751 0.9680388 0.9318136 SAN JOSE 2012 TAMPA BAY 2012 NY RANGERS 2012 VANCOUVER 2012 DETROIT 2012 0.8345595 0.7729422 0.7627002 0.6917667 0.6249559 PHOENIX 2012 0.4242253

Critical Evaluation and Future Plans

There are indications that there are techniques and strategies that help a hockey team achieve success

With the data that was collected, predicting who won the Stanley Cup was not achieved (additional data could have achieved something remarkable)

Future plans would be to collect more data, as well as keep watch on this season and compare the results to the predicted results

If all goes well and 89% of the teams predicted actually make the playoffs (especially in a shorter season), this model can be used in a way that can end up being lucrative in future endeavors

QUESTIONS?????

[email protected]

Download - A Statistical Determination of the Characteristics of Playoff Teams in The National Hockey League Dan Foehrenbach & Chris Claeys [email protected]@gmail.com

Top Related