furman engaged pres
TRANSCRIPT
Pitcher Cluster Analysis in Major League Baseball
Dr. John Harris, Dr. Tom Lewis, Jamey McDowell, Ian McConnell
Issue:
When predicting an individual batter’s performance against an individual pitcher’s, frequently that batter has not faced that pitcher often enough to generate a significant sample size.
For example, Freddie Freeman has gone 2 for 10 against Cole Hamels this season; is this really an accurate predictor of how well he will perform against Hamels in his next at-bat?
Goal
We wanted to increase sample size of batter-pitcher “matchups” in order to better predict future interactions between specific players
To accomplish this, we grouped pitchers with similar styles together; we did this through the use of clustering algorithms
Hypothesis
By seeing how well a batter does against a cluster of same-style pitchers, we can better predict how well he will fair against any particular pitcher contained within that cluster
Data Sources
Sean Forman, founder of Baseball-Reference.com
Baseball-Reference.com
Brooks Baseball
Sample Data Sheet:
Metrics Analyzed
List of every plate appearance of the 2014 regular season, sorted by date
Pitch type statistics by pitcher
Pitcher style
Batter hand (L/R/Switch)
Data Exclusions
Batters who only batted in second half of season
Batters who only sacrifice bunted in first half
Clustering Methods in Use
K-means: make initial guesses at k cluster centers, then adjust centers based on mean of observations in that cluster
Decision Tree Analysis: let the computer choose which pitcher characteristics most strongly affect opposing OBP (minimum cluster size 50)
Clustering Pitchers by...
Pitch independent stats (Strike percentage, GB/FB, etc.)
L-R
Batter performance against pitcher
Pitch similarity (i.e. fastballs thrown alike, etc.)
Pitch Independent Stats, K-means
K=17
Even spread on large clusters, obvious reasonings for small clusters
Pitch Independent Stats, CART
8 clusters/leaves
Decided by OBP against
Important factors:Strike percentage
Strikeout percentage
Velocity difference between top two pitches
Pitches per plate appearance
Batter performance against
8 clusters
Pitchers in same cluster if same batters perform in a similar fashion against those pitchers
Clustering Batters
When we treat batters as individual entities, they do not have enough plate appearances to make accurate predictions
We solve this by treating all left-handed batters as the same “batter”, and do the same with righties and switch hitters
Method
By a given cluster method, assign each pitcher to numbered cluster
Compile every plate appearance and total on-base for a pitcher cluster against a batter type in first half of season
Example
Left-handed batters are 200 for 860 against cluster 3
This is the predicted performance of LHBs against cluster 3 in second half of 2014
Method (cont.)
Run the same compilation on second half of season
We test the accuracy of our prediction with a minimum variance test
Minimum Variance Test
∑(x_i-p)^2=S(1-2p)+Np^2
S=# of times on base
N=Number of opportunities (PA)
p=predicted OBP v. cluster
Results
Hypothesis confirmed: every method we tested better predicted the second half of a season than career history
Clustering methods which also cluster batters were the only ones that “beat” prediction based on first half OBP of batter
Conclusions
Computer-chosen decision statistics best separate pitchers into clusters
Sample sizes are large enough to accurately predict when all batters are treated as three clusters
Application
In-game decisions
Pre-game decisions
Future WorkTest on other years
Cluster batters in ways other than handedness
Probit Modeling