furman engaged pres

Pitcher Cluster Analysis in Major League Baseball

Dr. John Harris, Dr. Tom Lewis, Jamey McDowell, Ian McConnell

Issue:

When predicting an individual batter’s performance against an individual pitcher’s, frequently that batter has not faced that pitcher often enough to generate a significant sample size.

For example, Freddie Freeman has gone 2 for 10 against Cole Hamels this season; is this really an accurate predictor of how well he will perform against Hamels in his next at-bat?

Goal

We wanted to increase sample size of batter-pitcher “matchups” in order to better predict future interactions between specific players

To accomplish this, we grouped pitchers with similar styles together; we did this through the use of clustering algorithms

Hypothesis

By seeing how well a batter does against a cluster of same-style pitchers, we can better predict how well he will fair against any particular pitcher contained within that cluster

Data Sources

Sean Forman, founder of Baseball-Reference.com

Baseball-Reference.com

Brooks Baseball

Sample Data Sheet:

Metrics Analyzed

List of every plate appearance of the 2014 regular season, sorted by date

Pitch type statistics by pitcher

Pitcher style

Batter hand (L/R/Switch)

Data Exclusions

Batters who only batted in second half of season

Batters who only sacrifice bunted in first half

Clustering Methods in Use

K-means: make initial guesses at k cluster centers, then adjust centers based on mean of observations in that cluster

Decision Tree Analysis: let the computer choose which pitcher characteristics most strongly affect opposing OBP (minimum cluster size 50)

Clustering Pitchers by...

Pitch independent stats (Strike percentage, GB/FB, etc.)

L-R

Batter performance against pitcher

Pitch similarity (i.e. fastballs thrown alike, etc.)

Pitch Independent Stats, K-means

K=17

Even spread on large clusters, obvious reasonings for small clusters

Pitch Independent Stats, CART

8 clusters/leaves

Decided by OBP against

Important factors:Strike percentage

Strikeout percentage

Velocity difference between top two pitches

Pitches per plate appearance

Batter performance against

8 clusters

Pitchers in same cluster if same batters perform in a similar fashion against those pitchers

Clustering Batters

When we treat batters as individual entities, they do not have enough plate appearances to make accurate predictions

We solve this by treating all left-handed batters as the same “batter”, and do the same with righties and switch hitters

Method

By a given cluster method, assign each pitcher to numbered cluster

Compile every plate appearance and total on-base for a pitcher cluster against a batter type in first half of season

Example

Left-handed batters are 200 for 860 against cluster 3

This is the predicted performance of LHBs against cluster 3 in second half of 2014

Method (cont.)

Run the same compilation on second half of season

We test the accuracy of our prediction with a minimum variance test

Minimum Variance Test

∑(x_i-p)^2=S(1-2p)+Np^2

S=# of times on base

N=Number of opportunities (PA)

p=predicted OBP v. cluster

Results

Hypothesis confirmed: every method we tested better predicted the second half of a season than career history

Clustering methods which also cluster batters were the only ones that “beat” prediction based on first half OBP of batter

Conclusions

Computer-chosen decision statistics best separate pitchers into clusters

Sample sizes are large enough to accurately predict when all batters are treated as three clusters

Application

In-game decisions

Pre-game decisions

Future WorkTest on other years

Cluster batters in ways other than handedness

Probit Modeling

furman engaged pres

Documents