introduction - ucfilespace tools - homehomepages.uc.edu/~barkerba/dm/baseball.docx · web...
TRANSCRIPT
Analysis of Baseball Pitchers Salaries
DATA MINING (901)Project-Report Group 7
4 June 2012
Brad BarkerBrenda Courtad
Ryan Goble
Analysis of Baseball Pitchers SalariesAuthors: Brad Barker, Brenda Courtad and Ryan Goble
Abstract: The purpose of this paper is to determine what factors affect the salary of a pitcher in
major league baseball. There are many variables that can affect a player’s salary, including
which team they play for, the player’s age, the length of their contract, etc. We used the
baseball database from Sean Lahman(Lahman, 2012) to model pitchers salary. In order to
remove variation due to the fact that the dataset spans from 1985 through 2010, all salaries
were adjusted for inflation, with 2010 being the base year. Instead of using one specific type of
model, we ran a variety of models to compare which is optimal for predicting a pitchers salary.
In order to find the best model, we created models using GLM, GAM, and CART methods. We
used standard modeling techniques to find the optimal model of each type and then compared
their MSE values to find the best type of model for estimating salary. The MSE for the test data
using each model is shown below and it can be seen that GAM created the best model for the
test data.
Model MSE (Test Set)
Generalized Additive Model 0.3658Linear Model (Categorical Variables) 0.4753Regression Tree 0.5662Generalized Linear Model 0.6344
Introduction
With the commercial success of Money Ball, we became interested in the business
analytics processes that were used to determine the best valued team. Like in the movie,
players who were not paid large salaries could have attributes that contributed to higher
probability of team success meaning that the best value was not the highest paid player. We
are interested in finding similar nuances and America’s pastime data is readily available online.
The questions of interest are: What are the predictors of salary? Which model technique best
predicts salary? To begin our investigation we built logical relations between multiple tables of
baseball player performance data and their salaries. Given the significant differences in
predictors for players by position; we’ve chosen to focus our evaluation on one position, the
pitcher.
The data
The data for this project comes from the seanlahman.com baseball database (Lahman,
2012). We went through the available data and created a hypothesis of which variables will
have a significant effect on salary. Those variables that were included in the final data set to be
used in R for analysis and results were reasonable with many models getting above .5 R squared
values. Instead of doubling our efforts in the model build effort, we chose to invest in better
cleaning and transformations of our data.
First, we noticed that many of the pitcher statistics were heavily correlated with the
number of pitches thrown. The most obvious and successful method for us to mitigate that
correlation was to make all of those counts (Home Runs, Runs, Walks, etc.) a percent of total
pitches thrown. Interestingly, this caused some expected errors when fitting strait linear models
because it created a singular matrix when we included all variables which accounted for total
pitches. Easily, R identifies this scenario and removes the single variable which contained the
least significance.
Second, we had to manage the issue of panel data. One approach would have been to
select only a single record or snapshot by player. The negative effect of this approach is that it
removes too many records from evaluation. We chose to handle the panel data issue by
mitigating its affects, opposed to eliminating its presence. We mitigated panel data in two ways:
Contract Renewals – By including only records on the year of a contract renewal,
we eliminate all the records where a player’s salary is ‘fixed’. This is appropriate in our model
concept as well. Since a player already has a multi-year contract, ‘predicting’ a salary in any year
after the first is not logical. Additionally, we included contract length as a predictor which was
calculated from the number of years that the players salary remained fixed. The variable was
not always highly significant, but most always contributed to prediction accuracy.
Accumulated years – The inclusion of the total number of years a player had
been to an all-star game and total number of years played also helps mitigate the impact of
panel data by reducing recurrence variables cause. If a player always makes it to the all-star
game then the Boolean impact of all-star game would make the player’s recurrence and panel
influence more impactful. However, if every occurrence in the all-star game incremented the
all-star count by one we are introducing what we have now found to be a highly significant
variable for predicting salary, count of all-star game appearances.
Not taken into consideration for panel data mitigation were the previous year’s salary,
any determination of salary growth, any reduced weighting on a players individual records
cause by recurrence of player ID. Contract renewal and accumulated years calculations were
both conceptually easy to calculate and sufficient to significantly improve our models. In actual
contract negotiations previous contracts and performance are very influential, however it
would introduce missing data complexity as rookie year previous salary and performance is not
readily available.
Transformation for inflation was necessary given the ten year period we chose to
evaluate. The inflation factor was critical to do correct comparison between years and is prior
knowledge that inflation is a consideration during contract renewal process. Post adjustment
for inflation, we also conducted a log transform of the salaries. This is common in financials and
is typically done to mitigate the impact of heterscedacticity as the accuracy of salary prediction
grows as the salary does. These transforms were conducted in series and our notation is Y for
the transformed response variable we used in our models.
Reluctantly, there were two groups of pitchers that we had to remove from evaluations
of our models. First, the most elite highest paid players’ salaries are recognizably
disproportionate to their total performance and team owners draw them in with high salaries
for the ‘hype’ or ‘notoriety’ that they bring. This is intended to drive ticket sales more than
improve a team’s winning potential. The top 5% salaries by year were regrettably cut to
mitigate those players influence. Alternatively, walk on and mid-year players have skewed stats
based upon how they arrive mid-year. Players with five games or less and players within the 5th
percentile of salaries were excluded.
Table relationships remain unchanged from project proposal except that there are now
no predictor variables taken from fielding or batting. Now that there was a need for many
nested tables in the data formatting, it is no longer able to be understood visually for all
records. The database and all source code are found at http://homepages.uc.edu/~barkerba for
reference.
Table 1: Variable Descriptions
Field Name Descriptionadj_salary The salary for a contract renewal adjusted for inflation. NOTE: log
transformation was handled in R.Salary The salary as provided in the Lahman’s source databaseyears_contract The duration of years that a salary remained fixed; an indication
of the number of years a contract was held forplayerID Last name (shared last names are indexed by number)yearID Four digit year (filtered for years 1985 through 2010)teamID Unique 3 character designator for each teamAge Age of the player during the given yearYearsPlayed The total number of previous years a player has played MLBAllStarYears The total number of previous years a player played in the All Star
game.nPitch The number of at bats a pitcher has pitched againstFieldOut The ratio to nPitches of FieldOutsStrikeOut The ratio to nPitches of StrikeOutsHit The ratio to nPitches of HitsHomeRun The ratio to nPitches of HomeRunsWalk The ratio to nPitches of WalksIntWalk The ratio to nPitches of Intentional WalksWinLoss The Win / Loss record over total games. -1 all losses, 0 equal
number loss and wins, 1 all wins.Game The total number of games a pitcher pitched in and obtained a
win or loss. NOTE: It is possible and common for a pitcher to pitch a game and earn neither a win nor a loss.
The data can be split into two different categories. The first is the parameters that describe the
pitcher such as age and win or loss percentage. The second set of parameters is the actual
pitching statistics. These were compared with scatter plots shown in figure 1 and 2.
Figure 1: Scatter Plot of Pitcher Descriptors
Figure 2: Scatter Plot of Pitching Statistics
Finally, Table 2 shows the mean, standard deviation and the correlation for each of the
variables used in the analysis. Not surprisingly years played and age are highly correlated.
Years played and all-star years have the highest linear correlation to salary.
Table 2: Mean, SD, and correlation of variables.
Correlation
MeanStandard Deviation
adjusted salary
years contract
year ID Age
Years Played
All Star Years
Number Pitch
Field Out
Strike Out Hit
Home Run Walk
Int Walk
Win Loss
Game
adjusted salary
1,843,322
2,713,188 1
years contract 1.04 0.23 0.198 1year ID 1998 7 0.254 -0.016 1
Age 29.4 4.3 0.360 0.0920.05
7 1 Years Played 3.57 3.56 0.592 0.119
0.247
0.715 1
All Star Years 0.41 1.12 0.517 0.111
-0.02
90.41
9 0.440 1
Number Pitch 421 285 0.298 0.058
-0.09
6
-0.06
8 0.065 0.171 1
Field Out 0.51 0.05 -0.008 0.018
-0.24
80.06
3-
0.017 0.000 0.307 1
Strike Out 0.16 0.05 0.138 0.009
0.193
-0.00
9 0.077 0.117 -0.052
-0.68
8 1
Hit 0.20 0.03 -0.024 -0.022
-0.00
50.04
2 0.016-
0.050 -0.0140.05
2
-0.51
6 1
Home Run 0.03 0.01 0.015 0.003
0.146
0.035 0.057
-0.029 -0.090
-0.09
8
-0.11
9
-0.03
4 1
Walk 0.08 0.03 -0.158 -0.0170.05
1
-0.17
2-
0.136-
0.110 -0.276
-0.47
20.04
1
-0.26
40.02
7 1
Int Walk 0.01 0.01 -0.149 -0.021
-0.06
80.07
4-
0.031-
0.082 -0.339
-0.15
50.03
6
-0.03
0
-0.11
6
-0.01
5 1
Win Loss -0.02 0.34 0.056 0.017
-0.01
30.01
1 0.024 0.049 0.1760.10
60.14
8
-0.19
1
-0.19
2
-0.12
6
-0.06
6 1
Game 35.8 19.1 -0.011 -0.0010.08
20.12
9 0.092-
0.007 0.034
-0.06
00.37
0
-0.28
8
-0.25
8
-0.18
30.23
40.13
7 1
Linear Model Analysis
First a training and test set were created for the data with 80% of the data in the
training set and the remainder in the test set. These same sets were used for all of the model
analysis. A GLM model was fit for the data. It was created with one set of dummy variables for
the categorical variable of team ID. AIC was used to step through the variables and determine
the best fit model for salary prediction. The best fit GLM resulted in an MSE of 0.6344 for the
test set. The fitted model is below in table 3. .
Table 3: Coefficients for Best Fit GLM
EstimateStd. Error Pr(>|t|) Estimate
Std. Error Pr(>|t|)
(Intercept) -49.202 3.115 0.000 teamIDNYA 0.267 0.099 0.007years_contract 0.372 0.044 0.000 teamIDNYN 0.094 0.100 0.348yearID 0.031 0.002 0.000 teamIDOAK -0.103 0.098 0.296teamIDARI 0.106 0.114 0.354 teamIDPHI -0.050 0.098 0.607teamIDATL 0.029 0.099 0.768 teamIDPIT -0.092 0.098 0.349teamIDBAL 0.063 0.099 0.521 teamIDSDN -0.073 0.099 0.457teamIDBOS 0.121 0.099 0.224 teamIDSEA 0.031 0.100 0.753teamIDCAL 0.023 0.114 0.843 teamIDSFN 0.127 0.098 0.195teamIDCHA 0.043 0.099 0.666 teamIDSLN -0.045 0.098 0.650teamIDCHN 0.164 0.098 0.094 teamIDTBA -0.130 0.109 0.232teamIDCIN 0.040 0.099 0.688 teamIDTEX 0.043 0.099 0.667teamIDCLE -0.022 0.100 0.822 teamIDTOR 0.056 0.100 0.574teamIDCOL 0.018 0.103 0.864 teamIDWAS -0.127 0.138 0.357teamIDDET 0.002 0.099 0.986 Age 0.039 0.003 0.000teamIDFLO -0.146 0.103 0.154 YearsPlayed 0.159 0.004 0.000teamIDHOU -0.001 0.100 0.991 AllStarYears 0.114 0.010 0.000teamIDKCA 0.053 0.099 0.596 nPitch 0.001 0.000 0.000teamIDLAA 0.399 0.134 0.003 FieldOut -1.886 0.227 0.000teamIDLAN 0.203 0.100 0.042 Hit -1.939 0.332 0.000teamIDMIL -0.039 0.109 0.724 HomeRun -3.977 0.812 0.000teamIDMIN -0.039 0.100 0.695 Walk -2.956 0.401 0.000
teamIDML4 -0.169 0.111 0.128 WinLoss -0.218 0.030 0.000teamIDMON -0.134 0.104 0.198 Game -0.001 0.001 0.085
Next, the variables for number of all-star years, years played, and years of a contract were
changed to categorical and fitted with dummy variables. After stepping through the variables
using AIC as the selection criteria, this model resulted in a lower AIC and better R2 value than
the model without the categorical variables. Therefore, it was chosen and used for comparison
with other models. The MSE for the test set was calculated to be 0.4753.
GAM Analysis
To determine the best GAM model we will first evaluate which variables can be
expressed appropriately as spline functions. Since there are two predictor variables we wish to
treat as categorical variables we will exclude them and evaluate the rest; those variables are
years_contract (due to the distribution of contract years) and teamID.
Figure 3: GAM SplineFits; Summary Results
20 25 30 35 40 45
-2-1
01
2
Age
s(Age,8.91)
0 2 4 6 8 10 12
-2-1
01
2
AllStarYears
s(AllStarYears,4.64)
0.3 0.4 0.5 0.6 0.7
-2-1
01
2
FieldOut
s(FieldOut,5.65)
20 40 60 80
-2-1
01
2
Game
s(Game,8.03)
Figure 4: GAM SplineFits; Age, AllStarYears, FieldOut, Game
Spline Spline
Spline Spline
0.05 0.15 0.25 0.35
-2-1
01
2
Hit
s(Hit,0.86)
0.00 0.04 0.08 0.12
-2-1
01
2
HomeRun
s(HomeRun,0.98)
0.00 0.05 0.10 0.15
-2-1
01
2
IntWalk
s(IntWalk,0.99)
0 200 600 1000
-2-1
01
2
nPitch
s(nPitch,7.15)
Figure 5: GAM SplineFits; Hit, Homerun, IntWalk,nPitch
Linear Linear
Linear Spline
0.1 0.2 0.3 0.4
-2-1
01
2
StrikeOut
s(StrikeOut,0.64)
0.0 0.1 0.2 0.3
-2-1
01
2
Walk
s(Walk,6.36)
-0.5 0.0 0.5
-2-1
01
2
WinLoss
s(WinLoss,1)
1985 1995 2005
-2-1
01
2
yearID
s(yearID,8.79)
Figure 6: GAM SplineFits; StrikeOut, Walk, WinLoss, YearID
Linear Spline
Linear Spline
0 5 10 15 20
-2-1
01
2
YearsPlayed
s(YearsPlayed,8.58)
Figure 7: GAM SplineFits; YearsPlayed
From inspecting the initial spline results we see that the following variables should be
converted to linear and reevaluate for all. The variables that should be treated as linear given
their behavior when plotted as spline are Hit, HomeRun, intWalk, StrikeOut, and WinLoss. Also
from the summary results, we see that FieldOut and StrikeOut are not significant. We will
include them for the time being and, evaluate how they perform with the categorical variables.
With our adjustments now made for those variables which should be spline and those which
should be linear, we can run again including categorical variables. As a result we get GCV Score
of 0.37669 from the following formula:
Formula: Y ~ s(Age) + s(AllStarYears) + s(Game) + s(nPitch) + s(Walk) + s(yearID) +
s(YearsPlayed) + FieldOut + Hit + HomeRun + IntWalk + StrikeOut + teamID + WinLoss +
years_contract
Spline
CART Analysis
A regression tree was fitted using our training dataset and the full model. The resulting tree is
shown below in figure 8. Due to our dependent variable Y being log(salary), we must undo the
log transformation to get actual salaries. We see that the predicted salary for a player who’s
been a professional baseball player for less than 2.5 years, was an All-Star player for less than
half a year, and it is later than 2002.5 is $477,671. Below in figure 9 is the graph of the relative
error versus the size of the tree. This shows that the optimal size for the tree is 9 nodes. From
this we created a new tree shown in figure 10 which was used to predict the test salaries. The
resulting MSE from this test set was 0.5662. Interestingly, no pitching statistics are included in
the tree. The determining factors are descriptions of the pitchers. The methods used to pay
pitchers may not be based on their statistics, supporting the findings in Money Ball that you can
still have a winning team with less money.
Figure 8: Initial Regression Tree for Y
Figure 9: relative error versus the size of the tree
Figure 10: Final Regression Tree for Y
Conclusions
After examining a generalized linear model, general additive model, and regression tree,
it can be seen that the best model for predicting salary turned out to be GAM. While the linear
model was a fairly good model for predicting salary, its inability to deal with the non-linear
nature of certain variables (age, All-Star years, game, nPitch, walk, yearID, and years played)
caused it to be inferior to the GAM model. Both the linear and GAM models were superior to
the regression tree for predicting player salary. It is likely that the regression tree simply did
not do well with the log transformation of the dependent variable which suppressed a large
amount of variability in the salary data. After comparing the various types of models and
comparing their predictive abilities, the best method of predicting the salary of a major league
baseball pitcher is to use a GAM with the formula: Y ~ s(Age) + s(AllStarYears) + s(Game) +
s(nPitch) + s(Walk) + s(yearID) + s(YearsPlayed) + FieldOut + Hit + HomeRun + IntWalk +
StrikeOut + WinLoss.
Model MSE (Test Set)
Generalized Additive Model 0.3658Linear Model (Categorical Variables) 0.4753Regression Tree 0.5662Generalized Linear Model 0.6344
Works Cited
Lahman, S. (2012, 2 7). Sean Lahman. Retrieved 4 25, 2012, from Sean Lahman:
http://www.seanlahman.com/baseball-archive/statistics
Wikipedia - Panel Data. (n.d.). Retrieved 5 2, 2012, from Wikipedia:
http://en.wikipedia.org/wiki/Panel_data