introduction - ucfilespace tools - homehomepages.uc.edu/~barkerba/dm/baseball.docx · web...

Analysis of Baseball Pitchers Salaries

DATA MINING (901)Project-Report Group 7

4 June 2012

Brad BarkerBrenda Courtad

Ryan Goble

Analysis of Baseball Pitchers SalariesAuthors: Brad Barker, Brenda Courtad and Ryan Goble

Abstract: The purpose of this paper is to determine what factors affect the salary of a pitcher in

major league baseball. There are many variables that can affect a player’s salary, including

which team they play for, the player’s age, the length of their contract, etc. We used the

baseball database from Sean Lahman(Lahman, 2012) to model pitchers salary. In order to

remove variation due to the fact that the dataset spans from 1985 through 2010, all salaries

were adjusted for inflation, with 2010 being the base year. Instead of using one specific type of

model, we ran a variety of models to compare which is optimal for predicting a pitchers salary.

In order to find the best model, we created models using GLM, GAM, and CART methods. We

used standard modeling techniques to find the optimal model of each type and then compared

their MSE values to find the best type of model for estimating salary. The MSE for the test data

using each model is shown below and it can be seen that GAM created the best model for the

test data.

Model MSE (Test Set)

Generalized Additive Model 0.3658Linear Model (Categorical Variables) 0.4753Regression Tree 0.5662Generalized Linear Model 0.6344

Introduction

With the commercial success of Money Ball, we became interested in the business

analytics processes that were used to determine the best valued team. Like in the movie,

players who were not paid large salaries could have attributes that contributed to higher

probability of team success meaning that the best value was not the highest paid player. We

are interested in finding similar nuances and America’s pastime data is readily available online.

The questions of interest are: What are the predictors of salary? Which model technique best

predicts salary? To begin our investigation we built logical relations between multiple tables of

baseball player performance data and their salaries. Given the significant differences in

predictors for players by position; we’ve chosen to focus our evaluation on one position, the

pitcher.

The data

The data for this project comes from the seanlahman.com baseball database (Lahman,

2012). We went through the available data and created a hypothesis of which variables will

have a significant effect on salary. Those variables that were included in the final data set to be

used in R for analysis and results were reasonable with many models getting above .5 R squared

values. Instead of doubling our efforts in the model build effort, we chose to invest in better

cleaning and transformations of our data.

First, we noticed that many of the pitcher statistics were heavily correlated with the

number of pitches thrown. The most obvious and successful method for us to mitigate that

correlation was to make all of those counts (Home Runs, Runs, Walks, etc.) a percent of total

pitches thrown. Interestingly, this caused some expected errors when fitting strait linear models

because it created a singular matrix when we included all variables which accounted for total

pitches. Easily, R identifies this scenario and removes the single variable which contained the

least significance.

Second, we had to manage the issue of panel data. One approach would have been to

select only a single record or snapshot by player. The negative effect of this approach is that it

removes too many records from evaluation. We chose to handle the panel data issue by

mitigating its affects, opposed to eliminating its presence. We mitigated panel data in two ways:

Contract Renewals – By including only records on the year of a contract renewal,

we eliminate all the records where a player’s salary is ‘fixed’. This is appropriate in our model

concept as well. Since a player already has a multi-year contract, ‘predicting’ a salary in any year

after the first is not logical. Additionally, we included contract length as a predictor which was

calculated from the number of years that the players salary remained fixed. The variable was

not always highly significant, but most always contributed to prediction accuracy.

Accumulated years – The inclusion of the total number of years a player had

been to an all-star game and total number of years played also helps mitigate the impact of

panel data by reducing recurrence variables cause. If a player always makes it to the all-star

game then the Boolean impact of all-star game would make the player’s recurrence and panel

influence more impactful. However, if every occurrence in the all-star game incremented the

all-star count by one we are introducing what we have now found to be a highly significant

variable for predicting salary, count of all-star game appearances.

Not taken into consideration for panel data mitigation were the previous year’s salary,

any determination of salary growth, any reduced weighting on a players individual records

cause by recurrence of player ID. Contract renewal and accumulated years calculations were

both conceptually easy to calculate and sufficient to significantly improve our models. In actual

contract negotiations previous contracts and performance are very influential, however it

would introduce missing data complexity as rookie year previous salary and performance is not

readily available.

Transformation for inflation was necessary given the ten year period we chose to

evaluate. The inflation factor was critical to do correct comparison between years and is prior

knowledge that inflation is a consideration during contract renewal process. Post adjustment

for inflation, we also conducted a log transform of the salaries. This is common in financials and

is typically done to mitigate the impact of heterscedacticity as the accuracy of salary prediction

grows as the salary does. These transforms were conducted in series and our notation is Y for

the transformed response variable we used in our models.

Reluctantly, there were two groups of pitchers that we had to remove from evaluations

of our models. First, the most elite highest paid players’ salaries are recognizably

disproportionate to their total performance and team owners draw them in with high salaries

for the ‘hype’ or ‘notoriety’ that they bring. This is intended to drive ticket sales more than

improve a team’s winning potential. The top 5% salaries by year were regrettably cut to

mitigate those players influence. Alternatively, walk on and mid-year players have skewed stats

based upon how they arrive mid-year. Players with five games or less and players within the 5th

percentile of salaries were excluded.

Table relationships remain unchanged from project proposal except that there are now

no predictor variables taken from fielding or batting. Now that there was a need for many

nested tables in the data formatting, it is no longer able to be understood visually for all

records. The database and all source code are found at http://homepages.uc.edu/~barkerba for

reference.

Table 1: Variable Descriptions

Field Name Descriptionadj_salary The salary for a contract renewal adjusted for inflation. NOTE: log

transformation was handled in R.Salary The salary as provided in the Lahman’s source databaseyears_contract The duration of years that a salary remained fixed; an indication

of the number of years a contract was held forplayerID Last name (shared last names are indexed by number)yearID Four digit year (filtered for years 1985 through 2010)teamID Unique 3 character designator for each teamAge Age of the player during the given yearYearsPlayed The total number of previous years a player has played MLBAllStarYears The total number of previous years a player played in the All Star

game.nPitch The number of at bats a pitcher has pitched againstFieldOut The ratio to nPitches of FieldOutsStrikeOut The ratio to nPitches of StrikeOutsHit The ratio to nPitches of HitsHomeRun The ratio to nPitches of HomeRunsWalk The ratio to nPitches of WalksIntWalk The ratio to nPitches of Intentional WalksWinLoss The Win / Loss record over total games. -1 all losses, 0 equal

number loss and wins, 1 all wins.Game The total number of games a pitcher pitched in and obtained a

http://homepages.uc.edu/~barkerba

win or loss. NOTE: It is possible and common for a pitcher to pitch a game and earn neither a win nor a loss.

The data can be split into two different categories. The first is the parameters that describe the

pitcher such as age and win or loss percentage. The second set of parameters is the actual

pitching statistics. These were compared with scatter plots shown in figure 1 and 2.

Figure 1: Scatter Plot of Pitcher Descriptors

Figure 2: Scatter Plot of Pitching Statistics

Finally, Table 2 shows the mean, standard deviation and the correlation for each of the

variables used in the analysis. Not surprisingly years played and age are highly correlated.

Years played and all-star years have the highest linear correlation to salary.

Table 2: Mean, SD, and correlation of variables.

Correlation

MeanStandard Deviation

adjusted salary

years contract

year ID Age

Years Played

All Star Years

Number Pitch

Field Out

Strike Out Hit

Home Run Walk

Int Walk

Win Loss

Game

adjusted salary

1,843,322

2,713,188 1

years contract 1.04 0.23 0.198 1year ID 1998 7 0.254 -0.016 1

Age 29.4 4.3 0.360 0.0920.05

7 1 Years Played 3.57 3.56 0.592 0.119

0.247

0.715 1

All Star Years 0.41 1.12 0.517 0.111

-0.02

90.41

9 0.440 1

Number Pitch 421 285 0.298 0.058

-0.09

6

-0.06

8 0.065 0.171 1

Field Out 0.51 0.05 -0.008 0.018

-0.24

80.06

3-

0.017 0.000 0.307 1

Strike Out 0.16 0.05 0.138 0.009

0.193

-0.00

9 0.077 0.117 -0.052

-0.68

8 1

Hit 0.20 0.03 -0.024 -0.022

-0.00

50.04

2 0.016-

0.050 -0.0140.05

2

-0.51

6 1

Home Run 0.03 0.01 0.015 0.003

0.146

0.035 0.057

-0.029 -0.090

-0.09

8

-0.11

9

-0.03

4 1

Walk 0.08 0.03 -0.158 -0.0170.05

1

-0.17

2-

0.136-

0.110 -0.276

-0.47

20.04

1

-0.26

40.02

7 1

Int Walk 0.01 0.01 -0.149 -0.021

-0.06

80.07

4-

0.031-

0.082 -0.339

-0.15

50.03

6

-0.03

0

-0.11

6

-0.01

5 1

Win Loss -0.02 0.34 0.056 0.017

-0.01

30.01

1 0.024 0.049 0.1760.10

60.14

8

-0.19

1

-0.19

2

-0.12

6

-0.06

6 1

Game 35.8 19.1 -0.011 -0.0010.08

20.12

9 0.092-

0.007 0.034

-0.06

00.37

0

-0.28

8

-0.25

8

-0.18

30.23

40.13

7 1

Linear Model Analysis

First a training and test set were created for the data with 80% of the data in the

training set and the remainder in the test set. These same sets were used for all of the model

analysis. A GLM model was fit for the data. It was created with one set of dummy variables for

the categorical variable of team ID. AIC was used to step through the variables and determine

the best fit model for salary prediction. The best fit GLM resulted in an MSE of 0.6344 for the

test set. The fitted model is below in table 3. .

Table 3: Coefficients for Best Fit GLM

EstimateStd. Error Pr(>|t|) Estimate

Std. Error Pr(>|t|)

(Intercept) -49.202 3.115 0.000 teamIDNYA 0.267 0.099 0.007years_contract 0.372 0.044 0.000 teamIDNYN 0.094 0.100 0.348yearID 0.031 0.002 0.000 teamIDOAK -0.103 0.098 0.296teamIDARI 0.106 0.114 0.354 teamIDPHI -0.050 0.098 0.607teamIDATL 0.029 0.099 0.768 teamIDPIT -0.092 0.098 0.349teamIDBAL 0.063 0.099 0.521 teamIDSDN -0.073 0.099 0.457teamIDBOS 0.121 0.099 0.224 teamIDSEA 0.031 0.100 0.753teamIDCAL 0.023 0.114 0.843 teamIDSFN 0.127 0.098 0.195teamIDCHA 0.043 0.099 0.666 teamIDSLN -0.045 0.098 0.650teamIDCHN 0.164 0.098 0.094 teamIDTBA -0.130 0.109 0.232teamIDCIN 0.040 0.099 0.688 teamIDTEX 0.043 0.099 0.667teamIDCLE -0.022 0.100 0.822 teamIDTOR 0.056 0.100 0.574teamIDCOL 0.018 0.103 0.864 teamIDWAS -0.127 0.138 0.357teamIDDET 0.002 0.099 0.986 Age 0.039 0.003 0.000teamIDFLO -0.146 0.103 0.154 YearsPlayed 0.159 0.004 0.000teamIDHOU -0.001 0.100 0.991 AllStarYears 0.114 0.010 0.000teamIDKCA 0.053 0.099 0.596 nPitch 0.001 0.000 0.000teamIDLAA 0.399 0.134 0.003 FieldOut -1.886 0.227 0.000teamIDLAN 0.203 0.100 0.042 Hit -1.939 0.332 0.000teamIDMIL -0.039 0.109 0.724 HomeRun -3.977 0.812 0.000teamIDMIN -0.039 0.100 0.695 Walk -2.956 0.401 0.000

teamIDML4 -0.169 0.111 0.128 WinLoss -0.218 0.030 0.000teamIDMON -0.134 0.104 0.198 Game -0.001 0.001 0.085

Next, the variables for number of all-star years, years played, and years of a contract were

changed to categorical and fitted with dummy variables. After stepping through the variables

using AIC as the selection criteria, this model resulted in a lower AIC and better R2 value than

the model without the categorical variables. Therefore, it was chosen and used for comparison

with other models. The MSE for the test set was calculated to be 0.4753.

GAM Analysis

To determine the best GAM model we will first evaluate which variables can be

expressed appropriately as spline functions. Since there are two predictor variables we wish to

treat as categorical variables we will exclude them and evaluate the rest; those variables are

years_contract (due to the distribution of contract years) and teamID.

Figure 3: GAM SplineFits; Summary Results

20 25 30 35 40 45

-2-1

01

2

Age

s(Age,8.91)

0 2 4 6 8 10 12

-2-1

01

2

AllStarYears

s(AllStarYears,4.64)

0.3 0.4 0.5 0.6 0.7

-2-1

01

2

FieldOut

s(FieldOut,5.65)

20 40 60 80

-2-1

01

2

Game

s(Game,8.03)

Figure 4: GAM SplineFits; Age, AllStarYears, FieldOut, Game

Spline Spline

Spline Spline

0.05 0.15 0.25 0.35

-2-1

01

2

Hit

s(Hit,0.86)

0.00 0.04 0.08 0.12

-2-1

01

2

HomeRun

s(HomeRun,0.98)

0.00 0.05 0.10 0.15

-2-1

01

2

IntWalk

s(IntWalk,0.99)

0 200 600 1000

-2-1

01

2

nPitch

s(nPitch,7.15)

Figure 5: GAM SplineFits; Hit, Homerun, IntWalk,nPitch

Linear Linear

Linear Spline

0.1 0.2 0.3 0.4

-2-1

01

2

StrikeOut

s(StrikeOut,0.64)

0.0 0.1 0.2 0.3

-2-1

01

2

Walk

s(Walk,6.36)

-0.5 0.0 0.5

-2-1

01

2

WinLoss

s(WinLoss,1)

1985 1995 2005

-2-1

01

2

yearID

s(yearID,8.79)

Figure 6: GAM SplineFits; StrikeOut, Walk, WinLoss, YearID

Linear Spline

Linear Spline

0 5 10 15 20

-2-1

01

2

YearsPlayed

s(YearsPlayed,8.58)

Figure 7: GAM SplineFits; YearsPlayed

From inspecting the initial spline results we see that the following variables should be

converted to linear and reevaluate for all. The variables that should be treated as linear given

their behavior when plotted as spline are Hit, HomeRun, intWalk, StrikeOut, and WinLoss. Also

from the summary results, we see that FieldOut and StrikeOut are not significant. We will

include them for the time being and, evaluate how they perform with the categorical variables.

With our adjustments now made for those variables which should be spline and those which

should be linear, we can run again including categorical variables. As a result we get GCV Score

of 0.37669 from the following formula:

Formula: Y ~ s(Age) + s(AllStarYears) + s(Game) + s(nPitch) + s(Walk) + s(yearID) +

s(YearsPlayed) + FieldOut + Hit + HomeRun + IntWalk + StrikeOut + teamID + WinLoss +

years_contract

Spline

CART Analysis

A regression tree was fitted using our training dataset and the full model. The resulting tree is

shown below in figure 8. Due to our dependent variable Y being log(salary), we must undo the

log transformation to get actual salaries. We see that the predicted salary for a player who’s

been a professional baseball player for less than 2.5 years, was an All-Star player for less than

half a year, and it is later than 2002.5 is $477,671. Below in figure 9 is the graph of the relative

error versus the size of the tree. This shows that the optimal size for the tree is 9 nodes. From

this we created a new tree shown in figure 10 which was used to predict the test salaries. The

resulting MSE from this test set was 0.5662. Interestingly, no pitching statistics are included in

the tree. The determining factors are descriptions of the pitchers. The methods used to pay

pitchers may not be based on their statistics, supporting the findings in Money Ball that you can

still have a winning team with less money.

Figure 8: Initial Regression Tree for Y

Figure 9: relative error versus the size of the tree

Figure 10: Final Regression Tree for Y

Conclusions

After examining a generalized linear model, general additive model, and regression tree,

it can be seen that the best model for predicting salary turned out to be GAM. While the linear

model was a fairly good model for predicting salary, its inability to deal with the non-linear

nature of certain variables (age, All-Star years, game, nPitch, walk, yearID, and years played)

caused it to be inferior to the GAM model. Both the linear and GAM models were superior to

the regression tree for predicting player salary. It is likely that the regression tree simply did

not do well with the log transformation of the dependent variable which suppressed a large

amount of variability in the salary data. After comparing the various types of models and

comparing their predictive abilities, the best method of predicting the salary of a major league

baseball pitcher is to use a GAM with the formula: Y ~ s(Age) + s(AllStarYears) + s(Game) +

s(nPitch) + s(Walk) + s(yearID) + s(YearsPlayed) + FieldOut + Hit + HomeRun + IntWalk +

StrikeOut + WinLoss.

Model MSE (Test Set)

Generalized Additive Model 0.3658Linear Model (Categorical Variables) 0.4753Regression Tree 0.5662Generalized Linear Model 0.6344

Works Cited

Lahman, S. (2012, 2 7). Sean Lahman. Retrieved 4 25, 2012, from Sean Lahman:

http://www.seanlahman.com/baseball-archive/statistics

Wikipedia - Panel Data. (n.d.). Retrieved 5 2, 2012, from Wikipedia:

http://en.wikipedia.org/wiki/Panel_data

introduction - ucfilespace tools - homehomepages.uc.edu/~barkerba/dm/baseball.docx · web...

Documents