using binary regression to analyze win streaks in american football
TRANSCRIPT
-
8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football
1/28
Working Paper
Using Logistic Regression to Analyze Win Streaks in American Football
Working Paper
Tom Gross
-
8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football
2/28
2
Logistic regression has become a standard tool for analyzing binary outcomes. Examples
vary as widely as habitat selection by turtles (Compton, Rhymer, & McCollough, 2 002), blood
clots in stroke victims (Larrue, Kummer, Muller, & Bluhmki, 2001), and predicting election
results (Antonakis and Dalgas, 2009). Perhaps the most frequent use of binary regression is to
analyze wins and losses of sports teams. Recent examples include predicting wins in the NCAA
mens basketball tournament (Coleman & Lynch, 2009), baseball standings and streaks (Sire &
Redner, 2009), and the efficiency of betting markets (Ryall & Bedford, 2010). This application
of research methodology will use logistic regression to analyze the probability of an American
football team winning their next game, using the teams recent won/loss behavior as a predictor.
Logistic Regression
The value of logistic regression to sports research lies in the way in which outcomes are
recorded. The vast majority of sports contests have binary outcomes that determine conference
winners (i.e. a win or a loss), and most of the remaining contests have only three outcomes
win, loss, tiewith a tie often being rare. As a class, sports outcomes are usually treated as
restricted response variables, which cannot be analyzed by means of ordinary least-squared
regression. (An example of a sports outcome with less restrictive outputs is NASCAR cup
racing, which awards a maximum of 48 points for first place, and a minimum of one point for
forty-third place. The output variable is an interval integer. However, it does not present the
estimation problems described by Aldrich and Nelson (1984).) The nature of the response
variable, coupled with the popularity of sports, has resulted in an explosion of publications using
logistic regression. (A search of Google Scholar pairing sports and logistic regression results
in over 14,000 hits on papers published since 2007.)
-
8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football
3/28
3
One popular phenomenon of sports is the streak. Thomas (2010) noted that one of the
most popular sports records is DiMaggios 56-game hitting streak in baseball (p. 1). Further, the
author claimed that such records are often considered magical and unlikely to happen simply
by chance (p.1). In contrast, Albert (2008) argued that several authors have examined
streakiness in baseball, with mixed results. Although there was some evidence of statistically
abnormal behavior, most evidence failed to show a significant difference between examples of
streaks and what would be expected by random distributions (p. 2). Similarly, Oppenheimer and
Monin (2009) argued that the gamblers fallacy is the widely held belief that streaks will end
sooner than random chance would suggest (p.3
26). The authors also claimed that streaks in
general conflict with the popular belief that small samples are like miniature examples of the
population, leading people to believe that streaks are the product of a non-random process.
One potential area of research would be to compare the likelihood of winning to previous
win and loss streaks. These streaks are defined as consecutive wins or losses; as a random
variable, it has both category (win or loss) and magnitude (the number of consecutive wins or
losses). A win streak is time dependent, with its most recent value occurring at . If it is
assumed that the sequence of wins and losses from t to t+1 is a Markov process, then there
cannot be a causal effect between past values and future values. It follows that researchers who
wish to make use of the no memory property of Markov processes justify the applicability of a
Markov process to their model. A simple way to do this is to model the outcome as the sum of a
defined Markov process and some other independent variable process. The second process could
be any well-defined function of the selected independent variable. Performing logistic regression
on a model parsed this way results in separate parameter estimates for the two processes. Further,
-
8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football
4/28
4
it is possible to create a regression equation that allows the second process to be estimated
independently. This method will be demonstrated below.
Wins and Win Streaks
One application of logistic regression using win and loss streak data is to test the
relationship between win streaks and the probability of winning. Sire and Redner (2009) used a
variation of logistic regression (the Bradley-Terry model) to show that, for the most part, win
streaks have a mean and distribution that agree with the assumed underlying distribution.
However, they found evidence non-random behavior in their analysis of baseball results during
the period from 1901 to 1960 (p. 479). The methodology of Sire and Redner included
assumptions of the type of distribution underlying a teams probability of winning. They did not
specifically identify what the factors affecting probability were, making it difficult to account for
differences in winning percentages between teams. Instead, they limited their research to the
questions of team parity, and the evidence of winning streaks being abnormally long. In
particular, Sire and Redner did not test for a relationship between win streaks and changes in the
probability of winning. Although the authors claimed that this question continues to be
vigorously debated, the last citation they included was published in 2000 (p. 474).
Sire and Redner (2009) provided an excellent example of the difficulties facing
researchers attempting to explore the relationship between wins and losses, and any artifact of
how those data are recorded. Specifically, any attempt to determine a relationship between win
streaks and the probability of winning needs to model both the independent variable used to
predict outcomes, as well as the probability model itself. One method commonly used to do this
is the Logit model, which posits a linear explanatory variable and a continuous, though bounded,
response variable. The Logit model allows a wide variety of explanatory variables while
-
8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football
5/28
5
maintaining a response variable profile that can be mapped onto a probability distribution
function.
The Model
The Logit model is based on simple binary regression, where the probability of a win is
given by
(1)
Equation (1) plots a sigmoid curve that represents the probability of event Z. The
equation forZand its relationship to the underlying independent variables that affect p is given
by
(2)
HereZcan represent any outcome from a linear function ofX, although in the regressions
reported below,Xrepresents win and loss streaks. For a simple example, imagine that the
probability of solving a specific math problem on a test is a function of the number of similar
problems solved prior to the test. The variableXwould represent the number of similar problems
solved, andZwould represent the additive function of the problems solved. Because Z affects
the probability of success,pz is a conditional probability, and a monotonically increasing
function ofZandX.
Equation (1) is a solution to equation (2) expressed as a probability function. Often such
relationships are represented as an odds ratio instead of a probability; the argument in equation
(2) is a simple transformation of the odds ratio of eventZ. (The odds ratio of an event is the
probability of an event happening divided by the probability of that event not happening,
expressed in terms of discrete outcomes.) Equations (1) and (2) imply thatX(for example, the
-
8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football
6/28
6
number of problems solved) is the only determinant of the probability of an event (for example,
the probability of solving a specific problem on a test). This simple model of the causes affecting
the probability of an event is neither likely, nor even a stable series. It is therefore assumed that
the probability of a win is a function of a set of variables (gamma) that have a fixed mean and
variance, which are uncorrelated withX, making the full model
(3)
HereXis defined as it is in equation (2). However,p represents the total probability of
winning. Gamma represents the value of a function of underlying causes of changes in
probability of a win, not includingX. For example, Gamma could be a function of home field
advantage (for example, see Levemier & Barilla, 2007).), individual matchups (Brown & Sokol,
2010), or any other variable that would systematically affect the probability of winning.
Following the methodology of Brown and Sokol, it is assumed that the variable Gamma is
normally distributed . (This assumption is not necessary, but it simplifies the
analysis.)
Equation (3) assumes that the probability of a win can be held constant and evaluated in
X; it also implies that there is no correlation between variables ingamma andX. It follows that
equation (2) is the regression model used to determine the effect of a win streak on the
probability of winning the next game, whereas equation (3) represents the relationship between
all independent variables and the probability of a win. The variableXrepresents consecutive
wins and losses. As noted above, this variable has both categorical (wins or losses) and
magnitude components (number of wins or losses). Of these two components, the category is the
most important because the model is testing if, as was claimed by Sire and Redner (2009), wins
-
8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football
7/28
7
and losses are self-reinforcing (p. 474). (The magnitude of the win streak represents the most
restrictive form of the model.) Wins are given positive values and losses are given negative
values so that the values ofXpreserve both the categorical and magnitude aspects. DefiningX
this way eliminates zero as a possible value forX. This does not affect the estimation of the
coefficient onX; on the contrary, it improves the estimation of the parameter by forcing the
intercept of the model to be zero. (The specifics of how the random variableXis generated are
shown below.) It should be noted that more complex model specifications could be specified.
One interesting variation would be to includeXas an element ofGamma; these variations are
more difficult to evaluate, and are not considered here.
In the regression model specification shown in equation (2),Zis bounded by the
underlying distribution ofX. For example, becauseXrepresents consecutive wins at time t, the
maximum value forX(assuming every game was won) is the total games played. Noting that
professional American football plays 16 regular season games, the maximum value ofXis plus
or minus 16. (Looking at real data, the max winning streak in 2010 was eight and the maximum
losing streak was 10.)
Equation (3) represents the relationship between the probability of an outcome (e.g., a
win) and all factors that might change the probability. However the first coefficient, , only
measures the effect of win streaks on the total probability of a win. The individual parameters for
Gamma represent the contributory effect of each of those variables. Examples of factors that
affect Gamma might include home field advantage, team matchups, or any other variable that
affects the probability of a win, except for the contributory effect of a win streak. (The obvious
concern that those factors might directly affectXthe win streak variableis discussed in detail
below.)
-
8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football
8/28
8
Interpreting the regression results in terms of the full model is straightforward; ifXis
uncorrelated with other determinants ofp and adds no significant amount to the value ofp, beta
term will estimate as not different from zero; this implies that win streaks do not affect the
probability of winning the next game. The specific hypothesis being tested by the regression
equation is .The null can be rejected if regression results indicate a significant
relationship, e.g., thep-value of the beta term is below the specified level of significance.
Typically, a researcher reports a significant relationship between the regression variable and the
outcome variable as justification for rejecting the null hypothesis and accepting the alternative.
Briefly stated, a very lowp-value is usually considered both necessary and a sufficient reason to
reject the null hypothesis. Unfortunately, although significance is necessary to reject the null, it
is not sufficient for the regression equation (2) because the independent variables used in the
regression equation are biased toward significance. This bias does not mean the model is not
valid; however, it does require discussion of the nature of the bias contained in the variableX.
Potential Bias in Estimates of
The potential effect of bias in X can be demonstrated by assuming that the unconditional
probability of a win is determined by probability function of a set of variables not includingX,
which equation (3) does. Using the method of Kvam and Sokol (2006), the transition from one
state to another (e.g., from winning to losing), not includingX, is modeled as a Markov chain;
sequences of this type have no memory, meaning that subsequent events are independent of the
outcomes of prior events. (The assumption that the determinants of a win, excludingX, constitute
a Markov process is not necessary; however, its inclusion simplifies the analysis, and does not
change the model expressed in equation (3).) Further, if we assume that the coefficient onXis
-
8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football
9/28
9
zero, ( ), the model implies that changes in Gamma will not affect the value ofX.
Unfortunately this assumption can be demonstrated as false.
As noted above, the values ofX(win streaks) are simply consecutive outcomes of the
same type (wins or losses). That is, a win streak of three at time tmeans that the outcomes from
t-3 to t-1 were all wins. However, the win streak is simply a random array of outcomes in periods
t-3 to t-1. It follows that the expected value of a win streak is given by
(4)
WinStreak is therefore a Taylor series in n andp, and more importantly, a function ofp.
Although solving for the expected value ofWinStreak is simple, characterizing the behavior of
the variable is more difficult. One way to do this is to describe WinStreakusing known
distributional forms. Assumingp is fixed, we can define q as 1-p; it follows that from equation
(4) that the expected value ofWinStreak is the variance of a geometric distribution in q. (This is
a rather minor observation that seems not to have been noted in the literature.) Unfortunately,
WinStreakdoes not have a stable variance; further, higher moments ofWinStreakprobably do
not exist.i
It is also possible to examine directly the effect of a change in the underlying probability
on the value ofWinStreak. The increase in WinStreakfor a change inp is given by
(5)
Equation (5) confirms that both WinStreakand NextGame (the term that represents the
game played at time t in the regressions below) are increasing inp. (The outcome of the next
game is assumed to be a function of p; equation (5) shows that WinStreak is also a function ofp.)
Equation (5) also verifies that if we assume the value of beta in equation (2) is zero, there is no
-
8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football
10/28
10
memory in the sequence of wins, even though both WinStreakand NextGame are functions ofp.
(This follows because it is possible to describe NextGame without reference toX.) The important
point is that asp gets large, both the number of wins, and the magnitude ofWinStreak increase.
RegressingNextGame onWinStreakwill show spurious correlation because larger values of
WinStreakmean that wins will be more frequent. Literally, Gamma is a confounding variable
in regression equation (2). Regressions that demonstrate a significant relationship between
WinStreakand NextGame are necessary to reject the null hypothesis, but not sufficient because
the result can be a spurious effect ofGamma.
The behavior ofWinStr
eakcauses a significant bias when the probability of a win is very
large or very small. The bias will be smaller if the number of paired observations at a givenp is
small or if p is approximately 50%. However, if the probability of a win is extreme and number
of paired observations is larger (for example when the number of games played by a basketball
team is 82), simply regressing WinStreakon NextGame might indicate a significant relationship
when one does not exist.
The results of equations (4) and (5) call into question the usefulness of using a Logit
model on win streaks in any form. However, since the bias is one-sidedindicating a
relationship when one does not existthe model still has usefulness by failing to reject the null.
However, the existence of bias points to how difficult it is to reject the null hypothesis for the
regression equation (2). If the regressions show a significant relationship, an additional set of
regressions should be run after attempting to remove the bias from the WinStreakvariables.
(How this might be done is unclear because all potential factors in Gamma would have to be
eliminated as the source of bias.) The gist of the matter is that it is possible for the beta term in
equation (2) to show significance when there is no relationship between the independent and
-
8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football
11/28
11
dependent variables. Therefore, it is important to run additional tests when any significant results
from the regression are discovered.
Data
The real data for regression were gathered from Sports-Reference.com (Sports Reference,
2011). This site has extensive historical data on professional and amateur sports. Data were
retrieved in March of 2011. The NFL win/loss data were for the 2010 year, and included only
regular season games (512 individual outcomes representing 256 total games played). Each win
was recorded as a one, and each loss was recorded as a zero. The data were arrayed as a single
set of values, representing all 512 outcomes. Although this represents a doubling of the actual
events, the only effect on the parameter estimation is the reducing the standard errors by half.
This array of 512 outcomes was the set from which all other variables were created.
Win streaks for teams were calculated in three different ways, each of which replaces the
value of X in equation (2); however, only the first two were used for estimating the model. The
three variable names used were WinStreak, Pos_Neg, and Cumm_WinStreak. WinStreak
calculates win streaks and loss streaks separately and then combines the result. For example,
assume that a sequence of outcomes in the data set is {W, W, L, W, L, L}. Converting this to
numeric values, the sequence becomes {1, 1, 0, 1, 0, 0}. This sequence has four different streaks:
a win streak of two, a loss streak of one, a win streak of one, and a loss streak of two. However,
streaks are measured at time t, meaning that this sequence will have six different values for
WinStreak. Each win adds +1 and each loss adds -1. (As noted above, wins and losses are
calculated separately for technical reasons that do not affect the parameter estimation.) Starting
at t=1 through t=6, the six values forWinStreakare {1, 2, -1, 1, -1, -2}. WinStreakrepresents the
most restrictive form of the regression equation (2) because it retains not only the type of
-
8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football
12/28
12
outcome (a win or a loss, represented by positive or negative numbers), but also the magnitude of
the variable. The values forWinStreakare theXvalues in the regression equation (2). This is the
way one would expect win and loss streaks to be reported, and what we mean when we say, my
team has a two game win streak. The only difference is that the sign of the value represents a
win (+) or a loss (-).
Pos_Neg is calculated in a similar way, except that the only values Pos_Neg can take are
positive one and negative one. For example, the string of results described above becomes {1, 1,
-1, 1, -1, -1}. Positive numbers represent win streaks and negative numbers represent loss
streaks; however, the length of the streak is lost. This is mathematically identical to coding an
independent variable as a zero or a one; the only difference is that using the values of negative
one and positive one force the intercept to be zero. (A graph of the regression relationship would
go through the origin.)Pos_Negrepresents a less restrictive version of the regression equation
(2) because it does not retain the magnitude of the win streak; it simply indicates if the streak is
one of winning or losing.
The final constructed independent variable is Cumm_WinStreak. This variable sums the
numerical value for each win and loss, resulting in a cumulative value for consecutive wins and
losses. Using the same string of values used forWinStreak, Cumm_WinStreakbecomes {1, 2, 1,
2, 1, 0}. As teams accumulate wins, Cumm_WinStreak becomes larger and positive; losing
causes the value to become smaller, and eventually negative, if more games are lost than are
won. This variable was not considered appropriate for real data because of serious problems that
result when attempting to model its behavior. Unlike the previous two constructed variables,
Cumm_WinStreakdoes not reset after each change of state. (Moving from a win to a loss, or the
opposite, is a change of state.) It follows that the value of the variable can drift from its point of
-
8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football
13/28
13
origin, and that the drift can be arbitrarily large. For example, even assuming a Markov process
with , the value ofCumm_WinStreakcan be arbitrarily large at any given time. (This is
a common feature of Markov chains, which is often forgotten by people investing in the stock
market.) Further, Cumm_WinStreakhas an unusual limit behavior: as the number of un-played
games approaches zero (i.e., the last games in the regular season are played) the value of
Cumm_WinStreakbecomes a simple transform of the probability of winning (in particular, the
odds ratio). This means that Cumm_WinStreakwill often show significant results even when the
other constructed variables do not.
The regression equation (2) matches the independent variable at timet
with the outcome
at time t+1. This required that each of the independent variables be labeled by time; although
time is not a variable in the regression model, aligning the variables correctly is necessary. The
values for each of these variables at time twere paired in logistic regression with the outcome of
next game played at , drawn from the original sequence of outcomes. The last value for
each set of 16 games was deleted because the independent variable could not be paired with a
next game. This reduced the total number of regression pairs to 480.
Using the array described above, the six values would be reduced to five; the two
independent and dependent variables (representing WinStreakand NextGame) become {1, 2, -1,
1, -1} and {1, 0, 1, 0, 0} respectively. Note that the sequence of values for WinStreak represents
values calculated from t=1 to t=5, whereas the values of the NextGame represent values
calculated for periods t=2 to t=6.
Each of the win streak variables represented a slightly different set of assumptions for the
probability model. The simplest interpretation of equation (3) is that winning increases the
chance of winning, and (unfortunately for fans) losing increases the chance of losing. This is
-
8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football
14/28
14
captured by the independent variable Pos_Neg. Using Pos_Negassumes that a win in period t
affects the probability of a win in period . However, the magnitude of the string of wins
had no additional effect. Under this assumption, winning improved the chances of winning by a
fixed amount until the next loss, at which time the probability of a win reverted to the original
value. (Original values for the probability of a win are given by the value ofGamma.)
In contrast to Pos_Neg, the design ofWinStreakassumes that consecutive wins or losses
have an additive effect; under this assumption, long streaks are self-perpetuating. Sire and
Redner characterized this, without being specific, as winning streaks being self-reinforcing
(2009, p. 474). It should be noted that this assumption leads to significant modeling problems; in
this case, equation (2) becomes a positive feedback function, as described above. Under this
scenario, if the sequence is long enough,p converges to its limit of one (i.e., no probability of a
loss) or zero (no probability of a win).
Although there are significant concerns about the modeling of the independent variables,
the dependent variable NextGame can be described as a Markov chain with a transition
probability ofp. Assuming win streaks do not affect the probability of winning, equation (3)
becomes a simple Markov process. This is the same methodology used by Brown and Sokol
(2010), Kvam and Sokol (2006), as well as others. As noted above, significant results from the
regression equation (2) might be interpreted as evidence that the sequence of outcomes is not a
Markov chain. It follows that rejection of the null implies rejection of the distributional
assumptions on which equation (3) is based. In that event, it would be the necessary to create a
different probability model for wins and losses.
Testing the Model
-
8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football
15/28
15
As noted above, using the independent variables in regression equation (2) raises
significant concerns about the ability to capture any relationship that might exist between win
streaks and the next game played. It is therefore appropriate to test the model on simulated data
prior to running regressions on actual data. The regression model was tested using simulated data
consisting of 1000 randomly generated values. The values were generated using and Excel Add-
In (MegaStat), which specified random variable uniformly distributed from zero to
one,. The value of N=1000 was chosen to avoid bias that can result from small
samples. A large sample will minimize the likelihood of a Type II error; in addition, the bias
inherent in estimating Logit models creates a significant likelihood of a type I error with small
samples. This point was made by Nemes, Jonasson, Genell, and Steineck (2009) as a general
problem with logistic regression; it is an even larger problem here due to the inherent bias of the
variables used to represent win streaks.
Assuming a valid model specification of the relationship between independent and
response variables, regression using a random set should not yield significant results. The
random simulated data were subsequently modified with a bias to test the model when there was
a known effect; assuming that the model specification is valid, the results should be significant.
As noted, the randomly generated data (Y) were uniformly distributed between the values
of zero and one. These data were then converted into binary data (Y) using the rule if is less
than 0.500, ; otherwise, This yielded a set of 520 zero values and 480 one values
( ) representing wins and losses. Win streaks were calculated for consecutive wins andlosses in the simulated data using the method described in the data section above. For example,
WinStreakwas calculated by assigning the first win in a string a value of 1, the second
consecutive win was assigned a two, and so on. Each loss was assigned a value of negative one,
-
8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football
16/28
16
so that consecutive losses resulted in consecutively larger negative numbers. The resulting
WinStreakconsisted of 1000 positive or negative numbers representing the consecutive wins or
losses to that point. These values were paired so that was paired with the
outcome , the NextGame variable in the regression model. (This reduced the number of
paired values to 999 because there is no next game after the last win streak value is calculated.)
Logistic regression was performed on the paired data to determine if there was a significant
relationship between the two simulated variables.
Assuming that consecutive wins and losses affect the probability of winning the next
game is tantamount to saying that is a significant and positive value in equation (2). The
interpretation is that long win streaks increase the probability of winning the next game, and long
losing streaks decrease that probability. By design, the random probabilities used to generate the
simulated data were uniformly distributed with a mean of 0.50; the value of each did not
depend on the ex post calculated win streaks. It follows that the change in probability for a given
change in X is given by
(6)
It also follows that the expected value ofZin equation (2) must be zero, and equation (1)
becomes
(7)
Equation (7) implies that binary regression using the randomly generated values forY
should not have significant coefficients or p-values greater than traditional limits. The results of
binary logistic regression using the unbiased random data using the value ofWinStreakat time,
and NextGame ( ) gives the results shown in Table1:
-
8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football
17/28
17
Table 1
Logistic Regression Table forSimulatedData
Predictor Coefficient SE Z p-value Odds 95% CI
Constant -0.084 0.0635 -1.3228 0.1860 Ratio Lower Upper
WinStreak -0.0107 0.0272 -0.3934 0.6930 0.99 0.94 1.04
Note: CI is Confidence Interval
As expected, the p-value on the variable WinStreak is large, implying that there is no
relationship between WinStreakand the probability of winning the next game. Given the
randomness of the data, the lack of significance using randomly generated data is encouraging; it
shows that a model that could capture an existing relationship is not necessarily tricked into
showing significance because of inherent bias.
Testing the model under conditions when it should not work is of little value for
demonstrating that the model will work when it should. Instead, it is necessary to show that the
model can accurately demonstrate a relationship between win streaks and the probability of
winning the next game when a relationship is known to exist. The original random probabilities
were modified to test this by biasing the probability assigned to event using the value of
the. The bias added consisted of adding 10% to the existing probability for each
ordinal number of the win streak in the previous period. The biased data is modeled on equation
(2); it has no Gamma variable as shown in equation (3). Thus, the new probability () for event
is given by:
(8)
The maximum increase in probability was 30.34%, whereas the maximum decrease was
69.16%. Further, the median value was -.57%, with first and third quartile values of -7.36% and
6.32% respectively. Somewhat surprisingly, the number of wins went down from 480 to 443.
These values represent a significant change in probability, based on the previously paired
-
8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football
18/28
18
WinStreak value. Each of the new probabilities were converted to binary values using the same
rule described previously, and new values forWinStreakcalculated. Binary logistic regression
was then run on the new biased values. These results of which are shown in Table 2:
Table 2
Logistic Regression Tableof BiasedResults
Predictor Coefficient SE Z p-value Odds 95% CI
Constant -0.1441 .0659 -2.1866 0.0290 Ratio Lower Upper
Biased WinStreak 0.1250 0.0196 6.3776 0.0000 1.13 1.09 1.18
Note: CI is Confidence Interval
As the regression table shows, the model worked surprisingly well to capture the effect of
the bias introduced; the associated p-value is zero to four significant digits. (The results would be
significant even if the standard error value for the independent variable were doubled.) The
coefficient on the biased WinStreak indicates that each additional win increases the likelihood of
winning by an estimated 12.5%, a number well within the expected 95% confidence interval for
the actual value of 10%. In addition, the sign of the coefficient is positive, as predicted.
Performing the same analysis using the binary data (where positive values ofWinStreakwere
given a value of one, and the remaining data were given a value of negative one) gave similar
results.
The results from all three regressions indicate that the model in all its forms will capture
any effect that we know is present in the next win. However, the simulated data are based on
equation (2); they do not imply that equation (2) can measure the coefficient onXif the actual
probability model is as represented by equation (3). As noted above, the significance of the
regression is a necessary but not sufficient condition to reject the null hypothesis. This follows
from the bias inherent in estimates of the coefficient onXwhen other factors, such as those noted
above, affect the probability of a win.
-
8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football
19/28
19
The existence of bias suggests a two-step process for evaluating the relationship between
win streaks and the probability of winning their next game for sports teams. The first step in the
process is to estimate the beta term in equation (2) to determine if there is evidence of an effect.
If there is not, the null hypothesis cannot be rejected. If evidence of a significant relationship
does exist, additional tests should be run to determine if the evidence might be due to spurious
correlation. Although there are significant flaws in this approach, which are addressed in the
discussion below, this two-step process provides a reasonable method for evaluating the research
problem.
Results using American Football
The data used to evaluate the relationship between team win streaks and the probability
of winning the next game played consisted of the wins and losses of all 32 teams in the National
Football League, 2010-2011 regular season. Data were downloaded from www.Sports-
Reference.com (2011), and was coded as described above. There are 16 games played each
regular season by the 32 NFL teams, resulting in 512 outcomes, representing 256 total games.
Each team had individual winning percentages calculated, as well as the independent variables
WinStreak, Pos_Neg, and Cumm_WinStreak, as described above. Regressions were run for the
teams individually, however the results showed no significance. (This could easily have been
predicted because the sample size of 15 per team is too small to provide good results in logistic
regression.) The independent variables at time twere paired with the next game outcome at
time
, resulting in a loss of one game per team, and reducing the total number of data pairs
to 480.
Although the data were paired using the results for each team, the 32 sets of data were
combined into a single regression of 480 pairs. This decreases the standard errors for the
-
8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football
20/28
20
coefficient terms by a factor of two; however, it has no other effect on the estimates of the
parameters. This technique does raise some concerns beyond the effect on standard errors. First,
it is not obvious that a single stable relationship can be measured across several teams. A more
accurate description of X would be as a vector with j elements, each j representing a different
team. Then each xji is the ith
pair of regressions for the jth
team. This implies 32 different
independent coefficients. Even if some of the coefficients were non-zero, there is no guarantee
that the coefficient on the vector can be estimated. Second, even if a single relationship exists for
each team, there is no guarantee that the actual estimate of the vector beta will be non-zero.
Although these are valid questions, the response to both is that the regression model
posited in equation (2) specifically assumes that there is a single stable relationship between win
streaks and the probability of winning a teams next game, regardless of when it is measured.
The null hypothesis for that model can only be rejected if such a relationship is detected. If such
a relationship cannot be detected, it does not matter if it is because such a relationship does not
exist, or if it exists in a form that cannot be captured by the model. Failing to reject the null
hypothesis does not prove that there is no relationship; it merely presents evidence that the
relationship does not exist in the assumed form. In short, these concerns about the ability to
measure the relationship between win streaks and the probability of winning are further evidence
of how difficult it is to measure a relationship between win streaks and the probability of
winning. They however do not invalidate the model itself.
An advantage to combining all the pairs into a single data set is that there will be an equal
number of wins and losses. This has an effect of forcing the intercept through the origin, that is
E[ . This both provides additional evidence on the validity of the model and improves the
accuracy of the estimate of the beta term in equation (2).
-
8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football
21/28
21
Regressing WinStreak on NextGame using actual data had the expected results, as can be
seen in Table 3:
Table 3
Log
istic
Regress
ion
Table:WinStr
eakonNex
tGame
Predictor Coefficient SE Z p-value Odds 95% CI
Constant 0.00298 0.0914 0.0326 0.9740 Ratio Lower Upper
WinStreak 0.0319908 0.0370 0.8654 0.3870 1.03 0.96 1.11
Note: CI is Confidence Interval
The regression results show that the model is not biased to the degree that it always
shows significance. The estimate of the coefficient on the beta term for equation (2) is not
significantly different from zero and that the null hypothesis cannot be rejected. The assumption
that the constant term is equal to zero also cannot be rejected, although the model using
WinStreakallows some variation for this term. These results are somewhat surprising, given the
tendency of the model to find significance when smaller samples were used, or when the winning
percentage tended toward extremes. (Additional regressions were run using individual team data
(N=15), and as expected, frequently indicated significance. Those results are not shown because
they can be dismissed as being caused by a combination of small sample bias, and the bias
inherent in WinStreak.)
As noted, the model using WinStreak is the most restrictive form of the regression
equation, because it requires that the beta term capture the effect of each additional win or loss in
the streak. A less restrictive form of the model uses Pos_Neg; this version posits that wins and
losses are reinforcing, but not additive. It follows that the threshold for significance is lower.
Nonetheless, the results for this regression again gave the expected theoretical results, as shown
in Table 4:
-
8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football
22/28
22
Table 4
Logistic Regression Table:Pos_NegonNextGame
Predictor Coefficient SE Z p-value Odds 95% CI
Constant 0.0000 0.0913 0.0000 1.0000 Ratio Lower Upper
Pos_Neg -0.03334 0.0913 -0.3651 0.7150 0.9700 0.81 1.16
Note: CI is Confidence Interval
As expected, the Pos_Negversion of the model provides the cleanest estimate of the beta
coefficient in equation (2). This follows because the regression line must go through the origin
(i.e., the constant term must be zero) when the only possible values for X are -1 and +1. The
results show that the null hypothesis again cannot be rejected. There is no evidence of an effect
even in the less restricted model, where a win or loss streak simply influences the probability of a
win in the next game played.
The final regression used the variable Cumm_WinStreak. As noted above, this model is
not considered as a valid example of the regression equation (2) because the independent
variable for each team converges to a simple transform of the winning percentage. As such, the
model should show the relationship that exists. The predicted results are that there will be a
significant relationship between the independent and response variable, even though the
preceding regressions found none. The results of this regression can be seen in Table 5:
Table 5
Logistic Regression Table:Cumm_WinStreakonNextGame
Predictor Coefficient SE Z p-value Odds 95% CI
Constant 0.0010 0.0927 0.0112 0.9910 Ratio Lower Upper
Cumm_WinStreak 0.0964 0.0262 3.6840 0.0000 1.10 1.05 1.16
Note: CI is Confidence Interval
As expected, the regression results show a significant relationship between in
independent and response variable. (Thep-value would still be below 5% if the standard error of
the coefficient were doubled.) Although the hypothesis that the constant term equals zero cannot
be rejected, the hypothesis that the regression coefficient equals zero can be rejected. These
-
8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football
23/28
23
results provide more evidence that caution needs be exercised when building independent
variable from past outcomes in a sequence; some input variables will always show a relationship
with output variables. This is a valid concern even when the output sequence is a Markov chain,
as noted in the discussion of potential bias above.
Discussion of Results
The relationship described in equation (3) assumes that there exist factors that could
affect the probability of winning. That is, the probability of winning is not simply a random
variable with an expected value of 0.50. Inasmuch as the factors that affect winning are stable
over time, it is reasonable to hypothesize a relationship between future performance and past
performance. Unfortunately, describing the nature of that relationship can be difficult. The
simplest model for the performance of a sports team (and many other contests) is that the
likelihood of winning their next game is a random expression of an underlying probability of
winning independent of time; an example is the Markov chain model used by Kvam and Sokol
(2006). Although the authors used this model to predict wins in the NCAA mens basketball
tournament with excellent results, their model was more complex than it first appeared. Kvam
and Sokol estimated the probability of a win based on individual matchups of paired teams rather
than the teams overall winning percentages. In doing so, the authors demonstrated a difficulty
with simply using the winning percentage as an estimate ofp for predicting the outcome of a
teams next game. Put simply, Kvam and Sokol estimatedp, the probability of winning a
specific contest, as a conditional probability. Therein lies the problem of everyone who ever
wanted to predict the outcome of an event: prognosticators want to know the likelihood of
winning a specific contest given the expected circumstances of that contest, not the likelihood of
winning n times inNcontests.
-
8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football
24/28
24
Conditional estimates allow the probability model to include a wide range of explanatory
variables that researchers may be able to test. Typical examples include the winning percentage
of the paired contestant (Kvam & Sokol, 2006), home field advantage (Foster & Washington,
2009), and the recent performance of contestants (Hvattum & Arntzen, 2010). One possible
conditional variable is the win streak. In its most simple form, this model assumes that wins and
losses are self-reinforcing. As Sire and Redner (2009, p. 474) noted, individuals frequently refer
to these win and loss streaks when making their predictions. Unfortunately, working out the
details of this model demonstrates several problems.
As noted above, win and loss streaks will be correlated with outcomes because both are
increasing functions ofp. It follows that any constructed independent variable must take this
potential source of bias into account when testing models. This is a significant problem for
researchers, and one that may ultimately make the use of any win streak variable invalid.
However, the potential for bias simply begs the question of significant results; testing the model
might nonetheless prove useful, as was the case in American football results for 2010-2011.
The results using American football data from 2010-2011 showed that regressing two
valid constructs of win and loss streaks on the next game played failed to show a significant
relationship between current win loss streaks and the probability of winning the next game
played. These results are consistent with models that describe the probability of winning as based
on a set of underlying variables that are performance related, rather than related to previous wins
and losses. These models argue in favor of a more parsimonious model for the probability of
winning that does not include win and loss streaks. Simply put, there is no evidence in the
American football league of additional probability of winning or losing is gained by the
individuals or teams recent performance.
-
8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football
25/28
25
The results from American football do not prove that win streaks have no effect on the
probability of winning future games; they merely show that a valid application of the model
provided no evidence of such a relationship. The lack of evidence may be due to the choice of
variables, or the probability model itself; however, it is just as likely that there is no relationship
between win streaks and the probability of winning future games. As such, the burden of proof
lies on those who claim there is a relationship between win streaks and future wins to construct a
model and provide evidence of the relationship.
These results may be disappointing to those who harbor a belief that their team will
ultimately win (or loose) a given contest because of the teams recent success (or failure).
Nonetheless, the results are consistent with the typical modeling of the outcome sequence as a
Markov chain. As such, failure to show a relationship is probably more important than showing a
relationship using logistic regression.
Two important results can be gleaned from this study. First, the likelihood of persistent
bias when using models that simply regress win streaks on future outcomes was demonstrated by
an analysis of the variables, and verified by the regression results. As tempting as this type of
investigation is, it is more likely that any significant results will turn out to be artifacts of the
model and variable specifications. Second, the approach demonstrated the validity of using the
simple regression equation (2) when the full probability model can be expressed in the form
given by equation (3). The implies that it is possible to test relationships independently of other
underlying determinants of probability.
Conclusion
Conducting successful research requires that several skills be addressed in a logical and
appropriate way. As Trochim (2001) put it, research involves an eclectic blending of an
-
8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football
26/28
26
enormous range of skills and activities. (p. 4) Although his statement is somewhat bombastic, it
accurately conveys the impression that research is not simply about having an idea and checking
to see if it might be correct. Leedy and Ormrod (2005) expanded on this point and argued that
successful research involves the systematic process of collecting, analyzing, and interpreting
information (p. 2). Good research requires not only a good question and adequate data, but also
appropriate methodology.
This paper has explored the range of research methodology available to researchers, and
tried to demonstrate how the choice of a research question and the available data affect the
methodology used. Not all methodologies are appropriate for every research question or type of
data; researchers must select a methodology that fits. One focus of this paper has been to identify
an appropriate research methodology for exiting data that is restricted in the range of values that
can be expressed. As was shown, data of this type can require specialized transcription and
analysis. As part of this investigation, several examples of research were reviewed, most of
which used logistic regression to analyze binary response data. As was shown, both the data and
the method of analysis created challenges for researchers. The lessons learned from those
examples resulted in a detailed application of logistic regression on the win and loss results for
American football teams.
-
8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football
27/28
27
References
Albert, J. (2008). Streaky hitting in baseball. Journalof Quantitative AnalysisinSports,
4 (1), 1-32.
Aldrich, J. H., & Nelson, F. D. (1984). Linearprobability,Logit,andProbitmodels.
Sage Publications.
Brown, M., & Sokol, J. (2010). An improved LRMC method for NCAA basketball
prediction. Journalof Quantitative Analysis,6 (3), 1-21.
Foster, W. M., & Washington, M. (2009). Organizational structure and home team
performance. TeamPerformance Management,15 (3/4), 158-171.
Hvattum, L. M., & Arntzen, H. (2010). Using ELO ratings for match results prediction in
association football. InternationalJournalof Forecasting,26, 460-470.
Kvam, P., & Sokol, J. (2006). A logistic regression Markov chain model for NCAA
basketball.NavalResearchLogistics,53, 788-803.
Leedy, P. D., & Ormrod, J. E. (2005). Practicalresearch. Upper Saddle River: Pearson
Education.
Nemes, S., Jonasson, J., Genell, A., & Steineck, G. (2009). Bias in odds ratios by logistic
regression modelling and sample size.BMCMedicalResearch Methodology, 9(56), 1-5.
Oppenheimer, D. M., & Monin, B. (2009). The retrospective gambler's fallacy: Unlikelyevents, constructing the past, and multiple universes. JudgementandDecision
Making, 4 (5), 326-334.
Sire, C., & Redner, S. (2009). Understanding baseball team standings and streaks.
EuopeanPhysicalJournalB , 473-481.
Sports Reference. (2011). Pro-FootballReference.Com. Retrieved March 19, 2011, from
Sports Reference: http://www.pro-football-reference.com/
Thomas, A. C. (2010). That's the second-biggest hitting streak I've ever seen! verifying
simulated historical extremes in baseball. Journalof Quantitative AnalysisinSports,6(4), 1-34.
Trochim, W. (2001).Researchmethodsknowledgebase. Cincinnati, OH: Atomic DogPublishing.
-
8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football
28/28
28
iThe variance of a random variable X is given by ; this evaluates to
.
The problem is that this term becomes negative at approximately ; variances cannot be
negative.