using binary regression to analyze win streaks in american football

8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football

1/28

Working Paper

Using Logistic Regression to Analyze Win Streaks in American Football

Working Paper

Tom Gross

[email protected]


2/28

2

Logistic regression has become a standard tool for analyzing binary outcomes. Examples

vary as widely as habitat selection by turtles (Compton, Rhymer, & McCollough, 2 002), blood

clots in stroke victims (Larrue, Kummer, Muller, & Bluhmki, 2001), and predicting election

results (Antonakis and Dalgas, 2009). Perhaps the most frequent use of binary regression is to

analyze wins and losses of sports teams. Recent examples include predicting wins in the NCAA

mens basketball tournament (Coleman & Lynch, 2009), baseball standings and streaks (Sire &

Redner, 2009), and the efficiency of betting markets (Ryall & Bedford, 2010). This application

of research methodology will use logistic regression to analyze the probability of an American

football team winning their next game, using the teams recent won/loss behavior as a predictor.

Logistic Regression

The value of logistic regression to sports research lies in the way in which outcomes are

recorded. The vast majority of sports contests have binary outcomes that determine conference

winners (i.e. a win or a loss), and most of the remaining contests have only three outcomes

win, loss, tiewith a tie often being rare. As a class, sports outcomes are usually treated as

restricted response variables, which cannot be analyzed by means of ordinary least-squared

regression. (An example of a sports outcome with less restrictive outputs is NASCAR cup

racing, which awards a maximum of 48 points for first place, and a minimum of one point for

forty-third place. The output variable is an interval integer. However, it does not present the

estimation problems described by Aldrich and Nelson (1984).) The nature of the response

variable, coupled with the popularity of sports, has resulted in an explosion of publications using

logistic regression. (A search of Google Scholar pairing sports and logistic regression results

in over 14,000 hits on papers published since 2007.)


3/28

3

One popular phenomenon of sports is the streak. Thomas (2010) noted that one of the

most popular sports records is DiMaggios 56-game hitting streak in baseball (p. 1). Further, the

author claimed that such records are often considered magical and unlikely to happen simply

by chance (p.1). In contrast, Albert (2008) argued that several authors have examined

streakiness in baseball, with mixed results. Although there was some evidence of statistically

abnormal behavior, most evidence failed to show a significant difference between examples of

streaks and what would be expected by random distributions (p. 2). Similarly, Oppenheimer and

Monin (2009) argued that the gamblers fallacy is the widely held belief that streaks will end

sooner than random chance would suggest (p.3

26). The authors also claimed that streaks in

general conflict with the popular belief that small samples are like miniature examples of the

population, leading people to believe that streaks are the product of a non-random process.

One potential area of research would be to compare the likelihood of winning to previous

win and loss streaks. These streaks are defined as consecutive wins or losses; as a random

variable, it has both category (win or loss) and magnitude (the number of consecutive wins or

losses). A win streak is time dependent, with its most recent value occurring at . If it is

assumed that the sequence of wins and losses from t to t+1 is a Markov process, then there

cannot be a causal effect between past values and future values. It follows that researchers who

wish to make use of the no memory property of Markov processes justify the applicability of a

Markov process to their model. A simple way to do this is to model the outcome as the sum of a

defined Markov process and some other independent variable process. The second process could

be any well-defined function of the selected independent variable. Performing logistic regression

on a model parsed this way results in separate parameter estimates for the two processes. Further,


4/28

4

it is possible to create a regression equation that allows the second process to be estimated

independently. This method will be demonstrated below.

Wins and Win Streaks

One application of logistic regression using win and loss streak data is to test the

relationship between win streaks and the probability of winning. Sire and Redner (2009) used a

variation of logistic regression (the Bradley-Terry model) to show that, for the most part, win

streaks have a mean and distribution that agree with the assumed underlying distribution.

However, they found evidence non-random behavior in their analysis of baseball results during

the period from 1901 to 1960 (p. 479). The methodology of Sire and Redner included

assumptions of the type of distribution underlying a teams probability of winning. They did not

specifically identify what the factors affecting probability were, making it difficult to account for

differences in winning percentages between teams. Instead, they limited their research to the

questions of team parity, and the evidence of winning streaks being abnormally long. In

particular, Sire and Redner did not test for a relationship between win streaks and changes in the

probability of winning. Although the authors claimed that this question continues to be

vigorously debated, the last citation they included was published in 2000 (p. 474).

Sire and Redner (2009) provided an excellent example of the difficulties facing

researchers attempting to explore the relationship between wins and losses, and any artifact of

how those data are recorded. Specifically, any attempt to determine a relationship between win

streaks and the probability of winning needs to model both the independent variable used to

predict outcomes, as well as the probability model itself. One method commonly used to do this

is the Logit model, which posits a linear explanatory variable and a continuous, though bounded,

response variable. The Logit model allows a wide variety of explanatory variables while


5/28

5

maintaining a response variable profile that can be mapped onto a probability distribution

function.

The Model

The Logit model is based on simple binary regression, where the probability of a win is

given by

(1)

Equation (1) plots a sigmoid curve that represents the probability of event Z. The

equation forZand its relationship to the underlying independent variables that affect p is given

by

(2)

HereZcan represent any outcome from a linear function ofX, although in the regressions

reported below,Xrepresents win and loss streaks. For a simple example, imagine that the

probability of solving a specific math problem on a test is a function of the number of similar

problems solved prior to the test. The variableXwould represent the number of similar problems

solved, andZwould represent the additive function of the problems solved. Because Z affects

the probability of success,pz is a conditional probability, and a monotonically increasing

function ofZandX.

Equation (1) is a solution to equation (2) expressed as a probability function. Often such

relationships are represented as an odds ratio instead of a probability; the argument in equation

(2) is a simple transformation of the odds ratio of eventZ. (The odds ratio of an event is the

probability of an event happening divided by the probability of that event not happening,

expressed in terms of discrete outcomes.) Equations (1) and (2) imply thatX(for example, the


6/28

6

number of problems solved) is the only determinant of the probability of an event (for example,

the probability of solving a specific problem on a test). This simple model of the causes affecting

the probability of an event is neither likely, nor even a stable series. It is therefore assumed that

the probability of a win is a function of a set of variables (gamma) that have a fixed mean and

variance, which are uncorrelated withX, making the full model

(3)

HereXis defined as it is in equation (2). However,p represents the total probability of

winning. Gamma represents the value of a function of underlying causes of changes in

probability of a win, not includingX. For example, Gamma could be a function of home field

advantage (for example, see Levemier & Barilla, 2007).), individual matchups (Brown & Sokol,

2010), or any other variable that would systematically affect the probability of winning.

Following the methodology of Brown and Sokol, it is assumed that the variable Gamma is

normally distributed . (This assumption is not necessary, but it simplifies the

analysis.)

Equation (3) assumes that the probability of a win can be held constant and evaluated in

X; it also implies that there is no correlation between variables ingamma andX. It follows that

equation (2) is the regression model used to determine the effect of a win streak on the

probability of winning the next game, whereas equation (3) represents the relationship between

all independent variables and the probability of a win. The variableXrepresents consecutive

wins and losses. As noted above, this variable has both categorical (wins or losses) and

magnitude components (number of wins or losses). Of these two components, the category is the

most important because the model is testing if, as was claimed by Sire and Redner (2009), wins


7/28

7

and losses are self-reinforcing (p. 474). (The magnitude of the win streak represents the most

restrictive form of the model.) Wins are given positive values and losses are given negative

values so that the values ofXpreserve both the categorical and magnitude aspects. DefiningX

this way eliminates zero as a possible value forX. This does not affect the estimation of the

coefficient onX; on the contrary, it improves the estimation of the parameter by forcing the

intercept of the model to be zero. (The specifics of how the random variableXis generated are

shown below.) It should be noted that more complex model specifications could be specified.

One interesting variation would be to includeXas an element ofGamma; these variations are

more difficult to evaluate, and are not considered here.

In the regression model specification shown in equation (2),Zis bounded by the

underlying distribution ofX. For example, becauseXrepresents consecutive wins at time t, the

maximum value forX(assuming every game was won) is the total games played. Noting that

professional American football plays 16 regular season games, the maximum value ofXis plus

or minus 16. (Looking at real data, the max winning streak in 2010 was eight and the maximum

losing streak was 10.)

Equation (3) represents the relationship between the probability of an outcome (e.g., a

win) and all factors that might change the probability. However the first coefficient, , only

measures the effect of win streaks on the total probability of a win. The individual parameters for

Gamma represent the contributory effect of each of those variables. Examples of factors that

affect Gamma might include home field advantage, team matchups, or any other variable that

affects the probability of a win, except for the contributory effect of a win streak. (The obvious

concern that those factors might directly affectXthe win streak variableis discussed in detail

below.)


8/28

8

Interpreting the regression results in terms of the full model is straightforward; ifXis

uncorrelated with other determinants ofp and adds no significant amount to the value ofp, beta

term will estimate as not different from zero; this implies that win streaks do not affect the

probability of winning the next game. The specific hypothesis being tested by the regression

equation is .The null can be rejected if regression results indicate a significant

relationship, e.g., thep-value of the beta term is below the specified level of significance.

Typically, a researcher reports a significant relationship between the regression variable and the

outcome variable as justification for rejecting the null hypothesis and accepting the alternative.

Briefly stated, a very lowp-value is usually considered both necessary and a sufficient reason to

reject the null hypothesis. Unfortunately, although significance is necessary to reject the null, it

is not sufficient for the regression equation (2) because the independent variables used in the

regression equation are biased toward significance. This bias does not mean the model is not

valid; however, it does require discussion of the nature of the bias contained in the variableX.

Potential Bias in Estimates of

The potential effect of bias in X can be demonstrated by assuming that the unconditional

probability of a win is determined by probability function of a set of variables not includingX,

which equation (3) does. Using the method of Kvam and Sokol (2006), the transition from one

state to another (e.g., from winning to losing), not includingX, is modeled as a Markov chain;

sequences of this type have no memory, meaning that subsequent events are independent of the

outcomes of prior events. (The assumption that the determinants of a win, excludingX, constitute

a Markov process is not necessary; however, its inclusion simplifies the analysis, and does not

change the model expressed in equation (3).) Further, if we assume that the coefficient onXis


9/28

9

zero, ( ), the model implies that changes in Gamma will not affect the value ofX.

Unfortunately this assumption can be demonstrated as false.

As noted above, the values ofX(win streaks) are simply consecutive outcomes of the

same type (wins or losses). That is, a win streak of three at time tmeans that the outcomes from

t-3 to t-1 were all wins. However, the win streak is simply a random array of outcomes in periods

t-3 to t-1. It follows that the expected value of a win streak is given by

(4)

WinStreak is therefore a Taylor series in n andp, and more importantly, a function ofp.

Although solving for the expected value ofWinStreak is simple, characterizing the behavior of

the variable is more difficult. One way to do this is to describe WinStreakusing known

distributional forms. Assumingp is fixed, we can define q as 1-p; it follows that from equation

(4) that the expected value ofWinStreak is the variance of a geometric distribution in q. (This is

a rather minor observation that seems not to have been noted in the literature.) Unfortunately,

WinStreakdoes not have a stable variance; further, higher moments ofWinStreakprobably do

not exist.i

It is also possible to examine directly the effect of a change in the underlying probability

on the value ofWinStreak. The increase in WinStreakfor a change inp is given by

(5)

Equation (5) confirms that both WinStreakand NextGame (the term that represents the

game played at time t in the regressions below) are increasing inp. (The outcome of the next

game is assumed to be a function of p; equation (5) shows that WinStreak is also a function ofp.)

Equation (5) also verifies that if we assume the value of beta in equation (2) is zero, there is no


10/28

10

memory in the sequence of wins, even though both WinStreakand NextGame are functions ofp.

(This follows because it is possible to describe NextGame without reference toX.) The important

point is that asp gets large, both the number of wins, and the magnitude ofWinStreak increase.

RegressingNextGame onWinStreakwill show spurious correlation because larger values of

WinStreakmean that wins will be more frequent. Literally, Gamma is a confounding variable

in regression equation (2). Regressions that demonstrate a significant relationship between

WinStreakand NextGame are necessary to reject the null hypothesis, but not sufficient because

the result can be a spurious effect ofGamma.

The behavior ofWinStr

eakcauses a significant bias when the probability of a win is very

large or very small. The bias will be smaller if the number of paired observations at a givenp is

small or if p is approximately 50%. However, if the probability of a win is extreme and number

of paired observations is larger (for example when the number of games played by a basketball

team is 82), simply regressing WinStreakon NextGame might indicate a significant relationship

when one does not exist.

The results of equations (4) and (5) call into question the usefulness of using a Logit

model on win streaks in any form. However, since the bias is one-sidedindicating a

relationship when one does not existthe model still has usefulness by failing to reject the null.

However, the existence of bias points to how difficult it is to reject the null hypothesis for the

regression equation (2). If the regressions show a significant relationship, an additional set of

regressions should be run after attempting to remove the bias from the WinStreakvariables.

(How this might be done is unclear because all potential factors in Gamma would have to be

eliminated as the source of bias.) The gist of the matter is that it is possible for the beta term in

equation (2) to show significance when there is no relationship between the independent and


11/28

11

dependent variables. Therefore, it is important to run additional tests when any significant results

from the regression are discovered.

Data

The real data for regression were gathered from Sports-Reference.com (Sports Reference,

2011). This site has extensive historical data on professional and amateur sports. Data were

retrieved in March of 2011. The NFL win/loss data were for the 2010 year, and included only

regular season games (512 individual outcomes representing 256 total games played). Each win

was recorded as a one, and each loss was recorded as a zero. The data were arrayed as a single

set of values, representing all 512 outcomes. Although this represents a doubling of the actual

events, the only effect on the parameter estimation is the reducing the standard errors by half.

This array of 512 outcomes was the set from which all other variables were created.

Win streaks for teams were calculated in three different ways, each of which replaces the

value of X in equation (2); however, only the first two were used for estimating the model. The

three variable names used were WinStreak, Pos_Neg, and Cumm_WinStreak. WinStreak

calculates win streaks and loss streaks separately and then combines the result. For example,

assume that a sequence of outcomes in the data set is {W, W, L, W, L, L}. Converting this to

numeric values, the sequence becomes {1, 1, 0, 1, 0, 0}. This sequence has four different streaks:

a win streak of two, a loss streak of one, a win streak of one, and a loss streak of two. However,

streaks are measured at time t, meaning that this sequence will have six different values for

WinStreak. Each win adds +1 and each loss adds -1. (As noted above, wins and losses are

calculated separately for technical reasons that do not affect the parameter estimation.) Starting

at t=1 through t=6, the six values forWinStreakare {1, 2, -1, 1, -1, -2}. WinStreakrepresents the

most restrictive form of the regression equation (2) because it retains not only the type of


12/28

12

outcome (a win or a loss, represented by positive or negative numbers), but also the magnitude of

the variable. The values forWinStreakare theXvalues in the regression equation (2). This is the

way one would expect win and loss streaks to be reported, and what we mean when we say, my

team has a two game win streak. The only difference is that the sign of the value represents a

win (+) or a loss (-).

Pos_Neg is calculated in a similar way, except that the only values Pos_Neg can take are

positive one and negative one. For example, the string of results described above becomes {1, 1,

-1, 1, -1, -1}. Positive numbers represent win streaks and negative numbers represent loss

streaks; however, the length of the streak is lost. This is mathematically identical to coding an

independent variable as a zero or a one; the only difference is that using the values of negative

one and positive one force the intercept to be zero. (A graph of the regression relationship would

go through the origin.)Pos_Negrepresents a less restrictive version of the regression equation

(2) because it does not retain the magnitude of the win streak; it simply indicates if the streak is

one of winning or losing.

The final constructed independent variable is Cumm_WinStreak. This variable sums the

numerical value for each win and loss, resulting in a cumulative value for consecutive wins and

losses. Using the same string of values used forWinStreak, Cumm_WinStreakbecomes {1, 2, 1,

2, 1, 0}. As teams accumulate wins, Cumm_WinStreak becomes larger and positive; losing

causes the value to become smaller, and eventually negative, if more games are lost than are

won. This variable was not considered appropriate for real data because of serious problems that

result when attempting to model its behavior. Unlike the previous two constructed variables,

Cumm_WinStreakdoes not reset after each change of state. (Moving from a win to a loss, or the

opposite, is a change of state.) It follows that the value of the variable can drift from its point of


13/28

13

origin, and that the drift can be arbitrarily large. For example, even assuming a Markov process

with , the value ofCumm_WinStreakcan be arbitrarily large at any given time. (This is

a common feature of Markov chains, which is often forgotten by people investing in the stock

market.) Further, Cumm_WinStreakhas an unusual limit behavior: as the number of un-played

games approaches zero (i.e., the last games in the regular season are played) the value of

Cumm_WinStreakbecomes a simple transform of the probability of winning (in particular, the

odds ratio). This means that Cumm_WinStreakwill often show significant results even when the

other constructed variables do not.

The regression equation (2) matches the independent variable at timet

with the outcome

at time t+1. This required that each of the independent variables be labeled by time; although

time is not a variable in the regression model, aligning the variables correctly is necessary. The

values for each of these variables at time twere paired in logistic regression with the outcome of

next game played at , drawn from the original sequence of outcomes. The last value for

each set of 16 games was deleted because the independent variable could not be paired with a

next game. This reduced the total number of regression pairs to 480.

Using the array described above, the six values would be reduced to five; the two

independent and dependent variables (representing WinStreakand NextGame) become {1, 2, -1,

1, -1} and {1, 0, 1, 0, 0} respectively. Note that the sequence of values for WinStreak represents

values calculated from t=1 to t=5, whereas the values of the NextGame represent values

calculated for periods t=2 to t=6.

Each of the win streak variables represented a slightly different set of assumptions for the

probability model. The simplest interpretation of equation (3) is that winning increases the

chance of winning, and (unfortunately for fans) losing increases the chance of losing. This is


14/28

14

captured by the independent variable Pos_Neg. Using Pos_Negassumes that a win in period t

affects the probability of a win in period . However, the magnitude of the string of wins

had no additional effect. Under this assumption, winning improved the chances of winning by a

fixed amount until the next loss, at which time the probability of a win reverted to the original

value. (Original values for the probability of a win are given by the value ofGamma.)

In contrast to Pos_Neg, the design ofWinStreakassumes that consecutive wins or losses

have an additive effect; under this assumption, long streaks are self-perpetuating. Sire and

Redner characterized this, without being specific, as winning streaks being self-reinforcing

(2009, p. 474). It should be noted that this assumption leads to significant modeling problems; in

this case, equation (2) becomes a positive feedback function, as described above. Under this

scenario, if the sequence is long enough,p converges to its limit of one (i.e., no probability of a

loss) or zero (no probability of a win).

Although there are significant concerns about the modeling of the independent variables,

the dependent variable NextGame can be described as a Markov chain with a transition

probability ofp. Assuming win streaks do not affect the probability of winning, equation (3)

becomes a simple Markov process. This is the same methodology used by Brown and Sokol

(2010), Kvam and Sokol (2006), as well as others. As noted above, significant results from the

regression equation (2) might be interpreted as evidence that the sequence of outcomes is not a

Markov chain. It follows that rejection of the null implies rejection of the distributional

assumptions on which equation (3) is based. In that event, it would be the necessary to create a

different probability model for wins and losses.

Testing the Model


15/28

15

As noted above, using the independent variables in regression equation (2) raises

significant concerns about the ability to capture any relationship that might exist between win

streaks and the next game played. It is therefore appropriate to test the model on simulated data

prior to running regressions on actual data. The regression model was tested using simulated data

consisting of 1000 randomly generated values. The values were generated using and Excel Add-

In (MegaStat), which specified random variable uniformly distributed from zero to

one,. The value of N=1000 was chosen to avoid bias that can result from small

samples. A large sample will minimize the likelihood of a Type II error; in addition, the bias

inherent in estimating Logit models creates a significant likelihood of a type I error with small

samples. This point was made by Nemes, Jonasson, Genell, and Steineck (2009) as a general

problem with logistic regression; it is an even larger problem here due to the inherent bias of the

variables used to represent win streaks.

Assuming a valid model specification of the relationship between independent and

response variables, regression using a random set should not yield significant results. The

random simulated data were subsequently modified with a bias to test the model when there was

a known effect; assuming that the model specification is valid, the results should be significant.

As noted, the randomly generated data (Y) were uniformly distributed between the values

of zero and one. These data were then converted into binary data (Y) using the rule if is less

than 0.500, ; otherwise, This yielded a set of 520 zero values and 480 one values

( ) representing wins and losses. Win streaks were calculated for consecutive wins andlosses in the simulated data using the method described in the data section above. For example,

WinStreakwas calculated by assigning the first win in a string a value of 1, the second

consecutive win was assigned a two, and so on. Each loss was assigned a value of negative one,


16/28

16

so that consecutive losses resulted in consecutively larger negative numbers. The resulting

WinStreakconsisted of 1000 positive or negative numbers representing the consecutive wins or

losses to that point. These values were paired so that was paired with the

outcome , the NextGame variable in the regression model. (This reduced the number of

paired values to 999 because there is no next game after the last win streak value is calculated.)

Logistic regression was performed on the paired data to determine if there was a significant

relationship between the two simulated variables.

Assuming that consecutive wins and losses affect the probability of winning the next

game is tantamount to saying that is a significant and positive value in equation (2). The

interpretation is that long win streaks increase the probability of winning the next game, and long

losing streaks decrease that probability. By design, the random probabilities used to generate the

simulated data were uniformly distributed with a mean of 0.50; the value of each did not

depend on the ex post calculated win streaks. It follows that the change in probability for a given

change in X is given by

(6)

It also follows that the expected value ofZin equation (2) must be zero, and equation (1)

becomes

(7)

Equation (7) implies that binary regression using the randomly generated values forY

should not have significant coefficients or p-values greater than traditional limits. The results of

binary logistic regression using the unbiased random data using the value ofWinStreakat time,

and NextGame ( ) gives the results shown in Table1:


17/28

17

Table 1

Logistic Regression Table forSimulatedData

Predictor Coefficient SE Z p-value Odds 95% CI

Constant -0.084 0.0635 -1.3228 0.1860 Ratio Lower Upper

WinStreak -0.0107 0.0272 -0.3934 0.6930 0.99 0.94 1.04

Note: CI is Confidence Interval

As expected, the p-value on the variable WinStreak is large, implying that there is no

relationship between WinStreakand the probability of winning the next game. Given the

randomness of the data, the lack of significance using randomly generated data is encouraging; it

shows that a model that could capture an existing relationship is not necessarily tricked into

showing significance because of inherent bias.

Testing the model under conditions when it should not work is of little value for

demonstrating that the model will work when it should. Instead, it is necessary to show that the

model can accurately demonstrate a relationship between win streaks and the probability of

winning the next game when a relationship is known to exist. The original random probabilities

were modified to test this by biasing the probability assigned to event using the value of

the. The bias added consisted of adding 10% to the existing probability for each

ordinal number of the win streak in the previous period. The biased data is modeled on equation

(2); it has no Gamma variable as shown in equation (3). Thus, the new probability () for event

is given by:

(8)

The maximum increase in probability was 30.34%, whereas the maximum decrease was

69.16%. Further, the median value was -.57%, with first and third quartile values of -7.36% and

6.32% respectively. Somewhat surprisingly, the number of wins went down from 480 to 443.

These values represent a significant change in probability, based on the previously paired


18/28

18

WinStreak value. Each of the new probabilities were converted to binary values using the same

rule described previously, and new values forWinStreakcalculated. Binary logistic regression

was then run on the new biased values. These results of which are shown in Table 2:

Table 2

Logistic Regression Tableof BiasedResults


Constant -0.1441 .0659 -2.1866 0.0290 Ratio Lower Upper

Biased WinStreak 0.1250 0.0196 6.3776 0.0000 1.13 1.09 1.18


As the regression table shows, the model worked surprisingly well to capture the effect of

the bias introduced; the associated p-value is zero to four significant digits. (The results would be

significant even if the standard error value for the independent variable were doubled.) The

coefficient on the biased WinStreak indicates that each additional win increases the likelihood of

winning by an estimated 12.5%, a number well within the expected 95% confidence interval for

the actual value of 10%. In addition, the sign of the coefficient is positive, as predicted.

Performing the same analysis using the binary data (where positive values ofWinStreakwere

given a value of one, and the remaining data were given a value of negative one) gave similar

results.

The results from all three regressions indicate that the model in all its forms will capture

any effect that we know is present in the next win. However, the simulated data are based on

equation (2); they do not imply that equation (2) can measure the coefficient onXif the actual

probability model is as represented by equation (3). As noted above, the significance of the

regression is a necessary but not sufficient condition to reject the null hypothesis. This follows

from the bias inherent in estimates of the coefficient onXwhen other factors, such as those noted

above, affect the probability of a win.


19/28

19

The existence of bias suggests a two-step process for evaluating the relationship between

win streaks and the probability of winning their next game for sports teams. The first step in the

process is to estimate the beta term in equation (2) to determine if there is evidence of an effect.

If there is not, the null hypothesis cannot be rejected. If evidence of a significant relationship

does exist, additional tests should be run to determine if the evidence might be due to spurious

correlation. Although there are significant flaws in this approach, which are addressed in the

discussion below, this two-step process provides a reasonable method for evaluating the research

problem.

Results using American Football

The data used to evaluate the relationship between team win streaks and the probability

of winning the next game played consisted of the wins and losses of all 32 teams in the National

Football League, 2010-2011 regular season. Data were downloaded from www.Sports-

Reference.com (2011), and was coded as described above. There are 16 games played each

regular season by the 32 NFL teams, resulting in 512 outcomes, representing 256 total games.

Each team had individual winning percentages calculated, as well as the independent variables

WinStreak, Pos_Neg, and Cumm_WinStreak, as described above. Regressions were run for the

teams individually, however the results showed no significance. (This could easily have been

predicted because the sample size of 15 per team is too small to provide good results in logistic

regression.) The independent variables at time twere paired with the next game outcome at

time

, resulting in a loss of one game per team, and reducing the total number of data pairs

to 480.

Although the data were paired using the results for each team, the 32 sets of data were

combined into a single regression of 480 pairs. This decreases the standard errors for the


20/28

20

coefficient terms by a factor of two; however, it has no other effect on the estimates of the

parameters. This technique does raise some concerns beyond the effect on standard errors. First,

it is not obvious that a single stable relationship can be measured across several teams. A more

accurate description of X would be as a vector with j elements, each j representing a different

team. Then each xji is the ith

pair of regressions for the jth

team. This implies 32 different

independent coefficients. Even if some of the coefficients were non-zero, there is no guarantee

that the coefficient on the vector can be estimated. Second, even if a single relationship exists for

each team, there is no guarantee that the actual estimate of the vector beta will be non-zero.

Although these are valid questions, the response to both is that the regression model

posited in equation (2) specifically assumes that there is a single stable relationship between win

streaks and the probability of winning a teams next game, regardless of when it is measured.

The null hypothesis for that model can only be rejected if such a relationship is detected. If such

a relationship cannot be detected, it does not matter if it is because such a relationship does not

exist, or if it exists in a form that cannot be captured by the model. Failing to reject the null

hypothesis does not prove that there is no relationship; it merely presents evidence that the

relationship does not exist in the assumed form. In short, these concerns about the ability to

measure the relationship between win streaks and the probability of winning are further evidence

of how difficult it is to measure a relationship between win streaks and the probability of

winning. They however do not invalidate the model itself.

An advantage to combining all the pairs into a single data set is that there will be an equal

number of wins and losses. This has an effect of forcing the intercept through the origin, that is

E[ . This both provides additional evidence on the validity of the model and improves the

accuracy of the estimate of the beta term in equation (2).


21/28

21

Regressing WinStreak on NextGame using actual data had the expected results, as can be

seen in Table 3:

Table 3

Log

istic

Regress

ion

Table:WinStr

eakonNex

tGame


Constant 0.00298 0.0914 0.0326 0.9740 Ratio Lower Upper

WinStreak 0.0319908 0.0370 0.8654 0.3870 1.03 0.96 1.11


The regression results show that the model is not biased to the degree that it always

shows significance. The estimate of the coefficient on the beta term for equation (2) is not

significantly different from zero and that the null hypothesis cannot be rejected. The assumption

that the constant term is equal to zero also cannot be rejected, although the model using

WinStreakallows some variation for this term. These results are somewhat surprising, given the

tendency of the model to find significance when smaller samples were used, or when the winning

percentage tended toward extremes. (Additional regressions were run using individual team data

(N=15), and as expected, frequently indicated significance. Those results are not shown because

they can be dismissed as being caused by a combination of small sample bias, and the bias

inherent in WinStreak.)

As noted, the model using WinStreak is the most restrictive form of the regression

equation, because it requires that the beta term capture the effect of each additional win or loss in

the streak. A less restrictive form of the model uses Pos_Neg; this version posits that wins and

losses are reinforcing, but not additive. It follows that the threshold for significance is lower.

Nonetheless, the results for this regression again gave the expected theoretical results, as shown

in Table 4:


22/28

22

Table 4

Logistic Regression Table:Pos_NegonNextGame



Pos_Neg -0.03334 0.0913 -0.3651 0.7150 0.9700 0.81 1.16


As expected, the Pos_Negversion of the model provides the cleanest estimate of the beta

coefficient in equation (2). This follows because the regression line must go through the origin

(i.e., the constant term must be zero) when the only possible values for X are -1 and +1. The

results show that the null hypothesis again cannot be rejected. There is no evidence of an effect

even in the less restricted model, where a win or loss streak simply influences the probability of a

win in the next game played.

The final regression used the variable Cumm_WinStreak. As noted above, this model is

not considered as a valid example of the regression equation (2) because the independent

variable for each team converges to a simple transform of the winning percentage. As such, the

model should show the relationship that exists. The predicted results are that there will be a

significant relationship between the independent and response variable, even though the

preceding regressions found none. The results of this regression can be seen in Table 5:

Table 5

Logistic Regression Table:Cumm_WinStreakonNextGame



Cumm_WinStreak 0.0964 0.0262 3.6840 0.0000 1.10 1.05 1.16


As expected, the regression results show a significant relationship between in

independent and response variable. (Thep-value would still be below 5% if the standard error of

the coefficient were doubled.) Although the hypothesis that the constant term equals zero cannot

be rejected, the hypothesis that the regression coefficient equals zero can be rejected. These


23/28

23

results provide more evidence that caution needs be exercised when building independent

variable from past outcomes in a sequence; some input variables will always show a relationship

with output variables. This is a valid concern even when the output sequence is a Markov chain,

as noted in the discussion of potential bias above.

Discussion of Results

The relationship described in equation (3) assumes that there exist factors that could

affect the probability of winning. That is, the probability of winning is not simply a random

variable with an expected value of 0.50. Inasmuch as the factors that affect winning are stable

over time, it is reasonable to hypothesize a relationship between future performance and past

performance. Unfortunately, describing the nature of that relationship can be difficult. The

simplest model for the performance of a sports team (and many other contests) is that the

likelihood of winning their next game is a random expression of an underlying probability of

winning independent of time; an example is the Markov chain model used by Kvam and Sokol

(2006). Although the authors used this model to predict wins in the NCAA mens basketball

tournament with excellent results, their model was more complex than it first appeared. Kvam

and Sokol estimated the probability of a win based on individual matchups of paired teams rather

than the teams overall winning percentages. In doing so, the authors demonstrated a difficulty

with simply using the winning percentage as an estimate ofp for predicting the outcome of a

teams next game. Put simply, Kvam and Sokol estimatedp, the probability of winning a

specific contest, as a conditional probability. Therein lies the problem of everyone who ever

wanted to predict the outcome of an event: prognosticators want to know the likelihood of

winning a specific contest given the expected circumstances of that contest, not the likelihood of

winning n times inNcontests.


24/28

24

Conditional estimates allow the probability model to include a wide range of explanatory

variables that researchers may be able to test. Typical examples include the winning percentage

of the paired contestant (Kvam & Sokol, 2006), home field advantage (Foster & Washington,

2009), and the recent performance of contestants (Hvattum & Arntzen, 2010). One possible

conditional variable is the win streak. In its most simple form, this model assumes that wins and

losses are self-reinforcing. As Sire and Redner (2009, p. 474) noted, individuals frequently refer

to these win and loss streaks when making their predictions. Unfortunately, working out the

details of this model demonstrates several problems.

As noted above, win and loss streaks will be correlated with outcomes because both are

increasing functions ofp. It follows that any constructed independent variable must take this

potential source of bias into account when testing models. This is a significant problem for

researchers, and one that may ultimately make the use of any win streak variable invalid.

However, the potential for bias simply begs the question of significant results; testing the model

might nonetheless prove useful, as was the case in American football results for 2010-2011.

The results using American football data from 2010-2011 showed that regressing two

valid constructs of win and loss streaks on the next game played failed to show a significant

relationship between current win loss streaks and the probability of winning the next game

played. These results are consistent with models that describe the probability of winning as based

on a set of underlying variables that are performance related, rather than related to previous wins

and losses. These models argue in favor of a more parsimonious model for the probability of

winning that does not include win and loss streaks. Simply put, there is no evidence in the

American football league of additional probability of winning or losing is gained by the

individuals or teams recent performance.


25/28

25

The results from American football do not prove that win streaks have no effect on the

probability of winning future games; they merely show that a valid application of the model

provided no evidence of such a relationship. The lack of evidence may be due to the choice of

variables, or the probability model itself; however, it is just as likely that there is no relationship

between win streaks and the probability of winning future games. As such, the burden of proof

lies on those who claim there is a relationship between win streaks and future wins to construct a

model and provide evidence of the relationship.

These results may be disappointing to those who harbor a belief that their team will

ultimately win (or loose) a given contest because of the teams recent success (or failure).

Nonetheless, the results are consistent with the typical modeling of the outcome sequence as a

Markov chain. As such, failure to show a relationship is probably more important than showing a

relationship using logistic regression.

Two important results can be gleaned from this study. First, the likelihood of persistent

bias when using models that simply regress win streaks on future outcomes was demonstrated by

an analysis of the variables, and verified by the regression results. As tempting as this type of

investigation is, it is more likely that any significant results will turn out to be artifacts of the

model and variable specifications. Second, the approach demonstrated the validity of using the

simple regression equation (2) when the full probability model can be expressed in the form

given by equation (3). The implies that it is possible to test relationships independently of other

underlying determinants of probability.

Conclusion

Conducting successful research requires that several skills be addressed in a logical and

appropriate way. As Trochim (2001) put it, research involves an eclectic blending of an


26/28

26

enormous range of skills and activities. (p. 4) Although his statement is somewhat bombastic, it

accurately conveys the impression that research is not simply about having an idea and checking

to see if it might be correct. Leedy and Ormrod (2005) expanded on this point and argued that

successful research involves the systematic process of collecting, analyzing, and interpreting

information (p. 2). Good research requires not only a good question and adequate data, but also

appropriate methodology.

This paper has explored the range of research methodology available to researchers, and

tried to demonstrate how the choice of a research question and the available data affect the

methodology used. Not all methodologies are appropriate for every research question or type of

data; researchers must select a methodology that fits. One focus of this paper has been to identify

an appropriate research methodology for exiting data that is restricted in the range of values that

can be expressed. As was shown, data of this type can require specialized transcription and

analysis. As part of this investigation, several examples of research were reviewed, most of

which used logistic regression to analyze binary response data. As was shown, both the data and

the method of analysis created challenges for researchers. The lessons learned from those

examples resulted in a detailed application of logistic regression on the win and loss results for

American football teams.


27/28

27

References

Albert, J. (2008). Streaky hitting in baseball. Journalof Quantitative AnalysisinSports,

4 (1), 1-32.

Aldrich, J. H., & Nelson, F. D. (1984). Linearprobability,Logit,andProbitmodels.

Sage Publications.

Brown, M., & Sokol, J. (2010). An improved LRMC method for NCAA basketball

prediction. Journalof Quantitative Analysis,6 (3), 1-21.

Foster, W. M., & Washington, M. (2009). Organizational structure and home team

performance. TeamPerformance Management,15 (3/4), 158-171.

Hvattum, L. M., & Arntzen, H. (2010). Using ELO ratings for match results prediction in

association football. InternationalJournalof Forecasting,26, 460-470.

Kvam, P., & Sokol, J. (2006). A logistic regression Markov chain model for NCAA

basketball.NavalResearchLogistics,53, 788-803.

Leedy, P. D., & Ormrod, J. E. (2005). Practicalresearch. Upper Saddle River: Pearson

Education.

Nemes, S., Jonasson, J., Genell, A., & Steineck, G. (2009). Bias in odds ratios by logistic

regression modelling and sample size.BMCMedicalResearch Methodology, 9(56), 1-5.

Oppenheimer, D. M., & Monin, B. (2009). The retrospective gambler's fallacy: Unlikelyevents, constructing the past, and multiple universes. JudgementandDecision

Making, 4 (5), 326-334.

Sire, C., & Redner, S. (2009). Understanding baseball team standings and streaks.

EuopeanPhysicalJournalB , 473-481.

Sports Reference. (2011). Pro-FootballReference.Com. Retrieved March 19, 2011, from

Sports Reference: http://www.pro-football-reference.com/

Thomas, A. C. (2010). That's the second-biggest hitting streak I've ever seen! verifying

simulated historical extremes in baseball. Journalof Quantitative AnalysisinSports,6(4), 1-34.

Trochim, W. (2001).Researchmethodsknowledgebase. Cincinnati, OH: Atomic DogPublishing.


28/28

28

iThe variance of a random variable X is given by ; this evaluates to

.

The problem is that this term becomes negative at approximately ; variances cannot be

negative.

using binary regression to analyze win streaks in american football

Documents