using binary regression to analyze win streaks in american football

Upload: tom-gross

Post on 07-Apr-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football

    1/28

    Working Paper

    Using Logistic Regression to Analyze Win Streaks in American Football

    Working Paper

    Tom Gross

    [email protected]

  • 8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football

    2/28

    2

    Logistic regression has become a standard tool for analyzing binary outcomes. Examples

    vary as widely as habitat selection by turtles (Compton, Rhymer, & McCollough, 2 002), blood

    clots in stroke victims (Larrue, Kummer, Muller, & Bluhmki, 2001), and predicting election

    results (Antonakis and Dalgas, 2009). Perhaps the most frequent use of binary regression is to

    analyze wins and losses of sports teams. Recent examples include predicting wins in the NCAA

    mens basketball tournament (Coleman & Lynch, 2009), baseball standings and streaks (Sire &

    Redner, 2009), and the efficiency of betting markets (Ryall & Bedford, 2010). This application

    of research methodology will use logistic regression to analyze the probability of an American

    football team winning their next game, using the teams recent won/loss behavior as a predictor.

    Logistic Regression

    The value of logistic regression to sports research lies in the way in which outcomes are

    recorded. The vast majority of sports contests have binary outcomes that determine conference

    winners (i.e. a win or a loss), and most of the remaining contests have only three outcomes

    win, loss, tiewith a tie often being rare. As a class, sports outcomes are usually treated as

    restricted response variables, which cannot be analyzed by means of ordinary least-squared

    regression. (An example of a sports outcome with less restrictive outputs is NASCAR cup

    racing, which awards a maximum of 48 points for first place, and a minimum of one point for

    forty-third place. The output variable is an interval integer. However, it does not present the

    estimation problems described by Aldrich and Nelson (1984).) The nature of the response

    variable, coupled with the popularity of sports, has resulted in an explosion of publications using

    logistic regression. (A search of Google Scholar pairing sports and logistic regression results

    in over 14,000 hits on papers published since 2007.)

  • 8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football

    3/28

    3

    One popular phenomenon of sports is the streak. Thomas (2010) noted that one of the

    most popular sports records is DiMaggios 56-game hitting streak in baseball (p. 1). Further, the

    author claimed that such records are often considered magical and unlikely to happen simply

    by chance (p.1). In contrast, Albert (2008) argued that several authors have examined

    streakiness in baseball, with mixed results. Although there was some evidence of statistically

    abnormal behavior, most evidence failed to show a significant difference between examples of

    streaks and what would be expected by random distributions (p. 2). Similarly, Oppenheimer and

    Monin (2009) argued that the gamblers fallacy is the widely held belief that streaks will end

    sooner than random chance would suggest (p.3

    26). The authors also claimed that streaks in

    general conflict with the popular belief that small samples are like miniature examples of the

    population, leading people to believe that streaks are the product of a non-random process.

    One potential area of research would be to compare the likelihood of winning to previous

    win and loss streaks. These streaks are defined as consecutive wins or losses; as a random

    variable, it has both category (win or loss) and magnitude (the number of consecutive wins or

    losses). A win streak is time dependent, with its most recent value occurring at . If it is

    assumed that the sequence of wins and losses from t to t+1 is a Markov process, then there

    cannot be a causal effect between past values and future values. It follows that researchers who

    wish to make use of the no memory property of Markov processes justify the applicability of a

    Markov process to their model. A simple way to do this is to model the outcome as the sum of a

    defined Markov process and some other independent variable process. The second process could

    be any well-defined function of the selected independent variable. Performing logistic regression

    on a model parsed this way results in separate parameter estimates for the two processes. Further,

  • 8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football

    4/28

    4

    it is possible to create a regression equation that allows the second process to be estimated

    independently. This method will be demonstrated below.

    Wins and Win Streaks

    One application of logistic regression using win and loss streak data is to test the

    relationship between win streaks and the probability of winning. Sire and Redner (2009) used a

    variation of logistic regression (the Bradley-Terry model) to show that, for the most part, win

    streaks have a mean and distribution that agree with the assumed underlying distribution.

    However, they found evidence non-random behavior in their analysis of baseball results during

    the period from 1901 to 1960 (p. 479). The methodology of Sire and Redner included

    assumptions of the type of distribution underlying a teams probability of winning. They did not

    specifically identify what the factors affecting probability were, making it difficult to account for

    differences in winning percentages between teams. Instead, they limited their research to the

    questions of team parity, and the evidence of winning streaks being abnormally long. In

    particular, Sire and Redner did not test for a relationship between win streaks and changes in the

    probability of winning. Although the authors claimed that this question continues to be

    vigorously debated, the last citation they included was published in 2000 (p. 474).

    Sire and Redner (2009) provided an excellent example of the difficulties facing

    researchers attempting to explore the relationship between wins and losses, and any artifact of

    how those data are recorded. Specifically, any attempt to determine a relationship between win

    streaks and the probability of winning needs to model both the independent variable used to

    predict outcomes, as well as the probability model itself. One method commonly used to do this

    is the Logit model, which posits a linear explanatory variable and a continuous, though bounded,

    response variable. The Logit model allows a wide variety of explanatory variables while

  • 8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football

    5/28

    5

    maintaining a response variable profile that can be mapped onto a probability distribution

    function.

    The Model

    The Logit model is based on simple binary regression, where the probability of a win is

    given by

    (1)

    Equation (1) plots a sigmoid curve that represents the probability of event Z. The

    equation forZand its relationship to the underlying independent variables that affect p is given

    by

    (2)

    HereZcan represent any outcome from a linear function ofX, although in the regressions

    reported below,Xrepresents win and loss streaks. For a simple example, imagine that the

    probability of solving a specific math problem on a test is a function of the number of similar

    problems solved prior to the test. The variableXwould represent the number of similar problems

    solved, andZwould represent the additive function of the problems solved. Because Z affects

    the probability of success,pz is a conditional probability, and a monotonically increasing

    function ofZandX.

    Equation (1) is a solution to equation (2) expressed as a probability function. Often such

    relationships are represented as an odds ratio instead of a probability; the argument in equation

    (2) is a simple transformation of the odds ratio of eventZ. (The odds ratio of an event is the

    probability of an event happening divided by the probability of that event not happening,

    expressed in terms of discrete outcomes.) Equations (1) and (2) imply thatX(for example, the

  • 8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football

    6/28

    6

    number of problems solved) is the only determinant of the probability of an event (for example,

    the probability of solving a specific problem on a test). This simple model of the causes affecting

    the probability of an event is neither likely, nor even a stable series. It is therefore assumed that

    the probability of a win is a function of a set of variables (gamma) that have a fixed mean and

    variance, which are uncorrelated withX, making the full model

    (3)

    HereXis defined as it is in equation (2). However,p represents the total probability of

    winning. Gamma represents the value of a function of underlying causes of changes in

    probability of a win, not includingX. For example, Gamma could be a function of home field

    advantage (for example, see Levemier & Barilla, 2007).), individual matchups (Brown & Sokol,

    2010), or any other variable that would systematically affect the probability of winning.

    Following the methodology of Brown and Sokol, it is assumed that the variable Gamma is

    normally distributed . (This assumption is not necessary, but it simplifies the

    analysis.)

    Equation (3) assumes that the probability of a win can be held constant and evaluated in

    X; it also implies that there is no correlation between variables ingamma andX. It follows that

    equation (2) is the regression model used to determine the effect of a win streak on the

    probability of winning the next game, whereas equation (3) represents the relationship between

    all independent variables and the probability of a win. The variableXrepresents consecutive

    wins and losses. As noted above, this variable has both categorical (wins or losses) and

    magnitude components (number of wins or losses). Of these two components, the category is the

    most important because the model is testing if, as was claimed by Sire and Redner (2009), wins

  • 8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football

    7/28

    7

    and losses are self-reinforcing (p. 474). (The magnitude of the win streak represents the most

    restrictive form of the model.) Wins are given positive values and losses are given negative

    values so that the values ofXpreserve both the categorical and magnitude aspects. DefiningX

    this way eliminates zero as a possible value forX. This does not affect the estimation of the

    coefficient onX; on the contrary, it improves the estimation of the parameter by forcing the

    intercept of the model to be zero. (The specifics of how the random variableXis generated are

    shown below.) It should be noted that more complex model specifications could be specified.

    One interesting variation would be to includeXas an element ofGamma; these variations are

    more difficult to evaluate, and are not considered here.

    In the regression model specification shown in equation (2),Zis bounded by the

    underlying distribution ofX. For example, becauseXrepresents consecutive wins at time t, the

    maximum value forX(assuming every game was won) is the total games played. Noting that

    professional American football plays 16 regular season games, the maximum value ofXis plus

    or minus 16. (Looking at real data, the max winning streak in 2010 was eight and the maximum

    losing streak was 10.)

    Equation (3) represents the relationship between the probability of an outcome (e.g., a

    win) and all factors that might change the probability. However the first coefficient, , only

    measures the effect of win streaks on the total probability of a win. The individual parameters for

    Gamma represent the contributory effect of each of those variables. Examples of factors that

    affect Gamma might include home field advantage, team matchups, or any other variable that

    affects the probability of a win, except for the contributory effect of a win streak. (The obvious

    concern that those factors might directly affectXthe win streak variableis discussed in detail

    below.)

  • 8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football

    8/28

    8

    Interpreting the regression results in terms of the full model is straightforward; ifXis

    uncorrelated with other determinants ofp and adds no significant amount to the value ofp, beta

    term will estimate as not different from zero; this implies that win streaks do not affect the

    probability of winning the next game. The specific hypothesis being tested by the regression

    equation is .The null can be rejected if regression results indicate a significant

    relationship, e.g., thep-value of the beta term is below the specified level of significance.

    Typically, a researcher reports a significant relationship between the regression variable and the

    outcome variable as justification for rejecting the null hypothesis and accepting the alternative.

    Briefly stated, a very lowp-value is usually considered both necessary and a sufficient reason to

    reject the null hypothesis. Unfortunately, although significance is necessary to reject the null, it

    is not sufficient for the regression equation (2) because the independent variables used in the

    regression equation are biased toward significance. This bias does not mean the model is not

    valid; however, it does require discussion of the nature of the bias contained in the variableX.

    Potential Bias in Estimates of

    The potential effect of bias in X can be demonstrated by assuming that the unconditional

    probability of a win is determined by probability function of a set of variables not includingX,

    which equation (3) does. Using the method of Kvam and Sokol (2006), the transition from one

    state to another (e.g., from winning to losing), not includingX, is modeled as a Markov chain;

    sequences of this type have no memory, meaning that subsequent events are independent of the

    outcomes of prior events. (The assumption that the determinants of a win, excludingX, constitute

    a Markov process is not necessary; however, its inclusion simplifies the analysis, and does not

    change the model expressed in equation (3).) Further, if we assume that the coefficient onXis

  • 8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football

    9/28

    9

    zero, ( ), the model implies that changes in Gamma will not affect the value ofX.

    Unfortunately this assumption can be demonstrated as false.

    As noted above, the values ofX(win streaks) are simply consecutive outcomes of the

    same type (wins or losses). That is, a win streak of three at time tmeans that the outcomes from

    t-3 to t-1 were all wins. However, the win streak is simply a random array of outcomes in periods

    t-3 to t-1. It follows that the expected value of a win streak is given by

    (4)

    WinStreak is therefore a Taylor series in n andp, and more importantly, a function ofp.

    Although solving for the expected value ofWinStreak is simple, characterizing the behavior of

    the variable is more difficult. One way to do this is to describe WinStreakusing known

    distributional forms. Assumingp is fixed, we can define q as 1-p; it follows that from equation

    (4) that the expected value ofWinStreak is the variance of a geometric distribution in q. (This is

    a rather minor observation that seems not to have been noted in the literature.) Unfortunately,

    WinStreakdoes not have a stable variance; further, higher moments ofWinStreakprobably do

    not exist.i

    It is also possible to examine directly the effect of a change in the underlying probability

    on the value ofWinStreak. The increase in WinStreakfor a change inp is given by

    (5)

    Equation (5) confirms that both WinStreakand NextGame (the term that represents the

    game played at time t in the regressions below) are increasing inp. (The outcome of the next

    game is assumed to be a function of p; equation (5) shows that WinStreak is also a function ofp.)

    Equation (5) also verifies that if we assume the value of beta in equation (2) is zero, there is no

  • 8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football

    10/28

    10

    memory in the sequence of wins, even though both WinStreakand NextGame are functions ofp.

    (This follows because it is possible to describe NextGame without reference toX.) The important

    point is that asp gets large, both the number of wins, and the magnitude ofWinStreak increase.

    RegressingNextGame onWinStreakwill show spurious correlation because larger values of

    WinStreakmean that wins will be more frequent. Literally, Gamma is a confounding variable

    in regression equation (2). Regressions that demonstrate a significant relationship between

    WinStreakand NextGame are necessary to reject the null hypothesis, but not sufficient because

    the result can be a spurious effect ofGamma.

    The behavior ofWinStr

    eakcauses a significant bias when the probability of a win is very

    large or very small. The bias will be smaller if the number of paired observations at a givenp is

    small or if p is approximately 50%. However, if the probability of a win is extreme and number

    of paired observations is larger (for example when the number of games played by a basketball

    team is 82), simply regressing WinStreakon NextGame might indicate a significant relationship

    when one does not exist.

    The results of equations (4) and (5) call into question the usefulness of using a Logit

    model on win streaks in any form. However, since the bias is one-sidedindicating a

    relationship when one does not existthe model still has usefulness by failing to reject the null.

    However, the existence of bias points to how difficult it is to reject the null hypothesis for the

    regression equation (2). If the regressions show a significant relationship, an additional set of

    regressions should be run after attempting to remove the bias from the WinStreakvariables.

    (How this might be done is unclear because all potential factors in Gamma would have to be

    eliminated as the source of bias.) The gist of the matter is that it is possible for the beta term in

    equation (2) to show significance when there is no relationship between the independent and

  • 8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football

    11/28

    11

    dependent variables. Therefore, it is important to run additional tests when any significant results

    from the regression are discovered.

    Data

    The real data for regression were gathered from Sports-Reference.com (Sports Reference,

    2011). This site has extensive historical data on professional and amateur sports. Data were

    retrieved in March of 2011. The NFL win/loss data were for the 2010 year, and included only

    regular season games (512 individual outcomes representing 256 total games played). Each win

    was recorded as a one, and each loss was recorded as a zero. The data were arrayed as a single

    set of values, representing all 512 outcomes. Although this represents a doubling of the actual

    events, the only effect on the parameter estimation is the reducing the standard errors by half.

    This array of 512 outcomes was the set from which all other variables were created.

    Win streaks for teams were calculated in three different ways, each of which replaces the

    value of X in equation (2); however, only the first two were used for estimating the model. The

    three variable names used were WinStreak, Pos_Neg, and Cumm_WinStreak. WinStreak

    calculates win streaks and loss streaks separately and then combines the result. For example,

    assume that a sequence of outcomes in the data set is {W, W, L, W, L, L}. Converting this to

    numeric values, the sequence becomes {1, 1, 0, 1, 0, 0}. This sequence has four different streaks:

    a win streak of two, a loss streak of one, a win streak of one, and a loss streak of two. However,

    streaks are measured at time t, meaning that this sequence will have six different values for

    WinStreak. Each win adds +1 and each loss adds -1. (As noted above, wins and losses are

    calculated separately for technical reasons that do not affect the parameter estimation.) Starting

    at t=1 through t=6, the six values forWinStreakare {1, 2, -1, 1, -1, -2}. WinStreakrepresents the

    most restrictive form of the regression equation (2) because it retains not only the type of

  • 8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football

    12/28

    12

    outcome (a win or a loss, represented by positive or negative numbers), but also the magnitude of

    the variable. The values forWinStreakare theXvalues in the regression equation (2). This is the

    way one would expect win and loss streaks to be reported, and what we mean when we say, my

    team has a two game win streak. The only difference is that the sign of the value represents a

    win (+) or a loss (-).

    Pos_Neg is calculated in a similar way, except that the only values Pos_Neg can take are

    positive one and negative one. For example, the string of results described above becomes {1, 1,

    -1, 1, -1, -1}. Positive numbers represent win streaks and negative numbers represent loss

    streaks; however, the length of the streak is lost. This is mathematically identical to coding an

    independent variable as a zero or a one; the only difference is that using the values of negative

    one and positive one force the intercept to be zero. (A graph of the regression relationship would

    go through the origin.)Pos_Negrepresents a less restrictive version of the regression equation

    (2) because it does not retain the magnitude of the win streak; it simply indicates if the streak is

    one of winning or losing.

    The final constructed independent variable is Cumm_WinStreak. This variable sums the

    numerical value for each win and loss, resulting in a cumulative value for consecutive wins and

    losses. Using the same string of values used forWinStreak, Cumm_WinStreakbecomes {1, 2, 1,

    2, 1, 0}. As teams accumulate wins, Cumm_WinStreak becomes larger and positive; losing

    causes the value to become smaller, and eventually negative, if more games are lost than are

    won. This variable was not considered appropriate for real data because of serious problems that

    result when attempting to model its behavior. Unlike the previous two constructed variables,

    Cumm_WinStreakdoes not reset after each change of state. (Moving from a win to a loss, or the

    opposite, is a change of state.) It follows that the value of the variable can drift from its point of

  • 8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football

    13/28

    13

    origin, and that the drift can be arbitrarily large. For example, even assuming a Markov process

    with , the value ofCumm_WinStreakcan be arbitrarily large at any given time. (This is

    a common feature of Markov chains, which is often forgotten by people investing in the stock

    market.) Further, Cumm_WinStreakhas an unusual limit behavior: as the number of un-played

    games approaches zero (i.e., the last games in the regular season are played) the value of

    Cumm_WinStreakbecomes a simple transform of the probability of winning (in particular, the

    odds ratio). This means that Cumm_WinStreakwill often show significant results even when the

    other constructed variables do not.

    The regression equation (2) matches the independent variable at timet

    with the outcome

    at time t+1. This required that each of the independent variables be labeled by time; although

    time is not a variable in the regression model, aligning the variables correctly is necessary. The

    values for each of these variables at time twere paired in logistic regression with the outcome of

    next game played at , drawn from the original sequence of outcomes. The last value for

    each set of 16 games was deleted because the independent variable could not be paired with a

    next game. This reduced the total number of regression pairs to 480.

    Using the array described above, the six values would be reduced to five; the two

    independent and dependent variables (representing WinStreakand NextGame) become {1, 2, -1,

    1, -1} and {1, 0, 1, 0, 0} respectively. Note that the sequence of values for WinStreak represents

    values calculated from t=1 to t=5, whereas the values of the NextGame represent values

    calculated for periods t=2 to t=6.

    Each of the win streak variables represented a slightly different set of assumptions for the

    probability model. The simplest interpretation of equation (3) is that winning increases the

    chance of winning, and (unfortunately for fans) losing increases the chance of losing. This is

  • 8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football

    14/28

    14

    captured by the independent variable Pos_Neg. Using Pos_Negassumes that a win in period t

    affects the probability of a win in period . However, the magnitude of the string of wins

    had no additional effect. Under this assumption, winning improved the chances of winning by a

    fixed amount until the next loss, at which time the probability of a win reverted to the original

    value. (Original values for the probability of a win are given by the value ofGamma.)

    In contrast to Pos_Neg, the design ofWinStreakassumes that consecutive wins or losses

    have an additive effect; under this assumption, long streaks are self-perpetuating. Sire and

    Redner characterized this, without being specific, as winning streaks being self-reinforcing

    (2009, p. 474). It should be noted that this assumption leads to significant modeling problems; in

    this case, equation (2) becomes a positive feedback function, as described above. Under this

    scenario, if the sequence is long enough,p converges to its limit of one (i.e., no probability of a

    loss) or zero (no probability of a win).

    Although there are significant concerns about the modeling of the independent variables,

    the dependent variable NextGame can be described as a Markov chain with a transition

    probability ofp. Assuming win streaks do not affect the probability of winning, equation (3)

    becomes a simple Markov process. This is the same methodology used by Brown and Sokol

    (2010), Kvam and Sokol (2006), as well as others. As noted above, significant results from the

    regression equation (2) might be interpreted as evidence that the sequence of outcomes is not a

    Markov chain. It follows that rejection of the null implies rejection of the distributional

    assumptions on which equation (3) is based. In that event, it would be the necessary to create a

    different probability model for wins and losses.

    Testing the Model

  • 8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football

    15/28

    15

    As noted above, using the independent variables in regression equation (2) raises

    significant concerns about the ability to capture any relationship that might exist between win

    streaks and the next game played. It is therefore appropriate to test the model on simulated data

    prior to running regressions on actual data. The regression model was tested using simulated data

    consisting of 1000 randomly generated values. The values were generated using and Excel Add-

    In (MegaStat), which specified random variable uniformly distributed from zero to

    one,. The value of N=1000 was chosen to avoid bias that can result from small

    samples. A large sample will minimize the likelihood of a Type II error; in addition, the bias

    inherent in estimating Logit models creates a significant likelihood of a type I error with small

    samples. This point was made by Nemes, Jonasson, Genell, and Steineck (2009) as a general

    problem with logistic regression; it is an even larger problem here due to the inherent bias of the

    variables used to represent win streaks.

    Assuming a valid model specification of the relationship between independent and

    response variables, regression using a random set should not yield significant results. The

    random simulated data were subsequently modified with a bias to test the model when there was

    a known effect; assuming that the model specification is valid, the results should be significant.

    As noted, the randomly generated data (Y) were uniformly distributed between the values

    of zero and one. These data were then converted into binary data (Y) using the rule if is less

    than 0.500, ; otherwise, This yielded a set of 520 zero values and 480 one values

    ( ) representing wins and losses. Win streaks were calculated for consecutive wins andlosses in the simulated data using the method described in the data section above. For example,

    WinStreakwas calculated by assigning the first win in a string a value of 1, the second

    consecutive win was assigned a two, and so on. Each loss was assigned a value of negative one,

  • 8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football

    16/28

    16

    so that consecutive losses resulted in consecutively larger negative numbers. The resulting

    WinStreakconsisted of 1000 positive or negative numbers representing the consecutive wins or

    losses to that point. These values were paired so that was paired with the

    outcome , the NextGame variable in the regression model. (This reduced the number of

    paired values to 999 because there is no next game after the last win streak value is calculated.)

    Logistic regression was performed on the paired data to determine if there was a significant

    relationship between the two simulated variables.

    Assuming that consecutive wins and losses affect the probability of winning the next

    game is tantamount to saying that is a significant and positive value in equation (2). The

    interpretation is that long win streaks increase the probability of winning the next game, and long

    losing streaks decrease that probability. By design, the random probabilities used to generate the

    simulated data were uniformly distributed with a mean of 0.50; the value of each did not

    depend on the ex post calculated win streaks. It follows that the change in probability for a given

    change in X is given by

    (6)

    It also follows that the expected value ofZin equation (2) must be zero, and equation (1)

    becomes

    (7)

    Equation (7) implies that binary regression using the randomly generated values forY

    should not have significant coefficients or p-values greater than traditional limits. The results of

    binary logistic regression using the unbiased random data using the value ofWinStreakat time,

    and NextGame ( ) gives the results shown in Table1:

  • 8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football

    17/28

    17

    Table 1

    Logistic Regression Table forSimulatedData

    Predictor Coefficient SE Z p-value Odds 95% CI

    Constant -0.084 0.0635 -1.3228 0.1860 Ratio Lower Upper

    WinStreak -0.0107 0.0272 -0.3934 0.6930 0.99 0.94 1.04

    Note: CI is Confidence Interval

    As expected, the p-value on the variable WinStreak is large, implying that there is no

    relationship between WinStreakand the probability of winning the next game. Given the

    randomness of the data, the lack of significance using randomly generated data is encouraging; it

    shows that a model that could capture an existing relationship is not necessarily tricked into

    showing significance because of inherent bias.

    Testing the model under conditions when it should not work is of little value for

    demonstrating that the model will work when it should. Instead, it is necessary to show that the

    model can accurately demonstrate a relationship between win streaks and the probability of

    winning the next game when a relationship is known to exist. The original random probabilities

    were modified to test this by biasing the probability assigned to event using the value of

    the. The bias added consisted of adding 10% to the existing probability for each

    ordinal number of the win streak in the previous period. The biased data is modeled on equation

    (2); it has no Gamma variable as shown in equation (3). Thus, the new probability () for event

    is given by:

    (8)

    The maximum increase in probability was 30.34%, whereas the maximum decrease was

    69.16%. Further, the median value was -.57%, with first and third quartile values of -7.36% and

    6.32% respectively. Somewhat surprisingly, the number of wins went down from 480 to 443.

    These values represent a significant change in probability, based on the previously paired

  • 8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football

    18/28

    18

    WinStreak value. Each of the new probabilities were converted to binary values using the same

    rule described previously, and new values forWinStreakcalculated. Binary logistic regression

    was then run on the new biased values. These results of which are shown in Table 2:

    Table 2

    Logistic Regression Tableof BiasedResults

    Predictor Coefficient SE Z p-value Odds 95% CI

    Constant -0.1441 .0659 -2.1866 0.0290 Ratio Lower Upper

    Biased WinStreak 0.1250 0.0196 6.3776 0.0000 1.13 1.09 1.18

    Note: CI is Confidence Interval

    As the regression table shows, the model worked surprisingly well to capture the effect of

    the bias introduced; the associated p-value is zero to four significant digits. (The results would be

    significant even if the standard error value for the independent variable were doubled.) The

    coefficient on the biased WinStreak indicates that each additional win increases the likelihood of

    winning by an estimated 12.5%, a number well within the expected 95% confidence interval for

    the actual value of 10%. In addition, the sign of the coefficient is positive, as predicted.

    Performing the same analysis using the binary data (where positive values ofWinStreakwere

    given a value of one, and the remaining data were given a value of negative one) gave similar

    results.

    The results from all three regressions indicate that the model in all its forms will capture

    any effect that we know is present in the next win. However, the simulated data are based on

    equation (2); they do not imply that equation (2) can measure the coefficient onXif the actual

    probability model is as represented by equation (3). As noted above, the significance of the

    regression is a necessary but not sufficient condition to reject the null hypothesis. This follows

    from the bias inherent in estimates of the coefficient onXwhen other factors, such as those noted

    above, affect the probability of a win.

  • 8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football

    19/28

    19

    The existence of bias suggests a two-step process for evaluating the relationship between

    win streaks and the probability of winning their next game for sports teams. The first step in the

    process is to estimate the beta term in equation (2) to determine if there is evidence of an effect.

    If there is not, the null hypothesis cannot be rejected. If evidence of a significant relationship

    does exist, additional tests should be run to determine if the evidence might be due to spurious

    correlation. Although there are significant flaws in this approach, which are addressed in the

    discussion below, this two-step process provides a reasonable method for evaluating the research

    problem.

    Results using American Football

    The data used to evaluate the relationship between team win streaks and the probability

    of winning the next game played consisted of the wins and losses of all 32 teams in the National

    Football League, 2010-2011 regular season. Data were downloaded from www.Sports-

    Reference.com (2011), and was coded as described above. There are 16 games played each

    regular season by the 32 NFL teams, resulting in 512 outcomes, representing 256 total games.

    Each team had individual winning percentages calculated, as well as the independent variables

    WinStreak, Pos_Neg, and Cumm_WinStreak, as described above. Regressions were run for the

    teams individually, however the results showed no significance. (This could easily have been

    predicted because the sample size of 15 per team is too small to provide good results in logistic

    regression.) The independent variables at time twere paired with the next game outcome at

    time

    , resulting in a loss of one game per team, and reducing the total number of data pairs

    to 480.

    Although the data were paired using the results for each team, the 32 sets of data were

    combined into a single regression of 480 pairs. This decreases the standard errors for the

  • 8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football

    20/28

    20

    coefficient terms by a factor of two; however, it has no other effect on the estimates of the

    parameters. This technique does raise some concerns beyond the effect on standard errors. First,

    it is not obvious that a single stable relationship can be measured across several teams. A more

    accurate description of X would be as a vector with j elements, each j representing a different

    team. Then each xji is the ith

    pair of regressions for the jth

    team. This implies 32 different

    independent coefficients. Even if some of the coefficients were non-zero, there is no guarantee

    that the coefficient on the vector can be estimated. Second, even if a single relationship exists for

    each team, there is no guarantee that the actual estimate of the vector beta will be non-zero.

    Although these are valid questions, the response to both is that the regression model

    posited in equation (2) specifically assumes that there is a single stable relationship between win

    streaks and the probability of winning a teams next game, regardless of when it is measured.

    The null hypothesis for that model can only be rejected if such a relationship is detected. If such

    a relationship cannot be detected, it does not matter if it is because such a relationship does not

    exist, or if it exists in a form that cannot be captured by the model. Failing to reject the null

    hypothesis does not prove that there is no relationship; it merely presents evidence that the

    relationship does not exist in the assumed form. In short, these concerns about the ability to

    measure the relationship between win streaks and the probability of winning are further evidence

    of how difficult it is to measure a relationship between win streaks and the probability of

    winning. They however do not invalidate the model itself.

    An advantage to combining all the pairs into a single data set is that there will be an equal

    number of wins and losses. This has an effect of forcing the intercept through the origin, that is

    E[ . This both provides additional evidence on the validity of the model and improves the

    accuracy of the estimate of the beta term in equation (2).

  • 8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football

    21/28

    21

    Regressing WinStreak on NextGame using actual data had the expected results, as can be

    seen in Table 3:

    Table 3

    Log

    istic

    Regress

    ion

    Table:WinStr

    eakonNex

    tGame

    Predictor Coefficient SE Z p-value Odds 95% CI

    Constant 0.00298 0.0914 0.0326 0.9740 Ratio Lower Upper

    WinStreak 0.0319908 0.0370 0.8654 0.3870 1.03 0.96 1.11

    Note: CI is Confidence Interval

    The regression results show that the model is not biased to the degree that it always

    shows significance. The estimate of the coefficient on the beta term for equation (2) is not

    significantly different from zero and that the null hypothesis cannot be rejected. The assumption

    that the constant term is equal to zero also cannot be rejected, although the model using

    WinStreakallows some variation for this term. These results are somewhat surprising, given the

    tendency of the model to find significance when smaller samples were used, or when the winning

    percentage tended toward extremes. (Additional regressions were run using individual team data

    (N=15), and as expected, frequently indicated significance. Those results are not shown because

    they can be dismissed as being caused by a combination of small sample bias, and the bias

    inherent in WinStreak.)

    As noted, the model using WinStreak is the most restrictive form of the regression

    equation, because it requires that the beta term capture the effect of each additional win or loss in

    the streak. A less restrictive form of the model uses Pos_Neg; this version posits that wins and

    losses are reinforcing, but not additive. It follows that the threshold for significance is lower.

    Nonetheless, the results for this regression again gave the expected theoretical results, as shown

    in Table 4:

  • 8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football

    22/28

    22

    Table 4

    Logistic Regression Table:Pos_NegonNextGame

    Predictor Coefficient SE Z p-value Odds 95% CI

    Constant 0.0000 0.0913 0.0000 1.0000 Ratio Lower Upper

    Pos_Neg -0.03334 0.0913 -0.3651 0.7150 0.9700 0.81 1.16

    Note: CI is Confidence Interval

    As expected, the Pos_Negversion of the model provides the cleanest estimate of the beta

    coefficient in equation (2). This follows because the regression line must go through the origin

    (i.e., the constant term must be zero) when the only possible values for X are -1 and +1. The

    results show that the null hypothesis again cannot be rejected. There is no evidence of an effect

    even in the less restricted model, where a win or loss streak simply influences the probability of a

    win in the next game played.

    The final regression used the variable Cumm_WinStreak. As noted above, this model is

    not considered as a valid example of the regression equation (2) because the independent

    variable for each team converges to a simple transform of the winning percentage. As such, the

    model should show the relationship that exists. The predicted results are that there will be a

    significant relationship between the independent and response variable, even though the

    preceding regressions found none. The results of this regression can be seen in Table 5:

    Table 5

    Logistic Regression Table:Cumm_WinStreakonNextGame

    Predictor Coefficient SE Z p-value Odds 95% CI

    Constant 0.0010 0.0927 0.0112 0.9910 Ratio Lower Upper

    Cumm_WinStreak 0.0964 0.0262 3.6840 0.0000 1.10 1.05 1.16

    Note: CI is Confidence Interval

    As expected, the regression results show a significant relationship between in

    independent and response variable. (Thep-value would still be below 5% if the standard error of

    the coefficient were doubled.) Although the hypothesis that the constant term equals zero cannot

    be rejected, the hypothesis that the regression coefficient equals zero can be rejected. These

  • 8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football

    23/28

    23

    results provide more evidence that caution needs be exercised when building independent

    variable from past outcomes in a sequence; some input variables will always show a relationship

    with output variables. This is a valid concern even when the output sequence is a Markov chain,

    as noted in the discussion of potential bias above.

    Discussion of Results

    The relationship described in equation (3) assumes that there exist factors that could

    affect the probability of winning. That is, the probability of winning is not simply a random

    variable with an expected value of 0.50. Inasmuch as the factors that affect winning are stable

    over time, it is reasonable to hypothesize a relationship between future performance and past

    performance. Unfortunately, describing the nature of that relationship can be difficult. The

    simplest model for the performance of a sports team (and many other contests) is that the

    likelihood of winning their next game is a random expression of an underlying probability of

    winning independent of time; an example is the Markov chain model used by Kvam and Sokol

    (2006). Although the authors used this model to predict wins in the NCAA mens basketball

    tournament with excellent results, their model was more complex than it first appeared. Kvam

    and Sokol estimated the probability of a win based on individual matchups of paired teams rather

    than the teams overall winning percentages. In doing so, the authors demonstrated a difficulty

    with simply using the winning percentage as an estimate ofp for predicting the outcome of a

    teams next game. Put simply, Kvam and Sokol estimatedp, the probability of winning a

    specific contest, as a conditional probability. Therein lies the problem of everyone who ever

    wanted to predict the outcome of an event: prognosticators want to know the likelihood of

    winning a specific contest given the expected circumstances of that contest, not the likelihood of

    winning n times inNcontests.

  • 8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football

    24/28

    24

    Conditional estimates allow the probability model to include a wide range of explanatory

    variables that researchers may be able to test. Typical examples include the winning percentage

    of the paired contestant (Kvam & Sokol, 2006), home field advantage (Foster & Washington,

    2009), and the recent performance of contestants (Hvattum & Arntzen, 2010). One possible

    conditional variable is the win streak. In its most simple form, this model assumes that wins and

    losses are self-reinforcing. As Sire and Redner (2009, p. 474) noted, individuals frequently refer

    to these win and loss streaks when making their predictions. Unfortunately, working out the

    details of this model demonstrates several problems.

    As noted above, win and loss streaks will be correlated with outcomes because both are

    increasing functions ofp. It follows that any constructed independent variable must take this

    potential source of bias into account when testing models. This is a significant problem for

    researchers, and one that may ultimately make the use of any win streak variable invalid.

    However, the potential for bias simply begs the question of significant results; testing the model

    might nonetheless prove useful, as was the case in American football results for 2010-2011.

    The results using American football data from 2010-2011 showed that regressing two

    valid constructs of win and loss streaks on the next game played failed to show a significant

    relationship between current win loss streaks and the probability of winning the next game

    played. These results are consistent with models that describe the probability of winning as based

    on a set of underlying variables that are performance related, rather than related to previous wins

    and losses. These models argue in favor of a more parsimonious model for the probability of

    winning that does not include win and loss streaks. Simply put, there is no evidence in the

    American football league of additional probability of winning or losing is gained by the

    individuals or teams recent performance.

  • 8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football

    25/28

    25

    The results from American football do not prove that win streaks have no effect on the

    probability of winning future games; they merely show that a valid application of the model

    provided no evidence of such a relationship. The lack of evidence may be due to the choice of

    variables, or the probability model itself; however, it is just as likely that there is no relationship

    between win streaks and the probability of winning future games. As such, the burden of proof

    lies on those who claim there is a relationship between win streaks and future wins to construct a

    model and provide evidence of the relationship.

    These results may be disappointing to those who harbor a belief that their team will

    ultimately win (or loose) a given contest because of the teams recent success (or failure).

    Nonetheless, the results are consistent with the typical modeling of the outcome sequence as a

    Markov chain. As such, failure to show a relationship is probably more important than showing a

    relationship using logistic regression.

    Two important results can be gleaned from this study. First, the likelihood of persistent

    bias when using models that simply regress win streaks on future outcomes was demonstrated by

    an analysis of the variables, and verified by the regression results. As tempting as this type of

    investigation is, it is more likely that any significant results will turn out to be artifacts of the

    model and variable specifications. Second, the approach demonstrated the validity of using the

    simple regression equation (2) when the full probability model can be expressed in the form

    given by equation (3). The implies that it is possible to test relationships independently of other

    underlying determinants of probability.

    Conclusion

    Conducting successful research requires that several skills be addressed in a logical and

    appropriate way. As Trochim (2001) put it, research involves an eclectic blending of an

  • 8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football

    26/28

    26

    enormous range of skills and activities. (p. 4) Although his statement is somewhat bombastic, it

    accurately conveys the impression that research is not simply about having an idea and checking

    to see if it might be correct. Leedy and Ormrod (2005) expanded on this point and argued that

    successful research involves the systematic process of collecting, analyzing, and interpreting

    information (p. 2). Good research requires not only a good question and adequate data, but also

    appropriate methodology.

    This paper has explored the range of research methodology available to researchers, and

    tried to demonstrate how the choice of a research question and the available data affect the

    methodology used. Not all methodologies are appropriate for every research question or type of

    data; researchers must select a methodology that fits. One focus of this paper has been to identify

    an appropriate research methodology for exiting data that is restricted in the range of values that

    can be expressed. As was shown, data of this type can require specialized transcription and

    analysis. As part of this investigation, several examples of research were reviewed, most of

    which used logistic regression to analyze binary response data. As was shown, both the data and

    the method of analysis created challenges for researchers. The lessons learned from those

    examples resulted in a detailed application of logistic regression on the win and loss results for

    American football teams.

  • 8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football

    27/28

    27

    References

    Albert, J. (2008). Streaky hitting in baseball. Journalof Quantitative AnalysisinSports,

    4 (1), 1-32.

    Aldrich, J. H., & Nelson, F. D. (1984). Linearprobability,Logit,andProbitmodels.

    Sage Publications.

    Brown, M., & Sokol, J. (2010). An improved LRMC method for NCAA basketball

    prediction. Journalof Quantitative Analysis,6 (3), 1-21.

    Foster, W. M., & Washington, M. (2009). Organizational structure and home team

    performance. TeamPerformance Management,15 (3/4), 158-171.

    Hvattum, L. M., & Arntzen, H. (2010). Using ELO ratings for match results prediction in

    association football. InternationalJournalof Forecasting,26, 460-470.

    Kvam, P., & Sokol, J. (2006). A logistic regression Markov chain model for NCAA

    basketball.NavalResearchLogistics,53, 788-803.

    Leedy, P. D., & Ormrod, J. E. (2005). Practicalresearch. Upper Saddle River: Pearson

    Education.

    Nemes, S., Jonasson, J., Genell, A., & Steineck, G. (2009). Bias in odds ratios by logistic

    regression modelling and sample size.BMCMedicalResearch Methodology, 9(56), 1-5.

    Oppenheimer, D. M., & Monin, B. (2009). The retrospective gambler's fallacy: Unlikelyevents, constructing the past, and multiple universes. JudgementandDecision

    Making, 4 (5), 326-334.

    Sire, C., & Redner, S. (2009). Understanding baseball team standings and streaks.

    EuopeanPhysicalJournalB , 473-481.

    Sports Reference. (2011). Pro-FootballReference.Com. Retrieved March 19, 2011, from

    Sports Reference: http://www.pro-football-reference.com/

    Thomas, A. C. (2010). That's the second-biggest hitting streak I've ever seen! verifying

    simulated historical extremes in baseball. Journalof Quantitative AnalysisinSports,6(4), 1-34.

    Trochim, W. (2001).Researchmethodsknowledgebase. Cincinnati, OH: Atomic DogPublishing.

  • 8/4/2019 Using Binary Regression to Analyze Win Streaks in American Football

    28/28

    28

    iThe variance of a random variable X is given by ; this evaluates to

    .

    The problem is that this term becomes negative at approximately ; variances cannot be

    negative.