anthology of statistics in sports

333

Click here to load reader

Upload: th3goose

Post on 18-Dec-2014

286 views

Category:

Documents


101 download

TRANSCRIPT

Page 1: Anthology of Statistics in Sports
Page 2: Anthology of Statistics in Sports

Anthology ofStatistics in Sports

Page 3: Anthology of Statistics in Sports

ASA-SIAM Series onStatistics and Applied ProbabilityThe ASA-SIAM Series on Statistics and Applied Probability is publishedjointly by the American Statistical Association and the Society for Industrial and AppliedMathematics. The series consists of a broad spectrum of books on topics in statistics andapplied probability. The purpose of the series is to provide inexpensive, quality publicationsof interest to the intersecting membership of the two societies.

ASASIAM

Editorial Board

Martin T. WellsCornell University, Editor-in-Chief

David BanksDuke University

H. T. BanksNorth Carolina State University

Richard K. BurdickArizona State University

Douglas M. HawkinsUniversity of Minnesota

Joseph GardinerMichigan State University

Susan HolmesStanford University

Lisa LaVangeInspire Pharmaceuticals, Inc.

Francoise Seillier-MoiseiwitschUniversity of Maryland - Baltimore County

Mark van der LaanUniversity of California, Berkeley

Albert, J., Bennett, J., and Cochran, J. J., eds., Anthology of Statistics in Sports

Smith, W. F, Experimental Design for FormulationBaglivo, J. A., Mathematica Laboratories for Mathematical Statistics: Emphasizing Simulation

and Computer Intensive MethodsLee, H. K. H., Bayesian Nonparametrics via Neural NetworksO'Gorman, T. W., Applied Adaptive Statistical Methods: Tests of Significance and

Confidence IntervalsRoss, T. J., Booker, J. M., and Parkinson, W. J., eds., Fuzzy Logic and Probability Applications:

Bridging the GapNelson, W. B., Recurrent Events Data Analysis for Product Repairs, Disease Recurrences, and

Other ApplicationsMason, R. L. and Young, J. C, Multivariate Statistical Process Control with Industrial

ApplicationsSmith, P. L., A Primer for Sampling Solids, Liquids, and Gases: Based on the Seven Sampling

Errors of Pierre GyMeyer, M. A. and Booker, J. M., Eliciting and Analyzing Expert Judgment: A Practical GuideLatouche, G. and Ramaswami, V., Introduction to Matrix Analytic Methods in Stochastic

ModelingPeck, R., Haugh, L., and Goodman, A., Statistical Case Studies: A Collaboration Between

Academe and Industry, Student EditionPeck, R., Haugh, L., and Goodman, A., Statistical Case Studies: A Collaboration Between

Academe and IndustryBarlow, R., Engineering ReliabilityCzitrom, V. and Spagon, P. D., Statistical Case Studies for Industrial Process Improvement

Page 4: Anthology of Statistics in Sports

Anthology ofStatistics in Sports

Edited by

Jim AlbertBowling Green State University

Bowling Green, Ohio

Jay BennettTelcordia Technologies

Piscataway, New Jersey

James J. CochranLouisiana Tech University

Ruston, Louisiana

siajiLSociety for Industrial and Applied Mathematics American Statistical AssociationPhiladelphia, Pennsylvania Alexandria, Virginia

Page 5: Anthology of Statistics in Sports

The correct bibliographic citation for this book is as follows: Albert, Jim, Jay Bennett, andJames J. Cochran, eds., Anthology of Statistics in Sports, ASA-SIAM Series on Statistics andApplied Probability, SIAM, Philadelphia, ASA, Alexandria, VA, 2005.

Copyright © 2005 by the American Statistical Association and the Society for Industrial andApplied Mathematics.

1 0 9 8 7 6 5 4 3 2 1

All rights reserved. Printed in the United States of America. No part of this book may bereproduced, stored, or transmitted in any manner without the written permission of thepublisher. For information, write to the Society for Industrial and Applied Mathematics,3600 University City Science Center, Philadelphia, PA 19104-2688.

Library of Congress Cataloging-in-Publication Data

Anthology of statistics in sports / [compiled by] Jim Albert, Jay Bennett, James J. Cochran.p. cm. - (ASA-SIAM series on statistics and applied probability)

On cover: The ASA Section on Statistics in Sports.Includes bibliographical references.ISBN 0-89871-587-3 (pbk.)1. Sports-United States-Statistics. 2. Sports-United States-Statistical methods. I.

Albert, Jim, 1953- II. Bennett, Jay. III. Cochran, James J. IV. American StatisticalAssociation. Section on Statistics in Sports. V. Series.

GV741.A6942005796'.021-dc22 2005042540

siajTL is a registered trademark.

Page 6: Anthology of Statistics in Sports

ContentsAcknowledgments ix

1 Introduction 1Jim Albert, Jay Bennett, and James J. Cochran

2 The Use of Sports in Teaching Statistics 5Jim Albert and James J. Cochran

PART I STATISTICS IN FOOTBALL 11

3 Introduction to the Football Articles 13Hal Stern

4 A Geometry Model for NFL Field Goal Kickers 17Scott M. Berry

5 A State-Space Model for National Football League Scores 23Mark E. Glickman and Hal S. Stern

6 Predictions for National Football League Games via Linear-Model Methodology 35David Harville

7 The Best NFL Field Goal Kickers: Are They Lucky or Good? 45Donald G. Morrison and Manohar U. Kalwani

8 On the Probability of Winning a Football Game 53Hal Stern

PART II STATISTICS IN BASEBALL 59

9 Introduction to the Baseball Articles 61Jim Albert and James J. Cochran

10 Exploring Baseball Hitting Data: What About Those Breakdown Statistics? 67Jim Albert

11 Did Shoeless Joe Jackson Throw the 1919 World Series? 77Jay Bennett

12 Player Game Percentage 87Jay M. Bennett and John A. Flueck

V

Page 7: Anthology of Statistics in Sports

Contents

13 Estimation with Selected Binomial Information or Do You Really Believe ThatDave Winfield Is Batting .471? 91George Casella and Roger L. Berger

14 Baseball: Pitching No-Hitters 103Cliff Frvhlich

15 Answering Questions About Baseball Using Statistics 111Bill James, Jim Albert, and Hal S. Stern

16 The Progress of the Score During a Baseball Game 119G. R. Lindsey

PART III STATISTICS IN BASKETBALL 145

17 Introduction to the Basketball Articles 147Robert L. Wardrop

18 Improved NCAA Basketball Tournament Modeling via Point Spread andTeam Strength Information 149Bradley P. Carlin

19 It's Okay to Believe in the "Hot Hand" 155Patrick D. Larkey, Richard A. Smith, and Joseph B. Kadane

20 More Probability Models for the NCAA Regional Basketball Tournaments 163Neil C. Schwertman, Kathryn L. Schenk, and Brett C. Holbrook

21 The Cold Facts About the "Hot Hand" in Basketball 169Amos Tversky and Thomas Gilovich

22 Simpson's Paradox and the Hot Hand in Basketball 175Robert L. Wardrop

PART IV STATISTICS IN ICE HOCKEY 181

23 Introduction to the Ice Hockey Articles 183Robin H. Lock

24 Statistical Methods for Rating College Hockey Teams 187Timothy J. Danehy and Robin H. Lock

25 Overtime or Shootout: Deciding Ties in Hockey 193William Hurley

26 It Takes a Hot Goalie to Raise the Stanley Cup 197Donald G. Morrison and David C. Schmittlein

PART V STATISTICAL METHODOLOGIES AND MULTIPLE SPORTS 203

27 Introduction to the Methodologies and Multiple Sports Articles 205Scott Berry

28 Bridging Different Eras in Sports 209Scott M. Berry, C. Shane Reese, and Patrick D. Larkey

VI

Page 8: Anthology of Statistics in Sports

Contents

29 Data Analysis Using Stein's Estimator and Its Generalizations 225Bradley Efron and Carl Morris

30 Assigning Probabilities to the Outcomes of Multi-Entry Competitions 235David A. Harville

31 Basketball, Baseball, and the Null Hypothesis 241Robert Hooke

32 Lessons from Sports Statistics 245Frederick Mosteller

33 Can TQM Improve Athletic Performance? 251Harry V. Roberts

34 A Brownian Motion Model for the Progress of Sports Scores 257Hal S. Stern

PART VI STATISTICS IN MISCELLANEOUS SPORTS 265

35 Introduction to the Miscellaneous Sports Articles 267Donald Guthrie

36 Shooting Darts 271Hal Stern and Wade Wilcox

37 Drive for Show and Putt for Dough 275Scott M. Berry

38 Adjusting Golf Handicaps for the Difficulty of the Course 281Francis Scheid and Lyle Calvin

39 Rating Skating 287Gilbert W. Bassett, Jr. and Joseph Per sky

40 Modeling Scores in the Premier League: Is Manchester United Really the Best? 293A Ian J. Lee

41 Down to Ten: Estimating the Effect of a Red Card in Soccer 299G. Ridder, J. S. Cramer, and P. Hopstaken

42 Heavy Defeats in Tennis: Psychological Momentum or Random Effect? 303David Jackson and Krzysztof Mosurski

43 Who Is the Fastest Man in the World? 311Robert Tibshirani

44 Resizing Triathlons for Fairness 317Howard Wainer and Richard D. De Veaux

VII

Page 9: Anthology of Statistics in Sports

This page intentionally left blank

Page 10: Anthology of Statistics in Sports

AcknowledgmentsOriginal Sources of Contributed Articles

Chapter 4 originally appeared in Chance, vol. 12, no. 3, 1999, pp. 51-56.Chapter 5 originally appeared in Journal of the American Statistical Association, vol. 93, no. 441, 1998,

pp. 25-35.Chapter 6 originally appeared in Journal of the American Statistical Association, vol. 75, no. 371, 1980,

pp. 516-524.Chapter 7 originally appeared in Chance, vol. 6, no. 3, 1993, pp. 30-37.Chapter 8 originally appeared in The American Statistician, vol. 45, no. 3, 1991, pp. 179-183.Chapter 10 originally appeared in Journal of the American Statistical Association, vol. 89, no. 427, 1994,

pp. 1066-1074.Chapter 11 originally appeared in The American Statistician, vol. 47, no. 4, 1993, pp. 241-250.Chapter 12 originally appeared in American Statistical Association Proceedings of the Section on Statistics

in Sports, 1992, pp. 64-66.Chapter 13 originally appeared in Journal of the American Statistical Association, vol. 89, no. 427, 1994,

pp. 1080-1090.Chapter 14 originally appeared in Chance, vol. 7, no. 3, 1994, pp. 24-30.Chapter 15 originally appeared in Chance, vol. 6, no. 2, 1993, pp. 17-22, 30.Chapter 16 originally appeared in American Statistical Association Journal, September 1961, pp. 703-728.Chapter 18 originally appeared in The American Statistician, vol. 50, no. 1, 1996, pp. 39-43.Chapter 19 originally appeared in Chance, vol. 2, no. 4, 1989, pp. 22-30.Chapter 20 originally appeared in The American Statistician, vol. 50, no. 1, 1996, pp. 34-38.Chapter 21 originally appeared in Chance, vol. 2, no. 1, 1989, pp. 16-21.Chapter 22 originally appeared in The American Statistician, vol. 49, no. 1, 1995, pp. 24-28.Chapter 24 originally appeared in American Statistical Association Proceedings of the Section on Statistics

in Sports, 1993, pp. 4-9.Chapter 25 originally appeared in Chance, vol. 8, no. 1, 1995, pp. 19-22.Chapter 26 originally appeared in Chance, vol. 11, no. 1, 1998, pp. 3-7.Chapter 28 originally appeared in Journal of the American Statistical Association, vol. 94, no. 447, 1999,

pp. 661-676.Chapter 29 originally appeared in Journal of the American Statistical Association, vol. 70, no. 350, 1975,

pp. 311-319.Chapter 30 originally appeared in Journal of the American Statistical Association, vol. 68, no. 342, 1973,

pp. 312-316.Chapter 31 originally appeared in Chance, vol. 2, no. 4, 1989, pp. 35-37.Chapter 32 originally appeared in The American Statistician, vol. 51, no. 4, 1997, pp. 305-310.Chapter 33 originally appeared in Chance, vol. 6, no. 3, 1993, pp. 25-29, 69.Chapter 34 originally appeared in Journal of the American Statistical Association, vol. 89, no. 427, 1994,

pp. 1128-1134.Chapter 36 originally appeared in Chance, vol. 10, no. 3, 1997, pp. 16-19.Chapter 37 originally appeared in Chance, vol. 12, no. 4, 1999, pp. 50-55.Chapter 38 originally appeared in American Statistical Association Proceedings of the Section on Statistics

in Sports, 1995, pp. 1-5.

IX

Page 11: Anthology of Statistics in Sports

Acknowledgments

Chapter 39 originally appeared in Journal of the American Statistical Association, vol. 89, no. 427, 1994,pp. 1075-1079.

Chapter 40 originally appeared in Chance, vol. 10, no. 1, 1997, pp. 15-19.Chapter 41 originally appeared in Journal of the American Statistical Association, vol. 89, no. 427, 1994,

pp. 1124-1127.Chapter 42 originally appeared in Chance, vol. 10, no. 2, 1997, pp. 27-34.Chapter 43 originally appeared in The American Statistician, vol. 51, no. 2, 1997, pp. 106-111.Chapter 44 originally appeared in Chance, vol. 7, no. 1, 1994, pp. 20-25.

Figure Permissions

The artwork on page 111 in Chapter 15 is used with permission of the artist, John Gampert.The photograph on page 170 in Chapter 21 is used with permission of the photographer, Marcy Dubroff.The photographs on pages 197 and 198 in Chapter 26 are used with permission of Getty Images.The photograph on page 310 in Chapter 42 is used with permission of Time, Inc.

X

Page 12: Anthology of Statistics in Sports

Chapter 1

Introduction

Jim Albert, Jay Bennett, andJames J. Cochran

1.1 The ASA Section on Statistics inSports (SIS)

The 1992 Joint Statistical Meetings (JSM) saw the creationof a new section of the American Statistical Association(ASA). Joining the host of traditional areas of statisticssuch as Biometrics, Survey Research Methods, and Physi-cal and Engineering Sciences was the Section on Statisticsin Sports (SIS). As stated in its charter, the section isdedicated to promoting high professional standards in theapplication of statistics to sports and fostering statisticaleducation in sports both within and outside the ASA.

Statisticians worked on sports statistics long before thefounding of SIS. Not surprisingly, some of the earliestsports statistics pieces in the Journal of the American Sta-tistical Association (JASA) were about baseball and ap-peared in the 1950s. One of the first papers was Freder-ick Mosteller's 1952 analysis of the World Series (JASA,47 (1952), pp. 355-380). Through the years, Mostellerhas continued his statistical research in baseball as wellas other sports. Fittingly, his 1997 paper "Lessons fromSports Statistics" (The American Statistician, 51-4 (1997),pp. 305-310) is included in this volume (Chapter 32) andprovides a spirited example of the curiosity and imagina-tion behind all of the works in this volume.

Just as the nation's sporting interests broadened beyondbaseball in the 1960s, 1970s, and 1980s, so did the top-ics of research in sports statistics. Football, basketball,golf, tennis, ice hockey, and track and field were now be-ing addressed (sometimes very passionately). Perhaps no

question has been so fiercely debated as the existence ofthe "hot hand" in basketball. The basketball section (PartIII) of this volume provides some examples of this debate,which remains entertainingly unresolved.

Until the creation of SIS, research on sports statisticshad to be presented and published in areas to which it wasonly tangentially related—one 1984 paper published inthis volume (Chapter 12) was presented at the JSM underthe auspices of the Social Statistics Section since it hadno other home. The continued fervent interest in sportsstatistics finally led to the creation of SIS in the 1990s.Since its creation, the section has provided a forum forthe presentation of research at the JSM. This in turn hascreated an explosion of sports statistics papers in ASApublications as well as an annual volume of proceedingsfrom the JSM (published by the section). The September1994 issue of JASA devoted a section exclusively to sportsstatistics. The American Statistician typically has a sportsstatistics paper in each issue. Chance often has more thanone article plus the regular column "A Statistician Readsthe Sports Pages."

What lies in the future? Research to date has been heav-ily weighted in the areas of competition (rating players/teams and evaluating strategies for victory). This differsgreatly from the research being performed in other partsof the world. Papers in publications of the InternationalSports Statistics Committee of the International StatisticalInstitute (ISI) have emphasized analysis of participationand popularity of sports. This is certainly one frontier ofsports statistics that North American statisticians shouldexplore in future work. Given the ongoing debate aboutthe social and economic values of professional sports fran-chises, research in this area may become more common aswell.

1

Page 13: Anthology of Statistics in Sports

Introduction

1.2 The Sports Anthology ProjectGiven this rich history of research, one goal of SIS sinceits inception has been to produce a collection of papersthat best represents the research of statisticians in sports.In 2000, SIS formed a committee to review sports statis-tics papers from ASA publications. The committee wascharged with producing a volume of reasonable length witha broad representation of sports, statistical concepts, andauthorship. While some emphasis was placed on includ-ing recent research, the committee also wished to includeolder seminal works that have laid the foundation for cur-rent research.

This volume's basic organization is by sport. Each ma-jor spectator sport (football, baseball, basketball, and icehockey) has its own section. Sports with less representa-tion in the ASA literature (such as golf, soccer, and trackand field) have been collected into a separate section. An-other, separate, section presents research that has greaterbreadth and generality, with each paper addressing severalsports. Each section has been organized (with an introduc-tion) by a notable contributor to that area.

1.3 Organization of the PapersSport provides a natural organization of the papers, andmany readers will start reading papers on sports that theyare particularly interested in as a participant or a fan. How-ever, there are several alternative ways of grouping thepapers that may be useful particularly for the statistics in-structor.

1.4 Organization by Technical LevelAll of these articles have been published in the followingjournals of the American Statistical Association (ASA):Chance, Journal of the American Statistical Association(JASA), The American Statistician (TAS), and the Pro-ceedings of the Statistics in Sports Section of the AmericanStatistical Association. A graph of the distribution of thepapers in the four journals is displayed in Figure 1.1.

The Chance and TAS articles in this volume are writtenat a relatively modest technical level and are accessible forboth beginner and advanced students of statistics. Chanceis a magazine, jointly published by ASA and Springer-Verlag, about statistics and the use of statistics in society.Chance features articles that showcase the use of statisticalmethods and ideas in the social, biological, physical, andmedical sciences. One special feature of Chance is theregular column "A Statistician Reads the Sports Pages"and some of the articles in this volume are taken from this

Figure 1.1. Distribution of papers in four journals.

column. TAS is the "general interest" journal of the ASAand publishes articles on statistical practice, the teachingof statistics, the history of statistics, and articles of a gen-eral nature. JASA is the main research journal of the ASA.Although the technical level of JASA articles is typicallyhigher than that of Chance and TAS articles, JASA con-tains an "Application and Case Studies" section that fea-tures interesting applications of statistical methodology.As mentioned earlier, JASA featured in 1994 a special col-lection of papers on the application of statistics to sports.The SIS Proceedings contains papers that were presentedat the annual JSM under the sponsorship of SIS. The threeproceedings papers appearing here in Chapters 12, 24, and38 are also very accessible for a broad range of statisti-cians with different levels of training. As a general rule ofthumb for the purposes of this collection, the spectrum oftechnical sophistication from high to low for these journalsis JASA, TAS, Chance, and SIS Proceedings.

1.5 Organization by InferentialMethod

An alternative organization scheme is by the inferentialmethod. Some of the papers deal with particular inferen-tial methods that are relevant to answering statistical ques-tions related to sports. Indeed, one could actually create anundergraduate or graduate course in statistics devoted toparticular inferential topics that are important in the con-text of sports. Below we describe the methods and list thecorresponding papers that are included in this anthology.

1.5.1 Prediction of Sports Events

(Papers by Glickman and Stern (Chapter 5), Harville(Chapter 6), Stern (Chapter 8), Carlin (Chapter 18), and

2

Page 14: Anthology of Statistics in Sports

Introduction

Schwertman et al. (Chapter 20).)A general statistical problem is to predict the winner of

a particular sports game. This is obviously an importantproblem for the gambling industry, as millions of dollarsare bet annually on the outcomes of sports events. Animportant problem is the development of good predictionmethods. The articles in this volume focus on predictingthe results of professional American football and collegebasketball games.

1.5.2 Hot Hand Phenomena

(Papers by Larkey et al. (Chapter 19), Tversky andGilovich (Chapter 21), Wardrop (Chapter 22), Hooke(Chapter 31), Morrison and Schmittlein (Chapter 26), andJackson and Mosurski (Chapter 42).)

Many people believe that the abilities of athletes andteams can go through short periods of highs and lows.They propose that these hot hand/cold hand phenomenaaccount for streaky performance in sports. Psychologistsgenerally believe that people misinterpret the streaky pat-terns in coin-tossing and sports data and think that playersand teams are streaky when, really, they aren't. A generalstatistical problem is to detect true streakiness from sportsdata. The articles in this volume discuss the "hot hand" inbasketball, baseball, hockey, and tennis.

has the fan actually learned about the ability of the playeror team from this information? These two articles discussthis problem within the context of baseball, although itapplies to many other sports.

1.5.5 Rating Players or Teams

(Papers by Danehy and Lock (Chapter 24), Bassett andPersky (Chapter 39), Morrison and Kalwani (Chapter 7),and Berry (Chapters 4 and 37).)

An interesting statistical problem is how to rate or rankplayers or teams based on their performances. In somesports, such as American college football, it is not pos-sible to have a complete round-robin competition, thuspresenting a challenge in ranking teams based on theseincomplete data. This is not a purely academic exercise,as various ranking systems are used to select contendersfor national championships (as in American college foot-ball) or to seed tournament positions. There are also in-teresting statistical issues in evaluating sports performers.How does one effectively compare two sports players whenthere are many measurements of a player's ability? Howdoes one compare players from different eras? These pa-pers illustrate statistical issues in ranking teams (collegehockey teams) and individuals (professional football kick-ers, skaters, golfers, and hockey and baseball players).

1.5.3 Probability of Victory

(Papers by Bennett and Flueck (Chapter 12), Bennett(Chapter 11), Lindsey (Chapter 16), and Stern (Chapter34).)

These papers address an interesting problem for sportsfans. Suppose you are watching a game in progress andyour team is leading by a particular margin. What is theprobability that your team will win? This general questionis addressed by these authors for different sports. Obvi-ously the answer to this question has an impact on the fan'senjoyment of the game and also on the team's strategy inwinning the game.

1.5.4 Learning from Selected Data

(Papers by Albeit (Chapter 10) and Casella and Berger(Chapter 13).)

Often sports announcers will report "selected data."Specifically, they might talk about how a player or teamperforms in a given situation or report on an unusually highor low player or team performance in a short time period.There is an inherent bias in this information, since it hasbeen reported because of its "interesting" nature. What

1.5.6 Unusual Outcome

(Paper by Frohlich (Chapter 14).)The sports media pays much attention to "unusual" out-

comes. For example, in baseball, we are surprised by ano-hitter or if a batter gets a hit in a large number of con-secutive games. From a statistical viewpoint, should we besurprised by these events, or are these events to be expectedusing standard probability models?

1.5.7 The Rules of Sports

(Papers by Hurley (Chapter 25), Scheid and Calvin (Chap-ter 38), and Wainer and De Veaux (Chapter 44).)

Statistical thinking can be useful in deciding on rules inparticular sports. For example, statistics can help deter-mine a reasonable method of breaking a tie game in socceror a good way of setting a golf player's handicap.

1.5.8 Game Strategy

(Paper by Harville (Chapter 30).)Statistics can be useful in determining proper strate-

gies in sports. In baseball, strategy is often based on

3

Page 15: Anthology of Statistics in Sports

Introduction

"The Book," that is, a collection of beliefs based on anec-dotal evidence absorbed by players and managers through-out baseball history. Through a careful analysis of baseballdata, one can investigate the wisdom of particular strate-gies such as the sacrifice bunt, the intentional walk, andstealing of bases. It is not surprising that managers' intu-ition about the correct strategy is often inconsistent withthe results of a careful statistical analysis.

1.5.9 Illustrating Statistical Methods

(Papers by Efron and Morris (Chapter 29) and Roberts(Chapter 33).)

Sports data can provide an accessible and interestingway of introducing new statistical methodology. The paperby Efron and Morris illustrates the advantages of shrinkageestimators using baseball batting averages as an example.The paper by Roberts illustrates the use of techniques inTotal Quality Management in improving a student's abilityin golf (specifically, putting) and in the game of pool.

1.5.10 Modeling Team Competition

(Papers by Lee (Chapter 40) and James, Albert, and Stern(Chapter 15).)

After a season of team sports competition has beenplayed, fans assume that the winner of the championshipgame or series is truly the best team. But is this a reason-able conclusion? What have we really learned about theabilities of the teams based on these observed competitionresults? These two articles explore these issues for thesports of soccer and baseball.

1.5.11 The Use of These Articles inTeaching Statistics

Sports examples provide an attractive way of introducingstatistical ideas, at both elementary and advanced levels.Many people are involved in sports, either as a participantor a fan, and so are naturally interested in statistical prob-lems that are framed within the context of sports. Chapter 2in this anthology gives an overview of the use of sports ex-amples in teaching statistics. It describes statistics coursesthat have been offered with a sports theme and discusses anumber of other articles that illustrate statistical analyseson interesting sports datasets. From this teaching perspec-tive, the reader will see that many of the research papers inthis volume are very appropriate for use in the classroom.

1.6 The Common ThemeIgnoring the differences in sport topics and statistical tech-niques applied, we can see that the papers presented in thisvolume have one noteworthy quality in common: All re-search was performed for the sheer joy of it. These statis-ticians did not perform this research to fulfill a contract,advance professionally, or promote their employers. Theyinitiated this research out of an inner dissatisfaction withstandard presentations of sports statistics in various mediaand a personal need to answer questions that were eitherbeing addressed inadequately or not at all. We hope thatyou enjoy reading their work as much as they enjoyed pro-ducing it.

4

Page 16: Anthology of Statistics in Sports

Chapter 2

The Use of Sports inTeaching Statistics

Jim Albert and James J. Cochran

2.1 Motivation to Use SportsExamples in Teaching Statistics

Teaching introductory statistics is a challenging endeavorbecause the students have little prior knowledge about thediscipline of statistics and many of them are anxious aboutmathematics and computation. Many students come tothe introductory statistics course with a general misunder-standing of the science of statistics. These students regard"statistics" as a collection of numbers and they believe thatthe class will consist of a series of computations on thesenumbers. Indeed, although statistics texts will discuss thewide application of statistical methodology, many studentswill believe that they will succeed in the class if they areable to recall and implement a series of statistical recipes.

Statistical concepts and examples are usually presentedin a particular context. However, one obstacle in teach-ing this introductory class is that we often describe thestatistical concepts in a context (such as medicine, law,or agriculture) that is completely foreign to the student.The student is much more likely to understand concepts inprobability and statistics if they are described in a familiarcontext. Many students are familiar with sports either as aparticipant or a spectator. They know of popular athletes,such as Tiger Woods and Barry Bonds, and they are gen-erally knowledgeable about the rules of major sports, suchas baseball, football, and basketball. To many students,sports is a familiar context in which an instructor can de-scribe statistical thinking. Students often know and havesome intuition about the issues in a particular sport. Since

they are knowledgeable in sports, they can see the value ofstatistics in gaining additional insight on particular issues.

Statistical methodology is useful in addressing problemsin many disciplines. The use of statistical thinking in thecontext of baseball is called sabermetrics. James (1982)defines sabermetrics as follows (note that his commentsapply equally if the word "sabermetrics" is replaced with"statistics"):

"Sabermetrics does not begin with the numbers.It begins with issues. The numbers, the statisticsare not the subject of the discussion... The sub-ject is baseball. The numbers bear a relationshipto that subject and to us which is much like therelationship of tools to a machine and to the me-chanic who uses them. The mechanic does notbegin with a monkey wrench; basically, he is noteven interested in the damn monkey wrench. Allthat he wants from the monkey wrench is that itdo its job and not give him any trouble. He be-gins with the machine, with the things which hesees and hears there, and from those he forms anidea—a thesis—about what must be happeningin that machine. The tools are a way of takingthe thing apart so he can see if he was right or ifhe needs to try something else."

Sports provides a familiar setting where instructors candiscuss questions or issues related to sports and show howstatistical thinking is useful in answering these questions.Using James' analogy, we can use sports examples to em-phasize that the main subject of the discussion is sportsand the statistical methodology is a tool that contributes toour understanding of sports.

5

Page 17: Anthology of Statistics in Sports

Chapter 2 The Use of Sports in Teaching Statistics

2.2 Mosteller's Work in Statisticsand Sports

Frederick Mosteller was one of the first statisticians to ex-tensively work on statistical problems relating to sports.Mosteller was also considered one of the earlier innova-tors in the teaching of statistics; his "one-minute survey"is currently used by many instructors to get quick feed-back from the students on the "muddiest point in the lec-ture." The article by Mosteller (1997), reprinted in thisvolume as Chapter 32, summarizes his work on problemsin baseball, football, basketball, and hockey. He describesseveral lessons he learned from working on sports, manyof which are relevant to the teaching of statistics. First,if many reviewers (you can substitute the word "students"for "reviewers") are knowledgeable about the materialsand interested in the findings, they will drive the authorcrazy with the volume, perceptiveness, and relevance oftheir suggestions. Mosteller's first lesson is relevant toclassroom teaching: teaching in the sports context will en-courage class discussion. A second lesson is that any infer-ential method in statistics can likely be applied to a sportsexample. Mosteller believes that statisticians can learnfrom newswriters, who are able to communicate with thegeneral public. Perhaps statisticians (or perhaps statisticsinstructors or writers of statistics texts) could learn fromnewswriters on how to communicate technical content tothe general public. Given the large number of sports enthu-siasts in the United States, Mosteller believes that sportsprovides an opportunity for statisticians to talk to youngpeople about statistics. In a paper on broadening the scopeof statistical education, Mosteller (1988) talks about theneed for books aimed at the general public that communi-cate statistical thinking in an accessible way. Albert andBennett (2003) and Ross (2004) are illustrations of bookswritten with a sports theme that are intended to teach ideasof probability and statistics to the general public.

2.3 The Use of Interesting SportsDatasets in Teaching

The use of sports data in teaching has several advantages.First, sports data are easily obtained from almanacsand the Internet. For example, there are a numberof websites, such as the official Major League Base-ball site (www.mlb.com) and the Baseball Reference site(www.baseball-reference.com) that give detailed data onbaseball players and teams over the last 100 seasons ofMajor League Baseball competition. There are similarsites that provide data for other sports. Extensive statis-

tics for professional football, basketball, hockey, golf,and soccer can be found at www.nfl.com, www.nba.com,www.nhl.com, www.pga.com, and www.mlsnet.com, re-spectively.

The second advantage of using sports in teaching statis-tics is that there are a number of interesting questionsregarding sports players and teams that can be used tomotivate statistical analyses, such as: When does an ath-lete peak in his/her career? What is the role of chancein determining the outcomes of sports events? Who isthe best player in a given sport? How can one best rankcollege football teams who play a limited schedule? Aresports participants truly streaky?

The references to this chapter contain a number of arti-cles in the Journal of Statistics Education (JSE), The Amer-ican Statistician (TAS), the Proceedings of the Section onStatistics in Sports, and the Proceedings of the Section onStatistical Education that illustrate statistical analyses oninteresting sports datasets. In addition, the regular col-umn "A Statistician Reads the Sports Page" (written byHal Stern and Scott Berry) in Chance and the column"The Statistical Sports Fan" (written by Robin Lock) inSTATS (the magazine for students of statistics) containaccessible statistical analyses on questions dealing withsports. (Some of these columns are reprinted in this vol-ume.) The JSE articles are specifically intended to equipthe instructors of statistics with interesting sports examplesand associated questions that can be used for presentationor homework. Following are five representative "sportsstories" from JSE that can be used to motivate statisticalanalyses. We conclude this section with a reference toan extensive dataset available at the Chance Course web-site (www.dartmouth.edurchance) to detect streakiness insports performance.

2.3.1 Exploring Attendance at MajorLeague Baseball Games(Cochran (2003))

Economic issues in major league sports can be addressedusing the data provided here. This dataset contains mea-surements on team performance (wins and losses, runsscored and allowed, games behind division winner, rankfinish within division), league and division affiliation, andtotal home game attendance for every Major League Base-ball team for the 1969-2000 seasons. The author explainshow he uses these data to teach regression and econo-metrics by assigning students the task of modeling therelationship between team performance and annual homeattendance.

6

Page 18: Anthology of Statistics in Sports

Albert and Cochran

2.3.2 How Do You Measure the Strengthof a Professional Football Team?(Watnik and Levine (2001))

This dataset gives a large number of performance vari-ables for National Football League (NFL) teams for the2000 regular season. The authors use principal compo-nents methodology to identify components of a footballteam's strength. The first principal component is identifiedwith a team's offensive capability and the second compo-nent with a team's defensive capability. These componentswere then related to the number of wins and losses for theteams. A number of questions can be answered with thesedata, including: Which conference (NFC or AFC) tendsto score more touchdowns? What variables are most help-ful in predicting a team's success? How important are thespecial teams, such as the kicking and punting teams, in ateam's success?

2.3.3 Who Will Be Elected to the MajorLeague Baseball Hall of Fame?(Cochran (2000))

Election to the Hall of Fame is Major League Baseball'sgreatest honor, and its elections are often controversial andcontentious. The author explains how he uses the dataprovided to teach descriptive statistics, classification, anddiscrimination by assigning students the task of using thevarious career statistics to model whether the players havebeen elected to the Hall of Fame. Career totals for stan-dard baseball statistics (number of seasons played, gamesplayed, official at-bats, runs scored, hits, doubles, triples,home runs, runs batted in, walks, strikeouts, batting aver-age, on base percentage, slugging percentage, stolen basesand times caught stealing, fielding average, and primaryposition played) are provided for each position player el-igible for the Major League Baseball Hall of Fame as of2000. In addition, various sabermetric composite mea-sures (adjusted production, batting runs, adjusted battingruns, runs created, stolen base runs, fielding runs, and totalplayer rating) are provided for these players. Finally, anindication is made of whether or not each player has beenadmitted into the Major League Baseball Hall of Fame and,if so, under what set of rules he was admitted.

2.3.4 Hitting Home Runs in the Summerof 1998 (Simonoff (1998))

The 1998 baseball season was one of the most excitingin history due to the competition between Mark McGwireand Sammy Sosa to break the season home run record es-

tablished by Roger Mans in 1961. This article contains thegame-to-game home run performance for both McGwireand Sosa in this season and gives an overview of differ-ent types of exploratory analyses that can be done withthese data. In particular, the article explores the pattern ofhome run hitting of both players during the season; seesif McGwire's home runs helped his team win games dur-ing this season; and sees if the home run hitting variedbetween games played at home and games played awayfrom home. In a closing section, the author outlines manyother questions that can be addressed with these data.

2.3.5 Betting on Football Games(Lock (1997))

This dataset contains information for all regular season andplayoff games placed by professional football games for aspan of five years. This dataset provides an opportunity forthe instructor to introduce casino bets on sports outcomespresented by means of point spreads and over/under val-ues. Also, the dataset can be used to address a number ofquestions of interest to sports-minded students, such as:What are typical professional game scores and margins ofvictory? Is there a home-field advantage in football andhow can we measure it? Is there a correlation between thescores of the home and away teams? Is there a relation-ship between the point spread and the actual game results?Can one develop useful rules for predicting the outcomesof football games?

2.3.6 Data on Streaks(from Chance Datasets,http://www.dartmouth.edu/~chance/teaching-aids/data.html)

Albright (1993) performed an extensive analysis to detectthe existence of streakiness in the sequences of bats for allplayers in Major League Baseball during a season. Thedataset available on the Chance datasets site provides 26bits of information on the situation and outcome for eachtime at bat for a large number of players in both the Amer-ican and National Leagues during the time period 1987-1990. Albright (1993) and Albert (1993) give illustrationsof the types of exploratory and confirmatory analyses thatcan be performed to detect streakiness in player perfor-mances.

7

Page 19: Anthology of Statistics in Sports

Chapter 2 The Use of Sports in Teaching Statistics

2.4 Special Probability andStatistics Classes Focusedon Sports

A number of authors have created special undergraduateclasses with a focus on the application of statistics to sports.One of these classes is designed to provide a complete in-troduction to the basic topics in data analysis and probabil-ity within the context of sports. Other courses are examplesof seminar-type classes that use statistics from sports arti-cles and books to motivate critical discussions of particularissues in sports.

2.4.3 A Sabermetrics Course

Costa and Huber (2003) describe a one-credit-hour course,developed and taught by Costa at Seton Hall University in1998, that focused on sabermetrics (the science of study-ing baseball records). This course has evolved into a fullthree-credit-hour, baseball-oriented introductory statisticscourse similar to the course offered by Albert (2002,2003)that is taught at both Seton Hall University and the U.S.Air Force Academy. This course also incorporates fieldtrips to various baseball sites (such as the National Base-ball Hall of Fame and Museum in Cooperstown, NY) andis team-taught.

2.4.1 An Introductory Statistics CourseUsing Baseball

Albert (2002,2003) discusses a special section of an intro-ductory statistics course where all of the course material istaught from a baseball perspective. This course was takenby students fulfilling a mathematics elective requirement.The topics of this class were typical of many introductoryclasses: data analysis for one and two variables, elemen-tary probability, and an introduction to inference. Thisclass was distinctive in that all of the material was taughtwithin the context of baseball. Data analysis was discussedusing baseball data on players and teams, both current andhistorical. Probability was introduced by the descriptionof several tabletop baseball games and an all-star gamewas simulated using spinners constructed from players'hitting statistics. The distinction between a player's per-formance and his ability was used to introduce statisticalinference, and the significance of situational hitting statis-tics and streaky data was discussed.

2.4.2 Using Baseball Board Games toTeach Probability Concepts

Cochran (2001) uses baseball simulation board games tointroduce a number of concepts in probability. Compar-ing playing cards for two players from the Strat-O-Maticbaseball game, he explains different approaches to as-signing probabilities to events, various discrete probabil-ity distributions, transformations and sample spaces, thelaws of multiplication and addition, conditional probabil-ity, randomization and independence, and Bayes' Theo-rem. Cochran finds that this twenty-minute demonstrationdramatically improves the understanding of basic proba-bility achieved by most students (even those who don't likeor know anything about baseball).

2.4.4 A Course on Statistics and Sports

Reiter (2001) taught a special course on statistics and sportsat Williams College. This course was offered during athree-week winter term period, where students were en-couraged to enroll in a class outside of their major. Thestudents had strong interests in sports but had varying back-grounds in statistics. Each student made two oral presen-tations in this class. In the first presentation, the studentgave some background concerning the statistics used in asports article of his/her choosing. For the second presenta-tion, the student presented research from a second articleor described his or her own data analysis. The seminarhelped the students learn the concept of random variabil-ity and the dependence of the statistical methodology onassumptions such as independence and normality.

2.4.5 A Freshmen Seminar: Mathematicsand Sports

Gallian (2001) taught a liberal arts freshmen seminar at theUniversity of Minnesota at Duluth that utilized statisticalconcepts to analyze achievements and strategies in sports.The intent of the freshmen seminar is to provide an atyp-ical course by a senior professor where all of the studentsactively participate. The students, working primarily ingroups of two or three, give written and oral reports from alist of articles that apply statistical methodology to sports.Students are encouraged to ask questions of the speak-ers and participate in class discussion. Biographical andhistorical information about athletes and their special ac-complishments (such as Joe DiMaggio's 56-game hittingstreak) is provided via videotapes.

8

Page 20: Anthology of Statistics in Sports

Albert and Cochran

2.4.6 A Freshmen Seminar: Statistics andMathematics of Baseball

Ken Ross taught a similar freshmen seminar at the Univer-sity of Oregon on the statistics and mathematics of base-ball. The goal of this seminar is to provide an interestingcourse with much student interaction and critical think-ing. Most of the students were very knowledgeable aboutbaseball. The books by Hoban (2000) and Schell (1999)were used for this seminar. Hoban (2000) seemed appro-priate for this class since it is relatively easy to read andits conclusions are controversial, which provoked studentdiscussion. Schell (1999) was harder to read for these stu-dents since it is more sophisticated statistically, and thusthe instructor supplemented this book with additional dis-cussion. A range of statistical topics was discussed in thisclass, including Simpson's paradox, correlation, and chi-square tests. Ross recently completed a book (Ross, 2004)that was an outgrowth of material from this freshmen sem-inar.

2.5 SummarySports examples have been used successfully by manystatistics instructors and now comprise the entire basisof many introductory statistics courses. Furthermore, en-tire introductory statistics textbooks and "popular-style"books that exclusively use sports examples have recentlybeen published. (Examples of these popular-style booksare Albert and Bennett (2003), Haigh (2000), Ross (2004),Skiena (2001), and Watts and Bahill (2000).) Statistics in-structors who use sports examples extensively have founda corresponding increase in comprehension and retentionof concepts that students usually find to be abstract andobtuse. Instructors have found that most students, eventhose who are not knowledgeable about sports, enjoy andappreciate sports examples in statistics courses. Due to thepopularity of sports in the general public and the easy avail-ability of useful sports datasets, we anticipate that sportswill continue to be a popular medium for communicatingconcepts of probability and statistics.

ReferencesAlbert, J. (1993), Discussion of "A statistical analysis ofhitting streaks in baseball" by S. C. Albright, Journal ofthe American Statistical Association, 88, 1184-1188.

Albert, J. (2002), "A baseball statistics course," Journal ofStatistics Education, 10, 2.

Albert, J. (2003), Teaching Statistics Using Baseball,Washington, DC: Mathematical Association of America.

Albert, J. and Bennett, J. (2003), Curve Ball, New York:Copernicus Books.

Albright, S. C. (1993), "A statistical analysis of hittingstreaks in baseball," Journal of the American StatisticalAssociation, 88, 1175-1183.

Bassett, G. W. and Hurley, W. J. (1998), "The effects ofalternative HOME-AWAY sequences in a best-of-sevenplayoff series," The American Statistician, 52, 51-53.

Cochran, J. (2000), "Career records for all modern posi-tion players eligible for the Major League Baseball Hallof Fame," Journal of Statistics Education, 8, 2.

Cochran, J. (2001), "Using Strat-O-Matic baseball to teachbasic probability for statistics," ASA Proceedings of theSection on Statistics in Sports.

Cochran, J. (2003), "Data management, exploratory dataanalysis, and regression analysis with 1969-2000 MajorLeague Baseball attendance," Journal of Statistics Educa-tion, 10, 2.

Costa, G. and Huber, M. (2003), Whaddya Mean? You GetCredit for Studying Baseball? Technical report.

Gallian, J. (2001), "Statistics and sports: A freshman sem-inar," ASA Proceedings of the Section on Statistics inSports.

Gould, S. J. (2003), Triumph and Tragedy in Mudville: ALifelong Passion for Baseball, New York: W. W. Norton.

Haigh, J. (2000), Taking Chances, Oxford, UK: OxfordUniversity Press.

Hoban, M. (2000), Baseball's Complete Players, Jeffer-son, NC: McFarland.

James, B. (1982), The Bill James Baseball Abstract, NewYork: Ballantine Books.

Lackritz, James R. (1981), "The use of sports data in theteaching of statistics," ASA Proceedings of the Section onStatistical Education, 5-7.

Lock, R. (1997), "NFL scores and pointspreads," Journalof Statistics Education, 5,3.

9

Page 21: Anthology of Statistics in Sports

Chapter 2 The Use of Sports in Teaching Statistics

McKenzie, John D., Jr. (1996), "Teaching applied statis- Schell, M. (1999), Baseball All-Time Best Hitters, Prince-tics courses with a sports theme," ASA Proceedings of the ton, NJ: Princeton University Press.Section on Statistics in Sports, 9-15.

Simonoff, J. (1998), "Move over, Roger Maris: BreakingMorris, Pamela (1984), "A course work project for exam- baseball's most famous record," Journal of Statistics Edu-ination," Teaching Statistics, 6, 42-47. cation, 6, 3.

Mosteller, F. (1988), "Broadening the scope of statistics Skiena, S. (2001), Calculated Bets, Cambridge, UK: Cam-and statistical education," The American Statistician, 42, bridge University Press.93-99.

Starr, Norton (1997), "Nonrandom risk: The 1970 draftMosteller, F. (1997), "Lessons from sports statistics," The lottery," Journal of Statistics Education, 5.American Statistician, 51, 305-310.

Watnik, M. (1998), "Pay for play: Are baseball salariesNettleton, D. (1998), "Investigating home court advan- based on performance?" Journal of Statistics Education,tage," Journal of Statistics Education, 6, 2. 6, 2.

Quinn, Robert J. (1997), "Investigating probability with Watnik, M. and Levine, R. (2001), "NFL Y2K PCA," Jour-the NBA draft lottery," Teaching Statistics, 19, 40-42. nal of Statistics Education, 9, 3.

Quinn, Robert J. (1997), "Anomalous sports perfor- Watts, R. G. and Bahill,A.T. (2000), Keep Your Eye on themances," Teaching Statistics, 19, 81-83. Ball, New York: W. H. Freeman.

Reiter, J. (2001), "Motivating students' interest in statis- Wiseman, Frederick and Chatterjee, Sangit (1997), "Ma-tics through sports," ASA Proceedings of the Section on jor League Baseball player salaries: Bringing realism intoStatistics in Sports. introductory statistics courses," The American Statistician,

51, 350-352.Ross, K. (2004), A Mathematician at the Ballpark, NewYork: Pi Press.

10

Page 22: Anthology of Statistics in Sports

Part IStatistics in Football

Page 23: Anthology of Statistics in Sports

This page intentionally left blank

Page 24: Anthology of Statistics in Sports

Chapter 3

Introduction to theFootball Articles

Hal Stern

This brief introduction, a sort of "pregame show," pro-vides some background information on the application ofstatistical methods in football and identifies particular re-search areas. The main goal is to describe the history ofresearch in football, emphasizing the place of the five se-lected articles.

3.1 BackgroundFootball (American style) is currently one of the most pop-ular sports in the U.S. in terms of fan interest. As seemsto be common in American sports, large amounts of quan-titative information are recorded for each football game.These include summaries of both team and individual per-formances. For professional football, such data can befound for games played as early as the 1930s and, for col-lege football, even earlier years. Somewhat surprisingly,despite the large amount of data collected, there has beenlittle of the detailed statistical analysis that is common forbaseball.

Two explanations for the relatively small number ofstatistics publications related to football are obvious. First,despite the enormous amount of publicity given to pro-fessional football, it is actually difficult to obtain detailed(play-by-play) information in computer-usable form. Thisis not to say that the data do not exist—they do exist and areused by the participating teams. The data have not beeneasily accessible to those outside the sport. Now play-by-play listings can be found on the World Wide Web atthe National Football League's own site (www.nfl.com).These data are not in convenient form for research use, but

one can work with them.A second contributing factor to the shortage of research

results is the nature of the game itself. Here are four exam-ples of the kinds of things that can complicate statisticalanalyses: (1) scores occur in steps of size 2, 3, 6, 7, and8 rather than just a single scoring increment; (2) the gameis time-limited with each play taking a variable amount oftime; (3) actions (plays) move the ball over a continuoussurface; and (4) several players contribute to the success orfailure of each play. Combined, these properties of foot-ball make the number of possible score-time-location situ-ations that can occur extremely large and this considerablycomplicates analysis.

The same two factors that conspire to limit the amountof statistical research related to football also impact thetypes of work that are feasible. Most of the publishedresearch has focused on aspects of the game that can beeasily separated from the ordinary progress of the game.For example, it can be difficult to rate the contribution ofplayers because several are involved in each play. As a re-sult, more work has been done evaluating the performanceof kickers (including two articles published here) than ofrunning backs. In the same way, the question of whetherto try a one-point or two-point conversion after touchdownhas received more attention than any other strategy-relatedquestion. In the remainder of this introduction, the historyof statistical research related to football is surveyed and thefive football articles in our collection are placed in context.

Since selecting articles for a collection like this invari-ably omits some valuable work, it is important to mentionsources containing additional information. Stern (1998)provides a more comprehensive review of the work thathas been carried out in football. The book by Carroll,Palmer, and Thorn (1988) is a sophisticated analysis of thegame by three serious researchers with access to play-by-play data. Written for a general audience, the book does

13

Page 25: Anthology of Statistics in Sports

Chapter 3 Introduction to the Football Articles

not provide many statistical details but is quite thought pro-voking. The collections of articles edited by Ladany andMachol (1977) and Machol, Ladany, and Morrison (1976)are also good academic sources.

3.2 Information SystemsStatistical contributions to the study of football have ap-peared in four primary areas: information systems, playerperformance evaluation, football strategy, and team per-formance evaluation. The earliest significant contributionof statistical reasoning to football was the developmentof computerized systems for studying opponents' perfor-mances. This remains a key area of research activity.Professional and college football teams prepare reportsdetailing the types of plays and formations favored by op-ponents in a variety of situations. The level of detail in thereports can be quite remarkable, e.g., they might indicatethat Team A runs to the left side of the field 70% of thetime on second down with five or fewer yards required for anew first down. These reports clearly influence team prepa-ration and game-time decision making. These data couldalso be used to address strategic issues (e.g., whether a teamshould try to maintain possession when facing fourth downor kick the ball over to its opponent) but that would requiremore formal analysis than is typically done. It is interestingthat much of the early work applying statistical methodsto football involved people affiliated with professional orcollegiate football (i.e., players and coaches) rather thanstatisticians. The author of one early computerized play-tracking system was 1960s professional quarterback FrankRyan (Ryan, Francia, and Strawser, 1973). Data from sucha system would be invaluable for statistically minded re-searchers but the data have not been made available.

3.3 Player Performance EvaluationEvaluation of football players has always been importantfor selecting teams and rewarding players. Formally eval-uating players, however, is a difficult task because severalplayers contribute to each play. A quarterback may throwthe ball 5 yards down the field and the receiver, after catch-ing the ball, may elude several defensive players and run90 additional yards for a touchdown. Should the quarter-back get credit for the 95-yard touchdown pass or just the5 yards the ball traveled in the air? What credit shouldthe receiver get? The difficulty in apportioning credit tothe several players that contribute to each play has meantthat a large amount of research has focused on aspectsof the game that are easiest to isolate, such as kicking.Kickers contribute to their team's scoring by kicking field

goals (worth 3 points) and points-after-touchdowns (worth1 point). On fourth down a coach often has the choice of(1) attempting an offensive play to gain the yards neededfor a new first down, (2) punting the ball to the opposition,or (3) attempting a field goal. A number of papers haveconcerned the performance of field goal kickers. The arti-cle of Berry (Chapter 4) is the latest in a series of attemptsto model the probability that a field goal attempted froma given distance will be successful. His model builds onearlier geometric models for field goal kick data. Our col-lection also includes a second article about kickers, thoughfocused on a very different issue. Morrison and Kalwani(Chapter 7) examine the performances of all professionalkickers over a number of seasons and ask whether there aremeasurable differences in ability. Their somewhat surpris-ing result is that the data are consistent with the hypothesisthat all kickers have equivalent abilities. Though strictlyspeaking this is not likely to be true, the data are indicativeof the large amount of variability in field goal kicking. Asmore play-by-play data are made available to researchers,one may expect further advances in the evaluation of per-formance by nonkickers.

3.4 Football StrategyAn especially intriguing goal for statisticians is to discoveroptimal strategies for football. Examples of the kinds ofquestions that one might address include optimal decision-making on fourth down (to try to maintain possession, tryfor a field goal, or punt the ball to the opposing team),optimal point-after-touchdown conversion strategy (one-point or two-point conversion attempt), and perhaps evenoptimal offensive and defensive play calling. As is thecase with evaluation of players, the complexities of foot-ball have meant that the majority of the research work hasbeen carried out on the most isolated components of thegame, especially point-after-touchdown conversion deci-sions. Attempts to address broader strategy considerationsrequire being able to place a value on game situations, forexample, to have the probability of winning the game fora given score, time remaining, and field position. No-table attempts in this regard include the work of Carterand Machol (1971)—that is, former NFL quarterback Vir-gil Carter—and Carroll, Palmer, and Thorn (1988).

3.5 Team Performance EvaluationThe assessment of team performance has received consid-erably more attention than the assessment of individualplayers. This may be partly a result of the enormous bet-ting market for professional and college football in the U.S.

14

Page 26: Anthology of Statistics in Sports

Stern

A key point is that teams usually play only a single gameeach week. This limits the number of games per team perseason to between 10 and 20 (depending on whether weare thinking of college or professional football) and thusmany teams do not play one another in a season. Becauseteams play unbalanced schedules, an unequivocal determi-nation of the best team is not possible. This means thereis an opportunity for developing statistical methods to rateteams and identify their relative strengths.

For a long time there has been interest in rating col-lege football teams with unbalanced schedules. The 1920sand 1930s saw the development of a number of systems.Early methods relied only on the record of which teamshad defeated which other teams (with no use made of thegame scores). It has become more popular to make useof the scores accumulated by each team during its games,as shown in two of the articles included here. Harville(Chapter 6), in the second of his two papers on this topic,describes a linear model approach to predicting NationalFootball League games. The linear model approach in-corporates parameters for team strength and home fieldadvantage. In addition, team strengths are allowed to varyover time. Building on Harville's work, Glickman andStern (Chapter 5) apply a model using Bayesian methodsto analyze the data. The methods of Harville and Glickmanand Stern rate teams on a scale such that the difference inestimated ratings for two teams is a prediction of the out-come for a game between them. The methods suggest thata correct prediction percentage of about 67% is possible;the game itself is too variable for better prediction.

This last observation is the key to the final chapter of PartI of this book. In Chapter 8, Stern examines the results offootball games over several years and finds that the pointdifferentials are approximately normal with mean equal tothe Las Vegas betting point spread and standard deviationbetween 13 and 14 points. Stern's result means that itis possible to relate the point spread and the probabilityof winning. This relationship, and similar work for othersports, have made formal probability calculations possiblein a number of settings for which it had previously beenquite difficult.

ReferencesBerry, S. (1999), "A geometry model for NFL field goalkickers," Chance, 12 (3), 51-56.

Carroll, B., Palmer, P., and Thorn, J. (1988), The HiddenGame of Football, New York: Warner Books.

Carter, V. and Machol, R. E. (1971), "Operations researchon football," Operations Research, 19, 541-545.

Glickman, M. E. and Stern, H. S. (1998), "A state-spacemodel for National Football League scores," Journal of theAmerican Statistical Association, 93, 25-35.

Harville, D. (1980), "Predictions for National FootballLeague games via linear-model methodology," Journal ofthe American Statistical Association, 75, 516-524.

Ladany, S. P. and Machol, R. E. (editors) (1977), OptimalStrategy in Sports, Amsterdam: North-Holland.

Machol, R. E., Ladany, S. P., and Morrison, D. G. (edi-tors) (1976), Management Science in Sports, Amsterdam:North-Holland.

Morrison, D. G. and Kalwani, M. U. (1993), "The bestNFL field goal kickers: Are they lucky or good?" Chance,6 (3), 30-37.

Ryan, E, Francia, A. J., and Strawser, R. H. (1973), "Pro-fessional football and information systems," ManagementAccounting, 54,43-41.

Stern, H. S. (1991), "On the probability of winning a foot-ball game," The American Statistician, 45, 179-183.

Stern, H. S. (1998), "American football," in Statistics inSport, edited by J. Bennett, London: Arnold, 3-23.

3.6 SummaryEnough of the pregame show. Now on to the action! Thefollowing five football articles represent how statisticalthinking can influence the way that sports are watchedand played. In the case of football, we hope these arti-cles foreshadow future developments that will touch onmore difficult work involving player evaluation and strat-egy questions.

15

Page 27: Anthology of Statistics in Sports

This page intentionally left blank

Page 28: Anthology of Statistics in Sports

Chapter 4

A STATISTICIAN READSTHE SPORTS PAGES

Scott M. Berry,Column Editor

A Geometry Model for NFL FieldGoal Kickers

My dad always made it a point whenever we had a problemto discuss it openly with us. This freedom to discuss ideashad its limits — there was one thing that could not be dis-cussed. This event was too painful for him to think about —a weak little ground ball hit by Mookie Wilson that rolledthrough the legs of Bill Buckner. This error by Bucknerenabled the New York Mets to beat the Boston Red Sox inthe 1986 World Series. My dad is a life-long Red Sox fan andwas crushed by the loss. This loss was devastating because itappeared as though the Red Sox were going to win and endthe "Curse of the Babe" by winning their first championshipsince the trade of Babe Ruth after the 1919 season.

The Minnesota Vikings have suffered similar champi-onship futility. They have lost four Super Bowls withoutwinning one. The 1998 season appeared to be different.They finished the season 15-1, set the record for the mostpoints scored in a season, and were playing the AtlantaFalcons for a chance to go to the 1999 Super Bowl. TheVikings had the ball on the Falcon's 21 -yard line with a 7-point lead and just over two minutes remaining. Their placekicker, Gary Anderson, had not missed a field-goal attemptall season. A successful field goal and the Vikings wouldearn the right to play the Denver Broncos in the SuperBowl. Well, in Buckner-like fashion, he missed the fieldgoal — wide left . . . for a life-long Viking fan, it was painful.The Falcons won the game, and the Vikings were eliminat-ed. I took the loss much the same as my dad had taken theMets beating the Red Sox — not very well.

When the pain subsided I got to thinking about model-ing field goals. In an interesting geometry model, Berry andBerry (1985) decomposed each kick attempt with a dis-tance and directional component. In this column I use sim-ilar methods, but the data are more detailed and the currentcomputing power enables a more detailed model. I modelthe direction and distance of each kick with separate distri-

Column Editor: Scott M. Berry, Department of Statistics,Texas AM University, 41OB Blocker Building, CollegeStation, TX 77843-3143, USA; E-mail [email protected].

butions. They are combined to find estimates for the prob-ability of making a field goal for each player, from each dis-tance. By modeling distance and accuracy the intuitiveresult is that different kickers are better from different dis-tances. There has also been some talk in the NFL that fieldgoals are too easy. I address the effect of reducing the dis-tance between the uprights. Although it may be painful, Ialso address the probability that Gary Anderson would missa 38-yard field goal!

Background InformationA field goal attempt in the NFL consists of a center hikingthe ball to a holder, seven yards from the line of scrimmage.The holder places the ball on the ground, where the placekicker attempts to kick it through the uprights. Theuprights are 18 feet, 6 inches apart and are centered 10yards behind the goal line. Therefore, if the line of scrim-mage is the 21 -yard line, the ball is "snapped" to the hold-er at the 28-yard line and the uprights are 38 yards from theposition of the attempt, which is considered a 38-yardattempt. For the kick attempt to be successful, the placekicker must kick the ball through the uprights and above acrossbar, which is 10 feet off the ground. The informationrecorded by the NFL for each kick is whether it was suc-cessful, short, or wide, and if it is wide, which side it miss-es on. I could not find any official designation of how a kickis categorized if it is both short and wide. I assume that if akick is short that is the first categorization used for a missedkick. Therefore, a kick that is categorized as wide must havehad enough distance to clear the crossbar. After a touch-down is scored, the team has an option to kick a point aftertouchdown (PAT). This is a field-goal try originating fromthe 2-yard line — thus a 19-yard field-goal try.

I collected the result of every place kick during the1998 regular season. Each kick is categorized by thekicker, distance, result, and the reason for any unsuc-cessful kicks. Currently NFL kickers are ranked by theirfield-goal percentage. The trouble with these rankings isthat the distance of the kick is ignored. A good field-goalkicker is generally asked to attempt longer and more dif-ficult kicks. When a kicker has a 38-yard field goalattempt, frequently the TV commentator will provide thesuccess proportion for the kicker from that decile

17

Page 29: Anthology of Statistics in Sports

Chapter 4 A Geometry Model for NFL Field Goal Kickers

Figure 1. The accuracy necessary for a kicker is proportional

to the distance of the kick. For a field-goal try of 20 yards, you

can have an angle twice as big as that for a try from 40 yards.

yardage — that is, 30-39 yards. These percentages aregenerally based on small sample sizes. Information fromlonger kicks can also be informative for a 38-yardattempt. A success from a longer distance would be suc-cessful from shorter distances. Moreover, misses fromshorter distances would also be misses from the longerdistances. I model the accuracy and length for each ofthe place kickers and update from all the place kicks.

Mathematical ModelTo measure the accuracy of a place kicker, I model theangle of a kick from the center of the uprights. Refer to Fig.1 for a description of the accuracy aspect of the model. Fora shorter kick, there is a larger margin of error for the kick-er. Let the angle from the center line for a kick be . Ratherthan model the angle explicitly, I model the distance fromthe center line, for an angle 6, if the kick were a 40-yard

attempt. Let the random variable W be the number of feetfrom the center line if the kick were from 40 yards. Iassume that each kicker is unbiased — that is, on averagethey hit the middle of the uprights. It may be that somekickers are biased to one side or the other, but the data areinsufficient to investigate "biasedness" of a kicker. A kickedball must be within 9.25 feet of the center line when itcrosses the plane of the uprights for the kick to be success-ful (assuming it is far enough). Therefore, if the attempt isfrom 40 yards, | W| < 9.25 implies that the kick will be suc-cessful if it is long enough. For a 20-yard field goal the anglecan be twice as big as for the 40-yard attempt to be suc-cessful. Thus, a 20-yard field goal will be successful if| W| < 1 8.5. In general, a field goal from X yards will be suc-cessful if |W| < (9.25)(40/X).

I assume that the resulting distance from the center line,40 yards from the holder, is normally distributed with amean of 0 and a standard deviation of ;, for player i. I referto as the "accuracy" parameter for a kicker. For kicker i,if a kick from X yards is long enough, the probability that itis successful is

where is the cumulative distribution function of a stan-dard normal random variable.

For a kick to be successful, it also has to travel farenough to clear the crossbar 10 feet off the ground. Foreach attempt the distance for which it would be successfulif it were straight enough is labeled d. I model the distancetraveled for a kick attempt with a normal distribution. Themean distance for each kicker is ui and the standard devia-tion is T, which is considered the same for all kickers. Theparameter ui is referred to as the "distance" parameter forkicker i.

The distance traveled, d, for each kick attempt isassumed to be independent of the accuracy of each kick,W. Again this assumption is hard to check. It may be thaton long attempts a kicker will try to kick it harder, thusaffecting both distance and accuracy. Another concernmay be that kickers try to "hook" or "slice" their kicks. AllI am concerned with is the angle at which the ball pass-es the plane of the uprights. If the ball is kicked at a smallangle with the center line and it hooks directly throughthe middle of the uprights, then for my purposes that isconsidered to have had an angle of 0 with the center line.Another possible complication is the spot of the ball onthe line of scrimmage. The ball is always placed between

Where are the Data ?There are many sources on the Web for NFL football data, the official NFL Web site infl.com, has"garmebooks"for

game. It also plantains a play-by-play accountof every game. These are not in agreat format to download, but they can behandled. This site also provides the usual statistics for the NFL It has individual statistics for every current NFL player.

ESPN(espn.go.com), CNNSI (cnnsi.com), The SportingNew(portingnews.com)New C^es^^wsr^xw) and USA Today (www.usaalso have statistics available. Thesesitse sites also have live updates of games while they are in process. Each have diffierentinteresting features. . .

I have made the data I used for this article available on the Web at stat.tamu.edu/berry.

18

Page 30: Anthology of Statistics in Sports

Berry

the hash marks on the field. When the ball is placed onthe hash marks this can change slightly the range ofangles that will be successful. The data for ball place-ment are not available. I assume that there is no differ-ence in distance or accuracy depending on the placementof the ball on the line of scrimmage.

The effect of wind is ignored. Clearly kicking with thewind or against the wind would affect the distance traveled.A side wind would also affect the accuracy of a kick. Dataare not available on the wind speed and direction for eachkick. Adverse weather such as heavy rain and/or snow could

also have adverse impacts on both distance and accuracy.This is relatively rare, and these effects are ignored. Bilderand Loughlin (1998) investigated factors that affect field-goal success. Interestingly they found that whether a kickwill cause a lead change is a significant factor.

I use a hierarchical model for the distribution of theaccuracy parameters and the distance parameters for the kkickers (see sidebar). A normal distribution is used for each.For the distance parameters, u1,...,uk, the distribution isN( , T2). The accuracy parameters, ,.. ., k are N( ,r ).The normal distribution for the as is for convenience —

technically the s must be nonnega-tive, but the resulting distribution hasvirtually no probability less than 0.Morrison and Kalwani (1993) analyzedthree years worth of data for place kick-ers and found no evidence for differ-ences in ability of those kickers.

Data and ResultsDuring the 1998 regular season, therewere k = 33 place kickers combiningfor 1,906 attempts (including PATs).For each kicker i we have the distanceXij for his jth kick. I ignore kicks thatwere blocked. Although, in somecases this is an indication that thekicker has performed poorly, it is morelikely a fault of the offensive line. Thisis a challenging statistical problembecause the distance dij and accuracyWij for each kick are not observed.Instead, we observe censored observa-tions. This means that we don'tobserve W or d, but we do learn arange for them. If the kick is success-ful we know that d-ij > Xij and \Wij\ <9.25(40/Xy). If the kick is short weknow that dij < Xij and we learn noth-ing about Wij. If the kick is wide welearn that |Wij| > 9.25(40/Xij) and thatdij > Xij. A Markov-chain MonteCarlo algorithm is used to find thejoint posterior distribution of all of theparameters.

Intuitively we learn about by thefrequency of made kicks. If the kickerhas a high probability of making kicksthen they must have a small value of

;. The same relationship happens forthe distance parameter ui. If a kickerkicks it far enough when the distanceis Xij, that gives information about themean distance kicked. Likewise, if thekick is short, a good deal of informa-tion is learned about ui.

Table 1 presents the order of kick-ers in terms of their estimated accura-

19

Page 31: Anthology of Statistics in Sports

Chapter 4 A Geometry Model for NFL Field Goal Kickers

cy parameter. The mean and standarddeviation are reported for each ; andeach ui. The probability of making a30-yard and a 50-yard field goal arealso presented.

Gary Anderson is rated as the bestkicker from an accuracy standpoint.His mean ; is 5.51, which meansthat from 40 yards away he has amean of 0 feet from the center linewith a standard deviation of 5.51feet. He is estimated to have a 97%chance of making a 30-yard fieldgoal with a 78% chance of making a50-yard field goal. The impact of thehierarchical model shows in the esti-mates for Anderson. Using just datafrom Anderson, who was successfulon every attempt, it would be natur-al to claim he has perfect accuracyand a distance guaranteed to begreater than his longest kick. Thehierarchical model regresses his per-formance toward the mean of theother kickers. This results in himbeing estimated to be very good, butnot perfect.

Although Anderson is the mostaccurate, Jason Elam is estimated tobe the longest kicker. The best kickerfrom 52 yards and closer is GaryAnderson, but Jason Elam is the bestfrom 53 and farther. There can belarge differences between kickers fromdifferent distances. Hall is estimatedto be a very precise kicker with a 91%chance of making a 30-yard field goal.Brien has a 90% chance of making a30-yard field goal. But from 50 yardsBrien has a 51% chance, but Hall hasonly a 28% chance. Hall is an interest-ing case. Football experts would claimhe has a strong leg, but the model esti-mates he has a relatively weak leg. Ofhis 33 attempts he was short from 60,55, 46, and 32 yards. The 32 has a biginfluence on the results because of thenormal assumption for distance. Idon't know the circumstances behindthis miss. There may have been a hugewind or he may have just "shanked" it.A more appropriate model might havea small probability of a shanked kick,which would be short from every dis-

Figure 2. The probability of a successful field goal for Gary Anderson, Jason Elam,

Mike Husted and Adam Vinatieri.

Figure 3. The probability of success for Gary Anderson decomposed into distance

and accuracy. The curve labeled "distance" is the probability that the kick travels far

enough to be successful, and the curve labeled "accuracy" is the probability that the

kick would go through the uprights if it is long enough.

tance. The fact that Hall was asked to attempt kicks from60 and 55 yards is probably a strong indication that he doeshave a strong leg.

Figure 2 shows the probability of making a field goal forfour kickers with interesting profiles. Elam and Anderson areterrific kickers from short and long distances. Adam Vinatieri

is a precise kicker with poor distance. Husted is not very pre-cise but has a good leg. Figure 3 shows the decomposition ofAnderson's probability of making a field goal into the accura-cy component and the distance component. The probabilityof making a field goal is the product of the probabilities of thekick being far enough and accurate enough.

20

Page 32: Anthology of Statistics in Sports

Berry

The hierarchical distributions are estimated as follows:

The common standard deviation in the distance randomvariable, T, is estimated to be 4.37.

To investigate goodness of fit for the model, I calculatedthe probability of success for each of the kicker's attempts.From these I calculated the expected number of successfulkicks for each kicker. Table 2 presents these results. Thepredicted values are very close to the actual outcomes. Only3 of the 33 kickers had a predicted number of successes thatdeviated from the actual by more than 2 (Davis, 4;

Anderson, 3; Richey 3). I also characterized each of thefield-goal attempts by the model-estimated probability ofsuccess. In Table 3 I grouped these into the 10 deciles forthe probability of success. The actual proportion of successfor these deciles are presented. Again there is a very good fit.The last 5 deciles, which are the only ones with more than12 observations, have incredibly good fit. It certainly appearsas though kickers can be well modeled by their distance andaccuracy.

During the 1998 season, 82% of all field-goal attemptswere successful. Even from 50+ yards, there was a 55%success rate. Frequently, teams use less risky strategiesbecause the field goal is such a likely event. Why should anoffense take chances when they have an almost sure 3points? This produces a less exciting, more methodicalgame. There has been talk (and I agree with it) that theuprights should be moved closer together. From the model-ing of the accuracy of the kickers I can address what theresulting probabilities would be from the different dis-tances if the uprights were moved closer.

Using the hierarchical models, I label the "average" kick-er as one with u = 55.21 and — 7.17. Figure 4 shows theprobability of an average kicker making a field goal from thedifferent distances. The label for each of the curves repre-sents the number of feet each side of the upright isreduced. With the current uprights, from 40 yards, theaverage kicker has about an 80% chance of a successfulkick. If the uprights are moved in 3 feet on each side, thisprobability will be reduced to 62%. There is not a huge dif-ference in success probability for very long field goals, butthis is because distance becomes more important.Reducing each upright by 5 feet would make field goals too

21

Table 2 — The Predicted Number of Successes From the Model for Each Kickerfor Their Field Goal Attempts of 1998

Rank Kicker Attempts Made Predictal Rank Kicker Attempts Made Predicted1 G, Anderson2 Brien3 Vanderjagt4 Johnson5 Hanson6 Delgreco7 Elam8 Cunningham9 Peterson

10 Stoyanovich11 Kasay ' .12 Hall13 M.Andersen14 Vinatieri15 Carney16 Daluiso17 Stover

3521303133392635243225332538302627

3520272629362329192719252330262121

3218

- 252527352229192719

"252131252222

18 Mare19 Nedney20 Hollis21 Pelfrey22 Blanton23 Boniol24 Akers2S Christie26 Jacke27 Richey28 Longwell29 Wilkins30 Jaeger31 Blanchard32 Davis33 Hustedd

2718

.25274

19''2 '

41' 14'

26332626162727

22.,• 13 -21

19

2

13

0 .

33

10

1829

2021

11

17

21

2220

3

151

34

11

21

27

19

22

12

21

21

Table 3 — Actual Rates of Success for 1996Fiekf-Goal A t t e m p t s for Categories (Defined

Predicted% Attempts Made .Actual %0-1011-2021-30

31-40

41-50

61-7071-8081-9091-100

43 24 17 5

12 4

43 248l 49

194 143228 193292 284

256725

7133

5660

7485

97

By the Estimated Probability of Sucess

121

51-60

Page 33: Anthology of Statistics in Sports

Chapter 4 A Geometry Model for NFL Field Goal Kickers

unlikely and would upset the cur-rent balance between field goalsand touchdowns. Although a slightchange in the balance is desirable,a change of 5 would have too largean impact. I would suggest a reduc-tion of between 1 and 3 feet oneach upright.

Well, I have been dreading thispart of the article. What is the prob-ability that Gary Anderson wouldmake a 38-yard field goal? Themodel probability is 92.3%. Soclose to a Super Bowl! Did GaryAnderson "choke"? Clearly thatdepends on your definition ofchoke. I don't think he choked.Actually, in all honesty, the Vikingdefense is more to blame thanAnderson. They had a chance tostop the Falcon offense from going72 yards for the touchdown andthey failed. As a Viking fan I amused to it. Maybe the Vikings losingto the Falcons saved me fromwatching Anderson miss a crucial25-yarder in the Super Bowl.Misery loves company, and thusthere is comfort in knowing thatBill's fans have felt my pain ... wideright!

Figure 4. The probability of a successful kick for an average kicker for different size

uprights. The curve labeled "current" refers to the current rule, which has each upright

9 feet, 3 inches, from the center of the uprights. The curves labeled "1," "3," and "5"

are for each upright being moved in 1, 3, and 5 feet, respectively.

References and Further Reading

Berry, D.A., and Berry, T.D. (1985), "The Probability ofa Field Goal: Rating Kickers," The AmericanStatistician, 39, 152-155.

Bilder, C.R., and Loughlin, T.M. (1998), "It's Good!'An Analysis of the Probability of Success forPlacekicks," Chance, 11(2), 20-24.

Morrison, D.G., and Kalwani, M.U. (1993), "The BestNFL Field Goal Kickers: Are They Lucky or Good?"Chance, 6(3), 30-37.

Hierarchical ModelsA hierarchical model is commonly used to model the individual parameters of subjects from a common population. Thenotion is that tibeape is infonmation about one subject from the others. These models are powerful for modeling the per-formarice of athletes in all sports. They me being used teoeasingty in every branch of statistics but are commonly used

in biostatistics, environmental statistics, and educational testing. The general idea is that each of the subjects has its ownParameter This parameter describes something about that subject — for example, ability to hit a home run, kick afield goal, or hit a golf ball. For each subject a random variable xi is observed that is informative about . Thedistribution of the subject s s is explicitly modeled — . This distribution g, which is indexed by a parametera, describes the distribution of in the population. This distribution is generally unknown to the researcher.

We learn about from Xi but also rrom g. The distribution g is unkown, but we learn about it from the estimated 9s(based on each xi). Thus, there is information about from xi but also from the other Xs. This is called "shrinkage," "bor-rowing strength, or "regression to the mean." In this article, the field-goal kickers are all fantastic and are similar. Byobserving one kicker, information is gathered about all lackers. This creates the effect that, even though Gary Andersonnever missed a single kick, I still believe that he is mil perfect — he is very good, but not perfect. This information comesfrom using a hierarchical model forthe population of kickers. The mathematical calculation of the posterior distributionfrom these different sources of information is found using Bayes theorem.

22

Page 34: Anthology of Statistics in Sports

Chapter 5

A State-Space Model for National FootballLeague Scores

Mark E. GLICKMAN and Hal S. STERN

This article develops a predictive model for National Football League (NFL) game scores using data from the period 1988-1993.The parameters of primary interest—measures of team strength—are expected to vary over time. Our model accounts for thissource of variability by modeling football outcomes using a state-space model that assumes team strength parameters follow afirst-order autoregressive process. Two sources of variation in team strengths are addressed in our model; week-to-week changes inteam strength due to injuries and other random factors, and season-to-season changes resulting from changes in personnel and otherlonger-term factors. Our model also incorporates a home-field advantage while allowing for the possibility that the magnitude ofthe advantage may vary across teams. The aim of the analysis is to obtain plausible inferences concerning team strengths and othermodel parameters, and to predict future game outcomes. Iterative simulation is used to obtain samples from the joint posteriordistribution of all model parameters. Our model appears to outperform the Las Vegas "betting line" on a small test set consistingof the last 110 games of the 1993 NFL season.

KEY WORDS: Bayesian diagnostics; Dynamic models; Kalman filter; Markov chain Monte Carlo; Predictive inference.

1. INTRODUCTION

Prediction problems in many settings (e.g., finance, politi-cal elections, and in this article, football) are complicated bythe presence of several sources of variation for which a pre-dictive model must account. For National Football League(NFL) games, team abilities may vary from year to yeardue to changes in personnel and overall strategy. In addition,team abilities may vary within a season due to injuries, teampsychology, and promotion/demotion of players. Team per-formance may also vary depending on the site of a game.This article describes an approach to modeling NFL scoresusing a normal linear state-space model that accounts forthese important sources of variability.

The state-space framework for modeling a system overtime incorporates two different random processes. The dis-tribution of the data at each point in time is specified con-ditional on a set of time-indexed parameters. A second pro-cess describes the evolution of the parameters over time.For many specific state-space models, including the modeldeveloped in this article, posterior inferences about param-eters cannot be obtained analytically. We thus use Markovchain Monte Carlo (MCMC) methods, namely Gibbs sam-pling (Gelfand and Smith 1990; Geman and Geman 1984),as a computational tool for studying the posterior distri-bution of the parameters of our model. Pre-MCMC ap-proaches to the analysis of linear state-space models in-clude those of Harrison and Stevens (1976) and West andHarrison (1990). More recent work on MCMC methods hasbeen done by Carter and Kohn (1994), Fruhwirth-Schnatter(1994), and Glickman (1993), who have developed effi-cient procedures for fitting normal linear state-space mod-els. Carlin, Poison, and StofFer (1992), de Jong and Shep-hard (1995), and Shephard (1994) are only a few of therecent contributors to the growing literature on MCMC

Mark E. Glickman is Assistant Professor, Department of Mathematics,Boston University, Boston, MA 02215. Hal S. Stern is Professor, Depart-ment of Statistics, Iowa Sate University, Ames, IA 50011. The authorsthank the associate editor and the referees for their helpful comments.This work was partially supported by National Science Foundation grantDMS94-04479.

approaches to non-linear and non-Gaussian state-spacemodels.

The Las Vegas "point spread" or "betting line" of a game,provided by Las Vegas oddsmakers, can be viewed as the"experts" prior predictive estimate of the difference in gamescores. A number of authors have examined the point spreadas a predictor of game outcomes, including Amoako-Adu,Manner, and Yagil (1985), Stern (1991), and Zuber, Gandar,and Bowers (1985). Stern, in particular, showed that model-ing the score difference of a game to have a mean equal tothe point spread is empirically justifiable. We demonstratethat our model performs at least as well as the Las Vegasline for predicting game outcomes for the latter half of the1993 season.

Other work on modeling NFL football outcomes (Stefani1977, 1980; Stern 1992; Thompson 1975) has not incorpo-rated the stochastic nature of team strengths. Our model isclosely related to one examined by Harville (1977, 1980)and Sallas and Harville (1988), though the analysis that weperform differs in a number of ways. We create predictioninferences by sampling from the joint posterior distributionof all model parameters rather than fixing some parame-ters at point estimates prior to prediction. Our model alsodescribes a richer structure in the data, accounting for thepossibility of shrinkage towards the mean of team strengthsover time. Finally, the analysis presented here incorporatesmodel checking and sensitivity analysis aimed at assessingthe propriety of the state-space model.

2. A MODEL FOR FOOTBALL GAME OUTCOMES

Let yii denote the outcome of a football game betweenteam i and team i' where teams are indexed by the integersfrom 1 to p. For our dataset, p = 28. We take yii to bethe difference between the score of team i and the score ofteam i'. The NFL game outcomes can be modeled as ap-proximately normally distributed with a mean that dependson the relative strength of the teams involved in the gameand the site of the game. We assume that at week j of season

© 1998 American Statistical AssociationJournal of the American Statistical Association

March 1998, Vol. 93, No. 441, Applications and Case Studies

23

Page 35: Anthology of Statistics in Sports

Chapter 5 A State-Space Model for National Football League Scores

k, the strength or ability of team i can be summarized by aparameter (k,j)i- We let 0(k,j) denote the vector of p team-ability parameters for week j of season k. An additionalset of parameters, , i = 1,... ,p, measures the magnitudeof team i's advantage when playing at its home stadiumrather than at a neutral site. These home-field advantage(HFA) parameters are assumed to be independent of timebut may vary across teams. We let a. denote the vector ofp HFA parameters. The mean outcome for a game betweenteam i and team i' played at the site of team i during weekj of season k is assumed to be

We can express the distribution for the outcomes of alln(k,j) games played during week j of season k as

where y(k,j) is the vector of game outcomes, X(k,j) is then(k,j) x 2p design matrix for week j of season k (describedin detail later), (k,j) = ( 0 ( k , j ) , a ) is the vector of p team-ability parameters and p HFA parameters, and is the re-gression precision of game outcomes. We let 2 = -1

denote the variance of game outcomes conditional on themean. The row of the matrix X(k,j) for a game betweenteam i and team i' has the value 1 in the ith column (cor-responding to the first team involved in the game), —1 inthe i'th column (the second team), and 1 in the (p + i)thcolumn (corresponding to the HFA) if the first team playedon its home field. If the game were played at the site of thesecond team (team i'), then home field would be indicatedby a -1 in the (p -M')th column. Essentially, each rowhas entries 1 and —1 to indicate the participants and thena single entry in the column corresponding to the hometeam's HFA parameter (1 if it is the first team at home; —1if it is the second team). The designation of one team as thefirst team and the other as the second team is arbitrary anddoes not affect the interpretation of the model, nor does itaffect inferences.

We take K to be the number of seasons of available data.For our particular dataset there are K = 6 seasons. We letgk, for k = 1 , . . . , K, denote the total number of weeksof data available in season k. Data for the entire season areavailable for season fe, k = 1, . . . , 5, with gk varying from 16to 18. We take g6 = 10, using the data from the remainderof the sixth season to perform predictive inference. Addi-tional details about the structure of the data are provided inSection 4.

Our model incorporates two sources of variation relatedto the evolution of team ability over time. The evolution ofstrength parameters between the last week of season k andthe first week of season k +1 is assumed to be governed by

where G is the matrix that maps the vector 0(k,9k) to( k , g k ) - ave( ( k , g k ) ) > B s is the between-season regression

parameter that measures the degree of shrinkage (Bs < 1)or expansion ((BS > 1) in team abilities between seasons,and the product WS is the between-season evolution pre-

cision. This particular parameterization for the evolutionprecision simplifies the distributional calculus involved inmodel fitting. We let of = ( w s ) - l denote the between-season evolution variance. Then ws is the ratio of variances,

The matrix G maps the vector 0 ( k , g k ) to another vectorcentered at 0, and then shrunk or expanded around 0. We usethis mapping because the distribution of the game outcomesy(k , j ) is a function only of differences in the team abilityparameters; the distribution is unchanged if a constant isadded to or subtracted from each team's ability parameter.The mapping G translates the distribution of team strengthsto be centered at 0, though it is understood that shrinkageor expansion is actually occurring around the mean teamstrength (which may be drifting over time). The season-to-season variation is due mainly to personnel changes (newplayers or coaches). One would expect Bs < 1, because theplayer assignment process is designed to assign the bestyoung players to the teams with the weakest performancein the previous season.

We model short-term changes in team performance by in-corporating evolution of ability parameters between weeks,

where the matrix G is as before, BW is the between-weekregression parameter, and is the between-week evolu-tion precision. Analogous to the between-season componentof the model, we let = ( ww )-1 denote the variance ofthe between-week evolution, so that wW . Week-to-week changes represent short-term sources of variation;for example, injuries and team confidence level. It is likelythat Bw 1, because there is no reason to expect that suchshort-term changes will tend to equalize the team parame-ters (BW < 1) or accentuate differences (BW > 1).

Several simplifying assumptions built into this modelare worthy of comment. We model differences in footballscores, which can take on integer values only, as approxi-mately normally distributed conditional on team strengths.The rules of football suggest that some outcomes (e.g., 3 or7) are much more likely than others. Rosner (1976) modeledgame outcomes as a discrete distribution that incorporatesthe rules for football scoring. However, previous work (e.g.,Harville 1980, Sallas and Harville 1988, Stern 1991) hasshown that the normality assumption is not an unreason-able approximation, especially when one is not interestedin computing probabilities for exact outcomes but ratherfor ranges of outcomes (e.g., whether the score differenceis greater than 0). Several parameters, no ly the regressionvariance and the evolution variances and , are as-sumed to be the same for all teams and for all seasons.This rules out the possibility of teams with especially er-ratic performance. We explore the adequacy of these mod-eling assumptions using posterior predictive model checks(Gelman, Meng, and Stern 1996; Rubin 1984) in Section 5.

Prior distributions of model parameters are centered atvalues that seem reasonable based on our knowledge offootball. In each case, the chosen distribution is widely dis-persed, so that before long the data will play a dominant

24

Page 36: Anthology of Statistics in Sports

Glickman and Stern

role. We assume the following prior distributions:

and

Our prior distribution on corresponds to a harmonic meanof 100, which is roughly equivalent to a 10-point standarddeviation, for game outcomes conditional on knowingthe teams' abilities. This is close to, but a bit lower than,Stern's (1991) estimate of = 13.86 derived from a simplermodel. In combination with this prior belief about , theprior distributions on ww and ws assume harmonic means of

and of equal to 100/60 and 100/16, indicating our beliefthat the changes in team strength between seasons are likelyto be larger than short-term changes in team strength. Littleinformation is currently available about and , whichis represented by the .5 df. The prior distributions on theregression parameters assume shrinkage toward the meanteam strength, with a greater degree of shrinkage for theevolution of team strengths between seasons. In the contextof our state-space model, it is not necessary to restrict themodulus of the regression parameters (which are assumedto be equal for every week and season) to be less than 1,as long as our primary concern is for parameter summariesand local prediction rather than long-range forecasts.

The only remaining prior distributions are those for theinitial team strengths in 1988, (1,1), and the HFA parame-ters, a. For team strengths at the onset of the 1988 season,we could try to quantify our knowledge perhaps by examin-ing 1987 final records and statistics. We have chosen insteadto use an exchangeable prior distribution as a starting point,ignoring any pre-1988 information:

where we assume that

Let denote the prior variance of initial teamstrengths. Our prior distribution for w0 in combination withthe prior distribution on implies that has prior har-monic mean of 100/6 based on .5 df. Thus the a priori dif-ference between the best and worst teams would be about

= 16 points.We assume that the on have independent prior distribu-

tions

with

We assume a prior mean of 3 for the , believing that com-peting on one's home field conveys a small but persistentadvantage. If we let denote the prior vari-ance of the HFA parameters, then our prior distributions forwh and imply that has prior harmonic mean of 100/6based on .5 df.

3. MODEL FITTING AND PREDICTION

We fit and summarize our model using MCMC tech-niques, namely the Gibbs sampler (Gelfand and Smith 1990;Geman and Geman 1984). Let Y( K ,gk) represent all ob-served data through week ( K , g k ). The Gibbs sampler isimplemented by drawing alternately in sequence from thefollowing three conditional posterior distributions:

and

A detailed description of the conditional distributions ap-pears in the Appendix. Once the Gibbs sampler has con-verged, inferential summaries are obtained by using the em-pirical distribution of the simulations as an estimate of theposterior distribution.

An important use of the fitted model is in the predic-tion of game outcomes. Assume that the model has beenfit via the Gibbs sampler to data through week gK ofseason K, thereby obtaining m posterior draws of thefinal team-ability parameters 0(K,gK) the HFA parame-ters a, and the precision and regression parameters. De-note the entire collection of the parameters by T7(K,gK) —

Given the design ma-trix for the next week's games, X(k,gK+1), the poste-rior predictive distribution of next week's game outcomes,y(K, g K +1) , is given by

A sample from this distribution may be simulated by ran-domly selecting values of n(K,gK) from among the Gibbssampler draws and then drawing y(K,9 K+1) from the distri-bution in (5) for each draw of n ( K , g K ) - This process maybe repeated to construct a sample of desired size. To obtainpoint predictions, we could calculate the sample average ofthese posterior predictive draws. It is more efficient, how-ever, to calculate the sample average of the means in (5)across draws of "n(K,gK) .

4. POSTERIOR INFERENCES

We use the model described in the preceding section toanalyze regular season results of NFL football games forthe years 1988-1992 and the first 10 weeks of 1993 games.

25

Page 37: Anthology of Statistics in Sports

Chapter 5 A State-Space Model for National Football League Scores

The NFL comprised a total of 28 teams during these sea-sons. During the regular season, each team plays a totalof 16 games. The 1988-1989 seasons lasted a total of 16weeks, the 1990-1992 seasons lasted 17 weeks (each teamhad one off week), and the 1993 season lasted 18 weeks(each team had two off weeks). We use the last 8 weeks of1993 games to assess the accuracy of predictions from ourmodel. For each game we recorded the final score for eachteam and the site of the game. Although use of covariate in-formation, such as game statistics like rushing yards gainedand allowed, might improve the precision of the model fit,no additional information was recorded.

4.1 Gibbs Sampler Implementation

A single "pilot" Gibbs sampler with starting values at theprior means was run to determine regions of the parameterspace with high posterior mass. Seven parallel Gibbs sam-plers were then run with overdispersed starting values rela-tive to the draws from the pilot sampler. Table 1 displays thestarting values chosen for the parameters in the seven paral-lel runs. Each Gibbs sampler was run for 18,000 iterations,and convergence was diagnosed from plots and by examin-ing the potential scale reduction (PSR), as described by Gel-man and Rubin (1992), of the parameters ww, ws, w;0, wh, Bw,and Bs; the HFA parameters; and the most recent teamstrength parameters. The PSR is an estimate of the factorby which the variance of the current distribution of draws inthe Gibbs sampler will decrease with continued iterations.Values near 1 are indicative of convergence. In diagnosingconvergence, parameters that were restricted to be positivein the model were transformed by taking logs. Except forthe parameter all of the PSRs were less than 1.2. Theslightly larger PSR for could be explained from the plotof successive draws versus iteration number; the strong au-tocorrelation in simulations of ww slowed the mixing of thedifferent series. We concluded that by iteration, 17,000 theseparate series had essentially converged to the stationarydistribution. For each parameter, a sample was obtained byselecting the last 1,000 values of the 18,000 in each series.This produced the final sample of 7,000 draws from theposterior distribution for our analyses.

4.2 Parameter Summaries

Tables 2 and 3 show posterior summaries of some modelparameters. The means and 95% central posterior intervalsfor team parameters describe team strengths after the 10thweek of the 1993 regular season. The teams are ranked ac-cording to their estimated posterior means. The posterior

Table 1. Starting Values for Parallel Gibbs Samplers

Table 2. Summaries of the Posterior Distributions of TeamStrength and HFA Parameters After the First 10 Weeks

of the 1993 Regular Season

Parameter

Dallas CowboysSan Francisco 49ersBuffalo BillsNew Orleans SaintsPittsburgh SteelersMiami DolphinsGreen Bay PackersSan Diego ChargersNew York GiantsDenver BroncosPhiladelphia EaglesNew York JetsKansas City ChiefsDetroit LionsHouston OilersMinnesota VikingsLos Angeles RaidersPhoenix CardinalsCleveland BrownsChicago BearsWashington RedskinsAtlanta FalconsSeattle SeahawksLos Angeles RamsIndianapolis ColtsTampa Bay BuccaneersCincinnati BengalsNew England Patriots

Mean strength

9.06 (2.26, 16.42)7.43 (.29, 14.40)

4.22 (-2.73, 10.90)3.89 (-3.04, 10.86)

3.1 7 (-3.66, 9.96)2.03 (-4.79, 8.83)1 .83 (-4.87, 8.66)1 .75 (-5.02, 8.62)1.43 (-5.38, 8.21)1.1 8 (-5.75, 8.02)1.06 (-5.98, 7.80).98 (-5.95, 8.00).89 (-5.82, 7.77).80 (-5.67, 7.49).72 (-6.18, 7.51).25 (-6.57, 6.99).25 (-6.43, 7.10)

-.15 (-6.64, 6.56)-.55 (-7.47, 6.25)

-1.37 (-8. 18, 5.37)-1.46 (-8.36, 5.19)-2.94 (-9.89, 3.85)-3.1 7 (-9.61, 3.43)

-3.33 (-10.18,3.37)-5.29 (-12.11, 1.63)-7.43 (-14.38, -.68)-7.51 (-14.74, -.68)-7.73 (-14.54, -.87)

Mean HFA

1.62 (-1.94,4.86)2.77 (-.76, 6.19)

4.25 (.91, 7.73)3.44 (-.01, 6.87)

3.30 (.00, 6.68)2.69 (-.81,6.14)

2.19 (-1.17, 5.45)1.81 (-1.70, 5.12)

4.03 (.75, 7.53)5.27(1.90, 8.95)2.70 (-.75, 6.06)

1.86 (-1.51, 5.15)4.13 (.75,7.55)

3.12 (-.31, 6.48)7.28 (3.79, 11.30)3.34 (-.01, 6.80)3.21 (-.05, 6.55)2.67 (-.69, 5.98)

1.53 (-2.04, 4.81)3.82 (.38, 7.27)3.73 (.24, 7.22)

2.85 (-.55, 6.23)2.21 (-1.25,5.52)1.85 (-1.61,5.23)2.45 (-.97, 5.81)

1.77 (-1.69, 5.13)4.82 (1 .53, 8.33)3.94 (.55, 7.34)

NOTE: Values within parentheses represent central 95% posterior intervals.

means range from 9.06 (Dallas Cowboys) to —7.73 (NewEngland Patriots), which suggests that on a neutral field,the best team has close to a 17-point advantage over theworst team. The 95% intervals clearly indicate that a con-siderable amount of variability is associated with the team-strength parameters, which may be due to the stochastic na-ture of team strengths. The distribution of HFAs varies fromroughly 1.6 points (Dallas Cowboys, Cleveland Browns) toover 7 points (Houston Oilers). The 7-point HFA conveyedto the Oilers is substantiated by the numerous "blowouts"they have had on their home field. The HFA parameters arecentered around 3.2. This value is consistent with the re-sults of previous modeling (Glickman 1993; Harville 1980;Sallas and Harville 1988).

The distributions of the standard deviation parametersand are shown in Figures 1 and 2. The plots

show that each of the standard deviations is approximately

Table 3. Summaries of the Posterior Distributions of StandardDeviations and Regression Parameters After the First 10

Weeks of the 1993 Regular Season

Gibbs sampler series

Parameter

wW

ws

wo

wh

BwBs

110.01.0.5

100.0.6.5

2

100.020.05.0

20.0.8.8

3

200.080.015.06.0

.99

.98

4

500.0200.0100.0

1.01.21.2

5

1,000.0800.0150.0

.61.81.8

6

1 ,000.01.0

150.0.3.6

1.8

7

10.0800.0

.5100.0

1.8.6

Parameter

T

Bw

Bs

Mean

12.78 (12.23, 13.35)3.26(1.87, 5.22)

.88 (.52, 1.36)2.35(1.14,3.87)2.28(1.48,3.35)

.99 (.96, 1 .02)

.82 (.52, 1 .28)

NOTE: Values within parentheses represent central 95% posterior intervals.

26

Page 38: Anthology of Statistics in Sports

Glickman and Stern

symmetrically distributed around its mean. The posteriordistribution of is centered just under 13 points, indicatingthat the score difference for a single game conditional onteam strengths can be expected to vary by about 4 50points. The posterior distribution of shown in Figure1 suggests that the normal distribution of teams' abilitiesprior to 1988 have a standard deviation somewhere between2 and 5, so that the a priori difference between the best andworst teams is near 15. This range of team strengthappears to persist in 1993, as can be calculated from Table2. The distribution of is centered near 2.3, suggestingthat teams' HFAs varied moderately around a mean of 3points.

As shown in the empirical contour plot in Figure 2, theposterior distribution of the between-week standard devi-ation, , is concentrated on smaller values and is lessdispersed than that of the between-season evolution stan-dard deviation, s. This difference in magnitude indicatesthat the types of changes that occur between weeks arelikely to have less impact on a team's ability than are thechanges that occur between seasons. The distribution forthe between-week standard deviation is less dispersed thanthat for the between-season standard deviation, because thedata provide much more information about weekly inno-vations than about changes between seasons. Furthermore,the contour plot shows a slight negative posterior correla-tion between the standard deviations. This is not terriblysurprising if we consider that the total variability due tothe passage of time over an entire season is the compositionof between-week variability and between-season variability.If between-week variability is small, then between-seasonvariability must be large to compensate. An interesting fea-

Figure 1. Estimated Posterior Distributions, (a) Regression StandardDeviation ( ); (b) the Initial Team Strength Standard Deviation ( ); (c)HFA Standard Deviation ( ).

Figure 2. Estimated Joint Posterior Distribution of the Week-to- WeekEvolution Standard Deviation ( ) and the Season-to-Season EvolutionStandard Deviation ( ).

27

Page 39: Anthology of Statistics in Sports

Chapter 5 A State-Space Model for National Football League Scores

Figure 3. Estimated Joint Posterior Distribution of the Week-to- WeekRegression Effect (BW) and the Season-to-Season Regression Effect(Bs).

ture revealed by the contour plot is the apparent bimodal-ity of the joint distribution. This feature was not apparentfrom examining the marginal distributions. Two modes of( ) appear at (.6, 3) and (.9, 2).

Figure 3 shows contours of the bivariate posterior distri-bution of the parameters B w and (BS. The contours of the plotdisplay a concentration of mass near .8 and W 1.0,as is also indicated in Table 3. The plot shows a moremarked negative posterior correlation between these twoparameters than between the standard deviations. The neg-ative correlation can be explained in an analogous manner tothe negative correlation between standard deviations, view-ing the total shrinkage over the season as being the com-position of the between-week shrinkages and the between-season shrinkage. As with the standard deviations, the dataprovide more precision about the between-week regressionparameter than about the between-season regression param-eter.

4.3 Prediction for Week 11

Predictive summaries for the week 11 games of the 1993NFL season are shown in Table 4. The point predictionswere computed as the average of the mean outcomes acrossall 7,000 posterior draws. Intervals were constructed empir-ically by simulating single-game outcomes from the predic-tive distribution for each of the 7,000 Gibbs samples. Of the13 games, six of the actual score differences were containedin the 50% prediction intervals. All of the widths of the in-tervals were close to 18-19 points. Our point predictionswere generally close to the Las Vegas line. Games wherepredictions differ substantially (e.g., Oilers at Bengals) mayreflect information from the previous week that our modeldoes not incorporate, such as injuries of important players.

4.4 Predictions for Weeks 12 Through 18

Once game results for a new week were available, asingle-series Gibbs sampler was run using the entire datasetto obtain a new set of parameter draws. The starting val-ues for the series were the posterior mean estimates ofww,ws,wo,woh, w, and S, from the end of week 10. Be-cause the posterior variability of these parameters is small,the addition of a new week's collection of game outcomes isnot likely to have substantial impact on posterior inferences.Thus our procedure takes advantage of knowing regions ofthe parameter space a priori that will have high posteriormass. Having obtained data from the results of week 11, weran a single-series Gibbs sampler for 5,000 iterations, sav-ing the last 1,000 for predictive inferences. We repeated thisprocedure for weeks 12-17 in an analogous manner. Pointpredictions were computed as described earlier. In practice,the model could be refit periodically using a multiple-chainprocedure as an alternative to using this one-chain updat-ing algorithm. This might be advantageous in reassessingthe propriety of the model or determining whether signifi-cant shifts in parameter values have occurred.

4.5 Comparison with Las Vegas Betting Line

We compared the accuracy of our predictions with thoseof the Las Vegas point spread on the 110 games beyondthe 10th week of the 1993 season. The mean squared error(MSE) for predictions from our model for these 110 gameswas 165.0. This is slightly better than the MSE of 170.5 forthe point spread. Similarly, the mean absolute error (MAE)from our model is 10.50, whereas the analogous result forthe point spread is 10.84. Our model correctly predictedthe winners of 64 of the 110 games (58.2%), whereas theLas Vegas line predicted 63. Out of the 110 predictionsfrom our model, 65 produced mean score differences that"beat the point spread"; that is, resulted in predictions thatwere greater than the point spread when the actual scoredifference was larger than the point spread, or resulted inpredictions that were lower than the point spread when theactual score difference was lower. For this small sample, themodel fit outperforms the point spread, though the differ-ence is not large enough to generalize. However, the results

Table 4. Forecasts for NFL Games During Week 11 ofthe 1993 Regular Season

Week 1 1 games

Packers at SaintsOilers at BengalsCardinals at Cowboys49ers at BuccaneersDolphins at EaglesRedskins at GiantsChiefs at RaidersFalcons at RamsBrowns at SeahawksVikings at BroncosJets at ColtsBears at ChargersBills at Steelers

Predictedscore

difference

-5.493.35

-10.7713.01-1.74-6.90-2.57-1.46

.40-6.18

3.78-4.92-2.26

Las Vegasline

-6.08.5

-12.516.04.0

-7.5-3.5-3.5-3.5-7.0

3.5-8.5

-3.0

Actual scoredifference

2 (-14.75,3.92)35 (-6.05, 12.59)

-5 (-20.25, -1.50)24 (3.95, 22.58)5 (-10.93, 7.55)

-14 (-16.05, 2.35)11 (-11.63, 6.86)13 (-10.83, 7.93)-17 (-8.98, 9.72)

3 (-15.42, 3.25)14 (-5.42, 13.19)3 (-14.31, 4.28)

-23 (-11.53, 7.13)

NOTE: Values within parentheses represent central 50% prediction intervals.

28

Page 40: Anthology of Statistics in Sports

Glickman and Stern

Figure 4. Estimated Bivariate Distributions for Regression VarianceDiagnostics, (a) Scatterplot of the joint posterior distribution of D1(y; 0 )and D1 (y*; 0*) where D1 (• ; •) is a discrepancy measuring the range ofregression variance estimates among the six seasons and (0*, y*) aresimulations from their posterior and posterior predictive distributions; (b)scatterplot of the joint posterior distribution of D2(y; 0*) and D2(y*; 0*)tor D2(. ; •) a discrepancy measuring the range of regression varianceestimates among the 28 teams.

here suggest that the state-space model yields predictionsthat are comparable to those implied by the betting line.

5. DIAGNOSTICS

Model validation and diagnosis is an important part of

the model-fitting process. In complex models, however, di-agnosing invalid assumptions or lack of fit often cannot becarried out using conventional methods. In this section weexamine several model assumptions through the use of pos-terior predictive diagnostics. We include a brief descriptionof the idea behind posterior predictive diagnostics. We alsodescribe how model diagnostics were able to suggest animprovement to an earlier version of the model.

The approach to model checking using posterior predic-tive diagnostics has been discussed in detail by Gelman etal. (1996), and the foundations of this approach have beendescribed by Rubin (1984). The strategy is to construct dis-crepancy measures that address particular aspects of thedata that one suspects may not be captured by the model.Discrepancies may be ordinary test statistics, or they maydepend on both data values and parameters. The discrepan-cies are computed using the actual data, and the resultingvalues are compared to the reference distribution obtainedusing simulated data from the posterior predictive distribu-tion. If the actual data are "typical" of the draws from theposterior predictive distribution under the model, then theposterior distribution of the discrepancy measure evaluatedat the actual data will be similar to the posterior distributionof the discrepancy evaluated at the simulated datasets. Oth-erwise, the discrepancy measure provides some indicationthat the model may be misspecified.

To be concrete, we may construct a "generalized" teststatistic, or discrepancy, (y; 0), which may be a functionnot only of the observed data, genetically denoted by y,but also of model parameters, generically denoted by 0.We compare the posterior distribution of (y;0) to theposterior predictive distribution of D(y*;0), where we usey* to denote hypothetical replicate data generated underthe model with the same (unknown) parameter values. Onepossible summary of the evaluation is the tail probability,or p value, computed as

or

depending on the definition of the discrepancy. In practice,the relevant distributions or the tail probability can be ap-proximated through Monte Carlo integration by drawingsamples from the posterior distribution of 9 and then theposterior predictive distribution of y* given B.

The choice of suitable discrepancy measures, D, dependson the problem. We try to define measures that evaluate thefit of the model to features of the data that are not explicitlyaccounted for in the model specification. Here we considerdiagnostics that assess the homogeneity of variance assump-tions in the model and diagnostics that assess assumptionsconcerning the HFA. The HFA diagnostics were useful indetecting a failure of an earlier version of the model. Oursummary measures D are functions of Bayesian residualsas defined by Chaloner and Brant (1988) and Zellner (1975).As an alternative to focusing on summary measures D, theindividual Bayesian residuals can be used to search for out-liers or to construct a "distribution" of residual plots; wedo not pursue this approach here.

29

Page 41: Anthology of Statistics in Sports

Chapter 5 A State-Space Model for National Football League Scores

5.1 Regression Variance

For a particular game played between teams i and i' atteam i's home field, let

be the squared difference between the observed outcomeand the expected outcome under the model, which might becalled a squared residual. Averages of , across games canbe interpreted as estimates of (the variance of yii givenits mean). The model assumes that is constant acrossseasons and for all competing teams. We consider two dis-crepancy measures that are sensitive to failures of theseassumptions. Let D 1 ( y ;0 ) be the difference between thelargest of the six annual average squared residuals and thesmallest of the six annual average squared residuals. ThenD1(y*;0*) is the value of this diagnostic evaluated at sim-ulated parameters 0* and simulated data y*, and D1(y;0*)is the value evaluated at the same simulated parameters butusing the actual data. Based on 300 samples of parametersfrom the posterior distribution and simulated data from theposterior predictive distribution, the approximate bivariateposterior distribution of (D1(y;0*),.D1(y*;0*)) is shownon Figure 4a. The plot shows that large portions of the dis-tribution of the discrepancies lie both above and below theline D1(y; 6*) = D1(y*; 0*), with the relevant tail probabil-ity equal to .35. This suggests that the year-to-year variationin the regression variance of the actual data is quite consis-tent with that expected under the model (as evidenced bythe simulated datasets).

As a second discrepancy measure, we can compute theaverage for each team and then calculate the differ-ence between the maximum of the 28 team-specific esti-mates and the minimum of the 28 team-specific estimates.Let D2(y*;0*) be the value of this diagnostic measure forthe simulated data y* and simulated parameters 0*, andlet D2(y;0*) be the value for the actual data and simu-lated parameters. The approximate posterior distribution of(D2(y;0*),D2(y*;0*)) based on the same 300 samples ofparameters and simulated data are shown on Figure 4b.

The value of D-2 based on the actual data tends to belarger than the value based on the posterior predictivesimulations. The relevant tail probability P(D2(y*;0) >-D2y;0)|y) is not terribly small (.14), so we conclude thatthere is no evidence of heterogeneous regression variancesfor different teams. Thus we likely would not be interestedin extending our model in the direction of a nonconstantregression variance.

5.2 Site Effect: A Model Diagnostics Success Story

We can use a slightly modified version of the game resid-uals,

to search for a failure of the model in accounting forHFA. The rii are termed site-effect residuals, because theytake the observed outcome and subtract out estimated teamstrengths but do not subtract out the HFA. As we did withthe regression variance, we can examine differences in the

magnitude of the HFA over time by calculating the averagevalue of the rii for each season, and then examining therange of these averages. Specifically, for a posterior pre-dictive dataset y* and for a draw 0* from the posteriordistribution of all parameters, let D3(y*;0*) be the differ-

Figure 5. Estimated Bivariate Distributions for Site-Effect Diagnos-tics, (a) Scatterplot of the joint posterior distribution of D3(y; 0*) andDs(y*; 9*) where D3(- ; •) is a discrepancy measuring the range of HFAestimates among the six seasons and (0*, y*) are simulations from theirposterior and posterior predictive distributions; (b) scatterplot of the jointposterior distribution of D4 (y; 0*) and D4 (y*; 0*) for D4 (•; •) a discrepancymeasuring the range of HFA estimates among the 28 teams.

30

Page 42: Anthology of Statistics in Sports

Glickman and Stern

ence between the maximum and the minimum of the aver-age site-effect residuals by season. Using the same 300 val-ues of 0* and y* as before, we obtain the estimated bivariatedistribution of (D3(y; 0*), D3(y*; 0*}) shown on Figure 5a.

The plot reveals no particular pattern, although there is atendency for D3(y; 0*) to be less than the discrepancy eval-uated at the simulated datasets. This seems to be a chanceoccurrence (the tail probability equals .21).

We also include one other discrepancy measure, althoughit will be evident that our model fits this particular aspectof the data. We examined the average site-effect residualsacross teams to assess whether the site effect depends onteam. We calculated the average value of rii for each team.Let D4(y*; 0*) be the difference between the maximum andminimum of these 28 averages for simulated data y*. Itshould be evident that the model will fit this aspect of thedata, because we have used a separate parameter for eachteam's advantage. The approximate bivariate distribution of(D4(y; 0*), D4(y*; 0*)) is shown on Figure 5b. There is noevidence of lack of fit (the tail probability equals .32).

This last discrepancy measure is included here, despitethe fact that it measures a feature of the data that we haveexplicitly addressed in the model, because the current modelwas not the first model that we constructed. Earlier, we fita model with a single HFA parameter for all teams. Figure6 shows that for the single HFA parameter model, the ob-served values of D 4 ( y ; 0 * } were generally greater than thevalues of D4(y*; 0*), indicating that the average site-effectresiduals varied significantly more from team to team thanwas expected under the model (tail probability equal to .05).

This suggested the model presented here in which eachteam has a separate HFA parameter.

Figure 6. Estimated Bivariate Distribution for Site-Effect Diagnosticfrom a Poor-Fitting Model. The scatterplot shows the joint posterior dis-tribution of D4 (y; 0*) and D4 (y*; 0*) for a model that includes only a singleparameter for the site-effect rather than 28 separate parameters, one foreach team. The values of D4(y; 0*) are generally larger than the valuesof D4(y*; 0*), suggesting that the fitted model may not be capturing asource of variability in the observed data.

5.3 Sensitivity to Heavy Tails

Our model assumes that outcomes are normally dis-tributed conditional on the parameters, an assumption sup-ported by Stern (1991). Rerunning the model with t distri-butions in place of normal distributions is straightforward,because t distributions can be expressed as scale mixturesof normal distributions (see, e.g., Gelman, Carlin, Stern, andRubin 1995; and Smith 1983). Rather than redo the entireanalysis, we checked the sensitivity of our inferences tothe normal assumption by reweighting the posterior drawsfrom the Gibbs sampler by ratios of importance weights(relating the normal model to a variety of t models). Thereweighting is easily done and provides information abouthow inferences would be likely to change under alternativemodels. Our conclusion is that using a robust alternativecan slightly alter estimates of team strength but does nothave a significant effect on the predictive performance. Itshould be emphasized that the ratios of importance weightscan be unstable, so that a more definitive discussion of in-ference under a particular t model (e.g., 4 df) would requirea complete reanalysis of the data.

6. CONCLUSIONS

Our model for football game outcomes assumes that teamstrengths can change over time in a manner described bya normal state-space model. In previous state-space mod-eling of football scores (Harville 1977, 1980; Sallas andHarville 1988), some model parameters were estimated andthen treated as fixed in making inferences on the remainingparameters. Such an approach ignores the variability asso-ciated with these parameters. The approach taken here, incontrast, is fully Bayesian in that we account for the un-certainty in all model parameters when making posterior orpredictive inferences.

Our data analysis suggests that the model can be im-proved in several different dimensions. One could arguethat teams' abilities should not shrink or expand aroundthe mean from week to week, and because the posteriordistribution of the between-week regression parameter w

is not substantially different from 1, the model may be sim-plified by setting it to 1. Also, further exploration may benecessary to assess the assumption of a heavy-tailed distri-bution for game outcomes. Finally, as the game of footballcontinues to change over time, it may be necessary to al-low the evolution regression and variance parameters or theregression variance parameter to vary over time.

Despite the room for improvement, we feel that ourmodel captures the main components of variability in foot-ball game outcomes. Recent advances in Bayesian compu-tational methods allow us to fit a realistic complex modeland diagnose model assumptions that would otherwise bedifficult to carry out. Predictions from our model seem toperform as well, on average, as the Las Vegas point spread,so our model appears to track team strengths in a mannersimilar to that of the best expert opinion.

31

Page 43: Anthology of Statistics in Sports

Chapter 5 A

APPENDIX: CONDITIONAL DISTRIBUTIONS FCRMCMC SAMPLING

A.1 Conditional Posterior Distribution of

The conditional posterior distribution of the team strength pa-rameters, HFA parameters, and observation precision , is normal-gamma—the conditional posterior distribution of given the evo-lution precision and regression parameters (w;0, Wh, ww , ws, )is gamma, and the conditional posterior distribution of the teamstrengths and home-field parameters given all other parameters isan (M + l)p-variate normal distribution, where p is the number ofteams and M = is the total number of weeks for whichdata are available. It is advantageous to sample using results fromthe Kalman filter (Carter and Kohn 1994; Fruhwirth-Schnatter1994; Glickman 1993) rather than consider this (M + l)p-variateconditional normal distribution as a single distribution. This ideais summarized here.

The Kalman filter (Kalman 1961; Kalman and Bucy 1961) isused to compute the normal-gamma posterior distribution of thefinal week's parameters,

marginalizing over the previous weeks' vectors of team strengthparameters. This distribution is obtained by a sequence of recur-sive computations that alternately update the distribution of pa-rameters when new data are observed and then update the distri-bution reflecting the passage of time. A sample from this poste-rior distribution is drawn. Samples of team strengths for previousweeks are drawn by using a back-filtering algorithm. This is ac-complished by drawing recursively from the normal distributionsfor the parameters from earlier weeks,

The result of this procedure is a sample of values from the desiredconditional posterior distribution.

A.2 Conditional Posterior Distribution of

Conditional on the remaining parameters and the data, the pa-rameters wO , ww, ww, and ws are independent gamma random vari-ables with

and

A.3 Conditional Posterior Distribution of

Conditional on the remaining parameters and the data, the distri-butions of and are independent random variables with nor-mal distributions. The distribution of conditional on all otherparameters is normal, with

where

and

The distribution of conditional on all other parameters is alsonormal, with

where

and

[Received December 1996. Revised August 1997.]

REFERENCES

Amoako-Adu, B., Manner, H., and Yagil, J. (1985), "The Efficiency of Cer-tain Speculative Markets and Gambler Behavior," Journal of Economicsand Business, 37, 365-378.

Carlin, B. P., Poison, N. G., and Stoffer, D. S. (1992), "A Monte CarloApproach to Nonnormal and Nonlinear State-Space Modeling," Journalof the American Statistical Association, 87, 493-500.

Carter, C. K., and Kohn, R. (1994), "On Gibbs Sampling for State-SpaceModels," Biometrika, 81, 541-553.

Chaloner, K., and Brant, R. (1988), "A Bayesian Approach to Outlier De-tection and Residual Analysis," Biometrika, 75, '651-659.

32

Page 44: Anthology of Statistics in Sports

Glickman and Stern

de Jong, P., and Shephard, N. (1995), "The Simulation Smoother for TimeSeries Models," Biometrika, 82, 339-350.

Fruhwirth-Schnatter, S. (1994), "Data Augmentation and Dynamic LinearModels," Journal of Time Series Analysis, 15, 183-202.

Gelfand, A. E., and Smith, A. F. M. (1990), "Sampling-Based Approachesto Calculating Marginal Densities," Journal of the American StatisticalAssociation, 85, 972-985.

Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (1995), BayesianData Analysis, London: Chapman and Hall.

Gelman, A., Meng, X., and Stern, H. S. (1996), "Posterior Predictive As-sessment of Model Fitness via Realized Discrepancies" (with discus-sion), Statistica Sinica, 6, 733-807.

Gelman, A., and Rubin, D. B. (1992), "Inference From Iterative SimulationUsing Multiple Sequences," Statistical Science, 7, 457-511.

Geman, S., and Geman, D. (1984), "Stochastic Relaxation, Gibbs Distri-butions, and the Bayesian Restoration of Images," IEEE Transactionson Pattern Analysis and Machine Intelligence, 6, 721-741.

Glickman, M. E. (1993), "Paired Comparison Models With Time-Vary ingParameters," unpublished Ph.D. dissertation, Harvard University, Dept.of Statistics.

Harrison, P. J., and Stevens, C. F. (1976), "Bayesian Forecasting," Journalof the Royal Statistical Society, Ser. B, 38, 240-247.

Harville, D. (1977), 'The Use of Linear Model Methodology to Rate HighSchool or College Football Teams," Journal of the American StatisticalAssociation, 72, 278-289.

(1980), "Predictions for National Football League Games viaLinear-Model Methodology," Journal of the American Statistical As-sociation, 75, 516-524.

Kalman, R. E. (1960), "A New Approach to Linear Filtering and PredictionProblems," Journal of Basic Engineering, 82, 34-45.

Kalman, R. E., and Bucy, R. S. (1961), "New Results in Linear Filteringand Prediction Theory," Journal of Basic Engineering, 83, 95-108.

Rosner, B. (1976), "An Analysis of Professional Football Scores," in Man-agement Science in Sports, eds. R. E. Machol, S. P. Ladany, and D. G.Morrison, New York: North-Holland, pp. 67-78.

Rubin, D. B. (1984), "Bayesianly Justifiable and Relevant Frequency Cal-culations for the Applied Statistician," The Annals of Statistics, 12,1151-1172.

Sallas, W. M., and Harville, D. A. (1988), "Noninformative Priors andRestricted Maximum Likelihood Estimation in the Kalman Filter," inBayesian Analysis of Time Series and Dynamic Models, ed. J. C. Spall,New York: Marcel Dekker, pp. 477-508.

Shephard, N. (1994), "Partial Non-Gaussian State Space," Biometrika, 81,115-131.

Smith, A. F. M. (1983), "Bayesian Approaches to Outliers and Robust-ness," in Specifying Statistical Models From Parametric to Nonparamet-ric, Using Bayesian or Non-Bayesian Approaches, eds. J. P. Florens,M. Mouchart, J. P. Raoult, L. Simer, and A. F. M. Smith, New York:Springer-Verlag, pp. 13-35.

Stern, H. (1991), "On the Probability of Winning a Football Game," TheAmerican Statistician, 45, 179-183.

(1992), "Who's Number One? Rating Football Teams," in Proceed-ings of the Section on Statistics in Sports, American Statistical Associa-tion, pp. 1-6.

Stefani, R. T. (1977), "Football and Basketball Predictions Using LeastSquares," IEEE Transactions on Systems, Man, and Cybernetics, 7, 117-120.

(1980), "Improved Least Squares Football, Basketball, and SoccerPredictions," IEEE Transactions on Systems, Man, and Cybernetics, 10,116-123.

Thompson, M. (1975), "On Any Given Sunday: Fair Competitor OrderingsWith Maximum Likelihood Methods," Journal of the American Statis-tical Association, 70, 536-541.

West, M., and Harrison, P. J. (1990), Bayesian Forecasting and DynamicModels, New York: Springer-Verlag.

Zellner, A. (1975), "Bayesian Analysis of Regression Error Terms," Jour-nal of the American Statistical Association, 70, 138-144.

Zuber, R. A., Gandar, J. M., and Bowers, B. D. (1985), "Beating the Spread:Testing the Efficiency of the Gambling Market for National FootballLeague Games," Journal of Political Economy, 93, 800-806.

33

Page 45: Anthology of Statistics in Sports

This page intentionally left blank

Page 46: Anthology of Statistics in Sports

Chapter 6

Predictions for National Football League Games Via

Linear-Model MethodologyDAVID HARVILLE*

Results on mixed linear models were used to develop a procedurefor predicting the outcomes of National Football League games.The predictions are based on the differences in score from pastgames. The underlying model for each difference in score takesinto account the home-field advantage and the difference in theyearly characteristic performance levels of the two teams. Eachteam's yearly characteristic performance levels are assumed tofollow a first-order autoregressive process. The predictions for1,320 games played between 1071 and 1977 had an average absoluteerror of 10.68, compared with 10.49 for bookmaker predictions.

KEY WORDS: Football predictions; Mixed linear models; Vari-ance components; Maximum likelihood; Football ratings.

1. INTRODUCTION

Suppose that we wish to predict the future price of acommon stock or to address some other complex real-life prediction problem. What is the most useful role forstatistics?

One approach is to use the available informationrather informally, relying primarily on intuition and onpast experience and employing no statistical methods oronly relatively simple statistical methods. A second ap-proach is to rely exclusively on some sophisticatedstatistical algorithm to produce the predictions from therelevant data. In the present article, these two ap-proaches are compared in the context of predicting theoutcomes of National Football League (NFL) games.

The statistical algorithm to be used is set forth inSection 2. It is closely related to an algorithm devised byHarville (1977b) for rating high school or college footballteams.

The essentially nonstatistical predictions that are tobe compared with the statistical predictions are thosegiven by the betting line. The betting line gives thefavored team for each game and the point spread, that is,the number of points by which the favorite is expectedto win.

If a gambler bets on the favorite (underdog), he wins(loses) his bet when the favorite wins the game by morethan the point spread, but he loses (wins) his bet whenthe favorite either loses or ties the game or wins the gameby less than the point spread. On a $10 bet, the gamblerpays the bookmaker an additional dollar (for a total of$11) when he loses his bet and receives a net of $10

* David Harville is Professor, Department of Statistics, IowaState University, Ames, IA 50011. This article is based on aninvited paper (Harville 1978) presented at the 138th AnnualMeeting of the American Statistical Association, San Diego, CA.

when he wins. If the favorite wins the game by exactlythe point spread, the bet is in effect cancelled (Merchant1973). To break even, the gambler must win 52.4 percentof those bets that result in either a win or a loss (assum-ing that the bets are for equal amounts).

Merchant described the way in which the betting lineis established. A prominent bookmaker devises an initialline, which is known as the outlaw line, the early line,or the service line. This line is such that, in his informedopinion, the probability of winning a bet on the favoriteequals the probability of winning a bet on the underdog.

A select group of knowledgeable professional gamblersare allowed to place bets (in limited amounts) on thebasis of the outlaw line. A series of small adjustments ismade in the outlaw line until an approximately equalamount of the professionals' money is being attracted oneither side. The betting line that results from this pro-cess is the official opening line, which becomes availableon Tuesday for public betting.

Bets can be placed until the game is played, which isgenerally on Sunday but can be as early as Thursdayor as late as Monday. If at any point during the bettingperiod the bookmaker feels that there is too big adiscrepancy between the amount being bet on the favoriteand the amount being bet on the underdog, he may makea further adjustment in the line.

The nonstatistical predictions used in the presentstudy are those given by the official opening betting line.These predictions can be viewed as the consensusopinion of knowledgeable professional gamblers.

The statistical algorithm that is set forth in Section 2can be used to rate the various NFL teams as well as tomake predictions. While the prediction and rating prob-lems are closely related, there are also some importantdifferences, which are discussed in Section 5.

2. STATISTICAL PREDICTIONS

Each year's NFL schedule consists of three parts:preseason or exhibition games, regular-season games,and postseason or playoff games. The statistical algorithmpresented in Sections 2.2 through 2.4 translates scoresfrom regular-season and playoff games that have alreadybeen played into predictions for regular-season and play-

© Journal of the American Statistical AssociationSeptember 1980, Volume 75, Number 371

Applications Section

35

Page 47: Anthology of Statistics in Sports

Chapter 6 Predictions for National Football League Games via Linear-Model Methodology

off games to be played in the future. The model thatserves as the basis for this algorithm is described inSection 2.1.

The scores of exhibition games were not used in makingthe statistical predictions, and predictions were notattempted for future exhibition games. The rationale wasthat these games are hard to incorporate into the modeland thus into the algorithm and that they have verylittle predictive value anyhow. Merchant (1973) arguesthat, in making predictions for regular-season games, itis best to forget about exhibition games.

2.1 Underlying Model

Suppose that the scores to be used in making thepredictions date back to the Year F. Ultimately, F is tobe chosen so that the interlude between the beginning ofYear F and the first date for which predictions are re-quired is long enough that the effect of including earlierscores is negligible. For each year, number the regular-season and playoff games 1, 2, 3, ... in chronologicalorder.

Number the NFL teams 1, 2, 3, . . . . New teams areformed by the NFL from time to time. If Team i isadded after Year F, let F(i) represent the year of addi-tion. Otherwise, put F(i) = F. The home team and thevisiting team for the fcth game in Year j are to bedenoted by h ( j , k) and v(j, k), respectively. (If thegame were played on a neutral field, h(j, k) is takenarbitrarily to be one of the two participating teams, andv ( j , k) is taken to be the other.)

Let Sjk equal the home team's score minus the visitingteam's score for the kth game in Year j. The predictionalgorithm presented in Sections 2.2 through 2.4 dependsonly on the scores and depends on the scores only throughthe Sjk's.

Our model for the Sjk's involves conceptual quantitiesH and Tim (i = 1, 2, . . .; m = F(i), F(i) + 1 , . . . ) • Thequantity H is an unknown parameter that represents thehome-field advantage (in points) that accrues to a teamfrom playing on its own field rather than a neutral field.The quantity Tim is a random effect that can be inter-preted as the characteristic performance level (in points)of Team i in Year m relative to that of an "average"team in Year m.

The model equation for Sik is

if the game is played on a neutral field, or

Sjk = H + Th(j,k),j — Tv(jtk),j + Rjk

if it is not. Here, Rjk is a random residual effect.Assume that E(R j k ) = 0, that var(Rjk) = whereis an unknown strictly positive parameter, and that theRjk's are uncorrelated with each other and with theTim

Suppose that cov(Tim, Ti'm) = 0 if i i; that is,that the yearly characteristic performance levels of any

given team are uncorrelated with those of any otherteam. The yearly characteristic performance levels ofTeam i are assumed to follow a first-order autoregressiveprocess

where Ui ,F ( i ) U,F(i) +1, • • • are random variables thathave zero means and common unknown varianceand that are uncorrelated with each other and withTi,F(i), and where p is an unknown parameter satisfying0 < p < 1.

It remains to specify assumptions, for each i, aboutE[Ti,F(i) ] and var[Ti,F(i)], that is, about the mean andvariance of the first yearly characteristic performancelevel for Team i. The sensitivity of the prediction pro-cedure to these specifications depends on the proximity(in time) of the predicted games to the beginning ofYear F(i) and on whether the predicted games involveTeam i. For i such that F(i) = F, that is, for teams thatdate back to Year F, the prediction procedure will berelatively insensitive to these specifications, providedthat F is sufficiently small, that is, provided the forma-tion of the data base was started sufficiently in advanceof the first date for which predictions are required.

Put /(l — p2), and for convenience assumethat

for i such that F(i) = F. Then, for any given year, theyearly characteristic performance levels of those teamsthat date back to Year F have zero means and commonvariance or2, as would be the case if they were regardedas a random sample from an infinite population havingmean zero and variance . Moreover, for any such team,the correlation between its characteristic performancelevels for any two years m and m' is p |m ' -m | , which is adecreasing function of elapsed time.

For i such that F(i) > F, that is, for teams that cameinto being after Year F, it is assumed that

where uF (i) and TF(i)2 are quantities that are to be sup-plied by the user of the prediction procedure. Thequantities uF(i) and TF(i)

2 can be regarded as the meanand variance of a common prior distribution for the initialyearly characteristic performance levels of expansionteams. Information on the performance of expansionteams in their first year that predates Year F(i) can beused in deciding on values for uF(i) and T F ( i )

2 -The model for the Sjk's is similar to that applied to

high school and college football data by Harville (1977b).One distinguishing feature is the provision for data frommore than one year.

2.2 Preliminaries

We consider the problem of predicting SJK from SPI,SF2, • • • , SLG, where either J = L and K > G or J > L;that is, the problem of predicting the winner and the

36

Page 48: Anthology of Statistics in Sports

Harville

margin of victory for a future game based on the infor-mation accumulated as of Game G in Year L. Thisproblem is closely related to that of estimating or pre-dicting H and Tim (t = 1, 2, . . . ; m = F ( i ) , F(i) + 1,. . . )•

Take X = . If X and p were given, H wouldhave a unique minimum-variance linear unbiased(Aitken) estimator, which we denote by H ( , p). De-fine Ti m( , p, H) by

where the conditional expectation is taken under theassumption that Tim and SF1, SF2, • . . , SLG are jointlynormal or, equivalently, is taken to be Hartigan's(1969) linear expectation. Put

The quantity Tim( , p) is the best linear unbiasedpredictor (BLUP) of Tim (in the sense described byHarville 1976) for the case in which X and p are given;and, for that same case, the quantity SJK(X, p), definedas follows is the BLUP of Sjk:

if the JKth game is played on a neutral field; or

if it is not.Let MjK( ) denote the mean squared differ-

ence between SJK( p) and SJK, that is,

Specific representations for H( , p), Tim( , p), andM J K ( ) can be obtained as special cases ofrepresentations given, for example, by Harville (1976).

2.3 Estimation of Model Parameters

In practice, X and p (and ) are not given and mustbe estimated. One approach to the estimation of theseparameters is to use Patterson and Thompson's (1971)restricted maximum likelihood procedure. (See, e.g.,Harville's (1977a) review article for a general descriptionof this procedure.)

Suppose that SFI, SF2, - • •, SLG constitute the dataavailable for estimating X, p, and . For purposes ofestimating these parameters, we assume that the dataare jointly normal, and we take - E [ T i , F ] andvar[Ti,F(i)] to be of the form (2.1) for all i; however,we eliminate from the data set any datum Sjk for whichh(j, k) or v(j, k) corresponds to an expansion teamformed within / years of Year j (where / is to be specifiedby the user). Extending the assumption (2.1) to all isimplifies the estimation procedure, while the exclu-sion of games involving expansion teams in their earlyyears desensitizes the procedure to the effects of thisassumption.

We write H and Tim for the quantities H(\, p) andTim( , p) defined in Section 2.2 (with allowances for thedeletions in the data set and the change in assumptions).Let Xjk equal 0 or 1 depending on whether or not thejkth game is played on a neutral field.

The likelihood equations for the restricted maximumlikelihood procedure can be put into the form

withQi - E(Q i ) = 0 (i - 1, 2, 3) , (2.7)

and

It can be shown that

and that

and

Equations (2.7) can be solved numerically by the sameiterative numerical algorithm used by Harville (1977b,p. 288). This procedure calls for the repeated evaluationof the Qi's and their expectations for various trial valuesof X and p. Making use of (2.11), (2.12), and (2.13)reduces the problem of evaluating the Qi's and theirexpectations for particular values of X and p to the prob-lem of evaluating Hi and the Tim's and various elementsof their dispersion matrix. Kalman filtering and smooth-ing algorithms (suitably modified for mixed models, asdescribed by Harville 1979) can be used for maximumefficiency in carrying out the computations associatedwith the latter problem.

The amount of computation required to evaluate Hand the Tim's and the relevant elements of their dis-persion matrix, for fixed values of X and p, may not befeasible if data from a large number of years are beingused. The procedure for estimating X, p, and R

2 can bemodified in these instances by, for example, basing the"estimates" Ti,k+1 and Tik in the term (Ti.k+i — pT i k)2

of (2.8) on only those data accumulated through Yeark + Y, for some Y, rather than on all the data. Suchmodifications can significantly reduce the amount ofstorage and computation required to evaluate the Qi's

37

Page 49: Anthology of Statistics in Sports

Chapter 6 Predictions for National Football League Games via Linear-Model Methodology

and their expectations. (The modifications reduce theamount of smoothing that must be carried out in theKalman algorithm.)

This modified estimation procedure can be viewed asa particular implementation of the approximate restrictedmaximum likelihood approach outlined by Harville(1977a, Sec. 7). This approach seems to have produceda reasonable procedure even though it is based on theassumption of a distributional form (multivariatenormal) that differs considerably from the actual dis-tributional form of the Sjk.

2.4 Prediction Algorithm

Let , p, and R2 represent estimates of X, p, and 2,respectively. In particular, we can take X, p, and R 2 to bethe estimates described in Section 2.3.

Let

(i = 1,2, ...;m = F(i),F(i-) + 1, . . .) .

The quantity ft gives an estimate of H, and Tim givesan estimate or prediction of Tim.

It can be shown that

where N equals the total number of games played minusthe number of games played on neutral fields, G equalsthe grand total of all points scored by home teams minusthe grand total for visiting teams, and Dim equals thenumber of games played in Year m by Team i on itshome field minus the number played on its opponents'fields. The NFL schedule is such that, if it were not forplayoff games and for games not yet played in Year L,all of the Di,m's would equal zero, and ft would coincidewith the ordinary average N -1G.

It can also be shown that, if F(i) = L; that is, if onlyone season of data or a partial season of data is availableon Team i, then

where NiL equals the number of games played (in YearL) by Team i, GiL equals the total points scored (inYear L) by Team i minus the total scored against it byits opponents, r(j) equals Team i's opponent in its jthgame (of Year L), and

Thus, if F(i) = L, the estimator TiL . is seen to be a"shrinker"; that is, instead of the "corrected total" forthe ith team being divided by N iL, it is divided byNiL + . If F(i) < L, that is, if more than one seasonof data is available on Team i, the form of the estimatoris similar.

The prediction for the outcome SJK of a future gameis taken to be

An estimate of the mean squared error of this predictionis given by MJK = M J K ( R2, X, ft). The estimate MJK

underestimates the mean squared error to an extent thatdepends on the precision of the estimates X and ft.

The quantities S J K ( p) and M J K( , X, p) can beinterpreted as the mean and the variance of a posteriordistribution for SJK (Harville 1976, Sec. 4). Depending onthe precision of the estimates X, ft, and , it may bereasonable to interpret SJK and MJK in much the sameway.

Take BJK to be a constant that represents the differ-ence in score given by the betting line for Game K ofYear J. Relevant posterior probabilities for gamblingpurposes are Pr (SJK < BJK) and Pr(SJK > BJK). Theseposterior probabilities can be obtained from the posteriorprobabilities Pr(SJK = s) (s = ..., -2, -1, 0, 1, 2,. . .) , which we approximate by Pr(s — .5 < SJK*< s + .5), where SJK* is a normal random variable withmean SJK and variance M J K .

Due to the "lumpiness" of the actual distribution ofthe differences in score (as described, for college football,by Mosteller 1970) these approximations to the posteriorprobabilities may be somewhat crude; however, it isnot clear how to improve on them by other than ad hocprocedures. Rosner (1976) took a somewhat differentapproach to the prediction problem in an attempt toaccommodate this feature of the distribution.

Our prediction algorithm can be viewed as consistingof two stages. In the first stage, X, p, and R2 are esti-mated. Then, in the second stage, ft and TiL (i = 1, 2,. . .) and their estimated dispersion matrix are computed.By making use of the Kalman prediction algorithm (asdescribed by Harville 1979), the output of the secondstage can easily be converted into a prediction SJK

and an estimated mean squared prediction error MJK

for any future game.The second-stage computations, as well as the first-

stage computations, can be facilitated by use of theKalman filtering algorithm. This is especially true ininstances where the second-stage computations werepreviously carried out based on Sjk's available earlierand where X and p have not been reestimated.

3. EMPIRICAL EVALUATION OF PREDICTIONS

The statistical algorithm described in Section 2 wasused to make predictions for actual NFL games. Thesepredictions were compared for accuracy with thosegiven by the betting line. The games for which the com-parisons were made were 1,320 regular-season and play-off games played between 1971 and 1977, inclusive.

The betting line for each game was taken to be theopening line. The primary source for the opening linewas the San Francisco Chronicle, which, in its Wednesday

38

Page 50: Anthology of Statistics in Sports

Harville

1. Parameter Estimates Over Eachof Seven Time Periods

Last Yearin Period

19701971197219731974197519761977

X

.29

.26

.27

.28

.26

.27

.25—

P

.79

.83

.81

.82

.78

.80

.79—

185181180182175175171—

H

—2.192.032.272.312.182.32

.50

.44

.40

.37

.34

.322.42 .30

editions, ordinarily reported the opening line listed inHurrah's Tahoe Racebook. There were 31 games playedbetween 1971 and 1977 for which no line could be foundand for which no comparisons were made.

The statistical prediction for each of the 1,320 gameswas based on the outcomes of all NFL regular-seasonand playoff games played from the beginning of the 1968season through the week preceding the game to bepredicted. The estimates X, , and R

2 used in making thepredictions were based on the same data but were re-computed yearly rather than weekly. (In estimating X,p, and at the ends of Years 1970-1975, / was takento be zero, so that all accumulated games were used.However, in estimating these parameters at the end of1976, games played during 1968 were excluded as weregames involving the expansion teams, Seattle and TampaBay, that began play in 1976.)

Games that were tied at the end of regulation playand decided in an overtime period were counted as tieswhen used in estimating X, p, and R2 and in makingpredictions. The values assigned to uf(i) and TF(i)

2 forprediction purposes were —11.8 and 17.0, respectively.

The values obtained for X, 0, and 2 at the end of eachyear are listed in Table 1. The values of the estimate Hof the home-field advantage and the estimated standarderror of H as of the end of each year (based on the valuesX, , and R

2 obtained at the end of the previous year)are also given.

Some decline in the variability of the outcomes of thegames (both "among teams" and "within teams") ap-pears to have taken place beginning in about 1974. Theestimates of the among-teams variance r2 and the within-

teams variance R2 are both much smaller than thoseobtained by Harville (1977b) for Division-I college foot-ball (42 and 171 vs. 104 and 214, respectively). Theestimate of the home-field advantage is also smallerthan for college football (2.42 vs. 3.35). Merchant (1973)conjectured that, in professional football, the home-fieldadvantage is becoming a thing of the past. The resultsgiven in Table 1 seem to indicate otherwise.

Table 2 provides comparisons, broken down on a year-by-year basis, between the accuracy of the statisticalpredictions and the accuracy of the predictions givenby the betting line. Three criteria were used to assessaccuracy: the frequency with which the predicted winnersactually won, the average of the absolute values of theprediction errors, and the average of the squares of theprediction errors, where prediction error is defined to bethe actual (signed) difference in score between the hometeam and the visiting team minus the predicted differ-ence. With regard to the first criterion, a predicted tiewas counted as a success or half a success depending onwhether the actual outcome was a tie, and, if a tieoccurred but was not predicted, credit for half a successwas given.

The statistical predictions are seen to be somewhat lessaccurate on the average than the predictions given by thebetting line. Comparisons with Harville's (1977b)results indicate that both types of predictions tend to bemore accurate for professional football than for collegefootball. The average absolute difference between thestatistical predictions and the predictions given by thebetting line was determined to be 2.48.

Table 3 gives the accuracy, as measured by averageabsolute error, of the two types of predictions for eachof the 14 weeks of the regular season and for the playoffgames. Both types of predictions were more accurateover a midseason period, extending approximately fromWeek 6 to 13, than at the beginning or end of the season.Also, the accuracy of the statistical predictions comparedmore favorably with that of the betting line during mid-season than during the rest of the season. Specifically,the average absolute prediction error for Weeks 6through 13 was 10.37 for the statistical predictions and10.35 for the betting line (vs. the overall figures of 10.68and 10.49, respectively).

2. Accuracy of the Statistical Procedure Versus That of the Betting Line

Percentage of Winners

Year(s)

1971197219731974197519761977

Alt

Numberof Games

164189187187187203203

1.320

StatisticalProcedure

66.266.474.665.573.872.772.4

70.3

BettingLine

68.671.475.768.576.272.970.9

72.1

Average Absolute Error

StatisticalProcedure

10.2811.3511.8910.0710.7610.799.67

10.68

BettingLine

10.6110.9411.3610.1610.3810.609.46

10.49

Average Squared Error

StatisticalProcedure

172.1201.8228.2159.8201.0185.9166.1

187.8

BettingLine

181.0192.6205.6161.6183.6182.6168.0

182.0

39

Page 51: Anthology of Statistics in Sports

Chapter 6 Predictions for National Football League Games via Linear-Model Methodology

3. Week-by-Week Breakdown for Prediction Accuracy

Average Absolute Error

Week(s)

123456789

1011121314

PlayoffsAll

Numberof Games

9280939292939293939391939384

461,320

StatisticalProcedure

11.5510.4611.3110.9711.0210.4110.9910.488.959.82

10.3410.0211.9611.73

9.9610.68

BettingLine

11.209.79

11.3110.3910.5210.2710.7210.788.949.95

10.2610.1211.7511.18

9.7310.49

It is not surprising that the statistical predictions aremore accurate, relative to the betting line, during mid-season than during earlier and later periods. The sta-tistical predictions are based only on differences in scorefrom previous games. A great deal of additional informa-tion is undoubtedly used by those whose opinions arereflected in the betting line. The importance of takingthis additional information into account depends on theextent to which it is already reflected in available scores.Early and late in the season, the additional informationis less redundant than at mid-season. At the beginningof the season, it may be helpful to supplement the in-formation on past scores with information on rosterchanges, injuries, exhibition-game results, and so on.During the last week or two of the regular season, it maybe important to take into account which teams are stillin contention for playoff berths.

The statistical predictions were somewhat more similarto the predictions given by the betting line during mid-season than during earlier and later periods. For Weeks 6through 13, the average absolute difference between thetwo types of predictions was found to be 2.27 (vs. theoverall figure of 2.48).

There remains the question of whether the statisticalpredictions could serve as the basis for a successfulbetting scheme. Suppose that we are considering bettingon a future game, say, Game K of Year J.

If SJK > BJK, that is, if the statistical predictionindicates that the chances of the home team are betterthan those specified by the betting line, then we mightwish to place a bet on the home team. The final decisionon whether to make the bet could be based on the ap-proximation (discussed in Section 2.4) to the ratio

that is, on the approximation to the conditional prob-ability (as defined in Section 2.4) that the home teamwill win by more (or lose by less) than the betting linewould indicate given that the game does not end in atie relative to the betting line. The bet would be made ifthe approximate value of this conditional probabilitywere sufficiently greater than .5.

If SJK < BJK, then depending on whether the ap-proximation to the ratio (3.1) were sufficiently smallerthan .5, we would bet on the visiting team. The actualsuccess of such a betting scheme would depend on thefrequency with which bets meeting our criteria ariseand on the relative frequency of winning bets among thosebets that do qualify.

In Table 4a, the predictions for the 1,320 games aredivided into six categories depending on the (approxi-mate) conditional probability that the team favored by

4. Theoretical

ProbabilityInterval

[.50, .55)[.55, .60)[.60, .65)[.65, .70)[.70, .75)

>.75

[.50, .55)[.55, .60)[.60, .65)[.65, .70)[.70, .75)

>.75

Versus Observed Frequency of Success for Statistical Predictions Relative to the Betting Line

Numberof Games

and Numberof Ties

566(16)429(17)221 (12)78(3)18(0)8(0)

337 (7)248(11)112(6)39(2)3(0)2(0)

AverageProbability

.525

.574

.621

.671

.718

.778

.525

.573

.620

.672

.728

.805

ObservedRelative

Frequency

a, All weeks

.525

.534

.483

.627

.556

.625

b. Weeks 6-13

.503

.570

.528

.730

.3331.000

Numberof Games

and Numberof Ties

(Cumulative)

1320(48)754 (32)325(15)104(3)26(0)8(0)

741 (26)404(19)156(8)44(2)5(0)2(0)

AverageProbability

(Cumulative)

.570

.604

.643

.688

.736

.778

.564

.598

.638

.682

.758

.805

ObservedRelative

Frequency(Cumulative)

.528

.530

.526

.614

.577

.625

.541

.574

.581

.714

.6001.000

40

Page 52: Anthology of Statistics in Sports

Harville

the statistical algorithm relative to the betting line willwin by more (or lose by less) than indicated by the bettingline (given that the game does not end in a tie relative tothe betting line). For each category, the table gives thetotal number of games or predictions, the number ofgames that actually ended in a tie relative to the bettingline, the average of the conditional probabilities, and theobserved frequency with which the teams favored by thestatistical algorithm relative to the betting line actuallywon by more (or lost by less) than predicted by the line(excluding games that actually ended in a tie relative tothe line). Cumulative figures, starting with the categorycorresponding to the highest conditional probabilities,are also given. Table 4b gives the same information forthe 741 games of Weeks 6 through 13.

The motivation for the proposed betting scheme is asuspicion that the relative frequency with which theteams favored by the statistical algorithm relative to theline actually beat the line might be a strictly increasing(and possibly approximately linear) function of the"theoretical" frequency (approximate conditional prob-ability), having a value of .50 at a theoretical relativefrequency of .50. The results given in Table 4 tend tosupport this suspicion and to indicate that the rate ofincrease is greater for the midseason than for the entireseason.

Fitting a linear function to overall observed relativefrequencies by iterative weighted least squares producedthe following equations:

relative frequency= .50 + .285 (theoretical frequency - .50) .

The fitted equation for Weeks 6 through 1 3 was:

relative frequency= .50 + .655 (theoretical frequency — .50) .

The addition of quadratic and cubic terms to the equa-tions resulted in only negligible improvements in fit.

The proposed betting scheme would generally haveshown a profit during the 1971-1977 period. The rate ofprofit would have depended on whether betting had beenrestricted to midseason games and on what theoreticalfrequency had been used as the cutoff point in decidingwhether to place a bet. Even if bets (of equal size) hadbeen placed on every one of the 1,320 games, some profitwould have been realized (since the overall observedrelative frequency was .528 vs. the break-even point of.524).

4. DISCUSSION

4.1 Modification of the Prediction Algorithm

One approach to improving the accuracy of the sta-tistical predictions would be to modify the underlyingmodel. In particular, instead of assuming that theresidual effects are uncorrelated, we could, following Har-ville (1977b), assume that

where, taking the weeks of each season to be numbered1, 2, 3, ..., w(j, k) = m if Game k of Year j were playedduring Week m. The quantities Cimn and Fjk representrandom variables such that E(Ci m n) = E(Fi k) = 0,

var(Fj k) = F2 , cov(F jk , F,.j' k') = 0if

j' £ j or k' k , cov(Ci m n, Fj k) = 0 ,

and

Here, F2, c2, and a are unknown parameters.The correlation matrix of Ci m l, Cim2, ... is that for a

first-order autoregressive process. The quantity Ci m n

can be interpreted as the deviation in the performancelevel of Team i in Week n of Year m from the level thatis characteristic of Team i in Year m. The assumption(4.1) allows the weekly performance levels of any giventeam to be correlated to an extent that diminishes withelapsed time.

If the assumption (4.1) were adopted and positivevalues were used for a and c2, the effect on the statisticalprediction algorithm would be an increased emphasis onthe most recent of those games played in the year forwhich the prediction was being made. The games playedearly in that year would receive less emphasis.

The parameters T2, p, c2, a, and F2 associated withthe modified model were actually estimated from the1968-1976 NFL scores by an approximate restrictedmaximum likelihood procedure similar to that describedin Section 2.3 for estimating parameters of the originalmodel. As in Harville's (1977b) study of college footballscores, there was no evidence that a differed from zero.

A second way to improve the accuracy of the statisticalalgorithm would be to supplement the information in thepast scores with other quantitative information. A mixedlinear model could be written for each additional variate.The random effects or the residual effects associated witheach variate could be taken to be correlated with thosefor the other variates. At least in principle, the newvariates could be incorporated into the predictionalgorithm by following the same approach used in Section2 in devising the original algorithm. In practice, depend-ing on the number of variates that are added and thecomplexity of the assumed linear models, the computa-tions could be prohibitive.

One type of additional variate would be the (signed)difference for each regular-season and playoff gamebetween the values of any given statistic for the hometeam and the visiting team. For example, the yardsgained by the home team minus the yards gained by thevisiting team could be used. The linear model for avariate of this type could be taken to be of the same formas that applied to the difference in score.

There are two ways in which the incorporation of anadditional variate could serve to improve the accuracy

41

Page 53: Anthology of Statistics in Sports

Chapter 6 Predictions for National Football League Games via Linear-Model Methodology

of the statistical prediction for the outcome SJK of afuture game. It could contribute additional informationabout the yearly characteristic performance levelsT h ( j , K ) , j and T,(J,KI,J of the participating teams, orit could contribute information about the residual effectRJ K.. Comparison of the estimates of R2 given in Table 1with the figures given in Table 2 for average squaredprediction error indicate that the first type of contribu-tion is unlikely to be important, except possibly early inthe season. Variates that quantify injuries are examples ofvariates that might contribute in the second way.

4.2 Other Approaches

The statistical prediction algorithm presented in Sec-tion 2 is based on procedures for mixed linear modelsdescribed, for example, by Harville (1976, 1977a). Theseprocedures were derived in a frequentist framework;however, essentially the same algorithm could be arrivedat by an empirical Bayes approach like that described byHaff (1976) and Efron and Morris (1975) and used by thelatter authors to predict batting averages of baseballplayers.

Essentially statistical algorithms for predicting theoutcomes of NFL games were developed previously byGoode (as described in a nontechnical way by Marsha1974), Rosner (1976), and Stefani (1977). Mosteller(1973) listed some general features that seem desirablein such an algorithm.

Comparisons of the results given in Section 3 withStefani's results indicate that the predictions producedby the algorithm outlined in Section 2 tend to be moreaccurate than those produced by Stefani's algorithm.Moreover, Stefani reported that the predictions givenby his algorithm compare favorably with those given byGoode's algorithm and with various other statisticaland nonstatistical predictions.

There is some question whether it is possible for abettor who takes an intuitive, essentially nonstatisticalapproach to beat the betting line (in the long run) morethan 50 percent of the time. DelNagro (1975) reported afootball prognosticator's claim that in 1974 he had madepredictions for 205 college and NFL games relative tothe betting line with 184 (89.8 percent) successes (referalso to Revo 1976); however, his claim must be regardedwith some skepticism in light of subsequent well-docu-mented failures (DelNagro 1977).

Winkler (1971) found that the collective rate of successof sportswriters' predictions for 153 college and NFLgames was only 0.476, while Pankoff (1968), in a similarstudy, reported somewhat higher success rates.

Merchant (1973) followed the betting activities of twoprofessional gamblers during the 1972 NFL season. Hereported that they bet on 109 and 79 games and hadrates of success of .605 and .567, respectively.

5. THE RATING PROBLEM

A problem akin to the football prediction problem isthat of rating, ranking, or ordering the teams or a subset

of the teams from first possibly to last. The rating maybe carried out simply as a matter of interest, or it maybe used to honor or reward the top team or teams.

The NFL has used what might be considered a ratingsystem in picking its playoff participants. The NFLconsists of two conferences, and each conference isdivided into three divisions. Ten teams (eight before1978) enter the playoffs: the team in each division withthe highest winning percentage and the two teams ineach conference that, except for the division winners,have the highest winning percentages. A tie for a playoffberth is broken in accordance with a complex formula.The formula is based on various statistics including win-ning percentages and differences in score for games be-tween teams involved in the tie.

The prediction procedure described in Section 2 canalso be viewed as a rating system. The ratings for a givenyear, say Year P, are obtained by ordering the teams inaccordance with the estimates TI P,, T2 P, ... of theirYear P characteristic performance levels. However, as arating system, this procedure lacks certain desirablecharacteristics.

To insure that a rating system will be fair and will notaffect the way in which the games are played, it shoulddepend only on knowing the scores from the givenseason (Year P), should reward a team for winning perse, and should not reward a team for "running up thescore" (Harville 1977b). The procedure in Section 2 canbe converted into a satisfactory system by introducingcertain modifications.

Define a "truncated" difference in score for Game kof Year P by

where M is some number of points. Let

where SP k(M; T h ( P t k ) . P , T v ( P , k ) , p ; R2, , H) is the condi-tional expectation (based on an assumption that Sp i,

Sp z, . . . , TI P,, T2p, ... are jointly normal) of Sp k givenSP i(M), Sp 2 ( M ) , ..., TI P, T2 P, ..., that is, given all ofthe available truncated differences in score and all of thecharacteristic performance levels for Year P, or, equiva-lently, given SP k ( M ) , T h ( P , k ) P , and Tv(P,k).p. Note that

Let Ti p( , H; Sp i, Sp z, . . . ) represent the conditionalexpectation of TiP given SP I, Sp2, .... In the modifiedrating procedure for Year P, the ratings are obtained byordering estimates TI P,, T2 P, . . . , where these estimatesare based only on differences in score from Year P andwhere, instead of putting Tip = Ti p( , H; SP i, SP2, . • •)

42

Page 54: Anthology of Statistics in Sports

Harville

(as we would with the procedure in Section 2), we put

Here, is an "estimated differ-ence in score given by

where U is some number of points (say U = 21) andB (0 < B < 1) is some weight (say B = 1/3|) specified bythe user.

Equations (5.2) can be solved iteratively for TI P,

T2 P, ... by the method of successive approximations.On each iteration, we first compute new estimates of TI P,

T2 p, ... by using the procedure in Section 2 with thecurrent estimated differences in score in place of the "rawdifferences." Expressions (5.1) and (5.3) are then usedto update the estimated differences in score.

The underlying rationale for the proposed rating systemis essentially the same as that given by Harville (1977b)for a similar, but seemingly less satisfactory, scheme.

If the proposed rating system is to be accepted, itshould be understandable to the public, at least ingeneral terms. Perhaps (2.15) could be used as the basisfor a fairly simple description of the proposed ratingsystem. Our basic procedure is very similar to a statisticalprocedure developed by Henderson (1973) for use indairy cattle sire selection. It is encouraging to note thatit has been possible to develop an intuitive understand-ing of this procedure among dairy cattle farmers and tosell them on its merits.

It can be shown that taking TI P, T2p, ... to be asdefined by (5.2) is equivalent to choosing them tomaximize the function

where LM(T1P, T2 P, ...; SP l ( M ) , S p 2 ( M ) , . . . ; R2, X, H)is the logarithm of the joint probability "density" func-tion of TI P, T2P, ..., SP i(M), Sp 2 ( M ) , ... that resultsfrom taking Sp i, SP 2, ..., TI P, T2 P,, ... to be jointlynormal. When viewed in this way, the proposed ratingsystem is seen to be similar in spirit to a system devisedby Thompson (1975).

[Received July 1978. Revised June 1979.]

REFERENCES

DelNagro, M. (1975), "Tough in the Office Pool," Sports Illus-trated, 43, 74-76.

(1977), "Cashing in a Sure Thing," Sports Illustrated,47, 70-72.

Efron, Bradley, and Morris, Carl (1975), "Data Analysis UsingStein's Estimator and Its Generalizations," Journal of theAmerican Statistical Association, 70, 311-319.

Haff, L.R. (1976), "Minimax Estimators of the MultinomialMean: Autoregressive Priors," Journal of Multivariate Analysis,6, 265-280.

Hartigan, J.A. (1969), "Linear Bayesian Methods," Journal of theRoyal Statistical Society, Ser. B, 31, 446-454.

Harville, David A. (1976), "Extension of the Gauss-Markov Theo-rem to Include the Estimation of Random Effects," Annals ofStatistics, 4, 384-395.

(1977a), "Maximum Likelihood Approaches to VarianceComponent Estimation and to Related Problems," Journal ofthe American Statistical Association, 72, 320-338.

(1977b), "The Use of Linear-Model Methodology to RateHigh School or College Football Teams," Journal of the AmericanStatistical Association, 72, 278-289.

(1978), "Football Ratings and Predictions Via LinearModels," with discussion by Carl R. Morris, Proceedings of theAmerican Statistical Association, Social Statistics Section, 74-82;87-88.

(1979), "Recursive Estimation Using Mixed Linear ModelsWith Autoregressive Random Effects," in Proceedings of theVariance Components and Animal Breeding Conference in Honorof Dr. C.R. Henderson, Ithaca, N.Y.: Cornell University, Bio-metrics Unit.

Henderson, Charles R. (1973), "Sire Evaluation and GeneticTrends," in Proceedings of the Animal Breeding and GeneticsSymposium in Honor of Dr. Jay L. Lush, Champaign, 111.:American Society of Animal Science, 10-41.

Marsha, J. (1974), "Doing It by the Numbers," Sports Illustrated,40, 42-49.

Merchant, Larry (1973), The National Football Lottery, New York:Holt, Rinehart & Winston.

Mosteller, Frederick (1970), "Collegiate Football Scores, U.S.A.,"Journal of the American Statistical Association, 65, 35-48.

(1973), "A Resistant Adjusted Analysis of the 1971 and1972 Regular Professional Football Schedule," MemorandumEX-5, Harvard University, Dept. of Statistics.

Pankoff, Lyn D. (1968), "Market Efficiency and Football Betting,"Journal of Business, 41, 203-214.

Patterson, H.D., and Thompson, Robin (1971), "Recovery ofInter-Block Information When Block Sizes Are Unequal," Bio-metrika, 58, 545-554.

Revo, Larry T. (1976), "Predicting the Outcome of FootballGames or Can You Make a Living Working One Day a Week,"in Proceedings of the American Statistical Association, SocialStatistics Section, Part II, 709-710.

Rosner, Bernard (1976), "An Analysis of Professional FootballScores," in Management Science in Sports, eds. R.E. Machol,S.P. Ladany, and D.G. Morrison, Amsterdam: North-HollandPublishing Co., 67-78.

Stefani, R.T. (1977), "Football and Basketball Predictions UsingLeast Squares," IEEE Transactions on Systems, Man, and Cy-bernetics, SMC-7, 117-121.

Thompson, Mark (1975), "On Any Given Sunday: Fair Com-petitor Orderings With Maximum Likelihood Methods," Journalof the American Statistical Association, 70, 536-541.

Winkler, Robert L. (1971), "Probabilistic Prediction: Some Ex-perimental Results," Journal of the American Statistical Asso-ciation, 66, 675-685.

43

Page 55: Anthology of Statistics in Sports

This page intentionally left blank

Page 56: Anthology of Statistics in Sports

Chapter 7

Data suggest that decisions to hire and firekickers are often based on overreaction torandom events.

The Best NFL Field GoalKickers: Are They Lucky orGood?

Donald G. Morrison and Manohar U. Kalwani

The Question

In the September 7, 1992 issue ofSports Illustrated (SI) an article ti-tled "The Riddle of the Kickers"laments the inability of teams tosign kickers who consistentlymake clutch field goals late in thegame. Why do some kickers scorebetter than others? In our article,we propose an answer to SI's "rid-dle." More formally, we ask thequestion: Do the observed data onthe success rates across NFL fieldgoal kickers suggest significantskill differences across kickers?Interestingly, we find that the1989-1991 NFL field goal data areconsistent with the hypothesis ofno skill difference across NFLkickers. It appears then that, insearching for clutch field goalkickers, the NFL teams may beseeking a species that does not ex-ist.

The Kicker and His CoachIn its February 18, 1992 issue, theNew York Times reported that KenWillis, the place kicker who had

been left unsigned by the DallasCowboys, had accepted an offer ofalmost $1 million for two years tokick for Tampa Bay. The Cow-boys, Willis's team of the previousseason, had earlier agreed thatthey would pay him $175,000 forone year if Willis would not signwith another team during the PlanB free agency period. They wereinfuriated that Willis had goneback on his word.

"That scoundrel," said DallasCoach Jimmy Johnson, when in-formed of Willis's decision. "Hebroke his word."

"I'm not disappointed as muchabout losing a player because wecan find another kicker,"Johnson told The AssociatedPress, "but I am disappointedthat an individual compromisedtrust for money. When someonegives me their word, that'sstronger to me than a contract."

"I did give the Cowboys myword," said Willis, "but underthe circumstances, to not leavewould have been ludicrous."

Is Ken Willis really a "scoun-drel"? Was Jimmy Johnson more

45

Page 57: Anthology of Statistics in Sports

Chapter 7 The Best NFL Field Goal Kickers: Are They Lucky or Good?

of a scoundrel than his formerkicker? We leave these questionsto the reader. Rather, we focus onJohnson's assessment

"... we can find another kicker."

Just how differentiated areNFL kickers? Are some reallybetter than others—or are theyinterchangeable parts? Inquiring(statistical) minds want to know!

Some CaveatsWe begin with an observation:Some NFL kickers have strongerlegs than others. Morton Andersenof the New Orleans Saints, for ex-ample, kicked a 60-yard field goalin 1991; virtually none of the otherkickers even try one that long. Hiskickoffs also consistently go deepinto the end zone for touchbacks;

many kickers rarely even reach thegoal line. Thus, we concede that,everything considered, some NFLkickers really are better than others.

The specific question we are ask-ing, however, is: Given the typicallength of field goal attempts, isthere any statistical evidence ofskill difference across kickers?

Before giving our statisticalanalysis, a few more anecdotes arein order.

Tony Z Has a Perfect Season!In 1989, Tony Zendejas, kicking forHouston, with a success rate of67.6%, was among the NFL's bot-tom fourth of 28 kickers. In 1990,he was at the very bottom with58.3%. In 1991, though, Tony didsomething no other kicker in NFLhistory has ever done—he was suc-cessful on all of his field goal at-

tempts! (Note: Zendejas was thenkicking for the inept Los AngelesRams, and he had only 17 attemptsall season.) Early in the 1992 sea-son, however, Tony missed threefield goals in one gamel

One Miss and He's HistoryIn 1991, Brad Daluiso, a rookie outof UCLA, won the kicking job forthe Atlanta Falcons. Brad madehis first two kicks and missed agame-winning 27 yarder late inthe second half. The next day,Daluiso was cut. Did the Falconshave enough "evidence" for firingBrad?

Portrait of a Slump?Jeff Jaeger of the Los Angeles Raid-ers had a great 1991 season. Hemade 85.3% of his kicks—secondonly to Tony Zendejas's unprece-

46

Page 58: Anthology of Statistics in Sports

Morrison and Kalwani

dented perfect season. After 4games in 1992, Jeff was 5 for 11and the media were all over himfor his "slump." Two of themisses, however, were blockedand not Jaeger's fault. Three of themisses were 48-, 51-, and 52-yardattempts—hardly "sure shots." Infact, of Jeffs 11 attempts, a badlyhooked 29-yard miss was the onlyreally poor kick. This is a slump?

Wandering in the NFLWildernessFinally, we give our favoritekicker story. Nick Lowery of theKansas City Chiefs is sure to be inthe Hall of Fame. He has mademore field goals of 50 yards ormore than anyone and holds theall-time high career field goal per-centage—just under 80%. Werethe Chiefs clever observers whosaw Lowery kick in college andsigned him before others discov-ered his talent? No, almost everyteam in the NFL saw him either asa teammate or an opponent. Theroad to NFL fame for Lowery wasrocky indeed; he was cut 12 timesby 9 different teams before land-ing his present decade-long jobwith Kansas City. Was this justluck for the Chiefs or did some-thing else play a role?

Lucky or Good?These anecdotes—and rememberthe plural of "anecdote" is not"data"—suggest that the ob-served performance of NFL kick-ers depends a lot more on luckthan skill. (Scott Norwood, theBuffalo Bills, and the whole cityof Buffalo would be different ifNorwood's 47-yard game-ending1991 Super Bowl field goal at-tempt had not been 2 feet wideto the right.) A simple correla-tion analysis for all of the 1989,1990, and 1991 NFL data willdemonstrate this "luck over-whelming skill" conjecture in aqualitative manner. Later, a moreappropriate analysis will quan-tify these effects.

Year-to-Year Correlations

The Appendix gives the numberof attempts and successes for allkickers for three seasons, begin-ning in 1989. These data comefrom an annual publicationcalled the Official National Foot-ball League Statistics. (A call tothe NFL Office in New York Cityis all that is required to receive acopy by mail.) For our firstanalysis, we calculate the corre-lation across kickers for eachpair of years, where the X vari-able is percentage made in, say,1989, and the Y variable is thepercentage made in, say, 1990.We used only those kickers who"qualified," namely, who had at-tempted at least 16 field goals inboth the 16-game NFL seasons ofa pair. The three resulting corre-lations are as follows:

Pair Corre- p- Sampleof years lation value size

1989, 19901990, 19911989, 1991

-.16+.38-.53

.48

.07

.02

222420

Two of the three correlations arenegative, suggesting that an NFLkicker's year of above-average per-formance is at least as likely to befollowed by a year of below-aver-age performance rather than an-other year of above-average per-formance. In other words, NFLkickers' records do not exhibit con-sistency from one year to another.

Admittedly, this is a very un-sophisticated analysis; the num-ber of attempts can vary greatlybetween years. Some kickersmay try lots of long field goalsone year and mostly short onesthe next year. Our next analysiswill take these and other factorsinto account. However, this na-ive analysis has already let thecat out of the bag. The imper-fect—but reasonable—perform-ance measure of percentagemade is a very poor predictor ofthis same measure in the follow-ing year or two.

An Illustrative ThoughtExperiment

Consider the following hypo-thetical experiment: We have300 subjects each trying to kick2 field goals from 30 yards away.Each subject is successful zero,one, or two times. Three possiblescenarios for the results for all300 are given in Table 1.

Scenario C—All LuckLet's model each subject (kicker)as a Bernoulli process with someunobservable probability of suc-cessfully making each kick. Nowrecall what happens when youflip a fair coin (p = .5) twice. Youwill get zero, one, or two headswith probabilities .25, .50, and.25, respectively. Scenario C,therefore, is what we would ex-pect if there were no skill differ-ences across kickers and if eachand every kicker had the sameprobability of success of p = .5.In Scenario C there is no skill dif-ference—rather, the "0 for 2"kickers were "unlucky,"whereas the "2 for 2's" were sim-ply "lucky."

Scenario A—All SkillThe reader's intuition will lead tothe obvious analysis of ScenarioA. These data are consistent withhalf the kickers being perfect, thatis, p = 1, and the other half beingtotally inept, with p = 0.

The intermediate Scenario B isconsistent with the unobservablep-values being distributed uni-formly between 0 and 1 acrosskickers. The spirit of our analysison the NFL kicking data is to seewhich of these scenarios is mostconsistent with the data.

The Binomial All-LuckBenchmarkAssume, for illustration, everyNFL kicker had the same numberof attempts each year and allfield goals were of the samelength (as in our hypothetical ex-ample). Our analysis would pro-

47

Page 59: Anthology of Statistics in Sports

Chapter 7 The Best NFL Field Goal Kickers: Are They Lucky or Good?

ceed as follows: We would com-pute for each kicker, xi, the num-ber of successful kicks out of n,the common number of attempts.If the success rates, pi = xi/n, formost kickers turn out to be closeto p, which is the average suc-cess rate across all kickers, wewould have the analogue of Sce-nario C, that is, indicating verylittle skill difference. The ob-served data would be consistentwith each kicker having the same(or very close to) common p-value. As the data display greaterthan binomial variance acrosskickers (e.g., more toward Sce-narios B and A), they would in-dicate more and more skill dif-ferences across kickers.

A Beta Binomial Analysis

To formalize the argument just il-lustrated, we construct a prob-ability model for the performanceof field goal kickers. The assump-tions of our model are:

1. With respect to each field goalattempt, each kicker is a Ber-noulli process with an unob-servable probability p ofmaking each kick.

2. The unobservable p-value has aprobability density functionf(p) across kickers.

The first assumption literally saysthat for a given kicker all fieldgoals are equally difficult andwhether or not he makes or missesa kick does not affect the prob-ability of making any future kicks.The second assumption merely al-lows skill differences across kick-ers.

All Field Goals—or Segmentedby Distance?Table 2 displays the average pro-portions of field goals made by theNFL kickers during the 1989,1990, and 1991 seasons. A quicklook at the figures shows thatshort field goals are made over

Table 1— Three Possible Results of a HypotheticalTest of 300 Field Goal Kickers

Frequency distribution

No. ofsuccesses

012

.,. 'Total -Averagepercentages ofsuccesses

Scenario A

1500

150300

50

Scenario B

100100100300

50

Scenario C

7515075

300

50

Note that all three scenarios yield an overall success rate of 50%.

90% of the time, whereas longones are made less than 60% ofthe time. Thus, given the yardageof the attempts made by a kicker,he is not a Bernoulli process, thatis, the probability of success is notconstant. But if we only know thata kick is attempted, we wouldhave to weight all of these yard-age-dependent p-values by theprobability of each particularyardage attempt. This would givea common overall (weighted) p-value for each attempt. Thus,knowing only that a kicker tried,say, 30 field goals, the numbermade would have a binomial dis-tribution, with n = 30, and thisoverall p-value. This is the contextin which we model each kicker asa Bernoulli process for all kicks,irrespective of the distance from

the goal posts at which the fieldgoal attempt is made.

We also do separate analyses byyardage groups, namely, under 29yards, 30-39 yards, and 40-49yards. In these analyses, to the ex-tent we have reduced the effect ofthe varying field goal lengthsacross kickers, differences in suc-cess proportions are more likely tobe due to skill differences.

Beta Skill VariabilityBecause of its flexibility to de-scribe each of the anticipated sce-narios, we use the beta distribu-tion to represent the heterogeneityin the p-values (or true prob-abilities of success) across kickers(see sidebar). For example, thebeta distribution can be chosen torepresent the situation charac-

Table 2— NFL Field Goal Kickers' SuccessProportions

Season over 3 years

All kicks<29 yards

.30-39 Yards40-49 yards

1989

.731

.945

.801

.563

1990

.748,958,795.635

1991

,737.937.787.578

Aggregate

.740

.950

.793

.587

48

Page 60: Anthology of Statistics in Sports

Morrison and Kalwani

terized by Scenario C, in which allkickers have similar abilities. Or,it can be chosen to represent Sce-nario A, in which there are twotypes of kickers—good and bad.Conveniently, it turns out that aparameter of the beta distribution,namely, the polarization index, 9,can serve as an indicator of theamount and nature of heterogene-ity in the p-values across kickers.

Lucky or Good ReduxGiven our very reasonable as-sumptions of Bernoulli kickerswith the probabilities of successhaving a beta distribution acrosskickers, all we have to do is esti-mate the parameters, [u. (the meansuccess rate), and the polarizationindex, 9 (a measure of variability),of the distribution to completelytell our story:

For each analysis, u will give theaverage skill level (e.g., the meanp-value) and will say how muchof the observed variability in per-

formance is due to skill. It is allskill when =1, and all luck atthe other extreme when 9 = 0.

Results

We report the maximum likeli-hood estimates of u, and 9 for thefield goal data from the 1989,1990, and 1991 NFL seasons. Ta-ble 3 contains the results, whichare very compelling and tell avery simple story. (Please recallour earlier caveat about the dis-tance dimension of these kick-ers.) For the field goals that areattempted, the data overwhelm-ingly support the hypothesis ofno skill differences across theelite group of NFL kickers. Halfof the analyses have 9 = 0. [Inthese cells, the data showslightly less variability thanwould be expected under a ho-mogeneous (no skill difference)Bernoulli population of kickers.]The other half of the cells havepositive, but very tiny, 9 values;for example, they are all lessthan .03.

Table 3-Mean and Polarization for NFL Field Goals

Table 3 also displays the maxi-mum likelihood estimates of uand 9 for field goal data aggre-gated across each kicker for the1989, 1990, and 1991 seasons.The total number of field goalsmade and attempted, and thecorresponding success rates areincluded in the Appendix. Forthe 38 kickers who kicked 16 ormore field goals during at least 1of the 3 seasons, the numbers offield goals attempted varied from18 to 123, with an average ofabout 62. As Table 3 reveals,even in these aggregate data withlarger sample size, the estimatesof 9 are very close to 0 for allkicks or kicks segmented by fieldgoal length. These findings fromaggregate data provide furthersupport for our inference of alack of skill differences amongthe NFL kickers.

Reliability of Model ResultsOur inference of little, if any,skill differences among the elitegroup of NFL kickers relies onour estimate of the polarizationindex, 9, being 0 or close to it inalmost all the cases analyzed.The question arises: How accu-rate are our estimates of 9, par-ticularly since our sample sizesare not large? The number ofqualified NFL kickers (with anaverage of at least one kick pergame) in any given year is about28. These kickers on the averageattempt about 28 field goals overthe course of a 16-game season.Our maximum likelihood esti-mators of 9 and u have good sta-tistical properties but only whenthe sample sizes are large. Simu-lation results indicate that thesample sizes are adequate in thepresent problem for us to havegreat confidence in these results.

Should We Be Surprised?

If the readers of this article werelined up to attempt 20 yard fieldgoals, we would find big skill

49

Page 61: Anthology of Statistics in Sports

The Beta Distribution

Chapter 7 The Best NFL Field Goal Kickers: Are They Lucky or Good?

The beta distribution is frequently used to describe the thdistribution of probabilities across a population. Areas of deapplication include biometrics, genetics, market research ciopinion research, and psychometrics. The functional form < •of the beta .•distribution is given by •: ,{.• ; ; • -f • ;• -•/: •

e population, eDC. It turns outin be written as

;•

and =1/(a+ B+1), a polari.that the variance of the beta dif

var[p]= u(1-u)

zation in-stribution

where p denotes an individual kicker's true probability ofsuccess, a and b are parameters of the beta distribution,

• and TO denotes the gamma function. The mean and vari-ance of this distribution are ' - • • : , ; -V1, •;•.'.,"'--:-,-:

The beta distribution is flexible and can take differentshapes depending on the values of the parameters a andB. "-• - - • ' • , '•, "" ••'• '-• >•"." -'•B: " • • - - - ' . ' - - ' - " - . . ,

Figures 1A-C

The bimodal form of the beta distribution (an extremeversion of which is displayed in Fig. 1A would be appro-priate if the NFL kickers could be classified as either verygood or very poor. The bell-shaped form of the beta dis-tribution (again, an extreme form of which is shown in Fig.1C) would imply that most NFL kickers have very similartrue probabilities of success. The intermediate case dis-played in Fig. 11 would imply that the true probabilities ofsuccess of the NFL kickers are distributed uniformly be-tween 0 and 1. The three different shapes depicted inFigs. 1 A, 1B, and 1C of course, correspond to the Sce-:

narios A,B, and C presented in the text.-; •. -:" * „• £ " 'I

parameters u = a /(a + B), the mean success rate across

Thus when = 0, there Is zero variance in the p-vafuesacross kickers, implying no skill differences (see Fig. 1C).When = 1, the variance of p (given a mean, u) is themaximum possible value of u.(l-u) (see Fig. 1 A).

Working with the Beta Binomial ModelThe parameters u and ( can be estimated by either themethod of moments or the maximum likelihood approach;we used both approaches in fitting the model to the fieldgoal data. Those interested in more details should readColombo and Morrison (1988), which contains an analo-o^usappJicaBoriofth©or failure of British Ph.D. students across universities.The appendix of Colombo and Morrison gives the formu-las for the maximum likelihood estimators of u and 9. Toobtain method-of-moment estimates of u. and , we usedan iterative approach due to Kleinman (1973). Reassur-ingly, they turned out to be very close to the maximumlikelihood estimates of u and 9 in all the 12 cases consid-ered in the text. •

How good are the estimates in small samples? Evi-dence on the reliability of Kieinman's moment estimatorsin small samples is available in the simulation results ofLim (1992). The simulations were carried out in settingssimilar to our field goal data, in that the number of sam-pling units (kickers, in our case) was set at 30 and thenumber of trials per sampling unit was allowed to varybetween 20 and 40, with a mean of about 30 trials. Thevalue of the u. parameter was set at 0.75 and the polari-zation index 9 was allowed to vary between 0.25 to 0.75.In these simulations, Lim was able to recover the polari-zation index parameter within 10% of the true value 95%of the time. In sum, we feel comfortable about the reli-abiiity of our estimates of for the field goal data becausethe maximum likelihood estimates and Keinman's mo-ment estimates agree, and Lim's' simulation results sug-gest that the sample sizes are adequate for the moment

''WjiiiSateifcJtfcjtjMhau'* eamfiwEprs.

differences. But we analyzed the"best of the best." Each yearthousands of high school kickersget filtered into a few hundredcollege kickers. These few hun-dred then compete for the 28NFL kicking jobs. No one canmake every kick, but these guyscome close. Even if some of theseelite kickers are a little betterthan others, variability in per-

formance is caused by a poorsnap from center, an imperfecthold, a gust of wind, and so on.Upon reflection, we would besurprised if the results hadshown even moderate skill dif-ferences. Undoubtedly, somesmall skill differences do existacross these NFL kickers. How-ever, over the course of a seasonor two, there simply are not

enough field goal attempts toseparate the best of the kickersfrom the remaining NFL kickers.

Discussion

In the fall of 1989, only 8 of the28 NFL kickers were kicking forthe team that originally signedthem. We have already docu-

50

Page 62: Anthology of Statistics in Sports

mented Nick Lowery's odysseythrough the NFL training camps.Matt Bahr has kicked for two Su-per Bowl winners, the Steelersand Giants, going from Pitts-burgh to New York via the Cleve-land Browns. What accounts forthis mobility? Over a long career,an 80% kicker is great, a 70%kicker is a little below average,and a 60% kicker is terrible. Butin one season each kicker makesabout 30 attempts—sometimesmany less. A p = .7 kicker has anexpected value of 21 successes(70%) out of 30 kicks, but 18 suc-cesses (60%) and 24 successes(80%) are only slightly morethan one standard deviation fromthe mean. Obviously, a goodkicking coach can assess akicker's skill level by watchinghow he kicks as well as seeingwhether or not the kick wentthrough the uprights. Neverthe-less, it is our conjecture thatkickers are very often hired andfired based solely on binomialluck variance. Our advice to NFLkickers is: Rent—don't buy.

We conclude by returning tothe Jimmy Johnson/Ken Willisepisode. If the Tampa Bay teamhad had access to this article,would they have paid Willis$500,000 a year to kick for them?Probably. Most coaches keepsearching for that elusive kickerwho will never miss in crunchtime, and the Bucs also liked thestrength of Willis's leg (he mademore 50+ yarders than anyone in1991). Also, getting "one up" onJimmy Johnson must havepleased Tampa Bay. So, althoughthis article may not havechanged Tampa Bay's behavior,from a normative point of view,we think the Bucs made a finan-cial mistake. Jimmy Johnson, af-ter all, was correct when he said"we can find another kicker."

So to the ethics of CoachJohnson's statement, "He brokehis word," we can only note thatJohnson himself was not acting inthe spirit of Plan B free agency.

ADiKTMTrtY

Was Ken Willis smart to take the Epiloguemoney and run? Well, when youare competing against kicking col-leagues, all of whom are essen-tially interchangeable parts, andan owner offers to triple your sal-ary . . . Willis made the right call.(We just hope he did not sign morethan a two-year lease.)

As the review process for this ar-ticle was concluding, the NFL of-fice released the 1992 data. Themeans (u), polarization indices( ), and correlations for 1992 areall consistent with the 1989-1991 results. The only slight de-

51

(Source: Elias Sports Bureau)

Page 63: Anthology of Statistics in Sports

Chapter 7 The Best NFL Field Goal Kickers: Are They Lucky or Good?

viation is Ui = .910 and 9 = .046for the field goals of 29 yards orless. Although still very small,this polarization index is thehighest in the whole study. Thesuccess rate of 91% for theseshort kicks is about three pointsbelow the other years. It turnsout that one kicker, Greg Davis of

Phoenix, caused most of thesedeviations. In the 3 previousyears, Greg was a perfect 18 for18 from 29 yards or less. In 1992,he missed 4 out of 10 of theseshort kicks. (We expect to see alot of kickers in the Phoenixtraining camp this summer.) Thespirit of the 1989-1991 results is

clearly maintained in 1992.So what do we, the authors,

conclude? The data are consis-tent with no skill differencesacross NFL field goal kickers.Have we proved no skill differ-ence? No, but if there is sometrue skill difference, it is cer-tainly small compared to thewithin-kicker binomial variance.Do we believe some kickers arebetter than others? Yes. Wewould like to have either the vet-eran Morton Andersen or therookie Jason Hanson kicking forour team—but mostly because ofhow far they kick compared tothe typical NFL kicker. When itcomes to accuracy per se for thetypical attempt of less than 50yards, the addition of the 1992results only reinforces our beliefthat the NFL caliber kickers are,indeed, interchangeable parts.There are certainly numerouscoaches, fans, and especiallykickers who will disagree withus. But with the data so over-whelmingly on our side, the bur-den of proof would appear to beon those who disagree with us.

Additional Reading

Colombo, R.A., and Morrison, D.G.(1988), "Blacklisting Social Sci-ences Departments With Poor Ph.D.Submission Rates," ManagementScience, 34, 696-706.

Irving, G. W., and Smith, H. A. (1976),"A Model of Football Field GoalKicker," in Management Science inSports, TIMS Studies in the Man-agement Sciences, eds. R.E. Macholand S.P. Ladany, Amsterdam:North-Holland, Vol. 4, pp. 47-58.

Kleinman, J. C. (1973), "ProportionsWith Extraneous Variance: Singleand Independent Samples," Jour-nal of the American Statistical As-sociation, 68, 46-54.

Lim, B. (1992), "The Application ofStochastic Models to Study BuyerBehavior in Consumer DurableProduct Categories," Ph.D. disserta-tion, Purdue University, KrannertGraduate School of Management.

52

Page 64: Anthology of Statistics in Sports

Chapter 8

On the Probability of Winning a Football GameHAL STERN*

Based on the results of the 1981, 1983, and 1984 Na-tional Football League seasons, the distribution of themargin of victory over the point spread (defined as thenumber of points scored by the favorite minus the num-ber of points scored by the underdog minus the pointspread) is not significantly different from the normal dis-tribution with mean zero and standard deviation slightlyless than fourteen points. The probability that a team fa-vored by p points wins the game can be computed froma table of the standard normal distribution. This result isapplied to estimate the probability distribution of thenumber of games won by a team. A simulation is usedto estimate the probability that a team qualifies for thechampionship playoffs.

KEY WORDS: Goodness-of-fit tests; Normal distribu-tion.

1. INTRODUCTION

The perceived difference between two football teamsis measured by the point spread. For example, New Yorkmay be a three-point favorite to defeat Washington. Betscan be placed at fair odds (there is a small fee to theperson handling the bet) on the event that the favoritedefeats the underdog by more than the point spread. Inour example, if New York wins by more than three points,then those who bet on New York would win their bets.If New York wins by less than three points (or loses thegame), then those who bet on New York would lose theirbets. If New York wins by exactly three points then nomoney is won or lost. The point spread is set so that theamount bet on the favorite is approximately the same asthe amount bet against the favorite. This limits the riskof the people who handle the bets.

The point spread is of natural interest as a predictorof the outcome of a game. Although it is not necessarilyan estimate of the difference in scores, the point spreadhas often been used in this capacity. Pankoff (1968),Vergin and Scriabin (1978), Tryfos, Casey, Cook, Leger,and Pylypiak (1984), Amoako-Adu, Manner, and Yagil(1985), and Zuber, Gandar, and Bowers (1985) consid-ered statistical tests of the relationship between the pointspread and the outcome of the game. Due to the largevariance in football scores, they typically found that sig-nificant results (either proving or disproving a strong re-lationship) are difficult to obtain. Several of these au-thors then searched for profitable wagering strategies basedon the point spread. The large variance makes such strat-

*Hal Stern is Assistant Professor, Department of Statistics, HarvardUniversity, Cambridge, MA 02138. The author thanks Thomas M.Cover and a referee for helpful comments on the development andpresentation of this article.

egies difficult to find. Other authors (Thompson 1975;Stefani 1977, 1980; Harville 1980) attempted to predictgame outcomes or rank football teams using informationother than the point spread.

The results of National Football League (NFL) gamesseem to indicate that the true outcome of a game can bemodeled as a normal random variable with mean equalto the point spread. This approximation is developed insome detail, and two applications of this approach aredescribed.

2. DATA ANALYSIS

The data set consists of the point spread and the scoreof each NFL game during the 1981, 1983, and 1984 sea-sons. Many newspapers list the point spread each dayunder the heading "the betting line." The sources of thepoint spread for this data set are the New York Post (1981)and the San Francisco Chronicle (1983, 1984). There issome variability in the published point spreads (from dayto day and from newspaper to newspaper), however, thatvariability is small (typically less than one point) andshould not have a large impact on the results describedhere. An attempt was made to use point spreads fromlate in the week since these incorporate more informa-tion (e.g., injuries) than point spreads from early in theweek. For reasons of convenience, the day on which thedata were collected varied between Friday and Saturday.The 1982 results are not included because of a player'sstrike that occurred that year. The total number of gamesin the data set is 672. More recent data (from 1985 and1986) are used later to validate the results of this section.For each game the number of points scored by the fa-vorite (F), the number of points scored by the underdog(u ), and the point spread (P) are recorded. The marginof victory over the point spread (M) is defined by

for each game. The distribution of M is concentrated onmultiples of one-half since F and U are integers, whileP is a multiple of one-half.

A histogram of the margin of victory over the pointspread appears in Figure 1. Each bin of the histogramhas a width of 4.5 points. The chi-squared goodness-of-fit test indicates that the distribution of M is not signif-icantly different from a Gaussian distribution with meanzero and standard deviation 13.86 (computed from thedata). The sample mean of M is .07. This has beenrounded to zero because it simplifies the interpretationof the formula for the probability of winning in the nextsection. All observations of M larger than 33.75 in mag-nitude are grouped together, leading to a chi-squared teston 17 bins. The chi-squared test statistic is 15.05, be-tween the .5 and .75 quantile of the limiting chi-squared

53

Page 65: Anthology of Statistics in Sports

Chapter 8 On the Probability of Winning a Football Game

Figure 1. Histogram of the Margin of Victory Over the PointSpread (M). Goodness-of-fit tests indicate that the distribution of Mis approximately Gaussian.

distribution (14 degrees of freedom—17 bins with twoestimated parameters). The hypothesis of normality isalso consistent with histograms having larger and smallerbin widths. Naturally the normal distribution is just anapproximation. The variable M is concentrated on mul-tiples of one-half, and integer values occur twice as oftenas noninteger values. This would not be the case if nor-mality provided a more exact fit.

The Kolmogorov-Smirnov test is more powerful thanthe chi-squared test. The value of this test statistic is .913.Since the parameters of the normal distribution have beenestimated from the data, the usual Kolmogorov-Smir-nov significance levels do not apply. Using tables com-puted by Lilliefors (1967), we reject normality at the .05significance level but not at the .01 significance level.This test is sensitive to the fact that the mode of the datadoes not match the mode of the normal distribution. Wecontinue with the normal approximation despite thisdifference.

The results of the 1985 and 1986 seasons, collectedafter the initial analysis, provide additional evidence infavor of the normal approximation. The chi-squared sta-tistic, using the parameters estimated from the 1981-1984data, is 16.96. The p value for the chi-squared test islarger than .25. The Kolmogorov-Smirnov test statisticis .810, indicating a better fit than the original data set(the p value is approximately .10). More recent data maybe used to verify that the approximation continues to hold.

3. THE PROBABILITY OF WINNING A GAME

What is the probability that a p-point favorite wins afootball game? The natural estimate is the proportion ofp-point favorites in the sample that have won their game.This procedure leads to estimates with large standard er-rors because of the small number of games with any par-

ticular point spread. The normal approximation of theprevious section can be used to avoid this problem.

The probability that a team favored by p points winsthe game is

The argument in Section 2 shows that M = F — U — Pis approximately normal. A more detailed analysis in-dicates that normality appears to be a valid approxima-tion for F — U — P conditional on each value of P. Thisis difficult to demonstrate since there are few games withany particular value of P. A series of chi-squared testswere performed for games with similar point spreads.The smallest sample size was 69 games; the largest was112 games. Larger bins were used in the chi-squared test(a bin width of 10.5 points instead of the 4.5 points usedin Fig. 1) because of the size of the samples. Neigh-boring bins were combined so that each bin had an ex-pected count of at least five. None of the eight tests wassignificant; the smallest p value was greater than .10.These tests seem to indicate that normality is an adequateapproximation for each range of point spreads. If we ap-ply normality for a particular point spread, p, then F —U is approximately normal with mean p and standarddeviation 13.861. The probability of winning a game isthen computed as

where (•) is the cumulative distribution function of thestandard normal random variable.

The normal approximation for the probability of vic-tory is given for some sample point spreads (the oddnumbers) in Table 1. The observed proportion of p-pointfavorites that won their game, P, and an estimated stan-dard error are also computed. The estimates from thenormal formula are consistent with the estimates madedirectly from the data. In addition, they are monotoneincreasing in the point spread. This is consistent with theinterpretation of the point spread as a measure of thedifference between two teams. The empirical estimatesdo not have this property. A linear approximation to theprobability of winning is

This formula is accurate to within .0175 for |p| < 6.

Table 1. The Normal Approximation and the EmpiricalProbability of Winning

Point spread

13579

Pr (F > U | P)

.529

.586

.641

.693

.742

P

.571

.582

.615

.750

.650

Standarderror

.071

.055

.095

.065

.107

54

Page 66: Anthology of Statistics in Sports

Stern

4. APPLICATIONS

Conditional on the value of the point spread, the out-come of each game (measured by F — U) can be thoughtof as the sum of the point spread and a zero-mean Gauss-ian random variable. This is a consequence of the normaldistribution of M. We assume that the zero-mean Gaus-sian random variables associated with different gamesare independent. Although successive football games arealmost certainly not independent, it seems plausible thatthe random components (performance above or belowthe point spread) may be independent. The probabilityof a sequence of events is computed as the product ofthe individual event probabilities.

For example, the New York Giants were favored bytwo points in their first game and were a five-point un-derdog in their second game. The probability of winningboth games is4>(2/13.861)$(-5/13.861) = .226. Add-ing the probabilities for all (16

k) sequences of game out-comes that have k wins leads to the probability distri-bution in Table 2. The point spreads used to generateTable 2 are:

2, -5, -6,6, -3, -3.5, -5,0,

-6, -7,3, -1,7,3.5, -4,9.

The Giants actually won nine games. This is slightlyhigher than the mean of the distribution, which is 7.7.Since this is only one observation, it is difficult to testthe fit of the estimated distribution.

We use the results of all 28 teams over three years toassess the fit of the estimated distribution. Let

Pij(x) = probability that team i wins x gamesduring season j for x = 0, . . . , 16

F i J(X) = estimated cdf for the number of winsby team i during season j

Table 2. Distribution of the Number of Wins by the 1984New York Giants

Number of wins

0123456789

10111213141516

Probability

.0000

.0002

.0020

.0099

.0329

.0791

.1415

.1928

.2024

.1642

.1028

.0491

.0176

.0046

.0008

.0001

.0000

and

Xij = observed number of winsfor team i during season j

for i = 1, ..., 28 and./ = 1, 2, 3. The index i representsthe team and; the season (1981, 1983, or 1984). Thedistribution p(.) and the cdf F(.) represent the distribu-tion of the number of wins when the normal approxi-mation to the distribution of M is applied. Also, let Uij

be independent random variables uniformly distributedon (0, 1). According to a discrete version of the prob-ability integral transform, if Xtj ~ Fij' then Fij(Xiy) -Uijpij(Xij) has the uniform distribution on the interval (0,1). The uij represent auxiliary randomization needed toattain the uniform distribution. A chi-squared test isused to determine whether the transformed Xtj are con-sistent with the uniform distribution and therefore de-termine whether the Xtj are consistent with the distribu-tion Fij(.). The chi-squared statistic is computed from 84observations grouped into 10 bins between 0 and 1. Fourdifferent sets of uniform variates were used, and in eachcase the data were found to be consistent with the uni-form distribution. The maximum observed chi-squaredstatistic in the four trials was 13.1, between the .75 andthe .90 quantiles of the limiting distribution. The actualrecords of NFL teams are consistent with predictions madeusing the normal approximation for the probability ofwinning a game.

Using the point spreads of the games for an entire sea-son as input, it is possible to determine the probabilityof a particular outcome of the season. This type of anal-ysis is necessarily retrospective since the point spreadsfor the entire season are not available until the seasonhas been completed. To find the probability that a par-ticular team qualifies for the postseason playoffs, we couldconsider all possible outcomes of the season. This wouldinvolve extensive computations. Instead, the probabilityof qualifying for the playoffs is estimated by simulatingthe NFL season many times. In a simulated season, theoutcome of each game is determined by generating aBernoulli random variable with probability of successdetermined by the point spread of that game. For eachsimulated season, the 10 playoff teams are determined.Six playoff teams are determined by selecting the teamsthat have won each of the six divisions (a division is acollection of four or five teams). The winning team ina division is the team that has won the most games. Iftwo or more teams in a division are tied, then the winneris selected according to the following criteria: results ofgames between tied teams, results of games within thedivision, results of games within the conference (a col-lection of divisions), and finally random selection. It isnot possible to use the scores of games, since scores arenot simulated. Among the teams in each conference thathave not won a division, the two teams with the mostwins enter the playoffs as "wildcard" teams (recently in-creased to three teams). Tie-breaking procedures forwildcard teams are similar to those mentioned above.

The 1984 NFL season has been simulated 10,000 times.For each team, the probability of winning its division

55

Page 67: Anthology of Statistics in Sports

Chapter 8 On the Probability of Winning a Football Game

Table 3. Results of 10,000 Simulations of the 1984 NFL Season

Team Prfwin division) Pr(qualify for playoffs) 1984 actual result

National Conference— Eastern Division

Washington .5602Dallas .2343St. Louis .1142New York .0657Philadelphia .0256

National Conference — Central Division

Chicago .3562Green Bay .3236Detroit .1514Tampa Bay .1237Minnesota .0451

National Conference — Western Division

San Francisco .7551Los Angeles .1232New Orleans .0767Atlanta .0450

American Conference — Eastern Division

Miami .7608New England .1692New York .0607Buffalo .0051Indianapolis .0042

American Conference — Central Division

Pittsburgh .4781Cincinnati .3490Cleveland .1550Houston .0179

American Conference — Western Division

Los Angeles .4555Seattle .2551Denver .1311San Diego .1072Kansas City .051 1

.8157

.5669

.3576

.2291

.1209

.4493

.4170

.2159

.1748

.0660

.8771

.3306

.2291

.1500

.9122

.4835

.2290

.0248

.0205

.5774

.4574

.2339

.0268

.7130

.5254

.3405

.2870

.1686

division winner

wildcard playoff team

division winner

division winnerwildcard playoff team

division winner

division winner

wildcard playoff teamwildcard playoff teamdivision winner

has been computed. The probability of being selected forthe playoffs has also been determined. The results ap-pear in Table 3. Each estimated probability has a stan-dard error that is approximately .005. Notice that overmany repetitions of the season, the eventual Super Bowlchampion San Francisco would not participate in theplayoffs approximately 12% of the time.

5. SUMMARY

What is the probability that a team favored to win afootball game by p points does win the game? It turnsout that the margin of victory for the favorite is approx-imated by a Gaussian random variable with mean equalto the point spread and standard deviation estimated at13.86. The normal cumulative distribution function canbe used to compute the probability that the favored teamwins a football game. This approximation can also beused to estimate the distribution of games won by a teamor the probability that a team makes the playoffs. Theseresults are based on a careful analysis of the results ofthe 1981, 1983, and 1984 National Football League sea-

sons. More recent data (1985 and 1986) indicate that thenormal approximation is valid outside of the original dataset.

[Received June 1989. Revised December 1989.]

REFERENCES

Amoako-Adu, B., Manner, H., and Yagil, J. (1985), "The Efficiencyof Certain Speculative Markets and Gambler Behavior," Journal ofEconomics and Business, 37, 365—378.

Harville, D. (1980), "Predictions for National Football League Gamesvia Linear-Model Methodology," Journal of the American Statis-tical Association, 75, 516—524.

Lilliefors, H. W. (1967), "On the Kolmogorov-Smimov Test forNormality With Mean and Variance Unknown," Journal of theAmerican Statistical Association, 62, 399-402.

Pankoff, L. D. (1968), "Market Efficiency and Football Betting,"Journal of Business, 41, 203-214.

Stefani, R. T. (1977), "Football and Basketball Predictions Using LeastSquares," IEEE Transactions on Systems, Man, and Cybernetics,1, 117-121.

(1980), "Improved Least Squares Football, Basketball, andSoccer Predictions," IEEE Transactions on Systems, Man, and Cy-bernetics, 10, 116-123.

56

Page 68: Anthology of Statistics in Sports

Stern

Thompson, M. L. (1975), "On Any Given Sunday: Fair Competitor Vergin, R. C., and Scriabin, M. (1978), "Winning Strategies forOrderings With Maximum Likelihood Methods," Journal of the Wagering on Football Games," Management Science, 24, 809-818.American Statistical Association, 70, 536-541. Zuber, R. A., Gandar, J. M., and Bowers, B. D. (1985), "Beating

Tryfos, P., Casey, S., Cook, S., Leger, G., and Pylypiak, B. (1984), the Spread: Testing the Efficiency of the Gambling Market for Na-"The Profitability of Wagering on NFL Games," Management Sci- tional Football League Games," Journal of Political Economy, 93,ence, 30, 123-132. 800-806.

57

Page 69: Anthology of Statistics in Sports

This page intentionally left blank

Page 70: Anthology of Statistics in Sports

Part IIStatistics in Baseball

Page 71: Anthology of Statistics in Sports

This page intentionally left blank

Page 72: Anthology of Statistics in Sports

Chapter 9

Introduction to theBaseball Articles

Jim Albert and James J. Cochran

In this introduction we provide a brief background onthe application of statistical methods in baseball and weidentify particular research areas. We use the articles se-lected for this volume to describe the history of statisticalresearch in baseball.

9.1 BackgroundBaseball, often referred to as the national pastime, is oneof the most popular sports in the United States. Baseballbegan in the eastern United States in the mid 1800s. Pro-fessional baseball ensued near the end of the 19th century;the National League was founded in 1876 and the Ameri-can League in 1900. Currently, in the United States thereare 28 professional teams that make up the American andNational Leagues, and millions of fans watch games inballparks and on television.

Baseball is played between two teams, each consistingof nine players. A game of baseball is comprised of nineinnings, each of which is divided into two halves. In thetop half of the inning, one team plays in the field and theother team comes to bat; the teams reverse their roles in thebottom half of the inning. The team that is batting during aparticular half-inning is trying to score runs. The team withthe higher number of runs at the end of the nine innings isthe winner of the game. If the two teams have the samenumber of runs at the end of nine innings, additional or"extra" innings are played until one team has an advantagein runs scored at the conclusion of an inning.

During an inning, a player on the team in the field (calledthe pitcher) throws a baseball toward a player of the team

at-bat (who is called the batter). The batter will try to hitthe ball using a wooden stick (called a bat) in a location outof the reach of the players in the field. By hitting the ball,the batter has the opportunity to run around four bases thatlie in the field. If a player advances around all of the bases,he has scored a run. If a batter hits a ball that can be caughtbefore it hits the ground, hits a ball that can be thrown tofirst base before he runs to that base, or is tagged with theball while attempting to advance to any base beyond firstbase, he is said to be out and cannot score a run. A batteris also out if he fails to hit the baseball three times or ifthree good pitches (called strikes) have been thrown. Theobjective of the batting team during an inning is to scoreas many runs as possible before the defense records threeouts.

9.2 Standard PerformanceMeasures and Sabermetrics

One notable aspect of the game of baseball is the wealth ofnumerical information that is recorded about the game.The effectiveness of batters and pitchers is typicallyassessed by particular numerical measures. The usualmeasure of hitting effectiveness for a player is the bat-ting average, computed by dividing the number of hits bythe number of at-bats. This statistic gives the proportion ofopportunities (at-bats) in which the batter succeeds (getsa hit). The batter with the highest batting average duringa baseball season is called the best hitter that year. Bat-ters are also evaluated on their ability to reach one, two,three, or four bases on a single hit; these hits are called,respectively, singles, doubles, triples, and home runs. Theslugging average, a measure of this ability, is computed bydividing the total number of bases (in short, total bases) bythe number of opportunities. Since it weights hits by the

61

Page 73: Anthology of Statistics in Sports

Chapter 9 Introduction to the Baseball Articles

number of bases reached, this measure reflects the abilityof a batter to hit a long ball for distance. The most valuedhit in baseball, the home run, allows a player to advancefour bases on one hit (and allows all other players occupy-ing bases to score as well). The number of home runs isrecorded for all players and the batter with the largest num-ber of home runs at the end of the season is given specialrecognition.

A number of statistics are also used in the evaluation ofpitchers. For a particular pitcher, one counts the numberof games in which he was declared the winner or loser andthe number of runs allowed. Pitchers are usually rated interms of the average number of "earned" runs (runs scoredwithout direct aid of an error or physical mistake by oneof the pitcher's teammates) allowed for every nine inningspitched. Other statistics are useful in understanding pitch-ing ability. A pitcher records a strikeout when the batterfails to hit the ball in the field and records a walk whenhe throws four inaccurate pitches (balls) to the batter. Apitcher who can throw the ball very fast can record a highnumber of strikeouts. A pitcher who lacks control overhis pitches is said to be "wild" and will record a relativelylarge number of walks.

Sabermetrics is the mathematical and statistical study ofbaseball records. One goal of researchers in this field is tofind good measures of hitting and pitching performance.Bill James (1982) compares the batting records of twoplayers, Johnny Pesky and Dick Stuart, who played in the1960s. Pesky was a batter who hit a high batting averagebut hit few home runs. Stuart, in contrast, had a modestbatting average, but hit a high number of home runs. Whowas the more valuable hitter? James argues that a hittershould be evaluated by his ability to create runs for histeam. From an empirical study of a large collection ofteam hitting data, he established the following formula forpredicting the number of runs scored in a season based onthe number of hits, walks, at-bats, and total bases recordedin a season:

RUNS =(HITS + WALKS) (TOTAL BASES)

AT-BATS + WALKS

This formula reflects two important aspects in scoring runsin baseball. The number of hits and walks of a team re-flects the team's ability to get runners on base, while thenumber of total bases of a team reflects the team's ability tomove runners that are already on base. James' runs createdformula can be used at an individual level to compute thenumber of runs that a player creates for his team. In 1942,Johnny Pesky had 620 at-bats, 205 hits, 42 walks, and 258total bases; using the formula, he created 96 runs for histeam. Dick Stuart in 1960 had 532 at-bats with 160 at-bats,

34 walks, and 309 total bases for 106 runs created. Theconclusion is that Stuart in 1960 was a slightly better hitterthan Pesky in 1942 since Stuart created a few more runsfor his team (and in far fewer plate appearances). An alter-native approach to evaluating batting performance is basedon a linear weights formula. George Lindsey (1963) wasthe first person to assign run values to each event that couldoccur while a team was batting. By the use of recorded datafrom baseball games and probability theory, he developedthe formula

RUNS = (.41) IB + (.82)2B + (1.06)3B + (1.42)HR

where IB, 2B, 3B, and HR are, respectively, the numberof singles, doubles, triples, and home runs hit in a game.One notable aspect of this formula is that it recognizesthat a batter creates a run three ways. There is a directrun potential when a batter gets a hit and gets on base. Inaddition, the batter can advance runners that are alreadyon base. Also, by not getting an out, the hitter allows anew batter a chance of getting a hit, and this produces anindirect run potential. Thorn and Palmer (1993) present amore sophisticated version of the linear weights formulawhich predicts the number of runs produced by an averagebaseball team based on all of the offensive events recordedduring the game. Like James' runs created formula, thelinear weights rule can be used to evaluate a player's bat-ting performance. Although scoring runs is important inbaseball, the basic objective is for a team to outscore itsopponent. To learn about the relationship between runsscored and the number of wins, James (1982) looked atthe number of runs produced, the number of runs allowed,the number of wins, and the number of losses during aseason for a number of major league teams. James notedthat the ratio of a team's wins to losses was approximatelyequal to the square of the ratio of runs scored to the runsallowed. Equivalently,

RUNS =WINS

WINS + LOSSESRUNS2

RUNS2 + (OPPOSITION RUNS)2

This relationship can be used to measure a batter's perfor-mance in terms of the number of wins that he creates forhis team.

Sabermetrics has also developed better ways of evalu-ating pitching ability. The standard pitching statistics, thenumber of wins and the earned runs per game (ERA), areflawed. The number of wins of a pitcher can just reflectthe fact that he pitches for a good offensive (run-scoring)team. The ERA does measure the rate of a pitcher's ef-ficiency, but it does not measure the actual benefit of this

62

Page 74: Anthology of Statistics in Sports

Albert and Cochran

pitcher over an entire season. Thorn and Palmer (1993)developed the pitching runs formula

PITCHING RUNS

= (INNINGS PITCHED)/LEAGUE ERA \

V 9 )- EARNED RUNS ALLOWED

The factor (LEAGUE ERA/9) measures the average runsallowed per inning for all teams in the league. This valueis multiplied by the number of innings pitched by thatpitcher—this product represents the number of runs thatpitcher would allow over the season if he was average.Last, one subtracts the actual earned runs the pitcher al-lowed for that season. If the number of pitching runs islarger than zero, then this pitcher is better than average.This new measure appears to be useful in determining theefficiency and durability of a pitcher.

9.3 Modeling EventsProbability models can be very helpful in understandingthe observed patterns in baseball data. One of the earliestpapers of this type was that of Lindsey (1961) (Chapter 16in this volume) who modeled the runs scored in a game.The first step of his approach was to construct an empiricalprobability function for the number of runs scored in a half-inning. Lindsey used this probability function to model theprogression of a game. He modeled the length of the game(in innings), the total runs scored by both teams, and thelikelihood that one team would be leading the other by aparticular number of runs after a certain number of innings.Suppose that a team is losing to another by two runs afterseven innings—does the team have any chance of winning?From Lindsey (1961, Figure 5B), we would estimate thisprobability to be about 12%. Lindsey is careful to verifyhis modeling results with data from two baseball seasons.Several important assumptions are made in this paper thatgreatly simplify the modeling: teams are assumed to behomogeneous with respect to their ability to score, and runsscored in different half-innings are assumed independent.In his summary remarks, Lindsey gives several instancesin which these results are helpful to both fans and teammanagers. For example, in the 1960 World Series, thePirates defeated the Yankees despite being outscored by16-3, 10-0, and 12-0 in the three losing games. Lindseycalculates that the probability of three one-sided gameslike these in seven games is approximately .015, so onecan conclude that this series was an unusual occurrence.

Modeling is generally useful for understanding the sig-nificance of "rare" events in baseball. One of the most

exciting rare events is the no-hitter, a game in which onepitcher pitches a complete game (typically nine innings)and doesn't allow a single base hit. Frohlich (1994) (Chap-ter 14 in this volume) notes that a total of 202 no-hittershave occurred in Major League Baseball in the period1900-1993 and uses probability modeling to see if thisis an unusually high number. In Frohlich's simple proba-bility (SP) model, he assumes that any batter gets a hit withfixed probability p throughout history—he shows that thismodel predicts only 135 no-hitters since 1900. Of course,this model is an oversimplication since batters and pitchershave different abilities. Frohlich makes a major improve-ment to his SP model by assuming that the expected num-ber of hits allowed in a nine-inning game is not constant,but instead varies across pitchers according to a normal dis-tribution with a given spread. This variable pitcher (VP)model is shown to do a much better job in predicting thenumber of no-hitters.

Using these models, Frohlich comes to some interest-ing conclusions regarding the pitchers and teams that areinvolved in no-hitters. He notes that 23% of all no-hittershave been pitched by the best 60 pitchers in history (rankedby the number of hits allowed per nine innings), and thispercentage is significantly higher than the percentages pre-dicted using either the SP or VP model. On the other hand,weak-hitting teams are not more likely to receive no hits.Moreover, the assumption of batter variation appears to bemuch less important than the assumption of pitcher vari-ability in predicating the number of no-hitters. One factorthat may be important in a no-hitter is the decision of theofficial scorer—plays may be scored as "errors" instead of"hits" and these decisions affect the statistics of the pitcher.Using some reasonable probability calculations, Frohlichpredicts that this scoring bias may increase the number ofno-hitters about 5-10% over the number predicted by theSP model. The rate of no-hitters across time is also ex-plored. There is no general evidence that no-hitters aremore common now than in the past. However, Frohlichnotes that an unusually large number of no-hitters (16) oc-curred in 1990 and 1991 and there is no explanation foundfor this number—we just happened to observe a rare event.

9.4 Comparing Performances ofIndividual Players

Useful measures of hitting, pitching, and fielding perfor-mances of baseball players have been developed. How-ever, these statistics do not directly measure a player'scontribution to a win for his team. Bennett and Flueck(1984) (Chapter 12 in this volume) developed the Player

63

Page 75: Anthology of Statistics in Sports

Chapter 9 Introduction to the Baseball Articles

Game Percentage (PGP) method for evaluating a player'sgame contribution. This work extends work of Lindsey(1963) and Mills and Mills (1970). Using observed gamedata culled from several seasons, Lindsey (1963) was ableto estimate the expected numbers of runs scored in theremainder of a half-inning given the number of outs andthe on-base situation (runners present at first, second, andthird base). By taking the difference between the expectedruns scored before and after a plate appearance, one canjudge the benefit of a given batting play. Mills and Mills(1970) took this analysis one step further by estimatingthe probability that a team would win at a particular pointduring a baseball game. Bennett and Flueck (1984) ex-tend the methodology of Mills and Mills (1970) in severalways. First, they developed tables of win probabilitiesfor each inning given the run differential. Second, theymeasured the impact of a play directly by the change inwin probabilities, allocating half of this change to the of-fensive performer and half to the defensive player. Onecan measure a player's contribution to winning a game bysumming the changes in win probabilities for each play inwhich the player has participated. The PGP statistic is usedby Bennett (1993) (Chapter 11 in this volume) to evaluatethe batting performance of Joe Jackson. This player wasbanished from baseball for allegedly throwing the 1919World Series. A statistical analysis using the PGP showedthat Jackson played to his full potential during this series.Bennett looks further to see if Jackson failed to performwell in "clutch" situations. He looks at traditional clutchhitting statistics, investigates whether Jackson's PGP mea-sure was small given his slugging percentage, and doesa resampling analysis to see if Jackson's PGP value wasunusual for hitters with similar batting statistics. In hisconclusions, Bennett highlights a number of players onthe team that had weak performances during this WorldSeries.

Baseball fans are often interested in comparing battersor pitchers from different eras. In making these compar-isons, it is important to view batting or pitching statisticsin the context in which they were achieved. For example,Bill Terry led the National League in 1930 with a battingaverage of .401, a mark that has been surpassed since byonly one hitter. In 1968, Carl Yastrzemski led the Ameri-can League in hitting with an average of. 301. It appears onthe surface that Terry clearly was the superior hitter. How-ever, when viewed relative to the hitters that played duringthe same time, both hitters were approximately 27% bet-ter than the average hitter (Thorn and Palmer, 1993). Thehitting accomplishments of Terry in 1930 and Yastrzemskiin 1968 were actually very similar. Likewise, there aresignificant differences in hitting in different ballparks, and

hitting statistics need to be adjusted for the ballpark playedto make accurate comparisons between players.

9.5 StreaksAnother interesting question concerns the existence ofstreakiness in hitting data. During a season it is observedthat some ballplayers will experience periods of "hot" hit-ting, where they will get a high proportion of hits. Otherhitters will go through slumps or periods of hitting withvery few hits. However, these periods of hot and coldhitting may be just a reflection of the natural variabilityobserved in coin tossing. Is there statistical evidence fora "hot hand" among baseball hitters where the probabilityof obtaining a hit is dependent on recent at-bats? Albright(1993) looked at a large collection of baseball hitting dataand used a number of statistics, such as the number of runs,to detect streakiness in hitting data. His main conclusionwas that there is little statistical evidence generally for ahot hand in baseball hitting.

Currently there is great interest among fans and the me-dia in situational baseball data. The hitting performanceof batters is recorded for a number of different situations,such as day versus night games, grass versus artificial turffields, right-handed versus left-handed pitchers, and homeversus away games. There are two basic questions in thestatistical analysis of this type of data. First, are there par-ticular situations that can explain a significant amount ofvariation in the hitting data? Second, are there ballplayerswho perform particularly well or poorly in a given situa-tion? Albert (1994) (Chapter 10 in this volume) analyzes alarge body of published situational data and used Bayesianhierarchical models to combine data from a large group ofplayers. His basic conclusion is that there do exist someimportant situational differences. For example, batters hiton average 20 points higher when facing a pitcher of theopposite arm, and hit 8 points higher when they are playingin their home ballpark. Many of these situational differ-ences appear to affect all players in the same way. CoorsField in Colorado is a relatively easy ballpark in which tohit home runs, and all players' home run hitting abilitieswill be increased in this ballpark by the same amount. Itis relatively unusual to see a situation that is a so-calledability effect, where players' hitting abilities are changedby a different amount depending on the situation. Oneexample of this type of ability effect occurs in the pitchcount. Good contact hitters, such as Tony Gwynn, hit ap-proximately the same when they are behind or ahead in thecount, and other hitters (especially those who strike out fre-quently) have significantly smaller batting averages whenbehind in the count. Because of the small sample sizes

64

Page 76: Anthology of Statistics in Sports

Albert and Cochran

inherent in situational data, most of the observed variationin situational data is essentially noise, and it is difficult todetect situational abilities based on a single season of data.

9.6 Projecting Player and TeamPerformances

Watching a baseball game raises questions that motivateinteresting statistical analyses. During the broadcast of agame, a baseball announcer will typically report selectedhitting data for a player. For example, it may be reportedthat Barry Bonds has 10 hits in his most recent 20 at-bats.What have you learned about Bonds' batting average onthe basis of this information? Clearly, Bonds' batting av-erage can't be as large as 10/20 = .500 since this datawas chosen to maximize the reported percentage. Casellaand Berger (1994) (Chapter 13 in this volume) show howone can perform statistical inference based on this type ofselected data. Suppose that one observes a sequence of nBernoulli trials, and one is given the positive integers k*and n*, where the ratio k*/n* is the largest ratio of hits toat-bats in the entire batting sequence. Casella and Bergerconstruct the likelihood function for a player's true bat-ting average on the basis of this selected information anduse modern sampling methodology to find the maximumlikelihood (ml) estimate. They are interested in compar-ing this estimate with the ml estimate of the complete dataset. They conclude that this selected data provides rela-tively little insight into the batting average that is obtainedfrom batting records over the entire season. However, thecomplete data ml estimate is generally within one standarddeviation of the selected data ml estimate. Also the totalnumber of at-bats n is helpful in understanding the sig-nificance of the ratio k*/n*. As one might expect, Bonds'reported performance of 10 hits in his last 20 at-bats wouldbe less impressive if he had 600 at-bats instead of 400 at-bats in a season.

James, Albert, and Stern (1993) (Chapter 15 in this vol-ume) discuss the general problem of interpreting baseballstatistics. There are several measures of batting and pitch-ing performance that define excellence. For example, apitcher is considered "great" if he wins 20 games or abatter is "great" if he hits 50 home runs or has over 120runs batted in. But there is usually no consideration of therole of chance variability in these observed season perfor-mances. One pitcher who wins 20 games in a season willbe considered superior to another pitcher who wins only15. But it is very plausible that the first pitcher won moregames than the second pitcher due solely to chance vari-ability. In other words, if two players with equal ability

pitch an equal number of games, it is plausible that, bychance variability, one pitcher will win five games morethan the second. To illustrate this point, James, Albeit, andStern focus on the issue of team competition. Suppose thatthe Yankees win the World Series—are they truly the "best"team in baseball that year? To answer this question, theauthors construct a simple model for baseball competition.Teams are assumed to possess abilities that are normallydistributed; the spread of this normal curve is set so thatthe performances of the team match the performances ofmodern-day teams. Then a Bradley-Terry choice modelis used to represent team competition. This model is usedto simulate 1000 baseball seasons, and the relationship be-tween the participating teams' abilities and their seasonperformances is explored. The authors reach some inter-esting conclusions from their simulation. There is a 30%chance that "good" teams (in the top 10% of the ability dis-tribution) will have less than good seasons. Of the WorldSeries winners in the simulations, only 46% correspondedto good teams. The cream will generally rise to the top, butteams of "average" or "above-average" abilities can havepretty good seasons.

Cochran (2000) also uses a Bradley-Terry choice modelas the basis for simulating final divisional standing. Ex-pected win/loss percentages for each team in a four-teamdivision are systematically changed in accordance with acomplete-block design, and 10,000 seasons are simulatedfor each unique set of expected win/loss percentages. Hethen conducts a logistic regression of the probability thatthe team with the best win/loss percentage in the divisionactually wins the division, using the differences betweenthe best win/loss percentage in the division and the ex-pected win/loss percentages of the other three teams in thedivision as independent variables. He found that there isvery little marginal benefit (with regard to winning a di-vision title) to improving a team once its expected regularseason wins exceed the expected wins for every other teamin its division by five games.

9.7 SummaryCurrently, Major League Baseball games are recordedin very fine detail. Information about every single ballpitched, fielded, and hit during a game is noted, creatinga large database of baseball statistics. This database isused in a number of ways. Public relations departmentsof teams use the data to publish special statistics abouttheir players. The statistics are used to help determine thesalaries of major league ballplayers. Specifically, statisti-cal information is used as evidence in salary arbitration, alegal proceeding which sets salaries. A number of teams

65

Page 77: Anthology of Statistics in Sports

Chapter 9 Introduction to the Baseball Articles

have employed full-time professional statistical analysts,and some managers use statistical information in decidingon strategy during a game. Bill James and other baseballstatisticians have shown that it is possible to answer a va-riety of questions about the game of baseball by means ofstatistical analyses.

The seven baseball articles included in this volume char-acterize how statistical thinking can influence baseballstrategy and performance evaluation. Statistical researchin baseball continues to grow rapidly as the level of de-tail in the available data increases and the use of statisticalanalyses by professional baseball franchises proliferates.

ReferencesAlbert, J. (1994), "Exploring baseball hitting data: Whatabout those breakdown statistics?" Journal of the Ameri-can Statistical Association, 89, 1066-1074.

Albright, S. C. (1993), "A statistical analysis of hittingstreaks in baseball," Journal of the American StatisticalAssociation, 88, 1175-1183.

Bennett, J. M. (1993), "Did Shoeless Joe Jackson throwthe 1919 World Series?" The American Statistician, 47,241-250.

Bennett, J. M. and Flueck, J. A. (1984), "Player gamepercentage," Proceedings of the Social Science Section,American Statistical Association, 378-380.

Casella, G. and Berger, R. L. (1994), "Estimation withselected binomial information or do you really believe thatDave Winfield is batting .471?" Journal of the AmericanStatistical Association, 89, 1080-1090.

Cochran, J. J. (2000), "A power analysis of the 162game Major League Baseball schedule," in Joint Statis-tical Meetings, Indianapolis, IN, August 2000.

Frohlich, C. (1994), "Baseball: Pitching no-hitters,"Chance, 7, 24-30.

Harville, D. (1980), "Predictions for National FootballLeague games via linear-model methodology," Journal ofthe American Statistical Association, 75, 516-524.

James, B. (1982), The Bill James Baseball Abstract, NewYork: Ballantine Books.

James, B., Albert, J., and Stern, H. S. (1993), "Answeringquestions about baseball using statistics," Chance, 6, 17-22, 30.

Lindsey, G. R. (1961), "The progress of the score during abaseball game," American Statistical Association Journal,September, 703-728.

Lindsey, G. R. (1963), "An investigation of strategies inbaseball," Operations Research, 11, 447-501.

Mills, E. and Mills, H. (1970), Player Win Averages, SouthBrunswick, NJ: A. S. Barnes.

Thorn, J. and Palmer, P. (1993), Total Baseball, New York:HarperCollins.

66

Page 78: Anthology of Statistics in Sports

Chapter 10

Exploring Baseball Hitting Data: What AboutThose Breakdown Statistics?

Jim ALBERT*

During a broadcast of a baseball game, a fan hears how baseball hitters perform in various situations, such as at home and on theroad, on grass and on turf, in clutch situations, and ahead and behind in the count. From this discussion by the media, fans get themisleading impression that much of the variability in players' hitting performance can be explained by one or more of these situationalvariables. For example, an announcer may state that a particular player struck out because he was behind in the count and was facinga left-handed pitcher. In baseball one can now investigate the effect of various situations, as hitting data is recorded in very fine detail.This article looks at the hitting performance of major league regulars during the 1992 baseball season to see which situational variablesare "real" in the sense that they explain a significant amount of the variation in hitting of the group of players. Bayesian hierarchicalmodels are used in measuring the size of a particular situational effect and in identifying players whose hitting performance is verydifferent in a particular situation. Important situational variables are identified together with outstanding players who make the mostof a given situation.

KEY WORDS: Hierarchical modeling; Outliers; Situational variables.

1. INTRODUCTION

After the end of every baseball season, books are publishedthat give detailed statistical summaries of the batting andpitching performances of all major league players. In thisarticle we analyze baseball hitting data that was recentlypublished in Cramer and Dewan (1992). This book claimsto be the "most detailed statistical account of every majorleague player ever published," which enables a fan to "de-termine the strengths and weaknesses of every player."

For hitters, this book breaks down the usual set of battingstatistics (e.g., hits, runs, home runs, doubles) by numerousdifferent situations. Here we restrict discussion to the fun-damental hitting statistics—hits, official at-bats, and battingaverage (hits divided by at-bats)—and look at the variationof this data across situations. To understand the data thatwill be analyzed, consider the breakdowns for the 1992 seasonof Wade Boggs presented in Table 1. This table shows howBoggs performed against left- and right-handed pitchers andpitchers that induce mainly groundballs and flyballs. In ad-dition, the table gives hitting statistics for day and nightgames, games played at and away from the batter's homeballpark, and games played on grass and artificial turf. Thetable also breaks down hits and at-bats by the pitch count("ahead on count" includes 1-0, 2-0, 3-0, 2-1, and 3-1) andthe game situation ("scoring position" is having at least onerunner at either second or third, and "none on/out" is whenthere are no outs and the bases are empty). Finally, the tablegives statistics for the batting position of the hitter and dif-ferent time periods of the season.

What does a fan see from this particular set of statisticalbreakdowns? First, several situational variables do not seemvery important. For example, Boggs appears to hit the samefor day and night games and before and after the All-Stargame. But other situations do appear to matter. For example,Boggs hit .243 in home games and .274 in away games—a31-point difference. He appears to be more effective againstflyball pitchers compared to groundball pitchers, as the dif-

* Jim Albert is Professor, Department of Mathematics and Statistics,Bowling Green State University, Bowling Green, OH 43403. The author isgrateful to Bill James, the associate editor, and two referees for helpful com-ments and suggestions.

ference in batting averages is 57 points. The most dramaticsituation appears to be pitch count. He hit .379 on the firstpitch, .290 when he was ahead in the count, but only .197when he had two strikes on him.

What does a fan conclude from this glance at Boggs's bat-ting statistics? First, it is difficult to gauge the significance ofthese observed situational differences. It seems that Boggsbats equally well during day or night games. It also appearsthat there are differences in Boggs's "true" batting behaviorduring different pitch counts (the 93-point difference betweenthe "ahead in count" and "two strikes" averages describedearlier). But consider the situation "home versus away." Be-cause Boggs bats 31 points higher in away games than inhome games, does this mean that he is a better hitter awayfrom Fenway Park? Many baseball fans would answer "yes."Generally, people overstate the significance of seasonalbreakdown differences. Observed differences in batting av-erages such as these are often mistakenly interpreted as realdifferences in true batting behavior. Why do people makethese errors? Simply, they do not understand the generalvariation inherent in coin tossing experiments. There is muchmore variation in binomial outcomes than many people re-alize, and so it is easy to confuse this common form of ran-dom variation with the variation due to real situational dif-ferences in batting behavior.

How can fans gauge the significance of differences of sit-uational batting averages? A simple way is to look at the 5-year hitting performance of a player for the same situations.If a particular observed seasonal situational effect is real forBoggs, then one might expect him to display a similar situ-ational effect during recent years. Cramer and Dewan (1992)also gave the last 5 years' (including 1992) hitting perfor-mance of each major league player for all of the same situ-ations of Table 1. Using these data, Table 2 gives situationaldifferences in batting averages for Boggs for 1992 and theprevious 4-year period (1988-1991).

Table 2 illustrates the volatility of the situational differ-ences observed in the 1992 data. For example, in 1992 Boggs

© 1994 American Statistical AssociationJournal of the American Statistical Association

September 1994, Vol. 89, No. 427, Statistics in Sports

67

Page 79: Anthology of Statistics in Sports

Chapter 10 Exploring Baseball Hitting Data: What About Those Breakdown Statistics?

Table 1. Situational 1992 Batting Record ofWade Boggs

AVG AB

1992 seasonversus leftversus rightgroundballflyballhomeawaydaynightgrassturf1st pitchahead in countbehind in counttwo strikesscoring positionclose and latenone on/outbatting #1batting #3otherAprilMayJuneJulyAugSept/OctPre-AII-StarPost-AII-Star

.259

.272

.253

.235

.292

.243

.274

.259

.259

.254

.286

.379

.290

.242

.197

.311

.322

.254

.222

.287

.250

.253

.291

.242

.304

.198

.277

.263

.254

5141583561361442512631933214377729

16915721310690

142221289

4758695799683

278236

1334390324261725083

111221149384233293649831

1925232419237360

NOTE: Data from Cramer and Dewan (1992).

hit flyball pitchers 57 points better than groundball pitchers.But in the 4-year period preceding 1992, he hit groundballpitchers 29 points higher than flyball pitchers. The 31-point1992 home/away effect (favoring away) appears spurious,because Boggs was much better at Fenway Park for the pre-vious 4-year period. Looking at the eight situations, only thenight/day and pre- post-All-Star situational effects appearto be constant over time.

From this simple analysis, one concludes that it is difficultto interpret the significance of seasonal batting averages fora single player. In particular, it appears to be difficult toconclude that a player is a clutch or "two-strike" hitter basedsolely on batting statistics for a single season. But a largenumber of major league players bat in a particular season,and it may be easier to detect any significant situational pat-terns by pooling the hitting data from all of these players.The intent of this article is too look at situational effects in

batting averages over the entire group of hitters for the 1992season.

Here we look at the group of 154 regular major leagueplayers during the 1992 season. We define "regular" as aplayer who had at least 390 official at-bats; the number 400was initially chosen as a cutoff, but it was lowered to 390 toaccommodate Rob Deer, a hitter with unusual talents (powerwith a lot of strikeouts).

Using data from Cramer and Dewan (1992) for the 154regulars, we investigate the effects of the following eight sit-uations (with the abbreviation for the situation that we usegiven in parentheses):

• opposite side versus same side (OPP-SAME)• groundball pitcher versus flyball pitcher (GBALL-

FBALL)• home versus away• day versus night• grass versus turf• ahead in count versus two strikes in count (AHEAD-2

STRIKE)• scoring position versus none on/out (SCORING-

NONE ON/OUT)• pre-All-Star game versus post-All-Star game (PRE/AS-

POST/AS).

A few remarks should be made about this choice of situations.First, it is well known that batters hit better against pitcherswho throw from the opposite side from the batter's hittingside. For this reason, I look at the situation "opposite sideversus same side"; batters who switch-hit will be excludedfrom this comparison. Next, for ease of comparison of dif-ferent situations, it seemed helpful to create two nonover-lapping categories for each situation. All of the situations ofTable 1 are of this type except for pitch count, clutch situ-ations, and time. Note that the pitch categories are overlap-ping (one can be behind in the count and have two strikes)and so it seemed simpler to just consider the nonoverlappingcases "ahead in count" and "two strikes." Likewise, one cansimultaneously have runners in scoring position and thegame be close and late, so I considered only the "scoringposition" and "none on/out" categories. The batting dataacross months is interesting; however, because the primaryinterest is in comparing the time effect to other situationalvariables, I used only the pre- and post-All-Star game data.

When we look at these data across the group of 1992 reg-ulars, there are two basic questions that we will try to answer.

Table 2. Situational Differences in Batting Averages (One Unit = .001)for Wade Boggs for 1992 and 1988-1991

Year

19921988-1991

Year

19921988-1991

Right-left

-1961

Turf-grass

32-30

Flyball-groundball

57-29

Ahead-2strikes

93137

Home-away

-3185

None out-scoring position

-571

Night-day

06

Pre-AII Star-Post-All Star

9-8

68

Page 80: Anthology of Statistics in Sports

Albert

First, for a particular situation it is of interest to look for ageneral pattern across all players. Baseball people believethat most hitters perform better in various situations. In par-ticular, managers often make decisions under the followingassumptions:

• Batters hit better against pitchers throwing from the op-posite side.

• Batters hit better at home.• Batters hit better during day games (because it is harder

to see the ball at night).• Batters hit better when they are ahead in the count (in-

stead of being behind two strikes).• Batters hit better on artificial turf than on grass (because

groundballs hit on turf move faster and have a betterchance of reaching the outfield for a hit).

The other three situations in the list of eight above are notbelieved to be generally significant. One objective here is tomeasure and compare the general sizes of these situationaleffects across all players.

Once we understand the general situational effects, we canthen focus on individuals. Although most players may dis-play, say, a positive home effect, it is of interest to detectplayers who perform especially well or poorly at home. It iseasy to recognize great hitters such as Wade Boggs partlybecause his success is measured by a well-known statistic,batting average. The baseball world is less familiar with play-ers who bat especially well or poorly in given situations andthe statistics that can be used to measure this unusual per-formance. So a second objective in this article is to detectthese unusual situational players. These outliers are oftenthe most interesting aspect of baseball statistics. Cramer andDewan (1992), like many others, list the leading hitters withrespect to many statistical criteria in the back of their book.

This article is outlined as follows. Section 2 sets up thebasic statistical model and defines parameters that corre-spond to players' situational effects. The estimates of thesesituational effects from a single season can be unsatisfactoryand often can be improved by adjusting or shrinking themtowards a common value. This observation motivates theconsideration of a prior distribution that reflects a belief insimilarity of the set of true situational effects. Section 3 sum-marizes the results of fitting this Bayesian model to the groupof 1992 regulars for each one of the eight situational variables.The focus, as explained earlier, is on looking for generalsituational patterns and then finding players that deviate sig-nificantly from the general patterns. Section 4 summarizesthe analysis and contrasts it with the material presented byCramer and Dewan (1992).

2. THE MODEL

2.1 Basic Notation

Consider one of the eight situational variables, say thehome/away breakdown. From Cramer and Dewan (1992),we obtain hitting data for N = 154 players; for each player,we record the number of hits and official at-bats during homeand away games. For the /th player, this data can be repre-sented by a 2 X 2 contingency table,

home

away

HITS OUTS

An

hn

On

On

AT-BATS

abn

abi2 ,

where hf { denotes the number of hits, ot i the number ofouts, and ab, i the number of at-bats during home games.(The variables hi2, o,2, and abi2 are defined similarly foraway games.) Let pt\ and pi2 denote the true probabilitiesthat the /th hitter gets a hit home and away. If we assumethat the batting attempts are independent Bernoulli trialswith the aforementioned probabilities of success, then thenumber of hits hn and hi2 are independently distributedaccording to binomial distributions with parameters (abt { ,/7 ( 1)and(a6,2,A2).

For ease of modeling and exposition, it will be convenientto transform these data to approximate normality using thewell-known logit transformation. Define the observed logits

Player1 2 . . . N

yny\2

V2\

y22yNiyN2

Then, approximately, yn and yi2 are independent normal,where ytj has mean /*# = log(/>y/(l - /?/,)) and variance a\= (ab,jpij(l - p,j)yl. Because the sample sizes are large, wecan accurately approximate a\ by an estimate where pfj isreplaced by the observed batting average hy/ab^. With thissubstitution, a\; « l/hy + l/o/j.

Using the foregoing logistic transformation, we representthe complete data set for a particular situation, say home/away, as a 2 X N table:

home

away

The observation in the (i,j) cell, yih is the logit of the ob-served batting average of the /th player during the y th situ-ation. We model n,j, the mean of yijt as

where /*, measures the hitting ability of player / and ay is asituational effect that measures the change in this player'shitting ability due to the y th situation. The model as statedis overparameterized, so we express the situational effects asaf i = af and ai2 = -a,. With this change, the parameter a,represents the change in the hitting ability of the /th playerdue to the first situational category.

2.2 Shrinking Toward the Mean

For a given situational variable, it is of interest to estimatethe player situational effects a\,..., aN. These parametersrepresent the "true" situational effects of the players if theywere able to play an infinite number of games.

Is it possible to estimate accurately a particular player'ssituational effect based on his hitting data from one season?

69

Page 81: Anthology of Statistics in Sports

Chapter 10 Exploring Baseball Hitting Data: What About Those Breakdown Statistics?

To answer this question, suppose that a player has a truebatting average of .300 at home and .200 away, a 100-pointdifferential. If he has 500 at-bats during a season, half homeand half away, then one can compute that his observed sea-sonal home batting average will be between .252 and .348with probability .9 and his away batting average will be be-tween .158 and .242 with the same probability. So althoughthe player's true differential is 100 points, his seasonal battingaverage differential can be between 10 and 190 points. Be-cause this is a relatively wide interval, one concludes that itis difficult to estimate a player's situational effect using onlyseasonal data.

How can we combine the data from all players to obtainbetter estimates? In the situation where one is simultaneouslyestimating numerous parameters of similar size, it is wellknown in the statistics literature that improved estimatescan be obtained by shrinking the parameters toward a com-mon value. Efron and Morris (1975) illustrated the benefitof one form of these shrinkage estimators in the predictionof final season batting averages in 1970 for 18 ballplayersbased on the first 45 at-bats. (See also Steffey 1992, for adiscussion of the use of shrinkage estimates in estimating anumber of batting averages.) Morris (1983) used a similarmodel to estimate the true batting average of Ty Cobb fromhis batting statistics across seasons.

In his analysis of team breakdown statistics, James (1986)discussed the related problem of detecting significant effects.He observed that many team distinctions for a season (e.g.,how a team plays against right- and left-handed pitching)will disappear when studied over a number of seasons. Asimilar pattern is likely to hold for players' breakdowns. Someplayers will display unusually high or low situational statisticsfor one season, suggesting extreme values of the parametersa1. But if these situational data are studied over a numberof seasons, the players will generally have less extreme esti-mates of ai. This "regression to the mean" phenomena isdisplayed for the "ahead in count-two strikes" situation inFigure 1. This figure plots the 1992 difference in batting av-erages (batting average ahead in count-batting average 2strikes) against the 1988-1991 batting average difference forall of the players. Note that there is an upward tilt in thegraph, indicating that players who have larger batting averagedifferences in 1992 generally had larger differences in 1989-1991. But also note that there is less variability in the 4-yearnumbers (standard deviation .044 versus .055 for the 1992data). One way of understanding this difference is by theleast squares line placed on top of the graph. The equationof this line is y = .126 + (1 - .683) (x - .122), with theinterpretation that the 4-year batting average difference gen-erally adjusts the 1992 batting average difference 68% towardthe average difference (.12) across all players.

Suppose that the underlying batting abilities of the playersdo not change significantly over the 5-year period. Then the4-year batting average differences (based on a greater numberof at-bats) are better estimates than the 1-year differences ofthe situational abilities of the players. In that case, it is clearfrom Figure 1 that the observed seasonal estimates of the a,should be shrunk toward some overall value to obtain moreaccurate estimates. In the next section we describe a prior

Figure 1. Scatterplot of 1992 Pitch Count Difference in Batting Averages(AHEAD-2 STRIKES) Against Previous 4-Year Difference for All 1992Players.

distribution on the situation parameters that reflects a beliefin similarity of the effects and results in sample estimatesthat will shrink the season values toward a pooled value.

2.3 The Prior Distribution

The model discussed in Section 2.1 contains 2N param-eters, the hitting abilities u1, . . . , UN, and the situationaleffects a1], .. ., aN. Because the ability parameters in thissetting are nuisance parameters, we assume for conveniencethat the u1, are independently assigned flat noninformativepriors.

Because the focus is on the estimation of the situationaleffects a1, ..., aN, we wish to assign a prior that reflectssubjective beliefs about the locations of these parameters.Recall from our earlier discussion that it seems desirable forthe parameter estimates to shrink the individual estimatestoward some common value. This shrinkage can be accom-plished by assuming a priori that the effects a 1 , . . . , aN areindependently distributed from a common population (a).This prior reflects the belief that the N effects are similar insize and come from one population of effects ( a).

Because the a, are real-valued parameters, one reasonableform for this population model is a normal distribution withmean ua and variance 2

a. A slightly preferable form used inthis article is a t distribution with mean u,a, scale a, andknown degrees of freedom v (here we use the relatively smallvalue, v - 4). The parameters of this distribution are usedto explain the general size of the situational effect and toidentify particular players who have unusually high or lowsituational effects. The parameters ua and a describe thelocation and spread of the distribution of effects across allplayers. To see how this model can identify outliers, notethat a t (u a , a, v) distribution can be represented as themixture a,\\, distributed N(ua , 2

a/ ,), i, distributedgamma(j'/2, v / 2 ) . The new scale parameters 1, .. ., N

will be seen to be useful in picking out individuals who havesituational effects set apart from the main group of players.

To complete our prior specification, we need to discusswhat beliefs exist about the parameters u,a and a2

a that de-scribe the t population of situational effects. First, we assumethat we have little knowledge about the general size of thesituational effect; to reflect this lack of knowledge, the meanua is assigned a flat noninformative prior. We must be morecareful about the assignment of the prior distribution on2a, because this parameter controls the size of the shrinkage

70

Page 82: Anthology of Statistics in Sports

Albert

Table 3. Posterior Means of Parameters of Population Distribution of the Situation Effects and SummaryStatistics of the Posterior Means of the Batting Average Differences p,, - p/2 Across All Players

Summary statistics of [E(pt1 - P12)] (oneunit = .007;

Situation

GRASS-TURFSCORING-NONE ON/OUTDAY-NIGHTPRE/AS-POST/ASHOME-AWAYGBALL-FBALLOPP-SAMEAHEAD-2 STRIKES

E(ua)

-.002.000.004.007.016.023.048.320

E( a)

.107

.108

.105

.101

.103

.109

.106

.110

0,

-17-13-13-9-8-75

104

M

-30238

1120

123

Q3

15171617212432

142

Q3-0,

3230292629312738

of the individual player situational estimates toward thecommon value. In empirical work, it appears that the useof the standard noninformative prior for 2

a, 1 / 2a, can lead

to too much shrinkage. So to construct an informative priorfor 2

a, we take the home/away variable as one representativesituational variable among the eight and base the prior of2a on a posterior analysis of these parameters based on

home/away data from an earlier season. So we first assumethat u,ay

2a are independent, with u.a assigned a flat prior and

a2a distributed according to an inverse gamma distribution

with parameters a = 1 /2 and b = 1 /2 (vague prior specifi-cation), and then fit this model to 1991 home/away datafor all of the major league regulars. From this preliminaryanalysis, we obtain posterior estimates for a2

a that arematched to an inverse gamma distribution with parametersa = 53.2 and b = .810.

To summarize, the prior can be written as follows:• u1, ..., u.N independent with (u1,) = 1.

] , . . . , N independent t(u,a, a, v), where v = 4.• u independent with

( u a ) = 1 and ( 2 = K ( l / ( ( a+l)exp(-b/a2a), with

a = 53.2 and b= .810.This prior is used in the posterior analysis for each of thesituational variables.

3. FITTING THE MODEL3.1 General Behavior

The description of the joint posterior distribution and theuse of the Gibbs sampler in summarizing the posterior are

Figure 2. Boxplots of the Posterior Means of the Differences in BattingAverages for the Eight Situational Variables.

outlined in the Appendix. For each of the eight situationalvariables, a simulated sample of size 1,000 from the jointposterior distribution of ({u i } , {a i • , } , u.a,

2a} was obtained.

These simulated values can be used to estimate any functionof the parameters of interest. For example, if one is interestedin the difference in breakdown probabilities for player i', pi i— Pi2, then one can simulate values of this difference bynoting that pi l - pi 2 = exp(ui, + a i,)/(! + exp(ui, + a i))- exp(ui, - ai)/(! + exp(ui - ai,)), and then per-forming this transformation on the simulated values of { u i }and {a 1 } .

Our first question concerns the existence of a general sit-uational effect. Do certain variables generally seem moreimportant than others in explaining the variation in hittingperformance? One can answer this question by inspectionof the posterior distribution of the parameters ua and a,which describe the population distribution of the situationaleffects {ai}. The posterior means of these hyperparametersfor each of the eight situations are given in Table 3. Anotherway to look at the general patterns is to consider the posteriormeans of the true batting average differences pi l — pi2 acrossall players. Table 3 gives the median, quartiles, and inter-quartile range for this set of 154 posterior means; Figure 2plots parallel boxplots of these differences for the eight sit-uations.

Note from Table 3 and Figure 2 that there are significantdifferences between the average situational effects. The pos-terior mean of ua is a measure of the general size of thesituational effect on a logit scale. Note from Table 3 that the"ahead in count-2 strikes" effect stands out. The posteriormean of ua is .32 on the logit scale; the corresponding medianof the posterior means of the batting average differences is123 points. Batters generally hit much better when aheadversus behind in the pitch count. Compared to this effect,the other seven effects are relatively insignificant. Closer ex-amination reveals that "opposite-same arm" is the next mostimportant variable, with a median batting average differenceof 20 points. The "home-away" and "groundball-flyball"effects follow in importance, with median batting averagedifferences of 8 and 11 batting average points. The remainingfour situations appear generally insignificant, as the posteriormean of ua is close to 0 in each case.

The posterior means of a give some indication of therelative spreads of the population of effects for the eight sit-

71

Page 83: Anthology of Statistics in Sports

Chapter 10 Exploring Baseball Hitting Data: What About Those Breakdown Statistics?

uations. The general impression from Table 3 and Figure 2is that all of the situational populations have roughly thesame spread, with the possible exception of the pitch countsituation. So the differences between the sets of two differentsituational effects can be described by a simple shift. Forexample, the "groundball-flyball" effects are approximately10 batting average points higher than the "day-night" effects.

3.2 Individual Effects—What is an Outlier?

In the preceding section we made some observations re-garding the general shape of the population of situation effects{ a 1 } . Here we want to focus on the situation effects forindividual players. Figure 3 gives stem and leaf diagrams(corresponding to the Fig. 2 boxplots) for these differencesin batting averages for each of the eight situations. Theseplots confirm the general comments made in Section 3.1about the comparative locations and spreads of the situa-tional effect distributions.

Recall from Section 2 that it is desirable to shrink thebatting average differences that we observe for a single seasontoward a common value. Figure 4 plots the posterior meansof the differences pi l — p i 2 against the season batting averagedifferences for the effect "ahead in count-2 strikes." The liney = x is plotted on top of the graph for comparison purposes.This illustrates that these posterior estimates shrink the sea-sonal estimates approximately 50% toward the average effectsize of 122 points. Note that some of the seasonal battingaverage differences are negative; these players actually hitbetter in 1992 with a pitch count of 2 strikes. But the posteriormeans shrink these values toward positive values. Thus wehave little faith that these negative differences actually relateto true negative situational effects.

We next consider the issue of outliers. For instance, forthe particular effect "home-away," are there players that batparticularly well or poor at home relative to away games?Returning back to the displays of the estimates of the battingaverage differences in Figure 4, are there any players whoseestimates deviate significantly from the main group? We ob-serve some particular estimates, say the —95 in the "grass-turf" variable, that are set apart from the main distributionof estimates. Are these values outliers? Are they particularlynoteworthy?

We answer this question by first looking at some famousoutliers in baseball history. With respect to batting average,a number of Hall of Fame or future Hall of Fame playershave achieved a high batting average during a single season.In particular, we consider Ted Williams .406 batting averagein 1941, Rod Carew's .388 average in 1977, George Brett's.390 average in 1980 and Wade Boggs's .361 average in 1983.Each of these batting averages can be considered an outlier,because these accomplishments received much media atten-tion and these averages were much higher than the battingaverages of other players during that particular season.

These unusually high batting averages were used to cali-brate our Bayesian model. For each of the four data sets—Major League batting averages in 1941 and American Leaguebatting averages in 1977, 1980, and 1983—an exchangeablemodel was fit similar to that described in Section 2. In eachmodel fit we computed the posterior mean of the scale pa-

rameter X, corresponding to the star's high batting average.A value A, = 1 corresponds to an observation consistent withthe main body of data; an outlier corresponds to a smallpositive value of 1,. The posterior mean of this scale param-eter was for Williams .50, .59 for Carew, .63 for Brett, and.75 for Boggs. Thus Williams's accomplishment was thegreatest outlier in the sense that it deviated the most fromAmerican League batting averages in 1941.

This brief analysis of famous outliers is helpful in under-standing which individual situational effects deviate signifi-cantly from the main population of effects. For each of theeight situational analyses, the posterior means of the scaleparameters X, were computed for all of the players. Table 4lists players for all situations where the posterior mean of i,is smaller than .75. To better understand a player's unusualaccomplishment, this table gives his 1992 batting average ineach category of the situation and the batting average dif-ference ("season difference"). Next, the table gives the pos-terior mean of the difference in true probabilities pi l — pi2.

Finally, it gives the difference in batting averages for theprevious 4 years (given in Cramer and Dewan 1992).

What do we learn from Table 4? First, relatively few playersare outliers using our definition—approximately one per sit-uational variable. Note that the posterior estimates shrinkthe observed season batting average differences approxi-mately halfway toward the average situational effect. Theamount of shrinkage is greatest for the situational variables(such as "grass-turf") where the number of at-bats for oneof the categories is small. The last column addresses thequestion if these nine unusual players had exhibited similarsituational effects the previous 4 years. The general answerto this question appears to be negative. Seven of the nineplayers had 1988-1991 effects that were opposite in sign fromthe 1992 effect. The only player who seems to display a con-stant situational effect over the last 5 years is Tony Gwynn;he hits for approximately the same batting average regardlessof the pitch count.

4. SUMMARY AND DISCUSSION

What have we learned from the preceding analysis of thishitting data? First, if one looks at the entire group of 154baseball regulars, some particular situational variables standout. The variation in batting averages by the pitch count isdramatic—batters generally hit 123 points higher when aheadin the count than with 2 strikes. But three other variablesappear important. Batters on average hit 20 points higherwhen facing a pitcher of the opposite arm, 11 points higherwhen facing a groundball pitcher (as opposed to a flyballpitcher), and 8 points higher when batting at home. Becausethese latter effects are rather subtle, one may ask if thesepatterns carry over to other seasons. Yes they do. The samemodel (with the same prior) was fit to data from the 1990season and the median opposite arm, groundball pitcher,and home effects were 25,9, and 5 points respectively, whichare close to the 1992 effects discussed in Section 3.

Do players have different situational effects? Bill James(personal communication) views situational variables as ei-ther "biases" or "ability splits." A bias is a variable that hasthe same effect on all players, such as "grass-turf," "day-

72

Page 84: Anthology of Statistics in Sports

Albert

Figure 3. Stem-and-Leaf Diagrams of the Posterior Means of the Differences in Batting Averages for the Eight Situational Variables.

73

Page 85: Anthology of Statistics in Sports

Chapter 10 Exploring Baseball Hitting Data: What About Those Breakdown Statistics?

Figure 4. Posterior Means of the Differences in Batting Averages PlottedAgainst the Seasonal Difference in Batting Averages for the Situation"Ahead in Count/2 Strikes."

night", or "home-away." James argues that a player's abilityto hit does not change just because he is playing on a differentsurface, or a different time of day, or at a particular ballpark,and so it is futile to look for individual differences in thesesituational variables. In contrast, the pitch count is an "abilitysplit." There will exist individual differences in the "aheadin count-2 strikes" split, because one's batting average inthe 2-strike setting is closely related to one's tendency tostrike out. This statement is consistent with Figure 1, whichindicates that players with high situational effects for pitchcount during the previous 4-year period were likely to havehigh effects during the 1992 season. But there is much scatterin this graph, indicating that season performance is an im-perfect measurement of this intrinsic ability.

Although there are clear situational patterns for the entiregroup of players, it is particularly difficult to see these patternsfor individual players. We notice this in our brief study ofnine unusual players in Section 3.2. These players had ex-treme estimated effects for the 1992 season, but many ofthem displayed effects of opposite sign for the previous 4-year period. The only player who seemed to have a clearoutlying situational split was Tony Gwynn. But this doesnot mean that our search for players of high and low situa-tional splits is futile. Rather, it means that we need moredata to detect these patterns at an individual level.

Let us return to the book by Cramer and Dewan (1992),where the data were obtained. How does this book sum-marize the situational batting statistics that are listed? Thereis little discussion about the general size of the situationaleffects, so it is difficult to judge the significance of individualbatting accomplishments. For example, suppose that a givenhitter bats 100 points higher when ahead in the count com-pared with 2 strikes: Is that difference large or small? By ourwork, we would regard this as a relatively small difference,because it is smaller than the average of 123 batting averagepoints for all players.

The book lists the 1992 batting leaders at the back. In thecategory of home games, we find that Gerry Sheffield hadthe highest batting average at home. But in this list, the play-ers' hitting abilities and situational abilities are c abounded;Sheffield's high average at home may reflect only the factthat Sheffield is a good Mitter. In this article we have tried toisolate players' hitting abilities from their abilities to hit betteror worse in different situations.

5. RELATED WORK

Because breakdown batting statistics are relatively new,there has been little statistical analysis of these data. Therehas been much discussion on one type of situational vari-able—hitting streaks or slumps during a season. Albright(1993) summarized recent literature on the detection ofstreaks and did his own analysis on season hitting data for500 players. This data are notable, because a batter's hittingperformance and the various situational categories are re-corded for each plate appearance during this season. Albert(1993), in his discussion of Albright's paper, performed anumber of stepwise regressions on this plate appearance datafor 200 of the players. His results complement and extendthe results described here. The "home-away" and "oppositearm-same arm" effects were found to be important for theaggregate of players. In addition, players generally hit for ahigher batting average against weaker pitchers (deemed assuch based on their high final season earned run averages).Other new variables that seemed to influence hitting werenumber of outs and runners on base, although the degree ofthese effects was much smaller than for the pitcher strengthvariable. The size of these latter effects appeared to be similarto the "home-away" effects. Players generally hit for a loweraverage with two outs in an inning and for a higher averagewith runners on base. This brief study suggests that theremay be more to learn by looking at individual plate ap-pearance data.

APPENDIX: DESCRIPTION OF THE POSTERIORDISTRIBUTION OF THE SITUATIONAL EFFECTS AND

SUMMARIZATION OF THE DISTRIBUTION USINGTHE GIBBS SAMPLER

The use of Bayesian hierarchical prior distributions to modelstructural beliefs about parameters has been described by Lindleyand Smith (1972). The use of Gibbs sampling to simulate posteriordistributions in hierarchical models was outlined by Gelfand, Hills,Racine-Poon, and Smith (1990). Albert (1992) used an outliermodel, similar to that described in Section 2.3, to model homerunhitting data.

The complete model can be summarized as follows. For a givenbreakdown variable, we observe { ( y 1 1 {, y 1 2 ) , i: — I, . .. , N}, whereyij is the logit of the seasonal batting average of batter i in the jthcategory of the situation. We assume that the yij are independent,where yi1, , is N(u1 + a1,

21) and y12 is N(u1 - a1,

21 2 ) , where the

variances 211 and 2

12 are assumed known. The unknown parametersare a = ( , , . . ., aN) and u = (MI . • • •, uN-)- Using the representationof a t density as a scale mixture of normals, the prior distributionfor (a, u) is written as the following two-stage distribution:

Stage 1. Conditional on the hyperparameters ua, , and = ( 1,,. .., N), a and u are independent with u distributed accordingto the vague prior (u) = c and the situational effect compo-nents a1,. . ., aN independent with a1 distributed N ( u a ,2/ 1 ,) .

Stage 2. The hyperparameters ua, a, and X are independentwith ua distributed according to the vague prior (ua = c,a2

a is distributed inverse gamma(a, b) density with kernel( 2

a) ~ ( a + l )exp(— b/ 2a) ,and 1 , , . . . , XN/are iid from the gamma

(v/2, v/2) density with kernel 1v/2-1 exp(- 1/2). The hy-

perparameters a, b, and v are assumed known.

74

Page 86: Anthology of Statistics in Sports

Albert

Table 4. Outlying Situatlonal Players Where the Posterior Mean of the Scale Parameter i < .75

Player

Terry SteinbachDamn JacksonKevin BassJoe OliverKent HrbekKeith MillerMike DevereauxMickey MorandiniTony Gwynn

Situation

grass-turfgrass-turfscoring-none on/outscoring-none on/outpre/AS-post/ASday-nightday-nightgroundball-flyballahead-2 strikes

Battingaverage 1

.251

.278

.205

.172

.294

.167

.193

.324

.252

Battingaverage 2

.423

.156

.376

.319

.184

.325

.309

.155

.291

Seasondifference

-.172.122

-.171-.147

.110-.158-.116

.169-.039

Estimateof pi -p2

-.095.075

-.085-.073

.074-.086-.075

.082

.033

Previous4 years

.033

.020

.063

.079-.007

.022

.035-.045

.061

Combining the likelihood and the prior, the joint posterior densityof the parameters a, u, ua, 0, and X is given by

To implement the Gibbs sampler, we require the set of full con-ditional distributions; that is, the posterior distributions of eachparameter conditional on all remaining parameters. From (A.I),these fully conditional distributions are given as follows:

Define the variatesThen, conditional on all remaining parameters,

the ui are independent normal with meansand variances

Define the variatesThen the a, are independent normal with means

and variances

is inverse gamma with parameters a\

The i, are independent from gamma dis-tributions with parameters a, and

and

To implement the Gibbs sampler, one starts with an initial guessat (u, a, ua, aa, X) and simulates in turn from the full conditionaldistributions a, b, c, d, and e, in that order. For a particular con-ditional simulation (say u), one conditions on the most recent sim-ulated values of the remaining parameters (a, ua, a, and ). One

simulation of all of the parameters is referred to one cycle. Onetypically continues cycling until reaching a large number, say 1,000,of simulated values of all the parameters. Approximate convergenceof the sampling to the joint posterior distribution is assessed bygraphing the sequence of simulated values of each parameter andcomputing numerical standard errors for posterior means using thebatch means method (Bratley, Fox, and Schrage 1987). For modelssuch as these, the convergence (approximately) of this procedureto the joint posterior distribution takes only a small number ofcycles, and the entire simulated sample generated can be regardedas an approximate sample from the distribution of interest.

[Received April 1993, Revised July 1993.]

REFERENCES

Albert, J. (1992), "A Bayesian analysis of a Poisson Random Effects Modelfor Homerun Hitters," The American Statistician, 46, 246-253.

(1993), Discussion of "A Statistical Analysis of Hitting Streaks inBaseball" by S. C. Albright, Journal of the American Statistical Association,88, 1184-1188.

Albright, S. C. (1993), "A Statistical Analysis of Hitting Streaks in Baseball,"Journal of the American Statistical Association, 88, 1175-1183.

Bratley, P., Fox, B., and Schrage, L. (1987), A Guide to Simulation, NewYork: Springer-Verlag.

Cramer, R., and Dewan, J. (1992), STATS 1993 Player Profiles, STATS,Inc.

Efron, B., and Morris, C. (1975), "Data Analysis Using Stein's Estimatorand Its Generalizations," Journal of the American Statistical Association,70,311-319.

Gelfand, A. E., Hills, S. E., Racine-Poon, A., and Smith, A. F. M. (1990),"Illustration of Bayesian Inference in Normal Data Models Using GibbsSampling," Journal of the American Statistical Association, 85,972-985.

James, B. (1986), The Bill James Baseball Abstract, New York: Ballantine.Lindley, D. V., and Smith, A. F. M. (1972), "Bayes Estimates for the Linear

Model," Journal of the Royal Statistical Society, Ser. B, 135, 370-384.Morris, C. (1983), "Parametric Empirical Bayes Inference: Theory and Ap-

plications" (with discussion), Journal of the American Statistical Asso-ciation, 78, 47-65.

Steffey, D. (1992), "Hierarchical Bayesian Modeling With Elicited PriorInformation," Communications in Statistics, 21, 799-821.

75

Page 87: Anthology of Statistics in Sports

This page intentionally left blank

Page 88: Anthology of Statistics in Sports

Chapter 11

Did Shoeless Joe Jackson Throw the 1919 World Series?Jay BENNETT*

Joe Jackson and seven other White Sox were bannedfrom major league baseball for throwing the 1919 WorldSeries. This article examines the validity of Jackson'sbanishment with respect to his overall performance inthe series. A hypothesis test of Jackson's clutch battingperformance was performed by applying a resamplingtechnique to Player Game Percentage, a statistic thatmeasures a player's contribution to team victory. Thetest provides substantial support to Jackson's subse-quent claims of innocence.

KEY WORDS: Clutch hitting; Hypothesis test; PlayerGame Percentage; Player Win Average; Resampling

In 1919, the Cincinnati Reds upset the Chicago WhiteSox in a best-of-nine World Series in eight games. Ayear later, two key Chicago players confessed to par-ticipating in a conspiracy to throw the 1919 World Se-ries. Eight Chicago players (the Black Sox) were triedand found innocent of this crime. Nonetheless, all eightplayers were banned from major league baseball by thenewly appointed commissioner Kenesaw MountainLandis and never reinstated. The foremost player amongthe Black Sox was "Shoeless" Joe Jackson, who at thetime of his banishment had compiled a lifetime .356batting average (BA), the third highest average in base-ball history. There is no doubt that his alleged partic-ipation in the fix is the only factor that prevents hiselection to the Hall of Fame.

Many questions have persisted about the 1919 WorldSeries since the opening game of that Series. What evi-dence exists that the Black Sox did not play to their fullability? How well did players not accused in the scandalplay in the Series? Especially mysterious is the case ofJoe Jackson. He was one of the initial confessors, butlater retracted his confession. Even if he had takenmoney for throwing the Series, his .375 batting averageand record 12 hits in the Series indicate that he mayhave played on the level anyway. Detractors have com-mented that he may have compiled impressive statistics,but he didn't hit in the clutch.

This article presents a statistical analysis of the 1919World Series. It uses the data from this Series to high-light the effectiveness of a new baseball statistic, PlayerGame Percentage (PGP). PGP's capability to accountfor game situation allows questions of clutch perfor-mance on the field to be answered. Thus PGP is uniquelycapable of determining the extent of the scandal fromthe one true piece of available evidence: the record ofplay on the field. The focal point of the article is the

*Jay Bennett is Member of Technical Staff, Bellcore, Red Bank,NJ 07701. The author thanks John Flueck and John Healy for theirhelpful comments.

analysis of the contention that Joe Jackson threw the1919 World Series while batting .375.

1. PLAYER GAME PERCENTAGE

Fifty years after the Black Sox scandal, Mills andMills (1970) introduced a remarkable baseball statistic,the Player Win Average (PWA). PWA was developedon the premise that performance of baseball playersshould be quantified based on the degree that he in-creases (or decreases) his team's chance for victory ineach game. Of course, all established baseball statisticsattempt to do this indirectly by counting hits, total bases,and runs. The novel part of the Mills' idea was thatthey estimated this contribution directly.

Consider this example which the Mills brothers es-timated to be the biggest offensive play of the 1969World Series, in which the Miracle Mets defeated theheavily favored Orioles. Al Weis of the Mets came tothe plate in the top of the ninth inning of Game 2 withthe scored tied. The Mets had runners on first and thirdwith two outs. The probability of a Mets victory was51.1%. If we define the win probability (WP) as theprobability of a home team victory, then WP = 48.9%.Weis singled, which placed runners at first and secondand knocked in the go-ahead run (the eventual GameWinning RBI [G WRBI]). His hit increased the Metsprobability of winning to 84.9%. The difference WPin the win probabilities before and after the at-bat out-come is awarded to the batter and pitcher as creditsand debits. Win (Loss) Points are assigned on a play-by-play basis to players who increased (decreased) theirteam's chances of winning. Win and Loss Points arecalculated as the product of |AWP| and a scale factorof 20 chosen by the Mills brothers. Thus Weis was cred-ited 20 x 33.8 = 676 Win Points and the Orioles pitcherDave McNally was awarded an equal number of LossPoints. The Win and Loss Points are accumulated foreach player throughout the game or series of games.The Player Win Average for the game or series is

The Mills' system is deceptively simple and yet solvesmany of the problems of standard baseball statistics:

• Standard baseball statistics do not consider the gamesituation. A walk with bases loaded and the gametied in the bottom of the ninth inning is much moreimportant than a home run in the ninth inning ofa 10-0 rout. The PWA system was expressly de-veloped by the Mills brothers to take this "clutch"factor into account.

• Standard baseball statistics of hitters and pitchersare not comparable. PWA gives a single value forevaluating the performance of all players.

77

Page 89: Anthology of Statistics in Sports

Chapter 11 Did Shoeless Joe Jackson Throw the 1919 World Series?

• Standard baseball statistics do a poor job of eval-uating relief pitchers. Since PWA accounts for thegame status when the relief pitcher enters the game,PWA allows meaningful comparisons among reliefpitchers and between starters and relievers.

• Standard baseball statistics do a poor job of eval-uating fielding performance. In the Mills system,the defensive player can be a fielder rather thanthe pitcher. Thus if a fielder makes an error, hereceives the Loss Points that ordinarily would beawarded to the pitcher. Similarly, if the fielder makesa great play, the fielder would be awarded WinPoints for the out rather than the pitcher. Fieldingis evaluated within the context of the game likehitting and pitching.

Given all of these significant benefits, one wonderswhy PWA is not better known and commonly used.One major drawback is that the calculation of PWA isa greater data collection burden than standard baseballscorekeeping. A bigger obstacle is the estimation of winprobabilities for every possible game situation. The Millsbrothers outlined the basic estimation technique. First,data were collected on the fraction of time each stan-dard baseball event occurred (e.g., the fraction of timethat a home run was hit for each out and baserunnercombination). Using these data, thousands of baseballgames were simulated to obtain win probabilities foreach situation. The Mills brothers provided many winprobability estimates in their analysis of the 1969 WorldSeries, but a complete set was not published.

Bennett and Flueck (1984) devised a technique forestimating these win probability values based on datacollected from the 1959 and 1960 seasons (Lindsey 1961,1963). The appendix describes the techniques used inthese estimates. They demonstrated that these esti-mates were close to the values published by the Millsbrothers in their 1969 World Series analysis. In addi-tion, they replaced the Mills' PWA with a new statistic,Player Game Percentage (PGP). A player's PGP for agame generally is calculated as the sum of changes inthe probability of winning the game for each play in

which the player has participated. For most plays, halfthe change is credited to an offensive player (e.g., bat-ter) and half to a defensive player (e.g., pitcher). Thushome players generally receive AWP/2 for each playand visiting players receive — WP/2. In terms of Winand Loss Points, the Player Game Percentage for agame is

The denominator in the above equation accounts forthis conversion to half credit and for the differences inscale.

To clarify the PGP analysis process, Table 1 providesexcerpts from the PGP analysis for the most excitinggame of the 1919 World Series, the sixth game in whichthe White Sox came from behind to win in extra inningsby a 5-4 score. The scoring for this game (and all gamesin the Series) was derived from the play-by-play de-scriptions in Cohen, Neft, Johnson, and Deutsch (1976).The table presents the following information for eachplay of the game (all of which are required for PGPcalculations):

InningVisiting team's runsHome team's runsRunners on base at the play's conclusionNumber of outs at the play's conclusionMajor offensive playerMajor defensive playerChange in the win probability WP (expressed as a

percentage) for the home team from the previousplay

Win probability WP (expressed as a percentage) forthe home team at the play's conclusion

In plays with a major offensive player o and a majordefensive player d, if the play occurs in the bottom ofthe inning, then

Table 1. Excerpts From PGP Analysis of Game 6 in the 1919 World Series

Inning

Score Players

Sox Reds Bases Outs Offense Defense Play

NOTE: 1b, single; 2b, double; bb, base on balls; dp, double play; e, error; fo, fly out; go, ground out; k, strikeout; go*, ground out if not for error.

WP WP

2h2h2h2h2h2h

10v10v10v10v10v10h10h10h

00000044455555

00000044444444

01

12121302

1313130000

10012300113123

Duncan—KopfNealeRaridenRuetherWeaverJacksonFelschGandilRisbergRoushDuncanKopf

KerrRisbergKerrKerrKerrKerrRingRingRingRingRingKerrKerrKerr

go*ebbgogogo2b1bk1bdpgofogo

-2.215.716.01

-4.80-4.09-5.41

-18.36-13.80

13.73-20.94

7.87-8.25-5.65-4.60

52.5858.2964.3059.5055.4150.0031.6417.8431.5710.6318.5010.254.60

.00

78

Page 90: Anthology of Statistics in Sports

Bennett

and if the play occurs in the top of the inning, then

If no offensive player is given, WP is assigned to thedefensive player in its entirety. This always occurs foran error (e) and hit by pitcher. It is also given for asacrifice hit and intentional walk if WP is in favor ofthe defensive team; this is done so that the offensiveplayer is not penalized for strategic moves made bymanagers. An "*" next to the play indicates that theplay would have had this result if the defense had notmade an error or an extraordinary play. For example,as shown in Table 1, in the bottom of the second inningwith no score, Duncan of the Reds led off the inningwith a ground ball to Risberg at shortstop. Risbergshould have thrown out Duncan but he bobbled the balland Duncan was safe. Risberg was charged with anerror. For the purposes of PGP, this event was sepa-rated into two plays:

1. The groundout which should have occurred. Thisplay decreases Duncan's PGP by 1.1 and increases thepitcher Kerr's PGP by 1.1. Thus, for an error, the batteris penalized (as in standard batting statistics) and thepitcher rewarded.

2. The error which actually occurred. Risberg's errorturned an out with no runners on base into a no-outsituation with a runner on first. His PGP was penalized5.71.

In most plays, the major offensive player is the batterand the major defensive player is the pitcher. For stolenbase and caught-stealing plays, the major offensive playeris the runner and the major defensive player is thecatcher. For a runner's advance or a runner thrown out,the major offensive player is the runner and the majordefensive player is the fielder most involved in the play.

Using win probabilities from Table 1, the progress ofa baseball game can be plotted in a manner comparableto graphs produced by Westfall (1990) for basketball.Figure 1 shows such a graph for Game 6 of the 1919World Series. The figure plots win probability as a func-tion of plate appearances. Innings are noted numeri-cally along the x axis. As runs are scored, they are notedat the top of the graph for the home team and belowthe graph for the visiting team. Additional pointers in-dicate plays involving Joe Jackson. The plot dramati-cally describes how the Reds went out to an early leadand how the White Sox tied the game in the sixth inningand won it in the tenth. Large jumps in the graph areobserved when important runs are scored (e.g., theReds' first two runs at the end of the third inning). Theimportance of leadoff hits to establish scoring threatsis emphasized by smaller jumps (e.g., Jackson's walkto start the eighth inning). Note how the jumps becomelarger in the later innings when each event becomesmore important in determining the winner.

There are several advantages to PGP over PWA:

1. PGP has a simpler interpretation than PWA. Apositive (negative) PGP value represents winning per-

Figure 1. PGP Analysis of Game 6 of the 1919 World Series.

79

Page 91: Anthology of Statistics in Sports

Chapter 11 Did Shoeless Joe Jackson Throw the 1919 World Series?

centage above (below) average (i.e., the .500 level).For example, if a player has 2,600 Win points and 2,400Loss points, PGP = 5 which is easily interpreted asadding .050 to a team's winning percentage or about 8wins to an average team's total wins for the year. ThePWA value of .520 does not have such a simple inter-pretation.

2. PGP is a more valid quantification of a player'scontribution to victory. Consider two sequences of playsfor a relief pitcher's single inning of work:

• Sequence 1: The pitcher gives up two infield singlesto the first two batters and then strikes out the side.

• Sequence 2: The pitcher faces three batters andstrikes out the side.

Both sequences are the same with respect to thepitcher's contribution to team victory since the side wasretired with no runs scored. PGP recognizes this andgives the same value for both sequences. PWA doesnot. The second sequence has no Loss Points for thepitcher; thus PWA is 1.000. The first sequence doeshave Loss Points from the singles; thus PWA is lessthan 1.000.

Questions have been posed concerning the heavyweighting that the PWA/PGP systems place in key events.For example, a PGP evaluation indicates that Kirk Gib-son's only at-bat in the 1988 World Series made himthe most valuable player of the Series. Gibson's game-winning home run raised the probability of a Dodgervictory in Game 1 from 13.3% to 100%. This results ina 8.7 PGP average over the five-game Series, whichtops that of his closest contender, teammate OrelHershiser (the Series MVP as voted by the sportswrit-ers) who had a 5.4 PGP Series average. While the sam-ple size is too small for this to be a statistically significantassessment of their abilities, it is a reasonable assess-ment of the values of the actual performances of thetwo players in the Series. Gibson alone lifted his teamfrom imminent defeat to a certain triumph. Hershiserpitched well, but unlike Gibson he received the supportof other players (5.5 Dodger runs per game) to achievehis two victories. Thus, in general, baseball does havethese critical points and PGP assesses them in a rea-sonable fashion.

Another question concerns the use of a game as thestandard winning objective. For example, the win prob-

ability for the series could be used instead of summingthe win probabilities for each game. In this way, eventsin early games would have less weight than events inthe seventh game. Using this system, Bill Mazeroski'shome run which won the 1960 World Series in the finalplay of Game 7 would probably be the most valuableplay in World Series history. This system was consid-ered at one point but was rejected primarily becausethe game is the basic unit of baseball achievement withrespect to winning and losing. Using the game win prob-ability allows the PGP system to be applied meaning-fully and comparably to all levels of baseball play (sea-son, championship playoffs, and World Series).

2. PGP ANALYSIS OF THE 1919 WORLD SERIES

For each Series game, Table 2 shows the score, theMost Valuable Player (MVP), and the Least ValuablePlayer (LVP), with respect to PGP. Typically, the MVP(LVP) is from the winning (losing) team and this holdstrue for each game of the 1919 World Series. All MVP'sand LVP's were pitchers except for Felsch, Rath, Ris-berg, and Roush. Several players appear more thanonce. Chicago pitcher Dickie Kerr was MVP in bothgames he started. Pitchers Ring (Reds) and Cicotte(Sox) were MVPs and LVPs. A Black Sox player wasthe LVP in all five games lost by Chicago and was MVPin one of their three wins. Joe Jackson was neither MVPnor LVP in any game.

If all contributions to each player's PGP are summedand then divided by the number of games (8), the PGPresults for the entire 1919 World Series are obtained aspresented in Table 3. The eight Black Sox players havetheir names capitalized. One of these players, SwedeRisberg, was the LVP of the Series. The MVP wasDickie Kerr who played on the losing side. To give someperspective to these performances, if Risberg played atthis level throughout an entire season, he alone wouldreduce a .500 team to a .449 team. On the other hand,Kerr would raise a .500 team to a .541 winning per-centage. Joe Jackson stands out clearly as the BlackSox player with the best performance.

3. JOE JACKSON'S PERFORMANCE

As shown in the previous section, Joe Jackson's con-tribution to team victory in the 1919 World Series was

Table 2. 1919 World Series Game Results

Score

Game

12345678

White Sox

12300545

Reds

9402541

10

Most Valuable

Player

RuetherSalleeKerrRingEllerKerrCICOTTERoush

PGP

18.6819.3017.8529.4422.3315.341.8.2810.00

Least Valuable

Player

CICOTTERISBERGFisherCICOTTEFELSCHRingRathWILLIAMS

PGP

-20.66-14.21-14.33-10.04-8.02

-12.07-11.28-14.58

NOTE: Black Sox players are capitalized.

80

Page 92: Anthology of Statistics in Sports

Bennett

Table 3. PGP per Game for the 1919 World Series Players

White Sox PGP Reds

Kerr, p

Schalk, c

JACKSON, If

Lowdermilk, pSmith, ifMayer, pLynn, cWilkinson, pMCMULLIN, 3b

Murphy, ofWEAVER, 3bGANDIL, 1bS. Collins, ofJames, pLeibold, rf

CICOTTE, p

FELSCH, cfE. Collins, 2b

WILLIAMS, pRISBERG, ss

4.153.222.862.172.041.541.451.21.66.19.14.08.00.00.00

-.05-.05-.21-.27-.34-.40-.57-.84

-1.06-1.14-1.16-1.19-1.30-1.55-1.61-2.01-2.79-2.89-3.46-3.69-5.11

Eller, p

Ring, pWingo, cSallee, p

Roush, cfRuether, pDuncan, IfLuque, pMagee, of

Kopf, ss

Rariden, cDaubert, 1bNeale, rf

Fisher, pRath, 2b

Groh, 3b

NOTE: Black Sox players are capitalized.

superior not only to most players on his team, but alsosuperior to that of most players on the opposing team.Table 4 summarizes his PGP performance in each gameand traces his PGP average through the course of theSeries. Clearly, Jackson's hitting was the strongest fea-ture of his game; his batting PGP was the highest amongall players in the Series. The only negatives are in steal-ing and base running. His most serious mistake occurredwhen he was doubled off second base for the third outwith the score tied in the eighth inning of Game 6;however, his positive batting and fielding contributionsin this game far outweighed this mistake. The fact thathis key hit in the tenth inning resulted from his hustle

to beat out a bunt gives further support to Jackson'scase.

Jackson's batting average and slugging average (.563)were higher for the Series than during the regular sea-son. Still, detractors state that indeed Jackson battedwell, but he did not hit in the clutch (e.g., ". . . JoeJackson's batting average only camouflaged his inten-tional failings in clutch situations, . . ." [Neft, Johnson,Cohen, and Deutsch 1974, p. 96]). We will examinethis contention in several ways:

• By using traditional batting statistics for clutchsituations;

• By using linear regression to determine if Jackson'sbatting PGP was low for a batter with a .563 slug-ging average (SA); and

• By using resampling techniques to create a PGPdistribution for a .563 SA hitter batting in the sit-uations encountered.

3.1 Traditional Statistics

Currently, the most complete analysis of clutch hit-ting is that performed by the Elias sports bureau in itsannual publication (Siwoff, Hirdt, Hirdt, and Hirdt 1989).The analysis consists of examining standard baseballstatistics (batting average, slugging average, and on-base average [OBA]) for certain clutch situations. Theydefine a Late Inning Pressure (LIP) situation as a plateappearance occurring in the seventh inning or later when1) the score is tied; 2) the batter's team trails by one,two, or three runs; or 3) the batter's team trails by fourruns with the bases loaded. Table 5 shows the resultsof such an analysis applied to Jackson's plate appear-ances. Generally, Jackson performed well in these sit-uations. By any yardstick, he performed especially wellin LIP situations and when leading off an inning. Thisanalysis gives exploratory-level indications that Jacksondid hit in the clutch. However, it is difficult to draw aconclusion because of the multitude of clutch categoriesand the small sample sizes within each category.

3.2 Regression Analysis

PGP, in a sense, weights the Elias clutch categoriesand allows them to be pooled into an overall evaluatorof clutch batting performance. We expect that a cor-

Average

Table 4. Jackson's PGP for Each Game of the 1919 World Series

Game

12345678

Total

-2.646.36-.05

-1.07-4.65

7.577.81

-1.77

Batting

-2.646.36

.76-1.07-4.6511.258.80

-1.76

Stealing

.00

.00-.81

.00

.00

.00

.00

.00

Running

.00

.00

.00

.00

.00-4.27-.99

.00

Fielding

.00

.00

.00

.00

.00

.59

.00

.00

CumulativeAverage

-2.641.861.22.65

-.41.92

1.901.45

1.45 2.13 -.10 -.66 .07

81

Page 93: Anthology of Statistics in Sports

Chapter 11 Did Shoeless Joe Jackson Throw the 1919 World Series?

Table 5. Standard Situational Baseball Statistics for Joe Jackson's Plate Appearances in the1919 World Series

Situation AB TB BB BA SA OBA

Leading offRunners onScoring position

2 Outs/runners on2 Outs/scoring position

LIPLIP/runners onLIP/scoring position

All

91714

65

311

32

676

22

211

18

465

22

211

12

100

00

100

1

.444

.353

.357

.333

.400

.6671.0001.000

.375

.667

.412

.429

.333

.400

.6671.0001.000

.563

.500

.353

.357

.333

.400

.7501.0001.000

.394

relation should exist between batting PGP and tradi-tional batting statistics. Since the traditional statisticsdo not account for the game situation, batting PGPshould be higher than expected if the batter tended tohit better than expected in clutch situations and lowerif the batter tended to hit worse in such situations. Thus,one method of determining if Jackson hit in clutch sit-uations is to develop a linear regression line for therelationship between batting PGP and a standard bat-ting statistic and see where Jackson's batting PGP standswith respect to this line. Cramer (1977) used a similartechnique on PWA in an examination of the existenceof clutch hitting.

Figure 2 plots batting PGP versus SA for all 17 playerswith a minimum of 16 at bats in the 1919 World Series.Similar results were obtained when BA and OBA wereused in place of SA. The regression line shown wasestimated using the data from all players except theBlack Sox. Jackson's batting PGP was well above theregression line. Black Sox Risberg and Felsch not onlybatted poorly but also had lower batting PGP's thanexpected for those averages. Black Sox Weaver had ahigh SA but did not hit as well as expected in clutchsituations. Weaver was banned from baseball not be-

Figure 2. Regression Analysis of Batting PGP Versus SluggingAverage in the 1919 World Series.

cause of his participation in throwing games, but be-cause he knew about the fix and did not reveal it to thebaseball authorities. Part of his defense was his .324BA in the Series. This analysis indicates that his highaverage was not as valuable as it might appear. In fact,it was no more valuable than that of the Black SoxGandil who had a poor SA but did as well as expectedin clutch situations with respect to that average. Giventhe limited nature of the data used to establish theregression, it is difficult to place any level of significanceon the degree to which Jackson (or any other player)hit in the clutch using this analysis.

3.3 Resampling Analysis

In order to establish such a level of significance forclutch performance, a new technique using resamplingwas developed. Table 6 presents the situations and re-sults for each of Joe Jackson's 33 plate appearances inthe 1919 World Series. The AWP/2 values were calcu-lated with the standard PGP approach using the beforeand after plate appearance situations. However, the re-sampling technique to be proposed required the classifi-cation of each plate appearance as one of the following:

1B/1—Single with runners advancing one baseIB/2—Single with runners advancing two bases1B/U—Single that cannot be classified as 1B/1 or

1B/2. For example, if the single occurred with norunners on base, such a classification could not bemade.

2B/2—Double with runners advancing two bases2B/3—Double with all runners scoring2B/U—Double that cannot be classified as 2B/2 or

2B/3.3B—TripleHR—Home runBB—Base on ballsK—StrikeoutFO/NA—Fly out with no advance (e.g., foul pop-

up)SF—Deep fly out in which a runner did score or

could have scored from thirdFO/A—Fly out that advances all runners one baseFO/U—Fly out that cannot be classified as FO/NA,

FO/A, or SF

82

Page 94: Anthology of Statistics in Sports

Bennett

Table 6. Joe Jackson's Plate Appearances in the 1919 World Series

Game Inning

1 2V4V6V9V

2 2V4V6V8V

3 2H3H6H

4 2H3H6H8H

5 1H4H7H9H

6 1V4V6V8V

10V

7 1V3V5V7V

8 1H3H6H8H9H

RD

_ |

0-5-8

00

-3-2

023

00

-2-2

00

-4-5

0-2-3

00

0123

-4-5-8-9-5

Before At Bat

Bases

00

120

0120

0120

0200

13003

10202

22

120

2301

2323

Outs

0110

0012

000

0201

1102

21000

2211

12012

RD

-10

-5-8

00

-3-2

023

00

-2-2

00

-4-5

0-2-2

00

1223

-4-4-8_ 7

-5

After At Bat

Bases

00

230

21221

1121

2000

13000

0011

13

11

230

230120

Outs

1221

0022

010

0312

2213

32000

2222

22113

Play

GO/UGO/UGO/A

SF

2B/U1B/1

K1B/U

1B/UFO/NA1B/U

2B/UGO/UGO/U

K

FO/NAGO/UGO/UGO/U

FO/NAFO/NA

1B/2BB

1B/1

1B/21B/2

GO/AGO/U

FO/NAHRSF

2B/UGO/U

AWP/2

-1.06-.87-.72

.00

3.653.53

-1.881.06

1.75-1.48

.49

3.62-1.78-1.67-1.24

-2.86-.87-.79-.12

-1.10-.812.953.296.90

4.845.02

-.80-.26

-3.482.21

-.18.14

-.45

NOTE: RD = (White Sox Runs) - (Reds Runs).

GO/DP—Ground ball that was or could have beenturned into a double play

GO/F—Ground ball resulting in a force outGO/A—Ground ball in which all runners advance

one baseGO/U—Ground ball that cannot be classified as GO/

DP, GO/F, or GO/A

The following procedure was used in the resamplingtechnique. A random sample of 1,000 permutations ofthe 33 plays (listed in the "Play" column of Table 6)was created. For each permutation, a new batting PGPwas calculated. This was done by calculating a newAWP/2 for the first play in the permutation occurringin the first situation (Game 1, top of the second inning,no outs, bases empty, White Sox trailing by a run), anew AWP/2 for the second play in the permutation oc-curring in the second situation (Game 1, top of thefourth inning, one out, bases empty, score tied), andso on through all 33 plays in the permutation for all 33situations. The new batting PGP for the permutation isthe sum of these 33 new AWP/2 values.

Calculation of new AWP/2 values for plays that hadnot actually occurred required the definition of reason-able rules for baserunner advancement and outs. Sev-eral plays (e.g., home run, walk) are clearly defined intheir effects on baserunners, but most are less clear.For each of the play types described above (excludingthe U plays such as GO/U), a table describing the mo-tion of base runners, runs scored, and outs was con-structed. Table 7 is an example of such a table for GO/

Table 7. Basic Resolution of GO/DP

InitialBase Situation

012

123

1323

123

FinalBase Situation

002330

233

RunsScored

00000101

Outs

12121212

83

Page 95: Anthology of Statistics in Sports

Chapter 11 Did Shoeless Joe Jackson Throw the 1919 World Series?

DP. In some cases, the tables are overridden by specialrules. For example, if a play results in the third out, noruns are scored (e.g., with bases loaded and one out,GO/DP ends the inning with no additional runs scoreddespite the indication in Table 7).

U plays required special procedures. The existenceof U plays reflects the uncertainty of play type cate-gorization because of the limited description of the playprovided in the World Series record and because of thelimited ability to predict the effect of each play in sit-uations other than the one in which it actually occurred.The following assumptions were made here:

• GO/U—equal probability of being resolved asGO/A, GO/DP, or GO/F

• FO/U—equal probability of being resolved as FO/NA or SF

• 2B/U—equal probability of being resolved as 2B/2or 2B/3

• 1B/U—2/3 probability of 1B/1 and 1/3 probabilityof 1B/2 unless there are two outs in which case theprobabilities are reversed

A separate random number was selected for each Uplay of each permutation and the probabilities de-scribed above were used to select the appropriate playresolution. For example, Joe Jackson's batting recordin the 1919 World Series had nine GO/U plays. In eachpermutation, nine separate random numbers were se-lected to determine whether each GO/U should be treatedas a GO/A, GO/DP, or GO/F. Once the determinationof play type was made, the normal tables and rules wereused to determine AWP/2. Since a new random numberwas selected for each U play in each permutation, thevariability resulting from the uncertainty of the playdescription was incorporated into the resulting distri-bution for batting PGP.

Using the resampling technique described above, boththe batting situations and the batting performance (e.g.,BA, SA, OBA) were kept constant. The only thing thatchanged the PGP was the pairing of play to situation.Using the 1,000 permutations, a sampling distributionfor batting PGP/Game was established where the sit-uations and batting performance were fixed. (Another

Figure 3. Sampling Distribution of Batting PGP/Game for JoeJackson's Batting Record in the 1919 World Series. Darkened areaindicates PGP/Game values less than Jackson's 2.13.

possible approach would have been to generate randomplay results for each situation based on the player'sseason or career statistics. This was not done for threereasons. First, this would require the development ofanother model to generate play results from the avail-able batting data. Second, and more importantly, theconditioning on the actual batting performance wouldbe lost. The question to be tested here is whether Jack-son "placed" his hits in a manner that was detrimentalto his team. Third, given that Jackson's World SeriesBA and SA were higher than in the 1919 season andhis career, the analysis performed provides a toughertest of his innocence.)

Since the distribution is a reflection only of the clutchfactor of the batting performance, it can be used in ahypothesis test for clutch effect:

H0: Joe Jackson's batting performance had no clutcheffect (i.e., he batted as well as expected in clutchsituations).

H I .: Joe Jackson did not hit as well as expected in theclutch.

To test at a .10 level of significance, the critical regionfor rejecting H0 contains the values for the PGP sam-pling distribution lower than the .10 quantile.

Figure 3 shows the resulting distribution for the bat-ting performance presented in the "Play" column ofTable 6 when matched with the situations shown in thattable. The mean PGP/Game value for this distributionis 1.56 and the median is 1.59. Jackson's PGP/Gamevalue of 2.13 is the .686 quantile of this distribution.That is, Jackson's 1919 World Series batting had moreclutch value than 68.6% of batters with his combinationof batting results occurring in random association withhis batting situations. Thus since the proposed hypoth-esis test has a .686 level of significance, hypothesis H0

that Jackson's batting showed no clutch effect is accepted.Of additional interest in this distribution is its wide

variability. A player having a high BA and high SA andneutral clutch ability has an 8% chance of being a det-riment to his team at bat. Thus, in one sense, Jackson'sdetractors were right in that it is possible to have hightraditional batting statistics and yet have a negative im-pact on your team's chances to win. This quantitativelydemonstrates that in a short series of baseball gamesthe "when" of batting is just as (if not more) importantas the quantity of hits or bases.

4. CONCLUSION

Did Shoeless Joe Jackson throw the 1919 World Se-ries? Almost every statistical view of the game datasupports the contention that Joe Jackson played to hisfull potential in the 1919 World Series. Not only didJackson have higher batting statistics than he did duringthe regular season, but his batting PGP was also higherthan expected from those statistics. An hypothesis testbased on a PGP distribution developed from Jackson'sWorld Series batting record strongly supports the nullhypothesis of no clutch effect in his batting perfor-

84

Page 96: Anthology of Statistics in Sports

Bennett

mance. This conclusion is also supported by the follow-ing analysis results:

• Jackson was the third most valuable player in theSeries for his team and the seventh most valuableoverall.

• As a batter, Jackson made a greater contributionto his team's chances for victory than any otherbatter in the Series.

• Jackson made a positive overall contribution to-ward White Sox victory in the Series while all otherBlack Sox had negative impacts.

• Jackson had high traditional batting statistics in mostclutch situations (especially leading off and in lateinning pressure).

The analysis also brought to light some interestingfindings concerning other players. Buck Weaver ap-pears to be the player to whom the high-BA, no-clutchcontention appears to be most applicable. All of thecontroversy surrounding the Black Sox has obscuredEddie Collins' poor performance which was worse thansix of the Black Sox. Collins had no positive contri-butions in any category (batting, fielding, baserunning,or stealing). His batting PGP was low even given hislow SA for the Series. Only three White Sox had overallpositive PGP values for the Series. Thus Chicago's lossin the Series cannot be blamed totally on the Black Sox.

This analysis has highlighted the possible applicationsof situational baseball statistics such as PGP. It hasrevealed the great degree to which game outcomes aredetermined not only by how many hits and of what typeoccur but also by when they occur. Since PGP can bedirectly interpreted in terms of additional games wonor lost, it possesses great applicability in evaluating thevalue of a player to his team. Further research will focuson using the resampling techniques outlined in this ar-ticle to determine batting PGP/Game distributions fordifferent levels of BA, SA, and OB A in regular seasonplay.

APPENDIX: ESTIMATION OF WINPROBABILITIES

At any point in a baseball game, the probabilityP(W\RD, I, H, O, B) of the home team winning maybe defined given the following-parameters:

• The run differential RD (i.e., home team runsminus visiting team runs)

• The half-inning (H and /) where H = 1(2) indicatesthe top (bottom) half of an inning (e.g., bottom ofthe third inning)

• The number of outs O• The on-base situation B (e.g., runners on first and

third bases).

This appendix describes the techniques used to es-timate these win probabilities as developed by Bennettand Flueck (1984).

Using data acquired from the 1959 and 1960 seasons,Lindsey (1961, 1963) estimated:

• The win probability at the end of each inning I giventhe score, P(W\RD, /). Lindsey gave win prob-

abilities only for cases where \RD\ < 7. Win prob-abilities for \RD\ 7 were estimated as follows:

The values for C/ are shown in the following table.They were derived by fitting the above functionalform to the win probabilities for high run differentials.

Inning I 1 2 3 4 5 6 7 8 9

C1 .65 .64 .60 .57 .54 .49 .48 .30 .30

• The probability of scoring R more runs in an inninggiven the state of the inning, P(R\O, B). Lindseyprovided values only for cases in which R<3. Runprobabilities for R 3 were estimated using thefunctional form

fitted to the expected number of runs scored E(R\O,B) provided by Lindsey.

Using P(R\O, B) and P(W\RD, I), the probabilityof a home team victory at each stage of a game maybe estimated. Clearly, by the definition of victory inbaseball:

The win probabilities at the end of each half-inning are

P(W\RD, 1, 1, 3, B)

and

The win probabilities at the start of each half-inning are

and

Extra innings are assumed to be identical to the ninthinning:

85

Page 97: Anthology of Statistics in Sports

Chapter 11 Did Shoeless Joe Jackson Throw the 1919 World Series?

Each win probability within a half-inning is the sum ofthe products of 1) the probability of scoring R moreruns in the inning given the out-base situation and 2)the probability of winning given R more runs are scored:

P(W\RD, I, H, O, B)

The techniques described here can be applied to theappropriate baseball data from any period. The winprobability estimates from the Lindsey data represent-ing the 1959 and 1960 seasons were found to be closeto the values published by the Mills brothers in their1969 World Series analysis. In 1919 though, baseballwas still in the deadball era. While batting averages andon-base percentages were quite similar, slugging aver-ages (.388 vs. .348) and runs per game (8.6 vs. 7.7) werehigher in 1960 than in 1919. Since runs were more val-uable in 1919, we would expect that the | WP/2| valuesfor hits (especially those producing RBI's) are actuallylarger than those estimated from the 1960 data. Sinceprobability tables such as those produced by Lindseyare not available for 1919, currently it is not possibleto estimate the magnitude of these differences. How-ever, given that the resampling analysis of Jackson's

performance was performed on a relative basis and giventhe .686 p value for that test, it is unlikely that the useof 1919 data would lead to a different conclusion con-cerning Jackson's participation in the fix.

[Received December 1991. Revised April 1993.]

REFERENCES

Bennett, J. M, and Flueck, J. A. (1984), "Player Game Percentage,"in Proceedings of the Social Statistics Section, American StatisticalAssociation, pp. 378-380.

Cohen, R. M., Neft, D. S., Johnson, R. T., and Deutsch, J. A.(1976), The World Series, New York: Dial.

Cramer, R. D. (1977), "Do Clutch Hitters Exist?" Baseball ResearchJournal, Sixth Annual Historical and Statistical Review of the So-ciety for American Baseball Research, 74-79.

Lindsey, G. R. (1961), "The Progress of the Score During a BaseballGame," Journal of the American Statistical Association, 56, 703-728.

(1963), "An Investigation of Strategies in Baseball," Oper-ations Research, 11, 4, 477-501.

Mills, E. G., and Mills, H. D. (1970), Player Win Averages, NewYork: A. S. Barnes.

Neft, D. S., Johnson, R. T., Cohen, R. M., and Deutsch, J. A.(1974), The Sports Encyclopedia: Baseball, New York: Grosset &Dunlap.

Siwoff, S., Hirdt, S., Hirdt, T., and Hirdt, P. (1989), The 1989 EliasBaseball Analyst, New York: Collier.

Westfall, P. H. (1990), "Graphical Presentation of a Basketball Game,"The American Statistician, 44, 305.

86

Page 98: Anthology of Statistics in Sports

Chapter 12

PLAYER GAME PERCENTAGE

Jay M. Bennett, Bell Communications ResearchJohn A. Flueck, NOAA/ERL and NCAR/FOF

1. Introduction

Player Game Percentage (PGP) is a statistical technique forevaluating a major league baseball player with respect to hiscontribution to winning games for his team. It is unique amongbaseball statistics in its ability to synthesize batting, pitching,fielding, and base running evaluations into a single value rating theplayer. Thus, not only may batters be compared with other battersand pitchers with other pitchers, but batters may be comparedwith pitchers in their ultimate contribution to team victory. PGPalso simplifies the comparison of relief pitchers with other relieversas well as with starting pitchers. PGP has its genesis with thePlayer Win Average (PWA) concept.[1] While most baseballstatistics are direct (e.g. ERA, RBI) or indirect (e.g. BA)measures of runs scored,[2] PWA goes directly to the heart of thematter by measuring changes in the probability of winning witheach game event. At any point in a baseball game, the probabilityP of the home team winning may be estimated given the followingparameters:

• The run differential RD (i.e. home team runs minus visitingteam runs)

• The half-inning (H and /) (e.g. bottom of the third inning)

• The number of outs O

• The on-base situation B (e.g. runners on first and third bases).

A baseball event may then be defined as an occurrence whichchanges any of the above parameters. Thus, baseball eventsinclude not only hits, walks, double plays but also stolen bases,passed balls, and balks.

Before each baseball event, there is a prior probability of thehome team winning P(E-) where E- is the state of the gameprior to the event. Similarly, after the event there is an analogousprobability P(E+) where E+ is the state after the event. The neteffect of the event E on the outcome of the game is

N(E)-P(E+) -P(E-).Most baseball events have a major offensive player and a major

defensive player who are most responsible for the event'soccurrence. Each of these players may then be credited with halfthe change in probability from the event. The home team playeris credited with N(E)/2 percentage points while the visiting playeris credited with the negative of this same amount.

The Player Game Percentage (PGP) for a player is the sum ofthese percentage point credits over the measurement perioddesired. A player's PGP may be calculated over a single game, aseries of games such as the World Series, a baseball season, or anentire baseball career. The PGP Average is the total PGP dividedby the number of games considered.

PGP differs from PWA in several important ways:

• While PWA converts the win probabilities into Win/LossPoints on a scale of -1000 to 1000, PGP deals directly inprobabilities in the form of percentages. This form enhancesunderstanding and reduces unnecessary arithmetic conversions.

• While PWA is analogous to PGP Average, PWA does not usenumber of games in its denominator. Instead it uses the sumof credits (Win Points) and debits (Loss Points) in thedenominator. The disadvantage of this technique is theelimination of the additive properties of the averages. It alsotends to reduce the averages of players in critical situations andincrease those of players in non-critical situations.

• PGP adds several new scoring concepts including anintermediate event evaluation (described in Section 4) to more

precisely assign credit for events.

• PGP uses a different set of win probabilities. Their calculationis discussed in the next section.

2. Eitimtion of Win Probabilities

From the above discussion, it is clear that the key to the PGPconcept lies in the estimation of the win probability for eachpossible state of a baseball game. Mills and Mills1'1 obtained theirestimates by recording the frequencies of each event in each stateduring the 1969 baseball season. They then used this informationto simulate thousands of games and thus obtain the requiredprobabilities. Unfortunately, a complete set of these probabilityvalues was not published.

For the estimation of the win probabilities for PGP, we reliedon data from other published sources. Using data acquired fromthe 1959 and 1960 seasons, Lindsey estimated:

• The win probability at the end of each inning / given the score,P(W/RD,I)[3] Since Lindsey gave win probabilities for only alimited number of cases (RD\<7), we assumed that

• The probability of scoring R more runs in an inning given thestate of the inning, P(R\OB).[4] Again, since Lindsey providesa limited number of cases (R<3), it was assumed for R>2that

With Lindsey's data, it is possible to calculate the probabilityof a home team victory at each stage of a game,P ( W \ R D , I , H , O ) where H-1(2) indicates the visiting (home)team at bat. Clearly, by the definition of victory in baseball:

• P(W\RD,I,H,O,B) - 1 if RD>o and either a) 1-9 andH-2 or b)1>9.

• P(W\RD, I, 2, 3, B) - 0 if RD<0 and I>8.

The probabilities at the end of each half-inning are

• P(W\RD,I,3,B) - p(R|0,0) P(W\RD+R,I)R-o

• P(W\RDJJ&J}-P(W\RD,I).

The probabilities within half-innings are the sums of the productsof the probability of scoring R more runs in the inning given theout-base situation and the probability of winning given R moreruns are scored:

To complete the calculations, we assume that all extra inningsare identical in nature to the ninth inning:

P(W\RD, I ,O ,B) - P(W\RD,9H,O,B) if I>9 and RD<1.We also assume that the game starts as an even contest:

P(WlO,1,1,0,0) - .5.Clearly, for 1</< 10,

P(W\RD I,1,0,O,) - P(W\RD,I-l,2,3B)and, for O</<10,

P(W\RD,I,2,0,0) - P(W\RD,I,1,3,B).

87

Page 99: Anthology of Statistics in Sports

Chapter 12 Player Game Percentage

3. Comparison of PGP and PWA Win Probabilities

A limited comparison of P(W\RDI,H,O,B) values used forPWA with those calculated using the techniques described abovewas made. Tables 1 and 2 show the P(W\RD,I,H,3,0) used byMills and Mills1'1 in their evaluation of the 1969 World Seriesalong with the comparable values calculated using the proceduresfrom Section 2. Across both tables, the median difference ofPWA-PGP is 0.0% and the median absolute value differencePWA-PGP is 0.4%. The means are the same when rounded to thenearest tenth of a percent.

Thus, although the PGP probabilities are not exactly equal tothe PWA probabilities, they are reasonably close to be useful inplayer evaluations. These results are quite remarkable when oneconsiders that the PGP probabilities are based on data from majorleague baseball a decade prior to PWA. Over this period therewere many changes in baseball most notably the expansion of bothleagues. In addition, PGP uses a deterministic set of equations tosimply calculate the probabilities while PWA probabilities requiredextensive simulation for their creation. This allows PGP tocalculate probabilities as needed instead of referencing a largedata base of probabilities. Thus, PGP can easily fit on thesmallest of personal computers (such as the TI-99/4A on whichsome of these calculations were made) to make it easily accessibleto the general public.

4. A PGP Example

One of the great strengths of PGP is its ability to analyzedefensive contributions along with offensive contributions. As anillustration of a PGP calculation, we have selected a greatdefensive play from the 1980 World Series. In Game Three withthe score tied in the top of the tenth inning, Mike Schmidt of thePhiladelphia Phillies came to bat against Dan Quisenberry of theKansas City Royals. The Phillies had one out and runners on firstand second bases. The probability P(W|0,10,l,l,ld2) of a homevictory for the Royals at this point was 40.7%. Schmidt hit a lowliner that appeared to be a sure single past second base. If theball had gone through for a hit, conservatively the Phillies wouldhave had the bases loaded and still only one out; the probability ofa Royals' victory would have been reduced toP(W\0,l0,l, 1,1&2&3) - 26.2%. However, the Royals' secondbaseman Frank White made an extraordinary play to intercept theliner and double the runner on second off the bag. The final resultof the play was the end of the Phillies' turn at bat with no runsscored. The Royals' probability of victory increased toP(W|0,10,1,3,0) - 62.6%.

In the player PGP evaluations for this event, Schmidt isresponsible on the offensive side. His PGP total for the game istherefore debited half of the overall change in the probability

-10.95% - -(62.6% - 40.7%)/2.Evaluation on the defensive side is more complex. Quisenberry isresponsible for allowing Schmidt to almost get a hit loading thebases; so, Quisenberry's PGP total is debited half of theintermediate change in. probability

-7.25% - (26.2% - 40.7%)/2.White is responsible for preventing Schmidt's hit and doubling therunner off second; his PGP total is increased half of the otherintermediate change

18.20% - (62.6% - 26.2%)/2.Unlike any other scoring system, PGP correctly gives credit toWhite for getting Quisenberry out of a jam.

5. PGP Evaluations of the 1980 World Series

Tables 3 and 4 present the PGP evaluations of the Phillies andRoyals respectively on a game by game basis in addition to anoverall PGP series average (i.e., PGP player total/6). Unser,Schmidt, and McBride were the outstanding players on the WorldChampion Phillies while Aikens and Otis were their counterpartson the Royals. All of these players were everyday players exceptfor Unser who contributed two outstanding pinch hits for thePhillies. One of these pinch hits was the single most significant

play in the 1980 World Series according to PGP. In the top of theninth inning of Game Five, the Royals held a one run lead. Unsercame to bat against the Royals' Dan Quisenberry with no outs anda runner on first base. Unser ripped a double which tied the gameand put him in position to score the go-ahead run. Unser's hitraised the probability of a Phillies' victory from 32.3% to 74.2%, achange of 41.9%. Unser's performance can be contrasted with thatof the Royals' key pinch hitter Cardenal, who did not perform aswell in that role resulting in the second worst PGP of allparticipants in the series. PGP, thus, highlights the importance ofpinch hitting at critical points of the game.

One of the best features of PGP is its ability to directly assessthe impact of player substitution on winning. If Willie Aikens(playing at his 1980 World Series level) had replaced an averageplayer (PGP average — 0%) on a .500 team, PGP predicts that hisaddition alone would have lifted that team to divisional contenderstatus with a .580 record:

New Team Record - Old Team Record + PGP Average/100.A similar substitution involving Quisenberry would have reducedthe same team to a .403 record. However, these cases are extremeones; it is unlikely that such high and low PGP averages would berecorded for an entire season.

The worst performance of the series was that of Quisenberrywho lost several late leads for the Royals. In fact, relief pitchingin general was not outstanding in this series. McGraw, widelyregarded as a Phillies pitching hero, in reality had only an averageseries (PGP average near zero) mostly a result of his generally-neglected loss of Game Three.

One of the surprises from the PGP analysis of the series is therelatively high evaluation of Willie Wilson's performance. In spiteof his record-setting 12 strikeouts, Wilson had the third highestPGP average on the Royals. Wilson's efforts in scoring thewinning run of Game Three and his outstanding fielding arehighlighted by PGP.

6. CONCLUSION

Assessing player performance in baseball using probabilities ofvictory has lain dormant since its inception in 1969. Much of thisconcept's relative obscurity is a result of the reticence of theconcept's innovators to reveal the probabilities which are at theheart of the system. Thus, baseball researchers and fans havebeen unable to test and experiment with these ideas. This papershows how the required probabilities may be calculated frompublished data. Although this data is nearly a quarter century old,the calculated probabilities were shown to be quite close to thoseused in PWA ten years later. The techniques themselves do notdate and will be as applicable to 1984 baseball data as they are to1960 baseball data. A study of how (and if) these probabilitieschange with time would be quite interesting.

An analysis of the 1980 World Series was shown as an exampleof the insights provided by PGP. The great advantages of PGParc:

• A synthesis of hitting, pitching, running, and fieldingevaluations into one quantitative value

• A quantitative tool to measure the previously immeasurableplayer contributions such as the advancement of base runnerson outs

• A direct measure of how much a player contributes to winning

• Evaluation of pinch hitters and relief pitchers conditional onthe state of the game at the moment of their entrance.

The only drawbacks to PGP are the detail of record-keeping in agame required for the proper state evaluations and the somewhatsubjective assignment of responsibility for plays. However, itshould be emphasized that PGP is not intended to be the finalarbiter in player evaluations. Like diagnostic measures for thedetection of outliers in regression, PGP is a tool to aid the baseballresearcher and fan in achieving a previously unknown perspectivefor player evaluation.

88

Page 100: Anthology of Statistics in Sports

Bennett and Flueck

Baseball has been relatively amenable to this probabilisticapproach because of its limited quantized nature. Efforts shouldbe made to extend these concepts to other sports such as footballand basketball.

References1. MILLS, ELDON G., and MILLS, HARLAN D.Player Win Averages. New York: A. S. Barnes.TABLE 1. P(W\RD,I,\,3,0) Used in PWA and in PGP

I1223333444455555666667777888899

10TABLE 2.

I11222333444455566777788888

RD001

-3013

-3-113

-3-1134

-3-1134

-11340134

-100

PWA54.554.866.321.055.167.785.519.041.069.587.516.639.471.989.894.213.837.275.292.395.933.680.195.097.659.787.797.699.017.862.262.2

PGP54.554.865.721.955.267.385.019.941.669.187.017.340.371.389.594.014.437.974.892.195.733.381.195.198.360.087.497.898.918.562.662.6

PWA-PGP0.00.00.6

-0.9-0.10.40.5

-0.9-0.60.40.5

-0.7-0.90.60.30.2

-0.6-0.70.40.20.20.3

-1.0-0.1-0.7-0.30.3

-0.20.1

-0.7-0.4-0.4

P(W\RD,I,2,3,0) Used in PWA and in PGPRD01013013013413414013401235

PWA50.061.350.062.381.550.063.683.450.065.385.691.367.688.193.271.195.250.076.594.097.150.085.293.497.199.5

PGP50.061.050.061.780.350.063.082.750.064.784.990.866.887.793.070.594.950.077.694.098.050.084.693.697.499.5

PWA-PGP0.00.30.00.61.20.00.60.70.00.60.70.50.80.40.20.60.30.0

-1.10.0

-0.90.00.6

-0.2-0.30.0

(1970),

2. BENNETT, JAY M., and FLUECK, JOHN A. (1983), "AnEvaluation of Major League Baseball Offensive PerformanceModels," The American Statistician. 37, 1, 76-82.3. LINDSEY, G. R. (1961), The Progress of the Score During aBaseball Game," American Statistical Association Journal. 56,703-728.4. LINDSEY, G. R. (1963), "An Investigation of Strategies inBaseball." Operations Research. 11, 4, 477-501.

TABLE 3. Phillies PGP Evaluations

PLAYERUnserSchmidtMcBrideCarltonBooneRuthvenRoseNolesBrusstarMcGrawReedSaucierBowaMorelandGrossLuzinskiSmithMaddoxBystromWalkChristensonTrillo

PGP AVE5.444.633.942.031.451.270.760.270.260.130.080.02

-0.14-0.19-0.30-0.50-1.57-2.15-2.45-2.46-2.81-3.91

1

1.5517.25

11.80

-1.05

12.65

-0.65

-0.650.10

-2.351.55

-14.75

-0.45

213.1014.359.05

-3.55-0.05

-1.50

3.20

3.40-1.40-0.60

-3.854.20

-20.50

GAME3 4

-5.70-0.95

-3.407.608.65

-24.40

-0.55-3.60-0.55

2.35-1.10

-6.50

-1.40-1.90-2.55

-0.75

1.951.601.55

0.102.15

-4.70-2.70

-16.85-2.55

520.9512.952.55

-1.95

-10.00

12.60-2.70

-3.053.85

0.75

-15.55-14.70

9.85

6

6.55-1.7015.753.05

6.50

-0.10

-2.15

-3.85-0.850.70

-3.30

TABLE 4. Royals PGP Evaluations

PLAYERAikensOtisWilsonChalkMartinWhitePattinHurdleSplittorfGuraWathanGaleBrettWashingtonMcRaePorterLeonardCardenalQuisenberry

PGP AVE8.047.021.490.910.390.100.02

-0.10-0.54-0.59-1.23-1.71-2.05-2.09-2.15-2.36-4.13-4.16-9.73

112.2011.95-5.70

0.20-4.20

-0.80

-6.20

-1.15-4.85-0.40-3.20

-23.50-10.10

0.65

23.357.752.305.45

-9.15

1.250.15

3.05-3.352.25

-1.65

-35.45

GAME3 4

19.304.652.65

-1.5510.20

-0.05

-2.555.55

-8.803.35

-7.95

-2.95

8.752.107.15

-4.10

1.75

5.55-0.950.90

-1.55-1.25

-15.655.60

53.40

15.654.75

22.05

-1.50

-4.80

-22.804.05

-13.350.20

0.80-26.45

61.250.00

-2.20

3.70-14.20

0.10

-3.25

-1.35-7.70-2.501.35

-5.65

0.25

89

Page 101: Anthology of Statistics in Sports

This page intentionally left blank

Page 102: Anthology of Statistics in Sports

Chapter 13

Estimation With Selected Binomial Information or DoYou Really Believe That Dave Winfield is Batting .471?George CASELLA and Roger L. BERGER*

Often sports announcers, particularly in baseball, provide the listener with exaggerated information concerning a player's performance.For example, we may be told that Dave Winfield, a popular baseball player, has hit safely in 8 of his last 17 chances (a batting averageof .471). This is biased, or selected information, as the "17" was chosen to maximize the reported percentage. We model this asobserving a maximum success rate of a Bernoulli process and show how to construct the likelihood function for a player's true battingability. The likelihood function is a high-degree polynomial, but it can be computed exactly. Alternatively, the problem yields tosolutions based on either the EM algorithm or Gibbs sampling. Using these techniques, we compute maximum likelihood estimators,Bayes estimators, and associated measures of error. We also show how to approximate the likelihood using a Brownian motioncalculation. We find that although constructing good estimators from selected information is difficult, we seem to be able to estimatebetter than expected, particularly when using prior information. The estimators are illustrated with data from the 1992 Major LeagueBaseball season.

KEY WORDS: Brownian motion; EM algorithm; Gibbs sampling; Selection bias.

1. INTRODUCTION

Sports announcers—in particular, baseball announcers—often use hyperbolical descriptions of a player's ability. Forexample, when Dave Winfield, a popular baseball player, isbatting, rather than report his current batting average (num-ber of hits divided by number of at-bats), it might be saidthat "he's really hitting well these days; he's 8 for his last17." This is clearly selectively reported data, biased upwardfrom the player's actual average. But with models that takethis bias into account, we should be able to use the selectivelyreported data to recover an estimate of the player's true abil-ity. In this article we explore various methodologies for doingthis.

1.1 Background

Research in estimation and modeling from selectively re-ported data has always been of interest and has many ap-plications other than analyzing baseball data. We will notattempt a thorough literature review here but will describesome general directions that such research has taken. Perhapsthe most widespread use of selection-bias methodology is inthe area of meta-analysis. Starting from work of Rosenthal(1979), researchers have worried about the effect of selectivelyreported data when combining results of different studies,where the selection is mainly through publication bias (pub-lishing only significant studies). These concerns have beensummarized and reviewed by Iyengar and Greenhouse (1988)and, more recently, in a trio of papers in Statistical Science(Dear and Begg 1992; Hedges 1992; Mosteller and Chalmers1992). Cleary (1993) has used these selection models, alongwith likelihood theory and Gibbs sampling, to construct es-timates of effects based on publication-biased data.

* George Casella is Professor, Biometrics Unit, Cornell University, Ithaca,NY 14853. Roger L. Berger is Professor, Department of Statistics, NorthCarolina State University, Raleigh, NC 27695. This research was supportedby National Science Foundation Grant DMS9100839 and National SecurityAgency Grant 90F-073. The authors thank Steve Hirdt of the Elias SportsBureau for providing detailed data for the 1992 Major League Baseball season.They also thank Marty Wells for numerous conversations concerning theGibbs/ EM algorithm implementation and the editors and referees for manyconstructive comments on an earlier version of this article.

A Bayesian approach to inference from selected data wastaken by Bayarri and DeGroot (1986a,b; 1991). A majorlesson to be learned from their work is that the uncorrectedmaximum likelihood estimator (MLE) can be exceedinglybad. In our baseball data, it is quite obvious that the naiveestimate of Winfield's batting ability (8/17 = .471) is vastlyincorrect. In other, more complicated situations, this mightnot be so obvious.

Perhaps the methodology most similar to ours here is thatof Dawid and Dickey (1977). They were concerned with theinfluence of selectively reported data on the likelihood func-tion, and how such influence can be accounted for. In par-ticular, they considered an example where the selectivelyreported data is the maximum of sums of Bernoulli randomvariables, an example closely related to our situation. Morerecently, Carlin and Gelfand (1992) have studied parametriclikelihood inference in "record-breaking" data; that is, datathat are a series of records, or maxima. They discussed manyapplications of their models, including sporting events, me-teorology, and industrial stress testing. In particular, theymodeled an underlying regression that attempts to explainthe increasing sequence of means and illustrated their tech-niques using data on Olympic record high jumps. The se-lected data that we are concerned with here may be thoughtof as a special case of "record breaking" data. But our modelsand estimation methodologies are different from previousapproaches.

1.2 Information

To make inferences from our selected data, we must makesome assumptions about the data we see. For example, whenwe are told that Dave Winfield is 8 for his last 17, we assumethat the "17" is chosen because the ratio 8/17 is the maxi-mum ratio of hits to at-bats. There is some hidden infor-mation in this number. For example, we know that on his18th and 19th previous at-bats he did not get a hit; otherwise,the announcer would have reported 9/18>8/17or9/19> 8/17. Moreover, our naive estimate of his ability should

© 1994 American Statistical AssociationJournal of the American Statistical Association

September 1994, Vol. 89, No. 427, Statistics in Sports

91

Page 103: Anthology of Statistics in Sports

Chapter 13 Estimation with Selected Binomial Information

not be 8/17 = .471, but rather 8/19 = .421, because weknow that the two previous at-bats (18th and 19th) werefailures. More precisely, we assume that a baseball player'ssequence of at-bats is a sequence of Bernoulli(0) randomvariables, X1, .. ., Xn, where the Xi`s are in chronologicalorder. After the nth at-bat, the player's batting average is p= n

i=1 Xi/n. This is the MLE based on observing the entire(complete) data set. But we assume that the data reportedare k* hits out of the last m* at-bats, where k* and m*satisfy

Note that the quantity (Xn + Xn-1 + - • • + xn_ i)/(i + 1)is just a player's batting average in the previous i + 1 at-bats. Thus we are assuming only that there is no higher hitsto at-bats ratio in the unreported data than the reported ratioof r* = k*/ m*. The value of r* also tells us the number ofat-bats previous to m* that we know to be failures, in thatr* < 1 / j implies failure on the previous j at-bats. (Here r*= 8/17 < 1/2, which implies failure on the previous twoat-bats.) There may be a higher ratio in the last m* at-bats;for example, perhaps the batter was 1 for 1 in his last at-bat.In practice, with the exception of similar trivial cases, r* willusually represent the maximum ratio of hits to at-bats. Alsonotice that the exact mechanism of choosing m* need notbe known; we only need assume that (1) is satisfied.

1.3 Data

We will illustrate the selected data information on datafrom the 1992 Major League Baseball season. Our first dataset is the record of all of Dave Winfield's 1992 at-bats andwhether the at-bat resulted in a hit or out. (Winfield actuallymade 670 plate appearances in 1992, but 87 of these werenot official at-bats, because they resulted in a walk, sacrifice,hit-by-pitch, or other outcome that does not count as an at-bat. Thus Winfield had 583 at-bats.)

Figure 1. The 1992 Hitting Record of Dave Winfield. The complete dataMLE (dashed line) is P = ratio of hits to at-bats, the selected maximum(solid line) is r * of (1), and the selected data MLE (dotted line) is 6 ofSection 3.2.

Figure 2. The 1992 Won-Loss Record of the New York Mets. The com-plete data MLE (dashed line) is p = ratio of wins to games, the selectedmaximum (solid line) is r* of (1), and the selected data MLE (dotted line)is of Section 3.2.

The data for Dave Winfield are displayed in Figure 1. Thedashed line represents his batting average (i.e., ratio of hitsto at-bats) for each at-bat. It can be seen that this value settlesdown quickly, and remains close to 169/583 = .290, Win-field's final batting average for the 1992 season. For each at-bat, this ratio is also the MLE given that we have observedthe entire sequence of all previous at-bats or, equivalently,that we know the total number of hits up to the givenat-bat.

The solid line in Figure 1 is a running sequence of valuesof r*. For each at-bat, this number is the maximum ratio ofhits to at-bats, counting backward in time from the given at-bat. In the calculation of r*, we required m* > 10, whichmerely serves to eliminate trivial cases (e.g., 1 for his last 1),and smooths out the picture somewhat (eliminating multiplepeaks at 2 for 3, 4 for 7, and so on). Thus Figure 1 shows573 at-bats, starting with at-bat number 11.

Finally the dotted line in Figure 1 is a running plot of theselected data MLE, the MLE of Winfield's batting abilitybased on observing only k*, m*, and n = at-bat number.This estimator is one of the main objects of investigation inthis article and is discussed in detail in later sections.

We also analyze a similar data set comprising the 1992won-loss record of the New York Mets, pictured in Figure2. The Mets played 162 games; we show the 152 games fromgame 11 to game 162. The values of r* (solid line) are some-what less variable than Winfield's, and again the completedata MLE (dashed line) quickly settles down to the finalwinning percentage of the Mets, 12/162 = .444. The selecteddata MLE still remains quite variable, however.

1.4 Summary

In Section 2 we show how to calculate the exact likelihoodbased on observing the selected information k*, m*, and n.We do the calculations two ways: one method using an exactcombinatoric derivation of the likelihood function and onemethod based on a Gibbs sampling algorithm. The combi-natorial derivation and the resulting likelihood are quite

92

Page 104: Anthology of Statistics in Sports

Casella and Berger

complicated. But an easily implementable (albeit computer-intensive) Gibbs algorithm yields likelihood functions thatare virtually identical. In Section 3 we consider maximumlikelihood point estimation and estimation of standard errors,and also show how to implement Bayesian estimation viathe Gibbs sampler. We also calculate maximum likelihoodpoint estimates in a number of ways, using the combinatoriallikelihood, the EM algorithm, and a Gibbs sampling-basedapproximation, and show how to estimate standard errorsfor these point estimates.

In Section 4 we adapt methodology from sequential anal-ysis to derive a Brownian motion-based approximation tothe likelihood. The approximation also yields remarkablyaccurate MLE values. A discussion in Section 5 relates allof this methodology back to the baseball data that originallysuggested it. Finally, in the Appendix we provide some tech-nical details.

2. LIKELIHOOD CALCULATIONS

In this section we show how to calculate the likelihoodfunction exactly. We use two methods, one based on a com-binatorial derivation and one based on Markov chainmethods.

Recall that the data, X 1 , . . . ,Xn are in chronological order.But for selectively reported data like we are considering, itis easier to think in terms of the reversed sequence, lookingbackward from time n, so we now redefine the data in termsof the reversed sequence. Also, we want to distinguish be-tween the reported data and the unreported data. We defineY = ( Y 1 . . . , ym*.+1) by Y1 = Xn-i+1 and Z = (Z1 , . . . , Zm)by Zi = Xn-m*-i, where m = n - (m* + 1). Thus Y is thereported data (with Y1 being the most recent at-bat, Y2 beingthe next most recent at-bat, and so on), including Ym*+\= Xn-m*, which we know to be 0. There are k* Is in Y. Zis the unreported data. We know that the vector Z satisfies

This is assumption (1), that there is no higher ratio than thereported r* in the unreported data.

The likelihood, given the reported data Y = y, is denotedby L( y). (Generally, random variables will be denoted byupper case letters and their observed values denoted by thelower case counterparts.) It is proportional to

where Z * is the set of all vectors z = ( z 1 , . . . , zm) that nevergive a higher ratio than r* [see (2)], and .Sz =

mi=1 zi = num-

ber of Is in the unreported data. Dawid and Dickey (1977)called the factor k*(1 - )m*-k* the "face-value likelihood"and the remainder of the expression the "correction factor."The correction factor is the correction to the likelihood thatis necessary because the data were selectively reported.

2.1 Combinatorial Calculations

An exact expression for the sum over Z * can be given interms of constants i*, n1, . . . , n i * , and c1, ..., ci*, which

we now define. Let i* be the largest integer that is less than(n - m*)(l - r*) and define

where [a] is the greatest integer less than or equal to a. Nowdefine constants c1 recursively by c\ = 1 and

Then (3) can be written as

The equivalence of (3) and (4) is proved in Appendix A. Ifi* = 0, which will be true if r* is large and n — m* is small,then the sum in (4) is not present and the likelihood is just

The constants ci grow very rapidly as i increases. So if i*is even moderately large, care must be taken in their com-putation. They can be computed exactly with a symbolicprocessor, but this can be time-consuming. So we now lookat alternate ways of computing the likelihood, and in Section4 we consider an approximation of L( \ y).

2.2 Sampling-Based Calculations

As an alternative to the combinatorial approach to cal-culating L( |y), we can implement a sampling-based ap-proach using the Gibbs sampler. We can interpret Equation(3) as stating

where z = ( z 1 , . . . , zm) are the unobserved Bernoullioutcomes, Z * is the set of all such possible vectors, andL( |y, z) is the likelihood based on the complete data.Equation (5) bears a striking resemblance to the assumedrelationship between the "complete data" and "incompletedata" likelihoods for implementation of the EM algorithmand, in fact, can be used in that way. We will later see howto implement an EM algorithm, but first we show how touse (5) to calculate L( y) using the Gibbs sampler.

We assume that L( \y) can be normalized in 0; that is,L( y) d < . (This is really not a very restrictive as-

sumption, as most likelihoods seem to have finite integrals.The likelihood L( \y) in (3) is the finite sum of terms thateach have a finite integral, so the integrability condition issatisfied.) Denote the normalized likelihood by L*( \y ), so

Because L( \ y) can be normalized, so can L( \ y, z) of (5).Denoting that normalized likelihood by L*( \ y, z), we nowcan consider both L*( \ y) and L*( \ y, z) as density func-

93

Page 105: Anthology of Statistics in Sports

Chapter 13 Estimation with Selected Binomial Information

tions in . Finally, from the unnormalized likelihoods, wedefine

an equation reminiscent of the EM algorithm. The functionk(z| y, ) is a density function that defines the density of Zconditional on y and . If we think of z as the "missing data"or, equivalently, think of y as the "incomplete data" and(y, z) as the "complete data," we can use the unnormalizedlikelihoods in a straightforward implementation of the EMalgorithm. But we have more. If we iteratively sample be-tween k(z | y, ) and L*( \ y, z) (i.e., sample a sequence z1,1, Z2 , 2, z3, 3 ,• • •)> then we can approximate the actualnormalized likelihood by

with the approximation improving as M (see the Ap-pendix for details). Thus we have a sampling-based exactcalculation of the true likelihood function.

Note that this sampling-based strategy for calculating thelikelihood differs from some other strategies that have beenused. The techniques of Geyer and Thompson (1992) forcalculating likelihoods are based on a different type of MonteCarlo calculation, one not based on Gibbs sampling. Thatis also the technique used by Carlin and Gelfand (1992) andGelfand and Carlin (1991). These approaches do not requirean integrable likelihood, as they sample only over z. Herewe sample over both z and , which is why we require anintegrable likelihood. For our extra assumption, we gain theadvantage of easy estimation of the entire likelihood function.The technique used here, which closely parallels the imple-mentation of the EM algorithm, was discussed (but not im-plemented) by Smith and Roberts (1993).

Implementing Equation (8) is quite easy. The likelihoodL*( |y, z) is the normalized complete data likelihood, so

(recall that Sz= m

i=1 , zi and ra = n - m* - 1). Thus tocalculate L*( \ y), we use the following algorithm:

Because L*( |y, z) is a beta distribution with parametersk* + Sz + 1 and n - k* - Sz + I, it is easy to generate the0's. To generate the z's, from k(z \ y, ) the following simplerejection algorithm runs very quickly:

1. Generate z = ( z l , . . . , zm), z, iid Bernoulli( ).

2. Calculate St = (k* + j=1 z/)/(m* + 1 + i), i = 1,. . . , m.

3. If Sj r* for every i = 1, . . . , m, accept z; otherwise,reject z. (11)

Implementing this algorithm using a 486 DX2 computerwith the Gauss programming language is very simple, andthe running time is often quite short. The running time wasincreased only in situations where n m *, when more con-strained Bernoulli sequences had to be generated. Typicalacceptance rates were approximately 60% (for r* = 8/17and n = 25-500), which, we expect, could be improved witha more sophisticated algorithm.

Figure 3 illustrates the Gibbs-sampled likelihoods for M= 1,000, for r* = 8/17 with various values of «. As can beseen, the modes and variances decrease as n increases. If weplot the likelihoods calculated from the combinatorial for-mula (4), the differences are imperceptible.

3. ESTIMATION

One goal of this article is to assess our ability to recovera good estimate of from the selectively reported data. Wewould be quite happy if our point estimate from the selecteddata was close to the MLE of the unselected data. But as weshall see, this is generally not the case. Although in somecases we can do reasonably well, only estimation with strongprior information will do well consistently.

3.1 Exact Maximum Likelihood Estimation

Based on the exact likelihood of (4), we can calculate theMLE by finding the Os of the derivative of L( \ y). The like-lihood is a high-degree polynomial in 0, but symbolic ma-nipulation programs can compute the constants and sym-bolically differentiate the polynomial. But the 0s must besolved for numerically, as the resulting expressions are tooinvolved for analytical evaluation. In all the examples wehave calculated, L( |y) is a unimodal function for 0 0< 1 and no difficulties were experienced in numerically find-ing the root.

Figure 3. Likelihood Functions for k* =8 and m* = 17, for n = 19(Solid Line), 25 (Long-Dashed Line), 50 (Dotted Line), and 100 (Short-Dashed Line), Normalized to Have Area = 1. The value n = 19 gives the"naive" likelihood.

94

Page 106: Anthology of Statistics in Sports

We have calculated the MLE for several different datasets; results for four values of m* and r* and five values ofm are given in Table 1. For each value of m*, r*, and m,two values are given. The first is the exact MLE, computedby the method just described. The second is an approximateMLE, which is discussed in Section 4. Just consider the exactvalues for now.

The exact MLE's exhibit certain patterns that would beexpected for this data:

1. The MLE never exceeds the naive estimate k*/(m*+ l) = m*r*/(w* + 1).

2. For fixed reported data m* and r*, the MLE decreasesas m, the amount of unreported data, increases. It appearsto approach a nonzero limit as m grows. Knowing that theratio does not exceed r* in a long sequence of unreporteddata should lead to a smaller estimate of than knowingonly that the ratio does not exceed r* in a short sequence.

3. For fixed r* and m, the MLE increases to r* as m*,the amount of reported data increases.

This method of finding the MLE requires a symbolic ma-nipulation program to calculate the constants ci or else somecareful programming to deal with large factorials. Also, themethod can be slow if m is large. The values for m = 200 inTable 1 each took several minutes to calculate. Thus we areled to investigate other methods of evaluating the MLE,methods that do not use direct calculation of L( \ y ) . Al-though these other methods are computationally intensive,they avoid the problem of dealing with the complicated exactlikelihood function.

3.2 The EM Algorithm

As in Section 2.2, the incomplete data interpretation ofthe likelihood function allows for easy implementation ofthe EM algorithm. With y = incomplete data and (y, z)= complete data, we compute an EM sequence 1, 2, • • •by

where E(Sz\ i) is the expected number of successes in themissing data. [The E step and the M step are combined intoone step in (12).] More precisely, Sz = ™i Zjy where theZj are iid Bernoulli ( ,) and the partial sums satisfy the re-strictions in (11). Such an expected value is virtually im-possible to calculate analytically, but is quite easy to ap-proximate using Monte Carlo methods (Wei and Tanner1990). The resulting sequence 1, 2, • • • converges to theexact complete data MLE. In all of our calculations, thevalue of the EM-calculated MLE is indistinguishable fromthe MLE resulting from (4).

3.3 Approximate MLE's from the Gibbs Sampler

Equation (8), which relates the exact likelihood to theaverage of the complete data likelihoods, forms a basis fora simple approximation scheme for the MLE. Although theaverage of the maxima is not the maximum of the average,we can use a Taylor series approximation to estimate the

Table 1. Exact (First Entry) and Approximate (Second Entry) MLE'sCalculated From the Exact Likelihood (4) and the

Approximation Described in (23)

m*

r*

1/5

2/5

3/5

4/5

m

52060

100200

52060

100200

52060

100200

52060

100200

5

.109

.088

.084

.084

.084

.261

.232

.227

.227

.227

.437

.408

.403

.403

.403

.632

.612

.609

.609

.609

.096

.079

.078

.078

.078

.263

.232

.229

.228

.228

.448

.412

.407

.407

.407

.644

.615

.609

.609

.609

25

.170

.150

.139

.136

.135

.359

.332

.317

.314

.312

.553

.527

.511

.507

.506

.754

.734

.721

.718

.717

.167

.145

.134

.132

.131

.360

.332

.316

.313

.311

.559

.531

.513

.510

.508

.763

.742

.726

.722

.720

45

.183

.167

.156

.152

.149

.376

.356

.341

.336

.333

.572

.554

.538

.534

.530

.773

.759

.746

.742

.739

.180

.164

.152

.149

.147

.377

.356

.341

.336

.333

.577

.556

.541

.535

.531

.779

.765

.750

.745

.742

200

.196

.191

.185

.182

.178

.394

.389

.381

.377

.372

.593

.588

.580

.577

.571

.794

.789

.783

.780

.776

.196

.190

.184

.181

.177

.394

.389

.381

.377

.372

.594

.589

.581

.578

.572

.795

.791

.785

.782

.777

incomplete data MLE as a weighted average of the completedata MLE's.

An obvious approach is to expand each complete datalikelihood in (8) around its MLE, , = (k* + S z ) /n , to get

because L* '( , | y, z,) = 0. Now substituting into (8) yields

>(0|y)

and differentiating with respect to 0 yields the approximateMLE

But it turns out that this approximation is not very accurate.A possible reason for this is the oversimplification of (13),which ignores most of the computed information. In partic-ular, for j i, the information in , is not used when ex-panding L*( \y, z/). Thus we modify (13) into a "double"Taylor approximation. We first calculate an average ap-proximation for each L*( 1 y, z/), averaging over all ,, andthen average over all L( |y, z i). We now approximate theincomplete data likelihood L*( y) with

95

Page 107: Anthology of Statistics in Sports

Chapter 13 Estimation with Selected Binomial Information

Differentiating with respect to 6 yields the approximate MLE

Table 2 compares the approximate MLE of (14) to the exactvalue found by differentiating the exact likelihood of (4). Itcan be seen that the Gibbs approximation is reasonable, butcertainly not as accurate as the EM calculation.

3.4 Bayes Estimation

It is relatively easy to incorporate prior information intoour estimation techniques, especially when using the Gibbssampling methodology of Section 2.2. More importantly, inmajor league baseball there is a wealth of prior information.For any given player or team, the past record is readily avail-able in sources such as the Baseball Encyclopedia (Reichler1988).

If we assume that there is prior information available inthe form of a beta(a, b) distribution, then, analogous toSection 2.2, we have the two full-conditional posterior dis-tributions

and

k(z\y, ,a,b) = as in (7) and (11),

with the Bernoulli parameter of 6. (15)

Running a Gibbs sampler on (15) is straightforward, and theposterior distribution of interest is given by

For point estimation, we usually use the posterior mean,given by

By using a beta(l, 1) prior, we get the likelihood functionas the posterior distribution. Table 2 also shows the valuesof this point estimator, and we see that it is a very reasonableestimate.

Because the available prior information in baseball is sogood, the Bayes posteriors are extraordinarily good, eventhough the data are not very informative. Figure 4 showsposterior distributions for the New York Mets using historicalvalues for the prior parameter values. It can be seen thatonce the prior information is included, the selected MLEproduces an excellent posterior estimate.

3.5 Variance of the Estimates

When using a maximum likelihood estimate, 0, a commonmeasure of variance is — 1 / /"(0), where / is the log-likelihood,/ = log L. In our situation, where L is expressed as a sum ofcomponent likelihoods (8), taking logs is not desirable. Buta few simple observations allow us to derive an approxi-mation for the variance.

Because / = log L, it follows that

We have L'( ) = 0 and - I"( ) = -L"( ) IL (8 ) \ using (8)yields the approximation

Table 2. Comparison of Combinatoric MLE [Obtained by Differentiating the Likelihood (4)], the MLE From the EM Algorithm, the Gibbs/ 'LikelihoodApproximation of (14), the Bayes Posterior Mean Using a Beta(1, 1) Prior (the Mean Likelihood Estimate), the

Brownian Motion-Based Approximation, and (the Complete Data MLE)

At-bat

187188189190191

339340341342343344345346

NOTE:

k*

554555

12471313141445

m*

18711121311

39155414243441111

r*

.294

.364

.417

.385

.455

.308

.303

.317

.310

.326

.318

.367

.456

i

.294

.298

.302

.300

.304

.298

.297

.299

.298

.300

.299

.301

.303

CombinatoricMLE

.294

.241

.292

.267

.321

.241

.273

.251

.245

.260

.254

.241

.321

EMMLE

.294

.240

.289

.267

.322

.240

.273

.251

.244

.261

.254

.240

.313

Gibbsapproximate

MLE

.294

.232

.276

.253

.303

.238

.273

.250

.242

.258

.246

.222

.300

Bayesmean

.296

.237

.278

.258

.299

.225

.272

.243

.238

.252

.247

.234

.302

BrownianMLE

.294

.241

.293

.268

.324

.241

.273

.251

.245

.260

.254

.241

.325

The at-bats were chosen from Dave Winfield's 1992 season.

96

Page 108: Anthology of Statistics in Sports

Casella and Berger

using (9) and (15). Of course, for the selected data the vari-ances are much higher than they would have been had weobserved the entire data set. We thus modify (19) to accountfor the fact that we did not observe n Bernoulli trials, butonly m* + 1. Because our calculations are analogous to thosein the EM algorithm, we could adjust (19) as in equation(3.26) of Dempster, Laird, and Rubin (1977), where theyshowed that the ratio of the complete-data variance to in-complete-data variance is given by the derivative of the EMmapping. But in our case we have an even simpler answer,and assume var( |y, z)/var( |y) = (m* + l)/n. Thus wemodify (19) to

It turns out that this approximation works quite well, betterthan the "single" Taylor series approximation for the vari-ance of the MLE of . But the double Taylor series argumentof Section 3.3 results in an improved approximation. Startingfrom the fact that /" = [LL" - (L')2]/L2, we write

Table 3 compares the approximation in (20) to values ob-tained by calculating /"( ) exactly (using a symbolic proces-sor) . It can be seen that the approximation is quite good formoderate to large m* and still is acceptable for small m*.

We can also compute Bayesian variance estimates. Pro-ceeding analogously to Section 3.4, the Bayesian variancewould be an average of beta variances,

var( |y, a, b)

But as before, we must adjust this variance to account forthe fact that we observe only m* + 1 trials. We do thisadjustment by replacing the term (n + a + b + l ) b y(m* + 1 + a + b+ 1). The resulting estimate behaves quitereasonably, yielding estimates close to the exact MLE valuesfor moderate m* and a = b = 1. These values are also dis-played in Table 3.

As expected, the standard error is sensitive to m*, yieldinglarge limits when m* is small. Figures 5 and 6 show these

Figure 4. Posterior Distributions for the New York Mete. The value r *= 8/11 = .727 was actually achieved at games 19 and 131. The solidlines are posterior distributions based on beta priors with parameters a= 9.957 and b = 11.967 (representing a mean of .454 and a standarddeviation of. 107, which are the Mete' overall past parameters), and arebased on n = 19 and 131 observations with modes decreasing in n. Theshort-dashed lines are posteriors using a beta(1, 1) prior with n = 79 and131 and hence are likelihood functions, again with modes decreasing inn. The long-dashed line is the naive likelihood, which assumes n = 12.

limits for the 1992 season of Dave Winfield and the Mets.Although the estimates and standard errors are quite vari-able, note that the true batting average and winning per-centage are always within one standard deviation of theselected MLE.

4. BROWNIAN MOTION APPROXIMATIONTO THE LIKELIHOOD

The last terms in the expressions (3) and (4),

are complicated to compute. But we can approximate theseterms with functions derived from a consideration ofBrownian motion. The resulting approximate likelihood canthen be maximized to find an approximate MLE.

This was done for the data in Table 1. The second entryin each case is the approximate MLE. It can be seen that theapproximate MLE's are excellent. In the 48 cases with m

60, the approximate and exact MLE's never differ by morethan .006. In fact, even for the smaller values of m, the exactand approximate MLE's are very close. In only three cases,all with m - 5, do the two differ by more than .01.

To develop our approximation, note from Appendix Athat the expression in (22) is equal to Pe( Sj/(j + 1) r* forj = 1, . . . , m), where Z1, Z2, ... are independent Ber-noulli^) random variables and Sj = /=i Z,. We rewritethe inequality Sj/(j + 1) < r* as

where and Nowthe vector (S*, . . . , S%,) has the same means, variances

97

Page 109: Anthology of Statistics in Sports

Chapter 13 Estimation with Selected Binomial Information

Table 3. Comparison of Standard Deviations Based on ExactDifferentiation of the Log-Likelihood, the Gibbs/ 'Likelihood

Approximation of (20), and the Bayes PosteriorStandard Deviation Using a Beta(1, 1) Prior

At-bat

187188189190191

339340341342343344345346

NOTE:

k"

554555

12471313141445

m"

18711121311

39155414243441111

ft

.294

.298

.302

.300

.304

.298

.297

.299

.298

.300

.299

.301

.303

MLE

.294

.241

.292

.267

.321

.241

.273

.251

.245

.260

.254

.241

.321

Exact

.033

.083

.086

.080

.093

.046

.029

.046

.045

.046

.045

.083

.093

Standard deviation

Gibbs /approximate

.033

.121

.122

.115

.129

.067

.036

.067

.065

.066

.064

.119

.131

Bayes

.033

.117

.118

.112

.125

.065

.035

.065

.064

.064

.063

.116

.126

The at-bats were chosen from Dave Winfield's 1992 season.

and covariances as (W(l), . . . , W(m)'), where W(t) isstandard (mean 0 and variance 1) Brownian motion. So wecan approximate

by

If we define r as the first passage time of W(t) through thelinear boundary b + i\6t (i.e., 7 = inf {t: W(t) > b + i\ t}),then

where is the standard normal cdf and the last equality isfrom (3.15) of Siegmund (1985).

Because S* is a discrete process, the first time that Sf> bg + riej it will in fact exceed the boundary by a posi-tive amount. Also, W(t) may exceed be + rjet for some 0</< w , e v e n i f ( W ( l ) , . . . , W(m)) does not. So the prob-ability we want is in fact larger than the approximation. Sieg-mund (1985, p. 50) suggests that this approximation will beimproved if b is replaced by b + p, where p is an appro-priately chosen constant. By trial and error, we found that

= .85 produced good approximate MLE's. Thus to obtainthe approximate MLE's in Table 1, Expression (22) was re-placed by

Figure 5. The 1992 Hitting Record of Dave Winfield. The dashed lineis p, the complete data MLE, and the solid line is 6, the selected dataMLE. The standard deviation limits (dotted lines) are based on the selecteddata MLE.

in (4), and the approximate likelihood was numericallymaximized. (The 0 of the derivative was found using a sym-bolic manipulation program, just as the exact MLE's werefound.)

5. DISCUSSION

If we are told that Dave Winfield is "8 for his last 17,"the somewhat unhappy conclusion is that really not verymuch information is being given. But the somewhat sur-prising observation is that there is some information. Al-though we cannot hope to recover the complete data MLEwith any degree of accuracy, we see in Figures 5 and 6 that±2 standard deviations of the selected data MLE alwayscontains the complete data MLE. Indeed, in almost everycase the complete data MLE is within one standard deviationof the selected data MLE. Moreover, the selected data esti-

Figure 6. The 1992 Won-Loss Record of the New York Mets. Thedashed line is p, the complete data MLE, and the solid line is 6, the selecteddata MLE. The standard deviation limits (dotted lines) are based on theselected data MLE.

98

Page 110: Anthology of Statistics in Sports

Casella and Berger

mates behave as expected. In particular, as n (the numberof either games or at-bats) increases, the ratio 8 for 17 looksworse; that is, it results in a smaller value of the selectedMLE. This is as it should be, because for a given successprobability, longer strings (larger values of «) will producelarger maxima. Also, due mainly to the method of construc-tion, the standard deviation of the selected MLE directlyreflects the amount of information that it contains, throughthe ratio m*/n.

Baseball is a sport well known for its accumulation ofdata. This readily translates into an enormous amount ofprior information that can be used for estimation. In Figure4 we saw how the New York Mets' prior information com-pletely overwhelms the selected data (and produces very goodestimates). This is in fact not an extreme case. If a picturesimilar to Figure 4 is constructed for Dave Winfield (withprior mean .285 and standard deviation .021), the resultingposterior is virtually a spike, no matter what data are used.

Throughout this article we have assumed that the observedselected data consist of k*, m *, and n. But typically thevalue of n is not reported, so the data are really only k* andm*. During the baseball season, it is quite easy to estimate«, especially for ballplayers who play regularly. Moreover,once n reaches a moderate value, its value has very littleeffect on that of the selected data MLE. For example, for aneveryday player we expect n *& 100 by May, so the value ofn will have little effect on MLE's based on m* 20. This isevident in Table 1, in the likelihood functions of Figure 3,and also in Table 4, which explores some limiting behaviorof the MLE. Although we do not know the exact expressionfor the limit as n -> , two points are evident. Besides thefact that the effect of n diminishes as n grows, it is clear thatr* = 8/17 and r* = 16/34 have different limits. Thus muchmore information is contained in the pair (k*, m*) than inthe single number r*.

Finally, we report an observation that a colleague (ChuckMcCulloch) made when looking at Figure 1—an observationthat may have interest for the baseball fan. When m* and« are close together, a number of things occur. First, theselected data and complete data MLE are close; second, theselected data standard deviation is smallest. Thus the selecteddata MLE is a very good estimator. But McCulloch's obser-vation is that when ^ (i.e., the complete data MLE),which usually implies m* n, then a baseball player is ina batting slump (i.e., his current batting average is his max-imum success ratio). This definition of a slump is based onlyon the player's relative performance, relative to his own"true" ability. A major drawback of our current notion of aslump is that it is usually based on some absolute measureof hitting ability, making it more likely that a .225 hitter,rather than a .300 hitter, would appear to be in a slump. (Ifa player is 1 for his last 10, is he in a slump? The answerdepends on how good a hitter he actually is. For example,Tony Gwynn's slump could be Charlie O'Brien's hot streak!)If we examine Figure 1, Dave Winfield was in a bad slumpduring at-bats 156-187 (he was 5 for 31 = .161) and at-bats360-377 (3 for 17 = .176), for in both cases his maximumhitting ability, r*, is virtually equal to the MLE's. Similarobservations can be made for Figure 2 and the Mets, partic-

Table 4. Limiting Behavior of the MLE for Fixed r' = .471, as n -*• oo

k*/m*

n

2550751001502003004005001000

8/17

.395

.368

.360

.359

.358

.358

.358

.357

.357

.357

76/34

.418

.401

.396

.392

.391

.391

.390

.390

.390

64/736

.456

.442

.435

.431

.431

.429

128/272

.460

.451

.447

.442

ularly for games 45-70 (although many would say the NewYork Mets' entire 1 992 season was a slump! ) . But the mes-sage is clear: You are in a slump if your complete data MLEis equal to your selected data MLE, for then your maximumhitting (or winning) ability is equal to your average ability.

APPENDIX A: DERIVATION OF COMBINATORIALFORMULA FOR LIKELIHOOD

In this section we derive the exact expressions (3) and (4) forL( \ y ) . The reported data are k* successes in the last m* trials.We have not specified exactly how this report was determined. Inthe baseball example, we do not know exactly how the announcerdecided to report "k:* out of m*." But what we have assumed isthat the complete data (y, z) consist of a vector y, with k* Is andm* + 1 - k* 0s, and a vector z Z*, which satisfies (2). Thelikelihood is then

where the sum is over all (y, z) that give the reported data and y>1

= y, • + z, = k* + Si. We have not specified exactly what allthe possible y vectors are, but for each possible y, z can be anyelement in Z*. Thus if C is the number of possible y vectors, thenthe likelihood is

Dropping the constant C, which is unimportant for likelihoodanalysis, yields (3) .

Let Z denote the set of all sequences (z\ , . . . , zm) of length mof Os and Is. Then the sum in (A.I) is

This sum over Z *c is the sum that appears in (4), as we now explain.The set Z* is the set of all z's that satisfy (2). Let .S, = '=, z,.

Then Z*e is the set of all z's that satisfy

for some j = 1 , . . . , m.

But(k* + Sj)l(m* + 1 +j) > k*/w* if and only ifSj/(j + 1)> r*. So Z*c is the set of all z's that satisfy

99

Page 111: Anthology of Statistics in Sports

Chapter 13 Estimation with Selected Binomial Information

Now to complete our derivation of (4), we must show that thesums in (A.2) and (4) are equal; that is,

To show this, we must explain what the constants /*, «,, andc, are.

First, the value of/* — 1 is the maximum number of Os that canoccur in ( z , , . . . , z,) ifSj/(j + 1) > r*. This is because Sj/(j + 1)> r* <=» Sj > r*(j + 1) <=>7 - Sj <j - r*(j + 1) and hence

j-Sj<j- r*(j + 1) = j(\ - r*)- r* m(\ - r*) - r*

= (n- m* - 1)(1 - r*)- r*

= (n-m*)( l -r*)- 1.

Next, suppose that when Sj/(j + 1) first exceeds r*, the numberof Os in (zi, .. ., Zj) isj — S} = j' — I. Then this must happen ontrial j = HJ-, because if on trial « , - , « , / — Snj, = j' — 1, then

But if j — Sj = j' — 1 and j < /y, then

To compute the sum in (A.3), we partition Z*c into sets Z0,. . . , Z,«_], where Z, is the set of z's such that ( z , , . . ., z,) containsexactly /' Os ifSj/(j + 1) is the first term to exceed r*; that is,

Z, = {z: j - Sj = i at they where S,/(y + 1) > r*

and S>/(/ + 1) < r* for 1 <;/ <j}.

If z e Z,, then in fact j = H,+I from our foregoing argument. Let(zi, . .., zn,+1) be a sequence such that Sni+1/(n,+i + 1) > r* and•$'/(/ + 1) < r* for 1 </ < «,+ ,. [So the vector (z,, . . . , zn,+l)contains /Os and «1+i — / Is.] This initial sequence can be completedin any way to produce a z E Z,. The sum of 6s*(l — 0)m~s* overall z's with this initial sequence is 0"1+1~'(1 — 0)', because the sumover all the parts that could be added to this initial sequence is 1.Note that we get the same value 0"1+I~'(1 — 6)', regardless of whichinitial sequence we choose. So if c,+\ is the number of differentinitial sequences that could form z's in Z,, then

which is Equation (A.3). It remains only to verify that the formulain Section 2.1 is the correct formula for c,. The value of c\ is thenumber of initial sequences with 1 - 1 = 0. Of course, c\ =• 1, asdefined. Suppose ct, ..., c, are correctly defined. Then we willshow that the formula

from Section 2.1, is correct. The value of ("'+', ') is the number ofall sequences (z j , . . . , zfli+1) that end in 1 and have exactly / Os.From this we must subtract those sequences for which Sj/(j + 1)> r* for some j < n,+l. lfSj/(j + 1) > r* for the first time at j',

and ify' — Sj> - /', then j' must equal n/ '+i- Among all sequences(zi, .. ., zn<+1) that end in 1 and have exactly / s, there arec,'+i("'+1~i-7'n|'+') that first exceed r* at n/>+i with /" Os. The value ofc,'+, is the number of initial sequences (z\, ..., zni,+1), and thecombinatorial term is the number of sequences (znj,+1 + 1, . . . ,zni+1_,) containing the remaining / — /' Os. Summing these termsfor /' = 0 , . . . , / ' — 1, changing the summation index toj = /' + 1yields the sum in (A.4). Thus the formula for c\,..., c,* is correct.

APPENDIX B: CALCULATING LIKELIHOODS WITHGIBBS SAMPLING

Gibbs sampling calculations for likelihood functions actually in-volves a mixture of some EM algorithm ideas (Dempster et al.1977) and an implementation of successive substitution sampling(Gelfand and Smith 1990).

As in the EM algorithm, we start with L(0|y) as the likelihoodof interest, based on the "incomplete" (but observed) data y. Theaugmented data is denoted by z, yielding the complete data like-lihood L(0|y, z) that satisfies

The set Z* may be quite complicated, taking into account all therestrictions imposed on the incomplete data likelihood. But it isoften the case that we will be able to sample from this set.

Now normalize both likelihoods (as in Sec. 2.2) to L*(6\y) andL*(0|y, z) and define k(z\y, 0) as in (7). Then

To verify Equation (A.6), write

by interchanging the order of the sum and integral and noting thatk ( z \ y , 0')L*(0'|y) = L*(0'\y, z). Now the integral on the rightside of (A.7) is equal to 1, and the remaining sum is just (A.5).Thus from Equation (A.6) and the results of Gelfand and Smith(1990), we can calculate L*(0|y) by successively sampling fromL*(0|y, z )andfc (z |y , 0).

Note that the implementation of the Gibbs sampler is a frequentistimplementation, relying on the finiteness of the integral of/e L*(0|y) dO. We can, however, interpret the finiteness of thisintegral as using a flat prior for 0; that is, 7r(0) = 1. With the ad-ditional "parameter" z, we then have the two full posterior distri-butions 7r(01y, z)(=L*(01y, z))and TT(Z| y, 0) ( = k ( z \ y , 0)). Sam-pling from these densities will yield a sample that is (approximately)from the marginal posterior ?r(0|y) (=L*(0|y)).

{Received April 1993. Revised November 1993.]

REFERENCES

Bayarri, M. J., and DeGroot, M. (1986a), "Bayesian Analysis of SelectionModels," Technical Report 365, Carnegie-Mellon University, Dept. ofStatistics.

(1986b), "Information in Selection Models," Technical Report 368,Carnegie-Mellon University, Dept. of Statistics.

• (1991), "The Analysis of Published Significant Results," TechnicalReport 91-21, Purdue University, Dept. of Statistics.

Carlin, B. P., and Gelfand, A. E. (1992), "Parameter Likelihood Inferencefor Record-Breaking Problems," technical report, University of Minnesota,Division of Biostatistics.

100

Page 112: Anthology of Statistics in Sports

Casella and Berger

Cleary, R. J. (1993), "Models for Selection Bias in Meta-analysis," Ph.D.thesis, Cornell University, Biometrics Unit.

Dawid, A. P., and Dickey, J. M. (1977), "Likelihood and Bayesian InferenceFrom Selectively Reported Data," Journal of the American StatisticalAssociation, 72, 845-850.

Dear, K. B. G., and Begg, C. B. (1992), "An Approach for Assessing Pub-lication Bias Prior to Performing a Meta-Analysis," Statistical Science,7, 237-245.

Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977), "Maximum Like-lihood From Incomplete Data Via the EM Algorithm" (with discussion),Journal of the Royal Statistical Society, Ser. B, 39, 1-37.

Gelfand, A. E., andCarlin, B. P. (1991), "Maximum Likelihood Estimationfor Constrained or Missing Data Models," technical report, University ofMinnesota, Division of Biostatistics.

Gelfand, A. E., and Smith, A. F. M. (1990), "Sampling-Based Approachesto Calculating Marginal Densities," Journal of the American StatisticalAssociation, 85, 398-409.

Geyer, C. J., and Thompson, E. A. (1992), "Constrained Monte Carlo Max-imum Likelihood for Dependent Data" (with discussion), Journal of theRoyal Statistical Society, Ser. B, 54, 657-699.

Hedges, L. V. (1992), "Modeling Publication Selection Effects in Meta-Analysis," Statistical Science, 7, 246-255.

lyengar, S., and Greenhouse, J. B. (1988), "Selection Models and the FileDrawer Problem," Statistical Science, 3, 109-135.

Mosteller, F., and Chalmers, T. C. (1992), "Some Progress and Prob-lems in Meta-Analysis of Clinical Trials," Statistical Science, 7, 227-236.

Reichler, J. L. (ed.) (1988), Baseball Encyclopedia (7th ed.), New York:Macmillan.

Rosenthal, R. (1979), "The 'File Drawer' Problem and Tolerance for NullResults," Psychology Bulletin, 86, 638-641.

Siegmund, D. (1985), Sequential Analysis, New York: Springer-Verlag.Smith, A. F. M., and Roberts, G. O. (1993), "Bayesian Computation Via

a Gibbs Sampler and Related Markov Chain Monte Carlo Methods"(with discussion), Journal of the Royal Statistical Society, Ser. B, 55, 3-24.

Wei, G. C. G., and Tanner, M. A. (1990), "A Monte Carlo Implementationof the EM Algorithm and the Poor Man's Data AugmentationAlgorithms," Journal of the American Statistical Association, 85, 699-704.

101

Page 113: Anthology of Statistics in Sports

This page intentionally left blank

Page 114: Anthology of Statistics in Sports

Chapter 14

Good pitchers make no-hitters happen,but poor-hitting teams aren't especiallyvulnerable.

Baseball: Pitching No-Hitters

Cliff Frohlich

On August 11, 1991, as I watchedfrom my seat in the third row be-hind the Oriole dugout at Balti-more's Memorial Stadium, WilsonAlvarez of the Chicago White Soxpitched a no-hitter, that is, in theentire nine-inning game no Oriolebatter reached base except onwalks and one Chicago error. Ininterviews after the game, Alvarezhumbly gave credit for his per-formance to his catcher, who hadchosen the type and location ofpitches thrown to each batter, andto his center fielder, who hadmade a truly remarkable play latein the game, preventing a seem-ingly sure hit.

Alvarez's performance that dayraises several questions aboutwhat factors are responsible forthe occurrence of no-hitters. Howimportant is the pitcher's talentand experience? Alvarez seemedan unlikely candidate for a no-hit-ter because this was only his sec-ond major league game, and he

was not even listed in my pro-gram. In his only previous gamehe had failed to retire a single bat-ter and had given up two homeruns. How important is the talentof the hitting team? Alvarez's vic-tim, the Baltimore Orioles, had thethird worst winning percentage ofall major league teams in 1991.How often do no-hitters comeabout because of questionable de-cisions by the official scorer, anewspaper reporter who, on eachplay, decides whether it was thebatter's hit or a fielder's error thatallowed the batter to reach base?In the seventh inning of Alvarez'game, Cal Ripken bunted andreached first safely on a closeplay, yet the official scorer ruledan error because the Chicagocatcher had "thrown poorly." Fi-nally, was something differentabout baseball in 1991 that madeno-hitters especially likely? Al-varez's performance was 1 of 16no-hitters pitched in the 1990 and

1991 seasons, the most everpitched in any 2-year period inbaseball history. There were nonein 1989, two in 1992, and three in1993. Why were there so many in1990-1991? More generally, whopitches no-hitters, against whom,and what can we learn about whythey happen?

A Probability Model forHits/Game

Let's construct a simple prob-ability (SP) model for the distribu-tion of hits/game in nine-inninggames. In the model we make as-sumptions that every student ofbaseball knows aren't realistic; de-spite this, the model provides anexcellent starting point for theevaluation of no-hitters. Whenbatters face pitchers, we assumethat p is the probability that eachbatter makes a hit and q = 1-p isthe probability that each batter

103

Page 115: Anthology of Statistics in Sports

Chapter 14 Baseball: Pitching No-Hitters

makes an out. We don't countwalks, errors, and the like becauseunder this simple model they af-fect neither the number of hits northe number of outs. A game is justa sampling regime from which wesample batters until there are 27outs (9 innings). This is just likedrawing red balls (hits) and whiteballs (outs) from a hat until wehave accumulated 27 white balls.Thus, the distribution of hits/game will be given by the negativebinomial formula (see sidebar)and the probability a team willhave no hits in a game is (1-p)27.This model assumes that p is thesame for all batters and pitchersand that serial opportunities to"draw" hits or outs are inde-pendent events (see sidebar).

Does the SP model correctlypredict the incidence of no-hitgames? All the available data con-cerning no-hit or low-hit games(Fig. 1) indicate that the SP modelgenerally underestimates the inci-dence of low-hit games. For exam-ple, in 1993, 28 major leagueteams each started 162 games, fora total of 4,536 games. Majorleague pitchers averaged 9.1 hits/9innings, which corresponds to p =9.17(27+9.1) = .252. Thus, the pre-dicted number of no-hitters is4,536(1 - .252)27 = 1.8. There werethree no-hitters during the 1993season. Now, 1.8 and 3.0 aren'tvery different. But, if we performthis same calculation for each yearfor the American and NationalLeagues using information re-ported in The Baseball Encyclope-dia about no-hitters, hits/9 in-nings, and games played, wepredict that there should havebeen 135 no-hit games since 1900,but, in fact, 202 have occurred.Thus, the SP model only explainsabout two-thirds of the no-hitgames observed. What is responsi-ble for the 67 "extra" no-hitters?

An obvious oversimplificationin the SP model is that it assumesimplicitly that the performance ofpitchers, batters, and officialscorers is the same throughout a

- If the probability of getting a hit or an

Here the coefficient containing fac-to rial s accounts for the possiblecombinations of hits and outs, given

, that the last batter always makes an•out*JPsir for the19,385 r*1«4n«iingTr(jor league games occurring between1989 and 1993 the distribution pre-dicted by this simple probability (SP)model is generally similar to the ob-served distribufion pg> 1). the SPmodel, however, predicts fewer thanhalf of the number of no-hitters ob-served (Table 1). It also predicts

;;fc3 »Jh€»iK3l iirv«dl. nwnbers far'•j)Mjje»:ip&i! 9 of fewer lifts andhigher-than-observed numbers forgames with 10 or more hits.

How does variability in the abilitiesof pitchers affect the distribution ofhits/game? Suppose that not allpitchers are equal in their ability toget batters out or that the ability ofindividual pitchers varies from onegame to another. In this variablepitcher (VP) model, p is not a con-stant but has a central value p0 andvaries such that the hrte/9 innings forindividual pitchers are normally dis-tributed amunel /% = 27 (1 - p^with standard deviation 0 (Fig. 2).Thvwi the overall distribution ofhits/game is

We can this a VP^a model 9•#$•'.to Nts/9 Innings, a-VP^rrtodetff0 is 1,15 hfta& innings, and so 0$. T<|'.find the distribution, the integral isevaluated numerically over values ofA? within 40 of the central value AJO.

How does batter variability affectthe Incidence of no-Wtjisrs? In the,variable batter (VB) model, teesimulate -games on the computer,assuming that in each game thevalues of p for individual battersare normally distributed with stand-ard deviation. Before each "game"the computer randomly selects hit-ting averages pf for batters / = 1,2,..., 9 from the assumed distribu-tion. Then, each time batter / Is"up"* we generate a random 6er-~noulli Mai with the probability of ahit equal to p^and the probability ofm out«qual to.{1 - .99, The game;ends when there are 27 outs. If hit-ting probabilities are normally dis-tributed with mean p0 and standarddeviation 0, then we call this aVP0 model. Because our computersimulation is stochastic, for each p$and 0 we simulate a large numberof games N (4x107 for Table 1) andfind the number % having H hits.Then, we estimate the probabilityof pitching a k-hitter as %W.

The SP, VP, and VB models areIdentical if 0 is 0,0. The VP^o addVB^ose models correctly predictmore games with 0 and 1 hits thanthe SP model. AH three models,however, predict more games thanobserved with about 11 and morehits. Presumably this Is becausebaseball managers are more Nicelyto change pitchers in games in whichthere are many hits.

A Test of Independence

Are batting outs independent? Orart hits Munched together so thatstrings of two or more hits are morecommon than they would be if hitswere independent? To test this, be-tween 1090 and 19931 kept recordsof how often strings of 1,2,3,..., hitsoccurred in the 119 games I at-tended, played by 27 major leagueteams at 20 different stadiums (Ta-ble 2). If there are A^ hits, and outs

and hits are independent, then thenumber Nko\ strings of hits of lengthk should be

*fc*AU^<H*

where p = (number of hits)/(numberof hits + outs). The data agree re-markably well with this model, sup-porting the assumption of event in-dependence.

104

Three Models for Hitting

Page 116: Anthology of Statistics in Sports

Frohlich

game and from game to game. Wenow consider how variability af-fects the results of the SP model.

Pitching

We can certainly attribute some ofthe "extra" no-hitters to the factthat not all pitchers are of equalability and that the performance ofeach individual pitcher variesfrom day to day. Using data com-piled from annual summaries ofinnings pitched and hits allowed,Fig. 2 demonstrates that the distri-bution of hits/9 innings for start-ing pitchers is approximately nor-mal with a standard deviation ofabout 1.0 hits/9 innings. This rep-resents the variation among theseason-long performance of pitch-ers. Of course, the performance ofeach individual varies on a day-to-day basis; thus the effective stand-ard deviation of the random quan-tity hits/9 innings is presumablysomewhat higher than 1.0.

To evaluate the effect of variablepitching ability on the incidenceof no-hitters, I construct a variablepitcher (VP) model by allowing pin the SP model to vary so thathits/9 innings has a normal distri-bution. I call this a VP^ 0 modelif the standard deviation of hits/9innings is 1.0, a VP^g model ifthe standard deviation is 1.5, andso on. For games since 1900, theVP^.o model predicts 181 no-hit-ters (see Table 1). Thus, appar-ently at least two-thirds of the 67"extra" no-hitters occur just be-cause all major league pitchers arenot of equal ability. A VP^gmodel predicts 264 no-hitters, sig-nificantly more than actually oc-curred.

Are no-hitters pitched mostlyby "average" pitchers havinggood days or by elite pitchers likeNolan Ryan? Nolan Ryan pitchedseven no-hitters diu -ng his career;Sandy Koufax pitched four; infact, Ryan and Koufax are the twopitchers with the lowest careeraverage of hits allowed per nine

Figure 1. Distribution of hits/game in nine-inning major league games occur-ring between 1989 and 1993 (histogram) and distribution predicted by thesimple probability (SP) model (continuous line), assuming p = .2482.

Figure 2. Distribution of hits/9 innings reported annually for all pitchers whopitched 100 or more innings between 1989 and 1993. Bin width is 0.2 hits/9innings; mean and standard deviation for the 1989-1993 data are 8.72±1.10hits/9 innings.

105

Page 117: Anthology of Statistics in Sports

Chapter 14 Baseball: Pitching No-Hitters

innings. Indeed, since 1900, 48no-hitters, or 23% of the total, be-long to pitchers who are amongthe top 60 pitchers with respect tothis statistic. Thus, the top pitch-ers really are responsible for asubstantial fraction of no-hitters.

It seems they also may get morethan their fair share. Ryan aver-aged 6.56 hits/9 innings andstarted 773 games over his 27-yearcareer; 6.56 hits/9 innings corre-sponds to p = .196. Now, 773(1 -

.196)27 = 2.2 games, so the SPmodel estimates that Ryan wouldhave pitched two no-hitters. If weperform this calculation for eachof the top 60 pitchers on the all-time hits/game list, we estimatethat they should have pitched 24no-hitters, exactly half of the num-ber observed. A VPa=l Q model pre-dicts the top 60 should havepitched 33 no-hitters.

Incidentally, an analysis of sea-sonal hits/9 innings data since

1900 for individual no-hit pitch-ers indicates that they are gener-ally having superior seasons. Inparticular, during the season thatthey perform no-hitters, theirhits/9 innings rate averages 0.82less than the league average. Onlyone-quarter of all no-hitters are bypitchers with seasonal hits/9 in-nings rates exceeding the leagueaverage.

Batting

Are no-hitters pitched mostlyagainst poor teams or weak-hittingteams? No, definitely not. On theaverage, since 1900, teams thatlost no-hitters actually won 49.2%of their games during those veryseasons, and their seasonal battingaverage was only .003 lower thanthe league average.

Batters seem to hit better inhome games than on the road;thus, we might expect that no-hit-ters are more likely to be pitchedby the home team. For example,between 1990 and 1993, home bat-ting averages in the major leagueswere .0065 higher than away aver-ages. If p is .2465 on the road and.2400 at home, the SP model pre-dicts 4.80 no-hitters/10,000 gamesaway and 6.05 no-hitters/10,000games at home. This amounts to56% of all no-hitters occurring athome. The data do, indeed, indi-cate that more no-hitters arepitched by the home team, andthis was especially true beforeabout 1980. Between 1900 and1979, 113 of 168 (67%) no-hittersoccurred at home; since 1980, 18of 34 (53%) were at home.

How much is the incidence ofno-hitters influenced by the factthat individual batters differ intheir hitting ability? The SP andVP models assume that within anygame there is no variation in batterhitting ability. Yet one couldnever pitch a no-hitter against anyteam if one batter had an averageof 1.000, even if all the other bat-ters were 0.000 hitters, combining

106

Page 118: Anthology of Statistics in Sports

Frohlich

to form an average p of .111.Clearly, batter variability is a fac-tor, at least in extreme cases.

To investigate this, we conductsimulated games on the computerusing a variable batter (VB) model,assuming that p for individual bat-ters is normally distributed. Tofind a reasonable value for thestandard deviation, we use indi-vidual batting averages and at-batsfor the 1992 season reported in theAmerican League Red Book andthe National League Green Book.The results are that o is .027 and.048 for the American League andNational League, respectively,with the difference occurring be-cause pitchers (who are notori-ously weak hitters) bat in the Na-tional League. If pitchers areexcluded, a for the National andAmerican Leagues is about thesame.

As expected, the VB model doesproduce more no-hitters than theSP model. For a o of .025, how-ever, the increase is only about2.5%, and for a o of .050, the in-crease is 11.9%. I conclude thatindividual batter variability is lessimportant than pitcher variability;that is, batter variability increasesthe no-hitter rate no more thanabout .5 no-hitters/10,000 gamesabove the rate predicted by the SPmodel.

Scoring

Official scorers may be responsi-ble for some no-hitters by callingclose plays "errors" rather than

hits. Newpaper accounts suggestthat scoring bias clearly is an in-fluence in some no-hitters. Offi-cial scorers are supplied by thehome team, which may also ex-plain why home teams pitch no-hitters more often than visitingteams. For example, in no-hittersrecorded by Addie Joss in 1910,Ernie Koob in 1917, and VirgilTrucks in 1952, plays originallyscored as hits were changed to er-rors six or more innings after theyoccurred. Presumably there arefewer such blatant scoringchanges in the modern era, whenmost games are televised both athome and away and scorers havethe benefit (or burden) of instantreplay. Indeed, this may explainwhy the proportion of no-hitterspitched at home is lower since1980 than previously.

Nevertheless, scoring a play asa hit or an error is subjective, andeven today my own subjective ob-servation is that scorers are morelikely to score plays as errors ifthere are no hits in the game, es-pecially after about the sixth in-ning. There are also clear differ-ences among stadiums with regardto what is ruled an error. For ex-ample, there were few infield er-rors scored in three games I at-tended in 1993 at Mile HighStadium in Denver; rather, it ap-peared that the scorer ruled a"hit" on all ground balls if the bat-ter reached base, regardless ofhow badly the infielder muffedthe play.

How can we estimate how manyone-hitters might be turned into

no-hitters because of questionablescoring? Let's assume that scoringbias only affects a play that wouldordinarily be a hit if it occurs afterthe sixth inning, that is, in one-third of all one-hitters. Further-more, let's suppose that scorersonly fudge on "questionable" playsand that they designate as errorsexactly half of all such late-inning,questionable plays. Thus, we esti-mate that the fraction of one-hittersthat becomes no-hitters is the prod-uct of the probability that the hitoccurs after the sixth inning (1/3),the probability that such a hit oc-curs during a "questionable play,"and the probability that the scorerrules it an error (1/2).

How common are "question-able" hits? To estimate this, ingames I attended during the 1993season I carefully noted all "ques-tionable plays," or plays that inmy judgment might have beenscored as either a hit or an error,hi 27 games played by 20 differentmajor league teams at 10 differentstadiums, batters reached base on500 plays, of which officialscorers ruled 481 as base hits and19 as errors. Of these plays Iscored 34 of the hits and 6 of theerrors as "questionable plays." Al-though scoring is highly subjec-tive, I believe that most who trythis will find that under ordinaryconditions (i.e., in games in whicha no-hitter is no longer possible)between about 5% and 10% of allplays scored as hits occur on"questionable plays."

Of these, as we assume that halfwould be scored as errors if theyoccurred in the last third of a gamewith no previous hits, the fractionof one-hitters that becomes no-hit-ters is between 0.008 and 0.017.Moreover, the SP model predictsthat in the absence of scoring biasthere will be 27p one-hitters foreach no-hitter. Thus, if p is .250,these calculations suggest thatscoring bias may increase thenumber of no-hitters about 5-10%over that predicted by the SPmodel.

107

Page 119: Anthology of Statistics in Sports

Chapter 14 Baseball: Pitching No-Hitters

Variations Over Time

Are no-hitters more common dur-ing periods when the league isweaker because of expansion?Since 1960 the major leagues haveexpanded from 16 to 28 teams,with new teams obtaining person-nel by drafting players from estab-lished teams. This dilution of thetalent pool may increase the rateof occurrence of no-hitters. Theexpansions occurred in 1961,1962,1969,1977, and 1993. To in-vestigate whether expansion af-fected no-hitters, I compare thenumber of no-hitters pitched in 2-year periods prior to expansionwith the number pitched in 2-yearperiods following expansion. Therecord shows that there were 30no-hitters pitched in the preex-pansion years (1959, 1960, 1967,1968, 1975, 1976, 1991, and 1992)and 27 no-hitters pitched in thepostexpansion years (1961, 1962,1963, 1969, 1970, 1977, 1978, and1993). Thus, there were actuallyfewer postexpansion no-hitterseven though the number of gamesplayed was about 10% greater. Iconclude that it is simply wrongto argue that no-hitters are pitchedby pitchers feasting on leaguesweakened by expansion.

Are no-hitters more likely afterSeptember 1, when teams expandtheir roster from 25 to 40 to try outinexperienced players? In the1993 season, two of the three no-hitters recorded occurred in Sep-tember. The teams that were no-hit were the Cleveland Indiansand the New York Mets, two of theweakest teams in the entire majorleagues. Is it reasonable to suggest,as did a September 20, 1993 edi-torial in the Sporting News, thatsuch September no-hitters occurbecause the weaker teams justdon't care? Since 1900 there havebeen 48 no-hitters pitched in themonth of September compared to36, 33, and 32 in the mid-seasonmonths of May, June, an l August.There have also been '>•'• in July,which has fewer games plaved in

recent years because of the all-starbreak, as well as 28 and 3 in the"short" months of April and Oc-tober when the season begins andends.

Do the 48 no-hitters in Septem-ber represent a significantlyhigher rate than the 101 no-hittersobserved for the "regular" mid-season months of May, June, andAugust? If we presume no-hittersare distributed according to aPoisson distribution, the meanand variance are equal for samplestaken over T-day periods. Thus,after scaling the rates and vari-ances to represent occurrences re-ported in 30-day months, the Sep-tember and mid-season since-1900 monthly rate estimates are48/month and 33/month, respec-tively. Furthermore, thesemonthly rates are different enoughso that I conclude that there prob-ably is a "September effect," pro-ducing about one "extra" no-hitterevery 6 years.

Is it fair, however, to attributethe September effect to lack of car-ing? Not necessarily. In the pre-vious sections I have demon-strated that no-hitters are morecommon when there is an increase

in variance for either pitchers orbatters. When teams expand theirrosters from 25 to 40 players inSeptember it is plausible thatthere is an increased variance inperformance due to the combina-tion of inexperienced playerscompeting alongside regularswhose skills are honed to late-sea-son form. Incidentally, the season-average winning percentage ofteams that are no-hit in Septemberand October is 50.5%; thus, it issimply not true that late-seasonno-hitters are always pitchedagainst weaker teams, as occurredin 1993.

Are no-hitters more commonnowadays than in the past? Thedata indicate that the incidence ofno-hitters differs during differentbaseball eras (Table 3). For exam-ple, during the "dead ball" erafrom 1900 to 1919, p was a low.241, and there were about 11 no-hitters per 10,000 games pitched.Then, during the "lively ball" pe-riod from 1920 to 1939, p rose to.268, whereas the number of no-hitters dropped to 3.7 per 10,000games. In most decades the rate ofno-hitters is approximately 1.5times the rate predicted by the SP

108

TTiMSHHK' *" * ' * " jEP" ' * ' 'wjBJlilKfeflf** -" ^< xWrnKi

^ ^ * t > v$?> ' i)*fc* * -( jMi j»

••'t*BO-1«» /"•£«<'• V,frja*-: •':•?/' '»t9&»*9f9.' •'&»*• * : -::0 • -'V"- m^m- ' $$& • • « • ..f:- - •^•'i*fc*Wfc" -' ifcflr - ••".>'* > ,'vV•?:&&$&» '' ' '£A$ ^ ' -• n*/ :-" •'" •:'.:1W6*WW.'-' '#» '- '''-'i»r •-, '"'•t''i«!Mfee •' £& "'.'»- -•'•''•19KM0A • .244 - -tr;>3iifrim ' M?': -•• tsr , r-''•Atwao* ' ' , ' ••*• :' ;".-

• tdiftttwe • '• M7 • . . - m.•IftMMttt '- -£»•' -v'tt ' "

pBsBrJRPwl

r '

:«Mte*11&-."&"..i>7--'."'Wf •Ci«$• -?,a :;•'• aah.- •T.f .•MS"irfrf-

fc*j!^1$p$@---r:B :-.

£. ^.fuv'*¥?'- ,' > - ' 'I*'* V^« '. W|£

*f-l&! '&*•&'*V -««-!-:•. ";-.^T(fi»#

y^-m:>;^:^"^'" '44-V:

' :<;4S'!V -: J

v ' K. " ~:^-$&^• •\-«^^^r>w^<-'»

• --m-H^^:5

. - 4»" - -A.V• ^4,ff

T-: ,-tM*''" - s.d^ ,,^>v*r-''!-

fc^ '«••

^^*.d>-

"M- "&c%t . *

&IS;$i$*&$;\tf'^-->.'f.*1 *•• *r'?'v '"" $•A, ft*f*~" A'"'*-">

.',,' ,"">'iV

' ''\ -'

Table 3-Number of and Rates of Nine-InningNo-Hitters Observed in Each Decade Since

1900 and Rates Predicted SP and VP Models

No. ofno-hit

Rate/10,000 games played

Page 120: Anthology of Statistics in Sports

Frohlich

Jered Weaver pitching for the 2003 USA Baseball Team. photo courtesy of USA Baseball, ©2003.

model. An outstanding exceptionis the 1980-1989 decade, whenthere were actually fewer no-hit-ters than predicted by the SPmodel. The rate for the 1990-1993period is unusually high; how-ever, this is only because so manyoccurred in 1990 and 1991. Other-wise, there is no evidence that no-hitters are more common nowthan in the past.

What Happened in1990-1991?

This is a puzzle. There were eightno-hitters in 1990, and eight in1991. For a Poisson process withtypical rates of about 7.5 no-hit-ters/10,000 games, the probabilityof observing 8 or more in a 4,200-game season is about 1%. Yet ithappened 2 years in a row.

I can find nothing unusual aboutthe 1990-1991 seasons to explainthis anomaly. When comparedwith data from all years since 1900,the distributions for 1990-1991 of

the factors that might be importantare in no way unusual; these fac-tors are hits/game/team, hits/9 in-nings for all pitchers in the majors,hits/9 innings for no-hit pitchers,the seasonal winning percentagesof teams being no-hit, and team bat-ting averages of teams being no-hit.There were no obvious changes inbaseball in 1990-1991, such as ex-pansion, rule changes, and the like.In addition, following 1991, thenumber of no-hitters has been nor-mal, with two in 1992 and three in1993. Thus, in the absence of anybetter explanation, I tentativelyconclude that the high number ofno-hitters in 1990-1991 is due sim-ply to chance.

Discussion and Conclusions

So, what have we learned about no-hitters? The data indicate clearlythat good pitchers make no-hittershappen, but, surprisingly, poor-hit-ting teams do not seem especiallyvulnerable. No-hitters occur more

often late in the season and, priorto 1980, more often at home. Theyare not, however, more commonagainst weak teams or in years ofleague expansion.

How well does the SP model ex-plain the incidence of no-hitters?The SP model is highly successfulas a reference model for evaluatinghow various factors affect the dis-tribution of hits/game, even thoughit ignores many factors that a base-ball fan knows are important; forexample, pitchers tire as gamesprogress, all hitters are not of equalability, and so on. In this respect itis like other reference models, suchas the ideal gas law in physics, orthe statistical assumption that adistribution is normal—it oftenprovides useful answers eventhough we know the model isn'tquite true.

The variable pitcher (VP) andvariable batter (VB) models arenatural modifications of the simplemodel that evaluate the importanceof two of its obvious inadequa-cies—variation in player ability forpitchers and hitters. All three mod-els ignore the obvious fact thatbaseball is a game of strategy inwhich game-specific objectivessuch as bunting (sacrificing an outto advance a runner) or the possi-bility of double plays embody thefact that individual batting outs arenot strictly independent.

To summarize our quantitativefindings, the simple model pre-dicts that we would have expected135 no-hit games from 1900 to thepresent even if identical, averagepitchers pitched against identical,average batters. We expect an ad-ditional 45 or so because pitchersaren't identical—a few are NolanRyans. About 15 more occur be-cause batter ability is also vari-able, including several that appearto reflect a "September effect."Perhaps 10 are attributable to biasin scoring, especially prior toabout 1980. Thus, the various fac-tors that we have considered com-bine to explain the 202 observedno-hitters.

109

Page 121: Anthology of Statistics in Sports

This page intentionally left blank

Page 122: Anthology of Statistics in Sports

Statistics and baseball go hand in hand, but how much of thegame is just plain luck?

Chapter 15

Answering Questions AboutBaseball Using Statistics

Bill James, Jim Albert, and Hal S. Stern

Is it possible for a last-place base-ball team, a team with the leastability in its division, to win theWorld Series just because of luck?Could an average pitcher win 20games during a season or an aver-age player achieve a .300 battingaverage? The connection betweenbaseball and statistics is a strongone. The Sunday newspaper dur-ing baseball season contains atleast one full page of baseball sta-tistics. Children buy baseballcards containing detailed summa-ries of players' careers. Despite thelarge amount of available informa-tion, however, baseball discussionis noticeably quiet on questionsthat involve luck.

Interpreting BaseballStatistics

Statistics are prominently fea-tured in almost every discussionon baseball. Despite this, the waythat baseball statistics are under-stood by the public, sports writers,and even baseball professionals isvery different from how statistics

are normally used by statisticiansto analyze issues. In fact, onemight say that the role of statistics

in baseball is unique in our cul-ture. Baseball statistics form akind of primitive literature that is

Illustration by John Gampert

111

Page 123: Anthology of Statistics in Sports

Chapter 15 Answering Questions About Baseball Using Statistics

unfolded in front of us each dayin the daily newspaper. This is theway that they are understood bybaseball fans and baseball profes-sionals—not as numbers at all, butas words, telling a story box scoreby box score, or line by line on theback of a baseball card. That iswhy young children often lovethose baseball numbers, eventhough they might hate math:They love the stories that thenumbers tell. People who do notanalyze data on a regular basis areable to examine the career recordof a player and describe, just bylooking at the numbers, the trajec-tory of the career—the great rookieyear followed by several years ofunfulfilled potential, then a tradeand four or five excellent yearswith the new team followed by theslow decline of the aging player.

Baseball has a series of stand-ards, measures of season-long per-formance that are widely under-stood and universally accepted asmarks of excellence—a .300 battingaverage (3 hits in every 10 attemptson average), 200 hits, 30 homeruns, 20 wins by a pitcher. In fact,standards exist and have meaningin enormous detail for players at alllevels of ability; a .270 hitter isviewed as being a significantly dif-ferent type of player from a .260hitter, despite the fact that thestandard deviation associated with500 attempts is about .020. Thesebaseball standards have taken on ameaning above and beyond thatsuggested by the quantitative infor-mation conveyed in the statistics.To baseball fans, sports writers, andprofessionals, all too often a .300batting average does not suggest30% of anything—.300 means ex-cellence. Similarly, 20 wins by apitcher is usually interpreted asgreat success in a way that 19 winsis not. Pitchers with 20 wins aretypically described as "hardwork-ing," and "they know how to win."The percentage of games won andother measures of pitching effec-tiveness are often relegated to sec-ondary consideration.

Baseball people tend, as a con-sequence of how they normallyunderstand statistics, to overesti-mate by a large amount the prac-tical significance of small differ-ences, making it very difficult toeducate them about what infer-ences can and cannot be drawnfrom the data available. Take, forexample, the difference between apitcher who wins 20 games and a15-game winner. How likely is itfor an average pitcher—that is, apitcher with average ability—towin 20 games just because he waslucky? If an average pitcher play-ing for an average team (we as-sume such a pitcher has prob-ability .5 of winning eachdecision) has 30 decisions (winsor losses) in a year, then the prob-ability of winning 20 or moregames by dumb luck is .05 (actu-ally .0494). Because these 30 deci-sions are influenced in a verystrong way by the quality of theteam for which the pitcher plays,the chance of an average pitcheron a better-than-average team win-ning 20 games is even larger.There are many pitchers who haveabout 30 decisions in a season,and, therefore, the chance thatsome pitcher somewhere wouldwin 20 games by dumb luck ismuch greater than that.

For an average pitcher to win 20games by chance is not really atrue fluke; it is something that anystatistical model will show couldeasily happen to the same pitchertwice in a row or even three timesin a row if he is pitching for a qual-ity team. Furthermore, real base-ball provides abundant examplesof seemingly average pitchers whodo win 20 games in a season. Forexample, in 1991, Bill Gullicksonof the Detroit Tigers, who hadbeen very nearly a .500 pitcher(probability of success .5) in thesurrounding seasons, won 20games in 29 decisions. Any statis-tician looking at this phenomenonwould probably conclude thatnothing should be inferred from it,that it was simply a random event.

But to a baseball person, thesuggestion that an average pitchermight win 20 games by sheer luckis anathema. Such an argumentwould be received by many ofthem as, at best, ignorance of thefiner points of the game, and, atworst, as a frontal attack on theirvalues. If it were suggested to ateammate or a coach or a managerof Bill Gullickson that he had won20 games in 1991 simply becausehe was lucky, this would be takenas an insult to Mr. Gullickson. Theproblem is that such a suggestionwould be messing with their lan-guage, trying to tell them thatthese particular words, "20-gamewinner," do not mean what theytake them to mean, that excellencedoes not mean excellence.

The economics of baseball lockthese misunderstandings intoplace. The difference between apitcher who wins 20 out of 30 de-cisions and a pitcher who wins 15out of 30 decisions is not statisti-cally significant, meaning that theresults are consistent with thepossibility of two pitchers havingequal underlying ability. But in to-day's market, the difference be-tween the two pitchers' paychecksis more than a million dollars ayear. A million dollars is a signifi-cant amount of money. So fromthe vantage point of baseball fans,players, and owners, it makes nosense to say that the difference be-tween 20 wins and 15 wins is notsignificant. It is a highly signifi-cant difference.

This misunderstanding of therole of chance is visible through-out baseball. Is it reasonably pos-sible, for example, that a .260 hit-ter (probability of success on eachattempt is .26) might hit .300 in agiven season simply because he islucky? A baseball fan might notbelieve it, but it quite certainly is.Given 500 trials, which in baseballwould be 500 at-bats, the prob-ability that a "true" .260 hitterwould hit .300 or better is .0246,or about one in 41. A .260 hittercould hit .300, .320, or even pos-

112

Page 124: Anthology of Statistics in Sports

James, Albert, and Stern

How Does One Simulate a Baseball Season?

Before considering one simulationmodel in detail, it is a food idea toreview the basic structure of majorleague baseball competition. Base-ball teams are divided into the Na-tional League with 12 teams andthe American League with 14teams. (This alignment will bechanged in the 1993 season whentwo expansion teams join the Na-tional League.) The teams in eachleague are divided into Easternand Western divisions, and the ob-jective of a team during the regularseason is to win more games thanany other team in its division. In theNational League, every team playsevery other team In its division 18times and every team in the otherdivision 12 times for a total of 162games. Teams in the AmericanLeague also play 162 games, butthey play 13 games against oppo-nents in the same division and 12games against teams in the otherdivision. In post-season play ineach league, the .winners of theEastern and Western divisions playin a "best-of-seven" play-off to de-cide the winner of the league pen-nant. The pennant winners of thetwo leagues play in a "best-of-seven" World Series to determinethe major league champion.

Because a baseball season is aseries of competitions betweenpairs of teams, one attractive prob-ability model for the simulation isthe choice model introduced byRalph Bradley and Milton Terry inan experimental design context in1952. Suppose that there are Tteams in a league and the teamshave different strengths. We repre-sent tire strengths of the T teams bypositive numbers 1,-.,. , T, Nowsuppose two teams, say Philadel-phia and Houston, play one after-noon. Let phil, and hous denote thestrengths of these two teams. Then,under the Bradley-terry choicemodel, the probability that Philadel-phia wins the game is given by the

-Can we assume that tie teams

have equal strengths? If this is true,- tand the result of any

baseball game is analogous to theresult of tossing a fair coin. Thechance that a particular team winsa game against any: opponent is1/2, and the number of wins of the

team during a season has a bino-mial distribution with sample size162 and probability of success 1/2.If this coin-tossing model is accu-rate, the observed variation of win-ning percentages across teams andseasons should look like a binomialdistribution with a probability of suc-cess equal to 1/2. If one looks at thewinning percentages of majorleague teams from recent seasons,one observes that this binomialmodel Is not a good fit—the winningpercentages have greater spreadthan that predicted under this coin-tossing model. So it is necessary toallow for different strengths amongthe teams.

We do not know the values of theBradley-Terry team strengths1,... , T. it is reasonable to as-sume, however, that there is a hy-pothetical population of possiblebaseball team strengths and thestrengths of the T teams in a leaguefor a particular season is a samplethat is drawn from this population.Because it is convenient to workwith normal distributions on real-valued parameters, we wilt assumethat the logarithms of the Bradley-Terry parameters, In 1,... ,ln T,are a random sample from a normalpopulation distribution with mean 0and known standard deviation 0.

How do we choose the spread oftills strength distribution ? Gener-ally, we choose-a value of thestandard deviation so that the simu-lated distribution of season-winningpercentages from the model isclose to the observed distribution ofwinning percentages of majorleague teams from recent years.One can mathematically show thatif we choose the standard deviation0 » .19, then the standard deviationof the simulated winning percent-ages is approximately 6.5%, whichagrees with the actual observedstandard deviation of season win-ning percentages from the pastseven years. With this final as-sumption, the model is completelydescribed, and we can use it tosimulate one major league season.Here Is how we perform one simu-lation.

1. Because there are 26 majorleague teams, we simulate a set

of 26 team strengths for a par-ticular season from the hypo-thetical population of strengths.In this step, we simulate valuesof the logarithms of thestrengths from a normal distri-bution with mean 0 and stand-ard deviation .19 and thenexponentiate these values toobtain values for the team abili-ties. A team with strength X, hasexpected winning percentage papproximately described byln(p / (1-p)) = 1.1 In , (this re-lationship was developed em-pirically). In Tables 2 and 3 ofthe article, teams are charac-terized by their expected win-ning percentage.

2. Simulate a full season of gamesfor each league using these val-ues of the team strengths. In theNational League, supposePhiladelphia plays Houston 12times. If pW| and A, denote thestrength numbers for the twoteams, then the number ofgames won by Philadelphia hasa binomial distribution with 12trials and probability of successW*Ph«+*hous)- «" a similarfashion we simulate all of thegames played during the sea-son.

3. After alt the games have beenplayed, the number of wins andlosses for each team are recorded.

4. The division winners from thesimulated data are determined byidentifying the team with the mostwins. It may happen that there areties for the winner of a particulardivision and one game must beplayed to determine the divisionwinner. The simulated season iscompleted by simulating the re-sults of the pennant champion-ships and the World Series usingthe Bradley-Terry model.

The simulation was repeated for1000 seasons. For each team ineach season, or team-season, werecord the "true" Bradley-Terrystrength, its simulated season win-ning percentage, and whether it wassuccessful in winning its division,pennant, or World Series. Using thisinformation, we can see how teamsof various true strengths performduring 162-game seasons.

113

Page 125: Anthology of Statistics in Sports

Chapter 15 Answering Questions About Baseball Using Statistics

sibly as high as .340 in a given sea-son, simply because he is lucky.Although it is true that the oddsagainst this last event are quitelong (1 in 23,000), it is also truethat there are hundreds of playerswho participate each season, anyone of whom might be the benefi-ciary of that luck. In discussing anunexpected event—let us say thefact that Mike Bordick of the Oak-land A's, a career .229 hitter untilthe 1992 season, when he im-proved to .300—one should keepprominently in mind the possibil-ity that he may simply have beenvery lucky. There are other possi-ble explanations, such as a newstadium or a new hitting strategy.These basic probability calcula-tions merely illustrate the large ef-fects that can result due to chancealone.

On the other hand, an unusuallypoor performance by a player mayalso be explained by chance. Theprobability that a .300 hitter willhave a run of 20 consecutive outs(ignoring walks and sacrifices)during a season is about .11. Infact, this calculation does not takeinto account that some opposingpitchers are tougher than othersand that, therefore, the .300 hittermay not have the same probabilityof success against each pitcher.There are other explanations for aslump besides bad luck—a playerdoes not see the ball well, a playeris unsatisfied with his defensiveassignment, problems at home,and so forth. All too often thesealternatives are the explanationpreferred by fans and sportswrit-ers. It is rare to hear a run of suc-cesses or failure attributed tochance.

Baseball fans do not completelyignore the role of chance in base-ball. Many well-hit baseballs endup right at a defensive player andmany weak fly balls fall in just theright place. Conventional wisdomsuggests that these events even outin the course of the season. Theprevious calculations suggest oth-erwise, that chance does not have

to balance out over a season oreven over several seasons.

Does the Best Team Win?

It is possible for a .260 hitter to hit.300, and it is possible for an av-erage pitcher to win 20 games, butis it possible for a last-place team,more precisely a team with last-place ability, to win the pennantsimply because they are luckyfrom the beginning of the seasonto the end? Needless to say, a base-ball fan would writhe in agonyeven to consider such a question,and for good reason: It under-mines his/her entire world.

To study this issue, a model ofthe modern major leagues is con-structed—26 teams, 2 leagues, 14teams in 1 league, 12 in the other.In real baseball, the exact qualityof each team is unknown. If a teamfinishes a season with wins in50% of their games (a .500 win-ning proportion), it could be thatthey were a .500 team, but it couldalso be that they were a lucky .450team or a .550 team that fell onhard luck. In the model teams arerandomly assigned abilities sothat the quality of each team isknown. Then the model is used tosimulate baseball seasons and theprobability of certain events—likethe probability that the best teamin baseball wins the World Se-ries—is estimated. Some subtletyis required in creating these simu-lation models; it would seemnatural to assign winning percent-ages in the model that are consis-tent with actual observed winningpercentages, but that would createa problem. As a season is playedout, there is a tendency for the em-pirical distribution of the simu-lated winning percentages to bemore spread out than the ran-domly assigned percentages. Es-sentially, the differences betweenthe records of the best teams andthose of the worst teams will nor-mally appear greater in the simu-lated results than the differences

in assigned abilities because someof the better teams will be luckyand some of the weaker teams willbe unlucky. The randomly as-signed ratings were adjusted sothat the simulated distribution ofteam records matched the ob-served distribution. (See the side-bar for a discussion of the simula-tion model used to generate theresults in this article.)

One thousand seasons of base-ball were simulated, a total of26,000 team-seasons, and the an-swers to the following basic ques-tions were collected:

• How often does the best team ina division (as measured by therandomly assigned ability) winthe divisional title?

• How often does the best team inbaseball win the World Series?

• How often does the best team inbaseball fail to win even its owndivision?

• How often does an average teamwin its division?

• How often does an average teamwin the World Series?

• Is it possible for a last-placeteam (the weakest team in its di-vision) to win the World Seriessimply because they are lucky?

In each simulated season it ispossible to identify the "best"team in each of the four baseballdivisions and the "best" team inthe entire league as measured bythe teams' randomly assignedabilities. The best team in the di-vision wins its division slightlymore than one-half the time. In the1000-year simulation, 4 divisionseach year, 56.4% of the 4000 divi-sional races were won by the bestteam in the division. The resultsare similar in the two leagues, al-though the larger divisions in theAmerican League (beginning in1993 the leagues will be the samesize) lead to slightly fewer wins bythe best team. The best team inbaseball, the team with the highestrandomly assigned ability, wonthe World Series in 259 out of

114

Page 126: Anthology of Statistics in Sports

James, Albert, and Stern

1000 seasons. The best team inbaseball fails to win even its owndivision 31.6% of the time. Evenif the best team in baseball doeswin its division, it must still wintwo best-of-seven play-off series towin the World Series. The prob-ability that the best team survivesthe play-offs, given that it won itsdivision, is about .38.

Table 1 defines five categories ofteams—good, above average, aver-age, below average, and poor. Thecategories are described in terms ofthe percentile of the pool of baseballteams and in terms of the expectedsuccess of the team. Good teams arethe top 10% of baseball teams withexpected winning proportion .567or better (equivalent to 92 or morewins in a 162-game season). Notethat in any particular season it maybe that more than 10% or fewer than10% of the teams actually achievewinning proportions of .567 or bet-ter because some average teams willbe lucky and some of the good teamswill be unlucky.

Table 2 describes the results ofthe 1000 simulated seasons. Thetotal number of team-seasons withrandomly assigned ability in eachof the five categories is shownalong with the actual performanceof the teams during the simulatedseasons. For example, of the 2491good team-seasons in the simula-tions, more than half had simu-lated winning proportions that

Table 1—Defining Five Categories of BaseballTeams

Category

PoorBelow avergeAverageAbove averageGood

Percentiles ofdistribution

0-1010-3535-6565-9090-100

Winning proportion

.000-.433

.433-.480

.480-.520

.520-.567

.567-1.000

put them in the good category. Anot insignificant number, 1%, hadwinning proportions below .480,the type of performance expectedof below-average or poor teams.

Table 3 records the number oftimes that teams of different abilitiesachieved various levels of success.As we would expect, most of the4000 division winners are eitherabove-average or good teams. It isstill a relatively common event,however, that an average or below-average team (defined as a teamwith true quality below .520) winsa divisional title. Over 1000 seasons,871 of the 4000 division winnerswere average or below in true qual-ity. Apparently, an average teamwins just less than one of the fourdivisions per season on average.There are three reasons why thisevent is so common. First, an aver-age team can win 92 or more games

in a season by sheer luck; this is cer-tainly not an earth-shaking event, hithe simulations, 2.5% of the averageor worse team-seasons had good re-cords by chance. Second, there area lot of average teams; average teamshave good teams badly outnum-bered. The third and largest reasonis that it is relatively common forthere to be a division that does not,in fact, have a good team. In realbaseball, it may not be obvious thata division lacks a good team becausethe nature of the game is that some-body has to win. But it might be thatin baseball there are five good teamswith three of those teams in one di-vision, one in a second, one in athird, but no good team in the fourthdivision. In another year, theremight be only three good teams inbaseball, or only two good teams;there is no rule of nature that saysthere have to be at least four good

Table 2— Simulated Performance of Teams in Each Category

Performance in Simulated Season

Randomlyassignedability

PoorBelow avg.AverageAbove avg.Good

No. ofteam-

seasons

25956683772865032491

PercentOf

total

10.025.729.725.09.6

Poor

58.920.55.0,6.0

Belowaverage

32.042.726.08.11.0

Average

8.927,437.926.97.4

Aboveaverage

1.18.7

26.242.931.9

Good

.0

.74.9

21.559.7

115

Page 127: Anthology of Statistics in Sports

Chapter 15 Answering Questions About Baseball Using Statistics

Table 3 — Frequency of Winning Title for Teams in Each Category

Randomlyassigned ability

PoorBelow averageAverageAbove averageGoodTotal

No. of team-seasons

25956683772865032491

26000

Won division

8156707

1702 •'14274000

Number of times

Won league

242

287844825

2000

Won World Series

19

122403465

1000

teams every year. So, as many yearsas not, there simply is not a verygood team in one division or an-other, and then a team that is in re-ality a .500 or .510 team has a fight-ing chance to win. It is fairlycommon.

However, most of those averageteams that get to the play-offs orWorld Series will lose. A slightadvantage in team quality is dou-bled in the seven-game series thatconstitutes baseball's play-offsand World Series. If a team has a51% chance of winning a singlegame against an opponent, thenthey would have essentially a52% chance of winning a seven-game series (ignoring home-fieldadvantages and assuming inde-pendence, the actual probability is.5218). If a team has a 53% chanceof winning a single game against agiven opponent, that will becomeabout 56% in a seven-game series.So if an average team is in theplay-offs with three good, or atleast above-average teams, theirchances of coming out on top arenot very good, although still non-negligible. In fact, Table 3 showsthat teams of average or less abil-ity won 132 of the 1000 simulatedWorld Series championships.

The simulation results suggestthat a more-or-less average teamwins the World Series about oneyear in seven. But is it possible fora last-place team, the team withthe least ability in its division, to

win the World Championshipsimply because of chance? Yes, itis. In 1 of the 1000 simulated sea-sons, a last-place team, a team thatshould by rights have finishedsixth in a six-team race with a re-cord of about 75-87, did get lucky,won 90 games instead of 75, andthen survived the play-offs againstthree better teams. The chance ofa last-place team winning theWorld Championship can be bro-ken down into elements, and noneof those elements is, by itself, allthat improbable. The combinationof these elements is unlikely butclearly not impossible.

A complementary question is:How good does a team have to be sothat it has a relatively large chanceof winning the World Series? Thesimulations suggest that even teamsin the top one-tenth of 1% of thedistribution of baseball teams, withan expectation of 106 wins in the162 game schedule, have only a 50-50 chance of winning the WorldChampionship.

Conclusions

What do all these simulation re-sults mean from the perspective ofbaseball fans and professionals?From their perspective, it is appall-ing. In baseball, every success andevery failure is assumed to have aspecific origin. If a team succeeds,if a team wins the World Champi-

onship, this event is considered tohave not a single cause but a mil-lion causes. There has never beena member of a World Champion-ship team who could not describe100 reasons why his team won. Itis much rarer to hear the role ofchance discussed. An exception isthe story of old Grover ClevelandAlexander, who scoffed at the ideathat he was the hero of the 1926World Series. Just before Alexan-der struck out Tony Lazzeri in theseventh and final game of thatWorld Series, Lazzeri hit a longdrive, a home-run ball that justcurved foul at the last moment. Al-exander always mentioned thatfoul ball and always pointed outthat if Lazzeri's drive had stayedfair, Lazzeri would have been thehero of the World Series, and hewould have been the goat.

The simulations suggest that, in-deed, there might not be any realreason why a team wins a WorldChampionship; sometimes it is justluck. That is an oversimplification,of course, for even assuming that anaverage team might win the WorldSeries; it still requires an enormouseffort to be a part of an average pro-fessional baseball team. Baseballteams are relatively homogeneousin ability. It is surprisingly difficultto distinguish among baseball teamsduring the course of the 162-game

116

Page 128: Anthology of Statistics in Sports

James, Albert, and Stern

regular season and best-of-sevenplay-off series and World Series.

It might be argued that baseballmanagers and players prefer not tothink of events as having a randomelement because this takes the con-trol of their own fate out of theirhands. The baseball player, coach,or manager has to believe that hecan control the outcome of thegame, or else what is the point ofworking so hard? This desire tocontrol the outcome of the gamecan lead to an overreliance onsmall samples. There are many ex-amples of baseball managers choos-ing a certain player to play againsta particular pitcher because theplayer has had success in the past(perhaps 5 hits in 10 attempts). Orperhaps a pitcher is allowed to re-main in the game despite allowingthree long line drives that arecaught at the outfield fence in oneinning but removed after two weakpopups fall in for base hits in thenext inning. Can a manager affordto think that these maneuvers rep-resent an overreaction to the resultsof a few lucky or unlucky trials?

In any case, this conflict be-tween the way that statisticianssee the game of baseball and theway that baseball fans and base-ball professionals see the gamecan sometimes make communica-tion between the groups very dif-ficult.

Additional Reading

Bradley, R. A., and Terry, M. E. (1952),"Rank Analysis of IncompleteBlock Designs. I. The Method ofPaired Comparisons," Biometrika,39, 324-345.

James, B. (1992), The Baseball Book1992, New York: Villard Press.

Ladany, S. P., and Machol, R. E.(1977), Optimal Strategies inSports, New York: North-Holland.

Lindsey, G. R. (1959), "Statistical DataUseful to the Operation of a Base-ball Team," Oper. Res., 7,197-207.

117

Page 129: Anthology of Statistics in Sports

This page intentionally left blank

Page 130: Anthology of Statistics in Sports

Chapter 16

THE PROGRESS OF THE SCORE DURING A BASEBALL GAME

G. R. LlNDSEY

Defence Systems Analysis Group, Defence Research Board, Ottawa, Canada

Since a baseball game consists of a sequence of half-innings com-mencing in an identical manner, one is tempted to suppose that theprogress of the score throughout a game would be well simulated by asequence of random drawings from a single distribution of half-inningscores. On the other hand, the instincts of a baseball fan are offended atso simple a suggestion. The hypothesis is examined by detailed analysisof 782 professional games, and a supplementary analysis of a further1000 games. It is shown that the scoring does vary significantly in theearly innings, so that the same distribution cannot be used for eachinning. But, with a few exceptions, total scores, establishments of leads,overcoming of leads, and duration of extra-inning games as observed inthe actual games show good agreement with theoretical calculationsbased on random sampling of the half-inning scoring distributions ob-served.

1. INTRODUCTION

ABASEBALL game consists of a sequence of half-innings, all commencing inan identical manner. It might therefore be supposed that a good ap-

proximation to the result of a large sample of real games could be obtainedfrom a mathematical model consisting of random drawings from a single popu-lation of half-inning scores. To be more explicit, two simple assumptions canbe postulated, which could be labelled "homogeneity" and "independence."

"Homogeneity" implies that each half-inning of a game offers the same apriori probability of scoring—i.e. the distributions of runs for each half-inningare identical. "Independence" implies that the distribution of runs scored sub-sequent to the completion of any half-inning is unaffected by the scoring historyof the game previous to that time.

On the other hand, there are reasons to suggest that the scoring in a gamemay have a structure more complicated than that represented by a simplesuperposition of identical independent innings. Pitchers may tire and losetheir dominance over batters as the game progresses. Once a lead is estab-lished the tactics may alter as the leading team tries to maintain its lead andthe trailing team tries to overcome it. Reminiscences suggest that specialevents occur in the last half of the ninth inning. The home crowd never fails toremind the Goddess of Fortune of the arrival of the Lucky Seventh. Surely agame so replete with lore and strategy must be governed by laws deeper thanthose of Blind Chance!

This paper attempts to examine such questions.The main body of observed data are taken from the results of 782 games

played in the National, American, and International Leagues during the lastthree months of the 1958 season. All games reported during this period were re-corded.

The distributions of runs are found for each inning, and compared with oneanother to test the postulate of homogeneity. Using the observed distribution

119

Page 131: Anthology of Statistics in Sports

Chapter 16 The Progress of the Score During a Baseball Game

and assuming them to be independent, the theoretical probability distributionsof a number of variables such as winning margin and frequency of extra inningsare calculated. These theoretical distributions are then compared with thoseactually observed in the sample of 782 games, in order to test the postulate ofindependence.

Some of the unexpected results are examined in a further sample of 1000games chosen at random from the 1959 season of the National and AmericanLeagues, and some of the conventional tests for correlation are applied.

2. DISTRIBUTION OF SCORES BY INNINGS

The distribution of scoring of runs in each half-inning is shown in Table 1.The columns headed 0,1, • • • >5 show the relative frequency with which oneteam scored x runs in the ith inning. The column headed "AV shows the num-ber of ith half-innings recorded. Games abandoned due to weather before be-coming legal contests, or abandoned with the score tied, were not counted, sothat the 782 games necessarily produced 1564 first, second, third and fourthhalf-innings. A small number of games were called in the fifth, sixth, seventhor eighth inning. Many games did not require the second half of the ninth in-ning, and only a few required extra innings. International League games sched-uled for seven innings only were excluded.

The third last column shows the mean number of runs scored by one teamin the iih. inning, and the second-last column gives the standard deviation aof the distribution. The last column, headed / Ni, gives the standard errorof the mean. It is immediately evident that the means differ considerably andsignificantly from inning to inning.

The means and standard deviations are also shown on the top half of Figure 1,in which the solid vertical black bars are centered at the mean number of runs,and have length 2 / Ni.

TABLE 1. RELATIVE FREQUENCY WITH WHICH ONE TEAM SCOREDx RUNS IN THE iTH INNING—(BASED ON 782 GAMES FROM

NL, AL, IL, 1958)

Inning

123456789

10111213141516

All ExtraAll

x — Number of Runs Scored

0

.709

.762

.745

.746

.748

.715

.743

.720

.737

.72

.80

.78

.92

.72

.6

.0

.762

.737

1

.151

.142

.119

.142

.140

.152.140.162.148.13.14.11.08.18.2.5.132.144

2

.087

.065

.064.063.060.077.067.064.074.10.03.09.00.05.1.5.076.069

3

.029

.017

.034

.026

.030

.033.026.027.021.04.03.02.00.00.0.0.027.027

4

.012

.008

.020

.015

.016

.010.014.012.011.01.00.00.00.00.1.0.010.013

5

.008

.004

.010

.006

.004

.008

.005

.012

.008

.00

.00

.00

.00

.05

.0

.0

.003

.007

>5

.004

.002

.008.002.002.005.005.003.001.00.00.00.00.00.0.0.000.003

*

1,5641,5641,5641,5641,5641,5581,5581,5541.191

13464442622102

30213,993

Mean

.53

.38

.53

.44

.45

.52

.46

.50

.45

.51

.30

.4

.42

.475

a

1.05.83

1.06.94.95

1.05.99

1.04.87.93

.861.00

.03

.02

.03

.02

.02

.03

.03

.03

.03

.08

.08

.05

.008

120

Page 132: Anthology of Statistics in Sports

Lindsey

The bottom row of Table 1 shows the overall frequency distribution when allinnings are combined. If the numbers in this bottom row are combined with Ni

to give the frequencies expected in the other rows for the individual innings, onthe assumption that all innings are merely samples of the same aggregate, andall extra innings are combined into one row, then the differences between thenumber of runs observed and expected produce a value of chi-square which hasa probability of only about 0.005 of being exceeded by chance alone.

FIG. 1. Scoring by individual half-innings.Black: 782 games of 1958White: 1000 games of 1959

Thus it can be concluded that the distribution of runs is not the same frominning to inning, and the "postulate of homogeneity" is untrue. Examinationof Table 1 or Figure 1 shows that the mean number of runs are greatest in thefirst, third and sixth innings, and least in the second and in the extra innings.There is nothing remarkable about the seventh or the ninth innings, both ofwhich are very similar to the aggregate distribution.

The most notable differences occur in the first three innings, where the meanscores in the first and third innings are 0.53, while the mean score in the secondinning is only 0.38. The probabilities of deviations as large as these from theoverall mean of 0.475 are approximately 0.03, 0.001 and 0.03 for a homogeneoussample. Deviations of the means for all later innings, and chi-square for the

121

Page 133: Anthology of Statistics in Sports

Chapter 16 The Progress of the Score During a Baseball Game

later innings, are small enough to allow them to be considered as samples fromthe overall aggregate (at a level of c = .05).

A possible explanation for this peculiar pattern of scoring in the first threeinnings is that the batting order is designed to produce maximum effectivenessin the first inning. The weak tail of the order tends to come up in the secondinning, the strong head tends to reappear in the third, and the weak tail in thefourth.

This pattern is even more evident if one plots the frequency of scoring threeor more runs, obtained by adding the columns for x = 3, 4, 5, and > 5 runs. Thefrequencies are shown by the solid black circles on the lower half of Figure 1. Itis seen that the third inning is the most likely to produce a big score.

TABLE 2. THE RELATIVE FREQUENCY WITH WHICH ONE TEAMSCORED x RUNS IN THE iTH INNING—(BASED ON 1000 GAMES

FROM NL, AL, 1959)

Inning0)

1

23456789C91ECEI

A11EA11CAll IAll

x =Number of Runs Scored

0

.700

.768.730.719.730.721.731.710.770.000.825.00.711.730.00.726

1

.159

.139

.131

.151

.145

.157

.138

.162

.129

.46

.112

.73

.198

.146

.59

.148

2

.081

.053

.079

.078

.074

.062

.071

.073

.060

.26

.033

.17

.052

.070

.22

.071

3

.031

.023

.029

.027

.033

.036

.029

.027

.027

.23

.012

.06

.018

.029

.15

.030

4

.018

.009

.016

.013

.011

.012

.020

.019

.010

.03

.003

.04

.008

.014

.03

.014

5

.010

.003

.009

.007

.005

.008

.008

.006

.004

.02

.006

.00

.005

.007

.01

.007

>5

.001

.005

.006

.005

.002

.004

.003

.003

.000

.00

.009

.00

.008

.004

.00

.004

Ni

2,0002,0002,0002,0002,0002,0002,0002,0001,576

6133153

38417,907

11418,021

Mean

.54

.40

.53

.51

.48

.51

.51

.52

.401.88

.321.40

.47

.4871.66

.493

a

1.03.92

1.101.03.98

1.031.041.03

.87

.98

.95

.761.001.01

.921.05

/

.02

.02

.03

.02

.02

.02

.02

.02

.02

.13

.05

.10

.05

.008

.09

.008

The main arguments in this paper are based on the data from the 782 gamesplayed in 1958. However, in order to obtain additional evidence regarding re-sults that appeared to deserve further examination, data were also collectedfrom a sample of 1000 games selected at random from the 1959 seasons of theNational and American Leagues. No abandoned games were included in thissample. Table 2 shows the frequency distributions of inning-by-inning scores,and the means and frequencies of scoring three or more runs are plotted onFigure 1, using hollow white bars and circles.

Comparison of the two tables show very close agreement. A chi-square testshows no significant difference between the two samples. The low scoring inthe second inning as compared to the first and third is evident again.

At this point we must abandon our postulate of homogeneity, and treat theinning-by-inning distributions individually.

In the subsequent analysis, the scoring probabilities are deduced directlyfrom the observed data of Table 1, without any attempt to replace the observa-tions by a mathematical law. However, examination of the shape of the dis-tributions is interesting for its own sake, and is described in Appendix A.

122

Page 134: Anthology of Statistics in Sports

Lindsey

Since baseball games are terminated as soon as the decision is certain, thelast half of the ninth (tenth, eleventh, • • • ) inning is incomplete for games wonby the home team in that particular half-inning. The results in Table 2 showseparate rows of complete and incomplete ninth and extra half-innings. Thefrequency distributions are very different for complete and incomplete half-innings. A winning (and therefore incomplete) last half must have at least onerun. A large proportion of winning last halves of the ninth show two or threeruns, and the mean is 1.88. One run is usually enough to win in an extra inning,and the mean score for winning last halves is 1.40. Completed halves includeall the visiting team's record, but exclude all winning home halves, so that themean scores are low: 0.40 for completed ninth half-innings, and lower still,0.32, for completed extra halves.

The basic probability function required for the calculation is fi(x), the apriori probability that a team will score x runs in its half of the ith inning, pro-vided that the half-inning is played, and assuming it to be played to completionIf half-innings were always played to completion, the distributions observedin Tables 1 and 2 would provide direct estimates of fi(x). However, for i 9the distributions for complete half-innings recorded in the tables cannot beused directly, since they have excluded the decisive winning home halves, whilethe distributions for incomplete halves are ineligible just because they are in-complete.

The problem of estimating f i-(x) from the observed frequencies for i 9 isdiscussed in Appendix B. There it is concluded that f 9 (x) is found to be verynearly the same as f18(x), the mean of the first eight innings, so that the ninthinning does not present any unusual scoring pattern. There are too few extrainnings to allow calculation of the tenth, eleventh, • • • innings separately, soa grouped distribution f E ( x ) applying to all extra innings is sought. fE(x) ascalculated from Table 1 (the 782 games of 1958) does not agree with (x),showing a substantial deficit in ones and an excess of twos, threes, and fours.But the 1000 games of 1959 (Table 2) produce distributions consistent with thehypothesis that f E ( x ) 18(x).

3. INHOMOGENEITY INHERENT IN THE SAMPLE

The 1782 games from which all of the data in this paper have been obtainedinclude two seasons, three leagues, twenty-four teams, day and night games,and single games and doubleheaders. If the sample were subdivided accordingto these, or other categories, it is possible that significant differences in thepopulations might be discernible. However, as was pointed out in an earlierstudy of batting averages [4, p. 200], subdivision of the sample soon reduces thenumbers to the point where the sampling error exceeds the magnitude of anysmall effects being examined, while extension of the period of time covered willintroduce new sources of inhomogeneity such as changes in the personnel of theteams. Therefore, many small effects will be measurable only if they appear con-sistently in large samples necessarily compounded from many categories ofgames. These are the most interesting effects for the general spectator, al-though a manager would prefer to know their application to the games of hisown team exclusively.

123

Page 135: Anthology of Statistics in Sports

Chapter 16 The Progress of the Score During a Baseball Game

In any case, it is the contention of the author that professional baseballgames offer a very homogeneous sample. Teams differ from one another so littlethat it is very unusual for a team to win less than one third or more than two-thirds of their games in a season. Of the sixteen major league teams, the lowestmean total score for the 1958 season was 3.40, the highest 4.89, the mean 4.39,and the standard deviation between mean scores was 0.375. Therefore the dif-ference between teams, if they represented sixteen separate populations withmeans as measured over the season, would show a standard deviation of ap-proximately one-third of 0.375, or 0.12, for each of the nine innings, which is asmall magnitude as compared to the standard deviation of approximately 1.0for the game-to-game variation in the pooled sample.

One hundred Chicago and Washington games from the 1959 AmericanLeague were extracted from the 1000-game sample, and their inning-by-inningdistributions analyzed separately (as in Table 1). (These teams finished firstand last, respectively.) The mean scores and standard deviations for the firsteight innings combined were = 0.40, = 0.94 for Chicago, and = 0.47,

- = 0.85 for Washington. They do show slightly smaller standard deviationsthan the pooled distribution, but a is still about 2 .

A small inhomogeneity probably does exist between the visiting and homehalves of any inning. The home team won 55.0% of the major league games of1958, and 54.7% of the 1000 games of 1959. In the latter sample the total scoreof the home teams showed = 4.50, - = 3.03, while for visitors the result was

= 4.38, = 3.13. The differences between these are negligible as compared tothe inning-to-inning differences of the pooled distribution.

While it might be interesting to extract separate sources of variances due tovarious inhomogeneities, it seems much more profitable in this exploratorystudy to use pooled data and seek effects which appear in spite of whateverinhomogeneity may be present.

4. INDEPENDENCE BETWEEN HALF-INNINGS

The postulate of independence could be tested by conventional methods,such as computation of coefficients of correlation and comparison of conditionaldistributions. This is in fact done in Appendix C, for certain pairs of half-in-nings, and shows no significant correlation. The method is applied only to pairsof half-innings, whereas the interesting questions pertain to tendencies notice-able over the whole game.

A different approach to the testing of independence is followed in the mainbody of this paper, which seems more closely related to the question at the fore-front of the minds of the participants and spectators during the course of a base-ball game, and which is likely to be more sensitive than the methods of correla-tion coefficients or conditional distributions. This is to examine the probabilitythat the team which is behind at a particular stage of the game will be able toovercome the lead and win. It is possible to deduce this by probability theory,if independence is assumed and the fi(x) derived from the data of Table 1 used,and the predicted result can be compared with that actually observed in prac-tice. Also, in addition to the probability of the lead being overcome, severalother distributions of interest, such as the winning margin, the total score of

124

Page 136: Anthology of Statistics in Sports

Lindsey

each team, and the length of the game, can be computed and compared withobserved results.

5. LENGTH OF GAME

It is shown in Appendix B that if we know the distribution of runs scored bya team in each half-inning (whether homogeneous or not), and if the scores ob-tained in successive half-innings are independent, then it is possible to calcu-late the probability that the score will be tied at the end of nine innings, and theprobability that the two teams make equal scores during any single extrainning. For the data collected here, these probabilities are 0.106 and 0.566,which implies that, of all the games which had a ninth inning, a fraction 0.106would require a tenth, and that of all the games which required a tenth (elev-

TABLE 3. THE STAGES AT WHICH VARIOUS GAMES CONCLUDED

Sample

1958

1958

1959

1959

Prob. ( = i)

N i(obs)N i (exp)

N (obs)N i (exp)

Game finished

N (obs)AT (exp)

i=9

.894

777

1000

10

.046

6782

95106

At end of F9

424400

11

.026

3238

4454

Durin

12

.015

2222

2025

gH9

6147

13

.008

1312

1511

14

.005

117

88

At end of #9

420447

15

.003

54

45

16

.002

12

42

During E

95106

17

.001

01

12

18

.001

00

01

Total

10001000

enth, twelfth, • • • ) inning a fraction 0.566 would require an eleventh (twelfth,thirteenth, • • • ) inning.

The row labelled "Prob( = i')" of Table 3 shows the calculated a priori prob-ability that a game will require exactly i innings. The third row shows thenumber of games (out of the 782 in 1958) which actually did require an ith in-ning. The fourth row shows the number that would be expected, based not onthe a priori probability but on the formula

N10 (exp) = 0.106JV9 (obs)

Ni (exp) = 0.566Ni_1 (obs) for i 11

which is based on the number of games actually requiring the inning previousto the ith.

Similar results are shown for the 1000 games of 1959.For both samples, the number of games actually requiring a tenth and elev-

enth inning is rather less than predicted, but the difference is not significant toto the 10% level of chi-square.

Another examination of the lengths of games can be made by observing thenumber that end without requiring the home half of the ninth inning (desig-nated as H9), the number that are concluded during H9, the number ending

125

Page 137: Anthology of Statistics in Sports

Chapter 16 The Progress of the Score During a Baseball Game

with the conclusion of H9, and the number requiring one or more extra innings.These are listed in Table 3 for the 1000 games of 1959. Also listed are thenumber that would be predicted by the independent model, using the theory ofAppendix B. The number of games decided in the last half of the ninth exceedthe predicted number (significantly at the level e 0.04).

Of all the 531 extra innings observed in both samples, a fraction 0.538 weretied (as compared to the expected fraction 0.566: the difference is not signifi-cant).

It seems fair to conclude that the length of games actually observed is con-sistent with the prediction made on the hypothesis of independence betweeninnings, although there is a slight tendency for games to be concluded in thelast of the ninth, or in the tenth, more often than predicted.

6. TOTAL SCORE

The distribution of the total number of runs scored by one team in an entiregame is shown in Table 4, in the rows labelled "Observed." The frequencies arecompared with those predicted on the model of independence, by calculationsdescribed in Appendix B, and employing estimates of fi(x) obtained from thedata of Table 1. The mean score is 4.25, and the standard deviation is 3.11.

Figure 2 shows the same results in graphical form, expressed as normalizedprobabilities. The continuous line shows F(x), the probability of one teamscoring x runs in a complete game as predicted by the theory. The midpoints ofthe vertical black bars represent the frequencies observed in the 782 games of1958. The length of each bar is 2VF(x)[l-F(x)]/N, where N is the number ofobservations, 1564. About two-thirds of the bars would be expected to overlapthe predicted values on the continuous line.

The vertical white bars show the results observed for the 1000 games of1959.

TABLE 4. DISTRIBUTION OF TOTAL SCORE BY ONE TEAM

1958

1959

1958

1959

Total Runs: 0

Frequency

ObservedExpected(0-E)

ObservedExpected(0-E)

10095

134122

Total Runs: 11

Frequency

ObservedExpected(0-E)

ObservedExpected(0-E)

24177

422220*

1

19316726

196214

12

189

1412

2

225218

249278-29

13

135

136

3

240233

276298

14

72

102

4

193219-26

286280

15

12

32

5

138184-46*

241236

16

2027*

3026*

6

136145

171186

17

10

30

7

115108

140138

18

20

10

8

7077

10098

19

10

00

9 10

51 3452 31

76 4166 40

20 Sum

0 15640 1564

1 20000 2000

126

Page 138: Anthology of Statistics in Sports

Lindsey

The results for the larger scores are also shown with an expanded verticalscale on the right of the diagram.

The rows labelled (0-E) in Table 4 show the difference (observed-expected)only when chi-square is larger than would be expected on 10% of occasions.When the value has an asterisk this signifies that chi-square exceeds the valuefor e = 0.001. There are two anomalous points—a very small number of 5-rungames in 1958 and a very large number of 11-run games in 1959—and a gen-

FIG. 2. Distribution of total scores by one team.Black: 782 games of 1958White: 1000 games of 1959

eral tendency to observe more large scores (over 10 runs) than predicted. Withthese exceptions, the agreement is good.

The excess number of large scores suggests the possibility that a team againstwhich a large score is being made might regard the game as hopelessly lost, anddecide to leave the serving but ineffective pitcher to absorb additional runs,instead of tiring out more relief pitchers in a continuing effort to halt theiropponents. However, this would imply that games in which a large total scorewas obtained by one team would show a disproportionate share of their runscoming in the later innings, but an examination of those games in which thewinner scored more than 10 runs showed no such effect.

It may be that there are a small number of occasions on which the batters

127

Page 139: Anthology of Statistics in Sports

Chapter 16 The Progress of the Score During a Baseball Game

are in abnormally good form, and for which the statistics do not belong to thegeneral population.

7. ESTABLISHMENT OF A LEAD

As well as the number of runs scored in each half-inning, the difference be-tween the scores of the two teams totalled up to the end of each full inning wasrecorded. Table 5 shows the number of games in which a lead of l runs had beenestablished by the end of the ith inning, and at the end of the game, for the 782games of 1958.

TABLE 5. ESTABLISHMENT OF A LEAD

0

1

2

3

4

5-9

10

<0.1

Inning:

Total:

EstablishedExpected(O-E)

EstablishedExpected(O-E)

EstablishedExpected(O-E)

EstablishedExpected(O-E)

EstablishedExpected(O-E)

EstablishedExpected(O-E)

EstablishedExpected(O-E)

1

782

420418

192192

105105

3136

1816

1616

00

2

782

286281—

254242

135139—

6060

2830

2831

10

3

782

203196—

239231

131152

-21

8388—

5952

6463

32

4

782

147152

24722225

138158—

105100

6264—

7884

53

5

782

128126—

211206—

167158—

87109

-22

7672

106104

75

6

779

118106

187188—

154152—

100114

8581

118126

1798

7

779

10996—

164176—

144148—

113114—

7984

150148

20128

8

777

8287

160163—

146142—

101114

9087

168168

301614*

_ _ _ _ _ _ _ . 0 2

Final

782

239256—

145147—

113113

7985

178163

281810*

Final

1000

316327

178188

146145—

103109

226208

• —

31238

In Appendix B, a formula is derived for the probability that one team willhave established a lead of l at the end of i innings, on the assumption that theprobability of scoring x runs in the ith inning is fi(x), and that there is no cor-relation between innings or with the score being made by the opposing team.The lines in Table 5 labelled "Expected" show the number predicted by thisformula.

The differences between the numbers observed and expected are entered inthe lines of Table 5 labelled (O-E) only when the difference has a chi-square cor-responding to € 0.1. When e 0.02 an asterisk is attached to the number.

In the bottom line, the value of c appropriate to chi-square for all observa-tions in the inning is shown only if e<0.1.

Figures 3A and 3B show the results in graphical form, with the observed fre-

128

Page 140: Anthology of Statistics in Sports

Lindsey

FIG. 3A. Probability of establishing a lead.782 games of 1958

(Even Numbers Only)

FIG. 3B. Probability of establishing a lead.782 games of 1958

(Odd Numbers Only)

129

Page 141: Anthology of Statistics in Sports

Chapter 16 The Progress of the Score During a Baseball Game

quencies of leads of 0, 2, 4, 6, and 8 runs being shown by the circles on Figure 3Aand leads of odd value on Figure 3B. The observed frequencies for l = 0 areshown as solid black circles, to distinguish them from the other hollow circles.Circles are shown only where the number has been observed in at least 10games.

The continuous lines on Figures 3A and 3B represent the theoretical fre-quencies based on the postulate of independence between innings. The lengthof the vertical bars on the circles representing observed frequencies indicatetwice the standard deviation, 2 pq/N, where Np is the theoretical frequency.

All curves except l = 0 start from 0, since the lead is 0 when the game begins.The probability of the score being tied (i.e. 1 = 0) decreases as the game pro-ceeds, dropping to 0 for the final result, since extra innings must be played untill 0. The probability of a small lead (1= 1 or 2) increases for a few innings, butthen decreases in the later innings because of the steady rise in the probabilityof larger leads (l>2).

The practice of stopping the game as soon as the decision is certain causessudden deviation in these probabilities for the scores at the end of the ninthinning and at the end of the game. If the home team leads by l> 1 at the end ofthe eighth, and the visitors reduce the lead but do not erase it in their half ofthe ninth, then the home half of the ninth is not played and there is no oppor-tunity to increase the lead once more. The home half of the ninth, and of extrainnings, always commences with the visitors ahead or the score tied, and ceasesif and when the home team achieves a lead of 1 (except in the circumstancesthat the play that scores the winning run also scores additional runs, as mightoccur if a tie were broken by a home run with men on base). Thus, almost allgames won in the lower half of the ninth or extra innings will show a final mar-gin of one. The formulae of Appendix B predict that 82% of games decided inextra innings will be by a margin of one run.

The consequence of these two factors (i.e. not starting the home half of theninth if the home team is ahead, and not finishing the home half of the ninth orextra inning if the home team gets ahead) is to raise the probability for a leadof I = 1 at the end of the ninth, and especially at the end of the game, with acorresponding adjustment to the other probabilities.

The final winning margin at the end of the game, indicated in the last columnsof Table 5, and above the label "F" on Figures 3A and 3B, is given for the 1000games of 1959 as well and is also displayed separately on Figure 4. The continu-ous line shows the margin calculated on the hypothesis of independence, ex-pressed as a normalized probability. The vertical black bars show the results ob-served for the 782 games of 1958, and the vertical white bars for the 1000 gamesof 1959. As before, the centre of the bar marks the observed frequency, and thetotal length of the bar is 2\/pq/N where the predicted frequency is Np.

It is evident from inspection of Figure 4 that the fit is extremely satisfactory,except for the excess of margins of ten or more. If we combine the results for1958 and 1959 the excess for 1 10 is significant at the level of e = .005.

The slight excess of large final margins may be caused by the excess numberof games with high scores already noted in Table 4 and Figure 2, since gameswith one team making a large total score will tend to show larger than average

130

Page 142: Anthology of Statistics in Sports

Lindsey

FIG. 4. Distribution of final winning margins.Black: 782 games of 1958White: 1000 games of 1959

winning margins. (The mean winning margin for the games of 1959 in whichthe winner scored more than 10 runs was 7.6 runs).

The observed mean winning margin for all games is 3.3 runs.

8. OVERCOMING A LEAD AND WINNING

When the frequency of establishment of a lead was recorded, it was alsonoted whether the leading team eventually won or lost the game.

Table 6 gives the number of occasions on which the lead of l (established atthe end of the ith inning with frequency given in Table 5) was overcome, andthe game won by the team which had been behind. Appendix B shows how thisfrequency can be predicted on the postulate of independence. Table 7 shows

TABLE 6. OVERCOMING OF A LEAD AND WINNING

T n A

1

2

3

Inning: 1

Total: 782

Overcome :Expected :(0-E)

Overcome:Expected :(0-E)

Overcome :Expected:(0-E)

5975

-16

3130

910

2

782

7997

-18

3038—

1615

3

782

8788—

3234—

2224—

4

782

8087—

4233—

1624—

5

782

7270

3935—

1218—

6

779

5355

2626—

1115

7

779

3640

1518

610—

8

777

1924—

109

14

131

Page 143: Anthology of Statistics in Sports

Chapter 16 The Progress of the Score During a Baseball Game

FIG. 5A. Probability of overcoming a lead and winning.782 games of 1958

(Odd Numbers Only)

the computed frequencies expressed as probabilities, while Table 6 shows themas number of games (out of 782) in the lines labelled "Expected." The lineslabelled (O-E) show the difference between observed and expected only whenchi-square would attain such a magnitude 10% or less of the time. Chi-squarefor the combined totals of each inning is less than the value for e = 0.1 for everyinning.

FIG. 5B. Probability of overcoming a lead and winning.782 games of 1958

(Even Numbers Only)

132

Page 144: Anthology of Statistics in Sports

Lindsey

TABLE 7. CALCULATED PROBABILITY THAT TEAM WITH LEAD l ATEND OF iTH INNING WILL WIN THE GAME

Inning i 1

Lead l6543210

-12

-3-4-5-6

.939

.905

.857

.793

.700

.610

.500

.389

.290

.207

.142

.095

.061

2

.944

.913

.867

.803

.720

.617

.500

.383

.280

.196

.133

.087

.054

3

.959

.932

.889

.827

.741

.630

.500

.368

.257

.171

.109

.067

.039

4

.970

.947

.908

.849

.764

.647

.500

.353

.236

.150

.091

.053

.029

5

.979

.961

.930

.877

.793

.668

.500

.331

.207

.122

.070

.038

.020

6

.988

.974

.949

.907

.831

.705

.500

.295

.168

.093

.050

.025

.012

7

.994

.987

.980

.940

.878

.776

.500

.244

.122

.060

.029

.013

.005

8

.998

.995

.987

.974

.936

.846

.500

.153

.063

.025

.011

.004

.000

Figures 5A and 5B show the observed and expected frequencies of overcom-ing a lead of l (i.e. of winning after having established a lead of —1} at the endof the ith inning. Figure 5A shows observed frequencies for margins of 1= —1and —3 indicated by the circles with standard deviation bars, and calculatedfrequencies for 1= — 1, —3, —5 and —7, indicated by the continuous lines. Fig-ure 5B shows observed values of 1= — 2, and — 4, and calculated values for/= — 2, —4, —6 and —8. In all cases, as might be expected, the probability ofovercoming a given deficit decreases as the game proceeds. The theoreticalvalues at i — 0 have no significance in an ordinary game, since it begins with ascore of 0-0. They would apply only if there were a handicap, or in the lastgame of a series played with the total run score to count, a practice not adoptedin baseball. The curve for 1 = 0 would be a straight line at P = 0.5, since withthe score tied, either team is equally likely to win.

To illustrate, a team that is two runs behind at the end of the fourth inninghas a theoretical probability of 0.236 of winning the game, as given by Table 7.Table 5 shows that a lead of 2 was established in 138 games (out of 782). Table6 shows that the team that was 2 runs behind eventually won 42 of them. Theexpected number is 0.236X138 = 33. Figure 5B shows the observed point at42/138 = 0.303, with a standard error of = -VpqfN = V(.236)(.764)/128 = .036.Chi-square is (42-33)2/33 = 2.5, which for 1 degree of freedom has c>0.1.Therefore, the lead was overcome rather more often than predicted by theory,but the difference is not statistically significant at the level of e = 0.1.

It may be concluded from these data that the predicted frequency of over-coming a lead is confirmed quite well by observation, so that the numbers inTable 7 (based on the postulate of independence) provide a measure of theprobability that a team can overcome a lead and win the game.

9. CONCLUSIONS

1. The distribution of runs scored per half-inning is not the same for all in-nings. The expected number of runs scored is significantly less in the second in-ning than in the first or third. The mean expectation per half-inning is 0.48 runs.

133

Page 145: Anthology of Statistics in Sports

Chapter 16 The Progress of the Score During a Baseball Game

2. The total score achieved by one team is consistent with a model based onrandom sampling from separate and independent half-inning samples, exceptthat the number of cases of large scores of more than ten runs is somewhat inexcess of the prediction based on the theory of independence. The mean totalscore for one team is 4.2 runs.

3. The establishment of a lead appears to follow the postulate of independ-ence quite well except for a slight tendency toward very large final winningmargins (of 10 runs or more).

4. The frequency with which a lead is overcome and the game won agreesvery well with the postulate of independence. It does not appear that there isany significant tendency for the trailing team to overcome the lead either moreor less frequently than would be predicted by the model of random drawingsfrom successive half-innings. From Table 7 it is possible to estimate the prob-ability that the game will eventually be won by the team which is ahead.

5. The number and length of extra-inning games is not inconsistent with thepostulate of independence, although the number of games ended in the ninthand tenth innings is somewhat greater than would be expected.

6. It seems logical to attribute the inhomogeneity of the first three inningsto the structure of the batting order.

10. REMARKS

Perhaps these findings will be considered disappointing. There is no observ-able tendency for the underdog to reverse the position. Although the game isnever over until the last man is out in the ninth inning, it is lost at the 2 1/2|%level of probability if the team is more than 2 runs behind when they start theninth inning. There is .nothing unusual about the seventh inning. The peculiarinnings from the point of view of scoring are the first three. The number of runsscored in the ninth inning is normal, but the 6% of games won during the lasthalf of the ninth exceed the expected proportion.

One way in which these results could be useful to a manager is when he isconsidering changing personnel during the game. If he wishes to rest a winningpitcher when the game appears to be won, or to risk the use of a good but in-jured player in order to overcome a deficit, he will want an estimate of the prob-ability that the present status will be reversed. He would use the informationin conjunction with other statistical data [4] and his evaluation of the immedi-ate form and special skills of his players.

Another application is to test certain scoring records believed to be unusual.For instance, the Chicago White Sox won 35 games by one run during the 1959American League season, which was generally considered to be remarkable.From Figure 4, the expected number would be 1/2 X 0.327 X154 25, (assumingthat .327 of their games would be decided by one run, and that they would winhalf of these). Taking = V#p<2 = 4.6 the difference (0-E) = 10 = 2.2 wouldbe expected to occur with probability 0.028. In other words their accomplish-ment had a probability of about 3%, and would be expected to be repeated orexceeded by one of sixteen teams about once in every two seasons.

In a full season of two eight-team major leagues we would expect the numberof games lasting eighteen innings or more to be 154X8X0.106X0.5668 1.

134

Page 146: Anthology of Statistics in Sports

Lindsey

In the World Series of 1960, the three games won by New York were byscores of 16-3, 10-0, and 12-0. Pittsburgh won the last game by 10-9.

From Figure 4, the observed probability of the final margin exceeding nineruns in 0.033. Therefore the probability of three or more games out of sevenhaving margins of ten or more is

From Table 4, the observed probability of one team scoring more than nineruns in a game is 0.0655. Therefore the probability that there will be four ormore scores of 10 or more in seven games is approximately

Therefore we can conclude that the three one-sided games did constitute avery unusual combination, to be expected only once or twice in 1000 seven-game series. The four large scores are less extraordinary, and would be expectedto occur on about 1 1/2% of seven-game series.

Thus even if this investigation has committed the sin of exploding somecherished beliefs, it does permit estimates to be made of the probability ofoccurrence of certain rare accomplishments.

APPENDIX A

DISTRIBUTION OF RUNS WITHIN AN INNING

The shape of the frequency distributions of Table 1 could form the subjectof an interesting analysis. Since the standard deviations are about twice asgreat as the means, the distributions do not follow a Poisson law.

If the negative binomial distribution [3, p. 155; 5, p. 179]

is fitted to the bottom line of Table 1 by choosing p = /i/V = 0.475 anda = /j?/(<r—iji) =0.43, so that it has the same mean and standard deviation as theexperimental distribution, the values listed in the line of Table 8 labelled

(x; a, p) are obtained. Use of such a law might allow mathematical manipula-tion of the distributions, but it is difficult to suggest any mathematical modelof baseball which would indicate a law of this precise form.

TABLE 8

Function a

Observed f(x)(x; a, p)

> ( x , a, P).35 1 +.65 2

—0.4333

000

P

—.475.685.685

Number of

0

.737

.726

.321

.749

1

.144

.164

.304

.133

2

.069

.062

.191

.065

3

.027

.026

.101

.030

Runs

4

.013

.012

.047

.013

5

.007

.006

.021

.006

>5

.003

.008

.015

.004

135

Page 147: Anthology of Statistics in Sports

Chapter 16 The Progress of the Score During a Baseball Game

A model representing an extension of the negative binomial distribution issuggested by the fact that, if a set of trials for which the individual probabilityof success is p is repeated until a successes have been obtained, then the proba-bility that y failures will have occurred before the ath success is (y; a, p}. If atrial consists of the fielding team dealing with a batter, and putting him out isconsidered a success, then the probability that y men will bat but not be putout before the inning is over will be (y; 3, p) where p is the probability that aplayer who appears at bat is put out during the inning. A player appearing atbat but not put out must either score or be left on base. The number left onbase, L, can be 0, 1, 2 or 3. The number of runs scored is x = y—L.

To be accurate, the mathematical model would need to account for all pos-sible plays and for their relative frequencies of occurrence. Models somewhatsimpler than this have been tested [l, 2, 7] with success.

A crude approximation would be to assume that in a fraction X of the inningsin which runs were scored, one man was left on base, while two were left on inthe remaining (1 — ) scoring innings. The distribution of runs would then begiven by:

where

TABLE 9. DATA FOR AMERICAN AND NATIONAL LEAGUES (1958)

Event

(1) Official Times at Bat(2) Bases on Balls(3) Hit by Pitcher(4) Sacrifice Hits(5) Sacrifice Flies

(6) (1)+(2)+(3)+(4)+(5)(7) Safe Hits(8) Errors(9) Double Plays

(10) (2) +(3) +(7) +(8) -(9)

A.L.

41,6844,062

252531322

46,85110,5951,0021,313

14,598

N.L.

42,1434,065

247515322

47,29211,026

1,0831,287

15,134

Total

83,8278,127

4991,046

644

94,14321,6212,0852,600

29,732

To estimate the constant p, the probability that a man appearing at bat willbe put out during the inning, we could note that in many nine-inning gamessome players make four and some five appearances at bat, so that out of9X4 1/2 40 batters, 27 are put out, and p 0.68. Or, to be more methodical, wecould use data from the major league season of 1958 [6] listed in Table 9. Theseshow 94,143 batters appearing, of which 29,732 were not put out, so thatp = l-.315 = .685.

136

Page 148: Anthology of Statistics in Sports

Lindsev

The last line of Table 8 shows the values of f(x) calculated for A = 0.35, forwhich reasonable agreement is found with the distribution actually observed.

APPENDIX BTHE FIRST EIGHT INNINGS

Let fi(x) be the probability that team A will score exactly x runs in the ithinning, where l i 8.

We assume throughout that both teams have the samefi(x).Let Fij(x) be the probability that team A will score exactly x runs between

the ith and jth innings (inclusive), where i j 8

Fi(x)=fi(x).

Tables of F i j(x) can be accumulated by successive summations such as

and

The probability that A will increase their lead by exactly x during the ithinning is

Gi(—x)=Gi(x) = probability that A's lead will decrease by x during the ithinning. A negative lead is, of course, the same as a deficit.

The probability that A will increase their lead by exactly x between the ithand jth innings (inclusive) is

INNINGS WHICH MAY BE INCOMPLETE

For i = 9, the home half-inning (H9) will not be started if the home team isahead at the end of the visitor's half (F9). For i 9, the home half, if started atall, will be terminated if and when the home team achieves a lead.

of the ith inning if it were played through to completion (even after the resultswere assured). Define Svi(x) and SHi(x) as the probabilities that the visitor'sand home halves of the ith inning, if started at all, will produce x runs as base-ball is actually played.

137

Define g*(x) to be the probability that team A would score x runs in their half

Page 149: Anthology of Statistics in Sports

Chapter 16 The Progress of the Score During a Baseball Game

Svi(x) = f i (x )

since the visitor's half is always completed. Thus we could use the measureddistribution of visitor's scores to estimate fi(x). However, the assumption thatthe visitor's and home team have the same distribution is open to question,and it seems preferable to use the data from the home halves as well.

Define g(l) as the probability that the visitors have a lead of I at the end of79.

The probability that H9 is required is then

Out of a large sample of N games we would observe N V9's and Na H9's.Of the Na H9's, NG19(0) would leave the score tied and require extra innings,

and N[l —G19(0)]/2 would leave the game won by the visitors. TheseN[l+(G19(0)]/2 H9's would all be complete. In them the home team wouldscore x runs only when x l.

If we recorded only completed ninth half-innings (as in Table 2) and com-puted the frequency S 9 c ( x ) of scoring x runs

The remaining N a — N [ 1 + - G 1 9 ( 0 ) ] / 2 H9's would result in victory for thehome team, and be terminated with less than three out. The winning marginwill almost always be one run, but can be greater on occasion when the finalplay that scores the decisive run produces additional runs. For example, if thehome team were one behind in #9, but produced a home run with the basesfull, the final margin would be recorded as three runs.

If we neglect the infrequent cases when the winning play produces a marginin excess of one, then all scores of x runs in incomplete H9's will be associatedwith leads of l = x — 1 for the visitors at the end of F9. But the probability of Hscoring x under these circumstances is now f9( x;) rather thanf9(x). We couldimagine that the inning is played to completion, but a score of (l+l) only re-corded.

If we recorded only incomplete halves of the ninth inning (as in Table 2) andcomputed SH91(x) the result would be

SH91 (0) = 0

138

Page 150: Anthology of Statistics in Sports

Lindsey

If we combine all halves of the ninth (as in Table 1), and compute S 9(x) , theresult is

S9(0) =f9(0)

For extra innings, paucity of data will make it advisable to group the resultsand seek fE (x ), the probability that a team starting its half of any extra inningwould score x runs if it played the inning to completion.

Reasoning similar to that applied to the ninth inning allows us to deduceS E C ( X ) , the distribution of runs for completed extra half-innings, SEI(X) for in-complete extra half-innings, and SE(X) for all extra innings.

NUMERICAL RESULTS

Starting with S9(x) as tabulated in Table 1 (Tl), we can computef9(x). Theresulting distribution is shown in the first line of Table 10, and beneath it isf18 (x)the mean of the distribution /»(x) for 1 i 8. They are obviously very

TABLE 10

Function

f9(x)f18(x)f18(x)

S9 C(x)8»c(x)

S H 9 1 ( X )SH9 1(x)

SE(x)SE(x)SE(x)

SEC(X)SEC(X)

SEI(X)SEI(X)

Source

Calc. fm. SH9(x) of T1Obs. in T 1Obs. in T 2

Calc. fm. MX) of TlObs. in T 2

Calc. fm. fa (x) of TlObs. in T 2

Calc. fm. fa(x) ofTlObs. in T IObs. in T 2

Calc. fm fa(x) of TlObs. in T 2

Calc. fm Tufa) of T IObs. in T 2

x: 0

.737

.736

.726

.757

.770

.00

.00

.736

.752

.711

.827

.825

.00

.00

1

.140

.143

.148

.137

.129

.58

.46

.187

.132

.198

.102

.112

.90

.73

2

.076

.069

.071

.062

.060

.26

.26

.047

.076

.052

.039

.033

.08

.17

3

.023

.028

.029

.023

.027

.10

.23

.016

.027

.018

.017

.012

.02

.06

4

.013

.013

.013

.010

.010

.04

.03

.007

.010

.008

.007

.003

.00

.04

5

.010

.007

.007

.005

.004

.02

.02

.004

.003

.005

.004

.006

.00

.00

AT e <0.1

1,19112,50016,000

1,576

4302384

04

.001

331

53/ <.001

139

Page 151: Anthology of Statistics in Sports

Chapter 16 The Progress of the Score During a Baseball Game

close, as may be confirmed by chi-square, so that the ninth inning is not anunusual one. __

In subsequent calculations f 1 8 (x) from Tl was used as the best estimate off9(x) . f 1 8 (x) calculated from T2 is not significantly different.

The observed results of Table 2, as separated between complete and incom-plete ninth innings, are compared with the predicted functions. The agreementis close for the 1576 complete half-innings, but less satisfactory for the 61 in-complete half-innings.

For extra innings, the distributions are calculated on the assumption thatf E ( x ) — f 1 8 ( x ) , and compared with those observed. There is a significant differ-ence from the 302 observations in Table 1, for which less ones and more twosand threes were found than predicted. If the calculation is reversed, and fE(x)derived from the observed S E ( X ) , the result shows (1)=.061, (2)=.119,/a(3)=.043, and /^(4)=.019, which is radically different from fm(x) or anyother fi(x). However, SE(X) and SEC(X) as calculated from fis(x) agree quitewell with the distributions observed in the 384 extra innings and 331 completedextra innings of Table 2. The agreement is poor for the 53 incomplete extrainnings.

All cases of disagreement involve incomplete innings and show a deficit ofone-run scores, suggesting that the neglect of scoring plays resulting in marginsof more than one run may be partially to blame.

These results for extra innings are contradictory. However, in view of theexcellent agreement with the 331 completed extra innings of Table 2, it wouldseem unjustifiable to conclude that extra innings offer any substantially differ-ent probability of scoring from the first nine innings.

LENGTH OP GAME

The probability that a tenth inning will be necessary is the same as the prob-ability that A's lead will be 0 after nine innings, i.e. (?i,9(0).

The probability that an eleventh inning will be necessary is Cri, 9(0)(?io(0).If we group all extra innings together, so that

and

then the probability that an ith inning will be required is

To state the probability in a manner more suitable for comparison with ob-served results, if AT9 games are observed to require a ninth inning, we wouldexpect #9(21,9(0) to require a tenth. If Ni games are observed to require an z'thinning (for t>9), then we would expect NiGs(ty to require an (i+l)th inning.

The probability that a game will require exactly i innings is

and

140

Page 152: Anthology of Statistics in Sports

Lindsey

DISTRIBUTION OP RUNS SCORED IN ENTIRE GAME

The distribution of runs accumulated over nine complete innings is F 1 , 9 (x) .However, the distribution of total game scores must include the possibilitiesthat the home half of the ninth (H9) is not played, or is started but not com-pleted, that the game lasts more than nine innings, and that the home half ofthe final extra inning is not completed.

During the first nine innings, the visiting team has probability F 1 , 9(x) ofscoring x runs. The probability that the home team will score y consists of fourterms:

(V win)

(H win without needing H9)

(H win by one run during an incomplete H9)

(Extra innings are needed).

If we set F1 , 9 D(x) = [ F 1 , 9 ( x ) + ( a ) + ( b ) + (c)], this will represent the dis-tribution of scores for games decided in nine innings, for both visiting and hometeam.

[ E(x)]2The distribution of runs scored in each indecisive extra inning is [fE(x)]2 .In the decisive extra inning, the visitor's score has distribution fE(x), while

the home score consists of two terms:

(V win)

(H win in an incomplete half-inning) (x > 1).

If we set FED(x)= {fE(x) + (e) + (F)], this will represent the distribution ofscores for the decisive extra inning.

The distribution of scores for games decided in 10 innings is then

and for games decided in i>ll innings it is

When these are calculated we can compute the probability that a team willscore a total of x runs in a whole game, whatever its length, as

141

Page 153: Anthology of Statistics in Sports

Chapter 16 The Progress of the Score During a Baseball Game

F(x) as computed in this way from the 1958 data is shown in Table 4 andFigures 3A and 3B.

ESTABLISHMENT AND OVERCOMING OF LEAD

The probability that team A will have established a lead of exactly I by theend of the ith inning is G1, (l) for i<8. The "expected" figures of Table 5 andthe curves of Figures 3A and 3B are calculated from this function for i<8.

The probability that team A will increase its lead by I or more over the jth tojth innings, inclusive, is

If team A is exactly I runs behind at the end of the ith inning, the probabilitythat they will win is

the first term representing a win in nine innings, the second a win in extra in-nings, where the assumption is made that when a game enters extra innings eachteam has probability of winning. The numbers of Tables 6 and 7, and thecurves of Figures 5A and 5B are computed from this function.

To calculate the distribution of the winning margin at the end of the gameit is necessary to allow for the possibilities that H9 will not be needed, and thatthe game will be won in H9 or an HE with the decisive half-inning beingterminated as soon as a lead of one run is obtained.

As in the preceding section, the margin at the end of nine innings has a prob-ability composed of four terms:

(a) G19(l) (V win by 0 l>0(b) g( — l) (H win by I without needing H9)(c) a probability of a — [l+G19(0)] that H win by 1 in the decisive incom-

plete #9(d) a probability of G19(0) that the score is tied.For those games which are decided in an extra inning, the probability of a

final margin of I is composed of two terms:(e) a term proportional to GE(l) (V win by l) l>0(f) a probability of that H wins by l = 1.Combining these, and normalizing where necessary, we obtain the probabil-

ity G(l) that the winning margin will be l runs.

G(0) = 0

This is the function used to calculate the last columns of Table 5, the pointsabove "F" on Figures 3A and 3B, and the curve of Figure 4.

142

Page 154: Anthology of Statistics in Sports

Lindsey

TABLE 11. CORRELATIONS BETWEEN HALF-INNINGS(1000 games of 1959)

Pair of Half-innings

V1-V2H1-H2V1-H1H1-V2V7-V8H7-H8

7-8

r

-0.001+0.05+0.045-0.024-0.014+0.009-0.002

N

1000500500500500500

1000

0V

.032

.047

.047

.047

.047

.047

.032

APPENDIX C

CORRELATIONS BETWEEN HALF-INNINGS AND BETWEEN TOTAL SCORES

A common method of testing for independence between two distributions isto compute the correlation coefficient.

The pair of scores for the visiting team's first and second innings (Vl and V2)for the 1000 games of 1959 show a correlation coefficient of —0.001. The stand-ard deviation for small coefficients and large numbers of readings is verynearly N-1/2 = 0.032. Table 11 shows several other correlation coefficients, andit is evident that none of them are significant. Thus, there is no evidence here ofany linear correlation between the scores in half-innings of the same or of theopposing team. Absence of linear correlation does not, however, constituteproof of independence [3, p. 222].

Another method of testing for independence is to draw up conditional dis-tributions for particular half-innings in which the scores in other half-inningshave been below, or above, average. Some of these are shown in Table 12.SV2(x\ V1>0) represents the measured distribution of runs in the visitors halfof the second inning for all cases (out of 1000 games) in which the visitors scored

TABLE 12. CONDITIONAL DISTRIBUTIONS FOR CERTAIN INNINGS(1000 games of 1959)

Sv2(x V1 >0)Exp.

SH 2(x Hl>0)Exp.

SH1(x V1 >0)Exp.

SV2(x Hl>0)Exp.

SV8(x V7>0)+SH8(x H7>0)Exp.

x: 0

246237

111112

108101

119119

191199

1

3140

2023

1521

2120

4241

2

1416

1210

1114

37

2523

>3

1412

97

98

96

2015

0.4

0.5

0.3

0.5

0.5

143

Page 155: Anthology of Statistics in Sports

Chapter 16 The Progress of the Score During a Baseball Game

one or more runs in their half of the first inning. "Exp" shows the number thatwould be expected on the null hypothesis that Sv2(x\ V1>0)=Sv2(x\ V1=0).Five such conditional distributions are shown in Table 12, with giving theprobability of occurrence of a chi-square as large as observed. No single differ-ence between "observed" and "expected" for which "expected" was 10 or moreshowed a chi-square with <.05. Thus there is no evidence of dependence be-tween the half-innings investigated.

The coefficient of correlation between the total scores of the home and visit-ing team for the 1000 games is +0.090 + 0.032. This is significant at the 1%level, and indicates a slight tendency for the two teams to score together. Sucha tendency would reduce the number of games with large winning marginsbelow the predictions of the independent model. Since the number actually ob-served is greater than predicted (see Table 5 and Figure 4 for margins of 10or more) the small correlation must be overcome by the excess number ofgames with a large total score and an associated large margin.

REFERENCES

[1] Briggs, Hexner, Meyers and Stewart, "A Simulation of a Baseball Game," Bulletin ofthe Operations Research Society of America, 8 Supplement 2 (1960), B-99.

[2] D'Esopo, Donato and Lefkowitz, Benjamin, "The Distribution of Runs in the Gameof Baseball." Stanford Research Institute, August 1960.

[3] Feller, William, An Introduction to Probability Theory and Its Application, Volume I,Second Edition. New York: John Wiley and Sons, Inc., 1957.

[4] Lindsey, George, "Statistical Data Useful for the Operation of a Baseball Team,"Operations Research, 7 (1959), 197.

[5] Parzen, Emanuel, Modern Probability Theory and Its Applications. New York: JohnWiley and Sons, Inc., 1960.

[6] Spink, J. G. Taylor, Baseball Guide and Record Book, 1959. St. Louis, Missouri: CharlesC. Spink and Son, 1959.

[7] Trueman, Richard, "A Monte Carlo Approach to the Analysis of Baseball Strategy,"Bulletin of the Operations Research Society of America, 7, Supplement 2 (1959),B-98.

144

Page 156: Anthology of Statistics in Sports

Part IIIStatistics in Basketball

Page 157: Anthology of Statistics in Sports

This page intentionally left blank

Page 158: Anthology of Statistics in Sports

Chapter 17

Introduction to theBasketball Articles

Robert L. Wardrop

17.1 BackgroundBasketball was invented in 1891 by James Naismith,a physical education instructor at the YMCA TrainingSchool in Springfield, Massachusetts, USA. The gameachieved almost immediate acceptance and popularity, andthe first collegiate game, with five players on each team,was played in 1896 in Iowa City, Iowa, USA. Professionalbasketball in the United States dates from the formula-tion of the National Basketball League in 1898, whichsurvived for six years. A later NBL was formed in 1937and existed until 1949 when it merged with the three-year-old Basketball Association of America to become the Na-tional Basketball Association (NBA). Currently, there isone women's professional basketball league in the UnitedStates and a number of men's and women's professionalleagues around the world. Basketball is one of the coresports played at high schools and colleges in the UnitedStates.

Considering the popularity of basketball, the amount ofstatistical research on the sport has been small comparedwith other sports. The topics of the chapters in this sec-tion are representative of the basketball research topics invarious statistical journals. Two chapters of this sectionconsider modeling the National Collegiate Athletic Asso-ciation (NCAA) basketball tournament. The remainingthree chapters investigate modeling the outcomes of indi-vidual shots. For a more in-depth analysis of basketballstatistics research, including a discussion of these five pa-pers, see the book Statistics in Sport, edited by Jay Bennett.

17.2 Modeling BasketballTournaments

The NCAA basketball tournament consists of four regionaltournaments, the winners of which advance to the "FinalFour" to determine the U.S. collegiate champion. Datingback to 1985, a regional tournament has consisted of 16teams, seeded 1, 2 , . . . , 16, by a panel of experts, with the1 seed going to the team perceived to be the best, the 2seed going to the team perceived to be second best, andso on. Let P(i; j) denote the probability that seed i willdefeat seed j and let W, denote the probability that seed iwill win a regional tournament.

In 1991, Schwertman, McCready, and Howard proposedthree models for the P(i; j). None of their models has anyunknown parameters. The motivation for the models wasto provide classroom examples that illustrate how individ-ual probabilities can be combined to obtain the probabilityof an event of interest, in this case, the Wi 's.

In Chapter 20 of this section, Schwertman, Schenk, andHolbrook propose and examine eight regression modelsthat use data from 10 years of tournaments to estimate thevalues of P (i; j). Once the P (i; j) 's have been estimated,they can be combined, as in Schwertman, McCready, andHoward (1991), to provide estimates of the Wi's. Thechapter discusses the performances of the eight modelsat estimating the P(i; j)'s and the performances of the11 models (eight regression plus three from Schwertman,McCready, and Howard (1991)) at estimating the Wi's.

Chapter 18 by Carlin provides an alternative approach.The earlier works use seed position as the only predictorof outcome. Carlin suggests that one might get improvedmodels by using a computer ranking of teams, in particu-lar the Sagarin ratings published in USA Today, or casinopoint spreads, as predictors. Carlin's suggestions have twonew features. First, the probability of seed i winning is al-

147

Page 159: Anthology of Statistics in Sports

Chapter 17 Introduction to the Basketball Articles

lowed to vary from region to region and from year to year.Second, the probability of winning the region need notdecrease with seed number.

17.3 The Hot Hand PhenomenaChapters 19, 21, and 22 explore individual shot attempts.Note that each chapter title includes the phrase "hot hand."The basic question addressed in these chapters is simple.Is the model of Bernoulli trials adequate for describingthe outcomes of a player's shots? If the answer is no,then one explanation is that there is nonstationarity in thesequence—that is, the probability that the player makesa shot changes over time. Alternatively, there may be anautocorrelation structure in the shooting sequence, wherethe probability of making a shot will depend on a player'sprevious performance.

The authors of Chapter 21 believe that the Bernoulli tri-als model is adequate for modeling shooting data. Theirchapter begins with a discussion of difficulties researchershave with predicting random sequences. For example, re-searchers tend to predict too many switches—from successto failure or failure to success—in a random sequence.Next, the authors analyze data from three sources: Profes-sional (NBA) game shooting from the floor; NBA gamefree-throw shooting; and collegiate practice shooting fromthe floor. Their analyses of these data certainly supporttheir conclusion, but they have several serious problems.First, they search for only one type of departure fromBernoulli trials, which is autocorrelation. They make noattempt to search for nonstationarity, which may be a rea-sonable description of the shooting patterns. Second, thesample sizes in the collegiate study are so small that thereis little power for any reasonable autocorrelation alterna-tive. Third, when the authors find significant results, theydiscount them by distorting the size of the estimated effect,for example, for a player who has a 60% success rate aftera success compared to only a 40% success rate after a miss,the authors describe this difference as an autocorrelation of0.20, which they say is "quite low." No serious basketballfan would argue that the practical difference between 60%and 40% shooters (from a fixed location) is unimportant!

In Chapter 19 Larkey, Smith, and Kadane examine datafrom 39 NBA telecasts from the 1987-88 NBA season.They analyze 18 players, including Detroit's Vinnie John-son, who had a reputation for being an extremely streakyshooter, and many of the top stars in the NBA. They donot attempt statistical inference; their work is purely de-scriptive. They want to see whether they can find any jus-tification for the widespread belief that Vinnie Johnson is aparticularly streaky shooter. They conjecture that whatever

Johnson is doing, it must be noticeable and memorable, andperhaps unlikely. They present many clever ideas. First,they focus on nonstationarity rather than autocorrelationand do this by searching the data for streaks of successes.Second, they incorporate game-time into their analysis inthe following way. They argue that if a player makes fivesuccessive shots in a short period of time, say three minutesof game-time, this feat is more noticeable and memorablethan making five successive shots equally spaced (in time)during a 48-minute game. From their analysis, they con-clude that Vinnie Johnson's reputation as a streak shooteris apparently well deserved; he is different from other play-ers in the data in terms of noticeable, memorable field goalshooting accomplishments.

In the final chapter in this section, Wardrop reexaminesTversky and Gilovich's free-throw data from Chapter 21,which introduced this set of data with the following ques-tion asked of a sample of basketball fans: When shootingfree-throws, does a player have a better chance of makinghis second shot after making his first shot than after miss-ing his first shot? 68% of the fans answered yes. Tverskyand Gilovich analyze data on nine members of the BostonCeltics for the seasons of 1980-81 and 1981-82 and con-clude that the correct answer is no, and, hence, that a ma-jority of the fans are incorrect. Wardrop suggests a differ-ent interpretation of these free-throw data. He shows thatif the data are aggregated over players, then the correctanswer to the question becomes yes. He points out thataggregation might be appropriate for the purpose of tryingto understand why the fans believe what they do. In orderfor a fan to adapt Tversky and Gilovich's analysis of theCeltics to his/her experiences, it must be assumed that thefan has separate two-by-two contingency tables for each ofthe hundreds, if not thousands, of players he/she has seenplay. This is a big assumption to make! Wardrop sug-gests that it might be more reasonable to assume that thefan has a table only for the aggregated data. Thus, insteadof concluding that, as Tversky and Gilovich write, "Peo-ple ... 'detect' patterns even where none exist," it wouldbe more productive to teach people the dangers of aggre-gation. Wardrop also shows that the Celtics players, asa group, were statistically significantly better free-throwshooters on their second attempts than on their first.

ReferenceSchwertman, N. C., McCready, T. A., and Howard, L.(1991), "Probability models for the NCAA regional bas-ketball tournaments," The American Statistician, 45, 35-38.

148

Page 160: Anthology of Statistics in Sports

Chapter 18

Improved NCAA Basketball Tournament Modeling viaPoint Spread and Team Strength Information

Bradley P. CARLIN

Several models for estimating the probability that a giventeam in an NCAA basketball tournament emerges as theregional champion were presented by Schwertman, Mc-Cready, and Howard. In this article we improve these prob-ability models by taking advantage of external informationconcerning the relative strengths of the teams and the pointspreads available at the start of the tournament for the firstround games. The result is a collection of regional cham-pionship probabilities that are specific to a given regionand tournament year. The approach is illustrated using datafrom the 1994 NCAA basketball tournament.

KEY WORDS: Computer ratings; Oddsmaking; Sportsoutcome prediction.

1. INTRODUCTION

For many years the analysis of sports statistics was limitedto huge tables of sums, means, and the occasional descrip-tive display. More recently, however, statisticians have be-gun to undertake more sophisticated analyses of these data.In teaching they can provide interesting and more easilygrasped illustrations of important concepts (see, e.g., Al-bert 1993). In research they often enable testing of newapproaches for handling difficult modeling scenarios, suchas nonrandomly missing data (Casella and Berger 1994) andtime-dependent selection and ranking (Barry and Hartigan1993). And, of course, in many cases the data are interest-ing in and of themselves; recent papers have investigatedthe existence of the "hot hand" in basketball (Tversky andGilovich 1989; Larkey, Smith, and Kadane 1989) and thelikelihood of "Shoeless" Joe Jackson's complicity in the fa-mous "Black Sox" scandal (Bennett 1993). The recent cre-ation of a new section of the American Statistical Associa-tion devoted to sports statistics provides further testimonyto their increasing popularity.

Perhaps the oldest inferential problem related to sportsstatistics is that of predicting the ultimate winner of someevent, based on whatever information is available concern-ing the various competitors. In the realm of college basket-ball the most talked-about such event is the NCAA men'stournament, held every year in March and early April. Inthis tournament 64 teams (some invited by a selection com-mittee, others receiving automatic bids thanks to their hav-

Bradley P. Carlin is Associate Professor, Division of Biostatistics, Uni-versity of Minnesota, Minneapolis, MN 55455-0392. The author thanksProf. Neil Schwertman for supplying the Fortran code to convert the Pmatrix into the regional championship probabilities, Prof. Jim Albert forsupplying the 1994 pretournament Sagarin ratings, Prof. Hal Stern for in-valuable advice and for suggesting the scoring rule used in Table 4, andProfs. Lance Waller and Alan Gelfand for helpful discussions.

ing won their own conference tournaments) are divided intofour regional tournaments (West, Midwest, East, and South-east) of 16 teams each. Some effort is made by the commit-tee to balance the overall team strength in each region, whileat the same time to place teams in the appropriate geograph-ical region. The teams in each region are then "seeded"(ranked) based on their relative strengths as perceived bythe committee. In a given region the tournament begins byhaving the strongest team (seed 1) play the weakest team(seed 16), the second-strongest (seed 2) play the second-weakest (seed 15), and so on. The winners of these eightfirst round games then play off in a predetermined order(e.g., the 1-16 winner plays the 8-9 winner) in four sec-ond round games, and so on until a single regional cham-pion is determined after four rounds of play. Finally, thefour regional champions face each other in fifth and sixthround games to determine a single national champion. Forthe time being we focus only on prediction of the regionalchampions (the "Final Four").

In a recent paper, Schwertman, McCready, and Howard(1991) consider three alternatives for specifying a 16 x 16matrix P of regional win probabilities. That is, P(i,j) isthe probability that seed i defeats seed j in a contest be-tween the two on a neutral court where, of course, i jand P ( j , i ) = 1 — P(i,j). Together with the assumptionthat the games are independent, they derive the probabilitythat seed i wins the region for i — 1, . . . , 16 using elemen-tary (although fairly tedious) calculations implemented ina Fortran program. Their models for P(i,j) are somewhatad hoc, although the most sophisticated (and best fitting)plausibly assumes a normal distribution of national teamstrengths, with the 64 tournament teams comprising the up-per tail of this distribution. Subsequent work by Schwert-man, Schenk, and Holbrook (1993) refines the approach byusing past NCAA tournament data to fit linear and logisticregression models for P(i,j) as a function of the differencein either team seeds or normal scores of the seeds.

In this article, we extend this approach by taking advan-tage of valuable external information available at the tour-nament's outset. Specifically, we may employ any of thevarious computer rankings of the teams, such as RPI index,Sagarin ratings, and so on, which typically arise as a lin-ear function of several variables (team record, opponents'records, strength of conference, etc.) monitored over thecourse of the season preceding the tournament. These rank-ings provide more refined information concerning relativeteam strengths than is captured by the regional seedings.Such rankings also enable differentiation between identi-cally seeded teams in different regions.

A second source of information for the first round gamesis the collection of point spreads offered by casinos andsports wagering services in states that allow gambling on

149

Page 161: Anthology of Statistics in Sports

Chapter 18 Improved NCAA Basketball Tournament Modeling

Table 1. Data from Round 1 of the 1994 NCAA Tournament

Region

WestWestWestWestWestWestWestWestMidwestMidwestMidwestMidwestMidwestMidwestMidwestMidwest

J '

15131197531

15131197531

S(i)-S( j )

20.2818.4311.95

9.81

4.786.391.201.64

23.5913.328.97

8.584.455.711.04

1.43

Yij

2024181168.544

281611.51247

-1.52

Rij

23269

14-4143

-815184

10-10

14-8-7

Region

EastEastEastEastEastEastEastEastSoutheastSoutheastSoutheastSoutheastSoutheastSoutheastSoutheastSoutheast

j i

15131197531

15131197531

S(i)-S(j)

20.3715.7810.98

9.312.005.53

2.53-.1920.8718.3217.31

9.905.71

4.962.60

5.63

Yij

2518.510.511.5

4.554

-3.52320.51810.59725

Rij

20182

2212

-10-5-331121329102211-6

college basketball. A point spread is a predicted amountby which one team (the "favorite") will defeat the other(the "underdog"); gamblers may bet on whether or not thefavorite's actual margin of victory will exceed the pointspread ("cover the spread"). Point spreads are potentiallyeven more valuable as pregame data than computer rankingsbecause, besides team strengths, they account for game- andtime-specific information, such as injuries to key players.Previous work by Harville (1980) and Stern (1992) showsthat point spread information is the "gold standard" againstwhich all other pregame information as to outcome mustbe judged.

Unfortunately, point spreads for potential games inrounds 2-4 will be unavailable at the tournament's outset.To remedy this, in Section 2 we describe an approach forimputing point spreads for these later games, and subse-quently converting the resulting point spread matrix into

Y_ij = 2.312 + 0.1 (j-i )A2; R-sq = 0.883

the win probability matrix P. Section 3 applies our approachto data from the 1994 NCAA men's basketball tournament.Finally, Section 4 summarizes our findings and commentsbriefly on the prediction of the ultimate NCAA basketballnational champion.

2. DETERMINATION OF WIN PROBABILITIES

Working with data from three seasons of professionalfootball, Stern (1991) showed that the favored team's actualmargin of victory, R, was reasonably approximated by anormal distribution with mean equal to the point spread, Y,and standard deviation a = 13.86. That is,

Pr(favorite defeats underdog) = Pr(R > 0) (Y/ ) (1)

where (•) denotes the cumulative distribution function ofthe standard normal distribution. This rather surprising re-sult indicates that the group of bettors who determine the

Y_ij= 1.165[S(i)-S(j))]; R-sq = 0.981

Figure 1. Regression of Round 1 Point Spreads on Differences in Seeding and Sagarin Rating.

150

Page 162: Anthology of Statistics in Sports

Carlin

Table 2. Comparison of Pr(Seed Wins Southeast Region) Across Models

Seed

123456789

10111213141516

Team

PurdueDukeKentuckyKansasWake ForestMarquetteMichigan StateProvidenceAlabamaSeton HallSW LouisianaCharlestonTN-ChattanoogaTennessee StateTexas SouthernCentral Florida

Sum

Schwertmanmethod

.459

.188

.110

.068

.047

.036

.026

.015

.011

.011

.009

.006

.005

.004

.003

.001

1.000

Seedregression

.326

.235

.155

.110

.064

.045

.028

.018

.008

.005

.003

.001

.001

.000

.000

.000

1.000

Sagarindifferences

.316

.151

.245

.111

.032

.026

.033

.067

.005

.010

.002

.002

.001

.000

.000

.000

1.000

Sagarinregression

.343

.148

.260

.108

.024

.019

.026

.061

.003

.006

.001

.001

.000

.000

.000

.000

1.000

Sagarin regressionwith R1 spreads

.349

.150

.255

.103

.027

.021

.025

.058

.003

.007

.001

.000

.000

.000

.000

.000

1.000

point spreads are, on the average, correct in their predic-tions of game outcome. (Note that it would be wrong togive too much credit for this accuracy of the point spreadto the bookies, who merely set an initial spread and sub-sequently raise or lower it so that roughly the same totalamount is bet on both the favorite and the underdog.)

Subsequent unpublished analysis by Stern of two sea-sons of professional basketball data indicates that (1) againholds, this time with a = 11.5. Intuition suggests that thisa value may be a bit large for our purposes because col-lege basketball is generally a lower scoring game, and wewould expect the variability in the victory margin to in-crease with the total points scored. Indeed, data from thefirst four rounds of the 1994 NCAA tournament producea value of = 8.83, although using this precise value inour analysis would of course be unfair because it was un-available at the tournament's outset. In what follows wetake a = 10 as a reasonable and somewhat conservativecompromise.

Hence for a given region, we may use Equation (1) withthe point spreads for the first round games to determine

Table 3. Comparison of Pr(Seed Wins Region) Across Regions

Seed

123456789

10111213141516

Sum

West

.182

.379

.147

.103

.053

.072

.009

.037

.011

.003

.002

.003

.000

.000

.000

.000

1.000

Midwest

.310

.232

.176

.093

.046

.047

.013

.044

.020

.011

.002

.005

.000

.001

.000

.000

1.000

East

.349

.306

.089

.109

.043

.041

.020

.009

.017

.004

.002

.010

.000

.000

.000

.000

1.000

Southeast

.349

.150

.255

.103

.027

.021

.025

.058

.003

.007

.001

.000

.000

.000

.000

.000

1.000

P(i, 17 - i), i = 1,..., 16. But this fills in only the antidi-agonal (lower left to upper right) of P; how should we de-termine the remaining entries? A natural solution would beto obtain a general prediction equation for the point spready in terms of the difference in team seeding (perhaps af-ter some suitable transformation), and then again use (1) tocomplete the P matrix. Relevant data for this calculationfrom the 32 first round games in the 1994 tournament aredisplayed in Table 1. Besides the seeding differences (j-i)and point spreads Yij for each game matching seeds i and jwhere i < j, the table also shows the actual victory marginRij for comparison. Our Yij values were obtained imme-diately prior to the beginning of tournament play from theThursday morning edition of a local newspaper, which inturn got them from a prominent Las Vegas oddsmaker. Anegative value of Yij implies that the team with the poorerseeding (team j) was favored by the bettors to win the game;a negative value of Rij indicates that the result was, in fact,a victory by the poorer seed (an "upset").

The first column of plots in Figure 1 shows the results ofregressing point spread on squared seeding difference. Thefitted regression line obtained is Yij = 2.312 + . 1 0 0 ( j — i)2,where i < j. The extremely close agreement between thisfitted regression line (solid line in upper panel) and thelowess smoothing line (dotted line) indicates a high degreeof linearity on this scale, and the standardized residual plotin the lower panel does not indicate any failure in the usualregression assumptions. Further, the R2 value of .883 indi-cates reasonably good fit. Note that the data line up in eightvertical columns because there are four games pitting seedi against seed (17 - i) for i = 1, . . . , 8.

In the second column of plots in Figure 1, we replacesquared seed difference as the predictor variable with thedifference in Sagarin rating, a numerical measure of teamstrength that we denote by S(k) for seed k. These rat-ings, which account for the won-lost record in Division Igames and strength of schedule, are published every Mon-day during the season in the newspaper USA Today. Ta-ble 1 gives the relevant differences from the collection ofrankings published on the Monday immediately prior to

151

Page 163: Anthology of Statistics in Sports

Chapter 18 Improved NCAA Basketball Tournament Modeling

Table 4. Information Scores for the Five Tournament Probability Estimation Methods

Region

WestMidwestEastSoutheastAll

Schwertmanmethod

-.116-.134-.154-.114-.517

Seedregression

-.111-.147-.148-.103-.508

Sagarindifferences

-.106-.134-.149-.116-.505

Sagarinregression

-.101-.134-.152-.114-.502

Sagarin regressionwith R1 spreads

-.102-.127-.145-.111-.485

the 1994 first round games. The ratings are designed toproduce hypothetical point spreads when differenced, andour results bear this out: the data seem to require no trans-formation to achieve linearity, and the intercept in the fullregression model is not significantly different from zero.Forcing the line through the origin, we obtain the fittedmodel Yij = 1.165[S(i) - S(j)], where i < j. The fittedslope coefficient suggests that the rating difference tends toslightly underestimate the point spread in matches betweenopponents of widely differing strengths. The improved R2

value of .981 confirms the visual impression from the fig-ure that the Sagarin rating difference is superior to seeddifference as a predictor of point spread. We remark thatwhile the best 39 of the 301 Division I teams (as measuredby Sagarin rating) were included in the 1994 tournament,the four #16 seeds had Sagarin rankings 166, 173, 196, and216, calling into question the assumption of Schwertmanet al. (1991) that the 64 tournament teams may be safelythought of as the best 64 teams in the country.

3. APPLICATION TO THE 1994NCAA TOURNAMENT

We begin by comparing the results of several approachessuggested by Figure 1 using the 1994 Southeast regionaltournament data because differences among the approachesare most apparent using data from this region. Table 2 givesthe estimated probability of emerging as the regional cham-pion for each of the 16 teams. The first method listed isthe one recommended by Schwertman et al. (1991). Thismethod assigns nearly 50% of the mass to the #1 seed,while giving only 5% to the entire lower half of the bracket(seeds 9-16). The next column provides results obtained byusing the regression of point spread on squared seed dif-ference to obtain the win probabilities. This method givesslightly more mass to the upper seeds other than #1, buteven less mass to the lower division teams (total proba-bility less than 2%). Like the Schwertman method, it usesonly seed information to determine win probabilities, so theresults in these columns would apply to any of the four re-gional tournaments. The seed regression results are specificto this year, however, because 1994 spread data were usedto fit the model.

Listed next are results obtained by simply taking the un-adjusted difference of the Sagarin ratings as the hypotheti-cal point spread between the two teams. Note that the prob-ability of a triumph by the #1 seed has dropped again, andtotal support for the lower division teams remains a mere2%. Note also that these probabilities are specific to bothyear and region because teams with identical seeds in differ-

ent regions need not have identical Sagarin ratings. Indeed,the probabilities are no longer strictly decreasing from seed1 down to seed 16, due to ordering conflicts between theseedings and the Sagarin ratings (e.g., #2 Duke rated 89.90,#3 Kentucky rated 91.59). The next column in the tableadjusts the Sagarin differences using the regression modelobtained in the previous section before converting them towin probabilities via Equation (1). This adjustment amountsto giving a boost to the two most highly rated teams in theregion (Purdue and Kentucky) at the expense of the others;notice that the total probability allocated to the lower halfof the bracket is now barely 1%. Finally, the last columnreplaces the imputed first round point spreads used in theprevious method with the actual point spreads. This resultsin subtle changes to only 16 entries in the P matrix, but be-cause they are the entries corresponding to the first gamesplayed, the effect on the regional championship probabili-ties is apparent, occasionally visible in the second decimalplace.

Table 3 applies this final method (using the Sagarin re-gression model plus actual first round spreads) to data fromeach of the four 1994 regional tournaments. There are sev-eral reversals of the seeding order, the most interesting ofwhich is the prediction of #2 Arizona, not #1 Missouri,as the team most likely to win the West regional. (To themodel's credit this was indeed the outcome in this region.)Support for seeds in the lower half of each region is againquite low. The largest single probability in the table is .379(Arizona), substantially lower than that given to any #1 seedby the Schwertman model, and perhaps indicative of the in-creased "parity" in college basketball often mentioned bysportswriters and coaches during the 1994 season. Again,the tournament results bear this out somewhat, as the actualregional champions were seeded 2 (Arizona), 1 (Arkansas),3 (Florida), and 2 (Duke).

Schwertmann et al. (1991) explore model fit more for-mally by comparing observed and expected numbers ofteams with a given seeding to become regional champi-ons over the first six years of 64-team NCAA tournamentplay (i.e., 24 regional champions). This method of modelvalidation is unavailable to us, however, because the prob-abilities in Table 3 are specific to both team and year, notjust seeding. Instead, we judge a method that produces thewin probability matrix P using the scoring rule

where Z = {Z 1 , . . . , ZK}, Zk = 1 if the team with the morefavorable seeding wins game k, and Zk — 0 otherwise.

152

Page 164: Anthology of Statistics in Sports

Carlin

Equation (2) is simply the average log-win probability pre-dicted by the model for those teams that actually won, sothat models are rewarded for assigning high probability tothese teams. I(P, Z) can also be thought of more formallyas the average Shannon information in the likelihood (see,e.g., Lindley 1956).

Table 4 gives the information scores for each of the fivemethods compared in Table 2. The bottom line sums overall K = 60 games in Rounds 1-4 of the 1994 tournament.To provide a reference point for the scores on this line,the completely naive method that assumes every game isan even toss-up (all P(i,j) = .5) would have a score oflog(.5) = —.693. Notice the monotonic improvement inscore as we move from left to right, with our final method(the one used in Table 3) emerging as noticeably better thanthe rest. From the component scores within each regionwe see that this final method scores higher in every regionthan the Schwertman method, which does not intend to beteam- or region-specific. Note however that there is littledifference in score among the methods for the Southeast re-gion, where Kentucky's early exit and Duke's ultimate wincontradicted the Sagarin ratings. But the methods based onthese ratings perform quite well in the West region, wherethe ratings correctly predicted that #2 Arizona would beat#1 Missouri.

4. CONCLUSION

In this article we have developed a method for improvedprobability modeling of NCAA regional basketball tour-naments. The method requires only elementary ideas inprobability theory, statistical graphics, and linear regressionanalysis, and as such should provide an interesting and in-structive exercise for students. Implementation for a givenyear requires only the Sagarin ratings for the appropriate64 teams, perhaps the collection of first round point spreads(if refitting the regression model is desired), and the Fortranprogram for computing the P matrix and reducing it to thecollection of regional championship probabilities (extendedfrom the original program by Schwertman, and availablefrom the author upon request via electronic mail).

We have argued on behalf of the use of (actual or im-puted) point spreads in determining win probabilities, onthe grounds that true point spreads are superior to computerrankings, which are in turn superior to the crude summariesprovided by tournament seedings. Our regression analysesin Section 2 and the information scores in Table 4 supportthis belief, as does other, more anecdotal evidence from the1994 tournament. For example, Table 1 shows that, in twofirst round games, the lower seed was actually favored bythe bettors: East #9 (Boston College) 3.5 points over East#8 (Washington State), and Midwest #10 (Maryland) 1.5points over Midwest #7 (St. Louis). The Sagarin ratingscorrected the seeding error in the former case, but not inthe latter (Maryland rating 83.43, St. Louis rating 84.47, a

difference of 1.04). Perhaps the bettors were influenced inthis case by their additional knowledge that St. Louis hadlost three of their last four games prior to the tournament,or that the team's tallest player was injured, suggesting thatthey would have a hard time guarding Maryland's 6-foot,10-inch center, Joe Smith. As it turned out, Smith had 29points and 15 rebounds in the game, and Maryland won by8 points.

As a final comment we note that in some cases interestmay focus not on the prediction of the Final Four, but onthe prediction of the ultimate national champion. While ourideas could, of course, be extended to the case of a single64 x 64 P matrix, the programming involved in reducingthis matrix to the vector of championship probabilitieswould be almost unbearably tedious. As a simple alternativewe might assume that the Final Four teams are more or lessevenly matched, and thus select as our national championthe team most likely to make it this far in the tournament(in our case, W#2 Arizona). It is worth pointing out, how-ever, that this logic, along with Table 3, suggests that nosingle team would be likely to have even a 10% chance atthe tournament's outset of winning the required six consec-utive games, so that any such prediction is almost certainto be incorrect.

[Received April 1994. Revised October 1994.]

REFERENCES

Albert, J. H. (1993), 'Teaching Bayesian Statistics using Sampling Meth-ods and MINITAB," The American Statistician, 47, 182-191.

Barry, D., and Hartigan, J. A. (1993), "Choice Models for Predicting Di-visional Winners in Major League Baseball," Journal of the AmericanStatistical Association, 88, 766-774.

Bennett, J. (1993), "Did Shoeless Joe Jackson Throw the 1919 World Se-ries?" The American Statistician, 47, 241-250.

Casella, G., and Berger, R. L. (1994), "Estimation with Selected BinomialInformation or Do You Really Believe that Dave Winfield is Batting.471?," Journal of the American Statistical Association, 89, 1080-1090.

Harville, D. (1980), "Predictions for National Football League Games viaLinear-Model Methodology," Journal of the American Statistical Asso-ciation, 75, 516-524.

Larkey, P. D., Smith, R. A., and Kadane, J. B. (1989), "It's Okay to Believein the 'Hot Hand,'" Chance, 2(4), 22-30.

Lindley, D. V. (1956), "On the Measure of Information Provided by anExperiment," Annals of Statistics, 27, 986-1005.

Schwertman, N. C, McCready, T. A., and Howard, L. (1991), "ProbabilityModels for the NCAA Regional Basketball Tournaments," The Ameri-can Statistician, 45, 35-38.

Schwertman, N. C., Schenk, K. L., and Holbrook, B. C. (1993), "MoreProbability Models for the NCAA Regional Basketball Tournaments,"Technical Report, Department of Mathematics and Statistics, CaliforniaState University, Chico.

Stern, H. (1991), "On the Probability of Winning a Football Game," TheAmerican Statistician, 45, 179-183.

(1992), "Who's Number One?—Rating Football Teams," in Pro-ceedings of the Section on Sports Statistics (Vol. 1), Alexandria, VA:American Statistical Association, pp. 1-6.

Tversky, A., and Gilovich, T. (1989), 'The Cold Facts about the 'Hot Hand'in Basketball," Chance, 2(1), 16-21.

153

Page 165: Anthology of Statistics in Sports

This page intentionally left blank

Page 166: Anthology of Statistics in Sports

Basketball players who do theimprobable and memorable do haveshooting streaks according to a newdata set that rebuts the Tversky andGilovich case against them.

Chapter 19

It's Okay to Believein the "Hot Hand"

Patrick D. Larkey, Richard A. Smith,Joseph B. Kadane

If economics is known as the "dis-mal science" for its temerity toinsist that resources are alwaysscarce and allocation decisions al-ways difficult, perhaps psycholo-gy should be known as the "debil-itating science" for its continuingfascination with our gross inade-quacies as thinking and acting or-ganisms. Psychologists of an earli-er era told us that much of ouradult behavior was ultimatelycaused by unresolved emotionalconflicts in childhood that arrestour development. Psychologistsmore recently have experimental-ly confirmed a host of limitationson our ability to remember andreason. Now two psychologists,Amos Tversky and Thomas Gilo-vich, tell us that if we are basket-ball fans and believe in the "hothand" and "streak shooting," weare misconceiving the laws ofchance and suffering from, of all

things, a "cognitive illusion" (see"The Cold Facts About the 'HotHand' in Basketball," Chance,Winter 1989).

Tversky and Gilovich boldlyclaim that "this misconception ofchance has direct consequencesfor the conduct of the game. Pass-ing the ball to the hot player, whois guarded closely by the oppos-ing team, may be a non-optimalstrategy if other players who donot appear hot have a betterchance of scoring. Like other cog-nitive illusions, the belief in thehot hand could be costly." Whileit is not entirely clear why offen-sive coaches would be more vul-nerable than defensive coaches toholding and acting upon such afallacious belief, Tversky and Gi-lovich's claim raises the possibili-ty that legendary coaches such asRed Auerbach and Johnny Wood-en suffered from cognitive illu-

sions throughout their careers; ifthey had only conferred with theright psychologists and under-stood the error of their ways, theymight have improved on theirstrategies and won even moregames and championships.

Is nothing sacred? Are fans andcoaches really wrong to believethat all basketball players occa-sionally get hot in their shootingand that a few players have agreater propensity than others toshoot in streaks? Perhaps not.This paper briefly reviewsTversky and Gilovich's conceptu-alization of the hot hand andstreak shooting, proposes a differ-ent conception of how observer's

155

Page 167: Anthology of Statistics in Sports

Chapter 19 It's Okay to Believe in the "Hot Hand"

beliefs in streak shooting arebased on National Basketball As-sociation (NBA) player shootingperformances, and tests this alter-native conception on data fromthe 1987-1988 NBA season.

For the Existenceof Streak Shooting

Tversky and Gilovich argue thatfan beliefs in the hot hand orstreak shooting imply two specificdepartures in player shooting se-quences from the simple binomialwith a constant hit rate:(1) "... the probability of a hitshould be greater following a hitthan following a miss (i.e., posi-tive association); and (2) ... thenumber of streaks of successivehits or misses should exceed thenumber produced by a chanceprocess with a constant hit rate(i.e., nonstationarity)."

While Tversky and Gilovich re-port the results of questionnairestudies and shooting experiments,the empirical centerpiece of theirargument is an analysis of fieldgoal shooting data from the 48home games of the Philadelphia76ers and their opponents duringthe 1980-1981 NBA season con-trasted with the expectations of asimple binomial process. Theycheck for dependence and nonsta-tionarity using a variety of testsincluding conditional probabili-ties, first order serial correlations,Wald-Wolfowitz, Lexis ratios, anda comparison of player results onfour-shot sequences with expecta-tions from binomials based on theplayers' sample shooting percent-ages.

Their data analysis yields noevidence of either dependence be-tween shots or of nonstationarity.Tversky and Gilovich "attributethe discrepancy between the ob-served basketball statistics andthe intuitions of highly interestedand informed observers to a gen-eral misconception of the laws ofchance that induces the expecta-tion that random sequences will

be far more balanced than theygenerally are, and creates the illu-sion that there are patterns orstreaks in independent se-quences."

There is, however, a seriousproblem with both their concep-tualization and their data analysesfor analyzing the origination,maintenance, and validity of be-liefs about the streakiness of par-ticular players. The shooting datathat they analyze are in a verydifferent form than the data usual-ly available to observers qua be-lievers in streak shooting. Thedata analyzed by Tversky and Gi-lovich consisted of isolated indi-vidual player shooting sequencesby game. The data available toobservers including fans, players,and coaches for analysis are indi-vidual players' shooting efforts inthe very complicated context ofan actual game.

To the extent that beliefs in rel-ative player propensities forstreak shooting are connected toshooting performance data, theyare almost certainly connected tothe data as presented to the be-lievers. In order to do any analysisat all, of course, it is necessary toabstract radically from the enor-mous complexity of the observa-tions of an actual basketball game.Tversky and Gilovich go beyondmere abstraction. Extracting indi-vidual player shooting sequencesfrom an actual game is a verycomplex task. This task is clearlybeyond the abilities of the averageobserver without external aids tomemory. Even team and media"statisticians" do not routinelypreserve shot sequence informa-tion on individual players.

Observers of NBA games see asuccession of field goal shootingopportunities, interrupted by a lotof nonshooting activity, from thebeginning to the end of a game.Each of these opportunities can betaken by any one of the ten play-ers on the floor with the result of ahit or a miss. Most observers ofreal basketball games attend to the

unfolding sequence of shootingopportunities in the game, not tothe efforts of an individual player.Observers note individual playersprimarily when they do notablethings in a cognitively manage-able chunk of shooting opportuni-ties. While we have no systematicbasis for knowing precisely howlarge a "manageable chunk of op-portunities" is—it would varywith both the observers and theoverall content of the chunk—weestimate that it is not longer than20 field goal opportunities or alittle less than half a quarter in theaverage NBA game in our data set.Longer chunks increase the proba-bility of failures in the observers'attention and memory.

The observers' focus on the un-folding sequence of shot opportu-nities rather than the activities ofan individual player suggests avery different model of playershooting activities than the oneused by Tversky and Gilovich. Italso suggests that their model, theseparate binomial for each player,is wrong both descriptively andnormatively. It is almost certainlynot the model used by any observ-ers of a real NBA game as the basisof their expectations about a play-er's shooting performance (i.e.,whether that player is "hot" or"cold") because of the extremedifficulty in sorting out and re-membering the individual shoot-ing performances of ten to twentyplayers over the course of a realgame. It is not a model that ob-servers should use as the basis oftheir expectations about a player'sshooting performance on the nextshot or for an entire game becausethe model ignores game context,how that player's shooting activi-ties interact with the activities ofthe other players.

Consider two players A and Vwho each have a run of 5 consecu-tive field goal successes in thesame game in which 220 fieldgoals were attempted. Beginningwith the 70th shot, V makes 5consecutive field goals for the en-

156

Page 168: Anthology of Statistics in Sports

Larkey, Smith, and Kadane

tire floor; he takes and makes the70th shot, the 71st shot, . . . . andthe 74th shot. A's 5 shots are max-imally dispersed across the 220shots in the game; he takes andmakes the 1st shot, the 55th shot,the 110th shot, the 165th shot,and the 220th shot.

Suppose that both players havethe same percentage of success,Pi = .5, in shooting field goals,that shots are independent, andthat Pi is stationary across sam-ples of shots. In the Tversky andGilovich analysis each streak thatis identical in length and content(hits and misses) for a particularplayer is equally probable. Yet,few of us would believe that theprobabilities of A and V accom-plishing their respective streakshooting feats are equal. We alsodo not believe that the probabili-ties of an observer seeing, notic-ing, and remembering the streaksof A and V are equal. The way astreak falls in the context of agame matters a lot to observers.Some streaks are much less proba-ble and much more noticeableand memorable to observers thanother streaks of the same lengthand identical hit and miss con-tent.

The sequence for V is much lesslikely than the sequence for A. V'saccomplishment is so unlikelythat we did not see even one con-secutive streak of 5 hits by oneplayer with no intervening fieldgoal activity by other players inour data on 39 NBA games withover 9000 field goal attempts by139 different players. Indeed, wedid not see even one 4-shot streakof this type. There were only two3-shot streaks of this type in allthe data. As for A's accomplish-ment, there were approximately100 instances in our data whereplayers made 5 consecutive shotswith some intervening activity byother players.

In the example, any fan watch-ing would notice and rememberV's performance in the 70th to74th shot in the game context. Thenoticeability and memorability of

Photo courtesy of Joe Labolito, Temple University Photography Department, 6 2004.

this sequence might be lessened ifthe shots were all tip-ins or if thebreak for the end of the first quar-ter occurred between the 71st and72nd shot. It is hard to imagineplausible circumstances thatleave this event as something oth-er than the most memorableshooting sequence in the entiregame. If all of the shots were 30-foot jumpers at the end of a closeseventh game in the NBA champi-onship series, the sequence and V

would be instant legends.On the other hand, A's se-

quence would probably be missedby everyone but his agent, rela-tives, and close friends watchingthe game. Player A would neverbe accused by anyone, except per-haps psychologists analyzing iso-lated player shooting sequencesor A's agent renegotiating his con-tract, of having had the hot handin this hypothetical game.

These differences in the proba-

157

Page 169: Anthology of Statistics in Sports

Chapter 19 It's Okay to Believe in the "Hot Hand"

bility, noticeability, and memora-bility of sequences in an unfold-ing context leads us to the follow-ing hypothesis: The field goalshooting patterns of players withreputations for streakiness willdiffer from the patterns of reputa-tionless players; a streak shooterwill accomplish low-probability,highly noticeable and memorableevents with greater frequencythan reputationless players in thedata set and with greater frequen-cy than would be expected of himin the context of a game.

All that remains is to test thishypothesis.

The Hot Hand Inand Out of ContextOur data consist of 39 games tele-cast in the Pittsburgh market dur-ing the 1987-1988 season. Thegames were videotaped and allshots were coded by the authorsin sequence from the beginning tothe end of the game. The datainclude teams with varying fre-quency:

1. Detroit Pistons—20 games2. Boston Celtics—17 games3. Los Angeles Lakers—16

games4. Atlanta Hawks—6 games5. Dallas Mavericks—5 games6. Chicago Bulls—4 games7. New York Knicks—3 games8. Utah Jazz—2 games

and one game each for the Phila-delphia 76ers, the WashingtonBullets, the Denver Nuggets, theMilwaukee Bucks, and the Cleve-land Cavaliers.

Twenty-four of the games wereplayoff games with the balance inthe regular season. The data setwas driven primarily by thescheduling habits of Pittsburghtelevision stations, WTBS (Atlan-ta), and WOR (New York) fromJanuary to the conclusion of the1987-1988 playoffs. While everyeffort was made to tape the gamesof the Celtics, Lakers, Hawks,

Bulls, and Mavericks, taping Pis-ton games had the highest priori-ty. One of the Piston's players,Vincent ("Vinnie") Johnson, "theMicrowave," has arguably thegreatest current reputation in theNBA for streak shooting. His entryin The Complete Handbook of ProBasketball (15th Edition, 1989,Signet, pp. 244-245) is:

The Microwave was onlylukewarm. . . . Had worstshooting season (.443) sincerookie year. . . . And his 12.2ppg was lowest in 6 1/2. ...Still the most lethal streakshooter in game who canpour in line-drive jumpers bythe bushel. . . . When he's on,nothing known to man canstop him. . . . Best reservedfor reserve role due to hisstreakiness. And when hehits his first two shots, oppo-nents can light novena can-dles . . .

In contrast, the same publica-tion does not mention the hothand or streak shooting in de-scribing any of the other playersanalyzed. For our analysis John-son's reputation as a streak shoot-er is fairly current and distinctive-

ly different from the reputationsof the other players. While datafrom Vinnie's best years, the yearsin which he was building his rep-utation as the "most lethal streakshooter" in the game, would af-ford the best chance of differenti-ating him from other players, suchdata are not available.

One hundred and thirty-ninedifferent players took at least oneshot in the 39 games. We reducedthe set of players for analysis to18, including those who attempt-ed more than 100 field goals orwho attempted 10 or more fieldgoals in 4 or more games. ThePistons, Celtics, and Lakers aredisproportionately represented,but these were 3 of the best teamsin the NBA. Our data include atleast a few games for a number ofthe most skilled offensive playersfrom other teams in the NBA suchas Michael Jordan, DominiqueWilkins, and Mark Aguirre.

We hypothesize that Vinnie'sfield goal shooting patterns willdiffer from the patterns of reputa-tionless players. Further, thesedifferences will take the form ofVinnie accomplishing low-proba-bility, highly noticeable andmemorable events with greater

Table 1. Conditional Probabilities of a Hit and Autocorrelation

(1)Player

JordanBirdMcHaleParishD. JohnsonAingeD. WilkinsE.JohnsonA-JabbarWorthyScottAguirreDantleytaimbeerDumarsThomas

V.johnsonRodman

Wt. Mean

(2)p(H3M).57(7).49(57).58(12),56(9),54(26).40(25),62(21).61(28).38(24).73(22)-60(20).70(10).33(21).41(17).60(25).49(45).47(17)

1.00(3}

,5248

(3)p(H/2m)

-47(17).40(103)••62(37).58(27).45(56).41(49),53(45).45(60).48(50).54(59).56(48).54(24).43(42).47(45).46(52).47(93).45(42).68(13)

.4809

(4)P(H/1M)

.53(43)

.38(177)

.61(108)

.48(]66)

.44(109)

.43(96)

.51(92)

.46(123)

.49(103)

.48(124)

.50(109)

.47(47)

.50(101)

.47(103),47(115).44(187).44(97).63(38)

.4727

(5)p(H)

.55(104)

.44(338)57(270)52(163).41(201).42(184).47(176).43(230).47(209).47(269).52(246)•46( 93).50(224).47(219).45(234).44(361).46(213),62(112)

,4782

(6)p(H/1H)

.56(57)

.49(145)

.53(146)

.54(80)

.35(75)

.44(71)

.42(78),40(91).49(90).48(119).54(121).39(41).51(104).43(96),41(99).44(154).49(96).55(55)

.4883

(7)P(H/2H)

.47(32)

.51(69)

.60(72)

.61(41)

.42(24)

.46(28)

.33(33)

.50(34)

.61(41)

.48(54)

.55(60)

.40(15)

.50(50)

.45(40)

.40(40)

.41(66)

.51(45)

.78(23)

.5034

(8)p(H/3H).47(15).47(34)-55(40).65(23)-50(8).36(11).27(11).33(15).50(22).64(25).55(31).33(6).50(24).39(18).40(15}.50(26}.82(23}.92(12}

.5402

(9)ACF(1)

.027

.141]-.068

.107-.107

.014-.088-.045

.015

.025

.037-.077

.014-.084-.035-.001

.036-.057

-.0016

158

Page 170: Anthology of Statistics in Sports

t

frequency than other players inthe data set and with greater fre-quency than would be expected ofhim in the context of a game. Wedo not expect Vinnie to look dif-ferent from the other players inthe data set from acontextual ana-lyses of his shooting patternsmuch like those of Tversky andGilovich. If he does look differenton this portion of our analysis, itimplies that Tversky and Gilovichlacked a streak shooter, at leastone of Vinnie's distinction, intheir data set.

We began with an acontextualplayer shooting sequences and re-peated two of the analyses uti-lized by Tversky and Gilovich—serial correlations between theoutcomes of successive shots andthe probabilities of a hit condi-tioned on 1, 2, and 3 hits ormisses. The results are shown inTable 1.

The results from this data setare not as clearly against the exis-tence of streak shooting asTversky and Gilovich's results.Half of the players have positiveserial correlations indicatingsome positive dependence be-tween pairs of shots with like re-sults. Larry Bird has the highest

positive serial correlation at .141;this is the only one significantlydifferent from a zero expectationof the simple binomial assumingindependence. At .036, Vinniehas the third largest positive serialcorrelation.

Vinnie is one of six playerswhose conditional probabilitiesincrease monotonically with thenumber of prior hits. He is not,however, the most interestingcase. Robert Parish of the BostonCeltics and Dennis Rodman of theDetroit Pistons show considerableevidence that for them "successbreeds success" in shooting fieldgoals. For Rodman, it also looks asif "failure breeds success."

The entries in Table 2 are sim-ple proportions: the number oftimes that each player accom-plished a hit sequence of a givenlength from 8 to 3 divided by thenumber of sequence opportunitiesof that length that the player had.The numerator is the number oftimes that a player went 8 for 8, 7for 7, and so on. The denominatoris the number of times the playertook a sequence of shots of lengthL (L = 8,7,6 . . . , 3). If a playertakes 9 shots in a game, this gamecontributes two 8-shot opportuni-

ties, three 7-shot opportunities,four 6-shot opportunities, and soforth.

In 4% of all the 8-shot se-quences that Robert Parish took inour data set, he made all 8 shots.Dennis Rodman stands out in thistable as a candidate streak shoot-er. He clearly dominates all of theother players on this simple pro-portional measure. A look at thedata reveals that Rodman had 1game in which he made 9 consec-utive shots, the longest isolatedplayer streak of hits in the entiredata set. Because both occur-rences and opportunities arecounted completely, the singlelong sequence of hits contributessubstantially to all sequencelengths. Rodman also had thehighest shooting percentage inour data set (see Table 1). VinnieJohnson, purportedly the most le-thal streak shooter in the NBA,disappears completely relative toother players by this measure.

Table 3 differs from Table 2only in that the runs are imper-fect; there was one miss in eachsequence of given length. Here wesee Michael Jordan and KevinMcHale, who were not notable atall in Table 2 on perfect runs,

159

Table 2. Perfect Runs of Hits-Acontextual 'Occurrences to

Opportunities"

Player 8/8 7/7 6/6 5/5 4/4 3/3 Player

7/8 6/7 5/6 4/5 3/4 2/3

JordanBirdMcHaleParish D. Johnson

Ainge D. Wilkins E. JohnsonA-Jabbar WorthyScottAguirre DantleyLaimbeer DumarsThomasV. JohnsonRodman

.04

.001

.17

.02.01.02.07

.02.03.03

.02

.01.01.03.13

.05.02.05.09.01.01

.01.04.05.06

.04

.01.01.02.04.16

.08

.06.10.13.03.03.02.03.07.08.09.03.07

.04.03.04.0818

.18.11.18.19.06.0907.09.14.11.15.07.13.10.08.08.13.24

JordanBirdMcHale Parish D. Johnson Ainge D. WilkinsE. JohnsonA-JabbarWorthyScottAguirreDantley Laimbeer Dumars ThomasV. JohnsonRodman

.11.03.0604

.01

.03

.04

.10

.04

.04

.11

.16

.06

.14

.03

.01

.01

.01

.04

.08

.06

.10

.06

.01

.01

.06

.14

.05

.19

.06

.19

.06

.0304

.03

.09

.12.09.13.01.0908

.03

.0817

.10

.24

12

.25

.14

.09

.11

.11

.13

.16

.12

.16

.10

.16

.17

.10

.11

.22

.11

.38

.18.32

.23

.12

.17

.22

.21

.24

.18

.23

.17

.26

.25.19.19.26.22

.49

.29.42.32.26.31.40.28.31.35.35.33.39.36.36.35.34.36

.06

.02

.01

.01

.01

.15

Table 3. One Miss in Hit Run-Acontextual "Occurrences to

Opportunities"

Page 171: Anthology of Statistics in Sports

Chapter 19 It's Okay to Believe in the "Hot Hand"

Table 4. Perfect Runs of Hits—Acontextual "Occurrences to

Expectations"

Table 5. One Miss in Hit Run—Acontextual "Occurrences to

Expectations"

emerge as the most interesting im-perfect streak shooters. Jordan istied with Vinnie in the 7 for 8category and either Jordan orMcHale have the highest propor-tion in all other categories. Vinnieis more apparent here with thesecond highest proportions in the6 for 7 and 5 for 6 categories andwith the third highest in the 4 for5 and 3 for 4 categories. It wouldbe difficult, however, to argue thathis reputation for streak shootingis warranted based on the data inthe first three tables.

Tables 4 and 5 parallel Tables 2and 3 with one important differ-ence. While the numerators areidentical, the denominators in Ta-bles 4 and 5 are expectations com-puted as the simple binomial forthe player for the given type ofsequence times the number of op-portunities in the data set that theplayer had to accomplish the se-quence. For example, the expecta-tions in the denominators for Ta-ble 4 were calculated as

where

P{ = Probability of a hit given ashot by player i.Tig = Shots taken by ith player ingth game.L = Length of run.G = Number of games.

The expectation for each entryin Tables 4 and 5 is 1.0. If a playeris below 1.0 for any sequencelength, it means that he accom-plished that sequence less oftenthan expected in this data set. Ifthe player is above 1.0, he accom-plished the sequence more oftenthan expected. 2.0 means that theplayer accomplished the se-quence twice as often as expected.For example, a player with a .5shooting percentage has by thebinomial assuming independencea .0039 probability of making 8field goals in a row. If this playertook 100 8-shot sequences in ourdata set, we expect to see .39 suc-cesses. If the player had one suc-cess in the data set, the table entrywould be 1/.39 = 2.5641.

Dennis Rodman and Robert Par-ish are the streak shooters by Ta-ble 4's measure. Vinnie Johnsondoes not stand out as a streakshooter on perfect sequences.

In Table 5, the ratio of occur-rences to expectations on imper-fect sequences, Vinnie emerges asdifferent from other players in-cluding several great players. Heaccomplishes the 7 for 8 sequence5 times more often than expect-ed; the closest players on 7 for 8are Isiah Thomas of the Pistonsand Byron Scott of the Lakers atabout 2 times the expectation.Vinnie is above all other playersin the 4 longest sequences and aclose second in 3 for 4 sequences.

To this point in the analysis wehave, like Tversky and Gilovich,looked only at isolated playershooting sequences. While ourstreak shooter looks somewhatdifferent by at least one measurefrom the other 17 players, we havenot yet considered context whichwe hypothesize is what really en-ables observers of NBA basketballto differentiate streak shootersfrom the other players.

Tables 6 and 7 explore the samemeasures as Tables 4 and 5 with acontextual restriction. Context isdefined as a sequence of 20 con-secutive field goal attempts takenby all players in a game. In a gamein which all players attempted N

160

Player 8/8

'"JordanBirdMcHate —Parish 6.56D. Johnson —AingeD. WilkinsE.JohnsonA-JabbarWorthy 2,86

ScottAguirreDarttleyLaimfaeer —Dumars —ThomasV. Johnson —Rodman 7,63

'7/7

5,48——_

3.631.30

1.15

2.294.26

6/6 5/5 4/4 3/3

.861.60.61

3.57

1,44

3.111.52—

l,49, —.90

1.062.712.27

.901,33,81

2,37•64.66

__' . ,•. 8 2

1.802.24130—

1.30.31.70

1.291.791.70

.831,47

.941 ,78.94,.97.38.80

1.401.551.17

• 571.15.90

.841.151,731.24

,941.34.98

1.37.87

1.17.15

1,081.361.101.10

,741.08.97,91.99

1.66,99

JordanBirdMcHaleParishD. JohnsonAmgeD. WilkinsE. JohnsonA«JabbarWorthyScottAguirreDantleyLaimbeerDumarsThomasV. JohnsonRodman

1.921.861.13.89———.68

1.431.902.46

— ,1.38— '•—

2.535.69

1.861.601.34..42,49,53.36

1.781.991.531.51—

1.15,25,27

2.193.91

.33

1.401.051.22..54.83.88.47

1.721.591.231.16.22.91

•1.14.49

1.452.56.46

1.161.181.11.80

1.071.24.86

1.301.23.95.94.79.99

1.34.86

1.051.82.40

1.27.96

1.00.84.74

1,01,98

1.151.10.84.84.79

1.051.11.95,98

1.23.60

1.20.88.99.83.89

1,02115.89.88

1.00.89,95

1,051.031.061.07.99,81

Page 172: Anthology of Statistics in Sports

Larkey, Smith, and Kadane

field goals, the total number of20 shot contextual sequences isN - 20 + 1. The numerator foreach entry is the number of timesthat a player accomplishes the se-quence of a given length in con-text. The denominator for eachentry in Tables 6 and 7 is anexpectation: the number of shotopportunities times the probabili-ty of a player taking T or moreshots (where T is greater than orequal to the sequence length, L) ina 20 field goal context and ofmaking r of L shots regardless ofposition in the T shots.

For example, the expectation inthe denominators of entries forTable 6 is:

G = Number of games.

wherePi = Probability of a hit given ashot by player i.L = Length of run.m = Number of possible shots(size of content) = 20.

= Probability of player i taking ashot.

where

Ag = All field goal attempts ingame g.

The probability of any playertaking the next shot is a roughapproximation using averagesacross games in our data. The esti-mate is the proportion of totalfield goals the player's team takeson average times the proportion ofhis team's field goals that theplayer takes on average. While itwould be better to build the mod-el conditional on information asto which specific players are onthe floor for each particular shotopportunity, this approach ismuch more complicated and re-quires data that was not coded.An examination of season data onminutes played and shots takenby the players analyzed here indi-cates no substantial bias from thissimplification.

As hypothesized, Vinnie John-son does the highly unlikely,highly noticeable, and memorable

in shooting field goals much morefrequently than the rest of theplayers examined. We also ex-plored shooting streaks in spansof 10 and 15 field goals of context.In general, the results strengthenfurther: The purer the sequence(fewer misses), the longer the se-quence of hits, and the briefer thecontext, the more distinctive Vin-nie is from other players in thisdata set.

We will leave it to the reader toapply Berkson's Interocular Trau-matic Test in examining Tables 6and 7. We conclude by this testthat Vinnie is different from theother players, particularly on themost improbable and most memo-rable feats. Players who lookedinteresting as potential streakshooters in the acontextual analy-sis such as Dennis Rodman disap-pear almost completely whencontext is considered. In examin-ing the data, the reason for thedisappearance is obvious. Rod-man's 9 for 9 streak fell in thefollowing positions of field goalattempts in that game: 72, 86,120,126, 136, 146, 150, 156, 160 (i.e.,Rodman took and made the 72ndfield goal, the 86th field goal.. .).

Table 6. Perfect Shot Sequences in 20FG Context—"Occurrences to

Expectations"

Table 7. Imperfect Shot Sequences In20 TO Context—"Occurrences to

Expectations"

161

Player 8/8 7/7 /6 SIS

Jordan — — ; — 1,11BW — ' — ; : • _ . • 5.50

- Mchale

ParishD. Johnson —Ainge

D. WilkinsE, Johnson — — -

WorthyScott — —Aguirre — '—Dantey — _ . - —

Thomas

V. Johnson — 4645.'Rodman. — ' "—

82.93 15.33— 2.52

16.40 3.94— —

180.80 28,3714.86 ' 5.21

41 566.01 81.69__ ._»

4/4

.672,521.31

-6.932.433.26.22,84

2.632.121.741.695.487.744,141.98

18.6733.62

3/3.49

1.09.93

3.511.353.0829.83

1.501.271.11.92

2.012,621.52.86

4.3210,42

playerJordanBirdMcHalePalfehD.JohnsonAingeD. WilkinsE JohnsonA-JabbarWorthyScottAguirreDantleyLaimbeerDumarsThomasV.JohnsonRodman

7/8 6/7 m

-r- — 1.12

— 23.34 5.70— — 3.90—. — —

'__ , — —.- . — — —

_ _ ,76; — — —

• _ — —— • — • —

.

— — 62.43— — 24.65— 17.26 9.73

10050.271131.63118.41— — — —

4/5

1.093.673.08

10.91• —5.70.87

2.211.362.242.562.886.33

25.892.322.73

41.75_

3/4

.901.161.806.88

.424.13.69

1.421.361.491.501.803.295.831.981.50

2/3

.68

.671.191.951.002.06,51

1.04.89.97.90

1.011.822.101.62.90

11.133.0113.71 5.15

Page 173: Anthology of Statistics in Sports

Chapter 19 It's Okay to Believe in the "Hot Hand"

Photo courtesy of Joe Labolito, Temple University Photography Department, © 2004.

ConclusionAt least one streak shooter withan occasional hot hand was aliveand well and living in Detroit dur-ing the 1987-1988 season. While1987-1988 was not his best yearand we cannot from this researchsupport him as "the most lethalstreak shooter in basketball," Vin-nie Johnson's reputation as astreak shooter is apparently welldeserved; he is different from oth-er players in the data in terms ofnoticeable, memorable field goalshooting accomplishments.

The coaches, players, and fansaccused of "misconceivingchance processes" stand some-what vindicated. At least in thecase of the Microwave and ourother players for this data set, ob-servers are able to notice and re-member improbable shootingfeats. They can also apparentlymake proper reputational attribu-tions to those players who do theimprobable and memorable moreregularly than other players.

Attributing error in reasoningabout chance processes requiresat the outset that you know thecorrect model for the observationsabout which subjects are reason-ing. Before you can identify errorsin reasoning and explain thoseerrors as the product of a particu-lar style of erroneous reasoning,you must first know the correctreasoning. It is much easier toknow the correct model in an ex-perimental setting than in a natu-ral setting. In the experimentalsetting you can choose it. In anatural setting such as profession-al basketball you must first dis-cover it.

Tversky and Gilovich employeda model, the set of simple bino-mials for isolated individual play-er shooting sequences, in theirdata analysis that could not and,indeed, should not be used byobservers of NBA games in under-standing sequences of shots inreal games. Their model usesshooting data in a form that

knowledgeable fans who believein hot hands and streak shootingnever encounter.

The binomials for isolated play-er sequences are not very usefulmodels for formulating or critiqu-ing game strategies. The modelcannot be used to reproduce thesequence of shots in a basketballgame; there is nothing in the mod-el to indicate which player shootsnext. This model, which assumesthat defensive responses are irrel-evant to a player's shooting suc-cess, has some strange strategicimplications. For example, themodel suggests the optimal strate-gy for allocating field goal shots isto have the player on a team withthe highest shooting percentagetake all shots. Coaches such asRed Auerbach and Johnny Wood-en probably never even consid-ered such a strategy. If they hadconsidered and adopted such astrategy there would be many few-er championship banners hangingin the rafters of the Boston Gardenand Pauly Pavilion.

Basketball fans and coacheswho once believed in the hothand and streak shooting and whohave been worried about the ade-quacy of their cognitive apparatussince the publication of Tverskyand Gilovich's original work canrelax and once again enjoy watch-ing the game. It is even okay toadmire the feats of a Vinnie John-son and to think about him asfundamentally different from oth-er shooters, even great shooterslike Larry Bird, Michael Jordan,and Isiah Thomas.

Additional Reading

Gilovich, T., Vallone, R., and Tversky,A. (1985), "The Hot Hand in Basket-ball: On the Misperception of Ran-dom Sequences," Cognitive Psy-chology, 17, 295-314.

Tversky, A. and Gilovich, T. (1989),"The Cold Facts about the 'HotHand' in Basketball," Chance, 2(1),16-21.

162

Page 174: Anthology of Statistics in Sports

Chapter 20

More Probability Models for the NCAA RegionalBasketball Tournaments

Neil C. SCHWERTMAN, Kathryn L. SCHENK, and Brett C. HOLBROOK

Sports events and tournament competitions provide ex-cellent opportunities for model building and using basicstatistical methodology in an interesting way. In this ar-ticle, National Collegiate Athletic Association (NCAA) re-gional basketball tournament data are used to develop sim-ple linear regression and logistic regression models usingseed position for predicting the probability of each of the16 seeds winning the regional tournament. The accuracy ofthese models is assessed by comparing the empirical prob-abilities not only to the predicted probabilities of winningthe regional tournament but also the predicted probabilitiesof each seed winning each contest.

KEY WORDS: Basketball; Logistic regression; Regres-sion.

1. INTRODUCTION

Enthusiasm for the study of probability is enhanced whenthe concepts are illustrated by real examples of interest tostudents. Athletic competitions afford many such opportu-nities to demonstrate the concepts of probability and havebeen extensively studied in the literature; see, for example,Mosteller (1952), Searls (1963), Moser (1982), Monahanand Berger (1977), David (1959), Glenn (1960), Schwert-man, McCready, and Howard (1991), and Ladwig andSchwertman (1992). One excellent probability analysis op-portunity for use in the classroom occurs each spring when"March Madness," as the media calls it, occurs. "MarchMadness" is the National Collegiate Athletic Association(NCAA) regional and Final Four basketball tournamentsthat culminate in a National Collegiate Championship game.The NCAA selects (actually, certain conference champi-ons or tournament winners are included automatically) 64teams, 16 for each of 4 regions, to compete for the nationalchampionship. The NCAA committee of experts not onlyselects the 64 teams from 292 teams in Division 1-A, but as-signs a seed position to each team in the four regions basedon their consensus of team strengths. The format for eachregional tournament is predetermined following the patternin Figure 1, where the number one seed (strongest team)plays the sixteenth seed (weakest team), the number twoseed (next strongest team) plays the fifteenth seed (secondweakest), etc. The experts attempt to evenly distribute the

Neil C. Schwertman is Chairman and Professor of Statistics, Depart-ment of Mathematics and Statistics, California State University, Chico,CA 95929-0525. Kathryn L. Schenk is Instructional Support Coordina-tor, Computer Center, California State University, Chico, CA 95929-0525.Brett C. Holbrook is Student, Department of Experimental Statistics, NewMexico State University, Las Cruces, NM 88003-0003.

teams to the regional tournament to achieve parity in thequality of each region.

Schwertman et al. (1991) suggested three rather ad hocprobability models that predicted remarkably well the em-pirical probability of each seed winning its regional tour-nament and advancing to the "final four." The validity ofthe three models was measured only by each seed's prob-ability of winning its regional tournament. In this articlewe use the NCAA regional basketball tournament data asan example to illustrate ordinary least squares and logisticregression in developing prediction models. The parame-ter estimates for the simple models considered are basedon the 600 games played (1985-1994) during the first tenyears using the 64-team format. Validity of the eight newempirical and the three previous models in Schwertman etal. (1991) are assessed by comparing the empirical proba-bilities not only to the predicted probabilities for each seedwinning the regional tournament but also to the predictedprobabilities of each seed winning each contest.

2. TOURNAMENT ANALYSISPredicting the probability of each seed winning the re-

gional tournament (and advancing to the final four) requiresthe consideration of all possible paths and opponents. Eventhough there are 16 teams in each region, the single elimina-tion format (only the winning team survives in the tourna-ment, i.e., one loss and the team is eliminated) is relativelyeasy to analyze compared to a double-elimination format.[See, for example, the analysis of the college baseball worldseries by Ladwig and Schwertman (1992).] In the first gameeach seed has only one possible opponent, but in the sec-ond game there are two possible opponents, four possiblein the third game and eight possible in the regional finals.Hence there are 1 • 2 • 4 • 8 = 64 potential sets of oppo-nents for each seed to play in order to eventually win theregional tournament. For example, for the number 2 seedto win, it must defeat seed 15 in game 5, either 7 or 10 ingame 11, either 3, 14, 6, or 11 in game 14, and either 1,16, 8, 9, 4, 13, 5, or 12 in game 15. The probability anal-ysis for the regional championship must include not onlythe probability of defeating each potential opponent, butalso the probability of each potential opponent advancingto that particular game. To illustrate, suppose the secondseed wins the regional tournament by defeating seeds 15,7,6, and 1 in games 5,11,14, and 15, respectively. Then theprobability that this occurs is P(2,15) • P(2,7) • P(7 playsin game 11) • P(2,6) • P(6 plays in game 14) • P(2,1) • P(lplays in game 15) where P(i, j) is the probability that an zthseed defeats a jth seed and P(j,i) =^ 1 — P(i, j). A moredetailed explanation of the various paths and the associ-ated probability analysis is contained in Schwertman et al.(1991). As in that article and most all such probability anal-yses of athletic competitions, we assume that the games are

Page 175: Anthology of Statistics in Sports

Chapter 20 More Probability Models for the NCAA Regional Basketball Tournaments

independent and the probabilities remain constant through-out the tournament. To complete the analysis we now mustfind probability models for determining P(i,j).

3. PROBABILITY MODELS

The purpose of the probability models is to incorporatethe relative strength of the teams in estimating the probabil-ity of each team winning in each game. It seems reasonableto use some function of seed positions because these weredetermined by a consensus of experts. In order to have thebroadest possible use in the classroom we use the simplestlinear straight line model E(Y] = B0 + B ( S ( i ) - S(j))where S(i) is some function of i, the team's seed position.Clearly, multiple regression models could be used for moreadvanced classes, but our basic model is appropriate evenfor most introductory classes. In addition to the three mod-els used in Schwertman et al. (1991), eight other models forassigning probabilities of success in each individual gameare considered. The eight models are defined by the 23 pos-sible combinations of three factors: (1) type of regression(ordinary or logistic), (2) type of intercept B0 (estimatedor specified constant), and (3) type of independent variable(linear or nonlinear function of seed positions).

There are obviously many functions of seed position thatcould be used in the models. The choice of S(i) and S(j) isquite arbitrary. We have chosen two rather simple functionsfor our investigation that fortunately provide excellent pred-itor models. The first is S1 (i) = -i for all i, which is simplyusing the difference in seed position as a single independentvariable. This function of seed position suggests a linearityin team strengths, for example, the difference in strength be-tween seeds 1 and 3 is the same as between seeds 14 and 16.Intuitively it seems likely that there is a greater differencein quality between a number 1 and 3 seed than between a14 and 16 seed. Thus this linearity may not be appropriate,

Figure 2. Points in Scatterplot May Represent as Few as 1 or asMany as 97 Games.

and therefore we consider a nonlinear function S 2 ( i ] forincorporating team strength. Since the normal distributionoccurs naturally in describing many random variables and isincluded in most introductory classes, it seems reasonableto use the normal distribution to describe a nonlinear rela-tionship in team strengths. If we assume that the strength ofthe 292 teams is normally distributed and that the expertsproperly ordered the top 64 teams for the tournament from229 to 292 with 229 the weakest and 292 the strongest,we can then determine a percentile and a corresponding zscore for that percentile. That is, adding a correction forcontinuity to 292, S2(i) = -l((294.5 - 4i)/292.5) where

is the cumulative distribution function of the standardnormal. For example, the transformed seed position z scorefor the number 1 seeds (289, 290, 291, 292 when orderedfrom weakest to strongest) was calculated from the per-centile of this group's median, that is, 290.5/292.5 = 99.316percentile, which corresponds to a z score of 2.466. Sim-ilarly, the number 2 seed's (285, 286, 287, 288) z score iscalculated from the 286.5/292.5 = 97.9487 percentile, cor-responding to a z score of 2.044. For seeds 3-16 the cor-responding z scores are: 1.823, 1.666, 1.542, 1.438, 1.348,1.267, 1.194, 1.127, 1.064, 1.006, .951, .898, .848, and .800,respectively.

The dependent variable is 1 if the lower seed (strongerteam) defeats the higher seed and 0 otherwise. It shouldbe noted that the logistic regression models (models 5-

Figure 1. NCAA Regional Basketball Tournament Pairings.Figure 3. Points in Scatterplot Represent as Few as 1 or as Many

as 40 Games.

164

Page 176: Anthology of Statistics in Sports

Schwertman, Schenk, and Holbrook

Table 1. Empirical and Estimated Probabilities of Seed i Defeating. Seed j in the NCAA Regional Basketball Tournament Games

NOTE: x based on all 52 pairings of seeds that have occurred. x based only on the 26 pairings that have had at least 5 games.

8) are equivalent to linear regression models using logodds, for example, model 5 is equivalent to \og[P(i,j)/(l —P^J))} = -0328 - -177(j - i). It is intuitively appealing touse .5 for the y intercept in the ordinary regression mod-els (models 2 and 4) since, if i = j (two clones play), wewould want the probability of a win by either to be a half.Similarly, for the logistic regression we would want the

y intercept to be zero (models 6 and 8) since, if i = j,then the ratio of p/q = 1 implies that the probability is theintuitive .5.

The eight new models with the estimated parameters are

Pi(i,j) = -535329 + .029922(j - i)

P2(i,j) = -5 + .033746(j - i)

165

Seed

i

1111111111111222222223333333344444445555666677778

101112

7

23456789

101112131634678

10111545679

1011145689

10121389

12137

101114101114159

151413

Wins bylower

Games seed empiricalplayed no probability

147

141241

1723

2281

401219

262

126

4011

20413

12402432118

4011

40531

406

40111

40123

xX

X

65

101121

14212181

40715

18195

3811

124117

321421113

3201

29430

265

27011

17123

.429

.714

.714

.917

.5001.000

.824

.9131.000

.5001.0001.0001.000

.5831.000

.556

.692

.500

.750

.833

.9501.0001.000

.6001.0001.000

.333

.583

.800

.583

.667

.5001.0001.000

.375

.800

.0001.000

.725

.8001.000

.000

.650

.833

.675

.0001.0001.000

.4251.0001.0001.000

P1(i,j)

.565

.595

.625

.655

.685

.715

.745

.775

.805

.835

.864

.894

.984

.565

.595

.655

.685

.715

.775

.805

.924

.565

.595

.625

.655

.715

.745

.775

.864

.565

.595

.655

.685

.715

.775

.805

.625

.655

.745

.775

.565

.655

.685

.775

.625

.655

.745

.775

.565

.685

.625

.565

51.899

26.849

P2(i,j)

.534

.567

.601

.635

.669

.702

.736

.770

.804

.837

.871

.9051.000.534.567.635.669.702.770.804.939.534.567.601.635.702.736.770.871.534.567

, .635.669.702.770.804.601.635.736.770.534.635.669.770.601.635.736.770.534.669.601.534

51.38825.688

P3( i ,J)

.651

.715

.760

.795

.825

.851

.874

.895

.914

.932

.949

.9641.000.594.639.704.730.753.793.811.873.575.611.641.666.711.730.748.795.566.596.645.666.685.719.735.609.630.684.700.556.619.638.685.594.612.659.674.551.610.578.546

52.16820.971

P4(i,j)

.634

.704

.754

.793

.826

.855

.880

.904

.925

.945

.963

.9811.000.570.620.692.721.747.791.811.879.550.589.622.651.700.721.741.793.539.572.627.650.671.709.727.587.610.670.688.529.599.619.671.570.590.643.659.523.589.553.517

53.443

19.818

P5 (i, j)

.536

.580

.622

.663

.701

.737

.770

.800

.826

.850

.871

.890

.932

.536

.580

.663

.701

.737

.800

.826

.906

.536

.580

.622

.663

.737

.770

.800

.871

.536

.580

.663

.701

.737

.800

.826

.622

.663

.770

.800

.536

.663

.701

.800

.622

.663

.770

.800

.536

.701

.622

.536

57.925

31.235

P6( i . j )

.543

.586

.627

.666

.703

.738

.770

.799

.826

.849

.870

.888

.930

.543

.586

.666

.703

.738

.799

.826

.904

.543

.586

.627

.666

.738

.770

.799

.870

.543

.586

.666

.703

.738

.799

.826

.627

.666

.770

.799

.543

.666

.703

.799

.627

.666

.770

.799

.543

.703

.627

.543

57.924

31.406

P7(i,j)

.657

.738

.788

.822

.847

.866

.882

.894

.905

.914

.921

.928

.944

.575

.640

.726

.756

.781

.820

.836

.881

.547

.600

.643

.678

.734

.756

.776

.822

.533

.578

.648

.677

.702

.744

.762

.598

.628

.701

.720

.518

.613

.638

.702

.575

.601

.668

.687

.511

.599

.551

.503

56.98624.980

P8 ( i ,J)

.666

.742

.788

.820

.844

.862

.877

.890

.900

.909

.916

.923

.939

.590

.650

.730

.758

.782

.818

.833

.877

.564

.613

.653

.686

.737

.758

.776

.820

.551

.592

.658

.684

.708

.747

.764

.611

.639

.707

.725

.537

.625

.649

.708

.590

.614

.677

.694

.530

.612

.568

.523

56.963

25.905

P9(i,J)

.667

.750

.800

.833

.857

.875

.889

.900

.909

.917

.923

.929

.941

.600

.667

.750

.778

.800

.833

.846

.882

.571

.625

.667

.700

.750

.769

.786

.824

.556

.600

.667

.692

.714

.750

.765

.615

.643

.706

.722

.538

.625

.647

.700

.588

.611

.667

.682

.529

.600

.560

.520

60.103

27.651

P10(i,J)

.531

.563

.594

.625

.656

.688

.719

.750

.781

.813

.844..875.969.531.563.625.656.688.750.781.906.531.563.594.625.688.719.750.844.531.563.625.656.688.750.781.594.625.719.750.531.625.656.750.594.625.719.750.531.656.594.531

52.022

26.779

P11(i,j)

.619

.681

.725

.760

.789

.815

.837

.858

.877

.894

.911

.926

.969

.562

.606

.671

.696

.719

.758

.776

.837

.544

.579

.608

.634

.677

.696

.714

.760

.535

.564

.612

.633

.652

.686

.701

.577

.598

.651

.666

.525

.588

.605

.652

.562

.580

.627

.641

.521

.579

.547

.515

52.551

24.051

Page 177: Anthology of Statistics in Sports

Chapter 20 More Probability Models for the NCAA Regional Basketball Tournaments

Table 2. Predicted Probabilities of Each Seed Winning the NCAA Regional Basketball Tournament

Model

Figures 2 and 3 display the graphs of models 1-8 and ascatterplot of the data.

The other three models, 9-11, used by Schwertmanet al. (1991), consist of one nonlinear type (model 9),P9(i,j) = j/(i + j); a linear type (model 10), P 1 0 ( i , j ) =.5 + .Q3125(Si(i] - S1(j)); and one based on normalscoring using the S2(i) (model 11), P 1 1 ( i , j ) = .5 +.2813625(S2(i) - S(j)). For details see Schwertman et al.(1991).

Estimates of the unspecified parameter(s) in the first fourmodels were obtained by ordinary least squares, while thenext four models (5-8) were determined using the SAS lo-gistic procedure. (See SAS/STAT User's Guide, Vol. 2, Ver-sion 6, 4th ed., pp. 1069-1126 for details.)

4. COMPARISON OF MODELS

The 11 different models for assigning probabilities of win-

ning for each seed in each individual game were comparedin three ways by using a chi-square statistic as a measure ofthe relative fit of the models. Of the possible 120 pairingsof seeds (16 • 15/2) only 52 have occurred. Table 1 liststhe pairs that have occurred, the number of games playedbetween these seeds, the number of wins by the lower seednumber (stronger team), and the empirical and estimatedprobabilities of the lower seed number winning from the11 models. Using the empirical data for the seed pairings,a chi-square goodness-of-fit

for each of the 52 seed pairs was computed, and the sumof these chi-squares is given for each model. Twenty-sixof the seed pairs had fewer than five games played, andthe small expected numbers in these cells, being used as adivisor, may place too much emphasis on these cells anddistort the chi-square values. Hence a second set of chi-square statistics based on just the 26 seed pairings with atleast 5 games was computed and is also given in Table 1.Models 1-8 use the data to estimate the model parameters,and consequently these chi-square values are not entirelyindependent. Nevertheless, the chi-square values do providea measure of the relative accuracy of the various models.

Table 3. Goodness of Fit Analysis

Expected numbersProbability model number

166

Seed

P(1) =P(2) =P(3) =P(4) =P(5) =P(6) =P(7) =P(8) =P(9) =P(10) =P(11) =P(12) =P(13) =P(14) =P(15) -P(16) =

r

.309

.224

.154

.107

.071

.050

.033

.020

.012

.008

.005

.003

.002

.001

.000

.000

2

.293

.219

.157

.110

.077

.053

.035

.021

.013

.009

.006

.003

.002

.001

.000

.000

3

.524

.195

.104

.060

.037

.028

.018

.010

.006

.006

.005

.003

.002

.002

.001

.000

4

.526

.194

.104

.059

.038

.028

.018

.009

.006

.006

.005

.002

.002

.001

.001

.000

5

.295

.225

.164

.112

.076

.050

.032

.019

.012

.007

.004

.002

.001

.001

.000

.000

Model

6

.298

.226

.163

.112

.075

.049

.031

.019

.011

.007

.004

.002

.001

.001

.000

.000

7

.508

.207

.109

.060

.038

.025

.016

.010

.007

.006

.004

.003

.002

.002

.001

.001

8

.504

.208

.110

.061

.038

.026

.017

.010

.007

.006

.004

.003

.002

.002

.001

.001

9

.519

.216

.107

.057

.034

.022

.014

.009

.006

.005

.004

.003

.002

.002

.001

.001

10

.275

.208

.154

.111

.080

.058

.040

.026

.017

.012

.008

.005

.003

.002

.001

.000

11

.459

.188

.110

.068

.047

.036

.026

.015

.011

.011

.009

.006

.005

.004

.003

.001

Group(seed no.)

1234

5 or more

p value

Obs.

168435

(4)

1

11.228.075.513.867.34

3.3837.4958

2

10.507.805.64

3.958.11

4.7865.3099

3

18.517.063.802.22

4.40

.8276

.9347

4

18.766.963.77

2.15

4.36

1.0070.9087

5

10.768.12

5.804.017.31

4.0921.3937

6

10.598.065.854.04

7.47

4.4287.3511

7

17.947.473.982.244.38

.5937

.9638

8

17.887.463.982.254.42

.5652

.9669

9

18.697.783.842.04

3.65

1.3545.8521

10

9.897.505.554.00

9.06

6.3057.1775

11

16.526.773.97

2.456.29

.6283

.9599

Page 178: Anthology of Statistics in Sports

Schwertman, Schenk, and Holbrook

Models 2, 3, and 4 produced p(l, 16) that were greaterthan 1.0. When this occurred the probabilities were set to.99999 in order to compute the chi-square statistics.

The third comparison of the models was done by usingP( i , j ) to compute the probabilities of each seed winningthe regional tournament. These probabilities are displayedin Table 2, and a chi-square goodness-of-fit to the empiricalprobabilities, used for evaluating the 11 models, is given inTable 3.

5. CONCLUSIONS

For the set of 26 pairings with 5 games or more, both theordinary and logistic models using the z scoring of seednumber S2(i} had smaller chi-square values than the corre-sponding models using just the seed position. In many casesthe z scoring substantially improved the fit of the predictedvalue to the empirical data, and seems to be a worthwhiletechnique.

Two unexpected results of the analysis occurred. The firstis that when predicting the probability of success in eachgame, P(i,j), the regressions with the y intercept specified(.5 for ordinary least squares and zero for logistic regres-sion) occasionally provided somewhat smaller chi-squarevalues than the unrestricted models. Because least squaresminimizes the squared deviations between predicted and ob-served values it was anticipated that the unrestricted mod-els (1, 3, 5, 7) would have slightly smaller chi-square valuesthan the corresponding restricted models (2, 4, 6, 8). Thechi-square statistic, however, is a weighted sum of thesesquared deviations, and hence is not necessarily a minimumwhen the unweighted sum is minimized.

The second unexpected result was that the models thatwere best (the smaller chi-square values) at predictingP(i,j) for each game did not do as well as some of theother models at predicting the overall regional tournamentchampion. Models 7 and 8 (logistic, with and without inter-cept, z-scored seeds) were the best at predicting the regionalwinner but were about in the middle (when ranked) of themodels for predicting individual games. On the other hand,models 3 and 4 (ordinary least squares, with or withoutspecified intercept, z-scored seeds) were the best predictionmodels for the individual games, but only fourth or fifthbest for predicting the regional champion.

If the objective is to develop a model for predicting thewinner between various seed pairs, then model 3 (ordinaryleast squares, no specified intercept, z-scored seeds) seemsto be the most satisfactory, whereas if the objective is topredict the regional winner, then the logistic models, 7 and8 (logistic regression, with and without intercept, z-scoredseeds), are the most satisfactory models. Interestingly, thead hoc model 11 seemed to be very adequate at predictingboth.

Obviously there are numerous models that could be used.We have focused on the simplest straight line models and el-ementary methods of incorporating team strength in orderto make the methodology accessible to a broad spectrumof students. We have attempted to present an applicationof ordinary least squares, logistic regression, and probabil-ity that should be of interest to many students. The ever-increasing interest in "March Madness" can be used to mo-tivate and stimulate this instructive, timely application ofseveral principles and methods of probability and statistics.Students seem to learn better when they can see applica-tion of the subject to something of interest to them. Webelieve that this analysis of the regional basketball tour-naments can promote student learning and enthusiasm forstudying probability and statistics.

[Received August 1993. Revised October 1994.]

REFERENCES

David, H. A. (1959), "Tournaments and Paired Comparisons," Biometrika,46, 139-149.

Glenn, W. A. (1960), "A Comparison of the Effectiveness of Tournaments,"Biometrika, 47, 253-262.

Ladwig, J. A., and Schwertman, N. C. (1992), "Using Probability andStatistics to Analyze Tournament Competitions, Chance, 5, 49-53.

Monahan, J. P., and Berger, P. D. (1977), "Playoff Structures in the NationalHockey League," in Optimal Strategies in Sports, eds. S. P. Ladany andR. E. Machol, Amsterdam: North-Holland, pp. 123-128.

Moser, L. E. (1982), "A Mathematical Analysis of the Game of Jai Alai,"The American Mathematical Monthly, 89, 292-300.

Mosteller, F. (1952), 'The World Series Competition," Journal of the Amer-ican Statistical Association, 47, 355-380.

Schwertman, N. C., McCready, T. A., and Howard, L. (1991), "ProbabilityModels for the NCAA Regional Basketball Tournaments," The Ameri-can Statistician, 45, 35-38.

Searls, D. T. (1963), "On the Probability of Winning with Different Tour-nament Procedures," Journal of the American Statistical Association, 58,1064-1081.

167

Page 179: Anthology of Statistics in Sports

This page intentionally left blank

Page 180: Anthology of Statistics in Sports

Chapter 21

The Cold Facts About the"Hot Hand" in Basketball

Do basketball players tend to shoot in streaks?Contrary to the belief of fans and commentators,analysis shows that the chances of hitting a shotare as good after a miss as after a hit.

Amos Tversky and Thomas Gilovich

You're in a world all your own. It'shard to describe. But the basketseems to be so wide. No matterwhat you do, you know the ball isgoing to go in.

—Purvis Short, of the NBA'sGolden State Warriors

This statement describes a phe-nomenon known to everyone whoplays or watches the game of bas-ketball, a phenomenon known asthe "hot hand." The term refers tothe putative tendency for success(and failure) in basketball to beself-promoting or self-sustaining.After making a couple of shots,players are thought to become re-laxed, to feel confident, and to"get in a groove" such that subse-quent success becomes more like-ly. The belief in the hot hand,then, is really one version of awider conviction that "successbreeds success" and "failurebreeds failure" in many walks oflife. In certain domains it surelydoes—particularly those in whicha person's reputation can play adecisive role. However, there areother areas, such as most gam-

bling games, in which the beliefcan be just as strongly held, butwhere the phenomenon clearlydoes not exist.

What about the game of basket-ball? Does success in this sporttend to be self-promoting? Doplayers occasionally get a "hothand"?

Misconceptions ofChance Processes

One reason for questioning thewidespread belief in the hot handcomes from research indicatingthat people's intuitive concep-tions of randomness do not con-form to the laws of chance. Peoplecommonly believe that the essen-tial characteristics of a chanceprocess are represented not onlyglobally in a large sample, but alsolocally in each of its parts. Forexample, people expect evenshort sequences of heads and tailsto reflect the fairness of a coin andto contain roughly 50% heads and50% tails. Such a locally represen-tative sequence, however, con-

tains too many alternations andnot enough long runs.

This misconception producestwo systematic errors. First, itleads many people to believe thatthe probability of heads is greaterafter a long sequence of tails thanafter a long sequence of heads;this is the notorious gamblers' fal-lacy. Second, it leads people toquestion the randomness of se-quences that contain the expectednumber of runs because even theoccurrence of, say, four heads in arow—which is quite likely ineven relatively small samples—makes the sequence appear non-representative. Random se-quences just do not look random.

Perhaps, then, the belief in thehot hand is merely one manifesta-tion of this fundamental miscon-ception of the laws of chance.Maybe the streaks of consecutivehits that lead players and fans tobelieve in the hot hand do notexceed, in length or frequency,those expected in any random se-quence.

To examine this possibility, wefirst asked a group of 100 knowl-

169

Page 181: Anthology of Statistics in Sports

Chapter 21 The Cold Facts About the "Hot Hand" in Basketball

Photos courtesy of Joe Labolito, Temple University Photography Department, © 2004.

Does Cornell senior Mike Pascal have a "hot hand"?

edgeable basketball fans to classi-fy sequences of 21 hits and misses(supposedly taken from a basket-ball player's performance record)as streak shooting, chance shoot-ing, or alternating shooting.Chance shooting was defined asruns of hits and misses that arejust like those generated by cointossing. Streak shooting and alter-nating shooting were defined asruns of hits and misses that arelonger or shorter, respectively,than those observed in coin toss-ing. All sequences contained 11hits and 10 misses, but differed inthe probability of alternation,p ( a ] , or the probability that theoutcome of a given shot would bedifferent from the outcome of theprevious shot. In a random (i.e.,independent) sequence, p ( a ) = .5;streak shooting and alternatingshooting arise when p ( a ] is lessthan or greater than .5, respective-ly. Each respondent evaluated sixsequences, with p(a) ranging from

170

Page 182: Anthology of Statistics in Sports

Tversky and Gilovich

Figure 1. Percentage of basketball fans classifying sequences of hits and missesas examples of streak shooting or chance shooting, as a function of the probabilityof alternation within the sequences.

.4 to .9. Two (mirror image) se-quences were used for each levelof p(a) and presented to differentrespondents.

The percentage of respondentswho classified each sequence as"streak shooting" or "chanceshooting" is presented in Figure 1as a function of p[a}. (The percent-age of "alternating shooting" isthe complement of these values.)As expected, people perceivestreak shooting where it does notexist. The sequence of p(a) = .5,representing a perfectly randomsequence, was classified as streakshooting by 65% of the respon-dents. Moreover, the perceptionof chance shooting was stronglybiased against long runs: The se-quences selected as the best exam-ples of chance shooting werethose with probabilities of alter-nation of .7 and .8 instead of .5.

It is clear, then, that a commonmisconception about the laws ofchance can distort people's obser-vations of the game of basketball:Basketball fans "detect" evidenceof the hot hand in perfectly ran-dom sequences. But is this themain determinant of the wide-spread conviction that basketballplayers shoot in streaks? The an-

swer to this question requires ananalysis of shooting statistics inreal basketball games.

Cold Facts from the NBAAlthough the precise meaning ofterms like "the hot hand" and"streak shooting" is unclear, theircommon use implies a shootingrecord that departs from coin toss-ing in two essential respects (seeaccompanying box). First, the fre-quency of streaks (i.e., moderateor long runs of successive hits)must exceed what is expected by achance process with a constant hitrate. Second, the probability of ahit should be greater following ahit than following a miss, yieldinga positive serial correlation be-tween the outcomes of successiveshots.

To examine whether these pat-terns accurately describe the per-formance of players in the NBA,the field-goal records of individ-ual players were obtained for 48home games of the Philadelphia76ers during the 1980-81 season.Table 1 presents, for the nine ma-jor players of the 76ers, the proba-bility of a hit conditioned on 1, 2,and 3 hits and misses. The overall

hit rate for each player, and thenumber of shots he took, are pre-sented in column 5. A comparisonof columns 4 and 6 indicates thatfor eight of the nine players theprobability of a hit is actuallyhigher following a miss (mean= .54) than following a hit (mean= .51), contrary to the stated be-liefs of both players and fans. Col-umn 9 presents the (serial) corre-lations between the outcomes ofsuccessive shots. These correla-tions are not significantly differ-ent than zero except for one play-er (Dawkins) whose correlation isnegative. Comparisons of the oth-er matching columns (7 vs. 3, and8 vs. 2) provide further evidenceagainst streak shooting. Addition-al analyses show that the proba-bility of a hit (mean = .57) follow-ing a "cold" period (0 or 1 hits inthe last 4 shots) is higher than theprobability of a hit (mean = .50)following a "hot" period (3 or 4hits in the last 4 shots). Finally, aseries of Wald-Wolfowitz runstests revealed that the observednumber of runs in the players'shooting records does not departfrom chance expectation exceptfor one player (Dawkins) whosedata, again, run counter to thestreak-shooting hypothesis. Paral-lel analyses of data from two otherteams, the New Jersey Nets andthe New York Knicks, yieldedsimilar results.

Although streak shooting en-tails a positive dependence be-tween the outcomes of successiveshots, it could be argued that boththe runs test and the test for apositive correlation are not suffi-ciently powerful to detect occa-sional "hot" stretches embeddedin longer stretches of normal per-formance. To obtain a more sensi-tive test of stationarity (suggestedby David Freedman) we parti-tioned the entire record of eachplayer into non-overlapping se-ries of four consecutive shots. Wethen counted the number of seriesin which the player's performancewas high (3 or 4 hits), moderate (2

171

Page 183: Anthology of Statistics in Sports

Chapter 21 The Cold Facts About the "Hot Hand" in Basketball

Table 1. Probability of making a shot conditioned on the outcome of previous shotsfor nine members of the Philadelphia 76ers; hits are denoted H, misses are M.

hits) or low (0 or 1 hits). If a playeris occasionally "hot," his recordmust include more high-perfor-mance series than expected bychance. The numbers of high,moderate, and low series for eachof the nine Philadelphia 76erswere compared to the expectedvalues, assuming independentshots with a constant hit rate (tak-en from column 5 of Table 1). Forexample, the expected percent-ages of high-, moderate-, and low-performance series for a playerwith a hit rate of .50 are 31.25%,37.5%, and 31.25%, respectively.The results provided no evidencefor non-stationarity or streakshooting as none of the nine chi-squares approached statistical sig-nificance. The analysis was re-peated four times (starting thepartition into quadruples at thefirst, second, third, and fourthshot of each player), but the re-

sults were the same. Combiningthe four analyses, the overall ob-served percentages of high, medi-um, and low series are 33.5%,39.4%, and 27.1%, respectively,whereas the expected percentagesare 34.4%, 36.8%, and 28.8%. Theaggregate data yield slightly fewerhigh and low series than expectedby independence, which is theexact opposite of the pattern im-plied by the presence of hot andcold streaks.

At this point, the lack of evi-dence for streak shooting could beattributed to the contaminating ef-fects of shot selection and defen-sive strategy. Streak shooting mayexist, the argument goes, but itmay be masked by a hot player'stendency to take more difficultshots and to receive more atten-tion from the defensive team. In-deed, the best shooters on theteam (e.g., Andrew Toney) do not

have the highest hit rate, presum-ably because they take more diffi-cult shots. This argument howev-er, does not explain why playersand fans erroneously believe thatthe probability of a hit is greaterfollowing a hit than following amiss, nor can it account for thetendency of knowledgeable ob-servers to classify random se-quences as instances of streakshooting. Nevertheless, it is in-structive to examine the perfor-mance of players when the diffi-culty of the shot and the defensivepressure are held constant. Free-throw records provide such data.Free throws are shot, usually inpairs, from the same location andwithout defensive pressure. Ifplayers shoot in streaks, theirshooting percentage on the sec-ond free throws should be higherafter having made their first shotthan after having missed their first

172

Player

Clint RichardsonJulius ErvingLionel HollinsMaurice CheeksCaldweliJonesAndrew ToneyBobby JonesSteve MixDarryl Dawkins

Weighted Mean =

NOTE: The number

*p<,01

P(H/3M)

.50

.52

.50

.77

.50

.52

.61

.70

.80

.56

of shots taken i

P(H/2M)

.47

.51

.49

.60

.48

.53

.58

.56

.73

.53

jyeachpfc

P(H/1M}

.56

.51

.46

.60

.47,51.56.52.71

.54

jyer is given in

p(H)

.50 (248)

.52 (884)

.46 (419)

.56(339)

.47 (272)

.46 (451)

.54(433)

.52 (351)

.62(403)

.52

parentheses

P(H/1H)

.49

.53

.46

.55

.45

.43

.53

.51

.57

.51

in Column 5.

f\mH)

.50

.52

.46

.54

.43

.40

.47

.48

.58

.50

p(H/3H)

.48

.48

.32

.59,27.34.53.36.51

.46

SerialCorrelation

r

-.020.016

-.004-.038-.016-.083-.049-.015-.142*

-.039

Page 184: Anthology of Statistics in Sports

Tversky and Gilovich

What People Mean by the "Hot Hand"and "Streak Shooting"

Although all that people mean bystreak shooting and the Hot handcan be rather complex, there is astrong consensus among thoseclose to the game about the core

al dependence. To document thisconsensus, we interviewed a sam-ple of 100 avid basketball fans fromCornell and Stanford. A summary oftheir responses are given below. Weasked similar questions of the play-ers whose data we analyzed—mem-bers of the Philadelphia 76ers—andtheir responses matched those wereport here.

Does a player have a better chanceof making a shot after having justmade his last two or three shots thanhe does after having just missed hislast two or three shots?

Yes 91%No 9%

When shooting free throws, does a

player have a better chance of mak-ing his second shot after making hisfirst shot than after missing his firstshot?

Yes 68%No 32%

is it important to pass the ball tosomeone who has just made several(2, 3, or 4) snots in a row?

Yes 84%No 16%

Consider a hypothetical player whoshoots 50% from the field.

What is your estimate of his fieldgoat percentage for those shots thathe takes after having just made ashot? -

Mean = 61%What is your estimate of his fieldgoal percentage for those shots thathe takes after having just missed ashot?

Mean = 42%

shot. Table 2 presents the proba-bility of hitting the second freethrow conditioned on the out-come of the first free throw fornine Boston Celtics players dur-ing the 1980-81 and the 1981-82seasons.

These data provide no evidencethat the outcome of the secondshot depends on the outcome ofthe first. The correlation is nega-tive for five players and positivefor the remaining four, and in nocase does it approach statisticalsignificance.

The Cold Facts fromControlled Experiments

To test the hot hand hypothesis,under controlled conditions, werecruited 14 members of themen's varsity team and 12 mem-bers of the women's varsity teamat Cornell University to partici-pate in a shooting experiment. Foreach player, we determined a dis-tance from which his or her shoot-ing percentage was roughly 50%,and we drew two 15-foot arcs atthis distance from which the play-er took 100 shots, 50 from eacharc. When shooting baskets, theplayers were required to movealong the arc so that consecutiveshots were never taken from ex-actly the same spot.

The analysis of the Cornell dataparallels that of the 76ers. Theoverall probability of a hit follow-ing a hit was .47, and the probabil-ity of a hit following a miss was.48. The serial correlation waspositive for 12 players and nega-tive for 14 (mean r = .02). Withthe exception of one player(r = .37) who produced a signifi-cant positive correlation (and wemight expect one significant re-sult out of 26 just by chance), boththe serial correlations and the dis-tribution of runs indicated thatthe outcomes of successive shotsare statistically independent.

We also asked the Cornell play-ers to predict their hits and misses

173

Table 2. Probability of hilling a second free throw (H2)conditioned on the outcome of the first free throw (H1

or M1) for

Player

Larry BirdCedric MaxwellRobert ParishNate ArchibaldChris FordKevin McHaleM.L Can-Rick RobeyGerald Henderson

nine members of

p(H2/M1)

.91 (53)

.76 (128).72(105).82 (76).77 (22),59(49).81 (26).61 (80).78 (37)

NOTE: The number of shots on which eachparentheses.

the Boston

P(H2/H1)

.88(285)

.81 (302)

.77 (213)

.83 (245)

.71 (51)

.73 (128)

.68 (57)

.59 (91)

.76 (101)

probability is ba

Celtics.

SerialCorrelation

r

-.032.061.056.014

-.069.130

-.128-.019-.022

sed is given in

Page 185: Anthology of Statistics in Sports

Chapter 21 The Cold Facts About the "Hot Hand" in Basketball

by betting on the outcome of eachupcoming shot. Before every shot,each player chose whether to bethigh, in which case he or shewould win 5 cents for a hit andlose 4 cents for a miss, or to betlow, in which case he or shewould win 2 cents for a hit andlose 1 cent for a miss. The playerswere advised to bet high whenthey felt confident in their shoot-ing ability and to bet low whenthey did not. We also obtainedbetting data from another playerwho observed the shooter and de-cided, independently, whether tobet high or low on each trial. Theplayers' payoffs included theamount of money won or lost onthe bets made as shooters and asobservers.

The players were generally un-successful in predicting their per-formance. The average correlationbetween the shooters' bets andtheir performance was .02, andthe highest positive correlationwas .22. The observers were alsounsuccessful in predicting theshooter's performance (meanr = .04). However, the bets madeby both shooters and observerswere correlated with the outcomeof the shooters' previous shot(mean r = .40 for the shooters and.42 for the observers). Evidently,both shooters and observers reliedon the outcome of the previousshot in making their predictions,in accord with the hot-hand hy-pothesis. Because the correlationbetween successive shots wasnegligible (again, mean r = .02),this betting strategy was not supe-rior to chance, although it didproduce moderate agreement be-tween the bets of the shooters andthe observers (mean r = .22).

The Hot Hand asCognitive Illusion

To summarize what we havefound, we think it may be helpfulto clarify what we have not found.Most importantly, our research

does not indicate that basketballshooting is a purely chance pro-cess, like coin tossing. Obviously,it requires a great deal of talentand skill. What we have found isthat, contrary to common belief, aplayer's chances of hitting arelargely independent of the out-come of his or her previous shots.Naturally, every now and then, aplayer may make, say, nine of tenshots, and one may wish toclaim—after the fact—that he washot. Such use, however, is mis-leading if the length and frequen-cy of such streaks do not exceedchance expectation.

Our research likewise does notimply that the number of pointsthat a player scores in differentgames or in different periodswithin a game is roughly thesame. The data merely indicatethat the probability of making agiven shot (i.e., a player's shoot-ing percentage) is unaffected bythe player's prior performance.However, players' willingness toshoot may well be affected by theoutcomes of previous shots. As aresult, a player may score morepoints in one period than in an-other not because he shoots better,but simply because he shootsmore often. The absence of streakshooting does not rule out thepossibility that other aspects of aplayer's performance, such as de-fense, rebounding, shots attempt-ed, or points scored, could be sub-ject to hot and cold periods. Fur-thermore, the present analysis ofbasketball data does not saywhether baseball or tennis play-ers, for example, go through hotand cold periods. Our researchdoes not tell us anything generalabout sports, but it does suggest ageneralization about people,namely that they tend to "detect"patterns even where none exist,and to overestimate the degree ofclustering in sports events, as inother sequential data. We attri-bute the discrepancy between theobserved basketball statistics andthe intuitions of highly interested

and informed observers to a gen-eral misconception of the laws ofchance that induces the expecta-tion that random sequences willbe far more balanced than theygenerally are, and creates the illu-sion that there are patterns orstreaks in independent sequences.

This account explains both theformation and maintenance of thebelief in the hot hand. If indepen-dent sequences are perceived asstreak shooting, no amount of ex-posure to such sequences willconvince the player, the coach, orthe fan that the sequences are ac-tually independent. In fact, themore basketball one watches, themore one encounters what ap-pears to be streak shooting. Thismisconception of chance has di-rect consequences for the conductof the game. Passing the ball to thehot player, who is guarded closelyby the opposing team, may be anon-optimal strategy if other play-ers who do not appear hot have abetter chance of scoring. Like oth-er cognitive illusions, the belief inthe hot hand could be costly.

Additional Reading

Gilovich, T., Vallone, R., and Tversky,A. (1985). "The hot hand in basket-ball: On the misperception of ran-dom sequences." Cognitive Psy-chology, 17, 295-314.

Kahneman, D., Slovic, P., andTversky, A. (1982). "Judgment un-der uncertainty: Heuristics and bi-ases." New York: Cambridge Uni-versity Press.

Tversky, A. and,Kahneman, D. (1971)."Belief in the law of small num-bers." Psychological Bulletin, 76,105-110.

Tversky, A. and Kahneman, D. (1974)."Judgment under uncertainty:Heuristics and biases." Science,185, 1124-1131.

Wagenaar, W. A. (1972). "Generationof random sequences by humansubjects: A critical survey of litera-ture." Psychological Bulletin, 77,65-72.

174

Page 186: Anthology of Statistics in Sports

Chapter 22

Simpson's Paradox and the Hot Hand in BasketballRobert L. WARDROP

A number of psychologists and statisticians are interestedin how laypersons make judgments in the face of uncer-tainties, assess the likelihood of coincidences, and drawconclusions from observation. This is an important andexciting area that has produced a number of interestingarticles. This article uses an extended example to demon-strate that researchers need to use care when examiningwhat laypersons believe. In particular, it is argued that thedata available to laypersons may be very different fromthe data available to professional researchers. In addition,laypersons unfamiliar with a counterintuitive result, suchas Simpson's paradox, may give the wrong interpretationto the pattern in their data. This paper gives two recom-mendations to researchers and teachers. First, take careto consider what data are available to laypersons. Sec-ond, it is important to make the public aware of Simpson'sparadox and other counterintuitive results.

KEY WORDS: Hot hand phenomenon; McNemar's test;Multiple analyses; Simpson's paradox.

1. INTRODUCTION

Schoolchildren routinely learn to identify optical illu-sions. It is arguably as important that the general publiclearn to identify statistical illusions. Many outstand-ing researchers have addressed this issue. As examples,Diaconis and Mosteller (1989) investigate computing theprobabilities of coincidences; Kahneman, Slovic, andTversky (1983) consider judgments made in the presenceof uncertainty; and Tversky and Gilovich (1989) inves-tigate the popular belief in the hot hand phenomenon inbasketball. This article examines some of the data pre-sented by Tversky and Gilovich.

Suppose that a basketball player plans to attempt 20shots, with each shot resulting in a hit or a miss. A statis-tician might assume tentatively that the assumptions ofBernoulli trials are appropriate for this experiment. Sup-pose next that the experiment is performed and the playerobtains the following data:

HMHMM MHHHM HHHMM HMHHH

Do these data provide convincing evidence against thetentative assumption of Bernoulli trials? Are the threeoccurrences of three successive hits convincing evidenceof the player having a "hot hand"? These are difficultquestions to answer because of the myriad of possiblealternatives to Bernoulli trials that exist. It is mathemat-ically and conceptually convenient to restrict attention toalternatives that allow the probability of success on anytrial to depend on the outcome of the previous trial or,perhaps, the outcomes of some small number of previous

trials. (This restriction may be unrealistic, but that issuewill not be addressed in this article.) With the restrictiveclass of alternatives described here, Tversky and Gilovichdevised a clever experiment to obtain convincing evidencethat knowledgeable basketball fans are much too ready todetect occurrences of streak shooting—the hot hand—insequences that are, in fact, the outcomes of Bernoulli trials.

Having established that basketball fans detect the hothand in simulated random data, Tversky and Gilovichnext examined three sets of real data. The data sets are:shots from the field during National Basketball Associa-tion (NBA) games; pairs of free throws shot during NBAgames; and a controlled experiment using college varsitymen and women basketball players. Using the restrictivealternatives described above, Tversky and Gilovich foundno evidence of the hot hand phenomenon in any of theirdata sets. In addition, using a test statistic that is sensitiveto certain time trends in the probability of success, theyagain found no evidence of the hot hand phenomenon.

This article examines the free throw data presented byTversky and Gilovich. Tversky and Gilovich began byasking a sample of 100 "avid basketball fans" from Cornelland Stanford: "When shooting free throws, does a playerhave a better chance of making his second shot after mak-ing his first shot than after missing his first shot?" A"Yes" response was interpreted as indicating belief in theexistence of the hot hand phenomenon, and a "No" asindicating disbelief. (Actually, a "No" response com-bines persons who believe in independence with those whobelieve in a negative association between shots; but theresearchers apparently were not interested in separatingthese groups.) Sixty-eight of the fans responded "Yes" andthe other 32 "No." Thus, a large majority of those quest-ioned believed in the hot hand phenomenon for free throwshooting. Tversky and Gilovich investigated the abovequestion empirically by examining data they obtained ona small group of well-known and widely viewed basket-ball players, namely, nine regulars on the 1980-1981 and1981-1982 Boston Celtics basketball team.

After their analysis of the Celtics data, Tversky andGilovich concluded that "These data provide no evidencethat the outcome of the second shot depends on the out-come of the first." Section 2 of this article will examine theCeltics data with the goal of reconciling what Tversky andGilovich found and what their basketball fans believed.In particular, it will be shown that, in a certain sense, theprevalent fan belief in the hot hand is not necessarily atodds with Tversky and Gilovich's conclusion.

The analysis presented in Section 3 of this paper indi-cates that several Celtics players were better at their secondshots than at their first.

2. INDEPENDENCE

RobertL. Wardrop is Associate Professor, Department of Statistics, It is instructive to begin by considering just two of theUniversity of Wisconsin—Madison, Madison, WI53706. The author nine Boston Celtics players who are represented in the freethanks the referees and associate editor for helpful comments. throw data, namely, Larry Bird and Rick Robey. During

175

Page 187: Anthology of Statistics in Sports

Chapter 22 Simpson's Paradox and the Hot Hand in Basketball

Table 1. Observed Frequencies for Pairs of Free Throws by Larry Bird and Rick Robey, and the Collapsed Table

the 1980-1981 and 1981-1982 seasons, Larry Bird shota pair of free throws on 338 occasions. Five times hemissed both shots, 251 times he made both shots, 34 timeshe made only the first shot, and 48 times he made onlythe second shot. These data are presented in Table 1, asare the same data for Rick Robey. Letphit and pmiss denotethe proportion of first shot hits that are followed by a hitand the proportion of first shot misses that are followed bya hit, respectively. For Bird, phit = 251/285 = .881 andPmiss = 48/53 = .906. For Robey, these numbers are .593and .612, respectively. Note that, contrary to the hot handtheory, each player shot slightly better after a miss thanafter a hit, although, as shown below, the differences arenot statistically significant.

It is possible, of course, to ignore the identity of theplayer attempting the shots and examine the data in thecollapsed table in Table 1. For example, on 509 occasionseither Bird or Robey attempted two free throws, on 305of those occasions both shots were hit, and so on. For thecollapsed table, hit = .811 and miss = -729. These valuessupport the hot hand theory—a hit was much more likelythan a miss to be followed by a hit.

The data from Bird and Robey illustrate Simpson's para-dox (Simpson 1951), namely, hit < miss in each compo-nent table, but phit > pmiss in the collapsed table. Forfurther examples and discussion of Simpson's paradox,see Shapiro (1982), Wagner (1982), the essay by AlanAgresti in Kotz and Johnson (1983), and their references.

Figure 1 provides a visual explanation of Simpson'sparadox. The top picture in the figure presents the propor-tion of second-shot successes after a hit for Bird, Robeyand the collapsed table. The bottom picture in the figurepresents the same three proportions for second shots at-tempted after a miss. It is easy to verify algebraically thatthe proportion of successes for a collapsed table equalsthe weighted average of the individual player's propor-tions, with weights equal to the proportion of data inthe collapsed table that comes from the player. For theafter-a-hit condition, for example, the weight for Bird is285/376 = .758, the weight for Robey is 91/376 = .242,and the proportion of successes for the collapsed table,305/376 = .811, is

In Figure 1, the heights of the four rectangles above theBird and Robey proportions equal the weights associatedwith the relevant player-condition pair. For example, theheight of the rectangle for Bird in the after-a-hit condition

equals .758, in agreement with the computation of theprevious paragraph. Thus, the proportion of successes foreach collapsed table in the figure is located at the center ofgravity of the two rectangles. As a result, even though bothBird and Robey shot better after a miss than after a hit, thecollapsed values show the reverse pattern due to the hugevariation in weights associated with each player. In short,Simpson's paradox has occurred because the after-a-misscondition, when compared to the after-a-hit condition, hasa disproportionately large share of its data originating fromthe far inferior shooter Robey.

When I first examined the Bird and Robey data severalyears ago, my immediate reactions were that this is aninteresting example of Simpson's paradox, the analysisof individual tables is "correct," and the analysis of thecollapsed table is "incorrect." Now I believe these labelswere applied too hastily. The reasons I changed my mindare discussed below after the entire data set is examined.

Table 2 introduces symbols to represent the variousnumbers in a 2 x 2 table. The values n1 ,n2 ,m1 , and m2

denote the marginal totals, and the values of a,b,c, andd denote the cell counts. The null hypothesis states thatthe outcome of the second shot is statistically independentof the outcome of the first shot. If the null hypothesis istrue, then conditional on the values of the marginal totals,the cell count a has a hypergeometric distribution with

Figure 1. A Visual Explanation of Simpson's Paradox for the FreeThrow Study.

176

Larry Bird

Second:First:

HitMissTotal

Hit

25148

299

Miss

345

39

Total

28553

338

First:

HitMissTotal

Rick Robey

Second:Hit

5449

103

Miss

373168

Total

9180

171

First:

HitMissTotal

Collapsed Table

Second:Hit

30597

402

Miss

7136

107

Total

376133509

Page 188: Anthology of Statistics in Sports

Wardrop

Table 2. Standard Notation for a 2 x 2 Table

First:

HitMissTotal

Second:Hit

ac

m1

Miss

bd

m2

Total

n1

"2n

expectation and variance:

and

The null distribution of

can be approximated by the standard normal curve. ForLarry Bird, a = 251, E(a) = 252.12, and var(a) = 4.575.Substituting these values into Equation (3) gives

Thus, as stated earlier, the results are not statistically sig-nificant. For Robey, z = -.25, and for the collapsed table,z = 1.99. Thus, an analysis of the collapsed table alonewould lead one to conclude that there is statistically sig-nificant evidence in support of the hot hand theory.

Tversky and Gilovich report data for all nine menwho played regularly for the Celtics during 1980-1982.The summaries needed for analysis are given in Table 3.The first column of the table lists the players' names. Thesecond and third columns list the values of hit and pmiss

defined above. The fourth, fifth, and sixth columns listthe values of a, E(a), and var(a) which are obtained fromtheir data and Equations (1) and (2). The seventh col-umn lists the value of z from Equation (3) for each player.The men are listed in the table by decreasing values ofPhit - Pmiss which, not too surprisingly, also lists them bydecreasing values of z. Thus, McHale, with a differenceof 73 - 59 = 14 percentage points, is listed first and Carr,with a difference of 68 - 81 = -13 percentage points, islisted last. In terms of either the point estimates or the teststatistic value, McHale provides the strongest evidencein support of the hot hand theory, and Carr provides thestrongest evidence in support of an inverse relationshipbetween the outcomes of the two shots. Note that fourplayers—McHale, Maxwell, Parish, and Archibald—shotbetter after a hit, while the remaining five players shotbetter after a miss.

The data for McHale give a one-sided approximate Pvalue of .0418. This is not particularly noteworthy for tworeasons:

(1) It is difficult to justify the use of a one-sided alterna-tive, especially given that five players shot better aftera miss and four shot better after a hit.

Table 3. Selected Statistics for the Investigation of Independenceof Shots for Nine Members of the Boston Celtics

Player Phit PmissE(a) var(a)

Kevin McHaleCedric MaxwellRobert ParishNate ArchibaldRick RobeyGerald HendersonLarry BirdChris FordM. L. Carr

.73

.81

.77

.83

.59

.76

.88

.71

.68

.59

.76

.72

.82

.61

.78

.91

.77

.81

932451642035477

2513639

88.23240.20160.75202.26

54.8177.58

252.1237.0341.20

7.63314.66713.0618.380

10.2574.8584.5753.1003.620

1.731.25.90.26

-.25-.26-.52-.58

-1.16

(2) Even if one believes a one-sided alternative is appro-priate, on the assumption that all nine players haveindependence between shots, the approximate proba-bility is 1 - (1 - .0418)9 = .32, or about one-third,that at least one of the nine P values would be as smallor smaller than McHale's.

Table 4 presents the observed frequencies and row pro-portions for the free throw data collapsed over the nineCeltics under investigation. For the collapsed table, therelative frequency of a hit after ahit is78.9 — 74.3 =4.6percentage points higher than the relative frequency ofa hit after a miss. Moreover, for the collapsed table, itcan be shown that a = l,162Ee(a) = 1,143.03, andvar(a) = 72.015, yielding z = 2.24, which is statisticallysignificant.

To summarize, separate analyses of individual playersindicate that four players shot better after a hit and fiveplayers shot better after a miss, but none of the individualplayer patterns is convincing. By contrast, the analysis ofthe collapsed table gives statistically significant evidencein support of the hot hand phenomenon.

In view of the Celtics data, what, if anything, are weto make of the fact that 68 out of 100 of Tversky andGilovich's avid basketball fans believe in the hot handphenomenon for free throw shooting? Perhaps these fanshave been watching players who do exhibit the hot hand.Perhaps these fans see patterns in data where no patternsexist. I prefer the following explanation.

I am an avid basketball fan. Over the past 30 years,I have observed several thousand different players shoot-ing free throws. It is difficult to imagine that I (or anyother basketball fan) could remember the equivalent ofthousands of 2 x 2 tables. Yet these individual tables areexactly what I would need in order to investigate prop-erly the question of the hot hand phenomenon. It is muchmore reasonable to assume that I have a single 2x2 table

Table 4. Observed Frequencies and Row Proportions for FreeThrow Data Collapsed Over Nine Celtics

Second:First: Hit Miss

Second:Total First: Hit Miss Total

Hit 1,162 311 1,473 HitMiss 428 148 576 MissTotal 1,590 459 2,049

.789 .211 1.000

.743 .257 1.000

177

Page 189: Anthology of Statistics in Sports

Chapter 22 Simpson's Paradox and the Hot Hand in Basketball

in my mind, namely, the collapsed table for all playersI have seen. Just like the Celtics data, my collapsed ta-ble indicates that a success is more likely than a failureto be followed by a success. Thus, there is a pattern inthe data that are reasonably available to me and, I conjec-ture, in the data that are reasonably available to Gilovichand Tversky's 100 basketball fans. It seems reasonableto suggest to basketball fans that the mental equivalent ofSimpson's paradox could lead to a cognitive statistical il-lusion that results in their "seeing patterns in the data thatdo not exist."

3. STATIONARITY

Tversky and Gilovich correctly concluded that there isno evidence of the hot hand phenomenon in the free throwdata. In this section, it is demonstrated, however, that thesimple model of Bernoulli trials is also inappropriate. Inparticular, it is shown that several of the Celtics playersshot significantly better on their second free throw, perhapsas a result of the practice afforded by the first shot.

Look at Table 1 again. Larry Bird made 84.3% (285 of338) of his first shots compared to 88.5% (299 of 338)of his second shots. Thus, there is evidence that he im-proved on his second shot. The null hypothesis that hisprobability of success was constant can be investigatedwith McNemar's test, which uses the fact that the nulldistribution of

can be approximated by the standard normal curve. (Re-call that b and c are defined in Table 2.) For Larry Bird,b = 34 and c = 48, giving

The same analysis can be performed for the other eightCeltics; the results are given in Table 5. The first col-umn of the table lists the player's names. The second andthird columns list, respectively, the relative frequenciesof successes on the first and second shots. The remain-ing columns list the values of b and c from each player's2x2 table and the value of z\ computed from Equation(4). The players are listed according to the difference inrelative frequencies between the first and second shots.

Table 5. Selected Statistics for Comparing the Success Rates onthe First and Second Free Throws for Nine Members of Boston

Celtics

Thus, Maxwell, who shot ten percentage points better onthe second shot than on the first, is listed first, and McHale,who shot three percentage points better on the first shot,is listed last. Note the following features of the data.

(1) Eight of nine players had a higher success rate ontheir second shots.

(2) Three players had one-sided approximate P val-ues below .05: Maxwell (.0006), Parish (.0080), andArchibald (.0250). The interpretation of these P valuesshould take into account that nine tests were performed.If, in fact, each player had a constant success rate onhis two shots, the approximate probability of obtainingat least one P value equal to or smaller than .0006 is:1 - (1 - .0006)9 = .0054. Similarly, the approximateprobability of obtaining at least two P values equal toor smaller than .0080 is .0022. Finally, the approximateprobability of obtaining at least three P values equal to orsmaller than .0250 is .0012. Thus, the three statisticallysignificant results do not seem to be attributable to theexecution of many tests.

(3) McNemar's test can be viewed as testing that aBernoulli trial success probability equals .5 based on asample of size b + c. Thus, several of the analyses of in-dividual players presented in Table 5 are based on verylittle data and, hence, have very low power. To combatthis difficulty, it is instructive to combine the data acrossthe nine players. In particular, if the null hypothesis ofconstant success probability is true for all nine players,then the observed value of

where the sum is taken over the nine tables, can be viewedas an observation from a distribution that is approximatelythe standard normal curve. The observed value of Z2 is—4.30, given in the bottom row of Table 5. This valueindicates that there is overwhelming evidence against theassumption that all nine null hypotheses are true.

4. SUMMARY

This article puts forth an argument to reconcile whatavid basketball fans believe and what Tversky andGilovich found. It is argued that the fans and the re-searchers were analyzing different sets of data. Whilethe researcher's data had no pattern, the fan's data hada pattern. This pattern, however, was due to the effectsof aggregation and not the hot hand phenomenon. Thisfinding indicates that researchers should take care to con-sider what data are available to laypersons. In addition,this finding underscores the importance of increasing theawareness of statistical fallacies among the general public.

This article also demonstrates that several Celtics play-ers showed a significant improvement in their shootingability on the second free throw. Thus, while the hot handphenomenon is not supported by these free throw data,neither is the simple model of Bernoulli trials.

[Received March 1992. Revised November 1993.]

178

Player

Cedric MaxwellRobert ParishNate ArchibaldRick RobeyLarry BirdGerald HendersonM. L. CarrChris FordKevin McHaleTotal

P(S1)

.70

.67

.76

.53

.84

.73

.69

.70

.72—

P(S2)

.80

.75

.83

.60

.88

.77

.72

.73

.69—

b

574942373424181535

311

C

977662494829211729

428

z1

-3.22-2.41-1.96-1.29-1.55

-.69-.48-.35

.75

z2= - 4.30

Page 190: Anthology of Statistics in Sports

Wardrop

REFERENCES

Diaconis, P., and Mosteller, F. (1989), "Methods for Studying Coinci-dences," Journal of the American Statistical Association, 84, 853-861.

Kahneman, D., Slovic, P., and Tversky, A. (1983), Judgement UnderUncertainty: Heuristics and Biases, Cambridge, U.K.: CambridgeUniversity Press.

Kotz, S., and Johnson, N. L. (eds.) (1983), Encyclopedia of StatisticalScience (Vol. 3), New York: John Wiley, pp. 24-28.

Shapiro, S. H. (1982), "Collapsing a Contingency Table—A GeometricApproach," The American Statistician, 36, 43-46.

Simpson, E. H. (1951), "The Interpretation of Interaction in Contin-gency Tables," Journal of the Royal Statistical Society, Ser. B, 13,238-241.

Tversky, A., and Gilovich, T. (1989), "The Cold Facts About the 'HotHand' in Basketball," CHANCE: New Directions for Statistics andComputing, 2, 16-21.

Wagner, C. H. (1982), "Simpson's Paradox in Real Life," The AmericanStatistician, 36,46-47.

179

Page 191: Anthology of Statistics in Sports

This page intentionally left blank

Page 192: Anthology of Statistics in Sports

Part IVStatistics in Ice Hockey

Page 193: Anthology of Statistics in Sports

This page intentionally left blank

Page 194: Anthology of Statistics in Sports

Chapter 23

Introduction to theIce Hockey Articles

Robin H. Lock

We provide a short description of ice hockey and itshistory in this introduction. We also briefly discuss theapplication of statistical methods in hockey and we identifyparticular research areas. We use the articles selected forthis part of the volume to give the reader a sense of thehistory of statistical research in hockey.

23.1 BackgroundIce hockey is a fast-paced winter sport that naturally en-joys its greatest popularity among sports enthusiasts in theupper northern hemisphere. The sport is believed to havebeen first played in the early nineteenth century in Windsor,Nova Scotia; Kingston, Ontario; or Montreal, Quebec; thefirst known rules were published in 1877 by the MontrealGazette. The first U.S. collegiate ice hockey game wasplayed between Yale and Johns Hopkins in 1896. Profes-sional hockey was established in the early twentieth cen-tury; the National Hockey League (NHL) was founded in1917 and currently includes 30 professional teams acrossNorth America. After ice hockey was introduced as anOlympic sport at the 1920 Summer Olympics, its interna-tional popularity and stature gradually grew, culminatingin the introduction of women's ice hockey as an Olympicsport in 1994.

In an ice hockey game, two opposing teams of skatersuse long, curved sticks to try to drive a puck (a hard rubberdisk) into each other's goal net. The team that scores themost goals by hitting the puck into its opponent's goalnet with their sticks wins the game. If the two opponentshave scored the same number of goals at the conclusion

of regulation play, some leagues will declare the gamea tie. In other leagues, this leads to an overtime periodthat is played under "sudden death" (i.e., the first teamto score in overtime is declared the winner). If neitherteam scores during the overtime period, some leagues thendeclare the game a tie, while others may use some othermeans of determining the winner. Additional overtimeperiods can be played until one team scores, or the teamscan have a shoot-out, where a series of players from eachteam alternate in taking shots on goal.

The sport may be played outdoors or indoors on a struc-ture called a rink. A rink consists of an oval ice surface,surrounded by a wall (usually referred to as the boards),with goals on both ends. Parallel lines are painted acrossthe ice to divide the rink into zones. The rink is divided inhalf by a red centerline, and blue lines between the center-line and the goals divide the rink further. The ice betweenthe two blue lines is referred to as the neutral zone; the iceoutside of the neutral zone that contains a team's goal iscalled their defending zone, and the ice outside of the neu-tral zone that contains their opponent's goal is called theirattacking zone. Collectively, the defending and attackingzones are referred to as the end zones. The rink also has ablue circle in its center and eight red circles placed strate-gically around the perimeter of the rink. Play often beginsor resumes after a stoppage at one of these circles.

Red lines (referred to as goal lines) also exist at eachend of the rink, where the boards begin to curve. The goals(also called nets) are placed in the middle of these lines.A half-circle, called the crease, is painted in front of eachgoal. Attacking players may not enter the crease unlessthe puck is already there, and they may not make contactwith the goalie (the opponent's player who is designatedwith the responsibility of defending the goal) while in thecrease.

Each team may have up to six players on the ice at any

183

Page 195: Anthology of Statistics in Sports

Chapter 23 Introduction to the Ice Hockey Articles

time in a regulation ice hockey game. These players oc-cupy three positions: forward, defense, and goalie. Thethree forwards—the center, left wing, and right wing—form a unit (called a line) that is primarily responsible fortheir team's offense. Centers, who are usually their team'sbest "passers," generally skate between and feed the puckto the wings, who are usually their team's best shooters.The two defenders comprise the last line of defense beforethe goalie; they attempt to disrupt their opponent's offensebefore they are able to shoot (attempt to score). Finally,the goalie (or goaltender) is responsible for guarding histeam's goal and preventing the opponent's shots from en-tering his net. The goalie may stop a shot on his goal withany part of his body (which is usually protected with heavypadding, a helmet, a collar, and a mask), the small glove onhis stick hand (the blocker), his stick, or the large leatherglove on his free hand (the trapper). Although the namesof the positions imply certain duties and responsibilities,the rules do not prevent players (other than the goalie)from skating to any part of the rink; forwards help withdefense, defenders sometimes score, and goalies will evenrisk leaving the net unmanned late in a game their team islosing to allow a substitute player to generate additionaloffense.

At the highest competitive levels (professional, inter-national, and college), games are comprised of three 20-minute periods with 15-minute intermissions after the firstand second periods. The ice is usually resurfaced duringthe intermissions. At lower levels of competition, the threeperiods generally last 10 or 15 minutes. Each period be-gins with a face-off at the blue circle at center ice. Oneplayer from each team lines up at the blue circle with hisstick blade on the ice. The referee drops the puck onto theice and the two players attempt to gain control of it. Playstops when a goal is scored, the puck leaves the rink, aplayer is injured, a penalty or other infraction is called, ora goalie covers the puck. Substitute players may enter thegame during any stoppage of play or even jump in "on thefly" when a teammate steps off the ice as play continues.

A player who violates a rule may be charged with apenalty. A penalized player must leave the ice and spendtime in the penalty box. Minor penalties last two minutesand are usually charged to players who illegally impedean opponent's progress by tripping or holding. A playercan also be charged with a minor penalty for delaying thegame or risking injury to an opponent. Major penaltieslasts five minutes and are charged to players who commitmore serious fouls.

A misconduct penalty may be assessed against a playeror team for a variety of infractions. A player charged witha misconduct penalty must leave the game for 10 minutes,

but his team may immediately place a substitute on theice. Game misconduct penalties are charged for dangerousplay, often when the official deems that the player's intentis to injure an opposing player. A game misconduct penaltyresults in ejection of the penalized player and assessmentof a minor penalty against his team. While goalies do notserve minor, major, or misconduct penalties (their penal tiesare served by a teammate who was on the ice at the time ofthe infraction) they may be ejected for game misconductpenalties.

When penalties leave one team with more skaters on theice than its opponent, the team with more players on the iceis said to have a power play and this advantage lasts for theduration of the penalty or until they score a goal (unlessit's a major penalty). Only two minor penalties can beserved simultaneously—if a team accumulates more thantwo overlapping minor penalties, the additional penalty isserved immediately after one of the two current penaltiesexpires. If a goal is scored on a team that is serving twopenalties, only one of the penalized players may return tothe game, and the team remains shorthanded. If a player isejected from the game, his team plays shorthanded for fiveminutes and may then substitute for the ejected player.

Other rule violations do not result in the assessment ofpenalty time. An offsides penalty occurs when an offen-sive player crosses into the attacking zone ahead of thepuck. At this point a face-off is held in the neutral zone,giving the opponent an opportunity to regain control of thepuck. Some leagues also use a two-line offsides rule thatprohibits an offensive player from making a pass across thecenterline from the defensive zone. At this point, a face-off is held in the red circle nearest the point from wherethe illegal pass was made.

Icing occurs when the puck moves from behind the cen-terline to a point beyond the opponent's goal line withoutbeing touched by another player. In some leagues, thisinfraction is automatically called, while other leagues callicing only after a defender has touched the puck behindthe goal line. Once icing is called, a face-off is held at thered circle closest to the penalized team's net.

Available data for the line and defenders include posi-tion played (center, left wing, right wing, defense, goalie),games played, goals scored, assists, points (the sum ofgoals and assists), plus-minus (difference between goalsscored and goals allowed while the player is on the ice),penalty minutes, power play goals, shorthanded goals,game-winning goals (goals that gives the winning teamjust enough goals to win), game-tying goals (the final goalin a tie game), shots on goal, shooting percentage, averagenumber of shifts per game, average time on ice per game,face-offs won, face-offs lost, and percentage of face-offs

184

Page 196: Anthology of Statistics in Sports

Lock

won. Available data for goalies include games played, wins(a goalie is credited with a win if he is on the ice when histeam scores the game-winning goal), losses (a goalie ischarged with a loss if he is on the ice when the opposingteam scores the game-winning goal), ties (a goaltender iscredited with a tie if he is on the ice when the game-tyinggoal is scored), goals against, shots against, goals-againstaverage (goals allowed per full game played), saves, savepercentage, shutouts (a goalie must play the entire gameto receive credit for a shutout), and penalty minutes.

Team standings in the National Hockey League are de-termined by a point system. A team is awarded two pointsfor a win, one point for a tie, and one point for an over-time loss. Other available team data include aggregatedindividual player statistics.

23.2 Determining the SuperiorTeam

Although the papers in this section appear here in chrono-logical order, they also follow a natural order as a typi-cal season reaches its conclusion. In Chapter 24, Danehyand Lock look at ways to compare teams from differentleagues who play a regular season schedule with varyingdegrees of overlapping interleague play. One goal of themethods they develop is to provide statistically reasonableprocedures for selecting teams to compete in a postseasontournament. Once the tournament begins, each game mustbe played to a definite conclusion so that only one teammoves on after each contest. In Chapter 25 Hurley inves-tigates several schemes for reaching that decision whentwo teams are tied at the end of regulation game-time. InChapter 26 Morrison and Schmittlein complete the playoffcycle by examining the factors that play a role in determin-ing the winning team in the "best of seven playoff" formatused by the National Hockey League to award the StanleyCup.

Harville (1977) considered the use of linear regressionmethods to produce comparative ratings of teams basedon individual game scores in football. Danehy and Locklooked for similar models to apply to ice hockey and in-troduced two significant revisions. Whereas Harville andother authors (e.g., Stern, 1995) use a single rating valuefor each team (based on the winning and losing margins ofits games), Danehy and Lock suggest using separate offen-sive and defensive ratings to capture both important aspectsof a team's performance. This also allows them to predictindividual game scores and distributions based on the rat-ings given to the participating teams. When analyzing thismodel with actual college hockey scores, they found that it

was susceptible to large discrepancies between the ratingsgiven to different leagues when there was little interleagueplay. This motivated the introduction of a method for pa-rameterizing the model to move smoothly from one thatgives heavy weight to the quality of an opponent to onethat ignores this factor completely. Thus the model can befine-tuned to reflect quality of opponents when producingthe ratings while remaining robust to a few extreme scores.

As the earliest paper in this group (Chapter 24), Danehyand Lock's work on rating methods for college hockey hasgenerated the most follow-up work to date. They publisheda second paper (Danehy and Lock, 1995) that extends themethod from an ordinary least squares model to considerscores to be generated by a pair of Poisson processes withthe team ratings determining the values of the Poisson scor-ing rates for any particular contest. This was consistentwith work done by Reep, Pollard, and Benjamin (1971),who found that game scores in football (soccer), cricket,baseball, and ice hockey followed a negative binomial dis-tribution that reflected mixtures of Poisson distributionswith parameters varying with the teams in each match. Inlater work, Danehy and Lock (1997) modified the modelfurther to allow the offensive, defensive, and home ice pa-rameters to interact multiplicatively, rather than use theadditive model introduced in the original paper. Details ofthe current methods can be found on the College HockeyOffensive and Defensive Ratings (CHODR) website athttp://it.stlawu.edu/ chodr. This site also contains histor-ical ratings and predicative results collected since 1996and was expanded in 1999 to include ratings for NCAADivision I Women's Ice Hockey (WCHODR).

In the 2000 NCAA Division I Men's Ice Hockey play-offs, St. Lawrence University (SLU) played Boston Uni-versity (BU) in a second round game at the KnickerbockerArena in Albany, New York. The score was tied 2-2 after60 minutes of regulation play and the teams proceeded toplay three full 20-minute overtime periods before SLU'sRobin Carruthers scored a goal at 3:53 of the fourth over-time to end the longest-ever NCAA tournament game. Thetwo goalies (both freshmen) each smashed the previousrecord for most saves in an NCAA playoff game. The twoother teams (Michigan and Maine) that were scheduled toplay in the second game that day had to wait until wellafter the time their game was scheduled to finish beforeeven taking the ice; this created havoc with pregame mealschedules, travel arrangements, broadcasters' voices, andfans' nerves. Perhaps these disruptions could have beenavoided if NCAA officials had adopted some proposalsmade in Hurley's 1995 article, presented in Chapter 25.

Hurley compares overtime and Shootout as methods fordetermining the victor in a tied contest. His work was

185

Page 197: Anthology of Statistics in Sports

Chapter 23 Introduction to the Ice Hockey Articles

motivated, in part, by a series of high-profile hockey andsoccer tournaments the previous year in which the champi-onship games were decided by Shootout rather than a goalscored by traditional team play. The shootouts had beeninstituted to prevent the sort of protracted overtime affairstypified by the SLU-BU contest. But some players, fans,and sportswriters have heavily criticized the shootout as aninappropriate way to determine a very important outcomefor a sport that emphasizes team play. Liu and Schultz(1994) looked at previous overtime methods employed bythe National Hockey League to examine how often the tiewas broken and the chances of the better team winning.Hurley suggests a novel alternative to the two extremes.First, hold the shootout and then play the overtime period.The team that wins the shootout starts the overtime witha one-goal advantage so that if neither team scores duringthe overtime they are declared the winner, but the shootoutloser still has a chance to recover through traditional playto retie the match. Hurley provides analysis of the likelytime needed to implement this procedure and considers theprobability that the better team will prevail.

Mosteller (1952) performed an early analysis of the re-sults of a best 4 out of 7 playoff system by analyzing WorldSeries (baseball) data to estimate the chances that the bet-ter team will come out on top in the series. Maisel (1966)follows this up with an extensive analysis of a best k of2k — 1 competitions that, ironically, looks at propertiesboth with and without the possibility of some contestsending in ties. But Stanley Cup games are always playedto conclusion and we could always resort to Hurley's tie-breaking scheme if the prospect of televising unreasonablylong overtime games becomes a problem!

Morrison and Schmittlein (Chapter 26 in this volume)examine the past history of Stanley Cup series and comparethe number of games played to what one would expect ifequally matched teams played a best 4 out of 7 series withindependent game results. While theory would suggestthat 6- and 7-game series are the most likely, the actualresults show far more sweeps (4-game series) and fewer7-game series than one would expect. What might causethis discrepancy? The authors suggest and examine threepossibilities. Perhaps the teams are not well matched andone is actually significantly better than its opponent so thatthe probability of winning each game moves away fromthe theoretical 0.5. Or the discrepancy might be due to ahome ice advantage since an odd number of games allowsone team an extra contest at its home rink. Finally, theindependence assumption might be faulty. Could one teamride a "hot" goalie to several quick wins and an early seriesvictory? This chapter examines each of these possibilitiesto see which might be consistent with previous Stanley

Cup results.

23.3 SummaryA great deal of game-level information is collected aboutindividual hockey players and teams. While much of thefocus of statistical research in hockey has been devoted toranking or comparing teams, a great deal of opportunityexists for analyzing performances in a manner similar tobaseball.

While the three papers included in this section are basedon the structure and results of ice hockey competitions,the methods they introduce could be generalized to othersports, particularly soccer, lacrosse, field hockey, and wa-ter polo, which share the same basic format of teams tryingto get an object past the opposing team's players into a con-fined space that is guarded by a goalie as the final line ofdefense.

ReferencesDanehy, T. J. and Lock, R. H. (1995), "CHODR—Usingstatistics to predict college hockey," STATS, 13, 10-14.

Danehy, T. J. and Lock, R. H. (1997), "Using a Poissonmodel to rate teams and predict scores in ice hockey," in1997 Proceedings of the Section on Statistics in Sports,Alexandria, VA: American Statistical Association, 25-30.

Harville, D. (1977), "The use of linear-model methodologyto rate high school or college football teams," Journal ofthe American Statistical Association, 72, 278-289.

Liu, Y. and Schultz, R. (1994), "Overtime in the NationalHockey League: Is it a valid tie-breaking procedure?" in1994 Proceedings of the Section on Statistics in Sports,Alexandria, VA: American Statistical Association, 55-60.

Maisel, H. (1966), "Best k of 2k — 1 comparisons," Journalof the American Statistical Association, 61, 329-344.

Mosteller, F. (1952), "The World Series competition,"Journal of the American Statistical Association, 47, 355-380.

Reep, C, Pollard, R., and Benjamin, B. (1971), "Skill andchance in ball games," Journal of the Royal Statistical So-ciety, Series A (General), 134, 623-629.

Stern, H. (1995), "Who's number 1 in college football? ...And how might we decide?" Chance, 8, 7-14.

186

Page 198: Anthology of Statistics in Sports

Chapter 24

Statistical Methods for Rating College Hockey Teams

Timothy J. Danehy, Clarkson University; Robin H. Lock, St. Lawrence UniversityRobin H. Lock, Mathematics Department, St. Lawrence University, Canton, NY 13617

KEY WORDS: Sports rating models; Poisson re-gression; Schedule graphs; Sports rankings

ABSTRACT: We investigate methods for ratingsports teams based solely on past game results. Tech-niques are illustrated using data from NCAA Divi-sion I Men's Ice Hockey competition, although themethods can easily be applied to other sports and lev-els of play. The proposed systems produce offensive,defensive, and overall ratings based on past perfor-mance and the quality of opponents. These ratingscan then be used to compare teams and forecast out-comes of future games. A significant challenge insuch rating schemes, particularly in the college envi-ronment, is the lack of connectivity in the schedulegraph. We demonstrate how an insufficient amountof inter-league play can cause traditional regressionmethods to break down and produce clearly inappro-priate ratings. We suggest a modified procedure de-signed to avert such pitfalls and examine the effec-tiveness of various models in predicting future gameoutcomes.

1. INTRODUCTION

In this paper we investigate statistical models forrating sports teams, with specific applications to col-lege hockey. Although the methods we develop aredesigned for college hockey, the general principlescould be easily applied in other settings. In the nextsection we describe some overall goals which mightbe addressed by a rating system. Section 3 gives anoverview of the schedule structure in college hockey,describes some characteristics of the game whichmake it amenable to ratings, and presents two ex-isting ratings systems. A traditional additive modeland least squares approach to estimating offensiveand defensive ratings as well as a home ice advan-tage is presented in Section 4. We then demonstratea deficiency in this model, using an example fromearly in the 1992-93 hockey season. A mechanismfor dealing with problems arising from the lack ofmany inter-league connections in the schedule graphis described in Section 6. Finally we examine theperformance of these ratings methods when predict-ing game outcomes in the 1992-93 regular seasonand playoffs.

2. GOALS OF A RATING SYSTEM

We must be clear to distinguish between ratingsystems and ranking systems. Most sports "Top 10"type polls are rankings - they provide a relative or-dering of the teams by (perceived) ability, but do notgive any numerical measure (other than number ofvotes) of that ability. A rating system should go fur-ther to produce direct estimates of a team's strengthon some interpretable scale. Some functions of a rat-ing system might include:

a. To rank order all teams (or individuals).b. To compare specific teams (or individuals).c. To adjust ratings for the quality of opponents.d. To predict game outcomes (winner/loser).e. To predict specific game differentials.f. To predict specific game scores.

One obvious rating system is a simple won-lostpercentage which could be used to accomplish tasks(a), (b), and (d). Another common rating systemwould be to take each team's average points scoredand subtract its average points given up. Such arating would allow for the prediction of game differ-entials or could be used to predict individual gamescores. However, neither of these simplistic meth-ods would address the issue of adjusting ratings toaccount for strengths of opponents.

3. COLLEGE HOCKEY3.1 Structure of Division I Ice Hockey

In the 1992-93 season forty-four schools competedin Men's Ice Hockey at the NCAA Division I level.Most of these teams were organized into four leagues- Hockey East (HE), ECAC, CCHA, and WCHA.League members played from 22 (ECAC) to 32(WCHA) of their games within their own league,with as few as zero (St. Cloud) to as many as 11(Lowell and Maine) against non-league opponents.

We consider a schedule graph consisting of 44 ver-tices, representing individual teams, and a set ofedges connecting teams involved in each game. Thegraph consists of four main blocks, correspondingto the leagues, with lots of edges (in fact multiplecomplete subgraphs) within each league block, butrelatively few connections between the blocks. Forexample, in 1992-93 there were no regular seasongames between ECAC and WCHA teams and only

187

Page 199: Anthology of Statistics in Sports

Chapter 24 Statistical Methods for Rating College Hockey Teams

two between the WCHA and HE. Overall, 537 of683 matchups (79%) in the 1992-93 regular seasonwere league contests. Within each league we mightbe confident in using standings based on a balancedleague schedule as basis for ranking teams. However,we need an alternate mechanism to reliably compareteams between different leagues.

3.2 Why College Hockey?

Several aspects of college hockey are particularlywell-suited for investigating ratings methods. It isgenerally well-recognized that a team's performancecan be measured primarily by its ability to scoregoals and to prevent its opponent from scoring.Scores directly reflect the number of goals scored,as compared to football or basketball where each"score" can yield a different number of "points"Each goal is an individual event without a clusteringof scores as might be found in baseball. The scoringrate is relatively low (averaging around four goalsper game), but considerably higher than in soccer.Finally, there is considerable fan interest in compar-isons between the leagues, particularly as the reg-ular season winds down and 12 teams are selectedand seeded to play for the national championship inthe NCAA post-season tournament.

3.3 Previous Hockey Rating Systems

One rating system which is used in the NCAAtournament selection process (for basketball as wellas hockey) is the Ratings Percentage Index (RPI).The system considers a teams own strength as wellas strength of schedule via a weighted average ofthree parts: winning percentage (20%), opponents'winning percentage (40%), and opponents' oppo-nents' winning percentage (40%). Another modelis TCHCR (The College Hockey Computer Rating)(Instone 1992). This system uses a least squaresoptimization (Leake 1976) to match the ratings dif-ferences between two teams as closely as possiblewith actual game outcomes, as measured by a per-formance function (see Stern 1992 for a discussionof performance functions in football). The functionused in TCHCR depends only on the game outcome(win, loss, or tie) so it primarily refelcts winning per-centage and strength of schedule, just as the RPI.Both systems allow us to compare specific teamswhile adjusting for the quality of opponents and pre-dict game outcomes (winner/loser) with the differ-ences in ratings providing a vague indication of thedisparity between two teams. However, neither ofthese systems is able to predict specific game differ-entials or actual scores.

3.4 Assumptions

As a database for computing ratings we use onlygames played between NCAA Division I schools, ex-cluding any contests with other divisions or Cana-dian schools. Since we model regulation time scoringrates, we ignore all goals scored in overtime. We as-sume a "home ice" advantage which is estimated aspart of the ratings. A team's expected scoring ratein a particular game is assumed to depend on its of-fensive ability, the defensive ability of its opponent,and the site of the contest. When computing rat-ings, data from past games is limited to game scoresand indicators for home ice and overtime.

4. AN ADDITIVE MODEL

Let Sijk denote the goals scored by Team i againstopposing Team j, with the index k reflecting a gamecounter. If n games are played, each producing twoscores, we have K = 1, 2, ... 2n. Our additive modelproduces an expected scoring rate according to

whereOi — offensive rating for team i.DJ = defensive rating for team j.H = home ice adjustment.IK = (+1 if i home, 0 if neutral, -1 if road ).

= mean scoring rate (goals/game).

The key rating quantities here are Oi and DJ . Onemay think of Oi as the expected scoring rate forTeam i against a hypothetical "average" team onneutral ice. Similarly, DJ represents the expectednumber of goals which Team j will give up whenplaying the "average" team. Since the object of thegame is to score more goals than you give up, wedefine an overall rating of a team's strength by

Thus the "average" team should have an overallrating of zero. Direct comparisons are easily man-aged since Ri — Rj gives the expected goal differentialwhen Team i plays Team j on neutral ice.

Actual ratings ( i and J) and other parameters( and ) can be estimated based on past gamedata using a least squares fit. To obtain unique es-timates we add the condition that the (weighted)average offensive and defensive ratings be the sameand, consequently, equal to . Thus

where ni- counts the games played by Team i.

188

Page 200: Anthology of Statistics in Sports

Danehy and Lock

The least squares solution provides the followingrelations among scores and estimates:

whereS = average goals scoredSh,Sr — average goals scored at home/on roadOh =_average offensive rating for home teamsOr,Dh,Dr are defined analogouslyAj - {k : kth score was AGAINST team ;}Bi = {k : kth score was BY team i}

Although one could use a standard package to cal-culate the least squares estimates, we have chosen aniterative method which can be adapted more easilyto the modified situations which we discuss shortly.We start with initial estimates O, = average goalsscored by team i, Dj — average goals scored againstteam j, ji = 5", and H = ^(Sh — Sr). We then adjustH using (5) and the initial offensive and defensiveestimates, recompute new offensive estimates with(6), and finally calculate new defensive ratings us-ing (7). This cycle is continued until the estimatesconverge to fixed values.

5. SPARSE CONNECTIONS

The lack of frequent connections between leagueblocks in the schedule graph can lead to some se-rious difficulties in computing comparative ratings.A vivid example can be seen in the fifth week ofthe 1992-93 season when Air Force (an independentteam) traveled to Colorado College (of the WCHA)and lost by a score of 12-3. At that early point in theseason there was only one other connection betweenthe WCHA and the rest of the schedule graph. Thesubsequent ratings accounted for the extreme CC-AF score by raising the offensive ratings of all theWCHA teams by a considerable margin and com-pensating by significantly lowering their defensiveratings. Predicted scores between two WCHA teamswere still reasonable, but comparisons with the restof college hockey made little sense (see Table 1 -WCHA teams in bold). Although this dramatic lackof stability is magnified by its occurrence early inthe season (only 129 games and few interleague con-nections), more subtle problems can be attributedto sensitivity to sparse connections throughout theseason.

6. MODIFIED ADDITIVE MODEL

As an alternate model, one might consider predict-ing the number of goals a team should score basedonly on its offensive ability, totally disregarding thedefensive strength of its opponent. Using the nota-tion introduced in Section 4 we would have

Thus the ratings estimates as well as each pre-dicted score would depend only on each teams aver-age goals scored with an adjustment for home ice.

But why should the offense be so important? Anequally plausible model might place the burden ofprediction solely on the opponent's defensive ability.

We could even combine both viewpoints to calcu-late an expected score by averaging

Although this gives the original model (1), wemight consider using (8) and (9) in the iterative pro-cedure for computing the estimates. We would elim-inate the difficulty of one unusual game affecting es-timates throughout a league, however we would alsolose any ability to adjust ratings based on the qual-ity of opponents. Thus a team could enhance itsratings by scheduling weak opponents and buildingup attractive goals for and goals against averages.

To minimize the liabilities of both approaches, weinclude an additional parameter to allows us to varysmoothly between them

Clearly a — 0 gives our original model, while a =1 produces equations (8) and (9). We still use (1) toobtain score predictions, but the updating equationsin the iterative procedure become

Table 2 shows how the introduction of an a valueas small as 0.05 or 0.10 can alleviate many of theproblems caused by extreme instances such as theCC-AF game, while still giving significant weight toquality of opponents.

189

Page 201: Anthology of Statistics in Sports

Chapter 24 Statistical Methods for Rating College Hockey Teams

7. RATING THE 1992-93 SEASON

7.1 Comparison of Models

Data from the 1992-93 season were used to eval-uate the effectiveness of these models in predictingcollege hockey scores and game outcomes. The firstseven weeks of the season (200 games) were appliedto develop initial ratings. Thereafter (553 games),we forecast each week's games using ratings basedonly on data available prior to that week. We com-pare results for the modified additive model usinga= 0, 0.05, 0.10, and 1.

Measures for evaluating the accuracy of the rat-ings are presented in Table 3. Several of these weresuggested by Stern (1992) in his analysis of foot-ball rating systems. The most basic quantity is thepercentage of games for which the predicted scoresaccurately forecast the winner of the contest. Thepercentages in Table 3 are based on the 473 gamesafter Week 7 in which a winner was determined inregulation time. To test a model's ability to forecastthe margin of victory we use the mean absolute de-viation between the predicted and actual goal differ-entials. An information statistic is computed as theaverage negative logarithm of the joint probablityof the observed game scores, using a Poisson prob-ability model based on predicted scoring rates. Wealso include the square root of the average squaredprediction errors and the median absolute deviationbetween predicted and actual team scores.

Although many of the differences in Table 3 aresmall, the trends are quite consistent. The a^O.10case is invariably better than the two extremes,while the models allowing no adjustment for oppo-nent's strength (&=l) are inevitably the least effec-tive. To help assess the percentage correct data, onecan check that the home team won 256 of the 421(60.8%) non-tie, non-neutral ice games during thisperiod. The additive least squares model (with a=Q)forecast 319 (75.8%) of those outcomes correctly.

7.2 Forecasting the 1992-93 NCAA Playoffs

As another means of assessing the effectivenessof our ratings systems, we examine performance inpredicting the outcomes in postseason tournaments.Despite the fact that only one regular season leaderwon its conference tournament (Maine in HE), themodified additive ratings (a=0.10) correctly pre-dicted the winners for 41 of the 52 (78.8%) non-tie games in the league playoffs. Predictions in theNCAA tournament should be more challenging sincethe opponents are frequently from different leagues.

Table 4 gives the ratings (a—0.10) for all 44 Divi-sion I teams following conference playoffs. At that

point in the season an NCAA selection committee(using RPI ratings as one of the criteria) chose 12teams and seeded them into its tournament. Theparticipants in 1993 included the top 9 teams in Ta-ble 4, plus Northern Michigan (#13), Brown (#14)and Minnesota (#18) which received and automaticbid for winning the WCHA tournament. Maine,Michigan, Lake Superior, and Boston University re-ceived the top seeds and first round byes. The fourfirst round games featured three "upsets" accordingto the seedings and our ratings (see Table 5). Afterthat point, the ratings correctly predicted winnersof the final seven games in the playoffs, includingMaine's victory over Lake Superior State for the Na-tional Championship.

8. CONCLUSION

We have proposed statistical models for ratingsystems to satisfy the criteria set forth in Section2. By the end of a full season, when most teamshave played a balanced league schedule and some(but never enough) connections exist between theleagues, the ratings and predictions produced forvarious values of a are not terribly different. How-ever, it is desirable to account for strength of op-ponents in producing reasonable ratings for com-paring teams between conferences at earlier pointsin the season. Section 5 clearly demonstrates thattraditional regression methods may not be optimalfor this task. Thus we have suggested a methodfor moving smoothly between regression methods(which can be overly sensitive to sparse conections)and simple averages (which completely disregard op-ponent's strength). The results from the 1992-93season indicate that the modified system is likelyto be superior to either extreme. These methodshave been used to produce a weekly rating servicecalled CHODR (College Hockey Offensive and De-fensive Ratings) which is disseminated through theHOCKEY-L electronic discussion list.

REFERENCES

Instone, K. (1992), "Inside the College Hockey Com-puter Rating," unpublished manuscript availablethrough e-mail at [email protected].

Leake, R.J. (1976), "A Method for Ranking Teamswith an Application to College Football," in Man-agement Science in Sports, North Holland, eds. R.E.Machol, S.P. Ladany, D.G. Morrison, 27-46.

Stern, H. (1992), "Who's Number One - Probabilityand Statistics in Sports," 1992 Proceedings of theASA Section on Statistics in Sports.

190

Page 202: Anthology of Statistics in Sports

Danehy and Lock

Table 1. Additive Model - Least Squares Regression (a = 0.00)(Through November 14, 1993 - Week 5)

Without CC - AFRank Team Off

1 Maine 8.42 Denver 7.03 Yale 8.74 Wisconsin 6.55 Michigan Tech 5.56 Princeton 6.27 North Dakota 6.78 Minnesota-Duluth 5.99 St Cloud 5.4

10 Cornell 3.811 Colorado College 6.012 Minnesota 5.418 Air Force 5.021 Northern Michigan 4.9

Def1.91.63.52.11.42.53.22.42.10.73.02.53.14.1

Overall6.55.45.34.44.13.83.53.53.33.23.02.91.90.8

With CCRank Team

1 Denver2 Wisconsin3 Michigan Tech4 North Dakota5 Minnesota-Duluth6 Colorado College7 St Cloud8 Maine9 Minnesota

10 Yale11 Princeton12 Northern Michigan13 Cornell20 Air Force

- AFOff8.17.76.58.27.27.96.77.76.48.25.76.03.65.1

Def Overall A Overall0.21.00.02.11.12.10.92.31.03.72.73.21.34.6

7.96.76.56.16.15.85.85.45.44.53.02.82.30.5

2.62.32.42.62.62.82.4

-1.12.4

-0.8-0.82.0

-0.9-1.4

Table 2. Additive Model - Least Squares Regression (With CC - AF)(Through November 14, 1993 - Week 5)

a = 0.05Rank Team Overall

123456789

10141720213135

Maine 5.6Yale 3.6Denver 2.6Princeton 2.1Wisconsin 1.8Michigan Tech 1.6Lake Superior 1.4Cornell 1.3Clarkson 1.2Colorado College 1.2North Dakota 0.8Minnesota-Duluth 0.7Minnesota 0.5St Cloud 0.5Air Force -1.1Northern Michigan -1.5

A-0.3-0.20.7

-0.20.50.70.0

-0.3-0.51.50.70.60.70.5

-1.80.5

a = 0.10Rank Team Overall A

1 Maine2 Yale3 Denver4 Princeton5 Lake Superior6 Wisconsin7 Clarkson8 Michigan Tech9 Miami (Ohio)

10 Michigan14 Colorado College20 North Dakota21 Minnesota-Duluth22 Minnesota24 St Cloud34 Air Force38 Northern Michigan

5.1 -0.23.1 0.02.1 0.41.8 0.01.7 0.01.4 0.31.3-0.31.2 0.41.0 0.01.0 0.00.7 1.20.3 0.40.2 0.30.2 0.40.0 0.2

-1.3-1.6-1.7 0.2

a = 1.00Rank Team

1 Maine2 Yale3 Lake Superior4 Clarkson5 Princeton6 Denver7 St Lawrence8 Wisconsin9 Harvard

10 Boston University12 Michigan Tech19 Colorado College24 Minnesota26 North Dakota27 Minnesota-Dulut29 St Cloud39 Air Force42 Northern Michig;

Overall2.81.81.31.21.11.00.90.80.80.70.60.10.0

-0.3h -0.3

-0.5-1.1

an -1.2

A = Change in overall rating as a result of including CC - AF game.

Table 3. Comparison of Model Effectiveness (Weeks 8-25)

191

Method

Additive-LSQAdditive-LSQAdditive-LSQAdditive-LSQ

RPITCHCR

Of

0.000.050.101.00

PercentCorrect

74.0%74.8%74.8%70.4%

68.9%71.0%

MAD GoalDifferential

2.222.212.202.34

Average-log (prob)

2.082.072.072.09

Root MSEGoals

2.022.002.002.04

Med AbsPred Err

1.341.341.331.38

Page 203: Anthology of Statistics in Sports

Chapter 24 Statistical Methods for Rating College Hockey Teams

Table 4. Additive Model - Least Squares Regression (a = 0.10)(Through March 21, 1993 - Week 23)

OverallRank

123456789

1011121314151617181920212223242526272829303132333435363738394041424344

OffenseTeamMaineMichiganLake SuperiorBoston UniversityClarksonMiami (Ohio)Minnesota-DuluthHarvardWisconsinMichigan StateRPIMichigan TechNorthern MichiganBrownNew HampshireUMass-LowellSt LawrenceMinnesotaProvidenceSt CloudFerris StateYaleWestern MichiganAlaska- FairbanksBowling GreenDenverKentVermontNorth DakotaNortheasternColgateDartmouthIllinois- ChicagoBoston CollegePrincetonMerrimackColorado CollegeAlaska- AnchorageNotre DameCornellUnionOhio StateArmyAir Force

Record3728272820272622232219172016182017211614181520101819121112101111998

1289663424

167898

105

14141115171117171211161816121611211721162524181625241720281127192230

517

22525523323543323842442212331120253203210201

Rating6.15.84.74.84.54.64.94.34.44.04.04.14.34.44.14.24.14.04.03.83.74.13.84.24.13.74.02.93.74.13.63.53.33.43.13.63.82.53.02.82.32.82.52.3

Rank1254763

108

24221711

3191314232025301826121528213829163233353436312742373944404143

DefenseRating

2.52.72.93.12.93.13.63.13.43.33.43.53.74.03.73.83.83.73.93.73.74.13.94.34.24.14.63.54.45.04.64.64.44.54.45.05.24.04.74.54.95.65.65.6

Rank123546

137

1089

111724141920162215182621282725361229403534303231394123373338434244

OverallRating

3.63.21.81.71.61.51.31.21.00.60.60.60.50.40.40.40.30.30.10.10.0

-0.1-0.1-0.1-0.1-0.4-0.6-0.6-0.6-0.9-1.0-1.1-1.1-1.1-1.3-1.4-1.4-1.4-1.7-1.7-2.6-2.8-3.0-3.3

Table 5. 1993 NCAA Division I Men's Ice Hockey Tournament

Date3/26/93

3/27/93

4/01/93

4/03/93

Prediction4.2-2.94.1-3.44.9-4.14.0-3.65.8-2.54.5-3.44.3-3.95.3-3.13.8-3.74.8-4.35.0-3.2

ClarksonHarvardMinnesota-DuluthMiami (Ohio)MaineBoston UniversityLake SuperiorMichiganLake SuperiorMaineMaine

vs Minnesotavs Northern Michiganvs Brownvs Wisconsinvs Minnesotavs Northern Michiganvs Minnesota-Duluthvs Wisconsinvs Boston Universityvs Michiganvs Lake Superior

Score1-22-3 (OT)7-31-36-24-14-34-3 (OT)6-14-3 (OT)5-4

192

= 3.95 H = ±0.44 goals per game.

Page 204: Anthology of Statistics in Sports

Shootouts have been criticized as unfair methods for decidingtie games. A statistical analysis compares the shootout andtwo alternative methods.

Chapter 25

Overtime or Shootout:Deciding Ties in Hockey

William Hurley

Introduction

The year 1994 appears to have beenthe Year of the Shootout. A shoot-out, in which five players fromeach team attempt to score in a se-ries of one-on-one contests with theopposing team's goaltender, de-cided the Olympic Gold Medalhockey game, the deciding game inthe World Hockey Championship,and, in soccer, the World Cup.Most fans see shootouts as exciting.The players, however, argue to aman that they are a poor way to de-cide an important championship,largely because a shootout resultsin a more random outcome.

There is a great deal to be saidfor the players' position. A cham-pionship hockey game ought to bedecided in the traditional way—five skaters against five skaters un-

til a winner is determined. How-ever, there is a need for somemechanism to shorten the tourna-ment playoff games that precedethe championship game. For in-stance, a semifinal game that wentto four overtime periods beforeone of the teams finally scoredwould seriously damage thatteam's chances in a championshipgame played two days later.

To my knowledge, there is mini-mal literature on tie-breakingmechanisms. Liu and Schutz(1994) examine the NationalHockey League's regular season 5-minute overtime period and findthat the better team has a 65%chance of scoring first in overtimeplay.

This article compares three for-mats for deciding tie games. In tra-ditional overtime (OT), the teams

play until one of the teams scores.In a shootout format (SH), teamsplay overtime for up to y minutesand if no team has scored, there isa shootout. A third format is avariation of the existing shootoutformat. The only way to haveshootouts and soothe the playersis to have a period of regular playfollowing a shootout. Here is onepossibility if two teams are tied atthe end of regulation time. Step 1:Immediately run a shootout. Theteam winning the shootout wouldbe awarded a goal. Step 2: Thenplay an overtime period, termed arecourse period, having a designlength of 7 minutes. If the teamwinning the shootout scores first,the game is over. If there is no goalscored in the overtime, the gameis over. If the team that lost theshootout scores, however, stop

193

Page 205: Anthology of Statistics in Sports

Chapter 25 Overtime or Shootout: Deciding Ties in Hockey

Figure 1. Frequency plot of the time to the winning goal in overtime.

play and repeat steps 1 and 2. Iterm this overtime format a shoot-out with recourse (SR).

In this article, we compare thethree formats with regard to theexpected length of the overtime,the variance of this length, and thestronger team's probability of win-ning.

The Length of Overtime

An important factor in this analy-sis is the distribution of the timebetween goals in a hockey game.To estimate this distribution, I ex-amined the 251 National HockeyLeague playoff overtime games be-tween 1970 and 1993. The data aretaken from The NHL OfficialGuide and Record Book, 1993-94(The National Hockey League,New York, 1993). A frequencyplot of the time of the winninggoal is shown in Fig. 1. These dataare consistent with an exponentialdistribution having a mean of 9.15minutes (see Box).

Using the exponential distribu-tion, and assuming that the twoteams are of similar strength (qual-ity), the expected length of an over-time period and its variance can becalculated for each overtime for-mat. Traditional overtime is over atthe point of the first goal, so the

expected length is just the mean ofthe exponential distribution, 9.15minutes. The standard deviation ofthe length is also 9.15.

To obtain the expected length of

a traditional shootout, we mustconsider two possibilities—eitherthe game ends on a goal during thesudden-death period or it endswith a shootout. Given these pos-sibilities and a sudden-death pe-riod of length y, the expectedlength of overtime can be shownto be (l-e- y)/ , where 1/ is themean time to a goal (9.15 min-utes). The standard deviation ofthe length of overtime isl-2 e- y-e-2 y/ . The first two

columns of Table 1 show the ex-pected length and standard devia-tion for various overtime lengths.Note that as the length of the sud-den-death period gets larger, ashootout is less likely, and, thus,for large y, this overtime format isequivalent to the traditional for-mat.

Computing the expected lengthand standard deviation of theshootout with recourse requires

The Exponential Distribution

The exponential distribution isoften used to approximate the dis-tribution of a random variable thatis the waiting time until someevent, for example, a goal in ahockey game. The probability den-sity function is p(x) = e- x and themean waiting time is 1/ , as is thestandard deviation. The probabilityof waiting beyond some time T forthe event is e- T.

The exponential distribution his acertain "memoryless" property. For in-stance, given that we have watted Ttime units with no event, the expectedwaiting time is 1/ , just as it was at thestart. In many applications, like waitingfor parts to fail, this memoryless prop-erty is not realistic. The exponentialdistribution seems to provide a goodapproximation to the waiting time fornockey scores, however.

Table 1— Mean Length and Standard Deviationsfor the SH and SR Formats for Various Design

Lengths

y

510152025

SH

Mean length

3.96.17.48.18.6

S.D. length

6.78,38.89.09.1

SR

Mean length

4.99.1

12.414,616.1

S.D. length

6.8

10.813.214 .715.8

194

Page 206: Anthology of Statistics in Sports

Hurley

an assumption about the prob-ability of winning the initialshootout. We take this to be .5 foreach team. The expected length isfound by averaging over three pos-sibilities:

1. The team that won the shootoutscores first in the recourse pe-riod and, hence, wins the game.

2. Neither team scores in the re-course period and, hence, theshootout winner wins thegame.

3. The shootout loser scores firstin the recourse period, thus giv-ing rise to another shootout.

This last possibility makes theshootout with recourse longerthan the traditional shootout for-mat if the recourse period has thesame length as the sudden-deathperiod of the traditional shootout.

Expressions for the expectedlength and standard deviation ofthe SR format are developed in thesidebar. Values of the expectedlength and standard deviation forvarious recourse period designlengths are presented in Table 1.Note that a recourse period havinga design length of 5 minutes hasan expected length of 4.9 minutes,or about half the length of thestandard overtime format (9.15minutes).

For the 5-minute SR, the prob-ability that there is a secondshootout is just the probabilitythat the team losing the shootoutscores in the first recourse period,

or p = ,5 e- t dt = .21, and the0

expected number of shootouts isl(l-p) + 2p(l-p) + 3p2(l-p) + ... =l/(l-p) = 1.27.

Computing Expected Length of Overtime and theProbability That the Stronger Team Will Win

The calculations to produce Tables 1and 2 are averages over the possibleoutcomes. We illustrate two of themhere.

The expected length of the stan-dard Shootout is either the time of thefirst goal (exponential distributionwith parameter 1/ or y minutes,whichever comes first). Denoting thelength of this format by LSH, the av-erage length is

where the first term is the case inwhich a goal is scored and the sec-ond term is the case in which no goalis scored.

The Shootout with recourse is a lit-tle trickier because of the possibilityof multiple periods. Letting LSR de-note the length of this format, the ex-pected length is now computed inthree pieces:

The first integral covers the casein which the loser of the Shootoutscores at time t with resulting ex-pected length t + E(LSR), the secondintegral covers the case in which theteam winning the Shootout scores inthe recourse period, and the third thecase in which there is no goal scoredin the recourse period. Standard de-viations are calculated using thesame approach to determine the ex-pected value of the length squared.

If two teams score goals inde-pendently and exponentially with

parameters AS and W, then theprbability that the stronger teamwill win a traditional overtime iss/( s + W)- For shootouts, theprobability that the stronger teamwill win is computed by listing all ofthe possible outcomes. In a Shoot-out with recourse, the strongerteam will win if

* the stronger team wins theshootout, and no goats arescored in overtime

* the stronger team wins theshootout, and the strongerteam scores first in overtime

. the stronger team wins theshootout, the weak teamscores first in overtime, and thestronger team wins given theremainder of the process

* the weak team wins the shoot-out, the stronger team scoresfirst in overtime, and thestronger team wins given theremainder of the process

If we take PS to be the probabilitythat the stronger team wins a Shoot-out, p to be the stronger team's over-all probability of winning, fs(y) to bethe probability that the stronger teamscores first during a recourse periodof length y minutes, and fw(y) to bethe probability that the weak teamwill score first during a recourse pe-riod of length y minutes, then

which can be solved for p. The prob-abilities that the strong or weak teamscores first, fs and fw, are computedfrom two independent exponentialdistributions.

Probability That theStronger Team Wins

A typical objection to the standardshootout is that it results in out-comes determined mainly by luckor by an arbitrarily chosen skill. Inthis section, we compare the de-

gree to which the three overtimeformats reward the stronger team.

Suppose that two teams of un-equal strength are tied at the endof regulation time. Label the teamsS (strong) and W (weak). Supposethat the time between goals forteam S has an exponential distri-

bution with mean 1/ S, and thetime between goals for team W hasexponential distribution withmean 1/ w We assume that thetwo goal-scoring processes oper-ate independently. No doubt thislast assumption is not completelyrealistic. For instance, the distri-

195

Page 207: Anthology of Statistics in Sports

Chapter 25 Overtime or Shootout: Deciding Ties in Hockey

Table 2— Probabilities That the Stronger Team Wins for Various ParameterValues (the probability that the stronger team wins in traditional

overtime is .6055 )

y

5101$2025

SH

ps=.5 ps=.6

.5006 .5949,3034 .5859.6089 .5780

.5164 .5729.5251 ,5706

ps=.7

,6892.6683.6472.6294.6161

ps=.5

.5003

.5019

.5053

.5105

.5173

SR

ps=.6

,6003.6018.6050.6100.6165

ps=.7

.7002,7016.7044.7087.7143

bution of time between goals forone team clearly depends on thequality of the opposing team. Nev-ertheless, in thinking about agame between two teams of fixedquality, the independence as-sumption may not be far off be-cause we can incorporate the in-dependence into the choice ofapproximate exponential parame-ters.

For traditional overtime, theprobability that the stronger teamwins is S( S+ W). What are plau-sible values for s and w? I exam-ined NHL Stanley Cup Final Se-ries data for the years between1980 and 1991. Using the scoresfor each of the 62 games, I calcu-lated the total goals scored byStanley Cup winning teams (252)and the total goals scored by Stan-ley Cup losing teams (164). Theaverage time between goals forwinning teams is 14.76 minutes( s = .0677) and for losing teams22.68 minutes ( w = .0441). Theseestimates ought to be consistentwith the average time to the firstgoal of 9.15 minutes computedearlier in the article. Given our as-sumptions, the time to the firstgoal has an exponential distribu-tion with parameter s + w, andusing s and w, the average timeto the first goal is 8.94 minutes,which is not significantly differ-ent from 9.15 minutes. Hence,based on these estimates for s

and w, the probability that thestronger team wins a traditionalovertime is .6055.

The probability that thestronger team wins under thestandard Shootout or the shootoutwith recourse depends on thelength of the associated overtime,y, and the probability that thestronger team wins the shootout.It would seem that the prob-ability that the stronger teamwins a shootout should be greaterthan .5, but it may actually beclose to .5 if the weaker team hasfive relatively good shooters. Ta-ble 2 presents probabilities thatthe stronger team wins for eachshootout format and for variousvalues of the length of overtime.A brief explanation of the compu-tation of these probabilities isgiven in Box 2.

The interesting aspect of theprobabilities in Table 2 is that theprobability that the stronger teamwins is dominated by its prob-ability of winning the shootout.This is especially true for shortovertime periods. As the overtimeperiod gets longer, however, thisprobability gets closer to the tradi-tional overtime result. The excep-tion appears to be the column forps = .6, where the probabilities ap-pear to be moving away from.6055. However, this is not thecase. If y is sufficiently large, weagain reach .6055.

Summary

The purpose of this article hasbeen to compare three overtimeformats on the basis of expectedlength of the game and the prob-ability that the stronger team winsthe overtime.

Based on National HockeyLeague data, the shootout with re-course format, having a designlength of 5 minutes, would have amuch smaller expected lengththan a standard internationalshootout format in which there isa shootout after 20 minutes. Evenwith this short design length,there is no guarantee that the SRformat will always be shorter. Onesolution would be to limit thenumber of recourse periods.

Finally, the probability that astronger team wins any overtimeformat having a shootout is domi-nated by the stronger team's prob-ability of success in a shootout.Assuming that a stronger team'sprobability of success in a shoot-out is less than it is in regular play,overtime formats with shootoutsgive the weaker team a betterchance of winning.

Additional ReadingLiu, Y., and Schutz, R. W. (1994),

"Overtime in the National HockeyLeague: Is It a Valid Tie-BreakingProcedure?" School of Human Ki-netics, University of British Colum-bia, Canada.

196

Page 208: Anthology of Statistics in Sports

The role of team ability, home ice, andthe "hot hand" in the Stanley Cup finals.

Chapter 26

It Takes a Hot Goalieto Raise the Stanley Cup

Donald G. Morrison and David C. SchmittleinIn May of 1997, Mike Vernon skatedaround the Philadelphia Spectrum withthe National Hockey League's StanleyCup raised high in his hands. Vernon

played spectacularly well during theplayoffs, including the finals against thePhiladelphia Flyers. He was voted theMost Valuable Player (MVP) of the

playoffs. Vernon's Detroit Red Wingsended a 42-year quest to regain theStanley Cup. Was Vernon the latest in along line of Hot Goalies to lead his teamto victory? We think so—and the sta-tistical evidence is quite compelling.

In ice hockey as in most profes-sional sports, winning the reg-

ular season "title" is muchless important than win-ning the postseason play-offs. Many years ago,there were only six

teams in theN a t i o n a lH o c k e yLeague andthe top fourmade thep l a y o f f s .T h e yplayed tworounds ofb e s t - o f -

s e v e nseries. Now

there are 26teams and 16

teams make thefour-round (again, best-of-seven) playoff series. At onepoint in the 1970s, there were21 teams and all but five made

the playoffs. This caused manyto feel that the 80-game regular

season was almost meaningless.Quotes such as "they play 80

197

Page 209: Anthology of Statistics in Sports

Chapter 26 It Takes a Hot Goalie to Raise the Stanley Cup

games merely to determine home iceadvantage" appeared frequently. That is,when Teams A and B meet in the play-offs, the seven games (if necessary) areplayed on A and B's home arenas in thefollowing order: A A B B A B A. Whenone team wins four games, the series isterminated. Thus, if a crucial seventhgame is played, it is on the home ice ofTeam A, the team with the better regu-lar season record. We will return to thehome ice factor, but first we give a relat-ed anecdote.

Mike Vernon,the Hot Goalie

After a 42-year hiatus, the treasuredStanley Cup returned to Detroit as theRed Wings swept the PhiladelphiaFlyers four games to none. The heroand MVP of the playoffs was MikeVernon, the Red Wings'veteran goalie.In fact Vernon played every one of the

games in the four rounds of the play-offs. During the regular season, Vernonplayed about one-third of the games asChris Osgood's back-up. Osgood hadthe better winning percentage andgoals against average during the regularseason. Legendary coach ScottyBowman felt Vernon was the "hot"goalie as the playoffs started, however,and went with his hot goalie for all ofthe playoff games. (This is not uncom-mon because all coaches now use twoor more goalies during the regular sea-son and many stick with just one goaliein the playoffs.)

In fact, two years earlier the RedWings lost the Stanley Cup finalseries—in another 4-0 sweep—toNew Jersey. In that series New Jersey'sgoaltending was seen as superior, andDetroit's was much criticized. It wasthis disappointing final series that ledDetroit to go out and acquire the ser-vices of Mike Vernon. We thank analert reviewer for this note on theacquisition of Vernon by Detroit.

Hot GoalieVersus Home Ice

If goalies did not get hot and home icehad no effect, and if the two teamswere of equal ability, the outcome ofthe Stanley Cup final series gameswould be a Bernoulli process, witheach team having a .5 probability ofwinning each game. The duration ofthe series would be four games if TeamA wins the first four or Team B winsthe first four. This would happen withprobability (1/2)4 + (1/2)4 = 1/8 =.125. Similar reasoning gives the fullprobability distribution:

Duration of series

4567

Probability

.1250

.2500

.3125

.3125

Despite being the goalie of choice for the Detroit Red Wings during the 1997 Stanley Cup Finals,

Mike Vernon found himself playing for the San Jose Sharks the following season.

If instead the teamswere of equal ability,but the home ice advan-tage was so strong thatthe home team alwayswon, then every serieswould go to sevengames because eachteam plays three homegames out of the firstsix. On the other hand,if the team with the hotgoalie always won, theneach series would lastthe minimum of fourgames. In the lessextreme cases, homeice advantage will pushthe preceding distribu-tion more to 6- and 7-game series. For exam-ple, if the home teamwon each game withprobability .7 (approxi-mately the league-widefigure for the regularseason), the probabilityof a four-game sweepwould drop from thevalue of. 12 50 to .0882.Conversely, the impactof a "hot goalie" willcause more 4- and5-game series.

198

Page 210: Anthology of Statistics in Sports

Morrison and Schmittlein

Stanley Cup FinalDuration

The Stanley Cup final series has beencontested 59 times in a best-of-seven-games series format. The frequencytable of duration for these series is:

Duration of series4567

Number19151510

Total 59

Series record

4-04-1

4-3

Frequency

1915

10

59

Number of series won by the home team

1412

846

Number of serieswon by theaway team

533

2

13

Clearly this shows a lot moresweeps (4-game series) than onewould expect even with no home-iceadvantage and equal ability.

We also calculated the maximumlikelihood estimate of the game-victoryprobability p for the better of the twoteams assuming no home-ice advan-tage—that is, still a Bernoulli process,but allowing p to be different from .5.This resulted in an estimated value ofp = .73. The observed frequency tableof duration for the final series and thetwo sets of expected values are:

Seriesduration

Observed# of series

Expected# of series:Bernoulli

p = .5

Expected# of series:Bernoullip = .73

4567

19151510

7.414.718.418.4

17.119.013.99.0

The Bernoulli p = .5 model isstrongly rejected because X = 22.65,compared to a critical value ofX2(.01)= 11.30 with 3 df. TheBernoulli p = .73 model is not reject-ed; that is, X2 = 1.25, while x2(.05) =5.99 with 2 df.

Allowing heterogeneity across yearsin the game-victory probability p (forthe p = .73 model) does not helpmuch—it would boost the expectednumber of 7-game series (a help), butreduce the number of 4-game series (ahurt).

Explaining the Data:Hot Goalie, Home Ice,and/or Dominant-TeamEffects

The hot-goalie theory clearly canexplain the far too many 4-game seriesand the almost as dramatic dearth of7-game series. But so does theBernoulli (no home-ice advantage)model with the better team being"much better" with p = .73— that is,a dominant-team hypothesis. Which

of these competingmodels is more plau-sible? And wheredoes a home-iceadvantage fit in?

During the regu-lar season the eliteteams do not winthree-fourths of theirgames. Of course,some regular seasongames end in ties(postseason games

do not), but during the regular season,the elite teams play many very poorteams. The Stanley Cup finals usuallypit two elite teams against each other.The better team may be a little better,but p = .73 implies the better teamwould win over the long run threetimes as often as they lose.

To put this in perspective, considerbaseball, which with its 162-gameregular season schedule comes theclosest to "the long run." No team inthis century has won three-fourths ofits games. The inaugural 1962

Amazin' Mets went 40 and 120 (mer-cifully two rainouts were neverplayed). Thus, the worst baseballteam in 100 years lost three-quartersof its games. For the p = .73 model tobe plausible means that the twoteams in the Stanley Cup finals are asdisparate in their abilities as the 1962Mets versus the average remainingNational League teams of 1962. Thisjust doesn't seem reasonable. Actually,examining our series-duration his-togram in slightly more detail willshow that such a p = .73 dominant-team hypothesis does not adequatelyexplain the data. Specifically, theinclination of the home team to winStanley Cup final series of varyingduration will enable us to sort outmuch more clearly the dominant-team, home-ice, and hot-goalieeffects. We will be able to rule outone of these three effects, and seethat the other two together (but nei-ther alone) suffice to explain the data.

Of the 59 Stanley Cup finals in abest-of-seven game format, the distrib-ution for winner's record in the series,and whether the series was won by thehome or away team (i.e. "home team"is the team with most regular-seasonpoints and so the home-ice advantageif the series goes seven games) is indi-cated in Table 1. If the two teams'point totals were equal, the number ofgames won in the regular season is thetiebreaker to determine the team withthe home-ice advantage. So, overall,the home team (i.e., better regular sea-son record) won 46/59 = 78% of theseseries.

199

4-2 12

Table 1-Sucess of the Morne Teram Versus Taht of the

Away Team in Stantey Cup Final Series of Varying Duration

won by the home team

15

Total

Page 211: Anthology of Statistics in Sports

Chapter 26 It Takes a Hot Goalie to Raise the Stanley Cup

This seems a reasonably large frac-tion. It is also remarkably close to ourgame-victory probability for the "bet-ter" team, estimated for a homoge-neous Bernoulli process (i.e., .73).Maybe having home-ice is indeedworth fighting for.

Is the Home-Ice EffectReal?

As suggested earlier, the home teamwould be expected to prevail in StanleyCup finals as a result of two distinctphenomena:

1. Dominant team effect: Becausehome ice goes by design to theteam with the better regular-season record, that team shouldgenerally be better even in theabsence of a "real" home-iceeffect (i.e., increased propensityto win when playing at home).

2. "Real" home-ice effect: The"home" team gets to play the"odd" game number seven (ifgame seven is required) athome.

Table 1 shows clearly that the high78% series-win probability by thehome team has everything to do withfactor 1 above (dominant team) andprobably nothing at all to do with fac-tor 2— that is, a real home-ice effect.This is seen in two ways:

1. Only 10 of the 59 series wentseven games—that is, enablinga real home-ice effect to arise.About twice as many series (19)went only four games—andwere therefore balanced withrespect to home ice (two homegames, two away games).

2. Comparing the series-win per-centage in 4-0 series with thatin 7-game series, if a "real"home-ice effect occurred wewould expect home teams towin substantially more of the 7-game series than the 4-gameseries. In fact, the "home" teamwon 14/19 = 74% of the 4-game series and 8/10 = 80% ofthe 7-game series. This veryslight increase is not signifi-cant, substantively or statisti-cally.

We should acknowledge that seriesmaking it to seven games will tend tobe composed of teams that are rela-tively evenly matched, so it might beargued that the seventh game at home"restores" to the home (better-record)

team a win-percentage that it wouldotherwise have seen erode through sixgames' worth of Bayesian updating(observing a 3-3 record in thosegames). Such an explanation is easilyruled offsides, however, by the factthat the home team's win percentage

(see Table 1) in 7-game series (80%) isidentical to its win percentage in both5-game series (12/15 = 80%) and in6-game series (12/15 = 80%), the lat-ter being balanced in home ice (3home games, 3 away games) for eachteam.

Accordingly there is no noteworthyevidence of a "real" home-ice effect inStanley Cup final series.

Since the Home-IceTeam Is Demonstrablythe Dominant Team,What Happened to theHot-Goalie Hypothesis?

Recall that the home team won 78% ofStanley Cup finals and, from theseries-record frequency table, we esti-mated that the "better" team's game-victory probability was .73, assuming ahomogeneous Bernoulli process. Doesthe similarity of these two numbersmean that the dominant-team hypoth-esis—that is, the fact that the hometeam had a better regular seasonrecord—suffices to account for theseries records observed and, further,eliminate an apparent hot-goalieeffect? Nothing could be further fromthe truth.

Imagine that the home team ineach series were indeed that "better"team in the Bernoulli process, havinggame-victory probability = .73. Then,in such a process, where repeatedgames' outcomes are independent(e.g., assuming no hot-goalie effect),the probability that the "better" (home)team wins in four games is .734 =.2840; and the probability that the"worse" (away) team wins in four is(1 - .73)4 = .0053. Consequently,among series that go exactly fourgames the dominant-team hypothesiswould expect the percentage of suchseries won by the home team to be.2840/(.2840 + .0053) = .982.

Table 1 showed that home teamswon 73.7% (14 of 19) of the 4-gameseries played. Even with our modestsample size, this observed percentageis significantly different from the98.2% expected by the dominant-teamhypothesis.

As we noted earlier, such 4-gameseries are balanced with respect to anyreal home-ice effect: Each team playstwo home games and two away games.The other balanced series are thosedecided in six games. Absent a hot-goalie effect, we can calculate theprobability that the home (better) team

200

Page 212: Anthology of Statistics in Sports

Morrison and Schmittlein

wins in six games (with p= .73 as pre-viously) as:

Similarly the probability that the awayteam wins in six games is:

Thus, among six-game series theexpected proportion won by the hometeam is (.73)2/(.732 + .272) = .880.This is again greater than the actualproportion of six-game serieswon by the home team(12/15 = .80).

For these six-gameseries the observedproportion (.8) is onlyone standard deviationbelow the theoreticalvalue and so not signifi-cantly different from it.Nonetheless, when takentogether with the precedingfour-game series results it isclear that the dominant-team hypothesis cannotexplain the home team'sobserved inclination to winthe Stanley Cup finals. Thehome team wins too fewseries, and this decrement isstatistically significant.Furthermore, a hot-goalieeffect—that is, nonindependencebetween successive game outcomeswithin a series—can readily close thisgap and account for the data observed.In these relatively short series, eitherteam's goalie is of course eminentlycapable of getting "hot." Thus, oneconsequence of a hot-goalie effectwould be to equalize (to some degree)the overall series wins between hometeams and away teams.

A second observation also supportsa hot-goalie interpretation—namely,that the shortage in home-team seriesvictories (relative to the Bernoulliprocess expectation) is greater botharithmatically and statistically for four-game series than it is for the longer six-game series. Naturally, it is more likely

for a goalie to get hot and stay hot for afour-game sweep than for the six-gameseries. Accordingly, our equalizationeffect stemming from the hot-goaliephenomenon should be greater forseries going four games than for thosethat go six games, and this is indeedwhat we observed.

In summary, although a hot-goalieeffect cannot be proved conclusively, itcould be readily discarded through var-ious possible patterns of empiricalresults. Instead, the empirical patternsobserved in this article all pointtoward, rather than away from, such aneffect.

Specifically, we find that the teamgaining the series home-ice advantagethrough possessing a better season

record does indeed fare better in theStanley Cup finals. The benefit seemsto stem entirely from the dominantteam effect—that is, a team with adominant regular season record proba-bly is the more talented of the two.There is no compelling evidence thathaving home ice for a decisive seventhgame (i.e., a "real" home-ice effect) isof any consequence.

Finally, the dominant-team effectcould suffice to explain the duration ofStanley Cup finals—so many seriesending so quickly—but only if weaccept that the better team has three-to-one odds of beating the weaker team

game in and game out. This seemsunlikely, to say the least. Furthermore,the dominant-team effect does not sat-isfactorily explain the percent of serieswon by the home team: Home teamswin too few, especially in four-gameseries. Both of these latter observationsare predicted and explained by a hot-goalie phenomenon.

Conclusion

The 82-game regular season in theNational Hockey League does indeeddetermine which teams get the homeice advantage if a seventh game isnecessary. The duration pattern of the59 Stanley Cup finals played, howev-er, rejects the notion that home ice isin fact a significant advantage. Thehot-goalie theory is at least stronglyconsistent with the observed data.The Bernoulli p = .73 model would bemore plausible if data showed thatone of the finalists typically had morekey players injured than the otherteam—-but, of course, we have nodata on that. The best that we can sayon the Stanley Cup finals is the fol-lowing:

1. The home ice advantage is not afactor overall;

2. The "home" team (i.e., teamgetting home-ice for game 7)does dominate these series butdue solely to its inherant ability,reflected by its having had thebetter, record during the regularseason;

3. The hot goalie hypothesis isalive and well.

Finally, what was Vernon's reward forbeing the hot goalie who led the RedWings in from 42 years in the StanleyCupless wilderness? He was released:The Red Wings could only protect twogoalies in the upcoming expansion draft.They kept the much younger (and prob-ably better) Chris Osgood and a youngbackup goalie. Mike with his champi-onship ring and hot goalie resume—notto mention his large salary—was sent toSan Jose for goal tending duties with theSharks. Thus, the Red Wings implicitlybought into the notion of the hotgoalie—as a transient phenomenon thatonly lasts for the time it takes to stagethe Stanley Cup playoffs.

201

Page 213: Anthology of Statistics in Sports

This page intentionally left blank

Page 214: Anthology of Statistics in Sports

Part VStatistical Methodologiesand Multiple Sports

Page 215: Anthology of Statistics in Sports

This page intentionally left blank

Page 216: Anthology of Statistics in Sports

Chapter 27

Introduction to theMethodologies andMultiple Sports Articles

Scott Berry

27.1 IntroductionThis section includes a wonderful blend of papers of ageneral nature that address multiple sports, multiple topics,and nontraditional sports. They address important sportsproblems, important statistical problems, and all are veryinteresting and well done. They are sorted into three broadclasses: hypothesis tests, prediction, and estimation.

27.1.1 Hypothesis Tests

Over the last 20 years the issue of the existence of a"hot hand" has captured the focus and imagination of thestatistics-in-sports community. The hot hand is essentiallyan example of the age-old cliche that success breeds suc-cess. Statistically it can be modeled in many ways, butthe classical idea is whether success or failure on one trialchanges the probability of success or failure on the nexttrial. Tversky and Gilovich (1989) (also Chapter 21 in thisvolume) wrote that there was no evidence for the existenceof the hot hand effect in basketball. To the sports commu-nity the existence of the hot hand is a tautology. This ispart of what made the questioning of the existence of suchan effect such a powerful idea.

Addressing the existence of the hot hand is a challengestatistically. The standard method is to assume no hot handeffect exists—this assumption forms the null hypothesis.Data is then collected to see if it agrees or, more con-clusively, disagrees with what is expected. This classicalhypothesis testing is the standard approach that is used inlooking for a hot hand effect. Hooke (Chapter 31 in thisvolume) was one of two papers in the 1989 Fall Chance

issue to examine the hot hand (also see Larkey, Smith, andKadane (Chapter 19 in this volume)). Hooke addressesthe approach of others, specifically Tversky and Gilovich(Chapter 21), in which they conclude that the data theycollected are consistent with the null hypothesis of no hothand effect. Hooke discusses a historically very difficultproblem of making conclusions when classical tests findthe data consistent with the null hypothesis. He states thatthe existence of the hot hand is still very much up in theair. Its effect, if it exists, is smaller than many may havethought, but it clearly could still exist. Not only is thispaper important for sports statisticians, but for researchersin every field.

27.1.2 Prediction

This second class includes three papers on the predictionof sports outcomes. In a horse race the outcome of interestis the order of finish of the horses, a permutation of theentrants. Harville (Chapter 30) models the probabilitiesfor the various permutations using a probability that eachhorse finishes first. He assumes that the odds on a horsewinning represent the true probability for a horse finishingfirst. To reduce the hopelessly large class of permutations,he uses the assumption that the probability of finishingsecond for horse B, given that horse A has won the race,is proportional to the probability that horse B would finishfirst in a race with the remaining horses. Checking hismodel with results from 335 races, Harville finds a nicematch—with some small deviations. This can result inbets that are expected money winners, which is rare inhorse racing; in particular, show and place betting reapeda positive expected payout in some races (an amazing thingin the world of 16% take pari-mutuel betting!).

Except for possibly a big win by the underdog, the mostdramatic event in sports is the big comeback. Games where

205

Page 217: Anthology of Statistics in Sports

Chapter 27 Introduction to the Methodologies and Multiple Sports Articles

a team trails by a large margin and comes back to win areanalyzed forever. Was it the winning team believing inthemselves or was it a choke by the losing team? Howdifficult was the comeback, really? Stern (Chapter 34)models the progress of a game using a Brownian motionmodel. While others have modeled the abilities of teamsand looked at the probabilities of teams winning basedon a normal distribution, this paper looks at these samequantities continuously throughout the game. The Brown-ian motion approach models the difference between thescores for each team as a normal distribution. The timeremaining in the game determines the mean and standarddeviation for the remaining point differential. Using thisapproach, and by modeling the ability of each team, Stemfinds that the probability that each team wins a game isconditional based on the score and the time remaining.

He applies the method to National Basketball Associa-tion games and Major League Baseball games. The modelworks very well in basketball because the Brownian mo-tion model assumes a continuous time and scoring struc-ture. While these assumptions are not exactly true in bas-ketball, they are very close. The model works reasonablywell in baseball, but does have shortcomings near the endof the game. Stern compares the results to those developedby Lindsey (1961) (Lindsey appears as Chapter 16 in thebaseball section (Part II) of this volume).

Mosteller (Chapter 32) presents a collection of sports-related analyses that he has done over his distinguishedcareer. He describes each of these analyses and the lessonseach of them have taught him. First he addresses the seem-ingly innocuous problem of estimating the relative abilityof two teams, conditional on the results of a seven-gameseries. Because the stopping rule for the series dependson the results (when one team wins four games), the un-biased estimator for the probability that one team beatsanother has some undesirable properties. This result wasvery important in the history of statistics. As Mostellersays in Chapter 32: "The existence of such unreasonableresults has downgraded somewhat the importance of unbi-asedness."

Next, he describes a robust statistical approach for rank-ing National Football League teams using the 1972 seasonas an example. Interestingly, he did not rank as first theundefeated Super Bowl champion, the Miami Dolphins.In spirit his rankings are similar to those that are now usedto partially determine the NCAA champion football team.He writes the following about a lesson he learned: "Thenation is so interested in robust sports statistics that it canhog the newspaper space even at an AAAS annual meet-ing."

He continues with modeling the number of runs in a

half-inning of baseball and the 18-hole score of a pro-fessional golfer. Throughout the article, Mosteller bringsclarity to the main goals of statistical analyses in sports. Hehighlights very important statistical ideas, answers criticalquestions about statistics in sports, and provides a meansof bringing statistics to the general public.

27.1.3 Estimation

The evaluation of athletes is the focus of this third classof papers. While articles and studies rating athletes areubiquitous in sports today, there is a lack of good paperson the experimental design of studies for optimizing per-formance. Roberts (1993) in Chapter 33 brings the rev-olutionary ideas of Total Quality Management (TQM) tothe optimization of athletic performance. The methods ofTQM are a natural choice for this application because ofits design and analysis for optimizing the performance ofspecific processes. Roberts discusses and presents datafor two interesting examples. The first is an example inwhich a golfer alters his putting technique. By using in-tervention analysis Roberts shows clearly that one of thetwo grips results in a better putting performance. In thesecond example, a billiards player designs and carries outa 2 x 2 experiment on the eye position and bridge used forthe shot. Roberts also informally describes his career as adistance runner and the thinking he has used to hone histraining techniques. One of the interesting aspects of hisapproach is that one technique may be better for one athleteand worse for another. Each of these examples addressesthe challenging problem of finding the correct techniquefor a single athlete—not a population of athletes.

Efron and Morris (1975) in Chapter 29 address an exam-ple of estimating the true batting average of baseball play-ers based on partial season data. The maximum likelihoodestimator (MLE) is the player's current batting average.Stein showed that this estimate is inadmissible—there ex-isted estimators which were uniformly better with respectto a squared-error loss function. It turns out that regressingeach player's batting average toward the mean of all theplayers' averages is a uniformly better estimator, a 350%improvement in the example considered.

This idea is now commonly accepted by statisticians,but at the time was shocking. Efron and Morris's sem-inal paper was a forerunner to the modern movement ofhierarchical models and random effects models, which arenow ubiquitous in statistics. It is generally thought of asthe paper that brought this type of problem—the empiricalBayes problem—into mainstream statistical use. Conse-quently, this paper is one of the most important in statisticsliterature, and certainly the most important to statistics in

206

Page 218: Anthology of Statistics in Sports

Berry

sports literature. While the paper's significance extendswell beyond the arena of sports, as in Mosteller's work,it shows that sports can provide the inspiration for manynovel statistical applications.

One of the hot button topics in sports is that of comparingplayers from different eras. How does Babe Ruth compareto today's sluggers? How does Rocket Richard compare totoday's goal scorers? How do Ben Hogan and Jack Nick-laus compare to Tiger Woods? These questions frequentlybring about heated arguments which almost always endwith "It's impossible to compare players from differenteras!" In Chapter 28, Berry, Reese, and Larkey (1999) usestatistical tools to compare the players from different erasin Major League Baseball, the National Hockey League,and professional golf. While Babe Ruth never played withMark McGwire, Ruth played with players, who playedwith players, who played with players, who did play withMcGwire. These overlaps are bridges to estimate the rela-tive abilities for all players in the history of a sport. Using aBayesian hierarchical model approach, Berry, Reese, andLarkey simultaneously estimate the aging effects, seasoneffects, and abilities of the players.

Comparisons of players are interesting, and their con-clusions about the changing nature of the talent pool ineach sport are fascinating. They find that many more goodhome run hitters are playing today than played years ago.Babe Ruth would still be a great home run hitter (thirdbest of all-time players), but there are many more play-ers similar to him in today's game. In each of the threesports, the talent pool gets better and better through time.In particular, the middle-level player is getting much betterthrough time while the best players of the different eras arecomparable.

Comments by Jim Albert, Jay Kadane, and MichaelSchell on this article provide fuel for discussion (for thecomments, see JASA, 94 (1999), pp. 677-686). Their en-thusiasm and interest lend further evidence to the powerfulnature of the problem of comparing players from differenteras.

27.2 SummaryAs stimulating, inventive, and revolutionary as the paperspresented in this section are, they do not provide the lastword on the questions addressed. The reader is encour-aged to continue exploring these research topics with thefollowing papers. The hot hand: Albright (1993), Berry(1997), Berry (1999a), Jackson and Mosurski (1997), andStern (1995). Probabilities of ranked results: Graves,Reese, and Fitzgerald (2001), Stern (1990), and Stern(1998). Prediction of games from intermediate results:

Berry (2000), Cooper, DeNeve, and Mosteller (1992), andZaman (2001). Distance running performance: Martinand Buoncristiani (1999). Regression to mean player per-formance: Berry (1999b) and Schall and Smith (2000b).Comparison of players from different eras: Schell (1999).Aging effects on player performance: Albert (1992) andSchall and Smith (2000a).

ReferencesAlbert, J. (1992), "A Bayesian analysis of a Poisson ran-dom effects model for homerun hitters," The AmericanStatistician, 46, 246-253.

Albright, S. C. (1993), "A statistical analysis of hittingstreaks in baseball," Journal of the American StatisticalAssociation, 88, 1175-1183 (with discussion).

Berry, S. M. (1997), "Judging who's hot and who's not,"Chance, 10 (2), 40-43.

Berry, S. M. (1999a), "Does 'the zone' exist for home-runhitters?" Chance, 12 (1), 51-56.

Berry, S. M. (1999b), "How many will Big Mac andSammy hit in '99," Chance, 12 (2), 51-55.

Berry, S. M. (2000), "My Triple Crown," Chance, 13 (3),56-61.

Berry, S. M., Reese, C. S., and Larkey, P. D. (1999), "Bridg-ing different eras in sports," Journal of the American Sta-tistical Association, 94, 661-676 (with discussion).

Cooper, H., DeNeve, K. M., and Mosteller, F. (1992),"Predicting professional game outcomes from intermedi-ate game scores," Chance, 5 (3/4), 18-22.

Efron, B. and Morris, C. (1975), "Data analysis usingStein's estimator and its generalizations," Journal of theAmerican Statistical Association, 70, 311-319.

Graves, T, Reese, C. S., and Fitzgerald, M. (2001), Hierar-chical Models for Permutations: Analysis of Auto RacingResults, Los Alamos National Laboratory Technical Re-port, Los Alamos, NM.

Harville, David A. (1973), "Assigning probabilities tothe outcomes of multi-entry competitions," Journal of theAmerican Statistical Association, 68, 312-316.

207

Page 219: Anthology of Statistics in Sports

Chapter 27 Introduction to the Methodologies and Multiple Sports Articles

Hooke, R. (1989), "Basketball, baseball, and the null hy-pothesis," Chance, 2 (4), 35-37.

Jackson, D. and Mosurski, K. (1997), "Heavy defeatsin tennis: Psychological momentum or random effect?"Chance, 10 (2), 27-34.

Larkey, P. D., Smith, R. A., and Kadane, J. B. (1989), "It'sokay to believe in the 'hot hand,'" Chance, 2 (4), 22-30.

Lindsey, G. R. (1961), "The progress of the score during abaseball game," American Statistical Association Journal,September, 703-728.

Martin, D. E. and Buoncristiani, J. F. (1999), "The ef-fects of temperature on marathon runners' performance,"Chance, 12 (4), 20-24.

Schell, M. J. (1999), Baseball's All-Time Best Hitters,Princeton, NJ: Princeton University Press.

Stern, H. (1990), "Models for distributions on permuta-tions," Journal of the American Statistical Association, 85,558-564.

Stern, H. (1994), "A Brownian motion model for theprogress of sports scores," Journal of the American Sta-tistical Association, 89, 1128-1134.

Stern, H. (1995), "Who's hot and who's not," in Proceed-ings of the Section on Statistics in Sports, Alexandria, VA:American Statistical Association, 26-35.

Stern, H. (1998), "How accurate are the posted odds?"Chance, 11 (4), 17-21.

Mosteller, F. (1997), "Lessons from sports statistics," The Tversky, A. and Gilovich, T. (1989), "The cold facts aboutAmerican Statistician, 51, 305-310. the 'hot hand' in basketball," Chance, 2(1), 16-21.

Roberts, H. V. (1993), "Can TQM improve athletic perfor- mance?" Z. (2001), "Coach Markov pulls goalie Poisson,"mance?" Chance, 6 (3), 25-29; 69. Chance, 14 (2), 31-35.

Schall, T. and Smith, G. (2000a), "Career trajectories inbaseball," Chance, 13 (4), 35-38.

Schall, T. and Smith, G. (2000b), "Do baseball playersregress to the mean?" The American Statistician, 54, 231-235.

208

Page 220: Anthology of Statistics in Sports

Chapter 28

Bridging Different Eras in SportsScott M. BERRY, C. Shane REESE, and Patrick D. LARKEY

This article addresses the problem of comparing abilities of players from different eras in professional sports. We study NationalHockey League players, professional golfers, and Major League Baseball players from the perspectives of home run hitting andhitting for average. Within each sport, the careers of the players overlap to some extent. This network of overlaps, or bridges, isused to compare players whose careers took place in different eras. The goal is not to judge players relative to their contemporaries,but rather to compare all players directly. Hence the model that we use is a statistical time machine. We use additive models toestimate the innate ability of players, the effects of aging on performance, and the relative difficulty of each year within a sport.We measure each of these effects separated from the others. We use hierarchical models to model the distribution of playersand specify separate distributions for each decade, thus allowing the "talent pool" within each sport to change. We study thechanging talent pool in each sport and address Gould's conjecture about the way in which populations change. Nonparametricaging functions allow us to estimate the league-wide average aging function. Hierarchical random curves allow for individuals toage differently from the average of athletes in that sport. We characterize players by their career profile rather than a one-numbersummary of their career.

KEY WORDS: Aging function; Bridge model; Hierarchical model; Population dynamics; Random curve.

1. INTRODUCTION

This article compares the performances of athletes fromdifferent eras in three sports: baseball, hockey, and golf. Agoal is to construct a statistical time machine in which weestimate how an athlete from one era would perform in an-other era. For examples, we estimate how many home runsBabe Ruth would hit in modern baseball, how many pointsWayne Gretzky would have scored in the tight-checkingNational Hockey League (NHL) of the 1950s, and how wellBen Hogan would do with the titanium drivers and extra-long golf balls of today's game.

Comparing players from different eras has long been pubfodder. The topic has been debated endlessly, generally tothe conclusion that such comparisons are impossible. How-ever, the data available in sports are well suited for suchcomparisons. In every sport there is a great deal of over-lap in players' careers. Although a player that played inthe early 1900s never played against contemporary players,they did play against players, who played against players,. . . , who played against contemporary players. This processforms a bridge from the early years of sport to the presentthat allows comparisons across eras.

A complication in making this bridge is that the over-lapping of players' careers is confounded with the players'aging process; players in all sports tend to improve, peak,and then decline. To bridge the past to the present, the ef-fects of aging on performance must be modeled. We use anonparametric function to model these effects in each sport.

Scott M. Berry is Assistant Professor, Department of Statistics, TexasA&M University, College Station, TX 77843. C. Shane Reese is Techni-cal Staff Member, Statistical Sciences, Los Alamos National Laboratory,Los Alamos, NM, 87545. Patrick D. Larkey is Professor, H. John HeinzIII School of Public Policy and Management, Carnegie Mellon Univer-sity, Pittsburgh, PA 15213. The authors thank Wendy Reese and TammyBerry for their assistance in collecting and typing in data, Sean Lahmanfor providing the baseball data, and Marino Parascenzo for his assistancein finding birth years for the golfers. The authors are also grateful for dis-cussions with Jim Calvin, Ed Kambour, Don Berry, Hal Stern, Jay Kadane,Brad Carlin, Jim Albert, and Michael Schell. The authors thank the ed-itor, the associate editor, and three referees for encouraging and helpfulcomments and suggestions.

An additional difficulty in modeling the effects of age onperformance is that age does not have the same effect onall players. To handle such heterogeneity, we use randomeffects for each player's aging function, which allows formodeling players that deviate from the "standard" agingpattern. A desirable effect of using random curves is thateach player is characterized by a career profile, rather thanby a one-number summary. Player A may be better thanplayer B when they are both 23 years old, and player Amay be worse than player B when they are both 33 yearsold. Section 3.4 discusses the age effect model.

By modeling the effects of age on the performance ofeach individual, we can simultaneously model the difficultyof each year and the ability of each player. We use hier-archical models (see Draper et al. 1992) to estimate theinnate ability of each player. To capture the changing poolof players in each sport, we use separate distributions foreach decade. This allows us to study the changing distri-bution of players in each sport over time. We also modelthe effect that year (season) has on player performance. Wefind that, for example, in the last 40 years, improved equip-ment and course conditions in golf have decreased scoringby approximately 1 shot per 18 holes. This is above and be-yond any improvement in the abilities of the players overtime. The estimated innate ability of each player, and thechanging evolution of each sport is discussed in Section 7.

Gould (1996) has hypothesized that the population ofplayers in sport is continually improving. He claimed thereis a limit to human ability—a wall that will never becrossed. There will always be players close to this wall,but as time passes and the population increases, more andmore players will be close to this wall. He believes there aregreat players in all eras, but the mean players and lower endof the tail players in each era are closer to the "wall." Byseparating out the innate ability of each player, we studythe dynamic nature of the populatioh of players. Section

© 1999 American Statistical AssociationJournal of the American Statistical Association

September 1999, Vol. 94, No. 447, Applications and Case Studies

209

Page 221: Anthology of Statistics in Sports

Chapter 28 Bridging Different Eras in Sports

8 describes our results regarding the population dynamics.We provide a discussion of Gould's claims as well.

We have four main goals:

1. To describe the effects of aging on performance ineach of the sports, including the degree of heterogeneityamong players. Looking at the unadjusted performance ofplayers over their careers is confounded with the chang-ing nature of the players and the changing structure of thesports. We separate out these factors to address the agingeffects.

2. To describe the effects of playing in each year in eachof the sports. We want to separate out the difficulty of play-ing in each era from the quality of players in that era. Theseeffects may be due to rule changes, changes in the qualityof the opponents, changes in the available (and legal) equip-ment, and the very nature of the sport.

3. To characterize the talent of each player, independentof the era or age of the player.

4. To characterize the changing structure of the pop-ulation of players. In a sport involving one player play-ing against an objective measure with the same equipmentthat has always been used (e.g. throwing a shot put, liftingweights), it is clear that the quality of players is increasing.We want to know if that is true in these three professionalsports.

Addressing any factor that affects performance requiresaddressing all such factors. If the league-wide performanceis used as a measure of the difficulty of a particular year,then this is confounded with the players' ability in that year.In hockey, if an average of 3 goals are scored per gamein 1950 and 4 goals scored per game in 1990, it is notclear whether scoring is easier or if the offensive playersare more talented. Our aim is to separate out each effect,while accounting for the other effects.

We have found little research in this area. Riccio (1994)examined the aging pattern of golfer Tom Watson in hisU.S. Open performances. Berry and Larkey (1998) com-pared the performance of golfers in major tournaments.Albert (1998) looked at the distribution of home runs byMike Schmidt over his career. Schell (1998) ranked thegreatest baseball players of all time on their ability to hitfor average. He used a z-score method to account for thechanging distribution of players and estimated the ballparkeffects. He ignored the aging effects by requiring a min-imum number of at bats to qualify for his method. Bothof these effects are estimated separately without account-ing for the other changing effects. Our goal is to construct acomprehensive model that makes the necessary adjustmentssimultaneously, rather than a series of clever adjustments.

The next section examines the measures of performancein each sport and the available data for each. Section 3describes the models used and the key assumptions ofeach. Section 4 discusses the Markov chain Monte Carlo(MCMC) algorithms used. The algorithms are standardMCMC successive substitution algorithms. Section 5 looksat the goodness of fit for each of the models. To addressthe aging effects, Section 6 presents nonparametric aging

functions. Random curves are used to allow for variation inaging across individuals. Section 7 discusses the results foreach sport, including player aging profiles, top peak per-formers, and the changes over time for each sport. Section8 discusses the population dynamics within each sport, andSection 9 discusses the results and possible extensions.

2. SPORTS SPECIFICS AND AVAILABLE DATA

In our study of hockey, we model the ability of NHLplayers to score points. In hockey, two teams battle continu-ously to shoot the puck into the other team's goal. Whoevershoots the puck into the goal receives credit for scoring agoal. If the puck was passed from teammates to the goalscorer, then up to the last two players to pass the puck onthe play receive credit for an assist. A player receives creditfor a point for either a goal or an assist. In hockey thereare three categories of players: forwards, defensemen, andgoalies. A main task of defensemen and goalies is to pre-vent the other team from scoring. A main task of forwardsis to score goals. Therefore, we consider forwards only.We recorded the number of points in each season for the1,136 forwards playing at least 100 games between 1948and 1996. All hockey data are from Hollander (1997). Wedeleted all seasons played for players age 40 and older.There were very few such seasons, and thus the age func-tion was not well defined greater than 40. Any conclusionsin this article about these players is based strictly on theirability to score points, which is not necessarily reflectiveof their "value" to a hockey team. Some forwards are wellknown for their defensive abilities; thus their worth is notaccurately measured by their point totals.

Considered among the most physically demanding ofsports, hockey requires great physical endurance, strength,and coordination. As evidence of this, forwards rotatethroughout the game, with three or four lines (sets of threeforwards) playing in alternating shifts. In no other majorsport do players participate such a small fraction of thetime. We do not have data on which players were linemates.The NHL has undergone significant changes over the years.The league has expanded from 6 teams in 1948 to 30 teamsin 1996. Recent years have brought a dramatic increase inthe numbers of Eastern European and American players, asopposed to almost exclusively Canadians in the early years.Technological developments have made an impact in theNHL. The skates that players use today are vastly supe-rior to those of 25 years ago. The sticks are stronger andcurved, helping players control the puck better and shootmore accurately. The style of play has also changed. At dif-ferent times in NHL history coaches have stressed offenseor stressed defense.

In golf, it takes a long time for a player to reach his orher peak. Golf requires a great deal of talent, but it does nottake the physical toll that hockey does. It seems reasonableto expect that the skills needed to play golf do not deterio-rate as quickly in aging players as do speed and strength inhockey. Therefore, the playing careers of golfers are muchlonger. Technology is believed to have played an enormousrole in golf. Advances in club and ball design have aided

210

Page 222: Anthology of Statistics in Sports

Berry, Reese, and Larkey

modern players. The conditions of courses today are far su-perior to conditions of 50 years ago: Modern professionalgolfers experience very few bad lies of the ball on the fair-ways of today's courses. The speed of the greens has in-creased over the years, which may increase scores, but thismay be offset by a truer roll. The common perception isthat technology has made the game easier.

We model the scoring ability of male professional golfersin the four major tournaments, considered the most im-portant events of each golf season. We have individualround scores for every player in the Masters and U.S. Openfrom 1935-1997 and in the Open Championship (labeledthe British Open by Americans) and the PGA of Amer-ica Championship from 1961-1997. (The Masters and U.S.Open were not played in 1943-1945 because of World WarII.) A major tournament comprises four rounds each of18 holes of play. A "cut" occurs after the second round,and thus playing in a major generally consists of playingeither two or four rounds. We found the birth years for488 players who played at least 10 majors in the tour-naments we are considering. We did not find the agesof 38 players who played at least 10 majors. The birthyears for current players were found at various web sites(pgatour.com;www.golfweb.com). For older players, we con-sulted golf writer Marino Parascenzo. We had trouble find-ing the birth years for marginal players from past eras. Thisbias has consequences in our analysis of the population dy-namics in Section 8.

Baseball is rich in data. We have data on every player(nonpitcher) who has batted in Major League Baseball(MLB) in the modern era (1901-1996). We have the yearof birth and the home ballpark for each player during eachseason. The number of at bats, base hits, and home runs arerecorded for each season. An official at bat is one in whichthe player reaches base safely from a base hit or makes anout. An at bat does not include a base on balls, sacrifice,or hit by pitch. (Interestingly, sacrifices were considered atbats before 1950 but not thereafter.) A player's batting av-erage is the proportion of at bats in which he gets a basehit. We also model a player's home run average, which isthe proportion of at bats a player hits a home run.

In terms of player aging, baseball is apparently betweengolf and hockey. Hand-eye coordination is crucial, but thegame does not take an onerous physical toll on players. Acommon perception is that careers in baseball are longerthan in hockey, but shorter than in golf. Baseball pridesitself on being a traditional game, and there have been rel-atively few changes in the rules during the twentieth cen-tury. Some changes include lowering the mound, reducingthe size of the strike zone, and modifications to the ball.The first 20 years of this century were labeled the "dead-ball era." The most obvious change in the population ofplayers came in the late 1940s, when African-Americanswere first allowed to play in the major leagues. MLB hashistorically been played mainly by U.S. athletes, althoughLatin Americans have had an increasing influence over thelast 40 years.

3. MODELS

In this section we present the bridging model, with detailsof the model for each sport. To compare players from dif-ferent eras, we select the most recent season in the datasetas the benchmark season. All evaluations of players are rel-ative to the benchmark season. The ability of every playerthat played during the benchmark season can be estimatedby their performance in that season. In the home run ex-ample, this includes current sluggers like Mark McGwire(1987-present), Ken Griffey, Jr. (1989-present), and MikePiazza (1992-present). The ability of players whose ca-reers overlapped with the current players can be estimatedby comparing their performances to the current players'performances in common years. In the home run exam-ple, this includes comparing players like Reggie Jackson(1967-1987), Mike Schmidt (1972-1989), and Dale Mur-phy (1976-1992) to McGwire, Griffey, and Piazza. The ca-reers of Jackson, Schmidt, and Murphy overlapped withthe careers of players who preceded them, such as MickeyMantle (1951-1968), Harmon Killebrew (1954-1975), andHank Aaron (1954-1976). The abilities of Mantle, Kille-brew, and Aaron can be estimated from their performancesrelative to Jackson, Schmidt, and Murphy in their com-mon years. The network of thousands of players withstaggered careers extends back to the beginning of base-ball. All three sports considered in this article have similarnetworks.

We estimate a league-wide age effect by comparing eachplayer's performance as they age with their estimated abil-ity. The difficulty of a particular season can be estimatedby comparing each player's performance in the season withtheir estimated ability and estimated age effect during thatseason. We can estimate other effects, such as ball parkand individual rounds in golf, in an analogous fashion. Thisexplanation is an iterative one, but the estimates of theseeffects are produced simultaneously.

There are two critical assumptions for each model used inthis article. The first is that outcomes across events (games,rounds, and at bats) are independent. A success or failurein one trial does not affect the results of other trials. Oneexample of dependence between trials is the "hot-hand" ef-fect: Success breeds success, and failure breeds failure. Thistopic has received a great deal of attention in the statisticsliterature. We have found no conclusive evidence of a hot-hand effect. (For interesting studies of the hot-hand, seeAlbert 1993, Albright 1993, Jackson and Mosurski 1997,Larkey, Smith, and Kadane 1989, Stern 1995, Stern andMorris 1993, and Tversky and Gilovich 1989a,b.) We donot take up this issue here, but we do believe that golf isthe most likely sport to have a hot-hand effect (and we havefound no analysis of the hot-hand effect in golf).

All of the models used are additive. Therefore, the secondcritical assumption is that there are no interactions. An in-teraction in this context would mean that the performancesof different players are affected differently by a predictor.For example, if player A is more successful in the moderngame than player B, then had they both played 50 yearsago player A would still have been better than player B.

211

Page 223: Anthology of Statistics in Sports

Chapter 28 Bridging Different Eras in Sports

We address the question of interactions in the discussionsection.

We use the same parameters across sports to representplayer and year effects. When necessary, superscripts h, g, a,and r are used to represent hockey, golf, and batting aver-ages and home runs in baseball.

3.1 Hockey

For the hockey data, we have k — 1,136 players. Thenumber of seasons played by player i is ni and the age ofplayer i in his jth season is aij. The year in which playeri played his jth season is yij, the number of points scoredin that season is xij, and the games played is gij. Countingthe number of points for a player in a game is counting rareevents in time, which we model using the Poisson distribu-tion.

Per game scoring for a season is difficult to obtain. To ad-dress the appropriateness of the Poisson distribution for oneplayer, we collected data on the number of points scored ineach game for Wayne Gretzky in the 1995-1996 season,as shown in Table 1. The Poisson appears to be a reason-able match for the points scored per game (the chi-squaredgoodness-of-fit test statistic is 5.72, with a p value of .22).

We assume that the points scored in a game are indepen-dent of those scored in other games, conditionally on theplayer and year, and that the points scored by one playerare independent of the points scored by other players. Themodel is

where the xij are independent conditional on the ij 's. As-sume that

log( Ay ) = i + yij + fi (aij ).

In this log-linear model, i. represents the player-specificability; that is, exp( i ) is the average number of points pergame for player i when he is playing at his peak age (fa = 0)in 1996 ( 1996 = 0). There are 49 's, one for each yearin our study. They represent the difficulty of year relativeto 1996. Therefore, we constrain 1999 0. We refer to1996 as the benchmark year. The function fa represents theaging effects for player i. We use a random curve to modelthe aging, as discussed in Section 3.4. The function fa isrestricted to be 0 for some age a (player i's peak age).

A conditionally independent hierarchical model is usedfor the 0's. To allow for the distribution of players to changeover time, we model a separate distribution for the 's for

Table 1. The Points Scored in Each of Wayne Gretzky's83 Games in the 1995-96 Season

Points

012345

Gretzky

23/83 = .2834/83 = .4118/83 = .22

4/83 = .054/83 = .050/83 = 0

Poisson

.31

.36

.21

.08

.02

.006

NOTE: For each point total the probability of that occurrence, assuming a Poisson distributionwith a mean of 1.18, is shown in the second column.

each decade. Let di be the decade in which player i wasbora. In the hockey example, the first decade is 1910-1919,the second decade is 1920-1929, and the last decade, theseventh, is 1970-1979. The model is

where N( , 2) refers to a normal distribution with a meanof and a variance of 2. The hyperparameters have thedistributions

and

where IG(a, b) refers to an inverse gamma distribution withmean 1 /b (a - 1) and variance l/b2(a - l)2(a - 2). The i

are independent conditional on the 's and 's. The 'sare independent with prior distributions

The average forward scores approximately 40 points in aseason, or approximately .5 points per game. Thus we setm = log(.5), and allow for substantial variability aroundthis number by setting s = .5. For the distribution of ,we set a = 3 and b = 3. This distribution has mean .167and standard deviation .167. We chose T = 1. We spec-ified prior distributions that we thought were reasonableand open minded. This prior represents the notion thatis not huge, but is flexible enough so that the posterior iscontrolled by the data. We find little difference in the resultsfor the priors that we considered reasonable.

3.2 Golf

The golf study involves k = 488 players, with player iplaying ni rounds of golf (a round consisting of 18 holes).For the jth round of player i, the year in which the roundis played is yij, the score of the round is xij, the age of theplayer is aij, and the round number in year yij is rij. Theround number ranges from 1 to 16 in any particular year,corresponding to the chronological order.

We adopt the following model for golf scores:

where thesume that

are independent given the and As-

Parameter i represents the mean score for player i whenthat player is at his peak (fa — 0), playing a round of av-erage difficulty in 1997 (6 = 0 and 7 = 0). The bench-mark year is 1997; thus 1997 0, and each representsthe difficulty of that year's major tournaments relative to1997. There is variation in the difficulty of rounds within ayear. Some courses are more difficult than others; the coursesetup can be relatively difficult or relatively easy, and theweather plays a major role in scoring. The 7's represent the

212

Page 224: Anthology of Statistics in Sports

Berry, Reese, and Larkey

difficulty of rounds within a year. Thus u,v is the mean dif-ference, in strokes, for round v from the average round inyear u. To preserve identifiability, and thus interpretability,we restrict

The aging function fi is discussed in Section 3.4. Adecade-specific hierarchical model is used for the 's. Letdi be the decade in which a golfer was born. There areseven decades: 1900-1909, 1910-1919, ... , 1960+. Onlythree players in the dataset were born in the 1970s, so theywere combined into the 1960s. Let the 's be independentconditional on the 's and 's and be distributed as

where

and

The 's are independent with prior distributions

and the u,v's are independent with the priors

We specify the hyperparameters as m = 7 3 , s = 3,a =3, b = 3, T = 3, and = 3. As in the hockey study, here theresults from priors similar to this one are virtually identi-cal. The distribution of golf scores has been discussed byMosteller and Youtz (1993). They modeled golf scores as63 plus a Poisson random variable. Their resulting distribu-tion looked virtually normal, with a slight right skew. Theydeveloped their model based on combining the scores ofall professional golfers. Scheid (1990) studied the scores of3,000 amateur golfers and concluded that the normal fitswell, except for a slightly heavier right tail. There are sometheoretical reasons why normality is attractive. Each roundscore is the sum of 18 individual hole scores. The distri-bution of scores on each hole is somewhat right-skewed,because scores are positive and unlimited from above. Ascore of 2, 3, or 4 over par on one hole is not all that rare,whereas 2, 3, or 4 under par on a hole is extremely rare,if not impossible. A residual normal probability plot shownin Section 5 demonstrates the slight right-skewed nature ofgolf scores. We checked models with a slight right skew,and found the results to be virtually identical (not shown).The only resulting difference that we noticed was in pre-dicting individual scores (in which we are not directly in-terested). Because of its computational ease and reasonablefit, we adopt the normality assumption.

3.3 Baseball

The baseball studies involve k = 7,031 players, with

player i playing in ni seasons. For player i in his jth sea-son, xijis the number of hits, hij is the number of homeruns, mij is the number of at bats, aij is the player's age,yij is the year of play, and tij is the player's home ballpark.(Players play half their games in their home ballpark andthe other half at various ballparks of the other teams.)

We model at bats as independent Bernoulli trials, with theprobability of success for player i in his jth year equal toij. We study both hits and home runs as successes; there-fore, we label and as the probability of getting a hitand of hitting a home run. Thus

where

For the baseball home run study, we use a similar model,

where

The parameters are indicator functions for seasons, andthe £ parameters are indicator functions for home ballparks.We include the parameters to account for the possibilitythat certain stadiums are "hitters'" parks and others are"pitchers' " parks. The aging function fi is discussed in thefollowing subsection.

Let di be the decade in which player i was born. Thereare 12 decades for the baseball players: 1860-1869, ... ,1970+. A decade-specific conditionally independent hier-archical model is used:

where the 's are independent conditional on the 's and( )2's. Assume that

and

The ' s are independent with prior distributions

and the 's are independent with prior distributions

The parameters are selected as = -l,sa = l,aa =3,ba = 3, Ta = 1, and a = 1.

For the home run data, the following decade-specific hi-erarchical model is used:

213

Page 225: Anthology of Statistics in Sports

Chapter 28 Bridging Different Eras in Sports

where the s are independent conditional on the 's and( )2's. Assume that

The 's are independent with prior distributions

and the 's are independent with prior distributions

We set mT = -3.5, sT = l,ar = 3,br = 3,TT = 1, and= 1. In both the average and the home run studies, the

selection of the parameters in the priors have essentially noeffect on the final conclusion.

3.4 Aging Functions

A common perception of aging functions is that playersimprove in ability as they mature, up to a peak level, thenslowly decline. It is generally believed that players improvefaster while maturing than they decrease in ability whiledeclining. The aging curve is clearly different for differentsports with regard to both peak age and the rate of changeduring maturity and decline. Moreover, some players tendto play at near-peak performance for a long period of time,whereas others have short periods of peak performance.This may be due to conditioning, injuries, or genetics. Weassume a mean aging curve for each sport. We model thevariation in aging for each player using hierarchical models,with the mean aging curve as the standard. In each model,i represents the ability of player i at peak performance in abenchmark year. Thus each player's ability is characterizedby a profile rather than one number; it may be that playerA is better than player B when they are both 22 years old,but player B is better than player A when they are both35. For convenience we round off ages, assuming that allplayers were born on January 1. Lindley and Smith (1972)proposed using random polynomial curves. Shi, Weiss, andTaylor (1996) used random spline curves to model CD4cell counts in infants over time. Their approach is similarto ours in that it models an effect in longitudinal data witha flexible random curve.

We let g(a) denote the mean aging curve in each sport.We let a be the peak age for a player. Without loss, we as-sume that g(a) = 0. We use the following model for playeri's aging curve:

The parameter = ( ) represents player i's varia-tion from the mean aging curve. We define the maturingperiod as any age less than aM and the declining period asany age greater than aD- To preserve the interpretation of 1

and 2 as aging parameters, we select aM and aD where theaging on each side becomes significant. We fit the mean ag-ing function for every player, then select ages (or knots) a,M

and aD, to represent players after their rise and before theirsteady decline. For ages a, such that, aM a aD, eachplayer ages the same. This range was determined from ini-tial runs of the algorithm. We selected a region in which theplayers' performance was close to the peak performance.Part of the motivation for a range of values unaffected byindividual aging patterns is to ensure stability in the calcu-lations. In each study we use the hierarchical model

which is a bivariate normal distribution. We use IG(10, 1)priors for and (mean .11 and standard deviation.039). Due to the large number of players in each exam-ple, the priors that we considered reasonable had virtuallyidentical results.

In the golf model, g (a) represents the additional numberof strokes worse than peak level for the average profes-sional golfer at age a. The maturing and declining parame-ters for each player have a multiplicative effect on the ad-ditional number of strokes. A player with = 1 maturesthe same as the average player. If > 1, then the playeraverages more strokes over his peak value than the averageplayer would'at the same age a < aM. If < 1, then theplayer averages fewer strokes over his peak value than theaverage player would at the same age a < aM, The sameinterpretation holds for 2, only representing players of agea > aD-

The quantity exp( i(a)) has a multiplicative effect on themean points per game parameter in hockey and on the log-odds of success in baseball. Therefore, exp(g(a)) representsthe proportion of peak performance for the average playerat age a.

We use a nonparametric form for the mean aging functionin each sport:

where the a's are parameters. The only restriction is thataa 0 for some value a. We select aa = 0 by initial runsof the algorithm to find a. This preserves the interpreta-tion for the 's as the peak performance values. This modelallows the average aging function to be of arbitrary formon both sides of peak age. In particular, the aging functionmay not be monotone. Although this may be nonintuitive, itallows for complete flexibility. A restriction is that the ageof peak performance is the same across a sport. We believethat this is a reasonable assumption. The model is robustagainst small deviations in the peak age, because the agingfunction will reflect the fact that players performed wellat those ages. By allowing players to age differently dur-ing the maturing and declining stages, each player's agingfunction can better represent good performance away fromthe league-wide peak. An alternative would be to model thepeak age as varying across the population using a hierar-chical model.

We tried alternative aging functions that were paramet-ric. We used a quadratic form and an exponential decay(growth) model. Both of these behaved very similar to the

214

Page 226: Anthology of Statistics in Sports

Berry, Reese, and Larkey

nonparametric form close to the peak value. The paramet-ric forms behaved differently for very young and very oldplayers. The parametric form was too rigid in that it pre-dicted far worse performance for older players. A piecewiseparametric form may be more reasonable.

4. ALGORITHMS

In this section we describe the Markov chain Monte Carloalgorithms used to calculate the posterior distributions. Thestructure of the programs is to successively generate valuesone at a time from the complete conditional distributions(see Gelfand and Smith 1990; Tierney 1994).

In the golf model, all of the complete conditional distri-butions are available in closed form. In the hockey, battingaverage, and home run models, a Metropolis-Hastings stepis used for most of the complete conditional distributions(see Chib and Greenberg 1995). In all of the models, gen-erating the decade specific means and standard deviationsare available in closed form.

Our results are based on runs with burn-in lengths of5,000. Every third observation from the joint distributionis selected from one chain until 10,000 observations arecollected. We used Fortran programs on a 166 MHz SunSpare Ultra 1. The golf programs took about 15 minutes,the hockey programs about 30 minutes, and each baseballprogram took about 80 minutes. With thousands of param-eters, monitoring convergence is difficult. We found thatmost of the parameters depended on the year effects, andso concentrated our diagnostic efforts on the year effects.The algorithm appeared to converge very quickly to a sta-ble set of year effects. Waiting for convergence for 5,000observations is probably overkill. Monitoring the mixing ofthe chain is also difficult. Again the year effects were im-portant. We also monitored those effects generated with aMetropolis step. We varied the candidate distributions toassure that the chain was mixing properly.

To validate our approach, we designed simulations andcompared the results with the known values. We set up see-

Figure 1. The Residuals in the Hockey Study Plotted Against theFitted Values. The lines are ± 1, 2, and 3 square root of the predictedvalues. These are the standard deviations assuming the model and pa-rameters are correct, and the data are truly Poisson. The percentageof observations in each of the regions partitioned by the ± 1, 2, and 3standard deviations are reported on the graph.

narios where players were constant over time, getting grad-ually better, and getting gradually worse. We crossed thiswith differing year effects. Some of the aspects of the mod-els were developed using this technique. For example, weadopted different means and standard deviations for eachdecade based on their increased performance in the sim-ulations. Our models did very well in the simulations. Inparticular, we found no systematic bias from these models.

5. GOODNESS OF FIT

In this section we consider the appropriateness of ourmodels and address their overall fit. For each sport wepresent an analysis of the residuals. We look at the sum ofsquared errors for our fitted model (referred to as the fullmodel) and several alternative models. The no individualaging model is a subset of the full model, with the restric-tion that = 1 and 2i= 1, for all i. The no aging effectsmodel assumes that fi(a) = 0, for all i and a. The nullmodel is a one-parameter model that assumes all playersare identical and there are no other effects. Although thisone-parameter model is not taken seriously, it does providesome information about the fit of the other models.

The objective sum of squares is the expected sum ofsquares if the model and parameters are correct. This is anunattainable goal in practice but gives an overall measureof the combined fit of the model and the parameters. Weprovide an analog to R2, which is the proportion of sumof squares explained by each model. For each model M,this is defined as 1 - SSM/SSN, where SSM refers to thesum of squared deviations using model M and subscript Nindexes the null model. Myers (1990) discussed R2 in thecontext of log-linear models.

In calculating the sum of squares, we estimate the pa-rameters with their posterior means.

5.1 Hockey

Figure 1 plots the residual points per player season areplotted against the predicted points per player season. Weinclude curves for ±1,2, and 3 times the square root of thepredicted values. These curves represent the ±1,2, and 3standard deviations of the residuals, assuming that the pa-rameters and model are correct. The percent of residuals ineach region is also plotted. The residual plot demonstratesa lack of fit of the model.

Table 2 presents the sum of squared deviations for eachmodel. The sum of squares for each model is the sum ofthe squared difference between the model estimated pointtotal and the actual point total, over every player season.

Table 2. The Sum of Squared Deviations (SS) Between thePredicted Point Totals for Each Model and the Actual

Point Totals in the Hockey Example

Model

ObjectiveFullNo individual agingNo aging effectsNull

SS

346,000838,000980,000

1,171,0003,928,000

R2

.91

.79

.75

.70

215

Page 227: Anthology of Statistics in Sports

Chapter 28 Bridging Different Eras in Sports

The objective sum of squares is ij gij ij, where ij isthe estimate of the points per game parameter from the fullmodel. This represents the expected sum of squares if themodel and parameters are exactly correct.

We feel that the model is reasonable but clearly demon-strates a lack of fit. The full model is a huge improvementover the null model, but it still falls well short of the objec-tive. Of the three examples (golf has no objective), hockeyrepresents the biggest gap between the objective and the fullmodel. We believe that this is because strong interactionsare likely in hockey. Of the three sports studied, hockey isthe most team oriented, in which the individual statistics ofa player are the most affected by the quality of his team-mates. For example, Bernie Nicholls scored 78 points in the1987-88 season without Wayne Gretzky as a teammate, andscored 150 points the next season as Gretzky's teammate.

There is also strong evidence that the aging effects andthe individual aging effects are important. The R2 is in-creased by substantial amounts by adding the age effectsand additionally, the individual aging effects. We think thatthe aging functions have a large effect because hockey is aphysically demanding sport in which a slight loss of phys-ical skill and endurance can have a big impact on scoringability. Two players who have slight differences in agingpatterns can exhibit large differences in point totals (rela-tive to their peaks).

5.2 Golf

Figure 2 is a normal probability plot of the standardizedresiduals in the golf example. The pattern in the residual q-qplot is interesting, showing a deviation from normality. Theleft tail is "lighter" than that of a normal distribution, andthe right tail is "heavier" than that of a normal distribution.As discussed in Section 3.2, this makes intuitive sense. Itis very difficult to score low, and it is reasonably likely toscore high. We tried various right-skewed distributions butfound little difference in the results. The only differencewe can see is in predicting individual scores, which is nota goal of this article.

Table 3 presents the sum of squared deviations betweenthe estimated scores from each model and the actual scores.

Table 3. The Sum of Squared Deviations (SS) Between the PredictedScore for Each Model and the Actual Score in the Golf Example

Figure 2. Normal Probability Plot of the Residuals From the GolfModel,

Model

FullNo individual agingNo aging effectsNull

SS

366,300366,600372,000527,000

R2

.30

.30

.29

Because of the normal model there is no objective sum ofsquares to compare the fit. The variance in the scores, , isa parameter fitted by the model and thus naturally reflectsthe fit of the model. Despite the small improvement betweenthe null model and the full model, we feel that this is avery good-fitting model. This conclusion is based on theestimate of x, which is 2.90. The R2 for this model is only.30, which is small, but we believe there is a large amountof variability in a golf score, which will never be modeled.We were pleased with a standard error of prediction of 2.90.There is little evidence that aging plays an important role inscoring. This is partly due to the fact that most of the scoresin the dataset are recorded when players are in their prime.Few players qualified for the majors when they were veryold or very young, and for these ages there is an effect.There is also little evidence that individual aging effectsare needed, but this suffers from the same problem justmentioned.

5.3 Baseball

The residual plot for each baseball example indicated noserious departures from the model. The normal probabil-ity plots showed almost no deviations from normality. Ta-ble 4 presents the home run sum of squares; Table 5, thebatting average sum of squares. The batting average exam-ple presents the sum of squared deviations of the predictednumber of base hits from the model from the actual numberof base hits. The R2 is .60 for the full model, very close tothe objective sum of squares of .62. We believe the battingaverage model is a good-fitting model. The home run modeldoes not fit as well as the batting average example. Despitean R2 of .80, it falls substantially short of the objective sumof squares. The high R2 is due to the large spread in homerun ability across players, for which the null model doesnot capture.

Aging does not play a substantial role in either measure.This is partly due to the large number of observations closeto peak, where aging does not matter, but also can be at-tributed to the lack of a strong effect due to aging. Thecontrast between the four examples in the role of aging

Table 4. The Sum of Squared Deviations (SS) Between the PredictedNumber of Home Runs for Each Model and the Actual Number

of Home Runs in the Home Run Example

Model

ObjectiveFullNo individual agingNo aging effectsNull

SS

171,000238,000242,500253,700

1,203,000

R2

.86

.80

.80

.78

216

Page 228: Anthology of Statistics in Sports

Berry, Reese, and Larkey

Table 5. The Sum of Squared Deviations (SS) Between the PredictedNumber of Hits for Each Model and the Actual Number of Hits

in the Batting Average Example

Model

ObjectiveFullNo individual agingNo aging effectsNull

SS

1,786,0001 ,867,0001,897,0001,960,0004,699,000

R2

.62

.60

.60

.58

and the individual aging effects is interesting. In the mostphysically demanding of the sports, hockey, aging plays thegreatest role. In the least physically demanding sport, golf,the aging effect plays the smallest role.

6. AGE EFFECT RESULTS

Figures 3-6 illustrate the four mean age effect (g) func-tions. Figure 3 shows the hockey age function. The y-axisrepresents the proportion of peak performance for a playerof age a. Besides keeping track of the mean of the agingfunction for each age, we also keep track of the standarddeviation of the values of the curve. The dashed lines arethe ±2 standard deviation curves. This graph is very steepon both sides of the peak age, 27. The sharp increase duringthe maturing years is surprising—20- to 23-year-old play-ers are not very close to their peak. Because of the sharppeak, we specified 29 and older as declining and 25 andyounger as maturing.

Figure 4 presents the average aging function for golf. Inthis model g represents the average number of strokes fromthe peak. The peak age for golfers is 34, but the range 30-35is essentially a "peak range." The rate of decline for golfersis more gradual than the rate of maturing. An average playeris within .25 shots per round (1 shot per tournament) frompeak performance when they are in the 25—40 age range. Anaverage 20-year-old and an average 50-year-old are both 2shots per round off their peak performance. Because of thepeak range from 30-35, we specified the declining stage as36 and older and the maturing phase as 29 and younger.

Figures 5 and 6 present the aging functions for home runsand batting averages. The home run aging function presentsthe estimated number of home runs for a player who is a20-home run hitter at his peak. The peak age for home runsis 29. A 20-home run hitter at peak is within 2 home runsof his peak at 25-35 years old. There is a sharp increase formaturers. Apparently, home run hitting is a talent acquiredthrough experience and learning, rather than being basedon brute strength and bat speed. The ability to hit homeruns does not decline rapidly after the peak level—even a40-year-old 20-home run-at-peak player is within 80% ofpeak performance.

The age effects for batting average are presented for ahitter who is a .300 hitter at his peak. Hitting for averagedoes differ from home run hitting—27 is the peak age, andyounger players are relatively better at hitting for averagethan hitting home runs. An average peak .300 hitter is ex-pected to be a .265 hitter at age 40. For batting average andhome runs, the maturing phase is 25 and younger and thedeclining phase is 31 and older.

7. PLAYER AND SPORT RESULTS

This section presents the results for the individual play-ers and the year effects within each sport. To understandthe rankings of the players, it is important to see the rela-tive difficulty within each sport over the years. Each playeris characterized by 0, his value at peak performance in abenchmark year, and by his aging profile. We present ta-bles that categorize players by their peak performance, butwe stress that their career profiles are a better categoriza-tion of the players. For example, in golf Jack Nicklaus isthe best player when the players are younger than 43, butBen Hogan is the best for players over 43. The mean oftheir maturing and declining parameters are presented forcomparison.

7.1 Hockey

The season effect in hockey is strong. Figure 7a showsthe estimated multiplicative effects, relative to 1996. From1948-1968 there were only six teams in the NHL, and thegame was defensive in nature. In 1969 the league addedsix teams. The league continued to expand to the present30 teams. With this expansion, goal scoring increased. The1970s and early 1980s were the height of scoring in theNHL. As evidence of the scoring effects over the years,many players who played at their peak age in the 1960s withmoderate scoring success played when they were "old" inthe 1970s and scored better than ever before (e.g., GordieHowe, Stan Mikita, and Jean Beliveau). In 1980 the wide-open offensive-minded World Hockey Association, a com-petitor to the NHL, folded and the NHL absorbed someof the teams and many of the players. This added to theoffensive nature and style of the NHL. In the 1980s theNHL began to attract players from the Soviet block, and theUnited States also began to produce higher-caliber players.This influx again changed the talent pool.

Scoring began to wane beginning in 1983. This is at-tributed in part to a change in the style of play. Teamswent from being offensive in nature to defensively oriented."Clutching and grabbing" has become a common term todescribe the style of play in the 1990s. As evidence of this,in 1998 the NHL made rule changes intended to increasescoring. The seasonal effects are substantial. The model pre-dicts that a player scoring 100 points in 1996 is would havescored 140 points in the mid-1970s or early 1980s.

Table 6 presents the top 25 players, rated on their peaklevel. Figure 8 presents profiles of some of these best play-ers. It demonstrates the importance of a profile over a one-number summary. Mario Lemieux is rated as the best peak-performance player, but Wayne Gretzky is estimated to bebetter when they are young, whereas Lemieux is estimatedto be the better after peak. The fact that Lemieux is ahead ofGretzky at peak may seem surprising. Gretzky played dur-ing the most wide-open era in the NHL, whereas Lemieuxplayed more of his career in a relatively defensive-mindedera. Lemieux's career and season totals are a bit mislead-ing, because he rarely played a full season. He missed manygames throughout his career, and we rate players on theirper game totals.

217

Page 229: Anthology of Statistics in Sports

Chapter 28 Bridging Different Eras in Sports

Figure 3. The Estimated Mean Aging Function and Pointwise ±2Standard Deviation Curves tor the Hockey Study. The y-axis is the pro-portion of peak tor a player of age a.

As a cross-validation, we present the model-predictedpoint totals for the 10 highest-rated peak players who arestill active. We estimated the 1997 season effect, 1997 =-.075, by the log of the ratio of goals scored in 1997 togoals scored in 1996. Table 7 presents the results. We cal-culated the variance of each predicted point total using thevariance of the Poisson model and the points per game pa-rameter, ij (the standard deviation is reported in Table7). With the exception of Pavel Bure, the predictions arevery close.

7.2 Golf

Figure 7b shows the estimate for the difficulty of eachround of the Masters tournament. The mean of these yearsis also plotted. We selected the Masters because it is the onetournament played on the same course (Augusta National)each year and the par has stayed constant at 72. These esti-mates measure the difficulty of each round, separated fromthe ability of the players playing those rounds. These esti-mates may account for weather conditions, course difficulty,and the equipment of the time. Augusta in the 1940s playedeasier than in the 1950s or 1960s; we are unsure why. Thereis approximately a 1 shot decrease from the mid-1950s to

Figure 4. The Estimated Mean Aging Function and Pointwise ±2Standard Deviation Curves for the Golf Study. The y-axis is the numberof shots more than peak value for a player of age a.

Figure 5. The Estimated Mean Aging Function and Pointwise ±2Standard Deviation Curves for the Home Run Study. The y-axis is thenumber of home runs for a player who is a 20-home run hitter at peakperformance.

the present. We attribute the decrease from the 1950s to thepresent to the improved equipment available to the players.Although it does appear that Augusta National is becom-ing easier to play, the effects of improved equipment donot appear to be as strong as public perception would haveone believe. Augusta is a challenging course in part becauseof the speed and undulation of the greens. It may be thatthe greens have become faster, and more difficult, over theyears. If this is true, then the golfers are playing a moredifficult course and playing it 1 shot better than before.Such a case would imply the equipment has helped morethan 1 shot.

Table 8 shows the top 25 players of all time at their peak.Figure 9 shows the profile of six of the more interestingcareers. The y-axis is the predicted mean average for eachplayer when they are the respective age. Jack Nicklaus at hispeak is nearly .5 shot better than any other player. Nicklausessentially aged like the average player. Ben Hogan, whois .7 shot worse than Nicklaus at peak, aged very well. Heis estimated to be better than Nicklaus at age 43 and older.The beauty of the hierarchical models comes through in theestimation of Tiger Woods' ability. Woods has played very

Figure 6. The Estimated Mean Aging Function and Pointwise ±2Standard Deviation Curves for the Batting Average Study. The y-axis isthe batting average for a player who is a .300 hitter at peak performance.

218

Page 230: Anthology of Statistics in Sports

Berry, Reese, and Larkey

(d)

F/grure 7. The Yearly Effects for the Hockey (a), Golf (b), Batting Average (c), and Home Run (d) Studies. The hockey plot shows the multiplicativeeffect on scoring for each year, relative to 1996. The golf plot shows the additional number of strokes for each round in the Masters, relative to theaverage of 1997. The line is the average for each year. The home run plot shows the estimated number of home runs for a 20-home run hitter in1996. The batting average plot shows the estimated batting average for a player who is a .300-hitter in 1996.

well as a 21-year-old player (winning the Masters by 12shots). He averaged 70.4 during the benchmark 1997 year.If he aged like the average player, which means during 1997he was 1.8 shots per round off his peak performance, thenhe would have a peak performance of 68.6. If he aged likethe average player, then he would be by far the best playerof all time. This is considered very unlikely because of thedistribution of players. It is more likely that he is a quickmaturer and is playing closer to his peak than the average21 year old. Thus his maturing parameter is estimated tobe .52. The same phenomenon is seen for both Ernie Elsand Justin Leonard, who are performing very well at youngages.

7.3 Baseball

Figures 7c and 7d illustrate the yearly effects for homeruns and batting average. They show that after 1920, whenthe dead-ball era ended, the difficulty of hitting home runshas not changed a great deal. A 20-home run hitter in 1996is estimated to have hit about 25 in the mid-1920s. Homerun hitting has slowly decreased over the years, perhaps be-cause of the increasing ability of pitchers. The difficulty ofgetting a base hit for a batter of constant ability has also

decreased since the early 1920s. The probability of gettinga hit bottomed out in 1968 and has increased slightly sincethen. The slight increase after 1968 has been attributed tothe lowering of the pitcher's mound, thus decreasing thepitchers' ability, and also to expansion in MLB. Most base-ball experts believe that umpires are now using a smallerstrike zone, which may also play a role. We attribute partof the general decrease over the century in the difficultyof getting a base hit to the increasing depth and ability ofpitchers.

Tables 9 and 10 show the top 25 peak batting averageand home run hitters. The posterior means for peak perfor-mance in the benchmark year of 1996, the maturing param-eter ( 1), and the declining parameter ( 2) are provided.Figures 10 and 11 show the career profiles for the battingaverage and home run examples. The model selects MarkMcGwire as the greatest home run per at bat hitter in his-tory. The model estimates that Babe Ruth's at his primewould hit 5 fewer home runs than McGwire in 1996. Inter-estingly the all-time career home run king (with 755) HankAaron, is only 23rd on the peak performance list. Aaron de-clined very slowly (the slowest of the top 100). He is higheron the batting average list (13) than on the home run list!

219

Page 231: Anthology of Statistics in Sports

Chapter 28 Bridging Different Eras in Sports

Table 6. The Top 25 Peak Players in the Hockey Study

Rank

123456789

10111213141516171819202122232425

Name

M. LemieuxW. GretzkyE. LindrosJ. JagrP. KariyaP. ForsbergS. YzermanJ. SakicG. HoweT. SelanneP. BureJ. BeliveauP. EspositoA. MogilnyP. TurgeonS. FederovM. MessierP. LaFontaineBo. HullM. BossyBr. HullM. SundinJ. RoenickP. StastnyJ. Kurri

Bom

19651961

19731972

1974

1973

19651969192819701971

19311942

1969196919691961

1965

1939195719641971

197019561960

Points in 1996

187(7)181 (5)157 (16)152(9)129 (15)124(10)120(5)119(6)119(7)113(6)113(8)112(5)112(5)112(6)110(6)110(5)110(4)109(5)108(4)108(4)107(5)106(7)106 (6)105(4)105 (4)

1

1.18.66.93

1.37

.95

.84

.91

.951.04

.78

.81

.671.82

1.18.95

1.051.51

1.20.94.86

1.15.99.67

1.201.11

2

.891.6611111.431.69

11.90

1.36111.55

1.32

1.291.021.12111.121.30

NOTE: The means of 1 and 2 are also presented. The Points in 1996 column representsthe mean points (with standard deviations given in parentheses) for the player in 1996 if theplayer was at his peak performance.

Willie Stargell and Darryl Strawberry provide an interest-ing contrast in profiles. At peak they are both considered 41home run hitters. Strawberry is estimated to have maturedfaster than Stargell, whereas Stargell maintained a higherperformance during the declining phase.

Ty Cobb, who played in his prime about 80 years ago, isstill considered the best batting average hitter of all time.Tony Gwynn is estimated to decline slowly ( 2 = .78) andis considered a better batting average hitter than Cobb afterage 34. Paul Molitor is estimated to be the best decliner ofthe top 100 peak players. At age 40, in 1996, he recorded a

Age

Figure 8. A Profile of Some of the Best Players in the Hockey Study.The estimated mean number of points for each age of the player, if thatseason were 1996.

batting average of .341. Alex Rodriguez exhibits the sameregression to the mean characteristics as Tiger Woods doesin golf. In Rodriguez's second year, the benchmark yearof 1996, he led the American League in hitting at .358.The model predicts that at his peak in 1996 he would hit.336. Because of the shrinkage factor, as a result of thehierarchical model, it is more likely that Rodriguez is closerto his peak than the average player (i.e., is a rapid maturer)and that 1996 was a "lucky" year for him.

We recorded 78 ballparks in use in MLB beginning in1901. When a ballpark underwent significant alterations, weincluded the "before" and "after" parks as different. Theconstraint for the parks is that {new Fenway} = 0 (Thereis an old Fenway, from 1912-1933, and a new Fenway,1933-. Significant changes were made in 1933, includingmoving the fences in substantially.) We report the three eas-iest and three hardest ballparks for home runs and battingaverage. (We ignore those with less than 5 years of use un-less they are current ballparks.) For a 20-home run hitterin new Fenway, the expected number of home runs in thethree easiest home run parks are 30.1 in South End Grounds(Boston Braves, 1901-1914), 28.6 in Coors Field (Colorado,1995-), and 26.3 in new Oakland Coliseum (Oakland,1996-). The 20-home run hitter would be expected to hit14.5 at South Side (Chicago White Sox, 1901-1909), 14.8 atold Fenway (Boston, 1912-1933), and 15.9 at Griffith Sta-dium (Washington, 1911-1961), which are the three mostdifficult parks. The average of all ballparks for a 20-homerun hitter at new Fenway is 20.75 home runs.

For a .300 hitter in new Fenway, the three easiest parksin which to hit for average are .320 at Coors Field, .306 atConnie Mack Stadium (Philadelphia, 1901-1937), and .305at Jacobs Field (Cleveland, 1994-). The three hardest parksin which to hit for average are .283 at South Side, .287 at oldOakland Coliseum (Oakland, 1968-1995), and .287 at oldFenway. New Fenway is a good (batting average) hitters'park. A .300 hitter at new Fenway would be a .294 hitterin the average of the other parks. Some of the changes tothe ballparks have been dramatic. Old Fenway was a verydifficult park in which to hit for average or home runs, butafter the fences were moved in, the park became close toaverage. The Oakland Coliseum went from a very difficultpark to a very easy park after the fences were moved inand the outfield bleachers were enclosed in 1996.

Table 7. The Predicted and Actual Points for the Top 10Model-Estimated Peak Players Who Played in 1997

Rank

12345678

1011

Name

M. LemieuxW. GretzkyE. LindrosJ. JagrP. KariyaP. ForsbergS. YzermanJ. SakicT. SelanneP. Bure

Age in1997

32362425232432282727

Gamesplayed

76825263696581657863

Predictedpoints

135 (14.9)103(13.9)84 (12.0)99 (13.1)86 (12.8)83 (12.5)84(13.1)87 (12.6)99(13.6)81 (12.3)

Actualpoints

12297799799868574

10955

NOTE: The model used data only from 1996 and prior to predict the point totals.

220

Page 232: Anthology of Statistics in Sports

Berry, Reese, and Larkey

Table 8. The Top 25 Peak Players in the Golf Study Table 9. The Top 25 Peak Players for the Batting Average Study

Rank

123456789

10111213141516171819202122232425

Name

J. NicklausT. WatsonB. HoganN. FaldoA. PalmerG. NormanJ. LeonardE. ElsG. PlayerF. CouplesH. IrwinC. PeeteJ. BorosR. FloydL. TrevinoS. SneadJ. OlazabalT. KiteB. CrenshawT. WoodsB. CasperB. NelsonP. MickelsonL. WadkinsT. Lehman

Born

1940194919121957192919551972196919351959194519431920194219391912196619491952197519311912197019491959

9

70.42 (.29)70.82 (.23)71.12 (.29)71. 19 (.21)71 .33 (.28)71 .39 (.19)71 .40 (.45)71 .45 (.34)71 .45 (.23)71. 50 (.21)71 .56 (.26)71 .56 (.36)71 .62 (.37)71 .63 (.24)71 .63 (.29)71 .64 (.27)71. 69 (.39)71.71 (.23)71 .74 (.22)71 .77 (.64)71 .77 (.26)71 .78 (.31)71 .79 (.44)71. 79 (.22)71 .82 (.30)

1

1.03.92

1.131.191.191.21.68.78,87

1.001.02111.221.001.10.74.98.43.52

1.001.00.79

1.131.05

2

.991.19.27

1.21.95.64

11.62.97.68.80.61.38.72.21

1.70

1.2211.091.111.78.79

Rank

123456789

10111213141516171819202122232425

Name

T. CobbT. GwynnT. WilliamsW. BoggsR. CarewJ. JacksonN. LajoieS. MusialF. ThomasE. DelahantyT. SpeakerR. HornsbyH. AaronA. RodriguezP. RoseH. WagnerR. ClementeG. BrettD. MattinglyK. PuckettM. PiazzaE. CollinsE. MartinezP. MolitorW. Mays

Born

1886196019181958194518891874192019681867188818961934197519411874193419531961196119681887196319561931

Average

.368 (.005)

.363 (.006)

.353 (.006)

.353 (.005)

.351 (.005)

.347 (.007)

.345 (.009)

.345 (.005)

.344 (.008)

.340 (.001)

.339 (.006)

.338 (.005)

.336 (.006)

.336 (.001)

.335 (.004)

.333 (.007)

.332 (.005)

.331 (.005)

.330 (.006)

.330 (.006)

.330 (.009)

.329 (.004)

.328 (.008)

.328 (.005)

.328 (.005)

1

1.141.08.95

1.051.06.86

1.98.99

1.99

1.02.89.85

1.2511.37.92.88

1.141.04.96

1.22.94.99

2

1.31.78.93

1.17.92

1.121.361.1611.021.321.021.251.89

1.30.50

1.161.07.93

11.01.79.31

1.19

NOTE: The standard deviations are in parentheses. The means of 1 and 2 are also pre- NOTE: The standard deviations are in parentheses. The means of 1 and 2 are presented,sented.

As cross-validation we present the model predictions for1997 performance. Recall, the baseball study uses data from1996 and earlier in the estimation. We estimate the seasoneffect of 1997 by the league-wide performance relative to1996. For batting average, the estimated year effect for 1997is —.01. Table 11 presents the model-predicted batting aver-age for the 10 highest-rated peak batting average players ofall time who are still active. The estimates are good, exceptfor Piazza and Gwynn, both of whom had batting averagesapproximately two standard deviations above the predictedvalues.

Table 12 presents the model-predicted and actual numberof home runs, conditional on the number of at bats, for

Figure 9. A Profile of Some of the Best Players in the Golf Study.The estimated mean score for each age of the player, if that round werean average 1997 round.

the 10 highest-rated peak home run hitters of all time whoare still active. The estimated year effect for 1997 is -.06.Palmer, Belle, and Canseco did worse than their projectedvalues. The model provided a nice fit for Griffey and Mc-Gwire, each of whom posted historical years that were notso unexpected by the model. Standard errors of predictionwere calculated using the error of the binomial model andthe error in the estimates of player abilities, age effects, andballpark effects.

Table 10. The Top 25 Peak Players in the Home Run Study

Rank

123456789

10111213141516171819202122232425

Name

M. McGwireJ. GonzalezB. RuthD. KingmanM. SchmidtH. KillebrewF. ThomasJ. CansecoR. KittleW. StargellW. McCoveyD. StrawberryB. JacksonT. WilliamsR. KinerP. SeereyR. JacksonK. GriffeyA. BelleR. AllenB. BondsD. PalmerH. AaronJ. FoxxM. Piazza

Bom

1963196918951948194919361968196419581940193819621962191819221923194619691966194219641968193419071968

9

.104 (.006)

.098 (.008)

.094 (.004)

.093 (.004)

.092 (.005)

.090 (.005)

.089 (.007)

.088 (.004)

.086 (.006)

.084 (.003)

.084 (.004)

.084 (.005)

.083 (.006)

.083 (.004)

.083 (.004)

.081 (.009)

.081 (.004)

.080 (.006)

.080 (.006)

.080 (.004)

.079 (.004)

.079 (.007)

.078 (.003)

.078 (.003)

.078 (.006)

1

.971.05.72.96.99.87.99

1.051.081.241.04.70

1.06.88

1.01.91.83

1.031.121.161.271.071.261.34.95

2

1.121.93

1.051.181.1311.01.96.79

1.221.101.04.97

1.0511.11111.121.051.53

1.161

NOTE: The standard deviations are in parentheses. The means of 1 and 2 are presented.

221

Page 233: Anthology of Statistics in Sports

Chapter 28 Bridging Different Eras in Sports

Figure 10. A Profile of Some of the Best Players in the Batting Av-erage Study. The estimated batting average for each age of the player,if that year were 1996.

8. POPULATION DYNAMICS

In this section we address the changing distribution ofplayers within each study. Figures 12-15 present graphs ofthe peak value estimate for each player, graphed against theyear the player was born. These player effects are separatedfrom all the other effects; thus the players can be compareddirectly.

In hockey there is some slight bias on each end of thepopulation distribution (see Fig. 12). Players born early inthe century were fairly old when our data began (1948).They are in the dataset only if they are good players. Therestriction that each player plays at least 100 games washarder for a player to reach earlier in this century becausea season consisted of 48 games, rather than the current 82games. Therefore, there is a bias, overestimating the per-centiles of the distribution of players for the early years.

Figure 11. A Profile of Some of the Best Players in the Home RunStudy. The estimated number of home runs, conditional on 500 at bats,for each age of the player, if that year were 1996.

Table 11. The Predicted and Actual Batting Averages (BA) for theTop 10 Model-Estimated Peak Players Who Played in 1997

Rank

249

14212324293947

Name

T. GwynnW. BoggsF. ThomasA. RodriguezM. PiazzaE. MartinezP. MolitorR. AlomarK. GriffeyM. Grace

Age in1997

37392922293441292833

At bats

592353530587556542538412608555

Predicted BA

.329 (.021)

.318 (.027)

.328 (.023)

.312 (.022)

.316 (.023)

.309 (.022)

.290 (.022)

.316 (.026)

.313 (.022)

.308 (.022)

ActualBA

.372

.292

.347

.300

.362

.330

.305

.333

.304

.319

NOTE: The model used 1901-1996 data to predict 1997 totals. Standard deviations are inparentheses.

Of the players born late in this century (after 1970), it ismore likely that the good ones are included. Thus the per-centiles are probably overestimated slightly.

For hockey players born after 1940, there is a clear in-crease in ability. Of the top 25 players, 9 are current playerswho have yet to reach their peak (36%, where only 8% ofthe players in our data had not reached their peak). It is hardto address Gould's claim with the hockey distribution. Thisis because not everyone in this dataset is trying to scorepoints. Many hockey players are role players, with the jobof to playing defense or even just picking fights with theopposition! The same is true of the distribution of homerun hitters in baseball. Many baseball players are not try-ing to hit home runs; their role may focus more on defenseor on hitting for average or otherwise reaching base. Thesame type of pattern shows up in home run hitting. The top10% of home run hitters are getting better with time (seeFig. 13). This could be attributed to the increasing size andstrength of the population from which players are produced,the inclusion of African-Americans and Latin Americans,or an added emphasis by major league managers on hittinghome runs.

It is easier to address Gould's claim with the batting av-erage and golf studies. In baseball, every player is trying toget a base hit. Every player participates in the offense anequal amount, and even the defensive-minded players try toget hits. In golf, every player tries to minimize his

Table 12. The Predicted and Actual Home Runs (HR) for the Top 10Model-Estimated Peak Players Who Played in 1997

Rank

1278

121819212225

Name

M. McGwireJ. GonzalezF. ThomasJ. CansecoD. StrawberryK. GriffeyA. BelleB. BondsD. PalmerM. Piazza

Age in1997

34282933352831332929

At bats

540541530388

29608634532556542

Predicted HR

55 (7.63)41 (7.08)42(7.17)37 (6.36)2 (1.58)

49 (7.74)46 (7.15)37 (6.48)34(6.51)36 (6.63)

ActualHR

584235230

5630402340

NOTE: The model used 1901-1996 data to predict 1997 totals. Standard deviations are inparentheses.

222

Page 234: Anthology of Statistics in Sports

Berry, Reese, and Larkey

Figure 12. The Estimated Peak Season Scoring Performance ofEach Player Plotted Against the Year They Were Born. The y-axis rep-resents the mean number of points scored for each player, at their peak,if the year was 1996. The three curves are the smoothed 10th, 50th, and90th percentiles.

scores—the only goal for the golfer. In the golf study thereis a bias in the players who are in our dataset: Only play-ers with 10 majors are included. It was harder to achievethis in the early years, because we have data on only twomajors until 1961. It was also hard to find the birth datesfor marginal players from the early years. We believe wehave dates for everyone born after 1940, but we are miss-ing dates for about 25% of the players born before then.There is also a slight bias on each end of the batting aver-age graph. Only the great players born in the 1860s werestill playing after 1900, and only the best players born inthe early 1970s are in the dataset.

Except for the tails of Figures 12 and 13, there is anclear increase in ability. The golf study supports Gould'sconjecture. The best players are getting slightly better, but

Figure 13. The Estimated Peak Home Run Performance of EachPlayer Plotted Against the Year They Were Born. The y-axis representsthe mean number of home runs for each player, at their peak, if theyear was 1996. The three curves are the smoothed 10th, 50th, and 90thpercentiles.

Figure 14. The Estimated Peak Batting Average Performance ofEach Player Plotted Against the Year They Were Born. The y-axis rep-resents the mean probability of a hit for each player, at their peak, if theyear was 1996. The three curves are the smoothed 10th, 50th, and 90thpercentiles.

there are great players in every era. The median and 10thpercentile are improving rapidly (see Fig. 15). The current10th percentile player is almost 2 shots better than the 10thpercentile fifty years ago. This explains why nobody dom-inates golf the way Hogan, Snead, and Nelson dominatedin the 1940s and 1950s. The median player, and even themarginal player, can have a good tournament and win. Bat-ting average exhibits a similar pattern. The best players areincreasing in ability, but the 10th percentile is increasingfaster than the 90th percentile (see Fig. 14). It appears asthough batting averages have increased steadily, whereasgolf is in a period of rapid growth.

These conclusions coincide with the histories of thesesports. American sports are experiencing increasing diver-sity in the regions from which they draw players. The glob-alization has been less pronounced in MLB, where players

Figure 15. The Estimated Peak Scoring Performance of Each PlayerPlotted Against the Year They Were Born. The y-axis represents themean score for each player, at their peak, if the year was 1997. Thethree curves are the smoothed 10th, 50th, and 90th percentiles.

223

Page 235: Anthology of Statistics in Sports

Chapter 28 Bridging Different Eras in Sports

are drawn mainly from the United States and other coun-tries in the Americas. Baseball has remained fairly stablewithin the United States, where it has been an importantpart of the culture for more than a century. On the otherhand, golf has experienced a huge recent boom throughoutthe world.

9. DISCUSSION

In this article we have developed a model for comparingplayers from different eras in their performance in threedifferent sports. Overlapping careers in each sport providea network of bridges from the past to the present.

For each sport we constructed additive models to accountfor various sources of error. The ability of each player, thedifficulty of each year, and the effects of aging on perfor-mance were modeled. To account for different players agingdifferently, we used random curves to represent the individ-ual aging effects. The changing population in each sport wasmodeled with separate hierarchical distributions for each ofthe decades.

Because of multiple sources of variation not accountedfor in scoring, the model for the scoring ability of NHLplayers did not fit as well as the model in the other threestudies. It still provided reasonable estimates, however, andthe face validity of the results is very high. The differentyears in hockey play an important role in scoring. Careertotals for individuals are greatly influenced by the era inwhich they played. Wayne Gretzky holds nearly every scor-ing record in hockey and yet we estimate him to be thesecond-best scorer of all time. The optimal age for a hockeyplayer is 27, with a sharp decrease after age 30. A hockeyplayer at age 34, the optimal golf age, is at only 75% ofhis peak value. Many of the greatest scorers of all time areplaying now, NHL hockey has greatly expanded its talentpool in the last 20 years, and the number of great playershas increased as well.

The golf model provided a very good fit, with results thatare intuitively appealing. Players' abilities have increasedsubstantially over time, and the golf data support Gould'sconjecture. The best players in each era are comparable,but the median and below-average players are getting muchbetter over time. The 10th percentile player has gotten about2 shots better over the last 40 years. The optimal age for aprofessional golfer is 34, though the range 30-35 is nearlyoptimal. A golfer at age 20 is approximately equivalent tothe same golfer at age 50—both are about 2 shots belowtheir peak level. We found evidence that playing AugustaNational now, with the equipment and conditions of today,is about 1 shot easier than playing it with the equipmentand conditions of 1950. Evidence was also found that golfscores are not normal. The left tail of scores is slightlyshorter than a normal distribution and the right tail slightlyheavier than a normal distribution.

The baseball model fit very well. The ability of players tohit home runs has increased dramatically over the century.Many of the greatest home run hitters ever are playing now.Batting average does not have the same increase over thecentury. There is a gradual increase in the ability of players

to hit for average, but the increase is not nearly as dramaticas for home runs. The distribution of batting average playerslends good support to Gould's conjecture. The best playersare increasing in ability, but the median and 10th percentileplayers are increasing faster over the century. It has gottenharder for players of a fixed ability to hit for average. Thismay be due to the increasing ability of pitchers.

Extensions of this work include collecting more com-plete data in hockey and golf. The aging curve could beextended to allow for different peak ages for the differ-ent players. Model selection could be used to address howthe populations are changing over time—including contin-uously indexed hierarchical distributions.

[Received May 1998. Revised January 1999.]

REFERENCES

Albert, J. (1993), Comment on "A Statistical Analysis of Hitting Streaksin Baseball," by S. Albright, Journal of the American Statistical Associ-ation, 88, 1184-1188.

(1998), "The Homerun Hitting of Mike Schmidt," Chance, 11, 3-11.

Albright, S. (1993), "A Statistical Analysis of Hitting Streaks in Baseball,"Journal of the American Statistical Association, 88, 1175-1183.

Berry, S., and Larkey, P. (1998), 'The Effects of Age on the Performanceof Professional Golfers," Science and Golf III, London: E & FN SPON.

Chib, S., and Greenberg, E. (1995), "Understanding the Metropolis-Hastings Algorithm," The American Statistician, 49, 327-335.

Draper, D., Gaver, D., Goel, P., Greenhouse, J., Hedges, L., Morris, C.,Tucker, J., and Waternaux, C. (1992), Combining Information: Statis-tical Issues and Opportunities for Research, (Vol. 1 of ContemporaryStatistics), Washington, DC: National Academy Press.

Gelfand, A., and Smith, A. (1990), "Sampling-Based Approaches to Cal-culating Marginal Densities," Journal of the American Statistical Asso-ciation, 85, 398-409.

Gilks, W., Richardson, S., and Spiegelhalter, D. (Eds.) (1996), MarkovChain Monte Carlo in Practice, London: Chapman and Hall.

Gould, S. (1996), Full House: The Spread of Excellence from Plato toDarwin, New York: Three Rivers Press.

Hollander, Z. (Ed.) (1997), Inside Sports Hockey, Detroit: Visible Ink Press.Jackson, D., and Mosurski, K. (1997), "Heavy Defeats in Tennis: Psycho-

logical Momentum or Random Effect?," Chance, 10, 27-34.Larkey, P., Smith, R., and Kadane, J. (1989), "It's Okay to Believe in the

Hot Hand," Chance, 2, 22-30.Lindley, D., and Smith, A. (1972), "Bayes Estimates for the Linear Model,"

Journal of the Royal Statistical Society, Ser. B, 34, 1-41.Mosteller, F., and Youtz, C. (1993), "Where Eagles Fly," Chance, 6, 37-42.Myers, R. (1990), Classical and Modern Regression With Applications,

Belmont, CA: Duxbury Press.Riccio, L. (1994), "The Aging of a Great Player; Tom Watson's Play,"

Science and Golf II, London: E & FN SPON.Scheid, F. (1990), "On the Normality and Independence of Golf Scores,"

Science and Golf, London: E & FN SPON.Schell, M. (1999), Baseball's All-Time Best Hitters, Princeton, NJ: Prince-

ton University Press.Shi, M., Weiss, R., and Taylor, J. (1996), "An Analysis of Paediatric CD4

Counts for Acquired Immune Deficiency Syndrome Using Flexible Ran-dom Curves," Applied Statistics, 45, 151-163.

Stern, H. (1995), "Who's Hot and Who's Not," in Proceedings of the Sec-tion on Statistics in Sports, American Statistical Association.

Stern, H., and Morris, C. (1993), Comment on "A Statistical Analysis ofHitting Streaks in Baseball," by S. Albright, Journal of the AmericanStatistical Association, 88, 1189-1194.

Tierney, L. (1994), "Markov Chains for Exploring Posterior Distributions,"The Annals of Statistics, 22, 1701-1762.

Tversky, A., and Gilovich, T. (1989a), "The Cold Facts About the "HotHand" in Basketball," Chance, 2, 16-21.

(1989b), "The "Hot Hand": Statistical Reality or Cognitive Illu-sion," Chance, 2, 31-34.

224

Page 236: Anthology of Statistics in Sports

Chapter 29

Data Analysis Using Stein's Estimator

and Its GeneralizationsBRADLEY EFRON and CARL MORRIS*

In 1961, James and Stein exhibited an estimator of the mean of a multi-variate normal distribution having uniformly lower mean squared errorthan the sample mean. This estimator is reviewed briefly in anempirical Bayes context. Stein's rule and its generalizations are thenapplied to predict baseball averages, to estimate toxomosls prevalencerates, and to estimate the exact size of Pearson's chi-square test withresults from a computer simulation. In each of these examples, themean square error of these rules is less than half that of the samplemean.

1. INTRODUCTION

Charles Stein [15] showed that it is possible to make auniform improvement on the maximum likelihood esti-mator (MLE) in terms of total squared error risk whenestimating several parameters from independent normalobservations. Later James and Stein [13] presented aparticularly simple estimator for which the improvementwas .quite substantial near the origin, if there are morethan two parameters. This achievement leads immedi-ately to a uniform,, nontrivial improvement over theleast squares (Gauss-Markov) estimators for the param-eters in the usual formulation of the linear model. Onemight expect a rush of applications of this powerful newstatistical weapon, but such has not been the case.Resistance has formed along several lines:

1. Mistrust of the statistical interpretation of the mathematicalformulation leading to Stein's result, in particular the sumof squared errors loss function;

2. Difficulties in adapting the James-Stein estimator to themany special cases that invariably arise in practice;

3. Long familiarity with the generally good performance of theMLE in applied problems;

4. A feeling that any gams possible from a "complicated" pro-cedure like Stem's could not be worth the extra trouble.(J.W. Tukey at the 1972 American Statistical Associationmeetings in Montreal stated that savings would not be morethan ten percent in practical situations.)

We have written a series of articles [5, 6, 7, 8, 9, 10, 11]that cover Points 1 and 2. Our purpose here, and in alengthier version of this report [12], is to illustrate themethods suggested in these articles on three appliedproblems and in that way deals with Points 3 and 4.Only one of the three problems, the toxoplasmosis data,is "real" in the sense of being generated outside thestatistical world. The other two problems are contrivedto illustrate in a realistic way the genuine difficulties and

* Bradley Efron is professor, Department of Statistics, Stanford University,Stanford, Calif. 94305. Carl Morris is statistician, Department of Economics, TheRAND Corporation, Santa Monica, Calif. 90406.

rewards of procedures like Stein's. They have the addedadvantage of having the true parameter values availablefor comparison of methods. The examples chosen are thefirst and only ones considered for this report, and thefavorable results typify our previous experience.

To review the James-Stein estimator in the simplestsetting, suppose that for given i

meaning the { X i } are independent and normally distrib-uted with mean E iXi i, and variance Var i (X i ) = 1.The example (1.1) typically occurs as a reduction to thiscanonical form from more complicated situations, aswhen Xi is a sample mean with known variance that istaken to be unity through an appropriate scale trans-formation. The unknown vector of means ( 1, • • •, k)is to be estimated with loss being the sum of squaredcomponent errors

where ( 1, • • •, k) is the estimate of 6. The MLE,which is also the sample mean, (X) X (X1 , • • •, Xk)has constant risk k,

E indicating expectation over the distribution (1.1).James and Stein [13] introduced the estimator 5X(X)

with ( i, • • • , )' any initial guess at andS (Xj — i)

2. This estimator has risk

being less than k for all 6, and if i = i for all i the riskis two, comparing very favorably to k for the MLE.

O Journal of the American Statistical AssociationJune 1975, Volume 70, Number 350

Applications Section

225

Page 237: Anthology of Statistics in Sports

Chapter 29 Data Analysis Using Stein's Estimator and Its Generalizations

The estimator (1.4) arises quite naturally in an em-pirical Bayes context. If the { i} themselves are a samplefrom a prior distribution,

then the Bayes estimate of i, is the a posteriori mean ofi, given the data

In the empirical Bayes situation, 2 is unknown, but itcan be estimated because marginally the {X i} areindependently normal with means { i} and

where Xk2 is the chi-square distribution with k degrees

of freedom. Since k 3, the unbiased estimate

is available, and substitution of (k — 2)/S for the un-known 1/(1 + 2) in the Bayes estimate i,* of (1.8)results in the James-Stein rule (1.4). The risk of i,

1

averaged over both X and 0 is, from [6] or [8],

ET denoting expectation over the distribution (1.7). Therisk (1.11) is to be compared to the corresponding risksof 1 for the MLE and 1-1/(1+ r2) for the Bayesestimator. Thus, if k is moderate or large i

1 is nearly asgood as the Bayes estimator, but it avoids the possiblegross errors of the Bayes estimator if 2 is misspecified.

It is clearly preferable to use min {1, (k — 2)/S] asan estimate of 1/(1 + 2) instead of (1.10). This resultsin the simple improvement

with a+ max (0, a). That R( , 1+) < R( , l) for all6 is proved in [2, 8, 10, 17]. The risks R( , 1) andR( , l+) are tabled in [11].

2. USING STEIN'S ESTIMATOR TO PREDICTBATTING AVERAGES

The batting averages of 18 major league playersthrough their first 45 official at bats of the 1970 seasonappear in Table 1. The problem is to predict each player'sbatting average over the remainder of the season usingonly the data of Column (1) of Table 1. This sample waschosen because we wanted between 30 and 50 at bats toassure a satisfactory approximation of the binomial bythe normal distribution while leaving the bulk of at batsto be estimated. We also wanted to include an unusuallygood hitter (Clemente) to test the method with at leastone extreme parameter, a situation expected to be lessfavorable to Stein's estimator. Batting averages arepublished weekly in the New York Times, and by April26, 1970 Clemente had batted 45 times. Stein's estimator

requires equal variances,1 or in this situation, equal atbats, so the remaining 17 players are all whom either theApril 26 or May 3 New York Times reported with 45at bats.

Let i, be the batting average of Player i, i = 1, • • ,18 (k = 18) after n = 45 at bats. Assuming base hitsoccur according to a binomial distribution with inde-

indpendence between players, nYi Bin (n, pi ) i = 1,2, • • •, 18 with pi the true season batting average, soEYi = pi. Because the variance of Yi, depends on themean, the arc-sin transformation for stabilizing thevariance of a binomial distribution is used: Xi f4 5(Y i) ,i = 1, • • •, 18 with

Then Xi has nearly unit variance2 independent of pi.The mean3 i, of Xi is given approximately by i, = fn(pi,).Values of Xi, i, appear in Table 1. From the central limittheorem for the binomial distribution and continuity of

n we have approximately

the situation described in Section 1.We use Stein's estimator (1.4), but we estimate the

common unknown value = i/k by X = X i /k ,shrinking all Xi toward X, an idea suggested by Lindley[6, p. 285-7]. The resulting estimate of the ith com-ponent i of is therefore

i

with V (Xi - X)2 and with fc-3=(k-l)-2as the appropriate constant since one parameter is esti-mated. In the empirical Bayes case, the appropriatenessof (2.3) follows from estimating the Bayes rule (1.8) byusing the unbiased estimates X for u and (k — 3)/V for1/(1 + )2 from the marginal distribution of X, analogousto Section 1 (see also [6, Sec. 7]). We may use theBayesian model for these data because (1.7) seems atleast roughly appropriate, although (2.3) also can bejustified by the non-Bayesian from the suspicion that

( i - )2 is small, since the risk of (2.3), analogous to(1.6), is bounded by

For our data, the estimate of 1/(1 + 2) is (k - 3)/F= .791 or = 0.514, representing considerable a prioriinformation. The value of X is —3.275 so

1 The unequal variances case is discussed in Section 3.2 An exact computer computation showed that the standard deviation of Xi i.s

within .036 of unity for n - 45 for all pi between 0.15 and 0.85.8 For most of this discussion we will regard the values of pi of Column 2, Table 1

and i. as the quantities to be estimated, although we actually have a predictionproblem because these quantities are estimates of the mean of i. Accounting forthis fact would cause Stein's method to compare even more favorably to the samplemean because the random error in pi increases the losses for all estimators equally.This increases the errors of good estimators by a higher percentage than poorer ones.

226

Page 238: Anthology of Statistics in Sports

Efron and Morris

1. 1970 Batting Averages for 18 Major League Players and Transformed Values Xi,

i

123456789

101112131415161718

Player

Clemente (Pitts, NL)F. Robinson (Balt, AL)F. Howard (Wash, AL)Johnstone (Cal, AL)Berry (Chi, AL)Spencer (Cal, AL)Kessinger (Chi, NL)L. Alvarado (Bos, AL)Santo (Chi, NL)Swoboda (NY, NL)Unser (Wash, AL)Williams (Chi, AL)Scott (Bos, AL)Petrocelli (Bos, AL)E. Rodriguez (KG, AL)Campaneris (Oak, AL)Munson (NY, AL)Alvis (Mil, NL)

= battingaverage for

first 45at bats

(1)

.400

.378

.356

.333

.311

.311

.289

.267

.244

.244

.222

.222

.222

.222

.222

.200

.178

.156

pi = battingaverage forremainderof season

(2)

.346

.298

.276

.222

.273

.270

.263

.210

.269

.230

.264

.256

.303

.264

.226

.285

.316

.200

At batsfor

remainderof season

(3)

367426521275418466586138510200277270435538186558408

70

Xi

(4)

-1.35-1.66-1.97-2.28-2.60-2.60-2.92-3.26-3.60-3.60-3.95-3.95-3.95-3.95-3.95-4.32-4.70-5.10

i

(5)

-2.10-2.79-3.11-3.96-3.17-3.20-3.32-4.15-3.23-3.83-3.30-3.43-2.71-3.30-3.89-2.98-2.53-4.32

The results are striking. The sample mean X has totalsquared prediction error (Xi — i)

2 of 17.56, but(X) ( l(X), , (X)) has total squared prediction

error of only 5.01. The efficiency of Stein's rule relativeto the MLE for these data is defined as (Xi - i)

2/( i

1 (X) — 0,)2, the ratio of squared error losses. Theefficiency of Stein's rule is 3.50 ( = 17.56/5.01) in thisexample.. Moreover, i1 is closer than Xi to i for 15batters, being worse only for Batters 1, 10, 15. Theestimates (2.5) are retransformed in Table 2 to provideestimates i1 = f n

- l ( i) of pi.Stein's estimators achieve uniformly lower aggregate

risk than the MLE but permit considerably increasedrisk to individual components of the vector 6. As a func-

2. Batting Averages and Their Estimates

/

123456789

101112131415161718

Battingaverage

for seasonremainder

Pi

.346

.298

.276

.222

.273

.270

.263

.210

.269

.230

.264

.256

.303

.264

.226

.285

.316

.200

Maximumlikelihoodestimate

Y

.400

.378

.356

.333

.311

.311

.289

.267

.244

.244

.222

.222

.222

.222

.222

.200

.178.156

Retrans-form ofStein's

estimator1

.290

.286

.281

.277

.273

.273

.268

.264

.259

.259

.254

.254

.254

.254

.254

.249

.244

.239

Retrans-form of

0.9

i 0.9

.334

.313

.292

.277

.273

.273

.268

.264

.259

.259

.254

.254

.254

.254

.254

.249

.233

.208

Retrans-form of

0.8

Pi0.8

.351

.329

.308

.287

.273

.273

.268

.264

.259

.259

.254

.254

.254

.254

.254

.242

.218

.194

tion of , the risk for estimating 1 by 1l, for example,

can be as large as k/4 times as great as the risk of theMLE X1. This phenomenon is discussed at length in[5, 6], where "limited translation estimators" (X)0 s 1 are introduced to reduce this effect. The MLEcorresponds to s = 0, Stein's estimator to s = 1. Theestimate i'(X) of i is defined to be as close as possibleto i1(X.) subject to the condition that it not differ fromXi by more than [(k - l)(k - 3)kV] Dk-1(s) standarddeviations of Xi, -Dk-1(s) being a constant taken from[6, Table 1]. If s = 0.8, then D17 (s) = 0.786, so 0.8(X)may differ from Xi by no more than

This modification reduces the maximum component

risk of 4.60 for i1 to 1.52 for S,0.8 while retaining 80 percent of the savings of Stein's rule over the MLE. The

retransformed values i°'8 of the limited translationestimates fn

-1( 0'8(X)) are given in the last column ofTable 2, the estimates for the top three and bottom twobatters being affected. Values for s = 0.9 are also givenin Table 2.

Clemente (i = 1) was known to be an exceptionallygood hitter from his performance in other years. Limitingtranslation results in a much better estimate for him, aswe anticipated, since 1

1(X) differs from X1 by an exces-sive 1.56 standard deviations of XL The limited trans-lation estimators are closer than the MLE for 16 of the18 batters, and the case s = 0.9 has better efficiency(3.91) for these data relative to the MLE than Stein'srule (3.50), but the rule with s = 0.8 has lower efficiency(3.01). The maximum component error occurs forMunson (i = 17) with all four estimators. The Bayesianeffect is so strong that this maximum error 17 — 17

decreased from 2.17 for s = 0, to 1.49 for s = 0.8, to1.25 for s = 0.9 to 1.08 for s = 1. Limiting translation

227

Page 239: Anthology of Statistics in Sports

Chapter 29 Data Analysis Using Stein's Estimator and Its Generalizations

therefore, increases the worst error in this example, justopposite to the maximum risks.

3. A GENERALIZATION OF STEIN'S ESTIMATOR TOUNEQUAL VARIANCES FOR ESTIMATING THE

PREVALENCE OF TOXOPLASMOSIS

One of the authors participated in a study of toxo-plasmosis in El Salvador [14]. Sera obtained from a totalsample of 5,171 individuals of varying ages from 36 ElSalvador cities were analyzed by a Sabin-Feldman dyetest. From the data given in [14, Table 1], toxoplasmosisprevalence rates Xi for City i, i = 1, , 36 were calcu-lated. The prevalence rate Xi has the form (observedminus expected)/expected, with "observed" being thenumber of positives for City i and "expected" the numberof positives for the same city based on an indirectstandardization of prevalence rates to the age distributionof City i. The variances Di = Var (X i) are known frombinomial considerations and differ because of unequalsample sizes.

These data Xi together with the standard deviationsD1 are given in Columns 2 and 3 of Table 3. The preva-lence rates satisfy a linear constraint diXi = 0 withknown coefficients di, > 0. The means i, = EXi, which

3. Estimates and Empirical Bayes Estimates ofToxoplasmosis Prevalence Rates

i

123456789

101112131415161718192021222324252627282930313233343536

Xi

.293

.214

.185

.152

.139

.128

.113

.098

.093

.079

.063

.052

.035

.027

.024

.024

.014

.004-.016-.028-.034-.040-.055-.083-.098-.100-.112-.138-.156-.169-.241-.294-.296-.324-.397-.665

Di

.304

.039

.047

.115

.081

.061

.061

.087

.049

.041

.071

.048

.056

.040

.049

.039

.043

.085

.128

.091

.073

.049

.058

.070

.068

.049

.059

.063

.077

.073

.106

.179

.064

.152

.158

.216

(X)

.035

.192

.159

.075

.092

.100

.088

.062

.079

.070

.045

.044

.028

.024

.020

.022

.012

.003-.007-.017-.024-.034-.044-.060-.072-.085-.089-.106-.107-.120-.128-.083-.225-.114-.133-.140

.0120

.0108

.0109

.0115

.0112

.0110

.0110

.0113

.0109

.0109

.0111

.0109

.0110

.0108

.0109

.0108

.0109

.0112

.0116

.0113

.0111

.0109

.0110

.0111

.0111

.0109

.0110

.0110

.0112

.0111

.0114

.0118

.0111

.0117

.0117

.0119

ki

1334.121.924.480.243.030.430.448.025.122.536.024.828.022.225.121.923.146.2

101.551.637.325.128.935.434.225.129.431.440.037.368.0

242.431.9

154.8171.5426.8

..882.102.143.509.336.221.221.370.154.112.279.148.192.107.154.102.122.359.564.392.291.154.204.273.262.154.210.233.314.291.468.719.238.647.665.789

also satisfy di i, = 0, are to be estimated from the{ X i } . Since the { X i } were constructed as sums of inde-pendent random variables, they are approximatelynormal; and except for the one linear constraint on thek = 36 values of Xi, they are independent. For simplicity,we will ignore the slight improvement in the independenceapproximation that would result from applying ourmethods to an appropriate 35-dimensional subspace andassume that the { X i } have the distribution of the follow-ing paragraph.

To obtain an appropriate empirical Bayes estimationrule for these data we assume that

and

A being an unknown constant. These assumptions arethe same as (1.1), (1.7), which lead to the James-Steinestimator if Di = D), for all i, j. Notice that the choiceof a priori mean zero for the i is particularly appropriatehere because the constant di i = 0 forces the param-eters to be centered near the origin.

We require k 3 in the following derivations. Define

Then (3.1) and (3.2) are equivalent to

For squared error loss4 the Bayes estimator is the aposteriori mean

with Bayes risk Var ( i Xi) = (1 - Bi)Di being lessthan the risk Di of i, = Xi.

Here, A is unknown, but the MLE A of A on the basisof the data S Xi

2 (A + Dj)x12, j = 1, 2, , k

is the solution to

with

being the Fisher information for A in Sj. We could useA from (3.6) to define the empirical Bayes estimator ofi, as (1 — D i / (A + D i))X i . However, this rule does notreduce to Stein's when all Dj are equal, and we insteaduse a minor variant of this estimator derived in [8]which does reduce to Stein's. The variant rule estimatesa different value Ai for each city (see Table 3). Thedifference between the rules is minor in this case, but itmight be important if k were smaller.

Our estimates i(X) of the i are given in the fourthcolumn of Table 3 and are compared with the unbiased

4 Or for any other increasing function of i — i .

228

Page 240: Anthology of Statistics in Sports

Efron and Morris

estimate Xi in Figure A. Figure A illustrates the "pull in"effect of i,(X), which is most pronounced for Cities 1,32, 34, 35, and 36. Under the empirical Bayes model, themajor explanation for the large Xi for these cities islarge Di rather than large i . This figure also showsthat the rankings of the cities on the basis of i(X) differsfrom that based on the Xi, an interesting feature thatdoes not arise when the Xi have equal variances.

A. Estimates of Toxoplasmosis Prevalence Rates

The values Ai, ki and Bi(S) defined in [8] are givenin the last three columns of Table 3. The value A of(3.6) is A = 0.0122 with standard deviation (A) esti-mated as 0.0041 (if A = 0.0122) by the Cramer-Raolower bound on (A). The preferred estimates Ai are allclose to but slightly smaller than A, and their estimatedstandard deviations vary from 0.00358 for the cities withthe smallest Di to 0.00404 for the city with the largest Di.

The likelihood function of the data plotted as a func-tion of A (on a log scale) is given in Figures B and C asLIKELIHOOD. The curves are normalized to have unitarea as a function of a = log A. The maximum valueof this function of a is at & = log (A) = log (.0122)= —4.40 a. The curves are almost perfectly normalwith mean = —4.40 and standard deviation a .371.The likely values of A therefore correspond to a adiffering from a by no more than three standard devi-ations, a — a 3 a, or equivalently, .0040 A .0372.

In the region of likely values of A, Figure B also graphstwo risks: BATES RISK and EB RISK (for empirical Bayes

B. Likelihood Function of A and Aggregate OperatingCharacteristics of Estimates as a Function of A,Conditional on Observed Toxoplasmosis Data

risk), each conditional on the data X. EB RISK6 is theconditional risk of the empirical Bayes rule defined (withD0 (l/k) Di)as

and BAYES RISK is

Since A is not known, BAYES RISK yields only a lowerenvelope for empirical Bayes estimators, agreeing withEB RISK at A = .0122. Table 4 gives values to supplementFigure B. Not graphed because it is too large to fit inFigure B is MLE RISK, the conditional risk of the MLE,defined as

MLE RISK exceeds EB RISK by factors varying from 7to 2 in the region of likely values of A, as shown in Table4. EB RISK tends to increase and MLE RISK to decrease asA increases, these values crossing at A = .0650, about4 standard deviations above the mean of the distributionof A.

4. Conditional Risks for Different Values of A

Risk

EB RISKMLE RISKP(EB CLOSER)

.0040

.352.511.00

.0122

.391.871.00

A

.0372

.761.27.82

.0650

1.081.08.50

2.501.00.04

The remaining curve in Figure B graphs the probabilitythat the empirical Bayes estimator is closer to than theMLE X, conditional on the data X. It is defined as

This curve, denoted P(EB CLOSER), decreases as Aincreases but is always very close to unity in the regionof likely values of A. It reaches one-half at about 4standard deviations from the mean of the likelihoodfunction and then decreases as A to its asymptoticvalue .04 (see Table 4).

The data suggest that almost certainly A is in theinterval .004 A .037, and for all such values of A,Figure B and Table 4 indicate that the numbers ,(X)are much better estimators of the i than are the Xi.Non-Bayesian versions of these statements may be basedon a confidence interval for /k.

Figure A illustrates that the MLE and the empiricalBayes estimators order the { i} differently. Define the

5In (3.8) the (X) are fixed numbers—those given in Table 3. The expectationis over the a posteriori distribution (3.4) of the i.

229

Page 241: Anthology of Statistics in Sports

Chapter 29 Data Analysis Using Stein's Estimator and Its Generalizations

correlation of an estimator § of 6 by

as a measure of how well § orders 6. We denotep(rEB > .MLE) as the probability that the empiricalBayes estimate 5 orders 6 better than X, i.e., as

The graph of (3.13) given in Figure C shows thatP(rEB > rMLE) > .5 for A .0372. The value at Adrops to .046.

C. Likelihood Function of A and Individual andOrdering Characteristics of Estimates as a

Function of A, Conditional on ObservedToxoplasmosis Data

Although X1 > X2, the empirical Bayes estimator forCity 2 is larger, 2(X) > 1(X). This is because D1 D2,indicating that X1 is large under the empirical Bayesmodel because of randomness while X2 is large because02 is large. The other curve in Figure C is

and shows that 2 > 1 is quite probable for likely valuesof A. This probability declines as A being .50 atA = .24 (eight standard deviations above the mean)and .40 at A = .

4. USING STEIN'S ESTIMATOR TO IMPROVE THERESULTS OF A COMPUTER SIMULATION

A Monte Carlo experiment is given here in whichseveral forms of Stein's method all double the experi-mental precision of the classical estimator. The exampleis realistic in that the normality and variance assumptionsare approximations to the true situation.

We chose to investigate Pearson's chi-square statisticfor its independent interest and selected the particularparameters (m 24) from our prior belief that empiricalBayes methods would be effective for these situations.

Although our beliefs were substantiated, the outcomes inthis instance did not always favor our pet methods.

The simulation was conducted to estimate the exactsize of Pearson's chi-square test. Let Y1 and Y2 beindependent binomial random variables, Y1 bin (m, p'),Y2 bin (m, p") so EY1 = mp', EY2 = mp". Pearsonadvocated the statistic and critical region

to test the composite null hypothesis H0:p' = p" againstall alternatives for the nominal size a = 0.05. The value3.84 is the 95th percentile of the chi-square distributionwith one degree of freedom, which approximates that ofT when m is large.

The true size of the test under H0 is defined as

which depends on both m and the unknown valuep p' p". The simulation was conducted for p = 0.5and the k = 17 values of m, with my = 7 .+ j, j = 1,

, k. The k values of aj a(0.5, mj,-) were to beestimated. For each j we simulated (4.1) n — 500 timeson a computer and recorded Zj as the proportion oftimes Ho was rejected. The data appear in Table 5. SincenZj bin (n, aj) independently, Zj is the unbiased andmaximum likelihood estimator usually chosen6 to esti-mate ay.

5. Maximum Likelihood Estimates andTrue Values for p = 0.5

1123456789

1011121314151617

MLE

mj

89

101112131415161718192021222324

zj

.082

.042

.046

.040

.054

.084

.036

.036

.040

.050

.078

.030

.036

.060

.052

.046

.054

True values

aj

.07681

.05011

.04219

.05279

.06403

.07556

.04102

.04559

.05151

.05766

.06527

.05306

.04253

.04588

.04896

.05417

.05950

Under H0 the standard deviation of Zj is approxi-mately = {(.05)(.95)/500} = .009747. The variablesXj (Zj — .05)/ have expectations

' We ignore an extensive bibliography of other methods for improving computersimulations. Empirical Bayes methods can be applied simultaneously with othermethods, and if better estimates of aj than Zjwere available then the empiricalBayes methods could instead be applied to them. But for simplicity we take Z,itself as the quantity to be improved.

230

Page 242: Anthology of Statistics in Sports

Efron and Morris

and approximately the distribution

described in earlier sections.The average value Z = .051 of the 17 points supports

the choice of the "natural origin" a = .05. Stein's rule(1.4) applied to the transformed data (4.3) and thenretransformed according to j = .05 + j yields

where 6 m (k - 2)/S and

All 17 true values aj were obtained exactly through aseparate computer program and appear in Figure Dand Table 5, so the loss function, taken to be the nor-malized sum of squared errors ( j — j)

2/ 2, can beevaluated.7 The MLE has loss 18.9, Stein's estimate(4.4) has loss 10.2, and the constant estimator, whichalways estimates ay as .05, has loss 23.4. Stein's ruletherefore dominates both extremes between which itcompromises.

Figure D displays the maximum likelihood estimates,Stein estimates, and true values. The true values showa surprising periodicity, which would frustrate attemptsat improving the MLE by smoothing.

D. MLE, Stein Estimates, and True Values for p — 0.5

On theoretical grounds we know that, the approxi-mation a (p, m) = .05 improves as m increases, which sug-gests dividing the data into two groups, say 8 m 16and 17 m 24. In the Bayesian framework [9] thisdisaggregation reflects the concern that A1, the expecta-

7 Exact rejection probabilities for other values of p are given in [12].

tion of AI* = (aj — .05)2/9 2 may be much largerthan A 2, the expectation of A 2* = =10 (aj — .05)2/8 2,or equivalently that the pull-in factor BI = 1/(1 + AI)for Group 1 really should be smaller than B2 = 1/ (1 + A 2)for Group 2.

The combined estimator (4.4), having BI = B2, isrepeated in the second row of Table 6 with loss com-ponents for each group. The simplest way to utilizeseparate estimates of BI and B2 is to apply two separateStem rules, as shown in the third row of the table.

6. Values of B and Losses for Data Separated intoTwo Groups, Various Estimation Rules

Rule

Maximum LikelihoodEstimator

Stein's rule,combined data

Separate Stein rulesSeparate Stein rules,

bigger constantAll estimates at .05

8 m16

B1,

.000

.325

.232

.2761.000

Group 1loss

7.3

4.24.5

4.318.3

17 = m24

B2,

.000

.325

.376

.4601.000

Group 2loss

11.6

6.05.4

4.65.1

Totalloss

18.9

10.29.9

8.923.4

In [8, Sec. 5] we suggest using the bolder estimate

S2 = S - S1 , k1 = 9 , k2 = 8 .

The constant ki,- — .66 is preferred because it accountsfor the fact that the positive part (1.12) will be used,whereas the usual choice ki — 2 does not. The fourthrow of Table 6 shows the effectiveness of this choice.

The estimate of .05, which is nearly the mean of the17 values, is included in the last row of the table to showthat the Stein rules substantially improve the two ex-tremes between which they compromise.

The actual values are

for Group 1 and

so B1* = 1/(1 + A1*) = .329 and B2* = 1/(1+ A2*)= .612. The true values of B1* and B2* are somewhatdifferent, as estimates for separate Stein rules suggest.Rules with BI and Bz near these true values will ordinarilyperform better for data simulated from these parametersp = 0.5, m = 8, ,24.

5. CONCLUSIONS

In the baseball, toxoplasmosis, and computer simu-lation examples, Stein's estimator and its generalizationsincreased efficiencies relative to the MLE by about 350percent, 200 percent, and 100 percent. These examples

231

Page 243: Anthology of Statistics in Sports

Chapter 29 Data Analysis Using Stein's Estimator and Its Generalizations

were chosen because we expected empirical Bayesmethods to work well for them and because their effi-ciencies could be determined. But we are aware of othersuccessful applications to real data8 and have suppressedno negative results. Although, blind application of thesemethods would gain little in most instances, the statis-tician who uses them sensibly and selectively can expectmajor improvements.

Even when they do not significantly increase efficiency,there is little penalty for using the rules discussed herebecause they cannot give larger total mean squared errorthan the MLE and because the limited translationmodification protects individual components. As severalauthors have noted, these rules are also robust to theassumption of the normal distribution, because theiroperating characteristics depend primarily on the meansand variances of the sampling distributions and of theunknown parameters. Nor is the sum of squared errorcriterion especially important. This robustness is borneout by the experience in this article since the samplingdistributions were actually binomial rather than normal.The rules not only worked well in the aggregate here, butfor most components the empirical Bayes estimatorsranged from slightly to substantially better than theMLE, with no substantial errors in the other direction.

Tukey's comment, that empirical Bayes benefits areunappreciable (Section 1), actually was directed at amethod of D.V. Lindley. Lindley's rules, though moreformally Bayesian, are similar to ours in that they aredesigned to pick up the same intercomponent informationin possibly related estimation problems. We have notdone justice here to the many other contributors tomultiparameter estimation, but refer the reader to thelengthy bibliography in [123- We have instead concen-trated on Stein's rule and its generalizations, to illustratethe power of the empirical Bayes theory, because themain gains are derived by recognizing the applicabilityof the theory, with lesser benefit attributable to theparticular method used. Nevertheless, we hope otherauthors will compare their methods with ours on theseor other data.

The rules of this article are neither Bayes nor admis-sible, so they can be uniformly beaten (but not by much;see [8, Sec. 6]). There are several published; admissible,minimax rules which also would do well on the baseballdata, although probably not much better than the ruleused there, for none yet given is known to dominateStein's rule with the positive part modification. Forapplications, we recommend the combination of sim-plicity, generalizability, efficiency, and robustness foundin the estimators presented here.

The most favorable situation for these estimatorsoccurs when the statistician wants to estimate theparameters of a linear model that are known to lie in ahigh dimensional parameter space H1, but he suspectsthat they may lie close to a specified lower dimensional

8 See, e.g., [3] for estimating fire alarm probabilities and [4] for estimating reac-tion times and sunspot data.

parameter space Ho H1.9 Then estimates unbiased for

every parameter vector in Hl may have large variance,while estimates restricted to Ho have smaller variancebut possibly large bias. The statistician need not choosebetween these extremes but can instead view them asendpoints on a continuum and use the data to determinethe compromise (usually a smooth function of thelikelihood ratio statistic for testing H0 versus HI) betweenbias and variance through an appropriate empiricalBayes rule, perhaps Stein's or one of the generalizationspresented here.

We believe many applications embody these featuresand that most data analysts will have good experienceswith the sensible use of the rules discussed here. In viewof their potential, we believe empirical Bayes methodsare among the most under utilized in applied dataanalysis.

[Received October 1973. Revised February 1975. ]

REFERENCES

[1] Anscombe, F., "The Transformation of Poisson, Binomial andNegative-Binomial Data," Biometrika, 35 (December 1948),246-54.

[2] Baranchik, A.J., "Multiple Regression and Estimation of theMean of a Multivariate Normal Distribution," TechnicalReport No. 51, Stanford University, Department of Statistics,1964.

[3] Carter, G.M. and Rolph, J.E., "Empirical Bayes MethodsApplied to Estimating Fire Alarm Probabilities," Journal of theAmerican Statistical Association, 69, No. 348 (December 1974),880-5.

[4] Efron, B., "Biased Versus Unbiased Estimation," Advances inMathematics, New York: Academic Press (to appear 1975).

[5] and Morris, C., "Limiting the Risk of Bayes andEmpirical Bayes Estimators—Part I: The Bayes Case,"Journal of the American Statistical Association, 66, No. 336(December 1971), 807-15.

[6] and Morris, C., "Limiting the Risk of Bayes andEmpirical Bayes Estimators—Part II: The Empirical BayesCase," Journal of the American Statistical Association, 67, No.337 (March 1972), 130-9.

[7] and Morris, C., "Empirical Bayes on Vector Observa-tions—An Extension of Stein's Method," Biometrika, 59, No. 2(August 1972), 335-47.

[83 and Morris, C., "Stein's Estimation Rule and Its Com-petitors—An Empirical Bayes Approach," Journal of theAmerican Statistical Association, 68, No. 341 (March 1973),117-30.

[9] and Morris, C., "Combining Possibly Related Estima-tion Problems," Journal of the Royal Statistical Society, Ser. B,35, No. 3 (November 1973; with discussion), 379-421.

[10] and Morris, C., "Families of Minimax Estimators ofthe Mean of a Multivariate Normal Distribution," P-5170,The RAND Corporation, March 1974, submitted to Annalsof Mathematical Statistics (1974).

[11] and Morris, C., "Estimating Several ParametersSimultaneously," to be published in Statistica Neerlandica.

[12] and Morris, C., "Data Analysis Using Stein's Estimatorand Its Generalizations," R-1394-OEO, The RAND Corpora-tion, March 1974.

[13] James, W. and Stein, C., "Estimation with Quadratic Loss,"

' One excellent example [17] takes H0 as the main effects in a two-way analysisof variance and H1 — Ho as the interactions.

232

Page 244: Anthology of Statistics in Sports

Efron and Morris

Proceedings of the Fourth Berkeley Symposium on MathematicalStatistics and Probability, Vol. 1, Berkeley: University ofCalifornia Press, 1961, 361-79.

[14] Remington, J.8., et al., "Studies on Toxoplamosis in El Sal-vador: Prevalence and Incidence of Toxoplasmosis as Mea-sured by the Sabin-Feldman Dye Test," Transactions of theRoyal Society of Tropical Medicine and Hygiene, 64, No. 2(1970), 252-67.

[15] Stein, C., "Inadmissibility of the Usual Estimator for theMean of a Multivariate Normal Distribution," Proceedings of

the Third Berkeley Symposium on Mathematical Statistics andProbability, Vol. 1, Berkeley: University-of California Press,1955, 197-206.

[16J > "Confidence Sets for the Mean of a MultivariateNormal Distribution," Journal of the Royal Statistical Society,Ser. B, 24, No. 2 (1962), 265-96.

[17] , "An Approach to the Recovery of Inter-Block In-formation in Balanced Incomplete Block Designs," in F.N.David, ed., Festschrift for J. Neyman, New York: John Wiley& Sons, Inc., 1966, 351-66.

233

Page 245: Anthology of Statistics in Sports

This page intentionally left blank

Page 246: Anthology of Statistics in Sports

Chapter 30

Assigning Probabilities to the Outcomes of

Multi-Entry CompetitionsDAVID A. HARVILLE*

The problem discussed is one of assessing the probabilities of the vari-ous possible orders of finish of a horse race or, more generally, of as-signing probabilities to the various possible outcomes of any multi-entry competition. An assumption is introduced that makes it possibleto obtain the probability associated with any complete outcome interms of only the 'win' probabilities. The results were applied to datafrom 335 thoroughbred horse races, where the win probabilities weretaken to be those determined by the public through pari-mutuelbetting.

I. INTRODUCTION

A horse player wishes to make a bet on a given horserace at a track having pari-mutuel betting. He hasdetermined each horse's 'probability' of winning. He canbet any one of the entires to win, place (first or second),or show (first, second, or third). His payoff on a successfulplace or show bet depends on which of the other horsesalso place or show. Our horse player wishes to make asingle bet that maximizes his expected return. He findsthat not only does he need to know each horse's prob-ability of winning, but that, for every pair of horses, hemust also know the probability that both will place, and,for every three, he must know the probability that allthree will show. Our better is unhappy. He feels that hehas done a good job of determining the horses' prob-abilities of winning; however he must now assign prob-abilities to a much larger number of events. Moreover, hefinds that the place and show probabilities are moredifficult to assess. Our better looks for an escape from hisdilemma. He feels that the probability of two givenhorses both placing or of three given horses all showingshould be related to their probabilities of winning. Heasks his friend, the statistician, to produce a formulagiving the place and show probabilities in terms of thewin probabilities.

The problem posed by the better is typical of a classof problems that share the following characteristics:

1. The members of some group are to be ranked in order fromfirst possibly to last, according to the outcome of some ran-dom phenomena, or the ranking of the members has alreadybeen effected, but is unobservable.

2. The 'probability' of each member's ranking first is known orcan be assessed.

3. From these probabilities alone, we wish to determine theprobability that a more complete ranking of the members

* David A. Harville is research mathematical statistician, Aerospace ResearchLaboratories, Wright-Patterson Air Force Base, Ohio 45433. The author wishes tothank the Theory and Methods editor, an associate editor and a referee for theiruseful suggestions.

will equal a given ranking or the probability that it will fallin a given collection of such rankings.

Dead heats or ties will be assumed to have zeroprobability. For situations where this assumption isunrealistic, the probabilities of the various possible tiesmust be assessed separately.

We assign no particular interpretation to the 'prob-ability' of a given ranking or collection of rankings. Weassume only that the probabilities of these events satisfythe usual axioms. Their interpretation will differ withthe setting.

Ordinarily, knowledge of the probabilities associatedwith the various rankings will be of most interest insituations like the horse player's where only the rankingitself, and not the closeness of the ranking, is important.The horse player's return on any bet is completely deter-mined by the horses' order of finish. The closeness of theresult may affect his nerves but not his pocketbook.

2. RESULTS

We will identify the n horses in the race or members inthe group by the labels 1, 2, • • •, n. Denote by pk[i1,i2 ik] the probability that horses or members i1,i2, • • •, ik finish or rank first, second, • • •, kth, respec-tively, where k n. For convenience, we use p [i] inter-changeably with p1[i] to represent the probability thathorse or member i finishes or ranks first. We wish to

obtain Pk[i1, i2, • • •, ik] in terms of p[ 1], p[2], • • •, P[n], for all i1, i2, • • •, ik and for k = 2, 3, • • •, n. In a sense,

our task is one of expressing the probabilities of ele-mentary events in terms of the probabilities of morecomplex events.

Obviously, we must make additional assumptions toobtain the desired formula. Our choice is to assume that,for all i1, i2, • • • ,ik and for k = 2, 3, • • • ,n, the conditionalprobability that member ik ranks ahead of membersik+1, ik+2, • • •, in given that members i1, i2, • • •, ik_1 rankfirst, second, • • • , (k — l)th, respectively, equals theconditional probability that ik ranks ahead of ik+1,ik+2, • • •, in given that i1, i2, • • •, ik_1 do not rank first.That is,

© Journal of the American Statistical AssociationJune 1973, Volume 68, Number 342

Applications Section

235

Page 247: Anthology of Statistics in Sports

Chapter 30 Assigning Probabilities to the Outcomes of Multi-Entry Competitions

where

qk[i1, i2, •• ,ik] 1 - p[i1] - p[i2] p[i]

so that, for the sought-after formula, we obtain

Pk[ i1, i2, • • •, ik]

In the particular case k = 2, the assumption (2.1) isequivalent to assuming that the event that member i2

ranks ahead of all other members, save possibly i\, isstochastically independent of the event that member i1ranks first.

The intuitive meaning and the reasonableness of theassumption (2.1) will depend on the setting. In particular,our horse player would probably not consider the assump-tion appropriate for every race he encounters. Forexample, in harness racing, if a horse breaks stride, thedriver must take him to the outside portion of the trackand keep him there until the horse regains the propergait. Much ground can be lost in this maneuver. Inevaluating a harness race in which there is a horse thatis an 'almost certain' winner unless he breaks, the bettorwould not want to base his calculations on assumption(2.1). For such a horse, there may be no such thing as anintermediate finish. He wins when he doesn't break, butfinishes 'way back' when he does.

In many, though not all, cases, there is a variate (otherthan rank) associated with each member of the groupsuch that the ranking is strictly determined by orderingtheir values. For example, associated with each horse isits running time for the race. Denote by Xi the variatecorresponding to member i, i = 1, 2, • • •, n. Clearly, theassumption (2.1) can be phrased in terms of the jointprobability distribution of X1, X2, • • - , Xn. It seemsnatural to ask whether there exist other conditions on thedistribution of the X i's which imply (2.1) or which followfrom it, and which thus would aid our intuition in grasp-ing the implications of that assumption. The answer ingeneral seems to be no. In particular, it can easily bedemonstrated by constructing a counterexample thatstochastic independence of the Xi's does not in itselfimply (2.1). Nor is the converse necessarily true. In fact,in many situations where assumption (2.1) might seemappropriate, it is known that the Xi's are not indepen-dent. For example, we would expect the running tunes ofthe horses to be correlated in most any horse race. Aneven better example is the ordering of n baseball teamsaccording to their winning percentages over a season ofplay. These percentages are obviously not independent,yet assumption (2.1) might still seem reasonable.

The probability that the ranking belongs to any givencollection of rankings can be readily obtained in terms ofP[1], P[2], • • • , p[n] by using (2.2) to express theprobability of each ranking in the collection in terms ofthe p[i]'s, and by then adding. For example, the horseplayer can compute the probability that both entry i and

entry j place from

A probability of particular interest in many situationsis the probability that entry or member r finishes or rankskth or better, for which we write

where the summation is over all rankings i1, i2, • • •, ik

for which iu = r for some u. If assumption (2.1) holds,then pk*[r] > P [s] if and only if p[r] > p[ s ]. Thisstatement can be proved easily by comparing the termsof the right side of (2.3) with the terms of the correspond-ing expression for P [s]. Each term of (2.3), whose indicesare such that iu = r and iv = s for some u, v, appearsalso in the second expression. Thus, it suffices to showthat any term Pk[ i1, i2, • • •, ik], for which ij s, j — 1,2, • • •, k, but iu = r for some u, is made smaller byputting iu = s if and only if p[r] > p[s]. That the latterassertion is true follows immediately from (2.2).

3. APPLICATION

In pari-mutuel betting, the payoffs on win bets aredetermined by subtracting from the win pool (the totalamount bet to win by all bettors on all horses) thecombined state and track take (a fixed percentage of thepool—generally about 16 percent, but varying from stateto state), and by then distributing the remainder amongthe successful bettors in proportion to the amounts oftheir bets. (Actually, the payoffs are slightly smallerbecause of 'breakage,' a gimmick whereby the return oneach dollar is reduced to a point where it can be expressedin terms of dimes.) In this section, we take the 'winprobability' on each of the n horses to be in inverse pro-portion to what a successful win bet would pay perdollar, so that every win bet has the same 'expectedreturn.' Note that these 'probabilities' are established bythe bettors themselves and, in some sense, represent aconsensus opinion as to each horse's chances of winningthe race. We shall suppose that, in any sequence of racesin which the number of entries and the consensus prob-abilities are the same from race to race, the horses goingoff at a given consensus probability win with a long-runfrequency equal to that probability. The basis for thissupposition is that, once the betting on a race has begun,the amounts bet to win on the horses are flashed on the'tote' board for all to see and this information is updatedperiodically, so that, if at some point during the course ofthe betting the current consensus probabilities do notcoincide with the bettors' experience as to the long-runwin frequencies for 'similar' races, these discrepancies willbe noticed and certain of the bettors will place win betsthat have the effect of reducing or eliminating them.

By adopting assumption (2.1) and applying the resultsof the previous section, we can compute the long-run fre-quencies with which any given order of finish is encount-ered over any sequence of races having the same number

236

Page 248: Anthology of Statistics in Sports

Harville

1. APPLICATION OF THEORETICAL RESULTSTO THIRD RACE OF SEPTEMBER 6, 7977,

AT RIVER DOWNS RACE TRACK

Amounts bet to win,place, and show as

percentages of totals

Name

Moonlander

E'Thon

Golden Secret

Antidote

Beviambo

Cedar Wing

Little FlitterHot and Humid

Win Place

27.6 20.016.5 14. 2

3.5 4.717.3 18.8

4.O 6.211.9 10.48.5 11.2

10.7 l4.4

Show

22.3

11.1

6.3

20.0

7.810. 4

9.9

12.2

Theoreticalprobability

Win

.275

.165

.035

.175

.040

.118

.085

.107

Place

.504

.332

.076

.350

.087

.21.5

.180

.224

Show

.688

.1*99

.126

.521

.l44

.382

.288

.353

Expected payoffper dollar

Placebet

1.11

.94

.58

.80

.51

.90

.62

.62

Showbet

1.01

1.06

.42

.80

.41

.86

.68

.72

of entries and the same consensus win probabilities. Inparticular, we can compute the 'probability' that anythree given horses in a race finish first, second, and third,respectively. As we shall now see, these probabilities areof something more than academic interest, since they arethe ones needed to compute the 'expected payoff' for eachplace bet (a bet that a particular horse will finish eitherfirst or second) and each show bet (a bet that the horsewill finish no worse than third).

Like the amounts bet to win, the amounts bet on eachhorse to place and to show are made available on the'tote' board as the betting proceeds. The payoff perdollar on a successful place (show) bet consists of theoriginal dollar plus an amount determined by subtractingfrom the final place (show) pool the combined state andtrack take and the total amounts bet to place (show) onthe first two (three) finishers, and by then dividing a half(third) of the remainder by the total amount bet to place(show) on the horse in question. (Here again, the actualpayoffs are reduced by breakage.) By using the prob-abilities computed on the basis of assumption (2.1) andthe assumption that consensus win probabilities equalappropriate long-run frequencies, we can compute theexpected payoff per dollar for a given place or show beton any particular race, where the expectation is takenover a sequence of races exhibiting the same number ofentries and the same pattern of win, place, and showbetting. If, as the termination of betting on a given raceapproaches, any of the place or show bets are found tohave potential expected payoffs greater than one, thereis a possibility that a bettor, by making such place andshow bets, can 'beat the races'. Of course, if eitherassumption (2.1) or the assumption that the consensuswin probabilities equal long-run win frequencies for raceswith similar betting patterns is inappropriate, then thissystem will not work. It will also fail if there tend to belarge last-minute adverse changes in the betting pattern,either because of the system player's own bets or becauseof the bets of others. However, at a track with consider-able betting volume, it is not likely that such changeswould be so frequent as to constitute a major stumblingblock.

In Table 1, we exemplify our results by applying themto a particular race, the third race of the September 6,1971, program at River Downs Race Track. The finalwin, place, and show pools were $45,071, $16,037, and$9,740, respectively. The percentage of each betting poolbet on each horse can be obtained from the table. Thetable also gives, for each horse, the consensus win prob-ability, the overall probabilities of placing and showing,and the expected payoffs, per dollar of place and showbets. The race was won by E'Thon who, on a per-dollarbasis, paid $5.00, $3.00, and $2.50 to win, place, and show,respectively; Cedar Wing was second, paying $3.80 and$2.70 per dollar to place and show; and Beviambo finishedthird, returning $3.20 for each dollar bet to show.

In order to check assumption (2.1) and the assumptionthat the consensus win probabilities coincide with thelong-run win frequencies over any sequence of raceshaving the same number of entries and a similar bettingpattern, data was gathered on 335 thoroughbred racesfrom several Ohio and Kentucky race tracks. Data fromraces with finishes that involved dead heats for one ormore of the first three positions were not used. Also, inthe pari-mutuel system, two or more horses are sometimeslumped together and treated as a single entity for bettingpurposes. Probabilities and expectations for the remain-ing horses were computed as though these 'field' entriesconsisted of single horses and were included in the data,though these figures are only approximations to the 'true'figures. However, the field entires themselves were notincluded in the tabulations.

As one check on the correspondence between consensuswin probabilities and the long-run win frequencies overraces with similar patterns of win betting, the horses weredivided into eleven classes according to their consensuswin probabilities. Table 2 gives, for each class, theassociated interval of consensus win probabilities; theaverage consensus win probability, the actual frequency

2. FREQUENCY OF WINNING—ACTUALVS. THEORETICAL

Theoreticalprobabilityof winning

.00 -

.05 -

.10 -

.15 -

.20 -

.25 -

.30 -

.35 -

.1*0 -

.45 -

• 50

.05

.10

.15

.20

.25

.30

.35

.1*0

.1*5

• 50

+

Numberof

horses

946

763

463

313

192

111*

71

49

25

12

10

Averagetheoreticalprobability

.028

.074

.121*

.175

.225

.272

.324

.373

.1*23

.464

.554

Actualfrequencyof winnir

.020

.064

.127

.169

.240

.289

.391*

.306

.610

.583

.700

Estimatedr standardig error

.005

.009

.016

.021

.031

.042

.058

.066

.096

.142

.145

237

Page 249: Anthology of Statistics in Sports

Chapter 30 Assigning Probabilities to the Outcomes of Multi-Entry Competitions

3. FREQUENCY OF FINISHING SECOND-ACTUAL VS. THEORETICAL

Theoreticalprobabilityof finishing

second

.00 - .05

.05 - .10

.10 - .15

.15 - .20

.20 - .25

.25 - .30

.30 +

Numberof

horses

776

750

548

426

283

161*

11

Averagetheoreticalprobability

.030

.074

.121*

.175

.223

.269

.311

Actualfrequency

of finishingsecond

.046

.095

.128

.155

.170

.226

.361*

Estimatedstandard

error

.008

.011

.014

.018

.022

.033

.145

of winners, and an estimate of the standard error as-sociated with the actual frequency. The actual frequenciesseem to agree remarkably well with the theoreticalprobabilities, though there seems to be a slight tendencyon the part of the betters to overrate the chances of long-shots and to underestimate the chances of the favoritesand near-favorites. Similar results, based on an extensiveamount of data from an earlier tune period and fromdifferent tracks, were obtained by Fabricand [1].

Several checks were also run on the appropriateness ofassumption (2.1). These consisted of first partitioningthe horses according to some criterion involving thetheoretical probabilities of second and third place finishesand then comparing the actual frequency with theaverage theoretical long-run frequency for each class.Tables 3-6 give the results when the criterion is theprobability of finishing second, finishing third, placing, orshowing, respectively. In general, the observed fre-quencies of second and third place finishes are in reason-able accord with the theoretical long-run frequencies,though there seems to be something of a tendency tooverestimate the chances of a .second or third placefinish for horses with high theoretical probabilities ofsuch finishes and to underestimate the chances of thosewith low theoretical probabilities, with the tendencybeing more pronounced for third place finishes than forsecond place finishes. A logical explanation for the

4. FREQUENCY OF FINISHING THIRD-ACTUAL VS. THEORETICAL

Theoreticalprobabilityof finishing

third

.00 - .05

.05 - .10

.10 - .15

.15 - .20

.20 - .25

.25 +

Numberof

horses

587

713

691838

115

l4

Averagetheoreticalprobability

.032

.071*

.121*

.175

-.212

.273

Actualfrequency Estimatedof finishing standard

third error

.01*9

.105

.126

.11*7

.130

.214

.009

.011

.013

.012

.031

.110

5. FREQUENCY OF PLACING—ACTUALVS. THEORETICAL

Theoreticalprobabilityof placing

.00 - .05

.05 - .10

.10 - .15

.15 - .20

.20 - .25

.25 - .30

.30 - .35

.35 - .4O

.40 - .45

.45 - .50

.50 - .55

.55 - .60

.60 - .65

.65 - .70

.70 - .75

.75 +

Numberof

horses

330

526

1*01*358

268

21*0

193

175

117

109

73

51

1*8

29

22

15

Averagetheoreticalprobability

.034

.074

.125

.171*

.224

.271*

.326

.375

.1*25

.172

.525

.578

.623

.673

.721*

.808

Actualfrequencyof placing

.036

.091

.121

.179

.257

.271

.306

.351*

.359

.1*1*0

.425

.667

.625

.621

• 909

.867

Estimatedstandarderror

.010

.013

.016

.020

.027

.029

.033

.036

.044

.01*8

.058

.066

.070

.090

.095

.088

conformity of the actual place results to those predictedby the theory which is evident in Table 5 is that thosehorses with high (low) theoretical probabilities of finish-ing second generally also have high (low) theoretical

6. FREQUENCY OF SHOWING—ACTUALVS. THEORETICAL

Theoreticalprobabilityof showing

.00 - .05

.05 - .10

.10 - .15

.15 - .20

.20 - .25

.25 - .30

.30 - .35

.35 - .40

.40 - .1*5

.1*5 - .50

.50 - .55

.55 - .60

.,60 - .65

.65 - .70

.70 - .75

.75 - .80

.80 - .85

.85 - .90

.90 +

Numberof

horses

111

316

328

266

253243

201

196

169

150

158

137

97

100

67

67

1*9

30

20

Averagetheoreticalprobability

.038

.075

.121*

.171*

.227

.274

.326

.374

.1*25

.1*77

.525

.571*

.625

.672

.722

.777

.823

.874

.930

Actual Estimatedfrequency standardof showing error

.045

.092

.180

.222

.257

.281*

.303

.1*39

.1*26

.460

.468

.1*71*

.577

.500

.627

.731

.816

.867

1.000

.020

.016

.021

.025

.027

.029

.032

.035

.038

.041

.040

.043

.050

.050

.059

.051+

.055

.062

.056

238

Page 250: Anthology of Statistics in Sports

Harville

7. PAYOFFS ON PLACE AND SHOW BETS-ACTUAL VS. THEORETICAL

Expected payof'fper dollar

.00 - .25

.25 - .35

.35 - .1*5

.45 - .55

.55 - .65

.65 - .75

.75 - .85

.85 - .95

.95 - 1.05

1.05 - 1.15

1.15 - 1.25

1.25 +

Number ofdifferentplace andshow bets

80

2l4

386

628

901*

980

958

819

546

286

90

25

Averageexpectedpayoff

per dollar

.216

.303

.404

.501*

.601

.700

.800

.898

.9951.090

1.186

1.320

Averageactualpayoff

per dollar

.088

.286

.609

.570

.730

.660

.947

.938

.983

.989

.971*

1.300

Estimatedstandarderror

.062

.068

.091

.071

.072

.047

.066

.050

.090

.060

.108

.258

probabilities of finishing first, so that the effects of theoverestimation (underestimation) of their chances offinishing second are cancelled out by the underestimation(overestimation) of their chances of finishing first. Whilea similar phenomenon is operative in the show results,the cancellation is less complete and there seems to be aslight tendency to overestimate the show chances of thosehorses with high theoretical probabilities and to under-estimate the chances of those with low theoreticalprobabilities.

Finally, the possible place and show bets were dividedinto classes according to the theoretical expected payoffsof the bets as determined from the final betting figures.The average actual payoff per dollar for each class canthen be compared with the corresponding average ex-pected payoff per dollar. The necessary figures are givenin Table 7. The results seem to indicate that those placeand show bets with high theoretical expected payoffs perdollar actually have expectations that are somewhatlower, giving further evidence that our assumptions arenot entirely realistic, at least not for some races.

The existence of widely different expected payoffs forthe various possible place and show bets implies thateither the bettors 'do not feel that assumption (2.1) isentirely appropriate' or they 'believe in assumption (2.1)'but are unable to perceive its implications. Our resultsindicate that to some small extent the bettors are suc-cessful in recognizing situations where assumption (2.1)may not hold and in acting accordingly, but that bigdifferences in the expected place and show payoffs resultprimarily from 'incorrect assessments' as to when assump-tion (2.1) is not appropriate or from 'ignorance as to theassumption's implications.'

A further implication of the results presented in Table7 is that a bettor could not expect to do much better thanbreak even by simply making place and show bets withexpected payoffs greater than one.

[Received January 1972. Revised September 1972. ]

REFERENCE

[1] Fabricand, Burton P., Horse Sense, New York: David McKayCompany, Inc., 1965.

239

Page 251: Anthology of Statistics in Sports

This page intentionally left blank

Page 252: Anthology of Statistics in Sports

". . . and thereby return our gameto the pure world of numbers, whereit belongs. "—Roger Angell

Chapter 31

Basketball, Baseball, andthe Null HypothesisRobert HookeTversky and Gilovich (Chance,Winter 1989) and Gould (Chance,Spring 1989) write persuasivelyon the nonexistence of hot andcold streaks in basketball andbaseball. As a statistician, I findno fault with their methods, but asa sometime competitor (at verylow levels) in various sports andgames I feel uncomfortable withtheir conclusions. Gould speaksof "a little homunculus in myhead [who] continues to jump upand down shouting at me" thathis intuitive feeling is right re-gardless of the mathematics. I,too, have such a homunculus,who has goaded me into raisingquestions about the conclusionsof these articles and the use of thenull hypothesis in general.

Every statistician knows thatpeople (even statisticians) tend tosee patterns in data that are actu-ally only random fluctuations.However, in almost every compet-itive activity in which I've everengaged (baseball, basketball,golf, tennis, even duplicatebridge), a little success generatesin me a feeling of confidencewhich, as long as it lasts, makesme do better than usual. Evenmore obviously, a few failures can

destroy this confidence, afterwhich for a while I can't do any-thing right. If any solid evidenceof such experiences can be found,it seemingly must be found out-side of the statistical arguments ofthe aforementioned papers, be-cause there are no apparent holesin these arguments. If the mathe-matics is all right and the conclu-sions still seem questionable, theplace to look is at the model,which is the connection betweenthe mathematics and reality.

If the model ''explains" thedata, then the model is correctand unique.

(True or False?)

Everybody knows this is false.(Well, almost everybody.) Tverskyand Gilovich seem to know this,because they show that their datado not confirm the existence ofthe so-called hot hand, and thecasual reader might conclude thatthey have shown its nonexistence,but they don't actually say so.Gould, though, does say aboutbaseball: "Everybody knowsabout hot hands. The only prob-lem is that no such phenomenonexists."

Statisticians are trained tospeak precisely, and usually theyremember to do so, being carefulto say, perhaps, "The normal dis-tribution is not contradicted bythe data at the 5% level, so wemay safely proceed as if the nor-mal distribution actually holds."Careful speech may become tire-some, though, and some peopleare even offended by it. Years oftrying to placate such customerssometimes drives statisticians tomake statements such as, "Thedata show that the distribution isnormal," hoping that this rashconclusion will not reach the earsof any colleague.

The two Chance articles use thestandard approach to the prob-lem. First, they look at whatwould happen if only chancewere involved, see what observeddata would look like under thisassumption, and then comparethe result with real data to see ifthere are major differences. If onlychance is involved, the mathemat-ical model for the real situation isthe usual coin tossing or "Ber-noulli trials" model, in whicheach event has a probability ofsuccess that is constant and inde-pendent of previous events. Using

241

Page 253: Anthology of Statistics in Sports

Chapter 31 Basketball, Baseball, and the Nulll Hypothesis

this model and observing no sig-nificant differences, they con-clude that there is no evidence todispute the null hypothesis thatonly chance is operating. Whilewe do know very well what hap-pens if only chance is involved,we do not have a good idea of howdata should turn out if there reallyis a psychological basis for"streaks."

In statistical language, we don'thave a well-formulated alterna-tive hypothesis to test against thenull hypothesis. Thus we inventvarious measures (such as the se-rial correlations of Tversky andGilovich), and we state how weintuitively think these measureswill behave if the null hypothesisis not true. If they don't appear tobehave this way, then we can con-clude fairly safely that the nullhypothesis is at least approxi-mately true, or that the opposingeffect, if true, is at least not verylarge.

My intuition tells me that thealternative hypothesis is not thatthere is a "hot hand" effect that isthe same for everyone, but that thereal situation is much more com-plex. Some people are slaves totheir recent past, some can ignoreit altogether, and others lie some-where in between. The slaves,who become discouraged after afew failures, probably don't makeit to the professional level in com-petition unless they have an un-usual excess of talent. If they arein the minority among profession-al athletes, it would take a verylarge amount of data to show howtheir behavior affects the overallstatistics. Also, if a player onlyhas a hot hand sometimes, how dowe know how many successes arerequired for the hot hand to takeover? With one player this num-ber may be one, with another itmay be three or four. A measurethat is appropriate in detectingthe effect for one of these typesmay not be very powerful for an-other.

Why does a statistician contin-

ue to look with skepticism onthese negative results? For onething, if there are no hot handsthere are also no slumps. Thus nomatter how many hits or walks agood pitcher has allowed in agame, the manager should nottake him out unless he has somephysical problem. Of all theslumps that I've observed, the oneof most majestic proportions wasendured by Steve Blass, a pitcherfor the Pittsburgh Pirates from1964 to 1974. From a fair start hegradually became a star in 1971and 1972, but in 1973 he became adisaster. An anecdote such as thisis in itself no argument for the

existence of slumps, since hisnumbers, bad as they were, mightpossibly have occurred by chance.Additional data, though, wereavailable to observers without get-ting into the box scores: Blass'spitches were often very wild,missing the plate by feet, not inch-es. In 1974 he tried again, pitchedin one game and then retired frombaseball. So far as I know, nophysical reason for all this wasever found.

I was once asked: "At the end ofa baseball season, is there realstatistical evidence that the bestteam won?" The statistician's firstattack on this (not the final one,

242

Page 254: Anthology of Statistics in Sports

Hooke

by any means) is to suppose thatfor each game the assumptions ofBernoulli trials hold. In someyears this null hypothesis is re-jected, but often not. Even whenthe null hypothesis is not reject-ed, the statistics on such addition-al measures as runs scored andruns allowed may show conclu-sively that the top teams (if notthe top team) were consistentlyperforming better than the others.Thus, we have statistics that seemto show that the game was allluck, while more detailed statis-tics may be available to contradictthis conclusion.

Then there is the issue of de-fense. In basketball, some teamsare alleged by the experts to playmuch better defense than others.In a given game, a player takes aseries of shots against the sameteam, whose defensive capabili-ties may be considerably greateror less than the league's average.Can this be true without some

effect showing up in the serialcorrelations? Do the same statis-tics that fail to show the existenceof the hot hand also show thatdefense is not important?

Baseball has a similar feature.Gould quotes statistical resultsfrom a colleague to the effect that"Nothing ever happened in base-ball above and beyond the fre-quency predicted by coin-tossingmodels," but he gives no details.One assumes that the effect ofvarious opposing pitchers was notpart of the model used, since thiswould introduce enormous com-plications. Yet a batter usuallyfaces the same pitcher severaltimes in a row. If the statistics donot show some sort of depen-dence on the opposition, then thestatistical procedures are simplynot powerful enough to detect ef-fects of interest such as streaki-ness.

In short, my conclusion is thatthe data examined and analyzed

to date show that the hot handeffect is probably smaller than wethink. No statistician would denythat people, even statisticians,tend to see patterns that are notthere. I would not say, however,that the hot hand doesn't exist.Were I a Bayesian, I would assigna very high prior probability to theexistence of hot hands and chal-lenge others to produce data thatwould contradict it.

Additional Reading

Angell, R. (1988), Season Ticket, Bos-ton: Houghton Mifflin.

Gould, S.J. (1989), "The Streak ofStreaks," Chance, 2(2) 10-16.

Hooke, R. (1983), How to TeJl theLiars from the Statisticians, NewYork: Marcel Dekker.

Tversky, A. and Gilovich, T. (1989),"The Cold Facts About the 'HotHand' in Basketball," Chance, 2(1)16-21.

243

Page 255: Anthology of Statistics in Sports

This page intentionally left blank

Page 256: Anthology of Statistics in Sports

Chapter 32

General

Frederick MOSTELLER

The author reviews and comments on his work in sportsstatistics, illustrating with problems of estimation in base-ball's World Series and with a model for the distribution ofthe number of runs in a baseball half inning. Data on colle-giate football scores have instructive distributions that indi-cate more about the strengths of the teams playing than theirabsolute values would suggest. A robust analysis of profes-sional football scores led to widespread publicity with thehelp of professional newswriters. Professional golf playerson the regular tour are so close in skill that a few roundsdo little to distinguish their abilities. A simple model forgolf scoring is "base +X" where the base is a small scorefor a round rarely achieved, such as 64, and X is a Poissondistribution with mean about 8. In basketball, football, andhockey the leader at the beginning of the final period winsabout 80% of the time, and in baseball the leader at the endof seven full innings wins 95% of the time. Empirical expe-rience with runs of even and odd numbers in tossing a diemillions of times fits closely the theoretical distributions.

KEY WORDS: Baseball; Football; Golf; Dice.

1. WORLD SERIES

My first paper on statistics in sports dealt with the WorldSeries of major-league baseball (Mosteller 1952). At a cock-tail party at Walter and Judith Rosenblith's home someoneasked: What is the chance that the better team in the se-ries wins? Some people did not understand the concept thatthere might be a "best" or "better" team, possibly differ-ent from the winner. It occurred to me that this questionprovided an excellent application of work on unbiased es-timation for quality control that Jimmie Savage and I com-pleted during World War II. I drafted the World Series paperwith the considerable assistance of Doris Entwisle, now aprofessor at The Johns Hopkins University, and I submit-

Frederick Mosteller is Roger I. Lee Professor of Mathematical Statistics,Emeritus, Department of Statistics, Harvard University, Cambridge, MA02138. This is a revised version of a talk presented at the Chicago Meetingof the American Statistical Association, August 5, 1996, on the occasionof the award of Sports Statistician of the Year to the author by the Sectionon Statistics in Sports. A referee and the associate editor have made manysuggestions for improving the final paper (see Lesson 1). Tables 1-4 andFigures 1 and 2 are reproduced with permission of the American StatisticalAssociation. Table 5 is used with permission of Psychometrika. Figure 3appeared in Chance, and is reproduced with permission of Springer-VerlagInc.

Lessons from Sports Statistics

ted it to the Journal of the American Statistical Associa-tion. W. Allen Wallis, then the editor of JASA, and who hadprompted our work on estimation when he directed the Sta-tistical Research Group of Columbia, sent it out to a numberof referees who were intensely interested in baseball. Thereferees had a variety of good ideas that led to extensionsof the paper.

That led me to my first lesson from mixing science andstatistics in sports.

Lesson 1. If many reviewers are both knowledgeableabout the materials and interested in the findings, they willdrive the author crazy with the volume, perceptiveness, andrelevance of their suggestions.

The paper doubled in size in response to the first roundof suggestions, and the second round lengthened it further.

The second lesson came from the work on unbiased esti-mates that applied to some models used in the World Seriespaper.

Lesson 2. If you develop some inferential statisticalmethods, you are likely to have a use for them in a paperon sports statistics.

In the World Series analysis we are making inferencesfrom the statistics of a truncated series (once a winner isdetermined, the series stops without carrying out the sevengames). Jimmie and I had a theorem about how to get theunique unbiased estimate from binomial processes (Gir-shick, Mosteller, and Savage 1946). We showed that therewere unreasonable results in unbiased estimation. The ex-istence of such unreasonable results has downgraded some-what the importance of unbiasedness. To see an example ofthis unreasonableness, I turn to a very short binomial gamein which p is the probability of a success on any trial, trialsare independent, and you play until either you get one suc-cess or two failures, and then stop. The material in Exhibit

Exhibit 1. Play Until You Get a Success or Two Failures

The sequence begins at (0 failures, 0 successes) with stopping points(0, 1), (1, 1), and (2, 0). Let the value of the estimate of p at these threepoints be x,y, and z, respectively.Unbiasedness implies

Rewriting in terms of q gives

x(1 -q) + yq(1 - q) + zq2 = 1 - q

1 • x + q(-x + y) + q2(-y + z) = 1 - q.

Equating coefficients of the powers of q on the two sides yields x = 1,y = 0, z = 0. It is annoying to many that the estimate of p for (1, 1) is0 although the observed proportion of successes is 1/2.

245

Page 257: Anthology of Statistics in Sports

Chapter 32 Lessons from Sports Statistics

Exhibit 2. Grid Showing Boundary Pointsfor Best-of-Seven-Games Series

Table 2. Outcomes of the 87 Best-of-Seven Games in a World Series

NOTE; The number of paths from (0, 0 to (2, 4) is ( ) = 10 because the only way to get to (2,4) is first to reach (2, 3) and then go to (2, 4) with an American League win.

1 shows that the unique unbiased estimate of p is 1 if thefirst trial is a success, 0 if the second trial is a success, and0 if the first two trials are failures. Most people find it un-reasonable that when half the trials are successes, the valueof the estimate of probability of success is 0.

For a best-of-seven series we can represent the sequencesof wins and losses of games in a series by paths in a rect-angular grid (see Exhibit 2) consisting of points (x, y), 0 <x,y < 4, excluding (4, 4). The point (x,y) represents xwins by the American League and y wins by the NationalLeague. A path starts at (0, 0). The binomial probability pis the chance that the next game is won by the AmericanLeague, which would add the step from (x, y) to (x, y + 1)to the path, or 1 — p that the National League wins, whichwould add the step from (x, y) to (x +1, y) to the path. Theboundary points (x,4) x — 0,1,2,3 correspond to serieswins by the American League, and (4, y) y = 0,1,2,3 towins by the National League. Paths stop when they reach aboundary point.

The unique value for the unbiased estimate for p at agiven boundary point is given by a ratio:

number of paths from (0,1) to the boundary pointnumber of paths from (0,0) to the boundary point

Table 2 shows the unbiased estimate associated with theboundary points for the best-of-seven series.

Table 1. Number of World Series Won by theAmerican League in 12-Year Intervals

Years

1903-1915a

1916-1927b

1928-19391940-19511952-19631964-19751976-19871988-1995°

Totals

No. won by AL

7 of 127 of 129 of 128 of 126 of 126 of 126 of 124 of 7

53 of 91

Games wonin series

NLx

44443210

AL

y01234444

Unbiasedestimate of

P (win by AL)

01/42/53/63/63/53/41

Frequency

765

181415148

Total 87

a No series in 1904.3 Includes NL victory in 1919, year of "Black Sox Scandal.";No series in 1994.

NOTE: Average of unbiased estimates = .540.

Example. Consider the boundary point (2, 4). The num-ber of paths from (0, 1) to (2, 4) is 6, the number from (0, 0)to (2, 4) is 10, and the value at the estimate is 6/10 or 3/5.It is amusing that (3, 4) has the value 1/2, as does (4, 3).

The formula applies to much more general patterns ofboundary points than just best-of-n series. We do requirethat the sum of probabilities of paths hitting boundarypoints is 1. The uniqueness requires that there be no in-terior points that can end the series. For example, if weadded the rule of the best-of-seven series that if the stateis ever (2, 2) we stop the series and declare a tie, then theestimate described above would not be unique.

Getting back to statistics in sports, there have been about43 World Series since my original paper was written. TheAmerican League had been winning more than half the se-ries. Table 1 shows that in the first 48 series (1903-1951)they won 31 or 65%, and in the most recent 43 series (1952-1995) they won 22 or 51% and dropped back to nearly even.Of the 91 series, 87 were best-of-seven-game series, basedon games played until one team won four. Four series werebased on best-of-nine games played until one team won five.

For the 87 best-of-seven-game series the average value ofthe unbiased estimates of the probability p of the AmericanLeague winning a given game is .540 (Table 2). In comput-ing this average value we weighted the estimate associatedwith each outcome by the number of series of that typethat occurred. The model used is independent Bernoulli tri-als with p fixed for the series.

We should try to examine the model more carefully. Ihave the examined impression that the independent bino-mial trials model was reasonable during the first 48 series,but an unexamined impression that it may not be appropri-ate during the last 43.

In the first 48 series there seemed to be no home-fieldadvantage, and I wonder whether that may have changed.

To close our World Series discussion as of 1952, in an-swer to the motivating question: we estimated the proba-bility that the better team won the World Series accordingto Model A (fixed p across years as well as within series)at .80, and for Model B (normally distributed p, but fixedwithin a series) at .76. How that may have changed, I donot know.

Lesson 3. There is always more to do.

246

Page 258: Anthology of Statistics in Sports

Mosteller

2. RUNS IN AN INNING OF MAJORLEAGUE BASEBALL

Bernard Rosner has allowed my associate, Cleo Youtz,and me to participate with him as he developed a theoryof the distribution of the number of runs in a major leaguebaseball inning (runs per three outs). We started by assign-ing a negative binomial distribution to the number of per-sons at bat in a three-out inning, but found that the resultsunderestimated the number of innings with three playersat bat and overestimated the number of innings with fourplayers at bat, but otherwise the number of batters facedfitted well. By making a brute-force adjustment that addedand subtracted a parameter to correct for these deviationsRosner was able to develop a theory for the number of runs.The expected values in Table 3 show the theoretical distri-bution of runs in innings (really half innings) as comparedwith the observed 1990 American League results for 77principal pitchers (Rosner, Mosteller, and Youtz 1996).

Along the way Rosner developed parameters for thepitchers' properties; but the description of these is too longto include here. However, the parameters did show that in1990 Roger Clemens had the best pitching record, with anexpected earned run average of 1.91. Thus, however disap-pointing the Red Sox may be, we owe them something forRoger Clemens.

Lesson 4. Wait until next year.

3. COLLEGIATE FOOTBALL

I stumbled across a whole season's collegiate footballscores somewhere, and was most impressed with the high-est tie score. In 1967 Alabama and Florida State tied at37-37; in 1968 Harvard and Yale tied 29-29. I wonderedwhat the probability of winning was with a given score. Fig-ure 1 lays that out as the estimate of P(winningjscore) for1967. Roughly 16 is the median winning score (see Fig. 1)(Mosteller 1970).

I thought that the most interesting regression was ofthe losing score on the winning score; see Figure 2. It wouldbe clearer if the vertical axis were horizontal, so please lookat it sideways. The point seems to be that when one teamscores many points, there is no time (and perhaps little abil-ity) for the other team to score. The winning score associ-

Table 3. Observed and Expected Distribution of Number of Runs Scoredin an Individual Half Inning (x) in Baseball (Based on 77 Starting

Pitchers in the 1990 American League Season)

Runsscored

(x)

01234567

Total

Observednumber of

innings

4,110903352172652953

5,639

(%)

(73)

(16)

(6)(3)(1)(1)(0)(0)

Expectednumber of

innings

4,139.9920.1319.9137.5

66.9

33.315.65.9

5,639

(%)

(73)

(16)

(6)(2)(1)(1)(0)(0)

Figure 1. Proportion of Times a Given Score Wins, 1967 CollegiateFootball Scores.

ated with the highest average losing score is about 32, andthen the loser averages around 17.

These results all seem to fall into the realm of descrip-tive statistics. Ties occur about half as often as one-pointdifferences. This can be argued from trivial combinatorics.

From Table 4 we can see that some scores are especiallylucky and others unlucky. For example, a score of 17 ismore than twice as likely to win as to lose, whereas thehigher score of 18 won less often than it lost. Again, 21 islucky, winning 69% of its games, but 22 is not, winning only59%. Table 4 shows the irregularity of these results ratherthan a monotonic rise in probability of winning given theabsolute score. (Ties gave half a win to each team.)

Lesson 5. Collegiate football scores contain extra in-formation about the comparative performances of a pair ofteams beyond the absolute size of the scores.

4. PROFESSIONAL FOOTBALL

I wanted to carry out a robust analysis of professionalfootball scores adjusted for the strength of the opposition

Average losing score

Figure 2. Graph of Average Losing Score for Each Winning Score,Collegiate Football, 1967.

247

Page 259: Anthology of Statistics in Sports

Chapter 32 Lessons from Sports Statistics

Table 4. Distributions of Team Scores Up to Scores of 29

Score

02346789

1011121314151617

Winning %

1.4.0

16.3

.05.8

15.5

8.640.6

33.320.023.635.136.6

39.548.2

69.3

Total

2226

431

1212203532725

55111202385675

Score

181920212223242526272829———

Total

Winning %

46.261.1

65.569.558.881.4

80.3

61.593.1

80.884.7

81.8———

Total

263684

1183443661329527222

———

2,316

because blowout scores occur when a team is desperatelytrying to win, often after having played a close game. In1972 Miami was undefeated, and no other team was, butat least on the surface it looked as if Miami had playedweaker teams than had other high-ranking teams. I plannedto present this paper at the post-Christmas meetings ofthe American Association for the Advancement of Science(AAAS). AAAS makes very extensive efforts for the press.Speakers are asked to prepare and deposit materials fornewswriters in advance. The head of the Harvard News Of-fice and I were acquainted, and I described these practicesto him. He asked if I had ever considered having my paperrewritten for the press by a newswriter, and, of course, I hadnot. He then suggested that after I had prepared the paper,I should give it to him for someone to do a rewrite, whichI did. After a couple of rounds of rewriting by a brilliantwriter whose name I wish I could recall, we completed itand sent it to the AAAS News Office. When I arrived at myhotel room in Washington, the phone was already ringingoff the hook as various newswriters wanted interviews.

The basic concept of the robust analysis was to createan index for a team that would be the difference between arobust measure of its offensive scoring strength against itsopponents and the corresponding index for its opponents'offensive score (or equivalently the team's weakness in de-fense). The higher the difference, the better the ranking.

I essentially used Tukey's trimeans (Tukey 1977)

(1 (lower quartile) + 2(median) + 1 (upper quartile))/4

on both a specific team's scores for the season and on itsopponents' scores against the team. And their differencewas an index of the performance of the team for the season.We adjusted each team's score for all of the teams it playedagainst. We also made a second-order adjustment for theopponents' scores based on the quality of their opponents.

In the end we used the robust index to obtain a ranking foreach team based on its adjusted scores against its opponentsand on its opponents' adjusted scores. Our estimates for theranking adjusted for quality of scheduled opponents rankedMiami second. (Miami did win the Superbowl that season.)

The newswriters spread this information all across thecountry. I got hate letters from many Miami fans, includinga number who claimed to be elderly women.

Lesson 6. Some people are bonded to their local teams.Lesson 7. The nation is so interested in robust sports

statistics that it can hog the newspaper space even at anAAAS annual meeting.

Lesson 8. Maybe it would pay statisticians to have moreof their papers rewritten by newswriters.

5. GOLF

Youtz and I were surprised to find that the top profes-sional men golf players were so close in skill that a smallnumber of rounds could do little to distinguish them. In-deed, the standard deviation among players of mean truescores (long-run) for one round at par 72 was estimated tobe about .1 of a stroke (Mosteller and Youtz 1992, 1993).

By equating courses using adjusted scores (adding 2 ata par 70 course, adding 1 at a par 71, and using par 72scores as they stand) we were able to pool data from thelast two rounds of all four-round men's professional golftournaments in 1990 in the U.S.P.G.A tour. We modeledscores as

y = base + X

where the base was a low score rarely achieved, such as62, and X was a Poisson variable. We might think of thebase as a score for a practically perfect round. We used forfitting all 33 tournaments:

base =63 X : Poisson with mean 9.3

(1 score in 2,500 is less than 63).For fine-weather days we used

base =64 X: Poisson with mean 8.1.

For windy weather we used

base = 62 X: Poisson with mean 10.4.

The smaller base for windy weather is contrary to intuition,but it may flow partly from the unreliability of the estimateof the base. We found it surprising that the average scoresfor fine-weather days (72.1) and windy days (72.4) were soclose.

Although the negative binomial is attractive as a mixtureof Poissons, the variance among professional players' long-run true scores is so small that it offers little advantage.

Figure 3 shows the fit of the distribution of scores to thePoisson model for the season.

Lesson 9. Stay off the course in thunderstorms; LeeTrevino was once struck by lightning while waiting for playto resume.

6. DO WE NEED TO WATCH WHOLE GAMES?

In basketball, football, and hockey the leader at the be-ginning of the final period (quarter or period) wins the gameabout 80% of the time (Cooper, DeNeve, and Mosteller1992). In baseball the leader at the end of seven full in-nings wins 95% of the time.

248

Page 260: Anthology of Statistics in Sports

Mosteller

Figure 3. Frequency Distribution of Adjusted Golf Scores in Rounds3 and 4 in Ten Tournaments in 1990 Having Fine Weather Compared tothe Poisson Distribution with Mean 8.1. golf scores, - - - Poisson.Source: Mosteller and Youtz (1993, Fig. 2). Reprinted with permission ofSpringer-Verlag New York Inc.

Lesson 10. We can afford to turn off the TV at the be-ginning of the final period unless the game is very close.

"Home" teams win about 60% of the time. "Home" teamsin basketball make more last-quarter comebacks from be-hind than "away" teams by a factor of 3 to 1.

The News Office of the National Academy of Scienceswrote a news release about the results given in that paper,and the story appeared in some form in dozens of papersthroughout the country.

Lesson 11. Those newswriters know both how toshorten a paper and what will grab readers' attention.

7. RUNS IN TOSSES OF DICE

Statisticians have a strong need to know how accuratelytheir mathematical models of probabilistic events imitatetheir real-life counterparts. Even for theories of coin toss-ing and of dice and the distribution of shuffled cards, thegap between theory and practice may be worth knowing.Consequently, I have always been eager to know of demon-strations where these outcomes are simulated.

About 1965 Mr. Willard Longcor came to my office andexplained that he had a hobby, and wondered whether thishobby might be adapted to some scientific use because hewas going to practice it anyway. He explained that in hisretiring years he had, as one intense hobby, recording theresults of coin tossing and die tossing. He had discussed ap-plying his hobby with several probabilists and statisticians,and his visit to my office was part of his tour to explorewhether his coin and dice throwing might be made moreuseful than for his personal enjoyment.

I had often thought that, although means and variancesmight work out all right in tosses of coins and dice, perhapsthere were problems with the actual distribution of runsof like outcomes. I was well acquainted with AlexanderMood's paper (1940) on the theory of runs, and so Mr.Longcor's proposal awakened me at once to an opportunityto see some practical results on runs with a human carryingout the tossing and the recording. Mr. Longcor's experiencewas in tossing a coin or in tossing a die, but recording onlywhether the outcome was odd or even.

I explained my interest in the distribution of runs. Weworked out a plan to record the outcomes of runs of evens

in a systematic way. I do not know whether anyone elseproposed a project to him, but he told me that he returnedto at least one probabilist to get independent assurance thatour program might be a useful task.

We planned to use dice, both the ordinary ones withdrilled pips that we called Brand X and some of the highestquality to be obtained from professional gambling supplyhouses. Inexpensive dice with holes for the pips with a dropof paint in each hole might show bias. Precision-made dicehave sharp edges meeting at 90 angles, and pips are eitherlightly painted or back-filled with extremely thin disks. Weused three different high-class brands A, B, and C. Each diewas to be used for 20,000 tosses and then set aside togetherwith its data. Work began with occasional contact by phoneand letter.

Many months later we received a large crate containingthe results for millions of throws very neatly recorded, eachdataset in the envelope with its own die. We did a lot ofchecking, and found the results to be in excellent order.Analyzing that crate of data kept a team of four of us busyfor a long time.

Some of the findings are shown in Table 5.Brand X did turn out to be biased, more biased than

Weldon's dice in his historical demonstration (as given inFry 1965). Weldon's dice were more biased than the high-quality dice Mr. Longcor purchased.

Lesson 12. Getting additional people involved in statis-tical work is a beneficial activity, and they should not haveto recruit themselves. Can we do more of this?

As for the number of runs, perhaps a good thing to lookat is the mean number of runs of 10 or more per 20,000tosses. The theoretical value is 9.76. Brands A and B arevery close at 10.08 and 9.67, Brand C is a little higher at10.52, and Brand X is 3.5 standard deviations off at 11.36.However, if we use its observed probability of "even" as.5072, then Brand X has a standard score of only .6, and soits "long-runs" deviation is small, given the bias. This givesus a little reassurance about the model.

8. CONCLUDING REMARKS

My experience with newswriters in relation to sportsstatistics strongly suggests to me that statisticians shoulddo more about getting statistical material intended for thepublic written in a more digestable fashion. We tend to giveseminars for newswriters so that they will know more aboutour field. Maybe we should be taking seminars ourselvesfrom newswriters to improve our communication with theconsumers of our work. Thus self-improvement might beone direction to go. An alternative would be to have morestatistical work that is for public consumption rewritten bynewswriters. Such efforts do take a couple of iterations be-cause first translations often run afoul of technical misun-derstandings, but with good will on both sides, newswriterscan clean out the misunderstandings between statisticianand writer. What can we do to take more advantage of thenewswriters' skills?

Because the ASA Section on Statistics in Sports has manymembers, and because many young people have an interest

249

Page 261: Anthology of Statistics in Sports

Chapter 32 Lessons from Sports Statistics

Table 5. Percentage Distributions for Blocks of 20,000 Throws According to the Number of Even Throws in the Blocks for Theoretical Distributionsand for Six Sets of Data, Together with Observed Means, Variances, and Standard Deviations, and Standard Scores for Mean

Percentage distribution of blocks of 20,000 throws

Number of even throws

10,281-10,32010,241-10,28010,201-10,24010,161-10,20010,121-10,16010,081-10,12010,041-10,08010,001-10,0409,961-10,0009,921-9,9609,881-9,9209,841-9,8809,001-9,840

Totala

Number of blocks of 20,000 throwsMean — 10,000Variance of block totalsStandard deviationStandard deviation of meanStandard score for mean based on observed S.D.

Theoretical

138

16222216831

100%

05,000

71

A

11156

19212015641

100%

1009

6,124787.8

1.15

Brands

B

101777

2017203

101%30-1

6,65182

14.9

-.06

C

36

10101935133

99%3114

4,77669

12.4

1.16

RANDrandom

numbersR

437

111721191251

100%100-7

6,348808.0

-.82

Pseudorandomnumbers

P

1225

17282114811

100%

1006

4,618686.8.86

BrandX

37

12191622173

99%58

1454,933

709.2

15.70

a Totals may not add to 100 because of rounding.Source: Iversen et al. (1971, p. 7). Reprinted with permission from Psychometrika and the authors.

in sports, perhaps additional interest in statistics could bepromoted by having more statisticians speak about sportsstatistics to young people, for example, in student math-ematics clubs in high schools. Some high-school mathe-matics teachers do already use sports for illustrating somepoints in mathematics.

When I think of the number of sports enthusiasts in theUnited States, I feel that more of these people should beinvolved in statistical matters. But so far, we do not seemto have organized a practical way to relate their intereststo the more general problems of statistics. Perhaps men-tioning this will encourage others to think of some ways ofimproving our connections.

Turning to activities of the Section on Statistics in Sports,I believe that we could profit from a lesson from the mathe-maticians. They have written out many important problemsin lists for their researchers to solve. If we had a list ofsports questions, whether oriented to strategies or to ques-tions that may not be answerable, or to problems that mightbe solved by new methods of data gathering, these ques-tions might attract more focused attention by researchers,and lead to new findings of general interest. And so I en-courage the production of some articles oriented to lists ofproblems.

[Received September 1996. Revised March 1997.]

REFERENCES

Cooper, H., DeNeve, K. M., and Mosteller, F. (1992), "Predicting Pro-fessional Sports Game Outcomes from Intermediate Game Scores,"Chance, 5(3-4), 18-22.

Fry, T. C. (1965), Probability and Its Engineering Uses (2nd ed.) Princeton,NJ: Van Nostrand, pp. 312-316.

Girshick, M. A., Mosteller, F., and Savage, L. J. (1946), "Unbiased Es-timates for Certain Binomial Sampling Problems with Applications,"Annals of Mathematical Statistics, 17, 13-23.

Iversen, G. R., Longcor, W. H., Mosteller, F., Gilbert, J. P., and Youtz, C.(1971), "Bias and Runs in Dice Throwing and Recording: A Few MillionThrows," Psychometrika, 36, 1-19.

Mood, A. M. (1940), "The Distribution Theory of Runs," Annals of Math-ematical Statistics, 11, 367-392.

Mosteller, F. (1952), "The World Series Competition," Journal of the Amer-ican Statistical Association, 47, 355-380.

(1970), "Collegiate Football Scores, U.S.A.," Journal of the Amer-ican Statistical Association, 65, 35-48.

(1979), "A Resistant Analysis of 1971 and 1972 Professional Foot-ball," in Sports, Games, and Play: Social and Psychological Viewpoints,ed. J. H. Goldstein, Hillsdale, NJ: Lawrence Erlbaum Associates, pp.371-399.

Mosteller, F., and Youtz, C. (1992), "Professional Golf Scores are Poissonon the Final Tournament Days," in 7992 Proceedings of the Section onStatistics in Sports, American Statistical Association, pp. 39-51.

(1993), "Where Eagles Fly," Chance, 6(2), 37-42.Rosner, B., Mosteller, F., and Youtz, C. (1996), "Modeling Pitcher Perfor-

mance and the Distribution of Runs per Inning in Major League Base-ball," The American Statistician, 50, 352-360.

Tukey, J. W. (1977), Exploratory Data Analysis, Reading, MA: Addison-Wesley.

250

Page 262: Anthology of Statistics in Sports

A statistician and athlete finds a widerange of applications of Total QualityManagement.

Chapter 33

Can TQM Improve AthleticPerformance?

Harry V. Roberts

Introduction

As a statistician and athlete—much of the athletics coming inage-group distance running andtriathlon competition after the ageof 50—I have kept careful recordsof training and competition overthe last 20 years in hope that sta-tistical analysis would benefit myperformance. Among the manyquestions I sought to answer were:Is even-pacing during a marathona good strategy? What type oftraining regimen is appropriate? Isit really necessary to drink fluidsduring long distance races in hotweather? Standard statisticalmethods, such as randomizedcontrolled experiments, have beenof limited use in answering such

questions. Over the years, I havedrawn on techniques from TotalQuality Management (TQM)—ideas that have been effectivelyused in manufacturing and nowincreasingly in service indus-tries—to improve athletic per-formance and found the methodsnot only helped improve my per-formance but also had a widerange of application.

Group Versus IndividualStudies

The title asks "Can TQM ImproveAthletic Performance?" ratherthan "Can Statistics Improve Ath-letic Performance?" Statistics can,in principle, improve anything,

but the statistician's orientation isoften toward research to obtainknowledge about a population.TQM, on the other hand, relies onstatistical methods to focus on theimprovement of specific proc-esses. The TQM focus is valuablein helping one take the direct roadto process improvement ratherthan the more leisurely path tonew knowledge about a popula-tion, which may or may not applyto a specific individual.

A typical research approach toathletic performance is to experi-ment on subject groups in hope offinding general relationships. Forexample, after a base period, agroup of runners could be ran-domized into two subgroups, oneof which trains twice a day and

251

Page 263: Anthology of Statistics in Sports

Chapter 33 Can TQM Improve Athletic Performance?

the other once a day. Subsequentperformance relative to the baseperiod could be compared bystandard statistical techniques.The lessons from such studies canthen be applied to improve indi-vidual athletic performance. Stud-ies of this kind, however, have notyet been very helpful.

By contrast, the individual ath-lete in the TQM approach isviewed as an ongoing process tobe improved. Historical data sug-gest hypotheses on how improve-ment should be sought, that is,what type of intervention mightimprove the process. A study isthen designed and performancemeasured; this is essentially theDeming Plan-Do-Check-Act cycle(PDCA) that is central to statisticalprocess control in industry. Thebasic statistical methodology isoften Box and Tiao's "Interven-

tion Analysis," a term they intro-duced in a classic article in theJournal of the American Statisti-cal Association in 1975 entitled"Intervention Analysis with Ap-plications to Economic and Envi-ronmental Problems."

When time-ordered perform-ance measurements on an under-lying process are in a state of sta-tistical control—that is, the databehave like independent randomdrawings from a fixed distribu-tion—intervention analysis callsfor a comparison of the meanperformance level before and af-ter a particular improvement isattempted using techniques fortwo independent samples. Whenan underlying process is not incontrol—for example, whenthere are autocorrelated vari-ation, trend, or day-of-week ef-fects—intervention analysis es-

sentially uses regression analysisto disentangle the effect of the in-tervention from the effects of theother sources of variation, ran-dom or systematic.

TQM for ImprovingTechnique

The quickest opportunities forTQM to improve athletic per-formance come in matters oftechnique; students of mine haveoften done studies aimed at im-provement in this area. Popularsubjects have been free-throwshooting and three-point shoot-ing in basketball, tennis service,place kicking, archery, targetshooting, skeet shooting, swim-ming, cycling, and golf. Studentsoften come up with ideas for im-provement when recording and

252

Page 264: Anthology of Statistics in Sports

Roberts

analyzing the results of regularpractice.

In most studies, students' per-formance has been in a state ofstatistical control during the baseperiod and in control at a higherlevel after the alteration of tech-nique. A student who was in-itially in statistical control witha free-throw success rate of 75%,for example, changed his aimpoint, to the back of the rim, andincreased the arch. Subsequentto this change, he found he wasin control with a success rate of82%. This is like starting with abiased coin that has probabilityof heads of 0.75, independentlyfrom toss to toss, and modifyingthe coin so that the probability ofheads rises to 0.82. The successof intervention analysis can befurther illustrated in three de-tailed examples.

Example 1. Improvement ofPuttingA student, an excellent golferwho had never been satisfiedwith his putting, set up an indoorputting green for careful practiceand experimentation. Results ofthe first 2,000 putts of his study,summarized by 20 groups of 100each, are presented: Putts sunkper 100 trials from fixed dis-tance:

47, 57, 57, 52, 59, 64, 45, 58, 61, 57,

71, 61, 67, 59, 64, 66, 76, 58, 61, 65

At the end of the first 10 groups of100, he noticed that 136 of 443misses were left misses and 307were right misses. He reasonedthat the position of the ball rela-tive to the putting stance was aproblem. To correct this, he pro-posed "moving the ball severalinches forward in my stance,keeping it just inside the left toe."The final 10 observations weremade with the modified stance,and all groups were displayed ina simple time-series plot (see Fig.1) to help visualize the change.

Figure 1. Scatterplot of scores for 20 groups of 100 putts each shows scoresimproved when stance was modified for the final 10 observations.

Examination of the plot suggeststhat he is improving: On average,the last 10 points are higher thanthe first. Simple regression analy-sis with an indicator variable forchange of stance suggests that theimprovement is genuine; diagnos-tic checks of adequacy of the re-gression model are satisfactory.The process appears to have beenin statistical control before the in-tervention and to have continuedin statistical control at a higherlevel subsequent to the interven-tion. A more subtle question iswhether we are seeing the steadyimprovement that comes withpractice, a sharp improvementdue to the change in stance, orboth combined. Comparison oftwo regression models, one thatincorporates a time trend and onethat incorporates a single-step im-provement, suggests that the sharpimprovement model fits the databetter than a trend one.

The estimated improvement issubstantial; it translates into sev-eral strokes per round. (The stu-dent confirmed the results withfurther work and also discoveredthat the "baseball grip," recom-mended by Lee Trevino, was atleast as good as the standard "re-verse overlap grip.")

This student's example is an ap-plication of intervention analysis.

There are no randomized controls,as in "true" statistical experimen-tation, but if one takes care ininterpretation, conclusions fromintervention analysis rest onnearly as firm ground. When ran-domized controls are possible, sta-tistical analysis can be even moreconclusive.

Example 2. A Pool ExperimentIn a study of technique for thegame of pool, it was possible to doa randomized experiment on alter-native techniques. The techniquesbeing considered were random-ized in blocks instead of just mak-ing a single transition from one tothe other, as was done in the put-ting example. The student was in-terested in alterations of techniqueaffecting two aspects of pool andhad been using an unconventionalapproach: an unorthodox upside-down V bridge with eye focusedon the object ball. The standard ap-proach was closed bridge and eyefocused on the cue ball. Which ofthese four combinations is best?The experiment that was carriedout was a two-level, two-factor de-sign with blocking and with ran-domization of techniques withinblocks. There were five sessionsconsisting of eight games each.The data set and details of the de-sign are presented in Table 1.

253

Page 265: Anthology of Statistics in Sports

Chapter 33 Can TQM Improve Athletic Performance?

Tablel—Study to Improve Game of Pool

Session 1

Shots Bridge Eye

SO 1 -139 1 143 -1 -1

• 78 -1 162 1 -1

40 -1 -1

62 -1 I

62 ' 1 1

• Session 2 •

' Shots Bridge Eye

32 1 136 -1 148 -1 -145 1 -154 -1 146 1 -1

55 1 15O -1 -1

Sessions 3.

Shots Bridge Eye

39 1 158 1 -144 -1 -156 -1 162 1 -157 -1 146 1 152 -1 -1

• . Session 4

Shots Bridge Eye

40 141 -127 1

. 48, -148 -132 152 15O -1

-111

-111

~~1 •' -11

Sessions 6

Shots Bridge Eye

35 -1 -138 -1 133 1 -125 i' 1

" 40 1 -152 -1 -1.' 45 " -1 1

36 1 1

Shots:SessionBridgee:Eye;

The number of shots from the break to get all the balls In.' Session, eight games per session on a given day, five sessions. . . ;

= -1 for unorthodox upside-down v bridge method); = 1 for standard closed bridge (standard method),= -1 eye focused on object ball (starting method);=1 eye focused on cue ball(standard method).

The eight games of each ses-sion are blocked into two blocksof four, and the four treatmentcombinations are randomizedwithin each block of four. Hence,the listing in Table 1 is in the se-quence in which the games wereplayed.

Figure 2 displays SHOTS as atime series. A careful examina-tion of the plot in Fig. 2 suggeststhat several systematic effectsare happening in the data. Theexperimental variations of tech-nique are clearly superimposedon a process that is not in con-trol. In particular:

1. Variability of SHOTS appearsto be higher when level ofSHOTS is higher. A logarithmicor other transformation?

2. Overall downtrend is evidentCan this be the result of im-provement with practice?

3. There is an uptrend within eachof the five days. Is this fatigueeffect?

It is important to emphasize thatthis process was not in a state ofstatistical control, yet the experi-ment was valid because the sys-tematic factors—the trends within

and between days—causing theout-of-control condition weremodeled in the statistical analysis,as we now show.

In the final regression by ordi-nary least squares, the sources ofvariation are sorted out as follows.LSHOTS is the log of the numberof shots to clear the table. STAND-ARD is an indicator for the stand-ard bridge and eye focus; it takesthe value 1 for games played usingthese techniques and 0 for othergames. This variable was definedwhen preliminary tabular exami-nation of a two-way table of meanLSHOTS suggested that any de-parture from the standard bridgeand focus led to a similar degrada-tion. TIME is the linear trend vari-able across all 40 observations.ORDER is sequence of gameswithin each day. (Block effectsturned out to be insignificant.)

The estimated regression equa-tion is

Fitted LSHOTS =3.84 - 0.248 (STANDARD)- 0.00998 (TIME)+ 0.0518 (ORDER)

Diagnostic checks of model ade-quacy were satisfactory.

The standard error of STAN-DARD was 0.06045 and the f-ratiowas —4.10. The standard technique,not the student's unorthodox tech-nique, worked best. Using thestandard technique led to an esti-mated 22% improvement in esti-mated SHOTS [exp{-0.248} = 0.78].

Note that there were two trends:an upward trend within each day,presumably reflecting fatigue, anda downward overall trend, presum-ably reflecting the effects of prac-tice. For given ORDER within aday, the fitted scores drop about8(0.01) = 0.08 from day to day; be-cause we are working in log units,this translates to a trend improve-ment due to practice of about 8%per day.

The pool example shows that intrying to improve technique, onecan go beyond simple interventionanalysis to designed experiments,even when the underlying processis not in a state of statistical control.Moreover, two aspects of tech-nique—bridge and eye focus—were studied simultaneously.

The full advantages of multifac-tor experimentation, however, aredifficult to realize in athletics. Thisis because change of athletic tech-nique entails considerable effort. In

254

Page 266: Anthology of Statistics in Sports

Roberts

the pool experiment, it was rela-tively easy to switch around thebridge and eye focus from trial totrial, but this is atypical. A swim-mer, for example, might have amuch harder time attempting tochange technique of each lap withthe four combinations of stroketechnique defined by (1) moderateversus extreme body roll and (2) or-dinary versus bilateral breathing.

Example 3. Even Pacing inRunningIn cross-country and track races inthe 1940s, I learned from painfulexperience that a fast start alwaysled to an agonizing slowing downfor the balance of a race. However,the coaching wisdom of the timealways called for starting very fastand hoping that eventually onecould learn to maintain the paceall the way. (A more refined ver-sion was that one should go outfast, slow down in the middle ofthe race, and then sprint at theend.)

Eventually, a distaste for suffer-ing led me to experiment with amore reasonable starting pace;that was the intervention. In mynext cross-country race, a 3-milerun, I was dead last at the end ofthe first quarter mile in a time of75 seconds; the leaders ran it inabout 60 seconds. (Had they beenable to maintain that pace, theywould have run a 4-minute mileon the first of the 3 miles.) For therest of the race, I passed runnerssteadily and finished in the upperhalf of the pack, not because Ispeeded up but because the othersslowed down. In the remainingraces that season, my performancerelative to other runners wasmuch improved.

Decades afterward, as an age-group runner, I was able to vali-date the youthful experiment. Atage 59, in my best marathon, I av-eraged 7:06 per mile throughoutthe race with almost no variation.In other marathons, I discoveredthat an even slightly too-fast pace(as little as 5-10 seconds per mile)

Figure 2. Plot of shots over five sessions of eight games of pool reveals over-all downward trend and upward trends in each of the five sessions.

for the first 20 miles was punishedby agony and drastic slowing inthe final 10 kilometers, with con-sequent inflation of the overallmarathon time. In my best mara-thons, all run at an even pace, Iwas only pleasantly tired in the fi-nal 10 kilometers and was able topass large numbers of runnerswho had run the first 20 miles toofast.

What worked for me apparentlyworks for others. In recent dec-ades, even-pacing has become thegenerally accepted practice. Mostworld distance records are nearlyevenly paced.

In my later experiences in age-group competition, I did two otherintervention studies that ranagainst the conventional wisdomof the time. I found that long train-ing runs for the marathon were notnecessary so long as my total train-ing mileage was adequate (roughly42 miles per week). I also discov-ered that it was unnecessary todrink fluids during long races inhot weather provided I was super-saturated with fluids at the start ofthe race and then, instead of drink-ing, doused myself with water atevery opportunity. In this way, Ikept cool and conserved body flu-ids but did not have the discomfortand mild nausea that came withdrinking while running.

These personal examples illus-trate the point that the individ-ual athlete can improve perform-ance without having to locateand draw on group data in thescientific literature. In each in-stance, the idea for improvementcame from a study of my actualexperience. The applicationsalso illustrate the usefulness ofsimple before-after comparisonsbetween what happened beforethe intervention and what hap-pened subsequently.

In the example of a changein training regimen—no long-distance training runs—theabrupt intervention is the onlypracticable approach because theeffects of training are cumulativeand lagged. One cannot switchtraining regimens on and off forshort periods of time and hope totrace the lagged effects; the sameobservation applies to attemptsto improve general fitness.

The same ideas, of course, ap-ply in industrial settings andhave been applied there for dec-ades in conjunction with use ofShewhart control charts by work-ers. The control chart makes iteasier to see quickly, evenwithout formal statistical test-ing, whether an intervention to

255

Page 267: Anthology of Statistics in Sports

Chapter 33 Can TQM Improve Athletic Performance?

improve the process has or has not succeeded.

The Limitations of Happenstance Data

All my examples have involved some conscious in-tervention in an ongoing process. "Happenstancedata" also may be helpful in suggesting hypothesesfor improvement. For this reason, it seems useful tomaintain athletic diaries, recording training meth-ods, workouts, competitive performances, injuries,weight, resting pulse rates, and other pertinent in-formation. But there are dangers in relying on hap-penstance data if one lacks elementary statisticalskills. First, it is often difficult to infer causationfrom regression analysis of happenstance data. Sec-ond, it is tempting to overreact to apparently ex-treme individual values and jump to a hasty processchange; this is what Deming calls "tampering," andit usually makes processes worse, not better. Dataplotting and statistical analysis are needed to en-sure a proper conclusion based on sound observa-tions and studies.

What Works Best

In improving athletic technique, the appropriate ex-perimental strategy is likely to be intervention analysisbecause randomized experimentation often is not fea-sible. Moreover, for practical reasons, we are usuallylimited to changing one thing at a time. Nevertheless,careful application of the methods of TQM may en-hance our ability to make causal judgments about theeffects of our interventions. As illustrated in the poolexample, it is not necessary for the athletic processesof interest to be in a state of statistical control beforewe can profitably intervene to improve them.

The examples presented above show that individualathletes can enhance their abilities without having tolocate and draw on experimental group data in the sci-entific literature that may suggest how to improve per-formance. The ideas for improvement may come fromthe study of actual experience or even the advice of ex-perts. Individual athletes can then collect and analyzepersonal data to see if the ideas work for them, and byso doing, determine ways to better then1 scores, finish-ing times, or other performance measures.

256

Page 268: Anthology of Statistics in Sports

Chapter 34

A Brownian Motion Model for theProgress of Sports Scores

Hal S. STERN*

The difference between the home and visiting teams' scores in a sports contest is modeled as a Brownian motion process denned ont E (0, 1), with drift u points in favor of the home team and variance a2. The model obtains a simple relationship between the hometeam's lead (or deficit) e at time t and the probability of victory for the home team. The model provides a good fit to the results of493 professional basketball games from the 1991-1992 National Basketball Association (NBA) season. The model is applied to theprogress of baseball scores, a process that would appear to be too discrete to be adequately modeled by the Brownian motion process.Surprisingly, the Brownian motion model matches previous calculations for baseball reasonably well.

KEY WORDS: Baseball; Basketball; Probit regression

1. INTRODUCTION

Sports fans are accustomed to hearing that "team A rarelyloses if ahead at halftime" or that "team B had just accom-plished a miracle comeback." These statements are rarelysupported with quantitative data. In fact the first of the twostatements is not terribly surprising; it is easy to argue thatapproximately 75% of games are won by the team that leadsat halftime. Suppose that the outcome of a half-game is sym-metrically distributed around 0 so that each team is equallylikely to "win" the half-game (i.e., assume that two evenlymatched teams are playing). In addition, suppose that theoutcomes of the two halves of a game are independent andidentically distributed. With probability .5 the same teamwill win both half-games, and in that case the team aheadat halftime certainly wins the game. Of the remaining prob-ability, it seems plausible that the first half winner will defeatthe second half winner roughly half the time. This elementaryargument suggests that in contests among fairly even teams,the team ahead at halftime should win roughly 75% of thetime. Evaluating claims of "miraculous" comebacks is moredifficult. Cooper, DeNeve, and Mosteller (1992) estimatedthe probability that the team ahead after three quarters ofthe game eventually wins the contest for each of the fourmajor sports (basketball, baseball, football, hockey). Theyfound that the leading team won more than 90% of the timein baseball and about 80% of the time in the other sports.They also found that the probability of holding a lead isdifferent for home and visiting teams. Neither the Cooper,et al. result nor the halftime result described here considersthe size of the lead, an important factor in determining theprobability of a win.

The goal here is to estimate the probability that the hometeam in a sports contest wins the game given that they leadby £ points after a fraction t E (0, 1) of the contest has beencompleted. Of course, the probability for the visiting teamis just the complement. The main focus is the game of bas-ketball.

Among the major sports, basketball has scores that canmost reasonably be approximated by a continuous distri-

* Hal S. Stern is Associate Professor, Department of Statistics, HarvardUniversity, Cambridge, MA 02138. Partial support for this work was providedby Don Rubin's National Science Foundation Grant SES-8805433. Theauthor thanks Tom Cover for suggesting the problem and the halftime ar-gument several years ago and Elisabeth Burdick for the baseball data. Helpfulcomments were received from Tom Belin, Andrew Gelman, and Carl Morris.

bution. A formula relating t and t to the probability of win-ning allows for more accurate assessment of the proprietyof certain strategies or substitutions. For example, should astar player rest at the start of the fourth quarter when histeam trails by 8 points or is the probability of victory fromthis position too low to risk such a move? In Section 2 aBrownian motion model for the progress of a basketball scoreis proposed, thereby obtaining a formula for the probabilityof winning conditional on £ and t. The model is applied tothe results of 493 professional basketball games in Section3. In Section 4 the result is extended to situations in whichit is known only that e > 0. Finally, in Section 5 the Brownianmotion model is applied to a data set consisting of the resultsof 962 baseball games. Despite the discrete nature of baseballscores and baseball "time" (measured in innings), theBrownian motion model produces results quite similar tothose of Lindsey( 1977).

2. THE BROWNIAN MOTION MODEL

To begin, we transform the time scale of all sports conteststo the unit interval. A time t E (0, 1) refers to the point ina sports contest at which a fraction / of the contest has beencompleted. Let X ( t ) represent the lead of the home team attime t. The process X(t) measures the difference betweenthe home team's score and the visiting team's score at time/; this may be positive, negative, or 0. Westfall (1990) pro-posed a graphical display of X ( t ) as a means of representingthe results of a basketball game. Naturally, in most sports(including the sport of most interest here, basketball), X ( t )is integer valued. To develop the model, we ignore this fact,although we return to it shortly. We assume that X ( t ) canbe modeled as a Brownian motion process with drift u perunit time (u > 0 indicates a n point per game advantage forthe home team) and variance a2 per unit time. Under theBrownian motion model,

and X ( s ) - X(t), s > t, is independent of X ( t ) with

© 1994 American Statistical AssociationJournal of the American Statistical Association

September 1994, Vol. 89, No. 427, Statistics in Sports

257

Page 269: Anthology of Statistics in Sports

Chapter 34 A Brownian Motion Model for the Progress of Sports Scores

The probability that the home team wins a game is Pr(x(1)> 0) = (n/a), and thus the ratio n/a indicates the mag-nitude of the home field advantage. In most sports, the hometeam wins approximately 55-65% of the games, correspond-ing to values of u/a in the range . 12-. 39. The drift parametern measures the home field advantage in points (typicallythought to be 3 points in football and 5-6 points in basket-ball).

Under the random walk model, the probability that thehome team wins [i.e., X(\) > 0] given that they have an epoint advantage (or deficit) at time t [i.e., X ( t ) = e] is

where is the cdf of the standard normal distribution. Ofcourse, as t 1 for fixed t 0, the probability tends toeither 0 or 1, indicating that any lead is critically importantvery late in a game. For fixed t, the lead e must be relativelylarge compared to the remaining variability in the contestfor the probability of winning to be substantial.

The preceding calculation treats X ( t ) as a continuousrandom variable, although it is in fact discrete. A continuitycorrection is obtained by assuming that the observed scoredifference is the value of X ( t ) rounded to the nearest integer.If we further assume that contests tied at t = 1 [i.e., X ( 1 )= 0] are decided in favor of the home team with probability.5, then it turns out that

In practice, the continuity correction seems to offer littleimprovement in the fit of the model and causes only minorchanges in the estimates of u and a. It is possible to obtaina more accurate continuity correction that accounts for thedrift in favor of the home team in deciding tied contests. Inthis case .5 is replaced by a function of n, a, and the lengthof the overtime used to decide the contest.

The Brownian motion model motivates a relatively simpleformula for P ^ J ( e , t), the probability of winning given thelead e and elapsed time t. A limitation of this formula isthat it does not take into account several potentially impor-tant factors. First, the probability that a home team wins,conditional on an e point lead at time t , is assumed to bethe same for any basketball team against any opponent. Ofcourse, this is not true; Chicago (the best professional bas-ketball team during the period for which data has been col-lected here) has a fairly good chance of making up a 5-pointhalftime deficit (e = -5, t = .50) against Sacramento (oneof the worst teams), whereas Sacramento would have muchless chance of coming from behind against Chicago. Onemethod for taking account of team identities would be toreplace u with an estimate of the difference in ability betweenthe two teams in a game, perhaps the Las Vegas point spread.A second factor not accounted for is whether the home teamis in possession of the ball at time t and thus has the nextopportunity to score. This is crucial information in the last

few minutes of a game (t > .96 in a 48-minute basketballgame). Despite the omission of these factors, the formulaappears to be quite useful in general, as demonstrated in theremainder of the article.

3. APPLICATION TO PROFESSIONAL BASKETBALL

Data from professional basketball games in the UnitedStates are used to estimate the model parameters and to assessthe fit of the formula for Pu,a(e, t). The results of 493 Na-tional Basketball Association (NBA) games from January toApril 1992 were obtained from the newspaper. This samplesize represents the total number of games available duringthe period of data collection and represents roughly 45% ofthe complete schedule. We assume that these games are rep-resentative of modern NBA basketball games (the mean scoreand variance of the scores were lower years ago). The dif-ferences between the home team's score and the visitingteam's score at the end of each quarter are recorded asX(.25), X(.50), X(.75), and X(1.OO) for each game. Forthe ith game in the sample, we also represent these valuesas Xi,i, j = 1, . . . , 4. The fourth and final measurement,X( 1.00) = XiA, is the eventual outcome of the game, possiblyafter one or more overtime periods have been played to re-solve a tie score at the end of four quarters. The overtimeperiods are counted as part of the fourth quarter for purposesof defining X. This should not be a problem, because X (1.00)is not used in obtaining estimates of the model parameters.In a typical game, on January 24, 1992, Portland, playingAtlanta at home, led by 6 points after one quarter and by 9points after two quarters, trailed by 1 point after three quar-ters, and won the game by 8 points. Thus Xi,1 = 6, Xi, 2 = 9,Xi,2 = -l,and X1,4 = 8.

Are the data consistent with the Brownian motion model?Table 1 gives the mean and standard deviation for the resultsof each quarter and for the final outcome. In Table 1 theoutcome of quarter 7 refers to the difference X i , J — Xi ,j-1 andthe final outcome refers to Xi,4 = X(1.00). The first threequarters are remarkably similar; the home team outscoresthe visiting team by approximately 1.5 points per quarter,and the standard deviation is approximately 7.5 points. Thefourth quarter seems to be different; there is only a slightadvantage to the home team. This may be explained by thefact that if a team has a comfortable lead, then it is apt toease up or use less skillful players. The data suggests that thehome team is much more likely to have a large lead afterthree quarters; this may explain the fourth quarter results inTable 1. The normal distribution appears to be a satisfactoryapproximation to the distribution of score differences in eachquarter, as indicated by the QQ plots in Figure 1. The cor-

Table 1. Results by Quarter of 493 NBA Games

Quarter

1234

Total

Variable

X (.25)X (.50) - X (.25)X (.75) - X (.50)X (1.00)-X (.75)

X(1.00)

Mean

1.411.571.51.22

4.63

Standard deviation

7.587.407.306.99

13.18

258

Page 270: Anthology of Statistics in Sports

Stern

relations between the results of different quarters are negativeand reasonably small (r12 = -.13, r13 = -.04, r14 = -.01,r23 = -.06, r24 = -.05, and r34 = -.11). The standard errorfor each correlation is approximately .045, suggesting thatonly the correlation between the two quarters in each halfof the game, r12 and r34, are significantly different from 0.The fact that teams with large leads tend to ease up mayexplain these negative correlations, a single successful quartermay be sufficient to create a large lead. The correlation ofeach individual quarter's result with the final outcome isapproximately .45. Although the fourth quarter results pro-vide some reason to doubt the Brownian motion model, itseems that the model may be adequate for the presentpurposes. We proceed to examine the fit of the formulaPu,a(e,t) derived under the model.

The formula Pu,a(e, t) can be interpreted as a probitregression model relating the game outcome to the trans-formed variables e/ and with coefficients 1/a and u / a . Let Yi = 1 if the home team wins the ith game[i.e., X ( 1 ) > 0)] and 0 otherwise. For now, we assume thatthe three observations generated for each game, correspond-ing to the first, second, and third quarters, are independent.Next we investigate the effect of this independence assump-tion. The probit regression likelihood L can be expressed as

where a = 1 / a and B = u/a. Maximum likelihood estimatesof a and B (and hence u and a) are obtained using a Fortranprogram to carry out a Newton-Raphson procedure. Con-vergence is quite fast (six iterations), with a = .0632 and B= .3077 implying

U = 4.87 and a = 15.82.

An alternative method for estimating the model parametersdirectly from the Brownian motion model, rather thanthrough the implied probit regression, is discussed later inthis section. Approximate standard errors of £ and a areobtained via the delta method from the asymptotic varianceand covariance of a and j8:

s.e.(u) = .90 and s.e.(a) = .89.

Figure 1. Q-Q Plots of Professional Basketball Score Differences byQuarter. These are consistent with the normality assumption of theBrownian motion model.

These standard errors are probably somewhat optimistic,because they are obtained under the assumption that indi-vidual quarters contribute independently to the likelihood,ignoring the fact that groups of three quarters come fromthe same game and have the same outcome Y,. We inves-tigate the effect of the independence assumption by simu-lation using two different types of data. "Nonindependent"data, which resemble the NBA data, are obtained by simu-lating 500 Brownian motion basketball games with fixed n,a and then using the three observations from each game (thefirst, second, and third quarter results) to produce data setsconsisting of 1,500 observations. Independent data sets con-sisting of 1,500 independent observations are obtained bysimulating 1,500 Brownian motion basketball games withfixed n, and using only one randomly chosen quarter fromeach game. Simulation results using "nonindependent" datasuggest that parameter estimates are approximately unbiasedbut the standard errors are 30-50% higher than under theindependence condition. The standard errors above arecomputed under the assumption of independence and aretherefore too low. Repeated simulations, using "noninde-pendent" data with parameters equal to the maximum like-lihood estimates, yield improved standard error estimates,s.e.(u) = 1.3 and s.e.(a) = 1.2.

The adequacy of the probit regression fit can be measuredrelative to the saturated model that fits each of the 158 dif-ferent (e, t) pairs occurring in the sample with its empiricalprobability. Twice the difference between the log-likelihoodsis 134.07, which indicates an adequate fit when comparedto the asymptotic chi-squared reference distribution with 156degrees of freedom. As is usually the case, there is little dif-ference between the probit regression results and logisticregression results using the same predictor variables. We useprobit regression to retain the easy interpretation of theregression coefficients in terms of u, a. The principal con-tribution of the Brownian motion model is that regressions

259

Page 271: Anthology of Statistics in Sports

Chapter 34 A Brownian Motion Model for the Progress of Sports Scores

based on the transformations of (e, t) suggested by theBrownian motion model, (e/ ), appear toprovide a better fit than models based on the untransformedvariables. As mentioned in Section 2, it is possible to fit theBrownian motion model with a continuity correction. Inthis case the estimates for n and a are 4.87 and 15.80, almostidentical to the previous estimates. For simplicity, we do notuse the continuity correction in the remainder of the article.

Under the Brownian motion model, it is possible to obtainestimates of n, a without performing the probit regression.The game statistics in Table 1 provide direct estimates ofthe mean and standard deviation of the assumed Brownianprocess. The mean estimate, 4.63, and the standard deviationestimate, 13.18, obtained from Table 1 are somewhat smallerthan the estimates obtained by the probit model. The dif-ferences can be attributed in part to the failure of the Brown-ian motion model to account for the results of the fourthquarter. The probit model appears to produce estimates thatare more appropriate for explaining the feature of the gamesin which we are most interested—the probability of winning.

Table 2 gives the probability of winning for several valuesof e, t. Due to the home court advantage, the home teamhas a better than 50% chance of winning even if it is behindby two points at halftime (t = .50). Under the Brownianmotion model, it is not possible to obtain a tie at t = 1 sothis cell is blank; we might think of the value there as beingapproximately .50. In professional basketball t = .9 corre-sponds roughly to 5 minutes remaining in the game. Noticethat home team comebacks from 5 points in the final 5 min-utes are not terribly unusual. Figure 2 shows the probabilityof winning given a particular lead; three curves are plottedcorresponding to t = .25, .50, .75. In each case the empiricalprobabilities are displayed as circles with error bars (± twobinomial standard errors). To obtain reasonably large samplesizes for the empirical estimates, the data were divided intobins containing approximately the same number of games(the number varies from 34 to 59). Each circle is plotted atthe median lead of the observations in the bin. The modelappears consistent with the pattern in the observed data.

Figure 3 shows the probability of winning as a functionof time for a fixed lead e. The shape of the curves is asexpected. Leads become more indicative of the final outcomeas time passes and, of course, larger leads appear abovesmaller leads. The e = 0 line is above .5, due to the drift infavor of the home team. A symmetric graph about the hor-izontal line at .5 is obtained if we fix u = 0. Although theprobit regression finds n is significantly different than 0, theno drift model P0,a(t, t) = $(e/ ) also provides areasonable fit to the data with estimated standard deviation15.18.

Figure 4 is a contour plot of the function P^(e, t) withtime on the horizontal axis and lead on the vertical axis.Lines on the contour plot indicate game situations with equalprobability of the home team winning. As long as the gameis close, the home team has a 50-75% chance of winning.

4. CONDITIONING ONLY ON THESIGN OF THE LEAD

Informal discussion of this subject, including the intro-duction to this article, often concerns the probability of win-ning given only that a team is ahead at time / (e > 0) withthe exact value of the lead unspecified. This type of partialinformation may be all that is available in some circum-

Table 2. Pu,a (e, t) for Basketball Data

Lead

Time t

.00

.25

.50

.75

.901.00

e = -10

.32

.25

.13

.03

.00

/ = -5

.46

.41

.32

.18

.00

e = -2

.55

.52

.46

.38

.00

e = o

.62

.61

.59

.56

.54

e = 2

.66

.65

.66

.691.00

e = 5

.74

.75

.78

.861.00

e = 10

.84

.87

.92

.981.00

stances. Integrating Pu,a(e, t) over the distribution of thelead ^ at time t yields (after some transformation)

which depends only on the parameters u and a through theratio u/a . The integral is evaluated at the maximum likeli-hood estimates of u and a using a Fortran program to im-plement Simpson's rule. The probability that the home teamwins given that it is ahead at t = .25 is .762, the probabilityat t = .50 is .823, and the probability at t = .75 is .881. Thecorresponding empirical values, obtained by considering onlythose games in which the home team led at the appropriatetime point, (263 games for t = .25, 296 games for t = .50,301 games for t = .75) are .783, .811, and .874, each withina single standard error of the model predictions.

If it is assumed that u = 0, then we obtain the simplification

with P0(.25) = 2/3, P0(.50) = 3/4, and P 0 ( .75) = 5/6. Be-cause there is no home advantage when u = 0 is assumed,we combine home and visiting teams together to obtain em-pirical results. We find that the empirical probabilities (basedon 471,473, and 476 games) respectively are .667, .748, and.821. Once again, the empirical results are in close agreementwith the results from the probit model.

5. OTHER SPORTS

Of the major sports, basketball is best suited to the Brown-ian motion model because of the nearly continuous natureof the game and the score. In this section we report the resultsof applying the Brownian motion model to the results of the1986 National League baseball season. In baseball, the teamsplay nine innings; each inning consists of two half-innings,with each team on offense in one of the half-innings. Thehalf-inning thus represents one team's opportunity to score.The average score for one team in a single half-inning isapproximately .5. More than 70% of the half-innings produce0 runs. The data consist of 962 games (some of the NationalLeague games were removed due to data entry errors or be-cause fewer than nine innings were played).

260

Page 272: Anthology of Statistics in Sports

Stern

Figure 2. Smooth Curves Showing Estimates of the Probability of Winning a Professional Basketball Game, Pu,a(e,t), as a Function of the Lead funder the Brownian Motion Model. The top plot is t = .25, the middle plot is t = .50; and the bottom plot is t = .75. Circles ± two binomial standarderrors are plotted indicating the empirical probability. The horizontal coordinate of each circle is the median of the leads for the games included inthe calculations for the circle.

261

Page 273: Anthology of Statistics in Sports

Chapter 34 A Brownian Motion Model for the Progress of Sports Scores

Figure 3. Estimated Probability of Winning a Professional Basketball Figure 5. Estimated Probability of Winning a Baseball Game, Pu,a(e, t),Game, Pu,o(e, t), as a Function of Time t for Leads of Different Sizes. as a Function of Time t for Leads of Different Sizes.

Clearly, the Brownian motion model is not tailored tobaseball as an application, although one might still considerwhether it yields realistic predictions of the probability ofwinning given the lead and the inning. Lindsey (1961, 1963,1977) reported a number of summary statistics, not repeatedhere, concerning the distribution of runs in each inning. Theinnings do not appear to be identically distributed due tothe variation in the ability of the players who tend to bat ina particular inning. Nevertheless, we fit the Brownian motionmodel to estimate the probability that the home team winsgiven a lead t at time t (here t E {1/9, . . . , 8/9}). Theprobit regression obtains the point estimates ft, = .34 and a

= 4.04. This mean and standard deviation are in good agree-ment with the mean and standard deviation of the marginof victory for the home team in the data. The asymptoticstandard errors for fi and a obtained via the delta methodare .09 and . 10. As in the basketball example, these standarderrors are optimistic, because each game is assumed to con-tribute eight independent observations to the probit regres-sion likelihood, when the eight observations from a singlegame share the same outcome. Simulations suggest that thestandard error of JJL is approximately .21 and the standarderror of a is approximately .18. The likelihood ratio test sta-tistic, comparing the probit model likelihood to the saturatedmodel, is 123.7 with 170 degrees of freedom. The continuitycorrection again has only a small effect.

Figure 5 shows the probability of winning in baseball asa function of time for leads of different sizes; circles are plot-ted at the time points corresponding to the end of each inning,

Table 3. Pu,o (e, t) for Baseball Compared to Lindsey's Results

Figure 4. Contour Plot Showing Combinations of Home Team Lead andFraction of the Game Completed for Which the Probability of the HomeTeam Winning is Constant for Professional Basketball Data.

t3/93/93/93/93/9

5/95/95/95/95/9

7/97/97/97/97/9

t

01234

01234

01234

u = .34o = 4.04

.53

.65

.75

.84

.90

.52

.67

.79

.88

.94

.52

.71

.86

.95

.98

u = .0a = 4.02

.50

.62

.73

.82

.89

.50

.65

.77

.87

.93

.50

.70

.85

.94

.98

Lindsey

.50

.63

.74

.83

.89

.50

.67

.79

.88

.93

.50

.76

.88

.94

.97

262

Page 274: Anthology of Statistics in Sports

Stern

tE{l/9, . . . ,8/9}. Despite the continuous curves in Figure5, it is not possible to speak of the probability that the hometeam wins at times other than those indicated by the circles,because of the discrete nature of baseball time. We can com-pare the Brownian motion model results with those of Lind-sey (1977). Lindsey's calculations were based on a Markovmodel of baseball with transition probabilities estimated froma large pool of data collected during the late 1950s. He es-sentially assumed that u = 0. Table 3 gives a sample of Lind-sey's results along with the probabilities obtained under theBrownian motion model with n = 0 (a = 4.02 in this case)and the probabilities obtained under the Brownian motionmodel with u unconstrained. The agreement is fairly good.The inadequacy of the Brownian motion model is most ap-parent in late game situations with small leads. The Brownianmotion model does not address the difficulty of scoring runsin baseball, because it assumes that scores are continuous.Surprisingly, the continuity correction does not help. We

should note that any possible model failure is confoundedwith changes in the nature of baseball scores between thelate 1950s (when Lindsey's data were collected) and today.The results in Table 3 are somewhat encouraging for morewidespread use of the Brownian motion model.

[Received May 1993. Revised July 1993.]

REFERENCES

Cooper, H., DeNeve, K. M., and Mosteller, F. (1992), "Predicting Profes-sional Game Outcomes From Intermediate Game Scores," CHANCE, 5,3-4, 18-22.

Lindsey, G. R. (1961), "The Progress of the Score During a Baseball Game,"Journal of the American Statistical Association, 56, 703-728.

(1963), "An Investigation of Strategies in Baseball," OperationsResearch, 11,477-501.

- (1977), "A Scientific Approach to Strategy in Baseball," in OptimalStrategies in Sports, eds. R. E. Machol and S. P. Ladany, Amsterdam:North-Holland, pp. 1-30.

Westfall, P. H. (1990), "Graphical Representation of a Basketball Game,"The American Statistician, 44, 305-307.

263

Page 275: Anthology of Statistics in Sports

This page intentionally left blank

Page 276: Anthology of Statistics in Sports

Part VIStatistics inMiscellaneous Sports

Page 277: Anthology of Statistics in Sports

This page intentionally left blank

Page 278: Anthology of Statistics in Sports

Chapter 35

Introduction to theMiscellaneous SportsArticles

Donald Guthrie

35.1 IntroductionStatistical studies of sports seem to fall into threecategories: analysis of outcomes, analysis of rules andstrategies, and analysis of extent of participation. The ninechapters of this section fall into the first two categories;analyses of participation are rare in statistical literature.

Americans, and not just the statistically inclined, seemfascinated with the collection of data about our sports. Ev-ery day the sports pages of newspapers are full of raw andsummary data about various sports. Consider baseball as amodel: elaborate data collection methods have been devel-oped from which we have derived summary statistics, andthese statistics have been used to understand and interpretthe games. The rules of baseball have remained substan-tially unchanged for over a century, and data collection onthe standard scorecard has left a historical archive of thegame. Football, basketball, and hockey have been moresubject to rule changes and less amenable to longitudinalmeasurement, but fans' seemingly relentless need for de-tailed analysis has led to instrumentation of data as well.Each sport has its repertoire of summary statistics familiarto its fans.

But the collection and interpretation of data is not nec-essarily limited to the major sports. In the following, wereprint several contributions from sports that occupy lessnewspaper space but are every bit as important to the par-ticipants. Temptation is great to organize them accordingto the particular game, but that would not reflect the trueclustering along categorical lines.

With the possible exception of golf, the sports analyzedin these chapters do not have strong followings in the

United States. Worldwide, however, football (soccer toAmericans) is certainly at the forefront. Numerical anal-ysis of football has perhaps suffered from the absence ofAmerican zeal for detail. Indeed, with the increasing pop-ularity of soccer in the United States and increasing world-wide television coverage, we have seen more and more adhoc statistics associated with soccer.

Instead of the popularity of the games discussed, the pa-pers in the following chapters have been chosen to reflecttheir application of statistical reasoning and for their in-novative use in understanding the sport discussed. Partici-pation in these games ranges from extensive (golf, tennis)to limited (figure skating, darts). All of the chapters, how-ever, offer insight into the games and into the correspond-ing statistical methodology. In all chapters, the authorsare proposing a model for the conduct of the competition,applying that model, and interpreting it in the context ofthe sport.

The chapters are organized into two groups, the firstdealing with rules and scoring and the second concernedwith results.

35.2 Rules and Strategy of SportsChapters 36, 38, 39, 41, and 44 of this section focus onrules and strategy. In Chapter 39, "Rating Skating," Bas-sett and Persky give an elegant description of the scor-ing rules for high-level ice skating competition. Theseauthors formalize two requirements for a function deter-mining placing, and they show that the median rank ruleprovides the proper placing given their two requirements.Finally, they argue convincingly that the current systemfor rating ice skating correctly captures the spirit of major-ity rule, and that the method effectively controls for mea-surement error. They conclude by commending skating

267

Page 279: Anthology of Statistics in Sports

Chapter 35 Introduction to the Miscellaneous Sports Articles

officials for having arrived at a reasonable system, eventhough there is no evidence that it was determined withthese or similar criteria in mind.

In Chapter 36, Stern and Wilcox, in the Chance article"Shooting Darts," explore alternative strategies for skillfuldarts competitors. Using some empirical evidence (pre-sumably collected in an unnamed British pub), they modelthe error distribution of the dart locations using Weibull-based methods and then use simulation to identify optimaltarget areas. The results of this chapter suggest aimingstrategies based on the accuracy of the shooter.

In Chapter 41, "Down to Ten: Estimating the Effect ofa Red Card in Soccer," Ridder, Cramer, and Hopstakenestimate the effect of player disqualification (receiving a"red card") in soccer/football. They gather data from 140games in a Dutch professional league, develop a model forthe progress of the score in a match that adjusts for possibledifferences in team strength, and estimate the likelihoodthat a team losing a player eventually loses the game. Theinteresting result is that a red card, especially early but evenlater in a match, is devastating. They then use their scoringmodel to estimate the effect of a penalty period rather thandisqualification. They conclude that a 15-minute penaltywould have an effect similar to that of disqualification with15 minutes remaining in the match, but the effect on thegame outcome would be the same regardless of when thepenalty occurred.

Scheid and Calvin, both consultants to the United StatesGolf Association (USGA), contribute Chapter 38, "Ad-justing Golf Handicaps for the Difficulty of the Course."Handicaps are designed for matching players of unequalskills in order to provide level competition. Handicaps arewidely used by golfers below the highest level, and are thusa vital element of the popularity of the game. Ordinarily,a player will have a rating (his/her handicap) determinedat the course he/she plays most frequently; this rating mayor may not be appropriate on another course. This chap-ter performs the practical service of explaining clearly thehistory of the USGA method of rating courses and sug-gesting corrections in handicaps. It describes the historyof the Slope System and its application to golf courses.Every golfing statistician—and there are many—shouldbenefit from this information.

In Chapter 44, Wainer and De Veaux, both competitiveswimmers, argue in "Resizing Triathlons for Fairness" thatthe classical triathlon is unfair. These authors give statisti-cal evidence that the swimming portion of the competitionis unduly less challenging than the running and cyclingsegments. They suggest a method of determining segmentlengths that would provide fairer competition. Althoughone may not accept all of their arguments (e.g., basing

marathon standards on a record set at Rotterdam, the flat-test course known), they make a good case for adjustmentof distances, but they only casually mention energy expen-diture as a standard. Like darts and golf, racing is a popu-lar form of competition, and a triathlon combines severaltypes of racing (running, biking, swimming) requiring sev-eral talents. The paper has stimulated and will continue tostimulate vigorous discussions among competitors.

35.3 Analysis of Sports ResultsChapters 37,40,42, and 43 of this section evaluate the re-sults of competition between players and teams. Althoughhead-to-head competition is a standard method, many ob-servers are not convinced that the winner of this type ofcompetition is actually the best team or player. For exam-ple, one major league baseball team may win the largestnumber of games and be declared the "best team," eventhough this team may lose a large fraction of the face-to-face encounters with particular opponents. Chapters 40and 43 in this section address the issue of team comparisonin the presence of direct competition, but with considera-tion given to play against common opponents or in differ-ent settings. While these chapters look at soccer/footballand sprint foot racing, it seems likely that similar analy-ses could be applied to other sports. Occasional upsets,thankfully, add interest to competition; both individualsports (e.g., tennis, golf, running) and team sports bringupsets. The purpose of these statistical analyses is to addcredibility to declaring the overall leader.

In Chapter 40, "Modeling Scores in the Premier League:Is Manchester United Really the Best?," Lee uses Poissonregression to model team scores in the 1995/96 PremierLeague soccer league. As a general opinion, ManchesterUnited has been regarded as the soccer equivalent of theNew York Yankees—they do not win all of their games, butare consistently at the top of the standings. One thousandreplications of that season are simulated using the fittedPoisson model, and 38% show United to be the leaguewinner, although their number of standings points in theactual season (82) is somewhat greater than the averagesimulated points (75.5). The simulations, however, jus-tify the disappointment of Liverpool fans since their teamactually scored only 71 points, but the simulations sug-gested that the team should have scored 74.9, nearly equalto United; Liverpool "won" 33% of the simulated seasons.Football as played in the Premier League has lower scoresthan are typical in Major League Baseball, yet the methodsused in this analysis could possibly be applied to baseballor ice hockey. Perhaps the Boston Red Sox could be vin-dicated for their difficulties with the Yankees!

268

Page 280: Anthology of Statistics in Sports

Guthrie

In Chapter 43 Tibshirani addresses the question of"Who Is the Fastest Man in the World?" He tries to com-pare the brilliant sprinters Michael Johnson (USA) andDonovan Bailey (Canada) based on the 1996 Olympic out-comes at 100 and 200 meters. After building and applyingseveral models to account for the fact that the two did notcompete against one another, he concludes by presentingarguments favoring each runner, and thus leaves the ques-tion unanswered. This discussion, however, reflects onegood aspect of statistical practice. That is, it is useful toreformulate questions in a more accessible way and in thecontext of the problem at hand. There is little doubt thatBailey attained the faster maximum speed (he needed torun only 100 meters), but Johnson maintained his speedover a longer time (he ran only the 200 meter race).

Berry's analysis of data provided by the ProfessionalGolf Association (PGA) in Chapter 37, "Drive for Showand Putt for Dough," opens some interesting issues. Headdresses the relative value of longer and shorter shots inthe accumulation of the scores of high-level professionalgolfers. His analysis is very thoughtfully conducted giventhat data collected by the PGA are rather tangential to hisobjective, and indeed to an understanding of the game.

Nevertheless, he manages to give support to the slogan inhis title and to quantify these components. His graphicalsummary vividly illustrates the compromise between driv-ing distance and accuracy and the effects of each of thesequalities on the final score. Finally, he gently encouragesthe collection of more specific data on the golf tour. Let'simagine the fun we could have in data analysis if each golfshot were monitored by a global positioning system (GPS)device to set its exact position.

Jackson and Mosurski continue the "hot hand" debate inChapter 42, "Heavy Defeats in Tennis: Psychological Mo-mentum or Random Effect?" These authors look at datafrom a large number of sets played at the U.S. Open andWimbledon tennis tournaments. They question whetherthe heavy defeats observed in these matches are due to adependence structure in the data, represented by a psycho-logical momentum (PM) model, or due to random variationin players' abilities from match to match, represented bya random effects model. In their analysis, the PM modelprovides a better fit to these data than the random effectsmodel. However, the PM is not as successful in explainingthe variation in the results in some of the titanic rivalriesin professional tennis.

269

Page 281: Anthology of Statistics in Sports

This page intentionally left blank

Page 282: Anthology of Statistics in Sports

Chapter 36

A STATISTICIAN READSTHE SPORTS PAGES

Hal S. Stern,Column Editor

Shooting Darts

Hal S. Stern and Wade Wilcox

The major team sports (baseball, basketball, football, hockey)receive most of the attention from the popular media and havebeen the subject of much statistical research. By contrast, thegame of darts is rarely covered by television or written abouton the sports pages of the local newspaper. A project carriedout while one of the authors (Wilcox) was an undergraduatedemonstrates that even the friendly, neighborhood game ofdarts can be used to demonstrate the benefits of measuringand studying variability. Of course, darts is more than just afriendly, neighborhood game; there are also professional play-ers, corporate sponsors, leagues, and tournaments. An effort tomodel dart throwing was motivated by a series of observationsof computerized dart games—when the game was running on"automatic" its performance was too good. The computeroccasionally missed its target but did not produce the occa-sional sequences of mistakes that haunt even top players. Wewondered how one would build a realistic model for dart play-ing and how such a model could be used.

A Brief Introduction to Darts

We begin by describing the dart board in some detail andthen briefly describing how dart games are played. A pictureof a dart board is presented in Fig. 1. Darts are thrown at theboard and points are awarded for each dart based on wherethe dart lands. The small circle in the center corresponds tothe double bullseye (denoted DB), which is worth 50 points,

Column Editor: Hal S. Stern, Department of Statistics, IowaState University of Science and Technology, Snedecor Hall,Ames, Iowa 50011-1210, USA; [email protected].

and the ring directly surrounding this is the single bullseye(SB) worth 25 points. Beyond those areas, the circular boardis divided into 20 equal-sized arcs with basic point valuesgiven by the numbers printed on the arcs (which run from 1to 20). There are two rings that pass through each of the 20arcs, one ring in the middle of each arc, and one ring at theoutside edge of each arc. A dart landing in the middle ringreceives triple value—that is, three times the number ofpoints indicated on the relevant arc. A dart landing in theouter ring receives double value—that is, twice the numberof points indicated on the relevant arc. We refer to the pointvalues associated with different regions by referring to thearc and the multiple of points that are awarded so that T18refers to triple value in the arc labeled 18 (54 points), D18refers to double value in the arc labeled 18 (36 points), andS18 refers to single value in the arc labeled 18(18 points). Itfollows that the most points that can be earned with a singledart is 60 by hitting the triple value ring of the arc valued at20 points—that is, T20.

A fundamental aspect of the most popular dart games(called 301 and 501) is that players try to accumulate pointsas fast as possible (some other games place a greater premi-um on accuracy). The fastest way to accumulate points is tocontinually hit the triple 20 (T20), which is worth 60 points.Note that this is considerably more valuable than the doublebullseye (50 points). In fact, the next most valuable target isT19. A closer examination of the board shows that the arcsnext to the arc labeled 20 are worth only 1 and 5 points, butthe arcs that surround the arc labeled 19 are worth 3 and 7points. Is it possible that a player would accumulate morepoints in the long run by aiming for T19 rather than T20? Ofcourse, the answer may well depend on the skill level of theplayer. An additional aspect of these games puts a premiumon accuracy; for example, the game 301 requires a player toaccumulate exactly 301 points but the player must begin andend by placing a dart in the double-value ring. Are somestrategies for playing this game more effective than others? Amodel that takes into account the skill level of a player andproduces realistic throws would allow us to answer questionslike those posed here.

271

Page 283: Anthology of Statistics in Sports

Chapter 36 Shooting Darts

Figure 1. Picture of the point-scoring area of a dart board. Theinnermost circle is the double bullseye worth 50 points; this issurrounded by the single-bullseye area worth 25 points. Theremainder of board is split into 20 arcs with basic point valuesindicated on each arc. The small area on the outermost ring ofeach arc has double point value and the area in the intermedi-ate ring of each arc has triple point value.

How to Model Dart Throws

A Sherlock Holmes quote from The Adventure of the CopperBeeches seems relevant here: '"Data! data! data!' he criedimpatiently. 'I can't make bricks without clay.'" We will havea hard time developing or testing a probability model for dartthrows without some data. Four different points on the dartboard were chosen (the center of the regions correspondingto T20, T19, D16, and DB) and approximately 150 throwsmade at each of the targets. Naturally not every throw hitsthe target point or even the target region. The darts distrib-ute themselves in a scatter around the target point. The dis-tance from the target was measured for each dart (to thenearest millimeter) and the angle at which the dart landed inrelation to the target was also recorded. There were no obvi-ous differences among the four sets of throws (correspond-ing to the four different targets), so the data have been com-bined—in all, 590 throws were made. The error distributionfor the 590 throws is provided in Fig. 2. Notice that 57% ofthe darts are 25 mm or less from the target (this is about oneinch) and 76% of the darts are 33 mm or less from the tar-get. The angles appeared consistent with a uniform distrib-ution; ho direction appeared to occur more frequently thanany other. The only formal statistical test that was performedfound that the proportion of throws in each of the four quad-rants surrounding the target point did not differ significantlyfrom what would be expected under a uniform distributionfor the angle of error.

The error distribution of Fig. 2 clearly does not follow anormal distribution. This is to be expected because the

errors as measured are restricted to be positive and are notlikely to follow a symmetric distribution. An alternativemodel immediately suggests itself. Suppose that horizontaland vertical throwing errors are independent of each otherwith both having normal distributions with mean 0 andvariance a2. The radial distance R that we have measuredis the square root of the sum of the squared horizontal andvertical errors. Under the assumptions that we have made,the radial error would follow a Weibull distribution (withshape parameter two) with probability density function

for r > 0. The Weibull distribution is a flexible two-para-meter distribution (the shape parameter and a scale para-meter) that is used often in research related to the reliabil-ity of products or machine parts. It should be pointed outthat, as often occurs, the continuous Weibull model is usedeven though the data are only recorded to the nearest mil-limeter.

The scale parameter of the Weibull distribution, cr is ameasure of player accuracy. (Technically the scale parameterfor the Weibull distribution that we have set up is , butwe find it more convenient to focus our attention on thevertical/horizontal standard deviation.) The value of a willdiffer from player to player with smaller values of indicat-ing better (more accurate) throwers. There are several statis-tical techniques that can be used to estimate the parameter

for a given dataset. The most natural estimate is based onthe fact that the radial error squared, R2, should be near onaverage. Therefore we can estimate a by taking the squareroot of one-half of the average squared radial error. The esti-

Figure 2. Histogram indicating the distribution of the distancefrom the target point. The total number of tosses is 590 spreadover four different targets. The superimposed curve is theWeibull distribution with 0-= 19.5 (the formula for the proba-bility density function is provided in the text).

272

Page 284: Anthology of Statistics in Sports

Stem and Wilcox

Table 1—Actual and SimulatedDistribution of Outcomes From Dart Tosses

Aimed at Triple-Value Twenty (T20)

Outcome

T20$20T5S5T1S1Other

Actual %

12.052.O

2.78.067

17.3

.1.3

Simulated %

10.9

46.7

3,615.4

4.116.1

3.2

Note: The actual distribution is based on 150 throws the

simulated distribution is based on 10,000 simulated throws.

mate of for the data in Fig. 2 is 19.5 with approximate errorbounds +.8. The Weibull distribution with s set equal to thisvalue is superimposed on the histogram in Fig. 2. We haveintentionally used a display with very narrow bins to preservethe information about the data values.

Does the Model Fit?

The fit of the Weibull model to the errors observed in throw-ing darts can be assessed in several ways. The histogram ofFig. 2 provides some evidence that the model fits. A his-togram with wider bins (not shown) provides a smoother pic-ture of the error distribution and the Weibull curve providesa closer match in that case. Figure 3 shows the informationfrom the sample in terms of the cumulative distributionfunction that measures the proportion of throws with errorsless than or equal to a given value. The smooth curve is theWeibull model and the step function that accompanies it isbased on the data. As an additional check on the fit of themodel, three other dart players agreed to throw 600 darts atthe same targets. We have created versions of Figs. 2 and 3for these players (not shown), and the resulting data appearto be consistent with the Weibull model. Two professionalplayers (these are players who are basically supported bysponsors and prize money) had standard deviations of 13.4and 13.2. Errors bigger than 40 mm (about 1.5 inches) occurin only 1 of 100 tosses for such players.

As a final check on the fit of the model, we wrote a com-puter program to simulate dart throwing. The simulationprogram can be used to simulate throws at any target point(e.g., the target point T20 represents the central point in thesmall arc associated with triple value and 20 points). A radi-al error is generated randomly from the Weibull distributionand then a random angle is generated uniform over the cir-cle. A simulated outcome is obtained by computing thenumber of points that correspond to the computer-generat-ed position. Table 1 gives the frequency of the most relevantoutcomes for 150 actual tosses and 10,000 simulated tossesat T20. It should be pointed out that the exact probabilitiesof each outcome under the Weibull model could be evaluat-

Figure 3. Cumulative distribution function for the observederror distribution (step function) and the Weibull distributionwith a = 19.5 (smooth curve).

ed exactly using numerical integration. Simulation providesa relatively fast and accurate approximation in this case. Ingeneral the agreement between the actual and simulatedoutcomes is quite good, an overall chi-squared goodness-of-fit test indicates that the actual tosses are not significantlydifferent than that expected under the Weibull model (treat-ing the simulated values as if they were exact). The largestdiscrepancy that we observe is that the number of S5 out-comes observed is smaller than expected under the Weibullmodel. There is no obvious explanation for this discrepancybased on reviewing the location of the actual tosses.

20.0

273

Table 2-- Expected Number Of Points per

Dart When Aiming at T19 and T20 Based

on 10,000 Simulated Dart Throws

(Standard Errors are in the Range .16-.19)

Pts per dart Pts per dartif aimed at T19 if aimed at T20

19.5

19.018.518.017.5

19.5

17.016.5

16.015.515.0

18.018.218.518.618.919.420.220.320.921.42230

17.617.918.2 18.318.719.320.220.4 21.0 21.522.2

Page 285: Anthology of Statistics in Sports

Chapter 36 Shooting Darts

Applying the ModelIf we accept the model then it is possible to begin address-ing strategy issues. What is the fastest way to accumulatepoints? The highest point value on the board is T20 whichis worth 60 points. This target point is surrounded by thearc worth five points and the arc worth one point, however,but T19 (worth 57 points) is surrounded by arcs worthseven and three points. An extremely accurate thrower willaccumulate points fastest by aiming at T20. Is there anaccuracy level (as measured by s) for which it is better toaim for the T19? Table 2 indicates that there is such a point!Players with high standard deviations (s 17) appear to earnmore points per dart by aiming at T19, but players withlower standard deviations (more accurate throwers) are bet-ter off aiming for T20. The difference between the twostrategies is relatively small over the range given in Table 2,the largest difference in means is .34 per dart, which trans-lates into only about 8 points total over the approximately25 throws that would be required to accumulate 500 points.Even this small amount might be important in a gamebetween evenly matchedplayers.

Of course, to apply Table2 one needs to know his orher own standard deviation.This can be obtained direct-ly by measuring the errordistance for a number ofthrows at a single target aswe have done here. A some-what easier approach to esti-mating one's standard devia-tion would be to throw anumber of darts at T20,compute the average scoreper dart, and use Table 2 tofind the corresponding valueof a. For example, an aver-age score of 19.5 (e.g., 1,950points in 100 darts) wouldsuggest that a 17.5 orperhaps a bit lower.

It is possible to apply ourapproach to more detailedquestions of strategy, but wehave not yet pursued thatarea. For example, supposethat a player in the game301 has 80 points remaining(recall the last throw mustbe a double). There are sev-eral possible strategies; forexample, T16, D16 (i.e.,triple 16 followed by double16) requires only two darts,whereas S20, S20, D20requires three darts. Whichis a better choice? Theanswer may well depend on

the accuracy of a player because aiming for S20 has a muchgreater probability of success then any double or triple.

As always, refinements of the model are possible: Wehave ignored the possibility of systematic bias by assuming auniform distribution for the angle of error, we have assumedindependence of the horizontal and vertical errors, and wehave not addressed the issue of darts bouncing off the metalrims that separate areas. Our relatively simple, one-parame-ter model appears to perform quite well in describing thevariability observed in dart throws and provides useful infor-mation for determining strategy. In particular, it appears thataiming for the obvious target, T20, is not the best point-accumulating strategy for some players.

Reference and Further Reading

Townsend, M. S. (1984), Mathematics in Sport,Chichester, UK: Wiley.

274

Page 286: Anthology of Statistics in Sports

A STATISTICIAN READSTHE SPORTS PAGES

Scott M. Berry,Column Editor

Chapter 37

Drive for Show and Putt forDough

I hit a beautiful tee shot, 250 yards right down the middleof the fairway. I follow that with a well-struck 5-iron to thegreen of the difficult 420-yard par 4. I have 20 feet for abirdie. I hit a firm putt that rolls by the hole, and I am leftwith a four-footer coming back. I push the par putt and tapin for my bogey. It took two shots to travel 420 yards andthen three putts to go the remaining 20 feet. All the whilemy playing partner sliced his drive, mishit a 3-wood to theright rough, then chipped it 12 feet from the hole and madethe par-saving putt. If he is kind he says nothing, but usu-ally he utters the well-known cliche, 'You drive for showand putt for dough!"

How accurate is this well-known cliche? In this article Iinvestigate the different attributes in golf and their impor-tance to scoring. Many different types of shots are hit bygolfers. Is it the awesome length of players or the deft toucharound the greens that differentiate golfers in their ability toscore well? I study the very best golfers-those on theUnited States Professional Golfer's Association Tour (PGATour). Clearly there are many differences between begin-ners, 10-handicappers, and professionals. I want to studywhat differentiates the very best golfers in the world.

Terminology and Available DataThere are par 3, par 4, and par 5 holes on a golf course. Parfor a hole is the number of shots it should take, if the holeis played well, to get the ball in the hole. Par is determinedby the number of shots it should take to get on the greenand then two putts to get the ball in the hole. Therefore,on a par 4, the most common par for a hole, it should rea-sonably take two shots to get on to the green and then two

Column Editor: Scott M. Berry, Department of Statistics,Texas A&M University, 410B Blocker Building, CollegeStation, TX 77843-3143, USA; E-mail [email protected].

putts for a par. If a golfer hits his ball on the green in twoshots less than par, this is referred to as hitting the green inregulation. The first shot on a hole is called the tee shot.For a par 4 or par 5, the tee shot is called a drive—the goalof which is to hit the ball far and straight. For each par 4and par 5 there is a fairway from the tee-off position to thegreen. The-fairway has short grass that provides an optimalposition for the next shot. On either side of the fairway canbe rough, water, sand traps, and various other hazards.These hazards make it more difficult for the next shot oradd penalty strokes to your score. For a par 3, the green isclose enough that the golfer attempts to hit his tee shot onthe green. This is usually done with an iron, which providesprecision in accuracy and distance.

I decompose a golfer's skill into six different attributes.These are driving distance, driving accuracy, iron-shot skill,chipping ability, sand-trap skill, and putting ability. Thereare other skills for a golfer, but these six are the most impor-tant. A goal of this article is to find the importance of each

275

Page 287: Anthology of Statistics in Sports

Chapter 37 Drive for Show and Putt for Dough

Is it the awesome length ofplayers or the deft toucharound the greens that

differentiate golfers in theirability to score well?

of these attributes for professional golfers. The PGA Tourkeeps various statistics for each golfer, and these will beused to measure each of the preceding attributes.

For each round played (this consists of an 18-holescore), two holes are selected that are flat and are situatedin opposite directions. Holes are selected where playersgenerally hit drivers. The hope is that by measuring the dis-tance traveled for tee shots on these holes the wind will bebalanced out and the slope of the terrain will not affect theresults. The distance of the tee shots for each of the golferson these holes is measured, and the average of these drivesis reported (I label it distance). For each of the par 4 and par5 holes the PGA Tour records whether the golfer hit his teeshot in the fairway or not. This variable, the percentage oftimes the golfer hits the fairway with his tee shot, is a goodmeasure of driving accuracy (labeled accuracy). The PGATour has recently changed the way it rates putting ability. Itused to keep track of how many putts each golfer had perround. This is a poor measure of putting because there maybe very different attempts by each player. One player mayhit every green in regulation and therefore his first putt isusually a long one. Another player may miss many greensin regulation and then chip the ball close to the hole. Thisleaves many short putts. Therefore,the PGA Tour records the averagenumber of putts per hole when thegreen is hit in regulation. There canstill be variation in the length of theputts, but it is a much better measureof putting ability (labeled putts).

Measuring chipping ability is not asstraightforward. The PGA Tour recordsthe percentage of times that a playermakes par or better when the green isnot hit in regulation. The majority ofthe time this means that the ball isaround the green and the player mustattempt to chip it close to the hole. Ifthe player sinks the next putt, then apar is made. This statistic, scramble, isa measure of a player's chipping abilitybut also of the player's putting ability (Idelete all attempts from a sand trap;these will be used to measure sandability). The player may be a bad chip-per but a great putter; thus he is suc-cessful at scrambling. To measure thechipping, separated from the puttingability, I create a new statistic, labeled

chipping. I ran a simple linear regression, estimating scram-ble from putts (technically, scramble is a percent and a logis-tic regression would be more reasonable, but the values forscramble are between 40%—70% and the linear regressiondoes just as well):

scramble = 249.3-106.0 putts

The goal of this method is to estimate how much of thescramble statistic is explained by their putting ability. Foreach player, the aspect of scramble that cannot beexplained by putting is his chipping ability, which is repre-sented by the residuals from this regression. These residu-als are used to measure chipping (and labeled chipping). InFig. 1, scramble is plotted against putts. The regression lineis also plotted. As this figure shows, some of the variabilityin scramble is because of putting ability. The residuals fromthis regression, the distance between the points and theline, is the part of scramble that is not explained by putting.This residual is what is used as the measure of chippingability. As can be seen in Fig. 1, Brad Faxon is a mediocreplayer for scramble, but because of his great putting abili-ty he should be a much better scrambler. Therefore, hisresidual is negative and large in magnitude, showing poorchipping ability. Robert Allenby has a pretty good scramblevalue, which is that much more amazing because he is nota great putter. Therefore, his residual is quite large, andpositive, showing great chipping skill.

The same problem exists for the sand saving ability of aplayer. The PGA Tour records the percentage of times thata player saves par when he hits his ball in a green side sandtrap. Again putting ability helps to save par. The residualsfrom the following regression are used as a measure of sandplay (and labeled sand):

Figure 1. A scatterpiot of scramble against putts. The regression line is shown. The

residuals from this regression line are used to measure chipping ability.

276

Page 288: Anthology of Statistics in Sports

Berry

sand save = 190.3-78.4 putts

The PGA Tour does not keep any good statistics to mea-sure iron-play ability. What would be ideal is to know howclose a player hits the ball to the hole when he has a rea-sonable chance. For example, in the U.S. OpenChampionship, the sponsoring organization, the UnitedStates Golf Association, kept track of the percentage ofgreens hit in regulation when the fairway was hit. The PGATour does not keep track of the same statistic. The PGATour does keep track of the performance for each player onpar 3 holes. The relation to par for each player on all thepar 3's is reported. Virtually every course has four par 3holes. Therefore, the average score on each par 3 can becalculated. Each player has an ideal starting position, sim-ilar to a second shot on a par 4 when the player has hit afairway. This score is also dependent on the putting, chip-ping, and sand play of the players. A player may be a badiron player yet score well because he chips and putts well.Therefore, as with the chipping and sand variables, I regressthe average score on the par 3 holes on the putts, chipping,and sand variables:

par 3 =1.89 + 0.67 putts -.0010 sand -.0020 chipping

The residuals from this regression are used as the measureof iron play (labeled irons). A positive value of chipping isgood, a positive value of sand is good, and a negative valueof irons is good.

I downloaded each of the preceding statistics for the 1999season from the PGA Tour Web site, pgatour.com. This wasdone after the first 28 tournaments of the 1999 season—theMercedes Championship through the Greater MilwaukeeOpen. Each of the 195 players that played at least 25 roundsduring this time are included in the dataset. The scoringaverages for each of these 195 playersare also recorded.

The main goal of this article is toshow the importance to scoring ofeach of the attributes—not necessarilya judgment of the players. This isimportant because the players do notall play the same tournaments, andthus the courses have varying difficulty.Although this may affect comparisonsacross players it does not create asmany difficulties in evaluating theeffect of the attributes. Courses areeasier or harder because the fairwaysare smaller, the greens more difficult,or the course longer. Therefore, if aplayer plays an "easy" course, he will hitmore fairways, make more putts, andso forth, but his scores will be lower.That golfer may not be as good asanother who has a higher scoring aver-age but plays harder courses, but theeffect of the attributes will still hold.The PGA has a scoring average mea-sure adjusted by the difficulty of eachround, but I do not use this measurefor the reasons just described. These

adjusted scores do not vary much from the unadjusted scor-ing average—thus the fact that different courses are playedshould have minimal effect on this analysis.

Descriptive Analysis and the ModelTables 1 and 2 present some summary numbers for thepopulation of players for each of the variables. Figure 2 pre-sents the scatterplot for each of the attributes and scoring.Seemingly there is not a large variation in putting acrossplayers. The standard deviation is .03 putts per green inregulation. If a player hits 10 greens per round, this is a dif-ference of .3 shots per round. Although this does not seemthat large, for this group of top-notch players putting is themost highly correlated attribute with scoring. This strongpattern is also demonstrated in the scatterplot. Sand is theleast correlated with scoring. This is most likely becausehitting sand-trap shots is not very common. Interestingly,distance has the second smallest correlation with scoring,though the direction of the correlation is intuitive—thelonger the drive the smaller the average scores. Driving dis-tance and accuracy have the highest absolute correlationamong the attributes. This correlation is negative, as wouldbe expected; the farther the ball travels, the less accuracythere will be. Irons is uncorrelated with sand, chipping,and putts because of its construction as the residuals fromthe regression of par 3 using these three variables. Likewisesand and chipping are uncorrelated with putts.

The linear model with no interactions for modeling scor-ing from the six attributes has a residual standard deviationof .346 with an R2 of .829. Each of the attributes is highlysignificant. I tested each of the two-way interactions. Theonly significant interaction is between distance and accura-

Table 1—Descriptive Statistics for the Seven Variables

Statistic

Scoring

Distance

Accuracy

Putts

Sand

Chipping

Iron

Mean

71.84

272.2

67.4%

1.79

0

0

0

Std. Dev.

.82

8.12

5.23%

.030

6.01

7.58

.051

Min

69.76

250

50.7%

1.71

-16.7

-31.17

-.11

Max

74.69

306

80.90%

1.88

14.34

16.30

.18

Table 2—Correlation Matrix for the Seven Variables

Score

Distance

Accuracy

Putts

Sand

Chipping

Iron

Score

1.000

-.194

-.378

.657

-.366

-.366

.298

Distance

-.194

1.000

-.413

-.047

-.068

-.068

.108

Accuracy

-.378

-.413

1.000

-.045

.265

.265

-.258

Putts

.657

-.047

-.045

1.000

.000

.000

.000

Sand

-.163

-.091

-.012

.000

.057

.057

.000

Chipping

-.366

-.068

.265

.000

1.000

1.000

.000

Iron

.298

.108

-.258

.000

.000

.000

1.000

277

Page 289: Anthology of Statistics in Sports

Chapter 37 Drive for Show and Putt for Dough

Figure 2. Scatterplots for each of the six attributes plotted against scoring average.

Table 3—Summary of Multiple Regression Model for the

Scoring Average Based on the Six Attributes

Intercept

Distance

Accuracy

Putts

Sand

Chipping

Iron

Dist*Accu

Coefficient

27.5

.067

.373

16.87

-.0257

-.0303

3.75

-.0016

Std. error

8.46

.032

.128

.83

.0041

.0033

.49

.0005

T value

3.24

2.13

2.91

20.32

-6.35

-9.11

7.65

-3.37

P value

.0014

.0347

.0040

.0000

.0000

.0000

.0000

.0009

cy. The resulting model, including this interaction, is sum-marized in Table 3. Although there is not a huge increasein the model fit over the "no interaction" model, it makessense that there would be an interaction between distanceand accuracy, and thus I use the model with the interaction.The implications of this interaction are discussed in thenext section.

David Seawell, one golfer, is an outlier in both scoringand scrambling. He has by far the worst chipping statistic,-31.17 (the second lowest is -19.0), and by far the worstscoring average, 74.69 (the second lowest is 73.86). Therewas virtually no change in the model when Seawell wasremoved; therefore, I left him in to fit the model.

To give some notion of importance to these attributes, Ipresent seven different golfers. "Median Mike" has thepopulation median for each of the attributes. Each of the

following players are at the medianin every category except one—inwhich they are at the 90th per-centile (or the 10th percentile if lowvalues are good). These players areDistance Don, Accurate Al, PuttingPete, Sandy Sandy, Chipping Chip,and Iron Ike. The category in whichthey are better than the medianshould be clear!

The estimated scoring averagefor each of these players is PuttingPete 71.14, Distance Dan 71.29,Accurate Al 71.32, Iron Ike 71.43,Chipping Chip 71.46, Sandy Sandy71.49, and Median Mike 71.70. Asthe title cliche claims, you putt fordough! Putting Pete is the best ofthese players. Thus, if a player ismediocre at everything—except onething at which he is pretty good—heis best off if that one thing isputting. Distance is not just forshow and neither is accuracy.Distance Dan is the second bestplayer, and Accurate Al is the thirdbest. I was surprised that chippingwas not more important. This maybe because the quality of these play-

ers is such that they do not need tochip much. Their ability to save parwhen they miss a green is also partiallyexplained by their putting ability, whichis important.

Total DrivingThis analysis clearly shows that puttingis very important for PGA Tour play-ers—if not the most importantattribute. This does not mean that whatseparates them from the 20-handicap-per is their putting. I think you wouldfind putting relatively unimportant

when comparing golfers of a wider class of overall ability.The long game-driving ability and iron play would be veryimportant. There is another interesting question that can beaddressed from this model: How can you characterize dri-ving ability? The PGA Tour ranks players in total driving. Itranks each of the golfers in distance and accuracy separate-ly and then sums the ranks. I think combining ranks is apoor method of combining categories, but aside from this, itweighs each of them equally. From the model presentedhere, I can characterize the importance of each. I fix eachattribute at the population median value, except distanceand accuracy. Figure 3 shows the contours for distance andaccuracy in which the estimated scoring average is the same.Two players on the same contour have the same estimatedmean score and thus have the same driving ability.

278

Page 290: Anthology of Statistics in Sports

Berry

This graph demonstrates the importance of the interac-tion between distance and accuracy. If there were no inter-action in the model, these contours would be linear, with aslope of .66. The linear contours would imply that, regard-less of your current driving status, you will improve yourscore an identical amount if you gain one yard in average

I think golf, like baseball, is an

ideal game for statistics because

each event is an isolated

discrete event that can easily

be characterized.

driving distance or a .66% increase in fairways hit percent-age. Figure 3 shows that this is not true with the interac-tion present. Compare John Daly (distance of 306 yards) toFred Funk (accuracy of 80.9%). Daly would need a hugeincrease in distance to improve, whereas a smaller increasein accuracy would result in a substantial improvement. Theopposite is true for Funk; a large increase in accuracy wouldbe needed to improve, but adding distance would result inan appreciable increase. Funk hits the fairway so often thatadding yardage would improve virtually every iron shot(making it shorter). When Daly misses the fairway, it limits

Figure 3. The scatterplot of distance versus accuracy. The lines represent contoursof the same estimated scoring average based on these two variables and the playerbeing at the median for putts, sand, chipping and irons.

the second shot, and, with him missing so often, it doesn'tmatter how much closer he is when he hits the fairways.He does well on those holes. It is surely the holes on whichhe misses the fairway that are causing troubles.

Most likely, if John Daly tries to improve his accuracy, itwill decrease distance, or if Fred Funk tries to increase dis-tance, it will decrease his accuracy. This function willenable them to decide whether the trade-off is beneficial ornot. For example, Daly's current estimated scoring averageis 72.34 (actual is 72.73). If he could sacrifice 10 yards dis-tance for a 5% increase in accuracy, his estimated scoringaverage would improve by .30 shots per round. Fred Funk'sestimated scoring average is 70.75 (actual is 70.70). If hecould sacrifice 5% in accuracy for an extra 10 yards in dis-tance (the opposite of Daly) his estimated scoring averagewould improve by .25 shots per round. In this example, 10yards to Daly is worth less than 5% in accuracy, but 5% inaccuracy is worth less than 10 yards to Funk.

Tiger Woods (distance 293 and accuracy 70.9%) is thehighest-rated driver according to this model. He is ratedsecond in the PGA Tour total driving statistic, behind HalSutton (distance 277 and accuracy 75.5%).

DiscussionFrom this model I have concluded that you drive for doughand putt for more dough! A crucial assumption for this con-clusion is that you believe that the variables I have usedadequately measure the attribute they are designed to mea-sure. I believe they are currently the best available mea-sures. Larkey and Smith (1998) discussed the adequacy of

these measures and made suggestionsabout ways to improve them. A greatreference for any statistics in sportstopic is Statistics in Sports (Bennett1998). Within this volume, Larkey(1998) gave an overview of statisticsresearch within golf.

I believe the PGA Tour could keepmuch better records of the perfor-mance of players. For example, theNational Football League keeps trackof every play. It reports the down anddistance and the yard line for everyplay. It reports whether it was a run ora pass, the yards gained and lost, and soforth. Any statistic I want can easily beconstructed from this data. The PGATour could do the same. A recordcould be kept on each shot taken—whether the shot was in the fairway, therough, a trap, and so forth. The result-ing position of the ball could be report-ed—this includes the distance to thehole and whether it is on the green, thefringe, in water, in a sand trap, and soforth. From this dataset it would beeasy to construct any statistic of inter-est. Ultimately I would like to con-

279

Page 291: Anthology of Statistics in Sports

Chapter 37 Drive for Show and Putt for Dough

struct a dynamic programming model that could give valueto each position of the ball. This would be an amazing toolfor golfers and the PGA Tour. It would enable a decisionmodel to be constructed as to whether a player should go forthe green on a long par 5: Is the risk worth it to hit over thewater or should I lay the ball up short of the water? It wouldalso help in setting up courses for play. I think golf, likebaseball, is an ideal game for statistics because each eventis an isolated discrete event that can easily be characterized.While I wait for such data to be kept, I will continue play-ing golf and telling my wife that it is all in the name ofresearch!

References and Further Reading

Bennett, J. (1998), Statistics in Sports, New York:Oxford University Press.

Larkey, P. D. (1998), "Statistics in Golf," in Statistics inSports, ed. J. Bennett, New York: Oxford UniversityPress.

Larkey, P. D., and Smith, A. A. (1998), "Improving thePGA Tour's Measures of Player Skill," in Science andGolf III, London: E & FN SPON.

Where is the Data?

-'.I'here are numerous golf sites on the Web Many willsett you dubs or give information on thousands of differ-ent courses. Here awe the two best Web sites for infer-Biatio» and data. The data used in the article are availablefor easy download on my web site, stat.tamu. edu/ berry.

* pgatour com: This is the official site of the PGA Tour.It has the official statistics for the PGA Tour, the NikeTour, and the PGA Senior Tour. The data are not in agood format to download, but they are available. This sitealso provides live scoring updates during tournaments.

• wwv.golfweb.com: This site, affiliated with CBSSports, provides in fo rmat ion for the PGA, LPGA, Senior,European, and Asian Tours, as well as amateur events.The statistics are not as extensive, but there is informationon more than just the U.S. PGA. This site also has livescoring.

280

Page 292: Anthology of Statistics in Sports

Chapter 38

ADJUSTING GOLF HANDICAPS FOR THE DIFFICULTY OF THE COURSE

Francis Scheid, Professor Emeritus, Boston University135 Elm Street, Kingston, MA. 02364

Lyle Calvin, Professor Emeritus, Oregon State University

KEY WORDS: Golf, slope, ratings, handicaps.

1. The problem.In many sports differences in playing ability are wider

under difficult circumstances and more narrow when thegoing is easier. In golf, ability is measured by a player'shandicap which estimates the difference between his orher ability and that of a standard called scratch, in thiscountry the level of the field in the Amateur Champion-ships. Handicaps are generally larger at difficult courses,meaning that differences from scratch are larger. Thismagnification of ability differences introduces inequitywhen players move from one course to another.

Early in this century British golfers developed a proce-dure to ease this inequity but it was abandoned as inef-fective. In the early 1970s the problem was revived1.2

and in 1973 GOLF DIGEST published an article3 in whichhandicap adjustments were described. In the late 1970sthe United States Golf Association (USGA) organized ahandicap research team (HRT) to study problems of eq-uity in play. In one of its recent studies4 the magnifica-tion of ability differences was consistently detected forcourses differing in length by only 400 yards, about thedifference between back tee and front tee length at manycourses. The present report describes the process devel-oped by the HRT and used by the USGA to deal with themagnification problem. It is called the Slope System.

2. Slope.Plots of expected score against home course (local)

handicap for play at dozens of courses strongly suggest astraight line relationship and regressions usually lead tocorrelations in the .90s. Although the data for extremelyhigh and low handicaps is thin it is assumed that linearityprevails across the entire handicap spectrum. No evidenceto the contrary has been found.

The problem of achieving handicap portability has ledto the idea of measuring golfers by their performance ona standardized golf course, affectionately called PerfectValley. It is assumed that the resulting relationship be-tween expected scores and standardized handicaps, to becalled indexes, will still be linear. Figure 1 exhibits thisassumption in a typical plot of expected scores againstindex at golf course C.

281

The line for course C has slope Sc and the product 100Sc is the slope of the course. Expected differentials, whichare scores minus scratch rating (a measure of course dif-ficulty), for players with indexes I and 0 are shown start-ing at the level of the rating and running upward to theline and we see that

where d is a differential. It holds for any course andany I, assuming only the linearity. It follows easily from(1) that

for any indexes l1 and I2 This is the magnificationeffect. If course A has higher slope than course B theperformance measure on the left side of (la) will belarger at A, for the same I1 and I2.

3. The mapping.Since (1) holds for any course we can take C to be our

reference course Perfect Valley, for which S has beenJ pv

defined as 1.13, the average figure found when scoreswere first regressed against local handicaps (now closerto 1.18).

Page 293: Anthology of Statistics in Sports

Chapter 38 Adjusting Golf Handicaps for the Difficulty of the Course

In this special case the index I also serves as the localhandicap. From (1) and (2) it follows that

where

depends only on the 0 index player, called scratch.Equation (3) holds for any course C and any index I.

Now assume that (3) can be applied not only to theexpected differentials but to any pair dc(I) and dpv(I). Thisis a broad assumption with interesting consequences. Itis based largely on intuition and amounts to saying that agood day at course C would have been a good day at PV,and likewise for bad days. For each value of I it providesa mapping of the distribution of differentials at C to oneat the reference course PV. (See Figure 2.)

The mapping is thus defined by

(4) dpv(I) = (1.13/Sc)dc(I) + T

The fact that expected values map to one another is thecontent of (3).

A bit of background will be useful in simplifying thismapping. Golf handicaps are not found by averaging allof a player's scores relative to scratch ratings, that is, notall the differentials. For a number of reasons, to empha-size a player's potential rather than average ability and topreserve some bonus for excellence, only the better halfare used. And for consistency, golf courses are not rated

using all scores at the Amateur Championships, only thebetter half. Scratch players are those whose better halfperformances match the scratch ratings. It follows thatthese players will have handicap 0 on all courses. This"scratch on one course, scratch on all" is the scratch prin-ciple. (It must also be mentioned that better half aver-ages are multiplied by .96 to produce handicaps, addingslightly to the emphasis on potential ability.)

Now, since (4) is assumed for all differentials it alsoholds for the expected better halves at each I level, againtaken over all players with index I.

Choose 1 = 0, the scratch level. Both better half terms in(5) are then 0 by the scratch principle, which means thatT = 0 and (4) simplifies.

4. The Slope System.Equation (6) is the input procedure for the Slope Sys-

tem. It maps differentials shot at course C to images atPV for use in obtaining or updating a player's index I.Inverting it brings

Since this applies to all differentials it can be applied toa player's expected better half

which, multiplied by .96, becomes

Figure 2.

Page 294: Anthology of Statistics in Sports

Scheid and Calvin

since a Perfect Valley handicap is an index. Equation (8)is the system's output procedure. It converts the player'sindex I to a suitable handicap for play at course C. To-gether (6) and (8) convert scores shot at any set of coursesinto a standardized handicap, and the standardized handi-cap into a local handicap for local purposes.

In summary, the assumptions made in this developmentof the Slope System are:

(a) linearity of expected scores vs. index(b) the mapping(c) the scratch principle.

The first of these is supported by all the evidence inhand and implies that the system treats all handicaplevels in the same way. The second is a broad leap,supported largely by intuition. The last is a pillar ofhandicapping. (An earlier approach5.6 made somewhatdifferent assumptions which led to the same input andoutput procedures.)

5. Implementation.Designing a system and making it operational over thou-

sands of golf courses are two different problems. Themeasurement of course difficulty has traditionally beendone by teams of raters who note bunkers and water holes,measure the widths of fairways and the depth of rough,judge the difficulty of greens and other obstacles. In thepast this was done for scratch players only and the scratchrating was the standard from which all others were mea-sured. The Slope System also requires a bogey rating, anestimate of course difficulty for the more or less averageplayer, which doubles the task of the rating teams.

Efforts using multiple regression to assess the impor-tance of the various obstacles that golfers encounter werea first step, the main difficulty being that length aloneexplains well over ninety percent of scores, leaving verylittle to distribute among bunkers and such. The domi-nance of length also has a positive side, since it makes itharder to be too far wrong. The work of refining ratingsbased on yardage alone has fallen to the traditional ratingteams but a detailed (inch thick) field manual7 has beenprepared to assist them and training seminars are regu-larly offered. As a result ratings made by the various re-gional associations have been found8 to be very consis-tent. Tests conducted by regional associations, in par-ticular the Colorado Golf Association, have provided feed-back and led to revisions, a process which is ongoing.Regressions have also been used9 to detect the more seri-ous outliers and to provide smoothing. These are de-scribed in the following section.

6. Outliers and smoothing.Scratch and slope ratings are subject to errors from a

number of sources. In addition to the errors which mayresult from inaccurate values assigned by raters or from

improper weighting of these values in calculating thescratch and bogey ratings, other sources of variation in-clude measurement error, both intra- and inter-team er-rors and model error caused by the unique placement ofobstacles on a course. With ratings taken at only twohandicap levels, scratch and bogey, this latter error is notimmediately apparent. If ratings were taken at a numberof handicap levels, rather than at only two, it would eas-ily be recognized that the best fitting line would not nec-essarily pass through the two points established for thescratch and bogey golfers.

Although such a line is a reasonable basis for the esti-mation of slope, other estimates might also be used. Inparticular, one might assume that all courses are drawnfrom a population with a common relationship betweenscratch and bogey ratings. An estimate of scratch ratingfor a course might then be obtained from the regressionof the scratch rating on the bogey rating. A still betterestimate might be obtained by combining the two, theoriginal from the rating team and the one from the regres-sion, as obtained from the population of courses in theassociation (or state, region or country).

The same procedure can also be used to obtain aweighted estimate of the bogey rating, combining theoriginal from the rating team and an estimate from theregression of bogey rating on scratch. These two esti-mates can then be used to obtain another estimate of theslope rating by substituting the weighted estimates ofscratch and bogey ratings into the slope formula. Theslope rating so obtained will be called the weighted sloperating.

This procedure has been tried at the golf associationlevel and at the national level. As an example, the plot ofbogey rating against scratch rating is shown in Figure 3for all courses in the Colorado Golf Association. Thecorrelation coefficient for the two ratings is .959 and hascomparable values in other associations. The ratings werethose assigned for the mens' primary tees, usually desig-nated as the white tees. The regression estimate of thescratch rating for any course is

SRr= a + b BRo

where the subscripts o and r refer to the original and re-gression estimates respectively. For the Colorado GolfAssociation this equation was

SR = 15.14 + .588 BRo

The regression estimate of the bogey rating for any courseis

BR = c + d SR

283

Page 295: Anthology of Statistics in Sports

Chapter 38 Adjusting Golf Handicaps for the Difficulty of the Course

which, for Colorado, is

BRr = -16.43 + 1.563 SRo

and the combined estimates are

and

SR =w SR +wSRw o o r r

BR =w BR + w B Rw o o r r

where wo + wr = 1.o rWeights were assigned proportional to the inverse of

the variances of the original and regression estimates ofthe ratings. Direct estimates of variances for the originalratings were not available; indirect estimates were madefrom the deviations from regression of the scratch andbogey ratings against yardage. Yardage is responsible forabout 85% of the variation so this estimate should includemost of the error variation. Variances for the regressionestimates were taken from the residual mean squares forthe scratch and bogey ratings regressed against each other.We do not claim that these variance estimates are the onlyones, or perhaps even the best, but they do appear to bereasonable proxies. For Colorado, the variance estimatesfor scratch ratings are .4366 and .5203 for the regressionson yardage and bogey ratings respectively and the corre-sponding variance estimates for bogey ratings are 3.8534

and 1.3850. From these values, the weights for scratchand bogey combined estimates were calculated to be

Scratch Bogeywo .544 .264wr .456 .736

Using these, weighted estimates of scratch and bogeyratings, and from them the weighted slope ratings, wereobtained. Figure 4 shows the plot of the weighted scratchand bogey ratings. The correlation coefficient for theseratings has increased to .998. We have some concern thatthe variance estimates as given by the regression of thescratch and bogey ratings on yardage may be too largeand, therefore, that the weights for wo may be too low. Ifso, the increase in the correlation, as a result of the weight-ing, would be less than is shown.

It is not suggested that these weighted estimates ofcourse and slope ratings should replace the original esti-mates. Where the revised estimates have been tried, lessthan ten percent of the slope ratings are different by morethan six points and only occasionally is there any appre-ciable change in the Course Rating. Since the executivedirectors of the golf associations already have the pre-rogative of changing the course and slope ratings whenthey believe an error has been made, it would be prefer-able for them to use the revised estimates as alternatives

284

Page 296: Anthology of Statistics in Sports

Scheid and Calvin

to support a change. This has been tried in a few casesand seems to work well. This procedure gives them aquantitative basis for modifying the ratings when theybelieve there is a need.

Since the two parameters used in the Slope System arethe Course Rating (equal to the scratch rating) and sloperating, one might consider what effect this revision hason the relationship between the two. Figure 5 shows theplot of the original slope and scratch ratings while Figure6 shows the plot of the weighted slope and scratch rat-ings. The correlation coefficient has increased from .773to .966. This means that the weighted scratch rating couldbe used as a very good predictor of the slope rating.

This raises an interesting question. If Course Ratingcan be used to predict the Slope Rating, why shouldn'tthe Course Rating and the Slope Rating for a course besimply calculated from the scratch rating, without both-ering to take a bogey rating? The estimate can be madewith rather high precision with about half as much workon the part of the rating team. One concern, however,might be that one cannot be sure that the relationship be-tween course and slope ratings would necessarily remainthe same without having the bogey ratings to continually

test and estimate the relationship. This idea will be ex-amined further and tested on a number of golf associa-tions.

7. References.(1) Soley and Bogevold; Proposed Refinement to the

USGA Golf Handicap System; report to the USGA, 1971.(2) Scheid; Does your handicap hold up on toughercourses?; GOLF DIGEST, 1973 (3) Riccio; How CourseRatings may Affect the Handicap Player, report to theHRT, 1978. (4) Scheid; On the Detectability of M.A.D;report to the HRT, 1995. (5) Stroud; The Slope Method— A Technique for Accurately Handicapping Golfers ofall Abilities Playing on Various Golf Courses; report tothe HRT, 1982. (6) Stroud and Riccio; Mathematical Un-derpinnings of the Slope Handicap System; in Scienceand Golf, edited by A. J. Cochran, 1990. (7) Knuth andSimmons (principally); USGA Course Rating SystemManual. (8) Calvin; The Consistency of Ratings Amongthe Golf Associations; report to the HRT, 1994. (9) Calvin;A Modified Model for Slope and Course Ratings; reportto the HRT, 1992.

285

Page 297: Anthology of Statistics in Sports

This page intentionally left blank

Page 298: Anthology of Statistics in Sports

Chapter 39

Rating SkatingGilbert W. BASSETT,* Jr. and JOSEPH PERSKY*

Among judged sports, figure skating uses a unique method of median ranks for determining placement. This system respondspositively to increased marks by each judge and follows majority rule when a majority of judges agree on a skater's rank. It isdemonstrated that this is the only aggregation system possessing these two properties. Median ranks provide strong safeguards againstmanipulation by a minority of judges. These positive features do not require the sacrifice of efficiency in controlling measurementerror. In a Monte Carlo study, the median rank system consistently outperforms alternatives when judges' marks are significantlyskewed toward an upper limit.

KEY WORDS: Breakdown; Majority rule; Median; Ranks.

1. INTRODUCTION

Early during the 1992 Winter Olympics, Scott Hamilton,the former Olympic champion now working as an announcerfor CBS, made a valiant but unsuccessful try to explain howjudges' marks are aggregated to determine the placementsof figure skaters. Once burned, he gingerly avoided the subjectfor the duration. Hamilton's difficulties reflected the complexand at first glance arcane procedures dictated by the rule-books. Yet officials of the United States Figure Skating As-sociation (USFSA) and the International Skating Union(ISU) have long claimed that their approach to judging avoidsarbitrary mechanisms in favor of the logic of majority rule.The purpose of this article is to explore this claim.

Figure skating is one of the most graceful and aesthetic ofsports. It also involves quick movements and subtle varia-tions. These characteristics of skating, similar in many re-spects to diving and gymnastics, make the appropriate rank-ing of competitors quite difficult. Not surprisingly, all threeof these sports rely on expert judging to determine placementsin competitions. Unwilling to trust such responsibility to asingle individual, the rules call for a panel of judges. Yetusing such a panel creates a difficult and interesting question:how best to combine the judges' marks?

In considering this question, it seems reasonable to firstask why judges' rankings should ever differ. Why don't allthe judges in a competition agree? What we want is a "model"of the generating process that gives rise to such differences.Unfortunately, even a cursory acquaintance with figureskating suggests a whole spectrum of possibilities.

At one extreme, we might imagine that the quality of eachskater's performance is an objective entity subject only toerrors in measurement. In such a view, judges' marks differin the same way as clocks timing a swimming event mightdiffer. This judge might have blinked at an important mo-ment, or that judge might have concentrated on one memberof a pair of skaters just when the partner wobbled. In thismodel, judges are imperfect measuring devices. The aggre-gation problem here is essentially one of measurement error.

At the other extreme, the differences among judges mightrepresent not errors of measurement but rather genuine dif-ferences in tastes. Judges' preferences would presumably re-

* Gilbert Bassett and Joseph Persky are Professors of Economics, Uni-versity of Illinois at Chicago, Chicago, IL 60607. The authors would like tothank two anonymous reviewers, Benjamin Wright of the InternationalSkating Union, Dale Mitch of the United States Figure Skating Association,Tammy Findlay, Shiela Lonergan, Nicole Persky, and all the mothers ofthe McFetridge Ice Skating Rink in Chicago.

fleet real aesthetic differences, although they might also beinfluenced by national pride and other less relevant moti-vations. In a world of complex aesthetics, we face all theaggregation problems of collective decision making (see Ar-row 1963 or Sen 1970). Moreover, skating officials worrycontinually about the strategic behavior of their judges.Where judges hold strongly to their preferences or have apersonal stake in the outcome, we must also worry aboutthe distortions that can be produced by strategic voting be-havior.

Both of these models are obviously oversimplifications.Yet even at this level, it is easy to appreciate that a systemof aggregation that showed highly desirable properties withrespect to one of these might fail to perform well with respectto the other. But both models have some plausibility; at anygiven competition, one or the other might dominate. Atprestigious competitions such as the Olympics or the WorldChampionships, judges' marks are very likely influenced byboth tastes and problems of measurement. At local or re-gional meets measurement problems probably dominate.Any system of aggregating judges' rankings must be consid-ered in light of both models. Thus the rating skating problemrequires us to search for a set of aggregation criteria relevantto both measurement error and preference aggregation.

Skating officials have long maintained that their placementsystem was desirable because it embodied the principle ofmajority rule. Although the concept of majority rule is opento a number of interpretations, we show that the system thathas been adopted is essentially a median rank method withtie-breaking rules. The identification of the method with me-dian ranks does not seem to have been noticed in the skatingliterature. Further, although the system has evolved only in-formally, we show that it is actually the only method thatsatisfies a majority rule and incentive compatibility (ormonotonicity) requirement in which a skater's final rankcannot be decreased by a judge who gives the skater a bettermark. Hence the skating associations have settled on theunique aggregation method of median ranks, which is resis-tent to manipulation by a minority subset of judges and alsosatisfies a reasonable incentive compatibility requirement.

These results relate to the problem of aggregating tastes.One might expect that a ranking system well tuned to han-dling such aggregation might perform poorly when evaluated

© 1994 American Statistical AssociationJournal of the American Statistical Association

September 1994, Vol. 89, No. 427, Statistics In Sports

287

Page 299: Anthology of Statistics in Sports

Chapter 39 Rating Skating

Judges

SkatersABCDEF

1

523164

2

213654

3

235146

4

215346

5

423165

6

423165

7

132546

8

215346

9

432165

Place ofLowestMajority

223155

Size of FinalLowest PlaceMajority

566555

324156

Figure 1. Hypothetical Ordinals for a Component Event.

from the perspective of measurement error. Yet we find thatthe official aggregation procedures deal effectively with thepersistent measurement problem of "score inflation" thatskews marks toward the upper limit of the scoring range.Thus we conclude that even in competitions where no ma-nipulation is likely but judge's marks are subject to randomerror, the system performs well.

In Section 2 we provide a brief overview of the officialscoring system used by the ISU and the USFSA. We explainthat the system is essentially median ranks, but with some-what arbitrary rules for breaking ties. In Section 3 we showthat the median rank method can be justified in terms of amajority rule incentive requirement. In Section 4 we go onto consider the relative performance of the system comparedto alternatives in the context of a simple model of error gen-eration. Finally, we summarize results in Section 5.

2. THE RULES

At all major USFSA events, as well as the World Cham-pionships and the Olympics, a skater's place is determinedby a weighted average of two component events. The short,or original, program is weighted one-third, and the long free-skating program is weighted two-thirds. Each componentevent is scored by a panel consisting of nine judges. At lessercompetitions, there are still three component events: com-pulsory figures (20%), original program (30%), and freeskating (50%). Moreover, there often are fewer judges (butalways an odd number) at such events. The compulsory fig-ures component has been dropped from more prestigiouscompetitions, because the slow and tedious etching of schoolfigures makes a less than dramatic video scene.

For each component, a judge gives two cardinal markson a scale of 1 to 6. For the original program the marks arefor required elements and presentation; for free-skating themarks are for technical merit and composition style. Thesemarks are the ones displayed prominently at competitions.But there is a long trail from marks to placement.

Take, for example, the placements in the original program.(Placements in the free-skating program are determined inexactly the same way.) First, for each judge an ordinal rankingof skaters is determined from total marks, the sum of thetwo subcomponent cardinal scores. These ordinal ranks andnot the raw scores become the basis for determining place-ments.

As presented by the USFSA rulebook, the procedure con-tinues as follows: "The competitor(s) placed first by the ab-solute majority (M) of judges is first; the competitor(s) placed

second or better by an absolute majority of judges is second,and so on" (USFSA CR 26:32). Note here the expression"second or better." In calculating majorities for second place,both first and second ranks are included. In calculating amajority for third place, first, seconds, and thirds are in-cluded, and so on for lower places. If for any place there isno majority, then that place goes to the skater with a majorityfor the nearest following place.

Now of course below first place, there can be numerousties in this system. (If a judge has given more than one firstbecause of a tied mark, then there can also be a tie for first.)The basic rule for breaking ties is that the place goes to thecompetitor with the greater majority. If after the applicationof the greater majority rule there is still a tie, then the placegoes to the skater with the "lowest total of ordinals fromthose judges forming the majority." And if this does notwork, then the place goes to the skater with the lowest totalof ordinals from all judges. In all cases of ties, the skatersinvolved in the ties must be placed before other skaters areconsidered.

To demonstrate how this all works, consider Figure 1,which contains hypothetical ordinal rankings for a compo-nent event. Notice that the tie for second between A and Bgoes to skater B because of the greater size of B's majorityfor second. Skater A then gets third place, because A mustbe placed before anyone else is considered. Because no onehas a majority of fourths, we go on to consider E and F, eachof whom has a majority of fifths. Because each has five judgesin their majority, breaking the tie depends on the sum ofordinals in each majority. E then wins with the lower sum,21, as compared to F's 23.

After just a bit of reflection, it is clear that the placementsystem used in figure skating starts from a ranking of themedian ordinals received by skaters. As defined by the rules,a skater's initial placement depends on the "lowest majority."But this "lowest majority" is just equal to the median ordinal.A majority of judges ranked the skater at the skater's medianordinal or better. It is true of course that a number of tie-breaking devices are applied. These rules involve several otherconcepts. But under the current procedures, a skater with alower (better) median will never be ranked worse than onewith a higher (worse) median. Such a result is explicitly ruledout, because all tied skaters must be placed before any re-maining skaters are considered. In particular, all skaters tiedat a given "lowest majority" or median rank must be placedbefore any other skaters are considered. Notice that in theabsence of this rule, a reversal vis-a-vis the median rank rule

288

Page 300: Anthology of Statistics in Sports

Bassett and Persky

could easily occur. For example, referring to Figure 1, if afterfailing in a tie-breaking situation for second place, skater Ahad to compete with skater C for third place, then the winnerwould be skater C (despite A's median of 2) because of agreater majority of "3s or better"; C has six "3s or better,"whereas A has only five.

Although over the years there have been a number ofchanges in the various tie-breaking mechanisms, since 1895the ISU has used its concept of majority rule to determineplacements. The only exception we have discovered was anexperiment in 1950 that used a trimmed mean. The systemhas now evolved to a point where it is clearly one of medianranks.

3. MEDIAN RANKS AND MAJORITY RULE

Why use the median rather than the average ordinal, thesum of the raw scores, or a trimmed mean? As in othersports involving subjective judging, ice skating has beenplagued by charges of strategic manipulation. This problemis a common one in the theory of constitution building.(There is of course a large literature addressing this issue; seeArrow 1963.) The most obvious reason for using medians isto limit the effect of one or two outliers on the final rankings.But there are any number of ways to begin to guard againstsuch manipulation. In defense of their system, skating offi-cials from the ISU and USFSA have often claimed that itembodies the essence of majority rule. The heart of theirargument is that a skater ranked best by a majority of judgesshould be ranked best overall.

In addition to its relation to majority rule, a system ofmedian ranks has at least one other attractive property: Ifan individual judge raises a skater's mark, then that actionwill never decrease that skater's placement. Thus if a judgeraises the mark of one skater, that skater will either moveup or stay the same in overall placement.

These two properties are attractive characteristics of me-dian ranks that suggest it for serious consideration. But infact we can make a stronger statement. If these two simpleconditions are considered to be necessary, median rankingis the only system that will satisfy both. The result followsfrom the median as a high-breakdown estimator and the factthat such estimates satisfy an exact-fit property that is equiv-alent to a majority requirement in the aggregation context(see Bassett 1991).

To formally demonstrate the result, let m/(j) denote theraw mark and let r j(s) denote the rank of the .sth skater bythe jth judge, where s = 1, . . . , S and /' = !,...,/. Wesuppose that higher-valued cardinal marks are assigned tobetter performances, and skaters are then ranked with " 1"as best. (We assume that there are no tied marks, so that themarks yield a complete ordering for each judge.) The finalrank of the sth skater is denoted by RANK(s).

An initial ranking is determined by a place function, de-noted by P. The P function takes the matrix of marks andproduces a vector, p with elements p(s), which provides apartial order of skaters. The ranking RANK(s) is obtainedby breaking the ties of p.

The total mark is a particularly simple example of a Pfunction. Here p(s) = (mi(5) + 2(s) + • • • + mj(s)).

Observe that this rule can be "manipulated" by a single judge.The skater from the "good" country who is clearly best inthe eyes of all but the judge from the "bad" country can losea competition if the "bad" judge gives that skater a very lowmark. A trimmed mean is also a placement function, andof course trimming can eliminate the influence of a single"bad" country judge. But despite this, trimming can stillviolate our conception of majority rule.

We now formalize the requirements of a place function:

1. Incentive compatibility. A skater's final rank cannotbe made worse by a judge who improves the skater's mark.In terms of P functions, this says that if d > 0 and m.j(s)+ d is substituted for m j ( s ) , then p ( s ) cannot fall.

2. Rank majority. If the rank matrix is such that skater shas rank r0 for at least half the judges and skaters s' has rankq0, for at least half the judges, where r0 < q0, then p(s)<p(s').

Note that the rank majority requirement considers only sit-uations in which more than half of the judges agree on theprecise rank of skater s and more than half (not necessarilythe same "more than half") agree on the precise rank ofskater s'. The rank majority sets no explicit conditions onany other situation.

Many placement functions meet requirement 2; for ex-ample, the shortest half or least median of squares (LMS(see Rousseeuw 1984). The LMS identifies for each skaterthe half subset with the most similar or closest ranks andassigns as an initial placement function the midpoint of thatinterval. Clearly this satisfies the rank majority rule; but itdoes not satisfy Requirement 1. To illustrate this fact, con-sider a skater with the following ranks given by five judges:1,1, 3,4, and 7. With LMS, this skater's placement functionvalue is 2. But if the last judge improves the seventh placerank to a fourth place finish, then the skater's placementfunction actually falls to 3.5.

Theorem. Any place function that satisfies Requirements1 and 2 is equivalent to the median rank place function.

Proof. It is easy to see that the median satisfies Require-ments 1 and 2. To see that only the median rank and noother placement function satisfies these two requirements,we proceed by contradiction. Let M and R be marks andranks evaluated by a P function satisfying Requirements 1and 2, where

but

med/{ri(l),. . . , r/(l)} =x0 >y0

= med,{r,(2),...,r j(2)}. (2)

We are going to change these marks without affecting eitherthe relative placement of skaters 1 and 2 or their medianranks; however, after the change, a majority of judges willhave given an identical rank score to skater 1 that is greaterthan an identical rank score given by a majority of judgesto skater 2. But this will violate the majority requirement ofa place function.

289

Page 301: Anthology of Statistics in Sports

Chapter 39 Rating Skating

Consider the set of judges whose rank for skater 1 is >Xo;notice that this set includes a majority of judges. For eachsuch judge, adjust marks so that (a) if r j(l) = JCG, then donothing; leave the mark and rank at their original values, or(b) if r j(l) > XQ, then increase skater 1's mark so that therank is decreased to x0. It can be verified that this remarkingand reranking leaves the median relation (2) unchanged, and,because the rank value for skater 1 goes down — the relation(1) also still holds (by the incentive requirement). Furtherthere are now a majority of judges for whom the rank of 1

We now perform a similar operation for skater 2. Considerthe set of judges whose rank for skater 2 is y0; notice thatthis set includes a majority of judges. For each judge in thismajority set, (a) if rj(2) = y0, then do nothing, or (b) if r,(2)< y0, then decrease skater 2's mark so that her rank is de-creased to y0. It can again be verified that this does not changeeither (1) or (2). Further, there now is a majority of judgesfor whom the rank of skater 2 is yo. Hence, by majority rule,p(1)> p(2), which contradicts (1) and completes the proof.

We conclude that the median rank is the only placementfunction that possesses these two desirable properties. Ofcourse, median ranks cannot perform miracles. Like all socialwelfare functions, this choice rule will, under specific cir-cumstances, violate Arrow's list of properties. In particular,the winner of a competition as judged by USFSA rules caneasily depend on "irrelevant alternatives." A new entrantinto a competition can change the outcome, just as a spoilerentering a three-way election can upset a favored candidate.

At the same time, we should also note that our choice ofRequirement 2 to represent majority rule is subject to dis-pute. This is only one of the possible interpretations of ma-jority rule. Indeed the more familiar representation of thisconcept performs pairwise comparisons between alternatives.If the majority prefers x to y, then society prefers x to y.This is a different idea of majority rule than that containedin median ranks, and it is easy to construct examples (seeFig. 2) in which a majority of judges prefer x to y but xobtains a worse median rank than y. The well-known prob-lem here is that such a ranking generally will not be transitive.

4. RANKING AS A MEASUREMENT

ERROR PROBLEM

The median ranks used in placing figure skaters capturean interesting meaning of majority rule and offer obviousadvantages in limiting strategic manipulation. Yet in the vastmajority of competitions where there is little concern withsuch issues of preference, one can reasonably ask whether

the present system is unnecessarily cumbersome or worse.For most competitions, the problem is not one of preferenceaggregation but rather one of statistical estimation, whereconcern is measurement error. Our first thought was that inthese settings, the USFSA system would be less attractivethan simpler aggregates, because its emphasis on medianranks ignores considerable information in determiningplacement. To look at this question, we conducted a seriesof Monte Carlo experiments comparing the official systemto one of simple addition of cardinal marks. For complete-ness, we also included a trimmed mean similar to that usedin diving competitions.

Skaters were assigned a "true" point score, which in turndefined a "true" ranking. The scores measured by individualjudges were set equal to the true score plus a random errorterm. A normal error distributions was used, but as in actualmeets, all scores were truncated at 6.0. As a simple measureof how well a system did, we calculated both the proportionof times that it picked the true winner and the average ab-solute error of placements.

In our first set of meet simulations, we treated the com-petition as consisting of only one component event judgedon a simple six-point scale. Each meet consisted of five judgesand six skaters (one through six), with true scores rangingin .2-point intervals from 5.8 to 4.8. The random error wastaken to be normal with mean 0 and variance 1. (But asnoted earlier, judges' scores were truncated at 6.0, thus skew-ing the distribution of scores). We ran 20,000 "meets" ofthis type. The simple addition of judges' scores correctlyidentified the true first place in 46% of the meets. But theUSFSA system picked the correct first place finisher in 54%of meets. The trimmed mean did about the same as the sum,picking 45% of the correct first place finishers. The straight-forward sum of ranks had an average absolute error in theestimated rank of a skater of 1.10, a figure identical to thatfor the USFSA system. The trimmed mean did only a tadworse, with an average absolute error of 1.12.

The result surprised us initially. But in hindsight, we re-alized that the success of the median ranking system waslargely due to the mark ceiling imposed on the judges. Inthis situation, downward measurement errors for a goodskater cannot easily be offset by upward measurement errors.Hence the average or total judges' score of a very good skateris systematically biased downward.

To demonstrate, we redid the simulation, but this timethe highest skater had a true score of only 3.6 and the otherskaters had scores again at .2-point intervals. The result, aswe now expected, was that the USFSA system found a lower

fourth place finisher

fifth place finisher

Judge 1 Judge 2 Judge 3 Judge 4 Judge 5

4 4 4 7 6

3 3 5 5 5

Figure 2. Conflicting Conceptions of Majority Rule. Notice that every judge but Judge 3 prefers the fifth place finisher, who has a majority of fives,to the fourth place finisher, who has a majority of fours. Also, in this case the fifth place finisher has a better (lower) total score and a better (lower)trimmed mean.

290

Page 302: Anthology of Statistics in Sports

Bassett and Persky

percentage of appropriate winners than the total system (46%vs. 48%). The trimmed mean also came in at 46%. Theaverage absolute error in placement was now a good dealhigher for USFSA, 1.15, as compared to 1.07 for the sum ofmarks and 1.11 for the trimmed mean.

Although hardly conclusive, these simulations suggest thatthe USFSA system may actually help in distinguishing amongskaters of different performance levels when questions ofpreference are not seriously at issue. This result depends crit-ically on the mark ceiling of six points, which strongly skewsjudges' marks. The median rank method works well withskewed scores.

5. SUMMARY

Like gymnastics and diving, figure skating requires amethod to aggregate judges' marks. Unlike other judgedsports, however, figure skating has adopted a system basedon median ranks. Skating officials have often bragged thattheir system represents majority rule. We have shown thatmedian ranks uniquely captures an important meaning ofmajority rule and provides strong protection against manip-ulation by a minority of judges.

One might have expected that these positive features wouldhave required the scoring system to sacrifice efficiency in themore mundane world of measurement error. Yet, somewhataccidentally as the result of persistent mark inflation, we findthat median ranks do a better job in controlling measurementerror than two alternatives, total marks and the trimmedmean.

Although we can find no historical evidence that skatingofficials ever had this end in mind, they have picked a systemparticularly well suited to serve as both a method of statisticalestimation and a means of preference aggregation as the sit-uation warrants.

[Received April 1993. Revised November 1993.]

REFERENCES

Arrow, K. (1963). Social Choice and Individual Values (2nd ed.), NewYork: John Wiley.

Bassett. G. W. (1991). "Equivariant, Monotone, 50% Breakdown Estima-tors," The American Statistician, May, 135-137.

Rousseeuw, P. J. (1984), "Least Median of Squares Regression," Journalof the American Statistical Association, 79,

Sen, A. (1970), Collective Choice and Social Welfare, Oakland: Holden-Day.

United States Figure Skating Association (1992), USFSA Rulebook, ColoradoSprings, CO: Author.

291

Page 303: Anthology of Statistics in Sports

This page intentionally left blank

Page 304: Anthology of Statistics in Sports

Chapter 40

A game of luck or a game of skill?

Modeling Scores in the PremierLeague: Is Manchester United Reallythe Best?

Alan J. Lee

In the United Kingdom, Associationfootball (soccer) is the major winterprofessional sport, and the FootballAssociation is the equivalent of theNational Football League in theUnited States. The competition isorganized into divisions, with thePremier League comprising the bestclubs. There are 20 teams in theleague. In the course of the season,every team plays every other teamexactly twice. Simple arithmetic showsthat there are 380 = 20 X 19 games inthe season. A win gets a team threepoints and a draw one point. In the1995/1996 season, Manchester Unitedwon the competition with a total of 82points. Did they deserve to win?

On one level, clearly ManchesterUnited deserved to win because itplayed every team twice and got themost points. But some of the teams arevery evenly matched, and some gamesare very close, with the outcome being

essentially due to chance. A lucky goalor an unfortunate error may decide thegame.

The situation is similar to a game ofroulette. Suppose a player wins a beton odds/evens. This event alone doesnot convince us that the player is morelikely to win (is a better team) than thehouse. Rather, it is the long-run advan-tage expressed as a probability that isimportant, and this favors the house,not the player. In a similar way, theteam that deserves to win the PremierLeague could be thought of as theteam that has the highest probability ofwinning. This is not necessarily thesame as the team that actually won.

How can we calculate the probabili-ty that a given team will win thePremier League? One way of doing thisis to consider the likely outcome whentwo teams compete. For example, whenManchester United plays, what is theprobability that it will win? That there

will be a draw? Clearly these probabili-ties will depend on which teamManchester United is playing and alsoon whether the game is at home oraway. (There are no doubt many otherpertinent factors, but we shall ignorethem.)

If we knew these probabilities forevery possible pair of teams in theleague, we could in principle calculatethe probability that a given team will"top the table." This is an enormouscalculation, however, if we want anexact result. A much simpler alternativeis to use simulation to estimate thisprobability to any desired degree ofaccuracy. In essence, we can simulateas many seasons as we wish and esti-mate the "top the table" probability bythe proportion of the simulated seasonsthat Manchester United wins. We canthen rate the teams by ranking theirestimated probabilities of winning thecompetition.

293

Page 305: Anthology of Statistics in Sports

Chapter 40 Modeling Scores in the Premier League

The Data

The first step in this program is togather some data. The Internet is agood source of sports data in machine-readable form. The Web sitehttp://dspace.dial.pipex.com/r-johnson/home.html has complete scores of all380 games played in the 95/96 sea-son, along with home and away infor-mation.

Modeling the Scores

Let's start by modeling the distributionof scores for two teams, sayManchester United playing Arsenal athome. We will assume that the numberof goals scored by the home team(Manchester United) has a Poisson dis-tribution with a mean HOME. Similarly,we will assume that the number ofgoals scored by the away team(Arsenal) also has a Poisson distribu-tion, but with a different mean AWAY.Finally, we will assume that the twoscores are independent so that thenumber of goals scored by the hometeam doesn't affect the distribution ofthe away team's score.

This last assumption might seem abit far-fetched. If we cross-tabulate thehome and away scores for all 380 games(not just games between Manchester Uand Arsenal), however, we get the fol-lowing table:

Home team score

Awayteamscore

01

23

4+

0275928197

1

2953

32148

210

14147

10

3

912124

2

4+

2441

0

A standard statistical test, the X2 test,shows that there is no evidence againstthe assumption of independence ( x2 =8.6993 on 16 df, p = .28). Accordingly,we will assume independence in ourmodel.

The next step is to model the distrib-ution of the home team's score. Thisshould depend on the following factors:

Using Poisson Regressionto Model Team Scores

We will assume that the score X of a particular team in a particular game hasa Poisson distribution so that

We want, the mean of this distribution to reflect the strength of the team, thequality of the opposition, and the home advantage, if it applies. One way ofdoing this is to express the logarithm of each mean to be a linear combinationof the factors. This neatly builds in the requirement that the mean of thePoison has to be positive. Our equation for the logarithm of the mean of thehome team is (say, when Manchester U plays Arsenal at home)

similarly, to model the score of the away team, Arsenal, we assume the log ofthe mean is

We have expressed these mean scores HOmE, and AWAY in terms of "parameters,"which can be interpreted as follows. First, there is an overall constant b, whichexpresses the average score in a game, then a parameter HOME, which measuresthe home-team advantage. Next comes a series of parameters OFFENSE,

each team, that measure the offensive power of the team. Finally, there is a setof parameters P^OT^B, again one for each team, that measures the strength oftfie defense.

fhe model just described is called a generalized linear model in the theory ofstatistics. Such models have been intensively studied in the statistical literature.Vfe can estimate the values of these parameters, assuming independent Poissondistributions, by using the method of maximum likelihood. The actual calcula-tions can be done using a standard statistical computer package. We used S-Plusfor our calculations.

The parameters calculated by S-Plus are shown in Table 2, and they allow usto compute the distribution of the joint score for any combination of teamshome and away. Por example, if Manchester U plays Arsenal at home, the prob-ability that Manchester scores h goals and Arsenal scores a goals is

and

Thus, if Manchester U played Arsenal at Manchester many times, on aver-ap Manchester U would score 1.44 goals and Arsenal .76 goals. To calculatethe probability of a home-side win, we simply total the probabilities of all com-bination of scores (h,a) with h > a. Similarly, to calculate the probability of adraw, we just total all the probabilities of scores where h - a and, for a loss,where h< a. A selection of these probabilities are shown in Table 3.

294

Page 306: Anthology of Statistics in Sports

Lee

• How potent is the offense of thehome team? We expect ManchesterU to get more goals than BoltonWanderers, at the bottom of thetable.

• How good is the away team'sdefense? A good opponent will notallow the home team to score somany goals.

• How important is the home-groundadvantage?

We can study how these factorscontribute to a team's score against aparticular opponent by fitting a statis-tical regression model, whichincludes an intercept to measure theaverage score across all teams, bothhome and away, a term to measurethe offensive capability of the team, aterm to measure the defensive capa-bility of the opposition, and finally anindicator for home or away. A similarmodel is used for the mean score ofthe away team.

These models are Poisson regressionmodels, which are special cases of gen-eralized linear models. The Poissonregression model is described in moredetail in the sidebar.

Data Analysis

Before we fit the Poisson regressionmodel, let us calculate some averagesthat shed light on the home-groundadvantage, the strength of the team,and the strength of the opposition.First, if we average the "home" scores ineach of the 380 games, we get a meanof 1.53 goals per game. The corre-sponding figure for the "away" scores is1.07, so the home-team advantage isabout .46 goals per game—a significantadvantage.

What about the offensive strength ofeach team? We can measure this in acrude way by calculating the averagenumber of goals scored per game byeach team. Admittedly, this takes noaccount of who played whom. Similarly,we can evaluate the defensive strengthof each team by calculating the numberof goals scored against each team.These values are given in Table 1. Wesee that Manchester United has thebest offense, but Arsenal has the bestdefense.

Table 1 — Average Goals for and Against

Team

Arsenal

Aston Villa"'y' ^ |gfc«;/r

icib$ilfe;' 'ChelseaCoventry €.EvertonLeeds U.Liverpool

Man. fcayMan.U.Middtetor®NeweflstttU.T4ctt».f«**stQWSfe&'ttfeil ' 'Southamptonl enh nnH.Wet Hun. U;Wimbledon

Averagegoalsfor

1.291.37

;'t,trim :

1.211.111,691,051.84J71.92.921.741,32

1.901.26,891.321.131.45

Averagegoats

against.84.92

JA:';fj4c;y,'-;•' luwr '" •

1.161.58

1.161.50m1.53.92

1.32.971.42

1.501.611.371.001.371J4

(W17

18181«1281712209251124159109161410

Teamrecord

L911

13251216111971861781023181891517

D)1297

"5141410711117106136101113911

Compe-titionpoints

6363

61' 2§ • ":" j

5038614371388243785S334038615141

•Hfw

Teatw •

ArsenalAston VillaBlackburn R.Soten \Afen.ChelseaCoventry CEvertonLeeds U.LiverpoolMan. CityMan. U.MtddlesbroNewcastle U.Nottm. forestQPRSheff. Wed.SouthamptonTottenham H.

- Wag-Hem. U.WJmbtedon

ife 2-*~lpin<im CiMSvwT 41Mil rifting II

Oiensiveparameter

.00

.06

.24-.19-.05-.12.28

-.18.36

-.37.40

-.32.31.05

-.23.01

-.34.03

-.11.16

landOpposifeA J 44*fetffcM**l&<VJHI} fjfHMNARpiJHI

Offensivemultiplier

1.001.071.27.83.95.88

1.33.S4

1.43.691.50.731.361.05JO1.01.711.03.901.17

ition Parameterssd Linear Model

Defensiveparameter

-.41-.31-.01.38

-.09.22

-.07.16

-.32.17

-.29.02

-.24.12.16.24.06

-.23

.38

Defensivemultiplier

.67

. .73.99

1.46.911.24.931.18.72

1.19.751.03.78

1.131.171.271.07.791.081.47

295

Page 307: Anthology of Statistics in Sports

Chapter 40 Modeling Scores in the Premier League

Now we "fit the model" and esti-mate the parameters. The intercept is.0165, and the home-team advantageparameter is .3518. The first valuemeans that a "typical" away team willscore 1.0166 ( = e.0165) goals, and thesecond means that, on average, thehome team can expect to score 100 xe.3518 = 142% of the goals scored by

their opposition. This agrees with thepreceding crude estimate; 1.5263 is142% of 1.0737.

Next we come to the offensive anddefensive parameters. The estimates ofthese are contained in Table 2. We seethat Manchester United has the largestoffensive parameter (.4041) andArsenal the smallest defensive parame-

Table 3 — Probabilities of a Win, Draw,or Loss for Selected Match-tips

Home teamMan. U.Liverpool

Man.U,Newcastle U.Newcastle U.LiverpoolMan. U.ArsenalArsenalLiverpoolArsenalNewcastle U.

Away teamLiverpoolMan. U.Newcastle U.Man. U.LiverpoolNewcastleArsenalMan. U.LiverpoolArsenalNewcastle U.Arsenal

Prob. of win.48.47.53.44.43.52.53.37.37,52.41.49

Prob. of draw.25.25.24.26.26.25.27.3030.28.30.29

Prob. of loss.26.27.23.30.31.23.20.33.33.20.29.22

Table 4 — Results From Simulating the Season

Team

Man. U.Newcastle U,LiverpoolArsenal

Aston VillaBlackburn R.EvertonTottenham HiNottm. ForestWest Harn. U.ChelseaLeeds U.MiddlesbroWimbledonSheff. Wed.Coventry C.Man. CitySouthamptonQPRBolton Wan.

Actualpoints95/96

8278716363616161585150434341403838383329

Poisson modelexpected

points75.770.774.963.863.761,264.960.250.046.353.441.441.544.7

44,841.235.739,639.933.9

Simulatedmeanpoints75.570.574.963.663.661.465.060.849.546.153.541.441.844.744.941.435.439.540.134.0

Simulatedstd. dev.points

7.17.87.57.77.47.47.57.57.47.77.47.47.4

7.67.27.66.97.07.37.2

Proportionat top of

table.38.16.33.03.03.03.04.01.00.00.00,00.00.00.00.00.00 ,.00.00.00

ter (- .4075), which is consistent withthe preceding preliminary analysis. Toget the expected score for a team, wemultiply the "typical away team" score(1.0166) by the offensive multiplier andby the defensive multiplier. In addition,if the team is playing at home, we mul-tiply by 1.4216 ( = e.3518). Note thatthese parameters are relative ratherthan absolute: The average of the offen-sive and defensive parameters has beenarbitrarily set to 0 and the "typical team"parameter adjusted accordingly.

What do we get from this more com-plicated analysis that we didn't get fromthe simple calculation of means? First,the model neatly accounts for the offen-sive and defensive strengths of both thehome team and the opposition. In addi-tion, using the model, we can calculatethe chance of getting any particularscore for any pair of teams. In particu-lar, the model gives us the probability ofa win, a loss, or a draw.

The results in Tables 1 and 2 are inagreement, giving the same orderingsfor offense and defense. This is a con-sequence of every team playing everyother team the same number of times.

If we perform the calculationsdescribed in the sidebar on page 18, wecan calculate the probability of win, lose,and draw for any pair of teams, home andaway. For example, Table 3 gives theseprobabilities for the top few teams. Tocontinue our example, we see from thesetables that when Manchester Unitedplays Arsenal at Manchester, they willwin with probability .53, draw with prob-ability .27, and lose with probability .20.

Simulating the Season

Now we can approach the problem ofwhether or not Manchester United waslucky to top the table in the 95/96 sea-son. As we noted previously, the Poissonregression approach allows us to calcu-late the chance of a win, loss, or drawfor a game between any pair of teams.In principle, this allows us to calculateexactly the chance a given team will topthe table. The calculation is too large tobe practical, however, so we resortinstead to simulation.

For each of the 380 games played,we can simulate the outcome of eachgame. Essentially, for each game, wethrow a three-sided die (conceptually

296

Page 308: Anthology of Statistics in Sports

Lee

only) whose faces are win, lose, anddraw. The probabilities of these threeoutcomes are similar to those given inthe preceding tables. From these 380simulated games, we can calculate thepoints table for the season, awardingthree points for a win and one for adraw, and see which team topped thetable.

In fact we used a computer programto simulate the 95/96 season 1,000times. We can calculate the mean andstandard deviation of 1,000 simulatedpoints totals for each team and also theexpected number of points under thePoisson model described previously. Wecan also count the proportion of timeseach team topped the table in the 1,000simulated seasons, which gives an esti-mate of the probability of topping thetable. Table 4 gives this information.

Manchester seems to have been alittle lucky, but it still has the highestaverage score. Liverpool was definitely

unlucky and according to our model isreally a better team than NewcastleUnited, who actually came second.

Of course, our approach to modelingthe scores is a little simplistic. We havetaken no account of the fact that teamsdiffer from game to game due toinjuries, trades, and suspensions. Inaddition, we are assuming that ourmodel leads to reasonable probabilitiesfor winning/losing/drawing games.Teams that tend to "run up the score"against weak opponents may be overrat-ed by a model that looks only at scores,and teams that settle into a "defensiveshell" once they have got the lead maybe underrated. Still, our results do seemto correspond fairly well to the histori-cal result of the 95/96 season.

References and Further Reading

Groeneveld, R. A. (1990), "RankingTeams in a League With Two

Divisions of t Teams," The AmericanStatistician, 44, 277-281.

Hill, I. D. (1974), "Association Footballand Statistical Inference," AppliedStatistics, 23, 203-208.

Keller, J. B. (1994), "ACharacterization of the PoissonDistribution and the Probability ofWinning a Game," The AmericanStatistician, 48, 294-298.

McCullagh, P., and Nelder, J. A.(1989), Generalised Linear Models,London: Chapman and Hall.

Schwertman N. C., McCready, T. A.,and Howard, L. (1991), "ProbabilityModels for the NCAA BasketballTournaments," The AmericanStatistician, 45,179-183.

Stern, H. S. (1995), "Who's Number1 in College Football? . . . AndHow Might We Decide?," Chance,8(3), 7-14.

297

Page 309: Anthology of Statistics in Sports

This page intentionally left blank

Page 310: Anthology of Statistics in Sports

Chapter 41

Down to Ten: Estimating the Effect of aRed Card in Soccer

G. RIDDER, J. S. CRAMER, and P. HOPSTAKEN*

We investigate the effect of the expulsion of a player on the outcome of a soccer match by means of a probability model for thescore. We propose estimators of the expulsion effect that are independent of the relative strength of the teams. We use the estimatesto illustrate the expulsion effect on the outcome of a match.

KEY WORDS: Conditional likelihood; Poisson process; Soccer; Unobserved heterogeneity.

1. INTRODUCTION

Professional soccer (known outside the United States asfootball) is popular all over the world; in Europe and SouthAmerica it is the dominant spectator sport. Because socceris a low scoring game, the rules have been often revised soas to raise the number of goals scored by either side and thusincrease the play's appeal. Since 1990, players can be expelledfor the rest of a match for illegal defensive actions, such asrepeated flagrant fouls and preventing an adverse goal byillegal means. The referee expels the player by showing hima red card.

In this article we investigate the effect of such an expulsionon the outcome of a match. Popular opinion holds widelydifferent views on the effectiveness of the red card, but as faras we know the question has not been submitted to empiricalresearch. We propose a model for the effect of the red cardthat allows for initial differences in the strengths of the teamsand for variation in the scoring intensity during the match.More specifically, we propose a time-inhomogeneous Poissonmodel with a match-specific effect for the score of either side.We estimate the differential effect of the red card by a con-ditional maximum likelihood (CML) estimator that is in-dependent of the match-specific effects. This estimator wasintroduced in econometrics by Hausman, Hall, and Gril-liches (1984), building on ideas of Andersen (1973).

In Section 2 we specify the model, in Section 3 we discussestimation, and in Section 4 we give the results. We considersome implications of the estimates in Section 5.

2. A MODEL FOR THE SCOREIN A SOCCER MATCH

First, we introduce some notation. The subscript i denotesa match, and j = 1,2 denotes the two sides in that match;a team in a match is thus identified by two subscripts ij. Werestrict attention to matches with a red card, and we alwaystake it that the red card is given against the second side, j= 2. Time is measured in minutes from 0 to 90, which isthe official duration of a match. In soccer the clock is notstopped when play is interrupted, but the referee can allowfor lost time at the end of the first and second halfs, after 45

* G. Ridder is Professor, Department of Econometrics, Free University,Amsterdam, The Netherlands. J. S. Cramer is Professor, Department ofEconomics, and P. Hopstaken is Senior Research Fellow, Foundation forEconomic Research, University of Amsterdam, The Netherlands. The au-thors thank Gusta Renes for helpful comments, Tony Lancaster for spottingan embarrassing error in a previous version, and the editor and two refereesfor comments that have improved the article considerably.

and 90 minutes. Recorded time is measured from the be-ginning of the match and from its resumption after the in-terval, however. As a result, there may be some minuteswhen there is no play at all, whereas the 45th and 90th min-utes may last longer than a full minute; but this is a minordistortion.

Let

Ti = minute in which a player is expelled from team 2,Nij = total number of goals scored in match i by team j,Kij = number of goals scored before Ti,Mij = number of goals scored after Ti,

i j ( t ) = scoring rate or intensity of team j in match / at thetth minute of play,

6j = multiplicative effect on ij(t) of expulsion of playerfrom team 2, and

ij = relative strength of team j in match i as comparedwith the overall average scoring rate, (t).

We make the following three assumptions:

1. The two teams score according to two independentPoisson processes. As a consequence, the number of goalsscored by team 1 is stochastically independent of the numberof goals scored by team 2. Moreover, the time intervals be-tween subsequent goals are stochastically independent. Thescoring intensities are not constant during the match; thusthe Poisson processes are nonhomogeneous.

2. The ratio of the scoring intensities of the two full teamsis a constant for each game; that is, (/(/) = 7^X(/) for matchesof 11 against 11 players, with (t) the average scoring intensityat the tth minute of play of full sides of 11 against 11.

3. After the red card, for t > Ti,, team 2 has 10 players,and the scoring intensities are /yy (0,./ =1 ,2 .

In Assumption 1 we describe the score in a match as arandom phenomenon that is only partly predictable. It de-pends on the playing time, on the relative strength of theteams, and on the effect of the red card. As we show, thescoring intensity increases with the time played. If we do notallow for this, then the effect of the red card will be overstated,because we confound it with the time effect. Of course thescore is strongly affected by the relative strength of the teams.The incidence of red cards may be related to the relativestrength, so that a comparison of red card games to unin-terrupted games gives a biased estimate of the effect. In ad-

© 1994 American Statistical AssociationJournal of the American Statistical Association

September 1994, Vol. 89, No. 427, Statistics in Sports

299

Page 311: Anthology of Statistics in Sports

Chapter 41 Down to Ten: Estimating the Effect of a Red Card in Soccer

dition, the timing of red cards may also be related to therelative strength of the teams, and again this biases the effect.The third factor is the effect of the red card, which by As-sumption 3 is measured by 1 and 2-

It is not our aim to predict the outcome of soccer matches,which requires an estimate of ij. Our estimate of the effectof the red card is independent of ij, which is of great help,because finding a good estimate of ij is difficult as experienceshows.

The Poisson assumption and its implications form As-sumption 1. It is not difficult to relax the Poisson assumptionat the cost of a more complex statistical model, but our lim-ited number of observations will not support this.

3. STATISTICAL ANALYSIS

3.1 Estimation of the Average Scoring Intensity

In Table 1 goals scored in 340 full matches in the twoprofessional soccer divisions in the Netherlands in the 1991-1992 season are classified by 15-minute intervals of play.This shows that the rate of scoring increases monotonicallyover the match, as has also been observed in England byMorris (1981).

If we assume that the average scoring intensity increaseslinearly during the match, then the expected number of goalsscored by a team j in match i in interval s is

so that the average number of goals scored by one team ininterval s is

Table 1. Goals Scored in the 1991-1992 Season by 15-Minute Intervals

where we take = 1. This implicitly defines a scale for ij;for example, if ij = 2, then team j has a scoring intensityin match i that is two times the average.

The average number of goals per minute in time intervals equals the entry in Table 1 divided by 680, twice the numberof contests. Estimates of a and b are then easily obtained byordinary least squares (OLS) regression. With (t) as thescoring intensity for a 90-minute game, we find (R2 = .95;standard errors in parentheses)

a = 1.050 (.024) and = .00776 (.00072).

Note that the reported standard errors are consistent in thepresence of heteroscedasticity. Inclusion of a quadratic termdid not improve the fit. In the sequel we ignore the samplingvariance of these estimates. This simplifies the computationof variances and is an acceptable approximation, as they aresmall. The estimates imply that the scoring intensity increasesduring a 90-minute game from 1 .05 in the first minute to1.75 in the final minute.

3.2 A Conditional Maximum Likelihood Estimator

Because the incidence of red cards is probably related tothe relative strength ij, a comparison of red card matcheswith other matches may give a biased estimate of the redcard effect. For that reason, we propose an estimator thatdoes not depend on the ,ij or on their distribution. Thisestimator is based on a comparison of the number of goals

Time interval(min)

0-1516-3031-4546-6061-7576-90

Number of goals

128140147169170198

scored by the same team before and after the red card. Moreprecisely, we consider the fraction of the goals scored afterthe red card, which we denote by yij. It is intuitively clearthat this fraction is independent of the time-constant match-specific effect.

Under Assumptions 1 to 3 (with P denoting the Poissondistribution),

In the sequel we denote

The conditional distribution of My, given Nij, is

where B denotes the binomial distribution and

The conditional distribution is degenerate if Nij = 0, and yij

is defined only if Nij 1. In the CML procedure we omitobservations with Nij = 0. The estimator of the red cardeffect is not biased by this restriction, as we shall see presently.

In the conditional distribution (5) and in the conditionallikelihood, the match-specific effects y cancel. Up to anadditive constant that does not depend on j, the log-likelihood is

with nI, 2 denoting the number of observations on teamsthat do not and do receive a red card. Because we conditionon the total scores, Nij, we can treat them as nonstochasticconstants. Hence omitting observations with a given totalscore — in particular, observations with Nij = 0 — does notaffect the CML estimator.

The likelihood equation is

300

a = 1.050 (.024) and = .00776 (.00072).

Page 312: Anthology of Statistics in Sports

Ridder, Cramer, and Hopstaken

This is a moment equation, equating a weighted average ofthe yij

`s to a weighted average of their expectations. Theweights are the total scores Ny, which by conditioning canbe treated as known constants.

In deriving the properties of the CML estimator, we notethat the binomial parameter gij( j) can be written in the logitform. Hence the log-likelihood is globally concave in log( ,),so that the CML estimator for j is uniquely denned. Theasymptotic variance of the CML estimator can be obtainedin the usual way.

3.3 OLS Estimation

With an additional assumption, we can estimate the effectof the red card by linear regression. From (3),

and

In (9) we allow the average relative strength in red card gamesto differ from the overall average 1. 1 and 2 indicate theaverage strengths of the teams before a player of team 2 isexpelled. The disturbances v1l j and v2lj are independent, andan additional assumption is required for consistent estimates,viz. cov( ij, A2

i ) = cov( ij, B2i ) = 0. A sufficient condition

for this is that T, and i} are stochastically independent. Underthis assumption, we can estimate j by the ratio of the regres-sion coefficients in (9).

4. ESTIMATION RESULTS

We apply CML estimation to data on 140 red card gamesin the seasons 1989-1990, 1990-1991, and 1991-1992 inboth divisions of the Dutch professional football league. In1 3 of these matches, two or more red cards were given. Be-cause we estimate the effect of being one player up or down,the part after the second expulsion is omitted. In only twomatches were a red card and a penalty kick given jointly.Because for the CML estimator we must omit observationswhere a team has not scored at all, the effective number ofobservations is 112 for teams with 11 players and 93 forteams with 10 players. We obtain the following results (stan-dard errors in parentheses) :

Table 2. Probabilities of the Outcome of the Matchby Minute of the Red Card

Minute ofred card Prfteam of 11 wins) Pr(draw) Prfteam of 10 wins)

0153045607590

.65

.62

.58

.54

.49

.44

.375

.17

.18

.20

.21

.23

.24

.25

.18

.20

.22

.25

.28

.32

.375

Table 3. Probabilities of the Outcome of a Matchwith a 15-Minute Exclusion Starting at

Minute of startof penalty Pr(team of 11 wins) Pr(draw) Pr(teamof10wins)

01530456075

.42

.42

.43

.43

.43

.44

.24

.24

.24

.24

.24

.24

.34

.34

.33

.33

.33

.32

CMLi = 1-88 (.29) and CML2 = .95 (.20).

According to the CML estimates, the scoring intensity in-creases by 88% for the team with 1 1 players; this effect isstatistically significant. The scoring intensity for the teamwith 10 players (team 2) hardly changes; the effect is notsignificantly different from 1 .

The OLS estimator gives rather different results (with thestandard errors consistent in the presence of heteroscedas-ticity) :

The estimated increase in the scoring intensity for team 1 ismuch smaller than for the CML estimator (but highly sig-nificant). More surprisingly, the OLS estimator shows a sta-tistically significant increase in the scoring intensity for team2. Hence using between-game information gives rather dif-ferent estimates that moreover are hard to interpret. Thefirst-stage regressions of the OLS estimator show that teamsthat receive the red card have the same scoring intensity asthe average (72 = 1.03 (.09)), but the opposing team is muchstronger ( 1 = 1.33 (.09)). Hence the red card usually isgiven to the already weaker team.

By stratifying our sample, we can investigate whether theestimates are robust against changes in the specification. First,we test whether the red card effect depends on the venue ofplay. This captures, among other things, the home advantage,and the estimate should be invariant to this distinction. TheLR statistic is .53 for the team with 1 1 players and .66 forthe team with 10 players; hence we can not reject invariance.The estimates are 1,home = 2.00 (.36), 0Uway = 1.56 (.46),and 2,home = -73 (.28), 2,away = 1-07 (.27). The estimatesare also invariant to stratification on the total score in amatch. In the sequel we use the CML estimates to illustratethe effect of the red card on the outcome of a match.

5. IMPLICATIONS OF THE ESTIMATES

We can use the results to illustrate the effect of the redcard on a soccer match. In Table 2 we give the probabilitiesof the three possible outcomes of the match between equallystrong teams as a function of . The last row of the tableshows that the probability of a draw between two teams ofaverage strength is .25. This is an indication of the role ofchance in the outcome of a soccer match. The role of chancewas also stressed by Osmond (1993). A red card early in thematch increases team 1's probability of victory substantially.

301

Page 313: Anthology of Statistics in Sports

Chapter 41 Down to Ten: Estimating the Effect of a Red Card in Soccer

Table 4. Expected Number of Goals in Matchby Minute of Red Card

Minute of red card

0153045607590

Expected number of goals

3.953.803.633.453.253.032.80

APPENDIX: DATA USED IN ANALYSIS

The symbols are introduced in Section 2.

K1 K2 Ml M2 Kl K2 Ml M2 Kl K2 Ml M2

Team 2's probability of victory decreases even more, whereasthe change in the probability of a draw is relatively small.

With a red card, a player is expelled for the remainder ofthe match. In indoor soccer and ice hockey, a player can beexcluded for a certain period. In Table 3 we show the effectof a 15-minute time penalty for equally strong teams. Al-though the effect depends on the time at which the penaltyis imposed, this dependence is rather weak.

As noted in Section 1, a motivation for the more frequentuse of the red card is to increase the number of goals scoredin a match. Table 4 shows that it has the desired effect.

We also consider the dilemma of a defender who faces aplayer who threatens to break through the defense. If theopposing player has a clear way to the goal, tripping up theplayer results in a red card for the defender. If the playergoes past the last defender, he will score with a high proba-bility. In our calculation we assume that the objective of thedefender is to minimize the probability of losing the match.There is a unique moment in a contest at which the optimalaction of the defender changes. After that moment, it is op-timal to trip up the opposing player. These times, whichdepend on the probability that the attacker will score andon the relative strength of the defender's team (with the at-tacker's team of average strength, 7 = 1), are reported inTable 5. The weaker side has a stronger incentive to resortto illegal defense. This is consistent with our observationthat the red card is usually given to the weaker side. It mayalso induce a correlation between and , and such a cor-relation biases the OLS estimates.

Table 5. Time (Minute of Game) After Which a Defender Should Stopa Breaking-Away Player by Probability of Score and Relative

Strength of the Defender's Team

Relative strengthof teams,

.512

Probability of score

.3

707172

.6

424852

1

01630

10100

00001000001100013010

00010

0102100230

20010

21

00100

0000000100

000110000010110

00000

00022

11000

01

50311

0032531121

21310

32211

0011042022

02140

2003011

10302

0000102001

02000

02001

01001

10002

00000

01110

02

1011151720

25252630323333333335

3636373939

4044444444

4444454546

4747485050

5252525253

5555555556

5656

112101112020200

010001013011101

1011322111

00301

11

12002

1101001002

10111012001100013111

01120

31201

03

* Matches with two or more red cards.

00031

2202110100

2112210000

0010100100

01110

00000

00

101011000000100

20000

00000

00000

0000010010

0000010

5858586060

61616262656565656667

6868687070

7070707071

7173737474

7475777878

7980808080

8080818282

8282

06101

1112041121

22222

03222

42221

21030

20012

01110

2

Second red card in 77 85

00111

0312201112

01122

01110

01102

10101

01113

01210

1

78, 73,

11000

0000000010

00100

01110

000001000101221001010

60, 89,

00010

0000000000

11010

00000

00000

0000120020

00000

0

68,

8283838384

84848585858585858686

8686878787

8888888888

8888888888

89898928*30*

36*36*40*43*45*

60*63*64*69*71*

78*

67 65,88,76, 77, 82.

{Received January 1993. Revised March 1994.]

REFERENCES

Andersen, E. B. (1973), Conditional Inference and Models for Measuring,Copenhagen: Mental Hygiejnisk Forlag.

Hausman, J. A., Hall, B. H., and Grilliches, Z. (1984), "Econometric Modelsfor Count Data With an Application to the Patents-R&D Relationship,"Econometrica, 52, 909-938.

Morris, D. (1981), The Soccer Tribe, London: Jonathan Cabe.Osmond, C. (1993), "Random Premiership?," RSS News, November, 5.

302

Page 314: Anthology of Statistics in Sports

Chapter 42

Getting slammed during your first set might affect your next!

Heavy Defeats in Tennis:Psychological Momentum orRandom Effect?

David Jackson and Krzysztof Mosurski

Sports statistics is a very diverse area.This article is concerned with (1) con-tests between individuals that are decid-ed not by a single trial but by a series oftrials and (2) the dependency structurethat may exist between trials in such con-tests.

Psychological Momentum

There is a widespread belief in manywalks of life, not just sports, that "successbreeds success and failure breeds fail-ure." If winning a trial increases the prob-ability of winning the next trial, then thatkind of dependency structure is quiteproperly called psychological momentum(PM). Unfortunately "momentum" is, atpresent, a much abused word that has

found its way into the vocabulary of prac-tically every sports commentator and fanalike to account for even the most mun-dane sequences of successes or failures.If one can demonstrate that PM is truly afactor in a given sport, however, thenheavy defeats are a consequence of thatdependency structure. There is clearly astrong positive relationship between PMand sequences of successes or failures.

A best-of-five-sets tennis match is agood example of the type of contest thatis of interest. And the interest is in thepossibly changing probability of winninga set as the match progresses. If PM is afactor, we are talking about a true depen-dency structure for the probability ofwinning a set, not merely an updatedestimate of an unchanging probability

based on additional data. The reason, ofcourse, that one doubts that the sets of abest-of-five-sets tennis match are inde-pendent is that the memoryless property,from which assumptions of indepen-dence usually gain their strength, is miss-ing in such a series of trials. No matterhow much either participant might wishotherwise, the outcomes of the previoussets that have led to the present score inthe match are known to both contestants.Perhaps we are made of such stuff thatknowledge of what has happened earlierdoes not affect our probability of winningthe next set. But perhaps it does. It isregrettable, but nonetheless true, that inanalyzing data, and not just sports data,assumptions of independence are oftenvery casually made in the literature.

303

Page 315: Anthology of Statistics in Sports

Chapter 42 Heavy Defeats in Tennis: Psychological Momentum or Random Effect?

The Search forPsychological Momentumin Sport

It is generally accepted in contests thatare decided by a series of trials that PMcan play a major role in the outcome. It isa long road, however, from being "gener-ally accepted" to being "well known," andthe search by authors for evidence of theexistence of PM in sport has generated afair amount of sometimes heated debatein recent years. It is an area that has seena considerable research effort withnumerous works, mainly on basketball,baseball or tennis, since the seminal arti-cle on the subject in the statistics litera-ture by Tversky and Gilovich (1989).Their analysis of consecutive shots inbasketball shows that contrary to popularbelief the chances of a player hitting ashot are as good after a miss as after a hit.In baseball, analysis of hitting streaks(Albert 1993; Albright 1993; Stern 1993)also failed to detect any significant effecton the probability of making a hit, due toa player's recent history of success or fail-ures. According to Stern (1995), the most

credible evidence, so far, for the exis-tence of psychological momentum insport has been provided by tennis(Jackson 1993, 1995). Those works showthat, when the odds in the first set of amatch are estimated from explanatoryvariables, then a "success-breeds-suc-cess" model provides a much better fit todata from the 1987 Wimbledon and theU.S. Open tennis tournaments than anindependent-sets model. These dataexhibit far more heavy defeats than canbe accommodated by the independencemodel, which assumes that the probabil-ity of winning a set remains constant in agiven match. The success-breeds-suc-cess model—that is, PM—explains thetennis data extremely well.

Random Variation inPlayer Ability and HeavyDefeats

There is a possible alternative explana-tion, however, for the apparent overabun-dance of heavy defeats that we observe intennis, and that is random variation in

Table 1—Model Comparisons

Wimbledon and U.S. Open tennis tournaments 1987-1988

Data for 1847 sets from 501 matches, which includes current rankingsof players in each match.

"Simple"independence

Model (A)

=.510s.e. * .03

Degrees offreedom = 1,846Deviance = 2,329

Odds model

Model (B)

a =.441s.e. =,035

log(k) = .391s.e. = .05

=$4=1,48

Degrees offreedom = 1,845Deviance = 2,264

Independence witha normal random

effect ' '

Model {AR)

=.532s.e. = .04

= .625S-e, = ;12 '

Degrees offreedom = 1,845Deviance = 2,291

Odds model witha normal

random effect

Model(BR)

=.459s.e. = .036

fog{k) = .332

s.e. = .05;=>k=1.32

= .142s.e. = .09

Degrees offreedom p 1,844Deviance =2,261

NOTE: Some parameter estimates with standard errors and goodness-of-fit statistics.

player ability from day to day. A random-effects model for player ability provides agood explanation of a common occur-rence in sport in which a player inflicts aheavy defeat on his opponent on one daybut himself suffers a heavy defeat fromthe same opponent on the next day. If aplayer's ability varies randomly from dayto day (but remains relatively constant onany given day), then such apparent rever-sals of form are to be expected because,for the same two players, the probabilityof winning a set may vary substantiallyfrom day to day. Of course, PM explainssuch reversals of form equally well. Thequestion we are posing in the title of thisarticle is: Should attributed overabun-dance of heavy defeats that we observe intennis(3/0 to either player) be put downto PM or could it equally well be attrib-uted solely to a random day-to-day fluc-tuation in the ability of the contestants?We answer this question by comparingthese alternatives on the basis of twoyears of data from the Wimbledon andU.S. Open tennis tournaments. In addi-tion the models are fitted to a datasetcontaining the career "head-to-head"records of Ivan Lendl versus JimmyConnors and John McEnroe versus Bjorn

The Evidence forPsychological Momentumin Tennis

We were fortunate that, at the same timethe basketball and baseball work was tak-ing place, we were trying to detect thesepsychological effects in tennis (Jackson1993, 1995)—fortunate because in theother two sports the magnitude of anypsychological effect was likely to be smalland hence difficult to detect. Even if suc-cessive attempts at a shot in basketball orat making a hit in baseball were indepen-dent, there was never any possibility thatthey were identically distributed; that is,the probability of success in both thosesports depends to a large extent on thesituational variables. In tennis we werenot faced with this latter problembecause the sets are supposedly identicalin the sense that the format is for practi-cal purposes identical and designed notto convey an advantage to either player.And as it turned out, for the models wefit to our tennis data, the magnitude ofthe psychological effect is considerable.

304

Page 316: Anthology of Statistics in Sports

Jackson and Mosurski

The Wimbledon and U.S.Open Tennis Data

The main dataset of Jackson (1993) con-sists of the 251 completed best-of-five-sets matches from the men's singles atWimbledon and the U.S. Open in 1987.Matches lasted of necessity 3, 4, or 5 setsand in total 918 sets were played. For eachmatch the order in which the sets werewon is available, which allows the score insets at the commencement of a set to beincluded in any model. Moreover, the offi-cial rankings of the players as given by theAssociation of Tournament Professionals(ATP) are available. These ranks are treat-ed as explanatory variables from whichinformation on the relative abilities of theplayers in each match is extracted. In par-ticular this prior information is used toobtain an estimate of the odds in the firstset of each match. This allows a morethorough investigation of the possibledependency structure between the out-come of sets within a match.

The Relationship BetweenRanks and Odds forProfessional Tennis Players

The professional players in our datasetare the elite players from a large popula-tion of tennis players. What this impliesis that, if we treat tennis ability as anattribute that has some standard butunknown distribution, then the expectedrelationships between the varyingamounts of this attribute for the eliteplayers are just those that apply in the tailof this unknown distribution. How oneestimates odds in the first set of a contestbetween two such players when theranks of the players are known is not cen-tral to the issues we are addressing here.Suffice it to say that there are standardprocedures available in ranking and rat-ing theory, which depend only on whatform one assumes for the tail of the dis-tribution. If we define O(r,s) to be theodds that a player ranked r beats a playerranked s in the first set, then the particu-lar estimator that we use is

O(r,s) = = (ratio of ranks)

Odds Model

where {i,j) is the score in sets and Oare the odds in the first set

Or taking togs

(1)or equivalently

log(odds of success) = • log(ratio) (la)

where a is a parameter to be determinedfrom the data.

Figure 1. The odds model: Winning a set

increases the odds of winning the next set

by a factor k.

Because for this estimator only theratio of the ranks is relevant, this impliesthat, for example, the highest rankedplayer has the same probability of suc-cess against the 4th ranked player as the20th ranked player has against the 80th.The parameter a determines what thisprobability is. Small values of the para-meter a imply a large random element tothe outcome regardless of differences inrank, but large a implies that even smalldifferences in rank lead to a high proba-bility of success for the higher rankedplayer.

Although the relationship betweenranks and odds for the elite players asgiven by the preceding equation has astrong theoretical basis and has beenused by several authors, its usefulnessdepends on the accuracy of the rankingsystem used by the ATP. If the rankingsystem is poor and lesser players havebeen ranked above better players, thenthe predictive value of the estimator willsuffer. It is necessary to test whether ourmodel for odds in the first set of a matchactually fits the data. Because we aredealing with individual successes or fail-ures in each set, we need to group theWimbledon and U.S. Open tennis datato test for goodness of fit. When this is

done it can be shown that the model doesindeed provide a good fit to the data. Thisnot only validates our use of this particu-lar estimator but also lends support towhat is a widely accepted view amongprofessional tennis players that the rank-ing system provides a fair and reasonablyaccurate guide to the relative merits ofthe tournament players.

The Odds Model

We now introduce a model for the oddsof winning a set, the odds model, thatincorporates the effect due to the "scorein sets" at the commencement of the set.The odds model was one of several mod-els that allowed for the existence of PMthat were fitted to the Wimbledon andU.S. Open tennis data. We define Oij tobe the odds of winning the next set (thei+j+lst set) when the score is (i,j) in sets.Then, in simple terms, the odds modelstates that "Winning a set increases theodds of winning the next set by a factork"(see Fig. 1).

For the odds model, the odds for suc-cess in the next set depend only on thedifference between the number of suc-cesses (sets won) and failures up to thatset and on O00, the odds for success inthe first set.

Because the ranks of the players ineach match are known, we can exploitthis information by using Equation (la)to estimate O00,, the odds of winning thefirst set. We can then rewrite Equation(3) which defines the odds model, as fol-lows:

log (Oi:j) = a • log(ratio) + log(k)*(i - j) (3a)

where (z,j) is the score in sets and ratio isthe ratio of the ranks of the players in thatmatch.

Taking k = 1 in the model statementeliminates the dependence on the scoreand therefore makes the independencemodel a special case of the odds model.Both the odds model and the indepen-dence model, in which the odds in thefirst set of each match are estimated fromthe ranks of the players, are fitted to theoriginal 1987 Wimbledon and U.S. Opentennis data of Jackson (1993). The inde-pendence model provides a very poor fit,whereas the odds model, which includesthe effect caused by the score, explainsthat data extremely well. For the oddsmodel, the estimate for the parameter k,

305

Page 317: Anthology of Statistics in Sports

Chapter 42 Heavy Defeats in Tennis: Psychological Momentum or Random Effect?

which is a measure of psychologicalmomentum, is k = 1.6 with standarderror .12. It is particularly noticeable forthe independence model that it badlyunderestimates the number of heavydefeats in the data. Later in the article wefit both of these models and some addi-tional models to a new Wimbledon andU.S. Open tennis dataset, which con-tains the results of matches from 1987and 1988, and the improvement in the fitby introducing the effect due to the scoreinto the model is just as marked (seeTable 1) for the larger dataset as it waspreviously. Moreover, as before, the num-ber of heavy defeats in the larger dataset(see Table 2) is very poorly accounted forby the independence model.

So far we have summarized the evi-dence for PM in tennis and mentionedsome of the work that has taken place inthe ongoing search for evidence of PM inother sports. Although the evidence forPM in tennis is strong, perhaps it is pos-sible to tell a different story that explainsthe apparent overabundance of heavydefeats that we see in tennis. A possiblecandidate is a random-effects model forplayer ability.

Random-Effects Model forPlayer Ability

One may accept (we do but with certainreservations) that the ATP ranking sys-tem is adequate and also accept, as wedo, that the ratio of the ranks of the play-ers is a reasonable function to use in esti-mating odds in the first set of a match.Yet one may still argue that a player's abil-ity varies from day to day. In that case onemay argue that a player's ranking is onlyan indicator of his average ability and thatthe function a- log(ratio) is an estimateof the average log-odds in the first set ofa match between contestants withknown ranks.

For instance a frequent occurrence intennis is that a player inflicts a heavydefeat on his opponent on one day buthimself suffers a heavy defeat by thesame opponent in a subsequent match.

Results of two matches between thesame players:

Day 1: A beats B 3/0Day 2: A loses to B 0/3

In this example we have only heavydefeats.

Model 1: Psychological momentum

We can explain these data by PM, asbefore, by saying that there is a probabil-ity of .5 of winning the first set in eachmatch but that whoever wins the first sethas probability 1 of winning any subse-quent set, a true dependency structure.[See Jackson (1995) for more detaileddiscussion of this type of dataset.]

Model 2: Player ability varies fromday to day

For player AP = 1 for each set on DAY 1P = 0 for each set on DAY 2

Alternatively we can say that on averagethese players are of equal ability but thatplayer A had probability 1 of winningevery set on the first day and probability0 on the second. In this case player abili-ty varies substantially from day to day butremains fixed on any given day.

The second model is an example of arandom-effects model for player ability,and it explains the apparent overabun-dance of heavy defeats that we see in thedata just as well as PM; that is, the likeli-hood of the observation is the same forboth models. It is also an independencemodel because the probability of winninga set does not vary within a match. If wewere to predict what would happen in asubsequent match, we would say thatone of the players will have a probability1 of winning every set but that player isequally likely to be player A or B.

Random-Effects Modelsand Heavy Defeats

We can generalize this relationshipbetween random effects and heavydefeats. If the true probability (P) of win-

ning a set for Player A against Player B isa random variable from match to matchwith mean p, this implies that heavydefeats are more likely (for both players)than if Player As probability of winning aset remained a constant (p) from matchto match. This addition of a randomeffect because of an apparent overabun-dance of heavy defeats in the data is verysimilar to the introduction, in other cir-cumstances, of a random effect into amodel in an attempt to compensate foroverdispersion.

Linear Logistic ModelWith a Random Effect

We want to adopt a linear logistic modelfor the relationship between the trueprobability P of winning a set, which isassumed to be a random variable inde-pendently chosen from day-to-day andany explanatory variables. In that case anindependence model is that the log-oddsfor success in a set is a fixed effect, whichis particular to that match and based onthe relative abilities of the players, plussome random effect that is chosen inde-pendently for each match.

log (odds for success in a set) =fixed effect + r (4)

log (odds for success in a set) =a • log(ratio)+ r (4a)

• Independence model• r is a random-match effect

As before, for players of known ranks,we choose to use a • log(ratio) for thefixed effect leading to the model speci-fied by Equation (4a). We also assumethat the random component r has zeromean. In that case we associate the fixedterm in the model with the average log-odds for that match. This is an indepen-dence model because the probability ofsuccess in a set is constant for a givenmatch, although it will vary from matchto match for the same two players due tothe random effect. Similarly for matchesbetween different players in which theratio of the ranks is the same in bothmatches, the probability of winning a setwill vary between matches because therandom effects have been independentlychosen.

To fit such a model we need to speci-fy the distribution of the random variabler. Here we assume that the random

306

Page 318: Anthology of Statistics in Sports

Jackson and Mosurski

Model FormulasModels A and 8 are straightforwardlogistic regressions; models AR andBR are logistic regressions with randorn effect.

Model Formula

Where log(O ) am the log-odds ofthe higher ranked player winningthe next set 'when the score in setsis {i,j) in match m. •Ratiom is the ratioof tile ranks <pf the players in matchm as previously defined The ran-dom effect rm ~ N<0, j*) and thepopulation parameters a, k, and 2

are to be estimated.

effects are chosen independently from aNormal(0, 2) distribution, although thatis only one of many distributions thatcould reasonably have been chosen. Ofcourse, whatever distribution is chosenone needs specialist software to fit any ofthese random effects models includingthe Normal. We have used the MultiLevel modeling package (MLn) for allmodels, whether or not they include arandom effect.

Parameter Estimation andModel Comparison: TheNew Wimbledon and U.S.Open Data

The results of the matches in the 1987men's singles tournaments at Wimbledonand the U.S. Open made up the originaldataset to which the odds model andsome other dependent trials models (butnot any random-effects models) were fit-ted in earlier works. Here we have addedthe matches from the 1988 tournamentsat both venues. The dataset now consistsof 1847 sets from 501 matches. In theaccompanying sidebar we summarize theresults of fitting the following four mod-els to this dataset. The models that wewish to compare and which we specify infull in the sidebar are

A. The simple independence modelB. The odds model. This allows for theexistence of PM, but does not precludesimple independence; that is, k - \.AR. The independence model with aN(0, 2) random effect. If 2 = 0 this againreduces to simple independence.BR. The odds model with a Normal ran-dom effect

The year and the venue were also con-sidered as explanatory variables, however,because these did not have any significanteffect we do not report the results here.

Model comparisons and interpreta-tion of the analysis follow:

(1) AR and A. By comparingdeviances, we see from Table 1 thatmodel AR, the independence model witha Normal random effect, is a big improve-ment when compared to the simple inde-pendence model. The estimate for 2 of.625 (s.e. = .12) is significantly differentfrom 0, and there is a reduction in thedeviance of 38.

(2) B and A. The improvement in fitfor the odds model (B) over simple inde-pendence is significantly greater thanthat achieved by the random-effectsmodel (AR). For the odds model there isa reduction of 65 in the deviance and theestimate for k, and the index of PM is1.48.

(3) BR and B. The addition of aNormal random effect to the odds modeldoes not significantly improve the fit.

The improvement in fit for the ran-dom-effects model (AR) over simple

independence is not unexpected becausewe know that one of the main flaws inthe simple independence model is that itbadly underestimates the number ofheavy defeats, and we suspected that theintroduction of a random effect was like-ly to go some way toward correcting thatdefect. It doesn't go far enough, however.The analysis confirms that the impact ofPM cannot be ignored. For the oddsmodel, the estimate for k of approximate-ly 1.5 is not only statistically significantbut also has a large practical impact. Forexample, for two evenly matched playersit implies that the winner of the first setwill have a probability of .6 of winningthe second set. If he wins that set thenhis probability of winning the third set is.69.

The estimate for the variance of therandom effect in models AR and BRenables us to calculate the reasonablerange of probabilities of winning the firstset in each of these models. Even formodel BR (it is much larger for modelAR) this estimate of the variance, 2 =.142,.142, implies a considerable level of vari-ation in player ability from day to day. Inthis case, for evenly matched playersthere is at least a 5% likelihood that theprobability of winning the first set on anygiven day is outside the range .315 to.685.

The primary question we seek toanswer is "Is it possible to rescue the con-cept of independent sets within matchessolely by the addition of a randomeffect?" In other words, can we produce

Table 2— Results for Higher Ranked Player of 501 MatchesFrom the 1987 and 1988 Wimbledon and U.S. Open

Tournaments

Expected values for the following models.(parameters as given in Table 1 .}

Result3/0

3/13/22/31/30/3

Observed19110457365459

Simpleindependence

ModelA158,3132.887.252.744.625,5

Odds modelModel B

193.7

112.358.836.749.150.5

Independence with aNormal random effect

Model AR172.5

115.370.249.751.941.3

307

Page 319: Anthology of Statistics in Sports

Chapter 42 Heavy Defeats in Tennis: Psychological Momentum or Random Effect?

an independence model that is compara-ble to PM as an explanation of thesedata, or must we necessarily abandon theidea of independence, which is a muchstronger statement. Well, we haven't res-cued it yet, which is not to say that it can-not be rescued, perhaps by some radical-ly different model for random variation inplayer ability than the one we have beenconsidering. What is clear, however, is:• The independence model with a nor-

mal random effect is not comparableto the odds model as an explanation ofthese data.

• The proposed model for variation in aplayer's ability contributes little to theoverall fit, whereas the effect due tothe score is substantial.

Heavy Defeats atWimbledon and the U.S.OpenThe number of heavy defeats thatoccurred at these two tournaments is anaspect of the data that is of considerableinterest. Table 2 gives the results ofmatches for the higher ranked player interms of sets won and lost and theexpected numbers of these results for thevarious models. It includes both thenumber of heavy defeats suffered by thehigher ranked player (the 0/3 results) andthe number of heavy defeats suffered bythe lower ranked player (the 3/0 results).

The order in which the sets were wonand lost has been suppressed in the viewof the data contained in Table 2, althoughknowledge of the order was used in esti-mating some of the parameters associat-ed with the models. The expected valueswere obtained by using the known ranksof the players in each match, togetherwith the fitted parameters, to calculatethe probability of a 3/0, 3/1, 3/2 result forboth players in each match and summingthese probabilities over all 501 matches.For the simple independence model andfor the odds model, this is a straightfor-ward calculation. For the random-effectsmodel, however, the likelihood of a 3/0,3/1, 3/2 result in a given match is depen-dent on the particular value of the ran-dom effect in that match and it is neces-sary to evaluate (numerically) somerather inelegant-looking integrals to cal-culate the unconditional likelihood foreach result.

The simple independence modelunderestimates the number of heavy

defeats in these data, considerably so forthe lower ranked player (the 3/0 results)and dramatically so for the number ofheavy defeats suffered by the higherranked player (the 0/3 results). The inde-pendence model with a Normal randomeffect, model AR, fits this aspect of thedata much better but is still not compara-ble to the odds model, so even if we wereto judge solely on the criteria of how wellthe models fit to the heavy-defeats aspectof the data, the proposed model for ran-dom variation in player ability from day today is not going to rescue the concept ofan "independent-sets-within-matches"model as an explanation of these data.

In Table 2, we have refrained fromincluding the expected values for model

[Tversky andGilovich's] analysis ofconsecutive shots in

basketball shows thatcontrary to popularbelief the chances ofa player hitting a shot

are as good after amiss as after a hit.

BR—that is, the odds model with theaddition of a Normal random effect.There are two main reasons for this.First, we are primarily concerned withthe comparison between the odds modeland an independent-sets-within-matchesmodel. Second as we saw in Table 1, thefit to the data for this full model, whichincludes both the PM effect and theeffect due to random variation in playerability, is not significantly better than themodel for PM on its own—namely, theodds model. Indeed the expected num-bers of 3/0 and 0/3 results are very simi-lar for both models. As the expected val-ues are extremely burdensome to com-pute, we did not proceed with the com-putations for the other possible results forthis model. If one accepts the existenceof PM, however, then the full model is areasonable starting point in any investiga-tion of the relative contributions of ran-

dom variation in player ability and PM tothe observed outcomes.

Fundamental Dependencyin the Tennis Data

A summary of the results of many match-es may provide evidence that the setswithin matches were not independent.For example, when a player wins by ascore of 3/1, there are three differentsequences that may occur—namely,LWWW, WLWW, and WWLW For a3/2 result, there are six sequences. If setswithin matches are independent, theneach of the three sequences for a 3/1result will have equal likelihood and sim-ilarly for the 3/2 results, because for anyindependence model each sequence isequally likely (irrespective of the constantprobability of winning a set in a givenmatch). Hence, we would expect approx-imately equal numbers for each of thesesequences in our data. If this is not thecase, then this is evidence of fundamen-tal dependency in the data.

In the Wimbledon and U.S. Opendataset, 158 matches finished 3/1 and 93matches finished 3/2 for one or the otherof the players (see Table 2); the othermatches were straight sets wins. A pre-liminary chi-squared investigation as towhether the numbers for each of thesequences leading to a 3/1 result are sig-nificantly different proved inconclusive.Similarly, for the six sequences leading toa 3/2 result. Because the categories inwhich the winner of the match loses aset—that is, 1st, 2nd, 3rd, or 4th set—are clearly ordinal, however, a model thatincludes this ordinality was fitted to thedata. For both the 3/1 results and the 3/2results, there is evidence that the set orsets lost by the winner in these matchesoccurred earlier rather than later in thematch, which would not be so for inde-pendent sets. For instance for the 3/1results there are 60 results in which theloss occurred in the first set—that is,LWWW—and 41 in which the lossoccurred in the third set—that is,WWLW. Of the 93 matches that lastedfive sets, in 23 the winner lost the firsttwo sets, more than for any othersequence, and there is an overall trendfor losses in earlier rather than later sets.This is a weak test for dependency in thistype of data because we cannot make useof the 3/0 or 0/3 results; however, in this

308

Page 320: Anthology of Statistics in Sports

Jackson and Mosurski

case it does produce evidence of funda-mental dependency that implies that anyindependent-sets-within-matches modelwill provide an inadequate description ofthe data. The test provides evidence ofdependency, although it is not immedi-ately obvious that it is evidence of PM. IfPM exists, however, then in general it iseasier for the eventual loser of a match towin a set early in the match rather thanlater when the effect of PM is more pro-nounced.

Some Conclusions for theWimbledon and U.S. OpenData

We have seen that the independencemodel with a Normal random effect doesnot rival the odds model as an explana-tion of the data. Indeed the evidence offundamental dependency implies that, ifwe did produce an independent-sets-within-matches model that fitted as wellas the odds model, we would be forced toconclude that both models were inade-quate descriptions of the data. It appearsthen that we must abandon the idea ofindependence. To abandon indepen-dence, however, is not to say that onemust reject the common-sense idea thatplayer ability varies from day to day, onlythat on its own such a model is unlikelyto be successful. Whatever the contribu-tion of random variation in a player's abil-

Fable 4—Model Comparisons for the Head-to-Head Datasets:Parameter Estimates and Goodness-of-Fit Statistics

Lendl vs. Connors

.- . • ' , - • • ' - . - ' ••-.• ivstoiitiiEsiSmate'' • '. ' ,,• ." ' . . , -,:/; ''"'•'

:.4*T>8^: ',- : ' ' •" . .' '''':•: .•'..Deviance » 583 • / •

W®&Estimates

d«1J5 ..*«2'j01Deviance « 54 J

(3) Random effect,EstimatesConst.*. 74 ^2»1.02Ctevianee « 56.0

-;.'•>.'. ',i;-:;. •.f.,;V.*4tS»woQW5..Borg • ", .

idependence, i.e., odds constant

; ^ \ . " ;.,.'•" --'C^mate . " - • "

'".V'/' - ' - ' " ".. :-**i.10 ' ' •

: Deviance = 60.9Is model, i 0, « k, _; O«,

Miniatesd=1,t3 k-,75Deviance = 60.4

i.e., log(odds) * const* r , r - N (0, S *)EstimatesCanst. = .09 £2 = QDeviance = 60.9

ity from day to day may be, our analysissuggests that psychological momentum iscertainly a major factor in the outcome ofmatches at the Wimbledon and U.S.Open tennis tournaments.

Head-to-Head Records:Lendl/Connors andBorg/McEnroe

When a number of matches take placebetween the same two players over a

Table 3— Matches Won by Winning Score for Ivan Lendl versusJimmy Connors 1 982-1 985 (16 matches ) and for JohnMcEnroe versus Bjorn Borg 1978-1981 (14matete)

Head-to-head records

Lendl vs. Connors:

1982-1985:

Matches won

Lend! Connors

7 1

2 1

2 0

0 3

0 0 !

McEnroe vs. Borg:

197e-198t:

Matches won

Winner's • ...

score McEnroe Borg

2/0 2 ,;,_. 3

2/1 1 3

3/0 0 0

3/1 3 0

3/2 1 1

period of time, it is reasonable, undercertain circumstances, to make theassumption that the expected probabilityof winning the first set (at least) remainsthe same in each of the matches. Forinstance, in the middle stages of a play-er's career one might assume that hisaverage ability (allowing for possible ran-dom day-to-day variation) remains con-stant. Because our interest is in a possi-bly changing probability of winning a setwithin a match, this simplifies matterssomewhat. It is no longer necessary, bymeans of the ranks or other explanatoryvariables, to estimate a changing under-lying probability from match to match.The expected probability of winning thefirst set is assumed to remain constantfor all matches between those two play-ers. Unfortunately, such datasets tend tobe small. That is certainly true for thehead-to-head records we look at here.The data themselves are interestingbecause they relate to some of the great-est players of all time. They are present-ed here mainly for that reason, but it isdoubtful if the head-to-head records, ontheir own, of any two professional tennisplayers could provide sufficient data tomake possible anything other than acrude assessment of the relative abilitiesof the players.

Table 3 contains (a) the head-to-headrecord of Ivan Lendl and Jimmy Connorsfrom 1982-1985, a period when bothplayers can be considered to be near thepeak of their abilities, Lendl having just

309

Page 321: Anthology of Statistics in Sports

Chapter 42 Heavy Defeats in Tennis: Psychological Momentum or Random Effect?

reached his and Connors not much pasthis best, and (b) the lifetime head-to-head record for John McEnroe and BjornBorg, a classic series of 14 matches overa three-year period from 1978-1981,between two players of similar age, com-peting for the number 1 spot in theirsport. As previously, the order in whichthe sets were won and lost has been sup-pressed in this view of the data.

The matches in these head-to-headrecords were played using either a best-of-three- or best-of-five-sets format. Forthese small datasets it is assumed thatthe parameter k in the odds model is thesame for either format. Of course anyevidence of PM obtained by fitting theodds model to these datasets is applica-ble only to matches between the twonamed players. This differs from theapproach taken earlier where the para-meter k that is estimated is a populationeffect, applicable to all matches betweenplayers from the population being con-sidered.• For Lendl/Connors there appear to be

many heavy defeats. Nine of Lendl'seleven wins were in straight sets, andnone of the best-of-five matches wentto the fifth set.

• For Borg/McEnroe, there were no 3/0results for either player, and approxi-mately half of their short matcheswent to a final set. There is no appar-ent evidence of an overabundance ofheavy defeats—if anything, thereverse.By comparing deviances we see that

for Lendl/Connors, the odds model issuperior to the independence modelwith a Normal random effect (see Table4), although both pick up the relativelyhigh number of heavy defeats betweenthese two players and both provide a bet-ter fit than simple independence. Theestimate of the index of PM, the para-meter k in the odds model, is 2.01 andfor the random-effects model the esti-mate for the variance of the randomeffect is 1.02. Of the two estimates, it isonly the parameter estimate for k in theodds model that is marginally significant.

For McEnroe/Borg, the estimate forthe random effect is identically 0, indi-cating that any variation in player abilitywill result in a lesser fit than the simpleindependence model. The odds model,however, does provide a nonredundantestimate for the index of PM betweenthese players, although the fit to the data

is practically thesame as for simpleindependence andthe estimate for k isless than 1. It isworth pointing outthat values of k < 1in the odds modelindicate that suc-cess breeds failureor, if you prefer,failure breeds suc-cess, and it is a fea-ture of the oddsmodel that it isequally capable ofpicking up thattype of dependencywhere it exists, aswell as what is gen-erally believed tobe the more common form of PMnamely, success breeds success.

PM may be one explanation for the stunning victory of Bjorn Borg(above) over John McEnroe in their Wimbledon finals match.

McEnroe and Momentum:(<You Cannot Be Serious"

So for the Borg/McEnroe series ofmatches there is no evidence that theprobability of winning a set in any oftheir matches was influenced by thescore or that the probability of winning aset varied from match to match. ForLendl/Connors the data do lend a littleweight to the conclusion that either PM(k >1) or variation in player ability fromday to day (or perhaps both) was a factorin that series of matches. If indeed thatwas the case, and it is far from provenfrom this analysis, it is for others to spec-ulate as to why that might have been so.

Thanks to Bill Benter of Hong Kongwho first suggested to us that it mightbe worth investigating whether "if play-ers' abilities did fluctuate from theiroverall rankings...this might salvage theindependence model." and to the editorand referees of Chance for some helpfulcomments. And finally sincere andheartfelt thanks to the odds compilersat Ladbrokes, Hills and Corals who, bytheir absolute reliance on indepen-dence in a series of trials, have unknow-ingly and unwittingly supported thisresearch over many years. It is literallytrue to say that without their generousand regular contributions to theadvancement of science this workwould not have been possible.

References and Further Reading

Albert, J. (1993), Comment on "AStatistical Analysis of HittingStreaks in Baseball," Journal of theAmerican Statistical Association, 88,1184-1188.

Albright, S. C. (1993), "A StatisticalAnalysis of Hitting Streaks inBaseball," Journal of the AmericanStatistical Association, 88,1175-1183.

Jackson, D. A. (1993), "IndependentTrials Are a Model for Disaster,"Applied Statistics, 42, 211-220.

(1995), "Tennis in Lilliput: AFable on Sports and Psychology,"Chance, 8 (3), 7-40.

Kruskal, W. (1988), "Miracles andStatistics: The Casual Assumptionof Independence," Journal of theAmerican Statistical Association, 83,929-940.

Stern, H. (1995), "Who's Hot andWho's Not," Proceedings of theSection on Statistics in Sports,American Statistical Association, pp.26-35.

Stern, H., and Morris, C. (1993),Comment on "A Statistical Analysisof Hitting Streaks in Baseball,"Journal of the American StatisticalAssociation, 88, 1189-1194.

Tversky A., and Gilovich, T. (1989), "TheCold Facts About the 'Hot Hand' inBasketball," Chance 2 (1),

310

Page 322: Anthology of Statistics in Sports

Chapter 43

Robert TIBSHIRANI

I compare the world record sprint races of Donovan Baileyand Michael Johnson in the 1996 Olympic Games, and tryto answer the questions: 1. Who is faster?, and 2. Whichperformance was more remarkable? The statistical meth-ods used include cubic spline curve fitting, the parametricbootstrap, and Keller's model of running.

KEY WORDS: Sprinting; World record; Curve fitting.

1. INTRODUCTION

At the 1996 Olympic Summer Games in Atlanta bothDonovan Bailey (Canada) and Michael Johnson (UnitedStates) won gold medals in track and field. Bailey won the100 meter race in 9.84 seconds, while Johnson won the200 meter race in 19.32 seconds. Both marks were worldrecords. After the 200 m race, an excited United States tele-vision commentator "put Johnson's accomplishment intoperspective" by pointing out that his record time was lessthan twice that of Bailey's, implying that Johnson had runfaster. Of course, this is not a fair comparison because thestart is the slowest part of a sprint, and Johnson only hadto start once, not twice.

Ato Bolton, the sprinter who finished third in both races,was also overwhelmed by Johnson's performance. He saidthat, although normally the winner of the 100 meter raceis considered the fastest man in the world, he thought thatJohnson was the now the fastest.

In this paper I carry out some analyses of these two worldrecord performances. I do not produce a definitive answer tothe provocative question in the title, as that depends on whatone means by "fastest." Hopefully, some light is shed onthis interesting and fun debate. Some empirical data mightsoon become available on this issue: a 150 meter match racebetween the two runners is tentatively scheduled for June1997.

2. SPEED CURVES

The results of the races are shown in Tables 1 and 2.A straightforward measure of. a running performance is

the speed achieved by the runner as a function of time.The first line of Table 3 gives the interval times for Bailey,

Robert Tibshirani is Professor, Department of Preventive Medicineand Biostatistics and the Department of Statistics, University of Toronto,Toronto, Ont., Canada M5S 1A8. The author thanks Trevor Hastie, Ge-off Hinton, Joseph Keller, Bruce Kidd, Keith Knight, David MacKay, CarlMorris, Don Redelmeier, James Stafford, three referees, and two editors forhelpful comments, Cecil Smith for providing Bailey's official split timesfrom Swiss Timing, and Guy Gibbons of Seagull Inc. for providing thecorrected version of the Swiss Timing results. This work was supportedby the Natural Sciences and Engineering Research Council of Canada.

Who is the Fastest Man in the World?

Table 1. Results for 1996 Olympic 100 m Final; The Reaction Time isthe Time it Takes for the Sprinter to Push Off the Blocks after the

Firing of the Starter's Pistol; DQ Means Disqualified

1.2.3.4.5.6.7.8.

Name

Bailey, Donovan (Canada)Fredericks, Frank (Namibia)Bolton, Ato (Tobago)Mitchell, Dennis (United States)Marsh, Michael (United States)Ezinwa, Davidson (Nigeria)Green, Michael (Jamaica)Christie, Linford (Great Britain)

Time Reaction time

9.84 +.1749.89 +.1439.909.99

10.0010.1410.16

.164

.145

.147

.157

.169DQ

Wind speed: +.7 m/s

obtained from Swiss Timing and reported in the TorontoSun newspaper. These times were not recorded for John-son. The value 7.7 at 70 m is almost surely wrong, as itwould imply an interval time of only 0.5 seconds for 10m. I contacted Swiss Timing about their possible error, andthey rechecked their calculations. As it turned out, the splittimes were computed using a laser light placed 20 m be-hind the starting blocks, and they had neglected to correctfor this 20 m gap in both the 70 and 80 m split times. Thecorrected times are shown in Table 3.

The estimated times at each distance shown in Table 4were obtained manually from a videotape of the races. Hereis how I estimated these times. I had recorded the 100 mhurdles race on the same track. Using the known positioningof the hurdles, I established landmarks on the infield whosedistance from the start I could determine. Then by watchinga video of the sprint races in slow motion, with the raceclock on the screen, I estimated the time it took to reacheach of these markings.

Table 5 compares the estimated and official split times.After the 40 m mark, the agreement is fairly good. Thedisagreement at 10, 20, and 30 m is due to the paucity ofdata and the severe camera angle for that part of the race.Fortunately, these points do not have a large influence on theresults, as our error analysis later shows. Overall, this agree-ment gives us some confidence about the estimated times

Table 2. Results for 1996 Olympic 200 m Final;"?" Meansthe Information was Not Available

1.2.3.4.5.6.7.8.

Name

Johnson, Michael (United States)Fredericks, Frank (Namibia)Bolton, Ato (Trinidad and Tobago)Thompson, Obadele (Barbados)Williams, Jeff (United States)Garcia, Ivan (Cuba)Stevens, Patrick (Belgium)Marsh, Michael (United States)

Time

19.3219.6819.8020.1420.1720.2120.2720.48

Time at Reaction100 m time

10.1210.1410.18

???

.161

.200

.208

.202

.182

.229? +.151? +.167

Wind speed: + .4 m/s

311

Page 323: Anthology of Statistics in Sports

Chapter 43 Who is the Fastest Man in the World?

Table 3. Official Times at Given Distances for Bailey; The "?" Indicates a Suspicious Time, Later Found to be in Error

Distance (m)

Original time (s)Corrected time (s)

0

.174

.174

10

1.91.9

20

3.13.1

30

4.14.1

40

4.94.9

50

5.65.6

60

6.56.5

70

7.7?7.2

80

8.28.1

90

9.09.0

100

9.849.84

Table 4. Estimated Times at Given Distances; Bailey Starts at the 100 m Mark;"+'Denotes Distance Past 100 m: For Example, "+12.9" Means 112.9 m

Distance (m)

Bailey:Johnson:

0

.174

.161

50

6.3

100

10.12

+ 12.9

2.811.4

+ 40.3

5.014.0

+ 49.4

5.714.8

+ 67.7

7.016.2

+ 76.9

7.817.0

+ 86.0

8.517.8

+ 100

9.84

19.32

Table 5. Comparison of Official and Estimated Interval Times for Bailey

Distance (m)

OfficialEstimated

0

.174

.174

10

1.92.1

20

3.13.4

30

4.14.3

40

4.95.1

50

5.65.7

60

6.56.4

70

7.27.2

80

8.18.0

90

9.08.9

100

9.849.84

Table 6. Estimated Times (seconds) for Johnson for Distances over 100 m

Johnson

100

10.12

110

11.10

120

12.09

130

13.06

140

13.97

150

14.83

160

15.61

170

16.40

180

17.26

190

18.23

200

19.32

for Johnson, and some idea of the magnitude of their error.The speed curves were estimated by fitting a cubic smooth-ing spline to the first differences of the times, constrainingthe curves to be 0 at the start of the race. The curves foreach runner are shown in the top panel of Figure 1. BecauseBailey's 100 m was much faster than Johnson's first 100 mbut slower than his second 100 m, it seems most interest-ing to make the latter comparison. Hence I have shiftedBailey's curve to start at time 10.12 s, and Johnson's timeat 100 m.

If Johnson's speed curve always lay above Bailey's, thenthis analysis would have provided convincing evidence infavor of Johnson because he achieved his speed despite hav-ing already run 100 meters. However, Bailey's curve doesrise above Johnson's, and achieves a higher maximum (13.2m/s for Bailey, 11.8 m/s for Johnson). A 95% confidenceinterval for the difference between the maxima, computingusing the parametric bootstrap, is (—.062,1.15). Hence thereis no definitive conclusion from this comparison. The Ap-pendix gives details of the computation of this confidenceinterval.

We note that the estimate of 13.2 m/s for Bailey's max-imum speed differs from the figure of 12.1 m/s reportedby Swiss timing. Bailey's estimated final speed is 12.4 m/sversus 11.5 m/s reported by Swiss timing. This size of dis-crepancy is not unexpected because the interval times areonly given to within . 1 of a second. When a sprinter is run-ning at top speed, he covers 10 m in approximately .8 s,giving a speed of 10/.8 = 12.5 m/s. Now if each of theinterval times are off by .05 s, then the estimated speedranges from 10/.9 = 11.1 m/s to 10/.7 = 14.3 m/s.

Who would win a race of say 150 meters? Here is asimple-minded approach to the question. Bailey's speed at

the 100 m mark was 12.4 m/s, and his speed was decreasingby only .036 m/s every 10 m. Johnson's estimated time at150 m was 14.83 s, as given in Table 6. In order to beatthat time Bailey would need "only" to maintain an averagespeed of more than 10.02 m/s for another 50 m. Of course,it is not clear whether he could do this. In the next sectionwe appeal to a parametric model to perform the necessaryextrapolation.

For interest, in the bottom panel of Figure 1 we compareBailey's curve to that from Ben Johnson's 1987 9.83 s worldrecord race (he was later disqualified for drug usage). Theyachieved roughly the same time in quite different ways: BenJohnson got a fast start, and then maintained his velocity;Bailey accelerated much more slowly, but achieved a highermaximum speed.

3. PREDICTIONS FROM KELLER'S MODEL

Keller (1973) developed a model of competitive runningthat predicts the form of the velocity curve for a sprinterusing his resources in an optimal way. Here we use hismodel to predict the winner of a 150m race.

According to Keller's theory, the force f ( t ) per unit massat time t, applied by a sprinter in the direction of motion,may be written as

where v(t) is the velocity and is a damping coefficient.This is just Newton's second law, where it is assumed thatthe resistance force per unit mass is v ( t ) /T .

Keller estimated T to be .892 s from various races. Ex-cellent overviews of Keller's work are given by Pritchard(1993) and Pritchard and Pritchard (1994).

312

Page 324: Anthology of Statistics in Sports

Tibshirani

Figure 1. Top Panel: Estimated Speed Curves for Bailey and John-son. Bailey's curve has been shifted to start at time 10.12 s, Johnson'stime at 100 m. Bottom panel: estimated speed curves for Bailey andJohnson from the latter's 1987 world record race.

Starting with assumption (1) and a model for energy stor-age and usage, Keller shows that the optimal strategy for arunner is to apply his maximum force F during the entirerace, leading to a velocity curve

Figure 2. Top Row: Optimal Velocity Curves (Broken) for Bailey's100 m. The top left panel uses Keller's model (2); the top right paneluses the modified model (4). The middle and bottom rows show the fitof the Keller and modified models for Johnson's 200 m. In all panels thesolid curve is the corresponding actual (estimated) velocity curve fromthe top panel of Figure 1.

Figure 3. Estimated Distance Curves and Actual Distances (Points)from Least Squares Fit of Model (4).

This applies to races of less than 291 m. For greater dis-tances there is a different optimal strategy. By integrating(2) we obtain the distance traveled in time t:

Figure 2 (top left and middle panels) shows the optimalspeed curves for the 100 and 200 m races, with Bailey'sand Johnson's superimposed. We used least squares on the(time, distance) measurements to find the best values ofand F for each runner in equation (3): these were (1.74,7.16) for Bailey and (1.69, 6.80) for Johnson.

We can use (3) to predict the times for a 150 m race;note that the reaction times must be included as well. Thepredictions are 13.97 s (Bailey) and 15.00 s (Johnson).

The same model also predicts a completely implausible200 m time of 17.72 s for Bailey. One shortcoming of themodel is the fact that the velocity curve (2) never decreases,but observed velocity curves usually do. To rectify this itseems reasonable to assume that a sprinter is unable to

Figure 4. Estimated Optimal Velocity and Distance Curves over 150m for Bailey (Dotted) and Johnson (Dashed).

313

Page 325: Anthology of Statistics in Sports

Chapter 43 Who is the Fastest Man in the World?

Figure 5. The Predicted Race in 1 s Snapshots. Shown is the Esti-mated Distance Traveled by Bailey (B) and Johnson (J) at Time = 0 s,1s,..., 14s, and at the End of the Race (Time 14.73 s).

maintain his maximum force F over the entire race, butinstead applies a force F — c • t for some c 0. Using thisin (1) leads to velocity and distance curves

where k = FT + r2c. We fit this model to the observeddistances by least squares, giving parameter estimates for( ,F,c) of (2.39, 6.41, .20) and (2.06, 6.10, .05) forBailey and Johnson, respectively. The fitted distance valuesare plotted with the actual ones in Figure 3. Note that theestimated maximum force is greater for Bailey than John-son, but decreases more quickly. Bailey also has a higherestimated resistance.

The estimated curves are shown in the top right and bot-tom panels of Figure 2. The estimated 150 m times fromthis model are 14.73 s for Bailey and 14.82 s for Johnson.The latter is very close to the estimated time of 14.83 s at150 m in the Olympic 200 m race from Table 6.

Figure 4 shows the estimated optimal velocity and dis-tance curves over 150 m from the model, and Figure 5depicts the predicted race in 1 s snapshots. Bailey is wellahead at the early part of the race, but starts to slow downearlier. Johnson gains on Bailey in the latter part of therace, but does not quite catch him at the end. The estimatedwinning margin for Bailey is .09 s. The bootstrap percentile95% confidence interval for the difference is (.03 s, .26 s),and the bias-corrected 95% bootstrap confidence intervalis (.02 s, .19 s). One thousand bootstrap replications wereused—see the Appendix for details. Figure 6 shows a box-plot of difference in the predicted 150 m times from thebootstrap replications.

Note that this model does not capture a possible changeof strategy by either runner in a 150 m race. This mightresult in different values for the parameters.

From Keller's theory one can also predict world recordtimes at various distances as a function of F and . Kellerfit his predicted world record times to the actual ones, fordistance from 50 yards to 10,000 m, in 1973. From thishe obtained the estimates F = 12.2 m/s2, = .892 s. Thefit was quite good: for 100 m—9.9 s (actual), 10.07 s (pre-

Figure 6. Boxplot of Johnson's Minus Bailey's Predicted 150 mTimes from 1,000 Bootstrap Replications.

dieted); for 200 m—19.5 s (actual), 19.25 s (predicted). (Theworld records that Keller reports in 1973 of 9.9 and 19.5s are questionable. The 100 m record was 9.95 s, although9.9 s was the best hand-timed performance. The 200 mrecord was 19.83 s.) It is interesting that at the time, the100 m record was faster than expected, but the 200 m recordwas slower. Johnson's performance brings the 200 m worldrecord close to the predicted value. It may be that the 200m record has been a little "soft," with runners focusing onthe more glamourous 100 m race. Note that the predictionsdo not include a component for reaction time: with John-son's reaction time of .161 s, the predicted record would be19.41 s.

4. ADJUSTMENT FOR THE CURVE

Johnson's first 100 m (10.12 s) was run on a curve, andBailey's was run on a straight track. Figure 7 shows thesprint track.

In the previous analysis we ignored this difference. As-suming we want to predict the performance of the runnersover a straight 150 m course (the course type for the May1997 race has not been announced at the time of this writ-ing), we should adjust Johnson's 200 m performance ac-cordingly. Intuitively, he should be given credit for havingachieved his time on the more difficult curved course.

What is the appropriate adjustment? The centripetal ac-celeration running of an object moving at a velocity varound a circle of radius r is a = v2/r. The radius of thecircular part of the track is 100/ = 31.83 m. With John-son's velocity ranging from 0 to 11.8 m/s, his centrepitalacceleration ranges from 0 to 4.37 m/s2. We cannot simplyadd this acceleration to the acceleration in the direction ofmotion because the centrepital acceleration is at right an-gles to the direction of motion. However, he does biologicalwork in achieving this acceleration, and hence spends en-ergy. Unfortunately, just how much energy is expended is

Figure 7. The Sprint Track, Showing Start and Finish Lines for the100 and 200 meter Races.

314

Page 326: Anthology of Statistics in Sports

Tibshirani

Figure 8. Percentage that Each Runner Achieved as a Function ofthe Winning Time in the Race tor the 100 meter (Solid Curve) and 200meter (Broken Curve) Races. The "1s" and "2s" correspond to the pre-vious nine Olympic 100 and 200 m finals, respectively.

difficult to measure, in the opinion of the physicists that Iconsulted. Hence I have not been able to quantity this effect.

5. COMPARISON TO OTHERRACE COMPETITORS

In the rest of this paper I focus on the question of whichof the two performances was more remarkable. These two

Figure 9. The Top Panels Show the Evolution of the 100m (Left) and200 m (Right) World Records. The middle panel shows the proportionimprovement of the existing record that was achieved each time in the100m (left) and 200 m (right). The bottom left panel shows the evolutionof the ratio of the 200 to 100 m world record times.

races were particularly unique because the same two run-ners (Fredericks and Bolton) finished second and third inboth. This suggests an interesting comparison. Figure 8shows the percentage that each runner achieved as a func-tion of the winning time in the race for the 100 (solid curve)and 200 meter race (broken curve). Johnson's winning mar-gin was particularly impressive. Also plotted in the figuresare the corresponding percentages achieved in the previousnine Olympic games, going back to 1952. (Throughout thisanalysis I restrict attention to post-1950 races because be-fore that time races were run on both straight and curvedtracks, and it was not always recorded which type of coursehad been used.) There has never been a winning margin aslarge as Johnson's in a 200 meter Olympic race, and onlyonce before in a 100 meter race. This was Robert Hayes'10.05 s performance in 1964 versus 10.25 s for the secondplace finisher. Johnson's margin over the second place Fred-ericks is also larger than the margin between the winner andthird place in all but two of the races.

6. EVOLUTION OF THE RECORDS

The top panels of Figure 9 show the evolution of the100 and 200 meter records since 1950. The proportion im-provements, relative to the existing record, are shown in themiddle panels of Figure 9. Johnson's 19.32 performancerepresented a 1.7% improvement in the existing record,the largest ever. (Tommie Smith lowered the 200 m worldrecord to 20.00 s in 1968, a 1.0% improvement from the ex-isting world record of 20.2 s. However, the 20.2 s value wasa hand-timed record: the existing automatic-timed recordwas 20.36 s, which Smith improved by 1.8%.) If we includeJohnson's 19.66 world record in the 1996 U.S. OlympicTrials, then overall he lowered Pietro Mennea 19.72 worldrecord by 2.02% in 1966.

The bottom left panel of Figure 9 shows the ratio ofthe 200 meter world record versus the 100 meter worldrecord from 1950 onward. The ratio has hovered both aboveand below 2.0, with Johnson's world record moving it toan all-time low of 1.963. The average speed in the 100 mrecord race was 10.16 m/s, and that for the 200 m race was10.35 m/s, the fastest average speed of any of the sprintor distance races. A ratio of below 2.0 is predicted by themathematical model of Keller (1973).

7. CONCLUSIONS

Who is faster, Bailey or Johnson? The answer dependson the definition of "faster," and there is no unique wayof comparing two performances at different distances. Ourresults are inconclusive on this issue:

• It is not fair to compare the average speeds (higher forJohnson) because the start is the slowest part of the race,and Johnson had to start only once.

• Bailey appeared to achieve a higher maximum speed,although the difference in maxima was not statistically sig-nificant at the .05 level; Johnson maintained a very highspeed over a long time interval.

• Predictions from an extended version of Keller's op-timal running model suggest that Bailey would win a

315

Page 327: Anthology of Statistics in Sports

Chapter 43 Who is the Fastest Man in the World?

(straight) 150 m race by .09 s. However, they do not ac-count for the fact that Johnson's times are based on a curvedinitial 100 m.

It would clearly be a close race, and there are a numberof factors I have not accounted for. This entire comparisonis based on just one race for each runner: consistency andcompetitiveness come into play any race. Perhaps most im-portant is the question of strategy. Each runner would trainfor and run a 150 meter race differently than a 100 or 200meter race. The effect of strategy is impossible to quantifyfrom statistical considerations alone.

Whose performance was more remarkable? Here, John-son has the clear edge:

• Johnson's winning margin over the second and thirdplace finishers (the same runners in both races!) was muchlarger than Bailey's, and was the second largest in anyOlympic 100 or 200 m final race.

• Johnson's percentage improvement of the existingworld record was the largest ever for a 100 or 200 mrace. However, the 200 m record might have been a lit-tle "soft" because it was well above the record as predictedby Keller's theory.

SOURCES

• David Wallechinsky, The Complete Book of the Sum-mer Olympics, Boston, New York, Toronto, London: Little,Brown, 1996.

• Mika Perkimki, [email protected], World-Wide Track& Field Statistics On-Line, [http://www.uta.fi/~csmipe/sport/index.html].

• IBM, Official Results of the 1996 Centennial OlympicGames, [http://results.atlanta.olympic.org].

• Track and Field News (Oct. 1996).

APPENDIX: ERROR ANALYSIS

The data in Table 4 were obtained manually from a video-tape of the race, and hence are subject to measurement error.To assess the effect of this error, we applied the paramet-ric bootstrap (Efron and Tibshirani 1993). The maximumamount of error in the times was thought to be around ±.05s for Bailey's times, and ±.20 s for Johnson's early timesand ±.15 s for Johnson's last 100 m times. Therefore, Iadded uniform noise on these ranges to each time mea-surement. For each resulting dataset I estimated the speedcurves for Bailey and Johnson. This process was repeated1,000 times.

The observed difference in the maximum speeds was13.2 - 11.8 = 1.4 m/s. The upper and lower 2.5% points ofthe 1,000 observed differences was (-.062, 1.-15). The sameparametric bootstrap procedure was used for the error anal-ysis of the fit of Keller model, in Section 3, and was usedto produce Figure 6.

[Received October 1996. Revised January 1997.]

REFERENCES

Efron, B., and Tibshirani, R. (1993), An Introduction to the Bootstrap,London: Chapman & Hall.

Keller, J. (1973), "A Theory of Competitive Running," Physics Today, 42-45.

Pritchard, W. (1993), "Mathematical Models of Running," SI AM Review,359-379.

Pritchard, W., and Pritchard, J. (1994), "Mathematical Models of Running,"American Scientist, 546-553.

316

Page 328: Anthology of Statistics in Sports

Chapter 44

Do runners and cyclists enjoy an advantagein today's triathlons?

Resizing Triathlons for Fairness

Howard Wainer and Richard D. De Veaux

In the country of Brogdinnian, afree college education is offered totop scorers on a test of general aca-demic ability. The test is made upof two parts: a verbal part and amathematical part. Women do bet-ter on the verbal part, whereasmen do better on the mathemati-cal part. There are 100 questionsworth 1 point each on the test; 80are mathematical questions, and20 are verbal questions. Almost allof the scholarships go to men. Ina recent class action suit, womenclaimed that the contest was un-fair. The defense attorney coun-tered that because the variance ofmath tests is smaller than that ofverbal tests, more math questionsare needed to spread out the com-petition. The judge, however, sawthrough this argument and ruledin favor of the women, pointingout that although variance is, in-deed, important, the test propor-tions must take into account themean performance for each group.

Although most people wouldagree that such an academic com-petition is unfair, the most com-mon triathlon races are grossly un-

fair to a large potential pool of par-ticipants. The typical triathlon iscomposed of three parts: swim-ming, bicycling, and running. Thewinner is determined by the totalamount of time needed to com-plete all three parts. It is clear thatto be fair to athletes with specialexpertise in any one of the threecomponents, the distances foreach event should be chosen care-fully. A race of 100 yards of cy-cling, 25 yards of swimming, and25 miles of running would be con-sidered unfair to everyone exceptexperienced marathon runners.Certainly, no one anticipating tocompete in such a triathlon wouldwaste much time training for any-thing but the run.

The triathlon distances de-scribed here illustrate our point inthe extreme. Although existingdistances for the triathlon compo-nents are less extreme than these,they are still a long way from be-ing fair to all athletes who mightconsider participating. Of course,to make progress toward a fairtriathlon, we must define what wemean by fair. The ideal distances

should be symmetric in somesense, but how to measure thesymmetry is unclear. We wouldlike a relatively strong or weakperformance in any one segmentto be equally rewarded or penal-ized. If we use total time to deter-mine the winner, some contendthat we need to make the vari-ances of the times of each segmentequal. This argument has beenused to justify the relatively longcycling segments of the currentIron Man and Olympic Triathlons.We maintain that the argument isspecious in this context and failsto take into account the variancethat would obtain if a differentgroup of athletes, specificallythose less predisposed toward cy-cling, were to participate.

Instead, we will take "fairness"to mean that a cyclist, runner, andswimmer, all equally proficient,can each traverse the associatedsegment of the triathlon in ap-proximately equal times; that is,the best swimmer in the world cancomplete the swimming segmentin about the same amount of timeas the best runner in the world can

317

Page 329: Anthology of Statistics in Sports

Chapter 44 Resizing Triathlons for Fairness

ede Barry, cyclist. Photo courtesy of USA Cycling, © 2004.

complete the running segment,and the best cyclist the cyclingsegment.

In this article we use this defi-nition to derive fair triathlon pro-portions for various total elapsedtimes. We discuss the equal-vari-ance argument further in the dis-cussion section. We concludewith a plan for the Ultimate Paris-to-London Triathlon.

How Long Is a Triathlon?

Triathlons come in all sizes andshapes; a sampling of them isshown in Table 1. Here we showtwo well-known triathlons, the

Iron Man and the Standard Inter-national, or Olympic Triathlon,along with the Garden State TinMan.

How were the proportions se-lected? Legend has it that the firsttriathlon came about as the resultof a bar bet by some sailors sta-tioned in Hawaii. Previously,some of them had participated inthe annual events known as theWaikiki Rough Water Swim, theAround the Island Bicycle Race,and the Honolulu Marathon. Oneproposed the challenge to com-plete the equivalent of all threeevents in one day. Thus, the IronMan Triathlon was born. It hasnow been contested every October

since 1981 in the village of Kailua-Kona on the island of Hawaii. Itconsists of a 2.4-mile swim, a26.2-mile run, and a 112-mile bi-cycle ride, precisely the distancesof the preexisting events. [The re-cord for this race of 8 hours, 9minutes, 15 seconds (8:09:15) wasset by Mark Allen in 1989.] Theorder of the three events, for logis-tical and physiological reasons, isswimming, cycling, and running.Although it would be interestingto consider the effects of other or-derings, we will assume the stand-ard order throughout this article,thus also ignoring the changing ef-fect fatigue would play in a differ-ent sequence.

318

Page 330: Anthology of Statistics in Sports

Wainer and De Veaux

Race •-

Iron ManDistance (km)Proportions

, Distance (km) ;.Proportions

Garden State TinMan - ~

Distance (km)Proportions

3.9 ,1.7%

' 1" »4 -•] 2,9% 'ii

I .8'' 1.6%! '

-

•,:•::

10

•~ **"*'"?':' *;'.'*

\TS\\ « • "i l.**'-; -. ---V. .,*

' .S7'77%

•saM4«Mi

47.8

Note that the ratio of distancesis 1 to 11 to 48—the length of therun is 11 times the length of theswim, and the length of the cy-cling course is 48 times the lengthof the swim. Although organizersof triathlons are free to choose anydistances they want, many areroughly scaled versions of the IronMan, with a few exceptions. Arethese proportions fair to all poten-tial participants? To judge this, letus hold the running segment ofthe contest constant and calculate"fair" lengths of the other two seg-ments.

The great Ethiopian runner Be-layneh Densimo ran the 1988 Rot-terdam marathon in just under 2hours and 7 minutes (2:06:50), thefastest marathon ever. Marathoncourses vary considerably, butthere is reasonable consistencyamong marathon times, and 2:07seems to be a plausible figure torepresent the current best possibletime.

How far can the best bicyclistsgo in 2 hours and 7 minutes? Thisyear, Spain's Miguel Indurainwon the 21-stage, 2,490-mile Tourde France in just under 101 hours,thus averaging almost 25 mph.This provides us with a lowerbound of 53 miles for 2 hours and7 minutes. Courses differ, andsurely cyclists would be able to gofaster if the race were to be only 2

hours. As we shall derive later, agood estimate for what would bea world record cycling distancefor 2:07 is about 60 miles. Thus,we see that a fair Iron Man wouldshrink the cycling leg almost inhalf.

What about swimmers? How farcan the best swimmers go in 2hours and 7 minutes? This is hardto estimate because the longestpool race is 1,500 m, and theworld record for that, held by Aus-tralia's Kirin Perkins, is 14 min-utes, 43.48 seconds (14:43.48).This record means that for 15 min-utes he can maintain a pace of justunder 59 seconds per 100 m.There are open-water marathonraces that take many hours, butcurrents, low water temperatures,and other nonstandard conditionsmake them a poor source of datafor estimating optimal human per-formance. Another count againstusing open-water races is thatmarathon swimming is not a par-ticularly popular sport, and hencemost of the greatest swimmers donot participate. Instead we haveopted to use a less formal sourceto estimate swimming ability: per-formance in practice. Recordskept by Rob Orr, coach of thePrinceton men's swimming team,revealed that sometimes in a 2-hour practice the swimmers cancomplete 10,000 m. Kirin Perkins,

whose practices are legendary, hasmaintained a near 1 minute (1:02)per 100 m pace for 2 hours. Usingthis pace as a standard wouldyield a total of approximately12,300 m in 2 hours and 7 min-utes. Making a more conservativeestimate (1:03:05) still yields12,000 m (about 7.5 miles). Thus,a fair Iron Man would need tomore than triple the swimming legwhile simultaneously halving thecycling portion.

Most triathlons are open-waterswims, rather than pool swims;therefore, our estimate of 12,000m is quite possibly a bit optimistic(depending on the direction of thecurrent!). However, because run-ning is the last event, the estimateof 26.2 miles for 2:07 will also beoptimistic in the context of atriathlon. We use 12,000 m as astarting point for a discussion of afair triathlon, with a view towardreevaluating the proportions usingactual split times for participantsonce they become available.

Consequently, a triathlon likethe Iron Man, in which each legwas scaled to be a shade over 2hours long for the best in theworld (competing in peak circum-stances and without the distrac-tion of the other two events),would not be proportioned 1 to 11to 48 as is currently the case, butrather 1 to 3.5 to 8. The currentIron Man has a cycling leg that issix times longer than it ought to berelative to swimming! It is nowonder that very few triathletesdescribe themselves as primarilyswimmers: The contest is so tiltedagainst them that it is hardlyworth a swimmer's effort to com-pete.

So far, we have focused primar-ily on a single race, the Iron Man,and derived fair proportions for it.As we indicated in Table 1, not alltriathlons are the same length, nordo they have the same proportionsas the Iron Man. For example, theTriathlon World Championshiprace, which is held annually, con-sists of a 1.5-km swim, a 40-km

319

Table 1-Some Typical Triathlons

Race Swimming Running Cycling Total

International(Olympic)

42.1

19%

10

19%

10

21%21%

180.080%

4078%

3777% 100%

100%

51.5100%

Page 331: Anthology of Statistics in Sports

Chapter 44 Resizing Triathlons for Fairness

; .• Cycling records " '

Distance (m) . Time . ,.•

v'S! ^V-Cl.4. ^^^V>.-? -, ;- 0^p - wt^/4 '-• ,. • • -^Bttlf; : -* v:" 'f ; •>'- "fto&ffr : :

10t ff '--' • '- 0:tt:i5 ';'HJ,«» ^24^840*20$ ;. -.• • ' C^^^4 --•-m#m \- • - " tmm

' - . - ' .' ; 2:«^2 •,-iapjioftv ;",• '' ; §mt?? \

"Worki record imputed for 12,000 meters s

Table 2-World Records In Three Sport? ^ *F1, P ll *:^Wp' * ll^wWi *' l

t« "•"•', ;"'. ,". Running rec

. JMmM(n) .Vf%fef4j,;':- ,,;,.,

•\:--r"Y--fo' • • -'••;;;"v , • ifrtt • > - . - • •

;.•••"• . <»o . !

^000mooo20.CWJ

; ,000•s-: mm! ./•- '

swimming based on practice

t>rds .

Him^-4tiM •' -'^<WMfel2-.

<MH:Si0:Q70^7^»0^6, 71:1$«61:20:19&®em

tperfonnanc

^T^J^^^ -* *„ 'jL ',^ ^"

^' - -. H Swimm

Distance (m)

- '" - ",• "fiO' ttooa»-«K)800

1,50012.000

e.

ing records _ • • - - ,

tfcne(h:m:s)

0:00:21:8. .- QifiO^M.:'.'- '"

0 )1: 7O^B 5.0(W)7:47*i0:14:43.52:07 ),

Table 3-Equilateral Triathlons

World record timesfor each leg (min)

1015

**/fiy«>**»*i*w%** oo{siyrnptG .. fcO

3045607590

105120

fmn-Mvr 127

Distances (km) Distances (miles)

Swim

1.01.52.72J4.35.77.18.8

10.011.412,0

Rim

-'. w>5,7

10.010.815.820,725.530,435.240.042,2

Bike

8.512.522.424.235.747.058.169.280.291.196.2

Swim

.6

.91.71.82.73.64.45.36.27.17.5

Run

2.43.56.26.79.8

12,815.918.921.924.826.2

Bike

5.37.8

13.915.022.229.236.143.049.856.659.7

bike ride, and a 10-km run. (Therecord of 1:48:20 was set by MilesStewart of Australia in 1991.)These distances have become thestandard distances for interna-tional competitions and will beused in the Olympics. The ratio ofthese distances is roughly 1 to 7 to27. Although this is fairer toswimmers than the Iron Man, it isstill not entirely fair, the runningsection being twice as long as itshould be relative to the swim,and the cycling leg over threetimes as long.

Data and Analyses

Table 2 shows the world recordsfor the three sports at various dis-tances. (The cycling data are fromVan Dorn 1991; running and swim-ming data from Meserole 1993.) Wehave included an imputed worldrecord for 12,000 m swimmingbased on practice performance.

Fitting a mathematical functionto these record times allows us tointerpolate accurately betweenthem and produce estimated dis-tances traversed for all three

sports for any intermediate time.Regression, fitting the logarithm ofdistance (in meters) to polyno-mials in the logarithm of time (inseconds), is used to estimate therelationships. The functions thatwere fit are shown in Fig. 1.

Table 3 provides guidelines forwhat might properly be called"Equilateral Triathlons" of variousduration, based on world-recordtimes for each segment. The dis-tance proportions for equilateraltriathlons are roughly 1 to 3.5 to 8(depending on the duration of the

320

Page 332: Anthology of Statistics in Sports

Wainer and De Veaux

Figure 1. A graph that allows one to construct equilateral triathlons of manydifferent durations. Equilateral versions of the two canonical durations areindicated by the vertical bars.

segments), or 8% to 28% to 64%.The range extends from a "sprint"triathlon, which most competitorsshould be able to complete in un-der 1 hour, to an Equilateral IronMan that might take a professionaltriathlete 8-10 hours to finish. Forexample, the entry for the equilat-eral equivalent of the OlympicTriathlon (keeping the running legconstant) is found on the third lineof the table and consists of a 2.7-kmswim, a 10-km run, and a 22.4-kmcycling leg. Compare these to thecurrent Olympic distances of 1.5-km swim, 10-km run, and 40-kmcycle.

Discussion

In this account we tried to exposethe unfairness of triathlons as they

are currently configured. We of-fered alternative proportions thatmake the three segments of thetriathlon, in some real sense, equal.By instituting these proportions wefeel there will be greater participa-tion from athletes whose best sportis swimming, rather than the cur-rent domination by cyclists andrunners. Our recommendationshave not met with universal appro-bation; some feel that our defini-tion of fairness is incorrect. Onecommon complaint was first of-fered by Sean Ramsay, a well-known Canadian triathlete, whosuggested that the current dis-tances were chosen to equallyspread out the competitors. Hence,the amount of discriminationamong competitors that can beachieved in a relatively shortswimming race required a much

longer cycling race. Ramsay's ob-servation is true for the athletescurrently competing in triathlons.Would it be true if the races wereproportioned differently?

Let us imagine a different pool ofcompetitors for the equilateraltriathlons. Perhaps more swim-mers will compete. The variation ofcycling times will become muchlarger because, in addition to thecurrent variation we see in cyclingtimes, there is the additional vari-ation due to participants for whomcycling is not their best sport.Therefore, the decrease in variationdue to the relatively shorter cyclingleg of an equilateral triathlon maybe compensated for by the in-creased variation expected for thenew pool of participants. To deter-mine the extent to which this istrue requires gathering data fromequilateral triathlons. It is likelythese data would include some ath-letes who currently do not partici-pate. We expect the pool of partici-pants to change as the triathlonsbecome fairer (see sidebar). Theanalysis of such data will then al-low the iterative refinement of thetriathlon proportions based on ac-tual split times.

To conclude our exposition, weextrapolate our definition of a fairtriathlon a little further (furtherperhaps than ought to be done).We would like to propose a newrace from Paris to London.

The UltimateParis-to-London Triathlon

Imagine the race beginning atdawn on a sunny day in mid-Au-gust from beneath the Arc deTriomphe in Paris. Crowds cheeras competitors set out on bicyclesfor Calais, 250 km away. The bestof them begin to arrive in Calaisat about noon. They strip awaytheir cycling clothes and greasetheir bodies for the 46-km swimacross the English Channel toDover. Although the weather isperfect and they have caught the

321

Page 333: Anthology of Statistics in Sports

Chapter 44 Resizing Triathlons for Fairness

The Effects of Selection on Observed Data

The notion of a bras in inferencesdrawn from the results off existingtriathtons due to the self-selection ofathletes into the event is subtle, al-

••' ^^^^^^^e^i^1^''^K:^t'^i6^^ ****.**>**. laMar1- , P|B|!*|WPWWIf j ,*pf<« 'l PGwC fifRji* fpPlw

-' ' Gffi ^BJ ^ P''it1 ' HniGjiyNiNlS Jit wiillii Cif CI

triathlons. He analyzed the split re-sults of three 1981 triathlons and ar-rived at proportions of 1:4:11. Theseare closer -to- our fectHwnendatlonsthan current practice but still not quiteright. The difference is accountablebecause he used split times from ex-isting triathlons. "Analysis of actualtriathlons is the only way to do thiskind of research," he said. But, ofcourse, the split times would be quitedifferent if the proportions were differ-ent. The subtle nature of this effect ofself-setection is best exposed if weexpress tie probtern mathematicaily.

Let S/equal the time for person /to complete the swimming leg, R,equal the time for person / to com-plete the running leg, and C, equalthe time for person /to complete thecycling leg. Moreover, let K, be anindicator variable that takes thevalue 1 if person /decides to partici-pate in thf WatWon, and 0 If not. Sofar, we have chosen distances sothat World Record(S) = World Re-

•. cord(R) =World Record(C) over all.;? individuals /. Ramsay's criticism was

that the distances should be chosenso that

var($)» vaitft)» varCC) (1)

' Perhaps Ramsay's objection is rea-softabtej, but ail w@ can observe is that

var(S IK« 1} m varffll/f * 1) (2)»var(C!K=1)

The yariance of S. over all indi-of two sets of

terms. One set 1$ observable: var(Sm »l|;afjd I($1K» 1)J8|«the*ec-ond component is unobservable be-cause it contains both the mean andthe variance of performance forthose potential triathletes who didnot participate: var(S \K= Q) and E(SI/C= 0). Moreover, because the poolof potential participants is definedonly loosely, we also cannot observef^Ka 1). There Is a similar term foreach of the other two segments aswell, ff (2) is observed to be true,then (1) is certainly true if the unob-served variances are all equal [i.e.,var(S W* 0) = var(ff IK* 0)» var(C\K= 0)], and the unobserved meansequal the observed means [E(S i K= 1}*ECSIK*0), E{HI K=1) =E<ff I K» 0), and E(CI K= 1) = E(CI K= 0)]. These assumptions are toofar fetched to be credible.

It is reasonable to believe that ifthe swimming segment is tripled,participation among swimmerswould increase. It is also likely thatthey would be worse cyclists on av-erage than those who are alreadyparticipating [i.e., E(C\K= 0) > E(CIK* 1)1, and so var(CI/C= 1) wouldincrease. How much the variancewould Increase Is unknown and can-not be known until we see howchanging the race proportionschanges the participation rates fromvarious subpopulations of athletes.Perhaps the change in participantswould Increase the variance enoughto counteract the variance shrinkagethat wiH occur from shortening thecycling distance.

tide, few will come close to Rich-ard Davey's 1988 record of 8hours and 5 minutes. The leadersstart emerging 9-10 hours later.The sun has already set. Thesehardy souls grab a quick bite toeat, change into running shoes,and set off on shaky legs for Lon-don, a mere 115 km away. A goodultramarathoner, starting fresh,

would cross the finish line inTrafalgar Square after about 6hours; indeed a world-class racewalker could do it in about 7ihours. But no one is fresh, and noone who has emerged from theChannel is a running specialist.In fact, the only ones who havefinished the swimming segmentso far are chubby channel swim-

mers. The waiting crowd looksanxiously through the darknessfor the sleek cyclists and runnerswho traditionally win triathlons.As the sun begins to rise, noneemerge. At the finish line, thewinner arrives, half jogging, halfwalking, more than 24 hours afterthe beginning of the race. Shegracefully accepts the trophy andprize money and then heads for abath, breakfast, and a bed.

The careful reader will noticethat the Paris-to-London swim-ming segment is a bit longer thanour equilateral analysis suggests. (Italso places the swimming segmentsecond for logistical reasons.) Thiscould be corrected by beginningthe race a bit east of Paris and fin-ishing the run somewhat north ofLondon. But the disparity here issmall in comparison to the disad-vantage to which swimmers are or-dinarily put. In addition, the appel-lation "The Ultimate La Queue enBrie to Chigwell Triathlon" hasneither the cachet nor the euphonyof our proposal.

Additional ReadingKeller, J. B. (1977), A Theory of Com-

petitive Running in Optimal Strate-gies in Sports, eds. S. P. Ladany andR. E. Machol, The Hague: NorthHolland, pp. 172-178.

Meserole, M. (ed.) (1993), The 1993Information Please Sports Alma-nac, Boston: Houghton Mifflin.

Smith, R. L. (1988), "Forecasting Re-cords By Maximum Likelihood,"Journal of the American StatisticalAssociation, 83, 331-338.

Tryfos, P. and Blackmore, R. (1985),"Forecasting Records," Journal ofthe American Statistical Associa-tion, 80, 46-50.

Van Dorn, W. G. (1991), "Equationsfor Predicting Record Human Per-formance in Cycling, Running andSwimming," Cycling Science, Sep-tember and December 1991, 13-16.

322

though the size of the bias may be

profound. Even Harald Johnson, Pub-

lisher of Swim Bike Run magazineand a perceptive observer of triath-

he wrote to his subscribers, Johnson

viduals is composed