what a coincidence! it's not as unlikely as you think

3
february2012 40 © 2012 The Royal Statistical Society What a coincidence! It’s not as unlikely as you think Coincidences A coincidence is a surprising concurrence of events, perceived as meaningfully related, with no apparent causal connection. (Diaconis and Mosteller 1 ) We tend to “personalize” coincidences. Our coincidences are much more surprising than yours. We are stunned if we run into an old friend in a Shanghai bar. But we would be quite blasé if you told us that something similar happened to you, because we know that old friends and acquaintances run into each other all the time, in the strangest of places. These are, in fact, just patterns in large data sets, and – the important part - the patterns found were not specified ahead of time. In scientific endeavours, the dangers of mining large data sets for patterns not specified in advance are well known 2 (see Significance, September 2011, for an article by Stan Young on this). In the example of meeting in a bar, the data set contains all the chance meetings that take place at any time in any bar in the world from Shanghai to London or Miami or, say, Casablanca. A pretty large set! The availability of the internet (surely one of the world’s largest data sets), and the vast amount of information it contains, means that it is easy to “find” what some might consider to be “significant” (or even “sinister”) connections between famous people and/or events. As an example, one of us did a Google search using the words “presidents” and “coincidences”, and got 136 000 hits. Here are some similarities between Abraham Lincoln and John Fitzgerald Kennedy: They each have seven letters in their last names. They were elected 100 years apart (1860 and 1960). They were both assassinated on a Friday in the presence of their wives. Lincoln was shot in Ford’s Theatre; Kennedy was shot in a Ford car. Both assassins were known by three names – John Wilkes Booth and Lee Harvey Oswald, with 15 letters in each complete name. Booth shot Lincoln in a theatre and fled to a warehouse; Oswald shot Kennedy from a warehouse and fled to a theatre. Both succeeding vice-presidents were southern Democrats and former senators named Johnson (Andrew and Lyndon), with 13 letters in their names and born 100 years apart (1808 and 1908). The similarities observed were not predicted in advance, but after a data mining exercise. Any two people (you and your next-door neighbour, for example) could get together and construct a list of personal similarities. As Diaconis and Mosteller 1 noted: “We are swimming in an ocean of coincidences.” That we are just beginning to be aware of this has much to do with the availability of immense amounts of information – statisticians call it data – and the relative ease with which it can be mined. Diaconis and Mosteller also noted that, in the definition of a coincidence that we quoted at the beginning, our psychology enters through the words “surprising”, “perceived”, “meaningfully”, and “apparent”. “Coincidence” is not solely a sta- tistical term; human nature comes into it as well. They also introduced an adage that they called the law of truly large numbers: with a large enough sample, almost anything outrageous will happen. Data often accumulates over time, in which case this adage may also be expressed as: “given enough time, we will see outrageous events”. Now, whether these events are actually as “outrageous” as they might seem at first rather depends on the situation. One of the messages Evelyn Evans of New Jersey won the lottery – twice. A stranger at a party may have something unlikely in common with us. It is human nature to believe such coincidences are rare. However, Byron Jones and Robb Muirhead explain that they are common, and that we should be much more surprised if we never heard about them. People are not good at assessing probabilities, and we often do not ask the right or appropriate questions. With a large enough sample, almost anything outrageous will happen

Upload: byron-jones

Post on 21-Jul-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

february201240 © 2012 The Royal Statistical Society

W h a t a c o i n c i d e n c e ! I t ’ s n o t a s u n l i ke l y a s y o u t h i n k

Coincidences

A coincidence is a surprising concurrence of events, perceived as meaningfully related, with no apparent causal connection. (Diaconis and Mosteller1)

We tend to “personalize” coincidences. Our coincidences are much more surprising than yours. We are stunned if we run into an old friend in a Shanghai bar. But we would be quite blasé if you told us that something similar happened to you, because we know that old friends and acquaintances run into each other all the time, in the strangest of places. These are, in fact, just patterns in large data sets, and – the important part - the patterns found were not specified ahead of time. In scientific endeavours, the dangers of mining large data sets for patterns not specified in advance are well known2 (see Significance, September 2011, for an article by Stan Young on this). In the example of meeting in a bar, the data set contains all the chance meetings that take place at any time in any bar in the world from Shanghai to London or Miami or, say, Casablanca. A pretty large set!

The availability of the internet (surely one of the world’s largest data sets), and the vast amount of information it contains, means that it is easy to “find” what some might consider to be “significant” (or even “sinister”) connections

between famous people and/or events. As an example, one of us did a Google search using the words “presidents” and “coincidences”, and got 136 000 hits.

Here are some similarities between Abraham Lincoln and John Fitzgerald Kennedy:

•They each have seven letters in their last names.

•They were elected 100 years apart (1860 and 1960).

•They were both assassinated on a Friday in the presence of their wives.

•Lincoln was shot in Ford’s Theatre; Kennedy was shot in a Ford car.

•Both assassins were known by three names – John Wilkes Booth and Lee Harvey Oswald, with 15 letters in each complete name.

•Booth shot Lincoln in a theatre and fled to a warehouse; Oswald shot Kennedy from a warehouse and fled to a theatre.

•Both succeeding vice-presidents were southern Democrats and former senators named Johnson (Andrew and Lyndon), with 13 letters in their names and born 100 years apart (1808 and 1908).

The similarities observed were not predicted in advance, but after a data mining exercise. Any two people (you and your next-door neighbour, for example) could get together and construct a list of personal similarities. As Diaconis and Mosteller1 noted: “We are swimming in an ocean of coincidences.” That we are just beginning to be aware of this has much to do with the availability of immense amounts of information – statisticians call it data – and the relative ease with which it can be mined.

Diaconis and Mosteller also noted that, in the definition of a coincidence that we quoted at the beginning, our psychology enters through the words “surprising”, “perceived”, “meaningfully”, and “apparent”. “Coincidence” is not solely a sta-tistical term; human nature comes into it as well. They also introduced an adage that they called the law of truly large numbers: with a large enough sample, almost anything outrageous will happen.

Data often accumulates over time, in which case this adage may also be expressed as: “given enough time, we will see outrageous events”. Now, whether these events are actually as “outrageous” as they might seem at first rather depends on the situation. One of the messages

Evelyn Evans of New Jersey won the lottery – twice. A stranger at a party may have something unlikely in common

with us. It is human nature to believe such coincidences are rare. However, Byron Jones and Robb Muirhead

explain that they are common, and that we should be much more surprised if we never heard about them. People

are not good at assessing probabilities, and we often do not ask the right or appropriate questions.

With a large enough sample, almost anything outrageous will

happen

february2012 41

we would like to get across is that people, even statisticians, are not good at accurately assess-ing probabilities and asking pertinent questions.

Two examples where the probabilities of the “coincidences” can be calculated relatively eas-ily are strangers at a party and lottery winners. Other examples that could be given in a longer and more technical article include assess-ing the links between clusters of cancers and geographical location – are there really more leukemias under power lines? – and assessing links between vaccinations and sudden infant death syndrome. Both of these have caused public concern, so understanding the nature of coincidences is important.

Meeting strangers at a party

We may be surprised when, in a social setting, it turns out that we have something in common

with a person we are chatting with – we go to the same dentist, or our home towns are the same, or our spouses have the same first names, or we have the same accountant, etc. These are examples of “matches”. The best-known problem in this class is known as the “birthday problem”.

Ignoring leap years, if you took a group of 366 people, you would be certain that at least two people have the same birthday. One ques-tion that may be asked is: how large a group do we need in order to have a 50:50 chance that there is a multiple birthday (that is, two or more people have the same birthday)?

It turns out that the answer is that it only takes 23 people. This is a result that seems sur-prising to most people when they first hear of it. (We are assuming that birthdays are randomly distributed. By this we mean that, if we pick a person at random, his or her birthday is equally likely to be any of the 365 days of the year.) Here is how it is solved.

Consider N people, numbered 1, 2, …, N. Person number 1 has some birthday. The chance that person number 2 has a birthday that is different from this is 364/365. The chance that person number 3 has a birthday that is differ-ent from the first two people is 363/365, and so on. It follows, then, that the chance that all N people have different birthdays is

364365

363365

365 1365

×

× × − +

N

.

Thus the chance (probability) that there is at least one multiple birthday (this is the event that happens if all N people do not have different birthdays) is p(N), where

p NN

( ) = −

×

× × − +

1364365

363365

365 1365

It may be checked that smallest value of N for which p(N) > ½ is 23 (and it is greater than ½ for all values of N larger than 23). We can also

‘Of all the gin joints in all the towns in all the world, she walks into mine.’ Bogart on Bacall in Casablanca. Coincidence, or was it bound to happen? Credit: mptvimages.com

february201242

show that we need 48 people in order to have a 95% chance of a multiple birthday.

The birthday problem has obvious exten-sions. Here is one. In a group of seemingly un-connected people, various types of coincidences (matches) can occur – the same birthday, same profession, same dentist, and so on. It seems obvious that if we include more than one cat-egory, we would need fewer people than we would need for one category, if we asked the question: how many people are needed to have a 50:50 chance of at least one match in at least one of the categories? For example, consider three categories: birthdays, dentists, and cars. There are 365 possible birthdays. The local tel-ephone directory for the town where one of us lives lists about 200 dentists. There are about 400 different makes and models of cars. It turns out that only 12 people are needed in order to have a 50:50 chance of a match in at least one of these categories.

Winning the lottery twice

We hear about people who seem to have incredibly good (or bad) luck. Here is one involving both. In October 1985, a woman named Evelyn Evans won a $3.9 million jackpot in the New Jersey State Lottery. In February 1986 (a scant 4 months later) she won again, this time to the tune of $1.5 million. What a coincidence! And you might think: how incredibly unlikely. Or is it?

A front-page headline of the New York Times on February 14th, 1986 proclaimed: “Odds-Defy-ing Jersey Woman Hits Lottery Jackpot Second Time”. The story began:

Defying odds in the realm of the preposterous – 1 in 17 trillion – a woman who won $3.9 million in the New Jersey lottery last October has hit the jackpot again.

First let us see where the “1 in 17 trillion” odds came from. The first time Ms Adams won, she correctly picked 6 out of 39 numbers. The format changed sometime in the next four months and the second time she won, she cor-rectly picked 6 out of 42 numbers. The 1 in 17 trillion chance stated in the New York Times is correct if the question of interest is: if Ms Ad-ams (a specified person) bought a single ticket in each of these two specified lotteries, what is the probability that she would win both? In the first lottery there were 3 262 623 pos-sible tickets; in the second lottery there were 5 245 786 possible tickets. The probability that two tickets, one in each lottery, would both win is

13262623

15245786

117 1 1012× ≈

×.

or about 1 in 17 trillion. This is the probability the New York Times reported. But we have no special interest in Ms Adams, or in the two dates on which she won her jackpots, or even in which state lottery was involved. Surely the interesting and pertinent question is: among all previous state lottery winners in the US, what is the chance that someone will win again?

Two statisticians at Purdue University, George McCabe and Steve Samuels, recognized this and wrote to the New York Times. In a letter that ap-peared on February 27, 1986 they said:

We’re two professional statisticians who differ ever so slightly from your odds makers. What your guys called a “1 in 17 trillion” long shot, we call practically a sure thing!

McCabe and Samuels came to this conclusion through straightforward probability modelling. They made some assumptions about the number of tickets bought in state lotteries by previous winners, and about how likely it is that each ticket wins. We will do the same here.

State lotteries do not differ very much, so we will assume a player chooses 6 numbers from the integers 1 to 44. The total number of different choices (i.e., different possible tickets) is about 7 million. Let us assume that every ticket bought in a state lottery has 1 chance in 7 million of winning the jackpot.

Next we need an estimate for the number of tickets bought by people who have previously won. There are about 40 states with lotteries, and most of these lotteries have been going for several years. Let us take the last 10-year period, and assume lotteries are drawn once a week. Then the number of former winners would be about 40 × 52 × 10 = 20 800. To be conservative, let us assume that former winners buy a total of 15 000 tickets a week.

Given the assumptions we have made, what is the chance that a former winner wins again in a four-month period? (Four months was the time between the wins of the New Jersey woman.) Well, four months is about 17 weeks, so (according to our assumption) former winners buy a total of about 17 × 15 000 = 255 000 tickets over that period. A calculation shows that the probability that at least one of these is a winner is about 1 in 28. So it is not anywhere near as unlikely as the New York Times suggested by an-swering an uninteresting question – and in fact the wrong one.

Over a seven-year period, former winners buy a total of about 5 250 000 tickets, and the prob-ability that at least one of these is a winner is

more than half (it is 53%). Put another way, if you were to bet (at even odds) that someone who has won a state lottery before will win again in the next seven years, you would have a better than 50:50 chance of winning.

What happened to Ms Adams, the woman who won the New Jersey lottery twice? In an article about lottery winners who lost their winnings, Goodstein3 writes:

“Winning the lottery isn’t always what it’s cracked up to be”, says Evelyn Adams, who won the New Jersey lottery not just once but twice (1985, 1986) to the tune of $5.4 million. Today the money is all gone and Adams lives in a trailer.

In conclusion

The “law of truly large numbers” tells us that we should expect to see unusual or surprising patterns or signals (coincidences) just because of chance. There is a human tendency to first look for patterns, and then to construct ex-planations for the patterns we find. Moreover, we often see patterns that we think are sur-prising and/or meaningful, because we tend not to ask the right questions, and we end up calculating probabilities for events that are not pertinent. Therefore instead of exclaiming “Wow! That’s Amazing” when an apparent rare coincidence occurs, let us not forget what Sir Ronald Fisher, arguably the father of modern statistics, said4:

the one “chance in a million” will un-doubtedly occur, with no less and no more than its appropriate frequency, however surprised we may be that it should occur to us.

References1. Diaconis, P. and Mosteller, F. (1989)

Methods for studying coincidences. Journal of the American Statistical Association, 84, 853–861.

2. Young, S. S. (2007) Gaming the system: Chaos from multiple testing. Institute of Mathematical Statistics Bulletin, December.

3. Goodstein, E. (2004). Unlucky lottery winners who lost their money. Available online at http://www.bankrate.com/brm/news/advice/20041108a1.asp

4. Fisher, R. A. (1966) Design of Experiments (8th edition). Edinburgh: Oliver and Boyd.

Byron Jones is a Biometrical Fellow in Statistical Methodology at Novartis Pharma AG, Basel, Switzer-land. Robb Muirhead is a statistical consultant locat-ed in Lyme, Connecticut, USA.