is the birthday problem more than a curiosity? · probability, 1976 [after developing a formula for...
TRANSCRIPT
11/9/2014
1
Steven J. Wilson
Johnson County Community College
AMATYC 2014, Nashville, Tennessee
How many people must
be in a room before the
probability of at least two
sharing a birthday is
greater than 50%?
Johnny Carson’s stab at it …
http://www.cornell.edu/video/
the-tonight-show-with-johnny-
carson-feb-6-1980-excerpt
Also Feb 7, Feb 8
Johnny Carson, 1925-2005
Ed McMahon, 1923-2009
11/9/2014
2
P(at least 2 share) = 1 – P(no one shares)
To find P(no one shares), do probabilities
of choosing a non-matching birthday:
So P(at least 2 share)
365
365
364
365
363
365
362
365
11
365
n
365 364 363 (365 ( 1))
365n
n
365365!
(365 )! 365 365
n
n n
P
n
3651365
n
n
P
P(at least 2 share)
For n = 23, P(at least 2 share) = 50.73%
3651365
n
n
P
n P(sharing)
5 2.71%
10 11.69%
15 25.29%
20 41.14%
25 56.87%
30 70.63%
35 81.44%
40 89.12%
45 94.10%
50 97.04%
The birthday problem used to be a splendid illustration of the advantages of pure thought over mechanical manipulation … … what calculators do not yield is understanding, or mathematical facility, or a solid basis for more advanced, generalized theories.
--- Paul Halmos (1916-2006), I Want to Be a Mathematician, 1985
11/9/2014
3
Is this only a curiosity?
How was this computed historically?
What is the underlying distribution?
Generalizations?
Applications?
[After solving the Birthday Problem]
“The next example in this section not
only possesses the virtue of giving rise
to a somewhat surprising answer, but it
is also of theoretical interest.”
- Sheldon Ross, A First Course in
Probability, 1976
[After developing a formula for the
probability that no point appears twice when sampling with replacement]
“A novel and rather surprising
application of [the formula] is the
so-called birthday problem.”
- Hoel, Port, & Stone, Introduction to
Probability Theory, 1971
11/9/2014
4
Richard von Mises (1883-1953)
often gets credit for posing it in
1939, but …
… he sought the expected
number of repetitions as
a function of the
number of people.
Ball & Coxeter included it in their 11th
edition of Mathematical Recreations
and Essays, published in 1939.
They gave credit to Harold Davenport
(1907-1969), who shared it about 1927.
But he did not think he was the
originator either.
So how did they do the computations back then?
Paul Halmos: “The birthday problem used to be a splendid illustration of the advantages of pure thought over mechanical manipulation …”
11/9/2014
5
P(no one shares)
Recall the Taylor Series:
First-order approximation: P(no one shares)
365 364 363 362 365 ( 1)
365 365 365 365 365
n
0 1 2 11 1 1 1
365 365 365 365
n
2
12 !
nx x x
e xn
0/365 1/365 2/365 ( 1)/365ne e e e (1 2 ( 1))/3651 ne
( ( 1)/2)/365n ne
Solving P(at least 2 share) > 0.50
( ( 1)/2)/365
( ( 1)/2)/365
1 0.5
0.5
( 1)ln(0.5)
2(365)
( 1) (365) ln 4 505.997
n n
n n
e
e
n n
n n
1 1 4(506) 1 202523
2 2n
2 506 0n n
2
( 1) ln 4
ln 4 0
1 1 4 ln 4
2
n n d
n n d
dn
0.5 0.25 ln 4
0.5 1.177
n d
n d
11/9/2014
6
So n is roughly proportional to the
square root of d, with
proportionality constant about 1.2.
1.2 365 22.93
0.5 1.177 365 22.99
n
n
(Pat’sBlog attributes the factor 1.2 to Persi Diaconis)
0.5 0.25 ln 4
0.5 1.177
1.2
n d
n d
n d
P(2 people not sharing) =
Among n people,
number of pairs =
oFor 23 people, there are 253 pairs!
oThat’s 253 chances of a birthday
match!
P(2 of n people sharing) ≈
oAlarm Bells! Independence assumed!
364
365
2
( 1)
2n
n nC
( 1)
23641
365
n n
P(2 of n people sharing) ≈
o Alarm Bells temporarily suppressed!
Solving:
gives n > 22.98
( 1)
23641
365
n n
( 1)
2
2
3641 0.5
365
505.3 0
n n
n n
11/9/2014
7
Discrete or Continuous?
These answer the wrong question
Discrete
Distribution
Random
Variable
Assumptions
Binomial No. of Successes Fixed n, constant p, independent
Hypergeometric No. of Successes Fixed n, p varies, dependent
Poisson No. of Successes Fixed time, constant λ,
independent
Geometric First Success n varies, constant p, independent
Negative
Binomial
k-th Success n varies, constant p, independent
Multinomial No. of Each Type Fixed n, constant p, independent
Place n indistinguishable balls
into d distinguishable bins.
Balls → People, Bins → Birthdays
When does one bin contain two
(or more) balls?
Bernoulli ???
Multinomial ???
Want “first match” !
AKA: occupancy problem,
collision counting
11/9/2014
8
How many trials to get the first match?
Variables:
x = number of trials to get the first match
d = number of equally likely options
(birthdays)
Pattern:
Not Not Not … Not Match
1 2 ( 2) 1d d d d x x
d d d d d
Pattern:
Not Not Not … Not Match
Probability:
1 2 ( 2) 1d d d d x x
d d d d d
1
( 1)( ) for 2,3,4,..., 1d xx
xP X x P x d
d
PDF CDF
When n=23,
the probability
of at least one
match first
exceeds 50%
The probability
of obtaining
the first match
is highest when
n = 20
MODE
MEAN
E(X) ≈ 24.6166
MEDIAN
11/9/2014
9
The PDF is discrete, so avoid calculus.
For which x is P(x) > P(x+1) ?
For d=365, we get x>19.61, so 19 or 20
Approximating:
1 1
1
2
1
( 1) ! !
( 1)! ( )!
( 1) ( 1)
0
1 1 4
2
d x d xx x
x x
x xP P
d d
x d x d
d d x d d x
d x x d x
x x d
dx
EXACT FORMULA!
x d for large d
Since
then
which is related to Ramanujan’s Q-function,
specifically
which is asymptotic:
and for d=365, we get:
1
( 1)( ) for 2,3,4,..., 1d xx
xP X x P x d
d
1
1
2
1( )
d
d xxx
xE X x P
d
( ) 1 ( )E X Q d
1 1 4( ) 1
2 3 12 2 135
dE X
d d
365 2 1 4( ) 24.6166
2 3 12 730 49275E X
A Jovian day = 9.9259 Earth-hours
A Jovian year = 11.86231 Earth-years
Both 1047625 and 10476P25 cause a
TI-84 overflow!
365.24 24 11 11.86231
1 1 9.9259
10476
Edays Ehrs JdayJyr Eyrs
Eyr Eday Ehrs
Jdays
10476 1
( 1)( ) for 2,3,4,...,10477
10476xx
xP X x P x
11/9/2014
10
Proportionality?
Heuristically?
1.2 10476 122.823n
ln 4( 1)
10476ln
10475
n n
( 1)
2104751 0.5
10476
n n
121.009n
Series Approximation:
Mathematica: for n = 121,
P(match) = 50.13%
( 1) (10476)ln4n n
10476 10475 10476 ( 1)0.5
10476 10476 10476
n
121.012n
Mean:
Median:
Mode:
1 1 4102.854
2
dx
2 1 4( ) 128.947
2 3 12 2 135
dE X
d d
1 1 4 ln 4121.012
2
dn
11/9/2014
11
Digitally
signed document
Your PIN
(private key)
F(Message)
Message
F
The hash function F
is not one-to-one!
Impostor Message
Alt 4
Alt 1
Alt 2
Alt 5
Alt 3
Imp 1
Imp 2
Imp 3
Imp 4 Imp 5
Birthday Attack: Find a good message
and a fraudulent message with the same value of F(Message).
F
G
Given n good messages, n fraudulent
messages, and a hash function with d
possible outputs, what is the probability
of a match between 2 sets?
Assume , so n outputs are
probably all different.
P(message 1 in set 1 doesn’t match
set 2) =
n d
1n
d
P(no message in set 1 matches
anything in set 2) =
50% probability of at least one
match between the sets:
Probability p of at least one match
between the sets:
2/ /1
nn
n d n dne e
d
2 /
2
1 0.50
ln 2
0.83
n de
n d
n d
1ln
1n d
p
11/9/2014
12
With a 16-bit hash function, there
are possible outputs
50% probability of a match with
messages:
1% probability of a match with
messages
162 65,536
0.83 65536 213n
165536ln 26
0.99n
Increase the size of the hash function:
oWith 64-bits, outputs are possible
o50% probability of a match with 3.5 billion messages (11 messages per US citizen)
o1% probability of a match with 431 million messages
642 18,446,744,073,709,551,616
Is this only a curiosity?
No, it’s part of a class of problems
How was this computed historically?
Using approximations (including Taylor Series)
What is the underlying distribution?
“First Match” Distribution
Generalizations?
Vary days in a year (other planets)
Applications?
Digital signatures
11/9/2014
13
Steven J. Wilson
Johnson County Community College
Overland Park, KS 66210
A PDF of the presentation is available at:
http://www.milefoot.com/about/presentations/birthday.pdf