power law size distributions - principles of complex ... · power law size distributions...
TRANSCRIPT
Power Law SizeDistributions
Definition
Examples
Wild vs. Mild
CCDFs
Zipf’s law
Zipf ⇔ CCDF
References
1 of 36
Power Law Size DistributionsPrinciples of Complex Systems
CSYS/MATH 300, Fall, 2011
Prof. Peter Dodds
Department of Mathematics & StatisticsCenter for Complex Systems
Vermont Advanced Computing CenterUniversity of Vermont
Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License.
Power Law SizeDistributions
Definition
Examples
Wild vs. Mild
CCDFs
Zipf’s law
Zipf ⇔ CCDF
References
2 of 36
Outline
Definition
Examples
Wild vs. Mild
CCDFs
Zipf’s law
Zipf ⇔ CCDF
References
Power Law SizeDistributions
Definition
Examples
Wild vs. Mild
CCDFs
Zipf’s law
Zipf ⇔ CCDF
References
3 of 36
Size distributionsThe sizes of many systems’ elements appear to obey aninverse power-law size distribution:
P(size = x) ∼ c x−γ
where 0 < xmin < x < xmax
and γ > 1
I Exciting class exercise: sketch this function.
I xmin = lower cutoffI xmax = upper cutoffI Negative linear relationship in log-log space:
log10 P(x) = log10 c − γ log10 x
I We use base 10 because we are good people.
Power Law SizeDistributions
Definition
Examples
Wild vs. Mild
CCDFs
Zipf’s law
Zipf ⇔ CCDF
References
4 of 36
Size distributions
Usually, only the tail of the distribution obeys apower law:
P(x) ∼ c x−γ for x large.
I Still use term ‘power law distribution.’I Other terms:
I Fat-tailed distributions.I Heavy-tailed distributions.
Beware:I Inverse power laws aren’t the only ones:
lognormals (), Weibull distributions (), . . .
Power Law SizeDistributions
Definition
Examples
Wild vs. Mild
CCDFs
Zipf’s law
Zipf ⇔ CCDF
References
5 of 36
Size distributions
Many systems have discrete sizes k :I Word frequencyI Node degree in networks: # friends, # hyperlinks, etc.I # citations for articles, court decisions, etc.
P(k) ∼ c k−γ
where kmin ≤ k ≤ kmax
I Obvious fail for k = 0.I Again, typically a description of distribution’s tail.
Power Law SizeDistributions
Definition
Examples
Wild vs. Mild
CCDFs
Zipf’s law
Zipf ⇔ CCDF
References
6 of 36
The statistics of surprise—words:Brown Corpus () (∼ 106 words):
rank word % q1. the 6.88722. of 3.58393. and 2.84014. to 2.57445. a 2.29966. in 2.10107. that 1.04288. is 0.99439. was 0.9661
10. he 0.939211. for 0.934012. it 0.862313. with 0.717614. as 0.713715. his 0.6886
rank word % q1945. apply 0.00551946. vital 0.00551947. September 0.00551948. review 0.00551949. wage 0.00551950. motor 0.00551951. fifteen 0.00551952. regarded 0.00551953. draw 0.00551954. wheel 0.00551955. organized 0.00551956. vision 0.00551957. wild 0.00551958. Palmer 0.00551959. intensity 0.0055
Power Law SizeDistributions
Definition
Examples
Wild vs. Mild
CCDFs
Zipf’s law
Zipf ⇔ CCDF
References
7 of 36
The statistics of surprise—words:
First—a Gaussian example:
P(x)dx =1√2πσ
e−(x−µ)2/2σdx
linear:
0 5 10 15 200
0.1
0.2
0.3
0.4
x
P(x
)
log-log
−3 −2 −1 0 1 2−25
−20
−15
−10
−5
0
log10
x
log 1
0P
(x)
mean µ = 10, variance σ2 = 1.
Power Law SizeDistributions
Definition
Examples
Wild vs. Mild
CCDFs
Zipf’s law
Zipf ⇔ CCDF
References
8 of 36
The statistics of surprise—words:
Raw ‘probability’ (binned):
linear:
0 2 4 6 80
200
400
600
800
1000
1200
q
Nq
log-log
−3 −2 −1 0 10
1
2
3
4
log10
q
log 10
Nq
Power Law SizeDistributions
Definition
Examples
Wild vs. Mild
CCDFs
Zipf’s law
Zipf ⇔ CCDF
References
9 of 36
The statistics of surprise—words:
‘Exceedance probability’:
linear:
0 2 4 6 80
500
1000
1500
2000
2500
q
N>
q
log-log
−3 −2 −1 0 10
1
2
3
4
log10
q
log 10
N>
q
Power Law SizeDistributions
Definition
Examples
Wild vs. Mild
CCDFs
Zipf’s law
Zipf ⇔ CCDF
References
10 of 36
My, what big words you have...
I Test capitalizes on word frequency following aheavily skewed frequency distribution with adecaying power law tail.
I Let’s do it collectively... ()
Power Law SizeDistributions
Definition
Examples
Wild vs. Mild
CCDFs
Zipf’s law
Zipf ⇔ CCDF
References
11 of 36
The statistics of surprise:
Gutenberg-Richter law ()
I Log-log plotI Base 10I Slope = -1
N(M > m) ∝ m−1
I From both the very awkwardly similar Christensen etal. and Bak et al.:“Unified scaling law for earthquakes” [3, 1]
Power Law SizeDistributions
Definition
Examples
Wild vs. Mild
CCDFs
Zipf’s law
Zipf ⇔ CCDF
References
12 of 36
The statistics of surprise:
From: “Quake Moves Japan Closer to U.S. andAlters Earth’s Spin” () by Kenneth Chang, March13, 2011, NYT:What is perhaps most surprising about the Japanearthquake is how misleading history can be. In the past300 years, no earthquake nearly that large—nothinglarger than magnitude eight—had struck in the Japansubduction zone. That, in turn, led to assumptions abouthow large a tsunami might strike the coast.
“It did them a giant disservice,” said Dr. Stein of thegeological survey. That is not the first time that theearthquake potential of a fault has been underestimated.Most geophysicists did not think the Sumatra fault couldgenerate a magnitude 9.1 earthquake, . . .
Power Law SizeDistributions
Definition
Examples
Wild vs. Mild
CCDFs
Zipf’s law
Zipf ⇔ CCDF
References
13 of 36
Great:
Two things we have poor cognitive understanding of:1. Probability
I Ex. The Monty Hall Problem ()I Ex. Son born on Tuesday ().
2. Logarithmic scales.
On counting and logarithms:
I Listen to Radiolab’s“Numbers.” ().
I Later: Benford’s Law ().
Power Law SizeDistributions
Definition
Examples
Wild vs. Mild
CCDFs
Zipf’s law
Zipf ⇔ CCDF
References
14 of 36
6
100 102 104
word frequency
100
102
104
100 102 104
citations
100
102
104
106
100 102 104
web hits
100
102
104
106 107
books sold
1
10
100
100 102 104 106
telephone calls received
100
103
106
2 3 4 5 6 7earthquake magnitude
102
103
104
0.01 0.1 1crater diameter in km
10-4
10-2
100
102
102 103 104 105
peak intensity
101
102
103
104
1 10 100intensity
1
10
100
109 1010
net worth in US dollars
1
10
100
104 105 106
name frequency
100
102
104
103 105 107
population of city
100
102
104
(a) (b) (c)
(d) (e) (f)
(g) (h) (i)
(j) (k) (l)
FIG. 4 Cumulative distributions or “rank/frequency plots” of twelve quantities reputed to follow power laws. The distributionswere computed as described in Appendix A. Data in the shaded regions were excluded from the calculations of the exponentsin Table I. Source references for the data are given in the text. (a) Numbers of occurrences of words in the novel Moby Dickby Hermann Melville. (b) Numbers of citations to scientific papers published in 1981, from time of publication until June1997. (c) Numbers of hits on web sites by 60 000 users of the America Online Internet service for the day of 1 December 1997.(d) Numbers of copies of bestselling books sold in the US between 1895 and 1965. (e) Number of calls received by AT&Ttelephone customers in the US for a single day. (f) Magnitude of earthquakes in California between January 1910 and May 1992.Magnitude is proportional to the logarithm of the maximum amplitude of the earthquake, and hence the distribution obeys apower law even though the horizontal axis is linear. (g) Diameter of craters on the moon. Vertical axis is measured per squarekilometre. (h) Peak gamma-ray intensity of solar flares in counts per second, measured from Earth orbit between February1980 and November 1989. (i) Intensity of wars from 1816 to 1980, measured as battle deaths per 10 000 of the population of theparticipating countries. (j) Aggregate net worth in dollars of the richest individuals in the US in October 2003. (k) Frequencyof occurrence of family names in the US in the year 1990. (l) Populations of US cities in the year 2000.
6
100
102
104
wor
d fr
eque
ncy
100
102
104
100
102
104
cita
tions
100
102
104
106
100
102
104
web
hits
100
102
104
106
107
book
s sol
d
110100
100
102
104
106
tele
phon
e ca
lls re
ceiv
ed
100
103
106
23
45
67
earth
quak
e m
agni
tude
102
103
104
0.01
0.1
1cr
ater
dia
met
er in
km
10-4
10-2
100
102
102
103
104
105
peak
inte
nsity
101
102
103
104
110
100
inte
nsity
110100
109
1010
net w
orth
in U
S do
llars
110100
104
105
106
nam
e fr
eque
ncy
100
102
104
103
105
107
popu
latio
n of
city
100
102
104
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)(k
)(l)
FIG
.4
Cum
ula
tive
distr
ibution
sor
“ran
k/f
requen
cyplo
ts”
oftw
elve
quan
tities
repute
dto
follow
pow
erla
ws.
The
distr
ibution
sw
ere
com
pute
das
des
crib
edin
Appen
dix
A.
Dat
ain
the
shad
edre
gion
sw
ere
excl
uded
from
the
calc
ula
tion
sof
the
expon
ents
inTab
leI.
Sou
rce
refe
rence
sfo
rth
edat
aar
egi
ven
inth
ete
xt.
(a)
Num
ber
sof
occ
urr
ence
sof
wor
ds
inth
enov
elM
oby
Dic
kby
Her
man
nM
elville
.(b
)N
um
ber
sof
cita
tion
sto
scie
ntific
pap
ers
publish
edin
1981
,fr
omtim
eof
publica
tion
until
June
1997
.(c
)N
um
ber
sof
hits
onw
ebsite
sby
6000
0use
rsof
the
Am
eric
aO
nline
Inte
rnet
serv
ice
for
the
day
of1
Dec
ember
1997
.(d
)N
um
ber
sof
copie
sof
bes
tsel
ling
book
sso
ldin
the
US
bet
wee
n18
95an
d19
65.
(e)
Num
ber
ofca
lls
rece
ived
by
AT
&T
tele
phon
ecu
stom
ersin
the
US
fora
singl
eday
.(f
)M
agnitude
ofea
rthquak
esin
Cal
ifor
nia
bet
wee
nJa
nuar
y19
10an
dM
ay19
92.
Mag
nitude
ispro
por
tion
alto
the
loga
rith
mof
the
max
imum
amplitu
de
ofth
eea
rthquak
e,an
dhen
ceth
edistr
ibution
obey
sa
pow
erla
wev
enth
ough
the
hor
izon
talax
isis
linea
r.(g
)D
iam
eter
ofcr
ater
son
the
moon
.V
ertica
lax
isis
mea
sure
dper
squar
ekilom
etre
.(h
)Pea
kga
mm
a-ra
yin
tensity
ofso
lar
flar
esin
counts
per
seco
nd,
mea
sure
dfr
omE
arth
orbit
bet
wee
nFeb
ruar
y19
80an
dN
ovem
ber
1989
.(i)
Inte
nsity
ofw
ars
from
1816
to19
80,m
easu
red
asbat
tle
dea
ths
per
1000
0of
the
pop
ula
tion
ofth
epar
tici
pat
ing
countr
ies.
(j)
Agg
rega
tenet
wor
thin
dol
lars
ofth
erich
est
indiv
idual
sin
the
US
inO
ctob
er20
03.
(k)
Fre
quen
cyof
occ
urr
ence
offa
mily
nam
esin
the
US
inth
eye
ar19
90.
(l)
Pop
ula
tion
sof
US
cities
inth
eye
ar20
00.
Power Law SizeDistributions
Definition
Examples
Wild vs. Mild
CCDFs
Zipf’s law
Zipf ⇔ CCDF
References
15 of 36
Size distributions
Examples:I Earthquake magnitude (Gutenberg-Richter
law ()): [1] P(M) ∝ M−2
I Number of war deaths: [9] P(d) ∝ d−1.8
I Sizes of forest fires [4]
I Sizes of cities: [10] P(n) ∝ n−2.1
I Number of links to and from websites [2]
I See in part Simon [10] and M.E.J. Newman [6] “Powerlaws, Pareto distributions and Zipf’s law” for more.
I Note: Exponents range in error
Power Law SizeDistributions
Definition
Examples
Wild vs. Mild
CCDFs
Zipf’s law
Zipf ⇔ CCDF
References
16 of 36
Size distributions
Examples:I Number of citations to papers: [7, 8] P(k) ∝ k−3.I Individual wealth (maybe): P(W ) ∝ W−2.I Distributions of tree trunk diameters: P(d) ∝ d−2.I The gravitational force at a random point in the
universe: [5] P(F ) ∝ F−5/2.I Diameter of moon craters: [6] P(d) ∝ d−3.I Word frequency: [10] e.g., P(k) ∝ k−2.2 (variable)
Power Law SizeDistributions
Definition
Examples
Wild vs. Mild
CCDFs
Zipf’s law
Zipf ⇔ CCDF
References
17 of 36
Power law distributions
Gaussians versus power-law distributions:I Mediocristan versus ExtremistanI Mild versus Wild (Mandelbrot)I Example: Height versus wealth.
I See “The Black Swan” by NassimTaleb. [11]
Power Law SizeDistributions
Definition
Examples
Wild vs. Mild
CCDFs
Zipf’s law
Zipf ⇔ CCDF
References
18 of 36
Turkeys...
From “The Black Swan” [11]
Power Law SizeDistributions
Definition
Examples
Wild vs. Mild
CCDFs
Zipf’s law
Zipf ⇔ CCDF
References
19 of 36
Taleb’s table [11]
Mediocristan/ExtremistanI Most typical member is mediocre/Most typical is either
giant or tiny
I Winners get a small segment/Winner take almost alleffects
I When you observe for a while, you know what’s goingon/It takes a very long time to figure out what’s going on
I Prediction is easy/Prediction is hard
I History crawls/History makes jumps
I Tyranny of the collective/Tyranny of the rare andaccidental
Power Law SizeDistributions
Definition
Examples
Wild vs. Mild
CCDFs
Zipf’s law
Zipf ⇔ CCDF
References
20 of 36
Size distributions
Power law size distributions aresometimes calledPareto distributions () after Italianscholar Vilfredo Pareto. ()
I Pareto noted wealth in Italy wasdistributed unevenly (80–20 rule;misleading).
I Term used especially bypractitioners of the DismalScience ().
Power Law SizeDistributions
Definition
Examples
Wild vs. Mild
CCDFs
Zipf’s law
Zipf ⇔ CCDF
References
21 of 36
Devilish power law distribution details:
Exhibit A:I Given P(x) = cx−γ with 0 < xmin < x < xmax,
the mean is (γ 6= 2):
〈x〉 =c
2− γ
(x2−γ
max − x2−γmin
).
I Mean ‘blows up’ with upper cutoff if γ < 2.I Mean depends on lower cutoff if γ > 2.I γ < 2: Typical sample is large.I γ > 2: Typical sample is small.
Insert question from assignment 1 ()
Power Law SizeDistributions
Definition
Examples
Wild vs. Mild
CCDFs
Zipf’s law
Zipf ⇔ CCDF
References
22 of 36
And in general...
Moments:I All moments depend only on cutoffs.I No internal scale that dominates/matters.I Compare to a Gaussian, exponential, etc.
For many real size distributions: 2 < γ < 3I mean is finite (depends on lower cutoff)I σ2 = variance is ‘infinite’ (depends on upper cutoff)I Width of distribution is ‘infinite’I If γ > 3, distribution is less terrifying and may be
easily confused with other kinds of distributions.
Insert question from assignment 1 ()
Power Law SizeDistributions
Definition
Examples
Wild vs. Mild
CCDFs
Zipf’s law
Zipf ⇔ CCDF
References
23 of 36
Moments
Standard deviation is a mathematical convenience:I Variance is nice analytically...I Another measure of distribution width:
Mean average deviation (MAD) = 〈|x − 〈x〉|〉
I For a pure power law with 2 < γ < 3:
〈|x − 〈x〉|〉 is finite.
I But MAD is mildly unpleasant analytically...I We still speak of infinite ‘width’ if γ < 3.
Insert question from assignment 2 ()
Power Law SizeDistributions
Definition
Examples
Wild vs. Mild
CCDFs
Zipf’s law
Zipf ⇔ CCDF
References
24 of 36
How sample sizes grow...
Given P(x) ∼ cx−γ:I We can show that after n samples, we expect the
largest sample to be
x1 & c′n1/(γ−1)
I Sampling from a finite-variance distribution gives amuch slower growth with n.
I e.g., for P(x) = λe−λx , we find
x1 &1λ
ln n.
Insert question from assignment 2 ()
Power Law SizeDistributions
Definition
Examples
Wild vs. Mild
CCDFs
Zipf’s law
Zipf ⇔ CCDF
References
25 of 36
Complementary Cumulative Distribution Function:
CCDF:I
P≥(x) = P(x ′ ≥ x) = 1− P(x ′ < x)
I
=
∫ ∞
x ′=xP(x ′)dx ′
I
∝∫ ∞
x ′=x(x ′)−γdx ′
I
=1
−γ + 1(x ′)−γ+1
∣∣∣∣∞x ′=x
I
∝ x−γ+1
Power Law SizeDistributions
Definition
Examples
Wild vs. Mild
CCDFs
Zipf’s law
Zipf ⇔ CCDF
References
26 of 36
Complementary Cumulative Distribution Function:
CCDF:I
P≥(x) ∝ x−γ+1
I Use when tail of P follows a power law.I Increases exponent by one.I Useful in cleaning up data.
PDF:
−3 −2 −1 0 10
1
2
3
4
log10
q
log 10
Nq
CCDF:
−3 −2 −1 0 10
1
2
3
4
log10
q
log 10
N>
q
Power Law SizeDistributions
Definition
Examples
Wild vs. Mild
CCDFs
Zipf’s law
Zipf ⇔ CCDF
References
27 of 36
Complementary Cumulative Distribution Function:
I Discrete variables:
P≥(k) = P(k ′ ≥ k)
=∞∑
k ′=k
P(k)
∝ k−γ+1
I Use integrals to approximate sums.
Power Law SizeDistributions
Definition
Examples
Wild vs. Mild
CCDFs
Zipf’s law
Zipf ⇔ CCDF
References
28 of 36
Zipfian rank-frequency plots
George Kingsley Zipf:I Noted various rank distributions
followed power laws, often with exponent -1(word frequency, city sizes...)
I Zipf’s 1949 Magnum Opus ():“Human Behaviour and the Principle ofLeast-Effort” [12]
I We’ll study Zipf’s law in depth...
Power Law SizeDistributions
Definition
Examples
Wild vs. Mild
CCDFs
Zipf’s law
Zipf ⇔ CCDF
References
29 of 36
Zipfian rank-frequency plots
Zipf’s way:I Given a collection of entities, rank them by size,
largest to smallest.I xr = the size of the r th ranked entity.I r = 1 corresponds to the largest size.I Example: x1 could be the frequency of occurrence of
the most common word in a text.I Zipf’s observation:
xr ∝ r−α
Power Law SizeDistributions
Definition
Examples
Wild vs. Mild
CCDFs
Zipf’s law
Zipf ⇔ CCDF
References
30 of 36
Size distributions
Brown Corpus (1,015,945 words):
CCDF:
−3 −2 −1 0 10
1
2
3
4
log10
q
log 10
N>
q
Zipf:
0 1 2 3 4−3
−2
−1
0
1
log10
rank i
log 10
qi
I The, of, and, to, a, ... = ‘objects’I ‘Size’ = word frequencyI Beep: CCDF and Zipf plots are related...
Power Law SizeDistributions
Definition
Examples
Wild vs. Mild
CCDFs
Zipf’s law
Zipf ⇔ CCDF
References
31 of 36
Size distributions
Brown Corpus (1,015,945 words):
CCDF:
−3 −2 −1 0 10
1
2
3
4
log10
q
log 10
N>
q
Zipf (axes flipped):
−3 −2 −1 0 10
1
2
3
4
log10
qi
log 10
ran
k i
I The, of, and, to, a, ... = ‘objects’I ‘Size’ = word frequencyI Beep: CCDF and Zipf plots are related...
Power Law SizeDistributions
Definition
Examples
Wild vs. Mild
CCDFs
Zipf’s law
Zipf ⇔ CCDF
References
32 of 36
Observe:I NP≥(x) = the number of objects with size at least x
where N = total number of objects.I If an object has size xr , then NP≥(xr ) is its rank r .I So
xr ∝ r−α = (NP≥(xr ))−α
∝ x (−γ+1)(−α)r since P≥(x) ∼ x−γ+1.
We therefore have 1 = (−γ + 1)(−α) or:
α =1
γ − 1
I A rank distribution exponent of α = 1 corresponds toa size distribution exponent γ = 2.
Power Law SizeDistributions
Definition
Examples
Wild vs. Mild
CCDFs
Zipf’s law
Zipf ⇔ CCDF
References
33 of 36
The Don. ()
Extreme deviations in test cricket:
1000 10 20 30 9040 50 60 70 80
I Don Bradman’s batting average () = 166% nextbest.
Power Law SizeDistributions
Definition
Examples
Wild vs. Mild
CCDFs
Zipf’s law
Zipf ⇔ CCDF
References
34 of 36
References I
[1] P. Bak, K. Christensen, L. Danon, and T. Scanlon.Unified scaling law for earthquakes.Phys. Rev. Lett., 88:178501, 2002. pdf ()
[2] A.-L. Barabási and R. Albert.Emergence of scaling in random networks.Science, 286:509–511, 1999. pdf ()
[3] K. Christensen, L. Danon, T. Scanlon, and P. Bak.Unified scaling law for earthquakes.Proc. Natl. Acad. Sci., 99:2509–2513, 2002. pdf ()
[4] P. Grassberger.Critical behaviour of the Drossel-Schwabl forest firemodel.New Journal of Physics, 4:17.1–17.15, 2002. pdf ()
Power Law SizeDistributions
Definition
Examples
Wild vs. Mild
CCDFs
Zipf’s law
Zipf ⇔ CCDF
References
35 of 36
References II
[5] J. Holtsmark.?Ann. Phys., 58:577–, 1917.
[6] M. E. J. Newman.Power laws, pareto distributions and zipf’s law.Contemporary Physics, pages 323–351, 2005.pdf ()
[7] D. J. d. S. Price.Networks of scientific papers.Science, 149:510–515, 1965. pdf ()
[8] D. J. d. S. Price.A general theory of bibliometric and other cumulativeadvantage processes.J. Amer. Soc. Inform. Sci., 27:292–306, 1976.
Power Law SizeDistributions
Definition
Examples
Wild vs. Mild
CCDFs
Zipf’s law
Zipf ⇔ CCDF
References
36 of 36
References III
[9] L. F. Richardson.Variation of the frequency of fatal quarrels withmagnitude.J. Amer. Stat. Assoc., 43:523–546, 1949. pdf ()
[10] H. A. Simon.On a class of skew distribution functions.Biometrika, 42:425–440, 1955. pdf ()
[11] N. N. Taleb.The Black Swan.Random House, New York, 2007.
[12] G. K. Zipf.Human Behaviour and the Principle of Least-Effort.Addison-Wesley, Cambridge, MA, 1949.