info 2950, lecture 19power laws in log-log space y = cx k (k=1/2,1,2)log 10 y = k ∗ log 10 x +log...

Prob Set 6: tomorrow sometime (likely just first part)

tentative due dates for the rest:ps6 due Mon 23 Apr [longest]ps7 due Wed 2 Mayps8 due Wed 9 May (last day classes, auto-extension to end week)

Info 2950, Lecture 19 12 Apr 2018

Z = X . Y10c = 10a . 10b = 10a+b

c = a + blog10Z = log10X + log10Y

Y = cXk

log10Y = klog10X + log10c(of the form y = ax + b)

Power Laws in log-log space

y = cxk (k=1/2,1,2) log10 y = k ∗ log10 x + log10 c

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100

sqrt(x)x

x**2

1

10

100

1 10 100

sqrt(x)x

x**2

2 / 25


y = cxk (k=1/2,1,2) log10 y = k ∗ log10 x + log10 c

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100

sqrt(x)x

x**2

1

10

100

1 10 100

sqrt(x)x

x**2

2 / 25


y = cx−k (k=1/2,1,2) log10 y = −k ∗ log10 x + log10 c

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100

100/sqrt(x)100/x

100/x**2

1

10

100

1 10 100

100/sqrt(x)100/x

100/x**2

6 / 25


y = cx−k (k=1/2,1,2) log10 y = −k ∗ log10 x + log10 c

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100

100/sqrt(x)100/x

100/x**2

1

10

100

1 10 100

100/sqrt(x)100/x

100/x**2

6 / 25

Zipf’s law

Now we have characterized the growth of the vocabulary incollections.

We also want to know how many frequent vs. infrequentterms we should expect in a collection.

In natural language, there are a few very frequent terms andvery many very rare terms.

Zipf’s law (linguist/philologist George Zipf, 1935):The i th most frequent term has frequency proportional to 1/i .

cf i ∝1i

cf i is collection frequency: the number of occurrences of theterm ti in the collection.

3 / 25

Zipf’s law

Now we have characterized the growth of the vocabulary incollections.

We also want to know how many frequent vs. infrequentterms we should expect in a collection.

In natural language, there are a few very frequent terms andvery many very rare terms.

Zipf’s law (linguist/philologist George Zipf, 1935):The i th most frequent term has frequency proportional to 1/i .

cf i ∝1i

cf i is collection frequency: the number of occurrences of theterm ti in the collection.

3 / 25

http://en.wikipedia.org/wiki/Zipf’s law

Zipf’s law: the frequency of any word is inversely proportional toits rank in the frequency table. Thus the most frequent word willoccur approximately twice as often as the second most frequentword, which occurs twice as often as the fourth most frequentword, etc. Brown Corpus:

“the”: 7% of all word occurrences (69,971 of!>1M).

“of”: ∼3.5% of words (36,411)

“and”: 2.9% (28,852)

Only 135 vocabulary items account for half the Brown Corpus.

The Brown University Standard Corpus of Present-Day American English

is a carefully compiled selection of current American English, totaling

about a million words drawn from a wide variety of sources . . . for many

years among the most-cited resources in the field.

4 / 25




“of”: ∼3.5% of words (36,411)

“and”: 2.9% (28,852)






4 / 25




“of”: ∼3.5% of words (36,411)

“and”: 2.9% (28,852)






4 / 25

(Recall ps3#3 for 120 article dataset)

Zipf’s law

Zipf’s law: The i th most frequent term has frequencyproportional to 1/i .

cf i ∝1i

cf is collection frequency: the number of occurrences of theterm in the collection.

So if the most frequent term (the) occurs cf1 times, then thesecond most frequent term (of) has half as many occurrencescf2 =

12cf1 . . .

. . . and the third most frequent term (and) has a third asmany occurrences cf3 =

13cf1 etc.

Equivalent: cf i = cik and log cf i = log c + k log i (for k = −1)

Example of a power law

5 / 25

Zipf’s law for Reuters

0 1 2 3 4 5 6 7

01

23

45

67

log10 rank

log1

0 cf

Fit far from perfect, but nonetheless key insight:Few frequent terms, many rare terms.

7 / 25

more from http://en.wikipedia.org/wiki/Zipf’s law

“A plot of word frequency in Wikipedia (27 Nov 2006). The plot is in log-log coordinates. x is rank of a word in the

frequency table; y is the total number of the words occurrences. Most popular words are “the”, “of” and “and”, as

expected. Zipf’s law corresponds to the upper linear portion of the curve, roughly following the green (1/x) line.”

8 / 25

Another Wikipedia count (15 May 2010)

http://imonad.com/seo/wikipedia-word-frequency-list/

All articles in the English version of Wikipedia, 21GB in XMLformat (five hours to parse entire file, extract data from markuplanguage, filter numbers, special characters, extract statistics):

Total tokens (words, no numbers): T = 1,570,455,731

Unique tokens (words, no numbers): M = 5,800,280

11 / 25

“Word frequency distribution follows Zipf’s law”

12 / 25

rank 1–50 (86M-3M), stop words (the, of, and, in, to, a, is,. . .)

rank 51–3K (2.4M-56K), frequent words (university, January,tea, sharp, . . .)

rank 3K–200K (56K-118), words from large comprehensivedictionaries (officiates, polytonality, neologism, . . .)above rank 50K mostly Long Tail words

rank 200K–5.8M (117-1), terms from obscure niches,misspelled words, transliterated words from other languages,new words and non-words (euprosthenops, eurotrochilus,lokottaravada, . . .)

13 / 25

Some selected words and associated counts

Google 197920

Twitter 894

domain 111850

domainer 22

Wikipedia 3226237

Wiki 176827

Obama 22941

Oprah 3885

Moniker 4974

GoDaddy 228

14 / 25

Project Gutenberg (per billion)

http://en.wiktionary.org/wiki/Wiktionary:Frequency_lists#Project_Gutenberg

Over 36,000 items (Jun 2011), average of > 50 new e-books / weekhttp://en.wiktionary.org/wiki/Wiktionary:Frequency_lists/PG/2006/04/1-10000

the 56271872

of 33950064

and 29944184

to 25956096

in 17420636

I 11764797

that 11073318

was 10078245

his 8799755

he 8397205

it 8058110

with 7725512

is 7557477

for 7097981

as 7037543

had 6139336

you 6048903

not 5741803

be 5662527

her 5202501

. . . 100, 000th

15 / 25

If a city is 10 times as populous,does it have 10 times as many gas stations?

Empirically, G = c P.77 (economies of scale)

similarly for miles of roadway, length of electrical cables, …,k ranges from .7 to .9

https://opinionator.blogs.nytimes.com/2009/05/19/math-and-the-city/

Power laws more generally

E.g., consider power law distributions of the form c r−k ,describing the number of book sales versus sales-rank r of a book,or the number of Wikipedia edits made by the r th most frequentcontributor to Wikipedia.

Amazon book sales: c r−k , k ≈ .87

number of Wikipedia edits: c r−k , k ≈ 1.7

(More on power laws and the long tail here:Networks, Crowds, and Markets:

Reasoning About a Highly Connected World

by David Easley and Jon KleinbergChpt 18: http://www.cs.cornell.edu/home/kleinber/networks-book/networks-book-ch18.pdf)

9 / 25

Power Laws more generally

0

200

400

600

800

1000

0 200 400 600 800 1000

Wik

iped

ia e

dits

/mon

th |

Amaz

on s

ales

/wee

k

User|Book rank r

40916 / r^{.87}

1258925 / r^{1.7}

Normalization given by the roughly1 sale/week for the200,000th ranked Amazon title:

40916r−.87

and by the10 edits/month for the1000th ranked Wikipedia editor:

1258925r−1.7

0.1

1

10

100

1000

10000

100000

1e+06

1e+07

1 10 100 1000 10000 100000 1e+06

Wik

iped

ia e

dits

/mon

th |

Amaz

on s

ales

/wee

k

User|Book rank r

1258925 / r^{1.7}

40916 / r^{.87}

Long tail: about a quarter ofAmazon book sales estimatedto come from the long tail,i.e., those outside the top100,000 bestselling titles

10 / 25

0

200

400

600

800

1000

0 200 400 600 800 1000

Wik

iped

ia e

dits

/mon

th |

Amaz

on s

ales

/wee

k

User|Book rank r

40916 / r^{.87}

1258925 / r^{1.7}

Normalization given by the roughly1 sale/week for the200,000th ranked Amazon title:

40916r−.87

and by the10 edits/month for the1000th ranked Wikipedia editor:

1258925r−1.7

0.1

1

10

100

1000

10000

100000

1e+06

1e+07

1 10 100 1000 10000 100000 1e+06

Wik

iped

ia e

dits

/mon

th |

Amaz

on s

ales

/wee

k

User|Book rank r

1258925 / r^{1.7}

40916 / r^{.87}

Long tail: about a quarter ofAmazon book sales estimatedto come from the long tail,i.e., those outside the top100,000 bestselling titles

10 / 25

Suppose y = cxk

then log(y) = k log(x) + log(c),and the plot of log(y) vs log(x) is a straight line with slope k (and y-intercept log(c) when x=1)

Now suppose some particular x0 value has associated y0 = cxk .Then x1 = 10x0 has y1 = cxk = c(10x0)k = 10k cxk = 10k y0 .So if x increases by a factor of 10, then y increases by a factor of 10k,and the value of k is easily determined.

More generally, if x1 = r x0 and y1 = s y0 for two points (x0, y0) and (x1, y1)on the curve y = cxk then s = y1 / y0 = (x1 / x0)k = rk

and comparison of r (ratio of x1 and x0) to s (ratio of y1 and y0)provides the value of the exponent k.

e.g. x -> 10x, y -> 10ky ; or generally x -> 102x, y-> 102ky

0

1 0

k = 3 k = -2

k = .5 k = -1

k = 1/3 k = -1/2

10003.5k/2 = 10-4

so k = -8/10.5 = -.76

“moment magnitude” is a logarithmic scale,from 6 to 8 is 1000 times more powerful,

(so, e.g., from 5 to 6 is 31.6 times)

f = c Mk

(Figures on next three slides from: N. Silver, “The signal and the noise” (2012))

f(9.1) = 1/13000 = .00008f(9.1) = 1/300 = .0033

11 Mar 2011: 9.1

1000k = 10-2

so k = -2/3f = c Mk f = c Mk 1000k/2 = 10-2.1

so k = -1.4

info 2950, lecture 19power laws in log-log space y = cx k (k=1/2,1,2)log 10 y = k ∗ log 10 x +log...

Documents