1 a random text model for the generation of statistical language invariants chris biemann university...

1

A Random Text Model for the Generation of

Statistical Language Invariants

Chris BiemannUniversity of Leipzig, Germany

HLT-NAACL 2007, Rochester, NY, USA

Monday, April 23, 2007

2

Outline

• Previous random text models

• Large-scale measures for text

• A novel random text model

• Comparison to natural language text

3

Necessary property: Zipf‘s Law• Zipf: Ordering words in a corpus by descending frequency, the

relation between the frequency of a word at rank r and its rank is given by f(r) ~ r-z, where z is the exponent of the power-law that corresponds to the slope of the curve in a log plot. For word frequencies in NL, z 1

• Zipf-Mandelbrot: f(r) ~(r+c1)-(1+c2): Approximates lower frequencies for very high ranks

1

10

100

1000

10000

1 10 100 1000 10000

fre

qu

en

cy

rank

rank-frequency

spoken Englishpower law z=1.4

Zipf-Mandelbrot c1=10 c2=0.4

4

Previous Random Text ModelsB. B. Mandelbrot (1953)• Sometimes called the “monkey at the typewriter”• With a probability w, a word separator is generated at each step, • with probability (1-w)/N, a letter from an alphabet of size N is

generated

H. A. Simon (1955)• No alphabet of single letters• at each time step, a previously unseen new word is added to the

stream with a probability , whereas with probability (1-), the next word is chosen amongst the words at previous positions.

• frequency distribution that follows a power law with exponent z=(1-). • Modified by Zanette and Montemurro (2002):

- sublinear growth for higher exponents- Zipf-Mandelbrot law by maximum probability threshold

5

Critique on Previous Models

• Mandelbrot: All words with the same length are equiprobable, as all letters are equiprobable Ferrer i Cancho and Solé (2002): Initialisation with letter probabilities obtained from natural language text solves this problem, but where do these letter frequencies come from?

• Simon: No concept of „letter“ at all.

• Both:– no concept of sentence– no word order restrictions: Simon = bag of words,

Mandelbrot does not take into account generated stream at all

6

Large-scale Measures for Text

• Zipf‘s law and lexical spectrum: rank-frequency plot should follow a power law with z1, frequency-spectrum (probability of frequencies) should follow a power law with z2 (Pareto distribution)

• Word length: Should be distributed like in natural language text, according to a variant of the gamma distribution (Sigurd et al. 2004)

• Sentence length: Should also distributed like in NL, same gamma distribution

• Significant neighbour-based co-occurrence graph: Should be a similar in terms of degree distribution and connectivity in random text and NL.

7

A Novel Random Text Model

Two parts:• Word Generator• Sentence Generator

Both follow the principle of beaten tracks: • Memorize what has been generated before• Generate with higher probability if generated before more

often

Inspired by Small World network generation, especially (Kumar et al. 1999).

8

Word Generator• Initialisation: – Letter graph of N letters. – Vertices are connected to themselves with weight 1.

• Choice: – When generating a word, the generator chooses a letter x according to its

probability P(x), which is computed as the normalized weight sum of outgoing edges:

• Parameter: – At every position, the word ends with a probability w(0,1) or generates a next

letter according to the letter production probability as given above.

• Update: – For every letter bigram, the weight of the directed edge between the preceding

and current letter in the letter graph is increased by one.

• Effect: self-reinforcement of letter probabilities: – the more often a letter is generated, the higher its weight sum will be in

subsequent steps, – leading to an increased generation probability.

Vv

vweightsum

xweightsumxP

)(

)()(

)(

),()(yneighu

uyweightyweightsumwith

9

Word Generator Example

The small numbers next to edges are edge weights. The probability for the letters for the next step are

P(A)=0.4 P(B)=0.4 P(C)=0.2

10

Measures on the Word Generator

• Word Generator fulfills measures much better than the Mandelbrot model.• For other measures, we need something extra...

1

10

100

1000

10000

1 10 100 1000 10000

frequ

ency

rank

rank-frequency

word generator w=0.2power law z=1

Mandelbrot model

1e-006

1e-005

0.0001

0.001

0.01

0.1

1

1 10 100 1000

P(fre

quen

cy)

frequency

lexical spectrum

word generator w=0.2power law z=2

Mandelbrot model

11

Sentence Generator I• Initialisation:

– Word graph is initialized with a begin-of-sentence (BOS) and an end-of-sentence (EOS) symbol, with an edge of weight 1 from BOS to EOS.

• Word Graph: (directed)– Vertices correspond to words – edge weights correspond to the number of times two words were

generated in a sequence.

• Generation:– random walk on the directed edges starts at the BOS vertex. – With a new word probability (1-s), an existing edge is followed from

the current vertex to the next vertex – the probability of choosing endpoint X from the endpoints of all

outgoing edges from the current vertex C is given by

)(

),(

),()(

CneighN

NCweight

XCweightXwordP

12

Sentence Generator II• Parameter:

– With probability s (0,1), a new word is generated by the word generator model

– next word is chosen from the word graph in proportion to its weighted indegree: the probability of choosing an existing vertex E as successor of a newly generated word N is given by

• Update:– For each sequence of two words generated, the weight of the

directed edge between them is increased by 1

Vv

Vv

XvweightXindgw

vindgw

EindgwEwordP

),()(

,)(

)()(

13

Sentence Generator Example

• In the last step, the second CA was generated as a new word from the word generator.

• The generation of empty sentences happens frequently. These are omitted in the output.

14

Comparison to Natural Language• Corpus for comparison: The first 1 million words of BNC, spoken

English.• 26 letters, uppercase, punctuation removed same in word generator• 125,395 sentences set s=0.08, remove first 50K sentences• average sentence length: 7.975 words • Average word length: 3.502 letters w=0.4

OOH

OOH

ERM

WOULD LIKE A CUP OF THIS ER

MM

SORRY NOW THAT S

NO NO I DID NT

I KNEW THESE PEWS WERE HARD

OOH I DID NT REALISE THEY WERE THAT BAD I FEEL SORRY FOR MY POOR CONGREGATION

15

Word Frequency

1

10

100

1000

10000

1 10 100 1000 10000

frequ

ency

rank

rank-frequency

sentence generatorEnglish

power law z=1.5• Zipf-Mandelbrot

distribution

• Smooth curve

• Similar to English

16

Word Length

• More 1-letter words in the sentence generator

• Longer words in the sentence generator

• Curve is similar• Gamma distribution here:

f(x)~x1.50.45x

1

10

100

1000

10000

100000

1 10

frequ

ency

length in letters

word length


gamma distribution

17

Sentence Length

• Longer sentences in English

• More 2-word sentences in english

• Curve is similar

1

10

100

1000

10000

1 10 100

num

ber o

f sen

tenc

es

length in words

sentence length


18

Neighbor-based Co-occurrence Graph

• Min. cooc. freq=2, min. log likelihood ratio=3.84

• NB-graph is a small world

• Qualitatively, English and sentence generator are similar

• Word generator shows much much less co-occurrences

• Factor 2 in clustering coefficient and number of vertices

0.001

0.01

0.1

1

10

100

1000

10000

1 10 100 1000

nr o

f ver

tices

degree interval

degree distribution


word generatorpower law z=2

English sample

sentence gen.

word gen. random graph (ER)

# of ver. 7154 15258 3498 10000

avg. sht. path

2.933 3.147 3.601 4.964

avg. deg. 9.445 6.307 3.069 7

cl.coeff. 0.2724 0.1497 0.0719 6.89E-4

z 1.966 2.036 2.007 -

19

Formation of Sentences

• Word graph grows and contains the full vocabulary used so far for generating in every time step.

• Random walks starting from BOS always end in EOS.

• Sentence length slowly increases: random walk has more possibilities before finally arriving at the EOS vertex.

• Sentence length is influenced by both parameters of the model: – the word end probability w in the

word generator – the new word probability s in the

sentence generator.

1

10

100

10000 100000 1e+006

avg.

sen

tenc

e le

ngth

text interval

sentence length growth

w=0.4 s=0.08w=0.4 s=0.1

w=0.17 s=0.22w=0.3 s=0.09

x^(0.25);

20

Conclusion

Novel random text model • obeys Zipf‘s law• obeys word length distribution• obeys sentence length• shows similar nb-cooccurrence data

First model that:• produces smooth lexical spectrum without initial letter

probabilities• incorporates notion of a sentence• models word order restrictions

21

Sentence generator at work

Beginning: Q . U . RFXFJF . G . G . U . R . U . RFXFJF . XXF . RFXFJF . U . QYVHA . RFXFJF . R TCW . CV . Z U . G . XXF . RFXFJF . M XXF . Q . G . RFXFJF . U . RFXFJF . RFXFJF . Z U . G . RFXFJF . RFXFJF . M XXF . R . Z U .

Later: X YYOXO QO OEPUQFC T TYUP QYFA FN XX TVVJ U OCUI X HPTXVYPF . FVFRIK . Y TXYP VYFI QC TPS Q UYYLPCQXC . G QQE YQFC XQXA Z JYQPX. QRXQY VCJ XJ YAC VN PV VVQF C XJN JFEQ QYVHA. U VIJ Q YT JU OF DJWI QYM U YQVCP QOTE OD XWY AGFVFV U XA YQYF AVYPO CDQQ TY NTO FYF QHT T YPXRQ R GQFRVQ . MUHVJ Q VAVF YPF QPXPCY Q YYFRQQ. JP VGOHYY F FPYF OM SFXNJJ A VQA OGMR L QY . FYC T PNXTQ . R TMQCQ B QQTF J PVX YT DTYO RXJYYCGFJ CYFOFUMOCTM PQRYQQYC AHXZQJQ JTW O JJ VX QFYQ YTXJTY YTYYFXK . RFXFJF JY XY RVV J YURQ CM QOXGQ QFMVGPQ. OY FDXFOXC. N OYCT . L MMYMT CY YAQ XAA J YHYJ MPQ XAQ UYBX RW XXF O UU COF XXF CQPQ VYYY XJ YACYTF FN . TA KV XJP O EGV J HQY KMQ U .

22

Questions?

Danke sdf sehr gf thank fdgf you g fd tusen sd ee takk erte dank we u trew wel wwd muchas werwe ewr gracias werwe rew merci mille werew re ew ee ew grazie d fsd ffs df d fds spassiva fs fdsa rtre trerere rteetr trpemma eedm

1 a random text model for the generation of statistical language invariants chris biemann university...

Documents

word frequencies

text zipfs law

natural language text

mandelbrot law

word ends

word separator

letter frequencies

word generator initialisation