1 a random text model for the generation of statistical language invariants chris biemann university...
TRANSCRIPT
1
A Random Text Model for the Generation of
Statistical Language Invariants
Chris BiemannUniversity of Leipzig, Germany
HLT-NAACL 2007, Rochester, NY, USA
Monday, April 23, 2007
2
Outline
• Previous random text models
• Large-scale measures for text
• A novel random text model
• Comparison to natural language text
3
Necessary property: Zipf‘s Law• Zipf: Ordering words in a corpus by descending frequency, the
relation between the frequency of a word at rank r and its rank is given by f(r) ~ r-z, where z is the exponent of the power-law that corresponds to the slope of the curve in a log plot. For word frequencies in NL, z 1
• Zipf-Mandelbrot: f(r) ~(r+c1)-(1+c2): Approximates lower frequencies for very high ranks
1
10
100
1000
10000
1 10 100 1000 10000
fre
qu
en
cy
rank
rank-frequency
spoken Englishpower law z=1.4
Zipf-Mandelbrot c1=10 c2=0.4
4
Previous Random Text ModelsB. B. Mandelbrot (1953)• Sometimes called the “monkey at the typewriter”• With a probability w, a word separator is generated at each step, • with probability (1-w)/N, a letter from an alphabet of size N is
generated
H. A. Simon (1955)• No alphabet of single letters• at each time step, a previously unseen new word is added to the
stream with a probability , whereas with probability (1-), the next word is chosen amongst the words at previous positions.
• frequency distribution that follows a power law with exponent z=(1-). • Modified by Zanette and Montemurro (2002):
- sublinear growth for higher exponents- Zipf-Mandelbrot law by maximum probability threshold
5
Critique on Previous Models
• Mandelbrot: All words with the same length are equiprobable, as all letters are equiprobable Ferrer i Cancho and Solé (2002): Initialisation with letter probabilities obtained from natural language text solves this problem, but where do these letter frequencies come from?
• Simon: No concept of „letter“ at all.
• Both:– no concept of sentence– no word order restrictions: Simon = bag of words,
Mandelbrot does not take into account generated stream at all
6
Large-scale Measures for Text
• Zipf‘s law and lexical spectrum: rank-frequency plot should follow a power law with z1, frequency-spectrum (probability of frequencies) should follow a power law with z2 (Pareto distribution)
• Word length: Should be distributed like in natural language text, according to a variant of the gamma distribution (Sigurd et al. 2004)
• Sentence length: Should also distributed like in NL, same gamma distribution
• Significant neighbour-based co-occurrence graph: Should be a similar in terms of degree distribution and connectivity in random text and NL.
7
A Novel Random Text Model
Two parts:• Word Generator• Sentence Generator
Both follow the principle of beaten tracks: • Memorize what has been generated before• Generate with higher probability if generated before more
often
Inspired by Small World network generation, especially (Kumar et al. 1999).
8
Word Generator• Initialisation: – Letter graph of N letters. – Vertices are connected to themselves with weight 1.
• Choice: – When generating a word, the generator chooses a letter x according to its
probability P(x), which is computed as the normalized weight sum of outgoing edges:
• Parameter: – At every position, the word ends with a probability w(0,1) or generates a next
letter according to the letter production probability as given above.
• Update: – For every letter bigram, the weight of the directed edge between the preceding
and current letter in the letter graph is increased by one.
• Effect: self-reinforcement of letter probabilities: – the more often a letter is generated, the higher its weight sum will be in
subsequent steps, – leading to an increased generation probability.
Vv
vweightsum
xweightsumxP
)(
)()(
)(
),()(yneighu
uyweightyweightsumwith
9
Word Generator Example
The small numbers next to edges are edge weights. The probability for the letters for the next step are
P(A)=0.4 P(B)=0.4 P(C)=0.2
10
Measures on the Word Generator
• Word Generator fulfills measures much better than the Mandelbrot model.• For other measures, we need something extra...
1
10
100
1000
10000
1 10 100 1000 10000
frequ
ency
rank
rank-frequency
word generator w=0.2power law z=1
Mandelbrot model
1e-006
1e-005
0.0001
0.001
0.01
0.1
1
1 10 100 1000
P(fre
quen
cy)
frequency
lexical spectrum
word generator w=0.2power law z=2
Mandelbrot model
11
Sentence Generator I• Initialisation:
– Word graph is initialized with a begin-of-sentence (BOS) and an end-of-sentence (EOS) symbol, with an edge of weight 1 from BOS to EOS.
• Word Graph: (directed)– Vertices correspond to words – edge weights correspond to the number of times two words were
generated in a sequence.
• Generation:– random walk on the directed edges starts at the BOS vertex. – With a new word probability (1-s), an existing edge is followed from
the current vertex to the next vertex – the probability of choosing endpoint X from the endpoints of all
outgoing edges from the current vertex C is given by
)(
),(
),()(
CneighN
NCweight
XCweightXwordP
12
Sentence Generator II• Parameter:
– With probability s (0,1), a new word is generated by the word generator model
– next word is chosen from the word graph in proportion to its weighted indegree: the probability of choosing an existing vertex E as successor of a newly generated word N is given by
• Update:– For each sequence of two words generated, the weight of the
directed edge between them is increased by 1
Vv
Vv
XvweightXindgw
vindgw
EindgwEwordP
),()(
,)(
)()(
13
Sentence Generator Example
• In the last step, the second CA was generated as a new word from the word generator.
• The generation of empty sentences happens frequently. These are omitted in the output.
14
Comparison to Natural Language• Corpus for comparison: The first 1 million words of BNC, spoken
English.• 26 letters, uppercase, punctuation removed same in word generator• 125,395 sentences set s=0.08, remove first 50K sentences• average sentence length: 7.975 words • Average word length: 3.502 letters w=0.4
OOH
OOH
ERM
WOULD LIKE A CUP OF THIS ER
MM
SORRY NOW THAT S
NO NO I DID NT
I KNEW THESE PEWS WERE HARD
OOH I DID NT REALISE THEY WERE THAT BAD I FEEL SORRY FOR MY POOR CONGREGATION
15
Word Frequency
1
10
100
1000
10000
1 10 100 1000 10000
frequ
ency
rank
rank-frequency
sentence generatorEnglish
power law z=1.5• Zipf-Mandelbrot
distribution
• Smooth curve
• Similar to English
16
Word Length
• More 1-letter words in the sentence generator
• Longer words in the sentence generator
• Curve is similar• Gamma distribution here:
f(x)~x1.50.45x
1
10
100
1000
10000
100000
1 10
frequ
ency
length in letters
word length
sentence generatorEnglish
gamma distribution
17
Sentence Length
• Longer sentences in English
• More 2-word sentences in english
• Curve is similar
1
10
100
1000
10000
1 10 100
num
ber o
f sen
tenc
es
length in words
sentence length
sentence generatorEnglish
18
Neighbor-based Co-occurrence Graph
• Min. cooc. freq=2, min. log likelihood ratio=3.84
• NB-graph is a small world
• Qualitatively, English and sentence generator are similar
• Word generator shows much much less co-occurrences
• Factor 2 in clustering coefficient and number of vertices
0.001
0.01
0.1
1
10
100
1000
10000
1 10 100 1000
nr o
f ver
tices
degree interval
degree distribution
sentence generatorEnglish
word generatorpower law z=2
English sample
sentence gen.
word gen. random graph (ER)
# of ver. 7154 15258 3498 10000
avg. sht. path
2.933 3.147 3.601 4.964
avg. deg. 9.445 6.307 3.069 7
cl.coeff. 0.2724 0.1497 0.0719 6.89E-4
z 1.966 2.036 2.007 -
19
Formation of Sentences
• Word graph grows and contains the full vocabulary used so far for generating in every time step.
• Random walks starting from BOS always end in EOS.
• Sentence length slowly increases: random walk has more possibilities before finally arriving at the EOS vertex.
• Sentence length is influenced by both parameters of the model: – the word end probability w in the
word generator – the new word probability s in the
sentence generator.
1
10
100
10000 100000 1e+006
avg.
sen
tenc
e le
ngth
text interval
sentence length growth
w=0.4 s=0.08w=0.4 s=0.1
w=0.17 s=0.22w=0.3 s=0.09
x^(0.25);
20
Conclusion
Novel random text model • obeys Zipf‘s law• obeys word length distribution• obeys sentence length• shows similar nb-cooccurrence data
First model that:• produces smooth lexical spectrum without initial letter
probabilities• incorporates notion of a sentence• models word order restrictions
21
Sentence generator at work
Beginning: Q . U . RFXFJF . G . G . U . R . U . RFXFJF . XXF . RFXFJF . U . QYVHA . RFXFJF . R TCW . CV . Z U . G . XXF . RFXFJF . M XXF . Q . G . RFXFJF . U . RFXFJF . RFXFJF . Z U . G . RFXFJF . RFXFJF . M XXF . R . Z U .
Later: X YYOXO QO OEPUQFC T TYUP QYFA FN XX TVVJ U OCUI X HPTXVYPF . FVFRIK . Y TXYP VYFI QC TPS Q UYYLPCQXC . G QQE YQFC XQXA Z JYQPX. QRXQY VCJ XJ YAC VN PV VVQF C XJN JFEQ QYVHA. U VIJ Q YT JU OF DJWI QYM U YQVCP QOTE OD XWY AGFVFV U XA YQYF AVYPO CDQQ TY NTO FYF QHT T YPXRQ R GQFRVQ . MUHVJ Q VAVF YPF QPXPCY Q YYFRQQ. JP VGOHYY F FPYF OM SFXNJJ A VQA OGMR L QY . FYC T PNXTQ . R TMQCQ B QQTF J PVX YT DTYO RXJYYCGFJ CYFOFUMOCTM PQRYQQYC AHXZQJQ JTW O JJ VX QFYQ YTXJTY YTYYFXK . RFXFJF JY XY RVV J YURQ CM QOXGQ QFMVGPQ. OY FDXFOXC. N OYCT . L MMYMT CY YAQ XAA J YHYJ MPQ XAQ UYBX RW XXF O UU COF XXF CQPQ VYYY XJ YACYTF FN . TA KV XJP O EGV J HQY KMQ U .
22
Questions?
Danke sdf sehr gf thank fdgf you g fd tusen sd ee takk erte dank we u trew wel wwd muchas werwe ewr gracias werwe rew merci mille werew re ew ee ew grazie d fsd ffs df d fds spassiva fs fdsa rtre trerere rteetr trpemma eedm