learning linguistic structure john goldsmith computer science department university of chicago...
Post on 21-Dec-2015
218 views
TRANSCRIPT
![Page 1: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/1.jpg)
Learning linguistic structure
John Goldsmith
Computer Science Department
University of Chicago
February 7, 2003
![Page 2: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/2.jpg)
A large part of the field of computational linguistics has moved during the 1990s from
developing grammars, speech recognition engines, etc., that simply work, to
developing systems that learn language-specific parameters from large amounts of data.
![Page 3: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/3.jpg)
Credo…
The application of statistically-driven methods of data analysis, when applied to natural language data, will produce results which shed light on linguistic structure.
![Page 4: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/4.jpg)
Unsupervised learning
Input: large texts in a natural language, with no prior knowledge of the language.
![Page 5: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/5.jpg)
A bit more about the goal
What’s the input? “Data” – which comes to the learner,
in acoustic form, unsegmented: Sentences not broken up into words Words not broken up into their
components (morphemes). Words not assigned to lexical
categories (noun, verb, article, etc.)
With a meaning representation?
![Page 6: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/6.jpg)
Idealization of the language-learning scheme
1. Segment the soundstream into words; the words form the lexicon of the language.
2. Discover internal structure of words; this is the morphology of the language.
3. Infer a set of lexical categories for words; each word is assigned to (at least) one lexical category.
4. Infer a set of phrase-structure rules for the language.
![Page 7: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/7.jpg)
Idealization?
While these tasks are individually coherent, we make no assumption that any one must be completed before another can be begun.
![Page 8: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/8.jpg)
Today’s task
To develop an algorithm capable of learning the morphology of a language, given knowledge of the words of the language, and of a large sample of utterances.
![Page 9: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/9.jpg)
GoalsGiven a corpus, learn:
The set of word-roots, prefixes, and suffixes, and principles of combinations;
Principles of automatic alternations (e.g., e drops before the suffixes –ing,–ity and –ed, but not before –s)
Some suffixes have one grammatical function (-ness) while others have more (e.g., -s: song-s versus sing-s).
![Page 10: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/10.jpg)
Why?
Practical applications: Automatic stemming for multilingual
information retrieval A corpus broken into morphemes is
far superior to a corpus broken into words for statistically-driven machine translation
Develop morphologies for speech recognition automatically
![Page 11: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/11.jpg)
Theoretically
There is a strong bias currently in linguistics to underestimate the difficulty of language learning –
For example, to identify language learning with the selection of a phrase-structure grammar, or with the independent setting of a small number of parameters.
![Page 12: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/12.jpg)
Morphology
The learning of morphology is a very difficult task, in the sense that every word W of length |W| can potentially be divided into 1, 2, …, L morphemes i, constrained only by i| = |W| – and that’s ignoring labeling (which is the stem, which the affix).
The number of potential morphologies for a given corpus is enormous.
![Page 13: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/13.jpg)
So the task is a reality check for discussions of language learning
![Page 14: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/14.jpg)
Ideally
We would like to pose the problem of grammar-selection as an optimization problem, and cut our task into two parts:
1. Specification of the objective function to be optimized, and
2. Development of practical search techniques to find optima in reasonable time.
![Page 15: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/15.jpg)
Current status
Linguistica: a C++ Windows-based program available for download at
http://humanities.uchicago.edu/faculty/goldsmith/Linguistica2000
Technical discussion in
Computational Linguistics (June 2001) Good results with 5,000 words, very fine-
grained results with 500,000 words (corpus length, not lexicon count), especially in European languages.
![Page 16: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/16.jpg)
Today’s talk
1. Specify the task in explicit terms
2. Minimum Description Length analysis: what it is, and why it is reasonable for this task; how it provides our optimization criteria.
3. Search heuristics: (1) bootstrap heuristic, and (2) incremental heuristics.
4. Morphology assigns a probability distribution over its words.
5. Computing the length of the morphology.
![Page 17: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/17.jpg)
Today’s talk (continued)
6. Results
7. Some work in progress: learning syntax to learn about morphology
![Page 18: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/18.jpg)
Given a text (but no prior knowledge of its language), we want:
List of stems, suffixes, and prefixes List of signatures.
A signature: a list of all suffixes (prefixes) appearing in a given corpus with a given stem.
Hence, a stem in a corpus has a unique signature.
A signature has a unique set of stems associated with it
![Page 19: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/19.jpg)
Example of signature in English
NULL.ed.ing.s
ask call point
summarizes:
ask asked asking asks
call called calling calls
point pointed pointingpoints
![Page 20: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/20.jpg)
We would like to characterize the discovery of a signature as an optimization problem
Reasonable tack: formulate the problem in terms of Minimum Description Length (Rissanen, 1989)
![Page 21: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/21.jpg)
Today’s talk
1. Specify the task in explicit terms
2. Minimum Description Length analysis: what it is, and why it is reasonable for this task; how it provides our optimization criteria.
3. Search heuristics: (1) bootstrap heuristic, and (2) incremental heuristics.
4. Morphology assigns a probability distribution over its words.
5. Computing the length of the morphology.
![Page 22: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/22.jpg)
Minimum Description Length (MDL)
Jorma Rissanen: Stochastic Complexity in Statistical Inquiry (1989)
Work by Michael Brent and Carl de Marcken on word-discovery using MDL in the mid-1990s.
![Page 23: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/23.jpg)
Essence of MDLIf we are given
1. a corpus, and
2. a probabilistic morphology, which technically means that we are given a distribution over certain strings of stems and affixes.
Then we can compute an over-all measure (“description length”) which we can seek to minimize over the space of all possible analyses.
![Page 24: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/24.jpg)
Description length of a corpus C, given a morphology M
The length, in bits, of the shortest formulation of the morphology expressible on a given Turing machine
+
Optimal compressed length of the corpus, using that morphology .
![Page 25: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/25.jpg)
Probabilistic morphology
To serve this function, the morphology must assign a distribution over the set of words it generates, so that the optimal compressed length of an actual, occurring corpus (the one we’re learning from) is -1 * log probability it assigns.
![Page 26: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/26.jpg)
Essence of MDL…
The goodness of the morphology is also measured by how compact the morphology is.
We can measure the compactness of a morphology in information theoretic bits.
![Page 27: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/27.jpg)
How can we measure the compactness of a morphology?
Let’s consider a naïve version of description length: count the number of letters.
This naïve version is nonetheless helpful in seeing the intuition involved.
![Page 28: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/28.jpg)
Naive Minimum Description Length
Corpus:
jump, jumps, jumping
laugh, laughed, laughing
sing, sang, singing
the, dog, dogs
total: 62 letters
Analysis:
Stems: jump laugh sing sang dog (20 letters)
Suffixes: s ing ed (6 letters)
Unanalyzed: the (3 letters)
total: 29 letters.
Notice that the description length goes UP if we analyze sing into s+ing
![Page 29: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/29.jpg)
Essence of MDL…
The best overall theory of a corpus is the one for which the sum of
-1 * log prob (corpus) + length of the morphology
(that’s the description length) is the smallest.
![Page 30: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/30.jpg)
Essence of MDL…
0
100000
200000
300000
400000
500000
600000
700000
Best analysis Elegant theorythat works
badly
Baroque theorymodeled on
data
Length of morphology
Log prob of corpus
![Page 31: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/31.jpg)
Overall logic
Search through morphology space for the morphology which provides the smallest description length.
![Page 32: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/32.jpg)
Brief foreshadowing of our calculation of the length of the morphology A morphology is composed of three lists:
a list of stems, a list of suffixes (say), and a list of ways in which the two can be combined (“signatures”).
Information content of a list =
itemeach
itemlengthcompressedlistoflength )()log(
![Page 33: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/33.jpg)
Stem list
))|(log(
)log(
1
tstemeach tinllettereachii
i
llprob
listoflength
![Page 34: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/34.jpg)
Today’s talk
1. Specify the task in explicit terms
2. Minimum Description Length analysis: what it is, and why it is reasonable for this task; how it provides our optimization criteria.
3. Search heuristics: (1) bootstrap heuristic, and (2) incremental heuristics.
4. Morphology assigns a probability distribution over its words.
5. Computing the length of the morphology.
![Page 35: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/35.jpg)
Bootstrap heuristic
1. Find a method to locate likely places to cut a word.
2. Allow no more than 1 cut per word (i.e., maximum of 2 morphemes).
3. Assume this is stem + suffix.
4. Associate with each stem an alphabetized list of its suffixes; call this its signature.
5. Accept only those word analyses associated with robust signatures…
![Page 36: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/36.jpg)
…where a robust signature is one with a minimum of 5 stems (and at least two suffixes).
Robust signatures are pieces of secure structure.
![Page 37: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/37.jpg)
Heuristic to find likely cuts…
Best is a modification of a good idea of Zellig Harris (1955):
Current variant:
Cut words at certain peaks of successor frequency.
Problems: can over-cut; can under-cut; and can put cuts too far to the right (“aborti-” problem). [Not a problem!]
![Page 38: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/38.jpg)
Successor frequency
g o v e r n
Empirically, only one letter follows “gover”: “n”
![Page 39: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/39.jpg)
Successor frequency
g o v e r n m
Empirically, 6 letters follows “govern”: “n”
i
os
e
#
![Page 40: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/40.jpg)
Successor frequency
g o v e r n m
Empirically, 1 letter follows “governm”: “e”
e
g o v e r 1 n 6 m 1 e
peak of successor frequency
![Page 41: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/41.jpg)
Lots of errors…
c o n s e r v a t i v e s
9 18 11 6 4 1 2 1 1 2 1 1
wrong right wrong
![Page 42: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/42.jpg)
Even so…
We set conditions:
Accept cuts with stems at least 5 letters in length;
Demand that successor frequency be a clear peak: 1… N … 1 (e.g. govern-ment)
Then for each stem, collect all of its suffixes into a signature; and accept only signatures with at least 5 stems to it.
![Page 43: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/43.jpg)
Words->SuccessorFreq1(GetStems_Suffixed(), GetSuffixes(), GetSignatures(), SF1 );CheckSignatures();ExtendKnownStemsToKnownSuffixes();
TakeSignaturesFindStems();ExtendKnownStemsToKnownSuffixes();
FromStemsFindSuffixes();ExtendKnownStemsToKnownSuffixes();
LooseFit();
CheckSignatures();
![Page 44: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/44.jpg)
2. Incremental heuristics
Enormous amount of detail being skipped…let’s look at one simple case:
Loose fit: suffixes and signatures to split: Collect any string that precedes a known suffix. Find all of its apparent suffixes, and
use MDL to decide if it’s worth it to do the analysis.
![Page 45: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/45.jpg)
Using MDL to judge a potential stem and potential signatureSuppose we find: act, acted, action,
acts.
We have the suffixes NULL, ed, ion, and s, but not the signature NULL.ed.ion.s
Let’s compute cost versus savings of signature NULL.ed.ion.s
![Page 46: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/46.jpg)
savings
Savings:
Stem savings: 3 copies of the stem act: that’s 3 x 3 = 9 letters = 40.5 bits (taking 4.5 bits/letter).
Suffix savings: ed, ing, s: 6 letters, another 27 bits.
Total of 67.5 bits--
![Page 47: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/47.jpg)
Cost of NULL.ed.ing.s
A pointer to each suffix:
][log
][log
][log
][log
s
W
ing
W
ed
W
NULL
W
To give a feel for this: 5][
log ed
W
Total cost of suffix list: about 30 bits.Cost of pointer to signature: total cost is
-- all the stems using it chip in to pay for its cost, though.
bitssigthisusethatstems
W13
][#log
![Page 48: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/48.jpg)
Cost of signature: about 43 bits Savings: about 67 bits Slight worsening in the compressed length
of these 4 words.
so MDL says: Do it! Analyze the words as stem + suffix.
Notice that the cost of the analysis would have been higher if one or more of the suffixes had not already “existed”.
![Page 49: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/49.jpg)
Today’s talk
1. Specify the task in explicit terms
2. Minimum Description Length analysis: what it is, and why it is reasonable for this task; how it provides our optimization criteria.
3. Search heuristics: (1) bootstrap heuristic, and (2) incremental heuristics.
4. Morphology assigns a probability distribution over its words.
5. Computing the length of the morphology.
![Page 50: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/50.jpg)
Frequency of analyzed word
][
][*
][
][*
][
][
)|(*)|(*)(
)(
inFT
W
FFreqTFreqFreq
FTFreq
W is analyzed as belonging to Signature stem T and suffix F.
Actually what we care about is the log of this:
Where [W] is the total number of words.
[x] means thecount of x’sin the corpus(token count)
![Page 51: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/51.jpg)
][
][log
][
][log
)|(log)(log
)(
inFT
W
FfreqTfreq
FTwordlengthCompressed
![Page 52: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/52.jpg)
Today’s talk
1. Specify the task in explicit terms
2. Minimum Description Length analysis: what it is, and why it is reasonable for this task; how it provides our optimization criteria.
3. Search heuristics: (1) bootstrap heuristic, and (2) incremental heuristics.
4. Morphology assigns a probability distribution over its words.
5. Computing the length of the morphology.
![Page 53: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/53.jpg)
The length of a morphology
A morphology is a set of 3 things: A list of stems; A list of suffixes; A list of signatures with the associated
stems.
We’ll make an effort to make our grammars consist primarily of lists, whose length is conceptually simple.
![Page 54: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/54.jpg)
Length of a list A header telling us how long the list is, of length
(roughly) log2 N, where N is the length. N entries. What’s in an entry?
Raw lists: a list of strings of letters, where the length of each letter is log2 (26) – the information content of a letter (we can use a more accurate conditional probability).
Pointer lists: A list of pointers to the entries.Someday: the information contained in the meaning of
each morpheme
![Page 55: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/55.jpg)
Connections across lists
Raw suffix list: ed s ing ion able …
Signature 1: Suffixes:
• pointer to “ing”• pointer to “ed”
Signature 2: Suffixes
• pointer to “ing”• pointer to “ion”
The length of each pointer is
suffixthisofoccurrence
wordssuffixed
#
#log2
-- usually cheaper than the letters themselves
![Page 56: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/56.jpg)
The fact that a pointer to a symbol has a length that is inversely proportional to its frequency is the key:
We want the shortest overall grammar; so
That means maximizing the re-use of units (stems, affixes, signatures, etc.)
![Page 57: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/57.jpg)
Suffixesf
A
f
WflistSuffix
][
][log||*
Stemst t
WtlistStem )
][
][log(||*:
Number of letters structure
+ Signatures, which we’ll get to shortly
![Page 58: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/58.jpg)
Information contained in the Signature component
Signatures
W
][
][log list of pointers to signatures
logstems( log Signatures
suffixes
)][
][log
][
][log(
)()(
SuffixesfSigs Stemst inft
W
<X> indicates the numberof distinct elements in X
![Page 59: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/59.jpg)
Original morphology+ Compressed data
Repair heuristics: using MDL
We could compute the entire MDL in one state of the morphology; make a change; compute the whole MDL in the proposed (modified) state; and compared the two lengths.
Revised morphology+
compressed data
<>
![Page 60: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/60.jpg)
But it’s better to have a more thoughtful approach.
Let’s define 2
1logstate
state
x
xx
Then the change of the size of the punctuation in the lists:
signaturesstemssuffixesi logloglog)(
Then the size of the punctuation for the 3 lists is:
<Suffixes> + <Stems> + <Signatures>
![Page 61: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/61.jpg)
Size of the suffix component, remember:
Suffixesf
A
f
WflistSuffixii
][
][log||*)(
Change in its size when we consider a modification to the morphology:1. Global effects of change of number of suffixes;2. Effects on change of size of suffixes in both states;3. Suffixes present only in state 1;4. Suffixes present only in state 2;
![Page 62: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/62.jpg)
Suffix component change:
)2,1(~
)2~,1()2,1(
||*][
][log
||*][
][log*
2
1)2,1(
Suffixesf
A
Suffixesf
A
SuffixesfA
ff
W
ff
WfSuffixesW
Contribution of suffixesthat appear only in State1
Contribution of suffixesthat appear only in State 2
Global effect of change on all suffixes
Suffixes whose counts change
![Page 63: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/63.jpg)
Digression on entropy, MDL, and morphology
Why using MDL is closely related to measuring the complexity of the space of possible vocabularies
You better save this for another day, John – you’ve only got 15 minutes left.
![Page 64: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/64.jpg)
Today’s talk (continued)
6. Results
7. Some work in progress: learning syntax to learn about morphology
![Page 65: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/65.jpg)
How good?
In practice, on a large naturally-occurring corpus of a European language: precision and recall in the low 80%.
Precision: proportion of predicted cuts that are correct
Recall: proportion of actual cuts that are predicted.
![Page 66: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/66.jpg)
These numbers go to the high 98% if we use an artificial corpus with all of the inflected forms of a word.
![Page 67: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/67.jpg)
Real life challenges include alumnus Johnson, Acheson, Adrianople adenomas Adirondacks Abolition Los Angeles
![Page 68: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/68.jpg)
Today’s talk (continued)
6. Results
7. Some work in progress: learning syntax to learn about morphology
![Page 69: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/69.jpg)
Current research projects
1. Allomorphy: Automatic discovery of relationship between stems (lov~love, win~winn)
2. Use of syntax (automatic learning of syntactic categories)
3. Rich morphology: other languages (e.g., Swahili), other sub-languages (e.g., biochemistry sub-language) where the mean # morphemes/word is much higher
4. Ordering of morphemes
![Page 70: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/70.jpg)
Allomorphy: Automatic discovery of relationship between stems
Currently learns (unfortunately, over-learns) how to delete stem-final letters in order to simplify signatures.E.g., delete stem-final –e in English
before suffixes –ing, -ed, -ion (etc.).
![Page 71: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/71.jpg)
Automatic learning of syntactic categories
Work in progress with Misha BelkinFinding eigenvector decomposition of
a graph that represents word neighbors.
Using eigenvectors of the bigram graph to infer morpheme identity. With Mikhail Belkin. Proceedings of the Morphology/Phonology Learning Workshop of ACL-02. Association for Computational Linguistics..
![Page 72: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/72.jpg)
Disambiguating morphs?
Automatic learning of morphology can provide us with a signature associated with a given stem:
Signature = alphabetized list of affixes associated with a given stem in a corpus.
![Page 73: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/73.jpg)
For example:
Signature NULL.ed.ing.s: aid, ask, call, claim, help,kick
Signature NULL.ed.ing: add, assist, attend, consider
Signature NULL.s achievement, acre, action,
administrator, affair
![Page 74: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/74.jpg)
The signature NULL.ed.ing
is much more a subsignature ofNULL.ed.ing.s
than NULL.s
is because of s’s ambiguity (noun, verb).
![Page 75: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/75.jpg)
How can we determine whether a given morph (“ed”, “s”) represents more than 1 morpheme?
I don’t think that we can do this on the basis of morphological information.
![Page 76: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/76.jpg)
Goal: find a way of describing syntactic behavior in a way that is dependent only on a corpus.
That is, in a fashion that is language-independent but corpus-dependent – though the global structure that is induced from 2 corpora from the same language will be very similar.
![Page 77: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/77.jpg)
French
Fem. sg.nouns
plural nouns
Finite verbs
![Page 78: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/78.jpg)
With such a method…
We can look at words formed with the “same” suffix, putting words into buckets based on the signature their stem is in:
Bucket 1 (NULL.ed.ing.s): aided, asked, called
Bucket 2 (NULL.ed.ing): added, assisted, attended.
Q: do the average positions from each of the buckets form a tight cluster?
![Page 79: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/79.jpg)
If the average locations of each bucket of –ed words form a tight cluster, then –ed is not ambiguous.
If the average locations of each bucket (from distinct signatures) does not form a tight cluster, the morpheme is not the same across signatures.
![Page 80: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/80.jpg)
Method
Not a clustering method; neither top-down nor bottom-up.
Two step procedure:
1. Construct a nearest-neighbor graph.
2. Reduce the graph to 2-dimensions by means of eigenvector decomposition.
![Page 81: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/81.jpg)
Nearest neighbors
Following a long list of researchers: We begin by assuming that a word
W’s distribution can be described by a vector L describing all of its left-hand neighbors and a vector R describing all of its right-hand neighbors.
![Page 82: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/82.jpg)
V = Size of corpus’ vocabulary VLw,Rw are vectors that live in RV.
If V is ordered alphabetically, then
Lw = (4, 0, 0, 0, …)
# of occurrences of “a” before
w
# of occurrences of “abatuna”
before w
# of occurrences of “abandoned”
before w
![Page 83: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/83.jpg)
Similarity of syntactic behavior is modeled as closeness of L-vectors
…where “closeness” of 2 vectors is modeled as the angle between them.
Vi
Vi
Vii
wv
wv
wv
wvwv
22||||),(cos
![Page 84: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/84.jpg)
Construct a (non-directed) graph:
Its vertices are the words W in V.
For each word W:Pick the K most-similar words (K = 20,
50) (by angle of L-vector)Add an edge to the graph connecting
W to each of those words.
![Page 85: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/85.jpg)
Canonical matrix representation
of a graph:
M(i,j) = 1 iff there is an edge connecting wi and wj – that is,
iff wi and wj are similar words as regards how they interact with the word immediately to the left.
![Page 86: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/86.jpg)
Where is this matrix M?
It’s a point in a space of size V(V-1)/2. Not very helpful, really.
How can we optimally reduce it to a space of small dimension?
Find the eigenvectors of the normalized laplacian of the graph.
See Chung, Malik and Shi, Belkin and Niyogi…
![Page 87: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/87.jpg)
A graph and its matrix M
The degree of a vertex (= word) is the number of edges adjacent (linked) to it.
Notice that this is not fixed across words.
The degree of vertex vi is the sum of the entries of the ith row.
![Page 88: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/88.jpg)
The laplacian of the graph
Let D = VxV diagonal matrix s.t.
diagonal entry M(i,i) = degree of vi
D – M is the Laplacian of the graph.
Its rows sum to 0.
![Page 89: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/89.jpg)
Normalized laplacian:
For each i, divide all entries in the ith row by √d(i).
For each i, divide all entries in the ith column by √d(i).
Result: Diagonal elements are all 1. Generally:
)()(
),(
jdid
jiM
![Page 90: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/90.jpg)
Eigenvector decomposition
The eigenvectors form a spectrum, ranked by the value of their eigenvalues.
Eigenvalues run from 0 to 2 (L is positive semi-definite).
The eigenvector with 0 eigenvalue reflects word’s frequency.
But the next smallest gives us a good representation of the words…
![Page 91: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/91.jpg)
…in the sense that the values associated with each word show how close the words are in the original graph.
We can graph the first two eigenvectors of the Left (or Right) graph: each word is located at the coordinates corresponding to it in the eigenvector(s):
![Page 92: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/92.jpg)
Spanish (left)
masculine plurals
fem.plurals
finiteverbs
feminine sgnouns
masc. sg. nounspastparticiples
![Page 93: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/93.jpg)
German (left)Neuter sgnouns
Names of places
Fem. sg.nouns
numbers, centuries
![Page 94: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/94.jpg)
English (right)
prepositions
+ “to”
+ of
nouns
modals
![Page 95: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/95.jpg)
English (left) infinitives
the + modals
past verbs
![Page 96: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/96.jpg)
Results of experiment
If we define the size of the minimal box that includes all of the vocabulary as being 1 by 1, then we find a small ( < 0.10 ) average distance to mean for unambiguous suffixes (e.g., -ed (English), -ait (French) ) – only for them.
![Page 97: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/97.jpg)
Measure
To repeat: we find the “virtual” location of the conflation of all of the stems of a given signature, plus the suffix in questione.g., NULL.ed.ing_ed
We do this for all signatures containing “ed”
We compute average distance to the mean.
![Page 98: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/98.jpg)
LeftGraph RightGraph
Expect coherence:
ed 0.050 0.054
-ly 0.032 0.100
‘s 0.000 0.012
-al 0.002 0.002
-ate 0.069 0.080
-ment 0.012 0.009
-ait 0.068 0.034
-er 0.055 0.071
-a 0.023 0.029
-ant 0.063 0.088
LeftGraph RightGraph
Expect little/no coherence:
-s 0.265 0.145
-ing 0.096 0.143
NULL 0.312 0.192
-e 0.290 0.130
Average <= 0.10
Average > 0.10
![Page 99: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/99.jpg)
Rich morphologies
A practical challenge for use in data-mining and information retrieval in patent applications (de-oxy-ribo-nucle-ic, etc.)
Swahili, Hungarian, Turkish, etc.
![Page 100: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/100.jpg)
The End
![Page 101: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/101.jpg)
Appendices
![Page 102: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/102.jpg)
Corpus
Pick a large corpus from a language --5,000 to 1,000,000 words.
![Page 103: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/103.jpg)
Corpus
Bootstrap heuristicFeed it into the “bootstrapping” heuristic...
![Page 104: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/104.jpg)
Corpus
Out of which comes a preliminary morphology,which need not be superb.Morphology
Bootstrap heuristic
![Page 105: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/105.jpg)
Corpus
Morphology
Bootstrap heuristic
incremental heuristics
Feed it to the incrementalheuristics...
![Page 106: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/106.jpg)
Corpus
Morphology
Bootstrap heuristic
incremental heuristics
modified morphology
Out comes a modifiedmorphology.
![Page 107: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/107.jpg)
Corpus
Morphology
Bootstrap heuristic
incremental heuristics
modified morphology
Is the modificationan improvement?Ask MDL--
![Page 108: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/108.jpg)
Corpus
Morphology
Bootstrap heuristic
modified morphology
If it is an improvement,replace the morphology...
Garbage
![Page 109: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/109.jpg)
Corpus
Bootstrap heuristic
incremental heuristics
modified morphology
Send it back to theincremental heuristics again...
![Page 110: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/110.jpg)
Morphology
incremental heuristics
modified morphology
Continue until there are no improvementsto try.
![Page 111: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/111.jpg)
Consider the space of all words of length L, built from an alphabet of size b.
How many ways are there to build a vocabulary of size N?Call that U(b,L,N).
Clearly,
!)!(
!),,(
NNb
b
N
bNLbU
L
LL
![Page 112: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/112.jpg)
Compare that with the operation (choosing a set of N words of length L, alphabet size b) with the operation of choosing a set of T stems (of length t) and a set of F suffixes (of length f), where t + f = L.
If we take the complexity of each task to be measured by the log of its size, then we’re asking the size of:
F
b
T
b
N
b
FfbUTtbU
NLbUft
L
log),,(),,(
),,(log
![Page 113: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/113.jpg)
NNbNL
StirlingNNbNL
NbN
NNb
b
NNb
b
N
b
L
L
L
L
LL
1loglog
)(loglog
!loglog
!log)!(
!log
!)!(
!loglog
is easy to approximate, however.
N
bL
log
bababa
babaaa
ba
athenbaif
1)...1)((
1)...)(1)...(1(
)!(
!remember:
![Page 114: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/114.jpg)
NNbNL
1loglog
The number of bits neededto list all the words:
the analysis
The length of all the pointers
to all the words:
the compressed corpus
Thus the log of the number of vocabularies =description length of that vocabulary,
in the terms we’ve been using
![Page 115: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/115.jpg)
That means that the differences in the sizes of the spacesof possible vocabularies is equal to the difference in the
description length in the two cases:hnce,
Difference of complexity of “simplex word” analysisand complexity of analyzed word analysis=log U(b,L,N) – log U(b,t,T) – log U(b,f,F)
)/1log()/1log()/1log(
))((log
FFTTNN
fFtTLNb
Difference in size of morphologies
Difference in size of compressed data
![Page 116: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/116.jpg)
But we’ve (over)simplified in this case by ignoring the frequencies inherent in real corpora. What’s of great interest in real life is the fact that some suffixes are used often, others rarely, and similarly for stems.
![Page 117: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/117.jpg)
We know something about the distribution of words, but nothing about distribution of stems and especially suffixes.
But suppose we wanted to think about the statistics of vocabulary choice in which words could be selected more than once….
![Page 118: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/118.jpg)
We want to select N words of length L, and the same word can be selected. How many ways of doing this are there?
You can have any number of occurrence of a word, and 2 sets of the same number of them are indistinguishable. How many such vocabularies are there, then?
N
i
iZ
NL
i
b
1
)()!(
)(
![Page 119: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/119.jpg)
N
i
iZ
NL
i
bNLbU
1
)()!(
)(),,(
where Z(i) is the number of words of frequency i.
(‘Z’ stands for “Zipf”).
We don’t know much about frequencies of suffixes,but Zipf’s law says that
i
KiZ )(
hence for a morphemeset that obeyed
the Zipf distribution:
N
i
N
i
N
i
iKbNL
iZiibNL
iiZbNLNLbU
1
1
1
loglog
)(loglog
)!log()(log),,(log
CorpusSizeK *1.0
![Page 120: Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d615503460f94a43bf6/html5/thumbnails/120.jpg)
)ln('log
lnln
loglog1
NNNKbNL
CxxxxdxSince
iKbNLN
i
End of digression