long-tail distributions and unsupervised learning of ... · unsupervised learning of morphology is...

Proceedings of COLING 2012: Technical Papers, pages 3121–3136,COLING 2012, Mumbai, December 2012.

Long-tail Distributions and Unsupervised Learning ofMorphology

Qiuye Zhao1 Mitch Marcus1

(1) University Of [email protected], [email protected]

AbstractIn previous work on unsupervised learning of morphology, the long-tail pattern in the rank-frequencydistribution of words, as well as of morphological units, is usually considered as following Zipf’slaw (power-law). We argue that these long-tail distributions can also be considered as lognormal.Since we know the conjugate prior distribution for a lognormal likelihood, we propose to generatemorphology data from lognormal distributions. When the performance is evaluated by a token-based criterion, giving more weights to the results of frequent words, the proposed model preformssignificantly better than other models in discussion. Moreover, we capture the statistical propertiesof morphological units with a Bayesian approach, other than a rule-based approach as studiedin (Chan, 2008) and (Zhao and Marcus, 2011). Given the multiplicative property of lognormaldistributions, we can directly capture the long-tail distribution of word frequency, without the needof an additional generative process as studied in (Goldwater et al., 2006).

Keywords: Morphological Learning, Zipf’s law, Lognormal distribution, Long tail distribution,Gibbs Sampling, Bayesian approach.

3121

1 IntroductionUnsupervised learning of morphology is an active research area. In this work, we will focus onlearning segmentations of words1. A segmentation of word w can be denoted as w = t. f , whichmeans that concatenating stem t and suffix f gives word w.

Assuming that stems and suffixes are independently distributed, a baseline generative morphologymodel is like this,

kT , kF = number of distinct stem types, number of distinct suffix types

αT ,αF = constant hyper-parameters

θ T ,θ F ∼ Symmetric-DirichletkT (αT ), Symmetric-DirichletkF (αF )

t i=1...N , fi=1...N ∼Multinomial(θ T ), Multinomial(θ F )

wi=1...N ∼ I(w = t. f )P(t|θ T )P( f |θ F )

where N is the number of words and I(w = t. f ) is the indicator function taking on value 1 whenconcatenating stem t and suffix f gives word w and 0 otherwise.

(a) The rank-frequency plot of words. (b) The log-log rank-frequency plot. (c) The CCDF plot of words.

Figure 1: Observing long-tail distributions of word frequency in WSJ Penn Treebank.

In the above baseline model, there no special statistical property of word frequency is captured, norof morphological units (e.g. stems or suffixes). As depicted in Figure 1-a, it is well-known thatgiven a large corpus, a rank-frequency plot of words generally exhibits a long tail that is distinctivelyheavier than in normal (Gaussian) distributions. An alternative way to show this long-tail pattern isby plotting the corresponding Complementary Cumulative Distribution Function (CCDF) on logisticscales. As depicted in Figure 1-c, the CCDF plot behaves like a straight line and is smoother thanthe corresponding log-log rank-frequency plot in Figure 1-b, which also approaches a straight linefor long-tail distributions. When words are segmented into stems and suffixes, distributions of theseunits can also be examined.

As we are going to show in Section 2.3, the rank-frequency distributions of morphological unitsalso show up as straight lines in log-log rank-frequency plots and CCDF plots. In previous workon morphological learning, such as (Chan, 2008), (Zhao and Marcus, 2011) and (Goldwater et al.,2006, 2011), the straight lines on logistic scales are usually interpreted as following Zipf’s law(Zipf, 1949). On the other hand, lognormal distributions with large variance also yield straight lineson both the log-log rank-frequency plot and the CCDF plot. In Section 2, we are going to discuss

1 A segmentation model may not be well-defined for morphology, especially when morph-rich languages are considered;however, different forms of morphological analyses may be compared as simple segmentation analyses.

3122

more about the long-tail distributions and argue that the signature straight lines may suggest eitherpower-law or lognormal.

So as to take advantage of the special statistical property of morphological units, which is consideredas following the Zipf’s law in (Chan, 2008), Chan proposes a rule-based bootstrapping algorithm formorphology learning, which is revised in (Zhao and Marcus, 2011) for acquiring functional elements.Even though Zipf’s law is discussed in both works, its specific definition does not really matter inthe design of the algorithms. For the sake of comparison with other models, we implement a reducedversion of the bootstrapping algorithm as described in (Zhao and Marcus, 2011), eliminating allad-hoc linguistic assumptions encoded in the original algorithm. With this rule-based algorithm, theacquired segmentation model performs rather well when evaluated with the type-based criterion, butnotably bad with the token-based evaluation, which gives more weights to the results of frequentwords than the type-based evaluation. We will describe this algorithm in Section 3.1 with moredetails.

The rule-based bootstrapping algorithm utilizes type frequencies only, no matter what form of inputis given. On the other hand, as shown in (Goldwater et al., 2006), when Dirichlet-multinomialmodel is assumed, the option of utilizing token frequency in generative model doesn’t help, and inthe contrast it hurts the inference of the generative model. Goldwater et al. (2006) argued that thismorphology model doesn’t capture the special statistical property of word frequency, therefore, anaddition generative process is introduced to transform the word frequencies to exhibit the desireddistribution. Since the long-tail pattern in word frequency is considered as following Zipf’s law, ageneralized Chinese restaurant process, Pitman-Yor process (Pitman and Yor, 1997), is exploited forproducing power-law distributions. For the sake of comparison, we re-implement the morphologymodel in (Goldwater et al., 2006), and conduct more experiments with different configurations.

We propose to compute lognormal likelihood, instead of multinomial likelihood, for both stems andsuffixes, in the generative morphology model. With this lognormal model, the option of utilizingtoken frequency, i.e. taking input of the unprocessed text data, dose help the inference of thegenerative model, especially when the token-based evaluation is preformed. When evaluated bythe token-based criterion, which gives more weights to the results of frequent words, the proposedmodel performs significantly better than other models in discussion, no matter what form of input isfed to the multinomial model or the rule-based algorithm.

With the proposed model, the particular statistical properties of stems and suffixes are utilized ina Bayesian model instead of a rule-based model. Furthermore, as we will discuss in Section 2.3,given the multiplicative property of lognormal distributions, the word frequency distribution canalso be predicted as lognormal. Therefore, we can directly capture the statistical property of wordfrequencies without the need of an additional generative process. Especially, the proposed generativemodel is more accurate with the token-based evaluation when utilizing token frequency, and moreaccurate with the type-based evaluation when utilizing type-frequency. This result pattern suggeststhat the proposed model is able to adapt to real data distribution by itself, therefore, we do not needto concern with justifying the appearance of type frequencies in morphology learning, as pursued in(Goldwater et al., 2006).

We are going to use Gibbs sampling, a standard Markov Chain Monte Carlo (MCMC) method forthe inference of generative models. In each iteration, the morphological analysis of each word issequentially re-sampled from its conditional posterior given morphological analyses of all otherwords. Since the sampling process is much more complex and time-consuming for the lognormalmodel than the multinomial model, we propose to constrain the learning of generative models with

3123

the acquisition outputs of the rule-based model. Since the rule-based bootstrapping algorithm takesinput of raw corpora only and so the generative models, the combination of these two processesresults in a totally unsupervised learning process as well. Even though this method is motivatedby the concern of training efficiency, the proposed use of acquisition outputs from a rule-basedmodel also significantly improves the performance of generative models, consistently for both thelognormal and the multinomial model.

2 Long-tail distributions2.1 The long-tail patternGiven a large corpus, we can compute the word frequency of each word type by counting itsoccurrences in the corpus. When we plot word frequency against its rank, such as in Figure 1-a,there is a long tail of the curve composed of the large number of words that occur in low frequency.When plotted on a logistic scale, as in Figure 1-b, the rank-frequency plot behaves like a straight line.An equivalent form of the rank-frequency approach is to plot the corresponding ComplementaryCumulative Distribution Function (CCDF). Instead of plotting a function of rank, we can also plotP(F >= f ) as a function of frequency f . As shown in Figure 1-c, the CCDF plot also behaves likea straight line on the logistic scale, which is smoother than the log-log rank-frequency plot.

We generally refer this kind of distributions as long-tail distributions, observing that a large portionof its population are composed of low-frequency events, which form a longer tail than normal(Gaussian) distributions. Long-tail patterns have been widely observed in various fields, but theymay be studied as different distributions. For example, economists may be familiar with this patternas Pareto distribution, which is also known as ’80-20’ rule. In Pareto’s original study (Pareto, 1896),the long-tail pattern is shown on CCDF plots, so it took a while for people to understand that itis a power-law distribution and is synonymous with ’Zipf’s law’ (Newman, 2005). Zipf’s law isproposed by linguist George Kingsley Zipf in his study of vocabulary distribution, and is widelyused to interpret the straight lines on logistic scales in the study of language. In more recent works,the convention of treating long-tail patterns as power-law distributions has been challenged. Forexample, Downey (Downey, 2001) argues that many networks metrics, such as file sizes and transfertimes, should be modeled as lognormal distributions. Lognormal distributions with large variancealso yield straight lines on the log-log rank-frequency plot and the CCDF plot.

2.2 Generating Zipf’s law and lognormal distributionsBased on the idea of preferential attachment, i.e. a ’richer-get-richer’ process, if we generate newword occurrences more likely of popular word types than of rarely seen word types in previousprocess, then word frequency of the generated corpus may follow Zipf’s law or be lognormal,depending on subtle differences in the generative processes.

More specifically, suppose that we are given i words for a start, i ≥ 1. Let nik denote the number of

occurrences of all the words that occur exactly k times in the previous i words. Let P(wi+1 = k)denote the probability that the i+ 1th occurrence is a word that has already appeared k times in theprevious i words. Consider the following process as described in (Simon, 1955),

P(wi+1 = k) = αn0 + Finik,

where n0 and α are constants. If Fi =(1−α)

i, then asymptotically, P(wi = k) will approach a

power-law distribution. On the other hand, if the constant item is removed from the above process,

P(wi+1 = k) = Finik,

3124

and Fi are independent and identically distributed variables with finite mean and variance, thenasymptotically, P(wi = k) will approach a lognormal distribution.

A even more naive generative model for Zipf’s law is Miller’s monkey (Miller, 1957), who can notonly type with a keyboard, but also distinguish space bar from other keys. If Miller’s monkey man-ages to hit the space bar with a constant probability and never hits the space bar twice subsequently,then the word frequency in the monkey’s output follows a power law. One crucial assumption inMiller’s demonstration is that all non-space letters are hit with equal probabilities. However, for thecase that any two letters are hit with different probabilities, Perline (1996) argues that for all wordsof length up to a constant, their rank-frequency distribution converges to a lognormal distribution.

After reviewing a brief history of generative models for power-law and lognormal distributions,Mitzenmacher (2004) suggests that "It might be reasonable to use which ever distribution makes iteasier to obtain results." 2

(a) The CCDF plot of stems. (b) The CCDF plot of suffixes. (c) The CCDF plot of verbs.

Figure 2: Observing long-tail distributions for morphological units.

2.3 Lognormal distributionsOne advantage of modeling long-tail patterns as lognormal is so that the multiplicative propertyholds, i.e. the product of independent lognormally-distributed random variables is itself lognormallydistributed. For example, consider the baseline generative morphology model as described in theintroduction, P(w) = P(t)P( f ). If both stems and suffixes are lognormally distributed, then wordfrequency is also lognormally distributed.

So as to examine the distributions of morphological units, with the help of the gold part-of-speechannotations, we segment all verbs into stems and suffixes in WSJ Penn Treebank (Marcus et al.,1993). As shown in Figure 2-a and b, the signature straight lines on CCDF plots suggest that bothstems and suffixes of verbs can be modeled as lognormal. Then, given the multiplicative propertyof lognormal distributions, verbs should also be lognormally distributed, which is confirmed by itsCCDF plot in Figure 2-c.

Another reason for one to consider modeling long-tail patterns as lognormal distributions is that weknow the conjugate prior distribution for a lognormal likelihood, but not for power-law likelihood.Considering the generative model P(w) = P(t)P( f ), if we assume that t and f are lognormal, theirmean µ and variance σ2 can be drawn from Normal priors and Inverse-Gamma priors respectively.

2 As also pointed out in (Mitzenmacher, 2004), if a power law distribution can have infinite mean and variance, then it isinaccurate to analyze it as lognormal. In present examples, we assume the exponent of power-law distributions is greater than0, thus it is safe for us to experiment with either distribution.

3125

3 A rule-based modelGiven a set of stems T and a set of suffixes F, we can divide a word w into stem t and suffix f , ift ∈ T, f ∈ F and w = t. f . For example, if T = {’laugh-’, ’analyz-’}, and F = {’-ed’, ’-s’}, then

’analyzed’ can be segmented into ’analyz-’ and ’-ed’, but ’red’ won’t be segmented. So as to learnsuch a rule-based morphology model, we need to acquire a set of stems and a set of suffixes.

3.1 A bootstrapping algorithm for acquiring morph unitsIn this section, we are going to describe a bootstrapping algorithm, adapted from the algorithm foracquiring functional elements in (Zhao and Marcus, 2011). This line of algorithms is especiallydesigned to account for the long-tail pattern observed for stems and suffixes, but not specific toeither Zipf’s law or lognormal distributions. As stated in (Chan, 2008), in which a bootstrappingalgorithm is originally proposed for acquiring transformation-based morphological rules, "whatmatters, though, is that the number of types per inflection decreases rapidly, ..., there are few highlyfrequent inflections, and many more infrequent ones."

Following (Zhao and Marcus, 2011), our algorithm is built upon a distributional property of’functional elements’: they occur in diverse contexts. In the case of learning morphologicalsegmentations, ’functional elements’ are suffixes in context of stems. For example, an inflectionalending ’-ed’ can be concatenated to most verb stems to derive past tense forms, in contrast to whicha non-sense suffix ’-roached’ can only be seen in few particular word types.

If all prefixes in all possible divisions of a word count as stems, then as computed from the WSJcorpus, the top three suffixes with the highest contextual diversity will be ’-s’, ’-d’ and ’-e’, twoof which do not comply with common sense of morphological suffixes in English. In other words,for acquiring suffixes, we want to compute their contextual diversity according to properly justifiedstems only. The most simple way of justifying stems as proper contexts for suffixes is to checkwhether it serve as context of more than one type of suffixes. For example, stem ’laughin-’ shouldnot be justified, because except for a particular suffix ’-g’, it cannot be concatenated with othersuffixes to form legal words.

Given a set of properly justified stems T , we measure the contextual diversity of a suffix f as

div( f ,T) =∑t∈Tδ(t. f ),

where δ(t. f ) is set to 1 if t. f forms any word, otherwise 0. For example, if we are given a set ofproperly justified stems, including ’laugh-’ but not ’b-’, the diversity measurement of ’-ing’ willincrease by one given the existence of word ’laughing’ but not word ’bing’.

Algorithm 1 The bootstrapping algorithm for acquiring stems and suffixesRequire: A corpus C containing raw text only.

Initialize set F0 to be empty and set T0 to contain all possible prefixesfor k = 1...K iterations do

Let Fk contain the top k suffixes with the highest diversities measured by div( f ,Tk−1).Let Tk contain stems that form legal words with suffixes in Fk

end forreturn FK and TK

We implement a reduced version of the bootstrapping algorithm in (Zhao and Marcus, 2011),

3126

eliminating all ad-hoc linguistic assumptions encoded in the original algorithm. As depicted inAlgorithm 1, the algorithm generates two sets of acquisition outputs during the bootstrappingprocess, both of which justify the proper set for measuring contextual diversity for each other. Asthe two sets alternately update during the bootstrapping process, the diversity measurement of eitherset is expected to be more and more accurate.

The only required input to this algorithm is a corpus C of raw text without any form of annotation.Set F0 is initialized to be empty and set T0 is initialized to contain all prefixes in all possibledivisions of all words in corpus C. At the kth bootstrapping iteration, k > 0, we compute set Fkas the top k suffixes of the highest contextual diversity according to set Tk−1. And set Tk containstems that can form legal words with suffixes in Fk. Since the diversity measurement of suffixesvaries over iterations with respect to updated set of stems, a suffix that is selected to output at someiteration, is not guaranteed to be selected in the following iterations.

kth iter. set F size of set T1th -s 6685th -ed, -ing, -s, -e, -es 2274

10th -ed, -ing, -e, -s, -es, -er, -ers, -ion, -ions, -ly 277620th ..., -ion, -ers, -y, -ions, -al, -or, -ors, -ings, -able, -ive, -ly, -aly, -ies 2993

Fix set T with all prefixes in all possible divistions of words.k=20 -s, -d, -e, -ed, -g, -n, -ng, -y, -ing, -t, -r, -es, -er, -on, -l, -rs, -a, -ly, -ion, -0 7091

Table 1: The acquisition outputs by Algorithm 1 over WSJ Penn Treebank.

We run this bootstrapping algorithm for acquiring stems and suffixes from the WSJ corpus. For thesake of comparison, we also experiment without updating set T during the bootstrapping. With theunchanged set T, which is initialized to contain all prefixes in all possible divisions of all words incorpus C, the bootstrapping algorithm is degenerated to a simple counting function, which, for agiven k, returns the top k frequent suffixes. So as to compare with other models more fairly, wedidn’t implement the mechanisms in the original algorithm for removing complex suffixes such as-ers, -ings and -ors, neither the trick for removing the most noisy suffix -e.

4 A generative model with multinomial likelihood

In this section, we are going to describe a generative morphology model that involves one morerandom variable than the baseline model we sketched in the introduction. In the baseline generativemodel, we assume without any condition that both stems and suffixes are independently and multi-nomially distributed. For the current model, stems and suffixes are independently and multinomiallydistributed in each inflectional class. A morphological analysis of word w can be denoted as (c, t, f ),which means that w = t. f and this analysis belongs to inflectional class c.

Assume a multinomial distribution over kC inflectional classes, with parameters θ C . To makepredictions about new classes, we take symmetric Dirichlet priors αC on parameters θ C , whichmeans that the way each inflectional class is used has little variation. When kC is set as 1 inthis model, it degenerates to the baseline generative model that assumes stems and suffixes areindependently distributed. Again, let N be the number of words and I(w = t. f ) denote the indicatorfunction taking on value 1 when concatenating stem t and suffix f gives word w and 0 otherwise.

3127

Our generative morphology model is like this,

kC = number of inflectional classes

αC ,αT ,αF = constant hyper-parameters

θ C ∼ Symmetric-Dirichlet(αC)

θ Ti=1...kC ,θ F

i=1...kC ∼ Symmetric-Dirichlet(αT ), Symmetric-Dirichlet(αF )

ci=1...N ∼Multinomial(θ C)

t i=1...N , fi=1...N ∼Multinomial(θ Tci), Multinomial(θ F

ci)

wi=1...N ∼ I(wi = t. f )P(ci = c|θ C)P(t|c,θ T )P( f |c,θ F )

This is the morphology model of choice in (Goldwater et al., 2011), following which we alsouse Gibbs sampler, a simple and widely-used Markov Chain Monte Carlo method, for inference.Assume the exchangeability of morphological analyses. The finite set of morphological analyses{a1, ..., aN} is exchangeable, if for a permutation, π, of the integers from 1 to N ,

P(a1, ..., aN ) = P(aπ(1), ..., aπ(N)).

At each iteration, from a= {a1, ..., aN}, sample a′1 given morphological analyses of all other words,i.e. A−1 = {a2, ..., aN}, then go to {a′1, a′2, ..., aN} and so on until {a′1, a′2, ..., a′N} = a′. It can beshown that this sampling process defines a Markov chain on a, a′, a′′, .... After a sufficient amountof time, the probability values are independent of the starting values and tend towards the stationarydistribution P(a). More about Gibbs sampling will be discussed in Section 5.2, where samplingprocesses for two generative models are compared.

Related workThe Dirichlet-multinomial model is not able to capture neither the particular statistical property ofword frequencies, as studied in (Goldwater et al., 2006), nor the particular statistical property ofmorphological units, as studied in (Chan, 2008) and (Zhao and Marcus, 2011). As described inSection 3.1, a rule-based bootstrapping algorithm can be designed to take advantage of long-taildistributions of both stems and suffixes. However, as we will show with experimental results inSection 6, the rule-based model performs notably bad with the token-based evaluation, which givesmore weights to the results of frequent words than the type-based evaluation.

In a two-stage learning framework proposed in (Goldwater et al., 2006), the morphology modelas introduced above is used as a ’generator’ for producing words. Word frequencies are thentransformed to exhibit power-law by an additional generative process, called ’adaptor’. Especially,a two-parameter generalization of Chinese Restaurant Process (Pitman and Yor, 1997) is usedas an adaptor. The Pitman-Yor process implements the principle of preferential attachment andguarantees the exchangability of its outputs. As we will also show in Section 6, the generator, i.e.the Dirichlet-multinomial model, learns reasonably well from the input of distinct word types, i.e.type-based input; however, the multinomial model itself cannot adapt to the input of unprocessedtext data, i.e. the token-based input. Augmented with an adaptor, the generator may achieve its bestperformance with all forms of input; however, the introduction of such an adaptor does not improvethe overall performance of the generator. In the following section, we will propose a generativemodel that learns well from both forms of input: type-based or token-based, without the need of anadaptor. The proposed model is able to utilize token frequency by itself, therefore, we do not need

3128

to concern with justifying the appearance of type frequency in morphology learning, as pursued in(Goldwater et al., 2006).

5 A generative model with lognormal likelihoodWe propose to compute lognormal likelihood, instead of multinomial likelihood, for both stems tand suffixes f . In this way, the particular statistical property of stems and suffixes is utilized in aBayesian model other than a rule-based model. Furthermore, as we discussed in Section 2, given themultiplicative property of lognormal distributions, for each inflectional class, the generated worddistribution is also lognormal. Therefore, with the proposed model, we can directly capture thestatistical property of word frequencies without the need of an additional generative process.

5.1 The probability modelAgain, assume that stems and suffixes are independent given inflectional classes. We still have amultinomial distribution over kC classes, with parameters θ C , and take symmetric Dirichlet priorsαC on θ C . Now, in each inflectional class c, we assume a lognormal distribution of frequency forboth stems and suffixes. It is equivalent to assume that the logarithms of stem/suffix frequencyare normally distributed over the rank. For example, if the logarithms of stem frequency, L(t),is normally distributed with mean µT and variance (σT )2, then stem frequency t is lognormallydistributed with mean eµ

Tand variance (σT )2.

For a random variable X that is normally distributed, if both its mean µ and variance σ2 are random,we will use the following distribution for priors, which can be shown to be conjugate to normallikelihood. Assume µ0,γ0,α, and β as the constant hyper-parameters, then we have

σ2 ∼ Inverse-Gamma(α,β)

µ|σ2 ∼ Normal(µ0,γ0/σ2)

x |µ,σ2 ∼ Normal(µ,σ2).

In our case, we construct the probability distributions as follows:

θ C ∼Dirichlet(αC)

ci=1...N ∼Multinomial(θ C)

(σTi=1...kC )2, (σF

i=1...kC )2 ∼Inverse-Gamma(αT ,β T ), Inverse-Gamma(αF ,β F )

µTi=1...kC ,µF

i=1...kC ∼Normal(µT0 ,γT

0 /(σTi )

2), Normal(µF0 ,γF

0/(σFi )

2)

t i=1...N , fi=1...N ∼Log-Normal(eµTci , (σT

ci)2), Log-Normal(eµ

Fci , (σF

ci)2)

wi=1...N ∼I(wi = t. f )P(t|ci ,µTci

, (σTci)2)P( f |ci ,µ

Fci

, (σFci)2)

where αC ,µT0 ,γT

0 ,αT ,β T ,µF0 ,γF

0 ,αF and β F are constant hyper-parameters .

5.2 Gibbs SamplingAs discussed in Sect 4, we use Gibbs sampler for the inference of generative models. At eachiteration, we need to sample a morphological analysis for each word given the morphologicalanalyses of all other words. For example, at initialization, each word receives a random analysis,then we proceed by sampling a morphological analysis a′1 of the first word, w1, given the random

3129

Multinomial

Lognormal

Table 2: Sample a morphological analysis of the word wi .

analyses of all other words, A−1 = {a2, a3, ...., aN}. Then we sample a morphological analysis a′2 ofword w2, given A−2 = {a′1, a3, ...., aN}, and so on until it stabilizes.

For the multinomial model, we sample (c, t, f ) together, with the following posterior conditionalprobability, P((c, t, f )|wi , A−i) ∝ I(wi = t. f )P(ci = c|θ C , A−i)P(t|c,θ T , A−i)P( f |c,θ F , A−i).Take the second term of the above equation as an example. As a result of our choice of con-jugate priors, the posterior distribution, P(t|c,θ T , A−i), is also multinomial but with a differentparameter α′ ∼ Dirichlet(αF +mc, f ), where mc, f is the number of analyses that contain both inflec-tional class c and suffix f . Therefore, the second term can be reduced to a form with θ T integratedout, P(t i = t|A−i , c) = αT+mc,t

αT++mc

. Similarly reductions can be done to other terms as well and putting

together the results, we obtain the conditional probability for sampling (c, t, f ) given the currentword and morphological analyses of all other words, which is shown at the first row of Table 5.2.

For the lognormal model, we sample the inflection class c first, from the posterior distribution asdiscussed above. Then we sample (t, f ) from its posterior distribution given the sampled c andmorphological analyses of all other words. Again, given our choice of conjugate priors, the posteriordistributions of stem t and suffix f are still lognormal with updated mean and variance. For thelognormal model, so as to update these parameters, we need to alternatively sample variance andmean from their own posterior distributions respectively. In practice, the updated mean and varianceare sampled regarding the normal distribution of stem/suffix frequency’s logarithms, L(t/ f ). Thespecific sampling process is depicted in the second row of Table 5.2, for a comparison with themultinomial model3. We won’t go into details of computing the updated parameters for samplingnew mean and variance4, but it is worth noticing that the data samples are logarithms of frequencies.

The sampling of lognormal distributions is obviously much more complex and time-consumingthan multinomial distributions. We are motivated to constrain the learning of generative models

3 Table 5.2 is constructed based on one of our reviewers’ suggestion.4 A detailed discussion can be seen in (Jordan, 2010)

3130

with the acquisition outputs from the rule-based model, which runs very fast by itself. As we willshow by experimental results in Section 6, the constrained learning is not only much faster but alsosignificantly improves the performance.

6 ExperimentsIn this section, we run experiments on unsupervised learning of morphology and compare theapproaches we describe in Section 3, 4 and 5. Following (Goldwater et al., 2011), we will learnmorphological segmentations for verbs in the WSJ corpus with the input of raw text only. Given thePenn Treebank guidelines, we consider words associated with tags of ’VB’, ’VBP’, ’VBZ’, ’VBD’,’VBN’ or ’VBG’ as verbs. Using the gold part-of-speech annotations, we extracted 137,899 verbsfrom the whole WSJ corpus which belong to 7,728 distinct word types. With heuristics based onpart-of-speech tags and spellings, we automatically segment each verb into a stem, which cannotbe empty, and a suffix, which may be empty, and use these segmentations as gold standards forevaluation.

Given the gold analysis of each word, the accuracy of a morphology model can be evaluated intwo ways. For a type-based evaluation, we compute the accuracy as the percentage of correctlyanalyzed word types out of all distinct word types that are ever seen in the corpus. For a token-basedevaluation, we compute the accuracy as the percentage of correctly analyzed tokens out of alloccurrences of words in the whole corpus. In most previous work on unsupervised learning ofmorphology, only the type-based evaluation is reported. However, we agree with Goldwater et al.(2006) that the token-based evaluation gives more weights to the results of frequent words, thusreflects better the performance of each approach as applied to real text data.

Different forms of inputIn formal study of morphology, the acquisition input is usually taken as the list of distinct wordtypes. For example, as shown in Section 3.1, the bootstrapping algorithm measures contextualdiversity with type frequency only. On the other hand, natural text data typically use most typesof words more than once. Furthermore, when a model is trained with the input of distinct typesonly, each word occurrence of the same type will always receive the same analysis by the model.However, if a model is trained with real text data, then with a generative model, a word may receivedifferent analyses on different occurrences.

(a) Type-based evaluation. (b) Token-based evaluation.

Figure 3: Experiment with different forms of input.

First, we replicate the experiments in (Goldwater et al., 2006). The morphology model is multinomialas discussed in Section 4. As shown in Figure 3, the multinomial model learns well with the inputof distinct word types, but poorly with the token-based input.

3131

Inflectional classes


Figure 4: Experiment with different numbers of inflectional classes.

In the above experiment, the number of inflectional classes is set as 6 following (Goldwater et al.,2006). However, this choice of the number of inflectional classes is rather arbitrary. Therefore, weexperiment with different settings of this parameter. As shown in Figure 4, different choices ofinflectional classes do not make significant differences in training results.

Constrained LearningAs we described in Section 5, the sampling of the lognormal model is much more complex andtime-consuming than the sampling of the multinomial model. Motivated by the concern of trainingefficiency, we propose to constrain the learning of generative models with the acquisition outputsfrom the rule-based model that we described in Section 3.1. Since the rule-based model takes inputof raw text only and so the generative models, the combination of these two processes results in atotally unsupervised learning process as well. More specifically, suppose that we have acquired a setof suffixes containing -es and -s only. With constrained learning, the only possible segmentationswe need to consider for word porches are porch+es, porche+s, and porches+"", instead of all the 7possible segmentations of this 7-character word. In practice, we constrain the learning of generativemodels with 20 suffixes acquired by the bootstrapping algorithm.


Figure 5: Experiment with constrained learning of multinomial models.

As shown in Figure 5, the constrained learning converges much faster than its unconstrainedcounterpart. Since the bootstrapping algorithm is very fast by itself, taking almost no time comparedto the training of generative models, the total training time is saved a lot. Moreover, even though

3132

this method is motivated by the concern of training efficiency, it also significantly improves theperformance. As clearly shown in Figure 5, the constrained learning achieves notably higherperformance by both the type-based and token-based evaluations.

The generative model with lognormal likelihoodAs discussed in Section 5, by replacing the multinomial likelihood with lognormal likelihood, wecan take advantage of the particular statistical property of morphological units with a Bayesianapproach; moreover, we can capture the statistical property of word frequency without the need ofan additional generating process. We experiment with the lognormal model over different formsof input, and evaluate it by different criteria. Based on the above experimental results, in thisexperiment, we set the number of inflectional class as 1, and apply the constrained learning with 20suffixes acquired by the bootstrapping algorithm.


Figure 6: Experiment with generative models of multinomial or lognormal distributoins.

Overall, the proposed model performs significantly better than the multinomial model, especiallywhen trained with unprocessed text data. As shown in Figure 6-a, when they are both given theinput of distinct types, the results of the lognormal and multinomial models are not distinguishableby the type-based evaluation. However, in contrast to the multinomial model, the lognormal modelis able to learn from unprocessed text data as well. As shown in Figure 6-b, by the token-basedevaluation, the proposed lognormal model achieves its best performance with the input of text data,which is much more accurate than the best of the multinomial model. It is interesting to observethat the proposed lognormal model is more accurate with the token-based evaluation when trainedwith token-based learning, but more accurate with the type-based evaluation when trained withtype-based learning. This result pattern suggests that the proposed model is able to adapt to realdata distributions by itself, without the need of an additional generative process.

Compare all three modelsWe have shown the acquisition outputs of the bootstrapping algorithm in Table 1, upon whichwe can build a rule-based segmentation model. In contrast to the learning progress of generativemodels, which will converge to a relatively steady state, we stop the acquisition process after20 bootstrapping iterations following previous experiments. Furthermore, so as to compare thegenerative models with the rule-based model, for each generative model, we compute the averageaccuracy of its last 5 training iterations.

3133

input form type-based evaluation token-based evaluatoinbootstrapping either 83.59% 64.04%multinomial type-based 79.98% 81.06%lognormal type-based 78.85% 73.10%

multinomial token-based 42.36% 58.06%lognormal token-based 75.46% 87.79%

Table 3: Compare all three models with different forms of input.

As shown in Table 3, the rule-based model achieves a type-based accuracy as high as 83.59%,significantly higher than any other generative model. However, by the token-based evaluation, therule-based model performs rather bad. The highest token-based accuracy, 87.79%, is achieved bythe lognormal generative model. No matter what form of input is fed to the multinomial model, thislevel of token-based accuracy cannot be achieved.

7 Conclusion and future work

In previous work on unsupervised learning of morphology, the long-tail pattern observed for the rank-frequency distribution of words, as well as of morphological units, is usually considered as followingZipf’s law (power-law). We argue that the signature straight lines on logistic scales may suggesteither power-law or lognormal. We have also discussed that both based on the idea of preferentialattachment, the generative processes for generating Zipf’s law and lognormal distributions haveonly subtle differences. The advantage of considering the long-tail distributions of morphologicalunits as lognormal is so that we can utilize the statistical property in a Bayesian model. Moreover,given the multiplicative property of lognormal distributions, we can directly capture the long-taildistribution of word frequency without the need of an adaptor.

The experimental results show that the proposed model performs significantly better than othermodels in discussion, especially when it is evaluated by a token-based criterion that respects moreof the real distribution of text data. Moreover, the proposed model can not only learn from the list ofdistinct word types, which can be handled by other models as well, but also from the unprocessedtext data, which cannot be handled by other models. Especially, the proposed generative model ismore accurate with the token-based evaluation when trained by token-based learning, and moreaccurate with the type-based evaluation when trained by type-based learning. This result patternsuggests that the proposed model is able to adapt to real data distribution by itself.

In this work, our primary goal is to provide an alternative perspective on modeling the long-taildistributions for morphology learning, rather than to develop a state-of-the-art morphology learningsystem. We are aware of recent work on morphology learning that utilize more extra informationand achieve good results on more data. Extra information that has been shown to be useful formorphology model includes syntactic context (Lee et al., 2011), document boundaries (Moon et al.,2009) and so on. The proposed model has a potential to be developed as a more complex learningsystem, thus, in future work, we plan to extend our model to integrate these extra information andcompare with more benchmark systems.

References

Chan, E. (2008). Structures and distributions in morphology learning. PhD thesis, University ofPennsylvania.

3134

Downey, A. B. (2001). The structural cause of file size distributions. In Proceedings of the Ninth In-ternational Symposium in Modeling, Analysis and Simulation of Computer and TelecommunicationSystems, MASCOTS ’01, Washington, DC, USA.

Goldwater, S., Griffiths, T., and Johnson, M. (2006). Interpolating between types and tokens byestimating power-law generators. In NIPS.

Goldwater, S., Griffiths, T. L., and Johnson, M. (2011). Producing power-law distributionsand damping word frequencies with two-stage language models. Journal of Machine LearningResearch, 12(Jul):2335–2382.

Jordan, M. I. (2010). The conjugate prior for the normal distribution. Lecture notes on Stat260:Bayesian Modeling and Inference.

Lee, Y. K., Haghighi, A., and Barzilay, R. (2011). Modeling syntactic context improves mor-phological segmentation. In Proceedings of the Fifteenth Conference on Computational NaturalLanguage Learning, Portland, Oregon, USA. Association for Computational Linguistics.

Marcus, M., Santorini, B., and Marcinkiewicz, M. A. (1993). Building a large annotated corpus ofenglish: The penn treebank. Computational linguistics, 19(2):313–330.

Miller, G. A. (1957). Some effects of intermittent silence. American Journal of Psychology,70:311–314.

Mitzenmacher, M. (2004). A brief history of generative models for power law and lognormaldistributions. INTERNET MATHEMATICS, 1:226–251.

Moon, T., Erk, K., and Baldridge, J. (2009). Unsupervised morphological segmentation andclustering with document boundaries. In Proceedings of the 2009 Conference on EmpiricalMethods in Natural Language Processing: Volume 2 - Volume 2, EMNLP ’09, pages 668–677,Stroudsburg, PA, USA. Association for Computational Linguistics.

Newman, M. (2005). Power laws, pareto distributions and zipf’s law. Contemporary Physics,46(5):323–351.

Pareto, V. (1896). Cours d’ Economie Politique. Droz, Geneva.

Perline, R. (1996). Zipf’s law, the central limit theorem, and the random division of the unitinterval. Physical Review, 54(1):220–223.

Pitman, J. and Yor, M. (1997). The two-parameter poisson-dirichlet distribution derived from astable subordinator. Annals of Probability, 25(2):855–900.

Simon, H. A. (1955). On a class of skew distribution functions. Biometrika, 42(3/4):425–440.

Zhao, Q. and Marcus, M. (2011). Functional elements and pos categories. In Proceedings of 5thInternational Joint Conference on Natural Language Processing, pages 1198–1206, Chiang Mai,Thailand. Asian Federation of Natural Language Processing.

Zipf, G. K. (1949). Human Behaviour and the Principle of Least Effort. Addison-Wesley, Reading,MA.

3135

long-tail distributions and unsupervised learning of ... · unsupervised learning of morphology is...

Documents