outline morphological productivity: corpus-based...

IntroductionQuantitative productivity: Baayen’s approach

Methodological issues in measuring quantitative productivityThe interpretation of (quantitative) productivity

Conclusion

Morphological Productivity:Corpus-Based Approaches

Marco Baroni

University of Bologna

Granada “Morphology and Corpora” Seminar

Marco Baroni Productivity



Conclusion

MotivationMorphological productivityAvailabilityDegrees of productivity

Outline

1 Introduction

2 Quantitative productivity: Baayen’s approach

3 Methodological issues in measuring quantitative productivity

4 The interpretation of (quantitative) productivity

5 Conclusion




Conclusion


Attested and possible words

Morphology is about defining what is a possible word (andexplaining why it is possible)Attested words are subset of possible wordsNeed to delimit set of possible but unattested words




Conclusion


Word-formation and productivity

Possible unattested words are (mostly) derived byword-formation processes (rules/schemas/. . . )However, not all wf processes are (equally) available tospeakersE.g.:

-ness vs. -ity vs. -thNN compounding in Germanic vs. Romance languages

-ness and Germanic NN compounding are productiveprocesses




Conclusion


The centrality of productivity in morphology

Objective measure of productivity neededbecause any theory of morphology/word formation must:

focus on productive processesexplain why only certain processes are productive

Vast literature on productivity (see refs.)




Conclusion


Productivity: the classic definitionSchultink (1961), translated by Booij

Productivity as morphological phenomenon is the possibilitywhich language users have to form an in principle uncountablenumber of new words unintentionally, by means of amorphological process which is the basis of the form-meaningcorrespondence of some words they know.




Conclusion


Some issues

“Unintentionally”“In principle uncountable” (the step- problem)How do we find out if a process can form an “in principleuncountable” number of new words?Is productivity an all-or-nothing phenomenon? Does therate at which different productive processes grows towardsuncountably many forms matter?How should productivity be interpreted? Is productivity aninherent property of a process, or an epiphenomenon?(Plag 1998)




Conclusion


Productivity as all-or-nothing

Availability (Bauer 2001): -ness is available, -th is notDifferent from “grammatical” vs. “ungrammatical”: kingdom,growth are “grammatical”




Conclusion


Problems with the availability approach

Common expressions like “very productive”, “marginallyproductive” betray shared intuition that productivity is not“black or white”re- is more productive than de-, but de- is more productivethan be-Once available always available?Baayen (2003) finds productive uses of -th on the Net(“Maintainance (sic) of greenth”)




Conclusion


Measuring (degrees of) productivity

How do we measure how many words could be generatedby a process, when the words that have already beengenerated are all we can see?Early proposals (e.g., Aronoff 1976) haveoperationalization problemsBaayen and colleagues (see refs.) ground study ofproductivity in corpora and tradition of lexical statisticsCorpora: you need to count something, to find out that X ismore productive than Y (dictionary entries not appropriate)Lexical statistics: we must count properties of our sample(instances of wf process attested in the corpus) to inferproperties of the population our sample is taken from (allpossible instances of wf process)




Conclusion

Lexical statisticsBaayen’s PApplications of P

Outline

1 Introduction




5 Conclusion




Conclusion


Lexical statisticsZipf 1949/1961, Baayen 2001, Evert 2005

Comparison of vocabulary size and other measures oflexical richnessE.g., for stylometry (does Joyce use richer vocabulary thanH. James?), language acquisition (how many words do7-year old know? is the L2 learners’ vocabularysignificantly smaller than the one of natives?),genre/register analysis (is spoken English lexically poorerthan written English)Productivity as a nuisance: target sample (text, corpus)does not contain full vocabularyDevelopment of methods to assess “growth rate” ofvocabulary and estimate vocabulary size (and othermeasures) in whole population




Conclusion


Basic terminology

N: sample/corpus size, number of tokens in the sampleV: vocabulary size, number of distinct types in the sampleV1: hapax legomena count, number of word types thatoccur only once in the sample (for hapaxes, Count(types)= Count(tokens))A sample: a b b c a a b a

N: 8; V: 3; V1: 1




Conclusion


Vocabulary growth curve

The sample: a b b c a a b a

N: 1, V: 1, V1: 1N: 3, V: 2, V1: 1N: 5, V: 3, V1: 1N: 8, V: 3, V1: 1(Most VGCs below smoothed with binomial interpolation)




Conclusion


Vocabulary growth curve of LOB corpus

2e+05 4e+05 6e+05 8e+05 1e+06

010

000

3000

050

000

7000

0

LOB VGC

N

V a

nd V

1




Conclusion


Frequency spectrum

The sample: a b b c a a b a d

Frequency classes: 1 (c, d), 3 (b), 4 (a)Frequency spectrum:

m V(m)1 23 14 1




Conclusion


Frequency spectrum of LOB corpus

●

●

●

●●

● ● ● ● ● ● ● ● ● ●

2 4 6 8 10 12 14

050

0010

000

2000

0

LOB Frequency Spectrum

frequency class

type

s




Conclusion


Morphology, productivity and lexical statistics

N: number of tokens characterized by target wf process incorpusV: number distinct types characterized by target wf processV1: number of hapax legomena characterized by target wfprocess




Conclusion


ri- in Italian la Repubblica corpus

0 200000 600000 1000000 1400000

020

040

060

080

010

00

ri− VGC

N

V a

nd V

1




Conclusion



●

●

●

● ●● ●

● ●● ●

● ● ● ●

2 4 6 8 10 12 14

050

100

150

200

250

300

350

ri− Frequency Spectrum

frequency class

type

s




Conclusion


Pronouns in Italian la Repubblica corpus

5e+02 5e+03 5e+04 5e+05 5e+06

020

4060

80

pronouns VGC

N (log)

V a

nd V

1




Conclusion



0 2000 4000 6000 8000 10000

020

4060

80

pronouns VGC (fragment)

N

V a

nd V

1




Conclusion



●● ● ● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●

1 100 10000

0.6

0.8

1.0

1.2

1.4

pronouns Frequency Spectrum

frequency class (log)

type

s




Conclusion


V as a measure of productivity

Valid only for corpora/samples of equal size!Good first approximation, but it is measuring attestedness,not potential:

(According to rough BNC counts) de- verbs have V of 141,un- verbs have V of 119, contra our intuitionWe want productivity index of pronouns to be 0, not 72!

(V of whole population could measure potential – seebelow)




Conclusion


Hapax legomena and productivity

Plots show relation between productivity and hapaxlegomenaIntuition: hapax legomena are words we did not see beforein our sample, until the moment in which we sample themThere is a close relation between hapax legomena andwords-yet-to-be-seen




Conclusion


Hapax legomena and productivity

If the word we sample after seeing N tokens is a hapaxlegomenon, this means that the word was not in the Ntokens seen up to that point, i.e., it is a new word at thatpointA productive process is a process that is more likely thanothers to produce new wordsThus, the more a process is productive, the more it is likelythat the next word we see that has been generated by thatprocess is a new word




Conclusion


Baayen’s P

Operationalize productivity of a process as probability thatthe next token created by the process that we sample is anew wordThis is same as probability that next token in sample ishapax legomenonThus, we can estimate probability of sampling a new wordas relative frequency of hapax legomena in our sample:P = V1

N(where V1 and N are limited to words displaying therelevant process)




Conclusion


Baayen’s P

P = V1N

Probability to sample token representing type we will neverencounter again (token labeled “hapax”) at first stage ofsampling (when we are at the beginning ofN-token-sample) is given by the proportion of hapaxes inthe whole N-token-sample divided by the total number oftokens in the sampleThus, this must also be probability that last token sampledrepresents new typeP as productivity measure matches intuition thatproductivity should measure potential of process togenerate new forms




Conclusion


P as vocabulary growth rate

P measures the potentiality of growth of V in a very literalway, i.e., it is the growth rate of V, the rate at whichvocabulary size increasesP is (approximation to) the derivative of V at N, i.e., theslope of the tangent to the vocabulary growth curve at N(Baayen 2001, pp. 49-50)Again, “rate of growth” of vocabulary generated by wfprocess seems good match for intuition about productivityof wf process




Conclusion



0 200000 600000 1000000 1400000

200

400

600

800

1000

ri− VGC with tangent at N = 280K

N

V




Conclusion



0 2000 4000 6000 8000 10000

020

4060

80

pronouns VGC with tangent at N = 5000

N

V




Conclusion


Baayen’s P and intuition

class V V1 N Pit. ri- 1098 346 1,399,898 0.00025it. pronouns 72 0 4,313,123 0en. un- 119 25 7,618 .00328en. de- 141 16 86,130 .000185




Conclusion


Applications of P

Extensive tradition of corpus-based analyses ofderivational morphology based on P (and V), by Baayenand colleagues (esp. English and Dutch), but not only(e.g., Lüdeling and Evert on German morphology, Gaetaand Ricca on Italian morphology)P used as an exploratory tool




Conclusion


Baayen & Lieber 1991 on English derivation

Large scale study of English derivation based on P and Vwith statistics extracted from CELEX database (= 18Mtokens version of COBUILD)A few results:

-ness > -ity; -ish > -ous; un- > in-; -ation > -al, -ment. . . but affixes such as -ity, -ous and in- have P > 0 (andthere are category-of-the-base effects)dual nature of re-: few high frequency forms make it lookunproductive (remove, recite, recall. . . ) (but see below onre-, P and sample size)




Conclusion


Plag, Dalton-Puffer & Baayen (1999)Productivity across speech and writing

P and V in written, demographic spoken andcontext-governed spoken sections of BNC (keep affixconstant, change corpus)Strong effect of “register”, with productivity higher in writtenthan spoken, and context-governed spoken higher thandemographic spoken (productivity as a dimension ofregister variation)Different productivity of different affixes in differentregisters: -like is “written-only”, -ness is strongly “written”,productivity reversal of -ize and -ish in writtenvs. demographicProductivity is affected by register, it cannot be explained inpurely structural terms




Conclusion


Other examples

Gaeta and Ricca (2003): on the extremely high productivityof the Italian “neo-”prefixes mega-, iper-, super-, maxi-,ultra-Lüdeling and Evert (2005): medical and non-medical -itis inXXth century German (with focus on methodologicalaspects)




Conclusion

Pre-processingP and sample size

Outline

1 Introduction




5 Conclusion




Conclusion


Methodological issues

Pre-processing/preparing the dataEffect of N on P




Conclusion


Pre-processing

IT IS IMPORTANT!!!Baayen, strangely, does not seem to worryAt least two aspects:

Problems of (automated) data-cleaning/complex wordidentification (Evert and Lüdeling 2001)Theoretical issues (delimitation and identification ofapplication of a wf process) (Gaeta and Ricca 2003, toappear)




Conclusion


Automated data-cleaning/complex word identification

Often necessary (13,850 types begin with re- in BNC,103,941 types begin with ri- in itWaC)We can rely on:

POS taggingLemmatizationPattern matching heuristics (e.g., candidate prefixed formmust be analyzable as PRE+VERB, with VERBindependently attested in corpus)




Conclusion


The problem with low frequency words

Correct analysis of low frequency words is fundamental tomeasure productivityHowever, automated tools will tend to have lowestperformance on low frequency forms:

Statistical tools will suffer from lack of relevant training dataManually-crafted tools will probably lack the relevantresources

Problems in both directions (under- and overestimation ofhapax counts)Part of the more general “95% performance” problem




Conclusion


Underestimation of hapaxes

The Italian TreeTagger lemmatizer is lexicon-based;out-of-lexicon words (e.g., productively formed wordscontaining a prefix) are lemmatized as UNKNOWNNo prefixed word with dash (ri-cadere) is in lexiconWriters are more likely to use dash to mark transparentmorphological structure




Conclusion


Productivity of ri- with and without an extended lexicon

0 200000 600000 1000000 1400000

200

400

600

800

1000

ri− VGC with/without extended lexicon

N

V




Conclusion


Overestimation of hapaxes

“Noise” generates hapax legomenaThe Italian TreeTagger seems to think that dashedexpressions containing pronoun-like strings are pronounsDashed strings can be anything, including full sentencesThis creates a lot of pseudo-pronoun hapaxes: tu-tu,parapaponzi-ponzi-pò, altri-da-lui-simili-a-lui




Conclusion


Productivity of the pronoun class before and aftercleaning

0e+00 1e+06 2e+06 3e+06 4e+06

5010

015

020

025

030

035

0

pronouns VGC with/without cleaning

N (log)

V




Conclusion


P (and V) with/without correct post-processing

With:class V V1 N Pri- 1098 346 1,399,898 0.00025pronouns 72 0 4,313,123 0

Without:class V V1 N Pri- 318 8 1,268,244 0.000006pronouns 348 206 4,314,381 0.000048




Conclusion


Linguistic analysis issues

Which of the following forms should be counted as prefixedforms with re-? redo, remove, retain, remake (as a noun),resyllabificationIf we remove remove because it is lexicalized, aren’t weboosting up the productivity of re- in a circular way?Are there two in- prefixes? inanimate vs. inchoativeHow about re-? re-conquer the city vs. re-play the songIs deXize a different wf process from de-?




Conclusion


THESE ARE SERIOUS PROBLEMS!

Given the current state of NLP tools (especially forlanguages other than English)and the typical resources of morphologistslarge-scale, methodologically sound quantitativeproductivity studies are unfeasible




Conclusion


P and sample size

It should be obvious that as N increases, V also increases(for at-least-mildly-productive processes)Thus, V cannot be compared at different Ns




Conclusion


V and NEnglish re- and mis-

0 10000 20000 30000 40000 50000

050

100

150

200

250

VGCs of re− (frag.) and mis−

N

V




Conclusion


P and sample size

It should be obvious that as N increases, V also increases(for at-least-mildly-productive processes)Thus, V cannot be compared at different NsHowever, the growth rate is also systematically decreasingas N becomes largerAt the beginning, any word will be a hapax legomenon; assample increases, hapaxes will be increasingly lowerproportion of sampleA specific instance of the more general problem of“variable constants” (Tweedie and Baayen 1998) in lexicalstatistics (cf. type/token ratio)




Conclusion


Growth rate of re- at different sample sizes

0 50000 100000 150000 200000

100

150

200

250

300

re− VGC (tangents at N = 22.5K, 134.5K)

N

V

●

●




Conclusion


P as a function of N (re-)

0 50000 100000 150000 200000

1e−

045e

−04

5e−

035e

−02

P in function of N (re−)

N (log)

P




Conclusion


The solution: control NGaeta and Ricca (to appear)

Always compute P at comparable NGiven two wf processes with sample sizes Na and Nb, withNa > Nb, measure P at Nb for both processesDenominator of P = V1

N is fixed, so this amounts tocomparing V1, the number of hapax legomenaIf more than 2 processes are compared, do comparisonpairwise and use transitivity, to minimize data lossI.e., if 3 processes have sample sizes Na > Nb > Nc ,compare processes a and b at Nb, b and c at Nc and inferproductivity ranking of a and c on the basis of theirrelationship to b




Conclusion


Controlling N: P

class N = 6097 N = 35107 N = 223970re- 0.007 0.00098 0.000085en- 0.00075 0.00014 NAmis- 0.00082 NA NA




Conclusion


Controlling N: V1 (interpolated values!)

class N = 6097 N = 35107 N = 223970re- 43.7 34.4 19en- 4.5 5 NAmis- 5 NA NA




Conclusion


Problems

We are throwing away dataClumpiness and other non-randomness effects




Conclusion


Non-randomnessEmpirical and interpolated VGCs of BNC

0e+00 2e+07 4e+07 6e+07 8e+07 1e+08

0e+

002e

+05

4e+

056e

+05

8e+

05

Empirical and expected VGCs of BNC

N

V




Conclusion


Non-randomnessThe “real” re- VGC

0 50000 100000 150000 200000

050

100

150

200

250

300

Empirical re− VGC

N

V a

nd V

1




Conclusion


Non-randomness

At least two issues:Clumpiness (well above one third of the non hapaxes in the“too much” data-set of Lüdeling/Baroni/Evert occur morethan once in the same document)Effects of specialized language, genre, register (in ourversion of BNC, spoken texts are almost entirely at end ofcorpus)

For less frequent process, we take sample from wholecorpus, whereas for more frequent process we takesample from first Nsub tokens, probably resulting in moreclumpiness and less variety of genre and topics




Conclusion


The importance of randomization

Randomize the order of words in corpusThis can be done efficiently and soundly by using binomialinterpolation (Baayen 2001, ch. 2)Binomial interpolation produces expected values of V andV1 for arbitrary sample sizes (< N) that can be thought ofas the average of an infinite number of randomizationsMost plots shown on these slides are based on binomialinterpolation




Conclusion


The importance of randomization

Counting number of documents in which a word occurs,rather than overall occurrences, might be a cure forclumpiness (but increases data-sparseness problems, andcomplicates the assumptions about sampling)However, non-randomized VGC plot provides very valuableinformation, and should always be included in quantitativeproductivity studies




Conclusion


Non-randomness: a bigger problem

The whole corpus is probably a non-random sample of the“population” we are interested in (e.g., the population ofwords illustrating word formation with re-, or the populationof words known by an English speaker)Unfortunately, we cannot take a randomized sub-samplefrom the whole population like we can do when taking asub-sample from the whole corpus (that’s what a corpus issupposed to be in the first instance!)




Conclusion


Non-randomness and parametric (lexico-)statisticalmodels

Non-randomness problems are seriously affecting thequality of parametric statistical models (Evert and Baroni toappear, and ongoing work)This is a pity, since these models would allow us toextrapolate V and V1 to arbitrary values (including V andV1 in the whole population), and to test significance ofdifferences(Please stay tuned)




Conclusion

Parsing and productivityProductivity and semantic transparencyNeed

Outline

1 Introduction




5 Conclusion




Conclusion


Productivity: cause or effect?

Is productivity (as measured by P and related measures)a cause or an effect (an epiphenomenon)?Does P correspond to an “activation level” in the mind ofthe speaker, or should it be explained by other factors?If so, what kinds of factors?General agreement (often implicit) is that P is anepiphenomenonSee, e.g., Plag (1999), who uses quantitative productivityas exploratory tool, and looks for qualitative structuralexplanations (phonological, semantic, morphosyntactic) ofdifferent degree of productivity of similar affixes




Conclusion


Corpus-based “explanations” of P

Hay and Baayen (2002, 2004; see also Baayen, 2003):parsing and productivityMy ongoing work on productivity and semantictransparency (Baroni and Vegnaduzzo 2003, Baroni 2005)




Conclusion


Relative frequency and parsing

One important result of Hay (2003): in a number of tasks,affixed forms behave as morphologically complex iff theirbase is more frequent than the affixed form itself:

E.g., illegible is more frequent than legible, and it behavesas morphologically simpleilliberal is less frequent that liberal, and it behaves asmorphologically complex

Parsing explanation in a dual route race model – whenanalyzing a derived word

If base is more frequent than derived form, it is retrievedfaster, and base+affix analysis winsIf derived form is more frequent, it is retrieved as a wholebefore base+affix analysis is accessed

The higher the base-to-derived-form relative frequency is,the more likely it is that a word is treated as complex




Conclusion


Parsing and productivity

Hay and Baayen’s hypothesis:Affixes that appear in many words that are parsed ascomplex in language perception (i.e., appear in manyderived words that have lower frequency than their bases)will be more “active” in lexiconI.e., they will be more available for word formation, i.e.,more productivePredicts correlation between P and relative frequency

Hay and Baayen (2002) report high correlation between Pand relative frequency for 80 English derivation affixes




Conclusion


Problems

Could the correlation be due to the fact that both measuresheavily rely on the number of low(est) frequency forms thatcontain a certain affix?More importantly, both P and relative frequency seem tobe useful indices of parsability/productivity, effects, notcauses!If productivity is caused by low relative frequency of bases,what causes this low relative frequency? (Or vice versa?)Nature of variables as epiphenomenal indices seems to berecognized by Hay and Baayen (2004), which analyze aconstellation of densely inter-correlated measures relatedto parsability and productivity




Conclusion


Productivity and semantics

Observation: we have much clearer intuitions about whatproductive affixes mean than about what unproductiveaffixes meanCf. redo vs. enlargeHypothesis: an affix is productive as long as it haswell-defined meaning in the language (it is easier toacquire the relevant semantic generalization, and thus touse it to form new words)Prediction: semantic transparency of forms containing anaffix will be correlated with productivity of affixHere, direction of causation should be clear




Conclusion


Measuring semantic transparency

By hand, it is hard. . .but computational linguists have developed methods tomeasure semantic similarity among words (e.g., Manningand Schütze 1999)Degree of semantic transparency of complex form isdegree of semantic similarity between complex form andits base




Conclusion


Contextual approach to meaning

Cruse 1986 (p. 1):[T]he semantic properties of a lexical item arefully reflected in appropriate aspects of therelations it contracts with actual and potentialcontexts [...] [T]here are are good reasons for aprincipled limitation to linguistic contexts.

Two knowledge poor operationalizations:Semantically similar words occur in similar contextsSemantically similar words occur near each other




Conclusion


Contextual similarity

Cosine (correlation) of normalized vectors representingco-occurrence frequency with all words within a certainwindow:

cos(−→x ,−→y ) =−→x ·−→y =

n

∑i=1

xiyi

My parameters:Targets: affixed form/base pairsContexts: all content wordsWindow: 1 sentence




Conclusion


Co-occurrence

Measured by Mutual Information (MI):

MI(w1,w2) = logPr(w1,w2)

Pr(w1)Pr(w2)

My parameters:Targets: affixed form/base pairsWindow: 1 sentence




Conclusion


Wf processes explored

“Baayen prefixes”Mean productivity rank assigned by 4 Englishmorphologists:

un 1.500re 1.625mis 3.250de 3.625be 5.875en 5.875in 6.250




Conclusion


Extraction of prefixed forms

From BNCAll forms that begin with prefix string and with base that:

is at least 3 letters longoccurs in the corpus as independent word

E.g., beads is treated as prefixed; bead and benny are notMore “sophisticated” methods (that rely on furtherautomated processing) perform worse than this “brutal”approach




Conclusion


Averaging Cosine/MI across forms with same prefix

Cosine/MI computed for each prefixed form/base pair, butwe want single value of each measure per prefixFor each prefix class, compute average cosine/MI of 20pairs with highest cosine/MI valueRationale: presence of subset of prefixed words with highsemantic transparency is more significant than fact thatother forms in same class are opaqueE.g., re- has plenty of both transparent and opaque formsNB: hapax legomena are not playing a (crucial) role!




Conclusion


Results

Morphologist’s rank:un, remis, debe, en, in

P:un > re > de > mis, in > en > beCosine:un > re > in > de > mis > en > beMI:un > re > in > de > en > mis > be




Conclusion


Discussion

Good results, especially with cosine similarityHowever, small data-setMore nuance needed: polysemy of in-, de- vs. deXize




Conclusion


Current workOn Italian ri-

Inspect corpus contexts to determine distribution andscope of senses (iterative vs. restitutive)Measure co-occurrence in relational patterns determinedby mini-grammar (e.g., V ART? ADJ* N)




Conclusion


Current workOn Italian ri-

Targets:Single prefixed wordsClass of ri- words (compared to other prefixed/non-prefixedwords)Iterative vs. restitutive

Patterns of co-occurrence (both similarities anddifferences):

With bases (or prefixed forms with same bases)With other ri- formsWith base+againDirect co-occurrence with words that tap into the semanticsof ri-




Conclusion


An encouraging pilot study

Hypothesis: words containing more transparent/productiveprefixes will have higher semantic/distributional similarity toother words containing same prefixri- is more productive than de- in Italian (excludingdeXizzare pattern)Prediction: on average, ri- words will have more ri- wordsin their distributionally defined nearest neighbor set thande- words will have other de- words




Conclusion


Distributional data

1.9B token Web-crawled itWaC corpus (Baroni andUeyama 2006)Automated thesaurus function of Word Sketch Engine(Kilgarriff et al. 2004)Based on Lin’s (1998) distributional similarity measure




Conclusion


Lin’s algorithm

Collect collocates of each target word with other words insmall set of grammatically meaningful patterns (e.g., for Vcollects N collocates in patterns N ADJ* ADV* AUX* V,V ART* ADV* ADJ* N, etc.)For each pair of target words (with same POS), computescore based on number of shared collocates, weighted byMI (so that more unusual collocates will have more weight)Pick as neighbor set of a target word all other target wordswith similarity score above a certain threshold (I used WSEdefaults)




Conclusion


Neighbor sets

Neighbor sets built in this way for our test words rangefrom 31 to 59 membersE.g., neighbor set of ricomporre (“to recompose”) includericostituire (“to reconstitute”), scomporre (“to decompose”),assemblare (“to assemble”)




Conclusion


Experiment

10 prefixed words with ri- and de- (but not deXizzare)randomly chosen from corpus (conditions: min fq ≥ 500,not in top 500 forms with prefix)Number of forms with same prefix in neighbor sets:

min med mean maxri- 3 14 13.4 25de- 2 3 3.4 6

Percentage of forms with same prefix over total number ofneighbors:

min med mean maxri- 10% 31% 28% 44%de- 4% 8% 8% 13%




Conclusion


“Need” in the interpretation of productivity

Ongoing work with Anke Lüdeling and Stefan EvertCorpus counts are influenced by the need to express agiven thought/conceptWords are only formed as and when there is a need forthem [. . . ] (Bauer 2001, 143)The need to express something depends on extra-linguisticfactors (like the political situation, fashion, etc.)There is no baayenitis without Baayen!




Conclusion


Competition

If a wf process is the only way to express a need, P will toa large extent measure extra-linguistic need, not factorsrelating to linguistic productivityNeed can be productive for extra-linguistic reasons!The study of productivity is interesting only when studyingrelative close (interchangeable) competitors expressing thesame need, as need factor is kept constantHowever. . . does competition exist? (cf. Plag 1999: whereall have the rivals gone?)Probably it does (ongoing work on a set of Germancompound heads meaning “too much” with a diseaseconnotation)




Conclusion

Outline

1 Introduction




5 Conclusion




Conclusion

Conclusion

The work of Baayen and others on productivity offundamental historical importance:

Early corpus-based workTheoretical relevance of quantitative dataMethodological expertise in using corpus data to answerlinguistic questionsDevelopment of descriptive techniques (VGCs etc.) andindex (P) to explore productivity in data-set




Conclusion

Conclusion

Corpus-based productivity measures, qualitativeexplanations often not based on corpus evidenceProductivity measures have played and can play animportant role in choosing productive phenomena to focuson (Baayen and Lieber 1991, Plag 1999, Lieber 2004)Possible criterion also in preparation of L2teaching/lexicographic materialsMany other areas of application still to explore: e.g.,non-morphological productivity in studies of lexicalrichness, stylometry




Conclusion

Conclusion

Very little corpus-based work on explaining productivityCorpora used as word frequency lists, context not takeninto accountTypically, coarse level of analysis (e.g., prefix polysemyignored), probably in part due to data sparseness, in partto manual work demandsReasons to be optimistic:

Availability of very large corporaAutomated corpus-based grammatical/semantic analysismethods from NLPNew analytical tools to study context and meaning fromcorpus linguistics, e.g., Stefanowitsch and Gries’ (2005 andelsewhere) collustructional analysis




Conclusion

THE END


outline morphological productivity: corpus-based...

Documents