measures of productivity and lexical diversity: suffixation (c.1150–1350)
TRANSCRIPT
English Department
Measures of Productivity and
Lexical Diversity: Suffixation
(c.1150–1350)
The development of abstract noun derivations
in different regions and text types (c.1150–1700)
Anne Gardner
01/10/11
English Department
Outline
1. Methodology: Statistical measures for productivity and lexical diversity
2. Linguistic Atlas of Early Middle English (LAEME)
3. Application of permutation testing
1. Productivity of -HED
2. Regional variation in the emergence of -HED
3. Lexical diversity of -REDEN
4. Evaluation
01/10/11 2
English Department
1. Methodology: Overview of statistical measures
• ‘Traditional’ measures of productivity:
• Token frequency
• Type frequency
• [Hapax legomena]
• Type/token ratio
• More recent measure of lexical richness and productivity by Säily &
Suomela:
Permutation testing involving type accumulation curves
01/10/11 3
English Department
1. Methodology: Types and tokens
• Types: number of distinct derivatives
• Tokens: number of attestations
01/10/11 4
Type Tokens Spellings attested in LAEME
falsehood 20 FALSEHED, FALSHEDE, FALSHED,
UALSHEDE(S), FALSED(E)
boldhood 2 BOLDHEDE
kindred 16 KINREDEN, KINRADEN, KENREDE,
KUNREDE(N), KUNREDNES, KUNRADE,
CUNREDE(N), CUNREDES, CUNRADEN,
CUNREADNES
foereden
‘hostility’
1 FAREDEN
English Department
1. Methodology: Token and type frequencies
Token frequency
• Normalised frequencies (per 10,000 words) enable comparisons across
subcorpora and highlight regional variation (e.g. progressive or
conservative tendencies)
• Test for statistical significance: e.g. log likelihood
Type frequency
• Productive words formation processes produce new types
• Type accumulation indicative of past productivity?
• Figures cannot be normalised ⟶ comparisons between suffixes and
regions difficult
01/10/11 5
English Department
1. Methodology: Type/token ratio (TTR)
• Gauging the lexical diversity of a word formation process:
• lower TTR ⟶ lower lexical diversity ⟶ lower productivity?
• higher TTR ⟶ higher lexical diversity ⟶ higher productivity?
• Figures of different suffixes, or for different regions, cannot easily
be compared as TTR depends on token frequency and corpus size
01/10/11 6
Types Tokens TTR
Suffix A 1 2 0.5
Suffix B 20 50 0.4
English Department
1. Methodology: Permutation testing (PT)
• Säily and/or Suomela (2007, 2009, 2011)
• Investigation of type richness: likelihood of encountering unusually high
or low type counts (⟶ levels of statistical significance)
a) type count in relation to (sub)corpus size ⟶ type frequency
b) type count in relation to tokens in (sub)corpus ⟶ type/token ratio
• Software:
• computing of type accumulation curves on the basis of random
reorderings of the corpus
• output: probability tables
01/10/11 7
English Department
1. Methodology: Permutation testing (PT–Type)
a) PT–Type: Type count in relation to (sub)corpus size
01/10/11 8
Words 0.0001 0.001 0.01 0.05 0.1 0.1 0.05 0.01 0.001 0.0001
0 0 0 0 0 0 8 10 28 28 29
2,606 0 0 0 0 0 9 12 28 29 32
… … … … … … … … … … …
52,113 0 0 0 0 1 29 33 39 53 57
54,718 0 0 0 1 1 30 33 40 54 58
… … … … … … … … … … …
72,958 0 0 0 1 2 33 37 49 57 61
75,563 0 0 1 2 3 34 47 49 58 62
… … … … … … … … … … …
English Department
1. Methodology: Permutation testing (PT–TTR)
b) PT–TTR: Type count in relation to tokens in (sub)corpus
01/10/11 9
Token
s0.0001 0.001 0.01 0.05 0.1 0.1 0.05 0.01 0.001 0.0001
0 0 0 0 0 0 10 22 28 28 28
1 0 0 0 0 0 12 23 28 29 29
… … … … … … … … … … …
31 0 0 0 1 3 34 37 41 50 52
32 0 0 0 2 3 34 37 42 51 53
… … … … … … … … … … …
40 0 0 0 2 6 37 40 48 54 57
41 0 0 0 2 6 37 40 48 55 57
… … … … … … … … … … …
English Department
1. Methodology: Permutation testing
Minimum requirements a subcorpus has to meet in order for its type count to
register as significantly high or low (starting at p < 0.05):
Example: -HED
⟶ thresholds could be too high for smaller subcorpora
01/10/11 10
PT–Type low type count high type count
subcorpus size 52,114–54,718 words 1–2,606 words
types 0 > 12
PT–TTR low type count high type count
tokens 32 41
types 1 41
English Department
2. Linguistic Atlas of Early Middle English (LAEME)
• Version 2.1 (December 2008): 648,801 words in 167 corpus files
• Implemented subperiods:
01/10/11 11
Subperiods Words LAEME datings
I 1150–1190 53,785 C12b1, C12b2
II 1190–1230 107,478 C12b2–C13a1, C13a1, C13a; c.1200
III 1230–1270 183,620 C13a2, C13a2–b1, C13b1, C13; c.1250
IV 1270–1310 142,425 C13b, C13b2, C13b2–C14a1; c.1300
V 1310–1350 161,493 C14a, C14a1, C14a2
(C = century, a/b = first/second half, 1/2 = first/second quarter)
English Department
2. LAEME: Regional coverage
Region Counties / Localisation Words
North (N) Cumberland, Durham, Lancashire, Yorkshire 65,082
West Midlands
(WML)
Cheshire, Gloucestershire, Herefordshire,
Shropshire, Staffordshire, Warwickshire,
Worcestershire
275,942
East Midlands
(EML)
Cambridgeshire, Essex, Huntingdonshire,
Leicestershire, Lincolnshire, Norfolk,
Northamptonshire, Suffolk; London
141,100
South West (SW)Berkshire, Devon, Dorset, Hampshire,
Oxfordshire, Somerset, Wiltshire82,931
South East (SE) Kent, Surrey, Sussex 38,698
Unlocalised n/a 45,048
01/10/11 12
English Department
2. LAEME: Diachronic regional coverage
orange 0-998 words disregarded
blue 999-5,000 words included, but results could be unreliable
01/10/11 13
I II III IV V
N 0 37 585 372 64,088
WML 999 74,992 131,634 68,201 116
EML 51,980 26,616 2,594 24,543 35,367
SW 806 1,751 15,034 34,303 31,037
SE 0 4,049 727 3,223 30,699
English Department
3. Application of permutation testing: -HED
01/10/11 14
• Origin of present-day -HOOD
1. Old English -HAD, via [hɔd]
2. -HED: new form of uncertain origin emerging in Early Middle English,
eventually replaced by -HAD
• Modern remnants of variation: e.g. godhead vs. godhood
• -HED in LAEME:
• 83 types, 256 tokens
• spellings: <hed(e), ed(e), heedd, head, heid, hide>
I(1150–1190)
II(1190–1230)
III (1230–1270)
IV(1270–1310)
V (1310–1350)
tokens 1 2 9 56 188
normalised 0.19 0.19 0.49 3.93 11.64
English Department
3.1. Productivity of -HED
• Strong increase in token frequency towards the mid-fourteenth century
• Subperiod V (1310–1350)
• high token frequency (V vs. IV: p < 0.0001)
• high type frequency (p < 0.001)
01/10/11 15
0
2
4
6
8
10
12
I II III IV V
PT–Type V
types 69
‘typical’ type count 12–46
significance +
p value < 0.001
+ significantly high type count
⟶ high productivity in V
English Department
3.2. Regional variation in the emergence of -HED
• Earliest attestation in East Midlands (Peterborough Chronicle)
• Token frequency rises earlier in East Midlands than in West Midlands
• Data from smaller subcorpora reliable?
⟶ East Midlands more progressive than West Midlands?
01/10/11 16
I II III IV
EML normalised 0.19 0 3.86 4.07
(tokens) (1) (0) (1) (10)
WML normalised 0 0.27 0.38 4.25
(tokens) (0) (2) (5) (29)
English Department
3.2. Regional variation in the emergence of -HED
East Midlands
= no statistically significant deviation
⟶ Type frequency in East Midlands within typical range
01/10/11 17
PT–Type I II III IV
types 1 0 1 9
‘typical’ type count 1–29 0–23 0–9 0–22
significance = = = =
English Department
3.2. Regional variation in the emergence of -HED
West Midlands
– significantly low type count
⟶ Significant paucity of types in West Midlands between c.1190–1270
⟶ Limited productivity
01/10/11 18
PT–Type I II III IV
types 0 1 3 12
‘typical’ type count 0–9 3–34 9–42 3–34
significance = – – =
p value < 0.05 < 0.05
English Department
3.2. Regional variation in the emergence of -HED
⟶ East Midlands more progressive than West Midlands and other regions
(here -HED is first featured in subperiods III and IV)
01/10/11 19
East Midlands West Midlands
Earliest
attestation
c.1154
(Peterborough Chronicle)
c.1200
(Lambeth Homilies)
Rise in token
frequency
from subperiod III
(1230–1270)
from subperiod IV
(1270–1310)
Type
frequency
within typical range significantly low in
subperiods II and III
(1190–1270)
Productivity earlier lower / later
English Department
3.3. Lexical diversity of -REDEN
• 10 types, 117 tokens
• Bases: 9 denominal, 1 deadjectival
• Suffix eventually ceases to be productive; survives in kindred, hatred
⟶ Eventual decline preceded by decrease in lexical diversity towards
mid-fourteenth century
01/10/11 20
TTR (all bases) I II III IV V
types 5 4 5 5 6
tokens 16 14 23 21 42
TTR 0.31 0.29 0.22 0.24 0.14
English Department
3.3. Lexical diversity of -REDEN
⟶ Decrease in lexical diversity (cp. TTR) does not register on a significant
level in permutation testing (PT–TTR)
Note: Permutation testing does not offer information on significance of
developments / variation falling within the ‘typical’ type count range.
Only cases of ‘extreme’ richness or paucity of types are registered.
01/10/11 21
PT–TTR I II III IV V
types 5 4 5 5 6
‘typical’ type count 2–6 2–6 3–7 2–7 5–9
significance = = = = =
TTR 0.31 0.29 0.22 0.24 0.14
English Department
4. Evaluation
Permutation testing helpful for studying productivity and lexical
diversity, accounting for diachronic developments and regional variation.
+ data relevant for interpretation of token frequencies
+ statistical significance for type frequencies and type/token ratios
– no information given on significance within ‘typical’ type count range
– threshold from which type counts begin to register as statistically
significant can be difficult to reach for smaller subcorpora
⟶ New method is a profitable complementation to traditional measures.
01/10/11 22
English Department
References
Baayen, Harald. 2009. ‘Corpus linguistics in morphology: Morphological
productivity.’ In Corpus Linguistics: An International Handbook, vol. 2, ed.
Anke Lüdeling and Merja Kytö, 899–919. Berlin, New York: Mouton de
Gruyter.
Ciszek, Ewa. 2008. Word Derivation in Early Middle English. Frankfurt am
Main: Peter Lang.
Cowie, Claire, and Christiane Dalton-Puffer. 2002. ‘Diachronic word-
formation and studying changes in productivity over time: Theoretical and
methodological considerations.’ In A Changing World of Words: Studies in
English Historical Lexicography, Lexicology and Semantics, ed. Javier E.
Díaz Vera, 410–437. Amsterdam: Rodopi.
Dalton-Puffer, Christiane. 1996. The French Influence on Middle English
Morphology: A Corpus-Based Study of Derivation. Berlin, New York: Mouton
de Gruyter.
01/10/11 23
English Department
References
Dietz, Klaus. 2007. ‘Denominale Abstraktbildungen des Altenglischen: dieWortbildung der Abstrakta auf -dōm, -hād, -lāc, -rǣden, -sceaft, -stæf
und -wist und ihrer Entsprechungen im Althochdeutschen und im
Altnordischen.’ In Beiträge zur Morphologie: Germanisch, Baltisch,
Ostseefinnisch, ed. by Hans Fix, 97–172. Odense: North-Western European
Language Evolution.
LAEME = A Linguistic Atlas of Early Middle English, 1150-1325. Compiled
by Margaret Laing and Roger Lass. 2007 / December 2008 (Version 2.1).
University of Edinburgh. http://www.lel.ed.ac.uk/ihd/laeme1/laeme1.html
(repeated access).
Oxford English Dictionary Online. 2010. 3rd edition. http://www.oed.com
(repeated access).
Rayson, Paul. Log-likelihood calculator. http://ucrel.lancs.ac.uk/llwizard.html
(repeated access).
01/10/11 24
English Department
References
Säily, Tanja. 2011. ‘Variation in morphological productivity in the BNC:
Sociolinguistic and methodological considerations.’ Corpus Linguistics and
Linguistic Theory 7 (1): 119–141.
Säily, Tanja, and Jukka Suomela. 2009. ‘Comparing type counts: The case
of women, men and -ity in early English letters.’ In Corpus Linguistics:
Refinements and Reassessments, ed. Antoinette Renouf and Andrew
Kehoe, 87–109. Amsterdam: Rodopi.
Suomela, Jukka. 2007. Type and hapax accumulation curves.
http://www.cs.helsinki.fi/u/josuomel/types/ (repeated access).
Trips, Carola. 2009. Lexical Semantics and Diachronic Morphology: The
Development of -hood, -dom and -ship in the History of English. Tübingen:
Max Niemeyer.
01/10/11 25
English Department
Contact
Anne Gardner, M.A.
English Department
University of Zurich
Plattenstrasse 47
CH-8032 Zürich
+41 44 634 36 93
01/10/11 26