measures of productivity and lexical diversity: suffixation (c.1150–1350)

English Department

Measures of Productivity and

Lexical Diversity: Suffixation

(c.1150–1350)

The development of abstract noun derivations

in different regions and text types (c.1150–1700)

Anne Gardner

01/10/11

English Department

Outline

1. Methodology: Statistical measures for productivity and lexical diversity

2. Linguistic Atlas of Early Middle English (LAEME)

3. Application of permutation testing

1. Productivity of -HED

2. Regional variation in the emergence of -HED

3. Lexical diversity of -REDEN

4. Evaluation

01/10/11 2

English Department

1. Methodology: Overview of statistical measures

• ‘Traditional’ measures of productivity:

• Token frequency

• Type frequency

• [Hapax legomena]

• Type/token ratio

• More recent measure of lexical richness and productivity by Säily &

Suomela:

Permutation testing involving type accumulation curves

01/10/11 3

English Department

1. Methodology: Types and tokens

• Types: number of distinct derivatives

• Tokens: number of attestations

01/10/11 4

Type Tokens Spellings attested in LAEME

falsehood 20 FALSEHED, FALSHEDE, FALSHED,

UALSHEDE(S), FALSED(E)

boldhood 2 BOLDHEDE

kindred 16 KINREDEN, KINRADEN, KENREDE,

KUNREDE(N), KUNREDNES, KUNRADE,

CUNREDE(N), CUNREDES, CUNRADEN,

CUNREADNES

foereden

‘hostility’

1 FAREDEN

English Department

1. Methodology: Token and type frequencies

Token frequency

• Normalised frequencies (per 10,000 words) enable comparisons across

subcorpora and highlight regional variation (e.g. progressive or

conservative tendencies)

• Test for statistical significance: e.g. log likelihood

Type frequency

• Productive words formation processes produce new types

• Type accumulation indicative of past productivity?

• Figures cannot be normalised ⟶ comparisons between suffixes and

regions difficult

01/10/11 5

English Department

1. Methodology: Type/token ratio (TTR)

• Gauging the lexical diversity of a word formation process:

• lower TTR ⟶ lower lexical diversity ⟶ lower productivity?

• higher TTR ⟶ higher lexical diversity ⟶ higher productivity?

• Figures of different suffixes, or for different regions, cannot easily

be compared as TTR depends on token frequency and corpus size

01/10/11 6

Types Tokens TTR

Suffix A 1 2 0.5

Suffix B 20 50 0.4

English Department

1. Methodology: Permutation testing (PT)

• Säily and/or Suomela (2007, 2009, 2011)

• Investigation of type richness: likelihood of encountering unusually high

or low type counts (⟶ levels of statistical significance)

a) type count in relation to (sub)corpus size ⟶ type frequency

b) type count in relation to tokens in (sub)corpus ⟶ type/token ratio

• Software:

• computing of type accumulation curves on the basis of random

reorderings of the corpus

• output: probability tables

01/10/11 7

English Department

1. Methodology: Permutation testing (PT–Type)

a) PT–Type: Type count in relation to (sub)corpus size

01/10/11 8

Words 0.0001 0.001 0.01 0.05 0.1 0.1 0.05 0.01 0.001 0.0001

0 0 0 0 0 0 8 10 28 28 29

2,606 0 0 0 0 0 9 12 28 29 32

… … … … … … … … … … …

52,113 0 0 0 0 1 29 33 39 53 57

54,718 0 0 0 1 1 30 33 40 54 58

… … … … … … … … … … …

72,958 0 0 0 1 2 33 37 49 57 61

75,563 0 0 1 2 3 34 47 49 58 62

… … … … … … … … … … …

English Department

1. Methodology: Permutation testing (PT–TTR)

b) PT–TTR: Type count in relation to tokens in (sub)corpus

01/10/11 9

Token

s0.0001 0.001 0.01 0.05 0.1 0.1 0.05 0.01 0.001 0.0001

0 0 0 0 0 0 10 22 28 28 28

1 0 0 0 0 0 12 23 28 29 29

… … … … … … … … … … …

31 0 0 0 1 3 34 37 41 50 52

32 0 0 0 2 3 34 37 42 51 53

… … … … … … … … … … …

40 0 0 0 2 6 37 40 48 54 57

41 0 0 0 2 6 37 40 48 55 57

… … … … … … … … … … …

English Department

1. Methodology: Permutation testing

Minimum requirements a subcorpus has to meet in order for its type count to

register as significantly high or low (starting at p < 0.05):

Example: -HED

⟶ thresholds could be too high for smaller subcorpora

01/10/11 10

PT–Type low type count high type count

subcorpus size 52,114–54,718 words 1–2,606 words

types 0 > 12

PT–TTR low type count high type count

tokens 32 41

types 1 41

English Department

2. Linguistic Atlas of Early Middle English (LAEME)

• Version 2.1 (December 2008): 648,801 words in 167 corpus files

• Implemented subperiods:

01/10/11 11

Subperiods Words LAEME datings

I 1150–1190 53,785 C12b1, C12b2

II 1190–1230 107,478 C12b2–C13a1, C13a1, C13a; c.1200

III 1230–1270 183,620 C13a2, C13a2–b1, C13b1, C13; c.1250

IV 1270–1310 142,425 C13b, C13b2, C13b2–C14a1; c.1300

V 1310–1350 161,493 C14a, C14a1, C14a2

(C = century, a/b = first/second half, 1/2 = first/second quarter)

English Department

2. LAEME: Regional coverage

Region Counties / Localisation Words

North (N) Cumberland, Durham, Lancashire, Yorkshire 65,082

West Midlands

(WML)

Cheshire, Gloucestershire, Herefordshire,

Shropshire, Staffordshire, Warwickshire,

Worcestershire

275,942

East Midlands

(EML)

Cambridgeshire, Essex, Huntingdonshire,

Leicestershire, Lincolnshire, Norfolk,

Northamptonshire, Suffolk; London

141,100

South West (SW)Berkshire, Devon, Dorset, Hampshire,

Oxfordshire, Somerset, Wiltshire82,931

South East (SE) Kent, Surrey, Sussex 38,698

Unlocalised n/a 45,048

01/10/11 12

English Department

2. LAEME: Diachronic regional coverage

orange 0-998 words disregarded

blue 999-5,000 words included, but results could be unreliable

01/10/11 13

I II III IV V

N 0 37 585 372 64,088

WML 999 74,992 131,634 68,201 116

EML 51,980 26,616 2,594 24,543 35,367

SW 806 1,751 15,034 34,303 31,037

SE 0 4,049 727 3,223 30,699

English Department

3. Application of permutation testing: -HED

01/10/11 14

• Origin of present-day -HOOD

1. Old English -HAD, via [hɔd]

2. -HED: new form of uncertain origin emerging in Early Middle English,

eventually replaced by -HAD

• Modern remnants of variation: e.g. godhead vs. godhood

• -HED in LAEME:

• 83 types, 256 tokens

• spellings: <hed(e), ed(e), heedd, head, heid, hide>

I(1150–1190)

II(1190–1230)

III (1230–1270)

IV(1270–1310)

V (1310–1350)

tokens 1 2 9 56 188

normalised 0.19 0.19 0.49 3.93 11.64

English Department

3.1. Productivity of -HED

• Strong increase in token frequency towards the mid-fourteenth century

• Subperiod V (1310–1350)

• high token frequency (V vs. IV: p < 0.0001)

• high type frequency (p < 0.001)

01/10/11 15

0

2

4

6

8

10

12

I II III IV V

PT–Type V

types 69

‘typical’ type count 12–46

significance +

p value < 0.001

+ significantly high type count

⟶ high productivity in V

English Department

3.2. Regional variation in the emergence of -HED

• Earliest attestation in East Midlands (Peterborough Chronicle)

• Token frequency rises earlier in East Midlands than in West Midlands

• Data from smaller subcorpora reliable?

⟶ East Midlands more progressive than West Midlands?

01/10/11 16

I II III IV

EML normalised 0.19 0 3.86 4.07

(tokens) (1) (0) (1) (10)

WML normalised 0 0.27 0.38 4.25

(tokens) (0) (2) (5) (29)

English Department


East Midlands

= no statistically significant deviation

⟶ Type frequency in East Midlands within typical range

01/10/11 17

PT–Type I II III IV

types 1 0 1 9

‘typical’ type count 1–29 0–23 0–9 0–22

significance = = = =

English Department


West Midlands

– significantly low type count

⟶ Significant paucity of types in West Midlands between c.1190–1270

⟶ Limited productivity

01/10/11 18

PT–Type I II III IV

types 0 1 3 12

‘typical’ type count 0–9 3–34 9–42 3–34

significance = – – =

p value < 0.05 < 0.05

English Department


⟶ East Midlands more progressive than West Midlands and other regions

(here -HED is first featured in subperiods III and IV)

01/10/11 19

East Midlands West Midlands

Earliest

attestation

c.1154

(Peterborough Chronicle)

c.1200

(Lambeth Homilies)

Rise in token

frequency

from subperiod III

(1230–1270)

from subperiod IV

(1270–1310)

Type

frequency

within typical range significantly low in

subperiods II and III

(1190–1270)

Productivity earlier lower / later

English Department

3.3. Lexical diversity of -REDEN

• 10 types, 117 tokens

• Bases: 9 denominal, 1 deadjectival

• Suffix eventually ceases to be productive; survives in kindred, hatred

⟶ Eventual decline preceded by decrease in lexical diversity towards

mid-fourteenth century

01/10/11 20

TTR (all bases) I II III IV V

types 5 4 5 5 6

tokens 16 14 23 21 42

TTR 0.31 0.29 0.22 0.24 0.14

English Department

3.3. Lexical diversity of -REDEN

⟶ Decrease in lexical diversity (cp. TTR) does not register on a significant

level in permutation testing (PT–TTR)

Note: Permutation testing does not offer information on significance of

developments / variation falling within the ‘typical’ type count range.

Only cases of ‘extreme’ richness or paucity of types are registered.

01/10/11 21

PT–TTR I II III IV V

types 5 4 5 5 6

‘typical’ type count 2–6 2–6 3–7 2–7 5–9

significance = = = = =

TTR 0.31 0.29 0.22 0.24 0.14

English Department

4. Evaluation

Permutation testing helpful for studying productivity and lexical

diversity, accounting for diachronic developments and regional variation.

+ data relevant for interpretation of token frequencies

+ statistical significance for type frequencies and type/token ratios

– no information given on significance within ‘typical’ type count range

– threshold from which type counts begin to register as statistically

significant can be difficult to reach for smaller subcorpora

⟶ New method is a profitable complementation to traditional measures.

01/10/11 22

English Department

References

Baayen, Harald. 2009. ‘Corpus linguistics in morphology: Morphological

productivity.’ In Corpus Linguistics: An International Handbook, vol. 2, ed.

Anke Lüdeling and Merja Kytö, 899–919. Berlin, New York: Mouton de

Gruyter.

Ciszek, Ewa. 2008. Word Derivation in Early Middle English. Frankfurt am

Main: Peter Lang.

Cowie, Claire, and Christiane Dalton-Puffer. 2002. ‘Diachronic word-

formation and studying changes in productivity over time: Theoretical and

methodological considerations.’ In A Changing World of Words: Studies in

English Historical Lexicography, Lexicology and Semantics, ed. Javier E.

Díaz Vera, 410–437. Amsterdam: Rodopi.

Dalton-Puffer, Christiane. 1996. The French Influence on Middle English

Morphology: A Corpus-Based Study of Derivation. Berlin, New York: Mouton

de Gruyter.

01/10/11 23

English Department

References

Dietz, Klaus. 2007. ‘Denominale Abstraktbildungen des Altenglischen: dieWortbildung der Abstrakta auf -dōm, -hād, -lāc, -rǣden, -sceaft, -stæf

und -wist und ihrer Entsprechungen im Althochdeutschen und im

Altnordischen.’ In Beiträge zur Morphologie: Germanisch, Baltisch,

Ostseefinnisch, ed. by Hans Fix, 97–172. Odense: North-Western European

Language Evolution.

LAEME = A Linguistic Atlas of Early Middle English, 1150-1325. Compiled

by Margaret Laing and Roger Lass. 2007 / December 2008 (Version 2.1).

University of Edinburgh. http://www.lel.ed.ac.uk/ihd/laeme1/laeme1.html

(repeated access).

Oxford English Dictionary Online. 2010. 3rd edition. http://www.oed.com

(repeated access).

Rayson, Paul. Log-likelihood calculator. http://ucrel.lancs.ac.uk/llwizard.html

(repeated access).

01/10/11 24

English Department

References

Säily, Tanja. 2011. ‘Variation in morphological productivity in the BNC:

Sociolinguistic and methodological considerations.’ Corpus Linguistics and

Linguistic Theory 7 (1): 119–141.

Säily, Tanja, and Jukka Suomela. 2009. ‘Comparing type counts: The case

of women, men and -ity in early English letters.’ In Corpus Linguistics:

Refinements and Reassessments, ed. Antoinette Renouf and Andrew

Kehoe, 87–109. Amsterdam: Rodopi.

Suomela, Jukka. 2007. Type and hapax accumulation curves.

http://www.cs.helsinki.fi/u/josuomel/types/ (repeated access).

Trips, Carola. 2009. Lexical Semantics and Diachronic Morphology: The

Development of -hood, -dom and -ship in the History of English. Tübingen:

Max Niemeyer.

01/10/11 25

English Department

Contact

Anne Gardner, M.A.

English Department

University of Zurich

Plattenstrasse 47

CH-8032 Zürich

[email protected]

+41 44 634 36 93

01/10/11 26

measures of productivity and lexical diversity: suffixation (c.1150–1350)

Documents

investigation of type

hedlexical diversity

different regions

subcorpus size0110118words0

figures of different

typetoken ratio ttrgauging

sily suomela

text types