variation in morphological productivity in the bncaacl2009/pdfs/saily2009aacl.pdf · tanja säily,...

22
Sociolinguistic and methodological considerations Tanja Säily, University of Helsinki 9 October 2009 In collaboration with Dr. Jukka Suomela, Helsinki Institute for Information Technology HIIT Variation in morphological productivity in the BNC:

Upload: others

Post on 21-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

  • Sociolinguistic and methodological considerations

    Tanja Säily, University of Helsinki9 October 2009

    In collaboration with Dr. Jukka Suomela,Helsinki Institute for Information Technology HIIT

    Variation in morphological productivity in the BNC:

  • 9 October 2009Tanja Säily, Variation in morphological productivity in the BNC 2

    Introduction

    -ness and -ityRoughly synonymous suffixesTypically form abstract nouns from adjectives: productive productiveness, productivity

    SociolinguisticsDo men and women use these suffixes differently in present-day English?

    MethodologyAre hapax-based productivity measures valid?

  • 9 October 2009Tanja Säily, Variation in morphological productivity in the BNC 3

    Material

    British National Corpus (BNC)100 million words: ~90% written, ~10% spoken

    Demographically sampledspoken component (BNC-DS)

    4.2 million words from early 1990sGender known for 88% of the data,social class for 62% (2.6 million words)

    Written component (BNC-W)88 million words, 1960s–1990sGender known for 51% of the data (45 Mw)

  • 9 October 2009Tanja Säily, Variation in morphological productivity in the BNC 4

    Methods

    How to measure productivity?Count the number of different words (types)Count the number of words occurring only once (hapax legomena, or hapaxes)- Approximating ‘new’ words

    Comparing type counts from subcorporaNormalisation problematic,establishing statistical significance likewisePermutation testing: take samples in random order and see how types accumulate, 1M times

  • 9 October 2009Tanja Säily, Variation in morphological productivity in the BNC 5

    - ity types vs. running words

    0 200,000 400,000 600,000 800,000 1,200,000

    0

    50

    100

    150

    200

    p 0.0001p 0.001p 0.01p 0.1

    m

    f

    CEEC

  • 9 October 2009Tanja Säily, Variation in morphological productivity in the BNC 6

    Sociolinguistics: Related work

    Productivity of -ity significantly low in17th-century letters written by women

    Corpus of Early English Correspondence(CEEC), Säily & Suomela (2009)-ity ‘learned’, etymologically foreign; women less well educated than men less able to use -ity?

    Women favour pronouns over common nounsRayson et al. 1997 (BNC-DS), Argamon et al. 2003 (BNC-W), Säily et al. forthcoming (CEEC)

  • Sociolinguistics: BNC-DS

    Productivity of both -ity and -nesssignificantly low in women’s speech

    Expected result- Women’s style more interactive

    -ity: difference just about significant-ness: gender difference tied to social class

    9 October 2009Tanja Säily, Variation in morphological productivity in the BNC 7

  • 9 October 2009Tanja Säily, Variation in morphological productivity in the BNC 8

    - ity types vs. running words

    0 500,000 1,000,000 1,500,000 2,000,000 2,500,000

    0

    10

    20

    30

    40

    50

    60

    70

    p 0.0001p 0.001p 0.01p 0.1

    fm

    BNC-DS

  • 9 October 2009Tanja Säily, Variation in morphological productivity in the BNC 9

    - ness types vs. running words

    0 500,000 1,000,000 1,500,000 2,000,000 2,500,000

    0

    10

    20

    30

    40

    50

    60

    70

    p 0.0001p 0.001p 0.01p 0.1

    f C2+DEm C2+DE

    BNC-DS

  • Sociolinguistics: BNC-W

    Productivity of -ity (but not -ness) significantly low in women’s writing

    Holds for both imaginative (BNC-W imag)and informative (BNC-Winf) textsResult for -ity expected; negative result for-ness requires more researchSemantics of -ness? ‘Embodied attribute/trait’ goes well with interactive writing style- Could also apply to 17th-century results

    9 October 2009Tanja Säily, Variation in morphological productivity in the BNC 10

  • 9 October 2009Tanja Säily, Variation in morphological productivity in the BNC 11

    - ity types vs. running words

    0 5,000,000 10,000,000 15,000,000

    0

    100

    200

    300

    400

    500

    600

    700

    p 0.0001p 0.001p 0.01p 0.1

    f

    m

    BNC-Wimag

  • 9 October 2009Tanja Säily, Variation in morphological productivity in the BNC 12

    - ness types vs. running words

    0 5,000,000 10,000,000 15,000,000

    0

    200

    400

    600

    800

    1,000

    p 0.0001p 0.001p 0.01p 0.1

    fm

    BNC-Wimag

  • 9 October 2009Tanja Säily, Variation in morphological productivity in the BNC 13

    - ity types vs. running words

    0 5,000,000 15,000,000 25,000,000

    0

    500

    1,000

    1,500

    p 0.0001p 0.001p 0.01p 0.1

    f

    m

    BNC-Winf

  • 9 October 2009Tanja Säily, Variation in morphological productivity in the BNC 14

    - ness types vs. running words

    0 5,000,000 15,000,000 25,000,000

    0

    500

    1,000

    1,500

    p 0.0001p 0.001p 0.01p 0.1

    f

    m

    BNC-Winf

  • 9 October 2009Tanja Säily, Variation in morphological productivity in the BNC 15

    Methodology: Related work

    Baayen (e.g., 1993)Category-conditioned degree of productivityP = n1/NHapax-conditioned degree of productivityP* = n1/h (or, within the same corpus, just n1)

    CEEC: hapax accumulation curves(Säily & Suomela 2009)

    Confidence intervals too wide

  • 9 October 2009Tanja Säily, Variation in morphological productivity in the BNC 16

    - ity hapaxes vs. running words

    0 200,000 400,000 600,000 800,000 1,200,000

    0

    10

    20

    30

    40

    50

    60

    70

    p 0.0001p 0.001p 0.01p 0.1

    m

    f

    CEEC

  • Methodology: BNC study

    BNC-W: hapax accumulation curvesMore data narrower confidence intervals- Results look similar to type accumulation

    curves but less significantHowever, the number of hapaxes does not grow linearly with either corpus size or the number of suffix tokens- Comparing P figures can be unreliable unless

    the sizes of the subcorpora / numbers of suffix tokens are of a similar magnitude

    9 October 2009Tanja Säily, Variation in morphological productivity in the BNC 17

  • 9 October 2009Tanja Säily, Variation in morphological productivity in the BNC 18

    - ity hapaxes vs. running words

    0 5,000,000 15,000,000 25,000,000

    0

    100

    200

    300

    400

    500

    600

    p 0.0001p 0.001p 0.01p 0.1

    f

    m

    BNC-Winf

  • 9 October 2009Tanja Säily, Variation in morphological productivity in the BNC 19

    - ity hapaxes vs. suffix tokens

    0 50,000 100,000 150,000

    0

    100

    200

    300

    400

    500

    600

    p 0.0001p 0.001p 0.01p 0.1

    f

    m

    BNC-Winf

  • Conclusion

    There can be sociolinguistic variation in morphological productivity

    There seem to be gendered speech styles and writing styles in English (possibly relatively stable over centuries)

    There is no perfect solution for measuring productivity as of yet

    9 October 2009Tanja Säily, Variation in morphological productivity in the BNC 20

  • References

    Argamon, S., M. Koppel, J. Fine & A.R. Shimoni. 2003. Gender, genre, and writing style in formal written texts. Text 23(3): 321–346.

    Baayen, R.H. 1993. On frequency, transparency and productivity. Yearbook of Morphology 1992, ed. by G. Booij & J. van Marle. Dordrecht: Kluwer Academic Publishers, 181–208.

    BNC = The British National Corpus, version 3 (BNC XML Edition). 2007. Distributed by Oxford University Computing Services on behalf of the BNC Consortium. URL: http://www.natcorp.ox.ac.uk/

    CEEC = Corpus of Early English Correspondence. 1998. Compiled by T. Nevalainen, H. Raumolin-Brunberg, J. Keränen, M. Nevala, A. Nurmi & M. Palander-Collin at the Department of English, University of Helsinki.

    9 October 2009Tanja Säily, Variation in morphological productivity in the BNC 21

  • References (cont.)

    Rayson, P., G. Leech & M. Hodges. 1997. Social differentiation in the use of English vocabulary: Some analyses of the conversational component of the British National Corpus. International Journal of Corpus Linguistics 2(1): 133–152.

    Säily, T., T. Nevalainen & H. Siirtola. Forthcoming. Variation in noun and pronoun frequencies in a historical corpus.

    Säily, T. & J. Suomela. 2009. Comparing type counts: The case of women, men and -ity in early English letters. Corpus Linguistics: Refinements and Reassessments (Language and Computers: Studies in Practical Linguistics 69), ed. by A. Renouf & A. Kehoe. Amsterdam: Rodopi, 87–109.

    9 October 2009Tanja Säily, Variation in morphological productivity in the BNC 22