simple statistics for corpus linguistics sean wallis survey of english usage university college...

77
Simple Statistics for Simple Statistics for Corpus Linguistics Corpus Linguistics Sean Wallis Survey of English Usage University College London [email protected]

Upload: brooklyn-pelter

Post on 01-Apr-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Simple Statistics for Simple Statistics for Corpus LinguisticsCorpus Linguistics

Sean WallisSurvey of English Usage

University College London

[email protected]

Page 2: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

OutlineOutline

• Numbers…

• A simple research question– do women speak or write more than men

in ICE-GB?– p = proportion = probability

• Another research question– what happens to speakers’ use of modal shall

vs. will over time?– the idea of inferential statistics– plotting confidence intervals

• Concluding remarks

Page 3: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Numbers...Numbers...

• We are used to concepts like these being expressed as numbers:– length (distance, height)– area– volume– temperature – wealth (income, assets)

Page 4: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Numbers...Numbers...

• We are used to concepts like these being expressed as numbers:– length (distance, height)– area– volume– temperature – wealth (income, assets)

• We are going to discuss another concept:– probability

• proportion, percentage

– a simple idea, at the heart of statistics

Page 5: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

ProbabilityProbability

• Based on another, even simpler, idea:– probability p = x / n

Page 6: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

ProbabilityProbability

• Based on another, even simpler, idea:– probability p = x / n – e.g. the probability that the

speaker says will instead of shall

Page 7: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

ProbabilityProbability

• Based on another, even simpler, idea:– probability p = x / n

• where– frequency x (often, f )

• the number of times something actually happens• the number of hits in a search

– e.g. the probability that the speaker says will instead of shall

Page 8: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

ProbabilityProbability

• Based on another, even simpler, idea:– probability p = x / n

• where– frequency x (often, f )

• the number of times something actually happens• the number of hits in a search

– cases of will

– e.g. the probability that the speaker says will instead of shall

Page 9: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

ProbabilityProbability

• Based on another, even simpler, idea:– probability p = x / n

• where– frequency x (often, f )

• the number of times something actually happens• the number of hits in a search

– baseline n is• the number of times something could happen• the number of hits

– in a more general search – in several alternative patterns (‘alternate forms’)

– cases of will

– e.g. the probability that the speaker says will instead of shall

Page 10: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

ProbabilityProbability

• Based on another, even simpler, idea:– probability p = x / n

• where– frequency x (often, f )

• the number of times something actually happens• the number of hits in a search

– baseline n is• the number of times something could happen• the number of hits

– in a more general search – in several alternative patterns (‘alternate forms’)

– cases of will

– total: will + shall

– e.g. the probability that the speaker says will instead of shall

Page 11: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

ProbabilityProbability

• Based on another, even simpler, idea:– probability p = x / n

• where– frequency x (often, f )

• the number of times something actually happens• the number of hits in a search

– baseline n is• the number of times something could happen• the number of hits

– in a more general search – in several alternative patterns (‘alternate forms’)

• Probability can range from 0 to 1

– e.g. the probability that the speaker says will instead of shall– cases of will

– total: will + shall

Page 12: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

What can a corpus tell us?What can a corpus tell us?

• A corpus is a source of knowledge about language:– corpus– introspection/observation/

elicitation– controlled laboratory experiment– computer simulation

Page 13: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

What can a corpus tell us?What can a corpus tell us?

• A corpus is a source of knowledge about language:– corpus– introspection/observation/

elicitation– controlled laboratory experiment– computer simulation

}How do these

differ in what they might tell

us?

Page 14: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

What can a corpus tell us?What can a corpus tell us?

• A corpus is a source of knowledge about language:– corpus– introspection/observation/

elicitation– controlled laboratory experiment– computer simulation

• A corpus is a sample of language

}How do these

differ in what they might tell

us?

Page 15: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

What can a corpus tell us?What can a corpus tell us?

• A corpus is a source of knowledge about language:– corpus– introspection/observation/elicitation– controlled laboratory experiment– computer simulation

• A corpus is a sample of language, varying by:– source (e.g. speech vs. writing, age...)– levels of annotation (e.g. parsing)– size (number of words)– sampling method (random sample?)

}How do these

differ in what they might tell

us?

Page 16: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

What can a corpus tell us?What can a corpus tell us?

• A corpus is a source of knowledge about language:– corpus– introspection/observation/elicitation– controlled laboratory experiment– computer simulation

• A corpus is a sample of language, varying by:– source (e.g. speech vs. writing, age...)– levels of annotation (e.g. parsing)– size (number of words)– sampling method (random sample?)

}How do these

differ in what they might tell

us?

How does this affect the types

of knowledg

e we might

obtain?

}

Page 17: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

What can a What can a parsedparsed corpus tell corpus tell us?us?• Three kinds of evidence may be found in

a parsed corpus:

Page 18: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

What can a What can a parsedparsed corpus tell corpus tell us?us?• Three kinds of evidence may be found in

a parsed corpus:

Frequency evidence of a particularknown rule, structure or linguistic event - How often?

Page 19: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

What can a What can a parsedparsed corpus tell corpus tell us?us?• Three kinds of evidence may be found in

a parsed corpus:

Frequency evidence of a particularknown rule, structure or linguistic event

Factual evidence of new rules, etc. - How novel?

- How often?

Page 20: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

What can a What can a parsedparsed corpus tell corpus tell us?us?• Three kinds of evidence may be found in a

parsed corpus:

Frequency evidence of a particularknown rule, structure or linguistic event

Factual evidence of new rules, etc.

Interaction evidence of relationshipsbetween rules, structures and events - Does X affect

Y?

- How novel?

- How often?

Page 21: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

What can a What can a parsedparsed corpus tell corpus tell us?us?• Three kinds of evidence may be found in a

parsed corpus:

Frequency evidence of a particularknown rule, structure or linguistic event

Factual evidence of new rules, etc.

Interaction evidence of relationshipsbetween rules, structures and events

• Lexical searches may also be made more precise using the grammatical analysis

- Does X affect Y?

- How novel?

- How often?

Page 22: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

A simple research questionA simple research question

• Let us consider the following question:

• Do women speak or write more words than men in the ICE-GB corpus?

• What do you think?

• How might we find out?

Page 23: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Lets get some dataLets get some data

• Open ICE-GB with ICECUP– Text Fragment query for words:

• “*+<{~PUNC,~PAUSE}>”• counts every word, excluding pauses

and punctuation

Page 24: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Lets get some dataLets get some data

• Open ICE-GB with ICECUP– Text Fragment query for words:

• “*+<{~PUNC,~PAUSE}>”• counts every word, excluding pauses

and punctuation

– Variable query:• TEXT CATEGORY = spoken, written

Page 25: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Lets get some dataLets get some data

• Open ICE-GB with ICECUP– Text Fragment query for words:

• “*+<{~PUNC,~PAUSE}>”• counts every word, excluding pauses

and punctuation

– Variable query:• TEXT CATEGORY = spoken, written

– Variable query:• SPEAKER GENDER = f, m, <unknown>

combine these3 queries}

Page 26: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Lets get some dataLets get some data

• Open ICE-GB with ICECUP– Text Fragment query for words:

• “*+<{~PUNC,~PAUSE}>”• counts every word, excluding pauses

and punctuation

– Variable query:• TEXT CATEGORY = spoken, written

– Variable query:• SPEAKER GENDER = f, m, <unknown>

F M <unknown> TOTALTOTAL 275,999 667,934 93,355 1,037,288 spoken 174,499 439,741 1,076 615,316 written 101,500 228,193 92,279 421,972

combine these3 queries}

Page 27: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

ICE-GB: gender / written-ICE-GB: gender / written-spokenspoken• Proportion of words in each category

spoken/written by women and men– The authors of some texts are unspecified– Some written material may be jointly

authored

– female/male ratio varies slightly

0 0.2 0.4 0.6 0.8 1

TOTAL

spoken

written femalefemale

malemale

p

Page 28: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

ICE-GB: gender / written-ICE-GB: gender / written-spokenspoken• Proportion of words in each category

spoken/written by women and men– The authors of some texts are unspecified– Some written material may be jointly

authored

– female/male ratio varies slightly

0 0.2 0.4 0.6 0.8 1

TOTAL

spoken

written femalefemale

malemale

p

pp (female)(female) = words spoken by = words spoken by women /women /

total words (excluding total words (excluding <unknown>)<unknown>)

Page 29: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

pp = Probability = Proportion = Probability = Proportion

• We asked ourselves the following question:– Do women speak or write more words

than men in the ICE-GB corpus?– To answer this we looked at the proportion

of words in ICE-GB that are produced by women (out of all words where the gender is known)

Page 30: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

pp = Probability = Proportion = Probability = Proportion

• We asked ourselves the following question:– Do women speak or write more words than men in

the ICE-GB corpus?– To answer this we looked at the proportion of words in

ICE-GB that are produced by women (out of all words where the gender is known)

• The proportion of words produced by women can also be thought of as a probability:– What is the probability that, if we were to pick

any random word in ICE-GB (and the gender was known) it would be uttered by a woman?

Page 31: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Another research questionAnother research question

• Let us consider the following question:

• What happens to modal shall vs. will over time in British English?– Does shall increase or decrease?

• What do you think?

• How might we find out?

Page 32: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Lets get some dataLets get some data

• Open DCPSE with ICECUP– FTF query for first person declarative shall:

• repeat for will

Page 33: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Lets get some dataLets get some data

• Open DCPSE with ICECUP– FTF query for first person declarative shall:

• repeat for will– Corpus Map:

• DATE Do the first set of queries and then drop into Corpus

Map}

Page 34: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Modal Modal shallshall vs. vs. willwill over time over time

• Plotting probability of speaker selecting modal shall out of shall/will over time (DCPSE)

shallshall = 100% = 100%

shallshall = 0% = 0%0.0

0.2

0.4

0.6

0.8

1.0

1955 1960 1965 1970 1975 1980 1985 1990 1995

p(shall | {shall, will})

(Aarts et al. 2013)

Page 35: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Modal Modal shallshall vs. vs. willwill over time over time

• Plotting probability of speaker selecting modal shall out of shall/will over time (DCPSE)

0.0

0.2

0.4

0.6

0.8

1.0

1955 1960 1965 1970 1975 1980 1985 1990 1995

p(shall | {shall, will})

(Aarts et al. 2013)

shallshall = 100% = 100%

shallshall = 0% = 0%

Page 36: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Modal Modal shallshall vs. vs. willwill over time over time

• Plotting probability of speaker selecting modal shall out of shall/will over time (DCPSE)

0.0

0.2

0.4

0.6

0.8

1.0

1955 1960 1965 1970 1975 1980 1985 1990 1995

p(shall | {shall, will})

Is shall going up or down?

(Aarts et al. 2013)

shallshall = 100% = 100%

shallshall = 0% = 0%

Page 37: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Is Is shall shall going up or down? going up or down?

• Whenever we look at change, we must ask ourselves two things:

Page 38: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Is Is shall shall going up or down? going up or down? • Whenever we look at change, we must ask ourselves two things:

What is the change relative to?– Is our observation higher or lower than we might expect?

• In this case we ask • Does shall decrease relative to shall +will ?

Page 39: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Is Is shall shall going up or down? going up or down? • Whenever we look at change, we must ask ourselves two things:

What is the change relative to?– Is our observation higher or lower than we might expect?

• In this case we ask • Does shall decrease relative to shall +will ?

How confident are we in our results?– Is the change big enough to be reproducible?

Page 40: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The idea of a confidence The idea of a confidence intervalinterval• All observations are imprecise

– Randomness is a fact of life– Our abilities are finite:

• to measure accurately or • reliably classify into types

• We need to express caution in citing numbers

• Example (from Levin 2013):– 77.27% of uses of think in 1920s data

have a literal (‘cogitate’) meaning

Page 41: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The idea of a confidence The idea of a confidence intervalinterval• All observations are imprecise

– Randomness is a fact of life– Our abilities are finite:

• to measure accurately or • reliably classify into types

• We need to express caution in citing numbers

• Example (from Levin 2013):– 77.27% of uses of think in 1920s data

have a literal (‘cogitate’) meaning

Really? Not 77.28, or 77.26?

Page 42: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The idea of a confidence The idea of a confidence intervalinterval• All observations are imprecise

– Randomness is a fact of life– Our abilities are finite:

• to measure accurately or • reliably classify into types

• We need to express caution in citing numbers

• Example (from Levin 2013):– 77% of uses of think in 1920s data

have a literal (‘cogitate’) meaning

Page 43: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The idea of a confidence The idea of a confidence intervalinterval• All observations are imprecise

– Randomness is a fact of life– Our abilities are finite:

• to measure accurately or • reliably classify into types

• We need to express caution in citing numbers

• Example (from Levin 2013):– 77% of uses of think in 1920s data

have a literal (‘cogitate’) meaning

Sounds defensible. But how confident can we be in this number?

Page 44: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The idea of a confidence The idea of a confidence intervalinterval• All observations are imprecise

– Randomness is a fact of life– Our abilities are finite:

• to measure accurately or • reliably classify into types

• We need to express caution in citing numbers

• Example (from Levin 2013):– 77% (66-86%*) of uses of think in 1920s

data have a literal (‘cogitate’) meaning

Page 45: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The idea of a confidence The idea of a confidence intervalinterval• All observations are imprecise

– Randomness is a fact of life– Our abilities are finite:

• to measure accurately or • reliably classify into types

• We need to express caution in citing numbers

• Example (from Levin 2013):– 77% (66-86%*) of uses of think in 1920s

data have a literal (‘cogitate’) meaning

Finally we have a credible range of values - needs a footnote* to explain how it was calculated.

Page 46: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The ‘sample’ and the The ‘sample’ and the ‘population’‘population’• We said that the corpus was a sample

Page 47: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The ‘sample’ and the The ‘sample’ and the ‘population’‘population’• We said that the corpus was a sample

• Previously, we asked about the proportions of male/female words in the corpus (ICE-GB)– We asked questions about the sample– The answers were statements of fact

Page 48: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The ‘sample’ and the The ‘sample’ and the ‘population’‘population’• We said that the corpus was a sample

• Previously, we asked about the proportions of male/female words in the corpus (ICE-GB)– We asked questions about the sample– The answers were statements of fact

• Now we are asking about “British English”

?

Page 49: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The ‘sample’ and the The ‘sample’ and the ‘population’‘population’• We said that the corpus was a sample

• Previously, we asked about the proportions of male/female words in the corpus (ICE-GB)– We asked questions about the sample– The answers were statements of fact

• Now we are asking about “British English”– We want to draw an inference

• from the sample (in this case, DCPSE)• to the population (similarly-sampled BrE utterances)

– This inference is a best guess– This process is called inferential statistics

Page 50: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Basic inferential Basic inferential statisticsstatistics

• Suppose we carry out an experiment– We toss a coin 10 times and get 5 heads– How confident are we in the results?

• Suppose we repeat the experiment• Will we get the same result again?

Page 51: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Basic inferential Basic inferential statisticsstatistics

• Suppose we carry out an experiment– We toss a coin 10 times and get 5 heads– How confident are we in the results?

• Suppose we repeat the experiment• Will we get the same result again?

• Let’s try…– You should have one coin– Toss it 10 times– Write down how many heads you get– Do you all get the same results?

Page 52: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The Binomial distributionThe Binomial distribution

• Repeated sampling tends to form a Binomial distribution around the expected mean X

F

N = 1

x

531 7 9

• We toss a coin 10 times, and get 5 heads

X

Page 53: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The Binomial distributionThe Binomial distribution

• Repeated sampling tends to form a Binomial distribution around the expected mean X

F

N = 4

x

531 7 9

• Due to chance, some samples will have a higher or lower score

X

Page 54: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The Binomial distributionThe Binomial distribution

• Repeated sampling tends to form a Binomial distribution around the expected mean X

F

N = 8

x

531 7 9

• Due to chance, some samples will have a higher or lower score

X

Page 55: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The Binomial distributionThe Binomial distribution

• Repeated sampling tends to form a Binomial distribution around the expected mean X

F

N = 12

x

531 7 9

• Due to chance, some samples will have a higher or lower score

X

Page 56: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The Binomial distributionThe Binomial distribution

• Repeated sampling tends to form a Binomial distribution around the expected mean X

F

N = 16

x

531 7 9

• Due to chance, some samples will have a higher or lower score

X

Page 57: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The Binomial distributionThe Binomial distribution

• Repeated sampling tends to form a Binomial distribution around the expected mean X

F

N = 20

x

531 7 9

• Due to chance, some samples will have a higher or lower score

X

Page 58: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The Binomial distributionThe Binomial distribution

• Repeated sampling tends to form a Binomial distribution around the expected mean X

F

N = 26

x

531 7 9

• Due to chance, some samples will have a higher or lower score

X

Page 59: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The Binomial distributionThe Binomial distribution• It is helpful to express x as the probability of choosing a head, p, with expected mean P

• p = x / n– n = max. number of

possible heads (10)

• Probabilities are inthe range 0 to 1=percentages

(0 to 100%)

F

p

0.50.30.1 0.7 0.9

P

Page 60: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The Binomial distributionThe Binomial distribution

• Take-home point:– A single observation, say x hits (or p as a

proportion of n possible hits) in the corpus, is not guaranteed to be correct ‘in the world’!

• Estimating the confidence you have in your results is essential

F

p

P

0.50.30.1 0.7 0.9

p

Page 61: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The Binomial distributionThe Binomial distribution

• Take-home point:– A single observation, say x hits (or p as a

proportion of n possible hits) in the corpus, is not guaranteed to be correct ‘in the world’!

• Estimating the confidence you have in your results is essential

– We want to makepredictions about future runs of the same experiment

F

p

P

p

0.50.30.1 0.7 0.9

Page 62: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Binomial Binomial Normal Normal

• The Binomial (discrete) distribution is close to the Normal (continuous) distribution

x

F

0.50.30.1 0.7 0.9

Page 63: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The central limit theoremThe central limit theorem

• Any Normal distribution can be defined by only two variables and the Normal function z

z . S z . S

F

– With more data in the experiment, S will be smaller

p0.50.30.1 0.7

population

mean P

standard deviationS = P(1 – P) / n

Page 64: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The central limit theoremThe central limit theorem

• Any Normal distribution can be defined by only two variables and the Normal function z

z . S z . S

F

2.5% 2.5%

population

mean P

– 95% of the curve is within ~2 standard deviations of the expected mean

standard deviationS = P(1 – P) / n

p0.50.30.1 0.7

95%

– the correct figure is 1.95996!

= the critical value of z for an error level of 0.05.

Page 65: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The single-sample The single-sample zz test...test...

• Is an observation p > z standard deviations from the expected (population) mean P?

z . S z . S

F

P2.5% 2.5%

p0.50.30.1 0.7

observation p• If yes, p is

significantly different from P

Page 66: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

...gives us a “confidence ...gives us a “confidence interval”interval”• P ± z . S is the confidence interval for P

– We want to plot the interval about p

z . S z . S

F

P

p0.50.30.1 0.7

2.5% 2.5%

Page 67: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

...gives us a “confidence ...gives us a “confidence interval”interval”• P ± z . S is the confidence interval for P

– We want to plot the interval about p

w+

F

P2.5% 2.5%

p0.50.30.1 0.7

observation p

w–

95%

Page 68: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

...gives us a “confidence ...gives us a “confidence interval”interval”• The interval about p is called the

Wilson score interval

• This interval reflects the Normal interval about P:

• If P is at the upper limit of p,p is at the lower limit of P

(Wallis, 2013)

F

P2.5% 2.5%

p

w+

observation p

w–

0.50.30.1 0.7

Page 69: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Modal Modal shallshall vs. vs. willwill over time over time

• Simple test: – Compare p for

• all LLC texts in DCPSE (1956-77) with• all ICE-GB texts (early 1990s)

– We get the following data

– We may plot the probabilityof shall being selected,with Wilson intervals

LLC ICE-GB totalshall 110 40 150will 78 58 136total 188 98 286

0.0

0.2

0.4

0.6

0.8

1.0

LLC ICE-GB

p(shall | {shall, will})

Page 70: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Modal Modal shallshall vs. vs. willwill over time over time

• Simple test: – Compare p for

• all LLC texts in DCPSE (1956-77) with• all ICE-GB texts (early 1990s)

– We get the following data

– We may plot the probabilityof shall being selected,with Wilson intervals

0.0

0.2

0.4

0.6

0.8

1.0

LLC ICE-GB

p(shall | {shall, will})LLC ICE-GB total

shall 110 40 150will 78 58 136total 188 98 286

May be input in a

2 x 2 chi-square test

- or you can check Wilson intervals

Page 71: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

0.0

0.2

0.4

0.6

0.8

1.0

1955 1960 1965 1970 1975 1980 1985 1990 1995

p(shall | {shall, will})

Modal Modal shallshall vs. vs. willwill over time over time

• Plotting modal shall/will over time (DCPSE)

• Small amounts of data / year

Page 72: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Modal Modal shallshall vs. vs. willwill over time over time

• Plotting modal shall/will over time (DCPSE)

0.0

0.2

0.4

0.6

0.8

1.0

1955 1960 1965 1970 1975 1980 1985 1990 1995

p(shall | {shall, will})• Small amounts

of data / year

• Confidence intervals identify the degree of certainty in our results

Page 73: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Modal Modal shallshall vs. vs. willwill over time over time

• Plotting modal shall/will over time (DCPSE)

0.0

0.2

0.4

0.6

0.8

1.0

1955 1960 1965 1970 1975 1980 1985 1990 1995

p(shall | {shall, will})

• Small amounts of data / year

• Confidence intervals identify the degree of certainty in our results

• Highly skewed p in some cases

– p = 0 or 1 (circled)

Page 74: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Modal Modal shallshall vs. vs. willwill over time over time

• Plotting modal shall/will over time (DCPSE)

0.0

0.2

0.4

0.6

0.8

1.0

1955 1960 1965 1970 1975 1980 1985 1990 1995

p(shall | {shall, will})

• Small amounts of data / year

• Confidence intervals identify the degree of certainty in our results

• We can now estimate an approximate downwards curve

(Aarts et al. 2013)

Page 75: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Recap Recap • Whenever we look at change, we must ask ourselves two things:

What is the change relative to?– Is our observation higher or lower than we might expect?

• In this case we ask • Does shall decrease relative to shall +will ?

How confident are we in our results?– Is the change big enough to be reproducible?

Page 76: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

ConclusionsConclusions

• An observation is not the actual value – Repeating the experiment might get different results

• The basic idea of these methods is – Predict range of future results if experiment was

repeated• ‘Significant’ = effect > 0 (e.g. 19 times out of 20)

• Based on the Binomial distribution– Approximated by Normal distribution – many uses

• Plotting confidence intervals• Use goodness of fit or single-sample z tests to compare

an observation with an expected baseline• Use 22 tests or two independent sample z tests to

compare two observed samples

Page 77: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

ReferencesReferences

• Aarts, B., J. Close, G. Leech and S.A. Wallis (eds). The Verb Phrase in English: Investigating recent language change with corpora. Cambridge: CUP.– Aarts, B., Close, J., and Wallis, S.A. 2013. Choices over time:

methodological issues in investigating current change. Chapter 2.– Levin, M. 2013. The progressive in modern American English.

Chapter 8.

• Wallis, S.A. 2013. Binomial confidence intervals and contingency tests. Journal of Quantitative Linguistics 20:3, 178-208.

• Wilson, E.B. 1927. Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association 22: 209-212.

• NOTE: Statistics papers, more explanation, spreadsheets etc. are published on corp.ling.stats blog: http://corplingstats.wordpress.com