capturing linguistic interaction in a grammar a method for empirically evaluating the grammar of a...

27
Capturing linguistic Capturing linguistic interaction interaction in a grammar in a grammar A method for empirically evaluating the grammar of a parsed corpus Sean Wallis Survey of English Usage University College London [email protected]

Upload: chad-quinn

Post on 17-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Capturing linguistic Capturing linguistic interaction interaction

in a grammarin a grammarA method for empirically evaluating

the grammar of a parsed corpus

Sean WallisSurvey of English Usage

University College London

[email protected]

Capturing linguistic Capturing linguistic interaction...interaction...• Parsed corpus linguistics

• Empirical evaluation of grammar

• Experiments– Attributive AJPs– Preverbal AVPs– Embedded postmodifying clauses

• Conclusions– Comparing grammars or corpora– Potential applications

Parsed corpus linguisticsParsed corpus linguistics

• Several million-word parsed corpora exist

• Each sentence analysed in the form of a tree– different languages have been analysed– limited amount of spontaneous speech data

• Commitment to a particular grammar required– different schemes have been applied– problems: computational completeness + manual

consistency

• Tools support linguistic research in corpora

Parsed corpus linguisticsParsed corpus linguistics

• An example tree from ICE-GB (spoken)

S1A-006 #23

Parsed corpus linguisticsParsed corpus linguistics

• Three kinds of evidence may be obtained from a parsed corpusFrequency evidence of a particular known

rule, structure or linguistic eventCoverage evidence of new rules, etc.Interaction evidence of the relationship

between rules, structures and events

• This evidence is necessarily framed within a particular grammatical scheme– So… how might we evaluate this grammar?

Empirical evaluation of Empirical evaluation of grammargrammar• Many theories, frameworks and grammars

– no agreed evaluation method exists– linguistics is divided into competing camps– status of parsed corpora ‘suspect’

• Possible method: retrievability of events circularity: you get out what you put in redundancy: ‘improvement’ by mere addition atomic: based on single events, not pattern specificity: based on particular phenomena

• New method: retrievability of event sequences

Experiment 1: attributive AJPsExperiment 1: attributive AJPs

• Adjectives before a noun in English

• Simple idea: plot the frequency of NPs with at least n = 0, 1, 2, 3… attributive AJPs

Experiment 1: attributive AJPsExperiment 1: attributive AJPs

• Adjectives before a noun in English

• Simple idea: plot the frequency of NPs with at least n = 0, 1, 2, 3… attributive AJPs

0

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

180,000

200,000

0 1 2 3 4 5 6

0.0000

1.0000

2.0000

3.0000

4.0000

5.0000

6.0000

0 1 2 3 4 5 6

Raw frequency Log frequency

NB: not a straight line

Experiment 1: analysis of Experiment 1: analysis of resultsresults• If the log-frequency line is straight

– exponential fall in frequency (constant probability)– no interaction between decisions (cf. coin tossing)

• Sequential probability analysis– calculate probability of adding each AJP– error bars (binomial)– probability falls

• second < first• third < second• fourth < second

– decisions interact

Experiment 1: analysis of Experiment 1: analysis of resultsresults• If the log-frequency line is straight

– exponential fall in frequency (constant probability)– no interaction between decisions (cf. coin tossing)

• Sequential probability analysis– calculate probability of adding each AJP– error bars (binomial)– probability falls

• second < first• third < second• fourth < second

– decisions interact

0.00

0.05

0.10

0.15

0.20

0.25

0 1 2 3 4 5

probability

Experiment 1: analysis of Experiment 1: analysis of resultsresults• If the log-frequency line is straight

– exponential fall in frequency (constant probability)– no interaction between decisions (cf. coin tossing)

• Sequential probability analysis– calculate probability of adding each AJP– error bars (binomial)– probability falls– decisions interact– fit to a power law

• y = m.x k

• find m and x

0.00

0.05

0.10

0.15

0.20

0.25

0 1 2 3 4 5

probability

y = 0.1931x -1.2793

Experiment 1: explanations?Experiment 1: explanations?

• Feedback loop: for each successive AJP, it is more difficult to add a further AJP– Explanation 1: semantic constraints

• tend to say tall green ship • do not tend to say tall short ship or green tall ship

– Explanation 2: communicative economy• once speaker said tall green ship, tends to only say ship

– Further investigation required

• General principle:– significant change (usually, fall) in probability is

evidence of an interaction along grammatical axis

Experiments 2,3: variationsExperiments 2,3: variations

Restrict head: common and proper nouns– Common nouns: similar results– Proper nouns and adjectives are often treated as

compounds (Northern England vs. lower Loire )

Ignore grammar: adjective + noun strings– Some misclassifications / miscounting (‘noise’)

• she was [beautiful, people] said; tall very [green ship]

– Similar results • slightly weaker (third < second ns at p=0.01)

– Insufficient evidence for grammar• null hypothesis: simple lexical adjacency

Experiment 4: preverbal AVPsExperiment 4: preverbal AVPs

• Consider adverb phrases before a verb– Results very different

• Probability does not fall significantly between first and second AVP

• Probability does fall between third and second AVP

– Possible constraints• (weak) communicative• not (strong) semantic

– Further investigationneeded

Experiment 4: preverbal AVPsExperiment 4: preverbal AVPs

• Consider adverb phrases before a verb– Results very different

• Probability does not fall significantly between first and second AVP

• Probability does fall between third and second AVP

– Possible constraints• (weak) communicative• not (strong) semantic

– Further investigationneeded

– Not power law: R2 < 0.24 0.00

0.01

0.02

0.03

0.04

0.05

0.06

1 2 3

probability

Experiment 5: embedded Experiment 5: embedded clausesclauses• Another way to specify nouns in English

– add clause after noun to explicate it• the ship [that was tall and green]• the ship [in the port]

– may be embedded• the ship [in the port [with the ancient lighthouse]]

– or successively postmodified• the ship [in the port][with a very old mast]

• Compare successive embedding and sequential postmodifying clauses– Axis = embedding depth / sequence length

Experiment 5: methodExperiment 5: method

• Extract examples with FTFs– at least n levels of embedded

postmodification:

Experiment 5: methodExperiment 5: method

• Extract examples with FTFs– at least n levels of embedded

postmodification:

01

2(etc.)

Experiment 5: methodExperiment 5: method

• Extract examples with FTFs– at least n levels of embedded postmodification:

01

2

– problems:• multiple matching cases (use ICECUP IV to classify)• overlapping cases (subtract extra case)• co-ordination of clauses or NPs (use alternative patterns)

(etc.)

Experiment 5: analysis of Experiment 5: analysis of resultsresults• Probability of adding a further embedded

clause falls with each level– second < first– sequential < embedding

• Embedding only:– third < first– insufficient data for

third < second

• Conclusion:– Interaction along embedding and sequential axes

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0 1 2 3 4

Experiment 5: analysis of Experiment 5: analysis of resultsresults• Probability of adding a further embedded

clause falls with each level– second < first– sequential < embedding

• Embedding only:– third < first– insufficient data for

third < second

• Conclusion:– Interaction along embedding and sequential axes

sequential

embedded

probability

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0 1 2 3 4

Experiment 5: analysis of Experiment 5: analysis of resultsresults• Probability of adding a further

embedded clause falls with each level– second < first– sequential < embedding

• Fitting to f = m.x k

– k < 0 = fall ( f = m/x |k|)

– |k| is high = steep

• Conclusion:– Both match power law: R2 > 0.99

sequential

embedded y = 0.0539x

-1.2206

y = 0.0523x -1.6516

Experiment 5: explanations?Experiment 5: explanations?

• Lexical adjacency?– No: 87% of 2-level cases have at least one VP, NP

or clause between upper and lower heads• Misclassified cases of embedding?

– No: very few (5%) semantically ambiguous cases• Language production constraints?

– Possibly, could also be communicative economy• contrast spontaneous speech with other modes

• Positive ‘proof’ of recursive tree grammar– Established from parsed corpus– cf. negative ‘proof’ (NLP parsing problems)

ConclusionsConclusions

• A new method for evaluating interactions along grammatical axes– General purpose, robust, structural– More abstract than ‘linguistic choice’ experiments– Depends on a concept of grammatical distance

along an axis, based on the chosen grammar

• Method has philosophical implications– Grammar viewed as structure of linguistic choices– Linguistics as an evaluable observational science

• Signature (trace) of language production decisions

– A unification of theoretical and corpus linguistics?

Comparing grammars or Comparing grammars or corporacorpora• Can we reliably retrieve known interaction

patterns with different grammars? – Do these patterns differ across corpora?

• Benefits over individual event retrievalnon-circular: generalisation across local syntaxnot subject to redundancy: arbitrary terms

makes trends more difficult to retrievenot atomic: based on patterns of interactiongeneral: patterns may have multiple explanations

• Supplements retrieval of events

Potential applicationsPotential applications

• Corpus linguistics– Optimising existing grammar

• e.g. co-ordination, compound nouns

• Theoretical linguistics– Comparing different grammars, same language– Comparing different languages or periods

• Psycholinguistics– Search for evidence of language production

constraints in spontaneous speech corpora• speech and language therapy• language acquisition and development

Links and further readingLinks and further reading

• Survey of English Usage– www.ucl.ac.uk/english-usage

• Corpora and grammar– .../projects/ice-gb

• Full paper– .../staff/sean/resources/analysing-

grammatical-interaction.pdf

• Sequential analysis spreadsheet (Excel)– .../staff/sean/resources/interaction-trends.xls