introduction to corpus linguisticscorpus.nytud.hu/people/varadi/btant129/download/w5... · corpus...

26
BTANT 129 w5 Introduction to corpus linguistics

Upload: others

Post on 17-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to corpus linguisticscorpus.nytud.hu/people/varadi/BTANT129/download/w5... · corpus linguistics. BTANT 129 w5 Corpus • The old school concept – A collection of texts

BTANT 129 w5

Introduction tocorpus linguistics

Page 2: Introduction to corpus linguisticscorpus.nytud.hu/people/varadi/BTANT129/download/w5... · corpus linguistics. BTANT 129 w5 Corpus • The old school concept – A collection of texts

BTANT 129 w5

Corpus• The old school concept

– A collection of texts especially if complete andself-contained: the corpus of Anglo-Saxon verse

The Oxford Companion to the English Language

• The modern view– A collection of naturally occurring language text

chosen to characterize a state or variety of a language

• John Sinclair Corpus Concordance Collocation OUP

Page 3: Introduction to corpus linguisticscorpus.nytud.hu/people/varadi/BTANT129/download/w5... · corpus linguistics. BTANT 129 w5 Corpus • The old school concept – A collection of texts

BTANT 129 w5

Corpus vs. archive• Text archive• Collection of texts in their original format(Oxford Text Archive:

http://ota.ox.ac.uk/)• Corpus• texts collected and processed in a unified,

systematic mannerBritish National Corpus:

http://www.natcorp.ox.ac.uk/

Page 4: Introduction to corpus linguisticscorpus.nytud.hu/people/varadi/BTANT129/download/w5... · corpus linguistics. BTANT 129 w5 Corpus • The old school concept – A collection of texts

BTANT 129 w5

Page 5: Introduction to corpus linguisticscorpus.nytud.hu/people/varadi/BTANT129/download/w5... · corpus linguistics. BTANT 129 w5 Corpus • The old school concept – A collection of texts

BTANT 129 w5

Page 6: Introduction to corpus linguisticscorpus.nytud.hu/people/varadi/BTANT129/download/w5... · corpus linguistics. BTANT 129 w5 Corpus • The old school concept – A collection of texts

BTANT 129 w5

Short historyBrief mention of just a select few! • Brown Corpus (Brown university)

– 1 m words– 15 genres– 500 samples 2000 words each– Area: US– Time: 1961

• LOB Corpus (Lancaster-Bergen-Oslo)– GB replica of Brown

Page 7: Introduction to corpus linguisticscorpus.nytud.hu/people/varadi/BTANT129/download/w5... · corpus linguistics. BTANT 129 w5 Corpus • The old school concept – A collection of texts

BTANT 129 w5

Cobuild

• Major corpus initiative by Collins andBirmingham Univ. John Sinclair

• 1991 20 m • -> Bank of English currently 450 m

words• http://www.cobuild.collins.co.uk

Page 8: Introduction to corpus linguisticscorpus.nytud.hu/people/varadi/BTANT129/download/w5... · corpus linguistics. BTANT 129 w5 Corpus • The old school concept – A collection of texts

BTANT 129 w5

British National Corpus

• 100 m words careful selection• 10 % spoken material• time span 1960 (fiction) – 1975 non-ficion)• 40-50 000 word texts• TEI compliant SGML coding• http://www.comp.lancs.ac.uk/ucrel/bncind

ex/

Page 9: Introduction to corpus linguisticscorpus.nytud.hu/people/varadi/BTANT129/download/w5... · corpus linguistics. BTANT 129 w5 Corpus • The old school concept – A collection of texts

BTANT 129 w5

Page 10: Introduction to corpus linguisticscorpus.nytud.hu/people/varadi/BTANT129/download/w5... · corpus linguistics. BTANT 129 w5 Corpus • The old school concept – A collection of texts

BTANT 129 w5

International Corpus of English

• 20 corpora of 1 m words devoted tovarieties of English around the world

• 500 texts (300 written 200 spoken) of2000 words each

• time span: 1990-0996• ICE-GB available in demo version• syntactic annotation, graphical tool

ICECUP

Page 11: Introduction to corpus linguisticscorpus.nytud.hu/people/varadi/BTANT129/download/w5... · corpus linguistics. BTANT 129 w5 Corpus • The old school concept – A collection of texts

BTANT 129 w5

Page 12: Introduction to corpus linguisticscorpus.nytud.hu/people/varadi/BTANT129/download/w5... · corpus linguistics. BTANT 129 w5 Corpus • The old school concept – A collection of texts

BTANT 129 w5

Corpus processing: tokenization

• Preprocessing– tokenization segmenting the text into

sentences• sometimes tricky: sentence delimiters in mid-

sentence positionswords• multi-word units – problem

– Normalization• restoring clitics, abbreviations ("can't", "I've")

Page 13: Introduction to corpus linguisticscorpus.nytud.hu/people/varadi/BTANT129/download/w5... · corpus linguistics. BTANT 129 w5 Corpus • The old school concept – A collection of texts

BTANT 129 w5

Corpus processing: tagging

• Tagging– labelling every word with its Part of Speech

category– Problem: ambiguity

• out of context, words can belong to differentpart of speech or have different analysis withinthe same POS

– set N vs. set V– bánt 'bánik' VBD vagy 'bánt' VBZ

Page 14: Introduction to corpus linguisticscorpus.nytud.hu/people/varadi/BTANT129/download/w5... · corpus linguistics. BTANT 129 w5 Corpus • The old school concept – A collection of texts

BTANT 129 w5

Corpus processing: disambiguation

• Disambiguation– defining the correct analysis in context

• Two approaches:• both needs manually corrected training corpus

– statistical• Hidden Markov model• calculating probability within a span of usually one or two

words• rate of success can be around 98%

– rule-based

Page 15: Introduction to corpus linguisticscorpus.nytud.hu/people/varadi/BTANT129/download/w5... · corpus linguistics. BTANT 129 w5 Corpus • The old school concept – A collection of texts

BTANT 129 w5

Syntactic annotation

• Difficult to do on such a scale• shallow parsing• Treebank:

collection of syntactically analyzedsentences

• Penn treebank• http://www.cis.upenn.edu/~treebank/

Page 16: Introduction to corpus linguisticscorpus.nytud.hu/people/varadi/BTANT129/download/w5... · corpus linguistics. BTANT 129 w5 Corpus • The old school concept – A collection of texts

BTANT 129 w5

Recent trends• Word sense ambiguation (SENSEVAL)

• http://www.itri.brighton.ac.uk/events/senseval/• Message understanding

• http://www.itl.nist.gov/iaui/894.02/related_projects/muc/index.html

• SEMANTIC WEB• making information on the web understandable

for machines• a vision requiring a huge effort, not clear

whether feasible at all

Page 17: Introduction to corpus linguisticscorpus.nytud.hu/people/varadi/BTANT129/download/w5... · corpus linguistics. BTANT 129 w5 Corpus • The old school concept – A collection of texts

BTANT 129 w5

Representative sample?

• A corpus any size is inevitably a sample• Of what?• Two approaches

– sampling speakers – demographic sampling– sampling their output – text type sample

Page 18: Introduction to corpus linguisticscorpus.nytud.hu/people/varadi/BTANT129/download/w5... · corpus linguistics. BTANT 129 w5 Corpus • The old school concept – A collection of texts

BTANT 129 w5

The notion of representativeness

• Sample vs. population• sample should be proportional to the

population for a given feature– example for demographic samplingif we know from census figures that 48% of people in

living in Budapest are malewe should compile our sample so that 48% of the

informants are male-> our sample is representative of Budapest

residents for gender

Page 19: Introduction to corpus linguisticscorpus.nytud.hu/people/varadi/BTANT129/download/w5... · corpus linguistics. BTANT 129 w5 Corpus • The old school concept – A collection of texts

BTANT 129 w5

Trouble with representativeness

• What should be the units of sampling?• Registers, text types, genres etc.• But no independent evidence about their

ratio in the totality of language output-> representativeness is an ideal but

impossible to implement

Page 20: Introduction to corpus linguisticscorpus.nytud.hu/people/varadi/BTANT129/download/w5... · corpus linguistics. BTANT 129 w5 Corpus • The old school concept – A collection of texts

BTANT 129 w5

Approaches to Representativeness

• Douglas Biber:• Rejects notion of proportional sampling• Sample should be as varied as possible• Representativeness measured in terms

of wide variety of text types included inthe sample

Page 21: Introduction to corpus linguisticscorpus.nytud.hu/people/varadi/BTANT129/download/w5... · corpus linguistics. BTANT 129 w5 Corpus • The old school concept – A collection of texts

BTANT 129 w5

The Web as a corpus?• Pro:• immense database• dynamically growing• ideal 'quick and

dirty' method

• Cons:• lots of rubbish,

irrelevant data• difficult to extract

hits• no language analysis• only string query,

which is crude

Page 22: Introduction to corpus linguisticscorpus.nytud.hu/people/varadi/BTANT129/download/w5... · corpus linguistics. BTANT 129 w5 Corpus • The old school concept – A collection of texts

BTANT 129 w5

One quick example

• Representativity or representativeness• Throw the two words at Google and have

a look at the figures• Think about the conclusions• There are special front-end sites

Page 23: Introduction to corpus linguisticscorpus.nytud.hu/people/varadi/BTANT129/download/w5... · corpus linguistics. BTANT 129 w5 Corpus • The old school concept – A collection of texts

BTANT 129 w5

Page 24: Introduction to corpus linguisticscorpus.nytud.hu/people/varadi/BTANT129/download/w5... · corpus linguistics. BTANT 129 w5 Corpus • The old school concept – A collection of texts

BTANT 129 w5

Page 25: Introduction to corpus linguisticscorpus.nytud.hu/people/varadi/BTANT129/download/w5... · corpus linguistics. BTANT 129 w5 Corpus • The old school concept – A collection of texts

BTANT 129 w5

Page 26: Introduction to corpus linguisticscorpus.nytud.hu/people/varadi/BTANT129/download/w5... · corpus linguistics. BTANT 129 w5 Corpus • The old school concept – A collection of texts

BTANT 129 w5