introduction to corpora@stanford florian jaeger, [email protected] for the methods class, december...

19
Introduction to Corpora@Stanford Florian Jaeger, [email protected] For the Methods class, December 3 rd , 2003

Upload: wilfrid-shields

Post on 28-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Corpora@Stanford Florian Jaeger, tiflo@stanford.edu For the Methods class, December 3 rd, 2003

Introduction to Corpora@Stanford

Florian Jaeger,

[email protected]

For the Methods class,

December 3rd, 2003

Page 2: Introduction to Corpora@Stanford Florian Jaeger, tiflo@stanford.edu For the Methods class, December 3 rd, 2003

Some basic questions

Where are our corpora? Where is the software?– Is there a list of all the stuff we have?– How can I access the software?

Where do I start? What information is available where?

– Are there tutorials for the available software?

What kind of corpus work is supported at Stanford? – Corpora are only for those computational folks … ;-)

And the most important question:

Page 3: Introduction to Corpora@Stanford Florian Jaeger, tiflo@stanford.edu For the Methods class, December 3 rd, 2003

Why bother at all …

Because we are often wrong with our (ad-hoc) intuitions – linguistic methodology is …– well, let’s not go there.

While corpora have a lot of drawbacks (no negative evidence, genre specific, etc.) they offer a lot of opportunities.

To illustrate my point, a little case study …

Page 4: Introduction to Corpora@Stanford Florian Jaeger, tiflo@stanford.edu For the Methods class, December 3 rd, 2003

Hagit Borer: “Some notes on the Syntax and Semantics of Quantity”Talk for the Sem. Workshop, 10/31/2002

Claim: “The interpretation of bare plurals does not, actually, consist of any subset of (well-defined) singulars.”– 0.5 apples/apple– 1.0 apples/apple– 1.5 apples/apple– zero apples/apple

Page 5: Introduction to Corpora@Stanford Florian Jaeger, tiflo@stanford.edu For the Methods class, December 3 rd, 2003

Hagit Borer: “Some notes on the Syntax and Semantics of Quantity”Talk for the Sem. Workshop, 10/31/2002

Hagit Borer’s judgments:– 0.5 apples/*apple– 1.0 apples/*apple– 1.5 apples/*apple– zero apples/*apple

Page 6: Introduction to Corpora@Stanford Florian Jaeger, tiflo@stanford.edu For the Methods class, December 3 rd, 2003

Hagit Borer: “Some notes on the Syntax and Semantics of Quantity”Talk for the Sem. Workshop, 10/31/2002

Google’s count: – 0.5 apples (120)/*apple (179)– 1.0 apples (42)/*apple (23,600)– 1.5 apples (59)/*apple (362)– zero apples (194)/*apple (124)

This also makes clear, some of the problems, so let’s take pears

Page 7: Introduction to Corpora@Stanford Florian Jaeger, tiflo@stanford.edu For the Methods class, December 3 rd, 2003

Hagit Borer: “Some notes on the Syntax and Semantics of Quantity”Talk for the Sem. Workshop, 10/31/2002

Google’s count: – 0.1 pears (32)/*pear (118)– 0.5 pears (37)/*pear (50)– 0.7 pears (9)/*pear (14)– 1.0 pears (14)/*pear (24,000)– 1 pears (14)/?pear (7,480)– One pears (1,130)/?pear (3,060)– 1.5 pears (28)/*pear (316)– zero pears (3)/*pear (0)

Conclusion:– It is amazing how many programs or computers products use fruit

names.– The original judgments seem questionable.

BUT: can we trust Google?

Page 8: Introduction to Corpora@Stanford Florian Jaeger, tiflo@stanford.edu For the Methods class, December 3 rd, 2003

GSearch Tutorial

Corpora@Stanford

Introduction

Rules & Copyrights

Account Setup

Available corpora Available softwareClasses & Projects

Acknowledgments

Site Map(to come)

Home

Help for Corpus TAs

Grep Tutorial

Tgrep Tutorial

CQP Tutorial

Top 10 Info sourcesOn the net

In addition to the indicated structure, all pages offer links to external pages, including corpora, software, tutorials, demos, etc.

Local SupportE-list & Corpus TA

Page 9: Introduction to Corpora@Stanford Florian Jaeger, tiflo@stanford.edu For the Methods class, December 3 rd, 2003

Looking for a corpus

There are several sites on the web that can help you to find out if what you are looking for exists:

– Databases like David Lee’s site (see also our Top 10 list)– The LDC database– Our list of corpora (next page)

Email lists, see our site under ‘Support’– Local: [email protected]– Global: [email protected]

Page 10: Introduction to Corpora@Stanford Florian Jaeger, tiflo@stanford.edu For the Methods class, December 3 rd, 2003

Types of corpora

Different languages Different media (speech, video, text) Different levels of annotation

– No annotation– Transcribed speech or video– Sociological annotation (gender of speaker, average age of

audience, dialect of speaker, etc.)– Discourse and textual information (publication date, number of

discourse participants, discussion panel vs. novel, etc.)– Linguistic annotation (phonemes, prosody, syntax, morpho-

syntax, lexemes, phonological segments & syllables, etc.)

Page 11: Introduction to Corpora@Stanford Florian Jaeger, tiflo@stanford.edu For the Methods class, December 3 rd, 2003

Looking for a specific corpus

List of available corpora– If the corpus is on AFS– If the corpus in on the Corpus Computer– If the corpus is on CD– If the corpus is on the WWW– If the corpus has special license conditions– If we don’t have the corpus

Page 12: Introduction to Corpora@Stanford Florian Jaeger, tiflo@stanford.edu For the Methods class, December 3 rd, 2003

GSearch Tutorial

Corpora@Stanford

Introduction

Rules & Copyrights

Account Setup

Available corpora Available softwareClasses & Projects

Acknowledgments

Site Map(to come)

Home

Help for Corpus TAs

Grep Tutorial

Tgrep Tutorial

CQP Tutorial

Top 10 Info sourcesOn the net

In addition to the indicated structure, all pages offer links to external pages, including corpora, software, tutorials, demos, etc.

Local SupportE-list & Corpus TA

Page 13: Introduction to Corpora@Stanford Florian Jaeger, tiflo@stanford.edu For the Methods class, December 3 rd, 2003

Tools & software

General Where to start:

– Local online tutorials (see also external references and manuals)

– The corpus TA– [email protected]

Little helpers

Page 14: Introduction to Corpora@Stanford Florian Jaeger, tiflo@stanford.edu For the Methods class, December 3 rd, 2003

A brief look at some tools

BNC Web – Problem: Superiority “who the hell …”– Problem: Distribution of “… is like …” – age dependent?

General information Age (easy export to e.g. Excel) Crosstabs

TGrep2 and Tgrep– Tutorial– Examples:

tgrep2 -c wsj_mrg.t2c.gz -l 'VP < (NP $. NP)‘ tgrep2 -c wsj_mrg.t2c.gz -l 'VP < (NP $. PP-DTV)‘ tgrep2 -c wsj_mrg.t2c.gz -l 'VP=foo < (/VB*/ < gave) & < (NP $ NP)‘ tgrep2 -c wsj_mrg.t2c.gz -l 'VP=foo < (/VB*/ < gave) & < (NP $ PP-DTV)'

Page 15: Introduction to Corpora@Stanford Florian Jaeger, tiflo@stanford.edu For the Methods class, December 3 rd, 2003

Note: Tgrep is right-headed

The following pattern matches an S which has a child A and another child that is a C and that the A has a child B:

– S < (A < B) < C

However, this pattern means that S has child A and that A has children B and C:

– S < ((A < B) < C)

It is equivalent to this:– S < (A < B < C)

Page 16: Introduction to Corpora@Stanford Florian Jaeger, tiflo@stanford.edu For the Methods class, December 3 rd, 2003

Some more Tgrep2 syntax

A < B A is the parent of (immediately dominates) B. A > B A is the child of B. A <N B B is the Nth child of A (the rst child is <1). A >N B A is the Nth child of B (the rst child is >1). A <, B Synonymous with A <1 B. A >, B Synonymous with A >1 B. A <-N B B is the Nth-to-last child of A (the last child is <-1). A >-N B A is the Nth-to-last child of B (the last child is >-1). A <- B B is the last child of A (synonymous with A <-1 B). A >- B A is the last child of B (synonymous with A >-1 B). A <` B B is the last child of A (also synonymous with A <-1 B). A >` B A is the last child of B (also synonymous with A >-1 B). A <: B B is the only child of A A >: B A is the only child of B A << B A dominates B (A is an ancestor of B).

Page 17: Introduction to Corpora@Stanford Florian Jaeger, tiflo@stanford.edu For the Methods class, December 3 rd, 2003

Some more TGrep2 syntax

A >> B A is dominated by B (A is a descendant of B). A <<, B B is a left-most descendant of A. A >>, B A is a left-most descendant of B. A <<` B B is a right-most descendant of A. A >>` B A is a right-most descendant of B. A <<: B There is a single path of descent from A and B is on it. A >>: B There is a single path of descent from B and A is on it. A . B A immediately precedes B. A , B A immediately follows B. A .. B A precedes B. A ,, B A follows B. A $ B A is a sister of B (and A 6= B). A $. B A is a sister of and immediately precedes B. A $, B A is a sister of and immediately follows B. A $.. B A is a sister of and precedes B. A $,, B A is a sister of and follows B. A = B The node matched by A is also matched by B.

Page 18: Introduction to Corpora@Stanford Florian Jaeger, tiflo@stanford.edu For the Methods class, December 3 rd, 2003

The alternative with windows

TigerSearch 2.1; screen shots:– Grammar search– Collocation search

Page 19: Introduction to Corpora@Stanford Florian Jaeger, tiflo@stanford.edu For the Methods class, December 3 rd, 2003

The end my friends

Want to help? – The website can always use additions (short

blurbs about software, your opinion about the user-friendliness of a certain web interface, etc.)

Tschuessi!