introduction to corpora@stanford florian jaeger, [email protected] for the methods class, december...
TRANSCRIPT
Introduction to Corpora@Stanford
Florian Jaeger,
For the Methods class,
December 3rd, 2003
Some basic questions
Where are our corpora? Where is the software?– Is there a list of all the stuff we have?– How can I access the software?
Where do I start? What information is available where?
– Are there tutorials for the available software?
What kind of corpus work is supported at Stanford? – Corpora are only for those computational folks … ;-)
And the most important question:
Why bother at all …
Because we are often wrong with our (ad-hoc) intuitions – linguistic methodology is …– well, let’s not go there.
While corpora have a lot of drawbacks (no negative evidence, genre specific, etc.) they offer a lot of opportunities.
To illustrate my point, a little case study …
Hagit Borer: “Some notes on the Syntax and Semantics of Quantity”Talk for the Sem. Workshop, 10/31/2002
Claim: “The interpretation of bare plurals does not, actually, consist of any subset of (well-defined) singulars.”– 0.5 apples/apple– 1.0 apples/apple– 1.5 apples/apple– zero apples/apple
Hagit Borer: “Some notes on the Syntax and Semantics of Quantity”Talk for the Sem. Workshop, 10/31/2002
Hagit Borer’s judgments:– 0.5 apples/*apple– 1.0 apples/*apple– 1.5 apples/*apple– zero apples/*apple
Hagit Borer: “Some notes on the Syntax and Semantics of Quantity”Talk for the Sem. Workshop, 10/31/2002
Google’s count: – 0.5 apples (120)/*apple (179)– 1.0 apples (42)/*apple (23,600)– 1.5 apples (59)/*apple (362)– zero apples (194)/*apple (124)
This also makes clear, some of the problems, so let’s take pears
Hagit Borer: “Some notes on the Syntax and Semantics of Quantity”Talk for the Sem. Workshop, 10/31/2002
Google’s count: – 0.1 pears (32)/*pear (118)– 0.5 pears (37)/*pear (50)– 0.7 pears (9)/*pear (14)– 1.0 pears (14)/*pear (24,000)– 1 pears (14)/?pear (7,480)– One pears (1,130)/?pear (3,060)– 1.5 pears (28)/*pear (316)– zero pears (3)/*pear (0)
Conclusion:– It is amazing how many programs or computers products use fruit
names.– The original judgments seem questionable.
BUT: can we trust Google?
…
GSearch Tutorial
Corpora@Stanford
Introduction
Rules & Copyrights
Account Setup
Available corpora Available softwareClasses & Projects
Acknowledgments
Site Map(to come)
Home
Help for Corpus TAs
Grep Tutorial
Tgrep Tutorial
CQP Tutorial
Top 10 Info sourcesOn the net
In addition to the indicated structure, all pages offer links to external pages, including corpora, software, tutorials, demos, etc.
Local SupportE-list & Corpus TA
Looking for a corpus
There are several sites on the web that can help you to find out if what you are looking for exists:
– Databases like David Lee’s site (see also our Top 10 list)– The LDC database– Our list of corpora (next page)
Email lists, see our site under ‘Support’– Local: [email protected]– Global: [email protected]
Types of corpora
Different languages Different media (speech, video, text) Different levels of annotation
– No annotation– Transcribed speech or video– Sociological annotation (gender of speaker, average age of
audience, dialect of speaker, etc.)– Discourse and textual information (publication date, number of
discourse participants, discussion panel vs. novel, etc.)– Linguistic annotation (phonemes, prosody, syntax, morpho-
syntax, lexemes, phonological segments & syllables, etc.)
Looking for a specific corpus
List of available corpora– If the corpus is on AFS– If the corpus in on the Corpus Computer– If the corpus is on CD– If the corpus is on the WWW– If the corpus has special license conditions– If we don’t have the corpus
…
GSearch Tutorial
Corpora@Stanford
Introduction
Rules & Copyrights
Account Setup
Available corpora Available softwareClasses & Projects
Acknowledgments
Site Map(to come)
Home
Help for Corpus TAs
Grep Tutorial
Tgrep Tutorial
CQP Tutorial
Top 10 Info sourcesOn the net
In addition to the indicated structure, all pages offer links to external pages, including corpora, software, tutorials, demos, etc.
Local SupportE-list & Corpus TA
Tools & software
General Where to start:
– Local online tutorials (see also external references and manuals)
– The corpus TA– [email protected]
Little helpers
A brief look at some tools
BNC Web – Problem: Superiority “who the hell …”– Problem: Distribution of “… is like …” – age dependent?
General information Age (easy export to e.g. Excel) Crosstabs
TGrep2 and Tgrep– Tutorial– Examples:
tgrep2 -c wsj_mrg.t2c.gz -l 'VP < (NP $. NP)‘ tgrep2 -c wsj_mrg.t2c.gz -l 'VP < (NP $. PP-DTV)‘ tgrep2 -c wsj_mrg.t2c.gz -l 'VP=foo < (/VB*/ < gave) & < (NP $ NP)‘ tgrep2 -c wsj_mrg.t2c.gz -l 'VP=foo < (/VB*/ < gave) & < (NP $ PP-DTV)'
Note: Tgrep is right-headed
The following pattern matches an S which has a child A and another child that is a C and that the A has a child B:
– S < (A < B) < C
However, this pattern means that S has child A and that A has children B and C:
– S < ((A < B) < C)
It is equivalent to this:– S < (A < B < C)
Some more Tgrep2 syntax
A < B A is the parent of (immediately dominates) B. A > B A is the child of B. A <N B B is the Nth child of A (the rst child is <1). A >N B A is the Nth child of B (the rst child is >1). A <, B Synonymous with A <1 B. A >, B Synonymous with A >1 B. A <-N B B is the Nth-to-last child of A (the last child is <-1). A >-N B A is the Nth-to-last child of B (the last child is >-1). A <- B B is the last child of A (synonymous with A <-1 B). A >- B A is the last child of B (synonymous with A >-1 B). A <` B B is the last child of A (also synonymous with A <-1 B). A >` B A is the last child of B (also synonymous with A >-1 B). A <: B B is the only child of A A >: B A is the only child of B A << B A dominates B (A is an ancestor of B).
Some more TGrep2 syntax
A >> B A is dominated by B (A is a descendant of B). A <<, B B is a left-most descendant of A. A >>, B A is a left-most descendant of B. A <<` B B is a right-most descendant of A. A >>` B A is a right-most descendant of B. A <<: B There is a single path of descent from A and B is on it. A >>: B There is a single path of descent from B and A is on it. A . B A immediately precedes B. A , B A immediately follows B. A .. B A precedes B. A ,, B A follows B. A $ B A is a sister of B (and A 6= B). A $. B A is a sister of and immediately precedes B. A $, B A is a sister of and immediately follows B. A $.. B A is a sister of and precedes B. A $,, B A is a sister of and follows B. A = B The node matched by A is also matched by B.
The alternative with windows
TigerSearch 2.1; screen shots:– Grammar search– Collocation search
The end my friends
Want to help? – The website can always use additions (short
blurbs about software, your opinion about the user-friendliness of a certain web interface, etc.)
Tschuessi!