guy aston, ylva berglund prytz, & lou burnard, exploring bnc-xml with xaira

Post on 13-Jan-2016

213 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Guy Aston, Ylva Berglund Prytz, & Lou Burnard,

http://www.natcorp.oucs.ox.ac.uk

Exploring BNC-XML with Xaira

What is the BNC?

a snapshot of British English, taken at the end of the 20th century

100 million words in approx 4000 different text samples, both spoken (10%) and written (90%)

synchronic (1990-4), sampled, general purpose corpus

available under licence; latest edition is BNC-XML (13 mar 2007)

Distinctive features of the BNC

non-opportunistic design standardized markup system

structural annotation word class annotation contextual information

general availability

...in these respects, the BNC remains distinctive, twenty years on!

What's new in BNC-XML? No systematic proofing, re-editing, or re-parsing... Same as BNC World:

texts (minus duplicates) POS tagging (but extended)

Additions simpler pos codes lemmata

Improvements Duplications, categorizations, segmentations... Coded descriptions

BNC-XML regroups texts using additional classification criteria

...sentences

Academic

Literary

Press

Nonfiction

Unpublished

Conversation

OtherSpolen

...words

<wtext type="NONAC"><div level="1" n="1" type="leaflet"> <head type="MAIN"><s n="1"><w c5="NN1" hw="factsheet" pos="SUBST">FACTSHEET</w> <w c5="DTQ" hw="what" pos="PRON">WHAT</w> <w c5="VBZ" hw="be" pos="VERB">IS</w> <w c5="NN1" hw="aids" pos="SUBST">AIDS</w><c c5="PUN">?</c> </s>  </head><p><s n="2"><hi rend="bo">  <w c5="NN1" hw="aids" pos="SUBST">AIDS</w> <c c5="PUL">(</c><w c5="VVN-AJ0" hw="acquire" pos="VERB">Acquired</w> <w c5="AJ0" hw="immune" pos="ADJ">Immune</w> <w c5="NN1" hw="deficiency" pos="SUBST">Deficiency</w> <w c5="NN1" hw="syndrome" pos="SUBST">Syndrome</w><c c5="PUR">)</c></hi> <w c5="VBZ" hw="be" pos="VERB">is</w> <w c5="AT0" hw="a" pos="ART">a</w> <w c5="NN1" hw="condition" pos="SUBST">condition</w> <w c5="VVN" hw="cause" pos="VERB">caused</w> <w c5="PRP" hw="by" pos="PREP">by</w> <w c5="AT0" hw="a" pos="ART">a</w> <w c5="NN1" hw="virus" pos="SUBST">virus</w> <w c5="VVN" hw="call" pos="VERB">called</w> <w c5="NP0" hw="hiv" pos="SUBST">HIV</w> <c c5="PUL">(</c>   <w c5="AJ0-NN1" hw="human" pos="ADJ">Human</w> <w c5="NN1" hw="immuno" pos="SUBST">Immuno</w> <w c5="NN1" hw="deficiency" pos="SUBST">Deficiency</w> <w c5="NN1" hw="virus" pos="SUBST">Virus</w><c c5="PUR">)</c><c c5="PUN">.</c> </s> … </p>… </div></wtext>

What is the markup for?

It makes it possible for you to distinguish aids=SUBST from aids=VERB distinguish occurrences in writing from ones in speech distinguish occurrences in headings from ones in

paragraphs identify contextual units like sentences and paragraphs

FACTSHEET WHAT IS AIDS?AIDS (Acquired Immune Deficiency Syndrome) is a condition caused by a virus called HIV (Human Immuno Deficiency Virus).

Has English moved on since the BNC?

types of text e-mail web pages / blogs SMS personal letters

topics globalization internet Elvis Word Perfect

how comparable is the Web?

Out of date?

The composition (and date) of any corpus affects inferences drawn from it

There aren't many alternatives Web-as-corpus: 85% of written texts aren't on the web -

and spoken texts? Results from monitor corpora non-replicable Copyright permissions unrepeatable

Quantitative and qualitative comparative evaluations of BNC coverage are needed but “it's surprising how much is there”

What can you do with it?

The BNC is a problematizing resource... complements (and corrects) intuition increases learner autonomy critiques the myth of the native speaker

... for teacher and learner alike XML makes it more accessible by non

specialist software (eg A0S in web browser)

You can use XAIRA to ...

find sample sentences cloze tests

check what the text book says grammar vs usage

(dis)confirm intuitions find sample specialist texts make serendipitous discoveries

Finding sample sentences

some phrases that take the gerund there's no point .... how / what about ...

generatable phrases [comparative] and [comparative]

sentence structures [s-initial interjection]

(Dis)confirming intuition

about choices have a problem + infinitive or gerund? do you make or take decisions?

about vocabulary which nouns collocate with hard?

about grammar I would be grateful if you [modal]?

Finding specialised texts

The BNC has an extraordinary range travel agent brochures, weather reports, formal

invitations, advertising, children's talk, academic discourse, doctor's consultations, marketing meetings, oral history, jokes and anecdotes, high literature, best-sellers, leaflets, personal diaries...

The problem is finding it use WLD principle

For learners...

The same as teachers Pointers to follow in the quest for idiomicity

collocations colligations semantic preferences semantic prosodies/pragmatic associations associations with particular genres/domains

Can learners use the BNC “autonomously”?

The ins and outs of autonomous use Learners may need warning to...

focus on patterns which recur, without necessarily trying to explain all the data

avoid overgeneralisation ... and encouragement to

be curious browse the context investigate exceptions

What are ins and outs?

(and are they the same as ups and downs)? 50 occurrences, sort left 2 colligation: (all) the ins and outs of semantic preference: know/learn/understand/keep

up with/get to grips with/get down to/forget; explain/teach/guide through/give/look at

semantic prosody: difficulty(?) analysis - mainly spoken conversation, but

numbers too small for reliable inference

Exploring idioms

make a point the point is point out

have a point high point point to

in point of fact starting point no point in

point of view at X point what‘s the point

to the point see/get/grasp the point

Example: idioms with point

Exploring features of speech

PS6NR >: [laugh] he's not a millionaire yet.PS6NM >: No so perhaps not, mm.Oh perhaps, perhaps he, perhaps he has the knowledge but has difficulty in er navigating his way to the betting shop to to do anything about it. PS6NR >: [laugh] PS6NM >: Anyway ermPS6NR >: Right I've ... results see this isPS6NM >: Mm.PS6NR >: this is really what I'm [ ... ] PS6NM >: Yeah. PS6NR >: comparison of subjects within groups and between groups I thought that's PS6NM >: Yeah, mm. PS6NR >: like a typical [ ... ]

Examples: spoken discourse markers and back channels

Exploring productivity of affixes

How many adjectives can you think of ending in -ish? babyish, bearish, .... wankish, whorish, yobbish

How many nouns starting with anti-? How about verbs?

Creative writing

Paul Auster: City of Glass

It was the wrong number that started it, the telephone ringing three times in the dead of the night, and the voice on the other end asking for someone he was not.

Examples: story beginnings

Ian McEwan: Saturday

Everyone agrees, airliners look different these days, predatory and doomed.

Where can I get one?

BNC XML: http://www.natcorp.ox.ac.uk now available on DVD standalone single user licence or institutional licence discounted price till end June

XAIRA Delivered free with the BNC (and also available free

from http://xaira.sf.net) Usable with any XML corpus Usable/ish on any platform

top related