lou burnard bnc-xml: an introduction

41
Lou Burnard http://www.natcorp.ox.ac.uk BNC-XML: an introduction

Upload: chloe-conley

Post on 15-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lou Burnard  BNC-XML: an introduction

Lou Burnard

http://www.natcorp.ox.ac.uk

BNC-XML:

an introduction

Page 2: Lou Burnard  BNC-XML: an introduction

What is the BNC?

a snapshot of British English, taken at the end of the 20th century

100 million words in approx 4000 different text samples, both spoken (10%) and written (90%)

synchronic (1990-4), sampled, general purpose corpus

available under licence; latest edition is BNC-XML (13 mar 2007)

Page 3: Lou Burnard  BNC-XML: an introduction

Production of the BNC managed by an academic-industrial

consortium with significant government funding

took three years (at least) cost GBP 1.6 million (at least)

came about through an unusual coincidence of interests amongst: Lexicographical publishers Government (DTI) Engineering and Science Research Council

Target audience: Lexicographers, NLP researchers, But not language teachers!

Page 4: Lou Burnard  BNC-XML: an introduction

Remember the Nineties?

WinWord or WP5? the choice is yours On your desk … a 386 with 50 Mb diskspace

(just about enough to run Windows 3) In your lab ... a VAX or a Sparc for serious

work On the WWW (maybe) ... Mosaic for X Little text in digital format Text encoding (under development)

TEI SGML

Page 5: Lou Burnard  BNC-XML: an introduction

Corpus linguistics 90s-style a world without the web! corpus linguistics

Traditionalists (ICAME) Expansionists (LDC, monitor corpora)

text encoding theory language engineering and NLP the JFIT mentality

Page 6: Lou Burnard  BNC-XML: an introduction

Project Goals Stated

A synchronic (1990-4) corpus of samples both spoken and written from the full range of British English language production

of non-opportunistic design, for generic applicability

with word class annotation and contextual information

Unstated better, more authoritative, learner

dictionaries a new template for European language

resources a REALLY BIG corpus

Page 7: Lou Burnard  BNC-XML: an introduction

The BNC “sausage machine”

OUPWritten(OUP/

Chambers)

Spoken(Longman)

Initial CDIF Conversion and Validation

(OUCS)Word Class Annotation

(UCREL)

Header generation and final validation

(OUCS)

Selection, clearance, and capture

Enrichment and encoding

Documentation, distribution, maintenance

Page 8: Lou Burnard  BNC-XML: an introduction

Distinctive features of the BNC non-opportunistic design standardized markup system

structural annotation word class annotation contextual information

general availability

...in these respects, the BNC remains distinctive, twenty years on!

Page 9: Lou Burnard  BNC-XML: an introduction

Why BNC XML? The BNC is still widely used ... but the technology has moved on XML tools are everywhere ... so using the corpus is much easier Conversion to XML was easy and (fairly)

automatic ... but with more tractable markup some

dusty corners needed sweeping out

Page 10: Lou Burnard  BNC-XML: an introduction

What's in the BNC?

79238146

6175896

4233955 8715786

Spoken Demographic Spoken Context Governed

Books and Periodicals Other written

Page 11: Lou Burnard  BNC-XML: an introduction

Needles and haystacks The BNC has an extraordinary range

travel agent brochures, weather reports, formal invitations, advertising, publicity leaflets, children's talk, academic discourse, doctor's consultations, marketing meetings, oral history, jokes and anecdotes, high literature, best-sellers, business letters, personal diaries and correspondence ...

The problem is finding the specific texts you want Selection criteria Descriptive criteria Post-hoc categorization

(or use the WLD principle)

Page 12: Lou Burnard  BNC-XML: an introduction

BNC Design Criteria for written texts (90%)

• Medium (books, newspapers, unpublished…)• Domain (informative, entertaining…)

Criteria for transcribed speech events (10%) Context governed half

• predefined list of speech situations Demographically sampled half

• 200 volunteers, sampled for age, sex, region These selection criteria make up a

taxonomy, which is defined in the corpus header

Page 13: Lou Burnard  BNC-XML: an introduction

What topics?

17244534

7341163

6574857

3037533

1223783416496420

3821902

14025537

7174152

Imaginative Scientific Social ScienceApplied Science World Affairs CommerceArts Belief Leisure

Page 14: Lou Burnard  BNC-XML: an introduction

Descriptive criteria

spoken texts speaker occupation, perceived accent,

education level, personal relationship… speech domain, region, locale …

written texts author age, sex, type audience, circulation, status text-type classification

These criteria were used to maximize variation once selectional constraints had been applied

Page 15: Lou Burnard  BNC-XML: an introduction

Post-hoc text-type classification

...sentences

Academic

Literary

Press

Nonfiction

Unpublished

Conversation

OtherSpolen

...words

Page 16: Lou Burnard  BNC-XML: an introduction

Annotation, encoding, markup

• A means of making explicit, and thus processable: structure

• texts, sections, paragraphs, turns, sentences, words... metadata

• text-type, situational parameters, context analysis

• morphology, syntactic function, translation

Adopting a single framework facilitates integration and sharing of fragmentary resources thus enhancing research outcomes

also makes tool development much easier

Page 17: Lou Burnard  BNC-XML: an introduction

BNC structure

wtextteiHeader

bncdoc

bnc

stext

teiHeader 4049

908

bncdocbncDoc

Page 18: Lou Burnard  BNC-XML: an introduction

pppp

div 1div

sssssss

wtext stext

divdiv

uuuu

wwwwwww

6,026,284

98,363,784

784,4841,599,692

BNC-XML structure

Page 19: Lou Burnard  BNC-XML: an introduction

Word class annotation CLAWS (Leech, Garside et al) approach What counts as a word?

In BNC-XML, each word is explicitly marked and annotated with a root form or lemma an automatically assigned C5 word class

code a simplified POS code

This isn't prima facie obvious, in spite of spelling conventions.

Page 20: Lou Burnard  BNC-XML: an introduction

Words and multiwords English orthography can be misleading

In BNC XML, some “multiwords” are explicitly marked: <mw c5=”PRP”>

<w c5=”PRP” pos=”PREP” hw=”in”>in </w><w c5=”NN1” pos=”SUBST” hw=”spite”>spite </w><w c5=”PRF” pos=”PREP” hw=”of”>of </w></mw>

... in spite of common sense... it wasn't me

<w c5=”PNP” pos=”PREP” hw=”it”>it </w><w c5=”VBD” pos=”VERB” hw=”be”>was</w><w c5=”XX0” pos=”ADV” hw=”not”>n't </w><w c5=”PNP” pos=”PRON” hw=”i”>me </w>

Page 21: Lou Burnard  BNC-XML: an introduction

Structure of written texts Most written texts are organized hierarchically

into various kinds of division, shown by headings or other features:

Some divisions are typed: e.g. chapter, section, story, subsection, column, front, part, recipe, leaflet...

all spoken texts are divided into “conversations”

<div level=”1”> <div level=”2”>... </div>

<div level=”2”>...</div></div>

Page 22: Lou Burnard  BNC-XML: an introduction

Features of written texts Paragraph-like

<p> marks paragraphs <head> marks headings or captions <list> marks lists <quote> marks quotes <lg> marks verse lines

Paragraph-parts <hi> for typographic highlighting <corr> for corrected passages <gap> for deliberate omissions <pb/> for page breaks

Page 23: Lou Burnard  BNC-XML: an introduction

Speech in writing...<sp> <speaker> <s n="20461"> <w c5="NP0" hw="mr." pos="SUBST">Mr. </w> <w c5="NP0" hw="speaker" pos="SUBST">Skinner</w> </s> </speaker>... <p> <s n="20468"> <w c5="DT0" hw="that" pos="ADJ">That </w> <w c5="NN1" hw="millionaire" pos="SUBST">millionaire </w> <w c5="NN1" hw="mammy" pos="SUBST">mammy</w> <w c5="POS" hw="'s" pos="UNC">'s </w> <w c5="NN1" hw="boy" pos="SUBST">boy </w> <c c5="PUN">—</c> </s> <stage> <s n="20469"> <w c5="NN1" hw="interruption" pos="SUBST">Interruption</w> </s> </stage> </p></sp> <sp> <speaker> <s n="20470"> <w c5="NP0" hw="mr." pos="SUBST">Mr. </w> <w c5="NP0" hw="speaker" pos="SUBST">Speaker</w> </s> </speaker> <p> <s n="20471"> <w c5="NN1-VVB" hw="order" pos="SUBST">Order</w> <c c5="PUN">.</c> </s> <s n="20472"> <w c5="DT0" hw="that" pos="ADJ">That </w> <w c5="VBZ" hw="be" pos="VERB">is </w> <w c5="XX0" hw="not" pos="ADV">not </w> <w c5="AV0" hw="wholly" pos="ADV">wholly </w> <w c5="AJ0" hw="unparliamentary" pos="ADJ">unparliamentary</w> <c c5="PUN">.</c> </s> </p> </sp><!-- HHV -->

Page 24: Lou Burnard  BNC-XML: an introduction

Structure of spoken texts

<u who=”XXX”> marks a stretch of speech initiated by

speaker identified as XXX <align with=”XXX”/>

marks a synchronization point detailed information on speakers is

given in the text header other features of transcribed speech are

also marked...

Page 25: Lou Burnard  BNC-XML: an introduction

Features of spoken texts <shift> marks changes in voice quality

• e.g. whispering, laughing, etc., both as discrete events and as changes in voice quality affecting passages within an utterance.

<vocal> marks non-verbal but vocalised sounds

• e.g. coughs, humming noises etc. <event> marks non-verbal and non-vocal

events• e.g. passing lorries, animal noises, and other matters

considered worthy of note. <pause> marks significant pauses

• silence, within or between utterances, longer than was judged normal for the speaker or speakers.

<unclear> marks unclear passages• whole utterances or passages within them which were

inaudible or incomprehensible for a variety of reasons.

Page 26: Lou Burnard  BNC-XML: an introduction

baby baby burped baby cries baby cry baby crying baby crying in background baby gurgling baby laughing baby noise baby noises baby screaming baby shouting baby shouting over the top baby shouts baby speaking baby squealing baby talk baby talking background chatter background chatter in pub background chatter in pub background chatting shuffling etcetera background conversation

event description

Page 27: Lou Burnard  BNC-XML: an introduction

Vocal descriptions

<vocal desc="big breath"/><vocal desc="breathing out suddenly"/><vocal desc="drawing in breath"/><vocal desc="exhales"/><vocal desc="indrawn breath"/><vocal desc="inhales"/><vocal desc="intake of breath"/><vocal desc="sharp intake of breath"/><vocal desc="takes a deep breath"/><vocal desc="takes breath"/>

<vocal desc=”breath”/>

<vocal desc=”astonished snort”/>

Page 28: Lou Burnard  BNC-XML: an introduction

Contextual information

each text has a TEI header identification and classification specific details (e.g. speakers)

all common data in the corpus header classification(s) in header are pointed to

by individual texts

Page 29: Lou Burnard  BNC-XML: an introduction

Structure of the TEI Header File Description <fileDesc>

• Title Statement• Responsibility Statement/s• Edition Statement• Extent• Publication Statement• Identification numbers• Source Description

Encoding Description• Tagging Declaration

Profile Description• Creation• [Participant Description]• Text Classification

Revision Description

Page 30: Lou Burnard  BNC-XML: an introduction

The title Statement

<title>How we won the open: the caddies' stories. Sample containing about 36083 words from a book (domain: leisure) </title>

<title>Harlow Women's Institute committee meeting. Sample containing about 246 words speech recorded in public context </title>

<title>32 conversations recorded by `Frank' (PS09E) between 21 and 28 February 1992 with 9 interlocutors, totalling 3193 s-units, 20607 words, and 3 hours 22 minutes 23 seconds of recordings. </title>

<title>[Leaflets advertising goods and products]. Sample containing about 23409 words of miscellanea (domain: commerce)</title>

<titleStmt> <title>The age of capital 1848-1875. Sample containing about 41650 words from a book (domain: world affairs) </title> <respStmt> <resp>Data capture and transcription</resp> <name>Oxford University Press </name> </respStmt></titleStmt>

Page 31: Lou Burnard  BNC-XML: an introduction

The edition statement<editionStmt> <edition>BNC XML Edition, December 2006</edition></editionStmt><extent> 41650 tokens; 41573 w-units; 1436 s-units </extent><publicationStmt> <distributor>Distributed under licence by Oxford University Computing Services on behalf of the BNC Consortium.</distributor> <availability> This material is protected by international copyright laws and may not be copied or redistributed in any way. Consult the BNC Web Site at http://www.natcorp.ox.ac.uk for full licencing and distribution conditions.</availability> <idno type="bnc">J0P</idno> <idno type="old"> AgeCap </idno></publicationStmt>

Page 32: Lou Burnard  BNC-XML: an introduction

The source description 1

<sourceDesc> <bibl><title>The age of capital 1848-1875. </title> <author n="HobsbE1" domicile="England">Hobsbawm, E J</author> <imprint n="ABACUS1"> <publisher>Abacus</publisher> <pubPlace>London</pubPlace> <date value="1977">1977</date> </imprint> <pp>203-316</pp> </bibl></sourceDesc></fileDesc>

Page 33: Lou Burnard  BNC-XML: an introduction

The source description 2

<sourceDesc><recordingStmt> <recording xml:id="KE5RE000" n="035201" date="1992-02-20" time="11:50+" type="Walkman"/> <recording xml:id="KE5RE001" n="035202" date="1992-02-20" time="11:50+" type="Walkman"/> <recording xml:id="KE5RE002" n="035203" date="1992-02-23" time="17:05+" type="Walkman"/> <recording xml:id="KE5RE003" n="035204" date="1992-02-22" type="Walkman"/></recordingStmt></sourceDesc>

Page 34: Lou Burnard  BNC-XML: an introduction

The encoding description<encodingDesc><tagsDecl> <namespace name=""> <tagUsage gi="c" occurs="5750"/> <tagUsage gi="corr" occurs="1"/> <tagUsage gi="div" occurs="115"/> <tagUsage gi="gap" occurs="3"/> <tagUsage gi="head" occurs="156"/> <tagUsage gi="hi" occurs="147"/> <tagUsage gi="l" occurs="2"/> <tagUsage gi="lg" occurs="1"/> <tagUsage gi="mw" occurs="256"/> <tagUsage gi="p" occurs="680"/> <tagUsage gi="quote" occurs="3"/> <tagUsage gi="s" occurs="2415"/> <tagUsage gi="w" occurs="41799"/> </namespace></tagsDecl></encodingDesc>

Page 35: Lou Burnard  BNC-XML: an introduction

The profile description (written)<profileDesc> <creation date="1962"> </creation> <textClass> <catRef targets="WRI ALLTIM1 ALLAVA2 ALLTYP3 WRIAAG4 WRIAD1 WRIASE1 WRIATY3 WRIAUD3 WRIDOM5 WRILEV2 WRIMED1 WRIPP5 WRISAM3 WRISTA2 WRITAS0"/> <classCode scheme="DLEE">W nonAc: humanities arts</classCode> <keywords scheme="COPAC"> <term>History, Modern - 19th century</term> <term>Capitalism - History - 19th century</term> <term>World, 1848-1875</term> </keywords> </textClass></profileDesc>

Page 36: Lou Burnard  BNC-XML: an introduction

Classification codes Codes used are predefined in the Corpus

header<taxonomy xml:id="WRIDOM"> <desc>Written Domain</desc> <category xml:id="WRIDOM1"> <catDesc>Imaginative</catDesc> </category> <category xml:id="WRIDOM2"> <catDesc>Natural and pure sciences</catDesc> </category> <category xml:id="WRIDOM3"> <catDesc>Applied sciences</catDesc> </category>...</taxonomy>

Page 37: Lou Burnard  BNC-XML: an introduction

The profile description (spoken)<profileDesc> <creation date="1992">1992-02-23 </creation> <particDesc n="108"> <person ageGroup="Ag1" xml:id="PS0X2" role="self" sex="m" soc="DE" dialect="XSS"> <age>20</age> <persName>Wayne</persName> <occupation>unemployed</occupation> <dialect>Central South-west England</dialect> </person> .... </particDesc> <settingDesc> <setting xml:id="KE5SE000" n="035201" who="PS000 PS0X2"> <placeName>Hampshire: Andover </placeName> <locale> local shop </locale> <activity spont="H"> visiting friends</activity> </setting> ...</settingDesc></profileDesc>

Page 38: Lou Burnard  BNC-XML: an introduction

Has English moved on?

types of text e-mail web pages / blogs SMS personal letters

topics globalization internet Elvis Word Perfect

Page 39: Lou Burnard  BNC-XML: an introduction

Out of date? The composition (and date) of any

corpus affects inferences drawn from it There aren't many alternatives

Web-as-corpus sources of spoken texts? monitor corpora are non-replicable copyright permissions unrepeatable

Quantitative and qualitative comparative evaluations of BNC coverage are needed but “it's surprising how much is there”

Page 40: Lou Burnard  BNC-XML: an introduction

Why is it still useful?

The BNC is a problematizing resource... complements (and corrects) intuition increases learner autonomy critiques the myth of the native speaker

... for teacher and learner alike XML makes it more usable by non-

specialist software Its range and availability make it unique

Page 41: Lou Burnard  BNC-XML: an introduction

Where can I get one? BNC XML: http://www.natcorp.ox.ac.uk

now available on DVD standalone single user licence or institutional

licence existing licensees should renew

XAIRA Delivered free with the BNC (and also

available free from http://xaira.sf.net) Usable with any XML corpus Usable/ish on any platform