corpus design ii see g kennedy, introduction to corpus linguistics, ch . 2 cf meyer, english...

Post on 19-Dec-2015

248 Views

Category:

Documents

6 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Corpus design II

See G Kennedy, Introduction to Corpus Linguistics, Ch . 2

CF Meyer, English Corpus Linguistics, Ch. 3

2/18

Issues in corpus design

• General purpose vs specialized• Dynamic (monitor) vs static• Representativeness and balance• Size• Collection, permission• Text capture and markup• Storage and access• Organizations

3/18

Collecting samples of speech

• Aim to collect natural samples• Cannot tape record surreptitiously

– Early corpora were done in thisa way, with permission sought afterwards

– Nowadays regarded as unethical, perhaps even illegal

• “Observer’s paradox”: presence of recorder effects behaviour

• Can be overcome (somewhat) by recording lots of material and sampling from the middle

4/18

Collecting written samples

• Much easier to obtain, but beware important issue of permission– Copyrighted material cannot be freely stored and

distributed– “Fair use” law allows use of up to 2,000 words for

private research– Corpus samples are often >2,000 words, and often

distributed widely, sometimes for profit (or at least at a price to cover/recoup costs)

– Copyright laws may differ between countries

5/18

Permission

• Can be quite onerous obtaining copyright permission– Time consuming to wait for a reply to a

request: do you go ahead and include it (ie start work on annotation and mark-up), or wait?

– Big risk, eg English-Norwegian Parallel Corpus contains copyrighted material and can only be used by U Oslo researchers, on site!

6/18

Text capture

• Easiest if text is already machine-readable, though there may still be some issues with mark-up– eg MRT obtained from publishers may have print

formatting information embedded in it– Text captured from an online source may have HTML

mark-up

• If text exists in printed form, scanning is a possibility– OCR is generally very good quality, but text must still

be carefully checked– Issue of how to deal with printing effects such as

hyphenation, headers and footers, footnotes

7/18

Text capture: re-keying

• If OCR is not suitable/available– eg hand-written texts, or medium is not flat

• Re-keying is only option• Highly expensive, time-consuming and error-

prone• With manuscripts, there may be an issue of

“keyboarder correction”– Example of Learner English corpus of handwritten

essays: important not to correct “errors”– PhD student collected handwritten essays by (Arabic)

learners of English for error analysis: first task was to “type them in”

8/18

Handwritten text

• Are these capital Ts?• Is this crossed out?• Is this a v or a t?• Is this depend or depond?

• etc.• What does this say?

• Compared to these?

9/18

Mark-up

• Issues like this can be overcome by mark-up• Annotate the text to show explicitly where there is

anything special– Doubtful text– Incorrect text (mark up can show what was probably

meant)– Extraneous material

• This is also an important issue in computer storage of ancient manuscripts

• More detail later

10/18

Speech corpora

• “Corpus” usually means transcribed speech data

• Many issues surrounding transcription of speech

• Some of them similar to issues with handwriting

• Others particular to speech

11/18

Transcribing speech

• Not just a matter of typing in what was said, though this is of course a major element– And may not be straightforward– How much “correction” to do in transcription– eg of hesitations, false starts, and other speech phenomena

• Speech corpora usually encode information about paralinguistic and non-linguistic features– Speed of delivery, pauses– Loudness (whispering, shouting, singing) – Coughs and other non-speech sounds which may be meaningful

(grunt, tutting, hesitation noises)– Even outside noises if relevant (eg passing siren, music, animals),

as they might “contribute” to the discussion

12/18

Transcribing speech

• Some conventions have emerged, eg …• Vocalized pauses: use phonetic symbols or

conventional spelling– or uh, ah, erm, uhuh (!)

• How to transcribe contractions like gotta, gonna, sorta, …– Notice how some are completely conventional, eg

can’t, won’t• How (and whether) to transcribe partially uttered

words and repetitions• How to represent unintelligible speech

13/18

Storage

• Where will the data be kept, and who will have access?– If corpus is for public distribution, will it be by license, or freely

available?– If by license, distribute online (with password) or on CD?

• Nowadays, fortunately, size is not such an issue though– Big corpora have to be distributed on multiple CDs– Downloading from a website can take hours

• Note that it is not only the corpus data that must be distributed:– Many corpora have associated software packages to facilitate

exploration– For speech corpora, original recordings may be available

14/18

Access

• Efficient access to corpus data comes hand-in-hand with corpus structure

• No good having structured corpus if that structure can’t be used to delimit searches

• Best if corpus is cross-indexed on all searchable criteria, ie all details that are encoded in headers

15/18

Organizations

• Several organizations, often based in universities, have their own corpus material, and are also very active in issues surrounding Corpus Linguistics

• “corpora” mailing list http://nora.hd.uib.no/corpora/• ELRA European Language Resources Association

http://www.elra.info/• LDC Linguistic Data Consortium

http://www.ldc.upenn.edu/• TEI Text Encoding Inititative http://www.tei-c.org/

16/18

• aims to make available the language resources for language engineering and to evaluate language engineering technologies

• active in identification, distribution, collection, validation, standardisation, improvement

• promotes the production of language resources • supports the infrastructure to perform evaluation

campaigns– Mainly through ELDA (Evaluation and Language

Resources Distribution Agency) http://www.elda.org/

http://www.elra.info/

17/18

• Based at U Penn

• supports language-related education, research and technology development by creating and sharing linguistic resources: data, tools and standards

• http://www.ldc.upenn.edu/

18/18

• collectively develops and maintains a standard for the representation of texts in digital form

• chief deliverable is a set of Guidelines which specify encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics

• http://www.tei-c.org/index.xml

top related