representatıvness, balance and samplıng ın a corpus lınguistıcs

23
Representatıvness, Representatıvness, balance and samplıng balance and samplıng ın a corpus Lınguistıcs ın a corpus Lınguistıcs

Upload: elmo-doyle

Post on 02-Jan-2016

20 views

Category:

Documents


0 download

DESCRIPTION

Representatıvness, balance and samplıng ın a corpus Lınguistıcs. Introduction. A corpus is designed to respresent a particular language or language variety. It is impossible to analyse every extant utterance or sentence of a given language. Sampling is unavodiable. - PowerPoint PPT Presentation

TRANSCRIPT

Representatıvness, balance Representatıvness, balance

and samplıng ın a corpus and samplıng ın a corpus

LınguistıcsLınguistıcs

IntroductionIntroduction

A corpus is designed to respresent a particular language or language variety. It is impossible to analyse every extant utterance or sentence of a given language.

Sampling is unavodiable.

How can you be sure that the sample you are studying is representative of the language or language variety under consideration?

One must considerOne must consider

Balance

Sampling

to ensure representativness

RepresentativenessRepresentativeness in in Corpus LinguisticsCorpus Linguistics

Biber (1993:242) : “Representativness refers to the exent to which a sample includes the full range of variability in a population.”

Population: a sample of a language or a language variety in a corpus.

2 factors2 factors to determine the to determine the

representativeness of arepresentativeness of a corpuscorpus

The range of genres included in a corpus (i.e. balance).

How text chunks for each genre are selected (i.e sampling).

Criteria Criteria used to used to select select textstexts for a corpus for a corpus

External : situational; genre / register

Internal: linguistic / text types

Internal criteria is Internal criteria is problematicproblematic and and circularcircular

Internal criteria like the distribution of words or grammatical features cannot be the primary parameters for the selection of corpus data.

The corpus has been skewed by design.

Sinclair (1995):

The texts or parts of texts to be included in a corpus should be selected according to external criteria so that their linguistic characteristics are independent of the selection process.

Another aspect of Another aspect of representativeness:representativeness: Change over time Change over time

View of a corpus as a

Static (It applies to a sample corpus)

Dynamic (It applies to a monitor corpus)

language model

Static view of languageStatic view of language

Helsinki Diachronic Corpus

Lancester — Oslo / Bergen Corpus (LOB)

Freiburg — LOB (FLOB)

Sampling frame

LOB : British English, early 1960s

FLOB : British English, early1990s

Representativeness of Representativeness of GeneralGeneral and and SpecializedSpecialized

CorporaCorporaGeneral Corpora (e.g., BNC) serve as a basis for an overall description of a language or language variety. It should involve samples from a broad range of genres.

Specialized Corpora tend to be domain (e.g., medicine or law) and genre (fiction, newspaper texts, academic prose) specific.

BalanceBalance

The range of text categories included in the corpus determines how balanced the corpus is.

The acceptable balance of corpus is determined by its intended uses.

A balanced corpus usually covers a wide range of text categories which are supposed to be representative of a language or language variety under consideration.

These text categories are typically sampled proportionally for inclusion in a corpus.

BalanceBalance

There is no scientific measure of corpus balance.

A more typical approach to corpus balance is that corpus-builders adopt an existing corpus model when building their own corpus, assuming that balanced will be achieved from the adopted model.

BNC is generally accepted as being a balanced corpus.

American National Corpus

Korean National Corpus

Polish National Corpus

Russian Reference Corpus

BalanceBalance

Balance is a more importan issue for a static sample corpus than a dynamic monitor corpus.

The builders of monitor corpora finds size of corpora more important than the balance of it. They assume that corpus will balance itself when it reaches a substantial size.

Domain % Date % Medium %

Imaginative 21. 91 1960-74 2. 26 Book 58. 58

Arts 8. 08 1975-93 89. 23 Periodical 31.08Belief and thought

3. 40 Unclassified 8. 49 Misc. published 4. 38

Commerce /finance

7. 93Misc.

published 4. 00

Leisure 11. 13To-be

spoken 1. 52

Natural / pure science

4. 18 Unclassifed 0. 40

Applied science

8. 21

Social science

14. 80

World affairs

18. 39

Unclassified 1. 93

Composition of written BNC

Region %Interactio

n type%

Context -goverened

%

South 45. 61 Monologue 18. 64Educational /informative

20. 56

Midlands 23. 33 Dialogue 74. 87 Bussiness 21. 47

North 25 . 43 Unclassified 6. 48 Institutional 21. 86

Unclassifed 5. 61 Leisure 23. 71

Unclassifed 12. 38

Composition of spoken BNC

SamplingSampling

To obtain a representative sample from a population:

define the sampling unit and boundaries of population

Written texts: sampling units are books, periodicals, or newspaper.

List of sampling units is referred to as sampling frame.

LOB corpusTarget population: All written English texts published in the United Kingdom in 1961

Sampling frame: British National Bibliograhpy Cumulated Subjet Index 1960-1964, for books

Willing’s Press Guide 1961, for periodicals

Sample sizeSample size

Full texts (i.e., whole texts)

Text chunks (ideal: sample text fragments, 2,000 running words)

A A populationpopulation can be defined can be defined in terms ofin terms of

Language production (demographically oriented)

Language reception (demographically oriented)

Language as a product (text category / genre of language)

Different Different samplingsampling techniquestechniques

Simple random sampling: all sampling units within the sampling frame are numbered and the sample is chosen by the use of table of random numbers

Stratified random sampling, first divides the whole population into relatively homogeneous groups (strata) and samples each stratum at random.

Demographic sampling (i.e., categorize sampling units on the basis of speaker/writer age,sex and social class) is also a type of stratified sampling.

Stratified Ramdom Sampling:Brown and LOB Corpora

The target population for each corpus was first grouped into 15 text categories such as news reportage, academic prose, different types of fiction.

Samples then drawn from each text category.

ConclusionConclusion

In constructing a balanced, representative corpus,

Stratified random sampling is to be preferred.

For written texts, a text typology established on the basis of external critearia is relevant.

For spoken data, demographic sampling is appropriate. It must be complemented by context-goverend sampling so that some context goverend linguistic variations can be included in the resulting corpus.