corpus linguistics developing a polyu language bank sherman lee [email protected] pi: grahame...

Corpus Linguistics

Developing a PolyU Language Bank

Sherman [email protected]

PI: Grahame BilbowThanks to: Chris Greaves, Raymond Cheung, Li Lan

mailto:[email protected]

2

Outline Background

Goals of corpus linguistics Types of corpora Applications of corpus analysis

As an illustration Exploring units of meaning Case study

Developing a PolyU Language Bank Aims and objectives of project Similar existing projects Procedures

The PolyU Language Bank Current status Sample corpora Sample search

3

Goals of corpus linguistics

Chomskyan linguistics ‘Langue’

(competence) Ideal speaker/hearer Language = innate

mental faculty Intuitive evidence Universals Grammar

Corpus linguistics ‘Parole’

(performance) Complexity/variation Language = social

phenomenon Empirical evidence Differences Meaning

4

Basic tools

Corpus: a systematic collection of speech or writing that is built according to explicit design criteria for a specific purpose

c.f. EAGLES’ broad definition: “A corpus can potentially contain any text type, incl. word lists, dictionaries, etc.”

Concordancer: search engine (e.g. WordSmith; SARA)

Concordance: occurrences of search item, displayed in list with immediate context shown

5

Types of corpora

Written vs Spoken General vs Specialised

e.g. ESP, Learner corpora Monolingual vs Multilingual

e.g. Parallel, Comparable Synchronic vs Diachronic; Monitor Annotated vs Unannotated

6

Written corpora Brown

LOB

Time of compilation 1960s

1970s

Compiled at Brown University (US)

Lancaster, Oslo, Bergen

Language variety Written American English

Written British English

Size 1 million words (500 texts of 2000 words each)

Design Balanced corpora; 15 genres of text, incl. press reportage, editorials, reviews, religion, government documents, reports, biographies, scientific writing, fiction

7

Specialised corpora CSPAE

CHILDES

Time of compilation 1990s

Since 1980s

Compiled at / by Michael Barlow (Rice Univ)

Project started at Carnegie Mellon Univ; contributors worldwide

Language variety Spoken professional American English

20 languages, incl.: E.Asian, Germanic, Romance, Slavic…; mainly conversational data;

Size 2 million words (tagged)

c. 20 million words (growing)

Design Transcripts from professional settings (meetings, conferences…) by 400 speakers; academia (1 M) politics (1 M wds)

“Child language data exchange system”, offering transcripts of monolingual and bilingual children’s language (language acquisition data)

COMPILED AT LANGUAGE SIZE DESIGN

First generation major corpora Brown Corpus (1960s)

Brown Univ, US Written American English 1 million (tagged)

15 genres of text: press reportage, religion, fiction…

Second generation mega corpora Bank of English (since 1991)

COBUILD, Birmingham Univ

Written / spoken English 450 million – year 2002 (tagged)

Monitor corpus; mostly written: newspapers, books; spoken: conversations, broadcasts, interviews...

International Corpus of English [ICE-GB] (1990s)

UCL, London Written / spoken British Engl.

1 million (grammatically parsed)

One of 15 projects worldwide preparing different national / regional varieties of English; 200 written, 300 spoken texts, various genres

Specialised corpora Corpus of Spoken Professional American Engl. [CSPAE] (1990s)

Rice Univ, US Spoken American English 2 million (tagged)

Transcripts from professional settings (meetings, press conferences) by approximately 400 speakers, centred on activities tied to academics and politics

Learner corpora International Corpus of Learner English [ICLE] (Since 1990s)

Louvain Centre for English Corpus Linguistics, Belg.

Engl. writing by learners of from 19 mother tongue backgrounds, incl. Chi.

Over 2 million Essay writing by advanced learners of English as a foreign language

Non-English monolingual corpora HK Cantonese Adult Corpus [HKCAC] (2000)

Dept Speech & Hearing Sci’s, HKU

HK Cantonese 170,000 characters Spontaneous speech recorded from phone-in radio programs and forums, by 69 speakers

Multilingual / Parallel corpora International Telecommunications Corpus [ITU / CRATER] (1995)

CRATER project (Corpus Resources & Terminology Extraction) Lanc U.

French, English and Spanish

1 million tokens in each language (tagged)

Trilingual parallel corpus from telecommunications domain; aligned at sentence level

Other examples of available corpora

9

Some applications of corpus analysis

Language teaching & learning Empirical teaching data – authentic examples of language use Reference source – answering learners’ questions or explaining

learner errors: • “What’s the difference between ‘at last’ and ‘in the end’?”• “How is ‘hardly’ used?”

Preparation of teaching materials – e.g. vocabulary lists, CLOZE tests CALL; concordancing and data-driven learning

Translation Using parallel texts to find suitable translation equivalents Creation of translation databases or glossaries for domain-specific

terminology, e.g. business, law, science Exploring units of meaning in texts

Linguistics and language research Lexicography & lexical studies – e.g. relative word frequency Language variation – e.g. linguistic features across registers Grammar – corpora used as data to test hypotheses, syntactic theory Pragmatics & discourse – e.g. CA of discourse features in spoken

(conversational) data

10

Exploring meaning, units of meaning

Focus on meaning because: People interested in the meanings of texts, in how language is

actually used in discourse Meaning is a key problem for translation, language learning,

information management… What are basic units of meaning?

Language teaching (TEFL): vocabulary often introduced in the form of new single words

Words considered to be basic units of meaning Is the word an ideal unit of meaning?

“… If you dog a dog during the dog days of summer, you’ll be a dog tired dog catcher…”“… Can I sit down? My dogs are barking…”

Most lexical errors made by language learners result from failure to deal with ambiguities of single words

11

‘ Unambiguous Units of Meaning’

Notion of an ‘Unambiguous Unit of Meaning’ necessary for understanding meaning

UUoM = keyword and all words in the context that contribute to making the word unambiguous

Compounds, idioms, multi-word units, collocations, set phrases

Often determined by a syntactic pattern Adj + N

• friendly fire, closing remarks V + N

• invite proposals, draw conclusions Adv + A

• politically correct, environmentally friendly N + of + N

• cause of death, proof of identity, code of practice, duty of care

12

Case study

Search for units of meaning in online dictionaries and corpora friendly fire environmentally friendly

Corpora from 1990s British National Corpus (BNC)

• 100,000,000+ words• Written (90%)

• Extracts from regional/national newspapers, specialist periodicals, academic books, popular fiction, un/published letters, memos, school/university essays

• Spoken (10%)• Informal conversation, formal meetings (business, government), radio shows,

phone-ins The Times (1995, Jan – March)

• 10,220,367 words• Written : business, home news, readers’ letters, reviews

Corpora from 1960 - 1970s Brown corpus / LOB corpus

• Each 1 million words• Written, balanced corpora of 15 genres of text

BNC

[100M] The Times [10.2M]

Brown [1 M]

LOB [1 M]

“friendly” 3952 363 61 55

“friendly fire” 37 1 0 0 [header] (no context) Wordnet 2.0 friendly fire 3 0 Dictionary.com [text] (phrase) Encarta World English so-called friendly fire 3 0 Cambridge Advanced ‘friendly fire’ 18 0 Merriam-Webster Online friendly fire 10 1 TigerNT Eng-Chi Online [text] (word) Lexiconer Online Eng-Chi so-called friendly-fire 1 0 ‘friendly-fire’ 1 0 friendly-fire 1 0 “environmentally” 692 44 0 0

“environmentally friendly” 205 23 0 0 (phrase) Wordnet 2.0 environmentally friendly 155 23 Dictionary.com (word) Encarta World English environmentally-friendly 50 0 Cambridge Advanced Merriam-Webster Online TigerNT Eng-Chi Online Lexiconer Online Eng-Chi

Search results

16

What the results show ‘friendly fire’, ‘environmentally friendly’

Represent fairly new concepts Occur in the newer corpora (1990s) as units of meaning Occur as entries in some of the online dictionaries only

(not bilingual dictionaries) New terminology and terms of common usage not

always recorded in dictionaries and termbanks One way of using corpora for learning and

translation: Use corpus evidence to help students recognise units of

meaning; introduce notion of units of meaning into language learning

17

Aims of PULB project

To design and build an archive of language corpora = ‘language bank’ To be used by staff and students in the

department For teaching, language learning and research

purposes To provide a user-friendly platform

A WWW interface via which users can freely access the language bank

With browse, search and concordance facilities

18

Ingredients of PULB

Sources: standard corpora, departmental collections

Medium: written texts, transcribed spoken data Language types: native speaker, learner corpora Languages: English, Chinese, Japanese, French,

German Genres: business, law, academia, media, social,

literature Target Size: 30 million

words (European) / characters (Asian)

19

Why a language bank? - “What’s in it for us”

Free and simple shared access to a collection of language corpora

That you can utilise for your teaching• Authentic examples of language use at your fingertips• Empirical teaching data covering different specialisms (ESP, EAP)

That you can utilise for your research• A ready-made collection of data waiting for you to work on• Saving on time and resources

Way of incorporating new methods and information technology into the department’s teaching and research activities

Increase students’ awareness of this rapidly developing methodology / branch of language studies (corpus linguistics, corpora studies)

Way of integrating theory with technology in the classroom Train students to be more computer-literate All of the above can

• Motivate students to become active learners• Help students to more effectively learn the target language (cf goals of DDL)

20

Similar existing projects

W3 Corpora Project (Essex) http://clwww.essex.ac.uk/w3c/ Access to corpora (Gutenberg texts, LOB, LOB-tagged) Web interface for performing searches Online tutorial and info on corpus linguistics

Web Concordancer (VLC, PolyU) http://vlc.polyu.edu.hk/concordance/ Access to variety of corpora and texts (bilingual/parallel

corpora, news, Bible, works of fiction) Web interface for performing searches

30

Directions for PULB

Build a language bank with features that parallel those of similar sites ~ VLC

• Bring together corpora and texts of various types and genres, of different languages

~ Essex• Make available different facilities for different

categories of users (cf. legal considerations)• Provide on-site tutorial, corpora-based info

Include extra features Allow searches in multiple texts / corpora

simultaneously Some form of parallel concordancing

31

Target composition of PULB

PolyU Language Bank Chinese Japanese

English

General corpora Learner corpora

ICE

Business English (PUBC)

Legal English

Academic English

BNC

BROWN

Spoken Corpora

WorkplaceEnglish

HK spokencorpus

Conference speeches

Academic presentations

French German

LegalChinese

Business Chinese

BusinessJapanese

JapaneseLiterature

Student work

Social interactions

Teaching reflections

Business writingSpecialised corpora

English Literature

32

Procedures (i)

Collate, sort, categorise data from various sources• Commercially available data• Departmental collections, incl.

PolyU Business Corpus (Li and Bilbow) Bilingual corpora (Xu) ESP / EAP corpora (Forey) Learner corpora (Sengupta) …

33

Procedures (ii) For the departmental collections:

Decide how to present each collection E.g. Sub-categories, macro categories

Clean up texts E.g. Duplications of text samples E.g. Structural features (headings, typographic features) E.g. Personal information found in data

• To protect anonymity or privacy of authors and speakers

Annotate texts Provide descriptive information about each corpus

• Compiler, time of compilation, type of collection… Provide descriptive information about the texts

• Number, size, genre of subtexts• Bibliographic info (written text)• Ethnographic info (spoken data)

Provide structural information for texts if necessary• Mark texts for paragraph boundaries etc…

34

Procedures (iii)

Put corpora together on platform; set up search and support facilities: ‘PULB map’ Browse facility Search and concordance facilities Tutorial / general information

Transplant PULB onto dept website for use by staff and students

Promote PULB among corpora community Data provider to data archives / distribution sites, e.g.

OLAC; ICAME

35

The PolyU Language Bank Current status

Range of corpora totalling 12M+ words Individual corpus descriptions Index of corpora Simple to use built-in concordancer Available at http://

langbank.engl.polyu.edu.hk/

http://langbank.engl.polyu.edu.hk/



37

The PolyU Language Bank Some of the currently available corpora

PolyU Business Corpus (Eng, Chi, Jap) BNC Sampler Corpus (Spoken, Written) Corpus of Multilingual Texts Corpus of Nursing and Health Science Texts Learner Corpus of Essays and Reports HK Bilingual Corpus of Legal and Documentary

Texts ...

41

How you can contribute Talk to us about your ideas

What would you like to see being incorporated into PULB?• In terms of corpora• In terms of search facilities and supplementary information

Can you think of other ways in which PULB can be organised and structured?

How likely are you to make use of PULB in your teaching and research?

Do you have any suggestions for corpus studies based on available or potentially available corpora from PULB?

Do you know of similar projects being undertaken elsewhere that we can learn from?

Talk to us about your collections / corpora Do you have collections of language data from past research

projects that are (could be) presented as a corpus (corpora)? Can we help you put your collections to good use? Can we work together to incorporate your collections into

PULB?

42

Concluding remarks Corpora represent a valuable but under exploited

resource for teaching and research PULB aims to bring together various corpora

under a single departmental archive, accessible via WWW

You can help us by contributing your ideas and/or your language collections

Please visit and test the PULB website at http://langbank.engl.polyu.edu.hk/ and provide us with feedback using the online evaluation form

Thank you very much




Social grooming

45

PolyU Business Corpus

Compiled in 1999-2000 (Li & Bilbow) Multilingual - comparable corpora:

English (c. 1.3 M words) Chinese (c. 1.2 M words) Japanese (c. 1.1 M words)

Business texts from: newspapers, government reports, company reports and brochures…

Has been used for creating a bilingual English-Chinese business lexicon

PolyU Business Lexicon

Duplication

corpus linguistics developing a polyu language bank sherman lee [email protected] pi: grahame...

Documents

corpus analysisas

taggedmonitor corpus

corpus linguisticsdeveloping

polyu language bankaims

polyu language banksherman

spoken english450

professional settings

sbrown univ