using corpora in linguistics and lexicography adam kilgarriff lexical computing ltd universities of...

Post on 05-Jan-2016

217 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Using Corpora in Linguistics and Lexicography

Adam Kilgarriff

Lexical Computing Ltd

Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 2

Outline Precision and recall History of corpus lexicography The Sketch Engine

– Demo Web corpora Corpus and dictionary

IDS Mannheim 2010 Kilgarriff 3

Find me all the fat cats

a request for information

IDS Mannheim 2010 Kilgarriff 4

High recall

Lots of responses Maybe not all good

IDS Mannheim 2010 Kilgarriff 5

High precision

Fewer hits Higher confidence

IDS Mannheim 2010 Kilgarriff 6

Information-seeking

Recall Precision

Computers good bad

People bad good

IDS Mannheim 2010 Kilgarriff 7

Lexicography: finding facts about words collocations grammatical patterns idioms synonyms meanings translations

IDS Mannheim 2010 Kilgarriff 8

Four ages of corpus lexicography

IDS Mannheim 2010 Kilgarriff 9

Age 1:Precomputer

Oxford English Dictionary:• 5 million index cards

IDS Mannheim 2010 Kilgarriff 10

Age 2: KWIC Concordances From 1980 Computerised Overhauled lexicography

IDS Mannheim 2010 Kilgarriff 11

Age 2: limitations

as corpora get bigger:too much data

• 50 lines for a word: :read all • 500 lines: could read all, takes a long time,

slow • 5000 lines: no

IDS Mannheim 2010 Kilgarriff 12

Age 3: Collocation statistics

Problem:too much data - how to summarise?

Solution:list of words occurring in neighbourhood of headword, with frequencies

Sorted by salience

IDS Mannheim 2010 Kilgarriff 13

Age-3 collocation statistics: limitations

Lists contain junk unsorted for type – mixes together adverbs,

subjects, objects, prepositions

What we really want: noise-free lists one list for each grammatical relation

IDS Mannheim 2010 Kilgarriff 14

Collocation listing For collocates of save (>5 hits), window 1-5 words to right of nodeword

word word

forests life

$1.2 dollars

lives costs

enormous thousands

annually face

jobs estimated

money your

IDS Mannheim 2010 Kilgarriff 15

Age 4: The word sketch Large well-balanced corpus Parse to find

– subjects, objects, heads, modifiers etc

One list for each grammatical relation Statistics to sort each list, as before

IDS Mannheim 2010 Kilgarriff 16

Macmillan English DictionaryFor Advanced Learners

Ed: Rundell, 2002

IDS Mannheim 2010 Kilgarriff 17

Working practice

Lexicographers mainly used sketches not concordances – missed less, more consistent– Faster

IDS Mannheim 2010 Kilgarriff 18

Euralex 2002

IDS Mannheim 2010 Kilgarriff 19

Euralex 2002 Can I have them for my language

please

IDS Mannheim 2010 Kilgarriff 20

The Sketch Engine Input:

– any corpus, any language Lemmatised, part-of-speech tagged

– specification of grammatical relations Word sketches integrated with Corpus query system

– Supports complex searching, sorting etc Credit: Pavel Rychly, Masaryk Univ

IDS Mannheim 2010 Kilgarriff 21

Customers Dictionary publishers

– Oxford University Press– Cambridge University Press– Collins– Macmillan– FrameNet Project (Berkeley, US)– National dictionary projects in

Czech Republic, Estonia, Ireland, Netherlands, Slovakia, Slovenia Universities

– Teaching and research– Languages, linguistics, language technology– UK, Germany, US, Greece, Taiwan, Japan, China, Slovenia,…

Other– Language teaching, textbook writing– Information management, web search companies– Automatic translation

IDS Mannheim 2010 Kilgarriff 22

Web corpora

Replaceable or replacable?– http://googlefight.com – http://looglefight.com

IDS Mannheim 2010 Kilgarriff 23

The web is– Very very large– Most languages– Most language types– Up-to-date– Free– Instant access

IDS Mannheim 2010 Kilgarriff 24

Web corpus types Large, general corpora Small, specialised corpora

– Specially for translators

IDS Mannheim 2010 Kilgarriff 25

Basic steps Gather pages

– CSE hits– Select and gather whole sites– General crawl

Filter De-duplicate Linguistic processing Load into corpus tool

IDS Mannheim 2010 Kilgarriff 26

WaC family corpora 100m – 2b word corpora 2-month project each All major world languages available in Sketch

Engine– Currently 30 languages– Growing monthly

Pioneers: Marco Baroni, Serge Sharoff Corpus Factory

Seeds: – mid-frequency words from ‘core vocab’ lists and corpora

Google on seed words, then crawl

IDS Mannheim 2010 Kilgarriff 27

CorporaArabic 174 Hindi 31 Russian 188

Chinese 456 Indonesian 102 Slovak 536

Czech 800 Irish 34 Slovene 738

Dutch 128 Italian 1910 Spanish 117

English 5508 Japanese 409 Swedish 114

French 126 Norwegian 95 Telugu 5

German 1627 Persian 6 Thai 108

Greek 149 Portuguese 66 Vietnamese 174

Estonian 11 Romanian 53 Welsh 63

Korean 77 Polish 156 Malay 230

IDS Mannheim 2010 Kilgarriff 28

How good are they? How to assess?

– Hard question, open research topic Good coverage

– Newspapers: news, politics bias– Web corpora: also cover personal,

kitchen vocab Web corpus / BNC / journalism corpus

– First two are close

IDS Mannheim 2010 Kilgarriff 29

Evaluating word sketches 11 years

– 1999-2010 Feedback

– Good but anecdotal Formal evaluation Method also lets us evaluate corpora

Kilgarriff 30

Goal

Collocations dictionary– Model: Oxford Collocations Dictionary– Publication-quality

Ask a lexicographer– For 42 headwords

For 20 best collocates per headwords– “should we include this collocation in a

published dictionary?”

Kilgarriff 31

Sample of headwords Nouns verbs adjectives, random High (Top 3000) N space solution opinion mass corporation leader V serve incorporate mix desire Adj high detailed open academic Mid (3000- 9999) N cattle repayment fundraising elder biologist sanitation V grieve classify ascertain implant Adj adjacent eldest prolific ill Low (10,000- 30,000) N predicament adulterer bake bombshell candy shellfish V slap outgrow plow traipse Adj neoclassical votive adulterous expandable

Kilgarriff 32

Precision and recall We test precision Recall is harder

How do we find all the collocations that the system should have found?

Current work• 200 collocates per headword

• Selected from

• All the corpora we have

• Various parameter settings

• Plus just-in-time evaluation for 'new' collocates

Kilgarriff 33

Four languages, three families

Dutch– ANW, 102m-word lexicographic corpus

English– UKWaC, 1.5b web corpus

Japanese– JpWaC, 400m web corpus

Slovene – FidaPlus, 620m lexicographic corpus

Kilgarriff 34

User evaluation

Evaluate whole system– Will it help with my task

Eg preparing a collocations dictionary

Contrast: developer evaluation– Can I make the system better?

Evaluate each module separately Current work

Kilgarriff 35

Components

Corpus NLP tools

– Segmenter, lemmatiser, POS-tagger

Sketch grammar Statistics

Kilgarriff 36

Practicalities Interface

– Good, Good-but Merge to good

– Maybe, Maybe-specialised, Bad Merge to bad

For each language– Two/three linguists/lexicographers– If they disagree

Don't use for computing performance

Kilgarriff 37

Results

Dutch 66% English 71% Japanese 87% Slovene 71%

IDS Mannheim 2010 Kilgarriff 38

Two thirds of a collocations dictionary can be gathered automatically

IDS Mannheim 2010 Kilgarriff 39

Small specialised corpora Terminologists Translators needing target-language

domain-specific vocab Specialist dictionaries

– Don’t exist– Expensive/inaccessible– Out of date

Instant small web corpora– BootCaT: Baroni and Bernardini 2004– WebBootCaT demo

IDS Mannheim 2010 Kilgarriff 40

Cyborgs A creature that is partly human and

partly machine – Macmillan English Dictionary

IDS Mannheim 2010 Kilgarriff 41

IDS Mannheim 2010 Kilgarriff 42

IDS Mannheim 2010 Kilgarriff 43

IDS Mannheim 2010 Kilgarriff 44

IDS Mannheim 2010 Kilgarriff 45

Cyborgs and the Information Society

The dictionary-making agent is part human (for precision), part computer (for recall).

IDS Mannheim 2010 Kilgarriff 46

Treat your computer with respect. You and it can do

great things together.

IDS Mannheim 2010 Kilgarriff 47

Thank you

http://www.sketchengine.co.uk

IDS Mannheim 2010 Kilgarriff 48

Corpus and dictionary Established model:

– Lexicographers use corpora, users use dictionaries

But– Users like collocations, examples– But are not corpus linguists

Explore the space between corpus and dictionary

IDS Mannheim 2010 Kilgarriff 49

Collocationality Which words are most ‘collocational’ Dictionary publishers

– Where to put ‘collocation boxes’ Language learners

IDS Mannheim 2010 Kilgarriff 50

Verb Freq MLE Prob x log = entropy

Take 2084 -.469

Gain 131 -.169

Offer 117 -.157

See 110 -.150

Enjoy 67 -.104

… … …

Clarify 1 -0.0031

… … …

Total 3730 -3.909

Calculation of entropy for advantage (object relation)

IDS Mannheim 2010 Kilgarriff 51

IDS Mannheim 2010 Kilgarriff 52

place (17881), attention (8476), door (8426), care (4884), step (4277), advantage (3730), rise (3334), attempt (2825), impression (2596), notice (2462), chapter (2318), mistake (2205), breath (2140), hold (1949), birth (1016), living (953), indication (812), tribute (720), debut (714), button (661), eyebrow (649), anniversary (637), mention (615), glimpse (531), suicide (486), toll (472), refuge (470), spokesman (453), sigh (436), birthday (429), wicket (412), appendix (410), pardon (399), precaution (396), temptation (374), goodbye (372), fuss (366), resemblance (350), goodness (288), precedence (285), havoc (270), tennis (266), comeback (260), farewell (228), prominence (228), go-ahead (202), sip (198),

IDS Mannheim 2010 Kilgarriff 53

What is there on the web? Web1T

– Present from google– All 1-, 2-, 3-, 4, 5-grams with f>40 in one trillion

(1012) words of English 1,000,000,000,000

Compare with BNC– Take top 50,000 items of each– 105 Web1T words not in BNC top50k– 50 words with highest Web1T:BNC ratio– 50 words with lowest ratio

IDS Mannheim 2010 Kilgarriff 54

Web-high (155 terms)

61 web and computing– config browser spyware url www forum

38 porn 22 US English 18 business/products common on web

– poker viagra lingerie ringtone dvd casino rental collectible tiffany

– NB: BNC is old 4 legal

– trademarks pursuant accordance herein

IDS Mannheim 2010 Kilgarriff 55

Web-low

Exclude British English, transcription/tokenisation anomalies

– herself stood seemed she looked yesterday sat considerable had council felt perhaps walked round her towards claimed knew obviously remained himself he him

IDS Mannheim 2010 Kilgarriff 56

Observations Pronouns and past tense verbs

– Fiction Masc vs fem Yesterday

– Probably daily newspapers Constancy of ratios:

– He/him/himself– She/her/herself

top related