adam kilgarriff lexical computing ltd sketchengine.co.uk

A cascade of corpora:The Cambridge Learner Corpus,

English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project

Adam KilgarriffLexical Computing Ltd

http://www.sketchengine.co.uk

English Profile

• From 2006• Cambridge Univ, Univ Press, ESOL (+ others)• Goal

– for each CEFR level, find characteristic lexis and grammar

• CEFR: Common European Framework of Reference– A1, A2: Beginner– B1, B2: Intermediate– C1, C2: Advanced

– Main resource: CLCNTNU Nov 2011 KIlgarriff 2

Cambridge Learner Corpus (CLC)

• Since 1993 • Leading resource• CUP and Cambridge Assessment

– For better dictionaries, ELT courses, tests– Material: all from exams (levels A1-C2)

• 45m words; 22m error-tagged• 200,000 scripts, 138 L1s, 203 nationalities

NTNU Nov 2011 KIlgarriff 3

Sketch Engine

• Leading corpus tool• Word sketches

– One-page summaries of a word’s grammatical and collocational behaviour

• In use at OUP, CUP, Collins, Macmillan, INL …• 55 languages

– 175 corpora– Since May including CHILDES: demo– Since last year including CLC

Macmillan English DictionaryFor Advanced Learners

Ed: Rundell, 2002

Error-coded corpus

• Challenge– Intuitive to search for x

• anywhere• only where it is part of an error• only where it is part of a correction

where x can be a word, phrase, grammar pattern …

Requirement for CLC in Sketch Engine

Error-coded corpora in SkE

• demo

HOO / HOO+

• Helping Our Own• HOO: English-NNS NLP researchers

– Developer = user: motivation– Shared task/competitive evaluation

• Organisers define task and prepare ‘gold standard’• Teams participate by running their software over test

data• Six teams (incl Tübingen), workshop end Sept

HOO+ (2012)

• Probably– English: learner data from CLC– Other languages? – Tasks

• Essay scoring • Determiner, preposition errors• ?• http://www.clt.mq.edu.au/research/projects/hoo/

Highlights of English lexicography

http://webdante.com

The KELLY Project

• EU Lifelong Learning Project• Word cards

– 9 languages• Arabic Chinese English Greek Italian Norwegian Polish

Russian Swedish– All 36 pairs– Words the learner should know (at A1 … C2)

• Partners• Stockholm Univ, Gotheburg Univ, Adam Mickiewicz Univ,

ILSP Athens, CNR Pisa, Oslo Univ, Leeds Univ, Keewords A/S, Lexical Computing Ltd

Interesting question

• How close to purely corpus-based can a pedagogic list be?

Method

• Take a general corpus• Count• Review, add, delete using other lists and corpora• Translate (72 directed-lg-pairs)• Words not in source list which occur in

translations:– Review source list

• http://kelly.sketchengine.co.uk

• Symmatrical pairs: <x,y> and <y,x>• Cliques:

– For x, y, z, … all pairs are symmetrical– 9-language cliques (English members)

• hospital library music sun theory

Web corpora

• Replaceable or replacable?– http://googlefight.com – http://looglefight.com

• The web is– Very very large– Most languages– Most language types– Up-to-date– Free– Instant access

Web corpus types

• Large, general corpora• Small, specialised corpora

– Specially for translators

Basic steps• Gather pages

– CSE hits– Select and gather whole sites– General crawl

• Filter• De-duplicate• Linguistic processing• Load into corpus tool

WaC family corpora• 100m – 2b word corpora• 2-month project each• All major world languages available in Sketch Engine

– Currently 42 languages– Growing monthly

• Pioneers: Marco Baroni, Serge Sharoff• Corpus Factory

• Seeds: – mid-frequency words from ‘core vocab’ lists and corpora

• Google on seed words, then crawl

How good are they?• How to assess?

– Hard question, open research topic• Good coverage

– Newspapers: news, politics bias– Web corpora: also cover personal, kitchen

vocab• Web corpus / BNC / journalism corpus

– First two are close

Evaluating word sketches

• 11 years – 1999-2011

• Feedback– Good but anecdotal

• Formal evaluation• Method also lets us evaluate corpora

KIlgarriff 26

• Collocations dictionary– Model: Oxford Collocations Dictionary– Publication-quality

• Ask a lexicographer– For 42 headwords

• For 20 best collocates per headwords– “should we include this collocation in a published

dictionary?”

NTNU Nov 2011

KIlgarriff 27

Sample of headwords• Nouns verbs adjectives, random• High (Top 3000)• N space solution opinion mass corporation leader• V serve incorporate mix desire• Adj high detailed open academic• Mid (3000- 9999)• N cattle repayment fundraising elder biologist sanitation• V grieve classify ascertain implant• Adj adjacent eldest prolific ill• Low (10,000- 30,000)• N predicament adulterer bake bombshell candy shellfish• V slap outgrow plow traipse• Adj neoclassical votive adulterous expandable

NTNU Nov 2011

Precision and recall

• a request for information– Find me all the fat cats

High recall

• Lots of responses• Maybe not all good

High precision

• Fewer hits• Higher confidence

KIlgarriff 31

Precision and recall We test precision Recall is harder

How do we find all the collocations that the system should have found?

Current work• 200 collocates per headword

• Selected from

• All the corpora we have

• Various parameter settings

• Plus just-in-time evaluation for 'new' collocates

NTNU Nov 2011

KIlgarriff 32

Four languages, three families

• Dutch– ANW, 102m-word lexicographic corpus

• English– UKWaC, 1.5b web corpus

• Japanese– JpWaC, 400m web corpus

• Slovene – FidaPlus, 620m lexicographic corpus

NTNU Nov 2011

KIlgarriff 33

User evaluation

• Evaluate whole system– Will it help with my task

• Eg preparing a collocations dictionary

• Contrast: developer evaluation– Can I make the system better?

• Evaluate each module separately• Current work

NTNU Nov 2011

KIlgarriff 34

Components

• Corpus• NLP tools

– Segmenter, lemmatiser, POS-tagger

• Sketch grammar• Statistics

NTNU Nov 2011

KIlgarriff 35

Practicalities• Interface

– Good, Good-but• Merge to good

– Maybe, Maybe-specialised, Bad• Merge to bad

• For each language– Two/three linguists/lexicographers– If they disagree

• Don't use for computing performance

NTNU Nov 2011

KIlgarriff 36

Results

• Dutch 66%• English 71%• Japanese 87%• Slovene 71%

NTNU Nov 2011

Two thirds of a collocations dictionary can be gathered automatically

Thank you

http://www.sketchengine.co.uk

Lexicography: finding facts about words

• collocations• grammatical patterns• idioms• synonyms• meanings• translations

Four ages of corpus lexicography

Age 1:Precomputer

Oxford English Dictionary:• 5 million index cards

Age 2: KWIC Concordances

• From 1980• Computerised• Overhauled lexicography

Age 2: limitations

as corpora get bigger:too much data

• 50 lines for a word: :read all • 500 lines: could read all, takes a long time, slow • 5000 lines: no

Age 3: Collocation statistics

• Problem:too much data - how to summarise?

• Solution:list of words occurring in neighbourhood of headword, with frequencies

• Sorted by salience

Age-3 collocation statistics: limitations

Lists contain• junk • unsorted for type – mixes together adverbs,

subjects, objects, prepositions

What we really want: • noise-free lists • one list for each grammatical relation

Age 4: The word sketch• Large well-balanced corpus• Parse to find

– subjects, objects, heads, modifiers etc

• One list for each grammatical relation• Statistics to sort each list, as before

Working practice

• Lexicographers mainly used sketches not concordances – missed less, more consistent– Faster

Euralex 2002

• Can I have them for my language please

The Sketch Engine

• Input: – any corpus, any language

• Lemmatised, part-of-speech tagged– specification of grammatical relations

• Word sketches integrated with• Corpus query system

– Supports complex searching, sorting etc• Credit: Pavel Rychly, Masaryk Univ

Customers• Dictionary publishers

– Oxford University Press– Cambridge University Press– Collins– National dictionary projects in

• Czech Republic, Estonia, Ireland, Netherlands, Slovakia, Slovenia

• Universities– Teaching and research– Languages, linguistics, language technology– UK, Germany, US, Greece, Taiwan, Japan, China, …

• Other– Language teaching, textbook writing– Information management, web search

• Demo– http://sketchengine.co.uk– Free trial

What is there on the web?• Web1T

– Present from google– All 1-, 2-, 3-, 4, 5-grams with f>40 in one trillion

(1012) words of English• 1,000,000,000,000

• Compare with BNC– Take top 50,000 items of each– 105 Web1T words not in BNC top50k– 50 words with highest Web1T:BNC ratio– 50 words with lowest ratio

Web-high (155 terms)

• 61 web and computing– config browser spyware url www forum

• 38 porn• 22 US English• 18 business/products common on web

– poker viagra lingerie ringtone dvd casino rental collectible tiffany

– NB: BNC is old• 4 legal

– trademarks pursuant accordance herein

Web-low

• Exclude British English, transcription/tokenisation anomalies

– herself stood seemed she looked yesterday sat considerable had council felt perhaps walked round her towards claimed knew obviously remained himself he him

Observations• Pronouns and past tense verbs

– Fiction

• Masc vs fem• Yesterday

– Probably daily newspapers

• Constancy of ratios:– He/him/himself– She/her/herself

adam kilgarriff lexical computing ltd sketchengine.co.uk

Documents

augmenting online dictionary entries with corpus data for...

genre in a frequency dictionary adam kilgarriff & carole...

getting to know your corpus adam kilgarriff lexical...

1 the long road from text to meaning adam kilgarriff lexical...

without data, nothing adam kilgarriff lexical computing ltd...

corpora by web services adam kilgarriff lexical computing...

1 googleology is bad science adam kilgarriff lexical...

1 corpora, language technology and maltese adam kilgarriff...

using corpora and how to build them adam kilgarriff lexical...

terminology, translation, and presemt; word frequency lists...

1 corpora, dictionaries, and points in between in the age of...

1 using corpora in language research -also introduction to...

large web corpora for indian languages - sketch...

1 word senses: a computational response adam kilgarriff...

using corpora for teaching chinese dr. adam kilgarriff...

simple maths for keywords adam kilgarriff lexical computing...

the sketch engine as infrastructure for large scale text...

why we need corpora and the sketch engine adam kilgarriff...

do we need lexicographers? prospects for automatic...

comparable corpora bootcat (ccbc) adam kilgarriff, avinesh...