adam kilgarriff lexical computing ltd sketchengine.co.uk
Post on 12-Jan-2016
27 Views
Preview:
DESCRIPTION
TRANSCRIPT
A cascade of corpora:The Cambridge Learner Corpus,
English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project
Adam KilgarriffLexical Computing Ltd
http://www.sketchengine.co.uk
English Profile
• From 2006• Cambridge Univ, Univ Press, ESOL (+ others)• Goal
– for each CEFR level, find characteristic lexis and grammar
• CEFR: Common European Framework of Reference– A1, A2: Beginner– B1, B2: Intermediate– C1, C2: Advanced
– Main resource: CLCNTNU Nov 2011 KIlgarriff 2
Cambridge Learner Corpus (CLC)
• Since 1993 • Leading resource• CUP and Cambridge Assessment
– For better dictionaries, ELT courses, tests– Material: all from exams (levels A1-C2)
• 45m words; 22m error-tagged• 200,000 scripts, 138 L1s, 203 nationalities
NTNU Nov 2011 KIlgarriff 3
Sketch Engine
• Leading corpus tool• Word sketches
– One-page summaries of a word’s grammatical and collocational behaviour
• In use at OUP, CUP, Collins, Macmillan, INL …• 55 languages
– 175 corpora– Since May including CHILDES: demo– Since last year including CLC
NTNU Nov 2011 KIlgarriff 4
NTNU Nov 2011 KIlgarriff 5
Macmillan English DictionaryFor Advanced Learners
Ed: Rundell, 2002
Error-coded corpus
• Challenge– Intuitive to search for x
• anywhere• only where it is part of an error• only where it is part of a correction
where x can be a word, phrase, grammar pattern …
Requirement for CLC in Sketch Engine
NTNU Nov 2011 KIlgarriff 6
HOO / HOO+
• Helping Our Own• HOO: English-NNS NLP researchers
– Developer = user: motivation– Shared task/competitive evaluation
• Organisers define task and prepare ‘gold standard’• Teams participate by running their software over test
data• Six teams (incl Tübingen), workshop end Sept
NTNU Nov 2011 KIlgarriff 8
HOO+ (2012)
• Probably– English: learner data from CLC– Other languages? – Tasks
• Essay scoring • Determiner, preposition errors• ?• http://www.clt.mq.edu.au/research/projects/hoo/
NTNU Nov 2011 KIlgarriff 9
DANTE
Highlights of English lexicography
NTNU Nov 2011 KIlgarriff 10
DANTE
NTNU Nov 2011 KIlgarriff 11
DANTE
NTNU Nov 2011 KIlgarriff 12
DANTE
NTNU Nov 2011 KIlgarriff 13
The KELLY Project
• EU Lifelong Learning Project• Word cards
– 9 languages• Arabic Chinese English Greek Italian Norwegian Polish
Russian Swedish– All 36 pairs– Words the learner should know (at A1 … C2)
• Partners• Stockholm Univ, Gotheburg Univ, Adam Mickiewicz Univ,
ILSP Athens, CNR Pisa, Oslo Univ, Leeds Univ, Keewords A/S, Lexical Computing Ltd
NTNU Nov 2011 KIlgarriff 15
Interesting question
• How close to purely corpus-based can a pedagogic list be?
NTNU Nov 2011 KIlgarriff 16
Method
• Take a general corpus• Count• Review, add, delete using other lists and corpora• Translate (72 directed-lg-pairs)• Words not in source list which occur in
translations:– Review source list
• http://kelly.sketchengine.co.uk
NTNU Nov 2011 KIlgarriff 17
• Symmatrical pairs: <x,y> and <y,x>• Cliques:
– For x, y, z, … all pairs are symmetrical– 9-language cliques (English members)
• hospital library music sun theory
NTNU Nov 2011 KIlgarriff 18
NTNU Nov 2011 KIlgarriff 19
Web corpora
• Replaceable or replacable?– http://googlefight.com – http://looglefight.com
NTNU Nov 2011 KIlgarriff 20
• The web is– Very very large– Most languages– Most language types– Up-to-date– Free– Instant access
NTNU Nov 2011 KIlgarriff 21
Web corpus types
• Large, general corpora• Small, specialised corpora
– Specially for translators
NTNU Nov 2011 KIlgarriff 22
Basic steps• Gather pages
– CSE hits– Select and gather whole sites– General crawl
• Filter• De-duplicate• Linguistic processing• Load into corpus tool
NTNU Nov 2011 KIlgarriff 23
WaC family corpora• 100m – 2b word corpora• 2-month project each• All major world languages available in Sketch Engine
– Currently 42 languages– Growing monthly
• Pioneers: Marco Baroni, Serge Sharoff• Corpus Factory
• Seeds: – mid-frequency words from ‘core vocab’ lists and corpora
• Google on seed words, then crawl
NTNU Nov 2011 KIlgarriff 24
How good are they?• How to assess?
– Hard question, open research topic• Good coverage
– Newspapers: news, politics bias– Web corpora: also cover personal, kitchen
vocab• Web corpus / BNC / journalism corpus
– First two are close
NTNU Nov 2011 KIlgarriff 25
Evaluating word sketches
• 11 years – 1999-2011
• Feedback– Good but anecdotal
• Formal evaluation• Method also lets us evaluate corpora
KIlgarriff 26
Goal
• Collocations dictionary– Model: Oxford Collocations Dictionary– Publication-quality
• Ask a lexicographer– For 42 headwords
• For 20 best collocates per headwords– “should we include this collocation in a published
dictionary?”
NTNU Nov 2011
KIlgarriff 27
Sample of headwords• Nouns verbs adjectives, random• High (Top 3000)• N space solution opinion mass corporation leader• V serve incorporate mix desire• Adj high detailed open academic• Mid (3000- 9999)• N cattle repayment fundraising elder biologist sanitation• V grieve classify ascertain implant• Adj adjacent eldest prolific ill• Low (10,000- 30,000)• N predicament adulterer bake bombshell candy shellfish• V slap outgrow plow traipse• Adj neoclassical votive adulterous expandable
NTNU Nov 2011
NTNU Nov 2011 KIlgarriff 28
Precision and recall
• a request for information– Find me all the fat cats
NTNU Nov 2011 KIlgarriff 29
High recall
• Lots of responses• Maybe not all good
NTNU Nov 2011 KIlgarriff 30
High precision
• Fewer hits• Higher confidence
KIlgarriff 31
Precision and recall We test precision Recall is harder
How do we find all the collocations that the system should have found?
Current work• 200 collocates per headword
• Selected from
• All the corpora we have
• Various parameter settings
• Plus just-in-time evaluation for 'new' collocates
NTNU Nov 2011
KIlgarriff 32
Four languages, three families
• Dutch– ANW, 102m-word lexicographic corpus
• English– UKWaC, 1.5b web corpus
• Japanese– JpWaC, 400m web corpus
• Slovene – FidaPlus, 620m lexicographic corpus
NTNU Nov 2011
KIlgarriff 33
User evaluation
• Evaluate whole system– Will it help with my task
• Eg preparing a collocations dictionary
• Contrast: developer evaluation– Can I make the system better?
• Evaluate each module separately• Current work
NTNU Nov 2011
KIlgarriff 34
Components
• Corpus• NLP tools
– Segmenter, lemmatiser, POS-tagger
• Sketch grammar• Statistics
NTNU Nov 2011
KIlgarriff 35
Practicalities• Interface
– Good, Good-but• Merge to good
– Maybe, Maybe-specialised, Bad• Merge to bad
• For each language– Two/three linguists/lexicographers– If they disagree
• Don't use for computing performance
NTNU Nov 2011
KIlgarriff 36
Results
• Dutch 66%• English 71%• Japanese 87%• Slovene 71%
NTNU Nov 2011
NTNU Nov 2011 KIlgarriff 37
Two thirds of a collocations dictionary can be gathered automatically
NTNU Nov 2011 KIlgarriff 39
NTNU Nov 2011 KIlgarriff 40
Lexicography: finding facts about words
• collocations• grammatical patterns• idioms• synonyms• meanings• translations
NTNU Nov 2011 KIlgarriff 41
Four ages of corpus lexicography
NTNU Nov 2011 KIlgarriff 42
Age 1:Precomputer
Oxford English Dictionary:• 5 million index cards
NTNU Nov 2011 KIlgarriff 43
Age 2: KWIC Concordances
• From 1980• Computerised• Overhauled lexicography
NTNU Nov 2011 KIlgarriff 44
Age 2: limitations
as corpora get bigger:too much data
• 50 lines for a word: :read all • 500 lines: could read all, takes a long time, slow • 5000 lines: no
NTNU Nov 2011 KIlgarriff 45
Age 3: Collocation statistics
• Problem:too much data - how to summarise?
• Solution:list of words occurring in neighbourhood of headword, with frequencies
• Sorted by salience
NTNU Nov 2011 KIlgarriff 46
Age-3 collocation statistics: limitations
Lists contain• junk • unsorted for type – mixes together adverbs,
subjects, objects, prepositions
What we really want: • noise-free lists • one list for each grammatical relation
NTNU Nov 2011 KIlgarriff 47
Age 4: The word sketch• Large well-balanced corpus• Parse to find
– subjects, objects, heads, modifiers etc
• One list for each grammatical relation• Statistics to sort each list, as before
NTNU Nov 2011 KIlgarriff 48
Working practice
• Lexicographers mainly used sketches not concordances – missed less, more consistent– Faster
NTNU Nov 2011 KIlgarriff 49
Euralex 2002
NTNU Nov 2011 KIlgarriff 50
Euralex 2002
• Can I have them for my language please
NTNU Nov 2011 KIlgarriff 51
The Sketch Engine
• Input: – any corpus, any language
• Lemmatised, part-of-speech tagged– specification of grammatical relations
• Word sketches integrated with• Corpus query system
– Supports complex searching, sorting etc• Credit: Pavel Rychly, Masaryk Univ
NTNU Nov 2011 KIlgarriff 52
Customers• Dictionary publishers
– Oxford University Press– Cambridge University Press– Collins– National dictionary projects in
• Czech Republic, Estonia, Ireland, Netherlands, Slovakia, Slovenia
• Universities– Teaching and research– Languages, linguistics, language technology– UK, Germany, US, Greece, Taiwan, Japan, China, …
• Other– Language teaching, textbook writing– Information management, web search
NTNU Nov 2011 KIlgarriff 53
• Demo– http://sketchengine.co.uk– Free trial
NTNU Nov 2011 KIlgarriff 54
What is there on the web?• Web1T
– Present from google– All 1-, 2-, 3-, 4, 5-grams with f>40 in one trillion
(1012) words of English• 1,000,000,000,000
• Compare with BNC– Take top 50,000 items of each– 105 Web1T words not in BNC top50k– 50 words with highest Web1T:BNC ratio– 50 words with lowest ratio
NTNU Nov 2011 KIlgarriff 55
Web-high (155 terms)
• 61 web and computing– config browser spyware url www forum
• 38 porn• 22 US English• 18 business/products common on web
– poker viagra lingerie ringtone dvd casino rental collectible tiffany
– NB: BNC is old• 4 legal
– trademarks pursuant accordance herein
NTNU Nov 2011 KIlgarriff 56
Web-low
• Exclude British English, transcription/tokenisation anomalies
– herself stood seemed she looked yesterday sat considerable had council felt perhaps walked round her towards claimed knew obviously remained himself he him
NTNU Nov 2011 KIlgarriff 57
Observations• Pronouns and past tense verbs
– Fiction
• Masc vs fem• Yesterday
– Probably daily newspapers
• Constancy of ratios:– He/him/himself– She/her/herself
top related