what computers can and cannot do for lexicography or us precision, them recall
DESCRIPTION
What computers can and cannot do for lexicography or Us precision, them recall. Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton, UK. Outline. Precision and recall History of corpus lexicography Natural Language Processing Cyborgs. Find me all the fat cats. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/1.jpg)
1
What computers can and cannot do for lexicography
or
Us precision, them recall
Adam Kilgarriff
Lexicography Masterclass Ltd
and
University of Brighton, UK
![Page 2: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/2.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 2
Outline Precision and recall History of corpus lexicography Natural Language Processing Cyborgs
![Page 3: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/3.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 3
Find me all the fat cats
a request for information
![Page 4: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/4.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 4
High recall
Lots of responses Maybe not all good
![Page 5: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/5.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 5
High precision
Fewer hits Higher confidence
![Page 6: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/6.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 6
Us precision, them recall
Recall Precision
Computers good bad
People bad good
![Page 7: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/7.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 7
Us precision, them recall
True in many areas– web searching, google– finding an image to illustrate a talk
Nowhere more so than
lexicography
![Page 8: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/8.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 8
Lexicography: finding facts about words
collocations grammatical patterns idioms synonyms antonyms meanings translations
![Page 9: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/9.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 9
Outline Precision and recall History of corpus lexicography Natural Language Processing Cyborgs
![Page 10: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/10.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 10
Four ages of corpus lexicography
![Page 11: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/11.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 11
Age 1:Precomputer
Oxford English Dictionary:• 5 million index cards
![Page 12: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/12.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 12
Age 2: KWIC Concordances From 1980 Computerised COBUILD project was innovator asian-kwic.html the coloured-pens method
![Page 13: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/13.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 13
Age 2: limitations
as corpora get bigger:too much data
• 50 lines for a word: :read all • 500 lines: could read all, takes a long time,
slow • 5000 lines: no
![Page 14: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/14.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 14
Age 3: Collocation statistics
Problem:too much data - how to summarise?
Solution:list of words occurring in neighbourhood of headword, with frequencies
Sorted by salience
![Page 15: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/15.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 15
Collocation listingFor right collocates of save (>5 hits)
word fr(x+y) fr(y) word fr(x+y) fr(y)
forests 6 170 life 36 4875
$1.2 6 180 dollars 8 1668
lives 37 1697 costs 7 1719
enormous 6 301 thousands 6 1481
annually 7 447 face 9 2590
jobs 20 2001 estimated 6 2387
money 64 6776 your 7 3141
![Page 16: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/16.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 16
Collocation statistics Which words?
– next word – last word – window, +1 to +5; window, -5 to -1
How sorted? most common collocates --but for most
nouns it's the most salient collocates --how to
measure salience?
![Page 17: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/17.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 17
Mutual Information
Church and Hanks 1989 How much more often does a word pair
occur, than one might expect by chance “Chance” of x and y occurring together:
p(x) * p(y) Probabilities approximated by
frequencies
p(x) =(approx) f(x)/N
![Page 18: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/18.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 18
Mutual Information
X fr eat fr X fr eat X
MI* rank
it 1000 400,000
404 1/
1M
3
meat 1000 6000 136 23/
1M
2
sushi 1000 100 5 50/
1M
1
* numbers are log-proportional to MI
![Page 19: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/19.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 19
Problem mathematical salience = lexicographic
salience? no! higher-frequency items are
lexicographically more salient Solution multiply MI by raw frequency
![Page 20: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/20.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 20
Mutual Information
X fr eat fr X fr eat X
MI rank
it 1000 400,000
404 1/
1M
3
meat 1000 6000 136 23/
1M
2
sushi 1000 100 5 50/
1M
1
MI x fr
new rank
400/M
3
3128/M
1
2500/M
2
![Page 21: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/21.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 21
Collocation listingFor right collocates of save (>5 hits)
word fr(x+y) fr(y) word fr(x+y) fr(y)
forests 6 170 life 36 4875
$1.2 6 180 dollars 8 1668
lives 37 1697 costs 7 1719
enormous 6 301 thousands 6 1481
annually 7 447 face 9 2590
jobs 20 2001 estimated 6 2387
money 64 6776 your 7 3141
![Page 22: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/22.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 22
Age-3 collocation statistics: limitations
Lists contain junk unsorted for type --MI lists mix adverbs,
subjects, objects, prepositions
What we really want: noise-free lists one list for each grammatical relation
![Page 23: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/23.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 23
Age 4: The word sketch Large well-balanced corpus Parse to find
– subjects, objects, heads, modifiers etc
One list for each grammatical relation Statistics to sort each list, as before
![Page 24: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/24.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 24
Can we do it? high-accuracy parsing is hard lots of NLP work, many parsing frameworks
exist if any parser can handle large corpus, it's
probably good enough--- sorting, statistics, make us error-tolerant
![Page 25: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/25.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 25
Can we do it? high-accuracy parsing is hard lots of NLP work, many parsing frameworks
exist if any parser can handle large corpus, it's
probably good enough--- sorting, statistics, make us error-tolerant
Poor man’s parsing:– object (of active verb) = last noun in any sequence
of nouns, adjectives, determiners, numbers and adverbs following the verb
![Page 26: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/26.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 26
The word sketch coffee_n.html
![Page 27: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/27.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 27
Macmillan Dictionary of English for Advanced Leaners, 2002: editor: Rundell. Work done 1999.
Word sketches produced for 6000 most common nouns, verbs, adjectives of English
using British National Corpus (100 M words, already POS-tagged)
lemmatized using John Carroll's lemmatizer parsed using regular expressions over POS-tags HTML files with hyperlinked corpus examples lexicographers used them extensively, used instead
of going direct to corpus positive feedback
![Page 28: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/28.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 28
Outline Precision and recall History of corpus lexicography Natural Language Processing Cyborgs
![Page 29: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/29.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 29
Natural Language Processing
The academic discipline which provides the tools– Also known as Computational Linguistics,
Human Language Technology (HLT), Language Engineering
Good at evaluation of its tools Good news for lexicography:
– identify the best tools, apply them to our corpora
![Page 30: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/30.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 30
An Anglophone Apology Technology, tools, resources most often
available for English This talk centres on English Other languages often present new
problems– Finding word delimiters for Chinese is hard– Finding bunsetsu for Japanese is hard
Fewer resources available, less work done Recommendation:
– find the local experts for your language
![Page 31: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/31.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 31
Recap: Lexicography: finding facts about words
collocations grammatical patterns idioms synonyms antonyms meanings translations
![Page 32: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/32.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 32
Recap: Lexicography: finding facts about words
collocations - sketches grammatical patterns - sketches idioms synonyms antonyms meanings translations
![Page 33: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/33.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 33
Idioms Extreme case of collocation/multi word
expressions Sequence of workshops on
collocations, MWE Technical terms (of great interest to
technologists, technical): TERMIGHT
![Page 34: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/34.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 34
Antonyms Essential semantic relation
![Page 35: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/35.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 35
Antonyms Essential semantic relation
but Justeson and Katz 1995: distributional
evidence for typical antonym pairs– rich men and poor men– the big ones and the small ones– black and white issues
Perhaps antonyms are ‘really’ distributional
![Page 36: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/36.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 36
Thesauruses Also near-synonyms
– are there any true synonyms? Distributional: which words share same
distributions– if corpus contains
object(drink, wine), object(drink, beer)
– 1 pt similarity between wine and beer– gather all points; find nearest neighbours
Sparck Jones, Lin, Grefenstette
![Page 37: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/37.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 37
Nearest neighbours
![Page 38: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/38.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 38
Translation Parallel corpora
– Texts and their translations or Comparable corpora
– Matched for source and target (genre and subject matter), not translations
Which L1 words occur in equivalent L1 settings to L2 words in L2 settings?– They are candidate translation pairs
Very hard problem Lots of high quality research
![Page 39: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/39.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 39
The WASPbench with David Tugwell, supported by UK EPSRC, grant
M54971 A lexicographer's workbench runtime creation of word sketches integration with Word Sense Disambiguation technology output is "disambiguating dictionary" - analysis of
word's meaning into senses, plus computer program for disambiguating contextualised instances of the word
First release now available. http://wasps.itri.brighton.ac.uk/ Sketches at http://www.itri.brighton.ac.uk
/~Adam.Kilgarriff/wordsketches.html
![Page 40: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/40.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 40
The Sketch Engine Input:
– any corpus, any language Lemmatised, part-of-speech tagged
– specification of grammatical relations Word sketches integrated with Corpus query system
– Supports complex searching, sorting etc First release early 2004
![Page 41: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/41.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 41
Outline Precision and recall History of corpus lexicography Natural Language Processing Cyborgs
![Page 42: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/42.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 42
Cyborgs Robots: will they take over? Rod Brooks’s answer:
– Wrong question: greatest advances are in what the human+computer ensemble can do
![Page 43: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/43.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 43
Cyborgs A creature that is partly human and
partly machine – Macmillan English Dictionary
![Page 44: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/44.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 44
![Page 45: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/45.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 45
![Page 46: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/46.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 46
![Page 47: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/47.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 47
![Page 48: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/48.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 48
Cyborgs and the Information Society
The dictionary-making agent is part human (for precision), part computer (for recall).
![Page 49: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/49.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 49
Treat your computer with respect. You and it can do
great things together.
![Page 50: What computers can and cannot do for lexicography or Us precision, them recall](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814b74550346895db86090/html5/thumbnails/50.jpg)
27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 50
Lexicographers of the future?