thesauruses for natural language processing

76
Thesauruses for Natural Language Processing Adam Kilgarriff Lexicography MasterClass and University of Brighton

Upload: coen

Post on 14-Jan-2016

26 views

Category:

Documents


2 download

DESCRIPTION

Thesauruses for Natural Language Processing. Adam Kilgarriff Lexicography MasterClass and University of Brighton. Outline. Definition Uses for NLP WASPS thesaurus web thesauruses Argument: words not word senses Evaluation proposals Cyborgs. What is a thesaurus?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Thesauruses for Natural Language Processing

Thesauruses for Natural Language Processing

Adam Kilgarriff

Lexicography MasterClass

and

University of Brighton

Page 2: Thesauruses for Natural Language Processing
Page 3: Thesauruses for Natural Language Processing

Outline

Definition Uses for NLP WASPS thesaurus web thesauruses Argument: words not word senses Evaluation proposals Cyborgs

Page 4: Thesauruses for Natural Language Processing

What is a thesaurus?

a resource that groups words according to similarity

Page 5: Thesauruses for Natural Language Processing

Manual and automatic

Manual– Roget, WordNets, many publishers

Automatic– Sparck Jones (1960s), Grefenstette (1994), Lin

(1998), Lee (1999) – aka distributional– two words are similar if they occur in same

contexts

Are they comparable?

Page 6: Thesauruses for Natural Language Processing

Thesauruses in NLP

sparse data

Page 7: Thesauruses for Natural Language Processing

Thesauruses in NLP sparse data

does x go with y?– don’t know, they have never been seen together

New question:does x+friends go with y+friends– indirect evidence for x and y– thesaurus tells us who friends are– “backing off”

Page 8: Thesauruses for Natural Language Processing

Relevant in:

Parsing– PP-attachment– conjunction scope

Bridging anaphors Text cohesion Word sense disambiguation (WSD) Speech understanding Spelling correction

Page 9: Thesauruses for Natural Language Processing

Speech understanding

He’s as headstrong as an alleg***** in the upwaters of the Yangtze

Page 10: Thesauruses for Natural Language Processing

Speech understanding

He’s as headstrong as an alleg***** in the upwaters of the Yangtze

allegory?

Page 11: Thesauruses for Natural Language Processing

Speech understanding

He’s as headstrong as an alleg***** in the upwaters of the Yangtze

allegory? alligator?

Page 12: Thesauruses for Natural Language Processing

Speech understanding

He’s as headstrong as an alleg***** in the upwaters of the Yangtze

allegory? in upwaters? No alligator? in upwaters? No

Page 13: Thesauruses for Natural Language Processing

Speech understanding

He’s as headstrong as an alleg***** in the upwaters of the Yangtze

allegory? in upwaters? No alligator? in upwaters? No allegory+friends in upwaters? No alligator+friends in upwaters? Yes

Page 14: Thesauruses for Natural Language Processing

PP-attachmentinvestigate stromatolite with microscope/speckles

– microscope: verb attachment– speckles: noun attachment

inspect jasper with spectrometer– which?

Page 15: Thesauruses for Natural Language Processing

PP attachment (cont)

compare frequencies of– <inspect, with, spectrometer>– <jasper, with, spectrometer>

Page 16: Thesauruses for Natural Language Processing

PP attachment (cont)

compare frequencies of– <inspect, with, spectrometer>– <jasper, with, spectrometer>

both zero? Try– <inspect+friends, with,

spectrometer+friends>– <jasper+friends, with,

spectrometer+friends>

Page 17: Thesauruses for Natural Language Processing

Conjunction scope

Compare– old boots and shoes– old boots and apples

Page 18: Thesauruses for Natural Language Processing

Conjunction scope

Compare– old boots and shoes– old boots and apples

Are the shoes old?

Page 19: Thesauruses for Natural Language Processing

Conjunction scope

Compare– old boots and shoes– old boots and apples

Are the shoes old? Are the apples old?

Page 20: Thesauruses for Natural Language Processing

Conjunction scope

Compare– old boots and shoes– old boots and apples

Are the shoes old? Are the apples old? Hypothesis:

– wide scope only when words are similar

Page 21: Thesauruses for Natural Language Processing

Conjunction scope

Compare– old boots and shoes– old boots and apples

Are the shoes old? Are the apples old? Hypothesis:

– wide scope only when words are similar hard problem: thesaurus might help

Page 22: Thesauruses for Natural Language Processing

Bridging anaphor resolution

– Maria bought a large apple. The fruit was red and crisp.

fruit and apple co-refer

Page 23: Thesauruses for Natural Language Processing

Bridging anaphor resolution

– Maria bought a large apple. The fruit was red and crisp.

fruit and apple co-refer How to find co-referring terms?

Page 24: Thesauruses for Natural Language Processing

Text cohesion

words on same theme– same segment

change in theme of words– new segment

same theme: same thesaurus class

Page 25: Thesauruses for Natural Language Processing

Word Sense Disambiguation (WSD) pike: fish or weapon

– We caught a pike this afternoon probably no direct evidence for

– catch pike probably is direct evidence for

– catch {pike,carp,bream,cod,haddock,…}

Page 26: Thesauruses for Natural Language Processing

WordNet, Roget

widely used for all the above

Page 27: Thesauruses for Natural Language Processing

The WASPS thesaurus– credit: David Tugwell– EPSRC grant K8931

POS-tag, lemmatise and parse the BNC (100M words)

Find all grammatical relations– <obj, climb, bank>– <modifier, big, bank>– <subject, bank, refuse>

70 million triples

Page 28: Thesauruses for Natural Language Processing

WASPS thesaurus (cont)

Similarity:– <obj, drink, beer>– <obj, drink, wine>

one point similarity between beer and wine count all points of similarity between all pairs

of words weight according to frequencies

– product of MI: Lin (1998)

Page 29: Thesauruses for Natural Language Processing

Word Sketches

one-page summary of a word’s grammatical and collocational behaviour

demo: http://wasps.itri.bton.ac.uk the Sketch Engine

– input any corpus– generate word sketches and thesaurus– just available now

Page 30: Thesauruses for Natural Language Processing

Nearest neighbours to zebra

Page 31: Thesauruses for Natural Language Processing

Nearest neighbours

zebra: giraffe buffalo hippopotamus rhinoceros gazelle antelope cheetah hippo leopard kangaroo crocodile deer rhino herbivore tortoise primate hyena camel scorpion macaque elephant mammoth alligator carnivore squirrel tiger newt chimpanzee monkey

Page 32: Thesauruses for Natural Language Processing
Page 33: Thesauruses for Natural Language Processing

exception: exemption limitation exclusion instance modification restriction recognition extension contrast addition refusal example clause indication definition error restraint reference objection consideration concession distinction variation occurrence anomaly offence jurisdiction implication analogy

pot: bowl pan jar container dish jug mug tin tub tray bag saucepan bottle basket bucket vase plate kettle teapot glass spoon soup box can cake tea packet pipe cup

Page 34: Thesauruses for Natural Language Processing

VERBS

measure

determine assess calculate decrease monitor increase evaluate reduce detect estimate indicate analyse exceed vary test observe define record reflect affect obtain generate predict enhance alter examine quantify relate adjust

boil

simmer heat cook fry bubble cool stir warm steam sizzle bake flavour spill soak roast taste pour dry wash chop melt freeze scald consume burn mix ferment scorch soften

Page 35: Thesauruses for Natural Language Processing

ADJECTIVES

hypnotic

haunting piercing expressionless dreamy monotonous seductive meditative emotive comforting expressive mournful healing indistinct unforgettable unreadable harmonic prophetic steely sensuous soothing malevolent irresistible restful insidious expectant demonic incessant inhuman spooky

pink

purple yellow red blue white pale brown green grey coloured bright scarlet orange cream black crimson thick soft dark striped thin golden faded matching embroidered silver warm mauve damp

Page 36: Thesauruses for Natural Language Processing

Nearest neighbours

crane winch swan heron

winch crane heron tern

heron mast crane gull

tractor rigging gull swan

truck pump tern crane

swan tractor curlew flamingo

Page 37: Thesauruses for Natural Language Processing

no clustering (tho’ could be done) no hierarchy (tho’ could be done) rhythm all on the web: http://wasps.itri.bton.ac.

uk– registration required

Page 38: Thesauruses for Natural Language Processing

The web

an enormous linguist’s playground– Computational Linguistics Special Issue,

Kilgarriff and Grefenstette (eds) 29 (3)• (coming soon)

Page 39: Thesauruses for Natural Language Processing

Google sets

http://labs.google.com/sets Input: zebra giraffe buffalo

Page 40: Thesauruses for Natural Language Processing

Google sets

http://labs.google.com/sets Input: zebra giraffe buffalo kudu hyena impala leopard hippo

waterbuck elephant cheetah eland

Page 41: Thesauruses for Natural Language Processing

Google sets

http://labs.google.com/sets Input: harbin beijing nanking

Page 42: Thesauruses for Natural Language Processing

Google sets

http://labs.google.com/sets Input: harbin beijing nanking Output: shanghai chengdu guangzhou

hangzhou changchun zhejiang kunming dalian jinan fuzhou

Page 43: Thesauruses for Natural Language Processing

Tree structure Roget

– all human knowledge as tree structure

– 1000 top categories• subdivisions

– like this» etc» etc

Page 44: Thesauruses for Natural Language Processing

Directories and thesauruses

Yahoo, http://www.yahoo.com Open directory project, http://dmoz.org

– all human activity as tree structure

plus corpus at every node– gather corpus, identify domain vocabulary

• Gonzalo and colleagues, Madrid, CL Special Issue

• Agirre and colleagues, ‘topic signatures’

Page 45: Thesauruses for Natural Language Processing

Words and word senses

automatic thesauruses– words

Page 46: Thesauruses for Natural Language Processing

Words and word senses

automatic thesauruses– words

manual thesauruses– simple hierarchy is appealing– homonyms

Page 47: Thesauruses for Natural Language Processing

Words and word senses

automatic thesauruses– words

manual thesauruses– simple hierarchy is appealing– homonyms– “aha! objects must be word senses”

Page 48: Thesauruses for Natural Language Processing

Problems

Theoretical Practical

Page 49: Thesauruses for Natural Language Processing

Theoretical

Page 50: Thesauruses for Natural Language Processing
Page 51: Thesauruses for Natural Language Processing
Page 52: Thesauruses for Natural Language Processing

Wittgenstein

Don’t ask for the meaning, ask for the use

Page 53: Thesauruses for Natural Language Processing

Practical

Page 54: Thesauruses for Natural Language Processing

Problems

Practical– a thesaurus is a tool– if the tool organises words senses you must do

WSD before you can use it– WSD: state of the art, optimal conditions: 80%

.

Page 55: Thesauruses for Natural Language Processing

Problems

Practical– a thesaurus is a tool– if the tool organises words senses you must do

WSD before you can use it– WSD: state of the art, optimal conditions: 80%

“To use this tool, first replace one fifth of your input with junk”

Page 56: Thesauruses for Natural Language Processing

Avoid word senses

Page 57: Thesauruses for Natural Language Processing

Avoid word senses

This word has three meanings/senses

Page 58: Thesauruses for Natural Language Processing

Avoid word senses

This word has three meanings/senses This word has three kinds of use

– well founded– empirical– we can study it

Page 59: Thesauruses for Natural Language Processing

sorry, roget

Page 60: Thesauruses for Natural Language Processing

sorry, AI

Page 61: Thesauruses for Natural Language Processing

sorry, AI AI model for NLP:

– NLP turns text into meanings– AI reasons over meanings– word meanings are concepts in an ontology– a Roget-like thesaurus is (to a good

approximation) an ontology– Guarino: “cleansing” WordNet

If a thesaurus groups words in their various uses (not meanings)– not the sort of thing AI can reason over

Page 62: Thesauruses for Natural Language Processing

sorry, AI

“linguistics expressions prompt for meanings rather than express meanings”– Fauconnier and Turner 2003

It would be nice if … But …

Page 63: Thesauruses for Natural Language Processing

Evaluation

manual thesauruses– not done

automatic thesauruses: attempts– pseudo-disambiguation (Lee 1999)– with ref to manual ones (Lin 1998)

Page 64: Thesauruses for Natural Language Processing

Task-based evaluation

Page 65: Thesauruses for Natural Language Processing

Task-based evaluation

Parsing– PP-attachment– conjunction scope

Bridging anaphors Text cohesion Word sense disambiguation (WSD) Speech understanding Spelling correction

Page 66: Thesauruses for Natural Language Processing

What is performance at the task– with no thesaurus– with Roget– with WordNet– with WASPS

Page 67: Thesauruses for Natural Language Processing

Plans

set up evaluation tasks theseval web-based thesaurus

– Open Directory Project hierarchies campaign

Page 68: Thesauruses for Natural Language Processing

Cyborgs

Robots: will they take over? Rod Brooks’s answer:

– Wrong question: greatest advances are in what the human+computer ensemble can do

Page 69: Thesauruses for Natural Language Processing

Cyborgs

A creature that is partly human and partly machine – Macmillan English Dictionary

Page 70: Thesauruses for Natural Language Processing
Page 71: Thesauruses for Natural Language Processing
Page 72: Thesauruses for Natural Language Processing
Page 73: Thesauruses for Natural Language Processing
Page 74: Thesauruses for Natural Language Processing

Cyborgs and the Information Society

The thedsaurus-making agent is part human (for precision), part computer (for recall).

Page 75: Thesauruses for Natural Language Processing

Summary: Thesauruses for NLP

Definition Uses for NLP WASPS thesaurus web thesauruses Argument: words not word senses Evaluation proposals Cyborgs

Page 76: Thesauruses for Natural Language Processing

Thesaurus-makers of the future?