extracting knowledge-bases from machine-readable dictionaries: have we wasted our time?
DESCRIPTION
Extracting Knowledge-Bases from Machine-Readable Dictionaries: Have We Wasted Our Time?. Nancy Ide and Jean Veronis Proc KB&KB’93 Workshop, 1993, pp257-266 http://www.cs.vassar.edu/faculty/ide/pubs.html As (mis-)interpreted by Peter Clark. The Postulates of MRD Work. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Extracting Knowledge-Bases from Machine-Readable Dictionaries: Have We Wasted Our Time?](https://reader036.vdocuments.us/reader036/viewer/2022062804/568149cc550346895db6fb76/html5/thumbnails/1.jpg)
Extracting Knowledge-Bases from Machine-Readable Dictionaries: Have We Wasted Our
Time?
Nancy Ide and Jean Veronis
Proc KB&KB’93 Workshop, 1993, pp257-266
http://www.cs.vassar.edu/faculty/ide/pubs.html
As (mis-)interpreted by Peter Clark
![Page 2: Extracting Knowledge-Bases from Machine-Readable Dictionaries: Have We Wasted Our Time?](https://reader036.vdocuments.us/reader036/viewer/2022062804/568149cc550346895db6fb76/html5/thumbnails/2.jpg)
The Postulates of MRD Work• P1: MRDs contain information that is useful for NLP
e.g.:
![Page 3: Extracting Knowledge-Bases from Machine-Readable Dictionaries: Have We Wasted Our Time?](https://reader036.vdocuments.us/reader036/viewer/2022062804/568149cc550346895db6fb76/html5/thumbnails/3.jpg)
The Postulates of MRD Work• P1: MRDs contain information that is useful for NLP• P2: This info is relatively easy to extract from MRDs
e.g., extraction of hypernyms (generalizations):
Dipper isa Ladle isa Spoon isa Utensil
![Page 4: Extracting Knowledge-Bases from Machine-Readable Dictionaries: Have We Wasted Our Time?](https://reader036.vdocuments.us/reader036/viewer/2022062804/568149cc550346895db6fb76/html5/thumbnails/4.jpg)
But…• Not much to show for it so far (1993)
– handful of limited and imperfect taxonomies
– few studies on the quality of knowledge in MRDs
– few studies on extracting more complex info
![Page 5: Extracting Knowledge-Bases from Machine-Readable Dictionaries: Have We Wasted Our Time?](https://reader036.vdocuments.us/reader036/viewer/2022062804/568149cc550346895db6fb76/html5/thumbnails/5.jpg)
Complaints…
• P1: useful info in MRDs:– C1a: 50%-70% of info in dictionaries is “garbled”
– C1b: sense definitions concept usage (“real concepts”)
– C1c: some types of knowledge simply not there
• P2: Info can be easily extracted• Most successes have been for hypernyms only
– C2a: MRD formats are a nightmare to deal with
– C2b: A virtually open-ended set of ways of describing facts
– C2c: Bootstrapping: Need a KB to build a KB from a MRD
![Page 6: Extracting Knowledge-Bases from Machine-Readable Dictionaries: Have We Wasted Our Time?](https://reader036.vdocuments.us/reader036/viewer/2022062804/568149cc550346895db6fb76/html5/thumbnails/6.jpg)
C1a: MRD information is “garbled”
• Multiple people, multiple years effort
• Space restrictions, syntactic restrictions
• Particular problem 1:– Attachment of terms too high (21%-34%)
• e.g., “pan” and “bottle” are “vessels”, but “cup” and “bowl” are simply “containers”
• occurs fairly randomly
– Categories less clear at top levels• “fork” and “spoon” is ok, but “implement” and “utensil” = ?
• Sometimes no word there to refer to a concept– leads to circular definitions
![Page 7: Extracting Knowledge-Bases from Machine-Readable Dictionaries: Have We Wasted Our Time?](https://reader036.vdocuments.us/reader036/viewer/2022062804/568149cc550346895db6fb76/html5/thumbnails/7.jpg)
C1a: MRD information is “garbled”
![Page 8: Extracting Knowledge-Bases from Machine-Readable Dictionaries: Have We Wasted Our Time?](https://reader036.vdocuments.us/reader036/viewer/2022062804/568149cc550346895db6fb76/html5/thumbnails/8.jpg)
• Particular problem 2:– Categories less clear at top levels
• “fork” and “spoon” is ok, but “implement” and “utensil” = ?
• Leads to disjuncts e.g. “implement or utensil”
• Sometimes no word there to refer to a concept– leads to circular definitions
– leads to “covert categories”, e.g., INSTRUMENTAL-OBJECT (a hypernym for “tool”, “utensil”, “instrument”, and “implement”)
C1a: MRD information is “garbled”
![Page 9: Extracting Knowledge-Bases from Machine-Readable Dictionaries: Have We Wasted Our Time?](https://reader036.vdocuments.us/reader036/viewer/2022062804/568149cc550346895db6fb76/html5/thumbnails/9.jpg)
• Particular problem 3:– And hypernyms are
relatively consistent!! Other semantic relations are given in a less consistent way, e.g., smell, taste, etc.
C1a: MRD information is “garbled”
![Page 10: Extracting Knowledge-Bases from Machine-Readable Dictionaries: Have We Wasted Our Time?](https://reader036.vdocuments.us/reader036/viewer/2022062804/568149cc550346895db6fb76/html5/thumbnails/10.jpg)
• Ambiguity of word senses, e.g.,– 87% of words in a sample fit > 1 word sense
• Word senses don’t reflect actual use• Word sense distinctions differ between MRDs
– level of detail
– way lines are drawn between senses
– no definitive set of distinctions
C1b: sense definitions concept usage (“real concepts”)
![Page 11: Extracting Knowledge-Bases from Machine-Readable Dictionaries: Have We Wasted Our Time?](https://reader036.vdocuments.us/reader036/viewer/2022062804/568149cc550346895db6fb76/html5/thumbnails/11.jpg)
C1c: some types of knowledge simply not there
• no broad contextual or world knowledge, e.g.,– no connection between “lawn” and “house”, or between
“ash” and “tobacco”
– “restaurant, eating house, eating place -- (a building where people go to eat)” [WordNet]
• No mention that it’s a commercial business, e.g., for “the waitress collected the check.”
![Page 12: Extracting Knowledge-Bases from Machine-Readable Dictionaries: Have We Wasted Our Time?](https://reader036.vdocuments.us/reader036/viewer/2022062804/568149cc550346895db6fb76/html5/thumbnails/12.jpg)
C2a: MRD formats are a nightmare to deal with
• Ambiguities / inconsistencies in typesetter format• Complex grammars for entries• Conventions are inconsistent, e.g. bracketing for
– “Canopic jar, urn, or vase” vs.
– “Junggar Pendi, Dzungaria, or Zungaria”
• Need a lot of hand pre-processing – not much general value to this
– is a vast task in itself
– not many processed dictionaries available
![Page 13: Extracting Knowledge-Bases from Machine-Readable Dictionaries: Have We Wasted Our Time?](https://reader036.vdocuments.us/reader036/viewer/2022062804/568149cc550346895db6fb76/html5/thumbnails/13.jpg)
C2b: A virtually open-ended set of ways of describing facts
But…
There is “virtually an open-ended set of phrases…”
![Page 14: Extracting Knowledge-Bases from Machine-Readable Dictionaries: Have We Wasted Our Time?](https://reader036.vdocuments.us/reader036/viewer/2022062804/568149cc550346895db6fb76/html5/thumbnails/14.jpg)
C2c: Bootstrapping: Need a KB to build a KB
Need knowledge to do NLP on MRDs! – e.g. “carry by means of a handle” vs. “carry by means of a
wagon”
• But undisambiguated hierarchy is unusable, e.g., – “saucepan” isa “pan” isa “leaf” need to build your KB before you even start on the MRD
![Page 15: Extracting Knowledge-Bases from Machine-Readable Dictionaries: Have We Wasted Our Time?](https://reader036.vdocuments.us/reader036/viewer/2022062804/568149cc550346895db6fb76/html5/thumbnails/15.jpg)
Synthesis• Underlying postulate of P1 and P2:
– P0: Large KBs cannot be built by hand• Counterexamples:
– Cyc– Dictionaries themselves!
• And besides…– KBs are too hard to extract from MRDs– don’t contain all the knowledge needed
• But: MRD contributions:– understanding the structure of dictionaries– convergance of NLP, lexicography, and electronic
publishing interests
![Page 16: Extracting Knowledge-Bases from Machine-Readable Dictionaries: Have We Wasted Our Time?](https://reader036.vdocuments.us/reader036/viewer/2022062804/568149cc550346895db6fb76/html5/thumbnails/16.jpg)
Ways forward…
• Combining Knowledge Sources:– One dictionary has 55%-70% of “problematic cases” [of
incompleteness], but 5 dictionaries reduced this to 5%
• Also should combine knowledge from corpora as a means of “filling out” KBs
• Prediction:– KBs built by people, using corpora and text extraction
technology tools, and combined together by hand (Schubert-style; Code4; Ikarus)
![Page 17: Extracting Knowledge-Bases from Machine-Readable Dictionaries: Have We Wasted Our Time?](https://reader036.vdocuments.us/reader036/viewer/2022062804/568149cc550346895db6fb76/html5/thumbnails/17.jpg)
Ways forward…
• MRDs will become encoded more consistently• Better analysis needed of the types of knowledge
needed for NLP– perhaps don’t need the kind of precision in a KB
• Exploitation of associational information– Very useful for sense disambiguation (e.g., Harabagiu)
![Page 18: Extracting Knowledge-Bases from Machine-Readable Dictionaries: Have We Wasted Our Time?](https://reader036.vdocuments.us/reader036/viewer/2022062804/568149cc550346895db6fb76/html5/thumbnails/18.jpg)
Ways forward…• Lexicographers increasingly interested in using lexical
databases for their work• Could create a NLP-like KB directly
– Create explicit semantic links between word entries
– Ensure consistency of content (e.g., using templates/frames ensures all the important information is provided)
![Page 19: Extracting Knowledge-Bases from Machine-Readable Dictionaries: Have We Wasted Our Time?](https://reader036.vdocuments.us/reader036/viewer/2022062804/568149cc550346895db6fb76/html5/thumbnails/19.jpg)
Ways forward…
![Page 20: Extracting Knowledge-Bases from Machine-Readable Dictionaries: Have We Wasted Our Time?](https://reader036.vdocuments.us/reader036/viewer/2022062804/568149cc550346895db6fb76/html5/thumbnails/20.jpg)
Ways forward…• Lexicographers increasingly interested in using lexical
databases for their work• Could create a NLP-like KB directly
– Create explicit semantic links between word entries
– Ensure consistency of content (e.g., using templates/frames ensures all the important information is provided)
– Ensure consistency of “metatext” (i.e., be consistent about how semantic relations are stated)
– Ensure consistency of sense division• e.g., “cup” and “bowl” have two senses (literal and metonymic) but
“glass” only has one (literal) could spot this inconsistency