castanet: using wordnet to build facet hierarchies
DESCRIPTION
Castanet: Using WordNet to Build Facet Hierarchies. Emilia Stoica and Marti Hearst School of Information, Berkeley. Motivation. Want to assign labels from multiple hierarchies. Fruit Apricot. Flavor gingerroot. Vegetables pepper. Motivation. - PowerPoint PPT PresentationTRANSCRIPT
Castanet:Using WordNet to Build Facet
Hierarchies
Emilia Stoica and Marti HearstSchool of Information,
Berkeley
Motivation
Want to assign labels from multiple hierarchies
Motivation
Hot and Sweet Chicken: 1 pepper, 2 apricots, 1 pound chicken breast, 1 Tbsp gingerroot
Meat Chicken
Vegetables pepper
Fruit Apricot
Flavor gingerroot
Castanet
Carves out a structure from the hypernym (IS-A) relations within WordNet
Produces surprisingly good results for a wide range of subjects e.g., arts, medicine, recipes, math, news,
bibliographical records
WordNet Challenges
A word may have more than one sense
- Fine granularity of word sense distinctions
e.g., newspaper (#1) - daily publication on
folded sheets
newspaper (#3) - physical object
- Ambiguity for the same sense
tuna#1 cactus
#2 fish food fish bony fish
WordNet Challenges (cont.)
The hypernym path may be quite long (e.g., sense #3 of tuna has 14 nodes)
Sparse coverage of proper names and noun phrases (not addressed)
Algorithm Goals
Build a set of facet hierarchies Balance depth and breadth
Avoid “skinny” paths Don’t go too deep or too broad
Choose understandable labels Disambiguate words
Currently a word can take on only one sense
Our ApproachD
ocum
ents
Sel
ect
ter
ms
WordNet
Build core tree
Augmentcore tree
Remove
top level
categories
Compress
Tree
Divide into facets
1. Select Terms
Select well-distributed terms from the collection
Eliminate stopwords Retain only those terms
with a distribution higher than a threshold
(default: top 10%)
Doc
ume
nts
WordNet
Sel
ect
term
s
Build core tree
Comp. tree
Remove top levelcateg.
Augm. core tree
2. Build Core Tree
Get hypernym path if term: - has only one sense, or - matches a pre-selected WordNet domain Adding a new term increases a
count at each node on its path by # of docs with the term. frozen dessert
sundae
entity
substance,matter
nutriment
dessert
ice cream sundae
frozen dessert
entity
substance,matter
nutriment
dessert
sherbet,sorbet
sherbet
Build a “backbone” Create paths from
unambiguous terms only Bias the structure towards
appropriate senses of words
Doc
ume
nts
WordNet
Sel
ect
te
rms
Build core tree
Comp. tree
Remove top levelcateg.
Augm. core tree
2. Build Core Tree (cont.)
Merge hypernym paths to build a tree
sundae
entity
substance,matter
nutriment
dessert
ice cream sundae
frozen dessert
entity
substance,matter
nutriment
dessert
sherbet,sorbet
sherbet
frozen dessert
sundae sherbet
substance,matter
nutriment
dessert
sherbet,sorbet
frozen dessert
entity
ice cream sundae
3. Augment Core Tree
Attach to Core tree the terms with more than one sense
Favor the more common path over other alternatives
Doc
ume
nts
WordNet
Sel
ect
te
rms
Build core tree
Comp. tree
Remove top levelcateg.
Augm. core tree
Augment Core Tree (cont.)
Date (p1) Date (p2)
entity abstraction substance,matter measure, quantity food, nutrient fundamental quality nutriment time period food calendar day (18) edible fruit (78) date date
Choose this path since it has more items assigned
Optional Step: Domains
To disambiguate, use Domains Wordnet has 212 Domains
medicine, mathematics, biology, chemistry, linguistics, soccer, etc.
A better collection has been developed by Magnini 2000 Assigns a domain to every noun synset
Automatically scan the collection to see which domains apply
The user selects which of the suggested domains to use or may add own
Paths for terms that match the selected domains are added to the core tree
Using Domains
dip glosses:
Sense 1: A depression in an otherwise level surface
Sense 2: The angle that a magnet needle makes with horizon
Sense 3: Tasty mixture into which bite-size foods are dipped
dip hypernyms
Sense 1 Sense 2 Sense 3
solid shape, form food
=> concave shape => space => ingredient, fixings
=> depression => angle => flavorer
Given domain “food”, choose sense 3
4. Compress Tree
Rule 1: Eliminate a parent with fewer
than k children unless it is the root or its distribution is larger than 0.1*maxdist
ice cream sundae
dessert
sundae
frozen dessert
sherbet,sorbet
sherbet
parfait
dessert
frozen dessert
sundae parfait sherbet
abstraction
Doc
ume
nts
WordNet
Sel
ect
te
rms
Build core tree
Comp. tree
Remove top levelcateg.
Augm. core tree
4. Compress Tree (cont.)
Rule 2: Eliminate a child whose
name appears within the parent’s name
sundae
dessert
frozen dessert
parfait sherbet
dessert
sundae parfait sherbet
abstraction
Doc
ume
nts
WordNet
Sel
ect
te
rms
Build core tree
Comp. tree
Remove top levelcateg.
Augm. core tree
5. Divide into Facets
Divide into facets
5. Divide into Facets(Remove top levels)
sugar syrup
entity
substance,matter
food,nutriment
ingredient,fixings
food stuff,food product
sweeteningherb
flavorer
parsley oregano sugar syrup
sweeteningherb
flavorer
parsley oregano
Rule 1: Eliminate very general categories (e.g., entity, abstraction). If no paths are longer than threshold t, then done. Else:
Divide into facets
Rule 2: Undo first step. Then eliminate all top levels until the maximum length of any path in the resulting hierarchy is t.
Example: Recipes (3500 docs)
Castanet Output (shown in Flamenco)
Castanet Output
Castanet Output
Castanet Output
Castanet Output
Castanet Evaluation
This is a tool for information architects, so people of this type did the evaluation
We compared output on Recipes Biomedical journal titles
We compared to two state-of-the-art algorithms LDA (Blei et al. 04) Subsumption (Sanderson & Croft ’99)
Subsumption Output
Subsumption Output
Subsumption Output
Subsumption Output
LDA Output
LDA Output
LDA Output
Evaluation Method
Information architects assessed the category systems
For each of 2 systems’ output: Examined and commented on top-level Examined and commented on two sub-levels
Then comment on overall properties Meaningful? Systematic? Likely to use in your work?
Evaluation (cont.)
Sample questions for top level categories: - Would you add/remove/rename any category ?
- Did this category match your expectations ?
Sample questions for a specific category: - Would you add/move/remove any sub-categories ? - Would you promote any sub-category to top level ?
General questions: - Would you use Castanet ? - Would you use LDA ? - Would you use Subsumption ? - Would you use list of most frequent terms ?
Evaluation Results
Results on recipes collection for “Would you use this system in your work?” # “Yes in some cases” or “yes, definitely”:
Castanet: 29/34 LDA: 0/18 Subsumption: 6/16 Baseline: 25/34
Average response to questions about quality (4 = “strongly agree”)
Evaluation Results
Average responses for top-level categories 4= no changes, 1 = change many
Average responses for 2 subcategories
Needed Improvements
Take spelling variations and morphological variants into account
Use verbs and adjectives, not just nouns Normalize noun phrases Allow terms to have more than one sense Improve algorithm for assigning documents to
categories.
Opportunities for Tagging
New opportunity: Tagging, folksonomies (flickr, de.lici.ous) People are created facets in a decentralized
manner They are assigning multiple facets to items This is done on a massive scale This leads naturally to meaningful associations
Conclusions
Castanet builds a set of faceted hierarchies by finding IS-A relations between terms using WordNet.
The method has been tested on various domains: medicine, recipes, math, news, arts, bibliographical
records
Usability study shows: Castanet is preferred to other state-of-the art solutions. Information architects want to use the tool in their work.
Learn More
Funding This work supported in part by NSF (IIS-9984741)
For more information: Stoica, E., Hearst, M., and Richardson, M., Automating
Creation of Hierarchical Faceted Metadata Structures, NAACL/HLT 2007
See http://flamenco.berkeley.edu