instance-based mapping between thesauri and folksonomies christian wartena rogier brussee telematica...
TRANSCRIPT
Instance-based mapping between thesauri and folksonomies
Christian Wartena
Rogier Brussee
Telematica Instituut
Outline
• Interoperability of Keywords
• Wikipedia and del.icio.us
• Keyword similarity
• Experiment
• Conclusion
Interoperability of Keywords
• Documents (pictures, movies, …) are annotated with keywords for organization and retrieval.
• In different collections/communities different sets of keywords are used.– The set of selectable keywords is often organized
in and delimited by a thesaurus.– The set of freely generated end-user keywords,
“tags” forms a folksonomy
• Align keywords/tags by comparing usage.
• Tested on del.icio.us tags and Wikipedia categories.
del.icio.us and Wikipedia
• Del.icio.us
– Social book marking site
– Bookmarks in most cases can be interpreted as labels or tags for the bookmarked URL.
– Many Wikipedia articles are tagged by del.icio.us users
• Wikipedia
– Articles are labeled with one or more categories by the article authors.
– Categories are organized hierarchically.
– Categories are organized consciously like in a thesaurus• New categories are introduced after discussions
between active Wikipedians.
Keyword alignment
• Problem– Given a keyword k in a system A, what is the most
similar keyword k’ in system B.• Given a tag from del.icio.us, what is the most
similar Wikipedia category (or vice versa).
• Approach– Interpret similarity as similarity of usage.– Compute similarity of usage on a common sub-
collection.
• Evaluation– Compare results to human judgment of similarity.
Keyword similarity
• Basic assumption: similarity is similarity of usage.
– If two keywords have similar usage they will give similar results in retrieval tasks.
• Two keywords have similar usage if they
– Have a similar distribution over documents • Divergence (relative entropy) of distributions• Cosine
– Often co-occur• Jaccard coefficient
'orwithtaggeddocs#
'andwithtaggeddocs#)',(
tt
ttttJaccard
New measure for keyword similarity
• Keywords have similar usage if they co-occur with similar frequency with all other keywords.
– We use the frequency with which a tag/keyword is assigned to a document.
– We include co-occurrence information with other terms.• Helps to cope with sparse data
• In other words:– Terms are similar if they have similar co-occurrence
patterns
• Similar to Tag Context Similarity of Cattuto et al.’s presentation tomorrow (Semantic Social Networks Session)
Mission Peacekeeping UN Security Council Priest MissionaryMission 10 4 8 3 2 1Peacekeeping 4 7 4 5 0 0UN 8 4 14 8 1 0Security Council 3 5 8 8 1 1Priest 2 0 1 1 6 4Missionary 1 0 0 1 4 8
0
2
4
6
8
10
12
Mission Peacekeeping UN SecurityCouncil
Priest Missionary
Mission
Security Council
Formalization: Distribution of co-occurring terms
•
• where
– q(t|d) is the keyword distribution of d
– Q(d|z) is the document distribution of z• “The fraction of z’s that is found in d”
• Weighted average of the keyword distributions of documents
– The weight is the relevance of d for z given by the probability Q(d|z)
d
z zdQdtqtp )|()|()(
Distance of keywords
• For each keyword there is a distribution over all (other) keywords.
• Similarity is expressed by divergence of these distributions
• Kullback-Leibler divergence:
• Bits per keyword saved by compressing a subcollection with keyword distribution p using p instead of a general distribution q.
t tq
tptpqpD
)(
)(log)()(
Distance of keywords (cont’d)
• Jensen-Shannon divergence:
– Mean distribution:
• Jensen-Shannon divergence is symmetric.
• Jensen-Shannon divergence is square of a non-negative distance satisfying the triangle inequality.
)()()( 21
21 mqDmpDqpJSD
)(21 qpm
Alignment
• Consider a collection of documents annotated with different sets of keywords.
• Represent a keyword by a distribution over terms from both collections.
• For each term find the closest term from the other collection.
Experiment I
• Mapping between Teleblik keywords and User Tags
• Educational video’s.
• Professional keywords from public broadcasting archive.
• Keywords assigned in an experiment by high school students.
• Data– 100 videos– 12.414 tags – 4.348 different tags– 269 different keywords
Experiment II
• Mapping between del.icio.us tags and Wikipedia categories
• Del.icio.us tags collected by Mathias Lux (Klagenfurt Univ.)
• Data
– 58.345 Wikipedia articles
– 500.618 tags and category annotations
– 42.425 different Wikipedia categories
– 49.603 different tags
• Mappings computed for tags occurring on at least 10 docs.
– Mappings for 2355 tags
– Mappings for 1827 categories– Using co-occurrence data with all 49.603 tags/categories
Evaluation of mapping
• Manual evaluation
• Classification of a sample of mappings into:b Broader termn Narrowerr Related termu Unrelatedx Source term is not a keyword (e.g. “to read”)q Meaning unknown
Evaluation of aligning Wikipedia and del.icio.us
• Pairs with a small distance are evaluated better than pairs with large distance.
• Evaluation of mappings with smallest and largest distance
– a) Categories to tags
– b) Tags to categories
Distance vs. mapping quality
Effect of keyword frequency
• No correlation between keyword frequency and divergence with best mapping found.
• Evaluation of mapping using two different distance measures.
• Categories broader, narrower and related are merged
• Results for– a) Categories to tags
– b) Tags to categories
Comparison with Jaccard-coefficient
Discussion of results
• Method works very well in test– Good mapping results– Distance is good indication of quality– Insensitive to frequency (upto a certain degree)
• Better than Jaccard, because it uses:– co-occurrence with other tags (‘tag context’)– frequency with which a tag is assigned to a document.
• Frequency information is typical for user generated tags.• We expect this method to perform less well for aligning keywords
with other keywords (without assignment frequencies).
• Distance measure also works well for clustering tags.
Future work
• Evaluating relatedness using external sources (e.g. Wordnet)
• Compare to other distance measures
• We used documents annotated completely according to two annotation schemes.– How large has the overlap to be to obtain decent
results?– We can create partial overlap of disjoint document
sets by a partial identification of the keywords.
• Detect asymmetry in relations (broader vs. narrower term)
Conclusion
• Using co-occurrence patterns is a fruitful approach.
• Frequent terms from folksonomies do behave similar to carefully assigned keywords.– Because usage based similarity measure yields
good mappings.– Folksonomy seems to work!