instance-based mapping between thesauri and folksonomies christian wartena rogier brussee telematica...

Instance-based mapping between thesauri and folksonomies

Christian Wartena

Rogier Brussee

Telematica Instituut

Outline

• Interoperability of Keywords

• Wikipedia and del.icio.us

• Keyword similarity

• Experiment

• Conclusion

Interoperability of Keywords

• Documents (pictures, movies, …) are annotated with keywords for organization and retrieval.

• In different collections/communities different sets of keywords are used.– The set of selectable keywords is often organized

in and delimited by a thesaurus.– The set of freely generated end-user keywords,

“tags” forms a folksonomy

• Align keywords/tags by comparing usage.

• Tested on del.icio.us tags and Wikipedia categories.

del.icio.us and Wikipedia

• Del.icio.us

– Social book marking site

– Bookmarks in most cases can be interpreted as labels or tags for the bookmarked URL.

– Many Wikipedia articles are tagged by del.icio.us users

• Wikipedia

– Articles are labeled with one or more categories by the article authors.

– Categories are organized hierarchically.

– Categories are organized consciously like in a thesaurus• New categories are introduced after discussions

between active Wikipedians.

Keyword alignment

• Problem– Given a keyword k in a system A, what is the most

similar keyword k’ in system B.• Given a tag from del.icio.us, what is the most

similar Wikipedia category (or vice versa).

• Approach– Interpret similarity as similarity of usage.– Compute similarity of usage on a common sub-

collection.

• Evaluation– Compare results to human judgment of similarity.

Keyword similarity

• Basic assumption: similarity is similarity of usage.

– If two keywords have similar usage they will give similar results in retrieval tasks.

• Two keywords have similar usage if they

– Have a similar distribution over documents • Divergence (relative entropy) of distributions• Cosine

– Often co-occur• Jaccard coefficient

'orwithtaggeddocs#

'andwithtaggeddocs#)',(

tt

ttttJaccard

New measure for keyword similarity

• Keywords have similar usage if they co-occur with similar frequency with all other keywords.

– We use the frequency with which a tag/keyword is assigned to a document.

– We include co-occurrence information with other terms.• Helps to cope with sparse data

• In other words:– Terms are similar if they have similar co-occurrence

patterns

• Similar to Tag Context Similarity of Cattuto et al.’s presentation tomorrow (Semantic Social Networks Session)

Mission Peacekeeping UN Security Council Priest MissionaryMission 10 4 8 3 2 1Peacekeeping 4 7 4 5 0 0UN 8 4 14 8 1 0Security Council 3 5 8 8 1 1Priest 2 0 1 1 6 4Missionary 1 0 0 1 4 8

0

2

4

6

8

10

12

Mission Peacekeeping UN SecurityCouncil

Priest Missionary

Mission

Security Council

Formalization: Distribution of co-occurring terms

•

• where

– q(t|d) is the keyword distribution of d

– Q(d|z) is the document distribution of z• “The fraction of z’s that is found in d”

• Weighted average of the keyword distributions of documents

– The weight is the relevance of d for z given by the probability Q(d|z)

d

z zdQdtqtp )|()|()(

Distance of keywords

• For each keyword there is a distribution over all (other) keywords.

• Similarity is expressed by divergence of these distributions

• Kullback-Leibler divergence:

• Bits per keyword saved by compressing a subcollection with keyword distribution p using p instead of a general distribution q.

t tq

tptpqpD

)(

)(log)()(

Distance of keywords (cont’d)

• Jensen-Shannon divergence:

– Mean distribution:

• Jensen-Shannon divergence is symmetric.

• Jensen-Shannon divergence is square of a non-negative distance satisfying the triangle inequality.

)()()( 21

21 mqDmpDqpJSD

)(21 qpm

Alignment

• Consider a collection of documents annotated with different sets of keywords.

• Represent a keyword by a distribution over terms from both collections.

• For each term find the closest term from the other collection.

Experiment I

• Mapping between Teleblik keywords and User Tags

• Educational video’s.

• Professional keywords from public broadcasting archive.

• Keywords assigned in an experiment by high school students.

• Data– 100 videos– 12.414 tags – 4.348 different tags– 269 different keywords

Experiment II

• Mapping between del.icio.us tags and Wikipedia categories

• Del.icio.us tags collected by Mathias Lux (Klagenfurt Univ.)

• Data

– 58.345 Wikipedia articles

– 500.618 tags and category annotations

– 42.425 different Wikipedia categories

– 49.603 different tags

• Mappings computed for tags occurring on at least 10 docs.

– Mappings for 2355 tags

– Mappings for 1827 categories– Using co-occurrence data with all 49.603 tags/categories

Evaluation of mapping

• Manual evaluation

• Classification of a sample of mappings into:b Broader termn Narrowerr Related termu Unrelatedx Source term is not a keyword (e.g. “to read”)q Meaning unknown

Evaluation of aligning Wikipedia and del.icio.us

• Pairs with a small distance are evaluated better than pairs with large distance.

• Evaluation of mappings with smallest and largest distance

– a) Categories to tags

– b) Tags to categories

Distance vs. mapping quality

Effect of keyword frequency

• No correlation between keyword frequency and divergence with best mapping found.

• Evaluation of mapping using two different distance measures.

• Categories broader, narrower and related are merged

• Results for– a) Categories to tags

– b) Tags to categories

Comparison with Jaccard-coefficient

Discussion of results

• Method works very well in test– Good mapping results– Distance is good indication of quality– Insensitive to frequency (upto a certain degree)

• Better than Jaccard, because it uses:– co-occurrence with other tags (‘tag context’)– frequency with which a tag is assigned to a document.

• Frequency information is typical for user generated tags.• We expect this method to perform less well for aligning keywords

with other keywords (without assignment frequencies).

• Distance measure also works well for clustering tags.

Future work

• Evaluating relatedness using external sources (e.g. Wordnet)

• Compare to other distance measures

• We used documents annotated completely according to two annotation schemes.– How large has the overlap to be to obtain decent

results?– We can create partial overlap of disjoint document

sets by a partial identification of the keywords.

• Detect asymmetry in relations (broader vs. narrower term)

Conclusion

• Using co-occurrence patterns is a fruitful approach.

• Frequent terms from folksonomies do behave similar to carefully assigned keywords.– Because usage based similarity measure yields

good mappings.– Folksonomy seems to work!

instance-based mapping between thesauri and folksonomies christian wartena rogier brussee telematica...

Documents

keyword similarity keywords

similar keyword

similar distribution

similar usage

keyword distribution

similar results

similar frequency

compute similarity of