gergely palla - extracting tag hierarchies
DESCRIPTION
Talk given by Gergely Palla at First annual meeting of KnowEscape COST ActionTRANSCRIPT
IntroductionBenchmarks and testing
Results
Extracting tag hierarchies
Gergely Tibély, Péter Pollner, Tamás Vicsek andGergely Palla
Statistical and Biological Physics Research Group,HAS (Eötvös University), Hungary
KnowEscape2013 Conference
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Tags and taggingThe goalTag hierarchy extraction methods
Tags and tagging: blogs, news portals
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Tags and taggingThe goalTag hierarchy extraction methods
Tags and tagging: blogs, news portals
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Tags and taggingThe goalTag hierarchy extraction methods
Tags and tagging: blogs, news portals
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Tags and taggingThe goalTag hierarchy extraction methods
Tags and tagging: blogs, news portals
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Tags and taggingThe goalTag hierarchy extraction methods
Tags and tagging: blogs, news portals
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Tags and taggingThe goalTag hierarchy extraction methods
Tags and tagging: blogs, news portals
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Tags and taggingThe goalTag hierarchy extraction methods
Tags and tagging: blogs, news portals
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Tags and taggingThe goalTag hierarchy extraction methods
Tags and tagging: blogs, news portals
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Tags and taggingThe goalTag hierarchy extraction methods
Tags and tagging: video and photo sharing
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Tags and taggingThe goalTag hierarchy extraction methods
Tags and tagging: video and photo sharing
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Tags and taggingThe goalTag hierarchy extraction methods
Tagging and folksonomies
In some cases the emerging set of free tags is called as a
FOLKSONOMY
collaborative nature,tags are equal,no hierarchy,
−→ The opposite of an ONTOLOGY.
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Tags and taggingThe goalTag hierarchy extraction methods
Tagging and folksonomies
In some cases the emerging set of free tags is called as a
FOLKSONOMY
collaborative nature,tags are equal,no hierarchy,
−→ The opposite of an ONTOLOGY.
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Tags and taggingThe goalTag hierarchy extraction methods
Tagging and folksonomies
In some cases the emerging set of free tags is called as a
FOLKSONOMY
collaborative nature,tags are equal,no hierarchy,
−→ The opposite of an ONTOLOGY.
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Tags and taggingThe goalTag hierarchy extraction methods
Tagging and folksonomies
In some cases the emerging set of free tags is called as a
FOLKSONOMY
collaborative nature,tags are equal,no hierarchy,
−→ The opposite of an ONTOLOGY.
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Tags and taggingThe goalTag hierarchy extraction methods
Tagging and folksonomies
In some cases the emerging set of free tags is called as a
FOLKSONOMY
collaborative nature,tags are equal,no hierarchy,
−→ The opposite of an ONTOLOGY.
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Tags and taggingThe goalTag hierarchy extraction methods
Tagging systemsHow can we search?
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Tags and taggingThe goalTag hierarchy extraction methods
Tagging systemsHow can we search?
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Tags and taggingThe goalTag hierarchy extraction methods
Tagging systemsHow can we search?
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Tags and taggingThe goalTag hierarchy extraction methods
Tagging systemsHow can we search?
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Tags and taggingThe goalTag hierarchy extraction methods
Tagging systemsSearching items
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Tags and taggingThe goalTag hierarchy extraction methods
Tagging systemsSearching items
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Tags and taggingThe goalTag hierarchy extraction methods
Tagging systemsSearching
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Tags and taggingThe goalTag hierarchy extraction methods
Tagging systemsSearching tags
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Tags and taggingThe goalTag hierarchy extraction methods
Tagging systemsSearching tags
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Tags and taggingThe goalTag hierarchy extraction methods
The goal
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Tags and taggingThe goalTag hierarchy extraction methods
The goal
Extracting a tag hierarchy!
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Tags and taggingThe goalTag hierarchy extraction methods
The goal
Extracting a tag hierarchy!
Motivation:
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Tags and taggingThe goalTag hierarchy extraction methods
The goal
Extracting a tag hierarchy!
Motivation:Help searching: If the tags are organised into ahierarchy, broadening or narrowing the scope issimple.
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Tags and taggingThe goalTag hierarchy extraction methods
The goal
Extracting a tag hierarchy!
Motivation:Help searching: If the tags are organised into ahierarchy, broadening or narrowing the scope issimple.
Give recommendations about yet unvisited objects.
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Tags and taggingThe goalTag hierarchy extraction methods
Tag hierarchy extracting algorithms
- P. Heymann and H. Garcia-Molina, "Collaborative Creation ofCommunal Hierarchical Taxonomies in Social Tagging Systems",Technical Report, Stanford InfoLab, (2006).
- P. Schmitz , "Inducing Ontology from Flickr Tags", Proceedings of the15th International Conference on World Wide Web (WWW), (2006).
- C. Van Damme, M. Hepp and K. Siorpaes, "FolksOntology: AnIntegrated Approach for Turning Folksonomies into Ontologies", SocialNetworks 2, 57–70, (2007)
- A.Plangprasopchok and K. Lerman, "Constructing Folksonomies fromUser-specified Relations on Flickr", Proceedings of the World WideWeb conference, (2009)
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Tags and taggingThe goalTag hierarchy extraction methods
Our methodsBasic outline
DEFINE LINKS
TAG 2
TAG 1
TAG 3
TAG 5
TAG 4
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Tags and taggingThe goalTag hierarchy extraction methods
Our methodsBasic outline
DEFINE LINKS
TAG 2
TAG 1
TAG 3
TAG 5
TAG 4
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Tags and taggingThe goalTag hierarchy extraction methods
Our methodsBasic outline
DEFINE LINKS THRESHOLD
TAG 2
TAG 1
TAG 3
TAG 5
TAG 4
TAG 2
TAG 1
TAG 3
TAG 5
TAG 4
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Tags and taggingThe goalTag hierarchy extraction methods
Our methodsBasic outline
DEFINE LINKS THRESHOLD DETERMINE
DIRECTION
TAG 2
TAG 1
TAG 3
TAG 5
TAG 4
TAG 2
TAG 1
TAG 3
TAG 5
TAG 4
TAG 2
TAG 1
TAG 3
TAG 5
TAG 4
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Tags and taggingThe goalTag hierarchy extraction methods
Our methodsBasic outline
DEFINE LINKS THRESHOLD DETERMINE
DIRECTION
TAG 2
TAG 1
TAG 3
TAG 5
TAG 4
TAG 2
TAG 1
TAG 3
TAG 5
TAG 4
TAG 1
TAG 2 TAG 3
TAG 4 TAG 5
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
BenchmarksQuality measures
How to test the method?
Benchmarks?
Gene Ontology
- tag hierarchy: hierarchy of protein functions,- tagged objects: proteins annotated by their knownfunctions
Synthetic benchmark:- tag hierarchy: user defined,- tagged objects: simulated tagging (random walk on thehierarchy).- G. Tibély, P. Pollner, T. Vicsek and G. Palla, “Ontologies and-,tag-statistics”, New Journal of Physics 14, 053009 (2012).
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
BenchmarksQuality measures
How to test the method?
Benchmarks:
Gene Ontology
- tag hierarchy: hierarchy of protein functions,- tagged objects: proteins annotated by their knownfunctions.
Synthetic benchmark:- tag hierarchy: user defined,- tagged objects: simulated tagging (random walk on thehierarchy).- G. Tibély, P. Pollner, T. Vicsek and G. Palla, “Ontologies and-,tag-statistics”, New Journal of Physics 14, 053009 (2012).
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
BenchmarksQuality measures
How to test the method?
Benchmarks:
Gene Ontology
- tag hierarchy: hierarchy of protein functions,- tagged objects: proteins annotated by their knownfunctions.
Synthetic benchmark:- tag hierarchy: user defined,- tagged objects: simulated tagging (random walk on thehierarchy).- G. Tibély, P. Pollner, T. Vicsek and G. Palla, “Ontologies and-,tag-statistics”, New Journal of Physics 14, 053009 (2012).
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
BenchmarksQuality measures
How to measure the quality?
0A
1A 1C
2C2A 2B 2D 2E
3B3A 3C 3F
1B
3D 3E 3G 3H
0A
1A 1C
2C2A 2B 2D 2E
3B3A 3C 3F
1B
3D 3E 3G 3H
EXACT RECONSTRUCTED
Evaluation?
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
BenchmarksQuality measures
How to measure the quality?
0A
1A 1C
2C2A 2B 2D 2E
3B3A 3C 3F
1B
3D 3E 3G 3H
0A
1A 1C
2C2A 2B 2D 2E
3B3A 3C 3F
1B
3D 3E 3G 3H
EXACT RECONSTRUCTED
Evaluation:fraction of correctly identified links, fraction ofacceptable links, fraction of missing links, etc.
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
BenchmarksQuality measures
How to measure the quality?
0A
1A 1C
2C2A 2B 2D 2E
3B3A 3C 3F
1B
3D 3E 3G 3H
0A
1A 1C
2C2A 2B 2D 2E
3B3A 3C 3F
1B
3D 3E 3G 3H
EXACT RECONSTRUCTED
Evaluation:fraction of correctly identified links, fraction ofacceptable links, fraction of missing links, etc.Normalised Mutual Information: sensitive also to theposition of the non-matching links.L. Danon et al., "Comparing community structureidentification", J. Stat. Mech. P09008 , (2005)
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
BenchmarksQuality measures
Normalised Mutual InformationMathematical formulation
The probability for picking a tag at random from the descendants of tag i in the exacthierarchy, Ge, and in the reconstructed hierarchy Gr:
pe(i) =|De(i)|N − 1
, pr(i) =|Dr(i)|N − 1
.
The probability for picking a tag at random from the intersection of the two sets ofdescendants:
pe,r(i) =|De(i) ∩ Dr(i)|
N − 1
Based on this, the Normalised Mutual Information between the exact- andreconstructed hierarchies:
Ie,r = −2
N∑i=1
pe,r(i) ln(
pe,r(i)pe(i)pr(i)
)N∑
i=1pe(i) ln pe(i) +
N∑i=1
pr(i) ln pr(i)=
2N∑
i=1|De(i) ∩ Dr(i)| ln
(|De(i)∩Dr(i)|(N−1)|De(i)|·|Dr(i)|
)N∑
i=1|De(i)| ln
(|De(i)|N−1
)+
N∑i=1|Dr(i)| ln
(|Dr(i)|N−1
) .
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
BenchmarksQuality measures
Normalised Mutual InformationBehaviour
0A
2B
3C 3D 3E 3F 3H
2D2C2A
3B3A
1B
3G
1A
2A 2B
1A
3G 3H
2C 2D
1B
3A 3C3B 3D 3F3E
0A
RANDOMIZATION
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
BenchmarksQuality measures
Normalised Mutual InformationBehaviour
0A
2B
3C 3D 3E 3F 3H
2D2C2A
3B3A
1B
3G
1A
2A 2B
1A
3G 3H
2C 2D
1B
3A 3C3B 3D 3F3E
0A
RANDOMIZATION
I
bottom up
random
top down
f
0
0.2
0.4
0.6
0.8
1
0 0.2 0.6 0.8 1 0.4
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Hierarchy of protein functionsFlickr and IMDbSynthetic data
Studied systems
Tagged proteins:- 5,913,610 proteins from GO, annotated by- 4,181 molecular functions.
Tagged photos:- 1,519,030 photos from Flickr, tagged by- 25,441 free English words.
Tagged films:- 336,223 films from IMDb, tagged by- 6,358 English keywords.
Synthetic benchmark:- 2,000,000 virtual objects,- 1,023 tags.
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Hierarchy of protein functionsFlickr and IMDbSynthetic data
ResultsHierarchy of protein functions
0A
1A
2C
4C 4D 4E
5A 5B 5C
4A 4B
2B 2D
3G3D3A 3B 3C 3E 3F 3H 3I 3J 3K 3L 3M 3N 3O 3P 3Q
4G 4H 4K 4O 4R
2A
4I 4J 4L 4M 4N 4P 4Q4F
0A
1A
2C
3S3R
4C 4D 4E
5A 5B 5C
4A 4B
2B2A 2D
3G3D3A 3B 3C 3E 3F 3H 3I 3J 3K 3L 3M 3N 3O 3P 3Q
4S 4K4G 4H 4I 4J 4L 4M 4N 4O 4P 4Q
3T
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Hierarchy of protein functionsFlickr and IMDbSynthetic data
ResultsHierarchy of protein functions
algorithm A 21%Matching algorithm B 20%links P. H. & H. G.-M. 19%
P. Schmitz 18%algorithm A 66%
Acceptable algorithm B 52%links P. H. & H. G.-M. 51%
P. Schmitz 65%algorithm A 35%
Normalised algorithm B 30%Mut. Info. P. H. & H. G.-M. 30%
P. Schmitz 30%algorithm A 78%
Linearised algorithm B 75%Mut. Info P. H. & H. G.-M. 75%
P. Schmitz 75%
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Hierarchy of protein functionsFlickr and IMDbSynthetic data
ResultsTag hierarchy from Flickr data
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Hierarchy of protein functionsFlickr and IMDbSynthetic data
ResultsTag hierarchy from IMDb data
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Hierarchy of protein functionsFlickr and IMDbSynthetic data
ResultsSynthetic benchmark: tagging by random walks
Input:the tag hierarchy,the distribution of the number tags on the objects,the tag frequency distribution.
The tagging:
the first tag at random,the rest:
with probability p: ashort random walk onthe DAG starting fromthe first tag,with probability 1− p:at random.
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Hierarchy of protein functionsFlickr and IMDbSynthetic data
ResultsSynthetic benchmark: tagging by random walks
Input:the tag hierarchy,the distribution of the number tags on the objects,the tag frequency distribution.
The tagging:
the first tag at random,the rest:
with probability p: ashort random walk onthe DAG starting fromthe first tag,with probability 1− p:at random.
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Hierarchy of protein functionsFlickr and IMDbSynthetic data
ResultsSynthetic benchmark: tagging by random walks
Input:the tag hierarchy,the distribution of the number tags on the objects,the tag frequency distribution.
The tagging:
the first tag at random,the rest:
with probability p: ashort random walk onthe DAG starting fromthe first tag,with probability 1− p:at random.
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Hierarchy of protein functionsFlickr and IMDbSynthetic data
ResultsSynthetic benchmark: tagging by random walks
Input:the tag hierarchy,the distribution of the number tags on the objects,the tag frequency distribution.
The tagging:
the first tag at random,the rest:
with probability p: ashort random walk onthe DAG starting fromthe first tag,with probability 1− p:at random.
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Hierarchy of protein functionsFlickr and IMDbSynthetic data
ResultsSynthetic benchmark: tagging by random walks
Input:the tag hierarchy,the distribution of the number tags on the objects,the tag frequency distribution.
The tagging:
the first tag at random,the rest:
with probability p: ashort random walk onthe DAG starting fromthe first tag,with probability 1− p:at random.
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Hierarchy of protein functionsFlickr and IMDbSynthetic data
ResultsSynthetic benchmark: tagging by random walks
Input:the tag hierarchy,the distribution of the number tags on the objects,the tag frequency distribution.
The tagging:
the first tag at random,the rest:
with probability p: ashort random walk onthe DAG starting fromthe first tag,with probability 1− p:at random.
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Hierarchy of protein functionsFlickr and IMDbSynthetic data
ResultsSynthetic benchmark: tagging by random walks
Input:the tag hierarchy,the distribution of the number tags on the objects,the tag frequency distribution.
The tagging:
the first tag at random,the rest:
with probability p: ashort random walk onthe DAG starting fromthe first tag,with probability 1− p:at random.
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Hierarchy of protein functionsFlickr and IMDbSynthetic data
ResultsSynthetic benchmark: tagging by random walks
Input:the tag hierarchy,the distribution of the number tags on the objects,the tag frequency distribution.
The tagging:
the first tag at random,the rest:
with probability p: ashort random walk onthe DAG starting fromthe first tag,with probability 1− p:at random.
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Hierarchy of protein functionsFlickr and IMDbSynthetic data
ResultsSynthetic benchmark: tagging by random walks
Input:the tag hierarchy,the distribution of the number tags on the objects,the tag frequency distribution.
The tagging:
the first tag at random,the rest:
with probability p: ashort random walk onthe DAG starting fromthe first tag,with probability 1− p:at random.
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Hierarchy of protein functionsFlickr and IMDbSynthetic data
ResultsSynthetic benchmark: tagging by random walks
Input:the tag hierarchy,the distribution of the number tags on the objects,the tag frequency distribution.
The tagging:
the first tag at random,the rest:
with probability p: ashort random walk onthe DAG starting fromthe first tag,with probability 1− p:at random.
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Hierarchy of protein functionsFlickr and IMDbSynthetic data
ResultsSynthetic benchmark: tagging by random walks
Input:the tag hierarchy,the distribution of the number tags on the objects,the tag frequency distribution.
The tagging:
the first tag at random,the rest:
with probability p: ashort random walk onthe DAG starting fromthe first tag,with probability 1− p:at random.
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Hierarchy of protein functionsFlickr and IMDbSynthetic data
ResultsSynthetic benchmark
a)
c)
b)
d)
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Hierarchy of protein functionsFlickr and IMDbSynthetic data
ResultsSynthetic benchmark
algorithm A 31%Matching algorithm B 89%links P. H. & H. G.-M. 48%
P. Schmitz 1%algorithm A 35%
Acceptable algorithm B 91%links P. H. & H. G.-M. 54%
P. Schmitz 2%algorithm A 18%
Normalised algorithm B 83%Mut. Info. P. H. & H. G.-M. 29%
P. Schmitz 1%algorithm A 66%
Linearised algorithm B 97%Mut. Info P. H. & H. G.-M. 76%
P. Schmitz 5%
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
IntroductionBenchmarks and testing
Results
Hierarchy of protein functionsFlickr and IMDbSynthetic data
Summary
Tags are important in knowledge organisation.
Tag-hierarchy extraction is an interesting problem with agreat potential for practical applications.
We have set up a framework for tag-hierarchy extraction:Benchmark systems for testing the tag-hierarchy extractionalgorithm can be found and can be created.The mutual information provides a quality measuresensitive also to the position of the links in the hierarchy.
-G. Tibély, P. Pollner, T. Vicsek and G. Palla, Ontologies and tag-statisticsNew Journal of Physics 14, 053009 (2012).-G. Tibély, P. Pollner, T. Vicsek and G. Palla, Extracting tag hierarchiesaccepted in PLoS ONE
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
Tagging by random walksMutual information
Further results from Flickr
e s k i m o
pac i f i c w a l r u s s a m u e l ade l i e a de l i e
p e n g u i n
b e a r d e d sea l
t h a w i n g
e r i g n a t h u s b a r b a t u s o d o b e n u s
r o s m a r u s c h i n s t r a p
s o u t h p o l e
s a m u e l t a y l o r
c o l e r i d g e
s p i t z b e r g e n t h a w
b a f f i n i s l a n d
k r i l l m a r i n e r pygosce l i s
ade l i ae
a n t a r c t i c p e n i n s u l a i n u i t
a n t a r c t i c
a r r o w
s p i t s b e r g e n p r e s e n t
w a l r u s
g i f t b o w
ch r i s t - m a s - t i m e
d e c t r e e f a r m
s v a l b a r d
a r c t i c c i r c le
g r e e n l a n d i ce
c u b e
n u n a v u t
n o r t h p o l e
a n t a r c t i c a
i ce - b r e a k e r
c o l e r i d g e c a r b o n d i o x i d e
d r y i ce
m e l t w h i t e o u t
f e b r u a r y w i scons in
h o l i d a y f r o s t c o l d
s n o w
w i n t e r t i m e m i d w i n t e r w a r m e r h i b e r n a t i o n j a n u a r y
s n o w b a l l f r e e z e
s k i
j a c k f r o s t
h o a r f r o s t
p o g o n i p c o l d
w e a t h e r s n o w - s t o r m s l e d d i n g
s n o w - s h o e s n o w -
f l a k e
m i n u s
ch i l l f r e e z i n g
i c e
f r o s t b i t e s h i v e r i n g
w i n t e r
d i a m o n d d u s t deco -
r a t i o n ho l i - d a y sea- s o n
d e c e m b e r
c a r o l i n g
c h i l l i n g
ch r i s t - m a s t r e e
i c e b e r g i ce w a t e r
i g l o o i ce m a c h i n e
a r c t i c f l a k e b l i z za rd
m e l t i n g s u b l i m a t i o n
s n o w b o a r d
s k i p o l e
h o a r
r i m e
w a t e r s k i
s k i e r s k i i n g n a t i v i t y x m a s
r i b b o n
s a n t a
c h r i s t m a s
w i
m a d i s o n m i l w a u k e e f e b j a n
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
Tagging by random walksMutual information
Algorithm ADetails
pre-process: for a tag i keep neighbours only ifnij ≥ 0.4 ·max(nij).Link-weight:
random estimate for the number of co-occurrences:〈nij〉R =
ni njn ,
link weight corresponds to the z-score: zij =nij−〈nij〉RσR(nij )
.
Threshold:keep only the strongest link on every tag.
Direction:we assume that tag frequencies are higher close to theroot,
→ for any given tag i , the strongest link is coming from itsparent.
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
Tagging by random walksMutual information
Algorithm ADetails
pre-process: for a tag i keep neighbours only ifnij ≥ 0.4 ·max(nij).Link-weight:
random estimate for the number of co-occurrences:〈nij〉R =
ni njn ,
link weight corresponds to the z-score: zij =nij−〈nij〉RσR(nij )
.
Threshold:keep only the strongest link on every tag.
Direction:we assume that tag frequencies are higher close to theroot,
→ for any given tag i , the strongest link is coming from itsparent.
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
Tagging by random walksMutual information
Algorithm ADetails
pre-process: for a tag i keep neighbours only ifnij ≥ 0.4 ·max(nij).Link-weight:
random estimate for the number of co-occurrences:〈nij〉R =
ni njn ,
link weight corresponds to the z-score: zij =nij−〈nij〉RσR(nij )
.
Threshold:keep only the strongest link on every tag.
Direction:we assume that tag frequencies are higher close to theroot,
→ for any given tag i , the strongest link is coming from itsparent.
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
Tagging by random walksMutual information
Algorithm ADetails
pre-process: for a tag i keep neighbours only ifnij ≥ 0.4 ·max(nij).Link-weight:
random estimate for the number of co-occurrences:〈nij〉R =
ni njn ,
link weight corresponds to the z-score: zij =nij−〈nij〉RσR(nij )
.
Threshold:keep only the strongest link on every tag.
Direction:we assume that tag frequencies are higher close to theroot,
→ for any given tag i , the strongest link is coming from itsparent.
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
Tagging by random walksMutual information
Algorithm ADetails
pre-process: for a tag i keep neighbours only ifnij ≥ 0.4 ·max(nij).Link-weight:
random estimate for the number of co-occurrences:〈nij〉R =
ni njn ,
link weight corresponds to the z-score: zij =nij−〈nij〉RσR(nij )
.
Threshold:keep only the strongest link on every tag.
Direction:we assume that tag frequencies are higher close to theroot,
→ for any given tag i , the strongest link is coming from itsparent.
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
Tagging by random walksMutual information
Algorithm ADetails
pre-process: for a tag i keep neighbours only ifnij ≥ 0.4 ·max(nij).Link-weight:
random estimate for the number of co-occurrences:〈nij〉R =
ni njn ,
link weight corresponds to the z-score: zij =nij−〈nij〉RσR(nij )
.
Threshold:keep only the strongest link on every tag.
Direction:we assume that tag frequencies are higher close to theroot,
→ for any given tag i , the strongest link is coming from itsparent.
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
Tagging by random walksMutual information
Algorithm ADetails
Exception:when the given tag is also the proposed parent of itsstrongest neighbour.
→ in this case we choose the 2nd strongest neighbour asparent
Local root:when the given tag is also the proposed parent for all of itsstrong neighbours.
Global assembly:the local root with largest “entropy” becomes the global root,rest of the local roots are linked in the order of their entropy.
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
Tagging by random walksMutual information
Algorithm BDetails
Link-weight:random estimate for the number of co-occurrences:〈nij〉R =
ni njn ,
link weight corresponds to the z-score: zij =nij−〈nij〉RσR(nij )
.Threshold:
keep links stronger than zij ≥ 10.Centrality:
calculate the eigenvector centrality based on the remainingweighted adjacency matrix.
Build the hierarchy:start from lowest centrality tags,choose parent from neighbours with higher centrality thanthe given tag,in case there are more candidates, choose the most relatedone, (according to the descendants already under the giventag).
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
Tagging by random walksMutual information
Mutual information and entropyFor discrete variables xi and yj with a joint probability distribution given by P(xi , yj ), themutual information is defined as
I(x , y) ≡∑
i
∑j
p(xi , yj ) ln
(p(xi , yj )
p(xi )p(yj )
).
The entropies of the variables is usually formulated as
H(x) = −∑
i
p(xi ) ln p(xi ), H(y) = −∑
j
p(yj ) ln p(yj ).
Thus, the mutual information can be also given as
I(x , y) = H(x) + H(y)− H(x , y).
Based on this, the normalised mutual information in general is defined as
Inorm(x , y) ≡2I(x , y)
H(x) + H(y).
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
Tagging by random walksMutual information
Comparing DAGs with mutual information
The probability for picking a tag at random from the descendants of tag i in the exacthierarchy, Ge, and in the reconstructed hierarchy Gr:
pe(i) =|De(i)|N − 1
, pr(i) =|Dr(i)|N − 1
.
The probability for picking a tag at random from the intersection of the two sets ofdescendants:
pe,r(i) =|De(i) ∩ Dr(i)|
N − 1
Based on this, the Normalised Mutual Information between the exact- andreconstructed hierarchies:
Ie,r = −2
N∑i=1
pe,r(i) ln(
pe,r(i)pe(i)pr(i)
)N∑
i=1pe(i) ln pe(i) +
N∑i=1
pr(i) ln pr(i)=
2N∑
i=1|De(i) ∩ Dr(i)| ln
(|De(i)∩Dr(i)|(N−1)|De(i)|·|Dr(i)|
)N∑
i=1|De(i)| ln
(|De(i)|N−1
)+
N∑i=1|Dr(i)| ln
(|Dr(i)|N−1
) .
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
Tagging by random walksMutual information
Comparing DAGs with mutual information
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
Tagging by random walksMutual information
Comparing DAGs with mutual information
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
Tagging by random walksMutual information
Comparing DAGs with mutual information
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies
Tagging by random walksMutual information
ResultsIn numbers
Hierarchy (target) GO subset user definedData (input) proteins simulated tagging
algorithm A 21% 31%Matching algorithm B 20% 89%links P. H. & H. G.-M. 19% 48%
P. Schmitz 18% 1%algorithm A 66% 35%
Acceptable algorithm B 52% 91%links P. H. & H. G.-M. 51% 54%
P. Schmitz 65% 2%algorithm A 35% 18%
Normalised algorithm B 30% 83%Mut. Info. P. H. & H. G.-M. 30% 29%
P. Schmitz 30% 1%algorithm A 78% 66%
Linearised algorithm B 75% 97%Mut. Info P. H. & H. G.-M. 75% 76%
P. Schmitz 75% 5%
G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies