ghent university-ibbt at mediaeval 2012 search and hyperlinking: semantic similarity using named...
DESCRIPTION
TRANSCRIPT
ELIS – Multimedia Lab
Tom De NiesPedro Debevere, Davy Van Deursen, Wesley De Neve, Erik
Mannens and Rik Van de Walle
Ghent University – IBBT – Multimedia Lab
MediaEval: Search and Hyperlinking4-5 October, Pisa, Italy
2
ELIS – Multimedia Lab
MediaEval 2012: Brave New Task: Search and HyperlinkingTom De Nies (IBBT-MMLab)
05/10/2012
1. Create enriched representation of videos and queries
2. Apply multiple similarity metrics3. Merge results by late fusion
Our approach in a nutshell
3
ELIS – Multimedia Lab
MediaEval 2012: Brave New Task: Search and HyperlinkingTom De Nies (IBBT-MMLab)
05/10/2012
Enriched Data Representation
4
ELIS – Multimedia Lab
MediaEval 2012: Brave New Task: Search and HyperlinkingTom De Nies (IBBT-MMLab)
05/10/2012
AdvantagesComparable queries and videosExtra metadata containing disambiguated conceptsEasy conversion from video to query object
→ possible to use same approach for Search and Linking!Disadvantages
o Enrichment step when ingesting data can take a whileo Only English NER tools → automatic translation step for
other languages
Enriched Data Representation
5
ELIS – Multimedia Lab
MediaEval 2012: Brave New Task: Search and HyperlinkingTom De Nies (IBBT-MMLab)
05/10/2012
1. Create enriched representation of videos and queries
2. Apply multiple similarity metrics3. Merge results by late fusion
6
ELIS – Multimedia Lab
MediaEval 2012: Brave New Task: Search and HyperlinkingTom De Nies (IBBT-MMLab)
05/10/2012
1. “Bag of words” similarity2. Named Entity-based similarity3. Tag-based similarity
Similarity metrics
7
ELIS – Multimedia Lab
MediaEval 2012: Brave New Task: Search and HyperlinkingTom De Nies (IBBT-MMLab)
05/10/2012
Bag of Words similarity
TEXTSTOP WORD
REMOVAL
TEXTWITHOUT
STOPWORDS
CALCULATE TERM FREQUENCY
(TF)
CALCULATE INVERSE
DOCUMENT FREQUENCY (IDF)
FOREACHWORD
TF-IDFVECTOR
ofTF-IDF weights:
TF(t,D) = # of occurrences of
t in D
𝐼𝐷𝐹 (𝑡)=log ¿𝐷∨ ¿¿𝑑∈𝐷 :𝑡∈𝑑∨¿
¿¿
8
ELIS – Multimedia Lab
MediaEval 2012: Brave New Task: Search and HyperlinkingTom De Nies (IBBT-MMLab)
05/10/2012
Similarity between two documents=
Cosine similarity of their TF-IDF vectors
Bag of Words similarity
9
ELIS – Multimedia Lab
MediaEval 2012: Brave New Task: Search and HyperlinkingTom De Nies (IBBT-MMLab)
05/10/2012
Bag of Words similarity
Both corpus & documents taken into account
Common words get lower weight to exploit unique features
Expensive training step (IDF initialization)
No semantics → ambiguity
10
ELIS – Multimedia Lab
MediaEval 2012: Brave New Task: Search and HyperlinkingTom De Nies (IBBT-MMLab)
05/10/2012
Named Entities are extracted from content
Similar content will have similar entities!
Named Entity-based Similarity
11
ELIS – Multimedia Lab
MediaEval 2012: Brave New Task: Search and HyperlinkingTom De Nies (IBBT-MMLab)
05/10/2012
Cosine similarity of vectors as in Bag of Words except:
Words → Named EntitiesIDF → inverse support (IS)
TF-IS weights:
Named Entity-based Similarity
12
ELIS – Multimedia Lab
MediaEval 2012: Brave New Task: Search and HyperlinkingTom De Nies (IBBT-MMLab)
05/10/2012
Named Entity-based Similarity
Less entities than terms → less calculations than BoW
Named Entities are unambiguous
IDF → IS : no indexing of corpus required
Lower precision / coarser granularity than BoW
13
ELIS – Multimedia Lab
MediaEval 2012: Brave New Task: Search and HyperlinkingTom De Nies (IBBT-MMLab)
05/10/2012
Tags expanded with synonyms for maximal recall(WordNet)
Jaccard Similarity
Tag-based similarity
14
ELIS – Multimedia Lab
MediaEval 2012: Brave New Task: Search and HyperlinkingTom De Nies (IBBT-MMLab)
05/10/2012
Tag-based similarity
Uses user-generated metadata
Very coarse granularity / Low precision
Synonyms for higher recall
15
ELIS – Multimedia Lab
MediaEval 2012: Brave New Task: Search and HyperlinkingTom De Nies (IBBT-MMLab)
05/10/2012
1. Create enriched representation of videos and queries
2. Apply multiple similarity metrics3. Merge results by late fusion
16
ELIS – Multimedia Lab
MediaEval 2012: Brave New Task: Search and HyperlinkingTom De Nies (IBBT-MMLab)
05/10/2012
Late Fusion
17
ELIS – Multimedia Lab
MediaEval 2012: Brave New Task: Search and HyperlinkingTom De Nies (IBBT-MMLab)
05/10/2012
RunMRR mGAP MASP
60 30 10 60 30 10 60 30 10
1 (LIMSI: BoW+NE)
0.188 0.15 0.11
70.120
0.089
0.033
0.066
0.066
0.061
2 (LIUM: BoW+NE)
0.254
0.187
0.054
0.140
0.069
0.033
0.046
0.046
0.028
3 (LIMSI: BoW+NE+Tags)
0.165
0.128
0.094
0.099
0.069
0.017
0.061
0.061
0.057
4 (LIUM: BoW+NE+Tags)
0.221
0.154
0.038
0.115
0.053
0.017
0.040
0.041
0.023
Evaluation: Search
18
ELIS – Multimedia Lab
MediaEval 2012: Brave New Task: Search and HyperlinkingTom De Nies (IBBT-MMLab)
05/10/2012
Unexpected:• LIUM > LIMSI, even though LIMSI had better language
detection → due to automatic translation?
• NE + BoW > NE + BoW + Tags→ Tags give false positives higher rank and find more results, so MRR decreases
Evaluation: Search
19
ELIS – Multimedia Lab
MediaEval 2012: Brave New Task: Search and HyperlinkingTom De Nies (IBBT-MMLab)
05/10/2012
Evaluation: Search
Run Precision @60 Recall @60
1 (LIMSI: BoW+NE) 0.056 0.40
2 (LIUM: BoW+NE) 0.061 0.467
3 (LIMSI: BoW+NE+Tags) 0.054 0.433
4 (LIUM: BoW+NE+Tags) 0.059 0.50
20
ELIS – Multimedia Lab
MediaEval 2012: Brave New Task: Search and HyperlinkingTom De Nies (IBBT-MMLab)
05/10/2012
MAP (Ground Truth)
MAP (Search results)
LIMSI (BoW + NE) 0.157 0.014
LIUM (BoW + NE) 0.171 0.040
LIMSI (BoW + NE + Tags)
0.157 0.003
LIUM (BoW + NE + Tags) 0.171 0.037
Evaluation: Linking
Possible explanations:• Thresholds optimized for Search task, not for Linking• User-generated tags vs. extracted tags
… to be investigated!
21
ELIS – Multimedia Lab
MediaEval 2012: Brave New Task: Search and HyperlinkingTom De Nies (IBBT-MMLab)
05/10/2012
• Better ranking criteria / late fusion• Improve tag-similarity• Optimize parameters for linking
Improvements / Future Work
22
ELIS – Multimedia Lab
MediaEval 2012: Brave New Task: Search and HyperlinkingTom De Nies (IBBT-MMLab)
05/10/2012
Discussion
These research activities were funded by Ghent University, IBBT, the IWT Flanders, the FWO-Flanders, and the European Union, in the context of the IBBT project Smarter Media in Flanders (SMIF).