Combining text/image in WikipediaMM task 2009
Christophe Moulin, Cecile Barat, Cedric Lemaıtre, Mathias Gery,Christophe Ducottet, Christine Largeron
Laboratoire Hubert Curien, Saint-Etienne, France
October 1st 2009
Christophe Moulin et al. (LaHC) Combining text/image in WikipediaMM task 2009 October 1st 2009 1 / 16
Outline
1 Model overviewTextual vector space modelVisual vocabularyCombining text and image modalities
2 Experiments
3 Conclusion and future work
Christophe Moulin et al. (LaHC) Combining text/image in WikipediaMM task 2009 October 1st 2009 2 / 16
Model overview
α +(1 − α)bag of words
approach
��
��documents
�
�
�
�indexing�
�
�
�combining
Model overviewA textual/visual model based on the bag of words approach
Christophe Moulin et al. (LaHC) Combining text/image in WikipediaMM task 2009 October 1st 2009 3 / 16
Model overview Textual vector space model
��
��stop words filtering
��
��Porter stemming
��
��bag of words creation
Textual vocabulary creationMain steps of the textual bag of words creation
Christophe Moulin et al. (LaHC) Combining text/image in WikipediaMM task 2009 October 1st 2009 4 / 16
Model overview Textual vector space model
bag of words vector of tf.idf weights
[2]
[1]: Salton et al.A vector space model for automatic indexing, 1975[2]: Robertson et al.Okapi et trec-3, 1994
Textual vector weightingSalton’s based tf.idf weighting[1]
�
�
wi,j = tfi,jidfj
tfi,j : representativeness
idfj : discrimination power
Christophe Moulin et al. (LaHC) Combining text/image in WikipediaMM task 2009 October 1st 2009 5 / 16
Model overview Textual vector space model
original Wikipedia article(n char around the image)
metadata of Wikipedia imageused in ImageCLEFwiki
Exploiting of the text around an image
Two sources of text : metadata + extracted text of the original Wikipediaarticles
Christophe Moulin et al. (LaHC) Combining text/image in WikipediaMM task 2009 October 1st 2009 6 / 16
Model overview Visual vocabulary
descriptors descriptorsprojection
visualvocabulary
bag of visualwords
descriptors bag of visualwords
vector oftfidf weights
[3]: Jurie et al.Creating efficient codebooks for visual recognition, 2005
Visual representationSimilar to the text representation using a visual codebook[3]
Visual vocabulary creation
Image representation
Christophe Moulin et al. (LaHC) Combining text/image in WikipediaMM task 2009 October 1st 2009 7 / 16
Model overview Visual vocabulary
meanstd(6 dimensions: 9350 visual words)
sift2(128 dimensions: 9630 visual words)
sift1(128 dimensions: 9303 visual words)
Visual features computationTwo different descriptors are used
regular partitioning: 16× 16 cells
interest regions based on MSER detector
Christophe Moulin et al. (LaHC) Combining text/image in WikipediaMM task 2009 October 1st 2009 8 / 16
Model overview Combining text and image modalities
query documents
Score matchingDistance computed between query and document vectors
query documentscore1 tf tf.idfscore2 tf.idf tf.idf
Christophe Moulin et al. (LaHC) Combining text/image in WikipediaMM task 2009 October 1st 2009 9 / 16
Model overview Combining text and image modalities
α +(1 − α)bag of words
approach
Model overviewLinear combination of textual and visual scores
α is fixed globally on ImageCLEFwiki 2008
Christophe Moulin et al. (LaHC) Combining text/image in WikipediaMM task 2009 October 1st 2009 10 / 16
Experiments
Global results
rank participant/score text image map num ret num rel ret
1 deuceng TXT - 0.2397 43052 1351
5 lahc/score2 100 char meanstd (α=0.025) 0.2178 44993 12136 lahc/score2 50 char meanstd (α=0.025) 0.2148 44993 1218
14 lahc/score2 metadata sift2 (α=0.084) 0.1903 44993 121215 lahc/score2 100 char - 0.1890 38004 120516 lahc/score2 50 char - 0.1880 37041 119820 lahc/score2 metadata meanstd (α=0.025) 0.1845 44993 120821 lahc/score2 metadata sift1 (α=0.012) 0.1807 44995 120024 lahc/score2 metadata meanstd (α=0.015) 0.1792 44993 121333 lahc/score2 metadata - 0.1667 35611 119244 lahc/score1 metadata - 0.1432 35611 116452 lahc/score2 metadata sift2 0.0365 619 14253 lahc/score2 metadata meanstd 0.0338 574 7654 lahc/score2 metadata sift1 0.0321 637 120
57 sztaki - IMG 0.0068 44993 80
Christophe Moulin et al. (LaHC) Combining text/image in WikipediaMM task 2009 October 1st 2009 11 / 16
Experiments
Textual results
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 0.2 0.4 0.6 0.8 1
score1 (map: 0.1432)score2 (map: 0.1667)
score2 50 char (map: 0.1880)score2 100 char (map: 0.1890)
Improvements provided by additional text (15%)
Christophe Moulin et al. (LaHC) Combining text/image in WikipediaMM task 2009 October 1st 2009 12 / 16
Experiments
Textual+visual results
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 0.2 0.4 0.6 0.8 1
score2 (map: 0.1667)score2 sift1: α=0.012 (map: 0.1807)
score2 meanstd: α=0.025 (map: 0.1845)score2 sift2: α=0.084 (map: 0.1903)
sift2 > meanstd> sift1
Christophe Moulin et al. (LaHC) Combining text/image in WikipediaMM task 2009 October 1st 2009 13 / 16
Experiments
Best results
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 0.2 0.4 0.6 0.8 1
score2 50 char (map: 0.1880)score2 100 char (map: 0.1890)
score2 50 char + meanstd (map: 0.2148)score2 100 char + meanstd (map: 0.2178)
Improvements provided by visual information (15%)
Christophe Moulin et al. (LaHC) Combining text/image in WikipediaMM task 2009 October 1st 2009 14 / 16
Conclusion and future work
ConclusionImprovement of our last year model
It works:
Text around the image in original wikipedia articles. (+15%)
Addition of visual features (MSER+sift). (color/texturecomplementarity)
Text-Image combination. (+15%)
Christophe Moulin et al. (LaHC) Combining text/image in WikipediaMM task 2009 October 1st 2009 15 / 16
Conclusion and future work
Future work
Combination with more than one visual descriptor.
Other fusion method.
Learnα for each query.
Christophe Moulin et al. (LaHC) Combining text/image in WikipediaMM task 2009 October 1st 2009 16 / 16