mediaeval 2016 - placing images with refined language models and similarity search with pca-reduced...
TRANSCRIPT
![Page 1: MediaEval 2016 - Placing Images with Refined Language Models and Similarity Search with PCA-reduced VGG Features](https://reader030.vdocuments.us/reader030/viewer/2022020314/58ed9fe91a28abd13f8b45f1/html5/thumbnails/1.jpg)
Placing Images with Refined Language Models and Similarity Search with PCA-reduced VGG FeaturesGiorgos Kordopatis-Zilos1, Adrian Popescu2, Symeon Papadopoulos1 and Yiannis Kompatsiaris1
1 Information Technologies Institute (ITI), CERTH, Greece
2 CEA LIST, 91190 Gif-sur-Yvette, France
MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Netherlands.
![Page 2: MediaEval 2016 - Placing Images with Refined Language Models and Similarity Search with PCA-reduced VGG Features](https://reader030.vdocuments.us/reader030/viewer/2022020314/58ed9fe91a28abd13f8b45f1/html5/thumbnails/2.jpg)
Summary
#2
Tag-based location estimation (1 runs)• Built upon the scheme of our 2015 participation [1] (Kordopatis-Zilos et
al., MediaEval 2015)• Based on a refined probabilistic Language Model
Visual-based location estimation (1 run)• Extract PCA-reduced VGG features to compute image similarities• Geospatial clustering scheme of the most visually similar images
Hybrid location estimation (3 run)• Combination of the textual and visual approaches using a set of rules
Training sets• Training set released by the organisers (≈4.7M geotagged items)• YFCC dataset, excl. images from users in test set (≈40M geotagged items)• External data derived from gazetteers, i.e. Geonames and OpenStreetMap
![Page 3: MediaEval 2016 - Placing Images with Refined Language Models and Similarity Search with PCA-reduced VGG Features](https://reader030.vdocuments.us/reader030/viewer/2022020314/58ed9fe91a28abd13f8b45f1/html5/thumbnails/3.jpg)
Tag-based location estimation
#3
• Processing steps of the approach– Offline: language model construction– Online: location estimation
OpenStreetMap
![Page 4: MediaEval 2016 - Placing Images with Refined Language Models and Similarity Search with PCA-reduced VGG Features](https://reader030.vdocuments.us/reader030/viewer/2022020314/58ed9fe91a28abd13f8b45f1/html5/thumbnails/4.jpg)
Pre-processing
• Tags and titles of the training set items are processed
• Apply – URL decoding– lowercase transformation– tokenization
• Remove– accents– symbols– punctuations
• The multi-word tags are split into their individual terms, which are also included in the item's term set
• Discarded numerics or less than three characters terms
#4
![Page 5: MediaEval 2016 - Placing Images with Refined Language Models and Similarity Search with PCA-reduced VGG Features](https://reader030.vdocuments.us/reader030/viewer/2022020314/58ed9fe91a28abd13f8b45f1/html5/thumbnails/5.jpg)
Language Model (LM)
• LM-based estimation– Most Likely Cell (mlc) considered the cell with the highest probability and
used to produce the estimation
𝑚𝑙𝑐𝑗 = argmax𝑖
𝑘=1
𝑇𝑗
𝑝(𝑡𝑘|𝑐𝑖) ∗ 𝑤(𝑡𝑘)
Inspired from [4]: (Popescu, MediaEval 2013)
#5
• LM generation scheme– divide earth surface in rectangular
cells with a side length of 0.01°
– calculate term-cell probabilities𝑝(𝑡|𝑐) = 𝑁𝑢/𝑁𝑡
![Page 6: MediaEval 2016 - Placing Images with Refined Language Models and Similarity Search with PCA-reduced VGG Features](https://reader030.vdocuments.us/reader030/viewer/2022020314/58ed9fe91a28abd13f8b45f1/html5/thumbnails/6.jpg)
Feature Selection and Weighting
#6
Feature Weighting
• Locality weight function, a function based on term relative position in T
• Spatial Entropy weight function, a Gaussian function based on the term’s spatial entropy
• Linear combination of the two weights
Feature Selection
• Calculate terms locality using a grid of 0.01°×0.01°
• When a user uses a given term, he/she is assigned to the entire cell neighborhood instead of a unique cell as in [1]
𝑙 𝑡 = 𝑁𝑡 ∗σ𝑐∈𝐶σ𝑢∈𝑈𝑡,𝑐
|{𝑢′|𝑢′ ∈ 𝑈𝑡,𝑐 , 𝑢′ ≠ 𝑢}|
𝑁𝑡2
• Terms with non-zero locality score form the term set 𝑇
![Page 7: MediaEval 2016 - Placing Images with Refined Language Models and Similarity Search with PCA-reduced VGG Features](https://reader030.vdocuments.us/reader030/viewer/2022020314/58ed9fe91a28abd13f8b45f1/html5/thumbnails/7.jpg)
Refinements
#7
• Multiple Grids– Built an additional LM using a finer
grid (cell side length of 0.001°)– combine the MLC of the individual
language models
• Similarity search [5] (Van Laere et al., ICMR 2011)– determine 𝑘𝑡 most similar training images in the MLC– their center-of-gravity is the final location estimation
From [2]: (Kordopatis-Zilos et al., PAISI 2015)
![Page 8: MediaEval 2016 - Placing Images with Refined Language Models and Similarity Search with PCA-reduced VGG Features](https://reader030.vdocuments.us/reader030/viewer/2022020314/58ed9fe91a28abd13f8b45f1/html5/thumbnails/8.jpg)
Visual-based location estimation
#8
• Main Objectives
• Ensure that the visual features are generic and transferable• Provide a compact representation of the features
• Model building
• CNN features extracted by fine-tuning the VGG model [4]
• Training: ~5K Points Of Interest (POIs), over 7M Flickr images using queries with:
– the POI name and a radius of 5km around its coordinates– the POI name and the associated city name
• Compressed outputs of fc7 layer (4096d) to 128d using PCA, learned on a subset of 250,000 train images
• Similarity Search based on the PCA-reduced CNN features
![Page 9: MediaEval 2016 - Placing Images with Refined Language Models and Similarity Search with PCA-reduced VGG Features](https://reader030.vdocuments.us/reader030/viewer/2022020314/58ed9fe91a28abd13f8b45f1/html5/thumbnails/9.jpg)
Visual-based location estimation
#9
Location Estimation
• Geospatial clustering of 𝑘𝑣 = 20 visually most similar images
• The largest cluster (or the first in case of equal size) is selected and its centroid is used as the location estimate
Visual Confidence
• Confidence metric for the visual estimation is based on the size of the largest cluster
𝑐𝑜𝑛𝑓𝑣 𝑖 = max(𝑛 𝑖 − 𝑛𝑡𝑘𝑣 − 𝑛𝑡
, 0)
𝑛 𝑖 : number of neighbors in the largest cluster of image i𝑛𝑡: configuration parameter of the confidence score ‘’strictness’’
![Page 10: MediaEval 2016 - Placing Images with Refined Language Models and Similarity Search with PCA-reduced VGG Features](https://reader030.vdocuments.us/reader030/viewer/2022020314/58ed9fe91a28abd13f8b45f1/html5/thumbnails/10.jpg)
Hybrid-based location estimation
• A set of rules to determine the source of estimation between the text and visual approaches
• The visual estimation is chosen in cases:
→ No estimation could be produced by the text approach
→ Visual estimation fell inside the borders of the mlc
→ By comparing the confidence scores 𝑐𝑜𝑛𝑓𝑣 and 𝑐𝑜𝑛𝑓𝑡 [1]
• Otherwise the text estimation is selected
#10
![Page 11: MediaEval 2016 - Placing Images with Refined Language Models and Similarity Search with PCA-reduced VGG Features](https://reader030.vdocuments.us/reader030/viewer/2022020314/58ed9fe91a28abd13f8b45f1/html5/thumbnails/11.jpg)
Runs and Results
#11
RUN-1: Tag-based location estimation + released training set
RUN-2: Visual-based location estimation + released training set
RUN-3: Hybrid location estimation + released training set
RUN-4: Hybrid location estimation + YFCC dataset
RUN-5: Hybrid location estimation + YFCC + External data
RUN-E: Visual-based location estimation + entire YFCC dataset
Images
![Page 12: MediaEval 2016 - Placing Images with Refined Language Models and Similarity Search with PCA-reduced VGG Features](https://reader030.vdocuments.us/reader030/viewer/2022020314/58ed9fe91a28abd13f8b45f1/html5/thumbnails/12.jpg)
Runs and Results
#12
RUN-1: Tag-based location estimation + released training set
RUN-2: Visual-based location estimation + released training set
RUN-3: Hybrid location estimation + released training set
RUN-4: Hybrid location estimation + YFCC dataset
RUN-5: Hybrid location estimation + YFCC + External data
Videos
![Page 13: MediaEval 2016 - Placing Images with Refined Language Models and Similarity Search with PCA-reduced VGG Features](https://reader030.vdocuments.us/reader030/viewer/2022020314/58ed9fe91a28abd13f8b45f1/html5/thumbnails/13.jpg)
References
#13
[1] G. Kordopatis-Zilos, A. Popescu, S. Papadopoulos, and Y. Kompatsiaris. Socialsensor at mediaeval placing task 2015. In MediaEval 2015 Placing Task, 2015.
[2] G. Kordopatis-Zilos, S. Papadopoulos, and Y. Kompatsiaris. Geotagging social media content with a refined language modelling approach. In Intelligence and Security Informatics, pages 21–40, 2015.
[3] A. Popescu. CEA LIST's participation at mediaeval 2013 placing task. In MediaEval 2013 Placing Task, 2013.
[4] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
[5] O. Van Laere, S. Schockaert, and B. Dhoedt. Finding locations of Flickr resources using language models and similarity search. ICMR ’11, pages 48:1–48:8, New York, NY, USA, 2011. ACM.
![Page 14: MediaEval 2016 - Placing Images with Refined Language Models and Similarity Search with PCA-reduced VGG Features](https://reader030.vdocuments.us/reader030/viewer/2022020314/58ed9fe91a28abd13f8b45f1/html5/thumbnails/14.jpg)
Thank you!
#14
Data/Code:
– https://github.com/MKLab-ITI/multimedia-geotagging/
Get in touch:
– Giorgos Kordopatis-Zilos: [email protected]
– Symeon Papadopoulos: [email protected] / @sympap
With the support of: