learning from the uncertain: leveraging social communities to generate reliable training data for...
TRANSCRIPT
Learning from the Uncertain: Leveraging Social Communities to generate reliable Training Data for Visual Concept Detection Tasksi-KNOW 2015
Christian Hentschel, Harald SackHasso Plattner Institute, University of Potsdam, Germany
Agenda
● Visual Concept Detection○ Problem: Insufficient Training Data○ Approach: Leveraging Social Photo Communities
● Relevant Image Retrieval○ Improved Dataset○ Community language model○ Visual Re-ranking
● Results● Conclusions & Outlook
Learning from the UncertainHarald Sack,10-21-2015
Chart 2
Visual Concept Detection
● Ability of learning visual categories in order to automatically identify new, unseen images of these categories only based on visual content
Learning from the Uncertain
Chart 3
Harald Sack,10-21-2015
● Supervised Machine Learning Task:○ Positive images (that depict a concept)
○ Negative images (that don’t)
○ Classification/Prediction:■ Test image if it depicts concept (or not):
Visual Concept Detection
Learning from the Uncertain
Chart 4
Harald Sack,10-21-2015
● Approach: Convolutional Neural Networks (CNN)○ outperformed all other approaches○ variation of multi-layer perceptron○ deep (i.e. many hidden layers)○ many parameters to train → prone to overfitting
■ important: (very) large training datasets
Visual Concept Detection - Training Data
Learning from the Uncertain
Chart 5Bag of Visual WordsConvolutional Neural Networks
Harald Sack,10-21-2015
● Training Data for Convolutional Neural Networks○ ImageNet
■ widely used■ benchmarking initiative■ > 14m photos■ > 21k classes
■ goal: 40k categories each covered by 10k photos evaluated as relevant by the majority of 10 human annotators
■ estimated annotation time: 63 years (2s per image)
Visual Concept Detection - Training Data
Learning from the Uncertain
Chart 6
Harald Sack,10-21-2015
http://www.image-net.org /
● Training Data for Convolutional Neural Networks (cont.)○ Flickr
■ > 8b photos (as of 12-12-2012)■ user generated annotations■ potentially unlimited visual concepts and photos per concept■ problems:
● incomplete (not severe as there are so many)○ missing annotations
● highly subjective!
● often not related to visual content!○ e.g. describe viewpoint rather than object
Visual Concept Detection - Leveraging Social Photo Communities
Learning from the Uncertain
Chart 7
Harald Sack,10-21-2015
● Task: identify images relevant for a given visual concept○ scene/object should be clearly visible, no major occlusions
Visual Concept Detection - Leveraging Social Photo Communities
Learning from the Uncertain
Chart 8
Harald Sack,10-21-2015
Relevant Image Retrieval - The Dataset
● MIRFLICKR-1M○ 1 Million Flickr images (selection based on
interestingness score)○ published under Creative Commons Attribution Licence○ provides EXIF data and user tags only
Learning from the UncertainHarald Sack,10-21-2015
Chart 9
○ authoritative metadata: title, description, geo information
○ social metadata: user comments, note, album and group memberships
○ available for > 90% of the original data○ publicly available: www.s16a.org/mirflickr
● our s14a extension:
● query extension○ based on the language used by Flickr users, generate query terms
similar/related to the visual concept query○ example: → …○ learn these annotation relationships based on metadata corpus
● language model: word2vec○ make predictions about words meaning based on learned contextual
appearances○ unsupervised training of a neural network: given a corpus
■ predict a word given its context (Continuous Bag-of-Words)■ predict context given a word (skip-gram)
○ skip-gram better suited to represent infrequent words
Relevant Image Retrieval - Community Language Model
Learning from the Uncertain
Chart 10
Harald Sack,10-21-2015
https://code.google.com/p/word2vec/
● currently: model is based on user tags only○ titles, descriptions, user comments use HTML encoded full-text strings○ pre-processing: lemmatization, stop word removal (English only)
● skip-gram model○ context window size: 6 (average # of tags per image: 12 )○ ignore words with total frequency f < 5○ 300-dimensional feature vector
● compute k=20 most similar terms per query concept○ extend initial visual concept query (e.g. ‘sunset’) by additional, related terms○ rank images based on number of query terms found in metadata
Relevant Image Retrieval - Community language model (cont.)
Learning from the Uncertain
Chart 11
Harald Sack,10-21-2015
● Advantages of corpus-specific language model:○ retrieves synonyms:
■ airplane: [aircraft, aeroplane, plane]○ retrieves related concepts:
■ beach: [sand, ocean, shore]○ retrieves frequent instances:
■ flower: [dahlia, spiderwort, tulip], car: [ford, porsche]○ multilingual:
■ dog: [chien, perro, hund]○ captures sub- and superclasses:
■ boat: [fisherboat, sailboat, sailingships], tiger: [flickrbigcats]○ no external knowledge base, dictionary or thesaurus required
Relevant Image Retrieval - Community language model (cont.)
Learning from the Uncertain
Chart 12
Harald Sack,10-21-2015
● Pre-trained CNN as feature extractor○ trained on ImageNet Large Scale Visual Recognition Challenge 2012 data○ deep feature encoding○ pen-ultimate (fully connected) layer○ 4,096-dimensional feature vector○ extracted for all images of MIRFLICKR-1M collection
■ 3 hours on NVIDIA Tesla K20X GPU○ publicly available: www.s16a.org/mirflickr
● Re-ranking: highest ranked image from extended query○ assumption: high relevance for visual concept○ compute cosine similarity using deep feature representations
Relevant Image Retrieval - Further Improvement by Visual Re-ranking
Learning from the Uncertain
Chart 13
Harald Sack,10-21-2015
● baseline:○ select photos based on presence of concept term in tagset○ no query extension○ large number of candidate images: randomly select n=200
● compare against○ extended query using community language model○ extended query + visual re-ranking
● evaluation using average precision:
○ R: no. of relevant photos, Rk: relevant images among the top k ranked instances, rel(k) = 1, if the photo at rank k is relevant, 0 otherwise
Results
Learning from the Uncertain
Chart 14
Harald Sack,10-21-2015
● tested for 10 visual concepts
● manual assessment of relevance:○ object/scene clearly visible○ no major occlusions
● visual-reranking of results obtained from language model superior
Results (cont.)
Learning from the Uncertain
Chart 15
Harald Sack,10-21-2015
○ exceptions: ‘airplane’, ‘car’
Results (cont.)
Learning from the Uncertain
Chart 16
Harald Sack,10-21-2015
● top ranked candidate images according to community language model
passenger, vliegtuig, aéroport, jetplane, traveller, airliner, lesavions, economysection, traveler, vacation, fuselage, motor, jetliner, transportation, airplane, legroom, transport, sky, passengerjet, jet, travel, flying, avion, airport, inflight, rudder, flugzeug, tail, bin, ptvs, aerial, cabin, passengerplane, aircraftpicture, schipholairport, flap, nederland, aviao, amsterdam, landinggear, cockpit, luggagebins, aircraft, plane, aviation, seat, economyclass, airship, aisle, netherlands, nosegear, holland, luggage, aircraftcabin, engine, aeroport, ptv, aeroplane, aeroplano, wing, paysbas
harney, ford, illinois, myoldpostcards, leeannharney, il, owner, coupe, chromeengine, backend, taillight, automobile, route66, custom, tail, motorvehicle, vintagecar, fomoco, international, fin, collectiblecar, 2012, custombuilt, september21232012, auto, dougthompson, motherroadfestival, 1950, 9212312, convertible, fordmotorcompany, classiccar, 2door, worldcars, car, ghostflames, vonliski, rearend, sidepipes, antiquecar, springfield, frankharney, deluxe, carsonconvertibletop, oldcar
airplane
car
Results (cont.)
Learning from the Uncertain
Chart 17
Harald Sack,10-21-2015
● Problem:○ visual re-ranking based on airport/rear light image fails to capture essential
features of the visual concept:■ airplane top-5 ranked images according to language model:
■ airplane top-5 ranked images after visual re-ranking:
● Automatically retrieve training images for visual concept detection:○ extend MIRFLICKR-1M dataset by additional authoritative and social metadata
■ www.s16a.org/mirflickr○ community language model improves over tag-based ranking
■ mAP: 56% → 77%○ visual re-ranking improves in most cases
● Future Work:○ ground truth relevance estimation based on single evaluator○ include further annotation data
■ title, description, Flickr group information …○ improve visual re-ranking:
■ outlier detection (e.g. single class SVM)■ on top-n results (instead of top-1)
Conclusions & Outlook
Learning from the Uncertain
Chart 18
Harald Sack,10-21-2015