IMLG – 22 March 2018 John Collomosse 2
Visual Search?
Visual search: Querying visual lrepositories using visual (pictorial) queries.
80% of the Internet forecast to be visual data by end 2018 [Cisco, NF 2016] (on track for 84%)
iTrace – Visual Plagiarism Detection
A visual ‘TurnItIn’
Trialed at 4 HEIs
120k artworks in VADS.ac.uk + uploads
John Collomosse 3
“Classic” Computer Vision
Matches visual features (“key points”) between images
SIFT features
(circa 2004)
IMLG – 22 March 2018
John Collomosse 4
“Modern” Computer Vision
Use deep neural networks trained on example data to extract digital signatures from whole images
Convolutional Neural Network
CNN
IMLG – 22 March 2018
John Collomosse 5
Deep Learning Revolution
Pre 2012 – State of the art…. Deep Learning (~2016) – State of the art….
Q: What is this? (Which of 1000 objects is this?) A: Mug (~9% accuracy)
Q: How many leftover donuts are there? A: Three (~70% accuracy)
IMLG – 22 March 2018
John Collomosse 6
Deep Learning Revolution
Challenges for CNNs circa 2012:
- Data hungry. CNNs require a lot of training data.
- Processing power. CNNs require a lot of CPU to train. So, only simple CNNs were trained.
- Niche.
Then…
- ImageNet arrived (16m images, 1000 classes) [Deng et al. 2009]
- GPUs. General purpose GPU processing / CUDA. The algorithms for training a CNN are highly parallelisable.
- NIPS/ECCV 2012. Double-digit % gain on ImageNet accuracy announced using CNNs.
The vision community took notice!
IMLG – 22 March 2018
Sketch based Visual Search
John Collomosse 7
1. Several Million (10^7) Colour Images
Sketch based Retrieval of….
2. Images using Deeply Learned Descriptors
3. Sketching with Style: Search with Aesthetic Constraints
IMLG – 22 March 2018
John Collomosse 8
Why Sketch?
“Most of the next generation will probably never use Desktop products. People don’t understand how profound a shift this is. The reality is for these hundreds of millions of users, mobile will be their entire gateway to services.” – Wired, 2017
• Touch screen (gesture) is the primary interface on mobile (replacing text/keyboard) • New discovery tools needed to release value in visual content • Sketch is an intuitive modality for describing desired visual attributes
IMLG – 22 March 2018
(want to invest time to)
But the problem…
John Collomosse 9
“People don’t draw well!”
Sketching is visual communication
[1] Hu and Collomosse “Performance Evaluation of Gradient Field HOG” Comp. Vision. Image Understanding (CVIU) 2013.
Excerpt of Flickr15k [1]
Humans communicate efficiently, using vocabulary & context
Sketch for retrieval is a casual throw-away act (for a machine).
(… and some users are bad at sketching)
IMLG – 22 March 2018
Demo
John Collomosse 10
Android demo app available for phones/tablets at: https://play.google.com/store/apps/details?id=com.collomosse.sketcher
IMLG – 22 March 2018
Diversion into Text Search
John Collomosse 11
A common measure of text document similarity involves building a frequency histogram of the words in the document: a “bag of words”
Ma
rtian
s
eve
the
mo
lten
Life
Lig
ht
and
by
This descriptor encodes the distribution of words in the document; a function of its content
Careful choice of the words (bins) is key! Location of words doesn’t matter!
IMLG – 22 March 2018
Bag of Visual Words for Photo Retrieval
John Collomosse 12
Q. Why does BoVW find a swan?
A. The frequency / distribution of local texture patches cut from the query (SIFT) matches those cut from swan images in the database.
query
database
IMLG – 22 March 2018
Bag of Visual Words for Sketch Retrieval
John Collomosse 13
Q. Why do we see a swan?
A. The spatial relationships of strokes (edges) determine the object’s structure, from which we infer presence of a swan.
IMLG – 22 March 2018
Synthesising “Texture”
John Collomosse 14
Photos Sketches
IMLG – 22 March 2018
Feature Extraction (sketch & photo)
Photographs passed through edge detection filter
Multi-scale patches cut at every ‘edge’ or ‘sketch’ pixel (Gradient information / HOG)
Database images Feature
extraction
Feature
encoding
Index
file
Matching
Feature
extraction
Feature
encoding
Query
sketch
results
Edge
detection
John Collomosse 15 IMLG – 22 March 2018
Results: ImageNet Dataset:
~2s / query
16m image dataset.
Platform:
AMD 2.6Ghz
Single core benchmark
43Gb features
John Collomosse 16 IMLG – 22 March 2018
Sketch based Visual Search
John Collomosse 17
1. Several Million (10^7) Colour Images
Sketch based Retrieval of….
2. Images using Deeply Learned Descriptors
3. Sketching with Style: Search with Aesthetic Constraints
IMLG – 22 March 2018
Problems with an Edgemap Approach
Sketches are not edgemaps (Distortion, Level of Abstraction, etc.)
John Collomosse 18
House
Crocodile
TU-Berlin dataset, Eitz et al. 2012
IMLG – 22 March 2018
Cross domain metric learning for SBIR
Can we learn a low dimensional metric embedding of edge and sketch space?
John Collomosse 19
Before learning After learning
sketch image
IMLG – 22 March 2018
Sketch matching with CNN
John Collomosse 20
L(a,p,n)
𝐿 𝑎, 𝑝, 𝑛 = 1
2 {max (0, 𝑚 +
+ 𝑎 − 𝑝 22 − 𝑎 − 𝑛 2
2}
512
512 100
256 15 15
256 15 15
3 3
256 15 15
3 3
128 31 31
3 3
64 71
71
5 5
512
512 100
256 15 15
256 15 15
3 3
256 15 15
3 3
128 31 31
3 3
64 71
71
5 5
a p
512
512 100
256 15 15
256 15 15
3 3
256 15 15
3 3
128 31 31
3 3
64 71
71
5 5
n
m: margin
Triplet loss in triplet network:
IMLG – 22 March 2018
What happens during training?
Learning joint embedding of edge and sketch space
John Collomosse 21
Before training
a
p
n
a p
n
m
Triplet loss
𝐿 𝑎, 𝑝, 𝑛 = 1
2 {max (0, 𝑚 +
+ 𝑎 − 𝑝 22 − 𝑎 − 𝑛 2
2}
IMLG – 22 March 2018
Datasets
John Collomosse 22
Training: • Sketch: TU-Berlin 20k@250 classes. • Image: Internet photo acquisition
Test: • Flickr15k: 15k photos + 330 sketches @ 33 classes.
Class diversity between training and test datasets:
TU-Berlin Flickr15k
“bridge” “bridge” “Tower bridge”
“Sydney bridge”
“Oxford bridge”
“duck” “swan” “duck-swan” “duck-swan”
“sun” “moon” “moon” “sunrise- sunset”
Human-skeleton nose mermaid angel
IMLG – 22 March 2018
Training Methodology
Data Augmentation and Triplet Formation
John Collomosse 23
• Images: • 25k photos: 100 photos/class. • Edge extraction: gPb [Arbelaez, 2011]. • Mean subtraction, random crop/rotation/scaling/flip.
• Sketches: • 20k sketches: 20s training, 60s validation per class. • Skeletonisation. • Mean subtraction, random crop/rotation/scaling/flip. • Random stroke removal.
• Triplet formation: • Random selection pos/neg samples.
• Training: • 10k epochs.
crop
rotation
scaling
flip
Stroke removal
IMLG – 22 March 2018
Representative Queries/Results
John Collomosse 24 IMLG – 22 March 2018
Sketch based Visual Search
John Collomosse 25
1. Several Million (10^7) Colour Images
Sketch based Retrieval of….
2. Images using Deeply Learned Descriptors
3. Sketching with Style: Search with Aesthetic Constraints
IMLG – 22 March 2018
Sketching with Style
1
2
1. User’s intermediate work in Photoshop (here, a graphite sketch)
2. Behance visually searched for inspiration in a specified style (here, watercolor)
Result searching 66.8m Behance images
John Collomosse 26 IMLG – 22 March 2018
Video Demo
John Collomosse 27 IMLG – 22 March 2018
Learning the Style Embedding
GoogleNet (Inception v3) with 128-D Bottleneck
Triplet design, fully siamese
Training Set (110k Behance)
John Collomosse 28 IMLG – 22 March 2018
Visualizing the Style Embedding (Behance 1m test set)
t-SNE perplexity 20
John Collomosse 29 IMLG – 22 March 2018
Putting it all together Two Stream Network Architecture:
1. A Structure Network – that learns an embedding to visually match structure irrespective of style
2. A Style Network – that learns an embedding to visually match aesthetics irrespective of content (structure)
25
6
Structure embedding
Style embedding
Search Index
12
8
12
8
Structure network: Sketch
Style network 12
8
12
8
Structure network: Image
Style network
25
6
John Collomosse 30 IMLG – 22 March 2018
Evaluating Sketch+Style Retrieval (top-1 result)
John Collomosse 31 IMLG – 22 March 2018
Fine Grain: Style Analogies Vector math in 128-D style space
= (watercolor + graphite) = (watercolor – graphite)
John Collomosse 32 IMLG – 22 March 2018
Fine Grain: Style Analogies Vector math in 128-D style space
= comic = (comic – pen+ink)
John Collomosse 33 IMLG – 22 March 2018
Fine Grain: Style Analogies Vector math in 128-D style space
= vectorart = (– vectorart)
John Collomosse 34 IMLG – 22 March 2018
Closing Thoughts
Scalability - Scaling under sketch ambiguity is the challenge (not compute) - Need to integrate modalities beyond shape. Which? How to fuse? - How to determine user intent in prioritizing modalities?
Composition Breakdown
- All SBIR assumes a single dominant object, but real data isn’t like that
John Collomosse 35
Deep Learning - Deep Learning outperforms classic approaches at perceptual tasks like search - But they must be trained with a lot of representative data (+annotation) - Sketch data in particular is sparse (10^2 categories, 10^4 instances)
IMLG – 22 March 2018