unsupervised computer vision: the current state of the art
TRANSCRIPT
Unsupervised Computer Vision
Stitch Fix, Styling Algorithms Research Talk
The Current State of the Art
TJ Torres Data Scientist, Stitch Fix
WHY DEEP LEARNING?Before DL much of computer vision was focused on feature descriptors
and image stats.
SURF MSER Corner
Image Credit: http://www.mathworks.com/products/computer-vision/features.html
WHY DEEP LEARNING?
Turns out NNs are great feature extractors.
http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
WHY DEEP LEARNING?
Turns out NNs are great feature extractors. Team name Entry description Classification
errorLocalization error
GoogLeNet No localization. Top5 val score is 6.66% error. 0.06656 0.606257
VGGa combination of multiple ConvNets, including a net trained on images of different size (fusion weights learnt on the validation set); detected boxes were not updated
0.07325 0.256167
VGG a combination of multiple ConvNets, including a net trained on images of different size (fusion done by averaging); detected boxes were not updated 0.07337 0.255431
VGG a combination of multiple ConvNets (by averaging) 0.07405 0.253231
VGG a combination of multiple ConvNets (fusion weights learnt on the validation set) 0.07407 0.253501
Leaderboard
WHY DEEP LEARNING?
Turns out NNs are great feature extractors. Team name Entry description Classification
errorLocalization error
GoogLeNet No localization. Top5 val score is 6.66% error. 0.06656 0.606257
VGGa combination of multiple ConvNets, including a net trained on images of different size (fusion weights learnt on the validation set); detected boxes were not updated
0.07325 0.256167
VGG a combination of multiple ConvNets, including a net trained on images of different size (fusion done by averaging); detected boxes were not updated 0.07337 0.255431
VGG a combination of multiple ConvNets (by averaging) 0.07405 0.253231
VGG a combination of multiple ConvNets (fusion weights learnt on the validation set) 0.07407 0.253501
Leaderboard
Convolution: gives local, translation invariant feature hierarchy
WHY DEEP LEARNING?
Image Credit: http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution
WHY DEEP LEARNING?
Edges Curves Top of 3 shapes
Softmax Output: Classification
Image Credit: http://parse.ele.tue.nl/education/cluster2
WHY DEEP LEARNING?
Image Credit: http://blog.keras.io/how-convolutional-neural-networks-see-the-world.html
WHY DEEP LEARNING?
Image Credit: http://blog.keras.io/how-convolutional-neural-networks-see-the-world.html
WHY DEEP LEARNING?
Image Credit: http://blog.keras.io/how-convolutional-neural-networks-see-the-world.html
WHY DEEP LEARNING?
Image Credit: http://blog.keras.io/how-convolutional-neural-networks-see-the-world.html
WHY DEEP LEARNING?
Image Credit: http://blog.keras.io/how-convolutional-neural-networks-see-the-world.html
LEARN MORE
http://cs231n.github.io/convolutional-networks/
WHY UNSUPERVISED?
Unfortunately very few image sets come with labels.
WHY UNSUPERVISED?
Unfortunately very few image sets come with labels.
What are the best labels for fashion/style?
THE UNSUPERVISED MO
Try to learn embedding space of image data. (generally includes generative process)
THE UNSUPERVISED MO
Try to learn embedding space of image data. (generally includes generative process)
1) Train encoder and decoder to encode then reconstruct image.
THE UNSUPERVISED MO
Try to learn embedding space of image data. (generally includes generative process)
1) Train encoder and decoder to encode then reconstruct image.
2) Generate image from random embedding and reinforce “good” looking images.
THE UNSUPERVISED MO
Try to learn embedding space of image data. (generally includes generative process)
1) Train encoder and decoder to encode then reconstruct image.
2) Generate image from random embedding and reinforce “good” looking images.
DOWNSIDES
Higher dimension embeddings = Non-interpretable
Latent distributions may contain gaps. No sensible continuum.
OUTLINE
1. Variational Auto-encoders (VAE)
2. Generative Adversarial Networks (GAN)
3. The combination of the two (VAE/GAN)
4. Generative Moment Matching Networks (GMMN)
5. Adversarial Auto-encoders (AAE?)
Briefly
OUTLINE
1. Variational Auto-encoders (VAE)
2. Generative Adversarial Networks (GAN)
3. The combination of the two (VAE/GAN)
4. Generative Moment Matching Networks (GMMN)
5. Adversarial Auto-encoders (AAE?)
Briefly
stitchfix/fauxtograph
VARIATIONAL AUTO-ENCODERS
ENCODING
input
Convolution
input
ENCODING
Convolution
latent
ENCODING
Convolution
VARIATIONAL STEP
sample from distribution
}µ
}�
q�(z) = N (z;µ(i),�2(i)I)
VARIATIONAL STEP
sampled
Deconvolution
DECODING
output
Deconvolution
DECODING
reconstruction
Deconvolution
CALCULATE LOSS
L(x) = DKL(q�(z)||N (0, I)) +MSE(x,yout
)
UPDATE WEIGHTS
W (l)⇤ij = W (l)
ij
✓1� ↵
@L@Wij
◆@L
@W
(l)ij
=
✓@L
@x
out
◆✓@x
out
@f
(n�1)
◆· · ·
@f
(l)
@W
(l)ij
!
source: @genekogan
Because of pixel-wise MSE loss.
Non-centered features disproportionately penalized.
OUTPUT
source: @genekogan
Because of pixel-wise MSE loss.
Non-centered features disproportionately penalized.
OUTPUT
Note Blurring hair.
GENERATIVE ADVERSARIAL NETWORKS
GAN STRUCTURE
Latent Random Vector
Generator Discriminator
Discriminator
GAN STRUCTUREGenerator
Filtered
Discriminator
GAN STRUCTUREGenerator
Image
Discriminator
GAN STRUCTUREGenerator
Gen/Train Image
Discriminator
GAN STRUCTUREGenerator
Filtered
Discriminator
GAN STRUCTUREGenerator
Yes/No
Discriminator
TRAINING
Generator
Generator and Discriminator play minimax game.
min
Gmax
DV (D,G) = E
x⇠pdata(x) [logD(x)] + Ez⇠pz(z) [log(1�D(G(z)))]
Discriminator
TRAINING
Generator
Lower loss for fooling Discriminator.
Generator and Discriminator play minimax game.
Lower loss for IDing correct training/generated data.
min
Gmax
DV (D,G) = E
x⇠pdata(x) [logD(x)] + Ez⇠pz(z) [log(1�D(G(z)))]
Discriminator
TRAINING
Generator
Lower loss for fooling Discriminator.
Generator and Discriminator play minimax game.
Lower loss for IDing correct training/generated data.
LD =
1
m
mX
i=1
hlogD
⇣x
(i)⌘+ log
⇣1�D
⇣G⇣z
(i)⌘⌘⌘i
LG =
1
m
mX
i=1
log
⇣1�D
⇣G⇣z(i)
⌘⌘⌘
min
Gmax
DV (D,G) = E
x⇠pdata(x) [logD(x)] + Ez⇠pz(z) [log(1�D(G(z)))]
Discriminator
TRAINING
Generator
Lower loss for fooling Discriminator.
Generator and Discriminator play minimax game.
Lower loss for IDing correct training/generated data.
LD =
1
m
mX
i=1
hlogD
⇣x
(i)⌘+ log
⇣1�D
⇣G⇣z
(i)⌘⌘⌘i
LG =
1
m
mX
i=1
log
⇣1�D
⇣G⇣z(i)
⌘⌘⌘
http://arxiv.org/pdf/1406.2661v1.pdf
OUTPUT
http://arxiv.org/pdf/1511.06434v2.pdf
Unfortunately Only Generative
VAE+GAN
VAE+GAN STRUCTUREGenerator DiscriminatorEncoder
O
VAE+GAN STRUCTUREGenerator DiscriminatorEncoder
O S
VAE+GAN STRUCTUREGenerator DiscriminatorEncoder
O S O
G(S)
G(E(O))
VAE+GAN STRUCTUREGenerator DiscriminatorEncoder
O S O
G(S)
G(E(O))
VAE+GAN STRUCTUREGenerator DiscriminatorEncoder
O S O
G(S)
G(E(O))
Yes/ No
MSE
Discriminator
TRAINING
Generator
Train Encoder, Generator and Discriminator with separate optimizers.
Encoder
LE = DKL(q�(z)||N (0, I)) +MSE(Dl(x), Dl(G(E(x))))
LG = � ⇥MSE(Dl(x), Dl(G(E(x))))� LGAN
LD = LGAN = || log(D(x)) + log(1�D(E(G(x)))) + log(1�D(G(z)))||1
Discriminator
TRAINING
Generator
Train Encoder, Generator and Discriminator with separate optimizers.
Encoder
LE = DKL(q�(z)||N (0, I)) +MSE(Dl(x), Dl(G(E(x))))
LG = � ⇥MSE(Dl(x), Dl(G(E(x))))� LGAN
LD = LGAN = || log(D(x)) + log(1�D(E(G(x)))) + log(1�D(G(z)))||1
VAE Prior learned similarity
learned similarity GAN
GAN discriminator loss
TAKEAWAY
http://arxiv.org/pdf/1512.09300v1.pdf
We are trying to get away from pixels to begin with so why use pixel distance as metric?
TAKEAWAY
http://arxiv.org/pdf/1512.09300v1.pdf
Learned similarity metric provides feature-level distance rather than pixel-level.
We are trying to get away from pixels to begin with so why use pixel distance as metric?
TAKEAWAY
http://arxiv.org/pdf/1512.09300v1.pdf
Learned similarity metric provides feature-level distance rather than pixel-level.
We are trying to get away from pixels to begin with so why use pixel distance as metric?
Latent space of a GAN with the encoder of a VAE
TAKEAWAY
http://arxiv.org/pdf/1512.09300v1.pdf
Learned similarity metric provides feature-level distance rather than pixel-level.
We are trying to get away from pixels to begin with so why use pixel distance as metric?
Latent space of a GAN with the encoder of a VAE
…BUT NOT THAT EASY TO TRAIN
GENERATIVE MOMENT MATCHING NETWORKS
DESCRIPTION
Use Maximum Mean Discrepancy between generated data and test data for loss.
Train generative network to output distribution with moments matching dataset.
LMMD2 =
������1
N
NX
i=0
�(xi)�1
M
MX
j=0
�(yj)
������
2
LMMD2 =1
N
2
NX
i=0
NX
i0=0
k(xi, xi0)�2
MN
NX
i=0
MX
j=0
k(xi, yj) +1
M
2
MX
j=0
MX
j0=0
k(yj , yj0)
DESCRIPTION
ADVERSARIAL AUTO-ENCODERS
DESCRIPTION
Want to create an auto-encoder whose “code space” has a distribution matching an arbitrary specified prior.
Like VAE, but instead of using Gaussian KL Div., use adversarial procedure to match coding dist. to prior.
DESCRIPTION
Want to create an auto-encoder whose “code space” has a distribution matching an arbitrary specified prior.
Like VAE, but instead of using Gaussian KL Div., use adversarial procedure to match coding dist. to prior.
Train encoder/decoder with reconstruction metrics.
Additionally: sample from encoding space, train encoder to produce samples indistinguishable
from specified prior.
DESCRIPTION
DESCRIPTIONGAN/
Regularization
DESCRIPTIONGAN/
Regularization
AE/ Reconstruction
SEMI-SUPERVISED
Regularize encoding space Disentangle encoding space
SEMI-SUPERVISED10 2D Gaussians
Swiss roll http://arxiv.org/pdf/1511.05644v1.pdf