unsupervised computer vision: the current state of the art

Unsupervised Computer Vision

Stitch Fix, Styling Algorithms Research Talk

The Current State of the Art

TJ Torres Data Scientist, Stitch Fix

WHY DEEP LEARNING?Before DL much of computer vision was focused on feature descriptors

and image stats.

SURF MSER Corner

Image Credit: http://www.mathworks.com/products/computer-vision/features.html

http://www.mathworks.com/products/computer-vision/features.html

WHY DEEP LEARNING?

Turns out NNs are great feature extractors.

http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

WHY DEEP LEARNING?

Turns out NNs are great feature extractors. Team name Entry description Classification

errorLocalization error

GoogLeNet No localization. Top5 val score is 6.66% error. 0.06656 0.606257

VGGa combination of multiple ConvNets, including a net trained on images of different size (fusion weights learnt on the validation set); detected boxes were not updated

0.07325 0.256167

VGG a combination of multiple ConvNets, including a net trained on images of different size (fusion done by averaging); detected boxes were not updated 0.07337 0.255431

VGG a combination of multiple ConvNets (by averaging) 0.07405 0.253231

VGG a combination of multiple ConvNets (fusion weights learnt on the validation set) 0.07407 0.253501

Leaderboard

http://arxiv.org/pdf/1409.1556




WHY DEEP LEARNING?

Turns out NNs are great feature extractors. Team name Entry description Classification

errorLocalization error

GoogLeNet No localization. Top5 val score is 6.66% error. 0.06656 0.606257

VGGa combination of multiple ConvNets, including a net trained on images of different size (fusion weights learnt on the validation set); detected boxes were not updated

0.07325 0.256167

VGG a combination of multiple ConvNets, including a net trained on images of different size (fusion done by averaging); detected boxes were not updated 0.07337 0.255431

VGG a combination of multiple ConvNets (by averaging) 0.07405 0.253231

VGG a combination of multiple ConvNets (fusion weights learnt on the validation set) 0.07407 0.253501

Leaderboard

Convolution: gives local, translation invariant feature hierarchy





WHY DEEP LEARNING?

Image Credit: http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution

http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution

WHY DEEP LEARNING?

Edges Curves Top of 3 shapes

Softmax Output: Classification

Image Credit: http://parse.ele.tue.nl/education/cluster2

http://parse.ele.tue.nl/education/cluster2

WHY DEEP LEARNING?

Image Credit: http://blog.keras.io/how-convolutional-neural-networks-see-the-world.html

http://blog.keras.io/how-convolutional-neural-networks-see-the-world.html

LEARN MORE

http://cs231n.github.io/convolutional-networks/

http://cs231n.github.io/convolutional-networks/

WHY UNSUPERVISED?

Unfortunately very few image sets come with labels.

WHY UNSUPERVISED?

Unfortunately very few image sets come with labels.

What are the best labels for fashion/style?

THE UNSUPERVISED MO

Try to learn embedding space of image data. (generally includes generative process)

THE UNSUPERVISED MO


1) Train encoder and decoder to encode then reconstruct image.

THE UNSUPERVISED MO



2) Generate image from random embedding and reinforce “good” looking images.

THE UNSUPERVISED MO



2) Generate image from random embedding and reinforce “good” looking images.

DOWNSIDES

Higher dimension embeddings = Non-interpretable

Latent distributions may contain gaps. No sensible continuum.

OUTLINE

1. Variational Auto-encoders (VAE)

2. Generative Adversarial Networks (GAN)

3. The combination of the two (VAE/GAN)

4. Generative Moment Matching Networks (GMMN)

5. Adversarial Auto-encoders (AAE?)

Briefly

OUTLINE

1. Variational Auto-encoders (VAE)

2. Generative Adversarial Networks (GAN)

3. The combination of the two (VAE/GAN)

4. Generative Moment Matching Networks (GMMN)

5. Adversarial Auto-encoders (AAE?)

Briefly

stitchfix/fauxtograph

VARIATIONAL AUTO-ENCODERS

ENCODING

input

Convolution

input

ENCODING

Convolution

latent

ENCODING

Convolution

VARIATIONAL STEP

sample from distribution

}µ

}�

q�(z) = N (z;µ(i),�2(i)I)

VARIATIONAL STEP

sampled

Deconvolution

DECODING

output

Deconvolution

DECODING

reconstruction

Deconvolution

CALCULATE LOSS

L(x) = DKL(q�(z)||N (0, I)) +MSE(x,yout

)

UPDATE WEIGHTS

W (l)⇤ij = W (l)

ij

✓1� ↵

@L@Wij

◆@L

@W

(l)ij

=

✓@L

@x

out

◆✓@x

out

@f

(n�1)

◆· · ·

@f

(l)

@W

(l)ij

!

source: @genekogan

Because of pixel-wise MSE loss.

Non-centered features disproportionately penalized.

OUTPUT

http://www.apple.com

source: @genekogan

Because of pixel-wise MSE loss.

Non-centered features disproportionately penalized.

OUTPUT

Note Blurring hair.

http://www.apple.com

GENERATIVE ADVERSARIAL NETWORKS

GAN STRUCTURE

Latent Random Vector

Generator Discriminator

Discriminator

GAN STRUCTUREGenerator

Filtered

Discriminator


Image

Discriminator


Gen/Train Image

Discriminator


Filtered

Discriminator


Yes/No

Discriminator

TRAINING

Generator

Generator and Discriminator play minimax game.

min

Gmax

DV (D,G) = E

x⇠pdata(x) [logD(x)] + Ez⇠pz(z) [log(1�D(G(z)))]

Discriminator

TRAINING

Generator

Lower loss for fooling Discriminator.


Lower loss for IDing correct training/generated data.

min

Gmax

DV (D,G) = E


Discriminator

TRAINING

Generator




LD =

1

m

mX

i=1

hlogD

⇣x

(i)⌘+ log

⇣1�D

⇣G⇣z

(i)⌘⌘⌘i

LG =

1

m

mX

i=1

log

⇣1�D

⇣G⇣z(i)

⌘⌘⌘

min

Gmax

DV (D,G) = E


Discriminator

TRAINING

Generator




LD =

1

m

mX

i=1

hlogD

⇣x

(i)⌘+ log

⇣1�D

⇣G⇣z

(i)⌘⌘⌘i

LG =

1

m

mX

i=1

log

⇣1�D

⇣G⇣z(i)

⌘⌘⌘

http://arxiv.org/pdf/1406.2661v1.pdf


OUTPUT



OUTPUT


Unfortunately Only Generative


VAE+GAN

VAE+GAN STRUCTUREGenerator DiscriminatorEncoder

O


O S


O S O

G(S)

G(E(O))


O S O

G(S)

G(E(O))

Yes/ No

MSE

Discriminator

TRAINING

Generator

Train Encoder, Generator and Discriminator with separate optimizers.

Encoder

LE = DKL(q�(z)||N (0, I)) +MSE(Dl(x), Dl(G(E(x))))

LG = � ⇥MSE(Dl(x), Dl(G(E(x))))� LGAN

LD = LGAN = || log(D(x)) + log(1�D(E(G(x)))) + log(1�D(G(z)))||1

Discriminator

TRAINING

Generator

Train Encoder, Generator and Discriminator with separate optimizers.

Encoder

LE = DKL(q�(z)||N (0, I)) +MSE(Dl(x), Dl(G(E(x))))

LG = � ⇥MSE(Dl(x), Dl(G(E(x))))� LGAN

LD = LGAN = || log(D(x)) + log(1�D(E(G(x)))) + log(1�D(G(z)))||1

VAE Prior learned similarity

learned similarity GAN

GAN discriminator loss

OUTPUT



TAKEAWAY


We are trying to get away from pixels to begin with so why use pixel distance as metric?


TAKEAWAY


Learned similarity metric provides feature-level distance rather than pixel-level.



TAKEAWAY




Latent space of a GAN with the encoder of a VAE


TAKEAWAY




Latent space of a GAN with the encoder of a VAE

…BUT NOT THAT EASY TO TRAIN


GENERATIVE MOMENT MATCHING NETWORKS

DESCRIPTION

Use Maximum Mean Discrepancy between generated data and test data for loss.

Train generative network to output distribution with moments matching dataset.

LMMD2 =

��1

N

NX

i=0

�(xi)�1

M

MX

j=0

�(yj)

��

2

LMMD2 =1

N

2

NX

i=0

NX

i0=0

k(xi, xi0)�2

MN

NX

i=0

MX

j=0

k(xi, yj) +1

M

2

MX

j=0

MX

j0=0

k(yj , yj0)

DESCRIPTION

OUTPUT



ADVERSARIAL AUTO-ENCODERS

DESCRIPTION

Want to create an auto-encoder whose “code space” has a distribution matching an arbitrary specified prior.

Like VAE, but instead of using Gaussian KL Div., use adversarial procedure to match coding dist. to prior.

DESCRIPTION

Want to create an auto-encoder whose “code space” has a distribution matching an arbitrary specified prior.

Like VAE, but instead of using Gaussian KL Div., use adversarial procedure to match coding dist. to prior.

Train encoder/decoder with reconstruction metrics.

Additionally: sample from encoding space, train encoder to produce samples indistinguishable

from specified prior.

DESCRIPTION

DESCRIPTIONGAN/

Regularization

DESCRIPTIONGAN/

Regularization

AE/ Reconstruction

SEMI-SUPERVISED

Regularize encoding space Disentangle encoding space

SEMI-SUPERVISED10 2D Gaussians

Swiss roll http://arxiv.org/pdf/1511.05644v1.pdf


SEMI-SUPERVISED



unsupervised computer vision: the current state of the art

Science