unsupervised computer vision: the current state of the art

77
Unsupervised Computer Vision Stitch Fix, Styling Algorithms Research Talk The Current State of the Art TJ Torres Data Scientist, Stitch Fix

Upload: tj-torres

Post on 15-Apr-2017

7.426 views

Category:

Science


0 download

TRANSCRIPT

Page 1: Unsupervised Computer Vision: The Current State of the Art

Unsupervised Computer Vision

Stitch Fix, Styling Algorithms Research Talk

The Current State of the Art

TJ Torres Data Scientist, Stitch Fix

Page 2: Unsupervised Computer Vision: The Current State of the Art

WHY DEEP LEARNING?Before DL much of computer vision was focused on feature descriptors

and image stats.

SURF MSER Corner

Image Credit: http://www.mathworks.com/products/computer-vision/features.html

Page 3: Unsupervised Computer Vision: The Current State of the Art

WHY DEEP LEARNING?

Turns out NNs are great feature extractors.

http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

Page 4: Unsupervised Computer Vision: The Current State of the Art

WHY DEEP LEARNING?

Turns out NNs are great feature extractors. Team name Entry description Classification

errorLocalization error

GoogLeNet No localization. Top5 val score is 6.66% error. 0.06656 0.606257

VGGa combination of multiple ConvNets, including a net trained on images of different size (fusion weights learnt on the validation set); detected boxes were not updated

0.07325 0.256167

VGG a combination of multiple ConvNets, including a net trained on images of different size (fusion done by averaging); detected boxes were not updated 0.07337 0.255431

VGG a combination of multiple ConvNets (by averaging) 0.07405 0.253231

VGG a combination of multiple ConvNets (fusion weights learnt on the validation set) 0.07407 0.253501

Leaderboard

Page 5: Unsupervised Computer Vision: The Current State of the Art

WHY DEEP LEARNING?

Turns out NNs are great feature extractors. Team name Entry description Classification

errorLocalization error

GoogLeNet No localization. Top5 val score is 6.66% error. 0.06656 0.606257

VGGa combination of multiple ConvNets, including a net trained on images of different size (fusion weights learnt on the validation set); detected boxes were not updated

0.07325 0.256167

VGG a combination of multiple ConvNets, including a net trained on images of different size (fusion done by averaging); detected boxes were not updated 0.07337 0.255431

VGG a combination of multiple ConvNets (by averaging) 0.07405 0.253231

VGG a combination of multiple ConvNets (fusion weights learnt on the validation set) 0.07407 0.253501

Leaderboard

Convolution: gives local, translation invariant feature hierarchy

Page 6: Unsupervised Computer Vision: The Current State of the Art

WHY DEEP LEARNING?

Image Credit: http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution

Page 7: Unsupervised Computer Vision: The Current State of the Art

WHY DEEP LEARNING?

Edges Curves Top of 3 shapes

Softmax Output: Classification

Image Credit: http://parse.ele.tue.nl/education/cluster2

Page 8: Unsupervised Computer Vision: The Current State of the Art

WHY DEEP LEARNING?

Image Credit: http://blog.keras.io/how-convolutional-neural-networks-see-the-world.html

Page 9: Unsupervised Computer Vision: The Current State of the Art

WHY DEEP LEARNING?

Image Credit: http://blog.keras.io/how-convolutional-neural-networks-see-the-world.html

Page 10: Unsupervised Computer Vision: The Current State of the Art

WHY DEEP LEARNING?

Image Credit: http://blog.keras.io/how-convolutional-neural-networks-see-the-world.html

Page 11: Unsupervised Computer Vision: The Current State of the Art

WHY DEEP LEARNING?

Image Credit: http://blog.keras.io/how-convolutional-neural-networks-see-the-world.html

Page 12: Unsupervised Computer Vision: The Current State of the Art

WHY DEEP LEARNING?

Image Credit: http://blog.keras.io/how-convolutional-neural-networks-see-the-world.html

Page 13: Unsupervised Computer Vision: The Current State of the Art

LEARN MORE

http://cs231n.github.io/convolutional-networks/

Page 14: Unsupervised Computer Vision: The Current State of the Art

WHY UNSUPERVISED?

Unfortunately very few image sets come with labels.

Page 15: Unsupervised Computer Vision: The Current State of the Art

WHY UNSUPERVISED?

Unfortunately very few image sets come with labels.

What are the best labels for fashion/style?

Page 16: Unsupervised Computer Vision: The Current State of the Art

THE UNSUPERVISED MO

Try to learn embedding space of image data. (generally includes generative process)

Page 17: Unsupervised Computer Vision: The Current State of the Art

THE UNSUPERVISED MO

Try to learn embedding space of image data. (generally includes generative process)

1) Train encoder and decoder to encode then reconstruct image.

Page 18: Unsupervised Computer Vision: The Current State of the Art

THE UNSUPERVISED MO

Try to learn embedding space of image data. (generally includes generative process)

1) Train encoder and decoder to encode then reconstruct image.

2) Generate image from random embedding and reinforce “good” looking images.

Page 19: Unsupervised Computer Vision: The Current State of the Art

THE UNSUPERVISED MO

Try to learn embedding space of image data. (generally includes generative process)

1) Train encoder and decoder to encode then reconstruct image.

2) Generate image from random embedding and reinforce “good” looking images.

DOWNSIDES

Higher dimension embeddings = Non-interpretable

Latent distributions may contain gaps. No sensible continuum.

Page 20: Unsupervised Computer Vision: The Current State of the Art

OUTLINE

1. Variational Auto-encoders (VAE)

2. Generative Adversarial Networks (GAN)

3. The combination of the two (VAE/GAN)

4. Generative Moment Matching Networks (GMMN)

5. Adversarial Auto-encoders (AAE?)

Briefly

Page 21: Unsupervised Computer Vision: The Current State of the Art

OUTLINE

1. Variational Auto-encoders (VAE)

2. Generative Adversarial Networks (GAN)

3. The combination of the two (VAE/GAN)

4. Generative Moment Matching Networks (GMMN)

5. Adversarial Auto-encoders (AAE?)

Briefly

stitchfix/fauxtograph

Page 22: Unsupervised Computer Vision: The Current State of the Art

VARIATIONAL AUTO-ENCODERS

Page 23: Unsupervised Computer Vision: The Current State of the Art

ENCODING

input

Convolution

Page 24: Unsupervised Computer Vision: The Current State of the Art

input

ENCODING

Convolution

Page 25: Unsupervised Computer Vision: The Current State of the Art

latent

ENCODING

Convolution

Page 26: Unsupervised Computer Vision: The Current State of the Art

VARIATIONAL STEP

sample from distribution

}�

q�(z) = N (z;µ(i),�2(i)I)

Page 27: Unsupervised Computer Vision: The Current State of the Art

VARIATIONAL STEP

sampled

Deconvolution

Page 28: Unsupervised Computer Vision: The Current State of the Art

DECODING

output

Deconvolution

Page 29: Unsupervised Computer Vision: The Current State of the Art

DECODING

reconstruction

Deconvolution

Page 30: Unsupervised Computer Vision: The Current State of the Art

CALCULATE LOSS

L(x) = DKL(q�(z)||N (0, I)) +MSE(x,yout

)

Page 31: Unsupervised Computer Vision: The Current State of the Art

UPDATE WEIGHTS

W (l)⇤ij = W (l)

ij

✓1� ↵

@L@Wij

◆@L

@W

(l)ij

=

✓@L

@x

out

◆✓@x

out

@f

(n�1)

◆· · ·

@f

(l)

@W

(l)ij

!

Page 32: Unsupervised Computer Vision: The Current State of the Art

source: @genekogan

Because of pixel-wise MSE loss.

Non-centered features disproportionately penalized.

OUTPUT

Page 33: Unsupervised Computer Vision: The Current State of the Art

source: @genekogan

Because of pixel-wise MSE loss.

Non-centered features disproportionately penalized.

OUTPUT

Note Blurring hair.

Page 34: Unsupervised Computer Vision: The Current State of the Art

GENERATIVE ADVERSARIAL NETWORKS

Page 35: Unsupervised Computer Vision: The Current State of the Art

GAN STRUCTURE

Latent Random Vector

Generator Discriminator

Page 36: Unsupervised Computer Vision: The Current State of the Art

Discriminator

GAN STRUCTUREGenerator

Filtered

Page 37: Unsupervised Computer Vision: The Current State of the Art

Discriminator

GAN STRUCTUREGenerator

Image

Page 38: Unsupervised Computer Vision: The Current State of the Art

Discriminator

GAN STRUCTUREGenerator

Gen/Train Image

Page 39: Unsupervised Computer Vision: The Current State of the Art

Discriminator

GAN STRUCTUREGenerator

Filtered

Page 40: Unsupervised Computer Vision: The Current State of the Art

Discriminator

GAN STRUCTUREGenerator

Yes/No

Page 41: Unsupervised Computer Vision: The Current State of the Art

Discriminator

TRAINING

Generator

Generator and Discriminator play minimax game.

min

Gmax

DV (D,G) = E

x⇠pdata(x) [logD(x)] + Ez⇠pz(z) [log(1�D(G(z)))]

Page 42: Unsupervised Computer Vision: The Current State of the Art

Discriminator

TRAINING

Generator

Lower loss for fooling Discriminator.

Generator and Discriminator play minimax game.

Lower loss for IDing correct training/generated data.

min

Gmax

DV (D,G) = E

x⇠pdata(x) [logD(x)] + Ez⇠pz(z) [log(1�D(G(z)))]

Page 43: Unsupervised Computer Vision: The Current State of the Art

Discriminator

TRAINING

Generator

Lower loss for fooling Discriminator.

Generator and Discriminator play minimax game.

Lower loss for IDing correct training/generated data.

LD =

1

m

mX

i=1

hlogD

⇣x

(i)⌘+ log

⇣1�D

⇣G⇣z

(i)⌘⌘⌘i

LG =

1

m

mX

i=1

log

⇣1�D

⇣G⇣z(i)

⌘⌘⌘

min

Gmax

DV (D,G) = E

x⇠pdata(x) [logD(x)] + Ez⇠pz(z) [log(1�D(G(z)))]

Page 44: Unsupervised Computer Vision: The Current State of the Art

Discriminator

TRAINING

Generator

Lower loss for fooling Discriminator.

Generator and Discriminator play minimax game.

Lower loss for IDing correct training/generated data.

LD =

1

m

mX

i=1

hlogD

⇣x

(i)⌘+ log

⇣1�D

⇣G⇣z

(i)⌘⌘⌘i

LG =

1

m

mX

i=1

log

⇣1�D

⇣G⇣z(i)

⌘⌘⌘

http://arxiv.org/pdf/1406.2661v1.pdf

Page 45: Unsupervised Computer Vision: The Current State of the Art

OUTPUT

http://arxiv.org/pdf/1511.06434v2.pdf

Page 46: Unsupervised Computer Vision: The Current State of the Art

OUTPUT

http://arxiv.org/pdf/1511.06434v2.pdf

Page 47: Unsupervised Computer Vision: The Current State of the Art

OUTPUT

http://arxiv.org/pdf/1511.06434v2.pdf

Page 48: Unsupervised Computer Vision: The Current State of the Art

OUTPUT

http://arxiv.org/pdf/1511.06434v2.pdf

Unfortunately Only Generative

Page 49: Unsupervised Computer Vision: The Current State of the Art

VAE+GAN

Page 50: Unsupervised Computer Vision: The Current State of the Art

VAE+GAN STRUCTUREGenerator DiscriminatorEncoder

O

Page 51: Unsupervised Computer Vision: The Current State of the Art

VAE+GAN STRUCTUREGenerator DiscriminatorEncoder

O S

Page 52: Unsupervised Computer Vision: The Current State of the Art

VAE+GAN STRUCTUREGenerator DiscriminatorEncoder

O S O

G(S)

G(E(O))

Page 53: Unsupervised Computer Vision: The Current State of the Art

VAE+GAN STRUCTUREGenerator DiscriminatorEncoder

O S O

G(S)

G(E(O))

Page 54: Unsupervised Computer Vision: The Current State of the Art

VAE+GAN STRUCTUREGenerator DiscriminatorEncoder

O S O

G(S)

G(E(O))

Yes/ No

MSE

Page 55: Unsupervised Computer Vision: The Current State of the Art

Discriminator

TRAINING

Generator

Train Encoder, Generator and Discriminator with separate optimizers.

Encoder

LE = DKL(q�(z)||N (0, I)) +MSE(Dl(x), Dl(G(E(x))))

LG = � ⇥MSE(Dl(x), Dl(G(E(x))))� LGAN

LD = LGAN = || log(D(x)) + log(1�D(E(G(x)))) + log(1�D(G(z)))||1

Page 56: Unsupervised Computer Vision: The Current State of the Art

Discriminator

TRAINING

Generator

Train Encoder, Generator and Discriminator with separate optimizers.

Encoder

LE = DKL(q�(z)||N (0, I)) +MSE(Dl(x), Dl(G(E(x))))

LG = � ⇥MSE(Dl(x), Dl(G(E(x))))� LGAN

LD = LGAN = || log(D(x)) + log(1�D(E(G(x)))) + log(1�D(G(z)))||1

VAE Prior learned similarity

learned similarity GAN

GAN discriminator loss

Page 57: Unsupervised Computer Vision: The Current State of the Art

OUTPUT

http://arxiv.org/pdf/1512.09300v1.pdf

Page 58: Unsupervised Computer Vision: The Current State of the Art

OUTPUT

http://arxiv.org/pdf/1512.09300v1.pdf

Page 59: Unsupervised Computer Vision: The Current State of the Art

OUTPUT

http://arxiv.org/pdf/1512.09300v1.pdf

Page 60: Unsupervised Computer Vision: The Current State of the Art

TAKEAWAY

http://arxiv.org/pdf/1512.09300v1.pdf

We are trying to get away from pixels to begin with so why use pixel distance as metric?

Page 61: Unsupervised Computer Vision: The Current State of the Art

TAKEAWAY

http://arxiv.org/pdf/1512.09300v1.pdf

Learned similarity metric provides feature-level distance rather than pixel-level.

We are trying to get away from pixels to begin with so why use pixel distance as metric?

Page 62: Unsupervised Computer Vision: The Current State of the Art

TAKEAWAY

http://arxiv.org/pdf/1512.09300v1.pdf

Learned similarity metric provides feature-level distance rather than pixel-level.

We are trying to get away from pixels to begin with so why use pixel distance as metric?

Latent space of a GAN with the encoder of a VAE

Page 63: Unsupervised Computer Vision: The Current State of the Art

TAKEAWAY

http://arxiv.org/pdf/1512.09300v1.pdf

Learned similarity metric provides feature-level distance rather than pixel-level.

We are trying to get away from pixels to begin with so why use pixel distance as metric?

Latent space of a GAN with the encoder of a VAE

…BUT NOT THAT EASY TO TRAIN

Page 64: Unsupervised Computer Vision: The Current State of the Art

GENERATIVE MOMENT MATCHING NETWORKS

Page 65: Unsupervised Computer Vision: The Current State of the Art

DESCRIPTION

Use Maximum Mean Discrepancy between generated data and test data for loss.

Train generative network to output distribution with moments matching dataset.

LMMD2 =

������1

N

NX

i=0

�(xi)�1

M

MX

j=0

�(yj)

������

2

LMMD2 =1

N

2

NX

i=0

NX

i0=0

k(xi, xi0)�2

MN

NX

i=0

MX

j=0

k(xi, yj) +1

M

2

MX

j=0

MX

j0=0

k(yj , yj0)

Page 66: Unsupervised Computer Vision: The Current State of the Art

DESCRIPTION

Page 67: Unsupervised Computer Vision: The Current State of the Art

OUTPUT

http://arxiv.org/pdf/1502.02761v1.pdf

Page 68: Unsupervised Computer Vision: The Current State of the Art

ADVERSARIAL AUTO-ENCODERS

Page 69: Unsupervised Computer Vision: The Current State of the Art

DESCRIPTION

Want to create an auto-encoder whose “code space” has a distribution matching an arbitrary specified prior.

Like VAE, but instead of using Gaussian KL Div., use adversarial procedure to match coding dist. to prior.

Page 70: Unsupervised Computer Vision: The Current State of the Art

DESCRIPTION

Want to create an auto-encoder whose “code space” has a distribution matching an arbitrary specified prior.

Like VAE, but instead of using Gaussian KL Div., use adversarial procedure to match coding dist. to prior.

Train encoder/decoder with reconstruction metrics.

Additionally: sample from encoding space, train encoder to produce samples indistinguishable

from specified prior.

Page 71: Unsupervised Computer Vision: The Current State of the Art

DESCRIPTION

Page 72: Unsupervised Computer Vision: The Current State of the Art

DESCRIPTIONGAN/

Regularization

Page 73: Unsupervised Computer Vision: The Current State of the Art

DESCRIPTIONGAN/

Regularization

AE/ Reconstruction

Page 74: Unsupervised Computer Vision: The Current State of the Art

SEMI-SUPERVISED

Regularize encoding space Disentangle encoding space

Page 75: Unsupervised Computer Vision: The Current State of the Art

SEMI-SUPERVISED10 2D Gaussians

Swiss roll http://arxiv.org/pdf/1511.05644v1.pdf

Page 76: Unsupervised Computer Vision: The Current State of the Art

SEMI-SUPERVISED

http://arxiv.org/pdf/1511.05644v1.pdf

Page 77: Unsupervised Computer Vision: The Current State of the Art