(or “mapping from a to b”)...convolution 3 32 32 width height depth result: 1 number, the result...

Conditional Adversarial Networks(or “mapping from A to B”)

CS448V — Computational Video Manipulation

May 22th, 2019

https://affinelayer.com/pixsrv/


Why? - Cool! Trendy! - Google Scholar

…

Hundreds of applications

and follow-up works…

CycleGAN

Pix2Pix

Enhancing Transitions

Single-Photo Facial Animation

Text-based Editing

Few-Shot Reenactment

Digital Humans

Overview

• Convolutional Neural Networks

• Generative Modeling

• Pix2Pix (“mapping from A to B”)



Convolutional Neural Network

• 2D Convolution Layers (Conv2D)

• Subsampling Layers (MaxPool, …)

• Non-linearity Layers (ReLU, …)

• Normalization Layers (BatchNorm, …)

• Upsampling Layers (TransposedConv, …)

• …

Components?

Convolution

3

32width

depth

32

height

32x32x3 image

Convolution

3

32width

depth

Convolve the filter with the image,

i.e., “slide over the image spatially,

computing dot products”3

5

5

5x5x3 filter

32

height

32x32x3 image

Convolution

3

32

32

width

height

depth

Result: 1 number, the result of taking the dot product

between the filter and a small 5x5x3 chunk of the

image, i.e., 5x5x3 = 70-dimensional dot product + bias

5x5x3 filter

wTx + b

32x32x3 image

Convolution

3

32

32

width

height

depth

Convolve (slide) over all

spatial locations

1

28

285x5x3 filter

32x32x3 image Activation map

Convolution

3

32

32

width

height

depth


spatial locations

1

28

28?5x5x3 filter


Convolution

3

32

32

width

height

depth


spatial locations

1

28

28

Invariant to?

Translation

Rotation

Scaling


???

Convolution

3

32

32

width

height

depth


spatial locations

1

28

28

Invariant to?

Translation

Rotation

Scaling


Convolution

3

32

32

width

height

depth


spatial locations

Activation map 32x32x3 image

Convolution Layer

3

32

32

width

height

depth

Convolution Layer

Activation tensor32x32x3 image


3

32

32

Convolution

ReLU

e.g. 6

5x5x3

filters

32x32x3 image

?


3

32

32

Convolution

ReLU

e.g. 6

5x5x3

filters

6

28

28

32x32x3 image 28x28x6 tensor


3

32

32

Convolution

ReLU

e.g. 6

5x5x3

filters

Convolution

ReLU

e.g. 10

5x5x6

filters

6

28

28

32x32x3 image 28x28x6 tensor

?


32x32x3 image

3

32

32

Convolution

ReLU

e.g. 6

5x5x3

filters

Convolution

ReLU

e.g. 10

5x5x6

filters

Convolution

ReLU

...

6

28

28

10

24

24

28x28x6 tensor 24x24x10 tensor

Convolutional Neural Networks

[LeNet-5, LeCun 1980]

Feature Hierarchy

Learn the features from data instead of hand engineering them!

(If enough data is available)

U-Net

Skip connections“Propagate low-level features

directly, helps with details”

Overview



• Pix2Pix



Generative Modeling

We want to learn 𝑝 X from data, such that we can “sample from it”!

𝑝(X)𝑝 X

𝐱𝑖 𝑖=1𝑁

x ~ 𝑝 XDensity Function

New SamplesTraining Data

“more of the same!”

Generative 2D Face Modeling

The world needs more celebrities

… or not … ?

𝐱𝑖 𝑖=1𝑁

x ~ 𝑝 X

New SamplesTraining Data

3.5 Years of Progress on Faces

https://thispersondoesnotexist.com

2018

StyleGAN - Interpolation

Overview






Image-to-Image Translation


[Zhang et al., ECCV 2016]

argminG

𝔼𝐱,𝐲[L(G 𝐱 , 𝐲)]

Loss Neural Network

G



argminG


Loss Neural Network

G

Paired!



argminG


Loss Neural Network

G



argminG


Neural Network

G

“What should I do?”



argminG


“How should I do it?”

G

“What should I do?”

Be careful what you wish for!

𝐿 𝐲, 𝐲 = 𝐲 − 𝐲 22

Degradation to the mean!

𝐿 𝐲, 𝐲 = 𝐲 − 𝐲 22

Automate Design of the Loss?

Automate Design of the Loss?Deep learning got rid of handcrafted features. Can we also get rid of handcrafting the loss function?

Automate Design of the Loss?

Universal loss function?

Deep learning got rid of handcrafted features. Can we also get rid of handcrafting the loss function?

Discriminator as a Loss Function

Discriminator

(Classifier)Real or Fake?

Conditional GAN

Conditional GAN

Generator (G)

Input 𝐱 Output 𝐲

Conditional GAN

Discriminator

(D)Real or Fake?Generator (G)

G tries to synthesize fake images that fool D

D tries to tell real from fake


Conditional GAN (Discriminator)

Discriminator

(D)Generator (G) Fake (0.9)

arg maxD

𝔼𝐱,𝐲[ logD G 𝐱 + log 1 − D 𝐲 ]


“1”

D tries to identify the fakes

Conditional GAN (Discriminator)

Discriminator

(D)Generator (G)

Discriminator

(D)

Fake (0.9)

Real (0.1)

arg maxD



GT 𝐲

“1”

“0”

D tries to identify the fakes

D tries to identify the real images

Conditional GAN (Generator)

Discriminator

(D)Generator (G)

arg minG


G tries to synthesize fake images that fool D.


Real (0.1)

“0”

Conditional GAN

Discriminator

(D)Generator (G)

arg minG

maxD


G tries to synthesize fake images that fool the best D.


Real or Fake?

Conditional GAN

Loss Function

(D)Generator (G)

G’s perspective: D is a loss function

Rather than being hand-designed, it is learned jointly!


Real or Fake?

Conditional Discriminator

Discriminator

(D)

Generator (G)

arg minG

maxD

𝔼𝐱,𝐲[ logD 𝐱, G 𝐱 + log 1 − D 𝐱, 𝐲 ]


Input 𝐱

Patch Discriminator

“Rather than penalizing if the output image

looks fake, penalize if each overlapping

patch in the output looks fake”

1x1 Pixel Discriminator

Image Discriminator

70x70 Patch Discriminator

Conditional Discriminator

Discriminator

(D)

Generator (G)

𝐿𝑐𝐺𝐴𝑁 G, D =𝔼𝐱,𝐲[ logD 𝐱, G 𝐱 + log 1 − D 𝐱, 𝐲 ]


Reconstruction Loss

Generator (G)𝑙1

𝐿𝑙1 G = 𝔼𝐱,𝐲 G x − y 1

𝐺∗ = arg minG

maxD

𝐿𝑐𝐺𝐴𝑁 G, D + 𝜆 𝐿𝑙1(G)

“Stable training + fast convergence”

100

Ablation Study

?

Ablation Study

Results on the Test Split

Results for Hand Drawings

?

Demo: Pix2Pix



Limitations

1. Paired data is required

2. Temporally instable if applied per-frame to a video sequence

3. Does not generalize to 3D transformations

CycleGAN

Cycle Consistency

CycleGAN

Recycle-GAN

Limitations




Vid2Vid

Limitations




DeepVoxels

Summary






References

• CVPR GAN Tutorial

• https://sites.google.com/view/cvpr2018tutorialongans

• CS231n

• http://cs231n.stanford.edu/slides/2016/winter1516_lecture7.pdf

• …

https://sites.google.com/view/cvpr2018tutorialongans

http://cs231n.stanford.edu/slides/2016/winter1516_lecture7.pdf

(or “mapping from a to b”)...convolution 3 32 32 width height depth result: 1 number, the result...

Documents