(or “mapping from a to b”)...convolution 3 32 32 width height depth result: 1 number, the result...

90
Conditional Adversarial Networks (or “mapping from A to B”) CS448V Computational Video Manipulation May 22 th , 2019

Upload: others

Post on 30-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Conditional Adversarial Networks(or “mapping from A to B”)

CS448V — Computational Video Manipulation

May 22th, 2019

Why? - Cool! Trendy! - Google Scholar

Hundreds of applications

and follow-up works…

CycleGAN

Pix2Pix

Why? - Cool! Trendy! - Google Scholar

Hundreds of applications

and follow-up works…

CycleGAN

Pix2Pix

Enhancing Transitions

Single-Photo Facial Animation

Text-based Editing

Few-Shot Reenactment

Digital Humans

Overview

• Convolutional Neural Networks

• Generative Modeling

• Pix2Pix (“mapping from A to B”)

Convolutional Neural Network

• 2D Convolution Layers (Conv2D)

• Subsampling Layers (MaxPool, …)

• Non-linearity Layers (ReLU, …)

• Normalization Layers (BatchNorm, …)

• Upsampling Layers (TransposedConv, …)

• …

Components?

Convolutional Neural Network

• 2D Convolution Layers (Conv2D)

• Subsampling Layers (MaxPool, …)

• Non-linearity Layers (ReLU, …)

• Normalization Layers (BatchNorm, …)

• Upsampling Layers (TransposedConv, …)

• …

Components?

Convolution

3

32width

depth

32

height

32x32x3 image

Convolution

3

32width

depth

Convolve the filter with the image,

i.e., “slide over the image spatially,

computing dot products”3

5

5

5x5x3 filter

32

height

32x32x3 image

Convolution

3

32

32

width

height

depth

Result: 1 number, the result of taking the dot product

between the filter and a small 5x5x3 chunk of the

image, i.e., 5x5x3 = 70-dimensional dot product + bias

5x5x3 filter

wTx + b

32x32x3 image

Convolution

3

32

32

width

height

depth

Convolve (slide) over all

spatial locations

1

28

285x5x3 filter

32x32x3 image Activation map

Convolution

3

32

32

width

height

depth

Convolve (slide) over all

spatial locations

1

28

28?5x5x3 filter

32x32x3 image Activation map

Convolution

3

32

32

width

height

depth

Convolve (slide) over all

spatial locations

1

28

28

Invariant to?

Translation

Rotation

Scaling

32x32x3 image Activation map

???

Convolution

3

32

32

width

height

depth

Convolve (slide) over all

spatial locations

1

28

28

Invariant to?

Translation

Rotation

Scaling

32x32x3 image Activation map

Convolution

3

32

32

width

height

depth

Convolve (slide) over all

spatial locations

Activation map 32x32x3 image

Convolution Layer

3

32

32

width

height

depth

Convolution Layer

Activation tensor32x32x3 image

Convolutional Neural Network

3

32

32

Convolution

ReLU

e.g. 6

5x5x3

filters

32x32x3 image

?

Convolutional Neural Network

3

32

32

Convolution

ReLU

e.g. 6

5x5x3

filters

6

28

28

32x32x3 image 28x28x6 tensor

Convolutional Neural Network

3

32

32

Convolution

ReLU

e.g. 6

5x5x3

filters

Convolution

ReLU

e.g. 10

5x5x6

filters

6

28

28

32x32x3 image 28x28x6 tensor

?

Convolutional Neural Network

32x32x3 image

3

32

32

Convolution

ReLU

e.g. 6

5x5x3

filters

Convolution

ReLU

e.g. 10

5x5x6

filters

Convolution

ReLU

...

6

28

28

10

24

24

28x28x6 tensor 24x24x10 tensor

Convolutional Neural Networks

[LeNet-5, LeCun 1980]

Feature Hierarchy

Learn the features from data instead of hand engineering them!

(If enough data is available)

U-Net

Skip connections“Propagate low-level features

directly, helps with details”

Overview

• Convolutional Neural Networks

• Generative Modeling

• Pix2Pix

Overview

• Convolutional Neural Networks

• Generative Modeling

• Pix2Pix

Generative Modeling

We want to learn 𝑝 X from data, such that we can “sample from it”!

𝑝(X)𝑝 X

𝐱𝑖 𝑖=1𝑁

x ~ 𝑝 XDensity Function

New SamplesTraining Data

“more of the same!”

Generative 2D Face Modeling

The world needs more celebrities

… or not … ?

𝐱𝑖 𝑖=1𝑁

x ~ 𝑝 X

New SamplesTraining Data

3.5 Years of Progress on Faces

https://thispersondoesnotexist.com

2018

StyleGAN - Interpolation

Overview

• Convolutional Neural Networks

• Generative Modeling

• Pix2Pix (“mapping from A to B”)

Overview

• Convolutional Neural Networks

• Generative Modeling

• Pix2Pix (“mapping from A to B”)

Image-to-Image Translation

Image-to-Image Translation

Image-to-Image Translation

Image-to-Image Translation

Image-to-Image Translation

[Zhang et al., ECCV 2016]

argminG

𝔼𝐱,𝐲[L(G 𝐱 , 𝐲)]

Loss Neural Network

G

Image-to-Image Translation

[Zhang et al., ECCV 2016]

argminG

𝔼𝐱,𝐲[L(G 𝐱 , 𝐲)]

Loss Neural Network

G

Paired!

Image-to-Image Translation

[Zhang et al., ECCV 2016]

argminG

𝔼𝐱,𝐲[L(G 𝐱 , 𝐲)]

Loss Neural Network

G

Image-to-Image Translation

[Zhang et al., ECCV 2016]

argminG

𝔼𝐱,𝐲[L(G 𝐱 , 𝐲)]

Neural Network

G

“What should I do?”

Image-to-Image Translation

[Zhang et al., ECCV 2016]

argminG

𝔼𝐱,𝐲[L(G 𝐱 , 𝐲)]

“How should I do it?”

G

“What should I do?”

Be careful what you wish for!

𝐿 𝐲, 𝐲 = 𝐲 − 𝐲 22

Degradation to the mean!

𝐿 𝐲, 𝐲 = 𝐲 − 𝐲 22

Automate Design of the Loss?

Automate Design of the Loss?

Automate Design of the Loss?Deep learning got rid of handcrafted features. Can we also get rid of handcrafting the loss function?

Automate Design of the Loss?

Universal loss function?

Deep learning got rid of handcrafted features. Can we also get rid of handcrafting the loss function?

Automate Design of the Loss?

Universal loss function?

Deep learning got rid of handcrafted features. Can we also get rid of handcrafting the loss function?

Discriminator as a Loss Function

Discriminator

(Classifier)Real or Fake?

Conditional GAN

Conditional GAN

Generator (G)

Input 𝐱 Output 𝐲

Conditional GAN

Discriminator

(D)Real or Fake?Generator (G)

G tries to synthesize fake images that fool D

D tries to tell real from fake

Input 𝐱 Output 𝐲

Conditional GAN (Discriminator)

Discriminator

(D)Generator (G) Fake (0.9)

arg maxD

𝔼𝐱,𝐲[ logD G 𝐱 + log 1 − D 𝐲 ]

Input 𝐱 Output 𝐲

“1”

D tries to identify the fakes

Conditional GAN (Discriminator)

Discriminator

(D)Generator (G)

Discriminator

(D)

Fake (0.9)

Real (0.1)

arg maxD

𝔼𝐱,𝐲[ logD G 𝐱 + log 1 − D 𝐲 ]

Input 𝐱 Output 𝐲

GT 𝐲

“1”

“0”

D tries to identify the fakes

D tries to identify the real images

Conditional GAN (Generator)

Discriminator

(D)Generator (G)

arg minG

𝔼𝐱,𝐲[ logD G 𝐱 + log 1 − D 𝐲 ]

G tries to synthesize fake images that fool D.

Input 𝐱 Output 𝐲

Real (0.1)

“0”

Conditional GAN (Generator)

Discriminator

(D)Generator (G)

arg minG

𝔼𝐱,𝐲[ logD G 𝐱 + log 1 − D 𝐲 ]

G tries to synthesize fake images that fool D.

Input 𝐱 Output 𝐲

Real (0.1)

“0”

Conditional GAN

Discriminator

(D)Generator (G)

arg minG

maxD

𝔼𝐱,𝐲[ logD G 𝐱 + log 1 − D 𝐲 ]

G tries to synthesize fake images that fool the best D.

Input 𝐱 Output 𝐲

Real or Fake?

Conditional GAN

Loss Function

(D)Generator (G)

G’s perspective: D is a loss function

Rather than being hand-designed, it is learned jointly!

Input 𝐱 Output 𝐲

Real or Fake?

Conditional Discriminator

Discriminator

(D)

Generator (G)

arg minG

maxD

𝔼𝐱,𝐲[ logD 𝐱, G 𝐱 + log 1 − D 𝐱, 𝐲 ]

Input 𝐱 Output 𝐲

Input 𝐱

Patch Discriminator

“Rather than penalizing if the output image

looks fake, penalize if each overlapping

patch in the output looks fake”

1x1 Pixel Discriminator

Image Discriminator

70x70 Patch Discriminator

Conditional Discriminator

Discriminator

(D)

Generator (G)

𝐿𝑐𝐺𝐴𝑁 G, D =𝔼𝐱,𝐲[ logD 𝐱, G 𝐱 + log 1 − D 𝐱, 𝐲 ]

Input 𝐱 Output 𝐲

Reconstruction Loss

Generator (G)𝑙1

𝐿𝑙1 G = 𝔼𝐱,𝐲 G x − y 1

𝐺∗ = arg minG

maxD

𝐿𝑐𝐺𝐴𝑁 G, D + 𝜆 𝐿𝑙1(G)

“Stable training + fast convergence”

100

Ablation Study

?

Ablation Study

Results on the Test Split

Results for Hand Drawings

?

Limitations

1. Paired data is required

2. Temporally instable if applied per-frame to a video sequence

3. Does not generalize to 3D transformations

CycleGAN

Cycle Consistency

CycleGAN

Recycle-GAN

Limitations

1. Paired data is required

2. Temporally instable if applied per-frame to a video sequence

3. Does not generalize to 3D transformations

Limitations

1. Paired data is required

2. Temporally instable if applied per-frame to a video sequence

3. Does not generalize to 3D transformations

Vid2Vid

Vid2Vid

Limitations

1. Paired data is required

2. Temporally instable if applied per-frame to a video sequence

3. Does not generalize to 3D transformations

Limitations

1. Paired data is required

2. Temporally instable if applied per-frame to a video sequence

3. Does not generalize to 3D transformations

DeepVoxels

Summary

• Convolutional Neural Networks

• Generative Modeling

• Pix2Pix (“mapping from A to B”)

References

• CVPR GAN Tutorial

• https://sites.google.com/view/cvpr2018tutorialongans

• CS231n

• http://cs231n.stanford.edu/slides/2016/winter1516_lecture7.pdf

• …