nishant gurnani - amazon s3 · 2017-10-05 · nishant gurnani gan reading group april 14th, 2017...
TRANSCRIPT
![Page 1: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/1.jpg)
Nishant Gurnani
GAN Reading Group
April 14th, 2017
1 / 107
![Page 2: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/2.jpg)
Why are these Papers Important?
I Recently a large number of GAN frameworks have beenproposed - BGAN, LSGAN, DCGAN, DiscoGAN . . .
I Wasserstein GAN is yet another GAN training algorithm,however it is backed up by rigorous theory in addition to goodperformance
I WGAN removes the need to balance generator updates withdiscriminator updates, thus removing a key source of traininginstability in the original GAN
I Empirical results show correlation between discriminator lossand perceptual quality thus providing a rough measure oftraining progress
Goal - Convince you that WGAN is the “best” GAN
2 / 107
![Page 3: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/3.jpg)
Why are these Papers Important?
I Recently a large number of GAN frameworks have beenproposed - BGAN, LSGAN, DCGAN, DiscoGAN . . .
I Wasserstein GAN is yet another GAN training algorithm,however it is backed up by rigorous theory in addition to goodperformance
I WGAN removes the need to balance generator updates withdiscriminator updates, thus removing a key source of traininginstability in the original GAN
I Empirical results show correlation between discriminator lossand perceptual quality thus providing a rough measure oftraining progress
Goal - Convince you that WGAN is the “best” GAN
3 / 107
![Page 4: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/4.jpg)
Why are these Papers Important?
I Recently a large number of GAN frameworks have beenproposed - BGAN, LSGAN, DCGAN, DiscoGAN . . .
I Wasserstein GAN is yet another GAN training algorithm,however it is backed up by rigorous theory in addition to goodperformance
I WGAN removes the need to balance generator updates withdiscriminator updates, thus removing a key source of traininginstability in the original GAN
I Empirical results show correlation between discriminator lossand perceptual quality thus providing a rough measure oftraining progress
Goal - Convince you that WGAN is the “best” GAN
4 / 107
![Page 5: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/5.jpg)
Why are these Papers Important?
I Recently a large number of GAN frameworks have beenproposed - BGAN, LSGAN, DCGAN, DiscoGAN . . .
I Wasserstein GAN is yet another GAN training algorithm,however it is backed up by rigorous theory in addition to goodperformance
I WGAN removes the need to balance generator updates withdiscriminator updates, thus removing a key source of traininginstability in the original GAN
I Empirical results show correlation between discriminator lossand perceptual quality thus providing a rough measure oftraining progress
Goal - Convince you that WGAN is the “best” GAN
5 / 107
![Page 6: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/6.jpg)
Why are these Papers Important?
I Recently a large number of GAN frameworks have beenproposed - BGAN, LSGAN, DCGAN, DiscoGAN . . .
I Wasserstein GAN is yet another GAN training algorithm,however it is backed up by rigorous theory in addition to goodperformance
I WGAN removes the need to balance generator updates withdiscriminator updates, thus removing a key source of traininginstability in the original GAN
I Empirical results show correlation between discriminator lossand perceptual quality thus providing a rough measure oftraining progress
Goal - Convince you that WGAN is the “best” GAN
6 / 107
![Page 7: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/7.jpg)
Why are these Papers Important?
I Recently a large number of GAN frameworks have beenproposed - BGAN, LSGAN, DCGAN, DiscoGAN . . .
I Wasserstein GAN is yet another GAN training algorithm,however it is backed up by rigorous theory in addition to goodperformance
I WGAN removes the need to balance generator updates withdiscriminator updates, thus removing a key source of traininginstability in the original GAN
I Empirical results show correlation between discriminator lossand perceptual quality thus providing a rough measure oftraining progress
Goal - Convince you that WGAN is the “best” GAN
7 / 107
![Page 8: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/8.jpg)
Outline
Introduction
Different Distances
Standard GAN
Wasserstein GAN
Improved Wasserstein GAN
8 / 107
![Page 9: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/9.jpg)
Outline
Introduction
Different Distances
Standard GAN
Wasserstein GAN
Improved Wasserstein GAN
9 / 107
![Page 10: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/10.jpg)
What does it mean to learn a probability distribution?
When learning generative models, we assume the data we havecomes from some unknown distribution Pr .
Want to learn a distribution Pθ that approximates Pr , where θ arethe parameters of the distribution.
There are two approaches for doing this:
1. Directly learn the probability density function Pθ and thenoptimize through maximum likelihood estimation
2. Learn a function that transforms an existing distribution Zinto Pθ. Here, gθ is some differentiable function, Z is acommon distribution (usually uniform or Gaussian), andPθ = gθ(Z )
10 / 107
![Page 11: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/11.jpg)
What does it mean to learn a probability distribution?
When learning generative models, we assume the data we havecomes from some unknown distribution Pr .
Want to learn a distribution Pθ that approximates Pr , where θ arethe parameters of the distribution.
There are two approaches for doing this:
1. Directly learn the probability density function Pθ and thenoptimize through maximum likelihood estimation
2. Learn a function that transforms an existing distribution Zinto Pθ. Here, gθ is some differentiable function, Z is acommon distribution (usually uniform or Gaussian), andPθ = gθ(Z )
11 / 107
![Page 12: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/12.jpg)
What does it mean to learn a probability distribution?
When learning generative models, we assume the data we havecomes from some unknown distribution Pr .
Want to learn a distribution Pθ that approximates Pr , where θ arethe parameters of the distribution.
There are two approaches for doing this:
1. Directly learn the probability density function Pθ and thenoptimize through maximum likelihood estimation
2. Learn a function that transforms an existing distribution Zinto Pθ. Here, gθ is some differentiable function, Z is acommon distribution (usually uniform or Gaussian), andPθ = gθ(Z )
12 / 107
![Page 13: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/13.jpg)
Maximum Likelihood approach is problematic
Recall that for continuous distributions P and Q the KL divergenceis:
KL(P||Q) =
∫xP(x) log
P(x)
Q(x)dx
and given function Pθ, the MLE objective is
maxθ∈Rd
1
m
m∑i=1
logPθ(x (i))
In the limit (as m→∞), samples will appear based on the data
distribution Pr , so
limm→∞
maxθ∈Rd
1
m
m∑i=1
logPθ(x (i)) = maxθ∈Rd
∫xPr (x) logPθ(x)dx
13 / 107
![Page 14: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/14.jpg)
Maximum Likelihood approach is problematic
limm→∞
maxθ∈Rd
1
m
m∑i=1
logPθ(x (i)) = maxθ∈Rd
∫xPr (x) logPθ(x)dx
= minθ∈Rd−∫xPr (x) logPθ(x)dx
= minθ∈Rd
∫xPr (x) logPr (x)dx −
∫xPr (x) logPθ(x)dx
= minθ∈Rd
KL(Pr ||Pθ)
I Note if Pθ = 0 at an x where Pr > 0, the KL divergence goesto +∞ (bad for the MLE if Pθ has low dimensional support)
I Typical remedy is to add a noise term to the modeldistribution to ensure distribution is defined everywhere
I This unfortunately introduces some error, and empiricallypeople have needed to add a lot of random noise to makemodels train
14 / 107
![Page 15: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/15.jpg)
Maximum Likelihood approach is problematic
limm→∞
maxθ∈Rd
1
m
m∑i=1
logPθ(x (i)) = maxθ∈Rd
∫xPr (x) logPθ(x)dx
= minθ∈Rd−∫xPr (x) logPθ(x)dx
= minθ∈Rd
∫xPr (x) logPr (x)dx −
∫xPr (x) logPθ(x)dx
= minθ∈Rd
KL(Pr ||Pθ)
I Note if Pθ = 0 at an x where Pr > 0, the KL divergence goesto +∞ (bad for the MLE if Pθ has low dimensional support)
I Typical remedy is to add a noise term to the modeldistribution to ensure distribution is defined everywhere
I This unfortunately introduces some error, and empiricallypeople have needed to add a lot of random noise to makemodels train
15 / 107
![Page 16: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/16.jpg)
Maximum Likelihood approach is problematic
limm→∞
maxθ∈Rd
1
m
m∑i=1
logPθ(x (i)) = maxθ∈Rd
∫xPr (x) logPθ(x)dx
= minθ∈Rd−∫xPr (x) logPθ(x)dx
= minθ∈Rd
∫xPr (x) logPr (x)dx −
∫xPr (x) logPθ(x)dx
= minθ∈Rd
KL(Pr ||Pθ)
I Note if Pθ = 0 at an x where Pr > 0, the KL divergence goesto +∞ (bad for the MLE if Pθ has low dimensional support)
I Typical remedy is to add a noise term to the modeldistribution to ensure distribution is defined everywhere
I This unfortunately introduces some error, and empiricallypeople have needed to add a lot of random noise to makemodels train
16 / 107
![Page 17: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/17.jpg)
Maximum Likelihood approach is problematic
limm→∞
maxθ∈Rd
1
m
m∑i=1
logPθ(x (i)) = maxθ∈Rd
∫xPr (x) logPθ(x)dx
= minθ∈Rd−∫xPr (x) logPθ(x)dx
= minθ∈Rd
∫xPr (x) logPr (x)dx −
∫xPr (x) logPθ(x)dx
= minθ∈Rd
KL(Pr ||Pθ)
I Note if Pθ = 0 at an x where Pr > 0, the KL divergence goesto +∞ (bad for the MLE if Pθ has low dimensional support)
I Typical remedy is to add a noise term to the modeldistribution to ensure distribution is defined everywhere
I This unfortunately introduces some error, and empiricallypeople have needed to add a lot of random noise to makemodels train
17 / 107
![Page 18: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/18.jpg)
Transforming an existing distribution
Shortcomings of the maximum likelihood approach motivate thesecond approach of learning a gθ (a generator) to transform aknown distribution Z.
Advantages of this approach:
I Unlike densities, this approach can represent distributionsconfined to a low dimensional manifold
I It’s very easy to generate samples - given a trained gθ, simplysample random noise z ∼ Z , and evaluate gθ(z)
VAEs and GANs are well known examples of this approach
VAEs focus on the approximate likelihood of the examples and soshare the limitation that you need to fiddle with additional noiseterms.
GANs offer much more flexibility but their training is unstable.
18 / 107
![Page 19: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/19.jpg)
Transforming an existing distribution
Shortcomings of the maximum likelihood approach motivate thesecond approach of learning a gθ (a generator) to transform aknown distribution Z.
Advantages of this approach:
I Unlike densities, this approach can represent distributionsconfined to a low dimensional manifold
I It’s very easy to generate samples - given a trained gθ, simplysample random noise z ∼ Z , and evaluate gθ(z)
VAEs and GANs are well known examples of this approach
VAEs focus on the approximate likelihood of the examples and soshare the limitation that you need to fiddle with additional noiseterms.
GANs offer much more flexibility but their training is unstable.
19 / 107
![Page 20: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/20.jpg)
Transforming an existing distribution
Shortcomings of the maximum likelihood approach motivate thesecond approach of learning a gθ (a generator) to transform aknown distribution Z.
Advantages of this approach:
I Unlike densities, this approach can represent distributionsconfined to a low dimensional manifold
I It’s very easy to generate samples - given a trained gθ, simplysample random noise z ∼ Z , and evaluate gθ(z)
VAEs and GANs are well known examples of this approach
VAEs focus on the approximate likelihood of the examples and soshare the limitation that you need to fiddle with additional noiseterms.
GANs offer much more flexibility but their training is unstable.
20 / 107
![Page 21: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/21.jpg)
Transforming an existing distribution
Shortcomings of the maximum likelihood approach motivate thesecond approach of learning a gθ (a generator) to transform aknown distribution Z.
Advantages of this approach:
I Unlike densities, this approach can represent distributionsconfined to a low dimensional manifold
I It’s very easy to generate samples - given a trained gθ, simplysample random noise z ∼ Z , and evaluate gθ(z)
VAEs and GANs are well known examples of this approach
VAEs focus on the approximate likelihood of the examples and soshare the limitation that you need to fiddle with additional noiseterms.
GANs offer much more flexibility but their training is unstable.
21 / 107
![Page 22: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/22.jpg)
Transforming an existing distribution
Shortcomings of the maximum likelihood approach motivate thesecond approach of learning a gθ (a generator) to transform aknown distribution Z.
Advantages of this approach:
I Unlike densities, this approach can represent distributionsconfined to a low dimensional manifold
I It’s very easy to generate samples - given a trained gθ, simplysample random noise z ∼ Z , and evaluate gθ(z)
VAEs and GANs are well known examples of this approach
VAEs focus on the approximate likelihood of the examples and soshare the limitation that you need to fiddle with additional noiseterms.
GANs offer much more flexibility but their training is unstable.
22 / 107
![Page 23: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/23.jpg)
Transforming an existing distribution
To train gθ (and by extension Pθ), we need a measure of distancebetween distributions i.e. d(Pr ,Pθ).
Distance Properties
I Distance d is weaker than distance d ′ if every sequence thatconverges under d ′ converges under d
I Given a distance d , we can treat d(Pr ,Pθ) as a loss function
I We can minimize d(Pr ,Pθ) with respect to θ as long as themapping θ 7→ Pθ is continuous (true if gθ is a neural network)
How do we define d?
23 / 107
![Page 24: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/24.jpg)
Transforming an existing distribution
To train gθ (and by extension Pθ), we need a measure of distancebetween distributions i.e. d(Pr ,Pθ).
Distance Properties
I Distance d is weaker than distance d ′ if every sequence thatconverges under d ′ converges under d
I Given a distance d , we can treat d(Pr ,Pθ) as a loss function
I We can minimize d(Pr ,Pθ) with respect to θ as long as themapping θ 7→ Pθ is continuous (true if gθ is a neural network)
How do we define d?
24 / 107
![Page 25: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/25.jpg)
Transforming an existing distribution
To train gθ (and by extension Pθ), we need a measure of distancebetween distributions i.e. d(Pr ,Pθ).
Distance Properties
I Distance d is weaker than distance d ′ if every sequence thatconverges under d ′ converges under d
I Given a distance d , we can treat d(Pr ,Pθ) as a loss function
I We can minimize d(Pr ,Pθ) with respect to θ as long as themapping θ 7→ Pθ is continuous (true if gθ is a neural network)
How do we define d?
25 / 107
![Page 26: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/26.jpg)
Transforming an existing distribution
To train gθ (and by extension Pθ), we need a measure of distancebetween distributions i.e. d(Pr ,Pθ).
Distance Properties
I Distance d is weaker than distance d ′ if every sequence thatconverges under d ′ converges under d
I Given a distance d , we can treat d(Pr ,Pθ) as a loss function
I We can minimize d(Pr ,Pθ) with respect to θ as long as themapping θ 7→ Pθ is continuous (true if gθ is a neural network)
How do we define d?
26 / 107
![Page 27: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/27.jpg)
Transforming an existing distribution
To train gθ (and by extension Pθ), we need a measure of distancebetween distributions i.e. d(Pr ,Pθ).
Distance Properties
I Distance d is weaker than distance d ′ if every sequence thatconverges under d ′ converges under d
I Given a distance d , we can treat d(Pr ,Pθ) as a loss function
I We can minimize d(Pr ,Pθ) with respect to θ as long as themapping θ 7→ Pθ is continuous (true if gθ is a neural network)
How do we define d?
27 / 107
![Page 28: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/28.jpg)
Transforming an existing distribution
To train gθ (and by extension Pθ), we need a measure of distancebetween distributions i.e. d(Pr ,Pθ).
Distance Properties
I Distance d is weaker than distance d ′ if every sequence thatconverges under d ′ converges under d
I Given a distance d , we can treat d(Pr ,Pθ) as a loss function
I We can minimize d(Pr ,Pθ) with respect to θ as long as themapping θ 7→ Pθ is continuous (true if gθ is a neural network)
How do we define d?
28 / 107
![Page 29: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/29.jpg)
Introduction
Different Distances
Standard GAN
Wasserstein GAN
Improved Wasserstein GAN
29 / 107
![Page 30: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/30.jpg)
Distance definitions
Notationχ - compact metric set (such as the space of images [0, 1]d)Σ - set of all Borel subsets of χProb(χ) - space of probability measures defined on χ
Total Variation (TV) distance
δ(Pr ,Pθ) = supA∈Σ|Pr (A)− Pθ(A)|
Kullback-Leibler (KL) divergence
KL(Pr ||Pθ) =
∫xPr (x) log
Pr (x)
Pθ(x)dµ(x)
where both Pr and Pθ are assumed to be absolutely continuouswith respect to a same measure µ defined on χ.
30 / 107
![Page 31: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/31.jpg)
Distance definitions
Notationχ - compact metric set (such as the space of images [0, 1]d)Σ - set of all Borel subsets of χProb(χ) - space of probability measures defined on χ
Total Variation (TV) distance
δ(Pr ,Pθ) = supA∈Σ|Pr (A)− Pθ(A)|
Kullback-Leibler (KL) divergence
KL(Pr ||Pθ) =
∫xPr (x) log
Pr (x)
Pθ(x)dµ(x)
where both Pr and Pθ are assumed to be absolutely continuouswith respect to a same measure µ defined on χ.
31 / 107
![Page 32: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/32.jpg)
Distance definitions
Notationχ - compact metric set (such as the space of images [0, 1]d)Σ - set of all Borel subsets of χProb(χ) - space of probability measures defined on χ
Total Variation (TV) distance
δ(Pr ,Pθ) = supA∈Σ|Pr (A)− Pθ(A)|
Kullback-Leibler (KL) divergence
KL(Pr ||Pθ) =
∫xPr (x) log
Pr (x)
Pθ(x)dµ(x)
where both Pr and Pθ are assumed to be absolutely continuouswith respect to a same measure µ defined on χ.
32 / 107
![Page 33: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/33.jpg)
Distance definitions
Notationχ - compact metric set (such as the space of images [0, 1]d)Σ - set of all Borel subsets of χProb(χ) - space of probability measures defined on χ
Total Variation (TV) distance
δ(Pr ,Pθ) = supA∈Σ|Pr (A)− Pθ(A)|
Kullback-Leibler (KL) divergence
KL(Pr ||Pθ) =
∫xPr (x) log
Pr (x)
Pθ(x)dµ(x)
where both Pr and Pθ are assumed to be absolutely continuouswith respect to a same measure µ defined on χ.
33 / 107
![Page 34: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/34.jpg)
Distance Definitions
Jensen-Shannon (JS) Divergence
JS(Pr ,Pθ) = KL(Pr ||Pm) + KL(Pθ||Pm)
where Pm is the mixture (Pr + Pθ)/2
Earth-Mover (EM) distance or Wasserstein-1
W (Pr ,Pθ) = infγ∈Π((Pr ,Pθ)
E(x ,y)∼γ [||x − y ||]
where Π(Pr ,Pθ) denotes the set of all join distributions γ(x , y)whose marginals are respectively Pr and Pθ
34 / 107
![Page 35: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/35.jpg)
Distance Definitions
Jensen-Shannon (JS) Divergence
JS(Pr ,Pθ) = KL(Pr ||Pm) + KL(Pθ||Pm)
where Pm is the mixture (Pr + Pθ)/2
Earth-Mover (EM) distance or Wasserstein-1
W (Pr ,Pθ) = infγ∈Π((Pr ,Pθ)
E(x ,y)∼γ [||x − y ||]
where Π(Pr ,Pθ) denotes the set of all join distributions γ(x , y)whose marginals are respectively Pr and Pθ
35 / 107
![Page 36: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/36.jpg)
Distance Definitions
Jensen-Shannon (JS) Divergence
JS(Pr ,Pθ) = KL(Pr ||Pm) + KL(Pθ||Pm)
where Pm is the mixture (Pr + Pθ)/2
Earth-Mover (EM) distance or Wasserstein-1
W (Pr ,Pθ) = infγ∈Π((Pr ,Pθ)
E(x ,y)∼γ [||x − y ||]
where Π(Pr ,Pθ) denotes the set of all join distributions γ(x , y)whose marginals are respectively Pr and Pθ
36 / 107
![Page 37: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/37.jpg)
Understanding the EM distance
Main IdeaProbability distributions are defined by how much mass they puton each point.
Imagine we started with distribution Pr , and wanted to move massaround to change the distribution into Pθ.
Moving mass m by distance d costs m · d effort.
The earth mover distance is the minimal effort we need to spend.
Transport Plan
Each γ ∈ Π is a transport plan and to execute the plan, for all x , ymove γ(x , y) mass from x to y .
37 / 107
![Page 38: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/38.jpg)
Understanding the EM distance
Main IdeaProbability distributions are defined by how much mass they puton each point.
Imagine we started with distribution Pr , and wanted to move massaround to change the distribution into Pθ.
Moving mass m by distance d costs m · d effort.
The earth mover distance is the minimal effort we need to spend.
Transport Plan
Each γ ∈ Π is a transport plan and to execute the plan, for all x , ymove γ(x , y) mass from x to y .
38 / 107
![Page 39: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/39.jpg)
Understanding the EM distance
Main IdeaProbability distributions are defined by how much mass they puton each point.
Imagine we started with distribution Pr , and wanted to move massaround to change the distribution into Pθ.
Moving mass m by distance d costs m · d effort.
The earth mover distance is the minimal effort we need to spend.
Transport Plan
Each γ ∈ Π is a transport plan and to execute the plan, for all x , ymove γ(x , y) mass from x to y .
39 / 107
![Page 40: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/40.jpg)
Understanding the EM distance
Main IdeaProbability distributions are defined by how much mass they puton each point.
Imagine we started with distribution Pr , and wanted to move massaround to change the distribution into Pθ.
Moving mass m by distance d costs m · d effort.
The earth mover distance is the minimal effort we need to spend.
Transport Plan
Each γ ∈ Π is a transport plan and to execute the plan, for all x , ymove γ(x , y) mass from x to y .
40 / 107
![Page 41: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/41.jpg)
Understanding the EM distance
Main IdeaProbability distributions are defined by how much mass they puton each point.
Imagine we started with distribution Pr , and wanted to move massaround to change the distribution into Pθ.
Moving mass m by distance d costs m · d effort.
The earth mover distance is the minimal effort we need to spend.
Transport Plan
Each γ ∈ Π is a transport plan and to execute the plan, for all x , ymove γ(x , y) mass from x to y .
41 / 107
![Page 42: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/42.jpg)
Understanding the EM distance
Main IdeaProbability distributions are defined by how much mass they puton each point.
Imagine we started with distribution Pr , and wanted to move massaround to change the distribution into Pθ.
Moving mass m by distance d costs m · d effort.
The earth mover distance is the minimal effort we need to spend.
Transport Plan
Each γ ∈ Π is a transport plan and to execute the plan, for all x , ymove γ(x , y) mass from x to y .
42 / 107
![Page 43: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/43.jpg)
Understanding the EM distance
What properties does the plan need to satisfy to transform Pr intoPθ?
The amount of mass that leaves x is∫y γ(x , y)dy . This must equal
Pr (x), the amount of mass originally at x .
The amount of mass that enters y is∫x γ(x , y)dx . This must
equal Pθ(y), the amount of mass that ends up at y.
This shows why the marginals of γ ∈ Π must be Pr and Pθ. Forscoring, the effort spent is∫
x
∫yγ(x , y)||x − y ||dydx = E(x ,y)∼γ [||x − y ||]
Computing the infimum of this over all valid γ gives the earthmover distance
43 / 107
![Page 44: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/44.jpg)
Understanding the EM distance
What properties does the plan need to satisfy to transform Pr intoPθ?
The amount of mass that leaves x is∫y γ(x , y)dy . This must equal
Pr (x), the amount of mass originally at x .
The amount of mass that enters y is∫x γ(x , y)dx . This must
equal Pθ(y), the amount of mass that ends up at y.
This shows why the marginals of γ ∈ Π must be Pr and Pθ. Forscoring, the effort spent is∫
x
∫yγ(x , y)||x − y ||dydx = E(x ,y)∼γ [||x − y ||]
Computing the infimum of this over all valid γ gives the earthmover distance
44 / 107
![Page 45: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/45.jpg)
Understanding the EM distance
What properties does the plan need to satisfy to transform Pr intoPθ?
The amount of mass that leaves x is∫y γ(x , y)dy . This must equal
Pr (x), the amount of mass originally at x .
The amount of mass that enters y is∫x γ(x , y)dx . This must
equal Pθ(y), the amount of mass that ends up at y.
This shows why the marginals of γ ∈ Π must be Pr and Pθ. Forscoring, the effort spent is∫
x
∫yγ(x , y)||x − y ||dydx = E(x ,y)∼γ [||x − y ||]
Computing the infimum of this over all valid γ gives the earthmover distance
45 / 107
![Page 46: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/46.jpg)
Understanding the EM distance
What properties does the plan need to satisfy to transform Pr intoPθ?
The amount of mass that leaves x is∫y γ(x , y)dy . This must equal
Pr (x), the amount of mass originally at x .
The amount of mass that enters y is∫x γ(x , y)dx . This must
equal Pθ(y), the amount of mass that ends up at y.
This shows why the marginals of γ ∈ Π must be Pr and Pθ. Forscoring, the effort spent is∫
x
∫yγ(x , y)||x − y ||dydx = E(x ,y)∼γ [||x − y ||]
Computing the infimum of this over all valid γ gives the earthmover distance
46 / 107
![Page 47: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/47.jpg)
Understanding the EM distance
What properties does the plan need to satisfy to transform Pr intoPθ?
The amount of mass that leaves x is∫y γ(x , y)dy . This must equal
Pr (x), the amount of mass originally at x .
The amount of mass that enters y is∫x γ(x , y)dx . This must
equal Pθ(y), the amount of mass that ends up at y.
This shows why the marginals of γ ∈ Π must be Pr and Pθ. Forscoring, the effort spent is∫
x
∫yγ(x , y)||x − y ||dydx = E(x ,y)∼γ [||x − y ||]
Computing the infimum of this over all valid γ gives the earthmover distance
47 / 107
![Page 48: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/48.jpg)
Understanding the EM distance
What properties does the plan need to satisfy to transform Pr intoPθ?
The amount of mass that leaves x is∫y γ(x , y)dy . This must equal
Pr (x), the amount of mass originally at x .
The amount of mass that enters y is∫x γ(x , y)dx . This must
equal Pθ(y), the amount of mass that ends up at y.
This shows why the marginals of γ ∈ Π must be Pr and Pθ. Forscoring, the effort spent is∫
x
∫yγ(x , y)||x − y ||dydx = E(x ,y)∼γ [||x − y ||]
Computing the infimum of this over all valid γ gives the earthmover distance
48 / 107
![Page 49: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/49.jpg)
Learning Parallel Lines Example
Let Z ∼ U[0, 1] and let P0 be the distribution of (0,Z ) ∈ R2,uniform on a straight vertical line passing through the origin. Nowlet gθ(z) = (θ, z) with θ a single real parameter.
We’d like our optimization algorithm to learn to move θ to 0. Asθ → 0, the distance d(P0,Pθ) should decrease.
49 / 107
![Page 50: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/50.jpg)
Learning Parallel Lines ExampleLet Z ∼ U[0, 1] and let P0 be the distribution of (0,Z ) ∈ R2,uniform on a straight vertical line passing through the origin. Nowlet gθ(z) = (θ, z) with θ a single real parameter.
We’d like our optimization algorithm to learn to move θ to 0. Asθ → 0, the distance d(P0,Pθ) should decrease.
50 / 107
![Page 51: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/51.jpg)
Learning Parallel Lines ExampleLet Z ∼ U[0, 1] and let P0 be the distribution of (0,Z ) ∈ R2,uniform on a straight vertical line passing through the origin. Nowlet gθ(z) = (θ, z) with θ a single real parameter.
We’d like our optimization algorithm to learn to move θ to 0. Asθ → 0, the distance d(P0,Pθ) should decrease.
51 / 107
![Page 52: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/52.jpg)
Learning Parallel Lines ExampleLet Z ∼ U[0, 1] and let P0 be the distribution of (0,Z ) ∈ R2,uniform on a straight vertical line passing through the origin. Nowlet gθ(z) = (θ, z) with θ a single real parameter.
We’d like our optimization algorithm to learn to move θ to 0. Asθ → 0, the distance d(P0,Pθ) should decrease.
52 / 107
![Page 53: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/53.jpg)
Learning Parallel Lines Example
For many common distance functions, this doesn’t happen.
δ(P0,Pθ) =
{1 if θ 6= 0
0 if θ = 0
JS(P0,Pθ) =
{log 2 if θ 6= 0
0 if θ = 0
KL(Pθ,P0) = KL(P0,Pθ) =
{+∞ if θ 6= 0
0 if θ = 0
W (P0,Pθ) = |θ|
53 / 107
![Page 54: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/54.jpg)
Learning Parallel Lines Example
For many common distance functions, this doesn’t happen.
δ(P0,Pθ) =
{1 if θ 6= 0
0 if θ = 0
JS(P0,Pθ) =
{log 2 if θ 6= 0
0 if θ = 0
KL(Pθ,P0) = KL(P0,Pθ) =
{+∞ if θ 6= 0
0 if θ = 0
W (P0,Pθ) = |θ|
54 / 107
![Page 55: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/55.jpg)
Theoretical Justification
Theorem (1)
Let Pr be a fixed distribution over χ. Let Z be a random variableover another space Z. Let g : Z ×Rd → χ be a function, that willbe denote gθ(z) with z the first coordinate and θ the second. LetPθ denote the distribution of gθ(). Then,
1. If g is continuous in θ, so is W (Pr ,Pθ).
2. If g is locally Lipschitz and satisfies regularity assumption 1,then W (Pr ,Pθ) is continuous everywhere, and differentiablealmost everywhere
3. Statements 1-2 are false for the Jensen-Shannon divergenceJS(Pr ,Pθ) and all the KLs.
55 / 107
![Page 56: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/56.jpg)
Theoretical Justification
Theorem (2)
Let P be a distribution on a compact space χ and (Pn)n∈N be asequence of distributions on χ. Then, considering all limits asn→∞,
1. The following statements are equivalentI δ(Pn,P)→ 0I JS(Pn,P)→ 0
2. The following statements are equivalentI W (Pn,P)→ 0
I PnD→ P where
D→ represents convergence in distribution forrandom variables
3. KL(Pn||P)→ 0 or KL(P||Pn)→ 0 imply the statements in (1).
4. The statements in (1) imply the statements in (2).
56 / 107
![Page 57: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/57.jpg)
Introduction
Different Distances
Standard GAN
Wasserstein GAN
Improved Wasserstein GAN
57 / 107
![Page 58: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/58.jpg)
Generative Adversarial Networks
Recall that the GAN training strategy is to define a game betweentwo competing networks.
The generator network G maps a source of noise to the inputspace.
The discriminator network D receives either a generated sample ora true data sample and must distinguish between the two.
The generator is trained to fool the discriminator.
58 / 107
![Page 59: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/59.jpg)
Generative Adversarial Networks
Recall that the GAN training strategy is to define a game betweentwo competing networks.
The generator network G maps a source of noise to the inputspace.
The discriminator network D receives either a generated sample ora true data sample and must distinguish between the two.
The generator is trained to fool the discriminator.
59 / 107
![Page 60: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/60.jpg)
Generative Adversarial Networks
Recall that the GAN training strategy is to define a game betweentwo competing networks.
The generator network G maps a source of noise to the inputspace.
The discriminator network D receives either a generated sample ora true data sample and must distinguish between the two.
The generator is trained to fool the discriminator.
60 / 107
![Page 61: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/61.jpg)
Generative Adversarial Networks
Recall that the GAN training strategy is to define a game betweentwo competing networks.
The generator network G maps a source of noise to the inputspace.
The discriminator network D receives either a generated sample ora true data sample and must distinguish between the two.
The generator is trained to fool the discriminator.
61 / 107
![Page 62: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/62.jpg)
Generative Adversarial Networks
Formally we can express the game between the generator G andthe discriminator D with the minimax objective:
minG
maxD
Ex∼Pr [log(D(x))] + Ex̃∼Pg[log(1− D(x̃)]
where Pr is the data distribution and Pg is the model distributionimplicitly defined by x̃ = G (z), z ∼ p(z)
62 / 107
![Page 63: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/63.jpg)
Generative Adversarial Networks
Remarks
I If the discriminator is trained to optimality before eachgenerator parameter update, minimizing the value functionamounts to minimizing the Jensen-Shannon divergencebetween the data and model distributions on x
I This is expensive and often leads to vanishing gradients as thediscriminator saturates
I In practice, this requirement is relaxed, and the generator anddiscriminator are update simultaneously
63 / 107
![Page 64: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/64.jpg)
Generative Adversarial Networks
Remarks
I If the discriminator is trained to optimality before eachgenerator parameter update, minimizing the value functionamounts to minimizing the Jensen-Shannon divergencebetween the data and model distributions on x
I This is expensive and often leads to vanishing gradients as thediscriminator saturates
I In practice, this requirement is relaxed, and the generator anddiscriminator are update simultaneously
64 / 107
![Page 65: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/65.jpg)
Generative Adversarial Networks
Remarks
I If the discriminator is trained to optimality before eachgenerator parameter update, minimizing the value functionamounts to minimizing the Jensen-Shannon divergencebetween the data and model distributions on x
I This is expensive and often leads to vanishing gradients as thediscriminator saturates
I In practice, this requirement is relaxed, and the generator anddiscriminator are update simultaneously
65 / 107
![Page 66: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/66.jpg)
Generative Adversarial Networks
Remarks
I If the discriminator is trained to optimality before eachgenerator parameter update, minimizing the value functionamounts to minimizing the Jensen-Shannon divergencebetween the data and model distributions on x
I This is expensive and often leads to vanishing gradients as thediscriminator saturates
I In practice, this requirement is relaxed, and the generator anddiscriminator are update simultaneously
66 / 107
![Page 67: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/67.jpg)
Introduction
Different Distances
Standard GAN
Wasserstein GAN
Improved Wasserstein GAN
67 / 107
![Page 68: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/68.jpg)
Kantorivich-Rubinstein Duality
Unfortunately, computing the Wasserstein distance exactly isintractable.
W (Pr ,Pθ) = infγ∈Π((Pr ,Pθ)
E(x ,y)∼γ [||x − y ||]
However, a result from the Kantorivich-Rubinstein Duality (Villani2008) shows W is equivalent to
W (Pr ,Pθ) = sup||f ||L≤1
Ex∼Pr [f (x)]− Ex∼Pθ[f (x)]
where the supremum is taken over all 1-Lipschitz functions
Calculating this is still intractable, but now it’s easier toapproximate.
68 / 107
![Page 69: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/69.jpg)
Kantorivich-Rubinstein Duality
Unfortunately, computing the Wasserstein distance exactly isintractable.
W (Pr ,Pθ) = infγ∈Π((Pr ,Pθ)
E(x ,y)∼γ [||x − y ||]
However, a result from the Kantorivich-Rubinstein Duality (Villani2008) shows W is equivalent to
W (Pr ,Pθ) = sup||f ||L≤1
Ex∼Pr [f (x)]− Ex∼Pθ[f (x)]
where the supremum is taken over all 1-Lipschitz functions
Calculating this is still intractable, but now it’s easier toapproximate.
69 / 107
![Page 70: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/70.jpg)
Kantorivich-Rubinstein Duality
Unfortunately, computing the Wasserstein distance exactly isintractable.
W (Pr ,Pθ) = infγ∈Π((Pr ,Pθ)
E(x ,y)∼γ [||x − y ||]
However, a result from the Kantorivich-Rubinstein Duality (Villani2008) shows W is equivalent to
W (Pr ,Pθ) = sup||f ||L≤1
Ex∼Pr [f (x)]− Ex∼Pθ[f (x)]
where the supremum is taken over all 1-Lipschitz functions
Calculating this is still intractable, but now it’s easier toapproximate.
70 / 107
![Page 71: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/71.jpg)
Wasserstein GAN Approximation
Note that if we replace the supremum over 1-Lipschitz functionswith the supremum over K -Lipschitz functions, then the supremumis K ·W (Pr ,Pθ) instead.
Suppose we have a parametrized function family {fw}w∈W , wherew are the weights and W is the set of all possible weights
Furthermore suppose these functions are all K -Lipschitz for someK . Then we have
maxw∈W
Ex∼Pr [fw (x)]−Ex∼Pθ[fw (x)] ≤ sup
||f ||L≤KEx∼Pr [f (x)]−Ex∼Pθ
[f (x)]
= K ·W (Pr ,Pθ)
If {fw}w∈W contains the true supremum among K -Lipschitzfunctions, this gives the distance exactly.
In practice this won’t be true!
71 / 107
![Page 72: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/72.jpg)
Wasserstein GAN Approximation
Note that if we replace the supremum over 1-Lipschitz functionswith the supremum over K -Lipschitz functions, then the supremumis K ·W (Pr ,Pθ) instead.
Suppose we have a parametrized function family {fw}w∈W , wherew are the weights and W is the set of all possible weights
Furthermore suppose these functions are all K -Lipschitz for someK . Then we have
maxw∈W
Ex∼Pr [fw (x)]−Ex∼Pθ[fw (x)] ≤ sup
||f ||L≤KEx∼Pr [f (x)]−Ex∼Pθ
[f (x)]
= K ·W (Pr ,Pθ)
If {fw}w∈W contains the true supremum among K -Lipschitzfunctions, this gives the distance exactly.
In practice this won’t be true!
72 / 107
![Page 73: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/73.jpg)
Wasserstein GAN Approximation
Note that if we replace the supremum over 1-Lipschitz functionswith the supremum over K -Lipschitz functions, then the supremumis K ·W (Pr ,Pθ) instead.
Suppose we have a parametrized function family {fw}w∈W , wherew are the weights and W is the set of all possible weights
Furthermore suppose these functions are all K -Lipschitz for someK .
Then we have
maxw∈W
Ex∼Pr [fw (x)]−Ex∼Pθ[fw (x)] ≤ sup
||f ||L≤KEx∼Pr [f (x)]−Ex∼Pθ
[f (x)]
= K ·W (Pr ,Pθ)
If {fw}w∈W contains the true supremum among K -Lipschitzfunctions, this gives the distance exactly.
In practice this won’t be true!
73 / 107
![Page 74: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/74.jpg)
Wasserstein GAN Approximation
Note that if we replace the supremum over 1-Lipschitz functionswith the supremum over K -Lipschitz functions, then the supremumis K ·W (Pr ,Pθ) instead.
Suppose we have a parametrized function family {fw}w∈W , wherew are the weights and W is the set of all possible weights
Furthermore suppose these functions are all K -Lipschitz for someK . Then we have
maxw∈W
Ex∼Pr [fw (x)]−Ex∼Pθ[fw (x)] ≤ sup
||f ||L≤KEx∼Pr [f (x)]−Ex∼Pθ
[f (x)]
= K ·W (Pr ,Pθ)
If {fw}w∈W contains the true supremum among K -Lipschitzfunctions, this gives the distance exactly.
In practice this won’t be true!
74 / 107
![Page 75: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/75.jpg)
Wasserstein GAN Approximation
Note that if we replace the supremum over 1-Lipschitz functionswith the supremum over K -Lipschitz functions, then the supremumis K ·W (Pr ,Pθ) instead.
Suppose we have a parametrized function family {fw}w∈W , wherew are the weights and W is the set of all possible weights
Furthermore suppose these functions are all K -Lipschitz for someK . Then we have
maxw∈W
Ex∼Pr [fw (x)]−Ex∼Pθ[fw (x)] ≤ sup
||f ||L≤KEx∼Pr [f (x)]−Ex∼Pθ
[f (x)]
= K ·W (Pr ,Pθ)
If {fw}w∈W contains the true supremum among K -Lipschitzfunctions, this gives the distance exactly.
In practice this won’t be true!
75 / 107
![Page 76: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/76.jpg)
Wasserstein GAN Approximation
Note that if we replace the supremum over 1-Lipschitz functionswith the supremum over K -Lipschitz functions, then the supremumis K ·W (Pr ,Pθ) instead.
Suppose we have a parametrized function family {fw}w∈W , wherew are the weights and W is the set of all possible weights
Furthermore suppose these functions are all K -Lipschitz for someK . Then we have
maxw∈W
Ex∼Pr [fw (x)]−Ex∼Pθ[fw (x)] ≤ sup
||f ||L≤KEx∼Pr [f (x)]−Ex∼Pθ
[f (x)]
= K ·W (Pr ,Pθ)
If {fw}w∈W contains the true supremum among K -Lipschitzfunctions, this gives the distance exactly.
In practice this won’t be true!
76 / 107
![Page 77: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/77.jpg)
Wasserstein GAN Algorithm
Looping all this back to generative models, we’d like to trainPθ = gθ(z) to match Pr .
Intuitively, given a fixed gθ, we can compute the optimal fw for theWasserstein distance.
We can then backpropagate through W (Pr , gθ(Z )) to get thegradient for θ.
∇θW (Pr ,Pθ) = ∇θ(Ex∼Pr [fw (x)]− Ez∼Z [fw (gθ(z))])
= −Ez∼Z [∇θfw (gθ(z))]
77 / 107
![Page 78: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/78.jpg)
Wasserstein GAN Algorithm
Looping all this back to generative models, we’d like to trainPθ = gθ(z) to match Pr .
Intuitively, given a fixed gθ, we can compute the optimal fw for theWasserstein distance.
We can then backpropagate through W (Pr , gθ(Z )) to get thegradient for θ.
∇θW (Pr ,Pθ) = ∇θ(Ex∼Pr [fw (x)]− Ez∼Z [fw (gθ(z))])
= −Ez∼Z [∇θfw (gθ(z))]
78 / 107
![Page 79: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/79.jpg)
Wasserstein GAN Algorithm
Looping all this back to generative models, we’d like to trainPθ = gθ(z) to match Pr .
Intuitively, given a fixed gθ, we can compute the optimal fw for theWasserstein distance.
We can then backpropagate through W (Pr , gθ(Z )) to get thegradient for θ.
∇θW (Pr ,Pθ) = ∇θ(Ex∼Pr [fw (x)]− Ez∼Z [fw (gθ(z))])
= −Ez∼Z [∇θfw (gθ(z))]
79 / 107
![Page 80: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/80.jpg)
Wasserstein GAN Algorithm
The training process now has three steps:
1. For a fixed θ, compute an approximation of W (Pr ,Pθ) bytraining fw to convergence
2. Once we have the optimal fw , compute the θ gradient−Ez∼Z [∇θfw (gθ(z))] by sampling several z ∼ Z
3. Update θ, and repeat the process
Important Detail
The entire derivation only works when the function family{fw}w∈W is K -Lipschitz.
To guarantee this is true, the authors use weight clamping. Theweights w are constrained to lie within [−c , c], by clipping w afterevery update to w .
80 / 107
![Page 81: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/81.jpg)
Wasserstein GAN Algorithm
The training process now has three steps:
1. For a fixed θ, compute an approximation of W (Pr ,Pθ) bytraining fw to convergence
2. Once we have the optimal fw , compute the θ gradient−Ez∼Z [∇θfw (gθ(z))] by sampling several z ∼ Z
3. Update θ, and repeat the process
Important Detail
The entire derivation only works when the function family{fw}w∈W is K -Lipschitz.
To guarantee this is true, the authors use weight clamping. Theweights w are constrained to lie within [−c , c], by clipping w afterevery update to w .
81 / 107
![Page 82: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/82.jpg)
Wasserstein GAN Algorithm
The training process now has three steps:
1. For a fixed θ, compute an approximation of W (Pr ,Pθ) bytraining fw to convergence
2. Once we have the optimal fw , compute the θ gradient−Ez∼Z [∇θfw (gθ(z))] by sampling several z ∼ Z
3. Update θ, and repeat the process
Important Detail
The entire derivation only works when the function family{fw}w∈W is K -Lipschitz.
To guarantee this is true, the authors use weight clamping. Theweights w are constrained to lie within [−c , c], by clipping w afterevery update to w .
82 / 107
![Page 83: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/83.jpg)
Wasserstein GAN Algorithm
The training process now has three steps:
1. For a fixed θ, compute an approximation of W (Pr ,Pθ) bytraining fw to convergence
2. Once we have the optimal fw , compute the θ gradient−Ez∼Z [∇θfw (gθ(z))] by sampling several z ∼ Z
3. Update θ, and repeat the process
Important Detail
The entire derivation only works when the function family{fw}w∈W is K -Lipschitz.
To guarantee this is true, the authors use weight clamping. Theweights w are constrained to lie within [−c , c], by clipping w afterevery update to w .
83 / 107
![Page 84: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/84.jpg)
Wasserstein GAN Algorithm
The training process now has three steps:
1. For a fixed θ, compute an approximation of W (Pr ,Pθ) bytraining fw to convergence
2. Once we have the optimal fw , compute the θ gradient−Ez∼Z [∇θfw (gθ(z))] by sampling several z ∼ Z
3. Update θ, and repeat the process
Important Detail
The entire derivation only works when the function family{fw}w∈W is K -Lipschitz.
To guarantee this is true, the authors use weight clamping. Theweights w are constrained to lie within [−c , c], by clipping w afterevery update to w .
84 / 107
![Page 85: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/85.jpg)
Wasserstein GAN Algorithm
The training process now has three steps:
1. For a fixed θ, compute an approximation of W (Pr ,Pθ) bytraining fw to convergence
2. Once we have the optimal fw , compute the θ gradient−Ez∼Z [∇θfw (gθ(z))] by sampling several z ∼ Z
3. Update θ, and repeat the process
Important Detail
The entire derivation only works when the function family{fw}w∈W is K -Lipschitz.
To guarantee this is true, the authors use weight clamping. Theweights w are constrained to lie within [−c , c], by clipping w afterevery update to w .
85 / 107
![Page 86: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/86.jpg)
Wasserstein GAN Algorithm
86 / 107
![Page 87: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/87.jpg)
Comparison with Standard GANs
I In GANs, the discriminator maximizes
1
m
m∑i=1
logD(x (i)) +1
m
m∑i=1
log(1− D(gθ(z(i))))
where we constrain D(x) to always be a probability p ∈ (0, 1)
I In WGANs, nothing requires the fw to output a probabilityand hence it is referred to as a critic instead of a discriminator
I Although GANs are formulated as a min max problem, inpractice we never train D to convergence
I Consequently, we’re updating G against an objective that kindof aims towards the JS divergence, but doesn’t go all the way
I In constrast, because the Wasserstein distance is differentiablenearly everywhere, we can (and should) train fw toconvergence before each generator update, to get as accuratean estimate of W (Pr ,Pθ) as possible
87 / 107
![Page 88: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/88.jpg)
Comparison with Standard GANs
I In GANs, the discriminator maximizes
1
m
m∑i=1
logD(x (i)) +1
m
m∑i=1
log(1− D(gθ(z(i))))
where we constrain D(x) to always be a probability p ∈ (0, 1)
I In WGANs, nothing requires the fw to output a probabilityand hence it is referred to as a critic instead of a discriminator
I Although GANs are formulated as a min max problem, inpractice we never train D to convergence
I Consequently, we’re updating G against an objective that kindof aims towards the JS divergence, but doesn’t go all the way
I In constrast, because the Wasserstein distance is differentiablenearly everywhere, we can (and should) train fw toconvergence before each generator update, to get as accuratean estimate of W (Pr ,Pθ) as possible
88 / 107
![Page 89: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/89.jpg)
Comparison with Standard GANs
I In GANs, the discriminator maximizes
1
m
m∑i=1
logD(x (i)) +1
m
m∑i=1
log(1− D(gθ(z(i))))
where we constrain D(x) to always be a probability p ∈ (0, 1)
I In WGANs, nothing requires the fw to output a probabilityand hence it is referred to as a critic instead of a discriminator
I Although GANs are formulated as a min max problem, inpractice we never train D to convergence
I Consequently, we’re updating G against an objective that kindof aims towards the JS divergence, but doesn’t go all the way
I In constrast, because the Wasserstein distance is differentiablenearly everywhere, we can (and should) train fw toconvergence before each generator update, to get as accuratean estimate of W (Pr ,Pθ) as possible
89 / 107
![Page 90: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/90.jpg)
Comparison with Standard GANs
I In GANs, the discriminator maximizes
1
m
m∑i=1
logD(x (i)) +1
m
m∑i=1
log(1− D(gθ(z(i))))
where we constrain D(x) to always be a probability p ∈ (0, 1)
I In WGANs, nothing requires the fw to output a probabilityand hence it is referred to as a critic instead of a discriminator
I Although GANs are formulated as a min max problem, inpractice we never train D to convergence
I Consequently, we’re updating G against an objective that kindof aims towards the JS divergence, but doesn’t go all the way
I In constrast, because the Wasserstein distance is differentiablenearly everywhere, we can (and should) train fw toconvergence before each generator update, to get as accuratean estimate of W (Pr ,Pθ) as possible
90 / 107
![Page 91: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/91.jpg)
Comparison with Standard GANs
I In GANs, the discriminator maximizes
1
m
m∑i=1
logD(x (i)) +1
m
m∑i=1
log(1− D(gθ(z(i))))
where we constrain D(x) to always be a probability p ∈ (0, 1)
I In WGANs, nothing requires the fw to output a probabilityand hence it is referred to as a critic instead of a discriminator
I Although GANs are formulated as a min max problem, inpractice we never train D to convergence
I Consequently, we’re updating G against an objective that kindof aims towards the JS divergence, but doesn’t go all the way
I In constrast, because the Wasserstein distance is differentiablenearly everywhere, we can (and should) train fw toconvergence before each generator update, to get as accuratean estimate of W (Pr ,Pθ) as possible
91 / 107
![Page 92: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/92.jpg)
Comparison with Standard GANs
I In GANs, the discriminator maximizes
1
m
m∑i=1
logD(x (i)) +1
m
m∑i=1
log(1− D(gθ(z(i))))
where we constrain D(x) to always be a probability p ∈ (0, 1)
I In WGANs, nothing requires the fw to output a probabilityand hence it is referred to as a critic instead of a discriminator
I Although GANs are formulated as a min max problem, inpractice we never train D to convergence
I Consequently, we’re updating G against an objective that kindof aims towards the JS divergence, but doesn’t go all the way
I In constrast, because the Wasserstein distance is differentiablenearly everywhere, we can (and should) train fw toconvergence before each generator update, to get as accuratean estimate of W (Pr ,Pθ) as possible
92 / 107
![Page 93: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/93.jpg)
Introduction
Different Distances
Standard GAN
Wasserstein GAN
Improved Wasserstein GAN
93 / 107
![Page 94: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/94.jpg)
Properties of optimal WGAN critic
An open question is how to effectively enforce the Lipschitzconstraint on the critic?
Previously (Arjovksy et. al 2017) we’ve seen that you can clip theweights of the critic to lie within the compact space [−c , c]
To understand why weight clipping is problematic in WGAN criticwe need to understand what are the properties of the optimalWGAN critic?
94 / 107
![Page 95: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/95.jpg)
Properties of optimal WGAN critic
An open question is how to effectively enforce the Lipschitzconstraint on the critic?
Previously (Arjovksy et. al 2017) we’ve seen that you can clip theweights of the critic to lie within the compact space [−c , c]
To understand why weight clipping is problematic in WGAN criticwe need to understand what are the properties of the optimalWGAN critic?
95 / 107
![Page 96: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/96.jpg)
Properties of optimal WGAN critic
An open question is how to effectively enforce the Lipschitzconstraint on the critic?
Previously (Arjovksy et. al 2017) we’ve seen that you can clip theweights of the critic to lie within the compact space [−c , c]
To understand why weight clipping is problematic in WGAN criticwe need to understand what are the properties of the optimalWGAN critic?
96 / 107
![Page 97: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/97.jpg)
Properties of optimal WGAN critic
An open question is how to effectively enforce the Lipschitzconstraint on the critic?
Previously (Arjovksy et. al 2017) we’ve seen that you can clip theweights of the critic to lie within the compact space [−c , c]
To understand why weight clipping is problematic in WGAN criticwe need to understand what are the properties of the optimalWGAN critic?
97 / 107
![Page 98: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/98.jpg)
Properties of optimal WGAN critic
If the optimal critic under the Kantorovich-Rubinstein dual D∗ isdifferentiable, and x is a point from our generator distribution Pθ,then there is a point y sampled from the true distribution Pr suchthat the gradient of D∗ at all points xt = (1− t)x + ty lie on astraight line between x and y.
In other words, ∇D∗(xt) = y−xt||y−xt || .
This implies that the optimal WGAN critic has gradientswith norm 1 almost everywhere under Pr and Pθ
98 / 107
![Page 99: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/99.jpg)
Properties of optimal WGAN critic
If the optimal critic under the Kantorovich-Rubinstein dual D∗ isdifferentiable, and x is a point from our generator distribution Pθ,then there is a point y sampled from the true distribution Pr suchthat the gradient of D∗ at all points xt = (1− t)x + ty lie on astraight line between x and y.
In other words, ∇D∗(xt) = y−xt||y−xt || .
This implies that the optimal WGAN critic has gradientswith norm 1 almost everywhere under Pr and Pθ
99 / 107
![Page 100: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/100.jpg)
Properties of optimal WGAN critic
If the optimal critic under the Kantorovich-Rubinstein dual D∗ isdifferentiable, and x is a point from our generator distribution Pθ,then there is a point y sampled from the true distribution Pr suchthat the gradient of D∗ at all points xt = (1− t)x + ty lie on astraight line between x and y.
In other words, ∇D∗(xt) = y−xt||y−xt || .
This implies that the optimal WGAN critic has gradientswith norm 1 almost everywhere under Pr and Pθ
100 / 107
![Page 101: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/101.jpg)
Properties of optimal WGAN critic
If the optimal critic under the Kantorovich-Rubinstein dual D∗ isdifferentiable, and x is a point from our generator distribution Pθ,then there is a point y sampled from the true distribution Pr suchthat the gradient of D∗ at all points xt = (1− t)x + ty lie on astraight line between x and y.
In other words, ∇D∗(xt) = y−xt||y−xt || .
This implies that the optimal WGAN critic has gradientswith norm 1 almost everywhere under Pr and Pθ
101 / 107
![Page 102: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/102.jpg)
Gradient penalty
We consider an alternative method to enforce the Lipschitzconstraint on the training objective.
A differentiable function is 1-Lipschitz if and only if it hasgradients with norm less than or equal to 1 everywhere.
This implies we should directly constrain the gradient norm of ourcritic function with respect to its input.
Enforcing a soft version of this we get:
102 / 107
![Page 103: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/103.jpg)
Gradient penalty
We consider an alternative method to enforce the Lipschitzconstraint on the training objective.
A differentiable function is 1-Lipschitz if and only if it hasgradients with norm less than or equal to 1 everywhere.
This implies we should directly constrain the gradient norm of ourcritic function with respect to its input.
Enforcing a soft version of this we get:
103 / 107
![Page 104: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/104.jpg)
Gradient penalty
We consider an alternative method to enforce the Lipschitzconstraint on the training objective.
A differentiable function is 1-Lipschitz if and only if it hasgradients with norm less than or equal to 1 everywhere.
This implies we should directly constrain the gradient norm of ourcritic function with respect to its input.
Enforcing a soft version of this we get:
104 / 107
![Page 105: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/105.jpg)
Gradient penalty
We consider an alternative method to enforce the Lipschitzconstraint on the training objective.
A differentiable function is 1-Lipschitz if and only if it hasgradients with norm less than or equal to 1 everywhere.
This implies we should directly constrain the gradient norm of ourcritic function with respect to its input.
Enforcing a soft version of this we get:
105 / 107
![Page 106: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/106.jpg)
Gradient penalty
We consider an alternative method to enforce the Lipschitzconstraint on the training objective.
A differentiable function is 1-Lipschitz if and only if it hasgradients with norm less than or equal to 1 everywhere.
This implies we should directly constrain the gradient norm of ourcritic function with respect to its input.
Enforcing a soft version of this we get:
106 / 107
![Page 107: Nishant Gurnani - Amazon S3 · 2017-10-05 · Nishant Gurnani GAN Reading Group April 14th, 2017 1/107. Why are these Papers Important? I Recently a large number of GAN frameworks](https://reader033.vdocuments.us/reader033/viewer/2022043020/5f3c74500f0ff13e0a517839/html5/thumbnails/107.jpg)
WGAN with gradient penalty
107 / 107