Emanuele Sansone
Generative Adversarial Networks (GANs)
29 march 2017Website: emsansone.github.io
Goals
Stimulate discussion
1. Why not using this model? 2. What don’t you like in this model? 3. Why not improving it? 4. …
Sketch of GANs
Density estimation in high-dimensional data (manifold assumption)
Ian Goodfellow, NIPS 2016 tutorial on GANs
Yoshua Bengio, MLSS 2015 Austin TX
Sketch of GANs
Training Datag✓(·)g
D�(·)D
[0, 1]
z
pg px
pz
“Generative Adversarial Networks” ICLR 2014 Ian Goodfellow
Training Datag✓(·)g
D�(·)D
[0, 1]
Sketch of GANs
z
pg
1. Fast sample generation - discriminative function (no Markov chains)
2. Impl ic i t defin i t ion o f density families
px
pz
GENERATIVE
Training Datag✓(·)g
D�(·)D
[0, 1]
Sketch of GANs
z
pg px
pz
ADVERSARIAL
SEE LATER
Game between two adversaries (no likelihood maximization)
Training Datag✓(·)g
D�(·)D
[0, 1]
Sketch of GANs
z
pg px
pz
NETWORKS
“Learning Deep Architectures for AI” Joshua Bengio 2009
Exploitation of deep neural architectures
(High capacity and efficient training…)
“Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks” arXiv 2015 Alec Radford, Luke Metz, Soumith Chintala
Application - Image synthesis
Application - Video synthesis
“Generating Videos with Scene Dynamics” NIPS 2016 Carl Vondrick, Hamed Pirsiavash, Antonio Torralba
Beach Golf Train Baby
Hallucinated videos
Input Output Input Output Input Output
Conditional generation of videos
Application - Representation learning 1
“InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets” NIPS 2016 Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, Pieter Abbeel
Application - Representation Learning 2
“InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets” NIPS 2016 Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, Pieter Abbeel
List of recent papers on Generative Adversarial Networks:
PAPERs about GANs 20142016
Generative Adversarial Networks ICLR 2014
Deep Generative Image Models Using a Laplacian Pyramid of Adversarial
Networks (UNSUPERVISED)
NIPS 2015
Draw: A Recurrent Neural Network for Image Generation (UNSUPERVISED) ICML 2015
Unsupervised Representation Learning with Deep Convolutional Generative
Adversarial Networks
arXiv 2015
InfoGAN: Interpretable Representation Learning by Information Maximizing
Generative Adversarial Nets
NIPS 2016
Towards Principled Unsupervised Learning ICLR 2016 (workshop)
Improved Techniques for Training GANs NIPS 2016
Unsupervised and SemiSupervised Learning with Categorical Generative
Adversarial Networks
ICLR 2016
Generating Videos with Scene Dynamics NIPS 2016
NIPS 2016 Tutorial: Generative Adversarial Networks PAPER+VIDEO NIPS 2016
PAPERs about GANs accepted to ICLR 2017
Towards Principled Methods for Training Generative Adversarial Networks ( FIRST THEORETICAL PAPER)
ICLR 2017
Adversarially Learned Inference ( SoA in SEMISUPERVISED LEARNING ) ICLR 2017
Learning to Generate Samples from Noise Through Infusion Training ICLR 2017
Improving Generative Adversarial Networks with Denoising Feature Matching ICLR 2017
LRGAN: Layered Recursive Generative Adversarial Networks for Image
Generation
ICLR 2017
Mode Regularized Generative Adversarial Networks ICLR 2017
Generative Models and Model Criticism via Optimized Maximum Mean
Discrepancy
ICLR 2017
Calibrating EnergyBased Generative Adversarial Networks ICLR 2017
Unrolled Generative Adversarial Networks ICLR 2017
Generative MultiAdversarial Networks ICLR 2017
EnergyBased Generative Adversarial Networks ICLR 2017
Probably hot topic?
Emanuele Sansone
Generative Adversarial Networks (GANs)
29 march 2017Website: emsansone.github.io
Problem formulation
min
✓
max
�
⇢
E
x⇠px
n
log(D
�
(x))
o
+ E
z⇠pz
n
log(1�D�
(g
✓
(z)))
o
�
Problem formulation
min
✓
max
�
⇢
E
x⇠px
n
log(D
�
(x))
o
+ E
z⇠pz
n
log(1�D�
(g
✓
(z)))
o
�
Training Data Generated Data
Problem formulation
min
✓
max
�
⇢
E
x⇠px
n
log(D
�
(x))
o
+ E
z⇠pz
n
log(1�D�
(g
✓
(z)))
o
�
Optimal Discriminator (Proof sketch)
min
✓
max
�
⇢
E
x⇠px
n
log(D
�
(x))
o
+ E
x⇠pg
n
log(1�D�
(x))
o
�
Zp
x
(x) log(D
�
(x))dx +
Zp
g
(x) log(1�D�
(x))dx
Z
n
p
x
(x) log(D
�
(x)) + p
g
(x) log(1�D�
(x))
o
dx
a log(y) + b log(1� y)
8(a, b) 2 R2 \ {0, 0} @a log(y) + b log(1� y)@y
= 0() y ⇤ =a
a + b
D
⇤�
(x) =p
x
(x)
p
x
(x) + pg
(x)
Problem formulation
min
✓
max
�
⇢
E
x⇠px
n
log(D
�
(x))
o
+ E
z⇠pz
n
log(1�D�
(g
✓
(z)))
o
�
Optimal Discriminator (Proof sketch)
min
✓
max
�
⇢
E
x⇠px
n
log(D
�
(x))
o
+ E
x⇠pg
n
log(1�D�
(x))
o
�
Zp
x
(x) log(D
�
(x))dx +
Zp
g
(x) log(1�D�
(x))dx
Z
n
p
x
(x) log(D
�
(x)) + p
g
(x) log(1�D�
(x))
o
dx
a log(y) + b log(1� y)
8(a, b) 2 R2 \ {0, 0} @a log(y) + b log(1� y)@y
= 0() y ⇤ =a
a + b
D
⇤�
(x) =p
x
(x)
p
x
(x) + pg
(x)
Problem formulation
min
✓
max
�
⇢
E
x⇠px
n
log(D
�
(x))
o
+ E
z⇠pz
n
log(1�D�
(g
✓
(z)))
o
�
Optimal Discriminator (Proof sketch)
min
✓
max
�
⇢
E
x⇠px
n
log(D
�
(x))
o
+ E
x⇠pg
n
log(1�D�
(x))
o
�
Zp
x
(x) log(D
�
(x))dx +
Zp
g
(x) log(1�D�
(x))dx
Z
n
p
x
(x) log(D
�
(x)) + p
g
(x) log(1�D�
(x))
o
dx
a log(y) + b log(1� y)
8(a, b) 2 R2 \ {0, 0} @a log(y) + b log(1� y)@y
= 0() y ⇤ =a
a + b
D
⇤�
(x) =p
x
(x)
p
x
(x) + pg
(x)
Problem formulation
min
✓
max
�
⇢
E
x⇠px
n
log(D
�
(x))
o
+ E
z⇠pz
n
log(1�D�
(g
✓
(z)))
o
�
Optimal Discriminator (Proof sketch)
min
✓
max
�
⇢
E
x⇠px
n
log(D
�
(x))
o
+ E
x⇠pg
n
log(1�D�
(x))
o
�
Zp
x
(x) log(D
�
(x))dx +
Zp
g
(x) log(1�D�
(x))dx
Z
n
p
x
(x) log(D
�
(x)) + p
g
(x) log(1�D�
(x))
o
dx
a log(y) + b log(1� y)
8(a, b) 2 R2 \ {0, 0} @a log(y) + b log(1� y)@y
= 0() y ⇤ =a
a + b
D
⇤�
(x) =p
x
(x)
p
x
(x) + pg
(x)
Problem formulation
min
✓
max
�
⇢
E
x⇠px
n
log(D
�
(x))
o
+ E
z⇠pz
n
log(1�D�
(g
✓
(z)))
o
�
Optimal Discriminator (Proof sketch)
min
✓
max
�
⇢
E
x⇠px
n
log(D
�
(x))
o
+ E
x⇠pg
n
log(1�D�
(x))
o
�
Zp
x
(x) log(D
�
(x))dx +
Zp
g
(x) log(1�D�
(x))dx
Z
n
p
x
(x) log(D
�
(x)) + p
g
(x) log(1�D�
(x))
o
dx
a log(y) + b log(1� y)
8(a, b) 2 R2 \ {0, 0} @a log(y) + b log(1� y)@y
= 0() y ⇤ =a
a + b
D
⇤�
(x) =p
x
(x)
p
x
(x) + pg
(x)
Problem formulation
min
✓
max
�
⇢
E
x⇠px
n
log(D
�
(x))
o
+ E
z⇠pz
n
log(1�D�
(g
✓
(z)))
o
�
Optimal Discriminator (Proof sketch)
min
✓
max
�
⇢
E
x⇠px
n
log(D
�
(x))
o
+ E
x⇠pg
n
log(1�D�
(x))
o
�
Zp
x
(x) log(D
�
(x))dx +
Zp
g
(x) log(1�D�
(x))dx
Z
n
p
x
(x) log(D
�
(x)) + p
g
(x) log(1�D�
(x))
o
dx
a log(y) + b log(1� y)
8(a, b) 2 R2 \ {0, 0} @a log(y) + b log(1� y)@y
= 0() y ⇤ =a
a + b
D
⇤�
(x) =p
x
(x)
p
x
(x) + pg
(x)
Problem formulation
min
✓
max
�
⇢
E
x⇠px
n
log(D
�
(x))
o
+ E
z⇠pz
n
log(1�D�
(g
✓
(z)))
o
�
Optimal Generator (Proof sketch)
min
✓
max
�
⇢
E
x⇠px
n
log(D
�
(x))
o
+ E
x⇠pg
n
log(1�D�
(x))
o
�
Zp
x
(x) log(D
�
(x))dx +
Zp
g
(x) log(1�D�
(x))dx
D
⇤�
(x) =p
x
(x)
p
x
(x) + pg
(x)
Zp
x
(x) log
✓p
x
(x)
p
x
(x) + p
g
(x)
◆dx +
Zp
g
(x) log
✓1�
p
x
(x)
p
x
(x) + p
g
(x)
◆dx
Zp
x
(x) log
✓p
x
(x)
p
x
(x) + p
g
(x)
◆dx +
Zp
g
(x) log
✓p
g
(x)
p
x
(x) + p
g
(x)
◆dx
Zp
x
(x) log
✓p
x
(x)
p
x
(x)+pg
(x)2
◆dx +
Zp
g
(x) log
✓p
g
(x)
p
x
(x)+pg
(x)2
◆dx � 2 log(2)
2JSD(px
, pg
)
The minimum value is �2log(2) when JSD(px
, pg
) = 0 () p
g
(x) = px
(x),8x
Problem formulation
min
✓
max
�
⇢
E
x⇠px
n
log(D
�
(x))
o
+ E
z⇠pz
n
log(1�D�
(g
✓
(z)))
o
�
Optimal Generator (Proof sketch)
min
✓
max
�
⇢
E
x⇠px
n
log(D
�
(x))
o
+ E
x⇠pg
n
log(1�D�
(x))
o
�
Zp
x
(x) log(D
�
(x))dx +
Zp
g
(x) log(1�D�
(x))dx
D
⇤�
(x) =p
x
(x)
p
x
(x) + pg
(x)
Zp
x
(x) log
✓p
x
(x)
p
x
(x) + p
g
(x)
◆dx +
Zp
g
(x) log
✓1�
p
x
(x)
p
x
(x) + p
g
(x)
◆dx
Zp
x
(x) log
✓p
x
(x)
p
x
(x) + p
g
(x)
◆dx +
Zp
g
(x) log
✓p
g
(x)
p
x
(x) + p
g
(x)
◆dx
Zp
x
(x) log
✓p
x
(x)
p
x
(x)+pg
(x)2
◆dx +
Zp
g
(x) log
✓p
g
(x)
p
x
(x)+pg
(x)2
◆dx � 2 log(2)
2JSD(px
, pg
)
The minimum value is �2log(2) when JSD(px
, pg
) = 0 () p
g
(x) = px
(x),8x
Problem formulation
min
✓
max
�
⇢
E
x⇠px
n
log(D
�
(x))
o
+ E
z⇠pz
n
log(1�D�
(g
✓
(z)))
o
�
Optimal Generator (Proof sketch)
min
✓
max
�
⇢
E
x⇠px
n
log(D
�
(x))
o
+ E
x⇠pg
n
log(1�D�
(x))
o
�
Zp
x
(x) log(D
�
(x))dx +
Zp
g
(x) log(1�D�
(x))dx
D
⇤�
(x) =p
x
(x)
p
x
(x) + pg
(x)
Zp
x
(x) log
✓p
x
(x)
p
x
(x) + p
g
(x)
◆dx +
Zp
g
(x) log
✓1�
p
x
(x)
p
x
(x) + p
g
(x)
◆dx
Zp
x
(x) log
✓p
x
(x)
p
x
(x) + p
g
(x)
◆dx +
Zp
g
(x) log
✓p
g
(x)
p
x
(x) + p
g
(x)
◆dx
Zp
x
(x) log
✓p
x
(x)
p
x
(x)+pg
(x)2
◆dx +
Zp
g
(x) log
✓p
g
(x)
p
x
(x)+pg
(x)2
◆dx � 2 log(2)
2JSD(px
, pg
)
The minimum value is �2log(2) when JSD(px
, pg
) = 0 () p
g
(x) = px
(x),8x
Problem formulation
min
✓
max
�
⇢
E
x⇠px
n
log(D
�
(x))
o
+ E
z⇠pz
n
log(1�D�
(g
✓
(z)))
o
�
Optimal Generator (Proof sketch)
min
✓
max
�
⇢
E
x⇠px
n
log(D
�
(x))
o
+ E
x⇠pg
n
log(1�D�
(x))
o
�
Zp
x
(x) log(D
�
(x))dx +
Zp
g
(x) log(1�D�
(x))dx
D
⇤�
(x) =p
x
(x)
p
x
(x) + pg
(x)
Zp
x
(x) log
✓p
x
(x)
p
x
(x) + p
g
(x)
◆dx +
Zp
g
(x) log
✓1�
p
x
(x)
p
x
(x) + p
g
(x)
◆dx
Zp
x
(x) log
✓p
x
(x)
p
x
(x) + p
g
(x)
◆dx +
Zp
g
(x) log
✓p
g
(x)
p
x
(x) + p
g
(x)
◆dx
Zp
x
(x) log
✓p
x
(x)
p
x
(x)+pg
(x)2
◆dx +
Zp
g
(x) log
✓p
g
(x)
p
x
(x)+pg
(x)2
◆dx � 2 log(2)
2JSD(px
, pg
)
The minimum value is �2log(2) when JSD(px
, pg
) = 0 () p
g
(x) = px
(x),8x
Problem formulation
min
✓
max
�
⇢
E
x⇠px
n
log(D
�
(x))
o
+ E
z⇠pz
n
log(1�D�
(g
✓
(z)))
o
�
Optimal Generator (Proof sketch)
min
✓
max
�
⇢
E
x⇠px
n
log(D
�
(x))
o
+ E
x⇠pg
n
log(1�D�
(x))
o
�
Zp
x
(x) log(D
�
(x))dx +
Zp
g
(x) log(1�D�
(x))dx
D
⇤�
(x) =p
x
(x)
p
x
(x) + pg
(x)
Zp
x
(x) log
✓p
x
(x)
p
x
(x) + p
g
(x)
◆dx +
Zp
g
(x) log
✓1�
p
x
(x)
p
x
(x) + p
g
(x)
◆dx
Zp
x
(x) log
✓p
x
(x)
p
x
(x) + p
g
(x)
◆dx +
Zp
g
(x) log
✓p
g
(x)
p
x
(x) + p
g
(x)
◆dx
Zp
x
(x) log
✓p
x
(x)
p
x
(x)+pg
(x)2
◆dx +
Zp
g
(x) log
✓p
g
(x)
p
x
(x)+pg
(x)2
◆dx � 2 log(2)
2JSD(px
, pg
)
The minimum value is �2log(2) when JSD(px
, pg
) = 0 () p
g
(x) = px
(x),8x
Problem formulation
min
✓
max
�
⇢
E
x⇠px
n
log(D
�
(x))
o
+ E
z⇠pz
n
log(1�D�
(g
✓
(z)))
o
�
Optimal Generator (Proof sketch)
min
✓
max
�
⇢
E
x⇠px
n
log(D
�
(x))
o
+ E
x⇠pg
n
log(1�D�
(x))
o
�
Zp
x
(x) log(D
�
(x))dx +
Zp
g
(x) log(1�D�
(x))dx
D
⇤�
(x) =p
x
(x)
p
x
(x) + pg
(x)
Zp
x
(x) log
✓p
x
(x)
p
x
(x) + p
g
(x)
◆dx +
Zp
g
(x) log
✓1�
p
x
(x)
p
x
(x) + p
g
(x)
◆dx
Zp
x
(x) log
✓p
x
(x)
p
x
(x) + p
g
(x)
◆dx +
Zp
g
(x) log
✓p
g
(x)
p
x
(x) + p
g
(x)
◆dx
Zp
x
(x) log
✓p
x
(x)
p
x
(x)+pg
(x)2
◆dx +
Zp
g
(x) log
✓p
g
(x)
p
x
(x)+pg
(x)2
◆dx � 2 log(2)
2JSD(px
, pg
)
The minimum value is �2log(2) when JSD(px
, pg
) = 0 () p
g
(x) = px
(x),8x
In pratice
Training Datag✓(·)g
D�(·)D
z
pg px
pz
D
�
(x) =p
x
(x)
p
x
(x) + pg
(x)=1
2
In pratice
“Generative Adversarial Networks” ICLR 2014 Ian Goodfellow
In pratice
“Generative Adversarial Networks” ICLR 2014 Ian Goodfellow
DON’T WORRY, THERE IS A DEMO
Why is it better than traditional generative approaches?
Traditional generative approaches
ln p(DATA) = lnZp(DATA, ✓)d✓
= ln
Zq(✓)p(DATA, ✓)q(✓)d✓
�Zq(✓) ln
p(DATA, ✓)q(✓)d✓
.= KL(q(✓), p(DATA, ✓))
The maximum is achieved for q(✓) = p(✓|DATA)
q(✓)
p(✓|DATA)
can be regarded as pgcan be regarded as p
x
Traditional generative approaches
ln p(DATA) = lnZp(DATA, ✓)d✓
= ln
Zq(✓)p(DATA, ✓)q(✓)d✓
�Zq(✓) ln
p(DATA, ✓)q(✓)d✓
.= KL(q(✓), p(DATA, ✓))
The maximum is achieved for q(✓) = p(✓|DATA)
q(✓)
p(✓|DATA)
can be regarded as pgcan be regarded as p
x
Traditional generative approaches
ln p(DATA) = lnZp(DATA, ✓)d✓
= ln
Zq(✓)p(DATA, ✓)q(✓)d✓
�Zq(✓) ln
p(DATA, ✓)q(✓)d✓
.= KL(q(✓), p(DATA, ✓))
The maximum is achieved for q(✓) = p(✓|DATA)
q(✓)
p(✓|DATA)
can be regarded as pgcan be regarded as p
x
Traditional generative approaches
The maximum is achieved for q(✓) = p(✓|DATA)
q(✓)
p(✓|DATA)
can be regarded as pgcan be regarded as p
x
ln p(DATA) = lnZp(DATA, ✓)d✓
= ln
Zq(✓)p(DATA, ✓)q(✓)d✓
�Zq(✓) ln
p(DATA, ✓)q(✓)d✓
.= �KL(q(✓), p(DATA))
Traditional generative approaches
The maximum is achieved for q(✓) = p(✓|DATA)
q(✓) can be regarded as pgcan be regarded as p
x
maxKL(pg
, px
)
ln p(DATA) = lnZp(DATA, ✓)d✓
= ln
Zq(✓)p(DATA, ✓)q(✓)d✓
�Zq(✓) ln
p(DATA, ✓)q(✓)d✓
.= �KL(q(✓), p(DATA))
p(✓,DATA)
Traditional generative approaches
The maximum is achieved for q(✓) = p(✓|DATA)
q(✓) can be regarded as pgcan be regarded as p
x
ln p(DATA) = lnZp(DATA, ✓)d✓
= ln
Zq(✓)p(DATA, ✓)q(✓)d✓
�Zq(✓) ln
p(DATA, ✓)q(✓)d✓
.= �KL(q(✓), p(DATA))
minKL(pg
, px
)
p(✓,DATA)
Traditional generative approaches vs. GANS
Traditional Generative Approaches GANs
Optimization
0 (Low cost for mode dropping) log(2)
Infinity (High cost for fake data) log(2)
Minimum 0 0
minKL(pg
, px
) min JSD(pg
, px
)
px
> 0
pg
! 0
pg
> 0
px
! 0
Is there any issue?
Problem of vanishing gradients
“Towards Principled Methods for Training Generative Adversarial Networks” ICLR 2017 Martin Arjovsky, Leon Bottou
Theorem 2.4: if the discriminator is close to optimality, namely
D
�
(x) 'p
x
(x)
p
x
(x) + pg
(x)
In other words,
(which is a very common case, i.e. Theorem 2.1-2.2)and the Jacobian of the generator is bounded
(by any scalar )
x
0
1D�
M
+
kr✓Ez⇠pzn
log(1�D�(g✓(z)))o
k2 M✏
1� ✏
D
�
(x) ' 1,8x 2 supp(px
)
D
�
(x) ' 0,8x 2 supp(pg
)
Problem of vanishing gradients
“Towards Principled Methods for Training Generative Adversarial Networks” ICLR 2017 Martin Arjovsky, Leon Bottou
Theorem 2.4: if the discriminator is close to optimality, namely
D
�
(x) 'p
x
(x)
p
x
(x) + pg
(x)
(which is a very common case, i.e. Theorem 2.1-2.2)and the Jacobian of the generator is bounded
(by any scalar )
x
0
1D�
M
+
kr✓Ez⇠pzn
log(1�D�(g✓(z)))o
k2 M✏
1� ✏
This happens when the discriminator is “too strong”. For example, when it is updated more
frequently than the generator
In other words,
D
�
(x) ' 1,8x 2 supp(px
)
D
�
(x) ' 0,8x 2 supp(pg
)
Problem of vanishing gradientsProof: kr
✓
Ez⇠p
z
n
log(1�D�
(g✓
(z)))o
k22 = kEz⇠pzn
r✓
log(1�D�
(g✓
(z)))o
k22
= kEz⇠p
z
n
�r✓
D�
(g✓
(z))
1�D�
(g✓
(z))
o
k22
Ez⇠p
z
n
kr✓
D�
(g✓
(z))
1�D�
(g✓
(z))k22o
= Ez⇠p
z
nkr✓
D�
(g✓
(z))k22|1�D
�
(g✓
(z))|2o
= Ez⇠p
z
nkrx
D�
(g✓
(z))J(g✓
(z))k22|1�D
�
(g✓
(z))|2o
Ez⇠p
z
nkrx
D�
(g✓
(z))k22kJ(g✓(z))k22|1�D
�
(g✓
(z))|2o
Since the discriminator is close to optimality, namelykD�
�D⇤�
k .= supx2X
n
|D�
�D⇤�
|+ krx
D�
�rx
D⇤�
k2o
< ✏
(values and gradients are both similar) and since
krx
D�
�rx
D⇤�
k22 � krxD�k22 � krxD⇤�k22
krx
D�
k22 < krxD⇤�k22 + ✏2 Cont.
Cauchy-Schwartz inequality
Jensen’s inequality
Problem of vanishing gradientsProof:
|D� �D⇤�| < ✏|� 1 +D� + 1�D⇤�| < ✏|1�D⇤� � (1�D�)| < ✏|1�D⇤�|� |(1�D�)| |1�D⇤� � (1�D�)||1�D⇤�|� |(1�D�)| < ✏
Ez⇠p
z
nkrx
D�
(g✓
(z))k22kJ(g✓(z))k22|1�D
�
(g✓
(z))|2o
< Ez⇠p
z
n(krx
D⇤�
(g✓
(z))k22k+ ✏2)J(g✓(z))k22|1�D
�
(g✓
(z))|2o
< Ez⇠p
z
n(krx
D⇤�
(g✓
(z))k22k+ ✏2)J(g✓(z))k22(|1�D⇤
�
(g✓
(z))|� ✏)2o
At optimalityrx
D
⇤�
(x) = 0
D
⇤�
(x) = 0,8x 2 supp(pg
) \ L
Therefore
kr✓Ez⇠pzn
log(1�D�(g✓(z)))o
k22 < Ez⇠pzn✏2kJ(g✓(z))k2
(1� ✏)2o
M2✏2
(1� ✏)2 QED
How do you solve it?
Training Datag✓(·)g
D�(·)D
[0, 1]
z
Recent Solution
✏g✓(z) + ✏ x + ✏
“Towards Principled Methods for Training Generative Adversarial Networks” ICLR 2017 Martin Arjovsky, Leon Bottou
Add zero-mean noise
Recent SolutionTheorem 3.2:
D
⇤�
(x) =p
x+✏(x)
p
x+✏(x) + pg+✏(x)
✏ ⇠ N (0,�2I)
+
Ez⇠p
z
n
r✓
log(1�D⇤�
(g✓
(z)))o
= Ez⇠p
z
n
a(z)
Z
p✏
(g✓
� y)r✓
kg✓
� yk2px
(y)dy
b(z)
Z
p✏
(g✓
� y)r✓
kg✓
� yk2pg
(y)dyo
“Towards Principled Methods for Training Generative Adversarial Networks” ICLR 2017 Martin Arjovsky, Leon Bottou
where
b(z) = a(z)px+✏
pg+✏
Recent SolutionTheorem 3.2:
D
⇤�
(x) =p
x+✏(x)
p
x+✏(x) + pg+✏(x)
✏ ⇠ N (0,�2I)
+
“Towards Principled Methods for Training Generative Adversarial Networks” ICLR 2017 Martin Arjovsky, Leon Both
where
b(z) = a(z)px+✏
pg+✏
Ez⇠p
z
n
r✓
log(1�D⇤�
(g✓
(z)))o
= Ez⇠p
z
n
a(z)
Z
p✏
(g✓
(z)� y)r✓
kg✓
(z)� yk2px
(y)dy
�b(z)Z
p✏
(g✓
(z)� y)r✓
kg✓
(z)� yk2pg
(y)dyo
Recent Solution
Proof:
a(z).=1
2�21
px+✏(g✓(z)) + pg+✏(g✓(z))
b(z).=1
2�21
px+✏(g✓(z)) + pg+✏(g✓(z))
px+✏(g✓(z))
pg+✏(g✓(z))
Ez⇠p
z
n
r✓
log(1�D⇤�
(g✓
(z)))o
= Ez⇠p
z
n
r✓
log
pg+✏(g✓(z))
px+✏(g✓(z)) + pg+✏(g✓(z))
o
= Ez⇠p
z
n
r✓
log(pg+✏(g✓(z)))�r✓ log(px+✏(g✓(z)) + pg+✏(g✓(z)))
o
= Ez⇠p
z
nr✓
pg+✏(g✓(z))
pg+✏(g✓(z))
�r✓
px+✏(g✓(z)) +r✓pg+✏(g✓(z))px+✏(g✓(z)) + pg+✏(g✓(z))
o
= Ez⇠p
z
n
r✓
{�px+✏(g✓(z))}�
1
px+✏(g✓(z)) + pg+✏(g✓(z))
px+✏(g✓(z))
pg+✏(g✓(z))
r✓
{�pg+✏(g✓(z))}
o
= Ez⇠p
z
n
2�2a(z)r✓
{�px+✏(g✓(z))}� 2�2b(z)r✓{�pg+✏(g✓(z))}
o
Cont.
Recall that adding two independent random variables produces a random variable with density obtained by convolving the two original densities
Proof:
Recent Solution
QED
(it should be derived)!It comes from the
optimal discriminatorformula which is
assumed fixed (andtherefore not
dependent on \theta)
Ez⇠p
z
n
r✓
log(1�D⇤�
(g✓
(z)))o
= Ez⇠p
z
n
� 2�2a(z)r✓
Z
p✏
(g✓
(z)� y)px
(y)dy+
2�2b(z)r✓
Z
p✏
(g✓
(z)� y)pg
(y)dyo
= Ez⇠p
z
n
� 2�2a(z)r✓
Z
1
Ze�
kg✓
(z)�yk2
2�2 px
(y)dy+
2�2b(z)r✓
Z
1
Ze�
kg✓
(z)�yk2
2�2 pg
(y)dyo
= Ez⇠p
z
n
� 2�2a(z)Z
r✓
1
Ze�
kg✓
(z)�yk2
2�2 px
(y)dy+
2�2b(z)
Z
r✓
1
Ze�
kg✓
(z)�yk2
2�2 pg
(y)dyo
= Ez⇠p
z
n
a(z)
Z
1
Ze�
kg✓
(z)�yk2
2�2 r✓
kg✓
(z)� yk2px
(y)dy�
b(z)
Z
1
Ze�
kg✓
(z)�yk2
2�2 r✓
kg✓
(z)� yk2pg
(y)dyo
= Ez⇠p
z
n
a(z)
Z
p✏
(g✓
(z)� y)r✓
kg✓
(z)� yk2px
(y)dy�
b(z)
Z
p✏
(g✓
(z)� y)r✓
kg✓
(z)� yk2pg
(y)dyo
Why does it work?
InterpretationEz⇠pz
n
r✓ log(1�D⇤�(g✓(z)))o
= Ez⇠pz
n
a(z)
�b(z)
Zp✏
(g✓
(z)� y)r✓
kg✓
(z)� yk2px
(y)dyZp✏(g✓(z)� y)r✓kg✓(z)� yk2pg(y)dy
o
Ez⇠pz
n
r✓ log(1�D⇤�(g✓(z)))o
= Ez⇠pz
n
a(z)
�b(z)
Zp✏
(g✓
(z)� y)r✓
kg✓
(z)� yk2px
(y)dyZp✏(g✓(z)� y)r✓kg✓(z)� yk2pg(y)dy
o
Interpretation
y
px
g✓(z)
Ez⇠pz
n
r✓ log(1�D⇤�(g✓(z)))o
= Ez⇠pz
n
a(z)
�b(z)
Zp✏
(g✓
(z)� y)r✓
kg✓
(z)� yk2px
(y)dyZp✏(g✓(z)� y)r✓kg✓(z)� yk2pg(y)dy
o
Interpretation
y
px
g✓(z)
Ez⇠pz
n
r✓ log(1�D⇤�(g✓(z)))o
= Ez⇠pz
n
a(z)
�b(z)
Zp✏
(g✓
(z)� y)r✓
kg✓
(z)� yk2px
(y)dyZp✏(g✓(z)� y)r✓kg✓(z)� yk2pg(y)dy
o
� >> 1, p✏ ' kAssumption:
Interpretation
y
px
g✓(z)
kg✓(z)� yk2
� >> 1, p✏ ' kAssumption:
Ez⇠pz
n
r✓ log(1�D⇤�(g✓(z)))o
= Ez⇠pz
n
a(z)
�b(z)
Zp✏
(g✓
(z)� y)r✓
kg✓
(z)� yk2px
(y)dyZp✏(g✓(z)� y)r✓kg✓(z)� yk2pg(y)dy
o
Interpretation
y
px
g✓(z)
kg✓(z)� yk2
� >> 1, p✏ ' kAssumption:
r✓kg✓(z)� yk2
Ez⇠pz
n
r✓ log(1�D⇤�(g✓(z)))o
= Ez⇠pz
n
a(z)
�b(z)
Zp✏
(g✓
(z)� y)r✓
kg✓
(z)� yk2px
(y)dyZp✏(g✓(z)� y)r✓kg✓(z)� yk2pg(y)dy
o
Interpretation
y
px
g✓(z)
kg✓(z)� yk2
� >> 1, p✏ ' kAssumption:
Ez⇠pz
n
r✓ log(1�D⇤�(g✓(z)))o
= Ez⇠pz
n
a(z)
�b(z)
Zp✏
(g✓
(z)� y)r✓
kg✓
(z)� yk2px
(y)dyZp✏(g✓(z)� y)r✓kg✓(z)� yk2pg(y)dy
o
�r✓kg✓(z)� yk2
Interpretation
y
px
g✓(z)
kg✓(z)� yk2
� >> 1, p✏ ' kAssumption:
Ez⇠pz
n
r✓ log(1�D⇤�(g✓(z)))o
= Ez⇠pz
n
a(z)
�b(z)
Zp✏
(g✓
(z)� y)r✓
kg✓
(z)� yk2px
(y)dyZp✏(g✓(z)� y)r✓kg✓(z)� yk2pg(y)dy
o
The overall effect is to move generated points close to the data manifold (ATTRACTION TO HIGH DENSITY)
�r✓kg✓(z)� yk2
Interpretation
y
px
g✓(z)
kg✓(z)� yk2
Assumption:
Ez⇠pz
n
r✓ log(1�D⇤�(g✓(z)))o
= Ez⇠pz
n
a(z)
�b(z)
Zp✏
(g✓
(z)� y)r✓
kg✓
(z)� yk2px
(y)dyZp✏(g✓(z)� y)r✓kg✓(z)� yk2pg(y)dy
o
VANISHING GRADIENTS
� ! 0, p✏(g✓(z)� y) ' 0
�r✓kg✓(z)� yk2
Interpretation
y
g✓(z)
Ez⇠pz
n
r✓ log(1�D⇤�(g✓(z)))o
= Ez⇠pz
n
a(z)
�b(z)
Zp✏
(g✓
(z)� y)r✓
kg✓
(z)� yk2px
(y)dyZp✏(g✓(z)� y)r✓kg✓(z)� yk2pg(y)dy
o
pg
�
Interpretation
y
g✓(z)
Ez⇠pz
n
r✓ log(1�D⇤�(g✓(z)))o
= Ez⇠pz
n
a(z)
�b(z)
Zp✏
(g✓
(z)� y)r✓
kg✓
(z)� yk2px
(y)dyZp✏(g✓(z)� y)r✓kg✓(z)� yk2pg(y)dy
o
pg
� >> 1, p✏ ' kAssumption:
�
Interpretation
y
g✓(z)
kg✓(z)� yk2
Ez⇠pz
n
r✓ log(1�D⇤�(g✓(z)))o
= Ez⇠pz
n
a(z)
�b(z)
Zp✏
(g✓
(z)� y)r✓
kg✓
(z)� yk2px
(y)dyZp✏(g✓(z)� y)r✓kg✓(z)� yk2pg(y)dy
o
pg
� >> 1, p✏ ' kAssumption:
�
Interpretation
y
g✓(z)
kg✓(z)� yk2
Ez⇠pz
n
r✓ log(1�D⇤�(g✓(z)))o
= Ez⇠pz
n
a(z)
�b(z)
Zp✏
(g✓
(z)� y)r✓
kg✓
(z)� yk2px
(y)dyZp✏(g✓(z)� y)r✓kg✓(z)� yk2pg(y)dy
o
pg
� >> 1, p✏ ' kAssumption:
�r✓kg✓(z)� yk2
�
Interpretation
y
g✓(z)
kg✓(z)� yk2
Ez⇠pz
n
r✓ log(1�D⇤�(g✓(z)))o
= Ez⇠pz
n
a(z)
�b(z)
Zp✏
(g✓
(z)� y)r✓
kg✓
(z)� yk2px
(y)dyZp✏(g✓(z)� y)r✓kg✓(z)� yk2pg(y)dy
o
pg
� >> 1, p✏ ' kAssumption:
r✓kg✓(z)� yk2
�
Interpretation
y
g✓(z)
kg✓(z)� yk2
Ez⇠pz
n
r✓ log(1�D⇤�(g✓(z)))o
= Ez⇠pz
n
a(z)
�b(z)
Zp✏
(g✓
(z)� y)r✓
kg✓
(z)� yk2px
(y)dyZp✏(g✓(z)� y)r✓kg✓(z)� yk2pg(y)dy
o
pg
�
� >> 1, p✏ ' kAssumption:
r✓kg✓(z)� yk2
The overall effect is to stretch the generated manifold (STRETCHING HIGH DENSITY REGIONS)
Interpretation
Ez⇠pz
n
r✓ log(1�D⇤�(g✓(z)))o
= Ez⇠pz
n
a(z)
�b(z)
Zp✏
(g✓
(z)� y)r✓
kg✓
(z)� yk2px
(y)dyZp✏(g✓(z)� y)r✓kg✓(z)� yk2pg(y)dy
o
b(z) = a(z)px+✏
pg+✏
Interpretation
Ez⇠pz
n
r✓ log(1�D⇤�(g✓(z)))o
= Ez⇠pz
n
a(z)
�b(z)
Zp✏
(g✓
(z)� y)r✓
kg✓
(z)� yk2px
(y)dyZp✏(g✓(z)� y)r✓kg✓(z)� yk2pg(y)dy
o
b(z) = a(z)px+✏
pg+✏
a(z) >> b(z)pg p
x
pg px
Attraction
g✓(z) g✓(z)
Interpretation
Ez⇠pz
n
r✓ log(1�D⇤�(g✓(z)))o
= Ez⇠pz
n
a(z)
�b(z)
Zp✏
(g✓
(z)� y)r✓
kg✓
(z)� yk2px
(y)dyZp✏(g✓(z)� y)r✓kg✓(z)� yk2pg(y)dy
o
b(z) = a(z)px+✏
pg+✏
a(z) >> b(z)pg p
x
pg px
Attraction
px
pga(z) ⇠ b(z)
px
pg
Stretching
g✓(z) g✓(z)
g✓(z) g✓(z)
Interpretation
DEMO
DEMO
Training Data
[0, 1]
z
pg px
pz
⌃
1
1
✓1✓2
�1�2
0 1
�(⌃(·))
Normal: GAMES =1500 DISCRIMINATOR_STEPS = 1 GENERATOR_STEPS = 1
Strong discriminator: GAMES = 100 DISCRIMINATOR_STEPS = 600 GENERATOR_STEPS = 1
Strong generator: GAMES = 10 DISCRIMINATOR_STEPS = 1 GENERATOR_STEPS = 600
Open issues
Open issues
1. Other solutions to the vanishing gradient problem? 2. Problem when the dimension of the support of the input distribution
is lower than the dimension of the support of the data distribution
Before training
pg
After training
px
g✓(·)g
pg
pz
dim(supp(pz)) � dim(supp(pg))
?
Open issues
g✓(·)g
pg
pz
dim(supp(pz)) � dim(supp(pg))
1. Other solutions to the vanishing gradient problem? 2. Problem when the dimension of the support of the input distribution
is lower than the dimension of the support of the data distribution
Open issues
Before training
pg
After training
px
g✓(·)g
pg
pz
dim(supp(pz)) � dim(supp(pg))
?
dim(supp(px
)) > dim(supp(pg
))
1. Other solutions to the vanishing gradient problem? 2. Problem when the dimension of the support of the input distribution
is lower than the dimension of the support of the data distribution