generative adversarial networkstat.snu.ac.kr/mcp/lecture10_gan.pdf · 2019-11-11 · generative...
Post on 28-May-2020
45 Views
Preview:
TRANSCRIPT
Generative Adversarial Network
Generative Adversarial Network
Seoul National University Deep Learning September-December, 2019 1 / 38
Generative Adversarial Network
Generative adversarial network (Goodfellow, 2014)
Since publication of GAN by Goodfellow (2014), many applications arereported.
Seoul National University Deep Learning September-December, 2019 2 / 38
Generative Adversarial Network
Generative adversarial network (Goodfellow, 2014)
Since publication of GAN by Goodfellow (2014), many variants of GANhave been published.
Seoul National University Deep Learning September-December, 2019 3 / 38
Generative Adversarial Network
Generative adversarial network
Similar setup as in VAE. Attempt to generate x given z with a smallerdimension.
Generative adversarial networks are based on a game theoreticscenario in which the generator network must compete against anadversary.
The generator network produces samples x = g(z ; θg ) that attemptsto fool the classifier into believing its samples are real. Its adversary,the discriminator network, attempts to distinguish between samplesdrawn from the training data and samples drawn from the generatorthrough P(y = 1|x) = D(x).
Seoul National University Deep Learning September-December, 2019 4 / 38
Generative Adversarial Network
Generative adversarial network: Setup
Let y = 1 if the data is real and y = 0 if the data is fake. We assumethat there is a lower dimensional representation z of x .
To generate data, one needs to know p(x |y = 1).
If P(y=1)=P(y=0)=.5, we have
P(y = 1|x) =p(x |y = 1)
p(x |y = 1) + p(x |y = 0)
In GAN, we specify P(y = 1|x) and p(x |y = 0) and estimatep(x |y = 1) by minimizing a distance between p(x |y = 0) andp(x |y = 1).
Seoul National University Deep Learning September-December, 2019 5 / 38
Generative Adversarial Network
Generative adversarial network
The likelihood function based on y |x ∼ Ber(D(x)) is
n∑i=1
yi log(D(xi ; θd)) + (1− yi ) log(1− D(xi ; θd))
However, xi ’s are not observed for yi = 0 and we replace xi withg(zi ; θg ). z is not observed for y = 0, and the marginal likelihood is
L(θd , θg ) =n∏
i=1
(D(xi ; θd))yi∫p(zi )(1− D(g(zi ; θg ); θd))(1−yi )dzi
Seoul National University Deep Learning September-December, 2019 6 / 38
Generative Adversarial Network
Generative adversarial network: Discriminator andGenerator
Consider the following quantity
v(θg , θd) = Ex∼pdata logD(x ; θd) + Ex∼pmodellog[1− Dg(z ; θg ); θd],
where pdata ≡ p(x |y = 1) and pmodel ≡ p(x |y = 0). Note that this isnot the expected likelihood in usual sense.
Optimization
Discriminator: Maximize v(θg , θd) over θd given θg .Generator: Minimize maxθd v(θg , θd)Alternate Discriminator and Generator steps.
Seoul National University Deep Learning September-December, 2019 7 / 38
Generative Adversarial Network
Generative adversarial network
Discriminator: Maximizing v(θg , θd) over θd is to estimate thediscriminator. When function space of D(x) is not restricted, argmaxof v over D is
D∗(x) =pdata
pdata + pmodel.
Generator: Plugging in P(y = 1|x) = D∗(x) = p(x |y=1)p(x |y=1)+p(x |y=0) to
v(θg , θd), we minimize
v(θg , θd) =
∫p(x |y = 1) log
p(x |y = 1)
p(x |y = 1) + p(x |y = 0)dx
+
∫p(x |y = 0) log
p(x |y = 0)
p(x |y = 1) + p(x |y = 0)dx
= KL(pdata||p∗∗) + KL(pmodel ||p∗∗) + const
where p∗∗ = (pdata + pmodel)/2
Seoul National University Deep Learning September-December, 2019 8 / 38
Generative Adversarial Network
Generative adversarial network
source: https://poloclub.github.io/ganlab/
Seoul National University Deep Learning September-December, 2019 9 / 38
Generative Adversarial Network
AE, VAE and GAN
AE VAE GAN
estimation transformation conditional distributionof distribution through
transformation
specifying x = g(z ; θg ) none x = g(z ; θg )transformation z = f (x ; θf )
specifying none p(z), p(x |z) p(z)distributions and thus indirectly
q(z |x) p(x)
objective ‖x − g(f (x ; θf ); θg )‖2 KL Jensen-Shannonfunction divergence divergence
Seoul National University Deep Learning September-December, 2019 10 / 38
Generative Adversarial Network
GAN algorithm
Seoul National University Deep Learning September-December, 2019 11 / 38
Generative Adversarial Network
Implementation of GAN
In practice, D(x) is restricted to neural network. We first maximizev(θg , θd) over θd to obtain θ∗d . Then,
θ∗g = argminθgEx∼pmodellog[1− Dg(z ; θg ); θ∗d]
The i th contribution of Ex∼pmodellog[1− Dg(z ; θg ); θ∗d] is
stochastically evaluated by
1
M
M∑m=1
log[1− Dg(z(m)i ; θg ); θ∗d]
where z(m)i is generated from p(z).
In practice, minimizing Ex∼pmodellog[1− Dg(z ; θg ); θ∗d] does not
work well. Instead, one aims to maximizeEx∼pmodel
log[Dg(z ; θg ); θ∗d] over θg .
Seoul National University Deep Learning September-December, 2019 12 / 38
Generative Adversarial Network
GAN: Comments
In a nutshell, GAN finds p(x |y = 0), pmodel , by minimizing KLdivergence between pmodel and (pdata + pmodel)/2. Therefore theobjective function is KL(pmodel ||p∗∗).
GAN algorithm represents an example of how to minimizeKL(pmodel ||p∗∗), namely finding P(y = 1|x) indexed by θd through‘discriminator’.
Game-theoretic arguments may be oversold since they are notessential for estimating the density. The role of discriminator is todetermine the loss function for the generator,KL(pdata||p∗∗) + KL(pmodel ||p∗∗), Jensen-Shannon divergence.
Jensen-Shannon divergence has advantage over KL divergence in thatone can avoid the problem of non-overlapping support.
Seoul National University Deep Learning September-December, 2019 13 / 38
Generative Adversarial Network
Detailed Architecture of GAN: DCGAN (Radford et al.2016, ICLR)
Radford et al. 2016, ICLR (DCGAN)
Seoul National University Deep Learning September-December, 2019 14 / 38
Generative Adversarial Network
Detailed Architecture of GAN: DCGAN (Radford et al.2016, ICLR)
Radford et al. 2016, ICLR (DCGAN)
Seoul National University Deep Learning September-December, 2019 15 / 38
Generative Adversarial Network
Detailed Architecture of GAN: DCGAN (Radford et al.2016, ICLR)
Architectures for classification need modification for GAN’s.
Some tips are given such as replacing max pooling by stridedconvolutional layers, using batch normalization, using ReLu forgenerator with tanh for output...
Issues of unstable training remained.
Seoul National University Deep Learning September-December, 2019 16 / 38
Generative Adversarial Network
Many GAN’s
Many ways of improving GAN’s (Sailmans et al. 2016)
Many variants of GAN have been proposed.
CycleGAN (Zhu et al., 2017): Domain transfer (input=horse,output=zebra)Text to image (Reed et al., 2016, ICML)Pix2pix (Isola et al., 2017, CVPR)
WGAN (Arjovsky et al., 2017) is the most popular which usesWasserstein distance metric to optimize the generating distribution.
Seoul National University Deep Learning September-December, 2019 17 / 38
Generative Adversarial Network
Wasserstein GAN: Distance
If the real data distribution Pr of X admits a density and Pθ is thedistribution of the parametrized density Pθ of g(Z ; θ) then,asymptotically, the likelihood inference amounts to minimizing theKullback-Leibler divergence KL(Pr‖Pθ).
When distributions are supported by low dimensional manifolds theKL distance is not defined.
WGAN minimizes Wasserstein distance between Pr and Pθ.
Seoul National University Deep Learning September-December, 2019 18 / 38
Generative Adversarial Network
Distances between two distributions
Total variation distance:
δ(Pr ,Pθ) = supA∈Σ|Pr (A)− Pθ(A)|
where Σ denote the set of all the Borel subsets of a compact metricset, X .
Kullback-Leibler divergence: KL(Pr‖Pθ)
Jensen-Shannon divergence:
JS(Pr ,Pθ) = KL(Pr‖Pm) + KL(Pθ‖Pm)
where Pm = (Pr + Pθ)/2.
Seoul National University Deep Learning September-December, 2019 19 / 38
Generative Adversarial Network
Earth-mover distance
Earth-mover distance or Wasserstein-1 distance:
W (Pr ,Pθ) = infγ∈Π(Pr ,Pθ)
E(x ,y)∼γ [‖x − y‖]
where Π(Pr ,Pθ) denotes the set of all joint distributions γ(x , y)whose marginal distributions are respectively Pr and Pθ.
source: Cuturi and Solomon, 2017, NeurIPS tutorial, A Primer on optimal transport
Seoul National University Deep Learning September-December, 2019 20 / 38
Generative Adversarial Network
Optimal transport
If we imagine the distributions as different heaps of a certain amountof earth, then the EMD is the minimal total amount of work it takesto transform one heap into the other. Work is defined as the amountof earth in a chunk times the distance it was moved.
Calculating the EMD is in itself an optimization problem: There areinfinitely many ways to move the earth around, and we need to findthe optimal one. We call the transport plan that we are trying to find.It simply states how we distribute the amount of earth from one placeover the domain of, or vice versa.
Seoul National University Deep Learning September-December, 2019 21 / 38
Generative Adversarial Network
Wasserstein distance
source: STAT36-708 Lecture note from Larry Wasserman
Distances may ignore the underlying geometry of the space. For threedensities p1, p2, p3, we have
∫p1−p2dx =
∫p1−p3dx =
∫p2−p3dx
and similarly for the other distances. But our intuition tells us that p1
and p2 are close together, which is captured in Wasserstein distance.
Seoul National University Deep Learning September-December, 2019 22 / 38
Generative Adversarial Network
Distance: Example
Let Z be U[0, 1] the uniform distribution on the unit interval. Let P0
be the distribution of (0,Z ) ∈ R2, 0 on the x-axis and the randomvariable Z on the y -axis, uniform on a straight vertical line passingthrough the origin. Let Pθ be the distribution of (θ,Z ) Then,W (P0,Pθ) = |θ|
JS(P0,Pθ) =
log 2 if θ 6= 0
0 if θ = 0
KL(P0,Pθ) =
+∞ if θ 6= 0
0 if θ = 0
δ(P0,Pθ) =
1 if θ 6= 0
0 if θ = 0
Seoul National University Deep Learning September-December, 2019 23 / 38
Generative Adversarial Network
Distance: Example
When θ → 0, the sequence (Pθt )t∈N converges to P0 under the EMdistance, but does not converge under the JS, KL, or TV divergences.
The KL, JS, and TV distances are not sensible loss functions whenlearning distributions supported by low dimensional manifolds.However the EM distance effectively captures the difference in thissetup.
Seoul National University Deep Learning September-December, 2019 24 / 38
Generative Adversarial Network
Dual form of Wasserstein distance: Kantorovich-Rubinsteinduality
W (Pr ,Pθ) = infγ∈Π(Pr ,Pθ)
E(x ,y)∼γ [‖x − y‖]
= infγ
supf
E(x ,y)∼γ [‖x − y‖+ Es∼Pr ,[f (s)]− Et∼Pθ,[f (t)]− (f (x)− f (y))]
since
supf
E(x ,y)∼γ [Es∼Pr ,[f (s)]−Et∼Pθ,[f (t)]− (f (x)− f (y))] =
0, if γ ∈ Π
∞ otherwise
Using Simon’s minimax theorem, it can be shown that strong duality holds.
Seoul National University Deep Learning September-December, 2019 25 / 38
Generative Adversarial Network
Kantorovich-Rubinstein duality-continued
Due to strong duality
infγ
supf
E(x ,y)∼γ [‖x − y‖+ Es∼Pr ,[f (s)]− Et∼Pθ,[f (t)]− (f (x)− f (y))]
= supf
infγE(x ,y)∼γ [‖x−y‖+Es∼Pr ,[f (s)]−Et∼Pθ,[f (t)]−(f (x)−f (y))]
= supf
[Es∼Pr ,[f (s)]−Et∼Pθ,[f (t)]+ inf
γE(x ,y)∼γ [‖x−y‖−(f (x)−f (y))]︸ ︷︷ ︸
=
0, if f∈|f (x1)−f (x2)|≤|x1−x2|−∞ otherwise
]
= supf ∈|f (x1)−f (x2)|≤|x1−x2|
[Es∼Pr ,[f (s)]−Et∼Pθ,[f (t)]
]
Seoul National University Deep Learning September-December, 2019 26 / 38
Generative Adversarial Network
Wasserstein GAN (Arjovsky et al., 2017)
Kantorovich-Rubinstein duality:
W (Pr ,Pθ) = supf ∈|f (x1)−f (x2)|≤|x1−x2|
Ex∼Pr [f (x)]− Ex∼Pθ[f (x)]
where the supremum is over all the 1-Lipschitz functions f : C → R.
To approximate computation of W (Pr ,Pθ), consider parametricfamily fw indexed by w and
maxw∈Ω
Ex∼Pr [fw (x)]− Ex∼Pθ[fw (gθ(z))]
WGAN trains a neural network parameterized with weights w lying ina compact space Ω and then backprop throughEz∼p(z)[∇θfw∗(gθ(z))], where w∗ is the argmax.
Seoul National University Deep Learning September-December, 2019 27 / 38
Generative Adversarial Network
Wasserstein GAN
Note that the fact that Ω is compact implies that all the functions fwwill be K−Lipschitz for some K .
In order to have parameters w lie in a compact space, clamp theweights, for example, Ω = [−0.01, 0.01]l after each gradient update.
Weight clipping is a poor but practical way to enforce a Lipschitzconstraint. A large clipping parameter may cause slow convergence,and a small clipping may lead to vanishing gradients.
Gulrajani et al. (2017): Add penalizing term, (‖∇xD(x)‖2 − K )2 toenforce lipschitz continuity
Seoul National University Deep Learning September-December, 2019 28 / 38
Generative Adversarial Network
Wasserstein GAN algorithm
Seoul National University Deep Learning September-December, 2019 29 / 38
Generative Adversarial Network
Improved Wasserstein GAN algorithm (Gulrajani et al.,2017)
Seoul National University Deep Learning September-December, 2019 30 / 38
Generative Adversarial Network
Family of distance functions for distributions: IntegralProbability Metrics
Given F and a set of functions from C → R, define
dF (P,Q) = supf ∈F
Ex∼P [f (x)]− Ex∼Q [f (x)],
called Integral Probability Metrics (IPM’s).
If F is the set of Lipschitz functions, dF (P,Q) is Wassersteindistance.
If F is the set of all measurable functions bounded between [−1, 1],dF (P,Q) is total variation distance.
If F = f ∈ H : ‖f ‖∞ ≤ 1 for some Reproducing Kernel HilbertSpace H, dF (P,Q) is the maximum mean discrepancy (MMD).
Seoul National University Deep Learning September-December, 2019 31 / 38
Generative Adversarial Network
WGAN vs. GAN
Seoul National University Deep Learning September-December, 2019 32 / 38
Generative Adversarial Network
Wasserstein Autoencoder (WAE)
Tolstikhin et al. (2018) proposed the Wasserstein Auto-Encoder(WAE) a new algorithm for building a generative model of the datadistribution. WAE minimizes a penalized form of the Wassersteindistance between the model distribution and the target distribution
The regularizer encourages the encoded training distribution to matchthe prior.
Comparing with WGAN, WAE uses the primal definition ofWasserstein distance.
Seoul National University Deep Learning September-December, 2019 33 / 38
Generative Adversarial Network
AE, VAE, WGAN and WAE
AE VAE GAN/WGAN WAE
Encoder Deterministic Stochastic none Stochastic
Decoder Deterministic Stochastic Deterministic Stochastic
P(z) no yes yes yes
Seoul National University Deep Learning September-December, 2019 34 / 38
Generative Adversarial Network
Wasserstein Autoencoder (WAE)
Let deterministic decoders, PG (X |Z ) map Z to X = G (Z ) for a givenG : Z → X . Q(Z |X ) is a conditional distribution of Z given X . LetPZ be prior and QZ (Z ) = EX∼PX
[Q(Z |X )]. Then
infΓ∈P(X∼,Y∼PG )
E(X ,Y )∼Γ[c(X ,Y )] = infQ:QZ=PZ
EPXEQ(Z |X )[c(X ,G (Z ))]
Objective function for WAE:
DWAE (PX ,PG ) = infQ(Z |X )∈Q
EPXEQ(Z |X )[c(X ,G (Z ))] + λDZ (QZ ,PZ ),
where Q is any nonparametric set of probablistic encoders, DZ is anarbitrary divergence between QZ and PZ , and λ > 0 is ahyperparameter.
Seoul National University Deep Learning September-December, 2019 35 / 38
Generative Adversarial Network
Wasserstein Autoencoder (WAE): Choices of penalty term
GAN-based DZ : Jensen-Shannon distance between QZandPZ ,DJS(QZ ,PZ ). Adversarial training: estimate γ by maximizingλn
∑ni=1 logDγ(zi ) + log(1− Dγ(zi ))
MMD-based DZ : For a positive-definite reproducing kernelk : Z ×Z → R, the maximum mean discrepancy (MMD) is defined as
MMDk(PZ ,QZ ) = ‖∫Zk(z , )dPZ (z)−
∫Zk(z , )dQZ (z)‖Hk
Seoul National University Deep Learning September-December, 2019 36 / 38
Generative Adversarial Network
WAE algorithms
Seoul National University Deep Learning September-December, 2019 37 / 38
Generative Adversarial Network
WGAN, WAE-GAN and WAE-MMD
WGAN WAE-GAN WAE-MMD
p(x |z ; θ) x = g(z ; θg ) x = g(z ; θg ) x = g(z ; θg )
p(z) normal normal normal
q(z |x) none q(z |x) q(z |x)
P(y = 1|x) none P(y = 1|x) none
w-distance dual primal primal
critic for dual fw - -
primal - JS(p(z)‖q(z)) ‖∫k(z,)dP(z)
contraint -∫k(z,)dQ(z)‖H
Seoul National University Deep Learning September-December, 2019 38 / 38
top related