gated-gan: adversarial gated networks for multi-collection ...land st, darlington, nsw 2008,...

14
1 Gated-GAN: Adversarial Gated Networks for Multi-Collection Style Transfer Xinyuan Chen, Chang Xu, Xiaokang Yang, Senior Member, IEEE, Li Song, and Dacheng Tao, Fellow, IEEE Abstract—Style transfer describes the rendering of an image’s semantic content as different artistic styles. Recently, generative adversarial networks (GANs) have emerged as an effective approach in style transfer by adversarially training the generator to synthesize convincing counterfeits. However, traditional GAN suffers from the mode collapse issue, resulting in unstable training and making style transfer quality difficult to guarantee. In addition, the GAN generator is only compatible with one style, so a series of GANs must be trained to provide users with choices to transfer more than one kind of style. In this paper, we focus on tackling these challenges and limitations to improve style transfer. We propose adversarial gated networks (Gated-GAN) to transfer multiple styles in a single model. The generative networks have three modules: an encoder, a gated transformer, and a decoder. Different styles can be achieved by passing input images through different branches of the gated transformer. To stabilize training, the encoder and decoder are combined as an auto-encoder to reconstruct the input images. The discriminative networks are used to distinguish whether the input image is a stylized or genuine image. An auxiliary classifier is used to recognize the style categories of transferred images, thereby helping the generative networks generate images in multiple styles. In addition, Gated-GAN makes it possible to explore a new style by investigating styles learned from artists or genres. Our extensive experiments demonstrate the stability and effectiveness of the proposed model for multi-style transfer. Index Terms—Multi-Style Transfer, Adversarial Generative Networks. I. I NTRODUCTION S TYLE transfer refers to redrawing an image by imitat- ing another artistic style. Specifically, given a reference style, one can make the input image look like it has been redrawn with a different stroke, perceptual representation, This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to fi- nal publication. Citation information: DOI 10.1109/TIP.2018.2869695, IEEE Transactions on Image Processing c 20XX IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. X. Chen is with the Department of Electronic Engineering, Shanghai Jiao Tong University, Shanghai 200240, China, and also with the Centre for Artificial Intelligence and the Faculty of Engineering and Information Technology, University of Technology Sydney, Ultimo, NSW 2007, Australia (e-mail: [email protected]). C. Xu and D. Tao are with the UBTech Sydney Artificial Intelligence Centre and the School of Information Technologies, the Faculty of Engi- neering and Information Technologies, The University of Sydney, 6 Cleve- land St, Darlington, NSW 2008, Australia (e-mail: [email protected]; [email protected]). X. Yang and S. Li are with the Department of Electronic Engineer- ing, Shanghai Jiao Tong University, Shanghai 200240, China (e-mail: [email protected]; song [email protected]). color scheme, or that it has been retouched using a different artistic interpretation. Manually transferring the image style by a professional artist usually takes considerable time. However, style transfer is a valuable technique with many practical applications, for example quickly creating cartoon scenes from landscapes or city photographs and providing amateur artists with guidelines for painting. Therefore, optimizing style transfer is a valuable pursuit. Style transfer, as an extension of texture transfer, has a rich history. Texture transfer aims to render an object with the texture extracted from a different object [1], [2], [3], [4]. In the early days, texture transfer used low-level visual features of target images, while the latest style transfer approaches are based on semantic features derived from pre-trained convo- lutional neural networks (CNNs). Gatys et al. [5] introduced the neural style transfer algorithm to separate natural image content and style to produce new images by combining the content of an arbitrary photograph with the styles of numerous well-known works of art. A number of variants emerged to improve the speed, flexibility, and quality of style transfer. Johnson et al. [6] and Ulyanov et al. [7] accelerated style transfer by using feedforward networks, while Chen et al. [8], Li et al. [9] and Odena et al. [10] achieved multi-style transfer by extracting each style from a single image. Ulyanov et al. [11] and Luan et al. [12] enhanced the quality of style transfer by investigating instance normalization in feedforward networks [7]. CNN-based style transfer methods can now produce high- quality imitative images. However, these methods focus on transferring the original image to the style provided by another style image (typically a painting). In contrast, collection style transfer aims to stylize a photograph by mimicking an artist’s or genre’s style. In practice, when a user takes a picture of a beautiful landscape, he might hope to re-render it on canvas such that it appears to have been painted by an artist, e.g., Monet, or in the style of a famous animation, e.g., Your Name. Given an in-depth understanding of an artist’s collection of paintings, it is possible to imagine how the artist might render the scene. With this in mind, generative adversarial networks (GANs) [13] can be applied to learn the distribution of an artist’s paintings. GANs are a framework in which two neural net- works compete with each other: a generative network and a discriminative network. The generative and discriminative networks are simultaneously optimized in a two-player game, where the discriminative networks aim to determine whether or not the input is painted by the artist, while the generative networks learn to generate images to fool the discriminative arXiv:1904.02296v1 [cs.CV] 4 Apr 2019

Upload: others

Post on 10-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Gated-GAN: Adversarial Gated Networks for Multi-Collection ...land St, Darlington, NSW 2008, Australia (e-mail: c.xu@sydney.edu.au; dacheng.tao@sydney.edu.au). X. Yang and S. Li are

1

Gated-GAN: Adversarial Gated Networks forMulti-Collection Style Transfer

Xinyuan Chen, Chang Xu, Xiaokang Yang, Senior Member, IEEE, Li Song, and Dacheng Tao, Fellow, IEEE

Abstract—Style transfer describes the rendering of an image’ssemantic content as different artistic styles. Recently, generativeadversarial networks (GANs) have emerged as an effectiveapproach in style transfer by adversarially training the generatorto synthesize convincing counterfeits. However, traditional GANsuffers from the mode collapse issue, resulting in unstabletraining and making style transfer quality difficult to guarantee.In addition, the GAN generator is only compatible with onestyle, so a series of GANs must be trained to provide userswith choices to transfer more than one kind of style. In thispaper, we focus on tackling these challenges and limitations toimprove style transfer. We propose adversarial gated networks(Gated-GAN) to transfer multiple styles in a single model. Thegenerative networks have three modules: an encoder, a gatedtransformer, and a decoder. Different styles can be achieved bypassing input images through different branches of the gatedtransformer. To stabilize training, the encoder and decoder arecombined as an auto-encoder to reconstruct the input images.The discriminative networks are used to distinguish whetherthe input image is a stylized or genuine image. An auxiliaryclassifier is used to recognize the style categories of transferredimages, thereby helping the generative networks generate imagesin multiple styles. In addition, Gated-GAN makes it possible toexplore a new style by investigating styles learned from artistsor genres. Our extensive experiments demonstrate the stabilityand effectiveness of the proposed model for multi-style transfer.

Index Terms—Multi-Style Transfer, Adversarial GenerativeNetworks.

I. INTRODUCTION

STYLE transfer refers to redrawing an image by imitat-ing another artistic style. Specifically, given a reference

style, one can make the input image look like it has beenredrawn with a different stroke, perceptual representation,

This article has been accepted for publication in a future issue of thisjournal, but has not been fully edited. Content may change prior to fi-nal publication. Citation information: DOI 10.1109/TIP.2018.2869695, IEEETransactions on Image Processing

c©20XX IEEE. Personal use of this material is permitted. Permission fromIEEE must be obtained for all other uses, in any current or future media,including reprinting/republishing this material for advertising or promotionalpurposes, creating new collective works, for resale or redistribution to serversor lists, or reuse of any copyrighted component of this work in other works.

X. Chen is with the Department of Electronic Engineering, ShanghaiJiao Tong University, Shanghai 200240, China, and also with the Centrefor Artificial Intelligence and the Faculty of Engineering and InformationTechnology, University of Technology Sydney, Ultimo, NSW 2007, Australia(e-mail: [email protected]).

C. Xu and D. Tao are with the UBTech Sydney Artificial IntelligenceCentre and the School of Information Technologies, the Faculty of Engi-neering and Information Technologies, The University of Sydney, 6 Cleve-land St, Darlington, NSW 2008, Australia (e-mail: [email protected];[email protected]).

X. Yang and S. Li are with the Department of Electronic Engineer-ing, Shanghai Jiao Tong University, Shanghai 200240, China (e-mail:[email protected]; song [email protected]).

color scheme, or that it has been retouched using a differentartistic interpretation. Manually transferring the image style bya professional artist usually takes considerable time. However,style transfer is a valuable technique with many practicalapplications, for example quickly creating cartoon scenesfrom landscapes or city photographs and providing amateurartists with guidelines for painting. Therefore, optimizing styletransfer is a valuable pursuit.

Style transfer, as an extension of texture transfer, has a richhistory. Texture transfer aims to render an object with thetexture extracted from a different object [1], [2], [3], [4]. Inthe early days, texture transfer used low-level visual featuresof target images, while the latest style transfer approaches arebased on semantic features derived from pre-trained convo-lutional neural networks (CNNs). Gatys et al. [5] introducedthe neural style transfer algorithm to separate natural imagecontent and style to produce new images by combining thecontent of an arbitrary photograph with the styles of numerouswell-known works of art. A number of variants emerged toimprove the speed, flexibility, and quality of style transfer.Johnson et al. [6] and Ulyanov et al. [7] accelerated styletransfer by using feedforward networks, while Chen et al.[8], Li et al. [9] and Odena et al. [10] achieved multi-styletransfer by extracting each style from a single image. Ulyanovet al. [11] and Luan et al. [12] enhanced the quality of styletransfer by investigating instance normalization in feedforwardnetworks [7].

CNN-based style transfer methods can now produce high-quality imitative images. However, these methods focus ontransferring the original image to the style provided by anotherstyle image (typically a painting). In contrast, collection styletransfer aims to stylize a photograph by mimicking an artist’sor genre’s style. In practice, when a user takes a picture of abeautiful landscape, he might hope to re-render it on canvassuch that it appears to have been painted by an artist, e.g.,Monet, or in the style of a famous animation, e.g., Your Name.Given an in-depth understanding of an artist’s collection ofpaintings, it is possible to imagine how the artist might renderthe scene.

With this in mind, generative adversarial networks (GANs)[13] can be applied to learn the distribution of an artist’spaintings. GANs are a framework in which two neural net-works compete with each other: a generative network anda discriminative network. The generative and discriminativenetworks are simultaneously optimized in a two-player game,where the discriminative networks aim to determine whetheror not the input is painted by the artist, while the generativenetworks learn to generate images to fool the discriminative

arX

iv:1

904.

0229

6v1

[cs

.CV

] 4

Apr

201

9

Page 2: Gated-GAN: Adversarial Gated Networks for Multi-Collection ...land St, Darlington, NSW 2008, Australia (e-mail: c.xu@sydney.edu.au; dacheng.tao@sydney.edu.au). X. Yang and S. Li are

2

networks. However, the GAN training procedure is unstable.In particular, without paired training samples, the originalGANs cannot guarantee that the output imitations contain thesame semantic information as that of the input images. Cycle-GAN [14], DiscoGAN [15], DualGAN [16] proposed cycle-consistent adversarial networks to address the unpaired image-to-image translation problem. They simultaneously trained twopairs of generative networks and discriminative networks, oneto produce imitative paintings and the other to transform theimitation back to the original photograph and pursue cycleconsistency.

Considering the wide application of style transfer on mo-bile devices, space-saving is an important algorithm designconsideration. Methods of CycleGAN [14], DiscoGAN [15],DualGAN [16] could only transfer one style per network. Inthis work, we propose a gated transformer module to achievemulti-collection style transfer in a single network. Moreover,previous methods adopted cycle-consistent loss requires anadditional network that converts the stylized image into theoriginal one. With the increase of the number of transferredstyle, the training algorithm will become complicated if weadopt cycle-consistent loss. Also, style transfer is actually aone-sided translation problem, which does not expect styleimages to be transformed to content images. In our method,we adopt encoder-decoder subnetwork and an auto-encoderreconstruction loss to guarantee that the outputs have theconsistent semantic information with the content images. Withauto-encoder reconstruction loss, our algorithm achieves one-sided mapping, which needs less parameters and can be easilygeneralized for multiple styles.

The proposed adversarial gated networks (Gated-GAN) re-alize the transfer of multiple artist or genre styles in a singlenetwork (see Figure 1). Different to the conventional encoder-decoder architectures in [6], [17], [14], we additionally con-sider a gated-transformer network between the encoder anddecoder consisting of multiple gates, each corresponding toone style. The gate controls which transformer is connectedto the model so that users can switch gate to choose betweendifferent styles. If the gated transformer is skipped, the encoderand decoder are trained as an auto-encoder to preserve seman-tic consistency between input images and their reconstructions.At the same time, the mode collapse issue is avoided and thetraining procedure is stabilized. The gated transformer alsofacilitates generating new styles through weighted connectionsbetween the transformer branches. Our discriminative networkarchitecture has two components: the first to distinguish syn-thesized images from genuine images, and the other to identifythe specific styles of these images. Experiments demonstratethat our adversarial gated networks successfully achieve multi-collection style transfer with a quality that is better or at leastcomparable to existing methods.

The remainder of this paper is organized as follows. InSection 2, we summarize related work. The proposed methodis detailed in Section 3. The results of experiments using theproposed method and comparisons with existing methods arereported in Section 4. We conclude in Section 5.

II. RELATED WORK

In this section, we introduce related style transfer works.We classify style transfer methods into four categories: tex-ture synthesis-based methods, optimization-based methods,feedforward network-based methods, and adversarial network-based methods.

A. Traditional Texture Transfer Method

Style transfer is an extension of texture transfer, the goalof the latter being to render an object with a texture takenfrom a different object [1], [2], [3], [4]. Most previous texturetransfer algorithms rely on texture synthesis methods andlow-level image features to preserve target image structure.Texture synthesis is the process of algorithmically constructingan unlimited number of images from a texture sample. Thegenerated images are perceived by humans to be of the sametexture but not exactly like the original images. A large rangeof powerful parametric and non-parametric algorithms existto synthesize photo-realistic natural texture [18], [19], [20].Based on texture synthesis, [21] and [22] used segmentationand patch matching to preserve information content. However,the texture transfer methods use only low-level target imagefeatures to inform texture transfer and take a long time tomigrate a style from one image to another.

B. Optimization-based Methods

The success of deep CNNs for image classification [23],[24] prompted many scientists and engineers to visualize fea-tures from a CNN [25]. DeepDream [24] was initially inventedto help visualize what a deep neural network sees when givenan image. Later, the algorithm became a technique to generateartworks in new psychedelic and abstract forms. Based onimage representations derived from pre-trained CNNs, Gatyset al. [5] introduced a neural style transfer algorithm toseparate and recombine image content and style. This approachhas since been improved in various follow-up papers. Li et al.[26] studied patch-based style transfer by combining genera-tive Markov random field (MRF) models and the pre-trainedCNNs. Selim et al. [27] extended this idea to head portraitpainting transfer by imposing novel spatial constraints to avoidfacial deformations. Luan et al. [12] studied photorealisticstyle transfer by assuming the input to output transformationwas locally affine in color space. Optimization-based methodscan produce high quality results but they are computationallyexpensive, since each optimization step requires a forward andbackward pass through the pre-trained network.

C. Feedforward Networks-based Methods

Feedforward network-based methods accelerates the opti-mization procedure, which first iteratively optimizes a gener-ative model and produces the styled image through a singleforward pass. Johnson et al. [6] and Ulyanov et al. [7] traineda feedforward network to quickly produce similar outputs.Based on [7], Ulyanov et al. [11] then proposed to maximizequality and diversity by replacing the batch normalizationmodule with instance normalization. After that, several works

Page 3: Gated-GAN: Adversarial Gated Networks for Multi-Collection ...land St, Darlington, NSW 2008, Australia (e-mail: c.xu@sydney.edu.au; dacheng.tao@sydney.edu.au). X. Yang and S. Li are

3

Inputs Monet Van Gogh Cezanne Ukiyo-e

Fig. 1. Gated-GAN for multi-collection style transfer. The images are produced from a single model with a shared encoder and decoder are shared. Stylesare controlled by switching different gated-transformer module. From left to right: original images, transferred images in Monet style, transferred images inVan Gogh’s style, transferred images in Cezanne’s style, transferred images in Ukiyoe-e’s style.

explored multi-style transfer in a single network. Dumoulinet al. [28] proposed conditional instance normalization, whichspecialized scaling and shifting parameters after normalizationto each specific texture and allowed the style transfer networkto learn multiple styles. Huang et al. [29] introduced anadaptive instance normalization (AdaIN) layer that adjustedthe mean and variance of the content input to match those ofthe style input. [8] introduced StyleBank, which was composedof multiple convolutional filter banks integrated in an auto-encoder, with each filter bank an explicit representation forstyle transfer. [9] took a noise vector and a selection unitas input to generate diverse image styles. Although adoptingdifferent methods to achieve multi-style transfer, they allexplicitly extracted style presentations from style images basedon the Gram matrix [5]. Gram matrix based methods could docollection style transfer if they use several images as style.Though those methods are designed to transfer the style ofa single image, they could also transfer the style of severalimages by averaging their Gram matrix statistics of pretraineddeep features. On the other hand, our methods learns tooutput samples in the distribution of the style of a collection.[30] achieved universal style transfer, by applying the stylecharacteristics from a style image to content images in a style-agnostic manner. By whitening and coloring transformation,the feature covariance of content images could exhibit thesame style statistical characteristics as the style images. Incontrast, we are interested in the multi-collection style transferproblem. In contrast, we are interested in the multi-collectionstyle transfer problem. A single image is difficult to compre-hensively represent the style of an artist, and thus we study

multi-collection style transfer to abstract the style of an artistfrom a collection of images.

D. Adversarial Network-based MethodsGANs [13] represent a generative method using two net-

works, one as a discriminator and the other as a generator,to iteratively improve the model by a minimax game. Chuanet al. [31] proposed Markovian GANs for texture synthesisand style transfer, addressing the efficiency issue inherent inMRF-CNN-based style transfer [26]. Spatial GAN (SGAN)[32] successfully achieved data-driven texture synthesis basedon GANs. PSGAN [33] improved Spatial GAN to learnperiodical textures by extending the structure of the input noisedistribution.

By adopting adversarial loss, many works have generatedrealistic images for conditional image generation, e.g., frameprediction [34], image super-resolution [35] and image-to-image translation [36]. However, these approaches often re-quire paired images as input, which are expensive and hard toobtain in practice. Several studies have been conducted investi-gating domain transfer in the absence of paired images. [15],[16], [14] independently reported the similar idea of cycle-consistent loss to transform the image from the source domainto the target domain and then back to the original image.Taigman et al. [37] proposed Domain Transfer Network, whichemployed a compound loss function, including an adversarialloss and constancy loss, to transfer a sample in one domainto an analog sample in another domain.

In contrast, some works have generated different imagetypes from noise in a single generative network. One strategy

Page 4: Gated-GAN: Adversarial Gated Networks for Multi-Collection ...land St, Darlington, NSW 2008, Australia (e-mail: c.xu@sydney.edu.au; dacheng.tao@sydney.edu.au). X. Yang and S. Li are

4

was to supply both the generator and discriminator with classlabels to produce class-conditional samples [38]. Another wasto modify the discriminator to contain an auxiliary decodernetwork to output the class label for the training data [39],[40] or a subset of the latent variables from which the sampleswere generated [41]. AC-GAN [10] added auxiliary multi-class category loss to supervise the discriminator, which wasused to generate multiple object types. Our work is differentin that it focuses on exploring migrating different styles tocontent images.

III. PROPOSED ALGORITHM

We first consider the collection style transfer problem. Wehave two sets of unpaired training samples: one set of inputimages {xi}Ni=1 ∈ X and the target set of collections for artistor genre {yi}Mi=1 ∈ Y . We aim to train a generative networkthat generates images G(x) in the style of a target artist orgenre, and simultaneously we train a discriminative network Dto distinguish the transferred images G(x) from the real styleimage y. The generative network implicitly learns the targetstyle from adversarial loss, aiming to fool the discriminator.The whole framework has three modules: an encoder, a gated-transformer and a decoder. The encoder consists of a seriesof convolutional layers that transform input image into featurespace Enc(x). After the encoder, a series of residual networks[42] become the transformer: T (·). The input of residual layerin gated function T is the feature maps from the last layerof encoder module Enc(x). The output of the gated functionis the activations T (Enc(x)). Then, a series of fractionally-strided convolutional networks decode the transformed featureinto output images G(x) = Dec(T (Enc(x))). To stabilizetraining, we introduce the auto-encoder reconstruction loss.We introduce the gated transformer module to integrate mul-tiple styles within a single generated network. The networkarchitecture is shown in Figure 2, and the overall architectureis called the adversarial gated network (Gated-GAN).

A. Adversarial Network for Style Transfer

To learn a style from the target domain Y , we applyadversarial loss [13], which simultaneously trains G and Das the two-player minimax game with loss function L(G,D).The generator G tries to generate an image G(x) that lookssimilar in style to target domain Y , while the discriminatorD aims to distinguish between them. Specifically, we trainD to maximize the probability of assigning the correct labelto target image y and transferred image G(x), meanwhiletraining G to minimize the probability of the discriminatorassigning the correct label to transferred image G(x). Theoriginal generative adversarial value function is expressed asfollows:

minG

maxD

V (G,D) =Ey∈Y [logD(y)]

+Ex∈X [log(1−D(G(x)))] .(1)

We employ the least squares loss (LSGAN) as exploredin [43], which provides a smooth and non-saturating gradient

in the discriminator D. The adversarial loss LGAN (G,D)becomes:

LGAN (G,D) = Ey∈Y[(D(y)− 1)2

]+ Ex∈X

[D(G(x))2

]. (2)

B. Auto-encoder Reconstruction Loss for Training Stabiliza-tion

The original GAN framework is known to be unstable, asit must train two neural networks with competing goals. [14]pointed out that one reason for instability is that there existnon-unique solutions when the generator learns the mappingfunction. Due to unpaired training samples, the same set ofinput images can be mapped to any random permutationof images in the target domain. To reduce the space ofpossible mapping functions, we introduce the auto-encoderreconstruction loss. In our model, the auto-encoder is obtainedby directly connecting the encoder and decoder modules. Thatis, the network is encouraged to produce output Dec(Enc(x))identical to input image x after learning the representation(encoding: Enc(x)) for the input data. We define the L1loss between the reconstructed output and input as the auto-encoder reconstruction loss:

LR = Ex∈X [||Dec(Enc(x))− x||1] . (3)

Mode collapse is a common problem in vanilla GAN [44],where all input images might be mapped to the same outputimage, and the optimization fails to make progress. In col-lection style transfer, if the networks trained with adversarialloss alone have sufficient capacity, content images would bemapped to an arbitrary output as long as it matches thetarget style. The proposed encoder-decoder subnetwork aimsto reconstruct input images, so that structures of the outputare expected to be consistent with the input image, whichguarantees diversity of the output along with different inputs.

C. Adversarial Gated Network for Multi-Collection StyleTransfer

1) Gated Generated Network: In multi-collection styletransfer, we have a set of input images {xi}Ni=1 ∈ X andcollections of paintings Y = {Y1, Y2, ..., YK}, where Kdenotes number of collections. In each collection, we haveMc numbers of images {yi}Mc

i=1 ∈ Yc, where c indicates theindex of collection. The proposed gated generative networkaims to output images G(x, c) by assigning specific style c.Specifically, the gated-transformer (red blocks in Figure 2)transforms the input from encoded space into different stylesby switching trigger to different branches:

G (x, c) = Dec (T (Enc(x), c)) . (4)

In each branch, we employ the residual network as the transfermodule. The encoder and decoder are shared by differentstyles, so the network only has to save the extra transformermodule parameters for each style.

Page 5: Gated-GAN: Adversarial Gated Networks for Multi-Collection ...land St, Darlington, NSW 2008, Australia (e-mail: c.xu@sydney.edu.au; dacheng.tao@sydney.edu.au). X. Yang and S. Li are

5

Fig. 2. Architecture of the proposed adversarial gated networks: a generative network and a discriminative network. The generative network consists of threemodules: an encoder, a gated transformer, and a decoder. Images are generated to different styles through branches in the gated transformer module. Thediscriminative network uses adversarial loss to distinguish between stylized and real images. An auxiliary classifier supervises the discriminative network toclassify the style categories.

2) Auxiliary Classifier for Multiple Styles: If we only usethe adversarial loss, the model tends to confuse and mixmultiples styles together. Therefore, we need a supervision toseparate categories of styles. One solution is to adopt Label-GAN [40]. [40] generalized binary discriminator to multi-classcase with its associated class label c ∈ {1, · · · ,K}, and the(K + 1)-th label corresponds to the generated samples. Theobjective functions are defined as:

LlabG = Ex∈X [H ([1, 0], [Dr (G(x)) , DK+1 (G(x))])] , (5)

LlabD = E(y,c)∈Y [H(v(c), D(y)] + Ex∈X [H(v(K + 1), D(G(x))] (6)

where denotes the probability of the sample x to have the i-th style. D(x) = [D1(x), D2(x), · · · , DK+1(x)] and v(c) =[v1(c), , vK+1(c)] with vi(c) = 0 if i 6= c and vi(c) = 1 if i =c. H is the cross-entropy, defined as H(p, q) = −

∑i pi log qi.

In LabelGAN, the generator gets its gradients from the Kspecific real class logits in discriminator and tends to refineeach sample towards being one of the classes. However,LabelGAN actually suffers from the overlaid-gradient problem[45]: all real class logits are encouraged at the same time.Though it tends to make each sample be one of these classesduring the training, the gradient of each sample is a weightedaveraging over multiple label predictors.

In our method, an auxiliary classifier (denoted as C) isadded in the consideration of leveraging the side informationdirectly:

LGatedG = λCLSEx∈X [H(u(c), C(G(x, c))] + LGAN (7)

where u(·) is the vectorizing operator that is similar to v(·)but defined with K classes, and C(G(x, c)) is the probabilitydistribution over K real classes given by the auxiliary classi-fier. LGAN indicates the adversarial loss (in Equation 2) thatencourages to generate realistic images. In the first term ofEquation 7, we optimize entropy to make each sample havea high confidence of being one of the classes, so that theoverlaid-gradient problem can be overcome. The loss can bewritten in the form of log-likelihood:

minCLCLS(C) = −E(y,c)∈Y {logC(Style = c|y)} . (8)

The classifier C is encouraged to correctly predict the log-likelihood of the correct class given real images. Meanwhile,

the generator aims to generate images that can be correctlyrecognized by classifier:

minGLCLS(G) = −Ex∈X {logC(Style = c|G(x, c))} . (9)

In practice, the classifier shares low-level convolutional layerswith the discriminator, but they have exclusive fully connectedlayers to output the conditional distribution.

IV. IMPLEMENTATION

1) Network Configuration: Our generative network archi-tecture contains two stride-2 convolutions (encoder), one gatedresidual blocks (gated-transfer), five residual blocks, and twofractionally-convolutions with 1

2 stride (decoder). Instance nor-malization [40] is used after the convolutional layers. Detailsare provided in Table I.

For the discriminators and classifiers, we adapt the Marko-vian Patch-GAN architecture [31], [36], [14], [16]. Insteadof operating over the full images, the discriminators andclassifiers distinguish overlapping patches, sampling from thereal and generated images. By doing so, the discriminators andclassifiers focus on local high-frequency features like textureand style and ignore the global image structure. The patch sizeis set to 70×70. In addition, PatchGAN has fewer parametersand can be applied to any size of input.

2) Training Strategy: To smooth the generated imageG(x, c), we make use of the total variation loss [6], [46],[47], denoted by LTV :

LTV =∑i,j

[(G(x)i,j+1 −G(x)i,j)

2+ (G(x)i+1,j −G(x)i,j)

2] 1

2

(10)where i ∈ (0, · · · , H − 1) and j ∈ (0, · · · ,W − 1) and G(x)is the generated image whose dimension is H ×W . The fullobjective of the generator is minimizing the loss function:

L(G) = LGAN + λCLSLCLS + λTV LTV (11)

where λCLS and λTV are parameters that control relativeimportance of their corresponding loss functions. Alterna-tively, we train an auto-encoder by minimizing the weightedreconstruction loss in Equation 3: λRLR. The discriminatormaximizes the prediction of real images and generated imagesL(D) = LGAN , while the classifier in Equation 8 maximizesthe prediction of collections from different artists or genres.

Page 6: Gated-GAN: Adversarial Gated Networks for Multi-Collection ...land St, Darlington, NSW 2008, Australia (e-mail: c.xu@sydney.edu.au; dacheng.tao@sydney.edu.au). X. Yang and S. Li are

6

TABLE IGENRATIVE NETWORK OF GATED-GAN

Operation Kernel size Stride Feature maps Normalization Nonlinearity

EncoderConvolution 7 1 32 Instance Normalization ReLUConvolution 3 2 64 Instance Normalization ReLUConvolution 3 2 128 Instance Normalization ReLU

Gated-transformer Residual block 128 Instance Normalization ReLU

Decoder

Residual block 128 Instance Normalization ReLUResidual block 128 Instance Normalization ReLUResidual block 128 Instance Normalization ReLUResidual block 128 Instance Normalization ReLUResidual block 128 Instance Normalization ReLU

Fractional-convolution 3 1/2 64 Instance Normalization ReLUFractional-convolution 3 1/2 32 Instance Normalization ReLU

Convolution 7 1 3 - tanh

For all experiments, we set λCLS = 1, λR = 10, andλTV = 10−6. The networks are trained with a learning rateof 0.0002, using the Adam solver [48] with batch size of 1.

The input image is 128×128. The training samples are firstscaled to 143 × 143, and then randomly flipped and croppedto 128 × 128. We train our model with input size of 128× 128 based on two reasons. First, randomly cropping rawinput could augment the number of training set. Secondly, arelatively smaller size of image decreases the computationalcost, so that speeds up training procedure. In test phase, Wetest images with their original resolution to receive a clearerexhibition in the paper.

To stabilize training, we update the discriminative networksusing a history of transferred images rather than the onesproduced by the latest generative network [49]. Specifically,we maintain an image buffer that stores 50 previously gen-erated images. At each iteration of discriminator training, wecompute the discriminator loss function by sampling imagesfrom the buffer. The training process is shown in Algorithm1. θEnc denotes the parameter of encoder module and θDecdenotes the parameters of decoder module. In practice, Kg

and Kd are set to 1.

V. EXPERIMENTS

In this section, we evaluate the effectiveness, stability, andfunctionality of the proposed model. We first introduce aquantitative assessment of image quality. Then, we set up atexture synthesis experiment and visualize the filters in thegated transformer branches. Lastly, we train the model formultiple style transfer and compare results with state-of-the-art algorithms.

A. Assessment of Image Quality

We used FID score [50] to quantitatively evaluate thequality of results. FID score measures the distance betweenthe generated distribution and the real distribution. To thisend, the generated samples are first embedded into a featurespace given by (a specific layer) of Inception Net. Then, takingthe embedding layer as a continuous multi-variate Gaussian,the mean and covariance are estimated for both the generateddata and the real data. The Frchet distance between these two

Algorithm 1 Adversarial training of gated network G.

Require: The set of training sample {xi}Ni=1 ∈ X , The setof style images with category {yi, ci} ∈ Y , number ofdiscriminator network updates per step Kd, number ofgenerative network updates per step Kg .

Ensure: Gated generative newtworks:G = Dec(T (Enc(·), ·)).

1: for number of training iterations do2: for Kd steps do3: Sample minibatch of style images (yi, ci) and

training images xi.4: Generate stylized image G(xi, ci) in Equation 4.5: Update discriminator D and classifier C

∆θD ← ∇θDLGAN , ∆θC ← θD∇θCLCLS .6: end for7: for Kg steps do8: Sample training images xi9: Update generator G :

∆θG ← ∇θG(LGAN +λCLSLCLS +λCLC +λTV LTV ).10: end for11: Update encoder and decoder module θEnc, θDec:

∆θEnc,θDec← ∇θEnc,θDec

(λRLR).12: end for

Gaussians is then used to quantify the quality of the samples,i.e.,

FID(x, g) = ‖µx−µg‖22 +Tr(Σx + Σg − 2(ΣxΣg)12 ) (12)

where (µx,Σx) and (µg,Σg) are the mean and covarianceof sample embeddings from the real data distribution andgenerative model distribution, respectively. In our experiment,we use paintings of artists as samples of real distribution andstylized images as samples of generated distribution. That isto say, we compute the FID between generated images andauthentic work of painting.

B. Texture synthesis

To explicitly understand the gated-transformer module inthe proposed Gated-GAN, we design an experiment to explorewhat the gated-transformer learns. We use our Gated-GAN toachieve synthesize texture, and visualize the gated-transformerfilters. For each style, the training set is a textured image.

Page 7: Gated-GAN: Adversarial Gated Networks for Multi-Collection ...land St, Darlington, NSW 2008, Australia (e-mail: c.xu@sydney.edu.au; dacheng.tao@sydney.edu.au). X. Yang and S. Li are

7

Fig. 3. Four cases of texture synthesis using Gated-GAN. For each case, the first column shows examples of texture, and the other three are synthesizedresults given different samples of Gaussian noise as inputs.

Fig. 4. Visualization of learned features in the gated transformer of the generative networks. In each case, the left shows synthesized images and the rightshows the corresponding features.

The training samples are first scaled to 143 × 143, and thenrandomly flipped and cropped to 128 × 128. The generativenetwork input is Gaussian noise. After adversarial training,the generative network outputs realistic textured images (seeFigure 3).

To explore style representations learned from the gated-transformer, we visualize the transformer filters in Figure 4.The features are decoded by 3 × 3 × 128 tensors, whereonly one of the 128 channels is activated by Gaussian noise.They passed through different gated transformer filters butthe same decoder. Since the output of decoder contains threechannels (RGB channels), we are able to observe the colourof output decoded from the learned feature. This revealsthat the transformer module learns style representations, e.g.,color, stroke, etc. Another interpretation is that the transformermodule learns the bases or elements of styles. Generatedimages can be viewed as linear combinations of these bases,with coefficients learned from the encoder module.

C. Style Transfer

In this subsection, we present our results for generatingmultiple styles of artists or genres using a single network.Then, we compare our results with state-of-the-art image styletransfer and collection style transfer algorithms. The modelis trained to generate images in style of Monet, Van Gogh,Cezanne, and Ukiyo-e, whose datasets are from [14]. Eachcontains 1073, 400, 526, and 563 paintings, respectively.

1) Multi-Collection Style Transfer: Collection style transfermimics the style of artists or genres with respect to theirfeatures, e.g., stroke, impasto, perspective frame usage, etc.Figure 5 shows the results of collection style transfer usingour method. Original images are presented on the left, and thegenerated images are on the right. For comparison, Monets

Fig. 5. Collection style transfer on Photo → Monet. From left to right:input photos, Monets paintings picked from a similar landscape theme, andour stylized images. The photo is transferred adaptively based on differentthemes.

paintings depicting similar scenes are shown in the middle.It can be seen that the styles of the generated images andtheir corresponding paintings are similar. Although the themesand colors of the two generated images are different, theystill appear similar to Monets authentic pieces. Our methodcan clearly mimic the style of the artist for different scenes.Figure 6 shows the results of applying the trained networkon evaluation images for Monets, Van Goghs, Cezannes, andUkiyo-es styles.

2) Comparison with Image Style Transfer: The image styletransfer algorithm [5] focuses on producing images that com-bine the content of an arbitrary photograph and style of one ormany well-known artworks. This is achieved by minimizing

Page 8: Gated-GAN: Adversarial Gated Networks for Multi-Collection ...land St, Darlington, NSW 2008, Australia (e-mail: c.xu@sydney.edu.au; dacheng.tao@sydney.edu.au). X. Yang and S. Li are

8

Input Monet Van Gogh Cezanne Ukiyo-e

Fig. 6. A four-style transfer network is trained to capture the styles of Monet,Van Gogh, Cezanne, and Ukiyo-e.

the mean-squared distance between the entries of the Grammatrix from the style image and the Gram matrix of the imageto be generated. We note some recent works on multi-styletransfer [28], [8], [9], but these are all based on neural styletransfer [5]. Thus, we compare our results with [5].

For each content image, we use two representative artworksas the reference style images. To generate images in thestyle of the entire collection, the target style representation iscomputed by the average Gram matrix of the target domain.To compare this with our method, we use the collections ofartists artworks or a genre and compute the average style asthe target.

Figure 7 reports the difference of methods. We can see thatGatys et al. [5] requires manually picking target style imagesthat closely match the desired output. If the entire collection isused as target images, the transferred style is the average styleof the collections. In contrast, our algorithm outputs diverseand reasonable images, each of which can be viewed as asample from the distribution of the artist’s style.

3) Comparison with Universal Style Transfer: [30] aims toapply the style characteristics from a style image to contentimages in a style-agnostic manner. By whitening and coloringtransformation, the feature covariance of content images couldexhibit the same style statistical characteristics as the styleimages without requiring any style-specific training.

We compare images generated from the proposed algorithmand those from [30]. The results are shown in Figure 8.Given a picture with bushes and flowers (see Figure 8 (a)),our method outputs what Monet might record this scenery(see Figure 8 (d)), in which the style of painting bushesand flowers is similar to Monets painting of “Flowers atVetheuil”. What if the content image is a cityscape? Ourmethod outputs images with foggy strokes (see Figure 8(h)), since Monet produced a lot of cityscapes with fog inLondon (e.g. “Charing Cross Bridge”). On the other hand,[30] transfers images by following a particular style image.Taking “Flowers at Vetheuil” as the style image, Figure 8 (g)produced by [30] well inherits the style of Monets “Flowers atVetheuil” with green and red spot. However, Monet might notpaint a cityscape with green and red spot as painting flowers.

TABLE IIQUANTITATIVE EVALUATION ON COLLECTION STYLE TRANSFER IN TERMS

OF FID TO MEASURE THE PERFORMANCE. LOWER SCORE INDICATESBETTER QUALITY.

Style Content Images CycleGAN [14] OursMonet 86.50 64.14 55.13

Cezanne 186.73 106.96 107.27Van Gogh 173.01 107.03 109.59

Ukiyoe 195.25 103.36 115.96MEAN 160.37 95.37 96.99

In summary, our task focuses on what the artists or genresmight paint given content images, while the task of [30] is toapply style characteristics from a particular style image to anycontent images. Both [30] and our method output interestingresults, and could be used in different scenarios.

4) Comparison with Collection Style Transfer: CycleGAN[14] previously showed impressive results on collection styletransfer, so in this section we compare our results withCycleGAN. The generative network of baseline CycleGANis composed of three stride-2 convolutional layers, 6 residualblocks, two fractional-convolutional layers and one last convo-lutional layer, which shares the same structure with our methodin our experiment. Figure 9 demonstrates multi-collection styletransfer by our method, which shows that the proposed modelproduces comparable results to CycleGAN.

Quantitative results are shown in Table II, though the qualityof images generated from the proposed algorithm exhibitssimilar performance as those of CycleGAN, it is instructive tonote that our four styles are produced from a single network.In the second column of Table II, we compute the score of thecorresponding content images of stylized images. We find thatthe stylized images achieve better performance than originalcontent images. It demonstrates that the stylized images aremore similar to the real authentic work of artists, which isconsistent with our intuitive expectation.

Finally, we compare model size with CycleGAN [14]. Thegenerative network is composed of several convolutional layersand residual blocks with the same architecture as Gated-GAN when the transformer module number is set to one. Theparameters of the two models are the same. Given anotherN styles, CycleGAN must train another N models. A wholegenerative network must be included for a new style. ForGated-GAN, the transformation operator is encoded in thegated transformer, which only has one residual block. A newstyle will thus only require a new transformer part in thegenerative network. As a result, the proposed method savesstorage space as the style number increases. In Figure 10, wecompare the numbers of parameters with those of CycleGAN.Both models are trained for 128× 128 training images.

5) Comparison with Conditional GAN: Conditional GAN[10], [38] model is a widely used method to generate class-conditional image. When the conditional GAN is applied inmulti style transfer, a stylized image G(c, x) is generated froma content image x and a style class label c. We compareconditional GAN in experiments. Class label is representedby a one-hot vector with k bit where each represents a styletype. k noise vectors of the same dimension as the content

Page 9: Gated-GAN: Adversarial Gated Networks for Multi-Collection ...land St, Darlington, NSW 2008, Australia (e-mail: c.xu@sydney.edu.au; dacheng.tao@sydney.edu.au). X. Yang and S. Li are

9

Input Gayts et al. (style I) Gayts et al. (style II) Gatys et al. (collection) Ours

Photo Monet

Photo Ukiyo-e Fig. 7. Comparison of our methods with image style transfer [5] on photo → Monet and photo → Ukiyo-e. From left to right: input photos, Gatys et al.’sresults using different target style images, Gatys et al.’s results using the entire collection of artist and genre, our results for collection style transfer.

(a) (b) (c) (d)

(e) (f) (g) (h)

Fig. 8. Comparison of our methods with universal style transfer [30] on photo→Monet. From left to right: input images, results of [30] with the style image:Monet Charing Cross Bridge, results of [30] with the style image: MonetFlowers at Vetheuil, and our results of Monet’s collection style transfer.

image are randomly sampled from a uniform distribution. Theinput of generative network is obtained by concatenating thecontent image with the outer product of these noise vectorsand the class label.

As we can see in Figure 11 (b), the conditional GANfails to output meaningful results.This is because in collectionstyle transfer, conditional GAN lacks of paired input-outputexamples. To stabilize the training of conditional GAN, weadopt cycle-consistent loss [14]. From the results of condi-tional GAN with cycle-consistent loss in Figure 11 (c), wecan see that the results of different styles tend to be muchsimilar, and only colors are changed at first sight. In contrast,our results (see Figure 11 (d)) are more diverse in differentstyles in terms of strokes and textures.

TABLE IIIQUANTITATIVE EVALUATION OF PARAMETER λCLS = {0, 0.1, 1, 5, 10} IN

TERMS OF FID SCORE.

Style λCLS = 0 λCLS = 0.1 λCLS = 1 λCLS = 5 λCLS = 10Monet 204.82 63.35 55.13 62.66 61.48

Cezanne 234.02 136.35 107.27 127.77 143.39Van Gogh 217.10 112.61 109.59 126.56 138.66

Ukiyoe 206.67 138.13 115.96 132.72 140.53MEAN 215.65 112.61 96.99 112.42 121.02

D. Analysis of Loss Function

1) Influence of Parameters in Loss Function: In our model,we proposed an auxiliary classier loss and an auto-encoder re-construction loss, which are balanced by parameters λCLS andλR respectively. Now we analyze the influence of parameters.To explore the influence of parameters, we do experiments byconsidering λCLS = {0, 0.1, 1, 5, 10} and λR = {1, 5, 10, 20}.

Figure 13 and Table IV demonstrate the qualitative andquantitative comparisons of the influence of parameter . Wecan see the classifier loss provides a supervision of styles.Without classifier loss (λCLS = 0), our model will only trans-fer into one style. If we set too large (λCLS = 10), the modelwould produce images with some artifacts. The underlyingreason is that larger suppresses the function of discriminativenetwork so that the output becomes less realistic. As a result,we set λCLS = 1 in our model.

Figure 13 and Table IV reveal the qualitative and quanti-tative comparisons of the influence of parameter λR. We cansee that if we set λR too small (λR = 1), the outputs tend

Page 10: Gated-GAN: Adversarial Gated Networks for Multi-Collection ...land St, Darlington, NSW 2008, Australia (e-mail: c.xu@sydney.edu.au; dacheng.tao@sydney.edu.au). X. Yang and S. Li are

10

Inputs Monet Van Gogh Cezanne Ukiyo-e

CycleGAN

CycleGAN

Our Result

Our Result

Fig. 9. Comparison with CycleGAN [14]. From left to right: original images, stylized images in Monet’s style, stylized images in Van Gogh’s style, stylizedimages in Cezanne’s style, stylized images in Ukiyo-e style. In each case, the first row shows the results produced by CycleGAN, and the second row showsour results.

Style number

1 2 3 4 5 6 7 8 9

Para

mete

r

107

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

CycleGAN

Proposed

Fig. 10. Model size. We compare the number of parameters between ourmodel and CycleGAN [14]. The x-axis indicates style number and the y-axisindicates the model size.

to be blurry and meaningless. It is because λR improve thestability of training procedure. When λR = {5, 10, 20} thevisual qualities are similar while the FID score shows achievesa slightly better quantitative performance. It demonstrates thatour method is robust and easy to reproduce satisfying results.Since λR = 10 achieves the best quality, we set λR = 10 inour model.

2) Analysis of Auto-encoder Reconstruction Loss: We nextjustify our choice of L1-norm. Beyond L1-norm, L2-normcan also be used in Equation 3. In Table V, we find thatthere is no significant difference between results of L1 and L2loss. In CycleGAN [14], L1-norm is used in cycle-consistentreconstruction loss. As CycleGAN is an important comparison

Input Condition GANCondition GAN +

cycle-consistent loss Ours

(a) (b) (c) (d)Fig. 11. Comparison of our methods with Condition GAN and its variant.From left to right: input, condition GAN and condition GAN + cycle-consistent loss. Each row indicates different styles, from top to bottom: Monet,Ukiyo-e, Cezanne.

Inputs !"#$ = 0 !"#$ = 0.1 !"#$ = 1(Ours) !"#$ = 10

(a) (b) (c) (d) (e)Fig. 12. Qualitative comparison of the influence of parameter λCLS . Thefirst column shows the input images. The rest columns demonstrate resultswith λCLS = {0, 0.1, 1, 10}. Each row demonstrates images transferred bydifferent styles. From top to bottom: Monet, Cezanne, Van Gogh.

Page 11: Gated-GAN: Adversarial Gated Networks for Multi-Collection ...land St, Darlington, NSW 2008, Australia (e-mail: c.xu@sydney.edu.au; dacheng.tao@sydney.edu.au). X. Yang and S. Li are

11

Inputs !" = 1 !" = 5 !" = 10(Ours) !" = 20

(a) (b) (c) (d) (e)Fig. 13. Qualitative comparison of the influence of parameter λR. The firstcolumn shows the input images. The rest columns demonstrate results withλR = {1, 5, 10, 20}. Each row demonstrates images transferred by differentstyles. From top to bottom: Monet, Cezanne, Ukiyo-e.

TABLE IVQUANTITATIVE EVALUATION OF PARAMETER λR = {1, 5, 10, 20} IN

TERMS OF FID SCORE.

Style λR = 1 λR = 5 λR = 10 λR = 20Monet 180.30 121.07 55.13 115.09

Cezanne 165.27 148.67 107.27 140.84Van Gogh 148.43 139.87 109.59 134.13

Ukiyoe 166.69 134.26 115.96 138.54MEAN 165.17 135.97 96.99 132.15

algorithm in our paper, we adopt L1-norm in our auto-encoderreconstruction loss as well.

Lastly, we analyze the influence of the auto-encoder recon-struction loss in stabilizing the adversarial training procedure.We train a comparative model by ignoring the auto-encoderreconstruction loss in Equation 3. In Figure 14, the modelwithout Equation 3 generates images with random texture andtend to be less diverse after training for several iterations. Incontrast, the full proposed model generates satisfying results.Without the auto-encoder reconstruction loss, the networkonly aims to generate images to fool the discriminative net-work, which often leads to the well-known problem of modecollapse [44]. Our encoder-decoder subnetwork is encouragedto reconstruct input images, and thus semantic structure ofthe input is aligned with that of the output, which directlyencourages diversity of output along with different inputs. Asa result, the full proposed model outputs satisfying results.

E. Analysis of network architecture

We explore the influence of neural network structure. Wesetup variants of our model in Table VI. Variants of modelshave different configurations of gated-transformer module. Thequantitative results in Table VII reveal that the performance

TABLE VQUANTITATIVE EVALUATION ON L1-NORM AND L2 NORM IN TERMS OF

FID SCORE.

Style CycleGAN Ours (L1-norm) Ours (L2-norm)Monet 64.14 55.13 56.09

Cezanne 106.96 107.27 101.54Van Gogh 107.03 109.59 109.33

Ukiyoe 103.36 115.96 112.39MEAN 95.37 96.99 94.84

TABLE VIEXPERIMENT SETUP OF NETWORK ARCHITECTURE ANALYSIS

Expt1 Expt2 Expt3Encoder 3 × Convolution 3 × Convolution 3 × Convolution

Gated-transformer 1 × Residual block 1 × Convolution 2 × Residual block

Decoder5 × Residual block 5 × Residual block 5 × Residual block

2 ×Fractional-convolution 2 × Fractional-convolution 2 × Fractional-convolution1 × Convolution 1 × Convolution 1 × Convolution

TABLE VIIQUANTITATIVE EVALUATION ON DIFFERENT NETWORK STRUCTURE IN

TERMS OF FID.

Style Variants 1(Ours) Variants 2 Variants 3Monet 55.13 67.16 53.08

Cezanne 107.27 128.62 110.13Van Gogh 109.59 199.71 109.07

Ukiyoe 115.96 195.87 100.32MEAN 96.99 147.84 93.15

of variant 2 declines compared to that of the variant 1.From qualitative results in Figure 15, we observe that modelof variant 2 cannot maintains content structure (see Figure15 (c)). The underlying reason is that the residual blockhas a branch that skip the convolutional layer and directlyconnects between the encoder and decoder module. Since theencoder-decoder subnetwork learns the content information ofinput from reconstruction loss, residual blocks with skippingconnection shuttle the encoded information to the decodermodule, which helps our model to output results aligned withthe structured of input images.

To analyze the influence of layer size of gated-transformermodule, we set variant 3 whose gated-transformer consistsof 2 residual blocks. In Table VII, we can see the modelof variant 3 achieves a slightly better quantitative evaluationthan variant 1. The reason is that with the number of residualblocks increasing, the expression capacity of network increasesas well, which means the model could capture more detailsfor each style. However, the performance rise of variant 3is limited and the qualitative qualities are similar in Figure15, which means one residual block of gated-transformer issufficient in multi style transfer. As a result, we adopt variant1 of architecture as our method.

F. Incremental Training

By sharing the same encoding/decoding subnets, our modelis compatible to the new style. For a new style, our modelenables to add the style by learning a new branch in the gated-transformer while holding the encoding-decoding subnetsfixed. We first jointly train the encoder-decoder subnetworkand gated-transformer (three collection style: Cezanne, Ukiyo-e and Van Gogh) with the strategy described in Algorithm 1.After that, for new the style (Monet), we train a new branchof residual blocks in the gated-transformer.

Figure 16 shows several results of new style by incrementaltraining. It obtains very comparable stylized results to theCycleGAN, which trains the whole network with the style.We also evaluate the quantitative performance of the new stylein term of FID score. The new style by incremental traininggets score of 57.27. Compared to 55.13 of our Gated-GAN and

Page 12: Gated-GAN: Adversarial Gated Networks for Multi-Collection ...land St, Darlington, NSW 2008, Australia (e-mail: c.xu@sydney.edu.au; dacheng.tao@sydney.edu.au). X. Yang and S. Li are

12

Inputs Full objective(100k iter)

Full objective(300k iter)

Full objective(500k iter)

No reconstructionloss (10k iter)

No reconstructionloss (100k iter)

No reconstructionloss (300k iter)

Fig. 14. Comparison with a variant of our method across different training iterations for mapping images to Cezannes style. From left to right: originalimages, results after training for 10k, 100k, and 300k iterations with and without auto-encoder reconstruction loss.

Inputs Variant 1 (Ours) Variant2 Variant3

(a) (b) (c) (d)

Fig. 15. Qualitative comparison of the influence of different networkstructures. The first row is the results of photo → Cezanne, and the secondrow is the results of photo → Van Gogh.

64.14 of baseline CycleGAN, the incremental training achievesa competitive result.

G. Linear Interpolation of Styles

Since our proposed model achieves multi-collectionstyle transfer by switching gates c to different branchesT (Enc(x), c), we can blend multiple styles by adjusting thegate weights to create a new style or generate transitionsbetween styles of different artists or genres:

G(x, c1, c2) = Dec(α·T (Enc(x), c1)+(1−α)·T (Enc(x), c2))(13)

where c1 and c2 indicate the gates corresponding to differentstyle branches, and indicates the weight for convex com-bination of styles. In Figure 17, we show an example ofinterpolation from Monet to Van Gogh with the trained modelas we vary α from 0 to 1. The convex combination producesa smooth transition from one style to the other.

Inputs CycleGANOurs

(Simultaneous Training)Ours

(Incremental Training)

Fig. 16. Comparison of incremental training. From left to right: originalinputs, results of CycleGAN [14], results of our methods that all the stylesare trained simultaneously, results of incremental training.

VI. CONCLUSIONS AND FUTURE WORK

In this paper, we study multi-collection style transfer in asingle network using adversarial training. To integrate stylesinto a single network, we design a gated network that fil-ters in different network branches with respect to differentstyles. To learn multiple styles simultaneously, a discrimina-tor and an auxiliary classifier distinguish authentic artworksand their styles. To stabilize GAN training, we introducethe auto-encoder reconstruction loss. Furthermore, the gatedtransformer module provides the opportunity to explore newstyles by assigning different weights to the gates. Experimentsdemonstrate the stability, functionality, and effectiveness of ourmodel and produce satisfactory results compared with a state-of-art algorithm, in which one network merely outputs imagesin one style. In the future, we will apply our model to trainother conditional image generation tasks (e.g., object transfig-uration, season transfer, photo enhancement) and explore togenerate diversified style transfer results.

Page 13: Gated-GAN: Adversarial Gated Networks for Multi-Collection ...land St, Darlington, NSW 2008, Australia (e-mail: c.xu@sydney.edu.au; dacheng.tao@sydney.edu.au). X. Yang and S. Li are

13

Fig. 17. Style interpolation. The leftmost image is generated in Monet’s style, and the rightmost image is generated in Van Gogh’s style. Images in the middleare convex combinations of the two styles.

REFERENCES

[1] A. A. Efros and T. K. Leung, “Texture synthesis by non-parametricsampling,” in Computer Vision, 1999. The Proceedings of the SeventhIEEE International Conference on, vol. 2. IEEE, 1999, pp. 1033–1038.

[2] H. Lee, S. Seo, S. Ryoo, and K. Yoon, “Directional texture transfer,” inProceedings of the 8th International Symposium on Non-PhotorealisticAnimation and Rendering. ACM, 2010, pp. 43–48.

[3] N. Ashikhmin, “Fast texture transfer,” IEEE Computer Graphics andApplications, vol. 23, no. 4, pp. 38–43, 2003.

[4] A. Hertzmann, C. E. Jacobs, N. Oliver, B. Curless, and D. H. Salesin,“Image analogies,” in Proceedings of the 28th annual conference onComputer graphics and interactive techniques. ACM, 2001, pp. 327–340.

[5] L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer usingconvolutional neural networks,” in Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, 2016, pp. 2414–2423.

[6] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in European Conference onComputer Vision. Springer, 2016, pp. 694–711.

[7] D. Ulyanov, V. Lebedev, A. Vedaldi, and V. S. Lempitsky, “Texturenetworks: Feed-forward synthesis of textures and stylized images.” inICML, 2016, pp. 1349–1357.

[8] D. Chen, L. Yuan, J. Liao, N. Yu, and G. Hua, “Stylebank: An explicitrepresentation for neural image style transfer,” in The IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), July 2017.

[9] Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang, “Diver-sified texture synthesis with feed-forward networks,” arXiv preprintarXiv:1703.01664, 2017.

[10] A. Odena, C. Olah, and J. Shlens, “Conditional image synthesis withauxiliary classifier gans,” in Proceedings of the 34th InternationalConference on Machine Learning, ICML 2017, Sydney, NSW, Australia,6-11 August 2017, 2017, pp. 2642–2651.

[11] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Improved texture networks:Maximizing quality and diversity in feed-forward stylization and texturesynthesis,” in The IEEE Conference on Computer Vision and PatternRecognition (CVPR), July 2017.

[12] F. Luan, S. Paris, E. Shechtman, and K. Bala, “Deep photo styletransfer,” in The IEEE Conference on Computer Vision and PatternRecognition (CVPR), July 2017.

[13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” inAdvances in neural information processing systems, 2014, pp. 2672–2680.

[14] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-imagetranslation using cycle-consistent adversarial networks,” in The IEEEInternational Conference on Computer Vision (ICCV), Oct 2017.

[15] T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim, “Learning to discover cross-domain relations with generative adversarial networks,” arXiv preprintarXiv:1703.05192, 2017.

[16] Z. Yi, H. Zhang, P. T. Gong et al., “Dualgan: Unsupervised dual learningfor image-to-image translation,” arXiv preprint arXiv:1704.02510, 2017.

[17] A. Radford, L. Metz, and S. Chintala, “Unsupervised representationlearning with deep convolutional generative adversarial networks,” arXivpreprint arXiv:1511.06434, 2015.

[18] A. A. Efros and W. T. Freeman, “Image quilting for texture synthesisand transfer,” in Proceedings of the 28th annual conference on Computergraphics and interactive techniques. ACM, 2001, pp. 341–346.

[19] M. E. Gheche, J. F. Aujol, Y. Berthoumieu, and C. A. Deledalle, “Texturereconstruction guided by a high-resolution patch,” IEEE Transactions onImage Processing, vol. 26, no. 2, pp. 549–560, Feb 2017.

[20] A. Akl, C. Yaacoub, M. Donias, J. P. D. Costa, and C. Germain,“Texture synthesis using the structure tensor,” IEEE Transactions onImage Processing, vol. 24, no. 11, pp. 4082–4095, Nov 2015.

[21] M. Elad and P. Milanfar, “Style transfer via texture synthesis,” IEEETransactions on Image Processing, vol. 26, no. 5, pp. 2338–2351, 2017.

[22] O. Frigo, N. Sabater, J. Delon, and P. Hellier, “Split and match:Example-based adaptive patch sampling for unsupervised style transfer,”in Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, 2016, pp. 553–561.

[23] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in Advances in Neural Infor-mation Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, andK. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097–1105.

[24] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”in Proceedings of the IEEE conference on computer vision and patternrecognition, 2015, pp. 1–9.

[25] A. Mahendran and A. Vedaldi, “Understanding deep image represen-tations by inverting them,” in Proceedings of the IEEE conference oncomputer vision and pattern recognition, 2015, pp. 5188–5196.

[26] C. Li and M. Wand, “Combining markov random fields and convo-lutional neural networks for image synthesis,” in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, 2016,pp. 2479–2486.

[27] A. Selim, M. Elgharib, and L. Doyle, “Painting style transfer for headportraits using convolutional neural networks,” ACM Transactions onGraphics (ToG), vol. 35, no. 4, p. 129, 2016.

[28] D. Vincent, S. Jonathon, and K. Manjunath, “A learned representationfor artistic style,” April 2017.

[29] H. Xun and B. Serge, “Arbitrary style transfer in real-time with adaptiveinstance normalization,” April 2017.

[30] Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang, “Universalstyle transfer via feature transforms,” in Advances in Neural InformationProcessing Systems, 2017, pp. 385–395.

[31] C. Li and M. Wand, “Precomputed real-time texture synthesis withmarkovian generative adversarial networks,” in European Conferenceon Computer Vision. Springer, 2016, pp. 702–716.

[32] N. Jetchev, U. Bergmann, and R. Vollgraf, “Texture synthesis with spa-tial generative adversarial networks,” arXiv preprint arXiv:1611.08207,2016.

[33] U. Bergmann, N. Jetchev, and R. Vollgraf, “Learning texture manifoldswith the periodic spatial gan,” in Thirty-fourth International Conferenceon Machine Learning (ICML), 2017.

[34] W. Lotter, G. Kreiman, and D. Cox, “Unsupervised learning of vi-sual structure using predictive generative networks,” arXiv preprintarXiv:1511.06380, 2015.

[35] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta,A. Aitken, A. Tejani, J. Totz, Z. Wang et al., “Photo-realistic singleimage super-resolution using a generative adversarial network,” arXivpreprint arXiv:1609.04802, 2016.

[36] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translationwith conditional adversarial networks,” in The IEEE Conference onComputer Vision and Pattern Recognition (CVPR), July 2017.

[37] Y. Taigman, A. Polyak, and L. Wolf, “Unsupervised cross-domain imagegeneration,” April 2017.

Page 14: Gated-GAN: Adversarial Gated Networks for Multi-Collection ...land St, Darlington, NSW 2008, Australia (e-mail: c.xu@sydney.edu.au; dacheng.tao@sydney.edu.au). X. Yang and S. Li are

14

[38] M. Mirza and S. Osindero, “Conditional generative adversarial nets,”arXiv preprint arXiv:1411.1784, 2014.

[39] A. Odena, “Semi-supervised learning with generative adversarial net-works,” arXiv preprint arXiv:1606.01583, 2016.

[40] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, andX. Chen, “Improved techniques for training gans,” in Advances in NeuralInformation Processing Systems, 2016, pp. 2234–2242.

[41] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, andP. Abbeel, “Infogan: Interpretable representation learning by informa-tion maximizing generative adversarial nets,” in Advances in NeuralInformation Processing Systems, 2016, pp. 2172–2180.

[42] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in Proceedings of the IEEE conference on computer visionand pattern recognition, 2016, pp. 770–778.

[43] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley, “Leastsquares generative adversarial networks,” in The IEEE InternationalConference on Computer Vision (ICCV), Oct 2017.

[44] M. Arjovsky and L. Bottou, “Towards principled methods for train-ing generative adversarial networks,” arXiv preprint arXiv:1701.04862,2017.

[45] Z. Zhou, H. Cai, S. Rong, Y. Song, K. Ren, W. Zhang, J. Wang,and Y. Yu, “Activation maximization generative adversarial nets,” inInternational Conference on Learning Representations, 2018.

[46] L. I. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation basednoise removal algorithms,” Physica D: Nonlinear Phenomena, vol. 60,no. 1-4, pp. 259–268, 1992.

[47] H. A. Aly and E. Dubois, “Image up-sampling using total-variationregularization with a new observation model,” IEEE Transactions onImage Processing, vol. 14, no. 10, pp. 1647–1659, 2005.

[48] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,”arXiv preprint arXiv:1412.6980, 2014.

[49] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb,“Learning from simulated and unsupervised images through adversarialtraining,” in The IEEE Conference on Computer Vision and PatternRecognition (CVPR), July 2017.

[50] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter,“Gans trained by a two time-scale update rule converge to a local nashequilibrium,” in Advances in Neural Information Processing Systems,2017, pp. 6629–6640.