spatial fusion gan for image synthesis...spatial fusion gan for image synthesis anonymous cvpr...

10
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 096 097 098 099 100 101 102 103 104 105 106 107 CVPR #194 CVPR #194 CVPR 2019 Submission #194. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE. Spatial Fusion GAN for Image Synthesis Anonymous CVPR submission Paper ID 194 Abstract Recent advances in generative adversarial networks (GANs) have shown great potentials in realistic image syn- thesis whereas most existing works address synthesis real- ism in either appearance space or geometry space but few in both. This paper presents an innovative Spatial Fusion GAN (SF-GAN) that combines a geometry synthesizer and an appearance synthesizer to achieve synthesis realism in both geometry and appearance spaces. The geometry syn- thesizer learns contextual geometries of background images and transforms and places foreground objects into the back- ground images unanimously. The appearance synthesizer adjust the color, brightness and styles of the foreground objects and embeds them into background images harmo- niously, where a guided filter is introduced for detail pre- serving. The two synthesizers are inter-connected as mu- tual references which can be trained end-to-end without su- pervision. The SF-GAN has been evaluated in two tasks: (1) realistic scene text image synthesis for training better recognition models; (2) glass and hat wearing for realis- tic matching glasses and hats with real portraits. Qualita- tive and quantitative comparisons with the state-of-the-art demonstrate the superiority of the proposed SF-GAN. 1. Introduction With the advances of deep neural networks (DNNs), im- age synthesis has been attracting increasing attention as a means of generating novel images and creating anno- tated images for training DNN models, where the latter has great potentials to replace the traditional manual annota- tion which is usually costly, time-consuming and unscal- able. The fast development of generative adversarial net- works (GANs) [8] in recent years opens a new door of au- tomated image synthesis as GANs are capable of generat- ing realistic images by concurrently implementing a gener- ator and discriminator. Three typical approaches have been explored for GAN-based image synthesis, namely, direct image generation [26, 32, 1], image-to-image translation [47, 15, 21, 13] and image composition [20, 2]. Figure 1. The proposed SF-GAN is capable of synthesizing realis- tic images concurrently in geometric and appearance spaces. Rows 1 and 2 show a few synthesized scene text images and row 3 shows a few hat-wearing and glass-wearing images where the foreground texts, glasses and hats as highlighted by red-color boxes are com- posed with the background scene and face images harmoniously. On the other hand, most existing GANs were designed to achieve synthesis realism either from geometry space or appearance space but few in both. Consequentially, most GAN-synthesized images have little contribution (many even harmful) when they are used in training deep network models. In particular, direct image generation still faces difficulties in generating high-resolution images due to the limited network capacity. GAN-based image composition is capable of producing high-resolution images [20, 2] by placing foreground objects into background images. But most GAN-based image composition techniques focus on geometric realism only (e.g. object alignment with con- textual background) which often produce various artifacts due to appearance conflicts between the foreground objects and the background images. As a comparison, GAN-based image-to-image translation aims for appearance realism by learning the style of images of the target domain whereas the geometric realism is largely ignored. We propose an innovative Spatial Fusion GAN (SF- GAN) that achieves synthesis realism in both geometry and appearance spaces concurrently, a very challenging task in image synthesis due to a wide spectrum of conflicts be- tween the foreground objects and background images with respect to relative scaling, spatial alignment, appearance 1

Upload: others

Post on 19-Jun-2020

17 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Spatial Fusion GAN for Image Synthesis...Spatial Fusion GAN for Image Synthesis Anonymous CVPR submission Paper ID 194 Abstract Recent advances in generative adversarial networks (GANs)

000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053

054055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107

CVPR#194

CVPR#194

CVPR 2019 Submission #194. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Spatial Fusion GAN for Image Synthesis

Anonymous CVPR submission

Paper ID 194

Abstract

Recent advances in generative adversarial networks(GANs) have shown great potentials in realistic image syn-thesis whereas most existing works address synthesis real-ism in either appearance space or geometry space but fewin both. This paper presents an innovative Spatial FusionGAN (SF-GAN) that combines a geometry synthesizer andan appearance synthesizer to achieve synthesis realism inboth geometry and appearance spaces. The geometry syn-thesizer learns contextual geometries of background imagesand transforms and places foreground objects into the back-ground images unanimously. The appearance synthesizeradjust the color, brightness and styles of the foregroundobjects and embeds them into background images harmo-niously, where a guided filter is introduced for detail pre-serving. The two synthesizers are inter-connected as mu-tual references which can be trained end-to-end without su-pervision. The SF-GAN has been evaluated in two tasks:(1) realistic scene text image synthesis for training betterrecognition models; (2) glass and hat wearing for realis-tic matching glasses and hats with real portraits. Qualita-tive and quantitative comparisons with the state-of-the-artdemonstrate the superiority of the proposed SF-GAN.

1. IntroductionWith the advances of deep neural networks (DNNs), im-

age synthesis has been attracting increasing attention asa means of generating novel images and creating anno-tated images for training DNN models, where the latter hasgreat potentials to replace the traditional manual annota-tion which is usually costly, time-consuming and unscal-able. The fast development of generative adversarial net-works (GANs) [8] in recent years opens a new door of au-tomated image synthesis as GANs are capable of generat-ing realistic images by concurrently implementing a gener-ator and discriminator. Three typical approaches have beenexplored for GAN-based image synthesis, namely, directimage generation [26, 32, 1], image-to-image translation[47, 15, 21, 13] and image composition [20, 2].

Figure 1. The proposed SF-GAN is capable of synthesizing realis-tic images concurrently in geometric and appearance spaces. Rows1 and 2 show a few synthesized scene text images and row 3 showsa few hat-wearing and glass-wearing images where the foregroundtexts, glasses and hats as highlighted by red-color boxes are com-posed with the background scene and face images harmoniously.

On the other hand, most existing GANs were designedto achieve synthesis realism either from geometry space orappearance space but few in both. Consequentially, mostGAN-synthesized images have little contribution (manyeven harmful) when they are used in training deep networkmodels. In particular, direct image generation still facesdifficulties in generating high-resolution images due to thelimited network capacity. GAN-based image compositionis capable of producing high-resolution images [20, 2] byplacing foreground objects into background images. Butmost GAN-based image composition techniques focus ongeometric realism only (e.g. object alignment with con-textual background) which often produce various artifactsdue to appearance conflicts between the foreground objectsand the background images. As a comparison, GAN-basedimage-to-image translation aims for appearance realism bylearning the style of images of the target domain whereasthe geometric realism is largely ignored.

We propose an innovative Spatial Fusion GAN (SF-GAN) that achieves synthesis realism in both geometry andappearance spaces concurrently, a very challenging task inimage synthesis due to a wide spectrum of conflicts be-tween the foreground objects and background images withrespect to relative scaling, spatial alignment, appearance

1

Page 2: Spatial Fusion GAN for Image Synthesis...Spatial Fusion GAN for Image Synthesis Anonymous CVPR submission Paper ID 194 Abstract Recent advances in generative adversarial networks (GANs)

108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

CVPR#194

CVPR#194

CVPR 2019 Submission #194. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

style, etc. The SF-GAN address these challenges by de-signing a geometry synthesizer and an appearance synthe-sizer. The geometry synthesizer learns the local geome-try of background images with which the foreground ob-jects can be transformed and placed into the backgroundimages unanimously. A discriminator is employed to traina spatial transformation network, targeting to produce trans-formed images that can mislead the discriminator. The ap-pearance synthesizer learns to adjust the color, brightnessand styles of the foreground objects for proper matchingwith the background images with minimum conflicts. Aguided filter is introduced to compensate the detail loss thathappens in most appearance-transfer GANs. The geometrysynthesizer and appearance synthesizer are inter-connectedas mutual references which can be trained end-to-end withlittle supervision.

The contributions of this work are threefold. First, it de-signs an innovative SF-GAN, an end-to-end trainable net-work that concurrently achieves synthesis realism in bothgeometry and appearance spaces. To the best of our knowl-edge, this is the first GAN that can achieve synthesis realismin geometry and appearance spaces concurrently. Second,it designs a fusion network that introduces guided filtersfor detail preserving for appearance realism, whereas mostimage-to-image-translation GANs tend to lose details whileperforming appearance transfer. Third, it investigates anddemonstrates the effectiveness of GAN-synthesized imagesin training deep recognition models, a very important issuethat was largely neglected in most existing GANs (except afew GANs for domain adaptation [13, 15, 21, 47]).

2. Related Work

2.1. Image Synthesis

Realistic image synthesis has been studied for years,from synthesis of single objects [28, 29, 36] to generation offull-scene images [7, 33]. Among different image synthesisapproaches, image composition has been explored exten-sively which synthesizes new images by placing foregroundobjects into some existing background image. The targetis to achieve composition realism by controlling the objectsize, orientation, and blending between foreground objectsand background images. For example, [9, 16, 43] investi-gate synthesis of scene text images for training better scenetext detection and recognition models. They achieve thesynthesis realism by controlling a series of parameters suchas text locations within the background image, geometrictransformation of the foreground texts, blending betweenthe foreground text and background image, etc. Other im-age composition systems have also been reported for DNNtraining [6], composition harmonization [25, 37], image in-painting [46], etc.

Optimal image blending is critical for good appearance

consistency between the foreground object and backgroundimage as well as minimal visual artifacts within the synthe-sized images. One straightforward way is to apply denseimage matching at pixel level so that only the correspond-ing pixels are copied and pasted, but this approach does notwork well when the foreground object and background im-age have very different appearance. An alternative way isto make the transition as smooth as possible so that artifactscan be hidden/removed within the composed images, e.g.alpha blending [38], but this approach tends to blur fine de-tails in the foreground object and background images. In ad-dition, gradient-based techniques such as Poisson blending[30] can edit the image gradient and adjust the inconsistencyin color and illumination to achieve seamlessly blending.

Most existing image synthesis techniques aim for geo-metric realism by hand-crafted transformations that involvecomplicated parameters and are prone to various unnaturalalignments. The appearance realism is handled by differentblending techniques where features are manually selectedand still susceptible to artifacts. Our proposed techniqueinstead adopts a GAN structure that learn geometry and ap-pearance features from real images with little supervision,minimizing various inconsistency and artifacts effectively.

2.2. GAN

GANs [8] have achieved great success in generating re-alistic new images from either existing images or randomnoises. The main idea is to have a continuing adversariallearning between a generator and a discriminator, where thegenerator tries to generate more realistic images while thediscriminator aims to distinguish the newly generated im-ages from real images. Starting from generating MNISThandwritten digits, the quality of GAN-synthesized imageshas been improved greatly by the laplacian pyramid of ad-versarial networks [5]. This is followed by various effortsthat employ a DNN architecture [32], stacking a pair of gen-erators [44], learning more interpretable latent representa-tions [4], adopting an alternative training method [1], etc.

Most existing GANs work towards synthesis realism inthe appearance space. For example, CycleGAN [47] usescycle-consistent adversarial networks for realistic image-to-image translation, and so other relevant GANs [15, 35].LR-GAN [42] generates new images by applying additionalspatial transformation networks (STNs) to factorize shapevariations. GP-GAN [41] composes high-resolution imagesby using Poisson blending [30]. A few GANs have beenreported in recent years for geometric realsim, e.g., [20]presents a spatial transformer GAN (ST-GAN) by embed-ding STNs in the generator for geometric realism, [2] de-signs Compositional GAN that employs a self-consistentcomposition-decomposition network.

Most existing GANs synthesize images in either geome-try space (e.g. ST-GAN) or appearance space (e.g. Cycle-

2

Page 3: Spatial Fusion GAN for Image Synthesis...Spatial Fusion GAN for Image Synthesis Anonymous CVPR submission Paper ID 194 Abstract Recent advances in generative adversarial networks (GANs)

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323

CVPR#194

CVPR#194

CVPR 2019 Submission #194. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

GAN) but few in both spaces. In addition, the GAN-synthesized images are usually not suitable for training deepnetwork models due to the lack of annotation or synthesisrealism. Our proposed SF-GAN can achieve both appear-ance and geometry realism by synthesizing images in ap-pearance and geometry spaces concurrently. Its synthesizedimages can be directly used to train more powerful deepnetwork models due to their high realism.

2.3. Guided Filter

Guided Filters [11, 12] use one image as guidance forfiltering another image which has shown superior perfor-mance in detail-preserving filtering. The filtering output isa linear transform of the guidance image by considering itsstructures, where the guidance image can be the input imageitself or another different image. Guided filtering has beenused in various computer vision tasks, e.g., [19] uses it forweighted averaging and image fusion, [45] uses a rollingguidance for fully-controlled detail smoothing in an itera-tive manner, [40] uses a fast guided filter for efficient im-age super-solution, [23] uses guided filters for high-qualitydepth map restoration, [22] uses guided filtering for toler-ance to heavy noises and structure inconsistency, and [10]puts Guided filtering as a nonconvex optimization problemand proposes solutions via majorize-minimization [14].

Most GANs for image-to-image-translation can synthe-size high-resolution images but the appearance transfer of-ten suppresses image details such as edges and texture. Howto keep the details of the original image while learning theappearance of the target remain an active research area. Theproposed SF-GANs introduces guided filters into a cyclenetwork which is capable of achieving appearance transferand detail preserving concurrently.

3. The Proposed MethodThe proposed SF-GAN consists of a geometry synthe-

sizer and an appearance synthesizer, and the whole networkis end-to-end trainable as illustrated in Fig. 2. Detailed net-work structure and training strategy will be introduced inthe following subsections.

3.1. Geometry Synthesizer

The geometry synthesizer has a local GAN structure ashighlighted by blue-color lines and boxes on the left of Fig.2. It consists of a spatial transform network (STN), a com-position module and a discriminator. The STN consists ofan estimation network as shown in Table 1 and a transfor-mation matrix which has N parameters that control the ge-ometric transformation of the foreground object.

The foreground object and background image are con-catenated to act as the input of the STN, where the es-timation network will predict a transformation matrix totransform the foreground object. The transformation can

Table 1. The structure of the geometry estimation network withinthe STN in Fig. 2

Layers Out Size ConfigurationsBlock1 16× 50 3× 3 conv, 32, 2× 2 poolBlock2 8× 25 3× 3 conv, 64, 2× 2 poolBlock3 4× 13 3× 3 conv, 128, 2× 2 pool

FC1 512 -FC2 N -

be affine, homography, or thin plate spline [3] (We use thinplate spline for the scene text synthesis task and homogra-phy for the portrait wearing task). Each pixel in the trans-formed image is computed by applying a sampling kernelcentered at a particular location in the original image. Withpixels in the original and transformed images denoted byP s = (ps1, p

s2, . . . , p

sN ) and P t = (pt1, p

t2, . . . , p

tN ), we use

a transformation matrix H to perform pixel-wise transfor-mation as follows: xti

yti1

= H

xsiysi1

(1)

where psi = (xsi , ysi ) and ptj = (xti, y

ti) denote the coordi-

nates of the i-th pixel within the original and transformedimage, respectively.

The transformed foreground object can thus be placedinto the background image to form an initially composedimage (Composed Image in Fig. 2). The discriminator D2in Fig. 2 learns to distinguish whether the composed imageis realistic with respect to a set of Real Images. On the otherhand, our study shows that real images are not good refer-ences for training geometry synthesizer. The reason is realimages are realistic in both geometry and appearance spaceswhile the geometry can only achieve realism in geometryspace. The difference in appearance space between the syn-thesized images and real images will mislead the trainingof geometry synthesizer. For optimal training of geometrysynthesizer, the reference images should be realistic in thegeometry space only and concurrently have similar appear-ance (e.g. colors and styles) with the initially composedimages. Such reference images are difficult to create man-ually. In the SF-GAN, we elegantly use images from theappearance synthesizer (Adapted Real shown in Fig. 2) asthe reference to train the geometry synthesizer, more detailsabout the appearance synthesizer to be discussed in the fol-lowing subsection.

3.2. Appearance Synthesizer

The appearance synthesizer is designed in a cycle struc-ture as highlighted in orange-color lines and boxes on theright of Fig. 2. It aims to fuse the foreground objectand background image to achieve synthesis realism in the

3

Page 4: Spatial Fusion GAN for Image Synthesis...Spatial Fusion GAN for Image Synthesis Anonymous CVPR submission Paper ID 194 Abstract Recent advances in generative adversarial networks (GANs)

324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377

378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431

CVPR#194

CVPR#194

CVPR 2019 Submission #194. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Figure 2. The structure of the proposed SF-GAN: The geometry synthesizer is highlighted by blue-color lines and boxes on the left and theappearance synthesizer is highlighted by orange-color lines and boxes on the right. STN denotes spatial transformation network, F denotesguided filters, G1, G2, D1 and D2 denote the generators and discriminators. For clarity, cycle loss and identity loss are not included.

appearance space. Image-to-image translation GANs alsostrive for realistic appearance but they usually lose visualdetails while performing the appearance transfer. Withinthe proposed SF-GAN, guided filters are introduced whichhelp to preserve visual details effectively while working to-wards synthesis realism within the appearance space.

3.2.1 Cycle Structure

The proposed SF-GAN adopts a cycle structure for map-ping between two domains, namely, the composed imagedomain and the real image domain. Two generators G1 andG2 are designed to achieve image-to-image translation intwo reverse directions, G1 from Composed Image to FinalSynthesis and G2 from Real Images to Adapted Real as il-lustrated in Fig. 2. Two discriminator D1 and D2 are de-signed to discriminate real images and translated images.

In particular, D1 will strive to distinguish the adaptedcomposed images (i.e. the Composed Image after domainadaptation by G1) and Real Images, forcing G1 to learn tomap from the Composed Image to Final Synthesis imagesthat are realistic in the appearance space G2 will learn tomap from Real Images to Adapted Real images, the imagesthat ideally are realistic in the geometry space only but havesimilar appearance as the Composed Image. As discussedin the previous subsection, the Adapted Real from G2 willbe used as references for training the geometry synthesizeras it will better focus on synthesizing images with realisticgeometry (as the interfering appearance difference has beencompressed in the Adapted Real).

Image appearance transfer usually comes with detailloss. We address this issue from two perspectives. The fistis by adaptive combination of cycle loss and identity loss.Specifically, we adopt a weighted combination strategy thatassigns higher weight to the cycle-loss for interested im-

age regions while higher weight to the identify-loss for non-interested regions. Take scene text image synthesis as an ex-ample. By assigning a larger cycle-loss weight and smalleridentity-loss to text regions, it ensures a multi-mode map-ping of the text style while keeping the background similarto the original image. The second is by introducing guidedfilters into the cycle structure for detail preserving, moredetails to be described in the next subsection.

3.2.2 Guided Filter

Guided filter was designed to perform edge-preserving im-age smoothing. It influences the filtering by using structuresin a guidance image As appearance transfer in most image-to-image-translation GANs tends to lose image details, weintroduce guided filters (F as shown in Fig. 2) into the SF-GAN for detail preserving within the translated images. Thetarget is to perform appearance transfer on the foregroundobject (within the Composed Image) only while keeping thebackground image with minimum changes.

We introduce guided filters into the proposed SF-GANand formulate the detail-preserving appearance transfer as ajoint up-sampling problem as illustrated in Fig. 3. In par-ticular, the translated images from the output of G1 (imagedetails lost) is the input image I to be filtered and the ini-tially Composed Image (image details unchanged) shown inFig. 2 acts as the guidance image R to provide edge andtexture details. The detail-preserving image T (correspond-ing to the Synthesized Image in Fig. 2) can thus be derivedby minimizing the reconstruction error between I and T ,subjects to a linear model:

Ti = akIi + bk,∀i ∈ ωk (2)

where i is the index of a pixel and ωk is a local square win-dow centered at pixel k.

4

Page 5: Spatial Fusion GAN for Image Synthesis...Spatial Fusion GAN for Image Synthesis Anonymous CVPR submission Paper ID 194 Abstract Recent advances in generative adversarial networks (GANs)

432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485

486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539

CVPR#194

CVPR#194

CVPR 2019 Submission #194. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Figure 3. Detailed structure of the guided filter F: Given an imageto be filtered (Composed Image in Fig. 2), a translated image withsmoothed details (the output of G1 in Fig. 2 where details are lostaround background face and foreground hat areas) and the maskof foreground object hat (provided), F produces a new image withfull details (Synthesized Image, the output of F at the bottom inFig. 2). It can be seen that the guided filter preserves details ofboth background image (e.g. the face area) and foreground hat(e.g. the image areas highlighted by the red-color box).

To determine the coefficients of the linear model ak andbk, we seek a solution that minimizes the difference be-tween T and the filter inputR which can be derived by min-imizing the following cost function in the local window:

E(ak, bk) =∑i∈ωk

((akIi + bk −Ri)2 + εa2k) (3)

where ε is a regularization parameter that prevents ak frombeing too large. It can be solved via linear regression:

ak =

1|ω|

∑i∈ωk

Ii − µkRkσk + ε

(4)

bk = Rk − akµk (5)

where µk and σ2k are the mean and variance of I in ωk, |ω|

is the number of pixels in ωk, and Rk = 1|ω|

∑i∈ωk

is themean of R in ωk.

By applying the linear model to all windows ωk in theimage and computing (ak, bk), the filter output can be de-rived by averaging all possible values of Ti:

Ti =1

|ω|∑k:i∈µk

(akIi + bk) = aiIi + bi (6)

where ai = 1|ω|

∑k∈ωk

ak and bi = 1|ω|

∑k∈ωi

bk. Weintegrate the guide filter into the cycle structure network toimplement an end-to-end trainable system.

3.3. Adversarial Training

The proposed SF-GAN is designed to achieve synthe-sis realism in both geometry and appearance spaces. TheSF-GAN training therefore has two adversarial objectives,one is to learn the real geometry and the other is to learnthe real appearance The geometry synthesizer and appear-ance synthesizer are actually two local GANs that are inter-connected and need coordination during the training. Forpresentation clarity, we denote the Foreground Object andBackground Image in Fig. 2 as the x, the Composed Imageas y and the Real Image as z which belongs to domains X ,Y and Z, respectively.

For the geometry synthesizer, the STN can actually beviewed as a generator G0 which predicts transportation pa-rameters for x. After the transformation of the ForegroundObject and Composition, the Composed Image becomes theinput of the discriminator D2 and the training reference z

comes from G2(z) of the appearance synthesizer. For thegeometry synthesizer, we adopt the Wasserstein GAN [1]objective for training which can be denoted by:

minG0

maxD2

Ex∼X [D2(G0(x))]− Ez′∼Z′ [D2(z)] (7)

where Z′

denotes the domains for z′. Since G0 aims to

minimize this objective against an adversary D2 that tries tomaximize it, the loss functions of D2 and G0 can be definedby:

LD2 = Ex∼X [D2(G0(x)]− Ez′∼Z′ [D2(z′)] (8)

LG0 = −Ex∼X [D2(G0(x))] (9)

The appearance synthesizer adopts a cycle structure thatconsists of two mappingsG1 : Y → Z andG2 : Z → Y . Ithas two adversarial discriminators D1 and D2. D2 is sharedbetween the geometry and appearance synthesizers, and itaims to distinguish y from G2(z) within the appearance syn-thesizer. The learning objectives thus consists of an adver-sarial loss for the mapping between domains and a cycleconsistency loss for preventing the mode collapse. For theadversarial loss, the objective of the mapping G1 : Y → Z(and the same for the reverse mappingG2 : Z → Y ) can bedefined by:

LD1 = Ey∼Y [D1(G1(y)]− Ez∼Z [D2(z)] (10)

LG1 = −Ey∼Y [D1(G1(y))] (11)

As the adversarial losses cannot guarantee that thelearned function maps an individual input y to a desired out-put z, we introduce cycle-consistency, aiming to ensure thatthe image translation cycle will bring x back to the origi-nal image, i.e. y → G1(y) → G2(G1(y))y. The cycle-consistency can be achieved by a cycle-consistency loss:

LG1cyc= Ey∼p(y)[‖G2(G1(y))− y‖] (12)

5

Page 6: Spatial Fusion GAN for Image Synthesis...Spatial Fusion GAN for Image Synthesis Anonymous CVPR submission Paper ID 194 Abstract Recent advances in generative adversarial networks (GANs)

540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593

594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647

CVPR#194

CVPR#194

CVPR 2019 Submission #194. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

LG2cyc= Ez∼p(z)[‖G1(G2(z))− z‖] (13)

We also introduce the identity loss to ensure that the trans-lated image preserves features of the original image:

LG1idt = Ey∼Y [‖G1(y)− y‖] (14)

LG2idt = Ez∼Z [‖G2(z)− z‖] (15)

For each training step, the model needs to update the ge-ometry synthesizer and appearance synthesizer separately.In particular, LD2 and LG0 are optimized alternately whileupdating the geometry synthesizer. While updating the ap-pearance synthesizer, all weights of the geometry synthe-sizer are freezed. In the mapping G1 : Y → Z, LD1

and LG1 + λ1LG1cyc+ λ2LG1idt are optimized alternately

where λ1 and λ2 controls the relative importance of thecycle-consistency loss and the identity loss, respectively. Inthe mapping G2 : Z → Y , LD2 and LG2 + λ1LG2cyc +λ2LG2idt are optimized alternately.

It should be noted that the sequential updating is nec-essary for end-to-end training of the proposed SF-GAN. Ifdiscarding the geometry loss, we need update the geometrysynthesizer according to the loss function of the appearancesynthesizer. On the other hand, the appearance synthesizerwill generate blurry foreground objects regardless of the ge-ometry synthesizer and this is similar to GANs for directimage generation. As discussed before, the direct imagegeneration cannot provide accurate annotation informationand the directly generated images also have low quality andare not suitable for training deep network models.

4. Experiments4.1. Datasets

ICDAR2013 [18] is used in the Robust Reading Compe-tition in the International Conference on Document Analy-sis and Recognition (ICDAR) 2013. It contains 848 wordimages for network training and 1095 for testing.

ICDAR2015 [17] is used in the Robust Reading Compe-tition under ICDAR 2015. It contains incidental scene textimages that are captured without preparation before captur-ing. 2077 text image patches are cropped from this dataset,where a large amount of cropped scene texts suffer fromperspective and curvature distortions.

IIIT5K [27] has 2000 training images and 3000 testimages that are cropped from scene texts and born-digitalimages. Each word in this dataset has a 50-word lexiconand a 1000-word lexicon, where each lexicon consists of aground-truth word and a set of randomly picked words.

SVT [39] is collected from the Google Street View im-ages that were used for scene text detection research. 647words images are cropped from 249 street view images andmost cropped texts are almost horizontal.

SVTP [31] has 639 word images that are cropped fromthe SVT images. Most images in this dataset suffer fromperspective distortion which are purposely selected for eval-uation of scene text recognition under perspective views.

CUTE [34] has 288 word images mose of which arecurved. All words are cropped from the CUTE datasetwhich contains 80 scene text images that are originally col-lected for scene text detection research.

CelebA [24] is a face image dataset that consists of morethan 200k celebrity images with 40 attribute annotations.This dataset is characterized by large quantities, large facepose variations, complicated background clutters, rich an-notations, and it is widely used for face attribute prediction.

4.2. Scene Text Synthesis

Data Preparation: The SF-GAN needs a set of Real Im-ages to act at references as illustrated in Fig. 2. We createthe Real Images by cropping the text image patches from thetraining images of ICDAR2013 [18], ICDAR2015 [17] andSVT [39] by using the provided annotation boxes. Whilecropping the text image patches, we extend the annotationbox (by an extra 1/4 of the width and height of the anno-tation boxes) to include a certain amount of neighboringbackground. The purpose is to include certain local geo-metric structures so that the geometry synthesizer can learnthe transformation matrix for transforming the foregroundtext with correct alignment with the background image.

Besides the Real Images, SF-Net also needs a set ofBackground Images as shown in Fig. 2. For scene textimage synthesis, we collect the background images bysmoothing out the text pixels of the cropped Real Images. Inparticular, we first apply k-means clustering to the croppedReal Images to get text pixel and so the text mask. Inpaint-ing is then applied over the text pixels which produces thebackground image without texts as shown in the second rowin Fig. 4. Further, the Foreground Object (text for scene textsynthesis) is computer-generated by using a 90k-lexicon.The created Background Images, Foreground Texts and RealImages are fed to the network to train the SF-GAN.

Note the synthesized scene text images by the trainedSF-GAN will have a larger background as the original back-ground images are larger than the annotation boxes. Textsneed to be cropped out with tighter boxes (to exclude extrabackground) for optimal training of scene text recognitionmodel. With the text maps as denoted by Transformed Ob-ject in Fig. 2, scene text patches can be cropped out accu-rately by detecting a minimal external rectangle.Results Analysis: We use 1 million SF-GAN synthesizedscene text images to train scene text recognition models anduse the model recognition performance to evaluate the use-fulness of the synthesized images. In addition, the SF-GANis benchmarked with a number of state-of-the-art synthe-

6

Page 7: Spatial Fusion GAN for Image Synthesis...Spatial Fusion GAN for Image Synthesis Anonymous CVPR submission Paper ID 194 Abstract Recent advances in generative adversarial networks (GANs)

648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701

702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755

CVPR#194

CVPR#194

CVPR 2019 Submission #194. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Table 2. Scene text recognition accuracy over the datasets ICDAR2013, ICDAR2015, SVT, IIIT5K, SVTP and CUTE, where 1 millionsynthesized text images are used for all comparison methods as as listed.

Methods ICDAR2013 ICDAR2015 SVT IIIT5K SVTP CUTE AVERAGE

Jaderberg [16] 58.1 35.5 67.0 57.2 48.9 35.3 50.3Gupta [9] 62.2 38.2 48.8 59.1 38.9 36.3 47.3Zhan [43] 62.5 37.7 63.5 59.5 46.7 36.9 51.1ST-GAN [20] 57.2 35.3 63.8 57.3 43.2 34.1 48.5

SF-GAN(BS) 55.9 34.9 64.0 55.4 42.8 33.7 47.8SF-GAN(GS) 57.3 35.6 66.5 57.7 43.9 36.1 49.5SF-GAN(AS) 58.1 36.4 66.7 58.5 45.3 35.7 50.1SF-GAN 61.8 39.0 69.3 63.0 48.6 40.6 53.7

Foreground

Background

ST-GAN

CycleGAN

SF-GAN(GS)

SF-GAN

Figure 4. Illustration of scene text image synthesis by differentGANs: Rows 1-2 are foreground texts and background imagesas labelled. Rows 3-4 show the images synthesized by ST-GANand CycleGAN, respectively. Row 5 shows images synthesizedby SF-GAN(GS), the output of the geometry synthesizer in SF-GAN (Composed Image in Fig. 2). The last row shows imagessynthesized by the proposed SF-GAN.

sis techniques by randomly selecting 1 million synthesizedscene text images from [16] and randomly cropping 1 mil-lion scene text images from [9] and [43]. Beyond that, wealso synthesize 1 million scene text images with randomtext appearance by using ST-GAN [20].

For ablation analysis, we evaluate SF-GAN(GS) which

denotes the output of the geometry synthesizer (ComposedImage as shown in Fig. 2) and SF-GAN(AS) which denotesthe output of the appearance synthesizer with random geo-metric alignments. A baseline SF-GAN (BS) is also trainedwhere texts are placed with random alignment and appear-ance. The three SF-GANs also synthesize 1 million im-ages each for scene text recognition tests. The recognitiontests are performed over four regular scene text datasets IC-DAR2013 [18], ICDAR2015 [17], SVT [39], IIIT5K [27]and two irregular datasets SVTP [31] and CUTE [34] as de-scribed in Datasets. Besides the scene text recognition, wealso perform user studies with Amazon Mechanical Turk(AMT) where users are recruited to tell whether SF-GANsynthesized images are real or synthesized.

Tables 2 and 3 show scene text recognition and AMTuser study results. As Table 2 shows, SF-GAN achieves thehighest recognition accuracy for most of the 6 datasets andan up to 3% improvement in average recognition accuracy(across the 6 datasets), demonstrating the superior useful-ness of its synthesized images while used for training scenetext recognition models. The ablation study shows that theproposed geometry synthesizer and appearance synthesizerboth help to synthesize more realistic and useful image inrecognition model training. In addition, they are comple-mentary and their combination achieves a 6% improvementin average recognition accuracy beyond the baseline SF-GAN(BS). The AMT results in the second column of Table3 also show that the SF-GAN synthesized scene text imagesare much more realistic than state-of-the-art synthesis tech-niques. Note the synthesized images by [16] are gray-scaleand not included in the AMT user study.

Fig. 4 shows a few synthesis images by using the pro-posed SF-GAN and a few state-of-the-art GANs. As Fig.4 shows, ST-GAN can achieve geometric alignment but theappearance is clearly unrealistic within the synthesized im-ages. The CycleGAN can adapt the appearance of the fore-ground texts to certain degrees but it ignores real geome-try. This leads to not only unrealistic geometry but also

7

Page 8: Spatial Fusion GAN for Image Synthesis...Spatial Fusion GAN for Image Synthesis Anonymous CVPR submission Paper ID 194 Abstract Recent advances in generative adversarial networks (GANs)

756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809

810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863

CVPR#194

CVPR#194

CVPR 2019 Submission #194. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Table 3. AMT user study to evaluate the realism of synthesizedimages. Percentages represent the how often the images in eachcategory were classified as real by Turkers.

Methods Text Glass Hat

Gupta [9] 38.0 - -Zhan [43] 41.5 - -ST-GAN [20] 31.6 41.7 42.6Real 74.1 78.6 78.2

SF-GAN 57.7 62.0 67.3

degraded appearance as the discriminator can easily distin-guish generated images and real images according to thegeometry difference. The SF-GAN (GS) gives the outputof the geometry synthesizer, i.e. the Composed Image asshown in Fig. 2, which produces better alignment due togood references from the appearance synthesizer. In addi-tion, it can synthesize curve texts due to the use of a thinplate spline transformation [3]. The fully implemented SF-GAN can further learn text appearance from real images andsynthesize highly realistic scene text images. Besides, wecan see that the proposed SF-GAN can learn from neigh-boring texts within the background images and adapt theappearance of the foreground texts accordingly.

4.3. Portrait Wearing

Data preparation: We use the dataset CelebA [24] and fol-low the provided training/test split for portrait wearing ex-periment. The training set is divided into two groups byusing the annotation ‘glass’ and ‘hat’, respectively. For theglass case, one group of people with glasses serve as the realdata for matching against in our adversarial settings and theother group without glasses serves as the background. Forthe foreground glasses, we crop 15 pairs of front-parallelglasses and reuse them to randomly compose with the back-ground images. According to our experiment, 15 pairs ofglasses as the foreground objects are sufficient to train a ro-bust model. The hat case has the similar setting, except thatwe use 30 cropped hats as the foreground objects.Results Analysis: Fig 5. shows a few SF-GAN synthe-sized images and comparisons with ST-GAN synthesizedimages. As Fig. 5 shows, ST-GAN achieves realism in thegeometry space by aligning the glasses and hats with thebackground face images. On the other hand, the synthe-sized images are unrealistic in the appearance space withclear artifacts in color, contrast and brightness. As a com-parison, the SF-GAN synthesized images are much morerealistic in both geometry and appearance spaces. In par-ticular, the foreground glasses and hats within the SF-GANsynthesized images have harmonious brightness, contrast,and blending with the background face images. Addition-ally, the proposed SF-GAN also achieve better geometric

Objects Faces ST-GAN SF-GAN

Figure 5. Illustration of portrait-wearing by different GANs:Columns 1-2 show foreground hats and glasses and backgroundface images, respectively. Columns 3-4 show images synthesizedby by ST-GAN [20] and our proposed SF-GAN, respectively.

alignment as compared with ST-GAN which focuses on ge-ometric alignment only. We conjecture that the better ge-ometric alignment is largely due to the reference from theappearance synthesizer. The AMT results as shown in thelast two columns of Table 3 also show the superior synthesisperformance of our proposed SF-GAN.

5. Conclusions

This paper presents a SF-GAN, an end-to-end trainablenetwork that synthesize realistic images given foregroundobjects and background images. The SF-GAN combine ageometry synthesizer and a appearance synthesizer and iscapable of achieving synthesis realism in both geometry andappearance spaces concurrently. Two use cases are studied.The first scene text image synthesis study shows that theproposed SF-GAN is capable of synthesizing useful imagesto train better scene text recognition models. The secondhat-wearing and glass matching study shows the SF-GANis widely applicable and can be easily extend to other tasks.We will continue to study SF-GAN for full-image synthesisfor training better detection models.

8

Page 9: Spatial Fusion GAN for Image Synthesis...Spatial Fusion GAN for Image Synthesis Anonymous CVPR submission Paper ID 194 Abstract Recent advances in generative adversarial networks (GANs)

864865866867868869870871872873874875876877878879880881882883884885886887888889890891892893894895896897898899900901902903904905906907908909910911912913914915916917

918919920921922923924925926927928929930931932933934935936937938939940941942943944945946947948949950951952953954955956957958959960961962963964965966967968969970971

CVPR#194

CVPR#194

CVPR 2019 Submission #194. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

References[1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gener-

ative adversarial networks. In PMLR, 2017. 1, 2, 5[2] S. Azadi, D. Pathak, S. Ebrahimi, and T. Darrell. Composi-

tional gan: Learning conditional image composition. CoRR,abs/1807.07560, 2018. 1, 2

[3] F. L. Bookstein. Principal warps: Thin-plate splines and thedecomposition of deformations. IEEE Transactions on Pat-tern Analysis and Machine Intelligence, 11(6), 1989. 3, 8

[4] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever,and P. Abbeel. Infogan: Interpretable representation learningby information maximizing generative adversarial nets. InNIPS, 2016. 2

[5] E. Denton, S. Chintala, A. Szlam, and R. Fergus. Deep gen-erative image models using a laplacian pyramid of adversar-ial networks. In NIPS, 2015. 2

[6] D. Dwibedi, I. Misra, and M. Hebert. Cut, paste and learn:Surprisingly easy synthesis for instance detection. In ICCV,2017. 2

[7] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig. Virtual worldsas proxy for multi-object tracking analysis. In CVPR, 2016.2

[8] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen-erative adversarial networks. In NIPS, pages 2672–2680,2014. 1, 2

[9] A. Gupta, A. Vedaldi, and A. Zisserman. Synthetic data fortext localisation in natural images. In CVPR, 2016. 2, 7, 8

[10] B. Ham, M. Cho, and J. Ponce. Robust guided image filteringusing nonconvex potentials. In TPAMI, 2018. 3

[11] K. He and J. Sun. Fast guided filter. arXiv:1505.00996, 2015.3

[12] K. He, J. Sun, and X. Tang. Guided image filtering. InTPAMI, 2013. 3

[13] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, A. A.Efros, and T. Darrell. Cycada: Cycle-consistent adversarialdomain adaptation. In ICML, 2018. 1, 2

[14] D. R. Hunter and K. Lange. Quantile regression via anmm algorithm. In Journal of Computational and GraphicalStatistics, 2000. 3

[15] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-imagetranslation with conditional adversarial networks. In CVPR,2017. 1, 2

[16] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman.Synthetic data and artificial neural networks for natural scenetext recognition. NIPS Deep Learning Workshop, 2014. 2, 7

[17] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh,A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R.Chandrasekhar, S. Lu, F. Shafait, S. Uchida, and E. Valveny.Icdar 2015 competition on robust reading. In ICDAR, pages1156–1160, 2015. 6, 7

[18] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, S. R. Mestre,J. Mas, D. F. Mota, J. A. Almazan, L. P. de las Heras, andet al. Icdar 2013 robust reading competition. In ICDAR,pages 1484–1493, 2013. 6, 7

[19] S. Li, X. Kang, and J. Hu. Image fusion with guided filtering.TIP, 22(7), 2013. 3

[20] C.-H. Lin, E. Yumer, O. Wang, E. Shechtman, and S. Lucey.St-gan: Spatial transformer generative adversarial networksfor image compositing. In CVPR, 2018. 1, 2, 7, 8

[21] M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. In NIPS, 2017. 1, 2

[22] W. Liu, X. Chen, C. Shen, J. Yu, Q. Wu, and J. Yang. Robustguided image filtering. arXiv:1703.09379, 2017. 3

[23] W. Liu, Y. Gu, C. Shen, X. Chen, Q. Wu, and J. Yang.Data driven robust image guided depth map restoration. TIP,2017. 3

[24] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning faceattributes in the wild. In ICCV, 2015. 6, 8

[25] F. Luan, S. Paris, E. Shechtman, and K. Bala. Deep painterlyharmonization. arXiv preprint arXiv:1804.03189, 2018. 2

[26] S. O. Mehdi Mirza. Conditional generative adversarial nets.arXiv:1411.1784, 2014. 1

[27] A. Mishra, K. Alahari, and C. Jawahar. Scene text recogni-tion using higher order language priors. In BMVC, 2012. 6,7

[28] Y. Movshovitz-Attias, T. Kanade, and Y. Sheikh. How usefulis photo-realistic rendering for visual learning? In ECCV,2016. 2

[29] D. Park and D. Ramanan. Articulated pose estimation withtiny synthetic videos. In CVPR, 2015. 2

[30] P. Perez, M. Gangnet, and A. Blake. Poisson image editing.ACM Transactions on graphics (TOG), 22(3), 2003. 2

[31] T. Q. Phan, P. Shivakumara, S. Tian, and C. L. Tan. Recog-nizing text with perspective distortion in natural scenes. InICCV, 2013. 6, 7

[32] A. Radford, L. Metz, and S. Chintala. Unsupervised repre-sentation learning with deep convolutional generative adver-sarial networks. In ICLR, 2016. 1, 2

[33] S. R. Richter, V. Vineet, S. Roth, and V. Koltun. Playing fordata: Ground truth from computer games. In ECCV, 2016. 2

[34] A. Risnumawan, P. Shivakumara, C. S. Chan, and C. L. Tan.A robust arbitrary text detection system for natural scene im-ages. In Expert Syst. Appl., 41(18):8027–8048, 2014. 6, 7

[35] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang,and R. Webb. Learning from simulated and unsupervisedimages through adversarial training. IN CVPR, 2017. 2

[36] H. Su, C. R. Qi, Y. Li, and L. Guibas. Render for cnn: View-point estimation in images using cnns trained with rendered3d model views. In ICCV, 2015. 2

[37] Y.-H. Tsai, X. Shen, Z. Lin, K. Sunkavalli, X. Lu, and M.-H.Yang. Deep image harmonization. In CVPR, 2017. 2

[38] M. Uyttendaele, A. Eden, and R. Szeliski. Eliminating ghost-ing and exposure artifacts in image mosaics. In CVPR, 2001.2

[39] K. Wang, B. Babenko, and S. Belongie. End-to-end scenetext recognition. In ICCV, 2011. 6, 7

[40] H. Wu, S. Zheng, J. Zhang, and K. Huang. Fast end-to-endtrainable guided filter. In CVPR, 2017. 3

[41] H. Wu, S. Zheng, J. Zhang, and K. Huang. Gp-gan: Towardsrealistic high-resolution image blending. arXiv preprintarXiv:1703.07195, 2017. 2

9

Page 10: Spatial Fusion GAN for Image Synthesis...Spatial Fusion GAN for Image Synthesis Anonymous CVPR submission Paper ID 194 Abstract Recent advances in generative adversarial networks (GANs)

97297397497597697797897998098198298398498598698798898999099199299399499599699799899910001001100210031004100510061007100810091010101110121013101410151016101710181019102010211022102310241025

102610271028102910301031103210331034103510361037103810391040104110421043104410451046104710481049105010511052105310541055105610571058105910601061106210631064106510661067106810691070107110721073107410751076107710781079

CVPR#194

CVPR#194

CVPR 2019 Submission #194. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

[42] J. Yang, A. Kannan, D. Batra, and D. Parikh. Lr-gan:Layered recursive generative adversarial networks for imagegeneration. ICLR, 2017. 2

[43] F. Zhan, S. Lu, and C. Xue. Verisimilar image synthesisfor accurate detection and recognition of texts in scenes. InECCV, 2018. 2, 7, 8

[44] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, andD. Metaxas. Stackgan: Text to photo-realistic image synthe-sis with stacked generative adversarial networks. In ICCV,2017. 2

[45] Q. Zhang, X. Shen, L. Xu, and J. Jia. Rolling guidance filter.ECCV, 2014. 3

[46] Y. Zhao, B. Price, S. Cohen, and D. Gurari. Guided image in-painting: Replacing an image region by pulling content fromanother image. arXiv:1803.08435, 2018. 2

[47] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial net-works. In ICCV, 2017. 1, 2

10