deep image spatial transformation for person image …deep image spatial transformation for person...

Deep Image Spatial Transformation for Person Image Generation

Yurui Ren1,2 Xiaoming Yu1,2 Junming Chen1,2 Thomas H. Li3,1 Ge Li 1,2

1School of Electronic and Computer Engineering, Peking University 2Peng Cheng Laboratory3Advanced Institute of Information Technology, Peking University

yrren,xiaomingyu,[email protected] [email protected] [email protected]

Abstract

Pose-guided person image generation is to transforma source person image to a target pose. This task re-quires spatial manipulations of source data. However,Convolutional Neural Networks are limited by lacking theability to spatially transform the inputs. In this pa-per, we propose a differentiable global-flow local-attentionframework to reassemble the inputs at the feature level.Specifically, our model first calculates the global corre-lations between sources and targets to predict flow fields.Then, the flowed local patch pairs are extracted fromthe feature maps to calculate the local attention coef-ficients. Finally, we warp the source features using acontent-aware sampling method with the obtained localattention coefficients. The results of both subjective andobjective experiments demonstrate the superiority of ourmodel. Besides, additional results in video animation andview synthesis show that our model is applicable to othertasks requiring spatial transformation. Our source codeis available at https://github.com/RenYurui/Global-Flow-Local-Attention.

1 . IntroductionImage spatial transformation is used to describe the spa-

tial deformation between sources and targets. Such de-formation can be caused by object motions or viewpointchanges. Many conditional image generation tasks can beseen as a type of spatial transformation tasks. For exam-ple, pose-guided person image generation [18, 24, 26, 38]transforms a person image from a source pose to a targetpose while retaining the appearance details. This task canbe tackled by reasonably reassembling the input data in thespatial domain.

However, Convolutional Neural Networks are spatial in-variant to the input data since they calculate outputs ina position-independent manner. This property can bene-fit tasks requiring to reason about images such as classi-fication [13, 27], segmentation [4, 7] and detection [25]

Figure 1. The visualization of data spatial transformation. For eachimage pair, the left image is the generated result of our model,while the right image is the input source image. Our model spa-tially transforms the information from sources to targets at the fea-tures level. The heat maps indicate the attention coefficients oflocal regions.

etc. However, it limits the networks by lacking abilitiesto spatially rearrange the input data. Spatial TransformerNetworks (STN) [10] solves this problem by introducinga Spatial Transformer module to standard neural networks.This module regresses global transformation parametersand warps input features using affine transformation. How-ever, since it assumes a global affine transformation be-tween sources and targets, this method cannot deal with thetransformations of non-rigid objects.

Attention mechanism [29, 33] allows networks to takeuse of non-local information, which gives networks abili-ties to build long-term correlations. It has been proved tobe efficient in many tasks such as natural language pro-cessing [29], image recognition [31, 9], and image gen-eration [33]. However, for spatial transformation tasks inwhich target images are the deformation results of sourceimages, each output position has a clear one-to-one rela-tionship with the source positions. Therefore, the attentioncoefficient matrix between the source and target should bea sparse matrix instead of a dense matrix.

Flow-based operation forces the attention coefficient ma-trix to be a sparse matrix by sampling a very local sourcepatch for each output position. These methods predict 2-Dcoordinate vectors specifying which positions in the sourcescould be sampled to generate the targets. However, in orderto stabilize the training, most of the traditional flow-basedmethods [37, 3] warp input data at the pixel level. They

1

arX

iv:2

003.

0069

6v1

[cs

.CV

] 2

Mar

202

0

https://github.com/RenYurui/Global-Flow-Local-Attention

https://github.com/RenYurui/Global-Flow-Local-Attention

limit the networks to be unable to generate new contents.Meanwhile, large motions are difficult to extract due to therequirement of generating full-resolution flow fields [20].Warping the inputs at the feature level can solve these prob-lems. However, the networks are easy to be stuck withinbad local minima [21, 32] due to two reasons. (1) As the in-put features and flow fields are changing during the trainingprocess, in the early stages of training, the features can notobtain reasonable gradients without correct flow fields. Thenetwork also cannot extract similarities to generate correctflow fields without reasonable features. (2) The poor gra-dients provided by the commonly used Bilinear samplingmethod further lead to instability in training. See Section Afor more discussion.

In order to deal with these problems, in this paper, wecombine flow-based operation with attention mechanisms.We propose a novel global-flow local-attention frameworkto force each output location to be only related to a localfeature patch of sources. The architecture of our model canbe found in Figure 2. Specifically, our network can be di-vided into two parts: Flow Field Estimator and Target Im-age Generator. The Flow Filed Estimator is responsible forextracting the global correlations and generating flow fields.The Target Image Generator is used to synthesize the fi-nal results using local attention. To avoid the poor gradientpropagation of the Bilinear sampling, we propose a content-aware sampling method to calculate the local attention co-efficients.

We compare our model with several state-of-the-artmethods. The results of both subjective and objective ex-periments show the superior performance of our model. Wealso conduct comprehensive ablation studies to verify ourhypothesis. Besides, we apply our model to other tasks re-quiring spatial transformation manipulation including viewsynthesis and video animation. The results show the versa-tility of our module. The main contributions of our papercan be summarized as:

• A global-flow local-attention framework is proposedfor pose-guided person image generation tasks. Exper-iments demonstrate the effectiveness of the proposedmethod.

• The carefully-designed framework and content-awaresampling operation ensure that our model is able towarp and reasonably reassemble the input data at thefeature level. This not only enables the model to gen-erate new contents, but also reduces the difficulty ofthe flow field estimation task.

• Additional experiments over view synthesis and videoanimation show that our model can be flexibly appliedto different spatial transformation tasks and use differ-ent structure guidance.

2 . Related WorkPose-guided Person Image Generation. An early at-tempt [18] on the pose-guided person image generation taskproposes a two-stage network to first generate a coarse im-age with target pose and then refine the results in an adver-sarial way. Essner et al. [2] try to disentangle the appear-ance and pose of person images. Their model enables bothconditional image generation and transformation. However,they use U-Net based skip connections, which may lead tofeature misalignments. Siarohin et al. [24] solve this prob-lem by introducing deformable skip connections to spatiallytransform the textures. It decomposes the overall deforma-tion by a set of local affine transformations (e.g. arms andlegs etc.). Although it works well in person image gen-eration, the requirement of the pre-defined transformationcomponents limits its application. Zhu et al. [38] propose amore flexible method by using a progressive attention mod-ule to transform the source data. However, useful informa-tion may be lost during multiple transfers, which may resultin blurry details. Han et al. [6] use a flow-based methodto transform the source information. However, they warpthe sources at the pixel level, which means that further re-finement networks are required to fill the holes of occlusioncontents. Liu et al. [16] and Li et al. [14] warp the inputsat the feature level. But both of them need additional 3Dhuman models to calculate the flow fields between sourcesand targets, which limits the application of these models.Our model does not require any supplementary informationand obtains the flow fields in an unsupervised manner.Image Spatial Transformation. Many methods have beenproposed to enable the spatial transformation capability ofConvolutional Neural Networks. Jaderberg et al. [10] intro-duce a differentiable Spatial Transformer module that esti-mates global transformation parameters and warps the fea-tures with affine transformation. Several variants have beenproposed to improve the performance. Zhang et al. add con-trolling points for free-form deformation [34]. The modelproposed in paper [15] sends the transformation parametersinstead of the transformed features to the network to avoidsampling errors. Jiang et al. [11] demonstrate the poor gra-dient propagation of the commonly used Bilinear sampling.They propose a linearized multi-sampling method for spa-tial transformation.

Flow-based methods are more flexible than affine trans-formation methods. They can deal with complex deforma-tions. Appearance flow [37] predicts flow fields and gener-ates the targets by warping the sources. However, it warpsimage pixels instead of features. This operation limits themodel to be unable to generate new contents. Besides, itrequires the model to predict flow fields with the same res-olution as the result images, which makes it difficult forthe model to capture large motions [39, 20]. Vid2vid [30]deals with these problems by predicting the ground-truth

LocalAttention

LocalAttention

LocalAttention

LocalAttention

Flow Field Estimator

Target Image Generator

𝐟"𝐟#

𝐰 𝐦

𝒙" 𝒑# 𝒑"

𝒑#

𝒙"

(𝒙#

ℒℓ+ℒ,-.ℒ"#/01ℒ2314

ℒ3ℒ4

Figure 2. Overview of our model. The Flow Field Estimator is used to generate global flow fields. The Target Image Generator yieldsresults by spatially transforming the source features using local attention. Dotted lines indicate that our local attention module can be usedat different scales.

flow fields using FlowNet [3] first and then trains their flowestimator in a supervised manner. They also use a gener-ator for occluded content generation. Warping the sourcesat the feature level can avoid these problems. In order tostabilize the training, some papers propose to obtain theflow-fields by using some assumptions or supplementary in-formation. Paper [23] assumes that keypoints are locatedon object parts that are locally rigid. They generate denseflow fields from sparse keypoints. Papers [16, 14] use the3D human models and the visibility maps to calculate theflow fields between sources and targets. Paper [21] pro-poses a sampling correctness loss to constraint flow fieldsand achieve good results.

3 . Our Approach

For the pose-guided person image generation task, targetimages are the deformation results of source images, whichmeans that each position of targets is only related to a localregion of sources. Therefore, we design a global-flow local-attention framework to force the attention coefficient matrixto be a sparse matrix. Our network architecture is shown inFigure 2. It consists of two modules: Flow Field Estima-tor F and Target Image Generator G. The Flow Field Es-timator is responsible for estimating the motions betweensources and targets. It generates global flow fields w andocclusion masks m for the local attention blocks. Withw and m, the Target Image Generator extracts the corre-sponding local feature block pair centered at each locationto calculate the output feature using local attention. The fol-

lowing describes each in detail. Please note that to simplifythe notations, we describe the network with a single localattention block. As shown in Figure 2, our model can beextended to use multiple attention blocks at different scales.

3 .1. Flow Field Estimator

Let ps and pt denote the structure guidance of the sourceimage xs and the target image xt respectively. Flow FieldEstimator F predicts the motions between xs and xt in anunsupervised manner. It takes xs, ps and pt as inputs andgenerates flow fields w and occlusion masks m.

w,m = F (xs,ps,pt) (1)

where w contains the coordinate offsets between sourcesand targets. The occlusion mask m with continuous val-ues between 0 and 1 indicates whether the information ofa target position exists in the sources. w and m share allweights of F other than their output layers.

Different from some previous flow-based methods [37,3, 30], our model warps features instead of pixels, whichenables the model to generate new content and reduces thedifficulty of the flow fields estimation. However, networksare easy to be stuck within bad local minima when warpingat the feature level [32, 21]. Therefore, we use the samplingcorrectness loss proposed by [21] to constraint w. It calcu-lates the similarity between the warped source and ground-truth target at the VGG feature level. Let vs and vt de-note the features generated by a specific layer of VGG19.vs,w = w(vs) is the warped results of vs using the flow

Kernel Prediction NetSoftmax Reshape

Extract

𝐟"

𝐟# Extract

Average Pooling

⨂

Spatial Element-wise Multiplication⨂𝒩&(𝐟#, 𝑙)𝑛×𝑛×𝑐

𝒩&(𝐟", 𝑙 + 𝐰0 )𝑛×𝑛×𝑐

𝒩&(𝐟", 𝑙 + 𝐰0 )𝑛×𝑛×𝑐

𝐤0𝑛×𝑛×1

𝐟3##401×1×𝑐

Figure 3. Overview of our Local Attention. We first extract the feature patch pair from the source and target according to the flow fields.Then the context-aware sampling kernel is calculated by the kernel prediction net. Finally, we sample the source feature and obtain thewarped result located at l.

field w. The sampling correctness loss calculates the rela-tive cosine similarity between vs,w and vt.

Lc =1

N

∑l∈Ω

exp(−µ(vls,w,v

lt)

µlmax) (2)

where µ(∗) denotes the cosine similarity. Coordinate setΩ contains all N positions in the feature maps, and vls,wdenotes the feature of vs,w located at the coordinate l =(x, y). The normalization term µlmax is calculated as

µlmax = maxl′∈Ω

µ(vl′

s ,vlt) (3)

It is used to avoid the bias brought by occlusion.The sampling correctness loss can constrain the flow

fields to sample semantically similar regions. However, asthe deformations of image neighborhoods are highly corre-lated, it would benefit if we could extract this relationship.Therefore, we further add a regularization term to our flowfields. This regularization term is used to punish local re-gions where the transformation is not an affine transforma-tion. Let ct be the 2D coordinate matrix of the target featuremap. The corresponding source coordinate matrix can bewritten as cs = ct + w. We use Nn(ct, l) to denote localn× n patch of ct centered at the location l. Our regulariza-tion assumes that the transformation betweenNn(ct, l) andNn(cs, l) is an affine transformation.

Tl = AlSl =

[θ11 θ12 θ13

θ21 θ22 θ23

]Sl (4)

where Tl =

[x1 x2 ... xn×ny1 y2 ... yn×n

]with each coordinate

(xi, yi) ∈ Nn(ct, l) and Sl =

x1 x2 ... xn×ny1 y2 ... yn×n1 1 ... 1

with each coordinate (xi, yi) ∈ Nn(cs, l). The estimatedaffine transformation parameters Al can be solved using theleast-squares estimation as

Al = (SHl Sl)−1SHl Tl (5)

Our regularization is calculated as the `2 distance of the er-ror.

Lr =∑l∈Ω

∥∥∥Tl − AlSl

∥∥∥2

2(6)

3 .2. Target Image Generator

With the flow fields w and occlusion masks m, our Tar-get Image Generator G is responsible for synthesizing theresults using the local attention. As shown in Figure 2, Gtakes xs, pt, w and m as inputs and generate the result im-age xt.

xt = G(xs,pt,w,m) (7)

Two branches are designed in G for different purposes.One is used to generate the feature of target images ft usingthe structure information pt, while another is used to ex-tract the feature fs from the source images xs. As shown inFigure 3, our local attention block transforms vivid texturesfrom source features to target features. We first extract local

DeepFashion Market-1501 Number ofFID LPIPS JND FID LPIPS Mask-LPIPS JND Parameters

Def-GAN 18.457 0.2330 9.12% 25.364 0.2994 0.1496 23.33% 82.08MVU-Net 23.667 0.2637 2.96% 20.144 0.3211 0.1747 24.48% 139.36MPose-Attn 20.739 0.2533 6.11% 22.657 0.3196 0.1590 16.56% 41.36MIntr-Flow 16.314 0.2131 12.61% 27.163 0.2888 0.1403 30.85% 49.58MOurs 10.573 0.2341 24.80% 19.751 0.2817 0.1482 27.81% 14.04M

Table 1. The evaluation results compared with several state-of-the-art methods including Def-GAN [24], VU-Net [2], Pose-Attn[38], andIntr-Flow [14] over dataset DeepFashion [17] and Market-1501 [36]. FID [8] and LPIPS [35] are objective metrics. JND is obtained byhuman subjective studies. It represents the probability that the generated images are mistaken for real images.

patchesNn(ft, l) andNn(fs, l+wl) from ft and fs respec-tively. The patch Nn(fs, l + wl) is extracted using bilinearsampling as the coordinates may not be integers. Then, akernel prediction network M is used to predict local n× nkernel kl as

kl = M(Nn(fs, l + wl),Nn(ft, l)) (8)

The softmax function is used as the non-linear activationfunction of the output layer of M . This operation forcesthe sum of kl to 1, which enables the stability of gradientbackward. Finally, the flowed feature located at coordinatel = (x, y) is calculated using a content-aware attention overthe extracted source feature patch Nn(fs, l + wl).

f lattn = P (kl ⊗Nn(fs, l + wl)) (9)

where ⊗ denotes the element-wise multiplication over thespatial domain and P represents the global average pool-ing operation. The warped feature map fattn is obtained byrepeating the previous steps for each location l.

However, not all contents of target images can be foundin source images because of occlusion or movements. Inorder to enable our Target Image Generator G to generatenew contents, the occlusion mask m with continuous valuebetween 0 and 1 is used to select features between fattn andft.

fout = (1−m) ∗ ft + m ∗ fattn (10)

We train the network using a joint loss consisting of areconstruction `1 loss, adversarial loss, perceptual loss, andstyle loss. The reconstruction `1 loss is written as

L`1 = ‖xt − xt‖1 (11)

The generative adversarial framework [5] is employed tomimic the distributions of the ground-truth xt. The adver-sarial loss is written as

Ladv = E[log(1−D(G(xs,pt,w,m)))]

+ E[logD(xt)] (12)

where D is the discriminator of Target Image Generator G.We also use the perceptual loss and style loss introduced

by [12]. The perceptual loss calculates `1 distance betweenactivation maps of a pre-trained network. It can be writtenas

Lperc =∑i

‖φi(xt)− φi(xt)‖1 (13)

where φi is the activation map of the i-th layer of a pre-trained network. The style loss calculates the statistic errorbetween the activation maps as

Lstyle =∑j

∥∥∥Gφj (xt)−Gφj (xt)∥∥∥

1(14)

where Gφj is the Gram matrix constructed from activationmaps φj . We train our model using the overall loss as

L = λcLc+λrLr+λ`1L`1 +λaLadv+λpLprec+λsLstyle(15)

4 . Experiments4 .1. Implementation Details

Datasets. Two datasets are used in our experiments: personre-identification dataset Market-1501 [36] and DeepFashionIn-shop Clothes Retrieval Benchmark [17]. Market-1501contains 32668 low-resolution images (128 × 64). The im-ages vary in terms of the viewpoints, background, illumi-nation etc. The DeepFashion dataset contains 52712 high-quality model images with clean backgrounds. We split thedatasets with the same method as that of [38]. The personalidentities of the training and testing sets do not overlap.Metrics. We use Learned Perceptual Image Patch Similar-ity (LPIPS) proposed by [35] to calculate the reconstruc-tion error. LPIPS computes the distance between the gen-erated images and reference images at the perceptual do-main. It indicates the perceptual difference between the in-puts. Meanwhile, Frechet Inception Distance [8] (FID) isemployed to measure the realism of the generated images. Itcalculates the Wasserstein-2 distance between distributionsof the generated images and ground-truth images. Betterscores are assigned to small distribution errors. Besides, weperform a Just Noticeable Difference (JND) test to evalu-ate the subjective quality. Volunteers are asked to choose

Source Image

Target Pose

Target Image Def-GAN VU-Net Pose-Attn OursIntr-Flow

Source Image

Target Pose


Figure 4. The qualitative comparisons with several state-of-the-art models including Def-GAN [24], VU-Net [2], Pose-Attn[38], and Intr-Flow [14]. The left part shows the results of the Fashion dataset. The right part shows the results of the Market-1501 dataset.

the more realistic image from the data pair of ground-truthand generated images. We provide the fooling rate as theevaluation results.Network Implementation and Training Details. Basi-cally, auto-encoder structures are employed to design our FandG. The residual block is used as the basic component ofthese models. The kernel prediction networkM is designedas a fully connected net. We use convolutional layers toimplement this network, which can obtain kl for differentlocation l in parallel. We train our model using 256 × 256images for the Fashion dataset. Two local attention blocksare used for feature maps with resolutions as 32 × 32 and64× 64. The extracted local patch sizes are 3 and 5 respec-tively. For Market-1501, we use 128 × 64 images with asingle local attention block at the feature maps with resolu-tion as 32× 16. The extracted patch size is 3. We train ourmodel in stages. The Flow Field Estimator is first trainedto generate flow fields. Then we train the whole model inan end-to-end manner. We adopt the ADAM optimizer withthe learning rate as 10−4. The batch size is set to 8 for allexperiments. More details can be found in Section D

4 .2. Comparisons

We compare our method with several stare-of-the-artmethods including Def-GAN [24], VU-Net [2], Pose-

Attn[38] and Intr-Flow [14]. The quantitative evaluation re-sults are shown in Table 1. For the Market-1501 dataset, wefollow the previous work [18] to calculate the mask-LPIPSto alleviate the influence of the background. It can be seenthat our model achieves competitive results in both datasets,which means that our model can generate realistic resultswith fewer perceptual reconstruction errors.

As the subjective metrics may not be sensitive to someartifacts, its results may mismatch with the actual subjec-tive perceptions. Therefore, we implement a just noticeabledifference test on Amazon Mechanical Turk (MTurk). Thisexperiment requires volunteers to choose the more realis-tic image from image pairs of real and generated images.The test is performed over 800 images for each model anddataset. Each image is compared 5 times by different vol-unteers. The evaluation results are shown in Table 1. Itcan be seen that our model achieves the best result in thechallenging Fashion dataset and competitive results in theMarket-1501 dataset.

The typical results of different methods are provided inFigure 4. For the Fashion dataset, VU-Net and Pose-Attnstruggle to generate complex textures since these modelslack efficient spatial transformation blocks. Def-GAN de-fines local affine transformation components (e.g. arms andlegs etc.). This model can generate correct textures. How-

Flow-Based Content-aware FID LPIPSSampling

Baseline N - 16.008 0.2473Global-Attn N - 18.616 0.2575Bi-Sample Y N 12.143 0.2406Full Model Y Y 10.573 0.2341

Table 2. The evaluation results of the ablation study.

ever, the pre-defined affine transformations are not suffi-cient to represent complex spatial variance, which limitsthe performance of the model. Flow-based model Intr-Flow is able to generate vivid textures for front pose im-ages. However, it may fail to generate realistic results forside pose images due to the requirement of generating full-resolution flow fields. Meanwhile, this model needs 3Dhuman models to generate the ground-truth flow fields fortraining. Our model regresses flow fields in an unsuper-vised manner. It can generate realistic images. Not only theglobal pattern but also the details such as the lace of clothesand the shoelace are reconstructed. For the Market-1501Dataset, our model can generate correct pose with vividbackgrounds. Artifacts can be found in the results of com-petitors, such as the sharp edges in Pose-Attn and the haloeffects in Def-GAN.

The numbers of model parameters are also provided toevaluate the computation complexity in Table 1. Thanks toour efficient attention blocks, our model does not require alarge number of convolution layers. Thus, we can achievehigh performance with less than half of the parameters ofthe competitors.

4 .3. Ablation Study

In this subsection, we train several ablation models toverify our assumptions and evaluate the contribution of eachcomponent.Baseline. Our baseline model is an auto-encoder convolu-tional network. We do not use any attention blocks in thismodel. Images xs, pt, ps are directly concatenated as themodel inputs.Global Attention Model (Global-Attn). The Global-Attnmodel is designed to compare the global-attention blockwith our local-attention block. We use a similar network ar-chitecture as our Target Image Generator G for this model.The local attention blocks are replaced by global attentionblocks where the attention coefficients are calculated by thesimilarities between the source features fs and target fea-tures ft.Bilinear Sampling Model (Bi-Sample). The Bi-Samplemodel is designed to evaluate the contribution of ourcontent-aware sampling method described in Section 3 .2.Both the Flow Field Estimator and Target Image Generatorare employed in this model. However, we use Bilinear sam-pling as the sampling method in our local-attention blocks.Full Model (Ours). We use both the flow-based local-

Source Image

Target Pose

Target Image

Baseline Global-Attn Bi-Sample Full Model

Figure 5. Qualitative results of the ablation study.

Source Image

Target Image

Global-Attn Bi-Sample Full ModelGlobal-AttnAttention Map

Bi-SampleAttention Map

Full ModelAttention Map

Figure 6. The visualization results of different attention modules.The red rectangles indicate the target locations. The heat mapsshow the attention coefficients. Blue represents low weights.

attention blocks and the content-aware sampling method inthis model.

The evaluation results of the ablation study are shownin Table 2. Compared with the Baseline model, the perfor-mance of the Global-Attn model is degraded, which meansthat unreasonable attention block cannot efficiently trans-form the information. Improvements can be obtained by

using flow-based methods such as the Bi-Sample model andour Full model which force the attention coefficient matrixto be a sparse matrix. However, the Bi-Sample model usesa pre-defined sampling method with a limited sampling re-ceptive field, which may lead to unstable training. Our fullmodel uses a content-aware sampling operation with an ad-justable receptive field, which brings further performancegain.

Subjective comparison of these ablation models can befound in Figure 5. It can be seen that the Baseline andGlobal-Attn model can generate correct structures. How-ever, the textures of the source images cannot be well-maintained, which is because these models generate im-ages by first extracting global features and then propagatingthe information to specific locations. This process leads tothe loss of details. The flow-based methods spatially trans-form the features. They are able to reconstruct vivid de-tails. However, the Bi-Sample model uses the pre-definedBilinear sampling method. It cannot find the exact samplinglocations, which leads to artifacts in the final results.

We further provide the visualization of the attentionmaps in Figure 6. It can be seen that the Global-Attn modelstruggles to exclude irrelevant information. Therefore, theextracted features are hard to be used to generate specifictextures. The Bi-Sample model assigns a local patch foreach generated location. However, incorrect features are of-ten flowed due to the limited sampling receptive field. OurFull model using the content-aware sampling method canflexibly change the sampling weights and avoid the artifacts.

5 . Application on Other TasksIn this section, we demonstrate the versatility of our

global-flow local-attention module. Since our model doesnot require any additional information other than imagesand structure guidance, it can be flexibly applied to tasksrequiring spatial transformation. Two example tasks areshown as follows.View Synthesis. View synthesis requires generating novelviews of objects or scenes based on arbitrary input views.Since the appearance of different views is highly corre-lated, the existing information can be reassembled to gener-ate the targets. We directly train our model on the ShapeNetdataset [1]. We generate novel target views using singleview input. The results can be found in Figure 7. We pro-vide the results of appearance flow as a comparison. It canbe seen that appearance flow struggles to generate occludedcontents as they warp image pixels instead of features. Ourmodel generates reasonable results.Image Animation. Given an input image and a guidancevideo sequence depicting the structure movements, the im-age animation task requires generating a video containingthe specific movements. This task can be solved by spatiallymoving the appearance of the sources. We train our model

Source AppFlow Ours Ground-TruthFigure 7. Qualitative results of the view synthesis task. We showthe results of our model and appearance flow [37] model. Click onthe image to start the animation in a browser.

Source Result Source ResultFigure 8. Qualitative results of the image animation task. Ourmodel generates the results video using reference image and edgeguidance. Click on the image to play the video clip in a browser.

with the real videos in the FaceForensics dataset [22]. Thisdataset contains 1000 videos of news briefings from differ-ent reporters. The face regions are cropped for this task.We use the edge maps as the structure guidance. For eachframe, the input source frame and the previous generatedn frames are used as the references. The flow fields arecalculated for each reference. We also employ an additionaldiscriminator to capture the distributions of the residuals be-tween video frames. The results can be found in Figure 8. Itcan be seen that our model generates realistic result videoswith vivid movements. More applications can be found inSection C .

6 . Conclusion

In this paper, we propose a global-flow local-attentionmodule for person image generation. Our model can spa-tially transform the useful information from sources to tar-gets. In order to train the model to reasonably transformand reassemble the inputs at the feature level in an unsu-pervised manner, we analyze the specific reasons causinginstability in training. Meanwhile, we propose a frameworkto solve these problems. Our model first extracts global cor-relations between sources and targets to obtain flow fields.Then, the local attention is used to sample source features ina content-aware manner. Experiments show that our modelcan generate target images with correct poses while main-taining details. In addition, the ablation study shows thatour improvements help the network find more reasonable

https://user-images.githubusercontent.com/30292465/75651034-81bf4d00-5c92-11ea-93cb-612969b1556a.gif

https://user-images.githubusercontent.com/30292465/75650912-43c22900-5c92-11ea-93b2-f901a5c1fa29.gif

sampling positions. Finally, we show that our model can beeasily extended to address other spatial deformation taskssuch as view synthesis and video animation.

References[1] Angel X Chang, Thomas Funkhouser, Leonidas Guibas,

Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese,Manolis Savva, Shuran Song, Hao Su, et al. Shapenet:An information-rich 3d model repository. arXiv preprintarXiv:1512.03012, 2015. 8

[2] Patrick Esser, Ekaterina Sutter, and Bjorn Ommer. A varia-tional u-net for conditional appearance and shape generation.In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 8857–8866, 2018. 2, 5, 6, 13

[3] Philipp Fischer, Alexey Dosovitskiy, Eddy Ilg, PhilipHausser, Caner Hazırbas, Vladimir Golkov, Patrick Van derSmagt, Daniel Cremers, and Thomas Brox. Flownet: Learn-ing optical flow with convolutional networks. arXiv preprintarXiv:1504.06852, 2015. 1, 3

[4] Ross Girshick, Jeff Donahue, Trevor Darrell, and JitendraMalik. Rich feature hierarchies for accurate object detectionand semantic segmentation. In Proceedings of the IEEE con-ference on computer vision and pattern recognition, pages580–587, 2014. 1

[5] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, BingXu, David Warde-Farley, Sherjil Ozair, Aaron Courville, andYoshua Bengio. Generative adversarial nets. In Advancesin neural information processing systems, pages 2672–2680,2014. 5

[6] Xintong Han, Xiaojun Hu, Weilin Huang, and Matthew RScott. Clothflow: A flow-based model for clothed persongeneration. In Proceedings of the IEEE International Con-ference on Computer Vision, pages 10471–10480, 2019. 2

[7] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Gir-shick. Mask r-cnn. In Proceedings of the IEEE internationalconference on computer vision, pages 2961–2969, 2017. 1

[8] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,Bernhard Nessler, and Sepp Hochreiter. Gans trained by atwo time-scale update rule converge to a local nash equilib-rium. In Advances in Neural Information Processing Sys-tems, pages 6626–6637, 2017. 5

[9] Han Hu, Zheng Zhang, Zhenda Xie, and Stephen Lin. Lo-cal relation networks for image recognition. arXiv preprintarXiv:1904.11491, 2019. 1

[10] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al.Spatial transformer networks. In Advances in neural infor-mation processing systems, pages 2017–2025, 2015. 1, 2

[11] Wei Jiang, Weiwei Sun, Andrea Tagliasacchi, Eduard Trulls,and Kwang Moo Yi. Linearized multi-sampling for differen-tiable image transformation. Proceedings of the IEEE Inter-national Conference on Computer Vision, 2019. 2

[12] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptuallosses for real-time style transfer and super-resolution. InEuropean conference on computer vision, pages 694–711.Springer, 2016. 5

[13] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.Imagenet classification with deep convolutional neural net-works. In Advances in neural information processing sys-tems, pages 1097–1105, 2012. 1

[14] Yining Li, Chen Huang, and Chen Change Loy. Dense intrin-sic appearance flow for human pose transfer. In IEEE Con-ference on Computer Vision and Pattern Recognition, 2019.2, 3, 5, 6, 13

[15] Chen-Hsuan Lin and Simon Lucey. Inverse compositionalspatial transformer networks. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition,pages 2568–2576, 2017. 2

[16] Wen Liu, Zhixin Piao, Jie Min, Wenhan Luo, Lin Ma, andShenghua Gao. Liquid warping gan: A unified frameworkfor human motion imitation, appearance transfer and novelview synthesis. In Proceedings of the IEEE InternationalConference on Computer Vision, pages 5904–5913, 2019. 2,3

[17] Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and XiaoouTang. Deepfashion: Powering robust clothes recognitionand retrieval with rich annotations. In Proceedings of theIEEE conference on computer vision and pattern recogni-tion, pages 1096–1104, 2016. 5, 13

[18] Liqian Ma, Xu Jia, Qianru Sun, Bernt Schiele, Tinne Tuyte-laars, and Luc Van Gool. Pose guided person image genera-tion. In Advances in Neural Information Processing Systems,pages 406–416, 2017. 1, 2, 6

[19] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, andYuichi Yoshida. Spectral normalization for generative ad-versarial networks. arXiv preprint arXiv:1802.05957, 2018.16

[20] Anurag Ranjan and Michael J Black. Optical flow estima-tion using a spatial pyramid network. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion, pages 4161–4170, 2017. 2

[21] Yurui Ren, Xiaoming Yu, Ruonan Zhang, Thomas H. Li,Shan Liu, and Ge Li. Structureflow: Image inpainting viastructure-aware appearance flow. In IEEE International Con-ference on Computer Vision (ICCV), 2019. 2, 3

[22] Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Chris-tian Riess, Justus Thies, and Matthias Nießner. Faceforen-sics: A large-scale video dataset for forgery detection in hu-man faces. arXiv preprint arXiv:1803.09179, 2018. 8

[23] Aliaksandr Siarohin, Stephane Lathuiliere, Sergey Tulyakov,Elisa Ricci, and Nicu Sebe. Animating arbitrary objects viadeep motion transfer. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 2377–2386, 2019. 3

[24] Aliaksandr Siarohin, Enver Sangineto, Stephane Lathuiliere,and Nicu Sebe. Deformable gans for pose-based humanimage generation. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 3408–3416, 2018. 1, 2, 5, 6, 13

[25] Karen Simonyan and Andrew Zisserman. Two-stream con-volutional networks for action recognition in videos. In Ad-vances in neural information processing systems, pages 568–576, 2014. 1

[26] Sijie Song, Wei Zhang, Jiaying Liu, and Tao Mei. Un-supervised person image generation with semantic parsingtransformation. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 2357–2366, 2019. 1

[27] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet,Scott Reed, Dragomir Anguelov, Dumitru Erhan, VincentVanhoucke, and Andrew Rabinovich. Going deeper withconvolutions. In Proceedings of the IEEE conference oncomputer vision and pattern recognition, pages 1–9, 2015.1

[28] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. In-stance normalization: The missing ingredient for fast styliza-tion. arXiv preprint arXiv:1607.08022, 2016. 16

[29] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and IlliaPolosukhin. Attention is all you need. In Advances in neuralinformation processing systems, pages 5998–6008, 2017. 1

[30] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu,Andrew Tao, Jan Kautz, and Bryan Catanzaro. Video-to-video synthesis. arXiv preprint arXiv:1808.06601, 2018. 2,3

[31] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim-ing He. Non-local neural networks. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion, pages 7794–7803, 2018. 1

[32] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, andThomas S Huang. Generative image inpainting with con-textual attention. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 5505–5514, 2018. 2, 3

[33] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augus-tus Odena. Self-attention generative adversarial networks.arXiv preprint arXiv:1805.08318, 2018. 1

[34] Haoyang Zhang and Xuming He. Deep free-form deforma-tion network for object-mask registration. In Proceedingsof the IEEE International Conference on Computer Vision,pages 4251–4259, 2017. 2

[35] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht-man, and Oliver Wang. The unreasonable effectiveness ofdeep features as a perceptual metric. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion, pages 586–595, 2018. 5

[36] Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jing-dong Wang, and Qi Tian. Scalable person re-identification:A benchmark. In Proceedings of the IEEE international con-ference on computer vision, pages 1116–1124, 2015. 5, 13

[37] Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Ma-lik, and Alexei A Efros. View synthesis by appearance flow.In European conference on computer vision, pages 286–301.Springer, 2016. 1, 2, 3, 8, 14, 15

[38] Zhen Zhu, Tengteng Huang, Baoguang Shi, Miao Yu, BofeiWang, and Xiang Bai. Progressive pose attention transfer forperson image generation. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, pages2347–2356, 2019. 1, 2, 5, 6, 13

[39] Xiaoou Tang Yiming Liu Ziwei Liu, Raymond Yeh andAseem Agarwala. Video frame synthesis using deep voxel

flow. In Proceedings of International Conference on Com-puter Vision (ICCV), October 2017. 2

A . Warping at the Feature LevelThe networks are easy to be stuck within bad local minima when warping the inputs at the feature level by using flow-

based methods with traditional warping operation. Two main problems are pointed out in our paper. In this section, wefurther explain these reasons.

Figure A.9 shows the warping process of the traditional Bilinear sampling. For each location l = (x, y) in the outputfeatures fout, a sampling position is assigned by the offsets in flow fields wl = (∆x,∆y). Then, the output feature f

(x,y)out is

obtained by sampling the local regions of the input features fin.

fx,yout = (d∆ye −∆y)(d∆xe −∆x)fx′,y′

in + (∆y − b∆yc)(∆x− b∆xc)fx′+1,y′+1

in

+ (∆y − b∆yc)(d∆xe −∆x)fx′,y′+1

in + (d∆ye −∆y)(∆x− b∆xc)fx′+1,y′

in

(A.16)

where d·e and b·c represent round up and round down respectively. Location (x′, y′) = (x + bxc, y + byc). Then, we canobtain the backward gradients. The gradients of flow fields is

∂fx,yout

∂∆x=(d∆ye −∆y)(fx

′+1,y′

in − fx′,y′

in ) + (∆y − b∆yc)(fx′+1,y′+1

in − fx′,y′+1

in ) (A.17)

the ∂fx,yout

∂∆y can be obtained in a similar way. The gradients of the input features can be written as

∂fx,yout

∂fx′,y′

in

= (d∆ye −∆y)(d∆xe −∆x) (A.18)

The other items can be obtained in a similar way.Let us first explain the first problem. Different from image pixels, the image features are changing during the training

process. The flow fields need reasonable input features to obtain correct gradients. As shown in Equation 17, their gradientsare calculated by the difference between adjacent features. If the input features are meaningless, the network cannot obtaincorrect flow fields. Meanwhile, according to Equation 18, the gradients of the input features are calculated by the offsets.They cannot obtain reasonable gradients without correct flow fields. Imagine the worst case that the model only uses thewarped results as output and does not use any skip connections, the warp operation stops the gradient from back-propagating.

Another problem is caused by the limited receptive field of Bilinear sampling. Suppose we have got meaningful inputfeatures fin by pre-training, we still cannot obtain stable gradient propagation. According to Equation 17, the gradient offlow fields is calculated by the difference between adjacent features. However, since the adjacent features are extracted fromadjacent image patches, they often have strong correlations (i.e. fx

′,y′

in ≈ fx′,y′+1

in ). Therefore, the gradients may be small atmost positions. Meanwhile, large motions are hard to be captured.

Figure A.9. Bilinear sampling. For each point of the right feature map (target feature map), the flow field assigns a sampling location inthe left feature map (source feature map). Then, the sampled feature is calculated through the nearest neighbors by Bilinear interpolation.

B . Analysis of the Regularization loss

In our paper, we mentioned that the deformations of image neighborhoods are highly correlated and proposed a regular-ization loss to extract this relationship. In this section, we further discuss this regularization loss. Our loss is based on anassumption: although the deformations of the whole images are complex, the deformations of local regions such as arms,clothes are always simple. These deformations can be modeled using affine transformation. Based on this assumption, ourregularization loss is proposed to punish local regions where the transformation is not an affine transformation. Figure B.10gives an example. Our regularization loss first extracts local 2D coordinate matrix N (ct, l) and N (cs, l). Then we estimatethe affine transformation parameters Al. Finally, the error is calculated as the regularization loss. Our loss assigns largeerrors to local regions that do not conform to the affine transformation assumption thereby forcing the network to change thesampling regions.

We provide an ablation study to show the effect of the regularization loss. A model is trained without using the regular-ization term. The results are shown in Figure B.11. It can be seen that by using our regularization loss the flow fields aremore smooth. Unnecessary jumps are avoided. We use the obtained flow fields to warp the source images directly to showthe sampling correctness. It can be seen that without using the regularization loss incorrect sampling regions will be assigneddue to the sharp jumps of the flow fields. Relatively good results can be obtained by using the affine assumption prior.

(0,0) (1,0) (2,0)

(0,1) (1,1) (2,1)

(1,2)(0,2) (2,2)

𝒩(𝑐$, 𝑙)

(0.9,2.3) (1.6,2.4) (2.6,2.0)

(2.3,1.3)(1.4,1.5)(0.1,2.6)

𝒩(𝑐(, 𝑙)

(0.3,0.7) (1.2,0.5) (2.2,0.3)

(0,0) (1,0) (2,0)

(0,1) (1,1) (2,1)

(1,2)(0,2) (2,2)

𝒩(𝑐$, 𝑙)

(0.9,2.3) (1.6,2.4) (2.6,2.0)

(2.3,1.3)(1.4,1.5)(0.4,1.6)

𝒩(𝑐(, 𝑙)

(0.3,0.7) (1.2,0.5) (2.2,0.3)

Non-affineTransformation

AffineTransformation

Figure B.10. Our regularization loss assigns large errors to local regions where the transformation is not an affine transformation therebyforcing the network to change the sampling locations.

Flow Fieldswith RegularizationSource Image

Target Image

Warped Resultswith Regularization

Flow Fieldswo Regularization

Warped Resultswo RegularizationTarget Image Flow Fields

with RegularizationSource Image Warped Resultswith Regularization

Flow Fieldswo Regularization

Warped Resultswo RegularizationTarget Image

Figure B.11. The flow fields of models trained with and without the regularization loss.

C . Additional Results

C .1. Additional Comparisons with Existing Works

We provide additional comparison results in this section. The qualitative results is shown in Figure C.12.

Source Image

Target Pose


Source Image

Target Pose

Target Image

Def-GAN VU-Net Pose-Attn OursIntr-Flow

Figure C.12. The qualitative comparisons with several state-of-the-art models including Def-GAN [24], VU-Net [2], Pose-Attn[38], andIntr-Flow [14] over dataset DeepFashion [17] and Market-1501 [36].

C .2. Additional Results of View Synthesis

We provide additional results of the view synthesis task in Figure C.13 and Figure C.14.

Figure C.13. Qualitative results of the view synthesis task. For each group, we show the results of Appearance Flow [37], the results of ourmodel, and ground-truth images, respectively. The top left image is the input source image. The other images are the generated results andground-truth images. Click on the rightmost image to start the animation in a browser.

https://user-images.githubusercontent.com/30292465/75650787-f8a81600-5c91-11ea-998e-94685f956302.gif

C .3. Additional Results of Image Animation

We provide additional results of the image animation task in Figure C.15.

Figure C.14. Qualitative results of the view synthesis task. For each group, we show the results of Appearance Flow [37], the results of ourmodel, and ground-truth images, respectively. The top left image is the input source image. The other images are the generated results andground-truth images. Click on the rightmost image to start the animation in a browser.

https://user-images.githubusercontent.com/30292465/75650814-09588c00-5c92-11ea-8c08-52662312c81d.gif

Figure C.15. Qualitative results of the image animation task. For each row, the leftmost image is the source image. The others are generatedimages. Click on the rightmost image to start the animation in a browser.

D . Implementation DetailsBasically, the auto-encoder structure is employed to design our network. We use the residual blocks as shown in Fig-

ure D.16 to build our model. Each convolutional layer is followed by instance normalization [28]. We use Leaky-ReLUas the activation function in our model. Spectral normalization [19] is employed in the discriminator to solve the notoriousproblem of instability training of generative adversarial networks. The architecture of our model is shown in Figure D.17. Wenote that since the images of the Market-1501 dataset are low-resolution images (128× 64), we only use one local attentionblock at the feature maps with resolution as 32× 16. We design the kernel prediction net M in the local attention block as afully connected network. The extracted local patchNn(fs, l+wl) andNn(ft, l) are concatenated as the input. The output ofthe network is kl. Since it needs to predict attention kernels kl for all location l in the feature maps, we use a convolutionallayer to implement this network, which can take advantage of the parallel computing power of GPUs.

We train our model in stages. The Flow Field Estimator is first trained to generate flow fields. Then we train the wholemodel in an end-to-end manner. We adopt the ADAM optimizer. The learning rate of the generator is set to 10−4. Thediscriminator is trained with a learning rate of one-tenth of that of the generator. The batch size is set to 8 for all experiments.The loss weights are set to λc = 5, λr = 0.0025, λ`1 = 5, λa = 2, λp = 0.5, and λs = 500.

https://user-images.githubusercontent.com/30292465/75650849-1d03f280-5c92-11ea-9f3d-ae4c85524787.gif

Conv 4x4, stride=2

Conv 3x3, stride=1

(a) ConvDown (b) ResBlock

Conv 3x3, stride=1

Conv 3x3, stride=1

Sum

TransConv 3x3, stride=2

Sum

(c) ResBlockUp

Conv 3x3, stride=1

TransConv 3x3, stride=2

Figure D.16. The components used in our networks.

Local Attention

(a) Target Image Generator

ResblockUp, 128

ConvDown, 64

ConvDown, 128

ConvDown, 256

ResBlock, 256

ResBlockUp, 256

ResBlock, 128

ResBlock, 64

ResBlockUp, 64

Conv 3x3, 3

Local Attention

Output

ConvDown, 64

ConvDown, 128

ConvDown, 256

Source Image

(b) Flow Field Estimator

Source_Feature_0

Target Structure

ConvDown, 64

ConvDown, 128

ConvDown, 256

ResBlockUp, 256

ResBlockUp, 128

ResBlockUp,64

ConvDown, 256

Sum

Source Image + Source Structure+ Target Structure

ConvDown, 32

Sum

Conv3x3, 2 Conv3x3, 1

Flow_Fields_0 mask_0

Flow_Fields_1Source_Feature_1

Conv3x3, 2 Conv3x3, 1

mask_1

Figure D.17. The network architecture of our network.

deep image spatial transformation for person image …deep image spatial transformation for person...

Documents