fine-grained multi-human parsing · 2020-01-03 · with parsing, ﬁne-grained semantic information...

International Journal of Computer Visionhttps://doi.org/10.1007/s11263-019-01181-5

Fine-Grained Multi-human Parsing

Jian Zhao1,2 · Jianshu Li1 · Hengzhu Liu2 · Shuicheng Yan1,3 · Jiashi Feng1

Received: 27 July 2018 / Accepted: 26 April 2019© Springer Science+Business Media, LLC, part of Springer Nature 2020

AbstractDespite the noticeable progress in perceptual tasks like detection, instance segmentation and human parsing, computers stillperform unsatisfactorily on visually understanding humans in crowded scenes, such as group behavior analysis, person re-identification, e-commerce, media editing, video surveillance, autonomous driving and virtual reality, etc. To perform well,models need to comprehensively perceive the semantic information and the differences between instances in a multi-humanimage, which is recently defined as the multi-human parsing task. In this paper, we first present a new large-scale database“Multi-human Parsing (MHP v2.0)” for algorithm development and evaluation to advance the research on understandinghumans in crowded scenes. MHP v2.0 contains 25,403 elaborately annotated images with 58 fine-grained semantic categorylabels and 16 dense pose key point labels, involving 2–26 persons per image captured in real-world scenes from variousviewpoints, poses, occlusion, interactions and background. We further propose a novel deep Nested Adversarial Network(NAN) model for multi-human parsing. NAN consists of three Generative Adversarial Network-like sub-nets, respectivelyperforming semantic saliency prediction, instance-agnostic parsing and instance-aware clustering. These sub-nets form anested structure and are carefully designed to learn jointly in an end-to-end way. NAN consistently outperforms existingstate-of-the-art solutions on our MHP and several other datasets, includingMHP v1.0, PASCAL-Person-Part and Buffy. NANserves as a strong baseline to shed light on generic instance-level semantic part prediction and drive the future research onmulti-human parsing. With the above innovations and contributions, we have organized the CVPR 2018 Workshop on VisualUnderstanding of Humans in Crowd Scene (VUHCS 2018) and the Fine-Grained Multi-human Parsing and Pose EstimationChallenge. These contributions together significantly benefit the community. Code and pre-trained models are available athttps://github.com/ZhaoJ9014/Multi-Human-Parsing_MHP.

Keywords Multi-human parsing · Benchmark dataset · Nested adversarial learning · Generative Adversarial Networks

Communicated by Li Liu, Matti Pietikäinen, Jie Qin, Jie Chen, WanliOuyang, Luc Van Gool.

B Jian [email protected]://zhaoj9014.github.io/

Jianshu [email protected]

Hengzhu [email protected]

Shuicheng [email protected]

Jiashi [email protected]

1 National University of Singapore, Singapore, Singapore

2 National University of Defense Technology, Changsha, China

3 Qihoo 360 AI Institute, Beijing, China

1 Introduction

One of the primary goals of intelligent human-computerinteraction is understanding the humans in visual scenes.It involves several perceptual tasks including detection,i.e. localizing different persons at a coarse, bounding boxlevel (Fig. 1a), instance segmentation, i.e. labeling eachpixel of each person uniquely (Fig. 1b), and human pars-ing, i.e. decomposing persons into their semantic categories(Fig. 1c). Recently, deep learning based methods haveachieved remarkable success in these perceptual tasks thanksto the availability of plentiful annotated images for trainingand evaluation purposes (Dollar et al. 2012; Everinghamet al.2015; Lin et al. 2014; Gong et al. 2017).

Though exciting, current progress is still far from the ulti-mate goal of visually understandinghumans.AsFig. 1 shows,previous efforts on understanding humans in visual scenes

123

http://crossmark.crossref.org/dialog/?doi=10.1007/s11263-019-01181-5&domain=pdf

http://orcid.org/0000-0002-3508-756X

https://github.com/ZhaoJ9014/Multi-Human-Parsing_MHP

International Journal of Computer Vision

(a) (b)

(d)(c)

Fig. 1 Illustration of motivation. Existing efforts on human-centricanalysis have been devoted to a detection (localizing different personsat a coarse, bounding box level), b instance segmentation (labelingeach pixel of each person uniquely) or c human parsing (decomposingpersons into their semantic categories). We focus on d multi-humanparsing (parsing body parts and fashion items at the instance level),which aligns better with many real-world applications. We introducea new large-scale, richly-annotated Multi-human Parsing (MHP v2.0)dataset consisting of images with various viewpoints, poses, occlusion,human interactions and background. We further propose a novel deepNested Adversarial Network (NAN) model for solving the challengingmulti-human parsing problem effectively and efficiently. Best viewedin color (Color figure online)

either only consider coarse information (Dollar et al. 2012;Everingham et al. 2015) or are agnostic to different instances(Lin et al. 2014; Gong et al. 2017). In the real-world scenar-ios, it is more likely that there simultaneously exist multiplepersons, with various human interactions, poses and occlu-sion. Thus, it is more practically demanded to parse humanbody parts and fashion items at the instance level, which isrecently defined as the multi-human parsing task (Li et al.2017c) (Figs. 1d, 2). The differences between perceptualtasks are visually compared in Fig. 3. Multi-human parsingenables more detailed understanding of humans in crowdedscenes and aligns better with many real-world applications,such as group behavior analysis (Gan et al. 2016), person re-identification (Zhao et al. 2013), e-commerce (Turban et al.2002), media editing (Xu et al. 2016), video surveillance(Collins et al. 2000), autonomous driving (Cordts et al. 2016)and virtual reality (Lin et al. 2016). However, as explainedabove, the existing benchmark datasets (Dollar et al. 2012;Everingham et al. 2015; Lin et al. 2014; Gong et al. 2017)are not suitable for such a new task. Even though Li et al.(2017c) proposed a preliminary Multi-human Parsing (MHPv1.0) dataset, it only contains 4980 images annotated with 18semantic labels. In this work, we propose a new large-scale

benchmark “Multi-human Parsing (MHP v2.0)”, aiming topush the frontiers of multi-human parsing research towardsholistically understanding humans in crowded scenes. Thedata inMHPv2.0 coverwide variability and complexityw.r.t.viewpoints, poses, occlusion, human interactions and back-ground. It in total includes 25,403 human images with 58semantic category pixel-wise annotations and 16 dense posekey point labels.

We further propose a novel deep Nested Adversarial Net-work (NAN) model for solving the challenging multi-humanparsing problem. Unlike most existing methods (Li et al.2017a, c; Jiang and Grauman 2016) which rely on separatestages of instance localization, human parsing and resultrefinement, the proposed NAN parses semantic categoriesand differentiates different person instances simultaneouslyin an effective and time-efficient manner. NAN consistsof three Generative Adversarial Network (GAN)-like sub-nets, respectively performing semantic saliency prediction,instance-agnostic parsing and instance-aware clustering.Each sub-task is simpler than the original multi-human pars-ing task, and is more easily addressed by the correspondingsub-net. Unlike many multi-task learning applications, inour method the sub-nets depend on each other, forming acausal nest by dynamically boosting each other through anadversarial strategy (see Figs. 10, 11), which is hence calleda “nested adversarial learning” structure. Such a structureenables effortless gradient BackPropagation (BP) in NANsuch that it can be trained in a holistic, end-to-endway, whichis favorable to both accuracy and speed. We conduct quali-tative and quantitative experiments on the MHP v2.0 datasetproposed in this work, as well as the MHP v1.0 (Li et al.2017c), PASCAL-Person-Part (Chen et al. 2014) and Buffy(Vineet et al. 2011) benchmark datasets. The results demon-strate the superiority of NAN on multi-human parsing overthe state-of-the-arts.

A preliminary version of this work was published in ACMMultimedia (ACM MM) 2018 and awarded as the Best Stu-dent Paper (Zhao et al. 2018).We extend it in numerousways:(1)We augment the content to cover sufficient details for bet-ter self-containing; we also improve and polish the writing ofthese sections for clearer and more comprehensive presen-tation. (2) We further tune our whole pipeline and improveits performance and generalizability for multi-human pars-ing, instance-agnostic parsing and instance-aware clustering.(3) We add abundant details on annotation tool, evaluationmetrics, parameter setting, network architecture and trainingprocedure. Thedataset, annotation tools, and source codes forNAN and evaluation metrics have been publicly available.1

(4)Weadd experiments to reveal the underlying challenges ofthe MHP v2.0 dataset and how NAN works, including morequalitative and quantitative analyzes on MHP v2.0, MHP

1 https://github.com/ZhaoJ9014/Multi-Human-Parsing_MHP.

123

https://github.com/ZhaoJ9014/Multi-Human-Parsing_MHP


Fig. 2 Example of multi-human parsing. Given an image containing several persons, the multi-human parsing technique separates different personsand identify each semantic category clearly. Best viewed in color (Color figure online)

Fig. 3 Illustration of comparisons between perceptual tasks. Persondetection only locates bounding boxes of the persons; instance segmen-tation refines the bounding boxes to foreground masks, but it can onlyprovide the silhouette information without human part details; humanparsing is agnostic of human instances and disadvantaged for human-

centric analysis.Multi-human parsing provides parsing and localizationat the same time. With parsing, fine-grained semantic information forevery pixel can be achieved. With localization, the body parts and fash-ion items from different persons can be well separated. Best viewed incolor (Color figure online)

Table 1 Statistics for publicly available human parsing datasets

Datasets Instance aware? # Total # Training # Validation # Testing # Category

Buffy (Vineet et al. 2011) ✓ 748 452 – 296 13

Fashionista (Yamaguchi et al. 2012) ✗ 685 456 – 229 56

PASCAL-Person-Part (Chen et al. 2014) ✗ 3533 1716 – 1817 7

ATR (Liang et al. 2015b) ✗ 17,700 16,000 700 1000 18

LIP (Gong et al. 2017) ✗ 50,462 30,462 10,000 10,000 20

MHP v1.0 (Li et al. 2017c) ✓ 4980 3000 1000 980 19

MHP v2.0 ✓ 25,403 15,403 5000 5000 59

v1.0, PASCAL-Person-Part and Buffy datasets, failure casestudy, etc.

Our contributions are summarized as follows.

– We propose a new large-scale benchmark and evaluationserver to advance understanding of humans in crowdedscenes, which contains 25,403 images annotated pixel-wisely with 58 semantic category pixel-wise annotationsand 16 dense pose key point labels.

123


– We propose a novel deep Nested Adversarial Network(NAN) model for multi-human parsing, which serves asa strong baseline to inspire more future research effortson this task.

– Comprehensive evaluations on theMHPv2.0 dataset pro-posed in this work, as well as the MHP v1.0, PASCAL-Person-Part and Buffy benchmark datasets verify thesuperiority ofNANon understanding humans in crowdedscenes over the state-of-the-arts.

With the above innovations and contributions, we haveorganized the IEEE Conference on Computer Vision andPattern Recognition (CVPR) 2018 Workshop on VisualUnderstanding of Humans in Crowd Scene (VUHCS 2018)2

and the Fine-Grained Multi-human Parsing Challenge.3

These contributions together significantly benefit the com-munity.

2 RelatedWork

Human Parsing Datasets The statistics of popular pub-licly available datasets for human parsing are summarized inTable 1. The Buffy dataset (Vineet et al. 2011) was releasedin 2011 for human parsing and instance segmentation. Itcontains only 748 images annotated with 13 semantic cat-egories. The Fashionista dataset (Yamaguchi et al. 2012)was released in 2012 for human parsing, containing limitedimages annotated with 56 fashion categories. The PASCAL-Person-Part dataset (Chen et al. 2014) was initially annotatedby Chen et al. (2014) from the PASCAL-VOC-2010 dataset(Everingham et al. 2010). Chen et al. (2016) extended itfor human parsing with 7 coarse body part labels. TheATR dataset (Liang et al. 2015b) was released in 2015 forhuman parsing with a large number of images annotatedwith 18 semantic categories. The LIP (Gong et al. 2017)dataset further extended ATR (Liang et al. 2015b) by crop-ping person instances from Microsoft COCO (Lin et al.2014). It is a large-scale human parsing dataset with denselypixel-wise annotations of 20 semantic categories. But it hastwo limitations. (1) Despite the large data size, it containslimited semantic category annotations, which restricts thefine-grained understanding of humans in visual scenes. (2)In LIP, only a small proportion of images contain multiplepersons with interactions. Such an instance-agnostic settingseverely deviates from reality. Even in the MHP v1.0 datasetproposed by Li et al. (2017c) for multi-human parsing, only4980 images are included and annotated with 18 semanticlabels. Comparatively, ourMHPv2.0 dataset contains 25,403

2 https://vuhcs.github.io/#portfolio.3 http://lv-mhp.github.io/.

elaborately annotated images with 58 fine-grained seman-tic part labels and 16 dense pose key point labels. It is thelargest andmost comprehensivemulti-human parsing datasetto date, to our best knowledge. Visual comparisons betweenLIP, MHP v1.0 and our MHP v2.0 are provided in Fig. 4.

Human Parsing Approaches Recently, many researchefforts have been devoted to human parsing (Liang et al.2015a; Gong et al. 2017; Liu et al. 2017; Zhao et al. 2017;He et al. 2017; De Brabandere et al. 2017; Li et al. 2017a, c;Jiang and Grauman 2016) due to its wide range of potentialapplications. For example, Liang et al. (2015a) proposed aproposal-free network for instance segmentation by directlypredicting the instance numbers of different categories andthe pixel-level information. Gong et al. (2017) proposed aself-supervised structure-sensitive learning approach, whichimposes human pose structures to parsing results withoutresorting to extra supervision. Liu et al. (2017) proposed asingle frame video parsing method which integrates frameparsing, optical flow estimation and temporal fusion intoa unified network. Zhao et al. (2017) proposed a self-supervised neural aggregation network, which learns toaggregate the multi-scale features and incorporates a self-supervised joint loss to ensure the consistency betweenparsing andpose.He et al. (2017) proposed theMaskR-CNN,which is extended from Faster R-CNN (Ren et al. 2015) byadding a branch for predicting an object mask in parallel withthe existing branch for bounding box recognition.DeBraban-dere et al. (2017) proposed to tackle instance segmentationwith a discriminative loss function, operating at the pixellevel, which encourages a convolutional network to producea representation of the image that can be easily clustered intoinstances with a simple post-processing step. However, thesemethods either only consider coarse semantic information orare agnostic to different instances. To enable more detailedhuman-centric analysis, Li et al. (2017c) initially proposedthe multi-human parsing task, which aligns better with therealistic scenarios. They also proposed a novel MH-Parsermodel as a reference method which generates parsing mapsand instance masks simultaneously in a bottom-up fashion.Jiang and Grauman (2016) proposed a new approach to seg-ment human instances and label their body parts using regionassembly. Li et al. (2017a) proposed a framework with ahuman detector and a category-level segmentation moduleto segment the parts of objects at the instance level. Thesemethods involve multiple separate stages for instance local-ization, human parsing and result refinement. In comparison,the proposed NAN produces accurate multi-human parsingresults through a single forward-pass in a time-efficient man-ner without tedious pre- or post-processing.

123

https://vuhcs.github.io/#portfolio

http://lv-mhp.github.io/


(a)

(b)

(c)

Fig. 4 Annotation examples in our “Multi-human Parsing (MHPv2.0)”dataset and existing datasets. a Examples in LIP. LIP is restricted to aninstance-agnostic setting and has limited semantic category annotations.b Examples in MHP v1.0. MHP v1.0 has lower scalability, variability

and complexity, and only contains coarse labels. c Examples in ourMHP v2.0. MHP v2.0 contains fine-grained semantic category labelswith various viewpoints, poses, occlusion, interactions and background,matching better with reality. Best viewed in color (Color figure online)

3 Multi-human Parsing Benchmark

In this section, we introduce the “Multi-human Parsing”(MHP v2.0), a new large-scale dataset focusing on seman-tic understanding of humans in crowded scenes with severalappealing properties. (1) It contains 25,403 elaborately anno-tated images with 58 fine-grained labels on body parts,fashion items and one background label as well as 16 densepose key point labels, which is larger and more comprehen-sive than previous similar attempts (Vineet et al. 2011; Liet al. 2017c. (2) The images within MHP v2.0 are collectedfrom real-world scenarios, involving humans with variousviewpoints, poses, occlusion, interactions and resolution. (3)The background of images inMHPv2.0 ismore complex anddiverse than previous datasets. Some examples are showedin Figs. 4, 6, 7 and 8. The MHP v2.0 dataset is expectedto provide a new benchmark suitable for both multi-humanparsing and pose estimation together with a standard eval-uation server where the test set will be kept secret to avoidoverfitting.

3.1 Image Collection and Annotation

Wemanually specify some underlying relationships (e.g.,family, couple, team, etc.) and possible scenes (e.g., sports,conferences, banquets, etc.) to ensure the diversity ofreturned results. Based on any one of these specifications,

Fig. 5 Annotation tool for multi-human parsing. Best viewed in color(Color figure online)

corresponding multi-human images are located by per-forming Internet searches over Creative Commons licensedimagery. For each identified image, the contained humannumber and the corresponding URL are stored in a spread-sheet. Automated scrapping software is used to download themulti-human imagery and stores all relevant information ina relational database. Moreover, a pool of images contain-ing clearly visible persons with interactions and rich fashion

123


Fig. 6 Illustration on semantic category definition and annotation examples of MHP v2.0. Best viewed in color (Color figure online)

Fig. 7 Annotation examples for multi-human pose estimation in MHP v2.0. Best viewed in color (Color figure online)

items is also constructed from the existing human-centricdatasets (Zhang et al. 2015, 2018; Chu et al. 2015; Sapp andTaskar 2013; Klare et al. 2015)4 to augment and complementInternet scraping results.

After curating the imagery, manual annotation is con-ducted by professional data annotators, which includes twodistinct tasks. The first task is manually counting the numberof foreground persons and duplicating each image to sev-

4 PASCAL-VOC-2012 (Everingham et al. 2015) andMicrosoft COCO(Lin et al. 2014) are not included due to limited percent of crowd-sceneimages with fine details of persons.

eral copies according to the count number. Each duplicatedimage is marked with the image ID, the contained personnumber and a self-index. The second is assigning the fine-grained pixel-wise label to each semantic category for eachperson instance.We implement an annotation tool and gener-atemulti-scale superpixels of images based onArbelaez et al.(2011) to speed up the annotation. See Fig. 5 for an example.Eachmulti-human image contains at least two instances. Theannotation for each instance is done in a left-to-right order,corresponding to the duplicated image with the self-indexfrom beginning to end. For each instance, 58 semantic cate-gories are defined and annotated, including cap/hat, helmet,

123


Fig. 8 Training, validation and testing examples in MHP v2.0. Best viewed in color (Color figure online)

(a) (b) (c)

Fig. 9 Dataset statistics. Best viewed in color (Color figure online)

face, hair, left-arm, right-arm, left-hand, right-hand, pro-tector, bikini/bra, jacket/windbreaker/hoodie, t-shirt, polo-shirt, sweater, singlet, torso-skin, pants, shorts/swim-shorts,skirt, stockings, socks, left-boot, right-boot, left-shoe, right-shoe, left-highheel, right-highheel, left-sandal, right-sandal,left-leg, right-leg, left-foot, right-foot, coat, dress, robe,jumpsuits,other-full-body-clothes,headwear,backpack,ball,bats, belt, bottle, carrybag, cases, sunglasses, eyewear,gloves, scarf, umbrella, wallet/purse, watch, wristband,tie, other-accessaries, other-upper-body-clothes and other-lower-body-clothes, as illustrated in Fig. 6. Each instancehas a complete set of annotations whenever the correspond-ing category appears in the current image. When annotatingone instance, others are regarded as background, as shownin Fig 5. Thus, the resulting annotation set for each imageconsists of N instance-level parsing masks, where N isthe number of persons in the image. Moreover, 2D humanposes with 16 dense key points (right-shoulder, right-elbow,right-wrist, left-shoulder, left-elbow, left-wrist, right-hip,right-knee, right-ankle, left-hip, left-knee, left-ankle, head,neck, spine and pelvis). Each key point has a flag indicating

whether it is visible-0/occluded-1/out-of-image-2. Head andinstance bounding boxes are also provided to facilitate multi-human pose estimation research. Some annotation examplesregarding pose are visualized in Fig. 7.

After annotation, manual inspection is performed on allimages and corresponding annotations to verify the cor-rectness. In cases where annotations are erroneous, theinformation ismanually rectifiedby5well informed analysts.The whole work took around three months to accomplish by25 professional data annotators.

3.2 Dataset Splits and Statistics

In total, there are 25,403 images in the MHP v2.0 dataset.Each image contains 2–26 person instances, with 3 on aver-age. The resolution of the images ranges from 85 × 100 to4511×6919, with 644×718 on average. We spit the imagesinto training, validation and testing sets. Following randomselection, we arrive at a unique split consisting of 15,403training and 5000 validation images with publicly availableannotations, as well as 5000 testing images with annotations

123


withheld for benchmarking purpose. Some training, valida-tion and testing examples in MHP v2.0 are shown in Fig. 8.

The statistics w.r.t. data distribution on 59 semantic cat-egories, the average semantic category number per imageand the average instance number per image in the MHP v2.0dataset are given in Fig. 9a–c, respectively. In general, face,arms and legs are the most remarkable parts of a humanbody. However, understanding humans in crowded scenesneeds to analyze fine-grained details of each person of inter-est, including different body parts, clothes and accessories.We therefore define 11 body parts, and 47 clothes and acces-sories. Among these 11 body parts, we divide arms, hands,legs and feet into left and right side for more precise analy-sis, which also increases the difficulty of the task. We definehair, face and torso-skin as the remaining three body parts,which canbeused as auxiliary guidance formore comprehen-sive instance-level analysis. As for clothing categories, wehave common clothes like coat, jacket/windbreaker/hoodie,sweater, singlet, pants, shorts/swim-shorts and shoes, con-fusing categories such as t-shirt versus polo-shirt, stockingsversus socks, skirt versus dress and robe, and boots ver-sus sandals and highheels, and infrequent categories suchas cap/hat, helmet, protector, bikini/bra, jumpsuits, glovesand scarf. Furthermore, accessories like sunglasses, belt, tie,watch and bags are also taken into account, which are com-mon but hard to predict, especially for the small-scale ones.

To summarize, the pre-defined semantic categories andposes of MHP v2.0 include most body parts and joint keypoints, clothes and accessories of different styles for men,women and children in all seasons. The images in MHP v2.0contain diverse instance numbers, viewpoints, poses, occlu-sion, interactions and background complexities. It alignsbetter with real-world scenarios and serves as a more realis-tic benchmark for human-centric analysis, whichwould pushthe research frontiers of fine-grained crowded scene under-standing regarding humans.

4 Deep Nested Adversarial Network

As shown in Figs. 10 and 11, the proposed deep NestedAdversarial Network (NAN) model consists of three GAN-like sub-nets that jointly perform semantic saliency predic-tion, instance-agnostic parsing and instance-aware clusteringend-to-end. NAN produces accurate multi-human parsingresults through a single forward-pass in a time-efficient man-ner without tedious pre- or post-processing. We now presenteach component in details.

4.1 Semantic Saliency Prediction

Large modality and interaction variations are the main chal-lenge to multi-human parsing and also the key obstacle to

Fig. 10 Flowchart of Deep Nested Adversarial Networks (NAN). NANfirst separates foreground persons from cluttered background, thenparses the semantic categories. Finally, it clusters the semantic cate-gories into different instances to obtain the final multi-human parsingoutput. The whole pipeline is end-to-end. Best viewed in color (Colorfigure online)

learning a well-performing human-centric analysis model.To address this problem, we propose to decompose the origi-nal task into three granularities and adaptively impose a prioron the specific process, each with the aid of a GAN-basedsub-net. This reduces the training complexity and leads tobetter empirical performance with limited data, as detailedlydiscussed in Sect. 4.4.

The first sub-net estimates semantic saliency maps tolocate the most noticeable and eye-attracting human regionsin images, which serves as a basic prior to facilitate fur-ther processing on humans, as illustrated in Fig. 10 topand Fig. 11 left. We formulate semantic saliency predic-tion as a binary pixel-wise labeling problem to segmentout foreground versus background. Inspired by the recentsuccess of Fully Convolutional Networks (FCNs) (Longet al. 2015) based methods on image-to-image applications(Li et al. 2017b; He et al. 2017), we leverage an FCNbackbone (FCN-8s) as the generator G1θ1 : R

H×W×C �→R

H×W×C ′of NAN for semantic saliency prediction, where

θ1 denotes the network parameters, and H , W , C and C ′denote the imageheight,width, channel number and semanticcategory (i.e., foreground plus background) number, respec-tively.

Formally, let the input RGB image be denoted by x andthe semantic saliency map be denoted by x ′. Then

x ′ := G1θ1(x). (1)

The key requirements forG1 are that the semantic saliencymap x ′ should present indistinguishable properties comparedwith a real one (i.e., ground truth) in appearance whilepreserving the intrinsic contextually remarkable informa-tion.

123


Fig. 11 Deep Nested Adversarial Networks (NAN) for multi-humanparsing. NAN consists of three GAN-like sub-nets, respectively per-forming semantic saliency prediction, instance-agnostic parsing andinstance-aware clustering. Each sub-task is simpler than the origi-nal multi-human parsing task, and is more easily addressed by thecorresponding sub-net. The sub-nets depend on each other, forming

a causal nest by dynamically boosting each other via an adversarialstrategy. Such a structure enables effortless gradient BackPropagation(BP) of NAN such that it can be trained in a holistic, end-to-end way.NAN produces accurate multi-human parsing results through a singleforward-pass in a time-efficient manner without tedious pre- or post-processing. Best viewed in color (Color figure online)

To this end, we propose to learn θ1 by minimizing a com-bination of two losses:

LG1θ1= −λ1Ladv1 + λ2Lss, (2)

where Ladv1 is the adversarial loss for refining realism andalleviating artifacts, Lss is the semantic saliency loss forpixel-wise image labeling, and λ denotes weighting param-eters among different losses.

Lss is a pixel-wise cross-entropy loss calculated based onthe binary pixel-wise annotations to learn θ1:

Lss = Lss(X′(θ1)|X), (3)

where X and X ′ are set representations of x and x ′, respec-tively, which applies equally to similar terms.

Ladv1 is proposed to narrow the gap between the distribu-tions of generated and real results. To facilitate this process,we leverage a Convolutional Neural Network (CNN) back-bone as the discriminator D1φ1 : RH×W×C ′ �→ R

1, whichis very simple to avoid typical GAN tricks. We alternativelyoptimize G1θ1 and D1φ1 to learn θ1 and φ1:

Ladv1 = arg minθ1

maxφ1

Ladv1(K (φ1)|X ′(θ1), X ′GT), (4)

where K denotes the binary real versus fake indicator.

4.2 Instance-Agnostic Parsing

The second sub-net concatenates the information from theoriginal RGB image with semantic saliency prior as inputand estimates a fine-grained instance-agnostic parsing map,which further serves as stronger semantic guidance from the

global perspective to facilitate instance-aware clustering, asillustrated in Fig. 10middle andFig. 11middle.We formulateinstance-agnostic parsing as a multi-class dense classifica-tion problem to mask semantically consistent regions ofbody parts and fashion items. Inspired by the leading per-formance of the skip-net on recognition tasks (Wu et al.2016; He et al. 2016), we modify a skip-net (WS-ResNetWu et al. 2016) into an FCN-based architecture as the gen-erator G2θ2 : R

H×W×(C+C ′) �→ RH/8×W/8×C ′′

of NANto learn a highly non-linear transformation for instance-agnostic parsing, where θ2 denotes the network parametersfor the generator andC ′′ denotes the semantic category num-ber. The prediction is downsampled by 8 for accuracy versusspeed trade-off. Contextual information from global andlocal regions compensates each other and naturally benefitshuman parsing. The hierarchical features within a skip-netare multi-scale in nature due to the increasing receptive fieldsizes, which are combined together via skip connections.Such a combined representation comprehensively maintainsthe contextual information, which is crucial for generatingsmooth and accurate parsing results.

Formally, let the instance-agnostic parsingmapbedenotedby x ′′. Then

x ′′ := G2θ2(x ∪ x ′). (5)

Similar to the first sub-net, we propose to learn θ2 byminimizing:

LG2θ2= −λ3Ladv2 + λ4Lgp, (6)

where Lgp is the global parsing loss for semantic partlabelling.

123


Lgp is a standard pixel-wise cross-entropy loss calculatedbased on the multi-class pixel-wise annotations to learn θ2.θ1 is also slightly fine-tuned due to the hinged gradient back-propagation route within the nested structure:

Lgp = Lgp(X′′(θ2, θ1)|X ∪ X ′(θ1)). (7)

Ladv2 is proposed to ensure the correctness and realismof the current phase and also the previous one for informa-tion flow consistency. To facilitate this process, we leverage asame CNN backbone with D1φ1 as the discriminator D2φ2 :R

H×W×(C ′+C ′′) �→ R1, which however are learned sepa-

rately. We alternatively optimize G2θ2 and D2φ2 to learn θ2,φ2 and slightly5 fine-tune θ1:

Ladv2 = arg minθ1,θ2

maxφ2

Ladv2(K (φ2)|X ′(θ1)

∪ X ′′(θ2, θ1), X ′GT ∪ X ′′

GT).(8)

4.3 Instance-Aware Clustering

The third sub-net concatenates the information from theoriginal RGB image with semantic saliency and instance-agnostic parsing priors as input and estimates an instance-aware clustering map by associating each semantic parsingmask to one of the person instances in the scene, as illus-trated in Fig. 10 bottom and Fig. 11 right. Inspired by theobservation that a human glances at an image and instantlyknows how many and where the objects are in the image,we formulate instance-aware clustering by parallelly infer-ring the instance number and pixel-wise instance location,and do not perform time-consuming region proposal gen-eration. We modify a same backbone architecture G2θ2 toincorporate two sibling branches as the generator G3θ3 :R

H/8×W/8×(C+C ′+C ′′) �→ RH/8×W/8×M ∪ R

1 of NAN forlocation-sensitive learning, where θ3 denotes the networkparameters for the generator and M denotes the pre-definedinstance location coordinate number. As multi-scale featuresintegrating both global and local contextual information arecrucial for increasing location prediction accuracy,we furtheraugment the pixel-wise instance location prediction branchwith a Multi-scale Fusion Unit (MSFU) to fuse shallow-,middle- and deep-level features, while using the featuremapsdownsampled by 8 concatenated with feature maps from thefirst branch for instance number regression.

Formally, let the pixel-wise instance location map bedenoted by x̃ and the instance number be denoted by n. Then

x̃ ∪ n := G3θ3(x ∪ x ′ ∪ x ′′). (9)

5 The trainable parameters of each stage (each sub-net) are mainlylearned through the losses of the corresponding stage. However, theycan still be adjusted to some degree by the losses of their subsequentstages due to the nested structure during gradient BP.

We propose to learn θ3 by minimizing

LG3θ3= −λ5Ladv3 + λ6Lpil + λ7Lin, (10)

where Lpil is the pixel-wise instance location loss for pixel-wise instance location regression and Lin is the instancenumber loss for instance number regression.

Lpil is a standard smooth-�1 loss (Girshick 2015) calcu-lated based on the foreground pixel-wise instance locationannotations to learn θ3. Since a person instance can be iden-tified by its top-left corner (xl , yl) and bottom-right corner(xr , yr ) of the surrounding bounding box, for each pixelbelonging to the person instance, the pixel-wise instancelocation vector is defined as [xl/w, yl/h, xr/w, yr/h],where w and h are width and height of the person instancefor normalization, respectively. Lin is a standard �2 loss cal-culated based on the instance number annotations to learnθ3. θ2 and θ1 are also slightly fine-tuned due to the chainedschema within the nest:

{Lpil = Lpil(X̃(θ3, θ2, θ1)|X ∪ X ′(θ1) ∪ X ′′(θ2, θ1)),Lin = Lin(N (θ3, θ2, θ1)|X ∪ X ′(θ1) ∪ X ′′(θ2, θ1)).

(11)

Given the above information, instance-aware clusteringmaps can be effortlessly generated with little computationaloverhead, which are denoted by X̂ ∈ R

M ′. Similar to Ladv2 ,

Ladv3 is proposed to ensure the correctness and realism of allphases for the information flow consistency. To facilitate thisprocess, we leverage a same CNN backbone with D2φ2 asthe discriminator D3φ3 : RH×W×(C ′+C ′′+M ′) �→ R

1, whichare learned separately. We alternatively optimize G3θ3 andD3φ3 to learn θ3, φ3 and slightly fine-tune θ2 and θ1:

Ladv3 = arg minθ1,θ2,θ3

maxφ3

Ladv3(K (φ3)|X ′(θ1) ∪ X ′′(θ2, θ1)

∪ X̂(θ3, θ2, θ1), X′GT ∪ X ′′

GT ∪ X̂GT).

(12)

4.4 Training and Inference

The goal of NAN is to use sets of real targets to learn threeGAN-like sub-nets that mutually boost and jointly accom-plish multi-human parsing. Each separate loss serves as adeep supervision within the nested structure which benefitsthe network convergence. The overall objective function forNAN is

LNAN= −∑

i∈{1,3,5}j∈{1,2,3}

λiLadv j +λ2Lss+λ4Lgp+λ6Lpil+λ7Lin.

(13)

123


Discussion on Motivation of Using Adversarial LearningSince the key requirements for G1, G2 and G3 are thatthe predictions should present indistinguishable propertiescompared with the real ones (i.e., ground truths) in appear-ance while preserving the intrinsic contextually remarkableinformation, we use adversarial losses to serve as auxiliaryregularization terms for refining realism and alleviating arti-facts. Unlike many multi-task learning applications, in ourmethod the sub-nets depend on each other, forming a causalnest by dynamically boosting each other through an adver-sarial strategy, which is hence called a “nested adversariallearning” structure. The nested adversarial learning strategyensures the correctness and realism of all phases for informa-tion flow consistency. Besides, each adversarial loss withinthe nest structure severs as auxiliary deep supervision toenable effortless gradient BP in NAN such that it can betrained in a holistic, end-to-end way, which is favorable toboth accuracy and speed.

We have conducted thorough component analysis inSect. 5.2.1 to investigate 17 variants in terms of differentarchitectures and loss function combinations of NAN to seetheir respective roles in multi-human parsing, including theverification on the superiority of incorporating adversariallearning to specific process and the nested adversarial learn-ing strategy (see Table 2).

Discussion on Training Complexity Since the large modal-ity and interaction variations are the main challenges tomulti-human parsing, to address this problem, we pro-pose to decompose the multi-human parsing task into threegranularities (i.e., semantic saliency prediction, instance-agnostic parsing, instance-aware clustering) and sequentiallygain stronger prior for multi-human parsing from each spe-cific process. For example, the semantic saliency predictionresults locate the most noticeable and eye-attracting humanregions in images, which serves as a basic prior to facilitatefurther processing on humans; the instance-agnostic resultsfurther serve as stronger semantic guidance from the globalperspective to facilitate accurate instance-aware clustering.

Unlike most existing methods that perform instance local-ization, human parsing and result refinement separately,the proposed NAN model adopts three GAN-like sub-netsto jointly perform semantic saliency prediction, instance-agnostic parsing and instance-aware clustering end-to-end.Each adversarial loss serves as an auxiliary regularizationterm to refine the realism and alleviate artifacts. Unlike manymulti-task learning applications, in our method the sub-netsdepend on each other, forming a causal nest by dynami-cally boosting each other through an adversarial strategy,which is hence called a “nested adversarial learning” struc-ture. Such a structure enables efficient gradient BP in NANsuch that it can be trained in a holistic, end-to-end way,which is favorable to both accuracy and speed. NAN pro-

duces accurate multi-human parsing results through a singleforward-pass in a time-efficient manner without tedious pre-or post-processing. This reduces the training complexity andleads to better empirical performance with limited data, ascan be verified in Sect. 5.2.1 and Table 3 (NAN is more than10 times faster than previous state-of-the-arts).

Discussion on Efficiency As indicated by Eq. (13) and rel-evant discussion, the NAN is clearly end-to-end trainableand can be optimized with the proposed nested adversar-ial learning strategy and BP algorithm. During inference,we simply feed the input image x into NAN and get theinstance-agnostic parsing representation/map x ′′ from G2θ2 ,the pixel-wise instance location representation/map x̃ andinstance number n from G3θ3 . Then we employ spectralclustering (Ng et al. 2002) to obtain the instance-awareclustering representation/map x̂ . As verified in Sect. 5.2.2,NAN can rapidly process one 512 × 512 image in about 1 swith acceptable resource consumption, which is appealingto real applications. This compares much favorably to previ-ous state-of-the-arts, e.g., MH-Parser (Li et al. 2017c) (14.94s/image). MH-Parser relies on separate and complex post-processingwhileNANproducesmulti-human parsing resultsin one shot. Thus, the proposed NAN is highly effective andefficient for understanding humans in crowded scenes.

5 Experiments

We evaluate NAN qualitatively and quantitatively under var-ious settings and granularities for understanding humans incrowded scenes. In particular, we evaluatemulti-human pars-ing performance on the MHP v2.0 dataset proposed in thiswork, aswell as theMHPv1.0 (Li et al. 2017c) andPASCAL-Person-Part (Chen et al. 2014) benchmark datasets. We alsoevaluate instance-agnostic parsing and instance-aware clus-tering results on the Buffy (Vineet et al. 2011) benchmarkdataset, which are byproducts of NAN.

5.1 Experimental Settings

5.1.1 Implementation Details

Throughout the experiments, the sizes of the RGB image x ,the semantic saliency prediction x ′, inputs to the discrim-inator D1φ1 and inputs to the generator G2θ2 are fixed as512 × 512; the sizes of the instance-agnostic parsing pre-diction x ′′, instance-aware clustering prediction x̂ , inputs tothe discriminator D2φ2 , inputs to the generator G3θ3 , inputsto the discriminator D3φ3 and instance location map x̃ arefixed as 64 × 64; the channel number of the pixel-wiseinstance location map is fixed as 4, incorporating two cor-ner points of the associated bounding box; the constraint

123


factors λi , i ∈ {1, 2, 3, 4, 5, 6, 7} are exhaustively searchedas 0.01, 1.00, 0.01, 0.01, 1.00, 10.00 and 1.00, respectively;the generator G1θ1 is initialized with FCN-8s (Long et al.2015) by replacing the last layer with a new convolutionallayer with kernel size 1 × 1 × 2, pretrained on PASCAL-VOC-2011 (Everingham et al. 2011) and fine-tuned on thetarget dataset; the generator G2θ2 is initialized with WS-ResNet (Wu et al. 2016) by eliminating the spatial poolinglayers, increasing the strides of the first convolutional lay-ers up to 2 in Bi , i ∈ {2, 3, 4}, eliminating the top-mostglobal pooling layer and the linear classifier, and adding twonew convolutional layers with kernel sizes 3 × 3 × 512 and1×1×C ′′, pretrained on ImageNet (Russakovsky et al. 2015)and PASCAL-VOC-2012 (Everingham et al. 2015), and fine-tuned on the target dataset; the generator G3θ3 is initializedwith the same backbone architecture and pre-trained weightswith G2θ2 (which are learned separately), by further aug-menting it with two sibling branches for pixel-wise instancelocation map prediction and instance number prediction,where the first branch utilizes an MSFU (three convolutionallayers with kernal sizes 3× 3× i, i ∈ {128, 128, 4} for spe-cific scale adaption) ended with a convolutional layer withkernel size 1× 1× 4 for multi-scale feature aggregation anda final convolutional layer with kernel size 1 × 1 × 4 forlocation regression and the second branch utilizes the fea-ture maps downsampled by 8 concatenated with the featuremaps from the first branch ended with a global pooling layer,a hidden 512-way fully-connected layer and a final 1-wayfully-connected layer for instance number regression; thethree discriminators Diφi , i ∈ {1, 2, 3} (which are learnedseparately) are all initialized with a VGG-16 (Simonyan andZisserman 2014) by adding a new convolutional layer at thevery beginning with kernel size 1× 1× 3 for input adaption,and replacing the last layer with a new 1-way fully-connectedlayer activated by sigmoid, pre-trained on ImageNet (Rus-sakovsky et al. 2015) and finetuned on the target dataset;the newly added layers are randomly initialized by drawingweights from a zero-mean Gaussian distribution with stan-dard deviation 0.01; we employ an off-the-shelf clusteringmethod (Ng et al. 2002) to obtain the instance-aware clus-tering map x̂ ; the dropout ratio is empirically fixed as 0.7;the weight decay and batch size are fixed as 5 × 10−3 and4, respectively; we use an initial learning rate of 1 × 10−6

for pre-trained layers, and 1 × 10−4 for newly added layersin all our experiments; we decrease the learning rate to 1/10of the previous one after 20 epochs and train the network forroughly 60 epochs one after the other; the proposed networkis implemented based on the publicly available TensorFlow(Abadi et al. 2016) platform, which is trained using Adam(β1=0.5) on four NVIDIA GeForce GTX TITAN X GPUswith 12Gmemory; the same training setting is utilized for allour compared network variants; we evaluate the testing timeby averaging the running time for images on the target set

on NVIDIA GeForce GTX TITAN X GPU and Intel Corei7-4930K [email protected]; our NAN can rapidly processone 512 × 512 image in about 1 second, which comparesmuch favorably to other state-of-the-art approaches, as thecurrent state-of-the-art methods (Li et al. 2017a, c; Jiang andGrauman 2016) rely on region proposal preprocessing andcomplex processing steps.

Discussion on Choice on Trade-Off Hyper-parmeter λ

Ladvii={1,2,3} serves as a regularization term that penalizeshigher-order inconsistencies between the prediction andground truth for refining realism and alleviating artifacts. Inparticular, Ladv3 is proposed to ensure the correctness andrealism of all phases for the information flow consistencywhile Ladv2 targeting at serving as a high-order regulariza-tion term to refine realism of the results of both the 1st and2nd phases and Ladv1 only in charge of the 1st phase. In thisprocess, Ladv3 takes more responsibility within the nestedadversarial learning to achieve good multi-human parsingresults. Lss and Lgp serve as deep supervisions within thenested structure benefitting network convergence. Instancenumber regression via Lin is easier to learn than pixel-wiseinstance location regression viaLpil. We choose the trade-offhyper-parameter λ based on the above intuition to balancethe effects of different losses. The choice trade-off hyper-parmeter λ obeys the above intuition and experimentallywork well.

5.1.2 Evaluation Metrics

Following Li et al. (2017c), we use the Average Precisionbased on part (APp) and Percentage of Correctly parsedsemantic Parts (PCP)metrics for multi-human parsing evalu-ation. Different from the Average Precision based on region(APr ) used in instance segmentation (Liang et al. 2015a;Hariharan et al. 2014), APp uses part-level pixel Intersectionover Union (IoU) of different semantic part categories withina person instance to determine if one instance is a true posi-tive. We prefer APp over APr as we focus on human-centricanalysis andwe aim to investigate howwell a person instanceas a whole is parsed. Additionally, we also report the APp

vol ,which is the mean of the APp at the IoU threshold rangingfrom 0.1 to 0.9, in increments of 0.1. As APp averages theIoU of all semantic part categories, it fails to reflect howmany semantic parts are correctly parsed. We further incor-porate the PCP, originally used in human pose estimation(Ferrari et al. 2008; Chen et al. 2014), to evaluate the parsingquality within person instances. For each true-positive per-son instance, we find all the semantic categories (excludingbackground)with pixel IoU larger than a threshold,which areregarded as correctly parsed. The PCP of one person instanceis the ratio between the correctly parsed semantic categorynumber and the total semantic category number of that per-

123


Table 2 Component analysis onthe MHP v2.0 validation set

Setting Method APp0.5(%) APp

vol(%) PCP0.5(%)

Baseline Mask R-CNN (He et al. 2017) 14.50 33.51 25.12

MH-Parser (Li et al. 2017c) 18.05 35.87 26.91

Network structure w/o G1 22.67 38.11 31.95

G2 w/o concatenated input 21.88 36.79 29.02

G3 w/o concatenated input 22.36 35.92 25.48

w/o D1 23.81 33.95 27.59

w/o D2 19.02 29.66 22.89

D2 w/o concatenated input 21.55 31.94 24.90

w/o D3 20.62 32.83 26.22

D3 w/o concatenated input 21.80 34.54 27.30

w/o MSFU 18.76 26.62 24.94

Ours NAN 24.83 42.77 34.37

Ours NAN + CRF 27.92 44.95 36.63

Upperbound X ′GT 26.17 43.59 38.11

X ′′GT 28.98 48.55 38.03

NGT 28.39 47.76 39.25

X̃GT 30.18 51.44 41.18

son. Missed person instances are assigned with 0 PCP. Theoverall PCP is the average PCP for all person instances. Notethat PCP is also a human-centric evaluation metric.

5.2 Evaluations onMHP v2.0 Benchmark

The MHP v2.0 dataset proposed in this paper is the largestand most comprehensive multi-human parsing benchmarkto date, which extends MHP v1.0 (Li et al. 2017c) to pushthe frontiers of understanding humans in crowded scenes. Itcontaining 25,403 elaborately annotated imageswith 58 fine-grained semantic category labels. Annotation examples aregiven in Figs. 4c, 6 and 8. The data are randomly organizedinto 3 splits, consisting of 15,403 training and5000validationimages with publicly available annotations, as well as 5000testing images with annotations withheld for benchmarkingpurpose. Evaluation systems report the APp and PCP overthe validation and testing sets.

5.2.1 Component Analysis

We first investigate the respective roles of different architec-tures and loss function combinations ofNAN inmulti-humanparsing. We compare 17 variants from four aspects, i.e.,different baselines [Mask R-CNN (He et al. 2017)6 andMH-Parser (Li et al. 2017c)], different network structures

6 As existing instance segmentation methods only offer silhouettes ofdifferent person instances, for comparison, we combine them with ourinstance-agnostic parsing prediction to generate the final multi-humanparsing results.

(w/o G1, G2 w/o concatenated input (RGB only), G3 w/oconcatenated input (RGB only), w/o D1, w/o D2, D2 w/oconcatenated input, w/o D3, D3 w/o concatenated input, w/oMSFU), our proposed NAN, NAN + Conditional RandomField (CRF)7 Lafferty et al. (2001), and upperbounds (X ′

GT:use the ground truth semantic saliency maps instead of G1prediction while keeping other settings the same; X ′′

GT: usethe ground truth instance-agnostic parsing maps instead ofG2 prediction while keeping other settings the same; NGT:use the ground truth instance number instead ofG3 predictionwhile keeping other settings the same; X̃GT: use the groundtruth pixel-wise instance location maps instead of G3 predic-tion while keeping other settings the same).

The performance comparison in terms of APp@IoU=0.5,APp

vol and PCP@IoU=0.5 on the MHP v2.0 validation setis reported in Table 2. By comparing the results from the1st versus 3rd panels, we observe that our proposed NANconsistently outperforms the baselines Mask R-CNN andMH-Parser by a large margin, i.e., 10.33% and 6.78% interms of APp, 9.26% and 6.90% in terms of APp

vol , and9.25% and 7.46% in terms of PCP. Mask R-CNN suffersdifficulties to differentiate entangled humans. MH-Parserinvolves multiple stages for instance localization, humanparsing and result refinement with high complexity, yieldingsub-optimal results, whereas NAN parses semantic cate-gories, differentiates different person instances and refinesresults simultaneously through deep nested adversarial learn-

7 WeadoptCRFas a post-processing step to refine the instance-agnosticparsing map by associating each pixel in the image with one of thesemantic categories.

123


ing in an effective yet time-efficient manner. By comparingthe results from the 2nd versus 3rd panels, we observe thatNAN consistently outperforms the 9 variants in terms of net-work structure. In particular, w/o G1 refers to truncating thesemantic saliency prediction sub-net from NAN, leading to2.16%, 4.66% and 2.42% performance drop in terms of allmetrics. This verifies the necessity of semantic saliency pre-diction that locates the most noticeable human regions inimages to serve as a basic prior to facilitate further human-centic processing. The superiority of incorporating adaptiveprior information to specific process can be verified by com-paring Gi, i ∈ {2, 3} w/o concatenated input with NAN, i.e.,2.95%, 5.98% and 5.35%; 2.47%, 6.85% and 8.89% differ-ences in terms of all metrics. The superiority of incorporatingadversarial learning to specific process can be verified bycomparing w/o Di, i ∈ {1, 2, 3} with NAN, i.e., 1.02%,8.82% and 6.78%; 5.81%, 13.11% and 11.48%; 4.21%,9.94% and 8.15% decrease in terms of all metrics. Thenested adversarial learning strategy ensures the correctnessand realism of all phases for information flow consistency. Itssuperiority is verified by comparing Di, i ∈ {2, 3} w/o con-catenated input with NAN, i.e., 3.28%, 10.83% and 9.47%;3.03%, 8.23% and 7.07% decline in terms of all metrics.MSFU dynamically fuses multi-scale features for enhancinginstance-aware clustering accuracy. Its superiority is verifiedby comparing w/o MSFU with NAN, i.e., 6.07%, 16.15%and 9.43% drop in terms of all metrics. Finally, we evalu-ate the limitations of our current algorithm. By comparingX ′

GT with NAN, only 1.34%, 0.82% and 3.74% improve-ments in term of all metrics are obtained, showing that theerrors from semantic saliency prediction are already smalland have only little effect on the final results. The significantgaps between 27.92%, 44.95% and 36.63% of NAN + CRF,28.98%, 48.55% and 38.03% of X ′′

GT and 24.83%, 42.77%and 34.37% of NAN show that a better instance-agnosticparsing network architecture can definitely help improve theperformance of multi-human parsing under our NAN frame-work. By comparing NGT and X̃GT withNAN, 3.56%, 4.99%and 4.88%; 5.35%, 8.67% and 6.81% improvements in termof all metrics are obtained, indicating that accurate instance-aware clustering results are critical for superior multi-humanparsing.

5.2.2 Quantitative Comparison

The performance comparison of the proposed NANwith twostate-of-the-art methods in terms of APp@IoU=0.5, APp

voland PCP@IoU=0.5 on the MHP v2.0 testing set is reportedin Table 3. Following Li et al. (2017c), we conduct exper-iments under three settings: All reports the evaluation overthe whole testing set; I nter20% reports the evaluation overthe sub-set containing the images with top 20% interaction Ta

ble3

Multi-human

parsingquantitativecomparisonon

theMHPv2.0

testingset

Method

All

Inter 20%

Inter 10%

Speed(s/im

g)APp 0.5(%

)APp vol

(%)

PCP 0

.5(%

)APp 0.5(%

)APp vol

(%)

PCP 0

.5(%

)APp 0.5(%

)APp vol

(%)

PCP 0

.5(%

)

MaskR-C

NN(H

eetal.2017)

14.90

33.88

25.11

4.77

24.28

12.75

2.23

20.73

8.38

–

MH-Parser(Lietal.2017c)

17.99

36.08

26.98

13.38

34.25

22.31

13.25

34.29

21.28

14.94

NAN

25.14

41.78

32.25

18.61

38.90

27.93

14.88

37.49

24.61

1.08

123


Fig. 12 Multi-human parsing qualitative comparison on the MHP v2.0 dataset. Best viewed in zoomed PDF

intensity;8 I nter10% reports the evaluation over the sub-setcontaining the images with top 10% interaction intensity.OurNAN is significantly superior over other state-of-the-artsunder setting-1. In particular, NAN improves the 2nd-best by7.15%, 5.70%and 5.27% in terms of allmetrics. For themorechallenging scenarios with intensive interactions (setting-2,3), NAN also consistently achieves the best performance.In particular, for I nter20% and I nter10%, NAN improvesthe 2nd-best by 5.23%, 4.65% and 5.62%; 1.63%, 3.20%and 3.33% in terms of all metrics. This verifies its effective-ness for multi-human parsing and understanding humans incrowded scenes. Moreover, NAN can rapidly process one512×512 image in about 1 second with acceptable resourceconsumption, which is appealing to real applications.9 Thiscompares much favorably to MH-Parser (Li et al. 2017c)(14.94 s/img), which relies on separate and complex post-processing [(including CRF (Lafferty et al. 2001)].

5.2.3 Qualitative Comparison

Figure 12 visualizes some results by the proposed NAN andtwo state-of-the-art methods and also corresponding groundtruths on the MHP v2.0 dataset. Note that Mask R-CNN(He et al. 2017) only offers silhouettes of different personinstances, thus we only compare our instance-aware clus-tering results with it and compare our holistic results withMH-Parser (Li et al. 2017c). It can be observed that the pro-

8 For each testing image, we calculate the pair-wise instance boundingbox IoU and use the mean value as the interaction intensity for eachimage.9 Since Mask R-CNN only offer silhouettes of different personinstances,we did not compare the speedwith it formulti-humanparsing.

Fig. 13 Multi-human parsing qualitative comparison between NANand MH-Parser on the MHP v2.0 dataset (complementary to Fig. 12).Best viewed in zoomed PDF

posed NAN performs well in multi-human parsing with awide range of viewpoints, poses, occlusion, interactions andbackground complexity. The instance-agnostic parsing andinstance-aware clustering predictions of NAN present highconsistency with corresponding ground truths, thanks to thenovel network structure and effective training strategy. Incontrast, Mask R-CNN suffers difficulties to differentiateentangled humans, while MH-Parser struggles to generatefine-grained parsing results and clearly segmented instancemasks. This further validates the effectiveness of NAN. Thesuperiority of NAN over MH-Parser and the failures of MH-Parser are highlighted in Fig. 13 for clearer comparison.We also show some failure cases of NAN in Fig. 14. Ascan be seen, humans in crowded scenes with heavy occlu-sion, extreme poses and intensive interactions are difficult toidentify and segment. Some small-scale semantic categorieswithin person instances are also difficult to parse. This con-firms that MHP v2.0 aligns with reality and deserves morefuture attention and research efforts.

123


Fig. 14 Failure cases of multi-human parsing results by our NAN onthe MHP v2.0 dataset. Best viewed in zoomed PDF

Table 4 Multi-human parsing quantitative comparison on the MHPv1.0 testing set

Method APp0.5(%) APp

vol(%) PCP0.5(%)

DL (De Brabandere et al. 2017) 47.76 47.73 49.21


Mask R-CNN (He et al. 2017) 52.68 49.81 51.87

NAN 57.09 56.76 59.91

5.3 Evaluations onMHP v1.0 Benchmark

The MHP v1.010 dataset is the first multi-human parsingbenchmark, originally proposed by Li et al. (2017c), whichcontains 4980 images annotated with 18 semantic labels.Annotation examples are visualized in Fig. 4b. The data arerandomly organized into 3 splits, consisting of 3000 train-ing, 1000 validation and 1000 testing images with publiclyavailable annotations. Evaluation systems report theAPp andPCP over the testing set. Refer to Li et al. (2017c) for moredetails.

The performance comparison of the proposed NAN withthree state-of-the-art methods in terms of APp@IoU=0.5,APp

vol and PCP@IoU=0.5 on the MHP v1.0 testing set (Liet al. 2017c) is reported inTable 4.With the nested adversariallearning of semantic saliency prediction, instance-agnosticparsing and instance-aware clustering, our method outper-forms the 2nd-best by 4.41% for APp

0.5, 6.95% for APpvol

and 8.04% for PCP0.5. Visual comparison of multi-humanparsing results by NAN and three state-of-the-art methods isprovided in Fig. 15, which further validates the advantagesof our NAN over existing solutions.

10 The dataset is available at http://lv-mhp.github.io/.

Fig. 15 Multi-human parsing qualitative comparison on the MHP v1.0dataset. Best viewed in zoomed PDF

Table 5 Multi-human parsing quantitative comparison on thePASCAL-Person-Part testing set

Method APr0.5(%) APr0.6(%) APr0.7(%) APrvol (%)

MNC (Dai et al. 2016) 38.80 28.10 19.30 36.70

Li et al. (2017a) 40.60 30.40 19.10 38.40

NAN 59.70 51.40 38.00 52.20

5.4 Evaluations on PASCAL-Person-Part Benchmark

The PASCAL-Person-Part11 dataset (Chen et al. 2014) isan extended set with additional annotations for PASCAL-VOC-2010 (Everingham et al. 2010). It goes beyond theoriginal PASCAL object detection task by providing pixel-wise labels for six human body parts, i.e., head, torso,upper-/lower-arms, and upper-/lower-legs. The rest of eachimage is considered as background. There are 3535 images inthe PASCAL-Person-Part dataset (Chen et al. 2014), whichis split into separate training set containing 1717 images andtesting set containing 1818 images. For fair comparison, wereport the APr over the testing set for multi-human parsing.Refer to Chen et al. (2016); Xia et al. (2016) for more details.

The performance comparison of the proposed NAN withtwo state-of-the-art methods in terms of APr@IoU=kk=0.7

0.5and APrvol on the PASCAL-Person-Part testing set (Chenet al. 2014) is reported in Table 5. Our method dramaticallysurpasses the 2nd-best by 18.90% for APr0.7 and 13.80% forAPrvol . Qualitative multi-human parsing results by NAN arevisualized in Fig. 16, which possess a high concordance with

11 The dataset is available at http://www.stat.ucla.edu/~xianjie.chen/pascal_part_dataset/pascal_part.html.

123

http://lv-mhp.github.io/

http://www.stat.ucla.edu/~xianjie.chen/pascal_part_dataset/pascal_part.html

http://www.stat.ucla.edu/~xianjie.chen/pascal_part_dataset/pascal_part.html


Fig. 16 Multi-human parsing qualitative comparison on the PASCAL-Person-Part dataset. Best viewed in zoomed PDF

Table 6 Instance segmentation quantitative comparison on the Buffydataset episode 4, 5 and 6

Method F (%) B (%) Ave. (%)

Vineet et al. (2011) – – 62.40

Jiang and Grauman (2016) 68.22 69.66 68.94


NAN 77.24 79.92 78.58

corresponding ground truths. This again verifies the effec-tiveness of our method for human-centric analysis.

5.5 Evaluations on Buffy Benchmark

The Buffy12 dataset (Vineet et al. 2011) was released in 2011for humanparsing and instance segmentation,which contains748 images annotated with 12 semantic labels. The data arerandomly organized into 2 splits, consisting of 452 trainingand 296 testing images with publicly available annotations.For fair comparison,we report the Forward (F) andBackward(B) scores (Jiang and Grauman 2016) over the episode 4, 5and 6 for instance segmentation evaluation. Refer to Vineetet al. (2011); Jiang and Grauman (2016) for more details.

The performance comparison of the proposed NAN withthree state-of-the-art methods in terms of F and B scoreson the Buffy dataset (Vineet et al. 2011) episode 4, 5 and 6is reported in Table 6. Our NAN consistently achieves thebest performance for all metrics. In particular, NAN sig-nificantly improves the 2nd-best by 6.13% for F score and7.98% for B score, with an average boost of 7.05%. Qualita-tive instance-agnostic parsing and instance-aware clusteringresults by NAN are visualized in Fig. 17, which well showsthe promising potential of ourmethod for fine-grained under-standing humans in crowded scenes.

12 The dataset is available at https://www.inf.ethz.ch/personal/ladickyl/Buffy.zip.

Fig. 17 Qualitative instance-agnostic parsing (upper panel) andinstance-aware clustering (lower panel) results by NAN on the Buffydataset. Best viewed in zoomed PDF

6 Conclusions

In this work, we presented “Multi-human Parsing (MHPv2.0)”, a new large-scale and carefully designed bench-mark to spark progress in understanding humans in crowdedscenes. MHP v2.0 contains 25,403 images, which are richlylabeled with 59 semantic categories and 16 dense pose keypoints. We also proposed a novel deep Nested AdversarialNetwork (NAN) model to address the challenging multi-human parsing problem and performed detailed evaluationsto compare NAN with current state-of-the-arts on MHPv2.0 and several other datasets. We envision the proposedMHP v2.0 dataset and the NAN method would drive thehuman parsing and pose estimation research towards real-world application with simultaneous presence of multiplepersons and complex interactions among them. In future, wewill continue our efforts to construct a more comprehen-sive multi-human parsing and pose estimation benchmarkdataset with more images and more detailed semantic cat-egory and pose key point annotations. Moreover, we willfurther improve the current multi-human parsing frameworkby making it more effective and efficient through improvingthe accuracy and reducing the running time, and more robustin in-the-wild scenarios with the reasoning of top-downand bottom-up methods, multi-task learning, etc. towardsintelligent human-computer interactions (i.e. visually under-standing the location, identity, behavior, etc. of human in theimage, and fine-grained semantic analyses with body partsand fashion items) and smart crowded scene understandingregarding humans.

Acknowledgements The work of Jian Zhao was partially supported byChina Scholarship Council (CSC) Grant 201503170248. The work ofJiashi Feng was partially supported by NUS IDS R-263-000-C67-646,ECRA R-263-000-C87-133 and MOE Tier-II R-263-000-D17-112.

References

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin,M., Ghemawat, S., Irving, G., Isard, M. et al. (2016). Tensorflow:A system for large-scale machine learning.

123

https://www.inf.ethz.ch/personal/ladickyl/Buffy.zip

https://www.inf.ethz.ch/personal/ladickyl/Buffy.zip


Arbelaez, P., Maire, M., Fowlkes, C., & Malik, J. (2011). Contourdetection and hierarchical image segmentation. T-PAMI, 33(5),898–916.

Chen, X., Mottaghi, R., Liu, X., Fidler, S., Urtasun, R., & Yuille, A.(2014). Detect what you can: Detecting and representing objectsusing holistic models and body parts. In CVPR (pp. 1971–1978).

Chen, L.-C., Yang, Y., Wang, J., Xu, W., & Yuille, A. L. (2016). Atten-tion to scale: Scale-aware semantic image segmentation. In CVPR(pp. 3640–3649).

Chu, X., Ouyang, W., Yang, W., & Wang, X. (2015). Multi-task recur-rent neural network for immediacy prediction. In ICCV (pp.3352–3360).

Collins, R. T., Lipton, A. J., Kanade, T., Fujiyoshi, H., Duggins, D.,Tsin, Y., Tolliver, D., Enomoto, N., Hasegawa, O., Burt, P. et al.(2000). A system for video surveillance and monitoring. VSAMfinal report (pp. 1–68).

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benen-son, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapesdataset for semantic urban scene understanding. In CVPR (pp.3213–3223).

Dai, J., He, K., & Sun, J. (2016). Instance-aware semantic segmentationvia multi-task network cascades. In CVPR (pp. 3150–3158).

De Brabandere, B., Neven, D., & Van Gool, L. (2017). Semanticinstance segmentation with a discriminative loss function. arXivpreprint arXiv:1708.02551.

Dollar, P.,Wojek, C., Schiele, B., & Perona, P. (2012). Pedestrian detec-tion: An evaluation of the state of the art. T-PAMI, 34(4), 743–761.

Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J.,& Zisserman, A. (2011). The PASCAL visual object classeschallenge 2011 (VOC2011) results. Retrieved May 25, 2011from http://www.pascal-network.org/challenges/VOC/voc2011/workshop/index.html.

Everingham,M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, J.,& Zisserman, A. (2015). The PASCAL visual object classes chal-lenge: A retrospective. International Journal of Computer Vision,111(1), 98–136.

Everingham,M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman,A. (2010). The PASCAL visual object classes (VOC) challenge.International Journal of Computer Vision, 88(2), 303–338.

Ferrari, V., Marin-Jimenez, M., & Zisserman, A. (2008). Progressivesearch space reduction for human pose estimation. In CVPR (pp.1–8).

Gan, C., Lin, M., Yang, Y., de Melo, G., & Hauptmann, A. G. (2016).Concepts not alone: Exploring pairwise relationships for zero-shotvideo activity recognition. In AAAI (p. 3487).

Girshick, R. (2015). Fast R-CNN. arXiv preprint arXiv:1504.08083.Gong, K., Liang, X., Shen, X., & Lin, L. (2017). Look into person:

Self-supervised structure-sensitive learning and a new benchmarkfor human parsing. arXiv preprint arXiv:1703.05446.

Hariharan, B., Arbeláez, P., R. Girshick, P., & Malik, J. (2014). Simul-taneous detection and segmentation. In ECCV (pp. 297–312).

He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN.In ICCV (pp. 2980–2988).

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning forimage recognition. In CVPR (pp. 770–778).

Jiang, H., & Grauman, K. (2016). Detangling people: Individuatingmultiple close people and their body parts via region assembly.arXiv preprint arXiv:1604.03880

Klare, B. F., Klein, B., Taborsky, E., Blanton, A., Cheney, J., Allen,K., Grother, P., Mah, A., & Jain, A.K. (2015). Pushing the fron-tiers of unconstrained face detection and recognition: Iarpa janusbenchmark a. In CVPR (pp. 1931–1939).

Lafferty, J., McCallum, A., & Pereira, F. C. (2001). Conditional randomfields: Probabilistic models for segmenting and labeling sequencedata. In ICML.

Li, Q., Arnab, A., & Torr, P. H. (2017a). Holistic, instance-level humanparsing. arXiv preprint arXiv:1709.03612.

Li, G., Xie, Y., Lin, L., & Yu, Y. (2017b). Instance-level salient objectsegmentation. In CVPR (pp. 247–256).

Li, J., Zhao, J., Wei, Y., Lang , C., Li, Y., Sim, T., Yan, S., &Feng, J. (2017c). Multi-human parsing in the wild. arXiv preprintarXiv:1705.07206.

Liang, X., Wei, Y., Shen, X., Yang, J., Lin, L., & Yan, S. (2015a).Proposal-free network for instance-level object segmentation.arXiv preprint arXiv:1509.02636.

Liang, X., Xu, C., Shen, X., Yang, J., Liu, S., Tang, J., Lin, L., &Yan, S.(2015b). Human parsing with contextualized convolutional neuralnetwork. In ICCV (pp. 1386–1394).

Lin, J., Guo, X., Shao, J., Jiang, C., Zhu, Y., & Zhu, S.-C. (2016). Avirtual reality platform for dynamic human-scene interaction. InSIGGRAPH (p. 11).

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan,D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Commonobjects in context. In ECCV (pp. 740–755).

Liu, S., Wang, C., Qian, R., Yu, H., Bao, R., & Sun, Y. (2017). Surveil-lance video parsing with single frame supervision. InCVPRW (pp.1–9).

Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional net-works for semantic segmentation. In CVPR (pp. 3431–3440).

Ng, A. Y., Jordan, M. I., & Weiss, Y. (2002). On spectral clustering:Analysis and an algorithm. In NIPS (pp. 849–856).

Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towardsreal-time object detection with region proposal networks. In NIPS(pp. 91–99).

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S.,et al. (2015). Imagenet large scale visual recognition challenge.International Journal of Computer Vision, 115(3), 211–252.

Sapp, B.,& Taskar, B. (2013). Modec: Multimodal decomposable mod-els for human pose estimation. In CVPR (pp. 3674–3681).

Simonyan, K., & Zisserman, A. (2014). Very deep convolutionalnetworks for large-scale image recognition. arXiv preprintarXiv:1409.1556

Turban, E., King, D., Lee, J., & Viehland, D. (2002). Electronic com-merce: A managerial perspective 2002 (Vol. 13, no. (975285), p.4). Englewood Cliffs: Prentice Hall.

Vineet, V.,Warrell, J., Ladicky, L., &Torr, P. H. (2011). Human instancesegmentation from video using detector-based conditional randomfields. In BMVC (Vol. 2, pp. 12–15).

Wu, Z., Shen, C., Van Den Hengel, A. (2016). Wider or deeper:Revisiting the resnet model for visual recognition. arXiv preprintarXiv:1611.10080.

Xia, F., Wang, P., Chen, L.-C., & Yuille, A. L. (2016). Zoom better tosee clearer: Human and object parsingwith hierarchical auto-zoomnet. In ECCV (pp. 648–663).

Xu, N., Price, B., Cohen, S., Yang, J., & Huang, T. S. (2016). Deepinteractive object selection. In CVPR (pp. 373–381).

Yamaguchi, K., Kiapour, M. H., Ortiz, L. E., & Berg, T. L. (2012). Pars-ing clothing in fashion photographs. In CVPR (pp. 3570–3577).

Zhang, Z., Luo, P., Loy, C. C., & Tang, X. (2018). From facial expres-sion recognition to interpersonal relation prediction. InternationalJournal of Computer Vision, 126(5), 550–569.

Zhang, N., Paluri, M., Taigman, Y., Fergus, R., Bourdev, L. (2015).Beyond frontal faces: Improving person recognition usingmultiplecues. In CVPR (pp. 4804–4813).

Zhao, J., Li, J., Cheng, Y., Sim, T., Yan, S., & Feng, J. (2018). Under-standing humans in crowded scenes: Deep nested adversariallearning and a new benchmark for multi-human parsing. In 2018ACMMultimediaConference onMultimediaConference (pp. 792–800). ACM.

123

http://arxiv.org/abs/1708.02551

http://www.pascal-network.org/challenges/VOC/voc2011/workshop/index.html

http://www.pascal-network.org/challenges/VOC/voc2011/workshop/index.html










Zhao, J., Li, J., Nie, X., Zhao, F., Chen, Y., Wang, Z., Feng, J., & Yan,S. (2017). Self-supervised neural aggregation networks for humanparsing. In CVPRW (pp. 7–15).

Zhao, R., Ouyang,W.,&Wang,X. (2013). Unsupervised salience learn-ing for person re-identification. In CVPR (pp. 3586–3593).

Publisher’s Note Springer Nature remains neutral with regard to juris-dictional claims in published maps and institutional affiliations.

123

fine-grained multi-human parsing · 2020-01-03 · with parsing, ﬁne-grained semantic information...

Documents