discriminative bimodal networks for visual localization ... › content_cvpr_2017 › ... · as a...

1
Discriminative Bimodal Networks for Visual Localization and Detection with Natural Language Queries Yuting Zhang, Luyao Yuan, Yijie Guo, Zhiyuan He, I-An Huang, Honglak Lee Model Architectures q Visual features: Fast R-CNN (VGGNet/ResNet) q Text features: Character-level CNN Ø Replicating a phrase until reaching the input length of the CNN (256 characters) Ø Non-saturable activations: Leaky ReLU Ø Can be replaced by other text embedding q The detection score of image region r is given by a linear classifier dynamically generated according to the text phrase t : where the classifier weights and bias are . q An extra regularization term on the classifier parameter is important for the SGD stability: Discriminative Bimodal Networks (DBNet) q Typical previous works Ø Caption generation model Ø Output in a huge language space Ø Mostly use only positive training samples (matched text and regions) q Our DBNet Ø A fully discriminative model Ø Learning in a binary space is easier Ø Able to use all possible negative samples across the training set ͳ (a) Conditional captioning using RNNs (b) Our discriminatively trained CNN white dog with black spots dog with a ball in its month ⋯⋯ black leather chair RNN RNN RNN RNN RNN RNN start white dog with black spots , , , , = = = = = white dog with black spots end |⋯ሻ ሺ | ⋯ ሻ ሺ | ⋯ ሻ ሺ |⋯ሻ |⋯ሻ ሺ |⋯ሻ = = Ͳ Ͳ ሺ = ͳ|, ሻ ሺ = Ͳ|, ሻ ሺ = Ͳ|, ሻ positive image region negative image region positive phrase negative phrase Text Network Figure 1: Comparison between (a) image captioning model and Benchmarking Protocol q Based on the Visual Genome dataset Ø Bounding boxes with text phrase descriptions Ø Additional spell checking and auto-correction q Localization: find a known-to-be-existing entity q Detection: localize all matched entities if exists any Ø Should have negative images (no matched entity) Ø Impractical to test all queries on each image q The first benchmark protocol for visual entity detection with language queries. 3 difficulty levels with increasing number of randomly chosen negative images per query. Ø Level 0: no negative image Ø Level 2: 5 times the number of positive images or 20 (whichever is larger) for each test phrase q Detection metric: average precision (AP) Ø mean AP (mAP): averaging APs over all phrases; each phrase has its own decision threshold. Ø global AP (gAP): a single AP for any phrases; all phrases share the same decision threshold. (requires scores to be calibrated over phrases; for zero-shot settings) Localization and Detection with Natural Language Queries q Tradition object detection: person, cat, dog, car, motorbike, airplane, bed, television, … q Natural language query: a group of bikers, a building with lots of windows, … 1 2 3 5 Experiments We release our implementation in both Caffe+MATLAB (original results) and TensorFlow. q DBNet training is memory consuming. Our code perform subbatch partition for subnetworks to fit the GPU memory limit while keeping a large overall batch size for effective training. q DBNet outperforms captioning models. Dynamic bias term improves performance. q DBNet shows higher per-phrase performance in term of mAP. q DBNet’s scores are better “calibrated” over different phrases in terms of gAP. q DBNet shows more robustness to negative images and rare text phrases. 6 f (x, r, t; )= w(t) > φ rgn (x, r; rgn )+ b(t). ared to the basic form of the bilinear fu Γ dynamic = kw(t)k 2 2 +|b(t)| 2 des the network weight decay b(t)= a > b φ txt (t; txt ), d d d w(t)= A > w φ txt (t; txt ), Green: Ground truth; Red: DenseCap; Yellow: DBNet Method Accuracy/% for IoU@ Median IoU 0.3 0.5 0.7 DenseCap 25.7 10.1 2.4 0.092 SCRC 27.8 11.0 2.5 0.115 DBNet w/o bias 36.3 22.4 9.4 0.124 DBNet 38.3 23.7 9.9 0.152 DBNet (ResNet) 42.3 26.4 11.2 0.205 a. Localization performance c. Quantitative comparison Data, Code & Model: DBNet.link Model Training: Labels, Objectives, and Optimization q Any region-text pair (r,t) can be possibly used in training. q r can be GT or proposed (EdgeBox), t can be an annotation on the same image as r or from the rest of the training set. q Spatial overlapping based training labels: Ø 1: r has large overlap with a ground truth region of t Ø Uncertain: (r,t) is not positive, and r has moderate overlap with a ground truth region of t. (excluded from training) Text similarity based uncertainty augmentation: If t' is similar to t, t' should also have uncertain label with r Ø 0: otherwise (including all other text phrases). q Given a region, the non-uncertain phrases are categorized into: positive phrases (pos), negative phrases from the annotations on the same image (neg), and negative phrases from the rest images (rest). q Training loss is normalized separately for the three categories: where we set , the sampling strategy in SGD determines (larger) and (smaller). q Optimization: (initialization) Initializing visual pathway with pretrained faster R-CNN; (phase 1) training the text pathway from scratch and fix the visual pathway parameters; (phase 2) jointly finetuning the two pathway; (phase 3) decrease the learning rate. 4 0.09: duck is getting in the water 0.07: torso of a duck 0.00: waterfall into a fountain 0.00: yellow flowers in the plant 0.32: male duck 0.48: torso of duck 0.88: duck is standing 0.86: brown duck with orange beak Figure 2: Ground truth labels for region L i = λ pos L pos i + λ neg L neg i + λ rest L rest i , X Image region Fast R-CNN Image feature Text phrase CNN Text feature Detection score FC Layer Classifier References [DenseCap] J. Johnson, A. Karpathy, and L. Fei-Fei. ”DenseCap: Fully convolutional localization networks for dense captioning”. In CVPR, 2016. [SCRC] R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Darrell. “Natural language object retrieval”. In CVPR, 2016. An arbitrary region for training GT region for uncertain phrase GT region for positive phrase GT region for negative phrase λ pos = λ neg + λ rest λ neg + λ rest Ø Level 1: the same number as the positive image Scenario # unique phrases Training set 2,113,688 Test set 180,363 Test set (unseen on training) 119,315 Per image (level 0) ~43 Per image (level 1) ~92 Per image (level 2) ~775 Difficulty level: 0 1 2 AP / % mAP gAP mAP gAP mAP gAP DenseCap 36.2 1.8 22.9 1.0 4.1 0.1 SCRC 38.5 2.2 37.5 1.7 29.3 0.7 DBNet 48.1 23.1 45.5 21.0 26.7 8.0 DBNet (ResNet) 51.1 24.2 48.3 22.2 29.7 9.0 b.1. Detection APs for [email protected] Difficulty level: 0 1 2 AP / % mAP gAP mAP gAP mAP gAP DenseCap 15.7 0.5 10.0 0.3 1.7 0.0 SCRC 16.5 0.5 16.3 0.4 12.8 0.2 DBNet 30.0 10.8 28.8 9.9 17.7 3.9 DBNet (ResNet) 32.6 11.5 31.2 10.7 19.8 4.3 b.2. Detection APs for [email protected] Difficulty level: 0 1 2 AP / % mAP gAP mAP gAP mAP gAP DenseCap 3.4 0.0 2.1 0.0 0.3 0.0 SCRC 3.4 0.0 3.4 0.0 2.7 0.0 DBNet 11.6 2.1 11.4 2.0 7.6 0.9 DBNet (ResNet) 12.9 2.2 12.6 2.1 8.5 0.9 b.3. Detection APs for [email protected] Prune Phrases Finetune Localization Detection (Level-1) ambiguous from other visual Recall / % for IoU@ Median Mean mAP / % for IoU@ gAP / % for IoU@ phrases images pathway 0.3 0.5 0.7 IoU IoU 0.3 0.5 0.7 0.3 0.5 0.7 No No No 30.6 17.5 07.8 0.066 0.211 35.5 22.0 08.6 08.3 03.1 00.4 Yes No No 34.5 21.2 09.0 0.113 0.237 39.0 24.6 09.7 15.5 07.4 01.6 Yes Yes No 34.7 21.1 08.8 0.119 0.238 41.3 25.6 10.0 17.2 07.9 01.6 Yes Yes Yes 38.3 23.7 09.9 0.152 0.258 45.5 28.8 11.4 21.0 09.9 02.0 e. Ablation study of DBNet’s major components Text phrases DBNet DenseCap SCRC a man jumping a skateboard a man wearing a red shirt a red white and blue baseball cap three people hanging out in the background GT: semitransparent & dashed Detected: boxes of solid edges 0.22 0.61 a small chalk board in the window 0.26 0.53 a foot hang over the side 0.29 0.61 blue and white ship on the water 0.27 0.51 side mirror of the bus 0.27 0.65 part of ocean surfboard 0.29 0.55 the short hair of the player 0.27 0.50 the back wheel of the bus 0.27 0.62 a man wearing sunglasses 0.30 0.52 leash on dog pulling person on skateboard 0.28 0.56 racket in tennis player’s hand 0.50 0.27 Green: Ground truth; Red: SCRC; Yellow: DBNet 0.25 0.60 a black baseball bat 0.26 0.67 the weiner is in the bun 0.28 0.64 a guy is wearing a white helmet 0.29 0.64 the visor is white 0.30 0.66 a window on the train 0.20 0.60 arrow pointing right 0.29 0.73 the shower nozzle 0.23 0.55 blue jeans on young woman 0.30 0.65 the uniform is grey 0.30 0.53 man in yellow snowboarding 0.26 0.38

Upload: others

Post on 03-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Discriminative Bimodal Networks for Visual Localization ... › content_cvpr_2017 › ... · As a natural way for modeling the cross-modality corre-lation, multiplication is also

Discriminative Bimodal Networks for Visual Localization and Detection with Natural Language Queries

Yuting Zhang, Luyao Yuan, Yijie Guo, Zhiyuan He, I-An Huang, Honglak Lee

Model Architectures

q Visual features: Fast R-CNN (VGGNet/ResNet)

q Text features: Character-level CNNØ Replicating a phrase until reaching the

input length of the CNN (256 characters)Ø Non-saturable activations: Leaky ReLU

Ø Can be replaced by other text embedding q The detection score of image region r is given by a linear classifier dynamically generated

according to the text phrase t :

where the classifier weights and bias are .

q An extra regularization term on the classifier parameter is important for the SGD stability:

Discriminative Bimodal Networks (DBNet)

q Typical previous worksØ Caption generation model

Ø Output in a huge language spaceØ Mostly use only positive training

samples (matched text and regions)

q Our DBNetØ A fully discriminative model

Ø Learning in a binary space is easierØ Able to use all possible negative

samples across the training set

Discriminative Bimodal Networks for

Visual Localization and Detection with Natural Language Queries

Yuting Zhang, Luyao Yuan, Yijie Guo, Zhiyuan He, I-An Huang, Honglak Lee

University of Michigan, Ann Arbor, MI, USA{yutingzh, yuanluya, guoyijie, zhiyuan, huangian, honglak}@umich.edu

AbstractAssociating image regions with text queries has been

recently explored as a new way to bridge visual and lin-

guistic representations. A few pioneering approaches have

been proposed based on recurrent neural language models

trained generatively (e.g., generating captions), but achiev-

ing somewhat limited localization accuracy. To better ad-

dress natural-language-based visual entity localization, we

propose a discriminative approach. We formulate a dis-

criminative bimodal neural network (DBNet), which can be

trained by a classifier with extensive use of negative sam-

ples. Our training objective encourages better localiza-

tion on single images, incorporates text phrases in a broad

range, and properly pairs image regions with text phrases

into positive and negative examples. Experiments on the

Visual Genome dataset demonstrate the proposed DBNet

significantly outperforms previous state-of-the-art methods

both for localization on single images and for detection on

multiple images. We we also establish an evaluation proto-

col for natural-language visual detection. Code is avail-

able at: http://ytzhang.net/projects/dbnet .

1. Introduction

Object localization and detection in computer vision aretraditionally limited to a small number of predefined cat-egories (e.g., car, dog, and person), and category-specificimage region classifiers [7, 11, 14] serve as object detectors.However, in the real world, the visual entities of interest aremuch more diverse, including groups of objects (involvedin certain relationships), object parts, and objects with par-ticular attributes and/or in particular context. For scalableannotation, these entities need to be labeled in a more flexi-ble way, such as using text phrases.

Deep learning has been demonstrated as a unified learn-ing framework for both text and image representations. Sig-nificant progress has been made in many related tasks, suchas image captioning [55, 56, 25, 37, 5, 9, 23, 18, 38], vi-sual question answering [3, 36, 57, 41, 2], text-based fine-grained image classification [44], natural-language objectretrieval [21, 38], and text-to-image generation [45].

A few pioneering works [21, 38] use recurrent neurallanguage models [15, 39, 50] and deep image represen-

(a) Conditional captioning using RNNs

(b) Our discriminatively trained CNN

white dog with black spots

dog with a ball in its month

⋯⋯black leather chair

RNN RNN RNN RNN RNN RNN

start white dog with black spots

� , � , � , � , ��

= = = = =

white dog with black spots end� � |⋯ � � |⋯ � � |⋯ � � |⋯ � � |⋯ � ����|⋯⋅ ⋅ ⋅ ⋅ ⋅

=

� � �=

� � = |�, �

� � = |�, �

� � = |�, �

positive image regionnegative image region

positive phrasenegative phrase

Text Network

Figure 1: Comparison between (a) image captioning model and

(b) our discriminative architecture for visual localization.

tations [31, 49] for localizing the object referred to by atext phrase given a single image (i.e., “object referring"task [26]). Global spatial context, such as “a man on the left(of the image)”, has been commonly used to pick up the par-ticular object. In contrast, Johnson et al. [23] takes descrip-tions without global context1 as queries for localizing moregeneral visual entities on the Visual Genome dataset [30].

All above existing work performs localization by maxi-mizing the likelihood to generate the query text given im-age regions using an image captioning model (Figure 1a),whose output probability density needs to be modeled onthe virtually infinite space of the natural language. Since itis hard to train a classifier on such a huge structured out-put space, current captioning models are constrained to betrained in generative [21, 23] or partially discriminative [38]ways. However, as discriminative tasks, localization anddetection usually favor models that are trained with a morediscriminative objective to better utilize negative samples.

In this paper, we propose a new deep architecture fornatural-language-based visual entity localization, which wecall a discriminative bimodal network (DBNet). Our ar-chitecture uses a binary output space to allow extensivediscriminative training, where any negative training sam-ple can be potentially utilized. The key idea is to take thetext query as a condition rather than an output and to let the

1Only a very small portion of text phrases on the Visual Genome referto the global context.

Benchmarking Protocolq Based on the Visual Genome dataset Ø Bounding boxes with text phrase descriptions

Ø Additional spell checking and auto-correctionq Localization: find a known-to-be-existing entity

q Detection: localize all matched entities if exists anyØ Should have negative images (no matched entity)

Ø Impractical to test all queries on each imageq The first benchmark protocol for visual entity detection with language queries. 3 difficulty levels with

increasing number of randomly chosen negative images per query.Ø Level 0: no negative image

Ø Level 2: 5 times the number of positive images or 20 (whichever is larger) for each test phraseq Detection metric: average precision (AP)

Ø mean AP (mAP): averaging APs over all phrases; each phrase has its own decision threshold.Ø global AP (gAP): a single AP for any phrases; all phrases share the same decision threshold.

(requires scores to be calibrated over phrases; for zero-shot settings)

Localization and Detection with Natural Language Queries q Tradition object detection: person, cat, dog, car, motorbike, airplane, bed, television, …q Natural language query: a group of bikers, a building with lots of windows, …

1

2

3

5

ExperimentsWe release our implementation in both Caffe+MATLAB (original results) and TensorFlow.q DBNet training is memory consuming. Our code perform subbatch partition for subnetworks to fit the GPU memory limit while keeping a large overall batch size for effective training.

q DBNet outperforms captioning models. � Dynamic bias term improves performance.q DBNet shows higher per-phrase performance in term of mAP.q DBNet’s scores are better “calibrated” over different phrases in terms of gAP.q DBNet shows more robustness to negative images and rare text phrases.

6

the classifiers for a small number of predefined categoriesin traditional object detection, our model is dynamicallyadaptable to different text phrases.3.1. Model framework

Let x be an image, r be the coordinates of a region, andt be a text phrase. The verification model f(x, t, r;⇥) 2 Routputs the confidence of r’s being matched with t. Sup-pose that l 2 {1, 0} is the binary label indicating if (t, r) isa positive or negative region-text pair on x. Our verificationmodel learns to fit the probability for r and t being compat-ible (a positive pair), i.e., p(l = 1|x, r, t). See Section Bin the supplementary materials for a formalized comparisonwith conditional captioning models.

To this end, we develop a bimodal deep neural networkfor our model. In particular, f(x, t, r;⇥) is composed oftwo single-modality pathways followed by a discriminativepathway. The image pathway �rgn(x, r;⇥rgn) extracts thedrgn-dim visual representation on the image region r on x.The language pathway �txt(t;⇥txt) extracts the dtxt-dim tex-tual representation for the phrase t. The discriminative path-way with parameters ⇥dis dynamically generates a classifierfor visual representation according to the textual represen-tation, and predicts if r and t are matched on x. The fullmodel is specified by ⇥ = (⇥txt,⇥rgn,⇥dis).3.2. Visual and linguistic pathwaysRoI-pooling image network. We suppose the regions ofinterest are given by an existing region proposal method(e.g., EdgeBox [62], RPN [46]). We calculate visual rep-resentations for all image regions in one pass using the fastR-CNN RoI-pooling pipeline. State-of-the-art image classi-fication networks, including the 16-layer VGGNet [49] andResNet-101 [17], are used as backbone architectures.Character-level textual network. For an English textphrase t, we encode each of its characters into a 74-dimone-hot vector, where the alphabet is composed of 74 print-able characters including punctuations and the space. Thus,the t is encoded as a 74-channel sequence by stacking allcharacter encodings. We use a character-level deep CNN[60] to obtain the high-level textual representation of t. Inparticular, our network has 6 convolutional layers interleav-ing with 3 max-pooling layers and followed by 2 fully con-nected layers (see Section A in the supplementary materialsfor more details). It takes a sequence of a fixed length as theinput and produces textual representations of a fixed dimen-sion. The input length is set to be long enough (here, 256characters) to cover possible text phrases.2 To avoid emptytailing characters in the input, we replicate the text phraseuntil reaching the input length limit.

We empirically found that the very sparse input can eas-ily lead to over-sparse intermediate activations, which can

2The Visual Genome dataset has more than 2.8M unique phrases,whose median length in character is 29. Less than 500 phrases has morethan 100 characters.

create a large portion of “dead” ReLUs and finally result ina degenerate solution. To avoid this problem, we adopt theLeaky ReLU (LReLU) [35] to keep all hidden units activein the character-level CNN.

Other text embedding methods [29, 24, 27] also can beused in the DBNet framework. We use the character-levelCNN because of its simplicity and flexibility. Compared toword-based models, it uses lower-dimensional input vectorsand has no constraint on the word vocabulary size. Com-pared to RNNs, it easily allows deeper architectures.3.3. Discriminative pathway

The discriminative pathway first forms a linear classifierusing the textual representation of the phrase t. Its linearcombination weights and bias are

w(t) = A>w�txt(t;⇥txt), (1)

b(t) = a>b �txt(t;⇥txt), (2)where Aw 2 Rdtxt⇥drgn , ab 2 Rdtxt , and ⇥dis = (Aw,ab).This classifier is applied to the visual representation of theimage region r on x, obtaining the verification confidencepredicted by our model:

f(x, r, t;⇥) = w(t)

>�rgn(x, r;⇥rgn) + b(t). (3)Compared to the basic form of the bilinear function�>

txt(t;⇥txt)Aw�rgn(x, r;⇥rgn), our discriminative pathwayincludes an additional linear term as the text-dependent biasfor the visual representation classifier.

As a natural way for modeling the cross-modality corre-lation, multiplication is also a source of instability for train-ing. To improve the training stability, we introduce a regu-larization term �dynamic = kw(t)k22+|b(t)|2 for the dynamicclassifier, besides the network weight decay �decay for ⇥.

4. Model learningIn DBNet, we drive the training of the proposed two-

pathway bimodal CNN with a binary classification objec-tive. We pair image regions and text phrases as train-ing samples. We define the ground truth binary label foreach training region-text pair (Section 4.1), and propose aweighted training loss function (Section 4.2).Training samples. Given M training images x1, x2, . . . ,

xM , let Gi = {(rij , tij)}Nij=1 be the set of ground truth an-

notations for xi, where Ni is the number of annotations, rijis the coordinate of the j

th region, and tij is the text phrasecorresponding to rij . When one region is paired with mul-tiple phrases, we take each pair as a separate entry in Gi.

We denote the set of all regions considered on xi by Ri,which includes both annotated regions

SNi

j=1{rij} and re-gions given by proposal methods [54, 62, 46]. We writeTi =

S{tij}Ni

j=1 for the set of annotated text phrases on xi,and T =

SMi=1 Ti for all training text phrases.

4.1. Ground truth labelsLabeling criterion. We assign each possible trainingregion-text pair with a ground truth label for binary clas-

the classifiers for a small number of predefined categoriesin traditional object detection, our model is dynamicallyadaptable to different text phrases.3.1. Model framework

Let x be an image, r be the coordinates of a region, andt be a text phrase. The verification model f(x, t, r;⇥) 2 Routputs the confidence of r’s being matched with t. Sup-pose that l 2 {1, 0} is the binary label indicating if (t, r) isa positive or negative region-text pair on x. Our verificationmodel learns to fit the probability for r and t being compat-ible (a positive pair), i.e., p(l = 1|x, r, t). See Section Bin the supplementary materials for a formalized comparisonwith conditional captioning models.

To this end, we develop a bimodal deep neural networkfor our model. In particular, f(x, t, r;⇥) is composed oftwo single-modality pathways followed by a discriminativepathway. The image pathway �rgn(x, r;⇥rgn) extracts thedrgn-dim visual representation on the image region r on x.The language pathway �txt(t;⇥txt) extracts the dtxt-dim tex-tual representation for the phrase t. The discriminative path-way with parameters ⇥dis dynamically generates a classifierfor visual representation according to the textual represen-tation, and predicts if r and t are matched on x. The fullmodel is specified by ⇥ = (⇥txt,⇥rgn,⇥dis).3.2. Visual and linguistic pathwaysRoI-pooling image network. We suppose the regions ofinterest are given by an existing region proposal method(e.g., EdgeBox [62], RPN [46]). We calculate visual rep-resentations for all image regions in one pass using the fastR-CNN RoI-pooling pipeline. State-of-the-art image classi-fication networks, including the 16-layer VGGNet [49] andResNet-101 [17], are used as backbone architectures.Character-level textual network. For an English textphrase t, we encode each of its characters into a 74-dimone-hot vector, where the alphabet is composed of 74 print-able characters including punctuations and the space. Thus,the t is encoded as a 74-channel sequence by stacking allcharacter encodings. We use a character-level deep CNN[60] to obtain the high-level textual representation of t. Inparticular, our network has 6 convolutional layers interleav-ing with 3 max-pooling layers and followed by 2 fully con-nected layers (see Section A in the supplementary materialsfor more details). It takes a sequence of a fixed length as theinput and produces textual representations of a fixed dimen-sion. The input length is set to be long enough (here, 256characters) to cover possible text phrases.2 To avoid emptytailing characters in the input, we replicate the text phraseuntil reaching the input length limit.

We empirically found that the very sparse input can eas-ily lead to over-sparse intermediate activations, which can

2The Visual Genome dataset has more than 2.8M unique phrases,whose median length in character is 29. Less than 500 phrases has morethan 100 characters.

create a large portion of “dead” ReLUs and finally result ina degenerate solution. To avoid this problem, we adopt theLeaky ReLU (LReLU) [35] to keep all hidden units activein the character-level CNN.

Other text embedding methods [29, 24, 27] also can beused in the DBNet framework. We use the character-levelCNN because of its simplicity and flexibility. Compared toword-based models, it uses lower-dimensional input vectorsand has no constraint on the word vocabulary size. Com-pared to RNNs, it easily allows deeper architectures.3.3. Discriminative pathway

The discriminative pathway first forms a linear classifierusing the textual representation of the phrase t. Its linearcombination weights and bias are

w(t) = A>w�txt(t;⇥txt), (1)

b(t) = a>b �txt(t;⇥txt), (2)where Aw 2 Rdtxt⇥drgn , ab 2 Rdtxt , and ⇥dis = (Aw,ab).This classifier is applied to the visual representation of theimage region r on x, obtaining the verification confidencepredicted by our model:

f(x, r, t;⇥) = w(t)

>�rgn(x, r;⇥rgn) + b(t). (3)Compared to the basic form of the bilinear function�>

txt(t;⇥txt)Aw�rgn(x, r;⇥rgn), our discriminative pathwayincludes an additional linear term as the text-dependent biasfor the visual representation classifier.

As a natural way for modeling the cross-modality corre-lation, multiplication is also a source of instability for train-ing. To improve the training stability, we introduce a regu-larization term �dynamic = kw(t)k22+|b(t)|2 for the dynamicclassifier, besides the network weight decay �decay for ⇥.

4. Model learningIn DBNet, we drive the training of the proposed two-

pathway bimodal CNN with a binary classification objec-tive. We pair image regions and text phrases as train-ing samples. We define the ground truth binary label foreach training region-text pair (Section 4.1), and propose aweighted training loss function (Section 4.2).Training samples. Given M training images x1, x2, . . . ,

xM , let Gi = {(rij , tij)}Nij=1 be the set of ground truth an-

notations for xi, where Ni is the number of annotations, rijis the coordinate of the j

th region, and tij is the text phrasecorresponding to rij . When one region is paired with mul-tiple phrases, we take each pair as a separate entry in Gi.

We denote the set of all regions considered on xi by Ri,which includes both annotated regions

SNi

j=1{rij} and re-gions given by proposal methods [54, 62, 46]. We writeTi =

S{tij}Ni

j=1 for the set of annotated text phrases on xi,and T =

SMi=1 Ti for all training text phrases.

4.1. Ground truth labelsLabeling criterion. We assign each possible trainingregion-text pair with a ground truth label for binary clas-

the classifiers for a small number of predefined categoriesin traditional object detection, our model is dynamicallyadaptable to different text phrases.3.1. Model framework

Let x be an image, r be the coordinates of a region, andt be a text phrase. The verification model f(x, t, r;⇥) 2 Routputs the confidence of r’s being matched with t. Sup-pose that l 2 {1, 0} is the binary label indicating if (t, r) isa positive or negative region-text pair on x. Our verificationmodel learns to fit the probability for r and t being compat-ible (a positive pair), i.e., p(l = 1|x, r, t). See Section Bin the supplementary materials for a formalized comparisonwith conditional captioning models.

To this end, we develop a bimodal deep neural networkfor our model. In particular, f(x, t, r;⇥) is composed oftwo single-modality pathways followed by a discriminativepathway. The image pathway �rgn(x, r;⇥rgn) extracts thedrgn-dim visual representation on the image region r on x.The language pathway �txt(t;⇥txt) extracts the dtxt-dim tex-tual representation for the phrase t. The discriminative path-way with parameters ⇥dis dynamically generates a classifierfor visual representation according to the textual represen-tation, and predicts if r and t are matched on x. The fullmodel is specified by ⇥ = (⇥txt,⇥rgn,⇥dis).3.2. Visual and linguistic pathwaysRoI-pooling image network. We suppose the regions ofinterest are given by an existing region proposal method(e.g., EdgeBox [62], RPN [46]). We calculate visual rep-resentations for all image regions in one pass using the fastR-CNN RoI-pooling pipeline. State-of-the-art image classi-fication networks, including the 16-layer VGGNet [49] andResNet-101 [17], are used as backbone architectures.Character-level textual network. For an English textphrase t, we encode each of its characters into a 74-dimone-hot vector, where the alphabet is composed of 74 print-able characters including punctuations and the space. Thus,the t is encoded as a 74-channel sequence by stacking allcharacter encodings. We use a character-level deep CNN[60] to obtain the high-level textual representation of t. Inparticular, our network has 6 convolutional layers interleav-ing with 3 max-pooling layers and followed by 2 fully con-nected layers (see Section A in the supplementary materialsfor more details). It takes a sequence of a fixed length as theinput and produces textual representations of a fixed dimen-sion. The input length is set to be long enough (here, 256characters) to cover possible text phrases.2 To avoid emptytailing characters in the input, we replicate the text phraseuntil reaching the input length limit.

We empirically found that the very sparse input can eas-ily lead to over-sparse intermediate activations, which can

2The Visual Genome dataset has more than 2.8M unique phrases,whose median length in character is 29. Less than 500 phrases has morethan 100 characters.

create a large portion of “dead” ReLUs and finally result ina degenerate solution. To avoid this problem, we adopt theLeaky ReLU (LReLU) [35] to keep all hidden units activein the character-level CNN.

Other text embedding methods [29, 24, 27] also can beused in the DBNet framework. We use the character-levelCNN because of its simplicity and flexibility. Compared toword-based models, it uses lower-dimensional input vectorsand has no constraint on the word vocabulary size. Com-pared to RNNs, it easily allows deeper architectures.3.3. Discriminative pathway

The discriminative pathway first forms a linear classifierusing the textual representation of the phrase t. Its linearcombination weights and bias are

w(t) = A>w�txt(t;⇥txt), (1)

b(t) = a>b �txt(t;⇥txt), (2)where Aw 2 Rdtxt⇥drgn , ab 2 Rdtxt , and ⇥dis = (Aw,ab).This classifier is applied to the visual representation of theimage region r on x, obtaining the verification confidencepredicted by our model:

f(x, r, t;⇥) = w(t)

>�rgn(x, r;⇥rgn) + b(t). (3)Compared to the basic form of the bilinear function�>

txt(t;⇥txt)Aw�rgn(x, r;⇥rgn), our discriminative pathwayincludes an additional linear term as the text-dependent biasfor the visual representation classifier.

As a natural way for modeling the cross-modality corre-lation, multiplication is also a source of instability for train-ing. To improve the training stability, we introduce a regu-larization term �dynamic = kw(t)k22+|b(t)|2 for the dynamicclassifier, besides the network weight decay �decay for ⇥.

4. Model learningIn DBNet, we drive the training of the proposed two-

pathway bimodal CNN with a binary classification objec-tive. We pair image regions and text phrases as train-ing samples. We define the ground truth binary label foreach training region-text pair (Section 4.1), and propose aweighted training loss function (Section 4.2).Training samples. Given M training images x1, x2, . . . ,

xM , let Gi = {(rij , tij)}Nij=1 be the set of ground truth an-

notations for xi, where Ni is the number of annotations, rijis the coordinate of the j

th region, and tij is the text phrasecorresponding to rij . When one region is paired with mul-tiple phrases, we take each pair as a separate entry in Gi.

We denote the set of all regions considered on xi by Ri,which includes both annotated regions

SNi

j=1{rij} and re-gions given by proposal methods [54, 62, 46]. We writeTi =

S{tij}Ni

j=1 for the set of annotated text phrases on xi,and T =

SMi=1 Ti for all training text phrases.

4.1. Ground truth labelsLabeling criterion. We assign each possible trainingregion-text pair with a ground truth label for binary clas-

the classifiers for a small number of predefined categoriesin traditional object detection, our model is dynamicallyadaptable to different text phrases.3.1. Model framework

Let x be an image, r be the coordinates of a region, andt be a text phrase. The verification model f(x, t, r;⇥) 2 Routputs the confidence of r’s being matched with t. Sup-pose that l 2 {1, 0} is the binary label indicating if (t, r) isa positive or negative region-text pair on x. Our verificationmodel learns to fit the probability for r and t being compat-ible (a positive pair), i.e., p(l = 1|x, r, t). See Section Bin the supplementary materials for a formalized comparisonwith conditional captioning models.

To this end, we develop a bimodal deep neural networkfor our model. In particular, f(x, t, r;⇥) is composed oftwo single-modality pathways followed by a discriminativepathway. The image pathway �rgn(x, r;⇥rgn) extracts thedrgn-dim visual representation on the image region r on x.The language pathway �txt(t;⇥txt) extracts the dtxt-dim tex-tual representation for the phrase t. The discriminative path-way with parameters ⇥dis dynamically generates a classifierfor visual representation according to the textual represen-tation, and predicts if r and t are matched on x. The fullmodel is specified by ⇥ = (⇥txt,⇥rgn,⇥dis).3.2. Visual and linguistic pathwaysRoI-pooling image network. We suppose the regions ofinterest are given by an existing region proposal method(e.g., EdgeBox [62], RPN [46]). We calculate visual rep-resentations for all image regions in one pass using the fastR-CNN RoI-pooling pipeline. State-of-the-art image classi-fication networks, including the 16-layer VGGNet [49] andResNet-101 [17], are used as backbone architectures.Character-level textual network. For an English textphrase t, we encode each of its characters into a 74-dimone-hot vector, where the alphabet is composed of 74 print-able characters including punctuations and the space. Thus,the t is encoded as a 74-channel sequence by stacking allcharacter encodings. We use a character-level deep CNN[60] to obtain the high-level textual representation of t. Inparticular, our network has 6 convolutional layers interleav-ing with 3 max-pooling layers and followed by 2 fully con-nected layers (see Section A in the supplementary materialsfor more details). It takes a sequence of a fixed length as theinput and produces textual representations of a fixed dimen-sion. The input length is set to be long enough (here, 256characters) to cover possible text phrases.2 To avoid emptytailing characters in the input, we replicate the text phraseuntil reaching the input length limit.

We empirically found that the very sparse input can eas-ily lead to over-sparse intermediate activations, which can

2The Visual Genome dataset has more than 2.8M unique phrases,whose median length in character is 29. Less than 500 phrases has morethan 100 characters.

create a large portion of “dead” ReLUs and finally result ina degenerate solution. To avoid this problem, we adopt theLeaky ReLU (LReLU) [35] to keep all hidden units activein the character-level CNN.

Other text embedding methods [29, 24, 27] also can beused in the DBNet framework. We use the character-levelCNN because of its simplicity and flexibility. Compared toword-based models, it uses lower-dimensional input vectorsand has no constraint on the word vocabulary size. Com-pared to RNNs, it easily allows deeper architectures.3.3. Discriminative pathway

The discriminative pathway first forms a linear classifierusing the textual representation of the phrase t. Its linearcombination weights and bias are

w(t) = A>w�txt(t;⇥txt), (1)

b(t) = a>b �txt(t;⇥txt), (2)where Aw 2 Rdtxt⇥drgn , ab 2 Rdtxt , and ⇥dis = (Aw,ab).This classifier is applied to the visual representation of theimage region r on x, obtaining the verification confidencepredicted by our model:

f(x, r, t;⇥) = w(t)

>�rgn(x, r;⇥rgn) + b(t). (3)Compared to the basic form of the bilinear function�>

txt(t;⇥txt)Aw�rgn(x, r;⇥rgn), our discriminative pathwayincludes an additional linear term as the text-dependent biasfor the visual representation classifier.

As a natural way for modeling the cross-modality corre-lation, multiplication is also a source of instability for train-ing. To improve the training stability, we introduce a regu-larization term �dynamic = kw(t)k22+|b(t)|2 for the dynamicclassifier, besides the network weight decay �decay for ⇥.

4. Model learningIn DBNet, we drive the training of the proposed two-

pathway bimodal CNN with a binary classification objec-tive. We pair image regions and text phrases as train-ing samples. We define the ground truth binary label foreach training region-text pair (Section 4.1), and propose aweighted training loss function (Section 4.2).Training samples. Given M training images x1, x2, . . . ,

xM , let Gi = {(rij , tij)}Nij=1 be the set of ground truth an-

notations for xi, where Ni is the number of annotations, rijis the coordinate of the j

th region, and tij is the text phrasecorresponding to rij . When one region is paired with mul-tiple phrases, we take each pair as a separate entry in Gi.

We denote the set of all regions considered on xi by Ri,which includes both annotated regions

SNi

j=1{rij} and re-gions given by proposal methods [54, 62, 46]. We writeTi =

S{tij}Ni

j=1 for the set of annotated text phrases on xi,and T =

SMi=1 Ti for all training text phrases.

4.1. Ground truth labelsLabeling criterion. We assign each possible trainingregion-text pair with a ground truth label for binary clas-

Green: Ground truth; Red: DenseCap; Yellow: DBNet

Method Accuracy/% for IoU@ Median IoU0.3 0.5 0.7

DenseCap 25.7 10.1 2.4 0.092SCRC 27.8 11.0 2.5 0.115DBNet w/o bias 36.3 22.4 9.4 0.124DBNet 38.3 23.7 9.9 0.152DBNet (ResNet) 42.3 26.4 11.2 0.205

a. Localization performance

c. Quantitative comparison

Data, Code & Model:

DBNet.link

Model Training: Labels, Objectives, and Optimization

q Any region-text pair (r,t) can be possibly used in training.q r can be GT or proposed (EdgeBox), t can be an annotation on

the same image as r or from the rest of the training set.q Spatial overlapping based training labels:

Ø 1: r has large overlap with a ground truth region of tØ Uncertain: (r,t) is not positive, and r has moderate overlap

with a ground truth region of t. (excluded from training)Text similarity based uncertainty augmentation:

If t' is similar to t, t' should also have uncertain label with rØ 0: otherwise (including all other text phrases).

q Given a region, the non-uncertain phrases are categorized into: positive phrases (pos), negative phrases from the annotations on the same image (neg), and negative phrases from the rest images (rest).

q Training loss is normalized separately for the three categories:

where we set , the sampling strategy in SGD determines (larger) and (smaller).

q Optimization: (initialization) Initializing visual pathway with pretrained faster R-CNN; (phase 1) training the text pathway from scratch and fix the visual pathway parameters; (phase 2) jointly finetuning the two pathway; (phase 3) decrease the learning rate.

4

0.09: duck is getting in the water

0.07: torso of a duck

0.00: waterfall into a fountain 0.00: yellow flowers in the plant

0.32: male duck 0.48: torso

of duck

0.88: duck is standing

0.86: brown duck with orange beak

Anchor region: Any image region for training

IoU ≥ ηpos : GT regions of positive text phrases

ηneg <IoU < ηpos : GT regions of ambiguous text phrases

IoU ≤ ηneg : GT regions of negative text phrases

The potential negative phrase

is marked as ambiguous,because it has already been in

the ambiguous phrase set.

Figure 2: Ground truth labels for region-text pairs (given an ar-

bitrary image region). Phrases are categorized into positive, am-

biguous, and negative sets based on the given region’s overlap with

ground truth boxes (measured by IoU and displayed as the num-

bers in front of the text phrases). Ambiguous phrases augmented

by text similarity is not shown here (see the video in the supple-

mentary materials for an illustration). For visual clarity, ηneg = 0.3

and ηpos = 0.7, which are different from the rest of the paper.

sification. For a region r on the image xi and a text phraset ∈ Ti, we take the largest overlap between r and t’s groundtruth regions as evidence to determine (r, t)’s label. LetIoU(·, ·) denote the intersection over union. The largest

overlap is defined as

νi(r, t) = maxr′∈Ri

{IoU(r′, r) : (r′, t) ∈ Gi}. (4)

In object detection on a limited number of categories (i.e.,Ti consists of category labels), νi(r, t) is usually reliableenough for assigning binary training labels, given the (al-most) complete ground truth annotations for all categories.

In contrast, text phrase annotations are inevitably incom-plete in the training set. One image region can have anintractable number of valid textual descriptions, includingdifferent points of focus and paraphrases of the same de-scription, so annotating all of them is infeasible. Conse-quently, νi(r, t) cannot always reflect the consistency be-tween an image region and a text phrase. To obtain reliabletraining labels, we define positive labels in a conservativemanner; and then, we combine text similarity together withspatial IoU to establish the ambiguous text phrase set thatreflects potential “false negative” labels. We provide de-tailed definitions below.

Positive phrases. For a region r on xi, its positive textphrases (i.e., phrases assigned with positive labels) consti-tute the set

Pi(r) = {t ∈ Ti : νi(r, t) ≥ ηpos}, (5)

where ηpos is a high enough IoU threshold (= 0.9) to deter-mine positive labels. Some positive phrases may be missingdue to incomplete annotations. However, we do not try torecover them (e.g., using text similarity), as “false positive”training labels may be introduced by doing so.

Ambiguous phrases. Still for the region r, we collect thetext phrases whose ground truth regions have moderate (nei-ther too large nor too small) overlap with r into a set

Ui(r) = {t ∈ Ti : ηneg < νi(r, t) < ηpos}, (6)

where ηneg is the IoU lower bound (= 0.1). When r’s largestIoU with the ground truths of a phrase t lies in (ηneg, ηpos),it is uncertain whether t is positive or negative. In otherwords, t is ambiguous with respect to the region r.

Note that Ui(r) only contains phrases from Ti. To coverall possible ambiguous phrases from the full set T , we usea text similarity measurement sim(·, ·) to augment Ui(r) tothe finalized ambiguous phrase set

Ai(r) = {t ∈ T : ∃t′ ∈ Ui(r), sim(t, t′) > τ}\Pi(r),(7)

where we use the METEOR [4] similarity for sim(·, ·) andset the text similarity threshold τ = 0.3.3

Labels for region-text pairs. For any image region r onxi and any phrase t ∈ T , the ground truth label of (r, t) is

yi(r, t) =

1, t ∈ Pi(r),

⟨uncertain⟩, t ∈ Ai(r),

0, otherwise,

(8)

where the pairs of a region and its ambiguous text phrasesare assigned with the “uncertain” label to avoid false nega-tive labels. Figure 2 illustrates the region-text label for anarbitrary training image region.

4.2. Weighted training loss

Effective training sets. On the image xi, the effective setof training region-text pairs is

Si = {(r, t) ∈ Ri × T : yi(r, t) ̸= ⟨uncertain⟩}, (9)

where, as previously defined, Ri consists of annotated andproposed regions, and T consists of all phrases from thetraining set. We exclude samples of uncertain labels.

We partition Si into three subsets according to the valueof yi(r, t) and the origin of the phrase t: Spos

i for yi(r, t) =1, Sneg

i for yi(r, t) = 0 ∧ t ∈ Ti, and S resti for all nega-

tive region-text pairs containing phrases from the rest of thetraining set (i.e., not from xi).

Per-image training loss Let fi(r, t) = f(xi, r, t;Θ) ∈ R

for notation convenience; and, let ℓ(·, ·) be a binary classi-fication loss, in particular, the cross-entropy loss of logisticregression. We define the training loss on xi as the summa-tion of three parts:

Li = λposLposi + λnegL

negi + λrestL

resti , (10)

Lposi =

1

|Sposi |

(r,t)∈Sposi

ℓ (fi(r, t), 1) , (11)

Lnegi =

1

|Snegi |

(r,t)∈Snegi

ℓ (fi(r, t), 0) , (12)

Lresti =

(r,t)∈S resti

freq(t) · ℓ (fi(r, t), 0)∑

(r,t)∈S resti

freq(t), (13)

3If the METEOR similarity of two phrases is greater than 0.3, theyare usually very similar. In Visual Genome, ∼0.25% of all possible pairsformed by the text phrases that occur ≥20 times can pass this threshold.

0.09: duck is getting in the water

0.07: torso of a duck

0.00: waterfall into a fountain 0.00: yellow flowers in the plant

0.32: male duck 0.48: torso

of duck

0.88: duck is standing

0.86: brown duck with orange beak

Anchor region: Any image region for training

IoU ≥ ηpos : GT regions of positive text phrases ηneg <IoU < ηpos : GT regions of ambiguous text phrases

IoU ≤ ηneg : GT regions of negative text phrases

The potential negative phraseis marked as ambiguous,because it has already been inthe ambiguous phrase set.

Figure 2: Ground truth labels for region-text pairs (given an ar-bitrary image region). Phrases are categorized into positive, am-biguous, and negative sets based on the given region’s overlap withground truth boxes (measured by IoU and displayed as the num-bers in front of the text phrases). Ambiguous phrases augmentedby text similarity is not shown here (see the video in the supple-mentary materials for an illustration). For visual clarity, ⌘neg = 0.3and ⌘pos = 0.7, which are different from the rest of the paper.

sification. For a region r on the image xi and a text phraset 2 Ti, we take the largest overlap between r and t’s groundtruth regions as evidence to determine (r, t)’s label. LetIoU(·, ·) denote the intersection over union. The largestoverlap is defined as

⌫i(r, t) = max

r02Ri

{IoU(r0, r) : (r0, t) 2 Gi}. (4)

In object detection on a limited number of categories (i.e.,Ti consists of category labels), ⌫i(r, t) is usually reliableenough for assigning binary training labels, given the (al-most) complete ground truth annotations for all categories.

In contrast, text phrase annotations are inevitably incom-plete in the training set. One image region can have anintractable number of valid textual descriptions, includingdifferent points of focus and paraphrases of the same de-scription, so annotating all of them is infeasible. Conse-quently, ⌫i(r, t) cannot always reflect the consistency be-tween an image region and a text phrase. To obtain reliabletraining labels, we define positive labels in a conservativemanner; and then, we combine text similarity together withspatial IoU to establish the ambiguous text phrase set thatreflects potential “false negative” labels. We provide de-tailed definitions below.Positive phrases. For a region r on xi, its positive textphrases (i.e., phrases assigned with positive labels) consti-tute the set

Pi(r) = {t 2 Ti : ⌫i(r, t) � ⌘pos}, (5)where ⌘pos is a high enough IoU threshold (= 0.9) to deter-mine positive labels. Some positive phrases may be missingdue to incomplete annotations. However, we do not try torecover them (e.g., using text similarity), as “false positive”training labels may be introduced by doing so.Ambiguous phrases. Still for the region r, we collect thetext phrases whose ground truth regions have moderate (nei-ther too large nor too small) overlap with r into a set

Ui(r) = {t 2 Ti : ⌘neg < ⌫i(r, t) < ⌘pos}, (6)

where ⌘neg is the IoU lower bound (= 0.1). When r’s largestIoU with the ground truths of a phrase t lies in (⌘neg, ⌘pos),it is uncertain whether t is positive or negative. In otherwords, t is ambiguous with respect to the region r.

Note that Ui(r) only contains phrases from Ti. To coverall possible ambiguous phrases from the full set T , we usea text similarity measurement sim(·, ·) to augment Ui(r) tothe finalized ambiguous phrase set

Ai(r) = {t 2 T : 9t0 2 Ui(r), sim(t, t

0) > ⌧}\Pi(r),

(7)where we use the METEOR [4] similarity for sim(·, ·) andset the text similarity threshold ⌧ = 0.3.3

Labels for region-text pairs. For any image region r onxi and any phrase t 2 T , the ground truth label of (r, t) is

yi(r, t) =

8><

>:

1, t 2 Pi(r),

huncertaini, t 2 Ai(r),

0, otherwise,(8)

where the pairs of a region and its ambiguous text phrasesare assigned with the “uncertain” label to avoid false nega-tive labels. Figure 2 illustrates the region-text label for anarbitrary training image region.

4.2. Weighted training lossEffective training sets. On the image xi, the effective setof training region-text pairs is

Si = {(r, t) 2 Ri ⇥ T : yi(r, t) 6= huncertaini}, (9)where, as previously defined, Ri consists of annotated andproposed regions, and T consists of all phrases from thetraining set. We exclude samples of uncertain labels.

We partition Si into three subsets according to the valueof yi(r, t) and the origin of the phrase t: Spos

i for yi(r, t) =1, Sneg

i for yi(r, t) = 0 ^ t 2 Ti, and S resti for all nega-

tive region-text pairs containing phrases from the rest of thetraining set (i.e., not from xi).Per-image training loss Let fi(r, t) = f(xi, r, t;⇥) 2 Rfor notation convenience; and, let `(·, ·) be a binary classi-fication loss, in particular, the cross-entropy loss of logisticregression. We define the training loss on xi as the summa-tion of three parts:

Li = �posLposi + �negL

negi + �restL

resti , (10)

L

posi =

1

|Sposi |

X

(r,t)2Sposi

` (fi(r, t), 1) , (11)

L

negi =

1

|Snegi |

X

(r,t)2Snegi

` (fi(r, t), 0) , (12)

L

resti =

P(r,t)2S rest

ifreq(t) · ` (fi(r, t), 0)

P(r,t)2S rest

ifreq(t)

, (13)

3If the METEOR similarity of two phrases is greater than 0.3, theyare usually very similar. In Visual Genome, ⇠0.25% of all possible pairsformed by the text phrases that occur �20 times can pass this threshold.

Image region

Fast R-CNN

Image feature

Text phrase CNN Text

feature

Detection score

FC Layer

Classifier

References[DenseCap] J. Johnson, A. Karpathy, and L. Fei-Fei. ”DenseCap: Fully convolutional localization networks for dense captioning”. In CVPR, 2016.[SCRC] R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Darrell. “Natural language object retrieval”. In CVPR, 2016.

An arbitrary region for training

GT region for uncertain phraseGT region for positive phrase

GT region for negative phrase

�pos

= �neg

+ �rest

1

�pos

= �neg

+ �rest

1

�pos

= �neg

+ �rest

1

Ø Level 1: the same number as the positive image

Scenario # unique phrases

Training set 2,113,688

Test set 180,363

Test set (unseen on training) 119,315

Per image (level 0) ~43

Per image (level 1) ~92

Per image (level 2) ~775

Difficulty level: 0 1 2AP / % mAP gAP mAP gAP mAP gAP

DenseCap 36.2 1.8 22.9 1.0 4.1 0.1SCRC 38.5 2.2 37.5 1.7 29.3 0.7DBNet 48.1 23.1 45.5 21.0 26.7 8.0DBNet (ResNet) 51.1 24.2 48.3 22.2 29.7 9.0

b.1. Detection APs for [email protected]

Difficulty level: 0 1 2AP / % mAP gAP mAP gAP mAP gAP

DenseCap 15.7 0.5 10.0 0.3 1.7 0.0SCRC 16.5 0.5 16.3 0.4 12.8 0.2DBNet 30.0 10.8 28.8 9.9 17.7 3.9DBNet (ResNet) 32.6 11.5 31.2 10.7 19.8 4.3

b.2. Detection APs for [email protected]

Difficulty level: 0 1 2AP / % mAP gAP mAP gAP mAP gAP

DenseCap 3.4 0.0 2.1 0.0 0.3 0.0SCRC 3.4 0.0 3.4 0.0 2.7 0.0DBNet 11.6 2.1 11.4 2.0 7.6 0.9DBNet (ResNet) 12.9 2.2 12.6 2.1 8.5 0.9

b.3. Detection APs for [email protected]

blanket coveringmother and son

calico cat laying on bed

dark haired womansitting up in bed

glasses the woman iswearing on her face

little boy sitting up in bed

one pink and one blue books

back wheel of a bicycle

bikers riding ina bicycle lane

bus with the route number 21

front wheel of a bicycle

gray buildingwith many windows

no turning streetsigns over the street

a brown basket full of fruit

a brown basket with agreen and white plaid towel

a can of oatmeal on shelf

a green glass bowl

a percolating coffee maker

potatoes in a bin

a bright colored snow board

a green dollarsign on a board

a red and white sign

a snowboarder witha red jacket

bright white snowon a ski slop

dark green pinetrees in the snow

331 / 3 rpm record albums

a brown couch

a chair sitting by the wall

a classic telephone

a coffee tablefilled with books

a framed picture on the wall

a car

a doorway withan arched entryway

a small domed roof

a tree with bare branches

large whitemulti level building

light in theroof of building

Figure 5: Qualitative detection results of DBNet with ResNet-101. We show detection results of six different text phrases on each image.For each image, the colors of bounding boxes correspond to the colors of text tags on the right. The semi-transparent boxes with dashedboundaries are ground truth regions, and the boxes with solid boundaries are detection results.

Prune Phrases Finetune Localization Detection (Level-1)ambiguous from other visual Recall / % for IoU@ Median Mean mAP / % for IoU@ gAP / % for IoU@

phrases images pathway 0.3 0.5 0.7 IoU IoU 0.3 0.5 0.7 0.3 0.5 0.7No No No 30.6 17.5 07.8 0.066 0.211 35.5 22.0 08.6 08.3 03.1 00.4Yes No No 34.5 21.2 09.0 0.113 0.237 39.0 24.6 09.7 15.5 07.4 01.6Yes Yes No 34.7 21.1 08.8 0.119 0.238 41.3 25.6 10.0 17.2 07.9 01.6Yes Yes Yes 38.3 23.7 09.9 0.152 0.258 45.5 28.8 11.4 21.0 09.9 02.0

Table 4: Ablation study of DBNet’s major components. The visual pathway is based on the 16-layer VGGNet.

C. Lawrence Zitnick, and D. Parikh. VQA: Visualquestion answering. In CVPR, 2015. 1

[4] S. Banerjee and A. Lavie. METEOR: An automaticmetric for MT evaluation with improved correlationwith human judgments. In ACL workshop on intrinsicand extrinsic evaluation measures for machine trans-lation and/or summarization, 2005. 4

[5] X. Chen and C. L. Zitnick. Mind’s eye: A recurrentvisual representation for image caption generation. InCVPR, 2015. 1

[6] J. Dai, Y. Li, K. He, and J. Sun. R-FCN: Object detec-tion via region-based fully convolutional networks. InNIPS, 2016. 2

[7] N. Dalal and B. Triggs. Histograms of oriented gradi-ents for human detection. In CVPR, 2005. 1, 3

[8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, andL. Fei-Fei. Imagenet: A large-scale hierarchical im-age database. In CVPR, 2009. 5, 14

[9] J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venu-gopalan, S. Guadarrama, K. Saenko, and T. Darrell.Long-term recurrent convolutional networks for vi-sual recognition and description. IEEE Transactionson Pattern Analysis and Machine Intelligence, 39(4):

677–691, April 2017. 1[10] M. Everingham, L. Van Gool, C. K. I. Williams,

J. Winn, and A. Zisserman. The pascal visual objectclasses (voc) challenge. International Journal of Com-puter Vision, 88(2):303–338, 2010. 5, 6, 7, 14

[11] P. Felzenszwalb, R. Girshick, D. McAllester, andD. Ramanan. Object detection with discriminativelytrained part-based models. IEEE Transactions on Pat-tern Analysis and Machine Intelligence, 2010. 1, 3

[12] A. Geiger, P. Lenz, and R. Urtasun. Are we readyfor autonomous driving? the KITTI vision benchmarksuite. In CVPR, 2012. 6

[13] R. Girshick. Fast R-CNN. In ICCV, 2015. 2, 5[14] R. Girshick, J. Donahue, T. Darrell, and J. Malik.

Region-based convolutional networks for accurate ob-ject detection and segmentation. IEEE Transactionson Pattern Analysis and Machine Intelligence, 38(1):142–158, Jan 2016. 1, 2, 3

[15] A. Graves. Generating sequences with recurrent neu-ral networks. arXiv:1308.0850, 2013. 1, 2

[16] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyra-mid pooling in deep convolutional networks for visualrecognition. In ECCV, 2014. 2

9

e. Ablation study of DBNet’s major components

G.2. Random detection results with phrase-dependent thresholdsIn Figure 11, we used phrase-dependent decision thresholds to determine how many regions were detected on an image. We set the threshold to make the

detection precision for the IoU threshold at 0.5 equal to 0.5 when applicable. DBNet outperformed DenseCap and SCRC significantly. DenseCap and SCRCresulted in many cases of false alarms or miss detection. Note that DBNet could usually achieve the 0.5 precision with a reasonable recall level, but DenseCapand SCRC might either fail achieving the 0.5 precision at all or give a low recall.

Text phrases DBNet DenseCap SCRC

a man jumping a skateboard

a man wearing a red shirt

a red whiteand blue baseball cap

three people hangingout in the background

black shirt of tennis player

black shortsof tennis player

man in blueshirt and white shorts

the man has brown hair

a black circularelectric oven burner

a little girlin a colorful top

a white and blackstove with range cook top

apple on the counter

Figure 11: Qualitative detection results of DBNet, DenseCap, and SCRC using phrase-dependent detection threshold. Detection results of four different text phrases are shownfor each image. The colors of the bounding boxes correspond to the colors of text phrases on the left. The semi-transparent boxes with dashed boundaries are ground truth regions,and the boxes with solid boundaries are detection results of three models.

Text phrases DBNet DenseCap SCRCa man jumping a skateboard

a man wearing a red shirt

a red white and blue baseball cap

three people hanging out in the background

GT: semitransparent & dashedDetected: boxes of solid edges

F.1. More qualitative comparison with DenseCap

0.22

0.61

a small chalkboard in the window

0.260.53

a foot hangover the side

0.29

0.61

blue and white shipon the water

0.27

0.51

side mirrorof the bus

0.27

0.65

part ofocean surfboard

0.25

0.51

arch overwindow column

0.29 0.55

the short hairof the player

0.27

0.50

the back wheelof the bus

0.27

0.62

a man wearingsunglasses

0.30 0.52

leash on dog pullingperson on skateboard

0.28

0.56

racket in tennisplayer’s hand

0.28

0.52

child is wearingpink gloves

0.29

0.51

flag on top ofbuilding

0.28 0.82

cement circlein the glass

0.29 0.50

a knife witha brown handle

0.290.58

silver stereoon table

0.27

0.55

umbrella coveringthe people

0.30 0.55

the manis surfing

0.360.50

a blacktee shirt

0.32 0.52

a airplane isin the sky

0.41

0.69

a red andwhite ball

0.730.57

the clock says11/08

0.520.75

drink can sittingon the sink

0.65 0.89

video game onthe tv screen

0.49 0.59

cell phonein hand

0.58 0.40

side windowsof a plane

0.67

0.85

a woman waiting atthe train station

0.30

0.27

the buildinghas windows

0.52 0.33

black screenon television

0.510.62

light brown cowon ground

0.41 0.61

a brown roofof a building

0.59

0.59

girl about tothrow frisbee

0.79 0.69

dark shirt withyellow writing

0.64

0.34

the bear hasa black nose

0.690.62

the clockis black

0.33

0.65

a red stripeon the plane

0.360.51

the red jacketof the skier

0.53 0.74

the appleis green

0.390.73

the bird standsin the sun

0.65 0.92

white keysof a keyboard

0.59

0.83

a letter pwritten in white

0.48 0.43

a fat seagullstanding

Figure 6: Qualitative comparison between DBNet and DenseCap on localization task. Examples are randomly sampled. Green boxes:ground truth; Red boxes: DenseCap; Yellow boxes: DBNet. The numbers are IoU with ground truth boxes.

17

Green: Ground truth; Red: SCRC; Yellow: DBNet

F.2. More qualitative comparison with SCRC

0.30 0.53

left armrestof the bench

0.30 0.54

the nose ofthe airplane

0.290.62

shadowfrom the man

0.24

0.53

the microwavehas buttons

0.30

0.57

this isa bottle

0.24 0.68

gray hour handon clock

0.25 0.60

a blackbaseball bat

0.26

0.67

the weiner isin the bun

0.28

0.64

a guy is wearinga white helmet

0.29 0.64

the visoris white

0.30 0.66

a windowon the train

0.24

0.54

a bird’stiny leg

0.20

0.60

arrowpointing right

0.29

0.73

the showernozzle

0.23

0.55

blue jeanson young woman

0.300.65

the uniformis grey

0.30

0.53

man in yellowsnowboarding

0.280.51

a whitecomputer keyboard

0.47

0.59

a remote controlon coffee table

0.26 0.38

a tree neara house

0.38 0.78

gray stapler

0.500.53

this is adining table

0.470.73

the arch ofa building

0.61

0.83

the door onthe stone cottage

0.31

0.27

the silverlong train

0.32

0.37

boat in themiddle of water

0.49

0.70

man wearingwet suit

0.41

0.46

yellow directional signon street

0.64 0.79

the jet ismade of steel

0.49 0.78

bearded man witha white hat

0.66 0.90

partially loadedmoving van

0.68 0.67

young boypointing at camera

0.47

0.62

black trafficlight

0.49 0.52

blue and whitestripe outfit

0.640.90

yellow taxi cabon the street

0.670.41

granola inyogurt cup

0.50

0.69

keyboard ofstreet meter

0.60

0.82

young man carryingbackpack

0.41

0.68

two candleholders

0.36 0.59

trunck ofelephant

0.33

0.44

directional street sign1600 block

0.40 0.46

paper note shapedlike autumn leaf

Figure 7: Qualitative comparison between DBNet and SCRC on localization task. Examples are randomly sampled. Green boxes: groundtruth; Red boxes: SCRC; Yellow boxes: DBNet. The numbers are IoU with ground truth boxes.

18