article in press - university of oulujiechen/paper/prl2018-deep contour and... · 2018-11-15 ·...
TRANSCRIPT
ARTICLE IN PRESS
JID: PATREC [m5G; January 11, 2018;4:1 ]
Pattern Recognition Letters 0 0 0 (2018) 1–8
Contents lists available at ScienceDirect
Pattern Recognition Letters
journal homepage: www.elsevier.com/locate/patrec
Deep contour and symmetry scored object proposal
Wei Ke
a , b , Jie Chen
b , Qixiang Ye
a , ∗
a School of Electronic, Electrical and Communication Engineering, University of Chinese of Academy of Sciences, Beijing, 101408, China b Center for Machine Vision and Signal Analysis, University of Oulu, Oulu, 90570, Finland
a r t i c l e i n f o
Article history:
Available online xxx
MSC:
41A05
41A10
65D05
65D17
Keyword:
Object proposal
Super-pixel grouping
FCN
Proposal scoring
a b s t r a c t
Object proposal has been successfully applied in recent supervised and weakly supervised visual object
detection tasks to improve the computational efficiency. The classical grouping-based object proposal ap-
proach can produce region proposals with high localization accuracy, but incorporates significant redun-
dancy for the lack of object confidence to evaluate the proposals. In this paper, we propose leveraging
the essential properties of images, i.e., contour and symmetry, to score the redundant region proposals.
Specifically, the contour and symmetry are extracted by a Simultaneous Contour and Symmetry Detection
Network (SCSDN) and used to score the bounding box with a Bayesian framework, which guarantees that
the scoring procedure is adaptive to general objects. A subset of high-scored proposals reserves the recall
rate, while can also significantly decrease the redundancy. Experimental results show that the proposed
approach improves the baseline by increasing the recall rate from 0.87 to 0.89 on the PASCAL VOC 2007
dataset. It also outperforms the state-of-the-art on AUC and uses much fewer object proposals to achieve
comparable recall rate.
© 2018 Published by Elsevier B.V.
1
d
a
t
n
p
m
p
l
t
e
e
p
m
j
o
t
m
E
B
e
t
c
i
r
r
a
b
t
N
t
r
j
b
t
f
j
i
h
0
. Introduction
Object localization is the first important step for visual object
etection. Conventional detection approaches [10,11,15] which use
sliding-window strategy to localize objects tend to generate up
o millions of candidate windows. The classification of such a big
umber of windows in the subsequent steps is computationally ex-
ensive, particularly when complex features and/or classification
ethods are used. Recently, an alternative way, i.e., object pro-
osal, has been investigated to improve the efficiency of object
ocalization. Object proposal tends to produce much fewer (up to
wo orders of magnitude) windows than the sliding-window strat-
gy, which by no doubts significantly improves the computational
fficiency [18] . Besides, object proposal is necessary for weakly su-
ervised object detection [36,39] and instance-based semantic seg-
entation [9] . As there is only image level annotation without ob-
ect bounding boxes in these tasks, it is a good choice to discovery
bject candidates in an unsupervised way. Higher recall rate with
he fewer number of object proposals can increase the robust of
odels and reduce the object-level search space.
∗ Corresponding author at: School of Electronic, Electrical and Communication
ngineering of University of Chinese Academy of Sciences, Room 457, Xueyuan II
LDG, Yanqi Lake, Huairou, Beijing, 101400, China.
E-mail addresses: [email protected] (J. Chen), [email protected] (Q. Ye).
r
n
i
t
t
c
ttps://doi.org/10.1016/j.patrec.2018.01.004
167-8655/© 2018 Published by Elsevier B.V.
Please cite this article as: W. Ke et al., Deep contour and symme
https://doi.org/10.1016/j.patrec.2018.01.004
Object proposal approaches in literatures can be coarsely cat-
gorized into two: grouping-based and objectness-based ones. In
he first category, bounding boxes are produced via hierarchi-
ally merging super-pixels. With various super-pixel/region group-
ng strategies, the generated proposals contribute to high recall
ate and localization accuracy, but the generated proposals are
edundant and lack object confidences. In the second category,
multi-scale sliding window strategy is used to produce object
ounding boxes. Delicate objectness measurement is then designed
o measure how likely a bounding box is an interesting object.
evertheless, such objectness approaches, without precise segmen-
ation procedure, reports lower recall rate and localization accu-
acy than the grouping-based ones. Considering that missed ob-
ects cannot be recovered in the subsequent stages, objectness-
ased approaches are not competent for tasks where recall rate is
he first concern.
In this paper, we propose a new object proposal approach that
ully utilizes the advantages of both super-pixel grouping and ob-
ectness strategies. Our approach first produces redundant bound-
ng boxes using super-pixel grouping approaches to achieve a high
ecall rate. It then scores each bounding box using multiple object-
ess properties, i.e., deep contour and symmetry, as well as choos-
ng a subset of high-scored proposals as solution. Our motivation is
hat a true object region should have one or more distinct proper-
ies contrasting with its surroundings, i.e., some objects could have
lear contour, while the others have distinct symmetry parts, or
try scored object proposal, Pattern Recognition Letters (2018),
2 W. Ke et al. / Pattern Recognition Letters 0 0 0 (2018) 1–8
ARTICLE IN PRESS
JID: PATREC [m5G; January 11, 2018;4:1 ]
s
c
o
g
[
p
e
s
o
p
o
m
b
m
e
l
p
f
g
s
o
i
b
h
w
r
l
i
i
o
3
d
a
m
e
s
l
t
3
g
b
u
p
p
t
N
s
g
n
i
t
a
t
l
t
a
m
t
both. Such properties are beneficial to reduce effectively the re-
dundant regions produced by the super-pixel grouping approaches.
The contributions of our approach are summarized as follows:
• We propose the Simultaneous Contour and Symmetry Detection
Network (SCSDN), which can uniformly extract image proper-
ties including contour and symmetry. • We propose Similarity Adaptive Search (SAS) to generate object
bounding boxes, which improves Selective Search by adaptively
calculating the similarity between super-pixel subsets. • We propose using the Bayesian framework to score redundant
bounding boxes and choose a subset of high-scored proposal as
solution. Experiments demonstrate that we can use significant
fewer high-scored object proposals to achieve comparable recall
rates with the baseline Selective Search.
The remaining parts of this paper are organized as follows: the
related works are represented in Section 2 . The proposed approach
is detailed in Section 3 . Experimental results are given in Section 4 ,
and we finally conclude the approach in Section 5 .
2. Related works
Object proposal approaches are coarsely categorized into
grouping-based and objectness-based ones. Grouping-based ap-
proaches usually root in image segmentation and region grouping
strategies, in a bottom-up manner. Objectness-based approaches, in
contrast, adopts a sliding window strategy to generate object can-
didates, in a top-down manner.
The extensively investigated image segmentation algorithms,
e.g., Constrained Parametric Min-Cuts (CPMC) [5] , can be directly
used to generate object proposals. Considering that segmented re-
gions are insufficient to cover objects with a high recall rate, Ui-
jlings et al. propose a hierarchical strategy to merge color homoge-
neous regions and generate object proposals. They leverage multi-
ple low-level features and multiple merging functions, named Se-
lective Search, to generate redundant bounding boxes so that as
many objects as possible are covered [35] . Manen et al. further im-
prove the merging strategy in Selective Searcing by using learned
weights as measurement to merge super-pixels [27] . Different with
Selective Search that uses single-scale segmentation, MCG [3] , ex-
plores multi-scale hierarchical segmentation regions for merging,
achieving higher recall rate, at the cost of computational efficiency.
By taking advantages of both CPMC and Selective Search, Ranta-
lankila et al. propose using a grouping process with a large pool
of features and generate segmentations using a CPMC-like process
[31] . Xiao et al. propose a complexity-adaptive metric distance for
super-pixel merging, which improves region grouping in different
levels of complexity [37] . Chen et al. focus on the object proposal
localization bias and propose multi-thresholding straddling expan-
sion (MTSE) to reduce localization bias using super-pixel tightness
[7] .
Although grouping-based approaches produce region proposals
with a high recall rate, they tend to produce many redundant pro-
posals. Furthermore, their involved fundamental image segmen-
tation procedure is usually time-consuming. Recently, more and
more attempts are made to generate object proposal based on the
object confidence of sliding windows, which avoid the image seg-
mentation procedure, increasing the computational efficiency, and
decreasing the proposal number.
Objectness [1] , scoring how likely a detection window contains
an object, is the pioneer exploring sliding window based object
proposal. The score is estimated based on a combination of multi-
cues including saliency, color contrast, edge density, location and
size statistics. Feng et al. score sliding windows with saliency cues
and Randomized-Seeds score it with super-pixel straddling [17] .
Cheng et al. propose Binarized Normed Gradients (BING) using
Please cite this article as: W. Ke et al., Deep contour and symme
https://doi.org/10.1016/j.patrec.2018.01.004
liding window for object proposal, based on an efficient weak
lassifier trained using binarized gradients. With a delicate design
f binary computations, a low computation cost of BING can be
uaranteed, which, as reported, reaches 300 FPS on a PC platform
8] . EdgeBoxes is conducted in a sliding window manner of multi-
le scales and multiple aspect-ratios [41] . The scores of objects are
stimated by the number of complete contours detected with the
tructure forest method.
To take the implemented advantages of grouping-based and
bjectness-based approaches to reduce the number of object pro-
osals, Krhenbh and Koltun train a regressor model to compute the
bject confidence with the object ground-truth and the binary seg-
entation masks. However, it is in an supervised way and limited
y the trained regressor as most object datasets are without seg-
entation masks [22,23] .
The power of Convolutional Neural Networks (CNN) has been
xplored to compute objectness recently. In [21] the shallow CNN
ayers are fed into fast decision forests to produce robust object
roposals. In [24] , a deep score is learned by CNN and is used
or updating the confidence of object proposal of EdgeBoxes. Re-
ion Proposal Network (RPN) [32] utlizes deep learning features to
core the sliding window. Benefit from the powerful representation
f convolutional features, the RPN needs only hundreds of bound-
ng boxes to achieve a similar recall rate with other objectness-
ased approaches that usually output thousands of proposals. Pin-
eiro et al., train a CNN model to output segmentation masks as
ell as object confidence [28] . They also refine the segmentation
esults by a bottom-up/top-down CNN architecture [29] . Neverthe-
ess, they are all data-driven and requires precise annotated train-
ng samples to achieve high accuracy, which limits its applications
n many tasks, e.g., semantic segmentation and weakly supervised
bject detection, where no precise object annotation is available.
. Methodology
The proposed approach, as shown in Fig. 1 , first produces re-
undant bounding boxes by grouping super-pixels hierarchically
nd extracts general object properties, i.e., deep contour and sym-
etry, with a uniform fully convolutional neural network. With the
xtracted object properties, Bayesian scoring is then proposed to
core each bounding box. A subset of high-scored proposals is se-
ected to guarantee a high recall rate, while significantly decreases
he proposal number.
.1. Deep contour and symmetry extraction
Contour is pixel-based low-level feature representation with
ood generalization ability. Usually, a trained contour model can
e directly used on other images [41] , that is to say, contour is
sed with unsupervised manner. The symmetry factor has similar
roperty. Instead of using traditional counter and symmetry ap-
roaches, we use Convolutional Neural Network (CNN) to extract
he low level features. The recent developed Fully Convolutional
etworks (FCN) [26] makes it possible to output a pixel-based re-
ponse map. Holistically-Nested Edge Detection (HED) [38] inte-
rates FCN with deeply supervision on the side-outputs of a trunk
etwork for contour detection. Considering symmetry has the sim-
lar property with contour, we use the similar architecture of HED
o extract symmetry map. In order to reduce the computation time
nd the network size, we use only one trunk network for both con-
our and symmetry detection, as shown in Fig. 2 . Using the 16-
ayer VGG [6] as the trunk network, the contour buffer convolu-
ional layer (CBuf) and symmetry buffer convolutional layer (Ebuf)
re added to each stage of VGG to form contour branch and sym-
etry branch. The uniform network is named Simultaneous Con-
our and Symmetry Detection Network (SCSDN). As the contour
try scored object proposal, Pattern Recognition Letters (2018),
W. Ke et al. / Pattern Recognition Letters 0 0 0 (2018) 1–8 3
ARTICLE IN PRESS
JID: PATREC [m5G; January 11, 2018;4:1 ]
Fig. 1. Flowchart of the proposed approach. For an input image, a uniform deep network, i.e., Simultaneous Contour and Symmetry Detection Network (SCSDN) is designed
to extract deep contour and symmetry maps. On the other hand, hierarchical super-pixel grouping is used to generate redundant bounding boxes, each of which is scored
with Bayesian scoring using deep contour and symmetry. High-scored bounding boxes are outputted as object proposals (green boxes). As we use the pre-trained SCSDN to
extract contour and symmetry, our approach is still unsupervised manner. (For interpretation of the references to color in this figure legend, the reader is referred to the
web version of this article.)
Fig. 2. The convolutional neural network architecture of Simultaneous Contour and Symmetry Detection Network (SCSDN), which outputs contour and symmetry maps
Simultaneous. We add buffer convolutional layers (CBuf and SBuf) on the trunk network to form contour branch network and symmetry branch network and the two
branches are trained orderly.
a
w
p
t
b
c
f
b
a
t
L
L
w
b
a
w
o
c
nd symmetry detection are performed in an end-to-end manner
ith fully convolution, it takes about a hundred of millisecond to
rocess one image.
In SCSDN, both contour branch and symmetry branch affects
he parameters of the trunk network, which leads to some insta-
ility during the multi-task learning. Compared to HED, the buffer
onvolutional layers are added to prevent the loss of each branch
rom being back-propagated directly to the trunk network. With
uffer layers, the two branches of SCSDN can be learned orderly.
In the training phase, supervision is taken on both side-outputs
nd the fused-output, i.e., weighted side-outputs. The loss func-
ions for the contour branch the symmetry branch are c
Please cite this article as: W. Ke et al., Deep contour and symme
https://doi.org/10.1016/j.patrec.2018.01.004
= L side ( W t , W c , �c ) + L f used ( W t , W c , �c , h c ) , (1)
= L side ( W t , W s , �s ) + L f used ( W t , W s , �s , h s ) . (2)
here W t , W c , W s are the parameters of the trunk network, the
uffer layers for the contour and symmetry branch; �c and �s
re the classifiers of each side outputs; h c and h s are the fusing
eights of the side-outputs; L side and L f used are the loss of side-
utputs and fused-output, respectively. In the testing phase, the
ontour and symmetry maps are extracted by taking sigmoid pro-
essing on the fused-outputs.
try scored object proposal, Pattern Recognition Letters (2018),
4 W. Ke et al. / Pattern Recognition Letters 0 0 0 (2018) 1–8
ARTICLE IN PRESS
JID: PATREC [m5G; January 11, 2018;4:1 ]
Fig. 3. Illustration of region grouping. (a) each super-pixel group is bounded with
a solid line box. The super-pixel groups are { c }, { a 1, a 2}, { b 1, b 2}, and { b 1, b 2, b 3}.
(b) the new super-pixel subsets after grouping. (For interpretation of the references
to color in the text, the reader is referred to the web version of this article.)
Fig. 4. Illustration of contour map and symmetry map. The green bounding box is
more potential to contain an object than the blue one as the green one contains
more complete contours and symmetry parts. (For interpretation of the references
to color in this figure legend, the reader is referred to the web version of this arti-
cle.)
a
h
D
D
t
D
w
o
R
i
m
f
l
S
a
a
b
i
3
3
t
c
t
e
i
s
Discussion: In the experiments of [38] , one HED model is effec-
tive and efficient for edge detection comparing with the traditional
contour detection methods. If we use one HED model for contour
and another for symmetry, it takes 2 × parameters and computa-
tional time of one HED model. In SCSDN, the parameters and com-
putational time is similar with only one HED as parameter sharing
in trunk network. It keeps the advantages of detection performance
as well as saving computational resources.
3.2. Redundant bounding boxes generation
Given an image, we follow the idea of Selective Search [35] to
generate redundant bounding boxes. In this paradigm, the image
is partitioned into hundreds of super-pixels, and then group the
super-pixels hierarchically with different similarity metrics of the
adjacent super-pixel pairs. The initial super-pixel regions are gen-
erated by the fast segmentation method [16] . Similarity of all ad-
jacent super-pixel pairs are then calculated and the two most sim-
ilar regions are grouped together. By iteratively grouping, Selec-
tive Search ends until all the regions are merged together. The
minimized bounding box is outputted which contains the merged
super-pixel pairs, as shown in Fig. 3 (a). The similarity for the re-
gion pair ( r i , r j ) is measured as:
d( r i , r j ) = a 1 · d c ( r i , r j ) + a 2 · d t ( r i , r j )
+ a 3 · d s ( r i , r j ) + a 4 · d f ( r i , r j ) , (3)
where d c , d t , d s , and d f are four base similarity measures, indi-
cating to preferentially grouping the similar color, similar texture,
small size and the region pair that fits into each other, respectively;
a i ∈ {0, 1} is an indicator about using or disabling the base simi-
larity measure. That is to say, it takes the similarity measurement
when a i = 1 or otherwise. The final redundant region pool is con-
stituted with different combination of a i .
3.2.1. Similarity Adaptive Search
To further improve the recall rate, we propose to use more
powerful similarity measurements to update the classical Selective
Search method. In its hierarchical grouping procedure, super-pixel
subsets are generated, shown as the region A, B, and C in Fig. 3 (b).
The color and texture similarity of subset pair in Selective Search
is calculated as
D mean ( R m
, R n ) = a 1 · d c ( R m
, R n ) + a 2 · d t ( R m
, R n ) , (4)
where R m
and R n are two super-pixel subsets that are seemed as
adjacent super pixels. That is to say, merging A with B or merg-
ing A with C is only determined by the mean color and/or texture
similarity of the whole regions in Fig. 3 (b).
Usually, super-pixel subsets are complex that the mean color
and texture of the two subsets are significantly different but the
connected super-pixels in the subsets are similar. Taking Fig. 3 and
color similarity as an example, merging A with B or merging A
with C is not only determined by d c ( A, B ) but also by d c ( a 1, b 2)
Please cite this article as: W. Ke et al., Deep contour and symme
https://doi.org/10.1016/j.patrec.2018.01.004
nd d c ( a 2, c ). Considering the connected super-pixels, the low and
igh complexity similarities in [37] are respectively defined as:
L ( R m
, R n ) = min { a 1 · d c ( r i , r j )
+ a 2 · d t ( r i , r j ) | r i ∈ R m
, r j ∈ R n } , (5)
H ( R m
, R n ) = max { a 1 · d c ( r i , r j )
+ a 2 · d t ( r i , r j ) | r i ∈ R m
, r j ∈ R n } . (6)
In our Similarity Adaptive Search (SAS), the distance between
wo super-pixels are calculated as
( R m
, R n ) = b 1 · D mean ( R m
, R n )
+ b 2 · ( ρm,n D L ( R m
, R n ) + (1 − ρm,n ) · D H ( R m
, R n ))
+ b 3 · D s ( R m
, R n ) + b 4 · D f ( R m
, R n ) , (7)
here b i ∈ {0, 1} denotes whether the similarity measure is used
r not; ρm, n indicates the complexity level of super-pixel subsets
m
and R n ; D s and D f are defined in Eq. (1) . When R m
and R n are
ndividual super-pixels, Eq. (5) is equivalent with Eq. (1) .
With defined similarity measures, we use the hierarchical
erging procedure to obtain redundant bounding boxes. We also
ollow [7] using tightness to rectify the bounding boxes so that the
ocation of bounding boxes is more accuracy.
Discussion: Comparing with Selective Search [35] , the proposed
AS considering not only the similarity of single super-pixel, but
lso the similarity of subsets of super-pixels. Selective Search is
part of SAS when complexity similarity is not considered, i.e.,
2 = 0 . SAS keeps the diversity of super-pixel merging as well as
ncreases the adaptivity.
.3. Bayesian scoring
.3.1. Contour score
Following the hypotheses that the number of contours con-
ained in a bounding box is indicative of the likelihood of the box
ontaining an object [41] , we use the complete contour number as
he objectness measurement, which is an orientation consistency
dge group.
The contour response map is produced by the SCSDN is shown
n Fig. 4 (b). Every point is seemed as an edge point. Supposing the
et of edge groups in an image is S = { s } , the edge group in a
itry scored object proposal, Pattern Recognition Letters (2018),
W. Ke et al. / Pattern Recognition Letters 0 0 0 (2018) 1–8 5
ARTICLE IN PRESS
JID: PATREC [m5G; January 11, 2018;4:1 ]
b
b
w
w
h
i
b
κ
g
w
w
a
t
e
m
h
3
p
r
s
s
s
3
H
a
y
w
d
w
i
s
P
t
s
G
t
4
4
4
t
i
i
d
i
M
8
4
r
o
a
m
w
t
4
d
fi
m
l
b
t
f
s
w
E
4
o
c
S
A
a
w
r
R
s
p
t
b
W
r
t
s
t
c
l
ounding box b is S b ⊂ S , and T = { t i j } is the pixels on s i . The score
ased on complete contour is computed as
e =
∑
i
w b ( s i ) m i
2 ( b w
+ b h ) κ , (8)
here m i is the magnitude of s i ; b w
and b h are the width and
eight of the bounding boxes. The perimeter of the bounding box
s used for normalization, and κ > 1 is a parameter to offset the
ias of larger windows having more edges on average and we set
= 2 following EdgeBoxes [41] . The score w b ( s i ) of every edge
roup s i is computed by
b ( s i ) =
⎧ ⎪ ⎨
⎪ ⎩
1 if s i is in b
1 − max P
| P | −1 ∏
j=1
a ( t j , t j+1 ) if s i overlap b ,
0 if s i is out of b
(9)
here a (t j , t j+1
)is the affinity of the orientations of t j and t j+1 ,
nd P is an ordered path of s i with length | P |, which is from t 1 ∈ b
o the end of s i .
On the pre-computed contour map, the confidence score for
ach region is calculated with Eq. (8) . The larger the score is, the
ore compete contours exist in the box, which accounts for a
igher confidence that the box is an object.
.3.2. Symmetry score
Symmetry is an important object property, which has been ex-
lored in visual tasks including object detection [4] and object
ecognition [40] . If a bounding box has a considerable number of
ymmetry axes, it is more likely to contain an object. With the
ymmetry map, the objectness based on symmetry is calculated
ame with Eq. (8) . A symmetry map is shown in Fig. 4 (c).
.3.3. Bayesian scoring
Let B = { b 1 , b 2 , · · · , b N } denotes a set of bounding boxes, and
= { (w
j e , w
j s ) } N j=1
denotes the objectness corresponding to contour
nd symmetry, respectively. In the Bayesian framework, the score
= f (H, B ) is drawn from a probabilistic model:
p(y | D N ) ∝ p( D N | y ) , (10)
here D N = { ( H j , b j ) } N j=1 . w e and w s are assumed conditional in-
ependent and we have
p( D N | y ) =
2 ∏
i =1
P (D
i N | y ) , (11)
here D
1 N
= { (w
j e , b j ) } N j=1
and D
2 N
= { (w
j s , b j ) } N j=1
. Sigmoid function
s applied on w e and w s to transform them to probability to mea-
ure how likely the bounding box might be an object, namely,
(( w e , b) | y ) = sigmoid( w e ) and P (( w s , b) | y ) = sigmoid( w s ) . As con-
our and symmetry are two independent low-level factors, we as-
ume that they make equal contribution to the Bayesian score.
iven a new bounding box b and h = ( w e , w s ) , the probability of
he box to be an object is calculated with
p(y | (b, h )) ∝ P (( w e , b) | y ) · P (( w s , b) | y ) . (12)
. Experimental results
.1. Metrics
.1.1. Dataset
Following [1,19,33,41] , we evaluate the proposed approach on
he PASCAL VOC 2007 dataset [14] . The dataset consists of train-
ng (4501 images), validation (2510 images) and test datasets (4510
mages). We compared our approach with baseline on validation
Please cite this article as: W. Ke et al., Deep contour and symme
https://doi.org/10.1016/j.patrec.2018.01.004
ataset and with the state-of-the-art on test datasets. We also ver-
fy the performance of the proposed approach on the challenging
S COCO validation dataset, which contains 40,504 images with
0 object categories [25] .
.1.2. Evaluation procedures
We follow the same evaluation procedure as [19] , using recall
ate, proposal number, and proposal-object overlap (Intersection
ver Union, IoU).
• Recall rate: With higher recall rate, the following classifier is
more potential to get high detection accuracy. Once some object
is lost in the object proposal stage, the classifier can no longer
detect the object. • Proposal number: Less proposal number is the efficiency guar-
antee of the following classifier. • IoU: Larger IoU means more accuracy localization, so that the
following feature extraction approaches can extract more effi-
ciency features.
Better approaches are recognized by smaller proposal numbers
nd larger IoU, while keeping high recall rate. There are three com-
only used experimental setups: recall rate vs. window number
ith given IoU, recall rate vs. IoU with given proposal number, and
he minimum proposal number with given recall rate and IoU.
.1.3. Contour and symmetry detector
We use the dataset BSDS [2] to train contour branch and the
ataset SYMMAX [34] to train symmetry branch, which are not
ne-tuned any more with the dataset for evaluation the perfor-
ance of object proposal.
The hyper-parameters of SCSDN include: mini-batch size (1),
earning rate (1e −6 for contour branch and 1e −8 for symmetry
ranch), momentum (0.9), weight decay (0.002). The number of
raining iterations (10,0 0 0, 80 0 0, 60 0 0, 40 0 0, 20 0 0, 10 0 0 and 10 0 0
or contour branch and symmetry branch orderly).
With this setting, the proposed SCSDN can output contour and
ymmetry response map at about 10 fps using single core GPU
ith better performance than the Structure Edge detector [12] in
dgeBoxes.
.2. Comparison with baseline
We evaluate the efficiency of the combination of grouping and
bjectness on PASCAL-VOC validation datasets. Fig. 5 indicates the
omparison between our approach and the baseline approach, i.e.,
elective Search [35] . With the region generated by Similarity
daptive Search (SAS) in Section 3.2 , the contour and symmetry
re cooperated orderly, as F+C and F+C+S shown in Fig. 5 .
Recall rate vs. the number of proposals is shown in Fig. 5 (a)
ith IoU = 0.7. It can be seen that our approach has improved the
ecall rate by more than 10% when using 100 or 10 0 0 proposals.
ecall rate is improved to 0.89 while Selective Search is 0.85, re-
pectively. In addition, our approach needs only 601 detection pro-
osals to get 75% recall rate, while Selective Search needs 1777.
Recall rate vs. IoU is shown in Fig. 5 (b) when using 10 0 0 de-
ection proposals. In Fig. 5 (b), it can be seen that our approach is
etter than the baseline when IoU locates between 0.5 and 0.80.
hen IoU is larger than 0.80, our approach reports a lower recall
ate than Selective Search. The reason lies in that our approach fur-
her employs a Nox-Maximum Suppression (NMS) procedure. Con-
idering that the IoU = 0.5 or IoU = 0.7 are the two typical setting in
he supervised/weakly-supervised object detection tasks, one can
onclude from Fig. 5 (b) that our approach outperforms the base-
ine on recall vs. IoU to get object candidates.
try scored object proposal, Pattern Recognition Letters (2018),
6 W. Ke et al. / Pattern Recognition Letters 0 0 0 (2018) 1–8
ARTICLE IN PRESS
JID: PATREC [m5G; January 11, 2018;4:1 ]
Fig. 5. Comparison with the baseline. SS is Selective Search [35] , SAS is our proposed Similarity Adaptive Search, F+C denotes re-ranked results with contour score, and
F+C+S with contour and symmetry scores, respectively.
Fig. 6. Comparisons with the state-of-the-art using recall rate versus number of proposals.
Table 1
Required proposal numbers at 25%, 50%, and 75% recall rates.
Method AUC N@25% N@50% N@75% Recall
BING [8] 0.20 302 – – 0.28
Rantalankila [31] 0.25 146 520 – 0.70
Objectness [1] 0.27 28 – – 0.38
RandomizedPrims [27] 0.35 42 358 3204 0.79
Rahtu [30] 0.36 29 310 – 0.70
Rigor [20] 0.38 25 367 1961 0.81
Selective Search [35] 0.39 29 210 1416 0.87
CPMC [5] 0.41 17 112 – 0.65
MTSE [7] 0.41 18 175 1112 0.89
CA [37] 0.42 27 167 1418 0.88
Endres [13] 0.44 7 112 – 0.66
MCG [3] 0.46 10 86 1562 0.82
EdgeBoxes [41] 0.47 12 96 655 0.88
Our approach (C) 0.48 12 91 535 0.89
Our approach (C + S) 0.49 10 71 476 0.89
c
r
c
4
w
p
4.3. Comparisons with state-of-the-art
We compare the proposed approach with recent unsupervised
approaches including Rahtu [30] , Objectness [1] , CPMC [5] , Selec-
tive Search [35] , RandomizedPrims [27] , Rantalankila [31] , MCG [3] ,
BING [8] , EdgeBoxes [41] , Endres [13] , Rigor [20] , MTSE [7] , and CA
[37] . The results of the compared approaches are provided by [19] ,
and curves are generated using the Structured Edge Detection Tool-
box V3.0 [41] .
Recall rate versus number of proposal is illustrated in Fig. 6 , and
we compare recent approaches using IoU thresholds of 0.5, 0.6, and
0.7. The red curves show the recall performance of our approach. It
can be seen in Fig. 6 that the maximum recall rate of our approach
slightly outperforms the state-of-the-art, in particular when IoU =0.7. Recall rate vs. IoU is shown in Fig. 7 . The number of proposals
is set to 10 0, 50 0, and 10 0 0, respectively. Varying IoU from 0.5 to
0.7, Endres, CPMC and MCG perform slightly better than our ap-
proach with 100 proposals. But our approach achieves the highest
recall rate given 500 and 1000 proposals.
In Table 1 , we compare the numbers of object proposal required
by each approach with 25%, 50% and 75% recall rates and IoU 0.7.
It can be seen that our approach keeps the highest recall rate of
0.89 compared to other methods. The AUC (Area Under the Curve)
is increased to 0.48 when the contour is used, and to 0.49 when
contour and symmetry are used. It was a trade-off between the
recall rate and the number of object proposal and the recall rate
about 75% is a good choice. To achieve 75% recall rate, our ap-
proach needs 476 detection proposals, which is the best among all
EPlease cite this article as: W. Ke et al., Deep contour and symme
https://doi.org/10.1016/j.patrec.2018.01.004
ompared approaches, showing that it can effectively reduce the
edundancy. Fig. 8 shows examples about how our approach in-
reases the localization accuracy and increase the recall rate.
.4. Evaluation on MS COCO
In order to verify the performance of our proposed approach,
e evaluate it on MS COCO [25] , as shown in Fig. 9 . We just com-
are the typical Selective Search based on super-pixel merging and
dgeBoxes based on objectness. As the images and objects in COCO
try scored object proposal, Pattern Recognition Letters (2018),
W. Ke et al. / Pattern Recognition Letters 0 0 0 (2018) 1–8 7
ARTICLE IN PRESS
JID: PATREC [m5G; January 11, 2018;4:1 ]
Fig. 7. Comparisons with state-of-the-art using recall rate versus IoU.
Fig. 8. Localization accuracy and recall rate comparison of the proposed approach in the first row and Selective Search in the second row. From the first three columns, it
is illustrated that the proposed approach has higher IoU, i.e., overlap between the object proposals (dashed line boxes) and the ground-truth (solid line boxes). From the
last two columns, it can be seen that the proposals by Selective Search miss three true positives (red boxes), while the proposals by our proposed approach contain all true
positives. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Fig. 9. Comparison results on MS COCO.
Please cite this article as: W. Ke et al., Deep contour and symmetry scored object proposal, Pattern Recognition Letters (2018),
https://doi.org/10.1016/j.patrec.2018.01.004
8 W. Ke et al. / Pattern Recognition Letters 0 0 0 (2018) 1–8
ARTICLE IN PRESS
JID: PATREC [m5G; January 11, 2018;4:1 ]
[
[
[
[
[
[
[
[
[
[
are more challenge than PASCAL VOC, the recall rate is much lower.
However, our approach still gets performance gain and the recall
rate is improved from 57% to 63%.
5. Conclusion
Object proposal methods can reduce object candidate windows
from millions to thousands. Our motivation is that the proper in-
tegration of general object properties, i.e., color, contour, and sym-
metry contributes to fewer but better object proposal. To fully inte-
grate advantages of grouping approaches with objectness based ap-
proaches, we propose a deep contour-symmetry scoring strategy in
a Bayesian framework to further improve the performance of ob-
ject proposal. Furthermore, the contour and symmetry properties
are extracted with Simultaneous Contour and Symmetry Detection
Network (SCSDN). Experiments demonstrate that a subset of high-
scored proposal can guarantee the recall rate while decreasing the
object proposal number, significantly. Our approach is easy to ex-
tend to some other proposal generation approaches and reduce the
object proposal number.
Acknowledgments
This work was supported in part by the National Science Foun-
dation of China under Grant 61671427 , and Youth Innovation Pro-
motion Association, Chinese Academy of Sciences under Grant
Z16110 0 0 016160 05.
References
[1] B. Alexe, T. Deselaers, V. Ferrari, Measuring the objectness of image windows,
IEEE Trans. Pattern Anal. Mach. Intell. 34 (11) (2012) 2189–2202, doi: 10.1109/TPAMI.2012.28 .
[2] P. Arbelaez , M. Maire , C.C. Fowlkes , J. Malik , Contour detection and hierarchi-cal image segmentation, IEEE Trans. Pattern Anal. Mach. Intell. 33 (5) (2011)
898–916 .
[3] P.A. Arbeláez, J. Pont-Tuset, J.T. Barron, F. Marqués, J. Malik, Multiscale combi-natorial grouping, in: Proceedings of IEEE Conference on Computer Vision and
Pattern Recognition, 2014, pp. 328–335, doi: 10.1109/CVPR.2014.49 . [4] X. Bai, X. Wang, L.J. Latecki, W. Liu, Z. Tu, Active skeleton for non-rigid ob-
ject detection, in: Proceedings of International Conference on Computer Vision,2009, pp. 575–582, doi: 10.1109/ICCV.2009.5459188 .
[5] J. Carreira, C. Sminchisescu, CPMC: automatic object segmentation using con-
strained parametric min-cuts, IEEE Trans. Pattern Anal. Mach. Intell. 34 (7)(2012) 1312–1328, doi: 10.1109/TPAMI.2011.231 .
[6] K. Chatfield , K. Simonyan , A. Vedaldi , A. Zisserman , Return of the devil in thedetails: Delving deep into convolutional nets, in: Proceedings of the British
Machine Vision Conference, 2014 . [7] X. Chen, H. Ma, X. Wang, Z. Zhao, Improving object proposals with multi-
thresholding straddling expansion, in: Proceedings of IEEE Conference on Com-puter Vision and Pattern Recognition, 2015, pp. 2587–2595, doi: 10.1109/CVPR.
2015.7298874 .
[8] M. Cheng, Z. Zhang, W. Lin, P.H.S. Torr, BING: binarized normed gradients forobjectness estimation at 300fps, in: Proceedings of IEEE Conference on Com-
puter Vision and Pattern Recognition, 2014, pp. 3286–3293, doi: 10.1109/CVPR.2014.414 .
[9] J. Dai , K. He , J. Sun , Instance-aware semantic segmentation via multi-task net-work cascades, in: Proceedings of IEEE Conference on Computer Vision and
Pattern Recognition, 2016, pp. 3150–3158 .
[10] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in:Proceedings of IEEE Conference on Computer Vision and Pattern Recognition,
2005, pp. 886–893, doi: 10.1109/CVPR.2005.177 . [11] P. Dollár, R. Appel, S.J. Belongie, P. Perona, Fast feature pyramids for object
detection, IEEE Trans. Pattern Anal. Mach. Intell. 36 (8) (2014) 1532–1545,doi: 10.1109/TPAMI.2014.2300479 .
[12] P. Dollár, C.L. Zitnick, Structured forests for fast edge detection, in: Proceedings
of International Conference on Computer Vision, 2013, pp. 1841–1848, doi: 10.1109/ICCV.2013.231 .
[13] I. Endres, D. Hoiem, Category-independent object proposals with diverse rank-ing, IEEE Trans. Pattern Anal. Mach. Intell. 36 (2) (2014) 222–234, doi: 10.1109/
TPAMI.2013.122 .
Please cite this article as: W. Ke et al., Deep contour and symme
https://doi.org/10.1016/j.patrec.2018.01.004
[14] M. Everingham, S.M.A. Eslami, L.J.V. Gool, C.K.I. Williams, J.M. Winn, A. Zisser-man, The pascal visual object classes challenge: a retrospective, Int. J. Comput.
Vis. 111 (1) (2015) 98–136, doi: 10.1007/s11263- 014- 0733- 5 . [15] P.F. Felzenszwalb, R.B. Girshick, D.A. McAllester, D. Ramanan, Object detec-
tion with discriminatively trained part-based models, IEEE Trans. Pattern Anal.Mach. Intell. 32 (9) (2010) 1627–1645, doi: 10.1109/TPAMI.2009.167 .
[16] P.F. Felzenszwalb, D.P. Huttenlocher, Efficient graph-based image segmentation,Int. J. Comput. Vis. 59 (2) (2004) 167–181, doi: 10.1023/B:VISI.0000022288.
19776.77 .
[17] J. Feng, Y. Wei, L. Tao, C. Zhang, J. Sun, Salient object detection by compo-sition, in: Proceedings of International Conference on Computer Vision, 2011,
pp. 1028–1035, doi: 10.1109/ICCV.2011.6126348 . [18] R.B. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for ac-
curate object detection and semantic segmentation, in: Proceedings of IEEEConference on Computer Vision and Pattern Recognition, 2014, pp. 580–587,
doi: 10.1109/CVPR.2014.81 .
[19] J.H. Hosang , R. Benenson , B. Schiele , How good are detection proposals, really?in: Proceedings of the British Machine Vision Conference, 2014 .
[20] A. Humayun, F. Li, J.M. Rehg, RIGOR: reusing inference in graph cuts for gen-erating object regions, in: Proceedings of IEEE Conference on Computer Vision
and Pattern Recognition, 2014, pp. 336–343, doi: 10.1109/CVPR.2014.50 . [21] N. Karianakis, T.J. Fuchs, S. Soatto, Boosting convolutional features for robust
object proposals, CoRR abs/1503.06350(2015).
22] P. Krähenbühl , V. Koltun , Geodesic object proposals, in: Proceedings of Euro-pean Conference on Computer Vision, 2014, pp. 725–739 .
23] P. Krähenbühl , V. Koltun , Learning to propose objects, in: Proceedings of IEEEConference on Computer Vision and Pattern Recognition, 2015, pp. 1574–1582 .
[24] W. Kuo, B. Hariharan, J. Malik, Deepbox: learning objectness with convolu-tional networks, in: Proceedings of International Conference on Computer Vi-
sion, 2015, pp. 2479–2487, doi: 10.1109/ICCV.2015.285 .
25] T. Lin , M. Maire , S.J. Belongie , J. Hays , P. Perona , D. Ramanan , P. Dollár , C.L. Zit-nick , Microsoft COCO: common objects in context, in: Proceedings of European
Conference on Computer Vision, 2014, pp. 740–755 . 26] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic seg-
mentation, in: Proceedings of IEEE Conference on Computer Vision and PatternRecognition, 2015, pp. 3431–3440, doi: 10.1109/CVPR.2015.7298965 .
[27] S. Manen, M. Guillaumin, L. Van Gool, Prime object proposals with randomized
prim’s algorithm, in: Proceedings of the International Conference on ComputerVision, 2013, pp. 2536–2543, doi: 10.1109/ICCV.2013.315 .
28] P.H.O. Pinheiro , R. Collobert , P. Dollár , Learning to segment object candidates,in: Advances in Neural Information Processing Systems, 2015, pp. 1990–1998 .
29] P.O. Pinheiro , T. Lin , R. Collobert , P. Dollár , Learning to refine object segments,in: Proceedings of European Conference on Computer Vision, 2016, pp. 75–91 .
[30] E. Rahtu, J. Kannala, M.B. Blaschko, Learning a category independent object
detection cascade, in: Proceedings of International Conference on ComputerVision, 2011, pp. 1052–1059, doi: 10.1109/ICCV.2011.6126351 .
[31] P. Rantalankila, J. Kannala, E. Rahtu, Generating object segmentation proposalsusing global and local search, in: Proceedings of IEEE Conference on Computer
Vision and Pattern Recognition, 2014, pp. 2417–2424, doi: 10.1109/CVPR.2014.310 .
32] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: towards real-time object detec-tion with region proposal networks, IEEE Trans. Pattern Anal. Mach. Intell. 9
(32) (2016), doi: 10.1109/TPAMI.2016.2577031 .
[33] K.E.A. van de Sande, J.R.R. Uijlings, T. Gevers, A.W.M. Smeulders, Segmentationas selective search for object recognition, in: Proceedings of the International
Conference on Computer Vision, 2011, pp. 1879–1886, doi: 10.1109/ICCV.2011.6126456 .
[34] S. Tsogkas, I. Kokkinos, Learning-based symmetry detection in natural images,in: Proceedings of the European Conference on Computer, 2012, pp. 41–54,
doi: 10.1007/978- 3- 642- 33786- 4 _ 4 .
[35] J.R.R. Uijlings, K.E.A. van de Sande, T. Gevers, A.W.M. Smeulders, Selectivesearch for object recognition, Int. J. Comput. Vis. 104 (2) (2013) 154–171,
doi: 10.1007/s11263- 013- 0620- 5 . 36] F. Wan, P. Wei, Z. Han, K. Fu, Q. Ye, Weakly supervised object detection
with correlation and part suppression, in: Proceedings of IEEE InternationalConference on Image Processing, 2016, pp. 3638–3642, doi: 10.1109/ICIP.2016.
7533038 .
[37] Y. Xiao, C. Lu, E. Tsougenis, Y. Lu, C. Tang, Complexity-adaptive distance metricfor object proposals generation, in: Proceedings of IEEE Conference on Com-
puter Vision and Pattern Recognition, 2015, pp. 778–786, doi: 10.1109/CVPR.2015.7298678 .
38] S. Xie, Z. Tu, Holistically-nested edge detection, in: Proceedings of InternationalConference on Computer Vision, 2015, pp. 1395–1403, doi: 10.1109/ICCV.2015.
164 .
39] Q. Ye , T. Zhang , W. Ke , Q. Qiu , J. Chen , G. Sapiro , B. Zhang , Self-learningscene-specific pedestrian detectors using a progressive latent model, in: Pro-
ceedings of IEEE Conference on Computer Vision and Pattern Recognition,2017, pp. 509–518 .
[40] Z. Zhang, W. Shen, C. Yao, X. Bai, Symmetry-based text line detection in naturalscenes, in: Proceedings of IEEE Conference on Computer Vision and Pattern
Recognition, 2015, pp. 2558–2567, doi: 10.1109/CVPR.2015.7298871 .
[41] C.L. Zitnick, P. Dollár, Edge boxes: Locating object proposals from edges, in:Proceedings of the European Conference on Computer, 2014, pp. 391–405,
doi: 10.1007/978- 3- 319- 10602- 1 _ 26 .
try scored object proposal, Pattern Recognition Letters (2018),