cornernet-lite: efficient keypoint based object detectionscale due to the ultra-tight inference...

15
H. LAW, Y. TANG, O. RUSSAKOVSKY, J. DENG: CORNERNET-LITE 1 CornerNet-Lite: Efficient Keypoint-Based Object Detection Hei Law [email protected] Yun Tang [email protected] Olga Russakovsky [email protected] Jia Deng [email protected] Department of Computer Science Princeton University Princeton, NJ USA Abstract Keypoint-based methods are a relatively new paradigm in object detection, eliminat- ing the need for anchor boxes and offering a simplified detection framework. Keypoint- based CornerNet achieves state of the art accuracy among single-stage detectors. How- ever, this accuracy comes at high processing cost. In this work, we tackle the problem of efficient keypoint-based object detection and introduce CornerNet-Lite. CornerNet-Lite is a combination of two efficient variants of CornerNet: CornerNet-Saccade, which uses an attention mechanism to eliminate the need for exhaustively processing all pixels of the image, and CornerNet-Squeeze, which introduces a new compact backbone architecture. Together these two variants address the two critical use cases in efficient object detection: improving efficiency without sacrificing accuracy, and improving accuracy at real-time efficiency. CornerNet-Saccade is suitable for offline processing, improving the efficiency of CornerNet by 6.0x and the AP by 1.0% on COCO. CornerNet-Squeeze is suitable for real-time detection, improving both the efficiency and accuracy of the popular real-time detector YOLOv3 (34.4% AP at 30ms for CornerNet-Squeeze compared to 33.0% AP at 39ms for YOLOv3 on COCO). Together these contributions for the first time reveal the potential of keypoint-based detection to be useful for applications requiring processing efficiency. 1 Introduction Keypoint-based object detection [27, 55, 58] is a class of methods that generate object bound- ing boxes by detecting and grouping their keypoints. CornerNet [27], the state-of-the-art among them, detects and groups the top-left and bottom-right corners of bounding boxes; it uses a stacked hourglass network [39] to predict the heatmaps of the corners and then uses associate embeddings [40] to group them. CornerNet allows a simplified design that eliminates the need for anchor boxes [48], and has achieved state-of-the-art accuracy on COCO [31] among single-stage detectors. c 2020. The copyright of this document resides with its authors. It may be distributed unchanged freely in print or electronic forms. arXiv:1904.08900v2 [cs.CV] 16 Sep 2020

Upload: others

Post on 11-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CornerNet-Lite: Efficient Keypoint Based Object Detectionscale due to the ultra-tight inference budget, which provides much less room for saccades to save. Significance and novelty:

H. LAW, Y. TANG, O. RUSSAKOVSKY, J. DENG: CORNERNET-LITE 1

CornerNet-Lite: Efficient Keypoint-BasedObject Detection

Hei [email protected]

Yun [email protected]

Olga [email protected]

Jia [email protected]

Department of Computer SciencePrinceton UniversityPrinceton, NJ USA

Abstract

Keypoint-based methods are a relatively new paradigm in object detection, eliminat-ing the need for anchor boxes and offering a simplified detection framework. Keypoint-based CornerNet achieves state of the art accuracy among single-stage detectors. How-ever, this accuracy comes at high processing cost. In this work, we tackle the problem ofefficient keypoint-based object detection and introduce CornerNet-Lite. CornerNet-Liteis a combination of two efficient variants of CornerNet: CornerNet-Saccade, which usesan attention mechanism to eliminate the need for exhaustively processing all pixels of theimage, and CornerNet-Squeeze, which introduces a new compact backbone architecture.Together these two variants address the two critical use cases in efficient object detection:improving efficiency without sacrificing accuracy, and improving accuracy at real-timeefficiency. CornerNet-Saccade is suitable for offline processing, improving the efficiencyof CornerNet by 6.0x and the AP by 1.0% on COCO. CornerNet-Squeeze is suitable forreal-time detection, improving both the efficiency and accuracy of the popular real-timedetector YOLOv3 (34.4% AP at 30ms for CornerNet-Squeeze compared to 33.0% AP at39ms for YOLOv3 on COCO). Together these contributions for the first time reveal thepotential of keypoint-based detection to be useful for applications requiring processingefficiency.

1 Introduction

Keypoint-based object detection [27, 55, 58] is a class of methods that generate object bound-ing boxes by detecting and grouping their keypoints. CornerNet [27], the state-of-the-artamong them, detects and groups the top-left and bottom-right corners of bounding boxes;it uses a stacked hourglass network [39] to predict the heatmaps of the corners and thenuses associate embeddings [40] to group them. CornerNet allows a simplified design thateliminates the need for anchor boxes [48], and has achieved state-of-the-art accuracy onCOCO [31] among single-stage detectors.

c© 2020. The copyright of this document resides with its authors.It may be distributed unchanged freely in print or electronic forms.

arX

iv:1

904.

0890

0v2

[cs

.CV

] 1

6 Se

p 20

20

Citation
Citation
{Law and Deng} 2018
Citation
Citation
{Tychsen-Smith and Petersson} 2017
Citation
Citation
{Wang, Chen, Huang, Yao, and Liu} 2017
Citation
Citation
{Law and Deng} 2018
Citation
Citation
{Newell, Yang, and Deng} 2016
Citation
Citation
{Newell, Huang, and Deng} 2017
Citation
Citation
{Ren, He, Girshick, and Sun} 2015
Citation
Citation
{Lin, Maire, Belongie, Hays, Perona, Ramanan, Doll{á}r, and Zitnick} 2014
Page 2: CornerNet-Lite: Efficient Keypoint Based Object Detectionscale due to the ultra-tight inference budget, which provides much less room for saccades to save. Significance and novelty:

2 H. LAW, Y. TANG, O. RUSSAKOVSKY, J. DENG: CORNERNET-LITE

30

32

34

36

38

40

42

44

0 200 400 600 800 1000 1200

CO

CO

AP

inference time (ms)

CornerNet-Squeeze(ours)

CornerNet-Saccade(ours)

State-of-the-art accuracy of keypoint-based object detection

Real-time inference

CornerNet [Law & Deng ’18]

YOLOv3 [Redmon & Farhadi ’18]

Figure 1: We introduce CornerNet-Saccade and CornerNet-Squeeze (collectively asCornerNet-Lite), two efficient object detectors based on CornerNet [27], a state-of-the-artkeypoint based object detector. CornerNet-Saccade speeds up the original CornerNet by 6.0xwith a 1% increase in AP. CornerNet-Squeeze is faster and more accurate than YOLOv3 [46],the state-of-the-art real time detector. All detectors are tested on the same machine with a1080Ti GPU and an Intel Core i7-7700k CPU.

However, a major drawback of CornerNet is its inference speed. It achieves an averageprecision (AP) of 42.2% on COCO at an inference cost of 1.1s per image 1, which is tooslow for video applications that require real-time or interactive rates. Although one caneasily speed up inference by reducing the number of pixels processed (e.g. by reducing thenumber of scales of processing or the image resolution), this can cause a large accuracy drop.For example, single-scale processing combined with reducing the input resolution can speedup the inference of CornerNet to 42ms per image, comparable to the 39ms of the popularfast detector YOLOv3 [46], but would decrease the AP to 25.6% which is much lower thanYOLOv3’s 33.0%. This makes CornerNet less competitive with alternatives in terms of theaccuracy-efficiency tradeoff.

In this paper we seek to improve the inference efficiency of CornerNet. The efficiency ofany object detector can be improved along two orthogonal directions: reducing the numberof pixels processed and reducing the amount of processing per pixel. We explore both direc-tions and introduce two efficient variants of CornerNet: CornerNet-Saccade and CornerNet-Squeeze, which we refer to collectively as CornerNet-Lite.

CornerNet-Saccade speeds up inference by reducing the number of pixels to process.It uses an attention mechanism similar to saccades in human vision [1, 60]. It starts witha downsized full image and generates an attention map, which is then zoomed in on andprocessed further by the model. This differs from the original CornerNet in that it is appliedfully convolutionally across multiple scales. By selecting a subset of crops to examine in highresolution, CornerNet-Saccade improves speed while improving the accuracy. Experimentson COCO show that CornerNet-Saccade achieves an AP of 43.2% at 190ms per image, a 1%increase in AP and a 6.0x speed-up over the original CornerNet.

CornerNet-Squeeze speeds up inference by reducing the amount of processing per pixel.It incorporates ideas from SqueezeNet [20] and MobileNets [16], and introduces a new, com-pact hourglass backbone that makes extensive use of 1×1 convolution, bottleneck layer, anddepth-wise separable convolution. With the new hourglass backbone, CornerNet-Squeeze

1All detectors in this paper are tested on a machine with a 1080Ti GPU and an Intel Core i7-7700k GPU.

Citation
Citation
{Law and Deng} 2018
Citation
Citation
{Redmon and Farhadi} 2018
Citation
Citation
{Redmon and Farhadi} 2018
Citation
Citation
{Bahill, Clark, and Stark} 1975
Citation
Citation
{Yarbus} 2013
Citation
Citation
{Iandola, Han, Moskewicz, Ashraf, Dally, and Keutzer} 2016
Citation
Citation
{Howard, Zhu, Chen, Kalenichenko, Wang, Weyand, Andreetto, and Adam} 2017
Page 3: CornerNet-Lite: Efficient Keypoint Based Object Detectionscale due to the ultra-tight inference budget, which provides much less room for saccades to save. Significance and novelty:

H. LAW, Y. TANG, O. RUSSAKOVSKY, J. DENG: CORNERNET-LITE 3

achieves an AP of 34.4% on COCO at 30ms, simultaneously more accurate and faster thanYOLOv3 (33.0% at 39ms).

A natural question is whether CornerNet-Squeeze can be combined with saccades to im-prove its efficiency even further. Somewhat surprisingly, our experiments give a negativeanswer: CornerNet-Squeeze-Saccade turns out slower and less accurate than CornerNet-Squeeze. This is because for saccades to help, the network needs to be able to generate suf-ficiently accurate attention maps, but the ultra-compact architecture of CornerNet-Squeezedoes not have this extra capacity. In addition, the original CornerNet is applied at multiplescales, which provides ample room for saccades to cut down on the number of pixels to pro-cess. In contrast, CornerNet-Squeeze is already applied at a single scale due to the ultra-tightinference budget, which provides much less room for saccades to save.

Significance and novelty: Collectively, these two variants of CornerNet-Lite make thekeypoint-based approach competitive, covering two popular use cases: CornerNet-Saccadefor offline processing, improving efficiency without sacrificing accuracy, and CornerNet-Squeeze for real-time processing, improving accuracy without sacrificing efficiency.

Both variants of CornerNet-Lite are technically novel. CornerNet-Saccade is the first tointegrate saccades with keypoint-based object detection. Its key difference from prior worklies in how each crop (of pixels or feature maps) is processed. Prior work that employssaccade-like mechanisms either detects a single object per crop (e.g. Faster R-CNN [48])or produces multiple detections per crop with a two-stage network involving additional sub-crops (e.g. AutoFocus [38]). In contrast, CornerNet-Saccade produces multiple detectionsper crop with a single-stage network.

CornerNet-Squeeze is the first to integrate SqueezeNet with the stacked hourglass archi-tecture and to apply such a combination on object detection. Prior works that employ thehourglass architecture have excelled at achieving competitive accuracy, but it was unclearwhether and how the hourglass architecture can be competitive in terms of efficiency. Ourdesign and results show that this is possible for the first time, particularly in the context ofobject detection.

Contributions Our contributions are three-fold: (1) We propose CornerNet-Saccade andCornerNet-Squeeze, two novel approaches to improving the efficiency of keypoint-basedobject detection; (2) On COCO, we improve the efficiency of state-of-the-art keypoint baseddetection by 6 fold and the AP from 42.2% to 43.2%, (3) On COCO, we improve both theaccuracy and efficiency of state-of-the art real-time object detection (to 34.4% at 30ms from33.0% at 39ms of YOLOv3).

2 Related Work

Saccades in Object Detection. Saccades in human vision refers to a sequence of rapid eyemovements to fixate different image regions. In the context of object detection algorithms,we use the term broadly to mean selectively cropping and processing image regions (sequen-tially or in parallel, pixels or features) during inference.

There has been a long history of using saccades [12, 42, 63] in object detection to speedup inference. For example, a special case of saccades is a cascade that repeatedly selects asubset of regions for further processing, as exemplified by the Viola-Jones face detector [56].The idea of saccades has taken diverse forms in various approaches, but can be roughly

Citation
Citation
{Ren, He, Girshick, and Sun} 2015
Citation
Citation
{Najibi, Singh, and Davis} 2018
Citation
Citation
{Gualdi, Prati, and Cucchiara} 2011
Citation
Citation
{Pedersoli, Gonz{à}lez, Bagdanov, and Villanueva} 2010
Citation
Citation
{Zhang, Zelinsky, and Samaras} 2007
Citation
Citation
{Viola, Jones, etprotect unhbox voidb@x penalty @M {}al.} 2001
Page 4: CornerNet-Lite: Efficient Keypoint Based Object Detectionscale due to the ultra-tight inference budget, which provides much less room for saccades to save. Significance and novelty:

4 H. LAW, Y. TANG, O. RUSSAKOVSKY, J. DENG: CORNERNET-LITE

%XXIRXMSR�QETW

(S[RWM^MRK

+IRIVEXMRK�SFNIGX�PSGEXMSRW (IXIGXMRK�SFNIGXW�

�\���]� �\���]�

�\���]�

�\���]� �\���]� �\���]�

3FNIGX�PSGEXMSRW

6EROMRK��TMGOMRK�XST�O�PSGEXMSRW

1IVKMRK�HIXIGXMSRW

'VSTTMRK�VIKMSRW�

Figure 2: Overview of CornerNet-Saccade. We predict a set of possible object locationsfrom the attention maps and bounding boxes generated on a downsized full image. We zoominto each location and crop a small region around that location. Then we detect objects intop k regions and merge the detections by NMS.

categorized by how each crop is processed, in particular, what kind of output is producedafter processing each crop.

Saccades in R-CNN [11], Fast R-CNN [10], and Faster R-CNN [48] take the form ofcrops representing potential objects. After processing, each crop is either rejected or con-verted to a single labeled box through classification and regression. Cascade R-CNN [3]extends Faster R-CNN by using a cascade of classifiers and regressors to iteratively rejector refine each proposal. The saccades in all these R-CNN variants are thus single-type andsingle-object, in that there is a single type of processing of crops, and each crop produces atmost a single object.

AutoFocus [38], which builds upon SNIPER [54] that improved R-CNN training, addsa branch to Faster R-CNN to predict the regions that are likely to contain small objects.Then it applies Faster R-CNN again to each of those regions by cropping the original pixels.In AutoFocus, there are two kinds of cropping, one that can produce multiple objects (bycalling Faster R-CNN as a subroutine), and the other that can produce at most a single object(cropping done within Faster R-CNN). The saccades in AutoFocus are thus multi-type andmixed, in the sense that two different types of processing are interleaved.

In contrast, saccades in CornerNet-Saccade are single-type and multi-object, in that thereis a single type of crop processing and each crop can produce multiple objects at once withoutadditional subcrops. This means that the number of crops processed by CornerNet-Saccadecan be much smaller than the number of objects, whereas for R-CNN variants and AutoFocusthe number of crops must be no smaller than the number of objects.

Efficient Object Detectors. Other than accuracy [4, 5, 8, 15, 19, 22, 32, 43, 51, 53, 59, 61,67], many recent works have improved upon the efficiency of detectors since the introductionof R-CNN [11], which applies a ConvNet [25] to 2000 RoIs. Repeatedly applying a ConvNetto the RoIs introduces many redundant computations. SPP [13] and Fast R-CNN [10] addressthis by applying a ConvNet fully convolutionally on the image and extracting features di-rectly from the feature maps for each RoI. Faster R-CNN [48] further improves efficiency byreplacing the low-level vision algorithm with a region proposal network. R-FCN [7] replacesthe expensive fully connected sub-detection network with a fully convolutional network, andLight-Head R-CNN [29] reduces the cost in R-FCN by applying separable convolution toreduce the number of channels in the feature maps before RoI pooling. On the other hand,

Citation
Citation
{Girshick, Donahue, Darrell, and Malik} 2014
Citation
Citation
{Girshick} 2015
Citation
Citation
{Ren, He, Girshick, and Sun} 2015
Citation
Citation
{Cai and Vasconcelos} 2018
Citation
Citation
{Najibi, Singh, and Davis} 2018
Citation
Citation
{Singh, Najibi, and Davis} 2018
Citation
Citation
{Cai, Fan, Feris, and Vasconcelos} 2016
Citation
Citation
{Cheng, Wei, Shi, Feris, Xiong, and Huang} 2018
Citation
Citation
{Dai, Qi, Xiong, Li, Zhang, Hu, and Wei} 2017
Citation
Citation
{He, Gkioxari, Doll{á}r, and Girshick} 2017
Citation
Citation
{Huang, Rathod, Sun, Zhu, Korattikara, Fathi, Fischer, Wojna, Song, Guadarrama, etprotect unhbox voidb@x penalty @M {}al.} 2017{}
Citation
Citation
{Jiang, Luo, Mao, Xiao, and Jiang} 2018
Citation
Citation
{Lin, Doll{á}r, Girshick, He, Hariharan, and Belongie} 2017{}
Citation
Citation
{Peng, Xiao, Li, Jiang, Zhang, Jia, Yu, and Sun} 2018
Citation
Citation
{Shrivastava, Sukthankar, Malik, and Gupta} 2016
Citation
Citation
{Singh and Davis} 2018
Citation
Citation
{Xu, Lv, Wang, Ren, Bodla, and Chellappa} 2018
Citation
Citation
{Zhai, Fu, Lu, and Li} 2018
Citation
Citation
{Zhu, Hu, Lin, and Dai} 2018
Citation
Citation
{Girshick, Donahue, Darrell, and Malik} 2014
Citation
Citation
{Krizhevsky, Sutskever, and Hinton} 2012
Citation
Citation
{He, Zhang, Ren, and Sun} 2015
Citation
Citation
{Girshick} 2015
Citation
Citation
{Ren, He, Girshick, and Sun} 2015
Citation
Citation
{Dai, Li, He, and Sun} 2016
Citation
Citation
{Li, Peng, Yu, Zhang, Deng, and Sun} 2017
Page 5: CornerNet-Lite: Efficient Keypoint Based Object Detectionscale due to the ultra-tight inference budget, which provides much less room for saccades to save. Significance and novelty:

H. LAW, Y. TANG, O. RUSSAKOVSKY, J. DENG: CORNERNET-LITE 5

one-stage detectors [9, 21, 24, 33, 35, 45, 47, 50, 58, 62, 65, 66] remove the region poolingstep of two-stage detectors.

Efficient Network Architectures. The efficiency of ConvNets is important to many mobileand embedded applications. Much attention [17, 26, 28, 36, 44, 49, 64] has been given to thedesign of efficient network architectures. SqueezeNet [20] proposes a fire module to reducethe number of parameters in AlexNet [25] by 50x, while achieving similar performance.MobileNets [16] are a class of lightweight networks that use depth-wise separable convolu-tions [6], and proposes strategies to achieve a good trade-off between accuracy and latency.PeleeNet [57], in contrast, demonstrates the effectiveness of standard convolutions by intro-ducing an efficient variant of DenseNet [18] consisting of two-way dense layers and a stemblock. Other networks were designed specifically for real-time detection. YOLOv2 [45] in-corporates ideas from NIN [30] to design a new variant of VGG [52]. YOLOv3 [46] furtherimproves DarkNet-19 by making the network deeper and introducing skip connections. RF-BNet [34] proposes a new module which mimics the receptive field of human vision systemsto efficiently gather information across different scales.

3 CornerNet-SaccadeCornerNet-Saccade detects objects within small regions around possible object locations inan image. It uses the downsized full image to predict attention maps and coarse boundingboxes; both suggest possible object locations. CornerNet-Saccade then detects objects by ex-amining the regions centered at the locations in high resolution. An overview of the pipelineis shown in Fig. 2.

3.1 Estimating Object LocationsThe first step in CornerNet-Saccade is to obtain possible object locations in an image. Weuse downsized full images to predict attention maps, which indicate both the locations andthe coarse scales of the objects at the locations. Given an image, we downsize it to twoscales by resizing the longer side of the image to 255 and 192 pixels. The image of size 192is padded with zeros to the size of 255 so that they can be processed in parallel. There are tworeasons for using image at such low resolutions. First, this step should not be a bottleneckin the inference time. Second, the network should easily leverage the context information inthe image to predict the attention maps.

For a downsized image, CornerNet-Saccade predicts 3 attention maps, one for smallobjects, one for medium objects and one for large objects. An object is considered small ifthe longer side of its bounding box is less than 32 pixels, medium if it is between 32 and 96pixels, and large if it is greater than 96 pixels2. Predicting locations separately for differentobject sizes gives us finer control over how much CornerNet-Saccade should zoom in at eachlocation. We can zoom in more at small object locations and less at medium object locations.

We predict the attention maps by using feature maps at different scales. The featuremaps are obtained from the backbone network in CornerNet-Saccade, which is an hourglassnetwork [39]. The feature maps from the upsampling layers in the hourglass are used topredict the attention maps. The feature maps at finer scales are used for smaller objects andthe ones at coarser scales are for larger objects. We predict the attention maps by applying a

2The sizes are w.r.t the input to the network.

Citation
Citation
{Fu, Liu, Ranga, Tyagi, and Berg} 2017
Citation
Citation
{Jeong, Park, and Kwak} 2017
Citation
Citation
{Kong, Sun, Yao, Liu, Lu, and Chen} 2017
Citation
Citation
{Lin, Goyal, Girshick, He, and Doll{á}r} 2017{}
Citation
Citation
{Liu, Anguelov, Erhan, Szegedy, Reed, Fu, and Berg} 2016
Citation
Citation
{Redmon and Farhadi} 2017
Citation
Citation
{Redmon, Divvala, Girshick, and Farhadi} 2016
Citation
Citation
{Shen, Liu, Li, Jiang, Chen, and Xue} 2017
Citation
Citation
{Wang, Chen, Huang, Yao, and Liu} 2017
Citation
Citation
{Zhang, Wen, Bian, Lei, and Li} 2018{}
Citation
Citation
{Zhang, Qiao, Xie, Shen, Wang, and Yuille} 2018{}
Citation
Citation
{Zhao, Sheng, Wang, Tang, Chen, Cai, and Ling} 2018
Citation
Citation
{Hu, Shen, and Sun} 2018
Citation
Citation
{Laube and Zell} 2018
Citation
Citation
{Li, Kadav, Durdanovic, Samet, and Graf} 2016
Citation
Citation
{Ma, Zhang, Zheng, and Sun} 2018
Citation
Citation
{Rastegari, Ordonez, Redmon, and Farhadi} 2016
Citation
Citation
{Sandler, Howard, Zhu, Zhmoginov, and Chen} 2018
Citation
Citation
{Zhang, Zhou, Lin, and Sun} 2018{}
Citation
Citation
{Iandola, Han, Moskewicz, Ashraf, Dally, and Keutzer} 2016
Citation
Citation
{Krizhevsky, Sutskever, and Hinton} 2012
Citation
Citation
{Howard, Zhu, Chen, Kalenichenko, Wang, Weyand, Andreetto, and Adam} 2017
Citation
Citation
{Chollet} 2017
Citation
Citation
{Wang, Li, and Ling} 2018
Citation
Citation
{Huang, Liu, Van Derprotect unhbox voidb@x penalty @M {}Maaten, and Weinberger} 2017{}
Citation
Citation
{Redmon and Farhadi} 2017
Citation
Citation
{Lin, Chen, and Yan} 2013
Citation
Citation
{Simonyan and Zisserman} 2014
Citation
Citation
{Redmon and Farhadi} 2018
Citation
Citation
{Liu, Huang, etprotect unhbox voidb@x penalty @M {}al.} 2018
Citation
Citation
{Newell, Yang, and Deng} 2016
Page 6: CornerNet-Lite: Efficient Keypoint Based Object Detectionscale due to the ultra-tight inference budget, which provides much less room for saccades to save. Significance and novelty:

6 H. LAW, Y. TANG, O. RUSSAKOVSKY, J. DENG: CORNERNET-LITE

Removing boxes at boundary

Suppressing redundant locations

Figure 3: Left: Some objects may not be fully covered by a region. The detector may stillgenerate bounding boxes (red dashed line) for those objects. We remove the bounding boxeswhich touch the boundaries to avoid such bounding boxes. Right: When the objects areclose to each other, we may generate regions that highly overlap with each other. Processingeither one of them is likely to detect objects in all highly overlapping regions. We suppressredundant regions to improve efficiency.

3× 3 Conv-ReLU module followed by a 1× 1 Conv-Sigmoid module to each feature map.During testing, we only process locations where scores are above a threshold t, and we sett = 0.3 in our experiments. During training, we set the center of each bounding box on thecorresponding attention map to be positive and the rest to negatives. Then we apply the focalloss with α = 2.

3.2 Detecting ObjectsCornerNet-Saccade uses the locations obtained from the downsized image to determinewhere to process. If we directly crop the regions from the downsized image, some objectsmay become too small to detect accurately. Hence, we should examine the regions at higherresolution based on the scale information obtained in the first step.

Based on the locations obtained from the attention maps, we can determine differentzoom-in scales for different object sizes: ss for small objects, sm for medium objects andsl for large objects. In general, ss > sm > sl because we should zoom in more for smallerobjects, so we set ss = 4, sm = 2 and sl = 1. At each possible location (x,y), we enlarge thedownsized image by si, where i ∈ {s,m, l} depending on the coarse object scale. Then weapply CornerNet-Saccade to a 255×255 window centered at the location.

CornerNet-Saccade uses the same network for attention maps and bounding boxes. Hence,when CornerNet-Saccade processes the downsized image to predict the attention maps, itmay already predict boxes for some larger objects in the image. But the boxes may not beaccurate. We also zoom into those objects to obatin better boxes.

There are some important implementation details to make processing efficient. First, weprocess the regions in batch to better utilize the GPU. Second, we keep the original imagein GPU memory, and perform resizing and cropping on the GPU to reduce the overhead oftransferring image data between CPU and GPU. To further utilize the processing power ofthe GPU, the cropped images are processed in parallel.

After detecting objects at possible object locations, we merge the bounding boxes andremove redundant ones by applying Soft-NMS [2]. When we crop the regions, the regionsmay include parts of the objects at the crop boundaries as shown in Fig. 3 (left). The detectormay generate bounding boxes for those objects, which may not be removed by Soft-NMS asthey may have low overlaps with the bounding boxes of the full objects. Hence, we removethe bounding boxes which touch the crop boundary. During training, we apply the sametraining losses in CornerNet to train the network to predict corner heatmaps, embeddingsand offsets.

Citation
Citation
{Bodla, Singh, Chellappa, and Davis} 2017
Page 7: CornerNet-Lite: Efficient Keypoint Based Object Detectionscale due to the ultra-tight inference budget, which provides much less room for saccades to save. Significance and novelty:

H. LAW, Y. TANG, O. RUSSAKOVSKY, J. DENG: CORNERNET-LITE 7

When objects are close to each other, we may generate regions that highly overlap witheach other as shown in Fig. 3 (right). It is undesirable to process both regions as processingeither one of them is likely to detect objects in the other. We adopt a procedure similarto NMS to remove redundant locations. First, we rank the object locations, prioritizinglocations from bounding boxes over locations from the attention maps. We then keep thebest object location and remove the locations that are close to the best location. We repeatthe procedure until no object locations are left.

3.3 Backbone NetworkWe design a new hourglass backbone network that works better in CornerNet-Saccade. Thenew hourglass network consists of 3 hourglass modules and has a depth of 54 layers, whileHourglass-104 in CornerNet consists of 2 hourglass modules and has a depth of 104. Werefer to the new backbone as Hourglass-54.

Each hourglass module in Hourglass-54 has fewer parameters and is shallower than theone in Hourglass-104. Following the downsizing strategy in Hourglass-104, we downsizethe feature by stride 2. We apply one residual module [14] after each downsampling layerand in each skip connection. Each hourglass module downsizes the input features 3 timesand increases the number of channels along the way (384, 384, 512). There is one residualmodule with 512 channels in the middle of the module, and one residual module after eachupsampling layer. We also downsize the image twice before the hourglass modules.

3.4 Training DetailsWe use Adam [23] to optimize both the losses for the attention maps and object detection,and use the same training hyperparameters found in CornerNet. The input size to the networkis 255×255, which is also the input resolution during inference. We train the network witha batch size of 48 on four 1080Ti GPUs. In order to avoid over-fitting, we adopt the dataaugmentation techniques used in CornerNet. When we randomly crop a region around anobject, the object is either placed randomly or at the center with some random offset. Thisensures that training and testing are consistent as the network detects objects within the cropscentered at object locations.

4 CornerNet-SqueezeIn contrast to CornerNet-Saccade, CornerNet-Squeeze focuses on reducing the amount ofprocessing per pixel. In CornerNet, most of the computational resources are spent on Hourglass-104. Although Hourglass-104 achieves good performance, it is expensive in terms of numberof parameters and inference time. To reduce the complexity of Hourglass-104, we use ideasfrom SqueezeNet [20] and MobileNets [16] to design a lightweight hourglass architecture.

4.1 Ideas from SqueezeNet and MobileNetsSqueezeNet proposes three strategies to reduce network complexity: (1) replacing 3× 3kernels with 1×1 kernels; (2) decreasing input channels to 3×3 kernels; (3) downsamplinglate. The building block of SqueezeNet, the fire module, encapsulates the first two ideas.The fire module first reduces the number of input channels with a squeeze layer consisting

Citation
Citation
{He, Zhang, Ren, and Sun} 2016
Citation
Citation
{Kingma and Ba} 2014
Citation
Citation
{Iandola, Han, Moskewicz, Ashraf, Dally, and Keutzer} 2016
Citation
Citation
{Howard, Zhu, Chen, Kalenichenko, Wang, Weyand, Andreetto, and Adam} 2017
Page 8: CornerNet-Lite: Efficient Keypoint Based Object Detectionscale due to the ultra-tight inference budget, which provides much less room for saccades to save. Significance and novelty:

8 H. LAW, Y. TANG, O. RUSSAKOVSKY, J. DENG: CORNERNET-LITE

of 1× 1 filters. Then, it feeds the result through an expand layer consisting of a mixture of1×1 and 3×3 filters.

Based on the insights provided by SqueezeNet, we use the fire module in CornerNet-Squeeze instead of the residual block. Furthermore, inspired by the success of MobileNets,we replace the 3×3 standard convolution in the second layer with a 3×3 depth-wise sepa-rable convolution, which further improves inference time. Tab. 1 shows a detail comparisonbetween the residual block in CornerNet and the new fire module in CornerNet-Squeeze.

We do not explore the third idea in SqueezeNet. Since the hourglass network has a sym-metrical structure, delayed downsampling results in higher resolution feature maps duringthe upsampling. Performing convolution on high resolution feature maps is computationallyexpensive, which would prevent us from achieving real-time detection.

Input Operator OutputResidual block in CornerNeth×w× k 3×3 Conv, ReLU h×w× k′

h×w× k′ 3×3 Conv, ReLU h×w× k′

Fire module in CornerNet-Squeezeh×w× k 1×1 Conv h×w× k′

2h×w× k′

2 1×1 Conv + 3×3 Dwise, ReLU h×w× k′

Table 1: Comparison between the residual block and the new fire module.

Other than replacing the residual blocks, we also make a few more modifications. Wereduce the maximum feature map resolution of the hourglass modules by adding one moredownsampling layer before the hourglass modules, and remove one downsampling layerin each hourglass module. CornerNet-Squeeze correspondingly downsizes the image threetimes before the hourglass module, whereas CornerNet downsizes the image twice. Wereplace the 3×3 filters with 1×1 filters in the prediction modules of CornerNet. Finally, wereplace the nearest neighbor upsampling in the hourglass network with transpose convolutionwith a 4×4 kernel.

4.2 Training Details

We use the same training losses and hyperparameters of CornerNet to train CornerNet-Squeeze. The only change is the batch size. Downsizing the image one more time priorto the hourglass modules reduces the memory usage by 4 times under the same image reso-lution in CornerNet-Squeeze. We train the network with a batch size of 55 on four 1080TiGPUs (13 images on the master GPU and 14 images per GPU for the rest of the GPUs).

5 Experiments

Implementation Details. CornerNet-Lite is implemented in PyTorch [41]. We use COCO [31]to evaluate CornerNet-Lite and compare it with other detectors. To measure the inferencetime, for each detector, we start the timer as soon as it finishes reading the image and stop thetimer as soon as it obtains the final bounding boxes. To provide fair comparisons betweendifferent detectors, we measure the inference times on the same machine with a 1080Ti GPUand an Intel Core i7-7700k CPU.

Citation
Citation
{Paszke, Gross, Chintala, Chanan, Yang, DeVito, Lin, Desmaison, Antiga, and Lerer} 2017
Citation
Citation
{Lin, Maire, Belongie, Hays, Perona, Ramanan, Doll{á}r, and Zitnick} 2014
Page 9: CornerNet-Lite: Efficient Keypoint Based Object Detectionscale due to the ultra-tight inference budget, which provides much less room for saccades to save. Significance and novelty:

H. LAW, Y. TANG, O. RUSSAKOVSKY, J. DENG: CORNERNET-LITE 9

Detector GPU Quantity Total MemCornerNet Titan X (PASCAL) 10 120GBCornerNet-Saccade 1080Ti 4 44GB

Table 2: CornerNet-Saccade saves more than 60% GPU memory and requires only 4 GPUsto train, while it outperforms CornerNet.

Training Efficiency of CornerNet-Saccade. CornerNet-Saccade not only improves theefficiency in testing but also in training. Tab. 2 shows that we are able to train CornerNet-Saccade on only four 1080Ti GPUs with a total of 44GB GPU memory while CornerNetrequires ten Titan X (PASCAL) GPUs with a total of 120GB GPU memory. We reduce thememory usage by more than 60%. Neither CornerNet nor CornerNet-Saccade uses mixedprecision training [37].Performance Analysis of Hourglass-54 in CornerNet-Saccade. We introduce a new hour-glass, Hourglass-54, in CornerNet-Saccade, and perform two experiments to better under-stand the performance contribution of Hourglass-54. First, we train CornerNet-Saccade withHourglass-104 instead of Hourglass-54. Second, we train CornerNet with Hourglass-54 in-stead of Hourglass-104. For the second experiment, due to limited resources, we train bothnetworks with a batch size of 15 on four 1080Ti GPUs and follow the training details in [27].

AP APs APm APl APatt

CornerNet-Saccade w/ HG-54 42.6 25.5 44.3 58.4 42.7+ gt attention 50.3 32.3 53.4 65.3 -CornerNet-Saccade w/ HG-104 41.4 23.8 43.5 57.1 40.1+ gt attention 48.9 32.4 51.8 62.6 -CornerNet w/ HG-54 37.2 18.4 40.3 49.0 -CornerNet w/ HG-104 38.2 18.6 40.9 50.1 -

Table 3: Hourglass-54 produces better results when combined with saccade on the COCOvalidation set.

Tab. 3 shows that CornerNet-Saccade with Hourglass-54 (42.6% AP) is more accuratethan with Hourglass-104 (41.4%). To investigate the difference in performance, we evaluatethe quality of both the attention maps and bounding boxes. First, predicting the attentionmaps can be seen as a binary classification problem, where the object locations are positivesand the rest are negatives. We measure the quality of the attention maps by average precision,denoted as APatt . Hourglass-54 achieves an APatt of 42.7%, while Hourglass-104 achieves40.1%, suggesting that Hourglass-54 is better at predicting attention maps.

Second, to study the quality of bounding boxes from each network, we replace the pre-dicted attention maps with the ground-truth attention maps, and also train CornerNet withHourglass-54. With the ground-truth attention maps, CornerNet-Saccade with Hourglass-54 achieves an AP of 50.3% while CornerNet-Saccade with Hourglass-104 achieves an APof 48.9%. CornerNet with Hourglass-54 achieves an AP of 37.2%, while Hourglass-104achieves 38.2%. The results suggest that Hourglass-54 produces better bounding boxes whencombined with saccade.CornerNet-Squeeze Ablation Study. We study each major change in CornerNet-Squeezeto understand its contribution to the inference time and AP, and compare squeezed hourglassnetwork to different networks in Tab. 4. To conserve GPUs, each model is only trained for250k iterations, following the details in Sec. 4.2. Results in Tab. 4 show that our squeezedhourglass network achieves better performance and efficiency than existing networks.CornerNet-Squeeze-Saccade. We try combining CornerNet-Squeeze with saccades to fur-

Citation
Citation
{Micikevicius, Narang, Alben, Diamos, Elsen, Garcia, Ginsburg, Houston, Kuchaiev, Venkatesh, etprotect unhbox voidb@x penalty @M {}al.} 2017
Citation
Citation
{Law and Deng} 2018
Page 10: CornerNet-Lite: Efficient Keypoint Based Object Detectionscale due to the ultra-tight inference budget, which provides much less room for saccades to save. Significance and novelty:

10 H. LAW, Y. TANG, O. RUSSAKOVSKY, J. DENG: CORNERNET-LITE

Time APCornerNet 211ms 31.4*+ w/o flipped image 111ms 29.7*+ one extra downsampling before HG modules 41ms 33.0+ replace residual blocks with new fire modules 31ms 29.8+ replace 3×3 with 1×1 conv in prediction layers 28ms 29.8+ upsample using transpose conv (CornerNet-Squeeze) 30ms 30.3replace new fire modules with original fire modules [20] 41ms 31.0replace new fire modules with MobileNets [16] modules 35ms 26.8replace squeezed hourglass with SqueezeNet [20] 24ms 12.5replace squeezed hourglass with MobileNets [16] 26ms 22.3replace squeezed hourglass with ShuffleNet [64] 31ms 19.7

Table 4: Ablation study on CornerNet-Squeeze. *Note that CornerNet is trained with asmaller batch size.

Time AP APs APm APl

CornerNet-Squeeze-Saccade 61ms 32.7 17.3 32.6 47.1+ gt attention - 38.0 24.4 39.3 50.2CornerNet-Squeeze 30ms 34.4 14.8 36.9 49.5

Table 5: Saccade only helps if the attention maps are sufficiently accurate.

ther improve the efficiency. The results in Tab. 5 suggest that saccade can only help if theattention maps are sufficiently accurate. Due to its ultra-compact architecture, CornerNet-Squeeze-Saccade does not have enough capacity to detect objects and predict attentionmaps simultaneously. Furthermore, CornerNet-Squeeze only operates on single scale im-ages, which provides much less room for CornerNet-Squeeze-Saccade to save. CornerNet-Squeeze-Saccade may process more number of pixels than CornerNet-Squeeze, slowingdown the inference time.

Time AP APs APm APl

YOLOv3 39ms 33.0 18.3 35.4 41.9CornerNet-Squeeze 30ms 34.4 13.7 36.5 47.4CornerNet (single) 211ms 40.6 19.1 42.8 54.3CornerNet (multi) 1147ms 42.2 20.7 44.8 56.6CornerNet-Saccade 190ms 43.2 24.4 44.6 57.3

Table 6: CornerNet-Lite versus CornerNet and YOLOv3 on COCO test set.

Comparisons with YOLO and CornerNet. We also compare CornerNet-Lite with Cor-nerNet and YOLOv3 on COCO test set in Tab. 6. CornerNet-Squeeze is faster and moreaccurate than YOLOv3. CornerNet-Saccade is more accurate than CornerNet at multi-scalesand 6 times faster.

6 ConclusionsWe propose CornerNet-Lite which is a combination of two efficient variant of CornerNet:CornerNet-Saccade and CornerNet-Squeeze. Together these contributions for the first timereveal the potential of keypoint-based detection to be useful for applications requiring pro-cessing efficiency.

Acknowledgements This work is partially supported by a DARPA grant FA8750-18-2-0019and an IARPA grant D17PC00343.

Citation
Citation
{Iandola, Han, Moskewicz, Ashraf, Dally, and Keutzer} 2016
Citation
Citation
{Howard, Zhu, Chen, Kalenichenko, Wang, Weyand, Andreetto, and Adam} 2017
Citation
Citation
{Iandola, Han, Moskewicz, Ashraf, Dally, and Keutzer} 2016
Citation
Citation
{Howard, Zhu, Chen, Kalenichenko, Wang, Weyand, Andreetto, and Adam} 2017
Citation
Citation
{Zhang, Zhou, Lin, and Sun} 2018{}
Page 11: CornerNet-Lite: Efficient Keypoint Based Object Detectionscale due to the ultra-tight inference budget, which provides much less room for saccades to save. Significance and novelty:

H. LAW, Y. TANG, O. RUSSAKOVSKY, J. DENG: CORNERNET-LITE 11

References[1] A Terry Bahill, Michael R Clark, and Lawrence Stark. The main sequence, a tool for

studying human eye movements. Mathematical Biosciences, 24(3-4):191–204, 1975.

[2] Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S Davis. Soft-nms–improving object detection with one line of code. In Proceedings of the IEEE In-ternational Conference on Computer Vision, pages 5561–5569, 2017.

[3] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delving into high quality objectdetection. In Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, pages 6154–6162, 2018.

[4] Zhaowei Cai, Quanfu Fan, Rogerio S Feris, and Nuno Vasconcelos. A unified multi-scale deep convolutional neural network for fast object detection. In European confer-ence on computer vision, pages 354–370. Springer, 2016.

[5] Bowen Cheng, Yunchao Wei, Honghui Shi, Rogerio Feris, Jinjun Xiong, and ThomasHuang. Revisiting rcnn: On awakening the classification power of faster rcnn. InProceedings of the European Conference on Computer Vision (ECCV), pages 453–468,2018.

[6] François Chollet. Xception: Deep learning with depthwise separable convolutions. InProceedings of the IEEE conference on computer vision and pattern recognition, pages1251–1258, 2017.

[7] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object detection via region-basedfully convolutional networks. In Advances in neural information processing systems,pages 379–387, 2016.

[8] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and YichenWei. Deformable convolutional networks. In Proceedings of the IEEE internationalconference on computer vision, pages 764–773, 2017.

[9] Cheng-Yang Fu, Wei Liu, Ananth Ranga, Ambrish Tyagi, and Alexander C Berg. Dssd:Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659, 2017.

[10] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference oncomputer vision, pages 1440–1448, 2015.

[11] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierar-chies for accurate object detection and semantic segmentation. In Proceedings of theIEEE conference on computer vision and pattern recognition, pages 580–587, 2014.

[12] Giovanni Gualdi, Andrea Prati, and Rita Cucchiara. Multistage particle windows forfast and accurate object detection. IEEE transactions on pattern analysis and machineintelligence, 34(8):1589–1604, 2011.

[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid poolingin deep convolutional networks for visual recognition. IEEE transactions on patternanalysis and machine intelligence, 37(9):1904–1916, 2015.

Page 12: CornerNet-Lite: Efficient Keypoint Based Object Detectionscale due to the ultra-tight inference budget, which provides much less room for saccades to save. Significance and novelty:

12 H. LAW, Y. TANG, O. RUSSAKOVSKY, J. DENG: CORNERNET-LITE

[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learningfor image recognition. In Proceedings of the IEEE conference on computer vision andpattern recognition, pages 770–778, 2016.

[15] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Pro-ceedings of the IEEE international conference on computer vision, pages 2961–2969,2017.

[16] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, WeijunWang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Effi-cient convolutional neural networks for mobile vision applications. arXiv preprintarXiv:1704.04861, 2017.

[17] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings ofthe IEEE conference on computer vision and pattern recognition, pages 7132–7141,2018.

[18] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Denselyconnected convolutional networks. In Proceedings of the IEEE conference on computervision and pattern recognition, pages 4700–4708, 2017.

[19] Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara,Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio Guadarrama, et al.Speed/accuracy trade-offs for modern convolutional object detectors. In Proceedingsof the IEEE conference on computer vision and pattern recognition, pages 7310–7311,2017.

[20] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally,and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and<0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.

[21] Jisoo Jeong, Hyojin Park, and Nojun Kwak. Enhancement of ssd by concatenatingfeature maps for object detection. arXiv preprint arXiv:1705.09587, 2017.

[22] Borui Jiang, Ruixuan Luo, Jiayuan Mao, Tete Xiao, and Yuning Jiang. Acquisition oflocalization confidence for accurate object detection. In Proceedings of the EuropeanConference on Computer Vision (ECCV), pages 784–799, 2018.

[23] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXivpreprint arXiv:1412.6980, 2014.

[24] Tao Kong, Fuchun Sun, Anbang Yao, Huaping Liu, Ming Lu, and Yurong Chen. Ron:Reverse connection with objectness prior networks for object detection. In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition, pages 5936–5944, 2017.

[25] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification withdeep convolutional neural networks. In Advances in neural information processingsystems, pages 1097–1105, 2012.

[26] Kevin Alexander Laube and Andreas Zell. Shufflenasnets: Efficient cnn modelsthrough modified efficient neural architecture search. arXiv preprint arXiv:1812.02975,2018.

Page 13: CornerNet-Lite: Efficient Keypoint Based Object Detectionscale due to the ultra-tight inference budget, which provides much less room for saccades to save. Significance and novelty:

H. LAW, Y. TANG, O. RUSSAKOVSKY, J. DENG: CORNERNET-LITE 13

[27] Hei Law and Jia Deng. Cornernet: Detecting objects as paired keypoints. In Proceed-ings of the European Conference on Computer Vision (ECCV), pages 734–750, 2018.

[28] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruningfilters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016.

[29] Zeming Li, Chao Peng, Gang Yu, Xiangyu Zhang, Yangdong Deng, and JianSun. Light-head r-cnn: In defense of two-stage object detector. arXiv preprintarXiv:1711.07264, 2017.

[30] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprintarXiv:1312.4400, 2013.

[31] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ra-manan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects incontext. In European conference on computer vision, pages 740–755. Springer, 2014.

[32] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and SergeBelongie. Feature pyramid networks for object detection. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, pages 2117–2125, 2017.

[33] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal lossfor dense object detection. In Proceedings of the IEEE international conference oncomputer vision, pages 2980–2988, 2017.

[34] Songtao Liu, Di Huang, et al. Receptive field block net for accurate and fast objectdetection. In Proceedings of the European Conference on Computer Vision (ECCV),pages 385–400, 2018.

[35] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In Europeanconference on computer vision, pages 21–37. Springer, 2016.

[36] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practi-cal guidelines for efficient cnn architecture design. In Proceedings of the EuropeanConference on Computer Vision (ECCV), pages 116–131, 2018.

[37] Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen,David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh,et al. Mixed precision training. arXiv preprint arXiv:1710.03740, 2017.

[38] Mahyar Najibi, Bharat Singh, and Larry S Davis. Autofocus: Efficient multi-scaleinference. arXiv preprint arXiv:1812.01600, 2018.

[39] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for hu-man pose estimation. In European Conference on Computer Vision, pages 483–499.Springer, 2016.

[40] Alejandro Newell, Zhiao Huang, and Jia Deng. Associative embedding: End-to-endlearning for joint detection and grouping. In Advances in Neural Information Process-ing Systems, pages 2277–2287, 2017.

Page 14: CornerNet-Lite: Efficient Keypoint Based Object Detectionscale due to the ultra-tight inference budget, which provides much less room for saccades to save. Significance and novelty:

14 H. LAW, Y. TANG, O. RUSSAKOVSKY, J. DENG: CORNERNET-LITE

[41] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, ZacharyDeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automaticdifferentiation in pytorch. In NIPS-W, 2017.

[42] Marco Pedersoli, Jordi Gonzàlez, Andrew D Bagdanov, and Juan J Villanueva. Recur-sive coarse-to-fine localization for fast object detection. In European Conference onComputer Vision, pages 280–293. Springer, 2010.

[43] Chao Peng, Tete Xiao, Zeming Li, Yuning Jiang, Xiangyu Zhang, Kai Jia, Gang Yu,and Jian Sun. Megdet: A large mini-batch object detector. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, pages 6181–6189, 2018.

[44] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net:Imagenet classification using binary convolutional neural networks. In European Con-ference on Computer Vision, pages 525–542. Springer, 2016.

[45] Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger. In Proceedingsof the IEEE conference on computer vision and pattern recognition, pages 7263–7271,2017.

[46] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprintarXiv:1804.02767, 2018.

[47] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once:Unified, real-time object detection. In Proceedings of the IEEE conference on computervision and pattern recognition, pages 779–788, 2016.

[48] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural informationprocessing systems, pages 91–99, 2015.

[49] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-ChiehChen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520,2018.

[50] Zhiqiang Shen, Zhuang Liu, Jianguo Li, Yu-Gang Jiang, Yurong Chen, and XiangyangXue. Dsod: Learning deeply supervised object detectors from scratch. In Proceedingsof the IEEE International Conference on Computer Vision, pages 1919–1927, 2017.

[51] Abhinav Shrivastava, Rahul Sukthankar, Jitendra Malik, and Abhinav Gupta. Be-yond skip connections: Top-down modulation for object detection. arXiv preprintarXiv:1612.06851, 2016.

[52] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

[53] Bharat Singh and Larry S Davis. An analysis of scale invariance in object detectionsnip. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-nition, pages 3578–3587, 2018.

[54] Bharat Singh, Mahyar Najibi, and Larry S Davis. Sniper: Efficient multi-scale training.In Advances in Neural Information Processing Systems, pages 9333–9343, 2018.

Page 15: CornerNet-Lite: Efficient Keypoint Based Object Detectionscale due to the ultra-tight inference budget, which provides much less room for saccades to save. Significance and novelty:

H. LAW, Y. TANG, O. RUSSAKOVSKY, J. DENG: CORNERNET-LITE 15

[55] Lachlan Tychsen-Smith and Lars Petersson. Denet: Scalable real-time object detectionwith directed sparse sampling. In Proceedings of the IEEE International Conferenceon Computer Vision, pages 428–436, 2017.

[56] Paul Viola, Michael Jones, et al. Rapid object detection using a boosted cascade ofsimple features. CVPR (1), 1:511–518, 2001.

[57] Robert J Wang, Xiang Li, and Charles X Ling. Pelee: A real-time object detectionsystem on mobile devices. In Advances in Neural Information Processing Systems,pages 1963–1972, 2018.

[58] Xinggang Wang, Kaibing Chen, Zilong Huang, Cong Yao, and Wenyu Liu. Pointlinking network for object detection. arXiv preprint arXiv:1706.03646, 2017.

[59] Hongyu Xu, Xutao Lv, Xiaoyu Wang, Zhou Ren, Navaneeth Bodla, and Rama Chel-lappa. Deep regionlets for object detection. In Proceedings of the European Conferenceon Computer Vision (ECCV), pages 798–814, 2018.

[60] Alfred L Yarbus. Eye movements and vision. Springer, 2013.

[61] Yao Zhai, Jingjing Fu, Yan Lu, and Houqiang Li. Feature selective networks for objectdetection. In Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, pages 4139–4147, 2018.

[62] Shifeng Zhang, Longyin Wen, Xiao Bian, Zhen Lei, and Stan Z Li. Single-shot refine-ment neural network for object detection. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, pages 4203–4212, 2018.

[63] Wei Zhang, Gregory Zelinsky, and Dimitris Samaras. Real-time accurate object de-tection using multiple resolutions. In 2007 IEEE 11th International Conference onComputer Vision, pages 1–8. IEEE, 2007.

[64] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremelyefficient convolutional neural network for mobile devices. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, pages 6848–6856, 2018.

[65] Zhishuai Zhang, Siyuan Qiao, Cihang Xie, Wei Shen, Bo Wang, and Alan L Yuille.Single-shot object detection with enriched semantics. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, pages 5813–5821, 2018.

[66] Qijie Zhao, Tao Sheng, Yongtao Wang, Zhi Tang, Ying Chen, Ling Cai, and HaibinLing. M2det: A single-shot object detector based on multi-level feature pyramid net-work. arXiv preprint arXiv:1811.04533, 2018.

[67] Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. Deformable convnets v2: Moredeformable, better results. arXiv preprint arXiv:1811.11168, 2018.