reppoints: point set representation for object detection · reppoints: point set representation for...
Post on 26-Mar-2020
9 Views
Preview:
TRANSCRIPT
-
RepPoints: Point Set Representation for Object Detection
Ze Yang1†∗ Shaohui Liu2,3†∗ Han Hu3 Liwei Wang1 Stephen Lin31Peking University 2Tsinghua University
3Microsoft Research Asiayangze@pku.edu.cn, b1ueber2y@gmail.com, wanglw@cis.pku.edu.cn
{hanhu, stevelin}@microsoft.com
Abstract
Modern object detectors rely heavily on rectangularbounding boxes, such as anchors, proposals and the fi-nal predictions, to represent objects at various recognitionstages. The bounding box is convenient to use but providesonly a coarse localization of objects and leads to a cor-respondingly coarse extraction of object features. In thispaper, we present RepPoints (representative points), a newfiner representation of objects as a set of sample points use-ful for both localization and recognition. Given groundtruth localization and recognition targets for training, Rep-Points learn to automatically arrange themselves in a man-ner that bounds the spatial extent of an object and indicatessemantically significant local areas. They furthermore donot require the use of anchors to sample a space of boundingboxes. We show that an anchor-free object detector basedon RepPoints, implemented without multi-scale training andtesting, can be as effective as state-of-the-art anchor-baseddetection methods, with 42.8 AP and 65.0 AP50 on theCOCO test-dev detection benchmark.
1. Introduction
Object detection aims to localize objects in an image andprovide their class labels. As one of the most fundamentaltasks in computer vision, it serves as a key component formany vision applications, including instance segmentation[34], human pose analysis [43], and visual reasoning [47].The significance of the object detection problem togetherwith the rapid development of deep neural networks has ledto substantial progress in recent years [7, 12, 11, 38, 15, 3].
In the object detection pipeline, bounding boxes, whichencompass rectangular areas of an image, serve as the ba-sic element for processing. They describe target locationsof objects throughout the stages of an object detector, from
∗Equal contribution. †The work is done when Ze Yang and ShaohuiLiu are interns at Microsoft Research Asia.
recognitionsupervision
convert
feature extraction+ classification
localizationsupervision
Figure 1. RepPoints are a new representation for object detectionthat consists of a set of points which indicate the spatial extent ofan object and semantically significant local areas. This representa-tion is learned via weak localization supervision from rectangularground-truth boxes and implicit recognition feedback. Based onthe richer RepPoints representation, we develop an anchor-free ob-ject detector that yields improved performance compared to usingbounding boxes.
anchors and proposals to final predictions. Based on thesebounding boxes, features are extracted and used for pur-poses such as object classification and location refinement.The prevalence of the bounding box representation canpartly be attributed to common metrics for object detectionperformance, which account for the overlap between esti-mated and ground truth bounding boxes of objects. Anotherreason lies in its convenience for feature extraction in deepnetworks, because of its regular shape and the ease of sub-dividing a rectangular window into a matrix of pooled cells.
Though bounding boxes facilitate computation, they pro-vide only a coarse localization of objects that does not con-form to an object’s shape and pose. Features extracted fromthe regular cells of a bounding box may thus be heavilyinfluenced by background content or uninformative fore-ground areas that contain little semantic information. Thismay result in lower feature quality that degrades classifica-tion performance in object detection.
In this paper, we propose a new representation, calledRepPoints, that provides more fine-grained localization and
1
arX
iv:1
904.
1149
0v1
[cs
.CV
] 2
5 A
pr 2
019
RepPoints: Point Set Representation for Object Detection
Ze Yang1†∗ Shaohui Liu2,3†∗ Han Hu3 Liwei Wang1 Stephen Lin31Peking University 2Tsinghua University
3Microsoft Research Asiayangze@pku.edu.cn, b1ueber2y@gmail.com, wanglw@cis.pku.edu.cn
{hanhu, stevelin}@microsoft.com
Abstract
Modern object detectors rely heavily on rectangularbounding boxes, such as anchors, proposals and the fi-nal predictions, to represent objects at various recognitionstages. The bounding box is convenient to use but providesonly a coarse localization of objects and leads to a cor-respondingly coarse extraction of object features. In thispaper, we present RepPoints (representative points), a newfiner representation of objects as a set of sample points use-ful for both localization and recognition. Given groundtruth localization and recognition targets for training, Rep-Points learn to automatically arrange themselves in a man-ner that bounds the spatial extent of an object and indicatessemantically significant local areas. They furthermore donot require the use of anchors to sample a space of boundingboxes. We show that an anchor-free object detector basedon RepPoints, implemented without multi-scale training andtesting, can be as effective as state-of-the-art anchor-baseddetection methods, with 42.8 AP and 65.0 AP50 on theCOCO test-dev detection benchmark.
1. Introduction
Object detection aims to localize objects in an image andprovide their class labels. As one of the most fundamentaltasks in computer vision, it serves as a key component formany vision applications, including instance segmentation[34], human pose analysis [43], and visual reasoning [47].The significance of the object detection problem togetherwith the rapid development of deep neural networks has ledto substantial progress in recent years [7, 12, 11, 38, 15, 3].
In the object detection pipeline, bounding boxes, whichencompass rectangular areas of an image, serve as the ba-sic element for processing. They describe target locationsof objects throughout the stages of an object detector, from
∗Equal contribution. †The work is done when Ze Yang and ShaohuiLiu are interns at Microsoft Research Asia.
recognitionsupervision
convert
feature extraction+ classification
localizationsupervision
Figure 1. RepPoints are a new representation for object detectionthat consists of a set of points which indicate the spatial extent ofan object and semantically significant local areas. This representa-tion is learned via weak localization supervision from rectangularground-truth boxes and implicit recognition feedback. Based onthe richer RepPoints representation, we develop an anchor-free ob-ject detector that yields improved performance compared to usingbounding boxes.
anchors and proposals to final predictions. Based on thesebounding boxes, features are extracted and used for pur-poses such as object classification and location refinement.The prevalence of the bounding box representation canpartly be attributed to common metrics for object detectionperformance, which account for the overlap between esti-mated and ground truth bounding boxes of objects. Anotherreason lies in its convenience for feature extraction in deepnetworks, because of its regular shape and the ease of sub-dividing a rectangular window into a matrix of pooled cells.
Though bounding boxes facilitate computation, they pro-vide only a coarse localization of objects that does not con-form to an object’s shape and pose. Features extracted fromthe regular cells of a bounding box may thus be heavilyinfluenced by background content or uninformative fore-ground areas that contain little semantic information. Thismay result in lower feature quality that degrades classifica-tion performance in object detection.
In this paper, we propose a new representation, calledRepPoints, that provides more fine-grained localization and
1
arX
iv:1
904.
1149
0v1
[cs
.CV
] 2
5 A
pr 2
019
RepPoints: Point Set Representation for Object Detection
Ze Yang1†∗ Shaohui Liu2,3†∗ Han Hu3 Liwei Wang1 Stephen Lin31Peking University 2Tsinghua University
3Microsoft Research Asiayangze@pku.edu.cn, b1ueber2y@gmail.com, wanglw@cis.pku.edu.cn
{hanhu, stevelin}@microsoft.com
Abstract
Modern object detectors rely heavily on rectangularbounding boxes, such as anchors, proposals and the fi-nal predictions, to represent objects at various recognitionstages. The bounding box is convenient to use but providesonly a coarse localization of objects and leads to a cor-respondingly coarse extraction of object features. In thispaper, we present RepPoints (representative points), a newfiner representation of objects as a set of sample points use-ful for both localization and recognition. Given groundtruth localization and recognition targets for training, Rep-Points learn to automatically arrange themselves in a man-ner that bounds the spatial extent of an object and indicatessemantically significant local areas. They furthermore donot require the use of anchors to sample a space of boundingboxes. We show that an anchor-free object detector basedon RepPoints, implemented without multi-scale training andtesting, can be as effective as state-of-the-art anchor-baseddetection methods, with 42.8 AP and 65.0 AP50 on theCOCO test-dev detection benchmark.
1. Introduction
Object detection aims to localize objects in an image andprovide their class labels. As one of the most fundamentaltasks in computer vision, it serves as a key component formany vision applications, including instance segmentation[34], human pose analysis [43], and visual reasoning [47].The significance of the object detection problem togetherwith the rapid development of deep neural networks has ledto substantial progress in recent years [7, 12, 11, 38, 15, 3].
In the object detection pipeline, bounding boxes, whichencompass rectangular areas of an image, serve as the ba-sic element for processing. They describe target locationsof objects throughout the stages of an object detector, from
∗Equal contribution. †The work is done when Ze Yang and ShaohuiLiu are interns at Microsoft Research Asia.
recognitionsupervision
convert
feature extraction+ classification
localizationsupervision
Figure 1. RepPoints are a new representation for object detectionthat consists of a set of points which indicate the spatial extent ofan object and semantically significant local areas. This representa-tion is learned via weak localization supervision from rectangularground-truth boxes and implicit recognition feedback. Based onthe richer RepPoints representation, we develop an anchor-free ob-ject detector that yields improved performance compared to usingbounding boxes.
anchors and proposals to final predictions. Based on thesebounding boxes, features are extracted and used for pur-poses such as object classification and location refinement.The prevalence of the bounding box representation canpartly be attributed to common metrics for object detectionperformance, which account for the overlap between esti-mated and ground truth bounding boxes of objects. Anotherreason lies in its convenience for feature extraction in deepnetworks, because of its regular shape and the ease of sub-dividing a rectangular window into a matrix of pooled cells.
Though bounding boxes facilitate computation, they pro-vide only a coarse localization of objects that does not con-form to an object’s shape and pose. Features extracted fromthe regular cells of a bounding box may thus be heavilyinfluenced by background content or uninformative fore-ground areas that contain little semantic information. Thismay result in lower feature quality that degrades classifica-tion performance in object detection.
In this paper, we propose a new representation, calledRepPoints, that provides more fine-grained localization and
1
arX
iv:1
904.
1149
0v1
[cs
.CV
] 2
5 A
pr 2
019
RepPoints: Point Set Representation for Object Detection
Ze Yang1†∗ Shaohui Liu2,3†∗ Han Hu3 Liwei Wang1 Stephen Lin31Peking University 2Tsinghua University
3Microsoft Research Asiayangze@pku.edu.cn, b1ueber2y@gmail.com, wanglw@cis.pku.edu.cn
{hanhu, stevelin}@microsoft.com
Abstract
Modern object detectors rely heavily on rectangularbounding boxes, such as anchors, proposals and the fi-nal predictions, to represent objects at various recognitionstages. The bounding box is convenient to use but providesonly a coarse localization of objects and leads to a cor-respondingly coarse extraction of object features. In thispaper, we present RepPoints (representative points), a newfiner representation of objects as a set of sample points use-ful for both localization and recognition. Given groundtruth localization and recognition targets for training, Rep-Points learn to automatically arrange themselves in a man-ner that bounds the spatial extent of an object and indicatessemantically significant local areas. They furthermore donot require the use of anchors to sample a space of boundingboxes. We show that an anchor-free object detector basedon RepPoints, implemented without multi-scale training andtesting, can be as effective as state-of-the-art anchor-baseddetection methods, with 42.8 AP and 65.0 AP50 on theCOCO test-dev detection benchmark.
1. Introduction
Object detection aims to localize objects in an image andprovide their class labels. As one of the most fundamentaltasks in computer vision, it serves as a key component formany vision applications, including instance segmentation[34], human pose analysis [43], and visual reasoning [47].The significance of the object detection problem togetherwith the rapid development of deep neural networks has ledto substantial progress in recent years [7, 12, 11, 38, 15, 3].
In the object detection pipeline, bounding boxes, whichencompass rectangular areas of an image, serve as the ba-sic element for processing. They describe target locationsof objects throughout the stages of an object detector, from
∗Equal contribution. †The work is done when Ze Yang and ShaohuiLiu are interns at Microsoft Research Asia.
recognitionsupervision
convert
feature extraction+ classification
localizationsupervision
Figure 1. RepPoints are a new representation for object detectionthat consists of a set of points which indicate the spatial extent ofan object and semantically significant local areas. This representa-tion is learned via weak localization supervision from rectangularground-truth boxes and implicit recognition feedback. Based onthe richer RepPoints representation, we develop an anchor-free ob-ject detector that yields improved performance compared to usingbounding boxes.
anchors and proposals to final predictions. Based on thesebounding boxes, features are extracted and used for pur-poses such as object classification and location refinement.The prevalence of the bounding box representation canpartly be attributed to common metrics for object detectionperformance, which account for the overlap between esti-mated and ground truth bounding boxes of objects. Anotherreason lies in its convenience for feature extraction in deepnetworks, because of its regular shape and the ease of sub-dividing a rectangular window into a matrix of pooled cells.
Though bounding boxes facilitate computation, they pro-vide only a coarse localization of objects that does not con-form to an object’s shape and pose. Features extracted fromthe regular cells of a bounding box may thus be heavilyinfluenced by background content or uninformative fore-ground areas that contain little semantic information. Thismay result in lower feature quality that degrades classifica-tion performance in object detection.
In this paper, we propose a new representation, calledRepPoints, that provides more fine-grained localization and
1
arX
iv:1
904.
1149
0v1
[cs
.CV
] 2
5 A
pr 2
019
RepPoints: Point Set Representation for Object Detection
Ze Yang1†∗ Shaohui Liu2,3†∗ Han Hu3 Liwei Wang1 Stephen Lin31Peking University 2Tsinghua University
3Microsoft Research Asiayangze@pku.edu.cn, b1ueber2y@gmail.com, wanglw@cis.pku.edu.cn
{hanhu, stevelin}@microsoft.com
Abstract
Modern object detectors rely heavily on rectangularbounding boxes, such as anchors, proposals and the fi-nal predictions, to represent objects at various recognitionstages. The bounding box is convenient to use but providesonly a coarse localization of objects and leads to a cor-respondingly coarse extraction of object features. In thispaper, we present RepPoints (representative points), a newfiner representation of objects as a set of sample points use-ful for both localization and recognition. Given groundtruth localization and recognition targets for training, Rep-Points learn to automatically arrange themselves in a man-ner that bounds the spatial extent of an object and indicatessemantically significant local areas. They furthermore donot require the use of anchors to sample a space of boundingboxes. We show that an anchor-free object detector basedon RepPoints, implemented without multi-scale training andtesting, can be as effective as state-of-the-art anchor-baseddetection methods, with 42.8 AP and 65.0 AP50 on theCOCO test-dev detection benchmark.
1. Introduction
Object detection aims to localize objects in an image andprovide their class labels. As one of the most fundamentaltasks in computer vision, it serves as a key component formany vision applications, including instance segmentation[34], human pose analysis [43], and visual reasoning [47].The significance of the object detection problem togetherwith the rapid development of deep neural networks has ledto substantial progress in recent years [7, 12, 11, 38, 15, 3].
In the object detection pipeline, bounding boxes, whichencompass rectangular areas of an image, serve as the ba-sic element for processing. They describe target locationsof objects throughout the stages of an object detector, from
∗Equal contribution. †The work is done when Ze Yang and ShaohuiLiu are interns at Microsoft Research Asia.
recognitionsupervision
convert
feature extraction+ classification
localizationsupervision
Figure 1. RepPoints are a new representation for object detectionthat consists of a set of points which indicate the spatial extent ofan object and semantically significant local areas. This representa-tion is learned via weak localization supervision from rectangularground-truth boxes and implicit recognition feedback. Based onthe richer RepPoints representation, we develop an anchor-free ob-ject detector that yields improved performance compared to usingbounding boxes.
anchors and proposals to final predictions. Based on thesebounding boxes, features are extracted and used for pur-poses such as object classification and location refinement.The prevalence of the bounding box representation canpartly be attributed to common metrics for object detectionperformance, which account for the overlap between esti-mated and ground truth bounding boxes of objects. Anotherreason lies in its convenience for feature extraction in deepnetworks, because of its regular shape and the ease of sub-dividing a rectangular window into a matrix of pooled cells.
Though bounding boxes facilitate computation, they pro-vide only a coarse localization of objects that does not con-form to an object’s shape and pose. Features extracted fromthe regular cells of a bounding box may thus be heavilyinfluenced by background content or uninformative fore-ground areas that contain little semantic information. Thismay result in lower feature quality that degrades classifica-tion performance in object detection.
In this paper, we propose a new representation, calledRepPoints, that provides more fine-grained localization and
1
arX
iv:1
904.
1149
0v1
[cs
.CV
] 2
5 A
pr 2
019
RepPoints: Point Set Representation for Object Detection
Ze Yang1†∗ Shaohui Liu2,3†∗ Han Hu3 Liwei Wang1 Stephen Lin31Peking University 2Tsinghua University
3Microsoft Research Asiayangze@pku.edu.cn, b1ueber2y@gmail.com, wanglw@cis.pku.edu.cn
{hanhu, stevelin}@microsoft.com
Abstract
Modern object detectors rely heavily on rectangularbounding boxes, such as anchors, proposals and the fi-nal predictions, to represent objects at various recognitionstages. The bounding box is convenient to use but providesonly a coarse localization of objects and leads to a cor-respondingly coarse extraction of object features. In thispaper, we present RepPoints (representative points), a newfiner representation of objects as a set of sample points use-ful for both localization and recognition. Given groundtruth localization and recognition targets for training, Rep-Points learn to automatically arrange themselves in a man-ner that bounds the spatial extent of an object and indicatessemantically significant local areas. They furthermore donot require the use of anchors to sample a space of boundingboxes. We show that an anchor-free object detector basedon RepPoints, implemented without multi-scale training andtesting, can be as effective as state-of-the-art anchor-baseddetection methods, with 42.8 AP and 65.0 AP50 on theCOCO test-dev detection benchmark.
1. Introduction
Object detection aims to localize objects in an image andprovide their class labels. As one of the most fundamentaltasks in computer vision, it serves as a key component formany vision applications, including instance segmentation[34], human pose analysis [43], and visual reasoning [47].The significance of the object detection problem togetherwith the rapid development of deep neural networks has ledto substantial progress in recent years [7, 12, 11, 38, 15, 3].
In the object detection pipeline, bounding boxes, whichencompass rectangular areas of an image, serve as the ba-sic element for processing. They describe target locationsof objects throughout the stages of an object detector, from
∗Equal contribution. †The work is done when Ze Yang and ShaohuiLiu are interns at Microsoft Research Asia.
recognitionsupervision
convert
feature extraction+ classification
localizationsupervision
Figure 1. RepPoints are a new representation for object detectionthat consists of a set of points which indicate the spatial extent ofan object and semantically significant local areas. This representa-tion is learned via weak localization supervision from rectangularground-truth boxes and implicit recognition feedback. Based onthe richer RepPoints representation, we develop an anchor-free ob-ject detector that yields improved performance compared to usingbounding boxes.
anchors and proposals to final predictions. Based on thesebounding boxes, features are extracted and used for pur-poses such as object classification and location refinement.The prevalence of the bounding box representation canpartly be attributed to common metrics for object detectionperformance, which account for the overlap between esti-mated and ground truth bounding boxes of objects. Anotherreason lies in its convenience for feature extraction in deepnetworks, because of its regular shape and the ease of sub-dividing a rectangular window into a matrix of pooled cells.
Though bounding boxes facilitate computation, they pro-vide only a coarse localization of objects that does not con-form to an object’s shape and pose. Features extracted fromthe regular cells of a bounding box may thus be heavilyinfluenced by background content or uninformative fore-ground areas that contain little semantic information. Thismay result in lower feature quality that degrades classifica-tion performance in object detection.
In this paper, we propose a new representation, calledRepPoints, that provides more fine-grained localization and
1
arX
iv:1
904.
1149
0v1
[cs
.CV
] 2
5 A
pr 2
019
RepPoints: Point Set Representation for Object Detection
Ze Yang1†∗ Shaohui Liu2,3†∗ Han Hu3 Liwei Wang1 Stephen Lin31Peking University 2Tsinghua University
3Microsoft Research Asiayangze@pku.edu.cn, b1ueber2y@gmail.com, wanglw@cis.pku.edu.cn
{hanhu, stevelin}@microsoft.com
Abstract
Modern object detectors rely heavily on rectangularbounding boxes, such as anchors, proposals and the fi-nal predictions, to represent objects at various recognitionstages. The bounding box is convenient to use but providesonly a coarse localization of objects and leads to a cor-respondingly coarse extraction of object features. In thispaper, we present RepPoints (representative points), a newfiner representation of objects as a set of sample points use-ful for both localization and recognition. Given groundtruth localization and recognition targets for training, Rep-Points learn to automatically arrange themselves in a man-ner that bounds the spatial extent of an object and indicatessemantically significant local areas. They furthermore donot require the use of anchors to sample a space of boundingboxes. We show that an anchor-free object detector basedon RepPoints, implemented without multi-scale training andtesting, can be as effective as state-of-the-art anchor-baseddetection methods, with 42.8 AP and 65.0 AP50 on theCOCO test-dev detection benchmark.
1. Introduction
Object detection aims to localize objects in an image andprovide their class labels. As one of the most fundamentaltasks in computer vision, it serves as a key component formany vision applications, including instance segmentation[34], human pose analysis [43], and visual reasoning [47].The significance of the object detection problem togetherwith the rapid development of deep neural networks has ledto substantial progress in recent years [7, 12, 11, 38, 15, 3].
In the object detection pipeline, bounding boxes, whichencompass rectangular areas of an image, serve as the ba-sic element for processing. They describe target locationsof objects throughout the stages of an object detector, from
∗Equal contribution. †The work is done when Ze Yang and ShaohuiLiu are interns at Microsoft Research Asia.
recognitionsupervision
convert
feature extraction+ classification
localizationsupervision
Figure 1. RepPoints are a new representation for object detectionthat consists of a set of points which indicate the spatial extent ofan object and semantically significant local areas. This representa-tion is learned via weak localization supervision from rectangularground-truth boxes and implicit recognition feedback. Based onthe richer RepPoints representation, we develop an anchor-free ob-ject detector that yields improved performance compared to usingbounding boxes.
anchors and proposals to final predictions. Based on thesebounding boxes, features are extracted and used for pur-poses such as object classification and location refinement.The prevalence of the bounding box representation canpartly be attributed to common metrics for object detectionperformance, which account for the overlap between esti-mated and ground truth bounding boxes of objects. Anotherreason lies in its convenience for feature extraction in deepnetworks, because of its regular shape and the ease of sub-dividing a rectangular window into a matrix of pooled cells.
Though bounding boxes facilitate computation, they pro-vide only a coarse localization of objects that does not con-form to an object’s shape and pose. Features extracted fromthe regular cells of a bounding box may thus be heavilyinfluenced by background content or uninformative fore-ground areas that contain little semantic information. Thismay result in lower feature quality that degrades classifica-tion performance in object detection.
In this paper, we propose a new representation, calledRepPoints, that provides more fine-grained localization and
1
arX
iv:1
904.
1149
0v1
[cs
.CV
] 2
5 A
pr 2
019
RepPoints: Point Set Representation for Object Detection
Ze Yang1†∗ Shaohui Liu2,3†∗ Han Hu3 Liwei Wang1 Stephen Lin31Peking University 2Tsinghua University
3Microsoft Research Asiayangze@pku.edu.cn, b1ueber2y@gmail.com, wanglw@cis.pku.edu.cn
{hanhu, stevelin}@microsoft.com
Abstract
Modern object detectors rely heavily on rectangularbounding boxes, such as anchors, proposals and the fi-nal predictions, to represent objects at various recognitionstages. The bounding box is convenient to use but providesonly a coarse localization of objects and leads to a cor-respondingly coarse extraction of object features. In thispaper, we present RepPoints (representative points), a newfiner representation of objects as a set of sample points use-ful for both localization and recognition. Given groundtruth localization and recognition targets for training, Rep-Points learn to automatically arrange themselves in a man-ner that bounds the spatial extent of an object and indicatessemantically significant local areas. They furthermore donot require the use of anchors to sample a space of boundingboxes. We show that an anchor-free object detector basedon RepPoints, implemented without multi-scale training andtesting, can be as effective as state-of-the-art anchor-baseddetection methods, with 42.8 AP and 65.0 AP50 on theCOCO test-dev detection benchmark.
1. Introduction
Object detection aims to localize objects in an image andprovide their class labels. As one of the most fundamentaltasks in computer vision, it serves as a key component formany vision applications, including instance segmentation[34], human pose analysis [43], and visual reasoning [47].The significance of the object detection problem togetherwith the rapid development of deep neural networks has ledto substantial progress in recent years [7, 12, 11, 38, 15, 3].
In the object detection pipeline, bounding boxes, whichencompass rectangular areas of an image, serve as the ba-sic element for processing. They describe target locationsof objects throughout the stages of an object detector, from
∗Equal contribution. †The work is done when Ze Yang and ShaohuiLiu are interns at Microsoft Research Asia.
recognitionsupervision
convert
feature extraction+ classification
localizationsupervision
Figure 1. RepPoints are a new representation for object detectionthat consists of a set of points which indicate the spatial extent ofan object and semantically significant local areas. This representa-tion is learned via weak localization supervision from rectangularground-truth boxes and implicit recognition feedback. Based onthe richer RepPoints representation, we develop an anchor-free ob-ject detector that yields improved performance compared to usingbounding boxes.
anchors and proposals to final predictions. Based on thesebounding boxes, features are extracted and used for pur-poses such as object classification and location refinement.The prevalence of the bounding box representation canpartly be attributed to common metrics for object detectionperformance, which account for the overlap between esti-mated and ground truth bounding boxes of objects. Anotherreason lies in its convenience for feature extraction in deepnetworks, because of its regular shape and the ease of sub-dividing a rectangular window into a matrix of pooled cells.
Though bounding boxes facilitate computation, they pro-vide only a coarse localization of objects that does not con-form to an object’s shape and pose. Features extracted fromthe regular cells of a bounding box may thus be heavilyinfluenced by background content or uninformative fore-ground areas that contain little semantic information. Thismay result in lower feature quality that degrades classifica-tion performance in object detection.
In this paper, we propose a new representation, calledRepPoints, that provides more fine-grained localization and
1
arX
iv:1
904.
1149
0v1
[cs
.CV
] 2
5 A
pr 2
019
RepPoints: Point Set Representation for Object Detection
Ze Yang1†∗ Shaohui Liu2,3†∗ Han Hu3 Liwei Wang1 Stephen Lin31Peking University 2Tsinghua University
3Microsoft Research Asiayangze@pku.edu.cn, b1ueber2y@gmail.com, wanglw@cis.pku.edu.cn
{hanhu, stevelin}@microsoft.com
Abstract
Modern object detectors rely heavily on rectangularbounding boxes, such as anchors, proposals and the fi-nal predictions, to represent objects at various recognitionstages. The bounding box is convenient to use but providesonly a coarse localization of objects and leads to a cor-respondingly coarse extraction of object features. In thispaper, we present RepPoints (representative points), a newfiner representation of objects as a set of sample points use-ful for both localization and recognition. Given groundtruth localization and recognition targets for training, Rep-Points learn to automatically arrange themselves in a man-ner that bounds the spatial extent of an object and indicatessemantically significant local areas. They furthermore donot require the use of anchors to sample a space of boundingboxes. We show that an anchor-free object detector basedon RepPoints, implemented without multi-scale training andtesting, can be as effective as state-of-the-art anchor-baseddetection methods, with 42.8 AP and 65.0 AP50 on theCOCO test-dev detection benchmark.
1. Introduction
Object detection aims to localize objects in an image andprovide their class labels. As one of the most fundamentaltasks in computer vision, it serves as a key component formany vision applications, including instance segmentation[34], human pose analysis [43], and visual reasoning [47].The significance of the object detection problem togetherwith the rapid development of deep neural networks has ledto substantial progress in recent years [7, 12, 11, 38, 15, 3].
In the object detection pipeline, bounding boxes, whichencompass rectangular areas of an image, serve as the ba-sic element for processing. They describe target locationsof objects throughout the stages of an object detector, from
∗Equal contribution. †The work is done when Ze Yang and ShaohuiLiu are interns at Microsoft Research Asia.
recognitionsupervision
convert
feature extraction+ classification
localizationsupervision
Figure 1. RepPoints are a new representation for object detectionthat consists of a set of points which indicate the spatial extent ofan object and semantically significant local areas. This representa-tion is learned via weak localization supervision from rectangularground-truth boxes and implicit recognition feedback. Based onthe richer RepPoints representation, we develop an anchor-free ob-ject detector that yields improved performance compared to usingbounding boxes.
anchors and proposals to final predictions. Based on thesebounding boxes, features are extracted and used for pur-poses such as object classification and location refinement.The prevalence of the bounding box representation canpartly be attributed to common metrics for object detectionperformance, which account for the overlap between esti-mated and ground truth bounding boxes of objects. Anotherreason lies in its convenience for feature extraction in deepnetworks, because of its regular shape and the ease of sub-dividing a rectangular window into a matrix of pooled cells.
Though bounding boxes facilitate computation, they pro-vide only a coarse localization of objects that does not con-form to an object’s shape and pose. Features extracted fromthe regular cells of a bounding box may thus be heavilyinfluenced by background content or uninformative fore-ground areas that contain little semantic information. Thismay result in lower feature quality that degrades classifica-tion performance in object detection.
In this paper, we propose a new representation, calledRepPoints, that provides more fine-grained localization and
1
arX
iv:1
904.
1149
0v1
[cs
.CV
] 2
5 A
pr 2
019
RepPoints: Point Set Representation for Object Detection
Ze Yang1†∗ Shaohui Liu2,3†∗ Han Hu3 Liwei Wang1 Stephen Lin31Peking University 2Tsinghua University
3Microsoft Research Asiayangze@pku.edu.cn, b1ueber2y@gmail.com, wanglw@cis.pku.edu.cn
{hanhu, stevelin}@microsoft.com
Abstract
Modern object detectors rely heavily on rectangularbounding boxes, such as anchors, proposals and the fi-nal predictions, to represent objects at various recognitionstages. The bounding box is convenient to use but providesonly a coarse localization of objects and leads to a cor-respondingly coarse extraction of object features. In thispaper, we present RepPoints (representative points), a newfiner representation of objects as a set of sample points use-ful for both localization and recognition. Given groundtruth localization and recognition targets for training, Rep-Points learn to automatically arrange themselves in a man-ner that bounds the spatial extent of an object and indicatessemantically significant local areas. They furthermore donot require the use of anchors to sample a space of boundingboxes. We show that an anchor-free object detector basedon RepPoints, implemented without multi-scale training andtesting, can be as effective as state-of-the-art anchor-baseddetection methods, with 42.8 AP and 65.0 AP50 on theCOCO test-dev detection benchmark.
1. Introduction
Object detection aims to localize objects in an image andprovide their class labels. As one of the most fundamentaltasks in computer vision, it serves as a key component formany vision applications, including instance segmentation[34], human pose analysis [43], and visual reasoning [47].The significance of the object detection problem togetherwith the rapid development of deep neural networks has ledto substantial progress in recent years [7, 12, 11, 38, 15, 3].
In the object detection pipeline, bounding boxes, whichencompass rectangular areas of an image, serve as the ba-sic element for processing. They describe target locationsof objects throughout the stages of an object detector, from
∗Equal contribution. †The work is done when Ze Yang and ShaohuiLiu are interns at Microsoft Research Asia.
recognitionsupervision
convert
feature extraction+ classification
localizationsupervision
Figure 1. RepPoints are a new representation for object detectionthat consists of a set of points which indicate the spatial extent ofan object and semantically significant local areas. This representa-tion is learned via weak localization supervision from rectangularground-truth boxes and implicit recognition feedback. Based onthe richer RepPoints representation, we develop an anchor-free ob-ject detector that yields improved performance compared to usingbounding boxes.
anchors and proposals to final predictions. Based on thesebounding boxes, features are extracted and used for pur-poses such as object classification and location refinement.The prevalence of the bounding box representation canpartly be attributed to common metrics for object detectionperformance, which account for the overlap between esti-mated and ground truth bounding boxes of objects. Anotherreason lies in its convenience for feature extraction in deepnetworks, because of its regular shape and the ease of sub-dividing a rectangular window into a matrix of pooled cells.
Though bounding boxes facilitate computation, they pro-vide only a coarse localization of objects that does not con-form to an object’s shape and pose. Features extracted fromthe regular cells of a bounding box may thus be heavilyinfluenced by background content or uninformative fore-ground areas that contain little semantic information. Thismay result in lower feature quality that degrades classifica-tion performance in object detection.
In this paper, we propose a new representation, calledRepPoints, that provides more fine-grained localization and
1
arX
iv:1
904.
1149
0v1
[cs
.CV
] 2
5 A
pr 2
019
RepPoints: Point Set Representation for Object Detection
Ze Yang1†∗ Shaohui Liu2,3†∗ Han Hu3 Liwei Wang1 Stephen Lin31Peking University 2Tsinghua University
3Microsoft Research Asiayangze@pku.edu.cn, b1ueber2y@gmail.com, wanglw@cis.pku.edu.cn
{hanhu, stevelin}@microsoft.com
Abstract
Modern object detectors rely heavily on rectangularbounding boxes, such as anchors, proposals and the fi-nal predictions, to represent objects at various recognitionstages. The bounding box is convenient to use but providesonly a coarse localization of objects and leads to a cor-respondingly coarse extraction of object features. In thispaper, we present RepPoints (representative points), a newfiner representation of objects as a set of sample points use-ful for both localization and recognition. Given groundtruth localization and recognition targets for training, Rep-Points learn to automatically arrange themselves in a man-ner that bounds the spatial extent of an object and indicatessemantically significant local areas. They furthermore donot require the use of anchors to sample a space of boundingboxes. We show that an anchor-free object detector basedon RepPoints, implemented without multi-scale training andtesting, can be as effective as state-of-the-art anchor-baseddetection methods, with 42.8 AP and 65.0 AP50 on theCOCO test-dev detection benchmark.
1. Introduction
Object detection aims to localize objects in an image andprovide their class labels. As one of the most fundamentaltasks in computer vision, it serves as a key component formany vision applications, including instance segmentation[34], human pose analysis [43], and visual reasoning [47].The significance of the object detection problem togetherwith the rapid development of deep neural networks has ledto substantial progress in recent years [7, 12, 11, 38, 15, 3].
In the object detection pipeline, bounding boxes, whichencompass rectangular areas of an image, serve as the ba-sic element for processing. They describe target locationsof objects throughout the stages of an object detector, from
∗Equal contribution. †The work is done when Ze Yang and ShaohuiLiu are interns at Microsoft Research Asia.
recognitionsupervision
convert
feature extraction+ classification
localizationsupervision
Figure 1. RepPoints are a new representation for object detectionthat consists of a set of points which indicate the spatial extent ofan object and semantically significant local areas. This representa-tion is learned via weak localization supervision from rectangularground-truth boxes and implicit recognition feedback. Based onthe richer RepPoints representation, we develop an anchor-free ob-ject detector that yields improved performance compared to usingbounding boxes.
anchors and proposals to final predictions. Based on thesebounding boxes, features are extracted and used for pur-poses such as object classification and location refinement.The prevalence of the bounding box representation canpartly be attributed to common metrics for object detectionperformance, which account for the overlap between esti-mated and ground truth bounding boxes of objects. Anotherreason lies in its convenience for feature extraction in deepnetworks, because of its regular shape and the ease of sub-dividing a rectangular window into a matrix of pooled cells.
Though bounding boxes facilitate computation, they pro-vide only a coarse localization of objects that does not con-form to an object’s shape and pose. Features extracted fromthe regular cells of a bounding box may thus be heavilyinfluenced by background content or uninformative fore-ground areas that contain little semantic information. Thismay result in lower feature quality that degrades classifica-tion performance in object detection.
In this paper, we propose a new representation, calledRepPoints, that provides more fine-grained localization and
1
arX
iv:1
904.
1149
0v1
[cs
.CV
] 2
5 A
pr 2
019
RepPoints: Point Set Representation for Object Detection
Ze Yang1†∗ Shaohui Liu2,3†∗ Han Hu3 Liwei Wang1 Stephen Lin31Peking University 2Tsinghua University
3Microsoft Research Asiayangze@pku.edu.cn, b1ueber2y@gmail.com, wanglw@cis.pku.edu.cn
{hanhu, stevelin}@microsoft.com
Abstract
Modern object detectors rely heavily on rectangularbounding boxes, such as anchors, proposals and the fi-nal predictions, to represent objects at various recognitionstages. The bounding box is convenient to use but providesonly a coarse localization of objects and leads to a cor-respondingly coarse extraction of object features. In thispaper, we present RepPoints (representative points), a newfiner representation of objects as a set of sample points use-ful for both localization and recognition. Given groundtruth localization and recognition targets for training, Rep-Points learn to automatically arrange themselves in a man-ner that bounds the spatial extent of an object and indicatessemantically significant local areas. They furthermore donot require the use of anchors to sample a space of boundingboxes. We show that an anchor-free object detector basedon RepPoints, implemented without multi-scale training andtesting, can be as effective as state-of-the-art anchor-baseddetection methods, with 42.8 AP and 65.0 AP50 on theCOCO test-dev detection benchmark.
1. Introduction
Object detection aims to localize objects in an image andprovide their class labels. As one of the most fundamentaltasks in computer vision, it serves as a key component formany vision applications, including instance segmentation[34], human pose analysis [43], and visual reasoning [47].The significance of the object detection problem togetherwith the rapid development of deep neural networks has ledto substantial progress in recent years [7, 12, 11, 38, 15, 3].
In the object detection pipeline, bounding boxes, whichencompass rectangular areas of an image, serve as the ba-sic element for processing. They describe target locationsof objects throughout the stages of an object detector, from
∗Equal contribution. †The work is done when Ze Yang and ShaohuiLiu are interns at Microsoft Research Asia.
recognitionsupervision
convert
feature extraction+ classification
localizationsupervision
Figure 1. RepPoints are a new representation for object detectionthat consists of a set of points which indicate the spatial extent ofan object and semantically significant local areas. This representa-tion is learned via weak localization supervision from rectangularground-truth boxes and implicit recognition feedback. Based onthe richer RepPoints representation, we develop an anchor-free ob-ject detector that yields improved performance compared to usingbounding boxes.
anchors and proposals to final predictions. Based on thesebounding boxes, features are extracted and used for pur-poses such as object classification and location refinement.The prevalence of the bounding box representation canpartly be attributed to common metrics for object detectionperformance, which account for the overlap between esti-mated and ground truth bounding boxes of objects. Anotherreason lies in its convenience for feature extraction in deepnetworks, because of its regular shape and the ease of sub-dividing a rectangular window into a matrix of pooled cells.
Though bounding boxes facilitate computation, they pro-vide only a coarse localization of objects that does not con-form to an object’s shape and pose. Features extracted fromthe regular cells of a bounding box may thus be heavilyinfluenced by background content or uninformative fore-ground areas that contain little semantic information. Thismay result in lower feature quality that degrades classifica-tion performance in object detection.
In this paper, we propose a new representation, calledRepPoints, that provides more fine-grained localization and
1
arX
iv:1
904.
1149
0v1
[cs
.CV
] 2
5 A
pr 2
019
RepPoints: Point Set Representation for Object Detection
Ze Yang1†∗ Shaohui Liu2,3†∗ Han Hu3 Liwei Wang1 Stephen Lin31Peking University 2Tsinghua University
3Microsoft Research Asiayangze@pku.edu.cn, b1ueber2y@gmail.com, wanglw@cis.pku.edu.cn
{hanhu, stevelin}@microsoft.com
Abstract
Modern object detectors rely heavily on rectangularbounding boxes, such as anchors, proposals and the fi-nal predictions, to represent objects at various recognitionstages. The bounding box is convenient to use but providesonly a coarse localization of objects and leads to a cor-respondingly coarse extraction of object features. In thispaper, we present RepPoints (representative points), a newfiner representation of objects as a set of sample points use-ful for both localization and recognition. Given groundtruth localization and recognition targets for training, Rep-Points learn to automatically arrange themselves in a man-ner that bounds the spatial extent of an object and indicatessemantically significant local areas. They furthermore donot require the use of anchors to sample a space of boundingboxes. We show that an anchor-free object detector basedon RepPoints, implemented without multi-scale training andtesting, can be as effective as state-of-the-art anchor-baseddetection methods, with 42.8 AP and 65.0 AP50 on theCOCO test-dev detection benchmark.
1. Introduction
Object detection aims to localize objects in an image andprovide their class labels. As one of the most fundamentaltasks in computer vision, it serves as a key component formany vision applications, including instance segmentation[34], human pose analysis [43], and visual reasoning [47].The significance of the object detection problem togetherwith the rapid development of deep neural networks has ledto substantial progress in recent years [7, 12, 11, 38, 15, 3].
In the object detection pipeline, bounding boxes, whichencompass rectangular areas of an image, serve as the ba-sic element for processing. They describe target locationsof objects throughout the stages of an object detector, from
∗Equal contribution. †The work is done when Ze Yang and ShaohuiLiu are interns at Microsoft Research Asia.
recognitionsupervision
convert
feature extraction+ classification
localizationsupervision
Figure 1. RepPoints are a new representation for object detectionthat consists of a set of points which indicate the spatial extent ofan object and semantically significant local areas. This representa-tion is learned via weak localization supervision from rectangularground-truth boxes and implicit recognition feedback. Based onthe richer RepPoints representation, we develop an anchor-free ob-ject detector that yields improved performance compared to usingbounding boxes.
anchors and proposals to final predictions. Based on thesebounding boxes, features are extracted and used for pur-poses such as object classification and location refinement.The prevalence of the bounding box representation canpartly be attributed to common metrics for object detectionperformance, which account for the overlap between esti-mated and ground truth bounding boxes of objects. Anotherreason lies in its convenience for feature extraction in deepnetworks, because of its regular shape and the ease of sub-dividing a rectangular window into a matrix of pooled cells.
Though bounding boxes facilitate computation, they pro-vide only a coarse localization of objects that does not con-form to an object’s shape and pose. Features extracted fromthe regular cells of a bounding box may thus be heavilyinfluenced by background content or uninformative fore-ground areas that contain little semantic information. Thismay result in lower feature quality that degrades classifica-tion performance in object detection.
In this paper, we propose a new representation, calledRepPoints, that provides more fine-grained localization and
1
arX
iv:1
904.
1149
0v1
[cs
.CV
] 2
5 A
pr 2
019
RepPoints: Point Set Representation for Object Detection
Ze Yang1†∗ Shaohui Liu2,3†∗ Han Hu3 Liwei Wang1 Stephen Lin31Peking University 2Tsinghua University
3Microsoft Research Asiayangze@pku.edu.cn, b1ueber2y@gmail.com, wanglw@cis.pku.edu.cn
{hanhu, stevelin}@microsoft.com
Abstract
Modern object detectors rely heavily on rectangularbounding boxes, such as anchors, proposals and the fi-nal predictions, to represent objects at various recognitionstages. The bounding box is convenient to use but providesonly a coarse localization of objects and leads to a cor-respondingly coarse extraction of object features. In thispaper, we present RepPoints (representative points), a newfiner representation of objects as a set of sample points use-ful for both localization and recognition. Given groundtruth localization and recognition targets for training, Rep-Points learn to automatically arrange themselves in a man-ner that bounds the spatial extent of an object and indicatessemantically significant local areas. They furthermore donot require the use of anchors to sample a space of boundingboxes. We show that an anchor-free object detector basedon RepPoints, implemented without multi-scale training andtesting, can be as effective as state-of-the-art anchor-baseddetection methods, with 42.8 AP and 65.0 AP50 on theCOCO test-dev detection benchmark.
1. Introduction
Object detection aims to localize objects in an image andprovide their class labels. As one of the most fundamentaltasks in computer vision, it serves as a key component formany vision applications, including instance segmentation[34], human pose analysis [43], and visual reasoning [47].The significance of the object detection problem togetherwith the rapid development of deep neural networks has ledto substantial progress in recent years [7, 12, 11, 38, 15, 3].
In the object detection pipeline, bounding boxes, whichencompass rectangular areas of an image, serve as the ba-sic element for processing. They describe target locationsof objects throughout the stages of an object detector, from
∗Equal contribution. †The work is done when Ze Yang and ShaohuiLiu are interns at Microsoft Research Asia.
recognitionsupervision
convert
feature extraction+ classification
localizationsupervision
Figure 1. RepPoints are a new representation for object detectionthat consists of a set of points which indicate the spatial extent ofan object and semantically significant local areas. This representa-tion is learned via weak localization supervision from rectangularground-truth boxes and implicit recognition feedback. Based onthe richer RepPoints representation, we develop an anchor-free ob-ject detector that yields improved performance compared to usingbounding boxes.
anchors and proposals to final predictions. Based on thesebounding boxes, features are extracted and used for pur-poses such as object classification and location refinement.The prevalence of the bounding box representation canpartly be attributed to common metrics for object detectionperformance, which account for the overlap between esti-mated and ground truth bounding boxes of objects. Anotherreason lies in its convenience for feature extraction in deepnetworks, because of its regular shape and the ease of sub-dividing a rectangular window into a matrix of pooled cells.
Though bounding boxes facilitate computation, they pro-vide only a coarse localization of objects that does not con-form to an object’s shape and pose. Features extracted fromthe regular cells of a bounding box may thus be heavilyinfluenced by background content or uninformative fore-ground areas that contain little semantic information. Thismay result in lower feature quality that degrades classifica-tion performance in object detection.
In this paper, we propose a new representation, calledRepPoints, that provides more fine-grained localization and
1
arX
iv:1
904.
1149
0v1
[cs
.CV
] 2
5 A
pr 2
019
RepPoints: Point Set Representation for Object Detection
Ze Yang1†∗ Shaohui Liu2,3†∗ Han Hu3 Liwei Wang1 Stephen Lin31Peking University 2Tsinghua University
3Microsoft Research Asiayangze@pku.edu.cn, b1ueber2y@gmail.com, wanglw@cis.pku.edu.cn
{hanhu, stevelin}@microsoft.com
Abstract
Modern object detectors rely heavily on rectangularbounding boxes, such as anchors, proposals and the fi-nal predictions, to represent objects at various recognitionstages. The bounding box is convenient to use but providesonly a coarse localization of objects and leads to a cor-respondingly coarse extraction of object features. In thispaper, we present RepPoints (representative points), a newfiner representation of objects as a set of sample points use-ful for both localization and recognition. Given groundtruth localization and recognition targets for training, Rep-Points learn to automatically arrange themselves in a man-ner that bounds the spatial extent of an object and indicatessemantically significant local areas. They furthermore donot require the use of anchors to sample a space of boundingboxes. We show that an anchor-free object detector basedon RepPoints, implemented without multi-scale training andtesting, can be as effective as state-of-the-art anchor-baseddetection methods, with 42.8 AP and 65.0 AP50 on theCOCO test-dev detection benchmark.
1. Introduction
Object detection aims to localize objects in an image andprovide their class labels. As one of the most fundamentaltasks in computer vision, it serves as a key component formany vision applications, including instance segmentation[34], human pose analysis [43], and visual reasoning [47].The significance of the object detection problem togetherwith the rapid development of deep neural networks has ledto substantial progress in recent years [7, 12, 11, 38, 15, 3].
In the object detection pipeline, bounding boxes, whichencompass rectangular areas of an image, serve as the ba-sic element for processing. They describe target locationsof objects throughout the stages of an object detector, from
∗Equal contribution. †The work is done when Ze Yang and ShaohuiLiu are interns at Microsoft Research Asia.
recognitionsupervision
convert
feature extraction+ classification
localizationsupervision
Figure 1. RepPoints are a new representation for object detectionthat consists of a set of points which indicate the spatial extent ofan object and semantically significant local areas. This representa-tion is learned via weak localization supervision from rectangularground-truth boxes and implicit recognition feedback. Based onthe richer RepPoints representation, we develop an anchor-free ob-ject detector that yields improved performance compared to usingbounding boxes.
anchors and proposals to final predictions. Based on thesebounding boxes, features are extracted and used for pur-poses such as object classification and location refinement.The prevalence of the bounding box representation canpartly be attributed to common metrics for object detectionperformance, which account for the overlap between esti-mated and ground truth bounding boxes of objects. Anotherreason lies in its convenience for feature extraction in deepnetworks, because of its regular shape and the ease of sub-dividing a rectangular window into a matrix of pooled cells.
Though bounding boxes facilitate computation, they pro-vide only a coarse localization of objects that does not con-form to an object’s shape and pose. Features extracted fromthe regular cells of a bounding box may thus be heavilyinfluenced by background content or uninformative fore-ground areas that contain little semantic information. Thismay result in lower feature quality that degrades classifica-tion performance in object detection.
In this paper, we propose a new representation, calledRepPoints, that provides more fine-grained localization and
1
arX
iv:1
904.
1149
0v1
[cs
.CV
] 2
5 A
pr 2
019
RepPoints: Point Set Representation for Object Detection
Ze Yang1†∗ Shaohui Liu2,3†∗ Han Hu3 Liwei Wang1 Stephen Lin31Peking University 2Tsinghua University
3Microsoft Research Asiayangze@pku.edu.cn, b1ueber2y@gmail.com, wanglw@cis.pku.edu.cn
{hanhu, stevelin}@microsoft.com
Abstract
Modern object detectors rely heavily on rectangularbounding boxes, such as anchors, proposals and the fi-nal predictions, to represent objects at various recognitionstages. The bounding box is convenient to use but providesonly a coarse localization of objects and leads to a cor-respondingly coarse extraction of object features. In thispaper, we present RepPoints (representative points), a newfiner representation of objects as a set of sample points use-ful for both localization and recognition. Given groundtruth localization and recognition targets for training, Rep-Points learn to automatically arrange themselves in a man-ner that bounds the spatial extent of an object and indicatessemantically significant local areas. They furthermore donot require the use of anchors to sample a space of boundingboxes. We show that an anchor-free object detector basedon RepPoints, implemented without multi-scale training andtesting, can be as effective as state-of-the-art anchor-baseddetection methods, with 42.8 AP and 65.0 AP50 on theCOCO test-dev detection benchmark.
1. Introduction
Object detection aims to localize objects in an image andprovide their class labels. As one of the most fundamentaltasks in computer vision, it serves as a key component formany vision applications, including instance segmentation[34], human pose analysis [43], and visual reasoning [47].The significance of the object detection problem togetherwith the rapid development of deep neural networks has ledto substantial progress in recent years [7, 12, 11, 38, 15, 3].
In the object detection pipeline, bounding boxes, whichencompass rectangular areas of an image, serve as the ba-sic element for processing. They describe target locationsof objects throughout the stages of an object detector, from
∗Equal contribution. †The work is done when Ze Yang and ShaohuiLiu are interns at Microsoft Research Asia.
recognitionsupervision
convert
feature extraction+ classification
localizationsupervision
Figure 1. RepPoints are a new representation for object detectionthat consists of a set of points which indicate the spatial extent ofan object and semantically significant local areas. This representa-tion is learned via weak localization supervision from rectangularground-truth boxes and implicit recognition feedback. Based onthe richer RepPoints representation, we develop an anchor-free ob-ject detector that yields improved performance compared to usingbounding boxes.
anchors and proposals to final predictions. Based on thesebounding boxes, features are extracted and used for pur-poses such as object classification and location refinement.The prevalence of the bounding box representation canpartly be attributed to common metrics for object detectionperformance, which account for the overlap between esti-mated and ground truth bounding boxes of objects. Anotherreason lies in its convenience for feature extraction in deepnetworks, because of its regular shape and the ease of sub-dividing a rectangular window into a matrix of pooled cells.
Though bounding boxes facilitate computation, they pro-vide only a coarse localization of objects that does not con-form to an object’s shape and pose. Features extracted fromthe regular cells of a bounding box may thus be heavilyinfluenced by background content or uninformative fore-ground areas that contain little semantic information. Thismay result in lower feature quality that degrades classifica-tion performance in object detection.
In this paper, we propose a new representation, calledRepPoints, that provides more fine-grained localization and
1
arX
iv:1
904.
1149
0v1
[cs
.CV
] 2
5 A
pr 2
019
RepPoints: Point Set Representation for Object Detection
Ze Yang1†∗ Shaohui Liu2,3†∗ Han Hu3 Liwei Wang1 Stephen Lin31Peking University 2Tsinghua University
3Microsoft Research Asiayangze@pku.edu.cn, b1ueber2y@gmail.com, wanglw@cis.pku.edu.cn
{hanhu, stevelin}@microsoft.com
Abstract
Modern object detectors rely heavily on rectangularbounding boxes, such as anchors, proposals and the fi-nal predictions, to represent objects at various recognitionstages. The bounding box is convenient to use but providesonly a coarse localization of objects and leads to a cor-respondingly coarse extraction of object features. In thispaper, we present RepPoints (representative points), a newfiner representation of objects as a set of sample points use-ful for both localization and recognition. Given groundtruth localization and recognition targets for training, Rep-Points learn to automatically arrange themselves in a man-ner that bounds the spatial extent of an object and indicatessemantically significant local areas. They furthermore donot require the use of anchors to sample a space of boundingboxes. We show that an anchor-free object detector basedon RepPoints, implemented without multi-scale training andtesting, can be as effective as state-of-the-art anchor-baseddetection methods, with 42.8 AP and 65.0 AP50 on theCOCO test-dev detection benchmark.
1. Introduction
Object detection aims to localize objects in an image andprovide their class labels. As one of the most fundamentaltasks in computer vision, it serves as a key component formany vision applications, including instance segmentation[34], human pose analysis [43], and visual reasoning [47].The significance of the object detection problem togetherwith the rapid development of deep neural networks has ledto substantial progress in recent years [7, 12, 11, 38, 15, 3].
In the object detection pipeline, bounding boxes, whichencompass rectangular areas of an image, serve as the ba-sic element for processing. They describe target locationsof objects throughout the stages of an object detector, from
∗Equal contribution. †The work is done when Ze Yang and ShaohuiLiu are interns at Microsoft Research Asia.
recognitionsupervision
convert
feature extraction+ classification
localizationsupervision
Figure 1. RepPoints are a new representation for object detectionthat consists of a set of points which indicate the spatial extent ofan object and semantically significant local areas. This representa-tion is learned via weak localization supervision from rectangularground-truth boxes and implicit recognition feedback. Based onthe richer RepPoints representation, we develop an anchor-free ob-ject detector that yields improved performance compared to usingbounding boxes.
anchors and proposals to final predictions. Based on thesebounding boxes, features are extracted and used for pur-poses such as object classification and location refinement.The prevalence of the bounding box representation canpartly be attributed to common metrics for object detectionperformance, which account for the overlap between esti-mated and ground truth bounding boxes of objects. Anotherreason lies in its convenience for feature extraction in deepnetworks, because of its regular shape and the ease of sub-dividing a rectangular window into a matrix of pooled cells.
Though bounding boxes facilitate computation, they pro-vide only a coarse localization of objects that does not con-form to an object’s shape and pose. Features extracted fromthe regular cells of a bounding box may thus be heavilyinfluenced by background content or uninformative fore-ground areas that contain little semantic information. Thismay result in lower feature quality that degrades classifica-tion performance in object detection.
In this paper, we propose a new representation, calledRepPoints, that provides more fine-grained localization and
1
arX
iv:1
904.
1149
0v1
[cs
.CV
] 2
5 A
pr 2
019
RepPoints: Point Set Representation for Object Detection
Ze Yang1†∗ Shaohui Liu2,3†∗ Han Hu3 Liwei Wang1 Stephen Lin31Peking University 2Tsinghua University
3Microsoft Research Asiayangze@pku.edu.cn, b1ueber2y@gmail.com, wanglw@cis.pku.edu.cn
{hanhu, stevelin}@microsoft.com
Abstract
Modern object detectors rely heavily on rectangularbounding boxes, such as anchors, proposals and the fi-nal predictions, to represent objects at various recognitionstages. The bounding box is convenient to use but providesonly a coarse localization of objects and leads to a cor-respondingly coarse extraction of object features. In thispaper, we present RepPoints (representative points), a newfiner representation of objects as a set of sample points use-ful for both localization and recognition. Given groundtruth localization and recognition targets for training, Rep-Points learn to automatically arrange themselves in a man-ner that bounds the spatial extent of an object and indicatessemantically significant local areas. They furthermore donot require the use of anchors to sample a space of boundingboxes. We show that an anchor-free object detector basedon RepPoints, implemented without multi-scale training andtesting, can be as effective as state-of-the-art anchor-baseddetection methods, with 42.8 AP and 65.0 AP50 on theCOCO test-dev detection benchmark.
1. Introduction
Object detection aims to localize objects in an image andprovide their class labels. As one of the most fundamentaltasks in computer vision, it serves as a key component formany vision applications, including instance segmentation[34], human pose analysis [43], and visual reasoning [47].The significance of the object detection problem togetherwith the rapid development of deep neural networks has ledto substantial progress in recent years [7, 12, 11, 38, 15, 3].
In the object detection pipeline, bounding boxes, whichencompass rectangular areas of an image, serve as the ba-sic element for processing. They describe target locationsof objects throughout the stages of an object detector, from
∗Equal contribution. †The work is done when Ze Yang and ShaohuiLiu are interns at Microsoft Research Asia.
recognitionsupervision
convert
feature extraction+ classification
localizationsupervision
Figure 1. RepPoints are a new representation for object detectionthat consists of a set of points which indicate the spatial extent ofan object and semantically significant local areas. This representa-tion is learned via weak localization supervision from rectangularground-truth boxes and implicit recognition feedback. Based onthe richer RepPoints representation, we develop an anchor-free ob-ject detector that yields improved performance compared to usingbounding boxes.
anchors and proposals to final predictions. Based on thesebounding boxes, features are extracted and used for pur-poses such as object classification and location refinement.The prevalence of the bounding box representation canpartly be attributed to common metrics for object detectionperformance, which account for the overlap between esti-mated and ground truth bounding boxes of objects. Anotherreason lies in its convenience for feature extraction in deepnetworks, because of its regular shape and the ease of sub-dividing a rectangular window into a matrix of pooled cells.
Though bounding boxes facilitate computation, they pro-vide only a coarse localization of objects that does not con-form to an object’s shape and pose. Features extracted fromthe regular cells of a bounding box may thus be heavilyinfluenced by background content or uninformative fore-ground areas that contain little semantic information. Thismay result in lower feature quality that degrades classifica-tion performance in object detection.
In this paper, we propose a new representation, calledRepPoints, that provides more fine-grained localization and
1
arX
iv:1
904.
1149
0v1
[cs
.CV
] 2
5 A
pr 2
019
RepPoints: Point Set Representation for Object Detection
Ze Yang1†∗ Shaohui Liu2,3†∗ Han Hu3 Liwei Wang1 Stephen Lin31Peking University 2Tsinghua University
3Microsoft Research Asiayangze@pku.edu.cn, b1ueber2y@gmail.com, wanglw@cis.pku.edu.cn
{hanhu, stevelin}@microsoft.com
Abstract
Modern object detectors rely heavily on rectangularbounding boxes, such as anchors, proposals and the fi-nal predictions, to represent objects at various recognitionstages. The bounding box is convenient to use but providesonly a coarse localization of objects and leads to a cor-respondingly coarse extraction of object features. In thispaper, we present RepPoints (representative points), a newfiner representation of objects as a set of sample points use-ful for both localization and recognition. Given groundtruth localization and recognition targets for training, Rep-Points learn to automatically arrange themselves in a man-ner that bounds the spatial extent of an object and indicatessemantically significant local areas. They furthermore donot require the use of anchors to sample a space of boundingboxes. We show that an anchor-free object detector basedon RepPoints, implemented without multi-scale training andtesting, can be as effective as state-of-the-art anchor-baseddetection methods, with 42.8 AP and 65.0 AP50 on theCOCO test-dev detection benchmark.
1. Introduction
Object detection aims to localize objects in an image andprovide their class labels. As one of the most fundamentaltasks in computer vision, it serves as a key component formany vision applications, including instance segmentation[34], human pose analysis [43], and visual reasoning [47].The significance of the object detection problem togetherwith the rapid development of deep neural networks has ledto substantial progress in recent years [7, 12, 11, 38, 15, 3].
In the object detection pipeline, bounding boxes, whichencompass rectangular areas of an image, serve as the ba-sic element for processing. They describe target locationsof objects throughout the stages of an object detector, from
∗Equal contribution. †The work is done when Ze Yang and ShaohuiLiu are interns at Microsoft Research Asia.
recognitionsupervision
convert
feature extraction+ classification
localizationsupervision
Figure 1. RepPoints are a new representation for object detectionthat consists of a set of points which indicate the spatial extent ofan object and semantically significant local areas. This representa-tion is learned via weak localization supervision from rectangularground-truth boxes and implicit recognition feedback. Based onthe richer RepPoints representation, we develop an anchor-free ob-ject detector that yields improved performance compared to usingbounding boxes.
anchors and proposals to final predictions. Based on thesebounding boxes, features are extracted and used for pur-poses such as object classification and location refinement.The prevalence of the bounding box representation canpartly be attributed to common metrics for object detectionperformance, which account for the overlap between esti-mated and ground truth bounding boxes of objects. Anotherreason lies in its convenience for feature extraction in deepnetworks, because of its regular shape and the ease of sub-dividing a rectangular window into a matrix of pooled cells.
Though bounding boxes facilitate computation, they pro-vide only a coarse localization of objects that does not con-form to an object’s shape and pose. Features extracted fromthe regular cells of a bounding box may thus be heavilyinfluenced by background content or uninformative fore-ground areas that contain little semantic information. Thismay result in lower feature quality that degrades classifica-tion performance in object detection.
In this paper, we propose a new representation, calledRepPoints, that provides more fine-grained localization and
1
arX
iv:1
904.
1149
0v1
[cs
.CV
] 2
5 A
pr 2
019
RepPoints: Point Set Representation for Object Detection
Ze Yang1†∗ Shaohui Liu2,3†∗ Han Hu3 Liwei Wang1 Stephen Lin31Peking University 2Tsinghua University
3Microsoft Research Asiayangze@pku.edu.cn, b1ueber2y@gmail.com, wanglw@cis.pku.edu.cn
{hanhu, stevelin}@microsoft.com
Abstract
Modern object detectors rely heavily on rectangularbounding boxes, such as anchors, proposals and the fi-nal predictions, to represent objects at various recognitionstages. The bounding box is convenient to use but providesonly a coarse localization of objects and leads to a cor-respondingly coarse extraction of object features. In thispaper, we present RepPoints (representative points), a newfiner representation of objects as a set of sample points use-ful for both localization and recognition. Given groundtruth localization and recognition targets for training, Rep-Points learn to automatically arrange themselves in a man-ner that bounds the spatial extent of an object and indicatessemantically significant local areas. They furthermore donot require the use of anchors to sample a space of boundingboxes. We show that an anchor-free object detector basedon RepPoints, implemented without multi-scale training andtesting, can be as effective as state-of-the-art anchor-baseddetection methods, with 42.8 AP and 65.0 AP50 on theCOCO test-dev detection benchmark.
1. Introduction
Object detection aims to localize objects in an image andprovide their class labels. As one of the most fundamentaltasks in computer vision, it serves as a key component formany vision applications, including instance segmentation[34], human pose analysis [43], and visual reasoning [47].The significance of the object detection problem togetherwith the rapid development of deep neural networks has ledto substantial progress in recent years [7, 12, 11, 38, 15, 3].
In the object detection pipeline, bounding boxes, whichencompass rectangular areas of an image, serve as the ba-sic element for processing. They describe target locationsof objects throughout the stages of an object detector, from
∗Equal contribution. †The work is done when Ze Yang and ShaohuiLiu are interns at Microsoft Research Asia.
recognitionsupervision
convert
feature extraction+ classification
localizationsupervision
Figure 1. RepPoints are a new representation for object detectionthat consists of a set of points which indicate the spatial extent ofan object and semantically significant local areas. This representa-tion is learned via weak localization supervision from rectangularground-truth boxes and implicit recognition feedback. Based onthe richer RepPoints representation, we develop an anchor-free ob-ject detector that yields improved performance compared to usingbounding boxes.
anchors and proposals to final predictions. Based on thesebounding boxes, features are extracted and used for pur-poses such as object classification and location refinement.The prevalence of the bounding box representation canpartly be attributed to common metrics for object detectionperformance, which account for the overlap between esti-mated and ground truth bounding boxes of objects. Anotherreason lies in its convenience for feature extraction in deepnetworks, because of its regular shape and the ease of sub-dividing a rectangular window into a matrix of pooled cells.
Though bounding boxes facilitate computation, they pro-vide only a coarse localization of objects that does not con-form to an object’s shape and pose. Features extracted fromthe regular cells of a bounding box may thus be heavilyinfluenced by background content or uninformative fore-ground areas that contain little semantic information. Thismay result in lower feature quality that degrades classifica-tion performance in object detection.
In this paper, we propose a new representation, calledRepPoints, that provides more fine-grained localization and
1
arX
iv:1
904.
1149
0v1
[cs
.CV
] 2
5 A
pr 2
019
RepPoints: Point Set Representation for Object Detection
Ze Yang1†∗ Shaohui Liu2,3†∗ Han Hu3 Liwei Wang1 Stephen Lin31Peking University 2Tsinghua University
3Microsoft Research Asiayangze@pku.edu.cn, b1ueber2y@gmail.com, wanglw@cis.pku.edu.cn
{hanhu, stevelin}@microsoft.com
Abstract
Modern object detectors rely heavily on rectangularbounding boxes, such as anchors, proposals and the fi-nal predictions, to represent objects at various recognitionstages. The bounding box is convenient to use but providesonly a coarse localization of objects and leads to a cor-respondingly coarse extraction of object features. In thispaper, we present RepPoints (representative points), a newfiner representation of objects as a set of sample points use-ful for both localization and recognition. Given groundtruth localization and recognition targets for training, Rep-Points learn to automatically arrange themselves in a man-ner that bounds the spatial extent of an object and indicatessemantically significant local areas. They furthermore donot require the use of anchors to sample a space of boundingboxes. We show that an anchor-free object detector basedon RepPoints, implemented without multi-scale training andtesting, can be as effective as state-of-the-art anchor-baseddetection methods, with 42.8 AP and 65.0 AP50 on theCOCO test-dev detection benchmark.
1. Introduction
Object detection aims to localize objects in an image andprovide their class labels. As one of the most fundamentaltasks in computer vision, it serves as a key component formany vision applications, including instance segmentation[34], human pose analysis [43], and visual reasoning [47].The significance of the object detection problem togetherwith the rapid development of deep neural networks has ledto substantial progress in recent years [7, 12, 11, 38, 15, 3].
In the object detection pipeline, bounding boxes, whichencompass rectangular areas of an image, serve as the ba-sic element for processing. They describe target locationsof objects throughout the stages of an object detector, from
∗Equal contribution. †The work is done when Ze Yang and ShaohuiLiu are interns at Microsoft Research Asia.
recognitionsupervision
convert
feature extraction+ classification
localizationsupervision
Figure 1. RepPoints are a new representation for object detectionthat consists of a set of points which indicate the spatial extent ofan object and semantically significant local areas. This representa-tion is learned via weak localization supervision from rectangularground-truth boxes and implicit recognition feedback. Based onthe richer RepPoints representation, we develop an anchor-free ob-ject detector that yields improved performance compared to usingbounding boxes.
anchors and proposals to final predictions. Based on thesebounding boxes, features are extracted and used for pur-poses such as object classification and location refinement.The prevalence of the bounding box representation canpartly be attributed to common metrics for object detectionperformance, which account for the overlap between esti-mated and ground truth bounding boxes of objects. Anotherreason lies in its convenience for feature extraction in deepnetworks, because of its regular shape and the ease of sub-dividing a rectangular window into a matrix of pooled cells.
Though bounding boxes facilitate computation, they pro-vide only a coarse localization of objects that does not con-form to an object’s shape and pose. Features extracted fromthe regular cells of a bounding box may thus be heavilyinfluenced by background content or uninformative fore-ground areas that contain little semantic information. Thismay result in lower feature quality that degrades classifica-tion performance in object detection.
In this paper, we propose a new representation, calledRepPoints, that provides more fine-grained localization and
1
arX
iv:1
904.
1149
0v1
[cs
.CV
] 2
5 A
pr 2
019
RepPoints: Point Set Representation for Object Detection
Ze Yang1†∗ Shaohui Liu2,3†∗ Han Hu3 Liwei Wang1 Stephen Lin31Peking University 2Tsinghua University
3Microsoft Research Asiayangze@pku.edu.cn, b1ueber2y@gmail.com, wanglw@cis.pku.edu.cn
{hanhu, stevelin}@microsoft.com
Abstract
Modern object detectors rely heavily on rectangularbounding boxes, such as anchors, proposals and the fi-nal predictions, to represent objects at various recognitionstages. The bounding box is convenient to use but providesonly a coarse localization of objects and leads to a cor-respondingly coarse extraction of object features. In thispaper, we present RepPoints (representative points), a newfiner representation of objects as a set of sample points use-ful for both localization and recognition. Given groundtruth localization and recognition targets for training, Rep-Points learn to automatically arrange themselves in a man-ner that bounds the spatial extent of an object and indicatessemantically significant local areas. They furthermore donot require the use of anchors to sample a space of boundingboxes. We show that an anchor-free object detector basedon RepPoints, implemented without multi-scale training andtesting, can be as effective as state-of-the-art anchor-baseddetection methods, with 42.8 AP and 65.0 AP50 on theCOCO test-dev detection benchmark.
1. Introduction
Object detection aims to localize objects in an image andprovide their class labels. As one of the most fundamentaltasks in computer vision, it serves as a key component formany vision applications, including instance segmentation[34], human pose analysis [43], and visual reasoning [47].The significance of the object detection problem togetherwith the rapid development of deep neural networks has ledto substantial progress in recent years [7, 12, 11, 38, 15, 3].
In the object detection pipeline, bounding boxes, whichencompass rectangular areas of an image, serve as the ba-sic element for processing. They describe target locationsof objects throughout the stages of an object detector, from
∗Equal contribution. †The work is done when Ze Yang and ShaohuiLiu are interns at Microsoft Research Asia.
recognitionsupervision
convert
feature extraction+ classification
localizationsupervision
Figure 1. RepPoints are a new representation for object detectionthat consists of a set of points which indicate the spatial extent ofan object and semantically significant local areas. This representa-tion is learned via weak localization supervision from rectangularground-truth boxes and implicit recognition feedback. Based onthe richer RepPoints representation, we develop an anchor-free ob-ject detector that yields improved performance compared to usingbounding boxes.
anchors and proposals to final predictions. Based on thesebounding boxes, features are extracted and used for pur-poses such as object classification and location refinement.The prevalence of the bounding box representation canpartly be attributed to common metrics for object detectionperformance, which account for the overlap between esti-mated and ground truth bounding boxes of objects. Anotherreason lies in its convenience for
top related