object detection (d2l4 2017 upc deep learning for computer vision)

[course site]

Object DetectionDay 2 Lecture 4

#DLUPC

Amaia [email protected]

PhD CandidateUniversitat Politècnica de Catalunya

https://telecombcn-dl.github.io/2017-dlcv/

https://telecombcn-dl.github.io/2017-dlcv/

mailto:[email protected]



Object Detection

CAT, DOG, DUCK

The task of assigning a label and a bounding box to all objects in the image

2

Object Detection: Datasets

3

20 categories6k training images

6k validation images10k test images

200 categories456k training images

60k validation + test images

80 categories200k training images60k val + test images

http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html

http://mscoco.org/

http://www.image-net.org/

Object Detection as Classification

Classes = [cat, dog, duck]

Cat ? NO

Dog ? NO

Duck? NO

4


Cat ? NO

Dog ? NO

Duck? NO

5



Cat ? YES

Dog ? NO

Duck? NO

6



Cat ? NO

Dog ? NO

Duck? NO

7


Problem: Too many positions & scales to test

Solution: If your classifier is fast enough, go for it8


Object Detection with ConvNets?

Convnets are computationally demanding. We can’t test all positions & scales !

Solution: Look at a tiny subset of positions. Choose them wisely :)9

Region Proposals

● Find “blobby” image regions that are likely to contain objects● “Class-agnostic” object detector● Look for “blob-like” regions

Slide Credit: CS231n 10

http://cs231n.github.io/

Region Proposals

Selective Search (SS) Multiscale Combinatorial Grouping (MCG)

[SS] Uijlings et al. Selective search for object recognition. IJCV 2013

[MCG] Arbeláez, Pont-Tuset et al. Multiscale combinatorial grouping. CVPR 2014 11

https://ivi.fnwi.uva.nl/isis/publications/2013/UijlingsIJCV2013/

https://www.eecs.berkeley.edu/Research/Projects/CS/vision/grouping/papers/apbmm_cvpr2014.pdf

Object Detection with Convnets: R-CNN

Girshick et al. Rich feature hierarchies for accurate object detection and semantic segmentation. CVPR 2014

12

http://www.cv-foundation.org/openaccess/content_cvpr_2014/html/Girshick_Rich_Feature_Hierarchies_2014_CVPR_paper.html

R-CNN


1. Train network on proposals

2. Post-hoc training of SVMs & Box regressors on fc7 features

13


R-CNN

14

We expect: We get:

R-CNN


1. Train network on proposals

2. Post-hoc training of SVMs & Box regressors on fc7 features

3. Non Maximum Suppression + score threshold

15


R-CNN


16


R-CNN: Problems

1. Slow at test-time: need to run full forward pass of CNN for each region proposal

2. SVMs and regressors are post-hoc: CNN features not updated in response to SVMs and regressors

3. Complex multistage training pipeline



Fast R-CNN

Girshick Fast R-CNN. ICCV 2015

Solution: Share computation of convolutional layers between region proposals for an image

R-CNN Problem #1: Slow at test-time: need to run full forward pass of CNN for each region proposal

18

http://www.cv-foundation.org/openaccess/content_iccv_2015/html/Girshick_Fast_R-CNN_ICCV_2015_paper.html

Fast R-CNN: Sharing features

Hi-res input image:3 x 800 x 600

with region proposal

Convolution and Pooling

Hi-res conv features:C x H x W

with region proposal

Fully-connected layers

Max-pool within each grid cell

RoI conv features:C x h x w

for region proposal

Fully-connected layers expect low-res conv features:

C x h x w

Slide Credit: CS231n 19Girshick Fast R-CNN. ICCV 2015



Fast R-CNN

Solution: Train it all at together E2E

R-CNN Problem #2&3: SVMs and regressors are post-hoc. Complex training.

20Girshick Fast R-CNN. ICCV 2015


Fast R-CNN

Slide Credit: CS231n

R-CNN Fast R-CNN

Training Time: 84 hours 9.5 hours

(Speedup) 1x 8.8x

Test time per image 47 seconds 0.32 seconds

(Speedup) 1x 146x

mAP (VOC 2007) 66.0 66.9

Using VGG-16 CNN on Pascal VOC 2007 dataset

Faster!

FASTER!

Better!

21


Fast R-CNN: Problem

Slide Credit: CS231n

R-CNN Fast R-CNN

Test time per image 47 seconds 0.32 seconds

(Speedup) 1x 146x

Test time per imagewith Selective Search 50 seconds 2 seconds

(Speedup) 1x 25x

Test-time speeds don’t include region proposals

22


Faster R-CNN

Con

v la

yers Region Proposal Network

FC6

Class probabilitiesFC7

FC8

RPN Proposals

RoI Pooling

Conv5_3

RPN Proposals

23Ren et al. Faster R-CNN: Towards real-time object detection with region proposal networks. NIPS 2015

Learn proposals end-to-end sharing parameters with the classification network

http://papers.nips.cc/paper/5638-faster-r-cnn-towards-real-time-object-detection-with-region-proposal-networks

Faster R-CNN

Con

v la


FC6


FC8

RPN Proposals

RoI Pooling

Conv5_3

RPN Proposals

Fast R-CNN


Learn proposals end-to-end sharing parameters with the classification network


Region Proposal Network

Objectness scores(object/no object)

Bounding Box Regression

In practice, k = 9 (3 different scales and 3 aspect ratios)



Faster R-CNN: Training

Con

v la


FC6


FC8

RPN Proposals

RoI Pooling

Conv5_3

RPN Proposals


RoI Pooling is not differentiable w.r.t box coordinates. Solutions:● Alternate training● Ignore gradient of classification branch w.r.t proposal coordinates● Make pooling function differentiable (spoiler D3L6)


Faster R-CNN

Ren et al. Faster R-CNN: Towards real-time object detection with region proposal networks. NIPS 2015

R-CNN Fast R-CNN Faster R-CNN

Test time per image(with proposals)

50 seconds 2 seconds 0.2 seconds

(Speedup) 1x 25x 250x

mAP (VOC 2007) 66.0 66.9 66.9




Faster R-CNN

28

● Faster R-CNN is the basis of the winners of COCO and ILSVRC 2015&2016 object detection competitions.

He et al. Deep residual learning for image recognition. CVPR 2016

http://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf

YOLO: You Only Look Once

29Redmon et al. You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016

Proposal-free object detection pipeline

http://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Redmon_You_Only_Look_CVPR_2016_paper.pdf


Redmon et al. You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016 30



31

Each cell predicts:

- For each bounding box:- 4 coordinates (x, y, w, h)- 1 confidence value

- Some number of class probabilities

For Pascal VOC:

- 7x7 grid- 2 bounding boxes / cell- 20 classes

7 x 7 x (2 x 5 + 20) = 7 x 7 x 30 tensor = 1470 outputs



Dog

Bicycle Car

Dining Table

Predict class probability for each cell




+ NMS+ Score threshold


SSD: Single Shot MultiBox Detector

Liu et al. SSD: Single Shot MultiBox Detector, ECCV 2016 34

Same idea as YOLO, + several predictors at different stages in the network

http://arxiv.org/abs/1512.02325

YOLOv2

35Redmon & Farhadi. YOLO900: Better, Faster, Stronger. CVPR 2017

https://arxiv.org/abs/1612.08242

YOLOv2

36

Results on Pascal VOC 2007

YOLOv2

37

Results on COCO test-dev 2015

Summary

38

Proposal-based methods● R-CNN● Fast R-CNN● Faster R-CNN● SPPnet ● R-FCNProposal-free methods● YOLO, YOLOv2● SSD






Resources

39

● Official implementations:○ Faster R-CNN [caffe]○ Yolov2 [darknet]○ SSD [caffe]○ R-FCN [caffe][MxNet]

● Unofficial ports to other frameworks are likely to exist… eg type “yolo tensorflow” in your browser and pick the one you like best.

● Or… use the newly released Object detection API by Google: SSD, R-FCN & Faster R-CNN (code & pretrained models in tensorflow)

Object detection tutorials (project ideas maybe?):● Toy object detection (squares, circles, etc.) (keras)● Object detection (pets dataset) (tensorflow)

https://github.com/rbgirshick/py-faster-rcnn

https://pjreddie.com/darknet/yolo/

https://github.com/weiliu89/caffe/tree/ssd

https://github.com/daijifeng001/R-FCN

https://github.com/msracver/Deformable-ConvNets

https://research.googleblog.com/2017/06/supercharge-your-computer-vision-models.html



https://medium.com/towards-data-science/object-detection-with-neural-networks-a4e2c46b4491

https://medium.com/towards-data-science/object-detection-with-neural-networks-a4e2c46b4491

https://cloud.google.com/blog/big-data/2017/06/training-an-object-detector-using-cloud-machine-learning-engine

https://cloud.google.com/blog/big-data/2017/06/training-an-object-detector-using-cloud-machine-learning-engine

Questions?

YOLO: Training

41Slide credit: YOLO Presentation @ CVPR 2016

For training, each ground truth bounding box is matched into the right cell

https://docs.google.com/presentation/d/1kAa7NOamBt4calBU9iHgT8a86RRHz9Yz2oh4-GTdX6M/edit#slide=id.g15092aa245_0_310

YOLO: Training


For training, each ground truth bounding box is matched into the right cell


YOLO: Training


Optimize class prediction in that cell:dog: 1, cat: 0, bike: 0, ...


YOLO: Training


Predicted boxes for this cell


YOLO: Training


Find the best one wrt ground truth bounding box, optimize it (i.e. adjust its coordinates to be closer to the ground truth’s coordinates)


YOLO: Training


Increase matched box’s confidence, decrease non-matched boxes confidence


YOLO: Training


For cells with no ground truth detections, confidences of all predicted boxes are decreased


YOLO: Training


For cells with no ground truth detections:● Confidences of all predicted

boxes are decreased ● Class probabilities are not

adjusted


YOLO: Training, formally


Bounding box coordinate regression

Bounding boxscore prediction

Classscore prediction

= 1 if box j and cell i are matched together, 0 otherwise

= 1 if box j and cell i are NOT matched together

= 1 if cell i has an object present


object detection (d2l4 2017 upc deep learning for computer vision)

Data & Analytics