object detection, deep learning, and r-cnns · object detection, deep learning, and r-cnns partly...

58
Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook

Upload: others

Post on 24-Jun-2020

16 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

Object detection, deep learning, and R-CNNs

Partly from Ross Girshick

Microsoft Research

Now at Facebook

Page 2: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

Pedestrians

Histograms of Oriented Gradients for Human Detection, Dalal and Triggs, CVPR 2005

AP ~77%More sophisticated methods: AP ~90%

(a) average gradient image over training examples(b) each “pixel” shows max positive SVM weight in the block centered on that pixel(c) same as (b) for negative SVM weights(d) test image(e) its R-HOG descriptor(f) R-HOG descriptor weighted by positive SVM weights(g) R-HOG descriptor weighted by negative SVM weights

Page 3: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

Overview of HOG Method

1. Compute gradients in the region to be described

2. Put them in bins according to orientation

3. Group the cells into large blocks

4. Normalize each block

5. Train classifiers to decide if these are parts of a human

Page 4: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

Details

• Gradients[-1 0 1] and [-1 0 1]T were good enough filters.

• Cell HistogramsEach pixel within the cell casts a weighted vote for an orientation-based histogram channel based on the values found in the gradient computation. (9 channels worked)

• BlocksGroup the cells together into larger blocks, either R-HOGblocks (rectangular) or C-HOG blocks (circular).

Page 5: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

More Details

• Block Normalization

They tried 4 different kinds of normalization.Let be the block to be normalized and e be a small constant.

Page 6: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

Example: Dalal-Triggs pedestrian detector

1. Extract fixed-sized (64x128 pixel) window at each position and scale

2. Compute HOG (histogram of gradient) features within each window

3. Score the window with a linear SVM classifier

4. Perform non-maxima suppression to remove overlapping detections with lower scores

Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05

Page 7: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05

Page 8: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

uncentered

centered

cubic-corrected

diagonal

Sobel

Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05

Outperforms

Page 9: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

• Histogram of gradient orientations

• Votes weighted by magnitude

• Bilinear interpolation between cells

Orientation: 9 bins (for unsigned angles)

Histograms in 8x8 pixel cells

Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05

Page 10: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

Normalize with respect to surrounding cells

Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05

Page 11: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

X=

Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05

# features = 15 x 7 x 9 x 4 = 3780

# cells

# orientations

# normalizations by neighboring cells

Page 12: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

Training set

Page 13: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05

pos w neg w

Page 14: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

pedestrian

Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05

Page 15: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

Detection examples

Page 16: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum
Page 17: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

Deformable Parts Model

• Takes the idea a little further

• Instead of one rigid HOG model, we have multiple HOG models in a spatial arrangement

• One root part to find first and multiple other parts in a tree structure.

Page 18: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

The Idea

Articulated parts model• Object is configuration of parts

• Each part is detectable

Images from Felzenszwalb

Page 19: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

Deformable objects

Images from Caltech-256

Slide Credit: Duan Tran

Page 20: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

Deformable objects

Images from D. Ramanan’s datasetSlide Credit: Duan Tran

Page 21: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

How to model spatial relations?• Tree-shaped model

Page 22: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum
Page 23: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

Hybrid template/parts model

Detections

Template Visualization

Felzenszwalb et al. 2008

Page 24: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

Pictorial Structures Model

Appearance likelihood Geometry likelihood

Page 25: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

Results for person matching

27

Page 26: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

Results for person matching

28

Page 27: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

BMVC 2009

Page 28: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

2012 State-of-the-art Detector:Deformable Parts Model (DPM)

30Felzenszwalb et al., 2008, 2010, 2011, 2012

Lifetime

Achievement

1. Strong low-level features based on HOG2. Efficient matching algorithms for deformable part-based

models (pictorial structures)3. Discriminative learning with latent variables (latent SVM)

Page 29: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

Why did gradient-based models work?

Average gradient image

Page 30: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

Generic categories

Can we detect people, chairs, horses, cars, dogs, buses, bottles, sheep …?PASCAL Visual Object Categories (VOC) dataset

Page 31: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

Generic categoriesWhy doesn’t this work (as well)?

Can we detect people, chairs, horses, cars, dogs, buses, bottles, sheep …?PASCAL Visual Object Categories (VOC) dataset

Page 32: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

PASCAL VOC detection history

0%

10%

20%

30%

40%

50%

60%

70%

2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

mea

n A

vera

ge P

reci

sio

n (

mA

P)

year

DPM

DPM,HOG+BOW

DPM,MKL

DPM++DPM++,MKL,SelectiveSearch

Selective Search,DPM++,MKL

41%41%37%

28%23%

17%

Page 33: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

Part-based models & multiple features (MKL)

0%

10%

20%

30%

40%

50%

60%

70%

2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

mea

n A

vera

ge P

reci

sio

n (

mA

P)

year

DPM

DPM,HOG+BOW

DPM,MKL

DPM++DPM++,MKL,SelectiveSearch

Selective Search,DPM++,MKL

41%41%37%

28%23%

17%

Page 34: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

Kitchen-sink approaches

0%

10%

20%

30%

40%

50%

60%

70%

2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

mea

n A

vera

ge P

reci

sio

n (

mA

P)

year

DPM

DPM,HOG+BOW

DPM,MKL

DPM++DPM++,MKL,SelectiveSearch

Selective Search,DPM++,MKL

41%41%37%

28%23%

17%

increasing complexity & plateau

Page 35: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

Region-based Convolutional Networks (R-CNNs)

0%

10%

20%

30%

40%

50%

60%

70%

2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

mea

n A

vera

ge P

reci

sio

n (

mA

P)

year

DPM

DPM,HOG+BOW

DPM,MKL

DPM++DPM++,MKL,SelectiveSearch

Selective Search,DPM++,MKL

41%41%37%

28%23%

17%

53%

62%

R-CNN v1

R-CNN v2

[R-CNN. Girshick et al. CVPR 2014]

Page 36: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

0%

10%

20%

30%

40%

50%

60%

70%

2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

mea

n A

vera

ge P

reci

sio

n (

mA

P)

year

~1 year

~5 years

Region-based Convolutional Networks (R-CNNs)

[R-CNN. Girshick et al. CVPR 2014]

Page 37: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

Standard Neural Networks

𝒙 = 𝑥1, … , 𝑥784𝑇 𝑧𝑗 = 𝑔(𝒘𝑗

𝑇𝒙) 𝑔 𝑡 =1

1 + 𝑒−𝑡

“Fully connected”

Page 38: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

From NNs to Convolutional NNs

• Local connectivity

• Shared (“tied”) weights

• Multiple feature maps

• Pooling

Page 39: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

Just-in-Time Information• What is a convolution?

• In signal processing, a correlation is an operation that multiplies a small mask times a small piece of the image. These are examples of such masks.

• The strict definition of convolution flips the mask.

• But in computer vision, we call everything convolution.

Page 40: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

Convolutional NNs

• Local connectivity

• Each green unit is only connected to (3)neighboring blue units

compare

Page 41: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

Convolutional NNs

• Shared (“tied”) weights

• All green units share the same parameters 𝒘

• Each green unit computes the same function,but with a different input window

𝑤1

𝑤2

𝑤3

𝑤1

𝑤2

𝑤3

Page 42: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

Convolutional NNs

• Convolution with 1-D filter: [𝑤3, 𝑤2, 𝑤1]

• All green units share the same parameters 𝒘

• Each green unit computes the same function,but with a different input window

𝑤1

𝑤2

𝑤3

Page 43: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

Convolutional NNs

• Convolution with 1-D filter: [𝑤3, 𝑤2, 𝑤1]

• All green units share the same parameters 𝒘

• Each green unit computes the same function,but with a different input window

𝑤1

𝑤2

𝑤3

Page 44: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

Convolutional NNs

• Convolution with 1-D filter: [𝑤3, 𝑤2, 𝑤1]

• All green units share the same parameters 𝒘

• Each green unit computes the same function,but with a different input window

𝑤1

𝑤2

𝑤3

Page 45: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

Convolutional NNs

• Convolution with 1-D filter: [𝑤3, 𝑤2, 𝑤1]

• All green units share the same parameters 𝒘

• Each green unit computes the same function,but with a different input window

𝑤1

𝑤2

𝑤3

Page 46: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

Convolutional NNs

• Convolution with 1-D filter: [𝑤3, 𝑤2, 𝑤1]

• All green units share the same parameters 𝒘

• Each green unit computes the same function,but with a different input window𝑤1

𝑤2

𝑤3

Page 47: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

Convolutional NNs

• Multiple feature maps

• All orange units compute the same functionbut with a different input windows

• Orange and green units compute different functions

𝑤1

𝑤2

𝑤3

𝑤′1𝑤′2𝑤′3

Feature map 1(array of greenunits)

Feature map 2(array of orangeunits)

Page 48: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

Convolutional NNs

• Pooling (max, average)

1

4

0

3

4

3

• Pooling area: 2 units

• Pooling stride: 2 units

• Subsamples feature maps

Page 49: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

Image

Pooling

Convolution

2D input

Page 50: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

The key to SVMs

• It’s all about the features

Histograms of Oriented Gradients for Human Detection, Dalal and Triggs, CVPR 2005

SVM weights(+) (-)

HOG features

Page 51: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

Core idea of “deep learning”

• Input: the “raw” signal (image, waveform, …)

• Features: hierarchy of features is learned from the raw input

Page 52: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

• If SVMs killed neural nets, how did they come back (in computer vision)?

Page 53: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

What’s new since the 1980s?

• More layers• LeNet-3 and LeNet-5 had 3 and 5 learnable layers

• Current models have 8 – 20+

• “ReLU” non-linearities (Rectified Linear Unit)• 𝑔 𝑥 = max 0, 𝑥

• Gradient doesn’t vanish

• “Dropout” regularization

• Fast GPU implementations

• More data

𝑥

𝑔(𝑥)

Page 54: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

What else? Object Proposals• Sliding window based object detection

• Object proposals• Fast execution

• High recall with low # of candidate boxes

ImageFeature

ExtractionClassificaiton

Iterate over window size, aspect ratio, and location

ImageFeature

ExtractionClassificaiton

Object Proposal

Page 55: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

The number of contours wholly enclosed by a bounding box is indicative of the likelihood of the box containing an object.

Page 56: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

Ross’s Own System: Region CNNs

Page 57: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

Competitive Results

Page 58: Object detection, deep learning, and R-CNNs · Object detection, deep learning, and R-CNNs Partly from Ross Girshick Microsoft Research Now at Facebook. ... Slides by Pete Barnum

Top Regions for Six Object Classes