object detection in crowded scenescmp.felk.cvut.cz/cmp/events/_colloquia/colloquium... · 2...

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Obj

ect D

etec

tion

in C

row

ded

Scen

es

Object Detection in Crowded Scenes

Bastian LeibeComputer Vision LaboratoryETH Zurich

CMP Seminar, Prague, 05.10.2006

joint work with:Nico Cornelis,Kurt Cornelis, Luc Van Gool,

Edgar Seemann,Krystian Mikolajczyk,Bernt Schiele

CVPR’05, BMVC’06CVPR’06 Video Proceedings

DAGM’06

2

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Bastian Leibe, ETH Zurich

Obj

ect D

etec

tion

in C

row

ded

Scen

es

Motivation

• Object Detection in Crowded ScenesRecognize and localize objects of a learned categoryLearn object variability

– Changes in appearance, scale, and articulationCompensate for clutter, overlap, and occlusion

3

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting


Obj

ect D

etec

tion

in C

row

ded

Scen

es

Motivation (2)

• Urban scene analysis from a moving vehicleDetect objects in the imageLocalize them in 3DBuild up a metric scene model

• Applications e.g. in driver assistance systems

4

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting


Obj

ect D

etec

tion

in C

row

ded

Scen

es

Challenges

Brightly-lit areasMotion blur

Lense flaring

+ Intra-category variability, multiple viewpoints, partial occlusion, ...Dark shadows

5

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting


Obj

ect D

etec

tion

in C

row

ded

Scen

es

Outline

• Object detection approachCognitive Loop between recognition and segmentationHypothesis Generation → Segmentation → Verification

• Recent extensionsEvaluation of different local featuresMulti-cue integration

• Application for urban scene analysisCognitive Loop between recognition and 3D reconstructionReal-time scene geometry estimationObject detection, 3D localization, and temporal integrationFeedback into geometry estimation

6

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting


Obj

ect D

etec

tion

in C

row

ded

Scen

es

Implicit Shape Model - Representation

• Learn appearance codebookExtract patches at interest pointsAgglomerative clustering ⇒ codebook

• Learn spatial distributionsMatch codebook to training imagesRecord matching positions on object

training images(+reference segmentation)

Appearance codebook…

………

…

Spatial occurrence distributionsx

y

s

x

y

sx

y

s

x

y

s

7

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting


Obj

ect D

etec

tion

in C

row

ded

Scen

es

Implicit Shape Model - Recognition

BackprojectedHypotheses

Interest Points Matched Codebook Entries

Probabilistic Voting

3D Voting Space(continuous)

x

y

s

Backprojectionof Maxima

[Leibe & Schiele,04]

8

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting


Obj

ect D

etec

tion

in C

row

ded

Scen

es

Implicit Shape Model - Recognition

BackprojectedHypotheses

Interest Points Matched Codebook Entries

Probabilistic Voting

Segmentation3D Voting Space

(continuous)

x

y

s

Backprojectionof Maxima

p(figure)Probabilities

[Leibe & Schiele,04]

9

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting


Obj

ect D

etec

tion

in C

row

ded

Scen

es

Example Results: Motorbikes

10

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting


Obj

ect D

etec

tion

in C

row

ded

Scen

es

• Usual procedure: Pruning of overlapping hypotheses.⇒ Counterproductive here!

• Look at segmentations instead⇒ Support comes from different areas!

Segmentation-Based Verification

11

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting


Obj

ect D

etec

tion

in C

row

ded

Scen

es

Formalization in MDL Framework

• Savings of a hypothesis

• withSarea : #pixels N in segmentationSmodel: model cost, proportional to expected areaSerror : estimate of error, according to

• Final form of equation

12

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting


Obj

ect D

etec

tion

in C

row

ded

Scen

es

Formalization in MDL Framework (2)

• Savings of combined hypothesis

• Goal: Find combination that best explains the imageQuadratic Boolean Optimization problem

mSS

SSmQmmmS

N

N

hhh

hhhT

m

T

m⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

==

∩

∩

L

MOM

L)

21

11

21

21

max max)(

( ) ( )21212121 hhShhSSSS errorareahhhh ∩+∩−+=∪

[Leonardis et al,95]

13

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting


Obj

ect D

etec

tion

in C

row

ded

Scen

es

Qualitative Results

[Leibe, Seemann, Schiele, CVPR’05]

14

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting


Obj

ect D

etec

tion

in C

row

ded

Scen

es

Estimated Articulations

Walking direction

Detection scale

Body pose

Problem cases

15

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting


Obj

ect D

etec

tion

in C

row

ded

Scen

es

Detections at EER

Single-frame recognition – no temporal continuity used!

[Leibe, Seemann, Schiele, CVPR’05]

16

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting


Obj

ect D

etec

tion

in C

row

ded

Scen

es

Outline




17

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting


Obj

ect D

etec

tion

in C

row

ded

Scen

es

Which Features to Use?

…

Region DescriptorPatchesSIFTGLOHPCA-SIFTShape Context

Region DetectorDoGHarris LaplaceHessian Laplace(Salient Regions)(MSER)

Cue Codebook(for each combination)

[Leibe, Mikolajczyk, Schiele, BMVC’06]

18

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting


Obj

ect D

etec

tion

in C

row

ded

Scen

es

Local Feature Evaluation

• Results on TUD Motorbike set (115 imgs, 125 objects)Best descriptors: SIFT + Shape ContextBest detectors: Hes-Lap + DoGPerformance significantly improved compared to DoG+Patches


19

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting


Obj

ect D

etec

tion

in C

row

ded

Scen

es

• Further improvements by combining several local cuesImproved recognition performanceIncreased recall at low false-positive rates

⇒ Reaching a level where the system can be used for applications.

Results on Other Data Sets

20

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting


Obj

ect D

etec

tion

in C

row

ded

Scen

es

Outline




21

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting


Obj

ect D

etec

tion

in C

row

ded

Scen

es

Cognitive Loop with 3D Geometry

• Connect recognition and reconstruction• Reconstruction pathway delivers scene geometry

⇒ Greatly improves recognition performance

• Recognition detects objects that disturb reconstruction⇒ More accurate geometry estimate

22

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting


Obj

ect D

etec

tion

in C

row

ded

Scen

es

Outline

• Hardware setup

• Reconstruction pathwayReal-time Structure-from-MotionReal-time dense reconstruction

• Recognition pathwayLocal-feature based object detectionIncorporation of scene geometryTemporal integration in world coordinate frameFeedback to reconstruction

• Results and Conclusion

23

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting


Obj

ect D

etec

tion

in C

row

ded

Scen

es

Hardware Setup

• Stereo camera rig mounted on top of the vehicle• Calibrated w.r.t. wheel base points• Video streams captured at 25 fps, 360×288 resolution

24

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting


Obj

ect D

etec

tion

in C

row

ded

Scen

es

Real-Time Structure-from-Motion

• Basis: very fast feature matchingSimple featuresOptimized for urban environmentOnly computed on green channel of a single camera

• Rest: standard SfM pipeline[Cornelis et al., CVPR’06]

25

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting


Obj

ect D

etec

tion

in C

row

ded

Scen

es

• Dense reconstruction on rectified imagesRuled surface assumption to speed-up dense reconstructionCorrelation measure: Sum of per-pixel SSDs along vertical linesLine-sweep algorithm with ordering constraints (DP)Fast computation on GPU

• Errors introduced by pixels not belonging to facades!

Real-Time Dense Reconstruction

[Cornelis et al., CVPR’06]

26

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting


Obj

ect D

etec

tion

in C

row

ded

Scen

es

• Merge dense reconstructions using known camera poses.• “Voted polygon carving” on 2D projection

Real-Time Dense Reconstruction (2)


27

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting


Obj

ect D

etec

tion

in C

row

ded

Scen

es

• Merge dense reconstructions using known camera poses.• “Voted polygon carving” on 2D projection• Surfaces registered on world map using GPS

Real-Time Dense Reconstruction (2)


28

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting


Obj

ect D

etec

tion

in C

row

ded

Scen

es

• Run-timesSfM + Bundle adjustment: 26-30 fps on CPUDense reconstruction: 26 fps on GPU

Textured 3D Model

Original 3D Reconstruction


29

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting


Obj

ect D

etec

tion

in C

row

ded

Scen

es

Information Flow into Recognition

• For each frame, 3D reconstruction deliversExternal camera calibrationGround plane estimate

⇒ Used for improving recognition of the next frame.

30

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting


Obj

ect D

etec

tion

in C

row

ded

Scen

es

Appearance-Based Car Detection

• Bank of 5 single-view ISM detectors• Each based on 3 local cues

Harris-Laplace, Hessian-Laplace, and DoG interest regions Local Shape Context descriptors

• Semi-profile detectors additionally mirrored• Not real-time yet…


31

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting


Obj

ect D

etec

tion

in C

row

ded

Scen

es

2D/3D Interactions

• Likelihood of 3D hypothesis H given image I and 2D detections h:

• 2D recognition scoreExpressed in terms of per-pixel p(figure) probabilities

recognitionscore (2D)

32

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting


Obj

ect D

etec

tion

in C

row

ded

Scen

es

2D/3D Interactions


• 3D priorDistance prior (uniform range)Size prior (Gaussian)

⇒ Significantly reduced search space

3D prior

x

s

y Search corridor

33

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting


Obj

ect D

etec

tion

in C

row

ded

Scen

es

2D/3D Interactions


• 2D/3D transferTwo image-plane detections are consistent if they correspond to the same 3D object

⇒ Multi-viewpoint integration⇒ Multi-camera integration

2D/3Dtransfer

34

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting


Obj

ect D

etec

tion

in C

row

ded

Scen

es

Detections Using Ground Plane Constraints

left camera1175 frames

35

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting


Obj

ect D

etec

tion

in C

row

ded

Scen

es

Quantitative Results

• Detection performance on first 600 framesAll cars annotated that were >50% visibleGround plane constraint significantly improves precisionPerformance: 0.2 fp/image at 50% recall

36

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting


Obj

ect D

etec

tion

in C

row

ded

Scen

es

Temporal Integration

• Temporal integration in world coordinate frameUsing external camera calibration from SfM.Each detection transfers to a 3D observation H.Find superset of 3D hypotheses .Estimate orientation using cluster shape & detected viewpoints.Select set of 3D hypotheses that best explain the observations.

37

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting


Obj

ect D

etec

tion

in C

row

ded

Scen

es

Hypothesis Selection for 3D Detections

• Quadratic Boolean Optimization Problem (from MDL)

• Individual scores (diagonal terms)

• Interaction costs (off-diagonal terms)

temporaldecay

likelihood ofmembership to

hypothesis

penalty forphysicaloverlap

[Leonardis et al,95]

38

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting


Obj

ect D

etec

tion

in C

row

ded

Scen

es

Result of Temporal Integration

39

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting


Obj

ect D

etec

tion

in C

row

ded

Scen

es

Online 3D Car Location Estimates

40

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting


Obj

ect D

etec

tion

in C

row

ded

Scen

es

3D Estimates After Convergence

41

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting


Obj

ect D

etec

tion

in C

row

ded

Scen

es

Feedback into 3D Reconstruction

• Feedback of detections & segmentation mapsUsed to discard features on cars for SfMUsed to mask out cars in dense reconstruction

⇒ More accurate 3D estimates in the next frame.

42

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting


Obj

ect D

etec

tion

in C

row

ded

Scen

es

Another Application: 3D City Modeling

Original 3D Reconstruction

Enhancing your driving experience…

43

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting


Obj

ect D

etec

tion

in C

row

ded

Scen

es

Conclusion

• Object detection in crowded scenesCognitive Loop between recognition and segmentationInitial hypotheses → top-down segmentation → verification

• Application for cognitive scene analysisCombining recognition, reconstruction, and temporal integration

• Cognitive Loop between 2D and 3D processingReconstruction delivers camera calibration, ground plane3D context tremendously improves recognition performanceCar detection, segmentation makes 3D estimation more accurate

• System applied to challenging real-world taskReal-time 3D reconstruction (26-30 fps)Accurate object detection & 3D pose estimation results

44

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting


Obj

ect D

etec

tion

in C

row

ded

Scen

es

Thank you very much for your attention!

Object Detection in Crowded ScenesMotivationMotivation (2)ChallengesOutlineImplicit Shape Model - RepresentationImplicit Shape Model - RecognitionImplicit Shape Model - RecognitionExample Results: MotorbikesSegmentation-Based VerificationFormalization in MDL FrameworkFormalization in MDL Framework (2)Qualitative ResultsEstimated ArticulationsDetections at EEROutlineWhich Features to Use?Local Feature EvaluationResults on Other Data SetsOutlineCognitive Loop with 3D GeometryOutlineHardware SetupReal-Time Structure-from-MotionReal-Time Dense ReconstructionReal-Time Dense Reconstruction (2)Real-Time Dense Reconstruction (2)Textured 3D ModelInformation Flow into RecognitionAppearance-Based Car Detection2D/3D Interactions2D/3D Interactions2D/3D InteractionsDetections Using Ground Plane ConstraintsQuantitative ResultsTemporal IntegrationHypothesis Selection for 3D DetectionsResult of Temporal IntegrationOnline 3D Car Location Estimates3D Estimates After ConvergenceFeedback into 3D ReconstructionAnother Application: 3D City ModelingConclusion

object detection in crowded scenescmp.felk.cvut.cz/cmp/events/_colloquia/colloquium... · 2...

Documents