object detection in crowded scenescmp.felk.cvut.cz/cmp/events/_colloquia/colloquium... · 2...
TRANSCRIPT
-
Perc
eptu
al a
nd S
enso
ry A
ugm
ente
d Co
mpu
ting
Obj
ect D
etec
tion
in C
row
ded
Scen
es
Object Detection in Crowded Scenes
Bastian LeibeComputer Vision LaboratoryETH Zurich
CMP Seminar, Prague, 05.10.2006
joint work with:Nico Cornelis,Kurt Cornelis, Luc Van Gool,
Edgar Seemann,Krystian Mikolajczyk,Bernt Schiele
CVPR’05, BMVC’06CVPR’06 Video Proceedings
DAGM’06
-
2
Perc
eptu
al a
nd S
enso
ry A
ugm
ente
d Co
mpu
ting
Bastian Leibe, ETH Zurich
Obj
ect D
etec
tion
in C
row
ded
Scen
es
Motivation
• Object Detection in Crowded ScenesRecognize and localize objects of a learned categoryLearn object variability
– Changes in appearance, scale, and articulationCompensate for clutter, overlap, and occlusion
-
3
Perc
eptu
al a
nd S
enso
ry A
ugm
ente
d Co
mpu
ting
Bastian Leibe, ETH Zurich
Obj
ect D
etec
tion
in C
row
ded
Scen
es
Motivation (2)
• Urban scene analysis from a moving vehicleDetect objects in the imageLocalize them in 3DBuild up a metric scene model
• Applications e.g. in driver assistance systems
-
4
Perc
eptu
al a
nd S
enso
ry A
ugm
ente
d Co
mpu
ting
Bastian Leibe, ETH Zurich
Obj
ect D
etec
tion
in C
row
ded
Scen
es
Challenges
Brightly-lit areasMotion blur
Lense flaring
+ Intra-category variability, multiple viewpoints, partial occlusion, ...Dark shadows
-
5
Perc
eptu
al a
nd S
enso
ry A
ugm
ente
d Co
mpu
ting
Bastian Leibe, ETH Zurich
Obj
ect D
etec
tion
in C
row
ded
Scen
es
Outline
• Object detection approachCognitive Loop between recognition and segmentationHypothesis Generation → Segmentation → Verification
• Recent extensionsEvaluation of different local featuresMulti-cue integration
• Application for urban scene analysisCognitive Loop between recognition and 3D reconstructionReal-time scene geometry estimationObject detection, 3D localization, and temporal integrationFeedback into geometry estimation
-
6
Perc
eptu
al a
nd S
enso
ry A
ugm
ente
d Co
mpu
ting
Bastian Leibe, ETH Zurich
Obj
ect D
etec
tion
in C
row
ded
Scen
es
Implicit Shape Model - Representation
• Learn appearance codebookExtract patches at interest pointsAgglomerative clustering ⇒ codebook
• Learn spatial distributionsMatch codebook to training imagesRecord matching positions on object
training images(+reference segmentation)
Appearance codebook…
………
…
Spatial occurrence distributionsx
y
s
x
y
sx
y
s
x
y
s
-
7
Perc
eptu
al a
nd S
enso
ry A
ugm
ente
d Co
mpu
ting
Bastian Leibe, ETH Zurich
Obj
ect D
etec
tion
in C
row
ded
Scen
es
Implicit Shape Model - Recognition
BackprojectedHypotheses
Interest Points Matched Codebook Entries
Probabilistic Voting
3D Voting Space(continuous)
x
y
s
Backprojectionof Maxima
[Leibe & Schiele,04]
-
8
Perc
eptu
al a
nd S
enso
ry A
ugm
ente
d Co
mpu
ting
Bastian Leibe, ETH Zurich
Obj
ect D
etec
tion
in C
row
ded
Scen
es
Implicit Shape Model - Recognition
BackprojectedHypotheses
Interest Points Matched Codebook Entries
Probabilistic Voting
Segmentation3D Voting Space
(continuous)
x
y
s
Backprojectionof Maxima
p(figure)Probabilities
[Leibe & Schiele,04]
-
9
Perc
eptu
al a
nd S
enso
ry A
ugm
ente
d Co
mpu
ting
Bastian Leibe, ETH Zurich
Obj
ect D
etec
tion
in C
row
ded
Scen
es
Example Results: Motorbikes
-
10
Perc
eptu
al a
nd S
enso
ry A
ugm
ente
d Co
mpu
ting
Bastian Leibe, ETH Zurich
Obj
ect D
etec
tion
in C
row
ded
Scen
es
• Usual procedure: Pruning of overlapping hypotheses.⇒ Counterproductive here!
• Look at segmentations instead⇒ Support comes from different areas!
Segmentation-Based Verification
-
11
Perc
eptu
al a
nd S
enso
ry A
ugm
ente
d Co
mpu
ting
Bastian Leibe, ETH Zurich
Obj
ect D
etec
tion
in C
row
ded
Scen
es
Formalization in MDL Framework
• Savings of a hypothesis
• withSarea : #pixels N in segmentationSmodel: model cost, proportional to expected areaSerror : estimate of error, according to
• Final form of equation
-
12
Perc
eptu
al a
nd S
enso
ry A
ugm
ente
d Co
mpu
ting
Bastian Leibe, ETH Zurich
Obj
ect D
etec
tion
in C
row
ded
Scen
es
Formalization in MDL Framework (2)
• Savings of combined hypothesis
• Goal: Find combination that best explains the imageQuadratic Boolean Optimization problem
mSS
SSmQmmmS
N
N
hhh
hhhT
m
T
m⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
==
∩
∩
L
MOM
L)
21
11
21
21
max max)(
( ) ( )21212121 hhShhSSSS errorareahhhh ∩+∩−+=∪
[Leonardis et al,95]
-
13
Perc
eptu
al a
nd S
enso
ry A
ugm
ente
d Co
mpu
ting
Bastian Leibe, ETH Zurich
Obj
ect D
etec
tion
in C
row
ded
Scen
es
Qualitative Results
[Leibe, Seemann, Schiele, CVPR’05]
-
14
Perc
eptu
al a
nd S
enso
ry A
ugm
ente
d Co
mpu
ting
Bastian Leibe, ETH Zurich
Obj
ect D
etec
tion
in C
row
ded
Scen
es
Estimated Articulations
Walking direction
Detection scale
Body pose
Problem cases
-
15
Perc
eptu
al a
nd S
enso
ry A
ugm
ente
d Co
mpu
ting
Bastian Leibe, ETH Zurich
Obj
ect D
etec
tion
in C
row
ded
Scen
es
Detections at EER
Single-frame recognition – no temporal continuity used!
[Leibe, Seemann, Schiele, CVPR’05]
-
16
Perc
eptu
al a
nd S
enso
ry A
ugm
ente
d Co
mpu
ting
Bastian Leibe, ETH Zurich
Obj
ect D
etec
tion
in C
row
ded
Scen
es
Outline
• Object detection approachCognitive Loop between recognition and segmentationHypothesis Generation → Segmentation → Verification
• Recent extensionsEvaluation of different local featuresMulti-cue integration
• Application for urban scene analysisCognitive Loop between recognition and 3D reconstructionReal-time scene geometry estimationObject detection, 3D localization, and temporal integrationFeedback into geometry estimation
-
17
Perc
eptu
al a
nd S
enso
ry A
ugm
ente
d Co
mpu
ting
Bastian Leibe, ETH Zurich
Obj
ect D
etec
tion
in C
row
ded
Scen
es
Which Features to Use?
…
Region DescriptorPatchesSIFTGLOHPCA-SIFTShape Context
Region DetectorDoGHarris LaplaceHessian Laplace(Salient Regions)(MSER)
Cue Codebook(for each combination)
[Leibe, Mikolajczyk, Schiele, BMVC’06]
-
18
Perc
eptu
al a
nd S
enso
ry A
ugm
ente
d Co
mpu
ting
Bastian Leibe, ETH Zurich
Obj
ect D
etec
tion
in C
row
ded
Scen
es
Local Feature Evaluation
• Results on TUD Motorbike set (115 imgs, 125 objects)Best descriptors: SIFT + Shape ContextBest detectors: Hes-Lap + DoGPerformance significantly improved compared to DoG+Patches
[Leibe, Mikolajczyk, Schiele, BMVC’06]
-
19
Perc
eptu
al a
nd S
enso
ry A
ugm
ente
d Co
mpu
ting
Bastian Leibe, ETH Zurich
Obj
ect D
etec
tion
in C
row
ded
Scen
es
• Further improvements by combining several local cuesImproved recognition performanceIncreased recall at low false-positive rates
⇒ Reaching a level where the system can be used for applications.
Results on Other Data Sets
-
20
Perc
eptu
al a
nd S
enso
ry A
ugm
ente
d Co
mpu
ting
Bastian Leibe, ETH Zurich
Obj
ect D
etec
tion
in C
row
ded
Scen
es
Outline
• Object detection approachCognitive Loop between recognition and segmentationHypothesis Generation → Segmentation → Verification
• Recent extensionsEvaluation of different local featuresMulti-cue integration
• Application for urban scene analysisCognitive Loop between recognition and 3D reconstructionReal-time scene geometry estimationObject detection, 3D localization, and temporal integrationFeedback into geometry estimation
-
21
Perc
eptu
al a
nd S
enso
ry A
ugm
ente
d Co
mpu
ting
Bastian Leibe, ETH Zurich
Obj
ect D
etec
tion
in C
row
ded
Scen
es
Cognitive Loop with 3D Geometry
• Connect recognition and reconstruction• Reconstruction pathway delivers scene geometry
⇒ Greatly improves recognition performance
• Recognition detects objects that disturb reconstruction⇒ More accurate geometry estimate
-
22
Perc
eptu
al a
nd S
enso
ry A
ugm
ente
d Co
mpu
ting
Bastian Leibe, ETH Zurich
Obj
ect D
etec
tion
in C
row
ded
Scen
es
Outline
• Hardware setup
• Reconstruction pathwayReal-time Structure-from-MotionReal-time dense reconstruction
• Recognition pathwayLocal-feature based object detectionIncorporation of scene geometryTemporal integration in world coordinate frameFeedback to reconstruction
• Results and Conclusion
-
23
Perc
eptu
al a
nd S
enso
ry A
ugm
ente
d Co
mpu
ting
Bastian Leibe, ETH Zurich
Obj
ect D
etec
tion
in C
row
ded
Scen
es
Hardware Setup
• Stereo camera rig mounted on top of the vehicle• Calibrated w.r.t. wheel base points• Video streams captured at 25 fps, 360×288 resolution
-
24
Perc
eptu
al a
nd S
enso
ry A
ugm
ente
d Co
mpu
ting
Bastian Leibe, ETH Zurich
Obj
ect D
etec
tion
in C
row
ded
Scen
es
Real-Time Structure-from-Motion
• Basis: very fast feature matchingSimple featuresOptimized for urban environmentOnly computed on green channel of a single camera
• Rest: standard SfM pipeline[Cornelis et al., CVPR’06]
-
25
Perc
eptu
al a
nd S
enso
ry A
ugm
ente
d Co
mpu
ting
Bastian Leibe, ETH Zurich
Obj
ect D
etec
tion
in C
row
ded
Scen
es
• Dense reconstruction on rectified imagesRuled surface assumption to speed-up dense reconstructionCorrelation measure: Sum of per-pixel SSDs along vertical linesLine-sweep algorithm with ordering constraints (DP)Fast computation on GPU
• Errors introduced by pixels not belonging to facades!
Real-Time Dense Reconstruction
[Cornelis et al., CVPR’06]
-
26
Perc
eptu
al a
nd S
enso
ry A
ugm
ente
d Co
mpu
ting
Bastian Leibe, ETH Zurich
Obj
ect D
etec
tion
in C
row
ded
Scen
es
• Merge dense reconstructions using known camera poses.• “Voted polygon carving” on 2D projection
Real-Time Dense Reconstruction (2)
[Cornelis et al., CVPR’06]
-
27
Perc
eptu
al a
nd S
enso
ry A
ugm
ente
d Co
mpu
ting
Bastian Leibe, ETH Zurich
Obj
ect D
etec
tion
in C
row
ded
Scen
es
• Merge dense reconstructions using known camera poses.• “Voted polygon carving” on 2D projection• Surfaces registered on world map using GPS
Real-Time Dense Reconstruction (2)
[Cornelis et al., CVPR’06]
-
28
Perc
eptu
al a
nd S
enso
ry A
ugm
ente
d Co
mpu
ting
Bastian Leibe, ETH Zurich
Obj
ect D
etec
tion
in C
row
ded
Scen
es
• Run-timesSfM + Bundle adjustment: 26-30 fps on CPUDense reconstruction: 26 fps on GPU
Textured 3D Model
Original 3D Reconstruction
[Cornelis et al., CVPR’06]
-
29
Perc
eptu
al a
nd S
enso
ry A
ugm
ente
d Co
mpu
ting
Bastian Leibe, ETH Zurich
Obj
ect D
etec
tion
in C
row
ded
Scen
es
Information Flow into Recognition
• For each frame, 3D reconstruction deliversExternal camera calibrationGround plane estimate
⇒ Used for improving recognition of the next frame.
-
30
Perc
eptu
al a
nd S
enso
ry A
ugm
ente
d Co
mpu
ting
Bastian Leibe, ETH Zurich
Obj
ect D
etec
tion
in C
row
ded
Scen
es
Appearance-Based Car Detection
• Bank of 5 single-view ISM detectors• Each based on 3 local cues
Harris-Laplace, Hessian-Laplace, and DoG interest regions Local Shape Context descriptors
• Semi-profile detectors additionally mirrored• Not real-time yet…
[Leibe, Mikolajczyk, Schiele, BMVC’06]
-
31
Perc
eptu
al a
nd S
enso
ry A
ugm
ente
d Co
mpu
ting
Bastian Leibe, ETH Zurich
Obj
ect D
etec
tion
in C
row
ded
Scen
es
2D/3D Interactions
• Likelihood of 3D hypothesis H given image I and 2D detections h:
• 2D recognition scoreExpressed in terms of per-pixel p(figure) probabilities
recognitionscore (2D)
-
32
Perc
eptu
al a
nd S
enso
ry A
ugm
ente
d Co
mpu
ting
Bastian Leibe, ETH Zurich
Obj
ect D
etec
tion
in C
row
ded
Scen
es
2D/3D Interactions
• Likelihood of 3D hypothesis H given image I and 2D detections h:
• 3D priorDistance prior (uniform range)Size prior (Gaussian)
⇒ Significantly reduced search space
3D prior
x
s
y Search corridor
-
33
Perc
eptu
al a
nd S
enso
ry A
ugm
ente
d Co
mpu
ting
Bastian Leibe, ETH Zurich
Obj
ect D
etec
tion
in C
row
ded
Scen
es
2D/3D Interactions
• Likelihood of 3D hypothesis H given image I and 2D detections h:
• 2D/3D transferTwo image-plane detections are consistent if they correspond to the same 3D object
⇒ Multi-viewpoint integration⇒ Multi-camera integration
2D/3Dtransfer
-
34
Perc
eptu
al a
nd S
enso
ry A
ugm
ente
d Co
mpu
ting
Bastian Leibe, ETH Zurich
Obj
ect D
etec
tion
in C
row
ded
Scen
es
Detections Using Ground Plane Constraints
left camera1175 frames
-
35
Perc
eptu
al a
nd S
enso
ry A
ugm
ente
d Co
mpu
ting
Bastian Leibe, ETH Zurich
Obj
ect D
etec
tion
in C
row
ded
Scen
es
Quantitative Results
• Detection performance on first 600 framesAll cars annotated that were >50% visibleGround plane constraint significantly improves precisionPerformance: 0.2 fp/image at 50% recall
-
36
Perc
eptu
al a
nd S
enso
ry A
ugm
ente
d Co
mpu
ting
Bastian Leibe, ETH Zurich
Obj
ect D
etec
tion
in C
row
ded
Scen
es
Temporal Integration
• Temporal integration in world coordinate frameUsing external camera calibration from SfM.Each detection transfers to a 3D observation H.Find superset of 3D hypotheses .Estimate orientation using cluster shape & detected viewpoints.Select set of 3D hypotheses that best explain the observations.
-
37
Perc
eptu
al a
nd S
enso
ry A
ugm
ente
d Co
mpu
ting
Bastian Leibe, ETH Zurich
Obj
ect D
etec
tion
in C
row
ded
Scen
es
Hypothesis Selection for 3D Detections
• Quadratic Boolean Optimization Problem (from MDL)
• Individual scores (diagonal terms)
• Interaction costs (off-diagonal terms)
temporaldecay
likelihood ofmembership to
hypothesis
penalty forphysicaloverlap
[Leonardis et al,95]
-
38
Perc
eptu
al a
nd S
enso
ry A
ugm
ente
d Co
mpu
ting
Bastian Leibe, ETH Zurich
Obj
ect D
etec
tion
in C
row
ded
Scen
es
Result of Temporal Integration
-
39
Perc
eptu
al a
nd S
enso
ry A
ugm
ente
d Co
mpu
ting
Bastian Leibe, ETH Zurich
Obj
ect D
etec
tion
in C
row
ded
Scen
es
Online 3D Car Location Estimates
-
40
Perc
eptu
al a
nd S
enso
ry A
ugm
ente
d Co
mpu
ting
Bastian Leibe, ETH Zurich
Obj
ect D
etec
tion
in C
row
ded
Scen
es
3D Estimates After Convergence
-
41
Perc
eptu
al a
nd S
enso
ry A
ugm
ente
d Co
mpu
ting
Bastian Leibe, ETH Zurich
Obj
ect D
etec
tion
in C
row
ded
Scen
es
Feedback into 3D Reconstruction
• Feedback of detections & segmentation mapsUsed to discard features on cars for SfMUsed to mask out cars in dense reconstruction
⇒ More accurate 3D estimates in the next frame.
-
42
Perc
eptu
al a
nd S
enso
ry A
ugm
ente
d Co
mpu
ting
Bastian Leibe, ETH Zurich
Obj
ect D
etec
tion
in C
row
ded
Scen
es
Another Application: 3D City Modeling
Original 3D Reconstruction
Enhancing your driving experience…
-
43
Perc
eptu
al a
nd S
enso
ry A
ugm
ente
d Co
mpu
ting
Bastian Leibe, ETH Zurich
Obj
ect D
etec
tion
in C
row
ded
Scen
es
Conclusion
• Object detection in crowded scenesCognitive Loop between recognition and segmentationInitial hypotheses → top-down segmentation → verification
• Application for cognitive scene analysisCombining recognition, reconstruction, and temporal integration
• Cognitive Loop between 2D and 3D processingReconstruction delivers camera calibration, ground plane3D context tremendously improves recognition performanceCar detection, segmentation makes 3D estimation more accurate
• System applied to challenging real-world taskReal-time 3D reconstruction (26-30 fps)Accurate object detection & 3D pose estimation results
-
44
Perc
eptu
al a
nd S
enso
ry A
ugm
ente
d Co
mpu
ting
Bastian Leibe, ETH Zurich
Obj
ect D
etec
tion
in C
row
ded
Scen
es
Thank you very much for your attention!
Object Detection in Crowded ScenesMotivationMotivation (2)ChallengesOutlineImplicit Shape Model - RepresentationImplicit Shape Model - RecognitionImplicit Shape Model - RecognitionExample Results: MotorbikesSegmentation-Based VerificationFormalization in MDL FrameworkFormalization in MDL Framework (2)Qualitative ResultsEstimated ArticulationsDetections at EEROutlineWhich Features to Use?Local Feature EvaluationResults on Other Data SetsOutlineCognitive Loop with 3D GeometryOutlineHardware SetupReal-Time Structure-from-MotionReal-Time Dense ReconstructionReal-Time Dense Reconstruction (2)Real-Time Dense Reconstruction (2)Textured 3D ModelInformation Flow into RecognitionAppearance-Based Car Detection2D/3D Interactions2D/3D Interactions2D/3D InteractionsDetections Using Ground Plane ConstraintsQuantitative ResultsTemporal IntegrationHypothesis Selection for 3D DetectionsResult of Temporal IntegrationOnline 3D Car Location Estimates3D Estimates After ConvergenceFeedback into 3D ReconstructionAnother Application: 3D City ModelingConclusion