object detection in crowded scenescmp.felk.cvut.cz/cmp/events/_colloquia/colloquium... · 2...

44
Perceptual and Sensory Augmented Computing Object Detection in Crowded Scenes Object Detection in Crowded Scenes Bastian Leibe Computer Vision Laboratory ETH Zurich CMP Seminar, Prague, 05.10.2006 joint work with: Nico Cornelis, Kurt Cornelis, Luc Van Gool, Edgar Seemann, Krystian Mikolajczyk, Bernt Schiele CVPR’05, BMVC’06 CVPR’06 Video Proceedings DAGM’06

Upload: others

Post on 21-Oct-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

  • Perc

    eptu

    al a

    nd S

    enso

    ry A

    ugm

    ente

    d Co

    mpu

    ting

    Obj

    ect D

    etec

    tion

    in C

    row

    ded

    Scen

    es

    Object Detection in Crowded Scenes

    Bastian LeibeComputer Vision LaboratoryETH Zurich

    CMP Seminar, Prague, 05.10.2006

    joint work with:Nico Cornelis,Kurt Cornelis, Luc Van Gool,

    Edgar Seemann,Krystian Mikolajczyk,Bernt Schiele

    CVPR’05, BMVC’06CVPR’06 Video Proceedings

    DAGM’06

  • 2

    Perc

    eptu

    al a

    nd S

    enso

    ry A

    ugm

    ente

    d Co

    mpu

    ting

    Bastian Leibe, ETH Zurich

    Obj

    ect D

    etec

    tion

    in C

    row

    ded

    Scen

    es

    Motivation

    • Object Detection in Crowded ScenesRecognize and localize objects of a learned categoryLearn object variability

    – Changes in appearance, scale, and articulationCompensate for clutter, overlap, and occlusion

  • 3

    Perc

    eptu

    al a

    nd S

    enso

    ry A

    ugm

    ente

    d Co

    mpu

    ting

    Bastian Leibe, ETH Zurich

    Obj

    ect D

    etec

    tion

    in C

    row

    ded

    Scen

    es

    Motivation (2)

    • Urban scene analysis from a moving vehicleDetect objects in the imageLocalize them in 3DBuild up a metric scene model

    • Applications e.g. in driver assistance systems

  • 4

    Perc

    eptu

    al a

    nd S

    enso

    ry A

    ugm

    ente

    d Co

    mpu

    ting

    Bastian Leibe, ETH Zurich

    Obj

    ect D

    etec

    tion

    in C

    row

    ded

    Scen

    es

    Challenges

    Brightly-lit areasMotion blur

    Lense flaring

    + Intra-category variability, multiple viewpoints, partial occlusion, ...Dark shadows

  • 5

    Perc

    eptu

    al a

    nd S

    enso

    ry A

    ugm

    ente

    d Co

    mpu

    ting

    Bastian Leibe, ETH Zurich

    Obj

    ect D

    etec

    tion

    in C

    row

    ded

    Scen

    es

    Outline

    • Object detection approachCognitive Loop between recognition and segmentationHypothesis Generation → Segmentation → Verification

    • Recent extensionsEvaluation of different local featuresMulti-cue integration

    • Application for urban scene analysisCognitive Loop between recognition and 3D reconstructionReal-time scene geometry estimationObject detection, 3D localization, and temporal integrationFeedback into geometry estimation

  • 6

    Perc

    eptu

    al a

    nd S

    enso

    ry A

    ugm

    ente

    d Co

    mpu

    ting

    Bastian Leibe, ETH Zurich

    Obj

    ect D

    etec

    tion

    in C

    row

    ded

    Scen

    es

    Implicit Shape Model - Representation

    • Learn appearance codebookExtract patches at interest pointsAgglomerative clustering ⇒ codebook

    • Learn spatial distributionsMatch codebook to training imagesRecord matching positions on object

    training images(+reference segmentation)

    Appearance codebook…

    ………

    Spatial occurrence distributionsx

    y

    s

    x

    y

    sx

    y

    s

    x

    y

    s

  • 7

    Perc

    eptu

    al a

    nd S

    enso

    ry A

    ugm

    ente

    d Co

    mpu

    ting

    Bastian Leibe, ETH Zurich

    Obj

    ect D

    etec

    tion

    in C

    row

    ded

    Scen

    es

    Implicit Shape Model - Recognition

    BackprojectedHypotheses

    Interest Points Matched Codebook Entries

    Probabilistic Voting

    3D Voting Space(continuous)

    x

    y

    s

    Backprojectionof Maxima

    [Leibe & Schiele,04]

  • 8

    Perc

    eptu

    al a

    nd S

    enso

    ry A

    ugm

    ente

    d Co

    mpu

    ting

    Bastian Leibe, ETH Zurich

    Obj

    ect D

    etec

    tion

    in C

    row

    ded

    Scen

    es

    Implicit Shape Model - Recognition

    BackprojectedHypotheses

    Interest Points Matched Codebook Entries

    Probabilistic Voting

    Segmentation3D Voting Space

    (continuous)

    x

    y

    s

    Backprojectionof Maxima

    p(figure)Probabilities

    [Leibe & Schiele,04]

  • 9

    Perc

    eptu

    al a

    nd S

    enso

    ry A

    ugm

    ente

    d Co

    mpu

    ting

    Bastian Leibe, ETH Zurich

    Obj

    ect D

    etec

    tion

    in C

    row

    ded

    Scen

    es

    Example Results: Motorbikes

  • 10

    Perc

    eptu

    al a

    nd S

    enso

    ry A

    ugm

    ente

    d Co

    mpu

    ting

    Bastian Leibe, ETH Zurich

    Obj

    ect D

    etec

    tion

    in C

    row

    ded

    Scen

    es

    • Usual procedure: Pruning of overlapping hypotheses.⇒ Counterproductive here!

    • Look at segmentations instead⇒ Support comes from different areas!

    Segmentation-Based Verification

  • 11

    Perc

    eptu

    al a

    nd S

    enso

    ry A

    ugm

    ente

    d Co

    mpu

    ting

    Bastian Leibe, ETH Zurich

    Obj

    ect D

    etec

    tion

    in C

    row

    ded

    Scen

    es

    Formalization in MDL Framework

    • Savings of a hypothesis

    • withSarea : #pixels N in segmentationSmodel: model cost, proportional to expected areaSerror : estimate of error, according to

    • Final form of equation

  • 12

    Perc

    eptu

    al a

    nd S

    enso

    ry A

    ugm

    ente

    d Co

    mpu

    ting

    Bastian Leibe, ETH Zurich

    Obj

    ect D

    etec

    tion

    in C

    row

    ded

    Scen

    es

    Formalization in MDL Framework (2)

    • Savings of combined hypothesis

    • Goal: Find combination that best explains the imageQuadratic Boolean Optimization problem

    mSS

    SSmQmmmS

    N

    N

    hhh

    hhhT

    m

    T

    m⎥⎥⎥

    ⎢⎢⎢

    ==

    L

    MOM

    L)

    21

    11

    21

    21

    max max)(

    ( ) ( )21212121 hhShhSSSS errorareahhhh ∩+∩−+=∪

    [Leonardis et al,95]

  • 13

    Perc

    eptu

    al a

    nd S

    enso

    ry A

    ugm

    ente

    d Co

    mpu

    ting

    Bastian Leibe, ETH Zurich

    Obj

    ect D

    etec

    tion

    in C

    row

    ded

    Scen

    es

    Qualitative Results

    [Leibe, Seemann, Schiele, CVPR’05]

  • 14

    Perc

    eptu

    al a

    nd S

    enso

    ry A

    ugm

    ente

    d Co

    mpu

    ting

    Bastian Leibe, ETH Zurich

    Obj

    ect D

    etec

    tion

    in C

    row

    ded

    Scen

    es

    Estimated Articulations

    Walking direction

    Detection scale

    Body pose

    Problem cases

  • 15

    Perc

    eptu

    al a

    nd S

    enso

    ry A

    ugm

    ente

    d Co

    mpu

    ting

    Bastian Leibe, ETH Zurich

    Obj

    ect D

    etec

    tion

    in C

    row

    ded

    Scen

    es

    Detections at EER

    Single-frame recognition – no temporal continuity used!

    [Leibe, Seemann, Schiele, CVPR’05]

  • 16

    Perc

    eptu

    al a

    nd S

    enso

    ry A

    ugm

    ente

    d Co

    mpu

    ting

    Bastian Leibe, ETH Zurich

    Obj

    ect D

    etec

    tion

    in C

    row

    ded

    Scen

    es

    Outline

    • Object detection approachCognitive Loop between recognition and segmentationHypothesis Generation → Segmentation → Verification

    • Recent extensionsEvaluation of different local featuresMulti-cue integration

    • Application for urban scene analysisCognitive Loop between recognition and 3D reconstructionReal-time scene geometry estimationObject detection, 3D localization, and temporal integrationFeedback into geometry estimation

  • 17

    Perc

    eptu

    al a

    nd S

    enso

    ry A

    ugm

    ente

    d Co

    mpu

    ting

    Bastian Leibe, ETH Zurich

    Obj

    ect D

    etec

    tion

    in C

    row

    ded

    Scen

    es

    Which Features to Use?

    Region DescriptorPatchesSIFTGLOHPCA-SIFTShape Context

    Region DetectorDoGHarris LaplaceHessian Laplace(Salient Regions)(MSER)

    Cue Codebook(for each combination)

    [Leibe, Mikolajczyk, Schiele, BMVC’06]

  • 18

    Perc

    eptu

    al a

    nd S

    enso

    ry A

    ugm

    ente

    d Co

    mpu

    ting

    Bastian Leibe, ETH Zurich

    Obj

    ect D

    etec

    tion

    in C

    row

    ded

    Scen

    es

    Local Feature Evaluation

    • Results on TUD Motorbike set (115 imgs, 125 objects)Best descriptors: SIFT + Shape ContextBest detectors: Hes-Lap + DoGPerformance significantly improved compared to DoG+Patches

    [Leibe, Mikolajczyk, Schiele, BMVC’06]

  • 19

    Perc

    eptu

    al a

    nd S

    enso

    ry A

    ugm

    ente

    d Co

    mpu

    ting

    Bastian Leibe, ETH Zurich

    Obj

    ect D

    etec

    tion

    in C

    row

    ded

    Scen

    es

    • Further improvements by combining several local cuesImproved recognition performanceIncreased recall at low false-positive rates

    ⇒ Reaching a level where the system can be used for applications.

    Results on Other Data Sets

  • 20

    Perc

    eptu

    al a

    nd S

    enso

    ry A

    ugm

    ente

    d Co

    mpu

    ting

    Bastian Leibe, ETH Zurich

    Obj

    ect D

    etec

    tion

    in C

    row

    ded

    Scen

    es

    Outline

    • Object detection approachCognitive Loop between recognition and segmentationHypothesis Generation → Segmentation → Verification

    • Recent extensionsEvaluation of different local featuresMulti-cue integration

    • Application for urban scene analysisCognitive Loop between recognition and 3D reconstructionReal-time scene geometry estimationObject detection, 3D localization, and temporal integrationFeedback into geometry estimation

  • 21

    Perc

    eptu

    al a

    nd S

    enso

    ry A

    ugm

    ente

    d Co

    mpu

    ting

    Bastian Leibe, ETH Zurich

    Obj

    ect D

    etec

    tion

    in C

    row

    ded

    Scen

    es

    Cognitive Loop with 3D Geometry

    • Connect recognition and reconstruction• Reconstruction pathway delivers scene geometry

    ⇒ Greatly improves recognition performance

    • Recognition detects objects that disturb reconstruction⇒ More accurate geometry estimate

  • 22

    Perc

    eptu

    al a

    nd S

    enso

    ry A

    ugm

    ente

    d Co

    mpu

    ting

    Bastian Leibe, ETH Zurich

    Obj

    ect D

    etec

    tion

    in C

    row

    ded

    Scen

    es

    Outline

    • Hardware setup

    • Reconstruction pathwayReal-time Structure-from-MotionReal-time dense reconstruction

    • Recognition pathwayLocal-feature based object detectionIncorporation of scene geometryTemporal integration in world coordinate frameFeedback to reconstruction

    • Results and Conclusion

  • 23

    Perc

    eptu

    al a

    nd S

    enso

    ry A

    ugm

    ente

    d Co

    mpu

    ting

    Bastian Leibe, ETH Zurich

    Obj

    ect D

    etec

    tion

    in C

    row

    ded

    Scen

    es

    Hardware Setup

    • Stereo camera rig mounted on top of the vehicle• Calibrated w.r.t. wheel base points• Video streams captured at 25 fps, 360×288 resolution

  • 24

    Perc

    eptu

    al a

    nd S

    enso

    ry A

    ugm

    ente

    d Co

    mpu

    ting

    Bastian Leibe, ETH Zurich

    Obj

    ect D

    etec

    tion

    in C

    row

    ded

    Scen

    es

    Real-Time Structure-from-Motion

    • Basis: very fast feature matchingSimple featuresOptimized for urban environmentOnly computed on green channel of a single camera

    • Rest: standard SfM pipeline[Cornelis et al., CVPR’06]

  • 25

    Perc

    eptu

    al a

    nd S

    enso

    ry A

    ugm

    ente

    d Co

    mpu

    ting

    Bastian Leibe, ETH Zurich

    Obj

    ect D

    etec

    tion

    in C

    row

    ded

    Scen

    es

    • Dense reconstruction on rectified imagesRuled surface assumption to speed-up dense reconstructionCorrelation measure: Sum of per-pixel SSDs along vertical linesLine-sweep algorithm with ordering constraints (DP)Fast computation on GPU

    • Errors introduced by pixels not belonging to facades!

    Real-Time Dense Reconstruction

    [Cornelis et al., CVPR’06]

  • 26

    Perc

    eptu

    al a

    nd S

    enso

    ry A

    ugm

    ente

    d Co

    mpu

    ting

    Bastian Leibe, ETH Zurich

    Obj

    ect D

    etec

    tion

    in C

    row

    ded

    Scen

    es

    • Merge dense reconstructions using known camera poses.• “Voted polygon carving” on 2D projection

    Real-Time Dense Reconstruction (2)

    [Cornelis et al., CVPR’06]

  • 27

    Perc

    eptu

    al a

    nd S

    enso

    ry A

    ugm

    ente

    d Co

    mpu

    ting

    Bastian Leibe, ETH Zurich

    Obj

    ect D

    etec

    tion

    in C

    row

    ded

    Scen

    es

    • Merge dense reconstructions using known camera poses.• “Voted polygon carving” on 2D projection• Surfaces registered on world map using GPS

    Real-Time Dense Reconstruction (2)

    [Cornelis et al., CVPR’06]

  • 28

    Perc

    eptu

    al a

    nd S

    enso

    ry A

    ugm

    ente

    d Co

    mpu

    ting

    Bastian Leibe, ETH Zurich

    Obj

    ect D

    etec

    tion

    in C

    row

    ded

    Scen

    es

    • Run-timesSfM + Bundle adjustment: 26-30 fps on CPUDense reconstruction: 26 fps on GPU

    Textured 3D Model

    Original 3D Reconstruction

    [Cornelis et al., CVPR’06]

  • 29

    Perc

    eptu

    al a

    nd S

    enso

    ry A

    ugm

    ente

    d Co

    mpu

    ting

    Bastian Leibe, ETH Zurich

    Obj

    ect D

    etec

    tion

    in C

    row

    ded

    Scen

    es

    Information Flow into Recognition

    • For each frame, 3D reconstruction deliversExternal camera calibrationGround plane estimate

    ⇒ Used for improving recognition of the next frame.

  • 30

    Perc

    eptu

    al a

    nd S

    enso

    ry A

    ugm

    ente

    d Co

    mpu

    ting

    Bastian Leibe, ETH Zurich

    Obj

    ect D

    etec

    tion

    in C

    row

    ded

    Scen

    es

    Appearance-Based Car Detection

    • Bank of 5 single-view ISM detectors• Each based on 3 local cues

    Harris-Laplace, Hessian-Laplace, and DoG interest regions Local Shape Context descriptors

    • Semi-profile detectors additionally mirrored• Not real-time yet…

    [Leibe, Mikolajczyk, Schiele, BMVC’06]

  • 31

    Perc

    eptu

    al a

    nd S

    enso

    ry A

    ugm

    ente

    d Co

    mpu

    ting

    Bastian Leibe, ETH Zurich

    Obj

    ect D

    etec

    tion

    in C

    row

    ded

    Scen

    es

    2D/3D Interactions

    • Likelihood of 3D hypothesis H given image I and 2D detections h:

    • 2D recognition scoreExpressed in terms of per-pixel p(figure) probabilities

    recognitionscore (2D)

  • 32

    Perc

    eptu

    al a

    nd S

    enso

    ry A

    ugm

    ente

    d Co

    mpu

    ting

    Bastian Leibe, ETH Zurich

    Obj

    ect D

    etec

    tion

    in C

    row

    ded

    Scen

    es

    2D/3D Interactions

    • Likelihood of 3D hypothesis H given image I and 2D detections h:

    • 3D priorDistance prior (uniform range)Size prior (Gaussian)

    ⇒ Significantly reduced search space

    3D prior

    x

    s

    y Search corridor

  • 33

    Perc

    eptu

    al a

    nd S

    enso

    ry A

    ugm

    ente

    d Co

    mpu

    ting

    Bastian Leibe, ETH Zurich

    Obj

    ect D

    etec

    tion

    in C

    row

    ded

    Scen

    es

    2D/3D Interactions

    • Likelihood of 3D hypothesis H given image I and 2D detections h:

    • 2D/3D transferTwo image-plane detections are consistent if they correspond to the same 3D object

    ⇒ Multi-viewpoint integration⇒ Multi-camera integration

    2D/3Dtransfer

  • 34

    Perc

    eptu

    al a

    nd S

    enso

    ry A

    ugm

    ente

    d Co

    mpu

    ting

    Bastian Leibe, ETH Zurich

    Obj

    ect D

    etec

    tion

    in C

    row

    ded

    Scen

    es

    Detections Using Ground Plane Constraints

    left camera1175 frames

  • 35

    Perc

    eptu

    al a

    nd S

    enso

    ry A

    ugm

    ente

    d Co

    mpu

    ting

    Bastian Leibe, ETH Zurich

    Obj

    ect D

    etec

    tion

    in C

    row

    ded

    Scen

    es

    Quantitative Results

    • Detection performance on first 600 framesAll cars annotated that were >50% visibleGround plane constraint significantly improves precisionPerformance: 0.2 fp/image at 50% recall

  • 36

    Perc

    eptu

    al a

    nd S

    enso

    ry A

    ugm

    ente

    d Co

    mpu

    ting

    Bastian Leibe, ETH Zurich

    Obj

    ect D

    etec

    tion

    in C

    row

    ded

    Scen

    es

    Temporal Integration

    • Temporal integration in world coordinate frameUsing external camera calibration from SfM.Each detection transfers to a 3D observation H.Find superset of 3D hypotheses .Estimate orientation using cluster shape & detected viewpoints.Select set of 3D hypotheses that best explain the observations.

  • 37

    Perc

    eptu

    al a

    nd S

    enso

    ry A

    ugm

    ente

    d Co

    mpu

    ting

    Bastian Leibe, ETH Zurich

    Obj

    ect D

    etec

    tion

    in C

    row

    ded

    Scen

    es

    Hypothesis Selection for 3D Detections

    • Quadratic Boolean Optimization Problem (from MDL)

    • Individual scores (diagonal terms)

    • Interaction costs (off-diagonal terms)

    temporaldecay

    likelihood ofmembership to

    hypothesis

    penalty forphysicaloverlap

    [Leonardis et al,95]

  • 38

    Perc

    eptu

    al a

    nd S

    enso

    ry A

    ugm

    ente

    d Co

    mpu

    ting

    Bastian Leibe, ETH Zurich

    Obj

    ect D

    etec

    tion

    in C

    row

    ded

    Scen

    es

    Result of Temporal Integration

  • 39

    Perc

    eptu

    al a

    nd S

    enso

    ry A

    ugm

    ente

    d Co

    mpu

    ting

    Bastian Leibe, ETH Zurich

    Obj

    ect D

    etec

    tion

    in C

    row

    ded

    Scen

    es

    Online 3D Car Location Estimates

  • 40

    Perc

    eptu

    al a

    nd S

    enso

    ry A

    ugm

    ente

    d Co

    mpu

    ting

    Bastian Leibe, ETH Zurich

    Obj

    ect D

    etec

    tion

    in C

    row

    ded

    Scen

    es

    3D Estimates After Convergence

  • 41

    Perc

    eptu

    al a

    nd S

    enso

    ry A

    ugm

    ente

    d Co

    mpu

    ting

    Bastian Leibe, ETH Zurich

    Obj

    ect D

    etec

    tion

    in C

    row

    ded

    Scen

    es

    Feedback into 3D Reconstruction

    • Feedback of detections & segmentation mapsUsed to discard features on cars for SfMUsed to mask out cars in dense reconstruction

    ⇒ More accurate 3D estimates in the next frame.

  • 42

    Perc

    eptu

    al a

    nd S

    enso

    ry A

    ugm

    ente

    d Co

    mpu

    ting

    Bastian Leibe, ETH Zurich

    Obj

    ect D

    etec

    tion

    in C

    row

    ded

    Scen

    es

    Another Application: 3D City Modeling

    Original 3D Reconstruction

    Enhancing your driving experience…

  • 43

    Perc

    eptu

    al a

    nd S

    enso

    ry A

    ugm

    ente

    d Co

    mpu

    ting

    Bastian Leibe, ETH Zurich

    Obj

    ect D

    etec

    tion

    in C

    row

    ded

    Scen

    es

    Conclusion

    • Object detection in crowded scenesCognitive Loop between recognition and segmentationInitial hypotheses → top-down segmentation → verification

    • Application for cognitive scene analysisCombining recognition, reconstruction, and temporal integration

    • Cognitive Loop between 2D and 3D processingReconstruction delivers camera calibration, ground plane3D context tremendously improves recognition performanceCar detection, segmentation makes 3D estimation more accurate

    • System applied to challenging real-world taskReal-time 3D reconstruction (26-30 fps)Accurate object detection & 3D pose estimation results

  • 44

    Perc

    eptu

    al a

    nd S

    enso

    ry A

    ugm

    ente

    d Co

    mpu

    ting

    Bastian Leibe, ETH Zurich

    Obj

    ect D

    etec

    tion

    in C

    row

    ded

    Scen

    es

    Thank you very much for your attention!

    Object Detection in Crowded ScenesMotivationMotivation (2)ChallengesOutlineImplicit Shape Model - RepresentationImplicit Shape Model - RecognitionImplicit Shape Model - RecognitionExample Results: MotorbikesSegmentation-Based VerificationFormalization in MDL FrameworkFormalization in MDL Framework (2)Qualitative ResultsEstimated ArticulationsDetections at EEROutlineWhich Features to Use?Local Feature EvaluationResults on Other Data SetsOutlineCognitive Loop with 3D GeometryOutlineHardware SetupReal-Time Structure-from-MotionReal-Time Dense ReconstructionReal-Time Dense Reconstruction (2)Real-Time Dense Reconstruction (2)Textured 3D ModelInformation Flow into RecognitionAppearance-Based Car Detection2D/3D Interactions2D/3D Interactions2D/3D InteractionsDetections Using Ground Plane ConstraintsQuantitative ResultsTemporal IntegrationHypothesis Selection for 3D DetectionsResult of Temporal IntegrationOnline 3D Car Location Estimates3D Estimates After ConvergenceFeedback into 3D ReconstructionAnother Application: 3D City ModelingConclusion