depth estimation via scene classification

Depth Estimation via Scene Classification

Vladimir NedovićVladimir Nedović

28-05-200828-05-2008

[email protected]@science.uva.nl

with: Arnold Smeulders & Jan-Mark Geusebroek (UvA) André Redert (Philips Research)

seems chaotic, but there is seems chaotic, but there is structure - same as in structure - same as in natural image statisticsnatural image statistics

viewpoint viewpoint constraintsconstraints understood, understood, influence on influence on film artfilm art

‘‘modal’ scene modal’ scene configurations configurations – – structures structures orthogonalorthogonal to to each othereach other

Order in Pollock's ChaosOrder in Pollock's Chaos

Jackson Pollock, Jackson Pollock, Blue Poles: Number 1, 1952Blue Poles: Number 1, 1952

R.P. Taylor, A.P. Micolich and D. Jonas, R.P. Taylor, A.P. Micolich and D. Jonas, Fractal Analysis Of Pollock's Drip Fractal Analysis Of Pollock's Drip PaintingsPaintings, Nature, vol. 399, p.422 (1999), Nature, vol. 399, p.422 (1999)

Sandro Botticelli, Sandro Botticelli, AnnunciationAnnunciation, 1489-90, 1489-90

Post-perspective Post-perspective (Quattrocento, after 1430)(Quattrocento, after 1430)

Pre-perspective Pre-perspective (Gothic art, before 1430)(Gothic art, before 1430)

Simone Martini (1285-1344)Simone Martini (1285-1344)

W. Richards, A. Jepson and J. Feldman, W. Richards, A. Jepson and J. Feldman, Priors, Priors, Preferences and Categorical PerceptsPreferences and Categorical Percepts, in , in Perception as Bayesian InferencePerception as Bayesian Inference, pp. 80-111, , pp. 80-111, 1996.1996.

Know any tilted buildings?Know any tilted buildings?

OutlineOutline

IntroductionIntroduction

Related workRelated work

Our approachOur approach

Preliminary classificationPreliminary classification

ConclusionsConclusions

IntroductionIntroduction

The context: fully automatic 2D to 3D conversion of The context: fully automatic 2D to 3D conversion of video data for 3DTVvideo data for 3DTV

GOALGOAL: in a fast manner, obtain an approximate, but : in a fast manner, obtain an approximate, but visually pleasing 3D model from a single imagevisually pleasing 3D model from a single image

We know about stereo, structure from motion, etc. We know about stereo, structure from motion, etc. but can we also derive depth from a single image?but can we also derive depth from a single image? humans can, right?humans can, right?

Can we exploit some constraints?Can we exploit some constraints? is the data really chaotic?is the data really chaotic? what about perceptual limitations of viewers?what about perceptual limitations of viewers?

Related workRelated work

BUTBUT:: outdoor images only + assumes sky&ground are always presentoutdoor images only + assumes sky&ground are always present i.e. accounts for less than half of all possibilitiesi.e. accounts for less than half of all possibilities

Related work (3): Saxena (Stanford Univ.)Related work (3): Saxena (Stanford Univ.) 3D mesh from ML on low-level features (no classes)3D mesh from ML on low-level features (no classes)

Related work (2): Hoiem (Carnegie Melon Univ.)Related work (2): Hoiem (Carnegie Melon Univ.) obtained 3D orientation of scene surfaces using machine obtained 3D orientation of scene surfaces using machine

learning (ICCV 2005)learning (ICCV 2005) improved object detection (CVPR 2006 best paper) + accounted improved object detection (CVPR 2006 best paper) + accounted

for occlusions to derive relative ordering of elements (ICCV 2007)for occlusions to derive relative ordering of elements (ICCV 2007)

Related work (1): Related work (1): Torralba & Oliva showed that depth can be derived from structure, itself

derived from natural image statistics (IEEE PAMI 2001)

SSeparate a visual scene into its two constituent elements:

consider objects separately from the stage on which they act

Our approachOur approach

Our approach: depth estimation via geometric Our approach: depth estimation via geometric scene classificationscene classification i.e. holistic, not pixel-based

Determine the 3D stage model firstDetermine the 3D stage model first

SStage ≈ first approximation of global depth reduces subsequent (finer) depth processing tasks can guide other processes, e.g. object localization & recognition

V. Nedovićć et al. ICCV2007

objectobject

stagestage

Our approachOur approach- stage models -- stage models -

For the stage, a rough depth model is sufficientFor the stage, a rough depth model is sufficient

Exploit geometric structure of images, which Exploit geometric structure of images, which reduces the number of possible configurationsreduces the number of possible configurations

Only a few configurations are prominent => the Only a few configurations are prominent => the first step in depth estimation can be first step in depth estimation can be stage stage classificationclassification

regularities arise from:regularities arise from: natural image statistics -> texture gradientsnatural image statistics -> texture gradients

viewpoint constraints -> perspectiveviewpoint constraints -> perspective

modal configurations & film rules -> orthogonalitymodal configurations & film rules -> orthogonality

Our approachOur approach- stage hierarchy -- stage hierarchy -

Structure of the visual world leads to only 15 Structure of the visual world leads to only 15 geometric scene typesgeometric scene types

Influence of structure identical indoors & outdoors => such distinction unnecessary

Three-level hierarchyThree-level hierarchy

perform classification in steps: first determine the geometric neighbourhood, then proceed further

Our approachOur approach- three-level hierarchy -- three-level hierarchy -

i.e. no parameteri.e. no parameterestimation needed!estimation needed!

i.e. 2-3 sub-stages per each stage accounting for i.e. 2-3 sub-stages per each stage accounting for variability in parametersvariability in parameters

geometry at bottom so constrained that pre-geometry at bottom so constrained that pre-defined crude depth maps already possibledefined crude depth maps already possible

Preliminary classification (1)Preliminary classification (1)

Proof of concept with a single Proof of concept with a single feature typefeature type natural image statistics-based Weibull natural image statistics-based Weibull

features (i.e. texture gradients)features (i.e. texture gradients)

TRECVID dataset of TV news used for evaluation

A.F. Smeaton et al. “Evaluation campaigns and TRECVid”, 8th ACM Int’l Workshop on Multimedia Info. Retrieval, 2006.

Features extracted based on a 4x4 region grid over the image

two features per region => 64 features in total

Preliminary classification (2)Preliminary classification (2)

Support Vector Machines (SVM) classifier based on a 1 vs. 1 multi-class approach

class name % in dataset % correct1 sky+bkg+gnd 6.3% 16.7%2 gnd+bkg 7.1% 8.2%3 sky+gnd 8.7% 60.7%4 gnd+bkg 7.4% 44.7%5 gnd+diagBkg 10.8% 26.9%6 diagBkg 6.4% 14.3%7 box 5.5% 8.1%8 1 side-wall 9.0% 13.6%9 corner 10.8% 34.3%

10 tab+pers+bkg 7.4% 48.0%11 pers+bkg 13.1% 42.5%12 no depth 7.4% 22.4%

AVG: 28.4%

individual stages (results of individual stages (results of symmetrical variants symmetrical variants combined)combined)

group name % in dat aset % correctI straight/no bkg. 29.5% 69.5%II tilted bkg. 17.2% 35.2%III box 14.5% 19.6%IV corner 10.8% 13.2%V person+bkg 20.5% 63.1%

AVG: 4 0.1%

stage groupsstage groups

group name % correct (AVG)I straight/no bkg. 59.5%II tilted bkg. 27.0%III box 41.1%V person+bkg 72.7%

two-step classification, average within two-step classification, average within group (assuming super-stage is known)group (assuming super-stage is known)

Conclusions (1)Conclusions (1)

We need a fast & approximate solution:We need a fast & approximate solution: do only what is necessary, viewers may not do only what is necessary, viewers may not

perceive it anywayperceive it anyway generalize where possible, to reduce the problem at generalize where possible, to reduce the problem at

every stepevery step

Separate a scene into a stage and the objectsSeparate a scene into a stage and the objects

Determine the stage 3D model firstDetermine the stage 3D model first rough model is sufficientrough model is sufficient plus, structure greatly reduces the number of plus, structure greatly reduces the number of

possible configurationspossible configurations and, stage will help us to locate and process objectsand, stage will help us to locate and process objects


Therefore, we can use scene classification as Therefore, we can use scene classification as the first step in depth estimationthe first step in depth estimation

Due to structure, we can create simple models Due to structure, we can create simple models that fit TV datathat fit TV data 15 stages is sufficient15 stages is sufficient no need to distinguish between indoor & outdoorno need to distinguish between indoor & outdoor


Our approach: three-step classificationOur approach: three-step classification geometry at the bottom constrained enough, so we geometry at the bottom constrained enough, so we

can already assign pre-defined depth mapscan already assign pre-defined depth maps no parameter estimation necessaryno parameter estimation necessary

Proof of concept demonstrated with a single Proof of concept demonstrated with a single feature typefeature type performance much better than chanceperformance much better than chance but enhancements needed (more features etc.)but enhancements needed (more features etc.)

Questions?Questions?

depth estimation via scene classification

Documents

d stage model firststage

geometric scene classification

depth estimation

d model firstrough model

rough depth model

d mesh

scene classificationvladimir

d conversion of video