holistic scene understanding
DESCRIPTION
Holistic Scene Understanding. Virginia Tech ECE6504 2013/02/26 Stanislaw Antol. What Does It Mean?. Computer vision parts extensively developed; less work done on their integration Potential benefit of different components compensating/helping other components. Outline. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/1.jpg)
Holistic Scene Understanding
Virginia TechECE6504
2013/02/26Stanislaw Antol
![Page 2: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/2.jpg)
What Does It Mean?
• Computer vision parts extensively developed; less work done on their integration
• Potential benefit of different components compensating/helping other components
![Page 3: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/3.jpg)
Outline
• Gaussian Mixture Models• Conditional Random Fields• Paper 1 Overview• Paper 2 Overview• My Experiment
![Page 4: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/4.jpg)
4
Gaussian Mixture
)()()|()|(
XPCPCXPXCP i
ii
Where P(X | Ci) is the PDF of class j, evaluated at X, P( Cj ) is the prior probability for class j, and P(X) is the overall PDF, evaluated at X.
Slide credit: Kuei-Hsien
Nc
k
kkj GwCXP1
)|(
Where wk is the weight of the k-th Gaussian Gk and the weights sum to one. One such PDF model is produced for each class.
)]()(2/1[2/12/
1
||)2(1
kkT
k MXVMX
knk eV
G
Where Mk is the mean of the Gaussian and Vk is the covariance matrix of the Gaussian..
![Page 5: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/5.jpg)
G1,w1 G2,w2
G3,w3
G4,w4
G5.w5
Class 1
)()(
)|()|(XPCP
CXPXCP jjj
Nc
k
kkj GwCXP1
)|(
)]()(2/1[2/12/
1
||)2(1)|( i
Ti XViX
idik eV
GXpG
Variables: μi, Vi, wk
We use EM (estimate-maximize) algorithm to approximate this variables. One can use k-means to initialize.
Composition of Gaussian Mixture
Slide credit: Kuei-Hsien
![Page 6: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/6.jpg)
Background on CRFs
Figure from: “An Introduction to Conditional Random Fields” by C. Sutton and A. McCallum
![Page 7: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/7.jpg)
Background on CRFs
Figure from: “An Introduction to Conditional Random Fields” by C. Sutton and A. McCallum
![Page 8: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/8.jpg)
Background on CRFs
Equations from: “An Introduction to Conditional Random Fields” by C. Sutton and A. McCallum
![Page 9: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/9.jpg)
Paper 1
• “TextonBoost: Joint Appearance, Shape, and Context Modeling for Multi-class Object Recognition and Segmentation”– J. Shotton, J. Winn, C. Rother, and A. Criminisi
![Page 10: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/10.jpg)
Introduction Simultaneous recognition and
segmentation Explain every pixel (dense features) Appearance + shape + context Class generalities + image specifics
Contributions New low-level features New texture-based discriminative
model Efficiency and scalability Example Results
Slide credit: J. Shotton
![Page 11: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/11.jpg)
Image Databases
• MSRC 21-Class Object Recognition Database– 591 hand-labelled images ( 45% train, 10% validation, 45% test )
• Corel ( 7-class ) and Sowerby ( 7-class ) [He et al. CVPR 04]
Slide credit: J. Shotton
![Page 12: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/12.jpg)
Sparse vs Dense Features• Successes using sparse features, e.g.
[Sivic et al. ICCV 2005], [Fergus et al. ICCV 2005], [Leibe et al. CVPR 2005]
• But…– do not explain whole image– cannot cope well with all object classes
• We use dense features– ‘shape filters’– local texture-based image descriptions
• Cope with– textured and untextured objects, occlusions,
whilst retaining high efficiency
problem imagesfor sparse features?
Slide credit: J. Shotton
![Page 13: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/13.jpg)
Textons• Shape filters use texton maps
[Varma & Zisserman IJCV 05][Leung & Malik IJCV 01]
• Compact and efficient characterisation of local texture
Texton mapColours Texton Indices
Input image
Clustering
Filter BankSlide credit: J. Shotton
![Page 14: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/14.jpg)
Shape Filters
• Pair:
• Feature responses v(i, r, t)
• Large bounding boxes enablelong range interactions
• Integral images
rectangle r texton t( , )
v(i1, r, t) = a
v(i2, r, t) = 0v(i3, r, t) = a/2
appearance context
up to 200 pixels
Slide credit: J. Shotton
![Page 15: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/15.jpg)
feature response imagev(i, r1, t1)
feature response imagev(i, r2, t2)
Shape as Texton Layout
( , )(r1, t1) =
( , )(r2, t2) =
t1 t2
t3 t4
t0
texton map ground truth
texton mapSlide credit: J. Shotton
![Page 16: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/16.jpg)
summed response imagesv(i, r1, t1) + v(i, r2, t2)
Shape as Texton Layout
( , )(r1, t1) =
( , )(r2, t2) =
t1 t2
t3 t4
t0
texton map ground truth
texton map summed response imagesv(i, r1, t1) + v(i, r2, t2)
texton map
Slide credit: J. Shotton
![Page 17: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/17.jpg)
Joint Boosting for Feature Selection
test image
30 rounds 2000 rounds1000 rounds
inferred segmentationcolour = most likely label
confidencewhite = low confidenceblack = high confidence
Using Joint Boost: [Torralba et al. CVPR 2004]
• Boosted classifier provides bulk segmentation/recognition only• Edge accurate segmentation will be provided by CRF model
Slide credit: J. Shotton
![Page 18: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/18.jpg)
Accurate Segmentation?
• Boosted classifier alone– effectively recognises objects– but not sufficient for pixel-
perfect segmentation
• Conditional Random Field (CRF)– jointly classifies all pixels whilst
respecting image edges
boosted classifier
+ CRF
Slide credit: J. Shotton
![Page 19: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/19.jpg)
Conditional Random Field Model
Log conditional probability ofclass labels c givenimage x and learned parameters
Slide credit: J. Shotton
![Page 20: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/20.jpg)
Conditional Random Field Modelshape-texture potentials
shape-texture potentials
jointly across all pixels
Shape-texture potentials broad intra-class
appearance distribution log boosted classifier parameters learned
offlineSlide credit: J. Shotton
![Page 21: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/21.jpg)
Conditional Random Field Model
intra-classappearance variations
colour potentials
Colour potentials compact appearance
distribution Gaussian mixture model parameters learned at
test timeSlide credit: J. Shotton
![Page 22: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/22.jpg)
Conditional Random Field Model
Capture prior on absolute image location
location potentials
tree sky road
Slide credit: J. Shotton
![Page 23: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/23.jpg)
Conditional Random Field Model
Potts model encourages neighbouring pixels
to have same label Contrast sensitivity
encourages segmentation tofollow image edges image edge map
edge potentialssum over
neighbouring pixels
Slide credit: J. Shotton
![Page 24: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/24.jpg)
Conditional Random Field Model
partition function(normalises distribution)
• For details of potentials and learning, see paper
Slide credit: J. Shotton
![Page 25: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/25.jpg)
• Find most probable labelling– maximizing
CRF Inferenceshape-texture colour location
edge
Slide credit: J. Shotton
![Page 26: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/26.jpg)
Learning
Slide credit: Daniel Munoz
![Page 27: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/27.jpg)
Results on 21-Class Database
building
Slide credit: J. Shotton
![Page 28: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/28.jpg)
Segmentation Accuracy• Overall pixel-wise accuracy is 72.2%
– ~15 times better than chance• Confusion matrix:
Slide credit: J. Shotton
![Page 29: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/29.jpg)
Some Failures
Slide credit: J. Shotton
![Page 30: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/30.jpg)
Effect of Model Components
Shape-texture potentials only:69.6%+ edge potentials: 70.3%+ colour potentials: 72.0%+ location potentials: 72.2%
shape-texture + edge + colour & location
pixel-wisesegmentation
accuracies
Slide credit: J. Shotton
![Page 31: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/31.jpg)
Comparison with [He et al. CVPR 04]• Our example results:
Accuracy Speed ( Train - Test )Sowerb
yCorel Sowerby Corel
Our CRF model 88.6% 74.6% 20 mins - 1.1 secs
30 mins - 2.5 secs
He et al. mCRF 89.5% 80.0% 1 day - 30 secs
1 day - 30 secs
Shape-texture potentials only
85.6% 68.4%
He et al. unary classifier only
82.4% 66.9%
Slide credit: J. Shotton
![Page 32: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/32.jpg)
Paper 2
• “Describing the Scene as a Whole: Joint Object Detection, Scene Classification, and Semantic Segmentation”– Jian Yao, Sanja Fidler, and Raquel Urtasun
![Page 33: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/33.jpg)
Motivation
• Holistic scene understanding:– Object detection– Semantic segmentation– Scene classification
• Extends idea behind TextonBoost– Adds scene classification, object-scene
compatibility, and more
![Page 34: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/34.jpg)
Main idea
• Create a holistic CRF– General framework to easily allow additions– Utilize other work as components of CRF– Perform CRF, not on pixels, but segments and
other higher-level values
![Page 35: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/35.jpg)
Holistic CRF (HCRF) Model
![Page 36: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/36.jpg)
HCRF Pre-cursors
• Use own scene classification, one-vs-all SVM classifier using SIFT, colorSIFT, RGB histograms, and color moment invariants, to produce scenes
• Use [5] for object detection (over-detection), bl
• Use [5] to help create object masks, μs
• Use [20] at two different K0 watershed threshold values to generate segments and super-segments, xi, yj, respectively
![Page 37: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/37.jpg)
HCRF
• Connection of potentials and their HCRF
![Page 38: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/38.jpg)
Segmentation Potentials
TextonBoost averaging
![Page 39: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/39.jpg)
Object Reasoning Potentials
![Page 40: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/40.jpg)
Class Presence Potentials
Chow-Liu algorithm
Is class k in image?
![Page 41: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/41.jpg)
Scene Potentials
Their classification technique
![Page 42: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/42.jpg)
Experimental Results
![Page 43: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/43.jpg)
Experimental Results
![Page 44: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/44.jpg)
Experimental Results
![Page 45: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/45.jpg)
Experimental Results
![Page 46: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/46.jpg)
My (TextonBoost) Experiment
• Despite statement, HCRF code not available• TextonBoost only partially available– Only code prior to CRF released– Expects a very rigid format/structure for images• PASCAL VOC2007 wouldn’t run, even with changes• MSRCv2 was able to run (actually what they used)
– No results processing, just segmented images
![Page 47: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/47.jpg)
My Experiment
• Run code on the (same) MSRCv2 dataset– Default parameters, except boosting rounds• Wanted to look at effects up until 1000 rounds;
compute up to 900• Limited time; only got output for values up to 300
• Evaluate relationship between boosting rounds and segmentation accuracy
![Page 48: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/48.jpg)
Experimental Advice
• Remember to compile in Release mode– Classification seems to be ~3 times faster– Training took 26 hours, maybe less if in Release
• Take advantage of multi-core CPU, if possible– Single-threaded program not utilizing much RAM,
so started running two classifications together
![Page 49: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/49.jpg)
Experimental Results
![Page 50: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/50.jpg)
Experimental Results
![Page 51: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/51.jpg)
Experimental Results
![Page 52: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/52.jpg)
Thank you for your time.
Any more questions?
![Page 53: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/53.jpg)
![Page 54: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/54.jpg)
![Page 55: Holistic Scene Understanding](https://reader033.vdocuments.us/reader033/viewer/2022051518/5681622d550346895dd259f2/html5/thumbnails/55.jpg)