mit6.870 grounding object recognition and scene understanding: lecture 1
DESCRIPTION
TRANSCRIPT
Wednesdays 1-4pm Room 13-1143 Instructor: Antonio Torralba Email: [email protected]
6.870 Grounding object recognition and scene
understanding
http://people.csail.mit.edu/torralba/courses/6.870/6.870.recognition.htm
Some slides are borrowed from other classes (see links on the course web site). Let me know if I forget to give credit to the right people.
http://groups.csail.mit.edu/vision/courses/6.869/
Grading
• Class participation: 20%
• Paper presentations: 40%
• Course project: 40%
Course project • Topics for projects: It can derive from one
of the papers studied or from your own research.
• Work individually or in pairs.
• Results described as a 4 pages CVPR paper
• Short presentation at the end of the semester
Paper presentations (40%)
Email me at the end of the class for scheduling the next week. We will first decide how to structure the week together.
• Presenter: – Present the key ideas, background material, and technical details. – Show me the slides two days before the class. – To test the basic ideas of the paper(s), using code available online or
writing toy code. – Create toy test problems that reveal something about the algorithm. – Constructive criticism.
Readings
Lecture 1 Class goals and a short introduc2on
6.870 Grounding object recognition and scene understanding
What is vision?
• What does it mean, to see? “to know what is where by looking”.
• How to discover from images what is present in the world, where things are, what ac2ons are taking place.
from Marr, 1982
The importance of images
100 million $
“Dora Maar au Chat” Pablo Picasso, 1941
Some images are more important than others
Why is vision hard?
The structure of ambient light
The structure of ambient light
The Plenop2c Func2on
The intensity P can be parameterized as:
P (θ, φ, t, λ, X, Y, Z) “The complete set of all convergence points constitutes the permanent possibilities of vision.” Gibson
Adelson & Bergen, 91
Why is vision hard?
Measuring light vs. measuring scene proper2es
We perceive two squares, one on top of each other.
Measuring light vs. measuring scene proper2es
by Roger Shepard (”Turning the Tables”)
Depth processing is automatic, and we can not shut it down…
Measuring light vs. measuring scene proper2es
Measuring light vs. measuring scene proper2es
Measuring light vs. measuring scene proper2es
(c) 2006 Walt Anthony
Assump2ons can be wrong
Ames room
By Aude Oliva
Why is vision hard?
Some things have strong varia2ons in appearance
Some things know that you have eyes
Brady, M. J., & Kersten, D. (2003). Bootstrapped learning of novel objects. J Vis, 3(6), 413-‐422
A short history of vision
The early op2mism
The crisis of the 80’s
Yes, object recogni2on is hard… (or at least it seems so for now…)
Object recogni2on Is it really so hard?
Challenges 1: view point variation
Michelangelo 1475-1564
Challenges 2: illumination
slide credit: S. Ullman
Challenges 3: occlusion
Magritte, 1957
Challenges 4: scale
Challenges 5: deformation
Xu, Beihong 1943
Challenges 6: background clutter
Klimt, 1913
Challenges 7: intra-class variation
Challenges
Brady, M. J., & Kersten, D. (2003). Bootstrapped learning of novel objects. J Vis, 3(6), 413-422
Discover the camouflaged object
Brady, M. J., & Kersten, D. (2003). Bootstrapped learning of novel objects. J Vis, 3(6), 413-422
Discover the camouflaged object
Brady, M. J., & Kersten, D. (2003). Bootstrapped learning of novel objects. J Vis, 3(6), 413-422
Any guesses?
So, let’s make the problem simpler: Block world
Nice framework to develop fancy math, but too far from reality… Object Recognition in the Geometric Era: a Retrospective. Joseph L. Mundy. 2006
Object Recognition in the Geometric Era: a Retrospective. Joseph L. Mundy. 2006
Binford and generalized cylinders
Binford and generalized cylinders
Recogni2on by components
Irving Biederman Recognition-by-Components: A Theory of Human Image Understanding. Psychological Review, 1987.
Recogni2on by components The fundamental assump2on of the proposed theory,
recogni2on-‐by-‐components (RBC), is that a modest set of generalized-‐cone components, called geons (N = 36), can be derived from contrasts of five readily detectable proper2es of edges in a two-‐dimensional image: curvature, collinearity, symmetry, parallelism, and cotermina2on.
The “contribu2on lies in its proposal for a par2cular vocabulary of components derived from perceptual mechanisms and its account of how an arrangement of these components can access a representa2on of an object in memory.”
1) We know that this object is nothing we know
2) We can split this objects into parts that everybody will agree
3) We can see how it resembles something familiar: “a hot dog cart”
“The naive realism that emerges in descriptions of nonsense objects may be reflecting the workings of a representational system by which objects are identified.”
A do-‐it-‐yourself example
Stages of processing
“Parsing is performed, primarily at concave regions, simultaneously with a detection of nonaccidental properties.”
Non accidental proper2es
Certain properties of edges in a two-dimensional image are taken by the visual system as strong evidence that the edges in the three-dimensional world contain those same properties.
Non accidental properties, (Witkin & Tenenbaum,1983): Rarely be produced by accidental alignments of viewpoint and object features and consequently are generally unaffected by slight variations in viewpoint.
?
image
Examples:
• Colinearity
• Smoothness
• Symmetry
• Parallelism
• Cotermination
From generalized cylinders to GEONS
“From variation over only two or three levels in the nonaccidental relations of four attributes of generalized cylinders, a set of 36 GEONS can be generated.”
Geons represent a restricted form of generalized cylinders.
Objects and their geons
Scenes and geons
Mezzanotte & Biederman
The importance of spa2al arrangement
Parts and Structure approaches With a different perspective, these models focused more on the
geometry than on defining the constituent elements:
• Fischler & Elschlager 1973 • Yuille ‘91 • Brunelli & Poggio ‘93 • Lades, v.d. Malsburg et al. ‘93 • Cootes, Lanitis, Taylor et al. ‘95 • Amit & Geman ‘95, ‘99 • Perona et al. ‘95, ‘96, ’98, ’00, ’03, ‘04, ‘05 • Felzenszwalb & Huttenlocher ’00, ’04 • Crandall & Huttenlocher ’05, ’06 • Leibe & Schiele ’03, ’04 • Many papers since 2000
Figure from [Fischler & Elschlager 73]
But, despite promising ini2al results…things did not work out so well (lack of data, processing power, lack of reliable methods for low-‐level and mid-‐level vision)
Instead, a different way of thinking about object detec2on started making some progress: learning based approaches and classifiers, which ignored low and mid-‐level vision.
Maybe the 2me is here to come back to some of the earlier models, more grounded in intui2ons about visual percep2on.
Renewed op2mism
Neocognitron Fukushima (1980). Hierarchical multilayered neural network
S-cells work as feature-extracting cells. They resemble simple cells of the primary visual cortex in their response.
C-cells, which resembles complex cells in the visual cortex, are inserted in the network to allow for positional errors in the features of the stimulus. The input connections of C-cells, which come from S-cells of the preceding layer, are fixed and invariable. Each C-cell receives excitatory input connections from a group of S-cells that extract the same feature, but from slightly different positions. The C-cell responds if at least one of these S-cells yield an output.
Neocognitron
Learning is done greedily for each layer
Convolu2onal Neural Network
The output neurons share all the intermediate levels
Le Cun et al, 98
Face detection and the success of learning based approaches
• The representation and matching of pictorial structures Fischler, Elschlager (1973). • Face recognition using eigenfaces M. Turk and A. Pentland (1991). • Human Face Detection in Visual Scenes - Rowley, Baluja, Kanade (1995) • Graded Learning for Object Detection - Fleuret, Geman (1999) • Robust Real-time Object Detection - Viola, Jones (2001) • Feature Reduction and Hierarchy of Classifiers for Fast Object Detection in Video Images - Heisele, Serre, Mukherjee, Poggio (2001) • ….
• The representation and matching of pictorial structures Fischler, Elschlager (1973). • Face recognition using eigenfaces M. Turk and A. Pentland (1991). • Human Face Detection in Visual Scenes - Rowley, Baluja, Kanade (1995) • Graded Learning for Object Detection - Fleuret, Geman (1999) • Robust Real-time Object Detection - Viola, Jones (2001) • Feature Reduction and Hierarchy of Classifiers for Fast Object Detection in Video Images - Heisele, Serre, Mukherjee, Poggio (2001) • ….
Faces everywhere
72 http://www.marcofolio.net/imagedump/faces_everywhere_15_images_8_illusions.html
Feret dataset, 1996 DARPA
The face age
• The representation and matching of pictorial structures Fischler, Elschlager (1973). • Face recognition using eigenfaces M. Turk and A. Pentland (1991). • Human Face Detection in Visual Scenes - Rowley, Baluja, Kanade (1995) • Graded Learning for Object Detection - Fleuret, Geman (1999) • Robust Real-time Object Detection - Viola, Jones (2001) • Feature Reduction and Hierarchy of Classifiers for Fast Object Detection in Video Images - Heisele, Serre, Mukherjee, Poggio (2001) • ….
Paul Viola Michael J. Jones Mitsubishi Electric Research Laboratories (MERL)
Cambridge, MA
Most of this work was done at Compaq CRL before the authors moved to MERL
Rapid Object Detection Using a Boosted Cascade of Simple Features
http://citeseer.ist.psu.edu/cache/papers/cs/23183/http:zSzzSzwww.ai.mit.eduzSzpeoplezSzviolazSzresearchzSzpublicationszSzICCV01-Viola-Jones.pdf/viola01robust.pdf
Manuscript available on web:
Haar-like filters and cascades Viola and Jones, ICCV 2001
The average intensity in the block is computed with four sums independently of the block size.
Also Fleuret and Geman, 2001
Face detection
• The representation and matching of pictorial structures Fischler, Elschlager (1973). • Face recognition using eigenfaces M. Turk and A. Pentland (1991). • Human Face Detection in Visual Scenes - Rowley, Baluja, Kanade (1995) • Graded Learning for Object Detection - Fleuret, Geman (1999) • Robust Real-time Object Detection - Viola, Jones (2001) • Feature Reduction and Hierarchy of Classifiers for Fast Object Detection in Video Images - Heisele, Serre, Mukherjee, Poggio (2001) • ….
Families of recognition algorithms Bag of words models Voting models
Constellation models Rigid template models
Sirovich and Kirby 1987 Turk, Pentland, 1991 Dalal & Triggs, 2006
Fischler and Elschlager, 1973 Burl, Leung, and Perona, 1995
Weber, Welling, and Perona, 2000 Fergus, Perona, & Zisserman, CVPR 2003
Viola and Jones, ICCV 2001 Heisele, Poggio, et. al., NIPS 01
Schneiderman, Kanade 2004 Vidal-Naquet, Ullman 2003
Shape matching Deformable models
Csurka, Dance, Fan, Willamowski, and Bray 2004 Sivic, Russell, Freeman, Zisserman, ICCV 2005
Berg, Berg, Malik, 2005 Cootes, Edwards, Taylor, 2001
Scene understanding Carboneio, de Freitas & Barnard (2004)
Kumar, Hebert (2005)
Torralba Murphy Freeman (2004)
Fink & Perona (2003)
Sudderth, Torralba, Wilsky, Freeman (2005)
Hoiem, Efros, Hebert (2005)
Torralba, Sinha (2001)
Rabinovich et al (2007)
Heitz and Koller (2008)
Desai, Ramanan, and Fowlkes (2009)
Choi, Lim, Torralba, Willsky (2010)
NSF Frontiers in computer vision workshop, 2011
MobilEye
Demo google googles
The labeling crisis
DUCK
DUCK
GRASS
PERSON
TREE
LAKE
BENCH
PERSON
PERSON PERSON
DUCK
PATH
SKY
SIGN
So what does object recognition involve?
Slide by Fei-Fei, Fergus, Torralba
Verification: is that a lamp?
Slide by Fei-Fei, Fergus, Torralba
Detection: are there people?
Slide by Fei-Fei, Fergus, Torralba
Identification: is that Potala Palace?
Slide by Fei-Fei, Fergus, Torralba
Object categorization
mountain
building
tree
banner
vendor people
street lamp
Slide by Fei-Fei, Fergus, Torralba
Scene and context categorization
• outdoor • city • …
Slide by Fei-Fei, Fergus, Torralba
Is this space large or small? How far are the buildings in the back?
Slide by Fei-Fei, Fergus, Torralba
Activity
What is this person doing? What are these two doing??
Slide by Fei-Fei, Fergus, Torralba
What are we tuned to?
The visual system is tuned to process structures typically found in the world.
The visual system seems to be tuned to a set of images:
Demo inspired from D. Field
Remember these images
Did you saw this image?
Test 2
Remember these images
Did you saw this image?
Human vision • Many input modalities • Active • Supervised, unsupervised, semi supervised learning. It can look for supervision.
Robot vision • Many poor input modalities • Active, but it does not go far
Internet vision • Many input modalities • It can reach everywhere • Tons of data
Data
Kinect
100 CSE 576, Spring 2008 Stereo matching
Active stereo with structured light
Project “structured” light patterns onto the object • simplifies the correspondence problem
camera 2
camera 1
projector
camera 1
projector
Li Zhang’s one-shot stereo
Slide credit: Rick Szeliski
Li Zhang, Brian Curless, and Steven M. Seitz. Rapid Shape Acquisition Using Color Structured Light and Multi-pass Dynamic Programming. In Proceedings of the 1st International Symposium on 3D Data Processing, Visualization, and Transmission (3DPVT), Padova, Italy, June 19-21, 2002, pp. 24-36.
101 CSE 576, Spring 2008 Stereo matching
102
Willow garage
http://www.willowgarage.com/pages/pr2/overview
Class goals
• Vision and language
• Vision and robotics
• Vision and others
To provide the right vision tools for not vision experts Thinking about the tasks to find new representations
The strategies our visual system uses are tuned to our visual world