mit6.870 grounding object recognition and scene understanding: lecture 1

Wednesdays 1-4pm Room 13-1143 Instructor: Antonio Torralba Email: [email protected]

6.870 Grounding object recognition and scene

understanding

http://people.csail.mit.edu/torralba/courses/6.870/6.870.recognition.htm

Some slides are borrowed from other classes (see links on the course web site). Let me know if I forget to give credit to the right people.

http://groups.csail.mit.edu/vision/courses/6.869/

Grading

•  Class participation: 20%

•  Paper presentations: 40%

•  Course project: 40%

Course project •  Topics for projects: It can derive from one

of the papers studied or from your own research.

•  Work individually or in pairs.

•  Results described as a 4 pages CVPR paper

•  Short presentation at the end of the semester

Paper presentations (40%)

Email me at the end of the class for scheduling the next week. We will first decide how to structure the week together.

•  Presenter: –  Present the key ideas, background material, and technical details. –  Show me the slides two days before the class. –  To test the basic ideas of the paper(s), using code available online or

writing toy code. –  Create toy test problems that reveal something about the algorithm. –  Constructive criticism.

Readings

Lecture 1 Class goals and a short introduc2on

6.870 Grounding object recognition and scene understanding

What is vision?

•  What does it mean, to see? “to know what is where by looking”.

•  How to discover from images what is present in the world, where things are, what ac2ons are taking place.

from Marr, 1982

The importance of images

100 million $

“Dora Maar au Chat” Pablo Picasso, 1941

Some images are more important than others

Why is vision hard?

The structure of ambient light

The Plenop2c Func2on

The intensity P can be parameterized as:

P (θ, φ, t, λ, X, Y, Z) “The complete set of all convergence points constitutes the permanent possibilities of vision.” Gibson

Adelson & Bergen, 91

Why is vision hard?

Measuring light vs. measuring scene proper2es

We perceive two squares, one on top of each other.


by Roger Shepard (”Turning the Tables”)

Depth processing is automatic, and we can not shut it down…

Assump2ons can be wrong

Ames room

By Aude Oliva

Why is vision hard?

Some things have strong varia2ons in appearance

Some things know that you have eyes

Brady, M. J., & Kersten, D. (2003). Bootstrapped learning of novel objects. J Vis, 3(6), 413-‐422

A short history of vision

The early op2mism

The crisis of the 80’s

Yes, object recogni2on is hard… (or at least it seems so for now…)

Object recogni2on Is it really so hard?

Challenges 1: view point variation

Michelangelo 1475-1564

Challenges 2: illumination

slide credit: S. Ullman

Challenges 3: occlusion

Magritte, 1957

Challenges 4: scale

Challenges 5: deformation

Xu, Beihong 1943

Challenges 6: background clutter

Klimt, 1913

Challenges 7: intra-class variation

Challenges

Brady, M. J., & Kersten, D. (2003). Bootstrapped learning of novel objects. J Vis, 3(6), 413-422

Discover the camouflaged object

Brady, M. J., & Kersten, D. (2003). Bootstrapped learning of novel objects. J Vis, 3(6), 413-422

Any guesses?

So, let’s make the problem simpler: Block world

Nice framework to develop fancy math, but too far from reality… Object Recognition in the Geometric Era: a Retrospective. Joseph L. Mundy. 2006

Object Recognition in the Geometric Era: a Retrospective. Joseph L. Mundy. 2006

Binford and generalized cylinders

Binford and generalized cylinders

Recogni2on by components

Irving Biederman Recognition-by-Components: A Theory of Human Image Understanding. Psychological Review, 1987.

Recogni2on by components The fundamental assump2on of the proposed theory,

recogni2on-‐by-‐components (RBC), is that a modest set of generalized-‐cone components, called geons (N = 36), can be derived from contrasts of five readily detectable proper2es of edges in a two-‐dimensional image: curvature, collinearity, symmetry, parallelism, and cotermina2on.

The “contribu2on lies in its proposal for a par2cular vocabulary of components derived from perceptual mechanisms and its account of how an arrangement of these components can access a representa2on of an object in memory.”

1)  We know that this object is nothing we know

2)  We can split this objects into parts that everybody will agree

3)  We can see how it resembles something familiar: “a hot dog cart”

“The naive realism that emerges in descriptions of nonsense objects may be reflecting the workings of a representational system by which objects are identified.”

A do-‐it-‐yourself example

Stages of processing

“Parsing is performed, primarily at concave regions, simultaneously with a detection of nonaccidental properties.”

Non accidental proper2es

Certain properties of edges in a two-dimensional image are taken by the visual system as strong evidence that the edges in the three-dimensional world contain those same properties.

Non accidental properties, (Witkin & Tenenbaum,1983): Rarely be produced by accidental alignments of viewpoint and object features and consequently are generally unaffected by slight variations in viewpoint.

?

image

Examples:

•  Colinearity

•  Smoothness

•  Symmetry

•  Parallelism

•  Cotermination

From generalized cylinders to GEONS

“From variation over only two or three levels in the nonaccidental relations of four attributes of generalized cylinders, a set of 36 GEONS can be generated.”

Geons represent a restricted form of generalized cylinders.

Objects and their geons

Scenes and geons

Mezzanotte & Biederman

The importance of spa2al arrangement

Parts and Structure approaches With a different perspective, these models focused more on the

geometry than on defining the constituent elements:

•  Fischler & Elschlager 1973 •  Yuille ‘91 •  Brunelli & Poggio ‘93 •  Lades, v.d. Malsburg et al. ‘93 •  Cootes, Lanitis, Taylor et al. ‘95 •  Amit & Geman ‘95, ‘99 •  Perona et al. ‘95, ‘96, ’98, ’00, ’03, ‘04, ‘05 •  Felzenszwalb & Huttenlocher ’00, ’04 •  Crandall & Huttenlocher ’05, ’06 •  Leibe & Schiele ’03, ’04 •  Many papers since 2000

Figure from [Fischler & Elschlager 73]

But, despite promising ini2al results…things did not work out so well (lack of data, processing power, lack of reliable methods for low-‐level and mid-‐level vision)

Instead, a different way of thinking about object detec2on started making some progress: learning based approaches and classifiers, which ignored low and mid-‐level vision.

Maybe the 2me is here to come back to some of the earlier models, more grounded in intui2ons about visual percep2on.

Renewed op2mism

Neocognitron Fukushima (1980). Hierarchical multilayered neural network

S-cells work as feature-extracting cells. They resemble simple cells of the primary visual cortex in their response.

C-cells, which resembles complex cells in the visual cortex, are inserted in the network to allow for positional errors in the features of the stimulus. The input connections of C-cells, which come from S-cells of the preceding layer, are fixed and invariable. Each C-cell receives excitatory input connections from a group of S-cells that extract the same feature, but from slightly different positions. The C-cell responds if at least one of these S-cells yield an output.

Neocognitron

Learning is done greedily for each layer

Convolu2onal Neural Network

The output neurons share all the intermediate levels

Le Cun et al, 98

Face detection and the success of learning based approaches

•  The representation and matching of pictorial structures Fischler, Elschlager (1973). •  Face recognition using eigenfaces M. Turk and A. Pentland (1991). •  Human Face Detection in Visual Scenes - Rowley, Baluja, Kanade (1995) •  Graded Learning for Object Detection - Fleuret, Geman (1999) •  Robust Real-time Object Detection - Viola, Jones (2001) •  Feature Reduction and Hierarchy of Classifiers for Fast Object Detection in Video Images - Heisele, Serre, Mukherjee, Poggio (2001) • ….

Faces everywhere

72 http://www.marcofolio.net/imagedump/faces_everywhere_15_images_8_illusions.html

Feret dataset, 1996 DARPA

The face age


Paul Viola Michael J. Jones Mitsubishi Electric Research Laboratories (MERL)

Cambridge, MA

Most of this work was done at Compaq CRL before the authors moved to MERL

Rapid Object Detection Using a Boosted Cascade of Simple Features

http://citeseer.ist.psu.edu/cache/papers/cs/23183/http:zSzzSzwww.ai.mit.eduzSzpeoplezSzviolazSzresearchzSzpublicationszSzICCV01-Viola-Jones.pdf/viola01robust.pdf

Manuscript available on web:

Haar-like filters and cascades Viola and Jones, ICCV 2001

The average intensity in the block is computed with four sums independently of the block size.

Also Fleuret and Geman, 2001

Face detection

Families of recognition algorithms Bag of words models Voting models

Constellation models Rigid template models

Sirovich and Kirby 1987 Turk, Pentland, 1991 Dalal & Triggs, 2006

Fischler and Elschlager, 1973 Burl, Leung, and Perona, 1995

Weber, Welling, and Perona, 2000 Fergus, Perona, & Zisserman, CVPR 2003

Viola and Jones, ICCV 2001 Heisele, Poggio, et. al., NIPS 01

Schneiderman, Kanade 2004 Vidal-Naquet, Ullman 2003

Shape matching Deformable models

Csurka, Dance, Fan, Willamowski, and Bray 2004 Sivic, Russell, Freeman, Zisserman, ICCV 2005

Berg, Berg, Malik, 2005 Cootes, Edwards, Taylor, 2001

Scene understanding Carboneio, de Freitas & Barnard (2004)

Kumar, Hebert (2005)

Torralba Murphy Freeman (2004)

Fink & Perona (2003)

Sudderth, Torralba, Wilsky, Freeman (2005)

Hoiem, Efros, Hebert (2005)

Torralba, Sinha (2001)

Rabinovich et al (2007)

Heitz and Koller (2008)

Desai, Ramanan, and Fowlkes (2009)

Choi, Lim, Torralba, Willsky (2010)

NSF Frontiers in computer vision workshop, 2011

MobilEye

Demo google googles

The labeling crisis

DUCK

DUCK

GRASS

PERSON

TREE

LAKE

BENCH

PERSON

PERSON PERSON

DUCK

PATH

SKY

SIGN

So what does object recognition involve?

Slide by Fei-Fei, Fergus, Torralba

Verification: is that a lamp?


Detection: are there people?


Identification: is that Potala Palace?


Object categorization

mountain

building

tree

banner

vendor people

street lamp


Scene and context categorization

•  outdoor •  city •  …


Is this space large or small? How far are the buildings in the back?


Activity

What is this person doing? What are these two doing??


What are we tuned to?

The visual system is tuned to process structures typically found in the world.

The visual system seems to be tuned to a set of images:

Demo inspired from D. Field

Remember these images

Did you saw this image?

Test 2

Remember these images

Did you saw this image?

Human vision • Many input modalities • Active • Supervised, unsupervised, semi supervised learning. It can look for supervision.

Robot vision • Many poor input modalities • Active, but it does not go far

Internet vision • Many input modalities • It can reach everywhere • Tons of data

Data

Kinect

100 CSE 576, Spring 2008 Stereo matching

Active stereo with structured light

Project “structured” light patterns onto the object •  simplifies the correspondence problem

camera 2

camera 1

projector

camera 1

projector

Li Zhang’s one-shot stereo

Slide credit: Rick Szeliski

Li Zhang, Brian Curless, and Steven M. Seitz. Rapid Shape Acquisition Using Color Structured Light and Multi-pass Dynamic Programming. In Proceedings of the 1st International Symposium on 3D Data Processing, Visualization, and Transmission (3DPVT), Padova, Italy, June 19-21, 2002, pp. 24-36.

101 CSE 576, Spring 2008 Stereo matching

Willow garage

http://www.willowgarage.com/pages/pr2/overview

Class goals

•  Vision and language

•  Vision and robotics

•  Vision and others

To provide the right vision tools for not vision experts Thinking about the tasks to find new representations

The strategies our visual system uses are tuned to our visual world

mit6.870 grounding object recognition and scene understanding: lecture 1

Education

measuring scene proper2es

object recogni2on

cone components

components rbc

grounding object recognition

structure of ambient

reality object recognition

scene understandinglecture