mit6.870 grounding object recognition and scene understanding: lecture 1

Post on 13-Dec-2014

386 Views

Category:

Education

4 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

Wednesdays 1-4pm Room 13-1143 Instructor: Antonio Torralba Email: torralba@csail.mit.edu

6.870 Grounding object recognition and scene

understanding

http://people.csail.mit.edu/torralba/courses/6.870/6.870.recognition.htm

Some slides are borrowed from other classes (see links on the course web site). Let me know if I forget to give credit to the right people.

http://groups.csail.mit.edu/vision/courses/6.869/

Grading

•  Class participation: 20%

•  Paper presentations: 40%

•  Course project: 40%

Course project •  Topics for projects: It can derive from one

of the papers studied or from your own research.

•  Work individually or in pairs.

•  Results described as a 4 pages CVPR paper

•  Short presentation at the end of the semester

Paper presentations (40%)

Email me at the end of the class for scheduling the next week. We will first decide how to structure the week together.

•  Presenter: –  Present the key ideas, background material, and technical details. –  Show me the slides two days before the class. –  To test the basic ideas of the paper(s), using code available online or

writing toy code. –  Create toy test problems that reveal something about the algorithm. –  Constructive criticism.

Readings  

Lecture  1    Class  goals  and    a  short  introduc2on  

6.870 Grounding object recognition and scene understanding

What  is  vision?  

•  What  does  it  mean,  to  see?    “to  know  what  is  where  by  looking”.  

•  How  to  discover  from  images  what  is  present  in  the  world,  where  things  are,  what  ac2ons  are  taking  place.  

from  Marr,  1982  

The  importance  of  images  

100  million  $  

“Dora  Maar  au  Chat”  Pablo  Picasso,  1941  

Some  images  are  more  important  than  others    

Why  is  vision  hard?  

The  structure  of  ambient  light  

The  structure  of  ambient  light  

The  Plenop2c  Func2on  

The intensity P can be parameterized as:

P (θ, φ, t, λ, X, Y, Z) “The complete set of all convergence points constitutes the permanent possibilities of vision.” Gibson

Adelson & Bergen, 91

Why  is  vision  hard?  

Measuring  light  vs.  measuring  scene  proper2es  

We perceive two squares, one on top of each other.

Measuring  light  vs.  measuring  scene  proper2es  

by Roger Shepard (”Turning the Tables”)

Depth processing is automatic, and we can not shut it down…

Measuring  light  vs.  measuring  scene  proper2es  

Measuring  light  vs.  measuring  scene  proper2es  

Measuring  light  vs.  measuring  scene  proper2es  

(c) 2006 Walt Anthony

Assump2ons  can  be  wrong  

Ames  room  

By Aude Oliva

Why  is  vision  hard?  

Some  things  have  strong  varia2ons  in  appearance  

Some  things  know  that  you  have  eyes  

Brady,  M.  J.,  &  Kersten,  D.  (2003).  Bootstrapped  learning  of  novel  objects.  J  Vis,  3(6),  413-­‐422    

A  short  history  of  vision  

The  early  op2mism  

The  crisis  of  the  80’s  

Yes,  object  recogni2on  is  hard…  (or at least it seems so for now…)

Object  recogni2on  Is  it  really  so  hard?  

Challenges 1: view point variation

Michelangelo 1475-1564

Challenges 2: illumination

slide credit: S. Ullman

Challenges 3: occlusion

Magritte, 1957

Challenges 4: scale

Challenges 5: deformation

Xu, Beihong 1943

Challenges 6: background clutter

Klimt, 1913

Challenges 7: intra-class variation

Challenges

Brady, M. J., & Kersten, D. (2003). Bootstrapped learning of novel objects. J Vis, 3(6), 413-422

Discover the camouflaged object

Brady, M. J., & Kersten, D. (2003). Bootstrapped learning of novel objects. J Vis, 3(6), 413-422

Discover the camouflaged object

Brady, M. J., & Kersten, D. (2003). Bootstrapped learning of novel objects. J Vis, 3(6), 413-422

Any guesses?

So,  let’s  make  the  problem  simpler:  Block  world  

Nice framework to develop fancy math, but too far from reality… Object Recognition in the Geometric Era: a Retrospective. Joseph L. Mundy. 2006

Object Recognition in the Geometric Era: a Retrospective. Joseph L. Mundy. 2006

Binford  and  generalized  cylinders  

Binford  and  generalized  cylinders  

Recogni2on  by  components  

Irving Biederman Recognition-by-Components: A Theory of Human Image Understanding. Psychological Review, 1987.

Recogni2on  by  components  The  fundamental  assump2on  of  the  proposed  theory,  

recogni2on-­‐by-­‐components  (RBC),  is  that  a  modest  set  of  generalized-­‐cone  components,  called  geons  (N  =  36),  can  be  derived  from  contrasts  of  five  readily  detectable  proper2es  of  edges  in  a  two-­‐dimensional  image:  curvature,  collinearity,  symmetry,  parallelism,  and  cotermina2on.  

The  “contribu2on  lies  in  its  proposal  for  a  par2cular  vocabulary  of  components  derived  from  perceptual  mechanisms  and  its  account  of  how  an  arrangement  of  these  components  can  access  a  representa2on  of  an  object  in  memory.”  

1)  We know that this object is nothing we know

2)  We can split this objects into parts that everybody will agree

3)  We can see how it resembles something familiar: “a hot dog cart”

“The naive realism that emerges in descriptions of nonsense objects may be reflecting the workings of a representational system by which objects are identified.”

A  do-­‐it-­‐yourself  example  

Stages  of  processing  

“Parsing is performed, primarily at concave regions, simultaneously with a detection of nonaccidental properties.”

Non  accidental  proper2es  

Certain properties of edges in a two-dimensional image are taken by the visual system as strong evidence that the edges in the three-dimensional world contain those same properties.

Non accidental properties, (Witkin & Tenenbaum,1983): Rarely be produced by accidental alignments of viewpoint and object features and consequently are generally unaffected by slight variations in viewpoint.

?

image

Examples:

•  Colinearity

•  Smoothness

•  Symmetry

•  Parallelism

•  Cotermination

From  generalized  cylinders  to  GEONS  

“From variation over only two or three levels in the nonaccidental relations of four attributes of generalized cylinders, a set of 36 GEONS can be generated.”

Geons represent a restricted form of generalized cylinders.

Objects  and  their  geons  

Scenes  and  geons  

Mezzanotte & Biederman

The  importance  of  spa2al  arrangement  

Parts and Structure approaches With a different perspective, these models focused more on the

geometry than on defining the constituent elements:

•  Fischler & Elschlager 1973 •  Yuille ‘91 •  Brunelli & Poggio ‘93 •  Lades, v.d. Malsburg et al. ‘93 •  Cootes, Lanitis, Taylor et al. ‘95 •  Amit & Geman ‘95, ‘99 •  Perona et al. ‘95, ‘96, ’98, ’00, ’03, ‘04, ‘05 •  Felzenszwalb & Huttenlocher ’00, ’04 •  Crandall & Huttenlocher ’05, ’06 •  Leibe & Schiele ’03, ’04 •  Many papers since 2000

Figure from [Fischler & Elschlager 73]

But,  despite  promising  ini2al  results…things  did  not  work  out  so  well  (lack  of  data,  processing  power,  lack  of  reliable  methods  for  low-­‐level  and  mid-­‐level  vision)  

Instead,  a  different  way  of  thinking  about  object  detec2on  started  making  some  progress:  learning  based  approaches  and  classifiers,  which  ignored  low  and  mid-­‐level  vision.  

Maybe  the  2me  is  here  to  come  back  to  some  of  the  earlier  models,  more  grounded  in  intui2ons  about  visual  percep2on.  

Renewed  op2mism  

Neocognitron  Fukushima (1980). Hierarchical multilayered neural network

S-cells work as feature-extracting cells. They resemble simple cells of the primary visual cortex in their response.

C-cells, which resembles complex cells in the visual cortex, are inserted in the network to allow for positional errors in the features of the stimulus. The input connections of C-cells, which come from S-cells of the preceding layer, are fixed and invariable. Each C-cell receives excitatory input connections from a group of S-cells that extract the same feature, but from slightly different positions. The C-cell responds if at least one of these S-cells yield an output.

Neocognitron  

Learning is done greedily for each layer

Convolu2onal  Neural  Network  

The output neurons share all the intermediate levels

Le Cun et al, 98

Face detection and the success of learning based approaches

•  The representation and matching of pictorial structures Fischler, Elschlager (1973). •  Face recognition using eigenfaces M. Turk and A. Pentland (1991). •  Human Face Detection in Visual Scenes - Rowley, Baluja, Kanade (1995) •  Graded Learning for Object Detection - Fleuret, Geman (1999) •  Robust Real-time Object Detection - Viola, Jones (2001) •  Feature Reduction and Hierarchy of Classifiers for Fast Object Detection in Video Images - Heisele, Serre, Mukherjee, Poggio (2001) • ….

•  The representation and matching of pictorial structures Fischler, Elschlager (1973). •  Face recognition using eigenfaces M. Turk and A. Pentland (1991). •  Human Face Detection in Visual Scenes - Rowley, Baluja, Kanade (1995) •  Graded Learning for Object Detection - Fleuret, Geman (1999) •  Robust Real-time Object Detection - Viola, Jones (2001) •  Feature Reduction and Hierarchy of Classifiers for Fast Object Detection in Video Images - Heisele, Serre, Mukherjee, Poggio (2001) • ….

Faces  everywhere  

72 http://www.marcofolio.net/imagedump/faces_everywhere_15_images_8_illusions.html

Feret dataset, 1996 DARPA

The face age

•  The representation and matching of pictorial structures Fischler, Elschlager (1973). •  Face recognition using eigenfaces M. Turk and A. Pentland (1991). •  Human Face Detection in Visual Scenes - Rowley, Baluja, Kanade (1995) •  Graded Learning for Object Detection - Fleuret, Geman (1999) •  Robust Real-time Object Detection - Viola, Jones (2001) •  Feature Reduction and Hierarchy of Classifiers for Fast Object Detection in Video Images - Heisele, Serre, Mukherjee, Poggio (2001) • ….

Paul Viola Michael J. Jones Mitsubishi Electric Research Laboratories (MERL)

Cambridge, MA

Most of this work was done at Compaq CRL before the authors moved to MERL

Rapid Object Detection Using a Boosted Cascade of Simple Features

http://citeseer.ist.psu.edu/cache/papers/cs/23183/http:zSzzSzwww.ai.mit.eduzSzpeoplezSzviolazSzresearchzSzpublicationszSzICCV01-Viola-Jones.pdf/viola01robust.pdf

Manuscript available on web:

Haar-like filters and cascades Viola and Jones, ICCV 2001

The average intensity in the block is computed with four sums independently of the block size.

Also Fleuret and Geman, 2001

Face detection

•  The representation and matching of pictorial structures Fischler, Elschlager (1973). •  Face recognition using eigenfaces M. Turk and A. Pentland (1991). •  Human Face Detection in Visual Scenes - Rowley, Baluja, Kanade (1995) •  Graded Learning for Object Detection - Fleuret, Geman (1999) •  Robust Real-time Object Detection - Viola, Jones (2001) •  Feature Reduction and Hierarchy of Classifiers for Fast Object Detection in Video Images - Heisele, Serre, Mukherjee, Poggio (2001) • ….

Families of recognition algorithms Bag of words models Voting models

Constellation models Rigid template models

Sirovich and Kirby 1987 Turk, Pentland, 1991 Dalal & Triggs, 2006

Fischler and Elschlager, 1973 Burl, Leung, and Perona, 1995

Weber, Welling, and Perona, 2000 Fergus, Perona, & Zisserman, CVPR 2003

Viola and Jones, ICCV 2001 Heisele, Poggio, et. al., NIPS 01

Schneiderman, Kanade 2004 Vidal-Naquet, Ullman 2003

Shape matching Deformable models

Csurka, Dance, Fan, Willamowski, and Bray 2004 Sivic, Russell, Freeman, Zisserman, ICCV 2005

Berg, Berg, Malik, 2005 Cootes, Edwards, Taylor, 2001

Scene understanding Carboneio,  de  Freitas  &  Barnard  (2004)  

Kumar,  Hebert  (2005)  

Torralba  Murphy  Freeman  (2004)  

Fink  &  Perona  (2003)  

Sudderth,  Torralba,  Wilsky,  Freeman  (2005)    

Hoiem,  Efros,  Hebert  (2005)  

Torralba,  Sinha  (2001)  

Rabinovich  et  al  (2007)  

Heitz  and  Koller  (2008)  

Desai,  Ramanan,  and  Fowlkes  (2009)  

Choi, Lim, Torralba, Willsky (2010)

NSF Frontiers in computer vision workshop, 2011

MobilEye

Demo google googles

The  labeling  crisis  

DUCK

DUCK

GRASS

PERSON

TREE

LAKE

BENCH

PERSON

PERSON PERSON

DUCK

PATH

SKY

SIGN

So what does object recognition involve?

Slide by Fei-Fei, Fergus, Torralba

Verification: is that a lamp?

Slide by Fei-Fei, Fergus, Torralba

Detection: are there people?

Slide by Fei-Fei, Fergus, Torralba

Identification: is that Potala Palace?

Slide by Fei-Fei, Fergus, Torralba

Object categorization

mountain

building

tree

banner

vendor people

street lamp

Slide by Fei-Fei, Fergus, Torralba

Scene and context categorization

•  outdoor •  city •  …

Slide by Fei-Fei, Fergus, Torralba

Is this space large or small? How far are the buildings in the back?

Slide by Fei-Fei, Fergus, Torralba

Activity

What is this person doing? What are these two doing??

Slide by Fei-Fei, Fergus, Torralba

What  are  we  tuned  to?  

The  visual  system  is  tuned  to  process  structures  typically  found  in  the  world.    

The visual system seems to be tuned to a set of images:

Demo inspired from D. Field

Remember these images

Did you saw this image?

Test 2

Remember these images

Did you saw this image?

Human vision • Many input modalities • Active • Supervised, unsupervised, semi supervised learning. It can look for supervision.

Robot vision • Many poor input modalities • Active, but it does not go far

Internet vision • Many input modalities • It can reach everywhere • Tons of data

Data

Kinect

100 CSE 576, Spring 2008 Stereo matching

Active stereo with structured light

Project “structured” light patterns onto the object •  simplifies the correspondence problem

camera 2

camera 1

projector

camera 1

projector

Li Zhang’s one-shot stereo

Slide credit: Rick Szeliski

Li Zhang, Brian Curless, and Steven M. Seitz. Rapid Shape Acquisition Using Color Structured Light and Multi-pass Dynamic Programming. In Proceedings of the 1st International Symposium on 3D Data Processing, Visualization, and Transmission (3DPVT), Padova, Italy, June 19-21, 2002, pp. 24-36.

101 CSE 576, Spring 2008 Stereo matching

102

Willow garage

http://www.willowgarage.com/pages/pr2/overview

Class goals

•  Vision and language

•  Vision and robotics

•  Vision and others

To provide the right vision tools for not vision experts Thinking about the tasks to find new representations

The strategies our visual system uses are tuned to our visual world

top related