0$ 1(+#22'-#(&#web.mit.edu/9.s915/www/classes/class01_all.pdf · the new golden age:...

!"#$%&'#(&#$)(*$+"#$,(-'(##.'(-$/0$

1(+#22'-#(&#

Shimon Ullman + Tomaso Poggio(with Owen Ardron Lewis)

Department of Brain & Cognitive Sciences

Massachusetts Institute of Technology

Cambridge, MA 02139 USA

9.S915 in Fall 2011

Friday, September 9, 2011

Overview

Context for this course: the Intelligence Initiative


!"#$%&'()#*$'+$,-.#)),/#-0#$,1$'-#$'+$."#$/&#2.$%&'()#*1$,-$10,#-0#3$%&'(2()4$."#$/&#2.#1.5

6#1#2&0"$'-$,-.#)),/#-0#$(4$-#7&'10,#-0#$2-8$0'*%7.#&$10,#-0#$9:;<=$

>$2$/&#2.$,-.#))#0.72)$*,11,'-

>$?,))$"#)%$07&#$*#-.2)$8,1#21#1$2-8$8#@#)'%$*'&#$,-.#)),/#-.$2&A+20.1$

>$?,))$,*%&'@#$."#$*#0"2-,1*1$+'&$0'))#0A@#$8#0,1,'-1

!"#1#$28@2-0#1$?,))$(#$0&,A02)$.'$'+$'7&$1'0,#.4B1

>$+7.7&#$%&'1%#&,.4>$#8702A'-3$$"#2)."3$$1#07&,.4$

!"#$%&'()#*$'+$,-.#)),/#-0#1$"'2$,.$3&,4#4$,-$."#$(&3,-$3-5$"'2$.'$&#%),03.#$,.$,-$*30",-#4


!"#$%&'(#)*+&,#$-.'$/0)#++12#0,#3

$2$/),*%1#$,-.'$."#$*2&C#.%)20#$

+'&$."#$10,#-0#$2-8$."#$#-/,-##&,-/$'+$,-.#)),/#-0#


The new golden age: trying (again)

to understand and replicate Intelligence

• The first and last attempt in history was ~ fifty years ago -- at the beginning of Artificial Intelligence (Dartmouth workshop,1956)

• It is time to try again because

– we have seen in the last decade and we are going to see in the next 5 years several intelligent apps from the first wave of AI+ML research: search + vision + speech + language + Go...

– ...but these are not intelligent machines and this is not enough to understand intelligence...

– ...we need to integrate computation with neuroscience and cognition for the next jump in understanding

• Thus , the next wave of research on Integrated Intelligence...


7

Syllabus


Since the introduction of learning techniques in AI+vision, there have been significant (and not well known) advances in a few domains:

• Games (Checkers, Chess, Go, Poker, soccer???...) • Search (Google,...) • Speech recognition (Nuance,...)• Natural Language/Knowledge retrieval (Watson and Jeopardy,..)• Vision (Kinect and see later)

!"#$%&'(("&&"&%)*%+#(,)*"%$"#-*)*.%/#*0%123


9

Integration is fundamental

• Cognition, brain science, computation, biology, • Perception, action, language, reasoning, social, • Integration of innate structures with learning


10

Integration is early

Language and perception

Direction of gaze

Perception and planning

Action recognition


Rules of the game:

• active participation in class• survey at http://www.surveymonkey.com/s/FVK6NGD about

• whiteboard presentations in I2camps ?or

• Wikipedia articles ? or ....

Slides on the Web site http://web.mit.edu/9.s915/www/Staff mailing list is [email protected] Student list will be [email protected] Please fill form!

send email to us if you want to be added to mailing list

Classhttp://web.mit.edu/9.s915/www/


http://www.surveymonkey.com/s/FVK6NGD

http://www.surveymonkey.com/s/FVK6NGD

http://web.mit.edu/9.s915/www/


mailto:[email protected]






!"#$%&'#(&#$)(*$+"#$,(-'(##.'(-$/0$

1(+#22'-#(&#

Shimon Ullman + Tomaso Poggio(with Owen Ardron Lewis)

Department of Brain & Cognitive SciencesMassachusetts Institute of Technology

Cambridge, MA 02139 USA

9.S915 in Fall 2011


Class 1: Learning

1. Statistical Learning Theory: an informal introduction

2. The framework of statistical learning3. Conditions, principles and algorithms for learning

(necessary and sufficient conditions, sparsity smoothness etc)

4. Learning and representation in the Brain • Beyond Binary classification: Taxonomy and

Structures • Applications Image and Video Classification


LEARNING THEORY+

ALGORITHMS

COMPUTATIONAL NEUROSCIENCE:

models+experimentsHow visual cortex works

Theorems on foundations of learning

Predictive algorithms

Sung & Poggio 1995, also Kanade& Baluja....

!"#$%&'()'"*&%&+(,-.-)/(012(3'"*4("+5


LEARNING THEORY+

ALGORITHMS





Sung & Poggio 1995

!"#$%&'()'"*&%&+(,-.-)/(012(3'"*4("+5


LEARNING THEORY+

ALGORITHMS





Face detection is now available in digital cameras (commercial systems)

!"#$%&'()'"*&%&+(,-.-)/(012(3'"*4("+5


LEARNING THEORY+

ALGORITHMS





Papageorgiou&Poggio, 1997, 2000 also Kanade&Scheiderman

!"#$%&'()'"*&%&+(,-.-)/(067(3'"*4("+5


LEARNING THEORY+

ALGORITHMS





Pedestrian and car detection are also “solved” (commercial systems, MobilEye)

!"#$%&'()'"*&%&+(


LEARNING THEORY+

ALGORITHMS





Pedestrian and car detection are also “solved” (commercial systems)

!"#$%&'()'"*&%&+(


http://www.volvocars.com/us/all-cars/volvo-s60/pages/5-things.aspx?p=5

!"#"$%&'(##"''"'&)$&*+&,$-&./0&#123(%"4&5)')1$


Since the introduction of learning/statistics in the ‘90s, computer vision has made significant (and not well known) advances in a few problem areas:

• Face identification under controlled conditions is “solved” (commercial systems) at ~ human level performance

• Face detection is available in cheap digital cameras (commercial systems)

• Pedestrian and car detection are also “solved” (commercial systems)

!",6&'(##"''"'&)$&*+0&#123(%"4&5)')1$


INPUT OUTPUTfGiven a set of l examples (data)

Question: find function f such that

is a good predictor of y for a future input x (fitting the data is not enough!)

!"#"$%"$&#'()*#+,$,-(./*0+12%34*+5$%*6('*#+,$,-


y

x

= data from f

= approximation of f

= function f

Generalization: estimating value of function where there are no data (good generalization means predicting the function well; important is for empirical or validation error to be a good proxy of the prediction error)

!"#"$%"$&#'()*#+,$,-(./*0+124+*6$&"$0,7(,0"(&3+5*(8$""$,-


!"#"$%"$&#'()*#+,$,-(./*0+124#+"(09(:#$,%"+*#:(:#"/(,0"(;3%"(%"#"$%"$&%

<=#'$#,"7(=#4,$>7(!:#'*7(?*50+*@@@A


Mukherjee, Niyogi, Poggio, Rifkin, Nature, March 2004

An algorithm is stable if the removal of any one training sample from any large set of samples results almost always in a small change in the learned function.

For ERM the following theorem holds

ERM on H generalizes if and only if the hypothesis space H is uGC and if and only if ERM on H is CVloo stable

!"#$%&'"$()"*)+,&'('-&.)/0&1$2$3)450"167810%2-'92,6:);--&<)1&="1:)(,&>2.2,6


It is possible to show that under quite general assumptions

generalization and well-posedness are equivalent, eg one implies the other.

Stability implies generalization:

a stable solution is predictive and (for ERM) also viceversa.

ERM ERM

H is u GC

ERM

CVloo

stability

For generalsymmetric algorithms

GeneralizationGeneralization+ consistency

Eloo+EEloo stability

ERM

Mukherjee, S., P. Niyogi, T. Poggio and R. Rifkin. Learning Theory: Stability is Sufficient for Generalization and Necessary and Sufficient for Consistency of Empirical Risk Minimization, Advances in Computational Mathematics, 25, 161-193, 2006; see also Nature, 2005


http://cbcl.mit.edu/projects/cbcl/publications/ps/mukherjee-ACM-06.pdf




Conditions for generalization in learning theory

have deep, almost philosophical, implications:

they can be regarded as (equivalent) conditions that guarantee a

theory to be predictive (that is scientific)

‣$$+"#/.3$456+$7#$&"/6#($0./4$)$64)22$6#+

‣ $+"#/.3$6"/52*$(/+$&")(-#$45&"$8'+"$(#8$*)+)9994/6+$/0$+"#$:4#$

!"#"$%"$&#'()*#+,$,-(./*0+12"/*0+*:%(*B"*,6$,-(903,6#"$0,%(09(

'*#+,$,-("/*0+1(


Conditions for generalization in learning theory

have deep, almost philosophical, implications:

they can be regarded as equivalent conditions that guarantee a

theory to be predictive (that is scientific)

‣ theory must be chosen from a small set

‣ theory should not change much with new data...most of the time

!"#"$%"$&#'()*#+,$,-(./*0+12"/*0+*:%(*B"*,6$,-(903,6#"$0,%(09(

'*#+,$,-("/*0+1(


Equation includes splines, Radial Basis Functions and SVMs (depending on choice of V).

implies

For a review, see Poggio and Smale, 2003; see also Schoelkopf and Smola, 2002; Bousquet, O., S. Boucheron and G. Lugosi; Cucker and Smale; Zhou and Smale...

!"#"$%"$&#'()*#+,$,-(./*0+12C*+,*'(D#&/$,*%(*-(E*-3'#+$F#"$0,($,(

ECG!(


has a Bayesian interpretation:data term is a model of the noise and the stabilizer is a prior on the hypothesis space of functions f. That is, Bayes rule

leads to

!"#"$%"$&#'()*#+,$,-(./*0+12&'#%%$&#'(#'-0+$"/:%2(E*-3'#+$F#"$0,((


implies

Classical learning algorithms: Kernel Machines (eg Regularization in RKHS)

Remark (for later use):

Kernel machines correspond toshallow networks

X1

f

Xl

!"#"$%"$&#'()*#+,$,-(./*0+12E*-3'#+$F#"$0,


Two connected and overlapping strands in learning theory:

! Bayes, hierarchical models, graphical models…

! Statistical learning theory, regularization (closer to classical math, functional analysis+probability theory+empirical process theory…)

!"#"$%"$&#'()*#+,$,-(./*0+12,0"*


FOUNDATIONS OF COMPUTATIONAL ANDSTATISTICAL LEARNING

FOUNDATIONS OF COMPUTATIONAL AND STATISTICAL LEARNING

Tomaso Poggio and Lorenzo Rosascotp,[email protected]

September 9, 2011

Foundations of Computational and Statistical Learning Foundations of Computational and Statistical Learning

COMPUTATIONAL LEARNING

STATISTICAL LEARNING THEORY

Learning is viewed as an inference problem from possibly smallsamples of high dimensional, noisy data.


LEARNING TASKS AND MODELS

SupervisedSemisupervisedUnsupervisedOnlineTransductiveActiveVariable SelectionReinforcement.....

One can consider the data to be created in a deterministic stochasticor even adversarial way.


WHERE TO START?

STATISTICAL AND SUPERVISED LEARNING

Statistical Models are essentially to deal with noise sampling andother sources of uncertainty.Supervised Learning is by far the most understood class ofproblems

REGULARIZATION

Regularization provides a a fundamental framework to solvelearning problems and design learning algorithms.We present a set of tools and techniques which are at thecore of a multitude of different ideas and developments,beyond supervised learning.


PLAN

Part I: Basic Concepts and NotationPart II: Foundational ResultsPart III: Algorithms


WARNING


PROBLEM AT A GLANCE

Given a training set of input-output pairs

Sn = (x1, y1), . . . , (xn, yn)

find fS such that fS(x) ∼ y .

e.g. the x �s are vectors and the y �s discrete labels in classificationand real values in regression.


LEARNING IS INFERENCE

For the above problem to make sense we need to assume input andoutput to be related!

STATISTICAL AND SUPERVISED LEARNING

Each input-output pairs is a sample from a fixed but unknowndistribution p(x , y).Under general condition we can write

p(x , y) = p(y |x)p(x).


NOISE ...

YX

p (y|x)

x

the same x can generate different y (according to p(y |x)):

the underlying process is deterministic, but there is noise in themeasurement of y ;the underlying process is not deterministic;the underlying process is deterministic, but only incompleteinformation is available.


...AND SAMPLING

p(x)

y

x

x

EVEN IN A NOISE FREE CASE WEHAVE TO DEAL WITH SAMPLING

the marginal p(x) distributionmight model

errors in the location of theinput points;discretization error for agiven grid;presence or absence ofcertain input instances


...AND SAMPLING

x

p(x)

y

x





...AND SAMPLING

y

x

p(x)

x





LEARNING, GENERALIZATION AND OVERFITTING

PREDICTIVITY OR GENERALIZATION

Given the data, the goal is to learn how to make decisions/predictionsabout future data / data not belonging to the training set .

The problem is : Avoid overfitting!!


OVER-FITTING ILLUSTRATED

Among many possible solutions how can we choose one that correclyapplies to previously unseen data?


LOSS FUNCTIONS

As we look for a deterministic estimator in stochastic environment wehave to expect to incur into errors.

LOSS FUNCTION

A loss function V : R× Y determines the price V (f (x), y) we pay,predicting f (x) when in fact the true output is y .


LOSS FUNCTIONS FOR REGRESSION

The most common is the square loss or L2 loss

V (f (x), y) = (f (x)− y)2

Absolute value or L1 loss:

V (f (x), y) = |f (x)− y |

Vapnik’s �-insensitive loss:

V (f (x), y) = (|f (x)− y |− �)+


LOSS FUNCTIONS FOR (BINARY) CLASSIFICATION

The most intuitive one: 0− 1-loss:

V (f (x), y) = θ(−yf (x))

(θ is the step function)The more tractable hinge loss:

V (f (x), y) = (1− yf (x))+

And again the square loss or L2 loss

V (f (x), y) = (1− yf (x))2


LOSS FUNCTIONS


EXPECTED RISK

A good function should incur in only a few errors. We need a way toquantify this idea.

EXPECTED RISK

The quantity

I[f ] =

�

X×YV (f (x), y)p(x , y)dxdy .

is called the expected error and measures the loss averaged over theunknown distribution.

A good function should have small expected risk.


TARGET FUNCTION

The expected risk is usually defined on some large space F possibledependent on p(x , y).The best possible error is

inff∈F

I[f ]

The infimum is often achieved at a minimizer f∗ that we call targetfunction.


LEARNING ALGORITHMS AND GENERALIZATION

A learning algorithm can be seen as a map

Sn → fn

from the training set to the a set of candidate functions.

Ideally we would likeI[fn] ∼ I[f∗].


REMINDER

CONVERGENCE IN PROBABILITY

Let {Xn} be a sequence of bounded random variables. Then

limn→∞

Xn = X in probability

if∀� > 0 lim

n→∞P{|Xn − X | ≥ �} = 0

CONVERGENCE IN EXPECTATION

Let {Xn} be a sequence of bounded random variables. Then

limn→∞

Xn = X in expectation

iflim

n→∞E(|Xn − X |) = 0


CONSISTENCY AND UNIVERSAL CONSISTENCY

A basic requirement is for the algorithm to get better as we get moredata...

CONSISTENCY

We say that an algorithm is consistent if

∀� > 0 limn→∞

P{I[fn]− I[f∗] ≥ �} = 0

UNIVERSAL CONSISTENCY

We say that an algorithm is consistent if for all probability p,

∀� > 0 limn→∞

P{I[fn]− I[f∗] ≥ �} = 0


SAMPLE COMPLEXITY AND LEARNING RATES

The above requirements are asymptotic.

ERROR RATES

A more practical question is, how fast does the error decay? This canexpressed as

P{I[fn]− I[f∗] ≥ �(n, δ)} ≥ 1− δ.

SAMPLE COMPLEXITY

Or equivalently, ’ how many point do we need to achieve an error �with a prescribed probability δ?’This can expressed as

P{I[fn]− I[f∗] ≥ �} ≥ 1− δ,

for n = n(�, δ).


EMPIRICAL RISK AND GENERALIZATION

how do we design learning algorithms? A first idea is to look at thedata...

EMPIRICAL RISK

In practice we can try to replace the expected risk by the empiricalrisk

In[f ] =1n

n�

i=1

V (f (xi), yi).

GENERALIZATION ERROR

The effectiveness of such an approximation error is captured by thegeneralization error,

P{|I[fn]− In[fn]| ≥ �} ≥ 1− δ,

for n = n(�, δ).


SOME (THEORETICAL AND PRACTICAL) QUESTIONS

How do we go from data to an actual algorithm?Is minimizing error on the data a good idea?Are there fundamental limitations in what we can and cannotlearn?


PLAN

Part I: Basic Concepts and Notation

Part II: Foundational ResultsPart III: Algorithms


NO FREE LUNCH THEOREM


Can we learn consistently any problem? Or equivalently douniversally consistent algorithms exist?

YES! Neareast neighbors, Histogram rules, SVM with (so called)universal kernels...


Given a number of points (and a confidence), can we always achievea prescribed error?NO!

Devroye et al.The last statement can be interpreted as follows: inference from finitesamples can effectively performed if and only if the problem satisfies some apriori condition.




Can we learn consistently any problem? Or equivalently douniversally consistent algorithms exist?YES! Neareast neighbors, Histogram rules, SVM with (so called)universal kernels...









Given a number of points (and a confidence), can we always achievea prescribed error?

NO!



HYPOTHESES SPACE

In statistical learning a first prior assumption amounts to choosing asuitable space of hypotheses H.

Examples: linear functions, polynomial, RBFs, Sobolev Spaces...A learning algorithm A is then a map from the data space to H,

A(Sn) = fn ∈ H


HYPOTHESES SPACE


Examples: linear functions, polynomial, RBFs, Sobolev Spaces...

A learning algorithm A is then a map from the data space to H,

A(Sn) = fn ∈ H


HYPOTHESES SPACE


Examples: linear functions, polynomial, RBFs, Sobolev Spaces...A learning algorithm A is then a map from the data space to H,

A(Sn) = fn ∈ H


EMPIRICAL RISK MINIMIZATION

How do we choose H? How do we design A?

ERMA prototype algorithm in statistical learning theory is Empirical RiskMinimization:

minf∈H

In[f ].


KEY THEOREM(S) ILLUSTRATED


KEY THEOREM(S)

UNIFORM GLIVENKO-CANTELLI CLASSES

We say that H is a uniform Glivenko-Cantelli (uGC) class, if for all p,

∀� > 0 limn→∞

P�

supf∈H

|I[f ]− In[f ]| > �

�= 0.

A necessary and sufficient condition for consistency of ERM is that His uGC.See: [Vapnik and Cervonenkis (71), Alon et al (97), Dudley, Gine, and Zinn(91)].

In turns the UGC property is equivalent to requiring H to have finitecapacity: Vγ dimension in general and VC dimension in classification.


STABILITY

z = (x , y)S = z1, ..., znSi = z1, ..., zi−1, zi+1, ...zn

CV STABILITY

A learning algorithm A is CV loo stability if for each n there exists aβ(n)

CV and a δ(n)CV such that for all p

P�|V (fSi , zi)− V (fS, zi)| ≤ β(n)

CV

�≥ 1− δ(n)

CV ,

with β(n)CV and δ(n)

CV going to zero for n →∞.


KEY THEOREM(S) ILLUSTRATED


ERM AND ILL-POSEDNESS

Ill posed problems often arise if one tries to infer general laws fromfew data

the hypothesis space is too largethere are not enough data

In general ERM leads to ill-posedsolutions because

the solution may be too complex

it may be not unique

it may change radically whenleaving one sample out


REMINDER

ILL POSED PROBLEM

A mathematical problem is well posed in the sense of Hadamard ifthe solution existsthe solution is uniquethe solution depends continuously on the data

If a problem is not well posed it is called ill posed.

Regularization is the classical way to restore well posedness (andensure generalization).


BAYES INTERPRETATION


PLAN

Part I: Basic Concepts and Notation

Part II: Foundational ResultsPart III: Algorithms


HYPOTHESES SPACE

We are going to look at hypotheses spaces which are reproducingkernel Hilbert spaces.

RKHS are Hilbert spaces of point-wise defined functions.They can be defined via a reproducing kernel, which is asymmetric positive definite function.functions in the space are (the completion of) linear combinations

f (x) =p�

i=1

K (x , xi)ci .

the norm in the space is a natural measure of complexity

�f�2H

=p�

j,i=1

K (xj , xi)cicj .


EXAMPLES OF PD KERNELS

Very common examples of symmetric pd kernels are• Linear kernel

K (x , x �) = x · x �

• Gaussian kernel

K (x , x �) = e−�x−x��2

σ2 , σ > 0

• Polynomial kernel

K (x , x �) = (x · x � + 1)d , d ∈ N

For specific applications, designing an effective kernel is achallenging problem.


KERNEL AND FEATURES

Often times kernels, are defined through a dictionary of features

D = {φj , i = 1, . . . , p | φj : X → R, ∀j}

setting

K (x , x �) =p�

i=1

φj(x)φj(x �).


IVANOV REGULARIZATION

We can regularize by explicitly restricting the hypotheses space H —for example to a ball of radius A

IVANOV REGULARIZATION

minf∈H

1n

n�

i=1

V (f (xi), yi)

subject to�f�2

H≤ A.

The above algorithm corresponds to a constrained optimizationproblem.


TIKHONOV REGULARIZATION

Regularization can also be done implicitly via penalizartion

TIKHONOV REGULARIZARION

arg minf∈H

1n

n�

i=1

V (f (xi), yi) + λ �f�2H

.

The above algorithm can be seen as the Lagrangian formulation of aconstrained optimization problem.


THE REPRESENTER THEOREM

AN IMPORTANT RESULT

The minimizer over the RKHS H, fS, of the regularized empiricalfunctional

IS[f ] + λ�f�2H,

can be represented by the expression

fn(x) =n�

i=1

ciK (xi , x),

for some (c1, . . . , cn) ∈ R.

Hence, minimizing over the (possibly infinite dimensional) Hilbertspace, boils down to minimizing over Rn.


SVM AND RLS

The way the coefficients c = (c1, . . . , cn) are computed depend on theloss function choice. .

RLS: Let Let y = (y1, . . . , yn) and Ki,j = K (xi , xj) thenc = (K + λnI)−1y.

SVM: Let αi = yici and Qi,j = yiK (xi , xj)yj


SVM AND RLS

The way the coefficients c = (c1, . . . , cn) are computed depend on theloss function choice. .

RLS: Let Let y = (y1, . . . , yn) and Ki,j = K (xi , xj) thenc = (K + λnI)−1y.SVM: Let αi = yici and Qi,j = yiK (xi , xj)yj


BAYES INTERPRETATION


REGULARIZATION APPROACH

More generally we can consider:

In(f ) + λR(f )

where, R(f ) is a regularizing functional and λ is a regularizationparameter trading-off between the two terms.

Sparsity based methodsManifold learningMulticlass...


SUMMARY

statistical learning as framework to deal with uncertainty in data.non free lunch and key theorem: no prior, no learning.regularization as a fundamental tool for enforcing a prior andensure stability and generalization


KERNEL AND DATA REPRESENTATION

In the above reasoning the kernel and the hypotheses space define arepresentation/parameterization of the problem and hence play aspecial role.

Where do they come from?

There are a few off the shelf choice (Gaussian, polynomial etc.)Often they are the product of problem specific engineering.

Are there principles– applicable in a wide range of situations– todesign effective data representation?


30

)*#+,$,-2(D#"/7(H,-$,**+$,-(<,0IA

More in general, since the introduction of supervised learning techniques, AI vision has made significant (and not well known) advances in a few domains:

• Vision (see above)• Graphics and morphing (see above)• Natural Language/Knowledge retrieval (Watson and Jeopardy)• Speech recognition (Nuance) • Games (Go, chess,...)


However, the full problem of vision (and intelligence!) is still not solved. it is likely the problems are not solvable with “classical” learning techniques.A reasonable bet is on novel neuroscience/cognitive inspired techniques:

• Recognition at human level of thousands of objects

• Scene understanding: the Turing test for Vision

)*#+,$,-2(D#"/7(H,-$,**+$,-(<,0IA


8$'(#59:;<"=5&">(9"+%#(5?(<$'(

@'&<*">(4<*'"9tomaso poggioMcGovern InstituteI2, CBCL,BCS, CSAILMIT

Very preliminary, provocative, possibly completely wrong, ...


VisionA Computational Investigation into the Human Representation and Processing of Visual InformationDavid MarrForeword by Shimon UllmanAfterword by Tomaso Poggio

David Marr's posthumously published Vision (1982) influenced a generation of brain and cognitive scientists, inspiring many to enter the field. In Vision, Marr describes a general framework for understanding visual perception and touches on broader questions about how the brain and its functions can be studied and understood. Researchers from a range of brain and cognitive sciences have long valued Marr's creativity, intellectual power, and ability to integrate insights and data from neuroscience, psychology, and computation. This MIT Press edition makes Marr's influential work available to a new generation of students and scientists.

In Marr's framework, the process of vision constructs a set of representations, starting from a description of the input image and culminating with a description of three-dimensional objects in the surrounding environment. A central theme, and one that has had far-reaching influence in both neuroscience and cognitive science, is the notion of different levels of analysis—in Marr's framework, the computational level, the algorithmic level, and the hardware implementation level.

Now, thirty years later, the main problems that occupied Marr remain fundamental open problems in the study of perception. Vision provides inspiration for the continui

A%4%5&B(

C$"<(%4(C$'*'

(


http://mitpress.mit.edu/catalog/author/default.asp?aid=18323






Desimone & Ungerleider 1989

ventral stream

8$'(@'&<*">(4<*'"9

Movshon et al.


The ventral stream hierarchy: V1, V2, V4, IT

A gradual increase in thereceptive field size, in the “complexity” of the preferred stimulus, in “invariance” to

position and scale changes

Kobatake & Tanaka, 1994

7)')1$0&5"$%4,6&'%4",2&


*Modified from (Gross, 1998)

[software available onlinewith CNS (for GPUs)]

Riesenhuber & Poggio 1999, 2000; Serre Kouh Cadieu Knoblich Kreiman & Poggio 2005; Serre Oliva Poggio 2007

((D'#5+&%=5&(%&(<$'(A'&<*">(E<*'"9B(

#>"44%#">/(F4<"&G"*GH(95G'>(I5?(%99'G%"<'(*'#5+&%=5&J


[software available online] Riesenhuber & Poggio 1999, 2000; Serre Kouh Cadieu Knoblich Kreiman & Poggio 2005; Serre Oliva Poggio 2007

• It is in the family of “Hubel-Wiesel” models (Hubel & Wiesel, 1959: qual. Fukushima, 1980: quant; Oram & Perrett, 1993: qual; Wallis & Rolls, 1997; Riesenhuber & Poggio, 1999; Thorpe, 2002; Ullman et al., 2002; Mel, 1997; Wersing and Koerner, 2003; LeCun et al 1998: not-bio; Amit & Mascaro, 2003: not-bio; Hinton, LeCun, Bengio not-bio; Deco & Rolls 2006…)

• Hardwired invariance to translation and scale

• As a biological model of object recognition in the ventral stream – from V1 to PFC -- it is perhaps the most quantitatively faithful to known neuroscience data

((D'#5+&%=5&(%&(A%4;">(-5*<'KB(LM#>"44%#">(95G'>H/(

4'>'#=@'("&G(%&@"*%"&<


V1:

Simple and complex cells tuning (Schiller et al 1976; Hubel & Wiesel 1965; Devalois et al 1982)

MAX-like operation in subset of complex cells (Lampl et al 2004)V2:

Subunits and their tuning (Anzai, Peng, Van Essen 2007)V4:

Tuning for two-bar stimuli (Reynolds Chelazzi & Desimone 1999)

MAX-like operation (Gawne et al 2002)Two-spot interaction (Freiwald et al 2005)

Tuning for boundary conformation (Pasupathy & Connor 2001, Cadieu, Kouh, Connor et al., 2007)Tuning for Cartesian and non-Cartesian gratings (Gallant et al 1996)

IT:

Tuning and invariance properties (Logothetis et al 1995, paperclip objects)Differential role of IT and PFC in categorization (Freedman et al 2001, 2002, 2003)

Read out results (Hung Kreiman Poggio & DiCarlo 2005)Pseudo-average effect in IT (Zoccolan Cox & DiCarlo 2005; Zoccolan Kouh Poggio & DiCarlo 2007)

Human:

Rapid categorization (Serre Oliva Poggio 2007)Face processing (fMRI + psychophysics) (Riesenhuber et al 2004; Jiang et al 2006)

!"#$%$&'"&%()*##+,-$.%$+)/-+#(01"0)&-20"03#23)."3')-$)4$#+"&3))2#5$%()+%3%

((!5G'>(FC5*N4HB

%<("##5;&<4(?5*(O%<4(5?(:$34%5>5+3(P(:43#$5:$34%#4


Models of the ventral stream in cortexperform well compared to

engineered computer vision systems (in 2006)on several databases

Bileschi, Wolf, Serre, Poggio, 2007

((!5G'>(FC5*N4HB

%<(:'*?5*94(C'>>("<(#59:;<"=5&">(>'@'>


Models of cortex lead to better systems for action recognition in videos: automatic phenotyping of mice

human agreement

72%

proposed system

77%

commercial system

61%

chance 12%

Performance

;"5)(-$<$=)../+#<$>5<$?"'2()('<$@/--'/<$A5+&"$%+##2#<$%#..#<$$B)+5.#$C/445('&)+/(6<$DEFE

((!5G'>(FC5*N4HB

%<(:'*?5*94(C'>>("<(#59:;<"=5&">(>'@'>


((D'#5+&%=5&(%&(A%4;">(-5*<'KB(

#59:;<"=5&("&G(9"<$'9"=#">(<$'5*3

For 10years+...

I did not manage to understand how model works....

we need a theory -- not only a model!


Why do hierarchical architectures work?


• We have “factored out” viewpoint, scale, translation: is it now still “difficult” for a classifier to categorize dogs vs horses?

Leibo, 2011

)?5&,)2(),50)<&2$)%2@-#.,6)"*)">A0-,)10-"3$2'"$B

C0"<0,12-),1&$(*"1<&'"$()"1)2$,1&-.&(()9&12&>2.2,6B


D10),1&$(*"1<&'"$(),50)<&2$)%2@-#.,6*"1)>2"."32-&.)">A0-,)10-"3$2'"$B

Leibo, 2011 Friday, September 9, 2011

Theory: 6 main ideas, 4 key theorems

1. The computational goal of the ventral stream is to learn transformations (affine in R^2) during development and be invariant to them (McCulloch+Pitts 1945, Hoffman 1966, Jack Gallant, 1993,...)

2. A memory-based module, similar to simple-complex cells, can learn from a video of an image transforming (eg translating) to provide a signature to any single image of any object that is invariant under the transformation (Invariance Lemma)

3. Since the group Aff(2,R) can be factorized as a semidirect product of translations and GL(2, R), storage requirements of the memory-based module can be reduced by order of magnitudes (Factorization Theorem). I argue that this is the evolutionary reason for hierarchical architecture in the ventral stream.


4. Evolution selects translations by using small apertures in the first layer (the Stratification Theorem). Size of receptive fields in the sequence of ventral stream area automatically selects specific transformations (translations, scaling, shear and rotations).

5. If Hebb-like rule active at at synapses then the storage requirements can be reduced further: tuning of cell in each layer -- during development -- will mimic the spectrum (SVD) of stored templates (Linking Theorem).

6. At the top layers invariances learned for “places” and for class-specific transformations such changes of expression of a face or rotation in depth of a face or different poses of a body ===> patches for places and class-specific patches (faces, bodies...)

6 main ideas...


CaveatI am speaking about 2 separate stages

1. learning (during development)2. “run time” processing of images


Part I:Memory-based invariance module


E"'9&'"$7)%2(-12<2$&'"$)$",)10-"$(,1#-'"$7F"5$("$G/2$%0$(,1&#((

a few measurements are needed for selectivity -- not so important which ones -- but much fewer than pixels...


Example: a highly efficient signature for OCR


Hierarchical features for selectivity or invariance?

V1 provides very discriminative signatures....


Abstraction: learning by a memory-based invariance module


A templatebook

time

Templatebooks


The invariance lemma


Part II:Why evolution uses hierarchies and how

This is about evolution(mostly before development)


Representation of the affine group


Factorization lemma (why hierarchies)


Stratification conjecture (why size of receptive fields increase)

Stratification Conjecture:For aperture size increasing from zero (relative to resolution of the filtered “image”) the first transformation which can be estimated above the noise of measurements is translation.


…

Text

Layered architecturelooking at neural images in the layer below, through small aperturescorresponding to receptive fields, on a 2D lattice

)6%7#$#+)%$&'"3#&35$#0


Part III:Templatebooks and development of

cells tuning

Mostly development stage


Memory-based invariance module


+2<#.&'"$()HF2<)E#,-5I7+JK)H"*),0<8.&,0>""LI)*"1)H6I),1&$(.&'"$()*1"<)

$&,#1&.)2<&30()

Rust et al. 2005 CarandiniFriday, September 9, 2011

+2<#.&'"$()HF2<)E#,-5I7+JK)H"*),0<8.&,0>""LI)*"1)HMI),1&$(.&'"$()*1"<

$&,#1&.)2<&30()


+2<#.&'"$()HF2<)E#,-5I7+JK)H"*),0<8.&,0>""LI)*"1)H6I),1&$(.&'"$()

$"2(0


+2<#.&'"$()HF2<)E#,-5I7+JK)H"*),0<8.&,0>""LI)*"1)1",&'"$(


D)-&1,""$)"*),50)+7N)-0..


These are predicted features orthogonal to trajectories of Lie group:

for translation (x, y), expansion, rotation

V1

V2/V4?

This is from Hoffman, 1966 cited by Jack GallantFriday, September 9, 2011

Towards V2 and V4???


Why do we care about spectrumof templatebook

eg of sequence of frames portraying transformations

seen by S:C cells during development?


!"'6$4#)(6$+")+$)+$.5(:4#$+"#$4)G$'6$2'H#23$+/$&"//6#$+"#$I&/..#&+J$6'4K2#$&#22$L&/..#6K/(*'(-$+/$+MN$)(*$K./O'*#$+"#$6)4#$O)25#$PP$'(*#K#(*#(+23$/0$!9$Q&./66$&/4K2#G$&#226$+"'6$6)36$+")+$8'+"$"'-"$K./7)7'2'+3$+"#$6'-()+5.#$'6$'(O).')(+$+/$!$0./4$2)3#.$+/$2)3#.$L0/.$5('0/.4$!N9

Theory: Linking conjecture

Choose a learning rule * for simple cells synaptic plasticity (eg receptive field) such as Oja’s update rule

or Hebb’s rule with normalized weights

Theorem:!"#$%&'!()*'+!%,-.'!/'0'(%&'+!('1'#2.'!3'*$+!('4'120/!&5'!+#'1&(%*!#(-#'(2'+!-6!&5'!&'7#*%&',--8


Top layer/patch example


((D'#5+&%Q%&+("(?"#'(?*59("(G%R'*'&<(@%'C:5%&<

R'#8P+5(#*$5('+6<$+5(#*$+/$0522P0)&#$+#4K2)+#6$0/.$*'S#.#(+$O'#8$)(-2#6

R'#8K/'(+$+/2#.)(+$5('+6$L&/4K2#G$5('+6N

9-*'(%01'!&-!%!&(%0+6-(7%2-0!7%:!,'!*'%(0'$!6(-7!';%7#*'+!-6!%!1*%++

Joel+Jim+tp, 2010


(()'"*&%&+(#>"44(4:'#%S#(<*"&4?5*9"=5&4B(@%'C:5%&<(

%&@"*%"&#'4(?5*(?"#'4


Some of the many implications

8 9-20"03#23)."3'):-03)-,)!/;<=)#>3#2+0)"3?

8 @-3) "2&-20"03#23) ."3') A":-2) B'-$4#C0) BB*A) :#&'%2"0:D)2-E2&-20"03#23) ."3') '"#$%$&'"#0) %2+) &-:4-0-E-2%("37) FG5"((#=)H#:%2???I

8 JK>4(%"20L)!#"2$"&'M%(("0)#N#&3)%2+):-$#)$#&#23)6"O"&%$(-)0"2P(#)&#(()4(%0E&"37

8 O#4$"Q#) %) 05RS#&3) -,) 3$%20,-$:%E-20) +5$"2P) +#Q#(-4:#23)F03$-R-0&-4"&)#2Q"$-2:#23I???&(#%$)4$#+"&E-20T)

8 U:%P#)4$"-$=)0'%4#),#%35$#0=)2%35$%()":%P#0)4$"-$0)%$#)J"$$#(#Q%23L


-5>>"O5*"<5*4(%&(*'#'&<(C5*N

V?)/53&')=)V?)6#"R-=)6?)U0"W=A?)X((:%2=)A?)A:%(#=)6?)Y-0%0&-=)

!?)V'5%2P=)9?)B%2=))@?)K+#(:%2=)K?)/#7#$0=)Z?)O#0":-2#[)6#."0=)A)\-"2#%

Q26/T$$A9$U'#6#("57#.<$!9$%#..#<$%9$C"'HH#.5.<$Q9$V'7'6/(/<$;9$W/5O.'#<$A9$?/5"<$$$;9$X'C).2/<$,9$A'22#.<$$C9$C)*'#5<$Q9$Y2'O)<$C9$?/&"<$$Q9$C)K/((#Z/$<X9$$V)2+"#.<$$$[9$?(/72'&"<$$!9$A)6\5#2'#.<$%9$W'2#6&"'<$$]9$V/20<$,9$C/((/.<$X9$^#.6+#.<$19$])4K2<$%9$C"'HH#.5.<$=9$?.#'4)(<$B9$]/-/+"#:6

F2<)E#,-5


0$ 1(+#22'-#(&#web.mit.edu/9.s915/www/classes/class01_all.pdf · the new golden age:...

Documents