cognitive vision - after the hype

Cognitive Vision – After the hype

Nicolas [email protected]

Centre for Vision, Speech and Signal Processing University of Surrey

mailto:[email protected]

What is vision?

Example: detection/recognition

PASCAL Visual Object Classes Challenge 2007

● Given examples from N classes, we want to detect and recognise new instances of one class in images

Detection/recognition

Some Limitations

● Domain adaptation● Performance

depends on number of classes

● Complexity grows with number of classes

● Hard to extend.

Example: Tracking

● A target is identified in a video, we want the system to follow its location and pose over time.

● Template based – Template drift problem

– Template udpate strategies

● … We're pretty good at it now.

Videos from the ALIEN tracker,

Z. Kalal, K. Mikolajczyk, and J. Matas, “Tracking-Learning-Detection,” IEEE TPAMI 2011.F.Pernici. “FaceHugger: The ALIEN Tracker Applied to Faces.” ECCV 2012

Robot Vision?

● Navigation (path planning, obstacle avoidance, SLAM)

● Grasping, manipulation, tool use.

● Planning (not strictly vision, but connected)

● Human-robot interaction?● Mostly a strong need for

precise 3D estimates of the world and objects' shapes.

NAO robot (Aldebaran robotics)

Robot Vision: Grasping ?

● Grasping remains a challenging task.

● Five-finger hands are complex to control.

● Choosing (stable) points of contact for fingers – depends on texture, object's 3D shape and weight...

● Precise 3D shape and 6D pose estimation, motion planning, obstacle detection...

● Hard to estimate from vision...

R. Detry, C. H. Ek, M. Madry, J. Piater and D. Kragic, Generalizing Grasps Across Partly Similar Objects. IEEE ICRA 2012.

Robot Vision: Affordances ??

● James J. Gibson The Theory of Affordances (1977)

● Latent “action possibilities” connected to objects.

● Affordance generalisation across object classes...

● Neural evidence: Mirror neurons (Rizzolatti, G., Craighero, L.: The mirror neuron system. Annual Review of Physiology 27, 169–192,

2004)

Robot Vision: Tool use ???

● Using tools for solving tasks is still a challenge – especially learning to!

● Primates (and even some birds) can do it (The Mentality of Apes. Wolfgang, Kohler, 1925).

?

Tool use (cont'd)

Face detection/recognition...

So... what is vision?

● Loosely defined concept● Pretty much vision is what we experience on a

daily basis ● A rich, vivid and complete representation of

the world...● … except most of it is made up...

The truth about human vision

● Human eye: – high resolution in a small, central area called the fovea

(cones).

– colour only in the fovea (cones).

– very coarse elsewhere.

– low light and motion sensitivity in the periphery (rods).

– we're virtually blind to static areas.

– Ah... and we have a significant blind spot in the middle of our field of view.

– … never noticed all that?

Human vision: the dualist illusion

● Our intuition is similar to Descartes' vision

● “The cartesian theatre”● We now know this is not the

case (from neuroscience).● There is no clear

delineation in the brain between perception and cognition.

Vision module

Cognition/consciousness

Action module

Diagram from Descartes' “Meditations”

Vision in the brain

Figure 25-12 from E.R. Kandel, J.H. Schwartz and T.M. Jessel, Eds. Principles of Neural Science, 4th Edition.

Cognitive Vision

● The ideal vision of vision as a separate module feeding information to cognition does not work.

● So, where do we put the bar?

Low level signal processing

High-level cognition, consciousness

Cognitive vision

feedback

Today's roadmap

● A (non-)definition of cognitive vision and its flavours– Cognitivist/Symbolic AI approach and

its problems● The frame problem● Symbol grounding problem

– The emergent view● Aside: Neural networks

– The embodiment question

● How to get there? Some insights from representation learning and deep architectures.– Autoencoders

– Convolutional networks

What is Cognitive Vision?

● H.H. Nagel (2003): – improving computer vision algorithms by

adding numerous consistency check mechanisms, at a logical level.

● David Vernon (2008, first draft 2004): – “... attempt to achieve more robust, resilient

and adaptable computer vision systems by endowing them with cognitive capabilities”

– “... able to adapt to unforeseen changes in the visual environment”

– “... in essence, a combination of computer vision and cognition”

● Multiple approaches to Cog-V– Symbolic AI

– Emergent view

– Embodied AI

Cognitive vision ?Symbolic AI

(dualist)

Emergent

Embodied

H.H. Nagel, Reflections on cognitive vision systems. In proc. of ICVS 2003.D. Vernon. Cognitive Vision: The case for an embodied perception. Image and Vision Computing 26 (2008).

Example of Cognitive Architecture The KnowHow system

KnowRob -- A Knowledge Processing Infrastructure for Cognition-enabled Robots. Part 1: The KnowRob System (Moritz Tenorth, Michael Beetz), IJRR 2013.

Symbolic AI

● Cognition involves operations over symbolic representations.

● “Perception” is the process abstracting symbolic representations from sensory signals.

● Mostly, the symbolic representation is the product of human design and choice.

● → problem when we go away from the domain of human experience (ie, “semantic gap”) sensory signals

interpretation

symbolic representation

logical reasoning

The symbol grounding problem

Searle's “Chinese room argument” (1980): – The symbols do not have the same semantics attached to

them as for the designer...

Harnad (1990)– Cognition is more than symbol manipulation

→ In other words, the system should learn its own symbols, grounded in its own experiences...

Barsalou (1999) – Cognition is inherently perceptual

– (and therefore, perception is inherently cognitive)

The frame problem in AI – part I (Daniel C. Dennett)

● Once upon a time, there was a robot, called R1...

"Cognitive Wheels: The Frame Problem of AI," in C. Hookway, ed., Minds, Machines and Evolution, Cambridge University Press 1984, 129-151.

PULL WAGON

The frame problem in AI – part I (Daniel C. Dennett)

● Once upon a time, there was a robot, called R1...

"Cognitive Wheels: The Frame Problem of AI," in C. Hookway, ed., Minds, Machines and Evolution, Cambridge University Press 1984, 129-151.

The frame problem in AI – part II (Daniel C. Dennett)

● A new robot was built to recognise, and handle side-effects: R1D1

Pulling wagon does not change wall colour

Pull the wagon?

Pulling the wagon does not discharge the batteries

...

...

The frame problem in AI – part III (Daniel C. Dennett)

● The designers built a third robot to assess the relevance of implications: Say hello to R2D1.

...

The frame problem in AI – part III (Daniel C. Dennett)

● In sum, any action requires a large, a priori unknown, amount of world knowledge

● Hard to predict for the system designer ● Hard to deduce by symbolically by the system● → need for common sense associations

For vision: it is hard to predetermine a priori the features and detectors that will be required.

Issues with Symbolic AI

● Symbolic AI is an efficient architecture

● Has solved successfully some hard problems

● ...but faces some complex limitations due to the separation between symbolic / sub-symbolic components.

Low level signal processing

High-level cognition, consciousness

Symbolic reasoning

detectors

symbols

Computer vision

AI

Emergent Cognition

● The system develops its own epistemiology (set of symbols & associations) from interacting with its environment.

● Enactive view (Maturana, H., Varela, F. The Tree of Knowledge – The Biological Roots of Human Understanding. New Science Library,

Boston & London (1987))– autonomous system

– can affect the environment

– is affected by the environment (embodied)

– self-organised and self-generated.

● Central nervous system – prediction & adaptation

Fig from Vernon, von Hofsten & Fadiga “A Roadmap for Cognitive Development in Humanoid Robots”. Springer, 2010.

Emergent Cognition: Shared Epistemiology

● Pb: different experience → different symbols!● Shared epistemiology comes from communication between

agents (my and your concept of “red” are shared, even if you're colour blind)

● Note: communication between artificial systems can be a lot faster!

Artificial Neural Networks

● An “artificial neuron” is in effect– a linear

transformation

– a linear squashing function s

an=f w ,b(x)=s(∑i

w i xi+b)

x1

x2

x3

+1

n an

w1

w2

w3

b

Non-linearities

● Smooth squashing functions.

● continuous and differentiable.

● sigmoid → [0,1]

● tanh → [-1,+1]

s (x)=1

1+e−x

s(x)=tanh (x)=ex

−e−x

ex+e−x

s ' (x)=(1−s (x))s (x)

Artificial Neural Network(aka Multilayer perceptron)

x1

x2

x3

h1r1

h2

+1+1

input layerlayer #1

(N^1=3 inputs)

“hidden” layerlayer #2

(N^2=2 nodes)

output layerlayer #3

(N^3=1 node)

θ=(W 1 ,b1 ,W 2 ,b2)

parameters:

f θ(x)=s( ∑j∈[1,N 2

]

W j12 s( ∑

i∈[1,N 1]

W ij1 x i+bi

1)+b j2)

z il+1= ∑

j∈[1,N l]

W j1l a j

l +b jl

ail=s(zi

l)

Generic node activation:

b12

Learning by back-propagation

E=12∥a3

− y∥

δ jL=

∂E

∂ a jLs ' (z j

L)

(⇔δ jL=(a j

L− y j) s ' (z j

L))

x1

x2

x3

h1r1

h2

+1+1

input layerlayer #1

(N^1=3 inputs)


(N^2=2 nodes)


(N^3=1 node)

(x , y)

δ jl=∑

i

W jilδi

l+1 s ' (z jl)

δ13

δ12

TOP LAYER ERROR

OTHER LAYERS ERROR

For a given datapoint with label

We have a error for the network

a13

b12

Learning by back-propagation

x1

x2

x3

h1r1

h2

+1+1

input layerlayer #1

(N^1=3 inputs)


(N^2=2 nodes)


(N^3=1 node)

b12

δ13

δ12

a13

∂ E

∂W ijl=ai

lδ j

l+1

∂E

∂b jl=δ j

l+1

→ Update parameterswith gradient descent

Finally we get the error derivative for all network parameters:

Embodiment

● Idea: Concepts can only be learnt for and by a body – → being affected by the

environment

– actions and perception and learnt jointly.

– good perception is what allows successful actions.

● Example of reaching with neural network (Jamone, L.; Natale, L.; Metta, G.; Nori, F.; Sandini, G. .2012, “Autonomous Online Learning of Reaching Behavior in a Humanoid Robot.” International Journal of

Humanoid Robotics 9(3), 2012.)

Do we need embodiment?

● If you buy the emergent thesis, it is required – joint development of perception & action

– symbol grounding in experience

– → emergent epistemiology

● What type of embodiment? – strong: physical body (or even organic body!)

– weak: a system coupled with its environment● it can affect its environment, and ● it is affected by it

Phylogeny vs. Ontogeny

● Phylogeny: the system's design (eg features like SIFT or lines). High with cognitivist approach, more limited in the emergent paradigm.

● Ontogeny: the system's development during its lifetime, drawn from experiences with its environment.

● Challenges for artificial systems:– hard to learn high level, abstract symbols autonomously.

– hard to generalise across experiences

– → how to learn abstract representations from experience?

Representation Learning

● Simple example: PCA ● Aim: identify dimensions that

vary jointly● Components are axes of largest

variation.● Linear transformation

● Orthogonal basis ● Applied on natural images,

generate filters similar to early cortical cells (V1)

PJB Hancock, RJ Baddeley and LS Smith (1992)The principal components of natural imagesNetwork: computation in neural systems 3(1)

y=W T x+μ

Arguments for deep hierarchies

● Feature sharing at intermediate levels → sub-linear coding and computation requirements (Fidler,

Boben & Leonardis. Evaluating multi-class learning strategies in a generative hierarchical

framework for object detection. NIPS'09.)● compact coding (Bengio, Courville, &

Vincent. Representation Learning: A Review and

New Perspectives. IEEE PAMI 35(8), 2013.)● → Human visual system, estimated

to have 5-10 levels (Krueger et al, “Deep Hierarchies in the Primate Visual Cortex: What

Can We Learn for Computer Vision? 2013)● →NN,CART,SVM → 2 layers

Figure from Fidler, Boben & Leonardis 2009

Arguments for Deep Hierarchies

● Problem with linear representations :– A combination of any

number of linear representations is also a linear representation...y=W 1

T x+μ1

z=W 2T y+μ2

⇔ z=W 2T W 1

T x+μ2+μ1

⇔ z=W 3T x+μ3

x

y

(W 1,μ1)

z

(W 2,μ2)

(W 3,μ3)

Data driven hierarchies: Autoencoders

● Idea: learn jointly a pair of mappings and

● that minimises information loss

● often using a neural network formulation

z=ψ( y)y=ϕ(x)

argminϕ ,ψ

∑x

∥x−ψ(ϕ(x))∥D

ϕ(x)=s(W x+b)

ψ(x)=W ' x+b'

x

y

ϕ ψ

x1 x2 x3 x4

y1 y2 y3

Remember: ANNs

● An “artificial neuron” is in effect– a linear

transformation

– a linear squashing function s

an=f w ,b(x)=s(∑i

w i xi+b)

x1

x2

x3

+1

n an

w1

w2

w3

b

Data driven hierarchies: Autoencoders

● Idea: learn jointly a pair of mappings and

● that minimises information loss

● often using a neural network formulation

z=ψ( y)y=ϕ(x)

argminϕ ,ψ

∑x

∥x−ψ(ϕ(x))∥D

ϕ(x)=s(W x+b)

ψ(x)=W ' x+b'

x

y

ϕ ψ

x1 x2 x3 x4

y1 y2 y3

Data driven hierarchies: Sparse Autoencoders

● Trivial solution whendim(Y) >= dim(X) !

● But overcomplete bases can be beneficial (Olhsausen & Field 1996)

● Solution→sparse coding

argminϕ ,ψ

∑x

∥x−ψ(ϕ(x))∥D+g (ϕ(x))

x

y

ϕ ψ

Olshausen, B. and Field, D. (1996). Emergence of simple-cell receptive field properties by learninga sparse code for natural images. Nature, 381(6583):607–609.

Stacked Auto-encoders

● you can stack multiple layers of AE

● Trained layer-wise● Note that the

structure is the same as ANN.

● Can be fine-tuned with backpropagation

x

h

ϕ1 ψ1

y

ϕ2 ψ2

Limitations of ANN

● Problem with ANN, doesn't work well with more than 2 layers

● pb. with backprop, probably the gradient gets too diluted.

● Problem for emergent cognition: we want to learn higher level of abstraction!

● More recently several alternatives have been developped (Deep Learning): Restricted Boltzman Machines (RBM), Stacked autoencoder, Convolutional nets.

Today's hot topic: Convolutional Neural Nets

● CNNs are neural nets (of course)

● sparse connectivity● shared weight →

convolutional.● receptive field span all

input dimensions● typically alternating layers

of convolution and max-pooling

x1 x1 x1 x1 x1

h1 h1 h1

Fig from http://deeplearning.net/tutorial/lenet.html

CNNs (cont'd)

● ex: LeNet (LeCun et al, 1998)● Alternating convolution & subsampling (ie, max-pooling) layes● Top-layer is a typical ANN. ● Train using backprop & stochastic gradient descent

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, 1998.

Figure from http://deeplearning.net/tutorial/lenet.html

CNNs (cont'd)

● Pb: training deep networks is difficult with backprop (slow, requires LOTS of data)● CNN do better (because sparse) but still a pb. ● → Unsupervised pre-training of the network

– using, ie, sparse autoencoders, layer-wise.

– refine the weights with supervised backprop afterwards.

● Top results in MNIST, ILSRVC, PASCAL VOC.

Other Part-based Hierarchies

● Deep belief networks (DBN): Restricted Boltzmann Machines (Hinton, Osindero, and Teh, “A Fast Learning Algorithm

for Deep Belief Nets,” Neural Computation 18, 2006.)● Slow Feature Analysis (SFA) (Franzius,

Wilber and Wiskott. “Invariant object recognition and pose estimation with slow feature analysis”. Neural

Computation, 2011)● Compositional hierarchies (Fidler, Boben &

Leonardis. “Evaluating multi-class learning strategies in a generative hierarchical framework for object

detection”. NIPS, 2009)● → Good review by Yoshua Bengio:

Yoshua Bengio, Aaron Courville, and Pascal Vincent. “Representation Learning: A Review and New Perspectives.” IEEE PAMI 35(8), 2013.

Fidler, S., M. Boben, and A. Leonardis 2009a. “Learning hierarchical compositional representations of object structure.” Pp. 196-215 in Object categorization : computer and human vision perspectives, edited by Sven J Dickinson, Aleš Leonardis, Bernt Schiele, and Michael J Tarr. New York: Cambridge University Press.

Summary and conclusions

● There is no delineation between cognition and vision.

● Reasoning on hand-crafted symbols may be inadequate (semantic gap) or brittle.

● Learning abstraction is hard, but possible using deep hierarchies.

● Unsupervised pre-training for deep hierarchies is critical → tells us something about cognition.