cognitive vision - after the hype
DESCRIPTION
Lecture from BMVA summer school 2014TRANSCRIPT
Cognitive Vision – After the hype
Nicolas [email protected]
Centre for Vision, Speech and Signal Processing University of Surrey
What is vision?
Example: detection/recognition
PASCAL Visual Object Classes Challenge 2007
● Given examples from N classes, we want to detect and recognise new instances of one class in images
Detection/recognition
Some Limitations
● Domain adaptation● Performance
depends on number of classes
● Complexity grows with number of classes
● Hard to extend.
Example: Tracking
● A target is identified in a video, we want the system to follow its location and pose over time.
● Template based – Template drift problem
– Template udpate strategies
● … We're pretty good at it now.
Videos from the ALIEN tracker,
Z. Kalal, K. Mikolajczyk, and J. Matas, “Tracking-Learning-Detection,” IEEE TPAMI 2011.F.Pernici. “FaceHugger: The ALIEN Tracker Applied to Faces.” ECCV 2012
Robot Vision?
● Navigation (path planning, obstacle avoidance, SLAM)
● Grasping, manipulation, tool use.
● Planning (not strictly vision, but connected)
● Human-robot interaction?● Mostly a strong need for
precise 3D estimates of the world and objects' shapes.
NAO robot (Aldebaran robotics)
Robot Vision: Grasping ?
● Grasping remains a challenging task.
● Five-finger hands are complex to control.
● Choosing (stable) points of contact for fingers – depends on texture, object's 3D shape and weight...
● Precise 3D shape and 6D pose estimation, motion planning, obstacle detection...
● Hard to estimate from vision...
R. Detry, C. H. Ek, M. Madry, J. Piater and D. Kragic, Generalizing Grasps Across Partly Similar Objects. IEEE ICRA 2012.
Robot Vision: Affordances ??
● James J. Gibson The Theory of Affordances (1977)
● Latent “action possibilities” connected to objects.
● Affordance generalisation across object classes...
● Neural evidence: Mirror neurons (Rizzolatti, G., Craighero, L.: The mirror neuron system. Annual Review of Physiology 27, 169–192,
2004)
Robot Vision: Tool use ???
● Using tools for solving tasks is still a challenge – especially learning to!
● Primates (and even some birds) can do it (The Mentality of Apes. Wolfgang, Kohler, 1925).
?
Tool use (cont'd)
Face detection/recognition...
So... what is vision?
● Loosely defined concept● Pretty much vision is what we experience on a
daily basis ● A rich, vivid and complete representation of
the world...● … except most of it is made up...
The truth about human vision
● Human eye: – high resolution in a small, central area called the fovea
(cones).
– colour only in the fovea (cones).
– very coarse elsewhere.
– low light and motion sensitivity in the periphery (rods).
– we're virtually blind to static areas.
– Ah... and we have a significant blind spot in the middle of our field of view.
– … never noticed all that?
Human vision: the dualist illusion
● Our intuition is similar to Descartes' vision
● “The cartesian theatre”● We now know this is not the
case (from neuroscience).● There is no clear
delineation in the brain between perception and cognition.
Vision module
Cognition/consciousness
Action module
Diagram from Descartes' “Meditations”
Vision in the brain
Figure 25-12 from E.R. Kandel, J.H. Schwartz and T.M. Jessel, Eds. Principles of Neural Science, 4th Edition.
Cognitive Vision
● The ideal vision of vision as a separate module feeding information to cognition does not work.
● So, where do we put the bar?
Low level signal processing
High-level cognition, consciousness
Cognitive vision
feedback
Today's roadmap
● A (non-)definition of cognitive vision and its flavours– Cognitivist/Symbolic AI approach and
its problems● The frame problem● Symbol grounding problem
– The emergent view● Aside: Neural networks
– The embodiment question
● How to get there? Some insights from representation learning and deep architectures.– Autoencoders
– Convolutional networks
What is Cognitive Vision?
● H.H. Nagel (2003): – improving computer vision algorithms by
adding numerous consistency check mechanisms, at a logical level.
● David Vernon (2008, first draft 2004): – “... attempt to achieve more robust, resilient
and adaptable computer vision systems by endowing them with cognitive capabilities”
– “... able to adapt to unforeseen changes in the visual environment”
– “... in essence, a combination of computer vision and cognition”
● Multiple approaches to Cog-V– Symbolic AI
– Emergent view
– Embodied AI
Cognitive vision ?Symbolic AI
(dualist)
Emergent
Embodied
H.H. Nagel, Reflections on cognitive vision systems. In proc. of ICVS 2003.D. Vernon. Cognitive Vision: The case for an embodied perception. Image and Vision Computing 26 (2008).
Example of Cognitive Architecture The KnowHow system
KnowRob -- A Knowledge Processing Infrastructure for Cognition-enabled Robots. Part 1: The KnowRob System (Moritz Tenorth, Michael Beetz), IJRR 2013.
Symbolic AI
● Cognition involves operations over symbolic representations.
● “Perception” is the process abstracting symbolic representations from sensory signals.
● Mostly, the symbolic representation is the product of human design and choice.
● → problem when we go away from the domain of human experience (ie, “semantic gap”) sensory signals
interpretation
symbolic representation
logical reasoning
The symbol grounding problem
Searle's “Chinese room argument” (1980): – The symbols do not have the same semantics attached to
them as for the designer...
Harnad (1990)– Cognition is more than symbol manipulation
→ In other words, the system should learn its own symbols, grounded in its own experiences...
Barsalou (1999) – Cognition is inherently perceptual
– (and therefore, perception is inherently cognitive)
The frame problem in AI – part I (Daniel C. Dennett)
● Once upon a time, there was a robot, called R1...
"Cognitive Wheels: The Frame Problem of AI," in C. Hookway, ed., Minds, Machines and Evolution, Cambridge University Press 1984, 129-151.
PULL WAGON
The frame problem in AI – part I (Daniel C. Dennett)
● Once upon a time, there was a robot, called R1...
"Cognitive Wheels: The Frame Problem of AI," in C. Hookway, ed., Minds, Machines and Evolution, Cambridge University Press 1984, 129-151.
The frame problem in AI – part II (Daniel C. Dennett)
● A new robot was built to recognise, and handle side-effects: R1D1
Pulling wagon does not change wall colour
Pull the wagon?
Pulling the wagon does not discharge the batteries
...
...
The frame problem in AI – part III (Daniel C. Dennett)
● The designers built a third robot to assess the relevance of implications: Say hello to R2D1.
...
The frame problem in AI – part III (Daniel C. Dennett)
● In sum, any action requires a large, a priori unknown, amount of world knowledge
● Hard to predict for the system designer ● Hard to deduce by symbolically by the system● → need for common sense associations
For vision: it is hard to predetermine a priori the features and detectors that will be required.
Issues with Symbolic AI
● Symbolic AI is an efficient architecture
● Has solved successfully some hard problems
● ...but faces some complex limitations due to the separation between symbolic / sub-symbolic components.
Low level signal processing
High-level cognition, consciousness
Symbolic reasoning
detectors
symbols
Computer vision
AI
Emergent Cognition
● The system develops its own epistemiology (set of symbols & associations) from interacting with its environment.
● Enactive view (Maturana, H., Varela, F. The Tree of Knowledge – The Biological Roots of Human Understanding. New Science Library,
Boston & London (1987))– autonomous system
– can affect the environment
– is affected by the environment (embodied)
– self-organised and self-generated.
● Central nervous system – prediction & adaptation
Fig from Vernon, von Hofsten & Fadiga “A Roadmap for Cognitive Development in Humanoid Robots”. Springer, 2010.
Emergent Cognition: Shared Epistemiology
● Pb: different experience → different symbols!● Shared epistemiology comes from communication between
agents (my and your concept of “red” are shared, even if you're colour blind)
● Note: communication between artificial systems can be a lot faster!
Artificial Neural Networks
● An “artificial neuron” is in effect– a linear
transformation
– a linear squashing function s
an=f w ,b(x)=s(∑i
w i xi+b)
x1
x2
x3
+1
n an
w1
w2
w3
b
Non-linearities
● Smooth squashing functions.
● continuous and differentiable.
● sigmoid → [0,1]
● tanh → [-1,+1]
s (x)=1
1+e−x
s(x)=tanh (x)=ex
−e−x
ex+e−x
s ' (x)=(1−s (x))s (x)
Artificial Neural Network(aka Multilayer perceptron)
x1
x2
x3
h1r1
h2
+1+1
input layerlayer #1
(N^1=3 inputs)
“hidden” layerlayer #2
(N^2=2 nodes)
output layerlayer #3
(N^3=1 node)
θ=(W 1 ,b1 ,W 2 ,b2)
parameters:
f θ(x)=s( ∑j∈[1,N 2
]
W j12 s( ∑
i∈[1,N 1]
W ij1 x i+bi
1)+b j2)
z il+1= ∑
j∈[1,N l]
W j1l a j
l +b jl
ail=s(zi
l)
Generic node activation:
b12
Learning by back-propagation
E=12∥a3
− y∥
δ jL=
∂E
∂ a jLs ' (z j
L)
(⇔δ jL=(a j
L− y j) s ' (z j
L))
x1
x2
x3
h1r1
h2
+1+1
input layerlayer #1
(N^1=3 inputs)
“hidden” layerlayer #2
(N^2=2 nodes)
output layerlayer #3
(N^3=1 node)
(x , y)
δ jl=∑
i
W jilδi
l+1 s ' (z jl)
δ13
δ12
TOP LAYER ERROR
OTHER LAYERS ERROR
For a given datapoint with label
We have a error for the network
a13
b12
Learning by back-propagation
x1
x2
x3
h1r1
h2
+1+1
input layerlayer #1
(N^1=3 inputs)
“hidden” layerlayer #2
(N^2=2 nodes)
output layerlayer #3
(N^3=1 node)
b12
δ13
δ12
a13
∂ E
∂W ijl=ai
lδ j
l+1
∂E
∂b jl=δ j
l+1
→ Update parameterswith gradient descent
Finally we get the error derivative for all network parameters:
Embodiment
● Idea: Concepts can only be learnt for and by a body – → being affected by the
environment
– actions and perception and learnt jointly.
– good perception is what allows successful actions.
● Example of reaching with neural network (Jamone, L.; Natale, L.; Metta, G.; Nori, F.; Sandini, G. .2012, “Autonomous Online Learning of Reaching Behavior in a Humanoid Robot.” International Journal of
Humanoid Robotics 9(3), 2012.)
Do we need embodiment?
● If you buy the emergent thesis, it is required – joint development of perception & action
– symbol grounding in experience
– → emergent epistemiology
● What type of embodiment? – strong: physical body (or even organic body!)
– weak: a system coupled with its environment● it can affect its environment, and ● it is affected by it
Phylogeny vs. Ontogeny
● Phylogeny: the system's design (eg features like SIFT or lines). High with cognitivist approach, more limited in the emergent paradigm.
● Ontogeny: the system's development during its lifetime, drawn from experiences with its environment.
● Challenges for artificial systems:– hard to learn high level, abstract symbols autonomously.
– hard to generalise across experiences
– → how to learn abstract representations from experience?
Representation Learning
● Simple example: PCA ● Aim: identify dimensions that
vary jointly● Components are axes of largest
variation.● Linear transformation
● Orthogonal basis ● Applied on natural images,
generate filters similar to early cortical cells (V1)
PJB Hancock, RJ Baddeley and LS Smith (1992)The principal components of natural imagesNetwork: computation in neural systems 3(1)
y=W T x+μ
Arguments for deep hierarchies
● Feature sharing at intermediate levels → sub-linear coding and computation requirements (Fidler,
Boben & Leonardis. Evaluating multi-class learning strategies in a generative hierarchical
framework for object detection. NIPS'09.)● compact coding (Bengio, Courville, &
Vincent. Representation Learning: A Review and
New Perspectives. IEEE PAMI 35(8), 2013.)● → Human visual system, estimated
to have 5-10 levels (Krueger et al, “Deep Hierarchies in the Primate Visual Cortex: What
Can We Learn for Computer Vision? 2013)● →NN,CART,SVM → 2 layers
Figure from Fidler, Boben & Leonardis 2009
Arguments for Deep Hierarchies
● Problem with linear representations :– A combination of any
number of linear representations is also a linear representation...y=W 1
T x+μ1
z=W 2T y+μ2
⇔ z=W 2T W 1
T x+μ2+μ1
⇔ z=W 3T x+μ3
x
y
(W 1,μ1)
z
(W 2,μ2)
(W 3,μ3)
Data driven hierarchies: Autoencoders
● Idea: learn jointly a pair of mappings and
● that minimises information loss
● often using a neural network formulation
z=ψ( y)y=ϕ(x)
argminϕ ,ψ
∑x
∥x−ψ(ϕ(x))∥D
ϕ(x)=s(W x+b)
ψ(x)=W ' x+b'
x
y
ϕ ψ
x1 x2 x3 x4
y1 y2 y3
Remember: ANNs
● An “artificial neuron” is in effect– a linear
transformation
– a linear squashing function s
an=f w ,b(x)=s(∑i
w i xi+b)
x1
x2
x3
+1
n an
w1
w2
w3
b
Data driven hierarchies: Autoencoders
● Idea: learn jointly a pair of mappings and
● that minimises information loss
● often using a neural network formulation
z=ψ( y)y=ϕ(x)
argminϕ ,ψ
∑x
∥x−ψ(ϕ(x))∥D
ϕ(x)=s(W x+b)
ψ(x)=W ' x+b'
x
y
ϕ ψ
x1 x2 x3 x4
y1 y2 y3
Data driven hierarchies: Sparse Autoencoders
● Trivial solution whendim(Y) >= dim(X) !
● But overcomplete bases can be beneficial (Olhsausen & Field 1996)
● Solution→sparse coding
argminϕ ,ψ
∑x
∥x−ψ(ϕ(x))∥D+g (ϕ(x))
x
y
ϕ ψ
Olshausen, B. and Field, D. (1996). Emergence of simple-cell receptive field properties by learninga sparse code for natural images. Nature, 381(6583):607–609.
Stacked Auto-encoders
● you can stack multiple layers of AE
● Trained layer-wise● Note that the
structure is the same as ANN.
● Can be fine-tuned with backpropagation
x
h
ϕ1 ψ1
y
ϕ2 ψ2
Limitations of ANN
● Problem with ANN, doesn't work well with more than 2 layers
● pb. with backprop, probably the gradient gets too diluted.
● Problem for emergent cognition: we want to learn higher level of abstraction!
● More recently several alternatives have been developped (Deep Learning): Restricted Boltzman Machines (RBM), Stacked autoencoder, Convolutional nets.
Today's hot topic: Convolutional Neural Nets
● CNNs are neural nets (of course)
● sparse connectivity● shared weight →
convolutional.● receptive field span all
input dimensions● typically alternating layers
of convolution and max-pooling
x1 x1 x1 x1 x1
h1 h1 h1
Fig from http://deeplearning.net/tutorial/lenet.html
CNNs (cont'd)
● ex: LeNet (LeCun et al, 1998)● Alternating convolution & subsampling (ie, max-pooling) layes● Top-layer is a typical ANN. ● Train using backprop & stochastic gradient descent
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, 1998.
Figure from http://deeplearning.net/tutorial/lenet.html
CNNs (cont'd)
● Pb: training deep networks is difficult with backprop (slow, requires LOTS of data)● CNN do better (because sparse) but still a pb. ● → Unsupervised pre-training of the network
– using, ie, sparse autoencoders, layer-wise.
– refine the weights with supervised backprop afterwards.
● Top results in MNIST, ILSRVC, PASCAL VOC.
Other Part-based Hierarchies
● Deep belief networks (DBN): Restricted Boltzmann Machines (Hinton, Osindero, and Teh, “A Fast Learning Algorithm
for Deep Belief Nets,” Neural Computation 18, 2006.)● Slow Feature Analysis (SFA) (Franzius,
Wilber and Wiskott. “Invariant object recognition and pose estimation with slow feature analysis”. Neural
Computation, 2011)● Compositional hierarchies (Fidler, Boben &
Leonardis. “Evaluating multi-class learning strategies in a generative hierarchical framework for object
detection”. NIPS, 2009)● → Good review by Yoshua Bengio:
Yoshua Bengio, Aaron Courville, and Pascal Vincent. “Representation Learning: A Review and New Perspectives.” IEEE PAMI 35(8), 2013.
Fidler, S., M. Boben, and A. Leonardis 2009a. “Learning hierarchical compositional representations of object structure.” Pp. 196-215 in Object categorization : computer and human vision perspectives, edited by Sven J Dickinson, Aleš Leonardis, Bernt Schiele, and Michael J Tarr. New York: Cambridge University Press.
Summary and conclusions
● There is no delineation between cognition and vision.
● Reasoning on hand-crafted symbols may be inadequate (semantic gap) or brittle.
● Learning abstraction is hard, but possible using deep hierarchies.
● Unsupervised pre-training for deep hierarchies is critical → tells us something about cognition.