human object recognition part 3 of the biomimetic trilogy bruce draper
Post on 20-Dec-2015
216 views
TRANSCRIPT
Review:A Divided Vision System
The human vision system has three major components:1. The early vision system
Retinogeniculate pathway RetinaLGNd V1 (V2 V3) and channels
Retinotectal pathway Retina S.C. Pulvinar Nucleus V1(V2 V3) Retina S.C. Pulvinar Nucleus MT(dorsal) Retina S.C. LGNd(interlaminar) V1(V2 V3)
2. The dorsal (“where”) pathway3. The ventral (“what”) pathway
D. Milner & M. Goodale, The Visual Brain in Action, p. 22
The Early Vision System
Retinotopically mapped– Small receptive fields in LGNd, V1– Receptive fields grow with processing depth
Bigger in V2, Bigger still in V3… Spatially organized into feature maps
– Edge maps (Gabor filters, quadrature pairs)– Color maps– Disparity maps– Motion maps (in MT, if not before)
Afferent & efferent connections Measurable neural correlates of spatial attention
An Early Vision Hypothesis
Logic: why compute any feature across the entire image when it would be cheaper to compute it later across only the attention window? Because you need the feature to select the attention window.
Neural evidence: Neural correlates of spatial attention (e.g. anticipatory firing, enhanced firing) are measurable in V1 and even LGNd.
Psychological evidence: ventral and dorsal streams appear to process the same attention windows, suggesting that attention is selected prior to the ventral/dorsal split.
Caveat: Some dorsal vision tasks (e.g. ego-motion estimation) benefit from a broad field of view, and may be non-attentional
The primary role of the early vision system is spatial attention
The Dorsal/Ventral Split
Color Codes:– Red: early vision– Orange/Yellow: dorsal
Leads to somatosensory and motor cortex
– Blue/Green: ventral Leads more to
memories, frontal cortex
More developed in humans than monkeys
A Dorsal Vision Hypothesis
Anatomical evidence: 1. strongly connected to motion and stereo processing in V1; 2. dorsal areas (e.g. LIP, 7a) inactive under anaesthesia3. neurons conjointly tuned for perception and action4. saccade-responsive neurons and gaze-responsive neurons
Behavioral evidence: 1. monkeys with dorsal lesions recognize objects but can’t grab
them;2. blindsight (see next slide)
Milner & Goodale: the dorsal vision system supports immediate actions, and not cognition or memory
Blindsight
Patients with severe damage to V1 are “cortically blind”
– Report no sensation of vision – MRI confirms no activity in V1– Saccadic eye movements continue
Nonetheless, they can point at targets– Much better than random (see chart)– Once they relax & let it happen
Why? – Retina S.C. Pulvinar Nucleus MT(dorsal)– MRI confirms some dorsal vision activity
So?– Confirms that dorsal vision has no contact with
cognition
A Ventral Vision Hypothesis
Anatomical evidence:1. Visual pathways connects early vision to areas associated with
memory (e.g. right inferior frontal lobe (RIFL))2. MRI centers of activity in ventral stream during (a) expert object
recognition and (b) landmark recognition Behavioral evidence:
1. Ventral lesions in monkeys prevent object recognition2. Lesions in fusiform gyrus in humans lead to prosopagnosia3. Stimulation of RIFL during surgery creates mental images
Milner & Goodale: the ventral pathway supports vision for cognition, including (categorical & sub-categorical)
object recognition and landmark-based navigation
Repetition Suppression
What happens when the same stimulus is presented repeatedly to the vision system?
– In fMRI studies, the total response of a voxel drops with each presentation
– In single-cell recording studies, neural responses become extreme
Most cells stop firing at all A few cells start responding at their maximal firing rate
– This can be observed in the ventral stream But not the early vision system
– This can be observed at both short and long time scales Short-time scale repetition suppression is interrupted by novel
targets
This may seem like a tangent, but its not…
Decomposing the Ventral Stream
The ventral stream has 4 major parts, as revealed by MRI:1. The early vision system
Both the ventral & dorsal streams start here Selects spatial attention windows (our hypothesis)
2. The lateral occipital cortex Large area, diffusely active in MRI studies Including (at least) V4 & V8 Kosslyn: hypothesizes feature extraction
3. The inferotemporal cortex Large area, diffusely active in MRI studies Sharp focus of activity in fusiform gyrus during expert recognition Sharp focus of activity in parahippocampal gyrus during landmark
recognition4. The right inferior frontal cortex
Associated with visual memories Efferently stimulates V1 when active Strongly lateralized
Area V8 (Lateral Occipital Cortex)
Short-term repetition studies suggest V8 computes edge-based features
– Equal amounts of suppression for image/image, image/edge, edge/image or edge/edge pairs
Psychological studies suggest the recognition is sensitive to the disruption of “non-accidental” features
1. Colinearity2. Parallelism (translational symmetry)3. Reflection (anti-symmetry)4. Co-termination (end-point near)5. Constant curvature
Diffuse response suggests population coding
An LOC Hypothesis
Evidence:– Diffuse responses consistent with population codes– Fit psychology models of LOC as feature extraction– Explains repetition suppression effects in V8– Explains non-classical receptive field responses in V1
(assuming efferent feedback to early vision)
Area V8 detects non-accidental edge relations through parameter-space voting schemes (e.g. Hough spaces)
Other LOC areas use voting schemes to summarize other features, e.g. color histograms in area V4/V7. Together, LOC areas create a
high-dimensional but distributed feature representation
Infero-temporal Cortex (IT)
Diffusely active in fMRI during all types of object recognition
Last visual processing stage before memories Distributed responses to objects (Tsunoda, et al):
Test Stimulus Hot spots (versus control, shown for differentLevels of statistical significance)
Inferotemporal Cortex (continued)
Hot spots overlap, and
aren’t contiguous
(pop. Code)
Some stimuli yield greater total responses; responses
overlap
Always some response
Minimal effect of stimulusintensity
Significant results
Figure A is a control: hot spots from 3 different objects
Figure B: red spots respond to the whole cat; a subset of spots (blue) respond to just the head; a subset of that responds to a silhouette of the head (yellow)
• Implication: part-based features. Figure C: Blue spot responds to whole object, but not
to simplification. Some red spots respond only to simplified version
– Implication: More complex scenario: some feature responses are turned off by the whole object (competition?)
An IT Hypothesis
Repetition suppression effects are strongest in IT Single cell recording studies show that IT cells
respond to multiple features (e.g. color + shape) Simpler organizations (e.g. part/subpart hierarchies,
“view maps”) are not supported by single-cell recording data
Repetition suppression in infero-temporal cortex implements unsupervised feature space
segmentation, thus categorizing attention windows
Expert Object Recognition
Expert object recognition applies when:– The viewer is very familiar with the target object– The illumination and viewpoint are familiar– The target is recognized at both a categorical & sub-categorical
level– Example: human faces
Sub-categories: expression, age, gender
Expert recognition properties include:– Fine sub-categorical discrimination, increased recognition speed– Equal response times for category/sub-category– Inability to dissassociate categorical & sub-categorical recognition– Trainable
Everyone is expert at recognizing faces*, chairs; dog show judges are expert at dogs; subjects can be trained to be expert with Greebles.
Expert Object Recognition (II)
Anatomically, expert object recognition is distinguished by:
1. (fMRI)Activation of early vision, LOC & IT– All forms of recognition do this
2. (fMRI) Sharp centers of activation in fusiform gyrus (in IT) and right inferior frontal lobe
3. (ERP) The n170 signal (170 ms post stimulus)
An Expert Recognition Hypothesis
Evidence:1. Expert recognition is illumination & viewpoint dependent
2. It activates RIFL, which creates mental images & can activate the image buffers in V1.
Expert Object Recognition is appearance-based, matching the current stimulus to previous memories.
When a category becomes familiar, the fusiform gyrus is recruited to build a manifold representation
of the samples. Sub-categorical properties are encoded in the manifold dimensions
An End-to-end computational model
(1) Bottom-up spatial selective attention Multi-scale maps for intensity, colors, edges (V1) Difference of Gaussian (on-center/off-surround)
filtering to find impulses Select peaks in x, y, scale as attention windows
Step 1 Issues
(1) Issues with step #1:– More information channels
Motion– Trent Williams found this is hard
Disparity
– Inhibition of return– Top-down control
Integration of predictions (predictive attention) Split attention?
Note: attention windows do not correspond to objects. They are just interesting parts of the image (but repeatability is key)
Step 2: Feature Extraction
(2) Attention windows are converted into fixed-length sparse feature vectors by parameter space voting techniques.– V8 is modeled with multiple non-accidental
features Hough space for colinearity Hough space of axes of reflection for anti-symmetry and
co-termination
– V4 is modeled as a color histogram– Simplest feature: low-resolution pixels
Step 2 Examples
Source Attention Window
Image Space
Collinearity
Edges vote in Hough space for positions and orientations of lines
Reflection(Symmetry & Vertices)
Pairs of edges votes for axes of reflection that map one onto the other (if any)
Hough Space
Step 2 Issues
Missing features– Constant curvature (V8)– Apparent-color-corrected histograms (V4)– Disparity features
Huge parameter space– How to evaluate features without supervision
Step 3: Feature Space Segmentation
(3) IT is modeled as O(1) unsupervised segmentation:– The features extracted in step #2 are concatenated to form a
single, high-dimensional representation– A 1-level neural net is trained to segment the samples by:
If a neuron responds < 0.5 to a sample, give it a training signal of 0 for that sample
If a neuron responds > 0.5, give a training signal of 1.0 Note that every neuron is trained independently, and there is no
communication among them– The response of IT to a sample is the vector of binarized
neural responses Each pattern of responses is a region in feature space
Step 3 issues
Stability– If neurons keep adapting, then region codes change– Linear neurons imply non-local interactions
Radial basis neurons should perform better
Evaluation: what makes one categorization better than another?
– No supervised training data– Number and size of categories vary
Gabe Salazar is cutting his teeth on this one…
Top-down predictions– Can we predict a category, and use it influence steps 1 & 2?
Steps 4 & 5 (unimplemented)
(4) Create sub-space manifold to describe samples in crowded regions.– PCA subspaces are a first approximation– Local linear embedding manifolds are better
Sub-categories should correspond to manifold dimensions
(5) Associative Memory– Associate attention windows with:
Other attention windows (generate predictions) With other modalities (e.g. language)
Adele Howe and I have a joint interest in this last point