human object recognition part 3 of the biomimetic trilogy bruce draper

Human Object Recognition

Part 3 of the Biomimetic Trilogy

Bruce Draper

Review:A Divided Vision System

The human vision system has three major components:1. The early vision system

Retinogeniculate pathway RetinaLGNd V1 (V2 V3) and channels

Retinotectal pathway Retina S.C. Pulvinar Nucleus V1(V2 V3) Retina S.C. Pulvinar Nucleus MT(dorsal) Retina S.C. LGNd(interlaminar) V1(V2 V3)

2. The dorsal (“where”) pathway3. The ventral (“what”) pathway

D. Milner & M. Goodale, The Visual Brain in Action, p. 22

The Early Vision System

Retinotopically mapped– Small receptive fields in LGNd, V1– Receptive fields grow with processing depth

Bigger in V2, Bigger still in V3… Spatially organized into feature maps

– Edge maps (Gabor filters, quadrature pairs)– Color maps– Disparity maps– Motion maps (in MT, if not before)

Afferent & efferent connections Measurable neural correlates of spatial attention

An Early Vision Hypothesis

Logic: why compute any feature across the entire image when it would be cheaper to compute it later across only the attention window? Because you need the feature to select the attention window.

Neural evidence: Neural correlates of spatial attention (e.g. anticipatory firing, enhanced firing) are measurable in V1 and even LGNd.

Psychological evidence: ventral and dorsal streams appear to process the same attention windows, suggesting that attention is selected prior to the ventral/dorsal split.

Caveat: Some dorsal vision tasks (e.g. ego-motion estimation) benefit from a broad field of view, and may be non-attentional

The primary role of the early vision system is spatial attention

The Dorsal/Ventral Split

Color Codes:– Red: early vision– Orange/Yellow: dorsal

Leads to somatosensory and motor cortex

– Blue/Green: ventral Leads more to

memories, frontal cortex

More developed in humans than monkeys

A Dorsal Vision Hypothesis

Anatomical evidence: 1. strongly connected to motion and stereo processing in V1; 2. dorsal areas (e.g. LIP, 7a) inactive under anaesthesia3. neurons conjointly tuned for perception and action4. saccade-responsive neurons and gaze-responsive neurons

Behavioral evidence: 1. monkeys with dorsal lesions recognize objects but can’t grab

them;2. blindsight (see next slide)

Milner & Goodale: the dorsal vision system supports immediate actions, and not cognition or memory

Blindsight

Patients with severe damage to V1 are “cortically blind”

– Report no sensation of vision – MRI confirms no activity in V1– Saccadic eye movements continue

Nonetheless, they can point at targets– Much better than random (see chart)– Once they relax & let it happen

Why? – Retina S.C. Pulvinar Nucleus MT(dorsal)– MRI confirms some dorsal vision activity

So?– Confirms that dorsal vision has no contact with

cognition

A Ventral Vision Hypothesis

Anatomical evidence:1. Visual pathways connects early vision to areas associated with

memory (e.g. right inferior frontal lobe (RIFL))2. MRI centers of activity in ventral stream during (a) expert object

recognition and (b) landmark recognition Behavioral evidence:

1. Ventral lesions in monkeys prevent object recognition2. Lesions in fusiform gyrus in humans lead to prosopagnosia3. Stimulation of RIFL during surgery creates mental images

Milner & Goodale: the ventral pathway supports vision for cognition, including (categorical & sub-categorical)

object recognition and landmark-based navigation

Repetition Suppression

What happens when the same stimulus is presented repeatedly to the vision system?

– In fMRI studies, the total response of a voxel drops with each presentation

– In single-cell recording studies, neural responses become extreme

Most cells stop firing at all A few cells start responding at their maximal firing rate

– This can be observed in the ventral stream But not the early vision system

– This can be observed at both short and long time scales Short-time scale repetition suppression is interrupted by novel

targets

This may seem like a tangent, but its not…

Decomposing the Ventral Stream

The ventral stream has 4 major parts, as revealed by MRI:1. The early vision system

Both the ventral & dorsal streams start here Selects spatial attention windows (our hypothesis)

2. The lateral occipital cortex Large area, diffusely active in MRI studies Including (at least) V4 & V8 Kosslyn: hypothesizes feature extraction

3. The inferotemporal cortex Large area, diffusely active in MRI studies Sharp focus of activity in fusiform gyrus during expert recognition Sharp focus of activity in parahippocampal gyrus during landmark

recognition4. The right inferior frontal cortex

Associated with visual memories Efferently stimulates V1 when active Strongly lateralized

Area V8 (Lateral Occipital Cortex)

Short-term repetition studies suggest V8 computes edge-based features

– Equal amounts of suppression for image/image, image/edge, edge/image or edge/edge pairs

Psychological studies suggest the recognition is sensitive to the disruption of “non-accidental” features

1. Colinearity2. Parallelism (translational symmetry)3. Reflection (anti-symmetry)4. Co-termination (end-point near)5. Constant curvature

Diffuse response suggests population coding

An LOC Hypothesis

Evidence:– Diffuse responses consistent with population codes– Fit psychology models of LOC as feature extraction– Explains repetition suppression effects in V8– Explains non-classical receptive field responses in V1

(assuming efferent feedback to early vision)

Area V8 detects non-accidental edge relations through parameter-space voting schemes (e.g. Hough spaces)

Other LOC areas use voting schemes to summarize other features, e.g. color histograms in area V4/V7. Together, LOC areas create a

high-dimensional but distributed feature representation

Infero-temporal Cortex (IT)

Diffusely active in fMRI during all types of object recognition

Last visual processing stage before memories Distributed responses to objects (Tsunoda, et al):

Test Stimulus Hot spots (versus control, shown for differentLevels of statistical significance)

Inferotemporal Cortex (continued)

Hot spots overlap, and

aren’t contiguous

(pop. Code)

Some stimuli yield greater total responses; responses

overlap

Always some response

Minimal effect of stimulusintensity

IT (III): when stimulus is simplified

Significant results

Figure A is a control: hot spots from 3 different objects

Figure B: red spots respond to the whole cat; a subset of spots (blue) respond to just the head; a subset of that responds to a silhouette of the head (yellow)

• Implication: part-based features. Figure C: Blue spot responds to whole object, but not

to simplification. Some red spots respond only to simplified version

– Implication: More complex scenario: some feature responses are turned off by the whole object (competition?)

An IT Hypothesis

Repetition suppression effects are strongest in IT Single cell recording studies show that IT cells

respond to multiple features (e.g. color + shape) Simpler organizations (e.g. part/subpart hierarchies,

“view maps”) are not supported by single-cell recording data

Repetition suppression in infero-temporal cortex implements unsupervised feature space

segmentation, thus categorizing attention windows

Expert Object Recognition

Expert object recognition applies when:– The viewer is very familiar with the target object– The illumination and viewpoint are familiar– The target is recognized at both a categorical & sub-categorical

level– Example: human faces

Sub-categories: expression, age, gender

Expert recognition properties include:– Fine sub-categorical discrimination, increased recognition speed– Equal response times for category/sub-category– Inability to dissassociate categorical & sub-categorical recognition– Trainable

Everyone is expert at recognizing faces*, chairs; dog show judges are expert at dogs; subjects can be trained to be expert with Greebles.

Expert Object Recognition (II)

Anatomically, expert object recognition is distinguished by:

1. (fMRI)Activation of early vision, LOC & IT– All forms of recognition do this

2. (fMRI) Sharp centers of activation in fusiform gyrus (in IT) and right inferior frontal lobe

3. (ERP) The n170 signal (170 ms post stimulus)

An Expert Recognition Hypothesis

Evidence:1. Expert recognition is illumination & viewpoint dependent

2. It activates RIFL, which creates mental images & can activate the image buffers in V1.

Expert Object Recognition is appearance-based, matching the current stimulus to previous memories.

When a category becomes familiar, the fusiform gyrus is recruited to build a manifold representation

of the samples. Sub-categorical properties are encoded in the manifold dimensions

An End-to-end computational model

(1) Bottom-up spatial selective attention Multi-scale maps for intensity, colors, edges (V1) Difference of Gaussian (on-center/off-surround)

filtering to find impulses Select peaks in x, y, scale as attention windows

Step 1 Issues

(1) Issues with step #1:– More information channels

Motion– Trent Williams found this is hard

Disparity

– Inhibition of return– Top-down control

Integration of predictions (predictive attention) Split attention?

Note: attention windows do not correspond to objects. They are just interesting parts of the image (but repeatability is key)

Step 2: Feature Extraction

(2) Attention windows are converted into fixed-length sparse feature vectors by parameter space voting techniques.– V8 is modeled with multiple non-accidental

features Hough space for colinearity Hough space of axes of reflection for anti-symmetry and

co-termination

– V4 is modeled as a color histogram– Simplest feature: low-resolution pixels

Step 2 Examples

Source Attention Window

Image Space

Collinearity

Edges vote in Hough space for positions and orientations of lines

Reflection(Symmetry & Vertices)

Pairs of edges votes for axes of reflection that map one onto the other (if any)

Hough Space

Step 2 Issues

Missing features– Constant curvature (V8)– Apparent-color-corrected histograms (V4)– Disparity features

Huge parameter space– How to evaluate features without supervision

Step 3: Feature Space Segmentation

(3) IT is modeled as O(1) unsupervised segmentation:– The features extracted in step #2 are concatenated to form a

single, high-dimensional representation– A 1-level neural net is trained to segment the samples by:

If a neuron responds < 0.5 to a sample, give it a training signal of 0 for that sample

If a neuron responds > 0.5, give a training signal of 1.0 Note that every neuron is trained independently, and there is no

communication among them– The response of IT to a sample is the vector of binarized

neural responses Each pattern of responses is a region in feature space

Step 3 issues

Stability– If neurons keep adapting, then region codes change– Linear neurons imply non-local interactions

Radial basis neurons should perform better

Evaluation: what makes one categorization better than another?

– No supervised training data– Number and size of categories vary

Gabe Salazar is cutting his teeth on this one…

Top-down predictions– Can we predict a category, and use it influence steps 1 & 2?

Steps 4 & 5 (unimplemented)

(4) Create sub-space manifold to describe samples in crowded regions.– PCA subspaces are a first approximation– Local linear embedding manifolds are better

Sub-categories should correspond to manifold dimensions

(5) Associative Memory– Associate attention windows with:

Other attention windows (generate predictions) With other modalities (e.g. language)

Adele Howe and I have a joint interest in this last point

Conclusion

We have a biologically plausible model that– Learns to extract and categorize image windows

from larger scenes– Without any human supervision or intervention

We need help improving, evaluating, and extending it– Interested parties should let me know!

human object recognition part 3 of the biomimetic trilogy bruce draper

Documents

dorsal vision system

dorsal vision activity

dorsal vision tasks

human vision system

dorsal lesions

divided vision system

early vision orangeyellow

dorsal leads