introduction methods - columbia university · camel leg rhino wrench knife gun calculator table o1:...
Post on 29-Apr-2018
215 Views
Preview:
TRANSCRIPT
Feed-forward deep neural networks diverge from primates on core visual object recognition behaviorRishi Rajalingham*, Elias B. Issa1*, Kailyn Schmidt, Kohitij Kar, Pouya Bashivan, and James J. DiCarloDept. of Brain and Cognitive Sciences, McGovern Institute for Brain Research, MITDepartment of Neuroscience, Zuckerman Mind Brain Behavior Institute, Columbia University1
Inception-v3Human Pool Monkey Pool (n=5)
V1 ALEXNET
VGG
GOOGLENET
RESNET
INCEPTION-v3
NYU
Primate Zone
Cons
iste
ncy
to H
uman
Poo
l (O
2)
Human Pool
Monkey Pool (n=5)
Inception-v3
V1 ALEXNET
VGG
GOOGLENET
RESNET
INCEPTION-v3
NYU
Cons
iste
ncy
to H
uman
Poo
l (O
1)
Primate ZoneMonkey Pool (n=5 subjects)
Heldout Human Pool (n=4 subjects)
Labe
l pro
babi
lity 1
0
1
0
1
0
Human Monkey HCNN
Human Pool
Monkey Pool (n=5)
Inception-v3
O2: Performance per object for each distracter (confusion matrix)
V1 ALEXNET
VGG
GOOGLENET
RESNET
INCEPTION-v3
NYU
Cons
iste
ncy
to H
uman
Poo
l (I1
c) Primate Zone
Monkey Pool (n=5 subjects)
Heldout Human Pool (n=4 subjects)
HCN
N p
erfo
rman
ce (
Δd’
)
Human performance (Δd’)
n=240 images
Mon
key
perf
orm
ance
( Δ
d’)
n=240 images
Human performance (Δd’)
I1c: Performance per image across distracters
Humans can rapidly and accurately recognize objects in spite of variation in viewing param-eters (e.g. position, pose, and size) and background conditions.
To understand this invariant object recognition ability, one can construct and test models that aim to accurately capture human behavior, including making the same patterns of errors over all images. Here, we applied this straightforward visual “Turing test” to the leading feed-forward com-putational models of human vision (hierarchical convolutional neural networks, HCNNs) and to a leading animal model (rhesus macaques).
Our results reveal a failure of current HCNNs to fully replicate the image-level behavioral patterns of primates.
Introduction Methods
Behavioral paradigm: We tested object recognition performance on a two alternative forced choice (2AFC) match-to-sam-ple task using brief (100ms) presentations of natural-istic synthetic images of basic-level objects.
AndroidTablet
Juice reward tubeArduino controller
High-throughput psychophsyics: Over a million behavioral trials across hun-dreds of humans and �ve monkeys were ag-gregated to characterize each species on each image. To collect such large behavioral data-sets, we used Amazon Mechanical Turk for humans and a novel home-cage touchscreen system for monkeys.
Stimuli: To enforce invariant object recognition behavior, we generated several thousand naturalistic synthetic images (2400 images, 24 objects, 100 images/object), each with one foreground object (by ren-dering a 3D model of each object with random viewing parameters) on a random natural image background.
Methods
HARD EASY
Fixation(100ms)
Test Image (100ms)
Choice Screen(1000ms)
Amazon Mechanical Turk “MonkeyTurk”
TankBirdShortsElephantHouseHangerPenHammer
ZebraForkGuitarBearWatchDogSpiderTruck
Wrench Rhino LegCamel TableCalculatorGunKnife
O1: Performance per object across distracters
Results
Inception-v3Human Pool Monkey Pool (n=5)
V1 ALEXNET
VGG
GOOGLENET
RESNET
INCEPTION-v3
NYU
Cons
iste
ncy
to H
uman
Poo
l (I2
c)
Primate Zone
I2c: Performance per image for each distracter (confusion matrix)
Conclusions
Behavioral metrics: We characterized the core object recognition behavior of humans, macaques and HCNNs over a large set of images and distracter objects using several behavioral metrics. Each behavioral metric computes a pattern of unbiased behavioral perfor-mances, using a sensitivity index (d’), at the resolution of objects (O1, O2) and images (I1, I2). To isolate image-driven behavioral variance that is object independent, we normalized image-level behavioral metrics by subtracting the mean performance values over all images of the same object (I1c, I2c).
Each object can produce myriad images under variations in viewing parame-ters, with some images being more challenging than others.
Consistent with previous work, macaque monkeys and all tested HCNN models are highly consistent with humans with respect to object-level metrics (O1, O2), while a lower level representation (V1) is less consistent.
Eccentricity Size Pose Di�erence
Monkey PoolHuman Pool
Nor
mal
ized
Imag
e-le
vel P
erfo
rman
ce (Δ
d’)
Mean Luminance BG Spatial Frequency Segmentation Index
Regr
essi
on W
eigh
ts
Eccentri
city
Size
Pose D
i�erence
Mean Luminance
Segmentation In
dex
BG Spatial F
requency
All tested HCNN models fail to accurately capture the image-level behavioral patterns of primates.
In contrast to the corresponding object-level metric (O1), all tested HCNNs diverge from primates with respect to the one-versus-all image-level behavioral pattern (I1c).
Zooming in on one example image of a pen, humans and monkeys make similar confusions (choices in object labels), while the HCNN model dif-fers signi�cantly.
Example image with corresponding behavior-al patterns
(Left, top row) For each behavioral metric, we de�ned a “primate zone” as the range of consistency values delimited by extrapolated estimates of con-sistency to the human pool of large (i.e. in�nitely many subjects) pool of rhesus macaque monkeys and humans respectively.
Thus, the primate zone de�nes a range of consistency values that corre-spond to models that accurately capture the behavior of the human pool, at least as well as a monkey pool.
(Left) Consistency at the image level could not be easily rescued by pooling together several “model subjects,” (top panels) or simple modi�cations (bottom panels), such as primate-like retinal input sampling or additional model training on synthetic images generated in the same manner as our test set.
(Right) Additionally, the gap in consistency could not be attributed to simple image attributes (e.g. viewpoint meta-parameters, low-level image characteristics.
Cons
iste
ncy
to H
uman
Poo
l
# Subjects in Pool
Primate Zone
Monkey Pool
HCNN Pool
Cons
iste
ncy
to H
uman
Poo
l
# Subjects in Pool
Object-level metric (O2) Image-level metric (I2c)
Fine-tunedHCNN
Retina-sampledHCNN
Early HCNNlayers
Inception-v3
Cons
iste
ncy
to H
uman
Poo
l
Model Performance (d’)
Fine-tunedHCNN
Retina-sampledHCNN
Early HCNNlayers
Inception-v3
Primate Zone
The di�erence in image-level behav-ioral pattern between HCNNs and pri-mates is not strongly dependent on image attribtues.
• In recent years, speci�c HCNN models, optimized to match human-level object recognition performance, have proven to be a dramatic advance in the state-of-the-art of arti�cial vision systems.
• HCNN models match primates in their core object recognition behavior with respect to ob-ject-level di�culty and object-level confusion patterns.
• HCNN models fail to match primate behavior in image-level di�culty and fail to match im-age-level confusion patterns.
• Attempts to rescue the image-level gap in consistency, including adding retinal sampling on the model front-end, adding various decoders on the model back-end, and supplementing model training with similar images, did not reduce the gap between models and primates.
• Furthermore, no particular property of the image (luminance, spatial frequencey) or of the object in the image (size, pose) was strongly correlated with model failure to capture primate image-level performance.
Taken together, these results suggest that high-resolution behaviour could serve as a strong constraint for discovering models that more precisely capture the neural mechanisms under-lying primate object recognition.
To measure the behavioral consistency between two systems with respect to a given behavioral metric, we computed a noise-adjusted correlation, which accounts for the reliability (i.e. internal con-sistency) of behavioral measurements.
10 images/object
In contrast to the corresponding object-level metric (O2), all tested HCNNs diverge from primates with respect to the one-versus-other image-lev-el behavioral pattern (I2c).
CAM
ELD
OG
RHIN
OEL
EPH
ANT
WRE
NC
HKN
IFE
HAN
GER
FORK
GU
ITAR PE
NTA
NK
TRU
CK
BIRD
HAM
MER
GU
NTA
BLE
CAL
CU
LATO
RSP
IDER LEG
ZEBR
AH
OU
SEBE
ARSH
ORT
SW
ATC
H
3.0
3.5
2.5
1.5
0.5
2.5
1.5
3.5
4.5
1.0
2.0
3.0
top related