why consider probabilistic models? computational reasons
DESCRIPTION
Probabilistic Models of Cortical Computation Rajesh P. N. Rao Dept. of Computer Sci. and Engineering & Neurobio. and Behavior Program University of Washington Seattle, WA Lab website: http://neural.cs.washington.edu November, 2004 Funding: Sloan Foundation, Packard Foundation, ONR, and NSF. - PowerPoint PPT PresentationTRANSCRIPT
1
Probabilistic Models of Cortical Computation
Rajesh P. N. RaoDept. of Computer Sci. and Engineering &
Neurobio. and Behavior ProgramUniversity of Washington
Seattle, WA Lab website: http://neural.cs.washington.edu
November, 2004
Funding: Sloan Foundation, Packard Foundation, ONR, and NSF
2
Why Consider Probabilistic Models?Computational Reasons
Sensory measurements are typically ambiguous E.g. Projection from 3D to 2D in vision
Biological sensors and processing elements are noisy
Animal’s knowledge of the world is usually incomplete
There appears to be a need to be able to represent, learn, and reason about probabilities
3
Example 1: Ambiguity of Stimuli
Is it an oval-shaped or a circular object?
Retinal Image
Eye
Eye
4
Bayesian Model: The Likelihood Function
(From Geisler & Kersten, 2002)
Retinal Image I
Likelihood = P(I | Slant, Aspect ratio)
5
Bayesian Model: The Posterior
(From Geisler & Kersten, 2002)
Posterior = Likelihood Prior k
(k = normalization constant)
6
What isthis imagedepicting?
Example 2: Noise and Incomplete Knowledge
7
Bayesian Model
LikelihoodP(I | )
Prior probabilityP()
PosteriorProbabilityP( | I) = P(I | )P()/P(I)
Input Image
dog …street
???(Bayesian decision)
sample
Okinawabeach
8
Bayesian Model with “Top-Down” Bias
LikelihoodP(I | )
Prior probabilityP()
PosteriorProbabilityP( | I) = P(I | )P()/P(I)
Input Image
dog
“Dog”(Bayesian decision)
sample
dog …street Okinawabeach
9
Psychophysical Evidence for Bayesian Perception
Motion from cast shadows (Kersten et al., 1996)
Surface perception based on texture (Knill, 1998)
Inferring 3D shape from 2D images (Mamassian et al., 2002)
Color perception (Bloj et al., 1999)
Cue combination for depth perception (Jacobs, 2002)
Motion illusions (Weiss et al., 2002)
Motor Control (Körding and Wolpert, 2004)
10
Other Results: Contextual Modulation in V1
(Zipser et al., 1996 )
11
Attentional Modulation in V2 and V4
(Reynolds et al., 1999)
12
Decision Neurons in Areas LIP and FEF
t (ms)
(Roitman and Shadlen, 2002)
13
Rev. Thomas Bayes (1702-1761)
Can a network of neurons perform Bayesian inference?
• How is prior knowledge about the world (prior probabilities and likelihoods) stored in a network?
• How are posterior probabilities of states computed?
14
Generative Models for Bayesian Inference
Fundamental Idea: Inputs received by an organism are caused by external “states” of the world (hidden “causes”)
Goal: Estimate the probability of these causes (or states or “interpretations”) based on the inputs received thus far
15
Example: Linear Generative Models
16
Linear Generative Model
Spatial Generative Model:I(t) = Ur(t) + n(t)
r(t) = representation vector, n = zero mean Gaussian white noise with covariance
Temporal Dynamics for Time-Varying Processes:r(t) = Vr(t-1) + m(t-1)
V = transition matrix, m = zero mean Gaussian white noise with covariance m
Goal: Find optimal representation vector r(t) given inputs I(t), I(t-1), …, I(1).
17
Optimization Functions
Find optimal r(t) by Minimizing Prediction Errors for all t:
= mean of r before measurement of I
Generalize to Weighted Least Squares Function:
M = covariance before measurement of I
rrrrrIrI
rrrI
TT
i
ii
i
ii
UU
UE22
1
rrrrrIrI 11 MUUE TT
r
18
Minimizing E = Maximizing Posterior Probability
Minimizing E is equivalent to Maximizing log P(r|I) which is equivalent to Maximizing Posterior Probability P(r|I)
kE
kMUU
PPPPTT
)(log)(log)|(log)|(log11 rrrrrIrI
IrrIIr
19
Optimal Estimation and Kalman Filtering
Setting dE/dr = 0 and solving for the optimal r yields the Kalman Filter:
K(t) = “Kalman gain” matrix = N(t)UT-1
N(t) = covariance of r after measurement of I(t) = (UT-1U + M(t) -1) -1
M(t) = VN(t-1)VT + m
)1()(
)()()()()(
tVt
tUttKtt
rr
rIrr
20
A Simplified Kalman Filter
If is diagonal and equal to , K(t) = (N(t)/ )UT = G(t)UT
Kalman filter equation is of the form:
New Estimate = Prediction + Gain x Prediction Error
UT = Feedforward Matrix
U = Feedback Matrix
V = Recurrent Matrix (Lateral Connections)
)1()( Prediction
)()()()()(
tVt
tUtUtGtt T
rr
rIrr
21
Neural Implementation via Predictive Coding
(Rao & Ballard, 1997,1999; Rao, 1999)
Predictive Coding Model:Feedback = PredictionFeedforward = Prediction Error
22
Clues from Cortical Anatomy
HigherArea
LowerArea
23
Hierarchical Organization of the Visual Cortex
Lower
Higher
24
Hierarchical Generative Model (Rao & Ballard, 1999)
Original Generative Model:I = Ur + n
Hierarchical Generalization:r = Uhrh+ nh
rh = representation at a higher level
With Temporal Dynamics:r(t) = Vr(t-1) + Uhrh(t-1) + m(t-1)
Can derive Kalman filter equations for each levelYields a Hierarchical Model for Predictive Coding
r
I
rh
25
Hierarchical Predictive Coding Model
I
I
rI U
(Rao & Ballard, 1997,1999)
= Uh rh
26
The Predictive Coding Hypothesis
Feedback connections from higher areas convey predictions of expected activity in lower areas
Feedforward connections convey the errors between actual and predicted responses
Model Prediction
Since feedforward connections to higher areas originate from layer 2+3, responses of layer 2+3 neurons should be
interpretable as prediction errors
27
Results from the Classic Studies of Hubel and Weisel (1960s)
28
“Endstopping” in Cortical Neurons
29
Contextual Modulation in Visual Cortex
(Zipser et al., 1996 )
30
Example Network for Predictive Coding
31
Natural Images used for Training
32
Synaptic Weights after Learning
33
Endstopping as a Predictive Error Signal
34
Comparison with Layer 2+3 Cortical Neuron
35
Why Does
Endstopping
Occur in the
Model?
Orientation-
Dependent
Correlations
in Natural
Images
36
Other Contextual Effects in the Model
37
Support for
Predictive
Coding
from an
Imaging
Study
(Murray et al., 2002)
38
Predictive Coding in the Retina
From:Nicholls et al., 1992
Response of a retinal ganglion cell can be interpreted as the difference (error) between center pixel values and their prediction based on surrounding pixels (Srinivasan et al., 1982)
+- -
-+ +
Receptive Fields
On-center off-surround
Off-center on-surround
39
Predictive Coding in the LGN
Temporal Receptive Field of LGN X-cell
From:Dan et al., 1996
LGN cell responses
Response of LGN cell can be interpreted as the difference (error) between current pixel values and their prediction based on past pixel values
40
Summary for Part I
Computational and experimental studies point to the need for probabilistic models of brain function
Probabilistic models typically rely on generative models of sensory (and motor) processes
We examined a simple linear generative model and its hierarchical generalizationBayesian inference via Kalman filteringNeural implementation allows Hierarchical Predictive Coding
Feedback connections convey predictionsFeedforward connections convey errors in prediction
Hierarchical predictive coding explains endstopping and other contextual surround effects based on natural image statistics
41
Break
Questions to Ponder over:
1. Can we go beyond linear generative models and Gaussian distributions?
2. Can a neural population encode an entire probability distribution rather than simply
the mean or mode?
42
Generative Models II: Graphical Models
Graphical models depict the generative process as a graphNodes denote random variables (states)Edges denote dependencies
Example: If states are continuous, linear generative model: I = Ur + n
),;()|( rIrI UNP
r
I
Earthquake Burglar
Radio Alarm
43
Continuous versus Discrete States
1 i M
UnimodalE.g. Normal N(;,)
Multimodal
Discrete Approximation
Discrete States
44
The Belief Propagation Algorithm
If states are discrete, probabilities of random variables can be calculated through “belief propagation” (Pearl, 1988): Each node j sends a “message” (probability
density) to every neighbor iMessage to neighbor i depends on messages
received from all other neighbors
ijkj XXNX
jkjjjjix
ijiji xmxxxxm\)(
)()(),()(
Earthquake Burglar
Radio Alarm
45
An Example: Hidden Markov Models (HMMs)
A Simple but Powerful Graphical Model for Temporal Data:Observed world can be in one of M states 1, 2, …, M
The state t at time step t depends only on previous state t-1 and is given by the probabilities:
P(t = i | t-1 = j ) (or for convenience)
The input It at time t is given by P(It | t = j )
)|( 1tj
tiP
It-2
t-2
It-1
t-1
It
t
Graphical Model for a HMM
State
Input
46
Inference in HMMs
),,|()|(),,|,( 1111 IIIIII tti
tittt
ti PPP
),,|,() |()|( 12111 IIII
tttj
tj
j
ti
tit PPP
Likelihood of i at time t Prediction for i at time t
It-2
t-2
It-1
t-1
It
tState
Input
47
Equivalence to Belief Propagation for HMMs
Equivalent to on-line
(“forward”) belief
propagation through
time
ijkj XXNX
jkjjjjix
ijiji xmxxxxm\)(
)()(),()(
It-2
t-2
It-1
t-1
It
tState
Input
)|() |( ,111, tit
ttj
tj
j
ti
tti PmPm I
48
Can a network of neurons perform this computation?
ttj
tj
j
ti
tit
tti mPPm ,111, ) |()|( I
49
Recurrent Network Model
vIvv
RW dt
d
Synaptic weights
Input I
Leaky Integrator Equation for Output Firing Rate v
Output Decay Input Feedback
R
50
Discrete Implementation
jjijtii
jjijtiiii
tvrtv
tvRtvtvtv
)()1( i.e.
))()(()()1(
Iw
Iw
New activity Input Prior Activity
Input I
R
51
Can this equation implement Belief Propagation for HMMs?
?
j
jijtii tvrtv )()1( Iw
ttj
tj
j
ti
tit
tti mPPm ,111, ) |()|( I
52
Consider Belief Propagation in Log Domain
j
jijtii tvrtv )()1( IwEquation for a recurrent network:
log) |(log)|(loglog ,111,
tt
jtj
j
ti
tit
tti mPPm I
53
Bayesian Inference in a Recurrent Network
Network can perform Bayesian inference using:
)()|(log
)|(log
,11
jjij
ttj
tj
j
ti
titit
tvrmP
P
IwI
log)()1( and j
jijtii tvrtv Iw
log posterior log likelihood log prior normalization log 1, tt
im
54
Example 1: Orientation Discrimination Task
Feedforward weights wi (= F(i)): A set of 36 oriented filters spanning orientations i = 0°, 5°, 10°, …, 175°
Transition Probabilities = 1 if i = j, 0 otherwise
Input images = oriented edge plus additive Gaussian noise
)|( 1tj
tiP
t = 1 t = 2 t = 3 t = 4 t = 5 t = 6
…
55
Demo: Orientation Discrimination
Input Image Sequence
Log likelihood computed from Feedforward Weights
Posterior computed by the Network over time
Orientation Estimation: Pick the preferred orientation of neuron with maximum response Maximum a Posteriori (MAP) Estimation
Neurons
Res
pons
e
56
Example 2: Motion Detection Task
• The Task: Guess the direction of motion of the coherently moving dots (UP/DOWN or LEFT/RIGHT)
Coherence of dots controls task difficulty Widely used to study decision making in humans and monkeys (E.g. (Shadlen and Newsome, 2001))
Example Stimuli:5% coherence50% coherence
57
Network for Motion Detection
Let ij encode (stimulus location i, motion direction j)
We can create a network for detection of 1D motion direction by selecting appropriate transition probabilities P(ij | kl)
P(iR | kR)
P(kL | jL)Rightward selective
Leftward selective
Input image
F(i)
58
Feedforward Weights
Spatial Location
F(1) F(2) … F(15)
59
Recurrent Weights
t-1
Transition Probabilities Recurrent Weights
jj
ji
jjijij xPxmm )|(loglogsuch that chosen weightsRecurrent
From Neuron j
To
Neu
ron
i
Rightward Leftward
t
60
Network Output for Moving Inputs
Rightward Moving Input Leftward Moving Input
Right selective neurons
Left selective neurons
))|(log( slikelihood log bIP tit
Posterior log
Posterior
Right selective neurons
Left selective neurons
61
Solving the Random Dots Task
Neurons in the network compute log posterior probabilities:
Random dots task: Need to decide whether majority of dots are moving Left or Right
Compute posterior probability of L and R by summing over all locations xi (marginalize over xi)
),,|,(log and ),,|,(log 11 IIRxPIILxP titi
itit
itit
IIRxPIIRP
IILxPIILP
),,|,( ),,|(
),,|,(),,|(
11
11
L R
62
Probabilistic Motion Detection in a Model Network
Demo 1: Activities in a model network for noisy motion Activities represent posterior probabilities of left/rightward motion
Demo 2: Activities of model “decision” neurons Decision neurons sum up log posterior probabilities over time Solid line = Leftward motion, Dotted line = Rightward motion
Demo 3: Effect of making the stimulus more noisy Longer decision times for noisier stimuli
63
Reaction Time depends on Coherency
Rate of evidence accumulation depends on stimulus coherency
Reaction Time(decision making time)
Shorter reaction times for more coherent stimuli
40% coherency 60% coherency 80% coherency
64
Two Brain Areas involved in Visual Decision Making
65
“Decision Neurons” in cortical area LIP
Monkey deciding direction of motion in random dots task
Plot shows average response in LIP to stimuli with different noise levels
Model neuron responses resemble LIP activities
Slower rise to threshold for noisier stimuli
t (ms)
(Roitman and Shadlen, 2002)
66
“Decision” Neurons in Frontal Cortex
Monkey making an eye movement to an “odd-ball” target among a field of distractors
Monkey’s reaction time distribution can be predicted from threshold crossings!
Data from (Schall & Thompson, 1999)
67
Distribution of Reaction Times in the Model
0 100 200
Fre
quen
cy
Reaction Times (number of time steps)
60 % Coherence 90 % Coherence
0 40 80
68
What if we increase the prior for Leftward motion?
Higher prior for L
dL
(Based on www.physiol.cam.ac.uk/staff/carpente/recinormal.htm)
69
Model Prediction: Increasing Prior for Left Motion
0 50 100 150 200
Fre
quen
cy
Reaction Times (number of time steps)
Left/Right equal probabilities Left more probable than Right
0 20 40 60 80 100
Distribution shifts –Shorter reaction times
for Left trials
60% coherence 60% coherence
70
What if speed is more important than accuracy?
Lower threshold
for making faster
decisions dL
(Based on www.physiol.cam.ac.uk/staff/carpente/recinormal.htm)
71
Model Prediction: Imposing an “Urgency” Constraint
0 50 100 150 200
Fre
quen
cy
Reaction Times (number of time steps)
Decision Threshold = T Decision Threshold = T/2
0 20 40 60 80 100
T= 0.03 T= 0.015
Distribution shifts – Shorter reaction times
72
What about Spikes?
Recall the leaky integrator equation:
Assume vi is linearly related to the membrane potential of neuron i as follows:
For the standard integrate-and-fire model with additive noise, one can show (Plesser & Gerstner, 2000; Gerstner, 2000):
)()( j
jijtiii tvRtv
dt
dvIw
miV
TvkV im
i
ti
tti
tvkTtVmii meetVtspikeP i
mi
ofy ProbabilitPosterior
)1(|)1(( 1,)1(/))1((
73
Example
Membrane Potential (log posterior)
SampledSpikes
PostsynapticMembrane Potential(decoded log posterior)
))1(),...,1(|)(,(log
)1(
III
ttP
tVti
mi
kTtV
mii
mie
tVtspikeP/))1((
)1(|)1((
Recipient neuron withalpha synapse
74
What about Top-Down Information?
Hypothesis: Top-down
priors influence
lower-level probability estimates
75
Probabilistic Graphical ModelHierarchical Network
• Top-down feedback conveys prior probability for spatial locations
• Posterior probability at lower level computed from prior & image
Hierarchical Belief Propagation in Cortical Networks
(Rao, NIPS, 2004)
76
Attention can restore V4 responses in the presence of distractors (Reynolds et al., 1999)
Reference stimulus only
Reference and probe (No Attention)
Reference and probe (with Attention)
Example: Modeling Spatial Attention in V4
77
Attentional Restoration of Responses in the Model
Reference only Ref. and probe Ref. and probe with attention
(Rao, NIPS, 2004)
78
Related Work on Probabilistic Models
Linear Generative Model Sparse Coding Models (Oshausen & Field, 1996; 1997) ICA (Bell & Sejnowski, 1997)
Hierarchical ModelMacKay, 1956; Mumford, 1992; Kawato et al., 1993; Dayan et
al., 1995; Lee & Mumford, 2003; Friston, 2003; Hawkins, 2004
Encoding Uncertainty and Belief Propagation with NeuronsAnderson & Van Essen, 1994; Zemel et al., 1998; Pouget et al.,
2000; Deneve, NIPS, 2004; Yu & Dayan, NIPS, 2004; Zemel et al., NIPS, 2004
79
Summary and Conclusions (“Posterior” for this lecture)
There is growing evidence that the brain utilizes probabilistic principles such as Bayesian inference
This lecture explored two neural models for Bayesian inference: Predictive Coding: Feedback connections convey predictions
while feedforward connections carry errorsBelief Propagation: The membrane potential encodes log
posterior probability via belief propagation; spiking probability is equal to posterior probability of the state encoded by the neuron
Some broad predictions of the models:Cortical architecture implements a graphical model of the sensory
(and motor) environmentCortical networks perform hierarchical Bayesian inferenceCorticocortical feedback conveys predictions or prior probabilities
80
(http://employees.csbsju.edu/tcreed/pb/pdoganim.html)
Open Problems: Synaptic Plasticity: Role of STDP and short-term plasticity in
Bayesian models Neural Implementation of Sensorimotor Bayesian models Incorporating rewards (Pavlovian conditioning, etc.)…
Future Directions