introduction: statistical and machine learning based approaches to neurobiology shin ishii nara...

64
Introduction: statistical and machine learning based approaches to neurobiology Shin Ishii Nara Institute of Science and Technology

Upload: morgan-day

Post on 18-Dec-2015

224 views

Category:

Documents


1 download

TRANSCRIPT

Introduction: statistical and machine learning based

approaches to neurobiology

Shin Ishii

Nara Institute of Science and Technology

1. Mathematical Fundamentals:Maximum Likelihood and Bayesian Inferences

Coin tossing• Tossing a skewed coin

• How often does the head appear for this coin?– Probability of coming up one head in five tossing:

Head Tail Tail Tail Tail

One head comes up in five tossing

(Note: each trial is independent)

Parameter: rate of head appearance in an individual trial

Likelihood

Likelihood function

• Likelihood: evaluation of the observed data– viewed as a function of the parameter

What is the most likely parameter ?How to determine it?

It seems natural to set according to the frequency of coming up the head.

Really?

Which parameter is better for explaining the observed data?

Likelihood of parameter :

Likelihood of parameter :

Kullback-Leibler (KL) divergence

• A measure of the difference between two probability distributions: and

difference

We can measure the difference according to an objective and

numerical value.

Note: KL divergence is not a metric.

Minimize KL divergence

• Random events are drawn from the real distribution

true distribution

data set

Using the observed data, we want to estimate the true

distribution using a trial distribution.

trial distributionminimize

divergence

The smaller the KL divergence , the better an estimate.

Minimize KL divergence

• KL divergence between the two distributions

Constant: independentof parameter

To minimize KL divergence, we have onlyto maximize the second term with respect to the parameter .

Likelihood and KL divergence

• The second term is approximated by the sample mean:

data setLog likelihood

They are the same:• Minimizing the KL divergence• Maximizing the likelihood

Maximum Likelihood (ML) estimation

• Maximum likelihood (ML) estimate:

• What is the most likely parameter in the coin tossing?

Head Tail Tail Tail Tail

Maximization condition

Same as intuition

Property of ML estimate

• As the number of observations increases, the squared error of an estimate decreases in order. • ML estimate is asymptotically unbiased. R. Fisher

(1890-1962)

If the infinite number of observations could be obtained, an ML estimate becomes the real parameter.

InfeasibleWhat happens when only a limited number of observations have been obtained from the real environment?

Problem with ML estimation

• Is it a really skewed coin?

Head Tail Tail Tail Tail

It may just happen to come up four consecutive tails. It may be detrimental to assume the parameter as

Head TailHead HeadHead

Five more tossing...

See... Four consecutive tails occurred by chance.The ML estimate overfits to the first observations.How to avoid this overfitting?

If the data consists of one head in a single tossing, an ML estimate gives 100%. Not reasonable

Consider extreme case:

Bayesian approach

• Bayes theoremLikelihood Prior

Posterior

a posterioriinformation

information obtainedfrom data

a prioriinformation= +

We have no information about the probably skewed coin. Then, we now assume that the parameter is distributed around .

Prior distribution

Bayesian approach

• Bayes theoremLikelihood Prior

Posterior

a posterioriinformation

information obtainedfrom data

a prioriinformation= +

observed dataOne head and four tails.Hmm... It may be a skewed coin, but better consider other possibilities. Likelihood

function

Bayesian approach

• Bayes theoremLikelihood Prior

Posterior

a posterioriinformation

information obtainedfrom data

a prioriinformation= +

Posterior distribution

The parameter is distributed mainly between . Variance (uncertainty) exists.

Property of Bayesian inference

• Bayesian view: probability represents uncertainty of random events (subjective value).

R. Fisher(1890-1962)

That can’t be! The prior distribution leads to a subjective distortion against estimation. The estimation process must be objective to obtained data.

Frequentist

T. Bayes (1702-1761)

No problem. Uncertainty of random events (subjective probability) depends on the amount of information obtained from data and prior knowledge of the events.

Bayesian

Application of Bayesian approaches

• Data obtained from the real world:– sparse– high dimension– unobservable variables

Bayesian methods are available

User support system(Bayesian Network)

Bioinfomatics

2. Bayesian Approaches to Reconstruction of Neural Codes

A neural decoding problem

How the brain works?How the brain works?

Sensory information is represented in sequences of spikes.

When the same stimulus is repeatedly input, spike occurrence varies between trials.

An indirect approach is to reconstruct the stimuliAn indirect approach is to reconstruct the stimulifrom the observed spike trains.from the observed spike trains.

Bayesian application to a neural codeSpike train(observation)

Time

Stimulus(prior knowledge) ?

Possible algorithm of stimulus reconstruction (estimation) from

spike train only (Maximum likelihood estimation)

spike train & stimulus (Bayes estimation)

Note: we focus on whether spike trains include stimulus information, NOT BUT whether the algorithm is true in the brain.

‘Observation’ depends on ‘Prior’

Stimulus

Time

Neural system

Spike trainEstimatedstimulus

Time

Estimationalgorithm

Black box

Time

(Bialek et al., Science, 1991)

‘Observation’ depends on ‘Prior’

s

Neural system

Estimatedstimulusdistribution

Estimationalgorithm

( )P s

( | )P x s

( | )P s x

Black box

s x

Stimulusdistribution

Simple example of signal estimation

x s Observation = Signal + Noise :

Particular value of observation: xIncoming signal: sNoise ( )0

( )P s ( | )P s x( | )P x s( )P

=+

( )ests f xEstimation stimulus

Simple example of signal estimation

2

22

1exp

2( )

2 ss

P ss

If the signal s is supposed to be chosen from a Gaussian,

2

22

1 ( )( ) e( |

22) xpP x s

s xP x s

chosen from a Gaussian,

If the probability that one observes a particular x with signal s just depends on the noise , and the noise is supposed to be

2 2

2 2

1 ( )( | ) ( ) exp exp

2 2( | )

2s s

s s xP x s P sP s x

So the posterior is,

Simple example of signal estimation

(( )| () | )P xP ss Px sPrior knowledgeLikelihood

(Observation)PosteriorBayes theorem:

Maximum likelihood estimation: maximize

Bayes estimation: maximize

( | ) 0 estsP s xxs

( | ) 0 estP s x s Kxs

2 2( / )

1 sSNR

K SNRSNR

SNR

1

K

( | )P x s

( | )P s x

Signal estimation of a fly(Bialek et al., Science, 1991)

Calliphora erythrocephalaMovement sensitive neuron (H1)

Gaussian visual stimulus

Time

Visually guided flightTime scale 30 ms of behaviorH1 firing rate : 100-200 spikes/s Behavioral decisions are based on a few spikes.

Signal estimation of a fly(Bialek et al., Science, 1991)

StimulusObservation

X

Encoder{ }it ( )s t

Time

(( )| () | )P xP ss Px sPrior knowledgeLikelihood

(Observation)PosteriorBayes theorem:

[{ } | ]iP t s [ ]P s

Estimated stimulus maximizes [ |{ }]iP s tHowever, [ |{ }]iP s t can not be measured directly.

Kernel reconstruction and least square

[ |{ }]iP s tcan not still calculated, because can not be defined.

Estimated stimulation:

Next step alternative calculation:

[ |{ }]est is ds P s t s

2 2| ( ) ( ) |ests t s t dt ( )F ( )est i

i

s F t t Choosing the kernel

which minimizes the square error

in the function

[ |{ }]est is ds P s t s

Signal estimation of a fly(Bialek et al., Science, 1991)

( )F

StimulusEstimated stimulus

Kernel

Case of mammalian

Rat hippocampal CA1 cells

O’Keefe’s place cell

(Lever et al., nature, 2002)

Each place cell shows high activity when the rat is located at a specific position

It is known that hippocampal CA1 cells wouldrepresent the position is a familiar field.

Case of mammalian

(Lever et al., nature, 2002)

Each place cell shows high activity when the rat is located at a specific position

Question:

Can one estimate rat’s position in the field from firing patterns of rat hippocampal place cells?

Incremental Bayes estimation(Brown et al., J. Neurosci., 1998)

Sequential Bayes estimation from spike train

time1kt kt

Spike train of a place cell

1spikes in( ( ) | [ , ])k kP s t t t

11 1spikspikes at es in( | ( ), ( ( ) |) [ , ])k k kk kPP t s t t s t t t

(( )| () | )P xP ss Px sPrior knowledgeLikelihood

(Observation)PosteriorBayes theorem:

... ... Rat’s position at kt

( )ks tObservation:

Prior stimulation:

(Brown et al., J. Neurosci., 1998) Rat position can be estimated by integratingthe recent place cell activities and the positionestimate from the history of activities.

Spike train of place cells

... ...

1( )ks t

Observation:

(Brown et al., J. Neurosci., 1998)

Time1kt

( )ks t

Rat’s positionPrior:

1t

p

p

p

p

1 1spikes at | ( ),k k kP t s t t

1 1spikes in( ) | [ , ]k kP s t t t

Incremental Bayes estimation from spike train

1kt kt

1 1spikes in( ) | ( ) ( ) | [ , ] ( )k k k k kP s t s t P s t t t ds t

Incremental Bayes estimation from spike train(Brown et al., J. Neurosci., 1998)

1kt kt... ... Time1kt 1t

1spikes at | ( ),k k kP t s t t

1( )ks t 1 1spikes in( ) | [ , ]k kP s t t t

Observation probability is the function

of the firing rate of cells, which depends on the position & theta rhythm.

1spikes at | ( ),k k kP t s t t

Firing rate of a place cell depends on

1. position component (receptive field)2. theta phase component

Inhomogeneous Poisson process for spike train

Position component (asymmetric Gaussian):

(Brown et al., J. Neurosci., 1998)

Theta phase component (cosine):

0( | ( ), ) exp cos( ( ) )t t t

11( | ( ), ) exp ( ( ) ) ( ( ) )

2x xt x t x t W x t

Instantaneous firing rate:

( | ( ), ( ), ) ( ( ( ), ) ( | ( ), )x x xt x t t t x t t t

The parameters were determined by maximum likelihood.

Firing rate of a place cell depends on

1. preferred position (receptive field)2. preferred phase in the theta rhythm

Position estimation from spike train

Assumption:

The path of the rat may be approximated as a zero mean two-dimensional Gaussian random walk.Parameters, and were estimated by ML.

(Brown et al., J. Neurosci., 1998)

2x

1x 2x

1spikes in( ) | [ , ]k kP s t t t

1 11 spispikes a kes int | ( ) ( ) |, [ , ]k kk kk PP t s t t s t t t

Finally, estimation procedure is as follows:

1. Encoding stage: estimate parameters, , , , and .2. Decoding stage: estimate rat’s position by incremental Bayes method at each spike event with the assumption of Gaussian random walk.

x 1x 2x

Bayes estimation from spike train

spike!

spike! spike!

spike!

spike!

(Brown et al., J. Neurosci., 1998)

Real rat

EKF estimation with variance

spike!

Calculation of posterior distribution is done in discontinuous time steps; when a spike occurs as a new observation.

Position estimation from spike train (1)(Brown et al., J. Neurosci., 1998)

Mouse position Estimation

=

Prior Likelihood

X

Posterior

LikelihoodObserved

firing pattern

Model activity

Correlation

Bayes estimation Maximum likelihood Maximum correlation

Position estimation from spike train (2)(Brown et al., J. Neurosci., 1998)

Mouse position Estimation

=

Prior Likelihood

X

Posterior

LikelihoodObserved

firing pattern

Model activity

Correlation

Maximum likelihood Maximum correlationBayes estimation

The ML and maximum correlation methods ignorethe history of neural activities, but the incremental

Bayes incorporates it as a prior.

3. Information Theoretic Analysisof Spike Trains

Information transmissionin neural systems

Environmental stimulusSpike train

(neural response)Encoding

Decoding

•How does a spike train code information about the corresponding stimuli?•How efficient is the information transmission?•Which kind of coding is optimal?

X Y

t

Encoder Decoder

Information transmission:Generalized view

Shannon’s communication system (Shannon, 1948)

Informationsource

Message

Noise source

Z

Destination

Z~

Signal

X

TransmitterObservable

Y

Receiver

MessageReceivedsignal

Observable

Channel

stimuli)( symbol Encoded:

Symbol:

X

Z

symbol Decoded:~

response)( symbol dTransmitte:

Z

Y

Neural coding is stochastic process

Stimulus Observed Spike trains

t

Neuronal responses against a given stimulus are not deterministic but stochastic, and the stimulus against each response is also probabilistic.

X XY |1

XY |2

XY |3

Shannon’s Information

• Smallest unit of information is “bit”– 1 bit = the amount of information needed to ch

oose between two equally-likely outcomes (eg: tossing a coin)

• Properties:1. Information for independent events are additi

ve over constituent events

2. If we already know the outcome, there is no information

Shannon’s Information

Independent events:

Implies:

)()()(),,,( 221121 NNN XPXPXPXXXP

))(())(())((

)),,,((

2211

21

NN

N

XPIXPIXPI

XXXPI

Certain events:

Implies:

0)(or 1)( 11 XPXP

0))(( XPI

Property 1

Property 2

)(log))(( 2 XPXPI

Eg. Tossing a coin• Tossing an even coin

Head Tail

0.5Head)( XP 0.5Tail)( XP

bit 1

0.5logHead)( 2

XI bit 1Tail)( XI

Observed HeadX TailX

0.5Head)( XP 0.5Tail)( XP

Eg. Tossing a coin• Tossing a horribly skewed coin…

Head Tail

0.99Head)( XP 0.01Tail)( XP

bits 0145.0Head)( XI bits 6.64Tail)( XI

Observed HeadX TailX

Observing ordinary event has low information,but observing rare event is highly informative.

Eg. Tossing 5 coins

• Case 1: even 5 coins

• Case 2: skewed 5 coins

Head Tail Tail Tail Tail

0.5Head)(Given XP

( H,T,T,T,T) 1*5 5 bitsI X

2 2( H,T,T,T,T) log 0.2 4*log 0.8

3.6 bits

I X

Given ( Head) 0.2P X

Entropy

• Entropy is the expectation of the information over all possible observations

On average, how much information do we get from an observation drawn from the distribution?

)](log[E 2 XPH Xp

discrete

continuous

X

p XPXPH )(log)( 2

dXXPXPH p )(log)( 2

Entropy can be defined…

Some properties of entropy

• Scalar property of a probability distribution• Entropy is maximum if P(X) is constant

– Least certainty of the event• Entropy is minimum if P(X) is a delta function• Entropy is always positive• The higher the entropy is, the more you learn

(on average) by observing values of the random variable

• The higher the entropy is, the less you can predict the values of the random variable

Eg. Tossing a coin

Head Tail

pXP Head)( pXP 1Tail)(

)1(log)1(log 22 ppppH P

Entropy

Entropy reaches the maximum when each event occurs with equal probability, i.e., occurs most randomly.

Eg. Entropy of Gaussian distributions

Entropy only depends the standard deviation,i.e., the entropy reflects variability of information source.

bits )2(log2

1

2

1log

2

1 22

2/2/ 2222

edxeeH xxx

),(~ 2Nx

What distribution maximizes the entropy of random variables?

discreteX

p XPXPH )(log)( 2

continuousdXXPXPH p )(log)( 2

MXP

1)(

MX ,,1

2

2

2

)(exp

2

1)(

X

XP

uniform distribution

Gaussian

2][,][ XVXE

Entropy of Spike Trains

• A spike train can be transformed into a binary vector by discretizing the time into small bins.

• Computing the entropy of possible spike trains How informative are such spike trains?

1 1 1 1 1 1 10 0 0 0 0 0 0 0binary word

spike train

T

t

MacKay and McCulloch (1952)resolution time:

window timeofduration :

t

T

How many different binary words may occur over the whole bins

Entropy of spike trains Brillouin (1962)

tTN /

!!

!

01total NN

NN All possible words

rate spikemean :

yprobabilit firing:

s0' ofnumber :)1(

s1' ofnumber :

0

1

r

trp

NpN

pNN

Stirling approximation

.)1ln()1()ln()(2ln/)/(

/ln//ln/2ln/

))1ln()1ln()1(ln(2ln/1

!ln!ln!ln2ln/1

log

0011

0011

01

total2total

trtrtrtrtT

NNNNNNNNN

NNNNNN

NNN

NH

Entropy

)1(ln!ln xxx

The entropy is linear to length of time window, T

tT n larger thamuch is Assume

.)1ln()1()ln()(2ln

1total trtrtrtrtT

H

(ms)

spikes/s 50~ rEg.

Entropy rate: the unit of bits of information per second

• If the chance of a spike in a bin is small (low rate, or high sampling rate) then we can approximate the entropy rate as:

tr

er

T

H

2

total log

Entropy rate of temporal (timing) code

bit/s) (288

spikeper bits 76.5

1 trp

trtr )1ln(

Entropy of spike count distribution

n

npnpH )(log)()count spike( 2

Tn windowin time spikes observing ofy probabilit

Entropy of spike count distribution (rate code)

What should we choose for p(n)?We know only know two constraints about p(n):

1. probability distributions must be normalized .

2. average spike count in T should be 1)( n

np

rTn

We cannot determine p(n) uniquely,but can obtain p(n) which maximizes the entropy.

Entropy of spike count distribution

• Entropy for spike counts is maximized by an exponential distribution:

Entropy is then:

)exp()( nnp )1

1log(tr

trtrtrH

11log)1(log 22

Conditional entropy andmutual information

I(s,r)

• The entropy represents the uncertainty about the response in the absence of any other information.

• The conditional entropy represents the remaining uncertainty about the response for a fixed stimulus s.

• I(r,s) is the mutual information between s and r; representing the reduction of uncertainty in r achieved by measuring s.

• If r and s are statistically independent, then I(r,s)=0.

0),(),( | rsIHHsrI srrrH sH

srH | rsH |

rH

srH |

Mutual information

)|(logE 2| srpH sr

)|(log)|( 2| srpsrpH sr

drsrpsrpH sr )|(log)|( 2|

discrete

continuous

Conditional entropy

Reproducibility and variabilityin neural spike trains

Calliphora erythrocephalaMovement sensitive neuron (H1)

Dynamic stimuli(natural condition)

Static stimuli(artificial condition)

Ordered firing patterns(high reproducibility)

Irregular firing patterns(low reproducibility, Poisson-like patterns)

van Steveninck et al., Science, 1997

random walkwith diffusionconstant

s/degrees 14 2

Low spike count variance

Reproducibility and variabilityin neural spike trains

Calliphora erythrocephalaMovement sensitive neuron (H1)

High spike count variance(Poisson-like mean-variance relationship)

van Steveninck et al., Science, 1997

Dynamic stimuli(natural condition)

Static stimuli(artificial condition)

random walkwith diffusionconstant

s/degrees 14 2

Does more precise spike timing convey more information about input stimuli?

mean count

Quantifying information transfer

van Steveninck et al., Science, 1997

100

resp

onse

s

1. At each t, divide spike trains in 10 contiguous 3-ms bins, and construct local word frequency

2. Stepping in 3-ms bins, words are sampled (900 trials, 10 s)

30-ms window

)|( tWP

word local wordfrequency

6103~

Distribution of 1500 words

ttWPWP )|()(

Quantifying information transfer

van Steveninck et al., Science, 1997

W

WPWPH bits )(log)( 2total

bits )|(log)|( 2noise

tW

tWPtWPH

bits noisetotal HHI

Transmitted information(mutual information between W and t)

Entropy of spike trains

Conditional entropy of neuronal response given stimulus

H1’s responses to dynamic stimuli

bits 43.2

bits 62.2

bits 05.5

noise

total

I

H

H

Comparison with simulated spike trains

bits 43.2

bits 62.2

bits 05.5

noise

total

I

H

H

H1’s responses to dynamic stimuli

bits 95.0

bits 22.4

bits 17.5

noise

total

I

H

H

Simulated responses

Simulate spike trains by a modulated Poisson process

1. has the correct dynamics of the firing rate of the responses to dynamic stimuli

2. but follows the mean-variance relation of static stimuli (mean=variance).

Models that may accurately accounts for the

H1 neural response to static stimuli can

significant underestimate the signal

transfer under more natural conditions.

More than twice as much information

Summary

• Statistical inference– Maximum likelihood inference– Bayesian inference– Bayesian approach to neural decoding

problems

• Information theory– Information amount and entropy– Information theoretic approach to a neural

encoding system