liam paninski - columbia universitystat.columbia.edu/~yiannis/class/hos/lp2.pdf · 1: estimate some...

29
Efficient adaptive experimental design Liam Paninski Department of Statistics and Center for Theoretical Neuroscience Columbia University http://www.stat.columbia.edu/liam [email protected] March 12, 2009

Upload: others

Post on 06-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic

Efficient adaptive experimental design

Liam Paninski

Department of Statistics and Center for Theoretical Neuroscience

Columbia University

http://www.stat.columbia.edu/∼liam

[email protected]

March 12, 2009

Page 2: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic

Avoiding the curse of insufficient data

1: Estimate some functional f(p) instead of full joint

distribution p(r, s)

— information-theoretic functionals

2: Improved nonparametric estimators

— minimax theory for discrete distributions under KL loss

3: Select stimuli more efficiently

— optimal experimental design

(4: Parametric approaches)

Page 3: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic

Setup

Assume:

• parametric model pθ(r|~x) on responses r given inputs ~x

• prior distribution p(θ) on finite-dimensional model space

Goal: estimate θ from experimental data

Usual approach: draw stimuli i.i.d. from fixed p(~x)

Adaptive approach: choose p(~x) on each trial to maximize

E~xI(θ; r|~x) (e.g. “staircase” methods).

Page 4: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic

Note: Optimizing p(~x) =⇒ optimizing ~x

E~xI(θ; r|~x) = H(θ) − E~xH(θ|r, ~x).

Best p(~x) places all mass on points ~x that minimize H(θ|r, ~x).

So our problem really reduces to arg max~x I(θ; r|~x)

Page 5: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic

Snapshot: one-dimensional simulation

0

0.5

1

p(y

= 1

| x,

θ0)

x

0

2

4

x 10−3

I(y

; θ |

x)

0

10

20

30

40

θ

p(θ)

trial 100

optimizedi.i.d.

Page 6: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic

Asymptotic result

Under regularity conditions, a posterior CLT holds

(Paninski, 2005):

pN

(√N(θ − θ0)

)

→ N (µN , σ2); µN ∼ N (0, σ2)

• (σ2iid)

−1 = Ex(Ix(θ0))

• (σ2info)

−1 = argmaxC∈co(Ix(θ0)) log |C|

=⇒ σ2iid > σ2

info unless Ix(θ0) is constant in x

co(Ix(θ0)) = convex closure (over x) of Fisher information

matrices Ix(θ0). (log |C| strictly concave: maximum unique.)

Page 7: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic

Illustration of theorem

θ10 20 30 40 50 60 70 80 90 100

0

0.2

0.4θ

10 20 30 40 50 60 70 80 90 100

0

0.2

0.4

10 20 30 40 50 60 70 80 90 100

0.2

0.4

E(p

)

101

102

10−2

σ(p)

10 20 30 40 50 60 70 80 90 1000

0.5

1

P(θ

0)

trial number

Page 8: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic

Technical details

Stronger regularity conditions than usual to prevent “obsessive”

sampling and ensure consistency.

Significant complication: exponential decay of posteriors pN off

of neighborhoods of θ0 does not necessarily hold.

Page 9: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic

Psychometric example

• stimuli x one-dimensional: intensity

• responses r binary: detect/no detect

p(r = 1|x, θ) = f((x − θ)/a)

• scale parameter a (assumed known)

• want to learn threshold parameter θ as quickly as possible

0

0.5

1

θ

p(1

| x, θ

)

Page 10: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic

Psychometric example: results

• variance-minimizing and info-theoretic methods

asymptotically same

• just one unique function f ∗ for which σiid = σopt; for any

other f , σiid > σopt

Ix(θ) =(fa,θ)

2

fa,θ(1 − fa,θ)

• f ∗ solves

fa,θ = c√

fa,θ(1 − fa,θ)

f ∗(t) =sin(ct) + 1

2

• σ2iid/σ

2opt ∼ 1/a for a small

Page 11: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic

Open directions

In smooth loglikelihood case, we get√

N convergence rate

(albeit faster than standard i.i.d. rate)

In discontinuous loglikelihood case, we can have exponential

convergence (e.g., 20 questions game).

Question: more generally, when does infomax lead to

faster-than-√

N convergence rate?

Page 12: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic

Part 2: Computing the optimal stimulus

OK, now how do we actually do this in neural case?

• Computing I(θ; r|~x) requires an integration over θ

— in general, exponentially hard in dim(θ)

• Maximizing I(θ; r|~x) in ~x is doubly hard

— in general, exponentially hard in dim(~x)

Doing all this in real time (∼ 10 ms - 1 sec) is a major challenge!

Joint work w/ J. Lewi (Lewi et al., 2007; Lewi et al., 2008; Lewi et al., 2009)

Page 13: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic

Three key steps

1. Choose a tractable, flexible model of neural encoding

2. Choose a tractable, accurate approximation of the posterior

p(~θ|{~xi, ri}i≤N)

3. Use approximations and some perturbation theory to reduce

optimization problem to a simple 1-d linesearch

Page 14: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic

Step 1: focus on GLM case

ri ∼ Poiss(λi); λi|~xi, ~θ = f(~k · ~xi +∑

j

ajri−j).

More generally, log p(ri|θ, ~xi) = k(r)f(θ · ~xi) + s(r) + g(θ · ~xi)

Goal: learn ~θ = {~k,~a} in as few trials as possible.

Page 15: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic

GLM likelihood

λi ∼ Poiss(λi)

λi|~xi, ~θ = f(~k · ~xi +∑

j

ajri−j)

log p(ri|~xi, ~θ) = −f(~k ·~xi +∑

j

ajri−j)+ri log f(~k ·~xi +∑

j

ajri−j)

Two key points:

• Likelihood is “rank-1” — only depends on ~θ along ~z = (~x, ~r).

• f convex and log-concave =⇒ log-likelihood concave in ~θ

Page 16: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic

Step 2: representing the posterior

Idea: Laplace approximation

p(~θ|{~xi, ri}i≤N) ≈ N (µN , CN)

Justification:

• posterior CLT

• likelihood is log-concave, so posterior is also log-concave:

log p(~θ|{~xi, ri}i≤N) ∼ log p(~θ|{~xi, ri}i≤N−1) + log p(rN |xN , ~θ)

Page 17: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic

Efficient updating

Updating µN : one-d search

Updating CN : rank-one update, CN = (C−1N−1 + b~zt~z)−1 — use

Woodbury lemma

Total time for update of posterior: O(d2)

Page 18: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic

Step 3: Efficient stimulus optimization

Laplace approximation =⇒ I(θ; r|~x) ∼ Er|~x log |CN−1|

|CN |

— this is nonlinear and difficult, but we can simplify using

perturbation theory: log |I + A| ≈ trace(A).

Now we can take averages over p(r|~x) =∫

p(r|θ, ~x)pN(θ)dθ:

standard Fisher info calculation given Poisson assumption on r.

Further assuming f(.) = exp(.) allows us to compute

expectation exactly, using m.g.f. of Gaussian.

...finally, we want to maximize F (~x) = g(µN · ~x)h(~xtCN~x).

Page 19: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic

Computing the optimal ~x

max~x g(µN · ~x)h(~xtCN~x) increases with ||~x||2: constraining ||~x||2reduces problem to nonlinear eigenvalue problem.

Lagrange multiplier approach (Berkes and Wiskott, 2006)

reduces problem to 1-d linesearch, once eigendecomposition is

computed — much easier than full d-dimensional optimization!

Rank-one update of eigendecomposition may be performed in

O(d2) time (Gu and Eisenstat, 1994).

=⇒ Computing optimal stimulus takes O(d2) time.

Page 20: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic

Side note: linear-Gaussian case is easy

Linear Gaussian case:

ri = θ · ~xi + ǫi, ǫi ∼ N (0, σ2)

• Previous approximations are exact; instead of nonlinear

eigenvalue problem, we have standard eigenvalue problem.

No dependence on µN , just CN .

• Fisher information does not depend on observed ri, so

optimal sequence {~x1, ~x2, . . .} can be precomputed, since

observed ri do not change optimal strategy.

Page 21: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic

Near real-time adaptive design

0 200 400 6000.001

0.01

0.1

Dimensionality

Tim

e(Se

cond

s)

total timediagonalizationposterior update1d line Search

Page 22: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic

Gabor example

— infomax approach is an order of magnitude more efficient.

Page 23: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic

Handling nonstationary parameters

Various sources of nonsystematic nonstationarity:

• Eye position drift

• Changes in arousal / attentive state

• Changes in health / excitability of preparation

Solution: allow diffusion in extended Kalman filter:

~θN+1 = ~θN + ǫ; ǫ ∼ N (0, Q)

Page 24: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic

Nonstationary example

true θ

θi

tria

l

1 100

1

400

800

info. max.

θi

1 100

info. max. no diffusion

θi

1 100 θi

random

1 1000

0.5

1

Page 25: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic

Asymptotic efficiency

We made a bunch of approximations; do we still achieve correct

asymptotic rate?

Recall:

• (σ2iid)

−1 = Ex(Ix(θ0))

• (σ2info)

−1 = argmaxC∈co(Ix(θ0)) log |C|

Page 26: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic

Asymptotic efficiency: finite stimulus set

If |X | < ∞, computing infomax rate is just a finite-dimensional

(numerical) convex optimization over p(x).

Page 27: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic

Asymptotic efficiency: bounded norm case

If X = {~x : ||~x||2 < c < ∞}, optimizing over p(x) is now

infinite-dimensional, but symmetry arguments reduce this to a

two-dimensional problem (Lewi et al., 2009).

— σ2iid/σ

2opt ∼ dim(~x): infomax is most efficient in high-d cases

Page 28: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic

Conclusions

• Three key assumptions/approximations enable real-time

(O(d2)) infomax stimulus design:

— generalized linear model

— Laplace approximation

— first-order approximation of log-determinant

• Able to deal with adaptation through spike history terms

and nonstationarity through Kalman formulation

• Directions: application to real data; optimizing over

sequence of stimuli {~xt, ~xt+1, . . . ~xt+b} instead of just next

stimulus ~xt.

Page 29: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic

References

Berkes, P. and Wiskott, L. (2006). On the analysis and interpretation of inhomogeneous

quadratic forms as receptive fields. Neural Computation, 18:1868–1895.

Gu, M. and Eisenstat, S. (1994). A stable and efficient algorithm for the rank-one

modification of the symmetric eigenproblem. SIAM J. Matrix Anal. Appl.,

15(4):1266–1276.

Lewi, J., Butera, R., and Paninski, L. (2007). Efficient active learning with generalized

linear models. AISTATS07.

Lewi, J., Butera, R., and Paninski, L. (2008). Designing neurophysiology experiments to

optimally constrain receptive field models along parametric submanifolds. NIPS.

Lewi, J., Butera, R., and Paninski, L. (2009). Sequential optimal design of

neurophysiology experiments. Neural Computation, 21:619–687.

Paninski, L. (2005). Asymptotic theory of information-theoretic experimental design.

Neural Computation, 17:1480–1507.