![Page 1: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic](https://reader033.vdocuments.us/reader033/viewer/2022050504/5f95e34238698b4000615c21/html5/thumbnails/1.jpg)
Efficient adaptive experimental design
Liam Paninski
Department of Statistics and Center for Theoretical Neuroscience
Columbia University
http://www.stat.columbia.edu/∼liam
March 12, 2009
![Page 2: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic](https://reader033.vdocuments.us/reader033/viewer/2022050504/5f95e34238698b4000615c21/html5/thumbnails/2.jpg)
Avoiding the curse of insufficient data
1: Estimate some functional f(p) instead of full joint
distribution p(r, s)
— information-theoretic functionals
2: Improved nonparametric estimators
— minimax theory for discrete distributions under KL loss
3: Select stimuli more efficiently
— optimal experimental design
(4: Parametric approaches)
![Page 3: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic](https://reader033.vdocuments.us/reader033/viewer/2022050504/5f95e34238698b4000615c21/html5/thumbnails/3.jpg)
Setup
Assume:
• parametric model pθ(r|~x) on responses r given inputs ~x
• prior distribution p(θ) on finite-dimensional model space
Goal: estimate θ from experimental data
Usual approach: draw stimuli i.i.d. from fixed p(~x)
Adaptive approach: choose p(~x) on each trial to maximize
E~xI(θ; r|~x) (e.g. “staircase” methods).
![Page 4: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic](https://reader033.vdocuments.us/reader033/viewer/2022050504/5f95e34238698b4000615c21/html5/thumbnails/4.jpg)
Note: Optimizing p(~x) =⇒ optimizing ~x
E~xI(θ; r|~x) = H(θ) − E~xH(θ|r, ~x).
Best p(~x) places all mass on points ~x that minimize H(θ|r, ~x).
So our problem really reduces to arg max~x I(θ; r|~x)
![Page 5: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic](https://reader033.vdocuments.us/reader033/viewer/2022050504/5f95e34238698b4000615c21/html5/thumbnails/5.jpg)
Snapshot: one-dimensional simulation
0
0.5
1
p(y
= 1
| x,
θ0)
x
0
2
4
x 10−3
I(y
; θ |
x)
0
10
20
30
40
θ
p(θ)
trial 100
optimizedi.i.d.
![Page 6: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic](https://reader033.vdocuments.us/reader033/viewer/2022050504/5f95e34238698b4000615c21/html5/thumbnails/6.jpg)
Asymptotic result
Under regularity conditions, a posterior CLT holds
(Paninski, 2005):
pN
(√N(θ − θ0)
)
→ N (µN , σ2); µN ∼ N (0, σ2)
• (σ2iid)
−1 = Ex(Ix(θ0))
• (σ2info)
−1 = argmaxC∈co(Ix(θ0)) log |C|
=⇒ σ2iid > σ2
info unless Ix(θ0) is constant in x
co(Ix(θ0)) = convex closure (over x) of Fisher information
matrices Ix(θ0). (log |C| strictly concave: maximum unique.)
![Page 7: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic](https://reader033.vdocuments.us/reader033/viewer/2022050504/5f95e34238698b4000615c21/html5/thumbnails/7.jpg)
Illustration of theorem
θ10 20 30 40 50 60 70 80 90 100
0
0.2
0.4θ
10 20 30 40 50 60 70 80 90 100
0
0.2
0.4
10 20 30 40 50 60 70 80 90 100
0.2
0.4
E(p
)
101
102
10−2
σ(p)
10 20 30 40 50 60 70 80 90 1000
0.5
1
P(θ
0)
trial number
![Page 8: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic](https://reader033.vdocuments.us/reader033/viewer/2022050504/5f95e34238698b4000615c21/html5/thumbnails/8.jpg)
Technical details
Stronger regularity conditions than usual to prevent “obsessive”
sampling and ensure consistency.
Significant complication: exponential decay of posteriors pN off
of neighborhoods of θ0 does not necessarily hold.
![Page 9: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic](https://reader033.vdocuments.us/reader033/viewer/2022050504/5f95e34238698b4000615c21/html5/thumbnails/9.jpg)
Psychometric example
• stimuli x one-dimensional: intensity
• responses r binary: detect/no detect
p(r = 1|x, θ) = f((x − θ)/a)
• scale parameter a (assumed known)
• want to learn threshold parameter θ as quickly as possible
0
0.5
1
θ
p(1
| x, θ
)
![Page 10: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic](https://reader033.vdocuments.us/reader033/viewer/2022050504/5f95e34238698b4000615c21/html5/thumbnails/10.jpg)
Psychometric example: results
• variance-minimizing and info-theoretic methods
asymptotically same
• just one unique function f ∗ for which σiid = σopt; for any
other f , σiid > σopt
Ix(θ) =(fa,θ)
2
fa,θ(1 − fa,θ)
• f ∗ solves
fa,θ = c√
fa,θ(1 − fa,θ)
f ∗(t) =sin(ct) + 1
2
• σ2iid/σ
2opt ∼ 1/a for a small
![Page 11: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic](https://reader033.vdocuments.us/reader033/viewer/2022050504/5f95e34238698b4000615c21/html5/thumbnails/11.jpg)
Open directions
In smooth loglikelihood case, we get√
N convergence rate
(albeit faster than standard i.i.d. rate)
In discontinuous loglikelihood case, we can have exponential
convergence (e.g., 20 questions game).
Question: more generally, when does infomax lead to
faster-than-√
N convergence rate?
![Page 12: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic](https://reader033.vdocuments.us/reader033/viewer/2022050504/5f95e34238698b4000615c21/html5/thumbnails/12.jpg)
Part 2: Computing the optimal stimulus
OK, now how do we actually do this in neural case?
• Computing I(θ; r|~x) requires an integration over θ
— in general, exponentially hard in dim(θ)
• Maximizing I(θ; r|~x) in ~x is doubly hard
— in general, exponentially hard in dim(~x)
Doing all this in real time (∼ 10 ms - 1 sec) is a major challenge!
Joint work w/ J. Lewi (Lewi et al., 2007; Lewi et al., 2008; Lewi et al., 2009)
![Page 13: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic](https://reader033.vdocuments.us/reader033/viewer/2022050504/5f95e34238698b4000615c21/html5/thumbnails/13.jpg)
Three key steps
1. Choose a tractable, flexible model of neural encoding
2. Choose a tractable, accurate approximation of the posterior
p(~θ|{~xi, ri}i≤N)
3. Use approximations and some perturbation theory to reduce
optimization problem to a simple 1-d linesearch
![Page 14: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic](https://reader033.vdocuments.us/reader033/viewer/2022050504/5f95e34238698b4000615c21/html5/thumbnails/14.jpg)
Step 1: focus on GLM case
ri ∼ Poiss(λi); λi|~xi, ~θ = f(~k · ~xi +∑
j
ajri−j).
More generally, log p(ri|θ, ~xi) = k(r)f(θ · ~xi) + s(r) + g(θ · ~xi)
Goal: learn ~θ = {~k,~a} in as few trials as possible.
![Page 15: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic](https://reader033.vdocuments.us/reader033/viewer/2022050504/5f95e34238698b4000615c21/html5/thumbnails/15.jpg)
GLM likelihood
λi ∼ Poiss(λi)
λi|~xi, ~θ = f(~k · ~xi +∑
j
ajri−j)
log p(ri|~xi, ~θ) = −f(~k ·~xi +∑
j
ajri−j)+ri log f(~k ·~xi +∑
j
ajri−j)
Two key points:
• Likelihood is “rank-1” — only depends on ~θ along ~z = (~x, ~r).
• f convex and log-concave =⇒ log-likelihood concave in ~θ
![Page 16: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic](https://reader033.vdocuments.us/reader033/viewer/2022050504/5f95e34238698b4000615c21/html5/thumbnails/16.jpg)
Step 2: representing the posterior
Idea: Laplace approximation
p(~θ|{~xi, ri}i≤N) ≈ N (µN , CN)
Justification:
• posterior CLT
• likelihood is log-concave, so posterior is also log-concave:
log p(~θ|{~xi, ri}i≤N) ∼ log p(~θ|{~xi, ri}i≤N−1) + log p(rN |xN , ~θ)
![Page 17: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic](https://reader033.vdocuments.us/reader033/viewer/2022050504/5f95e34238698b4000615c21/html5/thumbnails/17.jpg)
Efficient updating
Updating µN : one-d search
Updating CN : rank-one update, CN = (C−1N−1 + b~zt~z)−1 — use
Woodbury lemma
Total time for update of posterior: O(d2)
![Page 18: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic](https://reader033.vdocuments.us/reader033/viewer/2022050504/5f95e34238698b4000615c21/html5/thumbnails/18.jpg)
Step 3: Efficient stimulus optimization
Laplace approximation =⇒ I(θ; r|~x) ∼ Er|~x log |CN−1|
|CN |
— this is nonlinear and difficult, but we can simplify using
perturbation theory: log |I + A| ≈ trace(A).
Now we can take averages over p(r|~x) =∫
p(r|θ, ~x)pN(θ)dθ:
standard Fisher info calculation given Poisson assumption on r.
Further assuming f(.) = exp(.) allows us to compute
expectation exactly, using m.g.f. of Gaussian.
...finally, we want to maximize F (~x) = g(µN · ~x)h(~xtCN~x).
![Page 19: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic](https://reader033.vdocuments.us/reader033/viewer/2022050504/5f95e34238698b4000615c21/html5/thumbnails/19.jpg)
Computing the optimal ~x
max~x g(µN · ~x)h(~xtCN~x) increases with ||~x||2: constraining ||~x||2reduces problem to nonlinear eigenvalue problem.
Lagrange multiplier approach (Berkes and Wiskott, 2006)
reduces problem to 1-d linesearch, once eigendecomposition is
computed — much easier than full d-dimensional optimization!
Rank-one update of eigendecomposition may be performed in
O(d2) time (Gu and Eisenstat, 1994).
=⇒ Computing optimal stimulus takes O(d2) time.
![Page 20: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic](https://reader033.vdocuments.us/reader033/viewer/2022050504/5f95e34238698b4000615c21/html5/thumbnails/20.jpg)
Side note: linear-Gaussian case is easy
Linear Gaussian case:
ri = θ · ~xi + ǫi, ǫi ∼ N (0, σ2)
• Previous approximations are exact; instead of nonlinear
eigenvalue problem, we have standard eigenvalue problem.
No dependence on µN , just CN .
• Fisher information does not depend on observed ri, so
optimal sequence {~x1, ~x2, . . .} can be precomputed, since
observed ri do not change optimal strategy.
![Page 21: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic](https://reader033.vdocuments.us/reader033/viewer/2022050504/5f95e34238698b4000615c21/html5/thumbnails/21.jpg)
Near real-time adaptive design
0 200 400 6000.001
0.01
0.1
Dimensionality
Tim
e(Se
cond
s)
total timediagonalizationposterior update1d line Search
![Page 22: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic](https://reader033.vdocuments.us/reader033/viewer/2022050504/5f95e34238698b4000615c21/html5/thumbnails/22.jpg)
Gabor example
— infomax approach is an order of magnitude more efficient.
![Page 23: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic](https://reader033.vdocuments.us/reader033/viewer/2022050504/5f95e34238698b4000615c21/html5/thumbnails/23.jpg)
Handling nonstationary parameters
Various sources of nonsystematic nonstationarity:
• Eye position drift
• Changes in arousal / attentive state
• Changes in health / excitability of preparation
Solution: allow diffusion in extended Kalman filter:
~θN+1 = ~θN + ǫ; ǫ ∼ N (0, Q)
![Page 24: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic](https://reader033.vdocuments.us/reader033/viewer/2022050504/5f95e34238698b4000615c21/html5/thumbnails/24.jpg)
Nonstationary example
true θ
θi
tria
l
1 100
1
400
800
info. max.
θi
1 100
info. max. no diffusion
θi
1 100 θi
random
1 1000
0.5
1
![Page 25: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic](https://reader033.vdocuments.us/reader033/viewer/2022050504/5f95e34238698b4000615c21/html5/thumbnails/25.jpg)
Asymptotic efficiency
We made a bunch of approximations; do we still achieve correct
asymptotic rate?
Recall:
• (σ2iid)
−1 = Ex(Ix(θ0))
• (σ2info)
−1 = argmaxC∈co(Ix(θ0)) log |C|
![Page 26: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic](https://reader033.vdocuments.us/reader033/viewer/2022050504/5f95e34238698b4000615c21/html5/thumbnails/26.jpg)
Asymptotic efficiency: finite stimulus set
If |X | < ∞, computing infomax rate is just a finite-dimensional
(numerical) convex optimization over p(x).
![Page 27: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic](https://reader033.vdocuments.us/reader033/viewer/2022050504/5f95e34238698b4000615c21/html5/thumbnails/27.jpg)
Asymptotic efficiency: bounded norm case
If X = {~x : ||~x||2 < c < ∞}, optimizing over p(x) is now
infinite-dimensional, but symmetry arguments reduce this to a
two-dimensional problem (Lewi et al., 2009).
— σ2iid/σ
2opt ∼ dim(~x): infomax is most efficient in high-d cases
![Page 28: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic](https://reader033.vdocuments.us/reader033/viewer/2022050504/5f95e34238698b4000615c21/html5/thumbnails/28.jpg)
Conclusions
• Three key assumptions/approximations enable real-time
(O(d2)) infomax stimulus design:
— generalized linear model
— Laplace approximation
— first-order approximation of log-determinant
• Able to deal with adaptation through spike history terms
and nonstationarity through Kalman formulation
• Directions: application to real data; optimizing over
sequence of stimuli {~xt, ~xt+1, . . . ~xt+b} instead of just next
stimulus ~xt.
![Page 29: Liam Paninski - Columbia Universitystat.columbia.edu/~yiannis/class/HOS/LP2.pdf · 1: Estimate some functional f(p) instead of full joint distribution p(r,s) — information-theoretic](https://reader033.vdocuments.us/reader033/viewer/2022050504/5f95e34238698b4000615c21/html5/thumbnails/29.jpg)
References
Berkes, P. and Wiskott, L. (2006). On the analysis and interpretation of inhomogeneous
quadratic forms as receptive fields. Neural Computation, 18:1868–1895.
Gu, M. and Eisenstat, S. (1994). A stable and efficient algorithm for the rank-one
modification of the symmetric eigenproblem. SIAM J. Matrix Anal. Appl.,
15(4):1266–1276.
Lewi, J., Butera, R., and Paninski, L. (2007). Efficient active learning with generalized
linear models. AISTATS07.
Lewi, J., Butera, R., and Paninski, L. (2008). Designing neurophysiology experiments to
optimally constrain receptive field models along parametric submanifolds. NIPS.
Lewi, J., Butera, R., and Paninski, L. (2009). Sequential optimal design of
neurophysiology experiments. Neural Computation, 21:619–687.
Paninski, L. (2005). Asymptotic theory of information-theoretic experimental design.
Neural Computation, 17:1480–1507.