information theoretic models of auditory...

Michael S. Lewicki†

Joint work with Evan Smith‡

Departments of Computer Science† and Psychology‡

Center for the Neural Basis of CognitionCarnegie Mellon University

We gratefully acknowledge the following support:NSF CAREER Award 0238351NIH Training Grant MH19983

Information theoretic models of auditory coding

from Warren, 1999

The cochlea and inner ear

A simple model of auditory coding

1 2 3 4 5 6 7 8 9 10

time (ms)

54321

time (ms)

3 5421

time (ms)

deBoer and deJongh, 1978 Carney and Yin, 1988

Auditory nerve filters can be estimated usingreverse correlation (spike-triggered averaging)

soundwaveform

filterbank staticnon-linearities

stochasticspiking

populationspike code

filterbank

Models are data driven

data: revcor filter

g(t) = atn!1e!bt cos(2πft + φ)

model: gammatone

“gammatone” function



data: revcor filter

model: gammatone

model fit

residual error




model fit

residual error


• The aim of both theories and models is to explain data

- Models are data driven.

- Theories are driven by principles.

- Theories require a definition of an ideal.

A theoretical approach

Theoretical questions:

• Why gammatones?

• Why spikes?

• How is sound coded by

the spike population?

How do we develop a theory?

Theoretical models

● explains data from principles

● requires idealization and abstraction

Efficient coding

Spike coding

Coding efficiency

Explaining physiological data

Current directions

Efficient coding theory

• Barlow, 1961; Attneave, 1954

- main goal of sensory coding is to code signals efficiently

- sensory codes are adapted to the sensory environment

- each code “feature” should have minimal redundancy

- each feature should describe independent information

• caveats:

- applies to behaviorally relevant information

- not all redundancy is bad, e.g. when compensating for noise

Efficient coding of natural soundsLewicki (2002) Nat. Neurosci. 5:356-363

Goal:

Predict optimal transformation of sound waveform from statistics of the natural acoustic environment

• use simplest model: linear

• derive optimal code for sound in analysis window

• use ICA to learn optimal linear transform

x1:N

What sounds to use?

• What are auditory systems adapted for doing?

- localization ⇒ environmental sounds

- communication ⇒ vocalizations

- general sound recognition ⇒ variety of sounds

• Learn codes for a variety of sound ensembles:

- non-harmonic environmental sounds (e.g. footsteps, stream sounds, etc.)

- animal vocalizations (rainforest mammals, e.g. chirps, screeches, cries, etc.)

- speech (from 100 male & female speakers from the TIMIT corpus)

Learning optimal linear transforms for natural sounds

environmental sounds vocalizations

speech

The optimal code depends on theclass of sounds being encoded: - a wavelet-like transform is best for environmental sounds - a Fourier-like transform is best for vocalizations - an intermediate transform is best for speech or general natural sounds

0.2 0.5 1 2 5 10 20

1

2

5

10

20

characteristic frequency (kHz)Q

10dB

Evans, 1975Rhode and Smith, 1985

0.2 0.5 1 2 5 10 20

1

2

5

10

20

center frequency (kHz)

Q10

dB

+ vocalizationso speech/combinedx environmental sounds

Theory

0.2 0.5 1 2 5 10 20

1

2

5

10

20

center frequency (kHz)

Q10

dB

Theory

Efficient coding explains auditory nerve population data

Q10dB = fc/w10dB

Filter sharpness:

0.2 0.5 1 2 5 10 20

1

2

5

10

20

characteristic frequency (kHz)Q

10dB

Evans, 1975Rhode and Smith, 1985

Data

Theoretical models



Efficient coding ● codes signals accurately and efficiently

● adapted to natural sensory environment

Spike coding

Coding efficiency


Current directions

Limitations of the linear model

• it’s linear

• code is optimal only within a block, not for whole signal

• offers no explanation of phase-locking and spikes

• representation depends on the relative alignment of the signal and block

A continuous filterbank does not form an efficient code

! "

#

#

#

Goal:

find a representation that is both time-relative and efficient

Efficient signal representation using time-shiftable kernels (spikes)

x(t) =M!

m=1

nm!

i=1

sm,i φm(t ! τm,i) + ε(t)

• Each spike encodes the precise time and magnitude of a particular kernel

• Spike population forms a non-redundant signal representation

• Two important theoretical abstractions for “spikes”

- not binary- not probabilistic

• Can convert to a population of stochastic, binary spikes

Figure 3: Smith, NC ms 2956

4

Smith and Lewicki (2005) Neural Comp. 17:19-45

The spikegram

0 5 10 15 20

100200

500

1000

2000

5000

K

Input

Residual

Reconstruction

0 5 10 15 20 25 ms

100200

500

1000

2000

5000

Ke

rne

l C

F (

Hz)

Input

a

100200

500

1000

2000

5000

Ker

nel C

F (H

z)

b

Freq

uenc

y (H

z)

100 200 300 400 500 600 700 800 ms

1000

2000

3000

4000

5000

c

Comparing a spike code to a spectrogram

a

100200

500

1000

2000

5000

Ker

nel C

F (H

z)

b

Freq

uenc

y (H

z)

100 200 300 400 500 600 700 800 ms

1000

2000

3000

4000

5000

c

a

100200

500

1000

2000

5000

Ker

nel C

F (H

z)

b

Freq

uenc

y (H

z)

100 200 300 400 500 600 700 800 ms

1000

2000

3000

4000

5000

c How do we compute the spikes?

Encoding signals with spikes

• There are many possible algorithms, varying degrees of biological plausibility

• Here, we use a variation of Matching Pursuit (Mallat and Zhang, 1993)

- yields near optimal spike representation, but is not biologically plausible

- assume there exists a biol. plausible algorithm that achieves the same end

Spike Coding with Matching Pursuit

1. convolve signal with kernels



2. find largest peak over convolution set




3. fit signal with kernel





4. subtract kernel from signal, record spike, and adjust convolutions






5. repeat






5. repeat . . .






5. repeat . . .

6. halt when desired fidelity is reached

“can” 5 dB SNR, 36 spikes, 145 sp/sec

Residual

Reconstruction

Input

0 50 100 150 200 ms100200

500

1000

2000

5000K

erne

l CF

(Hz)


Residual

Reconstruction

Input

0 50 100 150 200 ms100200

500

1000

2000

5000K

erne

l CF

(Hz)

Residual

Reconstruction

Input

0 50 100 150 200 ms100200

500

1000

2000

5000K

erne

l CF

(Hz)


Residual

Reconstruction

Input

0 50 100 150 200 ms100200

500

1000

2000

5000K

erne

l CF

(Hz)

Varying the number of gammatone kernels21 Coding time-relative structure with spikes E Smith and M S Lewicki

Figure 6: The number of kernel functions affects both the spectral resolution and the

temporal sparseness of the spike codes. The input signal (top) was encoded using

matching pursuit with 8, 16, 32 or 64 kernel functions (A-D, respectively). The to-

tal number of spikes in each is (A) 12011, (B) 1167, (C) 497 and (D) 479.

activity across bands. It also possesses a nonlinear frequency axis based on the cochlea.

This axis emphasizes the range important to human hearing and is used in many audi-

tory models and speech “front-ends”.

4.5 Sparse representation of transients

Though the “pizzerias” example demonstrates the large scale features of the spike code,

the fine structure is more clearly revealed in a shorter speech segment. The waveform

and spikegram of first half of the word “wealth” appear in figure 8. Here we can see

the time-relative coding of non-stationary structure. 100 msec into the word (about

8 kernels, 12011 spikes21 Coding time-relative structure with spikes E Smith and M S Lewicki


























16 kernels, 1167 spikes

21 Coding time-relative structure with spikes E Smith and M S Lewicki


























Coding efficiency in terms of spikes

100 1000 100001

10

100

Cost (spikes/sec)

SN

R (d

B)

Optim. Matching PursuitOptim. Filter!ThresholdMatching PursuitFilter!Threshold

100 1000 100001

10

100

Cost (spikes/sec)

SN

R (d

B)

8 Filters

256 Filters

Matching Pursuit

Efficient auditory coding with optimized kernel shapes

x(t) =M!

m=1

nm!

i=1

sm,i φm(t ! τm,i) + ε(t)

Figure 3: Smith, NC ms 2956

4

Adapt algorithm of Olshausen (2002)

Smith and Lewicki (2006) Nature 439:978-982

What are the optimal kernel shapes?

Optimizing the probabilistic model

x(t) =M!

m=1

nm!

i=1

smi !m(t! "m

i ) + #(t),

p(x|!) ="

p(x|!, s, ")p(s)p(")dsd"

" p(x|!, s, ")p(s)p(")

#(t) # N (0,$!)

Learning (Olshausen, 2002):

%

%!mlog p(x|!) =

%

%!mlog p(x|!, s, ") + log p(s)p(")

=1

2$"

%

%!m[x!

M!

m=1

nm!

i=1

smi !m(t! "m

i )]2

=1$"

[x! x]!

i

smi

Also adapt kernel lengths

Kernel functions are initialized to random vectors

Kernel functions optimized for coding speech

Theoretical models





Spike coding ● non-linear, efficient for time-varying signals

● idealization of binary action potentials

Coding efficiency


Current directions

Quantifying coding efficiency

1. fit signal

2. quantize time and amplitude values

3. prune zero values

4. measure coding efficiency using the entropy of quantized values

5. reconstruct signal using quantized values

6. measure fidelity using signal-to-noise ratio (SNR) of residual error

• identical procedure for other codes (e.g. Fourier and wavelet)

x(t) =M!

m=1

nm!

i=1

sm,i φm(t ! τm,i) + ε(t)Residual

Reconstruction

Input

0 50 150 200 ms100200

500

1000

2000

5000

Kern

el CF

(Hz)

Residual

Reconstruction

Input

0 50 150 200 ms100200

500

1000

2000

5000

Kern

el CF

(Hz)

original

reconstruction

residualerror

spike code

Coding efficiency curves

0 10 20 30 40 50 60 70 80 900

5

10

15

20

25

30

35

40

Rate (Kbps)

SNR

(dB)

Spike Code: adaptedSpike Code: gammatoneBlock Code: waveletBlock Code: Fourier

4x more efficient

+14 dB

Theoretical models







Coding efficiency ● much more efficient than linear block codes


Current directions

Using efficient coding theory to make theoretical predictions

Natural Sound Environment

optimal kernels:• properties• coding efficiency

physiological data:• auditory nerve filter shapes• population trends

evolution

?Michael S. Lewicki ! Carnegie Mellon Bad Zwischenahn ! Aug 21, 2004

A simple model of auditory coding

1 2 3 4 5 6 7 8 9 10

time (ms)

54321

time (ms)

3 5421

time (ms)

deBoer and deJongh, 1978 Carney and Yin, 1988

auditory revcor filters: gammatones

soundwaveform

filterbank staticnon-linearities

stochasticspiking

populationspike code

More theoretical questions:

• Why gammatones?

• Why spikes?

• How is sound coded by

the spike population?

How do we develop a theory?

Michael S. Lewicki ! Carnegie Mellon Bad Zwischenahn ! Aug 21, 2004

Comparing a spike code to a spectrogram

How do we compute the spikes?

a

100200

500

1000

2000

5000

Ker

nel C

F (H

z)

b

Freq

uenc

y (H

z)

100 200 300 400 500 600 700 800 ms

1000

2000

3000

4000

5000

c

0.1 0.2 0.5 1 2 50.1

0.2

0.5

1

2

5

Center Frequency (kHz)

)z

Hk(

htdi

wd

na

B

Speech PredictionAuditory Nerve Filters

theory

only compare to the data after optimizingwe do not fit the data

Prediction depends on sound ensemble.

vocalizationsenvironmental sounds

transient ambient

Natural sounds

fox

squirrel

walking on leaves

rustling leaves

cracking branches

stream by waterfall

Learned kernels share features of auditory nerve filters

Optimized kernelsscale bar = 1 msec

Auditory nerve filtersfrom Carney, McDuffy, and Shekhter, 1999

Learned kernels closely match individual auditory nerve filters

for each kernel find closet matching auditory nerve filterin Laurel Carney’s database of ~100 filters.

Learned kernels overlaid on selected auditory nerve filters

For almost all learned kernels there is a closely matching auditory nerve filter.

Spike kernels for natural sound mix matches revcor filters

Optimal kernels for environmental sounds are very short

Spike kernels for vocalizations are much longer and symmetric

Comparing learned kernels to auditory nerve population

0.1 0.2 0.5 1 2 50.1

0.2

0.5

1

2

5


)z

Hk(

htdi

wd

na

B

RevcorEnvironVocalNatural

Population distribution of kernels for natural sounds

0.1 0.2 0.5 1 2 50.1

0.2

0.5

1

2

5


)z

Hk(

htdi

wd

na

B


Population distribution of kernels for environmental sounds

0.1 0.2 0.5 1 2 50.1

0.2

0.5

1

2

5


)z

Hk(

htdi

wd

na

B


Population distribution of kernels for animal vocalizations

0.1 0.2 0.5 1 2 50.1

0.2

0.5

1

2

5


)z

Hk(

htdi

wd

na

B


Kernel distributions for different sound ensembles

0.1 0.2 0.5 1 2 50.1

0.2

0.5

1

2

5


)z

Hk(

htdi

wd

na

B


Population distribution of kernels for natural sounds

0.1 0.2 0.5 1 2 50.1

0.2

0.5

1

2

5


)z

Hk(

htdi

wd

na

B


Population distribution of kernels for speech (TIMIT)

0.1 0.2 0.5 1 2 50.1

0.2

0.5

1

2

5


)z

Hk(

htdi

wd

na

B

RevcorEnvironVocalSpeech

Speech matches composition of natural sounds

ambient environmental sounds

animal vocalizations

transient environmental sounds

Best mix for predicting auditory coding: 1.0 : 0.8 : 1.2

speech

Theoretical models









● theory explains gammatone revcor shapes ● also explains population trends

Current directions

Coding of a speech consonant

Residual

Reconstruction

Inputa

100200

500

1000

2000

5000

Ker

nel C

F (H

z)

b

38 39 40

340040004800

Freq

uenc

y (H

z)

c

10 20 30 40 50 60ms

1000

2000

3000

4000

5000

How is this achieving an efficient, time-relative code?

a

b

0 20 40 60 80 100 ms100200

500

1000

2000

5000

Kern

el C

F (H

z)

Time-relative coding of glottal pulses

Learning higher-order structure

a

b

0 20 40 60 80 100 120

Ker

nel C

F (H

z)

Time (msec)

!20 !15 !10 !5 0

Lag (msec)

Autocorrelation

Correlogram

Spike!triggered Periodicity

c

Spike interval alignment

Low-freq kernel do not precisely match signal period

300 305 310 315 320 325 330 ms

100

200

500

1000

2000

5000

Filte

r CF

(Hz)

original

reconstruction 3 dB SNR

residual


300 305 310 315 320 325 330 ms

100

200

500

1000

2000

5000

Filte

r CF

(Hz)

original


residual

3 Hierarchical spike models M S Lewicki

180 190 200 210 220 230 ms

100

200

500

1000

2000

5000

Filt

er

CF

(H

z)

Figure 2: Spikegram of the ’t’ in ’vietnamese’ at 30 dB SNR.


25 30 35 40 45 50 55 60 65 ms

100

200

500

1000

2000

5000

()

25 30 35 40 45 50 55 60 65 ms

100

200

500

1000

2000

5000

()

25 30 35 40 45 50 55 60 65 ms

100

200

500

1000

2000

5000

()

Figure 3: Spikegram of onset in ’can’ at fidelities of 20, 30, and 40 dB SNR. Asthe fidelity increases, the high-frequency structure of the consonant becomesmore prominent.


105 110 115 120 125 130 135 140 145 ms

100

200

500

1000

2000

5000

()

105 110 115 120 125 130 135 140 145 ms

100

200

500

1000

2000

5000

()

105 110 115 120 125 130 135 140 145 ms

100

200

500

1000

2000

5000

()

Figure 4: Spikegram of the vowel in ’can’ at fidelities of 5, 10, and 20 dB SNR.


300 305 310 315 320 325 330 ms

100

200

500

1000

2000

5000

()

300 305 310 315 320 325 330 ms

100

200

500

1000

2000

5000

()

300 305 310 315 320 325 330 ms

100

200

500

1000

2000

5000

()

Figure 5: Spikegram of the /a/ vowel in ’vietnamese’ at fidelities of 15, 20, and25 dB SNR.

Learning general higher-order acoustic structure

Non-stationary statistical regularity in acoustic features

a

b

c

1 128 256!8

!4

0

4

y1

R1

1 128 256!8

!4

0

4

R2

1

128

256

R1 R2 R3

1 128 256!8

!4

0

4

R3

Karklin and Lewicki (2005) Neural Comp. 17:397-423

Generalizing the standard ICA model

P (s) =!

i

P (si)

P (si) ! exp

!

"

"

"

"

"

si

!i

"

"

"

"

qi#

P (si)

Generalizing the standard ICA model

P (s) =!

i

P (si)

P (si) ! exp

!

"

"

"

"

"

si

!i

"

"

"

"

qi#

P (si)P (ui|!i)

P (ui|!i) ! exp

!

"

"

"

"

"

ui

!i

"

"

"

"

qi#

log !i = [Bv]i

P (ui|!i)

log !i

! log p(u|B,v) "!

i

"

"

"

"

ui

exp([Bv]i)

"

"

"

"

qi

Independent density components

Bj

P(u|vj= 2.0)

P(u|vj= 0.0)

P(u|vj=−2.0)

! log p(u|B,v) "!

i

"

"

"

"

ui

exp([Bv]i)

"

"

"

"

qi

Density components of speech

Learning Density Components 411

Figure 7: A subset of density components of speech. The weights in a column ofB are plotted as shaded patches in one of the nine panels. Each patch is placedaccording to the temporal and frequency distribution of the associated linearbasis function and shaded according to the value of the weight, with whiteindicating positive weights, black negative weights, and gray weights that areclose to zero. The axes represent time, 0 to 16 msec, horizontally, and frequency, 0to 8 kHz, vertically. The density components form a distributed representationof the frequency of the signal and the location of energy within the samplewindow. Density components coding for multiple frequencies might captureharmonic regularities in the speech signal (see the text for details).

Higher-level representations show invariant properties

a

b

c

Theoretical models









● theory explains gammatone revcor shapes ● also explains population trends

Current directions ● hierarchical models, higher-order structure

Q: (I. J. Good)

I cannot help wondering whether it is not largely a

prejudice to analyse signals in terms of frequency.

A: (D. Gabor)

There are in fact two good reasons for this

preference:

1. In communication problems we deal usually with

the infinite or semi-infinite time-axis, and

2. the problems are usually homogenous in time.

Once one or the other of these conditions is dropped,

it may be well worth while to carry out the analysis

in terms of other functions.

Symposium on Information Theory, Imperial College London, 1950

information theoretic models of auditory...

Documents