information theoretic models of auditory...

Michael S. Lewicki†

Joint work with Evan Smith‡

Departments of Computer Science† and Psychology‡

Center for the Neural Basis of CognitionCarnegie Mellon University

We gratefully acknowledge the following support:NSF CAREER Award 0238351NIH Training Grant MH19983

Information theoretic models of auditory coding

from Warren, 1999

The cochlea and inner ear

A simple model of auditory coding

1 2 3 4 5 6 7 8 9 10

time (ms)

3 5421

time (ms)

deBoer and deJongh, 1978 Carney and Yin, 1988

Auditory nerve filters can be estimated usingreverse correlation (spike-triggered averaging)

soundwaveform

filterbank staticnon-linearities

stochasticspiking

populationspike code

filterbank

Models are data driven

data: revcor filter

g(t) = atn!1e!bt cos(2πft + φ)

model: gammatone

“gammatone” function

data: revcor filter

model: gammatone

model fit

residual error

model fit

residual error

• The aim of both theories and models is to explain data

- Models are data driven.

- Theories are driven by principles.

- Theories require a definition of an ideal.

A theoretical approach

Theoretical questions:

• Why gammatones?

• Why spikes?

• How is sound coded by

the spike population?

How do we develop a theory?

Theoretical models

● explains data from principles

● requires idealization and abstraction

Efficient coding

Spike coding

Coding efficiency

Explaining physiological data

Current directions

Efficient coding theory

• Barlow, 1961; Attneave, 1954

- main goal of sensory coding is to code signals efficiently

- sensory codes are adapted to the sensory environment

- each code “feature” should have minimal redundancy

- each feature should describe independent information

• caveats:

- applies to behaviorally relevant information

- not all redundancy is bad, e.g. when compensating for noise

Efficient coding of natural soundsLewicki (2002) Nat. Neurosci. 5:356-363

Predict optimal transformation of sound waveform from statistics of the natural acoustic environment

• use simplest model: linear

• derive optimal code for sound in analysis window

• use ICA to learn optimal linear transform

What sounds to use?

• What are auditory systems adapted for doing?

- localization ⇒ environmental sounds

- communication ⇒ vocalizations

- general sound recognition ⇒ variety of sounds

• Learn codes for a variety of sound ensembles:

- non-harmonic environmental sounds (e.g. footsteps, stream sounds, etc.)

- animal vocalizations (rainforest mammals, e.g. chirps, screeches, cries, etc.)

- speech (from 100 male & female speakers from the TIMIT corpus)

Learning optimal linear transforms for natural sounds

environmental sounds vocalizations

speech

The optimal code depends on theclass of sounds being encoded: - a wavelet-like transform is best for environmental sounds - a Fourier-like transform is best for vocalizations - an intermediate transform is best for speech or general natural sounds

0.2 0.5 1 2 5 10 20

characteristic frequency (kHz)Q

Evans, 1975Rhode and Smith, 1985

0.2 0.5 1 2 5 10 20

center frequency (kHz)

+ vocalizationso speech/combinedx environmental sounds

Theory

0.2 0.5 1 2 5 10 20

center frequency (kHz)

Theory

Efficient coding explains auditory nerve population data

Q10dB = fc/w10dB

Filter sharpness:

0.2 0.5 1 2 5 10 20

characteristic frequency (kHz)Q

Evans, 1975Rhode and Smith, 1985

Theoretical models

Efficient coding ● codes signals accurately and efficiently

● adapted to natural sensory environment

Spike coding

Coding efficiency

Current directions

Limitations of the linear model

• it’s linear

• code is optimal only within a block, not for whole signal

• offers no explanation of phase-locking and spikes

• representation depends on the relative alignment of the signal and block

A continuous filterbank does not form an efficient code

find a representation that is both time-relative and efficient

Efficient signal representation using time-shiftable kernels (spikes)

x(t) =M!

sm,i φm(t ! τm,i) + ε(t)

• Each spike encodes the precise time and magnitude of a particular kernel

• Spike population forms a non-redundant signal representation

• Two important theoretical abstractions for “spikes”

- not binary- not probabilistic

• Can convert to a population of stochastic, binary spikes

Figure 3: Smith, NC ms 2956

Smith and Lewicki (2005) Neural Comp. 17:19-45

The spikegram

0 5 10 15 20

100200

Residual

Reconstruction

0 5 10 15 20 25 ms

100200

100 200 300 400 500 600 700 800 ms

Comparing a spike code to a spectrogram

100200

100 200 300 400 500 600 700 800 ms

100200

100 200 300 400 500 600 700 800 ms

c How do we compute the spikes?

Encoding signals with spikes

• There are many possible algorithms, varying degrees of biological plausibility

• Here, we use a variation of Matching Pursuit (Mallat and Zhang, 1993)

- yields near optimal spike representation, but is not biologically plausible

- assume there exists a biol. plausible algorithm that achieves the same end

Spike Coding with Matching Pursuit

1. convolve signal with kernels

2. find largest peak over convolution set

3. fit signal with kernel

4. subtract kernel from signal, record spike, and adjust convolutions

5. repeat

5. repeat . . .

6. halt when desired fidelity is reached

“can” 5 dB SNR, 36 spikes, 145 sp/sec

Residual

Reconstruction

0 50 100 150 200 ms100200

Residual

Reconstruction

0 50 100 150 200 ms100200

Residual

Reconstruction

0 50 100 150 200 ms100200

Residual

Reconstruction

0 50 100 150 200 ms100200

Residual

Reconstruction

0 50 100 150 200 ms100200

Residual

Reconstruction

0 50 100 150 200 ms100200

Varying the number of gammatone kernels21 Coding time-relative structure with spikes E Smith and M S Lewicki

Figure 6: The number of kernel functions affects both the spectral resolution and the

temporal sparseness of the spike codes. The input signal (top) was encoded using

matching pursuit with 8, 16, 32 or 64 kernel functions (A-D, respectively). The to-

tal number of spikes in each is (A) 12011, (B) 1167, (C) 497 and (D) 479.

activity across bands. It also possesses a nonlinear frequency axis based on the cochlea.

This axis emphasizes the range important to human hearing and is used in many audi-

tory models and speech “front-ends”.

4.5 Sparse representation of transients

Though the “pizzerias” example demonstrates the large scale features of the spike code,

the fine structure is more clearly revealed in a shorter speech segment. The waveform

and spikegram of first half of the word “wealth” appear in figure 8. Here we can see

the time-relative coding of non-stationary structure. 100 msec into the word (about

8 kernels, 12011 spikes21 Coding time-relative structure with spikes E Smith and M S Lewicki

16 kernels, 1167 spikes

21 Coding time-relative structure with spikes E Smith and M S Lewicki

Coding efficiency in terms of spikes

100 1000 100001

Cost (spikes/sec)

Optim. Matching PursuitOptim. Filter!ThresholdMatching PursuitFilter!Threshold

100 1000 100001

Cost (spikes/sec)

8 Filters

256 Filters

Matching Pursuit

Efficient auditory coding with optimized kernel shapes

x(t) =M!

sm,i φm(t ! τm,i) + ε(t)

Figure 3: Smith, NC ms 2956

Adapt algorithm of Olshausen (2002)

Smith and Lewicki (2006) Nature 439:978-982

What are the optimal kernel shapes?

Optimizing the probabilistic model

x(t) =M!

smi !m(t! "m

i ) + #(t),

p(x|!) ="

p(x|!, s, ")p(s)p(")dsd"

" p(x|!, s, ")p(s)p(")

#(t) # N (0,$!)

Learning (Olshausen, 2002):

%!mlog p(x|!) =

%!mlog p(x|!, s, ") + log p(s)p(")

%!m[x!

smi !m(t! "m

[x! x]!

Also adapt kernel lengths

Kernel functions are initialized to random vectors

Kernel functions optimized for coding speech

Theoretical models

Spike coding ● non-linear, efficient for time-varying signals

● idealization of binary action potentials

Coding efficiency

Current directions

Quantifying coding efficiency

1. fit signal

2. quantize time and amplitude values

3. prune zero values

4. measure coding efficiency using the entropy of quantized values

5. reconstruct signal using quantized values

6. measure fidelity using signal-to-noise ratio (SNR) of residual error

• identical procedure for other codes (e.g. Fourier and wavelet)

x(t) =M!

sm,i φm(t ! τm,i) + ε(t)Residual

Reconstruction

0 50 150 200 ms100200

Residual

Reconstruction

0 50 150 200 ms100200

original

reconstruction

residualerror

spike code

Coding efficiency curves

0 10 20 30 40 50 60 70 80 900

Rate (Kbps)

Spike Code: adaptedSpike Code: gammatoneBlock Code: waveletBlock Code: Fourier

4x more efficient

+14 dB

Theoretical models

Coding efficiency ● much more efficient than linear block codes

Current directions

Using efficient coding theory to make theoretical predictions

Natural Sound Environment

optimal kernels:• properties• coding efficiency

physiological data:• auditory nerve filter shapes• population trends

evolution

?Michael S. Lewicki ! Carnegie Mellon Bad Zwischenahn ! Aug 21, 2004

A simple model of auditory coding

1 2 3 4 5 6 7 8 9 10

time (ms)

3 5421

time (ms)

deBoer and deJongh, 1978 Carney and Yin, 1988

auditory revcor filters: gammatones

soundwaveform

filterbank staticnon-linearities

stochasticspiking

populationspike code

More theoretical questions:

• Why gammatones?

• Why spikes?

• How is sound coded by

the spike population?

How do we develop a theory?

Michael S. Lewicki ! Carnegie Mellon Bad Zwischenahn ! Aug 21, 2004

Comparing a spike code to a spectrogram

How do we compute the spikes?

100200

100 200 300 400 500 600 700 800 ms

0.1 0.2 0.5 1 2 50.1

Center Frequency (kHz)

Speech PredictionAuditory Nerve Filters

theory

only compare to the data after optimizingwe do not fit the data

Prediction depends on sound ensemble.

vocalizationsenvironmental sounds

transient ambient

Natural sounds

squirrel

walking on leaves

rustling leaves

cracking branches

stream by waterfall

Learned kernels share features of auditory nerve filters

Optimized kernelsscale bar = 1 msec

Auditory nerve filtersfrom Carney, McDuffy, and Shekhter, 1999

Learned kernels closely match individual auditory nerve filters

for each kernel find closet matching auditory nerve filterin Laurel Carney’s database of ~100 filters.

Learned kernels overlaid on selected auditory nerve filters

For almost all learned kernels there is a closely matching auditory nerve filter.

Spike kernels for natural sound mix matches revcor filters

Optimal kernels for environmental sounds are very short

Spike kernels for vocalizations are much longer and symmetric

Comparing learned kernels to auditory nerve population

0.1 0.2 0.5 1 2 50.1

RevcorEnvironVocalNatural

Population distribution of kernels for natural sounds

0.1 0.2 0.5 1 2 50.1

Population distribution of kernels for environmental sounds

0.1 0.2 0.5 1 2 50.1

Population distribution of kernels for animal vocalizations

0.1 0.2 0.5 1 2 50.1

Kernel distributions for different sound ensembles

0.1 0.2 0.5 1 2 50.1

Population distribution of kernels for natural sounds

0.1 0.2 0.5 1 2 50.1

Population distribution of kernels for speech (TIMIT)

0.1 0.2 0.5 1 2 50.1

RevcorEnvironVocalSpeech

Speech matches composition of natural sounds

ambient environmental sounds

animal vocalizations

transient environmental sounds

Best mix for predicting auditory coding: 1.0 : 0.8 : 1.2

speech

Theoretical models

● theory explains gammatone revcor shapes ● also explains population trends

Current directions

Coding of a speech consonant

Residual

Reconstruction

Inputa

100200

38 39 40

340040004800

10 20 30 40 50 60ms

How is this achieving an efficient, time-relative code?

0 20 40 60 80 100 ms100200

Time-relative coding of glottal pulses

Learning higher-order structure

0 20 40 60 80 100 120

Time (msec)

!20 !15 !10 !5 0

Lag (msec)

Autocorrelation

Correlogram

Spike!triggered Periodicity

Spike interval alignment

Low-freq kernel do not precisely match signal period

300 305 310 315 320 325 330 ms

original

reconstruction 3 dB SNR

residual

300 305 310 315 320 325 330 ms

original

residual

300 305 310 315 320 325 330 ms

original

residual

300 305 310 315 320 325 330 ms

original

residual

300 305 310 315 320 325 330 ms

original

residual

3 Hierarchical spike models M S Lewicki

180 190 200 210 220 230 ms

Figure 2: Spikegram of the ’t’ in ’vietnamese’ at 30 dB SNR.

25 30 35 40 45 50 55 60 65 ms

Figure 3: Spikegram of onset in ’can’ at fidelities of 20, 30, and 40 dB SNR. Asthe fidelity increases, the high-frequency structure of the consonant becomesmore prominent.

105 110 115 120 125 130 135 140 145 ms

Figure 4: Spikegram of the vowel in ’can’ at fidelities of 5, 10, and 20 dB SNR.

300 305 310 315 320 325 330 ms

Figure 5: Spikegram of the /a/ vowel in ’vietnamese’ at fidelities of 15, 20, and25 dB SNR.

Learning general higher-order acoustic structure

Non-stationary statistical regularity in acoustic features

1 128 256!8

R1 R2 R3

1 128 256!8

Karklin and Lewicki (2005) Neural Comp. 17:397-423

Generalizing the standard ICA model

P (s) =!

P (si)

P (si) ! exp

P (si)

Generalizing the standard ICA model

P (s) =!

P (si)

P (si) ! exp

P (si)P (ui|!i)

P (ui|!i) ! exp

log !i = [Bv]i

P (ui|!i)

log !i

! log p(u|B,v) "!

exp([Bv]i)

Independent density components

P(u|vj= 2.0)

P(u|vj= 0.0)

P(u|vj=−2.0)

! log p(u|B,v) "!

exp([Bv]i)

Density components of speech

Learning Density Components 411

Figure 7: A subset of density components of speech. The weights in a column ofB are plotted as shaded patches in one of the nine panels. Each patch is placedaccording to the temporal and frequency distribution of the associated linearbasis function and shaded according to the value of the weight, with whiteindicating positive weights, black negative weights, and gray weights that areclose to zero. The axes represent time, 0 to 16 msec, horizontally, and frequency, 0to 8 kHz, vertically. The density components form a distributed representationof the frequency of the signal and the location of energy within the samplewindow. Density components coding for multiple frequencies might captureharmonic regularities in the speech signal (see the text for details).

Higher-level representations show invariant properties

Theoretical models

● theory explains gammatone revcor shapes ● also explains population trends

Current directions ● hierarchical models, higher-order structure

Q: (I. J. Good)

I cannot help wondering whether it is not largely a

prejudice to analyse signals in terms of frequency.

A: (D. Gabor)

There are in fact two good reasons for this

preference:

1. In communication problems we deal usually with

the infinite or semi-infinite time-axis, and

2. the problems are usually homogenous in time.

Once one or the other of these conditions is dropped,

it may be well worth while to carry out the analysis

in terms of other functions.

Symposium on Information Theory, Imperial College London, 1950

information theoretic models of auditory...

Documents

auditory discrimination and auditory sensory behaviours in...

mechanism of sound transduction, auditory pathology, and...

auditory physiology - core · auditory physiology 18.0...

the processing of auditory deviance and auditory

oxford handbook of auditory science- auditory brain

university of southampton auditory implant...

auditory aspects of auditory imagery

tcdsb auditory processing disorder protocol€¦ · tcdsb...

1 inference and estimation in probabilistic time series...

auditory perception april 9, 2009 auditory vs. acoustic so...

school of informatics, university of edinburgh -...

analysis of gps traces of police vehicles leto peel...

nick s dalton sdalton@cs.ucl.ac.uk

ben butchart wolfgang emmerich university college london...

oxford handbook of auditory science the auditory brain ·...

stereo systems - cs.ucl.ac.uk

auditory processing disorder and auditory/language...

supplemental material: overdispersed variational...

hubbard (2013) auditory aspects of auditory imagery

auditory pathway -...