information theoretic models of auditory...
TRANSCRIPT
Michael S. Lewicki†
Joint work with Evan Smith‡
Departments of Computer Science† and Psychology‡
Center for the Neural Basis of CognitionCarnegie Mellon University
We gratefully acknowledge the following support:NSF CAREER Award 0238351NIH Training Grant MH19983
Information theoretic models of auditory coding
from Warren, 1999
The cochlea and inner ear
A simple model of auditory coding
1 2 3 4 5 6 7 8 9 10
time (ms)
54321
time (ms)
3 5421
time (ms)
deBoer and deJongh, 1978 Carney and Yin, 1988
Auditory nerve filters can be estimated usingreverse correlation (spike-triggered averaging)
soundwaveform
filterbank staticnon-linearities
stochasticspiking
populationspike code
filterbank
Models are data driven
data: revcor filter
g(t) = atn!1e!bt cos(2πft + φ)
model: gammatone
“gammatone” function
Models are data driven
g(t) = atn!1e!bt cos(2πft + φ)
data: revcor filter
model: gammatone
model fit
residual error
“gammatone” function
Models are data driven
g(t) = atn!1e!bt cos(2πft + φ)
model fit
residual error
“gammatone” function
• The aim of both theories and models is to explain data
- Models are data driven.
- Theories are driven by principles.
- Theories require a definition of an ideal.
A theoretical approach
Theoretical questions:
• Why gammatones?
• Why spikes?
• How is sound coded by
the spike population?
How do we develop a theory?
Theoretical models
● explains data from principles
● requires idealization and abstraction
Efficient coding
Spike coding
Coding efficiency
Explaining physiological data
Current directions
Efficient coding theory
• Barlow, 1961; Attneave, 1954
- main goal of sensory coding is to code signals efficiently
- sensory codes are adapted to the sensory environment
- each code “feature” should have minimal redundancy
- each feature should describe independent information
• caveats:
- applies to behaviorally relevant information
- not all redundancy is bad, e.g. when compensating for noise
Efficient coding of natural soundsLewicki (2002) Nat. Neurosci. 5:356-363
Goal:
Predict optimal transformation of sound waveform from statistics of the natural acoustic environment
• use simplest model: linear
• derive optimal code for sound in analysis window
• use ICA to learn optimal linear transform
x1:N
What sounds to use?
• What are auditory systems adapted for doing?
- localization ⇒ environmental sounds
- communication ⇒ vocalizations
- general sound recognition ⇒ variety of sounds
• Learn codes for a variety of sound ensembles:
- non-harmonic environmental sounds (e.g. footsteps, stream sounds, etc.)
- animal vocalizations (rainforest mammals, e.g. chirps, screeches, cries, etc.)
- speech (from 100 male & female speakers from the TIMIT corpus)
Learning optimal linear transforms for natural sounds
environmental sounds vocalizations
speech
The optimal code depends on theclass of sounds being encoded: - a wavelet-like transform is best for environmental sounds - a Fourier-like transform is best for vocalizations - an intermediate transform is best for speech or general natural sounds
0.2 0.5 1 2 5 10 20
1
2
5
10
20
characteristic frequency (kHz)Q
10dB
Evans, 1975Rhode and Smith, 1985
0.2 0.5 1 2 5 10 20
1
2
5
10
20
center frequency (kHz)
Q10
dB
+ vocalizationso speech/combinedx environmental sounds
Theory
0.2 0.5 1 2 5 10 20
1
2
5
10
20
center frequency (kHz)
Q10
dB
Theory
Efficient coding explains auditory nerve population data
Q10dB = fc/w10dB
Filter sharpness:
0.2 0.5 1 2 5 10 20
1
2
5
10
20
characteristic frequency (kHz)Q
10dB
Evans, 1975Rhode and Smith, 1985
Data
Theoretical models
● explains data from principles
● requires idealization and abstraction
Efficient coding ● codes signals accurately and efficiently
● adapted to natural sensory environment
Spike coding
Coding efficiency
Explaining physiological data
Current directions
Limitations of the linear model
• it’s linear
• code is optimal only within a block, not for whole signal
• offers no explanation of phase-locking and spikes
• representation depends on the relative alignment of the signal and block
A continuous filterbank does not form an efficient code
! "
#
#
#
Goal:
find a representation that is both time-relative and efficient
Efficient signal representation using time-shiftable kernels (spikes)
x(t) =M!
m=1
nm!
i=1
sm,i φm(t ! τm,i) + ε(t)
• Each spike encodes the precise time and magnitude of a particular kernel
• Spike population forms a non-redundant signal representation
• Two important theoretical abstractions for “spikes”
- not binary- not probabilistic
• Can convert to a population of stochastic, binary spikes
Figure 3: Smith, NC ms 2956
4
Smith and Lewicki (2005) Neural Comp. 17:19-45
The spikegram
0 5 10 15 20
100200
500
1000
2000
5000
K
Input
Residual
Reconstruction
0 5 10 15 20 25 ms
100200
500
1000
2000
5000
Ke
rne
l C
F (
Hz)
Input
a
100200
500
1000
2000
5000
Ker
nel C
F (H
z)
b
Freq
uenc
y (H
z)
100 200 300 400 500 600 700 800 ms
1000
2000
3000
4000
5000
c
Comparing a spike code to a spectrogram
a
100200
500
1000
2000
5000
Ker
nel C
F (H
z)
b
Freq
uenc
y (H
z)
100 200 300 400 500 600 700 800 ms
1000
2000
3000
4000
5000
c
a
100200
500
1000
2000
5000
Ker
nel C
F (H
z)
b
Freq
uenc
y (H
z)
100 200 300 400 500 600 700 800 ms
1000
2000
3000
4000
5000
c How do we compute the spikes?
Encoding signals with spikes
• There are many possible algorithms, varying degrees of biological plausibility
• Here, we use a variation of Matching Pursuit (Mallat and Zhang, 1993)
- yields near optimal spike representation, but is not biologically plausible
- assume there exists a biol. plausible algorithm that achieves the same end
Spike Coding with Matching Pursuit
1. convolve signal with kernels
Spike Coding with Matching Pursuit
1. convolve signal with kernels
2. find largest peak over convolution set
Spike Coding with Matching Pursuit
1. convolve signal with kernels
2. find largest peak over convolution set
3. fit signal with kernel
Spike Coding with Matching Pursuit
1. convolve signal with kernels
2. find largest peak over convolution set
3. fit signal with kernel
4. subtract kernel from signal, record spike, and adjust convolutions
Spike Coding with Matching Pursuit
1. convolve signal with kernels
2. find largest peak over convolution set
3. fit signal with kernel
4. subtract kernel from signal, record spike, and adjust convolutions
5. repeat
Spike Coding with Matching Pursuit
1. convolve signal with kernels
2. find largest peak over convolution set
3. fit signal with kernel
4. subtract kernel from signal, record spike, and adjust convolutions
5. repeat
Spike Coding with Matching Pursuit
1. convolve signal with kernels
2. find largest peak over convolution set
3. fit signal with kernel
4. subtract kernel from signal, record spike, and adjust convolutions
5. repeat
Spike Coding with Matching Pursuit
1. convolve signal with kernels
2. find largest peak over convolution set
3. fit signal with kernel
4. subtract kernel from signal, record spike, and adjust convolutions
5. repeat
Spike Coding with Matching Pursuit
1. convolve signal with kernels
2. find largest peak over convolution set
3. fit signal with kernel
4. subtract kernel from signal, record spike, and adjust convolutions
5. repeat . . .
Spike Coding with Matching Pursuit
1. convolve signal with kernels
2. find largest peak over convolution set
3. fit signal with kernel
4. subtract kernel from signal, record spike, and adjust convolutions
5. repeat . . .
6. halt when desired fidelity is reached
“can” 5 dB SNR, 36 spikes, 145 sp/sec
Residual
Reconstruction
Input
0 50 100 150 200 ms100200
500
1000
2000
5000K
erne
l CF
(Hz)
“can” 10 dB SNR, 93 spikes, 379 sp/sec
Residual
Reconstruction
Input
0 50 100 150 200 ms100200
500
1000
2000
5000K
erne
l CF
(Hz)
Residual
Reconstruction
Input
0 50 100 150 200 ms100200
500
1000
2000
5000K
erne
l CF
(Hz)
“can” 20 dB SNR, 391 spikes, 1700 sp/sec
Residual
Reconstruction
Input
0 50 100 150 200 ms100200
500
1000
2000
5000K
erne
l CF
(Hz)
Residual
Reconstruction
Input
0 50 100 150 200 ms100200
500
1000
2000
5000K
erne
l CF
(Hz)
“can” 40 dB SNR, 1285 spikes, 5238 sp/sec
Residual
Reconstruction
Input
0 50 100 150 200 ms100200
500
1000
2000
5000K
erne
l CF
(Hz)
Varying the number of gammatone kernels21 Coding time-relative structure with spikes E Smith and M S Lewicki
Figure 6: The number of kernel functions affects both the spectral resolution and the
temporal sparseness of the spike codes. The input signal (top) was encoded using
matching pursuit with 8, 16, 32 or 64 kernel functions (A-D, respectively). The to-
tal number of spikes in each is (A) 12011, (B) 1167, (C) 497 and (D) 479.
activity across bands. It also possesses a nonlinear frequency axis based on the cochlea.
This axis emphasizes the range important to human hearing and is used in many audi-
tory models and speech “front-ends”.
4.5 Sparse representation of transients
Though the “pizzerias” example demonstrates the large scale features of the spike code,
the fine structure is more clearly revealed in a shorter speech segment. The waveform
and spikegram of first half of the word “wealth” appear in figure 8. Here we can see
the time-relative coding of non-stationary structure. 100 msec into the word (about
8 kernels, 12011 spikes21 Coding time-relative structure with spikes E Smith and M S Lewicki
Figure 6: The number of kernel functions affects both the spectral resolution and the
temporal sparseness of the spike codes. The input signal (top) was encoded using
matching pursuit with 8, 16, 32 or 64 kernel functions (A-D, respectively). The to-
tal number of spikes in each is (A) 12011, (B) 1167, (C) 497 and (D) 479.
activity across bands. It also possesses a nonlinear frequency axis based on the cochlea.
This axis emphasizes the range important to human hearing and is used in many audi-
tory models and speech “front-ends”.
4.5 Sparse representation of transients
Though the “pizzerias” example demonstrates the large scale features of the spike code,
the fine structure is more clearly revealed in a shorter speech segment. The waveform
and spikegram of first half of the word “wealth” appear in figure 8. Here we can see
the time-relative coding of non-stationary structure. 100 msec into the word (about
Varying the number of gammatone kernels21 Coding time-relative structure with spikes E Smith and M S Lewicki
Figure 6: The number of kernel functions affects both the spectral resolution and the
temporal sparseness of the spike codes. The input signal (top) was encoded using
matching pursuit with 8, 16, 32 or 64 kernel functions (A-D, respectively). The to-
tal number of spikes in each is (A) 12011, (B) 1167, (C) 497 and (D) 479.
activity across bands. It also possesses a nonlinear frequency axis based on the cochlea.
This axis emphasizes the range important to human hearing and is used in many audi-
tory models and speech “front-ends”.
4.5 Sparse representation of transients
Though the “pizzerias” example demonstrates the large scale features of the spike code,
the fine structure is more clearly revealed in a shorter speech segment. The waveform
and spikegram of first half of the word “wealth” appear in figure 8. Here we can see
the time-relative coding of non-stationary structure. 100 msec into the word (about
16 kernels, 1167 spikes
21 Coding time-relative structure with spikes E Smith and M S Lewicki
Figure 6: The number of kernel functions affects both the spectral resolution and the
temporal sparseness of the spike codes. The input signal (top) was encoded using
matching pursuit with 8, 16, 32 or 64 kernel functions (A-D, respectively). The to-
tal number of spikes in each is (A) 12011, (B) 1167, (C) 497 and (D) 479.
activity across bands. It also possesses a nonlinear frequency axis based on the cochlea.
This axis emphasizes the range important to human hearing and is used in many audi-
tory models and speech “front-ends”.
4.5 Sparse representation of transients
Though the “pizzerias” example demonstrates the large scale features of the spike code,
the fine structure is more clearly revealed in a shorter speech segment. The waveform
and spikegram of first half of the word “wealth” appear in figure 8. Here we can see
the time-relative coding of non-stationary structure. 100 msec into the word (about
21 Coding time-relative structure with spikes E Smith and M S Lewicki
Figure 6: The number of kernel functions affects both the spectral resolution and the
temporal sparseness of the spike codes. The input signal (top) was encoded using
matching pursuit with 8, 16, 32 or 64 kernel functions (A-D, respectively). The to-
tal number of spikes in each is (A) 12011, (B) 1167, (C) 497 and (D) 479.
activity across bands. It also possesses a nonlinear frequency axis based on the cochlea.
This axis emphasizes the range important to human hearing and is used in many audi-
tory models and speech “front-ends”.
4.5 Sparse representation of transients
Though the “pizzerias” example demonstrates the large scale features of the spike code,
the fine structure is more clearly revealed in a shorter speech segment. The waveform
and spikegram of first half of the word “wealth” appear in figure 8. Here we can see
the time-relative coding of non-stationary structure. 100 msec into the word (about
Varying the number of gammatone kernels21 Coding time-relative structure with spikes E Smith and M S Lewicki
Figure 6: The number of kernel functions affects both the spectral resolution and the
temporal sparseness of the spike codes. The input signal (top) was encoded using
matching pursuit with 8, 16, 32 or 64 kernel functions (A-D, respectively). The to-
tal number of spikes in each is (A) 12011, (B) 1167, (C) 497 and (D) 479.
activity across bands. It also possesses a nonlinear frequency axis based on the cochlea.
This axis emphasizes the range important to human hearing and is used in many audi-
tory models and speech “front-ends”.
4.5 Sparse representation of transients
Though the “pizzerias” example demonstrates the large scale features of the spike code,
the fine structure is more clearly revealed in a shorter speech segment. The waveform
and spikegram of first half of the word “wealth” appear in figure 8. Here we can see
the time-relative coding of non-stationary structure. 100 msec into the word (about
32 kernels, 497 spikes
21 Coding time-relative structure with spikes E Smith and M S Lewicki
Figure 6: The number of kernel functions affects both the spectral resolution and the
temporal sparseness of the spike codes. The input signal (top) was encoded using
matching pursuit with 8, 16, 32 or 64 kernel functions (A-D, respectively). The to-
tal number of spikes in each is (A) 12011, (B) 1167, (C) 497 and (D) 479.
activity across bands. It also possesses a nonlinear frequency axis based on the cochlea.
This axis emphasizes the range important to human hearing and is used in many audi-
tory models and speech “front-ends”.
4.5 Sparse representation of transients
Though the “pizzerias” example demonstrates the large scale features of the spike code,
the fine structure is more clearly revealed in a shorter speech segment. The waveform
and spikegram of first half of the word “wealth” appear in figure 8. Here we can see
the time-relative coding of non-stationary structure. 100 msec into the word (about
21 Coding time-relative structure with spikes E Smith and M S Lewicki
Figure 6: The number of kernel functions affects both the spectral resolution and the
temporal sparseness of the spike codes. The input signal (top) was encoded using
matching pursuit with 8, 16, 32 or 64 kernel functions (A-D, respectively). The to-
tal number of spikes in each is (A) 12011, (B) 1167, (C) 497 and (D) 479.
activity across bands. It also possesses a nonlinear frequency axis based on the cochlea.
This axis emphasizes the range important to human hearing and is used in many audi-
tory models and speech “front-ends”.
4.5 Sparse representation of transients
Though the “pizzerias” example demonstrates the large scale features of the spike code,
the fine structure is more clearly revealed in a shorter speech segment. The waveform
and spikegram of first half of the word “wealth” appear in figure 8. Here we can see
the time-relative coding of non-stationary structure. 100 msec into the word (about
Varying the number of gammatone kernels21 Coding time-relative structure with spikes E Smith and M S Lewicki
Figure 6: The number of kernel functions affects both the spectral resolution and the
temporal sparseness of the spike codes. The input signal (top) was encoded using
matching pursuit with 8, 16, 32 or 64 kernel functions (A-D, respectively). The to-
tal number of spikes in each is (A) 12011, (B) 1167, (C) 497 and (D) 479.
activity across bands. It also possesses a nonlinear frequency axis based on the cochlea.
This axis emphasizes the range important to human hearing and is used in many audi-
tory models and speech “front-ends”.
4.5 Sparse representation of transients
Though the “pizzerias” example demonstrates the large scale features of the spike code,
the fine structure is more clearly revealed in a shorter speech segment. The waveform
and spikegram of first half of the word “wealth” appear in figure 8. Here we can see
the time-relative coding of non-stationary structure. 100 msec into the word (about
64 kernels, 479 spikes
21 Coding time-relative structure with spikes E Smith and M S Lewicki
Figure 6: The number of kernel functions affects both the spectral resolution and the
temporal sparseness of the spike codes. The input signal (top) was encoded using
matching pursuit with 8, 16, 32 or 64 kernel functions (A-D, respectively). The to-
tal number of spikes in each is (A) 12011, (B) 1167, (C) 497 and (D) 479.
activity across bands. It also possesses a nonlinear frequency axis based on the cochlea.
This axis emphasizes the range important to human hearing and is used in many audi-
tory models and speech “front-ends”.
4.5 Sparse representation of transients
Though the “pizzerias” example demonstrates the large scale features of the spike code,
the fine structure is more clearly revealed in a shorter speech segment. The waveform
and spikegram of first half of the word “wealth” appear in figure 8. Here we can see
the time-relative coding of non-stationary structure. 100 msec into the word (about
21 Coding time-relative structure with spikes E Smith and M S Lewicki
Figure 6: The number of kernel functions affects both the spectral resolution and the
temporal sparseness of the spike codes. The input signal (top) was encoded using
matching pursuit with 8, 16, 32 or 64 kernel functions (A-D, respectively). The to-
tal number of spikes in each is (A) 12011, (B) 1167, (C) 497 and (D) 479.
activity across bands. It also possesses a nonlinear frequency axis based on the cochlea.
This axis emphasizes the range important to human hearing and is used in many audi-
tory models and speech “front-ends”.
4.5 Sparse representation of transients
Though the “pizzerias” example demonstrates the large scale features of the spike code,
the fine structure is more clearly revealed in a shorter speech segment. The waveform
and spikegram of first half of the word “wealth” appear in figure 8. Here we can see
the time-relative coding of non-stationary structure. 100 msec into the word (about
Coding efficiency in terms of spikes
100 1000 100001
10
100
Cost (spikes/sec)
SN
R (d
B)
Optim. Matching PursuitOptim. Filter!ThresholdMatching PursuitFilter!Threshold
100 1000 100001
10
100
Cost (spikes/sec)
SN
R (d
B)
8 Filters
256 Filters
Matching Pursuit
Efficient auditory coding with optimized kernel shapes
x(t) =M!
m=1
nm!
i=1
sm,i φm(t ! τm,i) + ε(t)
Figure 3: Smith, NC ms 2956
4
Adapt algorithm of Olshausen (2002)
Smith and Lewicki (2006) Nature 439:978-982
What are the optimal kernel shapes?
Optimizing the probabilistic model
x(t) =M!
m=1
nm!
i=1
smi !m(t! "m
i ) + #(t),
p(x|!) ="
p(x|!, s, ")p(s)p(")dsd"
" p(x|!, s, ")p(s)p(")
#(t) # N (0,$!)
Learning (Olshausen, 2002):
%
%!mlog p(x|!) =
%
%!mlog p(x|!, s, ") + log p(s)p(")
=1
2$"
%
%!m[x!
M!
m=1
nm!
i=1
smi !m(t! "m
i )]2
=1$"
[x! x]!
i
smi
Also adapt kernel lengths
Kernel functions are initialized to random vectors
Kernel functions optimized for coding speech
Theoretical models
● explains data from principles
● requires idealization and abstraction
Efficient coding ● codes signals accurately and efficiently
● adapted to natural sensory environment
Spike coding ● non-linear, efficient for time-varying signals
● idealization of binary action potentials
Coding efficiency
Explaining physiological data
Current directions
Quantifying coding efficiency
1. fit signal
2. quantize time and amplitude values
3. prune zero values
4. measure coding efficiency using the entropy of quantized values
5. reconstruct signal using quantized values
6. measure fidelity using signal-to-noise ratio (SNR) of residual error
• identical procedure for other codes (e.g. Fourier and wavelet)
x(t) =M!
m=1
nm!
i=1
sm,i φm(t ! τm,i) + ε(t)Residual
Reconstruction
Input
0 50 150 200 ms100200
500
1000
2000
5000
Kern
el CF
(Hz)
Residual
Reconstruction
Input
0 50 150 200 ms100200
500
1000
2000
5000
Kern
el CF
(Hz)
original
reconstruction
residualerror
spike code
Coding efficiency curves
0 10 20 30 40 50 60 70 80 900
5
10
15
20
25
30
35
40
Rate (Kbps)
SNR
(dB)
Spike Code: adaptedSpike Code: gammatoneBlock Code: waveletBlock Code: Fourier
4x more efficient
+14 dB
Theoretical models
● explains data from principles
● requires idealization and abstraction
Efficient coding ● codes signals accurately and efficiently
● adapted to natural sensory environment
Spike coding ● non-linear, efficient for time-varying signals
● idealization of binary action potentials
Coding efficiency ● much more efficient than linear block codes
Explaining physiological data
Current directions
Using efficient coding theory to make theoretical predictions
Natural Sound Environment
optimal kernels:• properties• coding efficiency
physiological data:• auditory nerve filter shapes• population trends
evolution
?Michael S. Lewicki ! Carnegie Mellon Bad Zwischenahn ! Aug 21, 2004
A simple model of auditory coding
1 2 3 4 5 6 7 8 9 10
time (ms)
54321
time (ms)
3 5421
time (ms)
deBoer and deJongh, 1978 Carney and Yin, 1988
auditory revcor filters: gammatones
soundwaveform
filterbank staticnon-linearities
stochasticspiking
populationspike code
More theoretical questions:
• Why gammatones?
• Why spikes?
• How is sound coded by
the spike population?
How do we develop a theory?
Michael S. Lewicki ! Carnegie Mellon Bad Zwischenahn ! Aug 21, 2004
Comparing a spike code to a spectrogram
How do we compute the spikes?
a
100200
500
1000
2000
5000
Ker
nel C
F (H
z)
b
Freq
uenc
y (H
z)
100 200 300 400 500 600 700 800 ms
1000
2000
3000
4000
5000
c
0.1 0.2 0.5 1 2 50.1
0.2
0.5
1
2
5
Center Frequency (kHz)
)z
Hk(
htdi
wd
na
B
Speech PredictionAuditory Nerve Filters
theory
only compare to the data after optimizingwe do not fit the data
Prediction depends on sound ensemble.
vocalizationsenvironmental sounds
transient ambient
Natural sounds
fox
squirrel
walking on leaves
rustling leaves
cracking branches
stream by waterfall
Learned kernels share features of auditory nerve filters
Optimized kernelsscale bar = 1 msec
Auditory nerve filtersfrom Carney, McDuffy, and Shekhter, 1999
Learned kernels closely match individual auditory nerve filters
for each kernel find closet matching auditory nerve filterin Laurel Carney’s database of ~100 filters.
Learned kernels overlaid on selected auditory nerve filters
For almost all learned kernels there is a closely matching auditory nerve filter.
Spike kernels for natural sound mix matches revcor filters
Optimal kernels for environmental sounds are very short
Spike kernels for vocalizations are much longer and symmetric
Comparing learned kernels to auditory nerve population
0.1 0.2 0.5 1 2 50.1
0.2
0.5
1
2
5
Center Frequency (kHz)
)z
Hk(
htdi
wd
na
B
RevcorEnvironVocalNatural
Population distribution of kernels for natural sounds
0.1 0.2 0.5 1 2 50.1
0.2
0.5
1
2
5
Center Frequency (kHz)
)z
Hk(
htdi
wd
na
B
RevcorEnvironVocalNatural
Population distribution of kernels for environmental sounds
0.1 0.2 0.5 1 2 50.1
0.2
0.5
1
2
5
Center Frequency (kHz)
)z
Hk(
htdi
wd
na
B
RevcorEnvironVocalNatural
Population distribution of kernels for animal vocalizations
0.1 0.2 0.5 1 2 50.1
0.2
0.5
1
2
5
Center Frequency (kHz)
)z
Hk(
htdi
wd
na
B
RevcorEnvironVocalNatural
Kernel distributions for different sound ensembles
0.1 0.2 0.5 1 2 50.1
0.2
0.5
1
2
5
Center Frequency (kHz)
)z
Hk(
htdi
wd
na
B
RevcorEnvironVocalNatural
Population distribution of kernels for natural sounds
0.1 0.2 0.5 1 2 50.1
0.2
0.5
1
2
5
Center Frequency (kHz)
)z
Hk(
htdi
wd
na
B
RevcorEnvironVocalNatural
Population distribution of kernels for speech (TIMIT)
0.1 0.2 0.5 1 2 50.1
0.2
0.5
1
2
5
Center Frequency (kHz)
)z
Hk(
htdi
wd
na
B
RevcorEnvironVocalSpeech
Speech matches composition of natural sounds
ambient environmental sounds
animal vocalizations
transient environmental sounds
Best mix for predicting auditory coding: 1.0 : 0.8 : 1.2
speech
Theoretical models
● explains data from principles
● requires idealization and abstraction
Efficient coding ● codes signals accurately and efficiently
● adapted to natural sensory environment
Spike coding ● non-linear, efficient for time-varying signals
● idealization of binary action potentials
Coding efficiency ● much more efficient than linear block codes
Explaining physiological data
● theory explains gammatone revcor shapes ● also explains population trends
Current directions
Coding of a speech consonant
Residual
Reconstruction
Inputa
100200
500
1000
2000
5000
Ker
nel C
F (H
z)
b
38 39 40
340040004800
Freq
uenc
y (H
z)
c
10 20 30 40 50 60ms
1000
2000
3000
4000
5000
How is this achieving an efficient, time-relative code?
a
b
0 20 40 60 80 100 ms100200
500
1000
2000
5000
Kern
el C
F (H
z)
Time-relative coding of glottal pulses
Learning higher-order structure
a
b
0 20 40 60 80 100 120
Ker
nel C
F (H
z)
Time (msec)
!20 !15 !10 !5 0
Lag (msec)
Autocorrelation
Correlogram
Spike!triggered Periodicity
c
Spike interval alignment
Low-freq kernel do not precisely match signal period
300 305 310 315 320 325 330 ms
100
200
500
1000
2000
5000
Filte
r CF
(Hz)
original
reconstruction 3 dB SNR
residual
Low-freq kernel do not precisely match signal period
300 305 310 315 320 325 330 ms
100
200
500
1000
2000
5000
Filte
r CF
(Hz)
original
reconstruction 4 dB SNR
residual
Low-freq kernel do not precisely match signal period
300 305 310 315 320 325 330 ms
100
200
500
1000
2000
5000
Filte
r CF
(Hz)
original
reconstruction 5 dB SNR
residual
Low-freq kernel do not precisely match signal period
300 305 310 315 320 325 330 ms
100
200
500
1000
2000
5000
Filte
r CF
(Hz)
original
reconstruction 6 dB SNR
residual
Low-freq kernel do not precisely match signal period
300 305 310 315 320 325 330 ms
100
200
500
1000
2000
5000
Filte
r CF
(Hz)
original
reconstruction 7 dB SNR
residual
3 Hierarchical spike models M S Lewicki
180 190 200 210 220 230 ms
100
200
500
1000
2000
5000
Filt
er
CF
(H
z)
Figure 2: Spikegram of the ’t’ in ’vietnamese’ at 30 dB SNR.
4 Hierarchical spike models M S Lewicki
25 30 35 40 45 50 55 60 65 ms
100
200
500
1000
2000
5000
()
25 30 35 40 45 50 55 60 65 ms
100
200
500
1000
2000
5000
()
25 30 35 40 45 50 55 60 65 ms
100
200
500
1000
2000
5000
()
Figure 3: Spikegram of onset in ’can’ at fidelities of 20, 30, and 40 dB SNR. Asthe fidelity increases, the high-frequency structure of the consonant becomesmore prominent.
5 Hierarchical spike models M S Lewicki
105 110 115 120 125 130 135 140 145 ms
100
200
500
1000
2000
5000
()
105 110 115 120 125 130 135 140 145 ms
100
200
500
1000
2000
5000
()
105 110 115 120 125 130 135 140 145 ms
100
200
500
1000
2000
5000
()
Figure 4: Spikegram of the vowel in ’can’ at fidelities of 5, 10, and 20 dB SNR.
6 Hierarchical spike models M S Lewicki
300 305 310 315 320 325 330 ms
100
200
500
1000
2000
5000
()
300 305 310 315 320 325 330 ms
100
200
500
1000
2000
5000
()
300 305 310 315 320 325 330 ms
100
200
500
1000
2000
5000
()
Figure 5: Spikegram of the /a/ vowel in ’vietnamese’ at fidelities of 15, 20, and25 dB SNR.
Learning general higher-order acoustic structure
Non-stationary statistical regularity in acoustic features
a
b
c
1 128 256!8
!4
0
4
y1
R1
1 128 256!8
!4
0
4
R2
1
128
256
R1 R2 R3
1 128 256!8
!4
0
4
R3
Karklin and Lewicki (2005) Neural Comp. 17:397-423
Generalizing the standard ICA model
P (s) =!
i
P (si)
P (si) ! exp
!
"
"
"
"
"
si
!i
"
"
"
"
qi#
P (si)
Generalizing the standard ICA model
P (s) =!
i
P (si)
P (si) ! exp
!
"
"
"
"
"
si
!i
"
"
"
"
qi#
P (si)P (ui|!i)
P (ui|!i) ! exp
!
"
"
"
"
"
ui
!i
"
"
"
"
qi#
log !i = [Bv]i
P (ui|!i)
log !i
! log p(u|B,v) "!
i
"
"
"
"
ui
exp([Bv]i)
"
"
"
"
qi
Independent density components
Bj
P(u|vj= 2.0)
P(u|vj= 0.0)
P(u|vj=−2.0)
! log p(u|B,v) "!
i
"
"
"
"
ui
exp([Bv]i)
"
"
"
"
qi
Density components of speech
Learning Density Components 411
Figure 7: A subset of density components of speech. The weights in a column ofB are plotted as shaded patches in one of the nine panels. Each patch is placedaccording to the temporal and frequency distribution of the associated linearbasis function and shaded according to the value of the weight, with whiteindicating positive weights, black negative weights, and gray weights that areclose to zero. The axes represent time, 0 to 16 msec, horizontally, and frequency, 0to 8 kHz, vertically. The density components form a distributed representationof the frequency of the signal and the location of energy within the samplewindow. Density components coding for multiple frequencies might captureharmonic regularities in the speech signal (see the text for details).
Density components of speech
Learning Density Components 411
Figure 7: A subset of density components of speech. The weights in a column ofB are plotted as shaded patches in one of the nine panels. Each patch is placedaccording to the temporal and frequency distribution of the associated linearbasis function and shaded according to the value of the weight, with whiteindicating positive weights, black negative weights, and gray weights that areclose to zero. The axes represent time, 0 to 16 msec, horizontally, and frequency, 0to 8 kHz, vertically. The density components form a distributed representationof the frequency of the signal and the location of energy within the samplewindow. Density components coding for multiple frequencies might captureharmonic regularities in the speech signal (see the text for details).
Density components of speech
Learning Density Components 411
Figure 7: A subset of density components of speech. The weights in a column ofB are plotted as shaded patches in one of the nine panels. Each patch is placedaccording to the temporal and frequency distribution of the associated linearbasis function and shaded according to the value of the weight, with whiteindicating positive weights, black negative weights, and gray weights that areclose to zero. The axes represent time, 0 to 16 msec, horizontally, and frequency, 0to 8 kHz, vertically. The density components form a distributed representationof the frequency of the signal and the location of energy within the samplewindow. Density components coding for multiple frequencies might captureharmonic regularities in the speech signal (see the text for details).
Higher-level representations show invariant properties
a
b
c
Theoretical models
● explains data from principles
● requires idealization and abstraction
Efficient coding ● codes signals accurately and efficiently
● adapted to natural sensory environment
Spike coding ● non-linear, efficient for time-varying signals
● idealization of binary action potentials
Coding efficiency ● much more efficient than linear block codes
Explaining physiological data
● theory explains gammatone revcor shapes ● also explains population trends
Current directions ● hierarchical models, higher-order structure
Q: (I. J. Good)
I cannot help wondering whether it is not largely a
prejudice to analyse signals in terms of frequency.
A: (D. Gabor)
There are in fact two good reasons for this
preference:
1. In communication problems we deal usually with
the infinite or semi-infinite time-axis, and
2. the problems are usually homogenous in time.
Once one or the other of these conditions is dropped,
it may be well worth while to carry out the analysis
in terms of other functions.
Symposium on Information Theory, Imperial College London, 1950