a glimpsing model of speech perception martin cooke & sarah simpson speech and hearing research...
TRANSCRIPT
A glimpsing model of speech perception
Martin Cooke & Sarah Simpson
Speech and Hearing Research
Department of Computer Science
University of Sheffield
http://www.dcs.shef.ac.uk/~martin
Motivation: The nonstationarity ‘paradox’
speech technology performance falls with the
nonstationarity of the noise background …
Simpson & Cooke (2003)
Aurora eval
Motivation:The nonstationarity ‘paradox’
speech technology performance falls with the
nonstationarity of the noise background …
Simpson & Cooke (2003)
Miller (1947)
… while listeners appear to prefer a nonstationary background (8-12 dB SRT gain)
Possible factors
In a 1-speaker background, listeners can …• … employ organisational cues from the
background source to help segregate foreground• … employ schemas for both foreground and
background• … benefit from better glimpses of the speech target
but: multi-speaker backgrounds have certain advantages …• … less chance of informational masking• … easier enhancement algorithm
Glimpsing opportunities
% of time-frequency regions with a locally-positive SNR
Spectro-temporal glimpse densities
Glimpsing
Precursors• Term used by Miller & Licklider (1950) to explain intelligibility of
interrupted speech
• Related to ‘multiple looks’ model of Viemeister & Wakefield (1991) which demonstrated ‘intelligent’ temporal integration of tone bursts
• Assmann & Summerfield (in press) suggest ‘glimpsing & tracking’ as way of understanding how listeners cope with adverse conditions
• Culling & Darwin (1994) developed a glimpsing model to explain double vowel identification for small ΔF0s
• de Cheveigné & Kawahara (1999) can be considered a glimpsing model of vowel identification
• Close relation to missing data processing (Cooke et al, 1994)
Informal definition
a glimpse is some time-frequency region which contains a reasonably undistorted ‘view’ of local signal properties
Types of glimpsesComodulated
Eg Miller & Licklider (1950)
Spectral
Eg Warren et al (1995)
General uncomodulated
Eg Howard-Jones & Rosen (1993), Buss et al (2003)
Evidence from distorted speech
e.g. Drullman (1995) filtered noisy speech into 24 ¼-octave bands, extracted the temporal envelope in each band, and replaced those parts of the envelope below a target level with a constant value. Found intelligibility of 60% when 98% of signal was missing
Glimpsing in natural conditions: the dominance effect
Although audio signals add ‘additively’, the occlusion metaphor is more appropriate due to loglike compression in the auditory system
Consequently, most regions in a mixture are dominated by one or other source, leaving very few ambiguous regions, even for a pair of speech signals mixed at 0 dB.
Issues for a glimpsing model
What constitutes a useful glimpse?
Is sufficient information contained in glimpses?
How do listeners detect glimpses?
How can they be integrated?
Glimpse detection
Glimpse integration
Aims– Determine if glimpses contain sufficient information
– Explore definition of useful glimpse
• Comparison between listeners and model using natural VCV stimuli
• Subset of Shannon et al (1999) corpus
V = /a/
C = { b, d, g, p, t, k, m, n, l, r, f, v, s, z, sh, ch }
• Background source
– reversed multispeaker babbler for N=1, 8
– Allows variation in glimpsing opportunities
– 3 SNRs (TMRs): 0, -6 and -12 dB
• 12 listeners heard 160 tokens in each condition
– 2 repeats X 16 VCVs X 5 male speakers
Glimpsing study
Glimpsing model
• CDHMM employing missing data techniques• 16 whole-word HMMs
– 8 states
– 4 component Gaussian mixture per state
• Input representation– 10 ms frames of modelled auditory excitation pattern (40
gammatone filters, Hilbert envelope, 8 ms smoothing)
– NB: only simultaneous masking is modelled
• Training– 8 repetitions of each VCV by 5 male speakers per model
• Testing– As for listeners viz. 2 repetitions of each VCV by 5 male speakers
– Performance in clean: > 99%
Ideal glimpses
• All time-frequency regions whose local SNR exceeds a threshold
• Optimum threshold = 0 dB
• For this task, there is more than sufficient information in the glimpsed regions
• Listeners perform suboptimally with respect to this glimpse definition
Model performance I: ideal glimpses
1
8
Model performance:variation in detection threshold
Q Can varying the local SNR threshold for glimpse detection prodce a better match?
• No choice of local SNR threshold provides good fit to listeners
• Closest fit shown (-6 dB)
1
8
Analysis
• Unreasonable to expect listeners to detect individual glimpses in a sea of noise unless glimpse region is large enough
Analysis
• Unreasonable to expect listeners to detect individual glimpses in a sea of noise unless glimpse region is large enough
Model performance: useable glimpses
• Definition: glimpsed region must occupy at least N ERBs and T ms
• Search over 1-15 ERBs, 10-100 ms, at various detection thresholds
• Best match at– 6.3 ERBs (9 channels)– 40 ms– 0 dB local SNR threshold
1
8
• Howard-Jones & Rosen (1993) suggested 2-4 bands limit for uncomodulated glimpsing
• Buss et al (2003) found evidence for uncomodulated glimpsing in up to 9 bands
Consonant identification
identification of individual consonants
0
10
20
30
40
50
60
70
80
90
100
%
listeners model
listeners 45 61 64 83 76 70 50 72 68 77 90 91 79 54 63 92 71
model 85 68 73 83 85 85 58 67 60 78 35 73 87 60 78 53 71
b p d t g k l r m n s sh ch v f z all
• Reasonable matches overall apart from b, s & z
• However, little token-by-token agreement between common listener errors and model errors.
• Why?
Factors
Audibility of target
Organisational cues in target
Organisational cues in background
‘Confusability’
Existence of schemas for target
Existence of schemas for background
Informational maskingEnergetic masking
Successfulidentification
Measuring energetic masking
Approach: resynthesise glimpses alone
• Filter, time-reverse, refilter to remove phase distortion
• Select regions based on local SNR mask
Results• Little difference for 1-speaker
background, suggesting relatively low contribution of info masking in this case (due to reversed masker?)
• Larger difference for 8-speaker case possibly due to ‘unrealistic’ glimpses
-12 -6 020
30
40
50
60
70
80
90
100
Target-to-masker ratio (dB)
corr
ect
id
enti
fica
tion
data1data2data3data4data5data6data7data8
glimpses alone
speech+noise
1
8
med4
10 20 30 40 50 60 70 80 90 100
5
10
15
20
25
30
35
40
10 20 30 40 50 60 70 80 90 100
5
10
15
20
25
30
35
40
10 20 30 40 50 60 70 80 90 100
5
10
15
20
25
30
35
40
10 20 30 40 50 60 70 80 90 100
5
10
15
20
25
30
35
40
-12 -6 020
30
40
50
60
70
80
90
100
Target-to-masker ratio (dB)
corr
ect
id
enti
fica
tion
Comparison with ideal model
Results• Ideal model performs well in excess
of listeners when supplied with precisely the same information
Possible reasons:• Distortions• Glimpses do not occur in isolation:
possibility that a noise background will help
• Lack of nonsimultaneous masking model will inflate model performance
Ideal (model)
Ideal? (listeners)
The glimpse decoder
• Attempt at a unifying statistical theory for primitive and model-driven processes in CASA
• Basic idea: decoder not only determines the most likely speech hypothesis but also decides which glimpses to use
– Key advantage: no longer need to rely on clean acoustics!
• Can interpret (some) informational masking effects as the incorrect assignment of glimpses during signal interpretation
• Barker, J, Cooke, M.P. & Ellis, D.P.W. “Decoding speech in the presence of other sources”, accepted for Speech Communication
Summary & outlook
• Proposed a glimpsing model of speech identification in noise• Demonstrated sufficiency of information in target glimpses, at
least for VCV task• Preliminary definition of useful glimpse gives good overall
model-listener match• Introduced 2 procedures for measuring the amount of energetic
masking (i) via ASR (ii) via glimpse resynthesis• Need nonsimultaneous masking model• Need to isolate affects due to schemas• Repeat using non-reversed speech to introduce more
informational masking• Need to quantify affect of distortion in glimpse resynthesis• …
-Inf -40 -20 -10 00
10
20
30
40
50
60
70
80
Noise level relative to speech (dB)
Key
wor
ds c
orre
ct (
%)
CF 633 HzCF 4200 Hz
Cooke & Cunningham (in prep) Spectral induction with single speech-bands.
Masking noise can be beneficial
fullband
Warren et al (1995) demonstrated spectral induction effect with 2 narrow bands of speech with intervening noise
Speech modulated noise
Speech modulated noise
• As in Brungart (2001)
• Model results and glimpse distributions indicate increase in energetic masking for this type of masker
Speech modulated noise
Natural speech
natural, 1 spkrnatural, 8 spkr
SMN, 1 spkrSMN, 8 spkr
Speech modulated noise
• Listeners perform better with SMN than predicted on the basis of reduced glimpses (cf SMN model), but not quite as well as they do with natural speech masker
• Suggests energetic masking is not the whole story (cf Brungart, 2001), but further work needed to quantify relative contribution of
– Release from IM– Absence of background
models/cues
1
8
SMN (model)
NAT (model)
SMN (listeners)
NAT (listeners)