a glimpsing model of speech perception martin cooke & sarah simpson speech and hearing research...

A glimpsing model of speech perception

Martin Cooke & Sarah Simpson

Speech and Hearing Research

Department of Computer Science

University of Sheffield

http://www.dcs.shef.ac.uk/~martin

Motivation: The nonstationarity ‘paradox’

speech technology performance falls with the

nonstationarity of the noise background …

Simpson & Cooke (2003)

Aurora eval

Motivation:The nonstationarity ‘paradox’

speech technology performance falls with the

nonstationarity of the noise background …

Simpson & Cooke (2003)

Miller (1947)

… while listeners appear to prefer a nonstationary background (8-12 dB SRT gain)

Possible factors

In a 1-speaker background, listeners can …• … employ organisational cues from the

background source to help segregate foreground• … employ schemas for both foreground and

background• … benefit from better glimpses of the speech target

but: multi-speaker backgrounds have certain advantages …• … less chance of informational masking• … easier enhancement algorithm

Glimpsing opportunities

% of time-frequency regions with a locally-positive SNR

Spectro-temporal glimpse densities

Glimpsing

Precursors• Term used by Miller & Licklider (1950) to explain intelligibility of

interrupted speech

• Related to ‘multiple looks’ model of Viemeister & Wakefield (1991) which demonstrated ‘intelligent’ temporal integration of tone bursts

• Assmann & Summerfield (in press) suggest ‘glimpsing & tracking’ as way of understanding how listeners cope with adverse conditions

• Culling & Darwin (1994) developed a glimpsing model to explain double vowel identification for small ΔF0s

• de Cheveigné & Kawahara (1999) can be considered a glimpsing model of vowel identification

• Close relation to missing data processing (Cooke et al, 1994)

Informal definition

a glimpse is some time-frequency region which contains a reasonably undistorted ‘view’ of local signal properties

Types of glimpsesComodulated

Eg Miller & Licklider (1950)

Spectral

Eg Warren et al (1995)

General uncomodulated

Eg Howard-Jones & Rosen (1993), Buss et al (2003)

Evidence from distorted speech

e.g. Drullman (1995) filtered noisy speech into 24 ¼-octave bands, extracted the temporal envelope in each band, and replaced those parts of the envelope below a target level with a constant value. Found intelligibility of 60% when 98% of signal was missing

Glimpsing in natural conditions: the dominance effect

Although audio signals add ‘additively’, the occlusion metaphor is more appropriate due to loglike compression in the auditory system

Consequently, most regions in a mixture are dominated by one or other source, leaving very few ambiguous regions, even for a pair of speech signals mixed at 0 dB.

Issues for a glimpsing model

What constitutes a useful glimpse?

Is sufficient information contained in glimpses?

How do listeners detect glimpses?

How can they be integrated?

Glimpse detection

Glimpse integration

Aims– Determine if glimpses contain sufficient information

– Explore definition of useful glimpse

• Comparison between listeners and model using natural VCV stimuli

• Subset of Shannon et al (1999) corpus

V = /a/

C = { b, d, g, p, t, k, m, n, l, r, f, v, s, z, sh, ch }

• Background source

– reversed multispeaker babbler for N=1, 8

– Allows variation in glimpsing opportunities

– 3 SNRs (TMRs): 0, -6 and -12 dB

• 12 listeners heard 160 tokens in each condition

– 2 repeats X 16 VCVs X 5 male speakers

Glimpsing study

Identification results

8-speaker

1-speaker

Glimpsing model

• CDHMM employing missing data techniques• 16 whole-word HMMs

– 8 states

– 4 component Gaussian mixture per state

• Input representation– 10 ms frames of modelled auditory excitation pattern (40

gammatone filters, Hilbert envelope, 8 ms smoothing)

– NB: only simultaneous masking is modelled

• Training– 8 repetitions of each VCV by 5 male speakers per model

• Testing– As for listeners viz. 2 repetitions of each VCV by 5 male speakers

– Performance in clean: > 99%

Ideal glimpses

• All time-frequency regions whose local SNR exceeds a threshold

• Optimum threshold = 0 dB

• For this task, there is more than sufficient information in the glimpsed regions

• Listeners perform suboptimally with respect to this glimpse definition

Model performance I: ideal glimpses

1

8

Model performance:variation in detection threshold

Q Can varying the local SNR threshold for glimpse detection prodce a better match?

• No choice of local SNR threshold provides good fit to listeners

• Closest fit shown (-6 dB)

1

8

Analysis

• Unreasonable to expect listeners to detect individual glimpses in a sea of noise unless glimpse region is large enough

Model performance: useable glimpses

• Definition: glimpsed region must occupy at least N ERBs and T ms

• Search over 1-15 ERBs, 10-100 ms, at various detection thresholds

• Best match at– 6.3 ERBs (9 channels)– 40 ms– 0 dB local SNR threshold

1

8

• Howard-Jones & Rosen (1993) suggested 2-4 bands limit for uncomodulated glimpsing

• Buss et al (2003) found evidence for uncomodulated glimpsing in up to 9 bands

Consonant identification

identification of individual consonants

0

10

20

30

40

50

60

70

80

90

100

%

listeners model

listeners 45 61 64 83 76 70 50 72 68 77 90 91 79 54 63 92 71

model 85 68 73 83 85 85 58 67 60 78 35 73 87 60 78 53 71

b p d t g k l r m n s sh ch v f z all

• Reasonable matches overall apart from b, s & z

• However, little token-by-token agreement between common listener errors and model errors.

• Why?

Factors

Audibility of target

Organisational cues in target

Organisational cues in background

‘Confusability’

Existence of schemas for target

Existence of schemas for background

Informational maskingEnergetic masking

Successfulidentification

Measuring energetic masking

Approach: resynthesise glimpses alone

• Filter, time-reverse, refilter to remove phase distortion

• Select regions based on local SNR mask

Results• Little difference for 1-speaker

background, suggesting relatively low contribution of info masking in this case (due to reversed masker?)

• Larger difference for 8-speaker case possibly due to ‘unrealistic’ glimpses

-12 -6 020

30

40

50

60

70

80

90

100

Target-to-masker ratio (dB)

corr

ect

id

enti

fica

tion

data1data2data3data4data5data6data7data8

glimpses alone

speech+noise

1

8

med4

10 20 30 40 50 60 70 80 90 100

5

10

15

20

25

30

35

40

10 20 30 40 50 60 70 80 90 100

5

10

15

20

25

30

35

40

10 20 30 40 50 60 70 80 90 100

5

10

15

20

25

30

35

40

10 20 30 40 50 60 70 80 90 100

5

10

15

20

25

30

35

40

-12 -6 020

30

40

50

60

70

80

90

100

Target-to-masker ratio (dB)

corr

ect

id

enti

fica

tion

Comparison with ideal model

Results• Ideal model performs well in excess

of listeners when supplied with precisely the same information

Possible reasons:• Distortions• Glimpses do not occur in isolation:

possibility that a noise background will help

• Lack of nonsimultaneous masking model will inflate model performance

Ideal (model)

Ideal? (listeners)

The glimpse decoder

• Attempt at a unifying statistical theory for primitive and model-driven processes in CASA

• Basic idea: decoder not only determines the most likely speech hypothesis but also decides which glimpses to use

– Key advantage: no longer need to rely on clean acoustics!

• Can interpret (some) informational masking effects as the incorrect assignment of glimpses during signal interpretation

• Barker, J, Cooke, M.P. & Ellis, D.P.W. “Decoding speech in the presence of other sources”, accepted for Speech Communication

Summary & outlook

• Proposed a glimpsing model of speech identification in noise• Demonstrated sufficiency of information in target glimpses, at

least for VCV task• Preliminary definition of useful glimpse gives good overall

model-listener match• Introduced 2 procedures for measuring the amount of energetic

masking (i) via ASR (ii) via glimpse resynthesis• Need nonsimultaneous masking model• Need to isolate affects due to schemas• Repeat using non-reversed speech to introduce more

informational masking• Need to quantify affect of distortion in glimpse resynthesis• …

-Inf -40 -20 -10 00

10

20

30

40

50

60

70

80

Noise level relative to speech (dB)

Key

wor

ds c

orre

ct (

%)

CF 633 HzCF 4200 Hz

Cooke & Cunningham (in prep) Spectral induction with single speech-bands.

Masking noise can be beneficial

fullband

Warren et al (1995) demonstrated spectral induction effect with 2 narrow bands of speech with intervening noise

Speech modulated noise


• As in Brungart (2001)

• Model results and glimpse distributions indicate increase in energetic masking for this type of masker


Natural speech

natural, 1 spkrnatural, 8 spkr

SMN, 1 spkrSMN, 8 spkr


• Listeners perform better with SMN than predicted on the basis of reduced glimpses (cf SMN model), but not quite as well as they do with natural speech masker

• Suggests energetic masking is not the whole story (cf Brungart, 2001), but further work needed to quantify relative contribution of

– Release from IM– Absence of background

models/cues

1

8

SMN (model)

NAT (model)

SMN (listeners)

NAT (listeners)

a glimpsing model of speech perception martin cooke & sarah simpson speech and hearing research...

Documents

missing slide

speech target

distorted speech

aurora eval slide

glimpsing opportunities

glimpsing tracking

noise background simpson

pair of speech signals