computational perceptionlewicki/cp-s08/auditory-scene-analysis.pdfa framework for auditory scene...

75
Computational Perception 15-485/785 Auditory Scene Analysis

Upload: others

Post on 10-Jul-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Computational Perception15-485/785

Auditory Scene Analysis

Page 2: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

A framework for auditory scene analysis

• Auditory scene analysis involves lowand high level cues

• Low level acoustic cues are oftenresult in spontaneous grouping

• High level cues can be underattentional control

CP08: Auditory Scene Analysis 1 / Michael S. Lewicki, CMU !! !

!

? 2

Page 3: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Cues for auditory grouping

• temporal separation

• spectral separation

• harmonicity

• timbre

• temporal onsets and o!sets

• temporal/spectral modulations

• spatial separation

CP08: Auditory Scene Analysis 1 / Michael S. Lewicki, CMU !! !

!

? 3

Page 4: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Frequency and tempo cues for steam segregation

Bregman demo 1.

CP08: Auditory Scene Analysis 1 / Michael S. Lewicki, CMU !! !

!

? 4

Page 5: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Perceptual streams can be masked

Bregman demo 2. “Auditory camouflage”

CP08: Auditory Scene Analysis 1 / Michael S. Lewicki, CMU !! !

!

? 5

Page 6: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Stream segregation and rhythmic information

Bregman demo 3. The perception of rhythmic information depends on thesegmentation of the auditory streams.

CP08: Auditory Scene Analysis 1 / Michael S. Lewicki, CMU !! !

!

? 6

Page 7: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Streaming in African xylophone music

Bregman demo 7.

CP08: Auditory Scene Analysis 1 / Michael S. Lewicki, CMU !! !

!

? 7

Page 8: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

E!ect of pitch range

Bregman demo 8.

CP08: Auditory Scene Analysis 1 / Michael S. Lewicki, CMU !! !

!

? 8

Page 9: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

E!ect of timbre

Bregman demo 9.

CP08: Auditory Scene Analysis 1 / Michael S. Lewicki, CMU !! !

!

? 9

Page 10: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Stream segregation based on spectral peak position

Bregman demo 10.

CP08: Auditory Scene Analysis 1 / Michael S. Lewicki, CMU !! !

!

? 10

Page 11: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

E!ect of connectedness

Bregman demo 12.

CP08: Auditory Scene Analysis 1 / Michael S. Lewicki, CMU !! !

!

? 11

Page 12: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Grouping based on common fundamental frequency

Bregman demo 18.

• Do the set of frequencies come from the same source?

• Those of a common fundametal a grouped together.

CP08: Auditory Scene Analysis 1 / Michael S. Lewicki, CMU !! !

!

? 12

Page 13: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Fusion based on common frequency modulation

Bregman demo 19.

CP08: Auditory Scene Analysis 1 / Michael S. Lewicki, CMU !! !

!

? 13

Page 14: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Fusion based on common frequency modulation

Bregman demo 20.

CP08: Auditory Scene Analysis 1 / Michael S. Lewicki, CMU !! !

!

? 14

Page 15: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Onset rate a!ects segregation

Bregman demo 21.

CP08: Auditory Scene Analysis 1 / Michael S. Lewicki, CMU !! !

!

? 15

Page 16: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Rhythmic masking release

Bregman demo 22.

• This is an example oftemporal grouping withoverlapping frequencyranges.

• Frequecies outside the“critical band” of atarget influence theability to hear it

CP08: Auditory Scene Analysis 1 / Michael S. Lewicki, CMU !! !

!

? 16

Page 17: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Micro modulation in voice perception

Bregman demo 24.

• Normal speech vowels contain small, random fluctuations• the pure tone plus harmonic has weak vowel quality• addition of micro-modulation enhances the perception of the sound as a singing

vowel

CP08: Auditory Scene Analysis 1 / Michael S. Lewicki, CMU !! !

!

? 17

Page 18: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Blind source separation

Page 19: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Modeling the cocktail party problem

Suppose we have two speakers (sources),s1(t) and s2(t) and two microphones(mixtures), x1(t) and x2(t):

2x (t)1 x (t)

2

s (t)1

s (t)

The general problem is called blindsource separation. We only observe themixtures and are “blind” to the sources.

How do we model this in general?

CP08: Auditory Scene Analysis 1 / Michael S. Lewicki, CMU !! !

!

? 19

Page 20: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

A general formulation

Suppose we have M sources

s1(t), . . . , sM(t)

and M mixtures

x1(t), . . . , x2(t)

This can represented diagramatically as:

s (t)1

2s (t)

x (t)1

2x (t)

Ms (t) Mx (t)

A

What’s the simplest mathematicalmodel?

CP08: Auditory Scene Analysis 1 / Michael S. Lewicki, CMU !! !

!

? 20

Page 21: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

A general formulation

Suppose we have M sources

s1(t), . . . , sM(t)

and M mixtures

x1(t), . . . , x2(t)

This can represented diagramatically as:

s (t)1

2s (t)

x (t)1

2x (t)

Ms (t) Mx (t)

A

What’s the simplest mathematicalmodel?

Assume linear, instantaneous mixing:

x(t) = As(t)

CP08: Auditory Scene Analysis 1 / Michael S. Lewicki, CMU !! !

!

? 20

Page 22: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

A general formulation

Suppose we have M sources

s1(t), . . . , sM(t)

and M mixtures

x1(t), . . . , x2(t)

This can represented diagramatically as:

s (t)1

2s (t)

x (t)1

2x (t)

Ms (t) Mx (t)

A

What’s the simplest mathematicalmodel?

Assume linear, instantaneous mixing:

x(t) = As(t)

A is a called the mixing matrix.

What does this model ignore?

CP08: Auditory Scene Analysis 1 / Michael S. Lewicki, CMU !! !

!

? 20

Page 23: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

A general formulation

Suppose we have M sources

s1(t), . . . , sM(t)

and M mixtures

x1(t), . . . , x2(t)

This can represented diagramatically as:

s (t)1

2s (t)

x (t)1

2x (t)

Ms (t) Mx (t)

A

What’s the simplest mathematicalmodel?

Assume linear, instantaneous mixing:

x(t) = As(t)

A is a called the mixing matrix.

What does this model ignore?

• room acoustics, reverberation, echos• filtering, noise• might have more than two sounds• sounds might not come from a single

source• sound sources could change location

CP08: Auditory Scene Analysis 1 / Michael S. Lewicki, CMU !! !

!

? 20

Page 24: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

The distribution of sample points

!6 !4 !2 0 2 4 6

!=2

s1

!6 !4 !2 0 2 4 6

!=4

s2

• The histograms show theamplitude distribution ofeach source.

• The scatter plot showsthe 2D distribution ofx1(t) vs x2(t).

• The parameter !characterizes thesharpness of the datausing the distribution

P (s) ! e!|s|2/(1+!)/2

• ! = 0 corresponds to aGaussian distribution

CP08: Auditory Scene Analysis 1 / Michael S. Lewicki, CMU !! !

!

? 21

Page 25: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

The distribution of sample points

!6 !4 !2 0 2 4 6

!=1

s1

!6 !4 !2 0 2 4 6

!=2

s2

• Sources can have a widerange of amplitudedistributions.

• Di!erent directionscorrespond to di!erentmixing matrices.

• A mixture ofnon-Gaussian sourcescreate a distributionwhose axes can beuniquely determined(within a sign change).

CP08: Auditory Scene Analysis 1 / Michael S. Lewicki, CMU !! !

!

? 22

Page 26: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

The distribution of sample points

!6 !4 !2 0 2 4 6

!=!0.8

s1

!6 !4 !2 0 2 4 6

!=!0.4

s2

The axes of Sub-Gaussianources, i.e ! < 0 ornegative kurtosis, can stillbe determined.

CP08: Auditory Scene Analysis 1 / Michael S. Lewicki, CMU !! !

!

? 23

Page 27: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

A Gaussian distribution of sample points

!6 !4 !2 0 2 4 6

!=0

s1

!6 !4 !2 0 2 4 6

!=0

s2

The principal axes of twoGaussian sources areambiguous.

• Why?

• Because the product oftwo Gaussians is still aGaussian, so there are aninfinite number ofdirections that fit the 2Ddistribution.

CP08: Auditory Scene Analysis 1 / Michael S. Lewicki, CMU !! !

!

? 24

Page 28: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Inferring the (un)mixing matrix

How do we determine the axes from just the data?

CP08: Auditory Scene Analysis 1 / Michael S. Lewicki, CMU !! !

!

? 25

Page 29: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Modeling non-Gaussian distributions

Learning objective: model statisticaldensity of sources:! maximize P (x|A) over A.

Probability of pattern ensemble is:

P (x1,x2, ...,xN |A) =!

k

P (xk|A)

To obtain P (x|A) marginalize over s:

P (x|A) ="

dsP (x|A, s)P (s)

=P (s)

| detA|

Learning rule (ICA):

!A " AAT !

!Alog P (x|A)

= #A(zsT # I) ,

where z = (log P (s))!. UseP (si) $ ExPwr(si|µ, ", #i).

This is identical to the procedure thatwas used to learn e!cient codes, i.eindependent component analysis.

CP08: Auditory Scene Analysis 1 / Michael S. Lewicki, CMU !! !

!

? 26

Page 30: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1

Separating mixtures of real sources

30

Source 1

Source 2

Source 3

Source 4

Mixture 3

Mixture 4Mixture 1

Mixture 2

Page 31: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1 31

Source 1

Source 2

Source 3

Source 4

Mixture 3

Mixture 4Mixture 1

Mixture 2

Separating mixtures of real sources

Page 32: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1 32

Source 1

Source 2

Source 3

Source 4

Mixture 3

Mixture 4Mixture 1

Mixture 2

Separating mixtures of real sources

Page 33: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1 33

Source 1

Source 2

Source 3

Source 4

Mixture 3

Mixture 4Mixture 1

Mixture 2

Separating mixtures of real sources

Page 34: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1 34

Source 1

Source 2

Source 3

Source 4

Mixture 3

Mixture 4Mixture 1

Mixture 2

Separating mixtures of real sources

Page 35: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1 35

Source 1

Source 2

Source 3

Source 4

Mixture 3

Mixture 4Mixture 1

Mixture 2

Separating mixtures of real sources

Page 36: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1 36

Source 1

Source 2

Source 3

Source 4

Mixture 3

Mixture 4Mixture 1

Mixture 2

Separating mixtures of real sources

Page 37: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1 37

Source 1

Source 2

Source 3

Source 4

Mixture 3

Mixture 4Mixture 1

Mixture 2

Separating mixtures of real sources

Page 38: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1 38

Source 1

Source 2

Source 3

Source 4

Mixture 3

Mixture 4Mixture 1

Mixture 2

Separating mixtures of real sources

Page 39: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Separating mixtures of synthetic sources: 4 sources, 4 mixtures

!4 !2 0 2 4

!=!0.75

!4 !2 0 2 4

!=!0.17

!4 !2 0 2 4

!=!0.14

!4 !2 0 2 4

!=0.22

!4 !2 0 2 4

!=0.50

!4 !2 0 2 4

!=1.02

!4 !2 0 2 4

!=2.00

!4 !2 0 2 4

!=2.94

source 1 2 3 4 5 6 7 8avg. SNR (dB) 40.48 4.84 4.71 17.29 35.03 43.07 45.85 44.10

std. dev. 1.35 0.39 0.42 2.08 1.49 0.50 1.75 2.00

• Experiment: recover synthetic sources from a random mixing matrix.Repeat 5 times.

• SNR reflects the accuracy of in inferring the mixing matrix.• The near-Gaussian sources cannot be accurately recovered.

CP08: Auditory Scene Analysis 1 / Michael S. Lewicki, CMU !! !

!

? 28

Page 40: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1 40

Computational auditory scene analysis

How do we incorporate these ideas in a computational algorithm?

Blind source separation (ICA) only solves special case

• non-Gaussian sources

• linear, stationary mixtures

• equal number of sources and mixtures

• need at least two mixtures

What about approaches that use auditory grouping cues?

Page 41: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1 41

Page 42: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1 42

Cooke (1993)

“Modeling Auditory Processing and Organisation”

• uses gammatone filter bank to model auditory periphery

• auditory “objects” are represented by “synchrony strands”

• synchrony strands are formed according to auditory grouping principles

Want an auditory representation that

• describes the auditory scene in terms of time-frequency objects

• characterizes onsets, offsets, and movement of spectral components

• allows for further parameters to be calculated

Page 43: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1 43

Overview of stages in synchrony strand formation

Page 44: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1 44

Auditory periphery model

Page 45: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1 45

Hair cell adaptation

Hair cell model is based on adaptive non-linearities of cochlear responses

Page 46: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1 46

Next stage: extraction of dominant frequency

Page 47: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1 47

Dominant frequency estimation

a) waveform for the vowel /ae/

b) auditory filter output for 1kHz center frequency

c) instantaneous freqency

from Hilbert transform of gammatone filter

d) an alternative method of estimating instantaneous frequency

e) smoothed instantaneous frequency

median filtering over 10ms, plus linear smoothing

also allows for data reduction: samples are collapsed over 1ms window

Want a pure frequency representation in order to apply grouping rules.

Page 48: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1 48

Next stage: grouping frequency channels

Page 49: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1 49

Calculation of place groups• What frequencies belong to the same sound?

• Goal of this stage is to locate and characterize intervals along filterbank with synchronous activity.

• (a) Notice that center frequency is not distributed evenly due the median filtering in the frequency estimation stage.

• (b) The instantaneous frequency varies around the estimated frequency.

• Idea is to group all channels centered around the “dominant frequency”, i.e. grouping by spectral location.

• (c) S(f) is a smoothed frequency derivative estimate. Channels at the minimum are grouped together.

Page 50: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1 50

Plot of place groups for a speech waveform

Page 51: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1 51

Waveform resynthesized from place groups

Reconstruction quality is good

• best is indistinguishable from original

• worst is still clearly intelligible

• best for harmonic sounds

Resynthesize waveform by summing sines in place group

Page 52: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1 52

More synchrony strand displays: female speech

Page 53: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1 53

Synchrony strand display of male speech

Page 54: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1 54

Synchrony strand display of noise burst sequence

Page 55: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1 55

Synchrony strand display of new wave music

Page 56: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1 56

Modeling auditory scene exploration• The synchrony strand display has done local

grouping of frequencies, but this alone does not provide a way to group sounds from the same source.

• The next part of the model develops ways to construct higher order groups

Page 57: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1 57

A hierarchical view of auditory scene analysisTop level: the stable interpretation which results when all organization has been discovered and all

competitions resolved.

Intermediate level: results from conbining groups with similar derived properties such as

pitch and contour

First level: Results from the independent application of organizing principles such as harmonicity, common

onset etc.

But it’s not as simple as assigning strands to different sounds

Page 58: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1 58

Duplex perception (Lieberman, 1982)

• Stimulus: a synthetic syllable with 3 formants

• Syllable minus the third formant transition is played to one ear

• Missing formant transition is played to the other ear

What do subjects hear?

• There hear the syllable as if all formants had been presented to one ear.

• But, they also hear an isolated chirp

This suggests that components are shared across groups

Page 59: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1 59

Cooke’s framework for auditory scene exploration

1. Seed selection (thick line)

2. Choose ‘simultaneous set’ i.e. those strands which overlap in time with the seed

3. Apply grouping constraint, e.g. suppose the black strands share a common rate of amplitude modulation with the seed

4. Sequential phase: choose strand to extend the temporal basis of the group

5. Back to simultaneous phase: consider for similarity strands that overlap with new seed

6. Seeds may be selected from any in the group which extended in time

Page 60: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1 60

Harmonic constraint propagation for a voiced utterance

Page 61: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1 61

Implementation of auditory grouping principles

Cooke’s algorithm implements the following grouping principles:

• harmonicity

• common amplitude modulation

• common frequency movement

Also need “subsumption” to form higher level groups.

• Idea is to remove groups that are contained within a larger group

• Groups expand until at some point they are assigned to a source

Also note that the algorithm involves many heuristics which are not covered here.

Page 62: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1 62

Harmonic grouping

Harmonic groups discovered by various synchrony strands

Scoring function used in assessing how well strands fit sieve slots

Page 63: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1 63

Harmonic grouping of frequency modulated sounds

Page 64: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1 64

Harmonic grouping for synthetic 4-formant syllable

1st, 3rd, and 4th formant are synthesized on a fundamental of

110Hz

2nd formant synthesized on fundamental of 174Hz

Algorithm finds two groups

Page 65: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1 65

Grouping by common amplitude modulation

Page 66: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1 66

Grouping by common frequency movement

Page 67: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1 67

Subsumption: forming higher level groups

Remove groups that are contained within a larger group

Page 68: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1 68

Grouping natural speech

Page 69: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1 69

Test sounds and noise sources

Page 70: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1 70

Synchrony strand representations of test sounds

Page 71: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1 71

The mixture correspondence problem

Page 72: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1 72

Performance: Utterance characterization

Page 73: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1 73

Performance: crossover from intrusive source

How much is the intrusive source

incorrectly grouped?

• Least intrusion when intrusive source is noise

bursts (n2)

• Worst is when intrusive source is music (n4)

Page 74: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1 74

Grouping of speech from a mixture (cocktail party)

Page 75: Computational Perceptionlewicki/cp-s08/auditory-scene-analysis.pdfA framework for auditory scene analysis • Auditory scene analysis involves low and high level cues • Low level

Michael S. Lewicki ◇ Carnegie MellonCP08: Auditory Stream Analysis 1 75

Grouping speech in presence of laboratory noise