speech separation for recognition and enhancementdpwe/talks/darpa-2011-10-27.pdfspeech separation -...

18
Speech Separation - Dan Ellis 2011-10-27 /18 1 1. Speech in the Wild 2. Separation by Space 3. Separation by Pitch 4. Separation by Model Speech Separation for Recognition and Enhancement Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia University, NY and International Computer Science Institute, Berkeley CA [email protected] http://labrosa.ee.columbia.edu/

Upload: others

Post on 10-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Speech Separation for Recognition and Enhancementdpwe/talks/DARPA-2011-10-27.pdfSpeech Separation - Dan Ellis 2011-10-27 /18 References • John Hershey, Steve Rennie, Pedr Olsen,

Speech Separation - Dan Ellis 2011-10-27 /181

1. Speech in the Wild2. Separation by Space3. Separation by Pitch4. Separation by Model

Speech Separation forRecognition and Enhancement

Dan EllisLaboratory for Recognition and Organization of Speech and Audio

Dept. Electrical Eng., Columbia University, NYand

International Computer Science Institute, Berkeley CA

[email protected] http://labrosa.ee.columbia.edu/

Page 2: Speech Separation for Recognition and Enhancementdpwe/talks/DARPA-2011-10-27.pdfSpeech Separation - Dan Ellis 2011-10-27 /18 References • John Hershey, Steve Rennie, Pedr Olsen,

Speech Separation - Dan Ellis 2011-10-27 /18

1. Speech in the Wild

2

• The world is cluttered sound is transparentmixtures are inevitable

• Useful information is structured by ‘sources’specific definition of a ‘source’:intentional independence

Page 3: Speech Separation for Recognition and Enhancementdpwe/talks/DARPA-2011-10-27.pdfSpeech Separation - Dan Ellis 2011-10-27 /18 References • John Hershey, Steve Rennie, Pedr Olsen,

Speech Separation - Dan Ellis 2011-10-27 /18

Speech in the Wild: Examples

• Multi-party discussions

• Ambient recordings

• Applications:communications o robots o lifelogging/archives

3

Page 4: Speech Separation for Recognition and Enhancementdpwe/talks/DARPA-2011-10-27.pdfSpeech Separation - Dan Ellis 2011-10-27 /18 References • John Hershey, Steve Rennie, Pedr Olsen,

Speech Separation - Dan Ellis 2011-10-27 /18

Recognizing Speech in the Wild• Current ASR relies on low-D representations

e.g. 13 dimensional MFCC features every 10ms

• We need separation!

4

time / s

level / dB

ICSI Meeting Room excerpt

freq

/ kHz

MFCC!based resynthesis

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

freq

/ kHz

0

1

2

3

4

!60

!40

!20

0very successful for clean speech!inadequate for mixtures

Page 5: Speech Separation for Recognition and Enhancementdpwe/talks/DARPA-2011-10-27.pdfSpeech Separation - Dan Ellis 2011-10-27 /18 References • John Hershey, Steve Rennie, Pedr Olsen,

Speech Separation - Dan Ellis 2011-10-27 /18

2. Speech Separation• How can we separate speech information?

5

Select / Enhance

Analyze

Application

Noisy Speech

CleanedSpeech

(features)• Recognition• Listening• ...

• Spatial• Pitch• Speech probs• ...

• T-F masking• Weiner filtering• Reconstruction

Page 6: Speech Separation for Recognition and Enhancementdpwe/talks/DARPA-2011-10-27.pdfSpeech Separation - Dan Ellis 2011-10-27 /18 References • John Hershey, Steve Rennie, Pedr Olsen,

Speech Separation - Dan Ellis 2011-10-27 /18

Separation by Spatial Info• Given multiple microphones,

sound carries spatial information about source• E.g. model interaural spectrum of each source

as stationary level and time differences:

• e.g. at 75°, in reverb:

6

IPD IPD residual ILD

IPD ILDIPD residual

Page 7: Speech Separation for Recognition and Enhancementdpwe/talks/DARPA-2011-10-27.pdfSpeech Separation - Dan Ellis 2011-10-27 /18 References • John Hershey, Steve Rennie, Pedr Olsen,

Speech Separation - Dan Ellis 2011-10-27 /18

Model-Based EM Source Separation and Localization (MESSL)

can model more sources than sensors7

Mandel et al. ’10

Assign spectrogram pointsto sources

Re-estimatesource parameters

Page 8: Speech Separation for Recognition and Enhancementdpwe/talks/DARPA-2011-10-27.pdfSpeech Separation - Dan Ellis 2011-10-27 /18 References • John Hershey, Steve Rennie, Pedr Olsen,

Speech Separation - Dan Ellis 2011-10-27 /18

MESSL Results• Modeling uncertainty improves results

tradeoff between constraints & noisiness

8

2.45 dB2.45 dB

9.12 dB9.12 dB

0.22 dB0.22 dB

-2.72 dB-2.72 dB

12.35 dB12.35 dB8.77 dB8.77 dB

Ground truth masks

!40 !20 0 20 400

20

40

60

80

100

Target!to!masker ratio (dB)

Perce

nt co

rrect

HumanDP!OracleOracleOracleAllRevMixes

Algorithmic masks

!40 !20 0 20 400

20

40

60

80

100

Target!to!masker ratio (dB)

HumanSawadaMoubaMESSL!GMESSL!!!DUETMixes

• Helps with recognitiondigits accuracy

Page 9: Speech Separation for Recognition and Enhancementdpwe/talks/DARPA-2011-10-27.pdfSpeech Separation - Dan Ellis 2011-10-27 /18 References • John Hershey, Steve Rennie, Pedr Olsen,

Speech Separation - Dan Ellis 2011-10-27 /18

3. Separation by Pitch• Voiced syllables have near-periodic “pitch”

perceptually salient

lost in MFCCs

• Can we track pitch & use it for separation?... and other speech tasks?

9

!"#$%&$'()#*"&+$,-"#$'(!$./-#01232'(4-**2&2#52.

1232'(6-**2&2#52(5$#(72'8(-#(.82257(.20&20$,-"#

1-.,2#(*"&(,72(9%-2,2&(,$'/2&:

Brungart et al.’01

Page 10: Speech Separation for Recognition and Enhancementdpwe/talks/DARPA-2011-10-27.pdfSpeech Separation - Dan Ellis 2011-10-27 /18 References • John Hershey, Steve Rennie, Pedr Olsen,

Speech Separation - Dan Ellis 2011-10-27 /18

Noise-Robust Pitch Tracking• Important for voice detection & separation• Based on channel selection Wu, Wang & Brown ’03

pitch from summary autocorrelation over “good” bands

10

trained classifier decides which channels to include

BS Lee & Ellis ’12

Page 11: Speech Separation for Recognition and Enhancementdpwe/talks/DARPA-2011-10-27.pdfSpeech Separation - Dan Ellis 2011-10-27 /18 References • John Hershey, Steve Rennie, Pedr Olsen,

Speech Separation - Dan Ellis 2011-10-27 /18

Noise-Robust Pitch Tracking• Channel-based classifiers

learn domain channel/noise characteristicsthen separate, or derive features for recognition

• Only works for pitched soundsneed a broader description of the speech source...

11

Freq

/ kH

z

0

2

4

Cha

nnel

s

2040

0

12.5

25

0

12.5

25

perio

d / m

s

−40−20020

08_rbf_pinknoise5dB

0 1 2 3 4 time / sec

CS-GT

Page 12: Speech Separation for Recognition and Enhancementdpwe/talks/DARPA-2011-10-27.pdfSpeech Separation - Dan Ellis 2011-10-27 /18 References • John Hershey, Steve Rennie, Pedr Olsen,

Speech Separation - Dan Ellis 2011-10-27 /18

4. Separation by Models• If ASR is finding best-fit parameters

argmax P(W | X) ...• Recognize mixtures with Factorial HMM

model + state sequence for each voice/sourceexploit sequence constraints, speaker differences

separation relies on detailed speaker model12

Varga & Moore, ’90Hershey et al., ’10

model 1

model 2

observations / time

Page 13: Speech Separation for Recognition and Enhancementdpwe/talks/DARPA-2011-10-27.pdfSpeech Separation - Dan Ellis 2011-10-27 /18 References • John Hershey, Steve Rennie, Pedr Olsen,

Speech Separation - Dan Ellis 2011-10-27 /18

Eigenvoices• Idea: Find

speaker model parameter space

generalize without losing detail?

• Eigenvoice model:

89,600 dimensional space13

Kuhn et al. ’98, ’00Weiss & Ellis ’10

Speaker modelsSpeaker subspace bases

µ = µ̄ + U w + B hadapted mean eigenvoice weights channel channelmodel voice bases bases weights

Freq

uenc

y (kH

z)

Mean Voice

b d g p t k jh ch s z f th v dh m n l r w y iy ih eh ey ae aaaw ay ah aoowuw ax

2

4

6

8

Freq

uenc

y (kH

z)

Eigenvoice dimension 1

b d g p t k jh ch s z f th v dh m n l r w y iy ih eh ey ae aaaw ay ah aoowuw ax

2

4

6

8

Freq

uenc

y (kH

z)

Eigenvoice dimension 2

b d g p t k jh ch s z f th v dh m n l r w y iy ih eh ey ae aaaw ay ah aoowuw ax

2

4

6

8

Freq

uenc

y (kH

z)Eigenvoice dimension 3

b d g p t k jh ch s z f th v dh m n l r w y iy ih eh ey ae aaaw ay ah aoowuw ax

2

4

6

8

50

40

30

20

10

0

2

4

6

8

0

2

4

6

8

0

2

4

6

8

Page 14: Speech Separation for Recognition and Enhancementdpwe/talks/DARPA-2011-10-27.pdfSpeech Separation - Dan Ellis 2011-10-27 /18 References • John Hershey, Steve Rennie, Pedr Olsen,

Speech Separation - Dan Ellis 2011-10-27 /18

Eigenvoice Speech Separation

14

Page 15: Speech Separation for Recognition and Enhancementdpwe/talks/DARPA-2011-10-27.pdfSpeech Separation - Dan Ellis 2011-10-27 /18 References • John Hershey, Steve Rennie, Pedr Olsen,

Speech Separation - Dan Ellis 2011-10-27 /18

Eigenvoice Speech Separation• Eigenvoices for Speech Separation task

speaker adapted (SA) performs midway between speaker-dependent (SD) & speaker-indep (SI)

15

SI

SA

SD

Mix

Page 16: Speech Separation for Recognition and Enhancementdpwe/talks/DARPA-2011-10-27.pdfSpeech Separation - Dan Ellis 2011-10-27 /18 References • John Hershey, Steve Rennie, Pedr Olsen,

Speech Separation - Dan Ellis 2011-10-27 /18

Spatial + Model Separation• MESSL + Eigenvoice “priors”

16

Weiss, Mandel & Ellis ’11

Page 17: Speech Separation for Recognition and Enhancementdpwe/talks/DARPA-2011-10-27.pdfSpeech Separation - Dan Ellis 2011-10-27 /18 References • John Hershey, Steve Rennie, Pedr Olsen,

Speech Separation - Dan Ellis 2011-10-27 /18

Summary• Speech in the Wild

... real, challenging problem

... applications in communications, lifelogs ...

• Speech Separation... by generic properties (location, pitch)... via statistical models

• Recognition and Enhancement... separate-then-X, or integrated solution?

17

Page 18: Speech Separation for Recognition and Enhancementdpwe/talks/DARPA-2011-10-27.pdfSpeech Separation - Dan Ellis 2011-10-27 /18 References • John Hershey, Steve Rennie, Pedr Olsen,

Speech Separation - Dan Ellis 2011-10-27 /18

References• John Hershey, Steve Rennie, Pedr Olsen, Trausti Kristjansson, “Super-human multi-talker speech

recognition: A graphical modeling approach,” Computer Speech & Lang. 24 (1), 45-66, 2010.

• Jon Barker, Martin Cooke, Dan Ellis, “Decoding Speech in the Presence of Other Sources,” Speech Communication 45(1): 5-25, 2005.

• R. Kuhn, J. Junqua, P. Nguyen, N. Niedzielski, “Rapid speaker adaptation in eigenvoice space,” . IEEE Tr. Speech & Audio Proc. 8(6): 695–707, Nov 2000.

• Byung-Suk Lee & Dan Ellis, “Noise-robust pitch tracking by trained channel selection,” submitted to ICASSP, 2012.

• Michael Mandel, Ron Weiss, Dan Ellis, “Model-Based Expectation-Maximization Source Separation and Localization,” IEEE Tr. Audio, Speech, Lang. Proc. 18(2): 382-394, Feb 2010.

• A. Varga and R. Moore, “Hidden markov model decomposition of speech and noise,” ICASSP-90, 845–848, 1990.

• Ron Weiss & Dan Ellis, “Speech separation using speaker-adapted Eigenvoice speech models,” Computer Speech & Lang. 24(1): 16-29, 2010.

• Ron Weiss, Michael Mandel, Dan Ellis, “Combining localization cues and source model constraints for binaural source separation,” Speech Communication 53(5): 606-621, May 2011.

• Mingyang Wu, DeLiang Wang, Guy Brown, “A multipitch tracking algorithm for noisy speech,” IEEE Tr. Speech & Audio Proc. 11(3): 229–241, May 2003.

18