double talk 1

8/3/2019 Double Talk 1

1/13

Improving Monaural Speaker Identication byDouble-Talk Detection

R. Saeidi 1, P. Mowlaee 2 , T. Kinnunen 1, Z. -H. Tan 2,M. G. Christensen 3, S. H. Jensen 2, and P. Franti 1

1 Speech and image processing Unit (SIPU), School of Computing,University of Eastern Finland

2 Dept. of Electronic Systems, 3 Dept. of Architecture Design and Media TechnologyAalborg University, Denmark

Interspeech 2010Japan, September 2010

Presenter: Zhengh-Hua Tan

R. Saeidi 1 , P. Mowlaee 2 , T. Kinnune n 1 , Z. -H. Tan 2 , M. G. Christensen 3 , S. H. Jensen 2 , and P. Fr anti 1

Improving Monaural Speaker Identication by Double-Talk Detection

1 / 13
http://find/http://goback/


2/13

Presentation Outline

1 Problem Denition and Background

2 Proposed system

3 Double Talk Detection

4 Speaker Identication

5 Performance Evaluation

R. Saeidi et al. Monaural Speaker Identication 2 / 13
http://find/


3/13

Monaural Speaker IdenticationProblem Denition

FundamentalsRecognize BOTH of the speakers existing in a MIXED audio leNovelty of this work is including double-talk detector (DTD) as apre-processor for a previously proposed speaker identication

back-end

R. Saeidi, P. Mowlaee, T. Kinnunen, Z. H Tan, M. G. Christensen, S. H. Jensen and P. Fr anti,Signal-to-signal ratio independent speaker identication for co-channel speech signals, IEEE 20thInternational Conference on Pattern Recognition, ICPR 2010, , pp. 4565-4568, Istanbul, Turkey, August2010.



4/13

Monaural Speaker IdenticationMotivation

There are SOME studies to recognize BOTH of the speakers, butthey need at least TWO microphones [1]There are FEW studies to recognize BOTH of the speakers whenwe have only ONE microphone [2]

Building a stand alone speaker identication system as acomputationally less intensive alternative for Super HumanIroquois system [2]Bringing single-talk/double-talk information in frame-level toimprove monaural speaker identication

[1] Y. E. Kim, J. M. Walsh, and T. M. Doll, Comparison of a joint iterative method for multiple speakeridentication with sequential blind source separation and speaker identication, in Odyssey 2008: TheSpeaker and Language Recognition Workshop, Jan. 2008.[2] J. R. Hershey, S. J. Rennie, P. A. Olsen, and T. T. Kristjansson, Super-human multi-talker speechrecognition: A graphical modeling approach, Elsevier Computer Speech and Language, vol. 24, no. 1, pp.4566, Jan 2010.



5/13

System structure

Figure: The block diagram of the proposed system.



6/13

Double Talk Detection

Assume that we have K candidate models denoted by M k (i.e.M 0, M 1, and M 2), for describing monaural speech signal.We adopt a maximum a posteriori (MAP) criterion formultiple-hypothesis test to determine double-talk/single-talkregions in segments of a mixed signal. Given the mixed signal,

select the model which has the the maximum a posteriori (MAP)probability.

We apply different policies in speaker identication for mixed and

single-talker frames.M 0: None of the speakers is active,M 1: One of the speakers is active,M 2: Both of the speakers are active.



7/13

Double Talk Detection Performance

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.81

0

1

2

3

4

M i x e d s i g n a l

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.81

0

1

2

3

4

S p e a k e r

1

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.81

0

1

2

3

4

Time (sec)

S p e a k e r

2

Speaker 2 signalGround truthEstimated boudaries by DTD

Speaker 1 signalGround truthEstimated boudaries by DTD

Mixed signalGround truthEstimated boudaries by DTD

Figure: Double-talk detection results for mixture of male and female mixed at 3 dB SSR. (Labels are -1 for no speech, 0 for mixed signal,1 for speaker 1 and

2 for speaker 2.)R. Saeidi et al. Monaural Speaker Identication 7 / 13


8/13

Speaker IdenticationFrame Level Likelihood (FLL)

The main idea is to use mixed speech domain GMM modelsWe use here T 0 frames of input feature stream which arerecognized to be mixed speechWe compute FLL as: s igt = log[ p(x t | ig )] log[ p(x t |UBM )] (1)Finding the most probable speaker for each frame, we countnumber of winning frames per speaker and normalize it

Figure: ig as the model for the ith spe aker at SSR level g.R. Saeidi et al. Monaural Speaker Identication 8 / 13


9/13

Speaker IdenticationKullback-Leibler divergence (KLD)

We use here T 0 frames of input feature stream which arerecognized to be mixed speech

We compute KLD as:KLD ig = 12

M m =1 wm ( me mig )

T 1m ( me mig ) (2)

KLD scores averaged over SSR levels, g, and then normalized



10/13

Speaker IdenticationScore Fusion

For T 0 frames of input feature stream which are recognized to bemixed speech we form the score per speaker as:score = 0 .5 KLD + 0 .5 FLL

For T 1 (T 2) frames of input which are recognized to belong tospeaker 1 (2), we pass them to KLD module to nd the best matchidx is the identied speaker from single-talk frames, we add abonus score to its decision score as:score [idx ] = score [idx ] + T 1/T (or T 2/T )R. Saeidi et al. Monaural Speaker Identication 10 / 13
http://goforward/http://find/http://goback/


11/13

Evaluation Corpus

Grid corpus

Number of sentences per talker: 1000Number of speakers: 34 (18 male and 16 female)Corpus size: 34,000Number of distinct sentences: 2048

Files duration: typically 1-2 sec

Figure: The Speech Separation Challenge.



12/13

Speaker IdenticationResults

Table: Speaker identication performance (% error) where both speakers arecorrectly found in the top-3 list. Yes/No indicates whether the proposed DTDmethod is included. For the ST scenario both systems provide no error.

SG DG AverageDTD No Yes No Yes No YesSSR

-9 dB 7.26 6.70 17.50 13.03 8.00 5.32-6 dB 3.35 3.35 6.00 5.00 3.00 2.29-3 dB 0.56 0.56 2.50 2.00 1.00 0.610 dB 1.68 1.68 1.00 2.00 0.83 0.613 dB 2.79 2.23 6.50 5.00 3.00 1.896 dB 6.15 5.59 9.50 10.50 5.00 4.37

Average 3.64 3.35 7.17 6.17 3.47 2.57



13/13

Conclusion

Successful ideas from speaker verication is applied for monauralspeaker identicationMixed speech with different SSRs used to train speaker GMMsSpeaker models are created by MAP adaptation rather than

conventional ML trainDouble talk detection introduced to enhance speaker identicationsystem performance

MATLAB codewill be made available in my webpage: cs.joensuu./pages/saeidicontact: [email protected].


double talk 1

Documents