single microphone blind audio source separation using em ... · parameters and therefore help...

8
Single Microphone Blind Audio Source Separation Using EM-Kalman Filter and Short+Long Term AR Modeling Siouar Bensaid, Antony Schutz, and Dirk T.M. Slock ? EURECOM 2229 route des Crˆ etes, B.P. 193, 06904 Sophia Antipolis Cedex, FRANCE {siouar.bensaid,antony.schutz,dirk.slock}@eurecom.fr http://www.eurecom.fr Abstract. Blind Source Separation (BSS) arises in a variety of fields in speech processing such as speech enhancement, speakers diarization and identification. Generally, methods for BSS consider several observations of the same recording. Single microphone analysis is the worst underde- termined case, but, it is also the more realistic one. In this article, the autoregressive structure (short term prediction) and the periodic signa- ture (long term prediction) of voiced speech signal are modeled and a linear state space model with unknown parameters is derived. The Expec- tation Maximization (EM) algorithm is used to estimate these unknown parameters and therefore help source separation. Key words: blind audio source separation, EM, Kalman, speech pro- cessing, autoregressive. 1 Introduction Blind Source Separation is an important issue in audio processing. It helps solv- ing ”‘the cocktail party problem”’ where each speaker needs to be retrieved independently. Several works exploit the temporal structure of speech signal to help separation. In literature, three categories can be listed : The first exploits only the short term correlation in speech signal and models it with a short term Auto-Regressive (AR) process [2]. A second category models the quasi- periodicity of speech by introducing the fundamental frequency (or pitch) in the analysis [3,4]. Finally, few works combine the two aspects [5]. This article is classified in the last category. In [5], The problem is presented like an over- determined instantaneous model where the aim is to estimate jointly the long term (LT)and short term (ST) AR coefficients, as well as the demixing matrix in order to retrieve the speakers in a deflation scheme. An ascendant gradient algorithm is used to minimize the mean square of the total estimation error (short term and long term), and thus learn the parameters recursively. Our case is more difficult, since only a single sensor is used. Therefore, the proposed model ? EURECOM’s research is partially supported by its industrial partners: BMW, Cisco Systems, France T´ el´ ecom, Hitachi Europe, SFR, Sharp, ST Microelectronics, Swiss- com, Thales. This research has also been partially supported by Project SELIA.

Upload: others

Post on 03-Feb-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Single Microphone Blind Audio Source Separation Using EM ... · parameters and therefore help source separation. Key words: blind audio source separation, EM, Kalman, speech pro-cessing,

Single Microphone Blind Audio SourceSeparation Using EM-Kalman Filter and

Short+Long Term AR Modeling

Siouar Bensaid, Antony Schutz, and Dirk T.M. Slock?

EURECOM2229 route des Cretes, B.P. 193, 06904 Sophia Antipolis Cedex, FRANCE

{siouar.bensaid,antony.schutz,dirk.slock}@eurecom.fr

http://www.eurecom.fr

Abstract. Blind Source Separation (BSS) arises in a variety of fields inspeech processing such as speech enhancement, speakers diarization andidentification. Generally, methods for BSS consider several observationsof the same recording. Single microphone analysis is the worst underde-termined case, but, it is also the more realistic one. In this article, theautoregressive structure (short term prediction) and the periodic signa-ture (long term prediction) of voiced speech signal are modeled and alinear state space model with unknown parameters is derived. The Expec-tation Maximization (EM) algorithm is used to estimate these unknownparameters and therefore help source separation.

Key words: blind audio source separation, EM, Kalman, speech pro-cessing, autoregressive.

1 Introduction

Blind Source Separation is an important issue in audio processing. It helps solv-ing ”‘the cocktail party problem”’ where each speaker needs to be retrievedindependently. Several works exploit the temporal structure of speech signal tohelp separation. In literature, three categories can be listed : The first exploitsonly the short term correlation in speech signal and models it with a shortterm Auto-Regressive (AR) process [2]. A second category models the quasi-periodicity of speech by introducing the fundamental frequency (or pitch) inthe analysis [3, 4]. Finally, few works combine the two aspects [5]. This articleis classified in the last category. In [5], The problem is presented like an over-determined instantaneous model where the aim is to estimate jointly the longterm (LT)and short term (ST) AR coefficients, as well as the demixing matrixin order to retrieve the speakers in a deflation scheme. An ascendant gradientalgorithm is used to minimize the mean square of the total estimation error(short term and long term), and thus learn the parameters recursively. Our caseis more difficult, since only a single sensor is used. Therefore, the proposed model

? EURECOM’s research is partially supported by its industrial partners: BMW, CiscoSystems, France Telecom, Hitachi Europe, SFR, Sharp, ST Microelectronics, Swiss-com, Thales. This research has also been partially supported by Project SELIA.

Page 2: Single Microphone Blind Audio Source Separation Using EM ... · parameters and therefore help source separation. Key words: blind audio source separation, EM, Kalman, speech pro-cessing,

2 S. Bensaid, A. Schutz and D. Slock

of speech propagation is rather simplified (the observation is the instantaneoussum of sources). Nevertheless, this model is still relevant in several scenarios.Using some mathematical manipulation, a state space model with unknown pa-rameters is derived. Since the involved signals are Gaussians, Kalman filteringcan be used in the EM algorithm (Expectation step) to estimate the state. Thispaper is organized as follows: The state space model is introduced in section 2. Asmall recapitulation of the EM-Kalman algorithm is presented in section 3 andthe estimators’ expressions are then computed. Numerical results are providedin section 4, and conclusions are drawn in section 5.

2 State Space Model Formulation

We consider the problem of estimating Ns mixed Gaussian sources. We usea voice production model [6], that can be described by filtering an excitationsignal with long term prediction filter followed by a short term filter and whichis mathematically formulated

yt =Ns∑k=1

sk,t + nt,

sk,t =pk∑

n=1

ak,n sk,t−n + sk,t

sk,t = bk sk,t−Tk+ ek,t (1)

where

– yt is the scalar observation.– sk,t is the kth source at time t, an AR process of order pk

– ak,n is the nth short term coefficient of the kth source– sk,t is the short term prediction error of the kth source– bk is the long term prediction coefficient of the kth source– Tk is the period of the kth source, not necessary an integer– {ek,t}k=1..Ns

are the independent Gaussian distributed innovation sequenceswith variance ρk

– {nt} is a white Gaussian process with variance σ2n, independent of the inno-

vations {ek,t}k=1..Ns

This model seems to describe more faithfully the speech signal, especially thevoiced part (the most energetic part of speech). It is because it uses the shortterm auto-regressive model (AR) to describe the correlation between the signalsamples jointly with the long term AR model to depict the harmonic structureof speech, rather than being restricted to just one of both [2, 3]. Let xk,t be thevector of length (N+pk +2), defined like xk,t = [sk(t) sk(t−1) · · · sk(t−pk−1) |sk(t) sk(t− 1) · · · sk(t− bTkc) · · · sk(t−N + 1)]T . This vector can be written interms of xk,t−1 as the following

xk,t = Fk xk,t−1 + gk ek,t (2)

Page 3: Single Microphone Blind Audio Source Separation Using EM ... · parameters and therefore help source separation. Key words: blind audio source separation, EM, Kalman, speech pro-cessing,

Single Microphone BSS Using EM-Kalman and AR Modeling 3

where gk is the (N+pk+2) length vector defined as gk = [ 1 0 · · · 0 | 1 0 · · · · · · 0]T .The second non null component is at the position (pk + 3). The (N + pk + 2)×(N + pk + 2) matrix Fk has got the following structure

Fk =[F11,k F12,k

O F22,k

]where the (pk + 2) × (pk + 2) matrix F11,k, the (pk + 2) ×N matrix F12,k andthe N ×N matrix F22,k are given by

F11,k =

ak,1 ak,2 · · · ak,pk0 0

...

I(pk+1)

...

...0

F12,k =

0 · · · (1− αk) bk αk bk 0 · · · 00 · · · 0 0 0 · · · 0...

. . ....

......

......

0 · · · 0 0 0 · · · 0

F22,k =

0 · · · (1− αk) bk αk bk 0 · · · 0...

I(N−1)

...

...0

In matrices F12,k and F22,k, the variable αk, given by αk = (1 − b(Tk)c

Tk), is

present to consider the case where pitches are not integer. It is noteworthy thatthe choice of the F22,k matrix size N should be done carefully. In fact, the valueof N should be superior to the maximum value of pitches Tk in order to detectthe long-term aspect. It can be noticed that the coefficients (1−αk) bk and αk bkare situated respectively in the bTkcth and dTketh columns of F22,k and F12,k.Since Ns sources are present, we introduce the vector xt that consists of theconcatenation of the {xk,t}k=1:Ns

vectors (xt = [xT1,t xT

2,t · · · xTNs,t]

T ) whichresults in the time update equation 3. Moreover, by reformulating the expressionof {yt}, we introduce the observation equation 4. We obtain the following statespace model

xt = F xt−1 + G et (3)yt = hT xt + nt (4)

where

– et = [e1,t e2,t · · · eNs,t]T is the Ns×1 column vector resulting of the concate-nation of the Ns innovations. Its covariance matrix is the Ns ×Ns diagonalmatrix Q =diag(ρ1, · · · , ρNs

).

Page 4: Single Microphone Blind Audio Source Separation Using EM ... · parameters and therefore help source separation. Key words: blind audio source separation, EM, Kalman, speech pro-cessing,

4 S. Bensaid, A. Schutz and D. Slock

– F is the∑Ns

k=1(pk + N + 2) ×∑Ns

k=1(pk + N + 2) block diagonal matrixgiven by F = blockdiag (F1, · · · ,FNs

).– G is the

∑Ns

k=1(pk+N+2)×Ns matrix given by G = block diag(g1, · · · ,gNs)

– h is the∑Ns

k=1(pk +N + 2)× 1 column vector given by h = [hT1 · · ·h

TNs

]T

where hi = [1 0 · · · 0]T of length (N + pk + 2).

It is obvious that the linear dynamic system derived before depends on unknownparameters recapitulated in the variable θ =

{{ak,n}k∈{1,...,Ns}

n∈{1,...,pk}, {bk}k∈{1,...,Ns} ,

{ρk}k∈{1,...,Ns} , σ2n

}. Hence, a joint estimation of sources (the state) and θ is re-

quired. In literature([11, 10, 7]), the EM-Kalman algorithm presents an efficientapproach for estimating iteratively parameters and its convergence to the Max-imum Likelihood solution is proved [9]. In the next section, the application ofthis algorithm to our case is developed.

3 EM-Kalman Filter

The EM-Kalman algorithm permits to estimate iteratively parameters and sourcesby alternating two steps : E-step and M-step [9]. In the M-step, an estimate ofthe parameters θ is computed. In our problem, there are two types of parame-ters: the parameters of the time update equation 3 which consist on the shortterm and long term coefficients and the innovation power of all the Ns sources,and one parameter of the observation equation 4, the observation noise power.From the state space model presented in the first part, and for each source k, therelation between the innovation process at time t−1 and the LT+ST coefficientscould be written as

ek,t−1 = vTk xk,t−1 (5)

where vk = [1 −ak,1 · · ·−ak,pk− (1−αk) bk −αk bk]T is a (pk + 3)×1 column

vector and xk,t−1 = [sk(t−1, θ) · · · sk(t−pk−1, θ) sk(t−bTkc−1, θ) sk(t−bTkc−2, θ)]T is called the partial state deduced from the full state xt with the help of aselection matrix Sk. This lag of one time sample between the full and partial stateis justified later. After multiplying (5) by xT

k,t−1 in the two sides, applying theoperator E { |y1:t} and doing a matrix inversion, the following relation betweenthe vector of coefficients and the innovation power is deduced

vk = ρkR−1k,t−1[1, 0 · · · 0]T (6)

where the covariance matrix Rk,t−1 is defined as E{xk,t−1x

Tk,t−1|y1:t

}. It is

important to notice that the estimation of Rk,t−1 is done using observationstill time t, which consists on a fixed-lag smoothing treatment with lag = 1. Asmentioned previously, the relation between the partial state at time t − 1 andthe full state at time t is xk,t−1 = Skxt. This key relation is used in the partialstate covariance matrix computation

R−1k,t−1 = SkE

{xtxT

t |y1:t}ST

k (7)

Page 5: Single Microphone Blind Audio Source Separation Using EM ... · parameters and therefore help source separation. Key words: blind audio source separation, EM, Kalman, speech pro-cessing,

Single Microphone BSS Using EM-Kalman and AR Modeling 5

Notice here the transition from the fixed lag smoothing with the partial stateto the simple filtering with the full state. This fact justifies the selection ofthe partial state at time t − 1 from the full state at time t. This selection ispossible due to the augmented form matrix Fk or more precisely F11,k. Theinnovation power is simply deduced as the first component of the matrix R−1

k,t−1.The estimation of the observation noise power σ2

n is achieved by maximizing theloglikewood function logP

(yt|xt, σ

2n

)relative to σ2

n. The optimal value can beeasily proved equal to

σ2n

(t)= y2

t − 2ythT xt|t + hT(xt|tx

Tt|t + Pt|t

)h (8)

The time index in (t) in σ2n

(t)is to denote the iteration number. The computation

of the partial covariance matrix Rk,t−1 is achieved in the E − step. This matrix

depends on the quantity E{xk,txT

k,t|y1:t}

the definition of which is

E{xtxT

t |y1:t}

= xt|txTt|t + Pt|t (9)

where the quantities xt|t and Pt|t are respectively the full estimated state andthe full estimation error covariance computed using Kalman filtering equations.The adaptive algorithm is presented as Algorithm 1. The algorithm needs anaccurate initialization, which will be discussed afterward. In the algorithm sk,t

is the estimation of the source k at time t.

Adaptive EM Kalman Algorithm

– E-Step. Estimation of the sources covariance

Kt = Pt|t−1h(hT Pt|t−1h + σ2n)−1

xt|t = xt|t−1 + Kt(yt − hT xt|t−1)

Pt|t = Pt|t−1 −KthT Pt|t−1

xt+1|t = Fxt|t

Pt+1|t = FPt|tFT

+ GQGT

– M-Step. Estimation of the AR parameters using linear prediction.k = 1, ...., Ns

sk,t = (xk,t|t)[1,1]

Rk,t−1 = λRk,t−2 + (1− λ)Sk(xt|txTt|t + Pt|t)S

Tk

ρ(t)k = (R−1

k,t−1)−1(1,1)

v(t)k = ρkR−1

k,t−1[1, 0 · · · 0]T

σ2n

(t)= y2

t − 2ythT xt|t + hT(xt|tx

Tt|t + Pt|t

)h

Page 6: Single Microphone Blind Audio Source Separation Using EM ... · parameters and therefore help source separation. Key words: blind audio source separation, EM, Kalman, speech pro-cessing,

6 S. Bensaid, A. Schutz and D. Slock

The estimation of the pitches {Tk}k=1:Ns is done along with this algorithmusing a multipitch estimation algorithm [12].

4 Simulations

In this section, we present some results of our source separation algorithm. Weassume that the maximum number of sources is known. We limit our analysis tothe case of a mixture of two simultaneous sources corrupted by white noise. Inthe most energetic parts of the mixture, the inconstant SNR is about 20 dB asshown in Fig. 1. We work with real speech data to which we add artificially theobservation noise. The mixture consists of two voiced speech signals and is of10 s duration. The parameters are initialized randomly, except the periods wherewe use a multipitch algorithm [12] running in parallel to our main algorithm.The estimated periods from the multipitch algorithm are updated in the mainalgorithm every 64 ms. We do two experiments. The first one is the filtering casein which all the parameters are initialized with values close to the true ones. Theresults are close to perfect and are shown in Fig 2. In the second experiment,the parameters are initialized randomly and estimated adaptively in the M-Step.We can see the results in Fig 3. The separation looks not very good but, whenlistening to the estimated sources, we find that they are under-estimated, leadingto a mixture of the original sources in which the interferences are reduced. Thisis, in part, due to the fact that at a given moment we don’t know the number ofsources, so even when only one source is present, the algorithm seeks to estimatetwo sources. During the separation process, the estimated correlations are stillpolluted by the other source but the desired source is enhanced. The results canbe listened on the first author personal page [1].

0 1 2 3 4 5 6 7 8 9 10−0.5

0

0.5

y

Time (s)

0 1 2 3 4 5 6 7 8 9 10−50

0

50SNR

Time (s)

dB

Fig. 1. Mixture and SNR evolution.

Page 7: Single Microphone Blind Audio Source Separation Using EM ... · parameters and therefore help source separation. Key words: blind audio source separation, EM, Kalman, speech pro-cessing,

Single Microphone BSS Using EM-Kalman and AR Modeling 7

0 1 2 3 4 5 6 7 8 9−0.5

0

0.5

Mixture

Time (s)

0 1 2 3 4 5 6 7 8 9−0.5

0

0.5

Source 1

Time (s)

0 1 2 3 4 5 6 7 8 9−0.5

0

0.5

Source 2

Time (s)

0 1 2 3 4 5 6 7 8 9−0.5

0

0.5

Estimated Source 1

Time (s)

0 1 2 3 4 5 6 7 8 9−0.5

0

0.5

Estimated Source 2

Time (s)

Fig. 2. Source separation with fixed and known parameters.

0 1 2 3 4 5 6 7 8 9−0.5

0

0.5

Mixture

Time (s)

0 1 2 3 4 5 6 7 8 9−0.5

0

0.5

Source 1

Time (s)

0 1 2 3 4 5 6 7 8 9−0.5

0

0.5

Source 2

Time (s)

0 1 2 3 4 5 6 7 8 9−0.5

0

0.5

Estimated Source 1

Time (s)

0 1 2 3 4 5 6 7 8 9−0.5

0

0.5

Estimated Source 2

Time (s)

Fig. 3. Source separation adaptive estimation of the parameters.

Page 8: Single Microphone Blind Audio Source Separation Using EM ... · parameters and therefore help source separation. Key words: blind audio source separation, EM, Kalman, speech pro-cessing,

8 S. Bensaid, A. Schutz and D. Slock

5 Conclusion

In this paper we use the adaptive EM-Kalman algorithm for the blind audiosource separation problem. The model takes into account the different aspectsof speech signals production and sources are jointly estimated. The traditionalsmoothing step is included into the algorithm and is not an additional step.Simulations show the potential of the algorithm for real data. Yet, this perfor-mance depends a lot on the multipitch estimation quality. An error on trackingthe pitches may induce the performance decreasing drastically. This work wouldbe more complete if an other process aiming to estimate the number of activesources is working in parallel.

References

1. http://www.eurecom.fr/ ∼ bensaid/ICA102. A. Cichocki and R. Thawonmas, On-line algorithm for blind signal extraction of

arbitrarily distributed, but temporally correlated sources using second order statis-tics, Neural Process. Lett., vol. 12, no. 1, pp. 9198, 2000.

3. A. K. Barros and A. Cichocki, Extraction of specific signals with temporal struc-ture, Neural Comput., vol. 13, no. 9, pp. 19952003, 2001.

4. F. Tordini and F. Piazza, A semi-blind approach to the separation of real worldspeech mixtures, in Neural Networks, 2002. IJCNN 02. Proceedings of the 2002International Joint Conference on, vol. 2, 2002, pp. 12931298.

5. D. Smith, J. Lukasiak, and I. Burnett, Blind speech separation using a joint modelof speech production, Signal Processing Letters, IEEE, vol. 12, no. 11, pp. 784787,Nov. 2005.

6. W. C. Chu, Speech coding algorithms-foundation and evolution of standardizedcoders. John Wiley and Sons, NewYork, 2003.

7. M. Feder and E. Weinstein, Parameter estimation of superimposed signals usingthe EM algorithm, IEEE Trans. Acoust., Speech, Signal Processing, vol. 36, pp.477489, Apr. 1988.

8. S. Gannot, D. Burshtein, and E. Weinstein. Iterative-batch and sequential algo-rithms for single microphone speech enhancement. In ICASSP’98, pages 1215-8.IEEE, 1998.

9. Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977) Maximum Likelihood fromIncomplete Data via the EM Algorithm, Journal of the Royal Society, B, 39, 1-38.

10. W. Gao, T. S., and J. Lehnert, Diversity combining for ds/ss systems with time-varying, correlated fading branches, Communications, IEEE Transactions on, vol.51, no. 2, pp. 284295, Feb 2003.

11. C. Couvreur, Y. Bresler, Decomposition of a mixture of Gaussian AR processes,icassp, vol. 3, pp.1605-1608, Acoustics, Speech, and Signal Processing, 1995.ICASSP-95 Vol 3., 1995 International Conference on, 1995

12. Christensen, Mads and Jakobsson, Andreas and B.H., Juang,Multi-pitch estima-tion,Morgan & Claypool,2009