blind multichannel system identification

8/12/2019 Blind Multichannel System Identification

1/6

Blind Multichannel System Identification

with Applications in Speech Signal Processing

R. M. Nickel

Department of Electrical Engineering, 121 EE West,

The Pennsylvania State University, University Park, PA 16802

Email:rmn10psu.edu

Abstract

We are presenting a new approach for blind multichannel

system identification. The approach relies on the existence

of so calledexclusive activity periods (EAPs) in the sourcesignals. EAPs are time intervals during which only one

source is active and all other sources are inactive (i.e. zero).

The existence of EAPs is not guaranteed for arbitrary signal

classes. EAPs occur very frequently, however, in recordings

of conversational speech. The methods proposed in this pa-

per show how EAPs can be exploited to improve the perfor-

mance of blind multichannel system identification systems

in speech processing applications. We have shown that for

modestly complex tasks the proposed method achieves an

improvement of over 10 dB in signal-to-interference ratio

over conventional techniques.

1 Introduction

The general goal in blind multichannel system identifi-

cation is to estimate the transfer function of a multichannel

system solely based on its output signals and without any

specific knowledge about the input signals to the system. A

practical example for the use of blind multichannel system

identification is given by the following scenario: We are

recording an acoustic scene with an array of microphones.

If we want to isolate the voice of a single speaker out of a

mixture of signals from different sources then it is implicitly

necessary to estimate the transmission properties (or equiv-

alently the inverse transmission properties) of all channel

between all microphones and all sources.Historically, the first successful solutions to such prob-

lems were of instantaneous mixture type [1], i.e. cases in

which the transfer function was merely a matrix of con-

stants. A variety of tools, known asindependent component

analysis(ICA) methods (Comon [2]), have been developed

for these cases. The general criterion behind ICA methods

is to achieve a maximization of ameasure of independence

between the assumed input signals to the system.

Solutions to the more general convolutive mixture case

are significantly more complicated. Most solutions can be

classified as either, time domain approaches (see [3] and the

references therein) or, frequency domain approaches (see

[4]). More specifically, we have to distinguish between: thedomain in which we model the mixing process (mixing do-

main) and the domain in which we model the statistics of

the source signals (source domain). A choice of either time

orfrequencyfor each of these domains have significant ad-

vantages and disadvantages (see [4]).

In this paper we are proposing an alternative approach

that does not explicitly rely on an independence assumption

between sources. Instead, we are assuming the existence

ofexclusive activity periods (EAPs). EAPs are time inter-

vals during which only one source is active and all other

sources are silent. The existence of EAPs is not guaranteed

for arbitrary signal classes, but EAPs occur very frequently

in recordings of conversational speech.

2 Methods

A block diagram that depicts the considered scenario is

shown in figure 1. We assume that we haveM unknownsource signalsxi[n]fori= 1 . . . M . The transmission pathbetween sourcei and receiverj is described by the trans-fer function of a linear time-invariant system with impulse

responsegij[n]. The resultingMobservation signalsyj [n]are generated according to:

yj [n] =

Mi=1

k=0 gij [k] xi[n k]

for j = 1 . . . M .

(1)

Equation (1) can also be expressed in the z-domain as

Y(z) = G(z) X(z), (2)

in which X(z) is the z-transform of multichan-nel signal x[n] = [ x1[n] x2[n] . . . xM[n] ]

T,

Y(z) is the z-transform of multichannel signal

Proceedings of the 2005 International Conference on Computational Intelligence for Modelling, Control and Automation, and International Conference onntelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC05)

0-7695-2504-0/05 $20.00 2005IEEE


2/6

y[n] = [ y1[n] y2[n] . . . yM[n] ]T, and G(z) is a

matrix with the z-transforms of the impulse responses

gij [n]for alliandj .The goal in blind system identification is to find a matrix

of transfer functionsH(z)such that X(z)given by

X(z) = H(z) Y(z) (3)

is (in some appropriate metric) as close to X(z)as possible.Unfortunately, since we cannot measure the true source sig-

nalsx[n], it is generally not possible to explicitly minimizethe deviation. Instead, the most commonly used criterion

to find H(z) is to choose it such that the components ofthe resulting source reconstruction x[n] are statistically asindependentas possible.

In this paper we are proposing an alternative method that

minimizes an estimate of the deviation between x[n] andx[n]. The approach can be divided into three steps:

1. Find a set of time intervals[n1, n2]during which only

one source is active and all other sources are silent. Werefer to such time intervals as exclusive activity periods

(EAPs).

2. Find a set of transfer functions that deconvolve the

sources during EAPs.

3. ConstructH(z)by combining the results from differ-ent EAPs from different sources.

One of the caveats of the proposed approach is that it is

dependent on the existence of exclusive activity periods.

The existence of EAPs is not guaranteed for arbitrary signal

classes. EAPs do occur, however, very frequently in record-

ings of conversational speech, which makes the proposed

method particularly interesting for the solution of the blind

speech separation problem.

2.1 EAP Estimation for ConversationalSpeech Signals

An exclusive activity period is a time during which only

one source xi[n] is active and all other sources xk[n] aresilent, i.e.xk[n] = 0 fork =i. The estimation of exclusiveactivity periods for speech sources xi[n]can be based on the(almost) periodic nature of vocalic sounds. If only a single

person is speaking then all observations yj[n] exhibit timeintervals with a periodic structure. When multiple personsare speaking the periodicity is generally destroyed [5].

A robust short-time periodicity measure was proposed

Medanet al. [6]. They consider the similarity between two

adjacent observation segments of lengthk:

sj1[n, k] =[ yj [n k] . . . yj [n 2] yj [n 1] ]

T (4)

sj2[n, k] =[ yj [n] yj [n + 1] . . . yj [n + k1] ]

T. (5)

SourcesMixture Model

Observations

General Model Scenario

x1[n]

x2[n]

xM[n]

gii[n]

gij [n]

gji[n]

gjj [n]

y1[n]

y2[n]

yM[n]

Figure 1. A block diagram of the mixing scenario

described by equation (1). The Munknown signalsources are labelled withxi[n]. TheM observa-tions are labelled withyj [n].

A correlation measure NCORj [n, k] is defined through a

normalized inner product of vectors sj1[n, k]and s

j2[n, k]:

NCORj [n, k] = s

j1[n, k]

T sj2[n, k]

sj1[n, k] sj2[n, k]

. (6)

The normalization ensures that the correlation measure is

bounded between zero and one, i.e.0 NCORs[n, k] 1.The correlation measure is equal to one at the true periodpof a perfectly periodic signal. Less than perfectly periodic

signals yield correlation values less than one. As a conse-

quence, we can define a short-time periodicity measureas:

STPMj [n] = maxpminkpmax

{ NCORj [n, k] } . (7)

The search range for the maximum should be bounded by

the typical pitch range of human speech (50Hz...500Hz).

For observation signals that are sampled with sampling fre-quencyFs we have:

pmin= Fs / 500 Hz and pmax= Fs / 50Hz . (8)

A second feature that correlates well with EAPs is the so

called short-timezero crossing rate [7]:

STZCj [n] =

n+Lm=nL

| sign(yj[m]) sign(yj [m 1]) |. (9)

In our notation sign(x) is equal to +1 forx 0 and1for x < 0. The zero crossing rate counts the number of

transitions from positive samples to negative samples withinthe range (n L 1) . . . (n+ L). The range length isusually chosen as L = Fs 10 msec. Typically, the zerocrossing rate is low for EAP sections and high otherwise.

For the STZC measure to work properly it is important that

possible quantization offsets in the recorded speech signal

are removed prior to processing.

A normalized short-time zero crossing measure

NZCMj [n]is constructed with the maximumZCmaxj and


0-7695-2504-0/05 $20.00 2005IEEE


3/6

the minimumZCminj ofSTZCj [n]over alln:

NZCMj [n] =STZCj [n]ZCmaxj

ZCmaxj ZCminj. (10)

The identification of EAP candidates is done as follows: 1)

find all times n for whichSTPMj

[n] 0.7 for all obser-vations j = 1 . . . M , 2) expand the so found sections inforward and backward direction untilSTPMj [n] < 0.6, 3)remove times for which NZCMj [n] > 0.5, and 4) retainonly the intersection of all EAP candidates across all chan-

nelsj = 1 . . . M . An example forSTPMj [n],NZCMj [n],and the resulting set of EAP candidate sections is shown in

figure 3.

2.2 Blind EAP Deconvolution

In this section we discuss the subproblem of blind sys-

tem identification under an exclusive activity assumption,

i.e. we assume that we have identified a time interval[n1, n2] during which only source xi[n] is active and allother sources are silent, i.e. xk[n] = 0 for k = i. Underthe EAP assumption we may attempt to reconstruct source

xi[n]from each observationyj [n]via an appropriately cho-

sen inverse filterhij [n]:

xji [n] =Pk=0

hij [k] yj [n k]. (11)

Ideally xi[n] = xji [n] for all j = 1 . . . M . Practically,

however, we have xki[n] = xji [n] fork = j due to noise,

imperfect estimation of the EAPs, improper choice ofP,non-minimum phase properties ofgij[n], and so forth.

An estimate Eiof the reconstruction error can be definedwith

Ei =

Mj=1

n

|xji [n]xi[n] |2 (12)

and xi[n] = 1

M

Mj=1

xji [n]. (13)

It is readily seen from equation (12) that a perfect recon-

struction withxi[n] = xji [n] yields a minimum error esti-

mate ofEi = 0. Unfortunately, xi[n] = x

j

i [n] may notbe the only solution that satisfiesEi = 0(e.g. if thegij [n]are linearly dependent). One may hope, however, that (if the

gij [n]are sufficiently different) a global minimization ofEiwill lead to good estimates for thehij [n]forj = 1 . . . M .

The computation of the global minimum ofEi is aidedby the following notation. We define:

xji =

xji [n1+ P] . . . x

ji [n21] x

ji [n2]

T,

Yj =

yj[n1] yj [n1+ 1] yj [n1+ P]

yj [n1+ 1] yj [n1+ 2] ...

......

. . ....

yj [n2 P] yj [n2]

,

and hji =

hij [P] . . . hij [1] hij[0]

T

. (14)

Equation (11) can be rewritten as xji =Y

jhji and:

xi= 1

M

Mj=1

xji =

1

M

Mj=1

Yj hji . (15)

The error estimate (12) becomes:

Ei =Mj=1

Yj hji

1

M

Mk=1

Yk hki

2

(16)

=

Mj=1

[ Yj hji]T Yj hji 1M

Mj=1

Mk=1

[ Yk hki ]T Yj hji .

We define the matrices Rjk = [ Yk ]T Yj and

RF =

R11 R12 R1M

R21 R22 ...

......

. . ....

RM1 RMM

, (17)

RD

=

R11 0 0

0 R22 ...

......

. . . 0

0 0 RMM

, (18)

and Hi =

[ h1i ]T [ h2i ]

T . . . [ hMi ]T

T. (19)

Using equations (14) to (19) we can compactly write the

error estimate as

Ei = HTi [ R

D 1

MRF ] Hi. (20)

In order to avoid the trivial minimization of equation (20)

(with Hi = 0) we constrain the solution to

xi2 = 1

M2HTi R

F Hi= 1. (21)

We have thus reformulated the problem into that of finding

the vector Hithat minimizesEisubject to equation (21).It is readily shown with Lagrange multipliers that the so-

lution to the above problem is provided by one of the gen-

eralized eigenvectorsmof matricesRD andRF:

m RDm= R

Fm. (22)


0-7695-2504-0/05 $20.00 2005IEEE


4/6

We assume that the eigenvaluesm are sorted in decreas-ing order1 2 3 4 . . . and that the eigen-vectors are normalized such that m = 1 for all m.Choosingm as the solution leads to an error estimate of

Ei = M(Mm

1). The optimal solution, i.e. the one thatminimizesEiis thus given by

Hi = 1 and Ei = M(M

11).

2.3 Blind System Identification

As a result of the methods described in sections 2.1

and 2.2 we obtain an inverse filter estimate Hi (with its as-

sociated eigenvalue1) for each separately identified EAPsection. In a first step we discard all EAP sections (and Hi)

for whichlog10M

M1was greater than a certain EAP ac-

ceptance threshold(EAT - see section 4). In a second step

we use a simple minimum Euclidean distance hierarchical

clusteringmethod [8] to associate each vector Hito one of

theMsources. All vectors associated with the same sourcek are averaged1 (arithmetic mean) into an average eigen-vector Hk for each sourcek = 1 . . . M . By extracting thecorresponding subvectorsh

ji in analogy to equation (19):

Hi =

[ h1i ]T [ h2i ]

T . . . [ hMi ]TT

, (23)

we obtain a complete set of inverse filter vectors hji :

hji =

hij [P] . . . hij[1] hij[0]

T. (24)

An estimate for the mixing matrix G(z)from equation (2)is obtained from:

G(z)ij

= 1Pk=0 hij [k] z

k, (25)

where notation [G]ij refers to the element of matrix G inrow i and column j. An estimate for the demixing ma-trix H(z)from equation (3) can be obtained by numericallyinverting G(z) via Gaussian elimination. Unfortunately,the inversion process may introduce unstable poles into the

transfer functions ofH(z). The production of stable filterscan be enforced by mirroring poles that fall outside of the

unit circle back into the inside of the unit circle. The mir-

roring process distorts the correct phase response, but leaves

the magnitude response of individual channels intact.

3 Experiments

Experiments were conducted to verify the performance

of the proposed method. As speech data we used the SI-

subset of the TIMITdatabase from theLinguistic Data

1The weight for each vector was chosen proportional to the number of

samples contained in the associated EAP section. Longer EAP sections

had thus more weight then shorter ones.

0 0.5 1 1.5 2 2.5 3 3.5 41

0.5

0

0.5

1

Source Signals

Time [sec]

Channel#1

0 0.5 1 1.5 2 2.5 3 3.5 41

0.5

0

0.5

1

Time [sec]

Channel#2

Figure 2. An example of two source signalsx1[n]andx2[n] from the TIMIT database. The signalswere aligned to have a 30% overlap in time.

0 0.5 1 1.5 2 2.5 3 3.5 41

0.5

0

0.5

1

Mixed Signals and Features

Time [sec]

Channel#1

0 0.5 1 1.5 2 2.5 3 3.5 41

0.5

0

0.5

1

Time [sec]

Channel#2

Figure 3. The resulting mixed signals y1[n] andy2[n] from the example in figure 2. The upperdashed line in each axis indicates the resulting

STPMj [n] contour (equation (7)) and the lowerdashed line indicates the resulting NZCMj [n]contour (equation (10)). The gray regions indi-

cate the EAP sections that were estimated from

the mixed signals.

Consortium2. The chosen subset consists of recordings

from 630 subjects each uttering 3 phonetically-diverse sen-

tences3. The sentences were recorded with a sampling fre-

quency of 16 kHz. The signals were low-pass filtered and

down-sampled to 8 kHz prior to processing. All 3 sentences

from the same speaker were concatenated and then trun-

2The data is available at .3None of the sentences are repeated more than once.


0-7695-2504-0/05 $20.00 2005IEEE


5/6

0 0.5 1 1.5 2 2.5 3 3.5 41

0.5

0

0.5

1

Demixed Signals EAP Method

Time [sec]

Channel#1

0 0.5 1 1.5 2 2.5 3 3.5 41

0.5

0

0.5

1

Time [sec]

Channel#2

Figure 4. The resulting demixed signals x1[n]andx2[n] from the example in figure 2 after applica-tion of the proposed EAP method. Both signals

are very close to the original source signals x1[n]andx2[n]depicted in figure 2.

0 0.5 1 1.5 2 2.5 3 3.5 41

0.5

0

0.5

1

Demixed Signals Parra/Spence Method

Time [sec]

Channel#1

0 0.5 1 1.5 2 2.5 3 3.5 41

0.5

0

0.5

1

Time [sec]

Channel#2

Figure 5. The demixing result of a commonly

used conventional method of blind source sepa-

ration after L. Parra and C. Spence [9] (see sec-

tion 4).

cated to 4 seconds. As a result we received a total of 630

different 4 seconds long source signalsx[n].We ran experiments with different source numbers (M=

2, 3, and 4) and different filter lengths (P+ 1 =5, 7, and10). For eachMthe available data was randomly split4 into[630/M] groups ofM source signals xi[n]. To simulateconversational speech the signals were partially faded out

to obtain a relative time-overlap between signals of roughly

30% (see figure 2). TheMsource signals of each group

4Every source signal was only used once.

were mixed with order-P random minimum phase filtersgij[n]according to equation (1). The resulting observationsyj[n]forj = 1 . . . M were then used to estimate the inverse

filter matrix H(z)according to section 2. The reconstructedsource signal estimates xi[n] for i = 1 . . . M were com-puted according to equation (3) via X(z) = H(z) Y(z).

The quality of the estimated model was evaluated withtheSignal-to-Interference Ratio(SIR in [dB]) between the

reconstructed signalxi[n]and the original signal xi[n]:

SIRi= maxp

10log10

n| xi[n] |

2n| xi[n]xi[n p] |

2

.

(26)

The evaluation of the SIR was performed under careful con-

sideration of possible numbering permutations between the

original signals and the reconstructions.

Figures 2 to 4 show an example for an experiment with

two sources. The gray regions in the figures indicate the

EAP sections that were estimated from the mixed signals

y1[n] and y2[n] from figure 3. Figure 5 shows the resultfor a commonly used conventional method of blind source

separation after L. Parra and C. Spence [9] (see section 4).

4 Results

The results of the experiments are summarized in tables

I and II. Table I lists the average SIR values (AvSIR) that

were obtained by averaging the SIRi after equation (26)over all channels and all experiments with the same source

number M and filter order P. The third column reportsthe average SIR values for the proposed EAP method. The

fourth column reports the average SIR values that resulted

from an application of the popular blind source separationmethod proposed by L. Parra and C. Spence [9]. The results

for the Parra/Spence method was computed with software

written by S. Harmeling (MATLAB function convbss.m,

endorsed by L. Parra and C. Spence). The last column re-

ports the average SIR between the observations yj [n] andthe sourcesxi[n]as a reference.

Table II provides supplemental information for each ex-

periment. Column three of table II lists the average SIR

results that are obtained when the proposed methods is ap-

plied to the true EAP locations (and not the estimated EAP

locations). The fourth column of table II reports the number

of instances (in %) in which the numerical inversion of ma-

trix G(z)after equation (25) led to unstable poles that hadto be mapped back into the unit circle. Column five reports

the chosen value for the EAP acceptance threshold(EAT)

as described in section 2.3.

It is clearly visible from table I that the proposed method

achieves significant improvements over the Parra/Spence

method for small complexity tasks with smaller source

numbersMand smaller filter orders P. In the best casescenario, for two sources and with a filter length of 5 taps,


0-7695-2504-0/05 $20.00 2005IEEE


6/6

we can obtain a 11 dB improvement in average signal-to-

interference ratio. Unfortunately, the advantage vanishes

with larger complexity tasks. The reason for the decline

is partially due to the increasing number of pole location

changes as listed in column four of table II.

Table ISource Filter AvSIR AvSIR AvSIR

Number Length EAP Parra/ Mixed

M P+ 1 Method Spence Signals

2 5 16.06 4.96 4.47

2 7 10.45 5.11 4.50

2 10 7.37 5.35 4.76

3 5 11.24 5.17 3.77

3 7 8.41 5.16 3.88

3 10 6.22 5.40 4.02

4 5 6.67 5.17 3.10

4 7 2.72 5.10 3.26

Table I. Average signal-to-noise ratios for various

source numbersM, filter ordersP, and algorithms.

Table II

Source Filter AvSIR Pole

Number Length True Mirror EAT

M P+ 1 EAP %

2 5 47.29 15.56 4.0

2 7 26.54 30.79 4.0

2 10 9.29 55.56 4.0

3 5 30.53 59.52 3.0

3 7 19.15 77.14 3.0

3 10 9.68 95.24 3.04 5 18.11 92.36 3.0

4 7 9.09 99.36 3.0

Table II. Supplemental statistics about the experiments

with various source numbersMand filter ordersP.

A very promising result for future developments is con-

tained in column three of table II. If the result of the EAP

estimation is replaced with the true location of the EAPs

in the given mixture signalsyj [n]for j = 1 . . . M then theaverage SIR is dramatically improved over the Parra/Spence

method even for higher complexity cases. It is thus expected

that the method will produce significantly better results ifequipped with a more robust EAP detection strategy.

5 Conclusions

We have presented a new approach for blind multichan-

nel system identification. The approach relies on the de-

tection of exclusive activity periods (EAPs) in the source

signals. The presence of EAPs is exploited for a reliable

blindestimation ofsource signals and channel properties

(between sources and observations). The advantage of the

proposed method was demonstrated experimentally within

the framework of convolutive blind separation of speech

signals.

The goal of the paper was to provide a proof of con-

ceptfor EAP based, blind identification methods. Some of

the methods presented in this paper, especially the sectionon EAP detection, are, in their current form, still subop-

timal and deserve to be studied in greater detail. Despite

its suboptimality, however, the proposed method still im-

proves upon existing strategies (especially for lower com-

plexity tasks).

A caveat of the proposed method is that (currently) we

have not imposed a constraint that forces the optimal un-

mixing matrix H(z)to be representative of a stablesystem.Instead, we employed a simple pole-mirroring strategy that,

by itself, is responsible for a substantial part of the perfor-

mance loss at higher complexity tasks (see table II).

References

[1] A. Hyvarinen, Karhunen J., and E. Oja, Independent

Component Analysis, Wiley-Interscience, 2001.

[2] P. Comon, Independent component analysis: A new

concept, Signal Processing, vol. 36, pp. 287314,

1994.

[3] A. Cichocki and S. Amari, Adaptive Blind Signal and

Image Processing: Learning Algorithms and Applica-

tions, Wiley, Chichester, U.K., 2002.

[4] N. Mitianoudis and M. E. Davies, Audio source sepa-

ration: solutions and problems, International Journalof Adaptive Control and Signal Processing, vol. 18, no.

3, pp. 299314, Apr. 2004.

[5] J. R. Deller, J. G. Proakis, and J. H. Hansen, Discrete-

Time Processing of Speech Signals, Macmillan, New

York, 1993.

[6] Y. Medan, E. Yair, and D. Chazan, Super resolution

pitch determination of speech signals, IEEE Transac-

tions on Signal Processing, vol. 39, no. 1, pp. 4048,

January 1991.

[7] Kondoz,Digital Speech Coding for Low Bit Rate Com-

munication Systems, Wiley-Interscience, 2004.

[8] R. O. Duda and P. E. Hart, Pattern Classification and

Scene Analysis, Wiley-Interscience, Menlo Park, CA,

1973.

[9] L. Parra and C. Spence, Convolutive blind separa-

tion of non-stationary sources, IEEE Transactions on

Speech and Audio Processing, vol. 8, no. 3, pp. 320

327, May 2000.


0-7695-2504-0/05 $20.00 2005IEEE

blind multichannel system identification

Documents