head orientation compensation with video-informed … · 2017. 8. 25. · head orientation...

IWAENC 2016

Soumitro Chakrabarty, Deepth Pilakeezhu, Emanuël Habets

HEAD ORIENTATION COMPENSATION WITH VIDEO-INFORMED SINGLE CHANNEL

SPEECH ENHANCEMENT

© AudioLabs, 2016

Soumitro Chakrabarty Head Orientation Compensation with Video-Informed Single Channel Speech Enhancement

2

Head Orientation

§  Three degrees of freedom for head movement

§  Propagation of sound waves in the horizontal plane

§  Orientation of interest: YAW

© AudioLabs, 2016


3

Motivation

§  Similar problems can be witnessed in hands-free communication systems, speech based human machine interfaces etc.

© AudioLabs, 2016


4

Motivation

§  Human speakers radiate sound primarily to the front [1] §  Also, the radiation pattern is frequency dependent [1]

[1] H. K. Dunn and D. W. Farnsworth, “Exploration of pressure field around the human head during speech,” Journal Acoust. Soc. of America, vol. 10, no. 1, pp. 83–83, 1938.

© AudioLabs, 2016


5

Sound Radiation Pattern Frequency Dependency

0.5 1 1.5 2X [m]

0.5

1

1.5

2

Y [m

]

0

5

10

15

20

0.5 1 1.5 2X [m]

0.5

1

1.5

2Y

[m]

0

5

10

15

20

(a) 100 Hz (b) 3 kHz

§  Generated using spherical microphone impulse response generator (SMIRgen) [2]

[2] D. P. Jarrett, E. A. P. Habets, M. R. P. Thomas, and P. A. Naylor, “Rigid sphere room impulse response simulation: Algorithm and applications,” Journal Acoustical Society of America, vol. 132, pp. 1462, 2012.

© AudioLabs, 2016


6

Speaker-Microphone Setup

§  Aim: Compensate for the reduction in sound energy due to the relative orientation of the speaker with respect to the microphone, while attenuating the noise.

Speaker Microphone

0◦

90◦

180◦

270◦

0�

© AudioLabs, 2016


7

Problem Formulation

§  Microphone signal (STFT Domain) Y (n, k) = H(✓, k)S(n, k) + V (n, k)

0◦

θ

270◦

90◦

180◦

Y (n, k)S(n, k)0◦

θ

270◦

90◦

180◦

© AudioLabs, 2016


8

Problem Formulation


0◦

θ

270◦

90◦

180◦

Y (n, k)S(n, k)0◦

θ

270◦

90◦

180◦

Orientation-dependent ATF

Head orientation

© AudioLabs, 2016


9

Problem Formulation


0◦

θ

270◦

90◦

180◦

Y (n, k)S(n, k)0◦

θ

270◦

90◦

180◦

Source signal

Head orientation

© AudioLabs, 2016


10

Problem Formulation


0◦

θ

270◦

90◦

180◦

Y (n, k)S(n, k)0◦

θ

270◦

90◦

180◦

Noise

Head orientation

© AudioLabs, 2016


11

Problem Formulation

§  Microphone signal (STFT Domain)

§  Can be formulated as

Y (n, k) = H(✓, k)S(n, k) + V (n, k)

0◦

θ

270◦

90◦

180◦

Y (n, k)S(n, k)

Y (n, k) = A(✓, k)X(n, k) + V (n, k)

0◦

θ

270◦

90◦

180◦

© AudioLabs, 2016


12

Problem Formulation


§  In terms of attenuation

with and

Y (n, k) = H(✓, k)S(n, k) + V (n, k)

0◦

θ

270◦

90◦

180◦

Y (n, k)S(n, k)

Y (n, k) = A(✓, k)X(n, k) + V (n, k)

A(✓, k) =H(✓, k)

H(0, k)X(n, k) = H(0, k)S(n, k)

0◦

θ

270◦

90◦

180◦

Attenuation factor

© AudioLabs, 2016


13

Problem Formulation


§  In terms of attenuation

with and

Y (n, k) = H(✓, k)S(n, k) + V (n, k)

0◦

θ

270◦

90◦

180◦

Y (n, k)S(n, k)

Y (n, k) = A(✓, k)X(n, k) + V (n, k)

A(✓, k) =H(✓, k)

H(0, k)X(n, k) = H(0, k)S(n, k)

0◦

θ

270◦

90◦

180◦

Attenuation factor Desired Signal

© AudioLabs, 2016


14

Orientation Compensation Filter Derivation

§  Assuming all signal components to be independent

�Y (n, k) = |A(✓, k)|2�X(n, k) + �V (n, k)

© AudioLabs, 2016


15



�Y (n, k) = |A(✓, k)|2�X(n, k) + �V (n, k)

Microphone signal PSD

Desired signal PSD

Noise PSD

© AudioLabs, 2016


16



§  Estimate of desired signal

�Y (n, k) = |A(✓, k)|2�X(n, k) + �V (n, k)

X(n, k) = W (✓, k)Y (n, k)

© AudioLabs, 2016


17



§  Estimate of desired signal

§  Using MMSE criterion

§  Solution:

�Y (n, k) = |A(✓, k)|2�X(n, k) + �V (n, k)

X(n, k) = W (✓, k)Y (n, k)

W (✓, k) = arg minW

E{|WY (n, k)�X(n, k)|2}

W (✓, k) =|A(✓, k)|�X

|A(✓, k)|2�X + �V

© AudioLabs, 2016


18

Orientation Compensation Filter Orientation Dependent Gain §  Defining orientation dependent gain

G(✓, k) = |A(✓, k)|�1

© AudioLabs, 2016


19


§  Solution: G(✓, k) = |A(✓, k)|�1

W (✓, k) = G(✓, k) · �X

�X +G2(✓, k)�V

© AudioLabs, 2016


20


§  Solution: G(✓, k) = |A(✓, k)|�1

W (✓, k) = G(✓, k) · �X

�X +G2(✓, k)�V

Y (n, k)

G(θ, k)

X(n, k)

W (θ, k)

ReductionNoise

© AudioLabs, 2016


21


§  Solution: G(✓, k) = |A(✓, k)|�1

W (✓, k) = G(✓, k) · �X

�X +G2(✓, k)�V

Need to be estimated

Y (n, k)

G(θ, k)

X(n, k)

W (θ, k)

ReductionNoise

© AudioLabs, 2016


22

Head Orientation Estimation Video-based Method

§  Camera is co-located with the microphone

Speaker Microphone

0◦

90◦

180◦

270◦

Camera

0�

© AudioLabs, 2016


23


§  Camera is co-located with the microphone §  Proprietary software from Fraunhofer IIS, SHORETM, is used to obtain a

single orientation estimate at each time frame

Speaker Microphone

0◦

90◦

180◦

270◦

Camera

0�

© AudioLabs, 2016


24


§  Video-based orientation estimation §  Proprietary software from Fraunhofer IIS, SHORETM, is used to obtain a

single orientation estimate at each time frame n

§  Current limitation: We do not obtain estimates in the range of [90�, 270�]

Speaker Microphone

0◦

90◦

180◦

270◦

Camera

Speaker Microphone

0◦

90◦

180◦

270◦

Camera

Speaker Microphone

0◦

90◦

180◦

270◦

Camera

In this work, we assume the orientation to be known

© AudioLabs, 2016


25

Orientation Dependent Gain §  Use SMIRgen as a mouth simulator to compute a gain table §  Head is modeled as a rigid sphere

§  Mouth is an omnidirectional point source placed on the sphere

§  Orientation dependent gain is selected from the pre-computed gain table, at each time frame, based on the current estimate of the orientation

© AudioLabs, 2016


26

Gain Table Computation

0◦

90◦

180◦

270◦I discrete points

(1)  Sample the orientation range at points

I

© AudioLabs, 2016


27


0◦270◦

90◦

180◦

θi

(1)  Sample the orientation range at points (2)  Compute the ATF at each

✓i

I

© AudioLabs, 2016


28

Gain Table Computation (1)  Sample the orientation range at points (2)  Compute the ATF at each (3)  Compute the corresponding gain at each point, for each bin, as

0◦270◦

90◦

180◦

θi

✓i

I

G(✓i, k) =

��H(✓i, k)

H(0, k)

��

�1

G(✓, k) = |A(✓, k)|�1follows from

© AudioLabs, 2016


29


G(✓i, k) =

��H(✓i, k)

H(0, k)

��

�1

G(✓, k) = |A(✓, k)|�1follows from

(1)  Sample the orientation range at points (2)  Compute the ATF at each (3)  Compute the corresponding gain at each point, for each bin, as

(4)  The gain table is a matrix of size .

✓i

I

I ⇥K

© AudioLabs, 2016


30

Gain Table

Orientation (◦)0 60 120 180 240 300

Frequ

ency

[kHZ]

1

2

3

4

5

6

7

8

0

1

2

3

4

5

6

7

8

9

10

Figure: Gain table computed with SMIRgen for an anechoic environment

0◦

90◦

180◦

270◦

© AudioLabs, 2016


31

Gain Table

Orientation (◦)0 60 120 180 240 300

Frequ

ency

[kHZ]

1

2

3

4

5

6

7

8

0

1

2

3

4

5

6

7

8

9

10

Figure: Gain table computed with SMIRgen for an anechoic environment

§  Can be computed for reverberant environments using SMIRgen §  Can be computed using measured ATFs

© AudioLabs, 2016


32

System Overview

Y (n, k) X(n, k)

Orientation-dependentgain computation

G(θ, k)

Noise and Signal

φV , φX

power estimation

Orientationcompensation

filter

Orientationestimation

θOrientation (◦)

0 60 120 180 240 300

Frequ

ency

[kHZ]

1

2

3

4

5

6

7

8

0

1

2

3

4

5

6

7

8

9

10

© AudioLabs, 2016


33

Experimental Results Measured RIRs §  Measurement setup

§  Room size: §  Source-to-microphone distance: 1 m §  s

4.55m⇥ 4.45m⇥ 2.55m

T60 = 0.17

Speaker Microphone

0◦

90◦

180◦

270◦

KEMAR Dummy Head

© AudioLabs, 2016


34

Experimental Results Measured RIRs §  Measurement setup

§  Room size: §  Source-to-microphone distance: 1 m §  s

§  Stationary white noise with iSNR = 20 dB §  STFT parameters: 16 kHz sampling rate, frame length of 1024 samples

with 50% overlap §  Noise PSD: estimated from silent frames §  Desired signal PSD: Decision directed approach [3]

4.55m⇥ 4.45m⇥ 2.55m

T60 = 0.17

[3] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error log-spectral amplitude estimator,” IEEE Trans. Acoust., Speech, Signal Process., vol. 33, no. 2, pp. 443–445, 1985.

© AudioLabs, 2016


35

Experimental Results Measured RIRs

§  Results presented for three different gain table computations with resolution of 30 degrees, plus for only noise reduction

0◦

90◦

180◦


© AudioLabs, 2016


36


§  Results presented for three different gain table computations with resolution of 30 degrees, plus for only noise reduction §  NR: Only noise reduction, no application of orientation related gain

0◦

90◦

180◦


© AudioLabs, 2016


37


§  Results presented for three different gain table computations with resolution of 30 degrees, plus for only noise reduction §  NR: Only noise reduction, no application of orientation related gain §  AG: Assuming anechoic environment (using SMIRgen)

0◦

90◦

180◦


© AudioLabs, 2016


38


§  Results presented for three different gain table computations with resolution of 30 degrees, plus for only noise reduction §  NR: Only noise reduction, no application of orientation related gain §  AG: Assuming anechoic environment (using SMIRgen) §  RGSA: Spatially averaged reverberant gain (using SMIRgen)

0◦

90◦

180◦


© AudioLabs, 2016


39


§  Results presented for three different gain table computations with resolution of 30 degrees, plus for only noise reduction §  NR: Only noise reduction, no application of orientation related gain §  AG: Assuming anechoic environment (using SMIRgen) §  RGSA: Spatially averaged reverberant gain (using SMIRgen) §  RGMES: Using measured ATFs

0◦

90◦

180◦


© AudioLabs, 2016


40


(a) PESQ Improvement (b) meanLSD

0 60 120 180 240 300 3600.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Orientation (◦)

∆PESQ

0 60 120 180 240 300 3600

1

2

3

4

5

6

Orientation (◦)mLSD

[dB]

0 60 120 180 240 300 3600

1

2

3

4

5

6


[dB]

0 60 120 180 240 300 3600

1

2

3

4

5

6


[dB]

NR: Noise Reduction

0◦

90◦

180◦

270◦

© AudioLabs, 2016


41


0 60 120 180 240 300 3600.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Orientation (◦)

∆PESQ

0 60 120 180 240 300 3600

1

2

3

4

5

6


[dB]

0 60 120 180 240 300 3600

1

2

3

4

5

6


[dB]


0◦

90◦

180◦

270◦

AG: Anechoic Gain (SMIRgen)

© AudioLabs, 2016


42


0 60 120 180 240 300 3600.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Orientation (◦)

∆PESQ

0 60 120 180 240 300 3600

1

2

3

4

5

6


[dB]


0◦

90◦

180◦

270◦

RGSA: Spatially Averaged Reverberatn Gain (SMIRgen)

© AudioLabs, 2016


43


0 60 120 180 240 300 3600.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Orientation (◦)

∆PESQ

0 60 120 180 240 300 3600

1

2

3

4

5

6


[dB]

0 60 120 180 240 300 3600

1

2

3

4

5

6


[dB]

0 60 120 180 240 300 3600

1

2

3

4

5

6


[dB]

0 60 120 180 240 300 3600

1

2

3

4

5

6


[dB]


0◦

90◦

180◦

270◦

RGMES: Measured ATFs

© AudioLabs, 2016


44

Audio Examples Measured RIRs

0◦

90◦

180◦

270◦

0◦180◦

θ = 150◦

270◦

90◦

0◦

90◦

180◦

270◦RGMES gain

0◦

90◦

180◦

270◦Noise Reduction Only

© AudioLabs, 2016


45

Conclusions and Outlook §  A single channel speech enhancement framework, that incorporates

head orientation information was presented

§  Experimental results provided motivation for further exploring the significance head orientation information for speech enhancement

§  Current research focuses on developing methods to learn the attenuation characteristics due to the orientation of the speaker to perform the compensation

§  Future work would involve relaxing the constraints of the current system, and develop a method more suitable for a practical setting

head orientation compensation with video-informed … · 2017. 8. 25. · head orientation...

Documents