multi-user source localization using · a/v localization and speaker identification: who is where?...

39
0

Upload: others

Post on 11-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multi-user Source Localization using · A/V localization and speaker identification: who is where? Utilizes depth video stream Is robust to environmental effects Can be useful for

0

Page 2: Multi-user Source Localization using · A/V localization and speaker identification: who is where? Utilizes depth video stream Is robust to environmental effects Can be useful for

1

Multi-user Source Localization using

Audio-Visual Information of KINECT

Hong-Goo Kang

DSP Lab, Dept. of EE, Yonsei University

Page 3: Multi-user Source Localization using · A/V localization and speaker identification: who is where? Utilizes depth video stream Is robust to environmental effects Can be useful for

2

Contents

Group 2 : Active Speaker Detection and Beamforming

Overview: Research Overview

Group 1 : Multiple User Localization

Discussion

Group 3 : Speaker Identification

Page 4: Multi-user Source Localization using · A/V localization and speaker identification: who is where? Utilizes depth video stream Is robust to environmental effects Can be useful for

3

Research Overview (1)

Multiple user source tracking and localization

Microphone array processing: beamforming

Active speaker detection & real time speaker identification

Page 5: Multi-user Source Localization using · A/V localization and speaker identification: who is where? Utilizes depth video stream Is robust to environmental effects Can be useful for

4

Research Overview (2)

Speech signal processing for multi-user environment

Requires a user dependent processing user location/direction

Degrades performance in harsh acoustical environment, e.g. noise,

reverberation, interference, and so on beamforming

Solutions/Approaches

Introduces video signal processing approaches (depth image)

Head/face detection and tracking

Improves the performance of microphone array based speech enhancement

algorithms (beamforming) using the head location information

Apply the enhanced signal to speech signal processing applications, i.e.

speaker recognition/identification

Page 6: Multi-user Source Localization using · A/V localization and speaker identification: who is where? Utilizes depth video stream Is robust to environmental effects Can be useful for

5

Proposed Framework

Video

Depth Image: robust to environment

Audio

Voice information: speech signal

KINECT

Games

Virtual

Conference

ASR

Page 7: Multi-user Source Localization using · A/V localization and speaker identification: who is where? Utilizes depth video stream Is robust to environmental effects Can be useful for

6

Research Groups

Multi-user Localization

Group 1

Beamforming

Group 2

Video-based

head detection

and tracking

Coordinate translation

2D -> 3D

Active speaker

detection

Beamforming

Speaker Identification

Group 3

Feature extraction

User identification

Page 8: Multi-user Source Localization using · A/V localization and speaker identification: who is where? Utilizes depth video stream Is robust to environmental effects Can be useful for

7

Contents

Group 2 : Active Speaker Detection and Beamforming

Overview : Research Overview

Group 1 : Multiple User Localization

Discussion

Group 3 : Speaker Identification

Page 9: Multi-user Source Localization using · A/V localization and speaker identification: who is where? Utilizes depth video stream Is robust to environmental effects Can be useful for

8

Multiple User Localization

Objective

Head

Detection

Kinect

Depth Image

Candidate

Location

Page 10: Multi-user Source Localization using · A/V localization and speaker identification: who is where? Utilizes depth video stream Is robust to environmental effects Can be useful for

9

Depth Image based Head Tracking

Overview of head detection and tracking

Background

Removal

Distance

Transform

Edge

Detection

Distance

Measure

Template

Matching

Set Initial

Window

Centroid

Calculation

Position Inf.

Coor. Inf.

Depth Data

2D Chamfer Matching

Head Detection Head Tracking

Coordinate

Translation

Localization

Page 11: Multi-user Source Localization using · A/V localization and speaker identification: who is where? Utilizes depth video stream Is robust to environmental effects Can be useful for

10

Head Detection (1)

Background removal Use player information

Player index = 0 Background!

Result

Background

Removal

Distance

Transform

Edge

Detection

Distance

Measure

Template

Matching

Position Inf.

Depth Data

2D Chamfer Matching

Head Detection

Page 12: Multi-user Source Localization using · A/V localization and speaker identification: who is where? Utilizes depth video stream Is robust to environmental effects Can be useful for

11

Head Detection (2)

Chamfer matching

Shape detection algorithm

Morphological edge detection

Uses dilation and erosion images

Dilation : grow image region

Erosion : shrink image region

Background

Removal

Distance

Transform

Edge

Detection

Distance

Measure

Template

Matching

Position Inf.

Depth Data

2D Chamfer Matching

Head Detection

Original

Edge image

Page 13: Multi-user Source Localization using · A/V localization and speaker identification: who is where? Utilizes depth video stream Is robust to environmental effects Can be useful for

12

Head Detection (3)

Distance Transform

Converts binary image into distance image

Results in distance from the closest edge pixel

Result

Background

Removal

Distance

Transform

Edge

Detection

Distance

Measure

Template

Matching

Position Inf.

Depth Data

2D Chamfer Matching

Head Detection

Edge image Distance image

Page 14: Multi-user Source Localization using · A/V localization and speaker identification: who is where? Utilizes depth video stream Is robust to environmental effects Can be useful for

13

Head Detection (4)

Distance measure

Between edge and distance image

Edge image : template head image

Distance image : depth image

RMS chamfer distance

2

1

1 n

i ii

e dn

Background

Removal

Distance

Transform

Edge

Detection

Distance

Measure

Template

Matching

Position Inf.

Depth Data

2D Chamfer Matching

Head Detection

Black dots : edge image

Numbers : distance from edge

Page 15: Multi-user Source Localization using · A/V localization and speaker identification: who is where? Utilizes depth video stream Is robust to environmental effects Can be useful for

14

Head Detection (5)

Template matching

Detects candidate of head locations obtained from the chamfer matching

Uses nine head template images

Obtains fine head location

Background

Removal

Distance

Transform

Edge

Detection

Distance

Measure

Template

Matching

Position Inf.

Depth Data

2D Chamfer Matching

Head Detection

Template

Matching

Page 16: Multi-user Source Localization using · A/V localization and speaker identification: who is where? Utilizes depth video stream Is robust to environmental effects Can be useful for

15

Head Tracking

Initial window

Sets the coordinates from head detection

Real-time head tracking

Shifts the center of window center to the centroid of head

Adjusts the size of window

Set Initial

Window

Centroid

Calculation

Head Tracking

Page 17: Multi-user Source Localization using · A/V localization and speaker identification: who is where? Utilizes depth video stream Is robust to environmental effects Can be useful for

16

Coordinate Translation

Sound source localization

Uses the center pixel of head location to relative location from KINECT

Coordinate translation equation

Linear expressions using trigonometry

, , , ,x y z f x y D& & &

, :

:

, , :

x y

D

x y z& & &

Pixel indices

Distance

Relative source

location in meters

13 (160 )

4000

13 (120 )

4000

D xx

D yy

z D

&

&

&

Page 18: Multi-user Source Localization using · A/V localization and speaker identification: who is where? Utilizes depth video stream Is robust to environmental effects Can be useful for

17

Head Detection and Tracking

Multi-user tracking demo

Page 19: Multi-user Source Localization using · A/V localization and speaker identification: who is where? Utilizes depth video stream Is robust to environmental effects Can be useful for

18

Contents

Group 2 : Active Speaker Detection and Beamforming

Overview : Research Overview

Group 1 : Multiple User Localization

Discussion

Group 3 : Speaker Identification

Page 20: Multi-user Source Localization using · A/V localization and speaker identification: who is where? Utilizes depth video stream Is robust to environmental effects Can be useful for

19

Beamforming

Beamforming

Takes a spatial filtering

Enhances target speech by

suppressing noise and interference

Beamforming process

Spatial Filtering

19

Interference

Target Speech

: Diffused Noise

Source Location

InformationSpatial Filtering

wH

1x k x k

2 2x k x k

3 3x k x k

N Nx k x k

kx y k

1

NH

n n

n

Y W X

y(k) = wHxx: input signal

w: weights

y: output signal

Page 21: Multi-user Source Localization using · A/V localization and speaker identification: who is where? Utilizes depth video stream Is robust to environmental effects Can be useful for

20

Sound Source Localization Simulation

Conventional DoA estimation

Male speech, 16kHz

32ms Hanning window, 16ms overlap

Diffused white noise, 20dB SNR

Limitations of conventional approach

Susceptible to acoustical effects

Noise, Reverberation, Interference

Error in localization

Performance degradation in the beamforming processing

High computational complexity

Need to estimate source location for every frame

Page 22: Multi-user Source Localization using · A/V localization and speaker identification: who is where? Utilizes depth video stream Is robust to environmental effects Can be useful for

21

Proposed Method - Overview

BeamformingSound Source

Localization

Audio

Video

Page 23: Multi-user Source Localization using · A/V localization and speaker identification: who is where? Utilizes depth video stream Is robust to environmental effects Can be useful for

22

Beamforming for Multiple Candidates

Q: multiple # of candidate speakers ?

How to find active speaker?

A: Simultaneously form multiple beams

Need to consider multiple active speaker scenario

q1

q2

q3

BLAH!

q1

q2

q3

Active Speaker

Page 24: Multi-user Source Localization using · A/V localization and speaker identification: who is where? Utilizes depth video stream Is robust to environmental effects Can be useful for

23

Active Speaker Detection (1)

Candidate speaker detection

Apply SRP-PHAT algorithm to all the candidates’ locations

Find the location q that has the maximum sound level

24

1

ˆ argmax | 1,2, ,6i

j

n

n n

P i

X eP d

X

q

q

q q q q

q

L

BLAH!

q1

q2

q3 250

1000

700

Candidate

Speaker

Page 25: Multi-user Source Localization using · A/V localization and speaker identification: who is where? Utilizes depth video stream Is robust to environmental effects Can be useful for

24

Active Speaker Detection (2)

A/V integrated active user localizer

Uses average magnitude difference function (AMDF)

Find k that best compensates τ1-τ2

Compares with actual τ1 and τ2 obtained from video signal

Speaker is active if the difference of direction is within the pre-defined threshold

1

1 4

0

ˆ argmax

1

AMDFk

N

AMDF

n

k k

k x n x n kN

Active Speaker

τ1

τ2

x1(n)

x4(n)

Interference

Target User

Interference

Target User

Page 26: Multi-user Source Localization using · A/V localization and speaker identification: who is where? Utilizes depth video stream Is robust to environmental effects Can be useful for

25

Beamforming - Simulation

Simulation set up

Uses Generalized Sidelobe Canceller (GSC)

Interference: white noise

Results

Conventional vs proposed localization

45˚

Interference

Target Speech

: Diffused Noise

Page 27: Multi-user Source Localization using · A/V localization and speaker identification: who is where? Utilizes depth video stream Is robust to environmental effects Can be useful for

26

Performance Evaluation

Localization error

Spectral distortion and SNR Improvement

Page 28: Multi-user Source Localization using · A/V localization and speaker identification: who is where? Utilizes depth video stream Is robust to environmental effects Can be useful for

27

Demonstration

Beamforming

Noise reduction and target speech preservation

Active speaker detection

Page 29: Multi-user Source Localization using · A/V localization and speaker identification: who is where? Utilizes depth video stream Is robust to environmental effects Can be useful for

28

Contents

Group 2 : Active Speaker Detection and Beamforming

Overview : Research Overview

Group 1 : Multiple User Localization

Discussion

Group 3 : Speaker Identification

Page 30: Multi-user Source Localization using · A/V localization and speaker identification: who is where? Utilizes depth video stream Is robust to environmental effects Can be useful for

29

Application

Online-meeting

Speaker identification

Speech recognition

Gesture recognition

Page 31: Multi-user Source Localization using · A/V localization and speaker identification: who is where? Utilizes depth video stream Is robust to environmental effects Can be useful for

30

Introduction

Active speaker ID is displayed during speech active region

Page 32: Multi-user Source Localization using · A/V localization and speaker identification: who is where? Utilizes depth video stream Is robust to environmental effects Can be useful for

31

Problems

Implementation issues

Real-time online system

Frame based decision instead of utterance based decision

Complexity

Can not use high order of Gaussian mixtures

Use the fact that a few of the mixtures of a GMM contributes

significantly to the likelihood value for a speech feature vector

Multi-speaker identification

Cannot use the conventional pruning algorithm

Page 33: Multi-user Source Localization using · A/V localization and speaker identification: who is where? Utilizes depth video stream Is robust to environmental effects Can be useful for

32

Feature Extraction (1)

Mel-Frequency Cepstral Coefficients (MFCC): followed by ETSI configuration

39th-order MFCC include 0th order coefficientwith delta and delta-delta

25ms hamming window / 10ms shift

24th-order mel-filterbank without 1st order (~64Hz)

Log-mel spectral features

Static

frame

log-m

el S

pectr

um

50 100 150 200 250

2

4

6

8

10

12

14

16

18

20

22

10

12

14

16

18

20

22

24

Delta

frame

log-m

el S

pectr

um

50 100 150 200 250

2

4

6

8

10

12

14

16

18

20

22

-1

-0.5

0

0.5

1

1.5

2

Delta-delta

framelo

g-m

el S

pectr

um

50 100 150 200 250

2

4

6

8

10

12

14

16

18

20

22

-0.6

-0.4

-0.2

0

0.2

0.4

STFT

Power

Mel-filterbank

Logarithm

Discrete Cosine Transform

Input Speech

Liftering

39th MFCC

Cepstral Mean Normalization

Dynamic Feature Calculation

Page 34: Multi-user Source Localization using · A/V localization and speaker identification: who is where? Utilizes depth video stream Is robust to environmental effects Can be useful for

33

Feature Extraction (2)

Real-time Cepstral Mean Normalization (CMN)

CMN cannot be applied to real-time applications directly.

Mean vector can be calculated after receiving the input signal completely

Uses an approximated on-line technique

1

1(1 )

n

n n cmn

n n

cmn cmn n

c c c

c c c

%

%

500 1000 1500 2000 2500-60

-40

-20

0

20

40

60

Frame

1st M

FC

C

500 1000 1500 2000 2500-60

-40

-20

0

20

40

60

Frame

Δ

wo CMN

with CMN

with real-time CMN (α=0.003)

with CMN

with real-time CMN (α=0.003)

500 1000 1500 2000 2500-60

-40

-20

0

20

40

60

Frame

1st M

FC

C

500 1000 1500 2000 2500-60

-40

-20

0

20

40

60

Frame

Δ

wo CMN

with CMN

with real-time CMN (α=0.003)

with CMN

with real-time CMN (α=0.003)

Page 35: Multi-user Source Localization using · A/V localization and speaker identification: who is where? Utilizes depth video stream Is robust to environmental effects Can be useful for

34

Training – GMM-UBM

Speech

DB

Log-

likelihood of

top N

mixtures

Featu

re

Ext

ract

ion

Identified

Speaker ID

GMM

Modeling UBM MAP

Model3

Model2

Model1

Obtain

Top N

mixtures

Obtain

max

scoreUnknown Utterance

MFCCVAD

Feature buffer

Spk3

Spk2

Spk1

pre-recorded

Page 36: Multi-user Source Localization using · A/V localization and speaker identification: who is where? Utilizes depth video stream Is robust to environmental effects Can be useful for

35

Final Demonstration

Page 37: Multi-user Source Localization using · A/V localization and speaker identification: who is where? Utilizes depth video stream Is robust to environmental effects Can be useful for

36

Contents

Group 2 : Active Speaker Detection and Beamforming

Overview : Research Overview

Group 1 : Multiple User Localization

Discussion

Group 3 : Speaker Identification

Page 38: Multi-user Source Localization using · A/V localization and speaker identification: who is where? Utilizes depth video stream Is robust to environmental effects Can be useful for

37

Discussion

A/V localization and speaker identification: who is where? Utilizes depth video stream

Is robust to environmental effects

Can be useful for multi-user applications

User friendly

Possible future works System level optimization

Audio and Video signal processing

Applications

Speech recognition, A/V communications

Page 39: Multi-user Source Localization using · A/V localization and speaker identification: who is where? Utilizes depth video stream Is robust to environmental effects Can be useful for

38

Thank you!