dominant speaker identification for multipoint ... · the conference is administrated through an...

Dominant Speaker

Identification

for Multipoint

Videoconferencing

Under the supervision of

Prof. Israel Cohen

Ilana Volfin

Outline

Videoconferencing – an introduction

Discussed realization

Proposed method

Speech activity score evaluation

— Single observation

— Sequence of observations

SNR estimation

Experimental Results

Outline

Videoconferencing

Used for :

Remote education

Medical consulting

Business meetings

Personal communications

Introduction

Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Videoconferencing

History: [1]

Introduction


AT&T ‘Picturephone’

Not commercialized

1982 1964 1986

Compression Labs. System cost: 250,000$

Line cost per hour : 1,000$

PictureTel System cost: 80,000$

Line cost per hour : 100$

1991

System cost: 20,000$

Line cost per hour : 30$

Today

[1] Sprey, “Videoconferencing as a Communication Tool”,TRANSACTIONS ON PROFESSIONAL COMMUNICATION, IEEE, 1997

http://www.skype.com/intl/en-us/get-skype/

What is a Multipoint Conference?

3 or more participants

Each at his own location

Single microphone and video camera

The information is administered

through a central control unit

Introduction


0 5 10 15 20 25-0.5

0

0.5

0 5 10 15 20 25-0.1

0

0.1

0 5 10 15 20 25-0.5

0

0.5

[sec]

Ch 1

Ch 2

Ch 3

Central control unit - MCU

Multipoint Control Unit

The conference is administrated through an MCU:

MCU

Participant 1

Participant 2

Participant N

Introduction





Participant 1

Participant 2

Participant N

Introduction


decoding

mixing

encoding

Heavy processing!

processing




Participant 1

Participant 2

Participant N

Introduction


Reduce the amount of information to improve conference quality

decoding

mixing

encoding

processing

Speaker selection

Select M most active participants

Discard all remaining information

Guidelines [2]

The selection process should not introduce audio artifacts

Transparent to the participants

Resistance to noise

Lack of discrimination


Introduction

[2] P. J. Smith, “Voice conferencing over IP networks" , Master's thesis, Department of Electrical and Computer Engineering, McGill

University, Montreal, Canada, 2002.

Speaker selection - Challenges

Equipment

Low quality sensors lower SNR

Speakers produce crosstalk

Surrounding

General noises

Transient noises

Personal characteristics

Loud/Quiet participants


Introduction

Classic Voice Activity Detection (VAD)

Binary classification problem

Classify each signal frame into either speech or non-

speech → ‘High resolution’ decision

— Representation

Use a speech specific representation, to maximize

discrimination ability

— Classification method

Thresholding

Machine learning techniques


Introduction

VAD and Speaker Selection

Speaker Selection is a generalized task of VAD

For each signal frame in each channel :

— Determine whether it contains prolonged speech

activity or not

‘Lower resolution’ decision is needed

— Non-speech frames may belong to a speech burst too


Introduction

Speaker Selection Methods

Yu et al. Computers and communications ISCC 1998

“Linear PCM signal processing unit in multi-point video conferencing system”

— The channel with the biggest power is determined dominant

— To prevent frequent change of current speaker the decision is

made every 1 or 2 seconds

Chang, Lucent Technologies, 2001

“Multimedia conference call participant identification system and method “

— Speaking participants are identified as those who pass a volume

threshold

Kwak, Verizon Laboratories, 2002

“Speaker identifier for multi-party conference”

— Dominant speaker determined based on signal amplitude


Introduction

Speaker Selection Methods

Smith , MSc thesis, McGill University, 2002

“Voice conferencing over IP networks"

— Participants are ranked in the order of becoming active speakers

— A participant can move up in rankings if the smoothed power of

his signal is above a barge-in threshold

Xu et al. ICME IEEE, 2006

“Pass: Peer-aware silence suppression for internet voice conferences”

— Very advanced VAD for ranking

— Barge-in mechanism


Introduction

Summary of existing methods


Introduction

Method Provides

solution

for:

Stationary

Noise

Frequent

Switching

Transient

Noise

Yu 1998

(Power) - - -

Assume that

non-dominant

channels are

completely silent Chang 2001

(Volume

Threshold)

- - -

Kwak 2002 (Amplitude)

- - -

Smith 2002 (Barge In)

√ √ - Assume that

change in signal

power originates

from change in

speaker Xu 2006

(Barge In & advanced VAD)

√ √ -

Transient noise occurrences are not addressed


Dominant speaker identification (Speaker selection with M=1)

Only one participant appears on each screen

Most video information is discarded

Further traffic reduction by discarding some

audio

Conversation is focused on the dominant

speaker

Participant 1

Participant 2

Participant N

Dominant speaker

Previous Dominant speaker

Introduction ⋅ Discussed Realization⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

Discussed Realization


N channels

Speech Burst – a speech event composed of three

sequential phases: initiation, steady state and termination.

Speaker Switch – the point where a change in

dominant speaker occurs

Objective:

Follow speech bursts

Detect speaker switches

Discussed Realization

Introduction ⋅ Discussed Realization⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results

5 10 15 20

-0.2

0

0.2

5 10 15 20

-0.05

0

0.05

5 10 15 20

-0.2

0

0.2

[sec]

Dominant speaker identification relying

on speech activity in time intervals of

different length

Proposed Method

Currently observed frame

Time interval of medium length

Long time interval


𝑡 now

Dominant speaker identification relying on speech

activity in time intervals of different length

Two Stages:

Local processing

— Individual processing on each channel

— Speech activity score evaluation for the immediate,

medium and long time-intervals

Global decision

— Speech activity score comparison across channels

Proposed Method


Local processing

Channel 𝑖, time frame 𝑙:

Proposed Method

Speech

detection in

sub-bands

Immediate

time Speech

activity

evaluation

Medium

interval Speech

activity

evaluation

Long interval Speech

activity

evaluation

Φ𝑖𝑚𝑚𝑒𝑑

𝑖 (𝑙)

Φ𝑚𝑒𝑑𝑖𝑢𝑚

𝑖 (𝑙)

Φ𝑙𝑜𝑛𝑔

𝑖 (𝑙)

Speech activity Evaluation Score Evaluation


Speech activity evaluation on time

intervals of three characteristic

lengths

Three sequential steps

The input into each step consists of

smaller sub-units, of length

characteristic to the previous step

Activity is determined by the

number of active sub-units

Activity in a sub-unit is determined

by thresholding

Local processing


Proposed Method

𝑙

𝑘1

𝑘1+𝑁1

• Time-frequency representation

• 𝑙 – time index

• 𝑘1, 𝑘1+𝑁1 – frequency range of voiced speech


Local processing



Proposed Method

> 𝑡ℎ1

Binary vector

Σ

𝑎1 𝑙

𝑎1 𝑙 𝑎1 𝑙 − 1 ⋅⋅⋅ 𝑎1 𝑙 − 𝑁2 + 1


Local processing



Proposed Method

𝑎1 𝑙

𝑎1 𝑙 − 1 𝑎1 𝑙

𝑎1 𝑙 − 𝑁2 + 1

𝛼𝑙 =


Local processing



Proposed Method

𝑎1 𝑙

𝛼𝑙 =

> 𝑡ℎ2

Binary vector

Σ

𝑎2 𝑙


Local processing



Long time interval

Proposed Method

𝑎1 𝑙

𝛼𝑙 =

> 𝑡ℎ2

Binary vector

Σ

𝑎2 𝑙

𝑎2 𝑙 𝑎2 𝑙 − 𝑁2 + 1 𝑎2 𝑙 − 2𝑁2 + 1 𝑎2 𝑙 − 𝑁3 − 1 𝑁2 + 1 ⋅⋅⋅

𝑁2 𝑁2 𝑁2 ⋅⋅⋅


Local processing



Long time interval

Proposed Method

𝑎1 𝑙

𝑎2 𝑙

𝑎2 𝑙 − 𝑁2 + 1 𝑎2 𝑙

𝑎2 𝑙 − 𝑁3𝑁2 + 1

𝛽𝑙 =


Local processing



Long time interval

Proposed Method

𝑎1 𝑙

𝑎2 𝑙

𝛽𝑙 =

> 𝑡ℎ3 Σ

𝑎3 𝑙


Binary vector

Local processing



Long time interval

Proposed Method

𝑎1 𝑙

𝑎2 𝑙

𝑎3 𝑙


Active sub-units & Transients



Long time interval

Why Thresholding and counting ?

Smoothing

Equalization

Noise spikes suppression

Proposed Method

𝑎1 𝑙

𝑎2 𝑙

𝑎3 𝑙


# of active

sub-bands

# of active

frames

# of active

blocks

Active sub-units & Transients



Long time interval

Discriminating isolated transients from fluent audio

activity:

Proposed Method

𝑎1 𝑙

𝑎2 𝑙

𝑎3 𝑙


# of active

sub-bands

# of active

frames

# of active

blocks

Binary vector

Immediate Binary vector

Medium

Binary vector

Long

Global Decision

Time frame 𝑙:

Proposed Method

a11(𝑙), a2

1 𝑙 , a31(𝑙)

a12(𝑙), a2

2 𝑙 , a32(𝑙)

a1𝑁(𝑙), a2

𝑁 𝑙 , a3𝑁(𝑙)

Dominant

Speaker

Identification

Dominant

speaker in frame 𝒍


Global Decision

Time frame 𝑙:

Proposed Method

Φ𝑖𝑚𝑚𝑒𝑑1 (𝑙), Φ𝑚𝑒𝑑𝑖𝑢𝑚

1 𝑙 , Φ𝑙𝑜𝑛𝑔1 (𝑙)

Φ𝑖𝑚𝑚𝑒𝑑2 (𝑙), Φ𝑚𝑒𝑑𝑖𝑢𝑚

2 𝑙 , Φ𝑙𝑜𝑛𝑔2 (𝑙)

Φ𝑖𝑚𝑚𝑒𝑑𝑁 (𝑙), Φ𝑚𝑒𝑑𝑖𝑢𝑚

𝑁 𝑙 , Φ𝑙𝑜𝑛𝑔𝑁 (𝑙)

Dominant

Speaker

Identification

Dominant

speaker in frame 𝒍


Dominant speaker identification

Activated once in a decision-interval

Objective:

Detect speaker switch events

Method:

Compare speech activity scores across channels

Each channel is evaluated in respect to:

— the dominant channel

— minimal activity level

Proposed Method

Current dominant speaker remains unless

activity on one of the other channels

justifies a speaker switch


Score Evaluation

Single

Score computed on a

single observation

Each observation is

analyzed independently

More responsive to changes

Sequential

Score computed on a

sequence of observations

Exploits natural temporal

dependence in speech

More Smooth

Score Evaluation


Modeling the number of active sub-units

𝒂

Two Hypotheses: 𝐻1 𝑠𝑝𝑒𝑒𝑐ℎ 𝑖𝑠 𝑝𝑟𝑒𝑠𝑒𝑛𝑡𝐻0 𝑠𝑝𝑒𝑒𝑐ℎ 𝑖𝑠 𝑎𝑏𝑠𝑒𝑛𝑡

𝑯𝟏

𝑯𝟎

Single

Sequential

OR

Score Evaluation

Likelihood Models

𝑁 – number of sub-units

𝑎 – number of active sub-units

𝑝 𝑎|𝐻1 = Bin 𝑁, 𝑝 =𝑁𝑎

𝑝𝑎 1 − 𝑝 𝑁−𝑎

𝑝 𝑎|𝐻0 = 𝐸𝑥𝑝 𝜆 = 𝜆𝑒−𝜆𝑎


Score based on a single observation

Λ =𝑝 𝑎|𝐻1

𝑝 𝑎|𝐻0 𝚽 = 𝑙𝑜𝑔𝜦

speech

activity score

Φ = log𝑁𝑎

+ 𝑎 log(𝑝) + 𝑁 − 𝑎 log 1 − 𝑝 − log 𝜆 + 𝜆𝑎

Likelihood Ratio log-Likelihood Ratio

Score Evaluation - Single


Score based on a sequence of observations

The score is computed on a sequence of

observations

𝑿𝑙 = 𝑎 𝑙 − 𝑀 + 1 ,… , 𝑎 𝑙

Observations are not independent

A speech frame is more likely to be followed by

another speech frame

𝑝 𝑞𝑛 = 𝐻1 𝑞𝑛−1 = 𝐻1 > 𝑝 𝑞𝑛 = 𝐻1 [4]

First order Markovian dependency in the transitions

between consecutive frames

Score Evaluation - Sequential


[4] J. Sohn et al., “A statistical model-based voice activity detection”, Signal Processing Letters, IEEE, 1999

HMM of two states:

𝜁𝑙 = 𝐻1,(𝑙) 𝑠𝑝𝑒𝑒𝑐ℎ 𝑖𝑠 𝑝𝑟𝑒𝑠𝑒𝑛𝑡 𝑖𝑛 𝑓𝑟𝑎𝑚𝑒 𝑙

𝐻0,(𝑙) 𝑠𝑝𝑒𝑒𝑐ℎ 𝑖𝑠 𝑎𝑏𝑠𝑒𝑛𝑡 𝑖𝑛 𝑓𝑟𝑎𝑚𝑒 𝑙

State dynamics: 𝑏𝑖𝑗 = 𝑝 𝜁𝑙 = j 𝜁𝑙−1 = 𝑖 , 𝑖, 𝑗𝜖{0,1}

— 𝑏00 + 𝑏01 = 1, 𝑏10 + 𝑏11 = 1

Steady state probabilities

𝑝𝐻0 =𝑏10

𝑏10+𝑏01, 𝑝𝐻1 =

𝑏01

𝑏10+𝑏01

Current observation only depends on current

state: 𝑝 𝑋𝑙 𝜁𝑙 , 𝑿𝑙 = 𝑝 𝑋𝑙 𝜁𝑙




Denote: 𝛼𝑙 1 ≜ 𝑝 𝑿𝑙 , 𝐻1,(𝑙) and 𝛼𝑙 0 ≜ 𝑝 𝑿𝑙 , 𝐻0, 𝑙

Recursive formula (forward procedure) for 𝛼𝑙 𝑗 , 𝑗 ∈ 0,1

Λ𝑙𝑠𝑒𝑞

=𝑝 𝑿𝑙|𝐻1, 𝑙

𝑝 𝑿𝑙|𝐻0, 𝑙

Likelihood Ratio

Λ𝑙𝑠𝑒𝑞

=𝛼𝑙 1

𝛼𝑙 0⋅𝑝𝐻0

𝑝𝐻1

𝑝 𝑿𝑙 , 𝐻1, 𝑙 𝑝𝐻0

𝑝 𝑿𝑙 , 𝐻0, 𝑙 𝑝𝐻1

𝚽𝒍𝒔𝒆𝒒

= 𝐥𝐨𝐠𝜶𝒍 𝟏

𝜶𝒍 𝟎+ 𝐥𝐨𝐠

𝑝𝐻0

𝑝𝐻1



log-Likelihood Ratio

𝛼𝑞 𝑗 =

𝑝 𝑋𝑞 = 𝑎(𝑞)|𝐻𝑗,(𝑞) 𝛼𝑞−1 0 𝑏𝑗0 + 𝛼𝑞−1 1 𝑏𝑗1

𝑝 𝑋𝑞 = 𝑎 𝑞 |𝐻𝑗,(𝑞) 𝑝 𝐻𝑗

, 𝑙 − 𝑀 + 1 < 𝑞 < 𝑙 , 𝑞 = 𝑙 − 𝑀 + 1


Score Evaluation - Summary

Single Vs. Sequential


Denote: Γ 𝑛 =𝛼𝑛 1

𝛼𝑛(0) , Φ𝑠𝑒𝑞 = log Γ 𝑛 + log

𝑝𝐻0

𝑝𝐻1

Γ 𝑛 =𝛼𝑛 1

𝛼𝑛(0)=

𝑝 𝑎𝑛 𝐻1

𝑝 𝑎𝑛|𝐻0⋅𝛼𝑛−1 0 𝑏10 + 𝛼𝑛−1 1 𝑏11𝛼𝑛−1 0 𝑏00 + 𝛼𝑛−1 1 𝑏01

= Λ𝑛 ⋅𝑏10 + Γ 𝑛 − 1 𝑏11𝑏00 + Γ 𝑛 − 1 𝑏01

Φ𝑠𝑒𝑞 = Φ𝑠𝑖𝑛𝑔𝑙𝑒 + log𝑏10 + Γ 𝑛 − 1 𝑏11𝑏00 + Γ 𝑛 − 1 𝑏01

+ log𝑝𝐻0

𝑝𝐻1





+ log𝑝𝐻0

𝑝𝐻1

In the presence of speech

Γ 𝑛 − 1 ≫ 0 ⇒ log𝑏10 + Γ 𝑛 − 1 𝑏11𝑏00 + Γ 𝑛 − 1 𝑏01

→ log𝑏11𝑏10

> 0

⇒ Φ𝑠𝑒𝑞> Φ𝑠𝑖𝑛𝑔𝑙𝑒





+ log𝑝𝐻0

𝑝𝐻1

In absence of speech

Γ 𝑛 − 1 ≪ 1 ⇒ log𝑏10 + Γ 𝑛 − 1 𝑏11𝑏00 + Γ 𝑛 − 1 𝑏01

→ log𝑏10𝑏00

< 0

⇒ Φ𝑠𝑒𝑞< Φ𝑠𝑖𝑛𝑔𝑙𝑒

Sequential processing improves separability

A-priori SNR Estimation

Let: 𝑦 𝑛 = 𝑥 𝑛 + 𝑑 𝑛𝑆𝑇𝐹𝑇

𝑌𝑙 = 𝑋𝑙 + 𝐷𝑙

𝜆𝑙 = 𝐸 𝑋𝑙2 − Spectral variance of speech

𝑋𝑙 is a complex vector 𝑋𝑙 = 𝐴𝑙𝑒𝑗𝜙𝑙

𝜆𝑙 = 𝐸 𝑋𝑙2 = 𝐸 𝐴𝑙

2

𝜆𝐷𝑙= 𝐸 𝐷𝑙

2 − Spectral variance of noise

SNR estimation

A-priori SNR : 𝜉𝑙 =𝜆𝑙

𝜆𝐷𝑙

A-priori SNR : 𝜉 𝑙|𝑙 =𝜆 𝑙|𝑙

𝜆𝐷𝑙


The estimator is derived in two steps: [5]

1. Propagation Step

Assuming we have all information up to frame 𝑙 − 1

Obtain one frame ahead conditional variance 𝜆 𝑙|𝑙−1

2. Update Step

Update the estimator using information from current

frame, 𝑙

Obtain 𝜆 𝑙|𝑙


SNR estimation


[5] I. Cohen, “Relaxed statistical model for speech enhancement and a priori snr estimation“, Speech and Audio Processing, IEEE

Transactions on, 2005.

The estimator is derived in two steps :

1. Propagation Step

Assuming we have all information up to frame 𝑙 − 1

The conditional variance of speech is assumed to

propagate as a GARCH(1,1) model:

𝜆𝑙|𝑙−1 = 𝜆𝑚𝑖𝑛 + 𝜇 𝑋𝑙−12 + 𝛿 𝜆𝑙−1|𝑙−2 + 𝜆𝑚𝑖𝑛

𝜆𝑚𝑖𝑛 > 0, 𝜇 ≥ 0, 𝛿 ≥ 0, 𝜇 + 𝛿 < 1


SNR estimation

𝜆 𝑙|𝑙−1 = 𝐸 𝜆𝑙|𝑙−1|𝐴 𝑙−1, 𝜆 𝑙−1|𝑙−2= 𝜆𝑚𝑖𝑛 + 𝜇𝐴 𝑙−1

2 + 𝛿 𝜆 𝑙−1|𝑙−2 + 𝜆𝑚𝑖𝑛


A-priori SNR estimation

2. Update Step:

Let: 𝑦 𝑛 = 𝑥 𝑛 + 𝑑 𝑛𝑆𝑇𝐹𝑇

𝑌𝑙 = 𝑋𝑙 + 𝐷𝑙

Spectral enhancement 𝑋 𝑙 = 𝐺𝑌𝑙

𝐺 – spectral gain function

𝐺 is chosen as a minimizer of a certain distortion

measure

SNR estimation


A-priori SNR estimation

2. Update Step:

We chose 𝐺𝑆𝑃, the minimizer of the power distortion

measure

𝑑𝑆𝑃 = 𝐴𝑙2 − 𝐴 𝑙

2 2

𝛾𝑙 =𝑌𝑙

2

𝜆𝐷𝑙

- a posteriori SNR

𝜆 𝑙|𝑙−1 - one frame ahead conditional variance (speech)

The resulting spectral gain function is :

𝐺𝑆𝑃 𝜆 𝑙|𝑙−1, 𝛾𝑙 =𝜆 𝑙|𝑙−1

𝜆𝐷𝑙+𝜆 𝑙|𝑙−1

1𝛾𝑙+

𝜆 𝑙|𝑙−1


SNR estimation


Experimental results

Three Experiments:

1. Identification of the dominant speaker

2. Robustness to transient audio occurrences

3. Results on a real multipoint conference




The proposed method was compared to:

The speaker with the highest VAD score in the

decision-interval is selected

1. RAMIREZ [6]

2. SOHN [4]

3. GARCH [7]

4. The speaker with the highest SNR is selected

5. The speaker with the highest signal POWER is

selected



[6] Ramirez et al. “Statistical voice activity detection using a multiple observation likelihood ratio test”, Signal Processing Letters, IEEE,2005

[7] S. Mousazadeh and I. Cohen, “Ar-garch in presence of noise: Parameter estimation and its application to voice activity detection“,

Audio, Speech, and Language Processing, IEEE Transactions on, 2011

Performance Evaluation

Quantitative Evaluation:

False Speaker Switches #

Mid Sentence Clipping (MC) – percent of

undetected mid section of a speech burst

MC


true detected


Experiment #1

Signals:

One or more concatenated TiMit sentences of

same speaker (with added white noise)

The non silent part of the sentence is expected

to be detected as a continuous speech burst



Experiment #1

Proposed method Sohn VAD based method

Decision interval = 0.1 sec



Experiment #1

False Speaker Switches Mid Sentence Clipping


0 0.1 0.2 0.3 0.4 0.5 0.6 0.70

0.5

1

1.5

2

2.5

3Mid Sentence Clipping

Decision interval [sec]

%

Single

POWER

SNR

RAMIREZ

SOHN

GARCH

Sequential

0 0.1 0.2 0.3 0.4 0.5 0.6 0.70

10

20

30

40

50

60

70

80

90


Fals

e c

am

era

sw

itches

Single

POWER

SNR

RAMIREZ

SOHN

GARCH

Sequential


Experiment #2

Signals:

One or more concatenated TiMit sentences of

same speaker (with added white noise)

A sound of sneezing was added to channel 2

A sound of door knocks was added to channel 3



Experiment #2

ch 1

ch 2

0 5 10 15 20 25

[sec]

ch 3

sneezing

knocks



Experiment #2



0 0.1 0.2 0.3 0.4 0.5 0.6 0.70

0.5

1

1.5

2

2.5

3

3.5


%

Single

POWER

SNR

RAMIREZ

SOHN

GARCH

Sequential

0 0.1 0.2 0.3 0.4 0.5 0.6 0.70

10

20

30

40

50

60

70

80

90


Fals

e c

am

era

sw

itches

Single

POWER

SNR

RAMIREZ

SOHN

GARCH

Sequential


Experiment #2



Single Sequential



Experiment #2

Proposed Method GARCH VAD based method




Red – hand labeled

Black – Algorithm’s result

Experiment #2

Proposed Method POWER based method






Experiment #3

Signals:

Real Multi-channel conversation

5 channels

Channels 2 & 4 – clean speech

Channel 1 – mostly noise

Channels 3 and 5 - crosstalk



Experiment #3

Proposed Method POWER based method






Summary

Novel method for dominant speaker

identification

Two approaches to speech activity score

evaluation

Experimental framework

Less false speaker switches

Robustness to transient audio occurrences

Summary

Future Research

Non causal processing

Temporally variable thresholds

Preprocessing

Speech enhancement

Echo detection (or cancellation)

Future Research

Thank You

Thank You

dominant speaker identification for multipoint ... · the conference is administrated through an...

Documents