dominant speaker identification for multipoint ... · the conference is administrated through an...
TRANSCRIPT
![Page 1: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/1.jpg)
Dominant Speaker
Identification
for Multipoint
Videoconferencing
Under the supervision of
Prof. Israel Cohen
Ilana Volfin
![Page 2: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/2.jpg)
Outline
Videoconferencing – an introduction
Discussed realization
Proposed method
Speech activity score evaluation
— Single observation
— Sequence of observations
SNR estimation
Experimental Results
Outline
![Page 3: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/3.jpg)
Videoconferencing
Used for :
Remote education
Medical consulting
Business meetings
Personal communications
Introduction
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
![Page 4: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/4.jpg)
Videoconferencing
History: [1]
Introduction
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
AT&T ‘Picturephone’
Not commercialized
1982 1964 1986
Compression Labs. System cost: 250,000$
Line cost per hour : 1,000$
PictureTel System cost: 80,000$
Line cost per hour : 100$
1991
System cost: 20,000$
Line cost per hour : 30$
Today
[1] Sprey, “Videoconferencing as a Communication Tool”,TRANSACTIONS ON PROFESSIONAL COMMUNICATION, IEEE, 1997
![Page 5: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/5.jpg)
What is a Multipoint Conference?
3 or more participants
Each at his own location
Single microphone and video camera
The information is administered
through a central control unit
Introduction
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
0 5 10 15 20 25-0.5
0
0.5
0 5 10 15 20 25-0.1
0
0.1
0 5 10 15 20 25-0.5
0
0.5
[sec]
Ch 1
Ch 2
Ch 3
![Page 6: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/6.jpg)
Central control unit - MCU
Multipoint Control Unit
The conference is administrated through an MCU:
MCU
Participant 1
Participant 2
Participant N
Introduction
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
![Page 7: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/7.jpg)
Central control unit - MCU
Multipoint Control Unit
The conference is administrated through an MCU:
Participant 1
Participant 2
Participant N
Introduction
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
decoding
mixing
encoding
Heavy processing!
processing
![Page 8: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/8.jpg)
Central control unit - MCU
Multipoint Control Unit
The conference is administrated through an MCU:
Participant 1
Participant 2
Participant N
Introduction
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
Reduce the amount of information to improve conference quality
decoding
mixing
encoding
processing
![Page 9: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/9.jpg)
Speaker selection
Select M most active participants
Discard all remaining information
Guidelines [2]
The selection process should not introduce audio artifacts
Transparent to the participants
Resistance to noise
Lack of discrimination
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
Introduction
[2] P. J. Smith, “Voice conferencing over IP networks" , Master's thesis, Department of Electrical and Computer Engineering, McGill
University, Montreal, Canada, 2002.
![Page 10: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/10.jpg)
Speaker selection - Challenges
Equipment
Low quality sensors lower SNR
Speakers produce crosstalk
Surrounding
General noises
Transient noises
Personal characteristics
Loud/Quiet participants
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
Introduction
![Page 11: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/11.jpg)
Classic Voice Activity Detection (VAD)
Binary classification problem
Classify each signal frame into either speech or non-
speech → ‘High resolution’ decision
— Representation
Use a speech specific representation, to maximize
discrimination ability
— Classification method
Thresholding
Machine learning techniques
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
Introduction
![Page 12: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/12.jpg)
VAD and Speaker Selection
Speaker Selection is a generalized task of VAD
For each signal frame in each channel :
— Determine whether it contains prolonged speech
activity or not
‘Lower resolution’ decision is needed
— Non-speech frames may belong to a speech burst too
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
Introduction
![Page 13: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/13.jpg)
Speaker Selection Methods
Yu et al. Computers and communications ISCC 1998
“Linear PCM signal processing unit in multi-point video conferencing system”
— The channel with the biggest power is determined dominant
— To prevent frequent change of current speaker the decision is
made every 1 or 2 seconds
Chang, Lucent Technologies, 2001
“Multimedia conference call participant identification system and method “
— Speaking participants are identified as those who pass a volume
threshold
Kwak, Verizon Laboratories, 2002
“Speaker identifier for multi-party conference”
— Dominant speaker determined based on signal amplitude
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
Introduction
![Page 14: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/14.jpg)
Speaker Selection Methods
Smith , MSc thesis, McGill University, 2002
“Voice conferencing over IP networks"
— Participants are ranked in the order of becoming active speakers
— A participant can move up in rankings if the smoothed power of
his signal is above a barge-in threshold
Xu et al. ICME IEEE, 2006
“Pass: Peer-aware silence suppression for internet voice conferences”
— Very advanced VAD for ranking
— Barge-in mechanism
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
Introduction
![Page 15: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/15.jpg)
Summary of existing methods
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
Introduction
Method Provides
solution
for:
Stationary
Noise
Frequent
Switching
Transient
Noise
Yu 1998
(Power) - - -
Assume that
non-dominant
channels are
completely silent Chang 2001
(Volume
Threshold)
- - -
Kwak 2002 (Amplitude)
- - -
Smith 2002 (Barge In)
√ √ - Assume that
change in signal
power originates
from change in
speaker Xu 2006
(Barge In & advanced VAD)
√ √ -
Transient noise occurrences are not addressed
![Page 16: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/16.jpg)
Discussed realization
Dominant speaker identification (Speaker selection with M=1)
Only one participant appears on each screen
Most video information is discarded
Further traffic reduction by discarding some
audio
Conversation is focused on the dominant
speaker
Participant 1
Participant 2
Participant N
Dominant speaker
Previous Dominant speaker
Introduction ⋅ Discussed Realization⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
Discussed Realization
![Page 17: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/17.jpg)
Discussed realization
N channels
Speech Burst – a speech event composed of three
sequential phases: initiation, steady state and termination.
Speaker Switch – the point where a change in
dominant speaker occurs
Objective:
Follow speech bursts
Detect speaker switches
Discussed Realization
Introduction ⋅ Discussed Realization⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
5 10 15 20
-0.2
0
0.2
5 10 15 20
-0.05
0
0.05
5 10 15 20
-0.2
0
0.2
[sec]
![Page 18: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/18.jpg)
Dominant speaker identification relying
on speech activity in time intervals of
different length
Proposed Method
Currently observed frame
Time interval of medium length
Long time interval
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
𝑡 now
![Page 19: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/19.jpg)
Dominant speaker identification relying on speech
activity in time intervals of different length
Two Stages:
Local processing
— Individual processing on each channel
— Speech activity score evaluation for the immediate,
medium and long time-intervals
Global decision
— Speech activity score comparison across channels
Proposed Method
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
![Page 20: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/20.jpg)
Local processing
Channel 𝑖, time frame 𝑙:
Proposed Method
Speech
detection in
sub-bands
Immediate
time Speech
activity
evaluation
Medium
interval Speech
activity
evaluation
Long interval Speech
activity
evaluation
Φ𝑖𝑚𝑚𝑒𝑑
𝑖 (𝑙)
Φ𝑚𝑒𝑑𝑖𝑢𝑚
𝑖 (𝑙)
Φ𝑙𝑜𝑛𝑔
𝑖 (𝑙)
Speech activity Evaluation Score Evaluation
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
Speech activity evaluation on time
intervals of three characteristic
lengths
Three sequential steps
The input into each step consists of
smaller sub-units, of length
characteristic to the previous step
Activity is determined by the
number of active sub-units
Activity in a sub-unit is determined
by thresholding
![Page 21: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/21.jpg)
Local processing
Currently observed frame
Proposed Method
𝑙
𝑘1
𝑘1+𝑁1
• Time-frequency representation
• 𝑙 – time index
• 𝑘1, 𝑘1+𝑁1 – frequency range of voiced speech
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
![Page 22: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/22.jpg)
Local processing
Currently observed frame
Time interval of medium length
Proposed Method
> 𝑡ℎ1
Binary vector
Σ
𝑎1 𝑙
𝑎1 𝑙 𝑎1 𝑙 − 1 ⋅⋅⋅ 𝑎1 𝑙 − 𝑁2 + 1
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
![Page 23: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/23.jpg)
Local processing
Currently observed frame
Time interval of medium length
Proposed Method
𝑎1 𝑙
𝑎1 𝑙 − 1 𝑎1 𝑙
𝑎1 𝑙 − 𝑁2 + 1
𝛼𝑙 =
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
![Page 24: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/24.jpg)
Local processing
Currently observed frame
Time interval of medium length
Proposed Method
𝑎1 𝑙
𝛼𝑙 =
> 𝑡ℎ2
Binary vector
Σ
𝑎2 𝑙
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
![Page 25: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/25.jpg)
Local processing
Currently observed frame
Time interval of medium length
Long time interval
Proposed Method
𝑎1 𝑙
𝛼𝑙 =
> 𝑡ℎ2
Binary vector
Σ
𝑎2 𝑙
𝑎2 𝑙 𝑎2 𝑙 − 𝑁2 + 1 𝑎2 𝑙 − 2𝑁2 + 1 𝑎2 𝑙 − 𝑁3 − 1 𝑁2 + 1 ⋅⋅⋅
𝑁2 𝑁2 𝑁2 ⋅⋅⋅
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
![Page 26: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/26.jpg)
Local processing
Currently observed frame
Time interval of medium length
Long time interval
Proposed Method
𝑎1 𝑙
𝑎2 𝑙
𝑎2 𝑙 − 𝑁2 + 1 𝑎2 𝑙
𝑎2 𝑙 − 𝑁3𝑁2 + 1
𝛽𝑙 =
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
![Page 27: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/27.jpg)
Local processing
Currently observed frame
Time interval of medium length
Long time interval
Proposed Method
𝑎1 𝑙
𝑎2 𝑙
𝛽𝑙 =
> 𝑡ℎ3 Σ
𝑎3 𝑙
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
Binary vector
![Page 28: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/28.jpg)
Local processing
Currently observed frame
Time interval of medium length
Long time interval
Proposed Method
𝑎1 𝑙
𝑎2 𝑙
𝑎3 𝑙
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
![Page 29: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/29.jpg)
Active sub-units & Transients
Currently observed frame
Time interval of medium length
Long time interval
Why Thresholding and counting ?
Smoothing
Equalization
Noise spikes suppression
Proposed Method
𝑎1 𝑙
𝑎2 𝑙
𝑎3 𝑙
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
# of active
sub-bands
# of active
frames
# of active
blocks
![Page 30: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/30.jpg)
Active sub-units & Transients
Currently observed frame
Time interval of medium length
Long time interval
Discriminating isolated transients from fluent audio
activity:
Proposed Method
𝑎1 𝑙
𝑎2 𝑙
𝑎3 𝑙
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
# of active
sub-bands
# of active
frames
# of active
blocks
Binary vector
Immediate Binary vector
Medium
Binary vector
Long
![Page 31: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/31.jpg)
Global Decision
Time frame 𝑙:
Proposed Method
a11(𝑙), a2
1 𝑙 , a31(𝑙)
a12(𝑙), a2
2 𝑙 , a32(𝑙)
a1𝑁(𝑙), a2
𝑁 𝑙 , a3𝑁(𝑙)
Dominant
Speaker
Identification
Dominant
speaker in frame 𝒍
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
![Page 32: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/32.jpg)
Global Decision
Time frame 𝑙:
Proposed Method
Φ𝑖𝑚𝑚𝑒𝑑1 (𝑙), Φ𝑚𝑒𝑑𝑖𝑢𝑚
1 𝑙 , Φ𝑙𝑜𝑛𝑔1 (𝑙)
Φ𝑖𝑚𝑚𝑒𝑑2 (𝑙), Φ𝑚𝑒𝑑𝑖𝑢𝑚
2 𝑙 , Φ𝑙𝑜𝑛𝑔2 (𝑙)
Φ𝑖𝑚𝑚𝑒𝑑𝑁 (𝑙), Φ𝑚𝑒𝑑𝑖𝑢𝑚
𝑁 𝑙 , Φ𝑙𝑜𝑛𝑔𝑁 (𝑙)
Dominant
Speaker
Identification
Dominant
speaker in frame 𝒍
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
![Page 33: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/33.jpg)
Dominant speaker identification
Activated once in a decision-interval
Objective:
Detect speaker switch events
Method:
Compare speech activity scores across channels
Each channel is evaluated in respect to:
— the dominant channel
— minimal activity level
Proposed Method
Current dominant speaker remains unless
activity on one of the other channels
justifies a speaker switch
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
![Page 34: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/34.jpg)
Score Evaluation
Single
Score computed on a
single observation
Each observation is
analyzed independently
More responsive to changes
Sequential
Score computed on a
sequence of observations
Exploits natural temporal
dependence in speech
More Smooth
Score Evaluation
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
![Page 35: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/35.jpg)
Modeling the number of active sub-units
𝒂
Two Hypotheses: 𝐻1 𝑠𝑝𝑒𝑒𝑐ℎ 𝑖𝑠 𝑝𝑟𝑒𝑠𝑒𝑛𝑡𝐻0 𝑠𝑝𝑒𝑒𝑐ℎ 𝑖𝑠 𝑎𝑏𝑠𝑒𝑛𝑡
𝑯𝟏
𝑯𝟎
Single
Sequential
OR
Score Evaluation
Likelihood Models
𝑁 – number of sub-units
𝑎 – number of active sub-units
𝑝 𝑎|𝐻1 = Bin 𝑁, 𝑝 =𝑁𝑎
𝑝𝑎 1 − 𝑝 𝑁−𝑎
𝑝 𝑎|𝐻0 = 𝐸𝑥𝑝 𝜆 = 𝜆𝑒−𝜆𝑎
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
![Page 36: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/36.jpg)
Score based on a single observation
Λ =𝑝 𝑎|𝐻1
𝑝 𝑎|𝐻0 𝚽 = 𝑙𝑜𝑔𝜦
speech
activity score
Φ = log𝑁𝑎
+ 𝑎 log(𝑝) + 𝑁 − 𝑎 log 1 − 𝑝 − log 𝜆 + 𝜆𝑎
Likelihood Ratio log-Likelihood Ratio
Score Evaluation - Single
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
![Page 37: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/37.jpg)
Score based on a sequence of observations
The score is computed on a sequence of
observations
𝑿𝑙 = 𝑎 𝑙 − 𝑀 + 1 ,… , 𝑎 𝑙
Observations are not independent
A speech frame is more likely to be followed by
another speech frame
𝑝 𝑞𝑛 = 𝐻1 𝑞𝑛−1 = 𝐻1 > 𝑝 𝑞𝑛 = 𝐻1 [4]
First order Markovian dependency in the transitions
between consecutive frames
Score Evaluation - Sequential
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
[4] J. Sohn et al., “A statistical model-based voice activity detection”, Signal Processing Letters, IEEE, 1999
![Page 38: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/38.jpg)
HMM of two states:
𝜁𝑙 = 𝐻1,(𝑙) 𝑠𝑝𝑒𝑒𝑐ℎ 𝑖𝑠 𝑝𝑟𝑒𝑠𝑒𝑛𝑡 𝑖𝑛 𝑓𝑟𝑎𝑚𝑒 𝑙
𝐻0,(𝑙) 𝑠𝑝𝑒𝑒𝑐ℎ 𝑖𝑠 𝑎𝑏𝑠𝑒𝑛𝑡 𝑖𝑛 𝑓𝑟𝑎𝑚𝑒 𝑙
State dynamics: 𝑏𝑖𝑗 = 𝑝 𝜁𝑙 = j 𝜁𝑙−1 = 𝑖 , 𝑖, 𝑗𝜖{0,1}
— 𝑏00 + 𝑏01 = 1, 𝑏10 + 𝑏11 = 1
Steady state probabilities
𝑝𝐻0 =𝑏10
𝑏10+𝑏01, 𝑝𝐻1 =
𝑏01
𝑏10+𝑏01
Current observation only depends on current
state: 𝑝 𝑋𝑙 𝜁𝑙 , 𝑿𝑙 = 𝑝 𝑋𝑙 𝜁𝑙
Score based on a sequence of observations
Score Evaluation - Sequential
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
![Page 39: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/39.jpg)
Denote: 𝛼𝑙 1 ≜ 𝑝 𝑿𝑙 , 𝐻1,(𝑙) and 𝛼𝑙 0 ≜ 𝑝 𝑿𝑙 , 𝐻0, 𝑙
Recursive formula (forward procedure) for 𝛼𝑙 𝑗 , 𝑗 ∈ 0,1
Λ𝑙𝑠𝑒𝑞
=𝑝 𝑿𝑙|𝐻1, 𝑙
𝑝 𝑿𝑙|𝐻0, 𝑙
Likelihood Ratio
Λ𝑙𝑠𝑒𝑞
=𝛼𝑙 1
𝛼𝑙 0⋅𝑝𝐻0
𝑝𝐻1
𝑝 𝑿𝑙 , 𝐻1, 𝑙 𝑝𝐻0
𝑝 𝑿𝑙 , 𝐻0, 𝑙 𝑝𝐻1
𝚽𝒍𝒔𝒆𝒒
= 𝐥𝐨𝐠𝜶𝒍 𝟏
𝜶𝒍 𝟎+ 𝐥𝐨𝐠
𝑝𝐻0
𝑝𝐻1
Score Evaluation - Sequential
Score based on a sequence of observations
log-Likelihood Ratio
𝛼𝑞 𝑗 =
𝑝 𝑋𝑞 = 𝑎(𝑞)|𝐻𝑗,(𝑞) 𝛼𝑞−1 0 𝑏𝑗0 + 𝛼𝑞−1 1 𝑏𝑗1
𝑝 𝑋𝑞 = 𝑎 𝑞 |𝐻𝑗,(𝑞) 𝑝 𝐻𝑗
, 𝑙 − 𝑀 + 1 < 𝑞 < 𝑙 , 𝑞 = 𝑙 − 𝑀 + 1
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
![Page 40: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/40.jpg)
Score Evaluation - Summary
Single Vs. Sequential
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
Denote: Γ 𝑛 =𝛼𝑛 1
𝛼𝑛(0) , Φ𝑠𝑒𝑞 = log Γ 𝑛 + log
𝑝𝐻0
𝑝𝐻1
Γ 𝑛 =𝛼𝑛 1
𝛼𝑛(0)=
𝑝 𝑎𝑛 𝐻1
𝑝 𝑎𝑛|𝐻0⋅𝛼𝑛−1 0 𝑏10 + 𝛼𝑛−1 1 𝑏11𝛼𝑛−1 0 𝑏00 + 𝛼𝑛−1 1 𝑏01
= Λ𝑛 ⋅𝑏10 + Γ 𝑛 − 1 𝑏11𝑏00 + Γ 𝑛 − 1 𝑏01
Φ𝑠𝑒𝑞 = Φ𝑠𝑖𝑛𝑔𝑙𝑒 + log𝑏10 + Γ 𝑛 − 1 𝑏11𝑏00 + Γ 𝑛 − 1 𝑏01
+ log𝑝𝐻0
𝑝𝐻1
![Page 41: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/41.jpg)
Score Evaluation - Summary
Single Vs. Sequential
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
Φ𝑠𝑒𝑞 = Φ𝑠𝑖𝑛𝑔𝑙𝑒 + log𝑏10 + Γ 𝑛 − 1 𝑏11𝑏00 + Γ 𝑛 − 1 𝑏01
+ log𝑝𝐻0
𝑝𝐻1
In the presence of speech
Γ 𝑛 − 1 ≫ 0 ⇒ log𝑏10 + Γ 𝑛 − 1 𝑏11𝑏00 + Γ 𝑛 − 1 𝑏01
→ log𝑏11𝑏10
> 0
⇒ Φ𝑠𝑒𝑞> Φ𝑠𝑖𝑛𝑔𝑙𝑒
![Page 42: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/42.jpg)
Score Evaluation - Summary
Single Vs. Sequential
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
Φ𝑠𝑒𝑞 = Φ𝑠𝑖𝑛𝑔𝑙𝑒 + log𝑏10 + Γ 𝑛 − 1 𝑏11𝑏00 + Γ 𝑛 − 1 𝑏01
+ log𝑝𝐻0
𝑝𝐻1
In absence of speech
Γ 𝑛 − 1 ≪ 1 ⇒ log𝑏10 + Γ 𝑛 − 1 𝑏11𝑏00 + Γ 𝑛 − 1 𝑏01
→ log𝑏10𝑏00
< 0
⇒ Φ𝑠𝑒𝑞< Φ𝑠𝑖𝑛𝑔𝑙𝑒
Sequential processing improves separability
![Page 43: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/43.jpg)
A-priori SNR Estimation
Let: 𝑦 𝑛 = 𝑥 𝑛 + 𝑑 𝑛𝑆𝑇𝐹𝑇
𝑌𝑙 = 𝑋𝑙 + 𝐷𝑙
𝜆𝑙 = 𝐸 𝑋𝑙2 − Spectral variance of speech
𝑋𝑙 is a complex vector 𝑋𝑙 = 𝐴𝑙𝑒𝑗𝜙𝑙
𝜆𝑙 = 𝐸 𝑋𝑙2 = 𝐸 𝐴𝑙
2
𝜆𝐷𝑙= 𝐸 𝐷𝑙
2 − Spectral variance of noise
SNR estimation
A-priori SNR : 𝜉𝑙 =𝜆𝑙
𝜆𝐷𝑙
A-priori SNR : 𝜉 𝑙|𝑙 =𝜆 𝑙|𝑙
𝜆𝐷𝑙
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
![Page 44: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/44.jpg)
The estimator is derived in two steps: [5]
1. Propagation Step
Assuming we have all information up to frame 𝑙 − 1
Obtain one frame ahead conditional variance 𝜆 𝑙|𝑙−1
2. Update Step
Update the estimator using information from current
frame, 𝑙
Obtain 𝜆 𝑙|𝑙
A-priori SNR Estimation
SNR estimation
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
[5] I. Cohen, “Relaxed statistical model for speech enhancement and a priori snr estimation“, Speech and Audio Processing, IEEE
Transactions on, 2005.
![Page 45: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/45.jpg)
The estimator is derived in two steps :
1. Propagation Step
Assuming we have all information up to frame 𝑙 − 1
The conditional variance of speech is assumed to
propagate as a GARCH(1,1) model:
𝜆𝑙|𝑙−1 = 𝜆𝑚𝑖𝑛 + 𝜇 𝑋𝑙−12 + 𝛿 𝜆𝑙−1|𝑙−2 + 𝜆𝑚𝑖𝑛
𝜆𝑚𝑖𝑛 > 0, 𝜇 ≥ 0, 𝛿 ≥ 0, 𝜇 + 𝛿 < 1
A-priori SNR Estimation
SNR estimation
𝜆 𝑙|𝑙−1 = 𝐸 𝜆𝑙|𝑙−1|𝐴 𝑙−1, 𝜆 𝑙−1|𝑙−2= 𝜆𝑚𝑖𝑛 + 𝜇𝐴 𝑙−1
2 + 𝛿 𝜆 𝑙−1|𝑙−2 + 𝜆𝑚𝑖𝑛
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
![Page 46: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/46.jpg)
The estimator is derived in two steps :
1. Propagation Step
2. Update Step
Update the estimator using information from current
frame, 𝑙
Obtain 𝜆 𝑙|𝑙
A-priori SNR Estimation
SNR estimation
𝜆 𝑙|𝑙−1 = 𝐸 𝜆𝑙|𝑙−1|𝐴 𝑙−1, 𝜆 𝑙−1|𝑙−2= 𝜆𝑚𝑖𝑛 + 𝜇𝐴 𝑙−1
2 + 𝛿 𝜆 𝑙−1|𝑙−2 + 𝜆𝑚𝑖𝑛
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
![Page 47: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/47.jpg)
A-priori SNR estimation
2. Update Step:
Let: 𝑦 𝑛 = 𝑥 𝑛 + 𝑑 𝑛𝑆𝑇𝐹𝑇
𝑌𝑙 = 𝑋𝑙 + 𝐷𝑙
Spectral enhancement 𝑋 𝑙 = 𝐺𝑌𝑙
𝐺 – spectral gain function
𝐺 is chosen as a minimizer of a certain distortion
measure
SNR estimation
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
![Page 48: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/48.jpg)
A-priori SNR estimation
2. Update Step:
We chose 𝐺𝑆𝑃, the minimizer of the power distortion
measure
𝑑𝑆𝑃 = 𝐴𝑙2 − 𝐴 𝑙
2 2
𝛾𝑙 =𝑌𝑙
2
𝜆𝐷𝑙
- a posteriori SNR
𝜆 𝑙|𝑙−1 - one frame ahead conditional variance (speech)
The resulting spectral gain function is :
𝐺𝑆𝑃 𝜆 𝑙|𝑙−1, 𝛾𝑙 =𝜆 𝑙|𝑙−1
𝜆𝐷𝑙+𝜆 𝑙|𝑙−1
1𝛾𝑙+
𝜆 𝑙|𝑙−1
𝜆𝐷𝑙+𝜆 𝑙|𝑙−1
SNR estimation
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
![Page 49: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/49.jpg)
2. Update Step
𝑋 𝑙 = 𝐺𝑆𝑃 𝜆 𝑙|𝑙−1, 𝛾𝑙 𝑌𝑙 𝜆 𝑙|𝑙 = 𝐸 𝑋 𝑙2
= 𝐺𝑆𝑃 𝜆 𝑙|𝑙−1, 𝛾𝑙2𝑌𝑙
2
The a-priori SNR estimator:
𝜉 𝑙|𝑙 =𝜆 𝑙|𝑙
𝜆𝐷𝑙
=𝜆 𝑙|𝑙−1
𝜆𝐷𝑙+ 𝜆 𝑙|𝑙−1
1+𝜆 𝑙|𝑙−1𝛾𝑙
𝜆𝐷𝑙+𝜆 𝑙|𝑙−1
𝜆 𝑙|𝑙 = 𝐺𝑆𝑃 𝜆 𝑙|𝑙−1, 𝛾𝑙2𝑌𝑙
2
Speech detection in sub-bands
SNR estimation
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
![Page 50: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/50.jpg)
Experimental results
Three Experiments:
1. Identification of the dominant speaker
2. Robustness to transient audio occurrences
3. Results on a real multipoint conference
Experimental results
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
![Page 51: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/51.jpg)
Experimental results
The proposed method was compared to:
The speaker with the highest VAD score in the
decision-interval is selected
1. RAMIREZ [6]
2. SOHN [4]
3. GARCH [7]
4. The speaker with the highest SNR is selected
5. The speaker with the highest signal POWER is
selected
Experimental results
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
[6] Ramirez et al. “Statistical voice activity detection using a multiple observation likelihood ratio test”, Signal Processing Letters, IEEE,2005
[7] S. Mousazadeh and I. Cohen, “Ar-garch in presence of noise: Parameter estimation and its application to voice activity detection“,
Audio, Speech, and Language Processing, IEEE Transactions on, 2011
![Page 52: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/52.jpg)
Performance Evaluation
Quantitative Evaluation:
False Speaker Switches #
Mid Sentence Clipping (MC) – percent of
undetected mid section of a speech burst
MC
Experimental results
true detected
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
![Page 53: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/53.jpg)
Experiment #1
Signals:
One or more concatenated TiMit sentences of
same speaker (with added white noise)
The non silent part of the sentence is expected
to be detected as a continuous speech burst
Experimental results
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
![Page 54: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/54.jpg)
Experiment #1
Proposed method Sohn VAD based method
Decision interval = 0.1 sec
Experimental results
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
![Page 55: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/55.jpg)
Experiment #1
False Speaker Switches Mid Sentence Clipping
Experimental results
0 0.1 0.2 0.3 0.4 0.5 0.6 0.70
0.5
1
1.5
2
2.5
3Mid Sentence Clipping
Decision interval [sec]
%
Single
POWER
SNR
RAMIREZ
SOHN
GARCH
Sequential
0 0.1 0.2 0.3 0.4 0.5 0.6 0.70
10
20
30
40
50
60
70
80
90
Decision interval [sec]
Fals
e c
am
era
sw
itches
Single
POWER
SNR
RAMIREZ
SOHN
GARCH
Sequential
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
![Page 56: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/56.jpg)
Experiment #2
Signals:
One or more concatenated TiMit sentences of
same speaker (with added white noise)
A sound of sneezing was added to channel 2
A sound of door knocks was added to channel 3
Experimental results
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
![Page 57: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/57.jpg)
Experiment #2
ch 1
ch 2
0 5 10 15 20 25
[sec]
ch 3
sneezing
knocks
Experimental results
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
![Page 58: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/58.jpg)
Experiment #2
False Speaker Switches Mid Sentence Clipping
Experimental results
0 0.1 0.2 0.3 0.4 0.5 0.6 0.70
0.5
1
1.5
2
2.5
3
3.5
Decision interval [sec]
%
Single
POWER
SNR
RAMIREZ
SOHN
GARCH
Sequential
0 0.1 0.2 0.3 0.4 0.5 0.6 0.70
10
20
30
40
50
60
70
80
90
Decision interval [sec]
Fals
e c
am
era
sw
itches
Single
POWER
SNR
RAMIREZ
SOHN
GARCH
Sequential
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
![Page 59: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/59.jpg)
Experiment #2
False Speaker Switches Mid Sentence Clipping
Decision interval = 0.05 sec
Single Sequential
Experimental results
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
![Page 60: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/60.jpg)
Experiment #2
Proposed Method GARCH VAD based method
Decision interval = 0.3 sec
Experimental results
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
Red – hand labeled
Black – Algorithm’s result
![Page 61: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/61.jpg)
Experiment #2
Proposed Method POWER based method
Decision interval = 0.3 sec
Experimental results
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
Red – hand labeled
Black – Algorithm’s result
![Page 62: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/62.jpg)
Experiment #3
Signals:
Real Multi-channel conversation
5 channels
Channels 2 & 4 – clean speech
Channel 1 – mostly noise
Channels 3 and 5 - crosstalk
Experimental results
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
![Page 63: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/63.jpg)
Experiment #3
Proposed Method POWER based method
Decision interval = 0.4 sec
Experimental results
Introduction ⋅ Discussed Realization ⋅ Proposed Method ⋅ Score Evaluation ⋅ SNR Estimation ⋅ Experimental Results
Red – hand labeled
Black – Algorithm’s result
![Page 64: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/64.jpg)
Summary
Novel method for dominant speaker
identification
Two approaches to speech activity score
evaluation
Experimental framework
Less false speaker switches
Robustness to transient audio occurrences
Summary
![Page 65: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/65.jpg)
Future Research
Non causal processing
Temporally variable thresholds
Preprocessing
Speech enhancement
Echo detection (or cancellation)
Future Research
![Page 66: Dominant Speaker Identification for Multipoint ... · The conference is administrated through an MCU: Participant 1 Participant 2 Participant N Introduction Introduction ⋅ Discussed](https://reader033.vdocuments.us/reader033/viewer/2022050206/5f599529f7c2696e7666da46/html5/thumbnails/66.jpg)
Thank You
Thank You