kofi a. boakye advisor: nelson morgan january 17 th , 2007

52
K. Boakye: Qualifying Exam Presentation Speech Detection, Classification, and Processing for Improved Automatic Speech Recognition in Multiparty Meetings Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

Upload: vila

Post on 07-Feb-2016

23 views

Category:

Documents


0 download

DESCRIPTION

Speech Detection, Classification, and Processing for Improved Automatic Speech Recognition in Multiparty Meetings. Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007. At-a-glance. Trying to do automatic speech recognition (ASR) in meetings. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

K. Boakye: Qualifying Exam Presentation

Speech Detection, Classification, and Processing for Improved Automatic Speech Recognition in Multiparty

Meetings

Kofi A. BoakyeAdvisor: Nelson Morgan

January 17th, 2007

Page 2: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Introduction 2

At-a-glance

Goal: Reduce errors caused by crosstalk and overlapped speech to improve speech recognition in meetings

Trying to do automatic speech recognition (ASR) in meetings

• Personal mics pick up other speakers (crosstalk)

• Distant mics pick up multiple speakers at the same time (overlapped speech)

Page 3: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Introduction 3

Outline of talk

• Introduction• Speech activity detection for

nearfield microphones• Overlap speech detection for farfield

microphones• Overlap speech processing for

farfield microphones• Preliminary experiments

Page 4: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Introduction 4

The meeting domain

• Multiparty meetings are a rich content source for spoken language technology– Rich transcription– Indexing and summarization– Machine translation– High-level language and behavioral

analysis using dialog act annotation• Good automatic speech recognition

(ASR) is important

Page 5: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Introduction 5

Meeting ASR set-up

• For a typical set-up, meeting ASR audio data is obtained from various sensors located in the room. Common types include:

Individual Headset Microphone– Head-mounted mic positioned close to

speaker– Best-quality signal for speaker

Page 6: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Introduction 6

Meeting ASR set-up

• For a typical set-up, meeting ASR audio data is obtained from various sensors located in the room. Common types include:

Lapel Microphone– Individual mic placed on participant’s

clothing– More susceptible to interfering speech

Page 7: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Introduction 7

Meeting ASR set-up

• For a typical set-up, meeting ASR audio data is obtained from various sensors located in the room. Common types include:

Tabletop Microphone– Omni-directional pressure-zone mic– Placed between participants on table or

other flat surface– Number and placement vary

Page 8: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Introduction 8

Meeting ASR set-up

• For a typical set-up, meeting ASR audio data is obtained from various sensors located in the room. Common types include:

Linear Microphone Array– Collection of omni-directional mics with a

fixed linear topology– Composition can range from 4 to 64 mics– Enables beamforming for high SNR

signals

Page 9: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Introduction 9

Meeting ASR set-up

• For a typical set-up, meeting ASR audio data is obtained from various sensors located in the room. Common types include:

Circular Microphone Array– Combines central location of tabletop

mic and fixed topology of linear array– Consists of 4 to 8 omni-directional mics– Enables source localization and

speaker tracking

Page 10: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Introduction 10

ASR in multiparty meetings

• Nearfield recognition is generally performed by decoding each audio channel separatelyS/NS

DetectionFeature Extraction

Prob. Estimation Decoding

Words

S/NS Detection

Feature Extraction

Prob. Estimation Decoding

Words

S/NS Detection

Feature Extraction

Prob. Estimation Decoding

Words

S/NS Detection

Words

Feature Extraction

Prob. Estimation Decoding

Page 11: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Introduction 11

ASR in multiparty meetings

• Nearfield recognition is generally performed by decoding each audio channel separatelyS/NS

DetectionFeature Extraction

Prob. Estimation Decoding

Words

S/NS Detection

Feature Extraction

Prob. Estimation Decoding

Words

S/NS Detection

Feature Extraction

Prob. Estimation Decoding

Words

S/NS Detection

Words

Feature Extraction

Prob. Estimation Decoding

Words

Probability Estimate

Feature Extraction

Speech

Pronunciation Models

Decode

Grammar Model

Page 12: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Introduction 12

• Farfield recognition is done in one of two ways:

1) Signal combination

ASR in multiparty meetings

S/NS Detection

Feature Extraction

Prob. Estimation Decoding

Words

Signal C

ombination

Page 13: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Introduction 13

• Farfield recognition is done in one of two ways:

2) Hypothesis combination

ASR in multiparty meetings

Words

S/NS Detection

Feature Extraction

Prob. Estimation Decoding

S/NS Detection

Feature Extraction

Prob. Estimation Decoding

S/NS Detection

Feature Extraction

Prob. Estimation Decoding

S/NS Detection

Feature Extraction

Prob. Estimation Decoding

Hypothesis C

ombination

Page 14: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Part II: Overlap Detection 14

Performance metrics

• Word error rate (WER)– Token-based ASR performance metric

• Diarization error rate (DER)– Time-based diarization performance

metric

tokens

onssubstitutideletionsinsertionsWER

#

###

spkr

errspkrFAmissed

T

TTTDER .

Page 15: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Introduction 15

Crosstalk and overlapped speech• ASR in meetings presents specific challenges owing to

the domain• Multiple individuals speaking at various times leads to

two phenomena in particular– Crosstalk

• Associated with close-talking microphones• This non-local speech produces primarily insertion errors• Morgan et al. ’03: WER differed 75% relative between

segmented and unsegmented waveforms due largely to crosstalk

– Overlapped (co-channel) speech• Most pronounced (and severe) in distant microphone

condition• Also produces errors for recognizer• Shriberg et al. ’01: 12% absolute WER difference for

overlapped and non-overlapped speech segments for nearfield case

Page 16: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Introduction 16

Scope of project• Speech activity detection (SAD) for nearfield mics

– Investigate features for SAD using HMM segmenter• Metrics: word error rate (WER) and diarization error rate (DER)• Baseline features: standard cepstral features for an ASR system• Features will mainly be cross-channel in nature

• Overlap detection for farfield mics– Investigate features for overlap detection using HMM

segmenter• Metric: diarization error rate• Baseline features: standard cepstral features• Features will mainly be single-channel and pitch-related

• Overlap speech processing for farfield mics– Determine if speech separation methods can reduce WER

• Harmonic enhancement and suppression (HES)• Adaptive decorrelation filtering (ADF)

Overl

app

ed

speech

Cro

ssta

lk

Page 17: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

17K. Boakye QE » Part I: Speech Activity Detection

Part I: Speech Activity Detection for Nearfield Microphones

Page 18: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Part I: Speech Activity Detection 18

Related work• Amount of work specific to multi-speaker SAD is

rather small • Wrigley et al. ’03 and ’05

– Performed a systematic analysis of features for classifying multi-channel audio

– Key result: from among 20 features examined, best performing for each class was one derived from cross-channel correlation

• Pfau et al. ’01– Thresholding cross-channel correlations as a post-processing

step for HMM based SAD yielded 12% relative frame error rate reduction

• Laskowski et al. ’04– Cross-channel correlation thresholding produced ASR WER

improvements of 6% absolute over energy-thresholding

Page 19: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Part I: Speech Activity Detection 19

Candidate features• Cepstral features

– Consist of 12th-order Mel frequency cepstral coefficients, log-energy, and their first- and second-order time derivatives

– Common to a number of speech-related fields

– Log-energy is a fundamental component of most SAD systems

– MFCCs could distinguish local speech from phenomena with similar energy levels (breaths, coughs, etc.)

Page 20: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Part I: Speech Activity Detection 20

Candidate features• Cross-channel correlation

– Clear first-choice for cross-channel feature

– Wrigley et al.: normalized cross-channel correlation most effective feature for crosstalk detection

– Normalization seeks to compensate for channel gain differences and is done based on frame-level energy of

• Target channel

• Non-target channel

• Square root of target and non-target (spherical normalization)

Ci j (t) = max¿

P ¡ 1X

k=0

xi (t ¡ k)xj (t ¡ k ¡ ¿)w(k)

Page 21: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Part I: Speech Activity Detection 21

Candidate features• Log-energy differences

– Just as energy is a good feature for single-channel SAD, relative energy between channels should work well for our scenario

– Represents ratio of short-time energy between channels

– Much less utilized than cross-channel correlation, though can be more robust

• Normalized log-energy difference

– Compensate for channel gain differences

D i j (t) = E i (t) ¡ E j (t)

Enor m;i (t) = E i (t) ¡ Emin;i

Page 22: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Part I: Speech Activity Detection 22

Candidate features

• Time delay of arrival (TDOA) estimates– Performed well as features for farfield speaker

diarization• Ellis and Liu ’04 and Pardo et al. ’06

– Seem particularly well suited to distinguish local speech from crosstalk

– Proposed estimation method: generalized cross-correlation with phase transform (GCC-PHAT)

Standard cross-correlation

GCC-PHAT

Page 23: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Part I: Speech Activity Detection 23

Feature generation and combination

• One issue with cross-channel features: variable number of channels– Varies between 3 and 12 for some corpora

• Proposed solution: use order statistics (max and min)

• Considered feature combination as well– Simple concatenation– Combination with dimensionality reduction

• Principal component analysis (PCA)• Linear discriminant analysis (LDA)• Multilayer perceptron (MLP)

Page 24: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Part I: Speech Activity Detection 24

Work plan for part I• Compare performance of HMM segmentation using

proposed features– Metrics: WER and DER

• DER typically correlates with WER• DER can be computed quickly

– Data: NIST Rich Transcription (RT) Meeting Recognition evaluations

• 10-12 min. excerpts of meeting recordings from different sites

– Baseline measure: standard cepstral features– Feature performance measured in isolation and with

baseline features– Try to determine best combination of features and

combination technique that obtains this– Significant amount of this work has been done

Page 25: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

25K. Boakye QE » Part II: Overlap Detection

Part II: Overlap Detection for Farfield Microphones

Page 26: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Part II: Overlap Detection 26

Related work• “Usable” speech for speaker recognition

– Lewis and Ramachandran ’01• Compared MFCCs, LPCCs, and proposed pitch

prediction feature (PPF) for speaker count labeling on both closed- and open-set scenarios

– Shao and Wang ‘03• Used multi-pitch tracking to identify usable speech

for closed-set speaker recognition task– Yantorno et al.

• Proposed spectral autocorrelation peak-valley ratio (SAPVR), adjacent pitch period comparison (APPC), and kurtosis

Page 27: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Part II: Overlap Detection 27

Candidate features

• Cepstral features– As a representation of speech spectral

envelope, should provide information on whether multiple speakers are active

– Zissman et al.’90• Gaussian classifier with cepstral features

reported 80% classification accuracy between target-only, jammer-only, and target plus jammer speech

Page 28: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Part II: Overlap Detection 28

Candidate features

• Cross-channel correlation– Recall Wrigley et al.: correlation best

feature for nearfield audio classification– Unclear if this extends to farfield in

overlap case• For nearfield, overlapped speech tends to

have low cross-channel correlation• For farfield, large asymmetry in speaker-to-

microphone distances not typically present → low correlation may not occur

Page 29: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Part II: Overlap Detection 29

Candidate features

• Pitch estimation features– Explore how pitch detectors behave in

presence of overlapped speech– Methods can be applied at subband level

• May be appropriate here since harmonic energy from different speakers may be concentrated in different bands

– Issue regarding unvoiced regions• Include feature that indicates voicing

– Energy, zero-crossing rate, spectral tilt

Zero-crossing distance

Auto-correlation function

Average magnitude difference function

Page 30: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Part II: Overlap Detection 30

Candidate features

• Spectral autocorrelation peak valley ratio SAPVR = 20log10

R(p1)R(q1)

R(p1)

R(q1)Local maximum of non zero-lag spectral autocorrelationNext local maximum not harmonically related, or local minimum between andp1 p2

Page 31: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Part II: Overlap Detection 31

Candidate features

• Kurtosis– For zero-mean RV , kurtosis defined as:

– Measures “Gaussianity” of a RV– Speech signals, which are modeled as

Laplacian or Gamma tend to be super-Gaussian– Summing such signals produces a signal with

reduced kurtosis (Leblanc and DeLeon ’98 and Krishnamachari et al. ’00)

· x =E fx4g

fE fx2gg2¡ 3

x

Page 32: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Part II: Overlap Detection 32

Feature generation and combination

• As with nearfield condition, number of channels varies with meeting

• Aside from cross-channel correlation, features can be generated with a single channel

• Explore two methods:1) Select a single “best” channel based on SNR

estimates2) Combine audio signals using delay-and-sum

beamforming to produce a single channel– May adversely affect pitch-derived features

• Examine same combination approaches as before

Page 33: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Part II: Overlap Detection 33

Work plan for part II• Compare performance of HMM segmentation

using proposed features– Metric: DER– Data: NIST Rich Transcription (RT) Meeting

Recognition evaluations– Baseline measure will be standard cepstral

features– Feature performance measured in isolation and in

conjunction with baseline features– Try to determine overall best combination of

features and best combination technique that obtains this

Page 34: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

34K. Boakye QE » Part II: Overlap Speech Processing

Part III: Overlap Speech Processing for Farfield Microphones

Page 35: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Part III: Overlap Speech Processing

35

Related work

• Blind source separation (BSS)

X = AS

X = [x0 : : :xN ]T S = [s0 : : :sM ]T ;M · NGiven andrelated by

we seek to findS = WT X

If we assume ’s independent and at most one Gaussian distributed, solving method becomes one of independent component analysis (ICA)

si

Real-world audio signals have convolutive mixing- Reformulate problem in Z-transform domain → similar

solutions- Most techniques iterative, based on infomax criteria- Lee and Bell ’97: BSS yielded improved recognition results

for digit recognition in real room environment

Page 36: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Part III: Overlap Speech Processing

36

Related work

• Blind source separation (BSS)Another set of approaches based on minimizing cross-channel correlation — adaptive decorrelation filtering

– Weinstein et al. ’93 & ’96– Yen et al. ’96-‘99

Demonstrated improved recognition performance on simulated mixtures with coupling estimated from a real room

Page 37: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Part III: Overlap Speech Processing

37

Related work

• Single-channel separation– Techniques based on computational auditory

scene analysis (CASA) try to separate by partitioning audio spectrogram

– Partitioning relies on certain types of structure in signal and uses cues such as pitch, continuity, and common onset and offset

Page 38: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Part III: Overlap Speech Processing

38

Related work

• Single-channel separation– Bach and Jordan ’05

• Used spectral clustering to create speech stream partitions

– Morgan et al. ’97• Used simpler though related method exploiting

harmonic structure• Results on keyword spotting suggest approach may

be useful in an ASR context

Page 39: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Part III: Overlap Speech Processing

39

Harmonic enhancement and suppression

• Single-channel speech separation method• Utilizes harmonic structure of voiced

speech to separate– Speaker’s harmonics identified using pitch

estimation and signal generated by enhancing– Alternatively, time-frequency bins of short-time

Fourier transform in neighborhood of harmonics selected and the others zeroed, followed by signal reconstruction

– For additional speaker, first speaker’s harmonics suppressed and/or other speaker’s harmonics enhanced, if pitch can be determined

Page 40: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Part III: Overlap Speech Processing

40

Adaptive decorrelation filtering

• Multi-channel speech separation method• Separates signals by adaptively

determining filters governing coupling between channels

• Look at two-source, two-channel case:Y1(f ) = H11(f )S1(f ) +H12(f )S2(f )

Y2(f ) = H21(f )S1(f ) +H22(f )S2(f )

Page 41: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Part III: Overlap Speech Processing

41

Adaptive decorrelation filtering

• Multi-channel speech separation method• Separates signals by adaptively

determining filters governing coupling between channels

• Look at two-source, two-channel case:Y1(f ) = X 1(f ) +A(f )X 2(f )Y2(f ) = X 2(f ) +B(f )X 1(f )

whereX i (f ) = H i i (f )Si (f ); i = 1;2

A(f ) =H12(f )H22(f )

B(f ) =H21(f )H11(f )

Page 42: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Part III: Overlap Speech Processing

42

Adaptive decorrelation filtering

• Now process the signals with the separation system: C(f ) = 1¡ A(f )B(f )

V1(f ) = Y1(f )¡ A(f )Y2(f )V2(f ) = Y2(f )¡ B(f )Y1(f )

When and is invertible, the signals and can be perfectly restored

C(f ) x1(t) x2(t)A(f ) = A(f );B(f ) = B(f );

Since , when is not invertible, linearly distorted versions of the signals can be obtained

Vi (f ) = C(f )X i (f ) C(f )

Page 43: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Part III: Overlap Speech Processing

43

Work plan for part III• Employ speech separation algorithms on overlap

segments to try to improve ASR performance– Metric: WER– Focus on WER in overlap regions– Data: NIST RT evaluation (same as in parts I and II)

• Subsequent analyses if improvements obtained:– Compare processing entire segment over just overlap

region– Process overlap regions as determined by overlap

detector in part II– Analyze patterns of improvement, or conversely, which

error types persist

Page 44: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

44K. Boakye QE » Preliminary Experiments

Preliminary Experiments

Page 45: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Preliminary Experiments 45

Preliminary Experiments• Experiments pertain to part I• Performed using Augmented Multiparty

Interaction (AMI) development set meetings for the NIST RT-05S evaluation– Scenario-based meetings each involving 4 participants

wearing headset or head-mounted lapel mics

• Segmenter– Derived from HMM based speech recognition system– Two classes: “speech” and “nonspeech” each

represented with a three-state phone model – Training data: First 10 minutes from 35 AMI meetings– Test data: 12-minute excerpts from four additional AMI

meetings

Page 46: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Preliminary Experiments 46

Expt. 1: Single feature performance

baseline NMXC

LEDsNLEDs

reference

DER

WER

37.836.0

40.838.8

32.0

21.09

28.18

19.1616.74

0

5

10

15

20

25

30

35

40

45

Error (%)

Single feature performance

•LEDs and NLEDs outperform baseline cepstral features

•NMXC features do more poorly- Higher FA rate

•NLEDs give lower DER than LEDs

- Indicates effectiveness of normalization procedure

Diarization

Page 47: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Preliminary Experiments 47

Expt. 1: Single feature performance

baseline NMXC

LEDsNLEDs

reference

DER

WER

37.836.0

40.838.8

32.0

21.09

28.18

19.1616.74

0

5

10

15

20

25

30

35

40

45

Error (%)

Single feature performance

•NMXC features outperform baseline

•LEDs and NLEDs do not•NLEDs give lower WER than

LEDs•Cross-channel features

reduce insertion rate (between 39% and 46% relative)

•4% difference between best feature (NMXC) and reference

Recognition

Page 48: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Preliminary Experiments 48

Expt. 2: Initial feature combination

baselinebase +NMXC

base +LEDs

base +NLEDs

base +NMXC +

LEDs

base +NMXC +NLEDs

reference

DER

WER

37.8

34.0 34.733.5

38.1

34.6

32.0

21.09

11.6511.28 11.79

17.50

11.41

0

5

10

15

20

25

30

35

40

Error (%)

Initial feature combination

•Combination with baseline yields similar performance for features

- Exception: base + NMXC +LEDs

• Improved performance comes from reduced FA/insertions

•3-way combos degrade performance

- May be due to correlation between features

•2% difference between best combo and reference

Page 49: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Summary 49

Summary• Goal: Reduce errors caused by crosstalk and

overlapped speech to improve speech recognition in meetings

• Crosstalk– Use HMM based segmenter to identify local speech

regions• Investigate features to effectively do this

S/NS Detection

Feature Extraction

Prob. Estimation Decoding

Words

Improve this…

…to improve this

Page 50: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Summary 50

Summary• Goal: Reduce errors caused by crosstalk and

overlapped speech to improve speech recognition in meetings

• Overlapped speech– Use HMM based segmenter to identify overlapped

regions• Investigate features to effectively do this

– Process overlap regions to improve recognition performance

• Explore two method—HES and ADF—to see if they can do this

S/NS Detection

Feature Extraction

Prob. Estimation Decoding

Words

Overlap Detection

Overlap Processing

Feature Extraction

Prob. Estimation Decoding

…to improve this

Add this…

Page 51: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

January 17, 2007K. Boakye QE » Summary 51

Summary• Goal: Reduce errors caused by crosstalk and

overlapped speech to improve speech recognition in meetings

• Experiments– Some begun, many to be done

Page 52: Kofi A. Boakye Advisor: Nelson Morgan January 17 th , 2007

52K. Boakye: Qualifying Exam Presentation

Fin