hiwire progress report trento, january 2007 presenter: prof. alex potamianos technical university of...

HIWIRE Progress ReportTrento, January 2007

Presenter: Prof. Alex PotamianosTechnical University of Crete

Presenter: Prof. Alex PotamianosTechnical University of Crete

Outline

Long Term Research• Audio-Visual Processing (WP1)• Segment Models (WP1)• Bayes’ Optimal Adaptation (WP2)

Research for the Platforms• Features and Fusion

Integration on Year 2 Platforms• Mobile Platform • Fixed Platform

Outline


Research for the Platforms• New Features and Fusion


Stream-Weights: Motivation Low performance of ASR systems if low SNR

combine several sources of information

Sources of information are not equally reliable for different environments and noise conditions

Mismatch between training and test conditions

Unsupervised stream weight computation for multistream classifiers is an open problem.

Problem Definition Compute “optimal” exponent weights for each

stream si

Optimality in the sense of minimizing “total classification error”

Total Error Computation

Two class problem w1, w2, for the feature vector x

Feature pdfs p(x |w1) p(x |w2) Assume that estimation/modeling error is normal

variable zi

Optimal Stream Weights (1)

Minimize σ2 with respect to stream

Two interesting cases• Equal error rate in single-stream classifiers

p(x1 | w1 ) = p(x2 | w1) in decision region

• Equal estimation error variance in each stream

σS12 =σS2

2

Optimal Stream Weights (2) Equal error rate in single-stream classifiers

Equal estimation error variance in each stream

Antimodels, Inter and Intra Distances The multi-class problem is reposed as (multiple)

two-class classification problem

If p(x|w) follows a Gaussian distribution N(μ ,σ²), the Bayes error is function of D=|μ1 - μ2|/σ

Experimental Results (1) Test case: audio-visual continuous digit recognition

task

Difference from ideal two-class case• Multi-class problem• Recognition instead of classification

Multiple experiments:• clean video stream• noise corrupted audio streams at various SNR

Experimental Results (2) Subset of CUAVE database used:

• 36 speakers (30 training, 6 testing)• 5 sequences of 10 connected digits per speaker• Training set: 1500 digits (30x5x10)• Test set: 300 digits (6x5x10)

Features:• Audio: 39 features (MFCC_D_A)• Visual: 39 features (ROIDCT_D_A, odd columns)

Multi-Streams HMM models:• 8 state, left-to-right HMM whole-digit models• Single Gaussian mixture• AV-HMM uses separate audio and video feature streams

Weights’ distribution

Results (classification)

Inter- Intra- Distances and Recognition In each stream a total inter- intra- dist is computed

Results (recognition)

Conclusions We have proposed a stream computation method

for a multi class classification task based on theoretical results obtained for a two classes classification problem and making use of an anti-model technique

We use only the test utterance and the information contained in the trained models

Results are of interest for the problem of unsupervised estimation of stream weights for multi-streams classification and recognition problems

Outline




Dynamical System Segment Model

Segment models directly model time evolution of speech parameters

Based on linear dynamical system

The system parameters should guarantee • Identifiability, Controllability, Observability, Stability

Simple matrix topologies studies up to now

1k k k

k k k

x Fx w

y Hx v

Linear dynamical system with state-control:

Parameters F,B,H have canonical forms(Ljung – “System Identification”)

Generalized forms of parameter structures

1k k k k

k k k

x Fx Bu w

y Hx v 0 1 0 0 0 0 0 0 0

0 0 1 0 0 0 0 0 0

0 0 0 0 1 0 0 0 0

,

0 0 0 0 0 0 1 0 0

0 0 0 0 0 0 0 1 0

0 0 0 0 0 0 0 0 1

1 0 0 0 0 0 0 0 0

0 0 0 1 0 0 0 0 0

0 0 0 0 0 1 0 0 0

F B

H

Parameter Estimation

Use of EM algorithm to estimate the parameters F,B,P,R• We propose a new element-wise parameter estimation algorithm

For the forward-backward recursions, use Kalman smoother recursions

1 1 11 1 1 1 1^

1 11

( ) ( )

( )

M N M M Nc j r j

ic k k ic cr k kc k c r k

c j r j

ij Nj j

ii k kk

cof P x x cof P f x x

f

cof P x x

^

1 1 1 11 1 1 1 1 1 1 1

N M N M N M M Ni j r j r i r ck k ir k k jr k k ic jr k kij

k r k r k c r k

x x f x x f x x f f x xp

Experiments with artificial data

Experiments description:• Select random system parameters (using canonical matrix topology)

• Generate artificial data from the system

• Parameter estimation using the artificial data

Criteria for the evaluation of the system:• The log likelihood of the observations increases per EM iter.

• The parameter estimation error decreases per EM iter.

•Without state control•Dimension of F: 3x3•Observation vector size: 3x1•# of rows with free parameters: 3•# of samples: 1000

Model Training on Speech Data Aurora 2 Database 77 training sentences Word models with different number of states based on the

phonetic transcription

State alignments produced using HTK

Segments Models

2 oh

4 two, eight

6 one, three, four, five, six, nine, zero

8 seven

Speech Segment Modeling

Classification process

Keep true word-boundaries fixed • Digit-level alignments produced by an HMM

Apply suboptimum search and pruning algorithm• Keep the 11 most probable word-histories for each word in the

sentence

Classification is based on maximizing the likelihood

Test set:• Aurora 2, test A, subway sentences • 1000 test sentences• Different levels of noise (Clean, SNR: 20, 15, 10, 5 dB)• Front-End extracts 14-dimensional features (static features):

• HTK standard front-end• 2 feature configurations

– 12 Cepstral Coefficients + C0 + Energy– + first and second order derivatives (δ, δδ)

Classification results

Comparison of Segment-Models and HTK HMM classification (% Accuracy)• Same Front-End configuration, same alignments• Both Models trained on clean training data

AURORASubway

HMM (HTK) Segment Models

MFCC, E +δ +δδ MFCC, E +δ +δδ

Clean 97,19% 97,57% 97,53% 97,61%

SNR20 90,91% 95,71% 93,23% 95,12%

SNR15 80,09% 91,76% 87,91% 91,13%

SNR10 57,68% 81,93% 76,29% 82,69%

SNR5 36,01% 64,24% 54,87% 63,56%

Conclusions and Future work

Without derivatives Segment-models significantly outperform HMMs particularly under highly noisy conditions

When derivatives are used for both models their performance is similar

Use formants and other articulatory features to initialize the state vectors

Examine different dimensions of the state vector

Extension to a non-linear dynamical system• Use of extended Kalman filter

• Derivation of the EM reestimation formulae for the non-linear case

Outline




MAP versus Bayes Optimal

MAP adaptation techniques derive from Bayes Optimal Classification• Assumption: Posterior is peaked around the most probable

model• It is not optimal

Bayes Optimal adaptation is based on a weighted average of the posteriors• Better Performance with less training data• But:

• Computationally expensive• Hard to find analytical solutions

• Approximations should be considered

Bayes Optimal Adaptation

Bayes optimal classification is based on:

Assuming θ denotes a Gaussian component this becomes:

• Θ is a subset of Gaussians

1

| , | , | ,N

t t a t a t a t

R

p x s X p x X s p X s d

Our Approach

To obtain the N Gaussians of Θ:• Step 1: Cluster the Gaussian mixtures associated to context-

dependent models with common central phone• Step 2: From the extended Gaussian mixture choose the N less

distant Gaussians from each Gaussian component of the SI Gaussian mixture

Bayes optimal classification becomes:

1

( | , ) ( ) ( ; , ) ( | , )M

t t a i t t i i a ti

p x s X c s N x m S p X s

Gau

ssia

n S

ize

Number of MixtureComponents

1 2 M 1 2 M

Mixture 1 Mixture 2

• For example based on the entropy-based distance between the Gaussians the less distant Gaussians (in gray color) are clustered together

• The clustering can be performed at an element or sub-vector basis thus increasing the degrees of freedom.

Adaptation Configuration

Baseline trained on the WSJ database Adaptation data:

• spoke3 WSJ task• non-native speakers• 5 male and 5 female• 20 adaptation sentences per speaker• 40 test sentences per speaker

Perform experiments for different number of associated mixtures (associations)

Adaptation Results (% WER)Bayes’ Adaptation Baseline

5 Associations 6 Associations

Male speaker (4n0) 51.52% 47.65% 59.28%

Male speaker (4n3) 43.27% 41.98% 51.72%

Male speaker (4n5) 33.13% 31.48% 36.30%

Male speaker (4n9) 34.48% 33.43% 28.96%

Male speaker (4na) 26.66% 26.22% 28.72%

Total Male %WER 37.87% 36.15% 40,99%

Female speaker (4n1) 74.96% 74.47% 81.01%

Female speaker (4n4) 58.18% 58.18% 60.12%

Female speaker (4n8) 34.16% 35.99% 30.85%

Female speaker (4nb) 40.31% 39.38% 39.06%

Female speaker (4nc) 40.23% 41.68% 42.97%

Total Female %WER 49.56% 49.94% 50.80%

Total Results and Conclusions

Adaptation Baseline

5 Associations 6 Associations

Total %WER 43.71% 43.04% 45.89%

Small improvements can be obtained compared to the Baseline

The number of associations significantly influences the adaptation performance

The optimum number of associations depends on the baseline models and the adaptation data dynamically choose the associations

hiwire progress report trento, january 2007 presenter: prof. alex potamianos technical university of...

Documents

stream slide

results classification

computed slide

results recognition

optimal stream weights

weights distribution

stream s1

various snr slide