hiwire progress report trento, january 2007 presenter: prof. alex potamianos technical university of...
Post on 21-Dec-2015
220 views
TRANSCRIPT
HIWIRE Progress ReportTrento, January 2007
Presenter: Prof. Alex PotamianosTechnical University of Crete
Presenter: Prof. Alex PotamianosTechnical University of Crete
Outline
Long Term Research• Audio-Visual Processing (WP1)• Segment Models (WP1)• Bayes’ Optimal Adaptation (WP2)
Research for the Platforms• Features and Fusion
Integration on Year 2 Platforms• Mobile Platform • Fixed Platform
Outline
Long Term Research• Audio-Visual Processing (WP1)• Segment Models (WP1)• Bayes’ Optimal Adaptation (WP2)
Research for the Platforms• New Features and Fusion
Integration on Year 2 Platforms• Mobile Platform • Fixed Platform
Stream-Weights: Motivation Low performance of ASR systems if low SNR
combine several sources of information
Sources of information are not equally reliable for different environments and noise conditions
Mismatch between training and test conditions
Unsupervised stream weight computation for multistream classifiers is an open problem.
Problem Definition Compute “optimal” exponent weights for each
stream si
Optimality in the sense of minimizing “total classification error”
Total Error Computation
Two class problem w1, w2, for the feature vector x
Feature pdfs p(x |w1) p(x |w2) Assume that estimation/modeling error is normal
variable zi
Optimal Stream Weights (1)
Minimize σ2 with respect to stream
Two interesting cases• Equal error rate in single-stream classifiers
p(x1 | w1 ) = p(x2 | w1) in decision region
• Equal estimation error variance in each stream
σS12 =σS2
2
Optimal Stream Weights (2) Equal error rate in single-stream classifiers
Equal estimation error variance in each stream
Antimodels, Inter and Intra Distances The multi-class problem is reposed as (multiple)
two-class classification problem
If p(x|w) follows a Gaussian distribution N(μ ,σ²), the Bayes error is function of D=|μ1 - μ2|/σ
Experimental Results (1) Test case: audio-visual continuous digit recognition
task
Difference from ideal two-class case• Multi-class problem• Recognition instead of classification
Multiple experiments:• clean video stream• noise corrupted audio streams at various SNR
Experimental Results (2) Subset of CUAVE database used:
• 36 speakers (30 training, 6 testing)• 5 sequences of 10 connected digits per speaker• Training set: 1500 digits (30x5x10)• Test set: 300 digits (6x5x10)
Features:• Audio: 39 features (MFCC_D_A)• Visual: 39 features (ROIDCT_D_A, odd columns)
Multi-Streams HMM models:• 8 state, left-to-right HMM whole-digit models• Single Gaussian mixture• AV-HMM uses separate audio and video feature streams
Weights’ distribution
Results (classification)
Inter- Intra- Distances and Recognition In each stream a total inter- intra- dist is computed
Inter- Intra- Distances and Recognition In each stream a total inter- intra- dist is computed
Results (recognition)
Conclusions We have proposed a stream computation method
for a multi class classification task based on theoretical results obtained for a two classes classification problem and making use of an anti-model technique
We use only the test utterance and the information contained in the trained models
Results are of interest for the problem of unsupervised estimation of stream weights for multi-streams classification and recognition problems
Outline
Long Term Research• Audio-Visual Processing (WP1)• Segment Models (WP1)• Bayes’ Optimal Adaptation (WP2)
Research for the Platforms• New Features and Fusion
Integration on Year 2 Platforms• Mobile Platform • Fixed Platform
Dynamical System Segment Model
Segment models directly model time evolution of speech parameters
Based on linear dynamical system
The system parameters should guarantee • Identifiability, Controllability, Observability, Stability
Simple matrix topologies studies up to now
1k k k
k k k
x Fx w
y Hx v
Linear dynamical system with state-control:
Parameters F,B,H have canonical forms(Ljung – “System Identification”)
Generalized forms of parameter structures
1k k k k
k k k
x Fx Bu w
y Hx v 0 1 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0
,
0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 1
1 0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0 0
0 0 0 0 0 1 0 0 0
F B
H
Parameter Estimation
Use of EM algorithm to estimate the parameters F,B,P,R• We propose a new element-wise parameter estimation algorithm
For the forward-backward recursions, use Kalman smoother recursions
1 1 11 1 1 1 1^
1 11
( ) ( )
( )
M N M M Nc j r j
ic k k ic cr k kc k c r k
c j r j
ij Nj j
ii k kk
cof P x x cof P f x x
f
cof P x x
^
1 1 1 11 1 1 1 1 1 1 1
N M N M N M M Ni j r j r i r ck k ir k k jr k k ic jr k kij
k r k r k c r k
x x f x x f x x f f x xp
Experiments with artificial data
Experiments description:• Select random system parameters (using canonical matrix topology)
• Generate artificial data from the system
• Parameter estimation using the artificial data
Criteria for the evaluation of the system:• The log likelihood of the observations increases per EM iter.
• The parameter estimation error decreases per EM iter.
•Without state control•Dimension of F: 3x3•Observation vector size: 3x1•# of rows with free parameters: 3•# of samples: 1000
Model Training on Speech Data Aurora 2 Database 77 training sentences Word models with different number of states based on the
phonetic transcription
State alignments produced using HTK
Segments Models
2 oh
4 two, eight
6 one, three, four, five, six, nine, zero
8 seven
Speech Segment Modeling
Classification process
Keep true word-boundaries fixed • Digit-level alignments produced by an HMM
Apply suboptimum search and pruning algorithm• Keep the 11 most probable word-histories for each word in the
sentence
Classification is based on maximizing the likelihood
Test set:• Aurora 2, test A, subway sentences • 1000 test sentences• Different levels of noise (Clean, SNR: 20, 15, 10, 5 dB)• Front-End extracts 14-dimensional features (static features):
• HTK standard front-end• 2 feature configurations
– 12 Cepstral Coefficients + C0 + Energy– + first and second order derivatives (δ, δδ)
Classification results
Comparison of Segment-Models and HTK HMM classification (% Accuracy)• Same Front-End configuration, same alignments• Both Models trained on clean training data
AURORASubway
HMM (HTK) Segment Models
MFCC, E +δ +δδ MFCC, E +δ +δδ
Clean 97,19% 97,57% 97,53% 97,61%
SNR20 90,91% 95,71% 93,23% 95,12%
SNR15 80,09% 91,76% 87,91% 91,13%
SNR10 57,68% 81,93% 76,29% 82,69%
SNR5 36,01% 64,24% 54,87% 63,56%
Conclusions and Future work
Without derivatives Segment-models significantly outperform HMMs particularly under highly noisy conditions
When derivatives are used for both models their performance is similar
Use formants and other articulatory features to initialize the state vectors
Examine different dimensions of the state vector
Extension to a non-linear dynamical system• Use of extended Kalman filter
• Derivation of the EM reestimation formulae for the non-linear case
Outline
Long Term Research• Audio-Visual Processing (WP1)• Segment Models (WP1)• Bayes’ Optimal Adaptation (WP2)
Research for the Platforms• New Features and Fusion
Integration on Year 2 Platforms• Mobile Platform • Fixed Platform
MAP versus Bayes Optimal
MAP adaptation techniques derive from Bayes Optimal Classification• Assumption: Posterior is peaked around the most probable
model• It is not optimal
Bayes Optimal adaptation is based on a weighted average of the posteriors• Better Performance with less training data• But:
• Computationally expensive• Hard to find analytical solutions
• Approximations should be considered
Bayes Optimal Adaptation
Bayes optimal classification is based on:
Assuming θ denotes a Gaussian component this becomes:
• Θ is a subset of Gaussians
1
| , | , | ,N
t t a t a t a t
R
p x s X p x X s p X s d
Our Approach
To obtain the N Gaussians of Θ:• Step 1: Cluster the Gaussian mixtures associated to context-
dependent models with common central phone• Step 2: From the extended Gaussian mixture choose the N less
distant Gaussians from each Gaussian component of the SI Gaussian mixture
Bayes optimal classification becomes:
1
( | , ) ( ) ( ; , ) ( | , )M
t t a i t t i i a ti
p x s X c s N x m S p X s
Gau
ssia
n S
ize
Number of MixtureComponents
1 2 M 1 2 M
Mixture 1 Mixture 2
• For example based on the entropy-based distance between the Gaussians the less distant Gaussians (in gray color) are clustered together
• The clustering can be performed at an element or sub-vector basis thus increasing the degrees of freedom.
Adaptation Configuration
Baseline trained on the WSJ database Adaptation data:
• spoke3 WSJ task• non-native speakers• 5 male and 5 female• 20 adaptation sentences per speaker• 40 test sentences per speaker
Perform experiments for different number of associated mixtures (associations)
Adaptation Results (% WER)Bayes’ Adaptation Baseline
5 Associations 6 Associations
Male speaker (4n0) 51.52% 47.65% 59.28%
Male speaker (4n3) 43.27% 41.98% 51.72%
Male speaker (4n5) 33.13% 31.48% 36.30%
Male speaker (4n9) 34.48% 33.43% 28.96%
Male speaker (4na) 26.66% 26.22% 28.72%
Total Male %WER 37.87% 36.15% 40,99%
Female speaker (4n1) 74.96% 74.47% 81.01%
Female speaker (4n4) 58.18% 58.18% 60.12%
Female speaker (4n8) 34.16% 35.99% 30.85%
Female speaker (4nb) 40.31% 39.38% 39.06%
Female speaker (4nc) 40.23% 41.68% 42.97%
Total Female %WER 49.56% 49.94% 50.80%
Total Results and Conclusions
Adaptation Baseline
5 Associations 6 Associations
Total %WER 43.71% 43.04% 45.89%
Small improvements can be obtained compared to the Baseline
The number of associations significantly influences the adaptation performance
The optimum number of associations depends on the baseline models and the adaptation data dynamically choose the associations