hiwire progress report technical university of crete speech processing and dialog systems group...

HIWIRE Progress Report

Technical University of CreteSpeech Processing and

Dialog Systems Group

Presenter: Alex Potamianos (WP1)Vassilis Diakoloukas (WP2)

Technical University of CreteSpeech Processing and

Dialog Systems Group

Presenter: Alex Potamianos (WP1)Vassilis Diakoloukas (WP2)

Outline

Work package 1• Baseline: Aurora 2, Aurora 3, Aurora 4 (lattices)• Audio-Visual ASR: Baseline• Feature extraction and combination• Segment models for ASR• Blind Source Separation for multi-microphone ASR

Work package 2• Adaptation• Data collection

Baseline

Baseline Performance Completed• Aurora 2 on HTK• Aurora 3 on HTK• Aurora 4 on HTK

Lattices for Aurora 4 Baseline Performance Ongoing

• WSJ1 (Decipher)• DMHMMs (Decipher)

Aurora 2 Database

Based on TIdigits downsampled to 8KHz Noise artificially added at several SNRs 3 sets of noises

• A: subway, babble, car, exhib. hall• B: restaurant, street, airport, train station• C: subway, street (with different freq.

characteristics)

Two training conditions• Training on clean data• Multi-condition Training on noisy data

Aurora 2 Database

8440 training sentences 1001 test sentences / test set Three front-end configurations

• HTK default• WI007 (Aurora 2 distribution)• WI008 (Thanks to Prof. Segura)

Aurora 2: Clean training

HTK default Front-End

Accuracy vs SNR (Clean Training)

0

10

20

30

40

50

60

70

80

90

100

Clean 20 dB 15 dB 10 dB 5 dB 0 dB -5 dB

Test Set A

Test Set B

Test Set C

Overall

Aurora 2: Multi-Condition training

HTK default Front-End

Accuracy vs SNR (Multi-Condition Training)

0

10

20

30

40

50

60

70

80

90

100


Test Set A

Test Set B

Test Set C

Overall

Aurora 2: Clean vs Multi-Condition Training

Overall Aurora 2 Accuracy vs SNR

0

10

20

30

40

50

60

70

80

90

100


Multi-Condition TrainingClean Training

Aurora 2 Front End Comparison: Clean Training

Accuracy vs SNR (Clean Training)

0102030405060708090

100


WI008 FE

WI007 FE

HTK FE

Front End Comparison: Multi-Condition Training

Accuracy vs SNR (Multi-Condition Training)

0102030405060708090

100


WI008 FE

WI007 FE

HTK FE

Aurora 3 Database

5 languages• Finnish • German• Italian• Spanish• Danish

3 noise conditions• quiet• low noisy (low)• high noisy (high)

2 recording modes• close-talking microphone (ch0)• hands-free microphone (ch1)

Aurora 3 Database

3 experimental setups• Well-Matched (WM)

• 70% of all utts in “quiet, low, high” conditions were used for training

• remaining 30% were used for testing

• Medium Mismatched (MM)• 100% hands-free recordings from “quiet” and “low”

for training• 100% hands-free recordings from “high” for testing

• High Mismatched (HM)• 70% of close-talking recordings from all noise

conditions for training• 30% of hands-free recordings from “low” and “high”

for testing

Baseline Aurora 3 performance

AURORA 3 Performance (Spanish + Italian)

0

10

20

30

40

50

60

70

80

90

100

SPAN_WM SPAN_MM SPAN_HM ITAL_WM ITAL_MM ITAL_HM

WI007

WI008

Baseline Aurora 3 performance

AURORA 3 Performance

0

10

20

30

40

50

60

70

80

90

100

WI007

WI008

Baseline Aurora 3 with WI007 FE ( TUC - UGR comparison )

FINNISH SPANISH GERMAN

FRONT-END WM MM HM WM MM HM WM MM HM

WI007-TUC 90,53 72,5 30,35 86,88 73,72 42,23 90,58 79,06 74,24

WI007-UGR 92,74 80,51 40,53 92,94 80,31 51,55 91,2 81,04 73,17

TRAIN(#sent.) 1778 561 889 3392 1607 1696 2032 997 1007

TEST(#sent.) 770 146 283 1522 850 631 867 241 394

DANISH ITALIAN

FRONT-END WM MM HM WM MM HM

WI007-TUC 79,62 49,29 33,15 93,64 82,02 39,84

WI007-UGR 87,28 67,32 39,37 93,64 82,02 39,84

TRAIN(#sent.) 3440 1254 1720 2951 1245 1720

TEST(#sent.) 1474 204 658 1309 405 626


AURORA 3 Baseline accuracyTUC-UGR comparison

0

10

20

30

40

50

60

70

80

90

100

WI007-TUC

WI007-UGR


AURORA 3 RESULTS

0

20

40

60

80

100

120WI008-TUC

WI008-UGR

Aurora 4 Database

Based on the WSJ phase 0 collection 5000 word vocabulary 7138 training data (ARPA evaluation) 2 recording microphones 6 different noises artificially added

• Car, Babble, Restaurant, Street, Airport, TrainSt

Aurora 4 Training Data Sets 3 Training Conditions

• (Clean – MultiCondition – Noisy)

7138 utterances(as in the ARPA

evaluation)

7138 utterances

3569 utterances(Sennheiser)

3569 utterances(2nd mic)

893 (no noise added)

2676 (1 out of 6 noises added at SNRs between 10 and 20 dB)

Clean training Multicondition training

2676 (1 out of 6 noises added at SNRs between 10 and 20 dB)

893 (no noise added)

Aurora 4 Test Sets

14 Test Sets 2 sizes: small (166 utts) and large (330 utts)

330 utt.(Sennheiser microphone)

SET 1

330 utt.(Sennheiser mic; Noise 1 added at SNRs between 5

and 15 dB)

SET 2

…330 utt.

(Sennheiser mic; Noise 2 added at SNRs between 5

and 15 dB)

SET 3

330 utt.(Sennheiser mic; Noise 6 added at SNRs between 5

and 15 dB)

SET 7

330 utt.(2nd

microphone)

SET 8

330 utt.(2nd mic; Noise 1

added at SNRs between 5 and

15 dB)

SET 9

…330 utt.

(2nd mic; Noise 2 added at SNRs between 5 and

15 dB)

SET 10

330 utt.(2nd mic; Noise 6

added at SNRs between 5 and

15 dB)

SET 14

Lattices

Obtained from SONIC recognizer • real time decoding for WSJ 5k task

• State-of-the-art performance (8% WERR)

Lattices obtained from clean models

Three sizes lattices: small, medium, large

Fixed branching factor for each lattice size (small=2.5, medium=4, large=5.5)

Speed-up factor compared to HTK decoding: x100, x50, x10

Baseline Aurora 4 with Lattices

AURORA 4 (Small Lattice)

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Test Sets

Accu

racy

Clean

Multi

Noisy


AURORA 4 (Medium Lattice)

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Test Sets

Accu

racy

Clean

Multi

Noisy

Baseline Aurora 4 (Comparing Lattices)

Clean Training comparisonNo vs Small vs Medium Lattices

0

20

40

60

80

100

1 2 3 4 5 6 7 8 9 10 11 12 13 14 Avg.Test Sets

Accu

racy

No Lattices

Small Lattices

Medium Lattices

Aurora4 BaselineConclusions on Lattices

Lattices speed up recognition

• Medium Size Lattice is ~ 60 times faster

• Small Size Lattice is ~ 108 times faster

Problem: improved performance in noisy test

Careful when using lattices in mismatched

conditions (clean training-noisy data)!

Solution:

• two sets of lattices lattices: matched, mismatched

Audio-Visual ASR: Database

Subset of CUAVE database used:• 36 speakers (30 training, 6 testing)

• 5 sequences of 10 connected digits per speaker

• Training set: 1500 digits (30x5x10)

• Test set: 300 digits (6x5x10)

CUAVE database also contains more complex data sets: speaker moving around, speaker shows profile, continuous digits, two speakers (to be used in future evaluations)

CUAVE Database Speakers

Audio-Visual ASR: Feature Extraction Lip region of interest (ROI) tracking

• A fixed size ROI is detected using template matching

• ROI minimizes RGB-Euclidean distance with a given ROI template

• ROI template is selected from 1st frame of each speaker

• Continuity constraint: search within a 20x20 pixel window of previous frame ROI (does not work for rapid speaker movements)

Audio-Visual ASR: Feature Extraction Features extracted from ROI

• ROI is transformed to grayscale• ROI is decimated to a 16x16 pixel region• 2D separable DCT is applied to 16x16 pixel region• Upper-left 6x6 region is kept (excluding first coef.)• 35 feature vector is resampled in time from 29.97

fps (NTSC) to 100 fps • First and second derivatives in time are computed

using a 6 frame window (feature size 105)

Sanity check: unsupervised k-means clustering of ROI results in …

Experiments

Recognition experiment:

• Open loop digit grammar (50 digits per utterance,

no endpointing)

Classification experiment:

• Single digit grammar (endpointed digits based on

provided segmentation)

Models

Features: • Audio: 39 features (MFCC_D_A)• Visual: 105 features (ROIDCT_D_A)• Audio-Visual: 39+35 feats (MFCC_D_A+ROIDCT)

HMM models• 8 state, left-to-right HMM whole-digit models with

no state skipping• Single Gaussian mixture• Audio-Visual HMM uses separate audio and video

feature streams with equal weights (1,1)

Results (Word Accuracy]

Data• Training: 1500 digits (30 speakers)• Testing: 300 digits (6 speakers)

Audio Visual AudioVisual

Recognition 98% 26% 78%

Classification 99% 46% 85%

Future Work

Multi-mixture models Front-end (NTUA)

• Tracking algorithms • Feature extraction

Feature Combination• Feature integration• Feature weighting

Outline



Feature extraction and combination

Noise Robust Features (NTUA) – m12

AM-FM Features (NTUA) – m12

Feature combination – m12

Supra-segmental features (see also segment

models) – m18

Outline



Segment Models

Baseline system

Supra-segmental features

• Phone Transition modeling – m12

• Prosody modeling – m18

• Stress modeling – m18

Parametric modeling of feature trajectories

Dynamical system modeling

Combine with HMMs

Outline



Blind Source Separation (Mokios, Sidiropoulos] Based on PARallel FACtor (PARAFAC) analysis, i.e., low-

rank decomposition of multi-dimensional tensorial data Collecting spatial covariance matrix estimates which are

sufficiently separated in time:

Assumptions• uncorrelated speaker signals and noise• D(t) is a diagonal matrix of speaker powers for

measurement period t• denotes noise power (estimated from silence

intervals)

2( ) ( ) , k 1,..., (2) Hk kR t AD t A K

2

Outline



Acoustic Model Adaptation

Adaptation Method: • Bayes’ Optimal Classification

Acoustic Models:• Discrete Mixture HMMs

Bayes optimal classification

Classifier decision for a test data vector xtest:

Choose the class that results in the highest value:

),...,,|(maxarg)( 21jN

jjtest

jtest xxxxpcxc

dxxxpxpxxxxp jN

jjtest

jN

jjtest ),...,,|()|(),...,,|( 2121

Bayes optimal versus MAP

Assumption: the posterior is sufficiently peaked around the most probable point

MAP approximation:

θMAP is the set of parameters that maximize:

)|(),...,,|( 21 MAPtestjN

jjtest xpxxxxp

)}()|,...,,({maxarg),...,,|(maxarg 2121

pxxxpxxxp NNMAP

Why Bayes optimal classification

Optimal classification criterion The prediction of all the parameter hypotheses

is combined Better discrimination Less training data Faster asymptotic convergence to the ML

estimate

Why Bayes optimal classification

However:

• Computationally more expensive

• Difficult to find analytical solutions

• ....hence some approximations should still be considered

Discrete-Mixture HMMs (Digalakis et. al. 2000)

It is based on sub-vector quantization

Introduces a new form of observation distributions

DMHMMs benefits (Digalakis et. al. 2000)

Speech Recognition performance driven quantization scheme

Quantization of the acoustic space in sufficient detail

Mixtures capture the correlation between sub-vectors

Well-matched in client-server applications

Comparable performance to continuous HMMs

Faster decoding speeds

DMHMM parameters that could be adapted

Partitioning into sub-vectors• How many sub-vectors

• Which MFCCs to form each sub-vector

Bit-allocation• Optimize bit-allocation based on adaptation data

Discrete Mixture Weights

Centroids of codebooks

Centroid observation probabilities

Outline



TUC Non-Native Recordings

10 Speakers (6 male – 4 female)

Fluency in English:• 4 excellent

• 5 good – very good

• 1 satisfactory

Speaker pronunciation:• 1 from Cyprus

• 3 from Northern Greece

• 1 from Ionian Islands

• 2 Athens area

• 1 from Crete

• 1 from Central Greece

EXTRA SLIDES

Prior Work Overview

MLST.Constr. Est. Adapt.

MAP (Bayes) Adapt.

GenonesSegment Models

VTLN

Combinations

Robust Features

HIWIRE Work Proposal

AdaptationBayes optimal class.

Audio Visual ASRBaseline experiments

Microphone ArraysSpeech/Noise Separation

Feature SelectionAM-FM Features

Acoustic ModelingSegment Models

Aurora 2 Performance with HTK FE (Clean Training)

A B C

Subway Babble Car Exhibit Avg. Restr Street Airport Station Avg. Sub.M. Str.M. Avg. Overall

Clean 98,83 98,97 98,81 99,14 98,94 98,83 98,97 98,81 99,14 98,94 99,02 98,97 99 98,95

20 dB 96,96 89,96 96,84 96,2 94,99 89,19 95,77 90,07 94,38 92,35 94,47 95,19 94,83 93,9

15 dB 92,91 73,43 89,53 91,85 86,93 74,39 88,27 76,89 83,62 80,79 87,63 89,69 88,66 84,82

10 dB 78,72 49,06 66,24 75,1 67,28 52,72 66,75 53,15 59,61 58,06 75,19 75,27 75,23 65,18

5 dB 53,39 27,03 32,8 43,51 39,18 29,57 38,15 30,69 29,71 32,03 52,84 48,85 50,85 38,65

0 dB 27,3 11,73 13,27 15,98 17,07 11,7 18,68 15,84 12,25 14,62 26,01 21,64 23,83 17,44

-5 dB 12,62 4,96 8,35 7,65 8,4 5,04 10,07 8,08 8,49 7,92 12,1 10,7 11,4 8,81

Avg. 65,82 50,73 57,98 61,35 58,97 51,63 59,52 53,36 55,31 54,96 63,89 62,9 63,4 58,25

Aurora 2 Performance with HTK FE (Multi-Condition Training)

A B C


Clean 98,59 98,52 98,48 98,55 98,54 98,59 98,52 98,48 98,55 98,54 98,65 98,52 98,59 98,55

20 dB 97,64 97,61 97,85 96,98 97,52 96,56 97,46 97,17 96,64 96,96 97,05 96,43 96,74 97,14

15 dB 96,75 96,8 97,64 96,58 96,94 94,72 95,92 95,62 95,25 95,38 95,46 95,5 95,48 96,02

10 dB 94,38 95,22 95,65 93,12 94,59 90,97 94,2 92,78 92,35 92,58 92,35 91,9 92,13 93,29

5 dB 88,42 87,67 86,17 86,95 87,3 81,85 85,34 84,91 82,91 83,75 81,46 81,86 81,66 84,75

0 dB 65,67 61,03 50,82 61,8 59,83 56,83 60,22 64,36 54,21 58,91 45,16 54,05 49,61 57,42

-5 dB 26,01 26,18 19,15 22,49 23,46 22,6 26,3 27,65 18,88 23,86 18,61 25,54 22,08 23,34

Avg. 88,57 87,67 85,63 87,09 87,24 84,19 86,63 86,97 84,27 85,52 82,3 83,95 83,12 85,72

Aurora 2 Performance with WI008 FE (Clean Training)

A B C


Clean 99,08 99,03 99,05 99,23 99,1 99,08 99,03 99,05 99,23 99,1 99,02 99,03 99,03 99,08

20 dB 97,88 98,25 98,36 97,81 98,08 98,07 97,64 98,42 98,43 98,14 97,36 97,67 97,52 97,99

15 dB 96,38 96,74 97,52 96,7 96,84 95,33 96,58 97,05 96,76 96,43 95,3 95,74 95,52 96,41

10 dB 92,26 91,99 95,29 92,59 93,03 89,87 92,74 93,26 93,86 92,43 90,33 90,75 90,54 92,29

5 dB 83,88 80,68 86,01 84,05 83,66 76,05 83,25 83,54 84,2 81,76 78,88 78,48 78,68 81,9

0 dB 61,93 51,12 66,06 63,5 60,65 50,26 59,7 60,24 62,23 58,11 52,59 52,12 52,36 57,98

-5 dB 31,07 18,95 29,82 33,2 28,26 18,39 29,23 27,32 29,56 26,13 25,15 26,12 25,64 26,88

Avg. 86,47 83,76 88,65 86,93 86,45 81,92 85,98 86,5 87,1 85,37 82,89 82,95 82,92 85,31

Aurora 2 Performance with WI008 FE(Multi-Condition Training)

A B C


Clean 99,02 98,82 98,99 99,14 98,99 99,02 98,82 98,99 99,14 98,99 98,99 98,85 98,92 98,98

20 dB 98,62 98,58 98,54 98,24 98,5 98,1 98,13 98,63 98,8 98,42 98,07 97,94 98,01 98,37

15 dB 97,54 97,91 98,42 97,56 97,86 96,93 97,85 98,03 97,69 97,63 97,54 97,73 97,64 97,72

10 dB 95,33 96,07 97,38 95,34 96,03 94,84 95,59 95,91 96,05 95,6 95,58 95,31 95,45 95,74

5 dB 91,43 90,21 90,93 90,1 90,67 87,14 90,39 91,44 90,16 89,78 88,92 87,52 88,22 89,82

0 dB 75,28 68,71 80,7 76 75,17 65,55 73,85 75,78 74,08 72,32 66,99 65,63 66,31 72,26

-5 dB 39,85 30,05 40,41 44,99 38,83 28,52 38,88 40,95 41,75 37,53 30,43 30,59 30,51 36,64

Avg. 91,64 90,3 93,19 91,45 91,65 88,51 91,16 91,96 91,36 90,75 89,42 88,83 89,13 90,78

Aurora 3 HTK Settings

Spanish• Parametrize.csh

• Set Options = “-F RAW –fs 8 –q –noc0 –swap”

• Config_tr• TARGETKIND = MFCC_E_D_A• DELTAWINDOW = 3• ACCWINDOW = 2• ENORMALISE = F• HNET:TRACE = 2• NATURALREADORDER = T• NATURALWRITEORDER = T

Aurora 3 HTK Settings

Italian• Sdc_it.conf

• $FE_OPTIONS = “-q -F RAW –fs 8 ”

• Config• TARGETKIND = MFCC_D_A_E• HNET:TRACE = 2• ACCWINDOW = 2• DELTAWINDOW = 3• ENORMALISE = F• NATURALREADORDER = T• NATURALWRITEORDER = T

Baseline Aurora 3 Performance



WI007 90,53 72,5 30,35 86,88 73,72 42,23 90,58 79,06 74,24

WI008 95,62 76,68 86,11 93,47 85,41 81,02 94,49 88,73 89,55

TRAIN(#sent.) 1778 561 889 3392 1607 1696 2032 997 1007

TEST(#sent.) 770 146 283 1522 850 631 867 241 394

DANISH ITALIAN


WI007 79,62 49,29 33,15 93,64 82,02 39,84

WI008 84,99 65,68 63,91 96,58 88,53 88,22

TRAIN(#sent.) 3440 1254 1720 2951 1245 1720

TEST(#sent.) 1474 204 658 1309 405 626




WI007-TUC 90,53 72,5 30,35 86,88 73,72 42,23 90,58 79,06 74,24

WI007-UGR 92,74 80,51 40,53 92,94 80,31 51,55 91,2 81,04 73,17

TRAIN(#sent.) 1778 561 889 3392 1607 1696 2032 997 1007

TEST(#sent.) 770 146 283 1522 850 631 867 241 394

DANISH ITALIAN


WI007-TUC 79,62 49,29 33,15 93,64 82,02 39,84

WI007-UGR 87,28 67,32 39,37 93,64 82,02 39,84

TRAIN(#sent.) 3440 1254 1720 2951 1245 1720

TEST(#sent.) 1474 204 658 1309 405 626




WI008-TUC 95,62 76,68 86,11 93,47 85,41 81,02 94,49 88,73 89,55

WI008-UGR 96,09 80,92 86,61 96,64 93,92 91,55 95,11 90,84 91,25

TRAIN(#sent.) 1778 561 889 3392 1607 1696 2032 997 1007

TEST(#sent.) 770 146 283 1522 850 631 867 241 394

DANISH ITALIAN


WI008-TUC 84,99 65,68 63,91 96,58 88,53 88,22

WI008-UGR 93,37 81,49 79,59 96,71 92,53 89

TRAIN(#sent.) 3440 1254 1720 2951 1245 1720

TEST(#sent.) 1474 204 658 1309 405 626


Small Lattice Size

8 9 10 11 12 13 14 Avg.

Clean 86,56 80,77 68,43 64,75 55,31 70,98 59,7 72,37

Multi 86,85 86,52 83,98 82,5 81,33 84,64 81,84 84,88

Noisy 87 85,97 81,58 80,48 76,51 82,65 77,48 83,3

Average 86,8 84,42 78 75,91 71,05 79,42 73,01 80,19

1 2 3 4 5 6 7

Clean 88,36 85,67 74,36 73,44 66,41 74,59 63,87

Multi 86,81 86,85 85,78 85,34 85,56 85,89 84,42

Noisy 87,81 86,96 85,71 83,61 83,09 85,6 81,8

Average 87,66 86,49 81,95 80,8 78,35 82,03 76,7


Medium Lattice Size 1 2 3 4 5 6 7

Clean 87,92 84,71 72,89 72,78 65,12 73,78 62,91

Multi 85,97 85,52 84,79 83,83 83,9 84,24 83,24

Noisy 87,33 85,78 84,42 82,28 81,58 84,16 80,88

Average 87,07 85,34 80,7 79,63 78,87 80,73 75,68

8 9 10 11 12 13 14 Avg.

Clean 85,67 79,45 66,08 63,68 53,86 69,07 58,31 71,16

Multi 86,19 84,97 82,65 81,18 80,63 82,84 80,29 83,59

Noisy 86,7 85,45 81,14 78,67 74,22 82,21 76,65 82,25

Average 86,19 83,29 76,62 74,51 69,57 78,04 71,75 79,14


Small Lattice Size

8 9 10 11 12 13 14 Avg.

Clean 86,56 80,77 68,43 64,75 55,31 70,98 59,7 72,37

Multi 86,85 86,52 83,98 82,5 81,33 84,64 81,84 84,88

Noisy 87 85,97 81,58 80,48 76,51 82,65 77,48 83,3

Average 86,8 84,42 78 75,91 71,05 79,42 73,01 80,19

1 2 3 4 5 6 7

Clean 88,36 85,67 74,36 73,44 66,41 74,59 63,87

Multi 86,81 86,85 85,78 85,34 85,56 85,89 84,42

Noisy 87,81 86,96 85,71 83,61 83,09 85,6 81,8

Average 87,66 86,49 81,95 80,8 78,35 82,03 76,7

hiwire progress report technical university of crete speech processing and dialog systems group...

Documents

baseline aurora

htk aurora

clean training slide

performance slide

testing slide

segura slide

noisy data slide

training conditions