hiwire progress report technical university of crete speech processing and dialog systems group...

68
HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Post on 19-Dec-2015

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

HIWIRE Progress Report

Technical University of CreteSpeech Processing and

Dialog Systems Group

Presenter: Alex Potamianos (WP1)Vassilis Diakoloukas (WP2)

Technical University of CreteSpeech Processing and

Dialog Systems Group

Presenter: Alex Potamianos (WP1)Vassilis Diakoloukas (WP2)

Page 2: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Outline

Work package 1• Baseline: Aurora 2, Aurora 3, Aurora 4 (lattices)• Audio-Visual ASR: Baseline• Feature extraction and combination• Segment models for ASR• Blind Source Separation for multi-microphone ASR

Work package 2• Adaptation• Data collection

Page 3: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Outline

Work package 1• Baseline: Aurora 2, Aurora 3, Aurora 4 (lattices)• Audio-Visual ASR: Baseline• Feature extraction and combination• Segment models for ASR• Blind Source Separation for multi-microphone ASR

Work package 2• Adaptation• Data collection

Page 4: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Baseline

Baseline Performance Completed• Aurora 2 on HTK• Aurora 3 on HTK• Aurora 4 on HTK

Lattices for Aurora 4 Baseline Performance Ongoing

• WSJ1 (Decipher)• DMHMMs (Decipher)

Page 5: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Aurora 2 Database

Based on TIdigits downsampled to 8KHz Noise artificially added at several SNRs 3 sets of noises

• A: subway, babble, car, exhib. hall• B: restaurant, street, airport, train station• C: subway, street (with different freq.

characteristics)

Two training conditions• Training on clean data• Multi-condition Training on noisy data

Page 6: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Aurora 2 Database

8440 training sentences 1001 test sentences / test set Three front-end configurations

• HTK default• WI007 (Aurora 2 distribution)• WI008 (Thanks to Prof. Segura)

Page 7: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Aurora 2: Clean training

HTK default Front-End

Accuracy vs SNR (Clean Training)

0

10

20

30

40

50

60

70

80

90

100

Clean 20 dB 15 dB 10 dB 5 dB 0 dB -5 dB

Test Set A

Test Set B

Test Set C

Overall

Page 8: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Aurora 2: Multi-Condition training

HTK default Front-End

Accuracy vs SNR (Multi-Condition Training)

0

10

20

30

40

50

60

70

80

90

100

Clean 20 dB 15 dB 10 dB 5 dB 0 dB -5 dB

Test Set A

Test Set B

Test Set C

Overall

Page 9: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Aurora 2: Clean vs Multi-Condition Training

Overall Aurora 2 Accuracy vs SNR

0

10

20

30

40

50

60

70

80

90

100

Clean 20 dB 15 dB 10 dB 5 dB 0 dB -5 dB

Multi-Condition TrainingClean Training

Page 10: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Aurora 2 Front End Comparison: Clean Training

Accuracy vs SNR (Clean Training)

0102030405060708090

100

Clean 20 dB 15 dB 10 dB 5 dB 0 dB -5 dB

WI008 FE

WI007 FE

HTK FE

Page 11: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Front End Comparison: Multi-Condition Training

Accuracy vs SNR (Multi-Condition Training)

0102030405060708090

100

Clean 20 dB 15 dB 10 dB 5 dB 0 dB -5 dB

WI008 FE

WI007 FE

HTK FE

Page 12: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Aurora 3 Database

5 languages• Finnish • German• Italian• Spanish• Danish

3 noise conditions• quiet• low noisy (low)• high noisy (high)

2 recording modes• close-talking microphone (ch0)• hands-free microphone (ch1)

Page 13: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Aurora 3 Database

3 experimental setups• Well-Matched (WM)

• 70% of all utts in “quiet, low, high” conditions were used for training

• remaining 30% were used for testing

• Medium Mismatched (MM)• 100% hands-free recordings from “quiet” and “low”

for training• 100% hands-free recordings from “high” for testing

• High Mismatched (HM)• 70% of close-talking recordings from all noise

conditions for training• 30% of hands-free recordings from “low” and “high”

for testing

Page 14: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Baseline Aurora 3 performance

AURORA 3 Performance (Spanish + Italian)

0

10

20

30

40

50

60

70

80

90

100

SPAN_WM SPAN_MM SPAN_HM ITAL_WM ITAL_MM ITAL_HM

WI007

WI008

Page 15: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Baseline Aurora 3 performance

AURORA 3 Performance

0

10

20

30

40

50

60

70

80

90

100

WI007

WI008

Page 16: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Baseline Aurora 3 with WI007 FE ( TUC - UGR comparison )

FINNISH SPANISH GERMAN

FRONT-END WM MM HM WM MM HM WM MM HM

WI007-TUC 90,53 72,5 30,35 86,88 73,72 42,23 90,58 79,06 74,24

WI007-UGR 92,74 80,51 40,53 92,94 80,31 51,55 91,2 81,04 73,17

                 

TRAIN(#sent.) 1778 561 889 3392 1607 1696 2032 997 1007

TEST(#sent.) 770 146 283 1522 850 631 867 241 394

DANISH ITALIAN

FRONT-END WM MM HM WM MM HM

WI007-TUC 79,62 49,29 33,15 93,64 82,02 39,84

WI007-UGR 87,28 67,32 39,37 93,64 82,02 39,84

     

TRAIN(#sent.) 3440 1254 1720 2951 1245 1720

TEST(#sent.) 1474 204 658 1309 405 626

Page 17: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Baseline Aurora 3 with WI007 FE ( TUC - UGR comparison )

AURORA 3 Baseline accuracyTUC-UGR comparison

0

10

20

30

40

50

60

70

80

90

100

WI007-TUC

WI007-UGR

Page 18: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Baseline Aurora 3 with WI008 FE ( TUC - UGR comparison )

AURORA 3 RESULTS

0

20

40

60

80

100

120WI008-TUC

WI008-UGR

Page 19: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Aurora 4 Database

Based on the WSJ phase 0 collection 5000 word vocabulary 7138 training data (ARPA evaluation) 2 recording microphones 6 different noises artificially added

• Car, Babble, Restaurant, Street, Airport, TrainSt

Page 20: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Aurora 4 Training Data Sets 3 Training Conditions

• (Clean – MultiCondition – Noisy)

7138 utterances(as in the ARPA

evaluation)

7138 utterances

3569 utterances(Sennheiser)

3569 utterances(2nd mic)

893 (no noise added)

2676 (1 out of 6 noises added at SNRs between 10 and 20 dB)

Clean training Multicondition training

2676 (1 out of 6 noises added at SNRs between 10 and 20 dB)

893 (no noise added)

Page 21: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Aurora 4 Test Sets

14 Test Sets 2 sizes: small (166 utts) and large (330 utts)

330 utt.(Sennheiser microphone)

SET 1

330 utt.(Sennheiser mic; Noise 1 added at SNRs between 5

and 15 dB)

SET 2

…330 utt.

(Sennheiser mic; Noise 2 added at SNRs between 5

and 15 dB)

SET 3

330 utt.(Sennheiser mic; Noise 6 added at SNRs between 5

and 15 dB)

SET 7

330 utt.(2nd

microphone)

SET 8

330 utt.(2nd mic; Noise 1

added at SNRs between 5 and

15 dB)

SET 9

…330 utt.

(2nd mic; Noise 2 added at SNRs between 5 and

15 dB)

SET 10

330 utt.(2nd mic; Noise 6

added at SNRs between 5 and

15 dB)

SET 14

Page 22: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Lattices

Obtained from SONIC recognizer • real time decoding for WSJ 5k task

• State-of-the-art performance (8% WERR)

Lattices obtained from clean models

Three sizes lattices: small, medium, large

Fixed branching factor for each lattice size (small=2.5, medium=4, large=5.5)

Speed-up factor compared to HTK decoding: x100, x50, x10

Page 23: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Baseline Aurora 4 with Lattices

AURORA 4 (Small Lattice)

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Test Sets

Accu

racy

Clean

Multi

Noisy

Page 24: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Baseline Aurora 4 with Lattices

AURORA 4 (Medium Lattice)

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Test Sets

Accu

racy

Clean

Multi

Noisy

Page 25: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Baseline Aurora 4 (Comparing Lattices)

Clean Training comparisonNo vs Small vs Medium Lattices

0

20

40

60

80

100

1 2 3 4 5 6 7 8 9 10 11 12 13 14 Avg.Test Sets

Accu

racy

No Lattices

Small Lattices

Medium Lattices

Page 26: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Aurora4 BaselineConclusions on Lattices

Lattices speed up recognition

• Medium Size Lattice is ~ 60 times faster

• Small Size Lattice is ~ 108 times faster

Problem: improved performance in noisy test

Careful when using lattices in mismatched

conditions (clean training-noisy data)!

Solution:

• two sets of lattices lattices: matched, mismatched

Page 27: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Audio-Visual ASR: Database

Subset of CUAVE database used:• 36 speakers (30 training, 6 testing)

• 5 sequences of 10 connected digits per speaker

• Training set: 1500 digits (30x5x10)

• Test set: 300 digits (6x5x10)

CUAVE database also contains more complex data sets: speaker moving around, speaker shows profile, continuous digits, two speakers (to be used in future evaluations)

Page 28: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

CUAVE Database Speakers

Page 29: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Audio-Visual ASR: Feature Extraction Lip region of interest (ROI) tracking

• A fixed size ROI is detected using template matching

• ROI minimizes RGB-Euclidean distance with a given ROI template

• ROI template is selected from 1st frame of each speaker

• Continuity constraint: search within a 20x20 pixel window of previous frame ROI (does not work for rapid speaker movements)

Page 30: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)
Page 31: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Audio-Visual ASR: Feature Extraction Features extracted from ROI

• ROI is transformed to grayscale• ROI is decimated to a 16x16 pixel region• 2D separable DCT is applied to 16x16 pixel region• Upper-left 6x6 region is kept (excluding first coef.)• 35 feature vector is resampled in time from 29.97

fps (NTSC) to 100 fps • First and second derivatives in time are computed

using a 6 frame window (feature size 105)

Sanity check: unsupervised k-means clustering of ROI results in …

Page 32: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)
Page 33: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Experiments

Recognition experiment:

• Open loop digit grammar (50 digits per utterance,

no endpointing)

Classification experiment:

• Single digit grammar (endpointed digits based on

provided segmentation)

Page 34: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Models

Features: • Audio: 39 features (MFCC_D_A)• Visual: 105 features (ROIDCT_D_A)• Audio-Visual: 39+35 feats (MFCC_D_A+ROIDCT)

HMM models• 8 state, left-to-right HMM whole-digit models with

no state skipping• Single Gaussian mixture• Audio-Visual HMM uses separate audio and video

feature streams with equal weights (1,1)

Page 35: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Results (Word Accuracy]

Data• Training: 1500 digits (30 speakers)• Testing: 300 digits (6 speakers)

Audio Visual AudioVisual

Recognition 98% 26% 78%

Classification 99% 46% 85%

Page 36: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Future Work

Multi-mixture models Front-end (NTUA)

• Tracking algorithms • Feature extraction

Feature Combination• Feature integration• Feature weighting

Page 37: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Outline

Work package 1• Baseline: Aurora 2, Aurora 3, Aurora 4 (lattices)• Audio-Visual ASR: Baseline• Feature extraction and combination• Segment models for ASR• Blind Source Separation for multi-microphone ASR

Work package 2• Adaptation• Data collection

Page 38: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Feature extraction and combination

Noise Robust Features (NTUA) – m12

AM-FM Features (NTUA) – m12

Feature combination – m12

Supra-segmental features (see also segment

models) – m18

Page 39: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Outline

Work package 1• Baseline: Aurora 2, Aurora 3, Aurora 4 (lattices)• Audio-Visual ASR: Baseline• Feature extraction and combination• Segment models for ASR• Blind Source Separation for multi-microphone ASR

Work package 2• Adaptation• Data collection

Page 40: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Segment Models

Baseline system

Supra-segmental features

• Phone Transition modeling – m12

• Prosody modeling – m18

• Stress modeling – m18

Parametric modeling of feature trajectories

Dynamical system modeling

Combine with HMMs

Page 41: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Outline

Work package 1• Baseline: Aurora 2, Aurora 3, Aurora 4 (lattices)• Audio-Visual ASR: Baseline• Feature extraction and combination• Segment models for ASR• Blind Source Separation for multi-microphone ASR

Work package 2• Adaptation• Data collection

Page 42: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Blind Source Separation (Mokios, Sidiropoulos] Based on PARallel FACtor (PARAFAC) analysis, i.e., low-

rank decomposition of multi-dimensional tensorial data Collecting spatial covariance matrix estimates which are

sufficiently separated in time:

Assumptions• uncorrelated speaker signals and noise• D(t) is a diagonal matrix of speaker powers for

measurement period t• denotes noise power (estimated from silence

intervals)

2( ) ( ) , k 1,..., (2) Hk kR t AD t A K

2

Page 43: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Outline

Work package 1• Baseline: Aurora 2, Aurora 3, Aurora 4 (lattices)• Audio-Visual ASR: Baseline• Feature extraction and combination• Segment models for ASR• Blind Source Separation for multi-microphone ASR

Work package 2• Adaptation• Data collection

Page 44: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Acoustic Model Adaptation

Adaptation Method: • Bayes’ Optimal Classification

Acoustic Models:• Discrete Mixture HMMs

Page 45: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Bayes optimal classification

Classifier decision for a test data vector xtest:

Choose the class that results in the highest value:

),...,,|(maxarg)( 21jN

jjtest

jtest xxxxpcxc

dxxxpxpxxxxp jN

jjtest

jN

jjtest ),...,,|()|(),...,,|( 2121

Page 46: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Bayes optimal versus MAP

Assumption: the posterior is sufficiently peaked around the most probable point

MAP approximation:

θMAP is the set of parameters that maximize:

)|(),...,,|( 21 MAPtestjN

jjtest xpxxxxp

)}()|,...,,({maxarg),...,,|(maxarg 2121

pxxxpxxxp NNMAP

Page 47: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Why Bayes optimal classification

Optimal classification criterion The prediction of all the parameter hypotheses

is combined Better discrimination Less training data Faster asymptotic convergence to the ML

estimate

Page 48: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Why Bayes optimal classification

However:

• Computationally more expensive

• Difficult to find analytical solutions

• ....hence some approximations should still be considered

Page 49: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Discrete-Mixture HMMs (Digalakis et. al. 2000)

It is based on sub-vector quantization

Introduces a new form of observation distributions

Page 50: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

DMHMMs benefits (Digalakis et. al. 2000)

Speech Recognition performance driven quantization scheme

Quantization of the acoustic space in sufficient detail

Mixtures capture the correlation between sub-vectors

Well-matched in client-server applications

Comparable performance to continuous HMMs

Faster decoding speeds

Page 51: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

DMHMM parameters that could be adapted

Partitioning into sub-vectors• How many sub-vectors

• Which MFCCs to form each sub-vector

Bit-allocation• Optimize bit-allocation based on adaptation data

Discrete Mixture Weights

Centroids of codebooks

Centroid observation probabilities

Page 52: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Outline

Work package 1• Baseline: Aurora 2, Aurora 3, Aurora 4 (lattices)• Audio-Visual ASR: Baseline• Feature extraction and combination• Segment models for ASR• Blind Source Separation for multi-microphone ASR

Work package 2• Adaptation• Data collection

Page 53: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

TUC Non-Native Recordings

10 Speakers (6 male – 4 female)

Fluency in English:• 4 excellent

• 5 good – very good

• 1 satisfactory

Speaker pronunciation:• 1 from Cyprus

• 3 from Northern Greece

• 1 from Ionian Islands

• 2 Athens area

• 1 from Crete

• 1 from Central Greece

Page 54: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

EXTRA SLIDES

Page 55: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Prior Work Overview

MLST.Constr. Est. Adapt.

MAP (Bayes) Adapt.

GenonesSegment Models

VTLN

Combinations

Robust Features

Page 56: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

HIWIRE Work Proposal

AdaptationBayes optimal class.

Audio Visual ASRBaseline experiments

Microphone ArraysSpeech/Noise Separation

Feature SelectionAM-FM Features

Acoustic ModelingSegment Models

Page 57: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Aurora 2 Performance with HTK FE (Clean Training)

A B C

Subway Babble Car Exhibit Avg. Restr Street Airport Station Avg. Sub.M. Str.M. Avg. Overall

Clean 98,83 98,97 98,81 99,14 98,94 98,83 98,97 98,81 99,14 98,94 99,02 98,97 99 98,95

20 dB 96,96 89,96 96,84 96,2 94,99 89,19 95,77 90,07 94,38 92,35 94,47 95,19 94,83 93,9

15 dB 92,91 73,43 89,53 91,85 86,93 74,39 88,27 76,89 83,62 80,79 87,63 89,69 88,66 84,82

10 dB 78,72 49,06 66,24 75,1 67,28 52,72 66,75 53,15 59,61 58,06 75,19 75,27 75,23 65,18

5 dB 53,39 27,03 32,8 43,51 39,18 29,57 38,15 30,69 29,71 32,03 52,84 48,85 50,85 38,65

0 dB 27,3 11,73 13,27 15,98 17,07 11,7 18,68 15,84 12,25 14,62 26,01 21,64 23,83 17,44

-5 dB 12,62 4,96 8,35 7,65 8,4 5,04 10,07 8,08 8,49 7,92 12,1 10,7 11,4 8,81

Avg. 65,82 50,73 57,98 61,35 58,97 51,63 59,52 53,36 55,31 54,96 63,89 62,9 63,4 58,25

Page 58: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Aurora 2 Performance with HTK FE (Multi-Condition Training)

A B C

Subway Babble Car Exhibit Avg. Restr Street Airport Station Avg. Sub.M. Str.M. Avg. Overall

Clean 98,59 98,52 98,48 98,55 98,54 98,59 98,52 98,48 98,55 98,54 98,65 98,52 98,59 98,55

20 dB 97,64 97,61 97,85 96,98 97,52 96,56 97,46 97,17 96,64 96,96 97,05 96,43 96,74 97,14

15 dB 96,75 96,8 97,64 96,58 96,94 94,72 95,92 95,62 95,25 95,38 95,46 95,5 95,48 96,02

10 dB 94,38 95,22 95,65 93,12 94,59 90,97 94,2 92,78 92,35 92,58 92,35 91,9 92,13 93,29

5 dB 88,42 87,67 86,17 86,95 87,3 81,85 85,34 84,91 82,91 83,75 81,46 81,86 81,66 84,75

0 dB 65,67 61,03 50,82 61,8 59,83 56,83 60,22 64,36 54,21 58,91 45,16 54,05 49,61 57,42

-5 dB 26,01 26,18 19,15 22,49 23,46 22,6 26,3 27,65 18,88 23,86 18,61 25,54 22,08 23,34

Avg. 88,57 87,67 85,63 87,09 87,24 84,19 86,63 86,97 84,27 85,52 82,3 83,95 83,12 85,72

Page 59: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Aurora 2 Performance with WI008 FE (Clean Training)

A B C

Subway Babble Car Exhibit Avg. Restr Street Airport Station Avg. Sub.M. Str.M. Avg. Overall

Clean 99,08 99,03 99,05 99,23 99,1 99,08 99,03 99,05 99,23 99,1 99,02 99,03 99,03 99,08

20 dB 97,88 98,25 98,36 97,81 98,08 98,07 97,64 98,42 98,43 98,14 97,36 97,67 97,52 97,99

15 dB 96,38 96,74 97,52 96,7 96,84 95,33 96,58 97,05 96,76 96,43 95,3 95,74 95,52 96,41

10 dB 92,26 91,99 95,29 92,59 93,03 89,87 92,74 93,26 93,86 92,43 90,33 90,75 90,54 92,29

5 dB 83,88 80,68 86,01 84,05 83,66 76,05 83,25 83,54 84,2 81,76 78,88 78,48 78,68 81,9

0 dB 61,93 51,12 66,06 63,5 60,65 50,26 59,7 60,24 62,23 58,11 52,59 52,12 52,36 57,98

-5 dB 31,07 18,95 29,82 33,2 28,26 18,39 29,23 27,32 29,56 26,13 25,15 26,12 25,64 26,88

Avg. 86,47 83,76 88,65 86,93 86,45 81,92 85,98 86,5 87,1 85,37 82,89 82,95 82,92 85,31

Page 60: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Aurora 2 Performance with WI008 FE(Multi-Condition Training)

  A B C

  Subway Babble Car Exhibit Avg. Restr Street Airport Station Avg. Sub.M. Str.M. Avg. Overall

Clean 99,02 98,82 98,99 99,14 98,99 99,02 98,82 98,99 99,14 98,99 98,99 98,85 98,92 98,98

20 dB 98,62 98,58 98,54 98,24 98,5 98,1 98,13 98,63 98,8 98,42 98,07 97,94 98,01 98,37

15 dB 97,54 97,91 98,42 97,56 97,86 96,93 97,85 98,03 97,69 97,63 97,54 97,73 97,64 97,72

10 dB 95,33 96,07 97,38 95,34 96,03 94,84 95,59 95,91 96,05 95,6 95,58 95,31 95,45 95,74

5 dB 91,43 90,21 90,93 90,1 90,67 87,14 90,39 91,44 90,16 89,78 88,92 87,52 88,22 89,82

0 dB 75,28 68,71 80,7 76 75,17 65,55 73,85 75,78 74,08 72,32 66,99 65,63 66,31 72,26

-5 dB 39,85 30,05 40,41 44,99 38,83 28,52 38,88 40,95 41,75 37,53 30,43 30,59 30,51 36,64

Avg. 91,64 90,3 93,19 91,45 91,65 88,51 91,16 91,96 91,36 90,75 89,42 88,83 89,13 90,78

Page 61: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Aurora 3 HTK Settings

Spanish• Parametrize.csh

• Set Options = “-F RAW –fs 8 –q –noc0 –swap”

• Config_tr• TARGETKIND = MFCC_E_D_A• DELTAWINDOW = 3• ACCWINDOW = 2• ENORMALISE = F• HNET:TRACE = 2• NATURALREADORDER = T• NATURALWRITEORDER = T

Page 62: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Aurora 3 HTK Settings

Italian• Sdc_it.conf

• $FE_OPTIONS = “-q -F RAW –fs 8 ”

• Config• TARGETKIND = MFCC_D_A_E• HNET:TRACE = 2• ACCWINDOW = 2• DELTAWINDOW = 3• ENORMALISE = F• NATURALREADORDER = T• NATURALWRITEORDER = T

Page 63: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Baseline Aurora 3 Performance

FINNISH SPANISH GERMAN

FRONT-END WM MM HM WM MM HM WM MM HM

WI007 90,53 72,5 30,35 86,88 73,72 42,23 90,58 79,06 74,24

WI008 95,62 76,68 86,11 93,47 85,41 81,02 94,49 88,73 89,55

                 

TRAIN(#sent.) 1778 561 889 3392 1607 1696 2032 997 1007

TEST(#sent.) 770 146 283 1522 850 631 867 241 394

DANISH ITALIAN

FRONT-END WM MM HM WM MM HM

WI007 79,62 49,29 33,15 93,64 82,02 39,84

WI008 84,99 65,68 63,91 96,58 88,53 88,22

     

TRAIN(#sent.) 3440 1254 1720 2951 1245 1720

TEST(#sent.) 1474 204 658 1309 405 626

Page 64: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Baseline Aurora 3 with WI007 FE ( TUC - UGR comparison )

FINNISH SPANISH GERMAN

FRONT-END WM MM HM WM MM HM WM MM HM

WI007-TUC 90,53 72,5 30,35 86,88 73,72 42,23 90,58 79,06 74,24

WI007-UGR 92,74 80,51 40,53 92,94 80,31 51,55 91,2 81,04 73,17

                 

TRAIN(#sent.) 1778 561 889 3392 1607 1696 2032 997 1007

TEST(#sent.) 770 146 283 1522 850 631 867 241 394

DANISH ITALIAN

FRONT-END WM MM HM WM MM HM

WI007-TUC 79,62 49,29 33,15 93,64 82,02 39,84

WI007-UGR 87,28 67,32 39,37 93,64 82,02 39,84

     

TRAIN(#sent.) 3440 1254 1720 2951 1245 1720

TEST(#sent.) 1474 204 658 1309 405 626

Page 65: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Baseline Aurora 3 with WI008 FE ( TUC - UGR comparison )

FINNISH SPANISH GERMAN

FRONT-END WM MM HM WM MM HM WM MM HM

WI008-TUC 95,62 76,68 86,11 93,47 85,41 81,02 94,49 88,73 89,55

WI008-UGR 96,09 80,92 86,61 96,64 93,92 91,55 95,11 90,84 91,25

                 

TRAIN(#sent.) 1778 561 889 3392 1607 1696 2032 997 1007

TEST(#sent.) 770 146 283 1522 850 631 867 241 394

DANISH ITALIAN

FRONT-END WM MM HM WM MM HM

WI008-TUC 84,99 65,68 63,91 96,58 88,53 88,22

WI008-UGR 93,37 81,49 79,59 96,71 92,53 89

     

TRAIN(#sent.) 3440 1254 1720 2951 1245 1720

TEST(#sent.) 1474 204 658 1309 405 626

Page 66: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Baseline Aurora 4 with Lattices

Small Lattice Size

  8 9 10 11 12 13 14 Avg.

Clean 86,56 80,77 68,43 64,75 55,31 70,98 59,7 72,37

Multi 86,85 86,52 83,98 82,5 81,33 84,64 81,84 84,88

Noisy 87 85,97 81,58 80,48 76,51 82,65 77,48 83,3

Average 86,8 84,42 78 75,91 71,05 79,42 73,01 80,19

  1 2 3 4 5 6 7

Clean 88,36 85,67 74,36 73,44 66,41 74,59 63,87

Multi 86,81 86,85 85,78 85,34 85,56 85,89 84,42

Noisy 87,81 86,96 85,71 83,61 83,09 85,6 81,8

Average 87,66 86,49 81,95 80,8 78,35 82,03 76,7

Page 67: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Baseline Aurora 4 with Lattices

Medium Lattice Size  1 2 3 4 5 6 7

Clean 87,92 84,71 72,89 72,78 65,12 73,78 62,91

Multi 85,97 85,52 84,79 83,83 83,9 84,24 83,24

Noisy 87,33 85,78 84,42 82,28 81,58 84,16 80,88

Average 87,07 85,34 80,7 79,63 78,87 80,73 75,68

  8 9 10 11 12 13 14 Avg.

Clean 85,67 79,45 66,08 63,68 53,86 69,07 58,31 71,16

Multi 86,19 84,97 82,65 81,18 80,63 82,84 80,29 83,59

Noisy 86,7 85,45 81,14 78,67 74,22 82,21 76,65 82,25

Average 86,19 83,29 76,62 74,51 69,57 78,04 71,75 79,14

Page 68: HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)

Baseline Aurora 4 with Lattices

Small Lattice Size

  8 9 10 11 12 13 14 Avg.

Clean 86,56 80,77 68,43 64,75 55,31 70,98 59,7 72,37

Multi 86,85 86,52 83,98 82,5 81,33 84,64 81,84 84,88

Noisy 87 85,97 81,58 80,48 76,51 82,65 77,48 83,3

Average 86,8 84,42 78 75,91 71,05 79,42 73,01 80,19

  1 2 3 4 5 6 7

Clean 88,36 85,67 74,36 73,44 66,41 74,59 63,87

Multi 86,81 86,85 85,78 85,34 85,56 85,89 84,42

Noisy 87,81 86,96 85,71 83,61 83,09 85,6 81,8

Average 87,66 86,49 81,95 80,8 78,35 82,03 76,7