voicing features

1

Voicing Features

Horacio Franco, Martin GraciarenaAndreas Stolcke, Dimitra Vergyri, Jing Zheng

STAR Lab. SRI International

2

Phonetically Motivated Features

• Problem:

– Cepstral coefficients fail to capture many discriminative cues.

– Front-end optimized for traditional Mel cepstral features.

– Front-end parameters are a compromise solution for all phones.

3

Phonetically Motivated Features

• Proposal:

– Enrich Mel cepstral feature representation with phonetically motivated features from independent front-ends.

– Optimize each specific front-end to improve discrimination.

– Robust broad class phonetic features provide “anchor points” in acoustic phonetic decoding.

– General framework for multiple phonetic features. First approach: voicing features.

4

• Voicing features algorithms:1. Normalized peak autocorrelation (PA) . For time frame X

max computed in pitch region 80Hz to 450Hz2. Entropy of high order cepstrum (EC) and linear spectra (ES).

If And H is the entropy of Y,

thenEntropy computed in pitch region 80Hz to 450Hz

Voicing Features

)0(/)}({max RxxiRxxPA i)];()([)( itXtXEiRxx

);( );( SPECHESCEPSHEC

f

f

fY

fYfYP

fYPfYPYH

2

22

22

)(

)())((

;)))((log())(()(

));((;)(2

LSPECLogIDFTCEPSXDFTLSPEC

5

Voicing Features3. Correlation with template and DP alignment

[Arcienega, ICSLP’02]. The Discrete Logarithm Fourier Transform for the frequency band for speech signal

If IT is an impulse train, the template is and the signal DLFTthe correlation for frame j with the template is

the DP optimal correlation is max computed in pitch region 80Hz to 450Hz

)},0(/),({max jRytjiRytCT DPi

)];(),([),( ifTjfYEjiRyt

;)(2

XDLFTY

],[ 21 ff

1

)ln()ln(dlnf

; 2 ;)(1

12

dlnf)ln( 1

N

ff

eTwenxN

Y ifsin

njwi

i

)(nxDLFT

2)(ITDLFTT

6

Voicing Features• Preliminary exploration of voicing features:

- Best feature combination: Peak Autocorrelation + Entropy Cepstrum

- Complementary behavior of autocorrelation and entropy features for high and low pitch.

Low pitch: time periods are well separated therefore correlation is well defined.

High pitch: harmonics are well separated and cepstrum is well defined.

7

Voicing Features• Graph of voicing features:

w er k ay n d ax f s: aw th ax v dh ey ax r

8

Voicing Features• Integration of Voicing Features:

1 - Juxtaposing Voicing Features:

• Juxtapose two voicing features to traditional Mel cepstral feature vector (MFCC) plus delta and delta-delta features (MFCC+D+DD)

• Voicing feature front-end: use same MFCC frame rate and optimize temporal window duration.

9

Voicing Features• Train small switchboard database (64 hours). Test on

dev 2001. WER for both sexes. • Features: MFCC+D+DD, 25.6 msec. frame every 10 msec. • VTL and speaker mean and var. norm. Genone acoustic

model. Non-X-word, MLE trained, Gender Dep. Bigram LM.

Window Length Optimization WER

Baseline 41.4%

Baseline + 2 voicing (25.6 msec) 41.2 %

Baseline + 2 voicing (75 msec) 40.7 %


Baseline + 2 voicing (100 msec) 40.4 %


10

Voicing Features2 – Voiced/Unvoiced Posterior Features:

• Use a posterior voicing probability as feature. Computed from 2 state HMM. Juxtaposed feature dim is 40.

• Similar setup as before. Males only results.

• Soft V/UV transitions may be not captured because posterior feature behaves similar to binary feature.

Recognition Systems WER

Baseline 39.2 %

Baseline + voicing posterior 39.7 %

11

Voicing Features3 – Window of Voicing Features + HLDA: • Juxtapose MFCC features and window of voicing

features around current frame. • Apply dimensionality reduction with HLDA. Final feature

had 39 dimensions. • Same setup as before, MFCC+D+DD+3rd diffs. Both

sexes.• Baseline 1.5% abs. better, Voicing improves 1% more.

Recognition Systems WER %

Baseline + HLDA 39.9

Baseline + 1 frame, 2 voicing + HLDA

Baseline + 5 frames, 2 voicing + HLDA

38.9

Baseline + 9 frames, 2 voicing + HLDA

39.5

39.5

12

Voicing Features4 – Delta of Voicing Features + HLDA: • Use delta and delta-delta features instead of window of

voicing features. Apply HLDA to juxtaposed feature.• Same setup as before, MFCC+D+DD+3rd diffs. Males

only.

• Reason may be variability in voicing features produce noisy deltas. • HLDA weighting of “window of voicing features” is

similar to average. ---------------------------------------------------------------------------------- The best overall configuration was MFCC+D+DD+3rd

diffs. and 10 voicing features + HLDA.

Recognition Systems WER

Baseline + HLDA 37.5 %

Baseline + voicing + delta voicing + HLDA

37.6 %

13

Voicing Features• Voicing Features in SRI CTS Eval. Sept 03

System:• Adaptation of MMIE cross-word models w/wo voicing

features. • Used best configuration of voicing features.• Train on Full SWBD+CTRANS data. Test on EVAL’02.• Feature: MFCC+D+DD+3rd diffs.+HLDA• Adaptation: 9 transforms full matrix MLLR.• Adaptation hypothesis from: MLE non cross-word

model, PLP front end with voicing features.Recognition Systems WER

Baseline EVAL 25.6 %

Baseline EVAL + voicing 25.1 %

14

Voicing Features• Hypothesis Examples:

REF: OH REALLY WHAT WHAT KIND OF PAPER HYP BASELINE: OH REALLY WHICH WAS KIND OF PAPERHYP VOICING: OH REALLY WHAT WHAT KIND OF PAPER

REF: YOU KNOW HE S JUST SO UNHAPPYHYP BASELINE: YOU KNOW YOU JUST I WANT HAPPYHYP VOICING: YOU KNOW HE S JUST SO I WANT HAPPY

15

Voicing Features• Error analysis:

– In one experiment: 54% of speakers got WER reduction (some up to 4% abs. reduction). Rest 46% small WER increase.

– Still need a more detailed study of speaker dependent performance.

• Implementation:– Implemented a voicing feature engine in DECIPHER

system.– Fast computation, using one FFT and two IFFTs per

frame for both voicing features.

16

Voicing Features• Conclusions:

– Explored how to represent/integrate the voicing features for best performance.

– Achieved 1% abs (~2 % rel) gain in first pass (using small training set), and >0.5 % abs (2 % rel) (using full training set) in higher rescoring passes of DECIPHER LVCSR system.

• Future work:

– Still need to further explore feature combination/selection

– Develop more reliable voicing features, features not always reflect actual voicing activity

– Develop other phonetically derived features (vowels/consonants, occlusion, nasality, etc).

voicing features

Documents