voicing features
DESCRIPTION
Voicing Features. Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International. Phonetically Motivated Features. Problem: Cepstral coefficients fail to capture many discriminative cues. Front-end optimized for traditional Mel cepstral features. - PowerPoint PPT PresentationTRANSCRIPT
1
Voicing Features
Horacio Franco, Martin GraciarenaAndreas Stolcke, Dimitra Vergyri, Jing Zheng
STAR Lab. SRI International
2
Phonetically Motivated Features
• Problem:
– Cepstral coefficients fail to capture many discriminative cues.
– Front-end optimized for traditional Mel cepstral features.
– Front-end parameters are a compromise solution for all phones.
3
Phonetically Motivated Features
• Proposal:
– Enrich Mel cepstral feature representation with phonetically motivated features from independent front-ends.
– Optimize each specific front-end to improve discrimination.
– Robust broad class phonetic features provide “anchor points” in acoustic phonetic decoding.
– General framework for multiple phonetic features. First approach: voicing features.
4
• Voicing features algorithms:1. Normalized peak autocorrelation (PA) . For time frame X
max computed in pitch region 80Hz to 450Hz2. Entropy of high order cepstrum (EC) and linear spectra (ES).
If And H is the entropy of Y,
thenEntropy computed in pitch region 80Hz to 450Hz
Voicing Features
)0(/)}({max RxxiRxxPA i)];()([)( itXtXEiRxx
);( );( SPECHESCEPSHEC
f
f
fY
fYfYP
fYPfYPYH
2
22
22
)(
)())((
;)))((log())(()(
));((;)(2
LSPECLogIDFTCEPSXDFTLSPEC
5
Voicing Features3. Correlation with template and DP alignment
[Arcienega, ICSLP’02]. The Discrete Logarithm Fourier Transform for the frequency band for speech signal
If IT is an impulse train, the template is and the signal DLFTthe correlation for frame j with the template is
the DP optimal correlation is max computed in pitch region 80Hz to 450Hz
)},0(/),({max jRytjiRytCT DPi
)];(),([),( ifTjfYEjiRyt
;)(2
XDLFTY
],[ 21 ff
1
)ln()ln(dlnf
; 2 ;)(1
12
dlnf)ln( 1
N
ff
eTwenxN
Y ifsin
njwi
i
)(nxDLFT
2)(ITDLFTT
6
Voicing Features• Preliminary exploration of voicing features:
- Best feature combination: Peak Autocorrelation + Entropy Cepstrum
- Complementary behavior of autocorrelation and entropy features for high and low pitch.
Low pitch: time periods are well separated therefore correlation is well defined.
High pitch: harmonics are well separated and cepstrum is well defined.
7
Voicing Features• Graph of voicing features:
w er k ay n d ax f s: aw th ax v dh ey ax r
8
Voicing Features• Integration of Voicing Features:
1 - Juxtaposing Voicing Features:
• Juxtapose two voicing features to traditional Mel cepstral feature vector (MFCC) plus delta and delta-delta features (MFCC+D+DD)
• Voicing feature front-end: use same MFCC frame rate and optimize temporal window duration.
9
Voicing Features• Train small switchboard database (64 hours). Test on
dev 2001. WER for both sexes. • Features: MFCC+D+DD, 25.6 msec. frame every 10 msec. • VTL and speaker mean and var. norm. Genone acoustic
model. Non-X-word, MLE trained, Gender Dep. Bigram LM.
Window Length Optimization WER
Baseline 41.4%
Baseline + 2 voicing (25.6 msec) 41.2 %
Baseline + 2 voicing (75 msec) 40.7 %
Baseline + 2 voicing (87.5 msec) 40.5 %
Baseline + 2 voicing (100 msec) 40.4 %
Baseline + 2 voicing (112.5 msec) 41.2 %
10
Voicing Features2 – Voiced/Unvoiced Posterior Features:
• Use a posterior voicing probability as feature. Computed from 2 state HMM. Juxtaposed feature dim is 40.
• Similar setup as before. Males only results.
• Soft V/UV transitions may be not captured because posterior feature behaves similar to binary feature.
Recognition Systems WER
Baseline 39.2 %
Baseline + voicing posterior 39.7 %
11
Voicing Features3 – Window of Voicing Features + HLDA: • Juxtapose MFCC features and window of voicing
features around current frame. • Apply dimensionality reduction with HLDA. Final feature
had 39 dimensions. • Same setup as before, MFCC+D+DD+3rd diffs. Both
sexes.• Baseline 1.5% abs. better, Voicing improves 1% more.
Recognition Systems WER %
Baseline + HLDA 39.9
Baseline + 1 frame, 2 voicing + HLDA
Baseline + 5 frames, 2 voicing + HLDA
38.9
Baseline + 9 frames, 2 voicing + HLDA
39.5
39.5
12
Voicing Features4 – Delta of Voicing Features + HLDA: • Use delta and delta-delta features instead of window of
voicing features. Apply HLDA to juxtaposed feature.• Same setup as before, MFCC+D+DD+3rd diffs. Males
only.
• Reason may be variability in voicing features produce noisy deltas. • HLDA weighting of “window of voicing features” is
similar to average. ---------------------------------------------------------------------------------- The best overall configuration was MFCC+D+DD+3rd
diffs. and 10 voicing features + HLDA.
Recognition Systems WER
Baseline + HLDA 37.5 %
Baseline + voicing + delta voicing + HLDA
37.6 %
13
Voicing Features• Voicing Features in SRI CTS Eval. Sept 03
System:• Adaptation of MMIE cross-word models w/wo voicing
features. • Used best configuration of voicing features.• Train on Full SWBD+CTRANS data. Test on EVAL’02.• Feature: MFCC+D+DD+3rd diffs.+HLDA• Adaptation: 9 transforms full matrix MLLR.• Adaptation hypothesis from: MLE non cross-word
model, PLP front end with voicing features.Recognition Systems WER
Baseline EVAL 25.6 %
Baseline EVAL + voicing 25.1 %
14
Voicing Features• Hypothesis Examples:
REF: OH REALLY WHAT WHAT KIND OF PAPER HYP BASELINE: OH REALLY WHICH WAS KIND OF PAPERHYP VOICING: OH REALLY WHAT WHAT KIND OF PAPER
REF: YOU KNOW HE S JUST SO UNHAPPYHYP BASELINE: YOU KNOW YOU JUST I WANT HAPPYHYP VOICING: YOU KNOW HE S JUST SO I WANT HAPPY
15
Voicing Features• Error analysis:
– In one experiment: 54% of speakers got WER reduction (some up to 4% abs. reduction). Rest 46% small WER increase.
– Still need a more detailed study of speaker dependent performance.
• Implementation:– Implemented a voicing feature engine in DECIPHER
system.– Fast computation, using one FFT and two IFFTs per
frame for both voicing features.
16
Voicing Features• Conclusions:
– Explored how to represent/integrate the voicing features for best performance.
– Achieved 1% abs (~2 % rel) gain in first pass (using small training set), and >0.5 % abs (2 % rel) (using full training set) in higher rescoring passes of DECIPHER LVCSR system.
• Future work:
– Still need to further explore feature combination/selection
– Develop more reliable voicing features, features not always reflect actual voicing activity
– Develop other phonetically derived features (vowels/consonants, occlusion, nasality, etc).
17