![Page 1: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/1.jpg)
By the Novel Approaches team,With site leaders:
Nelson Morgan, ICSIHynek Hermansky, OGI
Dan Ellis, ColumbiaKemal Sönmez, SRIMari Ostendorf, UW
Hervé Bourlard, IDIAP/EPFLGeorge Doddington, NA-sayer
“Pushing the Envelope”
A six month report
![Page 2: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/2.jpg)
OverviewOverview
Nelson Morgan, ICSINelson Morgan, ICSI
![Page 3: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/3.jpg)
The Current Cast of The Current Cast of CharactersCharacters
• ICSI: Morgan, Q. Zhu, B. Chen, G. Doddington
• UW: M. Ostendorf, Ö. Çetin
• OGI: H. Hermansky, S. Sivadas, P. Jain
• Columbia: D. Ellis, M. Athineos
• SRI: K. Sönmez
• IDIAP: H. Bourlard, J. Ajmera, V. Tyagi
![Page 4: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/4.jpg)
Rethinking Acoustic Rethinking Acoustic Processing for ASRProcessing for ASR
• Escape dependence on spectral envelope
• Use multiple front-ends across time/freq
• Modify statistical models to accommodate new front-ends
• Design optimal combination schemes for multiple models
![Page 5: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/5.jpg)
time
Task 1: Pushing the Task 1: Pushing the Envelope (aside)Envelope (aside)
• Problem: Spectral envelope is a fragile information carrier
estimate of sound identity
info
rmati
on
fusio
n
10 msOLD
PROPOSED
• Solution: Probabilities from multiple time-frequency patches
ith estimate
up to 1s
kth estimate
nth estimate
estimate of sound identity
![Page 6: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/6.jpg)
Task 2: Beyond Task 2: Beyond Frames…Frames…
• Solution: Advanced features require advanced models, free of fixed-frame-rate paradigm
OLD
PROPOSED
conventional HMMshort-term features
• Problem: Features & models interact; new features may require different models
advanced features multi-rate, dynamic-scale classifier
![Page 7: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/7.jpg)
Today’s presentationToday’s presentation
• Infrastructure: training, testing, software
• Initial Experiments: pilot studies• Directions: where we’re headed
![Page 8: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/8.jpg)
Infrastructure Infrastructure
Kemal Sönmez, SRIKemal Sönmez, SRI(SRI/UW/ICSI effort)(SRI/UW/ICSI effort)
![Page 9: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/9.jpg)
Initial Experimental Initial Experimental ParadigmParadigm
• Focus on a small task to facilitate exploratory work (later move to CTS)
• Choose a task where LM is fixed & plays a minor role (to focus on acoustics)
• Use mismatched train/test data:To avoid tuning to the taskTo facilitate later move to CTS
• Task: OGI numbers/ Train: swbd+macrophone
![Page 10: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/10.jpg)
• Composition
(total ~ 60 hours)
* subset of SWB-1 hand-checked at SRI for accuracy of transcriptions and segmentations
• WER 2-4% higher vs. full 250+ hour training
Hub5 “Short” Training Hub5 “Short” Training SetSet
hoursCorpus Male Female
callhome 2.8 13.8
switchboard* 5.9 4.3credit-card 6.7 7.1macrophone 12.4 5.8
![Page 11: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/11.jpg)
Reduced UW Training Reduced UW Training SetSet
• A reduced training set to shorten expt. turn-around time
• Choose training utterances with per-frame likelihood scores close to the training set average
• 1/4th of the original training set• Statistics (gender, data set constituencies) are similar
to that of the full training set.
• For OGI Numbers, no significant WER sacrifice in the baseline HMM system (worse for Hub 5).
data set constituencies
male/femalemacrophon
ecallhome
credit-card
otherswitchboard
“short” 32% 32% 12% 24% 45/55%
Reduced (UW)
38% 28% 12% 22% 48/52%
![Page 12: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/12.jpg)
Development Test SetsDevelopment Test Sets• A “Core-Subset” of OGI’s Numbers 95 corpora – telephone
speech of people reciting addresses, telephone numbers, zip codes, or other miscellaneous items
• “Core-Subset” or “CS” consists of utterances that were phonetically hand-transcribed, intelligible, and contained only numbers
• Vocabulary Size: 32 words (digits + eleven, twelve… twenty… hundred…thousand, etc.)
Data Set Name Total Utterance
Total Words Duration (hours)
Numbers95-CS Cross
Validation
357 1353 ~0.2
Numbers95-CSDevelopment
1206 4673 ~0.6
Numbers95-CSTest
1227 4757 ~0.6
![Page 13: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/13.jpg)
Statistical Modeling Statistical Modeling Tools Tools
• HTK (Hidden Markov Toolkit) for establishing an HMM baseline, debugging
• GMTK (Graphical Models Toolkit) for implementing advanced models with multiple feature/state streamsAllows direct dependencies across streams Not limited by single-rate, single-stream paradigmRapid model specification/training/testing
• SRI Decipher system for providing lattices to rescore (later in CTS expts)
• Neural network tools from ICSI for posterior probability estimation, other statistical software from IDIAP
![Page 14: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/14.jpg)
Baseline SRI Baseline SRI RecognizerRecognizer
for the numbers taskfor the numbers task• Bottom-up state-clustered Gaussian mixture
HMMs for acoustic modeling• Acoustic adaptation to speakers using affine mean
and variance transforms[Not used for numbers]• Vocal-tract length normalization using maximum
likelihood estimation [Not helpful for numbers]• Progressive search with lattice recognition and N-
best rescoring [To be used in later work]• Bigram LM
![Page 15: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/15.jpg)
Initial ExperimentsInitial Experiments
Barry Chen, ICSIBarry Chen, ICSIHynek Hermansky, OHSU (OGI)Hynek Hermansky, OHSU (OGI)
Özgür Çetin, UWÖzgür Çetin, UW
![Page 16: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/16.jpg)
Goals of Initial Goals of Initial ExperimentsExperiments
• Establish performance baselinesHMM + standard features (MFCC, PLP)HMM + current best from ICSI/OGI
• Develop infrastructure for new modelsGMTK for multi-stream & multi-rate featuresNovel features based on large timespansNovel features based on temporal fine
structure
• Provide fodder for future error analysis
![Page 17: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/17.jpg)
ICSI Baseline ICSI Baseline experimentsexperiments
• PLP based - SRI system
• “Tandem” PLP-based ANN + SRI system
• Initial combination approach
![Page 18: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/18.jpg)
Development Baseline: Development Baseline: Gender Independent Gender Independent
PLP SystemPLP System
Training SetWord,SentenceError Rate on
Numbers95-CS Test Set
Full “Short” Hub5 (85k utterances, ~64.9 hrs)
3.4%,10.2%
UW Reduced Hub5 (20k utterances, ~18.8 hrs)
3.8%,11.4%
![Page 19: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/19.jpg)
Phonetically Trained Neural Phonetically Trained Neural NetNet
• Multi-Layer Perceptron (input, hidden, and output layer)• Trained Using Error-Backpropagation Technique – outputs
interpreted as posterior probabilities of target classes• Training Targets: 47 mono-phone targets from forced
alignment using SRI Eval 2002 system• Training Utterances: UW Reduced Hub5 Set• Training Features: PLP12+e+d+dd, mean & variance
normalized on per-conversation side basis• MLP Topology:
9 Frame Context Window (4 frames in past + current frame + 4 frames in future)
351 Input Units, 1500 Hidden Units, and 47 Output Units Total Number of Parameters: ~600k
![Page 20: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/20.jpg)
Baseline ICSI TandemBaseline ICSI Tandem
• Outputs of Neural Net before final softmax non-linearity used as inputs to PCA
• PCA without dimensionality reduction
• 4.1% Word and 11.7% Sentence Error Rate on Numbers95-CS test set
![Page 21: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/21.jpg)
Baseline ICSI Tandem+PLPBaseline ICSI Tandem+PLP
• PLP Stream concatenated with neural net posteriors stream• PCA reduces dimensionality of posteriors stream to 16
(keeping 95% of overall variance)• 3.3% Word and 9.5% Sentence Error Rate on Numbers95-
CS test set
![Page 22: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/22.jpg)
Word and String Error Rates on Word and String Error Rates on Numbers95-CS Test SetNumbers95-CS Test Set
![Page 23: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/23.jpg)
OGI Experiments:OGI Experiments:New Features in EARSNew Features in EARS
• Develop on home-grown ASR system (phoneme-based HTK)
• Pass the most promising to ICSI for running in SRI LVCSR system
• So far new features match the performance of the
baseline PLP features but do not exceed itadvantage seen in combination with the
baseline
![Page 24: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/24.jpg)
Looking to the human Looking to the human auditory system for design auditory system for design
inspirationinspiration
• Psychophysics Components within
certain frequency range (several critical bands) interact [e.g. frequency masking]
Components within certain time span (a few hundreds of ms) interact [e.g. temporal masking]
• Physiology 2-D (time-frequency)
matched filters for activity in auditory cortex [cortical receptive fields]
![Page 25: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/25.jpg)
TRAP-based HMM-NN hybrid ASR
Posterior probabilitiesof phonemes
Multilayer Perceptron
(MLP)
Mean &variancenormalized,hamming windowedcritical bandtrajectory
101 pointinput
Multilayer Perceptron
(MLP)
Multilayer Perceptron
(MLP)
Searchfor the best
match
![Page 26: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/26.jpg)
Feature estimation from linearly transformed temporal
patterns
MLP
MLPtransform
transform
TANDEMHMMASR
? ? ?
![Page 27: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/27.jpg)
Preliminary Preliminary TANDEM/TRAP results TANDEM/TRAP results
(OGI-HTK)(OGI-HTK)
WER% on OGI numbers, training on UW reduced training set,monophone models
BASELINE 4.5
TANDEM 4.1
TANDEM with TRAP 3.9
![Page 28: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/28.jpg)
Features from more than one Features from more than one critical-band temporal critical-band temporal
trajectorytrajectory
+
averagefrequencyderivative
cosinetransform
Studying KLT-derived basis functions, we observe:
![Page 29: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/29.jpg)
UW Baseline UW Baseline ExperimentExperimentss
• Constructed an HTK-based HMM system that is competitive with the SRI system
• Replicated the HMM system in GMTK• Move on to models which integrate
information from multiple sources in a principled manner:
Multiple feature streams (multi-stream models)
Different time scales (multi-rate models)
• Focus on statistical models not on feature extraction
![Page 30: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/30.jpg)
HTK HMM BaselineHTK HMM Baseline• An HTK-based standard HMM system:
• 3 state triphones with decision-tree clustering,
• Mixture of diagonal Gaussians as state output dists.,
• No adaptation, fixed LM.
• Dimensions explored:• Front-end: PLP vs. MFCC, VTLN
• Gender dependent vs. independent modeling
• Conclusions: • No significant performance differences
• Decided on PLPs, no VTLN, gender-independent models for simplicity
![Page 31: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/31.jpg)
HMM Baselines (cont.)HMM Baselines (cont.)• Replicated HTK baseline with equivalent results in GMTK
• To reduce experiment turn-around time, wanted to reduce the training set
• For HMMs and Numbers95, 3/4th of the training data can be safely ignored:
WER %
tool dev test
HTK 3.7 3.2
GMTK 3.7 3.0
Training set
WER %
dev test
Full “short” 3.7 3.2
1/4th (“reduced”)
3.4 3.4
![Page 32: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/32.jpg)
Multi-stream ModelsMulti-stream Models• Information fusion from multiple streams of features • Partially asynchronous state sequences
states of stream X
state
s of stre
am
Y
state seq. of stream Y
STATE TOPOLOGY
state seq. of stream X
feature stream X
feature stream Y
GRAPHICAL MODEL
modelWER %
dev test
HMM (PLP) 3.9 4.2
multi-stream(PLP+MFCC)
![Page 33: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/33.jpg)
Temporal envelope Temporal envelope featuresfeatures
(Columbia)(Columbia)• Temporal fine structure is lost (deliberately)
in STFT features:
• Need a compact, parametric description...time / sec
0.65 0.7 0.75 0.8 0.85 0.90
2000
4000
6000
8000
-6dB
0
-40
-20
0
0.65 0.7 0.75 0.8 0.85 0.9-0.05
0
0.05
0.1
0.15mpgr1-sx419
10 mswindows
![Page 34: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/34.jpg)
Frequency-DomainFrequency-DomainLinear Prediction Linear Prediction
(FDLP)(FDLP)
• Extend LPC with LP model of spectrum
• ‘Poles’ represent temporal peaks:
• Features ~ pole bandwidth, ‘frequency’
TD-LPy[n] = iaiy[n-i]
DFTFD-LP
Y[k] = ibiY[k-i]
0.65 0.7 0.75 0.8 0.85 0.9-0.05
0
0.05
0.1
mpgr1-sx419: TDLPC env (60 poles / 300 ms)
![Page 35: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/35.jpg)
Preliminary FDLP Preliminary FDLP ResultsResults
• Distribution of pole magnitudes for different phone classes (in 4 bands):
• NN Classifier Frame Accuracies:
plp12N 57.0%
plp12N+FDLP4 58.2%
-2 0 2 4 60
0.02
0.04
0.06
0.08
0.10-500 Hz band
-2 0 2 4 6
500-1000 Hz band
-2 0 2 4 6
1-2 kHz band
-2 0 2 4 6
2-4 kHz band
-log(1-||)
/ah//p/
![Page 36: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/36.jpg)
DirectionsDirections
Dan Ellis, ColumbiaDan Ellis, Columbia(SRI/UW/Columbia work)(SRI/UW/Columbia work)
Nelson Morgan, ICSINelson Morgan, ICSI(OGI/IDIAP/ICSI work + summary)(OGI/IDIAP/ICSI work + summary)
![Page 37: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/37.jpg)
Multi-rate Models (UW)Multi-rate Models (UW)
long-term features
short-term features
Cro
ss-s
cale
d
epe
nde
nci
es
(exa
mpl
e)
coarse state chain
fine state chain
• Integrate acoustic information from different time scales
• Account for dependencies across scales
• Better robustness against time- and/or frequency localized interferences
•Reduced redundancy gives better confidence estimates
![Page 38: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/38.jpg)
SRI DirectionsSRI Directions• Task 1: Signal-adaptive weighting of time-frequency patches
Basis-entropy based representation
Matching pursuit search for optimal weighting of patches
Optimality based on minimum entropy criterion
• Task 2: Graphical models of patch combinations
Tiling-driven dependency modeling
GM combines across patch selections
Optimality based on information in representation
![Page 39: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/39.jpg)
Data-derived phonetic Data-derived phonetic features (Columbia)features (Columbia)
• Find a set of independent attributes to account for phonetic (lexical) distinctionsphones replaced by feature streams
• Will require new pronunciation modelsasynchronous feature transitions (no phones)mapping from phonetics (for unseen words)
Joint work with Eric Fosler-Lussier
![Page 40: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/40.jpg)
ICA for feature basesICA for feature bases• PCA finds decorrelated bases;
ICA finds independent bases
• Lexically-sufficient ICA basis set?
test/dr1/faks0/sa2
Basis vectors
5
10
15
0
2
4
6
8
time / labels d ow n ae s m iy t ix k eh r iy ix n oy l iy r ae g l ay k dh ae tcl
0
2
4
6
8
frequency / Bark
-1
0
1
0 5 10 15 20-1
0
1
2
01234
![Page 41: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/41.jpg)
OGI Directions:OGI Directions:Targets in sub-bandsTargets in sub-bands• Initially context-independent and band-
specific phonemes• Gradually shifted to band-specific 6 broad
phonetic classes (stops, fricatives, nasals, vowels, silence, flaps)
• Moving towards band-independent speech classes (vocalic-like, fricative-like, plosive-like, ???)
![Page 42: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/42.jpg)
More than one temporal pattern?
Mean &Variance normalized,Hamming windowedcritical bandtrajectory
MLP
MLPKLT1
101 dim
KLTn
![Page 43: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/43.jpg)
Pre-processing by 2-D operatorsPre-processing by 2-D operatorswith subsequent TRAP-TANDEMwith subsequent TRAP-TANDEM
frequ
ency
time
1 2 10 0 0-1 -2 -1
-1 0 1-2 0 2-1 0 1
0 1 2-1 0 1-2 -1 0
-2 -1 0-1 0 10 1 2
differentiate faverage t
differentiate taverage f
diff upwardsav downwards
diff downwardsav upwards
![Page 44: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/44.jpg)
IDIAP Directions:IDIAP Directions:Phase AutoCorrelation Phase AutoCorrelation
FeaturesFeaturesTraditional Features: Autocorrelation based.Very sensitive to additive noise, other variations.Phase AutoCorrelation (PAC):
if represents autocorrelation
coeffs derived from a frame of length PACs:
.1,...,1,0 , NkkR1N
energy. Frame 0 , 0
cos1-
R
R
kRkP
![Page 45: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/45.jpg)
Entropy Based Multi-Entropy Based Multi-Stream CombinationStream Combination
• Combination of evidences from more than one expert to improve performance
• Entropy as a measure of confidence• Experts having low entropy are more
reliable as compared to experts having high entropy
• Inverse entropy weighting criterion• Relationship between entropy of the
resulting (recombined) classifier and recognition rate
![Page 46: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/46.jpg)
ICSI Directions:ICSI Directions:Posterior Combination Posterior Combination
FrameworkFramework
• Combination of Several Discriminative Probability Streams
![Page 47: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/47.jpg)
Improvement of the Combo Infrastructure
• Improve basic features:
Add prosodic features: voicing level, energy continuity,
Improve PLP by further removing the pitch difference among speakers.
• Tandem
Different targets, different training features. E.g.: word boundary.
• Improve TRAP (OGI)
• Combination
Entropy based, accuracy based stream weighting or stream selection.
![Page 48: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/48.jpg)
New types of tandem features: Possible
word/syllable boundary
NNProcessing
Inputfeature
Target posterior
Input feature:• Traditional or improved
PLP• Spectral continuity• Voicing, voicing continuity• Formant continuity feature• …more
• Phonemes• Word/syllable
boundary• Broad phoneme
classes• Manner/ place /
articulation… etc
![Page 49: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/49.jpg)
Data Driven Subword Unit Data Driven Subword Unit Generation (IDIAP/ICSI)Generation (IDIAP/ICSI)
Initial segmentation:large number of clusters
Is thresholdless BIC-likemerging criterion met?
Merge, re-segment, and re-estimate
Yes
StopNo
• Motivation: Phoneme-based units may not be optimal for ASR.
• Approach (based on speaker segmentation
method):
![Page 50: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/50.jpg)
SummarySummary
• Staff and tools in place to proceed with core experiments
• Pilot experiments provided coherent substrate for cooperation between 6 sites
• Future directions for individual sites are all over the map, which is what we want
• Possible exploration of collaborations w/MS in this meeting
![Page 51: By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé](https://reader038.vdocuments.us/reader038/viewer/2022110323/56649d795503460f94a5c783/html5/thumbnails/51.jpg)