automatic speech recognition using an echo state network

26
Automatic speech recognition using an echo state network Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and Computer Engineering University of Florida, Gainesville, FL, USA May 10, 2006

Upload: lilith

Post on 07-Jan-2016

34 views

Category:

Documents


3 download

DESCRIPTION

Automatic speech recognition using an echo state network. Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and Computer Engineering University of Florida, Gainesville, FL, USA May 10, 2006. 2000. CNEL Seminar History. Ratio spectrum, Oct. 2000 HFCC, Sept. 2002 - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Automatic speech recognition using an echo state network

Automatic speech recognition using an echo state network

Mark D. Skowronski

Computational Neuro-Engineering Lab

Electrical and Computer Engineering

University of Florida, Gainesville, FL, USA

May 10, 2006

Page 2: Automatic speech recognition using an echo state network

CNEL Seminar History• Ratio spectrum, Oct. 2000• HFCC, Sept. 2002• Bats, Dec. 2004• Electrohysterography, Aug. 2005• Echo state network, May 2006

2006

2000

Page 3: Automatic speech recognition using an echo state network

Overview

• ASR motivations

• Intro to echo state network

• Multiple readout filters

• ASR experiments

• Conclusions

Page 4: Automatic speech recognition using an echo state network

ASR Motivations

• Speech is most natural form of communication among humans.

• Human-machine interaction lags behind with tactile interface.

• Bottleneck in machine understanding is signal-to-symbol translation.

• Human speech a “tough” signal:– Nonstationary– Non-Gaussian– Nonlinear systems for production/perception

How to handle the “non”-ness of speech?

Page 5: Automatic speech recognition using an echo state network

ASR State of the Art• Feature extraction: HFCC

– bio-inspired frequency analysis– tailored for statistical models

• Acoustic pattern rec: HMM– Piecewise-stationary stochastic model– Efficient training/testing algorithms– …but several simplistic assumptions

• Language models– Uses knowledge of language, grammar– HMM implementations– Machine language understanding still elusive (spam

blockers)

Page 6: Automatic speech recognition using an echo state network

Hidden Markov ModelPremier stochastic model of non-stationary time series used for decision making.

Assumptions:

1) Speech is piecewise-stationary process.

2) Features are independent.

3) State duration is exponential.

4) State transition prob. function of previous-next state only.

Can we devise a better pattern recognition model?

Page 7: Automatic speech recognition using an echo state network

Echo State Network

• Partially trained recurrent neural network, Herbert Jaeger, 2001

• Unique characteristics:– Recurrent “reservoir” of processing elements,

interconnected with random untrained weights.– Linear readout weights trained with simple

regression provide closed-form, stable, unique solution.

Page 8: Automatic speech recognition using an echo state network

ESN Diagram & Equations

)()(

))()1(()(

nn

nnfn

out

in

xWy

uWxWx

Page 9: Automatic speech recognition using an echo state network

ESN Matrices

• Win: untrained, M x Min matrix– Zero mean, unity variance normally distributed– Scaled by rin

• W: untrained, M x M matrix– Zero mean, unity variance normally distributed– Scaled such that spectral radius r < 1

• Wout: trained, linear regression, Mout x M matrix– Regression closed-form, stable, unique solution– O(M2) per data point complexity

Page 10: Automatic speech recognition using an echo state network

Echo States Conditions

• The network has echo states if x(n) is uniquely determined by left-infinite input sequence …,u(n-1),u(n).

• x(n) is an “echo” of all previous inputs.

• If f is tanh activation function: – If σmax(W)=||W||<1, guarantees echo states

– If r=|λmax(W)|>1, guarantees no echo states

Page 11: Automatic speech recognition using an echo state network

ESN Training• Minimize mean-squared error between y(n)

and desired signal d(n).

))()(())()(( 1

1

TTout

out

nnnn dxxxW

pRW

Wiener solution:

Page 12: Automatic speech recognition using an echo state network

ESN Example: Mackey-GlassM=60 PEsr=0.9

rin=0.3u(n): MG,10000 samplesd(n)=u(n+1)

Prediction Gain(var(u)/var(e)):Input: 16.3 dBWiener: 45.1 dBESN: 62.6 dB

Page 13: Automatic speech recognition using an echo state network

Multiple Readout Filters• Wout projects reservoir space to output space.

• Question: how to divide reservoir space and use multiple readout filters?

• Answer: competitive network of filters

• Question: how to train/test competitive network of K filters?

• Answer: mimic HMM.

],1[),()( Kknn kout

k xWy

Page 14: Automatic speech recognition using an echo state network

HMM vs. ESN ClassifierHMM ESN Classifier

Output Likelihood MSE

Architecture States, left-to-right States, left-to-right

Minimum element

Gaussian kernel Readout filter

Elements combined

GMM Winner-take-all

Transitions State transition matrix Binary switching matrix

Training Segmental K-means (Baum-Welch)

Segmental K-means

Discriminatory No Maybe, depends on desired signal

Page 15: Automatic speech recognition using an echo state network

Segmental K-means: InitFor each input, xi(n) and desired di(n) for sequence i:

Divide x,d into equal-sized chunks Xη,Dη (one per state).For each n, select k(n)[1,K] uniform random.

After init. with all sequences:

Tii

nk

Tii

nk

nn

nn

))(()(

))(()(),(

),(

DXB

XXA

,1,, )( kkkout BAW

Page 16: Automatic speech recognition using an echo state network

Segmental K-means: Training

• For each utterance:– Produce MSE for each readout filter.– Find Viterbi path through MSE matrix.– Use features from each state to update

auto- and cross-correlation matrices.

• After all utterances: Wiener solution• Guaranteed to converge to local

minimum in MSE over training set.

Page 17: Automatic speech recognition using an echo state network

ASR Example 1• Isolated English digits “zero” - “nine” from TI46 corpus: 8

male, 8 female, 26 utterances each, 12.5 kHz sampling rate.

• ESN: M=60 PEs, r=2.0, rin=0.1, 10 word models, various #states and #filters per state.

• Features: 13 HFCC, 100 fps, Hamming window, pre-emphasis (α=0.95), CMS, Δ+ΔΔ (±4 frames)

• Pre-processing: zero-mean and whitening transform• M1/F1: testing; M2/F2: validation; M3-M8/F3-F8 training• Two to six training epochs for all models• Desired: next frame of 39-dimension features• Test: corrupted by additive noise from “real” sources (subway,

babble, car, exhibition hall, restaurant, street, airport terminal, train station)

• Baseline: HMM with identical input features

Page 18: Automatic speech recognition using an echo state network

ASR Results, noise free

K

ESN (HMM) 1 2 3 4 5 10

Nst=1 7(171) 6(136) 3(65) 2(33) 3(4) 2(2)

2 1(83) 1(46) 0(4) 1(3) 2(2) 1(0)

3 0(126) 1(4) 0(2) 0(2) 0(1) 2(0)

5 1(11) 1(2) 0(0) 0(0) 1(0) 0(0)

10 1(2) 1(0) 1(0) 1(0) 0(0) 0(0)

15 0(1) 0(0) 0(0) 0(0) 0(0) 1(0)

20 0 0 0 0 0 1

Number of classification errors out of 518 (smaller is better).

Page 19: Automatic speech recognition using an echo state network

ASR Results, noisyK

ESN (HMM) 1 2 3 4 5 10

Nst=1 70.9(22.4) 70.0(29.7) 74.6(45.6) 74.3(46.0) 74.3(36.2) 75.8(50.9)

2 76.3(41.5) 77.6(47.6) 78.3(50.1) 77.7(53.8) 77.1(50.2) 75.8(64.5)

3 78.8(29.2) 79.2(44.6) 79.3(51.7) 79.2(58.6) 79.1(58.6) 78.8(55.6)

5 81.4(51.6) 81.1(56.4) 81.6(59.7) 81.9(59.2) 81.3(59.2) 81.3(53.5)

10 84.6(57.2) 84.4(61.1) 84.4(58.7) 83.6(55.7) 83.5(56.2) 81.0(52.2)

15 85.4(64.0) 85.1(62.0) 85.0(59.2) 83.8(56.4) 82.8(52.9) 78.4(52.2)

20 85.8 85.6 84.0 83.5 82.5 72.3

Average accuracy (%),all noise sources, 0-20 dB SNR (larger is better):

Page 20: Automatic speech recognition using an echo state network

ASR Results, noisySingle mixture per state (K=1): ESN classifier

Page 21: Automatic speech recognition using an echo state network

ASR Results, noisySingle mixture per state (K=1): HMM baseline

Page 22: Automatic speech recognition using an echo state network

ASR Example 2• Same experiment setup as Example 1.• ESN: M=600 PEs, 10 states, 1 filter per state,

rin=0.1, various r.

• Desired: one-of-many encoding of class, ±1, tanh output activation function AFTER linear readout filter.

• Test: corrupted by additive speech-shaped noise• Baseline: HMM with identical input features

Page 23: Automatic speech recognition using an echo state network

ASR Results, noisy

Page 24: Automatic speech recognition using an echo state network

Discussion• What gives the ESN classifier its noise-

robust characteristics?• Theory: ESN reservoir provides context

of noisy input, allowing reservoir to reduce effects of noise by averaging.

• Theory: Non-linearity and high-dimensionality of network increases linear separability of classes in reservoir space.

Page 25: Automatic speech recognition using an echo state network

Future Work

• Replace winner-take-all with mixture-of-experts.

• Replace segmental K-means with Baum-Welch-type training algorithm.

• “Grow” network during training.

• Consider nonlinear activation functions (e.g., tanh, softmax) AFTER linear readout filter.

Page 26: Automatic speech recognition using an echo state network

Conclusions• ESN classifier using inspiration from HMM:

– Multiple readout filters per state, multiple states.– Trained as competitive network of filters.– Segmental K-means guaranteed to converge to

local minimum of total MSE from training set.

• ESN classifier noise robust compared to HMM:– Ave. over all sources, 0-20 dB SNR: +21

percentage points– Ave. over all sources: +9 dB SNR