complex-valued linear layers for deep neural network-based ... · 4/8/2016 · complex-valued...
TRANSCRIPT
Complex-valued Linear Layers for Deep NeuralNetwork-based Acoustic Models for Speech
Recognition
Izhak Shafran
Joint work with: Ehsan Variani, Arun Narayanan, and Tara Sainath
Google Inc
Outline
BackgroundAutomatic Speech RecognitionDeep Neural Networks
Problem DefinitionMotivation: Replace Hand-Crafted Feature ExtractionProblem DefinitionOne Approach: Learnable Time Domain Convolutional Filters
Learnable Complex-valued Frequency Domain FiltersProperties of the ModelEmpirical Evaluations
Summary & Discussion
Automatic Speech Recognition
Automatic Speech Recognition: Language ModelsScores likelihood of utterances in a given language
P(”call home”) > P(”call roam”)
P(w1w2) = P(w1|begin)P(w2|w1)P(end |w1w2)
P(”call home”) = P(”call”|begin) . . .
Automatic Speech Recognition: Feature ExtractionRemove redundancy and irrelevant information
Automatic Speech Recognition: Feature ExtractionRemove redundancy and irrelevant information
Automatic Speech Recognition: Acoustic ModelsScores the conditional likelihood that the hypothesized words were spoken
Acoustic Models Illustration: Vowel classification task
Consider the (mag) FFT of a frame of (vowel) speech
Notice energy peaks (formants)Turns out, their positions are different for different vowels
Acoustic Models Illustration: Vowel classification taskValues of (F1,F2) for vowels in a training set
Acoustic Models Illustration: Vowel classification taskScore the conditional likelihood that a hypothesized vowel was observed
P(X |æ) > P(X |ε)
Automatic Speech Recognition: Decoder
return ”call home” if
P(X |”call home”)P(”call home”) > P(X |”call roam”)P(”call roam”)
Mathematical justification,
W = argmaxW
P(W |X )
= argmaxW
P(X |W )P(W )
P(X )
= argmaxW
P(X |W )P(W )
(Deep) Neural NetworksIllustration with a simple feed forward network
A Node in a Neural Network
Neural NetworksIllustration of Forward Propagation
Neural Networks: LearningIllustration of Back Propagation Algorithm
home.agh.edu.pl/~vlsi/AI/backp_t_en/backprop.html
Multistyle Training (MTR) improves robustness
Deep Neural Networks: Why are they so hot now?
I Universal approximatorSuperposition Theorem of Kolmogorov (1957)A Computable Kolmogorov S. T. (Brattka, 2000)Neural Network with Unbounded Activation Functions is aUniversal Approximator (Sonoda & Murata, 2015)
I Back propagation algorithm(Linnainmaa, 1970; Rummelhart et al, 1986)
I Efficient BackProp (LeCun 1998),Fast Learning Algorithms for Deep Belief Net (Hinton, 2006)
Outline
BackgroundAutomatic Speech RecognitionDeep Neural Networks
Problem DefinitionMotivation: Replace Hand-Crafted Feature ExtractionProblem DefinitionOne Approach: Learnable Time Domain Convolutional Filters
Learnable Complex-valued Frequency Domain FiltersProperties of the ModelEmpirical Evaluations
Summary & Discussion
Motivation
In most ASR systems, features extraction:
I uses 3 decades old hand-crafted rules
I mimics human perception
I may not be optimal for speech recognition
Since neural networks are universal approximators,why not let the network learn the optimal features too !?
Recall: Feature Extraction
I Most problematic stage is Mel-filterbank
I Loss of information
I Log-mel features: useless for dereverberation or beamforming,precludes integration into neural networks
Problem
Can we feed the speech frames directly as input to the neuralnetworks and learn the features jointly with acoustic models?
One Approach: Learnable Time Domain Filters (Hoshen et al, 2015)
I Skip transforming the signal into frequency domain
I Replace mel-filters with a convolutional neural network layer
I Learn the parameters of the layer (filter) from the data
I Implementation inspired by gammatone features
Gammatone Features
All the above operations can be approximated in the neural network
Learnable Time Domain Filters (Hoshen et al, 2015)
Learned Time Domain Filters
Learnable Time Domain Filters (Hoshen et al, 2015)
But:
I convolution layer has several parameters to tune
I computationally expensive
Convolution
Convolutional Layer
I Convolution layers use only the ”valid” outputs
I Trade-off: Longer filters =⇒ fewer valid outputs
Time Domain Filters: Filter Length Dependencies
I For 64ms frame, there are still more settings to sweep!
I Typically, max pooling works better than average pooling
I Info loss, cannot cascade filters in the signal processing sense
Outline
BackgroundAutomatic Speech RecognitionDeep Neural Networks
Problem DefinitionMotivation: Replace Hand-Crafted Feature ExtractionProblem DefinitionOne Approach: Learnable Time Domain Convolutional Filters
Learnable Complex-valued Frequency Domain FiltersProperties of the ModelEmpirical Evaluations
Summary & Discussion
Convolution in Time ≡ Multiplication in Frequency Domain
Multiplication in Frequency Doman: Complex-Valued
X[k] H[k] Y[k]
I X[k] is complex-valued
I H[k] is complex-valued
I Need a linear layer with complex-valued neurons!
Complex-valued Neuron
Let X := Xr + jXi , W := Wr + jWi , B := Br + jBi
Z = g(WX − B)
Zr = g(WrXr −WiXi − Br )
Zi = g(WiXr + WrXi − Bi )
I Easily implemented in a neural net w/ shared weights
I How is this different from using g([WrWi ][XrXi ]T − B)?
Decision Boundaries are Orthogonal!
Easy to see from the norm to the boundary (Tohru Nitta, 2004)
Zr = g(WrXr −WiXi − Br )
Zi = g(WiXr + WrXi − Bi )
Qr =
[∂Zr
∂Xr
∂Zr
∂Xi
]=
[g ′(. . .) g ′(. . .)
] [ Wr
−Wi
]Qi =
[g ′(. . .) g ′(. . .)
] [ Wi
Wr
]QT
r Qi = 0 =⇒ Qr ⊥ Qi
Complex FFT Input & Complex-valued Linear Layers: The Appeal
I Orthogonal boundaries =⇒ higher modeling capacitye.g., XOR with one neuron!
I Gradient descent is faster
I Does not have trade-offs of convolutional layers
I L1 regularization is meaningful, unlike in time domain
I No max-pooling, allows cascading of filterse.g., Dereverberation + Beamforming + Feature extraction
I Possibility to learn hybrid filters that optimize both signalprocessing cost function and WER
Complex-valued Linear Layers: Single Filter
Complex-valued Linear Layers: Set of Filters
A model is a lie that helps you see the truth.Howard Skipper
Summation in the Frequency Domain
LemmaSummation in the frequency domain ≡ weighted average poolingin the time domain.If X is the Fourier transform of the 2N point signal x , then
N∑k=0
Xk =2N−1∑n=0
αnxn
where,
αn =
N + 1 n = 0
coth(j πi2N
)mod (n, 2) = 1
1 mod (n, 2) = 0, n 6= 0
Complex Linear Projection
Proposition
The projection in the frequency domain is equivalent toconvolution followed by a weighted average pooling:
N∑j=0
WijXj ←→2N−1∑j=0
αj (wi ∗ x) [j ]
Complex Linear Projection (CLP)
ASR Experimental Setup
I Training set: 2000 hrs, 3M anonymized voice search queriesI SNR between 0-20dB (≈12 dB)I Reverb T60 between 400ms-900ms (≈ 600ms)I 8 channel linear mic with spacing of 2cmI Noise and target speaker location constant per utterance
I Acoustic modelI A cascade of 3 LSTMs followed by a DNN layerI 13K HMM statesI Learning with asynchronized stochastic gradient descent
I Test set: Voice search w/ matching conditions
I All results reported here are after CE training
Experimental Settings & WER Results
I 25 ms, 32 ms, 64 ms
I 128 filters
System 25 ms 32 ms 64 ms
Log-mel 23.4 22.8 21.8Time-domain conv. 23.7 23.4 22.5CLP 23.2 22.8 22.0
I CLP is at least as good as the alternatives!
Learned CLP Filters: Before and After Applying L1 Regularization
Learned Filters: Histogram of Learned Filter Coefficients
Time-domain filter coefficients (in time & in freq)
CLF filter coefficients (w/o & w/ L1)
CLP filters are sparse!
Complex Linear Projection: Two Channels
Experimental Settings & WER Results
I Two channels from microphones, 14cm apart
I 25 ms, 32 ms, 64 ms
I 256 filters
System 25 ms 32 ms 64 ms
Log-mel 21.8 21.3 20.7Time-domain conv. 21.5 21.2 21.2CLP 21.5 20.9 20.5
I Again, CLP is at least as good as the other alternatives!
Multichannel CLP: Frame Size vs Projection Size
I Frame size: 32ms, 64ms
I Projection size: 128, 256, 512
I Without L1 regularization and stop bands
128 256 512
32ms 21.7 21.4 21.764ms 21.1 21.1 21.5
I Higher projections size =⇒ larger input to LSTM, slowerStep time is about 1.5 × slower going from 256 to 512
I Frame size has no impact for 32ms and 64ms
I Step time is about 2 × slower going from 1 to 8 channels
Learned CLP Filters: Two Channel
Learned CLP Filters: Time-delay between Real and Imaginary
Filter number vs time delay
Histogram of Time-delay between Real and Imaginary
Time delay vs frequency of occurrence
Language and Task Dependency
I English high-pitch voice search (en-us-hp)
I Taiwanese voice search (cmn-hant-tw)
System en-us-hp cmn-hant-tw
Log-mel 16.6 17.2Time-domain conv. 16.3 16.8CLP 16.4 16.6
I CLP as good or better than the alternatives!
Learned Filters: Sorted by Center Frequency
Computational Efficiency
I Time-domain filters ≈ O(Nkp + k2p))k strides, 2N window size, p filters
I CLP ≈ O(pN)
Model Time Filters CLP
num-params 45K 66Knum-add-mult 14.51 M 263.17 K
I Computation cost is a magnitude lower!
Summary & Future Directions
I Introduced complex-valued linear layers for acoustic modeling
I Complex-valued linear projection (CLP) as one instance of it
I Showed a simple multi-channel extension for CLP
I Reported experimental comparisons with log-mel andtime-domain filters
I Demonstrated task-dependent advantage
I Discussed properties of the CLP model
I Future directions:More efficient multi-channel projections, hybrid systems withWER and beamforming-type cost functions, adding multiplecomplex layers with appropriate complex-valued activationfunctions, integrating de-reverberation, . . .