audio features & parametric representations machine...
Post on 17-Mar-2018
226 Views
Preview:
TRANSCRIPT
1
API2011
Audio Features &
Machine Learning
E.M. Bakker
API2011
Features for Speech Recognition
and Audio Indexing
Parametric Representations
– Short Time Energy
– Zero Crossing Rates
– Level Crossing Rates
– Short Time Spectral Envelope
Spectral Analysis
– Filter Design
– Filter Bank Spectral Analysis Model
– Linear Predictive Coding (LPC)
API2011
Methods
Vector Quantization
– Finite code book of spectral shapes
– The code book codes for ‗typical‘ spectral shape
– Method for all spectral representations (e.g. Filter
Banks, LPC, ZCR, etc. …)
Support Vector Machines
Markov Models
Hidden Markov Models
Neural Networks Etc.
API2011
Pattern Recognition
Reference
Patterns
Parameter
Measurements
Decision
Rules
Pattern
Comparison
Speech
Audio, …
Recognized
Speech, Audio, …
Test Pattern
Query Pattern
2
API2011
Pattern Recognition
Reference
Vocabulary
Features
Feature Detector1
Hypothesis
Tester
Feature Combiner
and
Decision Logic
Speech
Audio, …
Recognized
Speech, Audio, …
Feature Detectorn
API2011
Vector Quantization
Data represented as feature vectors.
VQ Training set to determine a set of code
words that constitute a code book.
Code words are centroids using a similarity or
distance measure d.
Code words together with d divide the space into
a Voronoi regions.
A query vector falls into a Voronoi region and will
be represented by the respective codeword.
API2011
Vector Quantization
Distance measures d(x,y):
Euclidean distance
Taxi cab distance
Hamming distance
etc.
API2011
Vector Quantization
Let a training set of L vectors be given.
Assume a codebook of M code words is wanted.
Initialize:
choose M arbitrary vectors of the L vectors of the training set.
This is the initial code book.
Nearest Neighbor Search:
for each training vector v, find the code word w in the current code book that is closest and assign v to the corresponding cell of w.
Centroid Update:
For each cell with code word w determine the centroid c of the training vectors that are assigned to the cell of w.
Update the code word w with the new vector c.
Iteration:
repeat the steps Nearest Neighbor Search and Centroid Update until the average distance between the new and previous code word falls below a preset threshold.
3
API2011
Vector Classification
For an M-vector code book CB with codes
CB = {yi | 1 ≤ i ≤ M} ,
the index m* of the best codebook entry for a given vector v is:
m* = arg min d(v, yi)
1 ≤ i ≤ M
API2011
VQ for Classification
A code book CBk = {yki | 1 ≤ i ≤ M}, can be used to
define a class Ck.
Example Audio Classification:
Classes ‗crowd‘, ‗car‘, ‗silence‘, ‗scream‘, ‗explosion‘, etc.
Determine by using VQ code books CBk for each of the respective classes Ck.
VQ is very often used as a baseline method for classification problems.
API2011
Support Vector Machine (SVM)
In this method so called support vectors define
decision boundaries for classification and
regression.
An example where
a straight line
separates the two
Classes: a linear
classifier
Images from: www.statsoft.com.API2011
Support Vector Machine (SVM)
In general classification is not that simple.
SVM is a method that can handle the more complex cases
where the decision boundary requires a curve.
SVM uses a set of mapping
functions (kernels) to map
the feature space into
a transformed space so
that hyperplanes can be
used for the classification.
4
API2011
Support Vector Machine (SVM)
SVM uses a set of mapping functions (kernels) to
map the feature space into a transformed space
so that hyperplanes can be used for the
classification.
API2011
Support Vector Machine (SVM)
Training of an SVM is an iterative process:
– optimize the mapping function while minimizing an
error function
– The error function should capture the penalties for
misclassified, i.e., non separable data points.
API2011
Support Vector Machine (SVM)
SVM uses kernels that define the mapping function used in the method. Kernels can be:– Linear
– Polynomial
– RBF
– Sigmoid
– Etc.
– RBF (radial basis function) is the most popular kernel, again with different possible base functions.
– The final choice depends on characteristics of the classification task.
API2011
From: T.Joachims, SIGIR 2003 Tutorial.
5
API2011
From: T.Joachims, SIGIR 2003 Tutorial.API2011
Sequences: Sound and DNA
DNA: helix-shaped molecule
whose constituents are two
parallel strands of nucleotides
DNA is usually represented by
sequences of these four
nucleotides
This assumes only one strand
is considered; the second
strand is always derivable
from the first by pairing A‘s
with T‘s and C‘s with G‘s and
vice-versa
Nucleotides (bases)
– Adenine (A)
– Cytosine (C)
– Guanine (G)
– Thymine (T)
API2011
Biological Information:
From Genes to Proteins
GeneDNA
RNA
Transcription
Translation
Protein Protein folding
genomics
molecular
biology
structural
biology
biophysics
API2011
DNA / amino acid
sequence 3D structure protein functions
DNA (gene) →→→ pre-RNA →→→ RNA →→→ Protein
RNA-polymerase Spliceosome Ribosome
CGCCAGCTGGACGGGCACACC
ATGAGGCTGCTGACCCTCCTG
GGCCTTCTG…
TDQAAFDTNIVTLTRFVMEQG
RKARGTGEMTQLLNSLCTAVK
AISTAVRKAGIAHLYGIAGST
NVTGDQVKKLDVLSNDLVINV
LKSSFATCVLVTEEDKNAIIV
EPEKRGKYVVCFDPLDGSSNI
DCLVSIGTIFGIYRKNSTDEP
SEKDALQPGRNLVAAGYALYG
SATML
From Amino Acids to Proteins
Functions
6
API2011
Motivation for Markov Models
There are many cases in which we would like to represent the statistical regularities of some class of sequences
– genes
– proteins in a given family
– Sequences of audio features
Markov models are well suited to this type of task
API2011
A Markov Chain Model
Transition probabilities
– Pr(xi=a|xi-1=g)=0.16
– Pr(xi=c|xi-1=g)=0.34
– Pr(xi=g|xi-1=g)=0.38
– Pr(xi=t|xi-1=g)=0.12
1)|Pr( 1 gxx ii
API2011
Definition of Markov Chain Model
A Markov chain[1] model is defined by
– a set of states
some states emit symbols
other states (e.g., the begin state) are silent
– a set of transitions with associated probabilities
the transitions emanating from a given state define a distribution
over the possible next states
[1] Марков А. А., Распространение закона больших чисел на величины, зависящие друг
от друга. — Известия физико-математического общества при Казанском
университете. — 2-я серия. — Том 15. (1906) — С. 135—156
API2011
Markov Chain Models: Properties
Given some sequence x of length L
How probable the sequence is given our model?
For any probabilistic model of sequences, we can write this
probability as
Key property of a (1st order) Markov chain:
The probability of each xi depends only on the value of xi-1
)Pr()...,...,|Pr(),...,|Pr(
),...,,Pr()Pr(
112111
11
xxxxxxx
xxxx
LLLL
LL
L
i
ii
LLLL
xxx
xxxxxxxx
2
11
112211
)|Pr()Pr(
)Pr()|Pr()...|Pr()|Pr()Pr(
7
API2011
The Probability of a Sequence for a
Markov Chain Model
Pr(cggt)=Pr(c)Pr(g|c)Pr(g|g)Pr(t|g)
API2011
Example Applications
Isolated Word Recognition
A spectral vector is modeled by a state in a Markov chain
An utterance is represented by a sequence of states
Algorithmic Music Composition
States are note or pitch values
Probabilities for transitions to other notes are given (can be 1st order
or second order etc.)
API2011
Markov Chains for Discrimination
Suppose we want to distinguish CpG islands from other
sequence regions
Given sequences from CpG islands, and sequences
from other regions, we can construct
– a model to represent CpG islands
– a null model to represent the other regions
Using the models we can now score a test sequence by:
)|Pr(
)|Pr(log)(
nullModelx
CpGModelxxscore
API2011
Markov Chains for Discrimination
Why can we use
According to Bayes’ rule:
If we are not taking into account prior probabilities (Pr(CpG) and
Pr(null)) of the two classes, then from Bayes’ rule it is clear that
we just need to compare Pr(x|CpG) and Pr(x|null) as is done in
our scoring function score().
)Pr(
)Pr()|Pr()|Pr(
x
CpGCpGxxCpG
)Pr(
)Pr()|Pr()|Pr(
x
nullnullxxnull
)|Pr(
)|Pr(log)(
nullModelx
CpGModelxxscore
As the real score should be:
Assume x is observed what is the
probability that x was modeled by
CpG compared to the probability
that x was modeled by the null
model, i.e., Pr(CpG | x) / Pr(null | x)
8
API2011
Higher Order Markov Chains
The Markov property specifies that the probability of a state
depends only on the probability of the previous state
But we can build more ―memory‖ into our states by using a higher
order Markov model
In an n-th order Markov model
The probability of the current state depends on the previous n states.
),...,|Pr(),...,,|Pr( 1121 niiiiii xxxxxxx
API2011
Selecting the Order of a Markov Chain
Model
But the number of parameters we need to estimate
grows exponentially with the order
– for modeling DNA we need parameters for an n-th
order model
The higher the order, the less reliable we can expect
our parameter estimates to be
– estimating the parameters of a 2nd order Markov chain from the
complete genome of E. Coli (5.44 x 106 bases) , we‘d see
each word ~ 85.000 times on average (divide by 43)
– estimating the parameters of a 9th order chain, we‘d see each
word ~ 5 times on average (divide by 410 ~ 106)
)4( 1nO
API2011
Higher Order Markov Chains
An n-th order Markov chain over some alphabet A is equivalent to
a first order Markov chain over the alphabet of n-tuples: An
Example: A 2nd order Markov model for DNA can be treated as a
1st order Markov model over alphabet
AA, AC, AG, AT
CA, CC, CG, CT
GA, GC, GG, GT
TA, TC, TG, TT
API2011
A Fifth Order Markov Chain
Pr(gctaca) = Pr(gctac) . Pr(a | gctac)
9
API2011
Hidden Markov Model: A Simple HMM
Given observed sequence AGGCT, which state emits every item?
Model 1 Model 2
API2011
Tutorial on HMM
L.R. Rabiner, A Tutorial on Hidden Markov Models
and Selected Applications in Speech
Recognition, Proceeding of the IEEE, Vol. 77,
No. 22, February 1989.
API2011
HMM for Hidden Coin Tossing
HT
T
T T
T
H
T
……… H H T T H T H H T T H
API2011
Hidden State
We‘ll distinguish between the observed parts of a
problem and the hidden parts
In the Markov models we‘ve considered previously, it is
clear which state accounts for each part of the observed
sequence
In the model above, there are multiple states that could
account for each part of the observed sequence
– this is the hidden part of the problem
10
API2011
Learning and Prediction Tasks(in general, i.e., applies on both MM as HMM)
Learning– Given: a model, a set of training sequences
– Do: find model parameters that explain the training sequences with
relatively high probability (goal is to find a model that generalizes well to
sequences we haven‘t seen before)
Classification– Given: a set of models representing different sequence classes, and
given a test sequence
– Do: determine which model/class best explains the sequence
Segmentation– Given: a model representing different sequence classes, and given a
test sequence
– Do: segment the sequence into subsequences, predicting the class of
each subsequence
API2011
Algorithms for Learning & Prediction
Learning– correct path known for each training sequence -> simple maximum
likelihood or Bayesian estimation
– correct path not known -> Forward-Backward algorithm + ML or Bayesian
estimation
Classification– simple Markov model -> calculate probability of sequence along single
path for each model
– hidden Markov model -> Forward algorithm to calculate probability of
sequence along all paths for each model
Segmentation
– hidden Markov model -> Viterbi algorithm to find most probable path for
sequence
API2011
The Parameters of an HMM
Transition Probabilities
– Probability of transition from state k to state l
Emission Probabilities
– Probability of emitting character b in state k
Note: HMM’s can also be formulated using an emission probability
associated with a transition from state k to state l.
)|Pr( 1 kla iikl
)|Pr()( kbxbe iik
API2011
An HMM Example
Emission probabilities∑ pi = 1
Transition probabilities∑ pi = 1
11
API2011
HMM Isolated Word Recognizer
API2011
Three Important Questions(See also L.R. Rabiner (1989))
How likely is a given sequence?
– The Forward algorithm
What is the most probable ―path‖ for generating
a given sequence?
– The Viterbi algorithm
How can we learn the HMM parameters given a
set of sequences?
– The Forward-Backward (Baum-Welch) algorithm
API2011
How Likely is a Given Sequence?
The probability that a given path is taken and
the sequence is generated:L
i
iNL iiiaxeaxx
1
001 11)()...,...Pr(
6.3.8.4.2.4.5.
)(
)()(
),Pr(
35313
111101
aCea
AeaAea
AAC
API2011
How Likely is a Given Sequence?
But we need the probability over all paths:
The number of paths can be exponential in the
length of the sequence...
The Forward algorithm enables us to compute
this efficiently
12
API2011
The Forward Algorithm
Define to be the probability of being in
state k having observed the first i characters of
sequence x of length L
To compute , the probability of being in
the end state having observed all of sequence
x
Can be defined recursively
Compute using dynamic programming
)(if k
)(LfN
API2011
The Forward Algorithm
fk(i) equal to the probability of being in state k having
observed the first i characters of sequence x
Initialization
– f0(0) = 1 for start state; fi(0) = 0 for other state
Recursion
– For emitting state (i = 1, … L)
– For silent state
Termination
k
klkl aifif )()(
k
klkll aifieif )1()()(
k
kNkNL aLfLfxxx )()()...Pr()Pr( 1
API2011
Forward Algorithm Example
Given the sequence x=TAGA
API2011
Forward Algorithm Example
Initialization
– f0(0)=1, f1(0)=0…f5(0)=0
Computing other values
– f1(1)=e1(T)*(f0(0)a01+f1(0)a11)
=0.3*(1*0.5+0*0.2)=0.15
– f2(1)=0.4*(1*0.5+0*0.8)
– f1(2)=e1(A)*(f0(1)a01+f1(1)a11)
=0.4*(0*0.5+0.15*0.2)
…
– Pr(TAGA)= f5(4)=f3(4)a35+f4(4)a45
13
API2011
Three Important Questions
How likely is a given sequence?
What is the most probable ―path‖ for generating
a given sequence?
How can we learn the HMM parameters given a
set of sequences?
API2011
Finding the Most Probable Path: The Viterbi Algorithm
Define vk(i) to be the probability of the most probable
path accounting for the first i characters of x and
ending in state k
We want to compute vN(L), the probability of the most
probable path accounting for all of the sequence and
ending in the end state
Can be defined recursively
Again we can use use Dynamic Programming to
compute vN(L) and find the most probable path
efficiently
API2011
Finding the Most Probable Path: The Viterbi Algorithm
Define vk(i) to be the probability of the most probable
path π accounting for the first i characters of x and
ending in state k
The Viterbi Algorithm:
1. Initialization (i = 0)
v0(0) = 1, vk(0) = 0 for k>0
2. Recursion (i = 1,…,L)
vl(i) = el(xi) .maxk(vk(i-1).akl)
ptri(l) = argmaxk(vk(i-1).akl)
3. Termination:
P(x,π*) = maxk(vk(L).ak0)
π*L = argmaxk(vk(L).ak0)
API2011
Three Important Questions
How likely is a given sequence?
What is the most probable ―path‖ for
generating a given sequence?
How can we learn the HMM parameters
given a set of sequences?
14
API2011
Learning Without Hidden State
Learning is simple if we know the correct path for each
sequence in our training set
estimate parameters by counting the number of times
each parameter is used across the training set
API2011
Learning With Hidden State
If we don‘t know the correct path for each sequence
in our training set, consider all possible paths for the
sequence
Estimate parameters through a procedure that
counts the expected number of times each
parameter is used across the training set
API2011
Learning Parameters: The Baum-
Welch Algorithm
Also known as the Forward-Backward algorithm
An Expectation Maximization (EM) algorithm
– EM is a family of algorithms for learning probabilistic
models in problems that involve hidden states
In this context, the hidden state is the path that
best explains each training sequence
API2011
Learning Parameters: The Baum-
Welch Algorithm
Algorithm sketch:
– initialize parameters of model
– iterate until convergence
calculate the expected number of times
each transition or emission is used
adjust the parameters to maximize the
likelihood of these expected values
15
API2011
Computational Complexity of HMM Algorithms
Given an HMM with S states and a sequence of length
L, the complexity of the Forward, Backward and Viterbi
algorithms is
– This assumes that the states are densely interconnected
Given M sequences of length L, the complexity of
Baum Welch on each iteration is
)( 2LSO
)( 2LMSO
API2011
Markov Models Summary
We considered models that vary in terms of
order, hidden state
Three DP-based algorithms for HMMs: Forward,
Backward and Viterbi
We discussed three key tasks: learning,
classification and segmentation
The algorithms used for each task depend on
whether there is hidden state (correct path
known) in the problem or not
API2011
SummaryMarkov chains and hidden Markov models are
probabilistic models in which the probability of a
state depends only on that of the previous state
– Given a sequence of symbols, x, the forward
algorithm finds the probability of obtaining x in the
model
– The Viterbi algorithm finds the most probable path
(corresponding to x) through the model
– The Baum-Welch learns or adjusts the model
parameters (transition and emission probabilities) to
best explain a set of training sequences.
API2011
Spectral Analysis Models
Pattern Recognition Approach
1. Parameter Measurement => Pattern
2. Pattern Comparison
3. Decision Making
Parameter Measurements
– Bank of Filters Model
– Linear Predictive Coding Model
16
API2011
Band Pass Filter
Audio Signal
s(n)
Bandpass Filter
F()
Result Audio Signal
F(s(n))
Note that the bandpass filter can be
defined as:
• a convolution with a filter response
function in the time domain,
• a multiplication with a filter response
function in the frequency domain
API2011
Bank of Filters Analysis Model
API2011
Bank of Filters Analysis Model
Speech Signal: s(n), n=0,1,…– Digital with Fs the sampling frequency of s(n)
Bank of q Band Pass Filters: BPF1, …,BPFq
– Spanning a frequency range of, e.g., 100-3000Hz or 100-16kHz
– BPFi(s(n)) = xn(ejωi), where ωi = 2πfi/Fs is equal to the
normalized frequency fi, where i=1, …, q.
– xn(ejωi) is the short time spectral representation of s(n)
at time n, as seen through the BPFi with centre frequency ωi, where i=1, …, q.
Note: Each BPF independently processes s to produce the spectral representation x
API2011
MFCCs
Mel-Scale
Filter Bank
MFCC‘s
first 12 most
Significant
coefficients
Log()
Speech
Audio, …Preemphasis Windowing
Fast Fourier
Transform
Direct Cosine
Transform
MFCCs are calculated using the formula:
N
k
iki NkXC1
)/)5.0(cos(
Where
• Ci is the cepstral coefficient (i from {1, …, 12}
• P the order (12 in our case)
• K the number of discrete Fourier
transform magnitude coefficients
• Xk the kth order log-energy output
from the Mel-Scale filterbank.
• N is the number of filters
17
API2011
Short Time Fourier Transform
• s(m) signal
• w(n-m) a fixed low pass window
API2011
Short Time Fourier Transform
Long Hamming Window: 500 samples (=50msec)
Voiced Speech
API2011
Short Time Fourier Transform
Short Hamming Window: 50 samples (=5msec)
Voiced Speech
API2011
Short Time Fourier Transform
Long Hamming Window: 500 samples (=50msec)
Unvoiced Speech
top related