audio features & machine learning

API2010API2010

Audio Features &Audio Features &Machine LearningMachine Learning

E.M. Bakker E.M. Bakker

API2010API2010

Features for Speech Recognition Features for Speech Recognition and Audio Indexingand Audio Indexing

Parametric Representations– Short Time Energy– Zero Crossing Rates– Level Crossing Rates– Short Time Spectral Envelope

Spectral Analysis– Filter Design– Filter Bank Spectral Analysis Model– Linear Predictive Coding (LPC)

API2010API2010

MethodsMethods

Vector QuantizationVector Quantization– Finite code book of spectral shapesFinite code book of spectral shapes– The code book codes for ‘typical’ spectral shapeThe code book codes for ‘typical’ spectral shape– Method for all spectral representations (e.g. Filter Method for all spectral representations (e.g. Filter

Banks, LPC, ZCR, etc. …)Banks, LPC, ZCR, etc. …)Ensemble Interval Histogram (EIH) ModelEnsemble Interval Histogram (EIH) Model– Auditory-Based Spectral Analysis ModelAuditory-Based Spectral Analysis Model– More robust to noise and reverberationMore robust to noise and reverberation– Expected to be inherently better representation of Expected to be inherently better representation of

relevant spectral information because it models the relevant spectral information because it models the human cochlea mechanicshuman cochlea mechanics

API2010API2010

Pattern RecognitionPattern Recognition

ReferencePatterns

ParameterMeasurements

DecisionRules

PatternComparison

SpeechAudio, …

RecognizedSpeech, Audio, …

Test PatternQuery Pattern

API2010API2010

Pattern RecognitionPattern Recognition

Reference VocabularyFeatures

Feature Detector1

HypothesisTester

Feature Combinerand

Decision Logic

SpeechAudio, …

RecognizedSpeech, Audio, …

Feature Detectorn

API2010API2010

Spectral Analysis ModelsSpectral Analysis Models

Pattern Recognition ApproachPattern Recognition Approach1.1. Parameter Measurement => PatternParameter Measurement => Pattern2.2. Pattern ComparisonPattern Comparison3.3. Decision MakingDecision Making

Parameter MeasurementsParameter Measurements– Bank of Filters ModelBank of Filters Model– Linear Predictive Coding ModelLinear Predictive Coding Model

API2010API2010

Band Pass FilterBand Pass Filter

Audio Signals(n)

Bandpass FilterF()

Result Audio SignalF(s(n)

Note that the bandpass filter can be defined as:

• a convolution with a filter response function in the time domain,

• a multiplication with a filter response function in the frequency domain

API2010API2010

Bank of Filters Analysis ModelBank of Filters Analysis Model

API2010API2010

Bank of Filters Analysis ModelBank of Filters Analysis ModelSpeech Signal: s(n), n=0,1,…– Digital with Fs the sampling frequency of s(n)

Bank of q Band Pass Filters: BPF1, …,BPFq– Spanning a frequency range of, e.g., 100-3000Hz or

100-16kHz– BPFi(s(n)) = xn(ejωi), where ωi = 2πfi/Fs is equal to the

normalized frequency fi, where i=1, …, q.– xn(ejωi) is the short time spectral representation of s(n)

at time n, as seen through the BPFi with centre frequency ωi, where i=1, …, q.

Note: Each BPF independently processes s to produce the spectral representation x

API2010API2010

Bank of Filters Front End ProcessorBank of Filters Front End Processor

API2010API2010

Typical Speech Wave FormsTypical Speech Wave Forms

API2010API2010

MFCCsMFCCs

Mel-Scale Filter Bank

MFCC’sfirst 12 most

Signiifcantcoefficients

Log()

SpeechAudio, … Preemphasis Windowing Fast Fourier

Transform

Direct CosineTransform

MFCCs are calculated using the formula:

N

kiki NkXC

1

)/)5.0(cos(

Where • Ci is the cepstral coefficient• P the order (12 in our case)• K the number of discrete Fourier transform magnitude coefficients• Xk the kth order log-energy output from the Mel-Scale filterbank.• N is the number of filters

API2010API2010

Linear Predictive Coding ModelLinear Predictive Coding Model

API2010API2010

Filter Response FunctionsFilter Response Functions

API2010API2010

SomeSomeExamples Examples

ofof Ideal Band Ideal Band

FiltersFilters

API2010API2010

Perceptually Based Perceptually Based Critical Band ScaleCritical Band Scale

API2010API2010

Short Time Fourier TransformShort Time Fourier Transform

• s(m) signal• w(n-m) a fixed low pass window

API2010API2010

Short Time Fourier TransformShort Time Fourier TransformLong Hamming Window: 500 samples (=50msec)Long Hamming Window: 500 samples (=50msec)

Voiced Speech

API2010API2010

Short Time Fourier TransformShort Time Fourier TransformShort Hamming Window: 50 samples (=5msec)Short Hamming Window: 50 samples (=5msec)

Voiced Speech

API2010API2010

Short Time Fourier TransformShort Time Fourier TransformLong Hamming Window: 500 samples (=50msec)Long Hamming Window: 500 samples (=50msec)

Unvoiced Speech

API2010API2010

Short Time Fourier TransformShort Time Fourier TransformShort Hamming Window: 50 samples (=5msec)Short Hamming Window: 50 samples (=5msec)

Unvoiced Speech

API2010API2010

Short Time Fourier TransformShort Time Fourier TransformLinear Filter InterpretationLinear Filter Interpretation

API2010API2010

Linear Predictive Coding (LPC) Linear Predictive Coding (LPC) ModelModel

Speech Signal: s(n), n=0,1,…– Digital with Fs the sampling frequency of s(n)

Spectral Analysis on Blocks of Speech with an all pole modeling constraintLPC of analysis order p– s(n) is blocked into frames [n,m]– Again consider xn(ejω) the short time spectral representation of s(n) at

time n. (where ω = 2πf/Fs is equal to the normalized frequency f). – Now the spectral representation xn(ejω) is constrained to be of the form

σ/A(ejω), where A(ejω) is the pth order polynomial with z-transform: A(z) = 1 + a1z-1 + a2z-2 + … + apz-p

– The output of the LPC parametric Conversion on block [n,m] is the vector [a1,…,ap].

– It specifies parametrically the spectrum of an all-pole model that best matches the signal spectrum over the period of time in which the frame of speech samples was accumulated (pth order polynomial approximation of the signal).

API2010API2010

Vector QuantizationVector Quantization

Data represented as feature vectors.Data represented as feature vectors.VQ Training set to determine a set of code VQ Training set to determine a set of code words that constitute a code book.words that constitute a code book.Code words are centroids using a similarity or Code words are centroids using a similarity or distance measure d.distance measure d.Code words together with d divide the space into Code words together with d divide the space into a Voronoi regions.a Voronoi regions.A query vector falls into a Voronoi region and will A query vector falls into a Voronoi region and will be represented by the respective codeword.be represented by the respective codeword.

API2010API2010

Vector QuantizationVector Quantization

Distance measures d(x,y):Distance measures d(x,y):

Euclidean distanceEuclidean distanceTaxi cab distanceTaxi cab distanceHamming distanceHamming distanceetc.etc.

API2010API2010

Vector QuantizationVector QuantizationClustering the Training VectorsClustering the Training Vectors

Initialize:Initialize: choose M arbitrary vectors of the L vectors of choose M arbitrary vectors of the L vectors of the training set. This is the initial code book.the training set. This is the initial code book.Nearest neighbor search:Nearest neighbor search: for each training vector, find for each training vector, find the code word in the current code book that is closest the code word in the current code book that is closest and assign that vector to the corresponding cell.and assign that vector to the corresponding cell.Centroid update:Centroid update: update the code word in each cell update the code word in each cell using the centroid of the training vectors that are using the centroid of the training vectors that are assigned to that cell.assigned to that cell.Iteration:Iteration: repeat step 2-3 until the averae distance falls repeat step 2-3 until the averae distance falls below a preset threshold. below a preset threshold.

API2010API2010

Vector ClassificationVector Classification

For an M-vector code book CB with codesFor an M-vector code book CB with codesCB = {yCB = {yii | 1 | 1 ≤ i ≤ M} ,≤ i ≤ M} ,

the index mthe index m** of the best codebook entry for a of the best codebook entry for a given vector v is:given vector v is:

mm** = arg min d(v, y = arg min d(v, y ii)) 1 1 ≤ i ≤ M≤ i ≤ M

API2010API2010

VQ for ClassificationVQ for Classification

A code book CBA code book CBkk = {y = {ykkii | 1 | 1 ≤ i ≤ M}, can be used to ≤ i ≤ M}, can be used to

define a class Cdefine a class Ckk..

Example Audio Classification:Example Audio Classification:

Classes ‘crowd’, ‘car’, ‘silence’, ‘scream’, Classes ‘crowd’, ‘car’, ‘silence’, ‘scream’, ‘explosion’, etc.‘explosion’, etc.Determine by using VQ code books Determine by using VQ code books CBCBkk for each for each of the classes.of the classes.VQ is very often used as a baseline method for VQ is very often used as a baseline method for classification problems.classification problems.

API2010API2010

Sound, DNA: Sequences!Sound, DNA: Sequences!

DNA: helix-shaped molecule DNA: helix-shaped molecule whose constituents are two whose constituents are two parallel strands of nucleotidesparallel strands of nucleotides

DNA is usually represented by DNA is usually represented by sequences of these four sequences of these four nucleotidesnucleotides

This assumes only one strand This assumes only one strand is considered; the second is considered; the second strand is always derivable strand is always derivable from the first by pairing A’s from the first by pairing A’s with T’s and C’s with G’s and with T’s and C’s with G’s and vice-versavice-versa

Nucleotides (bases)Nucleotides (bases)– Adenine (A)Adenine (A)– Cytosine (C)Cytosine (C)– Guanine (G)Guanine (G)– Thymine (T)Thymine (T)

API2010API2010

Biological Information: Biological Information: From Genes to ProteinsFrom Genes to Proteins

GeneDNA

RNA

Transcription

Translation

Protein Protein folding

genomics

molecular biology

structural biology

biophysics

API2010API2010

DNA / amino acid sequence 3D structure protein functions

DNA (gene) →→→ pre-RNA →→→ RNA →→→ Protein RNA-polymerase Spliceosome Ribosome

CGCCAGCTGGACGGGCACACCATGAGGCTGCTGACCCTCCTGGGCCTTCTG…

TDQAAFDTNIVTLTRFVMEQGRKARGTGEMTQLLNSLCTAVKAISTAVRKAGIAHLYGIAGSTNVTGDQVKKLDVLSNDLVINVLKSSFATCVLVTEEDKNAIIVEPEKRGKYVVCFDPLDGSSNIDCLVSIGTIFGIYRKNSTDEPSEKDALQPGRNLVAAGYALYGSATML

From Amino Acids to Proteins From Amino Acids to Proteins FunctionsFunctions

API2010API2010

Motivation for Markov ModelsMotivation for Markov Models

TThere are many cases in which we would here are many cases in which we would like to representlike to represent the statistical regularities the statistical regularities of some class of sequencesof some class of sequences– genesgenes– proteins in a given familyproteins in a given family– Sequences of audio featuresSequences of audio featuresMarkov models are well suited to this type Markov models are well suited to this type of taskof task

API2010API2010

Definition of Markov Chain ModelDefinition of Markov Chain Model

A Markov chainA Markov chain[1][1] model is defined by model is defined by

– a set of statesa set of states

some states emit symbolssome states emit symbols

other states (e.g., the begin state) are silentother states (e.g., the begin state) are silent

– a set of transitions with associateda set of transitions with associated probabilitiesprobabilities

the transitions emanating from a given state define athe transitions emanating from a given state define a

ddistribution over the possible next statesistribution over the possible next states

[1] Марков А. А., Распространение закона больших чисел на величины, зависящие друг [1] Марков А. А., Распространение закона больших чисел на величины, зависящие друг

от друга. — Известия физико-математического общества при Казанском от друга. — Известия физико-математического общества при Казанском

университете. — 2-я серия. — Том 15. (1906) — С. 135—156 университете. — 2-я серия. — Том 15. (1906) — С. 135—156

API2010API2010

Markov Chain Models: Markov Chain Models: PropertiesProperties

Given some sequence x of length L, we can ask howGiven some sequence x of length L, we can ask how probable the sequence is given our modelprobable the sequence is given our modelFor any probabilistic model of sequences, we can For any probabilistic model of sequences, we can write thiswrite this probability asprobability as

key property of a (1key property of a (1stst order) Markov chain: the order) Markov chain: the probabilityprobability of each of each xxii depends only on the value ofdepends only on the value of x xi-1i-1

)Pr()...,...,|Pr(),...,|Pr(),...,,Pr()Pr(

112111

11

xxxxxxxxxxx

LLLL

LL

L

iii

LLLL

xxx

xxxxxxxx

211

112211

)|Pr()Pr(

)Pr()|Pr()...|Pr()|Pr()Pr(

API2010API2010

The Probability of a Sequence for a The Probability of a Sequence for a Markov Chain ModelMarkov Chain Model

Pr(cggt)=Pr(c)Pr(g|c)Pr(g|g)Pr(t|g)

API2010API2010

Example ApplicationExample Application

CpGCpG islands islands

CGCG di-nucleotides are rarer in eukaryotic genomes than di-nucleotides are rarer in eukaryotic genomes than expected expected

given the marginal probabilities of given the marginal probabilities of CC and and GG

but the regions upstream of genes are richer in but the regions upstream of genes are richer in CGCG di-nucleotides di-nucleotides

than elsewhere – than elsewhere – CpGCpG islands islands

useful evidence for finding genesuseful evidence for finding genes

Application: Predict Application: Predict CpGCpG islands with Markov chains islands with Markov chains

one Markov chain to represent one Markov chain to represent CpGCpG islands islands

another Markov chain to represent the rest of the genomeanother Markov chain to represent the rest of the genome

API2010API2010

Markov Chains for Markov Chains for DiscriminationDiscrimination

Suppose we want to distinguish CpG islands from Suppose we want to distinguish CpG islands from otherother sequence regionssequence regionsGiven sequences from CpG islands, and sequences Given sequences from CpG islands, and sequences fromfrom other regions, we can constructother regions, we can construct– a model to represent CpG islandsa model to represent CpG islands– a null model to represent the other regionsa null model to represent the other regions

We can then score a test sequence by:We can then score a test sequence by:

)|Pr()|Pr(log)(

nullModelxCpGModelxxscore

API2010API2010

Markov Chains for DiscriminationMarkov Chains for DiscriminationWhy can we use Why can we use

According to Bayes’ rule:According to Bayes’ rule:

If we are not taking into account prior probabilities If we are not taking into account prior probabilities (Pr(CpG)(Pr(CpG) and and Pr(null))Pr(null)) of of the twothe two classes, then from Bayes’ rule it is clear that we just need toclasses, then from Bayes’ rule it is clear that we just need to compare compare Pr(x|CpG)Pr(x|CpG) andand Pr(x|null)Pr(x|null) as is done in our scoring function as is done in our scoring function score().score().

)Pr()Pr()|Pr()|Pr(

xCpGCpGxxCpG

)Pr()Pr()|Pr()|Pr(

xnullnullxxnull

)|Pr()|Pr(log)(

nullModelxCpGModelxxscore

API2010API2010

Higher Order Markov ChainsHigher Order Markov Chains

The Markov property specifies that the probability of a stateThe Markov property specifies that the probability of a state depends depends onlyonly on the probability of the previous state on the probability of the previous state

But we can build more “memory” into our states by using aBut we can build more “memory” into our states by using a higher higher orderorder Markov model Markov model

In an In an n-thn-th order Markov model order Markov model

The probability of the current state depends on the previous The probability of the current state depends on the previous nn states. states.

),...,|Pr(),...,,|Pr( 1121 niiiiii xxxxxxx

API2010API2010

Selecting the Order of aSelecting the Order of a MarkovMarkov Chain Chain ModelModel

But the number of parameters we need to estimate But the number of parameters we need to estimate growsgrows exponentially with the orderexponentially with the order– for modeling DNA we need for modeling DNA we need parameters for anparameters for an n-th n-th

order modelorder model

The higher the order, the less reliable we can expect The higher the order, the less reliable we can expect ourour parameter estimates to beparameter estimates to be– estimating the parameters of a estimating the parameters of a 22ndnd order Markov chainorder Markov chain from the from the

complete genome of E. Coli (5.44 x 10complete genome of E. Coli (5.44 x 1066 bases) , we’d see bases) , we’d see eacheach word ~ 85.000 times on average (divide by 4word ~ 85.000 times on average (divide by 433))

– estimating the parameters of a 9estimating the parameters of a 9thth order chain, we’d order chain, we’d see each see each word ~ 5 times on average (divide by 4word ~ 5 times on average (divide by 410 10 ~ 10~ 1066))

)4( 1nO

API2010API2010

Higher Order Markov ChainsHigher Order Markov Chains

An n-th order Markov chain over some alphabet An n-th order Markov chain over some alphabet A A isis equivalent to a first order Markov chain over the equivalent to a first order Markov chain over the alphabetalphabet of n-tuples: Aof n-tuples: Ann

Example: A 2Example: A 2ndnd order Markov model for DNA can be order Markov model for DNA can be treated as a 1treated as a 1stst order Markov model over alphabet order Markov model over alphabetAA, AC, AG, ATAA, AC, AG, AT

CA, CC, CG, CTCA, CC, CG, CT

GA, GC, GG, GTGA, GC, GG, GT

TA, TC, TG, TTTA, TC, TG, TT

API2010API2010

A Fifth Order Markov ChainA Fifth Order Markov Chain

Pr(gctaca)=Pr(gctac)Pr(a|gctac)

API2010API2010

Hidden Markov Model: A Simple Hidden Markov Model: A Simple HMMHMM

Given observed sequence AGGCT, which state emits every item?

Model 1 Model 2

API2010API2010

Tutorial on HMMTutorial on HMM

L.R. Rabiner, A Tutorial on Hidden Markov Models L.R. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech and Selected Applications in Speech Recognition,Recognition,

Proceeding of the IEEE, Vol. 77, No. 22, February Proceeding of the IEEE, Vol. 77, No. 22, February 1989.1989.

API2010API2010

HMM for Hidden Coin TossingHMM for Hidden Coin Tossing

HT

TT T

TH

T……… H H T T H T H H T T H

API2010API2010

Hidden StateHidden State

We’ll distinguish between the observed parts of a We’ll distinguish between the observed parts of a

problemproblem and the hidden partsand the hidden parts

In the Markov models we’ve considered previously, it isIn the Markov models we’ve considered previously, it is

clear which state accounts for each part of the observedclear which state accounts for each part of the observed

sequencesequence

In the model above, there are multiple states that couldIn the model above, there are multiple states that could

account for each part of the observed sequenceaccount for each part of the observed sequence– this is the hidden part of the problemthis is the hidden part of the problem

API2010API2010

Learning and Prediction TasksLearning and Prediction Tasks(in general, i.e., applies on both MM as HMM)(in general, i.e., applies on both MM as HMM)

LearningLearning– GivenGiven: a model, a set of training sequences: a model, a set of training sequences– DoDo: find model parameters that explain the training sequences with: find model parameters that explain the training sequences with

relatively high probability (goal is to find a model that relatively high probability (goal is to find a model that generalizes generalizes wellwell to to sequences we haven’t seen before)sequences we haven’t seen before)

ClassificationClassification– GivenGiven: a set of models representing different sequence classes,: a set of models representing different sequence classes, and given and given

a test sequencea test sequence– DoDo: determine which model/class best explains the sequence: determine which model/class best explains the sequence

SegmentationSegmentation– GivenGiven: a model representing different sequence classes,: a model representing different sequence classes, and given and given a test a test

sequencesequence– DoDo: segment the sequence into subsequences, predicting the class of each: segment the sequence into subsequences, predicting the class of each

subsequencesubsequence

API2010API2010

Algorithms for Learning & PredictionAlgorithms for Learning & Prediction

LearningLearning– correct path known for each training sequencecorrect path known for each training sequence ->-> simple maximum simple maximum

likelihoodlikelihood or Bayesian estimationor Bayesian estimation– correct path not known correct path not known ->-> Forward-Backward algorithm + ML or Forward-Backward algorithm + ML or Bayesian Bayesian

estimationestimation

ClassificationClassification– simple Markov modelsimple Markov model -> -> calculate probability of sequence along singlecalculate probability of sequence along single

path for each modelpath for each model– hidden Markov modelhidden Markov model ->-> Forward algorithm to calculate probability of Forward algorithm to calculate probability of

sequence along all paths for each modelsequence along all paths for each model

SegmentationSegmentation– hidden Markov modelhidden Markov model ->-> Viterbi algorithm to find most probable path Viterbi algorithm to find most probable path for for

sequencesequence

API2010API2010

The Parameters of an HMMThe Parameters of an HMM

Transition ProbabilitiesTransition Probabilities

– Probability of transition from state k to state lProbability of transition from state k to state l

Emission ProbabilitiesEmission Probabilities

– Probability of emitting character b in state kProbability of emitting character b in state k

Note: HMM’s can also be formulated using an emission probability Note: HMM’s can also be formulated using an emission probability associated with a transition from state k to state l.associated with a transition from state k to state l.

)|Pr( 1 kla iikl

)|Pr()( kbxbe iik

API2010API2010

An HMM ExampleAn HMM Example

Emission probabilities∑ pi = 1

Transition probabilities∑ pi = 1

API2010API2010

Three Important QuestionsThree Important Questions(See also L.R. Rabiner (1989))(See also L.R. Rabiner (1989))

How likely is a given sequence?How likely is a given sequence?– The Forward algorithmThe Forward algorithm

What is the most probable “path” for generating What is the most probable “path” for generating a givena given sequence?sequence?– The Viterbi algorithmThe Viterbi algorithm

How can we learn the HMM parameters given a How can we learn the HMM parameters given a set ofset of sequences?sequences?– The Forward-Backward (Baum-Welch) algorithmThe Forward-Backward (Baum-Welch) algorithm

API2010API2010

How Likely is a Given Sequence?How Likely is a Given Sequence?The probability that a The probability that a givengiven path is taken and path is taken and thethe sequence is generated:sequence is generated:

L

iiNL iiiaxeaxx

1001 11

)()...,...Pr(

6.3.8.4.2.4.5.)(

)()(),Pr(

35313

111101

aCea

AeaAeaAAC

API2010API2010

How Likely is a Given Sequence?How Likely is a Given Sequence?

The probability The probability over all pathsover all paths is is

but the number of paths can be exponential in but the number of paths can be exponential in the length of the sequence...the length of the sequence...the Forward algorithm enables us to compute the Forward algorithm enables us to compute this efficientlythis efficiently

API2010API2010

The Forward AlgorithmThe Forward Algorithm

Define Define to be the probability of being in to be the probability of being in state kstate k having observed the first i characters of having observed the first i characters of sequence sequence x of length Lx of length LTo compute To compute , the probability of being in, the probability of being in the end state having observed all of the end state having observed all of sequence sequence xxCan be defined recursivelyCan be defined recursivelyCompute using dynamic programmingCompute using dynamic programming

)(ifk

)(LfN

API2010API2010

The Forward AlgorithmThe Forward Algorithm

ffkk(i)(i) equal to equal to the probability of being in state the probability of being in state kk having having observed the first observed the first ii characters of characters of sequence sequence xxInitializationInitialization– ff00(0) = 1(0) = 1 for start state; for start state; ffii(0) = 0(0) = 0 for other state for other state

RecursionRecursion– For emitting state For emitting state (i = 1, … L)(i = 1, … L)

– For silent stateFor silent state

TerminationTermination

k

klkl aifif )()(

k

klkll aifieif )1()()(

k

kNkNL aLfLfxxx )()()...Pr()Pr( 1

API2010API2010

Forward Algorithm ExampleForward Algorithm Example

Given the sequence x=TAGA

API2010API2010

Forward Algorithm ExampleForward Algorithm Example

InitializationInitialization– ff00(0)=1, f(0)=1, f11(0)=0…f(0)=0…f55(0)=0(0)=0

Computing other valuesComputing other values– ff11(1)=e(1)=e11(T)*(f(T)*(f00(0)a(0)a0101+f+f11(0)a(0)a1111))

=0.3*(1*0.5+0*0.2)=0.15=0.3*(1*0.5+0*0.2)=0.15– ff22(1)=0.4*(1*0.5+0*0.8)(1)=0.4*(1*0.5+0*0.8)

– ff11(2)=e(2)=e11(A)*(f(A)*(f00(1)a(1)a0101+f+f11(1)a(1)a1111))

=0.4*(0*0.5+0.15*0.2)=0.4*(0*0.5+0.15*0.2)……– Pr(TAGA)= fPr(TAGA)= f55(4)=f(4)=f33(4)a(4)a3535+f+f44(4)a(4)a4545

API2010API2010

Three Important QuestionsThree Important Questions

How likely is a given sequence?How likely is a given sequence?

What is the most probable “path” for generating What is the most probable “path” for generating a givena given sequence?sequence?

How can we learn the HMM parameters given a How can we learn the HMM parameters given a set ofset of sequences?sequences?

API2010API2010

Finding the Most Probable Path: The Viterbi AlgorithmFinding the Most Probable Path: The Viterbi Algorithm

Define Define vvkk(i)(i) to be the probability of to be the probability of the most probablethe most probable

pathpath accounting for the first accounting for the first ii characters of characters of xx and ending and ending inin state state kk

We want to compute We want to compute vvNN(L)(L),, the probability of the probability of the mostthe most

probable pathprobable path accounting for all of the sequence and accounting for all of the sequence and ending in the end stateending in the end state

Can be defined recursivelyCan be defined recursively

Again we can use use Dynamic Programming to Again we can use use Dynamic Programming to compute compute vvNN(L)(L) and find the most probable path and find the most probable path efficientlyefficiently

API2010API2010

Finding the Most Probable Path: The Viterbi AlgorithmFinding the Most Probable Path: The Viterbi Algorithm Define Define vvkk(i)(i) to be the probability of to be the probability of the most probablethe most probable path path ππ

accounting for the first accounting for the first ii characters of characters of xx and ending in and ending in state state kk

The Viterbi Algorithm:The Viterbi Algorithm:1.1. Initialization Initialization (i = 0)(i = 0)

vv00(0) = 1, v(0) = 1, vkk(0) = 0(0) = 0 for for k>0k>0

2.2. Recursion Recursion (i = 1,…,L)(i = 1,…,L) vvll(i) = e(i) = ell(x(xii) .max) .maxkk(v(vkk(i-1).a(i-1).aklkl))

ptrptrii(l) = argmax(l) = argmaxkk(v(vkk(i-1).a(i-1).aklkl))

3.3. Termination: Termination: P(x,P(x,ππ**) = max) = maxkk((vvkk(L).a(L).ak0k0))

ππ**LL = argmax = argmaxkk(v(vkk(L).a(L).ak0k0))

API2010API2010

Three Important QuestionsThree Important Questions

How likely is a given sequence?How likely is a given sequence?

What is the most probable “path” for What is the most probable “path” for

generating a givengenerating a given sequence?sequence?

How can we learn the HMM parameters How can we learn the HMM parameters

given a set ofgiven a set of sequences?sequences?

API2010API2010

Learning Without Hidden StateLearning Without Hidden StateLearning is simple if we know the correct path for each Learning is simple if we know the correct path for each sequence in our training setsequence in our training set

estimate parameters by counting the number of times each estimate parameters by counting the number of times each parameter is used across the training setparameter is used across the training set

API2010API2010

Learning With Hidden StateLearning With Hidden StateIf we don’t know the correct path for each sequence in ourIf we don’t know the correct path for each sequence in our training set, consider all possible paths for the sequencetraining set, consider all possible paths for the sequence

Estimate parameters through a procedure that counts the Estimate parameters through a procedure that counts the expected number of times each parameter is used across expected number of times each parameter is used across the training setthe training set

API2010API2010

Learning Parameters: The Baum-Learning Parameters: The Baum-Welch AlgorithmWelch Algorithm

Also known as the Forward-Backward algorithmAlso known as the Forward-Backward algorithm

An Expectation Maximization (EM) algorithmAn Expectation Maximization (EM) algorithm– EM is a family of algorithms for learning probabilisticEM is a family of algorithms for learning probabilistic

models in problems that involve hidden statesmodels in problems that involve hidden states

In this context, the hidden state is the path that In this context, the hidden state is the path that bestbest explains each training sequenceexplains each training sequence

API2010API2010

Learning Parameters: The Baum-Learning Parameters: The Baum-Welch AlgorithmWelch Algorithm

Algorithm sketch:Algorithm sketch:– initialize parameters of modelinitialize parameters of model

– iterate until convergenceiterate until convergence

calculate the calculate the expected expected number of times number of times eacheach transition or emission is usedtransition or emission is used

adjust the parameters to adjust the parameters to maximize maximize the the likelihood oflikelihood of these expected valuesthese expected values

API2010API2010

Computational Complexity of HMM AlgorithmsComputational Complexity of HMM Algorithms

Given an HMM with S states and a sequence of length Given an HMM with S states and a sequence of length L,L, the complexity of the Forward, Backward and Viterbithe complexity of the Forward, Backward and Viterbi algorithms isalgorithms is

– This assumes that the states are densely interconnectedThis assumes that the states are densely interconnected

Given M sequences of length L, the complexity of Given M sequences of length L, the complexity of BaumBaum Welch on each iteration isWelch on each iteration is

)( 2LSO

)( 2LMSO

API2010API2010

Markov Models SummaryMarkov Models SummaryWe considered models that vary in terms of order,We considered models that vary in terms of order, hidden statehidden state

Three DP-based algorithms for HMMs: Forward, Three DP-based algorithms for HMMs: Forward, BackwardBackward and Viterbiand Viterbi

We discussed three key tasks: learning, classification We discussed three key tasks: learning, classification andand segmentationsegmentation

The algorithms used for each task depend on whether The algorithms used for each task depend on whether therethere is hidden state (correct path known) in the is hidden state (correct path known) in the problem or notproblem or not

API2010API2010

SummarySummaryMarkov chains and hidden Markov models are Markov chains and hidden Markov models are probabilistic models in which the probability of a probabilistic models in which the probability of a state depends only on that of the previous statestate depends only on that of the previous state– Given a sequence of symbols, x, the Given a sequence of symbols, x, the forwardforward algorithm algorithm

finds the probability of obtaining x in the model finds the probability of obtaining x in the model

– The The ViterbiViterbi algorithm finds the most probable path algorithm finds the most probable path (corresponding to x) through the model(corresponding to x) through the model

– The The Baum-WelchBaum-Welch learns or adjusts the model learns or adjusts the model parameters (transition and emission probabilities) to parameters (transition and emission probabilities) to best explain a set of training sequences.best explain a set of training sequences.

audio features & machine learning

Documents