algorithms for nlptbergkir/11711fa17/fa17 11-711... · 2017. 9. 14. · §peaks = voicing: .46 to...

106
Acoustic Models Taylor Berg-Kirkpatrick – CMU Slides: Dan Klein – UC Berkeley Algorithms for NLP

Upload: others

Post on 03-Mar-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

AcousticModelsTaylorBerg-Kirkpatrick– CMU

Slides:DanKlein– UCBerkeley

AlgorithmsforNLP

Page 2: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

SpeechSignals

Page 3: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

n Frequencygivespitch;amplitudegivesvolume

n Frequenciesateachtimesliceprocessedintoobservationvectors

s p ee ch l a b

ampl

itude

SpeechinaSlide

……………………………………………..x12x13x12x14x14………..

Page 4: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

Articulation

Page 5: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

TextfromOhala,Sept2001,fromSharonRoseslide

Sagittal sectionofthevocaltract(Techmer 1880)

Nasalcavity

Pharynx

Vocalfolds(inthelarynx)

Trachea

Lungs

ArticulatorySystem

Oralcavity

Page 6: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

SpaceofPhonemes

§ Standardinternationalphoneticalphabet(IPA)chartofconsonants

Page 7: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

Place

Page 8: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

PlacesofArticulation

labial

dentalalveolar post-alveolar/palatal

velaruvular

pharyngeal

laryngeal/glottal

FigurethankstoJenniferVenditti

Page 9: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

Labialplace

bilabial

labiodental

FigurethankstoJenniferVenditti

Bilabial:p,b,m

Labiodental:f,v

Page 10: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

Coronalplace

dentalalveolar post-alveolar/palatal

FigurethankstoJenniferVenditti

Dental:th/dh

Alveolar:t/d/s/z/l/n

Post:sh/zh/y

Page 11: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

DorsalPlace

velaruvular

pharyngeal

FigurethankstoJenniferVenditti

Velar:k/g/ng

Page 12: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

SpaceofPhonemes

§ Standardinternationalphoneticalphabet(IPA)chartofconsonants

Page 13: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

Manner

Page 14: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

MannerofArticulation§ Inadditiontovaryingbyplace,soundsvaryby

manner

§ Stop:completeclosureofarticulators,noairescapesviamouth§ Oralstop:palateisraised(p,t,k,b,d,g)§ Nasalstop:oralclosure,butpalateislowered(m,

n,ng)

§ Fricatives:substantialclosure,turbulent:(f,v,s,z)

§ Approximants:slightclosure,sonorant:(l,r,w)

§ Vowels:noclosure,sonorant:(i,e,a)

Page 15: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

SpaceofPhonemes

§ Standardinternationalphoneticalphabet(IPA)chartofconsonants

Page 16: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

Vowels

Page 17: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

VowelSpace

Page 18: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

Acoustics

Page 19: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

“Shejusthadababy”

§ Whatcanwelearnfromawavefile?§ Nogapsbetweenwords(!)§ Vowelsarevoiced,long,loud§ Lengthintime=lengthinspaceinwaveformpicture§ Voicing:regularpeaksinamplitude§ Whenstopsclosed:nopeaks,silence§ Peaks=voicing:.46to.58(vowel[iy],fromsecond.65to.74(vowel[ax])andsoon

§ Silenceofstopclosure(1.06to1.08forfirst[b],or1.26to1.28forsecond[b])

§ Fricativeslike[sh]:intenseirregularpattern;see.33to.46

Page 20: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

Time-DomainInformation

bad

pad

spat

pat

ExamplefromLadefoged

Page 21: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

SimplePeriodicWavesofSound

Time (s)0 0.02

œ0.99

0.99

0

• Y axis: Amplitude = amount of air pressure at that point in time• Zero is normal air pressure, negative is rarefaction

• X axis: Time.• Frequency = number of cycles per second.• 20 cycles in .02 seconds = 1000 cycles/second = 1000 Hz

Page 22: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

ComplexWaves:100Hz+1000Hz

Time (s)0 0.05

œ0.9654

0.99

0

Ampl

itude

Page 23: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

Spectrum

100 1000Frequency in Hz

Coe

ffici

ent

Frequency components (100 and 1000 Hz) on x-axis

Page 24: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

Partof[ae]waveformfrom“had”

§ Notecomplexwaverepeatingninetimesinfigure§ Plussmallerwaveswhichrepeats4timesforeverylarge

pattern§ Largewavehasfrequencyof250Hz(9timesin.036seconds)§ Smallwaveroughly4timesthis,orroughly1000Hz§ Twolittletinywavesontopofpeakof1000Hzwaves

Ampl

itude

Time

Page 25: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

SpectrumofanActualSpeech

Frequency (Hz)0 5000

0

20

40

Coe

ffici

ent

Page 26: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

Spectrogramsam

pl

time

slice

Frequency (Hz)0 5000

0

20

40

freq

coeff

FFT

time

ampl

Page 27: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

Spectrograms

Fre

qu

en

cy (H

z)

05

00

0

0

20

40

Fre

qu

en

cy (H

z)

05

00

0

0

20

40

Fre

qu

en

cy (H

z)

05

00

0

0

20

40

Fre

qu

en

cy (H

z)

05

00

0

0

20

40

Fre

qu

en

cy (H

z)

05

00

0

0

20

40

Fre

qu

en

cy (H

z)

05

00

0

0

20

40

Fre

qu

en

cy (H

z)

05

00

0

0

20

40

Fre

qu

en

cy (H

z)

05

00

0

0

20

40

Fre

qu

en

cy (H

z)

05

00

0

0

20

40

Fre

qu

en

cy (H

z)

05

00

0

0

20

40

Fre

qu

en

cy (H

z)

05

00

0

0

20

40

Fre

qu

en

cy (H

z)

05

00

0

0

20

40

Fre

qu

en

cy (H

z)

05

00

0

0

20

40

Fre

qu

en

cy (H

z)

05

00

0

0

20

40

Fre

qu

en

cy (H

z)

05

00

0

0

20

40

Fre

qu

en

cy (H

z)

05

00

0

0

20

40

Fre

qu

en

cy (H

z)

05

00

0

0

20

40

Fre

qu

en

cy (H

z)

05

00

0

0

20

40

time

ampl

Page 28: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

Spectrogramsfre

q

time

time

ampl

Page 29: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

TypesofGraphsfre

q

time

time

ampl

ampl

time

Frequency (Hz)0 5000

0

20

40

freq

coeff

Page 30: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

BacktoSpectra§ Spectrumrepresentsthesefreqcomponents§ ComputedbyFouriertransform,algorithmwhichseparates

outeachfrequencycomponentofwave.

§ x-axisshowsfrequency,y-axisshowsmagnitude(indecibels,alogmeasureofamplitude)

§ Peaksat930Hz,1860Hz,and3020Hz.

Page 31: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

Source/Filter

Page 32: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

WhythesePeaks?

§ Articulationprocess:§ Thevocalcordvibrations

createharmonics§ Themouthisanamplifier§ Dependingonshapeof

mouth,someharmonicsareamplifiedmorethanothers

Page 33: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

Figures from Ratree Wayland

A3

A4

A2

C4 (middle C)

C3

F#3

F#2

Vowel[i]atincreasingpitches

Page 34: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

ResonancesoftheVocalTract

§ Thehumanvocaltractasanopentube:

§ Airinatubeofagivenlengthwilltendtovibrateatresonancefrequencyoftube.

§ Constraint:Pressuredifferentialshouldbemaximalat(closed)glottalendandminimalat(open)lipend.

Closedend Openend

Length17.5cm.

Figure from W. Barry

Page 35: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

FromSundberg

Page 36: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

Computingthe3FormantsofSchwa

§ LetthelengthofthetubebeL§ F1 =c/l1 =c/(4L)=35,000/4*17.5=500Hz§ F2 =c/l2 =c/(4/3L)=3c/4L=3*35,000/4*17.5=1500Hz§ F3 =c/l3 =c/(4/5L)=5c/4L=5*35,000/4*17.5=2500Hz

§ Soweexpectaneutralvoweltohave3resonancesat500,1500,and2500Hz

§ Thesevowelresonancesarecalledformants

Page 37: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

FromMarkLiberman

Page 38: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

SeeingFormants:theSpectrogram

Page 39: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

VowelSpace

Page 40: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

SeeingFormants:theSpectrogram

Page 41: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

AmericanEnglishVowelSpace

FRONT BACK

HIGH

LOW

iy

ih

eh

ae aa

ao

uw

uh

ahax

ix ux

Figures from Jennifer Venditti, H. T. Bunnell

Page 42: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

Spectrograms

Page 43: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

HowtoReadSpectrograms

§ [bab]:closureoflipslowersallformants:sorapidincreaseinallformantsatbeginningof"bab”

§ [dad]:firstformantincreases,butF2andF3slightfall§ [gag]:F2andF3cometogether:thisisacharacteristicof

velars.Formanttransitionstakelongerinvelarsthaninalveolars orlabials

From Ladefoged “A Course in Phonetics”

Page 44: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

“Shecamebackandstartedagain”

1.lotsofhigh-freqenergy3.closurefork4.burstofaspirationfork5.ey vowel;faint1100Hzformantisnasalization6.bilabialnasal7.shortbclosure,voicingbarelyvisible.8.ae;noteupwardtransitionsafterbilabialstopatbeginning9.noteF2andF3comingtogetherfor"k"

FromLadefoged “ACourseinPhonetics”

Page 45: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

DialectIssues

§ Speechvariesfromdialecttodialect(examplesareAmericanvs.BritishEnglish)§ Syntactic(“Icould”vs.“Icould

do”)§ Lexical(“elevator”vs.“lift”)§ Phonological§ Phonetic

§ Mismatchbetweentrainingandtestingdialectscancausealargeincreaseinerrorrate

American British

all

old

Page 46: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

SpeechRecognition

Page 47: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

TheNoisyChannelModel

Acoustic model: HMMs over word positions with mixtures of Gaussians as emissions

Language model: Distributions over sequences

of words (sentences)

Page 48: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

SpeechModel

w1 w2Words

s1 s2 s3 s4 s5 s6 s7Soundtypes

a1 a2 a3 a4 a5 a6 a7Acousticobservations

Languagemodel

Acousticmodel

Page 49: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

AcousticModel

s1 s2 s3 s4 s5 s6 s7Soundtypes

a1 a2 a3 a4 a5 a6 a7Acousticobservations

Acousticmodel

Page 50: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

Frame Extraction

§ A frame (25 ms wide) extracted every 10 ms

25 ms

10ms

a1 a2 a3

Figure:SimonArnfield

Previewoffeatureextractionforeachframe:1) DFT(Spectrum)2) Log(Calibrate?)3) anotherDFT(!!??)

Page 51: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

FeatureExtraction

Page 52: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

DigitizingSpeech

Figure:BryanPellom

Page 53: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

Source/Filter

§ Articulationprocess:§ Thevocalcordvibrations

createharmonics§ Themouthisanamplifier§ Dependingonshapeof

mouth,someharmonicsareamplifiedmorethanothers

Page 54: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

Figures from Ratree Wayland

ProblemwithRawSpectrum

Page 55: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

Deconvolution /Liftering

Page 56: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

Deconvolution /Lifterings

e f

Page 57: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

Deconvolution /Lifterings

e f

log

log

⇣log

⇣ ⌘

⌘+

Page 58: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

Deconvolution /Liftering

GraphsfromDanEllis

s = e � f

log(s) = log(e) + log(f)

IDFT(log(s))

Page 59: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

MelFreq.Cepstral Coefficients

§ DoFFTtogetspectralinformation§ Likethespectrogramwesawearlier

§ ApplyMelscaling(New)§ Modelshumanear;moresensitivity

inlowerfreqs§ Approx linearbelow1kHz,logabove,

equalsamplesaboveandbelow1kHz

§ TakeLog§ Dodiscretecosinetransform

[Graph:Wikipedia]

Page 60: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

FinalFeatureVector

§ 39(real)featuresper10msframe:§ 12MFCCfeatures§ 12deltaMFCCfeatures§ 12delta-deltaMFCCfeatures§ 1(log)frameenergy§ 1delta(log)frameenergy§ 1delta-delta(logframeenergy)

§ Soeachframeisrepresentedbya39Dvector

Page 61: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

EmissionModel

Page 62: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

HMMsforContinuousObservations

§ Before:discretesetofobservations

§ Now:featurevectorsarereal-valued

§ Solution1:discretization§ Solution2:continuousemissions

§ Gaussians§ MultivariateGaussians§ MixturesofmultivariateGaussians

§ Astateisprogressively§ Contextindependentsubphone (~3per

phone)§ Contextdependentphone(triphones)§ StatetyingofCDphone

Page 63: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

VectorQuantization

§ Idea:discretization§ MapMFCCvectorsonto

discretesymbols§ Computeprobabilities

justbycounting

§ ThisiscalledvectorquantizationorVQ

§ NotusedforASRanymore

§ But:usefultoconsiderasastartingpoint

Page 64: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

GaussianEmissions§ VQisinsufficientfortop-

qualityASR§ Hardtocoverhigh-

dimensionalspacewithcodebook

§ Movesambiguityfromthemodeltothepreprocessing

§ Instead:assumethepossiblevaluesoftheobservationvectorsarenormallydistributed.§ Representtheobservation

likelihoodfunctionasaGaussian?

From bartus.org/akustyk

Page 65: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

GaussiansforAcousticModeling

§ P(x):

P(x)

x

P(x) is highest here at mean

P(x) is low here, far from mean

A Gaussian is parameterized by a mean and a variance:

Page 66: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

MultivariateGaussians§ Insteadofasinglemeanµ andvariances2:

§ Vectorofmeansµ andcovariancematrixS

§ Usuallyassumediagonalcovariance(!)§ Thisisn’tverytrueforFFTfeatures,butislessbadforMFCCfeatures

Page 67: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

Gaussians:SizeofS

§ µ =[00] µ =[00] µ =[00]§ S =I S =0.6I S =2I§ AsS becomeslarger,Gaussianbecomesmorespreadout;asS becomessmaller,Gaussianmorecompressed

TextandfiguresfromAndrewNg

Page 68: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

Gaussians:ShapeofS

§ Asweincreasetheoffdiagonalentries,morecorrelationbetweenvalueofxandvalueofy

TextandfiguresfromAndrewNg

Page 69: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

Butwe’renotthereyet

§ SingleGaussiansmaydoabadjobofmodelingacomplexdistributioninanydimension

§ Evenworsefordiagonalcovariances

§ Solution:mixturesofGaussians

From openlearn.open.ac.uk

Page 70: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

MixturesofGaussians§ MixturesofGaussians:

Fromrobots.ox.ac.uk http://www.itee.uq.edu.au/~comp4702

Page 71: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

GMMs§ Summary:eachstatehasanemission

distributionP(x|s)(likelihoodfunction)parameterizedby:§ Mmixtureweights§ MmeanvectorsofdimensionalityD§ EitherM covariancematricesofDxD orM

Dx1diagonalvariancevectors

§ Likesoftvectorquantizationafterall§ Thinkofthemixturemeansasbeing

learnedcodebookentries§ ThinkoftheGaussiandensitiesasa

learnedcodebookdistancefunction§ ThinkofthemixtureofGaussianslikea

multinomialovercodes§ (EvenmoretruegivensharedGaussian

inventories,cf nextweek)

Page 72: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

StateModel

Page 73: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

StateTransitionDiagrams§ BayesNet:HMMasaGraphicalModel

§ StateTransitionDiagram:MarkovModelasaWeightedFSA

w w w

x x x

the cat chased

doghas

Page 74: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

ASRLexicon

Figure:J&M

Page 75: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

LexicalStateStructure

Figure:J&M

Page 76: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

AddinganLM

FigurefromHuangetalpage618

Page 77: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

StateSpace§ Statespacemustinclude

§ Currentword(|V|onorderof20K+)§ Indexwithincurrentword(|L|onorderof5)§ E.g.(lec[t]ure)(thoughnotinorthography!)

§ Acousticprobabilitiesonlydependonphonetype§ E.g.P(x|lec[t]ure)=P(x|t)

§ Fromastatesequence,canreadawordsequence

Page 78: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

StateRefinement

Page 79: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

PhonesAren’tHomogeneous

Page 80: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

NeedtoUseSubphones

Figure:J&M

Page 81: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

AWordwithSubphones

Figure:J&M

Page 82: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

Modelingphoneticcontext

wiyriymiyniy

Page 83: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

“Need”withtriphonemodels

Figure:J&M

Page 84: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

LotsofTriphones

§ Possibletriphones:50x50x50=125,000

§ Howmanytriphonetypesactuallyoccur?

§ 20KwordWSJTask(fromBryanPellom)§ Wordinternalmodels:need14,300triphones§ Crosswordmodels:need54,400triphones

§ Needtogeneralizemodels,tietriphones

Page 85: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

StateTying/Clustering

§ [Young,Odell,Woodland1994]

§ Howdowedecidewhichtriphonestoclustertogether?

§ Usephoneticfeatures (or‘broadphoneticclasses’)§ Stop§ Nasal§ Fricative§ Sibilant§ Vowel§ lateral

Figure:J&M

Page 86: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

StateSpace§ Statespacenowincludes

§ Currentword:|W|isorder20K§ Indexincurrentword:|L|isorder5§ Subphone position:3§ E.g.(lec[t-mid]ure)

§ Acousticmodeldependsonclusteredphonecontext§ Butthisdoesn’tgrowthestatespace

§ But,addingtheLMcontextfortrigram+does§ (afterthe,lec[t-mid]ure)§ Thisisarealproblemfordecoding

Page 87: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

Decoding

Page 88: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

InferenceTasks

Mostlikelywordsequence:d- ae- d

Mostlikelystatesequence:d1-d6-d6-d4-ae5-ae2-ae3-ae0-d2-d2-d3-d7-d5

Page 89: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

ViterbiDecoding

Figure:EnriqueBenimeli

Page 90: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

ViterbiDecoding

Figure:EnriqueBenimeli

Page 91: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

EmissionCaching§ Problem:scoringalltheP(x|s)valuesistooslow§ Idea:manystatessharetiedemissionmodels,socachethem

Page 92: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

PrefixTrie Encodings§ Problem:manypartial-wordstatesareindistinguishable§ Solution:encodewordproductionasaprefixtrie (with

pushedweights)

§ AspecificinstanceofminimizingweightedFSAs[Mohri,94]Figure:Aubert,02

n i d

n i t

n o t

d

ni

t

o t

0.04

0.02

0.01

0.04

0.25

0.5

11

1

Page 93: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

BeamSearch§ Problem:trellisistoobigtocomputev(s)vectors§ Idea:moststatesareterrible,keepv(s)onlyfortopstatesat

eachtime

§ Important:stilldynamicprogramming;collapseequiv states

theb.

them.

andthen.

atthen.

theba.thebe.thebi.

thema.theme.themi.

thena.thene.theni.

theba.

thebe.

thema.

thena.

Page 94: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

LMFactoring§ Problem:Higher-ordern-gramsexplodethestatespace§ (One)Solution:

§ Factorstatespaceinto(wordindex,lmhistory)§ Scoreunigramprefixcostswhileinsideaword§ Subtractunigramcostandaddtrigramcostoncewordiscomplete

d

ni

t

o t

0.04

0.25

0.5

11

1

the

Page 95: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

LMReweighting§ Noisychannelsuggests

§ Inpractice,wanttoboostLM

§ Also,goodtohavea“wordbonus”tooffsetLMcosts

§ Thesearebothconsequencesofbrokenindependenceassumptionsinthemodel

Page 96: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop
Page 97: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

Training

Page 98: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

TrainingMixtureModels§ Input:wavfileswithunalignedtranscriptions

§ Forcedalignment§ Computingthe“Viterbipath”overthetrainingdata(wherethe

transcriptionisknown)iscalled“forcedalignment”§ Weknowwhichwordstringtoassigntoeachobservationsequence.§ Wejustdon’tknowthestatesequence.§ Soweconstrainthepathtogothroughthecorrectwords(byusinga

specialexample-specificlanguagemodel)§ AndotherwiseruntheViterbialgorithm

§ Result:alignedstatesequence

Page 99: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

StateTying

§ CreatingCDphones:§ Startwithmonophone,doEM

training§ CloneGaussiansintotriphones§ Builddecisiontreeandcluster

Gaussians§ Cloneandtrainmixtures

(GMMs)

§ Generalidea:§ Introducecomplexitygradually§ Interleaveconstraintwith

flexibility

Page 100: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

Standardsubphone/mixtureHMM

Temporal Structure

GaussianMixtures

Model Error rateHMM Baseline 25.1%

Page 101: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

AnInducedModel

Standard Model

Single Gaussians

Fully Connected

[Petrov, Pauls, and Klein, 07]

Page 102: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

HierarchicalSplitTrainingwithEM

32.1%

28.7%

25.6%

HMM Baseline 25.1%5 Split rounds 21.4%

23.9%

Page 103: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

Refinementofthe/ih/-phone

Page 104: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

Refinementofthe/ih/-phone

Page 105: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

Refinementofthe/ih/-phone

Page 106: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 14. · §Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on §Silence of stop

0

5

10

15

20

25

30

35

ae

ao

ay

eh

er

ey

ih f r s sil

aa

ah

ix

iy z cl k sh n

vcl

ow l m t v

uw

aw

ax

ch

w

th

el

dh

uh p en

oy

hh

jh

ng y b d dx g zh

epi

HMMstatesperphone