algorithms for nlptbergkir/11711fa17/fa17 11-711 lecture 5...improving on n-grams? n-grams don’t...
TRANSCRIPT
SpeechSignalsTaylorBerg-Kirkpatrick– CMU
Slides:DanKlein– UCBerkeley
AlgorithmsforNLP
MaximumEntropyModels
ImprovingonN-Grams?§ N-gramsdon’tcombinemultiplesourcesofevidencewell
§ Here:§ “the”givessyntacticconstraint§ “demolition”givessemanticconstraint§ Unlikelytheinteractionbetweenthesetwohasbeendensely
observedinthisspecificn-gram
§ We’dlikeamodelthatcanbemorestatisticallyefficient
P(construction|Afterthedemolitionwascompleted,the)
SomeDefinitions
INPUTS
CANDIDATES
FEATURE VECTORS
closethe____
CANDIDATE SET
yoccursinx
“close”inx Ù y=“door”x-1=“the”Ù y=“door”
TRUE OUTPUTS
{door,table,…}
table
door
x-1=“the”Ù y=“table”
MoreFeatures,LessInteraction
§ N-Grams
§ Skips
§ Lemmas
§ Caching
x=closingthe____,y=doors
x-1=“the”Ù y=“doors”
x-2=“closing”Ù y=“doors”
x-2=“close”Ù y=“door”
yoccursinx
Data: FeatureImpact
Features TrainPerplexity TestPerplexity
3 gram indicators 241 350
1-3grams 126 172
1-3grams+skips 101 164
Exponential Form§ Weights Features
§ Linearscore
§ Unnormalized probability
§ Probability
LikelihoodObjective§ Modelform:
§ Log-likelihoodoftrainingdata
P (y|x,w) = exp(w
>f(x, y))P
y0 exp(w>f(x, y
0))
=
X
i
0
@w
>f(xi, y
⇤i )� log
X
y0
exp(w
>f(xi, y
0))
1
A
L(w) = log
Y
i
P (y
⇤i |xi, w) =
X
i
log
exp(w
>f(xi, y
⇤i ))P
y0 exp(w>f(xi, y
0))
!
Training
HistoryofTraining
§ 1990’s:Specializedmethods(e.g.iterativescaling)
§ 2000’s:General-purposemethods(e.g.conjugategradient)
§ 2010’s:Onlinemethods(e.g.stochasticgradient)
WhatDoesLLLookLike?§ Example
§ Data:xxxy§ Twooutcomes,xandy§ Oneindicatorforeach§ Likelihood
ConvexOptimization§ Themaxent objectiveisanunconstrainedconvexproblem
§ Oneoptimalvalue*,gradientspointtheway
Gradients
Countoffeaturesundertargetlabels
Expectedcountoffeaturesundermodelpredictedlabeldistribution
GradientAscent§ Themaxent objectiveisanunconstrainedoptimization
problem
§ GradientAscent§ Basicidea:moveuphillfromcurrentguess§ Gradientascent/descentfollowsthegradientincrementally§ Atlocaloptimum,derivativevectoriszero§ Willconvergeifstepsizesaresmallenough,butnotefficient§ Allweneedistobeabletoevaluatethefunctionanditsderivative
(Quasi)-Newton Methods§ 2nd-Ordermethods:repeatedlycreateaquadratic
approximationandsolveit
§ E.g.LBFGS,whichtracksderivativetoapproximate(inverse)Hessian
Regularization
Regularization Methods
§ Earlystopping
§ L2:L(w)-|w|22
§ L1:L(w)-|w|
RegularizationEffects
§ Earlystopping:don’tdothis
§ L2:weightsstaysmallbutnon-zero
§ L1:manyweightsdriventozero§ Goodforsparsity§ UsuallybadforaccuracyforNLP
Scaling
Whyis ScalingHard?
§ Bignormalizationterms
§ Lotsofdatapoints
HierarchicalPrediction§ Hierarchicalprediction /softmax [Mikolov etal2013]
§ Noise-ContrastiveEstimation[Mnih,2013]
§ Self-Normalization[Devlin,2014]
Image:ayende.com
StochasticGradient§ Viewthegradientasanaverageoverdatapoints
§ Stochasticgradient:takeastepeachexample(ormini-batch)
§ Substantialimprovementsexist,e.g.AdaGrad (Duchi,11)
Log-linearParameterization§ Modelform:
§ LearnbyfollowinggradientoftrainingLL:
@L(w)
@w
=X
i
f(xi
, y
⇤i
)�X
i
⇣EP (y|xi;w) [f(xi
, y)]⌘
P (y|x;w) = exp(w
>f(x, y))P
y0 exp(w>f(x, y
0))
Mixed Interpolation§ Butcan’twejustinterpolate:
§ P(w|most recentwords)§ P(w|skip contexts)§ P(w|caching)§ …
§ Yes,andpeopledo(well,did)§ Butadditivecombinationtendstoflattendistributions,notzerooutcandidates
NeuralLMs
Neural LMs
Image:(Bengio etal,03)
Neural vsMaxent§ Maxent LM
§ SimpleNeuralLM
� nonlinear,e.g.tanh
P (y|x;w) / exp(w
>f(x, y))
P (y|x;A,B,C) / exp
⇣A
>y
�
�B
>C
x
�⌘
Neural LM Example
x-1= thex-2= closing
Cclosing Cthe1.27.4… –3.31.1…
0.1
Aman Adoor Adoors …
2.3 1.5 …
-0.99 …h = ��B>C
x
�
P (y|x) / e
(A>h)P (y|x) / e
(A>h)
Neural LMs
Image:(Bengio etal,03)
Decision Trees/Forests
§ Decisiontrees?§ Goodfornon-lineardecisionproblems§ Randomforestscanimprovefurther[XuandJelinek,2004]§ Pathstoleavesbasicallylearnconjunctions§ GeneralcontrastbetweenDTsandlinearmodels
Prev Word?
…lastverb?
SpeechSignals
n Frequencygivespitch;amplitudegivesvolume
n Frequenciesateachtimesliceprocessedintoobservationvectors
s p ee ch l a b
ampl
itude
SpeechinaSlide
……………………………………………..x12x13x12x14x14………..
Articulation
TextfromOhala,Sept2001,fromSharonRoseslide
Sagittal sectionofthevocaltract(Techmer 1880)
Nasalcavity
Pharynx
Vocalfolds(inthelarynx)
Trachea
Lungs
ArticulatorySystem
Oralcavity
SpaceofPhonemes
§ Standardinternationalphoneticalphabet(IPA)chartofconsonants
Place
PlacesofArticulation
labial
dentalalveolar post-alveolar/palatal
velaruvular
pharyngeal
laryngeal/glottal
FigurethankstoJenniferVenditti
Labialplace
bilabial
labiodental
FigurethankstoJenniferVenditti
Bilabial:p,b,m
Labiodental:f,v
Coronalplace
dentalalveolar post-alveolar/palatal
FigurethankstoJenniferVenditti
Dental:th/dh
Alveolar:t/d/s/z/l/n
Post:sh/zh/y
DorsalPlace
velaruvular
pharyngeal
FigurethankstoJenniferVenditti
Velar:k/g/ng
SpaceofPhonemes
§ Standardinternationalphoneticalphabet(IPA)chartofconsonants
Manner
MannerofArticulation§ Inadditiontovaryingbyplace,soundsvaryby
manner
§ Stop:completeclosureofarticulators,noairescapesviamouth§ Oralstop:palateisraised(p,t,k,b,d,g)§ Nasalstop:oralclosure,butpalateislowered(m,
n,ng)
§ Fricatives:substantialclosure,turbulent:(f,v,s,z)
§ Approximants:slightclosure,sonorant:(l,r,w)
§ Vowels:noclosure,sonorant:(i,e,a)
SpaceofPhonemes
§ Standardinternationalphoneticalphabet(IPA)chartofconsonants
Vowels
VowelSpace
Acoustics
“Shejusthadababy”
§ Whatcanwelearnfromawavefile?§ Nogapsbetweenwords(!)§ Vowelsarevoiced,long,loud§ Lengthintime=lengthinspaceinwaveformpicture§ Voicing:regularpeaksinamplitude§ Whenstopsclosed:nopeaks,silence§ Peaks=voicing:.46to.58(vowel[iy],fromsecond.65to.74(vowel[ax])andsoon
§ Silenceofstopclosure(1.06to1.08forfirst[b],or1.26to1.28forsecond[b])
§ Fricativeslike[sh]:intenseirregularpattern;see.33to.46
Time-DomainInformation
bad
pad
spat
pat
ExamplefromLadefoged
SimplePeriodicWavesofSound
Time (s)0 0.02
œ0.99
0.99
0
• Yaxis:Amplitude=amountofairpressureatthatpointintime• Zeroisnormalairpressure,negativeisrarefaction
• Xaxis:Time.• Frequency=numberofcyclespersecond.• 20cyclesin.02seconds=1000cycles/second=1000Hz
ComplexWaves:100Hz+1000Hz
Time (s)0 0.05
œ0.9654
0.99
0
Spectrum
100 1000FrequencyinHz
Amplitu
de
Frequencycomponents(100and1000Hz)onx-axis
Partof[ae]waveformfrom“had”
§ Notecomplexwaverepeatingninetimesinfigure§ Plussmallerwaveswhichrepeats4timesforeverylarge
pattern§ Largewavehasfrequencyof250Hz(9timesin.036seconds)§ Smallwaveroughly4timesthis,orroughly1000Hz§ Twolittletinywavesontopofpeakof1000Hzwaves
SpectrumofanActualSoundwave
Frequency (Hz)0 5000
0
20
40
BacktoSpectra§ Spectrumrepresentsthesefreqcomponents§ ComputedbyFouriertransform,algorithmwhichseparates
outeachfrequencycomponentofwave.
§ x-axisshowsfrequency,y-axisshowsmagnitude(indecibels,alogmeasureofamplitude)
§ Peaksat930Hz,1860Hz,and3020Hz.
Source/Channel
WhythesePeaks?
§ Articulationprocess:§ Thevocalcordvibrations
createharmonics§ Themouthisanamplifier§ Dependingonshapeof
mouth,someharmonicsareamplifiedmorethanothers
FiguresfromRatree Wayland
A3
A4
A2
C4 (middle C)
C3
F#3
F#2
Vowel[i]atincreasingpitches
ResonancesoftheVocalTract
§ Thehumanvocaltractasanopentube:
§ Airinatubeofagivenlengthwilltendtovibrateatresonancefrequencyoftube.
§ Constraint:Pressuredifferentialshouldbemaximalat(closed)glottalendandminimalat(open)lipend.
Closedend Openend
Length17.5cm.
FigurefromW.Barry
FromSundberg
Computingthe3FormantsofSchwa
§ LetthelengthofthetubebeL§ F1 =c/l1 =c/(4L)=35,000/4*17.5=500Hz§ F2 =c/l2 =c/(4/3L)=3c/4L=3*35,000/4*17.5=1500Hz§ F3 =c/l3 =c/(4/5L)=5c/4L=5*35,000/4*17.5=2500Hz
§ Soweexpectaneutralvoweltohave3resonancesat500,1500,and2500Hz
§ Thesevowelresonancesarecalledformants
FromMarkLiberman
SeeingFormants:theSpectrogram
VowelSpace
Spectrograms
HowtoReadSpectrograms
§ [bab]:closureoflipslowersallformants:sorapidincreaseinallformantsatbeginningof"bab”
§ [dad]:firstformantincreases,butF2andF3slightfall§ [gag]:F2andF3cometogether:thisisacharacteristicof
velars.Formanttransitionstakelongerinvelarsthaninalveolars orlabials
FromLadefoged “ACourseinPhonetics”
“Shecamebackandstartedagain”
1.lotsofhigh-freqenergy3.closurefork4.burstofaspirationfork5.ey vowel;faint1100Hzformantisnasalization6.bilabialnasal7.shortbclosure,voicingbarelyvisible.8.ae;noteupwardtransitionsafterbilabialstopatbeginning9.noteF2andF3comingtogetherfor"k"
FromLadefoged “ACourseinPhonetics”
DerivingSchwa
§ Reminderofbasicfactsaboutsoundwaves§ f=c/l§ c=speedofsound(approx35,000cm/sec)§ Asoundwithl=10meters:f=35Hz(35,000/1000)§ Asoundwithl=2centimeters:f=17,500Hz(35,000/2)
AmericanEnglishVowelSpace
FRONT BACK
HIGH
LOW
iy
ih
eh
ae aa
ao
uw
uh
ahax
ix ux
FiguresfromJenniferVenditti,H.T.Bunnell
DialectIssues
§ Speechvariesfromdialecttodialect(examplesareAmericanvs.BritishEnglish)§ Syntactic(“Icould”vs.“Icould
do”)§ Lexical(“elevator”vs.“lift”)§ Phonological§ Phonetic
§ Mismatchbetweentrainingandtestingdialectscancausealargeincreaseinerrorrate
American British
all
old