sequence models i
TRANSCRIPT
SequenceModelsI
WeiXu(many slides from Greg Durrett, Dan Klein, Vivek Srikumar, Chris Manning, Yoav Artzi)
Administrivia
‣ Project1isout,dueonSep20(nextMonday!).
‣ Reading:Eisenstein7.0-7.4,Jurafsky+MarOnChapter8
ThisLecture
‣ Sequencemodeling
‣ HMMsforPOStagging
‣ Viterbi,forward-backward
‣ HMMparameteresOmaOon
LinguisOcStructures
‣ Languageistree-structured
Iatethespaghe*withchops/cks Iatethespaghe*withmeatballs
‣ Understandingsyntaxfundamentallyrequirestrees—thesentenceshavethesameshallowanalysis
Iatethespaghe*withchops/cks Iatethespaghe*withmeatballsPRPVBZDTNNINNNS PRPVBZDTNNINNNS
LinguisOcStructures
‣ LanguageissequenOallystructured:interpretedinanonlineway
Tanenhausetal.(1995)
POSTagging
Ghana’sambassadorshouldhavesetupthebigmee/nginDCyesterday.
‣Whattagsareoutthere?
NNPPOSNNMDVBVBNRPDTJJNNINNNPNN.
POSTagging
Slidecredit:DanKlein
POSTagging
Slidecredit:YoavArtzi
POSTagging
Fedraisesinterestrates0.5percent
VBDVBNNNP
VBZNNS
VBVBPNN
VBZNNS CD NN
I’m0.5%interestedintheFed’sraises!
Iherebyincreaseinterestrates0.5%
Fedraisesinterestrates0.5percent
VBDVBNNNP
VBZNNS
VBVBPNN
VBZNNS CD NN
‣ OtherpathsarealsoplausiblebutevenmoresemanOcallyweird…‣Whatgovernsthecorrectchoice?Word+context‣WordidenOty:mostwordshave<=2tags,manyhaveone(percent,the)‣ Context:nounsstartsentences,nounsfollowverbs,etc.
Whatisthisgoodfor?
‣ Text-to-speech:record,lead
‣ PreprocessingstepforsyntacOcparsers
‣ Domain-independentdisambiguaOonforothertasks
‣ (Very)shallowinformaOonextracOon
SequenceModels
‣ Inputx = (x1, ..., xn) y = (y1, ..., yn)Output
‣ POStagging:xisasequenceofwords,yisasequenceoftags
‣ Today:generaOvemodelsP(x,y);discriminaOvemodelsnextOme
HiddenMarkovModels
y = (y1, ..., yn)Output‣ Inputx = (x1, ..., xn)
‣ModelthesequenceofyasaMarkovprocess
y1 y2
‣Markovproperty:futureiscondiOonallyindependentofthepastgiventhepresent
‣ Ifyaretags,thisroughlycorrespondstoassumingthatthenexttagonlydependsonthecurrenttag,notanythingbefore
y3 P (y3|y1, y2) = P (y3|y2)
‣ LotsofmathemaOcaltheoryabouthowMarkovchainsbehave
HiddenMarkovModels
y1 y2 yn
x1 x2 xn
…
y = (y1, ..., yn)Output‣ Inputx = (x1, ..., xn)
Fedraises percent…
NNP VBZ NN…
HiddenMarkovModels
y1 y2 yn
x1 x2 xn
…
P (y,x) = P (y1)nY
i=2
P (yi|yi�1)nY
i=1
P (xi|yi)
IniOaldistribuOon
TransiOonprobabiliOes
EmissionprobabiliOes
} }} ‣ P(x|y)isadistribuOonoverallwordsinthevocabulary—notadistribuOonoverfeatures(butcouldbe!)
‣MulOnomials:tagxtagtransiOons,tagxwordemissions
‣ ObservaOon(x)dependsonlyoncurrentstate(y)
y = (y1, ..., yn)Output‣ Inputx = (x1, ..., xn)
TransiOonsinPOSTagging
‣ Dynamicsmodel
Fedraisesinterestrates0.5percent.
VBDVBNNNP
VBZNNS
VBVBPNN
VBZNNS CD NN
‣ likelybecausestartofsentence
‣ likelybecauseverbonenfollowsnoun
‣ directobjectfollowsverb,otherverbrarelyfollowspasttenseverb(mainverbscanfollowmodalsthough!)
P (y1 = NNP)
P (y2 = VBZ|y1 = NNP)
P (y3 = NN|y2 = VBZ)
P (y1)nY
i=2
P (yi|yi�1) NNP-propernoun,singularVBZ-verb,3rdps.sing.presentNN-noun,singularormass.
EsOmaOngTransiOons
‣ SimilartoNaiveBayesesOmaOon:maximumlikelihoodsoluOon=normalizedcounts(withsmoothing)readoffsuperviseddata
Fedraisesinterestrates0.5percent.NNP VBZ NN NNS CD NN
‣ Howtosmooth?
‣ Onemethod:smoothwithunigramdistribuOonovertags
‣ P(tag|NN)
P (tag|tag�1) = (1� �)P̂ (tag|tag�1) + �P̂ (tag)
=empiricaldistribuOon(readofffromdata)P̂
.
=(0.5.,0.5NNS)
‣ EmissionsP(x|y)capturethedistribuOonofwordsoccurringwithagiventag
EmissionsinPOSTagging
‣ P(word|NN)=(0.05person,0.04official,0.03interest,0.03percent…)
‣Whenyoucomputetheposteriorforagivenword’stags,thedistribuOonfavorstagsthataremorelikelytogeneratethatword
‣ Howshouldwesmooththis?
Fedraisesinterestrates0.5percent.NNP VBZ NN NNS CD NN .
EsOmaOngEmissions
‣ P(word|NN)=(0.5interest,0.5percent)—hardtosmooth!
‣ Fancytechniquesfromlanguagemodeling,e.g.lookattypeferOlity—P(tag|word)isflarerforsomekindsofwordsthanforothers)
Fedraisesinterestrates0.5percentNNP VBZ NN NNS CD NN
P (word|tag) = P (tag|word)P (word)
P (tag)
‣ AlternaOve:useBayes’rule
‣ CaninterpolatewithdistribuOonlookingatwordshapeP(wordshape|tag)(e.g.,P(capitalizedwordoflen>=8|tag))
‣ P(word|tag)canbealog-linearmodel—we’llseethisinafewlectures
InferenceinHMMs
‣ Inferenceproblem:
‣ ExponenOallymanypossibleyhere!
‣ SoluOon:dynamicprogramming(possiblebecauseofMarkovstructure!)
‣ManyneuralsequencemodelsdependonenOreprevioustagsequence,needtouseapproximaOonslikebeamsearch
‣ Inputx = (x1, ..., xn) y = (y1, ..., yn)Output
y1 y2 yn
x1 x2 xn
… P (y,x) = P (y1)nY
i=2
P (yi|yi�1)nY
i=1
P (xi|yi)
argmaxyP (y|x) = argmaxyP (y,x)
P (x)
ViterbiAlgorithm
slidecredit:VivekSrikumar
ViterbiAlgorithm
slidecredit:VivekSrikumar
ViterbiAlgorithm
slidecredit:VivekSrikumar
best(parOal)scorefor asequenceendinginstates
ViterbiAlgorithm
slidecredit:VivekSrikumar
ViterbiAlgorithm
slidecredit:VivekSrikumar
ViterbiAlgorithm
slidecredit:VivekSrikumar
ViterbiAlgorithm
slidecredit:VivekSrikumar
ViterbiAlgorithm
slidecredit:VivekSrikumar
ViterbiAlgorithm
slidecredit:DanKlein
‣ “Thinkabout”allpossibleimmediatepriorstatevalues.Everythingbeforethathasalreadybeenaccountedforbyearlierstages.
Forward-BackwardAlgorithm‣ InaddiOontofindingthebestpath,wemaywanttocomputemarginalprobabiliOesofpaths P (yi = s|x)
P (yi = s|x) =X
y1,...,yi�1,yi+1,...,yn
P (y|x)
‣WhatdidViterbicompute? P (ymax|x) = maxy1,...,yn
P (y|x)
‣ Cancomputemarginalswithdynamicprogrammingaswellusinganalgorithmcalledforward-backward
Forward-BackwardAlgorithm
P (y3 = 2|x) =sum of all paths through state 2 at time 3
sum of all paths
Forward-BackwardAlgorithm
slidecredit:DanKlein
P (y3 = 2|x) =sum of all paths through state 2 at time 3
sum of all paths
=
‣ Easiestandmostflexibletodoonepasstocomputeandonetocompute
Forward-BackwardAlgorithm
↵1(s) = P (s)P (x1|s)
↵t(st) =X
st�1
↵t�1(st�1)P (st|st�1)P (xt|st)
‣ IniOal:
‣ Recurrence:
‣ SameasViterbibutsumminginsteadofmaxing!
‣ ThesequanOOesgetverysmall!StoreeverythingaslogprobabiliOes
Forward-BackwardAlgorithm
‣ IniOal:�n(s) = 1
�t(st) =X
st+1
�t+1(st+1)P (st+1|st)P (xt+1|st+1)
‣ Recurrence:
‣ Bigdifferences:countemissionforthenextOmestep(notcurrentone)
Forward-BackwardAlgorithm
↵1(s) = P (s)P (x1|s)
↵t(st) =X
st�1
↵t�1(st�1)P (st|st�1)P (xt|st)
�n(s) = 1
�t(st) =X
st+1
�t+1(st+1)P (st+1|st)P (xt+1|st+1)
‣ Bigdifferences:countemissionforthenextOmestep(notcurrentone)
Forward-BackwardAlgorithm
↵1(s) = P (s)P (x1|s)
↵t(st) =X
st�1
↵t�1(st�1)P (st|st�1)P (xt|st)
�n(s) = 1
�t(st) =X
st+1
�t+1(st+1)P (st+1|st)P (xt+1|st+1)
P (s3 = 2|x) = ↵3(2)�3(2)Pi ↵3(i)�3(i)
‣Whatisthedenominatorhere? P (x)
=
HMMPOSTagging
‣ Baseline:assigneachworditsmostfrequenttag:~90%accuracy
‣ TrigramHMM:~95%accuracy/55%onunknownwords
Slidecredit:DanKlein
TrigramTaggers
‣ Trigrammodel:y1=(<S>,NNP),y2=(NNP,VBZ),…
‣ P((VBZ,NN)|(NNP,VBZ))—morecontext!Noun-verb-nounS-V-O
Fedraisesinterestrates0.5percentNNP VBZ NN NNS CD NN
‣ Tradeoffbetweenmodelcapacityanddatasize—trigramsarea“sweetspot”forPOStagging
HMMPOSTagging
‣ Baseline:assigneachworditsmostfrequenttag:~90%accuracy
‣ TrigramHMM:~95%accuracy/55%onunknownwords
‣ TnTtagger(Brants1998,tunedHMM):96.2%accuracy/86.0%onunks
Slidecredit:DanKlein
‣ State-of-the-art(BiLSTM-CRFs):97.5%/89%+onunks
https://arxiv.org/pdf/cs/0003055.pdf
Errors
officialknowledge madeupthestory recentlysoldsharesJJ/NNNN VBDRP/INDTNN RBVBD/VBNNNS
Slidecredit:DanKlein/Toutanova+Manning(2000)(NNNN:taxcut,artgallery,…)
RemainingErrors
‣ Underspecified/unclear,goldstandardinconsistent/wrong:58%
‣ Lexicongap(wordnotseenwiththattagintraining)4.5%‣ Unknownword:4.5%‣ Couldgetright:16%(manyoftheseinvolveparsing!)
‣ DifficultlinguisOcs:20%
Theysetupabsurdsitua/ons,detachedfromrealityVBD/VBP?(pastorpresent?)
a$10millionfourth-quarterchargeagainstdiscon/nuedopera/onsadjecOveorverbalparOciple?JJ/VBN?
Manning2011“Part-of-SpeechTaggingfrom97%to100%:IsItTimeforSomeLinguisOcs?”
OtherLanguages
Petrovetal.2012
OtherLanguages
‣ UniversalPOStagset(~12tags),cross-lingualmodelworksaswellastunedCRFusingexternalresources
Gillicketal.2016
Byte-to-Span
Zero-shotCross-lingualTransferLearning
‣ModelsaretrainedonannotatedEnglishdata,thendirectlyapplethemtoArabictextsforPOStagging.
Lan,Chen,Xu,Rirer2020
Zero-shotCross-lingualTransferLearning
‣ModelsaretrainedonannotatedEnglishdata,thendirectlyapplethemtoArabictextsforPOStagging. Lan,Chen,Xu,Rirer2020
NextUp
‣ CRFs:feature-baseddiscriminaOvemodels
‣ NamedenOtyrecogniOon