lecture 5: hidden variables - cs.cmu.edunasmith/psnlp/lecture5.pdfopmizaon for hidden variables •...
TRANSCRIPT
Lecture5:HiddenVariables
RandomVariablesinDecoding
inputs(X)
outputs(Y)
parameters(w)
RandomVariablesinSupervisedLearning
inputs(X)
outputs(Y)
parameters(w)
HiddenVariablesareDifferent
• Weusetheterm“hiddenvariable”(or“latentvariable”)torefertosomethingweneversee.– Notevenintraining.– SomeFmeswebelievetheyarereal.– SomeFmeswebelievetheyonlyapproximatereality.
RandomVariablesinDecoding
inputs(X)
outputs(Y)
parameters(w)
latent(Z)
RandomVariablesinSupervisedLearning
inputs(X)
outputs(Y)
parameters(w)
latent(Z)
LatentVariablesandInference
• BothlearninganddecodingcanbesFllbeunderstoodasinferenceproblems.
• Usually“mixed”:– somevariablesaregeLngmaximized– somevariablesaregeLngsummed
WordAlignments
• SinceIBMmodel1,wordalignmentshavebeentheprototypicalhiddenvariable.
• UlFmately,intranslaFon,wedonotcarewhattheyare.
• Currentapproach:learnthewordalignmentsunsupervised,thenfixthemtotheirmostlikelyvalues.– ThenconstructmodelsfortranslaFon.
• Alignmentonitsown:unsupervisedproblem.• MTonitsown:supervisedproblem.• MT+alignment:supervisedproblemwithlatentvariables.
AlignmentsinText‐to‐TextProblems
• Wangetal.(2007):“Jeopardy”modelforanswerrankinginQA.– AlignquesFonstoanswers.– SimilarmodelforparaphrasedetecFon(DasandSmith,2009)
LatentAnnotaFonsinParsing
• Treebankcategories(N,NN,NP,etc.)aretoocoarse‐grained.– LexicalizaFon(Collins,Eisner)– Johnson’s(1998)parentannotaFon– KleinandManning(2003)parser
• Treatthetrue,fine‐grainedcategoryashidden,andinferitfromdata.– Matsuzaki,Petrov,Dreyer,manyothers.
RicherFormalisms
• Cohnetal.(2009):treesubsFtuFongrammar.– Derivedtreeisobserved(outputvariable).– DerivaFontree(segmentaFonintoelementarytrees)ishidden.
• ZeglemoyerandCollins(2005andlater):inferCCGsyntaxfromfirst‐orderlogicalexpressionsandsentences.
• Liangetal.(2011):infersemanFcrepresentaFonfromtextanddatabase.
TopicModels
• Infertopics(ortopicblends)indocuments.• LatentDirichletallocaFon(Bleietal.,2003)isagreatexample.– SomeFmesaugmentedwithanoutputvariable(BleiandMcAuliffe,2007)–“supervised”LDA.
– Manyextensions!
UnsupervisedNLP
• Clustering(Brown,1992,manymore)• POStagging(Merialdo,1994,manymore)• Parsing(PereiraandSchabes,KleinandManning,…)• SegmentaFon(word–Goldwater;discourse–Eisenstein)
• Morphology• LexicalsemanFcs• Syntax‐semanFcscorrespondences• SenFmentanalysis• CoreferenceresoluFon• Word,phrase,andtreealignment
SupervisedorUnsupervised?
• Dependsonthetask,notthemodel.– Isay“unsupervised”whentheoutputvariablesarehiddenattrainingFme.
RandomVariablesinUnsupervisedLearning
inputs(X)
outputs(Y)
parameters(w)
latent(Z)
ProbabilisFcView
• TheusualstarFngpointforhiddenvariablesismaximumlikelihood.– “Input”and“output”donotmager;onlyobserved/latent.
RandomVariablesinProbabilisFcLearning
visible(V)
latent(L)
parameters(w)
EmpiricalRiskView
• Log‐loss– Equatestomaximummarginallikelihood(orMAPifR(w)isanegatedlogprior).
– UnlikelossfuncFonsinlecture4,thisisnotconvex!– EMseekstosolvethisproblem(butit’snottheonlyway).– RegularizaFondecisionsareorthogonal.
loss(v;hw) = ! log pw(v)
= ! log!
!
pw(v, !)
minw!Rd
1N
!
i
loss(vi;hw) + R(w)
OpFmizingtheMarginalLog‐Loss
• EMasinference• EMasopFmizaFon
• DirectopFmizaFon
GenericEMAlgorithm
• Input:w(0)andobservaFonsv1,v2,…vN• Output:learnedw• t=0• RepeatunFlw(t)≈w(t‐1):– Estep:
– Mstep:
– ++t• Returnw(t)
!i,!!, q(t)i (!)" pw(t)(! | vi)
w(t+1) ! maxw
!
i
!
!
q(t)i (!) log pw(vi, !)
MAPLearningasaGraphicalModel
w L
V
R
exp–R(w)=p(w)
pw(L)
pw(V|L)
• Combinedinference(maxoverw,sumoverL)isveryhard.– Ifwwerefixed,geLngtheposterioroverLwouldn’tbesobad.
– IfLwerefixed,maximizingoverwwouldn’tbesobad.
MAPLearningasaGraphicalModel
w L
V
R
exp–R(w)=p(w)
pw(L)
pw(V|L)
Estep Mstep
w L
V
R
exp–R(w)=p(w)
pw(L)
pw(V|L)
Baum‐Welch(EMforHMMs)asanExample
• Estep:forward‐backwardalgorithm(oneachexample).– ThisisexactmarginalinferencebyvariableeliminaFon.
– Thestructureofthegraphicalmodelletsusdothisbydynamicprogramming.
– ThemarginalsareprobabiliFesoftransiFonandemissioneventsateachposiFon.
• Mstep:MLEbasedonsoteventcounts.– RelaFvefrequencyesFmaFonaccomplishesMLEformulFnomials.
visible(V) latent(L)
parameters(w)
Baum‐WelchasaGraphicalModel
emit Y1
X1
R
Y2
X2
transit
Y3
X3
YnXn
visible(V) latent(L)
parameters(w)
AcFveTrail!
emit Y1
X1
R
Y2
X2
transit
Y3
X3
YnXn
visible(V) visible(V)
parameters(w)
NoAcFveTrailinAll‐VisibleCase
emit Y1
X1
R
Y2
X2
transit
Y3
X3
YnXn
WhyLatentVariablesMakeLearningHard
• NewintuiFon:parametersarenowinterdependentthatwerenotinterdependentinthefully‐visiblecase.
• ItallgoesbacktoacFvetrails.
“Viterbi”Learningis“Okay”!
w L
V
R
exp–R(w)=p(w)
pw(L)
pw(V|L)
• ApproximatejointMAPinferenceoverwandL(mostprobableexplanaFoninference).
• LossfuncFon: loss(v;hw) = !max!
log pw(v, !)
CondiFonalModels
• EMisusuallycloselyassociatedwithfullygeneraFveapproaches.
• Youcandothesamethingswithlog‐linearmodelsandwithcondiFonalmodels.– Locallynormalizedmodelsgiveflexibilitywithoutrequiringglobalinference(Berg‐Kirkpatrick,2010).
– HiddenvariableCRFs(Quagonietal.,2007)areverypowerful.
LearningCondiFonalHiddenVariableModels
w L
Vout
R
Vin
w
Vout
R
VindistribuFonoverVinisnotmodeled
standardcondiFonalmodel(e.g.,CRF)
OpFmizaFonforHiddenVariables
• We’vedescribedhiddenvariablelearningasinferenceproblems.
• ItismorepracFcal,ofcourse,tothinkaboutthisasop.miza.on.
• EMcanbeunderstoodfromanopFmizaFonframework,aswell.
EMandLikelihood
• TheconnecFonbetweenthegoalaboveandtheEMprocedureisnotimmediatelyclear.
!(w) =!
i
log!
!
pw(vi, !)
OpFmizaFonViewofEM
• AfuncFonofwandthecollecFonofqi.• Claim:EMperformscoordinate ascentonthisfuncFon.
!
i
"!
!
!
qi(!) log qi(!) +!
!
qi(!) log pw(! | vi) + log pw(vi)
#
OpFmizaFonViewofEM
• Thethirdtermisouractualgoal,Φ.Itonlydependsonw(nottheqi).
!
i
"!
!
!
qi(!) log qi(!) +!
!
qi(!) log pw(! | vi) + log pw(vi)
#!(w)
OpFmizaFonViewofEM
• ThelagertwotermstogetherarepreciselywhatwemaximizeontheMstep,giventhecurrentqi.– Thisisaconcaveproblemandwesolveitexactly.
!
i
"!
!
!
qi(!) log qi(!) +!
!
qi(!) log pw(! | vi) + log pw(vi)
#!(w)
!
!
qi(!) log pw(vi, !)
OpFmizaFonViewofEM
• Concern:istheMstepimprovingterm2attheexpenseofΦ?– No.
!
i
"!
!
!
qi(!) log qi(!) +!
!
qi(!) log pw(! | vi) + log pw(vi)
#!(w)
!
!
qi(!) log pw(vi, !)
TheMStep
• SecondpartisalsonotgeLnganyworsefromiteraFontoiteraFon:
!(w) =!
i
!
!
q(t)i (!) log pw(vi, !)!
!
i
!
!
q(t)i (!) log pw(! | vi)
!!
i
!
!
q(t)i (!) log pw(t+1)(! | vi) +
!
i
!
!
q(t)i (!) log pw(t)(! | vi)
= !!
i
!
!
q(t)i (!) log pw(t+1)(! | vi) +
!
i
!
!
q(t)i (!) log q(t)
i (!)
=!
i
D(q(t)i (·)"pw(t+1)(· | vi))
# 0
TheMStep
• EachMstep,onceqiisfixed,maximizesaboundonthelog‐likelihoodΦ.– Forfixedqi,thisisaconcaveproblemwecansolveinclosedforminmanycases.
• WhatabouttheEstep?
OpFmizaFonViewofEM
• Estepconsidersthefirsttwoterms.• Setseachqitobeequaltotheposteriorunderthecurrentmodel.
!
i
"!
!
!
qi(!) log qi(!) +!
!
qi(!) log pw(! | vi) + log pw(vi)
#!(w)
!
!
qi(!) log pw(vi, !)
!D(qi(·)"pw(· | vi))
CoordinateAscent
• Estepfixeswandsolvesfortheqi.• Mstepfixesallqiandsolvesforw.
!
i
"!
!
!
qi(!) log qi(!) +!
!
qi(!) log pw(! | vi) + log pw(vi)
#
!
!
qi(!) log pw(vi, !)
!D(qi(·)"pw(· | vi))
ThingsPeopleForgetAboutEM
• MulFplerandomstarts(ornon‐randomstarts),selectusinglikelihoodondevelopmentdata.
• VariantsmayhelpavoidlocalopFma…
VariantsofEM
• “Online”variantswherewedoanEstepononeoramini‐batchofexamplesaresFllcoordinateascent(NealandHinton,1998).
• DeterminisFcannealing:flagenouttheqi,makingthefuncFonclosertoconcave.
• StochasFcvariant:userandomizedapproximateinferenceforEstep.
• “Generalized”EM:improvewbutdon’tbotheropFmizingcompletely.
DirectOpFmizaFon
• AnalternaFvetoEM:applystochasFcgradientascentorquasi‐NewtonmethodsdirectlytoΦ.
• TypicallydoneforMN‐likemodelswithfeatures,e.g.,latent‐variableCRFs.– GradientisadifferenceoffeatureexpectaFons.– Requiresmarginalinference.
Summary
• EM:manywaystounderstandit.– Theguarantee:eachroundwillimprovethelikelihood.
– That’saboutasmuchaswecansay.
• SomeFmesitworks.– SmartiniFalizers– Lotsofbiasinherentinthemodelstructure/assumpFons