algorithms for nlp - cs.cmu.edutbergkir/11711fa17/fa17 11-711 lecture 16... · proposal? en vertu...
TRANSCRIPT
Backward
LearningLatentAnnotations
EMalgorithm:
X1
X2X7X4
X5 X6X3
He was right
.
§ Brackets are known§ Base categories are known§ Only induce subcategories
JustlikeForward-BackwardforHMMs.Forward
0
5
10
15
20
25
30
35
40
NP
VP PP
ADVP S
ADJP
SBAR Q
P
WH
NP
PRN
NX
SIN
V
PRT
WH
PP SQ
CO
NJP
FRAG
NAC UC
P
WH
ADVP INTJ
SBAR
Q
RR
C
WH
ADJP X
RO
OT
LST
NumberofPhrasalSubcategories
NumberofLexicalSubcategories
0
10
20
30
40
50
60
70
NNP JJ
NNS NN VBN RB
VBG VB VBD CD IN
VBZ
VBP DT
NNPS CC JJ
RJJ
S :PR
PPR
P$ MD
RBR
WP
POS
PDT
WRB
-LRB
- .EX
WP$
WDT
-RRB
- ''FW RB
S TO$
UH, ``
SYM RP LS #
LearnedSplits
§ Proper Nouns (NNP):
§ Personal pronouns (PRP):
NNP-14 Oct. Nov. Sept.NNP-12 John Robert JamesNNP-2 J. E. L.NNP-1 Bush Noriega Peters
NNP-15 New San WallNNP-3 York Francisco Street
PRP-0 It He IPRP-1 it he theyPRP-2 it them him
§ Relativeadverbs(RBR):
§ CardinalNumbers(CD):
RBR-0 further lower higherRBR-1 more less MoreRBR-2 earlier Earlier later
CD-7 one two ThreeCD-4 1989 1990 1988CD-11 million billion trillionCD-0 1 50 100CD-3 1 30 31CD-9 78 58 34
LearnedSplits
FinalResults(Accuracy)
≤ 40 wordsF1
all F1
ENG
Charniak&Johnson ‘05 (generative) 90.1 89.6
Split / Merge 90.6 90.1
GER
Dubey ‘05 76.3 -
Split / Merge 80.8 80.1
CH
N
Chiang et al. ‘02 80.0 76.6
Split / Merge 86.3 83.4
Still higher numbers from reranking / self-training methods
HierarchicalPruning
… QP NP VP …coarse:
split in two: … QP1 QP2 NP1 NP2 VP1 VP2 …
… QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 …split in four:
split in eight: … … … … … … … … … … … … … … … … …
DependencyParsing
§ Lexicalizedparserscanbeseenasproducingdependencytrees
§ Eachlocalbinarytreecorrespondstoanattachmentinthedependencygraph
questioned
lawyer witness
the the
DependencyParsing
§ Puredependencyparsingisonlycubic[Eisner99]
§ Someworkonnon-projective dependencies§ Commonin,e.g.Czechparsing§ CandowithMSTalgorithms[McDonaldandPereira05]
Y[h] Z[h’]
X[h]
i h k h’ j
h h’
h
h k h’
Shift-ReduceParsers
§ Anotherwaytoderiveatree:
§ Parsing§ Nousefuldynamicprogrammingsearch§ Canstillusebeamsearch[Ratnaparkhi97]
ParseReranking
§ Assumethenumberofparsesisverysmall§ WecanrepresenteachparseTasafeaturevectorj(T)
§ Typically,alllocalrulesarefeatures§ Alsonon-localfeatures,likehowright-branchingtheoveralltreeis§ [Charniak andJohnson05]givesarichsetoffeatures
Classification§ Automaticallymakeadecisionaboutinputs
§ Example:document® category§ Example:imageofdigit® digit§ Example:imageofobject® objecttype§ Example:query+webpages® bestmatch§ Example:symptoms® diagnosis§ …
§ Threemainideas§ Representationasfeaturevectors§ Scoringbylinearfunctions(ornot,actually)§ Learningbyoptimization
SomeDefinitions
INPUTS
CANDIDATES
FEATURE VECTORS
closethe____
CANDIDATE SET
yoccursinx
“close”inx Ù y=“door”x-1=“the”Ù y=“door”
TRUE OUTPUTS
{door,table,…}
table
door
x-1=“the”Ù y=“table”
BlockFeatureVectors
§ Sometimes,wethinkoftheinputashavingfeatures,whicharemultipliedbyoutputstoformthecandidates
… win the election …
“win” “election”
… win the election …
… win the election …
… win the election …
Non-BlockFeatureVectors§ Sometimesthefeaturesofcandidatescannotbe
decomposedinthisregularway§ Example:aparsetree’sfeaturesmaybetheproductions
presentinthetree
§ Differentcandidateswillthusoftensharefeatures§ We’llreturntothenon-blockcaselater
SNP VP
VN N
SNP VP
N V N
SNP VP
NP
N N
VP
V
NP
N
VP
V N
LinearModels:Scoring§ Inalinearmodel,eachfeaturegetsaweightw
§ Wescorehypothesesbymultiplyingfeaturesandweights:
… win the election …
… win the election …
… win the election …
… win the election …
LinearModels:DecisionRule
§ Thelineardecisionrule:
§ We’vesaidnothingaboutwhereweightscomefrom
… win the election …
… win the election …
… win the election …
… win the election …
… win the election …
… win the election …
BinaryClassification
§ Importantspecialcase:binaryclassification§ Classesarey=+1/-1
§ Decisionboundaryisahyperplane
BIAS : -3free : 4money : 2
0 10
1
2
freem
oney
+1 = SPAM
-1 = HAM
MulticlassDecisionRule
§ Ifmorethantwoclasses:§ Highestscorewins§ Boundariesaremorecomplex
§ Hardertovisualize
LearningClassifierWeights
§ Twobroadapproachestolearningweights
§ Generative:workwithaprobabilisticmodelofthedata,weightsare(log)localconditionalprobabilities§ Advantages:learningweightsiseasy,smoothingiswell-understood,
backedbyunderstandingofmodeling
§ Discriminative:setweightsbasedonsomeerror-relatedcriterion§ Advantages:error-driven,oftenweightswhicharegoodfor
classificationaren’ttheoneswhichbestdescribethedata
§ We’llmainlytalkaboutthelatterfornow
Howtopickweights?
§ Goal:choose“best”vectorwgiventrainingdata§ Fornow,wemean“bestforclassification”
§ Theideal:theweightswhichhavegreatesttestsetaccuracy/F1/whatever§ But,don’thavethetestset§ Mustcomputeweightsfromtrainingset
§ Maybewewantweightswhichgivebesttrainingsetaccuracy?§ Harddiscontinuousoptimizationproblem§ Maynot(doesnot)generalizetotestset§ Easytooverfit
Though, min-error training for MT does exactly this.
MinimizeTrainingError?§ Alossfunctiondeclareshowcostlyeachmistakeis
§ E.g.0lossforcorrectlabel,1lossforwronglabel§ Canweightmistakesdifferently(e.g.falsepositivesworsethanfalse
negativesorHammingdistanceoverstructuredlabels)
§ Wecould,inprinciple,minimizetrainingloss:
§ Thisisahard,discontinuousoptimizationproblem
LinearModels:Perceptron
§ Theperceptronalgorithm§ Iterativelyprocessesthetrainingset,reactingtotrainingerrors§ Canbethoughtofastryingtodrivedowntrainingerror
§ The(online)perceptronalgorithm:§ Startwithzeroweightsw§ Visittraininginstancesonebyone
§ Trytoclassify
§ Ifcorrect,nochange!§ Ifwrong:adjustweights
ObjectiveFunctions
§ Whatdowewantfromourweights?§ Depends!§ Sofar:minimize(training)errors:
§ Thisisthe“zero-oneloss”§ Discontinuous,minimizingisNP-complete
§ MaximumentropyandSVMshaveotherobjectivesrelatedtozero-oneloss
ClassificationMargin(Binary)
§ Distanceofxi toseparatorisitsmargin,mi§ Examplesclosesttothehyperplanearesupportvectors§ Margin g oftheseparatoristheminimumm
m
g
ClassificationMargin
§ Foreachexamplexi andpossiblemistakencandidatey,weavoidthatmistakebyamarginmi(y) (withzero-oneloss)
§ Marging oftheentireseparatoristheminimumm
§ Itisalsothelargestg forwhichthefollowingconstraintshold
§ SeparableSVMs:findthemax-marginw
§ CanstickthisintoMatlab and(slowly)getanSVM§ Won’twork(well)ifnon-separable
MaximumMargin
WhyMaxMargin?
§ Whydothis?Variousarguments:§ Solutiondependsonlyontheboundarycases,orsupportvectors (but
rememberhowthisdiagramisbroken!)§ Solutionrobusttomovementofsupportvectors§ Sparsesolutions(featuresnotinsupportvectorsgetzeroweight)§ Generalizationboundarguments§ Workswellinpracticeformanyproblems
Support vectors
MaxMargin/SmallNorm
§ Reformulation:findthesmallestwwhichseparatesdata
§ g scaleslinearlyinw,soif||w||isn’tconstrained,wecantakeanyseparatingwandscaleupourmargin
§ Insteadoffixingthescaleofw,wecanfixg =1
Remember this condition?
SoftMarginClassification§ Whatifthetrainingsetisnotlinearlyseparable?§ Slackvariables ξi canbeaddedtoallowmisclassificationofdifficultor
noisyexamples,resultinginasoftmargin classifier
ξi
ξi
MaximumMargin
§ Non-separableSVMs§ Addslacktotheconstraints§ Makeobjectivepay(linearly)forslack:
§ Ciscalledthecapacity oftheSVM– thesmoothingknob
§ Learning:§ CanstillstickthisintoMatlabifyouwant§ Constrainedoptimizationishard;bettermethods!§ We’llcomebacktothislater
Note: exist other choices of how to penalize slacks!
LinearModels:MaximumEntropy
§ Maximumentropy(logisticregression)§ Usethescoresasprobabilities:
§ Maximizethe(log)conditionallikelihoodoftrainingdata
Make positiveNormalize
MaximumEntropyII
§ Motivationformaximumentropy:§ Connectiontomaximumentropyprinciple(sortof)§ Mightwanttodoagoodjobofbeinguncertainonnoisycases…
§ …inpractice,though,posteriorsareprettypeaked
§ Regularization(smoothing)
Log-Loss§ Ifweviewmaxentasaminimizationproblem:
§ Thisminimizesthe“logloss”oneachexample
§ Oneview:loglossisanupperbound onzero-oneloss
HingeLoss
§ Thisiscalledthe“hingeloss”§ Unlikemaxent /logloss,youstop
gainingobjectiveoncethetruelabelwinsbyenough
§ YoucanstartfromhereandderivetheSVMobjective
§ Cansolvedirectlywithsub-gradientdecent(e.g.Pegasos:Shalev-Shwartz etal07)
§ Consider the per-instance objective:
Plot really only right in binary case
Maxvs“Soft-Max”Margin
§ SVMs:
§ Maxent:
§ Verysimilar!Bothtrytomakethetruescorebetterthanafunctionoftheotherscores§ TheSVMtriestobeattheaugmentedrunner-up§ TheMaxentclassifiertriestobeatthe“soft-max”
You can make this zero
… but not this one
BilingualWordAlignment
What is the anticipated cost of collecting fees under the new proposal?
En vertu de nouvelle propositions, quel est le côut prévu de perception de les droits?
x yWhat
is the
anticipatedcost
ofcollecting
fees under
the new
proposal?
En vertu delesnouvelle propositions, quel est le côut prévu de perception de le droits?
Combinatorial structure
StructuredModels
Assumption:
Scoreisasumoflocal“part”scores
Parts=nodes,edges,productions
space of feasible outputs
Bilingualwordalignment
§ association§ position§ orthography
Whatis
theanticipated
costof
collecting fees
under the
new proposal
?
En vertu delesnouvelle propositions, quel est le côut prévu de perception de le droits?
j
k
EfficientDecoding§ Commoncase:youhaveablackboxwhichcomputes
atleastapproximately,andyouwanttolearnw
§ Easiestoptionisthestructuredperceptron[Collins01]§ Structureentershereinthatthesearchforthebestyistypicallya
combinatorialalgorithm(dynamicprogramming,matchings,ILPs,A*…)§ Predictionisstructured,learningupdateisnot
StructuredMargin(Primal)
Rememberourprimalmarginobjective?
minw
1
2kwk22 + C
X
i
✓max
y
�w>fi(y) + `i(y)
�� w>fi(y
⇤i )
◆
Stillapplieswithstructuredoutputspace!
StructuredMargin(Primal)
minw
1
2kwk22 + C
X
i
�w>fi(y) + `i(y)� w>fi(y
⇤i )�
y = argmaxy�w>fi(y) + `i(y)
�Justneedefficientloss-augmenteddecode:
rw = w + CX
i
(fi(y)� fi(y⇤i ))
Stillusegeneralsubgradient descentmethods!(Adagrad)
StructuredMargin(Dual)§ Remembertheconstrainedversionofprimal:
§ Dualhasavariableforeveryconstrainthere
§ Wewant:
§ Equivalently:
FullMargin:OCR
a lot!…
“brace”
“brace”
“aaaaa”
“brace” “aaaab”
“brace” “zzzzz”
§ Wewant:
§ Equivalently:
‘It was red’
Parsingexample
a lot!
SA B
C D
SA BD F
SA B
C D
SE F
G H
SA B
C D
SA B
C D
SA B
C D
…
‘It was red’
‘It was red’
‘It was red’
‘It was red’
‘It was red’
‘It was red’
§ Wewant:
§ Equivalently:
‘What is the’‘Quel est le’
Alignmentexample
a lot!…
123
123
‘What is the’‘Quel est le’
123
123
‘What is the’‘Quel est le’
123
123
‘What is the’‘Quel est le’
123
123
123
123
123
123
123
123
‘What is the’‘Quel est le’
‘What is the’‘Quel est le’
‘What is the’‘Quel est le’
CuttingPlane(Dual)§ Aconstraintinductionmethod[Joachimsetal09]
§ Exploitsthatthenumberofconstraintsyouactuallyneedperinstanceistypicallyverysmall
§ Requires(loss-augmented)primal-decodeonly
§ Repeat:§ Findthemostviolatedconstraintforaninstance:
§ Addthisconstraintandresolvethe(non-structured)QP(e.g.withSMOorotherQPsolver)
CuttingPlane(Dual)§ Someissues:
§ CaneasilyspendtoomuchtimesolvingQPs§ Doesn’texploitsharedconstraintstructure§ Inpractice,worksprettywell;fastlikeperceptron/MIRA,morestable,noaveraging
Likelihood,Structured
§ Structureneededtocompute:§ Log-normalizer§ Expectedfeaturecounts
§ E.g.ifafeatureisanindicatorofDT-NNthenweneedtocomputeposteriormarginalsP(DT-NN|sentence)foreachpositionandsum
§ Alsoworkswithlatentvariables(morelater)
Option0:Reranking
x = “The screen was a sea of red.”
…Baseline Parser
Input N-Best List(e.g. n=100)
Non-Structured Classification
Output
[e.g. Charniak and Johnson 05]
Reranking§ Advantages:
§ Directlyreducetonon-structuredcase§ Nolocalityrestrictiononfeatures
§ Disadvantages:§ Stuckwitherrorsofbaselineparser§ Baselinesystemmustproducen-bestlists§ But,feedbackispossible[McCloskey,Charniak,Johnson2006]
M3Ns§ Anotheroption:expressallconstraintsinapackedform
§ MaximummarginMarkovnetworks[Taskar etal03]§ Integratessolutionstructuredeeplyintotheproblemstructure
§ Steps§ ExpressinferenceoverconstraintsasanLP§ Usedualitytotransformminimax formulationintomin-min§ Constraintsfactorinthedualalongthesamestructureastheprimal;
alphasessentiallyactasadual“distribution”§ Variousoptimizationpossibilitiesinthedual
Non-LinearSeparators§ Anotherview:kernelsmapanoriginalfeaturespacetosome
higher-dimensionalfeaturespacewherethetrainingsetis(more)separable
Φ: y → φ(y)
WhyKernels?§ Can’tyoujustaddthesefeaturesonyourown(e.g.addall
pairsoffeaturesinsteadofusingthequadratickernel)?§ Yes,inprinciple,justcomputethem§ Noneedtomodifyanyalgorithms§ But,numberoffeaturescangetlarge(orinfinite)§ Somekernelsnotasusefullythoughtofintheirexpanded
representation,e.g.RBFordata-definedkernels[HendersonandTitov05]
§ Kernelsletuscomputewiththesefeaturesimplicitly§ Example:implicitdotproductinquadratickerneltakesmuchless
spaceandtimeperdotproduct§ Ofcourse,there’sthecostforusingthepuredualalgorithms…