algorithms for nlptbergkir/11711fa17/fa17 11-711... · 2017. 9. 7. · approximate lms § simplest...
TRANSCRIPT
LanguageModelingIIITaylorBerg-Kirkpatrick– CMU
Slides:DanKlein– UCBerkeley
AlgorithmsforNLP
Announcements
§ Officehoursonwebsite§ butnoOHforTayloruntilnextweek.
EfficientHashing§ Closedaddresshashing
§ Resolvecollisionswithchains§ Easiertounderstandbutbigger
§ Openaddresshashing§ Resolvecollisionswithprobesequences§ Smallerbuteasytomessup
§ Direct-addresshashing§ Nocollisionresolution§ Justejectpreviousentries§ NotsuitableforcoreLMstorage
IntegerEncodings
thecatlaughed 233
n-gram count
7 1 15wordids
BitPacking
20bits20bits20bits
Got3numbersunder220 tostore?
Fitsinaprimitive64-bitlong
7 1 150…00111 0...00001 0...01111
IntegerEncodings
thecatlaughed 233
n-gram count
15176595 =
n-gramencoding
RankValues
c(the)=23135851162<235
35bitstorepresentintegersbetween0and235
15176595 233n-gramencoding count
60bits 35bits
RankValues
#uniquecounts=770000<220
20bitstorepresentranksofallcounts
15176595 3n-gramencoding rank
60bits 20bits 0 1
1 2
2 51
3 233
rank freq
SoFar
trigrambigramunigram
Wordindexer
Ranklookup
CountDB
N-gramencodingscheme
unigram:f(id)=idbigram:f(id1,id2)=?trigram:f(id1,id2,id3)=?
HashingvsSorting
ContextTries
Tries
ContextEncodings
[ManydetailsfromPauls andKlein,2011]
ContextEncodings
Compression
Idea:Differential Compression
VariableLengthEncodings
000 1001
Encoding“9”
Lengthin
Unary
Numberin
Binary
[Elias, 75]
2.910
Speed-Ups
ContextEncodings
NaïveN-GramLookup
Rolling Queries
Idea:FastCaching
LMcanbemorethan10xfasterw/direct-addresscaching
ApproximateLMs§ Simplestoption:hash-and-hope
§ ArrayofsizeK~N§ (optional)storehashofkeys§ Storevaluesindirect-address§ Collisions:storethemax§ Whatkindoferrorscantherebe?
§ Morecomplexoptions,likebloomfilters(originallyformembership,butseeTalbotandOsborne07),perfecthashing,etc
MaximumEntropyModels
ImprovingonN-Grams?§ N-gramsdon’tcombinemultiplesourcesofevidencewell
§ Here:§ “the”givessyntacticconstraint§ “demolition”givessemanticconstraint§ Unlikelytheinteractionbetweenthesetwohasbeendensely
observedinthisspecificn-gram
§ We’dlikeamodelthatcanbemorestatisticallyefficient
P(construction|Afterthedemolitionwascompleted,the)
SomeDefinitions
INPUTS
CANDIDATES
FEATURE VECTORS
closethe____
CANDIDATE SET
yoccursinx
“close”inx Ù y=“door”x-1=“the”Ù y=“door”
TRUE OUTPUTS
{door,table,…}
table
door
x-1=“the”Ù y=“table”
MoreFeatures,LessInteraction
§ N-Grams
§ Skips
§ Lemmas
§ Caching
x=closingthe____,y=doors
x-1=“the”Ù y=“doors”
x-2=“closing”Ù y=“doors”
x-2=“close”Ù y=“door”
yoccursinx
Data: FeatureImpact
Features TrainPerplexity TestPerplexity
3 gram indicators 241 350
1-3grams 126 172
1-3grams+skips 101 164
Exponential Form§ Weights Features
§ Linearscore
§ Unnormalized probability
§ Probability
LikelihoodObjective§ Modelform:
§ Log-likelihoodoftrainingdata
P (y|x,w) = exp(w
>f(x, y))P
y0 exp(w>f(x, y
0))
=
X
i
0
@w
>f(xi, y
⇤i )� log
X
y0
exp(w
>f(xi, y
0))
1
A
L(w) = log
Y
i
P (y
⇤i |xi, w) =
X
i
log
exp(w
>f(xi, y
⇤i ))P
y0 exp(w>f(xi, y
0))
!
Training
HistoryofTraining
§ 1990’s:Specializedmethods(e.g.iterativescaling)
§ 2000’s:General-purposemethods(e.g.conjugategradient)
§ 2010’s:Onlinemethods(e.g.stochasticgradient)
WhatDoesLLLookLike?§ Example
§ Data:xxxy§ Twooutcomes,xandy§ Oneindicatorforeach§ Likelihood
ConvexOptimization§ Themaxent objectiveisanunconstrainedconvexproblem
§ Oneoptimalvalue*,gradientspointtheway
Gradients
Countoffeaturesundertargetlabels
Expectedcountoffeaturesundermodelpredictedlabeldistribution
GradientAscent§ Themaxent objectiveisanunconstrainedoptimization
problem
§ GradientAscent§ Basicidea:moveuphillfromcurrentguess§ Gradientascent/descentfollowsthegradientincrementally§ Atlocaloptimum,derivativevectoriszero§ Willconvergeifstepsizesaresmallenough,butnotefficient§ Allweneedistobeabletoevaluatethefunctionanditsderivative
(Quasi)-Newton Methods§ 2nd-Ordermethods:repeatedlycreateaquadratic
approximationandsolveit
§ E.g.LBFGS,whichtracksderivativetoapproximate(inverse)Hessian
Regularization
Regularization Methods
§ Earlystopping
§ L2:L(w)-|w|22
§ L1:L(w)-|w|
RegularizationEffects
§ Earlystopping:don’tdothis
§ L2:weightsstaysmallbutnon-zero
§ L1:manyweightsdriventozero§ Goodforsparsity§ UsuallybadforaccuracyforNLP
Scaling
Whyis ScalingHard?
§ Bignormalizationterms
§ Lotsofdatapoints
HierarchicalPrediction§ Hierarchicalprediction /softmax [Mikolov etal2013]
§ Noise-ContrastiveEstimation[Mnih,2013]
§ Self-Normalization[Devlin,2014]
Image:ayende.com
StochasticGradient§ Viewthegradientasanaverageoverdatapoints
§ Stochasticgradient:takeastepeachexample(ormini-batch)
§ Substantialimprovementsexist,e.g.AdaGrad (Duchi,11)
Other Methods
Neural NetLMs
Image:(Bengio etal,03)
Neural vsMaxent§ Maxent LM
§ NeuralNetLM
exp (B� (Af(x)))
� nonlinear,e.g.tanh
Neural NetLMs
x-1= thex-2= closing
vclosing vthe1.27.4… –3.31.1…
–7.2
man door doors …
2.3 1.5 …
8.9 …h = �
X
i
Aivi
!
P (y|w, x) / e
Bh
Mixed Interpolation§ Butcan’twejustinterpolate:
§ P(w|most recentwords)§ P(w|skip contexts)§ P(w|caching)§ …
§ Yes,andpeopledo(well,did)§ Butadditivecombinationtendstoflattendistributions,notzerooutcandidates
Decision Trees/Forests
§ Decisiontrees?§ Goodfornon-lineardecisionproblems§ Randomforestscanimprovefurther[XuandJelinek,2004]§ Pathstoleavesbasicallylearnconjunctions§ GeneralcontrastbetweenDTsandlinearmodels
Prev Word?
…lastverb?
§ L2(0.01)17/355§ L2(0.1)27/172§ L2(0.5)60/156§ l2(10)296/265
MaximumEntropyLMs
§ Wantamodelovercompletionsygivenacontextx:
§ Wanttocharacterizetheimportantaspectsofy=(v,x)usingafeaturefunctionf
§ Fmightinclude§ Indicatorofv(unigram)§ Indicatorofv,previousword(bigram)§ Indicatorwhethervoccursinx(cache)§ Indicatorofvandeachnon-adjacentpreviousword§ …
𝑃 𝑦 𝑥 = 𝑃()closethedoor|closethe