algorithms for nlptbergkir/11711fa17/fa17 11-711... · 2017. 9. 7. · approximate lms § simplest...

LanguageModelingIIITaylorBerg-Kirkpatrick– CMU

Slides:DanKlein– UCBerkeley

AlgorithmsforNLP

Announcements

§ Officehoursonwebsite§ butnoOHforTayloruntilnextweek.

EfficientHashing§ Closedaddresshashing

§ Resolvecollisionswithchains§ Easiertounderstandbutbigger

§ Openaddresshashing§ Resolvecollisionswithprobesequences§ Smallerbuteasytomessup

§ Direct-addresshashing§ Nocollisionresolution§ Justejectpreviousentries§ NotsuitableforcoreLMstorage

IntegerEncodings

thecatlaughed 233

n-gram count

7 1 15wordids

BitPacking

20bits20bits20bits

Got3numbersunder220 tostore?

Fitsinaprimitive64-bitlong

7 1 150…00111 0...00001 0...01111

IntegerEncodings

thecatlaughed 233

n-gram count

15176595 =

n-gramencoding

RankValues

c(the)=23135851162<235

35bitstorepresentintegersbetween0and235

15176595 233n-gramencoding count

60bits 35bits

RankValues

#uniquecounts=770000<220

20bitstorepresentranksofallcounts

15176595 3n-gramencoding rank

60bits 20bits 0 1

1 2

2 51

3 233

rank freq

SoFar

trigrambigramunigram

Wordindexer

Ranklookup

CountDB

N-gramencodingscheme

unigram:f(id)=idbigram:f(id1,id2)=?trigram:f(id1,id2,id3)=?

HashingvsSorting

ContextTries

ContextEncodings

[ManydetailsfromPauls andKlein,2011]

ContextEncodings

Compression

Idea:Differential Compression

VariableLengthEncodings

000 1001

Encoding“9”

Lengthin

Unary

Numberin

Binary

[Elias, 75]

2.910

Speed-Ups

ContextEncodings

NaïveN-GramLookup

Rolling Queries

Idea:FastCaching

LMcanbemorethan10xfasterw/direct-addresscaching

ApproximateLMs§ Simplestoption:hash-and-hope

§ ArrayofsizeK~N§ (optional)storehashofkeys§ Storevaluesindirect-address§ Collisions:storethemax§ Whatkindoferrorscantherebe?

§ Morecomplexoptions,likebloomfilters(originallyformembership,butseeTalbotandOsborne07),perfecthashing,etc

MaximumEntropyModels

ImprovingonN-Grams?§ N-gramsdon’tcombinemultiplesourcesofevidencewell

§ Here:§ “the”givessyntacticconstraint§ “demolition”givessemanticconstraint§ Unlikelytheinteractionbetweenthesetwohasbeendensely

observedinthisspecificn-gram

§ We’dlikeamodelthatcanbemorestatisticallyefficient

P(construction|Afterthedemolitionwascompleted,the)

SomeDefinitions

INPUTS

CANDIDATES

FEATURE VECTORS

closethe____

CANDIDATE SET

yoccursinx

“close”inx Ù y=“door”x-1=“the”Ù y=“door”

TRUE OUTPUTS

{door,table,…}

table

door

x-1=“the”Ù y=“table”

MoreFeatures,LessInteraction

§ N-Grams

§ Skips

§ Lemmas

§ Caching

x=closingthe____,y=doors

x-1=“the”Ù y=“doors”

x-2=“closing”Ù y=“doors”

x-2=“close”Ù y=“door”

yoccursinx

Data: FeatureImpact

Features TrainPerplexity TestPerplexity

3 gram indicators 241 350

1-3grams 126 172

1-3grams+skips 101 164

Exponential Form§ Weights Features

§ Linearscore

§ Unnormalized probability

§ Probability

LikelihoodObjective§ Modelform:

§ Log-likelihoodoftrainingdata

P (y|x,w) = exp(w

>f(x, y))P

y0 exp(w>f(x, y

0))

=

X

i

0

@w

>f(xi, y

⇤i )� log

X

y0

exp(w

>f(xi, y

0))

1

A

L(w) = log

Y

i

P (y

⇤i |xi, w) =

X

i

log

exp(w

>f(xi, y

⇤i ))P

y0 exp(w>f(xi, y

0))

!

Training

HistoryofTraining

§ 1990’s:Specializedmethods(e.g.iterativescaling)

§ 2000’s:General-purposemethods(e.g.conjugategradient)

§ 2010’s:Onlinemethods(e.g.stochasticgradient)

WhatDoesLLLookLike?§ Example

§ Data:xxxy§ Twooutcomes,xandy§ Oneindicatorforeach§ Likelihood

ConvexOptimization§ Themaxent objectiveisanunconstrainedconvexproblem

§ Oneoptimalvalue*,gradientspointtheway

Gradients

Countoffeaturesundertargetlabels

Expectedcountoffeaturesundermodelpredictedlabeldistribution

GradientAscent§ Themaxent objectiveisanunconstrainedoptimization

problem

§ GradientAscent§ Basicidea:moveuphillfromcurrentguess§ Gradientascent/descentfollowsthegradientincrementally§ Atlocaloptimum,derivativevectoriszero§ Willconvergeifstepsizesaresmallenough,butnotefficient§ Allweneedistobeabletoevaluatethefunctionanditsderivative

(Quasi)-Newton Methods§ 2nd-Ordermethods:repeatedlycreateaquadratic

approximationandsolveit

§ E.g.LBFGS,whichtracksderivativetoapproximate(inverse)Hessian

Regularization

Regularization Methods

§ Earlystopping

§ L2:L(w)-|w|22

§ L1:L(w)-|w|

RegularizationEffects

§ Earlystopping:don’tdothis

§ L2:weightsstaysmallbutnon-zero

§ L1:manyweightsdriventozero§ Goodforsparsity§ UsuallybadforaccuracyforNLP

Scaling

Whyis ScalingHard?

§ Bignormalizationterms

§ Lotsofdatapoints

HierarchicalPrediction§ Hierarchicalprediction /softmax [Mikolov etal2013]

§ Noise-ContrastiveEstimation[Mnih,2013]

§ Self-Normalization[Devlin,2014]

Image:ayende.com

StochasticGradient§ Viewthegradientasanaverageoverdatapoints

§ Stochasticgradient:takeastepeachexample(ormini-batch)

§ Substantialimprovementsexist,e.g.AdaGrad (Duchi,11)

Other Methods

Neural NetLMs

Image:(Bengio etal,03)

Neural vsMaxent§ Maxent LM

§ NeuralNetLM

exp (B� (Af(x)))

� nonlinear,e.g.tanh

Neural NetLMs

x-1= thex-2= closing

vclosing vthe1.27.4… –3.31.1…

–7.2

man door doors …

2.3 1.5 …

8.9 …h = �

X

i

Aivi

!

P (y|w, x) / e

Bh

Mixed Interpolation§ Butcan’twejustinterpolate:

§ P(w|most recentwords)§ P(w|skip contexts)§ P(w|caching)§ …

§ Yes,andpeopledo(well,did)§ Butadditivecombinationtendstoflattendistributions,notzerooutcandidates

Decision Trees/Forests

§ Decisiontrees?§ Goodfornon-lineardecisionproblems§ Randomforestscanimprovefurther[XuandJelinek,2004]§ Pathstoleavesbasicallylearnconjunctions§ GeneralcontrastbetweenDTsandlinearmodels

Prev Word?

…lastverb?

§ L2(0.01)17/355§ L2(0.1)27/172§ L2(0.5)60/156§ l2(10)296/265

MaximumEntropyLMs

§ Wantamodelovercompletionsygivenacontextx:

§ Wanttocharacterizetheimportantaspectsofy=(v,x)usingafeaturefunctionf

§ Fmightinclude§ Indicatorofv(unigram)§ Indicatorofv,previousword(bigram)§ Indicatorwhethervoccursinx(cache)§ Indicatorofvandeachnon-adjacentpreviousword§ …

𝑃 𝑦 𝑥 = 𝑃()closethedoor|closethe

algorithms for nlptbergkir/11711fa17/fa17 11-711... · 2017. 9. 7. · approximate lms § simplest...

Documents