algorithms for nlptbergkir/11711fa17/fa17 11-711... · 2017. 9. 7. · approximate lms § simplest...

53
Language Modeling III Taylor Berg-Kirkpatrick – CMU Slides: Dan Klein – UC Berkeley Algorithms for NLP

Upload: others

Post on 24-Mar-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

LanguageModelingIIITaylorBerg-Kirkpatrick– CMU

Slides:DanKlein– UCBerkeley

AlgorithmsforNLP

Page 2: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

Announcements

§ Officehoursonwebsite§ butnoOHforTayloruntilnextweek.

Page 3: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

EfficientHashing§ Closedaddresshashing

§ Resolvecollisionswithchains§ Easiertounderstandbutbigger

§ Openaddresshashing§ Resolvecollisionswithprobesequences§ Smallerbuteasytomessup

§ Direct-addresshashing§ Nocollisionresolution§ Justejectpreviousentries§ NotsuitableforcoreLMstorage

Page 4: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

IntegerEncodings

thecatlaughed 233

n-gram count

7 1 15wordids

Page 5: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

BitPacking

20bits20bits20bits

Got3numbersunder220 tostore?

Fitsinaprimitive64-bitlong

7 1 150…00111 0...00001 0...01111

Page 6: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

IntegerEncodings

thecatlaughed 233

n-gram count

15176595 =

n-gramencoding

Page 7: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

RankValues

c(the)=23135851162<235

35bitstorepresentintegersbetween0and235

15176595 233n-gramencoding count

60bits 35bits

Page 8: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

RankValues

#uniquecounts=770000<220

20bitstorepresentranksofallcounts

15176595 3n-gramencoding rank

60bits 20bits 0 1

1 2

2 51

3 233

rank freq

Page 9: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

SoFar

trigrambigramunigram

Wordindexer

Ranklookup

CountDB

N-gramencodingscheme

unigram:f(id)=idbigram:f(id1,id2)=?trigram:f(id1,id2,id3)=?

Page 10: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

HashingvsSorting

Page 11: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

ContextTries

Page 12: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

Tries

Page 13: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

ContextEncodings

[ManydetailsfromPauls andKlein,2011]

Page 14: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

ContextEncodings

Page 15: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

Compression

Page 16: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

Idea:Differential Compression

Page 17: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

VariableLengthEncodings

000 1001

Encoding“9”

Lengthin

Unary

Numberin

Binary

[Elias, 75]

2.910

Page 18: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

Speed-Ups

Page 19: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

ContextEncodings

Page 20: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

NaïveN-GramLookup

Page 21: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

Rolling Queries

Page 22: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

Idea:FastCaching

LMcanbemorethan10xfasterw/direct-addresscaching

Page 23: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

ApproximateLMs§ Simplestoption:hash-and-hope

§ ArrayofsizeK~N§ (optional)storehashofkeys§ Storevaluesindirect-address§ Collisions:storethemax§ Whatkindoferrorscantherebe?

§ Morecomplexoptions,likebloomfilters(originallyformembership,butseeTalbotandOsborne07),perfecthashing,etc

Page 24: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

MaximumEntropyModels

Page 25: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

ImprovingonN-Grams?§ N-gramsdon’tcombinemultiplesourcesofevidencewell

§ Here:§ “the”givessyntacticconstraint§ “demolition”givessemanticconstraint§ Unlikelytheinteractionbetweenthesetwohasbeendensely

observedinthisspecificn-gram

§ We’dlikeamodelthatcanbemorestatisticallyefficient

P(construction|Afterthedemolitionwascompleted,the)

Page 26: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

SomeDefinitions

INPUTS

CANDIDATES

FEATURE VECTORS

closethe____

CANDIDATE SET

yoccursinx

“close”inx Ù y=“door”x-1=“the”Ù y=“door”

TRUE OUTPUTS

{door,table,…}

table

door

x-1=“the”Ù y=“table”

Page 27: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

MoreFeatures,LessInteraction

§ N-Grams

§ Skips

§ Lemmas

§ Caching

x=closingthe____,y=doors

x-1=“the”Ù y=“doors”

x-2=“closing”Ù y=“doors”

x-2=“close”Ù y=“door”

yoccursinx

Page 28: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

Data: FeatureImpact

Features TrainPerplexity TestPerplexity

3 gram indicators 241 350

1-3grams 126 172

1-3grams+skips 101 164

Page 29: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

Exponential Form§ Weights Features

§ Linearscore

§ Unnormalized probability

§ Probability

Page 30: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

LikelihoodObjective§ Modelform:

§ Log-likelihoodoftrainingdata

P (y|x,w) = exp(w

>f(x, y))P

y0 exp(w>f(x, y

0))

=

X

i

0

@w

>f(xi, y

⇤i )� log

X

y0

exp(w

>f(xi, y

0))

1

A

L(w) = log

Y

i

P (y

⇤i |xi, w) =

X

i

log

exp(w

>f(xi, y

⇤i ))P

y0 exp(w>f(xi, y

0))

!

Page 31: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

Training

Page 32: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

HistoryofTraining

§ 1990’s:Specializedmethods(e.g.iterativescaling)

§ 2000’s:General-purposemethods(e.g.conjugategradient)

§ 2010’s:Onlinemethods(e.g.stochasticgradient)

Page 33: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

WhatDoesLLLookLike?§ Example

§ Data:xxxy§ Twooutcomes,xandy§ Oneindicatorforeach§ Likelihood

Page 34: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

ConvexOptimization§ Themaxent objectiveisanunconstrainedconvexproblem

§ Oneoptimalvalue*,gradientspointtheway

Page 35: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

Gradients

Countoffeaturesundertargetlabels

Expectedcountoffeaturesundermodelpredictedlabeldistribution

Page 36: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

GradientAscent§ Themaxent objectiveisanunconstrainedoptimization

problem

§ GradientAscent§ Basicidea:moveuphillfromcurrentguess§ Gradientascent/descentfollowsthegradientincrementally§ Atlocaloptimum,derivativevectoriszero§ Willconvergeifstepsizesaresmallenough,butnotefficient§ Allweneedistobeabletoevaluatethefunctionanditsderivative

Page 37: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

(Quasi)-Newton Methods§ 2nd-Ordermethods:repeatedlycreateaquadratic

approximationandsolveit

§ E.g.LBFGS,whichtracksderivativetoapproximate(inverse)Hessian

Page 38: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

Regularization

Page 39: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

Regularization Methods

§ Earlystopping

§ L2:L(w)-|w|22

§ L1:L(w)-|w|

Page 40: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

RegularizationEffects

§ Earlystopping:don’tdothis

§ L2:weightsstaysmallbutnon-zero

§ L1:manyweightsdriventozero§ Goodforsparsity§ UsuallybadforaccuracyforNLP

Page 41: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

Scaling

Page 42: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

Whyis ScalingHard?

§ Bignormalizationterms

§ Lotsofdatapoints

Page 43: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

HierarchicalPrediction§ Hierarchicalprediction /softmax [Mikolov etal2013]

§ Noise-ContrastiveEstimation[Mnih,2013]

§ Self-Normalization[Devlin,2014]

Image:ayende.com

Page 44: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

StochasticGradient§ Viewthegradientasanaverageoverdatapoints

§ Stochasticgradient:takeastepeachexample(ormini-batch)

§ Substantialimprovementsexist,e.g.AdaGrad (Duchi,11)

Page 45: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

Other Methods

Page 46: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

Neural NetLMs

Image:(Bengio etal,03)

Page 47: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

Neural vsMaxent§ Maxent LM

§ NeuralNetLM

exp (B� (Af(x)))

� nonlinear,e.g.tanh

Page 48: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

Neural NetLMs

x-1= thex-2= closing

vclosing vthe1.27.4… –3.31.1…

–7.2

man door doors …

2.3 1.5 …

8.9 …h = �

X

i

Aivi

!

P (y|w, x) / e

Bh

Page 49: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

Mixed Interpolation§ Butcan’twejustinterpolate:

§ P(w|most recentwords)§ P(w|skip contexts)§ P(w|caching)§ …

§ Yes,andpeopledo(well,did)§ Butadditivecombinationtendstoflattendistributions,notzerooutcandidates

Page 50: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

Decision Trees/Forests

§ Decisiontrees?§ Goodfornon-lineardecisionproblems§ Randomforestscanimprovefurther[XuandJelinek,2004]§ Pathstoleavesbasicallylearnconjunctions§ GeneralcontrastbetweenDTsandlinearmodels

Prev Word?

…lastverb?

Page 51: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §
Page 52: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

§ L2(0.01)17/355§ L2(0.1)27/172§ L2(0.5)60/156§ l2(10)296/265

Page 53: Algorithms for NLPtbergkir/11711fa17/FA17 11-711... · 2017. 9. 7. · Approximate LMs § Simplest option: hash-and-hope § Array of size K ~ N § (optional) store hash of keys §

MaximumEntropyLMs

§ Wantamodelovercompletionsygivenacontextx:

§ Wanttocharacterizetheimportantaspectsofy=(v,x)usingafeaturefunctionf

§ Fmightinclude§ Indicatorofv(unigram)§ Indicatorofv,previousword(bigram)§ Indicatorwhethervoccursinx(cache)§ Indicatorofvandeachnon-adjacentpreviousword§ …

𝑃 𝑦 𝑥 = 𝑃()closethedoor|closethe