scaling up lda - 2

49
Fast sampling for LDA William Cohen

Upload: kendis

Post on 06-Jan-2016

15 views

Category:

Documents


0 download

DESCRIPTION

Scaling up LDA - 2. William Cohen. SPEEDUP FOR Parallel LDA - USING ALLREDUCE FOR Synchronization. What if you try and parallelize?. Split document/term matrix randomly and distribute to p processors .. then run “Approximate Distributed LDA”. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Scaling up  LDA - 2

Fast sampling for LDA

William Cohen

Page 2: Scaling up  LDA - 2

MORE LDA SPEEDUPSFIRST - RECAP LDA DETAILS

Page 3: Scaling up  LDA - 2
Page 4: Scaling up  LDA - 2

Called “collapsed Gibbs sampling” since you’ve marginalized away some variables

Fr: Parameter estimation for text analysis - Gregor Heinrich

prob this word/term assigned to topic k

prob this doc contains topic k

Page 5: Scaling up  LDA - 2

More detail

Page 6: Scaling up  LDA - 2
Page 7: Scaling up  LDA - 2

z=1

z=2

z=3

unit heightrandom

Page 8: Scaling up  LDA - 2

SPEEDUP 1 - SPARSITY

Page 9: Scaling up  LDA - 2

KDD 2008

Page 10: Scaling up  LDA - 2

z=1

z=2

z=3

unit heightrandom

Page 11: Scaling up  LDA - 2
Page 12: Scaling up  LDA - 2

Running total of P(z=k|…) or P(z<=k)

Page 13: Scaling up  LDA - 2

Discussion….

• Where do you spend your time?–sampling the z’s–each sampling step involves a loop over all topics– this seems wasteful

• even with many topics, words are often only assigned to a few different topics– low frequency words appear < K times … and there are lots and lots of them!– even frequent words are not in every topic

Page 14: Scaling up  LDA - 2

Discussion….

• What’s the solution?Idea: come up with

approximations to Z at each stage - then you might be able to stop

early…..computationall

y like a sparser vectorWant Zi>=Z

Page 15: Scaling up  LDA - 2

Tricks• How do you compute and maintain the bound?

– see the paper• What order do you go in?

– want to pick large P(k)’s first– … so we want large P(k|d) and P(k|w)– … so we maintain k’s in sorted order

• which only change a little bit after each flip, so a bubble-sort will fix up the almost-sorted array

Page 16: Scaling up  LDA - 2

Results

Page 17: Scaling up  LDA - 2

SPEEDUP 2 - ANOTHER APPROACH FOR USING SPARSITY

Page 18: Scaling up  LDA - 2

KDD 09

Page 19: Scaling up  LDA - 2

z=s+r+q

t = topic (k)w = wordd = doc

Page 20: Scaling up  LDA - 2

z=s+r+q

• If U<s:• lookup U on line segment with tic-

marks at α1β/(βV + n.|1), α2β/(βV + n.|2), …• If s<U<r:

• lookup U on line segment for rOnly need to check t such that nt|

d>0

t = topic (k)w = wordd = doc

Page 21: Scaling up  LDA - 2

z=s+r+q

• If U<s:• lookup U on line segment with tic-

marks at α1β/(βV + n.|1), α2β/(βV + n.|2), …• If s<U<s+r:

• lookup U on line segment for r• If s+r<U:

• lookup U on line segment for qOnly need to check t such that nw|

t>0

Page 22: Scaling up  LDA - 2

z=s+r+q

Only need to check t such that nw|

t>0

Only need to check t such that nt|

d>0

Only need to check occasionally (< 10% of the time)

Page 23: Scaling up  LDA - 2

z=s+r+q

Need to store nw|t for each word, topic pair …???

Only need to store nt|d for current d

Only need to store (and maintain) total words per topic and α’s,β,V

Trick; count up nt|

d for d when you start working on d and update incrementally

Page 24: Scaling up  LDA - 2

z=s+r+q

Need to store nw|t for each word, topic pair …???

1. Precompute, for each t,

Most (>90%) of the time and space is here…

2. Quickly find t’s such that nw|t is large for w

Page 25: Scaling up  LDA - 2

Need to store nw|t for each word, topic pair …???

1. Precompute, for each t,

Most (>90%) of the time and space is here…

2. Quickly find t’s such that nw|t is large for w• map w to an int array

• no larger than frequency w• no larger than #topics

• encode (t,n) as a bit vector• n in the high-order bits• t in the low-order bits

• keep ints sorted in descending order

Page 26: Scaling up  LDA - 2
Page 27: Scaling up  LDA - 2

Outline

• LDA/Gibbs algorithm details• How to speed it up by parallelizing• How to speed it up by faster sampling

–Why sampling is key–Some sampling ideas for LDA

• The Mimno/McCallum decomposition (SparseLDA)• Alias tables (Walker 1977; Li, Ahmed, Ravi, Smola KDD 2014)

Page 28: Scaling up  LDA - 2

Alias tables

http://www.keithschwarz.com/darts-dice-coins/

Basic problem: how can we sample from a biased coin quickly?

If the distribution changes slowly maybe we can do some preprocessing and then sample multiple times. Proof of concept: generate r~uniform and use a binary tree

r in (23/40,7/10]

O(K)

O(log2K)

Page 29: Scaling up  LDA - 2

Alias tables

http://www.keithschwarz.com/darts-dice-coins/

Another idea…

Simulate the dart with two drawn values:rx int(u1*K)ry u1*pmaxkeep throwing till you hit a stripe

Page 30: Scaling up  LDA - 2

Alias tables

http://www.keithschwarz.com/darts-dice-coins/

An even more clever idea: minimize the brown space (where the dart “misses”) by sizing the rectangle’s height to the average probability, not the maximum probability, and cutting and pasting a bit.

You can always do this using only two colors in each column of the final alias table and the dart never misses!

mathematically speaking…

Page 31: Scaling up  LDA - 2

KDD 2014

Key ideas• use variant of Mimno/McCallum decomposition

• Use alias tables to sample from the dense parts

• Since the alias table gradually goes stale, use Metropolis-Hastings sampling instead of Gibbs

Page 32: Scaling up  LDA - 2

KDD 2014

• q is stale, easy-to-draw from distribution• p is updated distribution• computing ratios p(i)/q(i) is cheap• usually the ratio is close to one

else the dart missed

Page 33: Scaling up  LDA - 2

KDD 2014

Page 34: Scaling up  LDA - 2

SPEEDUP FOR PARALLEL LDA - USING ALLREDUCE FOR

SYNCHRONIZATION

Page 35: Scaling up  LDA - 2
Page 36: Scaling up  LDA - 2

What if you try and parallelize?

Split document/term matrix randomly and distribute to p processors .. then run “Approximate Distributed LDA”

Common subtask in parallel versions of: LDA, SGD, ….

Page 37: Scaling up  LDA - 2

Introduction• Common pattern:

– do some learning in parallel – aggregate local changes from each processor

• to shared parameters– distribute the new shared parameters

• back to each processor– and repeat….

• AllReduce implemented in MPI, recently in VW code (John Langford) in a Hadoop/compatible scheme

MAP

ALLREDUCE

Page 38: Scaling up  LDA - 2
Page 39: Scaling up  LDA - 2
Page 40: Scaling up  LDA - 2
Page 41: Scaling up  LDA - 2
Page 42: Scaling up  LDA - 2
Page 43: Scaling up  LDA - 2
Page 44: Scaling up  LDA - 2

Gory details of VW Hadoop-AllReduce

• Spanning-tree server:– Separate process constructs a spanning tree of the compute nodes in the cluster and then acts as a server

• Worker nodes (“fake” mappers):– Input for worker is locally cached– Workers all connect to spanning-tree server– Workers all execute the same code, which might contain AllReduce calls:

• Workers synchronize whenever they reach an all-reduce

Page 45: Scaling up  LDA - 2

Hadoop AllReduce

don’t wait for duplicate jobs

Page 46: Scaling up  LDA - 2

Second-order method - like Newton’s method

Page 47: Scaling up  LDA - 2

2 24 features

~=100 non-zeros/example

2.3B examples

example is user/page/ad and conjunctions of these, positive if there was a click-thru on the ad

Page 48: Scaling up  LDA - 2

50M examples

explicitly constructed kernel 11.7M features

3,300 nonzeros/example

old method: SVM, 3 days: reporting time to get to fixed test error

Page 49: Scaling up  LDA - 2