Download - Scaling up LDA - 2
![Page 1: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/1.jpg)
Fast sampling for LDA
William Cohen
![Page 2: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/2.jpg)
MORE LDA SPEEDUPSFIRST - RECAP LDA DETAILS
![Page 3: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/3.jpg)
![Page 4: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/4.jpg)
Called “collapsed Gibbs sampling” since you’ve marginalized away some variables
Fr: Parameter estimation for text analysis - Gregor Heinrich
prob this word/term assigned to topic k
prob this doc contains topic k
![Page 5: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/5.jpg)
More detail
![Page 6: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/6.jpg)
![Page 7: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/7.jpg)
z=1
z=2
z=3
…
…
unit heightrandom
![Page 8: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/8.jpg)
SPEEDUP 1 - SPARSITY
![Page 9: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/9.jpg)
KDD 2008
![Page 10: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/10.jpg)
z=1
z=2
z=3
…
…
unit heightrandom
![Page 11: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/11.jpg)
![Page 12: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/12.jpg)
Running total of P(z=k|…) or P(z<=k)
![Page 13: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/13.jpg)
Discussion….
• Where do you spend your time?–sampling the z’s–each sampling step involves a loop over all topics– this seems wasteful
• even with many topics, words are often only assigned to a few different topics– low frequency words appear < K times … and there are lots and lots of them!– even frequent words are not in every topic
![Page 14: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/14.jpg)
Discussion….
• What’s the solution?Idea: come up with
approximations to Z at each stage - then you might be able to stop
early…..computationall
y like a sparser vectorWant Zi>=Z
![Page 15: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/15.jpg)
Tricks• How do you compute and maintain the bound?
– see the paper• What order do you go in?
– want to pick large P(k)’s first– … so we want large P(k|d) and P(k|w)– … so we maintain k’s in sorted order
• which only change a little bit after each flip, so a bubble-sort will fix up the almost-sorted array
![Page 16: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/16.jpg)
Results
![Page 17: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/17.jpg)
SPEEDUP 2 - ANOTHER APPROACH FOR USING SPARSITY
![Page 18: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/18.jpg)
KDD 09
![Page 19: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/19.jpg)
z=s+r+q
t = topic (k)w = wordd = doc
![Page 20: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/20.jpg)
z=s+r+q
• If U<s:• lookup U on line segment with tic-
marks at α1β/(βV + n.|1), α2β/(βV + n.|2), …• If s<U<r:
• lookup U on line segment for rOnly need to check t such that nt|
d>0
t = topic (k)w = wordd = doc
![Page 21: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/21.jpg)
z=s+r+q
• If U<s:• lookup U on line segment with tic-
marks at α1β/(βV + n.|1), α2β/(βV + n.|2), …• If s<U<s+r:
• lookup U on line segment for r• If s+r<U:
• lookup U on line segment for qOnly need to check t such that nw|
t>0
![Page 22: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/22.jpg)
z=s+r+q
Only need to check t such that nw|
t>0
Only need to check t such that nt|
d>0
Only need to check occasionally (< 10% of the time)
![Page 23: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/23.jpg)
z=s+r+q
Need to store nw|t for each word, topic pair …???
Only need to store nt|d for current d
Only need to store (and maintain) total words per topic and α’s,β,V
Trick; count up nt|
d for d when you start working on d and update incrementally
![Page 24: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/24.jpg)
z=s+r+q
Need to store nw|t for each word, topic pair …???
1. Precompute, for each t,
Most (>90%) of the time and space is here…
2. Quickly find t’s such that nw|t is large for w
![Page 25: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/25.jpg)
Need to store nw|t for each word, topic pair …???
1. Precompute, for each t,
Most (>90%) of the time and space is here…
2. Quickly find t’s such that nw|t is large for w• map w to an int array
• no larger than frequency w• no larger than #topics
• encode (t,n) as a bit vector• n in the high-order bits• t in the low-order bits
• keep ints sorted in descending order
![Page 26: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/26.jpg)
![Page 27: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/27.jpg)
Outline
• LDA/Gibbs algorithm details• How to speed it up by parallelizing• How to speed it up by faster sampling
–Why sampling is key–Some sampling ideas for LDA
• The Mimno/McCallum decomposition (SparseLDA)• Alias tables (Walker 1977; Li, Ahmed, Ravi, Smola KDD 2014)
![Page 28: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/28.jpg)
Alias tables
http://www.keithschwarz.com/darts-dice-coins/
Basic problem: how can we sample from a biased coin quickly?
If the distribution changes slowly maybe we can do some preprocessing and then sample multiple times. Proof of concept: generate r~uniform and use a binary tree
r in (23/40,7/10]
O(K)
O(log2K)
![Page 29: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/29.jpg)
Alias tables
http://www.keithschwarz.com/darts-dice-coins/
Another idea…
Simulate the dart with two drawn values:rx int(u1*K)ry u1*pmaxkeep throwing till you hit a stripe
![Page 30: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/30.jpg)
Alias tables
http://www.keithschwarz.com/darts-dice-coins/
An even more clever idea: minimize the brown space (where the dart “misses”) by sizing the rectangle’s height to the average probability, not the maximum probability, and cutting and pasting a bit.
You can always do this using only two colors in each column of the final alias table and the dart never misses!
mathematically speaking…
![Page 31: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/31.jpg)
KDD 2014
Key ideas• use variant of Mimno/McCallum decomposition
• Use alias tables to sample from the dense parts
• Since the alias table gradually goes stale, use Metropolis-Hastings sampling instead of Gibbs
![Page 32: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/32.jpg)
KDD 2014
• q is stale, easy-to-draw from distribution• p is updated distribution• computing ratios p(i)/q(i) is cheap• usually the ratio is close to one
else the dart missed
![Page 33: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/33.jpg)
KDD 2014
![Page 34: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/34.jpg)
SPEEDUP FOR PARALLEL LDA - USING ALLREDUCE FOR
SYNCHRONIZATION
![Page 35: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/35.jpg)
![Page 36: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/36.jpg)
What if you try and parallelize?
Split document/term matrix randomly and distribute to p processors .. then run “Approximate Distributed LDA”
Common subtask in parallel versions of: LDA, SGD, ….
![Page 37: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/37.jpg)
Introduction• Common pattern:
– do some learning in parallel – aggregate local changes from each processor
• to shared parameters– distribute the new shared parameters
• back to each processor– and repeat….
• AllReduce implemented in MPI, recently in VW code (John Langford) in a Hadoop/compatible scheme
MAP
ALLREDUCE
![Page 38: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/38.jpg)
![Page 39: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/39.jpg)
![Page 40: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/40.jpg)
![Page 41: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/41.jpg)
![Page 42: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/42.jpg)
![Page 43: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/43.jpg)
![Page 44: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/44.jpg)
Gory details of VW Hadoop-AllReduce
• Spanning-tree server:– Separate process constructs a spanning tree of the compute nodes in the cluster and then acts as a server
• Worker nodes (“fake” mappers):– Input for worker is locally cached– Workers all connect to spanning-tree server– Workers all execute the same code, which might contain AllReduce calls:
• Workers synchronize whenever they reach an all-reduce
![Page 45: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/45.jpg)
Hadoop AllReduce
don’t wait for duplicate jobs
![Page 46: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/46.jpg)
Second-order method - like Newton’s method
![Page 47: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/47.jpg)
2 24 features
~=100 non-zeros/example
2.3B examples
example is user/page/ad and conjunctions of these, positive if there was a click-thru on the ad
![Page 48: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/48.jpg)
50M examples
explicitly constructed kernel 11.7M features
3,300 nonzeros/example
old method: SVM, 3 days: reporting time to get to fixed test error
![Page 49: Scaling up LDA - 2](https://reader035.vdocuments.us/reader035/viewer/2022070404/56813ba9550346895da4db17/html5/thumbnails/49.jpg)