scaling up lda (monday’s lecture). what if you try and parallelize? split document/term matrix...

Scaling up LDA (Mondays lecture) What if you try and parallelize? Split document/term matrix randomly and distribute to p processors.. then run Approximate Distributed LDA Common subtask in parallel versions of: LDA, SGD, . AllReduce Introduction Common pattern: do some learning in parallel aggregate local changes from each processor to shared parameters distribute the new shared parameters back to each processor and repeat. MAP REDUCE some sort of copy Introduction Common pattern: do some learning in parallel aggregate local changes from each processor to shared parameters distribute the new shared parameters back to each processor and repeat. AllReduce implemented in MPI, recently in VW code (John Langford) in a Hadoop/compatible scheme MAP ALLREDUCE Gory details of VW Hadoop-AllReduce Spanning-tree server: Separate process constructs a spanning tree of the compute nodes in the cluster and then acts as a server Worker nodes (fake mappers): Input for worker is locally cached Workers all connect to spanning-tree server Workers all execute the same code, which might contain AllReduce calls: Workers synchronize whenever they reach an allreduce Hadoop AllReduce dont wait for duplicate jobs Second-order method - like Newtons method 2 24 features ~=100 non-zeros/example 2.3B examples example is user/page/ad and conjunctions of these, positive if there was a click-thru on the ad 50M examples explicitly constructed kernel 11.7M features 3,300 nonzeros/example old method: SVM, 3 days: reporting time to get to fixed test error On-line LDA Pilfered from NIPS 2010: Online Learning for LDA, Matthew Hoffman, Francis Bach & Blei uses uses Mondays lecture recap Compute expectations over the zs any way you want. Compute expectations over the zs any way you want. Technical Details Variational distrib: q(z d ) not q(z d )! Approximate using Gibbs: after sampling for a while estimate: estimate using time and coherence: D(w) = # docs containing word w better Summary of LDA speedup tricks Gibbs sampler: O(N*K*T) and K grows with N Need to keep the corpus (and zs) in memory You can parallelize You need to keep a slice of the corpus But you need to synchronize K multinomials over the vocabulary AllReduce would help? You can sparsify the sampling and topic-counts Mimnos trick - greatly reduces memory You can do the computation on-line Only need to keep K-multinomials and one documents worth of corpus and zs in memory You can combine some of these methods Online sparsified LDA Parallel online sparsified LDA?

scaling up lda (monday’s lecture). what if you try and parallelize? split document/term matrix...

Documents