r user-group-2011-09
DESCRIPTION
Talk given on September 21 to the Bay Area R User Group. The talk walks a stochastic project SVD algrorithm through the steps from initial implementation in R to a proposed implementation using map-reduce that integrates cleanly with R via NFS export of the distributed file system. Not surprisingly, this algorithm is essentially the same as the one used by Mahout.TRANSCRIPT
9/22/2011 © MapR Confidential 1
RHadoop
and MapR
9/22/2011 © MapR Confidential 2
The bad old days (i.e. now)
• Hadoop is a silo
• HDFS isn’t a normal file system
• Hadoop doesn’t really like C++
• R is limited
• One machine, one memory space
• Isn’t there any way we can just get along?
9/22/2011 © MapR Confidential 3
The white knight
• MapR changes things
• Lots of new stuff like snapshots, NFS
• All you need to know, you already know
• NFS provides cluster wide file access
• Everything works the way you expect
• Performance high enough to use as a message bus
9/22/2011 © MapR Confidential 4
Example, out-of-core SVD
• SVD provides compressed matrix form
• Based on sum of rank-1 matrices
A =s1u1¢v1 +s 2u2
¢v2 +e
± ±≈ + + ?
9/22/2011 © MapR Confidential 5
More on SVD
• SVD provides a very nice basis
Ax = A aiviå = s ju j ¢v jj
åé
ëêê
ù
ûúú
aivii
åé
ëê
ù
ûú= ais iui
i
å
9/22/2011 © MapR Confidential 6
• And a nifty approximation property
Ax =s1a1u1 +s 2a2u2 + s iaiuii>2
å
e2
£ s i
2
i>2
å
9/22/2011 © MapR Confidential 7
Also known as …
• Latent Semantic Indexing
• PCA
• Eigenvectors
9/22/2011 © MapR Confidential 8
An application, approximate translation
• Translation distributes over concatenation
• But counting turns concatenation into addition
• This means that translation is linear!
T(s1 | s2 ) =T(s1) | T(s2 )
k(s1 | s2 ) = k(s1) + k(s2 )
k(T(s1 | s2 )) = k(T(s1)) + k(T(s2 ))
ish
9/22/2011 © MapR Confidential 9
ish
9/22/2011 © MapR Confidential 10
Traditional computation
• Products of A are dominated by large singular values and corresponding vectors
• Subtracting these dominate singular values allows the next ones to appear
• Lanczos method, generally Krylov sub-space
A ¢AA( )n
=US2n+1 ¢V
9/22/2011 © MapR Confidential 11
But …
9/22/2011 © MapR Confidential 12
The gotcha
• Iteration in Hadoop is death
• Huge process invocation costs
• Lose all memory residency of data
• Total lost cause
9/22/2011 © MapR Confidential 13
Randomness to the rescue
• To save the day, run all iterations at the same time
Y = AW
QR =Y
B = ¢QA
US ¢V = B
QU( ) S ¢V » A
==
A
9/22/2011 © MapR Confidential 14
In R
lsa = function(a, k, p) {n = dim(a)[1]m = dim(a)[2]y = a %*% matrix(rnorm(m*(k+p)), nrow=m)y.qr = qr(y)b = t(qr.Q(y.qr)) %*% ab.qr = qr(t(b))svd = svd(t(qr.R(b.qr)))list(u=qr.Q(y.qr) %*% svd$u[,1:k],
d=svd$d[1:k], v=qr.Q(b.qr) %*% svd$v[,1:k])
}
9/22/2011 © MapR Confidential 15
Not good enough yet
• Limited to memory size
• After memory limits, feature extraction dominates
9/22/2011 © MapR Confidential 16
Hybrid architecture
Featureextraction
anddown
sampling
Input
Side-data
Datajoin
SequentialSVD
Map-reduce
Via NFS
9/22/2011 © MapR Confidential 17
Hybrid architecture
Featureextraction
anddown
sampling
Input
Side-data
Datajoin
Map-reduce
Via NFS
RVisualization
SequentialSVD
9/22/2011 © MapR Confidential 18
Randomness to the rescue
• To save the day again, use blocks
Yi = AiW
¢R R = ¢Y Y = ¢YiYiå
B j = AiWR-1( )Aij
i
å
LL ' = B ¢B
US ¢V = L
AWR-1U( )S L-1B ¢V( ) » A
==
=
9/22/2011 © MapR Confidential 19
Hybrid architecture
Map-reduce
Feature extractionand
down sampling Via NFS
RVisualization
Map-reduce
Block-wiseparallel
SVD
9/22/2011 © MapR Confidential 20
Conclusions
• Inter-operability allows massively scalability
• Prototyping in R not wasted
• Map-reduce iteration not needed for SVD
• Feasible scale ~10^9 non-zeros or more