r user-group-2011-09

9/22/2011 © MapR Confidential 1

RHadoop

and MapR


The bad old days (i.e. now)

• Hadoop is a silo

• HDFS isn’t a normal file system

• Hadoop doesn’t really like C++

• R is limited

• One machine, one memory space

• Isn’t there any way we can just get along?


The white knight

• MapR changes things

• Lots of new stuff like snapshots, NFS

• All you need to know, you already know

• NFS provides cluster wide file access

• Everything works the way you expect

• Performance high enough to use as a message bus


Example, out-of-core SVD

• SVD provides compressed matrix form

• Based on sum of rank-1 matrices

A =s1u1¢v1 +s 2u2

¢v2 +e

± ±≈ + + ?


More on SVD

• SVD provides a very nice basis

Ax = A aiviå = s ju j ¢v jj

åé

ëêê

ù

ûúú

aivii

åé

ëê

ù

ûú= ais iui

i

å


• And a nifty approximation property

Ax =s1a1u1 +s 2a2u2 + s iaiuii>2

å

e2

£ s i

2

i>2

å


Also known as …

• Latent Semantic Indexing

• PCA

• Eigenvectors


An application, approximate translation

• Translation distributes over concatenation

• But counting turns concatenation into addition

• This means that translation is linear!

T(s1 | s2 ) =T(s1) | T(s2 )

k(s1 | s2 ) = k(s1) + k(s2 )

k(T(s1 | s2 )) = k(T(s1)) + k(T(s2 ))

ish


ish


Traditional computation

• Products of A are dominated by large singular values and corresponding vectors

• Subtracting these dominate singular values allows the next ones to appear

• Lanczos method, generally Krylov sub-space

A ¢AA( )n

=US2n+1 ¢V


But …


The gotcha

• Iteration in Hadoop is death

• Huge process invocation costs

• Lose all memory residency of data

• Total lost cause


Randomness to the rescue

• To save the day, run all iterations at the same time

Y = AW

QR =Y

B = ¢QA

US ¢V = B

QU( ) S ¢V » A

==

A


In R

lsa = function(a, k, p) {n = dim(a)[1]m = dim(a)[2]y = a %*% matrix(rnorm(m*(k+p)), nrow=m)y.qr = qr(y)b = t(qr.Q(y.qr)) %*% ab.qr = qr(t(b))svd = svd(t(qr.R(b.qr)))list(u=qr.Q(y.qr) %*% svd$u[,1:k],

d=svd$d[1:k], v=qr.Q(b.qr) %*% svd$v[,1:k])

}


Not good enough yet

• Limited to memory size

• After memory limits, feature extraction dominates


Hybrid architecture

Featureextraction

anddown

sampling

Input

Side-data

Datajoin

SequentialSVD

Map-reduce

Via NFS


Hybrid architecture

Featureextraction

anddown

sampling

Input

Side-data

Datajoin

Map-reduce

Via NFS

RVisualization

SequentialSVD


Randomness to the rescue

• To save the day again, use blocks

Yi = AiW

¢R R = ¢Y Y = ¢YiYiå

B j = AiWR-1( )Aij

i

å

LL ' = B ¢B

US ¢V = L

AWR-1U( )S L-1B ¢V( ) » A

==

=


Hybrid architecture

Map-reduce

Feature extractionand

down sampling Via NFS

RVisualization

Map-reduce

Block-wiseparallel

SVD


Conclusions

• Inter-operability allows massively scalability

• Prototyping in R not wasted

• Map-reduce iteration not needed for SVD

• Feasible scale ~10^9 non-zeros or more

r user-group-2011-09

Technology