what you can do with a tall-and-skinny qr factorization in hadoop: principal components and large...
DESCRIPTION
Some techniques that work with the tall-and-skinny QR factorization of a matrix.TRANSCRIPT
What you can do with a Tall-and-Skinny !QR Factorization on Hadoop: !Large regressions, Principal Components
Slides bit.ly/16LS8Vk Code github.com/dgleich/mrtsqr
DAVID F. GLEICH ASSISTANT PROFESSOR !COMPUTER SCIENCE !PURDUE UNIVERSITY
@dgleich [email protected]
bit.ly/16LS8Vk David Gleich · Purdue 1
Why you should stay … you like advanced machine learning techniques you want to understand how to compute the singular values and vectors of a huge matrix (that’s tall and skinny) you want to learn about large-scale regression, and principal components from a matrix perspective
bit.ly/16LS8Vk David Gleich · Purdue 2
What I’m going to assume you know
MapReduce Python Some simple matrix manipulation
bit.ly/16LS8Vk David Gleich · Purdue 3
A
From tinyimages"collection
bit.ly/16LS8Vk
Tall-and-Skinny matrices (m ≫ n) Many rows (like a billion) A few columns (under 10,000)
regression and!general linear models!with many samples!
block iterative methods panel factorizations
approximate kernel k-means
big-data SVD/PCA!
Used in
David Gleich · Purdue 4
If you have tons of small records, then there is probably a tall-and-skinny matrix somwhere
bit.ly/16LS8Vk David Gleich · Purdue 5
Tall-and-skinny matrices are common in BigData
A1
A4
A2
A3
A4
A : m x n, m ≫ n Key is an arbitrary row-id Value is the 1 x n array "for a row Each submatrix Ai is an "the input to a map task.
bit.ly/16LS8Vk David Gleich · Purdue 6
PCA of 80,000,000!images
A
80,000,000 images
1000 pixels
First 16 columns of V as images
bit.ly/16LS8Vk Constantine & Gleich, MapReduce 2011.
20 40 60 80 1000
0.2
0.4
0.6
0.8
1
Principal Components
Fra
ctio
n o
f va
riance
20 40 60 80 1000
0.2
0.4
0.6
0.8
1
Principal Components
Fra
ctio
n o
f va
riance
200 400 600 800 10000
0.2
0.4
0.6
0.8
1
Principal Components
Fra
ctio
n o
f va
riance
200 400 600 800 10000
0.2
0.4
0.6
0.8
1
Principal Components
Fra
ctio
n o
f va
riance
Figure 5: The 16 most important principal compo-nent basis functions (by rows) and the amount ofvariance explained by the top 100 (bottom left) andall principal components (bottom right).
4. CONCLUSIONIn this manuscript, we have illustrated the ability of Map-
Reduce architectures to solve massive least-squares prob-lems through a tall and skinny QR factorization. We chooseto implement these algorithms in a simple Hadoop stream-ing framework to provide prototype implementations so thatothers can easily adapt the algorithms to their particularproblem. These codes are all available online.1 We envi-sion that the TSQR paradigm will find a place in block-analogues of the various iterative methods in the Mahoutproject. These methods are based on block analogues of theLanczos process, which replace vector normalization stepswith QR factorizations. Because the TSQR routine solveslinear regression problems, it can also serve as the least-squares sub-routine for an iteratively reweighted least-squaresalgorithm for fitting general linear models.
A key motivation for our MapReduce TSQR implemen-tation comes from a residual minimizing model reductionmethod [5] for approximating the output of a parameterizeddi�erential equation model. Methods for constructing re-duced order models typically involve a collection of solutions1See http://www.github.com/dgleich/mrtsqr.
(dubbed snapshots [16]) – each computed at its respectiveinput parameters. Storing and managing the terascale datafrom these solutions is itself challenging, and the hard diskstorage of MapReduce is a natural fit.
5. REFERENCES[1] E. Agullo, C. Coti, J. Dongarra, T. Herault, and J. Langem.
QR factorization of tall and skinny matrices in a gridcomputing environment. In Parallel Distributed Processing(IPDPS), 2010 IEEE International Symposium on, pages 1–11, April 2010.
[2] Å. Björck. Numerical Methods for Least Squares Problems.SIAM, Philadelphia, Penn., 1996.
[3] K. Bosteels. Fuzzy techniques in the usage and construction ofcomparison measures for music objects, 2009.
[4] J. Choi, J. Demmel, I. S. Dhillon, J. Dongarra, S. Ostrouchov,A. Petitet, K. Stanley, D. W. Walker, and R. C. Whaley.ScaLAPACK: A portable linear algebra library for distributedmemory computers - design issues and performance. PARA,pages 95–106, 1995.
[5] P. G. Constantine and Q. Wang. Residual minimizing modelreduction for parameterized nonlinear dynamical systems,arxiv:1012.0351, 2010.
[6] B. Dagnon and B. Hindman. TSQR on EC2 using the Nexussubstrate. http://www.cs.berkeley.edu/~agearh/cs267.sp10/files/writeup_dagnon.pdf, 2010. Class Project writeup forCS267 and University of California Berkeley.
[7] J. Dean and S. Ghemawat. MapReduce: Simplified dataprocessing on large clusters. In Proceedings of the 6thSymposium on Operating Systems Design andImplementation (OSDI2004), pages 137–150, 2004.
[8] J. Demmel, L. Grigori, M. Hoemmen, and J. Langou.Communication-avoiding parallel and sequential QRfactorizations. arXiv, 0806.2159, 2008.
[9] P. Drineas, M. W. Mahoney, S. Muthukrishnan, and T. Sarlós.Faster least squares approximation. Numerische Mathematik,117(2):219–249, 2011.
[10] J. G. F. Francis. The QR transformation a unitary analogue tothe LR transformation – part 1. The Computer Journal,4:265–271, 1961.
[11] P. E. Gill, W. Murray, and M. H. Wright. PracticalOptimization. Academic Press, 1981.
[12] G. H. Golub and C. F. van Loan. Matrix Computations. TheJohns Hopkins University Press, third edition, October 1996.
[13] D. Heller. A survey of parallel algorithms in numerical linearalgebra. SIAM Rev., 20:740–777, 1978.
[14] J. Langou. Computing the R of the QR factorization of talland skinny matrix using mpi_reduce. arXiv,math.NA:1002.4250, 2010.
[15] T. E. Oliphant. Guide to NumPy. Provo, UT, Mar. 2006.[16] L. Sirovich. Turbulence and the dynamics of coherent
structures. Part 1: Coherent structures. Quar, 45(3):561–571,1987.
[17] A. Stathopoulos and K. Wu. A block orthogonalizationprocedure with constant synchronization requirements. SIAMJ. Sci. Comput., 23:2165–2182, June 2001.
[18] A. Torralba, R. Fergus, and W. Freeman. 80 million tinyimages: A large data set for nonparametric object and scenerecognition. Pattern Analysis and Machine Intelligence,IEEE Transactions on, 30(11):1958 –1970, November 2008.
[19] L. N. Trefethen and D. I. Bau. Numerical Linear Algebra.SIAM, Philadelphia, 1997.
[20] Various. Hadoop version 0.21. http://hadoop.apache.org, 2010.[21] F. Wang. Implement linear regression.
https://issues.apache.org/jira/browse/MAHOUT-529.Mahout-529 JIRA, accessed on February 10, 2011.
[22] R. C. Whaley and J. Dongarra. Automatically tuned linearalgebra software. In SuperComputing 1998: HighPerformance Networking and Computing, 1998.
[23] B. White. hadoopy. http://bwhite.github.com/hadoopy.
Acknowledgments. We are exceedingly grateful to Mark Hoemmenfor many discussions about the TSQR factorization. We would alsolike to thank James Demmel for suggesting examining the referencestreaming time. Finally, we are happy to acknowledge the fellowMapReduce “computers” at Sandia for general Hadoop help: CraigUlmer, Todd Plantenga, Justin Basilico, Art Munson, and Tamara G.Kolda.
David Gleich · Purdue 7
Regression with 80,000,000 images
The goal was to approx. how much red there was in a picture from the value of the grayscale pixels only. We get a measure of how much “redness” each pixel contributes to the whole.
Table 3: Results when varying block size. The bestperformance results are bolded. See §3.3 for details.
Iter. 1 Iter. 2Cols. Blks. Maps Secs. Secs.50 2 8000 424 21— 3 — 399 19— 5 — 408 19— 10 — 401 19— 20 — 396 20— 50 — 406 18— 100 — 380 19— 200 — 395 19100 2 7000 410 21— 3 — 384 21— 5 — 390 22— 10 — 372 22— 20 — 374 221000 2 6000 493 199— 3 — 432 169— 5 — 422 154— 10 — 430 202— 20 — 434 202
3.4 Split sizeThere are three factors that control the TSQR tree on
Hadoop: the number of mappers, the number of reducers,and the number of iterations. In this section, we investi-gate the trade-o� between decreasing the number of map-pers, which is done by increasing the minimum split size inHDFS, and using additional iterations. Using additional it-erations provides the opportunity to exploit parallelism viaa reduction tree. Table 4 show the total computation timefor our C++ code when used with various split sizes andone or two iterations. The block size used was the best per-forming case from the previous experiment. Each row statesthe number of columns, the number of iterations used (fortwo iterations, we used 250 reducers in the first iteration),the split size, and the total computation time. With a splitsize of 512 MB, each mapper consumes an entire input fileof the matrix (recall that the matrices are constructed by1000 reducers, and hence 1000 files). The two iteration testused 250 reducers in the first iteration and 1 reducer in thesecond iteration. The one iteration test used 1 reducer inthe first iteration, which is required to get the correct finalanswer. (In the 1000 column test, using a smaller split sizeof 64 or 256 MB generated too much data from the mappersfor a single reducer to handle e�ciently.)
The results are di�erent between 50 columns and 1000columns. With 50 columns, a one iteration approach isfaster, and increasing the split size dramatically reduces thecomputation time. This results from two intertwined behav-iors: first, using a larger split size sends less data to the finalreducer, making it run faster; and second, using a larger splitsize reduces the overhead with Hadoop launching additionalmap tasks. With 1000 columns, the two iteration approachis faster. This happens because each R matrix output by themappers is 400 times larger than with the 50 column experi-ment. Consequently, the single reducer takes much longer inthe one iteration case. Using an additional iteration allowsus to handle this reduction with more parallelism.
Table 4: Results when varying split size. See §3.4.Cols. Iters. Split
(MB)Maps Secs.
50 1 64 8000 388— — 256 2000 184— — 512 1000 149— 2 64 8000 425— — 256 2000 220— — 512 1000 1911000 1 512 1000 666— 2 64 6000 590— — 256 2000 432— — 512 1000 337
3.5 Tinyimages: regression and PCAOur final experiment shows this algorithm applied to a
real world dataset. The tinyimages collection is a set of al-most 80,000,000 images. Each image is 32-by-32 pixels. Theimage collection is stored in a single file, where each 3072byte segment consists of the red, green, and blue values foreach of the 1024 pixels in the image. We wrote a customHadoop InputFormat to read this file directly and trans-mit the data to our Hadoop streaming programs as a set ofbytes. We used the Dumbo python framework for these ex-periments. In the following two experiments, we translatedall the color pixels into shades of gray. Consequently, thisdataset represents an 79,302,017-by-1024 matrix.
We first solved a regression problem by trying to predictthe sum of red-pixel values in each image as a linear combi-nation of the gray values in each image. Formally, if ri is thesum of the red components in all pixels of image i, and Gi,j
is the gray value of the jth pixel in image i, then we wantedto find min
qi(ri ≠
qj
Gi,jsj)2. There is no particular im-portance to this regression problem, we use it merely as ademonstration.The coe�cients sj are dis-played as an image at the right.They reveal regions of the im-age that are not as importantin determining the overall redcomponent of an image. Thecolor scale varies from light-blue (strongly negative) to blue(0) and red (strongly positive).The computation took 30 min-utes using the Dumbo frame-work and a two-iteration job with 250 intermediate reducers.
We also solved a principal component problem to find aprincipal component basis for each image. Let G be matrixof Gi,j ’s from the regression and let ui be the mean of the ithrow in G. The principal components of the images are givenby the right singular vectors of the matrix G≠ueT where uare all of the mean values as a vector and e is the 1024-by-1vector of ones. That is, let G ≠ ueT = U�V T be the SVD,then the principal components are the columns of V . Wecompute V by first doing a TSQR of G ≠ ueT , and thencomputing an SVD of the final R, which is a small 1024-by-1024 matrix. The principal components are plotted asimages in Figure 5. These images show a reasonable basisfor images and are reminiscent of the basis in a discretecosine transform.
A
80,000,000 images
1000 pixels
bit.ly/16LS8Vk David Gleich · Purdue 8
Let’s talk about QR!
bit.ly/16LS8Vk David Gleich · Purdue 9
QR Factorization and the Gram Schmidt process
Consider a set of vectors v1 to vn. Set u1 to be v1. Create a new vector u2 by removing any “component” of u1 from v2. Create a new vector u3 by removing any “component” of u1 and u2 from v3. …
bit.ly/16LS8Vk David Gleich · Purdue 10 “Gram-Schmidt process” "
from Wikipedia
QR Factorization and the Gram Schmidt process
v1 = a1u1
v2 = b1u1 + b2u2
v3 = c1u1 + c2u2 + c3u3
⇥v1 v2 v3 ...
⇤
=⇥u1 u2 v3 ...
⇤
2
6664
a1 b1 c1 ...0 b2 c2 ...0 0 c3 ......
......
. . .
3
7775
bit.ly/16LS8Vk David Gleich · Purdue 11
QR Factorization and the Gram Schmidt process
v1 = a1u1
v2 = b1u1 + b2u2
v3 = c1u1 + c2u2 + c3u3
V = UR All vectors in U are at right angles, i.e. they are decoupled A = QR
For this problem
What it’s usually"written as by others
bit.ly/16LS8Vk David Gleich · Purdue 12
QR Factorization and the Gram Schmidt process
v1 = a1u1
v2 = b1u1 + b2u2
v3 = c1u1 + c2u2 + c3u3
All vectors in U are at right angles, i.e. they are decoupled
A Q
R
=
bit.ly/16LS8Vk David Gleich · Purdue 13
PCA of 80,000,000!images
A
80,000,000 images
1000 pixels
X
MapReduce Post Processing
Zero"mean"rows
TSQ
R
R SVD
V
First 16 columns
of V as images
Top 100 singular values
(principal �components)
bit.ly/16LS8Vk Constantine & Gleich, MapReduce 2010. David Gleich · Purdue 14
Input 500,000,000-by-100 matrix Each record 1-by-100 row HDFS Size 423.3 GB Time to compute colsum( A ) 161 sec. Time to compute R in qr( A ) 387 sec.
bit.ly/16LS8Vk David Gleich · Purdue 15
The rest of the talk!Full TSQR code in hadoopy
bit.ly/16LS8Vk
import random, numpy, hadoopy class SerialTSQR: def __init__(self,blocksize,isreducer): self.bsize=blocksize self.data = [] if isreducer: self.__call__ = self.reducer else: self.__call__ = self.mapper def compress(self): R = numpy.linalg.qr( numpy.array(self.data),'r') # reset data and re-initialize to R self.data = [] for row in R: self.data.append([float(v) for v in row]) def collect(self,key,value): self.data.append(value) if len(self.data)>self.bsize*len(self.data[0]): self.compress()
def close(self): self.compress() for row in self.data: key = random.randint(0,2000000000) yield key, row def mapper(self,key,value): self.collect(key,value) def reducer(self,key,values): for value in values: self.mapper(key,value) if __name__=='__main__': mapper = SerialTSQR(blocksize=3,isreducer=False) reducer = SerialTSQR(blocksize=3,isreducer=True) hadoopy.run(mapper, reducer)
David Gleich · Purdue 16
A1
A2
A3
A1
A2qr
Q2 R2
A3qr
Q3 R3
A4qr Q4A4
R4
emit
A5
A6
A7
A5
A6qr
Q6 R6
A7qr
Q7 R7
A8qr Q8A8
R8
emit
Mapper 1Serial TSQR
R4
R8
Mapper 2Serial TSQR
R4
R8
qr Q emitRReducer 1Serial TSQR
AlgorithmData Rows of a matrix
Map QR factorization of rowsReduce QR factorization of rows
Communication avoiding QR (Demmel et al. 2008) !on MapReduce (Constantine and Gleich, 2010)
bit.ly/16LS8Vk David Gleich · Purdue 17
The rest of the talk!Full TSQR code in hadoopy
bit.ly/16LS8Vk
import random, numpy, hadoopy class SerialTSQR: def __init__(self,blocksize,isreducer): self.bsize=blocksize self.data = [] if isreducer: self.__call__ = self.reducer else: self.__call__ = self.mapper def compress(self): R = numpy.linalg.qr( numpy.array(self.data),'r') # reset data and re-initialize to R self.data = [] for row in R: self.data.append([float(v) for v in row]) def collect(self,key,value): self.data.append(value) if len(self.data)>self.bsize*len(self.data[0]): self.compress()
def close(self): self.compress() for row in self.data: key = random.randint(0,2000000000) yield key, row def mapper(self,key,value): self.collect(key,value) def reducer(self,key,values): for value in values: self.mapper(key,value) if __name__=='__main__': mapper = SerialTSQR(blocksize=3,isreducer=False) reducer = SerialTSQR(blocksize=3,isreducer=True) hadoopy.run(mapper, reducer)
David Gleich · Purdue 18
Too many maps cause too much data to one reducer!
Each image is 5k. Each HDFS block has "12,800 images. 6,250 total blocks. Each map outputs "1000-by-1000 matrix One reducer gets a 6.25M-by-1000 matrix (50GB)
bit.ly/16LS8Vk David Gleich · Purdue 19
S(1)
A
A1
A2
A3
A3
R1 map
Mapper 1-1 Serial TSQR
A2
emit R2 map
Mapper 1-2 Serial TSQR
A3
emit R3 map
Mapper 1-3 Serial TSQR
A4
emit R4 map
Mapper 1-4 Serial TSQR
shuffle
S1
A2
reduce
Reducer 1-1 Serial TSQR
S2 R2,2
reduce
Reducer 1-2 Serial TSQR
R2,1 emit
emit
emit
shuffle
A2 S3 R2,3
reduce
Reducer 1-3 Serial TSQR
emit
Iteration 1 Iteration 2
identity map
A2 S(2) R reduce
Reducer 2-1 Serial TSQR
emit
bit.ly/16LS8Vk David Gleich · Purdue 20
Input 500,000,000-by-100 matrix Each record 1-by-100 row HDFS Size 423.3 GB Time to compute colsum( A ) 161 sec. Time to compute R in qr( A ) 387 sec.
bit.ly/16LS8Vk David Gleich · Purdue 21
Hadoop streaming isn’t always slow! Synthetic data test on 100,000,000-by-500 matrix (~500GB) Codes implemented in MapReduce streaming Matrix stored as TypedBytes lists of doubles Python frameworks use Numpy+ATLAS matrix. Custom C++ TypedBytes reader/writer with ATLAS matrix. Iter 1
Total (secs.) Iter 2 Total (secs.)
Overall"Total (secs.)
Dumbo 960 217 1177 Hadoopy 612 118 730 C++! 350! 37! 387!Java 436 66 502
bit.ly/16LS8Vk David Gleich · Purdue 22
Use multiple iterations for problems with many columns Increasing split size improves performance (accounts for Hadoop data movement) Increasing iterations helps for problems with many columns. (1000 columns with 64-MB split size overloaded the single reducer.)
Cols. Iters. Split"(MB)
Maps Secs.
50 1 64 8000 388 – – 256 2000 184 – – 512 1000 149
– 2 64 8000 425 – – 256 2000 220 – – 512 1000 191
1000 1 512 1000 666
– 2 64 6000 590 – – 256 2000 432 – – 512 1000 337
bit.ly/16LS8Vk David Gleich · Purdue 23
More about how to !compute a regression
A
min kAx � bk2
= minX
i
(X
j
A
ij
x
j
� b
i
)2
b
A1
A2
A3
A1
A2 qr Q2
R2
A3 qr
A4
Mapper 1 Serial TSQR
b2 = Q2T b1
b1
bit.ly/16LS8Vk David Gleich · Purdue 24
TSQR code in hadoopy for regressions
bit.ly/16LS8Vk
import random, numpy, hadoopy class SerialTSQR: def __init__(self,blocksize,isreducer): […] def compress(self): Q,R = numpy.linalg.qr( numpy.array(self.data), ‘full’) # reset data and re-initialize to R self.data = [] for row in R: self.data.append([float(v) for v in row]) self.rhs = list( numpy.dot(Q.T, numpy.array(self.rhs) ) def collect(self,key,valuerhs): self.data.append(valuerhs[0]) self.rhs.append(valuerhs[1]) if len(self.data)>self.bsize*len(self.data[0]): self.compress()
def close(self): self.compress() for i,row in enumerate(self.data): key = random.randint(0,2000000000) yield key, (row, self.rhs[i]) def mapper(self,key,value): self.collect(key,unpack(value)) def reducer(self,key,values): for value in values: self.mapper(key, unpack(value)) if __name__=='__main__': mapper = SerialTSQR(blocksize=3,isreducer=False) reducer = SerialTSQR(blocksize=3,isreducer=True) hadoopy.run(mapper, reducer)
David Gleich · Purdue 25
More about how to !compute a regression
A
b
QR"for "
Regression
R
QT b
min kAx � bk2
= min kQRx � bk2
= min kQ
TQRx � Q
Tbk2
= min kRx � Q
Tbk2
Orthogonal or “right angle” matrices"don’t change vector magnitude
This is a tiny linear system!
def compute_x(output): ! R,y = load_from_hdfs(output) ! x = numpy.linalg.solve(R,y) ! write_output(x,output+’-x’) !
bit.ly/16LS8Vk David Gleich · Purdue 26
We do a similar step for the PCA and compute the 1000-by-1000 SVD on one machine
bit.ly/16LS8Vk David Gleich · Purdue 27
Getting the matrix Q is tricky!
bit.ly/16LS8Vk David Gleich · Purdue 28
What about the matrix Q?
We want Q to be numerically orthogonal. A condition number measures problem sensitivity. Prior methods all failed without any warning. Condition number
1020 105
norm
( Q
T Q –
I )
AR-1
AR-1 + "
iterative refinement Direct TSQR Benson, Gleich, "Demmel, Submitted
Prior work
Constantine & Gleich, MapReduce 2011
Benson, Gleich, Demmel, Submitted
bit.ly/16LS8Vk David Gleich · Purdue 29
Taking care of business by keeping track of Q
bit.ly/16LS8Vk
A1
A4
Q1 R1
Mapper 1
A2 Q2 R2
A3 Q3 R3
A4 Q4
Q1
Q2
Q3
Q4
R1
R2
R3
R4
R4 Q o
utpu
t
R ou
tput
Q11
Q21
Q31
Q41
R Task 2
Q11
Q21
Q31
Q41
Q1
Q2
Q3
Q4
Mapper 3
1. Output local Q and R in separate files
2. Collect R on one node, compute Qs for each piece
3. Distribute the pieces of Q*1 and form the true Q
David Gleich · Purdue 30
Code available from github.com/arbenson/mrtsqr … it isn’t too bad.
bit.ly/16LS8Vk David Gleich · Purdue 31
Future work … more columns!
With ~3000 columns, one 64MB chunk is a local QR computation. Could “iterate in blocks of 3000” columns to continue … maybe “efficient” for 10,000 columns Need different ideas for 100,000 columns (randomized methods?)
bit.ly/16LS8Vk David Gleich · Purdue 32
Questions? www.cs.purdue.edu/~dgleich @dgleich [email protected]
bit.ly/16LS8Vk David Gleich · Purdue 33