what you can do with a tall-and-skinny qr factorization in hadoop: principal components and large...

What you can do with a Tall-and-Skinny !QR Factorization on Hadoop: !Large regressions, Principal Components

Slides bit.ly/16LS8Vk Code github.com/dgleich/mrtsqr

DAVID F. GLEICH ASSISTANT PROFESSOR !COMPUTER SCIENCE !PURDUE UNIVERSITY

@dgleich [email protected]

bit.ly/16LS8Vk David Gleich · Purdue 1

Why you should stay … you like advanced machine learning techniques you want to understand how to compute the singular values and vectors of a huge matrix (that’s tall and skinny) you want to learn about large-scale regression, and principal components from a matrix perspective


What I’m going to assume you know

MapReduce Python Some simple matrix manipulation


A

From tinyimages"collection

bit.ly/16LS8Vk

Tall-and-Skinny matrices (m ≫ n) Many rows (like a billion) A few columns (under 10,000)

regression and!general linear models!with many samples!

block iterative methods panel factorizations

approximate kernel k-means

big-data SVD/PCA!

Used in

David Gleich · Purdue 4

If you have tons of small records, then there is probably a tall-and-skinny matrix somwhere


Tall-and-skinny matrices are common in BigData

A1

A4

A2

A3

A4

A : m x n, m ≫ n Key is an arbitrary row-id Value is the 1 x n array "for a row Each submatrix Ai is an "the input to a map task.


PCA of 80,000,000!images

A

80,000,000 images

1000 pixels

First 16 columns of V as images

bit.ly/16LS8Vk Constantine & Gleich, MapReduce 2011.

20 40 60 80 1000

0.2

0.4

0.6

0.8

1

Principal Components

Fra

ctio

n o

f va

riance

20 40 60 80 1000

0.2

0.4

0.6

0.8

1


Fra

ctio

n o

f va

riance

200 400 600 800 10000

0.2

0.4

0.6

0.8

1


Fra

ctio

n o

f va

riance

200 400 600 800 10000

0.2

0.4

0.6

0.8

1


Fra

ctio

n o

f va

riance

Figure 5: The 16 most important principal compo-nent basis functions (by rows) and the amount ofvariance explained by the top 100 (bottom left) andall principal components (bottom right).

4. CONCLUSIONIn this manuscript, we have illustrated the ability of Map-

Reduce architectures to solve massive least-squares prob-lems through a tall and skinny QR factorization. We chooseto implement these algorithms in a simple Hadoop stream-ing framework to provide prototype implementations so thatothers can easily adapt the algorithms to their particularproblem. These codes are all available online.1 We envi-sion that the TSQR paradigm will find a place in block-analogues of the various iterative methods in the Mahoutproject. These methods are based on block analogues of theLanczos process, which replace vector normalization stepswith QR factorizations. Because the TSQR routine solveslinear regression problems, it can also serve as the least-squares sub-routine for an iteratively reweighted least-squaresalgorithm for fitting general linear models.

A key motivation for our MapReduce TSQR implemen-tation comes from a residual minimizing model reductionmethod [5] for approximating the output of a parameterizeddi�erential equation model. Methods for constructing re-duced order models typically involve a collection of solutions1See http://www.github.com/dgleich/mrtsqr.

(dubbed snapshots [16]) – each computed at its respectiveinput parameters. Storing and managing the terascale datafrom these solutions is itself challenging, and the hard diskstorage of MapReduce is a natural fit.

5. REFERENCES[1] E. Agullo, C. Coti, J. Dongarra, T. Herault, and J. Langem.

QR factorization of tall and skinny matrices in a gridcomputing environment. In Parallel Distributed Processing(IPDPS), 2010 IEEE International Symposium on, pages 1–11, April 2010.

[2] Å. Björck. Numerical Methods for Least Squares Problems.SIAM, Philadelphia, Penn., 1996.

[3] K. Bosteels. Fuzzy techniques in the usage and construction ofcomparison measures for music objects, 2009.

[4] J. Choi, J. Demmel, I. S. Dhillon, J. Dongarra, S. Ostrouchov,A. Petitet, K. Stanley, D. W. Walker, and R. C. Whaley.ScaLAPACK: A portable linear algebra library for distributedmemory computers - design issues and performance. PARA,pages 95–106, 1995.

[5] P. G. Constantine and Q. Wang. Residual minimizing modelreduction for parameterized nonlinear dynamical systems,arxiv:1012.0351, 2010.

[6] B. Dagnon and B. Hindman. TSQR on EC2 using the Nexussubstrate. http://www.cs.berkeley.edu/~agearh/cs267.sp10/files/writeup_dagnon.pdf, 2010. Class Project writeup forCS267 and University of California Berkeley.

[7] J. Dean and S. Ghemawat. MapReduce: Simplified dataprocessing on large clusters. In Proceedings of the 6thSymposium on Operating Systems Design andImplementation (OSDI2004), pages 137–150, 2004.

[8] J. Demmel, L. Grigori, M. Hoemmen, and J. Langou.Communication-avoiding parallel and sequential QRfactorizations. arXiv, 0806.2159, 2008.

[9] P. Drineas, M. W. Mahoney, S. Muthukrishnan, and T. Sarlós.Faster least squares approximation. Numerische Mathematik,117(2):219–249, 2011.

[10] J. G. F. Francis. The QR transformation a unitary analogue tothe LR transformation – part 1. The Computer Journal,4:265–271, 1961.

[11] P. E. Gill, W. Murray, and M. H. Wright. PracticalOptimization. Academic Press, 1981.

[12] G. H. Golub and C. F. van Loan. Matrix Computations. TheJohns Hopkins University Press, third edition, October 1996.

[13] D. Heller. A survey of parallel algorithms in numerical linearalgebra. SIAM Rev., 20:740–777, 1978.

[14] J. Langou. Computing the R of the QR factorization of talland skinny matrix using mpi_reduce. arXiv,math.NA:1002.4250, 2010.

[15] T. E. Oliphant. Guide to NumPy. Provo, UT, Mar. 2006.[16] L. Sirovich. Turbulence and the dynamics of coherent

structures. Part 1: Coherent structures. Quar, 45(3):561–571,1987.

[17] A. Stathopoulos and K. Wu. A block orthogonalizationprocedure with constant synchronization requirements. SIAMJ. Sci. Comput., 23:2165–2182, June 2001.

[18] A. Torralba, R. Fergus, and W. Freeman. 80 million tinyimages: A large data set for nonparametric object and scenerecognition. Pattern Analysis and Machine Intelligence,IEEE Transactions on, 30(11):1958 –1970, November 2008.

[19] L. N. Trefethen and D. I. Bau. Numerical Linear Algebra.SIAM, Philadelphia, 1997.

[20] Various. Hadoop version 0.21. http://hadoop.apache.org, 2010.[21] F. Wang. Implement linear regression.

https://issues.apache.org/jira/browse/MAHOUT-529.Mahout-529 JIRA, accessed on February 10, 2011.

[22] R. C. Whaley and J. Dongarra. Automatically tuned linearalgebra software. In SuperComputing 1998: HighPerformance Networking and Computing, 1998.

[23] B. White. hadoopy. http://bwhite.github.com/hadoopy.

Acknowledgments. We are exceedingly grateful to Mark Hoemmenfor many discussions about the TSQR factorization. We would alsolike to thank James Demmel for suggesting examining the referencestreaming time. Finally, we are happy to acknowledge the fellowMapReduce “computers” at Sandia for general Hadoop help: CraigUlmer, Todd Plantenga, Justin Basilico, Art Munson, and Tamara G.Kolda.


Regression with 80,000,000 images

The goal was to approx. how much red there was in a picture from the value of the grayscale pixels only. We get a measure of how much “redness” each pixel contributes to the whole.

Table 3: Results when varying block size. The bestperformance results are bolded. See §3.3 for details.

Iter. 1 Iter. 2Cols. Blks. Maps Secs. Secs.50 2 8000 424 21— 3 — 399 19— 5 — 408 19— 10 — 401 19— 20 — 396 20— 50 — 406 18— 100 — 380 19— 200 — 395 19100 2 7000 410 21— 3 — 384 21— 5 — 390 22— 10 — 372 22— 20 — 374 221000 2 6000 493 199— 3 — 432 169— 5 — 422 154— 10 — 430 202— 20 — 434 202

3.4 Split sizeThere are three factors that control the TSQR tree on

Hadoop: the number of mappers, the number of reducers,and the number of iterations. In this section, we investi-gate the trade-o� between decreasing the number of map-pers, which is done by increasing the minimum split size inHDFS, and using additional iterations. Using additional it-erations provides the opportunity to exploit parallelism viaa reduction tree. Table 4 show the total computation timefor our C++ code when used with various split sizes andone or two iterations. The block size used was the best per-forming case from the previous experiment. Each row statesthe number of columns, the number of iterations used (fortwo iterations, we used 250 reducers in the first iteration),the split size, and the total computation time. With a splitsize of 512 MB, each mapper consumes an entire input fileof the matrix (recall that the matrices are constructed by1000 reducers, and hence 1000 files). The two iteration testused 250 reducers in the first iteration and 1 reducer in thesecond iteration. The one iteration test used 1 reducer inthe first iteration, which is required to get the correct finalanswer. (In the 1000 column test, using a smaller split sizeof 64 or 256 MB generated too much data from the mappersfor a single reducer to handle e�ciently.)

The results are di�erent between 50 columns and 1000columns. With 50 columns, a one iteration approach isfaster, and increasing the split size dramatically reduces thecomputation time. This results from two intertwined behav-iors: first, using a larger split size sends less data to the finalreducer, making it run faster; and second, using a larger splitsize reduces the overhead with Hadoop launching additionalmap tasks. With 1000 columns, the two iteration approachis faster. This happens because each R matrix output by themappers is 400 times larger than with the 50 column experi-ment. Consequently, the single reducer takes much longer inthe one iteration case. Using an additional iteration allowsus to handle this reduction with more parallelism.

Table 4: Results when varying split size. See §3.4.Cols. Iters. Split

(MB)Maps Secs.

50 1 64 8000 388— — 256 2000 184— — 512 1000 149— 2 64 8000 425— — 256 2000 220— — 512 1000 1911000 1 512 1000 666— 2 64 6000 590— — 256 2000 432— — 512 1000 337

3.5 Tinyimages: regression and PCAOur final experiment shows this algorithm applied to a

real world dataset. The tinyimages collection is a set of al-most 80,000,000 images. Each image is 32-by-32 pixels. Theimage collection is stored in a single file, where each 3072byte segment consists of the red, green, and blue values foreach of the 1024 pixels in the image. We wrote a customHadoop InputFormat to read this file directly and trans-mit the data to our Hadoop streaming programs as a set ofbytes. We used the Dumbo python framework for these ex-periments. In the following two experiments, we translatedall the color pixels into shades of gray. Consequently, thisdataset represents an 79,302,017-by-1024 matrix.

We first solved a regression problem by trying to predictthe sum of red-pixel values in each image as a linear combi-nation of the gray values in each image. Formally, if ri is thesum of the red components in all pixels of image i, and Gi,j

is the gray value of the jth pixel in image i, then we wantedto find min

qi(ri ≠

qj

Gi,jsj)2. There is no particular im-portance to this regression problem, we use it merely as ademonstration.The coe�cients sj are dis-played as an image at the right.They reveal regions of the im-age that are not as importantin determining the overall redcomponent of an image. Thecolor scale varies from light-blue (strongly negative) to blue(0) and red (strongly positive).The computation took 30 min-utes using the Dumbo frame-work and a two-iteration job with 250 intermediate reducers.

We also solved a principal component problem to find aprincipal component basis for each image. Let G be matrixof Gi,j ’s from the regression and let ui be the mean of the ithrow in G. The principal components of the images are givenby the right singular vectors of the matrix G≠ueT where uare all of the mean values as a vector and e is the 1024-by-1vector of ones. That is, let G ≠ ueT = U�V T be the SVD,then the principal components are the columns of V . Wecompute V by first doing a TSQR of G ≠ ueT , and thencomputing an SVD of the final R, which is a small 1024-by-1024 matrix. The principal components are plotted asimages in Figure 5. These images show a reasonable basisfor images and are reminiscent of the basis in a discretecosine transform.

A

80,000,000 images

1000 pixels


Let’s talk about QR!


QR Factorization and the Gram Schmidt process

Consider a set of vectors v1 to vn. Set u1 to be v1. Create a new vector u2 by removing any “component” of u1 from v2. Create a new vector u3 by removing any “component” of u1 and u2 from v3. …

bit.ly/16LS8Vk David Gleich · Purdue 10 “Gram-Schmidt process” "

from Wikipedia


v1 = a1u1

v2 = b1u1 + b2u2

v3 = c1u1 + c2u2 + c3u3

⇥v1 v2 v3 ...

⇤

=⇥u1 u2 v3 ...

⇤

2

6664

a1 b1 c1 ...0 b2 c2 ...0 0 c3 ......

......

. . .

3

7775



v1 = a1u1

v2 = b1u1 + b2u2

v3 = c1u1 + c2u2 + c3u3

V = UR All vectors in U are at right angles, i.e. they are decoupled A = QR

For this problem

What it’s usually"written as by others



v1 = a1u1

v2 = b1u1 + b2u2

v3 = c1u1 + c2u2 + c3u3

All vectors in U are at right angles, i.e. they are decoupled

A Q

R

=


PCA of 80,000,000!images

A

80,000,000 images

1000 pixels

X

MapReduce Post Processing

Zero"mean"rows

TSQ

R

R SVD

  V

First 16 columns

of V as images

Top 100 singular values

(principal �components)

bit.ly/16LS8Vk Constantine & Gleich, MapReduce 2010. David Gleich · Purdue 14

Input 500,000,000-by-100 matrix Each record 1-by-100 row HDFS Size 423.3 GB Time to compute colsum( A ) 161 sec. Time to compute R in qr( A ) 387 sec.


The rest of the talk!Full TSQR code in hadoopy

bit.ly/16LS8Vk

import random, numpy, hadoopy class SerialTSQR: def __init__(self,blocksize,isreducer): self.bsize=blocksize self.data = [] if isreducer: self.__call__ = self.reducer else: self.__call__ = self.mapper def compress(self): R = numpy.linalg.qr( numpy.array(self.data),'r') # reset data and re-initialize to R self.data = [] for row in R: self.data.append([float(v) for v in row]) def collect(self,key,value): self.data.append(value) if len(self.data)>self.bsize*len(self.data[0]): self.compress()

def close(self): self.compress() for row in self.data: key = random.randint(0,2000000000) yield key, row def mapper(self,key,value): self.collect(key,value) def reducer(self,key,values): for value in values: self.mapper(key,value) if __name__=='__main__': mapper = SerialTSQR(blocksize=3,isreducer=False) reducer = SerialTSQR(blocksize=3,isreducer=True) hadoopy.run(mapper, reducer)


A1

A2

A3

A1

A2qr

Q2 R2

A3qr

Q3 R3

A4qr Q4A4

R4

emit

A5

A6

A7

A5

A6qr

Q6 R6

A7qr

Q7 R7

A8qr Q8A8

R8

emit

Mapper 1Serial TSQR

R4

R8

Mapper 2Serial TSQR

R4

R8

qr Q emitRReducer 1Serial TSQR

AlgorithmData Rows of a matrix

Map QR factorization of rowsReduce QR factorization of rows

Communication avoiding QR (Demmel et al. 2008) !on MapReduce (Constantine and Gleich, 2010)


The rest of the talk!Full TSQR code in hadoopy

bit.ly/16LS8Vk

import random, numpy, hadoopy class SerialTSQR: def __init__(self,blocksize,isreducer): self.bsize=blocksize self.data = [] if isreducer: self.__call__ = self.reducer else: self.__call__ = self.mapper def compress(self): R = numpy.linalg.qr( numpy.array(self.data),'r') # reset data and re-initialize to R self.data = [] for row in R: self.data.append([float(v) for v in row]) def collect(self,key,value): self.data.append(value) if len(self.data)>self.bsize*len(self.data[0]): self.compress()

def close(self): self.compress() for row in self.data: key = random.randint(0,2000000000) yield key, row def mapper(self,key,value): self.collect(key,value) def reducer(self,key,values): for value in values: self.mapper(key,value) if __name__=='__main__': mapper = SerialTSQR(blocksize=3,isreducer=False) reducer = SerialTSQR(blocksize=3,isreducer=True) hadoopy.run(mapper, reducer)


Too many maps cause too much data to one reducer!

Each image is 5k. Each HDFS block has "12,800 images. 6,250 total blocks. Each map outputs "1000-by-1000 matrix One reducer gets a 6.25M-by-1000 matrix (50GB)


S(1)

A

A1

A2

A3

A3

R1 map

Mapper 1-1 Serial TSQR

A2

emit R2 map


A3

emit R3 map


A4

emit R4 map


shuffle

S1

A2

reduce

Reducer 1-1 Serial TSQR

S2 R2,2

reduce


R2,1 emit

emit

emit

shuffle

A2 S3 R2,3

reduce


emit

Iteration 1 Iteration 2

identity map

A2 S(2) R reduce


emit


Input 500,000,000-by-100 matrix Each record 1-by-100 row HDFS Size 423.3 GB Time to compute colsum( A ) 161 sec. Time to compute R in qr( A ) 387 sec.


Hadoop streaming isn’t always slow! Synthetic data test on 100,000,000-by-500 matrix (~500GB) Codes implemented in MapReduce streaming Matrix stored as TypedBytes lists of doubles Python frameworks use Numpy+ATLAS matrix. Custom C++ TypedBytes reader/writer with ATLAS matrix. Iter 1

Total (secs.) Iter 2 Total (secs.)

Overall"Total (secs.)

Dumbo 960 217 1177 Hadoopy 612 118 730 C++! 350! 37! 387!Java 436 66 502


Use multiple iterations for problems with many columns Increasing split size improves performance (accounts for Hadoop data movement) Increasing iterations helps for problems with many columns. (1000 columns with 64-MB split size overloaded the single reducer.)

Cols. Iters. Split"(MB)

Maps Secs.

50 1 64 8000 388 – – 256 2000 184 – – 512 1000 149

– 2 64 8000 425 – – 256 2000 220 – – 512 1000 191

1000 1 512 1000 666

– 2 64 6000 590 – – 256 2000 432 – – 512 1000 337


More about how to !compute a regression

A

min kAx � bk2

= minX

i

(X

j

A

ij

x

j

� b

i

)2

b

A1

A2

A3

A1

A2 qr Q2

R2

A3 qr

A4

Mapper 1 Serial TSQR

b2 = Q2T b1

b1


TSQR code in hadoopy for regressions

bit.ly/16LS8Vk

import random, numpy, hadoopy class SerialTSQR: def __init__(self,blocksize,isreducer): […] def compress(self): Q,R = numpy.linalg.qr( numpy.array(self.data), ‘full’) # reset data and re-initialize to R self.data = [] for row in R: self.data.append([float(v) for v in row]) self.rhs = list( numpy.dot(Q.T, numpy.array(self.rhs) ) def collect(self,key,valuerhs): self.data.append(valuerhs[0]) self.rhs.append(valuerhs[1]) if len(self.data)>self.bsize*len(self.data[0]): self.compress()

def close(self): self.compress() for i,row in enumerate(self.data): key = random.randint(0,2000000000) yield key, (row, self.rhs[i]) def mapper(self,key,value): self.collect(key,unpack(value)) def reducer(self,key,values): for value in values: self.mapper(key, unpack(value)) if __name__=='__main__': mapper = SerialTSQR(blocksize=3,isreducer=False) reducer = SerialTSQR(blocksize=3,isreducer=True) hadoopy.run(mapper, reducer)


More about how to !compute a regression

A

b

QR"for "

Regression

R

QT b

min kAx � bk2

= min kQRx � bk2

= min kQ

TQRx � Q

Tbk2

= min kRx � Q

Tbk2

Orthogonal or “right angle” matrices"don’t change vector magnitude

This is a tiny linear system!

def compute_x(output): ! R,y = load_from_hdfs(output) ! x = numpy.linalg.solve(R,y) ! write_output(x,output+’-x’) !


We do a similar step for the PCA and compute the 1000-by-1000 SVD on one machine


Getting the matrix Q is tricky!


What about the matrix Q?

We want Q to be numerically orthogonal. A condition number measures problem sensitivity. Prior methods all failed without any warning. Condition number

1020 105

norm

( Q

T Q –

I )

AR-1

AR-1 + "

iterative refinement Direct TSQR Benson, Gleich, "Demmel, Submitted

Prior work

Constantine & Gleich, MapReduce 2011

Benson, Gleich, Demmel, Submitted


Taking care of business by keeping track of Q

bit.ly/16LS8Vk

A1

A4

Q1 R1

Mapper 1

A2 Q2 R2

A3 Q3 R3

A4 Q4

Q1

Q2

Q3

Q4

R1

R2

R3

R4

R4 Q o

utpu

t

R ou

tput

Q11

Q21

Q31

Q41

R Task 2

Q11

Q21

Q31

Q41

Q1

Q2

Q3

Q4

Mapper 3

1. Output local Q and R in separate files

2. Collect R on one node, compute Qs for each piece

3. Distribute the pieces of Q*1 and form the true Q


Code available from github.com/arbenson/mrtsqr … it isn’t too bad.


Future work … more columns!

With ~3000 columns, one 64MB chunk is a local QR computation. Could “iterate in blocks of 3000” columns to continue … maybe “efficient” for 10,000 columns Need different ideas for 100,000 columns (randomized methods?)


Questions? www.cs.purdue.edu/~dgleich @dgleich [email protected]


what you can do with a tall-and-skinny qr factorization in hadoop: principal components and large...

Technology

wikipedia david gleich

row7constantine gleich

qr factorization

false self

a1 u1 v2

b2 u2 v3

pixels of image i

rin qr