mapreduce for scientific simulation analysis

A hands on introduction to scientific data analysis with Hadoop ! !A matrix computations perspective

DAVID F. GLEICH, PURDUE UNIVERSITY ICME MAPREDUCE WORKSHOP @ STANFORD

MRWorkshop David Gleich · Purdue 1

Who is this for?

workshop project groups those curious about "“MapReduce” and “Hadoop” those who think about "problems as matrices


What should you get out of it?

1. understand some problems that MapReduce solves effectively. 2. techniques to solve them using Hadoop and dumbo 3. learn some Hadoop words


What you won’t learn …

latest and greatest in "MapReduce algorithms how to improve the perform-"ance of your Hadoop job how to write wordcount "in Hadoop


Slides will be online soon. Code samples and short tutorials at github.com/dgleich/mrmatrix


1.  HPC vs. Data (redux) 2.  MapReduce vs. Hadoop 3.  Dive into Hadoop with

Hadoop streaming 4.  Sparse matrix methods "

with Hadoop


High performance computing vs. Data intensive computing


224k Cores 10 PB drive

1.7 Pflops

7 MW

Custom "interconnect"

$104 M

80k cores"50 PB drive ? Pflops ? MW GB ethernet $?? M

625 GB/core 45 GB/core


12 nodes; 4-core i7 processor, 24GB/node, 1GB ethernet

12 TB/node, 3000 GB/core, 50 TB usable space (3x redundancy)

icme-hadoop1


MapReduce is designed to solve a different set of problems


Supercomputer Data computing cluster Engineer

Each multi-day HPC simulation generates gigabytes of data.

A data cluster can hold hundreds or thousands of old simulations …

… enabling engineers to query and analyze months of simulation data for all sorts of neat purposes.


MapReduce and!Hadoop overview


The MapReduce programming model

Input a list of (key, value) pairs Map apply a function f to all pairs Reduce apply a function g to "

all values with key k (for all k) Output a list of (key, value) pairs




all values with key k (for all k) Output a list of (key, value) pairs Map function f must be side-effect free Reduce function g must be side-effect free




all values with key k (for all k) Output a list of (key, value) pairs All map functions can be done in parallel All reduce functions (for key k) can be done

in parallel MRWorkshop David Gleich · Purdue 15



all values with key k (for all k) Output a list of (key, value) pairs !Shuffle group all pairs with key k together"

(sorting suffices)


Mesh point variance in MapReduce Run 1 Run 2 Run 3

T=1 T=2 T=3 T=1 T=2 T=3 T=1 T=2 T=3


Mesh point variance in MapReduce

M M M

R R

1. Each mapper out-puts the mesh points with the same key.

2. Shuffle moves all values from the same mesh point to the same reducer.

Run 1 Run 2 Run 3

3. Reducers just compute a numerical variance.

T=1 T=2 T=3 T=1 T=2 T=3 T=1 T=2 T=3


MapReduce vs. Hadoop. MapReduce!A computation

model with: Map - a local data

transform Shuffle - a grouping

function Reduce – "

an aggregation

Hadoop!An implementation of MapReduce using the HDFS parallel file-system. Others !

Pheonix++, Twisted, Google MapReduce, spark, …


Why so many limitations?


Data scalability

The idea !Bring the computations to the data MR can schedule map functions without moving data.

1 MM R

RMMM

Maps Reduce

Shuffle

2

3

4

5

1 2 M M

3 4 M M

5 M


Mesh point variance in MapReduce

M M M

R R

1. Each mapper out-puts the mesh points with the same key.

2. Shuffle moves all values from the same mesh point to the same reducer.

Run 1 Run 2 Run 3

3. Reducers just compute a numerical variance.

Bring the computations to the data!

T=1 T=2 T=3 T=1 T=2 T=3 T=1 T=2 T=3


After waiting in the queue for a month and "after 24 hours of finding eigenvalues, one node randomly hiccups.

heartbreak on node rs252


Fault tolerant

Redundant input helps make maps data-local Just one type of communication: shuffle

MM R

RMM

Input stored in triplicate

Map output"persisted to disk"before shuffle

Reduce input/"output on disk


3HUIRUPDQFH�UHVXOWV��VLPXODWHG�IDXOWV�

:H�FDQ�VWLOO�UXQ�ZLWK�3�IDXOW�� ZLWK�RQO\�a��SHUIRUPDQFH�SHQDOW\��+RZHYHU��ZLWK�3�IDXOW��VPDOO��ZH�VWLOO�VHH�D�SHUIRUPDQFH�KLW�

Fault injection

10 100 1000

1/Prob(failure) – mean number of success per failure

Tim

e to

com

plet

ion

(sec)

200

100

No faults (200M by 200)

Faults (800M by 10)

Faults (200M by 200)

No faults "(800M by 10)

With 1/5 tasks failing, the job only takes twice as long.


Diving into Hadoop (with python)


Tools I like

hadoop streaming dumbo mrjob hadoopy C++


Tools I don’t use but other people seem to like …

pig java hbase Eclipse Cassandra


hadoop streaming

the map function is a program"(key,value) pairs are sent via stdin"output (key,value) pairs goes to stdout the reduce function is a program"(key,value) pairs are sent via stdin"keys are grouped"output (key,value) pairs goes to stdout


dumbo a wrapper around hadoop streaming for map and reduce functions in python

#!/usr/bin/env dumbo def mapper(key,value): """ Each record is a line of text. key=<byte that the line starts in the file> value=<line of text> """ valarray = [float(v) for v in value.split()] yield key, sum(valarray) if __name__=='__main__': import dumbo import dumbo.lib dumbo.run(mapper,dumbo.lib.identityreducer)


How can Hadoop streaming possibly be fast?

Hadoop streaming frameworks

Iter 1QR (secs.)

Iter 1Total (secs.)

Iter 2Total (secs.)

OverallTotal (secs.)

Dumbo 67725 960 217 1177

Hadoopy 70909 612 118 730

C++ 15809 350 37 387

Java 436 66 502

Synthetic data test 100,000,000-by-500 matrix (~500GB)Codes implemented in MapReduce streamingMatrix stored as TypedBytes lists of doublesPython frameworks use Numpy+AtlasCustom C++ TypedBytes reader/writer with AtlasNew non-streaming Java implementation too

David Gleich (Sandia)

All timing results from the Hadoop job tracker

C++ in streaming beats a native Java implementation.

16/22MapReduce 2011

500 GB matrix. Computing the R in a QR factorization.


Demo 1 1. generate data 2. get data to hadoop 3. run row sums 4. see row sums!


How does Hadoop know key = byte in file"value = line of text!!

InputFormat!Map a file on HDFS to (key,value) pairs TextInputFormat!Map a text file to (<byte offset>, <line>) pairs


The Hadoop Distributed File System (HDFS) and a big text file

HDFS stores files in 64MB chunks Each chunk is a FileSplit FileSplits are stored in parallel A InputFormat converts FileSplits into a sequence of key-val records FileSplits can cross record borders"(a small bit of communication)


Tall-and-skinny matrix storage in MapReduce A : m x n, m ≫ n Key is an arbitrary row-id Value is the 1 x n array "for a row Each submatrix Ai is an "InputSplit (the input to a"map task).

A1

A4

A2

A3

A4


hadoop!output row-sum for all local rows

MPI!parallel load for my-batch-of-rows

compute row-sum parallel save


Isn’t reading and writing text files rather inefficient?


Sequence Files and !OutputFormat

SequenceFile An internal Hadoop file format to store (key, value) pairs efficiently. Used between map and reduce steps.

OutputFormat Map (key, value) pairs to output on disk

TextOutputFormat Map (key,value) pairs to key\tvalue strings


typedbytes

A simple binary serialization scheme. [<1-byte-type-flag> <binary-value>]* Roughly equivalent to JSON (Optionally) used to communicate to and from Hadoop streaming.


typedbytes example

def _read(self):

t = unpack_type(self.file.read(1))[0]

self.t = t

return self.handler_table[t](self)

def read_vector(self):

r = self._read

count = unpack_int(self.file.read(4))[0]

return tuple(r() for i in xrange(count))


Demo 2 Column sums


Column sums in dumbo

#!/usr/bin/env dumbo def mapper(key,value): """ Each record is a line of text. """ valarray = [float(v) for v in value.split()] for col,val in enumerate(valarray): yield col, val def reducer(col,values): yield col, sum(values) if __name__=='__main__': import dumbo import dumbo.lib dumbo.run(mapper,reducer)


Isn’t this just moving the data to the computation?

Yes. It seems much"worse than MPI.

MPI!parallel load for my-batch-of-rows

update sum of each columns

parallel reduce partial column sums parallel save



Input a list of (key, value) pairs Map apply a function f to all pairs Combine apply g to local values with key k!Shuffle group all pairs with key k together!Reduce apply a function g to "

all values with key k Output a list of (key, value) pairs !


Column sums in dumbo

#!/usr/bin/env dumbo def mapper(key,value): """ Each record is a line of text. """ valarray = [float(v) for v in value.split()] for col,val in enumerate(valarray): yield col, val def reducer(col,values): yield col, sum(values) if __name__=='__main__': import dumbo import dumbo.lib dumbo.run(mapper,reducer,combiner=reducer)


How many mappers and reducers?

The number of maps is the number of InputSplits. You choose how many reducers. Each reducer outputs to a separate file.


Demo 3 Column sums with multiple reducers


Which reducer does my key go to?

Partitioner!Map a given key to a reducer HashPartitioner!Randomly distribute keys


Sparse matrix methods


Storing a matrix by rows

Row 1 (2,16.) (3,13.) Row 2 (3,10.) (4,12.) Row 3 (2,4.) (5,14.) Row 4 (3,9.) (6,20.)

Row 5 (4,7.) (6,4.) Row 6

�� ⋅ ��

�.�.� Sparse matrices in Matlab

To store anm×n sparsematrixM, Matlab uses compressed column format[Gilbert et al., ��]. Matlab never stores a 0 value in a sparsematrix. It always“re-compresses” the data structure in these cases. IfM is the adjacency matrixof a graph, then storing the matrix by columns corresponds to storing thegraph as an in-edge list.

We brie�y illustrate compressed row and column storage schemes in �g-ure �.�.

1

2

3

4

5

6

16

13

12

9

14

7

20

4

410

��

0 16 13 0 0 00 0 10 12 0 00 4 0 0 14 00 0 9 0 0 200 0 0 7 0 40 0 0 0 0 0

��

Compressed sparse row1 3 5 7 9 11 11

162

133

103

124

42

145

93

206

74

46 �ci

rp

ai

Compressed sparse column1 1 3 6 8 9 11

161

43

131

102

94

122

75

143

204

45 �ri

cp

ai

Figure 6.1 – Compressed row and columnstorage. At far le�, we have a weighted,directed graph. Its weighted adjacencymatrix lies below. At right are the com-pressed row and compressed columnarrays for this graph and matrix. Forsparse matrices, compressed row andcolumn storage make it easy to accessentries in rows and columns, respectively.Consider the �rd entry in rp. It saysto look at the �th element in ci to �ndall the columns in the �rd row of thematrix. �e �th and �th elements of ciand ai tell us that row � has non-zerosin columns � and �, with values � and��. When the sparse matrix correspondsto the adjacency matrix of a graph, thiscorresponds to e�cient access to theout-edges and in-edges of a vertex.

Most graph algorithms are designed to work with out-edge lists instead ofin-edge lists.� Before running an algorithm, MatlabBGL explicitly transposes

� See section �.�.� for a discussion aboutthe requirements for various graphalgorithms.

the graph so that Matlab’s internal representation corresponds to storing out-edge lists. For algorithms on symmetric graphs, these transposes are notrequired.

�e mex commands mxGetPr, mxGetJc, and mxGetIr retrieve pointers toMatlab’s internal storage of thematrixwithoutmaking a copy.�ese functionsmake it possible to access a sparse matrix e�ciently without making a copyand are a requirement of our implementation.

Let us recap. Sparse matrices are the best way to store graphs in Matlab.�ey provide all the necessary pieces to integrate cleanlywith “natural”Matlabsyntax and allow us access to their internals to run algorithms e�ciently.

�.�.� Other packages

�ere are other graph packages for Matlab too. One of the �rst was themeshpart toolkit [Gilbert and Teng, ��], which focuses on partitioningmeshes. Amore recent example is Matgraph [Scheinerman, ��], which con-tains a rich set of graph constructors to create adjacencymatrices for standardgraphs. It also provides an interface to support graph properties, such as la-bels and weights. Various authors released individual graph theory functionson the Mathworks File Exchange [Various, ��a, search for dijkstra]. Forexample, the Exchange contains more than three separate implementationsof Dijkstra’s shortest path algorithm.

�� ⋅ ��




1

2

3

4

5

6

16

13

12

9

14

7

20

4

410

��

0 16 13 0 0 00 0 10 12 0 00 4 0 0 14 00 0 9 0 0 200 0 0 7 0 40 0 0 0 0 0

��


162

133

103

124

42

145

93

206

74

46 �ci

rp

ai


161

43

131

102

94

122

75

143

204

45 �ri

cp

ai










Storing a matrix by rows in a text-file

Row 1 (2,16.) (3,13.) Row 2 (3,10.) (4,12.) Row 3 (2,4.) (5,14.) Row 4 (3,9.) (6,20.)

Row 5 (4,7.) (6,4.) Row 6

�� ⋅ ��




1

2

3

4

5

6

16

13

12

9

14

7

20

4

410

��

0 16 13 0 0 00 0 10 12 0 00 4 0 0 14 00 0 9 0 0 200 0 0 7 0 40 0 0 0 0 0

��


162

133

103

124

42

145

93

206

74

46 �ci

rp

ai


161

43

131

102

94

122

75

143

204

45 �ri

cp

ai









�� ⋅ ��




1

2

3

4

5

6

16

13

12

9

14

7

20

4

410

��

0 16 13 0 0 00 0 10 12 0 00 4 0 0 14 00 0 9 0 0 200 0 0 7 0 40 0 0 0 0 0

��


162

133

103

124

42

145

93

206

74

46 �ci

rp

ai


161

43

131

102

94

122

75

143

204

45 �ri

cp

ai










Sparse matrix-vector product

The matrix!1 (2,16.) (3,13.) 2 (3,10.) (4,12.) 3 (2,4.) (5,14.) 4 (3,9.) (6,20.) 5 (4,7.) (6,4.) 6

The vector!1 2.1 2 -1.3 3 0.5 4 0.6 5 -1.2 6 0.89

�� ⋅ ��




1

2

3

4

5

6

16

13

12

9

14

7

20

4

410

��

0 16 13 0 0 00 0 10 12 0 00 4 0 0 14 00 0 9 0 0 200 0 0 7 0 40 0 0 0 0 0

��


162

133

103

124

42

145

93

206

74

46 �ci

rp

ai


161

43

131

102

94

122

75

143

204

45 �ri

cp

ai









[Ax]i

=X

j

A

i ,j xj

To make this work, we need to get the value of the vector to the same function as the column of the matrix


Sparse matrix-vector product

The matrix!1 (2,16.) (3,13.) 2 (3,10.) (4,12.) 3 (2,4.) (5,14.) 4 (3,9.) (6,20.) 5 (4,7.) (6,4.) 6

The vector!1 2.1 2 -1.3 3 0.5 4 0.6 5 -1.2 6 0.89

�� ⋅ ��




1

2

3

4

5

6

16

13

12

9

14

7

20

4

410

��

0 16 13 0 0 00 0 10 12 0 00 4 0 0 14 00 0 9 0 0 200 0 0 7 0 40 0 0 0 0 0

��


162

133

103

124

42

145

93

206

74

46 �ci

rp

ai


161

43

131

102

94

122

75

143

204

45 �ri

cp

ai









[Ax]i

=X

j

A

i ,j xj

We need to “join” the matrix and vector based on the column


Sparse matrix-vector product!takes two MR tasks Map!If vector, emit (row,vecval) If matrix, " for each non-zero (row,col,val), " emit (col,(row,val))

Reduce!Find vecval in input keys For each (col,(row,val)), " emit (row,(val*vecval))

Map!Identity ""

Reduce (row, [(Aij xj), …]) !emit (row, sum(Aij xj))

Two types of records!

One of these values is not like the others

Form Aij xj for each nonzero Regroup data by rows, compute sums MRWorkshop David Gleich · Purdue 54

What about a “dense” row? Map!If vector, emit (row,vecval) If matrix, " for each non-zero (row,col,val), " emit (col,(row,val))


How do we find vecval without looking through "(and buffering) all "the input?

One of these values is not like the others

Form Aij xj for each nonzero MRWorkshop David Gleich · Purdue 55

Sparse matrix-vector product!takes two MR tasks Map!If vector, emit ((row,-1),vecval) If matrix, " for each non-zero (row,col,val), " emit ((col,0),(row,val))


Use a custom partitioner to make sure that (row,*) all get mapped to the same reducer, and that we always see (row,-1) before (row,0).

Two types of records!

Form Aij xj for each nonzero Regroup data by rows, compute sums MRWorkshop David Gleich · Purdue 56

Demo 4 Sparse matrix vector products


Matrix factorizations


A1

A2

A3

A1

A2qr

Q2 R2

A3qr

Q3 R3

A4qr Q4A4

R4

emit

A5

A6

A7

A5

A6qr

Q6 R6

A7qr

Q7 R7

A8qr Q8A8

R8

emit

Mapper 1Serial TSQR

R4

R8

Mapper 2Serial TSQR

R4

R8

qr Q emitRReducer 1Serial TSQR

AlgorithmData Rows of a matrix

Map QR factorization of rowsReduce QR factorization of rows


Full code in hadoopy In hadoopyimport random, numpy, hadoopyclass SerialTSQR:def __init__(self,blocksize,isreducer):self.bsize=blocksizeself.data = []if isreducer: self.__call__ = self.reducerelse: self.__call__ = self.mapper

def compress(self):R = numpy.linalg.qr(

numpy.array(self.data),'r')# reset data and re-initialize to Rself.data = []for row in R:self.data.append([float(v) for v in row])

def collect(self,key,value):self.data.append(value)if len(self.data)>self.bsize*len(self.data[0]):self.compress()

def close(self):self.compress()for row in self.data:key = random.randint(0,2000000000)yield key, row

def mapper(self,key,value):self.collect(key,value)

def reducer(self,key,values):for value in values: self.mapper(key,value)

if __name__=='__main__':mapper = SerialTSQR(blocksize=3,isreducer=False)reducer = SerialTSQR(blocksize=3,isreducer=True)hadoopy.run(mapper, reducer)

David Gleich (Sandia) 13/22MapReduce 2011 MRWorkshop David Gleich · Purdue 61

Related resources

Apache Mahout"Machine learning for Hadoop "… lots of matrices there … Another fantasic tutorial http://www.eurecom.fr/~michiard/teaching/webtech/tutorial.pdf


Way too much stuff!

I hope to keep expanding this tutorial over the week… Keep checking the git repo.


mapreduce for scientific simulation analysis

Education

list of key

mapreduce map

value pairs map

function g

data cluster

mapreduce algorithmshow

google mapreduce

local data