mapreduce for scientific simulation analysis

63
A hands on introduction to scientific data analysis with Hadoop A matrix computations perspective DAVID F. GLEICH, PURDUE UNIVERSITY ICME MAPREDUCE WORKSHOP @ STANFORD MRWorkshop David Gleich · Purdue 1

Upload: david-gleich

Post on 15-Jan-2015

2.923 views

Category:

Education


1 download

DESCRIPTION

A tutorial I gave at Stanford on how to use MapReduce style computations for simulation data.

TRANSCRIPT

Page 1: MapReduce for scientific simulation analysis

A hands on introduction to scientific data analysis with Hadoop ! !A matrix computations perspective

DAVID F. GLEICH, PURDUE UNIVERSITY ICME MAPREDUCE WORKSHOP @ STANFORD

MRWorkshop David Gleich · Purdue 1

Page 2: MapReduce for scientific simulation analysis

Who is this for?

workshop project groups those curious about "“MapReduce” and “Hadoop” those who think about "problems as matrices

MRWorkshop David Gleich · Purdue 2

Page 3: MapReduce for scientific simulation analysis

What should you get out of it?

1. understand some problems that MapReduce solves effectively. 2. techniques to solve them using Hadoop and dumbo 3. learn some Hadoop words

MRWorkshop David Gleich · Purdue 3

Page 4: MapReduce for scientific simulation analysis

What you won’t learn …

latest and greatest in "MapReduce algorithms how to improve the perform-"ance of your Hadoop job how to write wordcount "in Hadoop

MRWorkshop David Gleich · Purdue 4

Page 5: MapReduce for scientific simulation analysis

Slides will be online soon. Code samples and short tutorials at github.com/dgleich/mrmatrix

MRWorkshop David Gleich · Purdue 5

Page 6: MapReduce for scientific simulation analysis

1.  HPC vs. Data (redux) 2.  MapReduce vs. Hadoop 3.  Dive into Hadoop with

Hadoop streaming 4.  Sparse matrix methods "

with Hadoop

MRWorkshop David Gleich · Purdue 6

Page 7: MapReduce for scientific simulation analysis

High performance computing vs. Data intensive computing

MRWorkshop David Gleich · Purdue 7

Page 8: MapReduce for scientific simulation analysis

224k Cores 10 PB drive

1.7 Pflops

7 MW

Custom "interconnect"

$104 M

80k cores"50 PB drive ? Pflops ? MW GB ethernet $?? M

625 GB/core 45 GB/core

MRWorkshop David Gleich · Purdue 8

Page 9: MapReduce for scientific simulation analysis

12 nodes; 4-core i7 processor, 24GB/node, 1GB ethernet

12 TB/node, 3000 GB/core, 50 TB usable space (3x redundancy)

icme-hadoop1

MRWorkshop David Gleich · Purdue 9

Page 10: MapReduce for scientific simulation analysis

MapReduce is designed to solve a different set of problems

MRWorkshop David Gleich · Purdue 10

Page 11: MapReduce for scientific simulation analysis

Supercomputer Data computing cluster Engineer

Each multi-day HPC simulation generates gigabytes of data.

A data cluster can hold hundreds or thousands of old simulations …

… enabling engineers to query and analyze months of simulation data for all sorts of neat purposes.

MRWorkshop David Gleich · Purdue 11

Page 12: MapReduce for scientific simulation analysis

MapReduce and!Hadoop overview

MRWorkshop David Gleich · Purdue 12

Page 13: MapReduce for scientific simulation analysis

The MapReduce programming model

Input a list of (key, value) pairs Map apply a function f to all pairs Reduce apply a function g to "

all values with key k (for all k) Output a list of (key, value) pairs

MRWorkshop David Gleich · Purdue 13

Page 14: MapReduce for scientific simulation analysis

The MapReduce programming model

Input a list of (key, value) pairs Map apply a function f to all pairs Reduce apply a function g to "

all values with key k (for all k) Output a list of (key, value) pairs Map function f must be side-effect free Reduce function g must be side-effect free

MRWorkshop David Gleich · Purdue 14

Page 15: MapReduce for scientific simulation analysis

The MapReduce programming model

Input a list of (key, value) pairs Map apply a function f to all pairs Reduce apply a function g to "

all values with key k (for all k) Output a list of (key, value) pairs All map functions can be done in parallel All reduce functions (for key k) can be done

in parallel MRWorkshop David Gleich · Purdue 15

Page 16: MapReduce for scientific simulation analysis

The MapReduce programming model

Input a list of (key, value) pairs Map apply a function f to all pairs Reduce apply a function g to "

all values with key k (for all k) Output a list of (key, value) pairs !Shuffle group all pairs with key k together"

(sorting suffices)

MRWorkshop David Gleich · Purdue 16

Page 17: MapReduce for scientific simulation analysis

Mesh point variance in MapReduce Run 1 Run 2 Run 3

T=1 T=2 T=3 T=1 T=2 T=3 T=1 T=2 T=3

MRWorkshop David Gleich · Purdue 17

Page 18: MapReduce for scientific simulation analysis

Mesh point variance in MapReduce

M M M

R R

1. Each mapper out-puts the mesh points with the same key.

2. Shuffle moves all values from the same mesh point to the same reducer.

Run 1 Run 2 Run 3

3. Reducers just compute a numerical variance.

T=1 T=2 T=3 T=1 T=2 T=3 T=1 T=2 T=3

MRWorkshop David Gleich · Purdue 18

Page 19: MapReduce for scientific simulation analysis

MapReduce vs. Hadoop. MapReduce!A computation

model with: Map - a local data

transform Shuffle - a grouping

function Reduce – "

an aggregation

Hadoop!An implementation of MapReduce using the HDFS parallel file-system. Others !

Pheonix++, Twisted, Google MapReduce, spark, …

MRWorkshop David Gleich · Purdue 19

Page 20: MapReduce for scientific simulation analysis

Why so many limitations?

MRWorkshop David Gleich · Purdue 20

Page 21: MapReduce for scientific simulation analysis

Data scalability

The idea !Bring the computations to the data MR can schedule map functions without moving data.

1 MM R

RMMM

Maps Reduce

Shuffle

2

3

4

5

1 2 M M

3 4 M M

5 M

MRWorkshop David Gleich · Purdue 21

Page 22: MapReduce for scientific simulation analysis

Mesh point variance in MapReduce

M M M

R R

1. Each mapper out-puts the mesh points with the same key.

2. Shuffle moves all values from the same mesh point to the same reducer.

Run 1 Run 2 Run 3

3. Reducers just compute a numerical variance.

Bring the computations to the data!

T=1 T=2 T=3 T=1 T=2 T=3 T=1 T=2 T=3

MRWorkshop David Gleich · Purdue 22

Page 23: MapReduce for scientific simulation analysis

After waiting in the queue for a month and "after 24 hours of finding eigenvalues, one node randomly hiccups.

heartbreak on node rs252

MRWorkshop David Gleich · Purdue 23

Page 24: MapReduce for scientific simulation analysis

Fault tolerant

Redundant input helps make maps data-local Just one type of communication: shuffle

MM R

RMM

Input stored in triplicate

Map output"persisted to disk"before shuffle

Reduce input/"output on disk

MRWorkshop David Gleich · Purdue 24

Page 25: MapReduce for scientific simulation analysis

3HUIRUPDQFH�UHVXOWV��VLPXODWHG�IDXOWV�

:H�FDQ�VWLOO�UXQ�ZLWK�3�IDXOW�� �����ZLWK�RQO\�a����SHUIRUPDQFH�SHQDOW\���+RZHYHU��ZLWK�3�IDXOW��VPDOO��ZH�VWLOO�VHH�D�SHUIRUPDQFH�KLW�

Fault injection

10 100 1000

1/Prob(failure) – mean number of success per failure

Tim

e to

com

plet

ion

(sec)

200

100

No faults (200M by 200)

Faults (800M by 10)

Faults (200M by 200)

No faults "(800M by 10)

With 1/5 tasks failing, the job only takes twice as long.

MRWorkshop David Gleich · Purdue 25

Page 26: MapReduce for scientific simulation analysis

Diving into Hadoop (with python)

MRWorkshop David Gleich · Purdue 26

Page 27: MapReduce for scientific simulation analysis

Tools I like

hadoop streaming dumbo mrjob hadoopy C++

MRWorkshop David Gleich · Purdue 27

Page 28: MapReduce for scientific simulation analysis

Tools I don’t use but other people seem to like …

pig java hbase Eclipse Cassandra

MRWorkshop David Gleich · Purdue 28

Page 29: MapReduce for scientific simulation analysis

hadoop streaming

the map function is a program"(key,value) pairs are sent via stdin"output (key,value) pairs goes to stdout the reduce function is a program"(key,value) pairs are sent via stdin"keys are grouped"output (key,value) pairs goes to stdout

MRWorkshop David Gleich · Purdue 29

Page 30: MapReduce for scientific simulation analysis

dumbo a wrapper around hadoop streaming for map and reduce functions in python

#!/usr/bin/env dumbo def mapper(key,value): """ Each record is a line of text. key=<byte that the line starts in the file> value=<line of text> """ valarray = [float(v) for v in value.split()] yield key, sum(valarray) if __name__=='__main__': import dumbo import dumbo.lib dumbo.run(mapper,dumbo.lib.identityreducer)

MRWorkshop David Gleich · Purdue 30

Page 31: MapReduce for scientific simulation analysis

How can Hadoop streaming possibly be fast?

Hadoop streaming frameworks

Iter 1QR (secs.)

Iter 1Total (secs.)

Iter 2Total (secs.)

OverallTotal (secs.)

Dumbo 67725 960 217 1177

Hadoopy 70909 612 118 730

C++ 15809 350 37 387

Java 436 66 502

Synthetic data test 100,000,000-by-500 matrix (~500GB)Codes implemented in MapReduce streamingMatrix stored as TypedBytes lists of doublesPython frameworks use Numpy+AtlasCustom C++ TypedBytes reader/writer with AtlasNew non-streaming Java implementation too

David Gleich (Sandia)

All timing results from the Hadoop job tracker

C++ in streaming beats a native Java implementation.

16/22MapReduce 2011

500 GB matrix. Computing the R in a QR factorization.

MRWorkshop David Gleich · Purdue 31

Page 32: MapReduce for scientific simulation analysis

Demo 1 1. generate data 2. get data to hadoop 3. run row sums 4. see row sums!

MRWorkshop David Gleich · Purdue 32

Page 33: MapReduce for scientific simulation analysis

How does Hadoop know key = byte in file"value = line of text!!

InputFormat!Map a file on HDFS to (key,value) pairs TextInputFormat!Map a text file to (<byte offset>, <line>) pairs

MRWorkshop David Gleich · Purdue 33

Page 34: MapReduce for scientific simulation analysis

The Hadoop Distributed File System (HDFS) and a big text file

HDFS stores files in 64MB chunks Each chunk is a FileSplit FileSplits are stored in parallel A InputFormat converts FileSplits into a sequence of key-val records FileSplits can cross record borders"(a small bit of communication)

MRWorkshop David Gleich · Purdue 34

Page 35: MapReduce for scientific simulation analysis

Tall-and-skinny matrix storage in MapReduce A : m x n, m ≫ n Key is an arbitrary row-id Value is the 1 x n array "for a row Each submatrix Ai is an "InputSplit (the input to a"map task).

A1

A4

A2

A3

A4

MRWorkshop David Gleich · Purdue 35

Page 36: MapReduce for scientific simulation analysis

hadoop!output row-sum for all local rows

MPI!parallel load for my-batch-of-rows

compute row-sum parallel save

MRWorkshop David Gleich · Purdue 36

Page 37: MapReduce for scientific simulation analysis

Isn’t reading and writing text files rather inefficient?

MRWorkshop David Gleich · Purdue 37

Page 38: MapReduce for scientific simulation analysis

Sequence Files and !OutputFormat

SequenceFile An internal Hadoop file format to store (key, value) pairs efficiently. Used between map and reduce steps.

OutputFormat Map (key, value) pairs to output on disk

TextOutputFormat Map (key,value) pairs to key\tvalue strings

MRWorkshop David Gleich · Purdue 38

Page 39: MapReduce for scientific simulation analysis

typedbytes

A simple binary serialization scheme. [<1-byte-type-flag> <binary-value>]* Roughly equivalent to JSON (Optionally) used to communicate to and from Hadoop streaming.

MRWorkshop David Gleich · Purdue 39

Page 40: MapReduce for scientific simulation analysis

typedbytes example

def _read(self):

t = unpack_type(self.file.read(1))[0]

self.t = t

return self.handler_table[t](self)

def read_vector(self):

r = self._read

count = unpack_int(self.file.read(4))[0]

return tuple(r() for i in xrange(count))

MRWorkshop David Gleich · Purdue 40

Page 41: MapReduce for scientific simulation analysis

Demo 2 Column sums

MRWorkshop David Gleich · Purdue 41

Page 42: MapReduce for scientific simulation analysis

Column sums in dumbo

#!/usr/bin/env dumbo def mapper(key,value): """ Each record is a line of text. """ valarray = [float(v) for v in value.split()] for col,val in enumerate(valarray): yield col, val def reducer(col,values): yield col, sum(values) if __name__=='__main__': import dumbo import dumbo.lib dumbo.run(mapper,reducer)

MRWorkshop David Gleich · Purdue 42

Page 43: MapReduce for scientific simulation analysis

Isn’t this just moving the data to the computation?

Yes. It seems much"worse than MPI.

MPI!parallel load for my-batch-of-rows

update sum of each columns

parallel reduce partial column sums parallel save

MRWorkshop David Gleich · Purdue 43

Page 44: MapReduce for scientific simulation analysis

The MapReduce programming model

Input a list of (key, value) pairs Map apply a function f to all pairs Combine apply g to local values with key k!Shuffle group all pairs with key k together!Reduce apply a function g to "

all values with key k Output a list of (key, value) pairs !

MRWorkshop David Gleich · Purdue 44

Page 45: MapReduce for scientific simulation analysis

Column sums in dumbo

#!/usr/bin/env dumbo def mapper(key,value): """ Each record is a line of text. """ valarray = [float(v) for v in value.split()] for col,val in enumerate(valarray): yield col, val def reducer(col,values): yield col, sum(values) if __name__=='__main__': import dumbo import dumbo.lib dumbo.run(mapper,reducer,combiner=reducer)

MRWorkshop David Gleich · Purdue 45

Page 46: MapReduce for scientific simulation analysis

How many mappers and reducers?

The number of maps is the number of InputSplits. You choose how many reducers. Each reducer outputs to a separate file.

MRWorkshop David Gleich · Purdue 46

Page 47: MapReduce for scientific simulation analysis

Demo 3 Column sums with multiple reducers

MRWorkshop David Gleich · Purdue 47

Page 48: MapReduce for scientific simulation analysis

Which reducer does my key go to?

Partitioner!Map a given key to a reducer HashPartitioner!Randomly distribute keys

MRWorkshop David Gleich · Purdue 48

Page 49: MapReduce for scientific simulation analysis

Sparse matrix methods

MRWorkshop David Gleich · Purdue 49

Page 50: MapReduce for scientific simulation analysis

Storing a matrix by rows

Row 1 (2,16.) (3,13.) Row 2 (3,10.) (4,12.) Row 3 (2,4.) (5,14.) Row 4 (3,9.) (6,20.)

Row 5 (4,7.) (6,4.) Row 6

��� � ⋅ ��������

�.�.� Sparse matrices in Matlab

To store anm×n sparsematrixM, Matlab uses compressed column format[Gilbert et al., ����]. Matlab never stores a 0 value in a sparsematrix. It always“re-compresses” the data structure in these cases. IfM is the adjacency matrixof a graph, then storing the matrix by columns corresponds to storing thegraph as an in-edge list.

We brie�y illustrate compressed row and column storage schemes in �g-ure �.�.

1

2

3

4

5

6

16

13

12

9

14

7

20

4

410

�������������

0 16 13 0 0 00 0 10 12 0 00 4 0 0 14 00 0 9 0 0 200 0 0 7 0 40 0 0 0 0 0

�������������

Compressed sparse row1 3 5 7 9 11 11

162

133

103

124

42

145

93

206

74

46 �ci

rp

ai

Compressed sparse column1 1 3 6 8 9 11

161

43

131

102

94

122

75

143

204

45 �ri

cp

ai

Figure 6.1 – Compressed row and columnstorage. At far le�, we have a weighted,directed graph. Its weighted adjacencymatrix lies below. At right are the com-pressed row and compressed columnarrays for this graph and matrix. Forsparse matrices, compressed row andcolumn storage make it easy to accessentries in rows and columns, respectively.Consider the �rd entry in rp. It saysto look at the �th element in ci to �ndall the columns in the �rd row of thematrix. �e �th and �th elements of ciand ai tell us that row � has non-zerosin columns � and �, with values � and��. When the sparse matrix correspondsto the adjacency matrix of a graph, thiscorresponds to e�cient access to theout-edges and in-edges of a vertex.

Most graph algorithms are designed to work with out-edge lists instead ofin-edge lists.� Before running an algorithm, MatlabBGL explicitly transposes

� See section �.�.� for a discussion aboutthe requirements for various graphalgorithms.

the graph so that Matlab’s internal representation corresponds to storing out-edge lists. For algorithms on symmetric graphs, these transposes are notrequired.

�e mex commands mxGetPr, mxGetJc, and mxGetIr retrieve pointers toMatlab’s internal storage of thematrixwithoutmaking a copy.�ese functionsmake it possible to access a sparse matrix e�ciently without making a copyand are a requirement of our implementation.

Let us recap. Sparse matrices are the best way to store graphs in Matlab.�ey provide all the necessary pieces to integrate cleanlywith “natural”Matlabsyntax and allow us access to their internals to run algorithms e�ciently.

�.�.� Other packages

�ere are other graph packages for Matlab too. One of the �rst was themeshpart toolkit [Gilbert and Teng, ����], which focuses on partitioningmeshes. Amore recent example is Matgraph [Scheinerman, ����], which con-tains a rich set of graph constructors to create adjacencymatrices for standardgraphs. It also provides an interface to support graph properties, such as la-bels and weights. Various authors released individual graph theory functionson the Mathworks File Exchange [Various, ����a, search for dijkstra]. Forexample, the Exchange contains more than three separate implementationsof Dijkstra’s shortest path algorithm.

��� � ⋅ ��������

�.�.� Sparse matrices in Matlab

To store anm×n sparsematrixM, Matlab uses compressed column format[Gilbert et al., ����]. Matlab never stores a 0 value in a sparsematrix. It always“re-compresses” the data structure in these cases. IfM is the adjacency matrixof a graph, then storing the matrix by columns corresponds to storing thegraph as an in-edge list.

We brie�y illustrate compressed row and column storage schemes in �g-ure �.�.

1

2

3

4

5

6

16

13

12

9

14

7

20

4

410

�������������

0 16 13 0 0 00 0 10 12 0 00 4 0 0 14 00 0 9 0 0 200 0 0 7 0 40 0 0 0 0 0

�������������

Compressed sparse row1 3 5 7 9 11 11

162

133

103

124

42

145

93

206

74

46 �ci

rp

ai

Compressed sparse column1 1 3 6 8 9 11

161

43

131

102

94

122

75

143

204

45 �ri

cp

ai

Figure 6.1 – Compressed row and columnstorage. At far le�, we have a weighted,directed graph. Its weighted adjacencymatrix lies below. At right are the com-pressed row and compressed columnarrays for this graph and matrix. Forsparse matrices, compressed row andcolumn storage make it easy to accessentries in rows and columns, respectively.Consider the �rd entry in rp. It saysto look at the �th element in ci to �ndall the columns in the �rd row of thematrix. �e �th and �th elements of ciand ai tell us that row � has non-zerosin columns � and �, with values � and��. When the sparse matrix correspondsto the adjacency matrix of a graph, thiscorresponds to e�cient access to theout-edges and in-edges of a vertex.

Most graph algorithms are designed to work with out-edge lists instead ofin-edge lists.� Before running an algorithm, MatlabBGL explicitly transposes

� See section �.�.� for a discussion aboutthe requirements for various graphalgorithms.

the graph so that Matlab’s internal representation corresponds to storing out-edge lists. For algorithms on symmetric graphs, these transposes are notrequired.

�e mex commands mxGetPr, mxGetJc, and mxGetIr retrieve pointers toMatlab’s internal storage of thematrixwithoutmaking a copy.�ese functionsmake it possible to access a sparse matrix e�ciently without making a copyand are a requirement of our implementation.

Let us recap. Sparse matrices are the best way to store graphs in Matlab.�ey provide all the necessary pieces to integrate cleanlywith “natural”Matlabsyntax and allow us access to their internals to run algorithms e�ciently.

�.�.� Other packages

�ere are other graph packages for Matlab too. One of the �rst was themeshpart toolkit [Gilbert and Teng, ����], which focuses on partitioningmeshes. Amore recent example is Matgraph [Scheinerman, ����], which con-tains a rich set of graph constructors to create adjacencymatrices for standardgraphs. It also provides an interface to support graph properties, such as la-bels and weights. Various authors released individual graph theory functionson the Mathworks File Exchange [Various, ����a, search for dijkstra]. Forexample, the Exchange contains more than three separate implementationsof Dijkstra’s shortest path algorithm.

MRWorkshop David Gleich · Purdue 50

Page 51: MapReduce for scientific simulation analysis

Storing a matrix by rows in a text-file

Row 1 (2,16.) (3,13.) Row 2 (3,10.) (4,12.) Row 3 (2,4.) (5,14.) Row 4 (3,9.) (6,20.)

Row 5 (4,7.) (6,4.) Row 6

��� � ⋅ ��������

�.�.� Sparse matrices in Matlab

To store anm×n sparsematrixM, Matlab uses compressed column format[Gilbert et al., ����]. Matlab never stores a 0 value in a sparsematrix. It always“re-compresses” the data structure in these cases. IfM is the adjacency matrixof a graph, then storing the matrix by columns corresponds to storing thegraph as an in-edge list.

We brie�y illustrate compressed row and column storage schemes in �g-ure �.�.

1

2

3

4

5

6

16

13

12

9

14

7

20

4

410

�������������

0 16 13 0 0 00 0 10 12 0 00 4 0 0 14 00 0 9 0 0 200 0 0 7 0 40 0 0 0 0 0

�������������

Compressed sparse row1 3 5 7 9 11 11

162

133

103

124

42

145

93

206

74

46 �ci

rp

ai

Compressed sparse column1 1 3 6 8 9 11

161

43

131

102

94

122

75

143

204

45 �ri

cp

ai

Figure 6.1 – Compressed row and columnstorage. At far le�, we have a weighted,directed graph. Its weighted adjacencymatrix lies below. At right are the com-pressed row and compressed columnarrays for this graph and matrix. Forsparse matrices, compressed row andcolumn storage make it easy to accessentries in rows and columns, respectively.Consider the �rd entry in rp. It saysto look at the �th element in ci to �ndall the columns in the �rd row of thematrix. �e �th and �th elements of ciand ai tell us that row � has non-zerosin columns � and �, with values � and��. When the sparse matrix correspondsto the adjacency matrix of a graph, thiscorresponds to e�cient access to theout-edges and in-edges of a vertex.

Most graph algorithms are designed to work with out-edge lists instead ofin-edge lists.� Before running an algorithm, MatlabBGL explicitly transposes

� See section �.�.� for a discussion aboutthe requirements for various graphalgorithms.

the graph so that Matlab’s internal representation corresponds to storing out-edge lists. For algorithms on symmetric graphs, these transposes are notrequired.

�e mex commands mxGetPr, mxGetJc, and mxGetIr retrieve pointers toMatlab’s internal storage of thematrixwithoutmaking a copy.�ese functionsmake it possible to access a sparse matrix e�ciently without making a copyand are a requirement of our implementation.

Let us recap. Sparse matrices are the best way to store graphs in Matlab.�ey provide all the necessary pieces to integrate cleanlywith “natural”Matlabsyntax and allow us access to their internals to run algorithms e�ciently.

�.�.� Other packages

�ere are other graph packages for Matlab too. One of the �rst was themeshpart toolkit [Gilbert and Teng, ����], which focuses on partitioningmeshes. Amore recent example is Matgraph [Scheinerman, ����], which con-tains a rich set of graph constructors to create adjacencymatrices for standardgraphs. It also provides an interface to support graph properties, such as la-bels and weights. Various authors released individual graph theory functionson the Mathworks File Exchange [Various, ����a, search for dijkstra]. Forexample, the Exchange contains more than three separate implementationsof Dijkstra’s shortest path algorithm.

��� � ⋅ ��������

�.�.� Sparse matrices in Matlab

To store anm×n sparsematrixM, Matlab uses compressed column format[Gilbert et al., ����]. Matlab never stores a 0 value in a sparsematrix. It always“re-compresses” the data structure in these cases. IfM is the adjacency matrixof a graph, then storing the matrix by columns corresponds to storing thegraph as an in-edge list.

We brie�y illustrate compressed row and column storage schemes in �g-ure �.�.

1

2

3

4

5

6

16

13

12

9

14

7

20

4

410

�������������

0 16 13 0 0 00 0 10 12 0 00 4 0 0 14 00 0 9 0 0 200 0 0 7 0 40 0 0 0 0 0

�������������

Compressed sparse row1 3 5 7 9 11 11

162

133

103

124

42

145

93

206

74

46 �ci

rp

ai

Compressed sparse column1 1 3 6 8 9 11

161

43

131

102

94

122

75

143

204

45 �ri

cp

ai

Figure 6.1 – Compressed row and columnstorage. At far le�, we have a weighted,directed graph. Its weighted adjacencymatrix lies below. At right are the com-pressed row and compressed columnarrays for this graph and matrix. Forsparse matrices, compressed row andcolumn storage make it easy to accessentries in rows and columns, respectively.Consider the �rd entry in rp. It saysto look at the �th element in ci to �ndall the columns in the �rd row of thematrix. �e �th and �th elements of ciand ai tell us that row � has non-zerosin columns � and �, with values � and��. When the sparse matrix correspondsto the adjacency matrix of a graph, thiscorresponds to e�cient access to theout-edges and in-edges of a vertex.

Most graph algorithms are designed to work with out-edge lists instead ofin-edge lists.� Before running an algorithm, MatlabBGL explicitly transposes

� See section �.�.� for a discussion aboutthe requirements for various graphalgorithms.

the graph so that Matlab’s internal representation corresponds to storing out-edge lists. For algorithms on symmetric graphs, these transposes are notrequired.

�e mex commands mxGetPr, mxGetJc, and mxGetIr retrieve pointers toMatlab’s internal storage of thematrixwithoutmaking a copy.�ese functionsmake it possible to access a sparse matrix e�ciently without making a copyand are a requirement of our implementation.

Let us recap. Sparse matrices are the best way to store graphs in Matlab.�ey provide all the necessary pieces to integrate cleanlywith “natural”Matlabsyntax and allow us access to their internals to run algorithms e�ciently.

�.�.� Other packages

�ere are other graph packages for Matlab too. One of the �rst was themeshpart toolkit [Gilbert and Teng, ����], which focuses on partitioningmeshes. Amore recent example is Matgraph [Scheinerman, ����], which con-tains a rich set of graph constructors to create adjacencymatrices for standardgraphs. It also provides an interface to support graph properties, such as la-bels and weights. Various authors released individual graph theory functionson the Mathworks File Exchange [Various, ����a, search for dijkstra]. Forexample, the Exchange contains more than three separate implementationsof Dijkstra’s shortest path algorithm.

MRWorkshop David Gleich · Purdue 51

Page 52: MapReduce for scientific simulation analysis

Sparse matrix-vector product

The matrix!1 (2,16.) (3,13.) 2 (3,10.) (4,12.) 3 (2,4.) (5,14.) 4 (3,9.) (6,20.) 5 (4,7.) (6,4.) 6

The vector!1 2.1 2 -1.3 3 0.5 4 0.6 5 -1.2 6 0.89

��� � ⋅ ��������

�.�.� Sparse matrices in Matlab

To store anm×n sparsematrixM, Matlab uses compressed column format[Gilbert et al., ����]. Matlab never stores a 0 value in a sparsematrix. It always“re-compresses” the data structure in these cases. IfM is the adjacency matrixof a graph, then storing the matrix by columns corresponds to storing thegraph as an in-edge list.

We brie�y illustrate compressed row and column storage schemes in �g-ure �.�.

1

2

3

4

5

6

16

13

12

9

14

7

20

4

410

�������������

0 16 13 0 0 00 0 10 12 0 00 4 0 0 14 00 0 9 0 0 200 0 0 7 0 40 0 0 0 0 0

�������������

Compressed sparse row1 3 5 7 9 11 11

162

133

103

124

42

145

93

206

74

46 �ci

rp

ai

Compressed sparse column1 1 3 6 8 9 11

161

43

131

102

94

122

75

143

204

45 �ri

cp

ai

Figure 6.1 – Compressed row and columnstorage. At far le�, we have a weighted,directed graph. Its weighted adjacencymatrix lies below. At right are the com-pressed row and compressed columnarrays for this graph and matrix. Forsparse matrices, compressed row andcolumn storage make it easy to accessentries in rows and columns, respectively.Consider the �rd entry in rp. It saysto look at the �th element in ci to �ndall the columns in the �rd row of thematrix. �e �th and �th elements of ciand ai tell us that row � has non-zerosin columns � and �, with values � and��. When the sparse matrix correspondsto the adjacency matrix of a graph, thiscorresponds to e�cient access to theout-edges and in-edges of a vertex.

Most graph algorithms are designed to work with out-edge lists instead ofin-edge lists.� Before running an algorithm, MatlabBGL explicitly transposes

� See section �.�.� for a discussion aboutthe requirements for various graphalgorithms.

the graph so that Matlab’s internal representation corresponds to storing out-edge lists. For algorithms on symmetric graphs, these transposes are notrequired.

�e mex commands mxGetPr, mxGetJc, and mxGetIr retrieve pointers toMatlab’s internal storage of thematrixwithoutmaking a copy.�ese functionsmake it possible to access a sparse matrix e�ciently without making a copyand are a requirement of our implementation.

Let us recap. Sparse matrices are the best way to store graphs in Matlab.�ey provide all the necessary pieces to integrate cleanlywith “natural”Matlabsyntax and allow us access to their internals to run algorithms e�ciently.

�.�.� Other packages

�ere are other graph packages for Matlab too. One of the �rst was themeshpart toolkit [Gilbert and Teng, ����], which focuses on partitioningmeshes. Amore recent example is Matgraph [Scheinerman, ����], which con-tains a rich set of graph constructors to create adjacencymatrices for standardgraphs. It also provides an interface to support graph properties, such as la-bels and weights. Various authors released individual graph theory functionson the Mathworks File Exchange [Various, ����a, search for dijkstra]. Forexample, the Exchange contains more than three separate implementationsof Dijkstra’s shortest path algorithm.

[Ax]i

=X

j

A

i ,j xj

To make this work, we need to get the value of the vector to the same function as the column of the matrix

MRWorkshop David Gleich · Purdue 52

Page 53: MapReduce for scientific simulation analysis

Sparse matrix-vector product

The matrix!1 (2,16.) (3,13.) 2 (3,10.) (4,12.) 3 (2,4.) (5,14.) 4 (3,9.) (6,20.) 5 (4,7.) (6,4.) 6

The vector!1 2.1 2 -1.3 3 0.5 4 0.6 5 -1.2 6 0.89

��� � ⋅ ��������

�.�.� Sparse matrices in Matlab

To store anm×n sparsematrixM, Matlab uses compressed column format[Gilbert et al., ����]. Matlab never stores a 0 value in a sparsematrix. It always“re-compresses” the data structure in these cases. IfM is the adjacency matrixof a graph, then storing the matrix by columns corresponds to storing thegraph as an in-edge list.

We brie�y illustrate compressed row and column storage schemes in �g-ure �.�.

1

2

3

4

5

6

16

13

12

9

14

7

20

4

410

�������������

0 16 13 0 0 00 0 10 12 0 00 4 0 0 14 00 0 9 0 0 200 0 0 7 0 40 0 0 0 0 0

�������������

Compressed sparse row1 3 5 7 9 11 11

162

133

103

124

42

145

93

206

74

46 �ci

rp

ai

Compressed sparse column1 1 3 6 8 9 11

161

43

131

102

94

122

75

143

204

45 �ri

cp

ai

Figure 6.1 – Compressed row and columnstorage. At far le�, we have a weighted,directed graph. Its weighted adjacencymatrix lies below. At right are the com-pressed row and compressed columnarrays for this graph and matrix. Forsparse matrices, compressed row andcolumn storage make it easy to accessentries in rows and columns, respectively.Consider the �rd entry in rp. It saysto look at the �th element in ci to �ndall the columns in the �rd row of thematrix. �e �th and �th elements of ciand ai tell us that row � has non-zerosin columns � and �, with values � and��. When the sparse matrix correspondsto the adjacency matrix of a graph, thiscorresponds to e�cient access to theout-edges and in-edges of a vertex.

Most graph algorithms are designed to work with out-edge lists instead ofin-edge lists.� Before running an algorithm, MatlabBGL explicitly transposes

� See section �.�.� for a discussion aboutthe requirements for various graphalgorithms.

the graph so that Matlab’s internal representation corresponds to storing out-edge lists. For algorithms on symmetric graphs, these transposes are notrequired.

�e mex commands mxGetPr, mxGetJc, and mxGetIr retrieve pointers toMatlab’s internal storage of thematrixwithoutmaking a copy.�ese functionsmake it possible to access a sparse matrix e�ciently without making a copyand are a requirement of our implementation.

Let us recap. Sparse matrices are the best way to store graphs in Matlab.�ey provide all the necessary pieces to integrate cleanlywith “natural”Matlabsyntax and allow us access to their internals to run algorithms e�ciently.

�.�.� Other packages

�ere are other graph packages for Matlab too. One of the �rst was themeshpart toolkit [Gilbert and Teng, ����], which focuses on partitioningmeshes. Amore recent example is Matgraph [Scheinerman, ����], which con-tains a rich set of graph constructors to create adjacencymatrices for standardgraphs. It also provides an interface to support graph properties, such as la-bels and weights. Various authors released individual graph theory functionson the Mathworks File Exchange [Various, ����a, search for dijkstra]. Forexample, the Exchange contains more than three separate implementationsof Dijkstra’s shortest path algorithm.

[Ax]i

=X

j

A

i ,j xj

We need to “join” the matrix and vector based on the column

MRWorkshop David Gleich · Purdue 53

Page 54: MapReduce for scientific simulation analysis

Sparse matrix-vector product!takes two MR tasks Map!If vector, emit (row,vecval) If matrix, " for each non-zero (row,col,val), " emit (col,(row,val))

Reduce!Find vecval in input keys For each (col,(row,val)), " emit (row,(val*vecval))

Map!Identity ""

Reduce (row, [(Aij xj), …]) !emit (row, sum(Aij xj))

Two types of records!

One of these values is not like the others

Form Aij xj for each nonzero Regroup data by rows, compute sums MRWorkshop David Gleich · Purdue 54

Page 55: MapReduce for scientific simulation analysis

What about a “dense” row? Map!If vector, emit (row,vecval) If matrix, " for each non-zero (row,col,val), " emit (col,(row,val))

Reduce!Find vecval in input keys For each (col,(row,val)), " emit (row,(val*vecval))

How do we find vecval without looking through "(and buffering) all "the input?

One of these values is not like the others

Form Aij xj for each nonzero MRWorkshop David Gleich · Purdue 55

Page 56: MapReduce for scientific simulation analysis

Sparse matrix-vector product!takes two MR tasks Map!If vector, emit ((row,-1),vecval) If matrix, " for each non-zero (row,col,val), " emit ((col,0),(row,val))

Reduce!Find vecval in input keys For each (col,(row,val)), " emit (row,(val*vecval))

Use a custom partitioner to make sure that (row,*) all get mapped to the same reducer, and that we always see (row,-1) before (row,0).

Two types of records!

Form Aij xj for each nonzero Regroup data by rows, compute sums MRWorkshop David Gleich · Purdue 56

Page 57: MapReduce for scientific simulation analysis

Demo 4 Sparse matrix vector products

MRWorkshop David Gleich · Purdue 57

Page 58: MapReduce for scientific simulation analysis

MRWorkshop David Gleich · Purdue 58

Page 59: MapReduce for scientific simulation analysis

Matrix factorizations

MRWorkshop David Gleich · Purdue 59

Page 60: MapReduce for scientific simulation analysis

A1

A2

A3

A1

A2qr

Q2 R2

A3qr

Q3 R3

A4qr Q4A4

R4

emit

A5

A6

A7

A5

A6qr

Q6 R6

A7qr

Q7 R7

A8qr Q8A8

R8

emit

Mapper 1Serial TSQR

R4

R8

Mapper 2Serial TSQR

R4

R8

qr Q emitRReducer 1Serial TSQR

AlgorithmData Rows of a matrix

Map QR factorization of rowsReduce QR factorization of rows

MRWorkshop David Gleich · Purdue 60

Page 61: MapReduce for scientific simulation analysis

Full code in hadoopy In hadoopyimport random, numpy, hadoopyclass SerialTSQR:def __init__(self,blocksize,isreducer):self.bsize=blocksizeself.data = []if isreducer: self.__call__ = self.reducerelse: self.__call__ = self.mapper

def compress(self):R = numpy.linalg.qr(

numpy.array(self.data),'r')# reset data and re-initialize to Rself.data = []for row in R:self.data.append([float(v) for v in row])

def collect(self,key,value):self.data.append(value)if len(self.data)>self.bsize*len(self.data[0]):self.compress()

def close(self):self.compress()for row in self.data:key = random.randint(0,2000000000)yield key, row

def mapper(self,key,value):self.collect(key,value)

def reducer(self,key,values):for value in values: self.mapper(key,value)

if __name__=='__main__':mapper = SerialTSQR(blocksize=3,isreducer=False)reducer = SerialTSQR(blocksize=3,isreducer=True)hadoopy.run(mapper, reducer)

David Gleich (Sandia) 13/22MapReduce 2011 MRWorkshop David Gleich · Purdue 61

Page 62: MapReduce for scientific simulation analysis

Related resources

Apache Mahout"Machine learning for Hadoop "… lots of matrices there … Another fantasic tutorial http://www.eurecom.fr/~michiard/teaching/webtech/tutorial.pdf

MRWorkshop David Gleich · Purdue 62

Page 63: MapReduce for scientific simulation analysis

Way too much stuff!

I hope to keep expanding this tutorial over the week… Keep checking the git repo.

MRWorkshop David Gleich · Purdue 63