mapreduce for scientific simulation analysis
DESCRIPTION
A tutorial I gave at Stanford on how to use MapReduce style computations for simulation data.TRANSCRIPT
A hands on introduction to scientific data analysis with Hadoop ! !A matrix computations perspective
DAVID F. GLEICH, PURDUE UNIVERSITY ICME MAPREDUCE WORKSHOP @ STANFORD
MRWorkshop David Gleich · Purdue 1
Who is this for?
workshop project groups those curious about "“MapReduce” and “Hadoop” those who think about "problems as matrices
MRWorkshop David Gleich · Purdue 2
What should you get out of it?
1. understand some problems that MapReduce solves effectively. 2. techniques to solve them using Hadoop and dumbo 3. learn some Hadoop words
MRWorkshop David Gleich · Purdue 3
What you won’t learn …
latest and greatest in "MapReduce algorithms how to improve the perform-"ance of your Hadoop job how to write wordcount "in Hadoop
MRWorkshop David Gleich · Purdue 4
Slides will be online soon. Code samples and short tutorials at github.com/dgleich/mrmatrix
MRWorkshop David Gleich · Purdue 5
1. HPC vs. Data (redux) 2. MapReduce vs. Hadoop 3. Dive into Hadoop with
Hadoop streaming 4. Sparse matrix methods "
with Hadoop
MRWorkshop David Gleich · Purdue 6
High performance computing vs. Data intensive computing
MRWorkshop David Gleich · Purdue 7
224k Cores 10 PB drive
1.7 Pflops
7 MW
Custom "interconnect"
$104 M
80k cores"50 PB drive ? Pflops ? MW GB ethernet $?? M
625 GB/core 45 GB/core
MRWorkshop David Gleich · Purdue 8
12 nodes; 4-core i7 processor, 24GB/node, 1GB ethernet
12 TB/node, 3000 GB/core, 50 TB usable space (3x redundancy)
icme-hadoop1
MRWorkshop David Gleich · Purdue 9
MapReduce is designed to solve a different set of problems
MRWorkshop David Gleich · Purdue 10
Supercomputer Data computing cluster Engineer
Each multi-day HPC simulation generates gigabytes of data.
A data cluster can hold hundreds or thousands of old simulations …
… enabling engineers to query and analyze months of simulation data for all sorts of neat purposes.
MRWorkshop David Gleich · Purdue 11
MapReduce and!Hadoop overview
MRWorkshop David Gleich · Purdue 12
The MapReduce programming model
Input a list of (key, value) pairs Map apply a function f to all pairs Reduce apply a function g to "
all values with key k (for all k) Output a list of (key, value) pairs
MRWorkshop David Gleich · Purdue 13
The MapReduce programming model
Input a list of (key, value) pairs Map apply a function f to all pairs Reduce apply a function g to "
all values with key k (for all k) Output a list of (key, value) pairs Map function f must be side-effect free Reduce function g must be side-effect free
MRWorkshop David Gleich · Purdue 14
The MapReduce programming model
Input a list of (key, value) pairs Map apply a function f to all pairs Reduce apply a function g to "
all values with key k (for all k) Output a list of (key, value) pairs All map functions can be done in parallel All reduce functions (for key k) can be done
in parallel MRWorkshop David Gleich · Purdue 15
The MapReduce programming model
Input a list of (key, value) pairs Map apply a function f to all pairs Reduce apply a function g to "
all values with key k (for all k) Output a list of (key, value) pairs !Shuffle group all pairs with key k together"
(sorting suffices)
MRWorkshop David Gleich · Purdue 16
Mesh point variance in MapReduce Run 1 Run 2 Run 3
T=1 T=2 T=3 T=1 T=2 T=3 T=1 T=2 T=3
MRWorkshop David Gleich · Purdue 17
Mesh point variance in MapReduce
M M M
R R
1. Each mapper out-puts the mesh points with the same key.
2. Shuffle moves all values from the same mesh point to the same reducer.
Run 1 Run 2 Run 3
3. Reducers just compute a numerical variance.
T=1 T=2 T=3 T=1 T=2 T=3 T=1 T=2 T=3
MRWorkshop David Gleich · Purdue 18
MapReduce vs. Hadoop. MapReduce!A computation
model with: Map - a local data
transform Shuffle - a grouping
function Reduce – "
an aggregation
Hadoop!An implementation of MapReduce using the HDFS parallel file-system. Others !
Pheonix++, Twisted, Google MapReduce, spark, …
MRWorkshop David Gleich · Purdue 19
Why so many limitations?
MRWorkshop David Gleich · Purdue 20
Data scalability
The idea !Bring the computations to the data MR can schedule map functions without moving data.
1 MM R
RMMM
Maps Reduce
Shuffle
2
3
4
5
1 2 M M
3 4 M M
5 M
MRWorkshop David Gleich · Purdue 21
Mesh point variance in MapReduce
M M M
R R
1. Each mapper out-puts the mesh points with the same key.
2. Shuffle moves all values from the same mesh point to the same reducer.
Run 1 Run 2 Run 3
3. Reducers just compute a numerical variance.
Bring the computations to the data!
T=1 T=2 T=3 T=1 T=2 T=3 T=1 T=2 T=3
MRWorkshop David Gleich · Purdue 22
After waiting in the queue for a month and "after 24 hours of finding eigenvalues, one node randomly hiccups.
heartbreak on node rs252
MRWorkshop David Gleich · Purdue 23
Fault tolerant
Redundant input helps make maps data-local Just one type of communication: shuffle
MM R
RMM
Input stored in triplicate
Map output"persisted to disk"before shuffle
Reduce input/"output on disk
MRWorkshop David Gleich · Purdue 24
3HUIRUPDQFH�UHVXOWV��VLPXODWHG�IDXOWV�
:H�FDQ�VWLOO�UXQ�ZLWK�3�IDXOW�� �����ZLWK�RQO\�a����SHUIRUPDQFH�SHQDOW\���+RZHYHU��ZLWK�3�IDXOW��VPDOO��ZH�VWLOO�VHH�D�SHUIRUPDQFH�KLW�
Fault injection
10 100 1000
1/Prob(failure) – mean number of success per failure
Tim
e to
com
plet
ion
(sec)
200
100
No faults (200M by 200)
Faults (800M by 10)
Faults (200M by 200)
No faults "(800M by 10)
With 1/5 tasks failing, the job only takes twice as long.
MRWorkshop David Gleich · Purdue 25
Diving into Hadoop (with python)
MRWorkshop David Gleich · Purdue 26
Tools I like
hadoop streaming dumbo mrjob hadoopy C++
MRWorkshop David Gleich · Purdue 27
Tools I don’t use but other people seem to like …
pig java hbase Eclipse Cassandra
MRWorkshop David Gleich · Purdue 28
hadoop streaming
the map function is a program"(key,value) pairs are sent via stdin"output (key,value) pairs goes to stdout the reduce function is a program"(key,value) pairs are sent via stdin"keys are grouped"output (key,value) pairs goes to stdout
MRWorkshop David Gleich · Purdue 29
dumbo a wrapper around hadoop streaming for map and reduce functions in python
#!/usr/bin/env dumbo def mapper(key,value): """ Each record is a line of text. key=<byte that the line starts in the file> value=<line of text> """ valarray = [float(v) for v in value.split()] yield key, sum(valarray) if __name__=='__main__': import dumbo import dumbo.lib dumbo.run(mapper,dumbo.lib.identityreducer)
MRWorkshop David Gleich · Purdue 30
How can Hadoop streaming possibly be fast?
Hadoop streaming frameworks
Iter 1QR (secs.)
Iter 1Total (secs.)
Iter 2Total (secs.)
OverallTotal (secs.)
Dumbo 67725 960 217 1177
Hadoopy 70909 612 118 730
C++ 15809 350 37 387
Java 436 66 502
Synthetic data test 100,000,000-by-500 matrix (~500GB)Codes implemented in MapReduce streamingMatrix stored as TypedBytes lists of doublesPython frameworks use Numpy+AtlasCustom C++ TypedBytes reader/writer with AtlasNew non-streaming Java implementation too
David Gleich (Sandia)
All timing results from the Hadoop job tracker
C++ in streaming beats a native Java implementation.
16/22MapReduce 2011
500 GB matrix. Computing the R in a QR factorization.
MRWorkshop David Gleich · Purdue 31
Demo 1 1. generate data 2. get data to hadoop 3. run row sums 4. see row sums!
MRWorkshop David Gleich · Purdue 32
How does Hadoop know key = byte in file"value = line of text!!
InputFormat!Map a file on HDFS to (key,value) pairs TextInputFormat!Map a text file to (<byte offset>, <line>) pairs
MRWorkshop David Gleich · Purdue 33
The Hadoop Distributed File System (HDFS) and a big text file
HDFS stores files in 64MB chunks Each chunk is a FileSplit FileSplits are stored in parallel A InputFormat converts FileSplits into a sequence of key-val records FileSplits can cross record borders"(a small bit of communication)
MRWorkshop David Gleich · Purdue 34
Tall-and-skinny matrix storage in MapReduce A : m x n, m ≫ n Key is an arbitrary row-id Value is the 1 x n array "for a row Each submatrix Ai is an "InputSplit (the input to a"map task).
A1
A4
A2
A3
A4
MRWorkshop David Gleich · Purdue 35
hadoop!output row-sum for all local rows
MPI!parallel load for my-batch-of-rows
compute row-sum parallel save
MRWorkshop David Gleich · Purdue 36
Isn’t reading and writing text files rather inefficient?
MRWorkshop David Gleich · Purdue 37
Sequence Files and !OutputFormat
SequenceFile An internal Hadoop file format to store (key, value) pairs efficiently. Used between map and reduce steps.
OutputFormat Map (key, value) pairs to output on disk
TextOutputFormat Map (key,value) pairs to key\tvalue strings
MRWorkshop David Gleich · Purdue 38
typedbytes
A simple binary serialization scheme. [<1-byte-type-flag> <binary-value>]* Roughly equivalent to JSON (Optionally) used to communicate to and from Hadoop streaming.
MRWorkshop David Gleich · Purdue 39
typedbytes example
def _read(self):
t = unpack_type(self.file.read(1))[0]
self.t = t
return self.handler_table[t](self)
def read_vector(self):
r = self._read
count = unpack_int(self.file.read(4))[0]
return tuple(r() for i in xrange(count))
MRWorkshop David Gleich · Purdue 40
Demo 2 Column sums
MRWorkshop David Gleich · Purdue 41
Column sums in dumbo
#!/usr/bin/env dumbo def mapper(key,value): """ Each record is a line of text. """ valarray = [float(v) for v in value.split()] for col,val in enumerate(valarray): yield col, val def reducer(col,values): yield col, sum(values) if __name__=='__main__': import dumbo import dumbo.lib dumbo.run(mapper,reducer)
MRWorkshop David Gleich · Purdue 42
Isn’t this just moving the data to the computation?
Yes. It seems much"worse than MPI.
MPI!parallel load for my-batch-of-rows
update sum of each columns
parallel reduce partial column sums parallel save
MRWorkshop David Gleich · Purdue 43
The MapReduce programming model
Input a list of (key, value) pairs Map apply a function f to all pairs Combine apply g to local values with key k!Shuffle group all pairs with key k together!Reduce apply a function g to "
all values with key k Output a list of (key, value) pairs !
MRWorkshop David Gleich · Purdue 44
Column sums in dumbo
#!/usr/bin/env dumbo def mapper(key,value): """ Each record is a line of text. """ valarray = [float(v) for v in value.split()] for col,val in enumerate(valarray): yield col, val def reducer(col,values): yield col, sum(values) if __name__=='__main__': import dumbo import dumbo.lib dumbo.run(mapper,reducer,combiner=reducer)
MRWorkshop David Gleich · Purdue 45
How many mappers and reducers?
The number of maps is the number of InputSplits. You choose how many reducers. Each reducer outputs to a separate file.
MRWorkshop David Gleich · Purdue 46
Demo 3 Column sums with multiple reducers
MRWorkshop David Gleich · Purdue 47
Which reducer does my key go to?
Partitioner!Map a given key to a reducer HashPartitioner!Randomly distribute keys
MRWorkshop David Gleich · Purdue 48
Sparse matrix methods
MRWorkshop David Gleich · Purdue 49
Storing a matrix by rows
Row 1 (2,16.) (3,13.) Row 2 (3,10.) (4,12.) Row 3 (2,4.) (5,14.) Row 4 (3,9.) (6,20.)
Row 5 (4,7.) (6,4.) Row 6
��� � ⋅ ��������
�.�.� Sparse matrices in Matlab
To store anm×n sparsematrixM, Matlab uses compressed column format[Gilbert et al., ����]. Matlab never stores a 0 value in a sparsematrix. It always“re-compresses” the data structure in these cases. IfM is the adjacency matrixof a graph, then storing the matrix by columns corresponds to storing thegraph as an in-edge list.
We brie�y illustrate compressed row and column storage schemes in �g-ure �.�.
1
2
3
4
5
6
16
13
12
9
14
7
20
4
410
�������������
0 16 13 0 0 00 0 10 12 0 00 4 0 0 14 00 0 9 0 0 200 0 0 7 0 40 0 0 0 0 0
�������������
Compressed sparse row1 3 5 7 9 11 11
162
133
103
124
42
145
93
206
74
46 �ci
rp
ai
Compressed sparse column1 1 3 6 8 9 11
161
43
131
102
94
122
75
143
204
45 �ri
cp
ai
Figure 6.1 – Compressed row and columnstorage. At far le�, we have a weighted,directed graph. Its weighted adjacencymatrix lies below. At right are the com-pressed row and compressed columnarrays for this graph and matrix. Forsparse matrices, compressed row andcolumn storage make it easy to accessentries in rows and columns, respectively.Consider the �rd entry in rp. It saysto look at the �th element in ci to �ndall the columns in the �rd row of thematrix. �e �th and �th elements of ciand ai tell us that row � has non-zerosin columns � and �, with values � and��. When the sparse matrix correspondsto the adjacency matrix of a graph, thiscorresponds to e�cient access to theout-edges and in-edges of a vertex.
Most graph algorithms are designed to work with out-edge lists instead ofin-edge lists.� Before running an algorithm, MatlabBGL explicitly transposes
� See section �.�.� for a discussion aboutthe requirements for various graphalgorithms.
the graph so that Matlab’s internal representation corresponds to storing out-edge lists. For algorithms on symmetric graphs, these transposes are notrequired.
�e mex commands mxGetPr, mxGetJc, and mxGetIr retrieve pointers toMatlab’s internal storage of thematrixwithoutmaking a copy.�ese functionsmake it possible to access a sparse matrix e�ciently without making a copyand are a requirement of our implementation.
Let us recap. Sparse matrices are the best way to store graphs in Matlab.�ey provide all the necessary pieces to integrate cleanlywith “natural”Matlabsyntax and allow us access to their internals to run algorithms e�ciently.
�.�.� Other packages
�ere are other graph packages for Matlab too. One of the �rst was themeshpart toolkit [Gilbert and Teng, ����], which focuses on partitioningmeshes. Amore recent example is Matgraph [Scheinerman, ����], which con-tains a rich set of graph constructors to create adjacencymatrices for standardgraphs. It also provides an interface to support graph properties, such as la-bels and weights. Various authors released individual graph theory functionson the Mathworks File Exchange [Various, ����a, search for dijkstra]. Forexample, the Exchange contains more than three separate implementationsof Dijkstra’s shortest path algorithm.
��� � ⋅ ��������
�.�.� Sparse matrices in Matlab
To store anm×n sparsematrixM, Matlab uses compressed column format[Gilbert et al., ����]. Matlab never stores a 0 value in a sparsematrix. It always“re-compresses” the data structure in these cases. IfM is the adjacency matrixof a graph, then storing the matrix by columns corresponds to storing thegraph as an in-edge list.
We brie�y illustrate compressed row and column storage schemes in �g-ure �.�.
1
2
3
4
5
6
16
13
12
9
14
7
20
4
410
�������������
0 16 13 0 0 00 0 10 12 0 00 4 0 0 14 00 0 9 0 0 200 0 0 7 0 40 0 0 0 0 0
�������������
Compressed sparse row1 3 5 7 9 11 11
162
133
103
124
42
145
93
206
74
46 �ci
rp
ai
Compressed sparse column1 1 3 6 8 9 11
161
43
131
102
94
122
75
143
204
45 �ri
cp
ai
Figure 6.1 – Compressed row and columnstorage. At far le�, we have a weighted,directed graph. Its weighted adjacencymatrix lies below. At right are the com-pressed row and compressed columnarrays for this graph and matrix. Forsparse matrices, compressed row andcolumn storage make it easy to accessentries in rows and columns, respectively.Consider the �rd entry in rp. It saysto look at the �th element in ci to �ndall the columns in the �rd row of thematrix. �e �th and �th elements of ciand ai tell us that row � has non-zerosin columns � and �, with values � and��. When the sparse matrix correspondsto the adjacency matrix of a graph, thiscorresponds to e�cient access to theout-edges and in-edges of a vertex.
Most graph algorithms are designed to work with out-edge lists instead ofin-edge lists.� Before running an algorithm, MatlabBGL explicitly transposes
� See section �.�.� for a discussion aboutthe requirements for various graphalgorithms.
the graph so that Matlab’s internal representation corresponds to storing out-edge lists. For algorithms on symmetric graphs, these transposes are notrequired.
�e mex commands mxGetPr, mxGetJc, and mxGetIr retrieve pointers toMatlab’s internal storage of thematrixwithoutmaking a copy.�ese functionsmake it possible to access a sparse matrix e�ciently without making a copyand are a requirement of our implementation.
Let us recap. Sparse matrices are the best way to store graphs in Matlab.�ey provide all the necessary pieces to integrate cleanlywith “natural”Matlabsyntax and allow us access to their internals to run algorithms e�ciently.
�.�.� Other packages
�ere are other graph packages for Matlab too. One of the �rst was themeshpart toolkit [Gilbert and Teng, ����], which focuses on partitioningmeshes. Amore recent example is Matgraph [Scheinerman, ����], which con-tains a rich set of graph constructors to create adjacencymatrices for standardgraphs. It also provides an interface to support graph properties, such as la-bels and weights. Various authors released individual graph theory functionson the Mathworks File Exchange [Various, ����a, search for dijkstra]. Forexample, the Exchange contains more than three separate implementationsof Dijkstra’s shortest path algorithm.
MRWorkshop David Gleich · Purdue 50
Storing a matrix by rows in a text-file
Row 1 (2,16.) (3,13.) Row 2 (3,10.) (4,12.) Row 3 (2,4.) (5,14.) Row 4 (3,9.) (6,20.)
Row 5 (4,7.) (6,4.) Row 6
��� � ⋅ ��������
�.�.� Sparse matrices in Matlab
To store anm×n sparsematrixM, Matlab uses compressed column format[Gilbert et al., ����]. Matlab never stores a 0 value in a sparsematrix. It always“re-compresses” the data structure in these cases. IfM is the adjacency matrixof a graph, then storing the matrix by columns corresponds to storing thegraph as an in-edge list.
We brie�y illustrate compressed row and column storage schemes in �g-ure �.�.
1
2
3
4
5
6
16
13
12
9
14
7
20
4
410
�������������
0 16 13 0 0 00 0 10 12 0 00 4 0 0 14 00 0 9 0 0 200 0 0 7 0 40 0 0 0 0 0
�������������
Compressed sparse row1 3 5 7 9 11 11
162
133
103
124
42
145
93
206
74
46 �ci
rp
ai
Compressed sparse column1 1 3 6 8 9 11
161
43
131
102
94
122
75
143
204
45 �ri
cp
ai
Figure 6.1 – Compressed row and columnstorage. At far le�, we have a weighted,directed graph. Its weighted adjacencymatrix lies below. At right are the com-pressed row and compressed columnarrays for this graph and matrix. Forsparse matrices, compressed row andcolumn storage make it easy to accessentries in rows and columns, respectively.Consider the �rd entry in rp. It saysto look at the �th element in ci to �ndall the columns in the �rd row of thematrix. �e �th and �th elements of ciand ai tell us that row � has non-zerosin columns � and �, with values � and��. When the sparse matrix correspondsto the adjacency matrix of a graph, thiscorresponds to e�cient access to theout-edges and in-edges of a vertex.
Most graph algorithms are designed to work with out-edge lists instead ofin-edge lists.� Before running an algorithm, MatlabBGL explicitly transposes
� See section �.�.� for a discussion aboutthe requirements for various graphalgorithms.
the graph so that Matlab’s internal representation corresponds to storing out-edge lists. For algorithms on symmetric graphs, these transposes are notrequired.
�e mex commands mxGetPr, mxGetJc, and mxGetIr retrieve pointers toMatlab’s internal storage of thematrixwithoutmaking a copy.�ese functionsmake it possible to access a sparse matrix e�ciently without making a copyand are a requirement of our implementation.
Let us recap. Sparse matrices are the best way to store graphs in Matlab.�ey provide all the necessary pieces to integrate cleanlywith “natural”Matlabsyntax and allow us access to their internals to run algorithms e�ciently.
�.�.� Other packages
�ere are other graph packages for Matlab too. One of the �rst was themeshpart toolkit [Gilbert and Teng, ����], which focuses on partitioningmeshes. Amore recent example is Matgraph [Scheinerman, ����], which con-tains a rich set of graph constructors to create adjacencymatrices for standardgraphs. It also provides an interface to support graph properties, such as la-bels and weights. Various authors released individual graph theory functionson the Mathworks File Exchange [Various, ����a, search for dijkstra]. Forexample, the Exchange contains more than three separate implementationsof Dijkstra’s shortest path algorithm.
��� � ⋅ ��������
�.�.� Sparse matrices in Matlab
To store anm×n sparsematrixM, Matlab uses compressed column format[Gilbert et al., ����]. Matlab never stores a 0 value in a sparsematrix. It always“re-compresses” the data structure in these cases. IfM is the adjacency matrixof a graph, then storing the matrix by columns corresponds to storing thegraph as an in-edge list.
We brie�y illustrate compressed row and column storage schemes in �g-ure �.�.
1
2
3
4
5
6
16
13
12
9
14
7
20
4
410
�������������
0 16 13 0 0 00 0 10 12 0 00 4 0 0 14 00 0 9 0 0 200 0 0 7 0 40 0 0 0 0 0
�������������
Compressed sparse row1 3 5 7 9 11 11
162
133
103
124
42
145
93
206
74
46 �ci
rp
ai
Compressed sparse column1 1 3 6 8 9 11
161
43
131
102
94
122
75
143
204
45 �ri
cp
ai
Figure 6.1 – Compressed row and columnstorage. At far le�, we have a weighted,directed graph. Its weighted adjacencymatrix lies below. At right are the com-pressed row and compressed columnarrays for this graph and matrix. Forsparse matrices, compressed row andcolumn storage make it easy to accessentries in rows and columns, respectively.Consider the �rd entry in rp. It saysto look at the �th element in ci to �ndall the columns in the �rd row of thematrix. �e �th and �th elements of ciand ai tell us that row � has non-zerosin columns � and �, with values � and��. When the sparse matrix correspondsto the adjacency matrix of a graph, thiscorresponds to e�cient access to theout-edges and in-edges of a vertex.
Most graph algorithms are designed to work with out-edge lists instead ofin-edge lists.� Before running an algorithm, MatlabBGL explicitly transposes
� See section �.�.� for a discussion aboutthe requirements for various graphalgorithms.
the graph so that Matlab’s internal representation corresponds to storing out-edge lists. For algorithms on symmetric graphs, these transposes are notrequired.
�e mex commands mxGetPr, mxGetJc, and mxGetIr retrieve pointers toMatlab’s internal storage of thematrixwithoutmaking a copy.�ese functionsmake it possible to access a sparse matrix e�ciently without making a copyand are a requirement of our implementation.
Let us recap. Sparse matrices are the best way to store graphs in Matlab.�ey provide all the necessary pieces to integrate cleanlywith “natural”Matlabsyntax and allow us access to their internals to run algorithms e�ciently.
�.�.� Other packages
�ere are other graph packages for Matlab too. One of the �rst was themeshpart toolkit [Gilbert and Teng, ����], which focuses on partitioningmeshes. Amore recent example is Matgraph [Scheinerman, ����], which con-tains a rich set of graph constructors to create adjacencymatrices for standardgraphs. It also provides an interface to support graph properties, such as la-bels and weights. Various authors released individual graph theory functionson the Mathworks File Exchange [Various, ����a, search for dijkstra]. Forexample, the Exchange contains more than three separate implementationsof Dijkstra’s shortest path algorithm.
MRWorkshop David Gleich · Purdue 51
Sparse matrix-vector product
The matrix!1 (2,16.) (3,13.) 2 (3,10.) (4,12.) 3 (2,4.) (5,14.) 4 (3,9.) (6,20.) 5 (4,7.) (6,4.) 6
The vector!1 2.1 2 -1.3 3 0.5 4 0.6 5 -1.2 6 0.89
��� � ⋅ ��������
�.�.� Sparse matrices in Matlab
To store anm×n sparsematrixM, Matlab uses compressed column format[Gilbert et al., ����]. Matlab never stores a 0 value in a sparsematrix. It always“re-compresses” the data structure in these cases. IfM is the adjacency matrixof a graph, then storing the matrix by columns corresponds to storing thegraph as an in-edge list.
We brie�y illustrate compressed row and column storage schemes in �g-ure �.�.
1
2
3
4
5
6
16
13
12
9
14
7
20
4
410
�������������
0 16 13 0 0 00 0 10 12 0 00 4 0 0 14 00 0 9 0 0 200 0 0 7 0 40 0 0 0 0 0
�������������
Compressed sparse row1 3 5 7 9 11 11
162
133
103
124
42
145
93
206
74
46 �ci
rp
ai
Compressed sparse column1 1 3 6 8 9 11
161
43
131
102
94
122
75
143
204
45 �ri
cp
ai
Figure 6.1 – Compressed row and columnstorage. At far le�, we have a weighted,directed graph. Its weighted adjacencymatrix lies below. At right are the com-pressed row and compressed columnarrays for this graph and matrix. Forsparse matrices, compressed row andcolumn storage make it easy to accessentries in rows and columns, respectively.Consider the �rd entry in rp. It saysto look at the �th element in ci to �ndall the columns in the �rd row of thematrix. �e �th and �th elements of ciand ai tell us that row � has non-zerosin columns � and �, with values � and��. When the sparse matrix correspondsto the adjacency matrix of a graph, thiscorresponds to e�cient access to theout-edges and in-edges of a vertex.
Most graph algorithms are designed to work with out-edge lists instead ofin-edge lists.� Before running an algorithm, MatlabBGL explicitly transposes
� See section �.�.� for a discussion aboutthe requirements for various graphalgorithms.
the graph so that Matlab’s internal representation corresponds to storing out-edge lists. For algorithms on symmetric graphs, these transposes are notrequired.
�e mex commands mxGetPr, mxGetJc, and mxGetIr retrieve pointers toMatlab’s internal storage of thematrixwithoutmaking a copy.�ese functionsmake it possible to access a sparse matrix e�ciently without making a copyand are a requirement of our implementation.
Let us recap. Sparse matrices are the best way to store graphs in Matlab.�ey provide all the necessary pieces to integrate cleanlywith “natural”Matlabsyntax and allow us access to their internals to run algorithms e�ciently.
�.�.� Other packages
�ere are other graph packages for Matlab too. One of the �rst was themeshpart toolkit [Gilbert and Teng, ����], which focuses on partitioningmeshes. Amore recent example is Matgraph [Scheinerman, ����], which con-tains a rich set of graph constructors to create adjacencymatrices for standardgraphs. It also provides an interface to support graph properties, such as la-bels and weights. Various authors released individual graph theory functionson the Mathworks File Exchange [Various, ����a, search for dijkstra]. Forexample, the Exchange contains more than three separate implementationsof Dijkstra’s shortest path algorithm.
[Ax]i
=X
j
A
i ,j xj
To make this work, we need to get the value of the vector to the same function as the column of the matrix
MRWorkshop David Gleich · Purdue 52
Sparse matrix-vector product
The matrix!1 (2,16.) (3,13.) 2 (3,10.) (4,12.) 3 (2,4.) (5,14.) 4 (3,9.) (6,20.) 5 (4,7.) (6,4.) 6
The vector!1 2.1 2 -1.3 3 0.5 4 0.6 5 -1.2 6 0.89
��� � ⋅ ��������
�.�.� Sparse matrices in Matlab
To store anm×n sparsematrixM, Matlab uses compressed column format[Gilbert et al., ����]. Matlab never stores a 0 value in a sparsematrix. It always“re-compresses” the data structure in these cases. IfM is the adjacency matrixof a graph, then storing the matrix by columns corresponds to storing thegraph as an in-edge list.
We brie�y illustrate compressed row and column storage schemes in �g-ure �.�.
1
2
3
4
5
6
16
13
12
9
14
7
20
4
410
�������������
0 16 13 0 0 00 0 10 12 0 00 4 0 0 14 00 0 9 0 0 200 0 0 7 0 40 0 0 0 0 0
�������������
Compressed sparse row1 3 5 7 9 11 11
162
133
103
124
42
145
93
206
74
46 �ci
rp
ai
Compressed sparse column1 1 3 6 8 9 11
161
43
131
102
94
122
75
143
204
45 �ri
cp
ai
Figure 6.1 – Compressed row and columnstorage. At far le�, we have a weighted,directed graph. Its weighted adjacencymatrix lies below. At right are the com-pressed row and compressed columnarrays for this graph and matrix. Forsparse matrices, compressed row andcolumn storage make it easy to accessentries in rows and columns, respectively.Consider the �rd entry in rp. It saysto look at the �th element in ci to �ndall the columns in the �rd row of thematrix. �e �th and �th elements of ciand ai tell us that row � has non-zerosin columns � and �, with values � and��. When the sparse matrix correspondsto the adjacency matrix of a graph, thiscorresponds to e�cient access to theout-edges and in-edges of a vertex.
Most graph algorithms are designed to work with out-edge lists instead ofin-edge lists.� Before running an algorithm, MatlabBGL explicitly transposes
� See section �.�.� for a discussion aboutthe requirements for various graphalgorithms.
the graph so that Matlab’s internal representation corresponds to storing out-edge lists. For algorithms on symmetric graphs, these transposes are notrequired.
�e mex commands mxGetPr, mxGetJc, and mxGetIr retrieve pointers toMatlab’s internal storage of thematrixwithoutmaking a copy.�ese functionsmake it possible to access a sparse matrix e�ciently without making a copyand are a requirement of our implementation.
Let us recap. Sparse matrices are the best way to store graphs in Matlab.�ey provide all the necessary pieces to integrate cleanlywith “natural”Matlabsyntax and allow us access to their internals to run algorithms e�ciently.
�.�.� Other packages
�ere are other graph packages for Matlab too. One of the �rst was themeshpart toolkit [Gilbert and Teng, ����], which focuses on partitioningmeshes. Amore recent example is Matgraph [Scheinerman, ����], which con-tains a rich set of graph constructors to create adjacencymatrices for standardgraphs. It also provides an interface to support graph properties, such as la-bels and weights. Various authors released individual graph theory functionson the Mathworks File Exchange [Various, ����a, search for dijkstra]. Forexample, the Exchange contains more than three separate implementationsof Dijkstra’s shortest path algorithm.
[Ax]i
=X
j
A
i ,j xj
We need to “join” the matrix and vector based on the column
MRWorkshop David Gleich · Purdue 53
Sparse matrix-vector product!takes two MR tasks Map!If vector, emit (row,vecval) If matrix, " for each non-zero (row,col,val), " emit (col,(row,val))
Reduce!Find vecval in input keys For each (col,(row,val)), " emit (row,(val*vecval))
Map!Identity ""
Reduce (row, [(Aij xj), …]) !emit (row, sum(Aij xj))
Two types of records!
One of these values is not like the others
Form Aij xj for each nonzero Regroup data by rows, compute sums MRWorkshop David Gleich · Purdue 54
What about a “dense” row? Map!If vector, emit (row,vecval) If matrix, " for each non-zero (row,col,val), " emit (col,(row,val))
Reduce!Find vecval in input keys For each (col,(row,val)), " emit (row,(val*vecval))
How do we find vecval without looking through "(and buffering) all "the input?
One of these values is not like the others
Form Aij xj for each nonzero MRWorkshop David Gleich · Purdue 55
Sparse matrix-vector product!takes two MR tasks Map!If vector, emit ((row,-1),vecval) If matrix, " for each non-zero (row,col,val), " emit ((col,0),(row,val))
Reduce!Find vecval in input keys For each (col,(row,val)), " emit (row,(val*vecval))
Use a custom partitioner to make sure that (row,*) all get mapped to the same reducer, and that we always see (row,-1) before (row,0).
Two types of records!
Form Aij xj for each nonzero Regroup data by rows, compute sums MRWorkshop David Gleich · Purdue 56
Demo 4 Sparse matrix vector products
MRWorkshop David Gleich · Purdue 57
MRWorkshop David Gleich · Purdue 58
Matrix factorizations
MRWorkshop David Gleich · Purdue 59
A1
A2
A3
A1
A2qr
Q2 R2
A3qr
Q3 R3
A4qr Q4A4
R4
emit
A5
A6
A7
A5
A6qr
Q6 R6
A7qr
Q7 R7
A8qr Q8A8
R8
emit
Mapper 1Serial TSQR
R4
R8
Mapper 2Serial TSQR
R4
R8
qr Q emitRReducer 1Serial TSQR
AlgorithmData Rows of a matrix
Map QR factorization of rowsReduce QR factorization of rows
MRWorkshop David Gleich · Purdue 60
Full code in hadoopy In hadoopyimport random, numpy, hadoopyclass SerialTSQR:def __init__(self,blocksize,isreducer):self.bsize=blocksizeself.data = []if isreducer: self.__call__ = self.reducerelse: self.__call__ = self.mapper
def compress(self):R = numpy.linalg.qr(
numpy.array(self.data),'r')# reset data and re-initialize to Rself.data = []for row in R:self.data.append([float(v) for v in row])
def collect(self,key,value):self.data.append(value)if len(self.data)>self.bsize*len(self.data[0]):self.compress()
def close(self):self.compress()for row in self.data:key = random.randint(0,2000000000)yield key, row
def mapper(self,key,value):self.collect(key,value)
def reducer(self,key,values):for value in values: self.mapper(key,value)
if __name__=='__main__':mapper = SerialTSQR(blocksize=3,isreducer=False)reducer = SerialTSQR(blocksize=3,isreducer=True)hadoopy.run(mapper, reducer)
David Gleich (Sandia) 13/22MapReduce 2011 MRWorkshop David Gleich · Purdue 61
Related resources
Apache Mahout"Machine learning for Hadoop "… lots of matrices there … Another fantasic tutorial http://www.eurecom.fr/~michiard/teaching/webtech/tutorial.pdf
MRWorkshop David Gleich · Purdue 62
Way too much stuff!
I hope to keep expanding this tutorial over the week… Keep checking the git repo.
MRWorkshop David Gleich · Purdue 63