[ieee 2010 ieee 12th international conference on high performance computing and communications (hpcc...
TRANSCRIPT
Effortless and Efficient Distributed Data-partitioningin Linear Algebra
Carlos de Blas CartónDpto. Informática
University of Valladolid, Spain
Email: [email protected]
Arturo Gonzalez-EscribanoDpto. Informática
University of Valladolid, Spain
Email: [email protected]
Diego R. LlanosDpto. Informática
University of Valladolid, Spain
Email: [email protected]
Abstract—This paper introduces a new technique to exploitcompositions of different data-layout techniques with Hitmap, alibrary for hierarchical-tiling and automatic mapping of arrays.We show how Hitmap is used to implement block-cyclic layoutsfor a parallel LU decomposition algorithm. The paper comparesthe well-known ScaLAPACK implementation of LU, as well asother carefully optimized MPI versions, with a Hitmap imple-mentation. The comparison is made in terms of both performanceand code length. Our results show that the Hitmap versionoutperforms the ScaLAPACK implementation and is almost asefficient as our best manual MPI implementation. The insertionof this composition technique in the automatic data-layouts ofHitmap allows the programmer to develop parallel programswith both a significant reduction of the development effort anda negligible loss of efficiency.
Index Terms—Automatic data partition; layouts; distributedsystems.
I. INTRODUCTION
Data distribution and layout is a key part for any parallel al-
gorithm in distributed environments. To improve performance
and scalability, many parallel algorithms need to compose
different data partition and layout techniques at different
levels.A typical example is the block-cyclic partition, useful in
many linear algebra problems, such as the LU decomposition.
Block-cyclic can be seen as a composition of two basic
layouts: Blocking layout (divide data into same-sized blocks),
and a cyclic layout (distribute blocks cyclically to processes).
LU decomposition is a matrix factorization that transforms
a matrix in the product of a lower triangular matrix and a
upper triangular matrix. Applications of this decomposition
include solving linear equations, matrix inverse calculation,
and determinant computation.While many languages offer a predefined set of one-level,
data-parallel partition techniques (e.g. HPF [1], OpenMP [2],
UPC [3]), it is not straightforward to use them to create
a multiple-level distributed layout that adapts automatically
for different grain-levels. On the other hand, message-passing
interfaces for distributed environments (such as MPI [4]) allow
to develop programs carefully tuned for a given platform, but
at the cost of a high development effort [5], [6]. The reason
is that these interfaces only provide basic, simple tools to
calculate the data distribution and communication details.In this paper we present a technique to introduce com-
positions of different layout techniques at different levels in
Hitmap, a library for hierarchical-tiling and mapping of arrays.
Hitmap library is a basic tool in the back-end of the Trasgo
compilation system [7]. It is designed as a supporting library
for parallel compilation techniques, offering a collection of
functionalities for management, mapping and communication
of tiles. Htimap have a extensibility philosophy similar to the
Fortress framework [10], and it can be used to build higher
levels of abstraction. Other tiling arrays libraries, offering
abstract data-types for distributed tiles (such as HTA [8]),
could be reprogrammed and extended using Hitmap. The tile
decomposition in Hitmap is based on index-domain partitions,
in a similar way as proposed in the Chapel language [9].
In Hitmap, data-layout and load-balancing techniques are
independent modules that belong to the plug-in system. The
techniques are invoked from the code and applied at run-time
when needed, using internal information of the target system
topology to distribute the data. The programmer does not
need to reason in terms of the number of physical processors.
Instead, it uses highly abstract communication patterns for the
distributed tiles at any grain level. Thus, coding, and debug-
ging operations with entire data structures is easy. We show
how the use of layout compositions at different levels, together
with hierarchical Hitmap tiles, can also automatically benefit
the sequential part of the code, improving data locality and
exploiting the memory hierarchy for better performance. This
composition technique allows to use Hitmap to program highly
efficient distributed computations with low development cost.
We use the LU decomposition as a case study, focusing on
the ScaLAPACK implementation as reference. ScaLAPACK is
a software library of high-performance linear algebra routines
for distributed-memory, message-passing clusters, built on
top of a message-passing interface (such as MPI or PVM).
ScaLAPACK is a parallel version of the LAPACK project,
which in turn is the successor of the LINPACK benchmark to
measure supercomputer performance. Currently, there exists
other parallel evolution of LINPACK named HPL (High
Performance LINPACK). HPL is a benchmark to measure
performance in modern supercomputers, and the fastest 500
are published in the top500.org website. Thus, algorithms and
implementations used in ScaLAPACK are a key reference to
measure performance in supercomputing environments. Our
experimental results show that the Hitmap implementation of
LU offers better performance than ScaLAPACK version, and
2010 12th IEEE International Conference on High Performance Computing and Communications
978-0-7695-4214-0/10 $26.00 © 2010 IEEE
DOI 10.1109/HPCC.2010.37
89
2010 12th IEEE International Conference on High Performance Computing and Communications
978-0-7695-4214-0/10 $26.00 © 2010 IEEE
DOI 10.1109/HPCC.2010.37
89
⎛⎜⎜⎜⎝
a11 a12 · · · a1n
a21 a22 · · · a2n
......
. . ....
an1 an2 · · · ann
⎞⎟⎟⎟⎠ =
⎛⎜⎜⎜⎝
1 0 · · · 0l21 1 · · · 0...
.... . .
...
ln1 ln2 · · · 1
⎞⎟⎟⎟⎠×
⎛⎜⎜⎜⎝
u11 u12 · · · u1n
0 u22 · · · u2n
......
. . ....
0 0 · · · unn
⎞⎟⎟⎟⎠
Figure 1: LU decomposition.
almost the same performance as a carefully-tuned, manual MPI
implementation, with a significantly lower development effort.
II. A STUDY CASE: LU ALGORITHM
A. LU decomposition
LU decomposition [11], [12] writes a matrix A as a product
of two matrices: a lower triangular matrix L with 1’s in the
diagonal and an upper triangular matrix U . See Fig. 1.
LU decomposition is used to solve systems of linear equa-
tions or calculate determinants. The advantage of solving
systems with this method is that, once the matrix is factorized,
it can be used to compute solutions for different right-hand
sides.
As it is shown in Fig. 2, there are three different forms
of designing the LU matrix factorization: Right-looking, left-
looking and crout LU. There exists iterative and recursive
versions of the associated algorithms. In this paper we focus
on the iterative, right-looking version, since it provides a better
load balancing, as we will see in Sect. IV.
Figure 2 shows the state associated to a particular iteration
of the corresponding algorithms. Stripped portions refer to
elements that will get their final value at the end of the
iteration. Grey areas refer to read elements in left-looking and
crout, and updated elements in right looking. These elements
will be modified in subsequent iterations.
Algorithm 1 shows the right-looking LU decomposition. As
can be seen, the algorithm may be implemented with three
loops. It is worthwhile to note that the division in line 4 may
cause precision errors if divider is nearby 0 when using finite
precision. Thus, some implementations perform row-swapping
to avoid small divisors. In each iteration the algorithm searches
the greater aik with i > k and swap row i−th and k−th in
matrix and vector. After solving the system, it is necessary to
undo the row swapping in the solution vector.
Left−looking LU Right−looking LU Crout LU
Figure 2: Different versions of LU decomposition. Activity at
a middle iteration.
Algorithm 1 LU right-looking decomposition.
Require: A n× n matrix
Ensure: L and U n × n matrices overwriting A and non
storing either zeros in both matrices or ones in L diagonal
1: for k = 0 to n− 1 do2: {elements of matrix L}
3: for i = k + 1 to n− 1 do4: aik ← aik
akk
5: end for6: {update rest of elements}
7: for j = k + 1 to n− 1 do8: for i = k + 1 to n− 1 do9: aij ← aij − akjaik
10: end for11: end for12: end for
B. Solution of a system of linear equations
After the decomposition, it is possible to solve the system
Ax = b ≡ LUx = b in two steps. Let Ux = y. First we solve
Ly = b using forward substitution, and then we solve Ux = yusing backward substitution.
Definition A system Ax = b using forward substitution with
A lower triangular is solved as follows:
x0 =b0
a00
For i = 1, . . . , n− 1
xi =bi −
∑i−1j=0 aijxj
aii
Definition A system Ax = b using backward substitution with
A upper triangular is solved as follows:
xn−1 =bn−1
an−1,n−1
For i = n− 2, . . . , 0
xi =bi −
∑n−1j=i+1 aijxj
aii
The cost of the decomposition is in O(n3) and the cost
of the two substitutions are O(n2). Summing up, the cost of
the whole algorithm is in O(n3) and the cost of the solution
9090
������������������������������������������proc 0 proc 1 proc P−1
(a) Block distribution.
proc P−1proc 0 proc 1
proc 0 proc 2
proc 3
proc 1 proc 3
proc 2
(b) Cyclic distribution.
proc P−1proc 1proc 0 proc 0 proc 1
(c) Block-cyclic distribution.
Figure 3: Examples of one-dimensional data distributions.
is not significant. Thus, LU decomposition is useful when
solving several right-hand sides. The solving algorithm can be
implemented browsing the matrix by rows or columns, with
the same asymptotic cost.
III. BLOCK-CYCLIC LAYOUT
Parallel implementations of algorithms need to distribute the
data across different processes. The way of distributing the
data is known as layout. Each layout provides good balance
for a different family of algorithms, so it is very important
to choose the most suitable. For our study case, the most
appropriate is the block-cyclic layout. This layout can be seen
as the composition of the following layouts. Let N the number
of elements and P the number of processes.
1) Block layout (Fig. 3a). Each process receives a block of
�NP � contiguous elements.
2) Cyclic layout (Fig. 3b). The data is distributed cyclically
among the processes, so element i is associated to
process i mod P .
Block-cyclic layout [13] is built in two stages: blocks and
cyclic (see Fig. 3c). First, a blocking is made, but dividing
data into portions of the same size S, not necessarily equal to
�NP �. Then, the blocks are distributed cyclically to processes.
Thus, element i is associated to process � iS � mod P . These
three layouts can be easily extended to processes arrangements
in n dimensions.
IV. RIGHT-LOOKING LU IN DISTRIBUTED ENVIRONMENTS
Among the three LU decomposition algorithms described
above, the right-looking LU is the one that most benefits from
load-balancing techniques when implemented in distributed
environments. For this algorithm, block-cyclic layout is the
most appropriate load-balancing technique. The cyclic distri-
bution of the blocks ensures that all processes have roughly
the same workload, because the trailing matrix to be updated
is distributed among processes. In addition, distributing blocks
instead of single elements improves data locality [14], [15].
Left-looking and crout versions are not so useful in dis-
tributed environments, since it is much harder to find a suitable
4
U
L L
LL
U U
U
1 2
3
Figure 4: Steps in a blocked LU parallel algorithm.
data distribution. In left-looking LU algorithm, since only one
column per iteration is updated, only the column of processes
which owns that column performs computations, whereas the
rest only communicate. Something similar happens with crout.
Besides, right-looking implies less communication than
other versions, because only a column and a row is needed
per iteration to update the trailing matrix, whereas in other
versions, almost the whole matrix is locally needed.
Right-looking LU algorithm is usually computed in process
arrangements of 2 dimensions, also called grid of processes. In
Alg. 2, we can see an outline of a block-cyclic, right-looking,
LU parallel decomposition algorithm. Functions minedim1(i)and minedim2(i) in that algorithm return TRUE if i belongs
to the local process. The algorithm iterates by blocks and each
iteration is divided in four steps, as shown in Fig. 4.
The main steps of the factorization algorithm are shown
in Fig. 4. In the figure, dark grey portions refer to elements
updated and light grey ones to elements read on each step.
1) Factorize the diagonal block (Fig. 4.1).
2) Update blocks in the same column of the diagonal block
in matrix L, using the diagonal block (Fig. 4.2).
3) Update blocks in the same row of the diagonal block in
matrix U, using the diagonal block (Fig. 4.3).
4) Update trailing matrix using blocks previously updated
in steps 2 and 3 (Fig. 4.4).
Steps 2 to 4 may imply block communication among
processes. Since all updated blocks will be needed by other
processes of the same row or column, most communications
are in multicast mode.
Figure 5 shows how the processes are arranged into an R×Cgrid. In a given iteration, each process into the grid performs
exactly one of the following four tasks (see Alg. 2).
9191
Algorithm 2 LU parallel algorithm. Input: n× n matrix A and n× 1 vector b. Output: n× n matrices L and U overwriting
A and stored together ruling out zeros and ones of the diagonal of L, and n × 1 vector x overwriting b. The matrices are
partitioned in S × S blocks and the processes form a R× C grid.
1: for k = 0 to �NS � − 1 do {Iterating by blocks}
2: if minedim1(k) ∧minedim2(k) then3: {TASK TYPE 1: Process with the diagonal block}
4: Factorize diagonal blocks A[k, k]5: Update blocks A[i, k] with i > k with block A[k, k]6: Update blocks A[k, j]/j > k with block A[k, k]7: Update blocks of submatrix
8: else if minedim1(k) then9: {TASK TYPE 2: Process in the same column of the grid than the diagonal block process}
10: Update blocks A[i, k] with i > k with block A[k, k]11: Update blocks of submatrix
12: else if minedim2(k) then13: {TASK TYPE 3: Process in the same row of the grid than the diagonal block process}
14: Update blocks A[k, j]/j > k with block A[k, k]15: Update blocks of submatrix
16: else17: {TASK TYPE 4: Process neither in the same row of the grid nor in the same column of the grid as the diagonal block
process}
18: Update blocks of submatrix
19: end if20: end for
1) There is one process that owns the diagonal block. This
process executes the four steps and communicates in 2,
3 and 4 (process P[1,2] in Fig. 5).
2) There are R− 1 processes that own blocks in the same
column as the diagonal block. These processes execute
steps 2 and 4. In step 4 they communicate the blocks
updated in step 2, receiving blocks updated that belong
to the same row of the diagonal block. (processes P[0,2],
P[2,2], and P[3,2] in Fig. 5).
3) There are C−1 processes that owns blocks in the same
row as the diagonal block. These processes execute steps
3 and 4. In step 4 they communicate the blocks updated
in 3, receiving blocks updated that belong to the same
column of the diagonal block. (processes P[1,0], P[1,1],
P[1,3], and P[1,4] in Fig. 5).
4) The remaining (R ∗ C −R− C + 1) processes execute
step 4, receiving the blocks updated in steps 2 and 3
from one process on the same row and one process on
the same column.
A. An example
In Fig.6a we can see an example of a n×n matrix, divided
in 7 × 7 blocks and distributed to a 2 × 2 grid of processes.
In this example, the computation is at the start of the second
iteration (blocks in column 0 and row 0 are already updated).
Figure 6b shows the communications needed at this point.
Diagonal block is B[1,1] and belongs to process P[1,1]. Process
P[1,1] factorize this block and send it to processes P[1,0] and
P[0,1].
����������
����������
��������
��������
����������
����������
��������
��������
����������
����������
���������������
���������������
���������������
���������������
����������
����������
0
1
2
3
0 1 2 3 4
Figure 5: Example of a 3× 4 grid of processes where process
P[1,2] owns the diagonal block.
Processes P[1,1] and P[1,0] update blocks in the same row
as the diagonal block in matrix U (blocks B[1,2], B[1,3], . . . ,
B[1,6]) and processes P[1,1] and P[0,1] update blocks in the
same column as the diagonal block in matrix L (blocks B[2,1],
B[3,1], . . . , B[6,1]).
Then, all process update blocks of the submatrix with block
indexes B[2..6,2..6]. For doing it, process P[1,0] and P[0,1] need
blocks from P[1,1], and process P[0,0] needs blocks from P[0,1]
and P[1,0]. Remember the update expression aij = aij−akjaik
in line 9 of Alg.1. A process only need blocks of other
processes in the same row or column in the grid. Thus,
communications are limited to rows or columns.
The right-hand side vector has only one dimension and it
is mapped to a row or column of processes. Solution of the
system using the LU factorization is straightforward and need
no further explanation.
9292
1
0 1 2 3 4 5 6
2
3
4
5
6
0 process 0,0
process 1,0
process 0,1
process 1,1
(a) Distribution of blocks.
communication
1
0 1 2 3 4 5 6
2
3
4
5
6
0
(b) Communications in example on left figure in iteration 2.
Figure 6: Example of a 7 × 7 blocks matrix distributed in a
two-dimensional block-cyclic layout.
V. TWO IMPLEMENTATION APPROACHES
In this section we describe two different approaches for
LU decomposition implementation in distributed-systems. The
first one is using directly a message-passing library. The well-
know ScaLAPACK implementation is a good representative
of this approach. Instead of manage data-partition and MPI
communications directly, a different approach is to use a
library with a higher-level of abstraction. We present an LU-
implementation using Hitmap, a library to efficiently map
and communicate hierarchical tile-arrays. Hitmap is used in
the back-ends of the TRASGO compilation system [7]. After
reviewing both approaches, we compare both in terms of
performance and development effort.
A. ScaLAPACK
ScaLAPACK [16], [17] is a parallel library of high-
performance linear algebra routines for message-passing sys-
tems. These routines are written in Fortran using the BLACS
communications library. BLACS is defined as an interface,
and has different implementations depending on the commu-
nications library used, such as PVM, MPI, MPL, and NX.
ScaLAPACK is an evolution from the previous sequential
library, LAPACK designed for shared-memory and vector su-
percomputers. Their goals are efficiency, scalability, reliability,
portability, flexibility and easy of use.
One of the routines from ScaLAPACK , called pdgesv,
computes the solution of a real system of linear equations
Ax = b. This routine implements firstly a decomposition of a
n×n matrix A in a product of two triangular matrices L ∗U ,
where L is a lower triangular matrix with ones in the diagonal
and U is a upper triangular one. Then, this decomposition
makes easier the later resolution of the system Ax = b where
x is the n× 1 vector solution of the system.
ScaLAPACK implements the non-recursive right-looking
version of LU algorithm, since it is the better option for block-
cyclic layouts as explained before.
The matrices are distributed using a two-dimensional block-
cyclic layout and the local part of the matrices after distribu-
tion are stored contiguously in memory.
B. LU with Hitmap library
Hitmap is a library to support hierarchical tiling mapping
and manipulation [7]. It is designed to be used with an
SPMD parallel programming language, allowing the creation,
manipulation, distribution and efficient communication of tiles
and tile hierarchies. The focus of Hitmap is to provide an
abstract layer for tiling in parallel computations. Thus, it may
be used directly, or to build higher-level constructions for tiling
in parallel languages.
Although Hitmap is designed with an object-oriented ap-
proach, it is implemented in C language. It defines several
object-classes implemented with C structures and functions.
A Signature, represented by a HitSig object, is a tuple of
three numbers representing a selection of array indexes in a
one-dimensional domain. The three numbers represent: The
starting index, the ending index, and a stride to specify a non-
contiguous but regular subset of indexes. A Shape (HitShapeobject) is a set of n signatures representing a selection of
indexes in a multidimensional domain. Finally, a Tile, rep-
resented by a HitTile object is an array whose indexes are
defined by a shape. Tiles may be declared directly, or as a
selection of another tile, whose indexes are defined by a new
shape. Thus, they may be created hierarchically or recursively.
Tiles have two coordinate system to access its elements: (1)
Tile coordinates; and (2) Array coordinates. Tile coordinates
start always at index 0 and have no stride. Array coordinates
associate elements with their original domain indexes, as
defined by the shape of the root tile. In Hitmap a tile may
be defined with, or without allocated memory. This allows to
declare and partition arrays without memory, finally allocating
only the parts mapped to a given processor.
Hitmap also provides a HitTopology object-class to encapsu-
late simple functions to create virtual topologies from hidden
physical topology information, directly obtained by the library
from the lower implementation layer. HitLayout objects store
the results of the partition or distribution of a shape domain
over a topology of virtual processors. They are created calling
one of the layout plug-in modules included in the library.
9393
1 HitTopology topo = hit_topology( plug_topArray2DComplete );2 int numberBlocks = ceil_int( matrixSize / blockSize );3 HitTile_HitTile matrixTile, matrixSelection;4 hit_tileDomainHier( & matrixTile, sizeof( HitTile ), 2, numberBlocks, numberBlocks );5 HitLayout matrixLayout = hit_layout( plug_layCyclic,6 topo,7 hit_tileShape( matrixTile ) );8 hit_tileSelect( & matrixSelection, & matrixTile, hit_layShape( matrixLayout ) );9 hit_tileAlloc( & matrixSelection );10 HitTile_double block;11 hit_tileDomainAlloc( & block, sizeof( double ), 2, blockSize, blockSize );12 hit_tileFill( & matrixSelection, & block );13 hit_tileFree( block );14 hit_comAllowDims( matrixLayout );
Figure 7: Hitmap code for data partitioning.
There are some predefined layout modules, but it allows to
easily create new ones. The resulting layout objects have
information about the local part of the input domain, neighbor
relationships, and methods to get information about any other
virtual process part. The layout modules encapsulate mapping
decisions for load-balancing and neighborhood information for
either, data or task parallelism. The information in the layout
objects may be exploited in several abstract communication
functionalities which build HitComm objects. These objects
may be composed in reusable patterns represented by HitPat-tern objects. The library is built on top of the MPI commu-
nication library, internally exploiting several MPI techniques
for efficient buffering, marshalling, and communication.
Block-cyclic layout in Hitmap library
In this section we show how to use the Hitmap tools to
implement a block-cyclic layout. We will use a two-level tile
hierarchy and a cyclic Hitmap layout to distribute the upper
level blocks among processors.
Hitmap uses the HitTile objects to represent multidimen-
sional arrays, or arrays subselections. HitTile is an abstract
type which is specialized with the base type of the desired
array elements. Thus, HitTile_double is the type for tiles of
double elements. In this work, we include a new derived
type in the library: HitTile_HitTile. The base elements of tiles
of this type will also be tiles of any derived type. Thus,
heterogeneous hierarchies of tile types are easily created using
HitTile_HitTile objects on the branches, and other specific
derived types on the leaves of the structure. In Fig. 9 we show
the internal structure of a two-level hierarchy of tiles. The
HitTile_HitTile variable is internally a C structure that records
the information needed to access the elements of the array and
manipulate the whole tile. One of its fields is a pointer to the
data. In this case, it is a pointer to an array of HitTile elements.
We have added other information fields to the structure to
record details about the hierarchical structure at a given level.
Other Hitmap functionalities, related to the communication
and input/output of tiles, have been reprogrammed to support
marshalling/demarshalling whole hierarchical structures.
The rest of this section shows how the HitTile_HitTile type
is used to define and layout aggregations of subselection tiles
(e.g. array blocks), allowing to easily apply compositions of
layouts at different levels.
Data distribution
In Fig. 7 we show how the Hitmap tile creation and
layout functions can be used in a parallel environment, to
automatically generate the structure represented in Fig. 9,
with a HitTile_double element for each block assigned to the
local parallel processor. Line 1 creates a virtual 2D topology
of processors using a Hitmap plug-in. Topology plug-ins
automatically obtain information from the physical topology.
The number of blocks is calculated in line 2, using the matrix
size and the block size parameters.
Line 3 declares the HitTile variables needed. In line 4, we
define the domain of a tile representing the global array of
blocks. This declaration does not allocate memory for the
blocks, or the upper level array of tiles. Instead, it creates
the HitTile template with the arrays sizes. It is used in Line 5
to call a cyclic layout plug-in that distributes the tile elements
among processors. The returned HitLayout object contains the
information about the elements assigned to any processor,
specifically the local one. This data is used in line 9 to select
appropriate local elements of the original tile. The memory
for these selected local blocks is allocated in line 10.
The following four lines define and allocate a new tile with
the sizes of a block, using it as a pattern to fill up the local
selection of the matrix of blocks. The function hit_tileFillmakes clones of the second parameter for any data element of
the first tile parameter. The last line activates the possibility of
declaring 1D collective communications in the virtual topology
associated to the layout.
Factorization
Figure 8 shows an excerpt of the matrix factorization code.
Inside the main loop, we show the code for the Task Type
1, which updates (a) the blocks of the L matrix in the same
column as the diagonal block, and (b) the blocks of the trailing
matrix.
Before the task code, we can see the declaration of Hitmap
communication objects used in that task during the given
iteration. Hitmap provides functions to declare HitCom objects
9494
1 HitCom commRowSend, commRowRecv, commColSend, commColRcv, commBlockSend, commBlockRecv;2 for( i = 0; i < numberBlocks; i++ ) {34 // 1. COMMUNICATION OBJECTS DECLARATION5 commBlockSend = ...;6 commBlockRecv = hit_comBroadcastDimSelect( matrixLayout, 0, diagProc[0], & bufferDim0,7 hit_shape( 1, hit_sigIndex( currentBlock[1] ) ),8 HIT_COM_TILECOORDS, HIT_DOUBLE );9 commRowSend = ...;
10 commRowRecv = hit_comBroadcastDimSelect( matrixLayout, 0, diagProc[0], & bufferDim1,11 hit_shape( 1, hit_tileSigDimTail( matrixSelection, 1, currentBlock[1] ) ),12 HIT_COM_TILECOORDS, HIT_DOUBLE );13 commColSend = hit_comBroadcastDimSelect( matrixLayout, 1, hit_topDimRank( topo, 1 ),14 & matrixSelection, hit_shape( 2,15 hit_tileSigDimTail( matrixSelection, 0, currentBlock[0] ) ),16 hit_sigIndex( currentBlock[1] ) ),17 HIT_COM_TILECOORDS, HIT_DOUBLE );18 commColRecv = ...;1920 // 2. COMPUTATION21 // 2.1. TASK TYPE 1: UPDATE DIAGONAL BLOCK, U ROW, L COLUMN, AND TRAILING MATRIX22 if ( hit_topAmIDim( topo, 0, diagProc[0]23 && hit_topAmIDim( topo, 1, diagProc[1] ) {24 ...25 }2627 // 2.2 TASK TYPE 2: UPDATE L COLUMN, AND TRAILING MATRIX28 else if ( hit_topAmIDim( topo, 1, diagProc[1]) ) {29 hit_comDo( & commBlockRecv );30 for( j = currentBlock[0]; j < hit_tileDimCard( matrixSelection, 0 ); j++ ) {31 updateMatrixL( matrixBlockAt( bufferDim1, 1, currentBlock[1] ),32 matrixBlockAt( matrixSelection, j, currentBlock[1] ) );33 }34 hit_comDo( & commRowRecv );35 hit_comDo( & commColSend );36 for( j = currentBlock[0]; j < hit_tileDimCard( matrixSelection, 0 ); j++ ) {37 for( k = currentBlock[1]; k < hit_tileDimCard( matrixSelection, 1 ); k++ ) {38 updateTrailingMatrix( matrixBlockAt( matrixSelection, j, currentBlock[1] ),39 matrixBlockAt( bufferDim1, 1, k ),40 matrixBlockAt( matrixSelection, j, k ) );41 }42 }43 }4445 // 2.3. TASK TYPE 3: UPDATE U ROW, AND TRAILING MATRIX46 else if ( hit_topAmIDim( topo, 0, diagProc[0] ) {47 ...48 }49 // 2.4. TASK TYPE 4: UPDATE ONLY TRAILING MATRIX50 else {51 ...52 }53 hit_comFree( commRowSend ); hit_comFree( commRowRecv ); hit_comFree( commColSend );54 hit_comFree( commColRecv ); hit_comFree( commBlockSend ); hit_comFree( commBlockRecv );55 }
Figure 8: Hitmap factorization code.
containing information about the subtile (set of blocks) to com-
municate, and the sender/receiver processor in virtual topology
coordinates. Communications needed on this algorithm are
always broadcasts of blocks along a given axis of the 2D
virtual topology associated with the layout object. In the task
code, the communications associated to a HitCom object are
invoked, any number of times, with the Hit_comDo primitive.
VI. EXPERIMENTAL RESULTS
In this section we evaluate the approaches described in the
previous section. The evaluation is carried out in terms of
performance and ease of programming.
ScaLAPACK’s LU version is originally written in Fortran,
while Hitmap is a C library. To ensure a fair comparison, we
translate ScalaPACK’s LU version to C, using the MPI library
for communications. Since C language uses row-major storing
instead of Fortran’s column-major storing, we have inverted
loops and indexes for matrix accesses, and also adapted the
solving part of the algorithm. Moreover, the row-swapping
operation is not fairly comparable in terms of performance
with the C versions due to the different storing order. There-
fore, we have eliminated the row-swapping stage from both the
9595
...
...
...
...
...
...double
data
data
HitTile_double
baseExtent=sizeof(HitTile)
baseExtent=sizeof(double)
HitTile_HitTile
Figure 9: Two levels support for block-cyclic layouts in
Hitmap Library.
LU ScaLAPACK routine and our C programs. This translation
process also helped us to estimate the development effort of
developing a hand-made, right-looking, message-passing LU.
The hierarchical nature of a two-level Hitmap tile exploits
memory accesses locality in a natural way. To isolate this
effect, we developed two modified versions of the manual C
code with different optimizations. These changes represent a
good trade-off between development effort and performance.
Both modified versions are described below.
Blocks browsing: ScaLAPACK LU main loop iterates in a
block-by-block basis, but the inner loops browse the remaining
part of the matrix by columns. Thus, several block columns
are updated before proceeding to the next column. More data
locality can be achieved in the sequential parts of the code
browsing matrices by blocks, updating a whole block before
proceeding to the following one.
Blocks storing: ScaLAPACK stores the local parts of the
global matrix contiguously in memory: a n×m matrix is de-
clared as a 2-dimensional n×m array in which all the column
elements are contiguous. Storing each block contiguously and
using a matrix of pointers to locate the blocks improves both
locality and performance.
Experiments design and environment
The codes have been run on a shared-memory machine
called Geopar. Geopar is an Intel S7000FC4URE server,
equipped with four quad-core Intel Xeon MPE7310 processors
at 1.6GHz and 32GB of RAM. Geopar runs OpenSolaris
2008.05, with the Sun Studio 12 compiler suite. We use up
to 15 cores simultaneously, to avoid the interference of the
operative system running on the last core.
We present results with input data matrices of 15 000 ×
Par. Code Lines Total Lines %MPI Ref. Version 151 646 23.37%Hitmap Version 117 586 19.97%
Reduction 22.52% 9.29%
Table I: Comparison of code lines due to parallelism and total
code lines between Hitmap and the MPI reference implemen-
tation.
15 000 elements to achieve good scalability. The optimal block
size is platform-dependent. In Geopar, we have found that the
best block size is 64× 64 elements.
Results
Figure VI (left) shows execution times in seconds for the
different implementations. Fig.VI (right) shows the corre-
sponding speed-ups obtained. The plots show that our direct
translation of the ScaLAPACK code to C presents a similar
performance as the original Fortran code only when executed
in sequential. When executing in parallel, our results show that
the optimization effort applied to the communication stages in
ScaLAPACK routines can not be achieved by a naive transla-
tion to C. However, the data locality optimizations introduced
in blocks-browsing and blocks-storing versions outperform the
original ScaLAPACK routines clearly. The speed-up plots also
show that they escalate properly. The buffering of data needed
during communications also benefits from the block-storing
scheme.
Regarding the Hitmap version, it achieves similar perfor-
mance figures as the carefully-optimized C+MPI implementa-
tions with data locality improvements. The reason is that, as
we stated before, the Hitmap development techniques exploit
locality in a natural way. The flexible and general Hitmap tile
management only introduces a small overhead in the access
to tile elements.
Finally, in Table I we show a comparison of the Hitmap
version with the best optimized C version in terms of lines
of code. The comparison takes into account not only the
total number of lines of code, but also the lines devoted to
parallelism, including data-partition, communication initial-
izations, and communication calls. As the table shows, the
use of Hitmap library leads to a 9.29% reduction on the
total number of code lines. This reduction affects mainly to
lines directly related to parallelism (22.52%). The reason is
that Hitmap greatly simplifies the programmer effort for data
distribution and communication, compared with the equivalent
code needed when using MPI routines directly. Moreover, the
use of Hitmap tile-management primitives eliminates some
more lines in the sequential treatment of matrices and blocks.
VII. CONCLUSIONS
In this paper we present a technique that allows Hitmap, a
library for hierarchical array tiling and mapping, to support
layout compositions at different levels. The technique takes
advantage of a new derived type that allows to represent
heterogeneous tile hierarchies. This technique improves the
9696
100
1000
10000
0 2 4 6 8 10 12 14 16
time
(s)
number of processes
Matrix 15000−by−15000 and block 64−by−64
C naive translationScaLAPACK
HitmapC blocks browsing
C blocks storing
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
spee
d−up
number of processes
Matrix 15000−by−15000 and block 64−by−64
idealC blocks storing
C blocks browsingHitmap
ScaLAPACKC naive translation
Figure 10: Results for a 15 000× 15 000 matrix (left) and 64× 64 blocks (right).
applicability of Hitmap to new problem classes and program-
ming paradigms, such as multilevel adaptative PDE solvers,
or recursive domain partitions; and to other scenarios, such
as heterogeneous platforms. The improved features of Hitmap
allow the programmer to work at a higher abstraction level
than using message-passing programming directly, thus sig-
nificantly reducing coding and debugging effort.
We show how this new type allows to create data-layout
compositions. We use the block-cyclic layout in the well-
known parallel LU decomposition as a study case. We compare
the reference implementation of LU in ScaLAPACK with
different, carefully-optimized MPI versions, as well as a
Hitmap implementation. Our experimental results show that
the highly-efficient techniques used in Hitmap for tile man-
agement, memory access, data-locality, buffering, marshalling
and communication allows to achieve better performance than
the ScaLAPACK implementation. The performance obtained is
almost the same as developing a manual, carefully-optimized
MPI version with a fraction of its associated development cost.
ACKNOWLEDGEMENTS
This research is partly supported by the Ministerio de
Educación y Ciencia, Spain (TIN2007-62302), Ministerio
de Industria, Spain (FIT-350101-2007-27, FIT-350101-2006-
46, TSI-020302-2008-89, CENIT MARTA, CENIT OASIS,
CENIT OCEAN LEADER), Junta de Castilla y León,
Spain (VA094A08), and also by the Dutch government
STW/PROGRESS project DES.6397. Part of this work was
carried out under the HPC-EUROPA project (RII3-CT-2003-
506079), with the support of the European Community -
Research Infrastructure Action under the FP6 “Structuring the
European Research Area” Programme. The authors wish to
thank the members of the Trasgo Group for their support.
REFERENCES
[1] S. Hiranandani, K. Kennedy, and C. Tseng, “Compiling Fortran D forMIMD distributed-memory machines,” Commun. ACM, vol. 35, no. 8,pp. 66–80, 1992.
[2] R. Chandra, L. Dagum, D. Kohr, D. Maydan, J. McDonald, andR. Menon, Parallel programming in OpenMP. Morgan KaufmannPublishers, 1 ed., 2001. ISBN 1-55860-671-8.
[3] W. Carlson, J. Draper, D. Culler, K. Yelick, E. Brooks, and K. Warren,“Introduction to UPC and language specification,” Tech. Rep. CCS-TR-99-157, IDA Center for Computing Sciences, 1999.
[4] W. Gropp, E. Lusk, and A. Skjellum, Using MPI - 2nd Edition: PortableParallel Programming with the Message Passing Interface. The MITPress, 2 ed., Nov. 1999.
[5] S. Gorlatch, “Send-Recv considered harmful? Myths and truths aboutparallel programming,” in PaCT’2001 (V. Malyshkin, ed.), vol. 2127 ofLNCS, pp. 243–257, Springer-Verlag, 2001.
[6] K. Gatlin, “Trials and tribulations of debugging concurrency,” ACMQueue, vol. 2, pp. 67–73, Oct 2004.
[7] A. Gonzalez-Escribano and D. Llanos, “Trasgo: A nestedparallel programming system,” Journal of Supercomputing,vol. doi:10.1007/s11227-009-0367-5, 2009.
[8] G. Bikshandi, J. Guo, D. Hoeflinger, G. Almasi, B. Fraguela,M. Garzarán, D. Padua, and C. von Praun, “Programming for parallelismand locality with hierarchical tiled arrays,” in PPoPP’06, pp. 48–57,ACM Press, March 2006.
[9] B. Chamberlain, D. Callahan, and H. Zima, “Parallel programmabilityand the Chapel language,” International Journal of High PerformanceComputing Applications, vol. 21, pp. 291–312, Aug 2007.
[10] G. S. Jr., “Parallel programming and code selection in Fortress,” inPPoPP’06, pp. 1–1, ACM Press, March 2006.
[11] G. H. Golub and C. F. V. Loan, Matrix computations. JHU Press, Oct.1996.
[12] M. T. H. Heath, Scientific Computing. McGraw-Hill Sci-ence/Engineering/Math, 2 ed., 2001.
[13] J. Choi and J. J. Dongarra, “Scalable linear algebra software librariesfor distributed memory concurrent computers,” in Proceedings of the 5thIEEE Workshop on Future Trends of Distributed Computing Systems,p. 170, IEEE Computer Society, 1995.
[14] J. Choi, J. J. Dongarra, L. S. Ostrouchov, A. P. Petitet, D. W. Walker,and R. C. Whaley, “Design and implementation of the ScaLAPACK LU,QR, and cholesky factorization routines,” Sci. Program., vol. 5, no. 3,pp. 173–184, 1996.
[15] R. Reddy, A. Lastovetsky, and P. Alonso, “Parallel solvers for denselinear systems for heterogeneous computational clusters,” in Proceedingsof the 2009 IEEE International Symposium on Parallel & DistributedProcessing, pp. 1–8, IEEE Computer Society, 2009.
[16] J. Dongarra and A. Petitet, “ScaLAPACK tutorial,” in Proceedings ofthe Second International Workshop on Applied Parallel Computing,Computations in Physics, Chemistry and Engineering Science, pp. 166–176, Springer-Verlag, 1996.
[17] L. S. Blackford, A. Cleary, J. Choi, E. D’Azevedo, J. Demmel,I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley,D. Walker, and R. C. Whaley, ScaLAPACK Users’ Guide. SIAM, 1997.
9797