[ieee 2011 international conference on parallel processing (icpp) - taipei, taiwan...
TRANSCRIPT
![Page 1: [IEEE 2011 International Conference on Parallel Processing (ICPP) - Taipei, Taiwan (2011.09.13-2011.09.16)] 2011 International Conference on Parallel Processing - Optimizing SpMV for](https://reader036.vdocuments.us/reader036/viewer/2022080423/5750a6861a28abcf0cba40aa/html5/thumbnails/1.jpg)
Optimizing SpMV for Diagonal Sparse Matrices onGPU
Xiangzheng Sun∗†‡, Yunquan Zhang∗†, Ting Wang∗, Xianyi Zhang∗, Liang Yuan∗†‡ and Li Rao∗†‡∗ Lab. of Parallel Software and Computational Science,
Institute of Software, Chinese Academy of Sciences, Beijing, China† State Key Lab. of Computer Science,
Institute of Software, Chinese Academy of Sciences, Beijing, China‡Graduate University of Chinese Academy of Sciences, Beijing, China
Email: [email protected]
Abstract—Sparse Matrix-Vector multiplication (SpMV) is animportant computational kernel in scientific applications. Itsperformance highly depends on the nonzero distribution of sparsematrices. In this paper, we propose a new storage format fordiagonal sparse matrices, defined as Compressed Row Segmentwith Diagonal-pattern (CRSD). In CRSD, we design diagonalpatterns to represent the diagonal distribution. As the GraphicsProcessing Units (GPUs) have tremendous computation pow-er and OpenCL makes them more suitable for the scientificcomputing, we implement the SpMV for CRSD format on theGPUs using OpenCL. Since the OpenCL kernels are complied atruntime, we design the code generator to produce the codelets forall diagonal patterns after storing matrices into CRSD format.Specifically, the generated codelets already contain the indexinformation of nonzeros, which reduces the memory pressureduring the SpMV operation. Furthermore, the code generatoralso utilizes property of memory architecture and thread scheduleon the GPUs to improve the performance.
In the evaluation, we select four storage formats from priorstate-of-the-art implementations (Bell and Garland, 2009) onGPU. Experimental results demonstrate that the speedups reachup to 1.52 and 1.94 in comparison with the optimal implemen-tation of the four formats for the double and single precisionrespectively. We also evaluate on a two-socket quad-core IntelXeon system. The speedups reach up to 11.93 and 12.79 incomparison with CSR format under 8 threads for the doubleand single precision respectively.
I. INTRODUCTION
The Sparse Matrix-Vector multiplication (SpMV) is one
of the most important computational kernels in sparse linear
algebra. The performance highly depends on nozero distribu-
tion, which determines the memory access pattern and varies
significantly among different applications. Different sparse
matrix storage formats are proposed, such as Compressed
Sparse Row (CSR), Block CSR (BCSR), Diagonal (DIA),
ELLPACK/ITPACK (ELL) and coordinate (COO) [1][2]. For
example, nonzeros primarily distributes as dense blocks in
matrices produced by Finite Element Method (FEM). Then
the BCSR format is applied to store the dense block as a unit
[3]. It is difficult to develop a universal optimal solution for
all kinds of nonzero distributions.
In this paper, we study the optimization for diagonal sparse
matrices, in which the nonzeros mainly distribute along di-
agonals. Diagonal sparse matrices are universal. As far as
we know, Finite Difference Method (FDM), Finite Volume
Method(FVM) and FEM are three main methods for the
numerical solution of Partial Differential Equations (PDEs)
nowadays. Once the FDM or FVM is used, the coefficient
matrix of discrete PDEs is usually the diagonal sparse matrix.
The DIA format [1] is designed to store the diagonal sparse
matrix. All nonzeros on the same diagonal share the same
index. However, a large number of zeros should be filled to
maintain the diagonal structure, when there are many scatter
points or the diagonals are broken by long zero sections.
We define the long zero section as idle section. This may
reduce the performance, since the filled zeros consume extra
computation and memory resources.
To address this problem, we propose a novel storage format
CRSD. In order to represent the diagonal distribution, we
design the diagonal pattern, which divides diagonals into dif-
ferent groups. Then we can efficiently deal with idle sections
by modifying the diagonal patterns. Furthermore, the matrix
is split into row segments. In each row segment, nonzeros on
the diagonals of the same group are viewed as the unit of
storage and operation. We store those nonzeros contiguously
and organize the operation on them together. Simultaneously,
the scatter points are also detected in each row segment.
As the Graphics Processing Units (GPUs) have tremen-
dous computation power[1][4], they are widely used in High
Performance Computing(HPC). Furthermore, a new standard
OpenCL[5] make it easier to program on GPUs, which exploits
their computation resources for scientific computing. In this
paper, we implement the SpMV for CRSD format on the GPUs
using OpenCL.
After storing one matrix into CRSD format, we generate
the SpMV kernel according to diagonal patterns. Due to
the fact that the OpenCL kernels are complied at runtime,
we design the code generator to produce the codelet for
the correspond diagonal pattern. Specifically, the generated
codelets already contain the index information of nonzeros.
This reduces the memory pressure during the SpMV operation.
Moreover, the code generator also utilizes property of memory
architecture and thread schedule on the GPUs to improve the
performance. In SpMV implementation based on CRSD on
GPUs, the elements of sources vector, which are accessed
2011 International Conference on Parallel Processing
0190-3918/11 $26.00 © 2011 IEEE
DOI 10.1109/ICPP.2011.53
492
![Page 2: [IEEE 2011 International Conference on Parallel Processing (ICPP) - Taipei, Taiwan (2011.09.13-2011.09.16)] 2011 International Conference on Parallel Processing - Optimizing SpMV for](https://reader036.vdocuments.us/reader036/viewer/2022080423/5750a6861a28abcf0cba40aa/html5/thumbnails/2.jpg)
by all work-items in one work-group, are located into local
memory(correspond to shared memory in CUDA). Meanwhile,
all work-items in one work-group take the same execution path
to avoid thread divergence[6].
We evaluate the performance improvements of CRSD on
23 matrices in a two-socket quad-core Intel Xeon X5550
systems, with Tesla C2050. We select four storage formats
(DIA, ELL, CSR and HYB) from the prior state-of-the-art
implementations(Bell and Garland, 2009) [1] on GPU for
comparison. Experimental results demonstrate that the storage
format, which leads to the optimal performance, varies among
different matrices. In comparison with the optimal implemen-
tation, the speedups of CRSD reach up to 1.52 and 1.94 for
the double and single precision respectively. We also evaluate
the performance of Intel MKL, with version 10.2.6.038, on
CPU for comparison. The speedups of CRSD reach up to
11.93 (6.63 on average) and 12.79 (7.18 on average) in
comparison with CSR format under 8 threads for double and
single precisions respectively.
The rest of this paper is organized as follows: section II
describes the diagonal pattern and CRSD storage format;
section III presents how to produce the SpMV kernel for
CRSD; in section IV, the experiment results are provided and
analyzed; the related works are given in section V. At last,
conclusion is summarized in section VI.
II. CRSD STORAGE FORMAT
A. Motivation
In diagonal sparse matrix, the nonzeros mainly distribute on
a small number of diagonals [1][2]. The offset of each diagonal
from the main diagonal is used to identify the diagonals. The
offsets of diagonals above the main diagonal are positive,
while that of those below the main diagonal are negative. In
DIA storage format, the offsets are used as the indices. And
all nonzeros on the same diagonal share the same index. This
reduces the size of indices.
However, we find there are many scatter points or the
diagonals are broken by long zero sections in many matrices. If
we use DIA format to store those matrices, a large number of
zeros should be filled to maintain the diagonal structure. This
will consume computation resource and memory bandwidth
for processing the filled zeros during the SpMV operation.
We presents a real diagonal sparse matrix from our research
in Fig. 1. In the diagonals with offset ±200, long zero sections
are marked by red dotted lines. Besides, some scatter points
are circled in the detailed picture.
It is unwise to store the matrix in DIA format, for a large
number of zeros should be filled. However, a large proportion
of nonzeros distribute on the diagonals. And a small number
of nonzeros can be filled for the idle sections, marked by red
rectangular.
We design the diagonal pattern to exploit the partial diagonal
structure. A new storage format CRSD is proposed based on
diagonal pattern. In CRSD format, the scatter points can be
detected. Simultaneously, the number of filled zeros for idle
Fig. 1. A real diagonal sparse matrix in our research
Fig. 2. An example of diagonal sparse matrix
section can be controlled according to property of the input
matrix.
B. Diagonal Pattern
For any two diagonals in the matrix, if the absolute value of
difference of their offset is 1, they are adjacent. We can group a
sequence of diagonals by the following steps: if two diagonals
are adjacent, put them into an adjacent (AD) group; after
removing the diagonals within the adjacent groups, the original
diagonal sequence is broken up into pieces. We assign the
diagonals of each piece into a nonadjacent (NAD) group. The
diagonal pattern is defined as the way that the AD group(s)
and the NAD group(s) are organized.
When the group is represented by group type (AD or
NAD) and the number of diagonals in it, then
group = (group type, the number of diagonals) .
According to the definition, the diagonal pattern is represented
as follows:
diagonal-pattern = {group1, group2, . . . groupm}If the whole matrix contains several diagonal patterns, then
matrix = {dia-pattern1, dia-pattern2, . . . dia-patternn}.
For example, there are two diagonal patterns in the matrix
shown in Fig. 2 except nonzero v55. The matrix is represented
as follows:
matrix = {{(NAD,1),(AD,2),(NAD,2)},{(AD,2), (NAD,1)}}.
493
![Page 3: [IEEE 2011 International Conference on Parallel Processing (ICPP) - Taipei, Taiwan (2011.09.13-2011.09.16)] 2011 International Conference on Parallel Processing - Optimizing SpMV for](https://reader036.vdocuments.us/reader036/viewer/2022080423/5750a6861a28abcf0cba40aa/html5/thumbnails/3.jpg)
Fig. 3. Idle section process using diagonal pattern
There are two diagonal patterns in the matrix. The first
diagonal pattern covers the beginning two rows and the second
one covers the remaining four rows. In the first diagonal
pattern, we divide the diagonals into three groups. Diagonal
with offset 2 and the one with offset 3 are adjacent, so we put
them in an adjacent group (AD, 2). After processing the two
diagonals, any two of remaining diagonals are not adjacent.
Then the diagonal sequence is broken up into two pieces. The
main diagonal in the first piece is assigned to a nonadjacent
group (NAD, 1), while diagonals with offset 5 and 7 are
assigned to another nonadjacent group (NAD, 2). Finally, we
get the first diagonal pattern {(NAD,1),(AD,2),(NAD,2)}.C. Scatter Point Detection and Idle Section Process
We have grouped the diagonals using diagonal pattern.
Furthermore, the matrix is split into row segments. The number
of rows in each row segment is defined as row segment sizeand represented by the token mrows. In this way, the whole
matrix is split in two dimensions, as the dotted lines show in
Fig. 2. In each row segment, nonzeros on the diagonals of the
same group are the storage unit of CRSD.Additionally, if only one nonzero is on a diagonal within
one row segment, the nonzero is viewed as scatter point, such
as v55 in Fig. 2. If the nonzero is identified as scatter point, it
is unnecessary to fill zero into the diagonal where the scatter
point locates.With diagonal pattern, we can process idle section: if there
are few zeros in the idle section, we can fill the zeros
to maintain the diagonal structure and the diagonal pattern
remains unchanged; otherwise, if a large number of zeros
are needed, we believe that the diagonal is broken. Then the
diagonal pattern should be modified. It all depends on the
property of matrices in one application. For instance, a zero is
filled at v43 position, while the main diagonal is broken. When
the matrix given in Fig. 1 is described in diagonal pattern,
the result is shown in Fig. 3. Diagonals with offset ±200 are
broken instead of filling large number of zeros.
D. The Storage FormatIn CRSD storage format, the scatter points and the nonzeros
in diagonal are stored separately.
In order not to change the order of floating point operations,
the whole row where the scatter point locates is stored together.
When we abstract those rows, they form a sub-matrix. This
sub-matrix is stored in ELL [7] format(detailed in section IV).
The only difference is that the row number of each row in the
original matrix should be stored in array scatter rowno. For
the ELL format, array scatter colval and scatter val record
column value and value of each nonzero. The number of
nonzero in each row is num scatter width.
Except scatter points, the whole matrix is represented by
diagonal patterns. All nonzeros in the same diagonal pattern
share the same index: the diagonal pattern, the start row
number of the diagonal pattern, the number of row segments,
and the column indices of diagonals. The column index of
each diagonal is needed for nonadjacent group, while only the
column index of first diagonal in adjacent group needs to be
recorded. The diagonal pattern is stored in array matrix and the
remaining of index values are stored in array crsd dia index.
The nonzero values in each storage unit are stored contigu-
ously in array crsd dia val. Additionally, the nonzeros locate
on one diagonal are stored contiguously. For instance, since
v20, v31, v21 and v32 locate in one storage unit, they should
be stored contiguously. Meanwhile, v20 and v31 locate on one
diagonal and should be stored together.
The number of diagonal patterns and rows that contain
the scatter points are assigned to num dia patterns and
num scatter rows respectively.
An example of CRSD storage format is shown in Fig. 4 for
the matrix in Fig. 2, when row segment size is 2. The diagonal
patterns, discussed in section II-B, are stored in array matrix.
For the first diagonal pattern, it starts from R0 and covers 1
row segment. For the reason that diagonals with offset 2 and 3
are assigned into one adjacent group, only the column value C2
is needed. When we store the scatter points, only the last row
contains the scatter point. There are 4 (num scatter width)
nonzeros in this row.
III. SPMV IMPLEMENTATION FOR CRSD ON GPUS
GPUs [1][4] are not limited to solve graphics related is-
sues. With a large number of scalar processors, GPUs have
tremendous computation power and become very suitable for
massive data parallel process. OpenCL [5] is an open royalty-
free standard for general purpose parallel programming on
heterogeneous platforms. Using OpenCL, it becomes more
easier to exploit GPUs’ computation resources for scientific
computing. In this section, we implement the SpMV for CRSD
on GPUs using OpenCL.
A. GPUs and OpenCL Overview
In OpenCL platform model, an OpenCL device is most
easily defined as a collection of Compute Unit (CU). Each
CU can be still divided into one or more Processing El-ements (PE). The PEs execute the computation commands
submitted from an OpenCL application. All PEs within a CU
execute a single stream of instructions as SIMD or as SPMD.
494
![Page 4: [IEEE 2011 International Conference on Parallel Processing (ICPP) - Taipei, Taiwan (2011.09.13-2011.09.16)] 2011 International Conference on Parallel Processing - Optimizing SpMV for](https://reader036.vdocuments.us/reader036/viewer/2022080423/5750a6861a28abcf0cba40aa/html5/thumbnails/4.jpg)
num scatter rows = 1;num dia patterns = 2;num scatter width = 4;
matrix = {{(NAD,1),(AD,2),(NAD,2)}, {(AD,2), (NAD,1)}}crsd dia index = {R0, 1, C0, C2, C5, C7, | R2, 2, C0, C4}
crsd dia val = {{(v00,v11),(v02,v13,v03,v14),(v05,v16,v07,v18)}, {(v20,v31,v21,v32),(v23,v24)},{(v42,v53,0,v54),(v45,v56)} }
scatter rowno = {R5}scatter index = {C3, C4, C5, C6}scatter val = {v53, v54, v55, v56}
Fig. 4. The CRSD storage format for matrix shown in Fig. 2 when mrows=2
The programs executing on OpenCL devices are defined as
kernels. An index space is defined after submitting a kernel to
OpenCL device. An instance of the kernel executes for each
point in the index space. This instance is defined as work-item. Work-items are further grouped as work-groups. Each
work-group is assigned to a CU. Furthermore, work-items
within a work-group can be synchronized using barriers or
memory fences. The block of work-items that are executed
together is defined as a wavefront. And the number of work-
items in one wavefront is called wavefront size. If work-items
within a wavefront diverge, such as branching, all execution
paths are executed serially. This phenomenon is called thread
divergence[6]. Obviously, it will reduce the performance.
In OpenCL memory model, all Work-items accessible mem-
ory are generalized into four distant memory types: global
memory, constant memory, local memory and private memory.
All work-items can read/write the global memory. However,
accessing the global memory suffers a long latency. The local
memory is attached to each CU and is only shared by the work-
items within one work-group. Each PE have its own private
memory and the private memory can only be accessed by a
single work-item.
For the GPUs, both local memory and private memory have
low latency and higher bandwidth.
In CUDA, there are the correspond terminologies. We list
the related terminologies on Table I.
TABLE ITHE TERMINOLOGIES IN OPENCL AND CUDA
OpenCL CUDACU Streaming Multiprocessor (SM)PE Streaming Processor (SP)work-group thread blockwork-item threadwavefront warp
B. SpMV Implementation on GPUs
Parallelizing SpMV for the CRSD format on GPUs is
straightforward. One work-item processes one row of the
matrix. Meanwhile, the matrix is split into row segments in
the CRSD. The work-items processing the rows within one
row segment are assigned to the one work-group. It is wise
that mrows is a multiple of the wavefront size.
Once a matrix has been stored in CRSD storage format, we
can infer the following information (shown in Table II) for the
pth diagonal pattern from array matrix and crsd dia index.
TABLE IITHE INFORMATION INFERRED FROM THE pth DIAGONAL PATTERN
Description TokenThe number of row segments NRSp
The number of nonzeros in one row segment NNzRSp
The start row number SRp
The number of diagonals NDiaspThe column index of dth diagonal Colvp,d
According to those information, we can generate the SpMV
kernels. Firstly, we get the unique work group ID group idas well as the local work-item ID local id within work group
region from the OpenCL API. Then we can identify which
diagonal pattern the work group group id processes. Because
one work group deals with one row segment, work group
group id process the pth diagonal pattern only if satisfying
the following condition:∑p−1i=0 NRSi ≤ group id <
∑pi=0 NRSi.
Next, we generate the SpMV operations for the pth diagonal
pattern. When the work item local id processes the nonzero
on dth diagonal, the location of nonzero is∑p−1
i=0 (NRSi ×NNzRSi) + (group id −∑p−1
i=0 NRSi) × NNzRSi + d ×mrows+local id. The correspond element of source vector xis Colvp,d+(group id−∑p−1
i=0 NRSi)×mrows+ local id.
After accumulating the product of NDiasp nonzeros, the
result should be stored at SRp+(group id−∑p−1i=0 NRSi)×
mrows+ local id at destination vector y. We should note that∑p−1i=0 Expression = 0 when p = 0.
As a result, the generated SpMV codelet already contains
the index information. Thus it is unnecessary to access matrixand crsd dia index during the SpMV operation. Only the
nonzero values should be transferred to kernel.
When we deal with adjacent groups, the elements of source
vector x can be reused. For this reason, we can load the
elements into local memory. This will reduce number of the
global memory access. Since the access to the global memory
suffers a longer latency than access to local memory, the
performance will improve significantly when the number of
nonzeros in adjacent groups occupy a large proportion. The
size of the local memory is determined by the maximum
number of diagonals among all the adjacent groups, which
can also be obtained when the matrix is stored in CRSD. This
is detailed in Fig. 5. For the adjacent group, the nonzero Vy
495
![Page 5: [IEEE 2011 International Conference on Parallel Processing (ICPP) - Taipei, Taiwan (2011.09.13-2011.09.16)] 2011 International Conference on Parallel Processing - Optimizing SpMV for](https://reader036.vdocuments.us/reader036/viewer/2022080423/5750a6861a28abcf0cba40aa/html5/thumbnails/5.jpg)
and Vw operate on the same element, with index Ct+1, of
vector x. During the SpMV operation, we load those reused
elements into the local memory.
On the other side, for the reason that nonzeros of one storage
unit are stored contiguous, the simultaneously executed work-
items access contiguous part of the array crsd dia val. The
distance of the two successive accessed nonzeros for one work-
item is mrows, as well as the mrows is a multiple of the
wavefront size. Therefor, the accesses to global memory are
coalescing, which is most efficient situation. In Fig. 5, work-
item 0 and 1 first access the contiguous Vx and Vw, while
work-item 0 access Vy with distance of mrows next time.
Fig. 6 shows an example of the SpMV implementation
for the matrix in Fig. 2. The information inferred from
CRSD are shown in Table III. All the SpMV operations for
entire diagonal pattern are organized together in one case
of the switch, which utilizes the loop unrolling optimization
technique [20]. Because one work-group processes one row
segment, all the work-items take the same execution path. This
avoids the thread divergence.
We also generate the SpMV kernel for ELL format. The
generated kernel also applies the loop unrolling optimization
technique [20], owing to that we already obtain number of
nonzeros per row(num scatter width). The final kernel for
the whole matrix is composed of two parts — one processes
nonzeros in diagonal patterns, the other one processes the
scatter points. For the reason that the rows , which contain
the scatter points, may also belong to diagonal patterns, the
kernel processing diagonal patterns is executed at first.
TABLE IIITHE INFORMATION INFERRED FROM CRSD FOR MATRIX IN FIG. 2
Token p = 0 p = 1NRSp 1 2
NNzRSp 10 6SRp 0 2
NDiasp 5 3
IV. EVALUATION
In this section, we present the performance improvement of
CRSD. The platform information is provided in Table IV. We
select 23 matrices(given in TableV). Matrices with number
from 1 to 17 mainly come from structural problem, 2D/3D
problem and quantum chemistry problem [8] . The last six
matrices are from an astrophysics application [9]. For those
matrices, a significant proportion of nonzeros distribute on
diagonals.
We compare the performance of CRSD on GPU with prior
state-of-the-art implementations (Bell and Garland, 2009) [1]
for both double and single precisions on GPU. In the compar-
ison, we select DIA, CSR and ELL formats. All the kernels
are implemented in CUDA programming. When they evaluate
DIA format, the diagonal sparse matrices represent common
stencil operations on regular 1-, 2- and 3- dimensional grids.
The nonzeros distribute on several diagonals, for example on
27 diagonals. In our evaluation, matrix kim1 and kim2 has
similar nonzero distribution — nonzeros mainly distribute on
25 diagonals.
ELL format is more general than DIA format. When then
maximum number of the nonzeros per row is K, ELL stores
K elements for each row. All rows are padded with zeros to
length K. As a result, ELL format stores a dense matrix with
a additional array for column index for each nonzero.
Obviously ELL’s efficiency degrades, if the number of
nonzeros in each row varies. Then we can select a threshold
K ′. If the number of nonzeros exceeds K ′, only K ′ nonzeros
are stored in ELL format. And the remaining nonzeros are
stored separately with row and column number for each
nonzero. This format is the HYB format.
Besides we also choose CSR and DIA formats implemented
of Intel MKL on CPU, with version 10.2.6.038, to be com-
pared with. For Implementation based on DIA is not paralleled
in Intel MKL, we only choose parallel implementation of CSR
formats for comparison.
TABLE IVTHE PLATFORM INFORMATION
CPU Intel Xeon X5550, 2.67GHzMEM 8GB
Sockets 2Compiler GCC 4.4.3
GPU Tesla C2050Number of CUDA cores 448
Frequency of CUDA Cores 1.15GHzTotal Device Memory 3GB
TABLE VTHE MATRICES INFORMATION
Matrix # matrix name Dimensions nonzeros1 crystk03 24696×24696 8879372 crystk02 13965×13965 4912743 s3dkt3m2 90449×90449 19219554 s3dkq4m2 90449×90449 24556705 ecology1 1000000×1000000 29980006 ecology2 999999×999999 29979957 wang3 26064×26064 1771688 wang4 26068×26068 1771969 kim1 38415×38415 933195
10 kim2 456976×456976 1133002011 af 1 k101 503625×503625 902715012 af 2 k101 503625×503625 902715013 af 3 k101 503625×503625 902715014 Lin 256000×256000 101120015 nemeth21 9506×9506 59162616 nemeth22 9506×9506 68416917 nemeth23 9506×9506 75815818 s80 80 50 320000×320000 253280019 s100 100 62 620000×620000 491760020 s110 110 68 822800×822800 653114021 us80 80 50 320000×320000 253280022 us100 100 62 620000×620000 491760023 us110 110 68 822800×822800 6531140
A. Performance Comparison on GPU
The performance comparison result with double and single
precisions are shown in Fig. 7 and Fig. 8.
496
![Page 6: [IEEE 2011 International Conference on Parallel Processing (ICPP) - Taipei, Taiwan (2011.09.13-2011.09.16)] 2011 International Conference on Parallel Processing - Optimizing SpMV for](https://reader036.vdocuments.us/reader036/viewer/2022080423/5750a6861a28abcf0cba40aa/html5/thumbnails/6.jpg)
Fig. 5. Local memory usage for processing adjacent groups
Fig. 6. The SpMV implementation fragment for matrix in Fig. 2 when mrows=2
For the double precision, the speedup of CRSD format
reaches up to 11.13 and 9.42 for s3dkt3m2 and s3dkq4m2
respectively in comparison with DIA format. The main reason
is that nonzeros distribute on 655 diagonals, but the number of
nonzeros per row is only 41 for s3dkt3m2. Large number of
zeros should be filled to maintain the diagonal structure, which
497
![Page 7: [IEEE 2011 International Conference on Parallel Processing (ICPP) - Taipei, Taiwan (2011.09.13-2011.09.16)] 2011 International Conference on Parallel Processing - Optimizing SpMV for](https://reader036.vdocuments.us/reader036/viewer/2022080423/5750a6861a28abcf0cba40aa/html5/thumbnails/7.jpg)
consumes lot of computation resource. Conversely, CRSD
format describes the two matrices using 24 diagonal patterns
to reduce the number of filled zeros. Specially, the size of DIA
format exceeds the device memory for af 1 k101, af 2 k101
and af 3 k101, as a result that there are no performance results
for the three matrices.
ELL format is suitable for the discussed five matrices. For
s3dkt3m2 and s3dkq4m2, the speedup of ELL reaches up to
10.13 and 7.98 in comparison with DIA format. However,
the speedup of CRSD format is 1.18 and 1.10 respectively
compared with ELL format. While the speedup is only 1.025
for af 1 k101, af 2 k101 and af 3 k101.
As we discussed before, CRSD can load the reused elements
in vector x to local memory. If the proportion of adjacent
groups is relative small, the performance will be reduced by
the extra barrier operation for the local memory. This situation
occurs in matrices wang3 and wang4. The CRSD performs
very poor. The ELL format outperforms the CRSD format by
a factor of 1.22 and 1.23 respectively for wang3 and wang4.
While the DIA format still performs very poor, like s3dkt3m2
and s3dkq4m2.
In summary, the maximum speedup of CRSD format reach-
es up to 11.13 in comparison with DIA and 1.52 in comparison
with ELL format for double precision. The average speedup of
CRSD is 2.05 and 1.24 for DIA and ELL format respectively.
For the all matrices, CRSD outperforms CSR by a factor of
9.01 for maximum and 4.57 on average. The performance
speedups of CRSD format for the double precision are shown
in Fig. 9.
For the single precision, the performance comparisons are
similar with that for double precision. Due to the reason that
it requires less room to store the single precision than the
double precision, the DIA format for af 1 k101, af 2 k101
and af 3 k101 even works on GPU. For the three matrices,
the speedups of CRSD compared to DIA format is 1.31.
The performance speedups of CRSD format for single
precision are given in Fig. 10. We can conclude that the
maximum speedup can reach 11.24 and 1.94 respectively in
comparison with DIA and ELL formats. And the average
speedup is 1.92 and 1.50 respectively. For the all matrices,
CRSD outperforms CSR by a factor of 9.14 for maximum
and 4.59 on average.
For the HYB format, we use the default method to determine
ELL/COO split-ratio. The matrices from 1 to 14 choose
the entire ELL format. For the remaining matrices, small
percentage (about 0.2% ˜ 2.1%) of nonzeros are stored as COO
format. For those matrices, the speedups of CRSD reach 2.67
( 2.12 on average) and 3.68 ( 2.87 on average) for double and
single precision respectively.
B. Performance Comparison on CPU
The performance comparisons for double and single preci-
sion on CPU are shown in Fig. 11 and Fig. 12 respectively.
As we discussed before, DIA format is not suitable for
s3dkt3m2, s3dkq4m2, af 1 k101, af 2 k101 and af 3 k101.
Therefor, the performances based on DIA format are very
poor. In comparison with DIA format, the speedups of CRSD
reach up to 199.63 and 202.23 for double and single precision
respectively for the five matrices.
Except the five matrices, the speedups of CRSD reach up
to 15.27(12.34 on average) and 13.25(9.87 on average) for
double and single precision respectively.
In the comparison with CSR format with different threads,
the maximum and average speedups of CRSD are shown in
Table VI.
TABLE VIPERFORMANCE SPEEDUP OF CRSD COMPARED WITH CSR
precision serial parallel, thr=8
doublemaximum 25.06 11.93
average 14.76 6.63
singlemaximum 39.81 12.79
average 24.25 7.18
V. RELATED WORK
Sparse matrix vector multiplication is an important com-
putational kernel. Large number of efforts and works are
involved in its optimization. Im and Yelick et al. propose
register blocking, cache blocking and reordering techniques.
Register blocking[3][10][11] is based on BCSR format. BCSR
is suitable for matrices, in which nonzeros primarily distribute
in dense blocks. This property is universal for the matrices
produced by FEM. BCSR format is applied to store the dense
block as a unit. Vuduc et al. estimate the performance bounds
for the register blocking and propose a new approach to choose
the register block size [12]. However, excessive zeros have to
be filled to maintain the block format in BCSR, which wastes
the computation and memory resources. Due to the fact that
the number of the filled zero depends on the block-size and
different block-sizes lead to different processing performance,
OSKI [10] analyzes the input matrix to select the proper block-
size at runtime.
To reduce the number of filled zeros, Vuduc et al. in [13][14]
exploit variable block structure rather than identical block
size. They decompose the origin matrix into a proper sum of
sub-matrices, each of which uses identical block size and are
stored in UBCSR format; Belgin et al. explore the distribution
pattern of nonzeros in dense block and propose PBR to
store matrices without zero filling[15]. Cache blocking[3]
is used to increase the temporal locality by reordering the
memory access. Nishtala et al. present a new performance
models, which takes TLB misses into account, and a criteria
to determine when to apply the cache blocking [16]. Samuel
Williams [17] sums up all those optimization methods on the
emerging multi-core platforms.
To mitigate the memory bandwidth pressure, Willcock and
Lunsdaine propose DCSR and RPCSR to reduce the size of in-
dex structure [18], both of which utilized the data compression
by storing the delta value of column value of two contiguous
nonzero. In RPCSR, the SpMV implementation is produced
for the compressed result by code generator dynamically.
Kourtis et al. [19] categorize the storage size for the delta
498
![Page 8: [IEEE 2011 International Conference on Parallel Processing (ICPP) - Taipei, Taiwan (2011.09.13-2011.09.16)] 2011 International Conference on Parallel Processing - Optimizing SpMV for](https://reader036.vdocuments.us/reader036/viewer/2022080423/5750a6861a28abcf0cba40aa/html5/thumbnails/8.jpg)
0
5
10
15
20
25
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Performance (GFLOPS)
Matrix #
DIA
ELL
CSR
HYB
CRSD
Fig. 7. Performance Comparison for Double Precision on GPU
0
5
10
15
20
25
30
35
40
45
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Performance (GFLOPS)
Matrix #
DIA
ELL
CSR
HYB
CRSD
Fig. 8. Performance Comparison for Single Precision on GPU
0
2
4
6
8
10
12
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Speedup
Matrix #
CRSD/DIA
CRSD/ELL
CRSD/CSR
CRSD/HYB
Fig. 9. Performance Speedup for Double Precision on GPU
value into 3 kinds (1byte, 2byte and 4byte), and all contiguous
nonzero with delta storage size in the same category are stored
together to reduce the index data. Furthermore, Kourtis also
introduce CSR-VI to compress the nonzero value when most
of nonzero values are identical.
Bell and Garland [1] expose substantial fine-grained par-
allelism for different storage formats on throughput-oriented
processors using CUDA programming. Eddy Zhang [6] impos-
es thread-data remapping to eliminate the thread divergences
in GPU computing at runtime. Murthy [20] identifies optimal
unroll factors in GPGPU programs and reduces the number of
unroll factors according to the characteristics of the program.
499
![Page 9: [IEEE 2011 International Conference on Parallel Processing (ICPP) - Taipei, Taiwan (2011.09.13-2011.09.16)] 2011 International Conference on Parallel Processing - Optimizing SpMV for](https://reader036.vdocuments.us/reader036/viewer/2022080423/5750a6861a28abcf0cba40aa/html5/thumbnails/9.jpg)
0
2
4
6
8
10
12
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Speedup
Matrix #
CRSD/DIA
CRSD/ELL
CRSD/CSR
CRSD/HYB
Fig. 10. Performance Speedup for Single Precision on GPU
0
20
40
60
80
100
120
140
160
180
200
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Speedup
Matrix #
CRSD/CSR:CPU, 1thr
CRSD/CSR:CPU,8thr
CRSD/DIA:CPU, 1thr
Fig. 11. Performance Speedup for Double Precision on CPU
0
50
100
150
200
250
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Speedup
Matrix #
CRSD/CSR:CPU, 1thr
CRSD/CSR:CPU,8thr
CRSD/DIA:CPU, 1thr
Fig. 12. Performance Speedup for Double Precision on CPU
VI. CONCLUSION
In this paper, we propose CRSD for the diagonal sparse
matrix. We design diagonal pattern to describe the diagonal
distribution, making CRSD more suitable than DIA. Owing
to that the OpenCL kernels are complied at runtime, we
generate the SpMV OpenCL kernel according to the index
information of CRSD format. The generated kernel already
contains the index information, utilizes the loop-unrolling
optimization technique and local memory reuses for adjacent
groups. According to the definition of CRSD format and the
parallel strategy on GPUs, all work-items in one work group
access the global memory in a coalescing way and take the
500
![Page 10: [IEEE 2011 International Conference on Parallel Processing (ICPP) - Taipei, Taiwan (2011.09.13-2011.09.16)] 2011 International Conference on Parallel Processing - Optimizing SpMV for](https://reader036.vdocuments.us/reader036/viewer/2022080423/5750a6861a28abcf0cba40aa/html5/thumbnails/10.jpg)
same execution path to avoid thread divergence.
The results from our experiments demonstrate that CRSD
on GPU is efficient for processing the diagonal sparse matrices
than other formats. However, we also find the improvement of
CRSD on GPU for some matrices is not very distinct, com-
pared to CSR parallel implementation on CPU. The advantage
will become less if we need transfer the source vector x and
destination vector y between GPU and CPU for each SpMV
operation. In this situation, we plan to divide the task for both
GPU and CPU to implement the hybrid programming. For the
reason that we use the OpenCL programming, we will do more
evaluations on different platforms, such as Cell and AMD
devices. Furthermore, we will study more types of nonzero
distributions in the future.
ACKNOWLEDGMENT
We would like to thank National Astronomical Observa-
tories Chinese Academy of Sciences as well as Computer
Network Information Center Chinese Academy of Sciences
for providing the GPU cluster.
This paper is supported by the National 863 Plan
of China (No.2006AA01A125, No. 2009AA01A129, No.
2009AA01A134), the China HGJ Significant Project (No.
2009ZX01036-001-002), the Knowledge Innovation Program
of the Chinese Academy of Sciences (No.KGCX1-YW-13),
the Ministry of Finance (No. ZDYZ2008-2).
REFERENCES
[1] N. Bell and M. Garland. Implementing sparse matrix-vector multiplicationon throughput oriented processors. Supercomputing, 2009.
[2] Y. Saad. Iterative Methods for Sparse Linear Systems. Society forIndustrial Mathematics, 2003.
[3] E. Im. Optimizing the performance of sparse matrix-vector multiplication.PhD thesis, University of California, Berkeley, 2000.
[4] Yuri Dotsenko, Naga K. Govindaraju, Peter-Pike Sloan, Charles Boy-d,John Manferdelli. Fast scan algorithms on graphics processors. InICS’08:Proceedings of the 22nd Annual International Conference onSupercomputing, pages 205-213, 2008.
[5] OpenCL, http://www.khronos.org/opencl/.[6] E. Z. Zhang, Y. Jiang, Z. Guo, and X. Shen. Streamlining gpu applications
on the fly. In ICS, 2010.[7] R. Grimes, D. Kincaid, and D. Young. ITPACK 2.0 Users Guide.
Technical Report CNA-150, Center for Numerical Analysis, Universityof Texas, Aug. 1979.
[8] R.Boisvert, R.Pozo, K.Remington, B.Miller, R.Lipman, NISTMatrixMar-ket, http://math.nist.gov/MatrixMarket/index.html
[9] Chana K. H., Li Ligang, Liao Xinhao. Modelling the core convectionusing finite element and finite difference methods. Physics of the Earthand Planetary Interiors, 157(2): 124-138, 2006.
[10] Richard Vuduc, James Demmel, Katherine Yelick. OSKI: A library ofautomatically tuned sparse matrix kernels, Proceedings of SciDAC 2005,Journal of Physics: Conference Series.
[11] E. Im and K. Yelick. Optimizing sparse matrix computations for registerreuse in SPARSITY. Lecture Notes in Computer Science, 2073:127C136,2001.
[12] R. Vuduc, J. Demmel, K. Yelick, S. Kamil, R. Nishtala, and B. Lee.Performance optimizations and bounds for sparse matrix-vector multiply.In Supercomputing, Baltimore, MD, 2002.
[13] Richard Wilson Vuduc. Automatic Performance of Sparse Matrix Ker-nels, The dissertation of Ph.D, Computer Science Division, U.C. Berkeley,2003.
[14] R. W. Vuduc and H. Moon. Fast sparse matrix-vector multiplication byexploiting variable block structure. In High Performance Computing andCommunications, volume 3726 of Lecture Notes in Computer Science,pages 807C816, Springer, 2005.
[15] Mehmet Belgin, Godmar Back, Calvin J. Ribbens. Pattern-based sparsematrix representation for memory-efficient SMVM kernels. InternationalConference on Supercomputing, 2009, NY, USA.
[16] R. Nishtala, R. Vuduc, J.W. Demmel, K.A. Yelick, When cache blockingsparse matrix vector multiply works and why, Applicable Algebra inEngineering, Communication, and Computing, 2007.
[17] Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, KatherineYelick, James Demmel, Optimization of sparse matrix-vector multi-plication on emerging multicore platforms, Proceedings of the 2007ACM/IEEE conference on Supercomputing, November 10-16, 2007,Reno, Nevada.
[18] J. Willcock and A. Lumsdaine. Accelerating sparse matrix computationsvia data compression. In ICS06: Proceedings of the 20th annual interna-tional conference on Supercomputing, pages 307C316, New York, NY,USA, 2006. ACM Press.York, NY, USA, ACM.
[19] Kornilios Kourtis , Georgios Goumas , Nectarios Koziris, Optimizingsparse matrix-vector multiplication using index and value compression,Proceedings of the 5th conference on Computing frontiers, May 05-07,2008, Ischia, Italy.
[20] Giridhar Sreenivasa Murthy, Muthu Ravishankar, Muthu ManikandanBaskaran, and Ponnuswamy Sadayappan. Optimal loop unrolling forgpgpu programs. In 24th IEEE International Parallel and DistributedProcessing Symposium, Atlanta, Georgia, USA, 2010.
501