[ieee 2011 international conference on parallel processing (icpp) - taipei, taiwan...

10
Optimizing SpMV for Diagonal Sparse Matrices on GPU Xiangzheng Sun ∗†‡ , Yunquan Zhang ∗† , Ting Wang , Xianyi Zhang , Liang Yuan ∗†‡ and Li Rao ∗†‡ Lab. of Parallel Software and Computational Science, Institute of Software, Chinese Academy of Sciences, Beijing, China State Key Lab. of Computer Science, Institute of Software, Chinese Academy of Sciences, Beijing, China Graduate University of Chinese Academy of Sciences, Beijing, China Email: [email protected] Abstract—Sparse Matrix-Vector multiplication (SpMV) is an important computational kernel in scientific applications. Its performance highly depends on the nonzero distribution of sparse matrices. In this paper, we propose a new storage format for diagonal sparse matrices, defined as Compressed Row Segment with Diagonal-pattern (CRSD). In CRSD, we design diagonal patterns to represent the diagonal distribution. As the Graphics Processing Units (GPUs) have tremendous computation pow- er and OpenCL makes them more suitable for the scientific computing, we implement the SpMV for CRSD format on the GPUs using OpenCL. Since the OpenCL kernels are complied at runtime, we design the code generator to produce the codelets for all diagonal patterns after storing matrices into CRSD format. Specifically, the generated codelets already contain the index information of nonzeros, which reduces the memory pressure during the SpMV operation. Furthermore, the code generator also utilizes property of memory architecture and thread schedule on the GPUs to improve the performance. In the evaluation, we select four storage formats from prior state-of-the-art implementations (Bell and Garland, 2009) on GPU. Experimental results demonstrate that the speedups reach up to 1.52 and 1.94 in comparison with the optimal implemen- tation of the four formats for the double and single precision respectively. We also evaluate on a two-socket quad-core Intel Xeon system. The speedups reach up to 11.93 and 12.79 in comparison with CSR format under 8 threads for the double and single precision respectively. I. I NTRODUCTION The Sparse Matrix-Vector multiplication (SpMV) is one of the most important computational kernels in sparse linear algebra. The performance highly depends on nozero distribu- tion, which determines the memory access pattern and varies significantly among different applications. Different sparse matrix storage formats are proposed, such as Compressed Sparse Row (CSR), Block CSR (BCSR), Diagonal (DIA), ELLPACK/ITPACK (ELL) and coordinate (COO) [1][2]. For example, nonzeros primarily distributes as dense blocks in matrices produced by Finite Element Method (FEM). Then the BCSR format is applied to store the dense block as a unit [3]. It is difficult to develop a universal optimal solution for all kinds of nonzero distributions. In this paper, we study the optimization for diagonal sparse matrices, in which the nonzeros mainly distribute along di- agonals. Diagonal sparse matrices are universal. As far as we know, Finite Difference Method (FDM), Finite Volume Method(FVM) and FEM are three main methods for the numerical solution of Partial Differential Equations (PDEs) nowadays. Once the FDM or FVM is used, the coefficient matrix of discrete PDEs is usually the diagonal sparse matrix. The DIA format [1] is designed to store the diagonal sparse matrix. All nonzeros on the same diagonal share the same index. However, a large number of zeros should be filled to maintain the diagonal structure, when there are many scatter points or the diagonals are broken by long zero sections. We define the long zero section as idle section. This may reduce the performance, since the filled zeros consume extra computation and memory resources. To address this problem, we propose a novel storage format CRSD. In order to represent the diagonal distribution, we design the diagonal pattern, which divides diagonals into dif- ferent groups. Then we can efficiently deal with idle sections by modifying the diagonal patterns. Furthermore, the matrix is split into row segments. In each row segment, nonzeros on the diagonals of the same group are viewed as the unit of storage and operation. We store those nonzeros contiguously and organize the operation on them together. Simultaneously, the scatter points are also detected in each row segment. As the Graphics Processing Units (GPUs) have tremen- dous computation power[1][4], they are widely used in High Performance Computing(HPC). Furthermore, a new standard OpenCL[5] make it easier to program on GPUs, which exploits their computation resources for scientific computing. In this paper, we implement the SpMV for CRSD format on the GPUs using OpenCL. After storing one matrix into CRSD format, we generate the SpMV kernel according to diagonal patterns. Due to the fact that the OpenCL kernels are complied at runtime, we design the code generator to produce the codelet for the correspond diagonal pattern. Specifically, the generated codelets already contain the index information of nonzeros. This reduces the memory pressure during the SpMV operation. Moreover, the code generator also utilizes property of memory architecture and thread schedule on the GPUs to improve the performance. In SpMV implementation based on CRSD on GPUs, the elements of sources vector, which are accessed 2011 International Conference on Parallel Processing 0190-3918/11 $26.00 © 2011 IEEE DOI 10.1109/ICPP.2011.53 492

Upload: li

Post on 16-Mar-2017

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: [IEEE 2011 International Conference on Parallel Processing (ICPP) - Taipei, Taiwan (2011.09.13-2011.09.16)] 2011 International Conference on Parallel Processing - Optimizing SpMV for

Optimizing SpMV for Diagonal Sparse Matrices onGPU

Xiangzheng Sun∗†‡, Yunquan Zhang∗†, Ting Wang∗, Xianyi Zhang∗, Liang Yuan∗†‡ and Li Rao∗†‡∗ Lab. of Parallel Software and Computational Science,

Institute of Software, Chinese Academy of Sciences, Beijing, China† State Key Lab. of Computer Science,

Institute of Software, Chinese Academy of Sciences, Beijing, China‡Graduate University of Chinese Academy of Sciences, Beijing, China

Email: [email protected]

Abstract—Sparse Matrix-Vector multiplication (SpMV) is animportant computational kernel in scientific applications. Itsperformance highly depends on the nonzero distribution of sparsematrices. In this paper, we propose a new storage format fordiagonal sparse matrices, defined as Compressed Row Segmentwith Diagonal-pattern (CRSD). In CRSD, we design diagonalpatterns to represent the diagonal distribution. As the GraphicsProcessing Units (GPUs) have tremendous computation pow-er and OpenCL makes them more suitable for the scientificcomputing, we implement the SpMV for CRSD format on theGPUs using OpenCL. Since the OpenCL kernels are complied atruntime, we design the code generator to produce the codelets forall diagonal patterns after storing matrices into CRSD format.Specifically, the generated codelets already contain the indexinformation of nonzeros, which reduces the memory pressureduring the SpMV operation. Furthermore, the code generatoralso utilizes property of memory architecture and thread scheduleon the GPUs to improve the performance.

In the evaluation, we select four storage formats from priorstate-of-the-art implementations (Bell and Garland, 2009) onGPU. Experimental results demonstrate that the speedups reachup to 1.52 and 1.94 in comparison with the optimal implemen-tation of the four formats for the double and single precisionrespectively. We also evaluate on a two-socket quad-core IntelXeon system. The speedups reach up to 11.93 and 12.79 incomparison with CSR format under 8 threads for the doubleand single precision respectively.

I. INTRODUCTION

The Sparse Matrix-Vector multiplication (SpMV) is one

of the most important computational kernels in sparse linear

algebra. The performance highly depends on nozero distribu-

tion, which determines the memory access pattern and varies

significantly among different applications. Different sparse

matrix storage formats are proposed, such as Compressed

Sparse Row (CSR), Block CSR (BCSR), Diagonal (DIA),

ELLPACK/ITPACK (ELL) and coordinate (COO) [1][2]. For

example, nonzeros primarily distributes as dense blocks in

matrices produced by Finite Element Method (FEM). Then

the BCSR format is applied to store the dense block as a unit

[3]. It is difficult to develop a universal optimal solution for

all kinds of nonzero distributions.

In this paper, we study the optimization for diagonal sparse

matrices, in which the nonzeros mainly distribute along di-

agonals. Diagonal sparse matrices are universal. As far as

we know, Finite Difference Method (FDM), Finite Volume

Method(FVM) and FEM are three main methods for the

numerical solution of Partial Differential Equations (PDEs)

nowadays. Once the FDM or FVM is used, the coefficient

matrix of discrete PDEs is usually the diagonal sparse matrix.

The DIA format [1] is designed to store the diagonal sparse

matrix. All nonzeros on the same diagonal share the same

index. However, a large number of zeros should be filled to

maintain the diagonal structure, when there are many scatter

points or the diagonals are broken by long zero sections.

We define the long zero section as idle section. This may

reduce the performance, since the filled zeros consume extra

computation and memory resources.

To address this problem, we propose a novel storage format

CRSD. In order to represent the diagonal distribution, we

design the diagonal pattern, which divides diagonals into dif-

ferent groups. Then we can efficiently deal with idle sections

by modifying the diagonal patterns. Furthermore, the matrix

is split into row segments. In each row segment, nonzeros on

the diagonals of the same group are viewed as the unit of

storage and operation. We store those nonzeros contiguously

and organize the operation on them together. Simultaneously,

the scatter points are also detected in each row segment.

As the Graphics Processing Units (GPUs) have tremen-

dous computation power[1][4], they are widely used in High

Performance Computing(HPC). Furthermore, a new standard

OpenCL[5] make it easier to program on GPUs, which exploits

their computation resources for scientific computing. In this

paper, we implement the SpMV for CRSD format on the GPUs

using OpenCL.

After storing one matrix into CRSD format, we generate

the SpMV kernel according to diagonal patterns. Due to

the fact that the OpenCL kernels are complied at runtime,

we design the code generator to produce the codelet for

the correspond diagonal pattern. Specifically, the generated

codelets already contain the index information of nonzeros.

This reduces the memory pressure during the SpMV operation.

Moreover, the code generator also utilizes property of memory

architecture and thread schedule on the GPUs to improve the

performance. In SpMV implementation based on CRSD on

GPUs, the elements of sources vector, which are accessed

2011 International Conference on Parallel Processing

0190-3918/11 $26.00 © 2011 IEEE

DOI 10.1109/ICPP.2011.53

492

Page 2: [IEEE 2011 International Conference on Parallel Processing (ICPP) - Taipei, Taiwan (2011.09.13-2011.09.16)] 2011 International Conference on Parallel Processing - Optimizing SpMV for

by all work-items in one work-group, are located into local

memory(correspond to shared memory in CUDA). Meanwhile,

all work-items in one work-group take the same execution path

to avoid thread divergence[6].

We evaluate the performance improvements of CRSD on

23 matrices in a two-socket quad-core Intel Xeon X5550

systems, with Tesla C2050. We select four storage formats

(DIA, ELL, CSR and HYB) from the prior state-of-the-art

implementations(Bell and Garland, 2009) [1] on GPU for

comparison. Experimental results demonstrate that the storage

format, which leads to the optimal performance, varies among

different matrices. In comparison with the optimal implemen-

tation, the speedups of CRSD reach up to 1.52 and 1.94 for

the double and single precision respectively. We also evaluate

the performance of Intel MKL, with version 10.2.6.038, on

CPU for comparison. The speedups of CRSD reach up to

11.93 (6.63 on average) and 12.79 (7.18 on average) in

comparison with CSR format under 8 threads for double and

single precisions respectively.

The rest of this paper is organized as follows: section II

describes the diagonal pattern and CRSD storage format;

section III presents how to produce the SpMV kernel for

CRSD; in section IV, the experiment results are provided and

analyzed; the related works are given in section V. At last,

conclusion is summarized in section VI.

II. CRSD STORAGE FORMAT

A. Motivation

In diagonal sparse matrix, the nonzeros mainly distribute on

a small number of diagonals [1][2]. The offset of each diagonal

from the main diagonal is used to identify the diagonals. The

offsets of diagonals above the main diagonal are positive,

while that of those below the main diagonal are negative. In

DIA storage format, the offsets are used as the indices. And

all nonzeros on the same diagonal share the same index. This

reduces the size of indices.

However, we find there are many scatter points or the

diagonals are broken by long zero sections in many matrices. If

we use DIA format to store those matrices, a large number of

zeros should be filled to maintain the diagonal structure. This

will consume computation resource and memory bandwidth

for processing the filled zeros during the SpMV operation.

We presents a real diagonal sparse matrix from our research

in Fig. 1. In the diagonals with offset ±200, long zero sections

are marked by red dotted lines. Besides, some scatter points

are circled in the detailed picture.

It is unwise to store the matrix in DIA format, for a large

number of zeros should be filled. However, a large proportion

of nonzeros distribute on the diagonals. And a small number

of nonzeros can be filled for the idle sections, marked by red

rectangular.

We design the diagonal pattern to exploit the partial diagonal

structure. A new storage format CRSD is proposed based on

diagonal pattern. In CRSD format, the scatter points can be

detected. Simultaneously, the number of filled zeros for idle

Fig. 1. A real diagonal sparse matrix in our research

Fig. 2. An example of diagonal sparse matrix

section can be controlled according to property of the input

matrix.

B. Diagonal Pattern

For any two diagonals in the matrix, if the absolute value of

difference of their offset is 1, they are adjacent. We can group a

sequence of diagonals by the following steps: if two diagonals

are adjacent, put them into an adjacent (AD) group; after

removing the diagonals within the adjacent groups, the original

diagonal sequence is broken up into pieces. We assign the

diagonals of each piece into a nonadjacent (NAD) group. The

diagonal pattern is defined as the way that the AD group(s)

and the NAD group(s) are organized.

When the group is represented by group type (AD or

NAD) and the number of diagonals in it, then

group = (group type, the number of diagonals) .

According to the definition, the diagonal pattern is represented

as follows:

diagonal-pattern = {group1, group2, . . . groupm}If the whole matrix contains several diagonal patterns, then

matrix = {dia-pattern1, dia-pattern2, . . . dia-patternn}.

For example, there are two diagonal patterns in the matrix

shown in Fig. 2 except nonzero v55. The matrix is represented

as follows:

matrix = {{(NAD,1),(AD,2),(NAD,2)},{(AD,2), (NAD,1)}}.

493

Page 3: [IEEE 2011 International Conference on Parallel Processing (ICPP) - Taipei, Taiwan (2011.09.13-2011.09.16)] 2011 International Conference on Parallel Processing - Optimizing SpMV for

Fig. 3. Idle section process using diagonal pattern

There are two diagonal patterns in the matrix. The first

diagonal pattern covers the beginning two rows and the second

one covers the remaining four rows. In the first diagonal

pattern, we divide the diagonals into three groups. Diagonal

with offset 2 and the one with offset 3 are adjacent, so we put

them in an adjacent group (AD, 2). After processing the two

diagonals, any two of remaining diagonals are not adjacent.

Then the diagonal sequence is broken up into two pieces. The

main diagonal in the first piece is assigned to a nonadjacent

group (NAD, 1), while diagonals with offset 5 and 7 are

assigned to another nonadjacent group (NAD, 2). Finally, we

get the first diagonal pattern {(NAD,1),(AD,2),(NAD,2)}.C. Scatter Point Detection and Idle Section Process

We have grouped the diagonals using diagonal pattern.

Furthermore, the matrix is split into row segments. The number

of rows in each row segment is defined as row segment sizeand represented by the token mrows. In this way, the whole

matrix is split in two dimensions, as the dotted lines show in

Fig. 2. In each row segment, nonzeros on the diagonals of the

same group are the storage unit of CRSD.Additionally, if only one nonzero is on a diagonal within

one row segment, the nonzero is viewed as scatter point, such

as v55 in Fig. 2. If the nonzero is identified as scatter point, it

is unnecessary to fill zero into the diagonal where the scatter

point locates.With diagonal pattern, we can process idle section: if there

are few zeros in the idle section, we can fill the zeros

to maintain the diagonal structure and the diagonal pattern

remains unchanged; otherwise, if a large number of zeros

are needed, we believe that the diagonal is broken. Then the

diagonal pattern should be modified. It all depends on the

property of matrices in one application. For instance, a zero is

filled at v43 position, while the main diagonal is broken. When

the matrix given in Fig. 1 is described in diagonal pattern,

the result is shown in Fig. 3. Diagonals with offset ±200 are

broken instead of filling large number of zeros.

D. The Storage FormatIn CRSD storage format, the scatter points and the nonzeros

in diagonal are stored separately.

In order not to change the order of floating point operations,

the whole row where the scatter point locates is stored together.

When we abstract those rows, they form a sub-matrix. This

sub-matrix is stored in ELL [7] format(detailed in section IV).

The only difference is that the row number of each row in the

original matrix should be stored in array scatter rowno. For

the ELL format, array scatter colval and scatter val record

column value and value of each nonzero. The number of

nonzero in each row is num scatter width.

Except scatter points, the whole matrix is represented by

diagonal patterns. All nonzeros in the same diagonal pattern

share the same index: the diagonal pattern, the start row

number of the diagonal pattern, the number of row segments,

and the column indices of diagonals. The column index of

each diagonal is needed for nonadjacent group, while only the

column index of first diagonal in adjacent group needs to be

recorded. The diagonal pattern is stored in array matrix and the

remaining of index values are stored in array crsd dia index.

The nonzero values in each storage unit are stored contigu-

ously in array crsd dia val. Additionally, the nonzeros locate

on one diagonal are stored contiguously. For instance, since

v20, v31, v21 and v32 locate in one storage unit, they should

be stored contiguously. Meanwhile, v20 and v31 locate on one

diagonal and should be stored together.

The number of diagonal patterns and rows that contain

the scatter points are assigned to num dia patterns and

num scatter rows respectively.

An example of CRSD storage format is shown in Fig. 4 for

the matrix in Fig. 2, when row segment size is 2. The diagonal

patterns, discussed in section II-B, are stored in array matrix.

For the first diagonal pattern, it starts from R0 and covers 1

row segment. For the reason that diagonals with offset 2 and 3

are assigned into one adjacent group, only the column value C2

is needed. When we store the scatter points, only the last row

contains the scatter point. There are 4 (num scatter width)

nonzeros in this row.

III. SPMV IMPLEMENTATION FOR CRSD ON GPUS

GPUs [1][4] are not limited to solve graphics related is-

sues. With a large number of scalar processors, GPUs have

tremendous computation power and become very suitable for

massive data parallel process. OpenCL [5] is an open royalty-

free standard for general purpose parallel programming on

heterogeneous platforms. Using OpenCL, it becomes more

easier to exploit GPUs’ computation resources for scientific

computing. In this section, we implement the SpMV for CRSD

on GPUs using OpenCL.

A. GPUs and OpenCL Overview

In OpenCL platform model, an OpenCL device is most

easily defined as a collection of Compute Unit (CU). Each

CU can be still divided into one or more Processing El-ements (PE). The PEs execute the computation commands

submitted from an OpenCL application. All PEs within a CU

execute a single stream of instructions as SIMD or as SPMD.

494

Page 4: [IEEE 2011 International Conference on Parallel Processing (ICPP) - Taipei, Taiwan (2011.09.13-2011.09.16)] 2011 International Conference on Parallel Processing - Optimizing SpMV for

num scatter rows = 1;num dia patterns = 2;num scatter width = 4;

matrix = {{(NAD,1),(AD,2),(NAD,2)}, {(AD,2), (NAD,1)}}crsd dia index = {R0, 1, C0, C2, C5, C7, | R2, 2, C0, C4}

crsd dia val = {{(v00,v11),(v02,v13,v03,v14),(v05,v16,v07,v18)}, {(v20,v31,v21,v32),(v23,v24)},{(v42,v53,0,v54),(v45,v56)} }

scatter rowno = {R5}scatter index = {C3, C4, C5, C6}scatter val = {v53, v54, v55, v56}

Fig. 4. The CRSD storage format for matrix shown in Fig. 2 when mrows=2

The programs executing on OpenCL devices are defined as

kernels. An index space is defined after submitting a kernel to

OpenCL device. An instance of the kernel executes for each

point in the index space. This instance is defined as work-item. Work-items are further grouped as work-groups. Each

work-group is assigned to a CU. Furthermore, work-items

within a work-group can be synchronized using barriers or

memory fences. The block of work-items that are executed

together is defined as a wavefront. And the number of work-

items in one wavefront is called wavefront size. If work-items

within a wavefront diverge, such as branching, all execution

paths are executed serially. This phenomenon is called thread

divergence[6]. Obviously, it will reduce the performance.

In OpenCL memory model, all Work-items accessible mem-

ory are generalized into four distant memory types: global

memory, constant memory, local memory and private memory.

All work-items can read/write the global memory. However,

accessing the global memory suffers a long latency. The local

memory is attached to each CU and is only shared by the work-

items within one work-group. Each PE have its own private

memory and the private memory can only be accessed by a

single work-item.

For the GPUs, both local memory and private memory have

low latency and higher bandwidth.

In CUDA, there are the correspond terminologies. We list

the related terminologies on Table I.

TABLE ITHE TERMINOLOGIES IN OPENCL AND CUDA

OpenCL CUDACU Streaming Multiprocessor (SM)PE Streaming Processor (SP)work-group thread blockwork-item threadwavefront warp

B. SpMV Implementation on GPUs

Parallelizing SpMV for the CRSD format on GPUs is

straightforward. One work-item processes one row of the

matrix. Meanwhile, the matrix is split into row segments in

the CRSD. The work-items processing the rows within one

row segment are assigned to the one work-group. It is wise

that mrows is a multiple of the wavefront size.

Once a matrix has been stored in CRSD storage format, we

can infer the following information (shown in Table II) for the

pth diagonal pattern from array matrix and crsd dia index.

TABLE IITHE INFORMATION INFERRED FROM THE pth DIAGONAL PATTERN

Description TokenThe number of row segments NRSp

The number of nonzeros in one row segment NNzRSp

The start row number SRp

The number of diagonals NDiaspThe column index of dth diagonal Colvp,d

According to those information, we can generate the SpMV

kernels. Firstly, we get the unique work group ID group idas well as the local work-item ID local id within work group

region from the OpenCL API. Then we can identify which

diagonal pattern the work group group id processes. Because

one work group deals with one row segment, work group

group id process the pth diagonal pattern only if satisfying

the following condition:∑p−1i=0 NRSi ≤ group id <

∑pi=0 NRSi.

Next, we generate the SpMV operations for the pth diagonal

pattern. When the work item local id processes the nonzero

on dth diagonal, the location of nonzero is∑p−1

i=0 (NRSi ×NNzRSi) + (group id −∑p−1

i=0 NRSi) × NNzRSi + d ×mrows+local id. The correspond element of source vector xis Colvp,d+(group id−∑p−1

i=0 NRSi)×mrows+ local id.

After accumulating the product of NDiasp nonzeros, the

result should be stored at SRp+(group id−∑p−1i=0 NRSi)×

mrows+ local id at destination vector y. We should note that∑p−1i=0 Expression = 0 when p = 0.

As a result, the generated SpMV codelet already contains

the index information. Thus it is unnecessary to access matrixand crsd dia index during the SpMV operation. Only the

nonzero values should be transferred to kernel.

When we deal with adjacent groups, the elements of source

vector x can be reused. For this reason, we can load the

elements into local memory. This will reduce number of the

global memory access. Since the access to the global memory

suffers a longer latency than access to local memory, the

performance will improve significantly when the number of

nonzeros in adjacent groups occupy a large proportion. The

size of the local memory is determined by the maximum

number of diagonals among all the adjacent groups, which

can also be obtained when the matrix is stored in CRSD. This

is detailed in Fig. 5. For the adjacent group, the nonzero Vy

495

Page 5: [IEEE 2011 International Conference on Parallel Processing (ICPP) - Taipei, Taiwan (2011.09.13-2011.09.16)] 2011 International Conference on Parallel Processing - Optimizing SpMV for

and Vw operate on the same element, with index Ct+1, of

vector x. During the SpMV operation, we load those reused

elements into the local memory.

On the other side, for the reason that nonzeros of one storage

unit are stored contiguous, the simultaneously executed work-

items access contiguous part of the array crsd dia val. The

distance of the two successive accessed nonzeros for one work-

item is mrows, as well as the mrows is a multiple of the

wavefront size. Therefor, the accesses to global memory are

coalescing, which is most efficient situation. In Fig. 5, work-

item 0 and 1 first access the contiguous Vx and Vw, while

work-item 0 access Vy with distance of mrows next time.

Fig. 6 shows an example of the SpMV implementation

for the matrix in Fig. 2. The information inferred from

CRSD are shown in Table III. All the SpMV operations for

entire diagonal pattern are organized together in one case

of the switch, which utilizes the loop unrolling optimization

technique [20]. Because one work-group processes one row

segment, all the work-items take the same execution path. This

avoids the thread divergence.

We also generate the SpMV kernel for ELL format. The

generated kernel also applies the loop unrolling optimization

technique [20], owing to that we already obtain number of

nonzeros per row(num scatter width). The final kernel for

the whole matrix is composed of two parts — one processes

nonzeros in diagonal patterns, the other one processes the

scatter points. For the reason that the rows , which contain

the scatter points, may also belong to diagonal patterns, the

kernel processing diagonal patterns is executed at first.

TABLE IIITHE INFORMATION INFERRED FROM CRSD FOR MATRIX IN FIG. 2

Token p = 0 p = 1NRSp 1 2

NNzRSp 10 6SRp 0 2

NDiasp 5 3

IV. EVALUATION

In this section, we present the performance improvement of

CRSD. The platform information is provided in Table IV. We

select 23 matrices(given in TableV). Matrices with number

from 1 to 17 mainly come from structural problem, 2D/3D

problem and quantum chemistry problem [8] . The last six

matrices are from an astrophysics application [9]. For those

matrices, a significant proportion of nonzeros distribute on

diagonals.

We compare the performance of CRSD on GPU with prior

state-of-the-art implementations (Bell and Garland, 2009) [1]

for both double and single precisions on GPU. In the compar-

ison, we select DIA, CSR and ELL formats. All the kernels

are implemented in CUDA programming. When they evaluate

DIA format, the diagonal sparse matrices represent common

stencil operations on regular 1-, 2- and 3- dimensional grids.

The nonzeros distribute on several diagonals, for example on

27 diagonals. In our evaluation, matrix kim1 and kim2 has

similar nonzero distribution — nonzeros mainly distribute on

25 diagonals.

ELL format is more general than DIA format. When then

maximum number of the nonzeros per row is K, ELL stores

K elements for each row. All rows are padded with zeros to

length K. As a result, ELL format stores a dense matrix with

a additional array for column index for each nonzero.

Obviously ELL’s efficiency degrades, if the number of

nonzeros in each row varies. Then we can select a threshold

K ′. If the number of nonzeros exceeds K ′, only K ′ nonzeros

are stored in ELL format. And the remaining nonzeros are

stored separately with row and column number for each

nonzero. This format is the HYB format.

Besides we also choose CSR and DIA formats implemented

of Intel MKL on CPU, with version 10.2.6.038, to be com-

pared with. For Implementation based on DIA is not paralleled

in Intel MKL, we only choose parallel implementation of CSR

formats for comparison.

TABLE IVTHE PLATFORM INFORMATION

CPU Intel Xeon X5550, 2.67GHzMEM 8GB

Sockets 2Compiler GCC 4.4.3

GPU Tesla C2050Number of CUDA cores 448

Frequency of CUDA Cores 1.15GHzTotal Device Memory 3GB

TABLE VTHE MATRICES INFORMATION

Matrix # matrix name Dimensions nonzeros1 crystk03 24696×24696 8879372 crystk02 13965×13965 4912743 s3dkt3m2 90449×90449 19219554 s3dkq4m2 90449×90449 24556705 ecology1 1000000×1000000 29980006 ecology2 999999×999999 29979957 wang3 26064×26064 1771688 wang4 26068×26068 1771969 kim1 38415×38415 933195

10 kim2 456976×456976 1133002011 af 1 k101 503625×503625 902715012 af 2 k101 503625×503625 902715013 af 3 k101 503625×503625 902715014 Lin 256000×256000 101120015 nemeth21 9506×9506 59162616 nemeth22 9506×9506 68416917 nemeth23 9506×9506 75815818 s80 80 50 320000×320000 253280019 s100 100 62 620000×620000 491760020 s110 110 68 822800×822800 653114021 us80 80 50 320000×320000 253280022 us100 100 62 620000×620000 491760023 us110 110 68 822800×822800 6531140

A. Performance Comparison on GPU

The performance comparison result with double and single

precisions are shown in Fig. 7 and Fig. 8.

496

Page 6: [IEEE 2011 International Conference on Parallel Processing (ICPP) - Taipei, Taiwan (2011.09.13-2011.09.16)] 2011 International Conference on Parallel Processing - Optimizing SpMV for

Fig. 5. Local memory usage for processing adjacent groups

Fig. 6. The SpMV implementation fragment for matrix in Fig. 2 when mrows=2

For the double precision, the speedup of CRSD format

reaches up to 11.13 and 9.42 for s3dkt3m2 and s3dkq4m2

respectively in comparison with DIA format. The main reason

is that nonzeros distribute on 655 diagonals, but the number of

nonzeros per row is only 41 for s3dkt3m2. Large number of

zeros should be filled to maintain the diagonal structure, which

497

Page 7: [IEEE 2011 International Conference on Parallel Processing (ICPP) - Taipei, Taiwan (2011.09.13-2011.09.16)] 2011 International Conference on Parallel Processing - Optimizing SpMV for

consumes lot of computation resource. Conversely, CRSD

format describes the two matrices using 24 diagonal patterns

to reduce the number of filled zeros. Specially, the size of DIA

format exceeds the device memory for af 1 k101, af 2 k101

and af 3 k101, as a result that there are no performance results

for the three matrices.

ELL format is suitable for the discussed five matrices. For

s3dkt3m2 and s3dkq4m2, the speedup of ELL reaches up to

10.13 and 7.98 in comparison with DIA format. However,

the speedup of CRSD format is 1.18 and 1.10 respectively

compared with ELL format. While the speedup is only 1.025

for af 1 k101, af 2 k101 and af 3 k101.

As we discussed before, CRSD can load the reused elements

in vector x to local memory. If the proportion of adjacent

groups is relative small, the performance will be reduced by

the extra barrier operation for the local memory. This situation

occurs in matrices wang3 and wang4. The CRSD performs

very poor. The ELL format outperforms the CRSD format by

a factor of 1.22 and 1.23 respectively for wang3 and wang4.

While the DIA format still performs very poor, like s3dkt3m2

and s3dkq4m2.

In summary, the maximum speedup of CRSD format reach-

es up to 11.13 in comparison with DIA and 1.52 in comparison

with ELL format for double precision. The average speedup of

CRSD is 2.05 and 1.24 for DIA and ELL format respectively.

For the all matrices, CRSD outperforms CSR by a factor of

9.01 for maximum and 4.57 on average. The performance

speedups of CRSD format for the double precision are shown

in Fig. 9.

For the single precision, the performance comparisons are

similar with that for double precision. Due to the reason that

it requires less room to store the single precision than the

double precision, the DIA format for af 1 k101, af 2 k101

and af 3 k101 even works on GPU. For the three matrices,

the speedups of CRSD compared to DIA format is 1.31.

The performance speedups of CRSD format for single

precision are given in Fig. 10. We can conclude that the

maximum speedup can reach 11.24 and 1.94 respectively in

comparison with DIA and ELL formats. And the average

speedup is 1.92 and 1.50 respectively. For the all matrices,

CRSD outperforms CSR by a factor of 9.14 for maximum

and 4.59 on average.

For the HYB format, we use the default method to determine

ELL/COO split-ratio. The matrices from 1 to 14 choose

the entire ELL format. For the remaining matrices, small

percentage (about 0.2% ˜ 2.1%) of nonzeros are stored as COO

format. For those matrices, the speedups of CRSD reach 2.67

( 2.12 on average) and 3.68 ( 2.87 on average) for double and

single precision respectively.

B. Performance Comparison on CPU

The performance comparisons for double and single preci-

sion on CPU are shown in Fig. 11 and Fig. 12 respectively.

As we discussed before, DIA format is not suitable for

s3dkt3m2, s3dkq4m2, af 1 k101, af 2 k101 and af 3 k101.

Therefor, the performances based on DIA format are very

poor. In comparison with DIA format, the speedups of CRSD

reach up to 199.63 and 202.23 for double and single precision

respectively for the five matrices.

Except the five matrices, the speedups of CRSD reach up

to 15.27(12.34 on average) and 13.25(9.87 on average) for

double and single precision respectively.

In the comparison with CSR format with different threads,

the maximum and average speedups of CRSD are shown in

Table VI.

TABLE VIPERFORMANCE SPEEDUP OF CRSD COMPARED WITH CSR

precision serial parallel, thr=8

doublemaximum 25.06 11.93

average 14.76 6.63

singlemaximum 39.81 12.79

average 24.25 7.18

V. RELATED WORK

Sparse matrix vector multiplication is an important com-

putational kernel. Large number of efforts and works are

involved in its optimization. Im and Yelick et al. propose

register blocking, cache blocking and reordering techniques.

Register blocking[3][10][11] is based on BCSR format. BCSR

is suitable for matrices, in which nonzeros primarily distribute

in dense blocks. This property is universal for the matrices

produced by FEM. BCSR format is applied to store the dense

block as a unit. Vuduc et al. estimate the performance bounds

for the register blocking and propose a new approach to choose

the register block size [12]. However, excessive zeros have to

be filled to maintain the block format in BCSR, which wastes

the computation and memory resources. Due to the fact that

the number of the filled zero depends on the block-size and

different block-sizes lead to different processing performance,

OSKI [10] analyzes the input matrix to select the proper block-

size at runtime.

To reduce the number of filled zeros, Vuduc et al. in [13][14]

exploit variable block structure rather than identical block

size. They decompose the origin matrix into a proper sum of

sub-matrices, each of which uses identical block size and are

stored in UBCSR format; Belgin et al. explore the distribution

pattern of nonzeros in dense block and propose PBR to

store matrices without zero filling[15]. Cache blocking[3]

is used to increase the temporal locality by reordering the

memory access. Nishtala et al. present a new performance

models, which takes TLB misses into account, and a criteria

to determine when to apply the cache blocking [16]. Samuel

Williams [17] sums up all those optimization methods on the

emerging multi-core platforms.

To mitigate the memory bandwidth pressure, Willcock and

Lunsdaine propose DCSR and RPCSR to reduce the size of in-

dex structure [18], both of which utilized the data compression

by storing the delta value of column value of two contiguous

nonzero. In RPCSR, the SpMV implementation is produced

for the compressed result by code generator dynamically.

Kourtis et al. [19] categorize the storage size for the delta

498

Page 8: [IEEE 2011 International Conference on Parallel Processing (ICPP) - Taipei, Taiwan (2011.09.13-2011.09.16)] 2011 International Conference on Parallel Processing - Optimizing SpMV for

0

5

10

15

20

25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Performance (GFLOPS)

Matrix #

DIA

ELL

CSR

HYB

CRSD

Fig. 7. Performance Comparison for Double Precision on GPU

0

5

10

15

20

25

30

35

40

45

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Performance (GFLOPS)

Matrix #

DIA

ELL

CSR

HYB

CRSD

Fig. 8. Performance Comparison for Single Precision on GPU

0

2

4

6

8

10

12

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Speedup

Matrix #

CRSD/DIA

CRSD/ELL

CRSD/CSR

CRSD/HYB

Fig. 9. Performance Speedup for Double Precision on GPU

value into 3 kinds (1byte, 2byte and 4byte), and all contiguous

nonzero with delta storage size in the same category are stored

together to reduce the index data. Furthermore, Kourtis also

introduce CSR-VI to compress the nonzero value when most

of nonzero values are identical.

Bell and Garland [1] expose substantial fine-grained par-

allelism for different storage formats on throughput-oriented

processors using CUDA programming. Eddy Zhang [6] impos-

es thread-data remapping to eliminate the thread divergences

in GPU computing at runtime. Murthy [20] identifies optimal

unroll factors in GPGPU programs and reduces the number of

unroll factors according to the characteristics of the program.

499

Page 9: [IEEE 2011 International Conference on Parallel Processing (ICPP) - Taipei, Taiwan (2011.09.13-2011.09.16)] 2011 International Conference on Parallel Processing - Optimizing SpMV for

0

2

4

6

8

10

12

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Speedup

Matrix #

CRSD/DIA

CRSD/ELL

CRSD/CSR

CRSD/HYB

Fig. 10. Performance Speedup for Single Precision on GPU

0

20

40

60

80

100

120

140

160

180

200

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Speedup

Matrix #

CRSD/CSR:CPU, 1thr

CRSD/CSR:CPU,8thr

CRSD/DIA:CPU, 1thr

Fig. 11. Performance Speedup for Double Precision on CPU

0

50

100

150

200

250

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Speedup

Matrix #

CRSD/CSR:CPU, 1thr

CRSD/CSR:CPU,8thr

CRSD/DIA:CPU, 1thr

Fig. 12. Performance Speedup for Double Precision on CPU

VI. CONCLUSION

In this paper, we propose CRSD for the diagonal sparse

matrix. We design diagonal pattern to describe the diagonal

distribution, making CRSD more suitable than DIA. Owing

to that the OpenCL kernels are complied at runtime, we

generate the SpMV OpenCL kernel according to the index

information of CRSD format. The generated kernel already

contains the index information, utilizes the loop-unrolling

optimization technique and local memory reuses for adjacent

groups. According to the definition of CRSD format and the

parallel strategy on GPUs, all work-items in one work group

access the global memory in a coalescing way and take the

500

Page 10: [IEEE 2011 International Conference on Parallel Processing (ICPP) - Taipei, Taiwan (2011.09.13-2011.09.16)] 2011 International Conference on Parallel Processing - Optimizing SpMV for

same execution path to avoid thread divergence.

The results from our experiments demonstrate that CRSD

on GPU is efficient for processing the diagonal sparse matrices

than other formats. However, we also find the improvement of

CRSD on GPU for some matrices is not very distinct, com-

pared to CSR parallel implementation on CPU. The advantage

will become less if we need transfer the source vector x and

destination vector y between GPU and CPU for each SpMV

operation. In this situation, we plan to divide the task for both

GPU and CPU to implement the hybrid programming. For the

reason that we use the OpenCL programming, we will do more

evaluations on different platforms, such as Cell and AMD

devices. Furthermore, we will study more types of nonzero

distributions in the future.

ACKNOWLEDGMENT

We would like to thank National Astronomical Observa-

tories Chinese Academy of Sciences as well as Computer

Network Information Center Chinese Academy of Sciences

for providing the GPU cluster.

This paper is supported by the National 863 Plan

of China (No.2006AA01A125, No. 2009AA01A129, No.

2009AA01A134), the China HGJ Significant Project (No.

2009ZX01036-001-002), the Knowledge Innovation Program

of the Chinese Academy of Sciences (No.KGCX1-YW-13),

the Ministry of Finance (No. ZDYZ2008-2).

REFERENCES

[1] N. Bell and M. Garland. Implementing sparse matrix-vector multiplicationon throughput oriented processors. Supercomputing, 2009.

[2] Y. Saad. Iterative Methods for Sparse Linear Systems. Society forIndustrial Mathematics, 2003.

[3] E. Im. Optimizing the performance of sparse matrix-vector multiplication.PhD thesis, University of California, Berkeley, 2000.

[4] Yuri Dotsenko, Naga K. Govindaraju, Peter-Pike Sloan, Charles Boy-d,John Manferdelli. Fast scan algorithms on graphics processors. InICS’08:Proceedings of the 22nd Annual International Conference onSupercomputing, pages 205-213, 2008.

[5] OpenCL, http://www.khronos.org/opencl/.[6] E. Z. Zhang, Y. Jiang, Z. Guo, and X. Shen. Streamlining gpu applications

on the fly. In ICS, 2010.[7] R. Grimes, D. Kincaid, and D. Young. ITPACK 2.0 Users Guide.

Technical Report CNA-150, Center for Numerical Analysis, Universityof Texas, Aug. 1979.

[8] R.Boisvert, R.Pozo, K.Remington, B.Miller, R.Lipman, NISTMatrixMar-ket, http://math.nist.gov/MatrixMarket/index.html

[9] Chana K. H., Li Ligang, Liao Xinhao. Modelling the core convectionusing finite element and finite difference methods. Physics of the Earthand Planetary Interiors, 157(2): 124-138, 2006.

[10] Richard Vuduc, James Demmel, Katherine Yelick. OSKI: A library ofautomatically tuned sparse matrix kernels, Proceedings of SciDAC 2005,Journal of Physics: Conference Series.

[11] E. Im and K. Yelick. Optimizing sparse matrix computations for registerreuse in SPARSITY. Lecture Notes in Computer Science, 2073:127C136,2001.

[12] R. Vuduc, J. Demmel, K. Yelick, S. Kamil, R. Nishtala, and B. Lee.Performance optimizations and bounds for sparse matrix-vector multiply.In Supercomputing, Baltimore, MD, 2002.

[13] Richard Wilson Vuduc. Automatic Performance of Sparse Matrix Ker-nels, The dissertation of Ph.D, Computer Science Division, U.C. Berkeley,2003.

[14] R. W. Vuduc and H. Moon. Fast sparse matrix-vector multiplication byexploiting variable block structure. In High Performance Computing andCommunications, volume 3726 of Lecture Notes in Computer Science,pages 807C816, Springer, 2005.

[15] Mehmet Belgin, Godmar Back, Calvin J. Ribbens. Pattern-based sparsematrix representation for memory-efficient SMVM kernels. InternationalConference on Supercomputing, 2009, NY, USA.

[16] R. Nishtala, R. Vuduc, J.W. Demmel, K.A. Yelick, When cache blockingsparse matrix vector multiply works and why, Applicable Algebra inEngineering, Communication, and Computing, 2007.

[17] Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, KatherineYelick, James Demmel, Optimization of sparse matrix-vector multi-plication on emerging multicore platforms, Proceedings of the 2007ACM/IEEE conference on Supercomputing, November 10-16, 2007,Reno, Nevada.

[18] J. Willcock and A. Lumsdaine. Accelerating sparse matrix computationsvia data compression. In ICS06: Proceedings of the 20th annual interna-tional conference on Supercomputing, pages 307C316, New York, NY,USA, 2006. ACM Press.York, NY, USA, ACM.

[19] Kornilios Kourtis , Georgios Goumas , Nectarios Koziris, Optimizingsparse matrix-vector multiplication using index and value compression,Proceedings of the 5th conference on Computing frontiers, May 05-07,2008, Ischia, Italy.

[20] Giridhar Sreenivasa Murthy, Muthu Ravishankar, Muthu ManikandanBaskaran, and Ponnuswamy Sadayappan. Optimal loop unrolling forgpgpu programs. In 24th IEEE International Parallel and DistributedProcessing Symposium, Atlanta, Georgia, USA, 2010.

501