sparse matrix-vector multiplier - uclakodiak.ee.ucla.edu/cite/pdf/sparse matrix-vector multiplier -...

UNIVERSITY OF CALIFORNIA, LOS ANGELES

Sparse Matrix-Vector Multiplier

Wendi Liu

(SID: 104358212)

3/5/2015

Sparse Matrix-Vector Multiplier Wendi Liu

2

Table of Contents

Table of Contents ......................................................................................................................................................... 2

1. Introduction ......................................................................................................................................................... 3

2. Accumulator ........................................................................................................................................................ 4

3. Coping with QDR memory ................................................................................................................................ 6

4. SpMV implementation........................................................................................................................................ 7

5. Mapping on FPGA .............................................................................................................................................. 9

6. Simulation Result .............................................................................................................................................. 10

7. Conclusion ......................................................................................................................................................... 12

Reference .................................................................................................................................................................... 13


3

1. Introduction

Sparse matrix-vector (SpMV) multiplication describes the calculation y = A * x where x is a

known vector, and A is a known sparse matrix where most of its elements are zeros. SpMV is

used as the kernel for many scientific applications, and it’s a heated target for FPGA

implementation because it’s difficult to accelerate.

The sparse matrix we use in our project is compressed in Compressed Sparse Row (CSR)

format. The CSR format stores a matrix in three arrays, val, col, and ptr. Val and col contain

each non-zero element value and corresponding column number for each non-zero value in the

matrix, arranged in an order starting with the upper-left of the matrix and continuing column-

wise left-to-right and then row-wise from the top to bottom. The ptr array stores the indices

within val and col where each row begins, terminated with a value that contains the total length

of val and col.

Figure 1. A typical sparse matrix and its CSR format

A simplified segment of code for computing the result out of a CSR format matrix will be:

for(int i = 0; i < Nonzeros; i++)

sum = 0;

for(int j = 0; j < ptr[j + 1]; j++)

sum += val[j] * vec[col[j]];

res[j] = sum;


4

The overall architecture we use in our project looks as in figure 2. We use col to select

corresponding element in the vector out of BRAM, and multiply the element with corresponding

val. The products coming from the same single row in the matrix will be put into an accumulator

to get the accumulated sum.

Figure 2. Overall SpMV architecture

We used several floating point (fp) adder and an fp multiplier in the design, which are

generated automatically by the Xilinx Core Generator.

2. Accumulator

The accumulator architecture I used is shown in figure 3. Basically it’s an adder tree with a

series of flip-flops in front to convert sequential input into parallel input. The number of flip-

flops in front of adder tree is equal to the latency of the fp adder minus one. The point here is that

the bottom self-added adder get sum from the adder tree every n cycles (where n is the latency of

fp adder).

Figure 3. Accumulator architecture


5

Figure 4. Demonstration of the accumulator calculation

Since we are given fp adder with parameterized latency, the architecture of the accumulator

should also be parametrized. In order to do that, I wrote a Verilog code generator in C++ which

takes in the latency of the fp adder and generate a corresponding Verilog code of accumulator

with a particular architecture. In general, each layer is composed of fp adders and/or buffers

except first layer. Let’s say the total number of fp adders plus buffer is i. Then the number of fp

adders in next layer would be floor (i / 2). And if i is odd there would be one delay in next layer

coping with the dangling extra wire, otherwise there would be none. Delay is parameterized to

have the same latency as the fp adder. The pseudo code of the accumulator generator looks like

below:

n = latency of fp adder

Numoflayers() return ceiling(log2(latency + 1)) + 1

NumofFlipflop() // at layer 1

return n - 1

initialize wire array L1[NumofFliflop() + 1]


6

for i from 0 to NumofFlipflop() - 1

Flipflop fin(.D(i), .Q(i + 1))

NumofAdder(i) // # of adders of Layer i

if (i == 2) return floor((n + 1) / 2);

if (i > 2) return floor((NumofAdder(i - 1)+NumofBuffer(i - 1)) / 2);

Initialize a 2D wire array L[i][j] with each element presenting jth wire at ith layer.

For i from 2 to NUmofLayers()

for j from 1 to NumofAdder(i)

fp_adder adder_i(.in1(L[i-1][2j-1]), .in2(L[i-1][2j]), .out(L[i][j]))

NumofDelayr(i) // # of delays of Layer i

if(i==2) return (n + 1)%2;

if(i>2) return (Numofdelay(i - 1)+NumofAdder(i-1))%2；

Delay delay_i(.D(L[1][n + 1], .Q(L[2][NumofAdder(i) + 1])

For i from 3 to NumofLayer()

if (NumofDelay(i) == 1)

Delay delay_i(.D(L[i-1][NumofAdder(i - 1)+NumofBuffer(i -

1)]), .Q(L[i][NumofAdder(i) + 1]))

The overall latency of the accumulator would be

𝑙𝑎𝑡𝑒𝑛𝑐𝑦 = (𝑐𝑒𝑖𝑙𝑖𝑛𝑔(𝑙𝑜𝑔2𝑛) + 1) ∗ 𝑛

This latency is corresponding to the end of input stream. The accumulator also consumes an

input indicating the end of a series of input. There will also be a data valid signal active high

when accumulation is complete. The detailed interface for the accumulator is described as below:

Port name format I/O detail

clk Ufix_1 in clock

reset Ufix_1 in active low

in Single in serial input

EndOfLine Ufix_1 in active high, indicating end of a line

out Single out the output of accumulator

valid Ufix_1 out active high, indicating the output data is valid

Table 1. Interface of the accumulator

3. Coping with QDR memory

There are two qual data rate memories (QDR) on the FPGA platform ROACH. Basically, we

need to import the matrix into QDR, fetch the data from QDR and process them in our SpMV

and then store the result back into the QDR. All the interconnection between each module will

look like figure 5.


7

Figure 5. Interconnect between each module

Note that there are two values stored in each particular address in QDR, so I store each pair

of col and val in one address in QDR1. And for QDR2 I split it into three parts to store ptr, vec

and result respectively.

0x0000 col[0] val[0]

0x0001 col[0] val[1]

... ...

0xffff col[n] val[n]

Figure 6. QDR memory mapping

4. SpMV implementation

After we have successfully stored data into the QDR, the next step is let the SpMV start

computing. As you can see in the dataflow below, we first fetch the next ptr in QDR2 to see

where the end of current row is. Then we will read col and val pair in QDR1 consecutively until

we reach the end of line. Meanwhile we will read the vector element as referenced by the value

of col. Note that there would be a latency of 10 as reading vec from QDR2, so we put a delay

after val to compensate the vec fetching latency. After we get vec and val we multiply them and

send the product to the accumulator as described above. Note that Stage 2 to Stage 4 are

pipelined stage. It is also worth noting that fp multiplier in stage 4 and accumulator in stage 5

actually have a clk frequency half of the overall frequency. This is because it takes two clock

cycles to read a col/val pair.

0x0000 ptr[0]

... ...

0x4000 vec[0]

... ...

0x8000 res[0]

... ...


8

Figure 7. Dataflow of the SpMV

The interface of this SpMV is listed and described as below:

Port name format I/O detail

clk Ufix_1 in clock

reset Ufix_1 in active low

reset2 Ufix_1 in active low. Should be delayed as referred to ‘reset’ when

initializing.

length Ufix_32 in Indicating the length of col/val array

rd_en1 Ufix_1 out read enable for QDR1

wr_en1 Ufix_1 out write enable for QDR1

rd_en2 Ufix_1 out read enable for QDR2

wr_en2 Ufix_1 out write enable for QDR2

addr1 Ufix_32 out address for QDR1

addr2 Ufix_32 out address for QDR2

start Ufix_1 in indicating data has finished loading and ready for calculation

finish Ufix_1 out indicating the calculation has completed

data1_in unsigned_36 in read port for QDR1

data1_out unsigned_36 out write port for QDR1

data2_in unsigned_36 in read port for QDR2

data2_out unsigned_36 out write port for QDR2

data_valid1 Ufix_1 in indicating the data from QDR1 is ready

data_valid2 Ufix_1 in indicating the data from QDR2 is ready

Table 2. Interface of the SpMV

Note that the high 4 bits of QDR data port are ECC bits and thus could not either be read or

written.

The FSM I use for this SpMV is shown as below. In practice I merged stage 2, stage 3 and

stage 4 together:

IDLE: Wait for data loading

READ_PTR: read next ptr

READ_VAL: read col/val pair, read vec referenced by col, send vec and val to fp multiplier

RETRIEVE: send the product out of fp multiplier to the accumulator, retrieve the sum out of

accumulator and write the result into QDR

Stag

e1

read ptr from QDR2

Stag

e2 read col from QDR1

read val from QDR1

Stag

e3 read vec from QDR2

delay val

Stag

e4

multiply vec and val

Stag

e5

send to accumulator


9

Figure 8. Finite state machine of the SpMV

5. Mapping on FPGA

We use Simulink to simulate out design on FPGA. The FPGA system is called

Reconfigurable Open Architecture Computing Hardware (ROACH). ROACH is a Virtex5-based

upgrade to current CASPER hardware. It merges aspects from the IBOB and BEE2 platforms

into a single board. To encapsulate our Verilog code into blocks in Simulink, we import Verilog

into ‘black box’ provided by Simulink, compile the design, program it onto the FPGA and then

test it.

The Simulink model is shown as below. Essentially there are only one SpMV black box and

two QDRs on the top level. The accumulator and other building blocks are encapsulated in the

SpMV block.


10

Figure 9. Simulink Model

6. Simulation Result

The table below shows the resource usage information:

Resource Used Available Percentage

BUFG 5 32 15%

DCM_ADV 2 12 16%

DSP48E 9 640 1%

IDELAYCTRL 2 22 9%

ILOGIC 83 800 10%

External IOB 184 640 28%

IODELAY 36 800 4%

OLOGIC 111 800 13%

Slice Register 2679 58880 4%

Slice LUT 4056 58880 6%

Slice LUT-Flip Flop Pairs 4797 58880 8%

Table 3. Resource usage report

The system runs at 100MHz and below is the latency measured for several typical matrices.

The overall time measures the time from the very beginning of simulation to the end of

simulation, while the computation time only takes into consideration the calculation time and

excludes the time for matrix data loading.


11

Matrix Rows Colu

mns

Nonze

ros

Nonzeros/

Row

Sparsit

y

latency

(Computation

time)

latency

(overall time)

Dense 2,000 2,000 4,000,0

00 2000.00

100.000

00% 81.00ms 161.00ms

Protein 36,417 36,417 4,344,7

65 119.31

0.32761

% 108.74ms 195.64ms

FEM/Spher

es 83,334 83,334

6,010,4

80 72.13

0.08655

% 170.21ms 290.42ms

FEM/Cantil

ever 62,451 62,451

4,007,3

83 64.17

0.10275

% 117.62ms 197.77ms

Wind

Tunnel

217,91

8

217,91

8

11,524,

432 52.88

0.02427

% 361.24ms 591.73ms

FEM/Harb

or 46,835 46,835

2,374,0

01 50.69

0.10823

% 75.58ms 123.06ms

QCD 49,152 49,152 1,916,9

28 39.00

0.07935

% 68.33ms 106.67ms

FEM/Ship 140,87

4

140,87

4

3,568,1

76 25.33

0.01798

% 155.89ms 227.25ms

Economics 206,50

0

206,50

0

1,273,3

89 6.17

0.00299

% 149.37ms 174.84ms

Epidemiolo

gy

525,82

5

525,82

5

2,100,2

25 3.99

0.00076

% 357.5ms 399.50ms

FEM/Accel

erator

121,19

2

121,19

2

2,624,3

31 21.65

0.01787

% 125.63ms 178.12ms

Circuit 170,99

8

170,99

8

958,93

6 5.61

0.00328

% 121.78ms 140.96ms

Webbase 1,000,

005

1,000,

005

3,105,5

36 3.11

0.00031

% 662.11ms 724.22ms

LP 4,284 1,092,

610

11,279,

748 2632.99

0.24098

% 228.16ms 453.76ms

Table 4. Performance measurement on different test cases

Ideally, the maximum floating point operation per second is 100MFLOPs if we count the fp

multiplier as one floating operation and the accumulator as a whole as the other, and there are

both running at 50MHz. As for our simulation statistics, the equation for calculating throughput

is:

𝑇ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 =2 ∗ 𝑁𝑜𝑛𝑧𝑒𝑟𝑜𝑠

𝐶𝑜𝑚𝑝𝑢𝑡𝑎𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒

Thus we get a histogram about different throughput among different testcases:


12

Figure 10. Different throughput among different test cases

As illustrated above, throughput is proportionate to sparsity. Actually there is approximately

a logarithmic relation between throughput and sparsity:

Figure 11. Logarithmic relation between throughput and sparsity

7. Conclusion

We have designed a Sparse Matrix-Vector Multiplier build on given floating point adder

and multiplier, and successfully mapped it onto the ROACH FPGA system. When dealing

with matrices with sparsity larger than 0.1%, it could maintain a throughput no less than half

of the maximum theoretic value. However, it still perform poorly with those most sparse

matrices. Some optimization with the situation where one entire row is empty might be

further added to the design.

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9 10 11 12 13 14

MFF

LOP

s

Test Case Number

Throughput

0

20

40

60

80

100

120

140

0.00010% 0.00100% 0.01000% 0.10000% 1.00000% 10.00000% 100.00000%

Thro

ugh

pu

t

Sparsity

Sparsity vs Throughput


13

Reference

[1] Zhang, Yan, et al. "FPGA vs. GPU for sparse matrix vector multiply." Field-Programmable Technology, 2009. FPT 2009. International Conference on. IEEE, 2009.

[2] Richard Dorrance, Fengbo Ren, and Dejan Marković. "A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-blas on FPGAs." Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays. ACM, 2014.

[3] Ling Zhuo, and Viktor K. Prasanna. "Sparse matrix-vector multiplication on FPGAs." Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays. ACM, 2005.

[4] Goumas, Georgios, et al. "Understanding the performance of sparse matrix-vector multiplication." Parallel, Distributed and Network-Based Processing, 2008. PDP 2008. 16th Euromicro Conference on. IEEE, 2008.

[5] Song Sun, and Joseph Zambreno. "A floating-point accumulator for FPGA-based high performance computing applications." Field-Programmable Technology, 2009. FPT 2009. International Conference on. IEEE, 2009.

[6] Krishna K. Nagar, Yan Zhang, and Jason D. Bakos. "An integrated reduction technique for a double precision accumulator." Proceedings of the Third International Workshop on High-Performance Reconfigurable Computing Technology and Applications. ACM, 2009.

sparse matrix-vector multiplier - uclakodiak.ee.ucla.edu/cite/pdf/sparse matrix-vector multiplier -...

Documents