sparse matrix-vector multiplier - uclakodiak.ee.ucla.edu/cite/pdf/sparse matrix-vector multiplier -...
TRANSCRIPT
UNIVERSITY OF CALIFORNIA, LOS ANGELES
Sparse Matrix-Vector Multiplier
Wendi Liu
(SID: 104358212)
3/5/2015
Sparse Matrix-Vector Multiplier Wendi Liu
2
Table of Contents
Table of Contents ......................................................................................................................................................... 2
1. Introduction ......................................................................................................................................................... 3
2. Accumulator ........................................................................................................................................................ 4
3. Coping with QDR memory ................................................................................................................................ 6
4. SpMV implementation........................................................................................................................................ 7
5. Mapping on FPGA .............................................................................................................................................. 9
6. Simulation Result .............................................................................................................................................. 10
7. Conclusion ......................................................................................................................................................... 12
Reference .................................................................................................................................................................... 13
Sparse Matrix-Vector Multiplier Wendi Liu
3
1. Introduction
Sparse matrix-vector (SpMV) multiplication describes the calculation y = A * x where x is a
known vector, and A is a known sparse matrix where most of its elements are zeros. SpMV is
used as the kernel for many scientific applications, and it’s a heated target for FPGA
implementation because it’s difficult to accelerate.
The sparse matrix we use in our project is compressed in Compressed Sparse Row (CSR)
format. The CSR format stores a matrix in three arrays, val, col, and ptr. Val and col contain
each non-zero element value and corresponding column number for each non-zero value in the
matrix, arranged in an order starting with the upper-left of the matrix and continuing column-
wise left-to-right and then row-wise from the top to bottom. The ptr array stores the indices
within val and col where each row begins, terminated with a value that contains the total length
of val and col.
Figure 1. A typical sparse matrix and its CSR format
A simplified segment of code for computing the result out of a CSR format matrix will be:
for(int i = 0; i < Nonzeros; i++)
sum = 0;
for(int j = 0; j < ptr[j + 1]; j++)
sum += val[j] * vec[col[j]];
res[j] = sum;
Sparse Matrix-Vector Multiplier Wendi Liu
4
The overall architecture we use in our project looks as in figure 2. We use col to select
corresponding element in the vector out of BRAM, and multiply the element with corresponding
val. The products coming from the same single row in the matrix will be put into an accumulator
to get the accumulated sum.
Figure 2. Overall SpMV architecture
We used several floating point (fp) adder and an fp multiplier in the design, which are
generated automatically by the Xilinx Core Generator.
2. Accumulator
The accumulator architecture I used is shown in figure 3. Basically it’s an adder tree with a
series of flip-flops in front to convert sequential input into parallel input. The number of flip-
flops in front of adder tree is equal to the latency of the fp adder minus one. The point here is that
the bottom self-added adder get sum from the adder tree every n cycles (where n is the latency of
fp adder).
Figure 3. Accumulator architecture
Sparse Matrix-Vector Multiplier Wendi Liu
5
Figure 4. Demonstration of the accumulator calculation
Since we are given fp adder with parameterized latency, the architecture of the accumulator
should also be parametrized. In order to do that, I wrote a Verilog code generator in C++ which
takes in the latency of the fp adder and generate a corresponding Verilog code of accumulator
with a particular architecture. In general, each layer is composed of fp adders and/or buffers
except first layer. Let’s say the total number of fp adders plus buffer is i. Then the number of fp
adders in next layer would be floor (i / 2). And if i is odd there would be one delay in next layer
coping with the dangling extra wire, otherwise there would be none. Delay is parameterized to
have the same latency as the fp adder. The pseudo code of the accumulator generator looks like
below:
n = latency of fp adder
Numoflayers() return ceiling(log2(latency + 1)) + 1
NumofFlipflop() // at layer 1
return n - 1
initialize wire array L1[NumofFliflop() + 1]
Sparse Matrix-Vector Multiplier Wendi Liu
6
for i from 0 to NumofFlipflop() - 1
Flipflop fin(.D(i), .Q(i + 1))
NumofAdder(i) // # of adders of Layer i
if (i == 2) return floor((n + 1) / 2);
if (i > 2) return floor((NumofAdder(i - 1)+NumofBuffer(i - 1)) / 2);
Initialize a 2D wire array L[i][j] with each element presenting jth wire at ith layer.
For i from 2 to NUmofLayers()
for j from 1 to NumofAdder(i)
fp_adder adder_i(.in1(L[i-1][2j-1]), .in2(L[i-1][2j]), .out(L[i][j]))
NumofDelayr(i) // # of delays of Layer i
if(i==2) return (n + 1)%2;
if(i>2) return (Numofdelay(i - 1)+NumofAdder(i-1))%2;
Delay delay_i(.D(L[1][n + 1], .Q(L[2][NumofAdder(i) + 1])
For i from 3 to NumofLayer()
if (NumofDelay(i) == 1)
Delay delay_i(.D(L[i-1][NumofAdder(i - 1)+NumofBuffer(i -
1)]), .Q(L[i][NumofAdder(i) + 1]))
The overall latency of the accumulator would be
𝑙𝑎𝑡𝑒𝑛𝑐𝑦 = (𝑐𝑒𝑖𝑙𝑖𝑛𝑔(𝑙𝑜𝑔2𝑛) + 1) ∗ 𝑛
This latency is corresponding to the end of input stream. The accumulator also consumes an
input indicating the end of a series of input. There will also be a data valid signal active high
when accumulation is complete. The detailed interface for the accumulator is described as below:
Port name format I/O detail
clk Ufix_1 in clock
reset Ufix_1 in active low
in Single in serial input
EndOfLine Ufix_1 in active high, indicating end of a line
out Single out the output of accumulator
valid Ufix_1 out active high, indicating the output data is valid
Table 1. Interface of the accumulator
3. Coping with QDR memory
There are two qual data rate memories (QDR) on the FPGA platform ROACH. Basically, we
need to import the matrix into QDR, fetch the data from QDR and process them in our SpMV
and then store the result back into the QDR. All the interconnection between each module will
look like figure 5.
Sparse Matrix-Vector Multiplier Wendi Liu
7
Figure 5. Interconnect between each module
Note that there are two values stored in each particular address in QDR, so I store each pair
of col and val in one address in QDR1. And for QDR2 I split it into three parts to store ptr, vec
and result respectively.
0x0000 col[0] val[0]
0x0001 col[0] val[1]
... ...
0xffff col[n] val[n]
Figure 6. QDR memory mapping
4. SpMV implementation
After we have successfully stored data into the QDR, the next step is let the SpMV start
computing. As you can see in the dataflow below, we first fetch the next ptr in QDR2 to see
where the end of current row is. Then we will read col and val pair in QDR1 consecutively until
we reach the end of line. Meanwhile we will read the vector element as referenced by the value
of col. Note that there would be a latency of 10 as reading vec from QDR2, so we put a delay
after val to compensate the vec fetching latency. After we get vec and val we multiply them and
send the product to the accumulator as described above. Note that Stage 2 to Stage 4 are
pipelined stage. It is also worth noting that fp multiplier in stage 4 and accumulator in stage 5
actually have a clk frequency half of the overall frequency. This is because it takes two clock
cycles to read a col/val pair.
0x0000 ptr[0]
... ...
0x4000 vec[0]
... ...
0x8000 res[0]
... ...
Sparse Matrix-Vector Multiplier Wendi Liu
8
Figure 7. Dataflow of the SpMV
The interface of this SpMV is listed and described as below:
Port name format I/O detail
clk Ufix_1 in clock
reset Ufix_1 in active low
reset2 Ufix_1 in active low. Should be delayed as referred to ‘reset’ when
initializing.
length Ufix_32 in Indicating the length of col/val array
rd_en1 Ufix_1 out read enable for QDR1
wr_en1 Ufix_1 out write enable for QDR1
rd_en2 Ufix_1 out read enable for QDR2
wr_en2 Ufix_1 out write enable for QDR2
addr1 Ufix_32 out address for QDR1
addr2 Ufix_32 out address for QDR2
start Ufix_1 in indicating data has finished loading and ready for calculation
finish Ufix_1 out indicating the calculation has completed
data1_in unsigned_36 in read port for QDR1
data1_out unsigned_36 out write port for QDR1
data2_in unsigned_36 in read port for QDR2
data2_out unsigned_36 out write port for QDR2
data_valid1 Ufix_1 in indicating the data from QDR1 is ready
data_valid2 Ufix_1 in indicating the data from QDR2 is ready
Table 2. Interface of the SpMV
Note that the high 4 bits of QDR data port are ECC bits and thus could not either be read or
written.
The FSM I use for this SpMV is shown as below. In practice I merged stage 2, stage 3 and
stage 4 together:
IDLE: Wait for data loading
READ_PTR: read next ptr
READ_VAL: read col/val pair, read vec referenced by col, send vec and val to fp multiplier
RETRIEVE: send the product out of fp multiplier to the accumulator, retrieve the sum out of
accumulator and write the result into QDR
Stag
e1
read ptr from QDR2
Stag
e2 read col from QDR1
read val from QDR1
Stag
e3 read vec from QDR2
delay val
Stag
e4
multiply vec and val
Stag
e5
send to accumulator
Sparse Matrix-Vector Multiplier Wendi Liu
9
Figure 8. Finite state machine of the SpMV
5. Mapping on FPGA
We use Simulink to simulate out design on FPGA. The FPGA system is called
Reconfigurable Open Architecture Computing Hardware (ROACH). ROACH is a Virtex5-based
upgrade to current CASPER hardware. It merges aspects from the IBOB and BEE2 platforms
into a single board. To encapsulate our Verilog code into blocks in Simulink, we import Verilog
into ‘black box’ provided by Simulink, compile the design, program it onto the FPGA and then
test it.
The Simulink model is shown as below. Essentially there are only one SpMV black box and
two QDRs on the top level. The accumulator and other building blocks are encapsulated in the
SpMV block.
Sparse Matrix-Vector Multiplier Wendi Liu
10
Figure 9. Simulink Model
6. Simulation Result
The table below shows the resource usage information:
Resource Used Available Percentage
BUFG 5 32 15%
DCM_ADV 2 12 16%
DSP48E 9 640 1%
IDELAYCTRL 2 22 9%
ILOGIC 83 800 10%
External IOB 184 640 28%
IODELAY 36 800 4%
OLOGIC 111 800 13%
Slice Register 2679 58880 4%
Slice LUT 4056 58880 6%
Slice LUT-Flip Flop Pairs 4797 58880 8%
Table 3. Resource usage report
The system runs at 100MHz and below is the latency measured for several typical matrices.
The overall time measures the time from the very beginning of simulation to the end of
simulation, while the computation time only takes into consideration the calculation time and
excludes the time for matrix data loading.
Sparse Matrix-Vector Multiplier Wendi Liu
11
Matrix Rows Colu
mns
Nonze
ros
Nonzeros/
Row
Sparsit
y
latency
(Computation
time)
latency
(overall time)
Dense 2,000 2,000 4,000,0
00 2000.00
100.000
00% 81.00ms 161.00ms
Protein 36,417 36,417 4,344,7
65 119.31
0.32761
% 108.74ms 195.64ms
FEM/Spher
es 83,334 83,334
6,010,4
80 72.13
0.08655
% 170.21ms 290.42ms
FEM/Cantil
ever 62,451 62,451
4,007,3
83 64.17
0.10275
% 117.62ms 197.77ms
Wind
Tunnel
217,91
8
217,91
8
11,524,
432 52.88
0.02427
% 361.24ms 591.73ms
FEM/Harb
or 46,835 46,835
2,374,0
01 50.69
0.10823
% 75.58ms 123.06ms
QCD 49,152 49,152 1,916,9
28 39.00
0.07935
% 68.33ms 106.67ms
FEM/Ship 140,87
4
140,87
4
3,568,1
76 25.33
0.01798
% 155.89ms 227.25ms
Economics 206,50
0
206,50
0
1,273,3
89 6.17
0.00299
% 149.37ms 174.84ms
Epidemiolo
gy
525,82
5
525,82
5
2,100,2
25 3.99
0.00076
% 357.5ms 399.50ms
FEM/Accel
erator
121,19
2
121,19
2
2,624,3
31 21.65
0.01787
% 125.63ms 178.12ms
Circuit 170,99
8
170,99
8
958,93
6 5.61
0.00328
% 121.78ms 140.96ms
Webbase 1,000,
005
1,000,
005
3,105,5
36 3.11
0.00031
% 662.11ms 724.22ms
LP 4,284 1,092,
610
11,279,
748 2632.99
0.24098
% 228.16ms 453.76ms
Table 4. Performance measurement on different test cases
Ideally, the maximum floating point operation per second is 100MFLOPs if we count the fp
multiplier as one floating operation and the accumulator as a whole as the other, and there are
both running at 50MHz. As for our simulation statistics, the equation for calculating throughput
is:
𝑇ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 =2 ∗ 𝑁𝑜𝑛𝑧𝑒𝑟𝑜𝑠
𝐶𝑜𝑚𝑝𝑢𝑡𝑎𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒
Thus we get a histogram about different throughput among different testcases:
Sparse Matrix-Vector Multiplier Wendi Liu
12
Figure 10. Different throughput among different test cases
As illustrated above, throughput is proportionate to sparsity. Actually there is approximately
a logarithmic relation between throughput and sparsity:
Figure 11. Logarithmic relation between throughput and sparsity
7. Conclusion
We have designed a Sparse Matrix-Vector Multiplier build on given floating point adder
and multiplier, and successfully mapped it onto the ROACH FPGA system. When dealing
with matrices with sparsity larger than 0.1%, it could maintain a throughput no less than half
of the maximum theoretic value. However, it still perform poorly with those most sparse
matrices. Some optimization with the situation where one entire row is empty might be
further added to the design.
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13 14
MFF
LOP
s
Test Case Number
Throughput
0
20
40
60
80
100
120
140
0.00010% 0.00100% 0.01000% 0.10000% 1.00000% 10.00000% 100.00000%
Thro
ugh
pu
t
Sparsity
Sparsity vs Throughput
Sparse Matrix-Vector Multiplier Wendi Liu
13
Reference
[1] Zhang, Yan, et al. "FPGA vs. GPU for sparse matrix vector multiply." Field-Programmable Technology, 2009. FPT 2009. International Conference on. IEEE, 2009.
[2] Richard Dorrance, Fengbo Ren, and Dejan Marković. "A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-blas on FPGAs." Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays. ACM, 2014.
[3] Ling Zhuo, and Viktor K. Prasanna. "Sparse matrix-vector multiplication on FPGAs." Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays. ACM, 2005.
[4] Goumas, Georgios, et al. "Understanding the performance of sparse matrix-vector multiplication." Parallel, Distributed and Network-Based Processing, 2008. PDP 2008. 16th Euromicro Conference on. IEEE, 2008.
[5] Song Sun, and Joseph Zambreno. "A floating-point accumulator for FPGA-based high performance computing applications." Field-Programmable Technology, 2009. FPT 2009. International Conference on. IEEE, 2009.
[6] Krishna K. Nagar, Yan Zhang, and Jason D. Bakos. "An integrated reduction technique for a double precision accumulator." Proceedings of the Third International Workshop on High-Performance Reconfigurable Computing Technology and Applications. ACM, 2009.