1 matrix multiplication on sopc project instructor: ina rivkin students: shai amara shuki gulzari...
Post on 19-Dec-2015
219 views
TRANSCRIPT
11
Matrix Multiplication on SOPC
Project instructor: Ina Rivkin
Students: Shai Amara Shuki Gulzari
Project duration: one semester
22
Project Goals: Implementing a Matrix multiplication IP.
The IP will multiply N x M sized matrix A with M x L sized matrix B and provide an
N x L Result matrix. Integrating the IP on a system on
programmable chip (SOPC).
33
Specification of Matrix IPSpecification of Matrix IP
The matrices sizes, N, M, L can vary The matrices sizes, N, M, L can vary from 1 to 127.from 1 to 127.
The two multiplied matrixes numbers The two multiplied matrixes numbers can have values that range from -2^15 can have values that range from -2^15 up to +2^15 -1. up to +2^15 -1.
The result matrix’s numbers are of type The result matrix’s numbers are of type integer (32 bits) and can have values integer (32 bits) and can have values from -2^31 up to +2^31 -1.from -2^31 up to +2^31 -1.
44
Each matrix is stored in a separate Each matrix is stored in a separate address range of the IP. address range of the IP.
Since an address range is limited to a Since an address range is limited to a maximum of 64KB, that is sufficient for maximum of 64KB, that is sufficient for 16K(2^14) integers => maximum of a 16K(2^14) integers => maximum of a square matrix is 128x128. square matrix is 128x128.
*our maximum is 127x127.*our maximum is 127x127.
Specification of Matrix IP-contSpecification of Matrix IP-cont’’
55
ImplementationImplementationGeneral Hardware schemeGeneral Hardware scheme
ProcessorProcessor
Matrix MultiplicationPLB/OPB bridge
Uart
PLBPLB
OPBOPB
66
IP’s inner address rangesIP’s inner address ranges
The IP has 3 Address Ranges: The IP has 3 Address Ranges:
AR0 (matrix A), AR1 (matrix B) and AR2 AR0 (matrix A), AR1 (matrix B) and AR2 (Result matrix).(Result matrix).
The CPU is only allowed to write to AR0 and The CPU is only allowed to write to AR0 and AR1 and only to read from AR2.AR1 and only to read from AR2.
The IP’s FSM is only allowed to read from The IP’s FSM is only allowed to read from AR0 and AR1 and only to write to AR2.AR0 and AR1 and only to write to AR2.
77
General Implementation Idea Block diagramGeneral Implementation Idea Block diagramMatrix Multiplication unit
Memory
Logic
FSM
Clock
Address
Data
Write Enable
data
address
Write enableR0
start signal and sizes of matrices
Data out
Matrix A
Matrix B
The result Matrix
Mult Accum
R1Finish Bit
99
First, the processor writes the two matrices into the IP’s 1st and 2nd address ranges.
ADDRESS RANGE 0ADDRESS RANGE 0::
ADDRESS RANGE 1ADDRESS RANGE 1::
Implementation – a Simple ExampleImplementation – a Simple Example
0x0 0x1
..…
0x2 0x3
0x0 0x1 0x2 0x3
3 -1 2 5 .…
4 6
1010
a Simple Example - continuea Simple Example - continue
Secondly, it writes the matrices sizes (N, Secondly, it writes the matrices sizes (N, M, L) and start bit to the IP’s inner M, L) and start bit to the IP’s inner register in the following format:register in the following format:
* The IP’s FSM reads N, M, L as unsigned numbers, so * The IP’s FSM reads N, M, L as unsigned numbers, so the maximum size for each of them is 2^7 -1 = 127the maximum size for each of them is 2^7 -1 = 127
0 6 L M
13 20 N Start
22Don’t care
31 7 14 21
1111
a Simple Example - continuea Simple Example - continue
In our example the sizes could be In our example the sizes could be
2x2 and 2x1 or 4x1 and 1x2. 2x2 and 2x1 or 4x1 and 1x2. Let’s take the case of 2x2 and 2x1.Let’s take the case of 2x2 and 2x1. The inner register will be written withThe inner register will be written with::
0000010 000000100000101
startN M L
1212Finish bit <=1
n, m, l <= sizes
Address_A <= i + row*mAddress_B <= i*l + col
Sel_A <= 1Sel_B <= 1
Idle start=‘0’
start =‘1’
row <= 0
i <= 0col <= 0
i <= i +1Sel_A<=0Sel_B<=0
i< m-1
WE<=1Data_out<=data_inAdd_out<=row*l+col
col<= col+1
i = m-1
row <= row +1WE <= 0
i<= 0WE<=0
col < l -1
row < n -1
row = n -1
col = l -1
IP’s FSM
1313
EXAMPLE – ContinueEXAMPLE – Continue
In our example the fsm will do the following:In our example the fsm will do the following:
0x0 0x1 0x2 0x3
0x0 0x13 -1 2 5
4 6
Xilinx Multiplier
accumulator 123
4
-1
6
6
0x0 0x1
6
2
4
085
6
38
38
AR0
AR1
AR2
1515
Implementation - continueImplementation - continue
The Result matrix is saved in the IP’s The Result matrix is saved in the IP’s third address range. the IP informs the third address range. the IP informs the processor about the completion of the processor about the completion of the task by asserting finish bit that is being task by asserting finish bit that is being polled by the CPU.polled by the CPU.
After the CPU reads that finish bit = 1, it After the CPU reads that finish bit = 1, it can read the result matrix from the IP.can read the result matrix from the IP.
1616
The Verification ProcessThe Verification Process
For sizes of up to 16*16 the validation was by For sizes of up to 16*16 the validation was by allocating memory and random values for allocating memory and random values for matrices A, B. matrices A, B.
The validation was simply a comparison The validation was simply a comparison between matrices C (result) and D (expected).between matrices C (result) and D (expected).
When dealing with larger sizes we encountered a problem of allocating large memories (in software).
So we didn’t allocate memory and used instead: A[i] [j] = i + j ; B[i] [j] = i - j ; And compared it to the known result.
1717
Performance analysisPerformance analysis
The state machine number of clock cycles:The state machine number of clock cycles: { [ (3*M +2) x L ] + 2 } x N + 3 = …={ [ (3*M +2) x L ] + 2 } x N + 3 = …= = 3*M*L*N + 2*L*N + 2*N + 3 = O(N*M*L).= 3*M*L*N + 2*L*N + 2*N + 3 = O(N*M*L). Total : O(N*M*L) clock cycles.Total : O(N*M*L) clock cycles. Since we found it difficult to find the number Since we found it difficult to find the number
of clock cycles that take in software, we of clock cycles that take in software, we conducted a comparison in software that conducted a comparison in software that gives a good indication on our hardware. gives a good indication on our hardware.
1818
Performance analysis – continuePerformance analysis – continue
In Software the calculation is:In Software the calculation is:for (i=0; i<N; i++)for (i=0; i<N; i++)
for(j=0; j<L; j++)for(j=0; j<L; j++) for(k=0; k<M; k++)for(k=0; k<M; k++)
C[i][j] + = A[i][k] * B[k][j];C[i][j] + = A[i][k] * B[k][j];
In this implementation the CPU enters the In this implementation the CPU enters the loop N*M*L times (not clock cycles) !loop N*M*L times (not clock cycles) !
1919
Performance analysis – continuePerformance analysis – continue
In order to compare it to our IP’s In order to compare it to our IP’s performance, we counted the number of performance, we counted the number of times we “visit” inside the while() loop in times we “visit” inside the while() loop in which we wait for the finish signal.which we wait for the finish signal.
The following graph shows a comparison The following graph shows a comparison between the number of CPU operations between the number of CPU operations for square matrices of sizes 2x2 til 15x15.for square matrices of sizes 2x2 til 15x15.
2020
Performance analysis – Performance analysis – Comparison resultsComparison results
HW and SW comparison
0
500
1000
1500
2000
2500
3000
3500
4000
05101520
size of squre matrix
nu
mb
er o
f C
PU
op
erat
ion
s
HW
SW
Conclusion – Our IP provides an excellent solution for applications that Conclusion – Our IP provides an excellent solution for applications that require many multiplications of large matricesrequire many multiplications of large matrices !!! !!!
2121
Improvement suggestionsImprovement suggestions
For better performance additional For better performance additional Multipliers can be added to the design.Multipliers can be added to the design.
so that in each cycle more numbers could so that in each cycle more numbers could be multiplied and speed up the calculation be multiplied and speed up the calculation time. time.
using an interrupt instead of polling would using an interrupt instead of polling would also save valuable CPU time.also save valuable CPU time.