1 matrix multiplication on sopc project instructor: ina rivkin students: shai amara shuki gulzari...

11

Matrix Multiplication on SOPC

Project instructor: Ina Rivkin

Students: Shai Amara Shuki Gulzari

Project duration: one semester

22

Project Goals: Implementing a Matrix multiplication IP.

The IP will multiply N x M sized matrix A with M x L sized matrix B and provide an

N x L Result matrix. Integrating the IP on a system on

programmable chip (SOPC).

33

Specification of Matrix IPSpecification of Matrix IP

The matrices sizes, N, M, L can vary The matrices sizes, N, M, L can vary from 1 to 127.from 1 to 127.

The two multiplied matrixes numbers The two multiplied matrixes numbers can have values that range from -2^15 can have values that range from -2^15 up to +2^15 -1. up to +2^15 -1.

The result matrix’s numbers are of type The result matrix’s numbers are of type integer (32 bits) and can have values integer (32 bits) and can have values from -2^31 up to +2^31 -1.from -2^31 up to +2^31 -1.

44

Each matrix is stored in a separate Each matrix is stored in a separate address range of the IP. address range of the IP.

Since an address range is limited to a Since an address range is limited to a maximum of 64KB, that is sufficient for maximum of 64KB, that is sufficient for 16K(2^14) integers => maximum of a 16K(2^14) integers => maximum of a square matrix is 128x128. square matrix is 128x128.

*our maximum is 127x127.*our maximum is 127x127.

Specification of Matrix IP-contSpecification of Matrix IP-cont’’

55

ImplementationImplementationGeneral Hardware schemeGeneral Hardware scheme

ProcessorProcessor

Matrix MultiplicationPLB/OPB bridge

Uart

PLBPLB

OPBOPB

66

IP’s inner address rangesIP’s inner address ranges

The IP has 3 Address Ranges: The IP has 3 Address Ranges:

AR0 (matrix A), AR1 (matrix B) and AR2 AR0 (matrix A), AR1 (matrix B) and AR2 (Result matrix).(Result matrix).

The CPU is only allowed to write to AR0 and The CPU is only allowed to write to AR0 and AR1 and only to read from AR2.AR1 and only to read from AR2.

The IP’s FSM is only allowed to read from The IP’s FSM is only allowed to read from AR0 and AR1 and only to write to AR2.AR0 and AR1 and only to write to AR2.

77

General Implementation Idea Block diagramGeneral Implementation Idea Block diagramMatrix Multiplication unit

Memory

Logic

FSM

Clock

Address

Data

Write Enable

data

address

Write enableR0

start signal and sizes of matrices

Data out

Matrix A

Matrix B

The result Matrix

Mult Accum

R1Finish Bit

88

Actual Implementation Block diagramActual Implementation Block diagram

99

First, the processor writes the two matrices into the IP’s 1st and 2nd address ranges.

ADDRESS RANGE 0ADDRESS RANGE 0::

ADDRESS RANGE 1ADDRESS RANGE 1::

Implementation – a Simple ExampleImplementation – a Simple Example

0x0 0x1

..…

0x2 0x3

0x0 0x1 0x2 0x3

3 -1 2 5 .…

4 6

1010

a Simple Example - continuea Simple Example - continue

Secondly, it writes the matrices sizes (N, Secondly, it writes the matrices sizes (N, M, L) and start bit to the IP’s inner M, L) and start bit to the IP’s inner register in the following format:register in the following format:

* The IP’s FSM reads N, M, L as unsigned numbers, so * The IP’s FSM reads N, M, L as unsigned numbers, so the maximum size for each of them is 2^7 -1 = 127the maximum size for each of them is 2^7 -1 = 127

0 6 L M

13 20 N Start

22Don’t care

31 7 14 21

1111

a Simple Example - continuea Simple Example - continue

In our example the sizes could be In our example the sizes could be

2x2 and 2x1 or 4x1 and 1x2. 2x2 and 2x1 or 4x1 and 1x2. Let’s take the case of 2x2 and 2x1.Let’s take the case of 2x2 and 2x1. The inner register will be written withThe inner register will be written with::

0000010 000000100000101

startN M L

1212Finish bit <=1

n, m, l <= sizes

Address_A <= i + row*mAddress_B <= i*l + col

Sel_A <= 1Sel_B <= 1

Idle start=‘0’

start =‘1’

row <= 0

i <= 0col <= 0

i <= i +1Sel_A<=0Sel_B<=0

i< m-1

WE<=1Data_out<=data_inAdd_out<=row*l+col

col<= col+1

i = m-1

row <= row +1WE <= 0

i<= 0WE<=0

col < l -1

row < n -1

row = n -1

col = l -1

IP’s FSM

1313

EXAMPLE – ContinueEXAMPLE – Continue

In our example the fsm will do the following:In our example the fsm will do the following:

0x0 0x1 0x2 0x3

0x0 0x13 -1 2 5

4 6

Xilinx Multiplier

accumulator 123

4

-1

6

6

0x0 0x1

6

2

4

085

6

38

38

AR0

AR1

AR2

1414

EXAMPLE – ContinueEXAMPLE – Continue

And Indeed:And Indeed:

X =

3 1-

2 5

4

6

6

38

1515

Implementation - continueImplementation - continue

The Result matrix is saved in the IP’s The Result matrix is saved in the IP’s third address range. the IP informs the third address range. the IP informs the processor about the completion of the processor about the completion of the task by asserting finish bit that is being task by asserting finish bit that is being polled by the CPU.polled by the CPU.

After the CPU reads that finish bit = 1, it After the CPU reads that finish bit = 1, it can read the result matrix from the IP.can read the result matrix from the IP.

1616

The Verification ProcessThe Verification Process

For sizes of up to 16*16 the validation was by For sizes of up to 16*16 the validation was by allocating memory and random values for allocating memory and random values for matrices A, B. matrices A, B.

The validation was simply a comparison The validation was simply a comparison between matrices C (result) and D (expected).between matrices C (result) and D (expected).

When dealing with larger sizes we encountered a problem of allocating large memories (in software).

So we didn’t allocate memory and used instead: A[i] [j] = i + j ; B[i] [j] = i - j ; And compared it to the known result.

1717

Performance analysisPerformance analysis

The state machine number of clock cycles:The state machine number of clock cycles: { [ (3*M +2) x L ] + 2 } x N + 3 = …={ [ (3*M +2) x L ] + 2 } x N + 3 = …= = 3*M*L*N + 2*L*N + 2*N + 3 = O(N*M*L).= 3*M*L*N + 2*L*N + 2*N + 3 = O(N*M*L). Total : O(N*M*L) clock cycles.Total : O(N*M*L) clock cycles. Since we found it difficult to find the number Since we found it difficult to find the number

of clock cycles that take in software, we of clock cycles that take in software, we conducted a comparison in software that conducted a comparison in software that gives a good indication on our hardware. gives a good indication on our hardware.

1818

Performance analysis – continuePerformance analysis – continue

In Software the calculation is:In Software the calculation is:for (i=0; i<N; i++)for (i=0; i<N; i++)

for(j=0; j<L; j++)for(j=0; j<L; j++) for(k=0; k<M; k++)for(k=0; k<M; k++)

C[i][j] + = A[i][k] * B[k][j];C[i][j] + = A[i][k] * B[k][j];

In this implementation the CPU enters the In this implementation the CPU enters the loop N*M*L times (not clock cycles) !loop N*M*L times (not clock cycles) !

1919

Performance analysis – continuePerformance analysis – continue

In order to compare it to our IP’s In order to compare it to our IP’s performance, we counted the number of performance, we counted the number of times we “visit” inside the while() loop in times we “visit” inside the while() loop in which we wait for the finish signal.which we wait for the finish signal.

The following graph shows a comparison The following graph shows a comparison between the number of CPU operations between the number of CPU operations for square matrices of sizes 2x2 til 15x15.for square matrices of sizes 2x2 til 15x15.

2020

Performance analysis – Performance analysis – Comparison resultsComparison results

HW and SW comparison

0

500

1000

1500

2000

2500

3000

3500

4000

05101520

size of squre matrix

nu

mb

er o

f C

PU

op

erat

ion

s

HW

SW

Conclusion – Our IP provides an excellent solution for applications that Conclusion – Our IP provides an excellent solution for applications that require many multiplications of large matricesrequire many multiplications of large matrices !!! !!!

2121

Improvement suggestionsImprovement suggestions

For better performance additional For better performance additional Multipliers can be added to the design.Multipliers can be added to the design.

so that in each cycle more numbers could so that in each cycle more numbers could be multiplied and speed up the calculation be multiplied and speed up the calculation time. time.

using an interrupt instead of polling would using an interrupt instead of polling would also save valuable CPU time.also save valuable CPU time.

2222

Thank you !Thank you !

1 matrix multiplication on sopc project instructor: ina rivkin students: shai amara shuki gulzari...

Documents