implementing a nomc on the gidel platform end-semester presentation
DESCRIPTION
Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab. Winter 2009. Implementing a NoMC on the Gidel platform end-semester presentation. Instructor: Evgeny Fiksman Students: Meir Cohen Daniel Marcovitch. Table of Contents. - PowerPoint PPT PresentationTRANSCRIPT
1
Technion – Israel Institute of TechnologyDepartment of Electrical EngineeringHigh Speed Digital Systems Lab
Instructor: Evgeny FiksmanStudents: Meir Cohen
Daniel Marcovitch
Winter 2009
2
Project goals Page 2
Previous router Page 5
Our routers Page 7
Software design Page 11
Obstacles Page 12
Testing Page 14
Time tables Page 16
Table of Contents
Project goalsImplementing a parallel processing system
which contains several NoCs, each chip containing several sub-networks of processors.
Converting existing router to support Altera platform.
Expanding the router to enable communications between similar sub-networks.
Implementing a processor network which supports communication with the PC enabling: Use of PC’s CPU as part of the processing network. Simple I/O between PC and the rest of the processing
network.
3
Top-level structure of the expanded network
Each white square represents a single FPGA on the Gidel board.
FPGA-FPGA, FPGA-PC routes go via designated routers (GW).
The GWs design/protocols are the same as the internal routers.
4
Router from previous project
5
Cross Bar – Low Level
Clk Rst
Req
Des
t
Prem
it
Des
t
Pre
mit
Req
Dest
Premit
Req
Dest
Premit
Control B
us II
Control Bus II
Control Bus II
Permission Unit
Port
Controls3
Timer & Enable Unit
Control Bus I
Control Bus I
Data Bus 32 Bits
Data Bus 32 Bits
Data B
us
Data B
us
2
Bus I Interface Port2
Bus I Interface
Port2
Bus I Interface
Bus
I In
terfa
ceP
ort 2
Port2
Fsl_S_D
ata
Fsl_
M_D
ata
Port #3 FSM
Fsl_
S_R
ead
Fsl_
S_C
ontro
l
Fsl_
S_H
asD
ata
TO\FROM FSL
Fsl_M_W
rite
Fsl_M_C
ontrol
Fsl_M_Full
Bus II & Data Bus Interface
Port
2
Fsl_S_Data
Fsl_M_Data
Por
t #2
FSM
Fsl_S_Read
Fsl_S_Control
Fsl_S_HasData
TO\F
RO
M F
SL
Fsl_M_Write
Fsl_M_Control
Fsl_M_Full
Port2
Fsl_
S_D
ata
Fsl_M_D
ata
Port #1 FSMFsl_S
_Read
Fsl_S_C
ontrol
Fsl_S_H
asData
TO\FROM FSL
Fsl_
M_W
rite
Fsl_
M_C
ontro
l
Fsl_
M_F
ull
Por
t2
Fsl_S_Data
Fsl_M_Data
Port #4 FS
M
Fsl_S_Read
Fsl_S_Control
Fsl_S_HasData
TO\FR
OM
FSL
Fsl_M_Write
Fsl_M_Control
Fsl_M_Full
Port2
Bus
II &
Dat
a
Bus
Inte
rface
Bus II &
Data
Bus Interface
Bus II & Data Bus Interface
Dest2
Dest
2
Dest2
Des
t2
Dest2
COMM COMM
CO
MM
CO
MM
Bcast
Bca
st
Bcast
Bca
stR
eq
BcastPriority
• Two main units: Permission Unit Port FSM
• Time limited
Round Robin arbiter
• Port to Port & broadcasting
• Smart Connectivity• R – R• R - Core
• Modular design
Permission process
6
• Round Robin arbiter- service order according to loop counter.
• Check if DEST is not busy. • Permit for a ‘time slot’. • If not requesting, service next
requesting port.• BUSY and LAST writing ports
are saved.• Check for messages COMM
and direct to relevant port according to table
• Broadcast priority to enable only one bcast’ at a time.
CONTROLLER
Permission Unit
Clk Rst
BUSY
TO
\FR
OM
C
on
trol B
us
2
2 Port
DE
ST 2
Port2
3 1 2 4
LAST WRITING PORT1 2 3 4
MUX 4X2
1 0 1 0
BUSY PORTS1 2 3 4
MUX 4x1
LAST
Timer & Enable
Unit
Premit
2 2
2
2 2
Req1Req2Req3Req4
Req
2
COMMs table
4 3 2 1
Dest
COMM CommDst
DEST
BcastPriority
Unit
R1
R2
R3
R4Bcast1Bcast2Bcast3Bcast4
FR
OM
P
ort F
SM
’sNxt
TimeOver
Bcast
Our changes for the router
7
Fifth port
Routing table
Broadcast table
Local router (LR)Fabric router (FR)Primary/secondary interchip
router (P/S-ICR)PC router (PCR)
New router types:Changes:
Fifth port
8
Cross Bar – Low Level
Clk Rst
Req
Des
t
Prem
it
Des
t
Pre
mit
Req
Dest
Premit
Req
Dest
Premit
Control B
us II
Control Bus II
Control Bus II
Permission Unit
Port
Controls3
Timer & Enable Unit
Control Bus I
Control Bus I
Data Bus 32 Bits
Data Bus 32 Bits
Data B
us
Data B
us
2
Bus I Interface Port2
Bus I Interface
Port2
Bus I Interface
Bus
I In
terfa
ceP
ort 2
Port2
Fsl_S_D
ata
Fsl_
M_D
ata
Port #3 FSM
Fsl_
S_R
ead
Fsl_
S_C
ontro
l
Fsl_
S_H
asD
ata
TO\FROM FSL
Fsl_M_W
rite
Fsl_M_C
ontrol
Fsl_M_Full
Bus II & Data Bus Interface
Port
2
Fsl_S_Data
Fsl_M_Data
Por
t #2
FSM
Fsl_S_Read
Fsl_S_Control
Fsl_S_HasData
TO\F
RO
M F
SL
Fsl_M_Write
Fsl_M_Control
Fsl_M_Full
Port2
Fsl_
S_D
ata
Fsl_M_D
ata
Port #1 FSMFsl_S
_Read
Fsl_S_C
ontrol
Fsl_S_H
asData
TO\FROM FSL
Fsl_
M_W
rite
Fsl_
M_C
ontro
l
Fsl_
M_F
ull
Por
t2
Fsl_S_Data
Fsl_M_Data
Port #4 FS
M
Fsl_S_Read
Fsl_S_Control
Fsl_S_HasData
TO\FR
OM
FSL
Fsl_M_Write
Fsl_M_Control
Fsl_M_Full
Port2
Bus
II &
Dat
a
Bus
Inte
rface
Bus II &
Data
Bus Interface
Bus II & Data Bus Interface
Dest2
Dest
2
Dest2
Des
t2
Dest2
COMM COMM
CO
MM
CO
MM
Bcast
Bca
st
Bcast
Bca
stR
eq
BcastPriority
5th Port
Just adding another port module to the ring…
Routing
9
PC C C F F L LAddress
localfabricchip
rankcomm
Local router:Similar comm – routing by rank.Other comms – to 5th port.
Other routers:Routing by comm only.
Result: smaller routing tables
Routing
10
CONTROLLER
Permission Unit
Clk Rst
BUSY
TO
\FR
OM
C
on
trol B
us
2
2 Port
DE
ST 2
Port2
3 1 2 4
LAST WRITING PORT1 2 3 4
MUX 4X2
1 0 1 0
BUSY PORTS1 2 3 4
MUX 4x1
LAST
Timer & Enable
Unit
Premit
2 2
2
2 2
Req1Req2Req3Req4
Req
2
COMMs table
4 3 2 1
Dest
COMM CommDst
DEST
BcastPriority
Unit
R1
R2
R3
R4Bcast1Bcast2Bcast3Bcast4
FR
OM
P
ort F
SM
’s
Nxt
TimeOver
Bcast
Non-existing components to be added.
Broadcast table
11
Cross Bar – Low Level
Clk Rst
Req
Des
t
Prem
it
Des
t
Pre
mit
Req
Dest
Premit
Req
Dest
Premit
Control B
us II
Control Bus II
Control Bus II
Permission Unit
Port
Controls3
Timer & Enable Unit
Control Bus I
Control Bus I
Data Bus 32 Bits
Data Bus 32 Bits
Data B
us
Data B
us
2
Bus I Interface Port2
Bus I Interface
Port2
Bus I Interface
Bus
I In
terfa
ceP
ort 2
Port2
Fsl_S_D
ata
Fsl_
M_D
ata
Port #3 FSM
Fsl_
S_R
ead
Fsl_
S_C
ontro
l
Fsl_
S_H
asD
ata
TO\FROM FSL
Fsl_M_W
rite
Fsl_M_C
ontrol
Fsl_M_Full
Bus II & Data Bus Interface
Port
2
Fsl_S_Data
Fsl_M_Data
Por
t #2
FSM
Fsl_S_Read
Fsl_S_Control
Fsl_S_HasData
TO\F
RO
M F
SL
Fsl_M_Write
Fsl_M_Control
Fsl_M_Full
Port2
Fsl_
S_D
ata
Fsl_M_D
ata
Port #1 FSMFsl_S
_Read
Fsl_S_C
ontrol
Fsl_S_H
asData
TO\FROM FSL
Fsl_
M_W
rite
Fsl_
M_C
ontro
l
Fsl_
M_F
ull
Por
t2
Fsl_S_Data
Fsl_M_Data
Port #4 FS
M
Fsl_S_Read
Fsl_S_Control
Fsl_S_HasData
TO\FR
OM
FSL
Fsl_M_Write
Fsl_M_Control
Fsl_M_Full
Port2
Bus
II &
Dat
a
Bus
Inte
rface
Bus II &
Data
Bus Interface
Bus II & Data Bus Interface
Dest2
Dest
2
Dest2
Des
t2
Dest2
COMM COMM
CO
MM
CO
MM
Bcast
Bca
st
Bcast
Bca
stR
eq
BcastPriority
0 1 1 0 1
Broadcasting only to spanning tree branches.
Table tags branch ports with ‘1’ value:
Connected to “Port FSM” unit of each port.
12
Software layers
Software design
• Application Layer: MPI functions interface
• Network Layer: hardware independent implementation of these functions
• Data layer: relies on command bit fields
• Physical layer: designed for FSL bus Network layer
Application layer
Data layer
Physical layerAdjust to conform with altera i/f.
Using DMA transfers.
Add async. functions
Adjusted for new comm size
Message Passing Flow
13
Destination Tag Buffer address Size
Source Buffer
Auxiliary Receive Buffer (Constant)
Destination Buffer
Network
DMA transfer
DMA transfer
DMA transfer
MPI_Isend: only adds send request to sending list.
Destination Tag Buffer address Size
Destination Tag Buffer address Size
DMA sends data asynchronously.
Source Tag Buffer address Size
MPI_Irecv: only adds receive request to receiving list. Source Tag Buffer address Size
Source Tag Buffer address Size
DMA receives data asynchronously.
Transfer data into buffer in background.
Sending
Receiving
Obstacle1 - Memory bottleneck
14
Each Nios uses ~13Kb onchip memory.
FPGA has only ~70Kb onchip memory.
Only 5 processors fit.
Solutions:o Offchip memory – slow.Reducing program footprint.Using bigger FPGA for the whole network.
!!
Obstacle2 - Cache coherency
15
DMA buffer
cache line cache line cache line cache line
Cache flush is necessary but not enough! Incoherency in unaligned cache lines.
Solutions:o Not using cache – asynchronic system not effective.o Disabling cache in buffer area – cannot use cache after
DMA transfer. Align DMA buffers to cache lines (using memalign).
Memory
Cache
Local router Testing
16
Localrouter
NiosII
PC
Simple FIFO*
PIO
NiosII PIO
NiosII PIO
NiosII PIO
Simple FIFO*
Simple FIFO*
Simple FIFO*
Testing Program
* PIO to FIFO connector
• PIO output debug information, data sent/received and results.
• Test program prints the PIO data on screen.• In simulation PIO can be read directly from wave.
Application
17
Multiple matrix multiplication.
0,0 0,1 0,
1,0 1,1 1,
,0 ,1 ,
n
n
m m m n
a a a
a a a
a a a
0,0 0,1 0,
1,0 1,1 1,
,0 ,1 ,
n
n
m m m n
a a a
a a a
a a a
0,0 0,1 0,
1,0 1,1 1,
,0 ,1 ,
n
n
m m m n
a a a
a a a
a a a
0,0 0,1 0,
1,0 1,1 1,
,0 ,1 ,
n
n
m m m n
a a a
a a a
a a a
0,0 0,1 0,
1,0 1,1 1,
,0 ,1 ,
n
n
m m m n
a a a
a a a
a a a
0,0 0,1 0,
1,0 1,1 1,
,0 ,1 ,
n
n
m m m n
a a a
a a a
a a a
MUL MUL MUL MUL
0,0 0,1 0,
1,0 1,1 1,
,0 ,1 ,
n
n
m m m n
a a a
a a a
a a a
0,0 0,1 0,
1,0 1,1 1,
,0 ,1 ,
n
n
m m m n
a a a
a a a
a a a
0,0 0,1 0,
1,0 1,1 1,
,0 ,1 ,
n
n
m m m n
a a a
a a a
a a a
0,0 0,1 0,
1,0 1,1 1,
,0 ,1 ,
n
n
m m m n
a a a
a a a
a a a
19
QuestionsQuestions