switch eecs 252 – spring 2006 ramp blue project
DESCRIPTION
Switch EECS 252 – Spring 2006 RAMP Blue Project. Jue Sun and Gary Voronel Electrical Engineering and Computer Sciences University of California, Berkeley May 1, 2006. Outline. Goal of switch Implementation Performance Future implementation Current state of project Project experience. - PowerPoint PPT PresentationTRANSCRIPT
Switch
EECS 252 – Spring 2006RAMP Blue Project
Jue Sun and Gary VoronelElectrical Engineering and Computer Sciences
University of California, Berkeley
May 1, 2006
5/1/2006 CS252-s06, Project Presentation
2
Outline
• Goal of switch
• Implementation
• Performance
• Future implementation
• Current state of project
• Project experience
5/1/2006 CS252-s06, Project Presentation
3
One Piece of the Puzzle
• Main goal of RAMP Blue is to build a large scale system
• To do useful work, processors must be able to communicate
• Therefore, we need an interconnection network
5/1/2006 CS252-s06, Project Presentation
4
Implementation Goals
1. Support communication between all processors in system
2. Flexible hardware allowing parameterization of global system constants, especially number of Microblaze cores per FPGA
3. Minimal resource utilization
4. High throughput
5. Low latency
6. Simple, homogenous hardware
7. Simple software interface
5/1/2006 CS252-s06, Project Presentation
5
Hardware Design Constraints
• RAMP Blue will be implemented on the BEE2
• 4 user FPGAs per BEE2 board
• 2 LVCMOS links FPGA-to-FPGA communication– Relatively low latency (2 or 3 cycles)
– Throughput: more than 64bit
• 16 MGT links per board (4 per FPGA) for board-to-board communication
– Relatively high latency (20 or more cycles)
– Throughput: 32bit or 64 bit
• To achieve lowest latency possible, we limit the packet routes to at most 1 MGT link
• 16 Microblaze cores per FPGA (64 per board)– Depending on resource utilization, number of cores per FPGA
may need to be reduced
5/1/2006 CS252-s06, Project Presentation
6
Physical Topology
• Topology is fixed and homogenous throughout the system
– Each FPGA directly connected to 2 other FPGAs on the same board and 4 other boards
– Number of cores per FPGA is the same on every FPGA
• Each board has a direct connection to every other board in the system (maximum of 17 boards)
– BOARD n hooks up to board BOARD 16 through MGT n
– With 16 cores per FPGA, 17 boards supports 1088 processors!
5/1/2006 CS252-s06, Project Presentation
7
Board Level Connectivity
BOARD 0
FPGA 2
01 1415
45
23
12
13
10
11
8 96 7
FPGA 0 FPGA 3
FPGA 1
BOARD 7
FPGA 30
01 1415
45
23
12
13
10
11
8 96 7
FPGA 28 FPGA 31
FPGA 29
BOARD 16
FPGA 66
01 1415
45
23
12
13
10
11
8 96 7
FPGA 64 FPGA 67
FPGA 65
BOARD 10
FPGA 42
01 1415
45
23
12
13
10
11
8 96 7
FPGA 40 FPGA 43
FPGA 41
5/1/2006 CS252-s06, Project Presentation
8
FPGA Level Connectivity
For clarity, configuration shown is with 4 Microblaze cores per FPGA
BOARD 0
MGT0MGT1 MGT14MGT15
MG
T4
MG
T5
MG
T2
MG
T3
MG
T12
MG
T13
MG
T10
MG
T11
MGT8 MGT9MGT6 MGT7
FPGA 0 FPGA 3
FPGA 1 FPGA 2
0
1
3
2
4
5
7
6
12
13
15
14
8
9
11
10
5/1/2006 CS252-s06, Project Presentation
9
Switch Fabric Specifications
• Crossbar switch with maximal connectivity– Every Microblaze can access every other Microblaze on the
same FPGA directly
– Every Microblaze can access both LVCMOS links
– Every Microblaze can access all FPGA-local MGT links
• Buffering on inputs and outputs– Store-and-forward buffers for Microblazes to decrease
complexity and simplify software interface
– Cut through buffers for LVCMOS links
– MGT links wrapped XAUI cores that already have internal buffers
5/1/2006 CS252-s06, Project Presentation
10
Microblaze Level Connectivity
For clarity, configuration shown is with 4 Microblaze cores per FPGA
MICROBLAZE 0
MICROBLAZE 1
MICROBLAZE 3
MICROBLAZE 2S
WIT
CH
XAUI0
LV
CM
OS
LE
FT
LV
CM
OS
BU
FF
ER
LV
CM
OS
BU
FF
ER
LV
CM
OS
RIG
HT
LV
CM
OS
BU
FF
ER
LV
CM
OS
BU
FF
ER
BU
FF
ER
UN
ITB
UF
FE
RU
NIT
BU
FF
ER
UN
IT
BU
FF
ER
UN
IT
BU
FF
ER
UN
IT
BU
FF
ER
UN
IT
BU
FF
ER
UN
IT
BU
FF
ER
UN
IT
XAUICTRL
XAUI1XAUICTRL
XAUI2XAUICTRL
XAUI3XAUICTRL
5/1/2006 CS252-s06, Project Presentation
11
Switch Overall
Scheduler
Send requestDestinationDataDoneAllow Start
DataDataValidDataDone
Input Buffer
Send requestDestinationDataDoneAllow Start
DataDataValidDataDone
Input Buffer
Send requestDestinationDataDoneAllow Start
DataDataValidDataDone
Input Buffer
Send requestDestinationDataDoneAllow Start
DataDataValidDataDone
Input Buffer
free
DataDataValidDataDone
Output B
uffer
free
DataDataValidDataDone
Output B
uffer
free
DataDataValidDataDone
Output B
uffer
free
DataDataValidDataDone
Output B
uffer
5/1/2006 CS252-s06, Project Presentation
12
Scheduler
• If two ports want to send to the same port at the same time, the port attached to the port with the lowest number will be allowed to send first
• Other control logic not shown here is used to implement protocol between switch and buffers
Request schedulerfirst come first serve
Req port1Req port2Req port3Req port4
Req 1 RegisteredReq 2 RegisteredReq 3 RegisteredReq 4 Registered Port to service
Request schedulerfirst come first serve
Req port1Req port2Req port3Req port4
Req 1 RegisteredReq 2 RegisteredReq 3 RegisteredReq 4 Registered Port to service
Request schedulerfirst come first serve
Req port1Req port2Req port3Req port4
Req 1 RegisteredReq 2 RegisteredReq 3 RegisteredReq 4 Registered Port to service
Request schedulerfirst come first serve
Req port1Req port2Req port3Req port4
Req 1 RegisteredReq 2 RegisteredReq 3 RegisteredReq 4 Registered Port to service
Port in 1
Port in 4
Port in 3
Port in 2
Control Logic
Dest Port
Dest Port
Dest Port
Dest Port
5/1/2006 CS252-s06, Project Presentation
13
Scheduler
• If two ports want to send to the same port at the same time, the port attached to the port with the lowest number will be allowed to send first
• Other control logic not shown here is used to implement protocol between switch and buffers
Request schedulerfirst come first serve
Req port1Req port2Req port3Req port4
Req 1 RegisteredReq 2 RegisteredReq 3 RegisteredReq 4 Registered Port to service
Request schedulerfirst come first serve
Req port1Req port2Req port3Req port4
Req 1 RegisteredReq 2 RegisteredReq 3 RegisteredReq 4 Registered Port to service
Request schedulerfirst come first serve
Req port1Req port2Req port3Req port4
Req 1 RegisteredReq 2 RegisteredReq 3 RegisteredReq 4 Registered Port to service
Request schedulerfirst come first serve
Req port1Req port2Req port3Req port4
Req 1 RegisteredReq 2 RegisteredReq 3 RegisteredReq 4 Registered Port to service
Port in 1
Port in 4
Port in 3
Port in 2
Control Logic
Dest Port
Dest Port
Dest Port
Dest Port
5/1/2006 CS252-s06, Project Presentation
14
Scheduler
• If two ports want to send to the same port at the same time, the port attached to the port with the lowest number will be allowed to send first
• Other control logic not shown here is used to implement protocol between switch and buffers
Request schedulerfirst come first serve
Req port1Req port2Req port3Req port4
Req 1 RegisteredReq 2 RegisteredReq 3 RegisteredReq 4 Registered Port to service
Request schedulerfirst come first serve
Req port1Req port2Req port3Req port4
Req 1 RegisteredReq 2 RegisteredReq 3 RegisteredReq 4 Registered Port to service
Request schedulerfirst come first serve
Req port1Req port2Req port3Req port4
Req 1 RegisteredReq 2 RegisteredReq 3 RegisteredReq 4 Registered Port to service
Request schedulerfirst come first serve
Req port1Req port2Req port3Req port4
Req 1 RegisteredReq 2 RegisteredReq 3 RegisteredReq 4 Registered Port to service
Port in 1
Port in 4
Port in 3
Port in 2
Control Logic
Dest Port
Dest Port
Dest Port
Dest Port
5/1/2006 CS252-s06, Project Presentation
15
Scheduler
• If two ports want to send to the same port at the same time, the port attached to the port with the lowest number will be allowed to send first
• Other control logic not shown here is used to implement protocol between switch and buffers
Request schedulerfirst come first serve
Req port1Req port2Req port3Req port4
Req 1 RegisteredReq 2 RegisteredReq 3 RegisteredReq 4 Registered Port to service
Request schedulerfirst come first serve
Req port1Req port2Req port3Req port4
Req 1 RegisteredReq 2 RegisteredReq 3 RegisteredReq 4 Registered Port to service
Request schedulerfirst come first serve
Req port1Req port2Req port3Req port4
Req 1 RegisteredReq 2 RegisteredReq 3 RegisteredReq 4 Registered Port to service
Request schedulerfirst come first serve
Req port1Req port2Req port3Req port4
Req 1 RegisteredReq 2 RegisteredReq 3 RegisteredReq 4 Registered Port to service
Port in 1
Port in 4
Port in 3
Port in 2
Control Logic
Dest Port
Dest Port
Dest Port
Dest Port
5/1/2006 CS252-s06, Project Presentation
16
Scheduler
• If two ports want to send to the same port at the same time, the port attached to the port with the lowest number will be allowed to send first
• Other control logic not shown here is used to implement protocol between switch and buffers
Request schedulerfirst come first serve
Req port1Req port2Req port3Req port4
Req 1 RegisteredReq 2 RegisteredReq 3 RegisteredReq 4 Registered Port to service
Request schedulerfirst come first serve
Req port1Req port2Req port3Req port4
Req 1 RegisteredReq 2 RegisteredReq 3 RegisteredReq 4 Registered Port to service
Request schedulerfirst come first serve
Req port1Req port2Req port3Req port4
Req 1 RegisteredReq 2 RegisteredReq 3 RegisteredReq 4 Registered Port to service
Request schedulerfirst come first serve
Req port1Req port2Req port3Req port4
Req 1 RegisteredReq 2 RegisteredReq 3 RegisteredReq 4 Registered Port to service
Port in 1
Port in 4
Port in 3
Port in 2
Control Logic
Dest Port
Dest Port
Dest Port
Dest Port
5/1/2006 CS252-s06, Project Presentation
17
Scheduler
• If two ports want to send to the same port at the same time, the port attached to the port with the lowest number will be allowed to send first
• Other control logic not shown here is used to implement protocol between switch and buffers
Request schedulerfirst come first serve
Req port1Req port2Req port3Req port4
Req 1 RegisteredReq 2 RegisteredReq 3 RegisteredReq 4 Registered Port to service
Request schedulerfirst come first serve
Req port1Req port2Req port3Req port4
Req 1 RegisteredReq 2 RegisteredReq 3 RegisteredReq 4 Registered Port to service
Request schedulerfirst come first serve
Req port1Req port2Req port3Req port4
Req 1 RegisteredReq 2 RegisteredReq 3 RegisteredReq 4 Registered Port to service
Request schedulerfirst come first serve
Req port1Req port2Req port3Req port4
Req 1 RegisteredReq 2 RegisteredReq 3 RegisteredReq 4 Registered Port to service
Port in 1
Port in 4
Port in 3
Port in 2
Control Logic
Dest Port
Dest Port
Dest Port
Dest Port
5/1/2006 CS252-s06, Project Presentation
18
Scheduler
• If two ports want to send to the same port at the same time, the port attached to the port with the lowest number will be allowed to send first
• Other control logic not shown here is used to implement protocol between switch and buffers
Request schedulerfirst come first serve
Req port1Req port2Req port3Req port4
Req 1 RegisteredReq 2 RegisteredReq 3 RegisteredReq 4 Registered Port to service
Request schedulerfirst come first serve
Req port1Req port2Req port3Req port4
Req 1 RegisteredReq 2 RegisteredReq 3 RegisteredReq 4 Registered Port to service
Request schedulerfirst come first serve
Req port1Req port2Req port3Req port4
Req 1 RegisteredReq 2 RegisteredReq 3 RegisteredReq 4 Registered Port to service
Request schedulerfirst come first serve
Req port1Req port2Req port3Req port4
Req 1 RegisteredReq 2 RegisteredReq 3 RegisteredReq 4 Registered Port to service
Port in 1
Port in 4
Port in 3
Port in 2
Control Logic
Dest Port
Dest Port
Dest Port
Dest Port
Port 2
5/1/2006 CS252-s06, Project Presentation
19
Scheduler
• If two ports want to send to the same port at the same time, the port attached to the port with the lowest number will be allowed to send first
• Other control logic not shown here is used to implement protocol between switch and buffers
Request schedulerfirst come first serve
Req port1Req port2Req port3Req port4
Req 1 RegisteredReq 2 RegisteredReq 3 RegisteredReq 4 Registered Port to service
Request schedulerfirst come first serve
Req port1Req port2Req port3Req port4
Req 1 RegisteredReq 2 RegisteredReq 3 RegisteredReq 4 Registered Port to service
Request schedulerfirst come first serve
Req port1Req port2Req port3Req port4
Req 1 RegisteredReq 2 RegisteredReq 3 RegisteredReq 4 Registered Port to service
Request schedulerfirst come first serve
Req port1Req port2Req port3Req port4
Req 1 RegisteredReq 2 RegisteredReq 3 RegisteredReq 4 Registered Port to service
Port in 1
Port in 4
Port in 3
Port in 2
Control Logic
Dest Port
Dest Port
Dest Port
Dest Port
Port 2
5/1/2006 CS252-s06, Project Presentation
20
Scheduler
• If two ports want to send to the same port at the same time, the port attached to the port with the lowest number will be allowed to send first
• Other control logic not shown here is used to implement protocol between switch and buffers
Request schedulerfirst come first serve
Req port1Req port2Req port3Req port4
Req 1 RegisteredReq 2 RegisteredReq 3 RegisteredReq 4 Registered Port to service
Request schedulerfirst come first serve
Req port1Req port2Req port3Req port4
Req 1 RegisteredReq 2 RegisteredReq 3 RegisteredReq 4 Registered Port to service
Request schedulerfirst come first serve
Req port1Req port2Req port3Req port4
Req 1 RegisteredReq 2 RegisteredReq 3 RegisteredReq 4 Registered Port to service
Request schedulerfirst come first serve
Req port1Req port2Req port3Req port4
Req 1 RegisteredReq 2 RegisteredReq 3 RegisteredReq 4 Registered Port to service
Port in 1
Port in 4
Port in 3
Port in 2
Control Logic
Dest Port
Dest Port
Dest Port
Dest Port
Port 3
5/1/2006 CS252-s06, Project Presentation
21
Source Routing
• Fixed topology allows for straightforward source routing implementation
• Destination routing would be more robust, but would require significantly more resources and greater complexity
• Packet header is extremely simple: just a concatenated sequence of hops
• Minimal hardware required to determine next hop and adjust the header at every hop (zero LUTs used – can’t get better than that!)
– The next hop is encoded in the lowest bits of the header
– To adjust the header, the hardware must simply shift out the lowest bits
5/1/2006 CS252-s06, Project Presentation
22
Source Routing – Hop Encoding
• Need 5 bits to represent each hop– Must be able to encode 16 cores per FPGA + 4 MGT links + 2 LVCMOS
links = 22 total encodings (+ 1 for a FIN code)
– If 8 or less cores per FPGA are used, then each hop can be represented using only 4 bits (hardware supports parameterization of the hop encoding width)
• Maximum of 6 hops based on physical topology– Constrained MGT links to 1 hop per route
– Therefore, worst case route is:LVCMOS LVCMOS MGT LVCMOS LVCMOS MB
• Hop encoding allows header to fit into 1 word– 6 hops x 5 bits/hop = 30 bits
5/1/2006 CS252-s06, Project Presentation
23
Source Routing – Hop Encoding
• Need 5 bits to represent each hop– Must be able to encode 16 cores per FPGA + 4 MGT links + 2 LVCMOS
links = 22 total encodings (+ 1 for a FIN code)
– If 8 or less cores per FPGA are used, then each hop can be represented using only 4 bits (hardware supports parameterization of the hop encoding width)
• Maximum of 6 hops based on physical topology– Constrained MGT links to 1 hop per route
– Therefore, worst case route is:LVCMOS LVCMOS MGT LVCMOS LVCMOS MB
• Hop encoding allows header to fit into 1 word– 6 hops x 5 bits/hop = 30 bits
00 HOP5 HOP4 HOP3 HOP2 HOP1 HOP0
5 5 5 5 5 52
5/1/2006 CS252-s06, Project Presentation
24
Source Routing – Global Naming
• Processors are globally named – Necessary to reach the goal of a simple software interface
– If there are 16 cores per FPGA with 4 FPGAs per board and 17 total boards, then the processors are numbered 0 - 1087
• Naming scheme scales down with less cores– Necessary to support parameterization of global system
constants (especially number of cores per FPGA)
– If there are 4 cores per FPGA with 4 FPGAs per board and 17 total boards, then the processors are numbered 0 – 271
• Invalid processor number triggers error at the software level
– Again, supports simple software interface
– Ensures that only packets with valid headers enter the network
5/1/2006 CS252-s06, Project Presentation
25
Source Routing Example
• For simplicity, let’s assume there are 4 cores per FPGA
• Let’s send from processor #10 to processor #24 (representative of worst case path)
5/1/2006 CS252-s06, Project Presentation
26
Source Routing Example
• For simplicity, let’s assume there are 4 cores per FPGA
• Let’s send from processor #10 to processor #24 (representative of worst case path)
BOARD 0
MGT0MGT1 MGT14MGT15
MG
T4
MG
T5
MG
T2
MG
T3
MG
T12
MG
T13
MG
T10
MG
T11
MGT8 MGT9MGT6 MGT7
FPGA 0 FPGA 3
FPGA 1 FPGA 2
0
1
3
2
4
5
7
6
12
13
15
14
8
9
11
10
BOARD 1
MGT0MGT1 MGT14MGT15
MG
T4
MG
T5
MG
T2
MG
T3
MG
T12
MG
T13
MG
T10
MG
T11
MGT8 MGT9MGT6 MGT7
FPGA 4 FPGA 7
FPGA 5 FPGA 6
16
17
19
18
20
21
23
22
28
29
31
30
24
25
27
26
00 MB0 LEFT LEFT MGT1 LEFT LEFT
5/1/2006 CS252-s06, Project Presentation
27
Source Routing Example• Destination core is on a different board, so packet must first be
routed from the source FPGA (FPGA 2) to the FPGA that is connected to the destination board (which is FPGA 0)
• This requires 2 hops over the LEFT LVCMOS link
BOARD 0
MGT0MGT1 MGT14MGT15
MG
T4
MG
T5
MG
T2
MG
T3
MG
T12
MG
T13
MG
T10
MG
T11
MGT8 MGT9MGT6 MGT7
FPGA 0 FPGA 3
FPGA 1 FPGA 2
0
1
3
2
4
5
7
6
12
13
15
14
8
9
11
10
BOARD 1
MGT0MGT1 MGT14MGT15
MG
T4
MG
T5
MG
T2
MG
T3
MG
T12
MG
T13
MG
T10
MG
T11
MGT8 MGT9MGT6 MGT7
FPGA 4 FPGA 7
FPGA 5 FPGA 6
16
17
19
18
20
21
23
22
28
29
31
30
24
25
27
26
00 MB0 LEFT LEFT MGT1 LEFT LEFT
5/1/2006 CS252-s06, Project Presentation
28
Source Routing Example• Destination core is on a different board, so packet must first be
routed from the source FPGA (FPGA 2) to the FPGA that is connected to the destination board (which is FPGA 0)
• This requires 2 hops over the LEFT LVCMOS link
BOARD 0
MGT0MGT1 MGT14MGT15
MG
T4
MG
T5
MG
T2
MG
T3
MG
T12
MG
T13
MG
T10
MG
T11
MGT8 MGT9MGT6 MGT7
FPGA 0 FPGA 3
FPGA 1 FPGA 2
0
1
3
2
4
5
7
6
12
13
15
14
8
9
11
10
BOARD 1
MGT0MGT1 MGT14MGT15
MG
T4
MG
T5
MG
T2
MG
T3
MG
T12
MG
T13
MG
T10
MG
T11
MGT8 MGT9MGT6 MGT7
FPGA 4 FPGA 7
FPGA 5 FPGA 6
16
17
19
18
20
21
23
22
28
29
31
30
24
25
27
26
00 FIN MB0 LEFT LEFT MGT1 LEFT
5/1/2006 CS252-s06, Project Presentation
29
Source Routing Example• Destination core is on a different board, so packet must first be
routed from the source FPGA (FPGA 2) to the FPGA that is connected to the destination board (which is FPGA 0)
• This requires 2 hops over the LEFT LVCMOS link
BOARD 0
MGT0MGT1 MGT14MGT15
MG
T4
MG
T5
MG
T2
MG
T3
MG
T12
MG
T13
MG
T10
MG
T11
MGT8 MGT9MGT6 MGT7
FPGA 0 FPGA 3
FPGA 1 FPGA 2
0
1
3
2
4
5
7
6
12
13
15
14
8
9
11
10
BOARD 1
MGT0MGT1 MGT14MGT15
MG
T4
MG
T5
MG
T2
MG
T3
MG
T12
MG
T13
MG
T10
MG
T11
MGT8 MGT9MGT6 MGT7
FPGA 4 FPGA 7
FPGA 5 FPGA 6
16
17
19
18
20
21
23
22
28
29
31
30
24
25
27
26
00 FIN FIN MB0 LEFT LEFT MGT1
5/1/2006 CS252-s06, Project Presentation
30
Source Routing Example
• Once at the proper FPGA, packet can be sent across the MGT link to an FPGA on the destination board
BOARD 0
MGT0MGT1 MGT14MGT15
MG
T4
MG
T5
MG
T2
MG
T3
MG
T12
MG
T13
MG
T10
MG
T11
MGT8 MGT9MGT6 MGT7
FPGA 0 FPGA 3
FPGA 1 FPGA 2
0
1
3
2
4
5
7
6
12
13
15
14
8
9
11
10
BOARD 1
MGT0MGT1 MGT14MGT15
MG
T4
MG
T5
MG
T2
MG
T3
MG
T12
MG
T13
MG
T10
MG
T11
MGT8 MGT9MGT6 MGT7
FPGA 4 FPGA 7
FPGA 5 FPGA 6
16
17
19
18
20
21
23
22
28
29
31
30
24
25
27
26
00 FIN FIN MB0 LEFT LEFT MGT1
5/1/2006 CS252-s06, Project Presentation
31
Source Routing Example
• Once at the proper FPGA, packet can be sent across the MGT link to an FPGA on the destination board
BOARD 0
MGT0MGT1 MGT14MGT15
MG
T4
MG
T5
MG
T2
MG
T3
MG
T12
MG
T13
MG
T10
MG
T11
MGT8 MGT9MGT6 MGT7
FPGA 0 FPGA 3
FPGA 1 FPGA 2
0
1
3
2
4
5
7
6
12
13
15
14
8
9
11
10
BOARD 1
MGT0MGT1 MGT14MGT15
MG
T4
MG
T5
MG
T2
MG
T3
MG
T12
MG
T13
MG
T10
MG
T11
MGT8 MGT9MGT6 MGT7
FPGA 4 FPGA 7
FPGA 5 FPGA 6
16
17
19
18
20
21
23
22
28
29
31
30
24
25
27
26
00 FIN FIN FIN MB0 LEFT LEFT
5/1/2006 CS252-s06, Project Presentation
32
Source Routing Example
• Then, the packet must be routed to the destination FPGA, which requires 2 more LVCMOS hops
BOARD 0
MGT0MGT1 MGT14MGT15
MG
T4
MG
T5
MG
T2
MG
T3
MG
T12
MG
T13
MG
T10
MG
T11
MGT8 MGT9MGT6 MGT7
FPGA 0 FPGA 3
FPGA 1 FPGA 2
0
1
3
2
4
5
7
6
12
13
15
14
8
9
11
10
BOARD 1
MGT0MGT1 MGT14MGT15
MG
T4
MG
T5
MG
T2
MG
T3
MG
T12
MG
T13
MG
T10
MG
T11
MGT8 MGT9MGT6 MGT7
FPGA 4 FPGA 7
FPGA 5 FPGA 6
16
17
19
18
20
21
23
22
28
29
31
30
24
25
27
26
00 FIN FIN FIN MB0 LEFT LEFT
5/1/2006 CS252-s06, Project Presentation
33
Source Routing Example
• Then, the packet must be routed to the destination FPGA, which requires 2 more LVCMOS hops
BOARD 0
MGT0MGT1 MGT14MGT15
MG
T4
MG
T5
MG
T2
MG
T3
MG
T12
MG
T13
MG
T10
MG
T11
MGT8 MGT9MGT6 MGT7
FPGA 0 FPGA 3
FPGA 1 FPGA 2
0
1
3
2
4
5
7
6
12
13
15
14
8
9
11
10
BOARD 1
MGT0MGT1 MGT14MGT15
MG
T4
MG
T5
MG
T2
MG
T3
MG
T12
MG
T13
MG
T10
MG
T11
MGT8 MGT9MGT6 MGT7
FPGA 4 FPGA 7
FPGA 5 FPGA 6
16
17
19
18
20
21
23
22
28
29
31
30
24
25
27
26
00 FIN FIN FIN FIN MB0 LEFT
5/1/2006 CS252-s06, Project Presentation
34
Source Routing Example
• Then, the packet must be routed to the destination FPGA, which requires 2 more LVCMOS hops
BOARD 0
MGT0MGT1 MGT14MGT15
MG
T4
MG
T5
MG
T2
MG
T3
MG
T12
MG
T13
MG
T10
MG
T11
MGT8 MGT9MGT6 MGT7
FPGA 0 FPGA 3
FPGA 1 FPGA 2
0
1
3
2
4
5
7
6
12
13
15
14
8
9
11
10
BOARD 1
MGT0MGT1 MGT14MGT15
MG
T4
MG
T5
MG
T2
MG
T3
MG
T12
MG
T13
MG
T10
MG
T11
MGT8 MGT9MGT6 MGT7
FPGA 4 FPGA 7
FPGA 5 FPGA 6
16
17
19
18
20
21
23
22
28
29
31
30
24
25
27
26
00 FIN FIN FIN FIN FIN MB0
5/1/2006 CS252-s06, Project Presentation
35
Source Routing Example
• Finally, the packet must be forwarded to the destination Microblaze core
BOARD 0
MGT0MGT1 MGT14MGT15
MG
T4
MG
T5
MG
T2
MG
T3
MG
T12
MG
T13
MG
T10
MG
T11
MGT8 MGT9MGT6 MGT7
FPGA 0 FPGA 3
FPGA 1 FPGA 2
0
1
3
2
4
5
7
6
12
13
15
14
8
9
11
10
BOARD 1
MGT0MGT1 MGT14MGT15
MG
T4
MG
T5
MG
T2
MG
T3
MG
T12
MG
T13
MG
T10
MG
T11
MGT8 MGT9MGT6 MGT7
FPGA 4 FPGA 7
FPGA 5 FPGA 6
16
17
19
18
20
21
23
22
28
29
31
30
24
25
27
26
00 FIN FIN FIN FIN FIN MB0
5/1/2006 CS252-s06, Project Presentation
36
Source Routing Example
• Finally, the packet must be forwarded to the destination Microblaze core
BOARD 0
MGT0MGT1 MGT14MGT15
MG
T4
MG
T5
MG
T2
MG
T3
MG
T12
MG
T13
MG
T10
MG
T11
MGT8 MGT9MGT6 MGT7
FPGA 0 FPGA 3
FPGA 1 FPGA 2
0
1
3
2
4
5
7
6
12
13
15
14
8
9
11
10
BOARD 1
MGT0MGT1 MGT14MGT15
MG
T4
MG
T5
MG
T2
MG
T3
MG
T12
MG
T13
MG
T10
MG
T11
MGT8 MGT9MGT6 MGT7
FPGA 4 FPGA 7
FPGA 5 FPGA 6
16
17
19
18
20
21
23
22
28
29
31
30
24
25
27
26
00 FIN FIN FIN FIN FIN FIN
5/1/2006 CS252-s06, Project Presentation
37
Source Routing Example• Each arrow head represents a hop – takes 5 hops to reach the
destination FPGA• Requires one more hop to send the packet to the destination
Microblaze core totalling 6 hops in the worst case
BOARD 0
MGT0MGT1 MGT14MGT15
MG
T4
MG
T5
MG
T2
MG
T3
MG
T12
MG
T13
MG
T10
MG
T11
MGT8 MGT9MGT6 MGT7
FPGA 0 FPGA 3
FPGA 1 FPGA 2
0
1
3
2
4
5
7
6
12
13
15
14
8
9
11
10
BOARD 1
MGT0MGT1 MGT14MGT15
MG
T4
MG
T5
MG
T2
MG
T3
MG
T12
MG
T13
MG
T10
MG
T11
MGT8 MGT9MGT6 MGT7
FPGA 4 FPGA 7
FPGA 5 FPGA 6
16
17
19
18
20
21
23
22
28
29
31
30
24
25
27
26
5/1/2006 CS252-s06, Project Presentation
38
Source Routing – 17th Board
• To support the 17th board, boards communicate to the 17th board through the MGT link of their own board number
5/1/2006 CS252-s06, Project Presentation
39
Source Routing – 17th Board
• To support the 17th board, boards communicate to the 17th board through the MGT link of their own board number
BOARD 0
FPGA 2
01 1415
45
23
1213
1011
8 96 7
FPGA 0 FPGA 3
FPGA 1
BOARD 7
FPGA 30
01 1415
45
23
1213
1011
8 96 7
FPGA 28 FPGA 31
FPGA 29
BOARD 16
FPGA 66
01 1415
45
23
1213
1011
8 96 7
FPGA 64 FPGA 67
FPGA 65
BOARD 10
FPGA 42
01 1415
45
23
1213
1011
8 96 7
FPGA 40 FPGA 43
FPGA 41
5/1/2006 CS252-s06, Project Presentation
40
Source Routing – 17th Board
• For example, for BOARD 0 to send to BOARD 16, it sends over MGT 0
BOARD 0
FPGA 2
01 1415
45
23
1213
1011
8 96 7
FPGA 0 FPGA 3
FPGA 1
BOARD 7
FPGA 30
01 1415
45
23
1213
1011
8 96 7
FPGA 28 FPGA 31
FPGA 29
BOARD 16
FPGA 66
01 1415
45
23
1213
1011
8 96 7
FPGA 64 FPGA 67
FPGA 65
BOARD 10
FPGA 42
01 1415
45
23
1213
1011
8 96 7
FPGA 40 FPGA 43
FPGA 41
5/1/2006 CS252-s06, Project Presentation
41
Microblaze Interface• Store and forward• Connecting to FSL bus for now• Essentially double buffered • MB FSL reading speed = extremely slow compare to switch delay time – at
the fastest compilation with most efficient code, takes 48 cycle to write one value to FSL bus!
• Example: send from MB to LVCMOS, loop back to LVCMOS link and then back to MB
5/1/2006 CS252-s06, Project Presentation
42
LVCMOS interface
• 2 cycles of latency
• Two buses connecting 2 FPGAs, can be used to do anything
• Wire control bus and data bus on LVCMOS, except data_full or free signal is high 2 cycle before it is really full
5/1/2006 CS252-s06, Project Presentation
43
XAUI Interface
• Much simplified because of XAUI has internal buffer
• Essentially just some control signals
• Interface has recently changed, so this is still in progress
5/1/2006 CS252-s06, Project Presentation
44
Software Interface
• Simple interface to send and receive data• int send(int src, int dest, byte *buf, int len)
– Copies len bytes of buf into local outgoing Buffer Unit
– Constructs source route from src MB core to dest MB core
– Blocks until all data copied
– Returns number of bytes sent or -1 on error
• Receive is called by interrupt• int recv(byte *buf, int len)
– Copies len bytes into buf from local incoming Buffer Unit
– Blocks until all data received
– Returns number of bytes received or -1 on error
5/1/2006 CS252-s06, Project Presentation
45
Simplifications
• Fixed packet length simplifies control hardware
• Packet length fits completely into all buffers in the system, so the entire packet can be transferred from hop to hop
• Once data transmission starts from MB buffer, it is not interrupted till MB input buffer
• Store-and-forward implementation of MB buffers
5/1/2006 CS252-s06, Project Presentation
46
Performance (still need to clean this up)
• Latency1 =~ 48*packet length to write into FSL bus
• Latency2 =~ 2* packet length to wait for MB buffer to be full
• Latency3 =~ 2 in switch transmission
• Latency4 =~ 48*packet length to read into FSL bus
• Bandwidth = 32bit/cycle or 64 bit/cycle (current fsl do not support 64 bit)
5/1/2006 CS252-s06, Project Presentation
47
Utilization on BEE2: With Switch (16x32 FIFO) Without Switch
Number of BSCANs 1 out of 1 100% 1 out of 1 100%
Number of BUFGMUXs 7 out of 16 43% 6 out of 16 37%
Number of DCMs 3 out of 8 37% 3 out of 8 37%
Number of External DIFFMs 1 out of 496 1% 1 out of 496 1%
Number of LOCed DIFFMs 1 out of 1 100% 1 out of 1 100%
Number of External DIFFSs 1 out of 496 1% 1 out of 496 1%
Number of LOCed DIFFSs 1 out of 1 100% 1 out of 1 100%
Number of External IOBs 371 out of 996 37% 303 out of 996 30%
Number of LOCed IOBs 371 out of 371 100% 303 out of 303 100%
Number of MULT18X18s 14 out of 328 4% 14 out of 328 4%
Number of RAMB16s 35 out of 328 10% 27 out of 328 8%
Number of SLICEs 8136 out of 33088 24% 6901 out of 33088 20%
Note: Measured with switch that connects 8 ports: 2 MB, 2 LVCMOS link, but no XAUI. All buffers are 32 bit wide and 16 word deep.
5/1/2006 CS252-s06, Project Presentation
48
Future implementation
• Switch topology change
• Allow variable packet length – using control in fsl
• DMA
• 4 MB share a DMA
5/1/2006 CS252-s06, Project Presentation
49
“Associated Switch”
Scheduler
Input Buffercontrol
Data
Input Buffercontrol
Data
Input Buffercontrol
Data
Input Buffercontrol
Data
Output Port
control
data
Output Port
control
data
Output Port
control
data
Output Port
control
data
sch
edu
ler
buffer
buffer
buffer
buffer
sch
edu
ler
sch
edu
ler
sch
edu
ler
sch
edu
ler
sch
edu
ler
sch
edu
ler
sch
edu
ler
5/1/2006 CS252-s06, Project Presentation
50
Clustered Organization
• Microblaze cores organized into clusters– Since there are 4 DIMMs on the BEE2, split into 4 clusters
• NIC will coordinate transfer of data for all MBs in cluster– Faster transfer for MBs in the same cluster because its DMA– Faster overall transfer because data copying done in hardware
• Only 4 bits per hop now, but extra hop needed
Cluster0
Cluster2
MB0
NIC0
MB3
MB1
MB2
MB8
NIC2
MB11
MB9
MB10
Cluster1
MB4
NIC1
MB7
MB5
MB6
Cluster3
MB12
NIC3
MB15
MB13
MB14
5/1/2006 CS252-s06, Project Presentation
51
Whats Working NOW!!
• Switch @ 100MHz
• Source route generation
• Store and forward buffer for MB
• TCL script and (partial) global parameterization
• Homogenous hardware
• Interface LVCMOS
• Single MB with switch booted on XUP
• Double MB with switch booted on BEE2
5/1/2006 CS252-s06, Project Presentation
52
Almost Done / To Do
• Cut Through MB Buffer – Bottleneck of copying data from software limits performance
gains from cut through version
• Need to test XAUI / MGT link
• Interrupt controller
• Complete parameterization
5/1/2006 CS252-s06, Project Presentation
53
Trouble Spots
• Tools
• Interfaces
• Putting multiple MB on FPGA
• Lack of infrastructure during early stages