scores: a scalable and parametric streams-based communication architecture for modular...
DESCRIPTION
3 of 16 Communication Architectures uProc MEM DSP1 ASICDSP2 a) Bus BusNetwork-on-Chip (NoC) Advantages Disadvantages MEM uProcDSP1 ASICDSP2 b) Network-on-Chip NoC node Very well known Smaller hardware overhead SoC standards: Coreconnect®, Amba®, Wishbone Scalable Very high bandwidth Wires are broken in smaller segments Multiple and simultaneous parallel communications Does not scale well as number of modules increases High power consumption due to long wires Cross-talk issues Significant area overhead Exacerbated by store-and-forward routers Interfaces between modules and nodes are not standard Specific signals and handshaking protocols for each designTRANSCRIPT
SCORES: A Scalable and Parametric Streams-Based Communication Architecture for Modular Reconfigurable Systems
Abelardo Jara-Berrocal, Ann Gordon-RossNSF Center for High-Performance Reconfigurable Computing (CHREC)
Department of Electrical and Computer EngineeringUniversity of Florida
2 of 16
Introduction – Parallel Computation
Edges indicate communication volume
1.System Formulation
3. Task Allocation / System Placement
Source
FIR
Sink
Matrix
IFFT
Angle
4000
15000
15000
82500
40000
4000
15000
FFT
1
2
3
4
5
6
7
2. Application decomposition
High Performance Application
1, 7 Data 2,6 4 3,5
uProc MEM DSP1 ASIC DSP2
Modules
To leverage parallel computation speedups, system can be decomposed in smaller tasks
Parallel communication
How do designers provide efficient module communication?
Problem: Speedup can be limited by inefficient communication!
Profile 1:DSP:0.5ms
uProc: 2.2ms
Profile 2:ASIC:0.5msDSP: 2.5ms
3 of 16
Communication Architectures
uProcMEM
DSP1
ASIC DSP2
a) Bus
Bus Network-on-Chip (NoC)
Adv
anta
ges
Dis
adva
ntag
esMEM
uProc DSP1
ASIC DSP2
b) Network-on-ChipNoC node
• Very well known • Smaller hardware overhead• SoC standards: Coreconnect®, Amba®, Wishbone
• Scalable• Very high bandwidth
• Wires are broken in smaller segments• Multiple and simultaneous parallel communications
• Does not scale well as number of modules increases• High power consumption due to long wires• Cross-talk issues
• Significant area overhead• Exacerbated by store-and-forward routers
• Interfaces between modules and nodes are not standard• Specific signals and handshaking protocols for each design
4 of 16
General NoC architecture
NoC Interface
NoC Link
NoC NodeRouters (packet switching)Switches (circuit switching)
MEM
uProcDSP1
ASIC DSP2
I/O Slave
DSP2
uProc
[1] Salminem et.al. Survey of Network-on-Chip Proposals. White Paper. OCP-IP, March 2008
NoC TopologyVary across designsCommonly 2D mesh or torus [1]
5 of 16
Motivation• Relevant NoC metrics:
• Throughput• Latency• Area• Power
• 2D Mesh NoC• High throughput• Low latency• High communication parallelism
• Due to these advantages, some commercial 2D NoCs for ASICs have appeared:
• Arteris®• How about NoC implementations in FPGAs?
• FPGAs are increasingly used in digital designs– Reconfigurable– Lower cost than ASICs
• NoC area overhead becomes a problem– Area of a 3x3 2D Mesh NoC consumed 28.72% of a Xilinx V2P30[2](for maximum throughput of 9.5Gbps for complete 3x3 2D NoC)
• Problem is exacerbated with low capacity & low cost FPGA devices
N7
N4
N1
N8
N5
N2
N9
N6
N3
Nod
e
Mod
ul e
Arteris NoC
[2] B. Sethuraman, P. Bhattacharya, J. Khan, Ranga Vemuri: LiPaR: A light-weight parallel router for FPGA-based networks-on-chip. ACM Great Lakes Symposium on VLSI 2005: 452-457
6 of 16
• SCORES = Scalable CCommunication Architecture for Reconfigurable Embedded Systems
• Main contributions:• High throughput / bandwidth
– Circuit switching scheme• Low area overhead
– Linear topology • Multiple clock domains• Scalability
– VHDL model with numerous architectural parameters– Allows customization for different SoCs communication needs
SCORES - Contributions
REC
ON
FIG
UR
AB
LE
DEV
ICE
(FPG
A)
Module 1 Module 2 Module 3
SCORESInterface Interface Interface
scores-clk
clk2clk3
clk1Diff
eren
t clo
ck d
omai
ns
Implemented in
Xilinx VLX25 FPGA
7 of 16
clk
REC
ON
FIG
UR
AB
LE
DEV
ICE
(FPG
A)
Module 1 Module 2 Module 3
clk2clk3
clk1
SCORES – Top Level Design• SCORES main components:
• Switches – communication nodes inside SCORES• Interfaces – communication between modules and SCORES• Channels – communication links between switches and other
switches or interfaces• Modules access interfaces through local input ports and local output
ports
Module
SCORES
Switch
Interface
Interface Interface Interface
8 of 16
SCORES – Parametric Architecture
Module 4Module 3Module 2Module 1
kl – number of left switch channels
kr – number of right switch channelsko - number local output ports from the interface
ki - number local input ports to the interface
SCORES
Interfaces
Switch
N = Number of modules W = Width of a channel in bits
Additional parameters
Parameters enable SCORES to conform to custom communication requirements
9 of 16
SCORES – Terminology
Interface InterfaceInterface Interface
Module 1 Module 4Module 2 Module 3
• Producer: module which transmits data
• Consumer: module which receives data
• Streaming Data Channel (SDC):• Dedicated path between a
producer and a consumer• Dynamically created and
destroyed inside SCORES• Bidirectional path
• Data flows from producer to consumer
• Control synchronization signals flow from consumer to producer Producer
Streaming Data Channel (SDC)
Consumer
10 of 16
SCORES – Communication Phases
Interface InterfaceInterface Interface
Module 1 Module 4Module 2 Module 3
• Three communication phases• Phase I: Channel establishment:
• Producer requests a path to the consumer
• Path iteratively created inside switches between the producer and the consumer
• If a switch has no available channels
– Sends a DENY signal to the producer
– Producer can drop or maintain the request
• If successful, the Streaming Data Channel (SDC) is created between the producer and the consumer
Producer
Streaming Data Channel (SDC)
Consumer
11 of 16
SCORES – Communication Phases• Phase II: Streaming
transmission• Pipelined operation• If consumer buffer is full
– Consumer asserts “Full” to inform producer to pause transmission
• Interfaces built around asynchronous FIFOs
– Eases crossing different clock domains
• Phase III: Channel release• Producer deasserts its
request• Path between the
producer and the consumer is iteratively destroyed
Interface InterfaceInterface Interface
Module 1 Module 4Module 2 Module 3
Producer
Streaming Data Channel (SDC)
Consumer
Register
12 of 16
SCORES – Simultaneous Data Transfers
Interface
Input Registers
Switch 1 Switch 2 Switch 3 Switch 4
Interface Interface Interface
MUXes Free channel
• Set of FSM controllers running at each switch• Allows SCORES to establish and operate multiple SDCs in parallel
13 of 16
Results – Clock FrequencyFr
eque
ncy
(MH
z)
Number of right switch channels (Kr) (1 left switch
channel)
Number of left and right switch channels (Kr, Kl) (1 local input
and 1 local output port per switch)
Number of local input and output ports (Ki, Ko) per switch (1 left and 1 right
switch channel)
• Achieved SCORES maximum frequency is equal to the SCORES maximum throughput
Customized SCORES switch with 32-bit channels, 2 left and right switch channels, and 1 local input and 1 local output port operates at 254 MHz (Throughput=8.0Gbps, post place-and-route timing report).
14 of 16
Results - AreaA
rea
(slic
es)
Customized SCORES switch with 32-bit channels, 2 left and right switch channels and 1 local input and 1 local output port consumes 315 slices (1.41% of Virtex 4 VLX25)
Number of right switch channels (Kr) (1 left switch
channel)
Number of left and right switch channels (Kr, Kl) (1 local input
and 1 local output port per switch)
Number of local input and output ports (Ki, Ko) per switch (1 left and 1 right
switch channel)
15 of 16
Conclusions• We developed SCORES (Scalable Communication
Architecture for Reconfigurable Embedded Systems) - a highly parametric communication architecture
• SCORES Contributions:– Low area overhead (315 slices for a 32-bit switch with multiple
ports)– Modules can run at different and independent clock frequencies– Highly parametric design, which enables architecture
optimization• Future work
– Optimization of switch FSM controllers– Development of algorithms for module placement inside
SCORES– Tools for automatic determination of SCORES parameter values
16 of 16
Questions