Physical Implementation of the DSPIN Network-on-Chip in the
FAUST Architecture
Ivan Miro-Panades1,2,3, Fabien Clermidy3, Pascal Vivet3, Alain Greiner1
1 The University of Pierre et Marie Curie, Paris, France2 STMicroelectronics, Crolles, France3 CEA-Leti, MINATEC, Grenoble, France
Ivan MIRO PANADES – NOCS 20081
Outline
MotivationFAUST architectureMigration of DSPIN into FAUSTDSPIN implementationNetworks comparison
Ivan MIRO PANADES – NOCS 20082
Motivation
Physically implement the DSPIN NoC into the FAUST application platform
- DSPIN is a NoC developed between Lip6 and ST
- FAUST is a stream-oriented application platform for 4G telecom applications, based on ANOC, developed by CEA-Leti.
Compare the performances between ANOC and DSPIN on a real application and traffic
Ivan MIRO PANADES – NOCS 20083
FAUST architecture
RAM IF58 Pads
ETHERNET IF17 Pads
Async/Sync IF
Async node
NOC2 IF
83 Pads
RX units
TX units
AHB units
OFDMMOD.
ALAM.MOD.
CDMAMOD. MAPP. BIT
INTER.TURBOCODER
RAM ARM946 RAMEXT.
RAMCTRL
AHB
ROTOR EQUAL. CHAN.EST.
CONV.DEC.
ETHERNET
FRAMESYNC.
ODFMDEM.
CDMADEM.
DE-MAPP.
DE-INTER.
EXP
SPort
APort
NOC1 IF
84 PadsSPort
APort
RAC
NoCPerf.
EXP
CONV.CODER
Clk & Reset CTRLJTAG Clk, Rst
DART
23 computation units
Asynchronous NoC(ANOC)
20 ANOC routers
GALS conception
24 independent Clks
Ethernet port
Internal/External RAM
CPU ARM946ES
Cache 4KB-I 4KB-D
Hardware OFDM modulation/demodulation
Ivan MIRO PANADES – NOCS 20084
ANOC architectureAsynchronous NoC
- Asynchronous send/accept handshake protocol
- QDI 4-phase/4-rail asynchronous logic
- QoS with two Virtual Channels (Best Effort, Guaranteed Service)
NoC Routers- 5 ports router- Source routing- Wormhole packet switching- 32 bits payload
FIFO based GALS interfacesMapped onto libraries
- CMOS ST 130nm- TIMA TAL130 library
CrossbarCrossbar
CrossbarCrossbar
NICGALS
interfac
eIP
Asynchronous handshake protocol
Ivan MIRO PANADES – NOCS 20085
FAUST floor-plan with ANOC 4.5 M Gates276 chip pinsChip area 80 mm²ST CMOS 130nm166 MHz (worst-case)
NoC implementation :
DART
RAC
ARM946
RAM1RAM2 Ext.
RAM Ctrl.
OFDM mod.
Rotor
Frame sync.
OFDM demod.
CDMA Dem.
Channel Est. Ethernet
Deinter.Demapp.
Conv. Dec.Equal.
Mapp.CDMA Mod.N
P1
NP2
Ala.Bit.
Inter.Turbo Dec.
Conv. Codec.
Exp.
Exp.
CLK
Res
et
- Uses ANOC router hard-macro- Perform buffering and routing
of ANOC links- No spaghetti routing at top
level !
ANOC router (0.211 mm²)
WestPort
EastPort
NorthPort Unit Port
South Port
Ivan MIRO PANADES – NOCS 20086
Outline
MotivationFAUST architectureMigration of DSPIN into FAUSTDSPIN implementationNetworks comparison
Ivan MIRO PANADES – NOCS 20087
Ivan MIRO PANADES – NOCS 20088
Cluster (Y,X)Cluster (Y,X)
Cluster (YCluster (Y--1,X)1,X)
(Y,X+1)(Y,X+1)(Y,X(Y,X--1)1)
(Y(Y--1,X1,X--1)1) (Y(Y--1,X+1)1,X+1)
Cluster (Y+1,X)Cluster (Y+1,X)(Y+1,X(Y+1,X--1)1) (Y+1,X+1)(Y+1,X+1)
Local Local portport
in
in
in
in
in
in
in
in
in
DSPIN architecture
CK0CK0CK1CK1
CK4CK4CK5CK5
CK3CK3
CK6CK6 CK7CK7 CK8CK8
CKCK22
Packet baseDistributed router architectureSuited to GALS approachMesochronous links between routersSynthesizable with standard cellsNeither asynchronous nor custom cellsMetastability resolved by “bi-synchronous FIFO” More details in:
"A Low Cost Network-on-Chip with Guaranteed Service Well Suited to the GALS Approach“ NanoNet’06
NoC architectures comparisonANOC DSPIN
Router arity 5 port router 5 port router
Topology Irregular 2D mesh Regular 2D mesh
Routing technique Source routing (9 hops) Address-based (8 bits) X-First algorithm
Switching technique Wormhole Wormhole
Flit size (payload) 34 bits (32 bits) Generic: 34 bits (32 bits)
Flow control bits on the flit Begin of packet (BOP)End of packet (EOP)
Begin of packet (BOP)End of packet (EOP)
Virtual channels Best effortGuaranteed service
Best effortGuaranteed service
Programming model Message passing Shared memory (2 routers per cluster)Message passing (1 router per cluster)
Clocking scheme Fully asynchronous (QDI) with GALS interfaces
Multi-synchronous with mesochronousinterfaces
Flow control protocol Send/accept asynchronous handshake FIFO protocol (Write and WriteOk)
Clock tree None One per router
Physical implementation Hard macro Soft macro distributed on five modules
Long wires Inter-router wires Intra-cluster wires
Ivan MIRO PANADES – NOCS 20089
Packet format
Y X
4 bits
H0H1H2...H8
2 bits 4 bits2 bits
34 bits18 bits
34 bits (generic)
First flit
Following flits
2 bits
8 bits
DSPIN packetANOC packet
Similar packet format and control bitsANOC uses Source-routing (18 bits) allowing 9 hopsDSPIN uses Address-based (8 bits)Packet conversion module required:
- Design of Protocol_conversion module
Ivan MIRO PANADES – NOCS 200810
FAUST integration
GALS interface
SynchronousSEND/ACCEPT
AsynchronousSEND/ACCEPT
IPNIC
ANOC router
AsynchronousSEND/ACCEPT
CLK_IP
ANOC IP template
Protocol_conversion
SynchronousREAD/WRITE
SynchronousSEND/ACCEPT
IPNIC
LUT
DSPINrouter
MesochronousREAD/WRITE
CLK_NoC
CLK_IP
DSPIN IP template
Protocol_conversion module:Translates the routing algorithm using a LUTAdapts the flow control signals:
- ANOC: Send/Accept - DSPIN: FIFO protocol
Implements two bi-synchronous FIFOs
ANOC GALS Interface:Implements 4 FIFOssynchronous↔asynchronous
FAUST IPs and NICs are not modified
Ivan MIRO PANADES – NOCS 200811
Outline
MotivationFAUST architectureMigration of DSPIN into FAUSTDSPIN implementationNetworks comparison
Ivan MIRO PANADES – NOCS 200812
DSPIN implementation
Ivan MIRO PANADES – NOCS 200813
SynthesisHierarchical synthesis
Low Power CMOS ST 130nm technology
Standard cells
Without asynchronous nor custom cellsFloorplanning
Place
Optimize placement
Timing constraints file:
- Muti-cycle path (mesochronous interfaces)
- False path (asynchronous interfaces)
Clock-tree
RouteGALS compatible
Clock gating
Implemented in 4 stepsOptimize
DSPIN clock-treeMesochronous links
GALS compatibleBi-synchronous FIFO [NOCS 2007]
Max skew 50% clock periodTiming constraints:set_multi_cycle_path
Clk
Clk’Clk Clk’
Ivan MIRO PANADES – NOCS 200814
Router (0,0) Router (0,1) Router (0,2)
180° phase shift and 30% skew between routers5% skew
within the router
2nd, 3th Step5% skew
( bottom tree)
1st Step
4th Step30% skew(top tree)
Clk_NoC
Clock-tree implementation1. Add buffers/inverters2. Built bottom clock tree
(5% skew)3. Characterize bottom
clock-tree4. Build top clock-tree
(30% skew)
FAUST floor-plan with DSPINDistributed router implementationSoft macro approachHigher floor-plan flexibilityNoC adapts to the SoCLong wires are routed in a tree mannerDifferent router configurations are possible
WE
S
L N L
ES
W W
N L
S
E
N
WLS
E
N
WL S E
N
LW S E
N
W E
E
N
W
LS E
N
WL S
E
N
WL S E
L
N
W S
E
N
W
L
E
S
N
WL
S E WL
N
S E
N
EW S
L
N
WS
L ES W E
L
N
W S L E
N
W S
L E
N
S
WL
E
N
N
LRAC OFDM mod.
CDMA Mod.NP1
Ala. Bit. Inter.
Turbo Dec.
Conv. Codec.
CLK
ARM946
RAM1
RAM2Ext. RAM Ctrl.
Rotor
Channel Est. EthernetConv. Dec.
Equal.
Frame sync. OFDM demod.
CDMA Dem. Deinter.Demapp.
DART
N
W
LS E
DSPIN routerPlacement density: 60-70 %
(reserved area)
Ivan MIRO PANADES – NOCS 200815
FAUST floor-planM
app.
NP
2
NP
1
NP
2
Exp
.
Exp.
CLK
Res
et
FAUST with ANOC FAUST with DSPIN
Ivan MIRO PANADES – NOCS 200816
Outline
MotivationFAUST architectureMigration of DSPIN into FAUSTDSPIN implementationNetworks comparison
Ivan MIRO PANADES – NOCS 200817
NoC Area
ANOC DSPINRouter 0.211 mm² 0.161 mm²Interface GALS 0.070 mm² 0.024 mm²Clock tree 0.000 mm² 0.0016 mm²Total 0.281 mm² 0.187 mm²CMOS 130nm
ANOC is implemented as a hard macroDSPIN is implemented as a soft macroDSPIN is 33% smaller than ANOC
Ivan MIRO PANADES – NOCS 200818
NoC Throughput
ANOC DSPIN
Throughput on worst-caseconditions ~ 160Mflit/s ≤ 289Mflit/s
Throughput on nominalconditions ~ 220Mflit/s ≤ 408Mflit/s
DSPIN throughput is deterministic with respect to the clock frequency (one flit per clock cycle)Long wire latency penalty on throughput:
- DSPIN: critical path crosses one time the long wires- ANOC: critical path crosses 4 times the long wires, 4-phase protocol
• ANOC link pipelining is feasible
In a commercial circuit, DSPIN will be clocked not far away fromworst-case (289 MHz) to improve the fabrication yield
Ivan MIRO PANADES – NOCS 200819
Packet latency (1)
Compute the packet latency through many routersDSPIN has deterministicpacket latency with respect to clock frequencyDSPIN has lower First and Last packet latenciesANOC intermediate latencyis lower than DSPIN. DSPIN resynchronize the data on each router
ANOC DSPIN
F = 150 MHz
Intermediate router latency 6.80 ns 16.66 ns
First + Last latency 60.00 ns 56.66 ns
ANOC DSPIN
F = 250 MHz
6.80 ns 10.00 ns
47.00 ns 34.00 ns
Ivan MIRO PANADES – NOCS 200820
Packet latency (2)
ANOC DSPIN ANOC DSPIN
F = 150 MHz F = 250 MHz
Latency for 5 hops path 80.00 ns 106.66 ns 68.00 ns 64.00 ns
Latency for 9 hops path 106.66 ns 173.30 ns 96.00 ns 104.00 ns
The packet latency are similar for clock frequencies >250 MHzIntermediate packet latency is important but the application should be mapped in order to optimize the data locality (try to communicate with neighbor IPs rather than with faraway IPs)
Ivan MIRO PANADES – NOCS 200821
Power consumption
DSPIN ANOC
F = 150 MHz F = 250 MHz
Router 2.07 mW 2.89 mW 4.85 mW
GALS interface 1.62 mW 0.56 mW 0.81 mW
Clock tree 0.00 mW 2.44 mW 4.73 mW
Total 3.69 mW 5.89 mW 10.39 mW
Power extraction after P&RFunctional packet traffic (OFDM demodulation)Power consumption majorly dominated by FIFO data registersThe DSPIN clock-gating reduced the power consumption by 67%DSPIN clock-tree consumes as much power as the router itself
- Needs to improve DSPIN clock-gating- GALS clock-tree consumes only 2.5% of total clock-tree power
Ivan MIRO PANADES – NOCS 200822
ConclusionPhysical implementation of the DSPIN Network-on-Chip on FAUST platform
- Comparison between ANOC and DSPIN at architecture level - Adaptation of DSPIN architecture to manage stream-oriented
communications- Implementation up to layout of DSPIN network on FAUST platform=> DSPIN mesochronous links fully implemented with standard tools
Comparison between DSPIN and ANOC NoCs (STMicroelectronics 130nm)
- Area of DSPIN is 33% smaller than ANOC one- Maximum sustained throughput of DSPIN is 31% higher than ANOC- ANOC has lower packet latency - DSPIN power consumption 1.5 to 3 times higher than ANOC
ANOC is a good candidate for low latency and low power applications, while DSPIN is more suited to low area and high performance applications
Ivan MIRO PANADES – NOCS 200823
See you at my PhD defense on May 20th
Ivan MIRO PANADES – NOCS 200824