sarc proprietary and confidential - 2006-05 1 processor-to-memory-blocks noc with pre-configured...
TRANSCRIPT
SARC Proprietary and Confidential - 2006-05 1
Processor-to-Memory-Blocks NoC
with Pre-Configured (but run-time reconfigurable)
Low-Latency Routes
G. Mihelogiannakis, M. Katevenis, D. Pnevmatikatos
FORTH-ICS, Crete, Greece
SARC – Preliminary Draft of May 2006
SARC Proprietary and Confidential - 2006-05 2
Traditional Multiprocessor View
P
M
N
I
N E T W O R K
P P
M
N
I
M
NP
M
N
II
Local (cache) memory(ies) seen as monolithic blocks, each
SARC Proprietary and Confidential - 2006-05 3
Proposed View for Chip Multiprocessors
• Simple processors• Lots of memory
– to compensate for limited chip I/O throughput
• Large memories need to be built out of multiple smaller blocks– in order to bound
word line & bit line capacitance within each block
M
PM
M
M M
M
MM
M
M M
M
M
M
M
M
P
M
M
M
M
M
M
M
P
M
M
M
MM
MM
P
M M
SARC Proprietary and Confidential - 2006-05 4
Opportunities for (Re-) Configurability
Uniform allocation of memory blocks to processors
Non-uniform allocation of memory blocks to processors
M M
M
M
M
M
M
M
M M
M
M M
M M
MM
MM
M
M M
M
M
M
M
MM
M
M
M
M
M
P
M
M
M
M
MM
P
M
M
M
M
M
M
M
P
M
M
M
M M
M
M P
M
M
P
M
P
M
P
M
P
M
M
M
M
M
M
Challenge: make reconfigurable alloc. almost as fast as fixed
SARC Proprietary and Confidential - 2006-05 5
Long on-chip Wires already contain Active Elements
• Periodic buffers, due to quadratic nature of RC wire delay• Approximate worst-case numbers for a 130-nm technology
– as currently available to European Universities
• as synthesized, placed-&-routed, optimized– Synopsys DC V-2004.06-SP2, SOC-Encounter 3.3, Cadence NC Verilog
150
ps
150
ps
2 mm2 mm
4 mm
150 ps 150 ps
600 ps
SARC Proprietary and Confidential - 2006-05 6
Turn these into Low-Latency Configurability Elements
2-to-1 multiplexor made of (semi-custom) and-or-buffer gates– can we do better with (custom) transmission gates?
150 150 ps 150 150 ps35
0 ps
800 ps
150
ps
4 mm
32 3232
32 32 32
32
32
150 ps 150 ps
Stable before data arrive!
600 ps
SARC Proprietary and Confidential - 2006-05 7
Pre-Configuration is critical for Low Latency
Control logic plus fan-out to 32 mux bits add considerable delay
32 32 32
32 3232
150 ps 150 ps150
32
32 3232
150 ps 150 ps150
Stable before data arrive
32
32
5 5
2-stagecmb.lgc
900 ps
350
800 ps
600 ps
1350 ps
SARC Proprietary and Confidential - 2006-05 8
Configure “Preferred” Paths before Data Arrival
BufferRegisters
Control
1st preferred path (c ut- through)
2nd preferred path (c ut- through)
Infrequent path(upon c ontention,through buffers)
• Preconfigure (speculatively set) control for “preferred” path• Alternate paths still work, at increased latency• Configuration can change at run-time, quite fast
SARC Proprietary and Confidential - 2006-05 9
Prior Art: Low Latency NoC Routers
• Optimize routing decision, crossbar arbitration, VC allocation for one-clock-cycle operation– Mullins, West, Moore: “Low-Latency Virtual-Channel Routers for On-
Chip Networks”, ISCA 2004– Kim, Park, Theocharides, Vijaykrishnan, Das: “A Low Latency Router
Supporting Adaptivity for On-Chip Interconnects”, DAC 2005
Hdr data
Hdr Hdr
Hdr data
data
data
Dis
tanc
e, in
hop
s
Time, in clock cycles
Hdr data data Hdr data
Hdr data data Hdr data
Hdr data data
SARC Proprietary and Confidential - 2006-05 10
Contribution: Decouple Data Rate from Configuration
• Configure “preferred” paths at whatever convenient rate• When header/address/data arrive, forward along preferred
path and, in parallel, check header– if destination was not along preferred path, recover at longer latency
dataH/dat
Dis
tanc
e, in
hop
s
Time, in clock cycles
Cnf
Cnf
Cnf
H/dat
H/dat
H/dat
dataH/dat
dataH/dat
SARC Proprietary and Confidential - 2006-05 11
Conclusion
• Coarse-grain reconfigurability– at the level of memory block, compute processor,
compute engine, or (simple) control processor (FSM)
• Configure “preferred routes” in the chip, along which information flows at very low latency
• Other routes still available, but at longer latency
• Preferred routes easily reconfigurable, at run-time