sarc proprietary and confidential - 2006-05 1 processor-to-memory-blocks noc with pre-configured...

11
SARC Proprietary and Confidential - 2006-05 1 Processor-to-Memory-Blocks NoC with Pre-Configured (but run-time reconfigurable) Low-Latency Routes G. Mihelogiannakis, M. Katevenis, D. Pnevmatikatos FORTH-ICS, Crete, Greece SARC Preliminary Draft of May 2006

Upload: augusta-douglas

Post on 17-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SARC Proprietary and Confidential - 2006-05 1 Processor-to-Memory-Blocks NoC with Pre-Configured (but run-time reconfigurable) Low-Latency Routes G. Mihelogiannakis,

SARC Proprietary and Confidential - 2006-05 1

Processor-to-Memory-Blocks NoC

with Pre-Configured (but run-time reconfigurable)

Low-Latency Routes

G. Mihelogiannakis, M. Katevenis, D. Pnevmatikatos

FORTH-ICS, Crete, Greece

SARC – Preliminary Draft of May 2006

Page 2: SARC Proprietary and Confidential - 2006-05 1 Processor-to-Memory-Blocks NoC with Pre-Configured (but run-time reconfigurable) Low-Latency Routes G. Mihelogiannakis,

SARC Proprietary and Confidential - 2006-05 2

Traditional Multiprocessor View

P

M

N

I

N E T W O R K

P P

M

N

I

M

NP

M

N

II

Local (cache) memory(ies) seen as monolithic blocks, each

Page 3: SARC Proprietary and Confidential - 2006-05 1 Processor-to-Memory-Blocks NoC with Pre-Configured (but run-time reconfigurable) Low-Latency Routes G. Mihelogiannakis,

SARC Proprietary and Confidential - 2006-05 3

Proposed View for Chip Multiprocessors

• Simple processors• Lots of memory

– to compensate for limited chip I/O throughput

• Large memories need to be built out of multiple smaller blocks– in order to bound

word line & bit line capacitance within each block

M

PM

M

M M

M

MM

M

M M

M

M

M

M

M

P

M

M

M

M

M

M

M

P

M

M

M

MM

MM

P

M M

Page 4: SARC Proprietary and Confidential - 2006-05 1 Processor-to-Memory-Blocks NoC with Pre-Configured (but run-time reconfigurable) Low-Latency Routes G. Mihelogiannakis,

SARC Proprietary and Confidential - 2006-05 4

Opportunities for (Re-) Configurability

Uniform allocation of memory blocks to processors

Non-uniform allocation of memory blocks to processors

M M

M

M

M

M

M

M

M M

M

M M

M M

MM

MM

M

M M

M

M

M

M

MM

M

M

M

M

M

P

M

M

M

M

MM

P

M

M

M

M

M

M

M

P

M

M

M

M M

M

M P

M

M

P

M

P

M

P

M

P

M

M

M

M

M

M

Challenge: make reconfigurable alloc. almost as fast as fixed

Page 5: SARC Proprietary and Confidential - 2006-05 1 Processor-to-Memory-Blocks NoC with Pre-Configured (but run-time reconfigurable) Low-Latency Routes G. Mihelogiannakis,

SARC Proprietary and Confidential - 2006-05 5

Long on-chip Wires already contain Active Elements

• Periodic buffers, due to quadratic nature of RC wire delay• Approximate worst-case numbers for a 130-nm technology

– as currently available to European Universities

• as synthesized, placed-&-routed, optimized– Synopsys DC V-2004.06-SP2, SOC-Encounter 3.3, Cadence NC Verilog

150

ps

150

ps

2 mm2 mm

4 mm

150 ps 150 ps

600 ps

Page 6: SARC Proprietary and Confidential - 2006-05 1 Processor-to-Memory-Blocks NoC with Pre-Configured (but run-time reconfigurable) Low-Latency Routes G. Mihelogiannakis,

SARC Proprietary and Confidential - 2006-05 6

Turn these into Low-Latency Configurability Elements

2-to-1 multiplexor made of (semi-custom) and-or-buffer gates– can we do better with (custom) transmission gates?

150 150 ps 150 150 ps35

0 ps

800 ps

150

ps

4 mm

32 3232

32 32 32

32

32

150 ps 150 ps

Stable before data arrive!

600 ps

Page 7: SARC Proprietary and Confidential - 2006-05 1 Processor-to-Memory-Blocks NoC with Pre-Configured (but run-time reconfigurable) Low-Latency Routes G. Mihelogiannakis,

SARC Proprietary and Confidential - 2006-05 7

Pre-Configuration is critical for Low Latency

Control logic plus fan-out to 32 mux bits add considerable delay

32 32 32

32 3232

150 ps 150 ps150

32

32 3232

150 ps 150 ps150

Stable before data arrive

32

32

5 5

2-stagecmb.lgc

900 ps

350

800 ps

600 ps

1350 ps

Page 8: SARC Proprietary and Confidential - 2006-05 1 Processor-to-Memory-Blocks NoC with Pre-Configured (but run-time reconfigurable) Low-Latency Routes G. Mihelogiannakis,

SARC Proprietary and Confidential - 2006-05 8

Configure “Preferred” Paths before Data Arrival

BufferRegisters

Control

1st preferred path (c ut- through)

2nd preferred path (c ut- through)

Infrequent path(upon c ontention,through buffers)

• Preconfigure (speculatively set) control for “preferred” path• Alternate paths still work, at increased latency• Configuration can change at run-time, quite fast

Page 9: SARC Proprietary and Confidential - 2006-05 1 Processor-to-Memory-Blocks NoC with Pre-Configured (but run-time reconfigurable) Low-Latency Routes G. Mihelogiannakis,

SARC Proprietary and Confidential - 2006-05 9

Prior Art: Low Latency NoC Routers

• Optimize routing decision, crossbar arbitration, VC allocation for one-clock-cycle operation– Mullins, West, Moore: “Low-Latency Virtual-Channel Routers for On-

Chip Networks”, ISCA 2004– Kim, Park, Theocharides, Vijaykrishnan, Das: “A Low Latency Router

Supporting Adaptivity for On-Chip Interconnects”, DAC 2005

Hdr data

Hdr Hdr

Hdr data

data

data

Dis

tanc

e, in

hop

s

Time, in clock cycles

Hdr data data Hdr data

Hdr data data Hdr data

Hdr data data

Page 10: SARC Proprietary and Confidential - 2006-05 1 Processor-to-Memory-Blocks NoC with Pre-Configured (but run-time reconfigurable) Low-Latency Routes G. Mihelogiannakis,

SARC Proprietary and Confidential - 2006-05 10

Contribution: Decouple Data Rate from Configuration

• Configure “preferred” paths at whatever convenient rate• When header/address/data arrive, forward along preferred

path and, in parallel, check header– if destination was not along preferred path, recover at longer latency

dataH/dat

Dis

tanc

e, in

hop

s

Time, in clock cycles

Cnf

Cnf

Cnf

H/dat

H/dat

H/dat

dataH/dat

dataH/dat

Page 11: SARC Proprietary and Confidential - 2006-05 1 Processor-to-Memory-Blocks NoC with Pre-Configured (but run-time reconfigurable) Low-Latency Routes G. Mihelogiannakis,

SARC Proprietary and Confidential - 2006-05 11

Conclusion

• Coarse-grain reconfigurability– at the level of memory block, compute processor,

compute engine, or (simple) control processor (FSM)

• Configure “preferred routes” in the chip, along which information flows at very low latency

• Other routes still available, but at longer latency

• Preferred routes easily reconfigurable, at run-time