february 12, 1999 architecture and circuits: 1 interconnect-oriented architecture and circuits...

Architecture and Circuits: 1 February 12, 1999

Interconnect-OrientedArchitecture and Circuits

William J. DallyComputer Systems Laboratory

Stanford UniversityFebruary 12, 1998


On-chip wires

0.0mm

2.5mm

5.0mm

7.5mm

10.0mm

Minimum width wire in an 0.35m process


On-chip wires are getting slower

x1 x2

y

y

x2 = s x1 0.5x

R2 = R1/s2 4x

C2 = C1 1x

tw2 = R2C2y2 = tw1/s2 4x

tw2/tg2= tw1/(tg1s3) 8x

v = 0.5(tgRC)-1/2 (m/s)

v2 = v1s1/2 0.7x

vtg = 0.5(tg/RC)1/2 (m/gate)

v2tg2 = v1tg1s3/2 0.35x

tw = RCy2 RCy2 RCy2

tg tg tg


Technology scaling makes communication the scarce resource

0.35m64Mb DRAM

16 64b FP Proc400MHz

0.10m4Gb DRAM

1K 64b FP Proc2.5GHz

1998 2008

18mm12,000 tracks

1 clockrepeaters every 3mm

32mm90,000 tracks

20 clocksrepeaters every 0.4mm

P


Architecture Must Evolve to Fit the Landscape

20 Clocks

90,000tracks

Local, parallel operations High bandwidthLow latency &Low power

Global operations Low bandwidthHigh latency &High power


Architecture Today Depends on Fast Global Communication

Regs

I-Unit

• All instructions issued from single global instruction unit

• All data passes through global register file

• This won’t work when global accesses cost 20 clocks of latency


Tomorrow’s Architectures must Exploit Locality and Expose

Communication• Multiple elements

(clusters) with – local instruction dispatch– local register files– co-located with

arithmetic elements

• Explicit communication between elements through a switch or network

• Fast synchronization between instruction units

Regs IU

Regs IU

Regs IU

Regs IU

Sw

itch


Multi-ALU Processor Chip


1x 1.64x 5.25x

Standard-CellFull-Custom Crafted-Cell80 Different Cells 7 Different Cells 17 Different Cells

Design

IRRDP

ADDSUB

Full-Custom

Crafted-Cell

Standard Cell

2.23x

2.7x

1.11x

1.17x

1.0x

1.0x

Performance

Area

-Results courtesy of Andrew Chang

Crafted-Cell Design


Interconnect: repeaters with switching

• Need repeaters every 1mm or less

• Easy to insert switching– zero-cost reconfiguration

• Can’t afford decision time– static routing

• fixed or regular pattern

– source routing • on-demand• requires arbitration and

fanout

• Queuing and flow-control• Pipelining control

1mm

1mm

Arb LUT

Flit Interleave vs Virtual Channels(flow control through control layer) (6-flit message)

0

20

40

60

80

100

120

140

160

180

200

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Wormhole: 40bufs

Wormhole: 80bufs

Wormhole: 160bufs

VirtualChannels: 2vcsX4bufs = 40bufs

VirtualChannels: 4vcsX4bufs = 80bufs

VirtualChannels: 8vcsX4bufs=160bufs

VirtualChannels: 8vcsX8bufs=320bufs

Interleave: 40bufs

Interleave: 80bufs

Interleave: 160bufs


Bandwidth Hierarchy

• Provide lots of bandwidth where its inexpensive– short wires between

ALUs

• Moderate bandwidth with intermediate cost– local RAM associated

with each ALU cluster

• Low bandwidth where its expensive– Global RAM with long

wires

• Very low bandwidth off chip

Globalon-chip

RAM

LocalRAM

LocalRAM

LocalRAM

LocalRAM

ALUCluster

ALUCluster

ALUCluster

ALUCluster

off chipglobal30mm

medium4mm

local1mm


Bandwidth Hierarchy

• A key problem is to match the demands of an application to the bandwidth available at each level of the hierarchy

• Casting applications in a streaming model exposes much of the locality necessary to exploit the hierarchy

Globalon-chip

RAM

LocalRAM

LocalRAM

LocalRAM

LocalRAM

ALUCluster

ALUCluster

ALUCluster

ALUCluster


Architecture Research Issues

• Processor architecture– configuration of ALUs

• clustered vs distributed

– method for controlling ALUs

• distributed control, VLIW, SIMD

– communication aware instruction sets

• how to hide details while exposing communication

• Memory architecture– methods for exploiting

2D spatial locality– communication aware

cache organizations

• Communication Architecture– on-chip interconnection

networks– the use of repeaters with

switching– the use of hierarchy and

selective ‘fat’ wires


Circuit Challenges of Slow Interconnect

• The clock cycle is dominated by wire delay– novel circuits to improve

effective signal velocity

• Power is largely used to drive wires– low-swing on-chip

signaling methods– reject rather than

overpower noise

• Its difficult to distribute a global clock– locally synchronous

design methods– fast synchronizers

• no wait for metastable decay


Overdrive gives 3x improvement in RC wire latency


0 0.5 1 1.5 2 2.5 3 3.5 41.5

2

2.5

0 0.5 1 1.5 2 2.5 3 3.5 42.2

2.4

2.6

0 0.5 1 1.5 2 2.5 3 3.5 40

1

2

Low-Swing Overdrive Signaling

indata

reference

pc

+–x

x

x/2

x/2en

clk

1V Swing at Source

300mV Swing at Receiver

Recovered Signal


ConclusionExploit, Don’t Fight, The Technology

• Interconnect is rapidly dominating the delay, power, and area of ICs

• Traditional architectures rely on global communication– they are ill-suited for an interconnect-dominated

technology

• Emerging architectures expose communication and exploit locality– distributed register files and instruction dispatch– bandwidth hierarchy

• Novel circuits can mitigate effects of slow wires– overdrive, low-swing signaling, locally synchronous

design

february 12, 1999 architecture and circuits: 1 interconnect-oriented architecture and circuits...

Documents