on-chip communication design and latency-insensitive...

On-Chip Communication Design and Latency-Insensitive Protocols

Luca P. Carloni

EECS DepartmentUniversity of California at Berkeley

Enabling Systems on Silicon

276 348 439 553 697 878 11062212

4424

8848

1684 2317 3088 39905173 5631 6739

11511

19348

28751

61007500 8100 8500 9200 9800 10400

1200013800

15800

# transistors [M]on-chip local clock [MHz]power [x10mW]

Data forHigh-PerformanceMicroprocessors& ASIC Product Generations

Source: Executive Summary of2001International Technology Roadmap for Semiconductors

Deep Sub-Micron (DSM) Technologies

130(‘01) 115(‘02) 100(‘03) 90(‘04) 80(‘05) 70(‘06) 65(‘07) 45(‘10) 32(‘13) 22(‘16) Technology[nm] (year)

The Productivity Gap Problem

prediction

1 2 3 4 5 6 7 8Generations

Peak

Des

ign

Staf

f, (L

og S

cale

)

G. Spirakis (Intel), GSRC Retreat, Stanford, Feb 12, 1999

“Design staff has doubled each generation!”

• While chip makers could increase the number of logic transistors per chip by close to 60% each year, the productivity of design automation tools has been growing at the rate of only 21% per year [Electronic Business, June99]

The On-Chip Interconnect Latency Roadblock• RC delay of an average metal line with constant length

is worsening with each process generation despite:– increases in metal layers and in wire aspect ratio– one-time technological improvements such as copper metallization

and low-κ dielectric insulators

• The intrinsic interconnect delay of a 1-mm interconnect for a 35-nm technology will be longer than the MOSFET switching delay by two orders of magnitude [Davis et al., IEEE Proc. ‘01]

• Shift from function-centric to communication-centric design – Instead of being limited by the number of transistors that can be

integrated on a single die (computation bound) designs will be limited by the amount of state and logic reachable within the required number of clock cycles (communication bound)

DSM: Percentage of Reachable Die

0

20

40

60

80

100

250 180 130 100 80 60

16 clock cyles8 clock cycles

4 clock cycles

2 clock cycles1 clock cycle

• “For a 60-nanometer process a signal can reach only 5% of the die’s length in a clock cycle” [D. Matzke, ‘97]

• Cause: Combination of high frequencies and slower wires

The Future of Wires [Horowitz et al., IEEE Proc., ’01]

• Local (scaled-length) wires– span a fixed number of gates, scale well together with logic

• Global (fixed-length) wires– span a fixed fraction of a die, do not scale

Scaling

Outline -The Impact of Wire Latency in SOC Design• Interconnect latency

– is increasingly affecting microprocessor design• the amount of state reachable in a clock cycle, not the number of

transistors is going to limit ILP growth [Burger et al., ISCA ’00]• drive stages in Intel Pentium 4 NETBURST• clustered functional units and partitioned register file in Alpha 21264

– hard to estimate because affected by many phenomena• process variations, cross-talk, power-supply drop variations

– breaks the synchronous assumption• that lies at the basis of design automation tool flows

• Towards distributed design– wire pipelining destined to become pervasive in SOC design

• trades-off fixing a design exception by increasing wire latency– rise of on-chip multiprocessor architectures and on-chip networks– new design methodologies for component reuse/plug-and-play that

are robust to interconnect latency variance

Traditional Design Flow & the Timing Closure Problem

• founded on the synchronous design methodology

– longest combinational path (critical path) dictates the maximum operating frequency

– operating frequency is often a design constraint

– design exception: a path with delay larger than clock period

• many costly iterations between synthesis and layout because

– steps are performed independently

– accurate estimations of global wire latencies are impractical

– statistical delay models badly estimate post-layout wire load capacitance

RTL constraintsw/ statistical

wire load models

logicsynthesis

constraintsmet?

floorplanning /coarse placement

detailed placement /placement merge

constraintsmet?

re-optimization(buffering,sizing,

fanout opt.,critical path opt.

routing / layout merge

constraintsmet?

in-place optimization(buffering, sizing)

final layout[Kapadia et al., DAC ’99]

Converging to Final Layout by Fixing Design Exceptions

• Re-placing, re-routing, re-designing– do not alleviate the timing-closure problem

• Combining logic synthesis and physical design– difficult because logic synthesis is inherently unstable

• small variations in the HDL RTL specification may lead to major variations in the output netlist and, consequently, in the final layout

• Wire buffering – efficient, but carries precise limitations

• there is a limit to the number of buffers that designers can insert on a wire and still reduce delay

Wire Buffering and Wire Pipelining• Wire Delay

– grows quadratically with wire length • Wire Buffering

– if optimal makes wire delay grow linearly with its length– reduces the increase of wire delay vs. gate delay ratio

in future process technologies• from 2000X to 40X for global wires• from 10X to 3X for local wires

• Wire Pipelining – is necessary to meet specified

clock period

Buffering and Chip-Edge Long Global Wire Latency [Saraswat et al., ’02]

0

2

4

6

8

10

12

14

16

180 150 120 100 70 50 35

Wire

Late

ncy

(Clock

Cyc

les)

Unbufferd

Optimally buffered

Buffered w/ DP=25%

• Despite optimal buffering ~8X wire latency increase in a 35nm technology due to increases in: clock frequency (~3X), chip length(~1.45X), delay per unit length (~1.85X)

Retiming of Global Interconnects [Cong et al., DAC ’03]

• By using the Sequential Timing Analysis theory, combine retiming and placement in the physical design phase to optimize problematic interconnect (those that must be crossed in one cycle)

a

b

cd

• retiming does not reduce critical path delay (D=4)

a

b

cd

a b

d

c

• retiming reduce critical path delay from D=5 to D=3

a b

d

c

Beyond Retiming: Wire Pipelining by Flip-Flop Insertion • A theoretical lower-bound limits retiming

– max delay-to-register cycle ratio [Papefthymiou, ‘91]• Wire pipelining

– trades-off fixing a wire exception with increasing the wire latency by one or more cycles

– will become pervasive in DSM designs, where most global wires will be heavily pipelined anyway

• Combining wire buffering and wire pipelining– buffer/FF feasible region planning at architectural

floorplanning/wireplanning stage [Koh et al., DATE ’02]– performance-driven concurrent buffer/FF insertion for

latency-constrained interconnects [Cocchini, ICCAD ’02]– both approaches assume that latency constrained are

predefined by micro-architecture designers

Stateless Repeaters vs. Stateful Repeaters

• Both buffers and flip-flops are wire repeaters– regenerate the signals traveling on long wires

• Stateful repeaters– storage elements, which carry a state

• flip-flops, latches, registers, relay stations… • generally, the state must be initialized

• Inserting stateful repeaters has an impact on the surrounding control logic– if the interface logic of two communicating modules

assumed a certain latency, then costly rework is necessary to account for additional pipeline stages

– formal methods are necessary to enable automatic insertion of stateful repeaters

Breaking the Synchronous Assumption

• Presence of multi-cycle interconnect paths– long global wires are pipelined

• “in high-end microprocessors the clock frequency is determined primarily by the time needed to complete local computation loops, not by the time needed for global communication” [2001 ITRS]

– Compaq Alpha 21264, Intel Pentium 4• “there are plenty of places in the Pentium 4 where the wires

were pipelined” [Sprangle & Carmean, ISCA ’02] • High variance of on-chip clock rate

– more than 2 orders of magnitude• Today’s high-end microprocessors and

tomorrow SOCs are distributed systems– design methodologies for component reuse/plug-and-

play that are robust to interconnect latency variance

Microprocessors as Distributed Systems

0

500

1000

1500

2000

2500

386

486-

33

486-

DX2

486-

DX4

P-10

0

P-15

0

P-20

0

PII-

300

PII-

400

PIII

-600

PIII

-700

PIII

-800

PIII

-1G

P4-1

.5G

P4-2

.0G

Core Clock

Bus Clock

• Bus Clock vs. Processor Core Clock– Bus frequency is

not keeping up with the processor core

[ source Intel Corp. ]

• Increasing interconnect latency will penalize current memory-oriented microprocessor architecture – based on the assumption of low latency communication with centralized

structures such as caches, register files, and rename/reorder tables• Studies employing cache delay analysis tools predict that in a 35nm

design running at 10Ghz accessing even a 4KB L-1 cache will require 3 clock cycles [ Burger, Keckler et al., ISCA ’00]

The Evolution of Scalar Operand Networks [Agarwal et al. ’02]

• more functional units ⇒ more live values at greater distances ⇒more physical registers, register file ports, bypass paths… ⇒ more difficult to build larger, high-frequency scalar operand networks based on centralized register files

Pipelined processor with bypassing links and multiple ALUs• additional MUXs, pipeline registers, and bypass paths make the scalar operand network look more like a “traditional network”

The Impact of Wire Latency on Microprocessor Design• Alpha 21264

– marks the point when wire delays can no longer be ignored at the micro-architectural level

– integer unit is partitioned into two physically dispersed clusters, with a one-cycle penalty for communication of results between clusters.

• Pentium 4 – two pipeline stages (“Drive Stages”) solely for the

traversal of long wires between remote components• The architecture becomes a distributed

system whose components must be designed while accounting for communication delays

Scalability Challenges

• Delay scalability– maintain high clock frequencies as the design scales

• Logic (and wire) pipelining– turn propagation delay into pipeline latency is the only

option to build larger structures and still maintain high frequencies

• Bandwidth scalability– a design scales without inordinately increasing the

relative percentage of interconnect resources • Communication efficiency

– replace broadcasting mechanisms (like busses in superscalars) with unicast routes and p-2-p protocols

• Deadlock & starvation• less centralized structures, more produce/consume mismatches

The RAW Microprocessor [Agarwal et al., ’02]

• RAW scalar operand network addresses– the delay scalability challenge through tiling

• a signal can travel across the logic of a tile in one clock cycle• modulo building a good clock tree, the frequency does not decrease

as more tiles are added– the bandwidth scalability challenge by replacing buses with a

point-to-point mesh interconnect• the p2p static network is programmed to route operands only to

those tiles that need them

16 programmabletiles

122M transistors25GB/s I/O bndw

2MB SRAM43GB/s mem bndw

PE8-stage MIPS4-stage FPU32KB cache

96KB SRAM

static &dynamicrouters

256wires

• Scaling properties– the design has no centralized resources, no global buses, and no

structures that get larger as the tile or pin count increases– the longest wire, the design complexity, and the verification

complexity are all independent of transistor count

Architectural Solution: Tiling

FPGAmillions of gates

PIM256 PE

Fine-grain CMP64 in-order cores

Coarse-grain CMP16 out-order cores

TRIPS4 ultra-large cores

• Raw processor– scalable ISA to provide a parallel software interface to on-chip physical resources

that become first-class architectural entities• gates → tiles

– dynamic mapping of sequential program to small number of ALUs• wire delays → network hops

– dynamic stalls for non-fast paths and mispredicted code• pins → I/O ports

– speculative cache-miss handling (prefetching) – wire delay is exposed to the software programmer

• to go from corner to corner of the processor takes 6 network hops, which corresponds to approximately 6 cycles of wire delay.

• Classification of on-chip architectures based on the granularity of parallel processing elements and memories [Burger, Keckler et al., ISCA’03]– Polymorphous TRIPS can be configure to adapt to various parallelism

• data-level, instruction-level, thread-level

Parallelisms between Chip Multi-Processor Design and SOC Design• Mapping applications on a tiled architecture

– designing application specific hardware circuits• Compiler code optimizations

– similar to CAD-flow optimizations for SOC design• Impact of interconnect latency

– balance communication and computation latencies in SOC design

– architecture (re-)configuration and application mapping must account for exposed communication latency

• Design of on-chip network– map power/performance trade-off to target

applications

Packet-Routing On-Chip Networks [Dally et al. ‘01]

• Regular tiled structure– with on-chip network using a 2-dimension

folded torus topology • cyclically connected as 0,2,3,1

• Packet routing– between any pair of tiles (not just neighbors)

• On-chip network– implicit sharing of interconnect resources– regular topology enables interconnect

optimization (e.g. aggressive signaling circuits)– higher bandwidth and multiple concurrent

communication with respect to buses

30 31 32 33

00 01 02 03

10 11 12 13

20 21 22 23

• Key features– on-chip communication bandwidth is not an issue because of the

many wiring tracks– unlike modern designs with their all-or-nothing locality, latency

varies continuously with distance– latency is controlled by placing data near their point of use, not

just at their point of use

Latency Insensitive Design [ICCAD99]• Chip assembled using synchronous intellectual property

(IP) cores exchanging data by means of point-to-point communication channels

• Channels use a simple communication protocol which allows arbitrary latency

• Interface Logic Blocks (the shells) encapsulate and “protect” the IP cores (the pearls)

• Assume-Guarantee Reasoning to formally verify IP cores and communication protocols in separate steps

• Recycling, a new design transformation– an arbitrary number of sequential element (relay stations) can be

distributed on a long wire between any pair of shells to pipelineit and driving it at higher frequency

The IP Encapsulation Approach

Channels (short wires)Channels (long wires)

Shells (interface logic blocks)

P1

P2

P3

P4

P5

P6

P7

Pearls (synchronous IP cores)

Channel Segmentation

Shells (interface logic blocks)Channels (short wires)Channels (long wires)

P1

P2

P3

P4

P5

P6

P7

Pearls (synchronous IP cores)

RSRS

RS

RS

RS

RS RS

RS

Relay Stations

LIP: Key Points

• Relax time constraints during early phases of the design when correct measures of the delay paths among the SOC modules are not available

• ASSUMPTION: The functionality of the design is based on the sequencing of the signals and not on their exact timing– sufficient to require IPs be stallable (via clock-gating)

• Design specification relies on synchronous assumption, design implementation can be synchronous, asynchronous, GALS

The Theory of Latency Insensitive Protocols

STRICT SYSTEMP1 P2

PATIENT SYSTEMP1’ P2’

Synthesis

S = . . a . . b . . c . . d . .

S’ = . . . a . b . . . . c . . . d .

• Event : a member of V x T, • V : set of values, T : set of tags

• Signal : a set of events • s = { (v1, t1), (v2, t2), … , (vk, tk) }

The Tagged-Signal Model [Lee & Sangiovanni ‘96]

• Process : a subset P of the set of N-tuplesof signals

• Behavior : a tuple of signals b = (s1, s2, …, sN) which satisfies a process P

• System : a composition of processes P1,… ,PM• (i.e. the intersection of their behaviors)

• Synchronous Events have the same tag– signals s1, s2 are synchronous if each event in s1 is synchronous

with an event in s2 and vice versa

• Synchronous System: every signal in the system is synchronous with every other signal

• Timed System: the set T of tags (timestamps) is a totally ordered set.

– the ordering among the timestamps of a signal s induces a natural order on the set of events of s

The Tagged-Signal Model [Lee & Sangiovanni ‘96]

• Latency Insensitive System : a synchronous timed system with T = Ζ and V = Σ ∪ τ

• Σ is the set of informative symbols exchanged among the system processes

• τ ∉ Σ is the stalling symbol representing the absence of an informative symbol

Latency Insensitive Systems [Carloni et al. CAV ‘99]

+

• Finite Horizon : for each signal s there is a greatest timestamp T(s) which corresponds to the last informative event

• assume an infinite sequence of τ after T(s)

Informative Events and Stalling Events

s = i1 i2 τ i1 i2 i3 τ i1 i2 τ τ τ i1 i5 τ τ τ τ τ τ ...

| Fi [3, 13] (s) | = 6| Fi (s) | = 9• Sequence Length:

1 2 3

T(s) = 14

i54 5 6 7 8 9 10 11 12 13

• Filtering Operator:Fi(s) = i1 i2 i1 i2 i3 i1 i2 i1 i5

Fi [3, 13] (s) = i1 i2 i3 i1 i2 i1

• Strict Signal: all informative events precede all stalling events

Strict Signals and Stalled Signals

s1 = i1 i2 i1 i2 i3 i1 i2 i1 i5 τ τ τ τ τ τ ....

s2 = i1 i2 τ i1 i2 i3 τ i1 i2 τ τ τ i1 i5 τ τ ...

• Stalled Signal: a signal which is not strict

ord ( ek ) = | Fi [0, k] (s) | - 1

• Ordinal of an informative event ek of a signal s

Latency Equivalence

s1 ≡ s2 iff Fi (s1) = Fi (s2)

CorrespondingInformativeEventshave thesame

ordinals

s1 = i1 i2 i1 i2 i3 i1 i2 i1 i5 τ τ τ τ τ τ ....

s2 = i1 i2 τ i1 i2 i3 τ i1 i2 τ τ τ i1 i5 τ τ ...

• Example:

Stalling

• A stall move postpones an informative event of a signal sj of a behavior b by 1 timestamp

s1 = i1 i2 i6 i2 τ i2 τ τ i1 i5 τ τ ...s2 = i1 i3 i6 i1 τ i2 τ τ i1 i5 τ τ ...s3 = i1 i2 i5 i4 τ i2 τ τ i1 i5 τ τ ...

b = (s1,s2,s3)

b’ =stall (b, e3(s2))

s1 = i1 i2 i6 i2 τ i2 τ τ i1 i5 τ τ ...s2 = i1 i3 τ i6 i1 τ i2 τ τ i1 i5 τ τ ...s3 = i1 i2 i5 i4 τ i2 τ τ i1 i5 τ τ ...

Ordering The Events

e1 ≤lo e2 iff ord(e1) ≤ ord(e2) or ord(e1) = ord(e2) and s(e1) ≤c s(e2)

• To avoid cyclic behaviors by processing events with the same ordinal, we assume a well-founded order among signals

– in real-life design it corresponds to Mealy’s input-output relations

• The ordering among events is motivated by causality relations

– Past events do not depend on future events

Procrastination Effects

• After a stall move on a signal sj , causality relations may induce procrastination effects on other signal sk of behavior b

s1 = i1 i2 i6 i2 τ i2 τ τ i1 i5 τ τ ...s2 = i1 i3 τ i6 i1 τ i2 τ τ i1 i5 τ τ ...s3 = i1 i2 i5 i4 τ i2 τ τ i1 i5 τ τ ...

b’ =stall (b, e3(s2))

b’’ ∈PE [stall (b, e3(s2))]

s1 = i1 i2 i6 i2 τ i2 τ τ i1 i5 τ τ ...s2 = i1 i3 τ i6 i1 τ i2 τ τ i1 i5 τ τ ...s3 = i1 i2 i5 τ i4 τ i2 τ τ i1 i5 τ τ ...

Patient Processes

P is patient iff ∀b = (s1, .. , sN)∈P,∀j ∈ [1,N], ∀ek ∈ E(sj),

PE [ stall (ek(sj)) ] ∩ P ≠ ∅

• A patient process can take stall moves on any signal of its behaviors by reacting with the appropriate procrastination effects

• Examples : – if an input event is stalled, then some output events may be

delayed– if a downlink process requests to delay an output event

(backpressure) then future input events may be delayed

Compositionality

• For patient processes the notion of latency equivalence is compositional

• Th.1: P1 and P2 patient ⇒ P1 ∩ P2 patient • Th.2: for all patient P1, Q1, P2, Q2

P1 ≡ Q1 and P2 ≡ Q2 ⇒ (P1 ∩ P2) ≡ (Q1 ∩ Q2)• Th.3: for all strict P1, P2 and patient Q1, Q2

P1 ≡ Q1 and P2 ≡ Q2 ⇒ (P1 ∩ P2) ≡ (Q1 ∩ Q2)

• Major Resultif all processes in a strict system are replaced by corresponding patient processesthen the resulting system is latency equivalent to the original one

Strict Process and Channels

Pa

PbPc

PdPg

PePf

• A channel is a connection processC (j,k) constraining two signals to be identical

b=(s1, .. , sN)∈C(j,k) iff sj = sk

• A channel is NOT a patient process• Communication is based on the synchronous hypothesis• Unfortunately, the final system implementation may

require more than one clock cycle to “travel” a channel

C (4,6)

C (3,5)

C (1,2)

Relay Stations

B[1,1,1] (i,j)si sj

si = i1 τ i2 τ i3 τ i4 τ τ τ i5 τ i6 τ i7 τ i8 τ i9 τ τ ....sj = τ i1 τ i2 τ i3 τ i4 τ τ τ i5 τ i6 τ i7 τ i8 τ i9 τ τ ....

• B[2,1,1] (i,j) is called relay station and is the minimum capacity patient buffer able to “transfer” one informative unit per timestamp, thus allowing, in the best case, a maximum throughput equal to 1

B[2,1,1] (i,j)si sj

si = i1 i2 i3 τ τ i4 i5 i6 τ τ τ i7 τ i8 i9 τ τ ....sj = τ i1 i2 i3 τ τ i4 τ τ τ i5 i6 i7 τ i8 i9 τ τ ....

StopReg

Implementation of a Relay Station

dataIn dataOut

stopInstopOut

voidOutvoidIn

AuxReg

control logic

MainReg

mux

demux

•Stop signals implement the back-pressure mechanism of the protocol•Void signals detect invalid data (stalling, or τ, events) due to

unexpected latencies

Encapsulation of a Stallable Process

• Th.5: Shell and Core are latency equivalent

Shell Process

ERS

Stalling Signal Generator

Equalizer

P1

P2

P3

P4

P5

P6P7

StallableCore

Process

Implementation of a Shell

stallablecore module

Queue 1

Queue 2

Queue 3

dataIn1

dataIn2

dataIn3

dataOut

stopIn

voidOut

stopOut1stopOut2stopOut3voidIn1voidIn2voidIn3

control logic

• min queue length = 2

• min queue latency = 0

The IP Encapsulation Approach - reprise

Channels (short wires)Channels (long wires)

Shells (patient processes)

P1

P2

P3

P4

P5

P6

P7

Pearls (strict processes)

Channel Segmentation - reprise

Shells (patient processes)Channels (short wires)Channels (long wires)

P1

P2

P3

P4

P5

P6

P7

Pearls (strict processes)

RSRS

RS

RS

RS

RS RS

RS

Relay Stations

Remarks

• RTL design, layout and routing of individual blocks do not need to be changed to reflect any necessary changes in wire latencies during the chip-level layout

• Obviously the final result is satisfactory only to the extent that a sufficient throughput can be maintained after increasing channel latencies

Correct-by-Construction Design

Design and validate the chip as a collection of synchronous modules

Encapsulate each module within a shellto make it latency-insensitive

Apply traditional logic synthesis and place & route

Insert relay stationsto meet clock cycle

Case Study: PDLX Microprocessor

• Complete latency insensitive design of PDLX, an out-of-order microprocessor with speculative execution– ISA: same as Hennessy&Patterson’s DLX– PDLX architecture: based on an extended version

of Tomasulo’s Algorithm, which combines traditional dynamic scheduling with hardware-based speculative execution

PDLX Architecture

Load

Uni

t

Load

Uni

t

ALU

ALU

FP U

nit

D-C

ache

MM

U (d

)

Stor

e U

nit

Stor

e U

nit

Dispatch Unit

Decode Unit

Fetch Unit I-Cache

MMU (i)BranchProcessingUnit

RSRS

RSRS

RSRS

RSRS

RSRS

RSRS

RSRSRS

resresresresresresres

Completion Unit(reorder buffer)

System Register

Unit

I-Cache

MMU (i)

Dispatch Unit

Decode Unit

Fetch Unit

resresresresresresres


Load

Uni

t

Load

Uni

t

ALU

ALU

FP U

nit

D-C

ache

MM

U (d

)

Stor

e U

nit

Stor

e U

nit


System Register

Unit

RSRS

RSRS

RSRS

RSRS

RSRS

RSRS

RSRSRS

PDLX: IP Encapsulation

resresLo

ad U

nit

res

ALU

res

FP U

nit

res

Dispatch Unit

Decode Unit

Fetch Unit I-Cache

MMU (i)

D-C

ache

MM

U (d

)

Stor

e U

nit

Stor

e U

nit


System Register

Unit

BranchProcessingUnit

RSRS

RSRS

RSRS

RSRS

RSRS

RSRS

ALU

res

Load

Uni

t

res

RSRS

PDLX: Experimental Framework

architecturespec

channel latency

spec

a particularPDLX

implementation

C program(binary search,permutations)

DLX compilerDLX

assemblycode

DLX simulator

latencyequivalence

testlog of

memoryaccesses

log of memory

accesses

PDLX: Performance Analysis

0

0 . 1

0 . 2

0 . 3

0 . 4

0 . 5

0 . 6

0 . 7

0 . 8

thro

ughp

ut

L.0 .0 L.2 .0 L.4 .0 L.0 .1 L.2 .1 L.4 .1 L.0 .2 L.2 .2 L.4 .2

PDLX 1

PDLX 2

PDLX 3

0

0 . 2

0 . 4

0 . 6

0 . 8

1

1. 2

1. 4

effe

ctiv

e th

roug

hput

L . 0 . 0 L. 1. 0 L. 2 . 0 L. 3 . 0 L. 4 . 0 L. 5 . 0 L. 0 . 1 L. 1. 1 L. 2 . 1 L. 3 . 1 L. 4 . 1 L. 5 . 1 L. 0 . 2 L. 1. 2 L. 2 . 2 L. 3 . 2 L. 4 . 2 L. 5 . 2

PDLX 1

PDLX 2

PDLX 3

PDLX 1PDLX 2PDLX 3

Moving Around the Latency

S V1 V2 V3 V4

V5

V6V7

V8

V9

V10

V11

V12 V13 TV14

critical cycle =

V15

thp(G) =

RS

RS

C6

10 / (10 + 2)= 0.83

38% performancegain

relay stationscan be pushed around without

the need of redesigning any component

LIP Advantages• By orthogonalizing communication and

computation latency-insensitive design addresses– the timing closure problem

• enables wire pipelining among sequential elements regardless of feedback-loops thanks to automatic synthesis of interface logic

– the productivity gap problem• complex systems can be built by assembling pre-designed and

pre-verified blocks regardless of inter-communication latency

• Formal methods to build systems robust with respect to arbitrary latency variations– on-chip communication/computation latencies can be

rebalanced up to late phases of the design process my moving-around latency across computational elements

– protocol to power-down components (via stalling events)

On-Chip Communication Design and Latency-Insensitive Protocols

Luca P. Carloni

EECS DepartmentUniversity of California at Berkeley

on-chip communication design and latency-insensitive...

Documents