ece260b – cse241a winter 2005 clocking

54
ECE 260B – CSE 241A Clocking 1 http://vlsicad.ucsd.edu ECE260B – CSE241A Winter 2005 Clocking Website: http://vlsicad.ucsd.edu/courses/ece260b-w05 Slides courtesy of Prof. Andrew B. Kahng

Upload: taurus

Post on 31-Jan-2016

35 views

Category:

Documents


1 download

DESCRIPTION

ECE260B – CSE241A Winter 2005 Clocking. Website: http://vlsicad.ucsd.edu/courses/ece260b-w05. Slides courtesy of Prof. Andrew B. Kahng. Outline. Problem Statement Clock Distribution Structures Robustness / Signal Integrity Control Clock Design: Skew Scheduling Topology Construction - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 1 http://vlsicad.ucsd.edu

ECE260B – CSE241A

Winter 2005

Clocking

Website: http://vlsicad.ucsd.edu/courses/ece260b-w05

Slides courtesy of Prof. Andrew B. Kahng

Page 2: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 2 http://vlsicad.ucsd.edu

Outline

Problem Statement

Clock Distribution Structures

Robustness / Signal Integrity Control

Clock Design:

Skew Scheduling

Topology Construction

Embedding

Page 3: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 3 http://vlsicad.ucsd.edu

Why Clocks?

Clocks provide the means to synchronize By allowing events to happen at known timing boundaries, we

can sequence these events

Greatly simplifies building of state machines

No need to worry about variable delay through combinational logic (CL)

All signals delayed until clock edge (clock imposes the worst case delay)

CombLogic

register

CombLogic

register

register

DataflowFSM

Courtesy K. Yang, UCLA

Page 4: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 4 http://vlsicad.ucsd.edu

Clock Distribution Network

General goal of clock distribution Deliver clock to all memory elements with acceptable skew Deliver clock edges with acceptable sharpness

Clocking network design is one of the greatest challenges in the design of a large chip

Consume up to 1/3 of chip power Accurate signal delay Signal integrity Subject to uncertainty / variation of different processes /

operating conditions

Page 5: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 5 http://vlsicad.ucsd.edu

Clock Design Components

Oscillator

Dividers

Buffers Strong drivers Reduce delay Signal integrity / slew rate

Interconnects Balanced trees, meshes, etc. Shielding (e.g., for crosstalk reduction) Non-tree links / feedback loops

Page 6: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 6 http://vlsicad.ucsd.edu

Clock Distribution Objective

Minimum / bounded skew performance / hold time requirements

Guaranteed slew rate / signal integrity

Small insertion delay

Robustness under process / operating condition variation

Minimum cell / routing area

Minimum power consumption

Page 7: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 7 http://vlsicad.ucsd.edu

Clock Distribution Robustness Subject to Radically different loading (flip-flop density)

Across the die ECO (Engineering Change Order)

Interconnect coupling Signal integrity Delay variation

Process variation From lot-to-lot Across the die Buffers Metal width

Supply voltage variation across the die Both static IR drop Dynamic voltage drop

Temperature

Page 8: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 8 http://vlsicad.ucsd.edu

Issues in Clock Distribution Network Design

Skew Process, voltage, and temperature Data dependence Noise coupling Load balancing

Power, CV2f (consume up to 1/3 of total chip power) Clock gating

Flexibility/Tunability Compactness – fit into existing layout/design Facilitate ECO

Page 9: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 9 http://vlsicad.ucsd.edu

Skew: Clock Delay Varies With Position

Page 10: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 10 http://vlsicad.ucsd.edu

Clock Skew Causes

Designed (unavoidable) variations – mismatch in buffer load sizes, interconnect lengths

Process variation – process spread across die yielding different Leff, Tox, etc. values

Temperature gradients – changes MOSFET performance across die

IR voltage drop in power supply – changes MOSFET performance across die

Note: Delay from clock generator to fan-out points (clock latency) is not important by itself

BUT: increased latency leads to larger skew for same amount of relative variationSylvester / Shepard, 2001

Page 11: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 11 http://vlsicad.ucsd.edu

Outline

Problem Statement

Clock Distribution Structures

Robustness / Signal Integrity Control

Clock Design:

Skew Scheduling

Topology Construction

Embedding

Page 12: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 12 http://vlsicad.ucsd.edu

Clock Distribution Structures

RC-Tree Less capacitance More accuracy Flexible wiring

Grids Reliable Less data dependency Tunable (late in design)

Shown here for final stage drivers driving F/F loads

Page 13: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 13 http://vlsicad.ucsd.edu

Grids

Gridded clock distribution common on earlier DEC Alpha microprocessors

Advantages: Skew determined by grid density, not

too sensitive to load position Clock signals available everywhere Tolerant to process variations Usually yields extremely low skew

values

Disadvantages: Huge amount of wiring and power To minimize such penalties, need to

make grid pitch coarser lose the grid advantage

Pre-drivers

Global grid

Sylvester / Shepard, 2001

Page 14: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 14 http://vlsicad.ucsd.edu

H-Tree

H-tree (Bakoglu) One large central driver, recursive structure to

match wirelengths Halve wire width at branching points to reduce

reflections

Disadvantages Slew degradation along long RC paths Unrealistically large central driver

- Clock drivers can create large temperature gradients (ex. Alpha 21064 ~30° C)

Non-uniform load distribution Inherently non-scalable (wire R growth) Partial solution: intermediate buffers at branching

points

courtesy of P. Zarkesh-Ha

Sylvester / Shepard, 2001

Page 15: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 15 http://vlsicad.ucsd.edu

Buffered H-tree

Advantages Ideally zero-skew Can be low power (depending on skew requirements) Low area (silicon and wiring) CAD tool friendly (regular)

Disadvantages Sensitive to process variations

- Devices Want same size buffers at each level of tree

- Wires Want similar segment lengths on each layer in each source-sink path !!!

Local clocking loads inherently non-uniform

Sylvester / Shepard, 2001

Page 16: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 16 http://vlsicad.ucsd.edu

Tree Balancing

Some techniques:

a) Introduce dummy loads

b) Snaking of wirelength to match delays

Con: Routing area often more valuable than Silicon

Sylvester / Shepard, 2001

Page 17: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 17 http://vlsicad.ucsd.edu

Examples From Processor Chips

H-Tree, Asymmetric RC-Tree (IBM)

GridsDEC [Alphas]

SerpentinesIntel x86[Young ISSCC97]

Page 18: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 18 http://vlsicad.ucsd.edu

Example Skews From Processor Chips

DEC-Alpha 21064 clock spinesDEC-Alpha 21064 RC delays

DEC-Alpha 21164 RC delays for Global Distribution (Spine + Grid)

DEC-Alpha 21164 RC local delays

Page 19: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 19 http://vlsicad.ucsd.edu

ReShape Clocks Example (High-End ASIC)

Balanced, shielded H-tree for pre-clock distribution

Mesh for block level distribution

output mesh

All routes 5-6u M6/5, shielded with 1u grounds

~10 buffers per node E.g., ganged BUFx20’s

Output mesh must hit every sub-block

Page 20: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 20 http://vlsicad.ucsd.edu

Block Level Mesh (.18u)

Max 600u stride

1u m5 ribs every 20 - 30 u (4 to 6 rows)

Shielded input and output m6 shorting straps

Clumps of 1-6 clock buffers, surrounded by capacitor pads

Pre-clock connects to input shorting straps

Page 21: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 21 http://vlsicad.ucsd.edu

Problems with Meshes

Burn more power at low frequencies

Blocks more routing resources (solution: integrated power distribution with ribs can provide shielding for ‘free’)

Difficult for ‘spare’ clock domains that will not tolerate regioning

Post placement (and routing) tuning required

No ‘beneficial skew’ possible

Clock gating only easy at root

Fighting tools to do analysis: Clumped buffers a problem in Static Timing Analysis tools Large shorted meshes a problem for STA tools What does Elmore delay calculation look like for a non-tree? Need full extraction and SPICE-like simulation to determine skew

Page 22: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 22 http://vlsicad.ucsd.edu

Benefits of Meshes

Deterministic since shielded all the way down to rib distribution

No ECO placement required: all buffers preplaced before block placement

Low latency since uses shorted (= ganged, parallel) drivers, therefore lower skew

ECO placements of FFs later do not require rebalancing of tree

“Idealized” clocking environment for “concurrent dance” of RTL design and timing convergence

Page 23: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 23 http://vlsicad.ucsd.edu

Hybrid Structure

Balanced tree on the top

Mesh in the middle Minimize skew

Steiner minimum tree at the bottom Minimize cost Facilitate ECO

Page 24: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 24 http://vlsicad.ucsd.edu

Outline

Problem Statement

Clock Distribution Structures

Robustness / Signal Integrity Control

Clock Design:

Skew Scheduling

Topology Construction

Embedding

Page 25: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 25 http://vlsicad.ucsd.edu

Process Variation

Intra-die and inter-die variations Intra-die variation is increasingly significant since 0.13um technology

Systematic and random variations Systematic variation is due to equipment, process, etc.

- Global len aberration in lithograthy causes systematic variation

- Pattern-dependent optical proximity, chemical mechanical polish (CMP) Random variation is due to inherent variation

Spatial correlation across a chip Fast vs. slow corners

Page 26: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 26 http://vlsicad.ucsd.edu

Process Variation

Metal wires Width variation can be estimated by LUT(width, spacing) Thickness variation CMP local density Thickness variation also depends on wire width and spacing Could be up to 30-40% in 90nm process

Transistors Channel length variation (delay ~ L1.5) Thin gate oxide tox variation Vth variation Up to 30% variation in term of driving capability

Page 27: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 27 http://vlsicad.ucsd.edu

Process Variations – SPICE model

Process variations are reflected into a statistical SPICE model

Usually only a few parameters have a statistical distribution (e.g. : {L, W, TOX,VTn, VTp}) and the others are set to a nominal value

The nominal SPICE model is obtained by setting the statistical parameters to their nominal value

Slide courtesy of A. Nardi, J. Rabaey, K. Keutzer of UCB

Page 28: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 28 http://vlsicad.ucsd.edu

Global Variations (Inter-die)

Process variations Performance variations

Critical path delay of a 16-bit adder

All devices have the same set

of model parameters value

Slide courtesy of A. Nardi, J. Rabaey, K. Keutzer of UCB

Page 29: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 29 http://vlsicad.ucsd.edu

Local Variations (Intra-die)

Each device instance has a slightly different set of model parameter values (aka device mismatch)

The performance of some analog circuits strongly depends on the degree of matching of device properties

Digital circuits are in general more immune to mismatch, but clock distribution network is sensitive (clock skew)

Slide courtesy of A. Nardi, J. Rabaey, K. Keutzer of UCB

Page 30: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 30 http://vlsicad.ucsd.edu

Statistical Design

Need to account for process variations during design phase

•Statistical design–Nominal design–Yield optimization–Design centering

Slide courtesy of A. Nardi, J. Rabaey, K. Keutzer of UCB

Page 31: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 31 http://vlsicad.ucsd.edu

Statistical Design

Slide courtesy of A. Nardi, J. Rabaey, K. Keutzer of UCB

Page 32: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 32 http://vlsicad.ucsd.edu

Process Variation Tolerance Enhancement

Rule of thumb: balanced tree Identical buffers at identical heights Drive identical subtree loads

Can we do better than this?

Process variation tolerant clock design Bounded-skew DME Topology construction

- With process variation tolerance in objective Useful skew scheduling

- To the center of permissible ranges

Page 33: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 33 http://vlsicad.ucsd.edu

Signal Integrity

Crosstalk Capacitive, inductive

Supply voltage drop IR, L dI/dt, LC resonance

Temperature Increased resistance with higher temperature

Substrate coupling Parasitic resistance, capacitance in the substrate layer

Page 34: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 34 http://vlsicad.ucsd.edu

Crosstalk

Due to the coupling capacitance between interconnections, a signal switching on a net (aggressor) may affect the voltage waveform on a neighboring net (victim)

Noise Propagation

Increased Delay

Page 35: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 35 http://vlsicad.ucsd.edu

Circuit Model for Crosstalk

Page 36: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 36 http://vlsicad.ucsd.edu

Crosstalk Simulation

Page 37: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 37 http://vlsicad.ucsd.edu

Design for Crosstalk

It can be both capacitive and inductive Capacitive is dominant at current switching speeds

To reduce it: Use of shielding layer (inter-layer) Use of shielding wire (intra-layer)

GND

VDD

GND

Substrate

Page 38: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 38 http://vlsicad.ucsd.edu

Clock Gating

Reduce power consumption by temporarily shutting down part of the circuit

Additional cost of enabling circuits CLK1

DQ combinationallogic

FF FF

CLK2

CLK ENABLING

Page 39: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 39 http://vlsicad.ucsd.edu

Outline

Problem Statement

Clock Distribution Statement

Robustness / Signal Integrity Control

Clock Design:

Skew Scheduling

Topology Construction

Embedding

Page 40: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 40 http://vlsicad.ucsd.edu

Skew = Local Constraint

D : longest pathd : shortest path

FF FF

safe

Skew

race condition cycle time violation

-d + thold Tperiod - D - tsetup< <

permissible range

Timing is correct as long as the clock signals of sequentially adjacent FFs arrive within a permissible skew range

W. Dai, UC Santa Cruz

Page 41: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 41 http://vlsicad.ucsd.edu

“Useful Skew” Design Robustness

“0 0 0”: at verge of violation

FF FF FF2 ns 6 ns

T = 6 ns

“2 0 2”: more safety margin4 0

-22

4 0

Design will be more robust if clock signal arrival time is in the middle of permissible skew range, rather than on edge

W. Dai, UC Santa Cruz

Page 42: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 42 http://vlsicad.ucsd.edu

Constraints on Skews

FFi receives clock signal delayed by xi MIN_DEL 0 < 1 : if nominal clock delay is xi, then actual clock delay

must fall within interval xi x xi

For FF to operate correctly when clock edge arrives at time x, the correct input data must be present and stable during the time interval (x – SETUP, x + HOLD)

For 1 i,j L (#FFs), we compute lower and upper bounds MIN(i,j) and MAX(i,j) for the time that is required for a signal edge to propagate from FFi to FFj

Avoid double-clocking (race condition) xi + MIN(i,j) xj + HOLD

Avoid zero-clocking xj + SETUP + MAX(i,j) xj + P; P = clock period

Page 43: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 43 http://vlsicad.ucsd.edu

Optimal Useful Skews by Linear Programming

LP_SPEED (clock period reduction):

minimize P s.t.

xj - xj HOLD – MIN(i,j)

xi– xj + P SETUP + MAX(i,j)

xi MIN_DEL

LP_SAFETY (robustness):

Maximize M s.t.

xj - xj – M HOLD – MIN(i,j)

xi– xj – M SETUP + MAX(i,j) – P

xi MIN_DEL

Notes- J. P. Fishburn, “Clock Skew Optimization”, IEEE Trans. Computers 39(7) (1990), pp. 945-951.

- T. G. Szymanski, “Computing Optimal Clock Schedules”, Proc. DAC, June 1992, pp. 399-404.

- Useful Skew optimization is similar to Retiming optimization

- Peak current reductions are a side benefit

Page 44: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 44 http://vlsicad.ucsd.edu

Outline

Problem Statement

Clock Distribution Structures

Robustness / Signal Integrity Control

Clock Design:

Skew Scheduling

Topology Design

Embedding For zero skew (ZST-DME) For bounded skew (BST-DME)

Page 45: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 45 http://vlsicad.ucsd.edu

Zero-Skew Tree (ZST) Problem

Zero Skew Clock Routing Problem (S,G): Given a set S of sink locations and a connection topology G, construct a ZST T(S) with topology G and having minimum cost.

Skew = maximum value of |td(s0,si) – td(s0,sj)| over all sink pairs si, sj in S.

Td = signal delay (from source s0)

Connection topology G = rooted binary tree with nodes of S as leaves Edge ea in G is the edge from a to its parent |ea| is the (assigned) length of edge ea

Cost = total edge length

Page 46: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 46 http://vlsicad.ucsd.edu

Zero-Skew Example (555 sinks, 40 obstacles)

Page 47: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 47 http://vlsicad.ucsd.edu

A Zero-Skew Routing Algorithm

Finds a ZST under linear delay model with minimum cost over all ZSTs with topology G and sink set S

Terms Manhattan Arc: line segment with

slope +1 or –1 Tilted Rectangular Region (TRR):

collection of points within a fixed distance of a Manhattan arc- Core = Manhattan arc- Radius = distance

Merging segment = locus of feasible locations for a node v in the topology, consistent with minimum wirelength- If v is a sink, then ms(v) = {v}- If v is an internal node, then ms(v) is

the set of all points within distance |ea| of ms(a), and within distance |eb| of ms(b)

Page 48: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 48 http://vlsicad.ucsd.edu

Phase 1: Tree of Merging Segments

Goal: Construct a tree of merging segments corresponding to topology G Merging segment of a node depends on merging segment of its

children bottom-up construction Let a, b be children of v. We want placements of v that allow TSa

and TSb to be merged with minimum added wire while preserving zero skew

Merging cost = |ea| + |eb|

Fact: The intersection of two TRRs is also a TRR and can be found in constant time

Constant time per each new merging segment linear time (in size of S) to construct entire tree

Page 49: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 49 http://vlsicad.ucsd.edu

Phase 2: Find Node Placements

Goal: Find exact locations (“embeddings”) pl(v) of internal nodes v in the ZST topology

If v is the root node, then any point on ms(v) can be chosen as pl(v)

If v is an internal node other than the root, and p is the parent of v, then v can be embedded at any point in ms(v) that is at distance |ev| or less from pl(p) Detail: create square TRR trrp

with radius ev and core equal to pl(p); placement of v can be any point in ms(v) trrp

Each instruction executed at most once for each node in G, and TRR intersection is O(1) time Find_Exact_Placements is O(n) DME is O(n)

Page 50: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 50 http://vlsicad.ucsd.edu

Outline

Problem Statement

Clock Distribution Structures

Robustness / Signal Integrity Control

Clock Design:

Skew Scheduling

Topology Design

Embedding For zero skew (ZST-DME) For bounded skew (BST-DME)

Page 51: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 51 http://vlsicad.ucsd.edu

Non-Zero Skew Bounds

skew0

2 4 6

2

4

6

0246

2

4

6

skew

v

s4

va b

s1 s2 s3

Topologys0 b

a

Given a skew bound, where can internal nodes of the given topology (e.g., a, b, v) be placed?

Page 52: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 52 http://vlsicad.ucsd.edu

BST-DME Bottom-Up Phase

s4

va b

s1 s2 s3

Topology

s0

s1

s3

s4

s2

mr(a)mr(b)mr(v)

B = 4

Bottom-Up: build tree of merging regions corresponding to given topology

s0

Page 53: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 53 http://vlsicad.ucsd.edu

BST-DME Top-Down Phase

s4

va b

s1 s2 s3

Topology

s0

s1

s3

s4

s2

a bv

B = 4

s0

Page 54: ECE260B – CSE241A Winter 2005 Clocking

ECE 260B – CSE 241A Clocking 54 http://vlsicad.ucsd.edu

Good Luck for the Mid-Term!