distributed algorithms and vlsi - ti.tuwien.ac.at · distributed algorithms and vlsi ulrich schmid...

Keynote SSS‘08

Distributed Algorithms and VLSI

Ulrich SchmidVienna University of Technology

[email protected]

mailto:[email protected]

Keynote SSS'08 U. Schmid 2

Content

Short introduction to Very Large Scale Integration (VLSI): A photo gallery …– Great perspectives– But …

VLSI Circuits ↔ Distributed Algorithms– DAs and VLSI: Do’s and Don’t’s

Do’s – an Example: DARTS Fault-tolerant Clocks– Starting point: A simple distributed algorithm– How to implement it in VLSI ?– Proofs – [Under the rug: Metastability …]


Short introduction to VLSI: A photo gallery …


VLSI Circuits


Major IngredientsTransistors (nMOS):

Polysilicon GateSiO2Insulator

n n

p substrate

channel

Source Drain

LW

Gate

Source

Drain

Interconnect (wires):

Form & connect gates

(Inverter)


Miniaturization: Moore‘s Law

Intel 4004 (1971) Intel P4 (2001)• 2.250 transistors• 12 mm2 / 10 µm• 0.74 MHz, 1W

• 42.000.000 transistors• 217 mm2 / 0.180 µm = 180 nm• 2 GHz, 50 W


Multicore Processors

IBM POWER4 (dual-core)

IBM Cell (8-core)

Tilera TILE64

Today: < 45 nm


Systems-on-Chip (SoC)

Assemble whole SoC from suitable componentsMarket for “IP cores”, from different vendorsSync/asyncinterfaces

Nvidia Tegra


Great perspectives for VLSI circuits.

But …


Manufacturing Limitations

VLSILab Politechnico Torino

Optical Proximity Correction, Intel Corp.


Defects (Electromigration)

P. Gutman, IBM T.J.Watson Research Center

M. Ohring, Reliability and Failure of Electronic Materials and Devices,1998 ASM Corp. Shanghai

Wiskers Hillock

Void


Defects (Gate Oxide BD)

K.-L. Pey, C.-H. Tung, Physical characterization of breakdown in metal-oxide-semiconductor transistors

Breakdown−induced thermochemical reactions in (a) poly−Si gate and (b) p−Si substrate of n−channel MOSFETs.

Semitracks, Inc.

ESD-induced gate oxide breakdownwww.siliconfareast.com


Power Dissipation Problems

A. Choudhary, UMassSmall transistor dissipating 5mW in an SOI wafer; University of Bolton

→ Reduce supply voltage !


Radiation-induced Soft Errors

SLAC National Accelerator LabStanford

SET SEU

Powell, 1959

0 km10 km

1

10-3

Soft error rates dominate in VLSI !


Slow Signal Propagation

Transistors switch faster

BUTWires thinnerLess transistor driving strengthRC Signal propagation along wires dominate circuit speed


Clock Distribution Problem

Circuit & physical design of the POWER4 microprocessor, IBM J. Res. Dev.

Cell processor

tPD,CLK

CLK

D

CLK

D

CLK

D

CLK

D

…

tdly,DATA,1m

tdly,DATA,2m

tdly,DATA,km

FF1

FF2

FFk FFmcombin. logic

Clock signal (common!)

CLK

D

CLK

DCombinat. logic (gates)

Data

Synchronous design paradigm:

→ Synchronous abstraction increasingly difficult to maintain !


Hence, deep submicron VLSI circuits …


… are in fact FT Distributed Systems

Spatial distributionMessage-passing communicationMassive concurrencyAsynchronyFailuresSecurity issues (IP cores!)

Worth-while undertaking:Explore the applicability of DA results & approaches to VLSI circuits …


Applying DA Research in VLSI ?

2008 Dagstuhl-Seminar Distributed Algorithms in VLSI Chips (B. Charron-Bost, J. Ebergen, S. Dolev, U. Schmid, http://www.dagstuhl.de/08371)

[Great place for such undertakings …]

http://www.dagstuhl.de/08371


DA and VLSI – Don’t’s

Apply standard DAs in the VLSI context – too heavy weight in terms of computation & communicationApply standard replication-based FT (for coping with “classic” VLSI faults) – too heavy-weight in terms of power & area penalties

BUT …


DA and VLSI – Do’s (I)Apply “light-weight” DAs for decentralized handling of [nowadays centralized] functions, e.g. in large multicores– Memory access scheduling (Moscibroda & Mutlu, PODC’08)– Apply self-stabilizing algorithms for handling transient failures (S.

Dolev & Haviv, IEEE ToC, 2006)– Fault-tolerant clock generation in SoCs (Függer, Schmid, Fuchs,

Kempf, EDCC’06)

Apply replication-based FT to cope with malicious failures in VLSI – IP core security threats in SoCs– Inconsistently propagated errors in high-dependability

applications

Tilera TILE64


DA and VLSI – Do’s (II)

Apply VLSI results & approaches in DA research– Error-correcting codes and asynchronous consensus (Friedmann,

Mostefaoui, Rajsbaum & Raynal, IEEE ToC, 2007)– Corruption-resilient Codes (S. Dolev & Tzachar, DISC’08)

Extend DA approaches, to contribute to a (still lacking!) “Theory of Dependable VLSI Circuits”– Early example: Arbiter-Problem (Lamport, ~1980)– Handle massive concurrency (continuously computing gates!)– Handle computation and communication resource restrictions– Handle “non-closed” specifications– Define suitable failure models


Do’s – an Example: DARTS Fault-tolerant Clocks


DARTS – Distributed Algorithms for Robust Tick Synchronization

Joint work with A. Steininger, M. Függer, G. Fuchs [and many others]

http://ti.tuwien.ac.at/ecs/research/projects/darts


Clocking in SoCs (I)

Classic synchronous paradigmConcept: Common notion of time for entire chip

Method: Single quartz oscillatorGlobal, phase-accurate clock tree

Disadvantages- Cumbersome clock tree design- High power consumption- Clock is single point of failure!

DSP

WLAN

Video

GPRS

GPS


Clocking in SoCs (II)

Alternative: DARTS clocksConcept: Multiple synchronized tick generators

Method: Distributed FT tick generation alg (TG algs)Interacting via dedicated clock network (TG net)

Advantages- No quartz oscillator(s)- No critical clock tree- Clock is no single point of failure!- Reasonable synchrony

DSP

WLAN

Video

GPRS

GPS


Reasonable Synchrony ?

Phase synchronization

Clock synchronization

- max precision, - min/max frequency

Tick synchronization


Starting point: A Distributed Algorithm


On booting do:send tick(0) to all; C:= 0; /* C is last tick number sent */

Continuously do:

If received tick(C) from n – f different processes:send tick(C+1) to all; C := C+1;


Continuously do:

If received tick(C) from all n processes:send tick(C+1) to all; C := C+1;

Failure-free case (f = 0): Simple barrier synchronization(Modified) Srikanth & Toueg algorithmFailure case f > 0 ?

A Distributed Algorithm (I)


Continuously do:If received tick(X) from f +1 different processes and X > C:

send tick(C+1),…, tick(X) to all [once]; C := X;If received tick(C) from n – f different processes:

send tick(C+1) to all [once]; C := C+1;


A Distributed Algorithm (III)For n ≥ 3f + 1 and up to f Byz. failures,

with end-to-end delays ∈[d,d+ε]:Suppose process p sends tick(C+1) at time tThen, process q also sends tick(C+1)by time t+d+2ε

⇒ Clock ticks occur approximately synchronously

On booting: send tick(0) to all; C := 0; If got tick(X) from f +1 procs and X > C: send tick(C+1),…, tick(X) to all [once]; C := X; If got tick(C) from n - f processes: send tick(C+1) to all [once]; C := C+1;

f + 1

n − f ≥ 2f + 1

p at t any q’ at t+ε any q at t+d+2ε

≤ ε≤ d+ε


How to implement this DA in VLSI ?

Mind: We don’t have any clock available for a synchronous implementation …


Asynchronous Basic Circuits

a

b

y

loop

b y

a

y

prop

a b0

10

01

10

1

yold0

1yold

AND, OR, …; Muller C-Gate:- Continuously computes y = y(a,b) [with delay tprop]- AND gate for signal transitions ( barrier synchronization)- Note: Inevitably involves feedback loop [tloop]


Asynchronous Communication

Convey alternating up/down signal transitions only FIFO “zero-bit message” channels [with delay]

performance penalty (serial data transmission)additional wires (parallel data transmission)

Sender Receiver

k-bit

k-bit data transmission costly: Additional circuitry +

Signal wires


Major Challenges

If received tick(X) from f +1 processes and X > C :send tick(C+1),…, tick(X) to all [once]C := X

If received tick(C) from n − f processes :send tick(C+1) to all [once]C := C+1

k-bit message, k unbounded

Atomicity of actions

To be replaced byzero-bit messages

k kept at receiver

To be ensured byarchitecture + pathdelay constraints

Build suitablethreshold circuits

Thresholdcomparison


k-bit Zero-bit Messages

...

...C

C

C

C

Rremote,in

C

C

C

C

NAND

NOR

NOR

NAND

NAND

NAND

GEQe

GRe

GEQo

GRo

Ctop

Pipe Compare Signal Generation

Diff-Gate Local PipeRemote Pipe

Counter Module

LocalClk

TG net feeds everyclock signal to everyTG alg (bus of width n)At every TG alg, n − 1 Counter Modules [oneper remote TG alg] maintain tick numbersAnonymous ticks ⇒rules only distinguish– r rem > r loc (f + 1, GR

rule) – r rem ≥ r loc (n − f, GEQ

rule)

Asynchronous up/down-counterTG alg 1

TG alg 6

TG alg 5

TG alg 4

TG alg 3

TG alg 2

TG net


Move tick number maintenance from sender to receiver


Asynchron. Up/Down Counter

C

C

C

C

Rremote,in

C

C

C

C

NAND

NOR

NOR

NAND

NAND

NAND

GEQe

GRe

GEQo

GRo

Ctop



Counter Module

LocalClk

Ingredients:– Two elastic pipelines (= FIFO buffers for signal

transitions) count remote and local clock ticks– Common transitions removed by Diff-Gate– GR and GEQ status signals derived from last stages

Metastability-free by construction [well, almost …]


Atomicity of Actions

The gates making up the f + 1 and the n − f rulecompute continuously and concurrently, hence– may both produce tick(k), for the same k– this must be circumvented by all means [„once“]

How to ensure this atomicity?– Use separate circuitry for generating up-transitions (odd

k) and down-transitions (even k) → tick(k − 1) and tick(k) never mixed up

– Ensure that ratio of the maximum and minimum delay along certain paths is bounded (cp. Θ–Model [WLS05], ABC Model [RS08]) → tick(k − 2) and tick(k) nevermixed up


Threshold Modules

...

...

......

......

GR and GEQ statussignals of the n − 1 Counter Modules fedinto f +1 and n − fthreshold gatesBack-transition from status signals to transition-signalling for generating tick(k)


Proofs


Proofs & Implementations (SW)

abstraction

model (alg+sys)

implementation

SW

specificationproof


- max precision- min/max frequency

Ticksync n TG Algs, f Byz.

Executable machine code, real system

Prove that the model meets the specificationMinimize „proof gap“ between model and implementation

Proof goals:

Tick synced FT clocks

Distr. state machine, Byzantine failures

TTP implementation


Proofs & Implementations (HW)

abstraction

model (alg+sys)

implementation

SW HW

partitioning & constraints

HW capabilities

specificationproof



Hierarchical Proof

Specification of low-level building blocks Up/down ticks correctly simulate tick(k)Synchronization propertiesBounded Precision & FrequencyBounded space (pipeline)

tick-up/downInterlocking proof

tick(k), tick(k+1), …

(P)

Precision & Frequency

(U) (S)

Bounded space


On booting:send tick(0) to all; C := 0;

If got tick(X) from f +1 procs and X > C: send tick(C+1),…, tick(X) to all [once];C := X;

If got from n - f processes: send to all [once];C := C+1;

Interlocking Proof - “[once]”

k

k+1

k-2

x

tick-up/down


Interlocking proof

tick(k+1)tick(k)

x

tick(C)tick(C+1)


Higher-Level Properties

(P) Progress. If all correct nodes send tick(k) by time t, then every correct node sends at least tick(k+1) by t + T+.(U) Unforgeability. If no correct node sends tick(k) by time t, then no correct node sends tick(k+1) by t+T-

first.(S) Simultaneity. If some correct node sends tick(k) by time t, then every correct process sends at least tick(k) by t+T-

first

and, on top of those,

Precision & FrequencyBounded pipeline size


(P)

Precision & Frequency

(U) (S)

Bounded pipes

Prove elementary synchronization properties


Complete Suite of Proofs

[EDCC’06]


ack_ext ack_int

req_ext req_int

Remote Pipe

____

_G

EQe

GR

e

GEQ

o

___

GR

o

3f+1

1

= 2f+1 = 2f+1

= f+1 = f+1

......

......

Threshold Logic_____GEQe

GRe

GEQo

___GRo

clk_

out

Pipeline 1

Node p

...

...

...

Pipe Compare Signal Generators

CC

CC

CC

CC

C

Diff-GateCC

C

Local Pipe

rem

ote

clk_

in

External Pipe

Pipeline 2

Local PipeDiff-Gate

Pipe Compare Signal Gen.

ExternalPipe

Pipeline 3

Local PipeDiff-Gate


RemotePipe

Pipeline 3f+1

LocalPipe

Diff-Gate


...

Complete Implementation

Implementation of the model only needs to– implement the low-level building blocks as specified– ensure the additional delay ratio bounds for

interlocking proof (place & route constraints)

[DFT’06]


DARTS - Lessons Learned

Fault-tolerant distributed algorithms are indeed applicable in the VLSI context, but need “down-sizing” Distributed computing models with bounded delay ratio (Θ-Model, ABC model) well-suited for VLSI context (technology migration, re-using of models, etc.)Sole transition logic approach not sufficient for fault-tolerance ⇒ need a model that integrates event and state representationTime-free models suffer from a large “proof-gap” ⇒ need a model incorporating (continuous) timeFailures raise new metastability concerns ⇒ MS needs further investigation


Under the rug: Metastability …

[Stolen from Dagstuhl presentation of A. Steininger …]


Metastability

1

2

3

4

5

1 2 3 4 5

Inv 1

Inv 2

ui,2 = uo,1

ui,1 = uo,2

stable (HI)

stable (LO)

metastable

Bistable element(memory cell) withpositive feedback


Revisit Muller C-Element

1

01

0x

a

x

y

a

x

y

a

x

y

pure delay at gateand interconnect

limited output slope

normal operation

oscillationcreeping

b y

a


Error Containment

count pr

count pq

ThM

TG

node p

count qp

count qr

ThM

TG

node q

count rp

count rq

ThM

TG

node r

According to our proofs the wall holds – but we ignored metastability!


The Counter Module

count pr

count pq

ThM

TG

node p

count qp

count qr

ThM

TG

node q

count rp

count rq

ThM

TG

node r

C

C

C

C

Rremote,in

C

C

C

C

NAND

NOR

NOR

NAND

NAND

NAND

GEQe

GRe

GEQo

GRo

Ctop



Counter Module

LocalClk

purely combinational logicwon‘t hurt

BUT won‘t help

Muller C-ElementMetastable input may pass through!


The Threshold Module

count pr

count pq

ThM

TG

node p

count qp

count qr

ThM

TG

node q

count rp

count rq

ThM

TG

node r

Threshold Modulepurely combinational logic=> will not create metastability problem

BUT:

will propagate metastabilitywhile being near thethreshold

NO masking, NO protection


Metastability Containment ?

count pr

count pq

ThM

TG

node p

count qp

count qr

ThM

TG

node q

count rp

count rq

ThM

TG

node r


Some References[Bau05] R. Baumann. Radiation-induced soft errors in advanced semiconductor technologies. IEEE Transactions on Device and Materials Reliability 5(3):305--316, Sept. 2005.[BJ83] J. C. Barros and B. W. Johnson. Equivalence of the arbiter, the synchronizer, the latch, and the inertial delay. IEEE Trans. Comput., 32(7):603--614, 1983.[BZMLCLD02] R. Bhamidipati, A. Zaidi, S. Makineni, K. Low, R. Chen, K.-Y. Liu, and J. Dalgrehn. Challenges and methodologies for implementing high-performance network processors. Intel Technology Journal, 6(3):83--92, Aug. 2002.[BY07] A. Bink and R. York. Arm996hs, the first licensable, clockless 32-bit processor core. IEEE Micro, 25(2):58--68, February 2007.[Bor05] S. Borkar. Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. IEEE Micro, 25(6):10--16, Nov. 2005.[Cha84] D. M. Chapiro. Globally-Asynchronous Locally-Synchronous Systems. PhD thesis, Stanford University, Oct. 1984.[Con03] C. Constantinescu. Trends and challenges in VLSI circuit reliability. IEEE Micro, 23(4):14--19, July 2003.[DH06a] S. Dolev and Y. Haviv. Self-stabilizing microprocessors, analyzing and overcoming soft-errors. IEEE Transactions on Computers, 55(4):385--399, Apr. 2006.[Dol00] S. Dolev. Self-Stabilization. MIT Press, 2000.[DR98] C. Dyer and D. Rodgers. Effects on spacecraft \& aircraft electronics. In Proceedings ESA Workshop on Space Weather, ESA WPP-155, pages 17--27, Nordwijk, The Netherlands, nov 1998. ESA. [DT08] S. Dolev and N. Tzachar. Brief announcment: Corruption resilient fountain codes. In DISC, pages 502--503, 2008.[FFSK06:DFT] M. Ferringer, G. Fuchs, A. Steininger, and G. Kempf. VLSI Implementation of a Fault-Tolerant Distributed Clock Generation. IEEE International Symposium on Defect and Fault-Tolerance in VLSI Systems (DFT2006), pages 563--571, Oct. 2006.


Some References

[FMRR07] R. Friedman, A. Mostefaoui, S. Rajsbaum, and M. Raynal. Asynchronous agreement and its relation with error-correcting codes. IEEE Trans. Comput., 56(7):865--875, 2007.[Fri01] E. G. Friedman. Clock distribution networks in synchronous digital integrated circuits. Proceedings of the IEEE, 89(5):665--692, May 2001.[FSFK06] M. Fuegger, U. Schmid, G. Fuchs, and G. Kempf. Fault-Tolerant Distributed Clock Generation in VLSI Systems-on-Chip. In Proceedings of the Sixth European Dependable Computing Conference (EDCC-6), pages 87--96. IEEE Computer Society Press, Oct. 2006.[ITRS05] International technology roadmap for semiconductors, 2005.[KHP04] T. Karnik, P. Hazucha, and J. Patel. Characterization of soft errors caused by singleevent upsets in CMOS processes. Dependable and Secure Computing, IEEE Transactions on, 1(2):128--143, April-June 2004.[KK98] I. Koren and Z. Koren. Defect tolerance in VLSI circuits: Techniques and yield analysis. Proceedings of the IEEE, 86(9):1819--1838, Sep 1998.[Lam84] L. Lamport. Buridan's principle. Technical report, SRI Technical Report, 1984.[Lam03] L. Lamport. Arbitration-free synchronization. Distributed Computing, 16(2/3):219--237, September 2003. [LP76] L. Lamport and R. Palais. On the glitch phenomenon. Technical report, SRI Technical Report, 1976.[LS03] G. Le Lann and U. Schmid. How to implement a timer-free perfect failure detector in partially synchronous systems. Technical Report 183/1-127, Department of Automation, Technische Universit\"at Wien, January 2003.[Mar81] L. Marino. General theory of metastable operation. IEEE Transactions on Computers, C-30(2):107--115, February 1981.[MA01] M. S. Maza and M. L. Aranda. Analysis of clock distribution networks in the presence of crosstalk and groundbounce. In Proceedings International IEEE Conference on Electronics, Circuits, and Systems (ICECS), pages 773--776, 2001.


Some References[Nic05] M. Nicolaidis. Design for soft error mitigation. Device and Materials Reliability, IEEE Transactions on, 5(3):405--418, Sept. 2005.[Nor96] E. Normand. Single-event effects in avionics. IEEE Transactions on Nuclear Science,43(2):461--474, Apr 1996.[PB93] M. Peercy and P. Banerjee. Fault tolerant VLSI systems. Proceedings of the IEEE, 81(5):745--758, May 1993.[Res01] P. J. Restle and others. A clock distribution network for microprocessors. IEEE Journal of Solid-State Circuits, 36(5):792--799, May 2001. [RDS90] L. M. Reyneri, D. DelCorso, and B. Sacco. Oscillatory metastability in homogeneous and nhomogeneous flip-flops. IEEE Journal of Solid-State Circuits, SC-25(1):254--264, February 1990.[RS08] P. Robinson and U. Schmid. The Asynchronous Bounded-Cycle Model. Proceedings SSS'08, 2008.[SE02] I. E. Sutherland and J. Ebergen. Computers without Clocks. Scientific American, 287(2):62--69, Aug. 2002.[Sut89] I. E. Sutherland. Micropipelines. Communications of the ACM, Turing Award, 32(6):720--738, June 1989. ISSN:0001-0782.[WLS05] J. Widder, G. Le Lann, and U. Schmid. Failure detection with booting in partially synchronous systems. In Proceedings of the 5th European Dependable Computing Conference (EDCC-5), volume 3463 of LNCS, pages 20--37, Budapest, Hungary, Apr. 2005. Springer Verlag.[WS05] J. Widder and U. Schmid. Achieving synchrony without clocks. Research Report 49/2005, Technische Universität Wien, Institut für Technische Informatik, 2005. (submitted).

distributed algorithms and vlsi - ti.tuwien.ac.at · distributed algorithms and vlsi ulrich schmid...

Documents