michael nicolaidis tima - computer science department ... · michael nicolaidis tima. yield power...

Designing Robust Single-Chip Massively Parallel Tera-Device Processors:

Designing Robust SingleDesigning Robust Single--Chip Massively Chip Massively Parallel Parallel TeraTera--Device Processors:Device Processors:

Michael NicolaidisTIMA

Power consumption

Reliability

December 2nd, 2011 3Hai YU ARIS Group, TIMA Laboratory

MotivationIncreasing design complexity

High device and power densities

High clock frequencies

Increasing leakages

Shrinking feature geometriesLow voltage (low power)High clock frequencies

Defects (opens, shorts, etc.)Process parameter variations

Reducedsupply voltage

Increasedcircuit delay

Delay faults

Soft errors

FF, latch faults

Reliability

Low power consumption

PVT variations & Defects

Soft errors

Circuit aging

Parameter variations

Robust

Design

Yield,

Reliab

Low Po

December 2nd, 2011 4Hai YU ARIS Group, TIMA Laboratory

Major Major Challenges

Ultimate-CMOS and beyond-CMOS Technologies

20nm is around,

Mask optimization lithography + PDSOI and bulk silicon, and subsequently FINFET and ETSOI is in the roadmap down to the 11 nm.

Things are less precise on the subsequent nodes, but silicon nanowires and fully depleted SOI are listed as the most pertinent choices for 8nm and 5nm,

3nm is expected to move to some kind of carbon devices (nanotubes or otherwise)

beyond CMOS .

Computational Opportunities in Ultimate-CMOS and beyond-CMOS

Integrating several trillions of devices in a single chip.

Massively parallel architectures comprising several thousands of processors in a single die (e.g. in 2D mesh topology)

Offer unprecedented computing power and have a profound impact on all computer application domains - embedded systems, - telecommunication networks, - internet infrastructure and utilization, - cloud computing, ), - and ultimately on science, technology and the society as a whole.

But Several Show Stoppers

But aggressive technology scaling exacerbates:Process, voltage and temperature (PVT) variations; Sensitivity to electromagnetic interferences (EMI) and to radiation; Circuit aging and wearout; Defect levels;Power dissipation and thermal constraints.

Production of yield-efficient and reliable chips in ultimate-CMOS and beyond-CMOS may become impossible due to:

Excessive failure rates;Unpredictable and heterogeneous timing behaviour of identical cores;Circuit degradation over time; Die complexity. Power dissipation

The powerful android pursue its mission even after sever damage. But painfully ...

Terminator chips will continue their mission and remain powerful even when heavily damaged.

TERMINATOR TERA-DEVICE BIO-MIMETICNANOTECHNOOLOGY ROBUST - Chips

Single-chip massively parallel tera-device processors:thousands computing nodes in 2D mesh.

Show-stoppers: Yield and reliability (defects, PVT variations, accelerated aging, EMI and soft errors) + power.

HMHMHM

Most nodes -> temporary faults (T).Large percentage of memories (HM)

and interconnects (HI)->hard faultsHigh percentage of processors and

routers (HL)- hard faultsAging induced degradation (D)

frequently produces new temporal faults in most nodes.

Aging induced new hard faults (HM, HI,HL) occur with high MTBF (e.g. every few days).

Terminator chips: still deliver unprecedented computing power and pursue their mission flawlessly

Extreme Extreme DefectivityDefectivity

Holistic approach acting on all system levels Introduce innovative techniques;Chose the most adequate technique at each levelArchitect their cooperation to optimize the outcome:high yield, high reliability; reduced power; increased performance.

Software Redundancy (computation replication):

Not working for Permanent Faults Very high performance penalty Very High Power Penalty

Massive Redundancy (DMR, TMR) Not working for Multiple Failures Adverse Reduction of Hardware

Resources; Very High Power Penalty.

Conventional Approaches InadequateConventional Approaches Inadequate

Cells FrameworkCells Framework

Integrates existing and new lowIntegrates existing and new low--cost cost Circuit & Array level ApproachesCircuit & Array level Approaches

Circuit LevelSelf-test and self-repair for memories and interconnects

Low-cost circuit-level concurrent error detection in logic

Self-regulation of circuit parameters

Memory ECC

Array LevelDifferential Voltage Frequency Scaling

Fault-tolerant, variability aware and power-aware task scheduling and allocation algorithms.

Coherent check-pointing and error recovery at array-level.

Check-pointing-free error recovery at array-level.

Fault-tolerant, congestion and deadlock-free routing algorithms

Optimal Cooperation for reduced cost and increased efficiency

Cells FrameworkCells FrameworkIntegrates existing and new lowIntegrates existing and new low--cost cost

Circuit, Core, Node & Array level ApproachesCircuit, Core, Node & Array level Approaches

Circuit LevelSelf-test and self-repair for memories

Self-test and self-repair for interconnects

Low-cost circuit-level concurrent error detection in logic

Memory ECC

Self-regulation of circuit parameters

Various efficient solutions exist: Tanabe et al 1992; Kimet al 1998; Sawada et al 1999; Benso et al 2000; Schober et al

2001; Zorian 2002; Nicolaidis et al 2004; Li et al 2005; Lu et al 2006; Huang et al 2007; )

Low cost (a few %) due to memory regularity.

Spare based selfSpare based self--repair for repair for MemoriesMemories

HMHMHM

Architecture combining BIST, ECC and spare based repair is integrated

Most efficient for high defect densities (VTS 2004)

Repairs fabrication and aging induced faults

Significant reduction of failed parts as memories contribute in majority to SOC failures

Various efficient solutions exist Tanabe et al 1992; Kimet al 1998; Sawada et al 1999; Benso et al 2000; Schober et al

2001; Zorian 2002; Nicolaidis et al 2004; Li et al 2005; Lu et al 2006; Huang et al 2007; )

Low cost (a few %) due to memory regularity.

Spare based selfSpare based self--repair for repair for MemoriesMemories

Architecture combining BIST, ECC and spare based repair integrated

Most efficient for high defect densities (VTS 2004)

Significant reduction of failed parts as memories contribute in majority to SOC failures

Various efficient solutions exist Loi et al 2008; Hsieh et al 2010; Kang et al 2010; Pasca et al 2010; Nicolaidis et al 2010

Low cost is possible due to interconnects regularity.

SpareSpare--basedbased selfself--repairrepair for for InterconnectsInterconnects

Cell framework uses:Spares based self-repair for 2D interconnects.Spares and/or serialization based self-repair for 3D interconnects.

Significant reduction of hard faults in Interconnects.

Various efficient solutions exist Loi et al 2008; Hsieh et al 2010; Kang et al 2010; Pasca et al 2010; Nicolaidis et al 2010

Low cost is possible due to interconnects regularity.

SpareSpare--basedbased selfself--repairrepair for for InterconnectsInterconnects

T HL DT

Cell framework uses:Spares based self-repair for 2D interconnects.Spares and/or serialization based self-repair for 3D interconnects.

Significant reduction of hard faults in Interconnects.

GRAALGRAAL Architecture forArchitecture for Concurrent Error Concurrent Error Detection in Detection in logiclogic

Latch-based design + double samplingLow area, power and performance penalties. 32 bits icyflex

processor: ETS 2011 17.2%, 8.4%, 2.35%Detection of timing faults: 100% of CP delays

Variability, EMI, aging,Soft Errors

Other advantages: can be used for aggressive power reduction + Self-regulation

CC2 L32

err1comp

err2comp

Node i

err 1comperr 2

GRAAL: detects 100% delays,Less area and power

comperr1

Interconnections: GRAAL versus Double Interconnections: GRAAL versus Double sampling with sampling with FFsFFs

Double sampling with FF: detects 50% extra delays

Interconnections M SCk Ckb

M SCk Ckb

compCkb Ckb

GRAAL: Detection of larger delay faults (100% instead of 50% of link delay at lower area and power

Logic and interconnects (usually high cost): New scheme (GRAAL).

Low area, power, and performance penalty (delay faults 100% CP), SEUs, SETs.

CircuitCircuit--level Concurrent Errorlevel Concurrent ErrorDetectionDetection

Logic:GRAAL : ~ 17% area, 7% power, 2.35% performance Interconnects:Area ~ 2%, power & speed insignificant

Node internal interconnects

Array synchronous interconnects (a clock domain per node & router). Easy domain change using the routers FIFOs. + simple ECC.

No GRAAL if GALS

Error recovery:Processor nodes: instruction replay + lower Ck F.Links & routers: message resent + lower Ck F.

T HL DT

Logic and interconnects (usually high cost): New scheme (GRAAL).

Low area, power, and performance penalty (delay faults 100% CP), SEUs, SETs.

CircuitCircuit--level Concurrent Error level Concurrent Error DetectionDetection

LogicGRAAL: ~ 17% area, 7% power, 2.35% performance

Interconnects:Area ~ 2%, power & speed insignificant

Node internal interconnects

Array synchronous interconnects (a clock domain per node & router). Easy domain

Elimination of most faults in logic and interconnects

ECC for MemoryECC for Memory Field FailuresField Failures

Low cost ECC (e.g. SEC-DED) +Interleaving.

Errors uncorrectable by ECC: Array level recovery.

Same as for faults detected by memory and logic self-tests (discussed later)

SelfSelf--regulation at node levelregulation at node level(application context + circuit degradations)(application context + circuit degradations)

Use the error detection rate (Ef) provided by GRAAL:determine the operating points Ck frequency/Vdd for preselected Efs. Dynamically adapted tables (degradation).

Operating frequency for each application task provided by OS (task deadline): The node choose its Vdd from the tables as a function of priorities (power dissipation/reliability: Ef = 0, Ef = 10-4, ) determined at the application level.

According to the power dissipation priority: very low Vdd can be used for drastic power reduction needs.

Table of operating points for Efth = 10-4 Table of operating points for Efth = 10

F0 F1 F2 F0 F1 F2

Vdd0 Ef10-5

Vdd1 Ef Ef Ef

Fault tolerant massively parallel Fault tolerant massively parallel chipschips

Zajac, Collet, & Napieralski (2008), use self-test and fault tolerant routing to tolerate faults in massively parallel chips, but:

Timing faults can not be tolerated: nodes rejected as faulty (large resource wasting).Many timing faults will escape self-test and will affect reliability.

Transient faults (SEUs, SETs are not detected and will affect reliability.Static routing tables are used:

They cannot cope with new faults occurring during operation. It is not scalable in large arrays as routing activity is unpredictable and static routing will result in congestions.

New approaches needed: we developed distributed non-deterministic approaches taking local opportunistic decisions.

Other issues: Recovery from Hard Faults in Other issues: Recovery from Hard Faults in Logic, Memories and Interconnects + ECC Logic, Memories and Interconnects + ECC

Uncorrectable ErrorsUncorrectable Errors

Problems and remedies:

Recovery will reproduce the same errors (permanent faults): FT scheduling and allocation.

Errors can be propagated in the system and become unrecoverable: Cooperation between self-test; FT scheduling; and recovery algorithm.

Recovery should not affect the coherence of the system: Coordinated check point algorithm.

Check-pointing requires saving regularly internal states in external reliable memory. Congestion of IOs: Hierarchical recovery supported by the task allocation algorithm.

Cells FrameworkCells FrameworkIntegrates existing and new lowIntegrates existing and new low--cost cost

Circuit & Array level ApproachesCircuit & Array level Approaches

Array Level

Fault-tolerant, variability-, aging-, and power-aware task scheduling and allocation algorithms.

Coherent check-pointing and error rollback recovery.

Check-pointing-free error recovery.

Fault-tolerant, congestion-free, and deadlock-free routing algorithms.

Differential Voltage Frequency Scaling

Holistic fault-tolerant routing, power-aware task scheduling and allocation, rollback recovery, and circuit parameter regulation

Fault tolerant Routing AlgorithmsFault tolerant Routing AlgorithmsRouting tables: Store in each node a fault-free path to every other node.

It avoids failed nodes and routers and also deadlocks. It provides optimal routing for each individual message.But: frequent conflicts between messages as the routing for each message is fixed once for ever: impairing congestions in complex arrays.

No routing tables to avoid congestions, but deadlock becomes an issue: 1st adaptive congestion-free, deadlock-free routing algorithm that ensures "0 lost message" even under high failure rates (2010 IEEE NCA)

Distributed Algorithm : local opportunistic decisions (get another path when a node/router/link is faulty or congested). x1000 nuds!Tolerates multiple faulty nodes/routers/links; Avoids congestions; Copes with new failures.What about deadlocks (unplanned routing

may go indefinitely through the same loop)?2 Virtual Networks + Turn Prohibition + Virtual Source + Echo Mode

Fault tolerant Routing AlgorithmsFault tolerant Routing AlgorithmsWeak point: increase of average latency with the increase of the interconnect size

and the failure rate.Introduction of the Explicit Path Routing Mode (DATE 2011, Best IP Award). Limits drastically the traffic increase in case of high failure rates, improving drastically the average latency.

Initiator

- broadcastCK_REQ

- when CK_TAKENreceived fromall tasks- validate

global checkpoint

Non-initiator(blocking or not)

- on CK_REQ receipt- broadcast

CK_START- when CK_START

received from all tasks- take local

checkpoint- send to

initiator CK_TAKEN

Blocking synchronization messages

Non-blocking synchronization messages application messages

Messages in the array during check-pointing

Strengths: Coordinated ChP : simple rollback Performances optimization through

partitioning Intelligent broadcast: reduction

#broadcasts + reduction of size of check-points (memory occupation + communication cost in large networks)

Coordinated CheckCoordinated Check--pointing pointing (IOLTS08, ISCAS08, DELTA08, IOLTS10, WDSN, MICRO10)

Rollback may destroy coherence in a parallel system

Issue: may provoke congestions of I/O ports

Integrated FT routing, FT scheduling and Integrated FT routing, FT scheduling and allocation and Error Recoveryallocation and Error Recovery

FT scheduling and allocation : Intractable in complex arraysCheck-pointing : Needs to save regularly the internal states of the system in reliable (thus external) memory: congestion of I/O ports.

New adaptive algorithm (NCA 2011): fusions FT routing; FT scheduling and allocation; and error recovery

FT routing and FT scheduling and allocation :distributed algorithm based on local opportunistic decisions.Error Recovery: Hierarchical task organization (group of

clusters; task clusters; tasks; subtasks)Maintains hierarchy during execution (parent-children trees).Goes back to the parent node in case of failure.

Validated on arrays comprising 1000s nodes with up to 20% faulty routers/nodes! Streaming application.No saving of internal states in external memory (no-check-pointing). Recovery for any multiple faults (goes back to the first fault-free node of the tree).

Power aware FT task scheduling and allocation Power aware FT task scheduling and allocation (process variations, aging, and application context)(process variations, aging, and application context)

Three steps algorithm developed:1: Local Static scheduling: Determine # execution cycles for eachgroup TCs (rough power and time estimation).2: Global Static scheduling (genetic algorithm): Determines the optimal operating points.

3: On-line scheduling: For the active groups of tasks select the points that minimize total energy while respects deadlines

Issue: steps 2 and 3 non-scalable for large arrays (1000s nodes)

Apply the Global Static scheduling on each group of TCs in the application

We obtain a set of optimal operating

points (curve) for each group of TCs

Map and schedule a group of TC, globally to the system

An operating point for a possible scheduling and an active pair for

the target group of TCs.

From the set of operating points of the target group of TCs, remove the non-

optimal ones (in red colour): for the same execution time, the lower energy is better

Integrated FT Routing, FT and Power Aware Scheduling Integrated FT Routing, FT and Power Aware Scheduling and Allocation, Circuit Regulation, and Error Recoveryand Allocation, Circuit Regulation, and Error Recovery

Distributed routing and scheduling/allocation: as in the previous algorithm.Hierarchical Error Recovery: as in the previous algorithm.Circuit Heterogeneity handled hierarchically (IOLTS 2011):

Groups of task clusters mapped to regions (by the highest leader), Task clusters mapped to sub-regions by the leader of each regionTasks mapped to nodes by the leader of each sub-region, ...

Encapsulation in each task: operating frequency (deadlines); power/reliability priorities (e.g. Ef < 10-5).

Each node determines its clock frequency and Vddaccordingly.

Evaluated for 1000s nodes: with up to 20% faulty routers/nodes. Usingstreaming applications.

Cells Framework: Biologically Inspired Cells Framework: Biologically Inspired RobustnessRobustness

Ongoing extension of the algorithmDuring the design phase: use fault simulation to create the invalid classes of instructions (i.e. those affected by the different faults). During compilation: determine for each task, which of the invalid instruction classes are compatible with it. After fabrication and periodically in the field: execute the diagnosis algorithm in each node to determine the invalid instructions classes for this node. At run time, the leader of each subregion maps tasks to nodes, which have invalid instruction classes that are compatible with these tasks.

Ongoing work at TIMA: development of a holistic platform covering all levels of system hierarchy.

Bringing various innovations: including its overall architecture, its particular components, and the way they cooperate to optimize the outcome.

Ambitious goal to design the Terminator chips: transform into reliable systems future massively parallel chips, promising unprecedented computing power but affected by extreme failure rates.

Living Organisms Cells FrameworkCells surrounding injured parts substitute dead cells to regenerate damaged structures

Functioning nodes replace unrecoverable faulty nodes

Cells can recover from damages: e.g. DNA repair.

Nodes recover from faults (self-repair, CED, instr. replay, self-test, rollback)

Plasticity of nervous and circulatory system: after injure information/blood flow though alternative paths (e.g. anastomosis).

Dynamic routing algorithms create alternative paths in response to failures and congestions

Energy management: blood flow (transporting oxygen and nutriments) to downstream organs regulated according to their activity

Energy dissipation regulated by adapting node voltage and clock frequency to the computation needs

Notable similarities between Cells and Living organisms

Living Organisms CellsRegulate their physiological parameters to changing conditions and to their needs (e.g. the regulation of insulin levels in response to sugar levels).

Cooperation between array-level and circuit-level mechanisms adapts circuit parameters to meet application goals, while adapting power to circuit degradation

Cells in organisms achieve self-healing, and self-regulation in distributed non-deterministic, local and opportunistic manner

New distributed, non-deterministic routing, task allocation and scheduling algorithms, making local decisions in opportunistic manner

These processes are done without conscious intervention

These processes are transparent to the application software

Similarities emerged spontaneously as we tried to optimize the ways we can ensure robustness in very complex chips.

This trend should be reinforced as post CMOS technologies will increase the complexity of future chips.

Designing Robust Single-Chip Massively Parallel Tera-Device Processors:GRAAL Architecture for Concurrent Error Detection in logicSelf-regulation at node level(application context + circuit degradations)Fault tolerant massively parallel chipsFault tolerant Routing AlgorithmsFault tolerant Routing AlgorithmsIntegrated FT routing, FT scheduling and allocation and Error RecoveryPower aware FT task scheduling and allocation (process variations, aging, and application context)Integrated FT Routing, FT and Power Aware Scheduling and Allocation, Circuit Regulation, and Error RecoveryCells Framework: Biologically Inspired RobustnessCells FrameworkCells FrameworkCells Framework

michael nicolaidis tima - computer science department ... · michael nicolaidis tima. yield power...

Documents

lucea - tima: travel industry marketing association

tima news letter · 2018-03-15 · tima news letter 03 dr....

enacting global...

uloga studenata u radu here tima

tima news letter - ima tamilnadu state branch › wp-content...

tima grant project 2012-2013 · 2015. 8. 21. · 1 tima...

fault-secure parity prediction arithmetic operators michael...

tima lab. research reports - université grenoble...

solution processable interface materials for nanoparticulate...

texas implementation of medication …9/28/99 texas...

manual - revistaartereal.com.br§onaria-e-as... · tima...

tima adv-profile

ul tima te taste ul tima te creativity - pamark

tima 2013

nd mega industrial exhibition on combined industries...

themes in greek linguistics (philippaki-warburton,...

tima lab. research reports

vođenje i motiviranje tima u odgojno – obrazovnoj...

uloga projektnog tima u faznom odobrenju projekta

uspešnega tima in organizacije - fos-unm.siodnosi,...