michael nicolaidis tima - computer science department ... · michael nicolaidis tima. yield power...
Post on 07-May-2019
216 Views
Preview:
TRANSCRIPT
Designing Robust Single-Chip Massively Parallel Tera-Device Processors:
Designing Robust SingleDesigning Robust Single--Chip Massively Chip Massively Parallel Parallel TeraTera--Device Processors:Device Processors:
Michael NicolaidisTIMA
Michael NicolaidisTIMA
Yield
Power consumption
Reliability
December 2nd, 2011 3Hai YU ARIS Group, TIMA Laboratory
MotivationIncreasing design complexity
High device and power densities
High clock frequencies
Increasing leakages
Shrinking feature geometriesLow voltage (low power)High clock frequencies
Defects (opens, shorts, etc.)Process parameter variations
Reducedsupply voltage
Increasedcircuit delay
Delay faults
Soft errors
FF, latch faults
Reliability
Yield
Low power consumption
PVT variations & Defects
Soft errors
Circuit aging
Parameter variations
Robust
Design
:
Yield,
Reliab
ility
Low Po
wer
December 2nd, 2011 4Hai YU ARIS Group, TIMA Laboratory
Major Major Challenges
Ultimate-CMOS and beyond-CMOS Technologies
20nm is around,
Mask optimization lithography + PDSOI and bulk silicon, and subsequently FINFET and ETSOI is in the roadmap down to the 11 nm.
Things are less precise on the subsequent nodes, but silicon nanowires and fully depleted SOI are listed as the most pertinent choices for 8nm and 5nm,
3nm is expected to move to some kind of carbon devices (nanotubes or otherwise)
beyond CMOS .
Computational Opportunities in Ultimate-CMOS and beyond-CMOS
Integrating several trillions of devices in a single chip.
Massively parallel architectures comprising several thousands of processors in a single die (e.g. in 2D mesh topology)
Offer unprecedented computing power and have a profound impact on all computer application domains - embedded systems, - telecommunication networks, - internet infrastructure and utilization, - cloud computing, ), - and ultimately on science, technology and the society as a whole.
But Several Show Stoppers
But aggressive technology scaling exacerbates:Process, voltage and temperature (PVT) variations; Sensitivity to electromagnetic interferences (EMI) and to radiation; Circuit aging and wearout; Defect levels;Power dissipation and thermal constraints.
Production of yield-efficient and reliable chips in ultimate-CMOS and beyond-CMOS may become impossible due to:
Excessive failure rates;Unpredictable and heterogeneous timing behaviour of identical cores;Circuit degradation over time; Die complexity. Power dissipation
The powerful android pursue its mission even after sever damage. But painfully ...
Terminator chips will continue their mission and remain powerful even when heavily damaged.
TERMINATOR TERA-DEVICE BIO-MIMETICNANOTECHNOOLOGY ROBUST - Chips
Single-chip massively parallel tera-device processors:thousands computing nodes in 2D mesh.
Show-stoppers: Yield and reliability (defects, PVT variations, accelerated aging, EMI and soft errors) + power.
T HL
HM
HI
DT
T
T T
T
T
T
T T
T
T
T T
T
T
T
T
T
T
T T
T
T
TTD
D
D D
D
D
D
D
D
D
D
HL
HL
HLHL
HL
HMHMHM
HM
HM
HM
HM
HMHM
HM HI
HIHI
HI
HI
HI
HI
Most nodes -> temporary faults (T).Large percentage of memories (HM)
and interconnects (HI)->hard faultsHigh percentage of processors and
routers (HL)- hard faultsAging induced degradation (D)
frequently produces new temporal faults in most nodes.
Aging induced new hard faults (HM, HI,HL) occur with high MTBF (e.g. every few days).
Terminator chips: still deliver unprecedented computing power and pursue their mission flawlessly
Extreme Extreme DefectivityDefectivity
Holistic approach acting on all system levels Introduce innovative techniques;Chose the most adequate technique at each levelArchitect their cooperation to optimize the outcome:high yield, high reliability; reduced power; increased performance.
Software Redundancy (computation replication):
Not working for Permanent Faults Very high performance penalty Very High Power Penalty
Massive Redundancy (DMR, TMR) Not working for Multiple Failures Adverse Reduction of Hardware
Resources; Very High Power Penalty.
Conventional Approaches InadequateConventional Approaches Inadequate
Cells FrameworkCells Framework
Integrates existing and new lowIntegrates existing and new low--cost cost Circuit & Array level ApproachesCircuit & Array level Approaches
Circuit LevelSelf-test and self-repair for memories and interconnects
Low-cost circuit-level concurrent error detection in logic
Self-regulation of circuit parameters
Memory ECC
Array LevelDifferential Voltage Frequency Scaling
Fault-tolerant, variability aware and power-aware task scheduling and allocation algorithms.
Coherent check-pointing and error recovery at array-level.
Check-pointing-free error recovery at array-level.
Fault-tolerant, congestion and deadlock-free routing algorithms
Optimal Cooperation for reduced cost and increased efficiency
Cells FrameworkCells FrameworkIntegrates existing and new lowIntegrates existing and new low--cost cost
Circuit, Core, Node & Array level ApproachesCircuit, Core, Node & Array level Approaches
Circuit LevelSelf-test and self-repair for memories
Self-test and self-repair for interconnects
Low-cost circuit-level concurrent error detection in logic
Memory ECC
Self-regulation of circuit parameters
Various efficient solutions exist: Tanabe et al 1992; Kimet al 1998; Sawada et al 1999; Benso et al 2000; Schober et al
2001; Zorian 2002; Nicolaidis et al 2004; Li et al 2005; Lu et al 2006; Huang et al 2007; )
Low cost (a few %) due to memory regularity.
Spare based selfSpare based self--repair for repair for MemoriesMemories
T HL
HM
HI
DT
T
T T
T
T
T
T T
T
T
T T
T
T
T
T
T
T
T T
T
T
TT
D
D
D D
D
D
D
D
D
D
D
HL
HL
HLHL
HL
HMHMHM
HM
HM
HM
HM
HMHM
HM HI
HIHI
HI
HI
HI
HI
Architecture combining BIST, ECC and spare based repair is integrated
Most efficient for high defect densities (VTS 2004)
Repairs fabrication and aging induced faults
Significant reduction of failed parts as memories contribute in majority to SOC failures
Various efficient solutions exist Tanabe et al 1992; Kimet al 1998; Sawada et al 1999; Benso et al 2000; Schober et al
2001; Zorian 2002; Nicolaidis et al 2004; Li et al 2005; Lu et al 2006; Huang et al 2007; )
Low cost (a few %) due to memory regularity.
Spare based selfSpare based self--repair for repair for MemoriesMemories
T HL
HI
DT
T
T T
T
T
T
T T
T
T
T T
T
T
T
T
T
T
T T
T
T
TT
D
D
D D
D
D
D
D
D
D
D
HL
HL
HLHL
HL
HM HI
HIHI
HI
HI
HI
HI
Architecture combining BIST, ECC and spare based repair integrated
Most efficient for high defect densities (VTS 2004)
Repairs fabrication and aging induced faults
Significant reduction of failed parts as memories contribute in majority to SOC failures
Various efficient solutions exist Loi et al 2008; Hsieh et al 2010; Kang et al 2010; Pasca et al 2010; Nicolaidis et al 2010
Low cost is possible due to interconnects regularity.
SpareSpare--basedbased selfself--repairrepair for for InterconnectsInterconnects
T HL
HI
DT
T
T T
T
T
T
T T
T
T
T T
T
T
T
T
T
T
T T
T
T
TT
D
D
D D
D
D
D
D
D
D
D
HL
HL
HLHL
HL
HM HI
HIHI
HI
HI
HI
HI
Cell framework uses:Spares based self-repair for 2D interconnects.Spares and/or serialization based self-repair for 3D interconnects.
Repairs fabrication and aging induced faults
Significant reduction of hard faults in Interconnects.
Various efficient solutions exist Loi et al 2008; Hsieh et al 2010; Kang et al 2010; Pasca et al 2010; Nicolaidis et al 2010
Low cost is possible due to interconnects regularity.
SpareSpare--basedbased selfself--repairrepair for for InterconnectsInterconnects
T HL DT
T
T T
T
T
T
T T
T
T
T T
T
T
T
T
T
T
T T
T
T
TT
D
D
D D
D
D
D
D
D
D
D
HL
HL
HLHL
HL
HM
HI
Cell framework uses:Spares based self-repair for 2D interconnects.Spares and/or serialization based self-repair for 3D interconnects.
Repairs fabrication and aging induced faults
Significant reduction of hard faults in Interconnects.
CC3
GRAALGRAAL Architecture forArchitecture for Concurrent Error Concurrent Error Detection in Detection in logiclogic
Latch-based design + double samplingLow area, power and performance penalties. 32 bits icyflex
processor: ETS 2011 17.2%, 8.4%, 2.35%Detection of timing faults: 100% of CP delays
Variability, EMI, aging,Soft Errors
Other advantages: can be used for aggressive power reduction + Self-regulation
1
2
CC2 L32
err1comp
L12
CC1
comp
L21
err2comp
err1
L4
1
12
Node i
M2
err 1comperr 2
GRAAL: detects 100% delays,Less area and power
1M
2
comp
S1
comperr1
Interconnections: GRAAL versus Double Interconnections: GRAAL versus Double sampling with sampling with FFsFFs
S S
Nodei
CkCkb
Double sampling with FF: detects 50% extra delays
Interconnections M SCk Ckb
comp
M SCk Ckb
compCkb Ckb
GRAAL: Detection of larger delay faults (100% instead of 50% of link delay at lower area and power
Logic and interconnects (usually high cost): New scheme (GRAAL).
Low area, power, and performance penalty (delay faults 100% CP), SEUs, SETs.
CircuitCircuit--level Concurrent Errorlevel Concurrent ErrorDetectionDetection
Logic:GRAAL : ~ 17% area, 7% power, 2.35% performance Interconnects:Area ~ 2%, power & speed insignificant
Node internal interconnects
Array synchronous interconnects (a clock domain per node & router). Easy domain change using the routers FIFOs. + simple ECC.
No GRAAL if GALS
Error recovery:Processor nodes: instruction replay + lower Ck F.Links & routers: message resent + lower Ck F.
T HL DT
T
T T
T
T
T
T T
T
T
T T
T
T
T
T
T
T
T T
T
T
TT
D
D
D D
D
D
D
D
D
D
D
HL
HL
HLHL
HL
HM
HI
Logic and interconnects (usually high cost): New scheme (GRAAL).
Low area, power, and performance penalty (delay faults 100% CP), SEUs, SETs.
CircuitCircuit--level Concurrent Error level Concurrent Error DetectionDetection
HL
HL
HL
HLHL
HL
HM
HI
LogicGRAAL: ~ 17% area, 7% power, 2.35% performance
Interconnects:Area ~ 2%, power & speed insignificant
Node internal interconnects
Array synchronous interconnects (a clock domain per node & router). Easy domain
Elimination of most faults in logic and interconnects
ECC for MemoryECC for Memory Field FailuresField Failures
HL
HL
HL
HLHL
HL
HM
HI
TM
Low cost ECC (e.g. SEC-DED) +Interleaving.
Errors uncorrectable by ECC: Array level recovery.
Same as for faults detected by memory and logic self-tests (discussed later)
SelfSelf--regulation at node levelregulation at node level(application context + circuit degradations)(application context + circuit degradations)
Use the error detection rate (Ef) provided by GRAAL:determine the operating points Ck frequency/Vdd for preselected Efs. Dynamically adapted tables (degradation).
Operating frequency for each application task provided by OS (task deadline): The node choose its Vdd from the tables as a function of priorities (power dissipation/reliability: Ef = 0, Ef = 10-4, ) determined at the application level.
According to the power dissipation priority: very low Vdd can be used for drastic power reduction needs.
Table of operating points for Efth = 10-4 Table of operating points for Efth = 10
-5
F0 F1 F2 F0 F1 F2
Vdd0 Ef10-5
Vdd1 Ef Ef Ef
Fault tolerant massively parallel Fault tolerant massively parallel chipschips
Zajac, Collet, & Napieralski (2008), use self-test and fault tolerant routing to tolerate faults in massively parallel chips, but:
Timing faults can not be tolerated: nodes rejected as faulty (large resource wasting).Many timing faults will escape self-test and will affect reliability.
Transient faults (SEUs, SETs are not detected and will affect reliability.Static routing tables are used:
They cannot cope with new faults occurring during operation. It is not scalable in large arrays as routing activity is unpredictable and static routing will result in congestions.
New approaches needed: we developed distributed non-deterministic approaches taking local opportunistic decisions.
Other issues: Recovery from Hard Faults in Other issues: Recovery from Hard Faults in Logic, Memories and Interconnects + ECC Logic, Memories and Interconnects + ECC
Uncorrectable ErrorsUncorrectable Errors
HL
HL
HL
HLHL
HL
HM
HI
Problems and remedies:
Recovery will reproduce the same errors (permanent faults): FT scheduling and allocation.
Errors can be propagated in the system and become unrecoverable: Cooperation between self-test; FT scheduling; and recovery algorithm.
Recovery should not affect the coherence of the system: Coordinated check point algorithm.
Check-pointing requires saving regularly internal states in external reliable memory. Congestion of IOs: Hierarchical recovery supported by the task allocation algorithm.
Cells FrameworkCells FrameworkIntegrates existing and new lowIntegrates existing and new low--cost cost
Circuit & Array level ApproachesCircuit & Array level Approaches
Array Level
Fault-tolerant, variability-, aging-, and power-aware task scheduling and allocation algorithms.
Coherent check-pointing and error rollback recovery.
Check-pointing-free error recovery.
Fault-tolerant, congestion-free, and deadlock-free routing algorithms.
Differential Voltage Frequency Scaling
Holistic fault-tolerant routing, power-aware task scheduling and allocation, rollback recovery, and circuit parameter regulation
Fault tolerant Routing AlgorithmsFault tolerant Routing AlgorithmsRouting tables: Store in each node a fault-free path to every other node.
It avoids failed nodes and routers and also deadlocks. It provides optimal routing for each individual message.But: frequent conflicts between messages as the routing for each message is fixed once for ever: impairing congestions in complex arrays.
No routing tables to avoid congestions, but deadlock becomes an issue: 1st adaptive congestion-free, deadlock-free routing algorithm that ensures "0 lost message" even under high failure rates (2010 IEEE NCA)
Distributed Algorithm : local opportunistic decisions (get another path when a node/router/link is faulty or congested). x1000 nuds!Tolerates multiple faulty nodes/routers/links; Avoids congestions; Copes with new failures.What about deadlocks (unplanned routing
may go indefinitely through the same loop)?2 Virtual Networks + Turn Prohibition + Virtual Source + Echo Mode
Fault tolerant Routing AlgorithmsFault tolerant Routing AlgorithmsWeak point: increase of average latency with the increase of the interconnect size
and the failure rate.Introduction of the Explicit Path Routing Mode (DATE 2011, Best IP Award). Limits drastically the traffic increase in case of high failure rates, improving drastically the average latency.
I
TA
TB
TD
TC
Initiator
- broadcastCK_REQ
- when CK_TAKENreceived fromall tasks- validate
global checkpoint
Non-initiator(blocking or not)
- on CK_REQ receipt- broadcast
CK_START- when CK_START
received from all tasks- take local
checkpoint- send to
initiator CK_TAKEN
Blocking synchronization messages
Non-blocking synchronization messages application messages
Messages in the array during check-pointing
Strengths: Coordinated ChP : simple rollback Performances optimization through
partitioning Intelligent broadcast: reduction
#broadcasts + reduction of size of check-points (memory occupation + communication cost in large networks)
Coordinated CheckCoordinated Check--pointing pointing (IOLTS08, ISCAS08, DELTA08, IOLTS10, WDSN, MICRO10)
Rollback may destroy coherence in a parallel system
Issue: may provoke congestions of I/O ports
Integrated FT routing, FT scheduling and Integrated FT routing, FT scheduling and allocation and Error Recoveryallocation and Error Recovery
FT scheduling and allocation : Intractable in complex arraysCheck-pointing : Needs to save regularly the internal states of the system in reliable (thus external) memory: congestion of I/O ports.
New adaptive algorithm (NCA 2011): fusions FT routing; FT scheduling and allocation; and error recovery
FT routing and FT scheduling and allocation :distributed algorithm based on local opportunistic decisions.Error Recovery: Hierarchical task organization (group of
clusters; task clusters; tasks; subtasks)Maintains hierarchy during execution (parent-children trees).Goes back to the parent node in case of failure.
Validated on arrays comprising 1000s nodes with up to 20% faulty routers/nodes! Streaming application.No saving of internal states in external memory (no-check-pointing). Recovery for any multiple faults (goes back to the first fault-free node of the tree).
Power aware FT task scheduling and allocation Power aware FT task scheduling and allocation (process variations, aging, and application context)(process variations, aging, and application context)
Three steps algorithm developed:1: Local Static scheduling: Determine # execution cycles for eachgroup TCs (rough power and time estimation).2: Global Static scheduling (genetic algorithm): Determines the optimal operating points.
3: On-line scheduling: For the active groups of tasks select the points that minimize total energy while respects deadlines
Issue: steps 2 and 3 non-scalable for large arrays (1000s nodes)
Apply the Global Static scheduling on each group of TCs in the application
We obtain a set of optimal operating
points (curve) for each group of TCs
Map and schedule a group of TC, globally to the system
An operating point for a possible scheduling and an active pair for
the target group of TCs.
From the set of operating points of the target group of TCs, remove the non-
optimal ones (in red colour): for the same execution time, the lower energy is better
Integrated FT Routing, FT and Power Aware Scheduling Integrated FT Routing, FT and Power Aware Scheduling and Allocation, Circuit Regulation, and Error Recoveryand Allocation, Circuit Regulation, and Error Recovery
Distributed routing and scheduling/allocation: as in the previous algorithm.Hierarchical Error Recovery: as in the previous algorithm.Circuit Heterogeneity handled hierarchically (IOLTS 2011):
Groups of task clusters mapped to regions (by the highest leader), Task clusters mapped to sub-regions by the leader of each regionTasks mapped to nodes by the leader of each sub-region, ...
Encapsulation in each task: operating frequency (deadlines); power/reliability priorities (e.g. Ef < 10-5).
Each node determines its clock frequency and Vddaccordingly.
Evaluated for 1000s nodes: with up to 20% faulty routers/nodes. Usingstreaming applications.
Cells Framework: Biologically Inspired Cells Framework: Biologically Inspired RobustnessRobustness
Ongoing extension of the algorithmDuring the design phase: use fault simulation to create the invalid classes of instructions (i.e. those affected by the different faults). During compilation: determine for each task, which of the invalid instruction classes are compatible with it. After fabrication and periodically in the field: execute the diagnosis algorithm in each node to determine the invalid instructions classes for this node. At run time, the leader of each subregion maps tasks to nodes, which have invalid instruction classes that are compatible with these tasks.
Cells FrameworkCells Framework
T HL
HM
HI
DT
T
T T
T
T
T
T T
T
T
T T
T
T
T
TT
T
T T
T
T
TT
D
D
D D
D
D
D
D
DD
D
HL
HL
HLHL
HL
HMHM
HM
HM
HM
HM
HM
HMHM
HM HI
HIHI
HI
HI
HI
HI
Ongoing work at TIMA: development of a holistic platform covering all levels of system hierarchy.
Bringing various innovations: including its overall architecture, its particular components, and the way they cooperate to optimize the outcome.
Ambitious goal to design the Terminator chips: transform into reliable systems future massively parallel chips, promising unprecedented computing power but affected by extreme failure rates.
Cells FrameworkCells Framework
Living Organisms Cells FrameworkCells surrounding injured parts substitute dead cells to regenerate damaged structures
Functioning nodes replace unrecoverable faulty nodes
Cells can recover from damages: e.g. DNA repair.
Nodes recover from faults (self-repair, CED, instr. replay, self-test, rollback)
Plasticity of nervous and circulatory system: after injure information/blood flow though alternative paths (e.g. anastomosis).
Dynamic routing algorithms create alternative paths in response to failures and congestions
Energy management: blood flow (transporting oxygen and nutriments) to downstream organs regulated according to their activity
Energy dissipation regulated by adapting node voltage and clock frequency to the computation needs
Notable similarities between Cells and Living organisms
Cells FrameworkCells Framework
Notable similarities between Cells and Living organisms
Living Organisms CellsRegulate their physiological parameters to changing conditions and to their needs (e.g. the regulation of insulin levels in response to sugar levels).
Cooperation between array-level and circuit-level mechanisms adapts circuit parameters to meet application goals, while adapting power to circuit degradation
Cells in organisms achieve self-healing, and self-regulation in distributed non-deterministic, local and opportunistic manner
New distributed, non-deterministic routing, task allocation and scheduling algorithms, making local decisions in opportunistic manner
These processes are done without conscious intervention
These processes are transparent to the application software
Similarities emerged spontaneously as we tried to optimize the ways we can ensure robustness in very complex chips.
This trend should be reinforced as post CMOS technologies will increase the complexity of future chips.
Designing Robust Single-Chip Massively Parallel Tera-Device Processors:GRAAL Architecture for Concurrent Error Detection in logicSelf-regulation at node level(application context + circuit degradations)Fault tolerant massively parallel chipsFault tolerant Routing AlgorithmsFault tolerant Routing AlgorithmsIntegrated FT routing, FT scheduling and allocation and Error RecoveryPower aware FT task scheduling and allocation (process variations, aging, and application context)Integrated FT Routing, FT and Power Aware Scheduling and Allocation, Circuit Regulation, and Error RecoveryCells Framework: Biologically Inspired RobustnessCells FrameworkCells FrameworkCells Framework
top related