-1- ucsd vlsi cad laboratory and uiuc passat group recovery-driven design: a power minimization...

-1-UCSD VLSI CAD Laboratory and UIUC PASSAT GroupUCSD VLSI CAD Laboratory and UIUC PASSAT Group

Recovery-Driven Design: A Power Minimization Methodology for Error-Tolerant Processor Modules

Recovery-Driven Design: A Power Minimization Methodology for Error-Tolerant Processor Modules

Andrew B. KahngAndrew B. Kahng††, , Seokhyeong KangSeokhyeong Kang††, ,

Rakesh KumarRakesh Kumar‡‡ and John Sartori and John Sartori‡‡

††VLSI CAD LABORATORY, UCSDVLSI CAD LABORATORY, UCSD‡‡PASSAT GROUP, UIUCPASSAT GROUP, UIUC

DAC, June 17, 2010DAC, June 17, 2010

-2-

OutlineOutline

Background and MotivationBackground and Motivation– Voltage scaling and error-tolerant designVoltage scaling and error-tolerant design– Error-tolerant design vs. recovery-driven Error-tolerant design vs. recovery-driven

designdesign Recovery-Driven DesignRecovery-Driven Design

– Related workRelated work– Heuristic: power minimizationHeuristic: power minimization– Error rate estimationError rate estimation

Experimental Framework and ResultsExperimental Framework and Results– Design methodologyDesign methodology– Results and analysisResults and analysis

Conclusions and Ongoing WorkConclusions and Ongoing Work

-3-

Reducing Power with Voltage ScalingReducing Power with Voltage Scaling

Power is a first-order design Power is a first-order design constraintconstraint– Moore’s law implies Moore’s law implies

power density of power density of processors continues to processors continues to escalateescalate

Voltage scaling reduces Voltage scaling reduces power but eventually power but eventually causes massive timing causes massive timing violationsviolations P

ower

(lower voltage)Voltage

Timing errors begin to occur

Error-resilienceError-resilience allows allows deeper voltage scalingdeeper voltage scaling

-4-

*Hedge et al. “Energy-Efficient Signal Processing via Algorithmic Noise-Tolerance”, ISLPED 1999

Error-Tolerance MechanismsError-Tolerance Mechanisms

Traditional IC designTraditional IC design Error-Tolerant designError-Tolerant design

• No errors allowedNo errors allowed • Error correction Error correction architecture allows architecture allows timing errorstiming errors• Overclocking Overclocking and and

voltage overscaling voltage overscaling not enablednot enabled

• Overclocking Overclocking andand voltage overscaling voltage overscaling enabledenabled

Hardware error-toleranceHardware error-tolerance– Errors are detected and corrected during Errors are detected and corrected during

runtimeruntime– Razor Razor (MICRO 2003)(MICRO 2003) Application-level error-tolerance*Application-level error-tolerance*– Errors are allowed to propagate to software Errors are allowed to propagate to software

resulting in reduced performance or output resulting in reduced performance or output qualityquality

Voltage scaling (lower voltage)

ENERGY PER INSTRUCTION

ERROR RATE

REDUCTION IN COMPUTATION SPEED

ENERGY MINIMUM

~0.04% ERROR RATE

~0.2% SPEED REDUCTION

-5-

Our Work: From Error-Tolerance to Recovery-DrivenOur Work: From Error-Tolerance to Recovery-Driven

Error-Tolerant designError-Tolerant design Recovery-Driven designRecovery-Driven design

• Designed “from ground Designed “from ground up” for up” for specific target specific target error rateerror rate

• Design methodology Design methodology exploits functional exploits functional informationinformation

• Design still Design still optimized for optimized for correct correct operationoperation

• Design methodology Design methodology based on based on STA, STA, workload-agnosticworkload-agnostic

-6-

Recovery-Driven DesignRecovery-Driven Design

1. Minimize error rate to extend range of voltage 1. Minimize error rate to extend range of voltage scalingscaling

Error rate(traditional)

Error rate(optimized)

1. OptimizePaths

2. Reduce design power with cell downsizing or Vt 2. Reduce design power with cell downsizing or Vt swapswap

lower voltage

Target error rate

2. ReducePower

Power (traditional)

Power (optimized)

Pmin

Vmin

Operating point

Pmin

Vmin

New operating point

How to minimize power in recovery-driven How to minimize power in recovery-driven design? design?

-7-

OutlineOutline

Background and motivationBackground and motivation– Voltage scaling and error-tolerant processorVoltage scaling and error-tolerant processor– Error-tolerant design vs. recovery-driven Error-tolerant design vs. recovery-driven





-8-

Related Works: Design-Level Optimizations for Error-Tolerant Processors

Related Works: Design-Level Optimizations for Error-Tolerant Processors

BlueShiftBlueShift**– Increase frequency up to a target error rateIncrease frequency up to a target error rate– Speed up error paths with timing overrides and Speed up error paths with timing overrides and

FBBFBB

*Grescamp et al. “Blueshift: Designing Processors for Timing Speculation from the Ground up”, HPCA 2009

**Kahng et al. “Slack Redistribution for Graceful Degradation Under Voltage Overscaling”, ASPDAC 2010

Slack Optimizer**Slack Optimizer**– Make Make gradual slope slackgradual slope slack to to

achieve gracefully achieve gracefully increasing error rateincreasing error rate

– Estimate error rate using Estimate error rate using switching activity from SAIFswitching activity from SAIF

‘wall’ of slack

Num

ber

of p

aths

Timing slackZero slack after voltage scaling

Rarely exercised

paths

Frequently exercised

paths

‘gradual slope’ slack

Zero slack at nominal voltage

-9-

Recovery-Driven Design MethodologyRecovery-Driven Design Methodology

Problem: Problem: minimize processor power (leakage + minimize processor power (leakage + dynamic) for a target error rate dynamic) for a target error rate

Approach: Approach: we use slack redistribution and power we use slack redistribution and power reduction enabled by accurate error rate reduction enabled by accurate error rate estimationestimation

• Slack redistribution: Slack redistribution: reshape path slack based on path activity reshape path slack based on path activity (toggle rate) to minimize error rate and extend (toggle rate) to minimize error rate and extend voltage scaling (voltage scaling (OptimizePathsOptimizePaths and and ReducePowerReducePower heuristics)heuristics)• Error rate estimation Error rate estimation using a simulation using a simulation dump file (VCD) dump file (VCD)

-10-

Slack RedistributionSlack Redistribution

Redistribute slack from paths that rarely toggle Redistribute slack from paths that rarely toggle to paths that frequently toggle to paths that frequently toggle

# paths

timing slack

(a)zero slack afterscaling voltage

P-

P+P+

P-

(b)

(c) (d)

voltage scaling

upsize cells

downsize cells

downsize cells

iterate voltage scaling

OptimizePaths

ReducePower

-11-

Slack Redistribution FlowSlack Redistribution Flow

Toggle Information:Toggle Information:simulation dump file is simulation dump file is loadedloaded

Path Optimization: Path Optimization: minimize error rate to minimize error rate to extend range of extend range of voltage scalingvoltage scaling

Power Reduction: Power Reduction: downsize cells to obtain downsize cells to obtain additional power additional power savingssavings

Error Rate Estimation: Error Rate Estimation: estimate with toggle estimate with toggle info and STA resultsinfo and STA results

Netlist VCD

Analyze activity

Timing Analysis

OptimizePaths

ER > ERtarget

Reduce Voltage

ECO P&R

YES

NO

ReducePower

ERCompute Error Rate

-12-

Heuristic Details – OptimizePathsHeuristic Details – OptimizePaths

Main idea: increase slack of frequently-Main idea: increase slack of frequently-exercised paths in order of decreasing toggle exercised paths in order of decreasing toggle raterate

ProcedureProcedure1.1. Pick a critical path p with maximum toggle ratePick a critical path p with maximum toggle rate

2.2. Resize cell instance cResize cell instance cii in p in p

3.3. If the path slack is not improved, cell change is If the path slack is not improved, cell change is restoredrestored

4.4. Repeat 2. ~ 3. for all cell instances in path p Repeat 2. ~ 3. for all cell instances in path p

5.5. Repeat 2.~ 4. for all critical pathsRepeat 2.~ 4. for all critical paths OptimizePaths OptimizePaths → ReducePower → Voltage Scaling → ReducePower → Voltage Scaling

-13-

Heuristic Details – ReducePowerHeuristic Details – ReducePower

Main idea: downsize cells on non-critical paths Main idea: downsize cells on non-critical paths in order of decreasing in order of decreasing sensitivitysensitivity

Sensitivity (c) = (powerSensitivity (c) = (powercc – power – powerc’c’) / (slack) / (slackcc – – slackslackc’c’))

ProcedureProcedure1.1. Pick a cell c with maximum Pick a cell c with maximum sensitivitysensitivity

2.2. Downsize cell c with logically equivalent cellDownsize cell c with logically equivalent cell

3.3. Incremental timing analysis and check error Incremental timing analysis and check error raterate

4.4. If error rate is increased, cell change is restoredIf error rate is increased, cell change is restored

5.5. Repeat 1. ~ 4. Repeat 1. ~ 4. OptimizePaths → → ReducePowerReducePower→ Voltage Scaling → Voltage Scaling

-14-

Path Extraction for Error Rate EstimationPath Extraction for Error Rate Estimation

Instead of simulation, we use toggle information Instead of simulation, we use toggle information from value change dump (VCD) filefrom value change dump (VCD) file

#00a 0b1x 1y#1 1a0x0y#2…

clock

a

b

y

a x

b

NetlistWave form

#0 #1 #2 #3 #4

VCD file

Extracted paths

a-x-y (@ cycle 1, 3)b-y (@ cycle 2, 4)

y[value, net]

[time]

List of toggled netsin each cycle time

-15-

Toggle and Error Rate CalculationToggle and Error Rate Calculation

20X20X fasterfaster than actual simulation and than actual simulation and accurateaccurate

tottoggle XppTR /|)(|)( Toggle rate:Toggle rate:

Error rate:Error rate: totPp toggle XppERn

/|)(|)(

p: pathχtoggle: set of cycles which p has toggledXtot: total cycle #

*Kahng et al. “Slack Redistribution...”, ASPDAC 2010.

-16-

Evaluation of Heuristic Design ChoicesEvaluation of Heuristic Design Choices

Path orderingPath ordering– toggle rate * slacktoggle rate * slack– toggle ratetoggle rate

Optimization radiusOptimization radius– path onlypath only– fan-in/out networkfan-in/out network

Starting netlistStarting netlist– loosely constrained loosely constrained – tightly constrainedtightly constrained

Voltage step sizeVoltage step size– 0.01V and 0.05V0.01V and 0.05V

-17-

OutlineOutline

Background and motivationBackground and motivation– Voltage scaling and error-tolerant processorVoltage scaling and error-tolerant processor– Error-tolerant design vs. recovery-driven Error-tolerant design vs. recovery-driven





-18-

Design MethodologyDesign Methodology

Initial design(OpenSPARC T1)

PrimeTime

Design information (.v .spef)

Power Optimizer

Tcl Socket I/F

Library characterization

(SignalStorm)

Functional simulation(NC Verilog)

Benchmark generation(Simics)

Input vector

Simulation result (.vcd)

ECO P&R(SOCEncounter)

Final design

List of swaps

SynopsysLiberty

(.lib)

System level simulation using System level simulation using SimicsSimics with real with real benchmarksbenchmarks

Gate level simulation to get signal toggle Gate level simulation to get signal toggle information information (NC verilog)(NC verilog)

Prepare Synopsys Liberty file using Cadence Prepare Synopsys Liberty file using Cadence Signal StormSignal Storm

Implement in C++ and use Tcl socket to Implement in C++ and use Tcl socket to communicate with PrimeTimecommunicate with PrimeTime

Perform ECO P&R with cell swap listPerform ECO P&R with cell swap list

-19-

Power Analysis for Real WorkloadsPower Analysis for Real Workloads

system-level simulation

Simics + Transplant

functional simulation

VCS or NCVerilog

design implementatio

nDC, SOCE

memory modeling

MEMGEN, CACTI

power analysis

PrimeTime-PX

RTL designOpenSPARC

benchmark binary(bzip, twolf ...)

input pattern VCD

netlistSPEF

Liberty (.lib)

System level simulation with real benchmark System level simulation with real benchmark binary and input patterns are capturedbinary and input patterns are captured

Estimate power of memory – MEMGEN, CACTIEstimate power of memory – MEMGEN, CACTI Analyze leakage and dynamic power using PT-Analyze leakage and dynamic power using PT-PX PX

-20-

TestbedTestbed

Target design: sub-modules of Target design: sub-modules of OpenSPARC T1OpenSPARC T1

Benchmark: Benchmark: ammp, bzip2, equake, twolf, sort.ammp, bzip2, equake, twolf, sort.Fast-forward, capture vectors Fast-forward, capture vectors

Implementation: TSMC 65GP Implementation: TSMC 65GP technology with standard SP&Rtechnology with standard SP&R

Alternative design techniques:Alternative design techniques:– SP&R with loose constraints and tight constraintsSP&R with loose constraints and tight constraints– Slack Optimizer (make a “gradual slope”) Slack Optimizer (make a “gradual slope”)

[ASPDAC2010][ASPDAC2010]

-21-

Power Consumption of Each Design TechniquePower Consumption of Each Design Technique

Power savings compared to tradition SP&R Power savings compared to tradition SP&R designdesign

25% power 25% power savings savings @ @ 0.125% error 0.125% error rate (average)rate (average)

Area overhead and power savings (from loose Area overhead and power savings (from loose SP&R)SP&R)

Tight SP&R

Slack Optimizer

Power Optimizer

Area overhead 25.9% 3.7% 7.7%

Power savings@ 0.125% error

12% 14% 25%

Error rate (%)

LSU_STB_CTL

-22-

Power Consumption for HW-Based Error Tolerance Power Consumption for HW-Based Error Tolerance

Razor architecture was assumed for error Razor architecture was assumed for error detection and correction – account for Razor detection and correction – account for Razor overhead (area, power) and power cost of error overhead (area, power) and power cost of error correctioncorrection

LSU_STB_CTL

0.84V0.76V

21% 21% additional additional power power savingssavings

-23-


We propose recovery-driven design which We propose recovery-driven design which minimizes power for a target timing error rateminimizes power for a target timing error rate– Optimize designs with functional information and Optimize designs with functional information and

iterative voltage scalingiterative voltage scaling– We also develop a fast and accurate technique for We also develop a fast and accurate technique for

post-layout activity and error rate estimationpost-layout activity and error rate estimation We demonstrate significant power benefits – up We demonstrate significant power benefits – up

to 25% power savings compared to traditional to 25% power savings compared to traditional P&R at an error rate of 0.125%P&R at an error rate of 0.125%

Ongoing workOngoing work– Recovery-driven design for different error Recovery-driven design for different error

resilience mechanisms, different sources of resilience mechanisms, different sources of variationvariation

– Design / architecture co-explorationDesign / architecture co-exploration

-24-

Thank you

-25-

BACKUP

-26-

Related Work: BlueShiftRelated Work: BlueShift

BlueShiftBlueShift* : maximize frequency for a given * : maximize frequency for a given error rateerror rate

BlueShiftBlueShift speedup speedup– Paths with the highest frequency of timing errorsPaths with the highest frequency of timing errors– FBB (forward body-biasing) & Timing overrideFBB (forward body-biasing) & Timing override

LimitationLimitation– Repetitive gate level simulation – impracticalRepetitive gate level simulation – impractical– Design overhead of FBBDesign overhead of FBB

Computeerror rate

ER < TargetGate-level simulation

YES

NO Speed up paths

Finish

*Grescamp et al. “Blueshift: Designing processors for timing speculation from the ground up”, HPCA 2009

-27-

Exploiting Error Resilience for Multi-core DesignExploiting Error Resilience for Multi-core Design

Design of heterogeneously reliable multi-core Design of heterogeneously reliable multi-core processor processor

• Power-optimized for different mixes of workloads

• Power-optimized for different reliability target

Individual cores are customized for a specific workload class

-28-

Lifetime Energy MinimizationLifetime Energy Minimization

Maximizing energy efficiency of DVFS-based designsMaximizing energy efficiency of DVFS-based designs– Inefficiency is due to a design optimized for a single power / Inefficiency is due to a design optimized for a single power /

performance pointperformance point– Minimize energy when the processor spends R of its lifetime Minimize energy when the processor spends R of its lifetime

at high freq. (e.g., talk mode) and (1 – R) of its lifetime at at high freq. (e.g., talk mode) and (1 – R) of its lifetime at low freq. (e.g., standby mode)low freq. (e.g., standby mode)

• Replication-based methodology: area overhead vs. power tradeoffs

• Co-optimization methodology: optimize design with two operating constraints – (freq_hi, V_hi) and (freq_lo, V_lo)

• Both methodologies can be applied alternatively in each sub-modules

-29-

Sensitivity-Based Optimization PlatformSensitivity-Based Optimization Platform

Post-layout stage cell swapPost-layout stage cell swap– Cell sizing + ECOCell sizing + ECO– Multi-VMulti-Vtt swap swap– Multi-LMulti-Lgategate swap swap

Swap cell and check STASwap cell and check STAwith with PrimeTimePrimeTime socket socketinterfaceinterface

Cell swap according to the Cell swap according to the sensitivity sensitivity SS– For leakage optimization, For leakage optimization, SS = = ΔΔleakage x slackleakage x slack– For timing closure, For timing closure, SS = = ΔΔslack / (slack – slack / (slack – WNSWNS))

MMMC (Multi-Mode Multi-Corner) can be considered MMMC (Multi-Mode Multi-Corner) can be considered with multiple with multiple PrimeTimePrimeTime sockets sockets

Lgate biasing

-30-

Limitations of Traditional CAD FlowLimitations of Traditional CAD Flow

In modern digital design, vast majority of paths have In modern digital design, vast majority of paths have near-critical slack – near-critical slack – wall of slack distributionwall of slack distribution

Scaling beyond a critical operating point causes Scaling beyond a critical operating point causes massive errors and power benefits can be limited* massive errors and power benefits can be limited*

zero slackzero slack timing slacktiming slack

nu

mb

er o

f p

ath

s

erro

r ra

te

lower voltagelower voltage(higher frequency)(higher frequency)

operatingoperatingpointpoint

Error rate Error rate =

# cycles which have timing error

# total cycles

0.0 %0.0 % at 1.00Vat 1.00V1.0 %1.0 % at 0.95Vat 0.95V20.0 %20.0 % at 0.90Vat 0.90V

‘wall of slack’

*Kahng et al. “Slack Redistribution...”, ASPDAC 2010.

-1- ucsd vlsi cad laboratory and uiuc passat group recovery-driven design: a power minimization...

Documents