-1- ucsd vlsi cad laboratory and uiuc passat group recovery-driven design: a power minimization...
Post on 19-Dec-2015
215 views
TRANSCRIPT
-1-UCSD VLSI CAD Laboratory and UIUC PASSAT GroupUCSD VLSI CAD Laboratory and UIUC PASSAT Group
Recovery-Driven Design: A Power Minimization Methodology for Error-Tolerant Processor Modules
Recovery-Driven Design: A Power Minimization Methodology for Error-Tolerant Processor Modules
Andrew B. KahngAndrew B. Kahng††, , Seokhyeong KangSeokhyeong Kang††, ,
Rakesh KumarRakesh Kumar‡‡ and John Sartori and John Sartori‡‡
††VLSI CAD LABORATORY, UCSDVLSI CAD LABORATORY, UCSD‡‡PASSAT GROUP, UIUCPASSAT GROUP, UIUC
DAC, June 17, 2010DAC, June 17, 2010
-2-
OutlineOutline
Background and MotivationBackground and Motivation– Voltage scaling and error-tolerant designVoltage scaling and error-tolerant design– Error-tolerant design vs. recovery-driven Error-tolerant design vs. recovery-driven
designdesign Recovery-Driven DesignRecovery-Driven Design
– Related workRelated work– Heuristic: power minimizationHeuristic: power minimization– Error rate estimationError rate estimation
Experimental Framework and ResultsExperimental Framework and Results– Design methodologyDesign methodology– Results and analysisResults and analysis
Conclusions and Ongoing WorkConclusions and Ongoing Work
-3-
Reducing Power with Voltage ScalingReducing Power with Voltage Scaling
Power is a first-order design Power is a first-order design constraintconstraint– Moore’s law implies Moore’s law implies
power density of power density of processors continues to processors continues to escalateescalate
Voltage scaling reduces Voltage scaling reduces power but eventually power but eventually causes massive timing causes massive timing violationsviolations P
ower
(lower voltage)Voltage
Timing errors begin to occur
Error-resilienceError-resilience allows allows deeper voltage scalingdeeper voltage scaling
-4-
*Hedge et al. “Energy-Efficient Signal Processing via Algorithmic Noise-Tolerance”, ISLPED 1999
Error-Tolerance MechanismsError-Tolerance Mechanisms
Traditional IC designTraditional IC design Error-Tolerant designError-Tolerant design
• No errors allowedNo errors allowed • Error correction Error correction architecture allows architecture allows timing errorstiming errors• Overclocking Overclocking and and
voltage overscaling voltage overscaling not enablednot enabled
• Overclocking Overclocking andand voltage overscaling voltage overscaling enabledenabled
Hardware error-toleranceHardware error-tolerance– Errors are detected and corrected during Errors are detected and corrected during
runtimeruntime– Razor Razor (MICRO 2003)(MICRO 2003) Application-level error-tolerance*Application-level error-tolerance*– Errors are allowed to propagate to software Errors are allowed to propagate to software
resulting in reduced performance or output resulting in reduced performance or output qualityquality
Voltage scaling (lower voltage)
ENERGY PER INSTRUCTION
ERROR RATE
REDUCTION IN COMPUTATION SPEED
ENERGY MINIMUM
~0.04% ERROR RATE
~0.2% SPEED REDUCTION
-5-
Our Work: From Error-Tolerance to Recovery-DrivenOur Work: From Error-Tolerance to Recovery-Driven
Error-Tolerant designError-Tolerant design Recovery-Driven designRecovery-Driven design
• Designed “from ground Designed “from ground up” for up” for specific target specific target error rateerror rate
• Design methodology Design methodology exploits functional exploits functional informationinformation
• Design still Design still optimized for optimized for correct correct operationoperation
• Design methodology Design methodology based on based on STA, STA, workload-agnosticworkload-agnostic
-6-
Recovery-Driven DesignRecovery-Driven Design
1. Minimize error rate to extend range of voltage 1. Minimize error rate to extend range of voltage scalingscaling
Error rate(traditional)
Error rate(optimized)
1. OptimizePaths
2. Reduce design power with cell downsizing or Vt 2. Reduce design power with cell downsizing or Vt swapswap
lower voltage
Target error rate
2. ReducePower
Power (traditional)
Power (optimized)
Pmin
Vmin
Operating point
Pmin
Vmin
New operating point
How to minimize power in recovery-driven How to minimize power in recovery-driven design? design?
-7-
OutlineOutline
Background and motivationBackground and motivation– Voltage scaling and error-tolerant processorVoltage scaling and error-tolerant processor– Error-tolerant design vs. recovery-driven Error-tolerant design vs. recovery-driven
designdesign Recovery-Driven DesignRecovery-Driven Design
– Related workRelated work– Heuristic: power minimizationHeuristic: power minimization– Error rate estimationError rate estimation
Experimental Framework and ResultsExperimental Framework and Results– Design methodologyDesign methodology– Results and analysisResults and analysis
Conclusions and Ongoing WorkConclusions and Ongoing Work
-8-
Related Works: Design-Level Optimizations for Error-Tolerant Processors
Related Works: Design-Level Optimizations for Error-Tolerant Processors
BlueShiftBlueShift**– Increase frequency up to a target error rateIncrease frequency up to a target error rate– Speed up error paths with timing overrides and Speed up error paths with timing overrides and
FBBFBB
*Grescamp et al. “Blueshift: Designing Processors for Timing Speculation from the Ground up”, HPCA 2009
**Kahng et al. “Slack Redistribution for Graceful Degradation Under Voltage Overscaling”, ASPDAC 2010
Slack Optimizer**Slack Optimizer**– Make Make gradual slope slackgradual slope slack to to
achieve gracefully achieve gracefully increasing error rateincreasing error rate
– Estimate error rate using Estimate error rate using switching activity from SAIFswitching activity from SAIF
‘wall’ of slack
Num
ber
of p
aths
Timing slackZero slack after voltage scaling
Rarely exercised
paths
Frequently exercised
paths
‘gradual slope’ slack
Zero slack at nominal voltage
-9-
Recovery-Driven Design MethodologyRecovery-Driven Design Methodology
Problem: Problem: minimize processor power (leakage + minimize processor power (leakage + dynamic) for a target error rate dynamic) for a target error rate
Approach: Approach: we use slack redistribution and power we use slack redistribution and power reduction enabled by accurate error rate reduction enabled by accurate error rate estimationestimation
• Slack redistribution: Slack redistribution: reshape path slack based on path activity reshape path slack based on path activity (toggle rate) to minimize error rate and extend (toggle rate) to minimize error rate and extend voltage scaling (voltage scaling (OptimizePathsOptimizePaths and and ReducePowerReducePower heuristics)heuristics)• Error rate estimation Error rate estimation using a simulation using a simulation dump file (VCD) dump file (VCD)
-10-
Slack RedistributionSlack Redistribution
Redistribute slack from paths that rarely toggle Redistribute slack from paths that rarely toggle to paths that frequently toggle to paths that frequently toggle
# paths
timing slack
(a)zero slack afterscaling voltage
P-
P+P+
P-
(b)
(c) (d)
voltage scaling
upsize cells
downsize cells
downsize cells
iterate voltage scaling
OptimizePaths
ReducePower
-11-
Slack Redistribution FlowSlack Redistribution Flow
Toggle Information:Toggle Information:simulation dump file is simulation dump file is loadedloaded
Path Optimization: Path Optimization: minimize error rate to minimize error rate to extend range of extend range of voltage scalingvoltage scaling
Power Reduction: Power Reduction: downsize cells to obtain downsize cells to obtain additional power additional power savingssavings
Error Rate Estimation: Error Rate Estimation: estimate with toggle estimate with toggle info and STA resultsinfo and STA results
Netlist VCD
Analyze activity
Timing Analysis
OptimizePaths
ER > ERtarget
Reduce Voltage
ECO P&R
YES
NO
ReducePower
ERCompute Error Rate
-12-
Heuristic Details – OptimizePathsHeuristic Details – OptimizePaths
Main idea: increase slack of frequently-Main idea: increase slack of frequently-exercised paths in order of decreasing toggle exercised paths in order of decreasing toggle raterate
ProcedureProcedure1.1. Pick a critical path p with maximum toggle ratePick a critical path p with maximum toggle rate
2.2. Resize cell instance cResize cell instance cii in p in p
3.3. If the path slack is not improved, cell change is If the path slack is not improved, cell change is restoredrestored
4.4. Repeat 2. ~ 3. for all cell instances in path p Repeat 2. ~ 3. for all cell instances in path p
5.5. Repeat 2.~ 4. for all critical pathsRepeat 2.~ 4. for all critical paths OptimizePaths OptimizePaths → ReducePower → Voltage Scaling → ReducePower → Voltage Scaling
-13-
Heuristic Details – ReducePowerHeuristic Details – ReducePower
Main idea: downsize cells on non-critical paths Main idea: downsize cells on non-critical paths in order of decreasing in order of decreasing sensitivitysensitivity
Sensitivity (c) = (powerSensitivity (c) = (powercc – power – powerc’c’) / (slack) / (slackcc – – slackslackc’c’))
ProcedureProcedure1.1. Pick a cell c with maximum Pick a cell c with maximum sensitivitysensitivity
2.2. Downsize cell c with logically equivalent cellDownsize cell c with logically equivalent cell
3.3. Incremental timing analysis and check error Incremental timing analysis and check error raterate
4.4. If error rate is increased, cell change is restoredIf error rate is increased, cell change is restored
5.5. Repeat 1. ~ 4. Repeat 1. ~ 4. OptimizePaths → → ReducePowerReducePower→ Voltage Scaling → Voltage Scaling
-14-
Path Extraction for Error Rate EstimationPath Extraction for Error Rate Estimation
Instead of simulation, we use toggle information Instead of simulation, we use toggle information from value change dump (VCD) filefrom value change dump (VCD) file
#00a 0b1x 1y#1 1a0x0y#2…
clock
a
b
y
a x
b
NetlistWave form
#0 #1 #2 #3 #4
VCD file
Extracted paths
a-x-y (@ cycle 1, 3)b-y (@ cycle 2, 4)
y[value, net]
[time]
List of toggled netsin each cycle time
-15-
Toggle and Error Rate CalculationToggle and Error Rate Calculation
20X20X fasterfaster than actual simulation and than actual simulation and accurateaccurate
tottoggle XppTR /|)(|)( Toggle rate:Toggle rate:
Error rate:Error rate: totPp toggle XppERn
/|)(|)(
p: pathχtoggle: set of cycles which p has toggledXtot: total cycle #
*Kahng et al. “Slack Redistribution...”, ASPDAC 2010.
-16-
Evaluation of Heuristic Design ChoicesEvaluation of Heuristic Design Choices
Path orderingPath ordering– toggle rate * slacktoggle rate * slack– toggle ratetoggle rate
Optimization radiusOptimization radius– path onlypath only– fan-in/out networkfan-in/out network
Starting netlistStarting netlist– loosely constrained loosely constrained – tightly constrainedtightly constrained
Voltage step sizeVoltage step size– 0.01V and 0.05V0.01V and 0.05V
-17-
OutlineOutline
Background and motivationBackground and motivation– Voltage scaling and error-tolerant processorVoltage scaling and error-tolerant processor– Error-tolerant design vs. recovery-driven Error-tolerant design vs. recovery-driven
designdesign Recovery-Driven DesignRecovery-Driven Design
– Related workRelated work– Heuristic: power minimizationHeuristic: power minimization– Error rate estimationError rate estimation
Experimental Framework and ResultsExperimental Framework and Results– Design methodologyDesign methodology– Results and analysisResults and analysis
Conclusions and Ongoing WorkConclusions and Ongoing Work
-18-
Design MethodologyDesign Methodology
Initial design(OpenSPARC T1)
PrimeTime
Design information (.v .spef)
Power Optimizer
Tcl Socket I/F
Library characterization
(SignalStorm)
Functional simulation(NC Verilog)
Benchmark generation(Simics)
Input vector
Simulation result (.vcd)
ECO P&R(SOCEncounter)
Final design
List of swaps
SynopsysLiberty
(.lib)
System level simulation using System level simulation using SimicsSimics with real with real benchmarksbenchmarks
Gate level simulation to get signal toggle Gate level simulation to get signal toggle information information (NC verilog)(NC verilog)
Prepare Synopsys Liberty file using Cadence Prepare Synopsys Liberty file using Cadence Signal StormSignal Storm
Implement in C++ and use Tcl socket to Implement in C++ and use Tcl socket to communicate with PrimeTimecommunicate with PrimeTime
Perform ECO P&R with cell swap listPerform ECO P&R with cell swap list
-19-
Power Analysis for Real WorkloadsPower Analysis for Real Workloads
system-level simulation
Simics + Transplant
functional simulation
VCS or NCVerilog
design implementatio
nDC, SOCE
memory modeling
MEMGEN, CACTI
power analysis
PrimeTime-PX
RTL designOpenSPARC
benchmark binary(bzip, twolf ...)
input pattern VCD
netlistSPEF
Liberty (.lib)
System level simulation with real benchmark System level simulation with real benchmark binary and input patterns are capturedbinary and input patterns are captured
Estimate power of memory – MEMGEN, CACTIEstimate power of memory – MEMGEN, CACTI Analyze leakage and dynamic power using PT-Analyze leakage and dynamic power using PT-PX PX
-20-
TestbedTestbed
Target design: sub-modules of Target design: sub-modules of OpenSPARC T1OpenSPARC T1
Benchmark: Benchmark: ammp, bzip2, equake, twolf, sort.ammp, bzip2, equake, twolf, sort.Fast-forward, capture vectors Fast-forward, capture vectors
Implementation: TSMC 65GP Implementation: TSMC 65GP technology with standard SP&Rtechnology with standard SP&R
Alternative design techniques:Alternative design techniques:– SP&R with loose constraints and tight constraintsSP&R with loose constraints and tight constraints– Slack Optimizer (make a “gradual slope”) Slack Optimizer (make a “gradual slope”)
[ASPDAC2010][ASPDAC2010]
-21-
Power Consumption of Each Design TechniquePower Consumption of Each Design Technique
Power savings compared to tradition SP&R Power savings compared to tradition SP&R designdesign
25% power 25% power savings savings @ @ 0.125% error 0.125% error rate (average)rate (average)
Area overhead and power savings (from loose Area overhead and power savings (from loose SP&R)SP&R)
Tight SP&R
Slack Optimizer
Power Optimizer
Area overhead 25.9% 3.7% 7.7%
Power savings@ 0.125% error
12% 14% 25%
Error rate (%)
LSU_STB_CTL
-22-
Power Consumption for HW-Based Error Tolerance Power Consumption for HW-Based Error Tolerance
Razor architecture was assumed for error Razor architecture was assumed for error detection and correction – account for Razor detection and correction – account for Razor overhead (area, power) and power cost of error overhead (area, power) and power cost of error correctioncorrection
LSU_STB_CTL
0.84V0.76V
21% 21% additional additional power power savingssavings
-23-
Conclusions and Ongoing WorkConclusions and Ongoing Work
We propose recovery-driven design which We propose recovery-driven design which minimizes power for a target timing error rateminimizes power for a target timing error rate– Optimize designs with functional information and Optimize designs with functional information and
iterative voltage scalingiterative voltage scaling– We also develop a fast and accurate technique for We also develop a fast and accurate technique for
post-layout activity and error rate estimationpost-layout activity and error rate estimation We demonstrate significant power benefits – up We demonstrate significant power benefits – up
to 25% power savings compared to traditional to 25% power savings compared to traditional P&R at an error rate of 0.125%P&R at an error rate of 0.125%
Ongoing workOngoing work– Recovery-driven design for different error Recovery-driven design for different error
resilience mechanisms, different sources of resilience mechanisms, different sources of variationvariation
– Design / architecture co-explorationDesign / architecture co-exploration
-24-
Thank you
-25-
BACKUP
-26-
Related Work: BlueShiftRelated Work: BlueShift
BlueShiftBlueShift* : maximize frequency for a given * : maximize frequency for a given error rateerror rate
BlueShiftBlueShift speedup speedup– Paths with the highest frequency of timing errorsPaths with the highest frequency of timing errors– FBB (forward body-biasing) & Timing overrideFBB (forward body-biasing) & Timing override
LimitationLimitation– Repetitive gate level simulation – impracticalRepetitive gate level simulation – impractical– Design overhead of FBBDesign overhead of FBB
Computeerror rate
ER < TargetGate-level simulation
YES
NO Speed up paths
Finish
*Grescamp et al. “Blueshift: Designing processors for timing speculation from the ground up”, HPCA 2009
-27-
Exploiting Error Resilience for Multi-core DesignExploiting Error Resilience for Multi-core Design
Design of heterogeneously reliable multi-core Design of heterogeneously reliable multi-core processor processor
• Power-optimized for different mixes of workloads
• Power-optimized for different reliability target
Individual cores are customized for a specific workload class
-28-
Lifetime Energy MinimizationLifetime Energy Minimization
Maximizing energy efficiency of DVFS-based designsMaximizing energy efficiency of DVFS-based designs– Inefficiency is due to a design optimized for a single power / Inefficiency is due to a design optimized for a single power /
performance pointperformance point– Minimize energy when the processor spends R of its lifetime Minimize energy when the processor spends R of its lifetime
at high freq. (e.g., talk mode) and (1 – R) of its lifetime at at high freq. (e.g., talk mode) and (1 – R) of its lifetime at low freq. (e.g., standby mode)low freq. (e.g., standby mode)
• Replication-based methodology: area overhead vs. power tradeoffs
• Co-optimization methodology: optimize design with two operating constraints – (freq_hi, V_hi) and (freq_lo, V_lo)
• Both methodologies can be applied alternatively in each sub-modules
-29-
Sensitivity-Based Optimization PlatformSensitivity-Based Optimization Platform
Post-layout stage cell swapPost-layout stage cell swap– Cell sizing + ECOCell sizing + ECO– Multi-VMulti-Vtt swap swap– Multi-LMulti-Lgategate swap swap
Swap cell and check STASwap cell and check STAwith with PrimeTimePrimeTime socket socketinterfaceinterface
Cell swap according to the Cell swap according to the sensitivity sensitivity SS– For leakage optimization, For leakage optimization, SS = = ΔΔleakage x slackleakage x slack– For timing closure, For timing closure, SS = = ΔΔslack / (slack – slack / (slack – WNSWNS))
MMMC (Multi-Mode Multi-Corner) can be considered MMMC (Multi-Mode Multi-Corner) can be considered with multiple with multiple PrimeTimePrimeTime sockets sockets
Lgate biasing
-30-
Limitations of Traditional CAD FlowLimitations of Traditional CAD Flow
In modern digital design, vast majority of paths have In modern digital design, vast majority of paths have near-critical slack – near-critical slack – wall of slack distributionwall of slack distribution
Scaling beyond a critical operating point causes Scaling beyond a critical operating point causes massive errors and power benefits can be limited* massive errors and power benefits can be limited*
zero slackzero slack timing slacktiming slack
nu
mb
er o
f p
ath
s
erro
r ra
te
lower voltagelower voltage(higher frequency)(higher frequency)
operatingoperatingpointpoint
Error rate Error rate =
# cycles which have timing error
# total cycles
0.0 %0.0 % at 1.00Vat 1.00V1.0 %1.0 % at 0.95Vat 0.95V20.0 %20.0 % at 0.90Vat 0.90V
‘wall of slack’
*Kahng et al. “Slack Redistribution...”, ASPDAC 2010.