bypass aware instruction scheduling for register file power reduction sanghyun park 2 aviral...
Post on 19-Dec-2015
221 views
TRANSCRIPT
Bypass Aware Instruction Bypass Aware Instruction Scheduling for Register File Scheduling for Register File
Power ReductionPower Reduction
Sanghyun Park2 Aviral Shrivastava1
Nikil Dutt1 Alex Nicolau1 Yunheung Paek2
Eugene Earlie3
1CECS, ICS,UC Irvine, CA, USA
2SEE, SNUSeoul, Korea
3SCL, Intel,Hudson, MA, USA
SSCCLL
2 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Processor PowerProcessor Power Power is now a primary architectural concernPower is now a primary architectural concern
E.g.: Processor power consumption doubles w/ Pentium generations
High Power ConsumptionHigh Power Consumption Increases packaging/cooling cost Limits achievable performance
Especially Important for handheld embedded devicesEspecially Important for handheld embedded devices Battery life Weight
Managing the Impact of Increasing…
Gunther, Binns et. al,
Intel Technology Journal
Cost of Removing heat from a microprocessor
Increasing power consumption
Intel website
http://www.intel.com
3 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Power DensityPower Density
Power Density = power per unit areaPower Density = power per unit area Silicon is not a good conductor of heatSilicon is not a good conductor of heat
Areas with high power density becomes hot Higher temperature increases leakage
Positive feedback loop, possibly leading to thermal runawayPositive feedback loop, possibly leading to thermal runaway
Important to distribute power over the dieImportant to distribute power over the die Must “attack” hot-spots [Fred Pollack, Intel Corp, MICRO 32 keynote] Heat Stroke - Have to stop if any part of die has more than critical
temperature
Research beginning to address power densityResearch beginning to address power density Temperature-Aware Floorplanning
Surround high power density components with low-power density Surround high power density components with low-power density componentscomponents
Migrate tasks across cores Distribute heat-intensive tasks across dieDistribute heat-intensive tasks across die
Many other efforts…
4 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Register File PowerRegister File Power Register File is a significant source of power Register File is a significant source of power
dissipationdissipation Motorola M.CORE – approx. 16% processor power RF may consume up to 25% of processor power
High Register File Power densityHigh Register File Power density Small size, causes Hotspots e.g., Alpha 21264, Intel Pentium
Trend: increasing RF power due toTrend: increasing RF power due to Microarchitectural enhancements to improve IPC Compiler techniques to improve IPC Large Register Files (esp. VLIW processors)
5 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Heat Stroke from RF accessesHeat Stroke from RF accesses
Label1: add $1, $2,
$3 br Label1
Repeated access to register file at high rate
Create repeated hot spots at register file
Heat-up time short (1.2ms), cooling time long (12ms)
Degrades CPU utilization to 10%
Slide from “Heat Stroke: Power-Density-Based Denial of Service in SMT”, Jahangir Hasan et. al, ISHPC 2005
Example
6 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
OutlineOutline
Previous work in reducing RF PowerPrevious work in reducing RF PowerOn-Demand RF Read
Instruction Scheduling Instruction Scheduling technique for RF Power technique for RF Power reductionreduction
ExperimentsExperiments
SummarySummary
7 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Reducing RF Power: Related WorkReducing RF Power: Related Work Evaluation/Estimation of RF Power and RF Power
Density [ISLPED 98], [TCAD 01], [DATE 02]
Three ways to reduce RF Power1. Reduce energy per access to RF2. Reduce # registers in RF3. Reduce # accesses to RF
1. Reduce energy per access to RF Register File Design Considerations… Farkas, Jouppi, Chow,
WRL Research Report, 1995 The Energy Complexity of Register Files, Zubyan, Kogge,
ISPLED 1998 Energy Efficient Register Access, Tseng, Asanovic, SBCCI
2000
8 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Reducing RF Power: Related WorkReducing RF Power: Related Work
2. Reduce # registers in RF Instruction Scheduling to minimize # overlapping live
range Power-Aware Modulo Scheduling, Yun, Kim, ISLPED 2001 Lifetime-Sensitive Modulo Scheduling, Huff, PLDI 1993 Stage Scheduling … Eichenberger, Davidson, MICRO 1995
3. Reduce # accesses to RF3. Reduce # accesses to RF Hierarchical Register File
Reducing the Complexity of RF …, Balasubramonion et. Reducing the Complexity of RF …, Balasubramonion et. al., MICRO 2001al., MICRO 2001
Most lifetimes are short Temporarily hold register value in a buffer Reducing Register File Power… Hu, Martonosi, Workshop Reducing Register File Power… Hu, Martonosi, Workshop
on Complexity-Effective Design 2000on Complexity-Effective Design 2000 Reducing Register Ports using… Kim, Mudge, ICS 2003Reducing Register Ports using… Kim, Mudge, ICS 2003
9 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
OutlineOutline
Previous work in reducing RF PowerPrevious work in reducing RF PowerOn-Demand RF Read
Instruction Scheduling Instruction Scheduling technique for RF Power technique for RF Power reductionreduction
ExperimentsExperiments
SummarySummary
10 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
““On-Demand” RF ReadOn-Demand” RF Read Existing processors anticipatorily read RFExisting processors anticipatorily read RF
e.g., Pentium 4, Alpha 21264
SpecInt95 running on MIPS IISpecInt95 running on MIPS II 36% operands come from bypasses
8-issue SimpleScalar running SpecInt2K8-issue SimpleScalar running SpecInt2K 50-70% operands come from bypasses
Read from RF only if necessary (Teng & Asanovic, SBCCI 2000)Read from RF only if necessary (Teng & Asanovic, SBCCI 2000) First find out if the value is present in the bypasses If not, then read the value from RF We’ll call this “On-Demand RF Read”
When applied to Intel XScale modelWhen applied to Intel XScale model 58% energy reduction < 3% performance loss
This paper: Further reduction in RF power by Instruction Scheduling
11 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
OutlineOutline
Previous work in reducing RF PowerPrevious work in reducing RF PowerOn-Demand RF Read
Instruction Scheduling Instruction Scheduling technique for RF Power technique for RF Power reductionreduction
ExperimentsExperiments
SummarySummary
12 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Processor ModelProcessor Model
F D OR X1
RF
X2 WB
Partially Bypassed Processor
Pipeline BypassesPipeline Bypasses Improve performance
Full bypassingFull bypassing Best performance, but high power & wiring complexity
Partial BypassingPartial Bypassing Keep only some bypasses Popular in embedded processors, e.g., Intel XScale
13 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Operation Execution ModelOperation Execution Model
On Demand RF ReadOn Demand RF Read Read source operands Read source operands bypass result bypass result write write
backback
F D OR X1
RF
X2 WB
Add R1 R2 R3
Read R2, R3 from RF and bypasses
Bypass R1 to second port of OR
Do nothingWrite back R1 to RF
14 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
How can scheduling help?How can scheduling help?
Add R1 R2 R3Add R1 R2 R3
ADD R10 R11 R12ADD R10 R11 R12
SUB R4 R5 R1SUB R4 R5 R1
F D OR X1
RF
X2 WB
Add R1 R2 R3Add R1 R2 R3
SUB R4 R5 R1SUB R4 R5 R1
ADD R10 R11 ADD R10 R11 R12R12
SUB SUB CANNOTCANNOT useuse bypass to read R1bypass to read R1
SUBSUB CANCAN use bypassuse bypass to read R1to read R1
Instruction Scheduling can reduce RF usage!
15 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Bypass-sensitive RF Power-Aware Bypass-sensitive RF Power-Aware SchedulingScheduling
Schedule instructions so thatSchedule instructions so that Dependent instruction transfer operands using bypasses Reduce RF usage
Compiler needs to knowCompiler needs to know When does an instruction bypass result? Which operands can read the result? When result is written into register file?
Add R1 R2 R3
ADD R10 R11 R12
SUB R4 R5 R1
F D OR X1
RF
X2 WB
Add R1 R2 R3
SUB R4 R5 R1
ADD R10 R11 R12
Compiler needs a detailed processor-operation model
16 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Operation Table (OT)Operation Table (OT)
Model all the resources and Model all the resources and registers used by an operation in registers used by an operation in each cycle of its executioneach cycle of its execution
Can determine which operands are Can determine which operands are available for each source operandavailable for each source operand
Use OTs for scheduling to reduce Use OTs for scheduling to reduce the usage of RFthe usage of RF
Operation Table for ADD R1 R2 R3
1. F1. F
2. D2. D
3. OR3. OR
ReadOperandsReadOperands
R2R2
C1 RFC1 RF
R3R3
C2 RFC2 RF
C4 X1C4 X1
DestOperandsDestOperands
R1 RFR1 RF
4. X14. X1
WriteOperandsWriteOperands
R1R1
C4 ORC4 OR
5. X25. X2
6. XWB6. XWB
WriteOperandsWriteOperands
R1R1
C3 RFC3 RF
F D OR X1
RF
X2 WB
C1 C2 C3C4
Operation Tables for Scheduling in Partially Bypassed Processors – Shrivastava, Earlie,
Dutt, Nicolau, CODES + ISSS 2004
17 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
OT-based RF Power-Aware OT-based RF Power-Aware SchedulingScheduling
Operation Tables (OTs) provide a mechanism Operation Tables (OTs) provide a mechanism To accurately estimate the number of
operands read from RF
Exploit OTs for scheduling to reduce RF usageExploit OTs for scheduling to reduce RF usage Various scheduling strategies can be employed Choose scheduling heuristic with the least RF usage
We evaluated 3 BB scheduling techniquesWe evaluated 3 BB scheduling techniques1. RFPEX: Exhaustive2. RFPN: Greedy O(n)3. RFPN2: Greedy with one level of backtracking
O(n2)
18 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
OutlineOutline
Previous work in reducing RF PowerPrevious work in reducing RF PowerOn-Demand RF Read
Instruction Scheduling Instruction Scheduling technique for RF Power technique for RF Power reductionreduction
ExperimentsExperiments
SummarySummary
19 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Experimental SetupExperimental Setup
Intel XScaleIntel XScale 7 –stage, partially bypassed On-Demand RF Read
Architecture RF Power ModelRF Power Model
= # Register File Accesses MiBench benchmarksMiBench benchmarks SchedulerScheduler
Operation Table - based RF Power-Aware Scheduling Within Basic Block
Tried 3 strategiesTried 3 strategies RF Power ResultsRF Power Results
Compare with On-Demand RF Read architecture as baseline
GCC –O3
Assembly
Executable
RuntimeRF Reads
OT – based Scheduler
Application
Cycle-Accurate Simulator
GCC linker
20 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
1. RFPEX 1. RFPEX SchedulingScheduling
RF Power Reduction
0%
10%
20%
30%
ExhaustiveExhaustive Try all legal
permutations of instructions
O(n!) Complexity n - # instructions in BBn - # instructions in BB
Compilation TimeCompilation Time Hours Could not schedule
susan, rijndael (2 days)
RF Power ReductionRF Power Reduction Average 12%
Performance ImprovementPerformance Improvement Average 1.4%
Performance Improvement
-4%
-2%
0%
2%
4%
6%
8%
26% reduction
7% improvement
21 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
2. RFPN 2. RFPN SchedulingScheduling
Greedy O(n) schedulingGreedy O(n) scheduling Pick instructions one by
one Pick instruction which
gets most operands from bypass
O(n) Complexity n - # instructions in BBn - # instructions in BB
Compilation timeCompilation time Seconds
RF Power ReductionRF Power Reduction Average 6%
Performance ImprovementPerformance Improvement Average: -3.5%
RF Power Reduction
0%
5%
10%
15%
Performance Improvement-20%
-16%
-12%
-8%
-4%
0%
4%
22 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
3. RFPN2 3. RFPN2 SchedulingScheduling
RFPN2 - RFPN2 - Greedy with Greedy with one level of one level of backtrackingbacktracking O(n2) Complexity
n - # instructions in BBn - # instructions in BB
Compilation timeCompilation time Minutes
RF Power ReductionRF Power Reduction Average 10%
Performance Performance ImprovementImprovement Average: -2%
RFPN2 works well !!RFPN2 works well !!
RF Power Reduction
0%
5%
10%
15%
20%
25%
bitcoun
t
blowfish
decod
e sha
susan
corner
s
rijndae
l encod
eqso
rt
Average
Performance Improvement-20%
-15%
-10%
-5%
0%
5%
Average10% reduction
23 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
OutlineOutline
Previous work in reducing RF PowerPrevious work in reducing RF PowerOn-Demand RF Read
Instruction Scheduling Instruction Scheduling technique for RF Power technique for RF Power reductionreduction
ExperimentsExperiments
SummarySummary
24 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
SummarySummary Register File is one of the main hotspots in processorsRegister File is one of the main hotspots in processors Very important to reduce RF PowerVery important to reduce RF Power
Repeated accesses cause “Heat Stroke” Up to 90% performance degradation
On-Demand RF ReadOn-Demand RF Read is an effective technique is an effective technique 58% RF power reduction
Scope for further RF power reduction via instruction Scope for further RF power reduction via instruction schedulingscheduling
Contribution:Contribution: Instruction Scheduling Technique Instruction Scheduling Technique for further RF for further RF power reductionpower reduction Up to 26%, Average 12% RF power reduction 2% performance degradation Over and above On-Demand RF Read architecture as baseline RFPN2 is an effective heuristic for RF Power reduction
Future WorkFuture Work Beyond basic block scheduling