Download - Day 4: Symbiotic Optimization
Day 4: Symbiotic Optimization
Kim HazelwoodACACES Summer School
July 2009
ACACES 2009 – Process Virtualization
Modern Computing Challenges
Performance
Power & temperature
Reliability
Parallelism (multicore)
Heterogeneity
Limited resources (embedded computing)
ACACES 2009 – Process Virtualization
Typical Approaches
Optimize using SW or HW techniques in isolation
Performance
• SW: Compile-time optimizations, OS scheduling
• HW: Architectural improvements, VLSI technology
Reliability: Code/data duplication (HW or SW)
Power & Temperature
• HW control mechanisms
• Profile + recompile cycle HW
SW
ACACES 2009 – Process Virtualization
Modern Design Constraints
Compilers – “Compile once, run anywhere”• Cannot ship “MS Office for 3Q08 batch of Core2
3GHz, > 8GB RAM, BrandX power supply, located in high altitudes…”
OSes – Scheduling algorithms don’t scale to modern architectures
Microarchitecture – Limited window of application knowledge • Past must predict the future
Circuits – Guaranteed correctness, reliability, • Must design for the worst case
ACACES 2009 – Process Virtualization
Tortola: Symbiotic Optimization
Enable HW/SW Communication
What small changes can we make to HW/OS to enable collaborative solutions?
SW Applications
Binary Modifier
OS/HW
runtime traits hardware
feedback;os load
scheduling hints
ACACES 2009 – Process Virtualization
The Power of Virtualization
• No longer restricted to a fixed ISA
• Reduce hardware complexity– No more backwards compatibility warts– Fix bugs after shipment– Reduce time to market for new architectures
HW
SW Applications
Binary Modifier
Initiallyx86x86
Eventually
SWI
HWI
ACACES 2009 – Process Virtualization
Tortola Applications
Combine global program information with run-time feedback
• System-specific power usage• Application-specific heat anomalies• Workload/input specific performance optimization
Two case studies
• The di/dt problem – solved using hardware feedback and dynamic optimization
• Heterogeneous multicore scheduling – solved using hardware feedback, OS feedback, and OS hints from the binary modifier
ACACES 2009 – Process Virtualization
The di/dt Problem
Voltage stability is important for reliability, performance
Low-power techniques have a negative side effect: current variation
ITRS cites noise management as a Grand Challenge for 5-10 year time frame
Dips (undershoots) in supply voltage – can cause incorrect values to be calculated or stored
Spikes (overshoots) in supply voltage – can cause reliability problems
ACACES 2009 – Process Virtualization
di/dt Solutions
Software
MicroArch
Circuit-Level
Compiler Optimizations
Sensor/Actuator Mechanisms
Decoupling capacitors More Vdd Gnd pins on
package
Co-Designed MicroArch & SW Binary Modifier
ACACES 2009 – Process Virtualization
Sensor-Actuator Mechanisms
On-chip voltage sensors detect abnormally high/low voltage levels
On-chip actuator then attempts to quickly raise/lower the processor’s current draw
• Phantom firing– increases current (at the expense of power)• Resource throttling
– reduces current (at the expense of performance)
ACACES 2009 – Process Virtualization
1.05V
1.03V
1V
0.97V
0.95V
Op
era
tin
g
Volt
ag
e R
an
ge
Hard EmergencySoft
Emergency
ControlThreshold
Detecting Imminent Emergencies
ACACES 2009 – Process Virtualization
Targeting Mid-Frequency Di/dt
•Problematic: wide current spike
•Worst case: pulse at the resonant frequency
Su
pp
ly V
olta
ge (
V)
Pro
cess
or C
urr
ent
(A)
Time (Cycles)
Minimum Voltage
20 cycles
Su
pp
ly V
olta
ge (
V)
Pro
cess
or C
urr
ent
(A)
Time (Cycles)
Minimum Voltage
60 cycles
Maximum Voltage
*From:
Joseph et al.
HPCA 2003
ACACES 2009 – Process Virtualization
A di/dt Stressmark
BEGIN_LOOP:…ldt $f1, ($4)divt $f1, $f2, $f3divt $f3, $f2, $f3stt $f3, 8($4)ldq $7, 8($4)cmovne $31, $7, $3stq $3, $(4)stq $3, $(4)stq $3, $(4)…stq $3, $(4)…JUMP BEGIN_LOOP
Seq
uen
tial
Seq
uen
tial
Low
Pow
er
Low
Pow
er
Para
llel
Para
llel
Hig
h P
ow
er
Hig
h P
ow
er
But…Actuator engages every loop iteration degrading performance
Why not correct the problem in the code?
ACACES 2009 – Process Virtualization
Why use Dynamic Binary Modification?
Modify the instruction stream at run time
Much easier to react to an emergency than to predict one
Emergencies are processor dependent, but software should not be!
Enables run-time guarantees
ACACES 2009 – Process Virtualization
Leverage our additional software layer to supplement existing solutions
Microarchitecture provides feedback to our software-based virtual layer
Proposed Solution
Microprocessor
Sensor+Actuator Ext
AlteredAlteredExecutableExecutable
SWHW
BinaryModifierExecutableExecutable
VL
ACACES 2009 – Process Virtualization
Required Investigations
Characterizing emergencies
• How often do we see di/dt emergency loops?
• Are emergencies usually code-based?
Communication between the microarchitecture and the virtual layer
• What information should be passed to virtual layer during an emergency?
Fixing di/dt via binary modification
• Will existing techniques help?
• New algorithms?
ACACES 2009 – Process Virtualization
Last-Executed Branch
Data suggests modifying a few code sequences will eliminate many voltage emergencies
1
10
100
1000
10000
100000
1000000
gzip
vpr
gcc
mcf
crafty
parser
eon
perlbmk
gap
vortex
bzip2
twolf
wupwise
swim art
mgrid
applu
mesa
galgel
equake
facerec
sixtrack
ammp
apsi
Emergencies
Distinct Total
ACACES 2009 – Process Virtualization
Our goal is to• Smooth out current profile, or• Knock pulses off of the resonant frequency
Some existing options• Software pipelining, code motion, instruction
padding
Microprocessor
Sensor+Actuator Ext’ns
AlteredAlteredExecutableExecutable
BinaryModifier
ApplyApplyOptimizationsOptimizations
ExecutableExecutable
Possible Compiler Optimizations
ACACES 2009 – Process Virtualization
AABB
Iteration=1
Iteration=2
Iteration=3
Current
AABB
AABB
AABB
Software pipelining
smoothes profile
Current
Problematicloop:
AAAABBBB
Current
Loop unrolling disrupts
resonance pulse
Unrolledloop:
Loop Unrolling & SW Pipelining
ACACES 2009 – Process Virtualization
0.97V
0.98V
0.99V
1.00V
1.01V
1.02VBefore Loop Unrolling After Loop Unrolling
H
L
H1
H2
L1
L2
Unrolling the Di/dt Stressmark
ACACES 2009 – Process Virtualization
Two Case Studies
The di/dt problem – solved using hardware feedback and dynamic optimization
Heterogeneous multicore scheduling – solved using hardware feedback, OS feedback, and hints from the binary modifier
ACACES 2009 – Process Virtualization
Process Heterogeneity
• Process variation
Design Heterogeneity
• Specialized processor cores
Today’s Multicore Designs
Future Multicore Designs
Heterogeneous Multicores
ACACES 2009 – Process Virtualization
The Challenge: Scheduling
Many OSes assume identical core resources
• Bad assumption, even today (hyperthreading)
The OS may not have enough information to make the best scheduling decisions
• depending on the type of heterogeneity
Ideal process-to-core mappings change dynamically
Should this be a task for the OS alone?
ACACES 2009 – Process Virtualization
Our Approach
Claim: OSes could benefit from scheduling hints
Solution: Combine historical performance data with application phase information
SW Applications
Binary Modifier
OS/HW
phase information performance
counter info; os load
scheduling hints
ACACES 2009 – Process Virtualization25
Initial Investigation
Hardware configuration
• In-order core
• Out-of-order core
Software configuration
• Two applications executing (SPEC all combinations)
Migration indicators
• IPC (from HW)
• Phase (from SW) 1M cycle granularity
out of order
in order
App1 App2
ACACES 2009 – Process Virtualization26
Results
1. Calculated ideal (omniscient) scheduling
2. Calculated random scheduling
3. Explored two migration heuristics• IPC-threshold scheduling• IPC-delta scheduling
Metric: Distance from ideal
ACACES 2009 – Process Virtualization27
Random Scheduling
Probability of Swapping at Each 1M Timeslice 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
60%
40%
20%
0%
Dis
tan
ce f
rom
Id
eal
ACACES 2009 – Process Virtualization28
IPC Threshold SchedulingPerc
en
t D
ista
nce
fro
m Id
eal
30
20
10
00 Backoff 0.5 Backoff 1.0 Backoff 1.5 Backoff 1.9 Backoff 2.0 Backoff
0.5 IPC 0.75 IPC 1.0 IPC 1.25 IPC 1.5 IPC 1.75 IPC 2.0 IPC
ACACES 2009 – Process Virtualization29
IPC Delta SchedulingPerc
en
t D
ista
nce
fro
m Id
eal
30
20
10
0 0.05 0.1 0.15 0.2 0.3 0.4 0.5 0.75 1 1.25 1.5 1.75 2
Migration Trigger - IPC Change
ACACES 2009 – Process Virtualization30
Ongoing Investigations
Varying heterogeneity
• cache sizes
• floating-point availability
Determining migration indicators
• cache misses
Combinations
Other software feedback
ACACES 2009 – Process Virtualization
Symbiotic Optimization
Cross-layer approaches can be powerful
Minor changes enable communication channels
Hardware feedback and binary modification can help solve the di/dt problem
Hardware feedback and program phases can guide OS scheduling decisions
The Tortola design can also target power reduction, temperature reduction, reliability, etc.
http://www.tortolaproject.com/
ACACES 2009 – Process Virtualization
Course Summary
Now you know …
Process VMs versus system VMs
Research issues in building process VMs
How to use process VMs for a variety of other research projects
Why cross-layer approaches can be powerful and how to use process VMs to get there
Thanks and enjoy the rest of your summer!
32