dynamic binary optimization
DESCRIPTION
Dynamic Binary Optimization. Presenter Kim Jin Chul. Contents. 1. Overview of Applying Optimization on VMs. 2. Dynamic Program Behavior. 3. Profiling. 4. Optimizing Translation Blocks. addir16, r4, 4; add 4 to %eax lwzxr17, r2, r16; load operand from memory - PowerPoint PPT PresentationTRANSCRIPT
Dynamic Binary Optimization
Presenter Kim Jin Chul
Contents
Overview of Applying Optimization on VMs
Profiling
Optimizing Translation Blocks
11
33
44
22 Dynamic Program Behavior
Classical Optimizations
addi r16, r4, 4 ; add 4 to %eaxlwzx r17, r2, r16 ; load operand from memory add r7, r17, r7 ; perform add of %edxaddi r16, r4, 4 ; add 4 to %eaxstwx r7, r2, r16 ; store %edx value into memory
addl %edx, 4(%eax)movl 4(%eax), %edx
addi r16, r4, 4 ; add 4 to %eaxlwzx r17, r2, r16 ; load operand from memoryadd r7, r17, r7 ; perform add of %edxstwx r7, r2, r16 ; store %edx value into memory
Translation from IA-32 to PowerPC code.
Adopt a Common Subexpression Elimination
Optimization Based on Profiling
Basic Block A ... ... R3 ← ... R7 ← ... R1 ← R2 + R3 Br L1 if R3 == 0
Basic Block B ... R6 ← R1 + R6 ... ...
Basic Block CL1: R1 ← 0 ... ...
Basic Block A ... ... R3 ← ... R7 ← ... Br L1 if R3 == 0
Basic Block B ... R6 ← R1 + R6 ... ...
Basic Block CL1: R1 ← 0 ... ...
Basic Block B ... R6 ← R1 + R6 ... ...
Basic Block CL1: R1 ← 0 ... ...
Basic Block A ... ... R3 ← ... R7 ← ... Br L1 if R3 == 0
Compensation code R1 ← R2 + R3
use
def
Optimization Based on Profiling
Basic Block A ... ... R3 ← ... R7 ← ... R1 ← R2 + R3 Br L1 if R3 == 0
Superblock ... ... R3 ← ... R7 ← ... Br L2 if R3 != 0 R1 ← 0 ... ...
Compensation code R1 ← R2 + R3
Basic Block B L2:... R6 ← R1 + R6 ... ...
Basic Block B ... R6 ← R1 + R6 ... ...
Basic Block CL1: R1 ← 0 ... ...
A staged optimization system
Binary memory
image
Basic block
cache
Code cache Profile data
Translator OptimizerEmulation
manager
Interpreter
Stages: Interpret Basic translation Optmized block Highly optimized blocks
Fast startup Very slow startup
Slow steady state Fast steady state
Simple profiling Extensive profiling
Dynamic Program Behavior
Dynamic control flow is highly predictable
.
.R3 ← 100
loop: R1 ← mem(R2)Br found if R1 == –1R2 ← R2 + 4R3 ← R3 – 1Br loop if R3 != 0..
found: ...
Dynamic Program Behavior
50%
40%
30%
20%
10%
0%
0-10% 10-20%
20-30%
30-40% 40-50%
50-60% 60-70% 70-80%
80-90% >90%
Distribution of taken conditional branches
Predominantly not taken : 28%Predominantly taken : 42%
Fra
ctio
n of
sta
tic
cond
itio
nal b
ranc
hes
Percent taken
Back...
Dynamic Program Behavior
50%
40%
30%
20%
10%
0%
176.gcc 181.mcf 197.parser 252.eon 256.bzip2 171.swim 173.applu177.mesa187.facerec189.lucas
100%
90%
80%
70%
60%
Consistency of conditional branches The high percentage consists of backward branches
Benchmark
Dyn
amic
bra
nche
s de
cide
d sa
me
as p
revi
ous
tim
e
SPEC
Dynamic Program Behavior
The predictability of indirect jumps Some jump destination addresses seldom change
25%
20%
15%
10%
5%
0%
1 2 3 4 5 6 7 8 9 >9
Number of different destinations
Per
cent
of i
ndir
ect j
umps
Dynamic Program Behavior
The predictability of data value
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
All Add/Sub Load Logic Shift Set
Fra
ctio
n w
ith
cons
tant
val
ue
Instruction type
Static
Dynamic
Static instructions always compute the same value
Dynamic instructions execute the static instructions
Profiling
The process of collecting instruction and data statistics for an executing program
Optimization based on profiling work
Binary memory
image
Basic block
cache
Code cache Profile data
Translator OptimizerEmulation
manager
Interpreter
Back...
The Role of Profiling
HLLProgram
CompilerFrontend
A
B C
D
E
F
CompilerBackend
InstrumentedCode
InstrumentedCode
Test Data
ProgramExecution
ProgramStatistics
OptimizingCompiler
OptimizedBinary
Traditional profiling
The Role of Profiling
A
B
D
E
ProgramBinary
ProgramData
Interpreter Translator/Optimizer
PartialProgramStatistics
On-the-fly profiling in a dynamic optimizing VM
Types of Profiles
Several types of profile data How frequently different code regions are
being executed? It can be used to decide the level of
optimization Is control flow predictability?
It may be used as the basis for gathering and rearranging basic blocks
Rearranged basic blocks get a chance to be merged superblock
Types of Profiles
A
B C
D
E
F17
65
1550
25
48
A
B C
D
E
F
15
50
13
50
10
48
15
12
38
2
17
A basic block profile A edge profile
Collecting Profiles
Instrumentation-based profiling Specific program-related events and counts all
instances of the events being profiled Software-based Vs Hardware-based
Speed? Support? Flexibility?
Sampling-based profiling Program runs in its unmodified form, the
program is interrupted and event is captured
Instrumentation Vs Sampling Overhead : Instrumentation < Sampling
Sampling causes traps!
Profiling During InterpretationInstruction function list..branch_conditional(inst) { BO = extract(inst, 25, 5); BI = extract(inst, 20, 5); displacement = extract(inst, 15, 14) * 4; . . // code to compute whether branch should be taken . . profile_addr = lookup(PC); if (branch_taken) profile_cnt(profile_addr, taken); PC = PC + displacement; Else profile_cnt(profile_addr, nottaken); PC = PC + 4;}
PC
Takencount
Not-takencount
HASHBranch PC
PowerPC Branch Conditional Interpreter Routine
Profile Table for Collecting an Edge Profile During Interpretation
Profiling Translated Code
Translated basic block
Fall-through stub
Branch target stub
increment edge counter (j)
if (counter (j) > trigger) then invoke optimizer
else branch to target basic block
increment edge counter (i)
if (counter (i) > trigger) then invoke optimizer
else branch to fall-through basic block
Edge Profiling Code Inserted into Stubs of a Binary Translated Basic Block
Emulation Stages
Profiling Overhead
For profiling during interpretation, occurring 10-20% overhead
Profiling overheads can be reduced To reduce the number of instrumentation
points by selecting a smaller set of key points
Optimizing Translation Blocks
Two-part strategy for optimzing Using dominant control flow for enhancing
memory locality Making a translation blocks larger
Traces, Superblocks, Tree groups
Two parts of the strategy are actually relatively independent
Improving Locality
Two kinds of memory localities Spatial locality
Access to a memory location is soon followed by a memory access to an adjacent memory location
Temporal locality Access to a memory location is accessed
again in the near future
Improving Locality
ABr cond1 = = true
BBr cond2 = = false
CBr uncond
DBr cond3 = = true
EBr uncond
F
GBr cond4 = = true
A
B D
CF
G97
30
1
1
70
29
1
3
68
E
6829
2
Example code sequence
Improving Locality
ABr cond1 = = false
DBr cond3 = = true
F
Br uncond
G
Br cond2 = = false
E
Br uncond
B
C
Br cond4 = = true
Br uncond
A
B D
CF
G97
30
1
1
70
29
1
3
68
E
6829
2
Rearrange the blocks in memory
Improving Locality
A
B
Call proc xyz
.
.
.
K
L
Call proc xyz
X
Y
Proc xyz
ZReturn
X
Y
Z
A
B
X
Y
Z
K
L
.
.
.
X
Y
A
B
X
Z
K
L
.
.
.
Procedure InliningPositive & Negative
Effect?
Traces
A
B D
CF
G97
30
1
1
70
29
1
3
68
E
6829
2
Trace 1
Trace 2
Trace 3
Trace A contiguous sequence Both side entrances and side exits
Superblocks
Traces
Relations between Superblocks and Traces
Superblocks
A
B D
CF
G97
30
1
1
70
29
1
3
68
E
6829
2
A
B D
CF
G
E
G G
Superblocks Regions of code with only one entry and one or
more exit points
Superblocks
ABr cond1 = = false
DBr cond3 = = true
F
Br uncond
G
Br cond2 = = false
E
Br uncond
B
C
Br cond4 = = true
Br uncond
ABr cond1 = = false
DBr cond3 = = true
F
Br uncond
G
Br cond2 = = false
E
Br uncond
B
C
Br cond4 = = true
Br uncond
G
G
Br cond4 = = true
Br cond4 = = true
Tree Groups
A
B D
CF
G
E
G G
Tree groups Regions of code with only one entry and one or
more exit pointsFigure 4.7
SPEC benchmarks
Integer SPEC benchmark 176.gcc – GNU Compiler 181.mcf – Combinatorial Optimization 197.parset – Word Processor 252.eon – Computer Visualization 256.bzip2 – Compression
Floating-Point SPEC benchmark 171.swim – Shallow Water Modeling 173.applu – Parabolic 187.facerec – Imageprocessing 189.lucas – Number Theory
Back...