code size efficiency in global scheduling for ilp processors tinker research group department of...
TRANSCRIPT
Code Size Efficiency in Global Scheduling Code Size Efficiency in Global Scheduling for ILP Processorsfor ILP Processors
TINKER Research GroupDepartment of Electrical & Computer EngineeringNorth Carolina State University
Huiyang Zhou, Tom Conte
2
OutlineOutline
• Introduction• Quantitative measure of code size efficiency• Best code size efficiency for a given code size limit• Optimal code size efficiency for a program• Summary• Future work
3
IntroductionIntroduction
• Instruction level parallelism (ILP) vs. static code size– Region enlarging optimizations usually enhance ILP
• Cyclic scheduling: loop unrolling, loop peeling, etc.• Acyclic scheduling: tail duplication, recovery code,
etc.
• I-cache and ITLB performance vs. static code size– Larger code usually means larger I-Cache footprint
• Trade off of the conflicting effects of code size increase– Especially in acyclic global scheduling
4
Background of Treegion SchedulingBackground of Treegion Scheduling
• Treegion scheduling– An acyclic scheduling technique– Two phases
• Treegion formation• Treegion-based instruction
scheduling: Tree Traversal Scheduling (TTS) (HPCA-4, LCPC’01)
• Treegion– Basic scheduling unit – A single-entry / multiple-exit
nonlinear region with CFG forming a tree (i.e., no merge points and back-edges in a treegion)
BB1
BB2 BB3
BB4
BB5 BB6
Tree1
Tree2
5
Background of Treegion SchedulingBackground of Treegion Scheduling
• Treegion examples
BB1
BB2 BB3
BB4
BB5 BB6
Natural treegion: treegions formed without tail duplication (i.e., no code size increase during natural treegion formation)
BB1
BB2 BB3
BB4
BB5 BB6
BB4’
BB5’ BB6’
Tree1
Tree2
Tree 1’
6
Code Size Effects in Treegion SchedulingCode Size Effects in Treegion Scheduling
• Tail duplication increases code size• General operation combining reduces code size
BB1
BB2 BB3
…R1=R3+R4
…
BB5 BB6
BB4’
BB5’ …R7=R3+R4R9=R7*4
…
…R1=R3+R4
…
BB2 BB3
…________
…
BB5 BB6
BB4’
BB5’ …_________R9=R1*4
…
7
Quantitative Measure of Code Size EfficiencyQuantitative Measure of Code Size Efficiency
• ILP vs. static code size
129.compress
1.9
2
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
0.95 1.15 1.35 1.55 1.75 1.95relative size (=scheduled code size/original code size)
Sta
tic
IPC
All_Possible Havanki's heuristic BB Natural_tree
Havanki’s heuristic: A treegion formation heuristic proposed before [HPCA-4].
8
Code Size Efficiency for Any Code Size Related Code Size Efficiency for Any Code Size Related OptimizationsOptimizations
• Use the ratio of IPC changes over code size changes as an indication of code size efficiency.– Average code size efficiency
– Instantaneous code size efficiency
treegionnaturalcandidate
treegionnaturalcandidateaverage sizecodesizecode
IPCIPCEfficiency
_
_
__
napplicatioindividualbeforenapplicatioindividualafter
napplicatioindividualbeforenapplicatioindividualafterinst sizecodesizecode
IPCIPCEfficiency
____
____
__
9
Average and Instantaneous Code Size EfficiencyAverage and Instantaneous Code Size Efficiency
Code Size
Static IPC
A1
A2
A3A4
04
04
__ AA
AAaverage SizeCodeSizeCode
IPCIPCEff
A0
23
23
__)2(
AA
AAinst SizeCodeSizeCode
IPCIPCAEff
10
Estimate Static IPC Before SchedulingEstimate Static IPC Before Scheduling
• Use the expected execution time to calculate the static IPCFor a multi-path region:
• Now, IPC changes can be calculated as execution time saved by the optimization.
ipath
ipathipathipathExpected FreqboundresourcebounddependencedataMaxTimeExe_
___ _,___
tree1
tree2
Tree1’
CtreeSize
TreetimeExetreetimeExetreetimeExeAEff inst
1)2(
)'1(_)2(_)1(_)(
Example:
11
Optimal Code Size Efficiency For A Given Code Optimal Code Size Efficiency For A Given Code Size LimitSize Limit
Code Size
Static IPC
Natural Treegion
Size Limit
Fixed code size, try to maximize the static IPC, i.e., maximize the average code size efficiency
12
Optimal Tail Duplication Under Code Size ConstraintOptimal Tail Duplication Under Code Size Constraint1. Calculate the instantaneous code size efficiency for all possible
tail duplication candidates in the program scope.
2. Find the one with best code size efficiency.
3. If the selected candidate satisfies the code size constraint, perform the tail duplication and update the code size efficiencies of the candidates that are affected by the tail duplication process.
4. Repeat steps 2-3 until the code size limit is reached.
Relative Code Size
IPC
limit
13
Processor ModelProcessor Model Specification
Execution Dispatch/Issue/Retire bandwidth: 8; Universal function units: 8; Operation latency: ALU, ST, BR: 1 cycle; LD, floating-point (FP) add/subtract: 2 cycles.
I-cache Compressed (zero-nop) and two banks with 2-way 16KB each bank. Line size: 16 operations with 4 bytes each operation. Miss latency: 12 cycles
D-cache Size/Associativity/Replacement: 64KB/4-way/LRU; Line size: 32 bytes Miss Penalty: 14 cycles
Branch Predictor
G-share style Multiway branch prediction [20] Branch prediction table: 214 entries; Branch target buffer: 214 entries/8-way/LRU. Branch misprediction penalty: 10 cycles
14
Results: ILP vs. Code SizeResults: ILP vs. Code Size
129.compress
2
2.2
2.4
2.6
2.8
3
0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6
Relative code size
Sta
tic IP
C
best ILP for a given code size
0%
2%5%
80%30%
15
Results: ILP vs. Code Size (cont.)Results: ILP vs. Code Size (cont.)
147.vortex
2.5
2.7
2.9
3.1
3.3
3.5
0.8 1 1.2 1.4 1.6 1.8 2 2.2Relative code size
Sta
tic
IPC
best ILP for a given code size
0%
2%5%
80%30%
Reason: only a very small part of the program is frequently executed.
16
Optimal Code Size EfficiencyOptimal Code Size Efficiency
• Definition: the point where the ‘diminishing returns’ start
• Finding the optimal code size efficiency
Relative code size
IPC
A
l
A’
17
Finding the Optimal Code Size EfficiencyFinding the Optimal Code Size Efficiency
Relative code size0
K
K1
K2
relativeSizedCode
dIPC
_
Threshold on the first derivative of IPC vs. code size curve, which is simply the threshold on instantaneous code size efficiency !
A or A’• K is the slope of line l
18
Finding the Optimal Code Size Efficiency (cont.)Finding the Optimal Code Size Efficiency (cont.)
• Meaning of K1 and K2
Relative code size
IPC
AB
C
l1
l2
• K1 and K2 are the slope of the lines l1 and l2.
• The range (K1 – K2) determines the robustness of the threshold scheme.
• Point B Threshold as K1• Point C Threshold as K2
19
Algorithm for Finding the Optimal Code Size Algorithm for Finding the Optimal Code Size EfficiencyEfficiency
1. Set the threshold k anywhere between tan(/6) to tan(/12)
2. Calculate the instantaneous code size efficiency for all possible tail duplication candidates in the program scope.
3. If there is a candidate whose instantaneous code size efficiency is above the threshold, duplicate the candidate and update the efficiency of affected candidates, repeat until there are no more candidates.
When the expected execution time is used, the threshold scheme becomes (derivation details in ref [21])
staticstaticabsolute ICIPC
timeExek
dSize
timeExed
_)_(
20
Results for Optimal Code Size EfficiencyResults for Optimal Code Size Efficiency
• Vary threshold from tan(/12) to tan(/6), the threshold scheme finds the optimal efficiency accurately.
• Use m88ksim as an example124.m88ksim
2
2.1
2.2
2.3
0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6re lative code s ize
Sta
tic
IPC
best IPC for given code sizes threshold is 0.577 threshold is 0.268
0%
2%5% 10% 20%
21
I-Cache Impacts of the Code Size IncreaseI-Cache Impacts of the Code Size Increase
Miss Rates for a 32K 2-bank 2-way I-Cache
0.00%
0.50%
1.00%
1.50%
2.00%
2.50%
3.00%
3.50%
4.00%
Mis
s R
ate
I-cache miss rate of Td_Opt I-cache miss rate of natural tree
Code size impacts and locality impacts (ref [3])
22
I-Cache Impacts of the Code Size Increase (cont.)I-Cache Impacts of the Code Size Increase (cont.)
I-cache accesses ratio (Td_opt over natural treegion)
60%
70%
80%
90%
100%
Denser schedule of optimal efficiency results
23
I-Cache Impacts of the Code Size Increase (cont.)I-Cache Impacts of the Code Size Increase (cont.)
The ratio of I-cache miss penalties of Td_opt over natural treegion
0%
50%
100%
150%
200%
250%
The combined impact
24
Processor PerformanceProcessor PerformanceIdeal and Realistic Performance for Different Treegion Formations
0
0.5
1
1.5
2
2.5
3
3.5
4
IPC
Real_IPC (Td_Opt) Real_IPC (Havanki's heuristic)
Real_IPC (Natural tree) Static_IPC (Td_Opt)
Static_IPC (Havanki's heuristic) Static_IPC (Natural tree)
In average, significant speedup (17% over natural treegion) in dynamic IPC at the cost of 2% code size increase.
25
ConclusionsConclusions
• Quantitative measure of the code size efficiency: the ratio of IPC changes over code size increase
• Best code size efficiency for a given code size limit– Results
• Significant but varying impact on IPC
• Optimal efficiency: simple yet robust threshold scheme to find ‘knee’ of the curve – Results
• Improved I-cache performance (4%)• Significant speedup (17%)• Moderate static code size increase (2%)
• Future Work– Combine with other optimization, e.g., loop unrolling.
26
Contact InformationContact Information
Huiyang Zhou [email protected]
Tom Conte [email protected]
TINKER Research GroupNorth Carolina State Universitywww.tinker.ncsu.edu