code size efficiency in global scheduling for ilp processors tinker research group department of...

Code Size Efficiency in Global Scheduling Code Size Efficiency in Global Scheduling for ILP Processorsfor ILP Processors

TINKER Research GroupDepartment of Electrical & Computer EngineeringNorth Carolina State University

Huiyang Zhou, Tom Conte

2

OutlineOutline

• Introduction• Quantitative measure of code size efficiency• Best code size efficiency for a given code size limit• Optimal code size efficiency for a program• Summary• Future work

3

IntroductionIntroduction

• Instruction level parallelism (ILP) vs. static code size– Region enlarging optimizations usually enhance ILP

• Cyclic scheduling: loop unrolling, loop peeling, etc.• Acyclic scheduling: tail duplication, recovery code,

etc.

• I-cache and ITLB performance vs. static code size– Larger code usually means larger I-Cache footprint

• Trade off of the conflicting effects of code size increase– Especially in acyclic global scheduling

4

Background of Treegion SchedulingBackground of Treegion Scheduling

• Treegion scheduling– An acyclic scheduling technique– Two phases

• Treegion formation• Treegion-based instruction

scheduling: Tree Traversal Scheduling (TTS) (HPCA-4, LCPC’01)

• Treegion– Basic scheduling unit – A single-entry / multiple-exit

nonlinear region with CFG forming a tree (i.e., no merge points and back-edges in a treegion)

BB1

BB2 BB3

BB4

BB5 BB6

Tree1

Tree2

5

Background of Treegion SchedulingBackground of Treegion Scheduling

• Treegion examples

BB1

BB2 BB3

BB4

BB5 BB6

Natural treegion: treegions formed without tail duplication (i.e., no code size increase during natural treegion formation)

BB1

BB2 BB3

BB4

BB5 BB6

BB4’

BB5’ BB6’

Tree1

Tree2

Tree 1’

6

Code Size Effects in Treegion SchedulingCode Size Effects in Treegion Scheduling

• Tail duplication increases code size• General operation combining reduces code size

BB1

BB2 BB3

…R1=R3+R4

…

BB5 BB6

BB4’

BB5’ …R7=R3+R4R9=R7*4

…

…R1=R3+R4

…

BB2 BB3

…________

…

BB5 BB6

BB4’

BB5’ …_________R9=R1*4

…

7

Quantitative Measure of Code Size EfficiencyQuantitative Measure of Code Size Efficiency

• ILP vs. static code size

129.compress

1.9

2

2.1

2.2

2.3

2.4

2.5

2.6

2.7

2.8

2.9

0.95 1.15 1.35 1.55 1.75 1.95relative size (=scheduled code size/original code size)

Sta

tic

IPC

All_Possible Havanki's heuristic BB Natural_tree

Havanki’s heuristic: A treegion formation heuristic proposed before [HPCA-4].

8

Code Size Efficiency for Any Code Size Related Code Size Efficiency for Any Code Size Related OptimizationsOptimizations

• Use the ratio of IPC changes over code size changes as an indication of code size efficiency.– Average code size efficiency

– Instantaneous code size efficiency

treegionnaturalcandidate

treegionnaturalcandidateaverage sizecodesizecode

IPCIPCEfficiency

_

_

__

napplicatioindividualbeforenapplicatioindividualafter

napplicatioindividualbeforenapplicatioindividualafterinst sizecodesizecode

IPCIPCEfficiency

____

____

__

9

Average and Instantaneous Code Size EfficiencyAverage and Instantaneous Code Size Efficiency

Code Size

Static IPC

A1

A2

A3A4

04

04

__ AA

AAaverage SizeCodeSizeCode

IPCIPCEff

A0

23

23

__)2(

AA

AAinst SizeCodeSizeCode

IPCIPCAEff

10

Estimate Static IPC Before SchedulingEstimate Static IPC Before Scheduling

• Use the expected execution time to calculate the static IPCFor a multi-path region:

• Now, IPC changes can be calculated as execution time saved by the optimization.

ipath

ipathipathipathExpected FreqboundresourcebounddependencedataMaxTimeExe_

___ _,___

tree1

tree2

Tree1’

CtreeSize

TreetimeExetreetimeExetreetimeExeAEff inst

1)2(

)'1(_)2(_)1(_)(

Example:

11

Optimal Code Size Efficiency For A Given Code Optimal Code Size Efficiency For A Given Code Size LimitSize Limit

Code Size

Static IPC

Natural Treegion

Size Limit

Fixed code size, try to maximize the static IPC, i.e., maximize the average code size efficiency

12

Optimal Tail Duplication Under Code Size ConstraintOptimal Tail Duplication Under Code Size Constraint1. Calculate the instantaneous code size efficiency for all possible

tail duplication candidates in the program scope.

2. Find the one with best code size efficiency.

3. If the selected candidate satisfies the code size constraint, perform the tail duplication and update the code size efficiencies of the candidates that are affected by the tail duplication process.

4. Repeat steps 2-3 until the code size limit is reached.

Relative Code Size

IPC

limit

13

Processor ModelProcessor Model Specification

Execution Dispatch/Issue/Retire bandwidth: 8; Universal function units: 8; Operation latency: ALU, ST, BR: 1 cycle; LD, floating-point (FP) add/subtract: 2 cycles.

I-cache Compressed (zero-nop) and two banks with 2-way 16KB each bank. Line size: 16 operations with 4 bytes each operation. Miss latency: 12 cycles

D-cache Size/Associativity/Replacement: 64KB/4-way/LRU; Line size: 32 bytes Miss Penalty: 14 cycles

Branch Predictor

G-share style Multiway branch prediction [20] Branch prediction table: 214 entries; Branch target buffer: 214 entries/8-way/LRU. Branch misprediction penalty: 10 cycles

14

Results: ILP vs. Code SizeResults: ILP vs. Code Size

129.compress

2

2.2

2.4

2.6

2.8

3

0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6

Relative code size

Sta

tic IP

C

best ILP for a given code size

0%

2%5%

80%30%

15

Results: ILP vs. Code Size (cont.)Results: ILP vs. Code Size (cont.)

147.vortex

2.5

2.7

2.9

3.1

3.3

3.5

0.8 1 1.2 1.4 1.6 1.8 2 2.2Relative code size

Sta

tic

IPC

best ILP for a given code size

0%

2%5%

80%30%

Reason: only a very small part of the program is frequently executed.

16

Optimal Code Size EfficiencyOptimal Code Size Efficiency

• Definition: the point where the ‘diminishing returns’ start

• Finding the optimal code size efficiency

Relative code size

IPC

A

l

A’

17

Finding the Optimal Code Size EfficiencyFinding the Optimal Code Size Efficiency

Relative code size0

K

K1

K2

relativeSizedCode

dIPC

_

Threshold on the first derivative of IPC vs. code size curve, which is simply the threshold on instantaneous code size efficiency !

A or A’• K is the slope of line l

18

Finding the Optimal Code Size Efficiency (cont.)Finding the Optimal Code Size Efficiency (cont.)

• Meaning of K1 and K2

Relative code size

IPC

AB

C

l1

l2

• K1 and K2 are the slope of the lines l1 and l2.

• The range (K1 – K2) determines the robustness of the threshold scheme.

• Point B Threshold as K1• Point C Threshold as K2

19

Algorithm for Finding the Optimal Code Size Algorithm for Finding the Optimal Code Size EfficiencyEfficiency

1. Set the threshold k anywhere between tan(/6) to tan(/12)

2. Calculate the instantaneous code size efficiency for all possible tail duplication candidates in the program scope.

3. If there is a candidate whose instantaneous code size efficiency is above the threshold, duplicate the candidate and update the efficiency of affected candidates, repeat until there are no more candidates.

When the expected execution time is used, the threshold scheme becomes (derivation details in ref [21])

staticstaticabsolute ICIPC

timeExek

dSize

timeExed

_)_(

20

Results for Optimal Code Size EfficiencyResults for Optimal Code Size Efficiency

• Vary threshold from tan(/12) to tan(/6), the threshold scheme finds the optimal efficiency accurately.

• Use m88ksim as an example124.m88ksim

2

2.1

2.2

2.3

0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6re lative code s ize

Sta

tic

IPC

best IPC for given code sizes threshold is 0.577 threshold is 0.268

0%

2%5% 10% 20%

21

I-Cache Impacts of the Code Size IncreaseI-Cache Impacts of the Code Size Increase

Miss Rates for a 32K 2-bank 2-way I-Cache

0.00%

0.50%

1.00%

1.50%

2.00%

2.50%

3.00%

3.50%

4.00%

Mis

s R

ate

I-cache miss rate of Td_Opt I-cache miss rate of natural tree

Code size impacts and locality impacts (ref [3])

22

I-Cache Impacts of the Code Size Increase (cont.)I-Cache Impacts of the Code Size Increase (cont.)

I-cache accesses ratio (Td_opt over natural treegion)

60%

70%

80%

90%

100%

Denser schedule of optimal efficiency results

23

I-Cache Impacts of the Code Size Increase (cont.)I-Cache Impacts of the Code Size Increase (cont.)

The ratio of I-cache miss penalties of Td_opt over natural treegion

0%

50%

100%

150%

200%

250%

The combined impact

24

Processor PerformanceProcessor PerformanceIdeal and Realistic Performance for Different Treegion Formations

0

0.5

1

1.5

2

2.5

3

3.5

4

IPC

Real_IPC (Td_Opt) Real_IPC (Havanki's heuristic)

Real_IPC (Natural tree) Static_IPC (Td_Opt)

Static_IPC (Havanki's heuristic) Static_IPC (Natural tree)

In average, significant speedup (17% over natural treegion) in dynamic IPC at the cost of 2% code size increase.

25

ConclusionsConclusions

• Quantitative measure of the code size efficiency: the ratio of IPC changes over code size increase

• Best code size efficiency for a given code size limit– Results

• Significant but varying impact on IPC

• Optimal efficiency: simple yet robust threshold scheme to find ‘knee’ of the curve – Results

• Improved I-cache performance (4%)• Significant speedup (17%)• Moderate static code size increase (2%)

• Future Work– Combine with other optimization, e.g., loop unrolling.

26

Contact InformationContact Information

Huiyang Zhou [email protected]

Tom Conte [email protected]

TINKER Research GroupNorth Carolina State Universitywww.tinker.ncsu.edu

mailto:[email protected]

mailto:[email protected]

code size efficiency in global scheduling for ilp processors tinker research group department of...

Documents

static code sizelarger

code sizebb1bb2bb3r1

recovery code

static code sizeregion

code sizegeneral operation

treerelative size

ilpcyclic scheduling

treegion formation heuristic