more code optimization

1

More Code Optimization

2

Outline

• More Code Optimization techniques• Optimization Limiting Factors• Memory Performance

• Suggested reading

– 5.8 ~ 5.12

Review void combine4(vec_ptr v, data_t *dest)

{ long int i; long int length = vec_length(v); data_t *data = get_vec_start(v); data_t x = IDENT;

for (i = 0; i < length; i++) x = x OP data[i]; *dest = x;}

load

muladd

data[0]

load

muladd

data[1]

load

muladd

data[n-1]

.. ..

%rax %rbp %rdx %xmm0


load

mul

add

cmp

jg

4

Review

Nehalem (Core i7) Instruction Latency Cycles/IssueInteger Add 11 0.33Integer Multiply 3 1Integer/Long Divide 11--21 5--13Single/Double FP Add 3 1Single/Double FP Multiply 4/51Single/Double FP Divide 10--23 6--19

Integer Floating PointFunction + * + F* D*combine1 12 12 12 12 13combine4 2 3 3 4 5combine4 2 3 3 4 5

Latency 1 3 3 4 5Throughput 1 1 1 1 1

5

Loop Unrolling

void combine5combine5(vec_ptr v, int *dest){ int i; int length = vec_length(v); int limit = length - 2; int *data = get_vec_start(v); int x = IDENT;

/* combine 3 elements at a time */ for (i = 0; i < limit; i+=3) x = x OPER data[i] OPER data[i+1] OPER data[i+2];

/* finish any remaining elements */ for (; i < length; i++) x = x OPER data[i]; *dest = x;}

6

– Loads can pipeline, since don’t have dependencies

– Only one set of loop control operations

load (%rax,%rdx.0,4) t.1aaddq t.1a, %rcx.0c %rcx.1aload 4(%rax,%rdx.0,4) t.1baddq t.1b, %rcx.1a %rcx.1bload 8(%rax,%rdx.0,4) t.1caddq t.1c, %rcx.1b %rcx.1caddq $3,%rdx.0 %rdx.1cmpq %rdx.1, %rbp cc.1jg-taken cc.1

Translation

7

Graphical Representation

addq (%rax,%rdx,4), %xmm0

addq $3,%rdx

cmpq %rdx,%rbp

jg loop



load

add

add

cmp

jg

t.a

cc

addq 4(%rax,%rdx,4), %xmm0load

addt.b

addq 8(%rax,%rdx,4), %xmm0load

addt.c

8


%xmm0 %rax %rbp %rdx

%xmm0 %rdx

load

add

add

cmp

jg

load

add

load

add

%xmm0 %rdx

%xmm0 %rdx

load

add

add

data[i]

add

data[i+1]

load

add

data[i+2]

load

9


load

add

add

data[0]

add

data[1]

load

add

data[2]

load

load

add

add

data[3]

add

data[4]

load

add

data[5]

load

10

Loop Unrolling

• Improve performance– Reduces the number of operations

(e.g. loop indexing and conditional branching)

– Transform the code to reduce the number of operations in the critical paths

• CPEs for both integer ops improve

• But for both floating-point ops do not

11

Effect of Unrolling

load

mul

add

data[i]

mul

data[i+1]

load

mul

data[i+2]

load

load

add

add

data[i]

add

data[i+1]

load

add

data[i+2]

load

• Critical path– Latency of integer add is 1

– Latency of FP multiplex is 4

12

Effect of Unrolling

• Only helps integer sum for our examples

– Other cases constrained by functional unit

latencies

• Effect is nonlinear with degree of

unrolling

– Many subtle effects determine exact

scheduling of operations

13

Enhance Parallelism

• Multiple Accumulators– Accumulate in two different sums

• Can be performed simultaneously

– Combine at end

14

void combine6combine6(vec_ptr v, int *dest){ int i; int length = vec_length(v), limit = length-1; int *data = get_vec_start(v); int x0 = IDENT, x1 = IDENT;

/* combine 2 elements at a time */ for (i = 0; i < limit; i+=2){ x0 = x0 OPER data[i]; x1 = x1 OPER data[i+1]; }

/* finish any remaining elements */ for (; i < length; i++) x0 = x0 OPER data[i]; *dest = x0 OPER x1;}

Multiple Accumulator

15

load (%rax,%rdx.0,4) t.1amulq t.1a, %xmm0.0 %xmm0.1load 4(%rax,%rdx.0,4) t.1bmulq t.1b, %xmm1.0 %xmm1.1addq $2,%rdx.0 %rdx.1cmpq %rdx.1, %rbp cc.1jg-taken cc.1

Translation

• Two multiplies within loop no longer have data dependency

• Allows them to pipeline

16


mulss (%rax,%rdx,4), %xmm0

addq $2,%rdx

cmpq %rdx,%rbp

jg loop



load

mul

add

cmp

jg

t.a

cc

mulss 4(%rax,%rdx,4), %xmm1load

mult.b

%xmm1

%xmm1

data[i]

data[i+1]

17



%xmm0 %rdx

add

cmp

jg

load

mul

%xmm1

%xmm1

load

mul

%xmm0 %rdx

%xmm0 %rdx

add

load

mul

%xmm1

%xmm1

load

mul

18


data[0]

data[1]add

load

mulload

mul

data[2]

data[3]add

load

mulload

mul

data[n-2]

data[n-1]add

load

mulload

mul

.. .. ..

19

Enhance Parallelism

• Multiple Accumulators– Accumulate in two different sums

• Can be performed simultaneously

– Combine at end

• Re-association Transformation– Exploits property that integer addition &

multiplication are associative & commutative

– FP addition & multiplication not associative, but transformation usually acceptable

20

void combine7combine7(vec_ptr v, int *dest){ int i; int length = vec_length(v), limit = length-1; int *data = get_vec_start(v); int x = IDENT;

/* combine 2 elements at a time */ for (i = 0; i < limit; i+=2){ x = x OPER (data[i] OPER data[i+1]); }

/* finish any remaining elements */ for (; i < length; i++) x0 = x0 OPER data[i]; *dest = x0 OPER x1;}

Re-association Transformation

21

load (%rax,%rdx.0,4) xmm0.0load 4(%rax,%rdx.0,4) t.1mulq t.1, %xmm0.0 xmm0.1mulq %xmm0.1, %xmm1.0 xmm1.1addq $2,%rdx.0 %rdx.1cmpq %rdx.1, %rbp cc.1jg-taken cc.1

Translation

• Two multiplies within loop no longer have data dependency

• Allows them to pipeline

22




load

mul

add

cmp

jgcc

load

mult

%xmm1

%xmm1

movss (%rax,%rdx,4), %xmm0

addq $2,%rdx

cmpq %rdx,%rbp

jg loop

mulss 4(%rax,%rdx,4), %xmm0

mulss %xmm0, %xmm1

data[i]

data[i+1]

23



%xmm0 %rdx

add

cmp

jg

load

mul

load

mul

%xmm1 %rdx

%xmm0 %rdx

add

load

mul

load

mul

24


data[0]

data[1]add

load

mul

load

mul

data[2]

data[3]add

load

mul

load

mul

25

Summary of Results

Integer Floating PointFunction + * + F* D*combine1 12 12 12 12 13combine6 1.5 1.5 1.5 2 2.5(U*2,P*2)combine6 1 1 1 1 1combine6 1 1 1 1 1(U*5,P*5)

Latency 1 3 3 4 5Throughput 1 1 1 1 1

• Optimization Results for Combining– Archive a CPE close to 1.0 for all combinations

– Performance improvement of over 10X

Machine Indep. Opts•Eliminating loop inefficiencies•Reducing procedure calls•Eliminating unneeded memory referencesMachine dep. Opts•Loop Unrolling•Multiple Accumulator•Reassociation

26

Optimization Limiting Factors

• Register spilling

– Only 6 registers available (IA32)

– Using stack(memory) as storage

Degree of UnrollingMachine 1 2 3 4 5 6IA32 2.12 1.76 1.45 1.39 1.90 1.99X86-64 2.00 1.50 1.00 1.00 1.01 1.00

27

Example

IA32 codeIA32 code. Unroll . Unroll X5X5, accumulate , accumulate X5X5data_t = int, OP = +, i in %edx, data in %eax, limit at %ebp-20

.L291: imull movl imull movl imull imull movl imull movl addl cmpl jg

(%eax,%edx,4), %ecx-16(%ebp), %ebx4(%eax,%edx,4), %ebx%ebx, -16(%ebp)8(%eax,%edx,4), %edi12(%eax,%edx,4), %esi-28(%ebp), %ebx16(%eax,%edx,4), %ebx%ebx, -28(%ebp)$5, %edx%edx, -20(%ebp)

loop: x0 = x0 * data[i] Get x1 x1 = x1 * data[i+1] Store x1 x2 = x2 * data[i+2] x3 = x3 * data[i+3] Get x4 x4 = x4 * daa[i+4] Store x4 i+= 5 Compare limit:i If >, goto loop

28

Optimization Limiting Factors

• Register Spilling

– Only 6 registers available (IA32)

– Using stack(memory) as storage

• Branch Prediction

– Speculative Execution

– Misprediction Panelties

29

Branch Prediction

• Challenges– Instruction Control Unit must work well ahead

of Exec. Unit– To generate enough operations to keep EU

busy

• Speculative Execution

– Guess which way branch will go

– Begin executing instructions at predicted

position

• But don’t actually modify register or memory data

30

Example: Loop

80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1 80488be: movl (%ecx,%edx,4),%eax

80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1

Branch Taken

Branch Not-Taken

31




i = 98

i = 99

i = 100

Predict Taken (OK)

Predict Taken(Oops)


i = 101

Assume vector length = 100

Read invalid location

Executed

Fetched

Branch Prediction Through Loop




i = 98

i = 99

i = 100

Predict Taken (OK)

Predict Taken(Oops)


i = 101


32

Invalidate

Branch Misprediction Invalidation


80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1 80488be: movl (%ecx,%edx,4),%eax

i = 98

i = 99

Predict Taken (OK)


33

Branch Misprediction Invalidation

Definitely not taken

34

Branch Misprediction Recovery

• Performance Cost

– Misprediction on Core i7 wastes ~44 clock-

cycle

– Don’t be overly concerned about predictable

branches

– E.g. loop-closing branches would typically be

predicted as being taken, and only incur a

misprediction penalty on the last time around

void minmax2minmax2(int a[], int b[], int n){ int i; for (i = 0; i < n; i++) { int min = a[i]<b[i]?a[i]:b[i]; int max = a[i]<b[i]?b[i]:a[i]; a[i] = min; b[i] = max; }}

35

Write Suitable Codevoid minmax1minmax1(int a[], int b[], int n){ int i; for (i = 0; i < n; i++) { if (a[i] > b[i]) { int t = a[i]; a[i] = b[i]; b[i] = t; } }}

Function random predictableminmax1 14.5 3.0-4.0minmax2 5.0 5.0

Execution

FunctionalUnits

Instruction Control

Integer/Branch

FPAdd

FPMult/Div Load Store

InstructionCache

DataCache

FetchControl

InstructionDecode

Address

Instructions

Operations

Prediction OK?

DataData

Addr. Addr.

GeneralInteger

Operation Results

RetirementUnit

RegisterFile

Register Updates

37

Load Performance

• load unit can only initiate one load operation

every clock cycle (Issue=1.0)

typedef struct ELE {struct ELE *next ;int data ;

} list_ele, *list_ptr ;

int list_len(list_ptr ls) {int len = 0 ;while (ls) {

len++ ;ls = ls->next;

} return len ;

}

len in %eax, ls in %rdi.L11:

addl $1, %eaxmovq (%rdi), %rditestq %rdi, %rdijne .L11

Function CPElist_len 4.0

load latency 4.0

38

Store Performance

• store unit can only initiate one store operation

every clock cycle (Issue=1.0)void array_clear_4array_clear_4(int *dest, int

n) {

int i;int limit = n-3;for (i = 0; i < limit; i+=4) {

dest[i] = 0;dest[i+1] = 0;dest[i+2] = 0;dest[i+3] = 0;

}for ( ; i < n; i++)

dest[i] = 0;}

Function CPEarray_clear_4 1.0

39

Store Performance

void write_read(int *src, int *dest, int n){

int cnt = n;int val = 0;

while (cnt--) {*dest = val;val = (*src)+1;

}}

Example A: write_read(&a[0],&a[1],3)

val

a

cnt

-10 17

3

0

initial

-10 0

2

-9

initial

-10 -9

1

-9

initial

-10 -9

0

-9

initial

Example B: write_read(&a[0],&a[0],3)

val

a

cnt

-10 17

3

0

initial

0 17

2

1

initial

1 17

1

2

initial

2 17

0

3

initial

Function CPEExample A 2.0Example B 6.0

40

Load and Store Units

LoadUnit

Store Unit

Data Cache

Address Data

Store buffer

address dataMatchingaddresses

Data

address

Address Data

41


%eax %ebx %ecx %edx

%eax %ebx %ecx %edx

s_addr

load

sub

jne

s_data

addt

movl %eax,(%ecx)

addl $1,%eax

subl $1,%edx

jne loop

movl (%ebx), %eax

//inner-loop while (cnt--) {

*dest = val; val = (*src)+1; }

42


%eax %ebx %ecx %edx

%eax %edx

sub

s_addr

jg

s_data

add

load

%eax %edx

%eax %edx

sub

load

mul

mul

1

2

3


sub

load

mul

mul

sub

load

mul

mul

sub

load

mul

mul

sub

load

mul

mul

Example A Example BCritical Path

Function CPEExample A 2.0Example B 6.0

Getting High Performance

• Good compiler and flags• Don’t do anything stupid

– Watch out for hidden algorithmic inefficiencies– Write compiler-friendly code

• Watch out for optimization blockers: procedure calls & memory references

– Look carefully at innermost loops

• Tune code for machine– Exploit instruction-level parallelism– Avoid unpredictable branches– Make code cache friendly

more code optimization

Documents

length i x0

limit i

x oper datai oper datai

long int length

startv int x0

x1 oper datai

lengthv int limit

x0 oper datai x1