Download - Code Optimization 1. Outline Machine-Independent Optimization –Code motion –Memory optimization Suggested reading –5.1 ~ 5.6 2

Code Optimization

1

Outline

• Machine-Independent Optimization– Code motion– Memory optimization

• Suggested reading

– 5.1 ~ 5.6

2

Motivation

• Constant factors matter too!

– easily see 10:1 performance range depending on

how code is written

– must optimize at multiple levels

• algorithm, data representations, procedures, and loops

3

Motivation

• Must understand system to optimize

performance

– how programs are compiled and executed

– how to measure program performance and

identify bottlenecks

– how to improve performance without destroying

code modularity and generality

4

Generally Useful Optimizations

• Optimizations that you or the compiler should do regardless of processor / compiler

• Code Motion– Reduce frequency with which computation

performed• If it will always produce same result• Especially moving code out of loop

long j; int ni = n*i; for (j = 0; j < n; j++)

a[ni+j] = b[j];

void set_row(double *a, double *b, long i, long n){ long j; for (j = 0; j < n; j++)

a[n*i+j] = b[j];}

5

Compiler-Generated Code Motion

set_row:testq %rcx, %rcx # Test njle .L4 # If 0, goto donemovq %rcx, %rax # rax = nimulq %rdx, %rax # rax *= ileaq (%rdi,%rax,8), %rdx # rowp = A + n*i*8movl $0, %r8d # j = 0

.L3: # loop:movq (%rsi,%r8,8), %rax # t = b[j]movq %rax, (%rdx) # *rowp = taddq $1, %r8 # j++addq $8, %rdx # rowp++cmpq %r8, %rcx # Compare n:jjg .L3 # If >, goto loop

.L4: # done:rep ; ret

long j; long ni = n*i; double *rowp = a+ni; for (j = 0; j < n; j++)

*rowp++ = b[j];

void set_row(double *a, double *b, long i, long n){ long j; for (j = 0; j < n; j++)

a[n*i+j] = b[j];}

Where are the FP operations?

6

Reduction in Strength

– Replace costly operation with simpler one– Shift, add instead of multiply or divide

16*x --> x << 4• Utility machine dependent• Depends on cost of multiply or divide instruction

– On Intel Nehalem, integer multiply requires 3 CPU cycles

– Recognize sequence of products

for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j];

int ni = 0;for (i = 0; i < n; i++) { for (j = 0; j < n; j++) a[ni + j] = b[j]; ni += n;}

7

Share Common Subexpressions

– Reuse portions of expressions– Compilers often not very sophisticated in

exploiting arithmetic properties/* Sum neighbors of i,j */up = val[(i-1)*n + j ];down = val[(i+1)*n + j ];left = val[i*n + j-1];right = val[i*n + j+1];sum = up + down + left + right;

long inj = i*n + j;up = val[inj - n];down = val[inj + n];left = val[inj - 1];right = val[inj + 1];sum = up + down + left + right;

3 multiplications: i*n, (i–1)*n, (i+1)*n 1 multiplication: i*n

leaq 1(%rsi), %rax # i+1leaq -1(%rsi), %r8 # i-1imulq %rcx, %rsi # i*nimulq %rcx, %rax # (i+1)*nimulq %rcx, %r8 # (i-1)*naddq %rdx, %rsi # i*n+jaddq %rdx, %rax # (i+1)*n+jaddq %rdx, %r8 # (i-1)*n+j

imulq %rcx, %rsi # i*naddq %rdx, %rsi # i*n+jmovq %rsi, %rax # i*n+jsubq %rcx, %rax # i*n+j-nleaq (%rsi,%rcx), %rcx # i*n+j+n

8

• Procedure to Convert String to Lower Case

void lower(char *s){ int i; for (i = 0; i < strlen(s); i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a');}

Optimization Blocker #1: Procedure Calls

9

Lower Case Conversion Performance

– Time quadruples when double string length– Quadratic performance

10

Convert Loop To Goto Form

– strlen executed every iteration

void lower(char *s){ int i = 0; if (i >= strlen(s)) goto done; loop: if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); i++; if (i < strlen(s)) goto loop; done:}

11

Calling Strlen

• Strlen performance– Only way to determine length of string is to scan its entire

length, looking for null character.• Overall performance, string of length N

– N calls to strlen– Require times N– Overall O(N2) performance

/* My version of strlen */size_t strlen(const char *s){ size_t length = 0; while (*s != '\0') {

s++; length++;

} return length;}

12

Improving Performance

– Move call to strlen outside of loop– Since result does not change from one iteration

to another– Form of code motion

void lower2(char *s){ int i; int len = strlen(s); for (i = 0; i < len; i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a');}

13

Lower Case Conversion Performance

– Time doubles when double string length– Linear performance of lower2

lower

lower2

14

Optimization Blocker: Procedure Calls

• Why couldn’t compiler move strlen out of inner loop?– Procedure may have side effects

• Alters global state each time called

– Function may not return same value for given arguments• Depends on other parts of global state• Procedure lower could interact with strlen

• Warning:– Compiler treats procedure call as a black box– Weak optimizations near them

• Remedies:– Use of inline functions

• GCC does this with –O2• See web aside ASM:OPT

– Do your own code motion

int lencnt = 0;size_t strlen(const char *s){ size_t length = 0; while (*s != '\0') {

s++; length++; } lencnt += length; return length;}

15

Memory Matters

– Code updates b[i] on every iteration– Why couldn’t compiler optimize this away?

# sum_rows1 inner loop.L53:

addsd (%rcx), %xmm0 # FP addaddq $8, %rcxdecq %raxmovsd %xmm0, (%rsi,%r8,8) # FP storejne .L53

/* Sum rows is of n X n matrix a and store in vector b */void sum_rows1(double *a, double *b, long n) { long i, j; for (i = 0; i < n; i++) {

b[i] = 0;for (j = 0; j < n; j++) b[i] += a[i*n + j];

}}

16

Memory Aliasing

– Code updates b[i] on every iteration– Must consider possibility that these updates will affect program behavior


b[i] = 0;for (j = 0; j < n; j++) b[i] += a[i*n + j];

}}

double A[9] = { 0, 1, 2, 4, 8, 16, 32, 64, 128};

double B[3] = A+3;

sum_rows1(A, B, 3);

i = 0: [3, 8, 16]

init: [4, 8, 16]

i = 1: [3, 22, 16]

i = 2: [3, 22, 224]

Value of B:

17

Removing Aliasing

– No need to store intermediate results

# sum_rows2 inner loop.L66:

addsd (%rcx), %xmm0 # FP Addaddq $8, %rcxdecq %raxjne .L66


double val = 0;for (j = 0; j < n; j++) val += a[i*n + j];

b[i] = val; }}

18

Optimization Blocker: Memory Aliasing

•Aliasing– Two different memory references specify single

location– Easy to have happen in C

• Since allowed to do address arithmetic• Direct access to storage structures

– Get in habit of introducing local variables• Accumulating within loops• Your way of telling compiler not to check for aliasing

19

Optimizing Compilers

• Provide efficient mapping of program to machine– register allocation– code selection and ordering (scheduling)– dead code elimination– eliminating minor inefficiencies

• Don’t (usually) improve asymptotic efficiency– up to programmer to select best overall algorithm– big-O savings are (often) more important than constant

factors• but constant factors also matter

• Have difficulty overcoming “optimization blockers”– potential memory aliasing– potential procedure side-effects 20

Time Scales

• Absolute Time

– Typically use nanoseconds

• 10–9 seconds

– Time scale of computer instructions

21

Time Scales

• Clock Cycles– Most computers controlled by high frequency

clock signal

– Typical Range• 100 MHz

– 108 cycles per second

– Clock period = 10ns

• 2 GHz

– 2 X 109 cycles per second

– Clock period = 0.5ns

22

CPE

1 void vsum1(int n)

2 {

3 int i;

4

5 for (i = 0; i < n; i++)

6 c[i] = a[i] + b[i];

7 }

8

23

CPE

9 /* Sum vector of n elements (n must be even) */10 void vsum2(int n)11 {12 int i;1314 for (i = 0; i < n; i+=2) {15 /* Compute two elements per iteration */16 c[i] = a[i] + b[i];17 c[i+1] = a[i+1] + b[i+1];18 }19 }

24

Cycles Per Element

• Convenient way to express performance of

program that operators on vectors or lists

• Length = n

• T = CPE*n + Overhead

25

Cycles Per Element

0

100

200

300

400

500

600

700

800

900

1000

0 50 100 150 200

Elements

Cy

cle

s

vsum1Slope = 4.0

vsum2Slope = 3.5

26

Benchmark Example: Vector ADT

typedef struct {

int len ;

data_t *data ;

} vec_rec, *vec_ptr ;

typedef int data_t ;

length

data

0 1 2 len–1

•Data Types– Use different

declarations for data_t

– int– float– double

27

Procedures

• vec_ptr new_vec(int len)

– Create vector of specified length

• data_t *get_vec_start(vec_ptr v)

– Return pointer to start of vector data

28

Procedures

• int get_vec_element(vec_ptr v, int index, data_t

*dest)

– Retrieve vector element, store at *dest

– Return 0 if out of bounds, 1 if successful

• Similar to array implementations in Pascal, Java

– E.g., always do bounds checking

29


vec_ptr new_vec(int len)

{

/* allocate header structure */

vec_ptr result = (vec_ptr) malloc(sizeof(vec_rec)) ;

if ( !result )

return NULL ;

result->len = len ;

30


/* allocate array */if ( len > 0 ) {

data_t *data = (data_t *)calloc(len, sizeof(data_t)) ;if ( !data ) { free( (void *)result ) ; return NULL ; /* couldn’t allocte stroage */}result->data = data

} else result->data = NULL

return result ;}

31

Vector ADT

/** Retrieve vector element and store at dest.* Return 0 (out of bounds) or 1 (successful)*/ int get_vec_element(vec_ptr v, int index, data_t *dest) { if ( index < 0 || index >= v->len)

return 0 ;*dest = v->data[index] ;return 1;

}

32


/* Return length of vector */ int vec_length(vec_ptr) {

return v->len ; }

/* Return pointer to start of vector data */

data_t *get_vec_start(vec_ptr v){

return v->data ; }

33

Optimization Example

#ifdef ADD

#define IDENT 0

#define OPER +

#else

#define IDENT 1

#define OPER *

#endif

34


void combine1(vec_ptr v, data_t *dest)

{

int i;

*dest = IDENT;

for (i = 0; i < vec_length(v); i++) {

data_t val;

get_vec_element(v, i, &val);

*dest = *dest OPER val;

}

}

35


• Procedure

– Compute sum (product) of all elements of

vector

– Store result at destination location

36

Time Scales

void combine1(vec_ptr v, int *dest)

{

int i;

*dest = 0;

for (i = 0; i < vec_length(v); i++) {

int val;

get_vec_element(v, i, &val);

*dest += val;

}

}

37

Time Scales

• Procedure– Compute sum of all elements of integer vector

– Store result at destination location

– Vector data structure and operations defined via abstract data type

• Intel Core i7 Performance: CPE– 29.02 (Compiled -g) 12.00 (Compiled –O1)

38

Understanding Loop

void combine1-goto(vec_ptr v, int *dest){ int i = 0; int val; *dest = 0; if (i >= vec_length(v)) goto done; loop: get_vec_element(v, i, &val); *dest += val; i++; if (i < vec_length(v)) goto loop done:}

1 iteration

39

Inefficiency

• Procedure vec_length called every iteration

• Even though result always the same

40

Code Motion

void combine2(vec_ptr v, int *dest){ int i; int length = vec_length(v); *dest = 0; for (i = 0; i < length; i++) { int val; get_vec_element(v, i, &val); *dest += val; }}

41

Code Motion

• Optimization

– Move call to vec_length out of inner loop

• Value does not change from one iteration to next

• Code motion

– CPE: 8.03 (Compiled –O1)

• vec_length requires only constant time, but

significant overhead

42


void combine3(vec_ptr v, int *dest){ int i; int length = vec_length(v); int *data = get_vec_start(v); *dest = 0; for (i = 0; i < length; i++) { *dest += data[i];}

43


• Optimization– Avoid procedure call to retrieve each vector

element• Get pointer to start of array before loop

• Within loop just do pointer reference

• Not as clean in terms of data abstraction

– CPE: 6.01 (Compiled –O1)• Procedure calls are expensive!

• Bounds checking is expensive

44

Eliminate Unneeded Memory References

void combine4(vec_ptr v, int *dest)

{

int i;

int length = vec_length(v);

int *data = get_vec_start(v);

int sum = 0;

for (i = 0; i < length; i++)

sum += data[i];

*dest = sum;

}

45

Eliminate Unneeded Memory References

• Optimization

– Don’t need to store in destination until end

– Local variable sum held in register

– Avoids 1 memory read, 1 memory write per

cycle

– CPE: 2.00 (Compiled –O1)

• Memory references are expensive!

46

Detecting Unneeded Memory References

.L18:movl (%ecx,%edx,4),%eaxaddl %eax,(%edi)

incl %edxcmpl %esi,%edxjl .L18

.L24:addl (%eax,%edx,4),%ecx

incl %edxcmpl %esi,%edxjl .L24

Combine3 Combine4

47

Detecting Unneeded Memory References

• Performance

– Combine3

• 5 instructions in 6 clock cycles

• addl must read and write memory

– Combine4

• 4 instructions in 2 clock cyles

48

Next

• Understanding Modern Processor– Super-scalar– Out-of –order execution

• Suggested reading

– 5.7

49

Download - Code Optimization 1. Outline Machine-Independent Optimization –Code motion –Memory optimization Suggested reading –5.1 ~ 5.6 2

Top Related