Code Optimization
1
Outline
• Machine-Independent Optimization– Code motion– Memory optimization
• Suggested reading
– 5.1 ~ 5.6
2
Motivation
• Constant factors matter too!
– easily see 10:1 performance range depending on
how code is written
– must optimize at multiple levels
• algorithm, data representations, procedures, and loops
3
Motivation
• Must understand system to optimize
performance
– how programs are compiled and executed
– how to measure program performance and
identify bottlenecks
– how to improve performance without destroying
code modularity and generality
4
Generally Useful Optimizations
• Optimizations that you or the compiler should do regardless of processor / compiler
• Code Motion– Reduce frequency with which computation
performed• If it will always produce same result• Especially moving code out of loop
long j; int ni = n*i; for (j = 0; j < n; j++)
a[ni+j] = b[j];
void set_row(double *a, double *b, long i, long n){ long j; for (j = 0; j < n; j++)
a[n*i+j] = b[j];}
5
Compiler-Generated Code Motion
set_row:testq %rcx, %rcx # Test njle .L4 # If 0, goto donemovq %rcx, %rax # rax = nimulq %rdx, %rax # rax *= ileaq (%rdi,%rax,8), %rdx # rowp = A + n*i*8movl $0, %r8d # j = 0
.L3: # loop:movq (%rsi,%r8,8), %rax # t = b[j]movq %rax, (%rdx) # *rowp = taddq $1, %r8 # j++addq $8, %rdx # rowp++cmpq %r8, %rcx # Compare n:jjg .L3 # If >, goto loop
.L4: # done:rep ; ret
long j; long ni = n*i; double *rowp = a+ni; for (j = 0; j < n; j++)
*rowp++ = b[j];
void set_row(double *a, double *b, long i, long n){ long j; for (j = 0; j < n; j++)
a[n*i+j] = b[j];}
Where are the FP operations?
6
Reduction in Strength
– Replace costly operation with simpler one– Shift, add instead of multiply or divide
16*x --> x << 4• Utility machine dependent• Depends on cost of multiply or divide instruction
– On Intel Nehalem, integer multiply requires 3 CPU cycles
– Recognize sequence of products
for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j];
int ni = 0;for (i = 0; i < n; i++) { for (j = 0; j < n; j++) a[ni + j] = b[j]; ni += n;}
7
Share Common Subexpressions
– Reuse portions of expressions– Compilers often not very sophisticated in
exploiting arithmetic properties/* Sum neighbors of i,j */up = val[(i-1)*n + j ];down = val[(i+1)*n + j ];left = val[i*n + j-1];right = val[i*n + j+1];sum = up + down + left + right;
long inj = i*n + j;up = val[inj - n];down = val[inj + n];left = val[inj - 1];right = val[inj + 1];sum = up + down + left + right;
3 multiplications: i*n, (i–1)*n, (i+1)*n 1 multiplication: i*n
leaq 1(%rsi), %rax # i+1leaq -1(%rsi), %r8 # i-1imulq %rcx, %rsi # i*nimulq %rcx, %rax # (i+1)*nimulq %rcx, %r8 # (i-1)*naddq %rdx, %rsi # i*n+jaddq %rdx, %rax # (i+1)*n+jaddq %rdx, %r8 # (i-1)*n+j
imulq %rcx, %rsi # i*naddq %rdx, %rsi # i*n+jmovq %rsi, %rax # i*n+jsubq %rcx, %rax # i*n+j-nleaq (%rsi,%rcx), %rcx # i*n+j+n
8
• Procedure to Convert String to Lower Case
void lower(char *s){ int i; for (i = 0; i < strlen(s); i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a');}
Optimization Blocker #1: Procedure Calls
9
Lower Case Conversion Performance
– Time quadruples when double string length– Quadratic performance
10
Convert Loop To Goto Form
– strlen executed every iteration
void lower(char *s){ int i = 0; if (i >= strlen(s)) goto done; loop: if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); i++; if (i < strlen(s)) goto loop; done:}
11
Calling Strlen
• Strlen performance– Only way to determine length of string is to scan its entire
length, looking for null character.• Overall performance, string of length N
– N calls to strlen– Require times N– Overall O(N2) performance
/* My version of strlen */size_t strlen(const char *s){ size_t length = 0; while (*s != '\0') {
s++; length++;
} return length;}
12
Improving Performance
– Move call to strlen outside of loop– Since result does not change from one iteration
to another– Form of code motion
void lower2(char *s){ int i; int len = strlen(s); for (i = 0; i < len; i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a');}
13
Lower Case Conversion Performance
– Time doubles when double string length– Linear performance of lower2
lower
lower2
14
Optimization Blocker: Procedure Calls
• Why couldn’t compiler move strlen out of inner loop?– Procedure may have side effects
• Alters global state each time called
– Function may not return same value for given arguments• Depends on other parts of global state• Procedure lower could interact with strlen
• Warning:– Compiler treats procedure call as a black box– Weak optimizations near them
• Remedies:– Use of inline functions
• GCC does this with –O2• See web aside ASM:OPT
– Do your own code motion
int lencnt = 0;size_t strlen(const char *s){ size_t length = 0; while (*s != '\0') {
s++; length++; } lencnt += length; return length;}
15
Memory Matters
– Code updates b[i] on every iteration– Why couldn’t compiler optimize this away?
# sum_rows1 inner loop.L53:
addsd (%rcx), %xmm0 # FP addaddq $8, %rcxdecq %raxmovsd %xmm0, (%rsi,%r8,8) # FP storejne .L53
/* Sum rows is of n X n matrix a and store in vector b */void sum_rows1(double *a, double *b, long n) { long i, j; for (i = 0; i < n; i++) {
b[i] = 0;for (j = 0; j < n; j++) b[i] += a[i*n + j];
}}
16
Memory Aliasing
– Code updates b[i] on every iteration– Must consider possibility that these updates will affect program behavior
/* Sum rows is of n X n matrix a and store in vector b */void sum_rows1(double *a, double *b, long n) { long i, j; for (i = 0; i < n; i++) {
b[i] = 0;for (j = 0; j < n; j++) b[i] += a[i*n + j];
}}
double A[9] = { 0, 1, 2, 4, 8, 16, 32, 64, 128};
double B[3] = A+3;
sum_rows1(A, B, 3);
i = 0: [3, 8, 16]
init: [4, 8, 16]
i = 1: [3, 22, 16]
i = 2: [3, 22, 224]
Value of B:
17
Removing Aliasing
– No need to store intermediate results
# sum_rows2 inner loop.L66:
addsd (%rcx), %xmm0 # FP Addaddq $8, %rcxdecq %raxjne .L66
/* Sum rows is of n X n matrix a and store in vector b */void sum_rows2(double *a, double *b, long n) { long i, j; for (i = 0; i < n; i++) {
double val = 0;for (j = 0; j < n; j++) val += a[i*n + j];
b[i] = val; }}
18
Optimization Blocker: Memory Aliasing
•Aliasing– Two different memory references specify single
location– Easy to have happen in C
• Since allowed to do address arithmetic• Direct access to storage structures
– Get in habit of introducing local variables• Accumulating within loops• Your way of telling compiler not to check for aliasing
19
Optimizing Compilers
• Provide efficient mapping of program to machine– register allocation– code selection and ordering (scheduling)– dead code elimination– eliminating minor inefficiencies
• Don’t (usually) improve asymptotic efficiency– up to programmer to select best overall algorithm– big-O savings are (often) more important than constant
factors• but constant factors also matter
• Have difficulty overcoming “optimization blockers”– potential memory aliasing– potential procedure side-effects 20
Time Scales
• Absolute Time
– Typically use nanoseconds
• 10–9 seconds
– Time scale of computer instructions
21
Time Scales
• Clock Cycles– Most computers controlled by high frequency
clock signal
– Typical Range• 100 MHz
– 108 cycles per second
– Clock period = 10ns
• 2 GHz
– 2 X 109 cycles per second
– Clock period = 0.5ns
22
CPE
1 void vsum1(int n)
2 {
3 int i;
4
5 for (i = 0; i < n; i++)
6 c[i] = a[i] + b[i];
7 }
8
23
CPE
9 /* Sum vector of n elements (n must be even) */10 void vsum2(int n)11 {12 int i;1314 for (i = 0; i < n; i+=2) {15 /* Compute two elements per iteration */16 c[i] = a[i] + b[i];17 c[i+1] = a[i+1] + b[i+1];18 }19 }
24
Cycles Per Element
• Convenient way to express performance of
program that operators on vectors or lists
• Length = n
• T = CPE*n + Overhead
25
Cycles Per Element
0
100
200
300
400
500
600
700
800
900
1000
0 50 100 150 200
Elements
Cy
cle
s
vsum1Slope = 4.0
vsum2Slope = 3.5
26
Benchmark Example: Vector ADT
typedef struct {
int len ;
data_t *data ;
} vec_rec, *vec_ptr ;
typedef int data_t ;
length
data
0 1 2 len–1
•Data Types– Use different
declarations for data_t
– int– float– double
27
Procedures
• vec_ptr new_vec(int len)
– Create vector of specified length
• data_t *get_vec_start(vec_ptr v)
– Return pointer to start of vector data
28
Procedures
• int get_vec_element(vec_ptr v, int index, data_t
*dest)
– Retrieve vector element, store at *dest
– Return 0 if out of bounds, 1 if successful
• Similar to array implementations in Pascal, Java
– E.g., always do bounds checking
29
Benchmark Example: Vector ADT
vec_ptr new_vec(int len)
{
/* allocate header structure */
vec_ptr result = (vec_ptr) malloc(sizeof(vec_rec)) ;
if ( !result )
return NULL ;
result->len = len ;
30
Benchmark Example: Vector ADT
/* allocate array */if ( len > 0 ) {
data_t *data = (data_t *)calloc(len, sizeof(data_t)) ;if ( !data ) { free( (void *)result ) ; return NULL ; /* couldn’t allocte stroage */}result->data = data
} else result->data = NULL
return result ;}
31
Vector ADT
/** Retrieve vector element and store at dest.* Return 0 (out of bounds) or 1 (successful)*/ int get_vec_element(vec_ptr v, int index, data_t *dest) { if ( index < 0 || index >= v->len)
return 0 ;*dest = v->data[index] ;return 1;
}
32
Benchmark Example: Vector ADT
/* Return length of vector */ int vec_length(vec_ptr) {
return v->len ; }
/* Return pointer to start of vector data */
data_t *get_vec_start(vec_ptr v){
return v->data ; }
33
Optimization Example
#ifdef ADD
#define IDENT 0
#define OPER +
#else
#define IDENT 1
#define OPER *
#endif
34
Optimization Example
void combine1(vec_ptr v, data_t *dest)
{
int i;
*dest = IDENT;
for (i = 0; i < vec_length(v); i++) {
data_t val;
get_vec_element(v, i, &val);
*dest = *dest OPER val;
}
}
35
Optimization Example
• Procedure
– Compute sum (product) of all elements of
vector
– Store result at destination location
36
Time Scales
void combine1(vec_ptr v, int *dest)
{
int i;
*dest = 0;
for (i = 0; i < vec_length(v); i++) {
int val;
get_vec_element(v, i, &val);
*dest += val;
}
}
37
Time Scales
• Procedure– Compute sum of all elements of integer vector
– Store result at destination location
– Vector data structure and operations defined via abstract data type
• Intel Core i7 Performance: CPE– 29.02 (Compiled -g) 12.00 (Compiled –O1)
38
Understanding Loop
void combine1-goto(vec_ptr v, int *dest){ int i = 0; int val; *dest = 0; if (i >= vec_length(v)) goto done; loop: get_vec_element(v, i, &val); *dest += val; i++; if (i < vec_length(v)) goto loop done:}
1 iteration
39
Inefficiency
• Procedure vec_length called every iteration
• Even though result always the same
40
Code Motion
void combine2(vec_ptr v, int *dest){ int i; int length = vec_length(v); *dest = 0; for (i = 0; i < length; i++) { int val; get_vec_element(v, i, &val); *dest += val; }}
41
Code Motion
• Optimization
– Move call to vec_length out of inner loop
• Value does not change from one iteration to next
• Code motion
– CPE: 8.03 (Compiled –O1)
• vec_length requires only constant time, but
significant overhead
42
Reduction in Strength
void combine3(vec_ptr v, int *dest){ int i; int length = vec_length(v); int *data = get_vec_start(v); *dest = 0; for (i = 0; i < length; i++) { *dest += data[i];}
43
Reduction in Strength
• Optimization– Avoid procedure call to retrieve each vector
element• Get pointer to start of array before loop
• Within loop just do pointer reference
• Not as clean in terms of data abstraction
– CPE: 6.01 (Compiled –O1)• Procedure calls are expensive!
• Bounds checking is expensive
44
Eliminate Unneeded Memory References
void combine4(vec_ptr v, int *dest)
{
int i;
int length = vec_length(v);
int *data = get_vec_start(v);
int sum = 0;
for (i = 0; i < length; i++)
sum += data[i];
*dest = sum;
}
45
Eliminate Unneeded Memory References
• Optimization
– Don’t need to store in destination until end
– Local variable sum held in register
– Avoids 1 memory read, 1 memory write per
cycle
– CPE: 2.00 (Compiled –O1)
• Memory references are expensive!
46
Detecting Unneeded Memory References
.L18:movl (%ecx,%edx,4),%eaxaddl %eax,(%edi)
incl %edxcmpl %esi,%edxjl .L18
.L24:addl (%eax,%edx,4),%ecx
incl %edxcmpl %esi,%edxjl .L24
Combine3 Combine4
47
Detecting Unneeded Memory References
• Performance
– Combine3
• 5 instructions in 6 clock cycles
• addl must read and write memory
– Combine4
• 4 instructions in 2 clock cyles
48
Next
• Understanding Modern Processor– Super-scalar– Out-of –order execution
• Suggested reading
– 5.7
49