friday, september 15, 2006 the three most important factors in selling optimization are location,...

Friday, September 15, 2006

The three most important factors in selling optimization

are location, location, location.

- Realtor’s creed

Optimizations to source code

There are several optimizations that compilers and users can make to source code which generally result in fewer assembly code instructions.

x=3;

if (x<3) my_function();

Dead code elimination

x=3;

if (x<3) my_function();

x=4+3;

y=x-2;

Constant folding and propagation

Multiple constants are folded together and evaluated at compile time.

x=4+3;

y=x-2;

Common sub-expression elimination

x = y + (a*d+c)

z = u + (a*d+c)

C/C++ Register data type

If a variable is to be used many times, then they should be kept in registers and not loaded from the memory.

Hint to compilers

C/C++ Register data type

If a variable is to be used many times, then they should be kept in registers and no loaded from the memory.

Hint to compilers asm (problem?)

Strength reductionsReplace expensive operations with cheaper

ones.Replace integer multiplication and

division by constants with shift operations Most processors have integer shift functional

units. Multiplication of a number by 9?


ones.Replace 32-bit integer division by 64-bit

floating point division


ones.• Replace floating point multiplication by

small constants with floating point additions


ones.• Replace floating point divisions by one

division and multiplications• Division is one of the most expensive

operations, while that of multiplication is negligible in comparison.

• a=y/x• b=z/x

Strength reductions

• Replace floating point divisions by one division and multiplications

• Especially inside loops

a(i) = b(i) / c(i)

x(i) = y(i) / z(i)

Strength reductions

• Replace floating point divisions by one division and multiplications

• Especially inside loops.

a(i) = b(i)/c(i)

x(i) = y(i)/z(i)

temp = 1.0/(c(i)*z(i))a(i) = b(i) * z(i) * tempx(i) = y(i) * c(i) * temp


ones.• Replace power functions by floating

point multiplications • Power calculations can take 50 times longer

than multiplication

Fused multiply add (fma)Many processors have compound floating

point multiply add instructions that are more efficient than individual multiply, add instructions

fma: (a*b +c)

fnma: -(a*b) +c

Fused multiply add (fma)a, b, c are complex numbers c=c+a*b

(cr, ci) + (ar, ai) * (br, bi) =

(cr, ci)+((ar*br – ai*bi), (ar*bi + ai*br))

Fused multiply add (fma)(cr, ci) + (ar, ai) * (br, bi) =

(cr, ci)+((ar*br – ai*bi), (ar*bi + ai*br))

Multiply f1=ar*brMultiply f2=ar*bifnma f3=-ai*bi + f1fma f4=ai*br + f2Add f5=cr+f3Add f6=ci+f4

Fused multiply add (fma)Alter the order of instructions to:

(cr, ci) + (ar, ai) * (br, bi) =

((cr+ar*br) – (ai*bi), (ci+ar*bi) + (ai*br))

Fused multiply add (fma)Alter the order of instructions to:

(cr, ci) + (ar, ai) * (br, bi) =

((cr+ar*br) – (ai*bi), (ci+ar*bi) + (ai*br))

fma f1=ar*br + cr

fma f2=ar*bi + ci

fnma f3=-ai*bi + f1

fma f4=ai*br + f2

Loop optimizations

Account of most of the runtime for computational programs

Source code modifications can lead to significant improvements in runtime

Magnification factor

Single loop optimizationfor (i=0; i<n; i+=2)

a[i] = i*k + m;

Multiple of induction variable added to a constant.

Replace multiplication with addition.

Single loop optimizationfor (i=0; i<n; i+=2)

a[i] = i*k + m;Multiple of induction variable added to a constant.Replace multiplication with addition.

counter=m;

for (i=0; i<n; i+=2){

a[i] = counter;

counter = counter + k + k;

}

Condition statementsBranches in code reduce performance especially

on pipelined systems.

do I = 1,Nif (A > 0) then

x[I] = x[I]+1else

x[I] = 0.0endif

enddo

Condition statements

Branches are a significant overhead on pipelined systems.

Branch mis-prediction penalty is high. Pipelined get stalled In-flight instructions may have to be

discardedTry to move if statements out of loop body

Optimization has to be applied on case-by-case basis

Test promotion in loops

do I = 1,N

if (A > 0) then

x[I] = x[I]+1

else

x[I] = 0.0

endif

enddo

if (A > 0) then

do I = 1,N

x[I] = x[I]+1

enddo

else

do I = 1,N

x[I] = 0.0

enddo

endif

Most compilers don’t do this optimization.

Condition statementsdo i=1,len2 do j=1,len2 if(j < i) then a2d(j,i) =

a2d(j,i) + b2d(j,i)*con1

else a2d(j,i) = 1.0 endif enddoenddo

Condition statementsdo i=1,len2 do j=1,len2 if(j < i) then a2d(j,i) =

a2d(j,i) + b2d(j,i)*con1

else a2d(j,i) = 1.0 endif enddoenddo

do i=1,len2 do j=1,i-1 a2d(j,i) = a2d(j,i) + b2d(j,i)*con1

enddo do j=i,len2 a2d(j,i) = 1.0 enddoenddo

Loop PeelingBoundary Conditions

do I = 1,N

if (I == 1) then

x[I] = 0

elseif (I ==N )then

x[I} = N

else

x[I] = x[I]+y[I]

enddo

Loop Peeling

do I = 1,N

if (I == 1) then

x[I] = 0

elseif (I ==N )then

x[I} = N

else

x[I] = x[I]+y[I]

enddo

x[1]=0

do I = 2,N-1

x[I] = x[I]+y[I]

enddo

x[N]=N

Most compilers don’t do this optimization.

Loop Peeling

do i=1,n

y(i,n) = (1.0-x(i,1))*y(1,n)+x(i,1)*y(n,n)

enddo

Loop Peeling

do i=1,n

y(i,n) = (1.0-x(i,1))*y(1,n)+x(i,1)*y(n,n)

enddo

t2 = y(n,n)

y(1,n) = (1.0-x(1,1))*y(1,n)+x(1,1)*t2

t1 = y(1,n)

do i=2,n-1

y(i,n) = (1.0-x(i,1))*t1 + x(i,1)*t2

enddo

y(n,n) = (1.0-x(n,1))*t1 + x(n,1)*t2

Complex for compilers to perform.

Loop Unrolling

do i=1,n

A(i)=B(i)

enddo

Loop index incremented and checked at the beginning of each iteration

Branches interfere with pipelining

Register blocking: Temporary variables

that are used repeatedly.

Loop Unrolling

do i=1,n

A(i)=B(i)

enddo

do i=1,n,4

A(i)=B(i)

A(i+1)=B(i+1)

A(i+2)=B(i+2)

A(i+3)=B(i+3)

enddo•Unrolled by 4.•Some compilers allow users to specify unrolling depth.•Avoid excessive unrolling: Register pressure / spills can hurt performance•Pipelining to hide instruction latencies•Reduces overhead of index increment and conditional check

Assumption n is divisible by 4

Loop Unrollingdo j=1,m

do i=1,n

do k=1,p

c(i,j)= c(i,j) +a(k,i)*b(k,j)

enddo

enddo

enddo

Loop Unrollingdo j=1,m do i=1,n do k=1,p

c(i,j)= c(i,j) +a(k,i)*b(k,j)

enddo enddoenddo

do j=1,m,2 do i=1,n,2

t1=c(i,j)t2=c(i+1,j)t3=c(i,j+1)t4=c(i+1,j+1)

do k=1,pt1=t1+a(k,i)*b(k,j)t2=t2+a(k,i+1)*b(k,j)t3=t3+a(k,i)*b(k,j+1)t4=t4+a(k,i+1)*b(k,j+1)

enddoc(i,j) = t1c(i+1,j)=t2c(i,j+1)=t3c(i+1,j+1)=t4

enddoenddo

Loop Interchange

To improve spatial locality.Align access pattern with the order in

which data is storage in memory.2-D arrays in Fortran are stored column-

wise.2-D arrays in C are stored row-wise.

Loop Fusion

Beneficial in loop-intensive programs.Decreases index calculation overhead.Can also help in instruction level

parallelism.Beneficial if same data structures are

used in different loops.

Loop Fusion

for (i=0;i<nodes;i++) {

a[i] = a[i]*small;

c[i] = (a[i] + b[i])*relaxn;

}

for (i=1;i<nodes-1;i++)

{

d[i]=c[i]-a[i];

}

Loop Fusionfor (i=0;i<nodes;i++) { a[i] = a[i]*small; c[i] = (a[i] +

b[i])*relaxn;}for (i=1;i<nodes-1;i++){ d[i]=c[i]-a[i];}

a[0] = a[0]*small; c[0]= (a[0]+b[0])*relaxn; a[nodes-1] = a[nodes-

1]*small; c[nodes-1] = c[nodes-

1]*relaxn; for (i=1;i<nodes-1;i++) { a[i] = a[i]*small; c[i] = (a[i] +

b[i])*relaxn; d[i] = c[i] - a[i]; }

friday, september 15, 2006 the three most important factors in selling optimization are location,...

Documents