friday, september 15, 2006 the three most important factors in selling optimization are location,...
Post on 20-Dec-2015
218 views
TRANSCRIPT
Friday, September 15, 2006
The three most important factors in selling optimization
are location, location, location.
- Realtor’s creed
Optimizations to source code
There are several optimizations that compilers and users can make to source code which generally result in fewer assembly code instructions.
Constant folding and propagation
Multiple constants are folded together and evaluated at compile time.
x=4+3;
y=x-2;
C/C++ Register data type
If a variable is to be used many times, then they should be kept in registers and not loaded from the memory.
Hint to compilers
C/C++ Register data type
If a variable is to be used many times, then they should be kept in registers and no loaded from the memory.
Hint to compilers asm (problem?)
Strength reductionsReplace expensive operations with cheaper
ones.Replace integer multiplication and
division by constants with shift operations Most processors have integer shift functional
units. Multiplication of a number by 9?
Strength reductionsReplace expensive operations with cheaper
ones.Replace 32-bit integer division by 64-bit
floating point division
Strength reductionsReplace expensive operations with cheaper
ones.• Replace floating point multiplication by
small constants with floating point additions
Strength reductionsReplace expensive operations with cheaper
ones.• Replace floating point divisions by one
division and multiplications• Division is one of the most expensive
operations, while that of multiplication is negligible in comparison.
• a=y/x• b=z/x
Strength reductions
• Replace floating point divisions by one division and multiplications
• Especially inside loops
a(i) = b(i) / c(i)
x(i) = y(i) / z(i)
Strength reductions
• Replace floating point divisions by one division and multiplications
• Especially inside loops.
a(i) = b(i)/c(i)
x(i) = y(i)/z(i)
temp = 1.0/(c(i)*z(i))a(i) = b(i) * z(i) * tempx(i) = y(i) * c(i) * temp
Strength reductionsReplace expensive operations with cheaper
ones.• Replace power functions by floating
point multiplications • Power calculations can take 50 times longer
than multiplication
Fused multiply add (fma)Many processors have compound floating
point multiply add instructions that are more efficient than individual multiply, add instructions
fma: (a*b +c)
fnma: -(a*b) +c
Fused multiply add (fma)a, b, c are complex numbers c=c+a*b
(cr, ci) + (ar, ai) * (br, bi) =
(cr, ci)+((ar*br – ai*bi), (ar*bi + ai*br))
Fused multiply add (fma)(cr, ci) + (ar, ai) * (br, bi) =
(cr, ci)+((ar*br – ai*bi), (ar*bi + ai*br))
Multiply f1=ar*brMultiply f2=ar*bifnma f3=-ai*bi + f1fma f4=ai*br + f2Add f5=cr+f3Add f6=ci+f4
Fused multiply add (fma)Alter the order of instructions to:
(cr, ci) + (ar, ai) * (br, bi) =
((cr+ar*br) – (ai*bi), (ci+ar*bi) + (ai*br))
Fused multiply add (fma)Alter the order of instructions to:
(cr, ci) + (ar, ai) * (br, bi) =
((cr+ar*br) – (ai*bi), (ci+ar*bi) + (ai*br))
fma f1=ar*br + cr
fma f2=ar*bi + ci
fnma f3=-ai*bi + f1
fma f4=ai*br + f2
Loop optimizations
Account of most of the runtime for computational programs
Source code modifications can lead to significant improvements in runtime
Magnification factor
Single loop optimizationfor (i=0; i<n; i+=2)
a[i] = i*k + m;
Multiple of induction variable added to a constant.
Replace multiplication with addition.
Single loop optimizationfor (i=0; i<n; i+=2)
a[i] = i*k + m;Multiple of induction variable added to a constant.Replace multiplication with addition.
counter=m;
for (i=0; i<n; i+=2){
a[i] = counter;
counter = counter + k + k;
}
Condition statementsBranches in code reduce performance especially
on pipelined systems.
do I = 1,Nif (A > 0) then
x[I] = x[I]+1else
x[I] = 0.0endif
enddo
Condition statements
Branches are a significant overhead on pipelined systems.
Branch mis-prediction penalty is high. Pipelined get stalled In-flight instructions may have to be
discardedTry to move if statements out of loop body
Optimization has to be applied on case-by-case basis
Test promotion in loops
do I = 1,N
if (A > 0) then
x[I] = x[I]+1
else
x[I] = 0.0
endif
enddo
if (A > 0) then
do I = 1,N
x[I] = x[I]+1
enddo
else
do I = 1,N
x[I] = 0.0
enddo
endif
Most compilers don’t do this optimization.
Condition statementsdo i=1,len2 do j=1,len2 if(j < i) then a2d(j,i) =
a2d(j,i) + b2d(j,i)*con1
else a2d(j,i) = 1.0 endif enddoenddo
Condition statementsdo i=1,len2 do j=1,len2 if(j < i) then a2d(j,i) =
a2d(j,i) + b2d(j,i)*con1
else a2d(j,i) = 1.0 endif enddoenddo
do i=1,len2 do j=1,i-1 a2d(j,i) = a2d(j,i) + b2d(j,i)*con1
enddo do j=i,len2 a2d(j,i) = 1.0 enddoenddo
Loop PeelingBoundary Conditions
do I = 1,N
if (I == 1) then
x[I] = 0
elseif (I ==N )then
x[I} = N
else
x[I] = x[I]+y[I]
enddo
Loop Peeling
do I = 1,N
if (I == 1) then
x[I] = 0
elseif (I ==N )then
x[I} = N
else
x[I] = x[I]+y[I]
enddo
x[1]=0
do I = 2,N-1
x[I] = x[I]+y[I]
enddo
x[N]=N
Most compilers don’t do this optimization.
Loop Peeling
do i=1,n
y(i,n) = (1.0-x(i,1))*y(1,n)+x(i,1)*y(n,n)
enddo
t2 = y(n,n)
y(1,n) = (1.0-x(1,1))*y(1,n)+x(1,1)*t2
t1 = y(1,n)
do i=2,n-1
y(i,n) = (1.0-x(i,1))*t1 + x(i,1)*t2
enddo
y(n,n) = (1.0-x(n,1))*t1 + x(n,1)*t2
Complex for compilers to perform.
Loop Unrolling
do i=1,n
A(i)=B(i)
enddo
Loop index incremented and checked at the beginning of each iteration
Branches interfere with pipelining
Register blocking: Temporary variables
that are used repeatedly.
Loop Unrolling
do i=1,n
A(i)=B(i)
enddo
do i=1,n,4
A(i)=B(i)
A(i+1)=B(i+1)
A(i+2)=B(i+2)
A(i+3)=B(i+3)
enddo•Unrolled by 4.•Some compilers allow users to specify unrolling depth.•Avoid excessive unrolling: Register pressure / spills can hurt performance•Pipelining to hide instruction latencies•Reduces overhead of index increment and conditional check
Assumption n is divisible by 4
Loop Unrollingdo j=1,m do i=1,n do k=1,p
c(i,j)= c(i,j) +a(k,i)*b(k,j)
enddo enddoenddo
do j=1,m,2 do i=1,n,2
t1=c(i,j)t2=c(i+1,j)t3=c(i,j+1)t4=c(i+1,j+1)
do k=1,pt1=t1+a(k,i)*b(k,j)t2=t2+a(k,i+1)*b(k,j)t3=t3+a(k,i)*b(k,j+1)t4=t4+a(k,i+1)*b(k,j+1)
enddoc(i,j) = t1c(i+1,j)=t2c(i,j+1)=t3c(i+1,j+1)=t4
enddoenddo
Loop Interchange
To improve spatial locality.Align access pattern with the order in
which data is storage in memory.2-D arrays in Fortran are stored column-
wise.2-D arrays in C are stored row-wise.
Loop Fusion
Beneficial in loop-intensive programs.Decreases index calculation overhead.Can also help in instruction level
parallelism.Beneficial if same data structures are
used in different loops.
Loop Fusion
for (i=0;i<nodes;i++) {
a[i] = a[i]*small;
c[i] = (a[i] + b[i])*relaxn;
}
for (i=1;i<nodes-1;i++)
{
d[i]=c[i]-a[i];
}
Loop Fusionfor (i=0;i<nodes;i++) { a[i] = a[i]*small; c[i] = (a[i] +
b[i])*relaxn;}for (i=1;i<nodes-1;i++){ d[i]=c[i]-a[i];}
a[0] = a[0]*small; c[0]= (a[0]+b[0])*relaxn; a[nodes-1] = a[nodes-
1]*small; c[nodes-1] = c[nodes-
1]*relaxn; for (i=1;i<nodes-1;i++) { a[i] = a[i]*small; c[i] = (a[i] +
b[i])*relaxn; d[i] = c[i] - a[i]; }