fundamentals of high performance programming [email protected] r. clint whaley, atlas group,...

30
Fundamentals of High Performance Programming [email protected] www.netlib.org/atlas R. Clint Whaley, ATLAS group, Innovative Computing Lab, University of Tennessee

Upload: allen-nichols

Post on 03-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Fundamentals of High Performance Programming rwhaley@cs.utk.edu R. Clint Whaley, ATLAS group, Innovative Computing Lab, University

Fundamentals of High Performance Programming

[email protected] www.netlib.org/atlas

R. Clint Whaley,ATLAS group,Innovative Computing Lab,University of Tennessee

Page 2: Fundamentals of High Performance Programming rwhaley@cs.utk.edu R. Clint Whaley, ATLAS group, Innovative Computing Lab, University

Outline of Talk

• Intro to examples

• Optimization considerations

• Overview of memory and computational opt

• Memory opt in detail– Memory hierarchy

– Cache Basics

– Mem opt examples and analysis

• Computational opt in detail– FPU basics– Computational

optimization techniques with examples

• Optimized examples• Escape before

audience recovers

Page 3: Fundamentals of High Performance Programming rwhaley@cs.utk.edu R. Clint Whaley, ATLAS group, Innovative Computing Lab, University

Basic Analysis Definitions

• Number of memory references

• Number of FLOPS (floating point operations)

• Number of input data words– Get reuse when number of memory references

is greater than number of input data words

Page 4: Fundamentals of High Performance Programming rwhaley@cs.utk.edu R. Clint Whaley, ATLAS group, Innovative Computing Lab, University

Example Operations

for (dot=0.0, i=0; i < N; i++) dot += x[i] * y[i]

for (j=0; j < N; j++) for (i=0; i < N; i++) for (k=0; k < N; k++) C[i+j*ldc] += A[i+k*lda] * B[k+j*ldb];

DDOT: Dot product ofvectors X & Y

GEMM: matmul C += A*B

2N FLOPS, 2N mem ref2N data

2N3 FLOPS4N3 mem ref3N2 data

+= *N

N

Page 5: Fundamentals of High Performance Programming rwhaley@cs.utk.edu R. Clint Whaley, ATLAS group, Innovative Computing Lab, University

Optimization Considerations

• Remember, two ways to optimize:– Change algorithm

– Performance tuning

• 90/10 rule – Use profiling to find

kernels

• Ask if identified kernels can be recast as already opt kernel such as GEMM

• Remember: “Optimized for performance” and “portable/maintainable” are antonyms

Page 6: Fundamentals of High Performance Programming rwhaley@cs.utk.edu R. Clint Whaley, ATLAS group, Innovative Computing Lab, University

Hand-tuned Optimization Facts

• Purely A Priori optimization is pipe dream– Hardware component interaction difficult enough to

predict

– Compiler, OS, ISA get in the way of hardware

• Will rewrite kernel for every new arch– Use assembler sparingly

– Try to find collaborators who also use kernel

• The more you hand-optimize, the less the compiler can do for you

Page 7: Fundamentals of High Performance Programming rwhaley@cs.utk.edu R. Clint Whaley, ATLAS group, Innovative Computing Lab, University

Two Types of Optimization

• Memory/cache optimization– Theoretical peak: (bus width) * (bus speed)

• PII : (32 bits) * (66 Mhz) = 264 MB/s = 33 MW/s• Athlon: (64 bits) * (200 Mhz) = 1600 MB/s = 200 MW/s• Power3: (128 bits) * (100 Mhz) = 1600 MB/s = 200 MW/s

• Computational optimizations– Theoretical peak: (# fpus) (flops/cycle) * Mhz

• PII: (1 fpu) * (1 flop/cycle) * (450 Mhz) = 450 MFLOP• Athlon: (2 fpu) * (1flop/cycle) * (600 Mhz) = 1200 MFLOP• Power3: (2 fpu) * (2 flops/cycle) * (375 Mhz) = 1500 MFLOP

• Memory at least order of magnitude slower

Page 8: Fundamentals of High Performance Programming rwhaley@cs.utk.edu R. Clint Whaley, ATLAS group, Innovative Computing Lab, University

Memory Hierarchy

Registers

Level 1 Cache

Level 2 Cache

Main Memory

Disk

1

4-16

80-100

500

Millertime

8, 16, or 32 words

8-128KB1-16KW

512KB – 8MB64 KW – 1 MW

.5-8 GB

X TB

Page 9: Fundamentals of High Performance Programming rwhaley@cs.utk.edu R. Clint Whaley, ATLAS group, Innovative Computing Lab, University

Cache Basics

• Write policy (write-through, non-wt)– Writes more expensive than

reads

• Replacement policy (LRU)

• Line size (2-8 dp words)

• Associativity

• Level (1-3)

• Separate or combined data/instruction

• Latency between lvls• Bandwidth btwn lvls• # of outstanding

requests before blocking

• Prefetch strategy/units• TLB : Translation

Lookaside Buffer– virtual-physical mem

@ buffer– >= 32 vm pages

Page 10: Fundamentals of High Performance Programming rwhaley@cs.utk.edu R. Clint Whaley, ATLAS group, Innovative Computing Lab, University

Cache Basics

• Do:– Start on cache line

boundary

– Use entire line

– Make stride (lda) a multiple of cache line size

– Issue as many nonblocking fetches as cache supports

– If you have reuse:• Block for cache size

• copy to contiguous storage

• Don’t:– Use power of 2 for

non-contiguous matrix stride (lda)

– Access strided memory

– Access more than TLB separate memory locations

Page 11: Fundamentals of High Performance Programming rwhaley@cs.utk.edu R. Clint Whaley, ATLAS group, Innovative Computing Lab, University

Memory optimized DDOT

• 2N flops, 2N data fetches (X and Y not reused)

• Can still unroll for outstanding fetches and multiple prefetch units

for (dot=0.0, i=0; i < N; i++) dot += x[i] * y[i];

for (dot0=dot1=0.0, i=0; i < N; i += 8){ dot0 += x[0] * y[0]; dot1 += x[4] * y[4]; dot0 += x[1] * y[1]; dot1 += x[5] * y[5]; dot0 += x[2] * y[2]; dot1 += x[6] * y[6]; dot0 += x[3] * y[3]; dot1 += x[7] * y[7]; x += 8; y += 8;}

n2 = N / 2;xx = X + n2; yy = Y + n2;for (dot0=dot1=0.0, i=0; i < n2; i++){ dot0 += x[i] * y[i]; dot1 += xx[i] * yy[i];}

4 prefetch units:4 fetches:

Page 12: Fundamentals of High Performance Programming rwhaley@cs.utk.edu R. Clint Whaley, ATLAS group, Innovative Computing Lab, University

+= *N

N

Mem opt for GEMM• (N2 elts of C) * (N adds + N

mults) = 2N3 FLOPS

• (N2 elts of C) * (N (RA RB RC WC)) = 3N3 R + N3 W = 4N3 mem ref

• Since there are 4N3 mem ref, but only 3N2 data words, reuse is possible

• # of mem ref irreducible, but # that hit main mem is not

• Number of main mem ref:– [3N2 – 4N3]

• Can block for reuse at each level of mem hierarchy

• Cache blocking can be varied thru parameterization

• Register blocking requires differing codes to vary

Page 13: Fundamentals of High Performance Programming rwhaley@cs.utk.edu R. Clint Whaley, ATLAS group, Innovative Computing Lab, University

+= *I

J

Register Blocking• Typically 8-32 registers, so can keep only one array in reg• Remember, cost is: N3 RA + N3 RB + (N3 RC + N3 WC)• So C has at least twice cost of A or B, so put K as inner

loop– No reuse along K of A & B, so reuse must come within register

block

• Can use differing number of registers for A & B, but near-square are theoretically superior, so for simplicity let the number of registers blocking A and B be N0

• Main mem access now:– N2 RC + N2 WC + (N3/N0) RA + (N3/N0) RB

K

K

J

I

Page 14: Fundamentals of High Performance Programming rwhaley@cs.utk.edu R. Clint Whaley, ATLAS group, Innovative Computing Lab, University

Ex.: 2x2 Register Blockingfor (j=0; j < N; j += 2){ for (i=0; i < N; i += 2) { c00 = C[i+j*ldc]; c01 = C[i+(j+1)*ldc]; c10 = C[i+1+j*ldc]; c11 = C[i+1+(j+1)*ldc; for (k=0; k < N; k++) { a0 = A[i+k*lda]; a1 = A[i+1+k*lda]; b0 = B[k+j*ldb]; b1 = B[k+1+j*ldb]; c00 += a0 * b0; c10 += a1 * b0; c01 += a0 * b1; c11 += a1 * b1; } C[i+j*ldc] = c00; C[i+(j+1)*ldc] = c01; C[i+1+j*ldc] = c10; C[i+1+(j+1)*ldc] = c11; }}

Page 15: Fundamentals of High Performance Programming rwhaley@cs.utk.edu R. Clint Whaley, ATLAS group, Innovative Computing Lab, University

Blocking for Level 1 Cache• Register block for dot product to C access:

– ( (2N3)/N0 + N2 ) R + N2 W

• Assume N02 + 2N N0 <= L1, access reduces:

– N2 RA + (N3/N0) RB + N2 RC) + N2 WC

– Still cubic access, mem cost dominant

• So we need to explicitly block for L1

+= *I

J

K

J

I

K

Page 16: Fundamentals of High Performance Programming rwhaley@cs.utk.edu R. Clint Whaley, ATLAS group, Innovative Computing Lab, University

Blocking for Level 1 Cache• Choose blocking factor N1, such that:

– 3N12 <= L1

• (2N3/N1 + N2) RM + N2 WM +

• 2N3/N0 R1 + N3/N1(R1 + W1)

– N2 + N1 N0 + N02 <= L1

• (2N3/N1 + N2) RM + N2 WM +

• 3N3/N1 R2 + N3/N1 W2 +

• 2N3/N0 R1 + N3/N1(R1 + W1)

+= *I

J

K

J

I

K

Page 17: Fundamentals of High Performance Programming rwhaley@cs.utk.edu R. Clint Whaley, ATLAS group, Innovative Computing Lab, University

Blocking for Level 2 Cache• Implicit blocking occurs if

– 2N N1 + N12 <= L2

– Main memory access reduced from:• (2N3/N1 +N2) R + N2 W

– to:• (N3/N1 + 2N2) R + N2 W

• Can do explicit L2 block by cutting K so:– 2N0 N1 + N1

2 <= L2

• Or can block all dimension with N2 just as with N1 for L1• Can do same for arbitrary levels of cache

+= *I

J

K

J

I

K

Page 18: Fundamentals of High Performance Programming rwhaley@cs.utk.edu R. Clint Whaley, ATLAS group, Innovative Computing Lab, University

FPU Basics

• Number of FPUs• Pipelined or lame• Repeat rate (1 cycle or not fully pipelined)• Instruction type:

– muladd– multiply – add– multiply or add

Page 19: Fundamentals of High Performance Programming rwhaley@cs.utk.edu R. Clint Whaley, ATLAS group, Innovative Computing Lab, University

double precision ddot(const int N, const double *X, const double *Y){ const int N4 = (N>>2)<<2, nr = N-N4; const double *stX = X+N4; register double dot=0.0;

if (N > 0) { do { dot += *X * *Y; dot += X[1] * Y[1]; dot += X[2] * Y[2]; dot += X[3] * Y[3]; X += 4; Y += 4; } while(X != stX); if (nr) { stX += nr; do {dot += *X++ * *Y++} while (X != stX) } } return(dot);}

Computational Optimization

• Use “const” if variable not changing

• Use bitwise shift to avoid integer mult and div

• Tmp arrays should come from heap not stack

• Eliminate unused loop vars – ptr controlled loops

• Unroll loop for reduced loop overhead

• Increment ptrs only at end of loop

Page 20: Fundamentals of High Performance Programming rwhaley@cs.utk.edu R. Clint Whaley, ATLAS group, Innovative Computing Lab, University

for (max=0.0, i=0; i < N; i++) if (X[i] > max) max = X[i];

Computational Optimization

• Never use < or > if you can safely use != or ==– for (i=0; i < N; i++)– for (i=0; i != N; i++)– for (i=N; i; i--)

• Branch prediction usually guesses “true”– In if/else, put most

common case first

– Use do-while not while

if (N > 0){ max = *X; if (N != 1) { X++; do { if (*X <= max) continue; else max = *X; } while(++X != stX); }}

Max of a vector:

Becomes:

Page 21: Fundamentals of High Performance Programming rwhaley@cs.utk.edu R. Clint Whaley, ATLAS group, Innovative Computing Lab, University

Computational Optimization

• If operating on multiple mem @ seperated by non-compile-time constant, use multiple ptrs

• Use local vars (registers) to avoid aliasing prob and provide register blocking

• Do not recompute intermediate computations

Page 22: Fundamentals of High Performance Programming rwhaley@cs.utk.edu R. Clint Whaley, ATLAS group, Innovative Computing Lab, University

const int n2 = (N>>1)<<1, incC=(ldc<<1), incB=ldb<<1, incA=1-N*lda;double *pC0=C, *pC1=C+ldc;const double *pB0=B, *pB1=B+ldb, a=A;

for(j=0; j != n2; j += 2){ for (i=0; i != N; i++) { rC0 = pC0[i]; rC1 = pC1[i]; for (k=0; k != N; k++) { rA0 = *pA0; pA0 += lda; rB0 = pB0[k]; rB1 = pB1[k]; rC0 += rA0 * rB0; rC1 += rA0 * rB0; } *pC0 = rC0; *pC1 = rC1; A += incA; } A = a; pC0 += incC; pC1 += incC; pB0 += incB; pB1 += incB;

}

for (j=0; j != (N/2)*2; j+= 2){ for (i=0; i != N; i++) { for(k=0; k != N; k++) { C[i+j*ldc] += A[k*lda+i] * B[k+j*ldb]; C[i+(j+1)*ldc] += A[k*lda+i] * B[k+(j+1)*ldb]; } }}

Page 23: Fundamentals of High Performance Programming rwhaley@cs.utk.edu R. Clint Whaley, ATLAS group, Innovative Computing Lab, University

Simple MULADD Pipelining• For each FPU, use at least pipelen

accumulators to avoid pipe stalls

register double dot0, dot1, dot2, dot3;if (n4){ do { dot0 += *X * *Y; dot1 += X[1] * Y[1]; dot2 += X[2] * Y[2]; dot3 += X[3] * Y[3]; X += 4; Y += 4; } while (X != stX);}

4 length pipeline example:

for (i=0; i < N; i++) dot += X[i] * Y[i];

Page 24: Fundamentals of High Performance Programming rwhaley@cs.utk.edu R. Clint Whaley, ATLAS group, Innovative Computing Lab, University

Pipelining/loop skewing for separate mult/add units

register double m0, m1, m2, m3, dot0, dot1, dot2, dot3;m0 = *X * *Y; m1 = X[1] * Y[1]; m2 = X[2] * Y[2]; m3 = X[3] * Y[3];X += 4; Y += 4;do{ dot0 += m0; m0 = *X * *Y; dot1 += m1; m1 = X[1] * Y[1]; dot2 += m2; m2 = X[2] * Y[2]; dot3 += m3; m3 = X[3] * Y[3]; X += 4; Y += 4;}while (X != stX);dot0 += m0; dot1 += m1; dot2 += m2; dot3 += m3;

Page 25: Fundamentals of High Performance Programming rwhaley@cs.utk.edu R. Clint Whaley, ATLAS group, Innovative Computing Lab, University

void ATL_dJIK60x60x60TN60x60x0_a1_b1 (const int M, const int N, const int K, const double alpha, const double *A, const int lda, const double *B, const int ldb, const double beta, double *C, const int ldc)/* * matmul with TA=T, TB=N, MB=60, NB=60, KB=60, * lda=60, ldb=60, ldc=0, mu=6, nu=3, ku=1 */{ const double *stM = A + 3600; const double *stN = B + 3600; #define incAk 1 const int incAm = 300, incAn = -3600; #define incBk 1 const int incBm = -60, incBn = 180; #define incCm 6 const int incCn = (((ldc) << 1)+ldc) - 60; double *pC0=C, *pC1=pC0+(ldc), *pC2=pC1+(ldc); const double *pA0=A; const double *pB0=B; register int k; register double rA0, rA1, rA2, rA3, rA4, rA5; register double rB0, rB1, rB2; register double rC00, rC10, rC20, rC30, rC40, rC50, rC01, rC11, rC21, rC31, rC41, rC51, rC02, rC12, rC22, rC32, rC42, rC52;

do /* N-loop */ { do /* M-loop */ { rC00 = *pC0; rC10 = pC0[1]; rC20 = pC0[2]; rC30 = pC0[3]; rC40 = pC0[4]; rC50 = pC0[5]; rC01 = *pC1; rC11 = pC1[1]; rC21 = pC1[2]; rC31 = pC1[3]; rC41 = pC1[4]; rC51 = pC1[5]; rC02 = *pC2; rC12 = pC2[1]; rC22 = pC2[2]; rC32 = pC2[3]; rC42 = pC2[4]; rC52 = pC2[5];

Page 26: Fundamentals of High Performance Programming rwhaley@cs.utk.edu R. Clint Whaley, ATLAS group, Innovative Computing Lab, University

for (k=60; k; k--) /* easy loop to unroll */ { rB0 = *pB0; rB1 = pB0[60]; rB2 = pB0[120]; rA0 = *pA0; rA1 = pA0[60]; rA2 = pA0[120]; rA3 = pA0[180]; rA4 = pA0[240]; rA5 = pA0[300]; rC00 += rA0 * rB0; rC10 += rA1 * rB0; rC20 += rA2 * rB0; rC30 += rA3 * rB0; rC40 += rA4 * rB0; rC50 += rA5 * rB0; rC01 += rA0 * rB1; rC11 += rA1 * rB1; rC21 += rA2 * rB1; rC31 += rA3 * rB1; rC41 += rA4 * rB1; rC51 += rA5 * rB1; rC02 += rA0 * rB2; rC12 += rA1 * rB2; rC22 += rA2 * rB2; rC32 += rA3 * rB2; rC42 += rA4 * rB2; rC52 += rA5 * rB2; pA0 += incAk; pB0 += incBk; }

*pC0 = rC00; pC0[1] = rC10; pC0[2] = rC20; pC0[3] = rC30; pC0[4] = rC40; pC0[5] = rC50; *pC1 = rC01; pC1[1] = rC11; pC1[2] = rC21; pC1[3] = rC31; pC1[4] = rC41; pC1[5] = rC51; *pC2 = rC02; pC2[1] = rC12; pC2[2] = rC22; pC2[3] = rC32; pC2[4] = rC42; pC2[5] = rC52; pC0 += incCm; pC1 += incCm; pC2 += incCm; pA0 += incAm; pB0 += incBm; } while(pA0 != stM); pC0 += incCn; pC1 += incCn; pC2 += incCn; pA0 += incAn; pB0 += incBn; } while(pB0 != stN);}

Page 27: Fundamentals of High Performance Programming rwhaley@cs.utk.edu R. Clint Whaley, ATLAS group, Innovative Computing Lab, University

do /* N-loop */ { do /* M-loop */ { rC00 = *pC0; rC10 = pC0[1];/* * Start pipeline */ rA0 = *pA0; rB0 = *pB0; rA1 = pA0[40]; m0 = rA0 * rB0; m1 = rA1 * rB0; rA0 = pA0[1]; rB0 = pB0[1]; rA1 = pA0[41]; m2 = rA0 * rB0; m3 = rA1 * rB0; rA0 = pA0[2]; rB0 = pB0[2]; rA1 = pA0[42]; m4 = rA0 * rB0;/* * Completely unrolled K-loop */ rC00 += m0; m0 = rA1 * rB0; rA0 = pA0[3]; rB0 = pB0[3]; rA1 = pA0[43]; rC10 += m1; m1 = rA0 * rB0; rC00 += m2; m2 = rA1 * rB0;

rA0 = pA0[4]; rB0 = pB0[4]; rA1 = pA0[44]; rC10 += m3; m3 = rA0 * rB0; rC00 += m4; m4 = rA1 * rB0; rA0 = pA0[5]; rB0 = pB0[5]; rA1 = pA0[45]; rC10 += m0; m0 = rA0 * rB0; rC00 += m1; m1 = rA1 * rB0; rA0 = pA0[6]; rB0 = pB0[6]; rA1 = pA0[46]; rC10 += m2; m2 = rA0 * rB0; rC00 += m3; m3 = rA1 * rB0; rA0 = pA0[7]; rB0 = pB0[7]; rA1 = pA0[47]; rC10 += m4; m4 = rA0 * rB0; rC00 += m0; m0 = rA1 * rB0; rA0 = pA0[8]; rB0 = pB0[8]; rA1 = pA0[48]; rC10 += m1; m1 = rA0 * rB0; rC00 += m2; m2 = rA1 * rB0; rA0 = pA0[9]; rB0 = pB0[9]; rA1 = pA0[49]; rC10 += m3; m3 = rA0 * rB0; rC00 += m4; m4 = rA1 * rB0; rA0 = pA0[10]; rB0 = pB0[10]; rA1 = pA0[50]; rC10 += m0; m0 = rA0 * rB0; rC00 += m1; m1 = rA1 * rB0; rA0 = pA0[11]; rB0 = pB0[11]; rA1 = pA0[51]; rC10 += m2; m2 = rA0 * rB0; rC00 += m3; m3 = rA1 * rB0;

Page 28: Fundamentals of High Performance Programming rwhaley@cs.utk.edu R. Clint Whaley, ATLAS group, Innovative Computing Lab, University

rA0 = pA0[12]; rB0 = pB0[12]; rA1 = pA0[52]; rC10 += m4; m4 = rA0 * rB0; rC00 += m0; m0 = rA1 * rB0; rA0 = pA0[13]; rB0 = pB0[13]; rA1 = pA0[53]; rC10 += m1; m1 = rA0 * rB0; rC00 += m2; m2 = rA1 * rB0; rA0 = pA0[14]; rB0 = pB0[14]; rA1 = pA0[54]; rC10 += m3; m3 = rA0 * rB0; rC00 += m4; m4 = rA1 * rB0; rA0 = pA0[15]; rB0 = pB0[15]; rA1 = pA0[55]; rC10 += m0; m0 = rA0 * rB0; rC00 += m1; m1 = rA1 * rB0; rA0 = pA0[16]; rB0 = pB0[16]; rA1 = pA0[56]; rC10 += m2; m2 = rA0 * rB0; rC00 += m3; m3 = rA1 * rB0; rA0 = pA0[17]; rB0 = pB0[17]; rA1 = pA0[57]; rC10 += m4; m4 = rA0 * rB0; rC00 += m0; m0 = rA1 * rB0; rA0 = pA0[18]; rB0 = pB0[18]; rA1 = pA0[58]; rC10 += m1; m1 = rA0 * rB0; rC00 += m2; m2 = rA1 * rB0; rA0 = pA0[19]; rB0 = pB0[19]; rA1 = pA0[59]; rC10 += m3; m3 = rA0 * rB0; rC00 += m4; m4 = rA1 * rB0;

rA0 = pA0[20]; rB0 = pB0[20]; rA1 = pA0[60]; rC10 += m0; m0 = rA0 * rB0; rC00 += m1; m1 = rA1 * rB0; rA0 = pA0[21]; rB0 = pB0[21]; rA1 = pA0[61]; rC10 += m2; m2 = rA0 * rB0; rC00 += m3; m3 = rA1 * rB0; rA0 = pA0[22]; rB0 = pB0[22]; rA1 = pA0[62]; rC10 += m4; m4 = rA0 * rB0; rC00 += m0; m0 = rA1 * rB0; rA0 = pA0[23]; rB0 = pB0[23]; rA1 = pA0[63]; rC10 += m1; m1 = rA0 * rB0; rC00 += m2; m2 = rA1 * rB0; rA0 = pA0[24]; rB0 = pB0[24]; rA1 = pA0[64]; rC10 += m3; m3 = rA0 * rB0; rC00 += m4; m4 = rA1 * rB0; rA0 = pA0[25]; rB0 = pB0[25]; rA1 = pA0[65]; rC10 += m0; m0 = rA0 * rB0; rC00 += m1; m1 = rA1 * rB0; rA0 = pA0[26]; rB0 = pB0[26]; rA1 = pA0[66]; rC10 += m2; m2 = rA0 * rB0; rC00 += m3; m3 = rA1 * rB0; rA0 = pA0[27]; rB0 = pB0[27]; rA1 = pA0[67]; rC10 += m4; m4 = rA0 * rB0; rC00 += m0; m0 = rA1 * rB0; rA0 = pA0[28]; rB0 = pB0[28]; rA1 = pA0[68];

Page 29: Fundamentals of High Performance Programming rwhaley@cs.utk.edu R. Clint Whaley, ATLAS group, Innovative Computing Lab, University

rC10 += m1; m1 = rA0 * rB0; rC00 += m2; m2 = rA1 * rB0; rA0 = pA0[29]; rB0 = pB0[29]; rA1 = pA0[69]; rC10 += m3; m3 = rA0 * rB0; rC00 += m4; m4 = rA1 * rB0; rA0 = pA0[30]; rB0 = pB0[30]; rA1 = pA0[70]; rC10 += m0; m0 = rA0 * rB0; rC00 += m1; m1 = rA1 * rB0; rA0 = pA0[31]; rB0 = pB0[31]; rA1 = pA0[71]; rC10 += m2; m2 = rA0 * rB0; rC00 += m3; m3 = rA1 * rB0; rA0 = pA0[32]; rB0 = pB0[32]; rA1 = pA0[72]; rC10 += m4; m4 = rA0 * rB0; rC00 += m0; m0 = rA1 * rB0; rA0 = pA0[33]; rB0 = pB0[33]; rA1 = pA0[73]; rC10 += m1; m1 = rA0 * rB0; rC00 += m2; m2 = rA1 * rB0; rA0 = pA0[34]; rB0 = pB0[34]; rA1 = pA0[74]; rC10 += m3; m3 = rA0 * rB0; rC00 += m4; m4 = rA1 * rB0; rA0 = pA0[35]; rB0 = pB0[35]; rA1 = pA0[75]; rC10 += m0; m0 = rA0 * rB0; rC00 += m1; m1 = rA1 * rB0; rA0 = pA0[36]; rB0 = pB0[36]; rA1 = pA0[76];

rC10 += m2; m2 = rA0 * rB0; rC00 += m3; m3 = rA1 * rB0; rA0 = pA0[37]; rB0 = pB0[37]; rA1 = pA0[77]; rC10 += m4; m4 = rA0 * rB0; rC00 += m0; m0 = rA1 * rB0; rA0 = pA0[38]; rB0 = pB0[38]; rA1 = pA0[78]; rC10 += m1; m1 = rA0 * rB0; rC00 += m2; m2 = rA1 * rB0; rA0 = pA0[39]; rB0 = pB0[39]; rA1 = pA0[79]; rC10 += m3; m3 = rA0 * rB0;/* * Drain pipe on last iteration of K-loop */ rC00 += m4; m4 = rA1 * rB0; rC10 += m0; rC00 += m1; rC10 += m2;rC00 += m3;rC10 += m4; pA0 += incAk; pB0 += incBk; *pC0 = rC00; pC0[1] = rC10; pC0 += incCm; pA0 += incAm; pB0 += incBm; } while(pA0 != stM); pC0 += incCn; pA0 += incAn; pB0 += incBn; } while(pB0 != stN);

Page 30: Fundamentals of High Performance Programming rwhaley@cs.utk.edu R. Clint Whaley, ATLAS group, Innovative Computing Lab, University