autotuning (2.5/2): tce & empirical compilers · platform: sun ultra iii 16 double regs 667...

50
Autotuning (2.5/2): TCE & Empirical compilers Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.19] Tuesday, March 11, 2008 1

Upload: others

Post on 16-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

Autotuning (2.5/2):TCE & Empirical compilers

Prof. Richard Vuduc

Georgia Institute of Technology

CSE/CS 8803 PNA: Parallel Numerical Algorithms

[L.19] Tuesday, March 11, 2008

1

Page 2: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

Today’s sources

CS 267 at UCB (Demmel & Yelick)

Papers from various autotuning projects

PHiPAC, ATLAS, FFTW, SPIRAL, TCE

See: Proc. IEEE 2005 special issue on Program Generation, Optimization, and Platform Adaptation

Me (for once!)

2

Page 3: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

Review:Autotuners

3

Page 4: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

Performance-engineering challenges

Mini-Kernel

Belady /BRILA

Scalarized /Compiler

Outer Control Structure

Iterative Recursive

Inner Control Structure

Statement Recursive

Micro-Kernel

None /Compiler

Coloring /BRILA

Iterative

ATLAS CGw/SATLAS Unleashed

4

Page 5: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

Source: J. Johnson (2007), CScADS autotuning workshop

pseudoMflop/s

Motivation for performance tuning

5

Page 6: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

Context for autotuning

Problem: HPC needs detailed low-level machine knowledge

Autotuning methodology

Identify and generate a space of implementations

Search (modeling, experiments) to choose the best one

Early idea seedlings

Polyalgorithms

Profile and feedback-directed compilation

Domain- and architecture-specific code generators

6

Page 7: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

m0

n0

k0 = 1 Mflop/s

Example: What a search space looks like

Source: PHiPAC Project at UC Berkeley (1997)

Platform: Sun Ultra IIi

16 double regs

667 Mflop/s peak

Unrolled, pipelined inner-kernel

Sun cc v5.0 compiler

7

Page 8: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

Cooley-Tukey FFT algorithm: Encoding in FFTW’s codelet generator

N2-point DFT

N1-point DFT Twiddle

y[k] ! DFTN (x, k) "N!1!

j=0

x[j] · !!kjN x, y # CN

y[k1 + k2 · N1] !N2!1!

n2=0

"#N1!1!

n1

x[n1 · N2 + n2] · !!k1n1N1

$·!!k1n2

N

%·!!k2n2

N2

(Functionalpseudo-code)

let dftgen(N,x) ! fun k " . . . # DFTN (x, k)let cooley tukey(N1, N2, x) !

let x ! fun n2, n1 " x(n2 + n1 · N2) inlet G1 ! fun n2 " dftgen(N1, x(n2, )) inlet W ! fun k1, n2 " G1(n2, k1) · !!k1n2

N inlet G2 ! fun k1 " dftgen(N2,W(k1, ))in

fun k " G2(k mod N1, k div N1)

8

Page 9: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

9

Page 10: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

10

Page 11: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

Tensor Contraction Engine (TCE) for quantum chemistry

11

Page 12: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

Tensor Contraction Engine (TCE)

Application domain: Quantum chemistry

Electronic structure calculations

Dominant computation expressible as a “tensor contraction”

TCE generates a complete parallel program from a high-level spec

Automates time-space trade-offs

Output

S. Hirata (2002), and many others

Following presentation taken from Proc. IEEE 2005 special issue

12

Page 13: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

Source: Baumgartner, et al. (2005)

Motivation: Simplify program development

13

Page 14: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

Sabij =!

c,d,e,f,k,l

Aacik !Bbefl ! Cdfjk !Dcdel

"

Sabij =!

c,k

"

#!

d,f

"

#!

e,l

Bbefl !Dcdel

$

%! Cdfjk

$

%!Aacik

Naïvely, ≈ 4 × N10 flops

Assuming associativity and distributivity, ≈ 6 × N6 flops,but also requires temporary storage.

Source: Baumgartner, et al. (2005)

Rewriting to reduce operation counts

14

Page 15: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

T (1)bcdf =

!

e,l

Bbefl !Dcdel

T (2)bcjk =

!

d,f

T (1)bcdf ! Cdfjk

Sabij =!

c,k

T (2)bcjk !Aacik

T1 = T2 = S = 0for b, c, d, e, f, l do

T1[b, c, d, f ] += B[b, e, f, l] · D[c, d, e, l]for b, c, d, f, j, k do

T2[b, c, j, k] += T1[b, c, d, f ] · C[d, f, j, k]for a, b, c, i, j, k do

S[a, b, i, j] += T2[b, c, j, k] · A[a, c, i, k]

Operation and storage minimization via loop fusion

15

Page 16: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

T (1)bcdf =

!

e,l

Bbefl !Dcdel

T (2)bcjk =

!

d,f

T (1)bcdf ! Cdfjk

Sabij =!

c,k

T (2)bcjk !Aacik

T1 = T2 = S = 0for b, c, d, e, f, l do

T1[b, c, d, f ] += B[b, e, f, l] · D[c, d, e, l]for b, c, d, f, j, k do

T2[b, c, j, k] += T1[b, c, d, f ] · C[d, f, j, k]for a, b, c, i, j, k do

S[a, b, i, j] += T2[b, c, j, k] · A[a, c, i, k]

Operation and storage minimization via loop fusion

S = 0for b, c do

T1f ! 0, T2f ! 0for d, f do

for e, l doT1f += B[b, e, f, l] · D[c, d, e, l]

for j, k doT2f [j, k] += T1f · C[d, f, j, k]

for a, i, j, k doS[a, b, i, j] += T2f [j, k] · A[a, c, i, k]

16

Page 17: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

Time-space trade-offs

for a, e, c, f dofor i, j do

Xaecf += Tijae · Tijcf

for c, e, b, k do

T (1)cebk ! f1(c, e, b, k)

for a, f, b, k do

T (2)afbk ! f2(a, f, b, k)

for c, e, a, f dofor b, k do

Yceaf += T (1)cebk · T (2)

afbk

for c, e, a, f doE += Xaecf · Yceaf

Integrals, O(1000) flops

“Contraction” of T over i, j

“Contraction” over T(1) and T(2)

Max index of a—f: O(1000) i—k: O(100)

17

Page 18: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

Time-space trade-offs

for a, e, c, f dofor i, j do

Xaecf += Tijae · Tijcf

for c, e, b, k do

T (1)cebk ! f1(c, e, b, k)

for a, f, b, k do

T (2)afbk ! f2(a, f, b, k)

for c, e, a, f dofor b, k do

Yceaf += T (1)cebk · T (2)

afbk

for c, e, a, f doE += Xaecf · Yceaf

Same indices⇒ Loop fusion candidates

Max index of a—f: O(1000) i—k: O(100)

18

Page 19: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

Time-space trade-offs

for a, e, c, f dofor i, j do

Xaecf += Tijae · Tijcf

for c, e, b, k do

T (1)cebk ! f1(c, e, b, k)

for a, f, b, k do

T (2)afbk ! f2(a, f, b, k)

for c, e, a, f dofor b, k do

Yceaf += T (1)cebk · T (2)

afbk

for c, e, a, f doE += Xaecf · Yceaf

for a, e, c, f dofor i, j do

Xaecf += Tijae · Tijcf

for a, c, e, f , b, k do

T (1)cebk ! f1(c, e, b, k)

for a, e, c, f, b, k do

T (2)afbk ! f2(a, f, b, k)

for c, e, a, f dofor b, k do

Yceaf += T (1)cebk · T (2)

afbk

for c, e, a, f doE += Xaecf · Yceaf

Addextraflops

19

Page 20: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

Time-space trade-offs

for a, e, c, f dofor i, j do

Xaecf += Tijae · Tijcf

for c, e, b, k do

T (1)cebk ! f1(c, e, b, k)

for a, f, b, k do

T (2)afbk ! f2(a, f, b, k)

for c, e, a, f dofor b, k do

Yceaf += T (1)cebk · T (2)

afbk

for c, e, a, f doE += Xaecf · Yceaf

⇐ Fusedfor a, e, c, f dofor i, j do

x += Tijae · Tijcf

for b, k do

T (1)cebk ! f1(c, e, b, k)

T (2)afbk ! f2(a, f, b, k)

y += T (1)cebk · T (2)

afbk

E += x · y

20

Page 21: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

for a, e, c, f dofor i, j do

Xaecf += Tijae · Tijcf

for c, e, b, k do

T (1)cebk ! f1(c, e, b, k)

for a, f, b, k do

T (2)afbk ! f2(a, f, b, k)

for c, e, a, f dofor b, k do

Yceaf += T (1)cebk · T (2)

afbk

for c, e, a, f doE += Xaecf · Yceaf

Tiled & partially fused for aB , eB , cB , fB dofor a, e, c, f do

for i, j doXaecf += Tijae · Tijcf

for b, k dofor c, e do

T (1)ce ! f1(c, e, b, k)

for a, f do

T (2)af ! f2(a, f, b, k)

for c, e, a, f do

Yceaf += T (1)ce · T (2)

af

for c, e, a, f doE += Xaecf · Yceaf

21

Page 22: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

Transform algebraically, to minimize flops

Minimize temporary storage

Distribute and partition data for a parallel system

Search wrt space-time trade-off (feedback)

For out-of-core problems, apply optimize data locality

Generate final program (C/Fortran + MPI/Global-arrays)

22

Page 23: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

for a, e, c, f dofor i, j do

Xaecf += Tijae · Tijcf

for c, e, b, k do

T (1)cebk ! f1(c, e, b, k)

for a, f, b, k do

T (2)afbk ! f2(a, f, b, k)

for c, e, a, f dofor b, k do

Yceaf += T (1)cebk · T (2)

afbk

for c, e, a, f doE += Xaecf · Yceaf

Tensor loop nest ⇒ Expression tree

E, +ceaf

X, +ij

T T

f1 f2

T (1) T (2)

Y, +bk

23

Page 24: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

Expression tree ⇒ fusion graph

E, +ceaf

X, +ij

T T

f1 f2

T (1) T (2)

Y, +bk

c e a f

a e i j i jc fT T

XY

T (1)

ec a fb k b kf1

T (2)

f2

E

24

Page 25: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

Fusion graph

c e a f

a e i j i jc fT T

XY

T (1)

ec a fb k b kf1

T (2)

f2

E

25

Page 26: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

Fusion graph

c e a f

a e i j i jc fT T

XY

T (1)

ec a fb k b kf1

T (2)

f2

EFuse⇒ X scalar

26

Page 27: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

Fusion graph

c e a f

a e i j i jc fT T

XY

T (1)

ec a fb k b kf1

T (2)

f2

EFuse⇒ Y scalar

27

Page 28: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

Fusion graph

c e a f

a e i j i jc fT T

XY

T (1)

ec a fb k b kf1

T (2)

f2

E

28

Page 29: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

Fusion graph

c e a f

a e i j i jc fT T

XY

T (1)

ec a fb k b kf1

T (2)

f2

E

29

Page 30: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

Fusion graph

c e a f

a e i j i jc fT T

XY

T (1)

ec a fb k b kf1

T (2)

f2

E

a f ec

30

Page 31: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

31

Page 32: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

Empirical compilers and tools

32

Page 33: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

Code generation tools for autotuning

Code generation tools

GNU Superoptimizer -- Exhaustive search over schedules of straight-line code

Denali -- Theorem proving based scheduling

iFKO (Whaley @ UTSA) -- Iterative floating-point kernel optimizer

POET (Yi @ UTSA) -- Parameterized Optimizations for Empirical Tuning

33

Page 34: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

Iterative/empirical compilers

Compile-time

“Iterative compilation” -- Kisuki, Knijnenberg, O’Boyle, et al.

Hybrid model/search-based compiler -- Hall, et al. (USC)

Eigenmann @ Purdue (Polaris)

Quinlan, et al. (LLNL / PERI)

Qasem (TSU), Kennedy, Mellor-Crummey (Rice) -- Whole program tuning

Compilers that learn -- Cavazos (UDel); Stephenson/Amarsinghe (MIT)

Run-time: Voss, et al.: ADAPT

34

Page 35: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

Administrivia

35

Page 36: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

Upcoming schedule changes

Some adjustment of topics (TBD)

Today — Project proposals due

Th 3/13 — SIAM Parallel Processing (attendance encouraged)

Tu 4/1 — No class

Th 4/3 — Attend talk by Doug Post from DoD HPC Modernization Program

36

Page 37: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

Homework 1:Parallel conjugate gradients

Put name on write-up!

Grading: 100 pts max

Correct implementation — 50 pts

Evaluation — 45 pts

Tested on two samples matrices — 5

Implemented and tested on stencil — 10

“Explained” performance (e.g., per proc, load balance, comp. vs. comm) — 15

Performance model — 15 pts

Write-up “quality” — 5 pts

37

Page 38: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

Spy plot of matrix ‘msdoor--UF1644’

38

Page 39: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

Non-zeros per row ‘msdoor--UF1644’

nnz

row

39

Page 40: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

Non-zeros per row ‘msdoor--UF1644’

“active”elements

row

40

Page 41: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

Spy plot of matrix ‘audikw_1--UF1252’

41

Page 42: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

Non-zeros per row ‘audikw_1--UF1252’

nnz

row

42

Page 43: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

Active elements for ‘audikw_1--UF1252’

“active”elements

row

43

Page 44: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

Homework 2: Parallel n-body using the particle-mesh method

Acceleration of particle i, due to forces from all other particles:

Not yet decided what exactly I will ask to do (implementation? pencil-and-paper? thoughts?)

ri =!

j !=i

Fij = !!

j !=i

Gmj(ri ! rj)||ri ! rj ||3

44

Page 45: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

Projects

Your goal should be to do something useful, interesting, and/or publishable!

Something you’re already working on, suitably adapted for this course

Faculty-sponsored/mentored

Collaborations encouraged

45

Page 46: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

“In conclusion…”

46

Page 47: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

Backup slides

47

Page 48: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

My criteria for “approving” your project

“Relevant to this course:” Many themes, so think (and “do”) broadly

Parallelism and architectures

Numerical algorithms

Programming models

Performance modeling/analysis

48

Page 49: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

General styles of projects

Theoretical: Prove something hard (high risk)

Experimental:

Parallelize something

Take existing parallel program, and improve it using models & experiments

Evaluate algorithm, architecture, or programming model

49

Page 50: Autotuning (2.5/2): TCE & Empirical compilers · Platform: Sun Ultra IIi 16 double regs 667 Mflop/s peak Unrolled, pipelined inner-kernel Sun cc v5.0 compiler 7. Cooley-Tukey FFT

Anything of interest to a faculty member/project outside CoC

Parallel sparse triple product (R*A*RT, used in multigrid)

Future FFT

Out-of-core or I/O-intensive data analysis and algorithms

Block iterative solvers (convergence & performance trade-offs)

Sparse LU

Data structures and algorithms (trees, graphs)

Look at mixed-precision

Discrete-event approaches to continuous systems simulation

Automated performance analysis and modeling, tuning

“Unconventional,” but related

Distributed deadlock detection for MPI

UPC language extensions (dynamic block sizes)

Exact linear algebra

Examples

50