spiral: an empirical search system for program generation and optimization
DESCRIPTION
Spiral: an empirical search system for program generation and optimization. David Padua Department of Computer Science University of Illinois at Urbana-Champaign. Program optimization today. The optimization phase of a compiler applies a series of transformations to achieve its objectives. - PowerPoint PPT PresentationTRANSCRIPT
Spiral: an empirical search system for
program generation and optimization
David PaduaDepartment of Computer
ScienceUniversity of Illinois at Urbana-
Champaign
2
Program optimization today
• The optimization phase of a compiler applies a series of transformations to achieve its objectives.
• The compiler uses the outcome of program analysis to determine which transformations are correctness-preserving.
• Compiler transformation and analysis techniques are reasonably well-understood.
• Since many of the compiler optimization problems have “exponential complexity”, heuristics are needed to drive the application of transformations.
3
Optimization drivers
• Developing driving heuristics is laborious.
• One reason for this is the lack of methodologies and tools to build optimization drivers.
• As a result, although there is much in common among compilers, their optimization phases are usually re-implemented from scratch.
4
Optimization drivers (Cont.)
• A consequence: Machines and languages not widely popular usually lack good compilers. (some popular systems too)– DSP, network processor, and embedded system
programming is often done in assembly language.
– Evaluation of new architectural features requiring compiler involvement is not always meaningful.
– Languages such as APL, MATLAB, LISP, … suffer from chronic low performance.
– New languages difficult to introduce (although compilers are only a part of the problem).
5
A methodology based on the notion of search space
• Program transformations often have several possible target versions.– Loop unrolling: How many times– Loop tiling: size of the tile.– Loop interchanging: order of loop headers– Register allocation: which registers are stored
in memory to give room for new values.
• The process of optimization can be seen as a search in the space of possible program versions.
6
Empirical searchIterative compilation
• Perhaps the simplest application of the search space model is empirical search where several versions are generated and executed on the target machine. The fastest version is selected.
T. Kisuki, P.M.W. Knijnenburg, M.F.P. O'Boyle, and H.A.G. Wijshoff . Iterative compilation in program optimization. In Proc. CPC2000, pages 35-44, 2000
7
Empirical search and traditional compilers
• Searching is not a new approach and compilers have applied it in the past, but using architectural prediction models instead of actual runs:– KAP searched for best loop header
order– SGI’s MIPS-pro and IBM PowerPC
compilers select the best degree of unrolling.
8
Limitations of empirical search
• Empirical search is conceptually simple and portable.
• However, – the search space tends to be too large specially
when several transformations are combined.– It is not clear how to apply this method when
program behavior is a function of the input data set.
• Need heuristics/search strategies.• Availability of performance “formulas” could
help evaluate transformations across input data sets and facilitate search.
9
Compilers and Library Generators
Source Program
Internal representation
Algorithm
Program Transformation
Program Generation
10
Empirical search in program/library
generators
• Examples:– FFTW [M. Frigo, S. Johnson]– Spiral (FFT/signal processing) [J. Moura (CMU),
M. Veloso (CMU), J. Johnson (Drexel), …]– ATLAS (linear algebra)(R. Whaley, A. Petitet, J.
Dongarra)– PHiPAC[J. Demmel et al]
11
12
SPIRAL
• The approach:– Mathematical formulation of signal
processing algorithms– Automatically generate algorithm versions– A generalization of the well-known FFTW– Use compiler technique to translate
formulas into implementations– Adapt to the target platform by searching
for the optimal version
13
14
Fast DSP Algorithms As Matrix Factorizations
• Computing y = F4 x is carried out as:
t1 = A4 x ( permutation )
t2 = A3 t1 ( two F2’s )
t3 = A2 t2 ( diagonal scaling )
y = A1 t3 ( two F2’s )• The cost is reduced because A1, A2, A3
and A4 are structured sparse matrices.
15
Tensor Product Formulation of Cooley-
TuckeyTheorem
Example
rsrsr
rsssrrs LFITIFF )()(
is a diagonal matrixis a stride permutation
rssT
rsrL
1000
0010
0100
0001
1100
1100
0011
0011
1000
0100
0010
0001
1010
0101
1010
0101
)()( 4222
44224 LFITIFF
16
Formulas for Matrix Factorizations
4222
42224 )LF(I)TI(FF
rsrsr
rsssrrs )LF(I)TI(FF
R1
1
ki
nnnn
k
1i
nnnnnnnn )L(I)T)(IIF(IF ii
ii
ii
iiiii
where n = n1…nk, ni- = n1…ni-1, ni+= ni+1…nk
R2
17
Factorization Trees
F2
F2 F2
F8 : R1
F4 : R1F2
F2 F2
F8 : R1
F4 : R1
F2 F2 F2
F8 : R2
Different computation orderDifferent data access
patternDifferent performance
18
Walsh-Hadamard Transform
19
Optimal Factorization Trees
• Depend on the platform• Difficult to deduct• Can be found by empirical search
– The search space is very large– Different search algorithms
• Random, DP, GA, hill-climbing, exhaustive
20
21
22
Size of Search Space
N # of formulas N # of formulas
21 1 29 20793
22 1 210 103049
23 3 211 518859
24 11 212 2646723
25 45 213 13649969
26 197 214 71039373
27 903 215 372693519
28 4279 216 1968801519
23
24
25
More Search Choices
• Programming:– Loop unrolling– Memory allocation– In-lining
• Platform choices:– Compiler optimization options
26
The SPIRAL System
Formula Generator
SPL Compiler
Performance Evaluation
Search Engine
DSP Transform
Target machine DSP Library
SPL Program
C/FORTRAN Programs
27
Spiral
• Spiral does the factorization at installation time and generates one library routine for each size.
• FFTW only generates codelets (input size 64) and at run time performs the factorization.
28
A Simple SPL Program
Definition DirectiveFormula Comment
; This is a simple SPL program(define A (matrix(1 2)(2 1)))(define B (diagonal(3 3))#subname simple(tensor (I 2)(compose A B));; This is an invisible comment
29
Templates
(template (F n)[ n >= 1 ] ( do i=0,n-1 y(i)=0 do j=0,n-1 y(i)=y(i)+W(n,i*j)*x(j) end end ))
Pattern
I-code
Condition
30
SPL Compiler
Parsing
Intermediate Code Generation
Intermediate Code Restructuring
Target Code Generation
Abstract Syntax Tree
I-Code
I-Code
FORTRAN, C
Template Table
SPL Formula Template Definition
OptimizationI-Code
31
Intermediate Code Restructuring
• Loop unrolling– Degree of unrolling can be controlled globally
or case by case
• Scalar function evaluation– Replace scalar functions with constant value
or array access
• Type conversion– Type of input data: real or complex– Type of arithmetic: real or complex– Same SPL formula, different C/Fortran
programs
32
33
Optimizations
SPL Compiler
C/Fortran Compiler
Formula Generator* High-level scheduling* Loop transformation
* High-level optimizations- Constant folding- Copy propagation- CSE- Dead code elimination
* Low-level optimizations- Instruction scheduling- Register allocation
34
Basic Optimizations (FFT, N=25, SPARC, f77 –fast –O5)
35
Basic Optimizations(FFT, N=25, MIPS, f77 –O3)
36
Basic Optimizations(FFT, N=25, PII, g77 –O6 –malign-double)
37
Performance Evaluation
• Evaluation the performance of the code generated by the SPL compiler
• Platforms: SPARC, MIPS, PII• Search strategy: dynamic
programming
38
Pseudo MFlops
• Estimation of the # of FP operations:– FFT (radix-2): 5nlog2n – 10 + 16
s)( timeExecution
algorithm in the operations FP of #MFlops Pseudo
39
FFT Performance (N=21 to 26)
SPARC MIPS
PII
40
FFT Performance (N=27 to 220)
SPARC MIPS
PII
41
Important Questions
• What lessons can be learned from this work?
• Can this approach be used in other domains ?
42