an experimental comparison of empirical and model-based optimization
Post on 25-Jan-2016
34 Views
Preview:
DESCRIPTION
TRANSCRIPT
An Experimental Comparison of Empirical and Model-based Optimization
Keshav PingaliCornell University
Joint work with:Kamen Yotov2 ,Xiaoming Li1, Gang Ren1, Michael Cibulskis1,Gerald DeJong1, Maria Garzaran1, David Padua1,Paul Stodghill2, Peng Wu3
1UIUC, 2Cornell University, 3IBM T.J.Watson
Context: High-performance libraries Traditional approach
Hand-optimized code: (e.g.) BLAS Problem: tedious to develop
Alternatives: Restructuring compilers
General purpose: generate code from high-level specifications Use architectural models to determine optimization parameters Performance of optimized code is not satisfactory
Library generators Problem-specific: (e.g.) ATLAS for BLAS, FFTW for FFT Use empirical optimization to determine optimization parameters Believed to produce optimal code
Why are library generators beating compilers?
How important is empirical search? Model-based optimization and empirical
optimization are not in conflict Use models to prune search space
Search intelligent search Use empirical observations to refine models
Understand what essential aspects of reality are missing from model and refine model appropriately
Multiple models are fine Learn from experience
Previous work
Compared performance of code generated by a sophisticated compiler like SGI
MIPSpro code generated by ATLAS found ATLAS code is better
Hard to answer why Perhaps ATLAS is effectively doing transformations that
compilers do not know about Phase-ordering problem: perhaps compilers are doing
transformations in wrong order Perhaps parameters to transformations are chosen sub-
optimally by compiler using models
Our Approach
DetectHardware
Parameters
ATLAS SearchEngine
(MMSearch)NR
MulAddLatency
L1SizeATLAS MM
Code Generator(MMCase)
xFetchMulAddLatency
NBMU,NU,KU MiniMMM
Source
Compile,Execute,Measure
MFLOPS
DetectHardware
ParametersModelNR
MulAddLatency
L1I$Size ATLAS MMCode Generator
(MMCase)xFetchMulAddLatency
NBMU,NU,KU MiniMMM
Source
L1Size
Original ATLAS Infrastructure
Model-Based ATLAS Infrastructure
DetectHardware
Parameters
DetectHardware
Parameters
ATLAS MMCode Generator
(MMCase)
ATLAS MMCode Generator
(MMCase)
ATLAS SearchEngine
(MMSearch)
Model
Detecting Machine Parameters Micro-benchmarks
L1Size: L1 Data Cache size Similar to Hennessy-Patterson book
NR: Number of registers MulAdd: Fused Multiply Add (FMA)
“c+=a*b” as opposed to “c+=t; t=a*b” Latency: Latency of FP Multiplication
Code Generation: Compiler View
DetectHardware
Parameters
ATLAS SearchEngine
(MMSearch)NR
MulAddLatency
L1SizeATLAS MM
Code Generator(MMCase)
xFetchMulAddLatency
NBMU,NU,KU MiniMMM
Source
Compile,Execute,Measure
MFLOPS
DetectHardware
ParametersModelNR
MulAddLatency
L1I$Size ATLAS MMCode Generator
(MMCase)xFetchMulAddLatency
NBMU,NU,KU MiniMMM
Source
L1Size
Original ATLAS Infrastructure
Model-Based ATLAS Infrastructure
BLAS
Let us focus on BLAS-3 Code for MMM: for (int i = 0; i < M; i++) for (int j = 0; j < N; j++) for (int k = 0; k < K; k++) C[i][j] += A[i][k]*B[k][j]
Properties Very good reuse: O(N2) data, O(N3) computation Many optimization opportunities
Few “real” dependencies Will run poorly on modern machines
Poor use of cache and registers Poor use of processor pipelines
CBAC
Optimizations
Cache-level blocking (tiling) Atlas blocks only for L1 cache
Register-level blocking Highest level of memory hierarchy Important to hold array values in registers
Software pipelining Unroll and schedule operations
Versioning Dynamically decide which way to compute
Back-end compiler optimizations Scalar Optimizations Instruction Scheduling
Cache-level blocking (tiling) Tiling in ATLAS
Only square tiles (NBxNBxNB)
Working set of tile fits in L1 Tiles are usually copied to
continuous storage Special “clean-up” code
generated for boundaries Mini-MMM
for (int j = 0; j < NB; j++) for (int i = 0; i < NB; i++) for (int k = 0; k < NB; k++) C[i][j] += A[i][k] * B[k][j]
NB: Optimization parameter
B
N
M
A C
NB
NB
K
K
Short excursion into tiling
MMM miss ratio L1 Cache Miss Ratio for Intel Pentium III
– MMM with N = 1…1300– 16KB 32B/Block 4-way 8-byte elements
IJK version (large cache)
DO I = 1, N//row-major storage DO J = 1, N DO K = 1, N C(I,J) = C(I,J) + A(I,K)*B(K,J)
Large cache scenario: Matrices are small enough to fit into cache Only cold misses, no capacity misses Miss ratio:
Data size = 3 N2
Each miss brings in b floating-point numbers Miss ratio = 3 N2 /b*4N3 = 0.75/bN = 0.019 (b = 4,N=10)
C
B
A
K
K
IJK version (small cache)
DO I = 1, N DO J = 1, N DO K = 1, N C(I,J) = C(I,J) + A(I,K)*B(K,J)
Small cache scenario: Matrices are large compared to cache
reuse distance is not O(1) => miss Cold and capacity misses Miss ratio:
C: N2/b misses (good temporal locality) A: N3 /b misses (good spatial locality) B: N3 misses (poor temporal and spatial locality) Miss ratio 0.25 (b+1)/b = 0.3125 (for b = 4)
C
B
A
K
K
MMM experiments L1 Cache Miss Ratio for Intel Pentium III
– MMM with N = 1…1300– 16KB 32B/Block 4-way 8-byte elements
Tile size
Register-level blocking
Micro-MMM A: MUx1 B: 1xNU C: MUxNU MUxNU+MU+NU registers
Unroll loops by MU, NU, and KU Mini-MMM with Micro-MMM inside
for (int j = 0; j < NB; j += NU) for (int i = 0; i < NB; i += MU) load C[i..i+MU-1, j..j+NU-1] into registers for (int k = 0; k < NB; k++) load A[i..i+MU-1,k] into registers load B[k,j..j+NU-1] into registers multiply A’s and B’s and add to C’s store C[i..i+MU-1, j..j+NU-1]
MU, NU, KU: optimization parameters
B
NB
NB
A C
K
MU
NU
K
KU times
Scheduling
FMA Present? Schedule Computation
Using Latency Schedule Memory Operations
Using IFetch, NFetch,FFetch
Latency, xFetch: optimization parameters
M1
M2
M3
M4
MMU*NU
…
A1
A2
A3
A4
AMU*NU
…
L1
L2
L3
LMU+NU
…
La
ten
cy=2
A1
A2
AMU*NU
…
Computation
MemoryOperationsComputation
MemoryOperations
Computation
MemoryOperations
Computation
MemoryOperations
Computation
MemoryOperations
IFetch Loads
NFetch Loads
NFetch Loads
NFetch Loads
…
Comments Optimization parameters
NB: constrained by size of L1 cache MU,NU: constrained by NR KU: constrained by size of I-Cache xFetch: constrained by #of OL MulAdd/Latency: related to hardware parameters
Similar parameters would be used by compilers
parameter
MFlopsMFlops
parameter
Sensitive parameter Insensitive parameter
ATLAS Search
DetectHardware
Parameters
ATLAS SearchEngine
(MMSearch)NR
MulAddLatency
L1SizeATLAS MM
Code Generator(MMCase)
xFetchMulAddLatency
NBMU,NU,KU MiniMMM
Source
Compile,Execute,Measure
MFLOPS
DetectHardware
ParametersModelNR
MulAddLatency
L1I$Size ATLAS MMCode Generator
(MMCase)xFetchMulAddLatency
NBMU,NU,KU MiniMMM
Source
L1Size
Original ATLAS Infrastructure
Model-Based ATLAS Infrastructure
High-level picture
Multi-dimensional optimization problem: Independent parameters: NB,MU,NU,KU,… Dependent variable: MFlops Function from parameters to variables is given implicitly; can be evaluated repeatedly
One optimization strategy: orthogonal range search Optimize along one dimension at a time, using reference values for parameters not
yet optimized Not guaranteed to find optimal point, but might come close
Specification of OR Search
Order in which dimensions are optimized Reference values for un-optimized
dimensions at any step Interval in which range search is done for
each dimension
Search strategy
1. Find Best NB
2. Find Best MU & NU
3. Find Best KU
4. Find Best xFetch
5. Find Best Latency (lat)
6. Find non-copy version tile size (NCNB)
Find Best NB
Search in following range 16 <= NB <= 80 NB2 <= L1Size
In this search, use simple estimates for other parameters (eg) KU: Test each candidate for
Full K unrolling (KU = NB) No K unrolling (KU = 1)
Find best MU, NU : try all MU & NU that satisfy
In this step, use best NB from previous step
Find best KU
Find best Latency [1…6] Find best xFetch IFetch: [2,MU+NU], Nfetch:[1,MU+NU-IFetch]
Finding other parameters
1 <= MU,NU <= NB
MU*NU + MU + NU <= NR
Our Models
DetectHardware
Parameters
ATLAS SearchEngine
(MMSearch)NR
MulAddLatency
L1SizeATLAS MM
Code Generator(MMCase)
xFetchMulAddLatency
NBMU,NU,KU MiniMMM
Source
Compile,Execute,Measure
MFLOPS
DetectHardware
ParametersModelNR
MulAddLatency
L1I$Size ATLAS MMCode Generator
(MMCase)xFetchMulAddLatency
NBMU,NU,KU MiniMMM
Source
L1Size
Original ATLAS Infrastructure
Model-Based ATLAS Infrastructure
Modeling for Optimization Parameters Optimization parameters
NB Hierarchy of Models (later)
MU, NU
KU maximize subject to L1 Instruction Cache
Latency, MulAdd hardware parameters
xFetch set to 2
NRLatencyNUMUNUMU *
Largest NB for no capacity/conflict misses
Tiles are copied into contiguous memory Condition for cold misses only:
3*NB2 <= L1Size
A
k
B
j
k
i
NB
NBNB
NB
Largest NB for no capacity misses
MMM: for (int j = 0; i < N; i++)
for (int i = 0; j < N; j++) for (int k = 0; k < N; k++) c[i][j] += a[i][k] * b[k][j]
Cache model: Fully associative Line size 1 Word Optimal Replacement
Bottom line:N2+N+1<C One full matrix One row / column One element
A
M (I)
K
C
B
N (J)
K
Extending the Model
Line Size > 1 Spatial locality Array layout in memory matters
Bottom line: depending on loop order either or
B
C
B
NB
B
NB
1
2
B
CNB
B
NB
1
2
Extending the Model (cont.) LRU (not optimal replacement) MMM sample: for (int j = 0; i < N; i++)
for (int i = 0; j < N; j++) for (int k = 0; k < N; k++) c[i][j] += a[i][k] * b[k][j]
Bottom line:
jijNBNBijijiCBABABA
,,,,22,,11,
jNBjNBNBNBjNBjNB
jNBNBNBNBNB
jNB
jNB
CBABABA
CAAA
CAAA
CAAA
,,,,22,,11,
,1,12,11,1
,2,22,21,2
,1,12,11,1
B
CNB
B
NB
13
2
B
C
B
NB
B
NB
13
2
B
C
B
NBNB
B
NB
12
2
B
CNB
B
NB
B
NB
12
2
Summary: Modeling for Tile Size (NB) Models of increasing complexity
3*NB2 ≤ C Whole work-set fits in L1
NB2 + NB + 1 ≤ C Fully Associative Optimal Replacement Line Size: 1 word
or
Line Size > 1 word
or
LRU Replacement
B
N
M
A C
NB
NB
K
KB
C
B
NB
B
NB
1
2
B
CNB
B
NB
1
2
B
C
B
NB
B
NB
B
NB
12
2
B
CNB
B
NB
13
2A
M(I)
K
C
B
N (J)
KB
A
M(I)
K
C
B
N (J)
KL
Comments
Lot of work in compiler literature on automatic tile size selection
Not much is known about how well these algorithms do in practice Few comparisons to BLAS
Not obvious how one generalizes our models to more complex codes Insight needed: how sensitive is performance to tile size?
Experiments
Architectures: SGI R12000, 270MHz Sun UltraSPARC III, 900MHz Intel Pentium III, 550MHz
Measure Mini-MMM performance Complete MMM performance Sensitivity of performance to parameter variations
Installation Time of ATLAS vs. Model
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
SGI Atlas SGI Model Sun Atlas Sun Model Intel Atlas Intel Model
tim
e (
s)
Detect Machine Parameters Optimize MMM Generate Final Code Build Library
MiniMMM Performance
SGI ATLAS: 457 MFLOPS Model: 453 MFLOPS Difference: 1%
Sun ATLAS: 1287 MFLOPS Model: 1052 MFLOPS Difference: 20%
Intel ATLAS: 394 MFLOPS Model: 384 MFLOPS Difference: 2%
MMM Performance
00.5
11.5
22.5
33.5
44.5
5
0 1000 2000 3000 4000 5000Matrix Size
TL
B M
isse
s (B
illio
ns)
0
100
200
300
400
500
600
0 1000 2000 3000 4000 5000Martix Size
MF
LO
PS
0200
400600800
100012001400
16001800
0 1000 2000 3000 4000 5000Martix Size
MF
LO
PS
0
100
200
300
400
500
600
0 1000 2000 3000 4000 5000Martix Size
MF
LO
PS
SGI Sun
Intel
BLAS
MODELF77ATLAS
Optimization Parameter Values
NB MU/NU/KU F/I/N-FetchLatency
ATLAS SGI: 64 4/4/64 0/5/1 3 Sun: 48 5/3/48 0/3/5 5 Intel: 40 2/1/40 0/3/1 4
Model SGI: 62 4/4/62 1/2/2 6 Sun: 88 4/4/78 1/2/2 4 Intel: 42 2/1/42 1/2/2 3
Sensitivity to NB and Latency: Sun
0
200
400
600
800
1000
1200
1400
1600
20 40 60 80 100 120 140Tile Size
MF
LO
PS
0
200
400
600
800
1000
1200
1400
1600
1 2 3 4 5 6 7 8 9 10 11 12 13 14Latency
MF
LO
PS
Tile Size (NB) Latency
ATLAS MODELBEST ATLASMODEL BEST
0
100
200
300
400
500
600
20 220 420 620 820
Tile Size
MF
LO
PS
Sensitivity to NB: SGI3*NB2 ≤ C2
NB2 + NB + 1 ≤ C2
ATLASMODEL BEST
Sensitivity to NB: Intel
M
A, B
0
50
100
150
200
250
300
350
400
450
20 70 120 170 220 270 320
Tile Size (B: Best, A: ATLAS, M: Model)
MF
LO
PS
Sensitivity to MU,NU: SGI
12
34
56
78
MU12345678
NU
200
300
400
12
34
56
78
MU
1 2 3 4 5 6 7 8MU
1
2
3
4
5
6
7
8
NU
200300400
1
2
3
4
5
6
7
8
NU
Sensitivity to MU,NU: Sun
12
34
56
78
MU12345678
NU250500750
10001250
12
34
56
78
MU
1 2 3 4 5 6 7 8MU
1
2
3
4
5
6
7
8
NU
2505007501000
12501
2
3
4
5
6
7
8
NU
12
34
56
78
MU12345678
NU
0100200300
12
34
56
78
MU
1 2 3 4 5 6 7 8MU
1
2
3
4
5
6
7
8
NU
0100
2003001
2
3
4
5
6
7
8
NU
Sensitivity to MU,NU: Intel
Shape of register tile matters
0
50
100
150
200
250
300
350
400
450
1 2 3 4 5 6 7 8
MU / NU
MF
LO
PS
MU x 1 1 x NU
Sensitivity to KU
BA
B
A
B A
0
200
400
600
800
1000
1200
1400
1600
0 10 20 30 40 50 60
KU (B: Best, A: ATLAS, M: Model)
MF
LO
PS
SGI Sun Intel
Conclusions
Search is not as important as one might think
Compilers can achieve near-ATLAS performance if they implement well known transformations use models to choose parameter values
There is room for improvement in both models and empirical search Both are 20-25% slower than BLAS Higher levels of memory hierarchy cannot be neglected
Future Directions Study hand-written BLAS codes to understand
performance gap Repeat study with FFTW/SPIRAL
Uses search to choose between algorithms Combine models with search
Use models to speed up empirical search Use empirical studies to enhance models
Feed insights back into compilers How do we make it easier for compiler writers to
implement transformations? Use insights to simplify memory system
Information
URL: http://iss.cs.cornell.edu
Email: pingali@cs.cornell.edu
Sensitivity to Latency: Intel
M
A
0
50
100
150
200
250
300
350
400
450
1 2 3 4 5 6 7 8
Latency (B: Best, A: ATLAS, M: Model)
MF
LO
PS
top related