fine-grained benchmark subsetting for system selection › presentation › c184 ›...
TRANSCRIPT
Fine-grained Benchmark Subsetting for SystemSelection
Pablo de Oliveira Castro, Y. Kashnikov,C. Akel, M. Popov, W. Jalby
University of Versailles – Exascale Computing Research
CGO February 2014
Motivation
I Find system with the best performance on a set ofapplications?
I Reduce the cost of benchmarking
Sandy Bridge
Atom
Core2
Nehalem
SystemApplications
BT
LU
FT
MG
1 / 19
Key Idea
I Applications have redundanciesI Similar code called multiple timesI Similar code used in different applications
I Detect redundancies and keep only one representative
2 / 19
Previous Approaches
Remove similar applications
BT SP≈Joshi, Phansalkar, Eeckhout
Remove similar instruction blocks
BT
≈Simpoint: Sherwood, Perelman, Calder
3 / 19
What can be improved?
BT SP≈BT
≈Application subsetting
I Coarse grained: lesssimilarity, less accuracy
Instruction block subsetting
I Not portable, requires asimulator
I Cannot evaluate compilers
4 / 19
Source Code Subsetting
I Subset fine-grained source code fragmentsI Fine grainedI Can be recompiled and executed on multiple architectures
I Codelets
5 / 19
Our Approach
Step B: Build profile on a reference system
Maqao
Likwid
Static & DynamicProfiling Vectorization ratio
FLOPS Cache Misses...[ ]f1
f2f3...BT
SP
Step A: Detect codelets
BT
SP
codeletsCodeletFinder
(CAPS)
Applications
SP
BT
6 / 19
Our Approach
Clustering
Step C: Cluster similar codelets
RepresentativeExtraction
(Codelet Finder)
Step D: Extract representative set
Model
Sandy Bridge
Atom
Core2
FullBenchmarks
Results
Step E: Benchmark representatives7 / 19
Breaking the Application into Codelets
I Codelet: source code fragmentI Functions: too big, mixes different computation patternsI Innerloops: too small, hard to warmup and to measureI Outerloops (sweetspot)
I Capture most of the performance in HPC applications
75.4%
85.5%
98.8%95.0%
99.2%95.0% 97.3%
92.3%
0
25
50
75
100
Average BT CG FT IS LU MG SP
NAS Serial Benchmarks
% o
f exe
cutio
n tim
e ca
ptur
ed b
y co
dele
ts
8 / 19
Profiling and Clustering
I Automatically group similar codeletsI Profile codelets on a reference systemI Memory/Cache bandwidth, Instruction mix, Vectorization, ...
I Cluster codelets using feature distanceI We expect that:
I Clusters capture similar computation patternsI Clusters react similarly to architecture change
9 / 19
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation PatternDP: 2 simultaneous reductions
DP: MG Laplacian fine to coarse mesh transitionMP: Dense Matrix x vector product
DP: Vector multiply in asc./desc. orderDP: FFT butterfly computationDP: 3 simultaneous reductions
SP: Dense Matrix x vector productDP: Laplacian finite difference constant coefficientsDP: Vector multiply element wise in asc./desc. order
MP: First step FFTDP: First order recurrenceDP: First order recurrence
SP: Dot product over lower half square matrixSP: Addition on the diagonal elements of a matrix
DP: Red Black Sweeps Laplacian operatorDP: Vector divide element wise
DP: Norm + Vector divideDP: Sum of the absolute values of a matrix column
SP: Sum of a square matrixSP: Sum of the upper half of a square matrixSP: Sum of the lower half of a square matrix
DP: Multiplying a matrix row by a scalarDP: Linear combination of matrix rowsDP: Substracting a vector with a vector
DP: Sum of two square matrices element wiseDP: Sum of the absolute values of a matrix row
DP: Linear combination of matrix columnsDP: Vector multiply element wise
10 / 19
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation Pattern
DP: Vector divide element wiseDP: Norm + Vector divide
DP: Sum of the absolute values of a matrix columnSP: Sum of a square matrix
SP: Sum of the upper half of a square matrixSP: Sum of the lower half of a square matrix
ReductionSums
Long LatencyOperations
Similar computation patterns
10 / 19
Capturing Architecture Change
LU/erhs.f : 49
FT/appft.f : 45
BT/rhs.f : 266
SP/rhs.f : 275
Cluster A: triple-nestedhigh latency operations(div and exp)
Cluster B: stencil on fiveplanes (memory bound)
Core 2→ 2.93 GHz→ 3 MB
Atom→ 1.66 GHz→ 1 MB
Nehalem (Ref)Freq: 1.86 GHzLLC: 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
11 / 19
Same Cluster = Same Speedup
LU/erhs.f : 49
FT/appft.f : 45
BT/rhs.f : 266
SP/rhs.f : 275
Cluster A: triple-nestedhigh latency operations(div and exp)
Cluster B: stencil on fiveplanes (memory bound)
Core 2→ 2.93 GHz→ 3 MB
Atom→ 1.66 GHz→ 1 MB
Nehalem (Ref)Freq: 1.86 GHzLLC: 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
12 / 19
Representative Selection
Choose central codelet as representative
L2 bandwidth
FLO
PS
Vectoriz
ation
I Prediction model: Codeletsfrom the same cluster havethe same speedup whenchanging architectures
13 / 19
Representative Extraction: Codelet Finder
I Extract representatives as standalone microbenchmarks
I Can be recompiled and run outside of the original application
C or Fortran Application
for (i = 0; i < N ; i ++) { for (j = 0; j < N ; j ++) { a[i][j] += b[i]*c[j]; }}
.
.
.
.
.
.
Extracted Codelet
capture memory state Core Dump
Cod
ele
t
for (i = 0; i < N ; i ++) { for (j = 0; j < N ; j ++) { a[i][j] += b[i]*c[j]; }}
+
Is Source-code Isolation Viable for Performance Characterization? PSTI’13
14 / 19
Validation
I Trained and selected feature set on Numerical Recipes +Atom + Sandy Bridge
I Validated approach on NAS Serial and a new architecture,Core 2
Reference Target
Nehalem Atom Core 2 Sandy Bridge
CPU L5609 D510 E7500 E31240Frequency (GHz) 1.86 1.66 2.93 3.30Cores 4 2 2 4L1 cache (KB) 4×64 2×56 2×64 4×64L2 cache (KB) 4×256 2×512 3 MB 4×256L3 cache (MB) 12 - - 8Ram (GB) 8 4 4 6
Table : Test architectures.
15 / 19
NAS results
bt cg ft is lu mg sp
5
10
25
50
100
200
500
5
10
25
50
100
200
500
1000
2000
1
2
5
10
25
50
100
200
500
10
25
50
100
200
500
1000
2000
4000
1
2
5
10
25
50
100
5
10
25
5
10
25
50
100
Exe
cutio
n tim
e (m
s / i
nvoc
atio
n)
Reference (Nehalem) Sandy Bridge real Sandy Bridge predicted
I 18 representatives
I 23 times faster benchmark
I 5.8% median error
16 / 19
Tradeoff Reduction / Accuracy (NAS)
Number of clusters
Med
ian
% e
rror
5
10
15
20
25
5 10 15 20
●
●
8%
x44
Atom
5 10 15 20
●
●
3.9%
x25
Core 2
5 10 15 20
0
10
20
30
40
50
Ben
chm
arki
ng r
educ
tion
fact
or
●
●
5.8%
x23
Sandy Bridge
Median % error Benchmarking reduction factor
I More clusters:I ↗ accuracyI ↗ benchmarking cost
I Automatically select good tradeoff using Elbow method
17 / 19
Overall results (NAS)
Atom
Core 2
Sandy Bridge
0
1000
2000
3000
0
100
200
300
0
100
200
300
bt cg ft is lu mg sp
Exe
cutio
n tim
e (s
)
Reference Real Predicted
0.15 0.19
0.97 1.00
1.981.89
0.0
0.5
1.0
1.5
2.0
Atom Core 2 Sandy BridgeG
eom
etric
mea
n sp
eedu
p Real Speedup
Predicted Speedup
I Accurately evaluate architectures
I Choose the best architecture-benchmark pairs
18 / 19
Conclusion
I Take advantage of source loops redundancies to reducebenchmarking time
I Generate portable compressed benchmarksI Accurate (< 10%) and Faster (> ×23)
I ApplicationsI System Selection (this)I Fast compiler performance regression testsI Iterative Compilation
I http://benchmark-subsetting.github.io/fgbs/I data and analysis code available as a reproducible IPython
notebook
19 / 19
Thanks for your attention!
20 / 19
Feature Selection
I Genetic Algorithm: find best set of features on NumericalRecipes + Atom + Sandy Bridge
I The feature set is still among the best on NAS
Atom
Core 2
Sandy Bridge
1020304050
10203040
10203040
0 5 10 15 20 25Number of clusters
Med
ian
% e
rror
Worst
Median
Best
GA features
21 / 19
Reduction Factor Breakdown
Reduction Total Reduced invocations Clustering
Atom 44.3 ×12 ×3.7Core 2 24.7 ×8.7 ×2.8
Sandy Bridge 22.5 ×6.3 ×3.6
Table : Benchmarking reduction factor breakdown with 18representatives.
22 / 19
Same working set?
I NAS: regular codes.I Only 19% of codelets have different behavior accross
invocations.I Detect ill-behaved codelets. Exclude them from
representatives.
I SPEC: different working set per invocation.I Ongoing: Cluster codelets across working sets
23 / 19
Across Applications Similarities
● ●
● ● ●● ● ● ●
● ● ● ● ●● ● ● ● ● ● ● ●
● ●
●
●
● ● ●● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
Atom
Core 2
Sandy Bridge
10
20
30
40
10
20
30
40
10
20
30
40
0 5 10 15 20 25Number of clusters
Med
ian
% e
rror
Subsetting ● Across Applications Per Application
24 / 19
Profiling Features
SP
codelets
Likwid
Maqao
Performance counters per codelet
Static dissassembly and analysis
4 dynamic features
8 static features
Bytes Stored / cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUB/MULVectorization (FP/FP+INT/INT)
FLOPSL2 BandwidthL3 Miss RateMem Bandwidth
25 / 19