fine-grained benchmark subsetting for system selection › presentation › c184 ›...

Fine-grained Benchmark Subsetting for SystemSelection

Pablo de Oliveira Castro, Y. Kashnikov,C. Akel, M. Popov, W. Jalby

University of Versailles – Exascale Computing Research

CGO February 2014

Motivation

I Find system with the best performance on a set ofapplications?

I Reduce the cost of benchmarking

Sandy Bridge

Atom

Core2

Nehalem

SystemApplications

BT

LU

FT

MG

1 / 19

Key Idea

I Applications have redundanciesI Similar code called multiple timesI Similar code used in different applications

I Detect redundancies and keep only one representative

2 / 19

Previous Approaches

Remove similar applications

BT SP≈Joshi, Phansalkar, Eeckhout

Remove similar instruction blocks

BT

≈Simpoint: Sherwood, Perelman, Calder

3 / 19

What can be improved?

BT SP≈BT

≈Application subsetting

I Coarse grained: lesssimilarity, less accuracy

Instruction block subsetting

I Not portable, requires asimulator

I Cannot evaluate compilers

4 / 19

Source Code Subsetting

I Subset fine-grained source code fragmentsI Fine grainedI Can be recompiled and executed on multiple architectures

I Codelets

5 / 19

Our Approach

Step B: Build profile on a reference system

Maqao

Likwid

Static & DynamicProfiling Vectorization ratio

FLOPS Cache Misses...[ ]f1

f2f3...BT

SP

Step A: Detect codelets

BT

SP

codeletsCodeletFinder

(CAPS)

Applications

SP

BT

6 / 19

Our Approach

Clustering

Step C: Cluster similar codelets

RepresentativeExtraction

(Codelet Finder)

Step D: Extract representative set

Model

Sandy Bridge

Atom

Core2

FullBenchmarks

Results

Step E: Benchmark representatives7 / 19

Breaking the Application into Codelets

I Codelet: source code fragmentI Functions: too big, mixes different computation patternsI Innerloops: too small, hard to warmup and to measureI Outerloops (sweetspot)

I Capture most of the performance in HPC applications

75.4%

85.5%

98.8%95.0%

99.2%95.0% 97.3%

92.3%

0

25

50

75

100

Average BT CG FT IS LU MG SP

NAS Serial Benchmarks

% o

f exe

cutio

n tim

e ca

ptur

ed b

y co

dele

ts

8 / 19

Profiling and Clustering

I Automatically group similar codeletsI Profile codelets on a reference systemI Memory/Cache bandwidth, Instruction mix, Vectorization, ...

I Cluster codelets using feature distanceI We expect that:

I Clusters capture similar computation patternsI Clusters react similarly to architecture change

9 / 19

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation PatternDP: 2 simultaneous reductions

DP: MG Laplacian fine to coarse mesh transitionMP: Dense Matrix x vector product

DP: Vector multiply in asc./desc. orderDP: FFT butterfly computationDP: 3 simultaneous reductions

SP: Dense Matrix x vector productDP: Laplacian finite difference constant coefficientsDP: Vector multiply element wise in asc./desc. order

MP: First step FFTDP: First order recurrenceDP: First order recurrence

SP: Dot product over lower half square matrixSP: Addition on the diagonal elements of a matrix

DP: Red Black Sweeps Laplacian operatorDP: Vector divide element wise

DP: Norm + Vector divideDP: Sum of the absolute values of a matrix column

SP: Sum of a square matrixSP: Sum of the upper half of a square matrixSP: Sum of the lower half of a square matrix

DP: Multiplying a matrix row by a scalarDP: Linear combination of matrix rowsDP: Substracting a vector with a vector

DP: Sum of two square matrices element wiseDP: Sum of the absolute values of a matrix row

DP: Linear combination of matrix columnsDP: Vector multiply element wise

10 / 19

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation Pattern

DP: Vector divide element wiseDP: Norm + Vector divide

DP: Sum of the absolute values of a matrix columnSP: Sum of a square matrix

SP: Sum of the upper half of a square matrixSP: Sum of the lower half of a square matrix

ReductionSums

Long LatencyOperations

Similar computation patterns

10 / 19

Capturing Architecture Change

LU/erhs.f : 49

FT/appft.f : 45

BT/rhs.f : 266

SP/rhs.f : 275

Cluster A: triple-nestedhigh latency operations(div and exp)

Cluster B: stencil on fiveplanes (memory bound)

Core 2→ 2.93 GHz→ 3 MB

Atom→ 1.66 GHz→ 1 MB

Nehalem (Ref)Freq: 1.86 GHzLLC: 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

11 / 19

Same Cluster = Same Speedup

LU/erhs.f : 49

FT/appft.f : 45

BT/rhs.f : 266

SP/rhs.f : 275

Cluster A: triple-nestedhigh latency operations(div and exp)

Cluster B: stencil on fiveplanes (memory bound)

Core 2→ 2.93 GHz→ 3 MB

Atom→ 1.66 GHz→ 1 MB

Nehalem (Ref)Freq: 1.86 GHzLLC: 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

12 / 19

Representative Selection

Choose central codelet as representative

L2 bandwidth

FLO

PS

Vectoriz

ation

I Prediction model: Codeletsfrom the same cluster havethe same speedup whenchanging architectures

13 / 19

Representative Extraction: Codelet Finder

I Extract representatives as standalone microbenchmarks

I Can be recompiled and run outside of the original application

C or Fortran Application

for (i = 0; i < N ; i ++) { for (j = 0; j < N ; j ++) { a[i][j] += b[i]*c[j]; }}

.

.

.

.

.

.

Extracted Codelet

capture memory state Core Dump

Cod

ele

t

for (i = 0; i < N ; i ++) { for (j = 0; j < N ; j ++) { a[i][j] += b[i]*c[j]; }}

+

Is Source-code Isolation Viable for Performance Characterization? PSTI’13

14 / 19

Validation

I Trained and selected feature set on Numerical Recipes +Atom + Sandy Bridge

I Validated approach on NAS Serial and a new architecture,Core 2

Reference Target

Nehalem Atom Core 2 Sandy Bridge

CPU L5609 D510 E7500 E31240Frequency (GHz) 1.86 1.66 2.93 3.30Cores 4 2 2 4L1 cache (KB) 4×64 2×56 2×64 4×64L2 cache (KB) 4×256 2×512 3 MB 4×256L3 cache (MB) 12 - - 8Ram (GB) 8 4 4 6

Table : Test architectures.

15 / 19

NAS results

bt cg ft is lu mg sp

5

10

25

50

100

200

500

5

10

25

50

100

200

500

1000

2000

1

2

5

10

25

50

100

200

500

10

25

50

100

200

500

1000

2000

4000

1

2

5

10

25

50

100

5

10

25

5

10

25

50

100

Exe

cutio

n tim

e (m

s / i

nvoc

atio

n)

Reference (Nehalem) Sandy Bridge real Sandy Bridge predicted

I 18 representatives

I 23 times faster benchmark

I 5.8% median error

16 / 19

Tradeoff Reduction / Accuracy (NAS)

Number of clusters

Med

ian

% e

rror

5

10

15

20

25

5 10 15 20

●

●

8%

x44

Atom

5 10 15 20

●

●

3.9%

x25

Core 2

5 10 15 20

0

10

20

30

40

50

Ben

chm

arki

ng r

educ

tion

fact

or

●

●

5.8%

x23

Sandy Bridge

Median % error Benchmarking reduction factor

I More clusters:I ↗ accuracyI ↗ benchmarking cost

I Automatically select good tradeoff using Elbow method

17 / 19

Overall results (NAS)

Atom

Core 2

Sandy Bridge

0

1000

2000

3000

0

100

200

300

0

100

200

300

bt cg ft is lu mg sp

Exe

cutio

n tim

e (s

)

Reference Real Predicted

0.15 0.19

0.97 1.00

1.981.89

0.0

0.5

1.0

1.5

2.0

Atom Core 2 Sandy BridgeG

eom

etric

mea

n sp

eedu

p Real Speedup

Predicted Speedup

I Accurately evaluate architectures

I Choose the best architecture-benchmark pairs

18 / 19

Conclusion

I Take advantage of source loops redundancies to reducebenchmarking time

I Generate portable compressed benchmarksI Accurate (< 10%) and Faster (> ×23)

I ApplicationsI System Selection (this)I Fast compiler performance regression testsI Iterative Compilation

I http://benchmark-subsetting.github.io/fgbs/I data and analysis code available as a reproducible IPython

notebook

19 / 19

http://benchmark-subsetting.github.io/fgbs/

Thanks for your attention!

20 / 19

Feature Selection

I Genetic Algorithm: find best set of features on NumericalRecipes + Atom + Sandy Bridge

I The feature set is still among the best on NAS

Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040

0 5 10 15 20 25Number of clusters

Med

ian

% e

rror

Worst

Median

Best

GA features

21 / 19

Reduction Factor Breakdown

Reduction Total Reduced invocations Clustering

Atom 44.3 ×12 ×3.7Core 2 24.7 ×8.7 ×2.8

Sandy Bridge 22.5 ×6.3 ×3.6

Table : Benchmarking reduction factor breakdown with 18representatives.

22 / 19

Same working set?

I NAS: regular codes.I Only 19% of codelets have different behavior accross

invocations.I Detect ill-behaved codelets. Exclude them from

representatives.

I SPEC: different working set per invocation.I Ongoing: Cluster codelets across working sets

23 / 19

Across Applications Similarities

● ●

● ● ●● ● ● ●

● ● ● ● ●● ● ● ● ● ● ● ●

● ●

●

●

● ● ●● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

Atom

Core 2

Sandy Bridge

10

20

30

40

10

20

30

40

10

20

30

40

0 5 10 15 20 25Number of clusters

Med

ian

% e

rror

Subsetting ● Across Applications Per Application

24 / 19

Profiling Features

SP

codelets

Likwid

Maqao

Performance counters per codelet

Static dissassembly and analysis

4 dynamic features

8 static features

Bytes Stored / cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUB/MULVectorization (FP/FP+INT/INT)

FLOPSL2 BandwidthL3 Miss RateMem Bandwidth

25 / 19

fine-grained benchmark subsetting for system selection › presentation › c184 ›...

Documents