general characteristics memory characteristics ... · general characteristics memory...

Post on 05-May-2018

257 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

SIAM Parallel Processing 2012

  Motivation   Application Performance Characterization: ◦ Current approaches ◦ Our approach:   General Characteristics  Memory Characteristics

  Experimental Setup ◦ Benchmarks ◦ Tools

  Results   Conclusion

  Mantevo MiniApps are relatively new

  Compare to well-known widely-used benchmark suites (e.g, SPEC CPU2006 )

  Compare to original apps they represent

  Low-level detailed characterization  Provides insight into performance  Reveals optimization opportunities if available  Helps guide and/or validate the development of proxies

(miniApps)  Gives an idea of suitable platforms for the applications to run on  Helps find suitable sets of benchmarks for an experiment  …

SIAM Parallel Processing 2012

 How is it usually done?

 Problems:  No standard set of characteristics  Most studies use microarchitecture/hardware

dependent characteristics  execution time, CPI, miss rates…etc

 Other suggest microarchitecture independent?  Instruction dependence distance, Instruction mix  Spatial and/or temporal locality information…etc

 Limited set of characteristics is usually used due to simulation cost

SIAM Parallel Processing 2012

 Our approach  Wide range of low-level detailed characteristics  better ability to explain performance

 Hardware independent, but ISA dependent  Dynamic binary instrumentation (DBI) tools such as PIN  Most characteristics captured in terms of a frequency

distribution (histogram)  Hardware dependent  Hardware performance counters  Validation

 More efficient  No simulation

SIAM Parallel Processing 2012

  Instruction Mix   INT, FP, LD, ST, BR   FP: FP, SIMD   LD: INT / FP, E_LD: INT/FP   ST: INT / FP, E_ST: INT/FP   BR: INT/FP

  Instr-dependence distance   Register-to-use distance histogram

  Instr-to-Instr distance histograms   ld-to-ld, fp-to-fp, br-to-br, …etc

  Instr-to-Use distance histograms   ld-to-use, fp-to-use….etc

  Instruction size histogram   Registers read per instruction   Registers written per instruction

SIAM Parallel Processing 2012

  CPI ( Cycles per Instruction )   Cache miss rates ( per 1k instructions )

 L1, L2, L3…etc   Branch misprediction rate   Totals (for validation purposes)

 Total instructions  Total loads, stores, FP, and branches

SIAM Parallel Processing 2012

 Characteristics obtained from DBI tools   Spatial Locality histogram

  Cache line access stride distribution   Stride is the minimum stride found between current

access and the last N accesses (N currently set at 32)   Max stride one page (4KB)   64-byte cache lines assumed

  Temporal Locality histogram   Memory-Reuse-Distance (MRD) histogram

  MRD is # of unique memory references between two references to the same cache line

  Or MRD is # of unique cache lines referenced between two references to the same cache line

  Max distance currently set to cover 6MB   64-byte cache lines assumed

SIAM Parallel Processing 2012

 Characteristics obtained from DBI tools  Working Set size

  Total unique bytes touched by application   Distribution of unique bytes touched by every 1 billion

instructions   Pattern of executed memory instructions

  Distance defined in number of instructions between memory ops

 Distribution of memory size read/written

 Characteristics obtained from hardware performance counters:   Cache miss rates ( per 1k instructions )

  L1, L2, L3…etc

SIAM Parallel Processing 2012

 MantevoMiniApps   Explicit Finite Element MiniApps

  PhdMesh   Molecular Dynamics MiniApps

  MiniMD   Implicit Finite Element MiniApps

  HPCCG, pHPCCG, MiniFE

 SPEC CPU2006   6 Floating-point benchmarks:

  cactusADM, LBM, Povray, DealII, Leslie3d, Calculix   4 Integer benchmarks:

  Perlbench, Astar, Libquantum, Xalancbmk

  Input sizes:   Mantevo: adjusted for approximately same instruction

count as SPEC   SPEC: reference input

SIAM Parallel Processing 2012

 Platform:   Experiments run on Xeon-E5504, Gainestown (based

on Nehalem), 45nm, 4 core, 256KB L2/core, 4MB L3  Tools:

  PAPI (papiex)   CPI, cache and branch statistics

  PIN ( Dynamic Binary Instrumentation )   All general characteristics   Some memory characteristics   Benchmarks run to completion (~1day each)

  PIN + PinPoints + Simpionts   Spatial & temporal locality characteristics   Simulation points of size 1 billion dynamic instructions

covering 95% of execution   # of points ranges from 3 to 8 with different weights

SIAM Parallel Processing 2012

SIAM Parallel Processing 2012

0

20

40

60

80

100

%

% Stall Cycles

0 0.5

1 1.5

2 2.5

3

CPI

CPI

SIAM Parallel Processing 2012

0% 10% 20% 30% 40% 50% 60% 70% 80% 90%

100%

Branches

Int Ops

FP Ops

FP Stores

FP Loads

Int Stores

Int Loads

0 2 4 6 8

10 #

of in

stru

ctio

ns

FP-to-Use

0 1 2 3 4 5

# of

inst

ruct

ions

FP-to-FP

SIAM Parallel Processing 2012

0 2 4 6 8

10 12 14

# of

inst

ruct

ions

Instruction Dependence Distance

0 10 20 30 40 50 60 70 80

# of

inst

ruct

ions

Basic Block Size

0 500

1000 1500 2000 2500 3000 3500 4000

Meg

a By

tes

Working Set Size

SIAM Parallel Processing 2012

0.00

10.00

20.00

30.00

40.00

50.00

60.00

L1 Misses/1K inst

0.00 5.00

10.00 15.00 20.00 25.00 30.00 35.00

L2 Misses/1K inst

0.00% 1.00% 2.00% 3.00% 4.00% 5.00% 6.00% 7.00% 8.00%

BR Misprediction Rate

0.00 2.00 4.00 6.00 8.00

10.00 12.00 14.00 16.00

L3 Misses/1K inst

0 0.5

1 1.5

2 2.5

3 3.5

4 4.5

# of

inst

ruct

ions

Distance Between Mem Ops

0%

10%

20%

30%

40%

50%

% Mem Ops

0 1 2 3 4 5

# of

inst

ruct

ions

LD-to-Use

SIAM Parallel Processing 2012

0%

2%

4%

6%

8%

10%

12%

Calculix DealII Leslie3d PHPCCG MiniFE MiniMD PhdMesh SPEC Avg Mantevo Avg

Cache Miss Rates

L1

L2

L3

0

1E+11

2E+11

3E+11

4E+11

0 1 64 Other

Freq

uenc

y

Stride

0

5E+10

1E+11

1.5E+11

2E+11

2.5E+11

0 <=10 <=512 <=4096 <=65536 >65536

Freq

uenc

y

# unique cache lines referenced b/w 2 references to same line

MemReuse Distance

SIAM Parallel Processing 2012

0

3E+11

6E+11

9E+11

0 1 Other

Freq

uenc

y

Stride

0%

2%

4%

6%

8%

10%

12%

Calculix DealII Leslie3d PHPCCG MiniFE MiniMD PhdMesh SPEC Avg Mantevo Avg

Cache Miss Rates

L1

L2

L3

0

2E+11

4E+11

6E+11

8E+11

0 <=10 <=512 <=4096 <=65536 >65536

Freq

uenc

y

# unique cache lines referenced b/w 2 references to same line

MemReuse Distance

SIAM Parallel Processing 2012

0

2E+11

4E+11

6E+11

8E+11

1E+12

1.2E+12

0 1 64 Other

Freq

uenc

y

Stride

0%

2%

4%

6%

8%

10%

12%

Calculix DealII Leslie3d PHPCCG MiniFE MiniMD PhdMesh SPEC Avg Mantevo Avg

Cache Miss Rates

L1

L2

L3

0

30000000

60000000

90000000

1.2E+09

1.5E+09

1.8E+09

0 <=10 <=512 <=4096 <=65536 >65536

Freq

uenc

y

# unique cache lines referenced b/w 2 references to same line

MemReuse Distance

SIAM Parallel Processing 2012

0

5E+11

1E+12

1.5E+12

2E+12

0 1 2 Other

Freq

uenc

y

Stride (Calculix)

0 2E+11 4E+11 6E+11 8E+11

0 1 64 Other

Freq

uenc

y

Stride (DealII)

0 2E+11 4E+11 6E+11 8E+11

0 1 15 16 64 Other

Freq

uenc

y

Stride (Leslie3D)

0

5E+11

1E+12

1.5E+12

0 <=10 <=512 <=4096 <=65536 >65536

Freq

uenc

y

MemReuse Distance(Calculix)

0

2E+11

4E+11

6E+11

0 <=10 <=512 <=4096 <=65536 >65536 Fr

eque

ncy

MemReuse Distance(DealII)

0

2E+11

4E+11

6E+11

0 <=10 <=512 <=4096 <=65536 >65536

Freq

uenc

y

# unique cache lines referenced b/w 2 references to same line

MemReuse Distance(Leslie3D)

•  MiniApps exhibit more memory behavior • >100% more misses(L2 & L3) than SPEC! • Much larger data working set (500% more) • More memory ops per instruction (16% more) • Memory ops are closer to each other ( 2.1 vs. 2.9 ) •  More prone to contention for memory resources

•  MiniApps have much shorter (>100%) dependence distance than SPEC •  Suggests more dependence stalls

•  MiniApps have much shorter basic blocks than SPEC

•  MiniApps experience more stall time (> 33%) •  Greater CPI •  Due to more cache misses & dependence stalls

SIAM Parallel Processing 2012

 Compare performance of MiniApps and real full size apps:  Single node  At scale

 Obtain memory performance characteristics using full runs instead of simulation points  Compare findings to simulation points

 How sensitive performance is to problem size

top related