general characteristics memory characteristics ... · general characteristics memory...
Post on 05-May-2018
257 Views
Preview:
TRANSCRIPT
SIAM Parallel Processing 2012
Motivation Application Performance Characterization: ◦ Current approaches ◦ Our approach: General Characteristics Memory Characteristics
Experimental Setup ◦ Benchmarks ◦ Tools
Results Conclusion
Mantevo MiniApps are relatively new
Compare to well-known widely-used benchmark suites (e.g, SPEC CPU2006 )
Compare to original apps they represent
Low-level detailed characterization Provides insight into performance Reveals optimization opportunities if available Helps guide and/or validate the development of proxies
(miniApps) Gives an idea of suitable platforms for the applications to run on Helps find suitable sets of benchmarks for an experiment …
SIAM Parallel Processing 2012
How is it usually done?
Problems: No standard set of characteristics Most studies use microarchitecture/hardware
dependent characteristics execution time, CPI, miss rates…etc
Other suggest microarchitecture independent? Instruction dependence distance, Instruction mix Spatial and/or temporal locality information…etc
Limited set of characteristics is usually used due to simulation cost
SIAM Parallel Processing 2012
Our approach Wide range of low-level detailed characteristics better ability to explain performance
Hardware independent, but ISA dependent Dynamic binary instrumentation (DBI) tools such as PIN Most characteristics captured in terms of a frequency
distribution (histogram) Hardware dependent Hardware performance counters Validation
More efficient No simulation
SIAM Parallel Processing 2012
Instruction Mix INT, FP, LD, ST, BR FP: FP, SIMD LD: INT / FP, E_LD: INT/FP ST: INT / FP, E_ST: INT/FP BR: INT/FP
Instr-dependence distance Register-to-use distance histogram
Instr-to-Instr distance histograms ld-to-ld, fp-to-fp, br-to-br, …etc
Instr-to-Use distance histograms ld-to-use, fp-to-use….etc
Instruction size histogram Registers read per instruction Registers written per instruction
SIAM Parallel Processing 2012
CPI ( Cycles per Instruction ) Cache miss rates ( per 1k instructions )
L1, L2, L3…etc Branch misprediction rate Totals (for validation purposes)
Total instructions Total loads, stores, FP, and branches
SIAM Parallel Processing 2012
Characteristics obtained from DBI tools Spatial Locality histogram
Cache line access stride distribution Stride is the minimum stride found between current
access and the last N accesses (N currently set at 32) Max stride one page (4KB) 64-byte cache lines assumed
Temporal Locality histogram Memory-Reuse-Distance (MRD) histogram
MRD is # of unique memory references between two references to the same cache line
Or MRD is # of unique cache lines referenced between two references to the same cache line
Max distance currently set to cover 6MB 64-byte cache lines assumed
SIAM Parallel Processing 2012
Characteristics obtained from DBI tools Working Set size
Total unique bytes touched by application Distribution of unique bytes touched by every 1 billion
instructions Pattern of executed memory instructions
Distance defined in number of instructions between memory ops
Distribution of memory size read/written
Characteristics obtained from hardware performance counters: Cache miss rates ( per 1k instructions )
L1, L2, L3…etc
SIAM Parallel Processing 2012
MantevoMiniApps Explicit Finite Element MiniApps
PhdMesh Molecular Dynamics MiniApps
MiniMD Implicit Finite Element MiniApps
HPCCG, pHPCCG, MiniFE
SPEC CPU2006 6 Floating-point benchmarks:
cactusADM, LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:
Perlbench, Astar, Libquantum, Xalancbmk
Input sizes: Mantevo: adjusted for approximately same instruction
count as SPEC SPEC: reference input
SIAM Parallel Processing 2012
Platform: Experiments run on Xeon-E5504, Gainestown (based
on Nehalem), 45nm, 4 core, 256KB L2/core, 4MB L3 Tools:
PAPI (papiex) CPI, cache and branch statistics
PIN ( Dynamic Binary Instrumentation ) All general characteristics Some memory characteristics Benchmarks run to completion (~1day each)
PIN + PinPoints + Simpionts Spatial & temporal locality characteristics Simulation points of size 1 billion dynamic instructions
covering 95% of execution # of points ranges from 3 to 8 with different weights
SIAM Parallel Processing 2012
SIAM Parallel Processing 2012
0
20
40
60
80
100
%
% Stall Cycles
0 0.5
1 1.5
2 2.5
3
CPI
CPI
SIAM Parallel Processing 2012
0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
100%
Branches
Int Ops
FP Ops
FP Stores
FP Loads
Int Stores
Int Loads
0 2 4 6 8
10 #
of in
stru
ctio
ns
FP-to-Use
0 1 2 3 4 5
# of
inst
ruct
ions
FP-to-FP
SIAM Parallel Processing 2012
0 2 4 6 8
10 12 14
# of
inst
ruct
ions
Instruction Dependence Distance
0 10 20 30 40 50 60 70 80
# of
inst
ruct
ions
Basic Block Size
0 500
1000 1500 2000 2500 3000 3500 4000
Meg
a By
tes
Working Set Size
SIAM Parallel Processing 2012
0.00
10.00
20.00
30.00
40.00
50.00
60.00
L1 Misses/1K inst
0.00 5.00
10.00 15.00 20.00 25.00 30.00 35.00
L2 Misses/1K inst
0.00% 1.00% 2.00% 3.00% 4.00% 5.00% 6.00% 7.00% 8.00%
BR Misprediction Rate
0.00 2.00 4.00 6.00 8.00
10.00 12.00 14.00 16.00
L3 Misses/1K inst
0 0.5
1 1.5
2 2.5
3 3.5
4 4.5
# of
inst
ruct
ions
Distance Between Mem Ops
0%
10%
20%
30%
40%
50%
% Mem Ops
0 1 2 3 4 5
# of
inst
ruct
ions
LD-to-Use
SIAM Parallel Processing 2012
0%
2%
4%
6%
8%
10%
12%
Calculix DealII Leslie3d PHPCCG MiniFE MiniMD PhdMesh SPEC Avg Mantevo Avg
Cache Miss Rates
L1
L2
L3
0
1E+11
2E+11
3E+11
4E+11
0 1 64 Other
Freq
uenc
y
Stride
0
5E+10
1E+11
1.5E+11
2E+11
2.5E+11
0 <=10 <=512 <=4096 <=65536 >65536
Freq
uenc
y
# unique cache lines referenced b/w 2 references to same line
MemReuse Distance
SIAM Parallel Processing 2012
0
3E+11
6E+11
9E+11
0 1 Other
Freq
uenc
y
Stride
0%
2%
4%
6%
8%
10%
12%
Calculix DealII Leslie3d PHPCCG MiniFE MiniMD PhdMesh SPEC Avg Mantevo Avg
Cache Miss Rates
L1
L2
L3
0
2E+11
4E+11
6E+11
8E+11
0 <=10 <=512 <=4096 <=65536 >65536
Freq
uenc
y
# unique cache lines referenced b/w 2 references to same line
MemReuse Distance
SIAM Parallel Processing 2012
0
2E+11
4E+11
6E+11
8E+11
1E+12
1.2E+12
0 1 64 Other
Freq
uenc
y
Stride
0%
2%
4%
6%
8%
10%
12%
Calculix DealII Leslie3d PHPCCG MiniFE MiniMD PhdMesh SPEC Avg Mantevo Avg
Cache Miss Rates
L1
L2
L3
0
30000000
60000000
90000000
1.2E+09
1.5E+09
1.8E+09
0 <=10 <=512 <=4096 <=65536 >65536
Freq
uenc
y
# unique cache lines referenced b/w 2 references to same line
MemReuse Distance
SIAM Parallel Processing 2012
0
5E+11
1E+12
1.5E+12
2E+12
0 1 2 Other
Freq
uenc
y
Stride (Calculix)
0 2E+11 4E+11 6E+11 8E+11
0 1 64 Other
Freq
uenc
y
Stride (DealII)
0 2E+11 4E+11 6E+11 8E+11
0 1 15 16 64 Other
Freq
uenc
y
Stride (Leslie3D)
0
5E+11
1E+12
1.5E+12
0 <=10 <=512 <=4096 <=65536 >65536
Freq
uenc
y
MemReuse Distance(Calculix)
0
2E+11
4E+11
6E+11
0 <=10 <=512 <=4096 <=65536 >65536 Fr
eque
ncy
MemReuse Distance(DealII)
0
2E+11
4E+11
6E+11
0 <=10 <=512 <=4096 <=65536 >65536
Freq
uenc
y
# unique cache lines referenced b/w 2 references to same line
MemReuse Distance(Leslie3D)
• MiniApps exhibit more memory behavior • >100% more misses(L2 & L3) than SPEC! • Much larger data working set (500% more) • More memory ops per instruction (16% more) • Memory ops are closer to each other ( 2.1 vs. 2.9 ) • More prone to contention for memory resources
• MiniApps have much shorter (>100%) dependence distance than SPEC • Suggests more dependence stalls
• MiniApps have much shorter basic blocks than SPEC
• MiniApps experience more stall time (> 33%) • Greater CPI • Due to more cache misses & dependence stalls
SIAM Parallel Processing 2012
Compare performance of MiniApps and real full size apps: Single node At scale
Obtain memory performance characteristics using full runs instead of simulation points Compare findings to simulation points
How sensitive performance is to problem size
top related