general characteristics memory characteristics ... · general characteristics memory...

SIAM Parallel Processing 2012

Motivation Application Performance Characterization: ◦ Current approaches ◦ Our approach: General Characteristics Memory Characteristics

Experimental Setup ◦ Benchmarks ◦ Tools

Results Conclusion

Mantevo MiniApps are relatively new

Compare to well-known widely-used benchmark suites (e.g, SPEC CPU2006 )

Compare to original apps they represent

Low-level detailed characterization Provides insight into performance Reveals optimization opportunities if available Helps guide and/or validate the development of proxies

(miniApps) Gives an idea of suitable platforms for the applications to run on Helps find suitable sets of benchmarks for an experiment …

How is it usually done?

Problems: No standard set of characteristics Most studies use microarchitecture/hardware

dependent characteristics execution time, CPI, miss rates…etc

Other suggest microarchitecture independent? Instruction dependence distance, Instruction mix Spatial and/or temporal locality information…etc

Limited set of characteristics is usually used due to simulation cost

Our approach Wide range of low-level detailed characteristics better ability to explain performance

Hardware independent, but ISA dependent Dynamic binary instrumentation (DBI) tools such as PIN Most characteristics captured in terms of a frequency

distribution (histogram) Hardware dependent Hardware performance counters Validation

More efficient No simulation

Instruction Mix INT, FP, LD, ST, BR FP: FP, SIMD LD: INT / FP, E_LD: INT/FP ST: INT / FP, E_ST: INT/FP BR: INT/FP

Instr-dependence distance Register-to-use distance histogram

Instr-to-Instr distance histograms ld-to-ld, fp-to-fp, br-to-br, …etc

Instr-to-Use distance histograms ld-to-use, fp-to-use….etc

Instruction size histogram Registers read per instruction Registers written per instruction

CPI ( Cycles per Instruction ) Cache miss rates ( per 1k instructions )

L1, L2, L3…etc Branch misprediction rate Totals (for validation purposes)

Total instructions Total loads, stores, FP, and branches

Characteristics obtained from DBI tools Spatial Locality histogram

Cache line access stride distribution Stride is the minimum stride found between current

access and the last N accesses (N currently set at 32) Max stride one page (4KB) 64-byte cache lines assumed

Temporal Locality histogram Memory-Reuse-Distance (MRD) histogram

MRD is # of unique memory references between two references to the same cache line

Or MRD is # of unique cache lines referenced between two references to the same cache line

Max distance currently set to cover 6MB 64-byte cache lines assumed

Characteristics obtained from DBI tools Working Set size

Total unique bytes touched by application Distribution of unique bytes touched by every 1 billion

instructions Pattern of executed memory instructions

Distance defined in number of instructions between memory ops

Distribution of memory size read/written

Characteristics obtained from hardware performance counters: Cache miss rates ( per 1k instructions )

L1, L2, L3…etc

MantevoMiniApps Explicit Finite Element MiniApps

PhdMesh Molecular Dynamics MiniApps

MiniMD Implicit Finite Element MiniApps

HPCCG, pHPCCG, MiniFE

SPEC CPU2006 6 Floating-point benchmarks:

cactusADM, LBM, Povray, DealII, Leslie3d, Calculix 4 Integer benchmarks:

Perlbench, Astar, Libquantum, Xalancbmk

Input sizes: Mantevo: adjusted for approximately same instruction

count as SPEC SPEC: reference input

Platform: Experiments run on Xeon-E5504, Gainestown (based

on Nehalem), 45nm, 4 core, 256KB L2/core, 4MB L3 Tools:

PAPI (papiex) CPI, cache and branch statistics

PIN ( Dynamic Binary Instrumentation ) All general characteristics Some memory characteristics Benchmarks run to completion (~1day each)

PIN + PinPoints + Simpionts Spatial & temporal locality characteristics Simulation points of size 1 billion dynamic instructions

covering 95% of execution # of points ranges from 3 to 8 with different weights

% Stall Cycles

0% 10% 20% 30% 40% 50% 60% 70% 80% 90%

Branches

Int Ops

FP Ops

FP Stores

FP Loads

Int Stores

Int Loads

0 2 4 6 8

FP-to-Use

0 1 2 3 4 5

FP-to-FP

0 2 4 6 8

10 12 14

Instruction Dependence Distance

0 10 20 30 40 50 60 70 80

Basic Block Size

1000 1500 2000 2500 3000 3500 4000

Working Set Size

L1 Misses/1K inst

0.00 5.00

10.00 15.00 20.00 25.00 30.00 35.00

L2 Misses/1K inst

0.00% 1.00% 2.00% 3.00% 4.00% 5.00% 6.00% 7.00% 8.00%

BR Misprediction Rate

0.00 2.00 4.00 6.00 8.00

10.00 12.00 14.00 16.00

L3 Misses/1K inst

Distance Between Mem Ops

% Mem Ops

0 1 2 3 4 5

LD-to-Use

Calculix DealII Leslie3d PHPCCG MiniFE MiniMD PhdMesh SPEC Avg Mantevo Avg

Cache Miss Rates

0 1 64 Other

Stride

1.5E+11

2.5E+11

0 <=10 <=512 <=4096 <=65536 >65536

# unique cache lines referenced b/w 2 references to same line

MemReuse Distance

0 1 Other

Stride

Cache Miss Rates

0 <=10 <=512 <=4096 <=65536 >65536

MemReuse Distance

1.2E+12

0 1 64 Other

Stride

Cache Miss Rates

30000000

60000000

90000000

1.2E+09

1.5E+09

1.8E+09

0 <=10 <=512 <=4096 <=65536 >65536

MemReuse Distance

1.5E+12

0 1 2 Other

Stride (Calculix)

0 2E+11 4E+11 6E+11 8E+11

0 1 64 Other

Stride (DealII)

0 2E+11 4E+11 6E+11 8E+11

0 1 15 16 64 Other

Stride (Leslie3D)

1.5E+12

0 <=10 <=512 <=4096 <=65536 >65536

MemReuse Distance(Calculix)

0 <=10 <=512 <=4096 <=65536 >65536 Fr

MemReuse Distance(DealII)

0 <=10 <=512 <=4096 <=65536 >65536

MemReuse Distance(Leslie3D)

• MiniApps exhibit more memory behavior • >100% more misses(L2 & L3) than SPEC! • Much larger data working set (500% more) • More memory ops per instruction (16% more) • Memory ops are closer to each other ( 2.1 vs. 2.9 ) • More prone to contention for memory resources

• MiniApps have much shorter (>100%) dependence distance than SPEC • Suggests more dependence stalls

• MiniApps have much shorter basic blocks than SPEC

• MiniApps experience more stall time (> 33%) • Greater CPI • Due to more cache misses & dependence stalls

Compare performance of MiniApps and real full size apps: Single node At scale

Obtain memory performance characteristics using full runs instead of simulation points Compare findings to simulation points

How sensitive performance is to problem size

general characteristics memory characteristics ... · general characteristics memory...

Documents

wear mechanism and tribological characteristics of porous...

nano express open access reliability characteristics and...

impact of gate work-function on memory characteristics in...

11 memory characteristics in individuals with savant...

03-04 cache memory computer organization. characteristics...

learning and retention may be - eric · instructional....

br24g08-3 series : memory -...

13.2 physical characteristics of a hard drive chapter...

cmd: classiﬁcation-based memory deduplication through page...

evaluating characteristics of false memories: …memory &...

memory- and buffer-referencing characteristics of a … ·...

unit 5: memory organizations - bangladesh open...

computer organization and architecture characteristics of...

memory characteristics ideal access time (minimum)

view datasheet - s · pdf filedocid022152 rev 8 5/202...

sensory memory, primary memory. today sensory memory and its...

internals chapter 8 virtual memory and control structures...

theremin.music.uiowa.edutheremin.music.uiowa.edu/postersprograms/10.19.2000.pdfusing...

1 computers internal and external memory. 2 characteristics...

mtece21c1 advanced microprocessor...lengths, addressable...