9/22/2002nc state university1 detecting performance bottlenecks using binary rewriting jaydeep...
Post on 14-Dec-2015
221 Views
Preview:
TRANSCRIPT
9/22/2002 NC State University 1
Detecting Performance Bottlenecks Using Binary Rewriting
Jaydeep Marathe and Frank Mueller
North Carolina State University
Department of Computer Science
9/22/2002 NC State University 2
CPU
DRAM
Why are Memory Performance Bottlenecks a Problem?
L1 Cache
L2 Cache
Main Memory
Processor
• Processor speeds growing much faster than memory access speeds.
• Application memory performance has increasingly significant impact on overall performance.
9/22/2002 NC State University 3
Increase Locality Decrease Misses !
• Temporal Locality :: same cache block element accessed repeatedly before block is evicted.
Temporal LocalityTemporal Locality
x
x
Processor
Cache
Main Memory
Miss !
Miss !Hit !Hit !
• Spatial Locality :: adjacent cache block elements accessed, before block is evicted.
x y z
x y z
Processor
Cache
Main Memory
Spatial LocalitySpatial Locality
Cache Block
Miss !
Hit !
Locality Of Reference
9/22/2002 NC State University 4
How To Gauge Memory Performance ?
Drawbacks :• Tradeoff between accuracy sampling overhead• Fairly coarse statistics - overall hits, misses etc.
Executing Application
Observer Process
Periodic SamplingProcessor
H/w counters
UsageStatistics
• Use Hardware event counters• Sample Counter Values at regular intervals
One Way ..
Instrumenting Compiler
Source Code
Instrumented binary
Complete Access Trace
Post-processor
Executes
UsageStatistics
• Use Instrumenting Compiler• Insert Code to Log Memory Accesses.• Use Complete Trace For Analysis
Another Way ..
Drawbacks :• High Execution Overhead due to logging.• Complete Trace is huge ! : hundreds of MBs in size
Need Accurate Metrics with Minimum Time & Space Overheads !
9/22/2002 NC State University 5
TargetBinary
ControllerProcess
Online Compression
Instrument
Memory trace
Trace file
Compressed trace
Cache Simulator
Detailed Cache
Statistics
TargetBinary
• Binary Rewriting to instrument application binary.
• Insert online compression routines to compress generated trace
• Drive incremental cache simulator with compressed trace.
• Simulator generates detailed cache metrics for user feedback.
Detecting Bottlenecks Using Binary Rewriting
9/22/2002 NC State University 6
TargetBinary
ControllerProcess
Online Compression
Instrument
Memory trace
Trace file
Compressed trace
Cache Simulator
Detailed Cache
Statistics
TargetBinary
• Selective InstrumentationSelective Instrumentation of parts of target binary of parts of target binary. .
• Partial Data TracesPartial Data Traces instead of complete traces. instead of complete traces.
• Online Compression Online Compression reduces trace reduces trace storage storage requirements. requirements.
• Statistics Correlated to Source Data Structures.Statistics Correlated to Source Data Structures.
Advantages ..
9/22/2002 NC State University 7
Mutator (controller)
Machinecode
Target Binary
CFG
• Extended a Portable Binary Manipulation Framework (DynInst: U. Maryland).
Instrumenting Target Binary
Target_func(){ for( I = 0 ; I < N ; I ++) { A [ I ] = B [I ] * C[I];
}/* end loop */} /* end function */
_Target_func:
LoopStart : …… LOAD B[I], R1 LOAD C[I], R2 MULT R1,R2,R3 STORE A[I] …….LOOP LoopStartend_routine
• Parse Control Flow Graph (CFG) : Locate Routine Scopes & Loop Scopes.
DynInst
• Instrument Memory Access (Load / Store) & Scope Change instructions.
• Instrumentation calls handler functions in Shared Library.
ENTER_SCOPE_Handler()
LOAD_Handler()
STORE_Handler()
EXIT_SCOPE_Handler()
Shared Library
Instrumentation
9/22/2002 NC State University 8
Compressing Generated Trace
• Generated Trace potentially contains millions of accesses !
• Solution :: Detect Regular Patterns in trace for effective compression..
• Regular Section Descriptor (RSD) :: primary representation
• RSD :: < start_addr :: starting address of pattern ,
length :: length of the pattern ,
addr_stride :: stride b/w successive addresses in pattern ,
start_seq :: starting position of pattern in overall trace ,
seq_stride :: interleave distance in overall trace b/w ,
successive addresses from this pattern ,
event_type :: Enter/Exit Scope or Load/Store Access.
src_index :: index into {source_line :: source_file } table.
>
9/22/2002 NC State University 9
Consider the Trace produced by following sample loop ..
for ( I = 0; I <= N ; I ++){ A[I] = A[I] + B[I][I];}
Two Loads :: A[I] & B[I] [I]One Store :: A[I]
Address Trace Generated :
B[0][0]B[0][0]A[0]A[0] A[0]A[0] B[1][1]B[1][1]A[1]A[1] A[1]A[1] B[N][N]B[N][N]A[N]A[N] A[N]A[N]
RSD-1 :: over the loads of A[I] RSD-1 :: < start_addr = &A[0], length = N+1 , addr_stride = 8 , start_seq = 0 , seq_stride = 3 , event_type = LOAD , src_table_index = 1 >
An RSD Example
9/22/2002 NC State University 10
Power Regular Section Descriptors (PRSDs)
• RSDs not powerful enough to compress address stream efficiently.
• Solution :: Nest RSD to create Power Regular Section Descriptor (PRSD) • PRSD < base_addr :: first address generated by PRSD. base_addr_shift :: stride of base_addr b/w PRSD iterations base_seq :: starting position of this pattern in trace. base_seq_shift :: interleave distance b/w PRSD iterations. length :: PRSD length
child PRSD/RSD :: nested PRSD/ RSD >
For( …; …; …) {
For(…; ….;….) {
For(….;…;…) {
A [ ][ ] [ ]
}
}
}
PRSD-3 , child = PRSD-2PRSD-3 , child = PRSD-2
PRSD-2 , child = RSD-1PRSD-2 , child = RSD-1RSD-1RSD-1
• With PRSDs , can efficiently represent Loop Nests of arbitrary depths.
9/22/2002 NC State University 11
Incremental Cache Simulation
Cache Simulator
ReportReportFileFile
CompressedTrace
ScopesFile
VariablesFile
Detailed Cache Statistics
AddressTrace
Scope Structureof Target
Base Addressesof Variables in
Target
• Incremental cache simulation (modified MHSim [ Rice U.])
• Correlate Trace Addresses <---> Variable Names
Access point (LD / ST Instructions) <---> Line Numbers in Source
• Metrics per Access Point, also aggregated by scope structure of target.
9/22/2002 NC State University 12
Report File
Spatial HitsTotal Hits
Temporal HitsTotal Hits
Cache Block Fraction Used,before eviction (Access Efficiency)
Metric Definition What it tells us
Miss ratio Coarse Indicator of performance
Temporal ratio Relative degree of temporal locality
Spatial ratio Relative degree of spatial locality
Spatial UseUsed BytesBlock Size
# evictions**
Evictor References List of evictors Conflicting Variables (useful !)
Total MissesTotal Accesses
Cache Metrics Per Access Point
9/22/2002 NC State University 13
Test Kernel: Matrix Multiplication
60 for(i=0;i<MAT_DIM;i++)61 for(j=0;j<MAT_DIM;j++)62 for(k = 0;k<MAT_DIM;k++)63 x[i][j] = y[i][k] * z[k][j] + x[i][j];
MAT_DIM = 800 total samples registered= 1000000
reads = 750000 writes = 250000 hits = 738811misses = 261189 miss ratio = 0.261 temporal hits = 703930spatial hits = 34881 temporal ratio = 0.95 spatial ratio = 0.04721 spatial use = 0.169
Miss Temporal SpatialLine Name Hits Ratio Ratio Use Evictors
66 z_Read_1 0.00e+00 1.0 no hits 0.171 Z,Y,X 66 y_Read_0 2.39e+05 0.044 0.854 0.129 Z 66 x_Read_2 2.50e+05 0.0006 1.00 0.5 Z 66 x_Write_3 2.50e+05 0.0 1.00 no evicts
Per Reference Information :
C Source code
• High miss ratio : More than 25 % of accesses were misses.• Low spatial use : References evicted before cache block fully referenced.• z_Read_1 dominating 100 % misses• cause: iteration space layout• z_Read_1 sole evictor (evicts itself 95% evictor table)• Evictions low spatial use for x , y and z loads
locality for z: interchange j & k loops. temporal reuse for y and x: blocking (tiling)
Overall Performance :
9/22/2002 NC State University 14
Overall performance : (New / Old )
hits = 982128 / 738811misses = 17872 / 261189 miss ratio = 0.017 / 0.261temporal hits = 947173 / 703930spatial hits = 34955 / 34881temporal ratio = 0.96441 / 0.95spatial ratio = 0.03559 / 0.04721spatial use = 0.7039 / 0.169
Optimized Matrix Multiply
81for(jj =0;jj<MAT_DIM;jj += ts)82 for(kk=0;kk<MAT_DIM;kk += ts)83 for(i=0;i<MAT_DIM;i++)84 for(k=kk;k< min(kk+ts,MAT_DIM);k++)85 for(j=jj;j< min(jj+ts,MAT_DIM);j++)86 x[i][j] = x[i][k]* z[k][j]+ x[i][j];
tile size ts = 16;
Per Reference Information :
no evictsno evicts1.000.890.00.00.00e+0002.50e+052.50e+05x_Write_3
OldOldOldOldOld NewNewNewNewNew
0.5
0.129
0.171
1.00
0.854
no hits
0.0006
0.044
1.0
1.57e+02
1.10e+04
2.50e+05
2.50e+05
2.39e+05
0
0.8610.990.0012.88e+022.50e+05x_Read_2
0.7320.8960.0358.79e+032.41e+05y_Read_0
0.6730.9720.0358.79e+032.41e+05z_Read_1
Spatial UseTemporal Ratio
Miss RatioMisses HitsName
9/22/2002 NC State University 15
Another example: ADI Integration
16 for(k=1;k<N;k++) {17 for(i=2;i<N;i++)18 x[i][k]=x[i][k]- x[i-1][k]* a[i][k]/b[i-1][k];
22 for(i=2;i<N;i++)23 b[i][k]=b[i][k]– a[i][k]* a[i][k]/b[i-1][k]; }
N= 800accesses logged = 1000000
reads = 800000 writes = 200000 hits = 499499
misses = 500501 miss ratio = 0.5 temporal hits = 351731spatial hits = 147768 temporal ratio = 0.704 spatial ratio = 0.29583 spatial use = 0.2018
Per Reference Metrics :
Overall Performance :
• high overall miss rate poor locality• low overall spatial use • first 5 references have 0 hits• pattern: references iterate over rows• spatial use values low evictions
increase locality: top 5 references increase spatial locality interchange loops
Miss Temporal Spatial
Line Name Source_Ref Hits Misses Ratio Ratio Use18 x_Read_3 x[i][k] 0 1.00e+05 1.00 no hits 0.1318 a_Read_1 a[i][k] 0 1.00e+05 1.00 no hits 0.2518 b_Read_2 b[i-1][k] 0 1.00e+05 1.00 no hits 0.1323 b_Read_8 b[i][k] 0 9.98e+04 1.00 no hits 0.2423 a_Read_5 a[i][k] 0 9.98e+04 1.00 no hits 0.2418 x_Read_0 x[i-1][k] 1.00e+05 1.26e+02 0 1.0 0.2523 b_Read_7 b[i-1][k] 9.96e+04 1.25e+02 0 1.0 0.2518 x_Write_4 x[i][k] 1.00e+05 0.00e+00 0 0.50 no evicts23 b_Write_9 b[i][k] 9.98e+04 0.00e+00 0 0.27 no evicts23 a_Read_6 a[i][k] 9.98e+04 0.00e+00 0 0.74 no evicts
9/22/2002 NC State University 16
Optimized :: ADI IntegrationOverall performance : (New / Old )
hits = 874600 / 499499misses = 125400 / 500501miss ratio = 0.125 / 0.5 temporal hits = 454867 / 351731spatial hits = 419733 / 147768temporal ratio = 0.52009 / 0.704 spatial ratio = 0.4799 / 0.29583 spatial use = 0.9628 / 0.2018Per Reference Information :
0.24No hits1.09.98e+040a_Read_5
0.24No hits1.09.98e+040b_Read_8
OldOldOldOldOld
0.13
0.25
0.13
No hits
No hits
No hits
1.0
1.0
1.0
1.00e+05
1.00e+05
1.00e+05
0
0
0
b_Read_2
a_Read_1
0.990.300.171.78e+048.20e+04
0.9480.00.252.50e+047.48e+04
NewNewNewNewNew
0.9340.4950.1451.45e+048.53e+04
0.910.350.252.51e+047.51e+05
0.980.00.252.51e+047.51e+05x_Read_3
Spatial UseTemporal RatioMiss RatioMisses HitsName
significantly more hits fewer evictions higher spatial use
14 for(i=2;i<N;i++)15 for(k=1;k<N;k++)16 x[i][k]=x[i][k]-x[i-1][k]*a[i][k]/b[i-1][k];17 for(k=1;k<N;k++)18 b[i][k]= b[i][k]–a[i][k]*a[i][k]/b[i-1][k];
N= 800 accesses logged = 1000000
9/22/2002 NC State University 17
• Use Binary Rewriting to instrument target executable.
• Compress Generated Trace online.
• Use Compressed Trace for cache simulation.
• Compiler-independent support.
• Useful for mixed-language applications.
• Partial Data Traces : targetted instrumentation
• Efficient Online Trace Compression.
• Enhanced User Feedback :: source-correlated statistics.
Process
HighlightsHighlights
Summing Up ..
9/22/2002 NC State University 19
Future Work
Automatic Optimization
• Identify natural loops in CFG.
• Attempt to identify data dependencies from binary.
• Reconfigure binary with optimizations, without violating data
dependencies.
• Optimizations could include prefetching, tiling, loop fusion ,
loop interchange, etc.
ControllerExecuting
binary
CFGText
Section
Attach
Inject Optimization
s
Extract
9/22/2002 NC State University 20
Related Work
SIGMA [Supercomputing’02]Simulator Infrastructure to Guide Memory Analysis
Capture Full Address Trace
No Evictor Information , weaker compression algorithm
MTOOL [TPDS’93], CPROF [Computer’94]Correlation to source line numbers only.
PAPI , HPMAPIs to access hardware performance counters
9/22/2002 NC State University 21
The Compression Algorithm
• Targetted at regular array accesses in tightly nested loops. For( …; …; …) {
For(…; ….;….) {
For(….;…;…) {
A [ ][ ] [ ]
}
}
}
• Algorithm has growth rate O (n x w) where n = # accesses
and w = ‘pool size’ (maximum # accesses residing in
memory for pattern matching ).
• Tool Structure Modular :: Possible to use some other algorithm
more suited for application domain.
For( …; …; …)
{
A [ B[I] ] = 2.0;
} Works Well
Won’t
Work Well
Constant Size Compressed Trace !
9/22/2002 NC State University 22
Challenges
• Reverse-mapping of accesses to variable expressions in source
Currently limited to local and global variables only.
Difficult to support dynamically allocated objects , since
program counter might have passed object allocation stage
(malloc) by the time we attach to the application.
Reverse-engineering of access point --> source expression
difficult. (eg. A[I+j*2][Q+1][P+R] = 2.0 maps to lots of
machine instructions.)
• Symbol Table information must be present, for effective
user feedback ( getting variable names, line numbers etc.)
9/22/2002 NC State University 23
Memory Performance Metrics
• A hit occurs when layer contains accessed element.
L1 Cache L2 CacheMain Memory
Processor
Relative Relative Access CyclesAccess Cycles
~ 2~ 2 ~ 5~ 5 ~ 30~ 30
xHit !
• A miss occurs when requested element absent in layer.
• Misses bad ! :: force processor stall till data fetched from next layer. • Fewer Misses Faster Performance.
Miss !y
top related