catching accurate profiles in hardware
DESCRIPTION
ICS 280/259. Catching Accurate Profiles in Hardware. Satish Narayanasamy, Timothy Sherwood, Suleyman Sair, Brad Calder, George Varghese. Presented by Jelena Trajkovic. Outline. Introduction & Motivation Goal Related Work (Stratified Sampler) - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Catching Accurate Profiles in Hardware](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814928550346895db65e96/html5/thumbnails/1.jpg)
Catching Accurate Profiles in Hardware
Satish Narayanasamy, Timothy Sherwood, Suleyman Sair, Brad Calder, George Varghese
Presented by
Jelena Trajkovic
ICS 280/259
![Page 2: Catching Accurate Profiles in Hardware](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814928550346895db65e96/html5/thumbnails/2.jpg)
Outline
• Introduction & Motivation• Goal• Related Work (Stratified Sampler)• Interval-based Profiling for a Single Hash Profiler• Experimental results • Multiple-hash Profiler• Experimental results
![Page 3: Catching Accurate Profiles in Hardware](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814928550346895db65e96/html5/thumbnails/3.jpg)
Introduction & Motivation
• SW – used to gather program behavior information
• Architectural support for generating profiles at run-time– HW is used to assist SW,
– dependent on on system SW (for management or aggregation of events)
• HW-only profiler
![Page 4: Catching Accurate Profiles in Hardware](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814928550346895db65e96/html5/thumbnails/4.jpg)
Introduction & Motivation (cont.)
• HW optimizations that can take advantage of info gathered in run-time:– Cache replacement & prefetching
• identifying loads that cause majority of misses
– Value based optimization • 50% of memory accesses are dominated by 10 distinct values • capture this dynamically? => this information is used for
storing compressed values in data cache
– Trace formation • dynamically extracting and ordering frequently executed code
=> I-fetch more efficient
– Multiple path execution • find branches that are hard to predict and execute down
multiple paths
![Page 5: Catching Accurate Profiles in Hardware](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814928550346895db65e96/html5/thumbnails/5.jpg)
Goal
• The goal is to build a profiling scheme that satisfies following properties:– Area Efficient – capacity constraints (fixed amount of
area)
– Accurate – identify important / frequent events and count them accurately
– Timely – up-to-date information about program behavior
– Performance Efficiency and SW Independence – independent of system SW support to manage profiles (accumulate and analyze events), identifying in HW
![Page 6: Catching Accurate Profiles in Hardware](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814928550346895db65e96/html5/thumbnails/6.jpg)
Related Work
• SW profiling– Binary instrumentation (ATOM by Calder et al.)
• HW counter assisted profiling– DCPI system for Alpha Processors
• HW table based profiling– Stratified sampling (Sastry et al.)
• Co-processor profiler– Distill information passed from main processor (Ziles
and Sohi)
![Page 7: Catching Accurate Profiles in Hardware](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814928550346895db65e96/html5/thumbnails/7.jpg)
Profiling Events
• Profiling event: combination of several variables– instruction PC, load address, register value or name,
cache miss …
• Tuple represents event as combination of 2 variables – <pc, value>
![Page 8: Catching Accurate Profiles in Hardware](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814928550346895db65e96/html5/thumbnails/8.jpg)
Related Work: Stratified Sampler• Divides the original
input stream into multiple streams via hashing (independently sampled)
• Table of counters– number of occurrences of different events
– counter is selected by applying hash function on the input event
– incremented when event appears in the input stream
– on reaching threshold value, counter is reset and event is reported (interrupt to the OS)
![Page 9: Catching Accurate Profiles in Hardware](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814928550346895db65e96/html5/thumbnails/9.jpg)
Related Work: Stratified Sampler (cont.)
• To reduce aliasing and improve accuracy:– Partial tags, miss counters, state information
– Hit counters – number of occurrences
– Miss counters – tuple hashes to particular entry, but tag differs (replacement policy)
– On reaching threshold value:• Generate interrupt
• Buffered, interrupt is sent when buffer fills up
• Placed in associative counter table, passed to SW (via intermediate buffer)
• Accumulating information in SW (5% interrupt overhead)
![Page 10: Catching Accurate Profiles in Hardware](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814928550346895db65e96/html5/thumbnails/10.jpg)
Interval-based Profiling for a Single Hash Profiler
• Removing SW: accumulator table
• Interval-based – significant number of occurrences within interval
– reset hash-table counters after every interval
– improving accuracy - shielding
• Divide execution time into intervals– interval length – fixed number of profiling events
(tuples)
– capture only events (candidate tuples) that occur more than candidate threshold (% of interval length)
![Page 11: Catching Accurate Profiles in Hardware](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814928550346895db65e96/html5/thumbnails/11.jpg)
Single Hash Architecture
– accumulator table is fully associative and taggedif (input tuple is in acc. table )
inc counterelse
hash into hash-tableincrement corresponding counter
– hash-table does not contain tags – aliasingif (tuple reaches candidate threshold value)
if (acc. table is not full)acc. table is allocatedmark entry as non-replicable till
the end of interval– particular entry is not given as an input to the hash-table –
shieldingif (end of the interval)
flush hash-tablemark all entries in acc. table as
replaceable
![Page 12: Catching Accurate Profiles in Hardware](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814928550346895db65e96/html5/thumbnails/12.jpg)
Single Hash Architecture (cont.)
• Calculate worst case number of entries in the acc. table (avoid capacity and aliasing issues) as a function of profile interval length and candidate threshold– number of events that determine profiling interval
– number of occurrences in order to get recorded in acc. table (percentage of interval length)
• e.g. interval length = 10,000
• candidate threshold = 1% => 100 entries
0.1% => 1,000 entries
– 10,000 w/ 1% and 1 million w/ 0.1%
– Hash-table 2K entries
![Page 13: Catching Accurate Profiles in Hardware](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814928550346895db65e96/html5/thumbnails/13.jpg)
Single Hash Architecture (cont.)
• Hash functions: for a given tuple <pc, value>
npc = flip(randomize(pc))
nv = randomize(value)
index = xor-fold(npc xor nv, index-size)• Optimizations:
– Retaining: keeps top entries in acc. table from the previous interval
– Resetting: reset counter in hash-table, after it reaches candidate threshold
![Page 14: Catching Accurate Profiles in Hardware](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814928550346895db65e96/html5/thumbnails/14.jpg)
Experimental setup
• SPEC95:go, li, vortex; SPEC2K: gcc, vortex; deltablue, sis, burg
• Compilation:– DEC Alpha 21164, DEC C (full optimizations)
• Profiling analysis: ATOM• Fast forwarded and then ran for 500 million
instructions
![Page 15: Catching Accurate Profiles in Hardware](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814928550346895db65e96/html5/thumbnails/15.jpg)
Error Calculation
• For each interval compare candidates seen by HW profiler and perfect profiler– False Positive
– False Negative
– Neutral Positive
– Neutral Negatives
• Total error rate for an interval
![Page 16: Catching Accurate Profiles in Hardware](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814928550346895db65e96/html5/thumbnails/16.jpg)
Experimental Results
• Accuracy of HW profiling depends– number of unique tuples in an interval (distinct tuples)– number of unique tuples that cross threshold
• Analysis of candidate tuples
Number of distinct tuples seen in an interval on average
![Page 17: Catching Accurate Profiles in Hardware](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814928550346895db65e96/html5/thumbnails/17.jpg)
Number of unique candidate tuples in an interval on average
![Page 18: Catching Accurate Profiles in Hardware](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814928550346895db65e96/html5/thumbnails/18.jpg)
Percentage of variation of candidates from
one interval to the next
![Page 19: Catching Accurate Profiles in Hardware](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814928550346895db65e96/html5/thumbnails/19.jpg)
Error rates
Single Hash table with retaining/resetting results across a set of benchmarks
![Page 20: Catching Accurate Profiles in Hardware](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814928550346895db65e96/html5/thumbnails/20.jpg)
Multiple-hash Profiler
• Independent hash functions (for each table)
if(no entry in acc. table)
hash to each table
update each counter
if(all entries for particular tuple in hash table reach candidate threshold)
add entry to the acc. table
reset counters in hash-table (immediately or at the end of interval)
• Conservative update update just smallest counter
![Page 21: Catching Accurate Profiles in Hardware](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814928550346895db65e96/html5/thumbnails/21.jpg)
Muti-hash profiler for an interval of 10,000, 1% candidate threshold, and a total number of 2K hash-table entries
Muti-hash profiler for an interval of 1 million, 0.1% candidate threshold, and a total number of hash-table entries of 2K
![Page 22: Catching Accurate Profiles in Hardware](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814928550346895db65e96/html5/thumbnails/22.jpg)
Varying number of hash tables for the best muti-hash profiler - C1, R0 (w/ conservative update and w/o resetting) (10,00, 1% - L; 1mill, 0.1% - R)
Variation in the error across different intervals (BSH w/ resetting - L; multi-hash w/ conservative update and no resetting 4hash tables - R)
![Page 23: Catching Accurate Profiles in Hardware](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814928550346895db65e96/html5/thumbnails/23.jpg)
Summary
• Profiling architecture• Efficiently filters out important data • Efficient in terms of HW cost (6KB + (1KB or 10
KB) and overhead (no performance overhead)