vitaly slobodskoy andrey isakov pavel gerasimov · 2019. 1. 9. · vitaly slobodskoy andrey isakov...
TRANSCRIPT
Vitaly Slobodskoy
Andrey Isakov
Pavel Gerasimov
Intel® VTune™ Amplifier
• Performance profiler
• Focus is on hardware performance monitoring capabilities
• Performance Monitoring Unit (PMU)
• Event Based Sampling (EBS)
Samples
TimeThread,
Instruction addressThread,
Instruction address
Interrupt Interrupt
Why capturing call stacks is a pain for performance monitoring tools?
• Heavily OS-dependent
• Tricky to capture stack segment on some operating systems
• Using frame pointers is easier to get call stack, but requires all the binaries to be compiled with frame pointers (including system modules)
• Stack unwinding can be easily broken via using custom calling convention (depends on compiler)
• Capturing stack segment introduces additional collection overhead and increases resulting trace files
Last Branch Record
• LBR call stack is a list of pairs of From/To program addresses generated by CPU for each call instruction in a specially designated buffer
• TOS (Top of Stack) register points to the current stack element
• Intel® Skylake Microarchitecture introduces the 3rd register called“info” which captures stores branch prediction flag, TSX info, and elapsed cycle data
Call stack
0x699
0x599
0x499
0x399
0x299
0x100
# From To Info
0 0x100 0x200 111
1 0x299 0x300 222
2 0x399 0x400 333
3 0x499 0x500 444
4 0x599 0x600 555
5 0x699 0x700 666
6 0 0 0
Limitations
• Up to 32 register pairs (depends on CPU architecture)
• CPU is not aware of OS thread context switches
• List of LBR registers can contain stack frames from various threads
Getting valid portion of call stack using LBR
• Using static binary analysis it is possible to validate that current FROM address is reachable from previous TO
• Heuristics have to be appliedas indirect jumps can’t be checked using just static analysis
6 0 0 0
Full call stacks for PMU Event Based Sampling
# from to info
0 0x899 0x900 888
1 0x999 0xa00 999
2 0x399 0x400 333
3 0x499 0x500 444
4 0x599 0x600 550
5 0x699 0x700 660
6 0x799 0x800 770
# from to info
0 0x100 0x200 111
2 0x399 0x400 333
5 0x699 0x700 666
1 0x299 0x300 222
4 0x599 0x600 555
3 0x499 0x500 444
2 0x399 0x400 333
3 0x499 0x500 444
• Why don’t we try to stitch LBR-provided stack frames between samples?
2 0x399 0x400 333
3 0x499 0x500 444
Use MSR# and LBR INFO to identify common part
Sample 1 Sample 2
TOS
TOS
2 0x399 0x400 3332 0x399 0x400 333
3 0x499 0x500 444
Full call stacks for PMU Event Based Sampling
# from to info
0 0x100 0x200 111
5 0x699 0x700 666
1 0x299 0x300 222
4 0x599 0x600 555
6 0 0 0
3 0x499 0x500 444
0 0x899 0x900 888
1 0x999 0xa00 999
4 0x599 0x600 550
5 0x699 0x700 660
6 0x799 0x800 770
# from to infoCall stack
0x999
0x899
0x799
0x699
0x599
0x499
0x399
0x299
0x100
Issue #1 – more than 32 calls/rets of depth between samples
• We are limited by a number of LBR registers
• More than 32 calls/rets between 2 samples and we will not be able to stitch stacks
• Suggested solutions:
• Increase sampling frequency based on related PMU events:
• BR_INST_RETIRED.CALL or similar
• ROB_MISC_EVENTS.LBR_INSERTS
Samples
LBR buffer
Time
Issue #2 – thread migration
• HW keeps track for CPU cores, not SW threads
• Need to handle OS context switches in order to keep LBR history for the thread
• Triggering interrupt on every user->kernel mode switch via CPU_CLK_UNHALTED.RING0_TRANS PMU event
• Capture LBRs and TOS when context switch occurs to retain the thread call stack history
• Not every kernel mode transition lead to context switch (e.g. syscall), filtering is required
• Can configure event to trigger interrupt on enter/exit from kernel mode
CPU A
CPU B
Issue #3 – C++ exceptions
• C++ exception results in a long jump instead of RET instruction causing imbalance between the number of CALL and RET instructions
• As result LBR call stack contains duplicated entries
• Use the stack pointer to validate if Exception occurred?
Other issues
• Low chance to get the right call depth in case of recursion
• INFO register value will not likely be unique
• LBRs are flushed on entering C1 state
• Losing call stack history when core goes into C-state
• Capturing LBRs on exit from kernel mode won’t work
• Need to capture LBRs on enter to kernel mode
FFMpeg, Intel® Core™ i7 (SkyLake), 8 cores
• Running multiple FFMpeg processes in parallel.
• Execution time: 29 seconds
• Collection overhead:
• Perf with call stacks: 30-31s, ~7% overhead
• LBR based call stacks with context switch detection: 31-32s, ~8% overhead
* [Stitch point frame] is intentionally added to highlight how algorithm works
Mozilla Firefox, Intel® KabyLake, 8 cores
• Video streaming in the browser
Python, Intel® Core™ i7 (SkyLake), 8 cores
• Single process running python linter for subdirectory
• Application time: 60s
• Collection overhead:
• Perf with call stacks: 61s, ~2% overhead
• LBR based call stacks with context switch detection: 62s, ~3% overhead
Outcome
• Hardware assisted call stacks is a perspective solution with a number ofadvantages comparing with traditional methods:
• No Operating System dependency, can be applied on RTOSs and custom embedded OSs
• No calling convention and minimal compiler dependency
• Proper handling of C++ exceptions requires additional research