Understanding Performance with Flame Graphs
Vinzent Steinberg
C++ User Group, FIASNovember 11, 2015
1 / 27
Profiling Stack Traces
I Which code-paths are hot?I Timed interrupt collecting current function address, or entire
stack traceI Human readable summary reportI Problem: long output, difficult to comprehend
2 / 27
Flame Graphs
I Visualization of sampled stack tracesI Identify hot code-paths quicklyI Works for any profiler
3 / 27
Credits
Figure 1: Brendan Gregg
I Talk based on Brendan Gregg’s workI Performance architect at NetflixI Pioneered flame graphs for performance analysis during work
All information presented here is from his website:http://www.brendangregg.com/flamegraphs.html
4 / 27
Motivating Problem
I Slow MySQL data baseI Fix it or lose customerI CPU-limited
→ Sample stack traces with profiler
7 / 27
Profilers
I Linux: perf, SystemTap, ktapI Solaris: Dtrace, illumosI FreeBSD: DTraceI Max OS X: DTrace, InstrumentsI Windows: Xperf.exe
8 / 27
Background: Stack Frame
I Stack frame shows location in codeI Usual profiler output:
<module>`<function>+<instruction offset>
I For example:
libc.so.1`mutex_trylock_adaptive+0x112
9 / 27
Background: Stack TraceI List of stack framesI Current frame firstI Example:
libc.so.1`mutex_trylock_adaptive+0x112libc.so.1`mutex_lock_impl+0x165libc.so.1`mutex_lock+0xcmysqld`key_cache_read+0x741mysqld`_mi_fetch_keypage+0x48mysqld`w_search+0x84mysqld`_mi_ck_write_btree+0xa5mysqld`mi_write+0x344mysqld`ha_myisam::write_row+0x43mysqld`handler::ha_write_row+0x8dmysqld`end_write+0x1a3mysqld`evaluate_join_record+0x11e[...]libc.so.1`_lwp_start+0
10 / 27
Using perf_events$ perf record -F 99 -p 13204 -g -- sleep 30$ perf report -n --stdio[...]# Overhead Samples Command Shared Object Symbol# ........ ............ ....... ................. ...................................#
20.42% 605 bash [kernel.kallsyms] [k] xen_hypercall_xen_version|--- xen_hypercall_xen_version
check_events||--44.13%-- syscall_trace_enter| tracesys| || |--35.58%-- __GI___libc_fcntl| | || | |--65.26%-- do_redirection_internal| | | do_redirections| | | execute_builtin_or_function| | | execute_simple_command| | | execute_command_internal| | | execute_command| | | execute_while_or_until| | | execute_while_command| | | execute_command_internal| | | execute_command| | | reader_loop| | | main| | | __libc_start_main| | || | --34.74%-- do_redirections| | || | |--54.55%-- execute_builtin_or_function| | | execute_simple_command| | | execute_command_internal| | | execute_command| | | execute_while_or_until| | | execute_while_command| | | execute_command_internal| | | execute_command| | | reader_loop| | | main| | | __libc_start_main| | |
[...]
11 / 27
Problem
I Only 20.4% · 44.1% · 35.6% · 65.3% ≈ 2% of the samplesaccounted for
I This is already the condensed output
12 / 27
Flame Graph - Explained
I Each box represents a function in the stack(merged stack frame)
I y -axis: stack depthI x -axis: spans sample population (ordering meaningless)I Width of boxes: total time spent on-CPU (including ancestry)I Colors for readability onlyI InteractiveI Works well for multithreaded software
15 / 27
Creating Flame Graphs
1. Capture stacks (profiler of your choice)2. Fold stacks, converting
unix`i86_mwait+0xdunix`cpu_idle_mwait+0xf1unix`idle+0x114unix`thread_start+0x8
19486
to
unix`i86_mwait+0xd;...;unix`thread_start+0x8 19486
3. Generate plot
16 / 27
Creating Flame Graphs: Basics
I Sample all processes for 60 s with 99 Hz(avoiding lockstep)
$ git clone https://github.com/brendangregg/FlameGraph$ cd FlameGraph$ perf record -F 99 -a -g -- sleep 60$ perf script
| ./stackcollapse-perf.pl > out.perf-folded$ ./flamegraph.pl out.perf-folded > perf-kernel.svg
17 / 27
Creating Flame Graphs: Tricks
I Filtered flame graphs
[...]$ perf script
| ./stackcollapse-perf.pl > out.perf-folded$ grep -v cpu_idle out.perf-folded
| ./flamegraph.pl > nonidle.svg$ grep ext4 out.perf-folded
| ./flamegraph.pl > ext4internals.svg$ egrep 'system_call.*sys_(read|write)'
out.perf-folded | ./flamegraph.pl > rw.svg
I Demangle C++ names
$ perf script | c++filt| stackcollapse-perf.pl > out.perf-folded
18 / 27
Other Uses
I Other eventsI Device I/OI CPU events (for example cache misses)I Off-CPU eventsI Specific function callsI Memory allocation
I Other values (instead of sample count)I BytesI Latency
19 / 27
Off-CPU Time Problems
I Harder to interpretI On-CPU time limited, off-CPU unlimitedI Interpretation depends on what the code is doing
I Uses tracing scheduler eventsI Higher overhead
I Distorts resultsI Hurts the target
I More accurate than sampling
21 / 27
Differential Flame Graph
Figure 8: Difference to previous Flame Graph: Blue is faster, red is slower.
25 / 27
More Information
I http://www.brendangregg.com/flamegraphs.html(links to everything)
I Talk by Brendan Gregg at LISA 13
26 / 27