course notes on software monitoring & tracinginstruction set simulation, performance counters,...

Course Notes on Software Monitoring & TracingMartin Monperrus

KTH Royal Institute of Technology

Created April 2019, last updated April 27, 2020

“Program execution traces provide the most intimate details of a program’s dynamic behavior.” Bhansaliet al. 2006

This document contains course notes about monitoring and tracing of software execution (prepared forthe KTH course “Automated Testing and DevOps”). The monitoring in production aspect of DevOpsis essential to close the infinite loop.

Feedback: [email protected]: https://www.monperrus.net/martin/monitoring.pdf

Contents1 Core Concepts 3

2 Types of traces 42.1 Logs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Instruction stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.4 Memory trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.5 Call tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.6 Dynamic call graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.7 Event/message trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.7.1 Crashes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.7.2 HTTP traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.7.3 Database connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.7.4 Distributed system tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Implementation techniques 8

4 Monitoring Tools 84.1 Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84.2 Integrated / System level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94.3 Native . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94.4 Javascript . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94.5 Linux Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94.6 Trace format & trace analysis tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94.7 Distributed Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94.8 Others . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

5 Visualization 10

1

https://github.com/KTH/devops-course/

https://www.monperrus.net/martin/monitoring.pdf

6 Intercession 11

2

1 Core ConceptsExecution trace:

• an execution trace is a set of events resulting from the execution of a program

• it can be huge: “Nine billion instruction boot of FreeBSD”1

• it is most most of the time incomplete (it only gives partial information about the execution, itis not able to recompute the final result or recreate an intermediate program state), unless youuse time-travel debugging.

• a trace has different granularities, it can be for a single thread, a single program or for a systemdistributed over thousands of machines.

Usages of monitoring:

• Testing: code coverage

• Problem and anomaly detection

• Performance analysis and optimization

• Debugging (the absolute monitoring is time-travel debugging and execution replay)

• Security: intrusion detection (defense), reverse engineering, see for example this post

• Art (eg Web stalker)

Monitoring Terminology

• Monitoring and tracing is the art of collecting execution traces. Both terms are considered equalin this document.

• It is related to logging: monitoring is meant to be automatic while logging is traditionally manual.

• Monitoring can be achieved by a number of different techniques, including hardware interrupts,instruction set simulation, performance counters, operating system hooks (see below), code in-strumentation, dedicated runtime support (in interpreters, VMs, libraries).

• dynamic program analysis (or dynamic analysis for short) is based on monitoring to infer someknowledge, such as coverage or invariants.

• profiling is a specific kind of monitoring dedicated to performance analysis.

Core Trade-offs

• Performance overhead in production

• Amount of data (beware of the monitoring data lake)

• Privacy (we must not log passwords in clear)

Observability:

• In dynamic program analysis, observability is the property of a system referring to the ability ofcollecting information of interest.

• Observability has an engineering cost (it can take months to otain a specific kind of information).

• Observability has a performance cost (monitoring can slow down the whole process1).

• Observation has a storage cost (traces are quickly huge).1https://github.com/panda-re/panda

3

https://en.wikipedia.org/wiki/Time_travel_debugging

https://en.wikipedia.org/wiki/Code_coverage


https://www.google.com/search?q=execution+replay

https://blog.quarkslab.com/exploring-execution-trace-analysis.html

http://archive.rhizome.org/anthology/webstalker.html

https://en.wikipedia.org/wiki/Tracing_(software)

https://en.wikipedia.org/wiki/Log_file

https://en.wikipedia.org/wiki/hardware_interrupt

https://en.wikipedia.org/wiki/instruction_set_simulator

https://en.wikipedia.org/wiki/Hardware_performance_counter

https://en.wikipedia.org/wiki/Instrumentation_(computer_programming)


https://en.wikipedia.org/wiki/Dynamic_program_analysis

https://en.wikipedia.org/wiki/Profiling_(computer_programming)

https://en.wikipedia.org/wiki/Data_lake

https://github.com/panda-re/panda

Figure 1: Example of code coverage

2 Types of traces2.1 Logs

• It is straightforward to do logging

• it is harder to do proper centralized logging

– See centralized log management tools such as https://github.com/Graylog2/graylog2-server,fluentd, Apache Flume

– A known practice is to put all the logs into ElasticSearch https://www.elastic.co/blog/get-system-logs-and-metrics-into-elasticsearch-with-beats-system-modules

• Most application-level monitoring is done with logging. However, there exists more genericsolutions such as JMX

2.2 Coverage• The code coverage, also known as spectrum, is an interesting aspect of execution.

• Open question: for a given software stack, how to do it in production with acceptable lowoverhead(remove unused code, A/B testing code, decrease technical debt).

2.3 Instruction streamAn instruction stream is a sequence of consecutive events at a low-level (machine code, byte code), seefor example this instream trace

• Intel PIN generates instruction opcode traces

• Bytecode traces (-XX:+TraceBytecodes, see bytestacks) hello world trace

• A branch trace logs successful branch instructions

• If the trace is completely captured, it enables time-travel debugging.

4

https://github.com/Graylog2/graylog2-server

https://www.fluentd.org/

https://flume.apache.org/

https://www.elastic.co/blog/get-system-logs-and-metrics-into-elasticsearch-with-beats-system-modules

https://www.elastic.co/blog/get-system-logs-and-metrics-into-elasticsearch-with-beats-system-modules

https://en.wikipedia.org/wiki/Java_Management_Extensions

https://en.wikipedia.org/wiki/Code_coverage

https://raw.githubusercontent.com/monperrus/traces/master/x86_64-instruction-trace.txt

https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool

https://github.com/cl4es/bytestacks

https://raw.githubusercontent.com/monperrus/traces/master/jvm-bytecode.txt

https://en.wikipedia.org/wiki/Branch_trace


Figure 2: A flame graph is a nice visualization of a call tree

2.4 Memory trace• TracerGrind is a Valgrind tool (plugin) which can generate execution traces of a running process.

example trace

2.5 Call tree• The basic block of a call tree is a call trace (Kernel call trace with ftrace)

• A call tree is the concatenation of all consecutive call traces in a tree.

• Example: https://raw.githubusercontent.com/monperrus/traces/master/perf.txt

• Can be visualized http://www.valgrind.org/docs/manual/manual-core.html#manual-core.xtree, eg in a flame graph see Figure 2

• Capturing call tree in Java https://github.com/castor-software/yajta

• Capturing call tree in Javascript https://maierfelix.github.io/Iroh/

2.6 Dynamic call graphA call graph is a graph representation of a call tree.

• Jdcallgraph: https://github.com/dkarv/jdcallgraph/

• Java-callgraph: https://github.com/gousiosg/java-callgraph

5

http://www.brendangregg.com/flamegraphs.html

https://github.com/SideChannelMarvels/Tracer/tree/master/TracerGrind

https://raw.githubusercontent.com/monperrus/traces/master/tracegrind.txt

https://github.com/monperrus/traces/blob/master/ftrace.txt

https://raw.githubusercontent.com/monperrus/traces/master/perf.txt

http://www.valgrind.org/docs/manual/manual-core.html#manual-core.xtree

http://www.valgrind.org/docs/manual/manual-core.html#manual-core.xtree

https://github.com/castor-software/yajta

https://maierfelix.github.io/Iroh/

https://github.com/dkarv/jdcallgraph/

https://github.com/gousiosg/java-callgraph

2.7 Event/message trace2.7.1 Crashes

It is a common practice to collect crashes in the wild, with plenty of open-source and commercialtechnology (many of them being under the umbrella term application performance management.Crash handlers are all possible levels in the stack:

• at the kernel level (kernel oops)

• at the VM level Thread.setDefaultUncaughtExceptionHandler.

• at the cloud level openstack

Random sample of technology:

• https://sentry.io/welcome/

• https://github.com/mozilla-services/socorro

• https://chromium.googlesource.com/crashpad/crashpad/+/master/README.md

2.7.2 HTTP traces

• https://github.com/buger/goreplay/

• https://glowroot.org

2.7.3 Database connections

Example https://glowroot.org, https://github.com/onecodex/chrononaut

2.7.4 Distributed system tracing

• Distributed system tracing consists of tracing messages between machines / actors / microser-vices.

• Key enabler: use the same network library everywhere

• Core idea: tag all requests with a unique identifier

• For sake of performance, one can trace only a sample of requests

• See tools below

6

https://en.wikipedia.org/wiki/Crash_reporter

https://en.wikipedia.org/wiki/Application_performance_management

https://en.wikipedia.org/wiki/Linux_kernel_oops

http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Thread.html#setDefaultUncaughtExceptionHandler(java.lang.Thread.UncaughtExceptionHandler)

https://docs.openstack.org/blazar/queens/user/compute-host-monitor.html

https://sentry.io/welcome/

https://github.com/mozilla-services/socorro

https://chromium.googlesource.com/crashpad/crashpad/+/master/README.md

https://github.com/buger/goreplay/

https://glowroot.org

https://glowroot.org

https://github.com/onecodex/chrononaut

Figure 3: Dashboard of Vizceral for animated traffic graphs youtube1 youtube2

7

https://www.youtube.com/watch?v=KVbTjlZ0sfE

https://www.youtube.com/watch?v=MYHf_BXWuOc

3 Implementation techniques• instrumenting either the program source code or its binary executable form.

– Source code instrumentation, eg https://github.com/INRIA/spoon/, hundreds of tech-niques

– Binary code instrumentation, eg https://asm.ow2.io/, hundreds of techniques, see gdoc,incl. standard compiler techniques (gcov, lvvm/xray), aspect technology

• using an instruction set simulator or an interpreter

• using kernel support (post), see subsection 4.5

• using hardware support (interrupts and performance counter)

• using debug interfaces

– JVM: Java Debug Interface see https://docs.oracle.com/javase/7/docs/jdk/api/jpda/jdi/

– Browser/Node/V8: Chrome Remote Debugging Protocol see https://www.igvita.com/2012/04/09/driving-google-chrome-via-websocket-api/ (working example)

• native support in virtual machine and interpreters

– NodeJs: node –trace

– JVM:∗ jstack -l <pid>∗ JVM options, see for instance https://github.com/cl4es/bytestacks∗ JVM Tool Interface and agents,∗ jconsole is a graphical tool for monitoring a JVM https://openjdk.java.net/tools/

svc/jconsole/∗ Java agents Martin’s list of Java agents∗ Visualvm, see https://visualvm.github.io/

4 Monitoring Tools4.1 Java

• Clover: https://bitbucket.org/openclover/clover (coverage)

• Jacoco: https://github.com/jacoco/jacoco (coverage)

• Yajta: https://github.com/castor-software/yajta (execution tree)

• Btrace: https://github.com/btraceio/btrace

• Kieker: https://github.com/kieker-monitoring/kieker

• Bytefrog: https://github.com/codedx/bytefrog

• Bytestacks: https://github.com/cl4es/bytestacks

• Jdcallgraph: https://github.com/dkarv/jdcallgraph/

• Java-callgraph: https://github.com/gousiosg/java-callgraph

8


https://github.com/INRIA/spoon/

https://asm.ow2.io/

https://docs.google.com/document/d/1ZcckSFCQalSt7oxLfw_3fbyJPkczqKHV0e9BmKFRu8M/edit?usp=sharing

http://gcc.gnu.org/onlinedocs/gcc/Gcov.html

https://llvm.org/docs/XRay.html

https://en.wikipedia.org/wiki/Aspect-oriented_programming

https://en.wikipedia.org/wiki/Instruction_set_simulator

https://jvns.ca/blog/2017/07/05/linux-tracing-systems/

http://www.brendangregg.com/blog/2015-07-08/choosing-a-linux-tracer.html

https://en.wikipedia.org/wiki/Hardware_performance_counter

https://docs.oracle.com/javase/7/docs/jdk/api/jpda/jdi/

https://docs.oracle.com/javase/7/docs/jdk/api/jpda/jdi/

https://www.igvita.com/2012/04/09/driving-google-chrome-via-websocket-api/

https://www.igvita.com/2012/04/09/driving-google-chrome-via-websocket-api/

https://github.com/Jacarte/ws_chrome_debugging_example


https://docs.oracle.com/javase/8/docs/technotes/guides/jvmti/index.html

https://openjdk.java.net/tools/svc/jconsole/

https://openjdk.java.net/tools/svc/jconsole/

https://github.com/monperrus/misc/blob/master/awesome-java-agents.md

https://visualvm.github.io/

https://bitbucket.org/openclover/clover

https://github.com/jacoco/jacoco

https://github.com/castor-software/yajta

https://github.com/btraceio/btrace

https://github.com/kieker-monitoring/kieker

https://github.com/codedx/bytefrog


https://github.com/dkarv/jdcallgraph/

https://github.com/gousiosg/java-callgraph

• Java Flight Recorder: https://docs.oracle.com/javacomponents/jmc-5-4/jfr-runtime-guide/about.htm

• Glowroot: https://glowroot.org/

• Anttracks: https://glowroot.org/

• Android/Dalvik: https://github.com/FrenchYeti/dexcalibur

4.2 Native• gcov (coverage)

• LLVM

– llvm-tracer: https://github.com/ysshao/LLVM-Tracer– Llvm-cov: https://llvm.org/docs/CommandGuide/llvm-cov.html– Panda: https://github.com/panda-re/panda

4.3 Javascript• Iroh: https://maierfelix.github.io/Iroh/

• Istanbul: https://istanbul.js.org/

4.4 Linux Kernel• Linux process

– strace– tracer (recursive tracer), strace-graph trace-children

• perf, ftrace, bcc, dtrace, dtrace4linux, LTTng, strace

• See also post https://jvns.ca/blog/2017/07/05/linux-tracing-systems/

4.5 Trace format & trace analysis tools• See list of formats at https://www.eclipse.org/tracecompass/

• Common trace format: http://diamon.org/ctf/

• trace-diff: https://github.com/ryosuzuki/trace-diff

• Babeltrace: https://github.com/efficios/babeltrace

4.6 Distributed Tracing• https://github.com/openzipkin/zipkin/

• https://github.com/Netflix/zuul

• https://github.com/jaegertracing/jaeger

9

https://docs.oracle.com/javacomponents/jmc-5-4/jfr-runtime-guide/about.htm

https://docs.oracle.com/javacomponents/jmc-5-4/jfr-runtime-guide/about.htm

https://glowroot.org/


https://github.com/FrenchYeti/dexcalibur

https://gcc.gnu.org/onlinedocs/gcc/Gcov.html

https://github.com/ysshao/LLVM-Tracer

https://llvm.org/docs/CommandGuide/llvm-cov.html



https://istanbul.js.org/

https://en.wikipedia.org/wiki/Strace

https://github.com/jhuapl-saralab/tracer/

https://github.com/strace/strace/blob/master/strace-graph

https://github.com/alexvelea/trace-children

https://perf.wiki.kernel.org/index.php/Main_Page

https://www.kernel.org/doc/Documentation/trace/ftrace.txt

https://github.com/iovisor/bcc/

https://en.wikipedia.org/wiki/DTrace

https://github.com/dtrace4linux/linux

https://en.wikipedia.org/wiki/LTTng

https://en.wikipedia.org/wiki/Strace

https://jvns.ca/blog/2017/07/05/linux-tracing-systems/

https://www.eclipse.org/tracecompass/

http://diamon.org/ctf/

https://github.com/ryosuzuki/trace-diff

https://github.com/efficios/babeltrace

https://github.com/openzipkin/zipkin/

https://github.com/Netflix/zuul

https://github.com/jaegertracing/jaeger

Figure 4: Visualization of execution by Iroh.js. Live video at https://maierfelix.github.io/3d-code-vizard/

4.7 Integrated / System level• Nagios

• Zabbix

4.8 Others• Pin https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool

– https://github.com/SideChannelMarvels/Tracer– https://github.com/hasherezade/tiny_tracer

• Uftrace: https://github.com/namhyung/uftrace

• Valgrind: http://valgrind.org/

– https://github.com/SideChannelMarvels/Tracer/tree/master/TracerGrind– https://github.com/wmkhoo/taintgrind

• Instream: https://github.com/dwks/instream/ (x86 trace)

• Panda: https://github.com/panda-re/panda

• DynamoRio: http://dynamorio.org/. See module ‘DrCov’ for coverage http://dynamorio.org/docs/page_dr-cov.html

• Dyninst: https://github.com/dyninst/dyninst

• For Rust:

– https://github.com/anelson/tracers

10

https://maierfelix.github.io/3d-code-vizard/

https://maierfelix.github.io/3d-code-vizard/

https://en.wikipedia.org/wiki/Nagios

https://zabbix.org/wiki/Main_Page

https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool

https://github.com/SideChannelMarvels/Tracer

https://github.com/hasherezade/tiny_tracer

https://github.com/namhyung/uftrace

http://valgrind.org/

https://github.com/SideChannelMarvels/Tracer/tree/master/TracerGrind

https://github.com/wmkhoo/taintgrind

https://github.com/dwks/instream/


http://dynamorio.org/

http://dynamorio.org/docs/page_drcov.html

http://dynamorio.org/docs/page_drcov.html

https://github.com/dyninst/dyninst

https://github.com/anelson/tracers

5 Visualization• https://maierfelix.github.io/Iroh/

• https://glowroot.org/

• https://github.com/ncatlin/rgat

• https://thlorenz.com/visulator/

• https://github.com/adrianco/spigo

• https://www.explorviz.net/

• https://github.com/brendangregg/FlameGraph

• https://github.com/chrishantha/jfr-flame-graph

• https://github.com/epickrram/grav

6 Intercession• Once you do monitoring, the next step is to do intercession (changing/impacting the execution)

• Behavioral change: hot patching

• Fault injection: chaos engineering

• Synthetic monitoring is active monitoring in production, using synthetic workloads.

• Security: capture and block calls to ‘eval’ (see https://maierfelix.github.io/Iroh/)

11



https://github.com/ncatlin/rgat

https://thlorenz.com/visulator/

https://github.com/adrianco/spigo

https://www.explorviz.net/

https://github.com/brendangregg/FlameGraph

https://github.com/chrishantha/jfr-flame-graph

https://github.com/epickrram/grav

https://en.wikipedia.org/wiki/Synthetic_monitoring


course notes on software monitoring & tracinginstruction set simulation, performance counters,...

Documents