extensible distributed tracing from kernels to clusters Úlfar erlingsson, google inc. marcus...

1

Extensible Distributed Tracing from Kernels to Clusters

Úlfar Erlingsson, Google Inc.Marcus Peinado, Microsoft Research

Simon Peter, Systems Group, ETH ZurichMihai Budiu, Microsoft Research

Fay

2

Wouldn’t it be nice if…

• We could know what our clusters were doing?

• We could ask any question,… easily, using one simple-to-use system.

• We could collect answers extremely efficiently… so cheaply we may even ask

continuously.

3

Let’s imagine...

• Applying data-mining to cluster tracing• Bag of words technique– Compare documents w/o structural knowledge– N-dimensional feature vectors– K-means clustering

• Can apply to clusters, too!

4

Cluster-mining with Fay

• Automatically categorize cluster behavior, based on system call activity

5

Cluster-mining with Fay

• Automatically categorize cluster behavior, based on system call activity – Without measurable overhead on the execution– Without any special Fay data-mining support

6

Vector Nearest(Vector pt, Vectors centers) { var near = centers.First(); foreach (var c in centers) if (Norm(pt – c) < Norm(pt – near)) near = c; return near; }

var kernelFunctionFrequencyVectors =

cluster.Function(kernel, “syscalls!*”)

.Where(evt => evt.time < Now.AddMinutes(3))

.Select(evt => new { Machine = fay.MachineID(), Interval = evt.Cycles / CPS, Function = evt.CallerAddr }) .GroupBy(evt => evt, (k,g) => new { key = k, count = g.Count() });

Vectors OneKMeansStep(Vectors vs, Vectors cs) { return vs.GroupBy(v => Nearest(v, cs)) .Select(g => g.Aggregate((x,y) => x+y)/g.Count());}

Vectors KMeans(Vectors vs, Vectors cs, int K) { for (int i=0; i < K; ++i) cs = OneKMeansStep(vs, cs); return cs;}

Fay K-Means Behavior-Analysis Code

7


cluster.Function(kernel, “syscalls!*”)


.Select(evt => new { Machine = fay.MachineID(), Interval = evt.Cycles / CPS, Function = evt.CallerAddr }) .GroupBy(evt => evt, (k,g) => new { key = k, count = g.Count() });

Fay K-Means Behavior-Analysis Code

8

Fay vs. Specialized Tracing

• Could’ve built a specialized tool for this– Automatic categorization of behavior (Fmeter)

• Fay is general, but can efficiently do– Tracing across abstractions, systems (Magpie)– Predicated and windowed tracing (Streams)– Probabilistic tracing (Chopstix)– Flight recorders, performance counters, …

9

Key Takeaways

Fay: Flexible monitoring of distributed executions– Can be applied to existing, live Windows servers

1. Single query specifies both tracing & analysis– Easy to write & enables automatic optimizations

2. Pervasively data-parallel, scalable processing– Same model within machines & across clusters

3. Inline, safe machine-code at tracepoints– Allows us to do computation right at data source

10

Vector Nearest(Vector pt, Vectors centers) { var near = centers.First(); foreach (var c in centers) if (|pt – c| < |pt – near|) near = c; return near; }


cluster.Function(kernel, “*”)


.Select(evt => new { Machine = MachineID(), Interval = w.Cycles / CPS, Function = w.CallerAddr}) .GroupBy(evt => evt, (k,g) => new { key = k, count = g.Count() });



K-Means: Single, Unified Fay Queryvar kernelFunctionFrequencyVectors =

cluster.Function(kernel, “*”)


.Select(evt => new { Machine = fay.MachineID(), Interval = evt.Cycles / CPS, Function = evt.CallerAddr}) .GroupBy(evt => evt, (k,g) => new { key = k, count = g.Count() });

Vector Nearest(Vector pt, Vectors centers) { var near = centers.First(); foreach (var c in centers) if (Norm(pt – c) < Norm(pt – near)) near = c; return near; }



11

Fay is Data-Parallel on Cluster

• View trace query as distributed computation• Use cluster for analysis

12


System call trace events• Fay does early aggregation & data reduction• Fay knows what’s needed for later analysis

13


System call trace events• Fay does early aggregation & data reduction

K-Means analysis• Fay builds an efficient processing plan from query

14

Fay is Data-Parallel within Machines

• Early aggregation• Inline, in OS kernel• Reduce dataflow & kernel/user transitions

• Data-parallel per each core/thread

15

Processing w/o Fay Optimizations

• Collect data first (on disk)• Reduce later• Inefficient, can suffer data overload

K-Means: System calls K-Means: Clustering

16

Traditional Trace Processing

• First log all data (a deluge)• Process later (centrally)• Compose tools via scripting

K-Means: System calls K-Means: Clustering

17

Takeaways so far

Fay: Flexible monitoring of distributed executions

1. Single query specifies both tracing & analysis

2. Pervasively data-parallel, scalable processing

18

Safety of Fay Tracing Probes

• A variant of XFI used for safety [OSDI’06]

– Works well in the kernel or any address space– Can safely use existing stacks, etc.– Instead of language interpreter (DTrace)– Arbitrary, efficient, stateful computation

• Probes can access thread-local/global state• Probes can try to read any address– I/O registers are protected

19

Key Takeaways, Again

Fay: Flexible monitoring of distributed executions

1. Single query specifies both tracing & analysis

2. Pervasively data-parallel, scalable processing

3. Inline, safe machine-code at tracepoints

20

Target

Installing and Executing Fay Tracing

• Fay runtime on each machine• Fay module in each traced address space• Tracepoints at hotpatched function boundary

Tracing Runtime

Fay

User-Space

Kernel

Probe

XFI

Createprobe

Hotpatching

query

ETW

200 cycles

21

Low-level Code Instrumentation

Caller: ... e8ab62ffff call Foo ...

ff1508e70600 call[Dispatcher]Foo: ebf8 jmp Foo-6 ccccccFoo2: 57 push rdi

...

c3 ret

Module with a traced function Foo

• Replace 1st opcode of functions

22




...

c3 ret

Module with a traced function Foo Fay platform module

Dispatcher: t = lookup(return_addr) ...

call t.entry_probes ...

call t.Foo2_trampoline ...

call t.return_probes ... return /* to after call Foo */

• Replace 1st opcode of functions• Fay dispatcher called via trampoline

23


PF5

PF3

PF4



...

c3 ret

Module with a traced function Foo Fay platform module

Dispatcher: t = lookup(return_addr) ...

call t.entry_probes ...

call t.Foo2_trampoline ...

call t.return_probes ... return /* to after call Foo */

Fay probes

XFI XFI

XFI

• Replace 1st opcode of functions• Fay dispatcher called via trampoline• Fay calls the function, and entry & exit probes

24

• Fay adds 220 to 430 cycles per traced function • Fay adds 180% CPU to trace all kernel functions• Both approx 10x faster than Dtrace, SystemTap

What’s Fay’s Performance & Scalability?

Fay Solaris Dtrace

OS X Dtrace

Stap Linux

0

2000

4000

6000

8000

10000

Fay Solaris Dtrace

OS X Dtrace

Stap Linux

05

1015202530

2.8

17.2

26.7 CrashNull-probe overhead Slowdown (x)

Cycl

es

25

Fay Scalability on a Cluster

• Fay tracing memory allocations, in a loop:– Ran workload on a 128-node, 1024-core cluster– Spread work over 128 to 1,280,000 threads– 100% CPU utilization

• Fay overhead was 1% to 11% (mean 7.8%)

26

More Fay Implementation Details

• Details of query-plan optimizations• Case studies of different tracing strategies• Examples of using Fay for performance analysis

• Fay is based on LINQ and Windows specifics– Could build on Linux using Ftrace, Hadoop, etc.

• Some restrictions apply currently– E.g., skew towards batch processing due to Dryad

27

Conclusion

• Fay: Flexible tracing of distributed executions

• Both expressive and efficient– Unified trace queries– Pervasive data-parallelism– Safe machine-code probe processing

• Often equally efficient as purpose-built tools

28

Backup

29

A Fay Trace Query

from io in cluster.Function("iolib!Read")where io.time < Now.AddMinutes(5)let size = io.Arg(2) // request size in bytesgroup io by size/1024 into gselect new { sizeInKilobytes = g.Key,

countOfReadIOs = g.Count() };

• Aggregates read activity in iolib module• Across cluster, both user-mode & kernel• Over 5 minutes

30

A Fay Trace Query

from io in cluster.Function("iolib!Read")where io.time < Now.AddMinutes(5)let size = io.Arg(2) // request size in bytesgroup io by size/1024 into gselect new { sizeInKilobytes = g.Key,

countOfReadIOs = g.Count() };

• Specifies what to trace• 2nd argument of read function in iolib

• And how to aggregate• Group into kb-size buckets and count 1024 2048 4096 8192

0200040006000

extensible distributed tracing from kernels to clusters Úlfar erlingsson, google inc. marcus...

Documents

vectors cs

vectors centers

cluster behavior

vectors kmeansvectors

return vs

c return

behavioranalysis code

key takeaways fay