snailtrail: generalizing critical paths for online ... · flink, timely ~10% overhead compared to...

29
SnailTrail Generalizing Critical Paths for Online Analysis of Distributed Dataflows Moritz Hoffmann, Andrea Lattuada, John Liagouris, Vasiliki Kalavri, Desislava Dimitrova, Sebastian Wicki, Zaheer Chothia, and Timothy Roscoe Supported by

Upload: others

Post on 12-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SnailTrail: Generalizing Critical Paths for Online ... · Flink, Timely ~10% overhead compared to logging disabled High throughput 1.2 million events/s 8 workers Always online 1s

SnailTrail

Generalizing Critical Paths for Online

Analysis of Distributed Dataflows

Moritz Hoffmann, Andrea Lattuada, John Liagouris, Vasiliki

Kalavri, Desislava Dimitrova, Sebastian Wicki, Zaheer

Chothia, and Timothy Roscoe

Supported by

Page 2: SnailTrail: Generalizing Critical Paths for Online ... · Flink, Timely ~10% overhead compared to logging disabled High throughput 1.2 million events/s 8 workers Always online 1s

Reference ApplicationSnailTrail

Trace ingestion

Performance

summaries

Program activity graph

construction

Critical participation

computation and

activity ranking

trace streams

Profiling data

Flink,

Spark,

TensorFlow,

Heron,

Timely dataflow, ...

SnailTrail: Diagnosing latency issues in dataflows

“Where is the latency bottleneck in my computation?”

2

Page 3: SnailTrail: Generalizing Critical Paths for Online ... · Flink, Timely ~10% overhead compared to logging disabled High throughput 1.2 million events/s 8 workers Always online 1s

SnailTrail works online with minimal instrumentation

3

Instrumentation complexity

Online

SnailTrail

Coz:

Causal

Profiling

SOSP’15

Pivot

Tracing

SOSP’15

OfflineLess More

Vscope

Middleware

’12

Stitch

OSDI’16

Page 4: SnailTrail: Generalizing Critical Paths for Online ... · Flink, Timely ~10% overhead compared to logging disabled High throughput 1.2 million events/s 8 workers Always online 1s

Example 1: Metrics in Flink’s dashboard

4

Page 5: SnailTrail: Generalizing Critical Paths for Online ... · Flink, Timely ~10% overhead compared to logging disabled High throughput 1.2 million events/s 8 workers Always online 1s

Example 2: Task Scheduling in Spark

Driver

W1

W2

W3

5

SnailTrail, critical participation

Window

Conventional profiling

Window

% tim

e

Page 6: SnailTrail: Generalizing Critical Paths for Online ... · Flink, Timely ~10% overhead compared to logging disabled High throughput 1.2 million events/s 8 workers Always online 1s

The real-world is more complex

Many tasks, activities, operators,

dependencies

Long-running, dynamic workloads

Bottlenecks not isolated

6Credits: Frank McSherry, “Tracking progress in timely dataflow”

Page 7: SnailTrail: Generalizing Critical Paths for Online ... · Flink, Timely ~10% overhead compared to logging disabled High throughput 1.2 million events/s 8 workers Always online 1s

Conventional profiling can indicate wrong bottleneck

W1

W2

W3serialization

waiting

deserializationprocessing

processing

7

Page 8: SnailTrail: Generalizing Critical Paths for Online ... · Flink, Timely ~10% overhead compared to logging disabled High throughput 1.2 million events/s 8 workers Always online 1s

Conventional profiling can indicate wrong bottleneck

W1

W2

W3serialization

waiting

deserializationprocessing

processing

8

Page 9: SnailTrail: Generalizing Critical Paths for Online ... · Flink, Timely ~10% overhead compared to logging disabled High throughput 1.2 million events/s 8 workers Always online 1s

A quick review of

critical path analysis

9

Page 10: SnailTrail: Generalizing Critical Paths for Online ... · Flink, Timely ~10% overhead compared to logging disabled High throughput 1.2 million events/s 8 workers Always online 1s

The program activity graph

W1

W2

W3

u v

Nodes are timestamped events:

start or end of a worker activity

u = {timestamp: t,worker: 2

}

10

Page 11: SnailTrail: Generalizing Critical Paths for Online ... · Flink, Timely ~10% overhead compared to logging disabled High throughput 1.2 million events/s 8 workers Always online 1s

The program activity graph

W1

W2

W3d

u v

Edges represent typed

activities

(u, v) = {type: serialization,operator: map,

}

11

Page 12: SnailTrail: Generalizing Critical Paths for Online ... · Flink, Timely ~10% overhead compared to logging disabled High throughput 1.2 million events/s 8 workers Always online 1s

The program activity graph

W1

W2

W3

12

Waiting activities

are never on a

critical path

Activities express

dependencies

All workers

terminate

Page 13: SnailTrail: Generalizing Critical Paths for Online ... · Flink, Timely ~10% overhead compared to logging disabled High throughput 1.2 million events/s 8 workers Always online 1s

The program activity graph

W1

W2

W3

Which activities delay the overall execution?

13

Page 14: SnailTrail: Generalizing Critical Paths for Online ... · Flink, Timely ~10% overhead compared to logging disabled High throughput 1.2 million events/s 8 workers Always online 1s

Classical critical path analysis

W1

W2

W3

14

Page 15: SnailTrail: Generalizing Critical Paths for Online ... · Flink, Timely ~10% overhead compared to logging disabled High throughput 1.2 million events/s 8 workers Always online 1s

What is the equivalent

of a critical path for

continuously running,

distributed streaming

applications,

with potentially

unbounded input?

There might be no “job end”

The program activity graph and

critical paths change continuously

Profiling information can quickly

become stale

15

Page 16: SnailTrail: Generalizing Critical Paths for Online ... · Flink, Timely ~10% overhead compared to logging disabled High throughput 1.2 million events/s 8 workers Always online 1s

Online critical path analysis

16

Page 17: SnailTrail: Generalizing Critical Paths for Online ... · Flink, Timely ~10% overhead compared to logging disabled High throughput 1.2 million events/s 8 workers Always online 1s

SnailTrail: Online analysis of trace windows

Input stream Output stream

17

Periodic

windows

Trace stream

SnailTrail

Windowed

performance

summaries

Page 18: SnailTrail: Generalizing Critical Paths for Online ... · Flink, Timely ~10% overhead compared to logging disabled High throughput 1.2 million events/s 8 workers Always online 1s

Program activity graph window

18

W1

W2

W3

ts te

All critical paths have equal length

Page 19: SnailTrail: Generalizing Critical Paths for Online ... · Flink, Timely ~10% overhead compared to logging disabled High throughput 1.2 million events/s 8 workers Always online 1s

W1

W2

W3

ts te

1

2

3

4

5

6

7

8

9

Cannot enumerate

all critical paths

19

Impractical!

Spark trace: 1021 critical paths

in 10 second window

Page 20: SnailTrail: Generalizing Critical Paths for Online ... · Flink, Timely ~10% overhead compared to logging disabled High throughput 1.2 million events/s 8 workers Always online 1s

Sampling critical paths misses critical activities

20

W1

W2

W3

ts te

Page 21: SnailTrail: Generalizing Critical Paths for Online ... · Flink, Timely ~10% overhead compared to logging disabled High throughput 1.2 million events/s 8 workers Always online 1s

We rank activities

across all critical paths

to capture their

relative importance.

Intuition: The more critical paths go

through an activity, the more critical

it might be

21

Page 22: SnailTrail: Generalizing Critical Paths for Online ... · Flink, Timely ~10% overhead compared to logging disabled High throughput 1.2 million events/s 8 workers Always online 1s

Counting over enumerating

22

W1

W2

W3

ts te

0

6 9

6

Page 23: SnailTrail: Generalizing Critical Paths for Online ... · Flink, Timely ~10% overhead compared to logging disabled High throughput 1.2 million events/s 8 workers Always online 1s

The Critical Participation metric

Fraction of an edge’s time contribution across all critical paths

Critical paths

through edgeEdge weight

Total number of

critical paths Can be computed without

critical path enumeration!

23

Page 24: SnailTrail: Generalizing Critical Paths for Online ... · Flink, Timely ~10% overhead compared to logging disabled High throughput 1.2 million events/s 8 workers Always online 1s

Reference ApplicationSnailTrail

Trace ingestion

Performance

summaries

Program activity graph

construction

Critical participation

computation and

activity ranking

trace streams

Profiling data

Flink,

Spark,

TensorFlow,

Heron,

Timely dataflow, ...

SnailTrail in action

24

Page 25: SnailTrail: Generalizing Critical Paths for Online ... · Flink, Timely ~10% overhead compared to logging disabled High throughput 1.2 million events/s 8 workers Always online 1s

SnailTrail

Trace ingestion

Performance

summaries

Program activity graph

construction

Critical participation

computation and

activity ranking

trace streams

Interpreting critical participation-based summaries

Stream of tuples:

(Activity type, Operator, Worker, …, Critical

participation)

Examples:

Activity type bottleneck analysis

Operator bottleneck analysis

(More in the paper!)

25

Page 26: SnailTrail: Generalizing Critical Paths for Online ... · Flink, Timely ~10% overhead compared to logging disabled High throughput 1.2 million events/s 8 workers Always online 1s

Activity type bottleneck analysis (Spark)

Apache Spark: Yahoo! Streaming Benchmark, 16 workers, 8s windows

Conventional profiling SnailTrail, critical participation

26

Window Window

% tim

e

Page 27: SnailTrail: Generalizing Critical Paths for Online ... · Flink, Timely ~10% overhead compared to logging disabled High throughput 1.2 million events/s 8 workers Always online 1s

Operator bottleneck analysis (Flink)

27

Conventional profiling SnailTrail, critical participationRead

sentences

Flatmap:

Split words

Count words

Increase flatmap’s

parallelism!

Apache Flink: Dhalion WordCount Benchmark, 10 workers, 1s windows

% tim

e p

roce

ssin

g

Seconds Seconds

Page 28: SnailTrail: Generalizing Critical Paths for Online ... · Flink, Timely ~10% overhead compared to logging disabled High throughput 1.2 million events/s 8 workers Always online 1s

SnailTrail performance

Low instrumentation overhead

Spark, TensorFlow

No observed overhead

Flink, Timely

~10% overhead compared to

logging disabled

High throughput

1.2 million events/s

8 workers

Always online

1s of traces in 6ms (100x)

256s of traces in < 25s (10x)

SnailTrail on Intel Xeon E5-4640, 2.40GHz, 32

cores, 512GiB RAM

Trace: Apache Flink Sessionization, 48 workers,

1s-256s windows

28

Page 29: SnailTrail: Generalizing Critical Paths for Online ... · Flink, Timely ~10% overhead compared to logging disabled High throughput 1.2 million events/s 8 workers Always online 1s

Summary

Conventional profiling is misleading

CP-metric: online critical path analysis

SnailTrail: online CP-based summaries

29