counting elements in streams

47
Jamie Grier @jamiegrier [email protected] Apache Flink TM Counting elements in streams

Upload: jamie-grier

Post on 16-Apr-2017

548 views

Category:

Engineering


0 download

TRANSCRIPT

Page 1: Counting Elements in Streams

Jamie Grier@[email protected]

Apache FlinkTMCounting elements in streams

Page 2: Counting Elements in Streams

Introduction

Page 3: Counting Elements in Streams

Data streaming is becoming increasingly popular*

*Biggest understatement of 2016

Page 4: Counting Elements in Streams

Streaming technology is enabling the obvious: continuous processing on data

that is continuously produced

Page 5: Counting Elements in Streams

Streaming is the next programming paradigm for data applications, and you

need to start thinking in terms of streams

Page 6: Counting Elements in Streams

Counting

Page 7: Counting Elements in Streams

Continuous countingA seemingly simple

application, but generally an unsolved problem

E.g., count visitors, impressions, interactions, clicks, etc

Aggregations and OLAP cube operations are generalizations of counting

Page 8: Counting Elements in Streams

Counting in batch architectureContinuous

ingestion

Periodic (e.g., hourly) files

Periodic batch jobs

Page 9: Counting Elements in Streams

Problems with batch architectureHigh latency

Too many moving parts

Implicit treatment of time

Out of order event handling

Implicit batch boundaries

Page 10: Counting Elements in Streams

Counting in λ architecture"Batch layer":

what we had before

"Stream layer": approximate early results

Page 11: Counting Elements in Streams

Problems with batch and λWay too many moving parts (and code

dup)

Implicit treatment of time

Out of order event handling

Implicit batch boundaries

Page 12: Counting Elements in Streams

Counting in streaming architectureMessage queue ensures stream durability and

replay

Stream processor ensures consistent counting

Page 13: Counting Elements in Streams

Counting in Flink DataStream API

Number of visitors in last hour by country DataStream<LogEvent> stream = env .addSource(new FlinkKafkaConsumer(...)); // create stream from Kafka .keyBy("country"); // group by country .timeWindow(Time.minutes(60)) // window of size 1 hour .apply(new CountPerWindowFunction()); // do operations per window

Page 14: Counting Elements in Streams

Counting hierarchy of needs

Continuous counting

... with low latency,

... efficiently on high volume streams,

... fault tolerant (exactly once),

... accurate and

repeatable,

... queryable

Based on Maslow's hierarchy of needs

Page 15: Counting Elements in Streams

Counting hierarchy of needs

Continuous counting

Page 16: Counting Elements in Streams

Counting hierarchy of needs

Continuous counting

... with low latency,

Page 17: Counting Elements in Streams

Counting hierarchy of needs

Continuous counting

... with low latency,

... efficiently on high volume streams,

Page 18: Counting Elements in Streams

Counting hierarchy of needs

Continuous counting

... with low latency,

... efficiently on high volume streams,

... fault tolerant (exactly once),

Page 19: Counting Elements in Streams

Counting hierarchy of needs

Continuous counting

... with low latency,

... efficiently on high volume streams,

... fault tolerant (exactly once),

... accurate and

repeatable,

Page 20: Counting Elements in Streams

Counting hierarchy of needs

Continuous counting

... with low latency,

... efficiently on high volume streams,

... fault tolerant (exactly once),

... accurate and

repeatable,

... queryable1.1+

Page 21: Counting Elements in Streams

Rest of this talk

Continuous counting

... with low latency,

... efficiently on high volume streams,

... fault tolerant (exactly once),

... accurate and

repeatable,

... queryable

Page 22: Counting Elements in Streams

Latency

Page 23: Counting Elements in Streams

Yahoo! Streaming BenchmarkStorm, Spark Streaming, and Flink benchmark by

the Storm team at Yahoo!

Focus on measuring end-to-end latency at low throughputs

First benchmark that was modeled after a real application

Read more:• https

://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at

Page 24: Counting Elements in Streams

Benchmark task: counting! Count ad

impressions grouped by campaign

Compute aggregates over last 10 seconds

Make aggregates available for queries (Redis)

Page 25: Counting Elements in Streams

Results (lower is better)

Flink and Storm at sub-second latencies

Spark has a latency-throughput tradeoff

170k events/sec

Page 26: Counting Elements in Streams

Efficiency, and scalability

Page 27: Counting Elements in Streams

Handling high-volume streamsScalability: how many events/sec can a

system scale to, with infinite resources?

Scalability comes at a cost (systems add overhead to be scalable)

Efficiency: how many events/sec can a system scale to, with limited resources?

Page 28: Counting Elements in Streams

Extending the Yahoo! benchmarkYahoo! benchmark is a great starting point to

understand engine behavior.

However, benchmark stops at low write throughput and programs are not fault tolerant.

We extended the benchmark to (1) high volumes, and (2) to use Flink's built-in state.• http://data-artisans.com/extending-the-yahoo-streaming-benchmark

/

Page 29: Counting Elements in Streams

Results (higher is better)

Also: Flink jobs are correct under failures (exactly once), Storm jobs are not

500k events/sec

3mi events/sec

15mi events/sec

Page 30: Counting Elements in Streams

Fault tolerance and repeatability

Page 31: Counting Elements in Streams

Stateful streaming applicationsMost interesting stream applications are stateful

How to ensure that state is correct after failures?

Page 32: Counting Elements in Streams

Fault tolerance definitionsAt least once

• May over-count • In some cases even under-count with different

definition

Exactly once• Counts are the same after failure

End-to-end exactly once• Counts appear the same in an external sink (database,

file system) after failure

Page 33: Counting Elements in Streams

Fault tolerance in FlinkFlink guarantees exactly once

End-to-end exactly once supported with specific sources and sinks • E.g., Kafka ➔ Flink ➔ HDFS

Internally, Flink periodically takes consistent snapshots of the state without ever stopping the computation

Page 34: Counting Elements in Streams

Savepoints Maintaining stateful

applications in production is challenging

Flink savepoints: externally triggered, durable checkpoints

Easy code upgrades (Flink or app), maintenance, migration, and debugging, what-if simulations, A/B tests

Page 35: Counting Elements in Streams

Explicit handling of time

Page 36: Counting Elements in Streams

Notions of time

1977 1980 1983 1999 2002 2005 2015

Episode IV:A New Hope

Episode V:The EmpireStrikes Back

Episode VI:Return of the Jedi

Episode I:The Phantom

Menace

Episode II:Attach of the Clones

Episode III:Revenge of

the Sith

Episode VII:The Force Awakens

This is called event time

This is called processing time

Page 37: Counting Elements in Streams

Out of order streams

Event time windows

Arrival time windows

Instant event-at-a-time

First burst of eventsSecond burst of events

Page 38: Counting Elements in Streams

Why event timeMost stream processors are limited to

processing/arrival time, Flink can operate on event time as well

Benefits of event time • Accurate results for out of order data• Sessions and unaligned windows• Time travel (backstreaming)

Page 39: Counting Elements in Streams

What's coming up in Flink

Page 40: Counting Elements in Streams

Evolution of streaming in Flink Flink 0.9 (Jun 2015): DataStream API in beta, exactly-

once guarantees via checkpoiting

Flink 0.10 (Nov 2015): Event time support, windowing mechanism based on Dataflow/Beam model, graduated DataStream API, high availability, state interface, new/updated connectors (Kafka, Nifi, Elastic, ...), improved monitoring

Flink 1.0 (Mar 2015): DataStream API stability, out of core state, savepoints, CEP library, improved monitoring, Kafka 0.9 support

Page 41: Counting Elements in Streams

Upcoming featuresSQL: ongoing work in collaboration with Apache

Calcite

Dynamic scaling: adapt resources to stream volume, historical stream processing

Queryable state: ability to query the state inside the stream processor

Mesos support

More sources and sinks (e.g., Kinesis, Cassandra)

Page 42: Counting Elements in Streams

Queryable state

Using the stream processor as a database

Page 43: Counting Elements in Streams

Closing

Page 44: Counting Elements in Streams

SummaryStream processing gaining momentum, the right

paradigm for continuous data applications

Even seemingly simple applications can be complex at scale and in production – choice of framework crucial

Flink: unique combination of capabilities, performance, and robustness

Page 45: Counting Elements in Streams

Flink in the wild

30 billion events daily 2 billion events in 10 1Gb machines

Picked Flink for "Saiki"data integration &

distribution platform

See talks by at

Page 46: Counting Elements in Streams

Join the community!Follow: @ApacheFlink, @dataArtisansRead: flink.apache.org/blog, data-artisans.com/blogSubscribe: (news | user | dev) @ flink.apache.org

Page 47: Counting Elements in Streams

2 meetups next week in Bay Area!

April 6, San JoseApril 5, San Francisco

What's new in Flink 1.0 & recent performance benchmarks with Flinkhttp://www.meetup.com/Bay-Area-Apache-Flink-Meetup/