counting elements in streams

Jamie Grier@[email protected]

Apache FlinkTMCounting elements in streams

Introduction

Data streaming is becoming increasingly popular*

*Biggest understatement of 2016

Streaming technology is enabling the obvious: continuous processing on data

that is continuously produced

Streaming is the next programming paradigm for data applications, and you

need to start thinking in terms of streams

Counting

Continuous countingA seemingly simple

application, but generally an unsolved problem

E.g., count visitors, impressions, interactions, clicks, etc

Aggregations and OLAP cube operations are generalizations of counting

Counting in batch architectureContinuous

ingestion

Periodic (e.g., hourly) files

Periodic batch jobs

Problems with batch architectureHigh latency

Too many moving parts

Implicit treatment of time

Out of order event handling

Implicit batch boundaries

Counting in λ architecture"Batch layer":

what we had before

"Stream layer": approximate early results

Problems with batch and λWay too many moving parts (and code

dup)

Implicit treatment of time

Out of order event handling

Implicit batch boundaries

Counting in streaming architectureMessage queue ensures stream durability and

replay

Stream processor ensures consistent counting

Counting in Flink DataStream API

Number of visitors in last hour by country DataStream<LogEvent> stream = env .addSource(new FlinkKafkaConsumer(...)); // create stream from Kafka .keyBy("country"); // group by country .timeWindow(Time.minutes(60)) // window of size 1 hour .apply(new CountPerWindowFunction()); // do operations per window

Counting hierarchy of needs

Continuous counting

... with low latency,

... efficiently on high volume streams,

... fault tolerant (exactly once),

... accurate and

repeatable,

... queryable

Based on Maslow's hierarchy of needs


Continuous counting


Continuous counting




... accurate and

repeatable,


Continuous counting




... accurate and

repeatable,

... queryable1.1+

Rest of this talk

Continuous counting




... accurate and

repeatable,

... queryable

Latency

Yahoo! Streaming BenchmarkStorm, Spark Streaming, and Flink benchmark by

the Storm team at Yahoo!

Focus on measuring end-to-end latency at low throughputs

First benchmark that was modeled after a real application

Read more:• https

://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at

https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at




Benchmark task: counting! Count ad

impressions grouped by campaign

Compute aggregates over last 10 seconds

Make aggregates available for queries (Redis)

Results (lower is better)

Flink and Storm at sub-second latencies

Spark has a latency-throughput tradeoff

170k events/sec

Efficiency, and scalability

Handling high-volume streamsScalability: how many events/sec can a

system scale to, with infinite resources?

Scalability comes at a cost (systems add overhead to be scalable)

Efficiency: how many events/sec can a system scale to, with limited resources?

Extending the Yahoo! benchmarkYahoo! benchmark is a great starting point to

understand engine behavior.

However, benchmark stops at low write throughput and programs are not fault tolerant.

We extended the benchmark to (1) high volumes, and (2) to use Flink's built-in state.• http://data-artisans.com/extending-the-yahoo-streaming-benchmark

/

http://data-artisans.com/extending-the-yahoo-streaming-benchmark/



Results (higher is better)

Also: Flink jobs are correct under failures (exactly once), Storm jobs are not

500k events/sec

3mi events/sec

15mi events/sec

Fault tolerance and repeatability

Stateful streaming applicationsMost interesting stream applications are stateful

How to ensure that state is correct after failures?

Fault tolerance definitionsAt least once

• May over-count • In some cases even under-count with different

definition

Exactly once• Counts are the same after failure

End-to-end exactly once• Counts appear the same in an external sink (database,

file system) after failure

Fault tolerance in FlinkFlink guarantees exactly once

End-to-end exactly once supported with specific sources and sinks • E.g., Kafka ➔ Flink ➔ HDFS

Internally, Flink periodically takes consistent snapshots of the state without ever stopping the computation

Savepoints Maintaining stateful

applications in production is challenging

Flink savepoints: externally triggered, durable checkpoints

Easy code upgrades (Flink or app), maintenance, migration, and debugging, what-if simulations, A/B tests

Explicit handling of time

Notions of time

1977 1980 1983 1999 2002 2005 2015

Episode IV:A New Hope

Episode V:The EmpireStrikes Back

Episode VI:Return of the Jedi

Episode I:The Phantom

Menace

Episode II:Attach of the Clones

Episode III:Revenge of

the Sith

Episode VII:The Force Awakens

This is called event time

This is called processing time

Out of order streams

Event time windows

Arrival time windows

Instant event-at-a-time

First burst of eventsSecond burst of events

Why event timeMost stream processors are limited to

processing/arrival time, Flink can operate on event time as well

Benefits of event time • Accurate results for out of order data• Sessions and unaligned windows• Time travel (backstreaming)

What's coming up in Flink

Evolution of streaming in Flink Flink 0.9 (Jun 2015): DataStream API in beta, exactly-

once guarantees via checkpoiting

Flink 0.10 (Nov 2015): Event time support, windowing mechanism based on Dataflow/Beam model, graduated DataStream API, high availability, state interface, new/updated connectors (Kafka, Nifi, Elastic, ...), improved monitoring

Flink 1.0 (Mar 2015): DataStream API stability, out of core state, savepoints, CEP library, improved monitoring, Kafka 0.9 support

Upcoming featuresSQL: ongoing work in collaboration with Apache

Calcite

Dynamic scaling: adapt resources to stream volume, historical stream processing

Queryable state: ability to query the state inside the stream processor

Mesos support

More sources and sinks (e.g., Kinesis, Cassandra)

Queryable state

Using the stream processor as a database

Closing

SummaryStream processing gaining momentum, the right

paradigm for continuous data applications

Even seemingly simple applications can be complex at scale and in production – choice of framework crucial

Flink: unique combination of capabilities, performance, and robustness

Flink in the wild

30 billion events daily 2 billion events in 10 1Gb machines

Picked Flink for "Saiki"data integration &

distribution platform

See talks by at

Join the community!Follow: @ApacheFlink, @dataArtisansRead: flink.apache.org/blog, data-artisans.com/blogSubscribe: (news | user | dev) @ flink.apache.org

2 meetups next week in Bay Area!

April 6, San JoseApril 5, San Francisco

What's new in Flink 1.0 & recent performance benchmarks with Flinkhttp://www.meetup.com/Bay-Area-Apache-Flink-Meetup/

http://www.meetup.com/Bay-Area-Apache-Flink-Meetup/

counting elements in streams

Engineering