apache flink at strata san jose 2016

Kostas Tzoumas@kostas_tzoumas

Apache FlinkTMCounting elements in streams

Introduction

Data streaming is becoming increasingly popular*

*Biggest understatement of 2016

Streaming technology is enabling the obvious: continuous processing on data

that is continuously produced

Streaming is the next programming paradigm for data applications, and

you need to start thinking in terms of streams

Counting

Continuous counting A seemingly simple application,

but generally an unsolved problem

E.g., count visitors, impressions, interactions, clicks, etc

Aggregations and OLAP cube operations are generalizations of counting

Counting in batch architecture Continuous

ingestion

Periodic (e.g., hourly) files

Periodic batch jobs

Problems with batch architecture High latency

Too many moving parts

Implicit treatment of time

Out of order event handling

Implicit batch boundaries9

Counting in λ architecture "Batch layer":

what we had before

"Stream layer": approximate early results

Problems with batch and λ Way too many moving parts (and code dup)

Implicit treatment of time

Out of order event handling

Implicit batch boundaries

Counting in streaming architecture Message queue ensures stream durability and replay

Stream processor ensures consistent counting

Counting in Flink DataStream API

Number of visitors in last hour by country

DataStream<LogEvent> stream = env .addSource(new FlinkKafkaConsumer(...)); // create stream from Kafka .keyBy("country"); // group by country .timeWindow(Time.minutes(60)) // window of size 1 hour .apply(new CountPerWindowFunction()); // do operations per window

Counting hierarchy of needs

Continuous counting

... with low latency,

... efficiently on high volume streams,

... fault tolerant (exactly once),

... accurate and

repeatable,

... queryable

Based on Maslow's hierarchy of needs

Continuous counting

... accurate and

repeatable,

Continuous counting

... accurate and

repeatable,

... queryable1.1+

Rest of this talk

Continuous counting

... accurate and

repeatable,

... queryable

Latency

Yahoo! Streaming Benchmark Storm, Spark Streaming, and Flink benchmark by the

Storm team at Yahoo!

Focus on measuring end-to-end latency at low throughputs

First benchmark that was modeled after a real application

Read more:• https

://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at

Benchmark task: counting!

Count ad impressions grouped by campaign

Compute aggregates over last 10 seconds

Make aggregates available for queries (Redis)

Results (lower is better)

Flink and Storm at sub-second latencies

Spark has a latency-throughput tradeoff

170k events/sec

Efficiency, and scalability

Handling high-volume streams Scalability: how many events/sec can a

system scale to, with infinite resources?

Scalability comes at a cost (systems add overhead to be scalable)

Efficiency: how many events/sec can a system scale to, with limited resources?

Extending the Yahoo! benchmark

Yahoo! benchmark is a great starting point to understand engine behavior.

However, benchmark stops at low write throughput and programs are not fault tolerant.

We extended the benchmark to (1) high volumes, and (2) to use Flink's built-in state.• http://data-artisans.com/extending-the-yahoo-streaming-ben

chmark/

Results (higher is better)

Also: Flink jobs are correct under failures (exactly once), Storm jobs are not

500k events/sec

3mi events/sec

15mi events/sec

Fault tolerance and repeatability

Stateful streaming applications Most interesting stream applications are stateful

How to ensure that state is correct after failures?

Fault tolerance definitions At least once

• May over-count • In some cases even under-count with different definition

Exactly once• Counts are the same after failure

End-to-end exactly once• Counts appear the same in an external sink (database, file

system) after failure

Fault tolerance in Flink Flink guarantees exactly once

End-to-end exactly once supported with specific sources and sinks • E.g., Kafka Flink HDFS

Internally, Flink periodically takes consistent snapshots of the state without ever stopping the computation

Savepoints Maintaining stateful

applications in production is challenging

Flink savepoints: externally triggered, durable checkpoints

Easy code upgrades (Flink or app), maintenance, migration, and debugging, what-if simulations, A/B tests

Explicit handling of time

Notions of time

1977 1980 1983 1999 2002 2005 2015

Episode IV:A New Hope

Episode V:The EmpireStrikes Back

Episode VI:Return of the Jedi

Episode I:The Phantom

Menace

Episode II:Attach of the Clones

Episode III:Revenge of

the Sith

Episode VII:The Force Awakens

This is called event time

This is called processing time

Out of order streams

Event time windows

Arrival time windows

Instant event-at-a-time

First burst of eventsSecond burst of events

Why event time Most stream processors are limited to

processing/arrival time, Flink can operate on event time as well

Benefits of event time • Accurate results for out of order data• Sessions and unaligned windows• Time travel (backstreaming)

What's coming up in Flink

Evolution of streaming in Flink Flink 0.9 (Jun 2015): DataStream API in beta, exactly-once

guarantees via checkpoiting

Flink 0.10 (Nov 2015): Event time support, windowing mechanism based on Dataflow/Beam model, graduated DataStream API, high availability, state interface, new/updated connectors (Kafka, Nifi, Elastic, ...), improved monitoring

Flink 1.0 (Mar 2015): DataStream API stability, out of core state, savepoints, CEP library, improved monitoring, Kafka 0.9 support

Upcoming features SQL: ongoing work in collaboration with Apache Calcite

Dynamic scaling: adapt resources to stream volume, historical stream processing

Queryable state: ability to query the state inside the stream processor

Mesos support

More sources and sinks (e.g., Kinesis, Cassandra) 41

Queryable state

Using the stream processor as a database

Closing

Summary Stream processing gaining momentum, the right

paradigm for continuous data applications

Even seemingly simple applications can be complex at scale and in production – choice of framework crucial

Flink: unique combination of capabilities, performance, and robustness

Flink in the wild

30 billion events daily 2 billion events in 10 1Gb machines

Picked Flink for "Saiki"data integration &

distribution platform

See talks by at

Join the community!

Follow: @ApacheFlink, @dataArtisans Read: flink.apache.org/blog,

data-artisans.com/blog Subscribe: (news | user | dev) @ flink.apache.org

2 meetups next week in Bay Area!

April 6, San JoseApril 5, San Francisco

What's new in Flink 1.0 & recent performance benchmarks with Flinkhttp://www.meetup.com/Bay-Area-Apache-Flink-Meetup/

apache flink at strata san jose 2016

Software

apache flink: the latest and greatest...apache flink: the...

the flink - apache bigtop integration

advanced topics in apache...

apache flink training - deployment & operations

suneel marthi – bigpetstore flink: a comprehensive...

apache flink training - system overview

flink and apache spark fernanda de camargo magano dylan...

stream processing with apache flink - qconlondon.com ·...

streaming dataflow with apache flink

apache flink - sics

stream processing with apache flink

apache flink - tutorialspoint.comapache flink was founded by...

interactive data analysis with apache flink @ flink meetup...

gradoop: scalable graph analytics with apache flink @ flink...

large scale centrality measures in apache flink and apache...

apache flink 1.0.0 overview

apache flink® training

apache flink & graph processing

apache flink hands on

apache flink training - async io