apache-flink-what-how-why-who-where-by-slim-baltagi

116
Apache Flink: What, How, Why, Who, Where? By @SlimBaltagi Director of Big Data Engineering Capital One 1 New York City (NYC) Apache Flink Me etup Civic Hall, NYC February 2nd, 2016 New York City (NYC) Apache Flink Mee tup Civic Hall, NYC February 2 nd , 2016

Upload: slim-baltagi

Post on 16-Apr-2017

10.328 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

Apache Flink: What, How, Why, Who, Where?By @SlimBaltagiDirector of Big Data Engineering Capital One

1

New York City (NYC) Apache Flink MeetupCivic Hall, NYCFebruary 2nd, 2016

New York City (NYC) Apache Flink MeetupCivic Hall, NYC

February 2nd, 2016

Page 2: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

AgendaI. What is Apache Flink stack and how it fits

into the Big Data ecosystem? II. How Apache Flink integrates with Hadoop

and other open source tools? III. Why Apache Flink is an alternative to

Apache Hadoop MapReduce, Apache Storm and Apache Spark?

IV. Who is using Apache Flink? V. Where to learn more about Apache Flink? 

2

Page 3: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

I. What is Apache Flink stack and how it fits into the Big Data ecosystem?

1. What is Apache Flink?2. What is Flink Execution Engine?3. What are Flink APIs?4. What are Flink Domain Specific Libraries?5. What is Flink Architecture?6. What is Flink Programming Model?7. What are Flink tools?

3

Page 4: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

1. What is Apache Flink?

1.1 Apache project with a cool logo!1.2 Project that evolved the concept of a multi-purpose Big Data analytics framework1.3 Project with a unique vision and philosophy1.4 Only Hybrid ( Real-Time streaming + Batch) engine supporting many use cases1.5 Major contributor to the movement of unification of streaming and batch1.6 The 4G of Big Data Analytics frameworks

4

Page 5: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

1.1 Apache project with a cool logo! Apache Flink, like Apache Hadoop and

Apache Spark, is a community-driven open source framework for distributed Big Data Analytics.

Apache Flink has its origins in a research project called Stratosphere of which the idea was conceived in late 2008 by professor Volker Markl  from the Technische Universität Berlin in Germany.

Flink joined the Apache incubator in April 2014 and graduated as an Apache Top Level Project (TLP) in December 2014.

dataArtisans (data-artisans.com) is a German start-up company based in Berlin and is leading the development of Apache Flink. 5

Page 6: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

1.1 Apache project with a cool logo

Squirrel is an animal! This reflects the harmony with other animals in the Hadoop ecosystem (Zoo): elephant, pig, python, camel, …

A squirrel is swift and agile

This reflects the meaning of the word ‘Flink’: German for “nimble, swift, speedy”

Red color of the body This reflects the roots of the project at German universities: In harmony with red squirrels in Germany

Colorful tail This reflects an open source project as the colors match the ones of the feather symbolizing the Apache Software Foundation

Page 7: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

1.2 Project that evolved the concept of a multi-purpose Big Data analytics framework

7

What is a typical Big Data Analytics Stack: Hadoop, Spark, Flink, …?

Page 8: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

1.2 Project that evolved the concept of a multi-purpose Big Data analytics framework

Apache Flink, written in Java and Scala, consists of: 1. Big data processing engine: distributed and

scalable streaming dataflow engine 2. Several APIs in Java/Scala/Python:

• DataSet API – Batch processing• DataStream API – Real-Time streaming analytics

3. Domain-Specific Libraries:• FlinkML: Machine Learning Library for Flink• Gelly: Graph Library for Flink• Table: Relational Queries • FlinkCEP: Complex Event Processing for Flink8

Page 9: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

What is Apache Flink stack?

Gel

lyTa

ble

Had

oop

M/R

SAM

OA

DataSet (Java/Scala/Python)Batch Processing

DataStream (Java/Scala)Stream Processing

Flin

kML

LocalSingle JVMEmbedded

Docker

ClusterStandalone

YARN, Mesos (WIP)

CloudGoogle’s GCEAmazon’s EC2

IBM Docker Cloud, …

Goo

gle

Dat

aflo

w

Dat

aflo

w (W

iP)

MR

QL

Tabl

e

Cas

cadi

ng

Runtime - Distributed Streaming Dataflow

Zepp

elin

DEP

LOY

SYST

EMA

PIs

& L

IBR

AR

IES

STO

RA

GE Files

LocalHDFS

S3, Azure StorageTachyon

DatabasesMongoDB

HBaseSQL

Streams FlumeKafka

RabbitMQ…

Batch Optimizer Stream Builder

Stor

m

Gel

ly-S

trea

m

Page 10: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

• Declarativity• Query optimization• Efficient parallel in-

memory and out-of-core algorithms

• Massive scale-out• User Defined

Functions • Complex data types• Schema on read

• Real-Time Streaming

• Iterations• Memory

Management• Advanced

Dataflows• General APIs

Draws on concepts from

MPP Database Technology

Draws on concepts from

Hadoop MapReduce Technology

Add

1.3 Project with a unique vision and philosophy

Apache Flink’s original vision was getting the best from both worlds: MPP Technology and Hadoop MapReduce Technologies:

Page 11: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

1.3 Project with a unique vision and philosophy

All streaming all the time: execute everything as streams including batch!!

Write like a programming language, execute like a database.

Alleviate the user from a lot of the pain of:• manually tuning memory assignment to

intermediate operators• dealing with physical execution concepts (e.g.,

choosing between broadcast and partitioned joins, reusing partitions).

11

Page 12: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

1.3 Project with a unique vision and philosophy

Little configuration required• Requires no memory thresholds to configure – Flink manages its own memory • Requires no complicated network configurations – Pipelining engine requires much less memory for data exchange • Requires no serializers to be configured – Flink handles its own type extraction and data representation

Little tuning required: Programs can be adjusted to data automatically – Flink’s optimizer can choose execution strategies automatically 12

Page 13: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

1.3 Project with a unique vision and philosophy

Support for many file systems:• Flink is File System agnostic. BYOS: Bring Your

Own StorageSupport for many deployment options:

• Flink is agnostic to the underlying cluster infrastructure. BYOC: Bring Your Own Cluster

Be a good citizen of the Hadoop ecosystem• Good integration with YARN

Preserve your investment in your legacy Big Data applications: Run your legacy code on Flink’s powerful engine using Hadoop and Storm compatibility layers and Cascading adapter. 13

Page 14: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

1.3 Project with a unique vision and philosophy

Native Support of many use cases on top of the same streaming engine• Batch• Real-Time streaming• Machine learning• Graph processing• Relational queries

Support building complex data pipelines leveraging native libraries without the need to combine and manage external ones.

14

Page 15: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

1.4 The only hybrid (Real-Time Streaming + Batch) open source distributed data processing engine natively supporting many use cases:

Real-Time stream processing Machine Learning at scale

Graph AnalysisBatch Processing

15

Page 16: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

1.5 Major contributor to the movement of unification of streaming and batch

Dataflow proposal for incubation has been renamed to Apache Beam ( for combination of Batch and Stream) https://wiki.apache.org/incubator/BeamProposal

Apache Beam was accepted to the Apache incubation on February 1st, 2016 http://incubator.apache.org/projects/beam.html

Dataflow/Beam & Spark: A Programming Model Comparison, February 3rd, 2016https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison

By Tyler Akidau & Frances Perry, Software Engineers, Apache Beam Committers

16

Page 17: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

1.5 Major contributor to the movement of unification of streaming and batchApache Flink includes DataFlow on Flink

http://data-artisans.com/dataflow-proposed-as-apache-incubator-project/

Keynotes of the Flink Forward 2015 conference: • Keynote on October 12th, 2015 by Kostas Tzoumas and Stephan

Ewen of dataArtisanshttp://www.slideshare.net/FlinkForward/k-tzoumas-s-ewen-flink-forward-keynote/

• Keynote on October 13th, 2015 by William Vambenepe of Googlehttp://www.slideshare.net/FlinkForward/william-vambenepe-google-cloud-dataflow-and-flink-stream-processing-by-default 17

Page 18: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

1.6 The 4G of Big Data Analytics frameworks

Apache Flink is not YABDAF (Yet Another Big Data Analytics Framework)!

Flink brings many technical innovations and a unique vision and philosophy that distinguish it from: Other multi-purpose Big Data analytics frameworks

such as Apache Hadoop and Apache Spark Single-purpose Big Data Analytics frameworks

such as Apache Storm Apache Flink is the 4G (4th Generation) of Big Data

Analytics frameworks succeeding to Apache Spark.

18

Page 19: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

Apache Flink as the 4G of Big Data Analytics

Batch Batch Interactive

Batch Interactive Near-Real

Time Streaming Iterative

processing

Hybrid(Streaming +Batch) Interactive Real-Time

Streaming Native Iterative

processing

MapReduce Direct Acyclic Graphs (DAG)Dataflows

RDD: Resilient Distributed Datasets

Cyclic Dataflows

1st Generation (1G)

2ndGeneration(2G)

3rd Generation (3G)

4th Generation (4G)

19

Page 20: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

How Big Data Analytics engines evolved?

The evolution of Massive-Scale Data Processing Tyler Akidau, Google. Strata + Hadoop World, Singapore, December 2, 2015. Slides: https://docs.google.com/presentation/d/10vs2PnjynYMtDpwFsqmSePtMnfJirCkXcHZ1SkwDg-s/present?slide=id.g63ca2a7cd_0_527

The world beyond batch: Streaming 101, Tyler Akidau, Google, August 5, 2015

http://radar.oreilly.com/2015/08/the-world-beyond-batch-streaming-101.html

Streaming 102, Tyler Akidau, Google, January 20, 2016 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

It covers topics like event-time vs. processing-time, windowing, watermarks, triggers, and accumulation.

20

Page 21: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

2. What is Flink Execution Engine?

The core of Flink is a distributed and scalable streaming dataflow engine with some unique features:

1. True streaming capabilities: Execute everything as streams

2. Versatile: Engine allows to run all existing MapReduce, Cascading, Storm, Google DataFlow applications

3. Native iterative execution: Allow some cyclic dataflows4. Handling of mutable state5. Custom memory manager: Operate on managed

memory6. Cost-Based Optimizer: for both batch and stream

processing21

Page 22: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

3. Flink APIs

3.1 DataSet API for static data - Java, Scala, and Python3.2 DataStream API for unbounded real-time streams - Java and Scala

22

Page 23: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

3.1 DataSet API – Batch processing

case class Word (word: String, frequency: Int)

val env = StreamExecutionEnvironment.getExecutionEnvironment()val lines: DataStream[String] = env.fromSocketStream(...)lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS)) .keyBy("word").sum("frequency") .print()env.execute()

val env = ExecutionEnvironment.getExecutionEnvironment()val lines: DataSet[String] = env.readTextFile(...)lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .groupBy("word").sum("frequency") .print()env.execute()

DataSet API (batch): WordCount

DataStream API (streaming): Window WordCount

23

Page 24: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

3.2 DataStream API – Real-Time Streaming Analytics Flink Streaming provides high-throughput, low-latency

stateful stream processing system with rich windowing semantics.

Streaming Fault-Tolerance allows Exactly-once processing delivery guarantees for Flink streaming programs that analyze streaming sources persisted by Apache Kafka.

Flink Streaming provides native support for iterative stream processing.

Data streams can be transformed and modified using high-level functions similar to the ones provided by the batch processing API.

 

24

Page 25: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

3.2 DataStream API – Real-Time Streaming Analytics Flink being based on a pipelined (streaming) execution

engine akin to parallel database systems allows to:• implement true streaming & batch• integrate streaming operations with rich windowing

semantics seamlessly• process streaming operations in a pipelined way with

lower latency than micro-batch architectures and without the complexity of lambda architectures.

It has built-in connectors to many data sources like Flume, Kafka, Twitter, RabbitMQ, etc

25

Page 26: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

3.2 DataStream API – Real-Time Streaming Analytics

Apache Flink: streaming done right. Till Rohrmann. January 31, 2016 https://fosdem.org/2016/schedule/event/hpc_bigdata_flink_streaming/

Web resources about stream processing with Apache Flink at the Flink Knowledge Base http://sparkbigdata.com/component/tags/tag/49-flink-streaming

26

Page 27: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

4. Flink Domain Specific Libraries

4.1 FlinkML – Machine Learning Library

4.2 Table – Relational queries 4.3 Gelly – Graph Analytics for Flink

4.4 FlinkCEP: Complex Event Processing for Flink

27

Page 28: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

4.1 FlinkML - Machine Learning Library FlinkML is the Machine Learning (ML) library for Flink.

It is written in Scala and was added in March 2015. FlinkML aims to provide:

• an intuitive API• scalable ML algorithms• tools that help minimize glue code in end-to-end ML

applications FlinkML will allow data scientists to:

• test their models locally using subsets of data• use the same code to run their algorithms at a much

larger scale in a cluster setting.

28

Page 29: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

4.1 FlinkMLFlinkML unique features are:

1. Exploiting the in-memory data streaming nature of Flink.

2. Natively executing iterative processing algorithms which are common in Machine Learning.

3. Streaming ML designed specifically for data streams.

FlinkML: Large-scale machine learning with Apache Flink, Theodore Vasiloudis. October 21, 2015

Slides: https://sics.app.box.com/s/044omad6200pchyh7ptbyxkwvcvaiowu Video: https://www.youtube.com/watch?v=k29qoCm4c_k&feature=youtu.be

Check more FlinkML web resources at the Apache Flink Knowledge Base: http://sparkbigdata.com/component/tags/tag/51-flinkml

29

Page 30: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

4.2 Table – Relational Queries

Table API, written in Scala , allows specifying operations using SQL-like expressions instead of manipulating DataSet or DataStream.

Table API can be used in both batch (on structured data sets) and streaming programs (on structured data streams).http://ci.apache.org/projects/flink/flink-docs-master/libs/table.html

Flink Table web resources at the Apache Flink Knowledge Base: http://sparkbigdata.com/component/tags/tag/52-flink-table

30

Page 31: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

4.2 Table API – Relational Queries

val customers = envreadCsvFile(…).as('id, 'mktSegment) .filter("mktSegment = AUTOMOBILE")

val orders = env.readCsvFile(…) .filter( o => dateFormat.parse(o.orderDate).before(date) ) .as("orderId, custId, orderDate, shipPrio")

val items = orders .join(customers).where("custId = id") .join(lineitems).where("orderId = id") .select("orderId, orderDate, shipPrio, extdPrice * (Literal(1.0f) – discount) as revenue")

val result = items .groupBy("orderId, orderDate, shipPrio") .select("orderId, revenue.sum, orderDate, shipPrio")

Table API (queries)

31

Page 32: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

4.3 Gelly – Graph Analytics for Flink Gelly is Flink's large-scale graph processing API,

available in Java and Scala, which leverages Flink's efficient delta iterations to map various graph processing models (vertex-centric and gather-sum-apply) to dataflows.

Gelly provides:• A set of methods and utilities to create, transform

and modify graphs • A library of graph algorithms which aims to simplify

the development of graph analysis applications• Iterative graph algorithms are executed leveraging

mutable state32

Page 33: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

4.3 Gelly – Graph Analytics for Flink Gelly allows Flink users to perform end-to-end data

analysis, without having to build complex pipelines and combine different systems.

It can be seamlessly combined with Flink's DataSet API, which means that pre-processing, graph creation, graph analysis and post-processing can be done in the same application.

Gelly documentation https://ci.apache.org/projects/flink/flink-docs-master/libs/gelly_guide.html

Introducing Gelly: Graph Processing with Apache Flink http://flink.apache.org/news/2015/08/24/introducing-flink-gelly.html

Check out more Gelly web resources at the Apache Flink Knowledge Base: http://sparkbigdata.com/component/tags/tag/50-gelly33

Page 34: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

4.3 Gelly – Graph Analytics for Flink Single-pass Graph Streaming Analytics with Apache

Flink. Vasia Kalavri & Paris Carbone. January 31, 2016 FOSDEM'16. Brussels, BELGIUM.  

• Talk description :https://fosdem.org/2016/schedule/event/graph_processing_apache_flink/

• Slides: http://www.slideshare.net/vkalavri/gellystream-singlepass-graph-streaming-analytics-with-apache-flink

Gelly free training! http://www.slideshare.net/FlinkForward/vasia-kalavri-training-gelly-school

http://gellyschool.com/

34

Page 35: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

4.4 FlinkCEP: Complex Event Processing for FlinkFlinkCEP is the complex event processing library for

Flink. It allows you to easily detect complex event patterns in a stream of endless data.

Complex events can then be constructed from matching sequences. This gives you the opportunity to quickly get hold of what’s really important in your data.

https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/libs/cep.html

 

35

Page 36: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

5. What is Flink Architecture? Flink implements the Kappa Architecture:

run batch programs on a streaming system. References about the Kappa Architecture:

• Questioning the Lambda Architecture - Jay Kreps , July 2nd, 2014 http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html

• Turning the database inside out with Apache Samza -Martin Kleppmann, March 4th, 2015o http://www.youtube.com/watch?v=fU9hR3kiOK0 (VIDEO)o http://martin.kleppmann.com/2015/03/04/turning-the-database-inside-out.h

tml(TRANSCRIPT)

o http://blog.confluent.io/2015/03/04/turning-the-database-inside-out-with-apache-samza/ (BLOG)

36

Page 37: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

5. What is Flink Architecture?5.1 Client5.2 Master (Job Manager)5.3 Worker (Task Manager)

37

Page 38: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

5.1 Client Type extraction Optimize: in all APIs not just SQL queries as in Spark Construct job Dataflow graph Pass job Dataflow graph to job manager Retrieve job results

Job Manager

Client

case class Path (from: Long, to: Long)val tc = edges.iterate(10) { paths: DataSet[Path] => val next = paths .join(edges) .where("to") .equalTo("from") { (path, edge) => Path(path.from, edge.to) } .union(paths) .distinct() next }

Optimizer

Type extraction

Data Sourceorders.tbl

Filter

Map

DataSourcelineitem.tbl

JoinHybrid Hash

buildHT probe

hash-part [0]hash-part [0]

GroupRed

sort

forward

38

Page 39: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

5.2 Job Manager (JM) with High Availability Parallelization: Create Execution Graph Scheduling: Assign tasks to task managers State tracking: Supervise the execution

Job Manager

Data Sourceorders.t

bl

FilterMap

DataSourcelineitem.tbl

JoinHybrid HashbuildHT probe

hash-part [0] hash-part [0]

GroupRed

sort

forward

Task Manager

Task Manager

Task Manager

Task Manager

Data Sourceorders.tbl

Filter

Map DataSource

lineitem.tbl

JoinHybrid Hash

buildHT

probe

hash-part [0]

hash-part [0]

GroupRed

sort

forward

Data Sourceorders.tbl

Filter

Map DataSource

lineitem.tbl

JoinHybrid Hash

buildHT

probe

hash-part [0]

hash-part [0]

GroupRed

sort

forward

Data Sourceorders.tbl

FilterMap DataSou

rcelineitem.tbl

JoinHybrid Hash

buildHT

probe

hash-part [0]

hash-part [0]

GroupRed

sort

forward

Data Sourceorders.tbl

Filter

MapDataSourc

elineitem.tbl

JoinHybrid Hash

buildHT

probe

hash-part [0]

hash-part [0]

GroupRedsort

forward

39

Page 40: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

5.3 Task Manager ( TM) Operations are split up into tasks depending on the

specified parallelism Each parallel instance of an operation runs in a

separate task slot The scheduler may run several tasks from different

operators in one task slot

Task Manager

Slot

Task ManagerTask Manager

Slot

Slot

40

Page 41: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

6. What is Flink Programming Model? DataSet and DataStream as programming

abstractions are the foundation for user programs and higher layers.

Flink extends the MapReduce model with new operators that represent many common data analysis tasks more naturally and efficiently.

All operators will start working in memory and gracefully go out of core under memory pressure.

41

Page 42: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

6.1 DataSetDataSet: abstraction for distributed data and the

central notion of the batch programming APIFiles and other data sources are read into DataSets

• DataSet<String> text = env.readTextFile(…)Transformations on DataSets produce DataSets

• DataSet<String> first = text.map(…)DataSets are printed to files or on stdout

• first.writeAsCsv(…)Computation is specified as a sequence of lazily

evaluated transformationsExecution is triggered with env.execute()

42

Page 43: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

6.1 DataSet

Used for Batch Processing

Data Set Operation Data

SetSource

Example: Map and Reduce operation

Sink

b h

2 1

3 5

7 4

… …

Map Reduce

a

12

…43

Page 44: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

6.2 DataStream

Real-time event streams

Data Stream Operation Data

StreamSource Sink

Stock FeedName Price

Microsoft 124

Google 516

Apple 235

… …

Alert if Microsoft

> 120

Write event to database

Sum every 10 seconds

Alert if sum > 10000

Microsoft 124

Google 516Apple 235

Microsoft 124

Google 516

Apple 235

Example: Stream from a live stock feed

44

Page 45: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

7. What are Apache Flink tools?

7.1   Command-Line Interface (CLI)7.2   Web Submission Client7.3   Job Manager Web Interface7.4   Interactive Scala Shell7.5   Zeppelin Notebook

45

Page 46: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

7.1   Command-Line Interface (CLI) Flink provides a CLI to run programs that are packaged as

JAR files, and control their execution. bin/flink has 4 major actions

• run  #runs a program.• info  #displays information about a program.• list  #lists scheduled and running jobs• cancel #cancels a running job.

Example: ./bin/flink info ./examples/KMeans.jar

See CLI usage and related examples: https://ci.apache.org/projects/flink/flink-docs-master/apis/cli.html 46

Page 47: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

7.2   Web Submission Client

47

Page 48: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

7.2   Web Submission ClientFlink provides a web interface to:

• Upload programs• Execute programs• Inspect their execution plans• Showcase programs• Debug execution plans• Demonstrate the system as a whole

The web interface runs on port 8080 by default.To specify a custom port set the webclient.port property in the ./conf/flink.yaml configuration file.

48

Page 49: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

7.3   Job Manager Web Interface

Overall system status

Job execution details

Task Manager resourceutilization

49

Page 50: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

7.3 Job Manager Web Interface

The JobManager web frontend allows to :• Track the progress of a Flink program

as all status changes are also logged to the JobManager’s log file.

• Figure out why a program failed as it displays the exceptions of failed tasks and allow to figure out which parallel task first failed and caused the other tasks to cancel the execution.

50

Page 51: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

7.4   Interactive Scala Shellbin/start-scala-shell.sh --host localhost --port 6123

51

Page 52: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

7.4   Interactive Scala ShellFlink comes with an Interactive Scala Shell - REPL ( Read Evaluate Print Loop ) : ./bin/start-scala-shell.sh Interactive queries Let’s you explore data quickly It can be used in a local setup as well as in a

cluster setup. The Flink Shell comes with command history and

auto completion. Complete Scala API available So far only batch mode is supported. There is

plan to add streaming in the future: https://ci.apache.org/projects/flink/flink-docs-master/scala_shell.html

52

Page 53: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

7.5   Zeppelin Notebookhttp://localhost:8080/

53

Page 54: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

7.5   Zeppelin Notebook

Web-based interactive computation environment

Collaborative data analytics and visualization tool

Combines rich text, execution code, plots and rich media

Exploratory data scienceSaving and replaying of written codeStorytelling

54

Page 55: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

AgendaI. What is Apache Flink stack and how it fits

into the Big Data ecosystem? II. How Apache Flink integrates with Hadoop

and other open source tools? III. Why Apache Flink is an alternative to

Apache Hadoop MapReduce, Apache Storm and Apache Spark? 

IV. Who is using Apache Flink? V. Where to learn more about Apache Flink? 

55

Page 56: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

II. How Apache Flink integrates with Hadoop and other open source tools? 

Service Open Source ToolStorage/Serving Layer

Data Formats

Data Ingestion Services

Resource Management

56

Page 57: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

II. How Apache Flink integrates with Hadoop and other open source tools? 

Flink integrates well with other open source tools for data input and output as well as deployment. 

Flink allows to run legacy Big Data applications: MapReduce, Cascading and Storm applications

Flink integrates with other open source tools

1. Data Input / Output2. Deployment3. Legacy Big Data applications4. Other tools

57

Page 58: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

1. Data Input / OutputHDFS to read and write. Secure HDFS supportReuse data types (that implement Writables interface)Amazon S3 Microsoft Azure StorageMapR-FS Flink + Tachyon http://tachyon-project.org/

Running Apache Flink on Tachyon http://tachyon-project.org/Running-Flink-on-Tachyon.html

Flink + XtreemFS http://www.xtreemfs.org/

58

Page 60: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

1. Data Input / Output

Apache Kafka, a system that provides durability and pub/sub functionality for data streams.

Kafka + Flink: A practical, how-to guide. Robert Metzger and Kostas Tzoumas, September 2, 2015  http://data-artisans.com/kafka-flink-a-practical-how-to/ https://www.youtube.com/watch?v=7RPQUsy4qOM

Click-Through Example for Flink’s KafkaConsumer Checkpointing. Robert Metzger, September 2nd , 2015. http://www.slideshare.net/robertmetzger1/clickthrough-example-for-flinks-kafkaconsumer-checkpointing

MapR Streams (proprietary alternative to Kafka that is compatible with Apache Kafka 0.9 API) provides out of the box integration with Apache Flinkhttp://sparkbigdata.com/component/tags/tag/61

60

Page 61: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

1. Data Input / OutputUsing Apache Nifi with Flink:

• Flink and NiFi: Two Stars in the Apache Big Data Constellation. Matthew Ring. January 19th , 2016 http://www.slideshare.net/mring33/flink-and-nifi-two-stars-in-the-apache-big-data-constellation

• Integration of Apache Flink and Apache Nifi. Bryan Bende, February 4th , 2016

http://www.slideshare.net/BryanBende/integrating-nifi-and-flink

Using Elasticsearch with Flink: https://www.elastic.co/

Building real-time dashboard applications with Apache Flink, Elasticsearch, and Kibana. By Fabian Hueske, December 7, 2015.https://www.elastic.co/blog/building-real-time-dashboard-applications-with-apache-flink-elasticsearch-and-kibana

61

Page 62: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

2. DeploymentDeploy inside of Hadoop via YARN

• YARN Setup http://ci.apache.org/projects/flink/flink-docs-master/setup/yarn_setup.html

• YARN Configuration http://ci.apache.org/projects/flink/flink-docs-master/setup/config.html#yarn

Apache Flink cluster deployment on Docker using Docker-Compose by Simons Laws from IBM.

Talk at the Flink Forward in Berlin on October 12, 2015. Slides:

http://www.slideshare.net/FlinkForward/simon-laws-apache-flink-cluster-deployment-on-docker-and-dockercompose

Video recording (40’:49): https://www.youtube.com/watch?v=CaObaAv9tLE62

Page 63: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

3. Legacy Big Data applications

Flink’s MapReduce compatibility layer allows to: • run legacy Hadoop MapReduce jobs • reuse Hadoop input and output formats • reuse functions like Map and Reduce.

References: • Documentation: https://ci.apache.org/projects/flink/flink-docs-release-0.7/

hadoop_compatibility.html

• Hadoop Compatibility in Flink by Fabian Hüeske - November 18, 2014 http://flink.apache.org/news/2014/11/18/hadoop-compatibility.html

• Apache Flink - Hadoop MapReduce Compatibility. Fabian Hüeske, January 29, 2015 http://www.slideshare.net/fhueske/flink-hadoopcompat20150128

63

Page 64: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

3. Legacy Big Data applications  Cascading on Flink allows to port existing Cascading-MapReduce

applications to Apache Flink with virtually no code changes. http://www.cascading.org/cascading-flink/

Expected advantages are performance boost and less resources consumption.

References: • Cascading on Apache Flink, Fabian Hueske, data Artisans. Flink

Forward 2015. October 12, 2015• http://www.slideshare.net/FlinkForward/fabian-hueske-training-cascading-on-

flink• https://www.youtube.com/watch?v=G7JlpARrFkU

• Cascading connector for Apache Flink. Code on Github https://github.com/dataArtisans/cascading-flink

• Running Scalding jobs on Apache Flink, Ian Hummel, December 20, 201http://themodernlife.github.io/scala/hadoop/hdfs/sclading/flink/streaming/realtime/2015/12/20/running-scalding-jobs-on-apache-flink/

64

Page 65: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

3. Legacy Big Data applications  

Flink is compatible with Apache Storm interfaces and therefore allows reusing code that was implemented for Storm: • Execute existing Storm topologies using Flink as the underlying

engine. • Reuse legacy application code (bolts and spouts) inside Flink

programs.  https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/storm_compatibility.html

A Tale of Squirrels and Storms. Mathias J. Sax, October 13, 2015. Flink Forward 2015

http://www.slideshare.net/FlinkForward/matthias-j-sax-a-tale-of-squirrels-and-stormshttps://www.youtube.com/watch?v=aGQQkO83Ong

Storm Compatibility in Apache Flink: How to run existing Storm topologies on Flink. Mathias J. Sax, December 11, 2015

http://flink.apache.org/news/2015/12/11/storm-compatibility.html 65

Page 66: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

Ambari service for Apache Flink: install, configure, manage Apache Flink on HDP, November 17, 2015

https://community.hortonworks.com/repos/4122/ambari-service-for-apache-flink.html

Exploring Apache Flink with HDPhttps://community.hortonworks.com/articles/2659/exploring-apache-flink-with-hdp.html

Apache Flink + Apache SAMOA for Machine Learning on streams http://samoa.incubator.apache.org/

Flink Integrates with Zeppelin http://zeppelin.incubator.apache.org/http://www.slideshare.net/FlinkForward/moon-soo-lee-data-science-lifecycle-with-apache-flink-and-apache-zeppelin

Flink + Apache MRQL http://mrql.incubator.apache.org

66

4. Other tools

Page 67: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

Google Cloud Dataflow (GA on August 12, 2015) is a fully-managed cloud service and a unified programming model for batch and streaming big data processing. https://cloud.google.com/dataflow/ (Try it FREE)

Flink-Dataflow is a Google Cloud Dataflow SDK Runner for Apache Flink. It enables you to run Dataflow programs with Flink as an execution engine.

References: Google Cloud Dataflow on top of Apache Flink,

Maximilian Michels, data Artisans. Flink Forward conference, October 12, 2015 http://www.slideshare.net/FlinkForward/maximilian-michels-google-cloud-

dataflow-on-top-of-apache-flink Slides

https://www.youtube.com/watch?v=K3ugWmHb7CE Video recording

67

4. Other tools

Page 68: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

AgendaI. What is Apache Flink stack and how it fits

into the Big Data ecosystem? II. How Apache Flink integrates with Hadoop

and other open source tools for data input and output as well as deployment? 

III. Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark? 

IV. Who is using Apache Flink? V. Where to learn more about Apache Flink? 

68

Page 69: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

III. Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark? 

1. Why Flink is an alternative to Hadoop MapReduce?

2. Why Flink is an alternative to Apache Storm?3. Why Flink is an alternative to Apache Spark?4. What are the benchmarking results against

Flink?

69

Page 70: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

2. Why Flink is an alternative to Hadoop MapReduce?

1. Flink offers cyclic dataflows compared to the two-stage, disk-based MapReduce paradigm.

2. The application programming interface (API) for Flink is easier to use than programming for Hadoop’s MapReduce.

3. Flink is easier to test compared to MapReduce.4. Flink can leverage in-memory processing, data

streaming and iteration operators for faster data processing speed.

5. Flink can work on file systems other than Hadoop. 70

Page 71: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

2. Why Flink is an alternative to Hadoop MapReduce?

6. Flink lets users work in a unified framework allowing to build a single data workflow that leverages, streaming, batch, sql and machine learning for example.

7. Flink can analyze real-time streaming data.8. Flink can process graphs using its own Gelly library.9. Flink can use Machine Learning algorithms from its

own FlinkML library.10. Flink supports interactive queries and iterative

algorithms, not well served by Hadoop MapReduce. 

71

Page 72: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

2. Why  Flink is an alternative to Hadoop MapReduce?

11. Flink extends MapReduce model with new operators: join, cross, union, iterate, iterate delta, cogroup, … 

Input Map Reduce Output

DataSet DataSetDataSet

Red Join

DataSet Map DataSet

OutputS

Input

72

Page 73: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

3. Why Flink is an alternative to Storm?

1. Higher Level and easier to use API2. Lower latency

• Thanks to pipelined engine

3. Exactly-once processing guarantees• Variation of Chandy-Lamport

4. Higher throughput• Controllable checkpointing overhead

5. Flink Separates application logic from recovery

• Checkpointing interval is just a configuration parameter 73

Page 74: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

3. Why Flink is an alternative to Storm?

6. More light-weight fault tolerance strategy7. Stateful operators8. Native support for iterative stream processing. 9. Flink does also support batch processing10. Flink offers Storm compatibility

• Flink is compatible with Apache Storm interfaces and therefore allows reusing code that was implemented for Storm.

https://ci.apache.org/projects/flink/flink-docs-master/apis/storm_compatibility.html

74

Page 75: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

3. Why Flink is an alternative to Storm?

Extending the Yahoo! Streaming Benchmark, by Jamie Grier. February 2nd, 2016

http://data-artisans.com/extending-the-yahoo-streaming-benchmark/

Code at Github: https://github.com/dataArtisans/yahoo-streaming-benchmark

Results show that Flink has a much better throughput compared to storm and better fault-tolerance guarantees: exactly-once.High-throughput, low-latency, and exactly-once

stream processing with Apache Flink. The evolution of fault-tolerant streaming architectures and their performance – Kostas Tzoumas, August 5th 2015

http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/ 75

Page 76: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

4. Why  Flink is an alternative to Spark?

4.1 True Low latency streaming engine • Spark’s micro-batches aren’t good enough!• Unified batch and real-time streaming in a single

engine• The streaming model of Flink is based on the

Dataflow model similar to Google Dataflow4.2 Unique windowing features not available in Spark

• support for event time• out of order streams• a mechanism to define custom windows based on

window assigners and triggers. 76

Page 77: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

4. Why  Flink is an alternative to Spark?

4.3 Native closed-loop iteration operators • make graph and machine learning applications run

much faster 4.4 Custom memory manager

• no more frequent Out Of Memory errors!• Flink’s own type extraction component• Flink’s own serialization component

4.5 Automatic Cost Based Optimizer • little re-configuration and little maintenance when

the cluster characteristics change and the data evolves over time

77

Page 78: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

4. Why Flink is an alternative to Apache Spark?

4.6 Little configuration required 4.7 Little tuning required 4.8 Flink has better performance

78

Page 79: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

4.1 True low latency streaming engine Some claim that 95% of streaming use cases can be

handled with micro-batches!? Really!!! Spark’s micro-batching isn’t good enough for many

time-critical applications that need to process large streams of live data and provide results in real-time.

Below are Several use cases, taken from real industrial situations where batch or micro batch processing is not appropriate.

References: • MapR Streams FAQ https://www.mapr.com/mapr-streams-faq#question12

• Apache Spark vs. Apache Flink, January 13, 2015. Whiteboard walkthrough by Balaji Narasimhalu from MapRhttps://www.youtube.com/watch?v=Dzx-iE6RN4w 79

Page 80: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

4.1 True low latency streaming engine Financial Services

– Real-time fraud detection.– Real-time mobile notifications.

Healthcare– Smart hospitals - collect data and readings from hospital

devices (vitals, IVs, MRI, etc.) and analyze and alert in real time.– Biometrics - collect and analyze data from patient devices that

collect vitals while outside of care facilities.Ad Tech

– Real-time user targeting based on segment and preferences.Oil & Gas

– Real-time monitoring of pumps/rigs.

80

Page 81: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

4.1 True low latency streaming engine

Retail– Build an intelligent supply chain by placing sensors or RFID

tags on items to alert if items aren’t in the right place, or proactively order more if supply is low.

– Smart logistics with real-time end-to-end tracking of delivery trucks.

Telecommunications– Real-time antenna optimization based on user location data.– Real-time charging and billing based on customer usage, ability

to populate up-to-date usage dashboards for users.– Mobile offers.– Optimized advertising for video/audio content based on what

users are consuming.81

Page 82: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

4.1 True low latency streaming engine “I would consider stream data analysis to be a major

unique selling proposition for Flink. Due to its pipelined architecture Flink is a perfect match for big data stream processing in the Apache stack.” – Volker Markl

Ref.: On Apache Flink. Interview with Volker Markl, June 24th 2015 http://www.odbms.org/blog/2015/06/on-apache-flink-interview-with-volker-markl/

Apache Flink uses streams for all workloads: streaming, SQL, micro-batch and batch.

Batch is just treated as a finite set of streamed data. This makes Flink the most sophisticated distributed open source Big Data processing engine.

82

Page 83: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

4.2 Unique windowing features not available in Spark StreamingBesides arrival time, support for event time or a mixture

of both for out of order streamsCustom windows based on window assigners and

triggers. How Apache Flink enables new streaming applications.

Part I: The power of event time and out of order stream processing. December 9, 2015 by Stephan Ewen and Kostas Tzoumas http://data-artisans.com/how-apache-flink-enables-new-streaming-applications-part-1/

How Apache Flink enables new streaming applications. Part II: State and versioning. February 3, 2016 by Ufuk Celebi and Kostas Tzoumas

http://data-artisans.com/how-apache-flink-enables-new-streaming-applications/83

Page 84: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

4.2 Unique windowing features not available in Spark Streaming

Flink 0.10: A significant step forward in open source stream processing. November 17, 2015. By Fabian Hueske and Kostas Tzoumashttp://data-artisans.com/flink-0-10-a-significant-step-forward-in-open-source-stream-processing/

 Dataflow/Beam & Spark: A Programming Model Comparison. February 3, 2016. By Tyler Akidau & Frances Perry, Software Engineers, Apache Beam Committershttps://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison

84

Page 85: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

4.3 Iteration OperatorsWhy Iterations? Many Machine Learning and Graph processing algorithms need iterations! For example:

Machine Learning Algorithms • Clustering (K-Means, Canopy, …) •  Gradient descent (Logistic Regression, Matrix

Factorization) Graph Processing Algorithms

• Page-Rank, Line-Rank • Path algorithms on graphs (shortest paths,

centralities, …) • Graph communities / dense sub-components • Inference (Belief propagation) 85

Page 86: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

4.2 Iteration Operators Flink's API offers two dedicated iteration operations:

Iterate and Delta Iterate. Flink executes programs with iterations as cyclic

data flows: a data flow program (and all its operators) is scheduled just once.

In each iteration, the step function consumes the entire input (the result of the previous iteration, or the initial data set), and computes the next version of the partial solution

86

Page 87: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

4.3 Iteration Operators Delta iterations run only on parts of the data that is

changing and can significantly speed up many machine learning and graph algorithms because the work in each iteration decreases as the number of iterations goes on.

Documentation on iterations with Apache Flinkhttp://ci.apache.org/projects/flink/flink-docs-master/apis/iterations.html

87

Page 88: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

4.3 Iteration Operators

StepStep

Step Step Step

Client

for (int i = 0; i < maxIterations; i++) {

// Execute MapReduce job}

Non-native iterations in Hadoop and Spark are implemented as regular for-loops outside the system.

88

Page 89: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

4.3 Iteration Operators Although Spark caches data across iterations, it still

needs to schedule and execute a new set of tasks for each iteration.

In Spark, it is driver-based looping: • Loop outside the system, in driver program• Iterative program looks like many independent jobs

In Flink, it is Built-in iterations:• Dataflow with Feedback edges• System is iteration-aware, can optimize the job

Spinning Fast Iterative Data Flows - Ewen et al. 2012 : http://vldb.org/pvldb/vol5/p1268_stephanewen_vldb2012.pdf The Apache Flink model for incremental iterative dataflow processing.

89

Page 90: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

4.4 Custom Memory Manager Features:

C++ style memory management inside the JVM User data stored in serialized byte arrays in JVM Memory is allocated, de-allocated, and used strictly

using an internal buffer pool implementation. Advantages:

1. Flink will not throw an OOM exception on you.2. Reduction of Garbage Collection (GC)3. Very efficient disk spilling and network transfers4. No Need for runtime tuning5. More reliable and stable performance

90

Page 91: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

4.4 Custom Memory Manager

public class WC {public String

word; public int count;}

emptypage

Pool of Memory Pages

Sorting, hashing, caching

Shuffles/ broadcasts

User code objects

Man

aged

Unm

anag

edFlink contains its own memory management stack. To do that, Flink contains its own type extraction and serialization components.JVM Heap

91Net

wor

k B

uffe

rs

Page 92: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

4.4 Custom Memory ManagerFlink provides an Off-Heap option for its memory

management componentReferences:

• Peeking into Apache Flink's Engine Room - by Fabian Hüske, March 13, 2015 http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html

• Juggling with Bits and Bytes - by Fabian Hüske, May 11,2015

https://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.html

• Memory Management (Batch API) by Stephan Ewen- May 16, 2015

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=53741525

92

Page 93: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

4.4 Custom Memory Manager

Compared to Flink, Spark is catching up with its project Tungsten for Memory Management and Binary Processing: manage memory explicitly and eliminate the overhead of JVM object model and garbage collection. April 28, 2014https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html

It seems that Spark is adopting something similar to Flink and the initial Tungsten announcement read almost like Flink documentation!!

93

Page 94: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

4.5 Built-in Cost-Based Optimizer Apache Flink comes with an optimizer that is

independent of the actual programming interface. It chooses a fitting execution strategy depending

on the inputs and operations. Example: the "Join" operator will choose between

partitioning and broadcasting the data, as well as between running a sort-merge-join or a hybrid hash join algorithm.

This helps you focus on your application logic rather than parallel execution.

Quick introduction to the Optimizer: section 6 of the paper: ‘The Stratosphere platform for big data analytics’http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf

94

Page 95: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

4.5 Built-in Cost-Based Optimizer

Run locally on a data sample

on the laptopRun a month later

after the data evolved

Hash vs. SortPartition vs. Broadcast

CachingReusing partition/sortExecution

Plan A

ExecutionPlan B

Run on large fileson the cluster

ExecutionPlan C

What is Automatic Optimization? The system's built-in optimizer takes care of finding the best way to execute the program in any environment.

95

Page 96: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

4.5 Built-in Cost-Based Optimizer In contrast to Flink’s built-in automatic optimization,

Spark jobs have to be manually optimized and adapted to specific datasets because you need to manually control partitioning and caching if you want to get it right.

Spark SQL uses the Catalyst optimizer that supports both rule-based and cost-based optimization. References:

• Spark SQL: Relational Data Processing in Sparkhttp://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf

• Deep Dive into Spark SQL’s Catalyst Optimizer https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

96

Page 97: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

4.6 Little configuration required Flink requires no memory thresholds to

configure• Flink manages its own memory

Flink requires no complicated network configurations• Pipelining engine requires much less

memory for data exchange Flink requires no serializers to be configured

• Flink handles its own type extraction and data representation

97

Page 98: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

4.7 Little tuning requiredFlink programs can be adjusted to data automatically

• Flink’s optimizer can choose execution strategies automatically

According to Mike Olsen, Chief Strategy Officer of Cloudera Inc. “Spark is too knobby — it has too many tuning parameters, and they need constant adjustment as workloads, data volumes, user counts change. Reference: http://vision.cloudera.com/one-platform/

Tuning Spark Streaming for Throughput By Gerard Maas from Virdata. December 22, 2014 http://www.virdata.com/tuning-spark/

Spark Tuning: http://spark.apache.org/docs/latest/tuning.html

98

Page 99: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

4.8 Flink has better performanceWhy Flink provides a better performance?

• Custom memory manager• Native closed-loop iteration operators make graph

and machine learning applications run much faster .• Role of the built-in automatic optimizer. For example,

more efficient join processing• Pipelining data to the next operator in Flink is more

efficient than in Spark. Reference:

• A comparative performance evaluation of Flink, Dongwon Kim, Postech. October 12, 2015http://www.slideshare.net/FlinkForward/dongwon-kim-a-comparative-performance-evaluation-of-flink

99

Page 100: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

5. What are the benchmarking results against Flink?I am maintaining a list of resources related to

benchmarks against Flink: http://sparkbigdata.com/102-spark-blog-slim-baltagi/14-results-of-a-benchmark-between-apache-flink-and-apache-spark

A couple resources worth mentioning:• A comparative performance evaluation of Flink, Dongwon

Kim, POSTECH, Flink Forward October 13, 2015 http://www.slideshare.net/FlinkForward/dongwon-kim-a-comparative-performance-evaluation-of-flink

• Benchmarking Streaming Computation Engines at Yahoo December 16, 2015 Code at github: http://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at

https://github.com/yahoo/streaming-benchmarks

100

Page 101: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

AgendaI. What is Apache Flink stack and how it fits

into the Big Data ecosystem? II. How Apache Flink integrates with Hadoop

and other open source tools for data input and output as well as deployment? 

III. Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark. 

IV. Who is using Apache Flink? V. Where to learn more about Apache Flink? 

101

Page 102: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

IV. Who is using Apache Flink? You might like what you saw so far about

Apache Flink and still reluctant to give it a try!You might wonder: Is there anybody using

Flink in pre-production or production environment?

I asked this question to our friend ‘Google’ and I came with a short list in the next slide!

I also heard more about who is using Flink in production at the Flink Forward conference on October 12-13, 2015 in Berlin, Germany! http://flink-forward.org/

102

Page 104: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

IV. Who is using Apache Flink? 6 Apache Flink Case Studies from the 2015 Flink

Forward conference http://sparkbigdata.com/102-spark-blog-slim-baltagi/21-6-apache-flink-case-studies-from-the-2015-flinkforward-conference

Mine the Apache Flink User mailing list to discover more!

Gradoop: Scalable Graph Analytics with Apache Flink

• Gradoop project page http://dbs.uni-leipzig.de/en/research/projects/gradoop

• Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016. January 31, 2016http://www.slideshare.net/s1ck/gradoop-scalable-graph-analytics-with-apache-flink-fosdem-2016 104

Page 105: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

PROTEUS http://www.proteus-bigdata.com/

a European Union funded research project to improve Apache Flink and mainly to develop two libraries (visualization and online machine learning) on top of Flink core.PROTEUS: Scalable Online Machine Learning by

Rubén Casado at Big Data Spain 2015 • Video: https://www.youtube.com/watch?v=EIH7HLyqhfE

• Slides: http://www.slideshare.net/Datadopter/proteus-h2020-big-data

105

IV. Who is using Apache Flink?

Page 106: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

IV. Who is using Apache Flink? has its hack week and the winner was a Flink based streaming project! December 18, 2015

• Extending the Yahoo! Streaming Benchmark and Winning Twitter Hack-Week with Apache Flink. Posted on February 2, 2016 by Jamie Grier http://data-artisans.com/extending-the-yahoo-streaming-benchmark/

did some benchmarks to compare performance of their use case implemented on Apache Storm against Spark Streaming and Flink. Results posted on December 18, 2015http://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at 106

Page 107: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

AgendaI. What is Apache Flink stack and how it fits

into the Big Data ecosystem? II. How Apache Flink integrates with Hadoop

and other open source tools for data input and output as well as deployment? 

III. Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark?

IV. Who is using Apache Flink? V. Where to learn more about Apache Flink? 

107

Page 108: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

V. Where to learn more about Apache Flink?

1. What is Flink 2016 roadmap? 2. How to get started quickly with Apache

Flink?3. Where to find more resources about

Apache Flink?4. How to contribute to Apache Flink?5. What are some Key Takeaways? 

108

Page 109: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

1  What is Flink 2016 roadmap? SQL/StreamSQL and Table APICEP Library: Complex Event Processing library for the

analysis of complex patterns such as correlations and sequence detection from multiple sources https://github.com/apache/flink/pull/1557 January 28, 2015

Dynamic Scaling: Runtime scaling for DataStream programs

Managed memory for streaming operatorsSupport for Apache Mesos

https://issues.apache.org/jira/browse/FLINK-1984

Security: Over-the-wire encryption of RPC (Akka) and data transfers (Netty)

Additional streaming connectors: Cassandra, Kinesis109

Page 110: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

1  What is Flink roadmap? Expose more runtime metrics: Throughput / Latencies,

Backpressure monitoring, Spilling / Out of CoreMaking YARN resource dynamicDataStream API enhancementsDataSet API EnhancementsReferences:

• Apache Flink Roadmap Draft, December 2015https://docs.google.com/document/d/1ExmtVpeVVT3TIhO1JoBpC5JKXm-778DAD7eqw5GANwE/edit

• What’s next? Roadmap 2016. Robert Metzger, January 26, 2016. Berlin Apache Flink Meetup. http://www.slideshare.net/robertmetzger1/january-2016-flink-community-update-roadmap-2016/9

110

Page 111: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

2. How to get started quickly with Apache Flink?

Step-By-Step Introduction to Apache Flinkhttp://www.slideshare.net/sbaltagi/stepbystep-introduction-to-apache-flink

Implementing BigPetStore with Apache Flink http://www.slideshare.net/MrtonBalassi/implementing-bigpetstore-with-apache-flink

Apache Flink Crash Coursehttp://www.slideshare.net/sbaltagi/apache-flinkcrashcoursebyslimbaltagiandsrinipalthepu

Free training from Data Artisans http://dataartisans.github.io/flink-training/

All talks at the Flink Forward 2015http://sparkbigdata.com/102-spark-blog-slim-baltagi/22-all-talks-of-the-2015-flink-forward-conference

111

Page 113: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

4. How to contribute to Apache Flink?

Contributions to the Flink project can be in the form of:• Code• Tests• Documentation• Community participation: discussions, questions,

meetups, … How to contribute guide ( also contains a list of

simple “starter issues”)http://flink.apache.org/how-to-contribute.html

113

Page 114: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

5. What are some key takeaways?1. Although most of the current buzz is about Spark,

Flink offers the only hybrid (Real-Time Streaming + Batch) open source distributed data processing engine natively supporting many use cases.

2. With the upcoming release of Apache Flink 1.0, I foresee more adoption especially in use cases with Real-Time stream processing and also fast iterative machine learning or graph processing.

3. I foresee Flink embedded in major Hadoop distributions and supported!

4. Apache Spark and Apache Flink will both have their sweet spots despite their “Me Too Syndrome”!

114

Page 115: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Page 116: Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

Thanks!

116

• To all of you for attending!• To Bloomberg for sponsoring this event. • To data Artisans for allowing me to use some of

their materials for my slide deck.• To Capital One for giving me time to prepare and

give this talk. • Yes, we are hiring for our New York City offices

and our other locations! http://jobs.capitalone.com

• Drop me a note at [email protected] if you’re interested.