data intensive applications with apache flink

32
Milan – July 13 2016 Data Intensive Applications with Apache Flink Simone Robutti Machine Learning Engineer at Radicalbit @SimoneRobutti

Upload: simone-robutti

Post on 16-Apr-2017

125 views

Category:

Software


4 download

TRANSCRIPT

Page 1: Data Intensive Applications with Apache Flink

Milan – July 13 2016

Data Intensive Applications with Apache Flink

Simone RobuttiMachine Learning Engineer at Radicalbit

@SimoneRobutti

Page 2: Data Intensive Applications with Apache Flink

Agenda1. Brief Introduction to Apache Flink

○ Why

○ What

○ How

2. Machine Learning on Flink

○ Present landscape

○ Future of the Ecosystem

3. Closing notes on Radicalbit (shameless plug ahead)

Page 3: Data Intensive Applications with Apache Flink

100% Buzzword-free guaranteed

Big Data

Machine Intelligence

Web-scale400x

It’s like the human brain

Exactly-once

Exactly-once

Page 4: Data Intensive Applications with Apache Flink

Why Flink (and not Spark/Storm/Samza...)

Because it’s

production-ready

streaming-firstlow-latency

fault-toleranthigh-throughput

processing engine

Page 5: Data Intensive Applications with Apache Flink

Flink: what is it?

From Flink’s Documentation

Page 6: Data Intensive Applications with Apache Flink

Connectors and integrations

Page 7: Data Intensive Applications with Apache Flink

Flink’s Runtime

From Flink’s Documentation

Page 8: Data Intensive Applications with Apache Flink

Flink’s DataFlow

From Flink’s Documentation

Written by the user through DataSet/DataStream API

Compiled and optimized in the client

Page 9: Data Intensive Applications with Apache Flink

Flink’s DataFlow

From Flink’s Documentation

The compiled job is translated to distributed tasks by

the master and executed by workers

Page 10: Data Intensive Applications with Apache Flink

Machine Learning on Flink

Page 11: Data Intensive Applications with Apache Flink

Ready and awesome for parallel ML

Work in progress for distributed ML

ML on Flink

Page 12: Data Intensive Applications with Apache Flink

Flink for Model Evaluation Pipelines

Source

Data Preparation

Evaluation Sink

Source

Postprocess

-ing

Composable, modular Flink Operator

Page 13: Data Intensive Applications with Apache Flink

Evaluation with Flink-JPMML

Source Operator

Flink - JPMML

Operator

Sink Operator

Source Operator

model.pmml

Small library that implements basic model eval.

operations on top of JPMML (Gitlab)

Data Preparation

Page 14: Data Intensive Applications with Apache Flink

“I have seen people insisting on using Hadoop for

datasets that could easily fit on a flash drive and could

easily be processed on a laptop.”

- Yann LeCun

-

ML on Flink

Page 15: Data Intensive Applications with Apache Flink
Page 16: Data Intensive Applications with Apache Flink

FlinkML

What: Out-of-the-box workhorse algorithms (ALS,

SVM, LinReg, LogReg …)

Status: early phase, slow development

Page 17: Data Intensive Applications with Apache Flink

FlinkML

Pro: available out of the box, written with Flink API

Cons: reinvents the wheel, only a few algorithms,

no model persistence

Page 18: Data Intensive Applications with Apache Flink

Samsara

What: Linear algebra framework

Status: mature

Page 19: Data Intensive Applications with Apache Flink

Samsara

Pro: generic algorithms with platform-specific

bindings, skilled community

Cons: covers only a few use cases

Page 20: Data Intensive Applications with Apache Flink

SAMOA

What: Online learning algorithm framework (VHT,

AMR, …)

Status: early phase, complicated relationship with

the industry

Page 21: Data Intensive Applications with Apache Flink

SAMOA

Pro: many powerful generic online learning

algorithms, backed by academics (MOA, Weka)

Cons: not production ready, academic focus

Page 22: Data Intensive Applications with Apache Flink

ML on Flink: the future of the ecosystem

Page 23: Data Intensive Applications with Apache Flink

Apache Beam

Programming model for data processing pipelines

● Streaming first, batch as a bounded stream

● Layered API: What, Where, When, How

● Platform agnostic: same program, different

runners

Page 24: Data Intensive Applications with Apache Flink

Apache Beam - Runners

● Flink

● Spark (Partial)

● Google Cloud Dataflow

● Plain Java

● Gearpump (WIP)

● Apex (WIP)

Page 25: Data Intensive Applications with Apache Flink

BeamML: a runner-agnostic ML library

Page 26: Data Intensive Applications with Apache Flink

FlinkML Roadmap

● More algorithms!

● Evaluation framework

● Persistence/export

● Online Learning Framework

Page 27: Data Intensive Applications with Apache Flink

Proteus

Online Learning Platform - based on Flink

Source: Proteus’ website

Page 28: Data Intensive Applications with Apache Flink

The role of Radicalbit

Page 29: Data Intensive Applications with Apache Flink

Contributions

● Cassandra Connector

● Scala API extensions

● FlinkML (Linear Algebra Framework, MinHash)

● Akka Connector

Page 30: Data Intensive Applications with Apache Flink

Our vision

Flink can become the ideal choice to build real-time decision-heavy applications with high data-throughput

To achieve this:

● Ambitious applications (aim for real-time services)

● Reliable distributed online learning (Proteus?)

● A Pipelining Framework (experiment fast, increase testability and

modularity)

Page 31: Data Intensive Applications with Apache Flink

Q&A

Page 32: Data Intensive Applications with Apache Flink

THANKS!Simone Robutti

Mail: [email protected] Medium: @simone.robutti Twitter: @SimoneRobutti — @weareradicalbit