data intensive applications with apache flink - simone robutti, radicalbit

Milan – July 13 2016

Data Intensive Applications with Apache Flink

Simone RobuttiMachine Learning Engineer at Radicalbit

@SimoneRobutti

Agenda1. Brief Introduction to Apache Flink

○ Why

○ What

○ How

2. Machine Learning on Flink

○ Present landscape

○ Future of the Ecosystem

3. Closing notes on Radicalbit (shameless plug ahead)

100% Buzzword-free guaranteed

Big Data

Machine Intelligence

Web-scale400x

It’s like the human brain

Exactly-once

Why Flink (and not Spark/Storm/Samza...)

Because it’s

production-ready

streaming-firstlow-latency

fault-toleranthigh-throughput

processing engine

Flink: what is it?

From Flink’s Documentation

Connectors and integrations

Flink’s Runtime

Flink’s DataFlow

Written by the user through DataSet/DataStream API

Compiled and optimized in the client

Flink’s DataFlow

The compiled job is translated to distributed tasks by

the master and executed by workers

Machine Learning on Flink

Ready and awesome for parallel ML

Work in progress for distributed ML

ML on Flink

Flink for Model Evaluation Pipelines

Source

Data Preparation

Evaluation Sink

Source

Postprocess

Composable, modular Flink Operator

Evaluation with Flink-JPMML

Source Operator

Flink - JPMML

Operator

Sink Operator

Source Operator

model.pmml

Small library that implements basic model eval.

operations on top of JPMML (Gitlab)

Data Preparation

“I have seen people insisting on using Hadoop for

datasets that could easily fit on a flash drive and could

easily be processed on a laptop.”

- Yann LeCun

ML on Flink

FlinkML

What: Out-of-the-box workhorse algorithms (ALS,

SVM, LinReg, LogReg …)

Status: early phase, slow development

FlinkML

Pro: available out of the box, written with Flink API

Cons: reinvents the wheel, only a few algorithms,

no model persistence

Samsara

What: Linear algebra framework

Status: mature

Samsara

Pro: generic algorithms with platform-specific

bindings, skilled community

Cons: covers only a few use cases

What: Online learning algorithm framework (VHT,

AMR, …)

Status: early phase, complicated relationship with

the industry

Pro: many powerful generic online learning

algorithms, backed by academics (MOA, Weka)

Cons: not production ready, academic focus

ML on Flink: the future of the ecosystem

Apache Beam

Programming model for data processing pipelines

● Streaming first, batch as a bounded stream

● Layered API: What, Where, When, How

● Platform agnostic: same program, different

runners

Apache Beam - Runners

● Flink

● Spark (Partial)

● Google Cloud Dataflow

● Plain Java

● Gearpump (WIP)

● Apex (WIP)

BeamML: a runner-agnostic ML library

FlinkML Roadmap

● More algorithms!

● Evaluation framework

● Persistence/export

● Online Learning Framework

Proteus

Online Learning Platform - based on Flink

Source: Proteus’ website

The role of Radicalbit

Contributions

● Cassandra Connector

● Scala API extensions

● FlinkML (Linear Algebra Framework, MinHash)

● Akka Connector

Our vision

Flink can become the ideal choice to build real-time decision-heavy applications with high data-throughput

To achieve this:

● Ambitious applications (aim for real-time services)

● Reliable distributed online learning (Proteus?)

● A Pipelining Framework (experiment fast, increase testability and

modularity)

THANKS!Simone Robutti

Mail: simone.robutti@radicalbit.io Medium: @simone.robutti Twitter: @SimoneRobutti — @weareradicalbit

data intensive applications with apache flink - simone robutti, radicalbit

Data & Analytics

flink meetup

introduction to apache flink

streaming analytics with apache flink -...

m. lo vetere 1,2 , s. minutoli 1 , e. robutti 1

apache flink

apache flink big data stream processing · pdf fileapache...

flink: lessons of interoperability

flink internals web

stream processing with apache flink - qconlondon.com ·...

apache flink deep dive

streaming data flow with apache flink @ paris flink meetup...

flink vs. spark

flink forward sf 2017: ted dunning - non-flink machine...

interactive data analysis with apache flink @ flink meetup...

apache flink meetup: sanjar akhmedov - joining infinity –...

apache flink internals

flink 1.0-slides

introduction to distributed computing engines for data...

sics: apache flink streaming

flink 0.10 - upcoming features