proteus h2020

17
PROTEUS Scalable Online Machine Learning for Predictive Analytics and Real-Time Interactive Visualization BONAVENTURA DEL MONTE RESEARCHER @DFKI GMBH PH.D. STUDENT @TU BERLIN EUROPRO WORKSHOP, EDBT 2017 This project is funded by the European Union. Horizon 2020

Upload: bonaventura-del-monte

Post on 16-Apr-2017

81 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: PROTEUS H2020

PROTEUSScalable Online Machine Learning for Predictive Analytics and Real-Time Interactive VisualizationBONAVENTURA DEL MONTERESEARCHER @DFKI GMBHPH.D. STUDENT @TU BERLINEUROPRO WORKSHOP, EDBT 2017This project is funded

by the European Union. Horizon 2020

Page 2: PROTEUS H2020

2

Value

Velocity

VarietyVeracity

Volume BIG DATA

€€€

???

Page 3: PROTEUS H2020

3

Page 4: PROTEUS H2020

4

PROTEUS is a EU H2020 funded research project which aims to

design, develop, and provide an open-source ready-to-use Big Data solution, able to perform real-time interactive analytics and

predictive analysis through massive online machine learning,

efficiently dealing with extremely large historical data and data stream

Page 5: PROTEUS H2020

CONTENTS

1. PROJECT DETAILS2. VALIDATION SCENARIO3. HYBRID PROCESSING ENGINE4. SCALABLE ONLINE MACHINE LEARNING5. REAL-TIME INTERACTIVE VISUAL ANALYTICS6. CONCLUSION

Page 6: PROTEUS H2020

6

Project Consortium

Page 7: PROTEUS H2020

7

Project details Expected Outcomes

Hybrid processing Batch & Stream processing engine Declarative Language for batch & streams analytics

Scalable Online machine Learning SOLMA Library

Real-time interactive Visual Analytics Web charts library Incremental engine for interactive analytics

Business Impact Validation in realistic industrial use case

Page 8: PROTEUS H2020

8

Hot Strip Mill: Big Data scenario

Page 9: PROTEUS H2020

9

System Architecture

Page 10: PROTEUS H2020

10

Smoother processing of data stream and historical data in the same Flink job

A declarative language for batch and streaming analytics ETL and ML pipelines expressed in an unified language are holistically optimized

Hybrid Processing

Gather and clean sensor

dataPCA

Train ML

Model

D3

D1

D2

Bridging the Gap: Towards Optimizations across Linear and Relational Algebra": Andreas Kunft, Alexander Alexandrov, Asterios Katsifodimos, Volker Markl. BeyondMR workshop @SIGMOD 2016.

Page 11: PROTEUS H2020

11

Scalable Online Machine Learning ML challenge: Distributed Data Streams

Current state of the art of machine learning algorithms for Big Data is dominated by offline learning algorithms that process data-at-rest

Plenty of current data sources are streaming (online, data-in-motion): sensors, social networks, clickstream, etc.

In online learning, the algorithms see the data only once. The traditional meaning of online is that data is processed sequentially one by one but for many epochs: prequential evaluation

Page 12: PROTEUS H2020

12

Real-time Interactive Visual Analytics How to interactively visualize Big Data?

Incremental Analytics engine: incremental partial results in ~ O(1)

Visualization Layer: SSR-enabled web-based library seamlessly connected to the Incremental Analytics engine

https://github.com/proteus-h2020/proteic

Page 13: PROTEUS H2020

13

Conclusions PROTEUS is an EU H2020 international research project PROTEUS will contribute to the Big Data ecosystem with:

An innovative hybrid engine for processing both data-at-rest and data-in-motion SOLMA: An new library for scalable online machine learning Big Data Visualization guidelines: new ways of presenting and working with Big Data Real-time interactive visualization technology: Incremental engine & web-based library

PROTEUS will validate its innovations in a realistic industrial scenario PROTEUS will provide full-scale evaluation and impact assessment including

benchmarks, KPIs and anonymized datasets Specific metrics for the ArcelorMittal use case Generic indicators on the advancements in scalable machine learning, hybrid computation

and real-time interactive visual analytics.

Page 14: PROTEUS H2020

14

Thanks for your attention!Questions?

Contact us: Bonaventura Del Monte

bonaventura dot delmonte at dfki dot de www.dfki.berlin

www.proteus-bigdata.comwww.github.com/proteus-h2020

Page 15: PROTEUS H2020

15

Extra Slides

Page 16: PROTEUS H2020

16

Apache Flink 101 Massive parallel data flow engine with unified batch and stream

processing

Rich set of operators (including native iteration)

Flink Optimizer Inspired by optimizers of parallel database systems Physical optimization follows cost‐based approach

Memory Management Flink manages its own memory Never breaks the JVM heap

Page 17: PROTEUS H2020

17

Scalable Online Machine Learning PROTEUS contribution: SOLMA

User-friendly Extensibility

Basic scalable stream sketches that enable to query the stream Iterative algorithms for approximating the outcome of offline computation

Ready-to-use (supervised & unsupervised) online ML algorithms in Apache Flink