welcome to the first workshop on big data open source systems...

49
Welcome to the second Workshop on Big data Open Source S ystems (BOSS) September 10th, 2016 Co-located with VLDB 2016 Tilmann Rabl & Sebastian Schelter

Upload: others

Post on 21-May-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open

Welcome to the secondWorkshop on Big data Open

Source Systems (BOSS)September 10th, 2016

Co-located with VLDB 2016

Tilmann Rabl & Sebastian Schelter

Page 2: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open

Hands on Big Data

• 6 parallel tutorials• 6 systems

• Open source• Publicly available

• Presenters• System experts

• Hands on• This is not a demo!

• You can pick two!

Page 3: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open

But why?

• Mike Carey • Doing It On Big Data: a Tutorial/Workshop• Driving force

• Other people involved• Volker Markl• Kerstin Forster

• Second instance• Last time: 8 systems• Tell us what you think• Email: [email protected]

MOAR SYSTEMS!

Page 4: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open

Public Voting

• 9 Submissions, 6 tutorials selected• Google forms vote• 236 votes, 137 individuals• Max 46, min 11 votes

• Did you vote?

Page 5: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open

Presented Systems

• Apache Flink

• Apache SystemML

• HopsFS & ePipe

• LinkedIn‘s Open Source Analytics Platform

• rasdaman

• RHEEM

Page 6: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open

Massively Parallel Program• Bulk Synchronous Parallel• People Flow• Danger of skew!

Intr

oduc

tion

& F

lash

Ses

sion

Poly

glot

Coffe

e Br

eak

Lunc

h

Tuto

rials

Part

2Tutorial 1 Part 1Tutorial 2 Part 1Tutorial 3 Part 1Tutorial 4 Part 1Tutorial 5 Part 1Tutorial 6 Part 1

Page 7: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open

Maple

Heterogeneous Runtime Environment

You are here

Page 8: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open

Polyglot Session

Big Data processing using Polybase. Karthik Ramachandra (Microsoft Gray Systems Lab)

Multistore Systems: Retrospection on CloudMdsQL. Jose Pereira (Univ. do Minho & INESC)

Exploiting the data center in contemporary commodity boxes: The scaling-in approach. Jignesh Patel (Univ. of Wisconsin-Madison)

LeanBigData: Blending OLTP and OLAP to Deliver Real-Time Analytical Queries. Ricardo Jimenez-Peris (LeanXcale)

Page 9: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open

Flash Intro

Page 10: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open

Apache Flink

Page 11: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open

Kostas KloudasVasia KalavriJonas Traub

Introduction to Stream Processing with Apache Flink®

Page 12: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open

Overview

What is Stream Processing? What is Apache Flink? Windowed computations over streams Handling time Handling node failures Handling planned downtime Handling code upgrades

12

Page 13: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open

A data processing engine

13

Apache Flink is an open source platform for distributed stream and batch processing

Page 14: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open

What does Flink provide?

High Throughput and Low Latency Event-time (out-of-order) processing Exactly-once semantics Flexible windowing Fault-Tolerance

14

Page 15: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open

The Apache Flink Ecosystem

15

SQL

SQL

Page 16: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open

Its Users

16

…https://flink.apache.org/poweredby.html

Page 17: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open

Time for demo…

17

Robust Stream Processing with Apache Flink®: A Simple Walkthroughhttp://data-artisans.com/robust-stream-processing-flink-walkthrough/#more-1181

Page 18: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open

Apache SystemML

Page 19: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open

© 2016 IBM CorporationIBM Research

Apache SystemML: Declarative Large-Scale Machine Learning

Matthias BoehmIBM Research – Almaden

Acknowledgements: A. V. Evfimievski, F. Makari Manshadi, N. Pansare, B. Reinwald, F. R. Reiss, P. Sen, S. Tatikonda,

M. W. Dusenberry, D. Eriksson, N. Jindal, C. R. Kadner, J. Kim, N. Kokhlikyan, D. Kumar, M. Li, L. Resende, A. Singh, A. C. Surve, G. Weidner, and W. P. Yu

Page 20: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open

© 2016 IBM Corporation

Case Study: An Automobile Manufacturer Goal: Design a model to predict car reacquisition

20 IBM Research

Warranty Claims

Repair History

Diagnostic Readouts

Predictive Models

Features MachineLearningAlgorithm

Algorithm

Labels

Algorithm

Algorithm

• Class skew• Low precision

Result: 25x improvement in precision!

Page 21: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open

© 2016 IBM Corporation

Common Patterns Across Customers

Algorithm customization

Changes in feature set

Changes in data size

Quick iteration

21 IBM Research

Custom Analytics

Declarative Machine Learning

Page 22: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open

© 2016 IBM Corporation

Abstraction: The Good, the Bad and the Ugly

22 IBM Research

q = t(X) %*% (w * (X %*% v))

[adapted from Peter Alvaro: "I See What You Mean“, Strange Loop, 2015]

Simple & Analysis-CentricData Independence

Platform IndependenceAdaptivity

(Missing)Size InformationOperator

Selection

(Missing) Rewrites

Distributed Operations

Distributed Storage

(Implicit) Copy-on-WriteData Skew

Load Imbalance

Latency

Complex Control Flow

Local / Remote Memory Budgets

The Ugly: Expectations ≠ Reality

Understanding of optimizer and runtime techniques underpinning declarative, large-scale ML

Efficiency & Performance

Page 23: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open

© 2016 IBM Corporation

Tutorial Outline Case Study and Motivation (Flash) 5min SystemML Overview, APIs, and Tools 30min Common Framework 15min SystemML’s Optimizer (w/ Hands-On-Labs) 45min

23 IBM Research

Page 24: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open

HopsFS & ePipe

Page 25: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open

HopsFS & ePipeMahmoud Ismail <[email protected]>Gautier Berthou

<[email protected]>

Page 26: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open

From HDFS to HopsFS

• Scale to a million operations/sec

• Scale to billions of files/directories

• Search with sub second latency (ePipe)

Page 27: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open

Tutorial

• Introducing Github style for Hadoop projects (HopsWorks)

• Installation of Hops on AWS using Karamel

• Managing Datasets

• create, attach metadata, and search

• Running sample programs on HopsWorks

Page 28: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open

Goblin & Pinot

Page 29: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open

Open Source Analytics Pipeline at LinkedIn

Issac Buenrostro Jean François ImBOSS Workshop, 2016

Page 30: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open

Large Scale Analytics

1. Analyze many TB data daily.2. Multiple, heterogeneous sources, with varying data

quality.3. Fast querying for offline and real-time needs. 4. Integrate with other data processing jobs (MR, Hive,

Spark, etc.).5. Fault tolerance, scalability, manageability, …

Page 31: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open

Solution: Gobblin + Pinot

• Universal data ingestion framework.

• Extract, transform, quality check, and write data from/to a large variety of data storage technologies: HDFS, S3, Kafka, JDBC, Rest, …

• Distributed near-realtime OLAP data store.

• Index and combine data from offline data sources (e.g. Hadoop) and real time data sources (e.g. Kafka).

• SQL query interface.

P not

Page 32: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open

In This Workshop

32

Kafka

REST

FileSystem QueryApp

Page 33: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open

Find out more:

©2015 LinkedIn Corporation. All Rights Reserved. 33

https://github.com/linkedin/gobblinhttp://gobblin.readthedocs.io/[email protected]

P not

https://github.com/linkedin/[email protected]

https://engineering.linkedin.com/

Page 34: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open

rasdaman

Page 35: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open

rasdaman @ BOSS’16 :: Dimitar Mišev

rasdaman @ BOSS’16New Delhi, India, 09-sep-2016

Dimitar Mišev <[email protected]>Jacobs University | rasdaman GmbH

[gamingfeeds.com]Research supported by EU through EarthServer

Page 36: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open

rasdaman @ BOSS’16 :: Dimitar Mišev

Array Analytics Research @ Jacobs U

Large-Scale Scientific Information Systems research group• Flexible, scalable n-D array services• www.jacobs-university.de/lsis

Most visible results:• Pioneer Array DBMS, rasdaman• Standardization: OGC Big Geo Data, ISO SQL

Page 37: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open

rasdaman @ BOSS’16 :: Dimitar Mišev

rasdaman: Agile Array Analytics

„raster data manager“: n-D arrays in SQL• [VLDB 1994, VLDB 1997, SIGMOD 1998, VLDB 2003, …]

Array Algebra [NGITS 1998]• SQL/MDA [SSDBM 2014, DOLAP 2015]

Scalable, parallel „tile streaming“ architecture

130+ TB installations in operational use

Page 38: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open

rasdaman @ BOSS’16 :: Dimitar Mišev

Tutorial outline

Installation & deployment• RPM/DEB, VM download, build from source

Data modelling and concepts• What kind of data is supported?

Query language• Typical array analytics queries, hands on

Storage management• Single array datacubes can reach hundreds of TB• Learn how rasdaman scales to such volumes

Domain application: Geo services

Page 39: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open

RHEEM

Page 40: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open
Page 41: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open
Page 42: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open
Page 43: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open
Page 44: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open
Page 45: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open
Page 46: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open
Page 47: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open
Page 48: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open
Page 49: Welcome to the first Workshop on Big data Open Source Systems (BOSS)boss.dima.tu-berlin.de/media/BOSS16-Opening.pdf · 2016-09-28 · Welcome to the second Workshop on Big data Open

Let‘s go!

Maple

Intro & Polyglot