new challenges for big data analytics ericsson sztaki ... · © sztaki 2016 streamline h2020 new...

16
© SZTAKI 2016 SZTAKI 2015. New challenges for Big Data Analytics Ericsson – SZTAKI collaboration Institute for Computer Science and Control Hungarian Academy of Sciences Andras Benczur, head, Informatics Laboratory „Big Data – Momentum” research group [email protected]

Upload: others

Post on 30-Aug-2019

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: New challenges for Big Data Analytics Ericsson SZTAKI ... · © SZTAKI 2016 STREAMLINE H2020 New initiative on top of Apache Flink A "use-case complete" framework to unify batch and

© SZTAKI 2016 SZTAKI 2015.

New challenges for Big Data Analytics Ericsson – SZTAKI collaboration Institute for Computer Science and Control

Hungarian Academy of Sciences

Andras Benczur, head, Informatics Laboratory

„Big Data – Momentum” research group

[email protected]

Page 2: New challenges for Big Data Analytics Ericsson SZTAKI ... · © SZTAKI 2016 STREAMLINE H2020 New initiative on top of Apache Flink A "use-case complete" framework to unify batch and

© SZTAKI 2016

Challenges addressed in Ericsson-SZTAKI cooperation

Immediate implementational

• Access to Distributed Storage

– Measuring systems for real time querying

– Dynamic repartitioning during distributed execution

– Data locality considerations

• System configuration

– HDFS write partitioning:

• optimize block size for data

• optimal replication and data distribution polices

– Hadoop/YARN cluster use all resources by appropriate worker configuration

– Simplify the data processing pipeline

– Caching (largest gain)

– Keep spark warm, reuse for faster process start

Methods that do not really exist yet - Although many claim they have them

• Stream mining

– Predict on the fly, by using batch trained models?

– Train models on the fly?

• Architectures

– Lambda: Combine a batch (e.g. Hadoop) and a streaming (e.g. Storm)?

– Kappa: Or just use Spark Streaming, no batch at all?

– Or use the Flink unified dataflow engine, which can do both batch and streaming?

Page 3: New challenges for Big Data Analytics Ericsson SZTAKI ... · © SZTAKI 2016 STREAMLINE H2020 New initiative on top of Apache Flink A "use-case complete" framework to unify batch and

© SZTAKI 2016

Big Data Machine Learning

Session Drop (2014)

• Periodic radio channel physical parameter measurements

• Predict abnormal termination

QoE time series in SSL traffic (WIP)

• Predict event and errors in terminal by IP traffic

periods_bytes periods_bytes_up

load periods_end periods_flow_ cache periods_start protocol

service_

provider start

[363.0, 0.0, 227.0, 0.0,

33.0, 0.0, 131.0] ...

[226.0, 0.0, 227.0, 0.0,

0.0, 0.0, 98.0] ...

[1458300345.7,

1458300345.71, ...

[2.0, 2.0, 2.0, 2.0, 2.0,

2.0, 2.0] ...

[1458300345.59,

1458300345.7, ...

Facebook 1458300345.59

[52.0, 0.0, 33.0, 0.0,

52.0, 0.0, 33.0] ...

[52.0, 0.0, 0.0, 0.0,

52.0, 0.0, 0.0] ...

[1458300355.23,

1458300355.36, ...

[2.0, 2.0, 2.0, 2.0, 2.0,

2.0, 2.0] ...

[1458300355.21,

1458300355.23, ...

Facebook 1458300355.21

[1272.0] [861.0] [1458300440.44] [4.0] [1458300440.07] Android

C2DM/GCM

Google 1458300440.07

Timestamp delay event type errors service provider

1458301297 open app YouTube

1458301302 12 start new video (advertisement) YouTube

1458301320 1 skip add to video YouTube

1458301362 30 video freeze connection lost error YouTube

1458301392 30 retry connection lost error YouTube

1458301588 2 open app Facebook

1458301599 Scroll Facebook

1458301683 3 click on link index.hu

1458301686 10 page rendering started index.hu

1458301707 0 scroll down index.hu

1458301884 10 click on link index.hu

1458301894 page rendering started not fully completed index.hu

Drop No Drop

Page 4: New challenges for Big Data Analytics Ericsson SZTAKI ... · © SZTAKI 2016 STREAMLINE H2020 New initiative on top of Apache Flink A "use-case complete" framework to unify batch and

© SZTAKI 2016

STREAMLINE H2020

New initiative on top of Apache Flink

A "use-case complete" framework to unify batch and stream processing

• DFKI (DE)

• SICS (SE)

• Portugal Telecom (PT)

• Internet Memory (FR)

• Rovio (FI)

• NMusic (PT )

• SZTAKI (HU)

Page 5: New challenges for Big Data Analytics Ericsson SZTAKI ... · © SZTAKI 2016 STREAMLINE H2020 New initiative on top of Apache Flink A "use-case complete" framework to unify batch and

© SZTAKI 2016

Machine learning: Batch or Online or Updateable? Currently, first experiments with recommender systems

Batch • Repeatedly read all training data multiple times • Stochastic gradient: random order

Online • Read training data only once, no chance to store • Update immediately with large learning rate

Batch + Online • Linear combination • Challenge to implement in a software stack

Sampling from the past • Partially reuse recent data in a streaming solution • Can be implemented by using sliding windows

Page 6: New challenges for Big Data Analytics Ericsson SZTAKI ... · © SZTAKI 2016 STREAMLINE H2020 New initiative on top of Apache Flink A "use-case complete" framework to unify batch and

© SZTAKI 2016

Tracing under Spark by piggybacking

• Replacement for checkpointing records and building the traces offline with an external service

• We attach traces on-the-fly to records and pre-aggregate tracing-graphs for a fast, convenient and real-time result

Goal: Enable Spark to integrate with the internal tracing infrastructure of Ericsson’s Ark.

Despite the difficulty of the problem, results emerged beyond expectation: a full support to Spark’s Scala API.

SZTAKI came up with two solutions to the problem in Spark.

At Spark Summit next week

Sample graph piggybacked to a record in Spark

• 𝑇𝑖 can be any Spark operator

• We attach useful metrics to edges and vertices

(current load, time spent at task, …)

𝑇4

𝑇1

𝑇2

𝑇8

𝑇7

𝑇6

𝑇3

𝑇9

𝑇11

𝑇10

𝑇12

𝑇13

2

4 4

10

12 15

7

5

Page 7: New challenges for Big Data Analytics Ericsson SZTAKI ... · © SZTAKI 2016 STREAMLINE H2020 New initiative on top of Apache Flink A "use-case complete" framework to unify batch and

© SZTAKI 2016

Spark execution visualizer

• Understand the execution mechanism of Spark (new users to learn Spark faster)

• Discover issues of executors and physical tasks in detail

• Highlight bottlenecks of certain workflows

• Add insight to advanced, online optimization strategies

Demonstrated at ITU Telecom World, Budapest, 2015

DataBricks, the company behind Spark showed an interest towards the visualizer at Spark Summit EU, Amsterdam, 2015

Presented at Hadoop Summit, Dublin, 2016

Page 8: New challenges for Big Data Analytics Ericsson SZTAKI ... · © SZTAKI 2016 STREAMLINE H2020 New initiative on top of Apache Flink A "use-case complete" framework to unify batch and

© SZTAKI 2016

stage

boundary

& shuffle Dynamic repartitioning in Spark

even partitioning skewed data

slow task

slow task

Goal: Tackle the problem of highly skewed and dynamically changing distribution of keys

• Avoid data skew dynamically on-the-fly in any workflow

• Mitigate the problem of slow tasks

• Sophisticated optimization algorithm

• Major improvement

heavy keys create

heavy partitions (slow tasks)

Page 9: New challenges for Big Data Analytics Ericsson SZTAKI ... · © SZTAKI 2016 STREAMLINE H2020 New initiative on top of Apache Flink A "use-case complete" framework to unify batch and

© SZTAKI 2016

Adaptively repartition data during execution at shuffle-phase with a better partition function to approximate even partitioning

Slave Slave Slave

Master

approximate local key distribution & statistics

1 collect to master 2

global key distribution & statistics

3

4 redistribute new hash

Page 10: New challenges for Big Data Analytics Ericsson SZTAKI ... · © SZTAKI 2016 STREAMLINE H2020 New initiative on top of Apache Flink A "use-case complete" framework to unify batch and

© SZTAKI 2016

QoS scheduling: Spark (2015) and Yarn (2016)

1. Hard to predict the expected resource usage of a production-job, so users usually over-estimate consumption by 5x to meet the contract (hurts utilization) – problem in Spark and in Yarn

2. Peaks can slow down certain nodes or might cause out-of-memory errors – problem in Spark

time

resources

resource allocated by job

current consumption

out-of-memory error

under-utilization,

waste of resources

current consumption

close to optimal allocation

Page 11: New challenges for Big Data Analytics Ericsson SZTAKI ... · © SZTAKI 2016 STREAMLINE H2020 New initiative on top of Apache Flink A "use-case complete" framework to unify batch and

© SZTAKI 2016

Hops.io (Stockholm) EIT Digital Activity Proposal

Siz

e (

GB

)

250

500

1000

Year

2015 2016 2017 2018 2019

0

750

Page 12: New challenges for Big Data Analytics Ericsson SZTAKI ... · © SZTAKI 2016 STREAMLINE H2020 New initiative on top of Apache Flink A "use-case complete" framework to unify batch and

© SZTAKI 2016

Hops.io: Interactive Analytics with Zeppelin

Page 13: New challenges for Big Data Analytics Ericsson SZTAKI ... · © SZTAKI 2016 STREAMLINE H2020 New initiative on top of Apache Flink A "use-case complete" framework to unify batch and

© SZTAKI 2016

IoT Data Hub for Ericsson Business Unit Global Services (BUGS)

Open source NoSQL systems test result

Selected system is confidential

Last bar is Cassandra

In combination with Spark, tested w/ BUGS in 2015

Page 14: New challenges for Big Data Analytics Ericsson SZTAKI ... · © SZTAKI 2016 STREAMLINE H2020 New initiative on top of Apache Flink A "use-case complete" framework to unify batch and

© SZTAKI 2016

Spark vs Flink Stateful stream processing – “A/B testing”

Spark

• Better equipped for interactive, exploratory data analysis due to simpler runtime design and more mature third-party tools.

• Favorable in community support, currently available integration, monitoring and testing

Flink

• Competitive runtime especially for heavy workload

• Sort, join, iterative dataflows (e.g. ML)

• Transitioning to either of the two systems is fairly smooth: similar programming models

WIP: testing in Ericsson Aggregator

465

644

877

232

344

512

0

100

200

300

400

500

600

700

800

900

1000

5 10 20

10

00

re

co

rds

/s

ec

Number of CPU cores

Streaming A/B Test throughput

Flink

Spark

Page 15: New challenges for Big Data Analytics Ericsson SZTAKI ... · © SZTAKI 2016 STREAMLINE H2020 New initiative on top of Apache Flink A "use-case complete" framework to unify batch and

© SZTAKI 2016

Achievements

•High visibility of SZTAKI researchers within the open source community, as well as committers in other companies (e.g. DataBricks and Cloudera in San Francisco)

•Presentations at Hadoop and Spark Summits, Flink Forward

•A list of solved bugs, minor and major improvements submitted and accepted to the Apache Spark project in 2015:

–20+ issues covered

–Major improvement of deployment to YARN

–Major improvement in repartitioning efficiency

•Built expertise in Apache Spark, Flink, Yarn

•Training session and consultancy support for Ericsson Hungary

Page 16: New challenges for Big Data Analytics Ericsson SZTAKI ... · © SZTAKI 2016 STREAMLINE H2020 New initiative on top of Apache Flink A "use-case complete" framework to unify batch and

© SZTAKI 2016

Commercial break: still not late to join our contests + RecSys and Personalization Meetup: Budapest, June 9

RecSys Challenge 2016, Boston, Sept 15-19 ECML/PKDD Discovery Challenge 2016, Riva del Garda, September 19-23