experiences with spark @telefónica€¦ · telefónica digital what is spark? (i) • distributed...

30
1 Telefónica Digital Experiences with Spark @Telefónica Daniel Tapiador & Ignacio Blasco 12 th June 2014 – BCN Spark meetup

Upload: others

Post on 11-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Experiences with Spark @Telefónica€¦ · Telefónica Digital What is Spark? (I) • Distributed data processing framework/system • Aims at making data analytics fast to run (100x)

1 Telefónica Digital

Experiences with Spark @Telefónica

Daniel Tapiador & Ignacio Blasco 12th June 2014 – BCN Spark meetup

Page 2: Experiences with Spark @Telefónica€¦ · Telefónica Digital What is Spark? (I) • Distributed data processing framework/system • Aims at making data analytics fast to run (100x)

2 Telefónica Digital

Outline

• Introduction and motivation for a change • Spark Internals and API • Ecosystem • Tips & Tricks (Demo)

Page 3: Experiences with Spark @Telefónica€¦ · Telefónica Digital What is Spark? (I) • Distributed data processing framework/system • Aims at making data analytics fast to run (100x)

3 Telefónica Digital

Outline

• Introduction and motivation for a change • Spark Internals and API • Ecosystem • Tips & Tricks (Demo)

Page 4: Experiences with Spark @Telefónica€¦ · Telefónica Digital What is Spark? (I) • Distributed data processing framework/system • Aims at making data analytics fast to run (100x)

4 Telefónica Digital

What is Spark? (I)

• Distributed data processing framework/system • Aims at making data analytics fast to run (100x) and

to write • Works in memory and/or disk • Fills the gap for near-real time (in memory) apps • Scales out to big data sizes • Several language bindings (Scala, Python and Java)

Page 5: Experiences with Spark @Telefónica€¦ · Telefónica Digital What is Spark? (I) • Distributed data processing framework/system • Aims at making data analytics fast to run (100x)

5 Telefónica Digital

What is Spark? (II)

• Fully integrated with the Hadoop ecosystem §  Supports any Hadoop input format => can read from

HDFS, Hive, Impala, Hbase, etc • Higher level (and richer) interface than MapReduce

§  Generalize MR to support new apps in same engine §  General task DAG + data sharing

• Interactive shells for scala and python for exploratory work

Page 6: Experiences with Spark @Telefónica€¦ · Telefónica Digital What is Spark? (I) • Distributed data processing framework/system • Aims at making data analytics fast to run (100x)

6 Telefónica Digital

Original Niche

• Originally developed for: §  Iterative algorithms §  Interactive data mining

Page 7: Experiences with Spark @Telefónica€¦ · Telefónica Digital What is Spark? (I) • Distributed data processing framework/system • Aims at making data analytics fast to run (100x)

7 Telefónica Digital

Evolution (I)

MapReduce

Pregel

Dremel

GraphLab Storm

Giraph

Drill Tez

Impala

S4 …

Specialized systems (iterative, interactive and"

streaming apps)

General batch"processing

• Has aimed ever since at evolving to a much more complete framework

Page 8: Experiences with Spark @Telefónica€¦ · Telefónica Digital What is Spark? (I) • Distributed data processing framework/system • Aims at making data analytics fast to run (100x)

8 Telefónica Digital

Evolution (II)

Page 9: Experiences with Spark @Telefónica€¦ · Telefónica Digital What is Spark? (I) • Distributed data processing framework/system • Aims at making data analytics fast to run (100x)

9 Telefónica Digital

Motivation for a change (I)

• Column orientation (Parquet and ORC). Ratio 3.5x • Fast in-memory serialization (Kryo)

•  Largest input (per day) is 153 GB (bz2 - text).

•  Auxiliary tables sizes: §  4 GB (text – no compression) §  950 MB (text – no compression) §  12 MB (text – no compression)

•  Processing time: §  Something in between 5h – 15h

Page 10: Experiences with Spark @Telefónica€¦ · Telefónica Digital What is Spark? (I) • Distributed data processing framework/system • Aims at making data analytics fast to run (100x)

10 Telefónica Digital

Motivation for a change (II)

•  Only batch oriented •  Maintainability

§  Tedious to add/modify functionality

•  Complexity (more LoC) •  Needs explicit Orchestration •  Testing somehow tedious with

a lot of Integration Tests

•  Stability

•  In memory processing •  Exploration

§  Prototyping, PoC, etc.

•  Speed of development •  Orchestration natively done in

the code. •  Testing

•  v1.0.0 out but still a lot to stabilize

Page 11: Experiences with Spark @Telefónica€¦ · Telefónica Digital What is Spark? (I) • Distributed data processing framework/system • Aims at making data analytics fast to run (100x)

11 Telefónica Digital

Motivation for a change (III)

0

5000

10000

15000

20000

25000

30000

35000

C++ / Hadoop Streaming Scala / Spark

LoC (Model)

LoC (Model)

0 2000 4000 6000 8000

10000 12000

Java / Hadoop Scala / Spark

LoC (ETL)

LoC (ETL)

0

5

10

15

C++ / Hadoop Streaming

Scala / Spark

Model Development Time (in MM)

Model Development Time (in MM)

0 1 2 3 4 5

Java / Hadoop Scala / Spark

ETL Development Time (in MM)

ETL Development Time (in MM)

Page 12: Experiences with Spark @Telefónica€¦ · Telefónica Digital What is Spark? (I) • Distributed data processing framework/system • Aims at making data analytics fast to run (100x)

12 Telefónica Digital

Outline

• Introduction and motivation for a change • Spark Internals and API • Ecosystem • Tips & Tricks (Demo)

Page 13: Experiences with Spark @Telefónica€¦ · Telefónica Digital What is Spark? (I) • Distributed data processing framework/system • Aims at making data analytics fast to run (100x)

13 Telefónica Digital

Spark Programming Model (I)

• Resilient Distributed Datasets (RDDs) §  Distributed collections of objects that can be cached

in memory across cluster nodes §  Manipulated through various parallel operators §  Automatically rebuilt on failure (some checkpointing

as well) • RDDs can be spilled to disk if needed and/or kept in

memory (de)serialized • RDDs partitioning can be explicitly stated • Spark context (sc) is the entry point

§  Can run locally (single or multicore)

Page 14: Experiences with Spark @Telefónica€¦ · Telefónica Digital What is Spark? (I) • Distributed data processing framework/system • Aims at making data analytics fast to run (100x)

14 Telefónica Digital

Spark Programming Model (II)

• Shared variables (immutable) • Broadcast variables

§  Read-only variable cached on each machine §  Efficient broadcast algorithms to reduce

communication cost • Accumulators

§  Variables that are only added to through associative operations

§  Only driver program can read their value

Page 15: Experiences with Spark @Telefónica€¦ · Telefónica Digital What is Spark? (I) • Distributed data processing framework/system • Aims at making data analytics fast to run (100x)

15 Telefónica Digital

RDD Operations

• RDD operations include transformations and actions: §  map, filter, flatMap, sample, union, distinct,

groupByKey, reduceByKey, sortByKey, join, cogroup, cartesian, …

§  reduce, collect, count, first, take, takeSample, countByKey, foreach, saveAsTextFile, saveAsSequenceFile, …

§  persist/cache and repartitions

Page 16: Experiences with Spark @Telefónica€¦ · Telefónica Digital What is Spark? (I) • Distributed data processing framework/system • Aims at making data analytics fast to run (100x)

16 Telefónica Digital

RDD Persistence • Partitions lost are recomputed using the original

transformations • Each RDD can be stored with a different storage level

Page 17: Experiences with Spark @Telefónica€¦ · Telefónica Digital What is Spark? (I) • Distributed data processing framework/system • Aims at making data analytics fast to run (100x)

17 Telefónica Digital

Memory tuning

• Three considerations: §  Amount of memory used by the objects §  The cost of accessing those objects §  Overhead of garbage collection

• Java objects are fast to access but consume 2-5x more space (than raw data inside) §  Prefer array of objects and primitive types §  Avoid nested structures §  Kryo serializer §  …

Page 18: Experiences with Spark @Telefónica€¦ · Telefónica Digital What is Spark? (I) • Distributed data processing framework/system • Aims at making data analytics fast to run (100x)

18 Telefónica Digital

Outline

• Introduction and motivation for a change • Spark Internals and API • Ecosystem • Tips & Tricks (Demo)

Page 19: Experiences with Spark @Telefónica€¦ · Telefónica Digital What is Spark? (I) • Distributed data processing framework/system • Aims at making data analytics fast to run (100x)

19 Telefónica Digital

Spark Ecosystem (I)

• Unify…

Page 20: Experiences with Spark @Telefónica€¦ · Telefónica Digital What is Spark? (I) • Distributed data processing framework/system • Aims at making data analytics fast to run (100x)

20 Telefónica Digital

MLBase • Mllib

§  Distributed low-level machine learning library written against Spark

§  Maintained by Spark core developers §  Algorithms for classification, regression, clustering

and collaborative filtering (more to come)

• MLI §  High level ML abstractions §  MLTable and LocalMatrix

• ML Optimizer §  Model selection automation §  Solves a search problem over feature extractors and

ML algorithms

Page 21: Experiences with Spark @Telefónica€¦ · Telefónica Digital What is Spark? (I) • Distributed data processing framework/system • Aims at making data analytics fast to run (100x)

21 Telefónica Digital

Spark Streaming motivation

• Processing the same data in live streams as well as batch post-processing

• Existing frameworks cannot do both §  100s of MB/s with low latency §  TBs of data with high latency

• Extremely painful to maintain • Mutable state lost if node fails (in traditional model)

Page 22: Experiences with Spark @Telefónica€¦ · Telefónica Digital What is Spark? (I) • Distributed data processing framework/system • Aims at making data analytics fast to run (100x)

22 Telefónica Digital

Spark Streaming • Runs a streaming computation as a series of very

small, deterministic batch jobs • Chop up the live stream into

batches of X seconds • Each batch of data is processed

using DStream + RDD operations §  countByWindow, reduceByWindow, slice, window,

countByValue, countByValueAndWindow, …

• Combine live data streams with historical data

• Batch sizes as low as ½ s • Input sources

§  Kafka, HDFS, Flume, Akka actors, Raw TCP sockets, custom implementaions and RDDs pushed as a stream

Page 23: Experiences with Spark @Telefónica€¦ · Telefónica Digital What is Spark? (I) • Distributed data processing framework/system • Aims at making data analytics fast to run (100x)

23 Telefónica Digital

Shark

•  Interactive SQL queries + unification §  GENERATE Kmeans(tweet_locations) AS TABLE

tweet_clusters • Key points added (leveraged by Spark):

§  Cached tables §  In memory column orientation (3-20x reduction in size) §  Data co-partitioning, fully distributed sort, dynamic join

algorithm selection based on the data, partition pruning using range statistics, etc

• Early development stage

Page 24: Experiences with Spark @Telefónica€¦ · Telefónica Digital What is Spark? (I) • Distributed data processing framework/system • Aims at making data analytics fast to run (100x)

24 Telefónica Digital

• Approximate query processing §  Sampling module when ingesting §  Online sample selection based on query’s latency and accuracy §  Parallel query execution with appropriate error and confidence

bounds

BlinkDB

Page 25: Experiences with Spark @Telefónica€¦ · Telefónica Digital What is Spark? (I) • Distributed data processing framework/system • Aims at making data analytics fast to run (100x)

25 Telefónica Digital

GraphX

• Resilient Distributed Graph, efficient partitioning, implementations of the PowerGraph and Pregel graph-parallel frameworks using RDGs.

Page 26: Experiences with Spark @Telefónica€¦ · Telefónica Digital What is Spark? (I) • Distributed data processing framework/system • Aims at making data analytics fast to run (100x)

26 Telefónica Digital

Outline

• Introduction and motivation for a change • Spark Internals and API • Ecosystem • Tips & Tricks (Demo)

Page 27: Experiences with Spark @Telefónica€¦ · Telefónica Digital What is Spark? (I) • Distributed data processing framework/system • Aims at making data analytics fast to run (100x)

27 Telefónica Digital

Tips and Tricks

• Use of Try monad to debug §  Can be used to run controlled exceptions building

RDD[(Arg,Try)] §  Allow to Use Exceptions as Data

Page 28: Experiences with Spark @Telefónica€¦ · Telefónica Digital What is Spark? (I) • Distributed data processing framework/system • Aims at making data analytics fast to run (100x)

28 Telefónica Digital

Tips and Tricks

• Traits and Pimp my library to make DSLs §  Spark uses implicit conversion to add method to

specific RDD (Pimp my Library) §  Traits allow separation of concerns and allow to make

a modular and customizable DSL §  Can be loaded with a single import §  Allow custom operations and ETL

Page 29: Experiences with Spark @Telefónica€¦ · Telefónica Digital What is Spark? (I) • Distributed data processing framework/system • Aims at making data analytics fast to run (100x)

29 Telefónica Digital

Tips and Tricks

• Summary §  Spark is easy to grasp and can be used with almost

no knowledge of Scala of Functional Programming but…

§  Advanced Scala features can make powerful DSL §  Use of FP concept boost productivity and improves

parallelization

Page 30: Experiences with Spark @Telefónica€¦ · Telefónica Digital What is Spark? (I) • Distributed data processing framework/system • Aims at making data analytics fast to run (100x)

30 Telefónica Digital

Demo