experiences with spark @telefónica€¦ · telefónica digital what is spark? (i) • distributed...

1 Telefónica Digital

Experiences with Spark @Telefónica

Daniel Tapiador & Ignacio Blasco 12th June 2014 – BCN Spark meetup


Outline

• Introduction and motivation for a change • Spark Internals and API • Ecosystem • Tips & Tricks (Demo)


Outline



What is Spark? (I)

• Distributed data processing framework/system • Aims at making data analytics fast to run (100x) and

to write • Works in memory and/or disk • Fills the gap for near-real time (in memory) apps • Scales out to big data sizes • Several language bindings (Scala, Python and Java)


What is Spark? (II)

• Fully integrated with the Hadoop ecosystem §  Supports any Hadoop input format => can read from

HDFS, Hive, Impala, Hbase, etc • Higher level (and richer) interface than MapReduce

§  Generalize MR to support new apps in same engine §  General task DAG + data sharing

• Interactive shells for scala and python for exploratory work


Original Niche

• Originally developed for: §  Iterative algorithms §  Interactive data mining


Evolution (I)

MapReduce

Pregel

Dremel

GraphLab Storm

Giraph

Drill Tez

Impala

S4 …

Specialized systems (iterative, interactive and"

streaming apps)

General batch"processing

• Has aimed ever since at evolving to a much more complete framework


Evolution (II)


Motivation for a change (I)

• Column orientation (Parquet and ORC). Ratio 3.5x • Fast in-memory serialization (Kryo)

•  Largest input (per day) is 153 GB (bz2 - text).

•  Auxiliary tables sizes: §  4 GB (text – no compression) §  950 MB (text – no compression) §  12 MB (text – no compression)

•  Processing time: §  Something in between 5h – 15h


Motivation for a change (II)

•  Only batch oriented •  Maintainability

§  Tedious to add/modify functionality

•  Complexity (more LoC) •  Needs explicit Orchestration •  Testing somehow tedious with

a lot of Integration Tests

•  Stability

•  In memory processing •  Exploration

§  Prototyping, PoC, etc.

•  Speed of development •  Orchestration natively done in

the code. •  Testing

•  v1.0.0 out but still a lot to stabilize


Motivation for a change (III)

0

5000

10000

15000

20000

25000

30000

35000

C++ / Hadoop Streaming Scala / Spark

LoC (Model)

LoC (Model)

0 2000 4000 6000 8000

10000 12000

Java / Hadoop Scala / Spark

LoC (ETL)

LoC (ETL)

0

5

10

15

C++ / Hadoop Streaming

Scala / Spark

Model Development Time (in MM)

Model Development Time (in MM)

0 1 2 3 4 5

Java / Hadoop Scala / Spark

ETL Development Time (in MM)

ETL Development Time (in MM)


Outline



Spark Programming Model (I)

• Resilient Distributed Datasets (RDDs) §  Distributed collections of objects that can be cached

in memory across cluster nodes §  Manipulated through various parallel operators §  Automatically rebuilt on failure (some checkpointing

as well) • RDDs can be spilled to disk if needed and/or kept in

memory (de)serialized • RDDs partitioning can be explicitly stated • Spark context (sc) is the entry point

§  Can run locally (single or multicore)


Spark Programming Model (II)

• Shared variables (immutable) • Broadcast variables

§  Read-only variable cached on each machine §  Efficient broadcast algorithms to reduce

communication cost • Accumulators

§  Variables that are only added to through associative operations

§  Only driver program can read their value


RDD Operations

• RDD operations include transformations and actions: §  map, filter, flatMap, sample, union, distinct,

groupByKey, reduceByKey, sortByKey, join, cogroup, cartesian, …

§  reduce, collect, count, first, take, takeSample, countByKey, foreach, saveAsTextFile, saveAsSequenceFile, …

§  persist/cache and repartitions


RDD Persistence • Partitions lost are recomputed using the original

transformations • Each RDD can be stored with a different storage level


Memory tuning

• Three considerations: §  Amount of memory used by the objects §  The cost of accessing those objects §  Overhead of garbage collection

• Java objects are fast to access but consume 2-5x more space (than raw data inside) §  Prefer array of objects and primitive types §  Avoid nested structures §  Kryo serializer §  …


Outline



Spark Ecosystem (I)

• Unify…


MLBase • Mllib

§  Distributed low-level machine learning library written against Spark

§  Maintained by Spark core developers §  Algorithms for classification, regression, clustering

and collaborative filtering (more to come)

• MLI §  High level ML abstractions §  MLTable and LocalMatrix

• ML Optimizer §  Model selection automation §  Solves a search problem over feature extractors and

ML algorithms


Spark Streaming motivation

• Processing the same data in live streams as well as batch post-processing

• Existing frameworks cannot do both §  100s of MB/s with low latency §  TBs of data with high latency

• Extremely painful to maintain • Mutable state lost if node fails (in traditional model)


Spark Streaming • Runs a streaming computation as a series of very

small, deterministic batch jobs • Chop up the live stream into

batches of X seconds • Each batch of data is processed

using DStream + RDD operations §  countByWindow, reduceByWindow, slice, window,

countByValue, countByValueAndWindow, …

• Combine live data streams with historical data

• Batch sizes as low as ½ s • Input sources

§  Kafka, HDFS, Flume, Akka actors, Raw TCP sockets, custom implementaions and RDDs pushed as a stream


Shark

•  Interactive SQL queries + unification §  GENERATE Kmeans(tweet_locations) AS TABLE

tweet_clusters • Key points added (leveraged by Spark):

§  Cached tables §  In memory column orientation (3-20x reduction in size) §  Data co-partitioning, fully distributed sort, dynamic join

algorithm selection based on the data, partition pruning using range statistics, etc

• Early development stage


• Approximate query processing §  Sampling module when ingesting §  Online sample selection based on query’s latency and accuracy §  Parallel query execution with appropriate error and confidence

bounds

BlinkDB


GraphX

• Resilient Distributed Graph, efficient partitioning, implementations of the PowerGraph and Pregel graph-parallel frameworks using RDGs.


Outline



Tips and Tricks

• Use of Try monad to debug §  Can be used to run controlled exceptions building

RDD[(Arg,Try)] §  Allow to Use Exceptions as Data


Tips and Tricks

• Traits and Pimp my library to make DSLs §  Spark uses implicit conversion to add method to

specific RDD (Pimp my Library) §  Traits allow separation of concerns and allow to make

a modular and customizable DSL §  Can be loaded with a single import §  Allow custom operations and ETL


Tips and Tricks

• Summary §  Spark is easy to grasp and can be used with almost

no knowledge of Scala of Functional Programming but…

§  Advanced Scala features can make powerful DSL §  Use of FP concept boost productivity and improves

parallelization


Demo

experiences with spark @telefónica€¦ · telefónica digital what is spark? (i) • distributed...

Documents