an introduction to spark fishel... · 4 a brief review of mapreduce map mapmap map map map map map...

An Introduction to SparkBy Ryan Fishel

With thanks to Ted Malaksa

• Solutions Consultant at Cloudera Government Solutions

• Worked primarily with government clients

• Experience In

• Current: Video processing, System integration, Performance tuning

• Past: Solutions consulting (outside of Hadoop), OS research/development

Agenda

• Quick Look at the Past

• Introduction to Spark

• Spark in CDH

A brief review of MapReduce

Map Map Map Map Map Map Map Map Map Map Map Map

Reduce Reduce Reduce Reduce

Key advances by MapReduce:

• Data Locality: Automatic split computation and launch of mappers appropriately• Fault tolerance: Write out of intermediate results and restart able mappers meant ability to run

on commodity hardware• Linear scalability: Combination of locality + programming model that forces developers to

write generally scalable solutions to problems• Flexibility: It soon became a easy tool to do distributed computing, going well beyond just map

and reduce.

BUT… Can we do better?

Two approaches to doing better:1. Special purpose systems to solve one problem domain well. Ex:

Giraph / Graphlab (graph processing), Storm (stream processing)

2. Generalize the capabilities of MapReduce to provide a richer foundation to solve problems.Ex: MPI, Hama (BSP), Dryad (arbitrary DAGs)

Both are viable strategies depending on problem!

Agenda

• Spark in CDH

What is Spark?

Spark is a general purpose computational framework

Key properties:• Leverages distributed memory• Added distributed tools: Accumulator and Broadcast (Shared variables)• Full Directed Graph expressions for data parallel computations• Improved developer experience

Yet retains:• Linear scalability• Fault-tolerance• Data Locality based computations

Easy: Get Started Immediately

• Multi-language support

• Interactive Shell

Pythonlines = sc.textFile(...)lines.filter(lambda s: “ERROR” in s).count()

Scalaval lines = sc.textFile(...)lines.filter(s => s.contains(“ERROR”)).count()

JavaJavaRDD<String> lines = sc.textFile(...);lines.filter(new Function<String, Boolean>() {Boolean call(String s) {return s.contains(“error”);

}}).count();

Spark Client(App Master)

Scheduler and RDD Graph

Spark from a High Level

Worker

Spark Worker

RDD Objects

Task Threads

Block Manager

Rdd1.join(rdd2).groupBy(…).filter(…)

Task Scheduler

Threads

Block Manager

ClusterManager

Trackers

MemoryStore

DiskStore

BlockInfo

ShuffleBlockManager

Resilient Distributed Datasets

• Resilient Distributed Datasets

• Collections of objects spread across a cluster, stored in RAM or on Disk

• Built through parallel transformations

• Automatically rebuilt on failure• Lineage

• Building a Story• HadoopRDD

• JdbcRDD

• MappedRDD

• FlatMappedRDD

• FilteredRDD

• OrderedRDD

• PairRDD

• ShffledRDD

• UnionRDD

• ….

filter

groupBy

D: E: F:

Ç√Ω

• Data Interface

• Partitions

• Dependencies

• Function to compute

• Optional• Preferred Location

• Partitioning Info (Partitioner)

Easy: Expressive API

• Transformations

• Lazy

• Return RDDs

• Actions

• Demand action

• Return value

Easy: Expressive API

• Actions

• Collect

• Count

• First

• Take(n)

• saveAsTextFile

• foreach(func)

• reduce(func)

• …

• Transformations

• Map

• Filter

• flatMap

• Union

• Reduce

• Sort

• Join

• …

Easy: Example – Word Count

• Sparkpublic static class WordCountMapClass extends MapReduceBase

implements Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value,

OutputCollector<Text, IntWritable> output,

Reporter reporter) throws IOException {

String line = value.toString();

StringTokenizer itr = new StringTokenizer(line);

while (itr.hasMoreTokens()) {

word.set(itr.nextToken());

output.collect(word, one);

public static class WorkdCountReduce extends MapReduceBase

implements Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWritable> values,

int sum = 0;

while (values.hasNext()) {

sum += values.next().get();

output.collect(key, new IntWritable(sum));

• Hadoop MapReduceval spark = new SparkContext(master, appName, [sparkHome], [jars])

val file = spark.textFile("hdfs://...")

val counts = file.flatMap(line => line.split(" "))

.map(word => (word, 1))

.reduceByKey(_ + _)

counts.saveAsTextFile("hdfs://...")

Easy: Example – Word Count

• Sparkpublic static class WordCountMapClass extends MapReduceBase

implements Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value,

String line = value.toString();

StringTokenizer itr = new StringTokenizer(line);

while (itr.hasMoreTokens()) {

word.set(itr.nextToken());

output.collect(word, one);

public static class WorkdCountReduce extends MapReduceBase

implements Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWritable> values,

int sum = 0;

while (values.hasNext()) {

sum += values.next().get();

output.collect(key, new IntWritable(sum));

• Hadoop MapReduce

val spark = new SparkContext(master, appName, [sparkHome], [jars])

val file = spark.textFile("hdfs://...")

val counts = file.flatMap(line => line.split(" "))

.map(word => (word, 1))

.reduceByKey(_ + _)

counts.saveAsTextFile("hdfs://...")

Easy: Out of the Box Functionality

• Hadoop Integration

• Works natively with Hadoop Data

• Runs With YARN

• Libraries

• MLlib

• Spark Streaming

• GraphX (alpha)

• Roadmap

• Language support:• Improved Python support

• SparkR

• Java 8

• Better ML• Sparse Data Support

• Model Evaluation Framework

• Performance Testing

Memory management leads to greater performance

Trends:• ½ price every 18 months• 2x bandwidth every 3 years

Current Specs:• Hadoop cluster with 100 nodes

contains 10+TB of RAM today and will double next year

• 1 GB RAM ~ $10-$20

64-128GB RAM

16 cores

50 GB per sec

Memory can be enabler for high performance big data applications

Fast: Using RAM, Operator Graphs

• In-memory Caching• Data Partitions read from

RAM instead of disk

• Operator Graphs• Scheduling Optimizations

• Fault Tolerance

= cached partition

filter

groupBy

D: E: F:

Logistic Regression Performance(data fits in memory)

1 5 10 20 30

Number of Iterations

Hadoop

110 s / iteration

first iteration 80 sfurther iterations 1 s

Spark Streaming

• Run continuous processing of data using Spark’s core API.

• Extends Spark concept of RDD’s to DStreams (Discretized Streams) which are fault tolerant, transformable streams. Users can re-use existing code for batch/offline processing.

• Adds “rolling window” operations. E.g. compute rolling averages or counts for data over last five minutes.

• Example use cases:• “On-the-fly” ETL as data is ingested into Hadoop/HDFS.

• Detecting anomalous behavior and triggering alerts.

• Continuous reporting of summary metrics for incoming data.

val tweets = ssc.twitterStream()

val hashTags = tweets.flatMap (status => getTags(status))

hashTags.saveAsHadoopFiles("hdfs://...")

flatMap flatMap flatMap

save save save

batch @ t+1batch @ t batch @ t+2

tweets DStream

hashTags DStream

Stream composed of small (1-10s) batch

computations

“Micro-batch” Architecture

User Use Case Spark’s Value

Conviva Optimize end user’s online video experience by analyzing traffic patterns in real-time, enabling fine-grained traffic shaping control.

• Rapid development and prototyping• Shared business logic for offline and

online computation• Open machine learning algorithms

Yahoo! Speedup model training pipeline for Ad Targeting, increasing feature extraction phase by over 3x. Content recommendation using Collaborative filtering

• Reduced latency in broader data pipeline

• Use of iterative ML• Efficient P2P broadcast

Anonymous (LargeTech Company)

“Semi real-time” log aggregation and analysis for monitoring, alerting, etc.

• Low latency, frequently running “mini” batch jobs processing latest data sets

Technicolor Real-time analytics for their customers (i.e., telcos); need ability to provide streaming results and ability to query them in real time

● Easy to develop; just need Spark & SparkStreaming

● Arbitrary queries on live data

Sample use cases

Flume to Spark Streaming to HBase

Avro Client

Flume Agent

Avro Source

Interceptor

MemoryChannel

HDFS Sink

Avro Sink

Spark Streaming

HbaseClients

Micro JobsFlume Stream

Building on Spark Today

• What kind of Apps?

• ETL

• Machine Learning

• Streaming

• Dashboards

Growing Number of Successful Use Cases & 3rd Party

Applications

Current project status

• Spark promoted to an Apache Top Level Project in mid-Feb

• 100+ contributors and 25+ companies contributing

• Includes: Intel/Cloudera, Yahoo!, Microsoft, etc

Agenda

• Spark in CDH

Timelines for Spark in CDH (Proposed)

Phase 0 Phase 1 Phase 2

Available: Jan 2014Spark 0.9

Available: Mar 2014Spark 1.0

Available: July 2014Spark 1.x.x

• Separate Spark parcel• CDH 4.4+• Manual start/stop• Stand-alone mode Spark

server• No CM integration

• Bundled into CDH 5.0• CSD in Cloudera Manager

for installation and configuration

• Supports both CDH 4 and CDH 5

• Stand-alone mode with Resource Management

• YARN mode with Kerberos• Spark SQL alpha

• Bundled into CDH 5.1• Improved monitoring

using Spark History server

• Deeper integration into Cloudera Manager

• Optimized Flume integration for Spark Streaming

• Hue support

Spark key roadmap items – Production requirements

• Security:• Full Kerberos support in Stand-alone Mode• Sentry integration

• High availability:• Spark Master HA (available now)• YARN Application Master availability (available now)

• Operations support:• Improved performance monitoring, health checks, alerts through Cloudera Manager

• Reporting:• Usage history for chargeback / showback• Audit and lineage capabilities through Cloudera Navigator

• Advanced Resource Management:• Dynamic resource management capabilities for long-running Spark contexts

Spark key roadmap items – System capabilities

• Memory

• Tachyon integration with HDFS caching

• Off-heap memory optimization for Spark

• Enable multiple systems to share memory with Spark (stand-alone mode)

• Compatibility

• MapReduce 2 migration support for Spark

• Crunch on Spark (Pipelines)

• Pig on Spark (Spork)

• Oozie connector for Spark

• Integration

• Impala <-> Spark data migration

• Spark in CDH

an introduction to spark fishel... · 4 a brief review of mapreduce map mapmap map map map map map...

Documents

mapreduce cs372 spring 2009 with a heavy debt to: google map...

sigmetrics tutorial:...

mapreduce with parallelizable reduce - dimacs

tuple map reduce: beyond classic mapreduce

utah distributed systems meetup and reading …map reduce &...

model engineering - ptolemy project...in a set of documents....

map reduce algorithms - wordpress.com · map reduce...

analyzing data with map-reduce€¦ · distributed...

algorithms and applications in social networks · •apache...

reading material map reduce the map-reduce...

university of...

adoop mapreduce i o c5.0 decision t a hadoop map...

hadoop mapreduce types - fordham university€¦ · •...

mapreduce - cac.cornell.edu · these work in concert...

map reduce -...

a distributed data management using...

15-388/688 -practical data science: big data and...

mapreduce in spark€¦ · 4 algorithms in map-reduce 5...

computations incremental incoop: mapreduce for ›...

map-reduce for parallel computing - boise state...