boston apache spark user group (the spahk group) - introduction to spark - 15 july 2014

Boston Apache Spark User Group

(the Spahk group)Microsoft NERD Center - Horace Mann

Tuesday, 15 July 2014

Intro to Apache SparkMatthew Farrellee, @spinningmatt

Updated: July 2014

Background - MapReduce / Hadoop

● Map & reduce around for 5+ decades (McCarthy, 1960)

● Dean and Ghemawat demonstrate map and reduce for distributed data processing (Google, 2004)

● MapReduce paper timed well with commodity hardware capabilities of the early 2000s

● Open source implementation in 2006● Years of innovation improving, simplifying,

expanding

MapReduce / Hadoop difficulties

● Hardware evolved○ Networks became fast○ Memory became cheap

● Programming model proved non-trivial○ Gave birth to multiple attempts to simplify, e.g. Pig,

Hive, ...

● Primarily batch execution mode○ Begat specialized (non-batch) modes, e.g. Storm,

Drill, Giraph, ...

Some history - Spark

● Started in UC Berkeley AMPLab by Matei Zaharia, 2009○ AMP = Algorithms Machines People○ AMPLab is integrating Algorithms, Machines, and

People to make sense of Big Data● Open sourced, 2010● Donated to Apache Software Foundation,

2013● Graduated to top level project, 2014● 1.0 release, May 2014

What is Apache Spark?

An open source, efficient and productive cluster computing system that is interoperable with Hadoop

Open source

● Top level Apache project● http://www.ohloh.net/p/apache-spark

○ In a Nutshell, Apache Spark…○ has had 7,366 commits made by 299 contributors

representing 117,823 lines of code○ is mostly written in Scala with a well-commented

source code○ has a codebase with a long source history

maintained by a very large development team with increasing Y-O-Y commits

○ took an estimated 30 years of effort (COCOMO model) starting with its first commit in March, 2010 ending with its most recent commit 5 days ago

http://www.ohloh.net/p/apache-spark

http://www.ohloh.net/p/apache-spark

http://www.ohloh.net/p/apache-spark/commits/summary

http://www.ohloh.net/p/apache-spark/commits/summary

http://www.ohloh.net/p/apache-spark/contributors/summary

http://www.ohloh.net/p/apache-spark/contributors/summary

http://www.ohloh.net/p/apache-spark/analyses/latest/languages_summary




http://www.ohloh.net/p/apache-spark/factoids#FactoidCommentsHigh




http://www.ohloh.net/p/apache-spark/factoids#FactoidAgeOld

http://www.ohloh.net/p/apache-spark/factoids#FactoidAgeOld

http://www.ohloh.net/p/apache-spark/factoids#FactoidTeamSizeVeryLarge

http://www.ohloh.net/p/apache-spark/factoids#FactoidTeamSizeVeryLarge

http://www.ohloh.net/p/apache-spark/factoids#FactoidActivityIncreasing



http://www.ohloh.net/p/apache-spark/estimated_cost



http://www.ohloh.net/p/apache-spark/commits?sort=oldest

http://www.ohloh.net/p/apache-spark/commits?sort=oldest

http://www.ohloh.net/p/apache-spark/commits

http://www.ohloh.net/p/apache-spark/commits

Efficient

● In-memory primitives○ Use cluster memory and spill to disk only when

necessary● High performance

○ https://amplab.cs.berkeley.edu/benchmark/● General compute graphs, DAGs

○ Not just: Load -> Map -> Reduce -> Store -> Load -> Map -> Reduce -> Store

○ Rich and pipelined: Load -> Map -> Union -> Reduce -> Filter -> Group -> Sample -> Store

https://amplab.cs.berkeley.edu/benchmark/

https://amplab.cs.berkeley.edu/benchmark/

Interoperable

● Read and write data from HDFS (or any storage system with an HDFS-like API)

● Read and write Hadoop file formats

● Run on YARN

● Interact with Hive, HBase, etc

Productive

● Unified data model, the RDD● Multiple execution modes

○ Batch, interactive, streaming● Multiple languages

○ Scala, Java, Python, R, SQL● Rich standard library

○ Machine learning, streaming, graph processing, ETL● Consistent API across languages● Significant code reduction compared to

MapReduce

Consistent API, less code...

public static class WordCountMapClass extends MapReduceBase

implements Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value,

OutputCollector<Text, IntWritable> output,

Reporter reporter) throws IOException {

String line = value.toString();

StringTokenizer itr = new StringTokenizer(line);

while (itr.hasMoreTokens()) {

word.set(itr.nextToken());

output.collect(word, one);

}

}

}

public static class WorkdCountReduce extends MapReduceBase

implements Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWritable> values,

OutputCollector<Text, IntWritable> output,

Reporter reporter) throws IOException {

int sum = 0;

while (values.hasNext()) {

sum += values.next().get();

}

output.collect(key, new IntWritable(sum));

}

}

import org.apache.spark._

val sc = new SparkContext(new SparkConf().setAppName(“word count”))

val file = sc.textFile("hdfs://...")

val counts = file.flatMap(line => line.split(" "))

.map(word => (word, 1))

.reduceByKey(_ + _)

counts.saveAsTextFile("hdfs://...")

Spark (Scala) -MapReduce -

from operator import add

from pyspark import SparkContext

sc = SparkContext(conf=SparkConf().setAppName(“word count”))

file = sc.textFile("hdfs://...")

counts = file.flatMap(lambda line: line.split(" "))

.map(lambda word: (word, 1))

.reduceByKey(add)

counts.saveAsTextFile("hdfs://...")

Spark (Python) -

Spark workflow

Valuemattf

mattfRDD

Transform

Action

Load Save

The RDD

The resilient distributed dataset

A lazily evaluated, fault-tolerant collection of elements that can be operated on in parallel

Valuemattf

mattfRDD

Transform

Action

Load Save

RDDs technically

1. a set of partitions ("splits" in hadoop terms)2. list of dependencies on parent RDDs3. function to compute a partition given its

parents4. optional partitioner (hash, range)5. optional preferred location(s) for each

partition

Valuemattf

mattfRDD

Transform

Action

Load Save

Load Valuemattf

mattfRDD

Transform

Action

Load Save

Create an RDD.

● parallelize - convert a collection● textFile - load a text file● wholeTextFiles - load a dir of text files● sequenceFile / hadoopFile - load using

Hadoop file formats● More, http://spark.apache.

org/docs/latest/programming-guide.html#external-datasets

http://spark.apache.org/docs/latest/programming-guide.html#external-datasets




Lazy operations.Build compute DAG.Don’t trigger computation.

● map(func) - elements passed through func● flatMap(func) - func can return >=0 elements● filter(func) - subset of elements● sample(..., fraction, …) - select fraction● union(other) - union of two RDDs● distinct - new RDD w/ distinct elements

Transform Valuemattf

mattfRDD

Transform

Action

Load Save

● groupByKey - (K, V) -> (K, Seq[V])● reduceByKey(func) - (K, V) -> (K, func(V...))● sortByKey - order (K, V) by K● join(other) - (K, V) + (K, W) -> (K, (V, W))● cogroup/groupWith(other) - (K, V) + (K, W) -

> (K, (Seq[V], Seq[W]))● cartesian(other) - cartesian product (all

pairs)

Transform (cont) Valuemattf

mattfRDD

Transform

Action

Load Save

More available in documentation...

http://spark.apache.org/docs/latest/programming-guide.html#transformations

Transform (cont) Valuemattf

mattfRDD

Transform

Action

Load Save





Action Valuemattf

mattfRDD

Transform

Action

Load Save

Active operations.Trigger execution of DAG.Result in a value.

● reduce(func) - reduce elements w/ func● collect - convert to native collection● count - count elements● foreach(func) - apply func to elements● take(n) - return some elements

Action (cont) Valuemattf

mattfRDD

Transform

Action

Load Save

More available in documentation...

http://spark.apache.org/docs/latest/programming-guide.html#actions





Save Valuemattf

mattfRDD

Transform

Action

Load Save

Action that results indata stored to file system.

● saveAsTextFile● saveAsSequenceFile● saveAsObjectFile/saveAsPickleFile● More, http://spark.apache.

org/docs/latest/programming-guide.html#actions





Spark workflow

Valuemattf

mattfRDD

Transform

Action

Load Save

file = sc.textFile("hdfs://...")counts = file.flatMap(lambda line: line.split(" ")) .map(lambda word: (word, 1)) .reduceByKey(add)counts.saveAsTextFile("hdfs://...")

Multiple modes, rich standard library

Spark Core

SQL Streaming MLLib GraphX ...

Apache Spark

Spark SQL● Components

○ Catalyst - generic optimization for relational algebra○ Core - RDD execution; formats: Parquet, JSON○ Hive support - run HiveQL and use Hive warehouse

● SchemaRDDs and SQL

Spark

SQL Stream MLlib GraphX ...

User

User

User Name Age Height

Name Age Height

Name Age Height

sqlCtx.sql(“SELECT name FROM people WHERE age >= 13 AND age <= 19”)

RDD SchemaRDD

Spark StreamingSpark


● Run a streaming computation as a series of time bound, deterministic batch jobs

● Time bound used to break stream into RDDs

Spark Streaming Spark

Stream RDDs Results

X seconds wide

MLlibSpark


● Machine learning algorithms over RDDs

● Classification - logistic regression, linear support vector machines, naive Bayes, decision trees

● Regression - linear regression, regression trees● Collaborative filtering - alternating least squares● Clustering - K-Means● Optimization - stochastic gradient descent, limited-

memory BFGS● Dimensionality reduction - singular value

decomposition, principal component analysis

Deploying Spark

Source: http://spark.apache.org/docs/latest/cluster-overview.html

http://spark.apache.org/docs/latest/cluster-overview.html

Deploying Spark

● Driver program - shell or standalone program that creates a SparkContext and works with RDDs

● Cluster Manager - standalone, Mesos or YARN○ Standalone - the default, simple setup,

master + worker processes on nodes○ Mesos - a general purpose manager that

runs Hadoop and other services. Two modes of operation, fine & coarse.

○ YARN - Hadoop 2’s resource manager

Highlights from Spark Summit 2014

http://spark-summit.org/east/2015

New York in early 2015



Thanks to...

@pacoid, @rxin and @michaelarmbrust for letting me crib slides for this introduction

boston apache spark user group (the spahk group) - introduction to spark - 15 july 2014

Technology