boston apache spark user group (the spahk group) - introduction to spark - 15 july 2014
DESCRIPTION
An introduction to Apache Spark presented at the Boston Apache Spark User Group in July 2014TRANSCRIPT
Boston Apache Spark User Group
(the Spahk group)Microsoft NERD Center - Horace Mann
Tuesday, 15 July 2014
Intro to Apache SparkMatthew Farrellee, @spinningmatt
Updated: July 2014
Background - MapReduce / Hadoop
● Map & reduce around for 5+ decades (McCarthy, 1960)
● Dean and Ghemawat demonstrate map and reduce for distributed data processing (Google, 2004)
● MapReduce paper timed well with commodity hardware capabilities of the early 2000s
● Open source implementation in 2006● Years of innovation improving, simplifying,
expanding
MapReduce / Hadoop difficulties
● Hardware evolved○ Networks became fast○ Memory became cheap
● Programming model proved non-trivial○ Gave birth to multiple attempts to simplify, e.g. Pig,
Hive, ...
● Primarily batch execution mode○ Begat specialized (non-batch) modes, e.g. Storm,
Drill, Giraph, ...
Some history - Spark
● Started in UC Berkeley AMPLab by Matei Zaharia, 2009○ AMP = Algorithms Machines People○ AMPLab is integrating Algorithms, Machines, and
People to make sense of Big Data● Open sourced, 2010● Donated to Apache Software Foundation,
2013● Graduated to top level project, 2014● 1.0 release, May 2014
What is Apache Spark?
An open source, efficient and productive cluster computing system that is interoperable with Hadoop
Open source
● Top level Apache project● http://www.ohloh.net/p/apache-spark
○ In a Nutshell, Apache Spark…○ has had 7,366 commits made by 299 contributors
representing 117,823 lines of code○ is mostly written in Scala with a well-commented
source code○ has a codebase with a long source history
maintained by a very large development team with increasing Y-O-Y commits
○ took an estimated 30 years of effort (COCOMO model) starting with its first commit in March, 2010 ending with its most recent commit 5 days ago
Efficient
● In-memory primitives○ Use cluster memory and spill to disk only when
necessary● High performance
○ https://amplab.cs.berkeley.edu/benchmark/● General compute graphs, DAGs
○ Not just: Load -> Map -> Reduce -> Store -> Load -> Map -> Reduce -> Store
○ Rich and pipelined: Load -> Map -> Union -> Reduce -> Filter -> Group -> Sample -> Store
Interoperable
● Read and write data from HDFS (or any storage system with an HDFS-like API)
● Read and write Hadoop file formats
● Run on YARN
● Interact with Hive, HBase, etc
Productive
● Unified data model, the RDD● Multiple execution modes
○ Batch, interactive, streaming● Multiple languages
○ Scala, Java, Python, R, SQL● Rich standard library
○ Machine learning, streaming, graph processing, ETL● Consistent API across languages● Significant code reduction compared to
MapReduce
Consistent API, less code...
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
import org.apache.spark._
val sc = new SparkContext(new SparkConf().setAppName(“word count”))
val file = sc.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Spark (Scala) -MapReduce -
from operator import add
from pyspark import SparkContext
sc = SparkContext(conf=SparkConf().setAppName(“word count”))
file = sc.textFile("hdfs://...")
counts = file.flatMap(lambda line: line.split(" "))
.map(lambda word: (word, 1))
.reduceByKey(add)
counts.saveAsTextFile("hdfs://...")
Spark (Python) -
Spark workflow
Valuemattf
mattfRDD
Transform
Action
Load Save
The RDD
The resilient distributed dataset
A lazily evaluated, fault-tolerant collection of elements that can be operated on in parallel
Valuemattf
mattfRDD
Transform
Action
Load Save
RDDs technically
1. a set of partitions ("splits" in hadoop terms)2. list of dependencies on parent RDDs3. function to compute a partition given its
parents4. optional partitioner (hash, range)5. optional preferred location(s) for each
partition
Valuemattf
mattfRDD
Transform
Action
Load Save
Load Valuemattf
mattfRDD
Transform
Action
Load Save
Create an RDD.
● parallelize - convert a collection● textFile - load a text file● wholeTextFiles - load a dir of text files● sequenceFile / hadoopFile - load using
Hadoop file formats● More, http://spark.apache.
org/docs/latest/programming-guide.html#external-datasets
Lazy operations.Build compute DAG.Don’t trigger computation.
● map(func) - elements passed through func● flatMap(func) - func can return >=0 elements● filter(func) - subset of elements● sample(..., fraction, …) - select fraction● union(other) - union of two RDDs● distinct - new RDD w/ distinct elements
Transform Valuemattf
mattfRDD
Transform
Action
Load Save
● groupByKey - (K, V) -> (K, Seq[V])● reduceByKey(func) - (K, V) -> (K, func(V...))● sortByKey - order (K, V) by K● join(other) - (K, V) + (K, W) -> (K, (V, W))● cogroup/groupWith(other) - (K, V) + (K, W) -
> (K, (Seq[V], Seq[W]))● cartesian(other) - cartesian product (all
pairs)
Transform (cont) Valuemattf
mattfRDD
Transform
Action
Load Save
More available in documentation...
http://spark.apache.org/docs/latest/programming-guide.html#transformations
Transform (cont) Valuemattf
mattfRDD
Transform
Action
Load Save
Action Valuemattf
mattfRDD
Transform
Action
Load Save
Active operations.Trigger execution of DAG.Result in a value.
● reduce(func) - reduce elements w/ func● collect - convert to native collection● count - count elements● foreach(func) - apply func to elements● take(n) - return some elements
Action (cont) Valuemattf
mattfRDD
Transform
Action
Load Save
More available in documentation...
http://spark.apache.org/docs/latest/programming-guide.html#actions
Save Valuemattf
mattfRDD
Transform
Action
Load Save
Action that results indata stored to file system.
● saveAsTextFile● saveAsSequenceFile● saveAsObjectFile/saveAsPickleFile● More, http://spark.apache.
org/docs/latest/programming-guide.html#actions
Spark workflow
Valuemattf
mattfRDD
Transform
Action
Load Save
file = sc.textFile("hdfs://...")counts = file.flatMap(lambda line: line.split(" ")) .map(lambda word: (word, 1)) .reduceByKey(add)counts.saveAsTextFile("hdfs://...")
Multiple modes, rich standard library
Spark Core
SQL Streaming MLLib GraphX ...
Apache Spark
Spark SQL● Components
○ Catalyst - generic optimization for relational algebra○ Core - RDD execution; formats: Parquet, JSON○ Hive support - run HiveQL and use Hive warehouse
● SchemaRDDs and SQL
Spark
SQL Stream MLlib GraphX ...
User
User
User Name Age Height
Name Age Height
Name Age Height
sqlCtx.sql(“SELECT name FROM people WHERE age >= 13 AND age <= 19”)
RDD SchemaRDD
Spark StreamingSpark
SQL Stream MLlib GraphX ...
● Run a streaming computation as a series of time bound, deterministic batch jobs
● Time bound used to break stream into RDDs
Spark Streaming Spark
Stream RDDs Results
X seconds wide
MLlibSpark
SQL Stream MLlib GraphX ...
● Machine learning algorithms over RDDs
● Classification - logistic regression, linear support vector machines, naive Bayes, decision trees
● Regression - linear regression, regression trees● Collaborative filtering - alternating least squares● Clustering - K-Means● Optimization - stochastic gradient descent, limited-
memory BFGS● Dimensionality reduction - singular value
decomposition, principal component analysis
Deploying Spark
Source: http://spark.apache.org/docs/latest/cluster-overview.html
Deploying Spark
● Driver program - shell or standalone program that creates a SparkContext and works with RDDs
● Cluster Manager - standalone, Mesos or YARN○ Standalone - the default, simple setup,
master + worker processes on nodes○ Mesos - a general purpose manager that
runs Hadoop and other services. Two modes of operation, fine & coarse.
○ YARN - Hadoop 2’s resource manager
Highlights from Spark Summit 2014
http://spark-summit.org/east/2015
New York in early 2015
Thanks to...
@pacoid, @rxin and @michaelarmbrust for letting me crib slides for this introduction