introduction to spark on hadoop

© 2014 MapR Technologies 1 © 2014 MapR Technologies

An Overview of Apache Spark

© 2014 MapR Technologies 2

Agenda

• MapReduce Refresher

• What is Spark?

• The Difference with Spark

• Examples and Resources


MapReduce Refresher


MapReduce: A Programming Model

• MapReduce: Simplified Data Processing on Large Clusters (published 2004)

• Parallel and Distributed Algorithm:

• Data Locality

• Fault Tolerance

• Linear Scalability


The Hadoop Strategy

http://developer.yahoo.com/hadoop/tutorial/module4.html

Distribute data (share nothing)

Distribute computation (parallelization without synchronization)

Tolerate failures (no single point of failure)

Node 1

Mapping process

Node 2

Mapping process

Node 3

Mapping process

Node 1

Reducing process

Node 2

Reducing process

Node 3

Reducing process


Chunks are replicated across the cluster

Distribute Data: HDFS

User process

NameNode . . .

network

HDFS splits large data files

into chunks (64 MB)

metadata

access physical data access

Location metadata

DataNodes store & retrieve data

data


Distribute Computation

MapReduce

Program

Data

Sources

Hadoop Cluster

Result


MapReduce Basics

• Foundational model is based on a distributed file system

– Scalability and fault-tolerance

• Map

– Loading of the data and defining a set of keys

• Many use cases do not utilize a reduce task

• Reduce

– Collects the organized key-based data to process and output

• Performance can be tweaked based on known details of your

source files and cluster shape (size, total number)


MapReduce Execution and Data Flow

Files loaded from HDFS stores

file file

Files loaded from HDFS stores

Node 1

InputFormat InputFormat

OutputFormat OutputFormat

Final (k, v) pairs Final (k, v) pairs

reduce reduce

(sort) (sort)

Input (k, v) pairs

map map map

RR RR RR

RecordReaders:

Split Split Split

Writeback to

Local HDFS

store

file

Writeback to

Local HDFS

store

file

Split Split Split

RR RR RR

RecordReaders:

Input (k, v) pairs

map map map

Node2

“Shuffle” process

Intermediate (k, v)

Pairs exchanged

By all nodes

Partitioner

Intermediate (k, v) pairs

Partitioner

Intermediate (k, v) pairs


MapReduce Example: Word Count

Output

"The time has come," the Walrus said,

"To talk of many things:

Of shoes—and ships—and sealing-wax

the, 1

time, 1

has, 1

come, 1

…

and, 1

…

and, 1

…

and, [1, 1, 1]

come, [1,1,1]

has, [1,1]

the, [1,1,1]

time, [1,1,1,1]

…

and, 12

come, 6

has, 8

the, 4

time, 14

…

Input Map Shuffle

and Sort Reduce Output

Reduce


Tolerate Failures

Hadoop Cluster

Failures are expected & managed gracefully

DataNode fails -> name node will locate replica

MapReduce task fails -> job tracker will schedule another one

Data


MapReduce Processing Model

• Define mappers

• Shuffling is automatic

• Define reducers

• For complex work, chain jobs together

– Use a higher level language or DSL that does this for you


MapReduce Design Patterns

• Summarization – Inverted index, counting

• Filtering – Top ten, distinct

• Aggregation

• Data Organziation – partitioning

• Join – Join data sets

• Metapattern – Job chaining


Inverted Index Example

come, (alice.txt) do, (macbeth.txt) has, (alice.txt) time, (alice.txt, macbeth.txt) . . .

"The time has come," the

Walrus said

alice.txt

tis time to do it

macbeth.txt

time, alice.txt has, alice.txt come, alice.txt ..

tis, macbeth.txt time, macbeth.txt do, macbeth.txt …


MapReduce Example:Inverted Index

• Input: (filename, text) records

• Output: list of files containing each word

• Map: foreach word in text.split(): output(word, filename)

• Combine: uniquify filenames for each word

• Reduce: def reduce(word, filenames): output(word, sort(filenames))


MapReduce: The Good

• Built in fault tolerance

• Optimized IO path

• Scalable

• Developer focuses on Map/Reduce, not infrastructure

• simple? API


MapReduce: The Bad

• Optimized for disk IO

– Doesn’t leverage memory well

– Iterative algorithms go through disk IO path again and again

• Primitive API

– simple abstraction

– Key/Value in/out

– basic things like join

• require extensive code

• Result often many files that need to be combined appropriately


Free Hadoop MapReduce On Demand Training

• https://www.mapr.com/services/mapr-academy/big-data-hadoop-

online-training

https://www.mapr.com/services/mapr-academy/big-data-hadoop-online-training














What is Hive?

• Data Warehouse on top of Hadoop

– Gives ability to query without programming

– Used for analytical querying of data

• SQL like execution for Hadoop

• SQL evaluates to MapReduce code

– Submits jobs to your cluster


Using HBase as a MapReduce/Hive Source

EXAMPLE: Data Warehouse for Analytical Processing queries

Hive runs

MapReduce

application

Hive Select

Join HBase database

Files (HDFS/MapR-FS)

Query Result File


Using HBase as a MapReduce or Hive Sink

EXAMPLE: bulk load data into a table

Files (HDFS/MapR-FS) HBase database Hive runs

MapReduce application

Hive Insert Select


Using HBase as a Source & Sink

EXAMPLE: calculate and store summaries,

Pre-Computed, Materialized View

HBase database

Hive Select

Join

Hive runs

MapReduce

application


Job

Tracker

Name

Node

HADOOP (MAP-REDUCE + HDFS)

Data Node +

Task Tracker

Hive Metastore

Driver

(compiler, Optimizer, Executor)

Command Line

Interface

Web Interface

JDBC

Thrift Server

ODBC

Metastore

Hive

The schema metadata is stored

in the Hive metastore

Hive Table definition HBase trades_tall Table


Hive HBase

HBase Tables Hive

metastore

Points to Existing

Hive Managed


Hive HBase – External Table

CREATE EXTERNAL TABLE trades(key string, price bigint, vol bigint)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES ("hbase.columns.mapping"= “:key,cf1:price#b,cf1:vol#b")

TBLPROPERTIES ("hbase.table.name" = "/usr/user1/trades_tall");

Points to

External

key string price bigint vol bigint key cf1:price cf1:vol

AMZN_986186008 12.34 1000

AMZN_986186007 12.00 50

trades /usr/user1/trades_tall

Hive Table definition HBaseTable


Hive HBase – Hive Query

SQL evaluates to MapReduce code

SELECT AVG(price) FROM trades WHERE key LIKE "GOOG” ;

HBase Tables

Queries

Parser Planner Execution


Hive HBase – External Table

key cf1:price cf1:vol

AMZN_986186008 12.34 1000

AMZN_986186007 12.00 50

Selection

WHERE key like

SQL evaluates to MapReduce code

SELECT AVG(price) FROM trades WHERE key LIKE “AMZN” ;

Projection

select price

Aggregation

Avg( price)


Hive Query Plan

• EXPLAIN SELECT AVG(price) FROM trades WHERE key LIKE "GOOG%";

STAGE PLANS:

Stage: Stage-1

Map Reduce

Map Operator Tree:

TableScan

Filter Operator

predicate: (key like 'GOOG%') (type: boolean)

Select Operator

Group By Operator Reduce Operator Tree:

Group By Operator

Select Operator

File Output Operator


Hive Query Plan – (2)

output

hive> SELECT AVG(price) FROM trades WHERE key LIKE "GOOG%";

col0

Trades

table

group

aggregations:

avg(price)

scan filter

Select key like 'GOOG%

Select price

Group by

map()

map()

map()

reduce() reduce()


Hive Map Reduce

Region Region Region scan key, row

reduce()

shuffle

reduce()

reduce() Map() Map() Map()

Query Result File

HBase

Hive Select Join

Hive Query

result result result


Some Hive Design Patterns

• Summarization

– Select min(delay), max(delay), count(*) from flights group by

carrier;

• Filtering

– SELECT * FROM trades WHERE key LIKE "GOOG%";

– SELECT price FROM trades DESC LIMIT 10 ;

• Join

SELECT tableA.field1, tableB.field2 FROM tableA

JOIN tableB

ON tableA.field1 = tableB.field2;


What is a Directed Acylic Graph (DAG) ?

• Graph

– vertices (points) and edges (lines)

• Directed

– Only in a single direction

• Acyclic

– No looping

• This supports fault-tolerance

B A

B A


Hive Query Plan Map Reduce Execution

FS1

AGG2

RS4

JOIN1

RS2

AGG1

RS1

t1

RS3

t1

Job 3

Job 2

FS1

AGG2

JOIN1

AGG1

RS1

t1

RS3 Job 1

Job 1

Optimize


Slow !

Iteration: the bane of MapReduce


Typical MapReduce Workflows

Input to

Job 1

SequenceFile

Last Job

Maps Reduces

SequenceFile

Job 1

Maps Reduces

SequenceFile

Job 2

Maps Reduces

Output from

Job 1

Output from

Job 2

Input to

last job

Output from

last job

HDFS


Iterations

Step Step Step Step Step

In-memory Caching

• Data Partitions read from RAM instead of disk


Free HBase On Demand Training (includes Hive and MapReduce with HBase)

• https://www.mapr.com/services/mapr-academy/big-data-hadoop-

online-training















Lab – Query HBase airline data with Hive

Import mapping to Row Key and Columns:

Row-key

Carrier-

Flightnumber-

Date-

Origin- destination

delay info stats timing

Air

Craft delay

Arr delay

Carrier delay

cncl Cncl

code

tailnum distance elaptime arrtime Dep time

AA-1-2014-01-

01-JFK-LAX

13

0 N7704 2475 385.00 359 …


Count number of cancellations by reason (code)

$ hive

hive> explain select count(*) as

cancellations, cnclcode from flighttable

where cncl=1 group by cnclcode order by

cancellations asc limit 100;

1 row

OK

STAGE DEPENDENCIES:

Stage-1 is a root stage

Stage-2 depends on stages: Stage-1

Stage-0 is a root stage

STAGE PLANS:

Stage: Stage-1

Map Reduce

Map Operator Tree:

TableScan

Filter Operator

Select Operator

Group By Operator

aggregations: count()

Reduce Output Operator

Reduce Operator Tree:

Group By Operator

aggregations: count(VALUE._col0)

Select Operator


Stage: Stage-2

Map Reduce

Map Operator Tree:

TableScan

Reduce Output Operator

Reduce Operator Tree:

Extract

Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE

Limit


Stage: Stage-0

Fetch Operator

limit: 100


2 MapReduce jobs

$ hive

hive> select count(*) as cancellations, cnclcode from flighttable where

cncl=1 group by cnclcode order by cancellations asc limit 100;

1 row

Total jobs = 2

MapReduce Jobs Launched:

Job 0: Map: 1 Reduce: 1 Cumulative CPU: 13.3 sec MAPRFS Read: 0

MAPRFS Write: 0 SUCCESS

Job 1: Map: 1 Reduce: 1 Cumulative CPU: 1.52 sec MAPRFS Read: 0

MAPRFS Write: 0 SUCCESS

Total MapReduce CPU Time Spent: 14 seconds 820 msec

OK

4598 C

7146 A


Find the longest airline delays

$ hive

hive> select arrdelay,key from flighttable where arrdelay > 1000 order by

arrdelay desc limit 10;

1 row

MapReduce Jobs Launched:

Map: 1 Reduce: 1

OK

1530.0 AA-385-2014-01-18-BNA-DFW

1504.0 AA-1202-2014-01-15-ONT-DFW

1473.0 AA-1265-2014-01-05-CMH-LAX

1448.0 AA-1243-2014-01-21-IAD-DFW

1390.0 AA-1198-2014-01-11-PSP-DFW

1335.0 AA-1680-2014-01-21-SLC-DFW

1296.0 AA-1277-2014-01-21-BWI-DFW

1294.0 MQ-2894-2014-01-02-CVG-DFW

1201.0 MQ-3756-2014-01-01-CLT-MIA

1184.0 DL-2478-2014-01-10-BOS-ATL


Apache Spark


Apache Spark

spark.apache.org

github.com/apache/spark

[email protected]

• Originally developed in 2009 in UC

Berkeley’s AMP Lab

• Fully open sourced in 2010 – now

a Top Level Project at the Apache

Software Foundation


Spark: Fast Big Data

– Rich APIs in

Java, Scala,

Python

– Interactive shell

• Fast to Run

– General execution

graphs

– In-memory storage

2-5× less code


The Spark Community


Spark is the Most Active Open Source Project in Big Data

Giraph

Storm

Tez

0

20

40

60

80

100

120

140

Pro

ject

co

ntr

ibu

tors

in

past

year


Spark SQL Spark Streaming

(Streaming)

MLlib

(Machine learning)

Spark (General execution engine)

GraphX (Graph

computation)

Mesos

Distributed File System (HDFS, MapR-FS, S3, …)

Hadoop YARN

Unified Platform


Spark Use Cases

• Iterative Algorithms on on large amounts of data

• Anomaly detection

• Classification

• Predictions

• Recommendations


Why Iterative Algorithms

• Algorithms that need iterations

– Clustering (K-Means, Canopy, …)

– Gradient descent (e.g., Logistic Regression, Matrix Factorization)

– Graph Algorithms (e.g., PageRank, Line-Rank, components, paths,

reachability, centrality, )

– Alternating Least Squares ALS

– Graph communities / dense sub-components

– Inference (believe propagation) – …

51


Example: Logistic Regression

• Goal: find best line separating two sets of points

target

random initial line


data = spark.textFile(...).map(readPoint).cache() w = numpy.random.rand(D) for i in range(iterations): gradient = data .map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x))))

* p.y * p.x) .reduce(lambda x, y: x + y)

w -= gradient print “Final w: %s” % w

Iteration!

Logistic Regression


Data Sources

• Local Files

– file:///opt/httpd/logs/access_log

• S3

• Hadoop Distributed Filesystem

– Regular files, sequence files, any other Hadoop InputFormat

• HBase

• other NoSQL data stores


How Spark Works


Spark Programming Model

sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.map

Driver Program

SparkContext

cluster

Worker Node

Task Task

Task Worker Node


Resilient Distributed Datasets (RDD)

Spark revolves around RDDs

• Fault-tolerant

• read only collection of

elements

• operated on in parallel

• Cached in memory

• Or on disk

http://www.cs.berkeley.edu/~matei/papers/

2012/nsdi_spark.pdf

http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf


Working With RDDs

RDD

textFile = sc.textFile(”SomeFile.txt”)


Working With RDDs

RDD RDD RDD RDD

Transformations

linesWithSpark = textFile.filter(lambda line: "Spark” in line)



Working With RDDs

RDD RDD RDD RDD

Transformations

Action Value

linesWithSpark = textFile.filter(lambda line: "Spark” in line)

linesWithSpark.count()

74

linesWithSpark.first()

# Apache Spark



MapR Tutorial: Getting Started with Spark on MapR Sandbox

• https://www.mapr.com/products/mapr-sandbox-

hadoop/tutorials/spark-tutorial


Example Spark Word Count in Java

. . . the . . .




and time and

the, 1 time, 1 and, 1 and, 1

and, 12 time, 4

. . .

the, 20

JavaRDD<String> input = sc.textFile(inputFile);

// Split each line into words

JavaRDD<String> words = input.flatMap(

new FlatMapFunction<String, String>() {

public Iterable<String> call(String x) {

return Arrays.asList(x.split(" "));

}});

// Turn the words into (word, 1) pairs

JavaPairRDD<String, Integer> word1s== words.mapToPair(

new PairFunction<String, String, Integer>(){

public Tuple2<String, Integer> call(String x){

return new Tuple2(x, 1);

}});

// reduce add the pairs by key to produce counts

JavaPairRDD<String, Integer> counts =word1s.reduceByKey(

new Function2<Integer, Integer, Integer>(){

public Integer call(Integer x, Integer y){

return x + y;

}});

. . . . . . . . .


Example Spark Word Count in Scala

. . . the . . .




and time and

the, 1 time, 1 and, 1 and, 1

and, 12 time, 4

. . .

the, 20

// Load our input data.

val input = sc.textFile(inputFile)

// Split it up into words.

val words = input.flatMap(line => line.split(" "))

// Transform into pairs and count.

val counts = words

.map(word => (word, 1))

.reduceByKey{case (x, y) => x + y}

// Save the word count back out to a text file,

counts.saveAsTextFile(outputFile)

the, 20 time, 4 ….. and, 12

. . . . . . . . .



64

HadoopRDD

textFile

// Load input data.


RDD

partitions

MapPartitionsRDD



65

// Load our input data.


// Split it up into words.


HadoopRDD

textFile flatmap

MapPartitionsRDD

MapPartitionsRDD


FlatMap

flatMap

line => line.split(" "))

1 to many mapping

Ships Ships

and

wax and

wax

JavaPairRDD<String> words



67

textFile flatmap map



// Transform into pairs

val counts = words.map(word => (word, 1))

HadoopRDD

MapPartitionsRDD

MapPartitionsRDD

MapPartitionsRDD


Map

map

word => (word, 1))

1 to 1 mapping

and and, 1

JavaPairRDD<String, Integer> word1s



69

textFile flatmap map reduceByKey



val counts = words



HadoopRDD

MapPartitionsRDD

MapPartitionsRDD

ShuffledRDD

MapPartitionsRDD


reduceByKey

reduceByKey

case (x, y) => x + y wax, 1

and, 1

and, 1

wax, 1

and, 2

JavaPairRDD<String, Integer> counts



textFile flatmap map reduceByKey



val counts = words



val countArray = counts.collect()

HadoopRDD

MapPartitionsRDD

MapPartitionsRDD

MapPartitionsRDD

collect

ShuffledRDD

Array


Components Of Execution


MapR Blog: Getting Started with the Spark Web UI

• https://www.mapr.com/blog/getting-started-spark-web-ui


Spark RDD DAG -> Physical Execution plan

HadoopRDD

sc.textfile(…)

MapPartitionsRDD

flatmap

flatmap

reduceByKey

RDD Graph Physical Plan

collect

MapPartitionsRDD

ShuffledRDD

MapPartitionsRDD

Stage 1

Stage 2


Physical Plan

DAG

Stage 1

Stage 2

Task Task Task Task

Task Task Task

Stage 1

Stage 2

Split into Tasks

HFile

HDFS

Data Node

Worker Node

block

cache

partition

Executor

HFile

block

HFile

HFile

Task thread

Task

Set

Task

Scheduler Task

Physical Execution plan -> Stages and Tasks


Summary of Components

• Task : unit of execution

• Stage: Group of Tasks

– Base on partitions of RDD

– Tasks run in parallel

• DAG : Logical Graph of RDD operations

• RDD : Parallel dataset with partitions

76


How Spark Application runs on a Hadoop cluster

HFile

HDFS Data Node

Worker Node

block

cache

partition task task

Executor

HFile

block

HFile

HFile

SparkContext

zookeeper

YARN

Resource

Manager

HFile

HDFS Data Node

Worker Node

block

cache

partition task task

Executor

HFile

block

HFile

HFile

Client node

sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.map

Driver Program

Yarn

Node

Manger

Yarn

Node

Manger


Deploying Spark – Cluster Manager Types

• Standalone mode

• Mesos

• YARN

• EC2

• GCE


Example: Log Mining


Example: Log Mining

Load error messages from a log into memory, then

interactively search for various patterns

Based on slides from Pat McDonough at


Example: Log Mining



Worker

Worker

Worker

Driver


Example: Log Mining



Worker

Worker

Worker

Driver

lines = spark.textFile(“hdfs://...”)


Example: Log Mining



Worker

Worker

Worker

Driver


Base RDD


Example: Log Mining




errors = lines.filter(lambda s: s.startswith(“ERROR”))

Worker

Worker

Worker

Driver


Example: Log Mining





Worker

Worker

Worker

Driver

Transformed RDD


Example: Log Mining





messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker

Driver

messages.filter(lambda s: “mysql” in s).count()


Example: Log Mining






messages.cache()

Worker

Worker

Worker

Driver

messages.filter(lambda s: “mysql” in s).count() Action


Example: Log Mining






messages.cache()

Worker

Worker

Worker

Driver


Block 1

Block 2

Block 3


Example: Log Mining






messages.cache()

Worker

Worker

Worker


Block 1

Block 2

Block 3

Driver tasks

tasks

tasks


Example: Log Mining






messages.cache()

Worker

Worker

Worker


Block 1

Block 2

Block 3

Driver

Read

HDFS

Block

Read

HDFS

Block

Read

HDFS

Block


Example: Log Mining






messages.cache()

Worker

Worker

Worker


Block 1

Block 2

Block 3

Driver

Cache 1

Cache 2

Cache 3

Process

& Cache

Data

Process

& Cache

Data

Process

& Cache

Data


Example: Log Mining






messages.cache()

Worker

Worker

Worker


Block 1

Block 2

Block 3

Driver

Cache 1

Cache 2

Cache 3

results

results

results


Example: Log Mining






messages.cache()

Worker

Worker

Worker


Block 1

Block 2

Block 3

Driver

Cache 1

Cache 2

Cache 3

messages.filter(lambda s: “php” in s).count()


Example: Log Mining






messages.cache()

Worker

Worker

Worker


Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3


tasks

tasks

tasks

Driver


Example: Log Mining






messages.cache()

Worker

Worker

Worker


Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3


Driver

Process

from

Cache

Process

from

Cache

Process

from

Cache


Example: Log Mining






messages.cache()

Worker

Worker

Worker


Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3


Driver

results

results

results


Example: Log Mining






messages.cache()

Worker

Worker

Worker


Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3


Driver

Cache your data Faster Results

Full-text search of Wikipedia

• 60GB on 20 EC2 machines

• 0.5 sec from cache vs. 20s for

on-disk


Transformations and Actions


RDD Transformations and Actions

RDD RDD RDD RDD Transformations Action Value

Transformations (define a new RDD)

map filter sample union groupByKey reduceByKey join cache …

Actions (return a value)

reduce collect count save lookupKey …


Basic Transformations

> nums = sc.parallelize([1, 2, 3])

# Pass each element through a function > squares = nums.map(lambda x: x*x) // {1, 4, 9}

# Keep elements passing a predicate > even = squares.filter(lambda x: x % 2 == 0) // {4}

# Map each element to zero or more others > nums.flatMap(lambda x: => range(x))

> # => {0, 0, 1, 0, 1, 2}

Range object (sequence of

numbers 0, 1, …, x-1)


Basic Actions

> nums = sc.parallelize([1, 2, 3])

# Retrieve RDD contents as a local collection > nums.collect() # => [1, 2, 3]

# Return first K elements > nums.take(2) # => [1, 2]

# Count number of elements > nums.count() # => 3

# Merge elements with an associative function > nums.reduce(lambda x, y: x + y) # => 6

# Write elements to a text file > nums.saveAsTextFile(“hdfs://file.txt”)


RDD Fault Recovery

• RDDs track lineage information

• can be used to efficiently recompute lost data

msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“\t”)[2])

HDFS File Filtered RDD Mapped RDD filter

(func = startsWith(…)) map

(func = split(...))


Passing a function to Spark

• Spark is based on Anonymous function syntax

– (x: Int) => x *x

• Which is a shorthand for

new Function1[Int,Int] {

def apply(x: Int) = x * x

}

103


Dataframes


DataFrame

Distributed collection of data organized into named columns // Create the DataFrame

val df = sqlContext.read.json("person.json")

// Print the schema in a tree format

df.printSchema()

// root

// |-- age: long (nullable = true)

// |-- name: string (nullable = true)

// |-- height: string (nullable = true)

// Select only the "name" column

df.select("name").show()

https://spark.apache.org/docs/latest/sql-programming-guide.html


DataFrame RDD

• # data frame style

lineitems.groupby(‘customer’).agg(Map(

‘units’ > ‘avg’,

‘totalPrice’ > ‘std’

))

• # or SQL style

SELECT AVG(units), STD(totalPrice) FROM linetiems GROUP BY customer


Demo Interactive Shell

• Iterative Development

– Cache those RDDs

– Open the shell and ask questions

• We have all wished we could do this with MapReduce

– Compile / save your code for scheduled jobs later

• Scala – spark-shell

• Python – pyspark


MapR Blog: Using Apache Spark DataFrames for Processing of Tabular Data

• https://www.mapr.com/blog/using-apache-spark-dataframes-

processing-tabular-data


The physical plan for DataFrames


DataFrame Excecution plan

// Print the physical plan to the console auction.select("auctionid").distinct.explain() == Physical Plan == Distinct false Exchange (HashPartitioning [auctionid#0], 200) Distinct true Project [auctionid#0] PhysicalRDD [auctionid#0,bid#1,bidtime#2,bidder#3, bidderrate#4,openbid#5,price#6,item#7,daystolive#8], MapPartitionsRDD[11] at mapPartitions at ExistingRDD.scala:37


There’s a lot more !


Spark SQL Spark Streaming

(Streaming)

MLlib

(Machine learning)

Spark (General execution engine)

GraphX (Graph

computation)

Mesos

Distributed File System (HDFS, MapR-FS, S3, …)

Hadoop YARN

Unified Platform


Soon to Come

• Spark On Demand Training – https://www.mapr.com/services/mapr-academy/

• Blogs and Tutorials: – Movie Recommendations with Collaborative Filtering

– Spark Streaming

https://www.mapr.com/services/mapr-academy/





Soon to Come

Blogs and Tutorials: – Re-write this mahout example with spark


Examples and Resources


Spark on MapR

• Certified Spark Distribution

• Fully supported and packaged by MapR in partnership with

Databricks

– mapr-spark package with Spark, Shark, Spark Streaming today

– Spark-python, GraphX and MLLib soon

• YARN integration

– Spark can then allocate resources from cluster when needed


References

• Spark web site: http://spark.apache.org/

• https://databricks.com/

• Spark on MapR:

– http://www.mapr.com/products/apache-spark

• Spark SQL and DataFrame Guide

• Apache Spark vs. MapReduce – Whiteboard Walkthrough

• Learning Spark - O'Reilly Book

• Apache Spark

http://spark.apache.org/

http://spark.apache.org/

https://databricks.com/

https://databricks.com/

http://www.mapr.com/products/apache-spark





https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#dataframes

https://www.mapr.com/blog/apache-spark-vs-mapreduce-whiteboard-walkthrough.VYrrLFVVhBd




http://shop.oreilly.com/product/0636920028512.do




https://www.mapr.com/products/apache-spark


Q & A

@mapr maprtech

[email protected]

Engage with us!

MapR

maprtech

mapr-technologies

introduction to spark on hadoop

Data & Analytics

data share

large data files

data flow files

mapping process node

simplified data processing

mapreduce example

mapreduce execution

hdfs stores node